Abstract

In this paper we demonstrate how Python can be used throughout the entire life cycle of a graduate program in Data Science. In interdisciplinary fields, such as Data Science, the students often come from a variety of different backgrounds where, for example, some students may have strong mathematical training but less experience in programming. Python’s ease of use, open source license, and access to a vast array of libraries make it particularly suited for such students. In particular, we will discuss how Python, IPython notebooks, scikit-learn, NumPy, SciPy, and pandas can be used in several phases of graduate Data Science education, starting from introductory classes (covering topics such as data gathering, data cleaning, statistics, regression, classification, machine learning, etc.) and culminating in degree capstone research projects using more advanced ideas such as convex optimization, non-linear dimension reduction, and compressed sensing. One particular item of note is the scikit-learn library, which provides numerous routines for machine learning. Having access to such a library allows interesting problems to be addressed early in the educational process and the experience gained with such “black box” routines provides a firm foundation for the students own software development, analysis, and research later in their academic experience.

Keywords:data scienceeducationmachine learning