Вы находитесь на странице: 1из 3

G5705 Fall 2016 Introduction to Data Science

Tentative as August 29th, 2016

Department of Statistics, Columbia University

Course Information

Classes: Tuesdays 2:40-5:25 PM, Room 520 Math


Instructor: Tian Zheng (Office hours: Mondays 12:00 PM - 2:00 PM, plus announced
online Q&A or by appointments; Room 1007, SSW). Email: tian.zheng@columbia.edu
TA: Nathan Lenssen. njl2134@columbia.edu
Course websites: http://courseworks.columbia.edu
Course github: TBA

Prerequisites
The pre-requisite for this course includes working knowledge in high school math and
understanding of scientific research. Prior programming experience is NOT required.

Description
Data Science is a dynamic and fast growing field at the interface of Statistics and Computer
Science. The emergence of massive datasets containing millions or even billions of observations
provides the primary impetus for the field. Such datasets arise, for instance, in large-scale
retailing, telecommunications, astronomy, and Internet social media. It is essential for non-data
scientists such as users of data-driven solutions and policy makers to understand the
principles, methods and technologies used in data science and how the data science approach
is being used to revolutionize decision process in both private and public sectors. This course
will provide a broad overview of data science’s different areas from statistics, machine learning
to data engineering and many data science applications. This course is designed to provide a
non-technical introduction to the data science approach. It is intended to students from non-
quantitative fields. It shall not be count towards degree requirements for quantitative graduate
programs such as Statistics, Computer Science, Operations Research, or Data Science.
Students should inquire with their respective programs to determine eligibility of this course to
count towards minimum degree requirements.

Course organization
The class meets weekly, following a flipped classroom approach. Well-designed pre-recorded
lectures, readings and practices will be assigned before each meeting for students to complete
on their own outside the classroom. During the class meeting time, we will go over in detail
some hands-on exercises, discuss the lecture and reading materials and answer questions.
Tutorials on R and other easy-to-use data science softwares will also be given.

Below is a tentative schedule we will follow.

Week 1 (9/6): What is Data Science?


Tutorial: Intro to R
Week 2 (9/13): Intro to statistical thinking (I)
Tutorial: Basic data processing using R
Week 3 (9/20): Intro to statistical thinking (II)
Tutorial: Basic data analysis using R
Week 4 (9/27): Introduction to Bayesian modeling
Tutorial: TBD
Week 5 (10/4): Exploratory Data Analysis and Visualization (I)
Tutorial: Introduction to Tableau
Week 6 (10/11): Exploratory Data Analysis and Visualization (II)
Tutorial: Easy interactive graphics
Week 7 (10/18): Algorithms
Tutorial: Write an R function
Week 8 (10/25): Machine learning (I)
Tutorial: Machine learning using R
Week 9 (11/1): Machine learning (II)
TUtorial: Machine learning using R
Week 10 (11/15): Privacy and data security
Project-related tutorials: knitr
Week 11 (11/22): Data Engineering
Project-related tutorials: shiny app
Week 12 (11/29): Internet of Things
Project-related tutorials: ggplot2
Week 13 (12/6): Project presentations

Evaluation
Students' performance will be based on

Weekly short quiz 50%


Final Project 50%

Communication
To carry out a flipped classroom class, we use courseworks (http://courseworks.columbia.edu,
powered by Canvas) to organize online learning modules. It is absolutely essential for students
to go over relevant learning modules before each class sessions. We will use discussion board
on courseworks. The system is highly catered to getting you help fast and efficiently from
classmates, the TA, and myself. Rather than emailing questions to the teaching staff, I
encourage you to post your questions on the discussion board.

Textbook
There is not a single required text. The learning modules are constructed using a library of pre-
recorded video segments, which were well-designed. For tutorials, we will also prepare
documentations and suggest readings.

Here is an incomplete list of suggested reference books:

O'Neil and Schutt (2013) Doing Data Science

Provost and Fawcett (2013) Data Science for Business.

Fung (2013) Numbersense: how to use big data to your advantage.

Class policy

The flipped-classroom format is to allow more interactions and hands-on experiences in


classroom. To fully leverage the learning benefits this approach can offer, each student
needs to carry out self-learning using assigned modules with discipline.

Academic Integrity is the cornerstone of meaningful teaching and learning. It is especially


important for courses like ours that have a final project. Remember what matters more is
how much you learn not what grade you will get. In your project, document references
and resources that have been incorporated into your project and accredit them
appriporiately. Plagiarism is one of the most likely forms of cheating in this course. Here
are some tips to avoid plagiarism.

Emails related to learning and projects shall be redirected to our discussion board on
courseworks.

Students are expected to check emails at least once every 12 hours during the week and
every 24 hours over the weekend. Students should make sure not to miss any important
class-related announcements sent by emails or posted on Piazza. Emails will be
delivered the students' official UNI. It is the students' responsibility to ensure that these
emails are properly forwarded if they choose to use an alternative email address.

Вам также может понравиться