Introduction to Data Science

This two-day accelerated course is intended to give graduate students in computer science an overview of the topics associated with data science. It provides an overview of key topics in scalable data management, selected topics in statistics not always taught in introductory stats classes, a treatment of key topics and algorithms in machine learning, an introduction to the field of visualization research, and an overview of graph analytics.

All course materials are available in the github repository. Use these instructions to access the github material.

Day 1: Scalable Data Management

Introduction

Appetite whetting and context (15 min)
Course goals and logistics (10 min)
Twitter Sentiment Analysis (1 hour)

Readings

(example) Yong-Yeol Ahn, Sebastian E. Ahnert, James P. Bagrow, Albert-Laszlo Barabasi, Flavor network and the principles of food pairing, Scientific Reports 1, Article number: 196 doi:10.1038/srep00196
(example) Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of Emotions in 20th Century Books. PLoS ONE 8(3): e59030. doi:10.1371/journal.pone.0059030
(example) Google Flu Trends (plus: David Wagner, Google Flu Trends Wildly Overestimated This Year's Flu Outbreak, Atlantic Wire, February 13, 2013)
(example) L'Aquila quake: Italy scientists guilty of manslaughter, BBC
Drew Conway's Venn Diagram
Mike Loukides, What is data science?, O'Reilly Radar, 2010_
Origins of "Volume, Velocity, Variety"_ _
Dan Mckinley, Whom the Gods Would Destroy, they First Give Real-Time Analytics
Howard Wen, " Big Ethics for Big Data", O'Reilly Media
John Markoff, New York Times, Unreported Side Effects of Drugs Are Found Using Internet Search Data, March 13, 2013
Mike Loukides, Data Skepticism, O'Reilly Media, April 2013
eScience: The Fourth Paradigm (Foreward and Introduction, pages xi - xxxi; Gray's Laws, pages 5-12)
Chris Anderson, "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete" , Wired magazine, 2008
Responses to Chris Anderson, 2008

Relational Databases for Data Science

Relational Database Key Ideas (20 min)
Exercise: SQL Analytics (30 min)
In-Database Analytics (20 min)
Relational Algorithmics (10 min)

Readings

How Vertica Was the Star of the Obama Campaign, and Other Revelations
E. F. Codd, 1981 Turing Award Lecture, " Relational Database: A Practical Foundation for Productivity", 1981 (Think about which arguments from this short piece are still relevant today.)
Cohen et al. "MAD Skills: New Analysis Practices for Big Data", 2009
Erik Meijer, Gavin Bierma co-Relational Model of Large Shared Data Banks, Communications of the ACM, 2011

Beyond MapReduce

MapReduce refresher (10 min)
Exercise: MR Algorithms (30 min)
Comparison with Databases (10 min)
Myria: Analytics-as-a-Service (20 min)
Radish: Compiling Distributed Query Plans (10 min)

Readings

Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. SIGMOD '15
Brandon Myers et al, Radish: Compiling Efficient Query Plans for Distributed Shared Memory, 2015
Gog et al, Musketeer: all for one, one for all in data processing systems, Eurosys 15
Dean and Ghemawat, " MapReduce: A Flexible Data Processing Tool", _Communications of the ACM, _January 2010.
Ullman, Rajaraman, Mining of Massive Datasets, Chapter 2
Stonebraker et al., " MapReduce and Parallel DBMS's: Friends or Foes?", Communications of the ACM, January 2010.

Other papers mentioned:

Afrati, Foto N. and Ullman, Jeffrey D. Optimizing Joins in a Map-Reduce Environment EDBT 2010

NoSQL

Introduction (10 min)
Compared by Features (10 min)
NoSQL Response (10 min)
Exercise: Graph Processing with Pig on AWS (if time)

Readings

Rick Cattell, " Scalable SQL and NoSQL Data Stores", SIGMOD Record, December 2010 (39:4)
Data cleaning (not covered in lectures)
- Elmagarmid, et. al. "Duplicate Record Detection: A Survey"
- Koudas, et. al. "Record Linkage: Similarity Measures and Algorithms"

Day 2: Analytics and Visualization

Cherry-picked Statistics Topics

Motivation: Science "Losing Power" (5 min)
Publication Bias and Effect Size (10 min)
Fraud Detection (10 min)
Multiple Hypothesis Testing (10 min)
Is Big Data Different? (10 min)
Permutation Methods (20 min)

Readings

Chapter 3 of A Handbook of Statistical Analyses Using R
Gregory Park on overfitting to the leaderboard in a Kaggle Competition

Machine Learning Tour

Introduction (15 min)
Rules (15 min)
Trees (20 min)
Overfitting (10 min)
Evaluation (10 min)
Ensembles, Bagging, Boosting (10 min)
Random Forests (10 min)
Gradient Descent (30 min)
K-means, DBSCAN (15 min)

Readings

Xindong Wu et al., Top 10 Algorithms in Data Mining, Knowledge and Information Systems, 14(2008), 1: 1-37. (read C4.5)
Ullman, Rajaraman, Mining of Massive Datasets , Chapter 1
Pedro Domingos, A Few Useful Things to Know about Machine Learning, CACM 55(10), 2012
Xindong Wu et al., Top 10 Algorithms in Data Mining, Knowledge and Information Systems, 14(2008), 1: 1-37. (read k-means)

Visualization

Intro (10 min)
Types, Dimensions (20 min)
Encodings (10 min)
Perception (20 min)
Assignment: D3 Tutorial

Readings

Hans Rosling, The Joy of Stats
Pat Hanaran, Tools for Data Enthusiasts
Jeffrey Heer, Michael Bostock, Vadim Ogievetsky, A Tour through the Visualization Zoo, Communications of the ACM, Volume 53 Issue 6, June 2010

Graph Analytics

Structure (20 min)
Traversal (20 min)
Patterns (20 min)
PageRank (10 min)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

syllabus.md

syllabus.md

Introduction to Data Science

Day 1: Scalable Data Management

Introduction

Relational Databases for Data Science

Beyond MapReduce

NoSQL

Day 2: Analytics and Visualization

Cherry-picked Statistics Topics

Machine Learning Tour

Visualization

Graph Analytics

Files

syllabus.md

Latest commit

History

syllabus.md

File metadata and controls

Introduction to Data Science

Day 1: Scalable Data Management

Introduction

Relational Databases for Data Science

Beyond MapReduce

NoSQL

Day 2: Analytics and Visualization

Cherry-picked Statistics Topics

Machine Learning Tour

Visualization

Graph Analytics