This two-day accelerated course is intended to give graduate students in computer science an overview of the topics associated with data science. It provides an overview of key topics in scalable data management, selected topics in statistics not always taught in introductory stats classes, a treatment of key topics and algorithms in machine learning, an introduction to the field of visualization research, and an overview of graph analytics.
All course materials are available in the github repository. Use these instructions to access the github material.
- Appetite whetting and context (15 min)
- Course goals and logistics (10 min)
- Twitter Sentiment Analysis (1 hour)
Readings
- (example) Yong-Yeol Ahn, Sebastian E. Ahnert, James P. Bagrow, Albert-Laszlo Barabasi, Flavor network and the principles of food pairing, Scientific Reports 1, Article number: 196 doi:10.1038/srep00196
- (example) Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of Emotions in 20th Century Books. PLoS ONE 8(3): e59030. doi:10.1371/journal.pone.0059030
- (example) Google Flu Trends (plus: David Wagner, Google Flu Trends Wildly Overestimated This Year's Flu Outbreak, Atlantic Wire, February 13, 2013)
- (example) L'Aquila quake: Italy scientists guilty of manslaughter, BBC
- Drew Conway's Venn Diagram
- Mike Loukides, What is data science?, O'Reilly Radar, 2010_
- Origins of "Volume, Velocity, Variety"_ _
- Dan Mckinley, Whom the Gods Would Destroy, they First Give Real-Time Analytics
- Howard Wen, " Big Ethics for Big Data", O'Reilly Media
- John Markoff, New York Times, Unreported Side Effects of Drugs Are Found Using Internet Search Data, March 13, 2013
- Mike Loukides, Data Skepticism, O'Reilly Media, April 2013
- eScience: The Fourth Paradigm (Foreward and Introduction, pages xi - xxxi; Gray's Laws, pages 5-12)
- Chris Anderson, "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete" , Wired magazine, 2008
- Responses to Chris Anderson, 2008
- Relational Database Key Ideas (20 min)
- Exercise: SQL Analytics (30 min)
- In-Database Analytics (20 min)
- Relational Algorithmics (10 min)
Readings
- How Vertica Was the Star of the Obama Campaign, and Other Revelations
- E. F. Codd, 1981 Turing Award Lecture, " Relational Database: A Practical Foundation for Productivity", 1981 (Think about which arguments from this short piece are still relevant today.)
- Cohen et al. "MAD Skills: New Analysis Practices for Big Data", 2009
- Erik Meijer, Gavin Bierma co-Relational Model of Large Shared Data Banks, Communications of the ACM, 2011
- MapReduce refresher (10 min)
- Exercise: MR Algorithms (30 min)
- Comparison with Databases (10 min)
- Myria: Analytics-as-a-Service (20 min)
- Radish: Compiling Distributed Query Plans (10 min)
Readings
- Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. SIGMOD '15
- Brandon Myers et al, Radish: Compiling Efficient Query Plans for Distributed Shared Memory, 2015
- Gog et al, Musketeer: all for one, one for all in data processing systems, Eurosys 15
- Dean and Ghemawat, " MapReduce: A Flexible Data Processing Tool", _Communications of the ACM, _January 2010.
- Ullman, Rajaraman, Mining of Massive Datasets, Chapter 2
- Stonebraker et al., " MapReduce and Parallel DBMS's: Friends or Foes?", Communications of the ACM, January 2010.
Other papers mentioned:
- Afrati, Foto N. and Ullman, Jeffrey D. Optimizing Joins in a Map-Reduce Environment EDBT 2010
- Introduction (10 min)
- Compared by Features (10 min)
- NoSQL Response (10 min)
- Exercise: Graph Processing with Pig on AWS (if time)
Readings
- Rick Cattell, " Scalable SQL and NoSQL Data Stores", SIGMOD Record, December 2010 (39:4)
- Data cleaning (not covered in lectures)
- Elmagarmid, et. al. "Duplicate Record Detection: A Survey"
- Koudas, et. al. "Record Linkage: Similarity Measures and Algorithms"
- Motivation: Science "Losing Power" (5 min)
- Publication Bias and Effect Size (10 min)
- Fraud Detection (10 min)
- Multiple Hypothesis Testing (10 min)
- Is Big Data Different? (10 min)
- Permutation Methods (20 min)
Readings
- Chapter 3 of A Handbook of Statistical Analyses Using R
- Gregory Park on overfitting to the leaderboard in a Kaggle Competition
- Introduction (15 min)
- Rules (15 min)
- Trees (20 min)
- Overfitting (10 min)
- Evaluation (10 min)
- Ensembles, Bagging, Boosting (10 min)
- Random Forests (10 min)
- Gradient Descent (30 min)
- K-means, DBSCAN (15 min)
Readings
- Xindong Wu et al., Top 10 Algorithms in Data Mining, Knowledge and Information Systems, 14(2008), 1: 1-37. (read C4.5)
- Ullman, Rajaraman, Mining of Massive Datasets , Chapter 1
- Pedro Domingos, A Few Useful Things to Know about Machine Learning, CACM 55(10), 2012
- Xindong Wu et al., Top 10 Algorithms in Data Mining, Knowledge and Information Systems, 14(2008), 1: 1-37. (read k-means)
- Intro (10 min)
- Types, Dimensions (20 min)
- Encodings (10 min)
- Perception (20 min)
- Assignment: D3 Tutorial
Readings
- Hans Rosling, The Joy of Stats
- Pat Hanaran, Tools for Data Enthusiasts
- Jeffrey Heer, Michael Bostock, Vadim Ogievetsky, A Tour through the Visualization Zoo, Communications of the ACM, Volume 53 Issue 6, June 2010
- Structure (20 min)
- Traversal (20 min)
- Patterns (20 min)
- PageRank (10 min)