Skip to content

Latest commit

 

History

History
136 lines (100 loc) · 8.28 KB

syllabus.md

File metadata and controls

136 lines (100 loc) · 8.28 KB

Introduction to Data Science

This two-day accelerated course is intended to give graduate students in computer science an overview of the topics associated with data science. It provides an overview of key topics in scalable data management, selected topics in statistics not always taught in introductory stats classes, a treatment of key topics and algorithms in machine learning, an introduction to the field of visualization research, and an overview of graph analytics.

All course materials are available in the github repository. Use these instructions to access the github material.

Day 1: Scalable Data Management

Introduction

  • Appetite whetting and context (15 min)
  • Course goals and logistics (10 min)
  • Twitter Sentiment Analysis (1 hour)

Readings

Relational Databases for Data Science

  • Relational Database Key Ideas (20 min)
  • Exercise: SQL Analytics (30 min)
  • In-Database Analytics (20 min)
  • Relational Algorithmics (10 min)

Readings

Beyond MapReduce

  • MapReduce refresher (10 min)
  • Exercise: MR Algorithms (30 min)
  • Comparison with Databases (10 min)
  • Myria: Analytics-as-a-Service (20 min)
  • Radish: Compiling Distributed Query Plans (10 min)

Readings

Other papers mentioned:

NoSQL

  • Introduction (10 min)
  • Compared by Features (10 min)
  • NoSQL Response (10 min)
  • Exercise: Graph Processing with Pig on AWS (if time)

Readings

  • Rick Cattell, " Scalable SQL and NoSQL Data Stores", SIGMOD Record, December 2010 (39:4)
  • Data cleaning (not covered in lectures)
    • Elmagarmid, et. al. "Duplicate Record Detection: A Survey"
    • Koudas, et. al. "Record Linkage: Similarity Measures and Algorithms"

Day 2: Analytics and Visualization

Cherry-picked Statistics Topics

  • Motivation: Science "Losing Power" (5 min)
  • Publication Bias and Effect Size (10 min)
  • Fraud Detection (10 min)
  • Multiple Hypothesis Testing (10 min)
  • Is Big Data Different? (10 min)
  • Permutation Methods (20 min)

Readings

Machine Learning Tour

  • Introduction (15 min)
  • Rules (15 min)
  • Trees (20 min)
  • Overfitting (10 min)
  • Evaluation (10 min)
  • Ensembles, Bagging, Boosting (10 min)
  • Random Forests (10 min)
  • Gradient Descent (30 min)
  • K-means, DBSCAN (15 min)

Readings

Visualization

  • Intro (10 min)
  • Types, Dimensions (20 min)
  • Encodings (10 min)
  • Perception (20 min)
  • Assignment: D3 Tutorial

Readings

Graph Analytics

  • Structure (20 min)
  • Traversal (20 min)
  • Patterns (20 min)
  • PageRank (10 min)