Big Data, Small Languages, Scalable Systems

Instructor: Yanif Ahmad.
TA: P. C. Shyamshankar
Class schedule: MW 12-1.15pm, Shaffer 302.

Office hours: W 1.30-3.00pm, Malone 233.

Course Description

Big Data systems and analyses are enabling computer scientists to have an unprecedented impact on a wide variety of science, healthcare, economic and humanities applications. By making datasets and their analyses the center of focus, big data principles and methods empowers a data-driven dialogue with domain experts, supported by powerful tools, abstractions and computing capabilities to collect, process, visualize and explore data.

This course pairs a project-oriented practicum on Big Data systems with discussion-oriented seminars on cutting-edge techniques in data-intensive systems design, algorithms and applications. We aim to provide a semester-long experiential learning opportunity on select big data systems and analyses, alongside an end-to-end application of these tools and methods on a concrete scientific dataset. CS 600.615 is structured in three parts. First, we study the abstractions and systems design principles behind popular big data processing frameworks (e.g., Hadoop, Spark), including how these design principles facilitates their scalability. Next, the course syllabus covers an investigation of common workloads and use cases of these systems in both industrial and academic applications. These workloads are typically analytical in nature, and the course will then expose students to the scalable implementation of statistics and machine learning methods on these frameworks. Finally, students will apply the skills they have gained in developing an analysis stack for our science dataset through a semester-long course project.

CS 600.615 is a discussion- and project-oriented seminar course. Its format and workload includes weekly readings of academic conference papers, student presentations, in-class discussion and conference-style paper reviews on the assigned material, and most significantly, a substantial project component. Students work in groups of 3-4 all course activities, including presentations and projects. Course projects aim to develop and apply novel data science and data management research to a big data application domain, culminating in a data science workflow that comprises processing, analytics and visualization elements. Projects must be applied and evaluated on real-world datasets, and made available openly through software and data sharing resources such as GitHub or DataHub.

Area: Systems.
Prereq: CS 600.315/415 or CS 600.316/416, or equivalents.

Organization

Groups: Most of the work in this course (presentations and projects) will be done in groups of 3-4. Please register your group on our sign-up form by September 14, 2016.

Presentations: This course has a discussion-oriented format to introduce students to broader techniques and applications of data management, building on their existing background and experiences with data management tools. Students will read the assigned material prior to the week's classes as preparation for engaging in an in-class discussion to be led by student presenters. Student groups are expected to pick one topic area and present the two papers chosen for that area in two 50 minute lectures for Monday and Wednesday classes.

To get the most out of the material, my suggestion is that presentations should be done in the style of tutorials at conferences, that is they should include both introductory material on the topic and summarize the contributions of the assigned papers on the topic, rather than purely focus on the papers. Given 75 minute lecture slots, the remaining 25 minutes will be a class discussion and follow-up material by the instructor linking the week's materials to the remaining topics in the syllabus.

Paper Reviews: To encourage and prepare for in-class discussions, we ask students to individually submit (simplified) conference-style paper reviews that summarize the paper's main contribution, and the benefits and drawbacks of their concepts and methods. We anticipate that each review will take 30 minutes once students have read the paper, with each review being at most 1-1.5 pages of text. We ask students to review 50% of the papers presented throughout the semester. We have included a template for the review form here. Reviews should be posted anonymously on our Piazza page for other students to read, and sent privately to the instructor the night prior to the relevant presentation.

Project: Course projects should encompass challenges and solutions on data management and scalable algorithm design topics, using cloud computing resources where appropriate. These projects must also choose a big data domain as their focus, to define a concrete dataset that will be used for project evaluation. To get you thinking on potential topics, we have a list of suggestion here. Student groups are required to submit a 2 page project proposal document by Friday, September 30th, and a 10-12 page final report by Friday, December 16th. Additionally we will ask each group to present a brief 10-15 minute project plan during the week of September 21st, and a final project demonstration during the week of November 30th. When writing your project proposal, we encourage you to consider Heilmeiers' questions that address what you intend to accomplish. In addition to the main thrust of their individual topic, project groups are encouraged to interact and integrate their efforts with other groups during the semester as broader class-wide project efforts.

Important Dates and Links

  • Group creation and topic selection: 11:59pm on Wednesday, 14th September, 2016.
  • Proposal presentation: Week of September 26th, 2016.
  • Proposal document: 11:59pm on Friday, September 30th, 2016.
  • Project demos: Week of December 5th, 2016.
  • Final report: 11:59pm on Friday, December 16th, 2016.
Piazza page
Signup page
Review form

Grading

55% Project

  • 15% Proposal
  • 40% Demonstration and Report
25% Presentation and discussion lead
20% Paper reviews

Syllabus

Week Section Topic Paper Additional Reading
1. Introduction Data, Query and Analysis Modalities P1, P2 --
2. Survey I Cloud Data Systems Foundations P3, P4 --
3. Survey II Analytics Foundations P5, P6 --
4. Project I Project Proposals No reading --
5. Systems I Cloud Computing P7, P8 --
6. Systems II Real-time Systems P9, P10 --
7. Industry New at Google P11, P12 --
8. Analytics I Scalable Machine Learning P13, P14 --
9. Analytics II Numerics P15, P16 --
10. Analytics III Time Series P17, P18 --
11. Applications I Control Systems P19, P20 --
12. Applications II Humans & Behaviors P21, P22 --
13. Project II Project Demonstrations. No reading --

Schedule

Week 1 Yanif
Week 2 Yanif
Week 3 Yanif
Week 4 Student Teams
Week 5 Amitoj, Siddharth, Sachith, Sankalp
Week 6 Matt, Teodor, Nikhil
Week 7 Ziyan, Ehsan, Rishab, Vikas
Week 8 Nathan, Alex, Razieh
Week 9 Ben, Adhiraj, Anchit
Week 10 Adam, Ryan, Chu-Cheng, Winston
Week 11 Ke, Yirui, Yujie, Te
Week 12 Rohit, Sam, Rachel, Ben
Week 13 Student Teams

Paper List

Primary readings

  1. [Paper] "One Size Fits All": An Idea Whose Time Has Come and Gone.
  2. [Article] Exascale Computing and Big Data.
  3. [Paper] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.
  4. [Paper] In-Memory Big Data Management and Processing: A Survey.
  5. [Paper] Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals.
  6. [Article] Spectral Methods for Dimensionality Reduction.
  7. [Paper] Large-scale cluster management at Google with Borg.
  8. [Paper] Characterizing Private Clouds: ...
  9. [Paper] The Dataflow Model: ...
  10. [Paper] Twitter Heron: Stream Processing at Scale.
  11. [Paper] Goods: Organizing Google’s Datasets.
  12. [Paper] Shasta: Interactive Reporting At Scale
  13. [Paper] SystemML: Declarative Machine Learning on Spark.
  14. [Paper] TensorFlow: A system for large-scale machine learning.
  15. [Paper] Cumulon: Matrix-Based Data Analytics in the Cloud with Spot Instances.
  16. [Paper] Compressed Linear Algebra for Large-Scale Machine Learning.
  17. [Paper] BTrDB: Optimizing Storage System Design for Timeseries Processing.
  18. [Paper] Sequential Data Cleaning: A Statistical Approach.
  19. [Paper] Reactive Control of Autonomous Drones.
  20. [Paper] End to End Learning for Self-Driving Cars.
  21. [Paper] LiveLabs: Building In-Situ Mobile Sensing & Behavioural Experimentation TestBeds
  22. [Paper] BodyScan: Enabling Radio-based Sensing on Wearable Devices for Contactless Activity and Vital Sign Monitoring.

Background Readings

  1. [Paper] A Comparison of Approaches to Large-Scale Data Analysis
  2. [Paper] MapReduce: Simplified Data Processing on Large Clusters.
  3. [Paper] Discovery-driven Exploration of OLAP DataCubes
  4. [Paper] Mining Massive Datasets (Chapter 11: Dimensionality Reduction)
  5. [Paper] Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters
  6. [Paper] PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
  7. [Video]
  8. [Paper] REEF: Retainable Evaluator Execution Framework
  9. [Paper] Towards Resource-Elastic Machine Learning
  10. [Paper] Omega: flexible, scalable schedulers for large compute clusters
  11. [Paper] Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments
  12. [Paper] Dense Matrix Algorithms
  13. [Paper] Introduction to Convex Optimization for Machine Learning
  14. [Paper] Convex Optimization
  15. [Paper] Interactions with Big Data Analytics
  16. [Paper] Declarative Interaction Design for Data Visualization
  17. [Paper] The Effects of Interactive Latency on Exploratory Visual Analysis

Additional Material and Links

A (very) large list of Big Data tools
The Elements of Statistical Learning (Book)