Instructor: Yanif Ahmad.
TA: P. C. Shyamshankar
Class schedule: MW 12-1.15pm, Shaffer 302.
Office hours: W 1.30-3.00pm, Malone 233.
Big Data systems and analyses are enabling computer scientists to have an
unprecedented impact on a wide variety of science, healthcare, economic and
By making datasets and their analyses the center of focus, big data principles
and methods empowers a data-driven dialogue with domain experts, supported by
powerful tools, abstractions and computing capabilities to collect, process,
visualize and explore data.
This course pairs a project-oriented practicum on Big Data systems with discussion-oriented seminars on cutting-edge techniques in data-intensive systems design, algorithms and applications. We aim to provide a semester-long experiential learning opportunity on select big data systems and analyses, alongside an end-to-end application of these tools and methods on a concrete scientific dataset. CS 600.615 is structured in three parts. First, we study the abstractions and systems design principles behind popular big data processing frameworks (e.g., Hadoop, Spark), including how these design principles facilitates their scalability. Next, the course syllabus covers an investigation of common workloads and use cases of these systems in both industrial and academic applications. These workloads are typically analytical in nature, and the course will then expose students to the scalable implementation of statistics and machine learning methods on these frameworks. Finally, students will apply the skills they have gained in developing an analysis stack for our science dataset through a semester-long course project.
CS 600.615 is a discussion- and project-oriented seminar course. Its format and workload includes weekly readings of academic conference papers, student presentations, in-class discussion and conference-style paper reviews on the assigned material, and most significantly, a substantial project component. Students work in groups of 3-4 all course activities, including presentations and projects. Course projects aim to develop and apply novel data science and data management research to a big data application domain, culminating in a data science workflow that comprises processing, analytics and visualization elements. Projects must be applied and evaluated on real-world datasets, and made available openly through software and data sharing resources such as GitHub or DataHub.
Prereq: CS 600.315/415 or CS 600.316/416, or equivalents.
Most of the work in this course (presentations and projects) will be done in groups of 3-4.
Please register your group on our
sign-up form by September 14, 2016.
Presentations: This course has a discussion-oriented format to introduce students to broader techniques and applications of data management, building on their existing background and experiences with data management tools. Students will read the assigned material prior to the week's classes as preparation for engaging in an in-class discussion to be led by student presenters. Student groups are expected to pick one topic area and present the two papers chosen for that area in two 50 minute lectures for Monday and Wednesday classes.
To get the most out of the material, my suggestion is that presentations should be done in the style of tutorials at conferences, that is they should include both introductory material on the topic and summarize the contributions of the assigned papers on the topic, rather than purely focus on the papers. Given 75 minute lecture slots, the remaining 25 minutes will be a class discussion and follow-up material by the instructor linking the week's materials to the remaining topics in the syllabus.
Paper Reviews: To encourage and prepare for in-class discussions, we ask students to individually submit (simplified) conference-style paper reviews that summarize the paper's main contribution, and the benefits and drawbacks of their concepts and methods. We anticipate that each review will take 30 minutes once students have read the paper, with each review being at most 1-1.5 pages of text. We ask students to review 50% of the papers presented throughout the semester. We have included a template for the review form here. Reviews should be posted anonymously on our Piazza page for other students to read, and sent privately to the instructor the night prior to the relevant presentation.
Project: Course projects should encompass challenges and solutions on data management and scalable algorithm design topics, using cloud computing resources where appropriate. These projects must also choose a big data domain as their focus, to define a concrete dataset that will be used for project evaluation. To get you thinking on potential topics, we have a list of suggestion here. Student groups are required to submit a 2 page project proposal document by Friday, September 30th, and a 10-12 page final report by Friday, December 16th. Additionally we will ask each group to present a brief 10-15 minute project plan during the week of September 21st, and a final project demonstration during the week of November 30th. When writing your project proposal, we encourage you to consider Heilmeiers' questions that address what you intend to accomplish. In addition to the main thrust of their individual topic, project groups are encouraged to interact and integrate their efforts with other groups during the semester as broader class-wide project efforts.
|1.||Introduction||Data, Query and Analysis Modalities||P1, P2||--|
|2.||Survey I||Cloud Data Systems Foundations||P3, P4||--|
|3.||Survey II||Analytics Foundations||P5, P6||--|
|4.||Project I||Project Proposals||No reading||--|
|5.||Systems I||Cloud Computing||P7, P8||--|
|6.||Systems II||Real-time Systems||P9, P10||--|
|7.||Industry||New at Google||P11, P12||--|
|8.||Analytics I||Scalable Machine Learning||P13, P14||--|
|9.||Analytics II||Numerics||P15, P16||--|
|10.||Analytics III||Time Series||P17, P18||--|
|11.||Applications I||Control Systems||P19, P20||--|
|12.||Applications II||Humans & Behaviors||P21, P22||--|
|13.||Project II||Project Demonstrations.||No reading||--|
|Week 4||Student Teams|
|Week 5||Amitoj, Siddharth, Sachith, Sankalp|
|Week 6||Matt, Teodor, Nikhil|
|Week 7||Ziyan, Ehsan, Rishab, Vikas|
|Week 8||Nathan, Alex, Razieh|
|Week 9||Ben, Adhiraj, Anchit|
|Week 10||Adam, Ryan, Chu-Cheng, Winston|
|Week 11||Ke, Yirui, Yujie, Te|
|Week 12||Rohit, Sam, Rachel, Ben|
|Week 13||Student Teams|