Big Data, Small Languages, Scalable Systems

Instructor: Yanif Ahmad.
TA: P. C. Shyamshankar
Class schedule: MW 12-1.15pm, Shaffer 302.

Office hours: W 1.30-3.00pm, Malone 233.

Example Topics

The following list of topics are intended as a high-level guide to potential project areas. Students should refine the topic into a more focused question through discussions with their collaborators and the course staff. In addition, to a topic area, all projects should identify a dataset and multiple metrics for evaluating project outcomes. We encourage you to set up a 1hr meeting with the course staff to discuss your project for brainstorming, identifying related work and planning resource requirements during weeks 2-3 of the semester.


Databases on supercomputers

Despite generating large datasets from high-resolution simulations, today's supercomputers do not widely use database systems for their analyses and data management. In part, this is due to the complexity of administering, and lack of scalability in databases to execute on tens of thousands of CPU cores. Most importantly, supercomputers are not designed to run persistent services across their cores such as distributed database. This project would design an analytics database that can work in a serverless fashion, much like an embedded database (e.g., SQLite). A distributed embedded database on a supercomputer should take advantage of common software components such as job schedulers (e.g., SGE, Slurm), and cluster file systems (e.g., Lustre).

Declarative storage systems

Much of the advantages of new classes of analytics engines arise from customising the data layouts to their intended workload, to mitigate I/O overheads amongst other bottlenecks. Examples include column-stores, SSD-based graph engines, and array databases. In the K3 project (see below), we synthesize entire analytics engines, including their storage layers. This project would investigate the use of K3 to generate optimal data layouts given properties of the analytics algorithms, and statistics on data access. This includes generating hybrid data structures, where some part of the data structure is optimized for updates and writes, while other parts are optimized for read access.


Iterative and ML algorithm optimization

Many ML algorithms fit an iterative processing workflow, where a training loop repeatedly adjusts model parameters. Examples include gradient or coordinate descent algorithms. Current data management tools provide few automatic optimizations for iterative algorithms, and this project would investigate topics such as adaptive scheduling techniques, iteration under dynamic data, and speculative execution opportunities to improve the practical convergence achieved in distributed settings.


ML algorithms are increasingly being deployed as cloud services to enable common workflows to be used by large numbers of users, as they seek to model and predict on their own datasets. Designing ML systems for large numbers of users raises several design questions in terms of efficiency: how can ML aglorithms reuse their computations and share their work amongst multiple user requests. Furthermore, how can correlations and patterns detected in one task be used to inform other tasks? Students interested in this project should pick a specific algorithm, and study algorithm design questions to facilitate scaling across many users.

Multiresolution analysis

Simulations of scientific phenomena (such as proteins or galaxies) often produces high-dimensional datasets on the order of thousands if not millions of dimensions. Direct analysis at these scales is computationally intractable, and requires finding compact, accurate low-dimensional representations. This presents a challenge of efficient and automatic feature engineering, however, many existing techniques perform this on the full data representation. There exist few multiresolution approaches that take advantage of domain-specific knowledge, such as independence properties. For example, in a protein, secondary structures such as an alpha-helix have independent behaviors, when analyzing protein motions such as folding. This project would design a compositional approach to feature engineering, where compact representations are designed for each independent structure, and then composed to find good representations for global structures.


Speech interfaces

Recent database research has been heavily investigating techniques to improve database usability and accessibility, for example through gesture-based interfaces to support data navigation on mobile and tablet devices. This project would investigate the use of speech interfaces, in addition to gestures, to support large dataset querying and navigation. Topics include identifying the kinds of queries and actions that are well-suited to speech as opposed to gestures, and in designing and implementing a speech interface to a query and visualization engine through services such as


Our K3 project is a state-of-the-art analytics engine in development JHU under Prof. Ahmad's lab. K3 is a framework for building new application-specific data systems by combining database, programming languages and distributed systems techniques. We're interested in prototyping the following new functionality in K3.

Exploratory dataframe and array data management

With many of our target applications in the sciences, we would like to extend K3 to support data exploration tasks through dataframe ( and array APIs. These projects would start by providing new datatypes for dataframes and arrays based on existing high-performance C++ libraries (such as libdynd or blaze), wrapping them as a K3 API. Next, the single-site API would be extended to support a distributed dataframe or array abstraction, leveraging K3's existing communication and scheudling capabilities. These new distributed datatypes would then be used to showcase an interactive, scalable, data exploration workflow based on user-guided dimensionality reductions from a large dataset.

Web- and mobile-embedded databases

K3 is a distributed analytics engines designed as a sharded peer-to-peer system. This project would design a distributed embedded database, where each peer runs a webserver coupled with each K3 peer. In this way, K3 would support a highly localized web interface at each peer for visualization and interaction on each shard of data. Global data interactions would occur with K3 peers communicating and delivering relevant data to each interested peer. This would enable coupled data visualization, and network visualization of how the data is distributed and managed throughout a distributed system. The project should also consider novel visualization and query modalities that can be enabled on mobile devices.

Declarative query acceleration

This project would extend K3 to enable it to use of new accelerator devices for floating computations, such as GPUs and Xeon Phis. A key design feature of K3 is its annotation system, that enables developers to specify where (i.e., on what machines) and how (i.e., with which algorithm) to execute computation. This project would extend our annotation capabilities so that users can indicate that a piece of computation should execute on a GPU. Subsequently, all data that the computation depends on should be allocated or, or transferred to the GPU. The project should also consider how to assist programmers on identifying query operations that would substantially improve end-to-end performance if it were to be executed on GPUs.

Dataset candidates:

  • JHU-specific science datasets: biophysics and biochemistry, astrophysics, turbulence, connectome, genomics (ask Prof. Ahmad)
  • Wikipedia data dumps:
  • Github:
  • Datahub:
  • Public datasets hosted on Amazon:
  • Deepdive OpenData:
  • Google datacenter trace:
  • Twitter streaming API:
  • UCI ML archive:
  • IOT simulators:
  • Smart grid simulators: