## The Data Management Systems Lab @ Johns Hopkins

Our mission is to explore the challenges faced by modern computing applications that contend with vast quantities of information, from the basic sciences, to medical, enterprise and web domains.

We design, build and deploy abstractions, methods and tools to make "Big Data" and their supporting "Big Systems" easier to use, cheaper to manage, and ready to scale.

We are actively seeking all levels of students to participate in our research projects.

#### Projects

K3: Democratizing Big Data Systems Implementations
MDDB: The Molecular Dynamics Database
Collaborations: DBToaster, Dyna, OUTBIDS

#### People

F
F
P
P
Yanif Ahmad
Sarana Nutanong
Yotam Barnoy
P. C. Shyamshankar
M
U M
Nick Carey
Dan Deutsch
Naveen Natarajan

C
Tom Woolf, Jason Eisner, Alex Szalay, Christoph Koch (EPFL), Oliver Kennedy (U. Buffalo), OUTBIDS Team (incl. R. Vidal, A. Terzis, I-J. Wang)
Key: F = Faculty, P = Ph.D, M = MSE, U = Ugrad, C = Collaborator

#### Selected Publications

Adaptive Multiscale Exploration for Large-Scale Protein Analysis in the Molecular Dynamics Database. Under submission.

DBToaster: Higher-Order Delta Processing for Dynamic, Frequently Fresh Views.
Proceedings of the VLDB Endowment, 5(10):968-979, PVLDB 2012. [PDF]

K3: Language Design for Building Domain-Specific Runtimes.
Cross-model Language Design and Implementation, XLDI 2012. [PDF]

• Yanif
Ahmad

• Sarana Nutanong

• Yotam
Barnoy

• P. C.
Shyamshankar
(Shyam)

• Nick
Carey

### Masters and Undergraduates

• Dan Deutsch
• Naveen Natarajan
• Vaibhav Mohan

### Alumni

• Parkavi Srithar (Project: MDDB)
• Mohit Dia (Project: MDDB, first employment: ARINC)
• Varun Sharma (Project: DBToaster/Cumulus, first employment: Google)
• Guangxishui Yang (Project: DBToaster/Hadoop, first employment: 1010data)

Legend:   K3 MDDB DBToaster Data-Intensive Computing Pulse/Borealis/SAND/other
 MDDB Adaptive Multiscale Exploration for Large-Scale Protein Analysis in the Molecular Dynamics Database. S. Nutanong, Y. Ahmad, A. Szalay, T. Woolf. Under submission. K3 Declarative Computing for Algorithmic Data Analysis. N. W. Filardo, P. C. Shyamshankar, T. Vieira, Y. Ahmad, J. Eisner. Under submission. DBToaster DBToaster: Higher-Order Delta Processing for Dynamic, Frequently Fresh Views. Y. Ahmad, O. Kennedy, C. Koch, M. Nikolic. Proceedings of the VLDB Endowment, 5(10):968-979, PVLDB 2012. [PDF] K3 K3: Language Design for Building Domain-Specific Runtimes. P. Shyamshankar, Z. Palmer, Y. Ahmad. First International Workshop on Cross-model Language Design and Implementation, XLDI 2012. [PDF] DISC Incremental and Parallel Analytics on Astrophysical Data Streams. D. Mishin, T. Budavari, A. Szalay, Y. Ahmad. Proceedings of the 3rd Intl. Workshop on Data Intensive Computing in the Clouds (DataCloud), 2012. DISC I/O Streaming Evaluation of Batch Queries for Data-Intensive Computational Turbulence. K. Kanov, E. Perlman, R. Burns, Y. Ahmad, A. Szalay. ACM/IEEE Conference on High Performance Computing, SC 2011. [PDF] DISC Scientific Data Management at the Johns Hopkins Institute for Data Intensive Engineering and Science (IDIES). Y. Ahmad, R. Burns, M. Kazhdan, C. Meneveau, A. Szalay, A. Terzis. SIGMOD Record 39(3): 18-23 (2010) [PDF] DBToaster DBToaster: Agile Views in a Dynamic Data Management System. O. Kennedy, Y. Ahmad, C. Koch. CIDR 2011. [PDF]

 DBToaster DBToaster: A SQL Compiler for High-Performance Delta Processing in Main-Memory Databases. Y. Ahmad, C. Koch. Proceedings of the VLDB Endowment, Volume 2(2): 1566–1569, 2009. Pulse/Borealis Simultaneous Equation Systems for Query Processing on Continuous-Time Data Streams. Y. Ahmad, O. Papaemmanouil, U. C ̧ etintemel, J. Rogers. Proceedings of the International Conference on Data Engineering (ICDE), pp. 666–675, 2008. SenseWeb COLR-Tree: Communication Efficient Spatio-Temporal Index for a Sensor Data Web Portal. Y. Ahmad, S. Nath. Proceedings of the International Conference on Data Engineering (ICDE), pp. 784–793, 2008. Pulse/Borealis Declarative Temporal Data Models for Sensor-Driven Query Processing. Y. Ahmad, U. Cetintemel. Proceedings of the International Workshop on Data Management in Sensor Nets, pp. 37–42, 2007. XPORT Extensible optimization in overlay dissemination trees. O. Papaemmanouil, Y. Ahmad, U. Cetintemel, J. Jannotti, Y. Yildirim. Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 611–622, 2006. XPORT XPORT: Extensible Profile-Driven Overlay Routing Trees. O. Papaemmanouil, Y. Ahmad, U. Cetintemel, J. Jannotti. Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 769–771, 2006. XPORT Application-aware Overlay Networks for Data Dissemination. O. Papaemmanouil, Y. Ahmad, U. C ̧ etintemel, J. Jannotti. Proceedings of the International Workshop on Semantic EnabledNetworks and Services, pp. 76, 2006. Borealis Distributed Operation in the Borealis Stream Processing Engine. Y. Ahmad, B. Berg, U. Cetintemel, M. Humphrey, J. Hwang, A. Jhingran, A. Maskey, O. Papaemmanouil, A. Rasin, N. Tatbul, W. Xing, Y. Xing, S. Zdonik. Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 882–884, 2005. SAND Locality-Aware Networked Join Evaluation. Y. Ahmad, U. Cetintemel, J. Jannotti, A. Zgolinski. Proceedings of the IEEE International Workshop Networking Meets Databases, pp. 1183, 2005. SAND Network Awareness in Internet Scale Stream Processing. Y. Ahmad, U. C ̧ etintemel, J. Jannotti, A. Zgolinski, S. Zdonik. IEEE Bulletin of the Technical Committee on Data Engineering, March 2005, Vol 28, No. 1. Borealis The Design of the Borealis Stream Processing Engine. D. Abadi, Y. Ahmad, M.Balazinska, U. Cetintemel, M. Cherniack, J. Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, S. Zdonik. Proceedings of the 2nd Conference on Innovative Database Systems (CIDR), pp. 277–289, 2005. SAND Network-Aware Query Processing for Stream-Based Applications. Y. Ahmad, U. Cetintemel. Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 456–467, 2004. PL A Type System for Statically Detecting Spreadsheet Errors. Y. Ahmad, T. Antoniu, S. Goldwater, S. Krishnamurthi. Proceedings of the IEEE Conference on Automated Software Engineering (ASE), pp. 174–183, 2003.

#### Coming soon

• K3 (open-source interpreter and compiler, summer 2013)
• MDDB (web service and workbench, summer 2013)

#### Team Workspace (internal)

K3 is a procedural, event-driven language for the simple, high-level programming of long-running data-intensive computation. Its event-driven design ensures out-of-the-box support for message-passing and asynchronous computations. K3 aims to provide maximal flexibility in specifying system implementation eatures through declarative specification of implementation, deployment, and execution aspects.

K3 is under heavy development, and as an early preview of its event-driven form, we illustrate computing the Fibonnacci numbers.

declare final : int

trigger start(n:int) {} =
send(fibonacci, me, n, 0, 1)

trigger fibonacci(n:int, a:int, b:int) {} =
if n == 0
then send(result, me, b)
else send(fibonacci, me, n - 1, b, a + b)

trigger result(b:int) {} = final = b

role client1 {
source  s1 : int = file("data/args.csv", csv)
bind    s1 -> start
consume s1
}


Overview

Molecular dynamics (MD) simulations generate detailed time- series data of all-atom motions. These simulations require tremendous computing capabilities and are leading users of the world’s most powerful supercomputers.

While data generation is empowered by leadership-class computing, analysis of this high-resolution data is in its infancy in terms of scalability, ease-of-use, and ultimately its ability to answer ’grand challenge’ science questions. MD analyses are often decoupled from data generation, performed as post-processing, and can take as long to evaluate as the production phase itself.

Modern data management and analysis methods can improve on this dramatically, enabling scientists to make maximal use of petascale simulation data for scientific discovery. We are developing the Molecular Dynamics Database (MDDB) to couple HPC simulations with deep on-the-fly and exploratory analysis of MD data.

Our technical focus spans MDDB’s architectural design, its curation of long timespan protein trajectory data, and its development of biochemistry query workloads. Our initial analytics and abstractions include an extensible framework for adaptive coarse-grained steering of simulations, that accelerates exploration and understanding of protein structure and function.

Trajectory Datasets

MD trajectories are currently represented in two forms in MDDB. First, as described in the introduction, a trajectory can simply be stored as a time series of 3D-coordinates for efficient ingestion. This format matches the direct outputs of MD simulators, providing a high data loading throughput. Our second format is analysis-oriented, where we store protein shapes (called conformations) as a sequence of phi-psi dihedral angles describing the protein backbone. Viewing a trajectory as a movie of protein shapes, conformations correspond to a single image. Figure 1 shows a conformation of the alanine dipeptide protein described using two dihedral angles (phi, psi). For biological reasons, dihedral angles marked (*) need not be stored.

##### Figure 1: Protein conformations in phi-psi space.

Phi-psi sequences are considerably more compact than the raw format e.g., 2 phi-psi values vs (x,y,z) coordinates of 22 atoms for the alanine dipeptide. Furthermore, we can exploit their translation and rotation invariant properties for query processing. Indeed, phi-psi sequences are popular low- level features when applying machine learning techniques to molecular datasets.

MDDB represents phi-psi sequences, and their concatenation into trajectories as columnar relations. We currently store all trajectories across multiple proteins in a single columnar relation, with selective decomposition into a feature-oriented star schema that stores information across all trajectories for a specific protein phi-psi sequence. MDDB can apply a wide range of analyses on phi-psi values, for example Figure 1 shows analyses that: (i) identify energy wells by clustering conformations frequently assumed by the protein; (ii) identify pathways (transitions) between these energy wells; (iii) empirically construct and evaluate Markov chains that encode transition probability distributions of protein conformations.

Query Workload

MDDB supports two substantially different categories of query workloads, one a natural fit for DBMS and a second that requires coupling in-database functionality with a range of external software components. First, as inspired by the Fourth Paradigm, we are developing a benchmark of 20 query templates with our biophysics partners to capture domain-specific ad-hoc and exploratory science questions. These templates are generally compositions of SPJAG queries, frequently applying geometric computations, and with varying join degrees and nesting depth in terms of correlated subqueries. The second kind of workload arises from data-dependent iterative algorithms that produce substantial quantities of derived and intermediate data while solving search problems, or in converging to a fixpoint or termination condition. Adaptive control and its use of reinforcement learning matches this workload pattern, as well as several analysis algorithms used internally in MDDB, such as k-means clustering, MCMC inference and replica exchange (a form of parallel tempering). Interestingly determining termination of MD exploration is a fixpoint algorithm that involves a biased trajectory sampling process.

System Architecture

Our design principles for MDDB are to use off-the-shelf software and hardware components to minimize prototyping times through extensive software reuse, and to enable other biophysics groups to easily reproduce our setup. MDDB can run on the PostgreSQL and Greenplum DBMS, with basic processing on protein systems provided by the MDAnalysis package developed by the Woolf Lab, and data analysis from the MADLib library. MDAnalysis crucially provides data import and export functionality from the majority of the popular MD codes, and we use six major MD simulators in-house in our own trajectory production (CHARMM, Amber, NAMD, Gromacs, LAMMPS, Desmond). MDDB integrates with job scheduling software, specifically Gearman, to manage trajectory production and also to support a novel form of highly dynamic parallel execution of the iterative algorithms present in our workload.

Figure 2 illustrates the setup of our MDDB instance, and its deployment on our 10 node commodity cluster with approx. 150 cores, 700 GB RAM and 100TB storage capacity. In addition to the core DBMS, our MDDB instance is coupled with HPC resources for large-scale trajectory production, and a web application software stack to expose a biophysics web service and visualization capabilities. MDDB uses institution-level and national leadership-class computing resources for trajectory production, including JHU's DataScope system, as well as XSEDE and PSC's Anton resources. Our DataScope system provides a substantial floating point capability, with 90 nodes providing Tesla-class GPUs and SSD storage for pipelined simulation and ingestion. Currently, we transfer datasets from supercomputer resources synchronously after simulations, and we are exploring bulk network transfer techniques to maximize data ingestion efficiency for MDDB from wide- area sources.
##### Figure 2: MDDB system architecture.

Adaptive Control

MD simulations explore protein structure and function, and can conceptually be seen as a process of sampling a high-dimensional manifold capturing protein behavior. MDDB uses reinforcement learning (RL) techniques to control and implement a continuous, adaptive sampling algorithm. Figure~\ref{fig-AdaptiveControl} illustrates our adaptive controller as a learning agent which: (i) observes the state of the environment (e.g., the data collected so far) in order to come up with an appropriate simulation setup; (ii) interacts with the environment by executing different simulations; (iii) observes the reward associated with performing this action at the current state and the new state; (iv) starts over. As learning progresses, the agent tries to find an action policy which maps each state of the environment to the best action available through statistical prediction and inference on the rewards obtained by performing different actions at different states. The goal of the adaptive controller is to maximize the total reward obtained from a sequence of actions over a long period of time.
##### Figure 3: Adaptive control loop.

Publication

Adaptive Multiscale Exploration for Large-Scale Protein Analysis in the Molecular Dynamics Database.
S. Nutanong, Y. Ahmad, A. Szalay, T. Woolf.
Under submission.