PhD Proposal: Efficient Layouts and Algorithms for Managing Versioned Datasets

Souvik Bhattacherjee
03.08.2017 09:00 to 10:30
AVW 3450

Version Control Systems were primarily designed to keep track of and provide control over changes to source code and have since provided an excellent way to combat the problem of sharing and editing files in a collaborative setting. The recent surge in data-driven decision making has resulted in a proliferation of datasets elevating them to the level of source code which in turn has led the data analysts to resort to version control systems for the purpose of storing and managing datasets and their versions over time. Unfortunately existing version control systems are poor at handling large datasets primarily due to the underlying assumption that the stored files are relatively small text files with localized changes. Moreover the algorithms used by these systems tend to be fairly simple leading to suboptimal performance when applied to large datasets. In order to address the shortcomings, a key requirement here is to have a Dataset Version Control System that will serve as a common platform to enable data analysts to efficiently store and query dataset versions, track changes to datasets and share datasets between users at ease. Towards this goal, we address the fundamental problem of compactly storing a large number of dataset versions and efficiently retrieving any specific version. We initiate our study by considering storage-retrieval trade-offs for unstructured dataset versions such as text files, blobs, etc. where the notion of a partial version is not well-defined. We observe that the underlying techniques developed for unstructured datasets are not well suited for more structured dataset versions -- a version in this setting is defined by a collection of records each of which is uniquely addressable. We carefully explore the design space for building such a system and the various storage-retrieval trade-offs, and discuss how different storage layouts influence those trade-offs. Next, we formulate several problems trading off the version storage and retrieval cost in various ways and show that most of the problems are NP-Hard. We design several offline storage layout heuristics that effectively minimize the storage costs while keeping the retrieval costs low. In addition to version retrieval queries, our system also provides support for record provenance queries. Through extensive experiments on large datasets, we demonstrate that our proposed design can operate at the scale required in most practical scenarios. To conclude the proposal, we discuss some problems that we intend to pursue as a part of the future work: (1) devise online algorithms that handle updates to the system efficiently, (2) provide support for richer queries on compressed representations, and (3) design a system for supporting annotations on versioned datasets.

Examining Committee:

Chair: Dr. Amol Deshpande

Dept rep: Dr. Peter Keleher

Member: Dr. Alan Sussman