ABSTRACT
Title of dissertation: DELTA-BASED STORAGE AND QUERYING
FOR VERSIONED DATASETS
Amit Chavan
Doctor of Philosophy, 2018
Dissertation directed by: Professor Amol Deshpande
Department of Computer Science
Data-driven methods and products are becoming increasingly common in a variety
of communities, leading to a huge diversity of datasets being continuously generated,
modied, and analyzed. An increasingly important consideration for the underlying data
management systems is that, all of these datasets and their versions over time need to be
stored and queried for a variety of reasons including, but not limited to, reproducibility,
collaboration, provenance, auditing, introspective analysis, and backups. However, most
solutions today resort to highly ad hoc and manual version management and sharing
techniques, that leads to friction when managing collaborative data science workows,
while also introducing ineciencies.
In this dissertation, we introduce a framework for dataset version management,
and address the systems building, operator design, and optimization challenges involved
in building a dataset version control system. We describe the various challenges and
solutions in the context of our system, called DEX, that we have developed to support
increasingly complex version management tasks. We show how to use delta-encoding,
a key component in managing redundancy, to provide ecient storage and retrieval for
the thousands of dataset versions, and develop a formalism to understand the various
trade-os in a principled manner. We study the storage–recreation trade-o in detail and
provide a suite of inexpensive heuristics to obtain high-quality solutions under dierent
settings. In order to provide a rich interface to specify version management tasks, we
design a new query language, called VQUEL, with the ability to query dataset versions
and provenance in a unied manner. We study how assumptions on the delta format
can help in the design of a logical algebra, which we then use to execute increasingly
complex queries eciently. A key characteristic of our query execution methods is that
the computational cost is primarily dependent on the size and the number of deltas
in the expression (typically small), and not the input dataset versions (which can be
very large). Finally, we demonstrate the eectiveness of our developed techniques by
extensive evaluation of DEX on a mixture of real-world and synthetic datasets.
DELTA-BASED STORAGE AND QUERYING
FOR VERSIONED DATASETS
by
Amit Chavan
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulllment
of the requirements for the degree of
Doctor of Philosophy
2018
Advisory Committee:
Professor Amol Deshpande, Chair/Advisor
Professor Aravind Srinivasan, Co-advisor
Professor Louiqa Raschid, Dean’s Representative
Professor Daniel Abadi
Professor Mihai Pop
© Copyright by
Amit Chavan
2018

Dedicated to the memories of my dear (late) grandmother, Kaku.
ii
Acknowledgments
There are many people whose support, guidance, and friendship have made this thesis
possible.
I am forever indebted to my advisor, Amol Deshpande, who has taught me so much
and been a great mentor during the last six years. Learning how to do successful research
took me many years and I am thankful for the skills that Amol has patiently taught
me. He taught me how to communicate ideas more eectively, gave me the freedom to
investigate new directions, patiently listened and gave feedback on many of my half-
baked ideas, and contributed many ideas to this work.
I am fortunate to have had many mentors on the path towards my Ph.D. I would
like to thank Aravind Srinivasan for his support and guidance when I needed it the
most. I am indebted to Rajiv Gandhi for introducing me to computer science research
and helping me apply to graduate schools.
The sta at the Computer Science Department, particularly, Jennifer Story, have
created a wonderful place, and I am happy to be a part of it. I am also grateful to Louiqa
Raschid, Daniel Abadi, Mihai Pop, and Samir Khuller for their participation on my pro-
posal and defense committees.
I’d also like to thank my collaborators from whom I’ve learned so very much. In
particular, the work described in Chapter 3 was done jointly with two other PhD stu-
dents, Souvik Bhattacherjee and Silu Huang. I was primarily responsible for the design
of the algorithms, jointly responsible for the theoretical results with Silu, and for the
implementation and the experimental evaluation with Souvik. The work described in
iii
Chapter 4 was done jointly with Silu Huang, with equal contributions.
I have deeply enjoyed my interactions and friendships with the members of the
the database group and the Computer Science department at UMD. Thank you Souvik
Bhattacherjee, Hui Miao, Abdul Quamar, Theodoros Rekatsinas, Manish Purohit, Pan
Xu for making graduate study both intellectually stimulating and incredibly enjoyable.
I am also lucky to have a wonderful set of friends and housemates. An incomplete list
of those who helped me on this path includes Sangeetha Venkatraman, Amey Bhangale,
Kartik Nayak, Bhaskar Ramasubramanian, Ramakrishna Padmanabhan, Sudha Rao, and
Meethu Malu.
My heartfelt gratitude to my family, without whom I would not have made it this
far. My parents provided their love and support throughout my life, and encouraged me
to do things the right way, without making any compromises on the quality of work.
My sister, Deepti, sometimes believed in my goals and dreams even more than I did.
Lastly, to Shruti, for understanding and supporting me in all my ups and downs
during this journey. I’m fortunate to have had her boundless love and her seless and
unwavering support for the last six years. I look forward to our journey ahead, together.
iv
Table of Contents
Dedication ii
Acknowledgements iii
List of Figures viii
1 Introduction 1
1.1 Challenges in building a DVCS . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Dissertation Overview and Contributions . . . . . . . . . . . . . . . . . . 5
1.2.1 Ecient Storage and Retrieval . . . . . . . . . . . . . . . . . . . . 6
1.2.2 A Language to Query Provenance and Versions in Unied Manner 10
1.2.3 Delta-aware Query Execution . . . . . . . . . . . . . . . . . . . . 11
2 Related Work 15
2.1 Enabling Multi-Versioned Storage . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Version Control Systems (VCS) . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Query Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Query Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Storage–Recreation Tradeo 26
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Problem Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Essential Notations and Preliminaries . . . . . . . . . . . . . . . . 29
3.2.2 Mapping to Graph Formulation . . . . . . . . . . . . . . . . . . . 36
3.2.3 ILP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Proposed Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.1 Local Move Greedy Algorithm . . . . . . . . . . . . . . . . . . . . 54
3.4.2 Modied Prim’s Algorithm . . . . . . . . . . . . . . . . . . . . . 57
3.4.3 LAST Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4.4 Git Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
v
3.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5.2 Comparison with SVN and Git . . . . . . . . . . . . . . . . . . . 70
3.5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 73
4 A Unied Query Language for Provenance and Versioning 79
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3 Language Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3.2 Syntactic sweetenings . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3.3 Aggregate operators . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3.4 Version graph traversal . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3.5 Extensions to ne-grained provenance . . . . . . . . . . . . . . . 90
5 Query Execution I: Set-based Operations 92
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2.1 User Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2.2 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2.3 System Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2.3.1 Storage Graph . . . . . . . . . . . . . . . . . . . . . . . 97
5.2.3.2 Set-backed Deltas and Properties . . . . . . . . . . . . 98
5.3 Query Execution Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3.1 Optimization Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3.2 Access Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3.3 Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.4 Cost and Cardinality Estimation . . . . . . . . . . . . . . . . . . . 105
5.4 Query Execution Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4.1 Checkout Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.4.1.1 Single datafile Checkout . . . . . . . . . . . . . . . . 109
5.4.1.2 Multiple datafile Checkout . . . . . . . . . . . . . . . 111
5.4.2 Intersection Queries . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.4.3 Union . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.4.4 t-Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.5.1 Comparisons with Temporal Indexing . . . . . . . . . . . . . . . 135
6 Query Execution II: Declarative Queries 139
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.2 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.2.1 Schema Specication . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.2.2 Delta format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.2.3 Physical Representation . . . . . . . . . . . . . . . . . . . . . . . 144
6.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.3 Query Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
vi
6.3.1 v-tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.3.2 Scan Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.3.2.1 Delta Contraction . . . . . . . . . . . . . . . . . . . . . 150
6.3.2.2 Line/Star Structures . . . . . . . . . . . . . . . . . . . . 151
6.3.2.3 Applying Delta to a Materialized datafile . . . . . . . 153
6.3.3 JOIN Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.3.4 Simple Hash Join . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.3.5 Version-aware Hash Join . . . . . . . . . . . . . . . . . . . . . . . 156
6.3.6 Other Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.3.6.1 Filter Operator . . . . . . . . . . . . . . . . . . . . . . 158
6.3.6.2 Project Operator . . . . . . . . . . . . . . . . . . . . . . 158
6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7 Conclusions 163
Bibliography 165
vii
List of Figures
1.1 System Architecture of DEX . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Three solutions with dierent storage and recreation costs . . . . . . . . 9
3.1 Matrices corresponding to the example in Figure 1.2 . . . . . . . . . . . . 32
3.2 Graph based formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 A feasible storage graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Illustration of Proof of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Illustration of Proof of Lemma 6 . . . . . . . . . . . . . . . . . . . . . . . 47
3.6 Illustration of Local Move Greedy Heuristic . . . . . . . . . . . . . . . . . 54
3.7 Directed Graph G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.8 Undirected Graph G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.9 Illustration of Modied Prim’s algorithm in Figure 3.7 . . . . . . . . . . . 58
3.10 Illustration of LAST on Figure 3.8 . . . . . . . . . . . . . . . . . . . . . . 63
3.11 Distribution of delta sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.12 Results for the directed case, comparing the storage costs and total recre-
ation costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.13 Results for the directed case, comparing the storage costs and maximum
recreation costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.14 Results for the undirected case, comparing the storage costs and total
recreation costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.15 Taking workload into account leads to better solutions . . . . . . . . . . 75
3.16 Running times of LMG . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1 Conceptual data model for VQUEL . . . . . . . . . . . . . . . . . . . . . . 80
4.2 Example version graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1 Example of access trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2 An instance of the Tree Contraction algorithm . . . . . . . . . . . . . . . 114
5.3 Line and star structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.4 Access tree during the progress of C&R . . . . . . . . . . . . . . . . . . . 120
5.5 Eect of varying #Δ when |Δ| = 5% . . . . . . . . . . . . . . . . . . . . . 126
5.6 Eect of varying |A| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.7 Eect of varying #Δ when |Δ| = 5% . . . . . . . . . . . . . . . . . . . . . 127
viii
5.8 Access tree shapes; (a) Line, (b) Star, (c) Line-and-star . . . . . . . . . . . 129
5.9 Eect of access tree structure when |Δ| = 1%, #Δ = 100 . . . . . . . . . . . 131
5.10 Eect of query size when |Δ| = 1% . . . . . . . . . . . . . . . . . . . . . . 131
5.11 Intersect – Eect of |A| . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.12 Intersect – Eect of |Δ| . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.13 Union – Eect of |Δ| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.14 t-Thres. – Eect of |Δ| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.15 Eect of bitmap size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.1 Three datafiles and three column-wise deltas . . . . . . . . . . . . . . . 145
6.2 (a) Example query on k versions, (b) physical plan to execute the query. . 147
6.3 v-tuples for R1, R2, R3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.4 Line and star transformations for the Scan operation . . . . . . . . . . . 151
6.5 Buckets in simple hash join . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.6 Improved bucket structure for sparse keys . . . . . . . . . . . . . . . . . 157
6.7 Scan performance over varying delta sizes . . . . . . . . . . . . . . . . . 160
ix
Chapter 1: Introduction
Data-driven methods and products are becoming increasingly common in a variety
of communities, leading to a huge diversity of datasets being continuously generated,
modied, and analyzed. Data science teams, in a variety of domains, want to acquire
new datasets and perform increasingly sophisticated analysis tasks in order to derive
valuable insights. Such eorts are collaborative in nature and can often span multiple
teams or organizations, where one team uses datasets generated by another team in
their analysis tasks. Moreover, data science tasks are also iterative in nature – data is
updated regularly, algorithms are improved, or new approaches are tried out to explore
their impact. An update to a dataset, for instance, might lead to updating all the tasks or
“pipelines” that depend on it. An increasingly important consideration for the underly-
ing data management systems is that, all of these datasets and their versions over time
need to be stored and queried for a variety of reasons including auditing, provenance,
transparency, accountability, introspective analysis, and backups [1, 2, 3, 4, 5].
We outline an example data science workow that highlights the iterative and
collaborative aspects of the activity.
Example 1 Genome assembly of a whole genome sequence dataset is a complex task —
apart from huge computational demands, it is not always known a priori which tools and
1
settings will work best on the available sequence data for an organism [6]. The process typi-
cally involves testing multiple tools, parameters and approaches to produce the best possible
assembly for downstream analysis. The assemblies are evaluated on a host of metrics (e.g.,
the N50 statistic) and the choice of which assembly is the best one is also not always clear.
One potential sequence of steps might be: Sequenced reads (FastQ les)→ Error correction
tools (Quake, Sickle, etc.) → Input analysis, k-mer calculation (KmerGenie) → Assembly
tool (SOAPdenovo, ABySS)→ Assembly analysis and selection (QUAST).
A group of researchers may collaboratively try to analyze this data in various ways,
building upon the work done by the others in the team, but also trying out dierent algo-
rithms or tools. New data is also likely to be ingested at various points, either as updates/-
corrections to the existing data or as results of additional experiments. As one can imagine,
the ad hoc nature of this process and the desire not to lose any intermediate synthesized
result means that the researchers will be left with a large number of datasets and analyses,
mostly with large overlaps between them, and complex derivational dependencies.
We observe that although there is a wealth of data science research addressing
stand-alone data analysis issues or building integrated tools for analysis, the dataset
management aspects of these tools are poor, requiring data scientists to use ad hoc
mechanisms to record and reason about datasets. When people collaborate over data
it not uncommon to have hundreds or thousands of versions of collected, curated, and
derived datasets, at various degrees of structure (relational/JSON to fully unstructured).
Ad hoc solutions to this problem can hardly adapt to satisfy the long term tracking of
all such complex data and metadata. Once people start doing data analysis, they need
2
sustainable and scalable tools that allow them to spend little to no eort on maintaining,
sharing, and tracking changes to datasets, and instead focus on scientic and knowledge
discovery.
In this dissertation, we propose new methods and tools towards the goal of making
the management of such datasets hassle free, and address some of the fundamental prob-
lems in this space. We build a new dataset version control system (DVCS), to keep track
of all the dataset versions and dataset provenance in one place, and making it easy to
analyze or query this information. By keeping all this information in one place, we can
enable a rich set of functionality that can simplify the lives of data scientists, make it eas-
ier to identify and eliminate errors, and decrease the time to obtain actionable insights.
A DVCS can keep track of all the datasets and their thousands of versions, storing them
compactly, while still being able to retrieve them eciently, and query them to enable
rich introspection capabilities.
The goal of this dissertation is to develop a formalism for managing a large number
of dataset versions, i.e., storing and querying them, and designing algorithms that use
the formalism to enable ecient execution of a large class of introspection queries over
these versions.
1.1 Challenges in building a DVCS
The rst challenge that we consider in this dissertation, is how to eciently store
and retrieve the thousands of dataset versions. Because of the nature of the tasks that
create these versions, not all versions are entirely dierent from one another, and there
3
is massive redundancy in their contents. If the dierence, or delta, between two similar
versions can be computed, it is possible to save on storage costs by keeping only one
version and the delta, instead of two versions. The large number of overlapping versions
makes it imperative to exploit such deltas to compactly store all of them. However, this
gives rise to the storage–recreation tradeo : the more storage we use, the faster it is to
recreate or retrieve versions, while the less storage we use, the slower it is to recreate or
retrieve versions.
The second challenge is that a versioning API that simply provides put and get fa-
cilities is very limiting, and is not a good t for data scientists to understand and reason
about data contained within versions. Tools like git/svn lack sophisticated query inter-
faces. For example, in the data science scenario mentioned above, one may wish to write
a query that nds the intersection of a set of versions, representing, for instance, the nal
synthesized result of dierent pipelines. Identifying the design goals of a language that
enables users to traverse, compare, and query the version metadata, version history, and
the data itself, in a holistic manner, is an important and necessary problem for a DVCS.
The third challenge is designing ecient execution algorithms for the tasks men-
tioned above. For instance, users of a DVCS naturally want tools that help them reason
about multiple versions of datasets, and as such, they will pose queries that often access
more than one version at a time (e.g., for nding similarities among a set of versions).
Existing solutions require users to rst get/checkout, i.e., reconstruct a specic version of
a dataset of a le completely, all versions before running any queries on the data stored
within them. This approach is less than ideal particularly when the individual versions
are large and the users need to access multiple versions for their analysis task. First,
4
irrespective of the size of the query result, this approach entails creating all the input
versions before query processing can begin, resulting in large memory and/or I/O usage.
Second, it requires users to maintain another system to assist in executing the queries.
Third, this approach fails to exploit the fact that most datasets evolve through changes
that are small relative to the dataset sizes, and knowing about these changes can not
only help us towards our storage goals, but also enable answering queries eciently.
1.2 Dissertation Overview and Contributions
Given the challenges described above, the primary goal of this dissertation is to de-
velop a robust framework to model the fundamental problems for dataset version man-
agement, and design new techniques and algorithms to enable users to execute rich
queries eciently. We have built a prototye system, called DEX, whose architecture
is shown in Figure 1.1. The techniques summarized below serve as building blocks to
this architecture. The two main components that this dissertation is focused on are the
Storage Graph Builder and the Query Processor module.
The DEX system is built on top of git and has three major components: (a) a set
of command line utilities, DEX CLI, written in Python, to allow the user to interact
with the repository in the form of the standard add, commit, checkout, etc., commands
(similar to git), (b) the Storage Graph Builder which decides how best to store the col-
lection of dataset versions, and (c) the Query Processor, written in Java, that executes
user queries against the compressed representation. DEX CLI passes through the version
management tasks not pertaining to large datasets to git; the user may specify a le to
5
DEX CLI
Users
Version Graph
Storage Query 
Git Graph Processor
translation Builder
Delta Storage Version Graph Storage
Git 
Storage
Backend Data Store
Figure 1.1: System Architecture of DEX
be managed by DEX through a ag to the add command, and any tasks pertaining to
those les are sent to the Storage Graph Builder (in case of add or commit) or the Query
Processor.
The main parts of the dissertation, as well as the associated chapters and published
papers, are summarized below.
1.2.1 Ecient Storage and Retrieval
Given the high overlap and duplication among the datasets, it is attractive to con-
sider using delta encoding to store the datasets in a compact manner. Delta encoding is
a cornerstone of many no-overwrite storage systems that are focused on archiving and
maintaining vast quantities of datasets (simply put, a collection of les). Archival and
backup systems often store multiple versions or snapshots of large datasets that have
signicant overlap across their contents using deltas.
At a high level, delta encoding consists of representing a target version as the
6
mutation, or delta, from a source version content. Typically, source and target versions
are selected such that they have a large overlap across their contents and hence their
delta is small. Furthermore, the source version itself may be represented as a delta from
another version, and so on, creating a “graph” of versions and deltas. The compressed
storage is obtained by storing only a few select versions, commonly referred to as mate-
rialized versions, and deltas (instead of the versions they represent) in this graph, such
that it is possible to re-create any version by walking the path of deltas starting from a
materialized le and ending at the desired le.
However, this gives rise to the storage–recreation tradeo : the more storage we
use, the faster it is to recreate or retrieve versions, while the less storage we use, the
slower it is to recreate or retrieve versions. We illustrate this trade-o via an example.
Example 2 Figure 1.2(i) displays a version graph, indicating the derivation relationships
among 5 versions. Let V1 be the original dataset. Say there are two teams collaborating
on this dataset: team 1 modies V1 to derive V2, while team 2 modies V1 to derive V3.
Then, V2 and V3 are merged and give V5. As presented in Figure 1.2, V1 is associated with
⟨10000, 10000⟩, indicating that V1’s storage cost and recreation cost are both 10000 when
stored in its entirety (we note that these two are typically measured in dierent units – see
the second challenge below); the edge (V1 → V3) is annotated with ⟨1000, 3000⟩, where
1000 is the storage cost for V3 when stored as the modication from V1 (we call this the
delta of V3 from V1) and 3000 is the recreation cost for V3 given V1, i.e, the time taken to
recreate V3 given that V1 has already been recreated.
One naive solution to store these datasets would be to store all of them in their entirety
7
(Figure 1.2 (ii)). In this case, each version can be retrieved directly but the total storage cost
is rather large, i.e., 10000 + 10100 + 9700 + 9800 + 10120 = 49720. At the other extreme,
only one version is stored in its entirety while other versions are stored as modications or
deltas to that version, as shown in Figure 1.2 (iii). The total storage cost here is much smaller
(10000 + 200 + 1000 + 50 + 200 = 11450), but the recreation cost is large for V2, V3, V4 and
V5. For instance, the path {(V1 → V3 → V5)} needs to be accessed in order to retrieve V5
and the recreation cost is 10000 + 3000 + 550 = 13550 > 10120.
Figure 1.2 (iv) shows an intermediate solution that trades o increased storage for
reduced recreation costs for some version. Here we store versions V1 and V3 in their entirety
and store modications to other versions. This solution also exhibits higher storage cost
than solution (ii) but lower than (iii), and still results in signicantly reduced retrieval costs
for versions V3 and V5 over (ii).
Despite the fundamental nature of the storage-retrieval problem, there is surpris-
ingly little prior work on formally analyzing this trade-o and on designing techniques
for identifying eective storage solutions for a given collection of datasets. Version Con-
trol Systems (VCS) like Git, SVN, or Mercurial, despite their popularity, use fairly simple
algorithms underneath, and are known to have signicant limitations when managing
large datasets [7, 8]. Much of the prior work in literature focuses on a linear chain of
versions, or on minimizing the storage cost while ignoring the recreation cost.
We address this problem in detail in Chapter 3. We show that simply using delta
compression to minimize storage space leads to very high latencies while retrieving spe-
cic datasets. We also show that the delta compression heuristics used by popular VCS’
8
<10000, 10000> V1 <10000, 10000> V1
<200,200> <1000,3000>
<10100, 10100> V2 V3 <9700,9700> <10100, 10100> V2 V3 <9700,9700>
<50,400> <200,550>
<800,2500>
<9800,9800> V4 V5 <10120,10120> <9800,9800> V4 V5 <10120,10120>
(i) (ii)
<10000, 10000> V1 <10000, 10000> V1
<200,200> <1000,3000> <200,200>
V2 V3 V2 V3 <9700,9700>
<50,400> <200,550> <50,400> <200,550>
V4 V5 V4 V5
(iii) (iv)
Figure 1.2: (i) A version graph over 5 datasets – annotation ⟨a, b⟩ indicates a storage
cost of a and a recreation cost of b; (ii, iii, iv) three possible storage graphs
like Git and SVN are ineective at storing datasets, both in terms of resources consumed
and the quality of solution produced. Thus, to address the rst challenge mentioned
above, we propose novel problem formulations towards understanding the storage and
recreation tradeo in a principled manner. We formulate several optimization problems
and show that most variations are NP-Hard. As a result, we design several ecient
heuristics that are eective at exploring this trade-o and present an extensive exper-
imental evaluation over several synthetic and real-world workloads demonstrating the
eectiveness of our algorithms at handling large problem sizes.
9
1.2.2 A Language to Query Provenance and Versions in Unied Manner
To be truly useful for collaborative data science, we also need the ability to specy
queries and analysis tasks over the versioning and provenance information in a unied
manner. In the Genome assembly example mentioned earlier, there is a wide range of
queries that may be of interest. Simple queries include: (a) identifying versions based on
the metadata information (e.g., authors); (b) identifying versions that were derived (di-
rectly or through a chain of derivations) from a specic outdated version; and (c) nding
versions that dier from their predecessor version by a large number of records. More
complex queries include: (d) nding versions where the data within satises certain ag-
gregation conditions; (e) nding the intersection of a set of versions (representing, e.g.,
the nal synthesized results of dierent pipelines); and (f) nding versions that contain
any records derived from a specic record in a version.
These examples above illustrate some of the key requirements for a query lan-
guage, namely the ability to:
• Traverse the version graph (i.e., version-level provenance information) and query
the metadata associated with the versions and the derivation/update edges.
• Compare several versions to each other in a exible manner.
• Run declarative queries over data contained in a version, to the extent allowable
by the structure in the data.
• Query the tuple-level provenance information, when available, in conjunction
with the version-level provenance information.
10
In Chapter 4, we describe a language, called VQUEL, that aims to provide these
features. VQUEL is largely a generalization of the Quel language (while also introducing
certain syntactic conveniences that Quel does not possess), and combines features from
GEM and path-based query languages. This means thatVQUEL is a full-edged relational
query language, and in addition, it enables the seamless querying of the data present in
versions, versioning derivation relationships, as well as versioning metadata.
1.2.3 Delta-aware Query Execution
While delta encoding is an eective method to archive large amounts of data in
many immutable data stores, many of these stores require users to “check out” complete
le/dataset versions in order to manipulate them. In Chapter 5 and Chapter 6, we show
that this approach is less than ideal for a variety of rich queries that compare or work
with data from across multiple versions, and signicant performance benets can be had
if we can make use of the pre-computed deltas during query processing. In particular,
in these two chapters, we present a systematic study of the problem of supporting rich
analysis queries over delta-oriented storage engines, and describe the Query Execution
module of DEX. We focus on the storage design and implementation of DEX for a class
of semi-structured datasets that we call datafiles, and for a class of basic queries that
includes multi-version checkouts, intersections, unions, t-threshold, and single block
select-project-join queries.
A datafile is a le whose contents can be seen as a set of records, i.e., the order of
records within a datafile is immaterial, and no two records in a datafile are identical.
11
Examples of such les include CSV les, JSON documents, log les, to name a few, and
these constitute a large fraction of les in a typical data lake, and hence are particularly
suited to our study. A common method to represent a delta between two datafiles is
to maintain the “deletions” and “additions” of records required to go from one datafile
to the other. If a datafile has additional structure, namely, a user supplied primary key,
along with details on how to parse a record into its component elds or columns, we use
a more compact column-based delta format.
In Chapter 5, we develop a general cost-based optimization framework based on
key algebraic properties regarding composition of the deltas. The result of this frame-
work is a compact algebraic expression that connes query execution to a small set of
deltas. As a side eect, the computational cost is dependent on the size and the num-
ber of deltas in the expression (which are typically small) in contrast to the size of the
input datasets. We develop optimal algorithms for executing single-le or multi-le
checkout queries assuming reasonable restrictions on the evaluation plan search space.
We also develop a series of intuitive transformation rules that help simplify the search
space for intersection, union, and t-threshold queries, and use them in conjunction with
cost-based solutions for base cases, to develop eective search algorithms. We present a
comprehensive evaluation against synthetic datasets of varying characteristics. Our re-
sults show that our methods perform exceedingly well compared to the baselines, even
for simple queries like single-le checkouts.
In Chapter 6, we extend the above framework to execute single block select-project-
join (SPJ) queries, such as those that can be formulated in VQUEL, over multiple versions
stored in DEX. The key design decision of our approach is to execute the various query
12
operators exactly once for each unique tuple in the set of versions, rather than execut-
ing the query once for each version. We propose using a new representaion for tuples,
called v-tuples, during query processing. In addition to holding information about the
elds in a tuple, a v-tuple also stores information about the versions, from the set of
queries versions, the respective tuple appears in. We show how to eciently construct
such v-tuples from the delta-encoded representation and present our modications to
the traditional Scan and Join physical operators, in order to process v-tuples eciently.
We present an extensive evaluation of this approach on multiple synthetic datasets. Our
results shows that richer query processing is viable inDEX, with signicant performance
benets over one version at a time execution.
Delta Representations and Tradeos:
A key question in DEX is selecting the delta variant, i.e., the particular format/algorithm
for computing the delta between two les. This is because dierent delta formats are
appropriate for dierent types of les: a UNIX-style line-by-line di is a common delta
format for plain text les, while an XOR is more suited to numerical array-oriented data.
Exploiting the structure in the data, if known, can often lead to better deltas (e.g., for
XML [9], or relations [10]). Column-based deltas may be more appopriate when a large
number of records are changed slightly, e.g., due to a schema change. Furthermore, a
particular delta format may be directed or undirected: if a delta Δ between source le
A and target le B is directed, it may only be used to recreate B given A, and not vice
versa. An undirected delta between two les, on the other hand, can accept either le
as source and recreate the other.
13
The desire to execute queries directly on deltas (as we do in this work) brings
another dimension to this choice. There is an inherent tension in the amount of infor-
mation stored in a delta, and our ability to push query execution on to them. In this
work, we pick dierent delta formats depending on the class of queries that the user
wishes to support. Supporting richer queries and compact deltas requires us to make
assumptions about the datafile schema, as follows:
• In Chapter 3, we consider storing and retrieving individual les (in their entirety).
As such, we do not require any assumptions or restrictions on the le or delta
format, and any suitable delta algorithm can be used. However, we do require the
storage cost and the access cost, or their estimates, of the resulting delta.
• In Chapter 5, we consider a class of set-based query operators on the past versions.
Hence we require that every version of a datafile, at commit time, can be sepa-
rated into a set of records. The delta format then is also based on the set of records
abstraction, to aid in ecient query execution.
• In Chapter 6, we consider select-project-join queries across a large number of past
versions. In order for such queries to make sense, the datafile, at commit time,
must be separable into records, and every record be separable into its constituent
elds/columns. In order to use a compact column-based delta format, we also
require the user to specify a column or a set of columns as the primary key in the
datafile.
14
Chapter 2: Related Work
This section surveys the state of the art in managing versioned data. Our goal in
this section is to give a brief survey of the many decades of work done in this area and
to put our contributions in context. We begin with prior work perfomed towards the
goal of supporting ecient storage of multiple versions of data followed by a review of
recent solutions in the Version Control Systems space. Thereafter, we summarize work
done towards building richer interfaces to query such data and to execute those queries
eciently.
2.1 Enabling Multi-Versioned Storage
Relational DatabaseManagement Systems. Many database applications require that
multiple versions of records be stored and retrieved, and as such, there has been exten-
sive research on temporal and versioned databases and their applications. The eort
to satisfy the diverse needs of these applications has led to a number of versioning so-
lutions, e.g., [11, 12, 13, 14, 15, 16]. Much work, especially earlier papers, focused on
theoretical foundations, not on practical considerations such as storage eciency and
indexing versioned data. We briey review some of the work done in this area, and for
a detailed survey, we refer the reader to [12]. In addition, extensive bibliographies have
15
also been compiled, see [17] as a starting point.
There was some conceptual temporal database work in 1980s, see [18, 19] ([18]
contains references to even earlier work), on developing temporal data models and tem-
poral query languages. The most basic concepts that a relational temporal database is
based upon are valid time and transaction time, considered orthogonal to each other.
Valid time denotes the time period during which a fact is true with respect to the real
world. Transaction time is the time when a fact is stored in the database.
The rst relational database system oering temporal functionality was Postgres [20].
Postgres used R-trees [21] to index historical data, with recent data residing in a B+Tree.
This separation is important as a general multi-attribute index like an R-tree has di-
culty supporting data that is current and hence does not yet have an end time. The
movement of data from the B+tree to the R-tree occurs lazily.
Transactime time functionality has also received some industrial interest, partic-
ularly from Oracle [22] and Microsoft. Oracle’s FlashBack queries allow the application
to access prior transaction time states of the database and to retrieve all the versions of a
row between two transaction times. It also allows for “point in time” recovery, i.e., tables
and databases can be rolled back to a previous transaction time, discarding all changes
after that time; this functionality can be used to deal with bad user transactions. They
do not index historical versions, however, so historical version queries must go through
current time versions and then search backward “linearly” in time. More recently, Or-
acle supports “Total Recall” feature for Oracle 11g [23]. Building on FlashBack, Total
Recall archive is read only and it supports long time archiving of transaction time ver-
sions, including migration of the versions to archival media. A form of compression is
16
supported to reduce the storage cost of retaining more extensive database history.
Immortal DB, which was built into Microsoft SQL Server, integrated a temporal
indexing technique called the TSB-tree [14, 15] to provide high performance access and
update for both current and historical data.
Buneman et al. [24] proposed an archiving technique where all versions of the
data are merged into one hierarchy. An element appearing in multiple versions is stored
only once along with a timestamp. This technique of storing versions is in contrast
with techniques where retrieval of certain versions may require undoing the changes
(unrolling the deltas). The hierarchical data and the resulting archive is represented in
XML format which enables use of XML tools such as an XML compressor for compress-
ing the archive. It was not, however, a full-edged version control system representing
an arbitrarily graph of versions; rather it focused on algorithms for compactly encoding
a linear chain of versions.
MOLAP systems store data in multidimensional arrays [25] with particular focus
on aggregation queries. These systems exploit data structures to eciently compute
rollups. The MOLAP system in [26] supports versions to represent changes to the data
sources that should be propagated to the data warehouse periodically. But the versioning
system is designed to benet the concurrency control mechanism in order to minimize
contention between query and maintenance transactions.
Scientic Databases. In many elds of science, multidimensional arrays rather than
at tables are standard data types because data values are associated with coordinates in
space and time. As a result, many specialized array-processing systems have emerged,
17
e.g, [27, 28]. As noted in [29], an important requirement that scientists have for these
systems is the ability to create, archive, and explore dierent versions of their arrays.
None of the temporal database solutions mentioned above are a good t here because
(1) simulating arrays on top of relations can be inecient [29], and (2) their internal
data structures are not specialized for time travel over array data. Hence, a no-overwrite
storage manager with ecient support for querying old versions of an array is a critical
component of an array database management system. In recent work, Seering et al. [30]
presented a disk based versioning system using ecient delta encoding to minimize
space consumption and retrieval time in array-based systems.
Other DataModels andDeduplication SchemesMany solutions have been proposed
to support multiple versions of complex data, e.g., XML [31], object oriented [32], and
spatio-temporal data [33]. Khurana and Deshpande [34] present an approach for manag-
ing historical graph data for large information networks, and for executing snapshot re-
trieval queries on them. Quinlan and Dorward [35] propose an archival “deduplication”
storage system that identies duplicate blocks across les and only stores them once for
reducing storage requirements. Zhu et al. [36] present several optimizations on the basic
theme. Douglis and Iyengar [37] present several techniques to identify pairs of les that
could be eciently stored using delta compression even if there is no explicit derivation
information known about the two les. Ouyang et al. [38] studied the problem of com-
pressing a large collection of related les by performing a sequence of pairwise delta
compressions. They proposed a suite of text clustering techniques to prune the graph
of all pairwise delta encodings and nd the optimal branching (i.e., MCA) that mini-
18
mizes the total weight. Similar dictionary-based reference encoding techniques have
been used by Chan and Woo [39] to eciently represent a target web page in terms of
additions/modications to a small number of reference web pages. Burns and Long [40]
present a technique for in-place re-construction of delta-compressed les using a graph-
theoretic approach. Kulkarni et al. [41] present a more general technique that combines
several dierent techniques to identify similar blocks among a collection les, and use
delta compression to reduce the total storage cost (ignoring the recreation costs).
Our Contributions. Most of the work in temporal relational database management
systems has focused its eorts on eciently storing a “linear chain” of versions, unlike
our work, which requires supporting an arbitrary DAG of versions. Moreover, unlike our
framework, which does not make any assumptions about how the versions were modi-
ed, many schemes assume knowledge of the specic records or cells that are updated.
The general concept of multi-versioning has also been used extensively in commercial
databases to provide snapshot isolation [42]. However, these methods only store enough
history to preserve transactional semantics, whereas we preserve all historical branches
and derivation relationships to ensure integrity of the version graph. Finally, prior ef-
forts that have looked at the problem of minimizing the total storage cost for storing a
collection of related les do not typically consider the recreation cost or the tradeos
between the two. Their design also does not consider querying data in the past versions
using rich query interfaces, such as the ones available in temporal databases.
19
2.2 Version Control Systems (VCS)
Version Control Systems (VCS) have a long history in Computer Science. A VCS
records changes to a le or set of les over time so that any user can recall a specic ver-
sion later. Versioning techniques such as forward and backward delta encoding and the
use of multi-version B-trees have been implemented in various legacy systems. git [43]
is one of the conventional version control systems and is believed to be faster and more
disk ecient than other similar version control systems. The major dierence between
git and any other VCS, such as Subversion (svn), Concurrent Versions System (cvs) is
the way git thinks about its data. While most other systems store information as a
list of le-based changes, git thinks of its data more like a set of snapshots of a minia-
ture lesystem and stores changes at the snapshot level. To be ecient, if a le has not
changed, git doesn’t store the le again, instead, just a link to the previous identical le
it has already stored.
Despite their popularity, these systems largely use fairly simple algorithms under-
neath that are optimized to work with modest-sized source code les and their on-disk
structures are optimized to work with line-based dis. These systems are known to have
signicant limitations when handling large les and large numbers of versions [8]. As
a result, a variety of extensions like git-annex [44], Git Large File Storage [45], etc.,
have been developed to make them work reasonably well with large les. These exten-
sions replace large les with text pointers inside Git, while storing the le contents on
a remote server like GitHub.com, or Amazon S3.
20
Our Contributions. The initial design of DEX is modeled after conventional version
control software such as git. In particular, the concepts of a no-update model and of
dierencing stored les against each other for more ecient storage have both been
explored extensively by such systems. We build upon this work to support a superset
of the conventional version control API for large datasets. We show that the underlying
algorithms in git and Subversion can be extremely inecient, both in terms of storage
used and resource (memory/IO) consumption.
2.3 Query Languages
Abdessalem and Jomier [46] introduce VQL, a language designed for querying data
stored in multiversion databases. VQL is based on a rst-order calculus and provides
users with the ability to navigate through object versions modeled by the database. More
recently, there are new temporal constructs pushed in the SQL standard by the main
DBMS vendors [47]. This work, however, does not directly apply to our setting because
the constructs assume a linear chain of versions — as noted earlier, we could have an
arbitrary branching structure of versions.
While there has been substantial work on query languages for provenance, rang-
ing from adapting SQL [48], Prolog [49, 50], SPARQL [51, 52] to specialized languages
such as QLP [53, 54], PQL [55], ProQL [56] ( [57], [58] have additional examples), much
of this work centers on well-dened workows and tuple-based provenance rather than
collaborative settings where multiple users interact through a derivation graph of ver-
sions in an ad hoc manner.
21
Our Contributions. We design a new query language, called VQUEL, that is capable
of querying dataset versions, dataset provenance (e.g., which datasets a given dataset
was derived from), and record-level provenance (if available). Our design draws from
constructs introduced in the historical Quel [59] and GEM [60] languages, neither of
which had a temporal component.
2.4 Query Execution
There are a large number of proposed indexing techniques used for temporal data,
e.g., [33, 61, 62]. Salzberg and Tsotras [63] present a comprehensive survey of indexing
structures for temporal databases. They also present a classication of dierent queries
that one may ask over a temporal database.
Lomet et al. [14, 15] integrated a temporal indexing technique, the TSB-tree, into
Immortal DB (which was built into Microsoft SQL Server) to serve as the core access
method. The TSB-tree provides high performance access and update for both current
and historical data. Jouini and Jomier [64] studied the problem of eciently indexing
data with “branched evolution”. The main contributions here are the extension of tem-
poral index structures to data with branched evolution and an analysis method that esti-
mates the performance of the dierent index structures and provides guidelines for the
selection of the most appropriate one. Soroush and Balazinska [65] present an indexing
technique to support “time travel” queries for scientic arrays. A key aspect of their
technique is that they can support approximate queries that can quickly identify which
versions are relevant to a user and return the approximate content of these versions.
22
Queries in delta-based storage. Delta encoding has been used in a variety of systems
to provide trade-os among time, space, and compression performance, e.g., to reduce
data transfer time for text/HTTP objects [66], to reduce access time in a le system [67],
to store many versions of the generated artifacts in source code control systems (e.g.,
git) or other types of data [10, 30, 68]. However, the focus of many of the existing delta
encoding schemes has been to access the objects in their entirety and to the best of our
knowledge, they have not considered the tradeo between storage and “computability
over deltas”. Even version control systems that provide functionality to compare mul-
tiple objects, e.g., merge, di, etc., rst recreate all required les before operating upon
them. Recently, [65] presented an indexing technique to support “time travel” queries
for scientic arrays wherein they support approximate queries that can quickly iden-
tify which versions are relevant to a user and return the approximate content of these
versions. However, they did not consider queries that compared the contents of two or
more array versions.
Deltas and computing. The concept of making deltas “rst-class citizens” was ex-
plored in Heraclitus [69]. To support “what-if” scenario analysis, they provided general-
purpose constructs for creating, accessing, and combining deltas. In the specic realiza-
tion of their paradigm for the relational model, deltas are a set of signed atoms where the
positive atoms correspond to “insertions” and the negative atoms correspond to “dele-
tions”. In addition, the deltas have structure and can be manipulated directly by con-
structs in user programs, e.g., to delete all records satisfying a predicate. In contrast, our
use of deltas is at the physical level and not exposed to the users. They do not consider
23
optimizing the dierent types of queries against a delta storage. Executing queries with
hypothetical state updates was also considered in [70]. Here the state updates (or deltas)
were allowed to be expressions and the authors considered rewriting such queries into
an optimized form based on their novel rules for substituion and the rules for relational
algebra. Such rules are however not applicable in our setting. Record-based deltas were
also used in [71, 72] to provide the capability of sharing data and updates among dif-
ferent participants. However, they focused on formalizing the semantics of the update
exchange process, e.g., mapping updates across schemas and ltering them according to
local trust policies, and the challenges introduced therein.
Connections tomaterialized views. Using pre-computed deltas to answer user queries
is, at a high level, similar to the problems that have been considered in the context of
materialized view and index selection to speed up query processing. Broadly speaking,
research in this area has focused on three issues: (i) determining the search space or class
of views to consider for materialization, (ii) choosing a subset of views and indexes to
materialize depending on various constraints like storage overhead, maintenance over-
head, eectiveness on the query workload, etc., [73, 74] and (iii) quickly determining
which views to consider to answer a given query [75, 76]. In our problem setting, set-
backed deltas can be considered as a form of materialized views (which can be used to
reconstruct base relations), with our work addressing the problem of how to use such
views to answer queries. We also design a logical algebra over the specic delta formats,
that allows us to combine a large number of deltas to answer one query.
Evaluating set expressions. Several algorithms and data structures have been pro-
24
posed in literature to solve union, intersection and dierence problems on sets [77, 78,
79, 80] by minimizing the number of comparisons required. Although the ordered list
representation is the most common, some algorithms also consider representing sets
in other data structures, e.g., skip lists [81], machine word-based representations [82],
etc., to obtain additional speedup. A comparison of few of these methods is available
in [83, 84]. Speeding up set operations is largely orthogonal to our approach and we can
make use of some of these techniques as additional operators with the appropriate cost
model. As mentioned earlier, in this work, we use an adaptive set intersection algorithm
that was shown to have reasonably good performance in [83] without requiring any
preprocessing step. The larger problem here, however, is how to eciently evaluate a
set expression consisting of union, intersection and dierence. [85] consider evaluating
union-intersection expressions in a worst-case ecient way for a non-comparison based
model. However, their approach uses hash-based dictionaries, which would require an
additional pre-processing step, and it remains an open problem whether their results
can be extended to handle set dierence. Recently, [86] showed that, for a similar cost
model, a union-intersection expression can be rewritten to perform intersections before
unions with often a reduced cost. Their approach, however, did not consider rewrites in
the presence of set dierence.
25
Chapter 3: Storage–Recreation Tradeo
3.1 Introduction
In this chapter, we present a formal study of the problem of deciding how to jointly
store a collection of dataset versions, provided along with a version or derivation graph.
Specically, we focus on the problem of trading o storage costs and recreation costs in
a principled fashion. Aside from being able to handle the scale, both in terms of dataset
sizes and the number of versions, there are several other considerations that make this
problem challenging.
• Dierent application scenarios and constraints lead to many variations on the ba-
sic theme of balancing storage and recreation cost (see Table 3.1). The variations
arise both out of dierent ways to reconcile the conicting optimization goals,
as well as because of the variations in how the dierences between versions are
stored and how versions are reconstructed. For example, some mechanisms for
constructing dierences between versions lead to symmetric dierences (either
version can be recreated from the other version) — we call this the undirected
case. The scenario with asymmetric, one-way dierences is referred to as directed
case.
26
• Similarly, the relationship between storage and recreation costs leads to signi-
cant variations across dierent settings. In some cases the recreation cost is pro-
portional to the storage cost (e.g., if the system bottleneck lies in the I/O cost or
network communication), but that may not be true when the system bottleneck
is CPU computation. This is especially true for sophisticated dierencing mech-
anisms where a compact derivation procedure might be known to generate one
dataset from another.
• Another critical issue is that computing deltas for all pairs of versions is typically
not feasible. Relying purely on the version graph may not be sucient and signif-
icant redundancies across datasets may be missed.
• Further, in many cases, we may have information about relative access frequencies
indicating the relative likelihood of retrieving dierent datasets. Several baseline
algorithms for solving this problem cannot be easily adapted to incorporate such
access frequencies.
The key contributions of this chapter are as follows.
• We formally dene and analyze the dataset versioning problem and consider sev-
eral variations of the problem that trade o storage cost and recreation cost in dif-
ferent manners, under dierent assumptions about the dierencing mechanisms
and recreation costs (Section 3.2). Table 3.1 summarizes the problems and our
results. We show that most of the variations of this problem are NP-Hard (Sec-
tion 3.3).
27
Storage Cost Recreation Cost Undirected Directed Directed
Case, Δ = Φ Case, Δ = Φ Case, Δ ≠ Φ
Problem 1 minimize {} i < ∞, ∀i PTime, Minimum Spanning Tree
Problem 2  < ∞ minimize {max{i |1 ≤ i ≤ n}} PTime, Shortest Path Tree
Problem 3  ≤  minimize {∑ni=1i} NP-hard, NP-hard, LMG Algorithm
Problem 4  ≤  minimize {max{i |1 ≤ i ≤ n}} LAST NP-hard, MP Algorithm
Algorithm†
Problem 5 minimize {} ∑ni=1i ≤  NP-hard, NP-hard, LMG Algorithm
Problem 6 minimize {} max{i |1 ≤ i ≤ n} ≤  LAST NP-hard, MP Algorithm
Algorithm†
Table 3.1: Problem Variations With Dierent Constraints, Objectives and Scenarios.
• We provide two light-weight heuristics: one, when there is a constraint on average
recreation cost, and one when there is a constraint on maximum recreation cost;
we also show how we can adapt a prior solution for balancing minimum-spanning
trees and shortest path trees for undirected graphs (Section 5.4).
• We implement the proposed algorithms in our prototype DEX system. We present
an extensive experimental evaluation of these algorithms over several synthetic
and real-world workloads demonstrating the eectiveness of our algorithms at
handling large problem sizes (Section 5.5).
3.2 Problem Overview
In this section, we rst introduce essential notations and then present the various
problem formulations. We then present a mapping of the basic problem to a graph-
theoretic problem, and also describe an integer linear program to solve the problem
optimally.
28
3.2.1 Essential Notations and Preliminaries
Version Graph. We let  = {Vi}, i = 1,… , n be a collection of versions. The deriva-
tion relationships between versions are represented or captured in the form of a version
graph: ( , ). A directed edge from Vi to Vj in ( , ) represents that Vj was derived
from Vi (either through an update operation, or through an explicit transformation).
Since branching and merging are permitted in DEX (admitting collaborative data sci-
ence),  is a DAG (directed acyclic graph) instead of a linear chain. For example, Fig-
ure 1.2 represents a version graph , where V2 and V3 are derived from V1 separately,
and then merged to form V5.
Storage and Recreation. Given a collection of versions  , we need to reason about the
storage cost, i.e., the space required to store the versions, and the recreation cost, i.e., the
time taken to recreate or retrieve the versions. For a version Vi , we can either:
• Store Vi in its entirety: in this case, we denote the storage required to record ver-
sion Vi fully by Δi,i . The recreation cost in this case is the time needed to retrieve
this recorded version; we denote that by Φi,i . A version that is stored in its entirety
is said to be materialized.
• Store a “delta” from Vj : in this case, we do not store Vi fully; we instead store its
modications from another version Vj . For example, we could record that Vi is just
Vj but with the 50th tuple deleted. We refer to the information needed to construct
version Vi from version Vj as the delta from Vj to Vi . The algorithm giving us the
delta is called a dierencing algorithm. The storage cost for recording modications
29
from Vj , i.e., the size the delta, is denoted by Δj,i . The recreation cost is the time
needed to recreate the recorded version given that Vj has been recreated; this is
denoted by Φj,i .
Thus the storage and recreation costs can be represented using two matrices Δ
and Φ: the entries along the diagonal represent the costs for the materialized versions,
while the o-diagonal entries represent the costs for deltas. From this point forward, we
focus our attention on these matrices: they capture all the relevant information about
the versions for managing and retrieving them.
Delta Variants. Notice that by changing the dierencing algorithm, we can produce
deltas of various types:
• for text les, UNIX-style dis, i.e., line-by-line modications between versions, are
commonly used;
• we could have a listing of a program, script, SQL query, or command that generates
version Vi from Vj ;
• for some types of data, an XOR between the two versions can be an appropriate
delta; and
• for tabular data (e.g., relational tables), recording the dierences at the cell level is
yet another type of delta.
Furthermore, the deltas could be stored compressed or uncompressed. The various delta
variants lead to various dimensions of problem that we will describe subsequently.
30
The reader may be wondering why we need to reason about two matrices Δ and
Φ. In some cases, the two may be proportional to each other (e.g., if we are using un-
compressed UNIX-style dis). But in many cases, the storage cost of a delta and the
recreation cost of applying that delta can be very dierent from each other, especially
if the deltas are stored in a compressed fashion. Furthermore, while the storage cost is
more straightforward to account for in that it is proportional to the bytes required to
store the deltas between versions, recreation cost is more complicated: it could depend
on the network bandwidth (if versions or deltas are stored remotely), the I/O bandwidth,
and the computation costs (e.g., if decompression or running of a script is needed).
Example 3 Figure 3.1 shows thematricesΔ andΦ based on version graph in Figure 1.2. The
annotation associated with the edge (Vi , Vj) in Figure 1.2 is essentially ⟨Δi,j ,Φi,j⟩, whereas
the vertex annotation forVi is ⟨Δi,i ,Φi,i⟩. If there is no edge fromVi toVj in the version graph,
we have two choices: we can either set the corresponding Δ and Φ entries to “−” (unknown)
(as shown in the gure), or we can explicitly compute the values of those entries (by running
a dierencing algorithm). For instance, Δ3,2 = 1100 and Φ3,2 = 3200 are computed explicitly
in the gure (the specic numbers reported here are ctitious and not the result of running
any specic algorithm).
Discussion. Before moving on to formally dening the basic optimization problem, we
note several complications that present unique challenges in this scenario.
• Revealing entries in the matrix: Ideally, we would like to compute all pairwise Δ
and Φ entries, so that we do not miss any signicant redundancies among versions
31
10000 200 1000 -- -- 10000 200 3000 -- --
500 10100 -- 50 800 600 10100 -- 400 2500
-- 1100 9700 -- 200 -- 3200 9700 -- 550
-- -- -- 9800 900 -- -- -- 9800 2500
-- -- -- 800 10120 -- -- -- 2300 10120
(i) (i) Δ (ii) Φ (ii)
Figure 3.1: Matrices corresponding to the example in Figure 1.2 (with additional entries
revealed beyond the ones given by version graph)
that are far from each other in the version graph. However when the number of
versions, denoted n, is large, computing all those entries can be very expensive
(and typically infeasible), since this means computing deltas between all pairs of
versions. Thus, we must reason with incompleteΔ andΦmatrices. Given a version
graph , one option is to restrict our deltas to correspond to actual edges in the
version graph; another option is to restrict our deltas to be between “close by”
versions, with the understanding that versions close to each other in the version
graph are more likely to be similar. Prior work has also suggested mechanisms
(e.g., based on hashing) to nd versions that are close to each other [37]. We
assume that some mechanism to choose which deltas to reveal is provided to us.
• Multiple “delta” mechanisms: Given a pair of versions (Vi , Vj), there could be many
ways of maintaining a delta between them, with dierent Δi,j ,Φi,j costs. For ex-
ample, we can store a program used to derive Vj from Vi , which could take longer
to run (i.e., the recreation cost is higher) but is more compact (i.e., storage cost
is lower), or explicitly store the UNIX-style dis between the two versions, with
32
lower recreation costs but higher storage costs. For simplicity, we pick one delta
mechanism: thus the matrices Δ,Φ just have one entry per (i, j) pair. Our tech-
niques also apply to the more general scenario with small modications.
• Branches: Both branching and merging are common in collaborative analysis,
making the version graph a directed acyclic graph. In this chapter, we assume
each version is either stored in its entirety or stored as a delta from a single other
version, even if it is derived from two dierent datasets. Although it may be more
ecient to allow a version to be stored as a delta from two other versions in some
cases, representing such a storage solution requires more complex constructs and
both the problems of nding an optimal storage solution for a given problem in-
stance and retrieving a specic version become much more complicated. Getting
a better understanding of such constructs remains a rich area for future work.
Matrix Properties and Problem Dimensions. The storage cost matrix Δ may be
symmetric or asymmetric depending on the specic dierencing mechanism used for
constructing deltas. For example, the XOR dierencing function results in a symmetricΔ
matrix since the delta from a version Vi to Vj is identical to the delta from Vj to Vi . UNIX-
style dis where line-by-line modications are listed can either be two-way (symmetric)
or one-way (asymmetric). The asymmetry may be quite large. For instance, it may be
possible to represent the delta from Vi to Vj using a command like: delete all tuples with
age > 60, very compactly. However, the reverse delta from Vj to Vi is likely to be quite
large, since all the tuples that were deleted from Vi would be a part of that delta. In this
chapter, we consider both these scenarios. We refer to the scenario whereΔ is symmetric
33
and Δ is asymmetric as the undirected case and directed case, respectively.
A second issue is the relationship between Φ and Δ. In many scenarios, it may
be reasonable to assume that Φ is proportional to Δ. This is generally true for deltas
that contain detailed line-by-line or cell-by-cell dierences. It is also true if the system
bottleneck is network communication or I/O cost. In a large number of cases, however, it
may be more appropriate to treat them as independent quantities with no overt or known
relationship. For the proportional case, we assume that the proportionality constant is
1 (i.e., Φ = Δ); the problem statements, algorithms and guarantees are unaected by
having a constant proportionality factor. The other case is denoted by Φ ≠ Δ.
This leads us to identify three distinct cases with signicantly diverse properties:
(1) Scenario 1: Undirected case, Φ = Δ; (2) Scenario 2: Directed case, Φ = Δ; and (3)
Scenario 3: Directed case, Φ ≠ Δ.
Objective and Optimization Metrics. Given Δ,Φ, our goal is to nd a good storage
solution, i.e., we need to decide which versions to materialize and which versions to
store as deltas from other versions. Let  = {(i1, j1), (i2, j2), ...} denote a storage solution.
ik = jk indicates that the version Vik is materialized (i.e., stored explicitly in its entirety),
whereas a pair (ik , jk), ik ≠ jk indicates that we store a delta from Vik to Vjk .
We require any solution we consider to be a valid solution, where it is possible to
reconstruct any of the original versions. More formally, is considered a valid solution if
and only if for every version Vi , there exists a sequence of distinct versions Vl1 , ..., Vlk = Vi
such that (il1 , il1), (il1 , il2), (il2 , il3), ..., (ilk−1 , ilk ) are contained in  (in other words, there is a
version Vl1 that can be materialized and can be used to recreate Vi through a chain of
34
deltas).
We can now formally dene the optimization goals:
• Total Storage Cost (denoted ): The total storage cost for a solution  is simply
the storage cost necessary to store all the materialized versions and the deltas:
 = ∑(i,j)∈ Δi,j .
• Recreation Cost for Vi (denoted i): Let Vl1 , ..., Vlk = Vi denote a sequence that
can be used to reconstruct Vi . The cost of recreating Vi using that sequence is:
Φl1,l1+Φl1,l2+...+Φlk−1,lk . The recreation cost for Vi is the minimum of these quantities
over all sequences that can be used to recreate Vi .
Problem Formulations. We now state the problem formulations that we consider in
this chapter, starting with two base cases that represent two extreme points in the spec-
trum of possible problems.
Problem 1 (Minimizing Storage) GivenΔ,Φ, nd a valid solution such that  is min-
imized.
Problem 2 (Minimizing Recreation) Given Δ,Φ, identify a valid solution  such that
∀i, Ri is minimized.
The above two formulations minimize either the storage cost or the recreation
cost, without worrying about the other. It may appear that the second formulation is
not well-dened and we should instead aim to minimize the average recreation cost
across all versions. However, the (simple) solution that minimizes average recreation
cost also naturally minimizes i for each version.
35
In the next two formulations, we want to minimize (a) the sum of recreation costs
over all versions (∑i i), (b) the max recreation cost across all versions (maxi i), under
the constraint that total storage cost  is smaller than some threshold  . These problems
are relevant when the storage budget is limited.
Problem 3 (MinSum Recreation) Given Δ,Φ and a th- reshold  , identify  such that
 ≤  , and ∑i i is minimized.
Problem 4 (MinMax Recreation) Given Δ,Φ and a th- reshold  , identify  such that
 ≤  , and maxi i is minimized.
The next two formulations seek to instead minimize the total storage cost  given
a constraint on the sum of recreation costs or max recreation cost. These problems are
relevant when we want to reduce the storage cost, but must satisfy some constraints on
the recreation costs.
Problem 5 (Minimizing Storage(Sum Recreation)) GivenΔ,Φ and a threshold  , iden-
tify  such that ∑i i ≤  , and  is minimized.
Problem 6 (Minimizing Storage(Max Recreation)) GivenΔ,Φ and a threshold  , iden-
tify  such that maxi i ≤  , and  is minimized.
3.2.2 Mapping to Graph Formulation
In this section, we’ll map our problem into a graph problem, that will help us to
adopt and modify algorithms from well-studied problems such as minimum spanning
tree construction and delay-constrained scheduling. Given the matrices Δ and Φ, we
36
can construct a directed, edge-weighted graph G = (V , E) representing the relationship
among dierent versions as follows. For each version Vi , we create a vertex Vi in G. In
addition, we create a dummy vertex V0 in G. For each Vi , we add an edge V0 → Vi ,
and assign its edge-weight as a tuple ⟨Δi,i ,Φi,i⟩. Next, for each Δi,j ≠ ∞, we add an edge
Vi → Vj with edge-weight ⟨Δi,j ,Φi,j⟩.
The resulting graph G is similar to the original version graph, but with several
important dierences. An edge in the version graph indicates a derivation relationship,
whereas an edge in G simply indicates that it is possible to recreate the target version
using the source version and the associated edge delta (in fact, ideally G is a complete
graph). Unlike the version graph, G may contain cycles, and it also contains the spe-
cial dummy vertex V0. Additionally, in the version graph, if a version Vi has multiple
in-edges, it is the result of a user/application merging changes from multiple versions
into Vi . However, multiple in-edges in G capture the multiple choices that we have in
recreating Vi from some other versions.
Given graph G = (V , E), the goal of each of our problems is to identify a storage
graph Gs = (Vs , Es), a subset of G, favorably balancing total storage cost and the recre-
ation cost for each version. Implicitly, we will store all versions and deltas corresponding
to edges in this storage graph. (We explain this in the context of the example below.) We
say a storage graph Gs is feasible for a given problem if (a) each version can be recreated
based on the information contained or stored in Gs , (b) the recreation cost or the total
storage cost meets the constraint listed in each problem.
Example 4 Given matrix Δ and Φ in Figure 3.1(i) and 3.1(ii), the corresponding graph G
37
V0 V0
<10000, 10000> <10000, 10000>
V1 <9700,9700>
<10100, 10100> <9700,9700> V1
<200,200> <1000,3000>
<200,200>
<500,600>
<1100,3200>
V2 V3 V2 V3
<9800,9800> <10120,10120>
<50,400> <800,2500> <200,550> <50,400> <200,550>
<800,2300>
V4 V V5 4 V5
<900,2500>
Figure 3.2: Graph G Figure 3.3: Storage Graph Gs
is shown in Figure 3.2. Every version is reachable from V0. For example, edge (V0, V1) is
weighted with ⟨Δ1,1,Φ1,1⟩ = ⟨10000, 10000⟩; edge ⟨V3, V5⟩ is weighted with ⟨Δ3,5,Φ3,5⟩ =
⟨800, 2500⟩. Figure 3.3 is a feasible storage graph given G in Figure 3.2, where V1 and V3
are materialized (since the edges from V0 to V1 and V3 are present) while V2, V4 and V5 are
stored as modications from other versions.
After mapping our problem into a graph setting, we have the following lemma.
Lemma 1 The optimal storage graph Gs = (Vs , Es) for all 6 problems listed above must be
a spanning tree T rooted at dummy vertex V0 in graph G.
Proof 1 Recall that a spanning tree of a graph G(V , E) is a subgraph of G that (i) includes
all vertices of G, (ii) is connected, i.e., every vertex is reachable from every other vertex, and
(iii) has no cycles. Any Gs must satisfy (i) and (ii) in order to ensure that a version Vi can
be recreated from V0 by following the path from V0 to Vi . Conversely, if a subgraph satises
(i) and (ii), it is a valid Gs according to our denition above. Regarding (iii), presence of
a cycle creates redundancy in Gs . Formally, given any subgraph that satises (i) and (ii),
38
we can arbitrarily delete one from each of its cycle until the subgraph is cycle free, while
preserving (i) and (ii).
For Problems 1 and 2, we have the following observations. A minimum spanning
tree is dened as a spanning tree of smallest weight, where the weight of a tree is the sum
of all its edge weights. A shortest path tree is dened as a spanning tree where the path
from root to each vertex is a shortest path between those two in the original graph: this
would be simply consist of the edges that were explored in an execution of Dijkstra’s
shortest path algorithm.
Lemma 2 The optimal storage graph Gs for Problem 1 is a minimum spanning tree of G
rooted at V0, considering only the weights Δi,j .
Lemma 3 The optimal storage graph Gs for Problem 2 is a shortest path tree of G rooted
at V0, considering only the weights Φi,j .
3.2.3 ILP Formulation
We present an ILP formulation of the optimization problems described above.
Here, we take Problem 6 as an example; other problems are similar. Let xi,j be a bi-
nary variable for each edge (Vi , Vj) ∈ E, indicating whether edge (Vi , Vj) is in the storage
graph or not. Specically, x0,j = 1 indicates that version Vj is materialized, while xi,j = 1
indicates that the modication from version i to version j is stored where i ≠ 0. Let ri
be a continuous variable for each vertex Vi ∈ V , where r0 = 0; ri captures the recreation
cost for version i (and must be ≤ ).
minimize Σ(Vi ,Vj )∈Exi,j × Δi,j , subject to:
39
1. ∑i xi,j = 1, ∀j
2. rj − ri ≥ Φi,j if xi,j = 1
3. ri ≤ , ∀i
Lemma 4 Problem 6 is equivalent to the optimization problem described above.
Proof 2 First, constraint 1 indicates that each vertex Vi , 1 ≤ i ≤ n, has one and only one
in coming edge as described in Lemma 1. Constraint 2 indicates that no cycle exists in GW .
This can be proven by contradiction: when there exists a cycle {Vk1 , … , Vkl , Vk1}, we have
rk2 − rk1 ≥ Φk1,k2
……
rk1 − rkl ≥ Φkl ,k1
l
⇒ 0 ≥ ∑Φki ,k(i+1)%l
i=1
Thus, constraint 1 and 2 ensures the resulting storage graphGs is a spanning tree. Constraint
3 corresponds to recreation cost constraint in Problem 6, but note that ri is not necessarily
the recreation cost for version Vi .
First, the solution to Problem 6 fulls all constraints listed in integer linear program-
ming above by setting ri = i . Thus,  ≥  . Then, we prove  ≤  by contradiction.
Suppose there exists a solution to the linear programming above such that  < . Accord-
ing to constraint 2, 3 and spanning tree property (we let the path from root V0 to Vj be
40
{Vk1 = V0, Vk2 , ...Vkl , Vk(l+1) = Vj}):
l
rj ≥ rkl + Φkl ,j ≥ ... ≥ r0 + ∑ Φkm ,k(m+1)
m=1
 ≥ rj
l
⇒  ≥ r0 + ∑ Φkm ,k(m+1)
m=1
Thus, the recreation cost for each version Vj is fullled. Hence,  is not the minimum storage
cost in Problem 6, which contradicts the assumption.
Note however that the general form of an ILP does not permit an if-then statement
(as in (2) above). Instead, we can transform to the general form with the aid of a large
constant C . Thus, constraint 2 can be expressed as follows:
Φi,j + ri − rj ≤ (1 − xi,j) × C
Where C is a “suciently large” constant such that no additional constraint is added to
the model. For instance, C here can be set as 2 ∗  . On one hand, if xi,j = 1⇒ Φi,j+ri−rj ≤
0. On the other hand, if xi,j = 0 ⇒ Φi,j + ri − rj ≤ C . Since C is “suciently large”, no
additional constraint is added.
3.3 Computational Complexity
In this section, we study the complexity of the problems listed in Table 3.1 under
dierent application scenarios.
41
Problem 1 and 2 Complexity. As discussed in Section 3.2, Problem 1 and 2 can be
solved in polynomial time by directly applying a minimum spanning tree algorithm
(Kruskal’s algorithm or Prim’s algorithm for undirected graphs; Edmonds’ algorithm [87]
for directed graphs) and Dijkstra’s shortest path algorithm respectively. Kruskal’s al-
gorithm has time complexity O(E logV ), while Prim’s algorithm also has time com-
plexity O(E logV ) when using binary heap for implementing the priority queue, and
O(E + V logV ) when using Fibonacci heap for implementing the priority queue. The
running time of Edmonds’ algorithm is O(EV ) and can be reduced to O(E + V logV )
with faster implementation. Similarly, Dijkstra’s algorithm for constructing the short-
est path tree starting from the root has a time complexity of O(E logV ) via a binary
heap-based priority queue implementation and a time complexity of O(E + V logV ) via
Fibonacci heap-based priority queue implementation.
Next, we’ll show that Problem 5 and 6 are NP-hard even for the special case where
Δ = Φ and Φ is symmetric. This will lead to hardness proofs for the other variants.
Triangle Inequality. The primary challenge that we encounter while demonstrating
hardness is that our deltas must obey the triangle inequality: unlike other settings where
deltas need not obey real constraints, since, in our case, deltas represent actual modica-
tions that can be stored, it must obey additional realistic constraints. This causes severe
complications in proving hardness, often transforming the proofs from very simple to
fairly challenging.
Consider the scenario when Δ = Φ and Φ is symmetric. We take Δ as an example.
42
The triangle inequality, can be stated as follows:
|Δp,q − Δq,w | ≤ Δp,w ≤ Δp,q + Δq,w
|Δp,p − Δp,q | ≤ Δq,q ≤ Δp,p + Δp,q
where p, q, w ∈ V and p ≠ q ≠ w . The rst inequality states that the “delta” between two
versions can not exceed the total “deltas” of any two-hop path with the same starting
and ending vertex; while the second inequality indicates that the “delta” between two
versions must be bigger than one version’s full storage cost minus another version’s
full storage cost. Since each tuple and modication is recorded explicitly when Φ is
symmetric, it is natural that these two inequalities hold.
𝛼 v0
1 1 1 𝛼𝛽 𝛼𝛽
s1 s2 s3 𝛼 s1 𝛼 s2 𝛼 s3 v1 v2
(𝛽 + 1)𝛼
𝛼𝛽 𝛼𝛽 𝛼𝛽 𝛼𝛽 𝛼𝛽 𝛼𝛽 𝛼𝛽 (𝛽 + 1)𝛼
t1 t t2 t3 t t4 t5 1 t2 t3 t4 5
(𝛽 + 1)𝛼 (𝛽 + 1)𝛼 (𝛽 + 1)𝛼
(𝛽 + 1)𝛼 (𝛽 + 1)𝛼
(a) (b)
Figure 3.4: Illustration of Proof of Lemma 5
Problem 6 Hardness. We now demonstrate hardness.
Lemma 5 Problem 6 is NP-hard when Δ = Φ and Φ is symmetric.
Proof 3 Here we prove NP-hardness using a reduction from the set cover problem. Recall
43
that in the set cover problem, we are given m sets S = {s1, s2, ..., sm} and n items T =
{t1, t2, ...tn}, where each set si covers some items, and the goal is to pick k sets  ⊂ S such
that ∪{F∈}F = T while minimizing k.
Given a set cover instance, we now construct an instance of Problem 6 that will provide
a solution to the original set cover problem. The threshold we will use in Problem 6 will be
( +1) , where ,  are constants that are each greater than 2(m+n). (This is just to ensure
that they are “large”.) We now construct the graph G(V , E) in the following way; we display
the constructed graph in Figure 3.4. Our vertex set V is as follows:
• ∀si ∈ S, create a vertex si in V.
• ∀ti ∈ T , create a vertex ti in V.
• create an extra vertex v0, two dummy vertices v1, v2 in V .
We add the two dummy vertices simply to ensure that v0 is materialized, as we will see
later. We now dene the storage cost for materializing each vertex in V in the following
way:
• ∀si ∈ S, the cost is  .
• ∀ti ∈ T , the cost is ( + 1) .
• for vertex v0, the cost is  .
• for vertex v1, v2, the cost is ( + 1) .
(These are the numbers colored blue in the tree of Figure 3.4(b).) As we can see above, we
have set the costs in such a way that the vertex v0 and the vertices corresponding to sets
44
in S have low materialization cost, while the other vertices have high materialization cost:
this is by design so that we only end up materializing these vertices. Our edge set E is now
as follows.
• we connect vertex v0 to each si with weight 1.
• we connect v0 to both v1 and v2 each with weight  .
• ∀si ∈ S, we connect si to tj with weight  when tj ∈ si , where  = |V |.
It is easy to show that our constructed graph G obeys the triangle inequality.
Consider a solution to Problem 6 on the constructed graph G. We now demonstrate
that that solution leads to a solution of the original set cover problem. Our proof proceeds
in four key steps:
Step 1: The vertex v0 will be materialized, while v1, v2 will not be materialized. Assume
the contrary—say v0 is not materialized in a solution to Problem 6. Then, both v1 and v2
must be materialized, because if they are not, then the recreation cost of v1 and v2 would be
at least (+1)+1, violating the condition of Problem 6. However we can avoidmaterializing
v1 and v2, instead keep the delta to v0 and materialize v0, maintaining the recreation cost
as is while reducing the storage cost. Thus v0 has to be materialized, while v1, v2 will not be
materialized. (Our reason for introducing v1, v2 is precisely to ensure that v0 is materialized
so that it can provide basis for us to store deltas to the sets si .)
Step 2: None of the ti will be materialized. Say a given ti is materialized in the solution
to Problem 6. Then, either we have a set sj where sj is connected to ti in Figure 3.4(a)
also materialized, or not. Let’s consider the former case. In the former case, we can avoid
materializing ti , and instead add the delta from sj to ti , thereby reducing storage cost while
45
keeping recreation cost xed. In the latter case, pick any sj such that sj is connected to ti and
is not materialized. Then, we must have the delta from v0 to sj as part of the solution. Here,
we can replace that edge, and materialized ti , with materialized sj , and the delta from sj
to ti : this would reduce the total storage cost while keeping the recreation cost xed. Thus,
in either case, we can improve the solution if any of the ti are materialized, rendering the
statement false.
Step 3: For each si , either it is materialized, or the edge from v0 to si will be part of the
storage graph. This step is easy to see: since none of the ti are materialized, either each si
has to be materialized, or we must store a delta from v0.
Step 4: The sets si that are materialized correspond to a minimal set cover of the original
problem. It is easy to see that for each tj we must have an si such that si covers tj , and si
is materialized, in order for the recreation cost constraint to not be violated for tj . Thus,
the materialized si must be a set cover for the original problem. Furthermore, in order for
the storage cost to be as small as possible, as few si as possible must be materialized (this is
the only place we can save cost). Thus, the materialized si also correspond to a minimal set
cover for the original problem.
Thus, minimizing the total storage cost is equivalent to minimizing k in set cover
problem.
Note that while the reduction above uses a graph with only some edge weights (i.e.,
recreation costs of the deltas) known, a similar reduction can be derived for a complete
graph with all edge weights known. Here, we simply use the shortest path in the graph
reduction above as the edge weight for the missing edges. In that case, once again, the
46
storage graph in the solution to Problem 6 will be identical to the storage graph described
above.
Problem 5 Hardness: We now show that Problem 5 is NP-Hard as well. The general
philosophy is similar to the proof in Lemma 5, except that we create c dummy vertices
instead of two dummy vertices v1, v2 in Lemma 5, where c is suciently large—this is
to once again ensure that v0 is materialized.
Lemma 6 Problem 5 is NP-Hard when Δ = Φ and Φ is symmetric.
𝛼 v0
1 1 1 1 1 1
𝛼 s1 𝛼 s2 𝛼 s3 v1 v2 …… vc
𝛼 + 1 𝛼 + 1 …… 𝛼 + 1
𝛼𝛽 𝛼𝛽 𝛼𝛽 𝛼𝛽 𝛼𝛽 𝛼𝛽 𝛼𝛽
c dummy vertices {v1, v2,…, vc}
t1 t2 t3 t4 t5
(𝛽 + 1)𝛼 (𝛽 + 1)𝛼 (𝛽 + 1)𝛼
(𝛽 + 1)𝛼 (𝛽 + 1)𝛼
Figure 3.5: Illustration of Proof of Lemma 6
Proof 4 We prove NP-hardness using a reduction from the set cover problem. Recall that
in the set cover decision problem, we are given m sets S = {s1, s2, ..., sm} and n items T =
{t1, t2, ...tn}, where each set si covers some items, and given a k, we ask if there a subset
 ⊂ S such that ∪{F∈}F = T and | | ≤ k.
Given a set cover instance, we now construct an instance of Problem 5 that will provide
a solution to the original set cover decision problem. The corresponding decision problem
for Problem 5 is: given threshold  + ( + 1)n + k + (m − k)( + 1) + ( + 1)c in Problem 5,
47
is the minimum total storage cost in the constructed graph G no bigger than  + k + (m −
k) + n + c.
We now construct the graph G(V , E) in the following way; we display the constructed
graph in Figure 3.5. Our vertex set V is as follows:
• ∀si ∈ S, create a vertex si in V.
• ∀ti ∈ T , create a vertex ti in V.
• create an extra vertex v0, and c dummy vertices {v1, v2,… , vc} in V .
We add the c dummy vertices simply to ensure that v0 is materialized, as we will see later.
We now dene the storage cost for materializing each vertex in V in the following way:
• ∀si ∈ S, the cost is  .
• ∀ti ∈ T , the cost is ( + 1) .
• for vertex v0, the cost is  .
• for each vertex in {v1, v2,… , vc}, the cost is  + 1.
(These are the numbers colored blue in the tree of Figure 3.5.) As we can see above, we have
set the costs in such a way that the vertex v0 and the vertices corresponding to sets in S have
low materialization cost while the vertices corresponding to T have high materialization
cost: this is by design so that we only end up materializing these vertices. Even though the
costs of the dummy vertices is close to that of v0, si , we will show below that they will not
be materialized either. Our edge set E is now as follows.
• we connect vertex v0 to each si with weight 1.
48
• we connect v0 to vi , 1 ≤ i ≤ c each with weight 1.
• ∀si ∈ S, we connect si to tj with weight  when tj ∈ si , where  = |V |.
It is easy to show that our constructed graph G obeys the triangle inequality.
Consider a solution to Problem 5 on the constructed graph G. We now demonstrate
that that solution leads to a solution of the original set cover problem. Our proof proceeds
in four key steps:
Step 1: The vertex v0 will be materialized, while vi , 1 ≤ i ≤ c will not be materialized.
Let’s examine the rst part of this observation, i.e., that v0 will be materialized. Assume
the contrary. If v0 is not materialized, then at least one vi , 1 ≤ i ≤ c, or one of the si must
be materialized, because if not, then the recreation cost of {v1, v2,… , vc} would be at least
( +2)c > ( +1)c+ +(+1)n+k +(m−k)( +1), violating the condition (exceeding total
recreation cost threshold) of Problem 5. However we can avoid materializing this vi (or si),
instead keep the delta from vi (or si) to v0 and materialize v0, reducing the recreation cost
and the storage cost. Thus v0 has to be materialized. Furthermore, since v0 is materialized,
∀vi , 1 ≤ i ≤ c will not be materialized and instead we will retain the delta to v0, reducing
the recreation cost and the storage cost. Hence, the rst step is complete.
Step 2: None of the ti will be materialized. Say a given ti is materialized in the solution
to Problem 5. Then, either we have a set sj where sj is connected to ti in Figure 3.5(a) also
materialized, or not. Let us consider the former case. In the former case, we can avoid
materializing ti , and instead add the delta from sj to ti , thereby reducing storage cost while
keeping recreation cost xed. In the latter case, pick any sj such that sj is connected to ti
and is not materialized. Then, we must have the delta from v0 to sj as part of the solution.
49
Here, we can replace that edge, and the materialized ti , with materialized sj , and the delta
from sj to ti : this would reduce the total storage cost while keeping the recreation cost xed.
Thus, in either case, we can improve the solution if any of the ti are materialized, rendering
the statement false.
Step 3: For each si , either it is materialized, or the edge from v0 to si will be part of the
storage graph. This step is easy to see: since none of the ti are materialized, either each si
has to be materialized, or we must store a delta from v0.
Step 4: If the minimum total storage cost is no bigger than  + k + (m − k) + n + c,
then there exists a subset  ⊂ S such that ∪{F∈}F = T and | | ≤ k in the original set
cover decision problem, and vice versa. Let’s examine the rst part. If the minimum total
storage cost is no bigger than  + k + (m − k) + n + c, then the storage cost for all
si ∈ S must be no bigger than k + (m − k) since the storage cost for v0, {v1, v2,… , vc} and
{t1, t2,… , tn} is  , c and n respectively according to Step 1 and 2. This indicates that at
most k si ∈ S is materialized (we let the set of materialized si be M and |M | ≤ k). Next,
we prove that each tj is stored as the modication from the materialized si ∈ M . Suppose
there exists one or more tj which is stored as the modication from si ∈ S − M , then the
total recreation cost must be more than  + (( + 1)n + 1) + k + (m − k)( + 1) + ( + 1)c,
which exceeds the total recreation threshold. Thus, we have each tj ∈ T is stored as the
modication from si ∈ M . Let  = M , we can obtain ∪{F∈}F = T and | | ≤ k. Thus, If the
minimum total storage cost is no bigger than  + k + (m − k) + n + c, then there exists a
subset  ⊂ S such that ∪{F∈}F = T and | | ≤ k in the original set cover decision problem.
Next let’s examine the second part. If there exists a subset ⊂ S such that ∪{F∈}F = T
and | | ≤ k in the original set cover decision problem, then we can materialize each vertex
50
si ∈  as well as the extra vertex v0, connect v0 to {v1, v2,… , vc} as well as sj ∈ S −  , and
connect tj to one si ∈  . The resulting total storage is  + k + (m − k) + n + c and the
total recreation cost equals to the threshold. Thus, if there exists a subset  ⊂ S such that
∪{F∈}F = T and | | ≤ k in the original set cover decision problem, then the minimum total
storage cost is no bigger than  + k + (m − k) + n + c.
Thus, the decision problem in Problem 5 is equivalent to the decision problem in set
cover problem.
Once again, the problem is still hard if we use a complete graph as opposed to a graph
where only some edge weights are known.
Since Problem 4 swaps the constraint and goal compared to Problem 6, it is simi-
larly NP-Hard. (Note that the decision versions of the two problems are in fact identical,
and therefore the proof still applies.) Similarly, Problem 3 is also NP-Hard. Now that we
have proved the NP-hard even in the special case where Δ = Φ and Φ is symmetric, we
can conclude that Problem 3, 4, 5, 6, are NP-hard in a more general setting where Φ is
not symmetric and Δ ≠ Φ, as listed in Table 3.1.
Hop-Based Variants.So far, our focus has been on proving hardness for the special
case where Δ = Φ and Δ is undirected. We now consider a dierent kind of special case,
where the recreation cost of all pairs is the same, i.e., Φij = 1 for all i, j, while Δ ≠ Φ, and
Δ is undirected. In this case, we call the recreation cost as the hop cost, since it is simply
the minimum number of delta operations (or "hops") needed to reconstruct Vi .
The reason why we bring up this variant is that this directly corresponds to a
special case of the well-studied d-MinimumSteinerTree problem: Given an undirected
51
graph G = (V , E) and a subset ! ⊆ V , nd a tree with minimum weight, spanning
the entire vertex subset ! while the diameter is bounded by d . The special case of d-
MinimumSteinerTree problem when ! = V , i.e., the minimum spanning tree problem
with bounded diameter, directly corresponds to Problem 6 for the hop cost variant we
described above. The hardness for this special case was demonstrated by [88] using a
reduction from the SAT problem:
Lemma 7 Problem 6 is NP-Hard when Δ ≠ Φ and Δ is symmetric, and Φij = 1 for all i, j.
Note that this proof crucially uses the fact that Δ ≠ Φ unlike Lemma 5 and 6; thus the
proofs are incomparable (i.e., one does not subsume the other).
For the hop-based variant, additional results on hardness of approximation are
known by way of the d-MinimumSteinerTree problem [88, 89, 90]:
Lemma 8 ([88]) For any  > 0, Problem 6 has no ln n- approximation unless NP ⊂
Dtime(nlog log n).
Since the hop-based variant is a special case of the last column of Table 3.1, this
indicates that Problem 6 for the most general case is similarly hard to approximate; we
suspect similar results hold for the other problems as well. It remains to be seen if
hardness of approximation can be demonstrated for the variants in the second and third
last columns.
3.4 Proposed Algorithms
As discussed in Section 3.2, our dierent application scenarios lead to dierent
problem formulations, spanning dierent constraints and objectives, and dierent as-
52
sumptions about the nature of Φ,Δ.
Given that we demonstrated in the previous section that all the problems are NP-
Hard, we focus on developing ecient heuristics. In this section, we present two novel
heuristics: rst, in Section 3.4.1, we present LMG, or the Local Move Greedy algorithm,
tailored to the case when there is a bound or objective on the average recreation cost: thus,
this applies to Problems 3 and 5. Second, in Section 3.4.2, we present MP, or Modied
Prim’s algorithm, tailored to the case when there is a bound or objective on themaximum
recreation cost: thus, this applies to Problems 4 and 6. We present two variants of the MP
algorithm tailored to two dierent settings.
Then, we present two algorithms — in Section 3.4.3, we present an approximation
algorithm called LAST, and in Section 3.4.4, we present an algorithm called GitH which is
based on Git repack. Both of these are adapted from literature to t our problems and we
compare these against our algorithms in Section 5.5. Note that LAST does not explicitly
optimize any objectives or constraints in the manner of LMG, MP, or GitH, and thus the
four algorithms are applicable under dierent settings; LMG andMP are applicable when
there is a bound or constraint on the average or maximum recreation cost, while LAST
and GitH are applicable when a “good enough” solution is needed. Furthermore, note
that all these algorithms apply to both directed and undirected versions of the problems,
and to the symmetric and unsymmetric cases.
53
V0 V0
e01 e02 e01 e02
V1 V2 V1 e04 V2
e13 e14 e13 e14
V3 V4 V3 V4
e45 e46 e45 e46
V5 V6 V5 V6
(a) (b)
Figure 3.6: Illustration of Local Move Greedy Heuristic
3.4.1 Local Move Greedy Algorithm
The LMG algorithm is applicable when we have a bound or constraint on the aver-
age case recreation cost. We focus on the case where there is a constraint on the storage
cost (Problem 3); the case when there is no such constraint (Problem 5) can be solved by
repeated iterations and binary search on the previous problem.
Outline. At a high level, the algorithm starts with the Minimum Spanning Tree (MST) as
GS , and then greedily adds edges from the Shortest Path Tree (SPT) that are not present
in GS , while GS respects the bound on storage cost.
Detailed Algorithm. The algorithm starts o with GS equal to the MST. The SPT nat-
urally contains all the edges corresponding to complete versions. The basic idea of the
algorithm is to replace deltas in GS with versions from the SPT that maximize the fol-
54
lowing ratio:
 = reduction in sum of recreation costsincrease in storage cost
This is simply the reduction in total recreation cost per unit addition of weight to the
storage graph GS .
Let  consists of edges in the SPT not present in the GS (these precisely correspond
to the versions that are not explicitly stored in the MST, and are instead computed via
deltas in the MST). At each “round”, we pick the edge euv ∈  that maximizes , and
replace previous edge eu′v to v. The reduction in the sum of the recreation costs is com-
puted by adding up the reductions in recreation costs of all w ∈ GS that are descendants
of v in the storage graph (including v itself). On the other hand, the increase in storage
cost is simply the weight of euv minus the weight of eu′v . This process is repeated as long
as the storage budget is not violated. We explain this with the means of an example.
Example 5 Figure 3.6(a) denotes the current GS . Node 0 corresponds to the dummy node.
Now, we are considering replacing edge e14 with edge e04, that is, we are replacing a delta
to version 5 with version 5 itself. Then, the denominator of  is simply Δ04 − Δ14. And the
numerator is the changes in recreation costs of versions 4, 5, and 6 (notice that 5 and 6 were
below 4 in the tree.) This is actually simple to compute: it is simply three times the change
in the recreation cost of version 4 (since it aects all versions equally). Thus, we have the
numerator of  is simply 3 × (Φ01 + Φ14 − Φ04).
Complexity. For a given round, computing  for a given edge is O(|V |). This leads to an
overallO(|V |3) complexity, since we have up to |V | rounds, and upto |V | edges in  . How-
55
Algorithm 1: Local Move Greedy Heuristic
Input : Minimum Spanning Tree (MST) , Shortest Path Tree (SPT), source vertex
V0, space budget W
Output: A tree T with weight ≤ W rooted at V0 with minimal sum of access cost
1 Initialize T as MST.
2 Let d(Vi) be the distance from V0 to Vi in T , and p(Vi) denote the parent of Vi in T.
Let W (T ) denote the storage cost of T .
3 whileW (T ) < W do
4 (max , eSPT )← (0,∅)
5 foreach euv ∈  do
6 compute e
7 if e > max then
8 (max , ē)← (e , euv)
9 end
10 end
11 T ← T ⧵ eu′v ∪ euv ;  ←  ⧵ euv
12 if  = ∅ then
13 return T
14 end
15 end
ever, if we are smart about this computation (by precomputing and maintaining across
all rounds the number of nodes “below” every node), we can reduce the complexity of
computing  for a given edge to O(1). This leads to an overall complexity of O(|V |2)
Algorithm 1 provides a pseudocode of the described technique.
Access Frequencies. Note that the algorithm can easily take into account access fre-
quencies of dierent versions and instead optimize for the total weighted recreation cost
(weighted by access frequencies). The algorithm is similar, except that the numerator of
 will capture the reduction in weighted recreation cost.
56
3.4.2 Modied Prim’s Algorithm
Next, we introduce a heuristic algorithm based on Prim’s algorithm for Minimum
Spanning Trees for Problem 6 where the goal is to reduce total storage cost while recre-
ation cost for each version is within threshold  ; the solution for Problem 4 is similar.
Outline. At a high level, the algorithm is a variant of Prim’s algorithm, greedily adding
the version with smallest storage cost and the corresponding edge to form a spanning
tree T . Unlike Prim’s algorithm where the spanning tree simply grows, in this case,
even if an edge is present in T , it could be removed in future iterations. At all stages, the
algorithm maintains the invariant that the recreation cost of all versions in T is bounded
within  .
Detailed Algorithm. At each iteration, the algorithm picks the version Vi with the
smallest storage cost to be added to the tree. Once this version Vi is added, we consider
adding all deltas to all other versions Vj such that their recreation cost through Vi is
within the constraint  , and the storage cost does not increase. Each version maintains
a pair l(Vi) and d(Vi): l(Vi) denotes the marginal storage cost of Vi , while d(Vi) denotes
the total recreation cost of Vi . At the start, l(Vi) is simply the storage cost of Vi in its
entirety.
We now describe the algorithm in detail. Set X represents the current version set
of the current spanning tree T . Initially X = ∅. In each iteration, the version Vi with the
smallest storage cost (l(Vi)) in the priority queue PQ is picked and added into spanning
tree T (line 7-8). When Vi is added into T , we need to update the storage cost and
57
V0 v0
<3,3> 5 3
<4,4> V 31 <4,4> 3 v1 v2 4
<2,3> <4,4> <1,4>
2 2 4 4
<1,3>
<1,2>
V2 V3 v 33 v4
<1,3>
Figure 3.7: Directed Graph G Figure 3.8: Undirected Graph G
V0 V0 V V0 0
<3,3> <3,3> <3,3>
(3,3) V1 (3,3) V1 (3,3) V1 <4,4> (3,3) V1 <4,4>
<2,3> <2,3>
<1,2>
V2 V3 V V V2 V V V2 3 3 2 3
(4,4) (4,4) (2,6) (4,4) (2,6) (4,4) (1,6) (4,4)
(a) (b) (c) (d)
Figure 3.9: Illustration of Modied Prim’s algorithm in Figure 3.7
recreation cost for all Vj that are neighbors of Vi . Notice that in Prim’s algorithm, we do
not need to consider neighbors that are already in T . However, in our scenario a better
path to such a neighbor may be found and this may result in an update (line 10-17). For
instance, if edge ⟨Vi , Vj⟩ can make Vj ’s storage cost smaller while the recreation cost for
Vj does not increase, we can update p(Vj) = Vi as well as d(Vj), l(Vj) and T . For neighbors
Vj ∉ T (line 19-24), we update d(Vj), l(Vj),p(Vj) if edge ⟨Vi , Vj⟩ can make Vj ’s storage cost
smaller and the recreation cost for Vj is no bigger than  . Algorithm 2 terminates in |V |
iterations since one version is added into X in each iteration.
58
Example 6 Say we operate on G given by Figure 3.7, and let the threshold  be 6. Each
version Vi is associated with a pair ⟨l(Vi), d(Vi)⟩. Initially version V0 is pushed into priority
queue. When V0 is dequeued, each neighbor Vj updates < l(Vj), d(Vj) > as shown in Figure
3.9 (a). Notice that l(Vi), i ≠ 0 for all i is simply the storage cost for that version. For example,
when considering edge (V0, V1), l(V1) = 3 and d(V1) = 3 is updated since recreation cost (if
V1 is to be stored in its entirety) is smaller than threshold  , i.e., 3 < 6. Afterwards, version
V1, V2 and V3 are inserted into the priority queue. Next, we dequeue V1 since l(V1) is smallest
among the versions in the priority queue, and add V1 to the spanning tree. We then update
< l(Vj), d(Vj) > for all neighbors of V1, e.g., the recreation cost for version V2 will be 6
and the storage cost will be 2 when considering edge (V1, V2). Since 6 ≤ 6, (l(V2), d(V2)) is
updated to (2, 6) as shown in Figure 3.9 (b); however, < l(V3), d(V3) > will not be updated
since the recreation cost is 3 + 4 > 6 when considering edge (V1, V3). Subsequently, version
V2 is dequeued because it has the lowest l(V2), and is added to the tree, giving Figure 3.9
(b). Subsequently, version V3 are dequeued. When V3 is dequeued from PQ, (l(V2), d(V2)) is
updated. This is because the storage cost for V2 can be updated to 1 and the recreation cost
is still 6 when considering edge (V3, V2), even if V2 is already in T as shown in Figure 3.9
(c). Eventually, we get the nal answer in Figure 3.9 (d).
Complexity. The complexity of the algorithm is the same as that of Prim’s algorithm,
i.e., O(|E| log |V |). Each edge is scanned once and the priority queue need to be updated
once in the worst case.
59
Algorithm 2: Modied Prim’s Algorithm
Input : Graph G = (V , E), threshold 
Output: Spanning Tree T = (VT , ET )
1 Let X be the version set of current spanning tree T ; Initially T = ∅, X = ∅;
2 Let p(Vi) be the parent of Vi; l(Vi) denote the storage cost from p(Vi) to Vi , d(Vi)
denote the recreation cost from root V0 to version Vi ,
3 Initially ∀i ≠ 0, d(V0) = l(V0) = 0, d(Vi) = l(Vi) = ∞ ;
4 Enqueue < V0, (l(V0), d(V0)) > into priority queue PQ;
5 (PQ is sorted by l(vi));
6 while PQ ≠ ∅ do
7 < Vi , (l(Vi), d(Vi)) >← top(PQ), dequeue(PQ);
8 T = T∪ < Vi , p(Vi) >, X = X ∪ Vi;
9 for Vj ∈ (Vi’s neighbors in G) do
10 if Vj ∈ X then
11 if (Φi,j + d(Vi)) ≤ d(Vj) and Δi,j ≤ l(Vj) then
12 T = T− < Vj , p(Vj) >;
13 p(Vj) = Vi;
14 T = T∪ < Vj , p(Vj) > d(Vj)← Φi,j + d(Vi);
15 l(Vj)← Δi,j ;
16 end
17 end
18 else
19 if (Φi,j + d(Vi)) ≤  and Δi,j ≤ l(Vj) then
20 d(Vj)← Φi,j + d(Vi);
21 l(Vj)← Δi,j ; p(Vj) = Vi;
22 enqueue(or update) < Vj , (l(Vj), d(Vj)) > in PQ;
23 end
24 end
25 end
26 end
3.4.3 LAST Algorithm
Here, we sketch an algorithm from previous work [91] that enables us to nd a
tree with a good balance of storage and recreation costs, under the assumptions that
Δ = Φ and Φ is symmetric.
Outline. The algorithm starts from a minimum spanning tree and does a depth-rst
60
traveral (DFS) over the minimum spanning tree. During the process of DFS, if the recre-
ation cost for a node exceeds the pre-dened threshold (set up front), then this current
path is replaced with the shortest path to the node.
Detailed Algorithm. As discussed in Section 3.2.2, balancing between recreation cost
and storage cost is equivalent to balancing between the minimum spanning tree and the
shortest path tree rooted at V0. Khuller et al. [91] studied the problem of balancing min-
imum spanning tree and shortest path tree in an undirected graph, where the resulting
spanning tree T has the following properties, given parameter  :
• For each node Vi: the cost of path from V0 to Vi in T is within  times the shortest
path from V0 to Vi in G.
• The total cost of T is within (1 + 2/( − 1)) times the cost of minimum spanning
tree in G.
Even though Khuller’s algorithm is meant for undirected graphs, it can be applied to
the directed graph case without any comparable guarantees. The pseudocode is listed
in Algorithm 3.
Let MST denote the minimum spanning tree of graph G and SP (V0, Vi) denote the
shortest path from V0 to Vi in G. The algorithm starts with theMST and then conducts a
depth-rst traversal inMST . Each nodeV keeps track of its path cost from root as well as
its parent, denoted as d(Vi) and p(Vi) respectively. Given the approximation parameter
 , when visiting each node Vi , we rst check whether d(Vi) is bigger than  × SP (V0, Vi)
where SP stands for shortest path. If yes, we replace the path to Vi with the shortest path
61
from root to Vi inG and update d(Vi) as well as p(Vi). In addition, we keep updating d(Vi)
and p(Vi) during depth rst traversal as stated in line 4-7 of Algorithm 3.
Example 7 Figure 3.10 (a) is the minimum spanning tree (MST) rooted at node V0 of G
in Figure 3.8. The approximation threshold  is set to be 2. The algorithm starts with the
MST and conducts a depth-rst traversal in the MST from root V0. When visiting node V2,
d(V2) = 3 and the shortest path to node V2 is 3, thus 3 < 2 × 3. We continue to visit node
V2 and V3. When visiting V3, d(V3) = 8 > 2 × 3 where 3 is the shortest path to V3 in G.
Thus, d(V3) is set to be 3 and p(V3) is set to be node 0 by replacing with the shortest path
⟨V0, V3⟩ as shown in Figure 3.10 (b). Afterwards, the back-edge < V3, V1 > is traversed in
MST. Since 3 + 2 < 6, where 3 is the current value of d(V3), 2 is the edge weight of (V3, V1)
and 6 is the current value in d(V1), thus d(V1) is updated as 5 and p(V1) is updated as node
V3. At last node V4 is visited, d(V4) is rst updated as 7 according to line 3-7. Since 7 < 2×4,
lines 9-11 are not executed. Figure 3.10 (c) is the resulting spanning tree of the algorithm,
where the recreation cost for each node is under the constraint and the total storage cost is
3 + 3 + 2 + 2 = 10.
Complexity. The complexity of the algorithm is O(|E| log |V |). Given the minimum
spanning tree and shortest path tree rooted at V0, Algorithm 3 is conducted via depth
rst traversal on MST. It is easy to show that the complexity for Algorithm 3 is O(|V |).
The time complexity for computing minimum spanning tree and shortest path tree is
O(|E| log |V |) using heap-based priority queue.
62
Algorithm 3: Balance MST and Shortest Path Tree [91]
Input : Graph G = (V , E), MST , SP
Output : Spanning Tree T = (VT , ET )
1 Initialize T as MST . Let d(Vi) be the distance from V0 to Vi in T and p(Vi) be the parent of
Vi in T .
2 while DFS traversal on MST do
3 (Vi , Vj)← the edge currently in traversal;
4 if d(Vj) > d(Vi) + ei,j then
5 d(Vj)← (d(Vi) + ei,j);
6 p(Vj)← Vi ;
7 end
8 if d(Vj) >  ∗ SP (V0, Vj) then
9 add shortest path (V0, Vj) into T ;
10 d(Vj)← SP (V0, Vj);
11 p(Vj)← V0;
12 end
13 end
V0 V V0 0
3 3 3 3 3
3
V V V 31 2 V 1 V
V V3 2
3 2
2
2 2 2
V3 V4 V V4 1
2
(a) (b) V4 (c)
Figure 3.10: Illustration of LAST on Figure 3.8
3.4.4 Git Heuristic
This heuristic is an adaptation of the current heuristic used by Git and we refer to
it as GitH. We rst describe our understanding of the heuristic used by Git when a user
runs git-repack, followed by a sketch of GitH.
Git uses delta compression to reduce the amount of storage required to store a large
63
number of les (objects) that contain duplicated information. However, git’s algorithm
for doing so is not clearly described anywhere. An old discussion with Linus has a sketch
of the algorithm [92]. However there have been several changes to the heuristics used
that don’t appear to be documented anywhere.
The following describes our understanding of the algorithm based on the latest git
source code 1.
Here we focus on “repack”, where the decisions are made for a large group of
objects. However, the same algorithm appears to be used for normal commits as well.
Most of the algorithm code is in le: builtin/pack-objects.c
Step 1: Sort the objects, rst by “type”, then by “name hash”, and then by “size” (in the
decreasing order). The comparator is (line 1503):
static int type_size_sort(const void *_a, const void *_b)
Note the name hash is not a true hash; the pack_name_hash() function (pack-objects.h)
simply creates a number from the last 16 non-white space characters, with the last char-
acters counting the most (so all les with the same sux, e.g., .c, will sort together).
Step 2: The next key function is ll_find_deltas(), which goes over the les in the
sorted order. It maintains a list of W objects (W = window size, default 10) at all times.
For the next object, say O, it nds the delta between O and each of the objects, say B,
in the window; it chooses the the object with the minimum value of: delta(B, O) /
(max_depth - depth of B) where max_depth is a parameter (default 50), and depth of
1Cloned from https://github.com/git/git on 5/11/2015, commit id:
8440f74997cf7958c7e8ec853f590828085049b8
64
B refers to the length of delta chain between a root and B.
The original algorithm appears to have only used delta(B, O) to make the de-
cision, but the “depth bias” (denominator) was added at a later point to prefer slightly
larger deltas with smaller delta chains. The key lines for the above part:
• line 1812 (check each object in the window):
ret = try_delta(n, m, max_depth, &mem_usage);
• lines 1617-1618 (depth bias):
max_size = (uint64_t)max_size * (max_depth - src->depth) /
(max_depth - ref_depth + 1);
• line 1678 (compute delta and compare size):
delta_buf = create_delta(src->index, trg->data, trg_size,
&delta_size, max_size);
create_delta() returns non-null only if the new delta being tried is smaller than
the current delta (modulo depth bias), specically, only if the size of the new delta is less
than max_size argument. Note: lines 1682-1688 appear redundant given the depth bias
calculations.
Step 3. Originally the window was just the last W objects before the object O under
consideration. However, the current algorithm shues the objects in the window based
on the choices made. Specically, let b1,… , bW be the current objects in the window. Let
65
the object chosen to delta against for O be bi . Then bi would be moved to the end of
the list, so the new list would be: [b1, b2,… , bi−1, bi+1,… , bW , O, bi]. Then when we move
to the new object after O (say O′), we slide the window and so the new window then
would be: [b2,… , bi−1, bi+1,… , bW , O, bi , O′]. Small detail: the list is actually maintained
as a circular buer so the list doesn’t have to be physically “shifted” (moving bi to the
end does involve a shift though). Relevant code here is lines 1854-1861.
Finally we note that git never considers/computes/stores a delta between two ob-
jects of dierent types, and it does the above in a multi-threaded fashion, by partitioning
the work among a given number of threads. Each of the threads operates independently
of the others.
GitH. GitH uses two parameters: w (window size) and d (max depth). We consider the
versions in an non-increasing order of their sizes. The rst version in this ordering is
chosen as the root of the storage graph and has depth 0 (i.e., it is materialized). At all
times, we maintain a sliding window containing at most w versions. For each version
Vi after the rst one, let Vl denote a version in the current window. We compute: Δ′l,i =
Δl,i/(d − dl), where dl is the depth of Vl (thus deltas with shallow depths are preferred
over slightly smaller deltas with higher depths). We nd the version Vj with the lowest
value of this quantity and choose it as Vi’s parent (as long as dj < d). The depth of Vi is
then set to dj + 1. The sliding window is modied to move Vl to the end of the window
(so it will stay in the window longer), Vj is added to the window, and the version at the
beginning of the window is dropped.
Complexity. The running time of the heuristic is O(|V | log |V | + w |V |), excluding the
66
Dataset DC LC BF LF
Number of versions 100010 100002 986 100
Number of deltas 18086876 2916768 442492 3562
Average version size (MB) 347.65 356.46 0.401 422.79
MCA-Storage Cost (GB) 1265.34 982.27 0.0250 2.2402
MCA-Sum Recreation Cost (GB) 11506437.83 29934960.95 0.9648 47.6046
MCA-Max Recreation Cost (GB) 257.6 717.5 0.0063 0.5998
SPT-Storage Cost (GB) 33953.84 34811.14 0.3854 41.2881
SPT-Sum Recreation Cost (GB) 33953.84 34811.14 0.3854 41.2881
SPT-Max Recreation Cost (GB) 0.524 0.55 0.0063 0.5091
Table 3.2: Dataset properties
12
10
8
6
4
2
0 DC LC BF LF
Datasets
Figure 3.11: Distribution of delta sizes in the datasets (each delta size scaled by the av-
erage version size in the dataset)
time to construct deltas.
3.5 Experiments
In the following sections, we present an extensive evaluation of our designed algo-
rithms using a combination of synthetic and derived real-world datasets. Apart from im-
67
Normalized delta values
plementing the algorithms described above, LMG and LAST require both SPT and MST
as input. For both directed and undirected graphs, we use Dijkstra’s algorithm to nd
the single-source shortest path tree (SPT). We use Prim’s algorithm to nd the minimum
spanning tree for undirected graphs. For directed graphs, we use an implementation [93]
of the Edmonds’ algorithm [87] for computing the min-cost arborescence (MCA). We ran
all our experiments on a 2.2GHz Intel Xeon CPU E5-2430 server with 64GB of memory,
running 64-bit Red Hat Enterprise Linux 6.5.
3.5.1 Datasets
We use four data sets: two synthetic and two derived from real-world source code
repositories. Although there are many publicly available source code repositories with
large numbers of commits (e.g., in GitHub), those repositories typically contain fairly
small (source code) les, and further the changes between versions tend to be localized
and are typically very small; we expect dataset versions generated during collaborative
data analysis to contain much larger datasets and to exhibit large changes between ver-
sions. We were unable to nd any realistic workloads of that kind.
Hence, we generated realistic dataset versioning workloads as follows. First, we
wrote a synthetic version generator suite, driven by a small set of parameters, that is able
to generate a variety of version histories and corresponding datasets. Second, we created
two real-world datasets using publicly available forks of popular repositories on GitHub.
We describe each of the two below.
Synthetic Datasets: Our synthetic dataset generation suite2 takes a two-step approach
2Our synthetic dataset generator may be of independent interest to researchers working on version
68
to generate a dataset that we sketch below. The rst step is to generate a version graph
with the desired structure, controlled by the following parameters:
• number of commits, i.e., the total number of versions.
• branch interval and probability, the number of consecutive versions after
which a branch can be created, and probability of creating a branch.
• branch limit, the maximum number of branches from any point in the version
history. We choose a number in [1, branch limit] uniformly at random when we
decide to create branches.
• branch length, the maximum number of commits in any branch. The actual
length is a uniformly chosen integer between 1 and branch length.
Once a version graph is generated, the second step is to generate the appropriate
versions and compute the deltas. The les in our synthetic dataset are ordered CSV les
(containing tabular data) and we use deltas based on UNIX-style dis. The previous step
also annotates each edge (u, v) in the version graph with edit commands that can be
used to produce v from u. Edit commands are a combination of one of the following six
instructions – add/delete a set of consecutive rows, add/remove a column, and modify a
subset of rows/columns.
Using this, we generated two synthetic datasets (Figure 3.2):
• Densely Connected (DC): This dataset is based on a “at” version history, i.e.,
number of branches is high, they occur often and have short lengths. For each
management.
69
version in this data set, we compute the delta with all versions in a 10-hop distance
in the version graph to populate additional entries in Δ and Φ.
• Linear Chain (LC): This dataset is based on a “mostly-linear” version history,
i.e., number of branches is low, they occur after large intervals and have longer
lenghts. For each version in this data set, we compute the delta with all versions
within a 25-hop distance in the version graph to populate Δ and Φ.
Real-world datasets: We use 986 forks of the Twitter Bootstrap repository and 100
forks of the Linux repository, to derive our real-world workloads. For each repository,
we checkout the latest version in each fork and concatenate all les in it (by traversing
the directory structure in lexicographic order). Thereafter, we compute deltas between
all pairs of versions in a repository, provided the size dierence between the versions
under consideration is less than a threshold. We set this threshold to 100KB for the
Twitter Bootstrap repository and 10MB for the Linux repository. This gives us two real-
world datasets, Bootstrap Forks (BF) and Linux Forks (LF), with properties shown in
Figure 3.2.
3.5.2 Comparison with SVN and Git
We begin with evaluating the performance of two popular version control systems,
SVN (v1.8.8) and Git (v1.7.1), using the LF dataset. We create an FSFS-type repository in
SVN, which is more space ecient that a Berkeley DB-based repository [94]. We then
import the entire LF dataset into the repository in a single commit. The amount of space
occupied by the db/revs/ directory is around 8.5GB and it takes around 48 minutes
70
to complete the import. We contrast this with the naive approach of applying a gzip
on the les which results in total compressed storage of 10.2GB. In case of Git, we add
and commit the les in the repository and then run a git repack -a -d -depth=50
-window=50 on the repository3. The size of the Git pack le is 202 MB although the
repack consumes 55GB memory and takes 114 minutes (for higher window sizes, Git
fails to complete the repack as it runs out of memory).
In comparison, the solution found by the MCA algorithm occupies 516MB of com-
pressed storage (2.24GB when uncompressed) when using UNIX diff for computing the
deltas. To make a fair comparison with Git, we use xdiff from the LibXDi library [97]
for computing the deltas, which forms the basis of Git’s delta computing routine. Using
xdiff brings down the total storage cost to just 159 MB. The total time taken is around
102 minutes; this includes the time taken to compute the deltas and then to nd the MCA
for the corresponding graph.
The main reason behind SVN’s poor performance is its use of “skip-deltas” to en-
sure that at most O(log n) deltas are needed for reconstructing any version; that tends
to lead it to repeatedly store redundant delta information as a result of which the total
space requirement increases signicantly. The heuristic used by Git is much better than
SVN (Section 3.4.4). However as we show later (Fig. 3.12), our implementation of that
heuristic (GitH) required more storage than LMG for guaranteeing similar recreation
costs.
3Unlike git repack, svnadmin pack has a negligible eect on the storage cost as it primarily aims to
reduce disk seeks and per-version disk usage penalty by concatenating les into a single “pack” [95, 96].
71
LMG MP LAST GitH
81e4 7.51e4 7.51e 1 5.41e1
Dataset: DC 7.0 Dataset: LC 7.0 Dataset: BF Dataset: LF
7 5.26.5 6.5
6.0 5.0
6 6.05.5 4.85.5
5 5.0 4.6
4.5 5.0
4.5 4.44 4.0
3.5 4.0 4.2
31 2 3 4 5 6 3.00 1 2 3 4 5 6 3.25.5 3.0 3.5 4.0 4.5 5.04.20.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
(a) 1e3 (b) 1e3 (c) 1e 2 (d)
Storage Cost (GB)
Figure 3.12: Results for the directed case, comparing the storage costs and total recre-
ation costs
LMG MP LAST
1.4 6.21e 1
1.3 Dataset: DC Dataset: LF
6.0
1.2
1.1 5.8
1.0
5.6
0.9
0.8 5.4
0.7
5.2
0.6
0.15.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.20.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
(a) 1e3 (b)
Storage Cost (GB)
Figure 3.13: Results for the directed case, comparing the storage costs and maximum
recreation costs
LMG MP LAST
7.01e4 6.01e4 5.41e 1 2.0
6.5 Dataset: DC 5.5 Dataset: LC 5.2
Dataset: BF 1.8 Dataset: DC
6.0 5.0 1.6
5.5 5.0 4.8 1.4
5.0 4.5 4.6 1.2
4.5 4.0 4.4 1.0
4.0 4.2 0.8
3.5 3.5 4.0 0.6
3.02 3 4 5 6 7 8 3.01 2 3 4 5 6 7 3.38.0 3.5 4.0 4.5 5.0 5.5 6.0 0.42 3 4 5 6 7 8
Storage Cost (GB) 1e3 Storage Cost (GB) 1e3 Storage Cost (GB) 1e 2 Storage Cost (GB) 1e3
(a) (b) (c) (d)
Figure 3.14: Results for the undirected case, comparing the storage costs and total recre-
ation costs (a–c) or maximum recreation costs (d)
72
Sum of Recreation Cost (GB) Sum of Recreation Cost (GB)
Max Recreation Cost (GB)
Sum of Recreation Cost (GB)
Sum of Recreation Cost (GB)
Max Recreation Cost (GB)
3.5.3 Experimental Results
Directed Graphs. We begin with a comprehensive evaluation of the three algorithms,
LMG, MP, and LAST, on directed datasets. Given that all of these algorithms have pa-
rameters that can be used to trade o the storage cost and the total recreation cost, we
compare them by plotting the dierent solutions they are able to nd for the dierent
values of their respective input parameters. Figure 3.12(a–d) show four such plots; we
run each of the algorithms with a range of dierent values for its input parameter and
plot the storage cost and the total (sum) recreation cost for each of the solutions found.
We also show the minimum possible values for these two costs: the vertical dashed red
line indicates the minimum storage cost required for storing the versions in the dataset
as found by MCA, and the horizontal one indicates the minimum total recreation cost
as found by SPT (equal to the sum of all version sizes).
The rst key observation we make is that, the total recreation cost decreases dras-
tically by allowing a small increase in the storage budget over MCA. For example, for
the DC dataset, the sum recreation cost for MCA is over 11 PB (see Table 3.2) as com-
pared to just 34TB for the SPT solution (which is the minimum possible). As we can
see from Figure 3.12(a), a space budget of 1.1× the MCA storage cost reduces the sum
of recreation cost by three orders of magnitude. Similar trends can be observed for the
remaining datasets and across all the algorithms. We observe that LMG results in the
best tradeo between the sum of recreation cost and storage cost with LAST performing
fairly closely. An important takeaway here, especially given the amount of prior
73
work that has focused purely on storage cost minimization), is that: it is possi-
ble to construct balanced trees where the sum of recreation costs can be reduced
and brought close to that of SPT while using only a fraction of the space that
SPT needs.
We also ran GitH heuristic on the all the four datasets with varying window and
depth settings. For BF, we ran the algorithm with four dierent window sizes (50, 25,
20, 10) for a xed depth 10 and provided the GitH algorithm with all the deltas that
it requested. For all other datasets, we ran GitH with an innite window size but re-
stricted it to choose from deltas that were available to the other algorithms (i.e., only
deltas with sizes below a threshold); as we can see, the solutions found by GitH ex-
hibited very good total recreation cost, but required signicantly higher storage than
other algorithms. This is not surprising given that GitH is a greedy heuristic that makes
choices in a somewhat arbitrary order.
In Figures 3.13(a–b), we plot the maximum recreation costs instead of the sum of
recreation costs across all versions for two of the datasets (the other two datasets exhib-
ited similar behavior). The MP algorithm found the best solutions here for all datasets,
and we also observed that LMG and LAST both show plateaus for some datasets where
the maximum recreation cost did not change when the storage budget was increased.
This is not surprising given that the basic MP algorithm tries to optimize for the storage
cost given a bound on the maximum recreation cost, whereas both LMG and LAST focus
on minimization of the storage cost and one version with high recreation cost is unlikely
to aect that signicantly.
74
LMG LMG-W
4.11e5 3.21e2
4.0 Dataset: DC 3.1 Dataset: LF
3.9 3.0
3.8 2.9
3.7 2.8
3.6 2.7
3.15.5 2.0 2.5 (a) 3.0 3.5 4.0
2.62 3 4
1e3 (b)
5 6 7
Storage Cost (GB)
Figure 3.15: Taking workload into account leads to better solutions
Undirected Graphs. We test the three algorithms on the undirected versions of three
of the datasets (Figure 3.14). For DC and LC, undirected deltas between pairs of versions
were obtained by concatenating the two directional deltas; for the BF dataset, we use
UNIX diff itself to produce undirected deltas. Here again we observe that LMG con-
sistently outperforms the other algorithms in terms of nding a good balance between
the storage cost and the sum of recreation costs. MP again shows the best results when
trying to balance the maximum recreation cost and the total storage cost. Similar results
were observed for other datasets but are omitted due to space limitations.
Workload-aware Sum of Recreation Cost Optimization. In many cases, we may
be able to estimate access frequencies for the various versions (from historical access
patterns), and if available, we may want to take those into account when constructing the
storage graph. The LMG algorithm can be easily adapted to take such information into
account, whereas it is not clear how to adapt either LAST or MP in a similar fashion. In
this experiment, we use LMG to compute a storage graph such that the sum of recreation
costs is minimal given a space budget, while taking workload information into account.
75
Sum of Recreation Cost (GB)
The worload here assigns a frequency of access to each version in the repository using
a Zipan distribution (with exponent 2); real-world access frequencies are known to
follow such distributions. Given the workload information, the algorithm should nd a
storage graph that has the sum of recreation cost less than the index when the workload
information is not taken into account (i.e., all versions are assumed to be accessed equally
frequently). Figure 3.15 shows the results for this experiment. As we can see, for the
DC dataset, taking into account the access frequencies during optimization led to much
better solutions than ignoring the access frequencies. On the other hand, for the LF
dataset, we did not observe a large dierence.
Running Times. Here we evaluate the running times of the LMG algorithm. Recall that
LMG takes MST (or MCA) and SPT as inputs. In Fig. 3.16, we report the total running
time as well as the time taken by LMG itself. We generated a set of version graphs as
subsets of the graphs for LC and DC datasets as follows: for a given number of versions n,
we randomly choose a node and traverse the graph starting at that node in breadth-rst
manner till we construct a subgraph with n versions. We generate 5 such subgraphs
for increasing values of n and report the average running time for LMG; the storage
budget for LMG is set to three times of the space required by the MST (all our reported
experiments with LMG use less storage budget than that). The time taken by LMG on DC
dataset is more than LC for the same number of versions; this is because DC has lower
delta values than LC (see Fig. 3.2) and thus requires more edges from SPT to satisfy the
storage budget.
On the other hand, MP takes between 1 to 8 seconds on those datasets, when the
76
LMG LC LMG DC Total LC Total DC
700 3000
600 2500
500 2000
400
1500
300
200 1000
100 500
00 1 2 3 4 5 6 7 8 00 1 2 3 4 5 6 7 8
(a) Directed 1e4 (b) Undirected 1e4
Number of versions
Figure 3.16: Running times of LMG
recreation cost is set to maximum. Similar to LMG, LAST requires the MST/MCA and
SPT as inputs; however the running time of LAST itself is linear and it takes less than 1
second in all cases. Finally the time taken by GitH on LC and DC datasets, on varying
window sizes range from 35 seconds (window = 1000) to a little more than 120 minutes
(window = 100000); note that, this excludes the time for constructing the deltas.
In summary, although LMG is inherently a more expensive algorithm than MP or
LAST, it runs in reasonable time on large input sizes; we note that all of these times are
likely to be dwarfed by the time it takes to construct deltas even for moderately-sized
datasets.
Comparison with ILP solutions. Finally, we compare the quality of the solutions
found by MP with the optimal solution found using the Gurobi Optimizer for Problem 6.
We use the ILP formulation from Section 3.2.3 with constraint on the maximum recre-
ation cost (), and compare the optimal storage cost with that of the MP algorithm (which
resulted in solutions with lowest maximum recreation costs in our evaluation). We use
our synthetic dataset generation suite to generate three small datasets, with 15, 25 and
77
Time (seconds)
Storage Cost (GB)
v15  0.20 0.21 0.22 0.23 0.24
ILP 0.36 0.36 0.22 0.22 0.22
MP 0.36 0.36 0.23 0.23 0.23
v25  0.63 0.66 0.69 0.72 0.75
ILP 2.39 1.95 1.50 1.18 1.06
MP 2.88 2.13 1.7 1.18 1.18
v50  0.30 0.34 0.41 0.54 0.68
ILP 1.43 1.10 0.83 0.66 0.60
MP 1.59 1.45 1.06 0.91 0.82
Table 3.3: Comparing ILP and MP solutions for small datasets, given a bound on max
recreation cost,  (in GB)
50 versions denoted by v15, v25 and v50 respectively and compute deltas between all
pairs of versions. Table 3.3 reports the results of this experiment, across ve  values.
The ILP turned out to be very dicult to solve, even for the very small problem sizes,
and in many cases, the optimizer did not nish and the reported numbers are the best
solutions found by it.
As we can see, the solutions found by MP are quite close to the ILP solutions for the
small problem sizes for which we could get any solutions out of the optimizer. However,
extrapolating from the (admittedly limited) data points, we expect that on large problem
sizes, MP may be signicantly worse than optimal for some variations on the problems
(we note that the optimization problem formulations involving max recreation cost are
likely to turn out to be harder than the formulations that focus on the average recreation
cost). Development of better heuristics and approximation algorithms with provable
guarantees for the various problems that we introduce are rich areas for further research.
78
Chapter 4: A Unied Query Language for Provenance and Versioning
4.1 Introduction
In this chapter, we present our design of a version-aware query language, called
VQUEL, capable of querying dataset versions, dataset provenance (e.g., which datasets
a given dataset was derived from), and record-level provenance (if available). While git
and svn have proved tremendously useful for collaborative source code management,
their versioning API is based a notion of les, not structured records, and as such, is not a
good t for a scenario with a mix of structured and unstructured datasets; the versioning
API is also not capable of allowing data scientists to reason about data contained within
versions and the relationships between the versions in a holistic manner. VQUEL draws
from constructs introduced in the historical Quel [59] and GEM [60] languages, neither
of which had a temporal component.
4.2 Preliminaries
Recall that in DEX, a version consists of one or more datasets that are semantically
grouped together. Every version is identied by an ID, is immutable and any update to
a version conceptually results in a new version with a dierent version ID (note that the
79
commit_id:String pk:String
commit_msg:Text X:String
creation_ts:Date Record Y:String
author:Author Z:String
version 
{relation:Relation} {version:Version}
{parent:Version}
{children:Version}
name:String name:String
Relation Author
{record:Record} email:String
Figure 4.1: Conceptual Data model for VQUEL: the notation “{T}” denotes a set of values
of T; elds in the Records entity can be conceptually thought of as a union of all elds
across records; other elds and entities (for instance Authors) are not shown to keep the
discussion brief; for each entity, entries in the left and right column denote the attribute
name and type respectively.
physical data structures are not necessarily immutable, as we described in the previous
chapter). New versions can also be created through the application of transformation
programs to one or more existing versions. The version-level provenance that captures
these processes is maintained as a version graph.
Figure 4.1 shows a portion of the conceptual data model that we use to write
queries against and Figure 4.2 shows an example of a few versions along with the ver-
sion graph connecting them. The data model consists of four essential tables: Version,
Relation, File, and Record. Additional tables like Column and Author are required
in DEX but not essential for the purpose of this discussion. The dierence between
Relation and File is that a relation has a xed schema for all its records (recorded in
the Column table) while a le has no such requirement. To that eect, we denote the
records in a relation as tuples.
80
V1 V2 V3
Employee Dept. Employee Dept. Employee Dept.
d1
e1 d1 d1e1 e1
e2 d2 e2 d2 e2 d2
e3 e3 e3
Figure 4.2: An example version graph where circles denote versions; version V1 has two
Relations, Employee and Department, each having a set of records, {E1, E2, E3} and {D1,
D2} respectively; version V2 adds new records to both the Employee and Department
relations and also adds a new File, Forms.csv. Edge annotations (not shown) are used
to capture information about the derivation process itself, including references to trans-
formation programs or scripts if needed.
The Version table maintains the information about the dierent versions in the
database, including the “commit_id” (unique across the versions), and various attributes
capturing metadata about the version, such as the creation time and author, as well as
“commit_msg” and “creation_ts”, representing the commit message and creation time
respectively. There are four set-valued attributes called “relations”, “les”, “parents” and
“children”, recording the relations and les contained in the version, and the direct par-
ents and children in version graph respectively. The last two refer back to the Version
table, whereas the rst two refer to the Relation and File tables respectively. A tuple in
the Relation table, in turn, records the information for a relation including its schema;
we view the tuples in the relation as a set-valued attribute of this table itself — this al-
lows us to locate a relation and then query on the data inside it as we will see in the
next section. The Files table is analogous, but records information appropriate for an
unstructured le. Note that neither of these tables has a primary key but rather the at-
tributes “name” and “full_path” serve as discriminators, and must be combined with the
81
version “id” to construct primary keys. The “changed” attribute is a derived (redundant)
attribute that indicates whether the relation/le changed from the parent version, and
is very useful for version-oriented queries.
Finally, Record is a virtual table that can be conceptually thought of as a union of
all tuples and records in all relations and les across the versions. The one exception are
the “parents” and “children” attributes, which refer back to the Record table and can be
used to refer to ne-grained provenance information within queries. This table is never
directly referenced in the queries, but is depicted here for completeness. The prove-
nance information must “obey” the version graph, e.g., in the example shown, records
in version V2 can only have records in version V1 as parents.
We note here that this data model is a high-level conceptual one mainly intended
for ease of querying and aims to maximize data independence. For instance, although
the ne-grained provenance information is conceptually maintained in the Record table
here and can be queried using the “parents” and “children” attributes, the implementa-
tion could maintain that information at schema-level wherever feasible to minimize the
storage requirements.
4.3 Language Features
VQUEL is largely a generalization of the Quel language (while also introducing
certain syntactic conveniences that Quel does not possess), and combines features from
GEM and path-based query languages. This means thatVQUEL is a full-edged relational
query language, and in addition, it enables the seamless querying of the nested data
82
model described in the previous section, encoding versioning derivation relationships,
as well as versioning metadata.
VQUEL will be illustrated using example queries on the repository shown in Fig-
ure 4.2, with certain deviations introduced when necessary. We will introduce the con-
structs in VQUEL incrementally, starting from those present in Quel to the new ones
designed for the setting in DEX. For ease of understanding, we rst present a version
that is clear and easy to understand, but results in longer queries. In Section 4.3.2 we
describe additional constructs to make the queries concise.
4.3.1 Examples
We begin with some simpleVQUEL queries. Most of these queries are also straight-
forward to write in SQL; the queries that cannot be written in SQL easily begin in Sec-
tion 4.3.3. Here, we gradually introduce the constructs of VQUEL as a prelude to the
more complex queries combining versioning and data.
Query 1 Who is the author of version with id “v01”?
1 range of V is Version
2 retrieve V.author.name
3 where V.id = "v01"
A VQUEL query has two elements: iterator setup (range above) and retrieval (retrieve
above) of objects satisfying a predicate (where above). Iterators in VQUEL are similar to
tuple variables in Quel, but more powerful, in the sense that they can iterate over objects
at any level of our hierarchical data model. They are declared with a statement of the
form:
1 range of <iterator-variable> is <set>
83
The retrieve statement is used to select the object properties, and is of the form:
1 retrieve [into <iterator>][unique]<target-list>
2 [where <predicate>]
3 [sort by <attribute> [asc/desc] {, <attribute> [asc/desc]}]
The retrieve statement fetches all the object attributes specied in the target-list for
those objects satisfying the where clause.
Query 2 What commits did Alice make after January 01, 2015?
1 range of V is Version
2 retrieve V.all
3 where V.author.name = "Alice" and V.creation_ts >= "01/01/2015"
In Queries 1 and 2, note the use of GEM-style tuple-reference attributes, namely V.author
, and the keyword all from Quel. The comparators =, !=, <, <=, > and >= are allowed
in comparisons, and the logical connectives and, or, and not can be used to combine
comparisons.
Multiple iterators can be set up before a retrieval statement, and their respective
sets can be dened as a function of previously declared iterators. The next example
illustrates this idea. The rst range clause sets up an iterator V over all the versions. The
second range clause denes an iterator over all relations inside a version.
Query 3 List the commit timestamps of versions that contain the Employee relation.
1 range of V is Version
2 range of R is V.Relations
3 retrieve V.commit_ts
4 where R.name = "Employee"
Query 4 Show the commit history of the Employee relation in reverse chronological order.
1 range of V is Version
2 range of R is V.Relations
3 retrieve V.creation_ts, V.author.name, V.commit_message
4 where R.name = "Employee" and R.changed = true
5 sort by V.creation_ts desc
84
Similarly, we can set up a range clause over tuples inside a relation. Analogous to a
relational database, the user needs to be familiar with the schema to be able to pose
such a query.
Query 5 Show the history of the tuple with employee id “e01” from Employee relation.
1 range of V is Version
2 range of R is V.Relations
3 range of E is R.Tuples
4 retrieve E.all, V.commit_id, V.creation_ts
5 where E.employee_id = "e01" and R.name = "Employee"
6 sort by V.creation_ts
4.3.2 Syntactic sweetenings
In this section, we introduce some shorthand constructs to keep the size of the
queries small. These constructs are meant only for brevity, and each of them can be
mapped to an equivalent query without using shorthands.
The rst one is analogous to a lter operation over a set declaration: we can use
predicates in the set declaration block of the range statement. For instance, in the follow-
ing example, both queries iterate over the same set of versions. Note that the retrieve
into clause in (b1) sets up a new iterator V over all the versions satisfying constraints in
where clause.
1 (a1) range of V is Version(id = "v01")
2
3 (b1) range of T is Version
4 retrieve into V (T.all)
5 where T.id = "v01"
The next example shows the principle in action on a query that would otherwise become
quite long. Again, (a2) and (b2) below show identical queries written using the short
notation (a) and their equivalent form (b).
85
Query 6 Find all Employee tuples in version “v01” that are dierent in version “v02”.
1 (a2) range of E1 is Version(id = "v01").Relations(name = "Employee").Tuples
2 range of E2 is Version(id = "v02").Relations(name = "Employee").Tuples
3 retrieve E1.all
4 where E1.employee_id = E2.employee_id and E1.all != E2.all
5
6 (b2) range of V1 is Version
7 range of R1 is V1.Relations
8 range of E1 is R1.Tuples
9 range of V2 is Version
10 range of R2 is V2.Relations
11 range of E2 is R2.Tuples
12 retrieve E1.all
13 where V1.id="v01" and R1.name="Employee"
14 and V2.id="v02" and R2.name="Employee"
15 and E1.employee_id = E2.employee_id and E1.all != E2.all
4.3.3 Aggregate operators
The aggregate functions sum, avg, count, any, min and max are also provided in VQUEL.
Any expression involving components of iterated entity attributes, constants and arith-
metic symbols can be used as the argument of these functions. Due to the nested nature
of iterators, we introduce the _all version of these operators, i.e. count_all, sum_all, etc.
The general syntax of an aggregate expression is:
1 agg_op([<agg-attribute>/<iterator-variable>] [group by <grouping-attributes>] [where <
predicate>])
This evaluates the agg_op on each group of <agg-attribute> of objects that satisfy
the <predicate>. We see two examples next.
Query 7 For each version, count the number of relations inside it.
1 range of V is Version
2 range of R is V.Relations
3 retrieve V.id, count(R)
Query 8 Find all versions containing precisely 100 Employees with last name “Smith”.
86
1 range of V is Version
2 range of E is V.Relations(name = "Employee").Tuples
3 retrieve V.commit_id
4 where count(E.employee_id where E.last_name = "Smith") = 100
In both queries above, the aggregation is performed only over objects at the innermost
level of an iterator expression. In query 7, R is an iterator over relations inside a version
V, and count iterates only over the innermost level of this iterator hierarchy, that is, R.
Similarly, in query 8, the count expression only iterates over the tuples inside a relation
inside a version.
Notice that the latter query is not very easy to express in vanilla SQL: there is no
easy way to use SQL to retrieve version numbers, which in a traditional non-versioned
context would either be considered as schema-level information, or involve multiple
joins depending on the level of normalization of the schema. VQUEL, on the other hand,
allows us to set up the nested iterators that makes such queries very easy to express.
The next two examples show the usage of count_all operator. The dierence from
the count operator is that all the “parent” iterators are evaluated, instead of only the
innermost iterator, to compute the value of the aggregate. Another way to reason about
this behavior is that count has an implicit grouping list of attributes in its by clause:
query 9 is identical to query 8.
Query 9 Find all versions containing precisely 100 employees with last name “Smith”.
1 range of V is Version
2 range of R is V.Relations(name = "Employee")
3 range of E is R.Tuples
4 retrieve V.commit_id
5 where count_all(E.employee_id group by R, V where E.last_name = "Smith") = 100
Aggregates having a group by clause can also be used in the predicate to restrict the
87
results of the query. In query 9, the result of count_all for each group is compared against
100. Query 10 gives another example.
Query 10 Find all versions containing precisely 100 tuples in all relations put together
inside a version.
1 range of V is Version
2 range of R is V.Relations
3 range of T is R.Tuples
4 retrieve V.all
5 where count_all(T group by V) = 100
The next few examples show how we can use aggregate operators across a set of versions
to answer a variety of questions about the data.
Query 11 Among a group of versions, nd the version containing most tuples that satisfy
a predicate. For instance, which version contains the most number of employees above age
50?
1 range of V is Version
2 range of E is V.Relations(name = "Employee").Tuples
3 retrieve into T (V.id as id, count(E.id where E.age > 50) as c)
4 retrieve T.id
5 where T.c = max(T.c)
Up until now, for an iterator, we have been exploring “down” the hierarchy. We also
provide appropriate functions, depending on the type of iterator, to refer to values of
entities “up” in the hierarchy. In the next query, Version(T) is used to refer to the version
attributes of tuples in T.
Query 12 Which versions are such that the natural join between relations S and T has
more than 100 tuples?
88
1 range of V is Version
2 range of S is V.Relations(name = "S").Tuples
3 range of T is V.Relations(name = "T").Tuples
4 retrieve into Q(V.id as id,
5 count_all(S.id group by V where S.id = T.s_id and Version(S).id = Version(T).id) as
c)
6 retrieve Q.id
7 where Q.c >= 100
4.3.4 Version graph traversal
VQUEL has three constructs aimed at traversing the version graph. Each of these
operate on a version at a time, specied over an iterator.
• P(<integer>): Return the set of ancestor version of this version, until integer num-
ber of hops in the version graph. If the number of hops is not specied, we go till
the rst version. Duplicates are removed.
• D(<integer>): Similar to P() except that it returns the descendant/derived versions.
• N(<integer>): Similar to P() except that it returns the versions that are <integer>
number of hops away.
The next few queries illustrate these constructs. Notice once again that queries of this
type are not very easy to express in SQL, which does not permit the easy traversal of
graphs, or specication of path queries. The constructs we introduce are reminiscent of
constructs in graph traversal languages [98]; these combined with the rest of the power
of VQUEL enable some fairly challenging queries to be expressed rather easily.
Query 13 Find all versions within 2 commits of “v01” which have less than 100 employees.
89
1 range of V is Version(id = "v01")
2 range of N is V.N(2)
3 range of E is N.Relations(name = "Employee").Tuples
4 retrieve N.all
5 where count(E) < 100
Query 14 Find all versions where the delta from the previous version is greater than 100
tuples.
1 range of V is Version
2 range of P is V.P(1)
3 retrieve unique V.all
4 where abs(count(V.Relations.Tuples) - count(P.Relations.Tuples)) > 100
Query 15 For each tuple in Employee relation as of version “v01”, nd the parent version
where it rst appeared.
1 range of V is Version(id = "v01")
2 range of E is V.Relations(name = "Employee").Tuples
3 range of P is V.P()
4 range of PE is P.Relations(name = "Employee").Tuples
5 retrieve E.id, P.id
6 where E.employee_id = PE.employee_id and P.commit_ts = min(P.commit_ts)
4.3.5 Extensions to ne-grained provenance
Finally, in some cases, we may have complete transparency into the operations
performed by data scientists. In such cases, we can record, reason about, and access
tuple-level provenance information. Here is an example of a query that can refer to
tuple-level provenance:
Query 16 For tuples in version “v01” in relation S that satisfy a predicate, say value of
attribute attr = x, nd all parent tuples that they depend on.
1 range of E is Version(id = ‘‘v01’’).Relations(name = ‘‘S’’).Tuples
2 range of P is E.parents
3 retrieve E.id, P.id
4 where E.attr = x
90
Similar queries can be used to “walk up” the derivation path of given tuples, for example,
to identify the origins of specic tuples.
91
Chapter 5: Query Execution I: Set-based Operations
5.1 Introduction
As discussed in Chapter 3, DEX makes use of delta encoding to store past versions
of datasets eciently on disk. Many archival and backup systems, including version
control systems like git, SVN, etc., often store multiple versions or snapshots of large
datasets or les that have signicant overlap across their contents using deltas. As a re-
sult, there has been signicant work on various aspects of delta encoding-based storage
systems: computing near-optimal deltas for a variety of data formats [99, 100], quickly
nding ideal les to delta from [101], and supporting delta storage in le systems, scien-
tic databases, network transport, etc. [5, 67]. However, existing delta-oriented storage
engines oer limited or no support for querying the data stored within them; the pri-
mary query type supported by those engines is checkout, i.e., reconstructing a specic
version of a dataset or a le. With such storage engines becoming ecient and main-
stream, there is an increasing desire and opportunity to perform rich analysis queries
over the historical information contained within such data stores. The queries of inter-
est include auditing or provenance queries over the datasets (e.g., identify the datasets
where a particular property holds), analyzing the evolution of a dataset over time (i.e.,
temporal analytics), and comparing results of SQL-like queries over dierent versions
92
of the same dataset (obtained through, e.g., applying dierent analysis pipelines to the
same initial dataset).
However, other delta-oriented storage engines of today require users to “check
out” complete le/dataset versions in order to manipulate them. This approach is less
than ideal particularly when the individual versions are large and the users need to ac-
cess multiple versions for their analysis task. In this chapter, we present computation-
ally cheap methods to evaluate a query by pushing down query execution to the level of
deltas.
5.2 System Overview
We begin with a brief description of the user-facing data model before describing the dif-
ferent types of queries that we support. Thereafter, we describe the system data model,
i.e., the physical organization of data, and the primitives used by the system to evaluate
the queries.
5.2.1 User Data Model
We recall a few important denitions for ease of exposition. The user data model
in DEX has two main abstractions – datafile, and version – that form the basis of all
user interactions.
As mentioned earlier, a datafile is a le whose contents are interpreted as set of
records. The user species a record separator when a datafile is added in the system.
Within a datafile, we consider a record as an unstructured sequence of bytes. The only
93
constraint we impose, however, is that a datafile cannot contain identical records: two
records are said to be identical if they both have the same sequence of bytes. For instance,
textual at les such as CSV or logs can be seen as containing one record per line.
A version is a point-in-time snapshot of one or more datafiles typically residing
in a directory on the user’s le system. A version, identied by a unique ID, is immutable,
and can be created at any point in time by any user who has access to the repository.
In addition to datafiles and versions, DEX also captures the version-level prove-
nance – derivation and transformation relationships among the set of all versions – in
a data structure called the version graph. Nodes in a version graph correspond to
versions and edges capture relationships such as derivation, branching, transformation,
etc, between two versions. Since a version graph is typically much smaller than the
datafile contents, it can be kept and traversed in memory to identify the versions that
are referenced in a query.
We use the following notation to formalize the above discussion. Let  be the
set of all versions. Each version V ∈  contains a nite number of datafiles, say,
V = {A1,… , At}. Let = {A1,… , An} be the set of all datafiles across all versions. Note
that it is possible for a datafile to be present in more than one version – this happens
when the said datafile is not modied in the respective versions. The set of datafiles
that appear in a version are kept track of as metadata in the corresponding node of the
version graph. Let Aa = {r1,… , rm} be the set of records contained in datafile Aa. As
mentioned before, no two records in a datafile are identical, i.e., ri ≠ rj , ∀ri , rj ∈ Aa.
94
5.2.2 Queries
We now describe the semantics of each of the core operations that are the primary focus
of this chapter.
Checkout: Checkouts are the primary mechanism for reading o older versions of a
dataset. Any version or any set of datafiles can be checked out, and the result is
copied to the location suggested by the user (typically, it will be a directory on the user’s
machine). When a checkout query is issued, the version graph is consulted to identify the
set of datafiles that comprise it. Specically, the checkout operation takes as input a set
of k ≥ 1 datafiles k = {Ax1 ,… , Axk} ⊂  and outputs k les, one for each datafile.
Henceforth, we use the notation Checkout(k) to denote the checkout operation.
Intersect: The intersect operation is an important operation when comparing the con-
tents of a datafile that was modied across multiple versions. Similar to set intersec-
tion, given a set of k ≥ 2 datafiles k = {Ax1 ,… , Axk} ⊂ , the intersect operation
outputs a single datafile containing records that appear in all datafiles in k , i.e.,
{r ∶ r ∈ Ax1 ∧⋯ ∧ r ∈ Axk}. We use the notation I (k) to denote the intersect operation.
Union: The union operation, denoted by U (k), returns a single datafile containing
records that appear in any of the datafiles in k , i.e., {r ∶ r ∈ Ax1 ∨⋯ ∨ r ∈ Axk}.
t-Threshold: Given as input a set of k ≥ 3 datafiles k and an integer 1 < t < k, the t-
threshold operation, denoted by Tt(k), returns a single datafile that contains records
appearing in at least t of the datafiles in k . This generalizes the above operations –
t = 1 and t = k correspond to union and intersection respectively.
95
Although the above set of operations is intended as a starting point for investigat-
ing the nascent topic of query processing over deltas, these operations already enable
many interesting queries. For example, comparing the results of intersection, union
and/or t-threshold across the versions of an evolving dataset can provide insights into
the evolution process (e.g., properties of the records that change frequently vs those that
remain static). Intersection or t-threshold across the results of dierent machine learn-
ing pipelines on the same input dataset can help us identify which types of records are
dicult to predict correctly, which can help an analyst steer the training process. Fur-
ther, t-threshold can return, for each record, a bitmap indicating the versions to which
it belongs; depending on the semantics of the versions being queried, that information
could be used for a variety of purposes including correlation analysis, anomaly detection,
and visualizations. Finally, if specic analyses of interest are known in advance, materi-
alized views (e.g., projections, results of aggregate queries or joins) can be computed in
advance as the dataset versions are ingested; by exploiting the overlaps, these material-
ized views could be persisted cheaply in the storage engine itself. Although this requires
a priori planning, the benets at the time of querying could be tremendous. Dening
and automatically materializing such views remains a rich area for future work.
5.2.3 System Data Model
Next, we discuss the storage graph and the delta encoding scheme used in DEX to store
the versions of datafiles on disk. Thereafter, we describe few properties of the deltas
and discuss methods of combining them that will be useful in subsequent sections.
96
5.2.3.1 Storage Graph
Let  = (V , E) be a storage graph (see Figure 5.1 for an example). Note that this graph
is dierent from version graph, described in section 5.2.1. While the version graph
captures derivation or transformation relationships between versions of datasets, the
storage graph represents information at the granularity of datafiles (encompassing
all versions) and is meant to indicate delta relationships between them. Moreover, the
storage graph is used by internal query execution routines and, unlike version graph,
is not intended to be exposed to the end user. The vertex set V of the storage graph cap-
tures all unique datafiles across all versions, and a special empty datafile, A0. Thus,
V = A0 ∪.
An edge e(Ai , Aj) ∈ E represents the delta between datafiles Ai and Aj , and the
edge set E represents the deltas that are chosen to store all datafiles. The weight of
the edge we represents the storage cost (size in bytes) of the delta. For an edge e(A0, Ai),
we represents the storage cost of Ai in its entirety (i.e., Ai is materialized).
We require that  be a connected graph so that it is possible to reconstruct any of
the datafiles in. Specically, a path fromA0 toAi indicates the materialized datafile
(one following A0 on the path) and the sequence of deltas to apply in order to recreate
Ai . Thus, to store all the datafiles in , it is sucient to store only the materialized
datafiles in  and all the deltas in E.
Prior systems have made use of the storage graph representation [30, 65, 102], al-
beit with dierent monikers, to model a delta based solution to store data versions. The
storage graph also generalizes the sequence-of-deltas model where the versions are or-
97
dered according to a certain criteria, e.g., timestamp, le size, etc., and every version ex-
cept the rst is stored as a delta against the previous one. The sequence-of-deltas model,
although conceptually simple, has the downside that the retrieval time grows linearly
with the number of versions stored. The storage graph representation addresses this
limitation by allowing multiple versions to be derived from one version. For instance, if
we require that every datafile derives 3 datafiles not derived by others, we can pack
approximately 80K datafiles and have a maximum delta sequence of length 10.
5.2.3.2 Set-backed Deltas and Properties
The delta format that we consider in this chapter, called Set-backed Deltas, is an undi-
rected delta format, similar to the standard UNIX line-by-line di. A set-backed delta Δ
between a source datafileAi and a target datafileAj , is a set of two datafiles, Δ− and
Δ+, that correspond to “deletions” and “insertions” respectively. Δ− is the set of records
that are present in Ai but not in Aj , while Δ+ is the set of records that are not present in
Ai but present in Aj . Δ can also be used to reconstruct Ai from Aj by exchanging Δ− and
Δ+.
When using set-backed deltas, we require them to be consistent [69], i.e., a delta
does not contain the same record in Δ− and Δ+. This does not preclude updates to a
record, including schema changes, since an update can be recorded as deleting the old
record and adding a new record.
Denition 1 (Consistent Delta) A delta is said to be consistent if Δ− ∩ Δ+ = ∅.
Because datafiles and deltas are sets, we will often make use of the following three
98
A0 A0 A0
1000 1100 1100 1000
A1 A3 A1150 A3
20 20
150
A 50 502 50 30 A2 A3
A5 A7 30
A6 A7 50
30 50 30
80 50100 A4 A5
A6
A A4 10 11A A 10010 11 A
10 50 10 10 10
50 50 50
A12 A8 A9 A
A A A 128 9 12
(a) (b) (c)
Figure 5.1: (a) A storage graph over datafiles A1,… , A12, nodes shaded in blue
(A1, A3) indicate materialized datafiles, edge annotations indicate the disk size of the
delta; (b) access tree for Q(A12), this is the shortest path from A0 to A12; (c) access
tree for Q(A6, A8, A9, A12), this is the minimimum cost Steiner tree for the terminals
{A0, A6, A8, A9, A12}
standard operations on sets – union (∪), intersection (∩) and dierence (−). Continuing
the example, when we use Δ to construct Aj from Ai we call this operation patching Ai
using Δ, and denote it as Aj = Ai ⊕ Δ.
Denition 2 (Patch) Ai ⊕ Δ = (Ai − Δ−) ∪ Δ+
Observation 1 If Δ is consistent, Ai ⊕ Δ = (Ai − Δ−) ∪ Δ+ = (Ai ∪ Δ+) − Δ−.
Next, we describe another important property of set-deltas, called contraction. Intu-
itively, delta contraction corresponds to combining two deltas into a single delta such
that the new delta has the same eect as applying the individual deltas. Formally, if
A1, A2, A3 are three datafiles and Δ1 = Δ(A1, A2),Δ2 = Δ(A2, A3), we use the patch
operator as before to represent contraction as follows,
99
Denition 3 (Delta Contraction) Δ = Δ1 ⊕ Δ2, where,
Δ− = (Δ−1 − Δ+2 ) ∪ Δ−2 ; Δ+ = (Δ+ − +1 − Δ2 ) ∪ Δ2 (5.2.1)
Although delta contraction, as dened above, can be applied to two arbitrary deltas, the
result is well-dened only if the target datafile of Δ1 is same as the source datafile
of Δ2. The result Δ has the same source datafile as Δ1 and derives the target datafile
of Δ2.
This denition can be generalized to a sequence of deltas: the contraction of a
sequence of deltas Δ1,… ,Δm is the result of the operation Δ1 ⊕⋯ ⊕ Δm.
Given the above properties, we can infer that:
Observation 2 If Δ1 and Δ2 are consistent, then their contraction, Δ = Δ1 ⊕ Δ2, is also
consistent.
Observation 3 The patch operation is associative, i.e., (Δ1 ⊕ Δ2) ⊕ Δ3 = Δ1 ⊕ (Δ2 ⊕ Δ3).
Although some of these observations might seem straightforward, formalizing
them is crucial to argue the correctness of the transformations that we do later.
5.3 Query Execution Preliminaries
We begin with a more formal treatment of the query optimization problem, with rst
discussing the optimization metrics of interest and introducing the two-phase optimiza-
tion approach that we take. We then briey discuss the issues of cost and cardinality
estimation and the search space of query evaluation plans.
100
Given a query,Q(k)whereQ is one of {Checkout, I , U , Tt} (section 5.2.2) against
a given storage graph , there are two somewhat independent stages in the overall query
execution. First, we need to identify all the relevant datafiles and deltas in  that are
necessary to execute Q(k). We refer to this problem as nding an access tree of Q(k),
and describe it in detail in section 5.3.2.
Second, given an access tree, we need to devise an ecient evaluation plan, that
describes exactly what operations are used to compute the result of Q(k). This plan
is represented as a delta expression: an algebraic expression where the operands are
datafiles and deltas from the storage graph , and the operations are patch and prim-
itive set operations. During this stage, we also consider the problem of nding a good
ordering of evaluating the dierent operations in the delta expression. We describe the
techniques for each query Q ∈ {Checkout, I , U , Tt} in Section 5.4.
5.3.1 Optimization Metrics
To be able to develop a systematic cost-based approach to query execution, we rst need
to identify appropriate optimization metrics and cost models. It is unfortunately dicult
to develop a single cost metric that captures the costs of the two stages discussed above,
which also makes it hard to do joint optimization across them. Because the backend store
is likely to be relatively expensive to access (we expect it to be distributed in general),
we would like to minimize the amount of data that is read from the backend store; this
also reduces the network I/O. Once the data has been gathered, however, the dierent
ways to evaluate a query can have very dierent CPU costs and wall-clock time. Hence,
101
for the second phase, we would prefer to use a metric that tracks the CPU cost.
We adopt a two-phase approach in DEX inspired by this. We rst nd the best
“access tree” that minimizes the total amount of data that needs to be read (in bytes)
from the backend store. In other words, we identify the set of datafiles and deltas that
have the smallest total size, that are sucient to reconstruct the required datafiles. We
then search for the best evaluation plan according to a cost model that estimates the CPU
resources needed by the plan. We discuss the specics in further detail in Section 5.3.4
when we discuss the operator implementations.
We do not explicitly account for disk access costs during the second phase for sev-
eral reasons. First, although the overall storage graph and the delta sizes in total are
expected to be very large, the access tree for any given query is typically much smaller
and the deltas constituting that will typically t in the memory of a powerful machine.
More importantly, most of our algorithms (Section 5.3.4) access the deltas sequentially
(while reading and writing), and thus even if the deltas were disk resident (or interme-
diate results needed to be written to disk), the CPU and/or the memory bandwidth is
still the main bottleneck. One exception here is binary search or gallop search (that an
intersection operation might employ) where our approach might underestimate the cost
of an intersection in case of extreme skew. However, our cost estimation procedure can
be easily modied to account for that case. Moreover, the deltas are typically stored in a
compressed fashion on disk, thereby making it necessary to uncompress them by read-
ing them once into memory, and further making the overall computation CPU-bound.
102
5.3.2 Access Tree
Given a query Q(k), an access tree, Q = (VQ , EQ) is a subgraph of  such that: (i)
A0 ∪k ⊆ VQ ⊆ V , and (ii) Q is a tree, i.e., a connected graph with no cycles.
The rst condition implies that all datafiles required by the query are part of the
access tree. The second condition ensures that we have a valid and minimal solution: (i)
Valid: because Q is connected, there exists at least one path between A0 and Axi , which
denotes the materialized datafile and the sequence of deltas to apply to reconstruct
Axi , (ii) Minimal: because Q is a tree, for every Axi ∈ k , Q contains exactly one path
from A0 to Axi .
We dene the cost of an access tree as the sum of weights of all edges in it, i.e.,
C(Q) = ∑e∈EQ we . When the edge weights correspond to the sizes of the deltas, this def-
inition captures the cost metric mentioned above. To address the problem of identifying
the least cost access tree, we consider two cases, k = 1 and k > 1. We refer to these as
single datafile access and multiple datafile access respectively.
Single datafile Access: When k = 1, 1 = {Ax1}. Any A0 to Ax1 path in  satises the
conditions of an access tree. Thus, nding the least cost access tree amounts to nding
the shortest path between A0 and Ax1 , and we use the classical Dijkstra’s algorithm.
Multiple datafile Access: When k > 1, the problem of nding a low cost access
tree is equivalent to nding a Steiner Tree [103]. Here, the set of nodes A0 ∪ k act
as terminals and our objective is to nd a minimum cost Steiner tree that contains all
of them. This problem is -Hard, i.e., arbitrarily good approximations cannot be
achieved in polynomial time (unless  = ). In this work, we use the classical 2-
103
approximation algorithm, which nds a tree with cost at most 2 times the optimal.
Example 8 Consider the query Checkout({A6, A8, A9, A12}) on the storage graph in Fig-
ure 5.1(a). Figure 5.1(c) shows the least cost access tree for this query.
5.3.3 Search Space
Cost-based optimization requires us dene the search space of potential, equivalent
plans. The search space that we use in this work revolves around two equivalences:
(i) associativity of the patch operation, and (ii) De Morgan’s laws for set theory. We
can thus generate equivalent evaluation plans by repeatedly applying those equivalence
rules. Unfortunately the number of dierent evaluation plans is very large, even with
just the rst rule (Section 5.4.1). Unlike relational query optimization, the set of po-
tential intermediate results is not easy to dene either, and thus this problem does not
seem amenable to dynamic programming-style algorithms used there. We instead take
a hybrid approach where we use a series of heuristic transformation rules, based on De
Morgan’s laws, to simplify the expressions, and use a dynamic programming-based al-
gorithm (that exploits the associativity of patch) to optimize the sub-expressions in the
simplied expression.
Apart from generating alternative query expressions using logical equivalence
rules, it is also possible to expand the search space of candidate plans by considering
the impact of physical access structures on the data, e.g., secondary indexes. For in-
stance, B-Trees on datales or deltas can be helpful when records are ltered on some
attribute, bloom lters on deltas can help in evaluating queries like set dierence, and
104
so on. Additional considerations also arise when a join result is required across multiple
versions – the delta chains for the dierent sets of datales (corresponding to the dif-
ferent relations) may not be “aligned” and the access tree selection will have to consider
possibility of “joint” optimizations. Understanding this search space further, especially
for richer queries involving joins and aggregates, remains a rich area for future work.
5.3.4 Cost and Cardinality Estimation
The cost of executing any of the set operations mentioned so far depends on the physical
datafile format and the specic implementation of the operation. Since there exist
several implementations for the set operations, there exist several cost functions. In
DEX, the primary method of storing a datafile is clustered storage. In this method,
records are stored in a sorted manner based on a suitable derived key (e.g., SHA1). There
are several algorithms for evaluating a set expression between two or more operands
based on this storage format and we outline our choices next alongwith their respective
cost. To keep the discussion simple, we describe algorithms and their respective cost
functions when all input data for a specic operation ts in memory and there is no
paging of intermediate results to disk. Even if some deltas are large enough to require
using disk, most of the algorithms below access the deltas sequentially and thus can be
used with small modications. We note that our optimization algorithms are largely
agnostic to the specic choices for operator implementations, and can be used as long
as the costs of the operations can be estimated.
Intersection: To compute the intersection of l datafiles, A1,… , Al , we use an adaptive
105
algorithm introduced in [83] called Small Adaptive (SA). SA rst sorts the set of input
datafiles according to their size. For each element in the smallest datafile, SA per-
forms a gallop search on the second smallest datafile. A gallop search consists of two
stages. In the rst stage, we determine a range in which the element would reside if it
were in the datafile. This range is found by identifying the rst exponent j such that
the element at 2j is greater than the searched element. In the second stage, a binary
search is performed in the range (2j−1, 2j) to nd if the element exists. If found, a new
gallop search is performed in the remaining l − 2 datafiles to determine if the element
is present in the intersection, otherwise a new search is performed. After this step, each
datafile has an examined range (from the beginning to the position returned by the
current gallop search) and an unexamined range. SA then selects two datafiles with
the smallest unexamined range and repeats the process until one of the datafiles has
been fully examined.
Because intersections only make sets smaller, as the algorithm progresses with
several sets, the time to do each intersection eectively reduces. In particular, as pointed
to in [83], the algorithm benets largely if the set sizes vary widely, and performs poorly
if the set sizes are all roughly the same. Since one gallop search takesO(log i) time, where
i is the index where the element would be in the datafile, we can model the worst case
cost of intersection as,
C∩(A1,… , Al) = l |A1| log(|Al |/|A1|), (5.3.1)
where A1 and Al are the smallest and largest datafiles respectively.
106
Union: To take a union of l datafiles {A1,… , Al}, we perform a linear scan over all
lists to merge them, and output the result.
C∪(A1,… , Al) = |A1| +⋯ + |Al |. (5.3.2)
Set Dierence: To compute the set dierence A1 −A2, we choose the better among the
following two based on input sizes: perform a linear scan over both datafiles and use a
merging algorithm, or for each element in A1, perform a gallop search on A2, including
the element in the output if the search fails. This can be captured using the cost function,
C−(A1, A2) = min{|A1| + |A2|, |A1| log(|A2|/|A1|)}. (5.3.3)
Patch: This is a binary operation where the two inputs are either (i) a datafile (M )
and a delta (Δ), or (ii) two deltas (Δ1 and Δ2). In the rst case, the output datafile can
be computed by performing one linear scan over each of M , Δ+ and Δ− and evaluating
Denition 2, making the cost function,
C⊕(M,Δ) = |M | + |Δ|. (5.3.4)
Typically, |M | > |Δ−|, and we use the linear scan approach to compute the set dierence.
In the second case, the output Δ can be computed by evaluating Denition 3. Note that
the datafiles of Δ2 are scanned twice, once to compute Δ+ and once to compute Δ−.
107
Thus, the cost function is given as,
C⊕(Δ1,Δ2) = |Δ1| + 2|Δ2|. (5.3.5)
Cardinality Estimation: Because we restrict the search space as discussed in Sec-
tion 5.3.3, we require intermediate result size estimates only when two deltas are patched.
A1 A2 A3
Let Δ = Δ1 ⊕ Δ2, where Δ1 and Δ2 are deltas between three datafiles as above. Let
x = |Δ−| and y = |Δ+|. We want to estimate x and y. By denition, Δ is a consistent delta
between A1 and A3. Therefore, |A3| = |A1| − x + y . Since |A1| and |A3| are known, we can
estimate x from y , or vice versa.
From Denition 3 we can obtain intervals for both x and y as,
x ∈ [max(0, |Δ− + − − −1 | − |Δ2 |) + |Δ2 |, |Δ1 | + |Δ2 |] ,
y ∈ [max(0, |Δ+| − |Δ−1 2 |) + |Δ+2 |, |Δ+1 | + |Δ+2 |] .
We estimate the quantity with the smaller interval, where the value is chosen uniformly
at random from the corresponding interval.
5.4 Query Execution Algorithms
Next we present a series of algorithms for cost-based optimization for each of the dif-
ferent query types.
108
5.4.1 Checkout Queries
Let Checkout(k) and Q denote a checkout query and its access tree resp. We rst
consider the case when k = 1 (single datafile checkout) followed by the case k > 1
(multiple datafile checkout).
5.4.1.1 Single datafile Checkout
Recall that the access tree Q , when k = 1, is the shortest path from A0 to Ax1 in . The
delta expression for single datafile checkout is therefore, of the form,  ∶ M ⊕ Δ1 ⊕
Δ2 ⊕⋯ ⊕ Δm, where M is the materialized datafile.
Evaluation Algorithms: Since the ⊕ operation is associative, we can evaluate  in
multiple ways by changing the placement of “parentheses”. For instance, one method is
to evaluate the expression from left-to-right, i.e.,  ∶ (((M ⊕ Δ1) ⊕ Δ2) ⊕ ⋯ ⊕ Δm). Al-
ternately, we can evaluate the expression from right-to-left, or in any arbitrary fashion
that repeatedly combines two operands at a time, until we are left with the result. These
evaluation methods, in general, will have varying costs. The total number of evaluation
orders is equivalent to the classical problem of counting the number of ways of associ-
ating m applications of a binary operator, and is given by the (m − 1)th Catalan number,
which is Ω(4m/m3/2).
Note that a greedy algorithm that iteratively combines two deltas having the least
cost is not always the optimal strategy.
Example 9 Consider the expression,  ∶ Δ1 ⊕ Δ2 ⊕ Δ3, where the deltas are such that
|Δ1| = x, |Δ2| ≃ |Δ3| = y, x ≪ y . Intuitively, the deltas Δ2,Δ3 are larger compared to Δ1 and
109
they are such that they almost “undo” each other. The greedy algorithm will pick the plan
(Δ1 ⊕Δ2) ⊕Δ3 with estimated cost ≈ 2x + 5y , while the optimal plan Δ1 ⊕ (Δ2 ⊕Δ3) has cost
≈ x + 2y + 2", where " = |Δ2 ⊕ Δ3|.
For sake of completeness, we have reproduced the classical dynamic programming
algorithm to select the (estimated) best evaluation order below. We call this the path
contraction (PC) algorithm. We use PC extensively in subsequent sections to determine
the best evaluation order to combine a sequence of deltas. The runtime of PC is Θ(m3)
where m is the number of deltas.
The input to PC is a materialized datafile, sayM , followed by a sequence of deltas,
say Δ1,… ,Δm and the output is the datafile represented by the corresponding delta
expression, M⊕Δ1⊕⋯⊕Δm. The algorithm uses the property that if the optimal solution
splits the contraction of a path of length m into two sub-paths, then the contraction of
each of the two sub-paths must be optimal (otherwise, we can improve the solution for
the sub-paths to enhance the overall solution). For 0 ≤ i ≤ j ≤ m, let C[i, j] denote the
minimum cost to contract the sequence Δi ,… ,Δj , D[i, j] denote the estimated size of the
corresponding intermediate result, and S[i, j] denote how to best split the sequence. For
notational convenience only, if M is present, Δ0 = M . The pseudocode of PC is shown
in Algorithm 4.
As discussed in Section 5.3.3, we have syntactically restricted the space of alterna-
tive evaluation plans for checkout by only considering the associativity of the patch op-
eration. Although additional transformations could be used to expand the search space,
we could not identify any such transformation rules for checkout that were eective
110
outside of pathological cases.
Algorithm 4: Path Contraction (PC)
Input : A materialized datafile M (optional); Δ1,… ,Δm
Result: Minimum cost to evaluate M ⊕ Δ1 ⊕…Δm
1 for l ← 2 to m do
2 for i ← 0 to m − l + 1 do
3 j ← i + l − 1
4 C[i, j]← mini≤k<j C[i, k] + C[k + 1, j] + C⊕(D[i, k], D[k + 1, j])
5 S[i, j]← argmini≤k<j C[i, k] + C[k + 1, j] + C⊕(D[i, k], D[k + 1, j])
/* Update estimated size of Δi ⊕⋯ ⊕ Δj */
6 D[i, j]← EstimateSize(D[i, S[i, j]], D[S[i, j] + 1, j])
7 end
8 end
9 return C[1, m] (nal cost) and S (splitting markers)
5.4.1.2 Multiple datafile Checkout
Since Q is a tree, there exists exactly one path from A0 to each Ai ∈ k . Let (Ai)
denote the sequence of deltas on this path. A straightforward method to evaluate the
query is to consider the delta expression for each Ai based on the deltas in (Ai) and
use PC to get the optimal execution order. However, doing so does not take into account
the opportunity for shared computation. Specically, two or more paths may share sub-
expressions and we end up evaluating a sub-expression multiple times if we consider
each path independently. We illustrate this with the help of an example.
Example 10 Consider the query Checkout(A6, A8, A9, A12) and the access tree in Fig-
111
ure 5.1(c). We write one expression for each of A6, A8, A9 and A12 respectively, as follows,
 ∶A1 ⊕ Δ(A1, A2) ⊕ Δ(A2, A4) ⊕ Δ(A4, A8);
A1 ⊕ Δ(A1, A3) ⊕ Δ(A3, A5) ⊕ Δ(A5, A9);
A1 ⊕ Δ(A1, A3) ⊕ Δ(A3, A6);
A1 ⊕ Δ(A1, A3) ⊕ Δ(A3, A6) ⊕ Δ(A6, A10) ⊕ Δ(A10, A12)
Note that if we evaluate each of these independently, based on how the parenthesization is
performed, we will evaluate A1 ⊕ Δ(A1, A3) thrice, or Δ(A1, A3) ⊕ Δ(A3, A6) twice.
Evaluation Algorithms: The above can be seen as the problem of how to plan the exe-
cution of a batch of queries, where each query is a single datafile checkout, analogous
to multi-query optimization. The goal here is to design a strategy that recognizes the
possibilities of shared computation so that we can re-use the result of sub-expressions to
the extent possible in order to obtain a globally optimal evaluation plan. To that eect,
we develop a dynamic programming algorithm, called tree contraction (TC), to select the
best evaluation plan after accounting for shared computation.
At a high level, TC breaks up the problem into two questions: how do we decide
which sub-expressions to share and how do we best parenthesize each (sub-)expression?
We already know how to compute the solution for the latter using PC. Therefore, we be-
gin with applying PC for all (Ai), Ai ∈ k . However, apart from the best solution for the
complete expression, we also return the following information about each path (Ai), all
of which is computed by PC during its execution. Let Δ1,… ,Δm be the sequence of deltas
112
on the path (Ai) and i, j be indices such that 0 ≤ i ≤ j ≤ m. Then we return: (i) the min-
imum cost to contract the sequence Δi ,… ,Δj , denoted by C[i, j], (ii) the estimated size
of the intermediate delta, denoted by D[i, j], and (iii) the split marker, indicating how to
best split the the sequence, denoted by S[i, j]. We re-use these estimates of intermediate
delta sizes and partial contraction costs wherever the corresponding sub-expressions are
considered for sharing. To decide which sub-expression to share, we simply enumerate
all possibilities for the sub-expression starting from the root of the access tree (this is
done in line 5). The rest of the problem is solved recursively where  is the sequence
of sub-expressions that are considered to be shared. Finally, we also need to account for
the possibility of not sharing any sub-expression. The time complexity of TC is O(m3)
where m is the number of deltas in the access tree.
The input to TC is an access tree Q and the output is the set of datafiles k . The
pseudocode of TC is shown in Algorithm 5 and we explain it with the help of Figure 5.2.
First, note that if Q has more than one materialized datafile, then we can consider the
sub-trees rooted at each materialized le independently. Figure 5.2(a) shows an access
tree to checkout A1, A2, A3. Let M be the root of this access tree. Starting from M , let Bi
be the rst node that has l > 1 children. Let B0,… , Bi−1 be the intermediate nodes on the
M − Bi path. Let Q(Bj), 0 ≤ j < i, be the access tree rooted at Bj (this tree is equivalent
to deleting the nodes M, B0,… , Bj−1 from Q). For example, Figure 5.2(b) shows two
components: (i) the path (Bi−1) (above), and (ii) the access tree Q(Bi−1) (below). Let
split(Q) denote the operation that splits Q at Bi into l access trees, one for each child
of Bi . This is showin in Figure 5.2(c). Let split-par(Q) denote the operation that splits
Q at Bi into l access trees, one for each child of Bi , but this time preserving the parent
113
sequence of deltas in each split. This is shown in Figure 5.2(d).
M M M M M
B0 B0 B0 B0 B0
Bi-1 Bi-1 B B
Bi-1 B
i-1 i-1
i-1
Bi
B B B
B
i i Bi Bi
i i
A1
A1 A2 A3 A2 A3 A A A A1 2 3 A1 2 A3
(a) (b) (c) (d)
Figure 5.2: An instance of the Tree Contraction algorithm; (a) is an access tree for the
query Checkout(A1, A2, A3).
Algorithm 5: Tree Contraction
Input : Access Tree Q
1 Apply PC for each (Ai), Ai ∈ k . Memoize C[], S[] and D[].
2 return Best-Subexp({},Q)
3 Procedure Best-Subexp(,Q)
Input : Deltas, ; Access tree, Q
4 if Q is a path then return Best solution for { ∪ Q}
5 forall 0 ≤ j < i do
6 Δ(Bj ) ← estimated delta for the sequence (Bj)
7 ′ =  ∪ Δ(Bj )
8 cost_g ← Best-Subexp(′,Q(Bj))
9 Δ(Bi ) ← estimated delta for the sequence (Bi)
10 ′ =  ∪ Δ(Bi )
11 cost_g ← ∑′ ∈split(Q ) Best-Subexp(′,′Q)Q
12 cost_g ← ∑′ ′Q∈split-par(Q ) Best-Subexp(,Q)
13 return Best cost_g
Before we conclude this discussion, it will be helpful to understand, as the follow-
ing example shows, why a simple greedy strategy of always sharing the largest possible
114
expression (from left to right) is not always optimal.
Example 11 Consider the following access tree to checkout A3 and A4.
M A A A1 2 4
A3
The instance is constructed such that Δ2,Δ3,Δ4 are large (say, y ≈ |Δ2| ≈ |Δ3| ≈ |Δ4|) and
Δ3,Δ4 “undo” most of the changes done by Δ2. Δ1 is a small independent set of changes,
say, x = |Δ1|, x ≪ y . The greedy strategy will force us to share Δ′ = Δ1 ⊕ Δ2 and |Δ′| ≈
x + y . Thus the cost of the greedy strategy is ≈ 3x + 8y . On the other hand, evaluating
Δ1 ⊕ (Δ2 ⊕ Δ3),Δ1 ⊕ (Δ2 ⊕ Δ4) incurs a cost ≈ 2x + 6y .
5.4.2 Intersection Queries
Given an intersect query I (k) and its access tree Q , a straightforward method, that
we treat as a baseline, is to rst use TC to perform Checkout(k) followed by the in-
tersection. This approach, however, only considers the associativity of patch and the
sharing of sub-expressions in order to nd a good evaluation order. We now develop a
set of transformation rules on the access tree that allow us to compute partial intersection
results using only the deltas. Since a delta between two datafiles already captures a
notion of dierence between them, we leverage this information and avoid redundant
computation while nding the intersection.
The transformation rules are based on identifying two simple structures in the
access tree Q , called line and star (Figure 5.3). In each gure, we use boxes to denote
115
datafiles in k and circles to denote other datafiles. Also, if a box or circle is lled,
it denotes a materialized datafile.
A1 A2 A3 M
(a)
M A1 Ak A1 A2 Ak
(b) (c)
Figure 5.3: (a) A line of two or more datafiles; (b) A line when the materialized
datafile M is not a part of query input; (c) A star.
Line Access Trees: Consider the query I (A1, A2) with the datafiles as arranged in
Figure 5.3(a). Here, A1 is the materialized datafile while A2 is stored as a delta from
A1. It is easy to see that:
R = A1 ∩ A2 = A1 − Δ−1 .
In general, for the query I (k)with the datafiles as arranged in Figure 5.3(a), the result
R is computed as,
R = I ( − −k) = A1 − (Δ1 ∪⋯ ∪ Δk−1) (5.4.1)
Note that the above equality does not hold if there are other datafiles in Q
even if Q is a line. We use this equality to introduce our rst transformation rule that
“reduces” the deltas in a line structure to a single delta that gives the result for the
intersect query. Conceptually, this reduced delta acts as a delta between two datafiles:
the same materialized datafile as in the line and a (new) datafile representing the
intersection result.
T1:– If Δ1,… ,Δk−1 are the deltas in the line, then the reduced delta, Δl , for the intersect
116
query is composed as,
Δ−l = Δ−1 ∪ Δ−2 ∪⋯ ∪ Δ− +k−1; Δl = ∅
This transformation rule signicantly reduces the amount of data that needs to be sub-
sequently processed.
To handle the case when the materialized datafile is not a part of the line, as in
Figure 5.3(b), we use a two-step approach. First, assuming that A1 is the materialized
datafile and we can use rule T1 to compute the reduced delta Δl . Second, we can
contract Δ and Δl since they share the datafile A1. The result is computed as R =
M ⊕ Δ ⊕ Δl .
Next, we discuss how to evaluate eq. (5.4.1). Consider the identity, X − (Y ∪ Z ) =
(X − Y ) − Z , for three sets X, Y , Z . If |Y |, |Z | ≪ |X |, observe that X − (Y ∪ Z ) will often
have less cost than (X − Y ) − Z . Intuitively, if we do the set dierence rst, then |X − Y |
will be comparable to |X | and will end up being scanned again. Specically, under the
cost model stated in section 5.3.4, when |X | > 3max(|Y |, |Z |), performing X − (Y ∪ Z )
will result in a reduced cost. We therefore use the following greedy heuristic when
evaluating eq. (5.4.1).
H1:– Let  = {Δ−1 ,… ,Δ−k−1}, R = M . We iteratively perform the following until  is
empty: let Δ′ be the largest size delta in . If |R| > 3|Δ′|, we replace the largest two
deltas in  by their union; else, we set R = R − Δ′.
Star Access Trees: Consider the query I (A1, A2) with the datafiles as arranged in
117
Figure 5.3(c). Here, M is the materialized datafile and A1 and A2 are stored as deltas
from M . We have that:
R = A − − + +1 ∩ A2 = (M − (Δ1 ∪ Δ2 )) ∪ (Δ1 ∩ Δ2 )
To see why, recall that Δ−i indicates the set of records to be removed from M to get Ai .
Hence, no record in Δ−i can be a part of the intersection result. Additionally, new records
(that do not exist in M ) can be added only if they belong to all of Δ+i .
In general, for the query I (k)with the datafiles as arranged in Figure 5.3(c), the
result R is computed as,
R = I (k) = (M − (∪k − k +i=1Δi )) ∪ (∩i=1Δi ) (5.4.2)
The result R is written in terms of the materialized datafile M . This leads us to our
second tarnsformation rule that “reduces” the deltas in a star structure to a single delta
that gives the result for the intersect query. Conceptually, this reduced delta acts as a
delta between M and the intersection result.
T2:– If Δ1,… ,Δk are the deltas in the star, then the reduced delta Δs , for the intersect
query is composed as,
Δ− = ∪k − + k +s i=1Δi ; Δs = ∩i=1Δi
We use H1 to evaluate Equation (5.4.2). Since none of Δ+i can help reduce inter-
mediate result sizes, the intersection of Δ+i s can be done independently. Finally, we also
118
make the following observation.
Observation 4 Δl and Δs are consistent.
Arbitrary Access Trees: We develop an algorithm, called Contract and Reduce (C&R),
that puts the above two techniques together for arbitrary access trees. With minor mod-
ications, the same algorithm can be used for other types of queries, and hence we
describe its general form. The pseudocode for C&R is shown in Algorithm 6.
Starting with a queryQ ∈ {I , U , Tt}, and its access tree Q as inputs, C&R iteratively
evaluates partial delta expressions, eectively reducing the size of Q . Each iteration of
the algorithm has two phases: contract phase and reduce phase. In the contract phase,
we identify all maximal continuous delta paths: a path where all nodes, except the start
and end node, have exactly 2 neighbors, and none of the intermediate nodes is a part
of k . Each path should be of length > 2 and be the longest possible. Every such path
is then contracted to a single delta using PC. Specically, if Δ1,… ,Δu is the sequence of
deltas on the path between two nodes Ax and Ay in Q , we use PC to nd the best order
to evaluate Δ = Δ1 ⊕ ⋯ ⊕ Δu, execute the operations, and replace the sequence by the
delta Δ between Ax and Ay .
In the reduce phase, we nd all lines and stars in Q and reduce them according
to the appropriate transformation rules – T1/T3 for lines and T2/T4/T5 for stars. Each
transformation takes as input two or more deltas, either in a line or star conguration,
and replaces them by a single delta. Note that if all paths are contracted, and number of
deltas in Q is more than 1, there will at be at least one reduction to be performed.
The algorithm ends when there is only one delta remaining in Q . At this point,
119
we simply apply the delta to the materialized datafile in Q and return the result. We
illustrate the behaviour of the algorithm with the help of an example.
Example 12 Consider the query I (A6, A8, A9, A12) with access tree as in Figure 5.4(a). In
the rst iteration, during the contract phase, we compute the deltas: (1) Δc1 = Δ1 ⊕ Δ3 ⊕ Δ6,
(2) Δc2 = Δ4 ⊕Δ7, and (3) Δc3 = Δ8 ⊕Δ9 (Figure 5.4(b)). In the reduce phase where we reduce
A6 and A12 arranged in a line (Figure 5.4(c)). This reduction puts A9 and Q(A6, A12) in a
star, which is then reduced as in Figure 5.4(d)). In the next iteration, during the contract
phase, we compute Δc4 = Δ2 ⊕ Δs1 (Figure 5.4(e))). In the reduce phase, we reduce the star
A8 and Q(A6, A9, A12) which leaves the access tree with a single delta.
A1 A1 A1
A2 A3 A3 A3
A6
A4 A5 A6
A
A 8
A8
10
A9 Q(A6,A12)
A8 A9 A12 A9 A12
(a) (b) (c)
A1 A1
A3 A1 Q(A6,A9,A12)
A8 Q(A6,A9,A12) A8 Q(A6,A9,A12)
(d) (e) (f)
Figure 5.4: Access tree during the progress of C&R
120
Algorithm 6: Contract and Reduce (C&R)
Data: Query Q ∈ {I , U , Tt} and access tree Q
1 while Q contains more than one delta do
2 Contract Phase:
3 P ← list of maximal continuous delta paths of length > 2 in Q
4 Contract all paths in P using PC and update Q
5 if Q becomes a line/star then
6 return R ← result using H1
7 Reduce Phase:
8 L← list of line structures in Q
9 Reduce each line in L based on Q, i.e., apply one of T1/T3, and update Q
10 S ← list of star structures in Q
11 Reduce each star in S based on Q, i.e., apply one of T2/T4/T5, and update
Q
/* At this point Q contains one delta. */
12 return R ← Root(Q) ⊕ Δ
5.4.3 Union
In this section, we give transformation rules for line and star for the query U (k). We
can then use C&R with the mentioned rules to evaluate arbitrary access tree structures.
Line: Consider the query U (k) with the datafiles as arranged in Figure 5.3(a). Then:
R = A1 ∪ (∪ki=1Δ+i ).
The transformation rule for a line can therefore be stated as,
T3:– If Δ1,… ,Δk−1 are the deltas in the line, then the reduced delta, Δl , for the union
query is composed as,
Δ−l = ∅; Δ+l = ∪k +i=1Δi
Star: If the datafiles are arranged as shown in in Figure 5.3(c), we have that: R =
U (k) = (M − (∩k −i=1Δi )) ∪ (∪k +i=1Δi ).
To see why, since Δ−i indicates the set of records to be removed from M to get Ai , if a
121
record is absent in the union, it must have been present in all Δ−i . New records that are
added in any Δ+i are a part of the union result. Then:
T4:– If Δ1,… ,Δk are the deltas in the star, then the reduced delta Δs , for the union query
is composed as,
Δ−s = ∩ki=1Δ−i ; Δ+ = ∪k +s i=1Δi
We conclude this section by mentioning that similar to the intersection case, Δl
and Δs , for the union query, are consistent.
5.4.4 t-Threshold
In order to evaluate a t-threshold query Tt(k), we make use of multiset-backed deltas
during intermediate query execution, instead of the set-backed deltas that we have been
using so far. This introduces two important issues during the execution of C&R that
are the main focus of this section. First, we need to re-dene the semantics of delta
contraction in this new setting. Second, a line cannot be reduced in a straightforward
manner as before. We begin with some denitions, describe the transformation rule for
a star, and then discuss each of two issues in detail.
A multiset, unlike a set, allows multiple instances of any of its elements. We rep-
resent a multiset as A = {(r , c) ∶ c ∈ N≥1}, where r is an element and N≥1 is the set of
natural numbers. The number c is referred to as the multiplicity of r . A set is a multiset
with all multiplicities as 1.
Consider the following two operations concerning multisets.
Denition 4 (Multiset Union) Multiset union, denoted by A = A1 ⊎A2, returns a multi-
122
set containing elements that occur either in A1 or A2, where A1 and A2 are either multisets
or sets. The multiplicity of an element in A is the sum of its multiplicites in A1 and A2.
Denition 5 (Multiset Restrict) If A is a multiset, Ac≤p is the set of elements in A with
multiplicity at most p. Similarly, Ac≥p is the set of elements with multiplicity at least p.
We now describe how to evaluate Tt(k) when Q is a star.
Star: Consider the query R = T3(4), i.e., nd all records that appear in at least 3 of
{A1, A2, A3, A4}, when they are arranged in a star (Figure 5.3(c)). Suppose the delta Δ is
such that Δ− = Δ−1 ⊎Δ− − −2 ⊎Δ3 ⊎Δ4 and Δ+ = Δ+ ⊎Δ+ ⊎Δ+ ⊎Δ+1 2 3 4 . Here, Δ− and Δ+ are multisets,
and Δ is a multiset-backed delta. Then:
R = T3(4) = (M − Δ−c≥2) ∪ Δ+c≥3
To understand why, consider a record (r , c) ∈ Δ−. This indicates that r ∈ M and c of 4
deltas, {Δ−1 ,Δ−,Δ−,Δ−2 3 4}, ask to delete r . So long as c ≥ 2, r will absent from at least 2 of
{A1, A2, A3, A4}. Similarly, consider a record (s, c) ∈ Δ+. This indicates that s ∉ M and c
of 4 deltas, {Δ−,Δ−,Δ−,Δ−1 2 3 4}, ask to add s. So long as c ≥ 3, k will be present in at least 3
of {A1, A2, A3, A4}.
More generally, the transformation rule for a star can be stated as below.
T5:– If Δ1,… ,Δk are the deltas in the star, then the reduced delta Δs , for the t-threshold
query is composed as,
Δ−s = ⊎ki=1Δ−i ; Δ+ = ⊎ks i=1Δ+i
We note the following: (i) With every multiset-backed delta, we keep an integer
123
value 2 ≤ p(Δ) ≤ k which indicates the number of datafiles in k that are reduced
by this delta. For instance, p(Δ) = 4, in the previous example. In general, the access
tree will have several disjoint stars (or lines) which are reduced at dierent times in the
evaluation process. We discuss how p(Δ) is used shortly. (ii) The deltas Δ− +c≥k−t+1) and Δc≥t
are consistent.
We now describe the semantics of the patch operation in the presence of multiset
deltas.
Delta Contraction: Algorithm 7 computes the result of Δ = Δx ⊕ Δy , where Δy is a
multiset delta. Note that due to the nature of C&R, only the last delta in any sequence of
deltas can be a multiset delta. The result delta Δ is also a multiset delta.
The main idea here is to preseve semantics of two values: (i) the multiplicity c
of a record r , and (ii) the number of datafiles that are reduced due to the delta, p(Δ).
Since this is a patch operation, we can simply set p(Δ) ← p(Δy). Recall that a record
(r ′, c′) ∈ Δ−y indicates that c′ of p(Δy) deltas ask us to remove r ′. Now consider a record
r ∈ Δ+x . If r ∉ Δ−y , then we can add (r , p(Δy)) to Δ+. However, if r ∈ Δ−y , then we need to
“x” the multiplicity of r , i.e., add (r , p(Δy) − c) to Δ+, where c is the multiplicity of r in
Δ−y . The other case is similar.
We use Algorithm 7 in place of the patch operator dened in section 5.3.4 when
one of the operands is a multiset delta. Its estimated cost is modeled as, C⊕(Δx ,Δy) =
|Δx | + 2|Δy |, and we can use PC during the contract phase as before.
Line: Consider the query, R = T2(4), with the datafiles as shown below.
124
Algorithm 7: Patch operation for multiset-based deltas
Data: Set-backed delta Δx , and multiset-based delta Δy
Result: Multiset delta Δ = Δx ⊕ Δy
1 Initialize Δ ← Δy , p ← p(Δ)← p(Δy)
2 for r ∈ Δ+x do
3 if r ∈ Δ− then
4 Remove (r , c) from Δ−, Add (r , p − c) to Δ+
5 else
6 Add (r , p) to Δ+
7 for r ∈ Δ−x do
8 if r ∈ Δ+ then
9 Remove (r , c) from Δ+, Add (r , p − c) to Δ−
10 else
11 Add (r , p) to Δ−
12 return Δ
A1 A2 A3 A4
Consider a record r such that r ∈ Δ+1 , and r ∈ Δ−3 . Although r is present in two opposite
deltas, r ∈ R. More generally, in the case of t-threshold queries, simply knowing whether
a record is in Δ− or Δ+i i is not sucient to conclude if it is present in the result. We
also require knowledge of the “position” of the delta containing the record on the line.
Alternately, we can reduce the line by considering deltas in right-to-left order, by the
following simple modication to Algorithm 7. Suppose that we know how to contract
Δ2 ⊕ Δ3 to obtain a multiset delta Δy as shown. We show how to modify Algorithm 7
to compute Δ = Δ1 ⊕ Δy . The central idea is again to set record multiplicites and p(Δ)
correctly. Note that p(Δy) = 2, as it reduces A3 and A4. Since, Δ is also meant to reduce
A2, we set p(Δ) = p(Δy) + 1 (line 1). Consider a record r ∈ Δ+ −1 . If r ∉ Δy , then we can add
(r , p(Δy) + 1) to Δ+ (line 6). On the other hand, if r ∈ Δ−y , we add (r , p(Δy) − c + 1) to Δ+
125
(line 4) where c is the multiplicity of r in Δ−y . The other case is similar.
150
LR Greedy PC
100
50
0
25 50 75 100 150 200 300
Number of deltas
Figure 5.5: Eect of varying #Δ when |Δ| = 5%
30
LR Greedy PC
25
20
15
10
5
0
1M 2M 3M 4M 5M
Average datafile size
Figure 5.6: Eect of varying |A|
5.5 Experimental Evaluation
In this section, we present a comprehensive evaluation of our DEX prototype. The key
takeaway from our study is that, pushing down computation to the deltas can lead to
signincant savings, an order-of-magnitude in many cases. Surprisingly, even for a sin-
gle datafile checkout, we see large benets in the computational time. We also show,
126
Time (Seconds) Time (Seconds)
60
Naïve PC Greedy TC
50
40
30
20
10
0
25 50 75 100 150 200 300
Number of deltas
Figure 5.7: Eect of varying #Δ when |Δ| = 5%
through an illustrative experiment (Section 5.5), that using auxiliary data structures like
bitmaps can increase the benets many-fold, indicating that this is a rich direction for
future work.
All experiments were conducted on a single machine with Intel Core i7-4790 CPU
(3.60 GHz, 8MB L3 cache), 32GB of memory, running Ubuntu 16.04 and OpenJDK 64-bit
server JVM (ver. 1.8.0_111). Our choice to write the query processor in Java was primar-
ily based on getting quick development time while still being reasonably performant
on large datasets. While using a low-level language (e.g., C) will reduce the absolute
query execution times, it will not change our primary objective which is to measure
relative speedup of our techniques compared to the baseline. All time measurements
are recorded as wall-clock time. Unless otherwise stated, to measure response time, we
run each query 10 times and consider the median. To account for the adaptive perfor-
mance of some of the set operations, we repeat the above on 25 datasets with identical
properties (described next) and report the median. As discussed in Section 5.3.4, our
computations are CPU bound, and we did not nd an appreciable dierence in warm
127
Time (Seconds) 
cache vs cold cache settings; for consistency, we report results for a warm cache setting.
Datasets: Lacking access to real-world versioned datasets with sucient and varied
structure, we instead developed a synthetic data generator to generate datasets with
very dierent characteristics for a wide variety of parameter values. This enables us to
carefully study the performance of our techniques in various settings. Formally, every
experiment setting is characterized by a 4-tuple, ⟨T , |A|, |Δ|, #Δ⟩, where |A| and |Δ| refer
to the average number records in a datafile and average size of the deltas in the dataset
(as a percentage of |A|). T denotes the shape of the access tree that is used, and is one
of: line-shaped (l), star-shaped (s) and line-and-star (ls); and #Δ refers to the number of
deltas in the access tree. All records are 64-byte randomly generated strings.
Parameter Explanation Values
k Query size 2, 4, 6, 8, 10
|A| Average datafile size 1 million(M), 2M, 3M, 4M, 5M
|Δ| Average delta size 1%, 2%, 3%, 4%, 5%
#Δ Number of deltas 10, 25, 50, 75, 100
T Shape of access tree Line (l), Star (s), Line-and-star (ls)
Table 5.1: Possible values of parameters characterizing a synthetic dataset
Single datafile Checkout: We begin with evaluating the performance of PC, i.e., Al-
gorithm 4, against two heuristics for the case of single datafile checkout. Figure 5.5
shows the median response time of this analysis (in milliseconds) on the vertical axis,
and the horizontal axis is the number of deltas (#Δ) in the expression. The other param-
eters of the dataset are xed at ⟨T = l, |A| = 3M, |Δ| = 5%⟩.
The LR heuristic simply evaluates the delta expression from left-to-right starting
with the materialized datafile. This is the standard heuristic used in prior delta-based
128
A0 A0 A0
(a) (b) (c)
Figure 5.8: Access tree shapes; (a) Line, (b) Star, (c) Line-and-star
storage engines, like git. On the other hand, the Greedy heuristic iteratively patches
two operands having the least estimated cost.
We observe that in each instance, PC performs better than Greedy which performs
better than LR. Specically, we note up to 7.0-8.8X improvement in median response
times when comparing PC with LR and up to 14% improvement when comparing with
Greedy. The performance gap between LR and the other methods also increases slightly
as the number of deltas goes up. This is because the left input of every patch operation
in LR has a large size, in contrast to both Greedy and PC, that “balance” their inputs in a
cost-based manner. Also, because we assume that every record in a datafile is equally
likely to be modied and there is no set of “hot” records, i.e., records that are modied
often, we observe that the intermediate result sizes continue to grow in Greedy and PC
as well. We observe similar trends for other delta sizes and omit their results.
Next, we study the eect of varying average datafile size on the response times.
Figure 5.6 shows the result of this study on the dataset ⟨T = l, |Δ| = 1%, #Δ = 100⟩ when
129
|A| is varied from 1 million records to 5 million records. In this case, we observe a 8.9-
10.5X speedup when compared to LR, with the Greedy solution being approximately
close to PC.
Finally, although PC has cubic time complexity in the number of deltas, the solu-
tions it nds are, in all cases, better than alternatives even after taking optimization time
into consideration. When #Δ = 100, the average time to nd the optimal solution was
1.2ms.
Multiple datafile Checkout: We now evaluate the time taken to checkout k = 8
datafiles on the dataset ⟨T = ls, |Δ| = 5%, |A| = 1M⟩. We evaluate TC, i.e., Algorithm 5,
by comparison against three approaches. The Naive approach simply performs a check-
out of each datafile independently using LR. The second approach uses PC to checkout
individual datafiles. Both these approaches do not take into account sharing of in-
termediate results. The third approach, called Greedy, shares the results of the largest
sub-expressions as much as it can (e.g., for two datafiles, the result of the expression
from the root of the access tree to their lowest common ancestor is always shared).
Figure 5.7 reports the median checkout time (in seconds) as the number of deltas
(#Δ) in the access tree is varied. We observe that overall Greedy and TC have simi-
lar response times and TC performs slightly better than Greedy in each case (between
7.2 − 10.8% improvement). Also, when compared to Naive, we observe a 5.1–6.8X im-
provement in median response time.
The average optimization time when #Δ = 300 was 18.4ms.
Intersect: In the following set of experiments, we compare the running time of evalu-
130
20 11.8 Checkout Intersect C&R
12.4
9.2 8.27.2 7.315
6.7
4.7
10 4.0
4.2 2.6 3.7
5 2.6 2.0 2.0
0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
Query size (k)
(a) T=s (b) T=l (c) T=ls
Figure 5.9: Eect of access tree structure when |Δ| = 1%, #Δ = 100
20
Checkout Intersect C&R 2.4
4.4
6.9
15 1.9 16.7 7.7
23.6 10.4
2.3 25.1
10 2.9
6.7 4
13.2 5.3
5 16.6
0
Number of deltas
(a) k=4 (b) k=8
Figure 5.10: Eect of query size when |Δ| = 1%
18 1M 2M 3M 4M 5M 18 1% 2% 3% 4% 5%
16 16
14 14
12 12
10 10
8 8
6 6
4 4
2 2
0 0
2 4 6 8 10 2 4 6 8 10
Query size (k) Query size (k)
Figure 5.11: Intersect – Eect of |A| Figure 5.12: Intersect – Eect of |Δ|
131
Speedup
Time (Seconds) Time (Seconds)
10
25
50
75
100
150
200
Speedup 300
10
25
50
75
100
150
200
300
9 1% 2% 3% 4% 5% 6 1% 2% 3% 4% 5%
8
5
7
6 4
5
3
4
3 2
2
1
1
0 0
2 4 6 8 10 4,3 6,4 8,6 10,8
Query size (k) Query size, Threshold (k,t)
Figure 5.13: Union – Eect of |Δ| Figure 5.14: t-Thres. – Eect of |Δ|
ating I (k) using two algorithms. The baseline approach simply performs a checkout of
all the datafiles in k using TC, followed by their intersection. The second approach
measures the performance of C&R, i.e., Algorithm 6.
Because C&R makes decisions based on the shape of the access tree, we rst study
the eect of varying the shape of the access tree on intersect performance. Figure 5.9
shows the median response time against the query size for the three types of access trees:
line, star, and line-and-star. The other parameters of the dataset are ⟨|A| = 3M, |Δ| =
1%, #Δ = 100⟩. The numbers on top of each bar indicate the speedup obtained. We note
speedups of upto 12X when using C&R. The speedup obtained for T = l is smaller than
others primariliy due to the shape of the access tree – in a line, the smallest path between
root and a query datafile cannot be reduced using any of the tranformation rules and
must be contracted using PC.
In the next experiment, reported in Figure 5.10 we study the eect of varying the
number of deltas in the access tree of I (k). Here, we use the dataset ⟨T = ls, |A| =
3M, |Δ| = 1%⟩ and vary #Δ; we report the results for k = 4, 8. As we can see, our
132
Speedup
Speedup
techniques are particularly eective, giving a speedup upto 16X and 25X, for k = 4 and
k = 8 respectively. The speedup decreases as the number of deltas increases primarily
due to larger intermediate delta sizes.
Figure 5.11 shows the speedup obtained when the average datafile size is varied
between 1M and 5M records; other dataset parameters are ⟨T = ls, |Δ| = 1%, #Δ = 50⟩.
We observe that our techniques show signicant benet, obtaining upto 17X speedup.
Further, we note that datafile size does not aect C&R to a large degree.
Figure 5.12 reports the speedup obtained when the average delta size in the dataset,
|Δ|, is varied between 1% and 5% of the average datafile size; other dataset parameters
are ⟨T = ls, |A| = 3M, #Δ = 50⟩. We note a speedup of 2.8-16X when |Δ| = 1% that
decreases gradually to 2-6X when |Δ| = 5%. This conrms our hypothesis that if the
deltas between the datafiles are small, signicant improvements can be obtained by
using the deltas in query execution in a more direct manner. When the deltas get large,
the intermediate result sizes grow too, which results in a reduced speedup.
Union: The results for U (k) are similar to the intersection case although with smaller
speedup values. We report one such result in Figure 5.13: the eect of varying query
size (k) for datasets with dierent average delta size |Δ|. The other parameters of the
dataset are ⟨T = ls, #Δ = 50, |A| = 3M⟩. We note a speedup of 1.6-8.6X when |Δ| = 1%
that decreases gradually to 1.5-4.1X when |Δ| = 5%.
t-Threshold: We use the adaptive algorithm of [104] as a baseline for our t-threshold
experiments. Similar to adaptive set intersection, this algorithm uses gallop search in
order to nd the position of an element r in a set. Moreover, it maintains a min heap of
133
size k − t + 1, containing at most one element per set, in order to select a “good” element
to probe other sets during each iteration. Figure 5.14 reports the eects of varying (k, t)
across datasets with dierent delta sizes. The other parameters of the dataset are ⟨T =
ls, #Δ = 50, |A| = 3M⟩. We observe a speedup of 3.5-5X when |Δ| = 1% that gradually
reduces as the delta size increases. When |Δ| = 5%, we report a speedup of 2.1-3.1X.
The overall speedup in this case is less than that obtained in the intersection or union
query because unlike the two, the size of the intermediate results does not decrease when
transforming lines and stars.
Experiments with Bitmap Deltas: We have also built support for a ltered index
to answer intersection and union queries, and we show the results for an illustrative
experiment. Akin to a relational database, a ltered index in DEX is suited to answer
queries that always select from a nite “universal” set of records. In this case, we can
encode a set of records using a bitmap, where the order of records is determined by
their SHA1 value. The index creation step creates a bitmap of size || for each materi-
alized datafile and two bitmaps for each delta in the storage graph. We can then use
the bitwise AND(∧), OR(∨) and NOT(¬) operations to compute set intersection, union
and dierence. In this experiment, we use a compressed bitmap library called roaring
bitmaps [105] . Figure 5.15 shows the eect of index size on the intersect query. Here,
we measure the speedup vs query size for index size ranging from 500K-3M records. As
expected, for small universal sets, we get largely improved speedup ratios (up to 1200X).
With large universe sizes, there is however a penalty incurred when selecting the records
themselves given the bitmap information.
134
0.5M 1M 1.5M 2M 2.5M 3M
1400
1200
1000
800
600
400
200
0
2 4 6 8 10
Query size (k)
Figure 5.15: Eect of bitmap size
5.5.1 Comparisons with Temporal Indexing
In this section, we present a comparison of our approach with the temporal indexing
techniques by Buneman et al. [24], and discuss how we reimplemented and compared
against their approach.
Buneman et al. [24] proposed an archiving technique based on identifying changes
to (keyed) records across versions, specically temporal versions of hierarchical data,
that are then merged into one hierarchy represented in XML format. Because they also
compared against a di-based storage solution, we present a brief comparison to high-
light the respective strengths and weaknesses of the two strategies. Broadly stated, their
scheme, henceforth referred to as BA, merges all hierarchical elements across versions
into one hierarchy by identifying an element by its key and storing it only once, along
with the sequence of version timestamps where the respective element appears. Answer-
ing a checkout query thus requires scanning the entire archive, and using the intervals
to decide which elements belong to the answer. We reimplemented their technique in
our framework, using either sorted lists or bitmaps to store the sequence of version ids
135
Speedup
where an element appears. We describe this next.
We represent a dataset of records as a one-level hierarchical document with all the
records as children of the root node. When merging two datasets into a single archive,
we identify the common records and only store them once. Due to the nonlinear nature
of “version ids” (unlike timestamps) in our problem setting, we tried two dierent im-
plementations to keep the set of version ids for an element/record: (1) a sorted list or (2)
a bitmap. In the sorted list implementation, the version ids associated with every record
are stored in a sorted array and during retrieval, we use binary search to decide if the
record is present in the desired version. In the bitmap implementation, a bitmap of size
equal to the number of versions in the archive is used with each record to indicate the
versions that the record is present in. We use the roaring bitmap library [105] to store
these bitmaps. During retrieval, a simple scan through the archive can retrieve any ver-
sion. We note that it is not clear how to extend some of the optimizations in Buneman
et al., most notably “timestamp trees”, that depend on the linearity of timestamps, to the
nonlinear nature of version ids in a decentralized versioning/data lake scenario.
Buneman et al. compared the performance of their archiver against two approaches
based on deltas: (i) “cumulative di”, where every version is stored as a delta against
a common (typically rst) version, and (ii) “incremental di”, or “sequence-of-deltas”,
where every version is stored as a delta against the previous version, resulting in a line
storage graph. However, cumulative di had a large space overhead [24], and incremen-
tal di results in large checkout times due to long delta chains.
We consider single (k = 1) and multiple (k = 4) datafile checkout on the dataset
⟨|Δ| = 5%, |A| = 1M⟩. Additionally, for BA, we vary the number of datafiles (N ) in
136
the archive as N = 10, 50, 100, 250, 500, 1000. The bitmap implementation gives superior
performance (up to 19%) for N = 100 onwards and we use that to report checkout time,
while for N = 10, 50, we use the sorted list implementation. As noted previously, we can
pack approximately N = 80K datafiles in a storage graph (with certain constraints)
and get a delta chain of size at most 10 to checkout any single datafile; therefore, we
set #Δ = 10, 25 for fair comparison.
DEX; #Δ BA; archive size (N)
10 25 10 50 100 250 500 1000
k = 1 125 550 97 388 659 1286 2677 5362
k = 4 263 483 144 596 1110 2484 5160 9550
Table 5.2: Median checkout time (ms) in DEX and BA
As we can see, BA performs better than a sequence-of-deltas approach for checkout
queries. When k = 1 storing N = 50 datafiles gives better response times in BA than
storingN = 25 datafiles in a sequence-of-deltas approach (demonstrated by k = 1, #Δ =
25). However, checkout times for BA increase rapidly as the archive size grows, and DEX
is vastly superior to BA under more reasonable assumptions about the storage graph (in
the context of a versioning/data lake scenario, it is not clear how to extend some of the
optimizations in [24] that depend on the linearity of timestamps).
In short, the main dierence between DEX and the sequence-of-deltas approach
(that [24] primarily compared against) is that we assume that the storage graph is con-
structed using a technique that avoids very long delta chains (e.g., the methods presented
in Chapter 3, “skip links”-based approach [65], techniques that balance storage and re-
trieval costs [106], greedy heuristic used by git, etc.). We further note that BA suers
137
from three major limitations: (i) the entire archive must be read even when checking
out a single version, (ii) adding a new version requires an expensive merge operation
that scans the entire archive (unlike a delta-oriented storage engine where only a single
delta may be added), and (iii) decentralization is much more dicult in BA (in theory one
could maintain multiple archives and merge them periodically, but we are not aware of
any work that has attempted that).
138
Chapter 6: Query Execution II: Declarative Queries
6.1 Introduction
The preceding chapter described a query processing architecture for a simple class
of queries that compared data from multiple versions. As we showed, even these simple
queries exhibit interesting and unexplored computational challenges and the benets
of optimizing their execution can be tremendous (orders-of-magnitude in many cases).
In this chapter, we take a step further and describe an architecture to execute a much
richer class of queries from the VQUEL language described in Chapter 4. Specically,
we consider the query optimization challenges of executing a single block select-project-
join (SPJ) query on multiple versions.
Our motivation regarding studying this SQL-like API is that tabular data formats
(CSV/TSV) are extremely common in data science workows. Oftentimes analyzing and
inspecting past dataset versions can involve executing rich di queries to understand
how subsets of data have changed across versions. For example, a user might need
to slice a few columns from a CSV le present in multiple past versions, as a result of
updates from dierent sources, and only consider those records that appear in a majority
of the versions. She may also want to do the same for a dataset that is logically a “join” of
the records from two or more CSV les. Currently, users either use a combination of *nix
139
utilities like sort, uniq, cut, etc., to accomplish these tasks, or they may use a number of
open source tools such as Miller [107], csvkit [108], textql [109], etc., that provide SQL-
like API to quickly wrangle and work with table-like data formats. However, these ad
hoc approaches suer from the same limitations as discussed before – simply checking
out the les in every version and evaluating the user-specied query is prone to a lot of
wasted work, especially because of the overlap between the versions.
The key idea behind processing such rich queries in DEX is that we execute the
query plan exactly once, regardless of the number of versions. Each record/tuple that
is processed in the query plan is actually a data structure called a v-tuple. Apart from
storing data about the dierent columns required by the query, a v-tuple also keeps
track of a version list, or version bitmap, of the versions (among the subset of the queried
versions) that the tuple appears in. The potential performance benet of this approach
is that the physical query processing operators can operate in batch across versions that
are encoded for every tuple in the respective tuple bundle. For example, if a query is
to be run on 50 versions, and if a column c has value 100, then a selection operator
corresponding to c = 100 can lter out tuples across all 50 versions in one operation,
possibly resulting in a 50-fold reduction in the number of tuples that have to be moved
across the query plan.
This chapter presents the design of a query processing engine that we have built
for executing SPJ queries over multiple versions. In summary, our contributions are as
follows.
1. We propose a new framework to execute single block select-project-join (SPJ)
140
queries over multiple versions stored in a delta-based system. The key design
decision of our approach is to execute the various query operators once for each
unique tuple in the set of versions, rather than executing the query once for each
version.
2. We propose using a new representation for tuples, called v-tuple, during query
processing. In addition to the regular elds of a tuple, a v-tuple also has infor-
mation about which versions (among the set of queried versions) the respective
tuple appears in. We develop algorithms to eciently create such v-tuples from
the delta representation and to join/aggregate them.
3. We implement our techniques in DEX using Apache Calcite, which is an open
source, highly customizable engine for parsing and planning queries on data in
a variety of formats. Specically, we add new operators to Calcite for retrieving
data (Scan) and processing data (Filter/Join).
4. We extensively evaluate the performance of our methods on multiple synthetic
datasets. Our results show that DEX makes richer query processing viable in a
dataset version control system, with signicant benets over version-at-a-time
execution.
6.2 System Design
We review a few terms before describing our extensions to DEX to enable declar-
ative query processing. Conceptually, the execution engine evaluates a query Q over a
141
number of existing versions, say k, and outputs the result with each record being anno-
tated with the respective version(s) it appears in. Recall that a datafile is a le whose
contents are interpreted as a set of records. A version is a point-in-time snapshot of one
or more datafiles typically residing in a directory on the user’s le system. A version
is identied by a unique id, is immutable, and can be created by any user who has access
to the repository. A version graph captures the version-level provenance that includes
the derivation and transformation relationships, and metadata about the versions them-
selves. Nodes in a version graph correspond to versions, and edges capture relationships
such as derivation, branching, transformation, etc., between two versions. Both nodes
and edges have metadata that can be used to allow writing rich queries over the entire
repository. In this chapter, however, we only make use of the node metadata to identify
the relevant versions during query processing.
6.2.1 Schema Specication
Since our focus is on executing queries on table-like data, when a datafile is
added to the system, we also require the user to submit a schema le that species
how to parse the text-based format into rows and columns. By default, each line is
processed as an array of columns, all values being of type string. The schema le
can specify the data type for individual columns. Finally, the schema le must also
designate a column, or a group of columns, as the primary key, that must contain a
unique value that can be used to identify each and every row of the datafile uniquely.
At commit time, both intial commit, and any subsequent commits after updates, the
142
datafile is converted into Apache Parquet format for physical storage, as described
in Section 6.2.3. Additionally, a Parquet le can also be given as a input during commit
time. We require such rich schema information regarding the datafiles primarily to
support column deltas, as discussed next.
6.2.2 Delta format
As noted before, the choice of the delta format has signicant implications on
what operations/queries can be run on it. When deciding on a delta format between
two versions of a datafile, we had the following considerations:
1. the format should be compact,
2. creating and using the delta during query processing should be ecient, and
3. we should be able to read and work with parts of the delta that are essential.
Since every record is identied by a primary key, we can meet all of the above criteria
by capturing changes to individual record elds as outlined next.
Suppose R1 and R2 are two versions of a datafile R, with every record having the
schema ⟨k, c1, c2,… , ck⟩, where k is the primary key of R and c1, c2,… , ck are k columns.
The delta (Δ) between them contains records of the form ⟨k, [c11, c12], [c21, c22],… , [ck1, ck2]⟩,
where k is the primary key of the record that is dierent in one or more columns in the
two versions, and every ci1 or ci2 is the respective value in R1 or R2, in the column where
the record diers, or the special value “⟂”, if the column values are identical. We record
the value of the column in both versions, and therefore, this is an undirected delta for-
mat.
143
Figure 6.1 gives an example of this format. R1, R2, R3 are three versions of a datafile
R, Δ1 is the delta between R1 and R2, and Δ2 is the delta between R2 and R3. Consider
the record with primary key 1 in R1 and R2, i.e., R1 ∶ ⟨1, 100, a⟩ and R2 ∶ ⟨1, 100, b⟩. In
Δ1, the delta record for this primary key is therefore, Δ1 ∶ ⟨1, [⟂,⟂], [a, b]⟩. Similarly,
we have delta records for the primary keys 2, 3, 5, but not for 4 as the record is identical
in the two versions. Note that for every record in the delta, there is at least one column
where the values are not [⟂,⟂].
Next, we discuss how we compute the deltas between two versions of the same
datafile. The dierencing procedure works by nding records in the source and target
datafiles that are logical “pairs”. The schema-dened primary key is used to pair up
records in the source and target datafiles. For a record pair, any dierences in the
content is output as a new record in the delta having the schema described above. Such
pairs are considered as updates. Records in the source datafile that could not be paired
are considered as deletes, and we output one delta record for each such source record.
Similarly, records in the target datafile that could not be paired are output as inserts.
6.2.3 Physical Representation
We now describe how the datafiles and deltas are persisted on disk. Once a
datafile is committed in DEX, we parse it according to the schema le supplied and
store it in the Apache Parquet format [110]. Apache Parquet is a state-of-the-art, open
source columnar le format oering both high compression and high scan eciency.
Parquet is a PAX-like [111] format optimized for large data blocks and is vastly more
144
R1 R2 R3
k c1 c2 k c1 c2 k c1 c2
1 100 a 1 100 b 1 100 b
2 10 z 2 20 z 2 40 f
3 5 c 3 8 d 3 5 d
4 25 d 4 25 d 4 25 d
5 90 e 5 100 e 5 100 e
k c1 c2 k c1 c2 k c1 c2
1 2 [20,40] [z,f] 1
3 [8,5]
Figure 6.1: R1, R2, R3 are three versions of a datafile R. k is the primary key column.
Three deltas are shown, Δ1 is the delta between R1 and R2, Δ2 is the delta between R2 and
R3, and Δ3 is the delta between R1 and R3.
ecient than text-based formats like CSV during storage and query processing [112].
This format is also suited to store the deltas, because it results in a more compact repre-
sentation when only a small number of columns are modied across versions.
6.2.4 Discussion
Input data may come in the form of CSV/TSV/JSON formats which are text-based.
Currently user has to ensure that the schema le is correct and there are no errors dur-
ing the conversion process. Automatic schema inference may be added later. In addi-
tion, such conversion can also be done in a post-hoc manner. If it is expected that new
datafiles may benet from richer query processing, a separate process can be run to
convert the existing binary data to Parquet format by specifying a schema le.
145
[ T  ' T ] [ T  ' T ]
[ T  ' T ] [ c , d ]
5 [ 9 0 , 1 0 0 ] [ T  ' T ] [ T  ' T ]
[ T  ' T ] [ a , b ] 3 [ 5 , 8 ] [ c , d ] [ T  ' T ] [ a , b ]
2 [ 1 0 , 4 0 ] [ z , f ] 5 [ 9 0 , 1 0 0 ] 2 [ 1 0 , 2 0 ]
3
6.3 Query Execution
In this section, we describe the query processing ideas underlying our prototype
implementation. We make heavy use of terminology and algorithms from Section 5.3 and
Section 5.4, and extend the ideas behind delta contraction, and line and star reductions to
be applicable in this setting. Thereafter, we outline the design of other physical operators
to execute the rest of the query.
Query execution follows the same two-phase optimization approach that we de-
scribed in Section 5.3. Suppose we denote the query as Q(k), where Q denotes the SPJ
part of the query, and k = {R1, R2,… , Rk} is the set of datafiles required by the query.
For simplicity of notation, we assume that all of k are versions of a same datafile R.
However, in the the presence of a join, k will contain versions of multiple datafiles,
and the Scan step described in detail in Section 6.3.2, is modied to account for each
datafile separately.
The storage graph, , described in detail in Section 5.2.3.1, indicates the delta de-
cisions that have been made when storing all versions of a datafile. In the rst phase
of query execution, we identify all the relevant datafiles and deltas in  that are nec-
essary to execute Q(k). This is the problem of nding an access tree of Q(k), and we
have described this in detail in Section 5.3.2.
In the second phase, we map the logical SPJ query on to a directed acyclic graph
(DAG) of the physical operators described in the next few sections. Since in this work,
we address single-block queries, the mapping from the logical plan to the physical oper-
ators is straightforward, and an example is shown in Figure 6.2. The primary dierence
146
Project
Join R, S
on- ver si ons( V1,  V2,  . . . ,  Vk)
sel ect  R. a,  S. z
f r om R j oi n S on R. b = S. x Filter Rk Filter Sk
wher e R. c = " . . . "  and S. y = " . . . "
Scan Rk Scan Sk
(a) (b)
Figure 6.2: (a) Example query on k versions, (b) physical plan to execute the query.
from the perspective of a classical query execution engine is that the physical operators
described below are modied to create and process v-tuples instead of tuples. At the
lowest step of the DAG, we have the Scan operator (Section 6.3.2) taking as input the
respective access trees (for the two datafiles R and S in the query). The output of the
Scan operator is a set of v-tuples for each datafile. Thereafter, the Filter operator
applies all relevant predicates in the query. The Join operator (Section 6.3.3) reads the
two sets of v-tuples and outputs a single set of v-tuples corresponding to the join result.
The projection step removes any elds not needed by the query.
6.3.1 v-tuples
A v-tuple r , generated either as a result of the Scan step, or any subsequent steps in
the query processing pipeline, contains data about the relevant columns required by the
query, and a version bitmap, indicating the versions where the tuple is present. Figure 6.3
147
shows a set of v-tuples, which are the output of the Scan step on three versions of the
relation R, shown earlier in Figure 6.1. The tuple with the primary key 4 is unchanged
in all three versions, and hence it has the bitmap [111]. On the other hand, tuples with
primary keys 2 and 3 are dierent in all three versions, and hence we have three v-tuples
corresponding to each.
In principle, v-tuples are similar to the concept of “tuple bundles” in the Monte
Carlo Database System (MCDB) [113]. MCDB is a prototype relational database de-
signed to allow an analyst to attach arbitrary stochastic models to a database, thereby
specifying, in addition to the ordinary relations, “random” relations that contain uncer-
tain data. A tuple bundle, like a v-tuple, encapsulates the instantiations of a tuple over a
set of many Monte Carlo iterations, with the goal of operating in batch across all Monte
Carlo iterations. However, due to the nature of the application, the specic structure
of the data inside tuple bundles is dierent from v-tuples, and we require new physical
operators to generate them eciently.
6.3.2 Scan Operator
The Scan operation is the workhorse of the query processing phase. The input to
a Scan operation is the access tree of the datafile and a list of versions in the query.
The output is a set of v-tuples across all the versions requested by the query.
Suppose k = {R1, R2,… , Rk} are versions of a datafile R. A naive implementa-
tion of the Scan operator would be to rst perform a mutiple datafile checkout (Sec-
tion 5.4.1.2) of all of k , followed by a grouping step that brings common records to-
148
k c1 c2 [v1 v2 v3]
1 100 a [1 0 0]
1 100 b [0 1 1]
2 10 z [1 0 0]
2 20 z [0 1 0]
2 40 f [0 0 1]
3 5 c [1 0 0]
3 8 d [0 1 0]
3 5 d [0 0 1]
4 25 d [1 1 1]
5 90 e [1 0 0]
5 100 e [0 1 1]
Figure 6.3: v-tuples for R1, R2, R3
gether, to create a set of v-tuples. However, this scheme is likely to be inecient, as it
involves materializing close to k copies of most of the tuples, only to merge all tuples
together (if most of the datafile does not change). When k is large, about 50–100 in
our experiments, this scheme becomes a major performance bottleneck.
We therefore use a dierent strategy, one that works with the deltas as much as
possible. We use the Contract and Reduce (C&R) algorithm, described in detail in Sec-
tion 5.4.2 and in Algorithm 6 to generate v-tuples eciently. Recall that the C&R algo-
rithm takes as input an access tree and iteratively applies a set of transformations to
generate the query result eciently. In order for the algorithm to be applicable in this
setting, we need to address three important issues, that are the focus of the rest of the
section.
First, we need to dene the semantics of delta contraction for the delta format that
149
we use. The delta contraction step takes two deltas, say, Δ1 between R1 and R2, and Δ2
between R2 and R3, and combines them to create a single delta, say, Δ3 between R1 and
R3. This operation is useful when R2 is not required in the query. Second, we need to
dene reduction rules for line and star structures, as dened in Section 5.4.2, for our
delta format. Due to the additional structure in the deltas, we extend the delta format to
incorporate the result of reducing multiple deltas arranged in line/star congurations.
This delta format is similar to the one described in Section 6.2.2 earlier, but instead of
keeping two values per column in a reduced delta Δ, we keep 2 ≤ p(Δ) ≤ k values, where
p(Δ) indicates the number of versions in k that are reduced by this delta. Third, we
describe how to to apply, or patch, the “last” delta to the materialized datafile in order
to generate all the v-tuples.
6.3.2.1 Delta Contraction
Suppose we want to compute Δ = Δ1 ⊕ Δ2, where Δ1 and Δ2 are dened as above
(see Figure 6.1 for an example). If a primary key k′ occurs in only one of Δ1 or Δ2, it
is included as is in Δ. This indicates that the record was modied in only one of the
versions. If a primary key k is present in both Δ1 and Δ2, it indicates that the record was
modied in both versions, and we resolve it as follows. For each column ci, of primary
key k, appearing in the deltas, we have three possible scenarios:
• ci = [⟂,⟂] in both Δ1 and Δ2,
• ci = [⟂,⟂] in Δ1 and ci = [v, w] in Δ2, or vice versa, or
• ci = [x, y] in Δ1 and ci = [y, z] in Δ2,
150
R1
R1 R2 R3
R2 R3
k c1 c2 k c1 c2
1 1
(a) Line (b) Star
Figure 6.4: Line and star transformations for the Scan operation
where v, w, x, y, z are values from the domain of ci. In the rst case, we can infer that
the value in column ci has not changed between R1 and R3. Hence we set ci = [⟂,⟂] in
Δ. In the second case, we can infer that column value changed in one of Δ1 or Δ2, and
we set ci = [v, w] in Δ. In the third case, if x ≠ z, we set ci = [x, z] in Δ, otherwise we
set ci = [⟂,⟂]. After all columns of a key k have been processed and set appropriately
in Δ, we check if there is at least one column where the value is not [⟂,⟂]. If no such
column exists, k can be removed from Δ.
6.3.2.2 Line/Star Structures
We rst describe the transformation rule in the case of a line. Figure 6.4(a) shows
an example where three datafiles, R1, R2, R3, are arranged in a line, and our goal is to
generate v-tuples for all three. Similar to Section 5.4.4, we keep additional information
with each reduced delta. We rst set p(Δl) = 3, where Δl is the reduced delta. In general
p(Δl) = p(Δ1) + p(Δ2) − 1, where p(Δ) = 2 for a delta between two datafiles. This value
151
[ T  ' T  ' T ] [ T  ' T  ' T ]
[ T  ' T  ' T ] [ a , b , b ] [ T  ' T  ' T ] [ a , b , b ]
2 [ 1 0 , 2 0 , 4 0 ] [ z , z , f ] 2 [ 1 0 , 2 0 , 4 0 ] [ z , z , f ]
3 [ 5 , 8 , 5 ] [ c , d , d ] 3 [ 5 , 8 , 5 ] [ c , d , d ]
5 [ 9 0 , 1 0 0 , 1 0 0 ] 5 [ 9 0 , 1 0 0 , 1 0 0 ]
keeps track of the number of datafiles reduced by the delta. Every column ci in Δl is
an array of p(Δl) values. The main idea here is to preserve the values (in the array) of
the modied columns in every version, and infer values, whenever possible, to columns
that are ⟂ in the Δ. There are two cases to consider when computing these values for a
primary key k.
Case 1: k is present in only one of Δ1 or Δ2. This is the case with primary keys
k = 1, 5 in our example. Next, there are two scenarios. First, when a column ci contains
the special value ⟂, ci in Δl can simply be set to an array of all ⟂. For instance, c1 for
k = 1 and c2 for k = 5. Second, when the column contains values from the domain of
ci. We discuss the case when k is present in Δ1. For instance, in Δ1 column c2 for k = 1,
contains the values [a, b]. c2 in Δl is then set as follows: the rst p(Δ1) = 2 values are
taken from Δ1 and copied to ci in Δl , i.e., [a, b,⟂], and the remaining p(Δ2) − 1 values are
set to the last value in Δ1, i.e., [a, b, b]. We can complement the above rule to account
for the case when k is present in Δ2 instead.
Case 2: k is present in both Δ1 or Δ2. This is the case with primary keys k = 2, 3 in
our example. Next, there are three scenarios similar to the ones we discussed for delta
contraction above.
1. ci is all ⟂ in both Δ1 and Δ2, then ci can be set to all ⟂ in Δl
2. ci is blank in one of Δ1 or Δ2, then the ⟂ values can be set to the appropriate value
as in Case 1 above, depending on whether the ⟂ value is in Δ1 or Δ2 (e.g., c2 for
k = 2, 3)
3. ci has values in both Δ1 and Δ2, then ci in Δl is a simply a concatenation of the
152
values in Δ1 and Δ2 (accounting the last value in Δ1 and rst value in Δ2 only once,
since they are guaranteed to be the same).
The transformation rule in the case of a star is similar to the rules for line above,
with the only dierence being the value that is copied to ci in Δs , when either one of
both of Δ1 and Δ2 contain values (not ⟂) in ci. In the case of a star, it is the rst value
(this value will be the same in both Δ1 and Δ2) in the column ci. Figure 6.4(b) gives an
example when R1, R2, R3 are arranged in a star, and Δs is the reduced delta of Δ1 and Δ3.
6.3.2.3 Applying Delta to a Materialized datafile
The nal step in C&R algorithm is to apply the last delta to the materialized le
in order to generate all the v-tuples. Continuing our example from before, we want to
obtain the v-tuples in Figure 6.3 as a result of the operation R1 ⊕ Δl in the example of
Figure 6.4(a).
We perform a linear scan of the materialized le, R1, and the delta, Δl , resolving
records based on their primary key. For every primary key k in R1, we output between
1 and p(Δl) = 3 v-tuples. If a primary key k does not appear in the delta, it indicates
that record k has not been modied, and it is output as a single v-tuple with the bitmap
of p(Δl) 1s. For example, the record with primary key k = 4 does not appear in Δl , and
hence is output as (4, 25, d, [111]). On the other hand, if a primary key k appears in the
delta, we consider every column ci where the values are not ⟂ and generate the v-tuples
by resolving one column at a time. Specically, consider the record with primary key
k = 3 in Δl , i.e, (3, [5, 8, 5], [c, d, d]). We rst resolve the column c1. The goal here is to
153
identify all repeated values in c1 and output intermediate tuples, such that there is one
tuple, per repeated value in c1. For instance, the value 5 is repeated in versions 1 and
3, after resolving, the partial result is {(3, 5, [c,⟂, d]), (3, 8, [⟂, d,⟂])}. Next, we resolve
c2. For the rst tuple in the intermediate result, (3, 5, [c,⟂, d]), there are no duplicate
values in c2, and since this is the last column to be resolved, we can output two v-tuples,
{(3, 5, c, [100]), (3, 5, d, [001])}. Similarly, for the second tuple (3, 8, [⟂, d,⟂]), there are no
duplicate values in c2, and we can output {(3, 8, d, [010])}.
6.3.3 JOIN Operator
In this section, we describe hash join-based algorithms to perform the equi-join
operation. Our choice to study hash-based methods, as opposed to sort-based methods,
was inuenced by the observation that in general, they produce results faster than sort-
based methods and have a smaller memory footprint [114]. A detailed study of the
criteria when sort-based algorithms become competitive in the context of DEX remains
an area for future work.
6.3.4 Simple Hash Join
We rst adapt the canonical hash join algorithm to our setting. Suppose the two
inputs to be joined are R and S, such that |R| < |S|, and both R and S are sets of v-tuples.
The algorithm has a build phase and a probe phase. At the start of the build phase,
it allocates memory for the hash table. It then reads a v-tuple r ∈ R, hashes on the
join key of r using a pre-dened hash function ℎ(⋅), and writes the v-tuple r into the
154
Key: k1
k1 10 a 11111
k1 10 b 01111 k1 abc 11000
k1 15 c 11101 s1
... ... ... ...
Key: k2
k2 60 x 00001
k2 55 b 10000 k2 xyz 01100
k2 15 y 10001 s2
... ... ... ...
Buckets for two keys in R
Figure 6.5: Buckets in simple hash join
155
corresponding bucket. The build phase is completed when all the R v-tuples have been
stored in the hash table.
During the probe phase, each v-tuple s ∈ S is hashed on the join key using the
same hash function ℎ(⋅), and the correct hash bucket for the join key is identied (after
accounting for hash collisions). For each v-tuple rℎ(s) in the bucket, we perform a logical
AND operation on the the bitmap of rℎ(s) and s, and if non-zero, the concatenated v-tuple
rℎ(s) ⋅ s is output, with the new bitmap.
Figure 6.5 shows an example with two hash buckets of R, for keys k1 and k2. During
probe phase, the bitmap in s1 is ANDed with every tuple in bucket for key k1 to nd join
matches. This example also illustrates a source of ineciency in our simple adaptation.
Note that the v-tuples in the bucket for key k2 have mostly 0s in the bitmap. During
the probe phase, when a tuple s2 matches on the join key, most of the bitmap AND
operations result in bitmaps with all 0s. We address this limitation next.
6.3.5 Version-aware Hash Join
The main bottleneck in the simple hash join method described above is the case
when a bucket contains a large number of v-tuples having bitmaps of mostly 0s. When
probing for matches of a tuple s, the algorithm ends up scanning the entire bucket, only
to nd few matches. We propose a dierent bucket design to improve probe eciency
by exploiting the sparsity in such buckets. Figure 6.6 shows the important elements of
the new bucket design.
The bucket consists of two pieces. Suppose the bitmap in the v-tuples is of size k.
156
Key: k2
c1 c2 ... ck-1
Tuples in V1 c1
k2 xyz 01100
Tuples in V2 c2
...
Tuples in Vk
Figure 6.6: Improved bucket structure for sparse keys
The rst piece is an array of k − 1 integer values, that serve as a pointer to the tuples
for k versions into the bucket. The second piece is the tuples in the input R, partitioned
by version. That is all tuples in V1 are stored rst, followed by all tuples in V2, and so
on. During the build phase, the bitmap of every v-tuple r ∈ R is inspected to check the
versions r that appears in. If r appears in j ≤ k versions, we create j copies of r , sans the
bitmap, and store them at the correct partition in the bucket. Once the build phase is
completed, the bucket is scanned once to compute the counts/pointers in the rst piece.
During the probe phase, the bitmap of tuple s is inspected to identify the versions that s
appears in, and the array of pointers in the bucket is used to lookup the relevant tuples in
the bucket. The primary benet of this approach is when for a large number of r , j << k.
Although in the build phase, we use more memory for such buckets, the probe phase
can be done eciently by looking up only the required versions, instead of a linear scan
of all entries in the bucket.
157
6.3.6 Other Operators
6.3.6.1 Filter Operator
The Filter operator is identical to selection in a classical database system. The
lter predicate is applied to every v-tuple and if it evaluates to true, then the tuple is
output as is to the next step.
6.3.6.2 Project Operator
Similar to the selection operation, the project operation works with a tuple bundle
at a time, only keeping the respective columns and the version list for each tuple bundle
intact. Note that this also preserves multiset semantics of the projection operation.
6.4 Evaluation
We now evaluate the benets of using the C&R algorithm for the Scan operator
and the new bucket design for the hash join operation for the Join operator. Our goal is
to quantify the performance of both techniques compared to the respective baseline ap-
proaches. For the Scan operation, where the input is k datafiles, the baseline approach
is the multiple datafile checkout operation described in Section 5.4.1.2, followed by a
k-way merge to construct the v-tuples. For the Join operation, the baseline approach is
the simple hash join method described in Section 6.3.4 above.
All experiments were conducted on a single machine with Intel Core i7-4790 CPU
(3.60 GHz, 8MB L3 cache), 32GB of memory, running Ubuntu 16.04 and OpenJDK 64-
158
bit server JVM (ver. 1.8.0_111). All time measurements are recorded as wall-clock time.
Unless otherwise stated, to measure response time, we run each query 10 times and
consider the median.
6.4.1 Datasets
Lacking access to real-world versioned datasets with sucient and varied struc-
ture, we instead developed a synthetic data generator, described in detail in Section 5.5,
to generate datasets with dierent characteristics for a wide variety of parameter values.
This enables us to carefully study the performance of our techniques in various settings.
We describe the modications that we made to the dataset generation process next.
Every dataset is characterized by a 3-tuple, ⟨|A|, |Δ|, #Δ⟩, where |A| and |Δ| refer to
the average number records in a datafile and average number of records in the deltas
(as a percentage of |A|), respectively. #Δ refers to the number of deltas in the access tree.
A record has four columns – a primary key column of 64-bit integer type, and three
64-byte string columns. When creating a new version of a datafile from an existing
version, the dataset generator assigns probabilities, according to the Zipan distribution
(with exponent 2), to every record. This assignment indicates how likely a record is to
be modied when creating the new version. In a record, the column to be modied is
chosen at random from the three string columns.
159
Figure 6.7: Scan performance over varying delta sizes; dataset parameters: |R| =
3M, #Δ = 50.
6.4.2 Results
We rst compare the running time of Scan using the two methods. The input to
the Scan operator is an access tree over k versions, say of a datafile R, and the output
is a set of v-tuples. The rst method simply performs a multiple datafile checkout of
k versions, followed by a k-way merge. The second method uses the C&R algorithm, by
applying the transformation and reduction rules described in Section 6.3.2. Figure 6.7
shows the speedup obtained by using C&R when the average delta size is varied between
1%, 2%, 3%, with other dataset parameters set to |R| = 3M, #Δ = 50. As we can see, C&R
is between 4X–6X eective when the delta size is small, i.e., 1%. The speedup decreases
as the size of the deltas increases primarily due to larger intermediate delta sizes.
Next, Table 6.1 reports the running time of the two hash join methods when per-
forming an equi-join operation. The rst method, SHJ, denotes the simple hash join
approach described in Section 6.3.4, and the second method, VHJ, denotes the version-
aware hash join approach described in Section 6.3.5. We perform an equijoin on v-tuples
160
k SHJ (sec) VHJ (sec) Speedup
5 3.3 3.3 0%
25 19.4 11.6 61%
50 49.8 29.6 68%
75 103.2 59.7 73%
100 146.6 81.9 79%
Table 6.1: Join performance of two hash join methods over varying number of input
versions R, S; dataset parameters: |R| = 100K, |S| = 1M, |Δ| = 1%, #Δ = 50
representing k versions of two datafiles, R and S, where |R| = 100K, |S| = 1M . Dur-
ing the build phase, VHJ determines when to use the new bucket design, depending on
the sparsity of the bucket for a primary key. Specically, there are two parameters to
compute when deciding whether a bucket is sparse or not. The rst parameter, b, is the
sparsity of the bitmap in a v-tuple. This is measured relative to the size of the bitmap,
i.e., the number of versions k, and a v-tuple is said to be sparse when its b value is less
than a specied threshold, b̄. The second parameter, n, is the relative number of sparse
v-tuples in the bucket, and the bucket is said to be sparse, when its n value is greater
than a specied threshold, n̄. Table 6.1 reports the runtime (in seconds) and the speedup
obtained when we set b̄ = 5% and n̄ = 90%. When there are only 5 versions across which
to perform a join, no buckets satisfy the sparsity criteria, and hence the run times are
identical. We note speedup of 61% − 79%, increasing as the number of versions in the
join increases, when using a dierent bucket design for sparse buckets.
Our initial set of experiments thus shows signicant benets of the v-tuples ap-
proach in evaluating richer declarative queries on multiple versions. They indicate that
we can both create v-tuples eciently, by taking advantage of the delta representation
161
when reading the input, and that we can perform richer operations like join at acceptable
overheads. As the next step, we plan to add additional operations, such as aggregation
and user dened functions, to further extend the class of queries that DEX can support.
162
Chapter 7: Conclusions
In this dissertation, we introduced a framework for building a dataset version con-
trol system. We presented a theoretical model for capturing and reasoning about the
tradeos that arise in the management of thousands of dataset versions. We demon-
strated that if the versions are largely overlapping in their contents, by using delta-
encoding, it is possible to both store and retrieve them eectively. Instead of applying
delta-encoding in an ad hoc manner, we studied dierent algorithms that helped us nav-
igate the storage-recreation tradeo in a principled fashion. We also showed that it is
possible to execute a variety of queries over multiple versions of past dataset versions,
without having to rst retrieve all in their entirety.
The technical contributions of this dissertation are (i) a formal study of the dataset
versioning problem that considers the trade o between storage cost and recreation cost
in dierent manners, and provides a collection of polynomial time algorithms for nd-
ing good solutions for large problem sizes, (ii) a cost based optimization framework
along with a set of transformation rules that, based on the algebraic properties of the
deltas, nds ecient methods to evaluate checkout (retrieval), intersection, union, and
t-threshold queries over multiple versions, and (iii) a new approach to eciently execute
single block select-project-join queries over multiple data versions. We also introduced
163
the DEX system, that demonstrates how the techniques in this dissertation can lead to
signicantly improved performance compared to ad hoc techniques and version man-
agement systems prevalent today.
164
Bibliography
[1] James Cheney, Stephen Chong, Nate Foster, Margo Seltzer, and Stijn Vansum-
meren. Provenance: A future history. In OOPSLA, pages 957–964. ACM, 2009.
[2] Daniel J Weitzner, Harold Abelson, Tim Berners-Lee, Joan Feigenbaum, James
Hendler, and Gerald Jay Sussman. Information accountability. Communications
of the ACM, 51(6):82–87, 2008.
[3] R. K. L. Ko, P. Jagadpramana, M. Mowbray, S. Pearson, M. Kirchberg, Q. Liang,
and B. S. Lee. Trustcloud: A framework for accountability and trust in cloud
computing. In IEEE World Congress on Services, July 2011.
[4] Pat Helland. Immutability changes everything. In CIDR, 2015.
[5] João Paulo and José Pereira. A survey and classication of storage deduplication
systems. ACM Computing Surveys, 2014.
[6] Monya Baker. De novo genome assembly: what every biologist should know.
Nature methods, 9(4):333–337, 2012.
[7] http://git.kernel.org/cgit/git/git.git/tree/Documentation/technical/
pack-heuristics.txt, .
[8] http://comments.gmane.org/gmane.comp.version-control.git/189776, .
[9] Sebastian Rönnau, Jan Scheczyk, and Uwe M. Borgho. Towards XML Version
Control of Oce Documents. In Proc. of the ACM Symposium on Document Engi-
neering, pages 10–19, 2005.
[10] Michael Maddox, David Goehring, Aaron J. Elmore, Samuel Madden, Aditya G.
Parameswaran, and Amol Deshpande. Decibel: The relational dataset branching
system. PVLDB, 9(9):624–635, 2016.
[11] James Cliord, Curtis Dyreson, Tomás Isakowitz, Christian S Jensen, and
Richard Thomas Snodgrass. On the semantics of “now” in databases. ACM Trans-
actions on Database Systems (TODS), 22(2):171–214, 1997.
165
[12] Gultekin Özsoyoğlu and Richard T Snodgrass. Temporal and real-time databases:
A survey. IEEE Trans. on Knowl. and Data Eng., pages 513–532, 1995.
[13] Abdullah Uz Tansel, James Cliord, Shashi Gadia, Sushil Jajodia, Arie Segev,
and Richard Snodgrass. Temporal databases: theory, design, and implementation.
Benjamin-Cummings Publishing Co., Inc., 1993.
[14] David Lomet, Roger Barga, Mohamed F Mokbel, German Shegalov, Rui Wang, and
Yunyue Zhu. Immortal DB: Transaction time support for SQL server. In SIGMOD,
pages 939–941, 2005.
[15] David Lomet, Mingsheng Hong, Rimma Nehme, and Rui Zhang. Transaction time
indexing with version compression. In PVLDB, pages 870–881, 2008.
[16] David B. Lomet, Alan Fekete, Rui Wang, and Peter Ward. Multi-version con-
currency via timestamp range conict management. In IEEE 28th International
Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Vir-
ginia), 1-5 April, 2012, pages 714–725, 2012. doi: 10.1109/ICDE.2012.10. URL
http://dx.doi.org/10.1109/ICDE.2012.10.
[17] Yu Wu, Sushil Jajodia, and X Sean Wang. Temporal database bibliography update.
In Temporal Databases: research and practice, pages 338–366. Springer, 1998.
[18] Richard Snodgrass. The temporal query language tquel. ACM Trans. Database
Syst., 12(2):247–298, June 1987. ISSN 0362-5915. doi: 10.1145/22952.22956. URL
http://doi.acm.org/10.1145/22952.22956.
[19] Richard T Snodgrass, Santiago Gomez, and L Edwin McKenzie Jr. Aggregates
in the temporal query language tquel. Knowledge and Data Engineering, IEEE
Transactions on, 5(5):826–842, 1993.
[20] Michael Stonebraker, Lawrence A Rowe, and Michael Hirohama. The implementa-
tion of postgres. IEEE Transactions on Knowledge & Data Engineering, (1):125–142,
1990.
[21] Antonin Guttman. R-trees: A dynamic index structure for spatial searching. In
SIGMOD, pages 47–57, 1984.
[22] Using Oracle Flashback Technology. https://docs.oracle.com/cd/B28359_01/
appdev.111/b28424/adfns_flashback.htm, . Accessed: May 04, 2016.
[23] Oracle Total Recall with Oracle Database 11g Release 2. http:
//www.oracle.com/technetwork/database/focus-areas/storage/
total-recall-whitepaper-171749.pdf, . Accessed: November 11, 2016.
[24] Peter Buneman, Sanjeev Khanna, Keishi Tajima, and Wang Chiew Tan. Archiving
scientic data. ACM Trans. Database Syst., 29:2–42, 2004. doi: 10.1145/974750.
974752. URL http://doi.acm.org/10.1145/974750.974752.
166
[25] Torben Bach Pedersen and Christian S Jensen. Multidimensional database tech-
nology. Computer, 34(12):40–46, 2001.
[26] Heum-Geun Kang and Chin-Wan Chung. Exploiting versions for on-line data
warehouse maintenance in molap servers. In Proceedings of the 28th International
Conference on Very Large Data Bases, VLDB ’02, pages 742–753. VLDB Endowment,
2002. URL http://dl.acm.org/citation.cfm?id=1287369.1287433.
[27] P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. The multidi-
mensional database system rasdaman. In Proceedings of the 1998 ACM SIGMOD
International Conference on Management of Data, SIGMOD ’98, pages 575–577,
New York, NY, USA, 1998. ACM. ISBN 0-89791-995-5. doi: 10.1145/276304.276386.
URL http://doi.acm.org/10.1145/276304.276386.
[28] Paul G. Brown. Overview of scidb: Large scale array storage, processing and
analysis. In Proceedings of the 2010 ACM SIGMOD International Conference on
Management of Data, SIGMOD ’10, pages 963–968, New York, NY, USA, 2010.
ACM. ISBN 978-1-4503-0032-2. doi: 10.1145/1807167.1807271. URL http://doi.
acm.org/10.1145/1807167.1807271.
[29] Michael Stonebraker, Jacek Becla, David J. DeWitt, Kian-Tat Lim, David Maier,
Oliver Ratzesberger, and Stanley B. Zdonik. Requirements for science data bases
and scidb. In CIDR 2009, Fourth Biennial Conference on Innovative Data Systems
Research, Asilomar, CA, USA, January 4-7, 2009, Online Proceedings, 2009. URL
http://www-db.cs.wisc.edu/cidr/cidr2009/Paper_26.pdf.
[30] Adam Seering, Philippe Cudré-Mauroux, Samuel Madden, and Michael Stone-
braker. Ecient versioning for scientic array databases. In ICDE, pages 1013–
1024, 2012.
[31] Shu-Yao Chien, Vassilis J Tsotras, Carlo Zaniolo, and Donghui Zhang. Ecient
complex query support for multiversion xml documents. In Advances in Database
Technology—EDBT 2002, pages 161–178. Springer, 2002.
[32] Anders Björnerstedt and Christer Hultén. Object-oriented concepts, databases,
and applications. chapter Version Control in an Object-oriented Architecture,
pages 451–485. ACM, New York, NY, USA, 1989. ISBN 0-201-14410-7. doi: 10.
1145/63320.66513. URL http://doi.acm.org/10.1145/63320.66513.
[33] Marios Hadjieleftheriou, George Kollios, Vassilis J Tsotras, and Dimitrios Gunop-
ulos. Ecient indexing of spatiotemporal objects. In Advances in Database Tech-
nology—EDBT 2002, pages 251–268. Springer, 2002.
[34] U. Khurana and A. Deshpande. Ecient snapshot retrieval over historical graph
data. In Data Engineering (ICDE), 2013 IEEE 29th International Conference on, pages
997–1008, April 2013. doi: 10.1109/ICDE.2013.6544892.
167
[35] Sean Quinlan and Sean Dorward. Venti: A new approach to archival storage. In
FAST, volume 2, pages 89–101, 2002.
[36] Benjamin Zhu, Kai Li, and R Hugo Patterson. Avoiding the disk bottleneck in the
data domain deduplication le system. In Fast, volume 8, pages 1–14, 2008.
[37] F. Douglis and A. Iyengar. Application-specic delta-encoding via resemblance
detection. In USENIX ATC, 2003.
[38] Zan Ouyang, Nasir Memon, Torsten Suel, and Dimitre Trendalov. Cluster-based
delta compression of a collection of les. In WISE, 2002.
[39] Mun Choon Chan and Thomas YC Woo. Cache-based compaction: A new tech-
nique for optimizing web transfer. In INFOCOM, 1999.
[40] Randal C Burns and Darrell DE Long. In-place reconstruction of delta compressed
les. In Proceedings of the seventeenth annual ACM symposium on Principles of
Distributed Computing, pages 267–275. ACM, 1998.
[41] Purushottam Kulkarni, Fred Douglis, Jason D. LaVoie, and John M. Tracey. Re-
dundancy elimination within large collections of les. In USENIX ATC, 2004.
[42] Dan RK Ports and Kevin Grittner. Serializable snapshot isolation in postgresql.
Proceedings of the VLDB Endowment, 5(12):1850–1861, 2012.
[43] Scott Chacon and Ben Straub. Pro Git Book. https://git-scm.com/book/en/v2.
Accessed: May 04, 2016.
[44] Git-Annex. https://git-annex.branchable.com/, . Accessed: May 08, 2016.
[45] Git Large File Storage. https://git-lfs.github.com/, . Accessed: May 08, 2016.
[46] Talel Abdessalem and Geneviève Jomier. Vql: A query language for multiversion
databases. In Database Programming Languages, pages 160–179. Springer, 1997.
[47] Temporal Tables. https://msdn.microsoft.com/en-us/library/dn935015.
aspx. Accessed: May 04, 2016.
[48] Jennifer Widom. Trio: A system for integrated management of data, accuracy,
and lineage. Technical Report, 2004.
[49] Anderson Marinho, Leonardo Murta, Cláudia Werner, Vanessa Braganholo, Sérgio
Manuel Serra da Cruz, Eduardo Ogasawara, and Marta Mattoso. Provmanager: a
provenance management system for scientic workows. Concurrency and Com-
putation: Practice and Experience, 24(13):1513–1530, 2012.
[50] Leonardo Murta, Vanessa Braganholo, Fernando Chirigati, David Koop, and Ju-
liana Freire. noworkow: Capturing and analyzing provenance of scripts. In
Provenance and Annotation of Data and Processes, pages 71–83. Springer, 2014.
168
[51] Jihie Kim, Ewa Deelman, Yolanda Gil, Gaurang Mehta, and Varun Ratnakar. Prove-
nance trails in the Wings/Pegasus system. Concurrency and Computation: Practice
and Experience, 20(5):587–597, 2008.
[52] Marcin Wylot, Philippe Cudre-Mauroux, and Paul Groth. Executing provenance-
enabled queries over web data. In Proceedings of the 24th International Conference
onWorldWideWeb, pages 1275–1285. International World Wide Web Conferences
Steering Committee, 2015.
[53] Manish Kumar Anand, Shawn Bowers, Timothy Mcphillips, and Bertram
Ludäscher. Exploring scientic workow provenance using hybrid queries over
nested data and lineage graphs. In Scientic and Statistical Database Management,
pages 237–254. Springer, 2009.
[54] Manish Kumar Anand, Shawn Bowers, and Bertram Ludäscher. Techniques for
eciently querying scientic workow provenance graphs. In EDBT, volume 10,
pages 287–298, 2010.
[55] David A Holland, Uri Jacob Braun, Diana Maclean, Kiran-Kumar Muniswamy-
Reddy, and Margo I Seltzer. Choosing a data model and query language for prove-
nance. In The 2nd International Provenance and Annotation Workshop. Springer,
2008.
[56] Grigoris Karvounarakis, Zachary G Ives, and Val Tannen. Querying data prove-
nance. In Proceedings of the 2010 ACM SIGMOD International Conference on Man-
agement of data, pages 951–962. ACM, 2010.
[57] Shawn Bowers. Scientic workow, provenance, and data modeling challenges
and approaches. Journal on Data Semantics, 1(1):19–30, 2012.
[58] Susan B Davidson and Juliana Freire. Provenance and scientic workows: chal-
lenges and opportunities. In Proceedings of the 2008 ACM SIGMOD international
conference on Management of data, pages 1345–1350. ACM, 2008.
[59] Michael Stonebraker, Gerald Held, Eugene Wong, and Peter Kreps. The design
and implementation of INGRES. ACM Transactions on Database Systems (TODS),
1(3):189–222, 1976.
[60] Carlo Zaniolo. The database language GEM. In ACM Sigmod Record, volume 13(4),
pages 207–218. ACM, 1983.
[61] Bruno Becker, Stephan Gschwind, Thomas Ohler, Bernhard Seeger, and Peter
Widmayer. An asymptotically optimal multiversion b-tree. The VLDB Jour-
nal—The International Journal on Very Large Data Bases, 5(4):264–275, 1996.
[62] Christian Plattner, Andreas Wapf, and Gustavo Alonso. Searching in time. In
Proceedings of the 2006 ACM SIGMOD international conference on Management of
data, pages 754–756. ACM, 2006.
169
[63] Betty Salzberg and Vassilis J. Tsotras. Comparison of access methods for time-
evolving data. ACM Comput. Surv., 31(2):158–221, June 1999. ISSN 0360-0300. doi:
10.1145/319806.319816. URL http://doi.acm.org/10.1145/319806.319816.
[64] Khaled Jouini and Geneviève Jomier. Indexing multiversion databases. In Proceed-
ings of the Sixteenth ACM Conference on Conference on Information and Knowledge
Management, CIKM ’07, pages 915–918, New York, NY, USA, 2007. ACM. ISBN
978-1-59593-803-9. doi: 10.1145/1321440.1321574. URL http://doi.acm.org/10.
1145/1321440.1321574.
[65] Emad Soroush and Magdalena Balazinska. Time travel in a scientic array
database. In ICDE, pages 98–109, 2013.
[66] Jerey C. Mogul, Fred Douglis, Anja Feldmann, and Balachander Krishnamurthy.
Potential benets of delta encoding and data compression for http. In SIGCOMM,
pages 181–194, 1997.
[67] Josh MacDonald. File system support for delta compression. 2000.
[68] Anant P. Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron
Elmore, Samuel Madden, and Aditya Parameswaran. DataHub: Collaborative
Data Science & Dataset Version Management at Scale. In CIDR, 2015.
[69] Shahram Ghandeharizadeh, Richard Hull, and Dean Jacobs. Heraclitus: Elevating
deltas to be rst-class citizens in a database programming language. ACM Trans.
Database Syst., 21(3):370–426, 1996. ISSN 0362-5915. doi: 10.1145/232753.232801.
URL http://doi.acm.org/10.1145/232753.232801.
[70] Timothy Grin and Richard Hull. A framework for implementing hypothetical
queries. In SIGMOD, pages 231–242, 1997.
[71] Nicholas E. Taylor and Zachary G. Ives. Reconciling while tolerating disagreement
in collaborative data sharing. In SIGMOD, pages 13–24, 2006. ISBN 1-59593-434-
0. doi: 10.1145/1142473.1142476. URL http://doi.acm.org/10.1145/1142473.
1142476.
[72] Todd J. Green, Grigoris Karvounarakis, Zachary G. Ives, and Val Tannen. Up-
date exchange with mappings and provenance. In PVLDB, pages 675–686, 2007.
ISBN 978-1-59593-649-3. URL http://dl.acm.org/citation.cfm?id=1325851.
1325929.
[73] Sanjay Agrawal, Surajit Chaudhuri, and Vivek R. Narasayya. Automated Selection
of Materialized Views and Indexes in SQL Databases. In PVLDB, pages 496–505,
2000. URL http://dl.acm.org/citation.cfm?id=645926.671701.
[74] Zohreh Asgharzadeh Talebi, Rada Chirkova, Yahya Fathi, and Matthias Stallmann.
Exact and inexact methods for selecting views and indexes for OLAP performance
improvement. In EDBT, pages 311–322, 2008. doi: 10.1145/1353343.1353383. URL
http://doi.acm.org/10.1145/1353343.1353383.
170
[75] Alon Y. Halevy. Answering queries using views: A survey. The VLDB Journal,
10(4):270–294, 2001. ISSN 0949-877X. doi: 10.1007/s007780100054. URL http:
//dx.doi.org/10.1007/s007780100054.
[76] Jonathan Goldstein and Per-Åke Larson. Optimizing queries using materialized
views: A practical, scalable solution. In SIGMOD, pages 331–342, 2001. ISBN 1-
58113-332-4. doi: 10.1145/375663.375706. URL http://doi.acm.org/10.1145/
375663.375706.
[77] Zvi Galil and Giuseppe F. Italiano. Data structures and algorithms for disjoint set
union problems. ACM Comput. Surv., pages 319–344, 1991. ISSN 0360-0300. doi:
10.1145/116873.116878. URL http://doi.acm.org/10.1145/116873.116878.
[78] Frank K. Hwang and Shen Lin. A simple algorithm for merging two disjoint lin-
early ordered sets. SIAM Journal on Computing, 1(1):31–39, 1972.
[79] Erik Demaine, Alejandro López-Ortiz, and J. Munro. Adaptive Set Intersections,
Unions, and Dierences. In SODA, pages 743–752, 2000. URL http://dl.acm.
org/citation.cfm?id=338219.338634.
[80] Ricardo Baeza-Yates. A fast set intersection algorithm for sorted sequences. In
Annual Symposium on Combinatorial Pattern Matching, pages 400–408. Springer,
2004.
[81] Peter Sanders and Frederik Transier. Intersection in integer inverted indices. In
Proceedings of theMeeting on Algorithm Engineering & Expermiments, pages 71–83,
2007.
[82] Bolin Ding and Arnd Christian König. Fast set intersection in memory. PVLDB,
pages 255–266, 2011. ISSN 2150-8097. doi: 10.14778/1938545.1938550. URL http:
//dx.doi.org/10.14778/1938545.1938550.
[83] Erik D. Demaine, Alejandro López-Ortiz, and J. Ian Munro. Experiments on adap-
tive set intersections for text retrieval systems. In ALENEX, 2001.
[84] Jérémy Barbay, Alejandro López-Ortiz, Tyler Lu, and Alejandro Salinger. An ex-
perimental investigation of set intersection algorithms for text searching. ACM
Journal of Experimental Algorithmics, 2010.
[85] Philip Bille, Anna Pagh, and Rasmus Pagh. Fast evaluation of union - intersection
expressions. In ISAAC, pages 739–750, 2007.
[86] Taesung Lee, Jin-Woo Park, Sanghoon Lee, Seung-won Hwang, Sameh Elnikety,
and Yuxiong He. Processing and optimizing main memory spatial-keyword
queries. PVLDB, 9(3):132–143, 2015.
[87] Robert Endre Tarjan. Finding optimum branchings. Networks, 7(1):25–35, 1977.
[88] Guy Kortsarz and David Peleg. Approximating shallow-light trees. In SODA, 1997.
171
[89] Judit Bar-Ilan, Guy Kortsarz, and David Peleg. Generalized submodular cover
problems and applications. Theoretical Computer Science, 250(1):179–200, 2001.
[90] Moses Charikar, Chandra Chekuri, To-yat Cheung, Zuo Dai, Ashish Goel, Sudipto
Guha, and Ming Li. Approximation algorithms for directed steiner problems. Jour-
nal of Algorithms, 33(1):73–91, 1999.
[91] Samir Khuller, Balaji Raghavachari, and Neal Young. Balancing minimum span-
ning trees and shortest-path trees. Algorithmica, 14(4):305–321, 1995.
[92] https://www.kernel.org/pub/software/scm/git/docs/technical/
pack-heuristics.txt.
[93] http://edmonds-alg.sourceforge.net/.
[94] http://svn.apache.org/repos/asf/subversion/trunk/notes/fsfs, .
[95] http://svnbook.red-bean.com/en/1.8/svn.reposadmin.maint.html#svn.
reposadmin.maint.diskspace.fsfspacking, .
[96] http://svn.apache.org/repos/asf/subversion/trunk/notes/
fsfs-improvements.txt, .
[97] http://www.xmailserver.org/xdiff-lib.html.
[98] Peter T. Wood. Query languages for graph databases. SIGMOD Rec., 41(1):50–60,
April 2012. ISSN 0163-5808.
[99] Eugene W Myers. An O(ND) dierence algorithm and its variations. Algorithmica,
pages 251–266, 1986.
[100] Erik D. Demaine, Shay Mozes, Benjamin Rossman, and Oren Weimann. An op-
timal decomposition algorithm for tree edit distance. ACM Trans. Algorithms, 6
(1):2:1–2:19, December 2009. ISSN 1549-6325. doi: 10.1145/1644015.1644017. URL
http://doi.acm.org/10.1145/1644015.1644017.
[101] Fred Douglis and Arun Iyengar. Application-specic delta-encoding via resem-
blance detection. In USENIX Annual Technical Conference, General Track, pages
113–126, 2003.
[102] Git Packles. https://git-scm.com/book/en/v2/Git-Internals-Packfiles, .
Accessed: February 15, 2017.
[103] Richard M Karp. Reducibility among combinatorial problems. In Complexity of
computer computations. Springer, 1972.
[104] Jérémy Barbay and Claire Kenyon. Deterministic algorithm for the t-threshold set
problem. In Algorithms and Computation, pages 575–584. Springer, 2003.
172
[105] Samy Chambi, Daniel Lemire, Owen Kaser, and Robert Godin. Better bitmap per-
formance with roaring bitmaps. Software: Practice and Experience, 2015.
[106] Samir Khuller, Balaji Raghavachari, and Neal Young. Balancing minimum span-
ning trees and shortest-path trees. Algorithmica, 14(4):305–321, 1995.
[107] Miller. http://johnkerl.org/miller/doc/. Accessed: April 15, 2018.
[108] csvkit. https://csvkit.readthedocs.io/en/1.0.3/. Accessed: April 15, 2018.
[109] textql. https://github.com/dinedal/textql. Accessed: April 15, 2018.
[110] http://parquet.apache.org/documentation/latest/.
[111] Anastassia Ailamaki, David J. DeWitt, Mark D. Hill, and Marios Skounakis. Weav-
ing relations for cache performance. In VLDB 2001, Proceedings of 27th Interna-
tional Conference on Very Large Data Bases, September 11-14, 2001, Roma, Italy,
pages 169–180, 2001. URL http://www.vldb.org/conf/2001/P169.pdf.
[112] Marcel Kornacker, Alexander Behm, Victor Bittorf, Taras Bobrovytsky, Casey
Ching, Alan Choi, Justin Erickson, Martin Grund, Daniel Hecht, Matthew Jacobs,
Ishaan Joshi, Lenni Ku, Dileep Kumar, Alex Leblang, Nong Li, Ippokratis Pandis,
Henry Robinson, David Rorke, Silvius Rus, John Russell, Dimitris Tsirogiannis,
Skye Wanderman-Milne, and Michael Yoder. Impala: A modern, open-source SQL
engine for hadoop. In CIDR 2015, Seventh Biennial Conference on Innovative Data
Systems Research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings, 2015.
URL http://cidrdb.org/cidr2015/Papers/CIDR15_Paper28.pdf.
[113] Ravi Jampani, Fei Xu, Mingxi Wu, Luis Leopoldo Perez, Chris Jermaine, and Pe-
ter J. Haas. The monte carlo database system: Stochastic analysis close to the data.
ACM Trans. Database Syst., 36(3):18:1–18:41, 2011. doi: 10.1145/2000824.2000828.
URL http://doi.acm.org/10.1145/2000824.2000828.
[114] Spyros Blanas and Jignesh M. Patel. Memory footprint matters: ecient equi-
join algorithms for main memory data processing. In ACM Symposium on Cloud
Computing, SOCC ’13, Santa Clara, CA, USA, October 1-3, 2013, pages 19:1–
19:16, 2013. doi: 10.1145/2523616.2523626. URL http://doi.acm.org/10.1145/
2523616.2523626.
173