ABSTRACT Title of dissertation: DELTA-BASED STORAGE AND QUERYING FOR VERSIONED DATASETS Amit Chavan Doctor of Philosophy, 2018 Dissertation directed by: Professor Amol Deshpande Department of Computer Science Data-driven methods and products are becoming increasingly common in a variety of communities, leading to a huge diversity of datasets being continuously generated, modied, and analyzed. An increasingly important consideration for the underlying data management systems is that, all of these datasets and their versions over time need to be stored and queried for a variety of reasons including, but not limited to, reproducibility, collaboration, provenance, auditing, introspective analysis, and backups. However, most solutions today resort to highly ad hoc and manual version management and sharing techniques, that leads to friction when managing collaborative data science workows, while also introducing ineciencies. In this dissertation, we introduce a framework for dataset version management, and address the systems building, operator design, and optimization challenges involved in building a dataset version control system. We describe the various challenges and solutions in the context of our system, called DEX, that we have developed to support increasingly complex version management tasks. We show how to use delta-encoding, a key component in managing redundancy, to provide ecient storage and retrieval for the thousands of dataset versions, and develop a formalism to understand the various trade-os in a principled manner. We study the storage–recreation trade-o in detail and provide a suite of inexpensive heuristics to obtain high-quality solutions under dierent settings. In order to provide a rich interface to specify version management tasks, we design a new query language, called VQUEL, with the ability to query dataset versions and provenance in a unied manner. We study how assumptions on the delta format can help in the design of a logical algebra, which we then use to execute increasingly complex queries eciently. A key characteristic of our query execution methods is that the computational cost is primarily dependent on the size and the number of deltas in the expression (typically small), and not the input dataset versions (which can be very large). Finally, we demonstrate the eectiveness of our developed techniques by extensive evaluation of DEX on a mixture of real-world and synthetic datasets. DELTA-BASED STORAGE AND QUERYING FOR VERSIONED DATASETS by Amit Chavan Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulllment of the requirements for the degree of Doctor of Philosophy 2018 Advisory Committee: Professor Amol Deshpande, Chair/Advisor Professor Aravind Srinivasan, Co-advisor Professor Louiqa Raschid, Dean’s Representative Professor Daniel Abadi Professor Mihai Pop © Copyright by Amit Chavan 2018 Dedicated to the memories of my dear (late) grandmother, Kaku. ii Acknowledgments There are many people whose support, guidance, and friendship have made this thesis possible. I am forever indebted to my advisor, Amol Deshpande, who has taught me so much and been a great mentor during the last six years. Learning how to do successful research took me many years and I am thankful for the skills that Amol has patiently taught me. He taught me how to communicate ideas more eectively, gave me the freedom to investigate new directions, patiently listened and gave feedback on many of my half- baked ideas, and contributed many ideas to this work. I am fortunate to have had many mentors on the path towards my Ph.D. I would like to thank Aravind Srinivasan for his support and guidance when I needed it the most. I am indebted to Rajiv Gandhi for introducing me to computer science research and helping me apply to graduate schools. The sta at the Computer Science Department, particularly, Jennifer Story, have created a wonderful place, and I am happy to be a part of it. I am also grateful to Louiqa Raschid, Daniel Abadi, Mihai Pop, and Samir Khuller for their participation on my pro- posal and defense committees. I’d also like to thank my collaborators from whom I’ve learned so very much. In particular, the work described in Chapter 3 was done jointly with two other PhD stu- dents, Souvik Bhattacherjee and Silu Huang. I was primarily responsible for the design of the algorithms, jointly responsible for the theoretical results with Silu, and for the implementation and the experimental evaluation with Souvik. The work described in iii Chapter 4 was done jointly with Silu Huang, with equal contributions. I have deeply enjoyed my interactions and friendships with the members of the the database group and the Computer Science department at UMD. Thank you Souvik Bhattacherjee, Hui Miao, Abdul Quamar, Theodoros Rekatsinas, Manish Purohit, Pan Xu for making graduate study both intellectually stimulating and incredibly enjoyable. I am also lucky to have a wonderful set of friends and housemates. An incomplete list of those who helped me on this path includes Sangeetha Venkatraman, Amey Bhangale, Kartik Nayak, Bhaskar Ramasubramanian, Ramakrishna Padmanabhan, Sudha Rao, and Meethu Malu. My heartfelt gratitude to my family, without whom I would not have made it this far. My parents provided their love and support throughout my life, and encouraged me to do things the right way, without making any compromises on the quality of work. My sister, Deepti, sometimes believed in my goals and dreams even more than I did. Lastly, to Shruti, for understanding and supporting me in all my ups and downs during this journey. I’m fortunate to have had her boundless love and her seless and unwavering support for the last six years. I look forward to our journey ahead, together. iv Table of Contents Dedication ii Acknowledgements iii List of Figures viii 1 Introduction 1 1.1 Challenges in building a DVCS . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Dissertation Overview and Contributions . . . . . . . . . . . . . . . . . . 5 1.2.1 Ecient Storage and Retrieval . . . . . . . . . . . . . . . . . . . . 6 1.2.2 A Language to Query Provenance and Versions in Unied Manner 10 1.2.3 Delta-aware Query Execution . . . . . . . . . . . . . . . . . . . . 11 2 Related Work 15 2.1 Enabling Multi-Versioned Storage . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Version Control Systems (VCS) . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 Query Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Query Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3 Storage–Recreation Tradeo 26 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Problem Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.1 Essential Notations and Preliminaries . . . . . . . . . . . . . . . . 29 3.2.2 Mapping to Graph Formulation . . . . . . . . . . . . . . . . . . . 36 3.2.3 ILP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4 Proposed Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4.1 Local Move Greedy Algorithm . . . . . . . . . . . . . . . . . . . . 54 3.4.2 Modied Prim’s Algorithm . . . . . . . . . . . . . . . . . . . . . 57 3.4.3 LAST Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.4.4 Git Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 v 3.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.5.2 Comparison with SVN and Git . . . . . . . . . . . . . . . . . . . 70 3.5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 73 4 A Unied Query Language for Provenance and Versioning 79 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.3 Language Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3.2 Syntactic sweetenings . . . . . . . . . . . . . . . . . . . . . . . . 85 4.3.3 Aggregate operators . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3.4 Version graph traversal . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3.5 Extensions to ne-grained provenance . . . . . . . . . . . . . . . 90 5 Query Execution I: Set-based Operations 92 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.2.1 User Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.2.2 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.2.3 System Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.2.3.1 Storage Graph . . . . . . . . . . . . . . . . . . . . . . . 97 5.2.3.2 Set-backed Deltas and Properties . . . . . . . . . . . . 98 5.3 Query Execution Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . 100 5.3.1 Optimization Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.3.2 Access Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.3.3 Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.3.4 Cost and Cardinality Estimation . . . . . . . . . . . . . . . . . . . 105 5.4 Query Execution Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.4.1 Checkout Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.4.1.1 Single datafile Checkout . . . . . . . . . . . . . . . . 109 5.4.1.2 Multiple datafile Checkout . . . . . . . . . . . . . . . 111 5.4.2 Intersection Queries . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.4.3 Union . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.4.4 t-Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.5.1 Comparisons with Temporal Indexing . . . . . . . . . . . . . . . 135 6 Query Execution II: Declarative Queries 139 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.2 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.2.1 Schema Specication . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.2.2 Delta format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.2.3 Physical Representation . . . . . . . . . . . . . . . . . . . . . . . 144 6.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.3 Query Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 vi 6.3.1 v-tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.3.2 Scan Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.3.2.1 Delta Contraction . . . . . . . . . . . . . . . . . . . . . 150 6.3.2.2 Line/Star Structures . . . . . . . . . . . . . . . . . . . . 151 6.3.2.3 Applying Delta to a Materialized datafile . . . . . . . 153 6.3.3 JOIN Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 6.3.4 Simple Hash Join . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 6.3.5 Version-aware Hash Join . . . . . . . . . . . . . . . . . . . . . . . 156 6.3.6 Other Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.3.6.1 Filter Operator . . . . . . . . . . . . . . . . . . . . . . 158 6.3.6.2 Project Operator . . . . . . . . . . . . . . . . . . . . . . 158 6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 7 Conclusions 163 Bibliography 165 vii List of Figures 1.1 System Architecture of DEX . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Three solutions with dierent storage and recreation costs . . . . . . . . 9 3.1 Matrices corresponding to the example in Figure 1.2 . . . . . . . . . . . . 32 3.2 Graph based formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 A feasible storage graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4 Illustration of Proof of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . 43 3.5 Illustration of Proof of Lemma 6 . . . . . . . . . . . . . . . . . . . . . . . 47 3.6 Illustration of Local Move Greedy Heuristic . . . . . . . . . . . . . . . . . 54 3.7 Directed Graph G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.8 Undirected Graph G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.9 Illustration of Modied Prim’s algorithm in Figure 3.7 . . . . . . . . . . . 58 3.10 Illustration of LAST on Figure 3.8 . . . . . . . . . . . . . . . . . . . . . . 63 3.11 Distribution of delta sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.12 Results for the directed case, comparing the storage costs and total recre- ation costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.13 Results for the directed case, comparing the storage costs and maximum recreation costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.14 Results for the undirected case, comparing the storage costs and total recreation costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.15 Taking workload into account leads to better solutions . . . . . . . . . . 75 3.16 Running times of LMG . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.1 Conceptual data model for VQUEL . . . . . . . . . . . . . . . . . . . . . . 80 4.2 Example version graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.1 Example of access trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.2 An instance of the Tree Contraction algorithm . . . . . . . . . . . . . . . 114 5.3 Line and star structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.4 Access tree during the progress of C&R . . . . . . . . . . . . . . . . . . . 120 5.5 Eect of varying #Δ when |Δ| = 5% . . . . . . . . . . . . . . . . . . . . . 126 5.6 Eect of varying |A| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.7 Eect of varying #Δ when |Δ| = 5% . . . . . . . . . . . . . . . . . . . . . 127 viii 5.8 Access tree shapes; (a) Line, (b) Star, (c) Line-and-star . . . . . . . . . . . 129 5.9 Eect of access tree structure when |Δ| = 1%, #Δ = 100 . . . . . . . . . . . 131 5.10 Eect of query size when |Δ| = 1% . . . . . . . . . . . . . . . . . . . . . . 131 5.11 Intersect – Eect of |A| . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.12 Intersect – Eect of |Δ| . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.13 Union – Eect of |Δ| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.14 t-Thres. – Eect of |Δ| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.15 Eect of bitmap size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.1 Three datafiles and three column-wise deltas . . . . . . . . . . . . . . . 145 6.2 (a) Example query on k versions, (b) physical plan to execute the query. . 147 6.3 v-tuples for R1, R2, R3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6.4 Line and star transformations for the Scan operation . . . . . . . . . . . 151 6.5 Buckets in simple hash join . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.6 Improved bucket structure for sparse keys . . . . . . . . . . . . . . . . . 157 6.7 Scan performance over varying delta sizes . . . . . . . . . . . . . . . . . 160 ix Chapter 1: Introduction Data-driven methods and products are becoming increasingly common in a variety of communities, leading to a huge diversity of datasets being continuously generated, modied, and analyzed. Data science teams, in a variety of domains, want to acquire new datasets and perform increasingly sophisticated analysis tasks in order to derive valuable insights. Such eorts are collaborative in nature and can often span multiple teams or organizations, where one team uses datasets generated by another team in their analysis tasks. Moreover, data science tasks are also iterative in nature – data is updated regularly, algorithms are improved, or new approaches are tried out to explore their impact. An update to a dataset, for instance, might lead to updating all the tasks or “pipelines” that depend on it. An increasingly important consideration for the underly- ing data management systems is that, all of these datasets and their versions over time need to be stored and queried for a variety of reasons including auditing, provenance, transparency, accountability, introspective analysis, and backups [1, 2, 3, 4, 5]. We outline an example data science workow that highlights the iterative and collaborative aspects of the activity. Example 1 Genome assembly of a whole genome sequence dataset is a complex task — apart from huge computational demands, it is not always known a priori which tools and 1 settings will work best on the available sequence data for an organism [6]. The process typi- cally involves testing multiple tools, parameters and approaches to produce the best possible assembly for downstream analysis. The assemblies are evaluated on a host of metrics (e.g., the N50 statistic) and the choice of which assembly is the best one is also not always clear. One potential sequence of steps might be: Sequenced reads (FastQ les)→ Error correction tools (Quake, Sickle, etc.) → Input analysis, k-mer calculation (KmerGenie) → Assembly tool (SOAPdenovo, ABySS)→ Assembly analysis and selection (QUAST). A group of researchers may collaboratively try to analyze this data in various ways, building upon the work done by the others in the team, but also trying out dierent algo- rithms or tools. New data is also likely to be ingested at various points, either as updates/- corrections to the existing data or as results of additional experiments. As one can imagine, the ad hoc nature of this process and the desire not to lose any intermediate synthesized result means that the researchers will be left with a large number of datasets and analyses, mostly with large overlaps between them, and complex derivational dependencies. We observe that although there is a wealth of data science research addressing stand-alone data analysis issues or building integrated tools for analysis, the dataset management aspects of these tools are poor, requiring data scientists to use ad hoc mechanisms to record and reason about datasets. When people collaborate over data it not uncommon to have hundreds or thousands of versions of collected, curated, and derived datasets, at various degrees of structure (relational/JSON to fully unstructured). Ad hoc solutions to this problem can hardly adapt to satisfy the long term tracking of all such complex data and metadata. Once people start doing data analysis, they need 2 sustainable and scalable tools that allow them to spend little to no eort on maintaining, sharing, and tracking changes to datasets, and instead focus on scientic and knowledge discovery. In this dissertation, we propose new methods and tools towards the goal of making the management of such datasets hassle free, and address some of the fundamental prob- lems in this space. We build a new dataset version control system (DVCS), to keep track of all the dataset versions and dataset provenance in one place, and making it easy to analyze or query this information. By keeping all this information in one place, we can enable a rich set of functionality that can simplify the lives of data scientists, make it eas- ier to identify and eliminate errors, and decrease the time to obtain actionable insights. A DVCS can keep track of all the datasets and their thousands of versions, storing them compactly, while still being able to retrieve them eciently, and query them to enable rich introspection capabilities. The goal of this dissertation is to develop a formalism for managing a large number of dataset versions, i.e., storing and querying them, and designing algorithms that use the formalism to enable ecient execution of a large class of introspection queries over these versions. 1.1 Challenges in building a DVCS The rst challenge that we consider in this dissertation, is how to eciently store and retrieve the thousands of dataset versions. Because of the nature of the tasks that create these versions, not all versions are entirely dierent from one another, and there 3 is massive redundancy in their contents. If the dierence, or delta, between two similar versions can be computed, it is possible to save on storage costs by keeping only one version and the delta, instead of two versions. The large number of overlapping versions makes it imperative to exploit such deltas to compactly store all of them. However, this gives rise to the storage–recreation tradeo : the more storage we use, the faster it is to recreate or retrieve versions, while the less storage we use, the slower it is to recreate or retrieve versions. The second challenge is that a versioning API that simply provides put and get fa- cilities is very limiting, and is not a good t for data scientists to understand and reason about data contained within versions. Tools like git/svn lack sophisticated query inter- faces. For example, in the data science scenario mentioned above, one may wish to write a query that nds the intersection of a set of versions, representing, for instance, the nal synthesized result of dierent pipelines. Identifying the design goals of a language that enables users to traverse, compare, and query the version metadata, version history, and the data itself, in a holistic manner, is an important and necessary problem for a DVCS. The third challenge is designing ecient execution algorithms for the tasks men- tioned above. For instance, users of a DVCS naturally want tools that help them reason about multiple versions of datasets, and as such, they will pose queries that often access more than one version at a time (e.g., for nding similarities among a set of versions). Existing solutions require users to rst get/checkout, i.e., reconstruct a specic version of a dataset of a le completely, all versions before running any queries on the data stored within them. This approach is less than ideal particularly when the individual versions are large and the users need to access multiple versions for their analysis task. First, 4 irrespective of the size of the query result, this approach entails creating all the input versions before query processing can begin, resulting in large memory and/or I/O usage. Second, it requires users to maintain another system to assist in executing the queries. Third, this approach fails to exploit the fact that most datasets evolve through changes that are small relative to the dataset sizes, and knowing about these changes can not only help us towards our storage goals, but also enable answering queries eciently. 1.2 Dissertation Overview and Contributions Given the challenges described above, the primary goal of this dissertation is to de- velop a robust framework to model the fundamental problems for dataset version man- agement, and design new techniques and algorithms to enable users to execute rich queries eciently. We have built a prototye system, called DEX, whose architecture is shown in Figure 1.1. The techniques summarized below serve as building blocks to this architecture. The two main components that this dissertation is focused on are the Storage Graph Builder and the Query Processor module. The DEX system is built on top of git and has three major components: (a) a set of command line utilities, DEX CLI, written in Python, to allow the user to interact with the repository in the form of the standard add, commit, checkout, etc., commands (similar to git), (b) the Storage Graph Builder which decides how best to store the col- lection of dataset versions, and (c) the Query Processor, written in Java, that executes user queries against the compressed representation. DEX CLI passes through the version management tasks not pertaining to large datasets to git; the user may specify a le to 5 DEX CLI Users Version Graph Storage Query Git Graph Processor translation Builder Delta Storage Version Graph Storage Git Storage Backend Data Store Figure 1.1: System Architecture of DEX be managed by DEX through a ag to the add command, and any tasks pertaining to those les are sent to the Storage Graph Builder (in case of add or commit) or the Query Processor. The main parts of the dissertation, as well as the associated chapters and published papers, are summarized below. 1.2.1 Ecient Storage and Retrieval Given the high overlap and duplication among the datasets, it is attractive to con- sider using delta encoding to store the datasets in a compact manner. Delta encoding is a cornerstone of many no-overwrite storage systems that are focused on archiving and maintaining vast quantities of datasets (simply put, a collection of les). Archival and backup systems often store multiple versions or snapshots of large datasets that have signicant overlap across their contents using deltas. At a high level, delta encoding consists of representing a target version as the 6 mutation, or delta, from a source version content. Typically, source and target versions are selected such that they have a large overlap across their contents and hence their delta is small. Furthermore, the source version itself may be represented as a delta from another version, and so on, creating a “graph” of versions and deltas. The compressed storage is obtained by storing only a few select versions, commonly referred to as mate- rialized versions, and deltas (instead of the versions they represent) in this graph, such that it is possible to re-create any version by walking the path of deltas starting from a materialized le and ending at the desired le. However, this gives rise to the storage–recreation tradeo : the more storage we use, the faster it is to recreate or retrieve versions, while the less storage we use, the slower it is to recreate or retrieve versions. We illustrate this trade-o via an example. Example 2 Figure 1.2(i) displays a version graph, indicating the derivation relationships among 5 versions. Let V1 be the original dataset. Say there are two teams collaborating on this dataset: team 1 modies V1 to derive V2, while team 2 modies V1 to derive V3. Then, V2 and V3 are merged and give V5. As presented in Figure 1.2, V1 is associated with ⟨10000, 10000⟩, indicating that V1’s storage cost and recreation cost are both 10000 when stored in its entirety (we note that these two are typically measured in dierent units – see the second challenge below); the edge (V1 → V3) is annotated with ⟨1000, 3000⟩, where 1000 is the storage cost for V3 when stored as the modication from V1 (we call this the delta of V3 from V1) and 3000 is the recreation cost for V3 given V1, i.e, the time taken to recreate V3 given that V1 has already been recreated. One naive solution to store these datasets would be to store all of them in their entirety 7 (Figure 1.2 (ii)). In this case, each version can be retrieved directly but the total storage cost is rather large, i.e., 10000 + 10100 + 9700 + 9800 + 10120 = 49720. At the other extreme, only one version is stored in its entirety while other versions are stored as modications or deltas to that version, as shown in Figure 1.2 (iii). The total storage cost here is much smaller (10000 + 200 + 1000 + 50 + 200 = 11450), but the recreation cost is large for V2, V3, V4 and V5. For instance, the path {(V1 → V3 → V5)} needs to be accessed in order to retrieve V5 and the recreation cost is 10000 + 3000 + 550 = 13550 > 10120. Figure 1.2 (iv) shows an intermediate solution that trades o increased storage for reduced recreation costs for some version. Here we store versions V1 and V3 in their entirety and store modications to other versions. This solution also exhibits higher storage cost than solution (ii) but lower than (iii), and still results in signicantly reduced retrieval costs for versions V3 and V5 over (ii). Despite the fundamental nature of the storage-retrieval problem, there is surpris- ingly little prior work on formally analyzing this trade-o and on designing techniques for identifying eective storage solutions for a given collection of datasets. Version Con- trol Systems (VCS) like Git, SVN, or Mercurial, despite their popularity, use fairly simple algorithms underneath, and are known to have signicant limitations when managing large datasets [7, 8]. Much of the prior work in literature focuses on a linear chain of versions, or on minimizing the storage cost while ignoring the recreation cost. We address this problem in detail in Chapter 3. We show that simply using delta compression to minimize storage space leads to very high latencies while retrieving spe- cic datasets. We also show that the delta compression heuristics used by popular VCS’ 8 <10000, 10000> V1 <10000, 10000> V1 <200,200> <1000,3000> <10100, 10100> V2 V3 <9700,9700> <10100, 10100> V2 V3 <9700,9700> <50,400> <200,550> <800,2500> <9800,9800> V4 V5 <10120,10120> <9800,9800> V4 V5 <10120,10120> (i) (ii) <10000, 10000> V1 <10000, 10000> V1 <200,200> <1000,3000> <200,200> V2 V3 V2 V3 <9700,9700> <50,400> <200,550> <50,400> <200,550> V4 V5 V4 V5 (iii) (iv) Figure 1.2: (i) A version graph over 5 datasets – annotation ⟨a, b⟩ indicates a storage cost of a and a recreation cost of b; (ii, iii, iv) three possible storage graphs like Git and SVN are ineective at storing datasets, both in terms of resources consumed and the quality of solution produced. Thus, to address the rst challenge mentioned above, we propose novel problem formulations towards understanding the storage and recreation tradeo in a principled manner. We formulate several optimization problems and show that most variations are NP-Hard. As a result, we design several ecient heuristics that are eective at exploring this trade-o and present an extensive exper- imental evaluation over several synthetic and real-world workloads demonstrating the eectiveness of our algorithms at handling large problem sizes. 9 1.2.2 A Language to Query Provenance and Versions in Unied Manner To be truly useful for collaborative data science, we also need the ability to specy queries and analysis tasks over the versioning and provenance information in a unied manner. In the Genome assembly example mentioned earlier, there is a wide range of queries that may be of interest. Simple queries include: (a) identifying versions based on the metadata information (e.g., authors); (b) identifying versions that were derived (di- rectly or through a chain of derivations) from a specic outdated version; and (c) nding versions that dier from their predecessor version by a large number of records. More complex queries include: (d) nding versions where the data within satises certain ag- gregation conditions; (e) nding the intersection of a set of versions (representing, e.g., the nal synthesized results of dierent pipelines); and (f) nding versions that contain any records derived from a specic record in a version. These examples above illustrate some of the key requirements for a query lan- guage, namely the ability to: • Traverse the version graph (i.e., version-level provenance information) and query the metadata associated with the versions and the derivation/update edges. • Compare several versions to each other in a exible manner. • Run declarative queries over data contained in a version, to the extent allowable by the structure in the data. • Query the tuple-level provenance information, when available, in conjunction with the version-level provenance information. 10 In Chapter 4, we describe a language, called VQUEL, that aims to provide these features. VQUEL is largely a generalization of the Quel language (while also introducing certain syntactic conveniences that Quel does not possess), and combines features from GEM and path-based query languages. This means thatVQUEL is a full-edged relational query language, and in addition, it enables the seamless querying of the data present in versions, versioning derivation relationships, as well as versioning metadata. 1.2.3 Delta-aware Query Execution While delta encoding is an eective method to archive large amounts of data in many immutable data stores, many of these stores require users to “check out” complete le/dataset versions in order to manipulate them. In Chapter 5 and Chapter 6, we show that this approach is less than ideal for a variety of rich queries that compare or work with data from across multiple versions, and signicant performance benets can be had if we can make use of the pre-computed deltas during query processing. In particular, in these two chapters, we present a systematic study of the problem of supporting rich analysis queries over delta-oriented storage engines, and describe the Query Execution module of DEX. We focus on the storage design and implementation of DEX for a class of semi-structured datasets that we call datafiles, and for a class of basic queries that includes multi-version checkouts, intersections, unions, t-threshold, and single block select-project-join queries. A datafile is a le whose contents can be seen as a set of records, i.e., the order of records within a datafile is immaterial, and no two records in a datafile are identical. 11 Examples of such les include CSV les, JSON documents, log les, to name a few, and these constitute a large fraction of les in a typical data lake, and hence are particularly suited to our study. A common method to represent a delta between two datafiles is to maintain the “deletions” and “additions” of records required to go from one datafile to the other. If a datafile has additional structure, namely, a user supplied primary key, along with details on how to parse a record into its component elds or columns, we use a more compact column-based delta format. In Chapter 5, we develop a general cost-based optimization framework based on key algebraic properties regarding composition of the deltas. The result of this frame- work is a compact algebraic expression that connes query execution to a small set of deltas. As a side eect, the computational cost is dependent on the size and the num- ber of deltas in the expression (which are typically small) in contrast to the size of the input datasets. We develop optimal algorithms for executing single-le or multi-le checkout queries assuming reasonable restrictions on the evaluation plan search space. We also develop a series of intuitive transformation rules that help simplify the search space for intersection, union, and t-threshold queries, and use them in conjunction with cost-based solutions for base cases, to develop eective search algorithms. We present a comprehensive evaluation against synthetic datasets of varying characteristics. Our re- sults show that our methods perform exceedingly well compared to the baselines, even for simple queries like single-le checkouts. In Chapter 6, we extend the above framework to execute single block select-project- join (SPJ) queries, such as those that can be formulated in VQUEL, over multiple versions stored in DEX. The key design decision of our approach is to execute the various query 12 operators exactly once for each unique tuple in the set of versions, rather than execut- ing the query once for each version. We propose using a new representaion for tuples, called v-tuples, during query processing. In addition to holding information about the elds in a tuple, a v-tuple also stores information about the versions, from the set of queries versions, the respective tuple appears in. We show how to eciently construct such v-tuples from the delta-encoded representation and present our modications to the traditional Scan and Join physical operators, in order to process v-tuples eciently. We present an extensive evaluation of this approach on multiple synthetic datasets. Our results shows that richer query processing is viable inDEX, with signicant performance benets over one version at a time execution. Delta Representations and Tradeos: A key question in DEX is selecting the delta variant, i.e., the particular format/algorithm for computing the delta between two les. This is because dierent delta formats are appropriate for dierent types of les: a UNIX-style line-by-line di is a common delta format for plain text les, while an XOR is more suited to numerical array-oriented data. Exploiting the structure in the data, if known, can often lead to better deltas (e.g., for XML [9], or relations [10]). Column-based deltas may be more appopriate when a large number of records are changed slightly, e.g., due to a schema change. Furthermore, a particular delta format may be directed or undirected: if a delta Δ between source le A and target le B is directed, it may only be used to recreate B given A, and not vice versa. An undirected delta between two les, on the other hand, can accept either le as source and recreate the other. 13 The desire to execute queries directly on deltas (as we do in this work) brings another dimension to this choice. There is an inherent tension in the amount of infor- mation stored in a delta, and our ability to push query execution on to them. In this work, we pick dierent delta formats depending on the class of queries that the user wishes to support. Supporting richer queries and compact deltas requires us to make assumptions about the datafile schema, as follows: • In Chapter 3, we consider storing and retrieving individual les (in their entirety). As such, we do not require any assumptions or restrictions on the le or delta format, and any suitable delta algorithm can be used. However, we do require the storage cost and the access cost, or their estimates, of the resulting delta. • In Chapter 5, we consider a class of set-based query operators on the past versions. Hence we require that every version of a datafile, at commit time, can be sepa- rated into a set of records. The delta format then is also based on the set of records abstraction, to aid in ecient query execution. • In Chapter 6, we consider select-project-join queries across a large number of past versions. In order for such queries to make sense, the datafile, at commit time, must be separable into records, and every record be separable into its constituent elds/columns. In order to use a compact column-based delta format, we also require the user to specify a column or a set of columns as the primary key in the datafile. 14 Chapter 2: Related Work This section surveys the state of the art in managing versioned data. Our goal in this section is to give a brief survey of the many decades of work done in this area and to put our contributions in context. We begin with prior work perfomed towards the goal of supporting ecient storage of multiple versions of data followed by a review of recent solutions in the Version Control Systems space. Thereafter, we summarize work done towards building richer interfaces to query such data and to execute those queries eciently. 2.1 Enabling Multi-Versioned Storage Relational DatabaseManagement Systems. Many database applications require that multiple versions of records be stored and retrieved, and as such, there has been exten- sive research on temporal and versioned databases and their applications. The eort to satisfy the diverse needs of these applications has led to a number of versioning so- lutions, e.g., [11, 12, 13, 14, 15, 16]. Much work, especially earlier papers, focused on theoretical foundations, not on practical considerations such as storage eciency and indexing versioned data. We briey review some of the work done in this area, and for a detailed survey, we refer the reader to [12]. In addition, extensive bibliographies have 15 also been compiled, see [17] as a starting point. There was some conceptual temporal database work in 1980s, see [18, 19] ([18] contains references to even earlier work), on developing temporal data models and tem- poral query languages. The most basic concepts that a relational temporal database is based upon are valid time and transaction time, considered orthogonal to each other. Valid time denotes the time period during which a fact is true with respect to the real world. Transaction time is the time when a fact is stored in the database. The rst relational database system oering temporal functionality was Postgres [20]. Postgres used R-trees [21] to index historical data, with recent data residing in a B+Tree. This separation is important as a general multi-attribute index like an R-tree has di- culty supporting data that is current and hence does not yet have an end time. The movement of data from the B+tree to the R-tree occurs lazily. Transactime time functionality has also received some industrial interest, partic- ularly from Oracle [22] and Microsoft. Oracle’s FlashBack queries allow the application to access prior transaction time states of the database and to retrieve all the versions of a row between two transaction times. It also allows for “point in time” recovery, i.e., tables and databases can be rolled back to a previous transaction time, discarding all changes after that time; this functionality can be used to deal with bad user transactions. They do not index historical versions, however, so historical version queries must go through current time versions and then search backward “linearly” in time. More recently, Or- acle supports “Total Recall” feature for Oracle 11g [23]. Building on FlashBack, Total Recall archive is read only and it supports long time archiving of transaction time ver- sions, including migration of the versions to archival media. A form of compression is 16 supported to reduce the storage cost of retaining more extensive database history. Immortal DB, which was built into Microsoft SQL Server, integrated a temporal indexing technique called the TSB-tree [14, 15] to provide high performance access and update for both current and historical data. Buneman et al. [24] proposed an archiving technique where all versions of the data are merged into one hierarchy. An element appearing in multiple versions is stored only once along with a timestamp. This technique of storing versions is in contrast with techniques where retrieval of certain versions may require undoing the changes (unrolling the deltas). The hierarchical data and the resulting archive is represented in XML format which enables use of XML tools such as an XML compressor for compress- ing the archive. It was not, however, a full-edged version control system representing an arbitrarily graph of versions; rather it focused on algorithms for compactly encoding a linear chain of versions. MOLAP systems store data in multidimensional arrays [25] with particular focus on aggregation queries. These systems exploit data structures to eciently compute rollups. The MOLAP system in [26] supports versions to represent changes to the data sources that should be propagated to the data warehouse periodically. But the versioning system is designed to benet the concurrency control mechanism in order to minimize contention between query and maintenance transactions. Scientic Databases. In many elds of science, multidimensional arrays rather than at tables are standard data types because data values are associated with coordinates in space and time. As a result, many specialized array-processing systems have emerged, 17 e.g, [27, 28]. As noted in [29], an important requirement that scientists have for these systems is the ability to create, archive, and explore dierent versions of their arrays. None of the temporal database solutions mentioned above are a good t here because (1) simulating arrays on top of relations can be inecient [29], and (2) their internal data structures are not specialized for time travel over array data. Hence, a no-overwrite storage manager with ecient support for querying old versions of an array is a critical component of an array database management system. In recent work, Seering et al. [30] presented a disk based versioning system using ecient delta encoding to minimize space consumption and retrieval time in array-based systems. Other DataModels andDeduplication SchemesMany solutions have been proposed to support multiple versions of complex data, e.g., XML [31], object oriented [32], and spatio-temporal data [33]. Khurana and Deshpande [34] present an approach for manag- ing historical graph data for large information networks, and for executing snapshot re- trieval queries on them. Quinlan and Dorward [35] propose an archival “deduplication” storage system that identies duplicate blocks across les and only stores them once for reducing storage requirements. Zhu et al. [36] present several optimizations on the basic theme. Douglis and Iyengar [37] present several techniques to identify pairs of les that could be eciently stored using delta compression even if there is no explicit derivation information known about the two les. Ouyang et al. [38] studied the problem of com- pressing a large collection of related les by performing a sequence of pairwise delta compressions. They proposed a suite of text clustering techniques to prune the graph of all pairwise delta encodings and nd the optimal branching (i.e., MCA) that mini- 18 mizes the total weight. Similar dictionary-based reference encoding techniques have been used by Chan and Woo [39] to eciently represent a target web page in terms of additions/modications to a small number of reference web pages. Burns and Long [40] present a technique for in-place re-construction of delta-compressed les using a graph- theoretic approach. Kulkarni et al. [41] present a more general technique that combines several dierent techniques to identify similar blocks among a collection les, and use delta compression to reduce the total storage cost (ignoring the recreation costs). Our Contributions. Most of the work in temporal relational database management systems has focused its eorts on eciently storing a “linear chain” of versions, unlike our work, which requires supporting an arbitrary DAG of versions. Moreover, unlike our framework, which does not make any assumptions about how the versions were modi- ed, many schemes assume knowledge of the specic records or cells that are updated. The general concept of multi-versioning has also been used extensively in commercial databases to provide snapshot isolation [42]. However, these methods only store enough history to preserve transactional semantics, whereas we preserve all historical branches and derivation relationships to ensure integrity of the version graph. Finally, prior ef- forts that have looked at the problem of minimizing the total storage cost for storing a collection of related les do not typically consider the recreation cost or the tradeos between the two. Their design also does not consider querying data in the past versions using rich query interfaces, such as the ones available in temporal databases. 19 2.2 Version Control Systems (VCS) Version Control Systems (VCS) have a long history in Computer Science. A VCS records changes to a le or set of les over time so that any user can recall a specic ver- sion later. Versioning techniques such as forward and backward delta encoding and the use of multi-version B-trees have been implemented in various legacy systems. git [43] is one of the conventional version control systems and is believed to be faster and more disk ecient than other similar version control systems. The major dierence between git and any other VCS, such as Subversion (svn), Concurrent Versions System (cvs) is the way git thinks about its data. While most other systems store information as a list of le-based changes, git thinks of its data more like a set of snapshots of a minia- ture lesystem and stores changes at the snapshot level. To be ecient, if a le has not changed, git doesn’t store the le again, instead, just a link to the previous identical le it has already stored. Despite their popularity, these systems largely use fairly simple algorithms under- neath that are optimized to work with modest-sized source code les and their on-disk structures are optimized to work with line-based dis. These systems are known to have signicant limitations when handling large les and large numbers of versions [8]. As a result, a variety of extensions like git-annex [44], Git Large File Storage [45], etc., have been developed to make them work reasonably well with large les. These exten- sions replace large les with text pointers inside Git, while storing the le contents on a remote server like GitHub.com, or Amazon S3. 20 Our Contributions. The initial design of DEX is modeled after conventional version control software such as git. In particular, the concepts of a no-update model and of dierencing stored les against each other for more ecient storage have both been explored extensively by such systems. We build upon this work to support a superset of the conventional version control API for large datasets. We show that the underlying algorithms in git and Subversion can be extremely inecient, both in terms of storage used and resource (memory/IO) consumption. 2.3 Query Languages Abdessalem and Jomier [46] introduce VQL, a language designed for querying data stored in multiversion databases. VQL is based on a rst-order calculus and provides users with the ability to navigate through object versions modeled by the database. More recently, there are new temporal constructs pushed in the SQL standard by the main DBMS vendors [47]. This work, however, does not directly apply to our setting because the constructs assume a linear chain of versions — as noted earlier, we could have an arbitrary branching structure of versions. While there has been substantial work on query languages for provenance, rang- ing from adapting SQL [48], Prolog [49, 50], SPARQL [51, 52] to specialized languages such as QLP [53, 54], PQL [55], ProQL [56] ( [57], [58] have additional examples), much of this work centers on well-dened workows and tuple-based provenance rather than collaborative settings where multiple users interact through a derivation graph of ver- sions in an ad hoc manner. 21 Our Contributions. We design a new query language, called VQUEL, that is capable of querying dataset versions, dataset provenance (e.g., which datasets a given dataset was derived from), and record-level provenance (if available). Our design draws from constructs introduced in the historical Quel [59] and GEM [60] languages, neither of which had a temporal component. 2.4 Query Execution There are a large number of proposed indexing techniques used for temporal data, e.g., [33, 61, 62]. Salzberg and Tsotras [63] present a comprehensive survey of indexing structures for temporal databases. They also present a classication of dierent queries that one may ask over a temporal database. Lomet et al. [14, 15] integrated a temporal indexing technique, the TSB-tree, into Immortal DB (which was built into Microsoft SQL Server) to serve as the core access method. The TSB-tree provides high performance access and update for both current and historical data. Jouini and Jomier [64] studied the problem of eciently indexing data with “branched evolution”. The main contributions here are the extension of tem- poral index structures to data with branched evolution and an analysis method that esti- mates the performance of the dierent index structures and provides guidelines for the selection of the most appropriate one. Soroush and Balazinska [65] present an indexing technique to support “time travel” queries for scientic arrays. A key aspect of their technique is that they can support approximate queries that can quickly identify which versions are relevant to a user and return the approximate content of these versions. 22 Queries in delta-based storage. Delta encoding has been used in a variety of systems to provide trade-os among time, space, and compression performance, e.g., to reduce data transfer time for text/HTTP objects [66], to reduce access time in a le system [67], to store many versions of the generated artifacts in source code control systems (e.g., git) or other types of data [10, 30, 68]. However, the focus of many of the existing delta encoding schemes has been to access the objects in their entirety and to the best of our knowledge, they have not considered the tradeo between storage and “computability over deltas”. Even version control systems that provide functionality to compare mul- tiple objects, e.g., merge, di, etc., rst recreate all required les before operating upon them. Recently, [65] presented an indexing technique to support “time travel” queries for scientic arrays wherein they support approximate queries that can quickly iden- tify which versions are relevant to a user and return the approximate content of these versions. However, they did not consider queries that compared the contents of two or more array versions. Deltas and computing. The concept of making deltas “rst-class citizens” was ex- plored in Heraclitus [69]. To support “what-if” scenario analysis, they provided general- purpose constructs for creating, accessing, and combining deltas. In the specic realiza- tion of their paradigm for the relational model, deltas are a set of signed atoms where the positive atoms correspond to “insertions” and the negative atoms correspond to “dele- tions”. In addition, the deltas have structure and can be manipulated directly by con- structs in user programs, e.g., to delete all records satisfying a predicate. In contrast, our use of deltas is at the physical level and not exposed to the users. They do not consider 23 optimizing the dierent types of queries against a delta storage. Executing queries with hypothetical state updates was also considered in [70]. Here the state updates (or deltas) were allowed to be expressions and the authors considered rewriting such queries into an optimized form based on their novel rules for substituion and the rules for relational algebra. Such rules are however not applicable in our setting. Record-based deltas were also used in [71, 72] to provide the capability of sharing data and updates among dif- ferent participants. However, they focused on formalizing the semantics of the update exchange process, e.g., mapping updates across schemas and ltering them according to local trust policies, and the challenges introduced therein. Connections tomaterialized views. Using pre-computed deltas to answer user queries is, at a high level, similar to the problems that have been considered in the context of materialized view and index selection to speed up query processing. Broadly speaking, research in this area has focused on three issues: (i) determining the search space or class of views to consider for materialization, (ii) choosing a subset of views and indexes to materialize depending on various constraints like storage overhead, maintenance over- head, eectiveness on the query workload, etc., [73, 74] and (iii) quickly determining which views to consider to answer a given query [75, 76]. In our problem setting, set- backed deltas can be considered as a form of materialized views (which can be used to reconstruct base relations), with our work addressing the problem of how to use such views to answer queries. We also design a logical algebra over the specic delta formats, that allows us to combine a large number of deltas to answer one query. Evaluating set expressions. Several algorithms and data structures have been pro- 24 posed in literature to solve union, intersection and dierence problems on sets [77, 78, 79, 80] by minimizing the number of comparisons required. Although the ordered list representation is the most common, some algorithms also consider representing sets in other data structures, e.g., skip lists [81], machine word-based representations [82], etc., to obtain additional speedup. A comparison of few of these methods is available in [83, 84]. Speeding up set operations is largely orthogonal to our approach and we can make use of some of these techniques as additional operators with the appropriate cost model. As mentioned earlier, in this work, we use an adaptive set intersection algorithm that was shown to have reasonably good performance in [83] without requiring any preprocessing step. The larger problem here, however, is how to eciently evaluate a set expression consisting of union, intersection and dierence. [85] consider evaluating union-intersection expressions in a worst-case ecient way for a non-comparison based model. However, their approach uses hash-based dictionaries, which would require an additional pre-processing step, and it remains an open problem whether their results can be extended to handle set dierence. Recently, [86] showed that, for a similar cost model, a union-intersection expression can be rewritten to perform intersections before unions with often a reduced cost. Their approach, however, did not consider rewrites in the presence of set dierence. 25 Chapter 3: Storage–Recreation Tradeo 3.1 Introduction In this chapter, we present a formal study of the problem of deciding how to jointly store a collection of dataset versions, provided along with a version or derivation graph. Specically, we focus on the problem of trading o storage costs and recreation costs in a principled fashion. Aside from being able to handle the scale, both in terms of dataset sizes and the number of versions, there are several other considerations that make this problem challenging. • Dierent application scenarios and constraints lead to many variations on the ba- sic theme of balancing storage and recreation cost (see Table 3.1). The variations arise both out of dierent ways to reconcile the conicting optimization goals, as well as because of the variations in how the dierences between versions are stored and how versions are reconstructed. For example, some mechanisms for constructing dierences between versions lead to symmetric dierences (either version can be recreated from the other version) — we call this the undirected case. The scenario with asymmetric, one-way dierences is referred to as directed case. 26 • Similarly, the relationship between storage and recreation costs leads to signi- cant variations across dierent settings. In some cases the recreation cost is pro- portional to the storage cost (e.g., if the system bottleneck lies in the I/O cost or network communication), but that may not be true when the system bottleneck is CPU computation. This is especially true for sophisticated dierencing mech- anisms where a compact derivation procedure might be known to generate one dataset from another. • Another critical issue is that computing deltas for all pairs of versions is typically not feasible. Relying purely on the version graph may not be sucient and signif- icant redundancies across datasets may be missed. • Further, in many cases, we may have information about relative access frequencies indicating the relative likelihood of retrieving dierent datasets. Several baseline algorithms for solving this problem cannot be easily adapted to incorporate such access frequencies. The key contributions of this chapter are as follows. • We formally dene and analyze the dataset versioning problem and consider sev- eral variations of the problem that trade o storage cost and recreation cost in dif- ferent manners, under dierent assumptions about the dierencing mechanisms and recreation costs (Section 3.2). Table 3.1 summarizes the problems and our results. We show that most of the variations of this problem are NP-Hard (Sec- tion 3.3). 27 Storage Cost Recreation Cost Undirected Directed Directed Case, Δ = Φ Case, Δ = Φ Case, Δ ≠ Φ Problem 1 minimize {} i < ∞, ∀i PTime, Minimum Spanning Tree Problem 2  < ∞ minimize {max{i |1 ≤ i ≤ n}} PTime, Shortest Path Tree Problem 3  ≤ minimize {∑ni=1i} NP-hard, NP-hard, LMG Algorithm Problem 4  ≤ minimize {max{i |1 ≤ i ≤ n}} LAST NP-hard, MP Algorithm Algorithm† Problem 5 minimize {} ∑ni=1i ≤  NP-hard, NP-hard, LMG Algorithm Problem 6 minimize {} max{i |1 ≤ i ≤ n} ≤  LAST NP-hard, MP Algorithm Algorithm† Table 3.1: Problem Variations With Dierent Constraints, Objectives and Scenarios. • We provide two light-weight heuristics: one, when there is a constraint on average recreation cost, and one when there is a constraint on maximum recreation cost; we also show how we can adapt a prior solution for balancing minimum-spanning trees and shortest path trees for undirected graphs (Section 5.4). • We implement the proposed algorithms in our prototype DEX system. We present an extensive experimental evaluation of these algorithms over several synthetic and real-world workloads demonstrating the eectiveness of our algorithms at handling large problem sizes (Section 5.5). 3.2 Problem Overview In this section, we rst introduce essential notations and then present the various problem formulations. We then present a mapping of the basic problem to a graph- theoretic problem, and also describe an integer linear program to solve the problem optimally. 28 3.2.1 Essential Notations and Preliminaries Version Graph. We let  = {Vi}, i = 1,… , n be a collection of versions. The deriva- tion relationships between versions are represented or captured in the form of a version graph: ( , ). A directed edge from Vi to Vj in ( , ) represents that Vj was derived from Vi (either through an update operation, or through an explicit transformation). Since branching and merging are permitted in DEX (admitting collaborative data sci- ence),  is a DAG (directed acyclic graph) instead of a linear chain. For example, Fig- ure 1.2 represents a version graph , where V2 and V3 are derived from V1 separately, and then merged to form V5. Storage and Recreation. Given a collection of versions  , we need to reason about the storage cost, i.e., the space required to store the versions, and the recreation cost, i.e., the time taken to recreate or retrieve the versions. For a version Vi , we can either: • Store Vi in its entirety: in this case, we denote the storage required to record ver- sion Vi fully by Δi,i . The recreation cost in this case is the time needed to retrieve this recorded version; we denote that by Φi,i . A version that is stored in its entirety is said to be materialized. • Store a “delta” from Vj : in this case, we do not store Vi fully; we instead store its modications from another version Vj . For example, we could record that Vi is just Vj but with the 50th tuple deleted. We refer to the information needed to construct version Vi from version Vj as the delta from Vj to Vi . The algorithm giving us the delta is called a dierencing algorithm. The storage cost for recording modications 29 from Vj , i.e., the size the delta, is denoted by Δj,i . The recreation cost is the time needed to recreate the recorded version given that Vj has been recreated; this is denoted by Φj,i . Thus the storage and recreation costs can be represented using two matrices Δ and Φ: the entries along the diagonal represent the costs for the materialized versions, while the o-diagonal entries represent the costs for deltas. From this point forward, we focus our attention on these matrices: they capture all the relevant information about the versions for managing and retrieving them. Delta Variants. Notice that by changing the dierencing algorithm, we can produce deltas of various types: • for text les, UNIX-style dis, i.e., line-by-line modications between versions, are commonly used; • we could have a listing of a program, script, SQL query, or command that generates version Vi from Vj ; • for some types of data, an XOR between the two versions can be an appropriate delta; and • for tabular data (e.g., relational tables), recording the dierences at the cell level is yet another type of delta. Furthermore, the deltas could be stored compressed or uncompressed. The various delta variants lead to various dimensions of problem that we will describe subsequently. 30 The reader may be wondering why we need to reason about two matrices Δ and Φ. In some cases, the two may be proportional to each other (e.g., if we are using un- compressed UNIX-style dis). But in many cases, the storage cost of a delta and the recreation cost of applying that delta can be very dierent from each other, especially if the deltas are stored in a compressed fashion. Furthermore, while the storage cost is more straightforward to account for in that it is proportional to the bytes required to store the deltas between versions, recreation cost is more complicated: it could depend on the network bandwidth (if versions or deltas are stored remotely), the I/O bandwidth, and the computation costs (e.g., if decompression or running of a script is needed). Example 3 Figure 3.1 shows thematricesΔ andΦ based on version graph in Figure 1.2. The annotation associated with the edge (Vi , Vj) in Figure 1.2 is essentially ⟨Δi,j ,Φi,j⟩, whereas the vertex annotation forVi is ⟨Δi,i ,Φi,i⟩. If there is no edge fromVi toVj in the version graph, we have two choices: we can either set the corresponding Δ and Φ entries to “−” (unknown) (as shown in the gure), or we can explicitly compute the values of those entries (by running a dierencing algorithm). For instance, Δ3,2 = 1100 and Φ3,2 = 3200 are computed explicitly in the gure (the specic numbers reported here are ctitious and not the result of running any specic algorithm). Discussion. Before moving on to formally dening the basic optimization problem, we note several complications that present unique challenges in this scenario. • Revealing entries in the matrix: Ideally, we would like to compute all pairwise Δ and Φ entries, so that we do not miss any signicant redundancies among versions 31 10000 200 1000 -- -- 10000 200 3000 -- -- 500 10100 -- 50 800 600 10100 -- 400 2500 -- 1100 9700 -- 200 -- 3200 9700 -- 550 -- -- -- 9800 900 -- -- -- 9800 2500 -- -- -- 800 10120 -- -- -- 2300 10120 (i) (i) Δ (ii) Φ (ii) Figure 3.1: Matrices corresponding to the example in Figure 1.2 (with additional entries revealed beyond the ones given by version graph) that are far from each other in the version graph. However when the number of versions, denoted n, is large, computing all those entries can be very expensive (and typically infeasible), since this means computing deltas between all pairs of versions. Thus, we must reason with incompleteΔ andΦmatrices. Given a version graph , one option is to restrict our deltas to correspond to actual edges in the version graph; another option is to restrict our deltas to be between “close by” versions, with the understanding that versions close to each other in the version graph are more likely to be similar. Prior work has also suggested mechanisms (e.g., based on hashing) to nd versions that are close to each other [37]. We assume that some mechanism to choose which deltas to reveal is provided to us. • Multiple “delta” mechanisms: Given a pair of versions (Vi , Vj), there could be many ways of maintaining a delta between them, with dierent Δi,j ,Φi,j costs. For ex- ample, we can store a program used to derive Vj from Vi , which could take longer to run (i.e., the recreation cost is higher) but is more compact (i.e., storage cost is lower), or explicitly store the UNIX-style dis between the two versions, with 32 lower recreation costs but higher storage costs. For simplicity, we pick one delta mechanism: thus the matrices Δ,Φ just have one entry per (i, j) pair. Our tech- niques also apply to the more general scenario with small modications. • Branches: Both branching and merging are common in collaborative analysis, making the version graph a directed acyclic graph. In this chapter, we assume each version is either stored in its entirety or stored as a delta from a single other version, even if it is derived from two dierent datasets. Although it may be more ecient to allow a version to be stored as a delta from two other versions in some cases, representing such a storage solution requires more complex constructs and both the problems of nding an optimal storage solution for a given problem in- stance and retrieving a specic version become much more complicated. Getting a better understanding of such constructs remains a rich area for future work. Matrix Properties and Problem Dimensions. The storage cost matrix Δ may be symmetric or asymmetric depending on the specic dierencing mechanism used for constructing deltas. For example, the XOR dierencing function results in a symmetricΔ matrix since the delta from a version Vi to Vj is identical to the delta from Vj to Vi . UNIX- style dis where line-by-line modications are listed can either be two-way (symmetric) or one-way (asymmetric). The asymmetry may be quite large. For instance, it may be possible to represent the delta from Vi to Vj using a command like: delete all tuples with age > 60, very compactly. However, the reverse delta from Vj to Vi is likely to be quite large, since all the tuples that were deleted from Vi would be a part of that delta. In this chapter, we consider both these scenarios. We refer to the scenario whereΔ is symmetric 33 and Δ is asymmetric as the undirected case and directed case, respectively. A second issue is the relationship between Φ and Δ. In many scenarios, it may be reasonable to assume that Φ is proportional to Δ. This is generally true for deltas that contain detailed line-by-line or cell-by-cell dierences. It is also true if the system bottleneck is network communication or I/O cost. In a large number of cases, however, it may be more appropriate to treat them as independent quantities with no overt or known relationship. For the proportional case, we assume that the proportionality constant is 1 (i.e., Φ = Δ); the problem statements, algorithms and guarantees are unaected by having a constant proportionality factor. The other case is denoted by Φ ≠ Δ. This leads us to identify three distinct cases with signicantly diverse properties: (1) Scenario 1: Undirected case, Φ = Δ; (2) Scenario 2: Directed case, Φ = Δ; and (3) Scenario 3: Directed case, Φ ≠ Δ. Objective and Optimization Metrics. Given Δ,Φ, our goal is to nd a good storage solution, i.e., we need to decide which versions to materialize and which versions to store as deltas from other versions. Let  = {(i1, j1), (i2, j2), ...} denote a storage solution. ik = jk indicates that the version Vik is materialized (i.e., stored explicitly in its entirety), whereas a pair (ik , jk), ik ≠ jk indicates that we store a delta from Vik to Vjk . We require any solution we consider to be a valid solution, where it is possible to reconstruct any of the original versions. More formally, is considered a valid solution if and only if for every version Vi , there exists a sequence of distinct versions Vl1 , ..., Vlk = Vi such that (il1 , il1), (il1 , il2), (il2 , il3), ..., (ilk−1 , ilk ) are contained in  (in other words, there is a version Vl1 that can be materialized and can be used to recreate Vi through a chain of 34 deltas). We can now formally dene the optimization goals: • Total Storage Cost (denoted ): The total storage cost for a solution  is simply the storage cost necessary to store all the materialized versions and the deltas:  = ∑(i,j)∈ Δi,j . • Recreation Cost for Vi (denoted i): Let Vl1 , ..., Vlk = Vi denote a sequence that can be used to reconstruct Vi . The cost of recreating Vi using that sequence is: Φl1,l1+Φl1,l2+...+Φlk−1,lk . The recreation cost for Vi is the minimum of these quantities over all sequences that can be used to recreate Vi . Problem Formulations. We now state the problem formulations that we consider in this chapter, starting with two base cases that represent two extreme points in the spec- trum of possible problems. Problem 1 (Minimizing Storage) GivenΔ,Φ, nd a valid solution such that  is min- imized. Problem 2 (Minimizing Recreation) Given Δ,Φ, identify a valid solution  such that ∀i, Ri is minimized. The above two formulations minimize either the storage cost or the recreation cost, without worrying about the other. It may appear that the second formulation is not well-dened and we should instead aim to minimize the average recreation cost across all versions. However, the (simple) solution that minimizes average recreation cost also naturally minimizes i for each version. 35 In the next two formulations, we want to minimize (a) the sum of recreation costs over all versions (∑i i), (b) the max recreation cost across all versions (maxi i), under the constraint that total storage cost  is smaller than some threshold . These problems are relevant when the storage budget is limited. Problem 3 (MinSum Recreation) Given Δ,Φ and a th- reshold , identify  such that  ≤ , and ∑i i is minimized. Problem 4 (MinMax Recreation) Given Δ,Φ and a th- reshold , identify  such that  ≤ , and maxi i is minimized. The next two formulations seek to instead minimize the total storage cost  given a constraint on the sum of recreation costs or max recreation cost. These problems are relevant when we want to reduce the storage cost, but must satisfy some constraints on the recreation costs. Problem 5 (Minimizing Storage(Sum Recreation)) GivenΔ,Φ and a threshold  , iden- tify  such that ∑i i ≤  , and  is minimized. Problem 6 (Minimizing Storage(Max Recreation)) GivenΔ,Φ and a threshold  , iden- tify  such that maxi i ≤  , and  is minimized. 3.2.2 Mapping to Graph Formulation In this section, we’ll map our problem into a graph problem, that will help us to adopt and modify algorithms from well-studied problems such as minimum spanning tree construction and delay-constrained scheduling. Given the matrices Δ and Φ, we 36 can construct a directed, edge-weighted graph G = (V , E) representing the relationship among dierent versions as follows. For each version Vi , we create a vertex Vi in G. In addition, we create a dummy vertex V0 in G. For each Vi , we add an edge V0 → Vi , and assign its edge-weight as a tuple ⟨Δi,i ,Φi,i⟩. Next, for each Δi,j ≠ ∞, we add an edge Vi → Vj with edge-weight ⟨Δi,j ,Φi,j⟩. The resulting graph G is similar to the original version graph, but with several important dierences. An edge in the version graph indicates a derivation relationship, whereas an edge in G simply indicates that it is possible to recreate the target version using the source version and the associated edge delta (in fact, ideally G is a complete graph). Unlike the version graph, G may contain cycles, and it also contains the spe- cial dummy vertex V0. Additionally, in the version graph, if a version Vi has multiple in-edges, it is the result of a user/application merging changes from multiple versions into Vi . However, multiple in-edges in G capture the multiple choices that we have in recreating Vi from some other versions. Given graph G = (V , E), the goal of each of our problems is to identify a storage graph Gs = (Vs , Es), a subset of G, favorably balancing total storage cost and the recre- ation cost for each version. Implicitly, we will store all versions and deltas corresponding to edges in this storage graph. (We explain this in the context of the example below.) We say a storage graph Gs is feasible for a given problem if (a) each version can be recreated based on the information contained or stored in Gs , (b) the recreation cost or the total storage cost meets the constraint listed in each problem. Example 4 Given matrix Δ and Φ in Figure 3.1(i) and 3.1(ii), the corresponding graph G 37 V0 V0 <10000, 10000> <10000, 10000> V1 <9700,9700> <10100, 10100> <9700,9700> V1 <200,200> <1000,3000> <200,200> <500,600> <1100,3200> V2 V3 V2 V3 <9800,9800> <10120,10120> <50,400> <800,2500> <200,550> <50,400> <200,550> <800,2300> V4 V V5 4 V5 <900,2500> Figure 3.2: Graph G Figure 3.3: Storage Graph Gs is shown in Figure 3.2. Every version is reachable from V0. For example, edge (V0, V1) is weighted with ⟨Δ1,1,Φ1,1⟩ = ⟨10000, 10000⟩; edge ⟨V3, V5⟩ is weighted with ⟨Δ3,5,Φ3,5⟩ = ⟨800, 2500⟩. Figure 3.3 is a feasible storage graph given G in Figure 3.2, where V1 and V3 are materialized (since the edges from V0 to V1 and V3 are present) while V2, V4 and V5 are stored as modications from other versions. After mapping our problem into a graph setting, we have the following lemma. Lemma 1 The optimal storage graph Gs = (Vs , Es) for all 6 problems listed above must be a spanning tree T rooted at dummy vertex V0 in graph G. Proof 1 Recall that a spanning tree of a graph G(V , E) is a subgraph of G that (i) includes all vertices of G, (ii) is connected, i.e., every vertex is reachable from every other vertex, and (iii) has no cycles. Any Gs must satisfy (i) and (ii) in order to ensure that a version Vi can be recreated from V0 by following the path from V0 to Vi . Conversely, if a subgraph satises (i) and (ii), it is a valid Gs according to our denition above. Regarding (iii), presence of a cycle creates redundancy in Gs . Formally, given any subgraph that satises (i) and (ii), 38 we can arbitrarily delete one from each of its cycle until the subgraph is cycle free, while preserving (i) and (ii). For Problems 1 and 2, we have the following observations. A minimum spanning tree is dened as a spanning tree of smallest weight, where the weight of a tree is the sum of all its edge weights. A shortest path tree is dened as a spanning tree where the path from root to each vertex is a shortest path between those two in the original graph: this would be simply consist of the edges that were explored in an execution of Dijkstra’s shortest path algorithm. Lemma 2 The optimal storage graph Gs for Problem 1 is a minimum spanning tree of G rooted at V0, considering only the weights Δi,j . Lemma 3 The optimal storage graph Gs for Problem 2 is a shortest path tree of G rooted at V0, considering only the weights Φi,j . 3.2.3 ILP Formulation We present an ILP formulation of the optimization problems described above. Here, we take Problem 6 as an example; other problems are similar. Let xi,j be a bi- nary variable for each edge (Vi , Vj) ∈ E, indicating whether edge (Vi , Vj) is in the storage graph or not. Specically, x0,j = 1 indicates that version Vj is materialized, while xi,j = 1 indicates that the modication from version i to version j is stored where i ≠ 0. Let ri be a continuous variable for each vertex Vi ∈ V , where r0 = 0; ri captures the recreation cost for version i (and must be ≤ ). minimize Σ(Vi ,Vj )∈Exi,j × Δi,j , subject to: 39 1. ∑i xi,j = 1, ∀j 2. rj − ri ≥ Φi,j if xi,j = 1 3. ri ≤ , ∀i Lemma 4 Problem 6 is equivalent to the optimization problem described above. Proof 2 First, constraint 1 indicates that each vertex Vi , 1 ≤ i ≤ n, has one and only one in coming edge as described in Lemma 1. Constraint 2 indicates that no cycle exists in GW . This can be proven by contradiction: when there exists a cycle {Vk1 , … , Vkl , Vk1}, we have rk2 − rk1 ≥ Φk1,k2 …… rk1 − rkl ≥ Φkl ,k1 l ⇒ 0 ≥ ∑Φki ,k(i+1)%l i=1 Thus, constraint 1 and 2 ensures the resulting storage graphGs is a spanning tree. Constraint 3 corresponds to recreation cost constraint in Problem 6, but note that ri is not necessarily the recreation cost for version Vi . First, the solution to Problem 6 fulls all constraints listed in integer linear program- ming above by setting ri = i . Thus,  ≥  . Then, we prove  ≤  by contradiction. Suppose there exists a solution to the linear programming above such that  < . Accord- ing to constraint 2, 3 and spanning tree property (we let the path from root V0 to Vj be 40 {Vk1 = V0, Vk2 , ...Vkl , Vk(l+1) = Vj}): l rj ≥ rkl + Φkl ,j ≥ ... ≥ r0 + ∑ Φkm ,k(m+1) m=1  ≥ rj l ⇒  ≥ r0 + ∑ Φkm ,k(m+1) m=1 Thus, the recreation cost for each version Vj is fullled. Hence,  is not the minimum storage cost in Problem 6, which contradicts the assumption. Note however that the general form of an ILP does not permit an if-then statement (as in (2) above). Instead, we can transform to the general form with the aid of a large constant C . Thus, constraint 2 can be expressed as follows: Φi,j + ri − rj ≤ (1 − xi,j) × C Where C is a “suciently large” constant such that no additional constraint is added to the model. For instance, C here can be set as 2 ∗  . On one hand, if xi,j = 1⇒ Φi,j+ri−rj ≤ 0. On the other hand, if xi,j = 0 ⇒ Φi,j + ri − rj ≤ C . Since C is “suciently large”, no additional constraint is added. 3.3 Computational Complexity In this section, we study the complexity of the problems listed in Table 3.1 under dierent application scenarios. 41 Problem 1 and 2 Complexity. As discussed in Section 3.2, Problem 1 and 2 can be solved in polynomial time by directly applying a minimum spanning tree algorithm (Kruskal’s algorithm or Prim’s algorithm for undirected graphs; Edmonds’ algorithm [87] for directed graphs) and Dijkstra’s shortest path algorithm respectively. Kruskal’s al- gorithm has time complexity O(E logV ), while Prim’s algorithm also has time com- plexity O(E logV ) when using binary heap for implementing the priority queue, and O(E + V logV ) when using Fibonacci heap for implementing the priority queue. The running time of Edmonds’ algorithm is O(EV ) and can be reduced to O(E + V logV ) with faster implementation. Similarly, Dijkstra’s algorithm for constructing the short- est path tree starting from the root has a time complexity of O(E logV ) via a binary heap-based priority queue implementation and a time complexity of O(E + V logV ) via Fibonacci heap-based priority queue implementation. Next, we’ll show that Problem 5 and 6 are NP-hard even for the special case where Δ = Φ and Φ is symmetric. This will lead to hardness proofs for the other variants. Triangle Inequality. The primary challenge that we encounter while demonstrating hardness is that our deltas must obey the triangle inequality: unlike other settings where deltas need not obey real constraints, since, in our case, deltas represent actual modica- tions that can be stored, it must obey additional realistic constraints. This causes severe complications in proving hardness, often transforming the proofs from very simple to fairly challenging. Consider the scenario when Δ = Φ and Φ is symmetric. We take Δ as an example. 42 The triangle inequality, can be stated as follows: |Δp,q − Δq,w | ≤ Δp,w ≤ Δp,q + Δq,w |Δp,p − Δp,q | ≤ Δq,q ≤ Δp,p + Δp,q where p, q, w ∈ V and p ≠ q ≠ w . The rst inequality states that the “delta” between two versions can not exceed the total “deltas” of any two-hop path with the same starting and ending vertex; while the second inequality indicates that the “delta” between two versions must be bigger than one version’s full storage cost minus another version’s full storage cost. Since each tuple and modication is recorded explicitly when Φ is symmetric, it is natural that these two inequalities hold. 𝛼 v0 1 1 1 𝛼𝛽 𝛼𝛽 s1 s2 s3 𝛼 s1 𝛼 s2 𝛼 s3 v1 v2 (𝛽 + 1)𝛼 𝛼𝛽 𝛼𝛽 𝛼𝛽 𝛼𝛽 𝛼𝛽 𝛼𝛽 𝛼𝛽 (𝛽 + 1)𝛼 t1 t t2 t3 t t4 t5 1 t2 t3 t4 5 (𝛽 + 1)𝛼 (𝛽 + 1)𝛼 (𝛽 + 1)𝛼 (𝛽 + 1)𝛼 (𝛽 + 1)𝛼 (a) (b) Figure 3.4: Illustration of Proof of Lemma 5 Problem 6 Hardness. We now demonstrate hardness. Lemma 5 Problem 6 is NP-hard when Δ = Φ and Φ is symmetric. Proof 3 Here we prove NP-hardness using a reduction from the set cover problem. Recall 43 that in the set cover problem, we are given m sets S = {s1, s2, ..., sm} and n items T = {t1, t2, ...tn}, where each set si covers some items, and the goal is to pick k sets  ⊂ S such that ∪{F∈}F = T while minimizing k. Given a set cover instance, we now construct an instance of Problem 6 that will provide a solution to the original set cover problem. The threshold we will use in Problem 6 will be ( +1) , where , are constants that are each greater than 2(m+n). (This is just to ensure that they are “large”.) We now construct the graph G(V , E) in the following way; we display the constructed graph in Figure 3.4. Our vertex set V is as follows: • ∀si ∈ S, create a vertex si in V. • ∀ti ∈ T , create a vertex ti in V. • create an extra vertex v0, two dummy vertices v1, v2 in V . We add the two dummy vertices simply to ensure that v0 is materialized, as we will see later. We now dene the storage cost for materializing each vertex in V in the following way: • ∀si ∈ S, the cost is . • ∀ti ∈ T , the cost is ( + 1) . • for vertex v0, the cost is . • for vertex v1, v2, the cost is ( + 1) . (These are the numbers colored blue in the tree of Figure 3.4(b).) As we can see above, we have set the costs in such a way that the vertex v0 and the vertices corresponding to sets 44 in S have low materialization cost, while the other vertices have high materialization cost: this is by design so that we only end up materializing these vertices. Our edge set E is now as follows. • we connect vertex v0 to each si with weight 1. • we connect v0 to both v1 and v2 each with weight . • ∀si ∈ S, we connect si to tj with weight when tj ∈ si , where = |V |. It is easy to show that our constructed graph G obeys the triangle inequality. Consider a solution to Problem 6 on the constructed graph G. We now demonstrate that that solution leads to a solution of the original set cover problem. Our proof proceeds in four key steps: Step 1: The vertex v0 will be materialized, while v1, v2 will not be materialized. Assume the contrary—say v0 is not materialized in a solution to Problem 6. Then, both v1 and v2 must be materialized, because if they are not, then the recreation cost of v1 and v2 would be at least ( +1)+1, violating the condition of Problem 6. However we can avoidmaterializing v1 and v2, instead keep the delta to v0 and materialize v0, maintaining the recreation cost as is while reducing the storage cost. Thus v0 has to be materialized, while v1, v2 will not be materialized. (Our reason for introducing v1, v2 is precisely to ensure that v0 is materialized so that it can provide basis for us to store deltas to the sets si .) Step 2: None of the ti will be materialized. Say a given ti is materialized in the solution to Problem 6. Then, either we have a set sj where sj is connected to ti in Figure 3.4(a) also materialized, or not. Let’s consider the former case. In the former case, we can avoid materializing ti , and instead add the delta from sj to ti , thereby reducing storage cost while 45 keeping recreation cost xed. In the latter case, pick any sj such that sj is connected to ti and is not materialized. Then, we must have the delta from v0 to sj as part of the solution. Here, we can replace that edge, and materialized ti , with materialized sj , and the delta from sj to ti : this would reduce the total storage cost while keeping the recreation cost xed. Thus, in either case, we can improve the solution if any of the ti are materialized, rendering the statement false. Step 3: For each si , either it is materialized, or the edge from v0 to si will be part of the storage graph. This step is easy to see: since none of the ti are materialized, either each si has to be materialized, or we must store a delta from v0. Step 4: The sets si that are materialized correspond to a minimal set cover of the original problem. It is easy to see that for each tj we must have an si such that si covers tj , and si is materialized, in order for the recreation cost constraint to not be violated for tj . Thus, the materialized si must be a set cover for the original problem. Furthermore, in order for the storage cost to be as small as possible, as few si as possible must be materialized (this is the only place we can save cost). Thus, the materialized si also correspond to a minimal set cover for the original problem. Thus, minimizing the total storage cost is equivalent to minimizing k in set cover problem. Note that while the reduction above uses a graph with only some edge weights (i.e., recreation costs of the deltas) known, a similar reduction can be derived for a complete graph with all edge weights known. Here, we simply use the shortest path in the graph reduction above as the edge weight for the missing edges. In that case, once again, the 46 storage graph in the solution to Problem 6 will be identical to the storage graph described above. Problem 5 Hardness: We now show that Problem 5 is NP-Hard as well. The general philosophy is similar to the proof in Lemma 5, except that we create c dummy vertices instead of two dummy vertices v1, v2 in Lemma 5, where c is suciently large—this is to once again ensure that v0 is materialized. Lemma 6 Problem 5 is NP-Hard when Δ = Φ and Φ is symmetric. 𝛼 v0 1 1 1 1 1 1 𝛼 s1 𝛼 s2 𝛼 s3 v1 v2 …… vc 𝛼 + 1 𝛼 + 1 …… 𝛼 + 1 𝛼𝛽 𝛼𝛽 𝛼𝛽 𝛼𝛽 𝛼𝛽 𝛼𝛽 𝛼𝛽 c dummy vertices {v1, v2,…, vc} t1 t2 t3 t4 t5 (𝛽 + 1)𝛼 (𝛽 + 1)𝛼 (𝛽 + 1)𝛼 (𝛽 + 1)𝛼 (𝛽 + 1)𝛼 Figure 3.5: Illustration of Proof of Lemma 6 Proof 4 We prove NP-hardness using a reduction from the set cover problem. Recall that in the set cover decision problem, we are given m sets S = {s1, s2, ..., sm} and n items T = {t1, t2, ...tn}, where each set si covers some items, and given a k, we ask if there a subset  ⊂ S such that ∪{F∈}F = T and | | ≤ k. Given a set cover instance, we now construct an instance of Problem 5 that will provide a solution to the original set cover decision problem. The corresponding decision problem for Problem 5 is: given threshold + ( + 1) n + k + (m − k)( + 1) + ( + 1)c in Problem 5, 47 is the minimum total storage cost in the constructed graph G no bigger than + k + (m − k) + n + c. We now construct the graph G(V , E) in the following way; we display the constructed graph in Figure 3.5. Our vertex set V is as follows: • ∀si ∈ S, create a vertex si in V. • ∀ti ∈ T , create a vertex ti in V. • create an extra vertex v0, and c dummy vertices {v1, v2,… , vc} in V . We add the c dummy vertices simply to ensure that v0 is materialized, as we will see later. We now dene the storage cost for materializing each vertex in V in the following way: • ∀si ∈ S, the cost is . • ∀ti ∈ T , the cost is ( + 1) . • for vertex v0, the cost is . • for each vertex in {v1, v2,… , vc}, the cost is + 1. (These are the numbers colored blue in the tree of Figure 3.5.) As we can see above, we have set the costs in such a way that the vertex v0 and the vertices corresponding to sets in S have low materialization cost while the vertices corresponding to T have high materialization cost: this is by design so that we only end up materializing these vertices. Even though the costs of the dummy vertices is close to that of v0, si , we will show below that they will not be materialized either. Our edge set E is now as follows. • we connect vertex v0 to each si with weight 1. 48 • we connect v0 to vi , 1 ≤ i ≤ c each with weight 1. • ∀si ∈ S, we connect si to tj with weight when tj ∈ si , where = |V |. It is easy to show that our constructed graph G obeys the triangle inequality. Consider a solution to Problem 5 on the constructed graph G. We now demonstrate that that solution leads to a solution of the original set cover problem. Our proof proceeds in four key steps: Step 1: The vertex v0 will be materialized, while vi , 1 ≤ i ≤ c will not be materialized. Let’s examine the rst part of this observation, i.e., that v0 will be materialized. Assume the contrary. If v0 is not materialized, then at least one vi , 1 ≤ i ≤ c, or one of the si must be materialized, because if not, then the recreation cost of {v1, v2,… , vc} would be at least ( +2)c > ( +1)c+ +( +1) n+k +(m−k)( +1), violating the condition (exceeding total recreation cost threshold) of Problem 5. However we can avoid materializing this vi (or si), instead keep the delta from vi (or si) to v0 and materialize v0, reducing the recreation cost and the storage cost. Thus v0 has to be materialized. Furthermore, since v0 is materialized, ∀vi , 1 ≤ i ≤ c will not be materialized and instead we will retain the delta to v0, reducing the recreation cost and the storage cost. Hence, the rst step is complete. Step 2: None of the ti will be materialized. Say a given ti is materialized in the solution to Problem 5. Then, either we have a set sj where sj is connected to ti in Figure 3.5(a) also materialized, or not. Let us consider the former case. In the former case, we can avoid materializing ti , and instead add the delta from sj to ti , thereby reducing storage cost while keeping recreation cost xed. In the latter case, pick any sj such that sj is connected to ti and is not materialized. Then, we must have the delta from v0 to sj as part of the solution. 49 Here, we can replace that edge, and the materialized ti , with materialized sj , and the delta from sj to ti : this would reduce the total storage cost while keeping the recreation cost xed. Thus, in either case, we can improve the solution if any of the ti are materialized, rendering the statement false. Step 3: For each si , either it is materialized, or the edge from v0 to si will be part of the storage graph. This step is easy to see: since none of the ti are materialized, either each si has to be materialized, or we must store a delta from v0. Step 4: If the minimum total storage cost is no bigger than + k + (m − k) + n + c, then there exists a subset  ⊂ S such that ∪{F∈}F = T and | | ≤ k in the original set cover decision problem, and vice versa. Let’s examine the rst part. If the minimum total storage cost is no bigger than + k + (m − k) + n + c, then the storage cost for all si ∈ S must be no bigger than k + (m − k) since the storage cost for v0, {v1, v2,… , vc} and {t1, t2,… , tn} is , c and n respectively according to Step 1 and 2. This indicates that at most k si ∈ S is materialized (we let the set of materialized si be M and |M | ≤ k). Next, we prove that each tj is stored as the modication from the materialized si ∈ M . Suppose there exists one or more tj which is stored as the modication from si ∈ S − M , then the total recreation cost must be more than + (( + 1) n + 1) + k + (m − k)( + 1) + ( + 1)c, which exceeds the total recreation threshold. Thus, we have each tj ∈ T is stored as the modication from si ∈ M . Let  = M , we can obtain ∪{F∈}F = T and | | ≤ k. Thus, If the minimum total storage cost is no bigger than + k + (m − k) + n + c, then there exists a subset  ⊂ S such that ∪{F∈}F = T and | | ≤ k in the original set cover decision problem. Next let’s examine the second part. If there exists a subset ⊂ S such that ∪{F∈}F = T and | | ≤ k in the original set cover decision problem, then we can materialize each vertex 50 si ∈  as well as the extra vertex v0, connect v0 to {v1, v2,… , vc} as well as sj ∈ S −  , and connect tj to one si ∈  . The resulting total storage is + k + (m − k) + n + c and the total recreation cost equals to the threshold. Thus, if there exists a subset  ⊂ S such that ∪{F∈}F = T and | | ≤ k in the original set cover decision problem, then the minimum total storage cost is no bigger than + k + (m − k) + n + c. Thus, the decision problem in Problem 5 is equivalent to the decision problem in set cover problem. Once again, the problem is still hard if we use a complete graph as opposed to a graph where only some edge weights are known. Since Problem 4 swaps the constraint and goal compared to Problem 6, it is simi- larly NP-Hard. (Note that the decision versions of the two problems are in fact identical, and therefore the proof still applies.) Similarly, Problem 3 is also NP-Hard. Now that we have proved the NP-hard even in the special case where Δ = Φ and Φ is symmetric, we can conclude that Problem 3, 4, 5, 6, are NP-hard in a more general setting where Φ is not symmetric and Δ ≠ Φ, as listed in Table 3.1. Hop-Based Variants.So far, our focus has been on proving hardness for the special case where Δ = Φ and Δ is undirected. We now consider a dierent kind of special case, where the recreation cost of all pairs is the same, i.e., Φij = 1 for all i, j, while Δ ≠ Φ, and Δ is undirected. In this case, we call the recreation cost as the hop cost, since it is simply the minimum number of delta operations (or "hops") needed to reconstruct Vi . The reason why we bring up this variant is that this directly corresponds to a special case of the well-studied d-MinimumSteinerTree problem: Given an undirected 51 graph G = (V , E) and a subset ! ⊆ V , nd a tree with minimum weight, spanning the entire vertex subset ! while the diameter is bounded by d . The special case of d- MinimumSteinerTree problem when ! = V , i.e., the minimum spanning tree problem with bounded diameter, directly corresponds to Problem 6 for the hop cost variant we described above. The hardness for this special case was demonstrated by [88] using a reduction from the SAT problem: Lemma 7 Problem 6 is NP-Hard when Δ ≠ Φ and Δ is symmetric, and Φij = 1 for all i, j. Note that this proof crucially uses the fact that Δ ≠ Φ unlike Lemma 5 and 6; thus the proofs are incomparable (i.e., one does not subsume the other). For the hop-based variant, additional results on hardness of approximation are known by way of the d-MinimumSteinerTree problem [88, 89, 90]: Lemma 8 ([88]) For any  > 0, Problem 6 has no ln n- approximation unless NP ⊂ Dtime(nlog log n). Since the hop-based variant is a special case of the last column of Table 3.1, this indicates that Problem 6 for the most general case is similarly hard to approximate; we suspect similar results hold for the other problems as well. It remains to be seen if hardness of approximation can be demonstrated for the variants in the second and third last columns. 3.4 Proposed Algorithms As discussed in Section 3.2, our dierent application scenarios lead to dierent problem formulations, spanning dierent constraints and objectives, and dierent as- 52 sumptions about the nature of Φ,Δ. Given that we demonstrated in the previous section that all the problems are NP- Hard, we focus on developing ecient heuristics. In this section, we present two novel heuristics: rst, in Section 3.4.1, we present LMG, or the Local Move Greedy algorithm, tailored to the case when there is a bound or objective on the average recreation cost: thus, this applies to Problems 3 and 5. Second, in Section 3.4.2, we present MP, or Modied Prim’s algorithm, tailored to the case when there is a bound or objective on themaximum recreation cost: thus, this applies to Problems 4 and 6. We present two variants of the MP algorithm tailored to two dierent settings. Then, we present two algorithms — in Section 3.4.3, we present an approximation algorithm called LAST, and in Section 3.4.4, we present an algorithm called GitH which is based on Git repack. Both of these are adapted from literature to t our problems and we compare these against our algorithms in Section 5.5. Note that LAST does not explicitly optimize any objectives or constraints in the manner of LMG, MP, or GitH, and thus the four algorithms are applicable under dierent settings; LMG andMP are applicable when there is a bound or constraint on the average or maximum recreation cost, while LAST and GitH are applicable when a “good enough” solution is needed. Furthermore, note that all these algorithms apply to both directed and undirected versions of the problems, and to the symmetric and unsymmetric cases. 53 V0 V0 e01 e02 e01 e02 V1 V2 V1 e04 V2 e13 e14 e13 e14 V3 V4 V3 V4 e45 e46 e45 e46 V5 V6 V5 V6 (a) (b) Figure 3.6: Illustration of Local Move Greedy Heuristic 3.4.1 Local Move Greedy Algorithm The LMG algorithm is applicable when we have a bound or constraint on the aver- age case recreation cost. We focus on the case where there is a constraint on the storage cost (Problem 3); the case when there is no such constraint (Problem 5) can be solved by repeated iterations and binary search on the previous problem. Outline. At a high level, the algorithm starts with the Minimum Spanning Tree (MST) as GS , and then greedily adds edges from the Shortest Path Tree (SPT) that are not present in GS , while GS respects the bound on storage cost. Detailed Algorithm. The algorithm starts o with GS equal to the MST. The SPT nat- urally contains all the edges corresponding to complete versions. The basic idea of the algorithm is to replace deltas in GS with versions from the SPT that maximize the fol- 54 lowing ratio:  = reduction in sum of recreation costsincrease in storage cost This is simply the reduction in total recreation cost per unit addition of weight to the storage graph GS . Let  consists of edges in the SPT not present in the GS (these precisely correspond to the versions that are not explicitly stored in the MST, and are instead computed via deltas in the MST). At each “round”, we pick the edge euv ∈  that maximizes , and replace previous edge eu′v to v. The reduction in the sum of the recreation costs is com- puted by adding up the reductions in recreation costs of all w ∈ GS that are descendants of v in the storage graph (including v itself). On the other hand, the increase in storage cost is simply the weight of euv minus the weight of eu′v . This process is repeated as long as the storage budget is not violated. We explain this with the means of an example. Example 5 Figure 3.6(a) denotes the current GS . Node 0 corresponds to the dummy node. Now, we are considering replacing edge e14 with edge e04, that is, we are replacing a delta to version 5 with version 5 itself. Then, the denominator of  is simply Δ04 − Δ14. And the numerator is the changes in recreation costs of versions 4, 5, and 6 (notice that 5 and 6 were below 4 in the tree.) This is actually simple to compute: it is simply three times the change in the recreation cost of version 4 (since it aects all versions equally). Thus, we have the numerator of  is simply 3 × (Φ01 + Φ14 − Φ04). Complexity. For a given round, computing  for a given edge is O(|V |). This leads to an overallO(|V |3) complexity, since we have up to |V | rounds, and upto |V | edges in  . How- 55 Algorithm 1: Local Move Greedy Heuristic Input : Minimum Spanning Tree (MST) , Shortest Path Tree (SPT), source vertex V0, space budget W Output: A tree T with weight ≤ W rooted at V0 with minimal sum of access cost 1 Initialize T as MST. 2 Let d(Vi) be the distance from V0 to Vi in T , and p(Vi) denote the parent of Vi in T. Let W (T ) denote the storage cost of T . 3 whileW (T ) < W do 4 (max , eSPT )← (0,∅) 5 foreach euv ∈  do 6 compute e 7 if e > max then 8 (max , ē)← (e , euv) 9 end 10 end 11 T ← T ⧵ eu′v ∪ euv ;  ←  ⧵ euv 12 if  = ∅ then 13 return T 14 end 15 end ever, if we are smart about this computation (by precomputing and maintaining across all rounds the number of nodes “below” every node), we can reduce the complexity of computing  for a given edge to O(1). This leads to an overall complexity of O(|V |2) Algorithm 1 provides a pseudocode of the described technique. Access Frequencies. Note that the algorithm can easily take into account access fre- quencies of dierent versions and instead optimize for the total weighted recreation cost (weighted by access frequencies). The algorithm is similar, except that the numerator of  will capture the reduction in weighted recreation cost. 56 3.4.2 Modied Prim’s Algorithm Next, we introduce a heuristic algorithm based on Prim’s algorithm for Minimum Spanning Trees for Problem 6 where the goal is to reduce total storage cost while recre- ation cost for each version is within threshold  ; the solution for Problem 4 is similar. Outline. At a high level, the algorithm is a variant of Prim’s algorithm, greedily adding the version with smallest storage cost and the corresponding edge to form a spanning tree T . Unlike Prim’s algorithm where the spanning tree simply grows, in this case, even if an edge is present in T , it could be removed in future iterations. At all stages, the algorithm maintains the invariant that the recreation cost of all versions in T is bounded within  . Detailed Algorithm. At each iteration, the algorithm picks the version Vi with the smallest storage cost to be added to the tree. Once this version Vi is added, we consider adding all deltas to all other versions Vj such that their recreation cost through Vi is within the constraint  , and the storage cost does not increase. Each version maintains a pair l(Vi) and d(Vi): l(Vi) denotes the marginal storage cost of Vi , while d(Vi) denotes the total recreation cost of Vi . At the start, l(Vi) is simply the storage cost of Vi in its entirety. We now describe the algorithm in detail. Set X represents the current version set of the current spanning tree T . Initially X = ∅. In each iteration, the version Vi with the smallest storage cost (l(Vi)) in the priority queue PQ is picked and added into spanning tree T (line 7-8). When Vi is added into T , we need to update the storage cost and 57 V0 v0 <3,3> 5 3 <4,4> V 31 <4,4> 3 v1 v2 4 <2,3> <4,4> <1,4> 2 2 4 4 <1,3> <1,2> V2 V3 v 33 v4 <1,3> Figure 3.7: Directed Graph G Figure 3.8: Undirected Graph G V0 V0 V V0 0 <3,3> <3,3> <3,3> (3,3) V1 (3,3) V1 (3,3) V1 <4,4> (3,3) V1 <4,4> <2,3> <2,3> <1,2> V2 V3 V V V2 V V V2 3 3 2 3 (4,4) (4,4) (2,6) (4,4) (2,6) (4,4) (1,6) (4,4) (a) (b) (c) (d) Figure 3.9: Illustration of Modied Prim’s algorithm in Figure 3.7 recreation cost for all Vj that are neighbors of Vi . Notice that in Prim’s algorithm, we do not need to consider neighbors that are already in T . However, in our scenario a better path to such a neighbor may be found and this may result in an update (line 10-17). For instance, if edge ⟨Vi , Vj⟩ can make Vj ’s storage cost smaller while the recreation cost for Vj does not increase, we can update p(Vj) = Vi as well as d(Vj), l(Vj) and T . For neighbors Vj ∉ T (line 19-24), we update d(Vj), l(Vj),p(Vj) if edge ⟨Vi , Vj⟩ can make Vj ’s storage cost smaller and the recreation cost for Vj is no bigger than  . Algorithm 2 terminates in |V | iterations since one version is added into X in each iteration. 58 Example 6 Say we operate on G given by Figure 3.7, and let the threshold  be 6. Each version Vi is associated with a pair ⟨l(Vi), d(Vi)⟩. Initially version V0 is pushed into priority queue. When V0 is dequeued, each neighbor Vj updates < l(Vj), d(Vj) > as shown in Figure 3.9 (a). Notice that l(Vi), i ≠ 0 for all i is simply the storage cost for that version. For example, when considering edge (V0, V1), l(V1) = 3 and d(V1) = 3 is updated since recreation cost (if V1 is to be stored in its entirety) is smaller than threshold  , i.e., 3 < 6. Afterwards, version V1, V2 and V3 are inserted into the priority queue. Next, we dequeue V1 since l(V1) is smallest among the versions in the priority queue, and add V1 to the spanning tree. We then update < l(Vj), d(Vj) > for all neighbors of V1, e.g., the recreation cost for version V2 will be 6 and the storage cost will be 2 when considering edge (V1, V2). Since 6 ≤ 6, (l(V2), d(V2)) is updated to (2, 6) as shown in Figure 3.9 (b); however, < l(V3), d(V3) > will not be updated since the recreation cost is 3 + 4 > 6 when considering edge (V1, V3). Subsequently, version V2 is dequeued because it has the lowest l(V2), and is added to the tree, giving Figure 3.9 (b). Subsequently, version V3 are dequeued. When V3 is dequeued from PQ, (l(V2), d(V2)) is updated. This is because the storage cost for V2 can be updated to 1 and the recreation cost is still 6 when considering edge (V3, V2), even if V2 is already in T as shown in Figure 3.9 (c). Eventually, we get the nal answer in Figure 3.9 (d). Complexity. The complexity of the algorithm is the same as that of Prim’s algorithm, i.e., O(|E| log |V |). Each edge is scanned once and the priority queue need to be updated once in the worst case. 59 Algorithm 2: Modied Prim’s Algorithm Input : Graph G = (V , E), threshold  Output: Spanning Tree T = (VT , ET ) 1 Let X be the version set of current spanning tree T ; Initially T = ∅, X = ∅; 2 Let p(Vi) be the parent of Vi; l(Vi) denote the storage cost from p(Vi) to Vi , d(Vi) denote the recreation cost from root V0 to version Vi , 3 Initially ∀i ≠ 0, d(V0) = l(V0) = 0, d(Vi) = l(Vi) = ∞ ; 4 Enqueue < V0, (l(V0), d(V0)) > into priority queue PQ; 5 (PQ is sorted by l(vi)); 6 while PQ ≠ ∅ do 7 < Vi , (l(Vi), d(Vi)) >← top(PQ), dequeue(PQ); 8 T = T∪ < Vi , p(Vi) >, X = X ∪ Vi; 9 for Vj ∈ (Vi’s neighbors in G) do 10 if Vj ∈ X then 11 if (Φi,j + d(Vi)) ≤ d(Vj) and Δi,j ≤ l(Vj) then 12 T = T− < Vj , p(Vj) >; 13 p(Vj) = Vi; 14 T = T∪ < Vj , p(Vj) > d(Vj)← Φi,j + d(Vi); 15 l(Vj)← Δi,j ; 16 end 17 end 18 else 19 if (Φi,j + d(Vi)) ≤  and Δi,j ≤ l(Vj) then 20 d(Vj)← Φi,j + d(Vi); 21 l(Vj)← Δi,j ; p(Vj) = Vi; 22 enqueue(or update) < Vj , (l(Vj), d(Vj)) > in PQ; 23 end 24 end 25 end 26 end 3.4.3 LAST Algorithm Here, we sketch an algorithm from previous work [91] that enables us to nd a tree with a good balance of storage and recreation costs, under the assumptions that Δ = Φ and Φ is symmetric. Outline. The algorithm starts from a minimum spanning tree and does a depth-rst 60 traveral (DFS) over the minimum spanning tree. During the process of DFS, if the recre- ation cost for a node exceeds the pre-dened threshold (set up front), then this current path is replaced with the shortest path to the node. Detailed Algorithm. As discussed in Section 3.2.2, balancing between recreation cost and storage cost is equivalent to balancing between the minimum spanning tree and the shortest path tree rooted at V0. Khuller et al. [91] studied the problem of balancing min- imum spanning tree and shortest path tree in an undirected graph, where the resulting spanning tree T has the following properties, given parameter : • For each node Vi: the cost of path from V0 to Vi in T is within times the shortest path from V0 to Vi in G. • The total cost of T is within (1 + 2/( − 1)) times the cost of minimum spanning tree in G. Even though Khuller’s algorithm is meant for undirected graphs, it can be applied to the directed graph case without any comparable guarantees. The pseudocode is listed in Algorithm 3. Let MST denote the minimum spanning tree of graph G and SP (V0, Vi) denote the shortest path from V0 to Vi in G. The algorithm starts with theMST and then conducts a depth-rst traversal inMST . Each nodeV keeps track of its path cost from root as well as its parent, denoted as d(Vi) and p(Vi) respectively. Given the approximation parameter , when visiting each node Vi , we rst check whether d(Vi) is bigger than × SP (V0, Vi) where SP stands for shortest path. If yes, we replace the path to Vi with the shortest path 61 from root to Vi inG and update d(Vi) as well as p(Vi). In addition, we keep updating d(Vi) and p(Vi) during depth rst traversal as stated in line 4-7 of Algorithm 3. Example 7 Figure 3.10 (a) is the minimum spanning tree (MST) rooted at node V0 of G in Figure 3.8. The approximation threshold is set to be 2. The algorithm starts with the MST and conducts a depth-rst traversal in the MST from root V0. When visiting node V2, d(V2) = 3 and the shortest path to node V2 is 3, thus 3 < 2 × 3. We continue to visit node V2 and V3. When visiting V3, d(V3) = 8 > 2 × 3 where 3 is the shortest path to V3 in G. Thus, d(V3) is set to be 3 and p(V3) is set to be node 0 by replacing with the shortest path ⟨V0, V3⟩ as shown in Figure 3.10 (b). Afterwards, the back-edge < V3, V1 > is traversed in MST. Since 3 + 2 < 6, where 3 is the current value of d(V3), 2 is the edge weight of (V3, V1) and 6 is the current value in d(V1), thus d(V1) is updated as 5 and p(V1) is updated as node V3. At last node V4 is visited, d(V4) is rst updated as 7 according to line 3-7. Since 7 < 2×4, lines 9-11 are not executed. Figure 3.10 (c) is the resulting spanning tree of the algorithm, where the recreation cost for each node is under the constraint and the total storage cost is 3 + 3 + 2 + 2 = 10. Complexity. The complexity of the algorithm is O(|E| log |V |). Given the minimum spanning tree and shortest path tree rooted at V0, Algorithm 3 is conducted via depth rst traversal on MST. It is easy to show that the complexity for Algorithm 3 is O(|V |). The time complexity for computing minimum spanning tree and shortest path tree is O(|E| log |V |) using heap-based priority queue. 62 Algorithm 3: Balance MST and Shortest Path Tree [91] Input : Graph G = (V , E), MST , SP Output : Spanning Tree T = (VT , ET ) 1 Initialize T as MST . Let d(Vi) be the distance from V0 to Vi in T and p(Vi) be the parent of Vi in T . 2 while DFS traversal on MST do 3 (Vi , Vj)← the edge currently in traversal; 4 if d(Vj) > d(Vi) + ei,j then 5 d(Vj)← (d(Vi) + ei,j); 6 p(Vj)← Vi ; 7 end 8 if d(Vj) > ∗ SP (V0, Vj) then 9 add shortest path (V0, Vj) into T ; 10 d(Vj)← SP (V0, Vj); 11 p(Vj)← V0; 12 end 13 end V0 V V0 0 3 3 3 3 3 3 V V V 31 2 V 1 V V V3 2 3 2 2 2 2 2 V3 V4 V V4 1 2 (a) (b) V4 (c) Figure 3.10: Illustration of LAST on Figure 3.8 3.4.4 Git Heuristic This heuristic is an adaptation of the current heuristic used by Git and we refer to it as GitH. We rst describe our understanding of the heuristic used by Git when a user runs git-repack, followed by a sketch of GitH. Git uses delta compression to reduce the amount of storage required to store a large 63 number of les (objects) that contain duplicated information. However, git’s algorithm for doing so is not clearly described anywhere. An old discussion with Linus has a sketch of the algorithm [92]. However there have been several changes to the heuristics used that don’t appear to be documented anywhere. The following describes our understanding of the algorithm based on the latest git source code 1. Here we focus on “repack”, where the decisions are made for a large group of objects. However, the same algorithm appears to be used for normal commits as well. Most of the algorithm code is in le: builtin/pack-objects.c Step 1: Sort the objects, rst by “type”, then by “name hash”, and then by “size” (in the decreasing order). The comparator is (line 1503): static int type_size_sort(const void *_a, const void *_b) Note the name hash is not a true hash; the pack_name_hash() function (pack-objects.h) simply creates a number from the last 16 non-white space characters, with the last char- acters counting the most (so all les with the same sux, e.g., .c, will sort together). Step 2: The next key function is ll_find_deltas(), which goes over the les in the sorted order. It maintains a list of W objects (W = window size, default 10) at all times. For the next object, say O, it nds the delta between O and each of the objects, say B, in the window; it chooses the the object with the minimum value of: delta(B, O) / (max_depth - depth of B) where max_depth is a parameter (default 50), and depth of 1Cloned from https://github.com/git/git on 5/11/2015, commit id: 8440f74997cf7958c7e8ec853f590828085049b8 64 B refers to the length of delta chain between a root and B. The original algorithm appears to have only used delta(B, O) to make the de- cision, but the “depth bias” (denominator) was added at a later point to prefer slightly larger deltas with smaller delta chains. The key lines for the above part: • line 1812 (check each object in the window): ret = try_delta(n, m, max_depth, &mem_usage); • lines 1617-1618 (depth bias): max_size = (uint64_t)max_size * (max_depth - src->depth) / (max_depth - ref_depth + 1); • line 1678 (compute delta and compare size): delta_buf = create_delta(src->index, trg->data, trg_size, &delta_size, max_size); create_delta() returns non-null only if the new delta being tried is smaller than the current delta (modulo depth bias), specically, only if the size of the new delta is less than max_size argument. Note: lines 1682-1688 appear redundant given the depth bias calculations. Step 3. Originally the window was just the last W objects before the object O under consideration. However, the current algorithm shues the objects in the window based on the choices made. Specically, let b1,… , bW be the current objects in the window. Let 65 the object chosen to delta against for O be bi . Then bi would be moved to the end of the list, so the new list would be: [b1, b2,… , bi−1, bi+1,… , bW , O, bi]. Then when we move to the new object after O (say O′), we slide the window and so the new window then would be: [b2,… , bi−1, bi+1,… , bW , O, bi , O′]. Small detail: the list is actually maintained as a circular buer so the list doesn’t have to be physically “shifted” (moving bi to the end does involve a shift though). Relevant code here is lines 1854-1861. Finally we note that git never considers/computes/stores a delta between two ob- jects of dierent types, and it does the above in a multi-threaded fashion, by partitioning the work among a given number of threads. Each of the threads operates independently of the others. GitH. GitH uses two parameters: w (window size) and d (max depth). We consider the versions in an non-increasing order of their sizes. The rst version in this ordering is chosen as the root of the storage graph and has depth 0 (i.e., it is materialized). At all times, we maintain a sliding window containing at most w versions. For each version Vi after the rst one, let Vl denote a version in the current window. We compute: Δ′l,i = Δl,i/(d − dl), where dl is the depth of Vl (thus deltas with shallow depths are preferred over slightly smaller deltas with higher depths). We nd the version Vj with the lowest value of this quantity and choose it as Vi’s parent (as long as dj < d). The depth of Vi is then set to dj + 1. The sliding window is modied to move Vl to the end of the window (so it will stay in the window longer), Vj is added to the window, and the version at the beginning of the window is dropped. Complexity. The running time of the heuristic is O(|V | log |V | + w |V |), excluding the 66 Dataset DC LC BF LF Number of versions 100010 100002 986 100 Number of deltas 18086876 2916768 442492 3562 Average version size (MB) 347.65 356.46 0.401 422.79 MCA-Storage Cost (GB) 1265.34 982.27 0.0250 2.2402 MCA-Sum Recreation Cost (GB) 11506437.83 29934960.95 0.9648 47.6046 MCA-Max Recreation Cost (GB) 257.6 717.5 0.0063 0.5998 SPT-Storage Cost (GB) 33953.84 34811.14 0.3854 41.2881 SPT-Sum Recreation Cost (GB) 33953.84 34811.14 0.3854 41.2881 SPT-Max Recreation Cost (GB) 0.524 0.55 0.0063 0.5091 Table 3.2: Dataset properties 12 10 8 6 4 2 0 DC LC BF LF Datasets Figure 3.11: Distribution of delta sizes in the datasets (each delta size scaled by the av- erage version size in the dataset) time to construct deltas. 3.5 Experiments In the following sections, we present an extensive evaluation of our designed algo- rithms using a combination of synthetic and derived real-world datasets. Apart from im- 67 Normalized delta values plementing the algorithms described above, LMG and LAST require both SPT and MST as input. For both directed and undirected graphs, we use Dijkstra’s algorithm to nd the single-source shortest path tree (SPT). We use Prim’s algorithm to nd the minimum spanning tree for undirected graphs. For directed graphs, we use an implementation [93] of the Edmonds’ algorithm [87] for computing the min-cost arborescence (MCA). We ran all our experiments on a 2.2GHz Intel Xeon CPU E5-2430 server with 64GB of memory, running 64-bit Red Hat Enterprise Linux 6.5. 3.5.1 Datasets We use four data sets: two synthetic and two derived from real-world source code repositories. Although there are many publicly available source code repositories with large numbers of commits (e.g., in GitHub), those repositories typically contain fairly small (source code) les, and further the changes between versions tend to be localized and are typically very small; we expect dataset versions generated during collaborative data analysis to contain much larger datasets and to exhibit large changes between ver- sions. We were unable to nd any realistic workloads of that kind. Hence, we generated realistic dataset versioning workloads as follows. First, we wrote a synthetic version generator suite, driven by a small set of parameters, that is able to generate a variety of version histories and corresponding datasets. Second, we created two real-world datasets using publicly available forks of popular repositories on GitHub. We describe each of the two below. Synthetic Datasets: Our synthetic dataset generation suite2 takes a two-step approach 2Our synthetic dataset generator may be of independent interest to researchers working on version 68 to generate a dataset that we sketch below. The rst step is to generate a version graph with the desired structure, controlled by the following parameters: • number of commits, i.e., the total number of versions. • branch interval and probability, the number of consecutive versions after which a branch can be created, and probability of creating a branch. • branch limit, the maximum number of branches from any point in the version history. We choose a number in [1, branch limit] uniformly at random when we decide to create branches. • branch length, the maximum number of commits in any branch. The actual length is a uniformly chosen integer between 1 and branch length. Once a version graph is generated, the second step is to generate the appropriate versions and compute the deltas. The les in our synthetic dataset are ordered CSV les (containing tabular data) and we use deltas based on UNIX-style dis. The previous step also annotates each edge (u, v) in the version graph with edit commands that can be used to produce v from u. Edit commands are a combination of one of the following six instructions – add/delete a set of consecutive rows, add/remove a column, and modify a subset of rows/columns. Using this, we generated two synthetic datasets (Figure 3.2): • Densely Connected (DC): This dataset is based on a “at” version history, i.e., number of branches is high, they occur often and have short lengths. For each management. 69 version in this data set, we compute the delta with all versions in a 10-hop distance in the version graph to populate additional entries in Δ and Φ. • Linear Chain (LC): This dataset is based on a “mostly-linear” version history, i.e., number of branches is low, they occur after large intervals and have longer lenghts. For each version in this data set, we compute the delta with all versions within a 25-hop distance in the version graph to populate Δ and Φ. Real-world datasets: We use 986 forks of the Twitter Bootstrap repository and 100 forks of the Linux repository, to derive our real-world workloads. For each repository, we checkout the latest version in each fork and concatenate all les in it (by traversing the directory structure in lexicographic order). Thereafter, we compute deltas between all pairs of versions in a repository, provided the size dierence between the versions under consideration is less than a threshold. We set this threshold to 100KB for the Twitter Bootstrap repository and 10MB for the Linux repository. This gives us two real- world datasets, Bootstrap Forks (BF) and Linux Forks (LF), with properties shown in Figure 3.2. 3.5.2 Comparison with SVN and Git We begin with evaluating the performance of two popular version control systems, SVN (v1.8.8) and Git (v1.7.1), using the LF dataset. We create an FSFS-type repository in SVN, which is more space ecient that a Berkeley DB-based repository [94]. We then import the entire LF dataset into the repository in a single commit. The amount of space occupied by the db/revs/ directory is around 8.5GB and it takes around 48 minutes 70 to complete the import. We contrast this with the naive approach of applying a gzip on the les which results in total compressed storage of 10.2GB. In case of Git, we add and commit the les in the repository and then run a git repack -a -d -depth=50 -window=50 on the repository3. The size of the Git pack le is 202 MB although the repack consumes 55GB memory and takes 114 minutes (for higher window sizes, Git fails to complete the repack as it runs out of memory). In comparison, the solution found by the MCA algorithm occupies 516MB of com- pressed storage (2.24GB when uncompressed) when using UNIX diff for computing the deltas. To make a fair comparison with Git, we use xdiff from the LibXDi library [97] for computing the deltas, which forms the basis of Git’s delta computing routine. Using xdiff brings down the total storage cost to just 159 MB. The total time taken is around 102 minutes; this includes the time taken to compute the deltas and then to nd the MCA for the corresponding graph. The main reason behind SVN’s poor performance is its use of “skip-deltas” to en- sure that at most O(log n) deltas are needed for reconstructing any version; that tends to lead it to repeatedly store redundant delta information as a result of which the total space requirement increases signicantly. The heuristic used by Git is much better than SVN (Section 3.4.4). However as we show later (Fig. 3.12), our implementation of that heuristic (GitH) required more storage than LMG for guaranteeing similar recreation costs. 3Unlike git repack, svnadmin pack has a negligible eect on the storage cost as it primarily aims to reduce disk seeks and per-version disk usage penalty by concatenating les into a single “pack” [95, 96]. 71 LMG MP LAST GitH 81e4 7.51e4 7.51e 1 5.41e1 Dataset: DC 7.0 Dataset: LC 7.0 Dataset: BF Dataset: LF 7 5.26.5 6.5 6.0 5.0 6 6.05.5 4.85.5 5 5.0 4.6 4.5 5.0 4.5 4.44 4.0 3.5 4.0 4.2 31 2 3 4 5 6 3.00 1 2 3 4 5 6 3.25.5 3.0 3.5 4.0 4.5 5.04.20.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 (a) 1e3 (b) 1e3 (c) 1e 2 (d) Storage Cost (GB) Figure 3.12: Results for the directed case, comparing the storage costs and total recre- ation costs LMG MP LAST 1.4 6.21e 1 1.3 Dataset: DC Dataset: LF 6.0 1.2 1.1 5.8 1.0 5.6 0.9 0.8 5.4 0.7 5.2 0.6 0.15.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.20.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 (a) 1e3 (b) Storage Cost (GB) Figure 3.13: Results for the directed case, comparing the storage costs and maximum recreation costs LMG MP LAST 7.01e4 6.01e4 5.41e 1 2.0 6.5 Dataset: DC 5.5 Dataset: LC 5.2 Dataset: BF 1.8 Dataset: DC 6.0 5.0 1.6 5.5 5.0 4.8 1.4 5.0 4.5 4.6 1.2 4.5 4.0 4.4 1.0 4.0 4.2 0.8 3.5 3.5 4.0 0.6 3.02 3 4 5 6 7 8 3.01 2 3 4 5 6 7 3.38.0 3.5 4.0 4.5 5.0 5.5 6.0 0.42 3 4 5 6 7 8 Storage Cost (GB) 1e3 Storage Cost (GB) 1e3 Storage Cost (GB) 1e 2 Storage Cost (GB) 1e3 (a) (b) (c) (d) Figure 3.14: Results for the undirected case, comparing the storage costs and total recre- ation costs (a–c) or maximum recreation costs (d) 72 Sum of Recreation Cost (GB) Sum of Recreation Cost (GB) Max Recreation Cost (GB) Sum of Recreation Cost (GB) Sum of Recreation Cost (GB) Max Recreation Cost (GB) 3.5.3 Experimental Results Directed Graphs. We begin with a comprehensive evaluation of the three algorithms, LMG, MP, and LAST, on directed datasets. Given that all of these algorithms have pa- rameters that can be used to trade o the storage cost and the total recreation cost, we compare them by plotting the dierent solutions they are able to nd for the dierent values of their respective input parameters. Figure 3.12(a–d) show four such plots; we run each of the algorithms with a range of dierent values for its input parameter and plot the storage cost and the total (sum) recreation cost for each of the solutions found. We also show the minimum possible values for these two costs: the vertical dashed red line indicates the minimum storage cost required for storing the versions in the dataset as found by MCA, and the horizontal one indicates the minimum total recreation cost as found by SPT (equal to the sum of all version sizes). The rst key observation we make is that, the total recreation cost decreases dras- tically by allowing a small increase in the storage budget over MCA. For example, for the DC dataset, the sum recreation cost for MCA is over 11 PB (see Table 3.2) as com- pared to just 34TB for the SPT solution (which is the minimum possible). As we can see from Figure 3.12(a), a space budget of 1.1× the MCA storage cost reduces the sum of recreation cost by three orders of magnitude. Similar trends can be observed for the remaining datasets and across all the algorithms. We observe that LMG results in the best tradeo between the sum of recreation cost and storage cost with LAST performing fairly closely. An important takeaway here, especially given the amount of prior 73 work that has focused purely on storage cost minimization), is that: it is possi- ble to construct balanced trees where the sum of recreation costs can be reduced and brought close to that of SPT while using only a fraction of the space that SPT needs. We also ran GitH heuristic on the all the four datasets with varying window and depth settings. For BF, we ran the algorithm with four dierent window sizes (50, 25, 20, 10) for a xed depth 10 and provided the GitH algorithm with all the deltas that it requested. For all other datasets, we ran GitH with an innite window size but re- stricted it to choose from deltas that were available to the other algorithms (i.e., only deltas with sizes below a threshold); as we can see, the solutions found by GitH ex- hibited very good total recreation cost, but required signicantly higher storage than other algorithms. This is not surprising given that GitH is a greedy heuristic that makes choices in a somewhat arbitrary order. In Figures 3.13(a–b), we plot the maximum recreation costs instead of the sum of recreation costs across all versions for two of the datasets (the other two datasets exhib- ited similar behavior). The MP algorithm found the best solutions here for all datasets, and we also observed that LMG and LAST both show plateaus for some datasets where the maximum recreation cost did not change when the storage budget was increased. This is not surprising given that the basic MP algorithm tries to optimize for the storage cost given a bound on the maximum recreation cost, whereas both LMG and LAST focus on minimization of the storage cost and one version with high recreation cost is unlikely to aect that signicantly. 74 LMG LMG-W 4.11e5 3.21e2 4.0 Dataset: DC 3.1 Dataset: LF 3.9 3.0 3.8 2.9 3.7 2.8 3.6 2.7 3.15.5 2.0 2.5 (a) 3.0 3.5 4.0 2.62 3 4 1e3 (b) 5 6 7 Storage Cost (GB) Figure 3.15: Taking workload into account leads to better solutions Undirected Graphs. We test the three algorithms on the undirected versions of three of the datasets (Figure 3.14). For DC and LC, undirected deltas between pairs of versions were obtained by concatenating the two directional deltas; for the BF dataset, we use UNIX diff itself to produce undirected deltas. Here again we observe that LMG con- sistently outperforms the other algorithms in terms of nding a good balance between the storage cost and the sum of recreation costs. MP again shows the best results when trying to balance the maximum recreation cost and the total storage cost. Similar results were observed for other datasets but are omitted due to space limitations. Workload-aware Sum of Recreation Cost Optimization. In many cases, we may be able to estimate access frequencies for the various versions (from historical access patterns), and if available, we may want to take those into account when constructing the storage graph. The LMG algorithm can be easily adapted to take such information into account, whereas it is not clear how to adapt either LAST or MP in a similar fashion. In this experiment, we use LMG to compute a storage graph such that the sum of recreation costs is minimal given a space budget, while taking workload information into account. 75 Sum of Recreation Cost (GB) The worload here assigns a frequency of access to each version in the repository using a Zipan distribution (with exponent 2); real-world access frequencies are known to follow such distributions. Given the workload information, the algorithm should nd a storage graph that has the sum of recreation cost less than the index when the workload information is not taken into account (i.e., all versions are assumed to be accessed equally frequently). Figure 3.15 shows the results for this experiment. As we can see, for the DC dataset, taking into account the access frequencies during optimization led to much better solutions than ignoring the access frequencies. On the other hand, for the LF dataset, we did not observe a large dierence. Running Times. Here we evaluate the running times of the LMG algorithm. Recall that LMG takes MST (or MCA) and SPT as inputs. In Fig. 3.16, we report the total running time as well as the time taken by LMG itself. We generated a set of version graphs as subsets of the graphs for LC and DC datasets as follows: for a given number of versions n, we randomly choose a node and traverse the graph starting at that node in breadth-rst manner till we construct a subgraph with n versions. We generate 5 such subgraphs for increasing values of n and report the average running time for LMG; the storage budget for LMG is set to three times of the space required by the MST (all our reported experiments with LMG use less storage budget than that). The time taken by LMG on DC dataset is more than LC for the same number of versions; this is because DC has lower delta values than LC (see Fig. 3.2) and thus requires more edges from SPT to satisfy the storage budget. On the other hand, MP takes between 1 to 8 seconds on those datasets, when the 76 LMG LC LMG DC Total LC Total DC 700 3000 600 2500 500 2000 400 1500 300 200 1000 100 500 00 1 2 3 4 5 6 7 8 00 1 2 3 4 5 6 7 8 (a) Directed 1e4 (b) Undirected 1e4 Number of versions Figure 3.16: Running times of LMG recreation cost is set to maximum. Similar to LMG, LAST requires the MST/MCA and SPT as inputs; however the running time of LAST itself is linear and it takes less than 1 second in all cases. Finally the time taken by GitH on LC and DC datasets, on varying window sizes range from 35 seconds (window = 1000) to a little more than 120 minutes (window = 100000); note that, this excludes the time for constructing the deltas. In summary, although LMG is inherently a more expensive algorithm than MP or LAST, it runs in reasonable time on large input sizes; we note that all of these times are likely to be dwarfed by the time it takes to construct deltas even for moderately-sized datasets. Comparison with ILP solutions. Finally, we compare the quality of the solutions found by MP with the optimal solution found using the Gurobi Optimizer for Problem 6. We use the ILP formulation from Section 3.2.3 with constraint on the maximum recre- ation cost (), and compare the optimal storage cost with that of the MP algorithm (which resulted in solutions with lowest maximum recreation costs in our evaluation). We use our synthetic dataset generation suite to generate three small datasets, with 15, 25 and 77 Time (seconds) Storage Cost (GB) v15  0.20 0.21 0.22 0.23 0.24 ILP 0.36 0.36 0.22 0.22 0.22 MP 0.36 0.36 0.23 0.23 0.23 v25  0.63 0.66 0.69 0.72 0.75 ILP 2.39 1.95 1.50 1.18 1.06 MP 2.88 2.13 1.7 1.18 1.18 v50  0.30 0.34 0.41 0.54 0.68 ILP 1.43 1.10 0.83 0.66 0.60 MP 1.59 1.45 1.06 0.91 0.82 Table 3.3: Comparing ILP and MP solutions for small datasets, given a bound on max recreation cost,  (in GB) 50 versions denoted by v15, v25 and v50 respectively and compute deltas between all pairs of versions. Table 3.3 reports the results of this experiment, across ve  values. The ILP turned out to be very dicult to solve, even for the very small problem sizes, and in many cases, the optimizer did not nish and the reported numbers are the best solutions found by it. As we can see, the solutions found by MP are quite close to the ILP solutions for the small problem sizes for which we could get any solutions out of the optimizer. However, extrapolating from the (admittedly limited) data points, we expect that on large problem sizes, MP may be signicantly worse than optimal for some variations on the problems (we note that the optimization problem formulations involving max recreation cost are likely to turn out to be harder than the formulations that focus on the average recreation cost). Development of better heuristics and approximation algorithms with provable guarantees for the various problems that we introduce are rich areas for further research. 78 Chapter 4: A Unied Query Language for Provenance and Versioning 4.1 Introduction In this chapter, we present our design of a version-aware query language, called VQUEL, capable of querying dataset versions, dataset provenance (e.g., which datasets a given dataset was derived from), and record-level provenance (if available). While git and svn have proved tremendously useful for collaborative source code management, their versioning API is based a notion of les, not structured records, and as such, is not a good t for a scenario with a mix of structured and unstructured datasets; the versioning API is also not capable of allowing data scientists to reason about data contained within versions and the relationships between the versions in a holistic manner. VQUEL draws from constructs introduced in the historical Quel [59] and GEM [60] languages, neither of which had a temporal component. 4.2 Preliminaries Recall that in DEX, a version consists of one or more datasets that are semantically grouped together. Every version is identied by an ID, is immutable and any update to a version conceptually results in a new version with a dierent version ID (note that the 79 commit_id:String pk:String commit_msg:Text X:String creation_ts:Date Record Y:String author:Author Z:String version {relation:Relation} {version:Version} {parent:Version} {children:Version} name:String name:String Relation Author {record:Record} email:String Figure 4.1: Conceptual Data model for VQUEL: the notation “{T}” denotes a set of values of T; elds in the Records entity can be conceptually thought of as a union of all elds across records; other elds and entities (for instance Authors) are not shown to keep the discussion brief; for each entity, entries in the left and right column denote the attribute name and type respectively. physical data structures are not necessarily immutable, as we described in the previous chapter). New versions can also be created through the application of transformation programs to one or more existing versions. The version-level provenance that captures these processes is maintained as a version graph. Figure 4.1 shows a portion of the conceptual data model that we use to write queries against and Figure 4.2 shows an example of a few versions along with the ver- sion graph connecting them. The data model consists of four essential tables: Version, Relation, File, and Record. Additional tables like Column and Author are required in DEX but not essential for the purpose of this discussion. The dierence between Relation and File is that a relation has a xed schema for all its records (recorded in the Column table) while a le has no such requirement. To that eect, we denote the records in a relation as tuples. 80 V1 V2 V3 Employee Dept. Employee Dept. Employee Dept. d1 e1 d1 d1e1 e1 e2 d2 e2 d2 e2 d2 e3 e3 e3 Figure 4.2: An example version graph where circles denote versions; version V1 has two Relations, Employee and Department, each having a set of records, {E1, E2, E3} and {D1, D2} respectively; version V2 adds new records to both the Employee and Department relations and also adds a new File, Forms.csv. Edge annotations (not shown) are used to capture information about the derivation process itself, including references to trans- formation programs or scripts if needed. The Version table maintains the information about the dierent versions in the database, including the “commit_id” (unique across the versions), and various attributes capturing metadata about the version, such as the creation time and author, as well as “commit_msg” and “creation_ts”, representing the commit message and creation time respectively. There are four set-valued attributes called “relations”, “les”, “parents” and “children”, recording the relations and les contained in the version, and the direct par- ents and children in version graph respectively. The last two refer back to the Version table, whereas the rst two refer to the Relation and File tables respectively. A tuple in the Relation table, in turn, records the information for a relation including its schema; we view the tuples in the relation as a set-valued attribute of this table itself — this al- lows us to locate a relation and then query on the data inside it as we will see in the next section. The Files table is analogous, but records information appropriate for an unstructured le. Note that neither of these tables has a primary key but rather the at- tributes “name” and “full_path” serve as discriminators, and must be combined with the 81 version “id” to construct primary keys. The “changed” attribute is a derived (redundant) attribute that indicates whether the relation/le changed from the parent version, and is very useful for version-oriented queries. Finally, Record is a virtual table that can be conceptually thought of as a union of all tuples and records in all relations and les across the versions. The one exception are the “parents” and “children” attributes, which refer back to the Record table and can be used to refer to ne-grained provenance information within queries. This table is never directly referenced in the queries, but is depicted here for completeness. The prove- nance information must “obey” the version graph, e.g., in the example shown, records in version V2 can only have records in version V1 as parents. We note here that this data model is a high-level conceptual one mainly intended for ease of querying and aims to maximize data independence. For instance, although the ne-grained provenance information is conceptually maintained in the Record table here and can be queried using the “parents” and “children” attributes, the implementa- tion could maintain that information at schema-level wherever feasible to minimize the storage requirements. 4.3 Language Features VQUEL is largely a generalization of the Quel language (while also introducing certain syntactic conveniences that Quel does not possess), and combines features from GEM and path-based query languages. This means thatVQUEL is a full-edged relational query language, and in addition, it enables the seamless querying of the nested data 82 model described in the previous section, encoding versioning derivation relationships, as well as versioning metadata. VQUEL will be illustrated using example queries on the repository shown in Fig- ure 4.2, with certain deviations introduced when necessary. We will introduce the con- structs in VQUEL incrementally, starting from those present in Quel to the new ones designed for the setting in DEX. For ease of understanding, we rst present a version that is clear and easy to understand, but results in longer queries. In Section 4.3.2 we describe additional constructs to make the queries concise. 4.3.1 Examples We begin with some simpleVQUEL queries. Most of these queries are also straight- forward to write in SQL; the queries that cannot be written in SQL easily begin in Sec- tion 4.3.3. Here, we gradually introduce the constructs of VQUEL as a prelude to the more complex queries combining versioning and data. Query 1 Who is the author of version with id “v01”? 1 range of V is Version 2 retrieve V.author.name 3 where V.id = "v01" A VQUEL query has two elements: iterator setup (range above) and retrieval (retrieve above) of objects satisfying a predicate (where above). Iterators in VQUEL are similar to tuple variables in Quel, but more powerful, in the sense that they can iterate over objects at any level of our hierarchical data model. They are declared with a statement of the form: 1 range of is 83 The retrieve statement is used to select the object properties, and is of the form: 1 retrieve [into ][unique] 2 [where ] 3 [sort by [asc/desc] {, [asc/desc]}] The retrieve statement fetches all the object attributes specied in the target-list for those objects satisfying the where clause. Query 2 What commits did Alice make after January 01, 2015? 1 range of V is Version 2 retrieve V.all 3 where V.author.name = "Alice" and V.creation_ts >= "01/01/2015" In Queries 1 and 2, note the use of GEM-style tuple-reference attributes, namely V.author , and the keyword all from Quel. The comparators =, !=, <, <=, > and >= are allowed in comparisons, and the logical connectives and, or, and not can be used to combine comparisons. Multiple iterators can be set up before a retrieval statement, and their respective sets can be dened as a function of previously declared iterators. The next example illustrates this idea. The rst range clause sets up an iterator V over all the versions. The second range clause denes an iterator over all relations inside a version. Query 3 List the commit timestamps of versions that contain the Employee relation. 1 range of V is Version 2 range of R is V.Relations 3 retrieve V.commit_ts 4 where R.name = "Employee" Query 4 Show the commit history of the Employee relation in reverse chronological order. 1 range of V is Version 2 range of R is V.Relations 3 retrieve V.creation_ts, V.author.name, V.commit_message 4 where R.name = "Employee" and R.changed = true 5 sort by V.creation_ts desc 84 Similarly, we can set up a range clause over tuples inside a relation. Analogous to a relational database, the user needs to be familiar with the schema to be able to pose such a query. Query 5 Show the history of the tuple with employee id “e01” from Employee relation. 1 range of V is Version 2 range of R is V.Relations 3 range of E is R.Tuples 4 retrieve E.all, V.commit_id, V.creation_ts 5 where E.employee_id = "e01" and R.name = "Employee" 6 sort by V.creation_ts 4.3.2 Syntactic sweetenings In this section, we introduce some shorthand constructs to keep the size of the queries small. These constructs are meant only for brevity, and each of them can be mapped to an equivalent query without using shorthands. The rst one is analogous to a lter operation over a set declaration: we can use predicates in the set declaration block of the range statement. For instance, in the follow- ing example, both queries iterate over the same set of versions. Note that the retrieve into clause in (b1) sets up a new iterator V over all the versions satisfying constraints in where clause. 1 (a1) range of V is Version(id = "v01") 2 3 (b1) range of T is Version 4 retrieve into V (T.all) 5 where T.id = "v01" The next example shows the principle in action on a query that would otherwise become quite long. Again, (a2) and (b2) below show identical queries written using the short notation (a) and their equivalent form (b). 85 Query 6 Find all Employee tuples in version “v01” that are dierent in version “v02”. 1 (a2) range of E1 is Version(id = "v01").Relations(name = "Employee").Tuples 2 range of E2 is Version(id = "v02").Relations(name = "Employee").Tuples 3 retrieve E1.all 4 where E1.employee_id = E2.employee_id and E1.all != E2.all 5 6 (b2) range of V1 is Version 7 range of R1 is V1.Relations 8 range of E1 is R1.Tuples 9 range of V2 is Version 10 range of R2 is V2.Relations 11 range of E2 is R2.Tuples 12 retrieve E1.all 13 where V1.id="v01" and R1.name="Employee" 14 and V2.id="v02" and R2.name="Employee" 15 and E1.employee_id = E2.employee_id and E1.all != E2.all 4.3.3 Aggregate operators The aggregate functions sum, avg, count, any, min and max are also provided in VQUEL. Any expression involving components of iterated entity attributes, constants and arith- metic symbols can be used as the argument of these functions. Due to the nested nature of iterators, we introduce the _all version of these operators, i.e. count_all, sum_all, etc. The general syntax of an aggregate expression is: 1 agg_op([/] [group by ] [where < predicate>]) This evaluates the agg_op on each group of of objects that satisfy the . We see two examples next. Query 7 For each version, count the number of relations inside it. 1 range of V is Version 2 range of R is V.Relations 3 retrieve V.id, count(R) Query 8 Find all versions containing precisely 100 Employees with last name “Smith”. 86 1 range of V is Version 2 range of E is V.Relations(name = "Employee").Tuples 3 retrieve V.commit_id 4 where count(E.employee_id where E.last_name = "Smith") = 100 In both queries above, the aggregation is performed only over objects at the innermost level of an iterator expression. In query 7, R is an iterator over relations inside a version V, and count iterates only over the innermost level of this iterator hierarchy, that is, R. Similarly, in query 8, the count expression only iterates over the tuples inside a relation inside a version. Notice that the latter query is not very easy to express in vanilla SQL: there is no easy way to use SQL to retrieve version numbers, which in a traditional non-versioned context would either be considered as schema-level information, or involve multiple joins depending on the level of normalization of the schema. VQUEL, on the other hand, allows us to set up the nested iterators that makes such queries very easy to express. The next two examples show the usage of count_all operator. The dierence from the count operator is that all the “parent” iterators are evaluated, instead of only the innermost iterator, to compute the value of the aggregate. Another way to reason about this behavior is that count has an implicit grouping list of attributes in its by clause: query 9 is identical to query 8. Query 9 Find all versions containing precisely 100 employees with last name “Smith”. 1 range of V is Version 2 range of R is V.Relations(name = "Employee") 3 range of E is R.Tuples 4 retrieve V.commit_id 5 where count_all(E.employee_id group by R, V where E.last_name = "Smith") = 100 Aggregates having a group by clause can also be used in the predicate to restrict the 87 results of the query. In query 9, the result of count_all for each group is compared against 100. Query 10 gives another example. Query 10 Find all versions containing precisely 100 tuples in all relations put together inside a version. 1 range of V is Version 2 range of R is V.Relations 3 range of T is R.Tuples 4 retrieve V.all 5 where count_all(T group by V) = 100 The next few examples show how we can use aggregate operators across a set of versions to answer a variety of questions about the data. Query 11 Among a group of versions, nd the version containing most tuples that satisfy a predicate. For instance, which version contains the most number of employees above age 50? 1 range of V is Version 2 range of E is V.Relations(name = "Employee").Tuples 3 retrieve into T (V.id as id, count(E.id where E.age > 50) as c) 4 retrieve T.id 5 where T.c = max(T.c) Up until now, for an iterator, we have been exploring “down” the hierarchy. We also provide appropriate functions, depending on the type of iterator, to refer to values of entities “up” in the hierarchy. In the next query, Version(T) is used to refer to the version attributes of tuples in T. Query 12 Which versions are such that the natural join between relations S and T has more than 100 tuples? 88 1 range of V is Version 2 range of S is V.Relations(name = "S").Tuples 3 range of T is V.Relations(name = "T").Tuples 4 retrieve into Q(V.id as id, 5 count_all(S.id group by V where S.id = T.s_id and Version(S).id = Version(T).id) as c) 6 retrieve Q.id 7 where Q.c >= 100 4.3.4 Version graph traversal VQUEL has three constructs aimed at traversing the version graph. Each of these operate on a version at a time, specied over an iterator. • P(): Return the set of ancestor version of this version, until integer num- ber of hops in the version graph. If the number of hops is not specied, we go till the rst version. Duplicates are removed. • D(): Similar to P() except that it returns the descendant/derived versions. • N(): Similar to P() except that it returns the versions that are number of hops away. The next few queries illustrate these constructs. Notice once again that queries of this type are not very easy to express in SQL, which does not permit the easy traversal of graphs, or specication of path queries. The constructs we introduce are reminiscent of constructs in graph traversal languages [98]; these combined with the rest of the power of VQUEL enable some fairly challenging queries to be expressed rather easily. Query 13 Find all versions within 2 commits of “v01” which have less than 100 employees. 89 1 range of V is Version(id = "v01") 2 range of N is V.N(2) 3 range of E is N.Relations(name = "Employee").Tuples 4 retrieve N.all 5 where count(E) < 100 Query 14 Find all versions where the delta from the previous version is greater than 100 tuples. 1 range of V is Version 2 range of P is V.P(1) 3 retrieve unique V.all 4 where abs(count(V.Relations.Tuples) - count(P.Relations.Tuples)) > 100 Query 15 For each tuple in Employee relation as of version “v01”, nd the parent version where it rst appeared. 1 range of V is Version(id = "v01") 2 range of E is V.Relations(name = "Employee").Tuples 3 range of P is V.P() 4 range of PE is P.Relations(name = "Employee").Tuples 5 retrieve E.id, P.id 6 where E.employee_id = PE.employee_id and P.commit_ts = min(P.commit_ts) 4.3.5 Extensions to ne-grained provenance Finally, in some cases, we may have complete transparency into the operations performed by data scientists. In such cases, we can record, reason about, and access tuple-level provenance information. Here is an example of a query that can refer to tuple-level provenance: Query 16 For tuples in version “v01” in relation S that satisfy a predicate, say value of attribute attr = x, nd all parent tuples that they depend on. 1 range of E is Version(id = ‘‘v01’’).Relations(name = ‘‘S’’).Tuples 2 range of P is E.parents 3 retrieve E.id, P.id 4 where E.attr = x 90 Similar queries can be used to “walk up” the derivation path of given tuples, for example, to identify the origins of specic tuples. 91 Chapter 5: Query Execution I: Set-based Operations 5.1 Introduction As discussed in Chapter 3, DEX makes use of delta encoding to store past versions of datasets eciently on disk. Many archival and backup systems, including version control systems like git, SVN, etc., often store multiple versions or snapshots of large datasets or les that have signicant overlap across their contents using deltas. As a re- sult, there has been signicant work on various aspects of delta encoding-based storage systems: computing near-optimal deltas for a variety of data formats [99, 100], quickly nding ideal les to delta from [101], and supporting delta storage in le systems, scien- tic databases, network transport, etc. [5, 67]. However, existing delta-oriented storage engines oer limited or no support for querying the data stored within them; the pri- mary query type supported by those engines is checkout, i.e., reconstructing a specic version of a dataset or a le. With such storage engines becoming ecient and main- stream, there is an increasing desire and opportunity to perform rich analysis queries over the historical information contained within such data stores. The queries of inter- est include auditing or provenance queries over the datasets (e.g., identify the datasets where a particular property holds), analyzing the evolution of a dataset over time (i.e., temporal analytics), and comparing results of SQL-like queries over dierent versions 92 of the same dataset (obtained through, e.g., applying dierent analysis pipelines to the same initial dataset). However, other delta-oriented storage engines of today require users to “check out” complete le/dataset versions in order to manipulate them. This approach is less than ideal particularly when the individual versions are large and the users need to ac- cess multiple versions for their analysis task. In this chapter, we present computation- ally cheap methods to evaluate a query by pushing down query execution to the level of deltas. 5.2 System Overview We begin with a brief description of the user-facing data model before describing the dif- ferent types of queries that we support. Thereafter, we describe the system data model, i.e., the physical organization of data, and the primitives used by the system to evaluate the queries. 5.2.1 User Data Model We recall a few important denitions for ease of exposition. The user data model in DEX has two main abstractions – datafile, and version – that form the basis of all user interactions. As mentioned earlier, a datafile is a le whose contents are interpreted as set of records. The user species a record separator when a datafile is added in the system. Within a datafile, we consider a record as an unstructured sequence of bytes. The only 93 constraint we impose, however, is that a datafile cannot contain identical records: two records are said to be identical if they both have the same sequence of bytes. For instance, textual at les such as CSV or logs can be seen as containing one record per line. A version is a point-in-time snapshot of one or more datafiles typically residing in a directory on the user’s le system. A version, identied by a unique ID, is immutable, and can be created at any point in time by any user who has access to the repository. In addition to datafiles and versions, DEX also captures the version-level prove- nance – derivation and transformation relationships among the set of all versions – in a data structure called the version graph. Nodes in a version graph correspond to versions and edges capture relationships such as derivation, branching, transformation, etc, between two versions. Since a version graph is typically much smaller than the datafile contents, it can be kept and traversed in memory to identify the versions that are referenced in a query. We use the following notation to formalize the above discussion. Let  be the set of all versions. Each version V ∈  contains a nite number of datafiles, say, V = {A1,… , At}. Let = {A1,… , An} be the set of all datafiles across all versions. Note that it is possible for a datafile to be present in more than one version – this happens when the said datafile is not modied in the respective versions. The set of datafiles that appear in a version are kept track of as metadata in the corresponding node of the version graph. Let Aa = {r1,… , rm} be the set of records contained in datafile Aa. As mentioned before, no two records in a datafile are identical, i.e., ri ≠ rj , ∀ri , rj ∈ Aa. 94 5.2.2 Queries We now describe the semantics of each of the core operations that are the primary focus of this chapter. Checkout: Checkouts are the primary mechanism for reading o older versions of a dataset. Any version or any set of datafiles can be checked out, and the result is copied to the location suggested by the user (typically, it will be a directory on the user’s machine). When a checkout query is issued, the version graph is consulted to identify the set of datafiles that comprise it. Specically, the checkout operation takes as input a set of k ≥ 1 datafiles k = {Ax1 ,… , Axk} ⊂  and outputs k les, one for each datafile. Henceforth, we use the notation Checkout(k) to denote the checkout operation. Intersect: The intersect operation is an important operation when comparing the con- tents of a datafile that was modied across multiple versions. Similar to set intersec- tion, given a set of k ≥ 2 datafiles k = {Ax1 ,… , Axk} ⊂ , the intersect operation outputs a single datafile containing records that appear in all datafiles in k , i.e., {r ∶ r ∈ Ax1 ∧⋯ ∧ r ∈ Axk}. We use the notation I (k) to denote the intersect operation. Union: The union operation, denoted by U (k), returns a single datafile containing records that appear in any of the datafiles in k , i.e., {r ∶ r ∈ Ax1 ∨⋯ ∨ r ∈ Axk}. t-Threshold: Given as input a set of k ≥ 3 datafiles k and an integer 1 < t < k, the t- threshold operation, denoted by Tt(k), returns a single datafile that contains records appearing in at least t of the datafiles in k . This generalizes the above operations – t = 1 and t = k correspond to union and intersection respectively. 95 Although the above set of operations is intended as a starting point for investigat- ing the nascent topic of query processing over deltas, these operations already enable many interesting queries. For example, comparing the results of intersection, union and/or t-threshold across the versions of an evolving dataset can provide insights into the evolution process (e.g., properties of the records that change frequently vs those that remain static). Intersection or t-threshold across the results of dierent machine learn- ing pipelines on the same input dataset can help us identify which types of records are dicult to predict correctly, which can help an analyst steer the training process. Fur- ther, t-threshold can return, for each record, a bitmap indicating the versions to which it belongs; depending on the semantics of the versions being queried, that information could be used for a variety of purposes including correlation analysis, anomaly detection, and visualizations. Finally, if specic analyses of interest are known in advance, materi- alized views (e.g., projections, results of aggregate queries or joins) can be computed in advance as the dataset versions are ingested; by exploiting the overlaps, these material- ized views could be persisted cheaply in the storage engine itself. Although this requires a priori planning, the benets at the time of querying could be tremendous. Dening and automatically materializing such views remains a rich area for future work. 5.2.3 System Data Model Next, we discuss the storage graph and the delta encoding scheme used in DEX to store the versions of datafiles on disk. Thereafter, we describe few properties of the deltas and discuss methods of combining them that will be useful in subsequent sections. 96 5.2.3.1 Storage Graph Let  = (V , E) be a storage graph (see Figure 5.1 for an example). Note that this graph is dierent from version graph, described in section 5.2.1. While the version graph captures derivation or transformation relationships between versions of datasets, the storage graph represents information at the granularity of datafiles (encompassing all versions) and is meant to indicate delta relationships between them. Moreover, the storage graph is used by internal query execution routines and, unlike version graph, is not intended to be exposed to the end user. The vertex set V of the storage graph cap- tures all unique datafiles across all versions, and a special empty datafile, A0. Thus, V = A0 ∪. An edge e(Ai , Aj) ∈ E represents the delta between datafiles Ai and Aj , and the edge set E represents the deltas that are chosen to store all datafiles. The weight of the edge we represents the storage cost (size in bytes) of the delta. For an edge e(A0, Ai), we represents the storage cost of Ai in its entirety (i.e., Ai is materialized). We require that  be a connected graph so that it is possible to reconstruct any of the datafiles in. Specically, a path fromA0 toAi indicates the materialized datafile (one following A0 on the path) and the sequence of deltas to apply in order to recreate Ai . Thus, to store all the datafiles in , it is sucient to store only the materialized datafiles in  and all the deltas in E. Prior systems have made use of the storage graph representation [30, 65, 102], al- beit with dierent monikers, to model a delta based solution to store data versions. The storage graph also generalizes the sequence-of-deltas model where the versions are or- 97 dered according to a certain criteria, e.g., timestamp, le size, etc., and every version ex- cept the rst is stored as a delta against the previous one. The sequence-of-deltas model, although conceptually simple, has the downside that the retrieval time grows linearly with the number of versions stored. The storage graph representation addresses this limitation by allowing multiple versions to be derived from one version. For instance, if we require that every datafile derives 3 datafiles not derived by others, we can pack approximately 80K datafiles and have a maximum delta sequence of length 10. 5.2.3.2 Set-backed Deltas and Properties The delta format that we consider in this chapter, called Set-backed Deltas, is an undi- rected delta format, similar to the standard UNIX line-by-line di. A set-backed delta Δ between a source datafileAi and a target datafileAj , is a set of two datafiles, Δ− and Δ+, that correspond to “deletions” and “insertions” respectively. Δ− is the set of records that are present in Ai but not in Aj , while Δ+ is the set of records that are not present in Ai but present in Aj . Δ can also be used to reconstruct Ai from Aj by exchanging Δ− and Δ+. When using set-backed deltas, we require them to be consistent [69], i.e., a delta does not contain the same record in Δ− and Δ+. This does not preclude updates to a record, including schema changes, since an update can be recorded as deleting the old record and adding a new record. Denition 1 (Consistent Delta) A delta is said to be consistent if Δ− ∩ Δ+ = ∅. Because datafiles and deltas are sets, we will often make use of the following three 98 A0 A0 A0 1000 1100 1100 1000 A1 A3 A1150 A3 20 20 150 A 50 502 50 30 A2 A3 A5 A7 30 A6 A7 50 30 50 30 80 50100 A4 A5 A6 A A4 10 11A A 10010 11 A 10 50 10 10 10 50 50 50 A12 A8 A9 A A A A 128 9 12 (a) (b) (c) Figure 5.1: (a) A storage graph over datafiles A1,… , A12, nodes shaded in blue (A1, A3) indicate materialized datafiles, edge annotations indicate the disk size of the delta; (b) access tree for Q(A12), this is the shortest path from A0 to A12; (c) access tree for Q(A6, A8, A9, A12), this is the minimimum cost Steiner tree for the terminals {A0, A6, A8, A9, A12} standard operations on sets – union (∪), intersection (∩) and dierence (−). Continuing the example, when we use Δ to construct Aj from Ai we call this operation patching Ai using Δ, and denote it as Aj = Ai ⊕ Δ. Denition 2 (Patch) Ai ⊕ Δ = (Ai − Δ−) ∪ Δ+ Observation 1 If Δ is consistent, Ai ⊕ Δ = (Ai − Δ−) ∪ Δ+ = (Ai ∪ Δ+) − Δ−. Next, we describe another important property of set-deltas, called contraction. Intu- itively, delta contraction corresponds to combining two deltas into a single delta such that the new delta has the same eect as applying the individual deltas. Formally, if A1, A2, A3 are three datafiles and Δ1 = Δ(A1, A2),Δ2 = Δ(A2, A3), we use the patch operator as before to represent contraction as follows, 99 Denition 3 (Delta Contraction) Δ = Δ1 ⊕ Δ2, where, Δ− = (Δ−1 − Δ+2 ) ∪ Δ−2 ; Δ+ = (Δ+ − +1 − Δ2 ) ∪ Δ2 (5.2.1) Although delta contraction, as dened above, can be applied to two arbitrary deltas, the result is well-dened only if the target datafile of Δ1 is same as the source datafile of Δ2. The result Δ has the same source datafile as Δ1 and derives the target datafile of Δ2. This denition can be generalized to a sequence of deltas: the contraction of a sequence of deltas Δ1,… ,Δm is the result of the operation Δ1 ⊕⋯ ⊕ Δm. Given the above properties, we can infer that: Observation 2 If Δ1 and Δ2 are consistent, then their contraction, Δ = Δ1 ⊕ Δ2, is also consistent. Observation 3 The patch operation is associative, i.e., (Δ1 ⊕ Δ2) ⊕ Δ3 = Δ1 ⊕ (Δ2 ⊕ Δ3). Although some of these observations might seem straightforward, formalizing them is crucial to argue the correctness of the transformations that we do later. 5.3 Query Execution Preliminaries We begin with a more formal treatment of the query optimization problem, with rst discussing the optimization metrics of interest and introducing the two-phase optimiza- tion approach that we take. We then briey discuss the issues of cost and cardinality estimation and the search space of query evaluation plans. 100 Given a query,Q(k)whereQ is one of {Checkout, I , U , Tt} (section 5.2.2) against a given storage graph , there are two somewhat independent stages in the overall query execution. First, we need to identify all the relevant datafiles and deltas in  that are necessary to execute Q(k). We refer to this problem as nding an access tree of Q(k), and describe it in detail in section 5.3.2. Second, given an access tree, we need to devise an ecient evaluation plan, that describes exactly what operations are used to compute the result of Q(k). This plan is represented as a delta expression: an algebraic expression where the operands are datafiles and deltas from the storage graph , and the operations are patch and prim- itive set operations. During this stage, we also consider the problem of nding a good ordering of evaluating the dierent operations in the delta expression. We describe the techniques for each query Q ∈ {Checkout, I , U , Tt} in Section 5.4. 5.3.1 Optimization Metrics To be able to develop a systematic cost-based approach to query execution, we rst need to identify appropriate optimization metrics and cost models. It is unfortunately dicult to develop a single cost metric that captures the costs of the two stages discussed above, which also makes it hard to do joint optimization across them. Because the backend store is likely to be relatively expensive to access (we expect it to be distributed in general), we would like to minimize the amount of data that is read from the backend store; this also reduces the network I/O. Once the data has been gathered, however, the dierent ways to evaluate a query can have very dierent CPU costs and wall-clock time. Hence, 101 for the second phase, we would prefer to use a metric that tracks the CPU cost. We adopt a two-phase approach in DEX inspired by this. We rst nd the best “access tree” that minimizes the total amount of data that needs to be read (in bytes) from the backend store. In other words, we identify the set of datafiles and deltas that have the smallest total size, that are sucient to reconstruct the required datafiles. We then search for the best evaluation plan according to a cost model that estimates the CPU resources needed by the plan. We discuss the specics in further detail in Section 5.3.4 when we discuss the operator implementations. We do not explicitly account for disk access costs during the second phase for sev- eral reasons. First, although the overall storage graph and the delta sizes in total are expected to be very large, the access tree for any given query is typically much smaller and the deltas constituting that will typically t in the memory of a powerful machine. More importantly, most of our algorithms (Section 5.3.4) access the deltas sequentially (while reading and writing), and thus even if the deltas were disk resident (or interme- diate results needed to be written to disk), the CPU and/or the memory bandwidth is still the main bottleneck. One exception here is binary search or gallop search (that an intersection operation might employ) where our approach might underestimate the cost of an intersection in case of extreme skew. However, our cost estimation procedure can be easily modied to account for that case. Moreover, the deltas are typically stored in a compressed fashion on disk, thereby making it necessary to uncompress them by read- ing them once into memory, and further making the overall computation CPU-bound. 102 5.3.2 Access Tree Given a query Q(k), an access tree, Q = (VQ , EQ) is a subgraph of  such that: (i) A0 ∪k ⊆ VQ ⊆ V , and (ii) Q is a tree, i.e., a connected graph with no cycles. The rst condition implies that all datafiles required by the query are part of the access tree. The second condition ensures that we have a valid and minimal solution: (i) Valid: because Q is connected, there exists at least one path between A0 and Axi , which denotes the materialized datafile and the sequence of deltas to apply to reconstruct Axi , (ii) Minimal: because Q is a tree, for every Axi ∈ k , Q contains exactly one path from A0 to Axi . We dene the cost of an access tree as the sum of weights of all edges in it, i.e., C(Q) = ∑e∈EQ we . When the edge weights correspond to the sizes of the deltas, this def- inition captures the cost metric mentioned above. To address the problem of identifying the least cost access tree, we consider two cases, k = 1 and k > 1. We refer to these as single datafile access and multiple datafile access respectively. Single datafile Access: When k = 1, 1 = {Ax1}. Any A0 to Ax1 path in  satises the conditions of an access tree. Thus, nding the least cost access tree amounts to nding the shortest path between A0 and Ax1 , and we use the classical Dijkstra’s algorithm. Multiple datafile Access: When k > 1, the problem of nding a low cost access tree is equivalent to nding a Steiner Tree [103]. Here, the set of nodes A0 ∪ k act as terminals and our objective is to nd a minimum cost Steiner tree that contains all of them. This problem is -Hard, i.e., arbitrarily good approximations cannot be achieved in polynomial time (unless  = ). In this work, we use the classical 2- 103 approximation algorithm, which nds a tree with cost at most 2 times the optimal. Example 8 Consider the query Checkout({A6, A8, A9, A12}) on the storage graph in Fig- ure 5.1(a). Figure 5.1(c) shows the least cost access tree for this query. 5.3.3 Search Space Cost-based optimization requires us dene the search space of potential, equivalent plans. The search space that we use in this work revolves around two equivalences: (i) associativity of the patch operation, and (ii) De Morgan’s laws for set theory. We can thus generate equivalent evaluation plans by repeatedly applying those equivalence rules. Unfortunately the number of dierent evaluation plans is very large, even with just the rst rule (Section 5.4.1). Unlike relational query optimization, the set of po- tential intermediate results is not easy to dene either, and thus this problem does not seem amenable to dynamic programming-style algorithms used there. We instead take a hybrid approach where we use a series of heuristic transformation rules, based on De Morgan’s laws, to simplify the expressions, and use a dynamic programming-based al- gorithm (that exploits the associativity of patch) to optimize the sub-expressions in the simplied expression. Apart from generating alternative query expressions using logical equivalence rules, it is also possible to expand the search space of candidate plans by considering the impact of physical access structures on the data, e.g., secondary indexes. For in- stance, B-Trees on datales or deltas can be helpful when records are ltered on some attribute, bloom lters on deltas can help in evaluating queries like set dierence, and 104 so on. Additional considerations also arise when a join result is required across multiple versions – the delta chains for the dierent sets of datales (corresponding to the dif- ferent relations) may not be “aligned” and the access tree selection will have to consider possibility of “joint” optimizations. Understanding this search space further, especially for richer queries involving joins and aggregates, remains a rich area for future work. 5.3.4 Cost and Cardinality Estimation The cost of executing any of the set operations mentioned so far depends on the physical datafile format and the specic implementation of the operation. Since there exist several implementations for the set operations, there exist several cost functions. In DEX, the primary method of storing a datafile is clustered storage. In this method, records are stored in a sorted manner based on a suitable derived key (e.g., SHA1). There are several algorithms for evaluating a set expression between two or more operands based on this storage format and we outline our choices next alongwith their respective cost. To keep the discussion simple, we describe algorithms and their respective cost functions when all input data for a specic operation ts in memory and there is no paging of intermediate results to disk. Even if some deltas are large enough to require using disk, most of the algorithms below access the deltas sequentially and thus can be used with small modications. We note that our optimization algorithms are largely agnostic to the specic choices for operator implementations, and can be used as long as the costs of the operations can be estimated. Intersection: To compute the intersection of l datafiles, A1,… , Al , we use an adaptive 105 algorithm introduced in [83] called Small Adaptive (SA). SA rst sorts the set of input datafiles according to their size. For each element in the smallest datafile, SA per- forms a gallop search on the second smallest datafile. A gallop search consists of two stages. In the rst stage, we determine a range in which the element would reside if it were in the datafile. This range is found by identifying the rst exponent j such that the element at 2j is greater than the searched element. In the second stage, a binary search is performed in the range (2j−1, 2j) to nd if the element exists. If found, a new gallop search is performed in the remaining l − 2 datafiles to determine if the element is present in the intersection, otherwise a new search is performed. After this step, each datafile has an examined range (from the beginning to the position returned by the current gallop search) and an unexamined range. SA then selects two datafiles with the smallest unexamined range and repeats the process until one of the datafiles has been fully examined. Because intersections only make sets smaller, as the algorithm progresses with several sets, the time to do each intersection eectively reduces. In particular, as pointed to in [83], the algorithm benets largely if the set sizes vary widely, and performs poorly if the set sizes are all roughly the same. Since one gallop search takesO(log i) time, where i is the index where the element would be in the datafile, we can model the worst case cost of intersection as, C∩(A1,… , Al) = l |A1| log(|Al |/|A1|), (5.3.1) where A1 and Al are the smallest and largest datafiles respectively. 106 Union: To take a union of l datafiles {A1,… , Al}, we perform a linear scan over all lists to merge them, and output the result. C∪(A1,… , Al) = |A1| +⋯ + |Al |. (5.3.2) Set Dierence: To compute the set dierence A1 −A2, we choose the better among the following two based on input sizes: perform a linear scan over both datafiles and use a merging algorithm, or for each element in A1, perform a gallop search on A2, including the element in the output if the search fails. This can be captured using the cost function, C−(A1, A2) = min{|A1| + |A2|, |A1| log(|A2|/|A1|)}. (5.3.3) Patch: This is a binary operation where the two inputs are either (i) a datafile (M ) and a delta (Δ), or (ii) two deltas (Δ1 and Δ2). In the rst case, the output datafile can be computed by performing one linear scan over each of M , Δ+ and Δ− and evaluating Denition 2, making the cost function, C⊕(M,Δ) = |M | + |Δ|. (5.3.4) Typically, |M | > |Δ−|, and we use the linear scan approach to compute the set dierence. In the second case, the output Δ can be computed by evaluating Denition 3. Note that the datafiles of Δ2 are scanned twice, once to compute Δ+ and once to compute Δ−. 107 Thus, the cost function is given as, C⊕(Δ1,Δ2) = |Δ1| + 2|Δ2|. (5.3.5) Cardinality Estimation: Because we restrict the search space as discussed in Sec- tion 5.3.3, we require intermediate result size estimates only when two deltas are patched. A1 A2 A3 Let Δ = Δ1 ⊕ Δ2, where Δ1 and Δ2 are deltas between three datafiles as above. Let x = |Δ−| and y = |Δ+|. We want to estimate x and y. By denition, Δ is a consistent delta between A1 and A3. Therefore, |A3| = |A1| − x + y . Since |A1| and |A3| are known, we can estimate x from y , or vice versa. From Denition 3 we can obtain intervals for both x and y as, x ∈ [max(0, |Δ− + − − −1 | − |Δ2 |) + |Δ2 |, |Δ1 | + |Δ2 |] , y ∈ [max(0, |Δ+| − |Δ−1 2 |) + |Δ+2 |, |Δ+1 | + |Δ+2 |] . We estimate the quantity with the smaller interval, where the value is chosen uniformly at random from the corresponding interval. 5.4 Query Execution Algorithms Next we present a series of algorithms for cost-based optimization for each of the dif- ferent query types. 108 5.4.1 Checkout Queries Let Checkout(k) and Q denote a checkout query and its access tree resp. We rst consider the case when k = 1 (single datafile checkout) followed by the case k > 1 (multiple datafile checkout). 5.4.1.1 Single datafile Checkout Recall that the access tree Q , when k = 1, is the shortest path from A0 to Ax1 in . The delta expression for single datafile checkout is therefore, of the form,  ∶ M ⊕ Δ1 ⊕ Δ2 ⊕⋯ ⊕ Δm, where M is the materialized datafile. Evaluation Algorithms: Since the ⊕ operation is associative, we can evaluate  in multiple ways by changing the placement of “parentheses”. For instance, one method is to evaluate the expression from left-to-right, i.e.,  ∶ (((M ⊕ Δ1) ⊕ Δ2) ⊕ ⋯ ⊕ Δm). Al- ternately, we can evaluate the expression from right-to-left, or in any arbitrary fashion that repeatedly combines two operands at a time, until we are left with the result. These evaluation methods, in general, will have varying costs. The total number of evaluation orders is equivalent to the classical problem of counting the number of ways of associ- ating m applications of a binary operator, and is given by the (m − 1)th Catalan number, which is Ω(4m/m3/2). Note that a greedy algorithm that iteratively combines two deltas having the least cost is not always the optimal strategy. Example 9 Consider the expression,  ∶ Δ1 ⊕ Δ2 ⊕ Δ3, where the deltas are such that |Δ1| = x, |Δ2| ≃ |Δ3| = y, x ≪ y . Intuitively, the deltas Δ2,Δ3 are larger compared to Δ1 and 109 they are such that they almost “undo” each other. The greedy algorithm will pick the plan (Δ1 ⊕Δ2) ⊕Δ3 with estimated cost ≈ 2x + 5y , while the optimal plan Δ1 ⊕ (Δ2 ⊕Δ3) has cost ≈ x + 2y + 2", where " = |Δ2 ⊕ Δ3|. For sake of completeness, we have reproduced the classical dynamic programming algorithm to select the (estimated) best evaluation order below. We call this the path contraction (PC) algorithm. We use PC extensively in subsequent sections to determine the best evaluation order to combine a sequence of deltas. The runtime of PC is Θ(m3) where m is the number of deltas. The input to PC is a materialized datafile, sayM , followed by a sequence of deltas, say Δ1,… ,Δm and the output is the datafile represented by the corresponding delta expression, M⊕Δ1⊕⋯⊕Δm. The algorithm uses the property that if the optimal solution splits the contraction of a path of length m into two sub-paths, then the contraction of each of the two sub-paths must be optimal (otherwise, we can improve the solution for the sub-paths to enhance the overall solution). For 0 ≤ i ≤ j ≤ m, let C[i, j] denote the minimum cost to contract the sequence Δi ,… ,Δj , D[i, j] denote the estimated size of the corresponding intermediate result, and S[i, j] denote how to best split the sequence. For notational convenience only, if M is present, Δ0 = M . The pseudocode of PC is shown in Algorithm 4. As discussed in Section 5.3.3, we have syntactically restricted the space of alterna- tive evaluation plans for checkout by only considering the associativity of the patch op- eration. Although additional transformations could be used to expand the search space, we could not identify any such transformation rules for checkout that were eective 110 outside of pathological cases. Algorithm 4: Path Contraction (PC) Input : A materialized datafile M (optional); Δ1,… ,Δm Result: Minimum cost to evaluate M ⊕ Δ1 ⊕…Δm 1 for l ← 2 to m do 2 for i ← 0 to m − l + 1 do 3 j ← i + l − 1 4 C[i, j]← mini≤k 1 children. Let B0,… , Bi−1 be the intermediate nodes on the M − Bi path. Let Q(Bj), 0 ≤ j < i, be the access tree rooted at Bj (this tree is equivalent to deleting the nodes M, B0,… , Bj−1 from Q). For example, Figure 5.2(b) shows two components: (i) the path (Bi−1) (above), and (ii) the access tree Q(Bi−1) (below). Let split(Q) denote the operation that splits Q at Bi into l access trees, one for each child of Bi . This is showin in Figure 5.2(c). Let split-par(Q) denote the operation that splits Q at Bi into l access trees, one for each child of Bi , but this time preserving the parent 113 sequence of deltas in each split. This is shown in Figure 5.2(d). M M M M M B0 B0 B0 B0 B0 Bi-1 Bi-1 B B Bi-1 B i-1 i-1 i-1 Bi B B B B i i Bi Bi i i A1 A1 A2 A3 A2 A3 A A A A1 2 3 A1 2 A3 (a) (b) (c) (d) Figure 5.2: An instance of the Tree Contraction algorithm; (a) is an access tree for the query Checkout(A1, A2, A3). Algorithm 5: Tree Contraction Input : Access Tree Q 1 Apply PC for each (Ai), Ai ∈ k . Memoize C[], S[] and D[]. 2 return Best-Subexp({},Q) 3 Procedure Best-Subexp(,Q) Input : Deltas, ; Access tree, Q 4 if Q is a path then return Best solution for { ∪ Q} 5 forall 0 ≤ j < i do 6 Δ(Bj ) ← estimated delta for the sequence (Bj) 7 ′ =  ∪ Δ(Bj ) 8 cost_g ← Best-Subexp(′,Q(Bj)) 9 Δ(Bi ) ← estimated delta for the sequence (Bi) 10 ′ =  ∪ Δ(Bi ) 11 cost_g ← ∑′ ∈split(Q ) Best-Subexp(′,′Q)Q 12 cost_g ← ∑′ ′Q∈split-par(Q ) Best-Subexp(,Q) 13 return Best cost_g Before we conclude this discussion, it will be helpful to understand, as the follow- ing example shows, why a simple greedy strategy of always sharing the largest possible 114 expression (from left to right) is not always optimal. Example 11 Consider the following access tree to checkout A3 and A4. M A A A1 2 4 A3 The instance is constructed such that Δ2,Δ3,Δ4 are large (say, y ≈ |Δ2| ≈ |Δ3| ≈ |Δ4|) and Δ3,Δ4 “undo” most of the changes done by Δ2. Δ1 is a small independent set of changes, say, x = |Δ1|, x ≪ y . The greedy strategy will force us to share Δ′ = Δ1 ⊕ Δ2 and |Δ′| ≈ x + y . Thus the cost of the greedy strategy is ≈ 3x + 8y . On the other hand, evaluating Δ1 ⊕ (Δ2 ⊕ Δ3),Δ1 ⊕ (Δ2 ⊕ Δ4) incurs a cost ≈ 2x + 6y . 5.4.2 Intersection Queries Given an intersect query I (k) and its access tree Q , a straightforward method, that we treat as a baseline, is to rst use TC to perform Checkout(k) followed by the in- tersection. This approach, however, only considers the associativity of patch and the sharing of sub-expressions in order to nd a good evaluation order. We now develop a set of transformation rules on the access tree that allow us to compute partial intersection results using only the deltas. Since a delta between two datafiles already captures a notion of dierence between them, we leverage this information and avoid redundant computation while nding the intersection. The transformation rules are based on identifying two simple structures in the access tree Q , called line and star (Figure 5.3). In each gure, we use boxes to denote 115 datafiles in k and circles to denote other datafiles. Also, if a box or circle is lled, it denotes a materialized datafile. A1 A2 A3 M (a) M A1 Ak A1 A2 Ak (b) (c) Figure 5.3: (a) A line of two or more datafiles; (b) A line when the materialized datafile M is not a part of query input; (c) A star. Line Access Trees: Consider the query I (A1, A2) with the datafiles as arranged in Figure 5.3(a). Here, A1 is the materialized datafile while A2 is stored as a delta from A1. It is easy to see that: R = A1 ∩ A2 = A1 − Δ−1 . In general, for the query I (k)with the datafiles as arranged in Figure 5.3(a), the result R is computed as, R = I ( − −k) = A1 − (Δ1 ∪⋯ ∪ Δk−1) (5.4.1) Note that the above equality does not hold if there are other datafiles in Q even if Q is a line. We use this equality to introduce our rst transformation rule that “reduces” the deltas in a line structure to a single delta that gives the result for the intersect query. Conceptually, this reduced delta acts as a delta between two datafiles: the same materialized datafile as in the line and a (new) datafile representing the intersection result. T1:– If Δ1,… ,Δk−1 are the deltas in the line, then the reduced delta, Δl , for the intersect 116 query is composed as, Δ−l = Δ−1 ∪ Δ−2 ∪⋯ ∪ Δ− +k−1; Δl = ∅ This transformation rule signicantly reduces the amount of data that needs to be sub- sequently processed. To handle the case when the materialized datafile is not a part of the line, as in Figure 5.3(b), we use a two-step approach. First, assuming that A1 is the materialized datafile and we can use rule T1 to compute the reduced delta Δl . Second, we can contract Δ and Δl since they share the datafile A1. The result is computed as R = M ⊕ Δ ⊕ Δl . Next, we discuss how to evaluate eq. (5.4.1). Consider the identity, X − (Y ∪ Z ) = (X − Y ) − Z , for three sets X, Y , Z . If |Y |, |Z | ≪ |X |, observe that X − (Y ∪ Z ) will often have less cost than (X − Y ) − Z . Intuitively, if we do the set dierence rst, then |X − Y | will be comparable to |X | and will end up being scanned again. Specically, under the cost model stated in section 5.3.4, when |X | > 3max(|Y |, |Z |), performing X − (Y ∪ Z ) will result in a reduced cost. We therefore use the following greedy heuristic when evaluating eq. (5.4.1). H1:– Let  = {Δ−1 ,… ,Δ−k−1}, R = M . We iteratively perform the following until  is empty: let Δ′ be the largest size delta in . If |R| > 3|Δ′|, we replace the largest two deltas in  by their union; else, we set R = R − Δ′. Star Access Trees: Consider the query I (A1, A2) with the datafiles as arranged in 117 Figure 5.3(c). Here, M is the materialized datafile and A1 and A2 are stored as deltas from M . We have that: R = A − − + +1 ∩ A2 = (M − (Δ1 ∪ Δ2 )) ∪ (Δ1 ∩ Δ2 ) To see why, recall that Δ−i indicates the set of records to be removed from M to get Ai . Hence, no record in Δ−i can be a part of the intersection result. Additionally, new records (that do not exist in M ) can be added only if they belong to all of Δ+i . In general, for the query I (k)with the datafiles as arranged in Figure 5.3(c), the result R is computed as, R = I (k) = (M − (∪k − k +i=1Δi )) ∪ (∩i=1Δi ) (5.4.2) The result R is written in terms of the materialized datafile M . This leads us to our second tarnsformation rule that “reduces” the deltas in a star structure to a single delta that gives the result for the intersect query. Conceptually, this reduced delta acts as a delta between M and the intersection result. T2:– If Δ1,… ,Δk are the deltas in the star, then the reduced delta Δs , for the intersect query is composed as, Δ− = ∪k − + k +s i=1Δi ; Δs = ∩i=1Δi We use H1 to evaluate Equation (5.4.2). Since none of Δ+i can help reduce inter- mediate result sizes, the intersection of Δ+i s can be done independently. Finally, we also 118 make the following observation. Observation 4 Δl and Δs are consistent. Arbitrary Access Trees: We develop an algorithm, called Contract and Reduce (C&R), that puts the above two techniques together for arbitrary access trees. With minor mod- ications, the same algorithm can be used for other types of queries, and hence we describe its general form. The pseudocode for C&R is shown in Algorithm 6. Starting with a queryQ ∈ {I , U , Tt}, and its access tree Q as inputs, C&R iteratively evaluates partial delta expressions, eectively reducing the size of Q . Each iteration of the algorithm has two phases: contract phase and reduce phase. In the contract phase, we identify all maximal continuous delta paths: a path where all nodes, except the start and end node, have exactly 2 neighbors, and none of the intermediate nodes is a part of k . Each path should be of length > 2 and be the longest possible. Every such path is then contracted to a single delta using PC. Specically, if Δ1,… ,Δu is the sequence of deltas on the path between two nodes Ax and Ay in Q , we use PC to nd the best order to evaluate Δ = Δ1 ⊕ ⋯ ⊕ Δu, execute the operations, and replace the sequence by the delta Δ between Ax and Ay . In the reduce phase, we nd all lines and stars in Q and reduce them according to the appropriate transformation rules – T1/T3 for lines and T2/T4/T5 for stars. Each transformation takes as input two or more deltas, either in a line or star conguration, and replaces them by a single delta. Note that if all paths are contracted, and number of deltas in Q is more than 1, there will at be at least one reduction to be performed. The algorithm ends when there is only one delta remaining in Q . At this point, 119 we simply apply the delta to the materialized datafile in Q and return the result. We illustrate the behaviour of the algorithm with the help of an example. Example 12 Consider the query I (A6, A8, A9, A12) with access tree as in Figure 5.4(a). In the rst iteration, during the contract phase, we compute the deltas: (1) Δc1 = Δ1 ⊕ Δ3 ⊕ Δ6, (2) Δc2 = Δ4 ⊕Δ7, and (3) Δc3 = Δ8 ⊕Δ9 (Figure 5.4(b)). In the reduce phase where we reduce A6 and A12 arranged in a line (Figure 5.4(c)). This reduction puts A9 and Q(A6, A12) in a star, which is then reduced as in Figure 5.4(d)). In the next iteration, during the contract phase, we compute Δc4 = Δ2 ⊕ Δs1 (Figure 5.4(e))). In the reduce phase, we reduce the star A8 and Q(A6, A9, A12) which leaves the access tree with a single delta. A1 A1 A1 A2 A3 A3 A3 A6 A4 A5 A6 A A 8 A8 10 A9 Q(A6,A12) A8 A9 A12 A9 A12 (a) (b) (c) A1 A1 A3 A1 Q(A6,A9,A12) A8 Q(A6,A9,A12) A8 Q(A6,A9,A12) (d) (e) (f) Figure 5.4: Access tree during the progress of C&R 120 Algorithm 6: Contract and Reduce (C&R) Data: Query Q ∈ {I , U , Tt} and access tree Q 1 while Q contains more than one delta do 2 Contract Phase: 3 P ← list of maximal continuous delta paths of length > 2 in Q 4 Contract all paths in P using PC and update Q 5 if Q becomes a line/star then 6 return R ← result using H1 7 Reduce Phase: 8 L← list of line structures in Q 9 Reduce each line in L based on Q, i.e., apply one of T1/T3, and update Q 10 S ← list of star structures in Q 11 Reduce each star in S based on Q, i.e., apply one of T2/T4/T5, and update Q /* At this point Q contains one delta. */ 12 return R ← Root(Q) ⊕ Δ 5.4.3 Union In this section, we give transformation rules for line and star for the query U (k). We can then use C&R with the mentioned rules to evaluate arbitrary access tree structures. Line: Consider the query U (k) with the datafiles as arranged in Figure 5.3(a). Then: R = A1 ∪ (∪ki=1Δ+i ). The transformation rule for a line can therefore be stated as, T3:– If Δ1,… ,Δk−1 are the deltas in the line, then the reduced delta, Δl , for the union query is composed as, Δ−l = ∅; Δ+l = ∪k +i=1Δi Star: If the datafiles are arranged as shown in in Figure 5.3(c), we have that: R = U (k) = (M − (∩k −i=1Δi )) ∪ (∪k +i=1Δi ). To see why, since Δ−i indicates the set of records to be removed from M to get Ai , if a 121 record is absent in the union, it must have been present in all Δ−i . New records that are added in any Δ+i are a part of the union result. Then: T4:– If Δ1,… ,Δk are the deltas in the star, then the reduced delta Δs , for the union query is composed as, Δ−s = ∩ki=1Δ−i ; Δ+ = ∪k +s i=1Δi We conclude this section by mentioning that similar to the intersection case, Δl and Δs , for the union query, are consistent. 5.4.4 t-Threshold In order to evaluate a t-threshold query Tt(k), we make use of multiset-backed deltas during intermediate query execution, instead of the set-backed deltas that we have been using so far. This introduces two important issues during the execution of C&R that are the main focus of this section. First, we need to re-dene the semantics of delta contraction in this new setting. Second, a line cannot be reduced in a straightforward manner as before. We begin with some denitions, describe the transformation rule for a star, and then discuss each of two issues in detail. A multiset, unlike a set, allows multiple instances of any of its elements. We rep- resent a multiset as A = {(r , c) ∶ c ∈ N≥1}, where r is an element and N≥1 is the set of natural numbers. The number c is referred to as the multiplicity of r . A set is a multiset with all multiplicities as 1. Consider the following two operations concerning multisets. Denition 4 (Multiset Union) Multiset union, denoted by A = A1 ⊎A2, returns a multi- 122 set containing elements that occur either in A1 or A2, where A1 and A2 are either multisets or sets. The multiplicity of an element in A is the sum of its multiplicites in A1 and A2. Denition 5 (Multiset Restrict) If A is a multiset, Ac≤p is the set of elements in A with multiplicity at most p. Similarly, Ac≥p is the set of elements with multiplicity at least p. We now describe how to evaluate Tt(k) when Q is a star. Star: Consider the query R = T3(4), i.e., nd all records that appear in at least 3 of {A1, A2, A3, A4}, when they are arranged in a star (Figure 5.3(c)). Suppose the delta Δ is such that Δ− = Δ−1 ⊎Δ− − −2 ⊎Δ3 ⊎Δ4 and Δ+ = Δ+ ⊎Δ+ ⊎Δ+ ⊎Δ+1 2 3 4 . Here, Δ− and Δ+ are multisets, and Δ is a multiset-backed delta. Then: R = T3(4) = (M − Δ−c≥2) ∪ Δ+c≥3 To understand why, consider a record (r , c) ∈ Δ−. This indicates that r ∈ M and c of 4 deltas, {Δ−1 ,Δ−,Δ−,Δ−2 3 4}, ask to delete r . So long as c ≥ 2, r will absent from at least 2 of {A1, A2, A3, A4}. Similarly, consider a record (s, c) ∈ Δ+. This indicates that s ∉ M and c of 4 deltas, {Δ−,Δ−,Δ−,Δ−1 2 3 4}, ask to add s. So long as c ≥ 3, k will be present in at least 3 of {A1, A2, A3, A4}. More generally, the transformation rule for a star can be stated as below. T5:– If Δ1,… ,Δk are the deltas in the star, then the reduced delta Δs , for the t-threshold query is composed as, Δ−s = ⊎ki=1Δ−i ; Δ+ = ⊎ks i=1Δ+i We note the following: (i) With every multiset-backed delta, we keep an integer 123 value 2 ≤ p(Δ) ≤ k which indicates the number of datafiles in k that are reduced by this delta. For instance, p(Δ) = 4, in the previous example. In general, the access tree will have several disjoint stars (or lines) which are reduced at dierent times in the evaluation process. We discuss how p(Δ) is used shortly. (ii) The deltas Δ− +c≥k−t+1) and Δc≥t are consistent. We now describe the semantics of the patch operation in the presence of multiset deltas. Delta Contraction: Algorithm 7 computes the result of Δ = Δx ⊕ Δy , where Δy is a multiset delta. Note that due to the nature of C&R, only the last delta in any sequence of deltas can be a multiset delta. The result delta Δ is also a multiset delta. The main idea here is to preseve semantics of two values: (i) the multiplicity c of a record r , and (ii) the number of datafiles that are reduced due to the delta, p(Δ). Since this is a patch operation, we can simply set p(Δ) ← p(Δy). Recall that a record (r ′, c′) ∈ Δ−y indicates that c′ of p(Δy) deltas ask us to remove r ′. Now consider a record r ∈ Δ+x . If r ∉ Δ−y , then we can add (r , p(Δy)) to Δ+. However, if r ∈ Δ−y , then we need to “x” the multiplicity of r , i.e., add (r , p(Δy) − c) to Δ+, where c is the multiplicity of r in Δ−y . The other case is similar. We use Algorithm 7 in place of the patch operator dened in section 5.3.4 when one of the operands is a multiset delta. Its estimated cost is modeled as, C⊕(Δx ,Δy) = |Δx | + 2|Δy |, and we can use PC during the contract phase as before. Line: Consider the query, R = T2(4), with the datafiles as shown below. 124 Algorithm 7: Patch operation for multiset-based deltas Data: Set-backed delta Δx , and multiset-based delta Δy Result: Multiset delta Δ = Δx ⊕ Δy 1 Initialize Δ ← Δy , p ← p(Δ)← p(Δy) 2 for r ∈ Δ+x do 3 if r ∈ Δ− then 4 Remove (r , c) from Δ−, Add (r , p − c) to Δ+ 5 else 6 Add (r , p) to Δ+ 7 for r ∈ Δ−x do 8 if r ∈ Δ+ then 9 Remove (r , c) from Δ+, Add (r , p − c) to Δ− 10 else 11 Add (r , p) to Δ− 12 return Δ A1 A2 A3 A4 Consider a record r such that r ∈ Δ+1 , and r ∈ Δ−3 . Although r is present in two opposite deltas, r ∈ R. More generally, in the case of t-threshold queries, simply knowing whether a record is in Δ− or Δ+i i is not sucient to conclude if it is present in the result. We also require knowledge of the “position” of the delta containing the record on the line. Alternately, we can reduce the line by considering deltas in right-to-left order, by the following simple modication to Algorithm 7. Suppose that we know how to contract Δ2 ⊕ Δ3 to obtain a multiset delta Δy as shown. We show how to modify Algorithm 7 to compute Δ = Δ1 ⊕ Δy . The central idea is again to set record multiplicites and p(Δ) correctly. Note that p(Δy) = 2, as it reduces A3 and A4. Since, Δ is also meant to reduce A2, we set p(Δ) = p(Δy) + 1 (line 1). Consider a record r ∈ Δ+ −1 . If r ∉ Δy , then we can add (r , p(Δy) + 1) to Δ+ (line 6). On the other hand, if r ∈ Δ−y , we add (r , p(Δy) − c + 1) to Δ+ 125 (line 4) where c is the multiplicity of r in Δ−y . The other case is similar. 150 LR Greedy PC 100 50 0 25 50 75 100 150 200 300 Number of deltas Figure 5.5: Eect of varying #Δ when |Δ| = 5% 30 LR Greedy PC 25 20 15 10 5 0 1M 2M 3M 4M 5M Average datafile size Figure 5.6: Eect of varying |A| 5.5 Experimental Evaluation In this section, we present a comprehensive evaluation of our DEX prototype. The key takeaway from our study is that, pushing down computation to the deltas can lead to signincant savings, an order-of-magnitude in many cases. Surprisingly, even for a sin- gle datafile checkout, we see large benets in the computational time. We also show, 126 Time (Seconds) Time (Seconds) 60 Naïve PC Greedy TC 50 40 30 20 10 0 25 50 75 100 150 200 300 Number of deltas Figure 5.7: Eect of varying #Δ when |Δ| = 5% through an illustrative experiment (Section 5.5), that using auxiliary data structures like bitmaps can increase the benets many-fold, indicating that this is a rich direction for future work. All experiments were conducted on a single machine with Intel Core i7-4790 CPU (3.60 GHz, 8MB L3 cache), 32GB of memory, running Ubuntu 16.04 and OpenJDK 64-bit server JVM (ver. 1.8.0_111). Our choice to write the query processor in Java was primar- ily based on getting quick development time while still being reasonably performant on large datasets. While using a low-level language (e.g., C) will reduce the absolute query execution times, it will not change our primary objective which is to measure relative speedup of our techniques compared to the baseline. All time measurements are recorded as wall-clock time. Unless otherwise stated, to measure response time, we run each query 10 times and consider the median. To account for the adaptive perfor- mance of some of the set operations, we repeat the above on 25 datasets with identical properties (described next) and report the median. As discussed in Section 5.3.4, our computations are CPU bound, and we did not nd an appreciable dierence in warm 127 Time (Seconds) cache vs cold cache settings; for consistency, we report results for a warm cache setting. Datasets: Lacking access to real-world versioned datasets with sucient and varied structure, we instead developed a synthetic data generator to generate datasets with very dierent characteristics for a wide variety of parameter values. This enables us to carefully study the performance of our techniques in various settings. Formally, every experiment setting is characterized by a 4-tuple, ⟨T , |A|, |Δ|, #Δ⟩, where |A| and |Δ| refer to the average number records in a datafile and average size of the deltas in the dataset (as a percentage of |A|). T denotes the shape of the access tree that is used, and is one of: line-shaped (l), star-shaped (s) and line-and-star (ls); and #Δ refers to the number of deltas in the access tree. All records are 64-byte randomly generated strings. Parameter Explanation Values k Query size 2, 4, 6, 8, 10 |A| Average datafile size 1 million(M), 2M, 3M, 4M, 5M |Δ| Average delta size 1%, 2%, 3%, 4%, 5% #Δ Number of deltas 10, 25, 50, 75, 100 T Shape of access tree Line (l), Star (s), Line-and-star (ls) Table 5.1: Possible values of parameters characterizing a synthetic dataset Single datafile Checkout: We begin with evaluating the performance of PC, i.e., Al- gorithm 4, against two heuristics for the case of single datafile checkout. Figure 5.5 shows the median response time of this analysis (in milliseconds) on the vertical axis, and the horizontal axis is the number of deltas (#Δ) in the expression. The other param- eters of the dataset are xed at ⟨T = l, |A| = 3M, |Δ| = 5%⟩. The LR heuristic simply evaluates the delta expression from left-to-right starting with the materialized datafile. This is the standard heuristic used in prior delta-based 128 A0 A0 A0 (a) (b) (c) Figure 5.8: Access tree shapes; (a) Line, (b) Star, (c) Line-and-star storage engines, like git. On the other hand, the Greedy heuristic iteratively patches two operands having the least estimated cost. We observe that in each instance, PC performs better than Greedy which performs better than LR. Specically, we note up to 7.0-8.8X improvement in median response times when comparing PC with LR and up to 14% improvement when comparing with Greedy. The performance gap between LR and the other methods also increases slightly as the number of deltas goes up. This is because the left input of every patch operation in LR has a large size, in contrast to both Greedy and PC, that “balance” their inputs in a cost-based manner. Also, because we assume that every record in a datafile is equally likely to be modied and there is no set of “hot” records, i.e., records that are modied often, we observe that the intermediate result sizes continue to grow in Greedy and PC as well. We observe similar trends for other delta sizes and omit their results. Next, we study the eect of varying average datafile size on the response times. Figure 5.6 shows the result of this study on the dataset ⟨T = l, |Δ| = 1%, #Δ = 100⟩ when 129 |A| is varied from 1 million records to 5 million records. In this case, we observe a 8.9- 10.5X speedup when compared to LR, with the Greedy solution being approximately close to PC. Finally, although PC has cubic time complexity in the number of deltas, the solu- tions it nds are, in all cases, better than alternatives even after taking optimization time into consideration. When #Δ = 100, the average time to nd the optimal solution was 1.2ms. Multiple datafile Checkout: We now evaluate the time taken to checkout k = 8 datafiles on the dataset ⟨T = ls, |Δ| = 5%, |A| = 1M⟩. We evaluate TC, i.e., Algorithm 5, by comparison against three approaches. The Naive approach simply performs a check- out of each datafile independently using LR. The second approach uses PC to checkout individual datafiles. Both these approaches do not take into account sharing of in- termediate results. The third approach, called Greedy, shares the results of the largest sub-expressions as much as it can (e.g., for two datafiles, the result of the expression from the root of the access tree to their lowest common ancestor is always shared). Figure 5.7 reports the median checkout time (in seconds) as the number of deltas (#Δ) in the access tree is varied. We observe that overall Greedy and TC have simi- lar response times and TC performs slightly better than Greedy in each case (between 7.2 − 10.8% improvement). Also, when compared to Naive, we observe a 5.1–6.8X im- provement in median response time. The average optimization time when #Δ = 300 was 18.4ms. Intersect: In the following set of experiments, we compare the running time of evalu- 130 20 11.8 Checkout Intersect C&R 12.4 9.2 8.27.2 7.315 6.7 4.7 10 4.0 4.2 2.6 3.7 5 2.6 2.0 2.0 0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 Query size (k) (a) T=s (b) T=l (c) T=ls Figure 5.9: Eect of access tree structure when |Δ| = 1%, #Δ = 100 20 Checkout Intersect C&R 2.4 4.4 6.9 15 1.9 16.7 7.7 23.6 10.4 2.3 25.1 10 2.9 6.7 4 13.2 5.3 5 16.6 0 Number of deltas (a) k=4 (b) k=8 Figure 5.10: Eect of query size when |Δ| = 1% 18 1M 2M 3M 4M 5M 18 1% 2% 3% 4% 5% 16 16 14 14 12 12 10 10 8 8 6 6 4 4 2 2 0 0 2 4 6 8 10 2 4 6 8 10 Query size (k) Query size (k) Figure 5.11: Intersect – Eect of |A| Figure 5.12: Intersect – Eect of |Δ| 131 Speedup Time (Seconds) Time (Seconds) 10 25 50 75 100 150 200 Speedup 300 10 25 50 75 100 150 200 300 9 1% 2% 3% 4% 5% 6 1% 2% 3% 4% 5% 8 5 7 6 4 5 3 4 3 2 2 1 1 0 0 2 4 6 8 10 4,3 6,4 8,6 10,8 Query size (k) Query size, Threshold (k,t) Figure 5.13: Union – Eect of |Δ| Figure 5.14: t-Thres. – Eect of |Δ| ating I (k) using two algorithms. The baseline approach simply performs a checkout of all the datafiles in k using TC, followed by their intersection. The second approach measures the performance of C&R, i.e., Algorithm 6. Because C&R makes decisions based on the shape of the access tree, we rst study the eect of varying the shape of the access tree on intersect performance. Figure 5.9 shows the median response time against the query size for the three types of access trees: line, star, and line-and-star. The other parameters of the dataset are ⟨|A| = 3M, |Δ| = 1%, #Δ = 100⟩. The numbers on top of each bar indicate the speedup obtained. We note speedups of upto 12X when using C&R. The speedup obtained for T = l is smaller than others primariliy due to the shape of the access tree – in a line, the smallest path between root and a query datafile cannot be reduced using any of the tranformation rules and must be contracted using PC. In the next experiment, reported in Figure 5.10 we study the eect of varying the number of deltas in the access tree of I (k). Here, we use the dataset ⟨T = ls, |A| = 3M, |Δ| = 1%⟩ and vary #Δ; we report the results for k = 4, 8. As we can see, our 132 Speedup Speedup techniques are particularly eective, giving a speedup upto 16X and 25X, for k = 4 and k = 8 respectively. The speedup decreases as the number of deltas increases primarily due to larger intermediate delta sizes. Figure 5.11 shows the speedup obtained when the average datafile size is varied between 1M and 5M records; other dataset parameters are ⟨T = ls, |Δ| = 1%, #Δ = 50⟩. We observe that our techniques show signicant benet, obtaining upto 17X speedup. Further, we note that datafile size does not aect C&R to a large degree. Figure 5.12 reports the speedup obtained when the average delta size in the dataset, |Δ|, is varied between 1% and 5% of the average datafile size; other dataset parameters are ⟨T = ls, |A| = 3M, #Δ = 50⟩. We note a speedup of 2.8-16X when |Δ| = 1% that decreases gradually to 2-6X when |Δ| = 5%. This conrms our hypothesis that if the deltas between the datafiles are small, signicant improvements can be obtained by using the deltas in query execution in a more direct manner. When the deltas get large, the intermediate result sizes grow too, which results in a reduced speedup. Union: The results for U (k) are similar to the intersection case although with smaller speedup values. We report one such result in Figure 5.13: the eect of varying query size (k) for datasets with dierent average delta size |Δ|. The other parameters of the dataset are ⟨T = ls, #Δ = 50, |A| = 3M⟩. We note a speedup of 1.6-8.6X when |Δ| = 1% that decreases gradually to 1.5-4.1X when |Δ| = 5%. t-Threshold: We use the adaptive algorithm of [104] as a baseline for our t-threshold experiments. Similar to adaptive set intersection, this algorithm uses gallop search in order to nd the position of an element r in a set. Moreover, it maintains a min heap of 133 size k − t + 1, containing at most one element per set, in order to select a “good” element to probe other sets during each iteration. Figure 5.14 reports the eects of varying (k, t) across datasets with dierent delta sizes. The other parameters of the dataset are ⟨T = ls, #Δ = 50, |A| = 3M⟩. We observe a speedup of 3.5-5X when |Δ| = 1% that gradually reduces as the delta size increases. When |Δ| = 5%, we report a speedup of 2.1-3.1X. The overall speedup in this case is less than that obtained in the intersection or union query because unlike the two, the size of the intermediate results does not decrease when transforming lines and stars. Experiments with Bitmap Deltas: We have also built support for a ltered index to answer intersection and union queries, and we show the results for an illustrative experiment. Akin to a relational database, a ltered index in DEX is suited to answer queries that always select from a nite “universal” set of records. In this case, we can encode a set of records using a bitmap, where the order of records is determined by their SHA1 value. The index creation step creates a bitmap of size || for each materi- alized datafile and two bitmaps for each delta in the storage graph. We can then use the bitwise AND(∧), OR(∨) and NOT(¬) operations to compute set intersection, union and dierence. In this experiment, we use a compressed bitmap library called roaring bitmaps [105] . Figure 5.15 shows the eect of index size on the intersect query. Here, we measure the speedup vs query size for index size ranging from 500K-3M records. As expected, for small universal sets, we get largely improved speedup ratios (up to 1200X). With large universe sizes, there is however a penalty incurred when selecting the records themselves given the bitmap information. 134 0.5M 1M 1.5M 2M 2.5M 3M 1400 1200 1000 800 600 400 200 0 2 4 6 8 10 Query size (k) Figure 5.15: Eect of bitmap size 5.5.1 Comparisons with Temporal Indexing In this section, we present a comparison of our approach with the temporal indexing techniques by Buneman et al. [24], and discuss how we reimplemented and compared against their approach. Buneman et al. [24] proposed an archiving technique based on identifying changes to (keyed) records across versions, specically temporal versions of hierarchical data, that are then merged into one hierarchy represented in XML format. Because they also compared against a di-based storage solution, we present a brief comparison to high- light the respective strengths and weaknesses of the two strategies. Broadly stated, their scheme, henceforth referred to as BA, merges all hierarchical elements across versions into one hierarchy by identifying an element by its key and storing it only once, along with the sequence of version timestamps where the respective element appears. Answer- ing a checkout query thus requires scanning the entire archive, and using the intervals to decide which elements belong to the answer. We reimplemented their technique in our framework, using either sorted lists or bitmaps to store the sequence of version ids 135 Speedup where an element appears. We describe this next. We represent a dataset of records as a one-level hierarchical document with all the records as children of the root node. When merging two datasets into a single archive, we identify the common records and only store them once. Due to the nonlinear nature of “version ids” (unlike timestamps) in our problem setting, we tried two dierent im- plementations to keep the set of version ids for an element/record: (1) a sorted list or (2) a bitmap. In the sorted list implementation, the version ids associated with every record are stored in a sorted array and during retrieval, we use binary search to decide if the record is present in the desired version. In the bitmap implementation, a bitmap of size equal to the number of versions in the archive is used with each record to indicate the versions that the record is present in. We use the roaring bitmap library [105] to store these bitmaps. During retrieval, a simple scan through the archive can retrieve any ver- sion. We note that it is not clear how to extend some of the optimizations in Buneman et al., most notably “timestamp trees”, that depend on the linearity of timestamps, to the nonlinear nature of version ids in a decentralized versioning/data lake scenario. Buneman et al. compared the performance of their archiver against two approaches based on deltas: (i) “cumulative di”, where every version is stored as a delta against a common (typically rst) version, and (ii) “incremental di”, or “sequence-of-deltas”, where every version is stored as a delta against the previous version, resulting in a line storage graph. However, cumulative di had a large space overhead [24], and incremen- tal di results in large checkout times due to long delta chains. We consider single (k = 1) and multiple (k = 4) datafile checkout on the dataset ⟨|Δ| = 5%, |A| = 1M⟩. Additionally, for BA, we vary the number of datafiles (N ) in 136 the archive as N = 10, 50, 100, 250, 500, 1000. The bitmap implementation gives superior performance (up to 19%) for N = 100 onwards and we use that to report checkout time, while for N = 10, 50, we use the sorted list implementation. As noted previously, we can pack approximately N = 80K datafiles in a storage graph (with certain constraints) and get a delta chain of size at most 10 to checkout any single datafile; therefore, we set #Δ = 10, 25 for fair comparison. DEX; #Δ BA; archive size (N) 10 25 10 50 100 250 500 1000 k = 1 125 550 97 388 659 1286 2677 5362 k = 4 263 483 144 596 1110 2484 5160 9550 Table 5.2: Median checkout time (ms) in DEX and BA As we can see, BA performs better than a sequence-of-deltas approach for checkout queries. When k = 1 storing N = 50 datafiles gives better response times in BA than storingN = 25 datafiles in a sequence-of-deltas approach (demonstrated by k = 1, #Δ = 25). However, checkout times for BA increase rapidly as the archive size grows, and DEX is vastly superior to BA under more reasonable assumptions about the storage graph (in the context of a versioning/data lake scenario, it is not clear how to extend some of the optimizations in [24] that depend on the linearity of timestamps). In short, the main dierence between DEX and the sequence-of-deltas approach (that [24] primarily compared against) is that we assume that the storage graph is con- structed using a technique that avoids very long delta chains (e.g., the methods presented in Chapter 3, “skip links”-based approach [65], techniques that balance storage and re- trieval costs [106], greedy heuristic used by git, etc.). We further note that BA suers 137 from three major limitations: (i) the entire archive must be read even when checking out a single version, (ii) adding a new version requires an expensive merge operation that scans the entire archive (unlike a delta-oriented storage engine where only a single delta may be added), and (iii) decentralization is much more dicult in BA (in theory one could maintain multiple archives and merge them periodically, but we are not aware of any work that has attempted that). 138 Chapter 6: Query Execution II: Declarative Queries 6.1 Introduction The preceding chapter described a query processing architecture for a simple class of queries that compared data from multiple versions. As we showed, even these simple queries exhibit interesting and unexplored computational challenges and the benets of optimizing their execution can be tremendous (orders-of-magnitude in many cases). In this chapter, we take a step further and describe an architecture to execute a much richer class of queries from the VQUEL language described in Chapter 4. Specically, we consider the query optimization challenges of executing a single block select-project- join (SPJ) query on multiple versions. Our motivation regarding studying this SQL-like API is that tabular data formats (CSV/TSV) are extremely common in data science workows. Oftentimes analyzing and inspecting past dataset versions can involve executing rich di queries to understand how subsets of data have changed across versions. For example, a user might need to slice a few columns from a CSV le present in multiple past versions, as a result of updates from dierent sources, and only consider those records that appear in a majority of the versions. She may also want to do the same for a dataset that is logically a “join” of the records from two or more CSV les. Currently, users either use a combination of *nix 139 utilities like sort, uniq, cut, etc., to accomplish these tasks, or they may use a number of open source tools such as Miller [107], csvkit [108], textql [109], etc., that provide SQL- like API to quickly wrangle and work with table-like data formats. However, these ad hoc approaches suer from the same limitations as discussed before – simply checking out the les in every version and evaluating the user-specied query is prone to a lot of wasted work, especially because of the overlap between the versions. The key idea behind processing such rich queries in DEX is that we execute the query plan exactly once, regardless of the number of versions. Each record/tuple that is processed in the query plan is actually a data structure called a v-tuple. Apart from storing data about the dierent columns required by the query, a v-tuple also keeps track of a version list, or version bitmap, of the versions (among the subset of the queried versions) that the tuple appears in. The potential performance benet of this approach is that the physical query processing operators can operate in batch across versions that are encoded for every tuple in the respective tuple bundle. For example, if a query is to be run on 50 versions, and if a column c has value 100, then a selection operator corresponding to c = 100 can lter out tuples across all 50 versions in one operation, possibly resulting in a 50-fold reduction in the number of tuples that have to be moved across the query plan. This chapter presents the design of a query processing engine that we have built for executing SPJ queries over multiple versions. In summary, our contributions are as follows. 1. We propose a new framework to execute single block select-project-join (SPJ) 140 queries over multiple versions stored in a delta-based system. The key design decision of our approach is to execute the various query operators once for each unique tuple in the set of versions, rather than executing the query once for each version. 2. We propose using a new representation for tuples, called v-tuple, during query processing. In addition to the regular elds of a tuple, a v-tuple also has infor- mation about which versions (among the set of queried versions) the respective tuple appears in. We develop algorithms to eciently create such v-tuples from the delta representation and to join/aggregate them. 3. We implement our techniques in DEX using Apache Calcite, which is an open source, highly customizable engine for parsing and planning queries on data in a variety of formats. Specically, we add new operators to Calcite for retrieving data (Scan) and processing data (Filter/Join). 4. We extensively evaluate the performance of our methods on multiple synthetic datasets. Our results show that DEX makes richer query processing viable in a dataset version control system, with signicant benets over version-at-a-time execution. 6.2 System Design We review a few terms before describing our extensions to DEX to enable declar- ative query processing. Conceptually, the execution engine evaluates a query Q over a 141 number of existing versions, say k, and outputs the result with each record being anno- tated with the respective version(s) it appears in. Recall that a datafile is a le whose contents are interpreted as a set of records. A version is a point-in-time snapshot of one or more datafiles typically residing in a directory on the user’s le system. A version is identied by a unique id, is immutable, and can be created by any user who has access to the repository. A version graph captures the version-level provenance that includes the derivation and transformation relationships, and metadata about the versions them- selves. Nodes in a version graph correspond to versions, and edges capture relationships such as derivation, branching, transformation, etc., between two versions. Both nodes and edges have metadata that can be used to allow writing rich queries over the entire repository. In this chapter, however, we only make use of the node metadata to identify the relevant versions during query processing. 6.2.1 Schema Specication Since our focus is on executing queries on table-like data, when a datafile is added to the system, we also require the user to submit a schema le that species how to parse the text-based format into rows and columns. By default, each line is processed as an array of columns, all values being of type string. The schema le can specify the data type for individual columns. Finally, the schema le must also designate a column, or a group of columns, as the primary key, that must contain a unique value that can be used to identify each and every row of the datafile uniquely. At commit time, both intial commit, and any subsequent commits after updates, the 142 datafile is converted into Apache Parquet format for physical storage, as described in Section 6.2.3. Additionally, a Parquet le can also be given as a input during commit time. We require such rich schema information regarding the datafiles primarily to support column deltas, as discussed next. 6.2.2 Delta format As noted before, the choice of the delta format has signicant implications on what operations/queries can be run on it. When deciding on a delta format between two versions of a datafile, we had the following considerations: 1. the format should be compact, 2. creating and using the delta during query processing should be ecient, and 3. we should be able to read and work with parts of the delta that are essential. Since every record is identied by a primary key, we can meet all of the above criteria by capturing changes to individual record elds as outlined next. Suppose R1 and R2 are two versions of a datafile R, with every record having the schema ⟨k, c1, c2,… , ck⟩, where k is the primary key of R and c1, c2,… , ck are k columns. The delta (Δ) between them contains records of the form ⟨k, [c11, c12], [c21, c22],… , [ck1, ck2]⟩, where k is the primary key of the record that is dierent in one or more columns in the two versions, and every ci1 or ci2 is the respective value in R1 or R2, in the column where the record diers, or the special value “⟂”, if the column values are identical. We record the value of the column in both versions, and therefore, this is an undirected delta for- mat. 143 Figure 6.1 gives an example of this format. R1, R2, R3 are three versions of a datafile R, Δ1 is the delta between R1 and R2, and Δ2 is the delta between R2 and R3. Consider the record with primary key 1 in R1 and R2, i.e., R1 ∶ ⟨1, 100, a⟩ and R2 ∶ ⟨1, 100, b⟩. In Δ1, the delta record for this primary key is therefore, Δ1 ∶ ⟨1, [⟂,⟂], [a, b]⟩. Similarly, we have delta records for the primary keys 2, 3, 5, but not for 4 as the record is identical in the two versions. Note that for every record in the delta, there is at least one column where the values are not [⟂,⟂]. Next, we discuss how we compute the deltas between two versions of the same datafile. The dierencing procedure works by nding records in the source and target datafiles that are logical “pairs”. The schema-dened primary key is used to pair up records in the source and target datafiles. For a record pair, any dierences in the content is output as a new record in the delta having the schema described above. Such pairs are considered as updates. Records in the source datafile that could not be paired are considered as deletes, and we output one delta record for each such source record. Similarly, records in the target datafile that could not be paired are output as inserts. 6.2.3 Physical Representation We now describe how the datafiles and deltas are persisted on disk. Once a datafile is committed in DEX, we parse it according to the schema le supplied and store it in the Apache Parquet format [110]. Apache Parquet is a state-of-the-art, open source columnar le format oering both high compression and high scan eciency. Parquet is a PAX-like [111] format optimized for large data blocks and is vastly more 144 R1 R2 R3 k c1 c2 k c1 c2 k c1 c2 1 100 a 1 100 b 1 100 b 2 10 z 2 20 z 2 40 f 3 5 c 3 8 d 3 5 d 4 25 d 4 25 d 4 25 d 5 90 e 5 100 e 5 100 e k c1 c2 k c1 c2 k c1 c2 1 2 [20,40] [z,f] 1 3 [8,5] Figure 6.1: R1, R2, R3 are three versions of a datafile R. k is the primary key column. Three deltas are shown, Δ1 is the delta between R1 and R2, Δ2 is the delta between R2 and R3, and Δ3 is the delta between R1 and R3. ecient than text-based formats like CSV during storage and query processing [112]. This format is also suited to store the deltas, because it results in a more compact repre- sentation when only a small number of columns are modied across versions. 6.2.4 Discussion Input data may come in the form of CSV/TSV/JSON formats which are text-based. Currently user has to ensure that the schema le is correct and there are no errors dur- ing the conversion process. Automatic schema inference may be added later. In addi- tion, such conversion can also be done in a post-hoc manner. If it is expected that new datafiles may benet from richer query processing, a separate process can be run to convert the existing binary data to Parquet format by specifying a schema le. 145 [ T ' T ] [ T ' T ] [ T ' T ] [ c , d ] 5 [ 9 0 , 1 0 0 ] [ T ' T ] [ T ' T ] [ T ' T ] [ a , b ] 3 [ 5 , 8 ] [ c , d ] [ T ' T ] [ a , b ] 2 [ 1 0 , 4 0 ] [ z , f ] 5 [ 9 0 , 1 0 0 ] 2 [ 1 0 , 2 0 ] 3 6.3 Query Execution In this section, we describe the query processing ideas underlying our prototype implementation. We make heavy use of terminology and algorithms from Section 5.3 and Section 5.4, and extend the ideas behind delta contraction, and line and star reductions to be applicable in this setting. Thereafter, we outline the design of other physical operators to execute the rest of the query. Query execution follows the same two-phase optimization approach that we de- scribed in Section 5.3. Suppose we denote the query as Q(k), where Q denotes the SPJ part of the query, and k = {R1, R2,… , Rk} is the set of datafiles required by the query. For simplicity of notation, we assume that all of k are versions of a same datafile R. However, in the the presence of a join, k will contain versions of multiple datafiles, and the Scan step described in detail in Section 6.3.2, is modied to account for each datafile separately. The storage graph, , described in detail in Section 5.2.3.1, indicates the delta de- cisions that have been made when storing all versions of a datafile. In the rst phase of query execution, we identify all the relevant datafiles and deltas in  that are nec- essary to execute Q(k). This is the problem of nding an access tree of Q(k), and we have described this in detail in Section 5.3.2. In the second phase, we map the logical SPJ query on to a directed acyclic graph (DAG) of the physical operators described in the next few sections. Since in this work, we address single-block queries, the mapping from the logical plan to the physical oper- ators is straightforward, and an example is shown in Figure 6.2. The primary dierence 146 Project Join R, S on- ver si ons( V1, V2, . . . , Vk) sel ect R. a, S. z f r om R j oi n S on R. b = S. x Filter Rk Filter Sk wher e R. c = " . . . " and S. y = " . . . " Scan Rk Scan Sk (a) (b) Figure 6.2: (a) Example query on k versions, (b) physical plan to execute the query. from the perspective of a classical query execution engine is that the physical operators described below are modied to create and process v-tuples instead of tuples. At the lowest step of the DAG, we have the Scan operator (Section 6.3.2) taking as input the respective access trees (for the two datafiles R and S in the query). The output of the Scan operator is a set of v-tuples for each datafile. Thereafter, the Filter operator applies all relevant predicates in the query. The Join operator (Section 6.3.3) reads the two sets of v-tuples and outputs a single set of v-tuples corresponding to the join result. The projection step removes any elds not needed by the query. 6.3.1 v-tuples A v-tuple r , generated either as a result of the Scan step, or any subsequent steps in the query processing pipeline, contains data about the relevant columns required by the query, and a version bitmap, indicating the versions where the tuple is present. Figure 6.3 147 shows a set of v-tuples, which are the output of the Scan step on three versions of the relation R, shown earlier in Figure 6.1. The tuple with the primary key 4 is unchanged in all three versions, and hence it has the bitmap [111]. On the other hand, tuples with primary keys 2 and 3 are dierent in all three versions, and hence we have three v-tuples corresponding to each. In principle, v-tuples are similar to the concept of “tuple bundles” in the Monte Carlo Database System (MCDB) [113]. MCDB is a prototype relational database de- signed to allow an analyst to attach arbitrary stochastic models to a database, thereby specifying, in addition to the ordinary relations, “random” relations that contain uncer- tain data. A tuple bundle, like a v-tuple, encapsulates the instantiations of a tuple over a set of many Monte Carlo iterations, with the goal of operating in batch across all Monte Carlo iterations. However, due to the nature of the application, the specic structure of the data inside tuple bundles is dierent from v-tuples, and we require new physical operators to generate them eciently. 6.3.2 Scan Operator The Scan operation is the workhorse of the query processing phase. The input to a Scan operation is the access tree of the datafile and a list of versions in the query. The output is a set of v-tuples across all the versions requested by the query. Suppose k = {R1, R2,… , Rk} are versions of a datafile R. A naive implementa- tion of the Scan operator would be to rst perform a mutiple datafile checkout (Sec- tion 5.4.1.2) of all of k , followed by a grouping step that brings common records to- 148 k c1 c2 [v1 v2 v3] 1 100 a [1 0 0] 1 100 b [0 1 1] 2 10 z [1 0 0] 2 20 z [0 1 0] 2 40 f [0 0 1] 3 5 c [1 0 0] 3 8 d [0 1 0] 3 5 d [0 0 1] 4 25 d [1 1 1] 5 90 e [1 0 0] 5 100 e [0 1 1] Figure 6.3: v-tuples for R1, R2, R3 gether, to create a set of v-tuples. However, this scheme is likely to be inecient, as it involves materializing close to k copies of most of the tuples, only to merge all tuples together (if most of the datafile does not change). When k is large, about 50–100 in our experiments, this scheme becomes a major performance bottleneck. We therefore use a dierent strategy, one that works with the deltas as much as possible. We use the Contract and Reduce (C&R) algorithm, described in detail in Sec- tion 5.4.2 and in Algorithm 6 to generate v-tuples eciently. Recall that the C&R algo- rithm takes as input an access tree and iteratively applies a set of transformations to generate the query result eciently. In order for the algorithm to be applicable in this setting, we need to address three important issues, that are the focus of the rest of the section. First, we need to dene the semantics of delta contraction for the delta format that 149 we use. The delta contraction step takes two deltas, say, Δ1 between R1 and R2, and Δ2 between R2 and R3, and combines them to create a single delta, say, Δ3 between R1 and R3. This operation is useful when R2 is not required in the query. Second, we need to dene reduction rules for line and star structures, as dened in Section 5.4.2, for our delta format. Due to the additional structure in the deltas, we extend the delta format to incorporate the result of reducing multiple deltas arranged in line/star congurations. This delta format is similar to the one described in Section 6.2.2 earlier, but instead of keeping two values per column in a reduced delta Δ, we keep 2 ≤ p(Δ) ≤ k values, where p(Δ) indicates the number of versions in k that are reduced by this delta. Third, we describe how to to apply, or patch, the “last” delta to the materialized datafile in order to generate all the v-tuples. 6.3.2.1 Delta Contraction Suppose we want to compute Δ = Δ1 ⊕ Δ2, where Δ1 and Δ2 are dened as above (see Figure 6.1 for an example). If a primary key k′ occurs in only one of Δ1 or Δ2, it is included as is in Δ. This indicates that the record was modied in only one of the versions. If a primary key k is present in both Δ1 and Δ2, it indicates that the record was modied in both versions, and we resolve it as follows. For each column ci, of primary key k, appearing in the deltas, we have three possible scenarios: • ci = [⟂,⟂] in both Δ1 and Δ2, • ci = [⟂,⟂] in Δ1 and ci = [v, w] in Δ2, or vice versa, or • ci = [x, y] in Δ1 and ci = [y, z] in Δ2, 150 R1 R1 R2 R3 R2 R3 k c1 c2 k c1 c2 1 1 (a) Line (b) Star Figure 6.4: Line and star transformations for the Scan operation where v, w, x, y, z are values from the domain of ci. In the rst case, we can infer that the value in column ci has not changed between R1 and R3. Hence we set ci = [⟂,⟂] in Δ. In the second case, we can infer that column value changed in one of Δ1 or Δ2, and we set ci = [v, w] in Δ. In the third case, if x ≠ z, we set ci = [x, z] in Δ, otherwise we set ci = [⟂,⟂]. After all columns of a key k have been processed and set appropriately in Δ, we check if there is at least one column where the value is not [⟂,⟂]. If no such column exists, k can be removed from Δ. 6.3.2.2 Line/Star Structures We rst describe the transformation rule in the case of a line. Figure 6.4(a) shows an example where three datafiles, R1, R2, R3, are arranged in a line, and our goal is to generate v-tuples for all three. Similar to Section 5.4.4, we keep additional information with each reduced delta. We rst set p(Δl) = 3, where Δl is the reduced delta. In general p(Δl) = p(Δ1) + p(Δ2) − 1, where p(Δ) = 2 for a delta between two datafiles. This value 151 [ T ' T ' T ] [ T ' T ' T ] [ T ' T ' T ] [ a , b , b ] [ T ' T ' T ] [ a , b , b ] 2 [ 1 0 , 2 0 , 4 0 ] [ z , z , f ] 2 [ 1 0 , 2 0 , 4 0 ] [ z , z , f ] 3 [ 5 , 8 , 5 ] [ c , d , d ] 3 [ 5 , 8 , 5 ] [ c , d , d ] 5 [ 9 0 , 1 0 0 , 1 0 0 ] 5 [ 9 0 , 1 0 0 , 1 0 0 ] keeps track of the number of datafiles reduced by the delta. Every column ci in Δl is an array of p(Δl) values. The main idea here is to preserve the values (in the array) of the modied columns in every version, and infer values, whenever possible, to columns that are ⟂ in the Δ. There are two cases to consider when computing these values for a primary key k. Case 1: k is present in only one of Δ1 or Δ2. This is the case with primary keys k = 1, 5 in our example. Next, there are two scenarios. First, when a column ci contains the special value ⟂, ci in Δl can simply be set to an array of all ⟂. For instance, c1 for k = 1 and c2 for k = 5. Second, when the column contains values from the domain of ci. We discuss the case when k is present in Δ1. For instance, in Δ1 column c2 for k = 1, contains the values [a, b]. c2 in Δl is then set as follows: the rst p(Δ1) = 2 values are taken from Δ1 and copied to ci in Δl , i.e., [a, b,⟂], and the remaining p(Δ2) − 1 values are set to the last value in Δ1, i.e., [a, b, b]. We can complement the above rule to account for the case when k is present in Δ2 instead. Case 2: k is present in both Δ1 or Δ2. This is the case with primary keys k = 2, 3 in our example. Next, there are three scenarios similar to the ones we discussed for delta contraction above. 1. ci is all ⟂ in both Δ1 and Δ2, then ci can be set to all ⟂ in Δl 2. ci is blank in one of Δ1 or Δ2, then the ⟂ values can be set to the appropriate value as in Case 1 above, depending on whether the ⟂ value is in Δ1 or Δ2 (e.g., c2 for k = 2, 3) 3. ci has values in both Δ1 and Δ2, then ci in Δl is a simply a concatenation of the 152 values in Δ1 and Δ2 (accounting the last value in Δ1 and rst value in Δ2 only once, since they are guaranteed to be the same). The transformation rule in the case of a star is similar to the rules for line above, with the only dierence being the value that is copied to ci in Δs , when either one of both of Δ1 and Δ2 contain values (not ⟂) in ci. In the case of a star, it is the rst value (this value will be the same in both Δ1 and Δ2) in the column ci. Figure 6.4(b) gives an example when R1, R2, R3 are arranged in a star, and Δs is the reduced delta of Δ1 and Δ3. 6.3.2.3 Applying Delta to a Materialized datafile The nal step in C&R algorithm is to apply the last delta to the materialized le in order to generate all the v-tuples. Continuing our example from before, we want to obtain the v-tuples in Figure 6.3 as a result of the operation R1 ⊕ Δl in the example of Figure 6.4(a). We perform a linear scan of the materialized le, R1, and the delta, Δl , resolving records based on their primary key. For every primary key k in R1, we output between 1 and p(Δl) = 3 v-tuples. If a primary key k does not appear in the delta, it indicates that record k has not been modied, and it is output as a single v-tuple with the bitmap of p(Δl) 1s. For example, the record with primary key k = 4 does not appear in Δl , and hence is output as (4, 25, d, [111]). On the other hand, if a primary key k appears in the delta, we consider every column ci where the values are not ⟂ and generate the v-tuples by resolving one column at a time. Specically, consider the record with primary key k = 3 in Δl , i.e, (3, [5, 8, 5], [c, d, d]). We rst resolve the column c1. The goal here is to 153 identify all repeated values in c1 and output intermediate tuples, such that there is one tuple, per repeated value in c1. For instance, the value 5 is repeated in versions 1 and 3, after resolving, the partial result is {(3, 5, [c,⟂, d]), (3, 8, [⟂, d,⟂])}. Next, we resolve c2. For the rst tuple in the intermediate result, (3, 5, [c,⟂, d]), there are no duplicate values in c2, and since this is the last column to be resolved, we can output two v-tuples, {(3, 5, c, [100]), (3, 5, d, [001])}. Similarly, for the second tuple (3, 8, [⟂, d,⟂]), there are no duplicate values in c2, and we can output {(3, 8, d, [010])}. 6.3.3 JOIN Operator In this section, we describe hash join-based algorithms to perform the equi-join operation. Our choice to study hash-based methods, as opposed to sort-based methods, was inuenced by the observation that in general, they produce results faster than sort- based methods and have a smaller memory footprint [114]. A detailed study of the criteria when sort-based algorithms become competitive in the context of DEX remains an area for future work. 6.3.4 Simple Hash Join We rst adapt the canonical hash join algorithm to our setting. Suppose the two inputs to be joined are R and S, such that |R| < |S|, and both R and S are sets of v-tuples. The algorithm has a build phase and a probe phase. At the start of the build phase, it allocates memory for the hash table. It then reads a v-tuple r ∈ R, hashes on the join key of r using a pre-dened hash function ℎ(⋅), and writes the v-tuple r into the 154 Key: k1 k1 10 a 11111 k1 10 b 01111 k1 abc 11000 k1 15 c 11101 s1 ... ... ... ... Key: k2 k2 60 x 00001 k2 55 b 10000 k2 xyz 01100 k2 15 y 10001 s2 ... ... ... ... Buckets for two keys in R Figure 6.5: Buckets in simple hash join 155 corresponding bucket. The build phase is completed when all the R v-tuples have been stored in the hash table. During the probe phase, each v-tuple s ∈ S is hashed on the join key using the same hash function ℎ(⋅), and the correct hash bucket for the join key is identied (after accounting for hash collisions). For each v-tuple rℎ(s) in the bucket, we perform a logical AND operation on the the bitmap of rℎ(s) and s, and if non-zero, the concatenated v-tuple rℎ(s) ⋅ s is output, with the new bitmap. Figure 6.5 shows an example with two hash buckets of R, for keys k1 and k2. During probe phase, the bitmap in s1 is ANDed with every tuple in bucket for key k1 to nd join matches. This example also illustrates a source of ineciency in our simple adaptation. Note that the v-tuples in the bucket for key k2 have mostly 0s in the bitmap. During the probe phase, when a tuple s2 matches on the join key, most of the bitmap AND operations result in bitmaps with all 0s. We address this limitation next. 6.3.5 Version-aware Hash Join The main bottleneck in the simple hash join method described above is the case when a bucket contains a large number of v-tuples having bitmaps of mostly 0s. When probing for matches of a tuple s, the algorithm ends up scanning the entire bucket, only to nd few matches. We propose a dierent bucket design to improve probe eciency by exploiting the sparsity in such buckets. Figure 6.6 shows the important elements of the new bucket design. The bucket consists of two pieces. Suppose the bitmap in the v-tuples is of size k. 156 Key: k2 c1 c2 ... ck-1 Tuples in V1 c1 k2 xyz 01100 Tuples in V2 c2 ... Tuples in Vk Figure 6.6: Improved bucket structure for sparse keys The rst piece is an array of k − 1 integer values, that serve as a pointer to the tuples for k versions into the bucket. The second piece is the tuples in the input R, partitioned by version. That is all tuples in V1 are stored rst, followed by all tuples in V2, and so on. During the build phase, the bitmap of every v-tuple r ∈ R is inspected to check the versions r that appears in. If r appears in j ≤ k versions, we create j copies of r , sans the bitmap, and store them at the correct partition in the bucket. Once the build phase is completed, the bucket is scanned once to compute the counts/pointers in the rst piece. During the probe phase, the bitmap of tuple s is inspected to identify the versions that s appears in, and the array of pointers in the bucket is used to lookup the relevant tuples in the bucket. The primary benet of this approach is when for a large number of r , j << k. Although in the build phase, we use more memory for such buckets, the probe phase can be done eciently by looking up only the required versions, instead of a linear scan of all entries in the bucket. 157 6.3.6 Other Operators 6.3.6.1 Filter Operator The Filter operator is identical to selection in a classical database system. The lter predicate is applied to every v-tuple and if it evaluates to true, then the tuple is output as is to the next step. 6.3.6.2 Project Operator Similar to the selection operation, the project operation works with a tuple bundle at a time, only keeping the respective columns and the version list for each tuple bundle intact. Note that this also preserves multiset semantics of the projection operation. 6.4 Evaluation We now evaluate the benets of using the C&R algorithm for the Scan operator and the new bucket design for the hash join operation for the Join operator. Our goal is to quantify the performance of both techniques compared to the respective baseline ap- proaches. For the Scan operation, where the input is k datafiles, the baseline approach is the multiple datafile checkout operation described in Section 5.4.1.2, followed by a k-way merge to construct the v-tuples. For the Join operation, the baseline approach is the simple hash join method described in Section 6.3.4 above. All experiments were conducted on a single machine with Intel Core i7-4790 CPU (3.60 GHz, 8MB L3 cache), 32GB of memory, running Ubuntu 16.04 and OpenJDK 64- 158 bit server JVM (ver. 1.8.0_111). All time measurements are recorded as wall-clock time. Unless otherwise stated, to measure response time, we run each query 10 times and consider the median. 6.4.1 Datasets Lacking access to real-world versioned datasets with sucient and varied struc- ture, we instead developed a synthetic data generator, described in detail in Section 5.5, to generate datasets with dierent characteristics for a wide variety of parameter values. This enables us to carefully study the performance of our techniques in various settings. We describe the modications that we made to the dataset generation process next. Every dataset is characterized by a 3-tuple, ⟨|A|, |Δ|, #Δ⟩, where |A| and |Δ| refer to the average number records in a datafile and average number of records in the deltas (as a percentage of |A|), respectively. #Δ refers to the number of deltas in the access tree. A record has four columns – a primary key column of 64-bit integer type, and three 64-byte string columns. When creating a new version of a datafile from an existing version, the dataset generator assigns probabilities, according to the Zipan distribution (with exponent 2), to every record. This assignment indicates how likely a record is to be modied when creating the new version. In a record, the column to be modied is chosen at random from the three string columns. 159 Figure 6.7: Scan performance over varying delta sizes; dataset parameters: |R| = 3M, #Δ = 50. 6.4.2 Results We rst compare the running time of Scan using the two methods. The input to the Scan operator is an access tree over k versions, say of a datafile R, and the output is a set of v-tuples. The rst method simply performs a multiple datafile checkout of k versions, followed by a k-way merge. The second method uses the C&R algorithm, by applying the transformation and reduction rules described in Section 6.3.2. Figure 6.7 shows the speedup obtained by using C&R when the average delta size is varied between 1%, 2%, 3%, with other dataset parameters set to |R| = 3M, #Δ = 50. As we can see, C&R is between 4X–6X eective when the delta size is small, i.e., 1%. The speedup decreases as the size of the deltas increases primarily due to larger intermediate delta sizes. Next, Table 6.1 reports the running time of the two hash join methods when per- forming an equi-join operation. The rst method, SHJ, denotes the simple hash join approach described in Section 6.3.4, and the second method, VHJ, denotes the version- aware hash join approach described in Section 6.3.5. We perform an equijoin on v-tuples 160 k SHJ (sec) VHJ (sec) Speedup 5 3.3 3.3 0% 25 19.4 11.6 61% 50 49.8 29.6 68% 75 103.2 59.7 73% 100 146.6 81.9 79% Table 6.1: Join performance of two hash join methods over varying number of input versions R, S; dataset parameters: |R| = 100K, |S| = 1M, |Δ| = 1%, #Δ = 50 representing k versions of two datafiles, R and S, where |R| = 100K, |S| = 1M . Dur- ing the build phase, VHJ determines when to use the new bucket design, depending on the sparsity of the bucket for a primary key. Specically, there are two parameters to compute when deciding whether a bucket is sparse or not. The rst parameter, b, is the sparsity of the bitmap in a v-tuple. This is measured relative to the size of the bitmap, i.e., the number of versions k, and a v-tuple is said to be sparse when its b value is less than a specied threshold, b̄. The second parameter, n, is the relative number of sparse v-tuples in the bucket, and the bucket is said to be sparse, when its n value is greater than a specied threshold, n̄. Table 6.1 reports the runtime (in seconds) and the speedup obtained when we set b̄ = 5% and n̄ = 90%. When there are only 5 versions across which to perform a join, no buckets satisfy the sparsity criteria, and hence the run times are identical. We note speedup of 61% − 79%, increasing as the number of versions in the join increases, when using a dierent bucket design for sparse buckets. Our initial set of experiments thus shows signicant benets of the v-tuples ap- proach in evaluating richer declarative queries on multiple versions. They indicate that we can both create v-tuples eciently, by taking advantage of the delta representation 161 when reading the input, and that we can perform richer operations like join at acceptable overheads. As the next step, we plan to add additional operations, such as aggregation and user dened functions, to further extend the class of queries that DEX can support. 162 Chapter 7: Conclusions In this dissertation, we introduced a framework for building a dataset version con- trol system. We presented a theoretical model for capturing and reasoning about the tradeos that arise in the management of thousands of dataset versions. We demon- strated that if the versions are largely overlapping in their contents, by using delta- encoding, it is possible to both store and retrieve them eectively. Instead of applying delta-encoding in an ad hoc manner, we studied dierent algorithms that helped us nav- igate the storage-recreation tradeo in a principled fashion. We also showed that it is possible to execute a variety of queries over multiple versions of past dataset versions, without having to rst retrieve all in their entirety. The technical contributions of this dissertation are (i) a formal study of the dataset versioning problem that considers the trade o between storage cost and recreation cost in dierent manners, and provides a collection of polynomial time algorithms for nd- ing good solutions for large problem sizes, (ii) a cost based optimization framework along with a set of transformation rules that, based on the algebraic properties of the deltas, nds ecient methods to evaluate checkout (retrieval), intersection, union, and t-threshold queries over multiple versions, and (iii) a new approach to eciently execute single block select-project-join queries over multiple data versions. We also introduced 163 the DEX system, that demonstrates how the techniques in this dissertation can lead to signicantly improved performance compared to ad hoc techniques and version man- agement systems prevalent today. 164 Bibliography [1] James Cheney, Stephen Chong, Nate Foster, Margo Seltzer, and Stijn Vansum- meren. Provenance: A future history. In OOPSLA, pages 957–964. ACM, 2009. [2] Daniel J Weitzner, Harold Abelson, Tim Berners-Lee, Joan Feigenbaum, James Hendler, and Gerald Jay Sussman. Information accountability. Communications of the ACM, 51(6):82–87, 2008. [3] R. K. L. Ko, P. Jagadpramana, M. Mowbray, S. Pearson, M. Kirchberg, Q. Liang, and B. S. Lee. Trustcloud: A framework for accountability and trust in cloud computing. In IEEE World Congress on Services, July 2011. [4] Pat Helland. Immutability changes everything. In CIDR, 2015. [5] João Paulo and José Pereira. A survey and classication of storage deduplication systems. ACM Computing Surveys, 2014. [6] Monya Baker. De novo genome assembly: what every biologist should know. Nature methods, 9(4):333–337, 2012. [7] http://git.kernel.org/cgit/git/git.git/tree/Documentation/technical/ pack-heuristics.txt, . [8] http://comments.gmane.org/gmane.comp.version-control.git/189776, . [9] Sebastian Rönnau, Jan Scheczyk, and Uwe M. Borgho. Towards XML Version Control of Oce Documents. In Proc. of the ACM Symposium on Document Engi- neering, pages 10–19, 2005. [10] Michael Maddox, David Goehring, Aaron J. Elmore, Samuel Madden, Aditya G. Parameswaran, and Amol Deshpande. Decibel: The relational dataset branching system. PVLDB, 9(9):624–635, 2016. [11] James Cliord, Curtis Dyreson, Tomás Isakowitz, Christian S Jensen, and Richard Thomas Snodgrass. On the semantics of “now” in databases. ACM Trans- actions on Database Systems (TODS), 22(2):171–214, 1997. 165 [12] Gultekin Özsoyoğlu and Richard T Snodgrass. Temporal and real-time databases: A survey. IEEE Trans. on Knowl. and Data Eng., pages 513–532, 1995. [13] Abdullah Uz Tansel, James Cliord, Shashi Gadia, Sushil Jajodia, Arie Segev, and Richard Snodgrass. Temporal databases: theory, design, and implementation. Benjamin-Cummings Publishing Co., Inc., 1993. [14] David Lomet, Roger Barga, Mohamed F Mokbel, German Shegalov, Rui Wang, and Yunyue Zhu. Immortal DB: Transaction time support for SQL server. In SIGMOD, pages 939–941, 2005. [15] David Lomet, Mingsheng Hong, Rimma Nehme, and Rui Zhang. Transaction time indexing with version compression. In PVLDB, pages 870–881, 2008. [16] David B. Lomet, Alan Fekete, Rui Wang, and Peter Ward. Multi-version con- currency via timestamp range conict management. In IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Vir- ginia), 1-5 April, 2012, pages 714–725, 2012. doi: 10.1109/ICDE.2012.10. URL http://dx.doi.org/10.1109/ICDE.2012.10. [17] Yu Wu, Sushil Jajodia, and X Sean Wang. Temporal database bibliography update. In Temporal Databases: research and practice, pages 338–366. Springer, 1998. [18] Richard Snodgrass. The temporal query language tquel. ACM Trans. Database Syst., 12(2):247–298, June 1987. ISSN 0362-5915. doi: 10.1145/22952.22956. URL http://doi.acm.org/10.1145/22952.22956. [19] Richard T Snodgrass, Santiago Gomez, and L Edwin McKenzie Jr. Aggregates in the temporal query language tquel. Knowledge and Data Engineering, IEEE Transactions on, 5(5):826–842, 1993. [20] Michael Stonebraker, Lawrence A Rowe, and Michael Hirohama. The implementa- tion of postgres. IEEE Transactions on Knowledge & Data Engineering, (1):125–142, 1990. [21] Antonin Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD, pages 47–57, 1984. [22] Using Oracle Flashback Technology. https://docs.oracle.com/cd/B28359_01/ appdev.111/b28424/adfns_flashback.htm, . Accessed: May 04, 2016. [23] Oracle Total Recall with Oracle Database 11g Release 2. http: //www.oracle.com/technetwork/database/focus-areas/storage/ total-recall-whitepaper-171749.pdf, . Accessed: November 11, 2016. [24] Peter Buneman, Sanjeev Khanna, Keishi Tajima, and Wang Chiew Tan. Archiving scientic data. ACM Trans. Database Syst., 29:2–42, 2004. doi: 10.1145/974750. 974752. URL http://doi.acm.org/10.1145/974750.974752. 166 [25] Torben Bach Pedersen and Christian S Jensen. Multidimensional database tech- nology. Computer, 34(12):40–46, 2001. [26] Heum-Geun Kang and Chin-Wan Chung. Exploiting versions for on-line data warehouse maintenance in molap servers. In Proceedings of the 28th International Conference on Very Large Data Bases, VLDB ’02, pages 742–753. VLDB Endowment, 2002. URL http://dl.acm.org/citation.cfm?id=1287369.1287433. [27] P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. The multidi- mensional database system rasdaman. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, SIGMOD ’98, pages 575–577, New York, NY, USA, 1998. ACM. ISBN 0-89791-995-5. doi: 10.1145/276304.276386. URL http://doi.acm.org/10.1145/276304.276386. [28] Paul G. Brown. Overview of scidb: Large scale array storage, processing and analysis. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, pages 963–968, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0032-2. doi: 10.1145/1807167.1807271. URL http://doi. acm.org/10.1145/1807167.1807271. [29] Michael Stonebraker, Jacek Becla, David J. DeWitt, Kian-Tat Lim, David Maier, Oliver Ratzesberger, and Stanley B. Zdonik. Requirements for science data bases and scidb. In CIDR 2009, Fourth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2009, Online Proceedings, 2009. URL http://www-db.cs.wisc.edu/cidr/cidr2009/Paper_26.pdf. [30] Adam Seering, Philippe Cudré-Mauroux, Samuel Madden, and Michael Stone- braker. Ecient versioning for scientic array databases. In ICDE, pages 1013– 1024, 2012. [31] Shu-Yao Chien, Vassilis J Tsotras, Carlo Zaniolo, and Donghui Zhang. Ecient complex query support for multiversion xml documents. In Advances in Database Technology—EDBT 2002, pages 161–178. Springer, 2002. [32] Anders Björnerstedt and Christer Hultén. Object-oriented concepts, databases, and applications. chapter Version Control in an Object-oriented Architecture, pages 451–485. ACM, New York, NY, USA, 1989. ISBN 0-201-14410-7. doi: 10. 1145/63320.66513. URL http://doi.acm.org/10.1145/63320.66513. [33] Marios Hadjieleftheriou, George Kollios, Vassilis J Tsotras, and Dimitrios Gunop- ulos. Ecient indexing of spatiotemporal objects. In Advances in Database Tech- nology—EDBT 2002, pages 251–268. Springer, 2002. [34] U. Khurana and A. Deshpande. Ecient snapshot retrieval over historical graph data. In Data Engineering (ICDE), 2013 IEEE 29th International Conference on, pages 997–1008, April 2013. doi: 10.1109/ICDE.2013.6544892. 167 [35] Sean Quinlan and Sean Dorward. Venti: A new approach to archival storage. In FAST, volume 2, pages 89–101, 2002. [36] Benjamin Zhu, Kai Li, and R Hugo Patterson. Avoiding the disk bottleneck in the data domain deduplication le system. In Fast, volume 8, pages 1–14, 2008. [37] F. Douglis and A. Iyengar. Application-specic delta-encoding via resemblance detection. In USENIX ATC, 2003. [38] Zan Ouyang, Nasir Memon, Torsten Suel, and Dimitre Trendalov. Cluster-based delta compression of a collection of les. In WISE, 2002. [39] Mun Choon Chan and Thomas YC Woo. Cache-based compaction: A new tech- nique for optimizing web transfer. In INFOCOM, 1999. [40] Randal C Burns and Darrell DE Long. In-place reconstruction of delta compressed les. In Proceedings of the seventeenth annual ACM symposium on Principles of Distributed Computing, pages 267–275. ACM, 1998. [41] Purushottam Kulkarni, Fred Douglis, Jason D. LaVoie, and John M. Tracey. Re- dundancy elimination within large collections of les. In USENIX ATC, 2004. [42] Dan RK Ports and Kevin Grittner. Serializable snapshot isolation in postgresql. Proceedings of the VLDB Endowment, 5(12):1850–1861, 2012. [43] Scott Chacon and Ben Straub. Pro Git Book. https://git-scm.com/book/en/v2. Accessed: May 04, 2016. [44] Git-Annex. https://git-annex.branchable.com/, . Accessed: May 08, 2016. [45] Git Large File Storage. https://git-lfs.github.com/, . Accessed: May 08, 2016. [46] Talel Abdessalem and Geneviève Jomier. Vql: A query language for multiversion databases. In Database Programming Languages, pages 160–179. Springer, 1997. [47] Temporal Tables. https://msdn.microsoft.com/en-us/library/dn935015. aspx. Accessed: May 04, 2016. [48] Jennifer Widom. Trio: A system for integrated management of data, accuracy, and lineage. Technical Report, 2004. [49] Anderson Marinho, Leonardo Murta, Cláudia Werner, Vanessa Braganholo, Sérgio Manuel Serra da Cruz, Eduardo Ogasawara, and Marta Mattoso. Provmanager: a provenance management system for scientic workows. Concurrency and Com- putation: Practice and Experience, 24(13):1513–1530, 2012. [50] Leonardo Murta, Vanessa Braganholo, Fernando Chirigati, David Koop, and Ju- liana Freire. noworkow: Capturing and analyzing provenance of scripts. In Provenance and Annotation of Data and Processes, pages 71–83. Springer, 2014. 168 [51] Jihie Kim, Ewa Deelman, Yolanda Gil, Gaurang Mehta, and Varun Ratnakar. Prove- nance trails in the Wings/Pegasus system. Concurrency and Computation: Practice and Experience, 20(5):587–597, 2008. [52] Marcin Wylot, Philippe Cudre-Mauroux, and Paul Groth. Executing provenance- enabled queries over web data. In Proceedings of the 24th International Conference onWorldWideWeb, pages 1275–1285. International World Wide Web Conferences Steering Committee, 2015. [53] Manish Kumar Anand, Shawn Bowers, Timothy Mcphillips, and Bertram Ludäscher. Exploring scientic workow provenance using hybrid queries over nested data and lineage graphs. In Scientic and Statistical Database Management, pages 237–254. Springer, 2009. [54] Manish Kumar Anand, Shawn Bowers, and Bertram Ludäscher. Techniques for eciently querying scientic workow provenance graphs. In EDBT, volume 10, pages 287–298, 2010. [55] David A Holland, Uri Jacob Braun, Diana Maclean, Kiran-Kumar Muniswamy- Reddy, and Margo I Seltzer. Choosing a data model and query language for prove- nance. In The 2nd International Provenance and Annotation Workshop. Springer, 2008. [56] Grigoris Karvounarakis, Zachary G Ives, and Val Tannen. Querying data prove- nance. In Proceedings of the 2010 ACM SIGMOD International Conference on Man- agement of data, pages 951–962. ACM, 2010. [57] Shawn Bowers. Scientic workow, provenance, and data modeling challenges and approaches. Journal on Data Semantics, 1(1):19–30, 2012. [58] Susan B Davidson and Juliana Freire. Provenance and scientic workows: chal- lenges and opportunities. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1345–1350. ACM, 2008. [59] Michael Stonebraker, Gerald Held, Eugene Wong, and Peter Kreps. The design and implementation of INGRES. ACM Transactions on Database Systems (TODS), 1(3):189–222, 1976. [60] Carlo Zaniolo. The database language GEM. In ACM Sigmod Record, volume 13(4), pages 207–218. ACM, 1983. [61] Bruno Becker, Stephan Gschwind, Thomas Ohler, Bernhard Seeger, and Peter Widmayer. An asymptotically optimal multiversion b-tree. The VLDB Jour- nal—The International Journal on Very Large Data Bases, 5(4):264–275, 1996. [62] Christian Plattner, Andreas Wapf, and Gustavo Alonso. Searching in time. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 754–756. ACM, 2006. 169 [63] Betty Salzberg and Vassilis J. Tsotras. Comparison of access methods for time- evolving data. ACM Comput. Surv., 31(2):158–221, June 1999. ISSN 0360-0300. doi: 10.1145/319806.319816. URL http://doi.acm.org/10.1145/319806.319816. [64] Khaled Jouini and Geneviève Jomier. Indexing multiversion databases. In Proceed- ings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM ’07, pages 915–918, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-803-9. doi: 10.1145/1321440.1321574. URL http://doi.acm.org/10. 1145/1321440.1321574. [65] Emad Soroush and Magdalena Balazinska. Time travel in a scientic array database. In ICDE, pages 98–109, 2013. [66] Jerey C. Mogul, Fred Douglis, Anja Feldmann, and Balachander Krishnamurthy. Potential benets of delta encoding and data compression for http. In SIGCOMM, pages 181–194, 1997. [67] Josh MacDonald. File system support for delta compression. 2000. [68] Anant P. Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron Elmore, Samuel Madden, and Aditya Parameswaran. DataHub: Collaborative Data Science & Dataset Version Management at Scale. In CIDR, 2015. [69] Shahram Ghandeharizadeh, Richard Hull, and Dean Jacobs. Heraclitus: Elevating deltas to be rst-class citizens in a database programming language. ACM Trans. Database Syst., 21(3):370–426, 1996. ISSN 0362-5915. doi: 10.1145/232753.232801. URL http://doi.acm.org/10.1145/232753.232801. [70] Timothy Grin and Richard Hull. A framework for implementing hypothetical queries. In SIGMOD, pages 231–242, 1997. [71] Nicholas E. Taylor and Zachary G. Ives. Reconciling while tolerating disagreement in collaborative data sharing. In SIGMOD, pages 13–24, 2006. ISBN 1-59593-434- 0. doi: 10.1145/1142473.1142476. URL http://doi.acm.org/10.1145/1142473. 1142476. [72] Todd J. Green, Grigoris Karvounarakis, Zachary G. Ives, and Val Tannen. Up- date exchange with mappings and provenance. In PVLDB, pages 675–686, 2007. ISBN 978-1-59593-649-3. URL http://dl.acm.org/citation.cfm?id=1325851. 1325929. [73] Sanjay Agrawal, Surajit Chaudhuri, and Vivek R. Narasayya. Automated Selection of Materialized Views and Indexes in SQL Databases. In PVLDB, pages 496–505, 2000. URL http://dl.acm.org/citation.cfm?id=645926.671701. [74] Zohreh Asgharzadeh Talebi, Rada Chirkova, Yahya Fathi, and Matthias Stallmann. Exact and inexact methods for selecting views and indexes for OLAP performance improvement. In EDBT, pages 311–322, 2008. doi: 10.1145/1353343.1353383. URL http://doi.acm.org/10.1145/1353343.1353383. 170 [75] Alon Y. Halevy. Answering queries using views: A survey. The VLDB Journal, 10(4):270–294, 2001. ISSN 0949-877X. doi: 10.1007/s007780100054. URL http: //dx.doi.org/10.1007/s007780100054. [76] Jonathan Goldstein and Per-Åke Larson. Optimizing queries using materialized views: A practical, scalable solution. In SIGMOD, pages 331–342, 2001. ISBN 1- 58113-332-4. doi: 10.1145/375663.375706. URL http://doi.acm.org/10.1145/ 375663.375706. [77] Zvi Galil and Giuseppe F. Italiano. Data structures and algorithms for disjoint set union problems. ACM Comput. Surv., pages 319–344, 1991. ISSN 0360-0300. doi: 10.1145/116873.116878. URL http://doi.acm.org/10.1145/116873.116878. [78] Frank K. Hwang and Shen Lin. A simple algorithm for merging two disjoint lin- early ordered sets. SIAM Journal on Computing, 1(1):31–39, 1972. [79] Erik Demaine, Alejandro López-Ortiz, and J. Munro. Adaptive Set Intersections, Unions, and Dierences. In SODA, pages 743–752, 2000. URL http://dl.acm. org/citation.cfm?id=338219.338634. [80] Ricardo Baeza-Yates. A fast set intersection algorithm for sorted sequences. In Annual Symposium on Combinatorial Pattern Matching, pages 400–408. Springer, 2004. [81] Peter Sanders and Frederik Transier. Intersection in integer inverted indices. In Proceedings of theMeeting on Algorithm Engineering & Expermiments, pages 71–83, 2007. [82] Bolin Ding and Arnd Christian König. Fast set intersection in memory. PVLDB, pages 255–266, 2011. ISSN 2150-8097. doi: 10.14778/1938545.1938550. URL http: //dx.doi.org/10.14778/1938545.1938550. [83] Erik D. Demaine, Alejandro López-Ortiz, and J. Ian Munro. Experiments on adap- tive set intersections for text retrieval systems. In ALENEX, 2001. [84] Jérémy Barbay, Alejandro López-Ortiz, Tyler Lu, and Alejandro Salinger. An ex- perimental investigation of set intersection algorithms for text searching. ACM Journal of Experimental Algorithmics, 2010. [85] Philip Bille, Anna Pagh, and Rasmus Pagh. Fast evaluation of union - intersection expressions. In ISAAC, pages 739–750, 2007. [86] Taesung Lee, Jin-Woo Park, Sanghoon Lee, Seung-won Hwang, Sameh Elnikety, and Yuxiong He. Processing and optimizing main memory spatial-keyword queries. PVLDB, 9(3):132–143, 2015. [87] Robert Endre Tarjan. Finding optimum branchings. Networks, 7(1):25–35, 1977. [88] Guy Kortsarz and David Peleg. Approximating shallow-light trees. In SODA, 1997. 171 [89] Judit Bar-Ilan, Guy Kortsarz, and David Peleg. Generalized submodular cover problems and applications. Theoretical Computer Science, 250(1):179–200, 2001. [90] Moses Charikar, Chandra Chekuri, To-yat Cheung, Zuo Dai, Ashish Goel, Sudipto Guha, and Ming Li. Approximation algorithms for directed steiner problems. Jour- nal of Algorithms, 33(1):73–91, 1999. [91] Samir Khuller, Balaji Raghavachari, and Neal Young. Balancing minimum span- ning trees and shortest-path trees. Algorithmica, 14(4):305–321, 1995. [92] https://www.kernel.org/pub/software/scm/git/docs/technical/ pack-heuristics.txt. [93] http://edmonds-alg.sourceforge.net/. [94] http://svn.apache.org/repos/asf/subversion/trunk/notes/fsfs, . [95] http://svnbook.red-bean.com/en/1.8/svn.reposadmin.maint.html#svn. reposadmin.maint.diskspace.fsfspacking, . [96] http://svn.apache.org/repos/asf/subversion/trunk/notes/ fsfs-improvements.txt, . [97] http://www.xmailserver.org/xdiff-lib.html. [98] Peter T. Wood. Query languages for graph databases. SIGMOD Rec., 41(1):50–60, April 2012. ISSN 0163-5808. [99] Eugene W Myers. An O(ND) dierence algorithm and its variations. Algorithmica, pages 251–266, 1986. [100] Erik D. Demaine, Shay Mozes, Benjamin Rossman, and Oren Weimann. An op- timal decomposition algorithm for tree edit distance. ACM Trans. Algorithms, 6 (1):2:1–2:19, December 2009. ISSN 1549-6325. doi: 10.1145/1644015.1644017. URL http://doi.acm.org/10.1145/1644015.1644017. [101] Fred Douglis and Arun Iyengar. Application-specic delta-encoding via resem- blance detection. In USENIX Annual Technical Conference, General Track, pages 113–126, 2003. [102] Git Packles. https://git-scm.com/book/en/v2/Git-Internals-Packfiles, . Accessed: February 15, 2017. [103] Richard M Karp. Reducibility among combinatorial problems. In Complexity of computer computations. Springer, 1972. [104] Jérémy Barbay and Claire Kenyon. Deterministic algorithm for the t-threshold set problem. In Algorithms and Computation, pages 575–584. Springer, 2003. 172 [105] Samy Chambi, Daniel Lemire, Owen Kaser, and Robert Godin. Better bitmap per- formance with roaring bitmaps. Software: Practice and Experience, 2015. [106] Samir Khuller, Balaji Raghavachari, and Neal Young. Balancing minimum span- ning trees and shortest-path trees. Algorithmica, 14(4):305–321, 1995. [107] Miller. http://johnkerl.org/miller/doc/. Accessed: April 15, 2018. [108] csvkit. https://csvkit.readthedocs.io/en/1.0.3/. Accessed: April 15, 2018. [109] textql. https://github.com/dinedal/textql. Accessed: April 15, 2018. [110] http://parquet.apache.org/documentation/latest/. [111] Anastassia Ailamaki, David J. DeWitt, Mark D. Hill, and Marios Skounakis. Weav- ing relations for cache performance. In VLDB 2001, Proceedings of 27th Interna- tional Conference on Very Large Data Bases, September 11-14, 2001, Roma, Italy, pages 169–180, 2001. URL http://www.vldb.org/conf/2001/P169.pdf. [112] Marcel Kornacker, Alexander Behm, Victor Bittorf, Taras Bobrovytsky, Casey Ching, Alan Choi, Justin Erickson, Martin Grund, Daniel Hecht, Matthew Jacobs, Ishaan Joshi, Lenni Ku, Dileep Kumar, Alex Leblang, Nong Li, Ippokratis Pandis, Henry Robinson, David Rorke, Silvius Rus, John Russell, Dimitris Tsirogiannis, Skye Wanderman-Milne, and Michael Yoder. Impala: A modern, open-source SQL engine for hadoop. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings, 2015. URL http://cidrdb.org/cidr2015/Papers/CIDR15_Paper28.pdf. [113] Ravi Jampani, Fei Xu, Mingxi Wu, Luis Leopoldo Perez, Chris Jermaine, and Pe- ter J. Haas. The monte carlo database system: Stochastic analysis close to the data. ACM Trans. Database Syst., 36(3):18:1–18:41, 2011. doi: 10.1145/2000824.2000828. URL http://doi.acm.org/10.1145/2000824.2000828. [114] Spyros Blanas and Jignesh M. Patel. Memory footprint matters: ecient equi- join algorithms for main memory data processing. In ACM Symposium on Cloud Computing, SOCC ’13, Santa Clara, CA, USA, October 1-3, 2013, pages 19:1– 19:16, 2013. doi: 10.1145/2523616.2523626. URL http://doi.acm.org/10.1145/ 2523616.2523626. 173