ABSTRACT Title of dissertation: PROVENANCE MANAGEMENT FOR COLLABORATIVE DATA SCIENCE WORKFLOWS Hui Miao Doctor of Philosophy, 2018 Dissertation directed by: Professor Amol Deshpande Department of Computer Science Collaborative data science activities are becoming pervasive in a variety of communities, and are often conducted in teams, with people of different expertise performing back-and-forth modeling and analysis on time-evolving datasets. Cur- rent data science systems mainly focus on specific steps in the process such as training machine learning models, scaling to large data volumes, or serving the data or the models, while the issues of end-to-end data science lifecycle management are largely ignored. Such issues include, for example, tracking provenance and deriva- tion history of models, identifying data processing pipelines and keeping track of their evolution, analyzing unexpected behaviors and monitoring the project health, and providing the ability to reason about specific analysis results. We address these challenges by ingesting, managing, and analyzing rich provenance information gen- erated during data science projects, and using it to enable users to easily publish, share, and discover data analytics projects. We first describe the design of our unified provenance and metadata manage- ment system, called ProvDB. We adopt a schema-later approach and use a flexible graph-based provenance representation model that combines the core concepts in version control and provenance management. We describe several ingestion mech- anisms for this provenance model and show how heterogeneous data analysis envi- ronments can be served with natural extensions to this framework. We also describe a set of novel features of the system including graph queries for retrospective prove- nance, fileviews for data transformations, introspective queries for debugging, and continuous monitoring queries for anomaly detection. We then illustrate how to support deep learning modeling lifecycle via the extensibility mechanism in ProvDB. We describe techniques to compactly store and efficiently query the rich set of data artifacts generated during deep learning model- ing lifecycle. We also describe a high-level domain specific language that helps raise the abstraction level during model exploration and enumeration and accelerate the modeling process. Lastly, we propose graph query operators and develop efficient evaluation tech- niques to address the verbose and evolving nature of such provenance graphs. First, we introduce a graph segmentation operator, which queries the provenance of a col- lection of user-given vertices (e.g., versioned files, author names) via flexible bound- ary criteria. Second, we propose a graph summarization operator to aggregate the results of multiple segmentation operations, and allow multi-resolution interaction with the aggregation result to understand similar and abnormal behaviors in those segments. PROVENANCE MANAGEMENT FOR COLLABORATIVE DATA SCIENCE WORKFLOWS by Hui Miao Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2018 Advisory Committee: Professor Amol Deshpande, Chair/Advisor Professor Larry S. Davis Professor Alan Sussman Professor Richard Marciano Professor Héctor Corrada Bravo ©c Copyright by Hui Miao 2018 Acknowledgments It has been a long journey to complete and defense this dissertation. I owe my gratitude to all the people who help me and walk beside me in the fulfilling journey. First and foremost, I wish like to express my greatest appreciation to my advisor, Professor Amol Deshpande, a great advisor and mentor, who trains me as a researcher with the ability to identify important systems issues and work on challenging problems. Amol gave me freedom to explore new topics, propose ideas and pick the ones that I was interested in but may have high risks, while trusted in me when entering into new fields and helped me stay on a right track that lead to concrete results. He taught me how to elaborate problem formulations and showed me how to present ideas to the research community. I am very grateful for his invaluable time spent on numerous discussions and manuscripts editing even during late nights and weekends. I feel very fortunate to work with him as my advisor. I would also like to thank other committee members Professor Larry S. Davis, Professor Alan Sussman, Professor Richard Marciano and Professor Héctor Corrada Bravo for their insightful comments and critiques on this dissertation. My special thanks are given to Larry for supporting the collaboration with his computer vision lab on modeling lifecycle and repository discoveries. The discussion with computer vision practitioners was very fruitful and lead to parts of this dissertation. I would like to thank Professor Lise Getoor, who was my co-advisor during her time at the University of Maryland. I was fortunate to be mentored by her and part of her LINQS group. I am very grateful to her for her invaluable time and ii constant support for my graduate study in the first three years. The projects and collaboration in LINQS gave me in-depth understanding of large-scale inference in relational learning and knowledge graph construction. I had wonderful summer internship experience in Google Research during my graduate study. I would like to thank my mentor Christopher Olston, and other team members, Alon Halevy, Steven Whang, Neoklis Polyzotis, Sunita Sarawagi, Natalya Noy, Sudip Roy and Xiao Yu. Christopher taught me to select impactful problems, avoid premature optimizations and deliver high quality projects. The GOODS experience on searching enterprise datasets broadened my vision and influ- enced the area that I worked on later in the dissertation. Looking backwards, the time spent in Maryland was enriching and enjoyable due to many friends and collaborators. I wish to thank many members in the database group, Souvik Bhattacherjee, Amit Chavan, Udayan Khurana, K. Ash- win Kumar, Jayanta Mondal, Walaa Eldin Moustafa, Abdul Quamar, Theodoros Rekatsinas, Virinchi Srinivas, and Konstantinos Xirogiannopoulos for all the joyful time spent and interesting discussions had together. I would also like to thank many friends and collaborators worked together in A.V. Williams, Stephen Bach, Ruofei Du, Shobeir Fakhraei, Shi Feng, Peixin Gao, Hua He, Bert Huang, Ang Li, Hao Li, Xiangyang Liu, Ben London, Alex Memory, Shangfu Peng, Arti Ramesh, Jay Pujara, Ramakrishna Padmanabhan, Sheng Yang, Chen Zhao. I have immensely enjoyed interacting and collaborating with them throughout my graduate study. I owe a great debt of gratitude to my family, Xiaolin Xi, Xiaoqin Zhan and Xiran Miao. This dissertation would not have been finished without their constant iii love, trust and support during the journey. Words cannot express the gratitude. All of them have sacrificed a lot in order to let me pursue my interest. Portions of this work were supported by NSF grants 1513972 and 1513443. iv Table of Contents Acknowledgements ii List of Tables ix List of Figures x 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Challenges & Opportunities . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Provenance Representation . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Ingestion Mechanism . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.3 Designing Query Facilities . . . . . . . . . . . . . . . . . . . . 7 1.2.4 System Efficiency Issues . . . . . . . . . . . . . . . . . . . . . 9 1.2.5 Discovering Repositories & Learning From Others . . . . . . . 11 1.3 Approach & Organization . . . . . . . . . . . . . . . . . . . . . . . . 12 2 Related Work 16 2.1 Provenance Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.1 Workflow Provenance . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.2 Data Provenance . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1.3 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Collaborative Data Science Systems . . . . . . . . . . . . . . . . . . . 20 2.2.1 Data Management for Collaborative Analytics . . . . . . . . . 20 2.2.2 Version Control Systems for Data Science . . . . . . . . . . . 22 2.2.3 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Machine Learning Systems . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.1 Modern Machine Learning Software Systems . . . . . . . . . . 24 2.3.2 Modeling Pipeline & Lifecycle Management Systems . . . . . 25 2.3.3 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 26 v 3 ProvDB System Design for Collaborative Analytics 28 3.1 Unified Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.1 Conceptual Data Model . . . . . . . . . . . . . . . . . . . . . 29 3.1.2 Physical Property Graph Data Model . . . . . . . . . . . . . . 32 3.2 Provenance Ingestion . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.1 Shell command-based Ingestion Framework . . . . . . . . . . . 34 3.2.2 User Annotations . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.3 File Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.4 Extension Modules . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Query Facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.1 Queries over Version/Workflow Graph and Properties . . . . . 39 3.3.2 Reasoning about Pipelines . . . . . . . . . . . . . . . . . . . . 40 3.3.3 Introspective Diffs: Shallow vs Deep “Diff” Queries . . . . . . 40 3.3.4 Continuous Monitoring or Anomaly Detection . . . . . . . . . 41 3.4 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4 ModelHub: Managing Deep Learning Projects on ProvDB 46 4.1 Motivation & Approach . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.1.1 DNN Modeling Lifecycle and Challenges . . . . . . . . . . . . 47 4.1.2 ModelHub Approach . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2.1 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . 52 4.2.2 Modeling Data Artifacts . . . . . . . . . . . . . . . . . . . . . 55 4.2.3 Model Adjustment . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2.4 Model Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3 ModelHub System Overview . . . . . . . . . . . . . . . . . . . . . . . 57 4.3.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3.1.1 DNN Model . . . . . . . . . . . . . . . . . . . . . . . 59 4.3.1.2 VCS Data Model . . . . . . . . . . . . . . . . . . . . 60 4.3.2 Query Facilities . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3.2.1 Model Exploration Queries . . . . . . . . . . . . . . 61 4.3.2.2 Model Enumeration Queries . . . . . . . . . . . . . . 63 4.3.3 ModelHub Implementation . . . . . . . . . . . . . . . . . . . . 66 4.4 Parameter archival storage (PAS) . . . . . . . . . . . . . . . . . . . . 67 4.4.1 Weight Parameters & Query Types of Interest . . . . . . . . . 67 4.4.2 Parameters As Segmented Float Matrices . . . . . . . . . . . . 70 4.4.2.1 Float Data Type Schemes . . . . . . . . . . . . . . . 70 4.4.2.2 Bytewise Segmentation for Float Matrices . . . . . . 72 4.4.2.3 Delta Encoding Across Snapshots . . . . . . . . . . . 73 4.4.3 Optimal Parameter Archival Storage . . . . . . . . . . . . . . 73 4.4.3.1 Constrained Spanning Tree Problem . . . . . . . . . 80 4.4.3.2 PAS-MT . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4.3.3 PAS-PT . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.4.4 Model Evaluation Scheme in PAS . . . . . . . . . . . . . . . . 86 vi 4.5 Evaluation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.5.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . 89 4.5.1.1 Real World Dataset . . . . . . . . . . . . . . . . . . 89 4.5.1.2 Synthetic Datasets . . . . . . . . . . . . . . . . . . . 90 4.5.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . 91 4.5.2.1 Float Representation & Accuracy . . . . . . . . . . . 91 4.5.2.2 Delta Encoding & Compression Ratio Gain . . . . . 92 4.5.2.3 Optimal Parameter Archival Storage . . . . . . . . . 94 4.5.2.4 Retrieval Performance . . . . . . . . . . . . . . . . . 96 4.5.2.5 Progressive Query Evaluation . . . . . . . . . . . . . 97 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5 Querying Collaborative Analytics Lifecycle Provenance 99 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.2 Challenges & Desiderata . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . 103 5.2.2 Provenance Model . . . . . . . . . . . . . . . . . . . . . . . . 105 5.2.3 Provenance Queries & Challenges . . . . . . . . . . . . . . . . 111 5.2.3.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . 112 5.2.3.2 Summarization . . . . . . . . . . . . . . . . . . . . . 114 5.3 Segmentation Operation . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.3.1 Semantics of Segmentation (PgSeg) . . . . . . . . . . . . . . 117 5.3.1.1 Source (Vsrc) and Destination Entities (Vdst) . . . . . 118 5.3.1.2 Induced Vertices Vind . . . . . . . . . . . . . . . . . . 119 5.3.1.3 Boundary Criteria B . . . . . . . . . . . . . . . . . . 124 5.3.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 125 5.3.2 Query Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.3.2.1 Overview: Two-Step Approach . . . . . . . . . . . . 126 5.3.2.2 Induce Step . . . . . . . . . . . . . . . . . . . . . . . 126 5.3.2.3 Adjust Step . . . . . . . . . . . . . . . . . . . . . . . 135 5.3.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 135 5.4 Summarization Operation . . . . . . . . . . . . . . . . . . . . . . . . 136 5.4.1 Semantics of Summarization (PgSum) . . . . . . . . . . . . . 137 5.4.1.1 Property Aggregations & Provenance Types . . . . . 137 5.4.1.2 Provenance Summary Graph (Psg) . . . . . . . . . . 140 5.4.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 142 5.4.2 Query Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.5 System Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.6 Evaluation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.6.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . 148 5.6.1.1 Provenance Graphs & PgSeg Queries . . . . . . . . 148 5.6.1.2 Similar Segments & PgSum Queries . . . . . . . . . 150 5.6.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . 151 5.6.2.1 Segmentation Operator . . . . . . . . . . . . . . . . 151 5.6.2.2 Summarization Operator . . . . . . . . . . . . . . . . 156 vii 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6 Discovering Hosted Analytics Projects 162 6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.1.1 Model Discovery . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.1.2 Enriched Project Repository . . . . . . . . . . . . . . . . . . . 166 6.2 Model Discovery System Vision . . . . . . . . . . . . . . . . . . . . . 167 6.2.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . 168 6.2.2 Compare & Rank Models . . . . . . . . . . . . . . . . . . . . 171 6.2.3 Process & Ensemble Returned Models . . . . . . . . . . . . . 175 6.3 Evaluation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 6.3.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . 178 6.3.2 Evaluation Result . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 7 Conclusions 183 Bibliography 186 viii List of Tables 3.1 ProvDB Ingestion Methods . . . . . . . . . . . . . . . . . . . . . . . 34 4.1 Popular CNN Models for Object Recognition . . . . . . . . . . . . . . 54 4.2 A list of key dlv utilities. . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3 Float Representation Scheme Trade-offs . . . . . . . . . . . . . . . . . 71 4.4 Recreation Cost of a Snapshot si Cr(PV , si) in a plan PV . . . . . . . 77 4.5 Real World DNN Models used in the Experiment Study . . . . . . . . 90 4.6 Delta Performance For Lossless & Lossy Schemes, 32-bits . . . . . . . 94 4.7 Recreation Performance Comparison of Storage Plans . . . . . . . . . 96 5.1 Summary of Provenance Graph Datasets (PG) . . . . . . . . . . . . . 149 ix List of Figures 1.1 ProvDB High-level System Architecture . . . . . . . . . . . . . . . . 13 3.1 Conceptual Data Model . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Example Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Provenance Property Graph . . . . . . . . . . . . . . . . . . . . . . . 33 3.4 Diff Artifacts (Result logging files for two deep neural networks) . . . 43 3.5 Cypher Query to Find Related Changes via Derivations . . . . . . . . 44 4.1 Deep Learning Modeling Lifecycle . . . . . . . . . . . . . . . . . . . . 47 4.2 Anatomy of A DNN Model (LeNet) . . . . . . . . . . . . . . . . . . . 53 4.3 ModelHub System Architecture . . . . . . . . . . . . . . . . . . . . 57 4.4 Relationships of Model Versions and Snapshots . . . . . . . . . . . . 68 4.5 A Matrix Storage Graph Example . . . . . . . . . . . . . . . . . . . . 75 4.6 Optimal Matrix Storage Plan without Constraints . . . . . . . . . . . 75 4.7 Optimal Matrix Storage Plan Constrained by C ψir (s1) ≤ 3∧C ψir (s2) ≤ 6 76 4.8 Compression-Accuracy Tradeoff for Float Representation Schemes . . 92 4.9 Compression Performance for Different Delta Schemes & Models . . . 93 4.10 Comparing PAS Archival Storage Algorithms for SD . . . . . . . . . 95 4.11 Progressive Evaluation Query Processing Using High-Order Bytes . . 97 5.1 A Data Analytics Project Lifecycle Example & Associated Provenance104 5.2 Illustration of the W3C PROV Data Model . . . . . . . . . . . . . . 106 5.3 Provenance Graph for the Lifecycle Example (Some edges and prop- erties are not shown) . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.4 Segmentation Query Examples . . . . . . . . . . . . . . . . . . . . . . 113 5.5 Summarization Query Examples . . . . . . . . . . . . . . . . . . . . . 115 5.6 SimProv Normal Form, SimProv→ Re. Lg ⊆ A×E ; Rg, La, Ra ⊆ A×A; Lu ⊆ E ×A; Ru, Le, Re, Qd ⊆ E × E . . . . . . . . . . . . 129 5.7 Proposed SimProv Rewriting, SimProv → Ee. Aa ⊆ A × A; Ee ⊆ E × E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.8 Position of Proposed Graph Query Operators in ProvDB . . . . . . 145 5.9 Comparing Cyper Query 5.1, CflrB, SimProvAlg and SimProvTst Efficiency by Varying Graph Size N . . . . . . . . . . . . . . . . . . . 152 5.10 Evaluate Performance for Different Project Stereotypes by Varying λi 154 x 5.11 Impact of Input Size λi on CflrB, SimProvAlg and SimProvTst 154 5.12 Effectiveness of SimProvAlg and SimProvTst with Early Stopping155 5.13 Studying PgSum on Segments at Different Stages of a Project by Varying Concentration α . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.14 Studying PgSum on Segments with Different Complexity by Varying Activity Types k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.15 Studying PgSum on Segments with Different Size by Varying Num- ber of Activities n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.16 Studying PgSum Effectiveness by Varying |S| . . . . . . . . . . . . . 159 6.1 Growth of Github Repositories using iPyhont Notebooks (based on Github Weekly Dump Dataset on Google Big Query) . . . . . . . . . 164 6.2 Enriched Information for Data Science Projects . . . . . . . . . . . . 166 6.3 Overview of ModelHub Discovery Pipeline . . . . . . . . . . . . . . . 170 6.4 Finetuned VGG Models . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.5 Correlation Between Aligned Distance and Prediction Results . . . . 181 xi Chapter 1: Introduction 1.1 Motivation Collaborative data science activities are becoming pervasive in a variety of communities, and are often conducted in teams, with people of different exper- tise performing back-and-forth modeling and analysis on time-evolving datasets. Current data science systems mainly focus on specific steps in the process such as supporting and accelerating training machine learning models in different data processing frameworks and management systems, scaling to large data volumes by exploiting distributed optimization schemes and system architectures, or serving the data or the models to satisfy demanding service level agreements, while the issues of end-to-end data science lifecycle management are largely ignored. Such issues include, for example, tracking provenance and derivation history of models, identifying data processing pipelines and keeping track of their evolution, analyzing unexpected behaviors and monitoring the project health, and providing the ability to reason about specific analysis results. Addressing these issues in a collaborative data science environment is especially challenging because the process of collabora- tive data science is often ad hoc, typically featuring highly unstructured datasets, an amalgamation of different tools and techniques, significant back-and-forth among 1 the members of a team, and trial-and-error to identify the right analysis tools, al- gorithms, models, and parameters. More specifically, many modern data analytics activities are ad hoc and reac- tive to emerging day-to-day problems rather than locking in to a stable platform on well-established business models or scientific workflows. There is no easy way to capture and reason about ad hoc data science pipelines, many of which are often spread across a collection of cleaning and modeling efforts in analysis scripts and the back and forth steps in transient command line histories. Metadata or prove- nance information about how datasets were generated, including the programs or scripts used for generating them and/or values of any crucial parameters, is often lost. Similarly, it is hard to keep track of any dependencies between the datasets. As most datasets and analysis scripts evolve over time, there is also a need to keep track of their versions over time; using version control systems like git can help to some extent, but those do not provide sufficiently rich introspection capabilities to deal with unexpected situations. Moreover, even in a managed version control repository, learning curve for new team members is high, and understanding and reproducing others’ work is challenging, e.g., many issues may occur and often one has to contact the author when reproducing results using an open source repository, as the derivation history is not present. We argue that provenance and metadata about the versioned artifacts and derivations during the analysis process are the key to solve the challenges that are pervasive in the ad hoc data analytics activities. Due to the lack of platform sup- port for capturing and analyzing such provenance and metadata information, data 2 scientists are required to manually keep track of, and act upon, such information, which is not only tedious, but error-prone. For example, data scientists must manu- ally keep track of which derived datasets need to be updated when a source dataset changes. They often use spreadsheets to list which parameter combinations have been tried out during the development of a machine learning model. Debugging becomes significantly harder; e.g., a small change in an analysis script may have a significant impact on the final result, but identifying that change may be non- trivial, especially in a collaborative setting. It is similarly challenging to identify which input records are most relevant to a particular output record. Repeatability can often be very difficult, even for the same researcher, because of an amalgama- tion of constantly evolving tools and datasets being used, and because of a lack of easy-to-use mechanism to keep track of the parameter values used during analysis or modeling. Critical errors may be hidden in the mess of datasets and analysis scripts that cannot be easily identified; e.g., a data scientist may erroneously be training on the test dataset due to an inadvertent mistake while creating the testing and training datasets. 1.2 Challenges & Opportunities In this dissertation, we explore the challenges and the opportunities in uni- fied management of versioned artifacts and all kinds of provenance and metadata about collaborative data analytics activities that gets generated during the analysis process; this includes the information about the different versions of the datasets 3 that are created, derivation metadata about how derived artifacts were generated, fine-granularity provenance information, dependencies across datasets, and so on. Our hypothesis is that by combining all this information into one place, and making it easy to analyze or query this information, we can enable a rich set of functionality that can simplify the lives of data scientists, make it easier to identify and eliminate errors, decrease the time to obtain actionable insights, and accelerate the process to get into other data analytics activities. This is hardly a new observation, and there has been much prior work on capturing and analyzing provenance in a variety of communities. However, there is still a lack of practical systems that treat different kinds of provenance and metadata information in a unified manner, and that can be easily integrated in the workflow of a data science project. At the same time, the widespread use of data science has brought to the forefront several important and crucial challenges, such as ethics, transparency, reproducibility, etc., and we posit that fine-granularity provenance is key to addressing those challenges. There are however several crucial systems requirements and conceptual chal- lenges that must be addressed to fully exploit those opportunities. 1.2.1 Provenance Representation It is hard to define a unified schema for multiple types of information at various granularity a priori to support very diverse analytics processes, as different users of different analytics workflows may wish to capture and analyze different types of such 4 data. In a collaborative analytics project, such information includes: • The raw datasets stored in files, which range from structured datasets (e.g., database files, CSV) to semi-structured (e.g., graphs, JSON, Jupyter note- books) to highly unstructured (e.g., text datasets, audio, images); • The derivation metadata that captures versioning information about how files within or across versions were created and dependencies among them – a program or a script normally changes files and generates versions, however, such programs and scripts themselves may have different versions; • Record-level provenance information that is used to connect records across different versions, and structured results of running a data analysis pipeline or an experiment; • The details about work pipelines of a team, such as task assignments among the team members, project conventions about certain tasks, which often evolve during the process and exist implicitly in the team activities. Existing solutions can not represent all the information. On one hand, there are data model standardization efforts for general purpose provenance capturing, but they do not treat project artifacts and their versions as first class citizens. On the other hand, there are popular version control systems, such as git. Although they can manage the artifact versions and derivations, metadata and provenance about the files are out of their scope; even within the versions, what activities occurred and the derivation dependencies among files are not captured. 5 1.2.2 Ingestion Mechanism Even if a data model can be designed, how to ingest the rich information with as little effort as possible from the team is particularly challenging. In practice, collaborative analytics teams use an amalgamation of software libraries, tools and distributed systems. There are many important aspects of in- formation about how a piece of data was derived that is typically lost by the tools. An ingestion mechanism should be able to capture the information in different cir- cumstances, including: • The practitioners often tune model with different hyperparameter settings to improve the results, and try different dataset transformations for the model input. The dataset transformations and the model configuration scripts are two pieces of continuously evolving artifacts, and it can be hard to keep track of the relationships between them; • For the cases when analysis scripts take user-specified parameters as command line options, which are rarely carefully recorded, it is very difficult to repeat an analysis or to understand the origins of a specific dataset, even by the task owner herself; • Fine-grained provenance information, i.e., keeping track of which input records generated which output records, is also usually difficult to capture without significant effort on the users’ part. Moreover, unless specifically required (e.g., auditing), capturing metadata and 6 provenance is far from the main project interests of a typical data science team. The captured information has very high long-term value, but less short-term value compared with working on tasks related to the main project objective. Both of the observations lead to the following requirements for the ingestion mechanism: • We must be able to capture the information with minimal involvement from the users, otherwise the system is unlikely to be used in practice; • The types of tools that the user can use in the pipeline should not be con- strained, due to the rapid evolving nature of modeling paradigms; • We should have extensible provenance ingestion mechanisms, in order to adapt to different analytics processes and tool environments. For instance, a feature engineering modeling practice is very different from an end-to-end learning practice using deep neural networks, as they have different modeling artifacts, tuning best practices, and corresponding practitioners’ work style and conven- tions. Different extensions would be used for different workloads. 1.2.3 Designing Query Facilities Perhaps conceptually the most difficult challenge is to develop query facilities and/or declarative abstractions to make it easy and powerful to exploit this data. First, there is a wide range of general queries and provenance analysis tasks that are of interest. For example, such queries include: • Identifying or retrieving versions of project artifacts; 7 • Asking lineage of a particular artifact version or a record in that version; • Comparing multiple lineages to show their differences; • Finding all project artifact versions associated with a specific metadata. These queries could be written in a general query language (e.g., SQL, SPARQL, Cypher), once the information is captured in a corresponding structured data model that is well understood by the query issuers. However, such queries often involve versions and path expressions which, written in general query language, tend to be verbose and difficult to compose. More importantly, the assumption that the user is an expert with full understanding of the underlying workflow is hardly true in practice, due to the complex derivations, and the ad-hoc and collaborative nature of such projects. Second, different development environments have their own modeling processes and the users may ask questions at different levels; for example, a modeler using deep learning may ask about network architecture and hyperparmeters, while a feature engineering modeler may ask about selected features and regularization method. At the same time, there are many potential advanced tasks that would be modeled over such data including: • Explanation queries where we are looking for origins of a piece of data or a holistic view of a modeling artifact; • Introspection queries that attempt to identify flaws with the data science pro- cess (e.g., p-value hacking); 8 • Continuous monitoring to identify issues during deployment of a data sci- ence pipeline (e.g., concept drifts where a learned model doesn’t fit new data; changes to input data formats). Last but not least, analytics process is evolving, and there is often no work- flow skeleton as seen in other provenance systems, such as scientific workflows and business processes. The data science workflow consists of repetitive trails which may have subtle differences, some of which include potential errors. The captured provenance and metadata tend to be verbose, and the users who are the team mem- bers or outsiders often only have partial knowledge of the project artifacts and work steps. In that situation, designing an easy-to-use query facility to identify the precise and concise information needed by the users is very challenging; some exam- ple include: a) identifying the most relevant part in the lineage by just specifying the artifacts that the user knows about; b) identifying the common and abnormal pipelines among a set of artifact derivations. We should be able to induce contribut- ing artifacts that otherwise may not be specified by the users having only partial understanding of a project. 1.2.4 System Efficiency Issues We expect many efficiency and optimization issues will arise as the variety of the captured metadata and the volume of the captured data increase. This is especially expected to be an issue in the following aspects: a) storing the versions of different types of modeling artifacts, e.g., float numbers and matrices; b) answering 9 graph queries on a progressively evolving workflows in the collaborative team. In the analytics workflow, due to many repetitive steps, there are often very similar artifacts that have a large storage footprint. One example is the dataset, which is often split into training and testing splits and created as copies; this not only wastes storage space, but also requires careful attention to make sure that the two resultant datasets are disjoint. This can be even trickier if k-fold cross-validation is being used. Another example is the artifact of trained model parameters. Popular modeling methods such as deep learning use billions of float parameters, resulting in model artifacts taking hundreds of GB to store. As the model is trained many times with different network architectures and hyperparameters, many model artifact ver- sions are generated. How to deal with the storage footprint of repetitive trails is a challenge when designing new archiving algorithms. Existing systems (e.g., git and GitHub) use a single versioning algorithm for all types of files and discourage versioning important analytics artifacts, such as datasets and trained weights. Moreover, the derivations among artifacts form graphs, and the evolving na- ture of analytics workflow lead to verbose and large graphs. Query types of interest in workflows typically involve expensive path queries on the collected graphs, which still has limited support in modern graph databases. Designing efficient query eval- uation algorithms for important query types of interest is a challenge and a very important issue. 10 1.2.5 Discovering Repositories & Learning From Others In the “Big Data” era, datasets are collected blindly in different domains and industries by logging user and system behavior or labeling data via crowd-sourcing, and there is a large demand to conduct data science projects to find value from it. With more experienced practitioners and better system support, more and more data science projects are built and shared online. For example, there are tens of thousands of hosted Github projects using Jupyter notebooks; deep learning models have been shared by the community starting with publishing models on authors’ websites. Now training systems tend to have models hosted at a central portal for practitioners, e.g., Caffe Model Zoo. However, in current hosting services, identifying a repository which is related to an analytics task is very difficult. For example, for a data science practitioner who is working on her own project and willing to find references, it is not feasible to use available systems to answer the queries such as: • ‘What projects used a similar dataset like mine on a classification problem? (e.g., US census data, 256x256 images)’; • ‘Show me a set of diverse projects which explore this specific dataset or use this particular modeling method (e.g., random forest)’; • ‘Find or ensemble a model from projects having high recall but reasonable accuracy on a given validation dataset’. The main reason that the current systems can not answer such queries is 11 because they view an analytics repository as a collection of files, but not as the unified data and provenance model we discussed in this dissertation. By hosting the enriched repository with the provenance and derivations among project artifacts, we can open up the exciting opportunity to answer those discovery queries, facilitate learning from others, and accelerate the data science project lifecycle. 1.3 Approach & Organization In this dissertation, we address the aforementioned challenges in a systematic manner. We take a system building approach and design a tool-agnostic system (ProvDB) to ingest, manage and query provenance information, and to allow the users to publish, share, and discover data analytics projects. Collaborative systems are typically centered around the concept of versions, and supported by distributed version control system (DVCS). Recent research ex- tends version control systems (e.g., git) originally proposed for collaborative soft- ware development to support large datasets [1]. In these types of systems, a version is often immutable and any update to a version conceptually results in a new version with a different version, ID. ProvDB is a stand-alone system (Figure 1.1), designed to be used in con- junction with a version control system (DVCS) like git or DataHub [2]. ProvDB provides a workflow-aware version control commandline toolkit that integrates with DVCS, which handles the actual version management tasks, including supporting the standard checkout, commit, merge, etc., functionality. Using ProvDB, we en- 12 ProvDB Local Repository Client Global ProvDB Server Provenance Management Layer Services Provenance Workflow- GUI Templated Domain Analytical Ingestion aware Version Provenance Specific Query Model Framework Control Toolkit Queries Extensions Discovery Repository Data Storage Layer Publishing Metadata: Version Graph, Raw Data: Versioned Datasets, Workflow Graph, Storage Graph Models, Scripts, Results, Files Repository Sharing Figure 1.1: ProvDB High-level System Architecture vision a number of local DVCS “repositories”, each corresponding to a team of researchers collaborating closely together. The contents of each repository will typi- cally be replicated across a number of machines as different researchers “check out” the repository contents to work on them. Since we leverage a concrete DVCS (e.g., git) for keeping these in sync, the repository contents are available as files for the users to operate upon; the users can run tools as before on those after checking them out, including distributed toolkits like Hadoop or Spark. In ProvDB, broadly, the data maintained across the system can be catego- rized into: • Raw data that the users can directly access and analyze including the datasets, analysis scripts, and any derived artifacts such as trained models; • Metadata or provenance information transparently maintained by the system, and used for answering queries over the versioning or provenance information. To capture provenance information, ProvDB organizes metedata (version graphs, workflow graphs, storage graphs) and raw data (versioned datasets, models, 13 scripts, result files, etc.) in a unified graph data model and provides a set of ingestion mechanisms to capture provenance. On the managed provenance data, ProvDB supports a rich set of query facilities, including asking questions on data lineages, and introspective queries on comparing two versions of a model and explaining the differences. We describe the detailed design of these components in Chapter 3. In order to better serve users who use specific modeling approaches, we design an extension mechanism in ProvDB. With a concrete modeling paradigm, such as feature engineering and deep learning, a clearer lifecycle can be defined, and we can identify the steps that can be automated to accelerate the modeling process. In Chapter 4, we illustrate our study on extending ProvDB for deep learning modelers on computer vision tasks. We build a deep learning lifecycle management system, ModelHub, to manage artifact versions and provenance of modeling steps in a deep learning lifecycle. In the chapter, we show what kind of provenance information is ingested, how a domain-specific query facility can be used for accelerating the modeling process, and when general version control storage module is inappropriate. Once the provenance is managed by a system like ProvDB and Model- Hub, in Chapter 5, we study how to fully exploit the ingested provenance and help data science project teams. As collaborative analytics projects often have unstable lifecycles resulting in evolving and verbose provenance graphs, it is common that team members only have partial knowledge of the provenance graph. Without a predefined workflow skeleton and full understanding of contributing artifacts and steps, it is difficult to write queries and explore the provenance graph using modern graph databases and query languages. In the chapter, we formulate graph query op- 14 erators for collaborative analytics workflows and propose efficient query evaluation techniques on a modern property graph backend. In Chapter 6, we examine the opportunities on the server side of lifecycle management systems, and envision a model discovery service. Given the presence of large collections of data science projects uploaded by different groups of data scientists and hosted centrally by a system, model discovery is the problem of iden- tifying relevant projects for a data science practitioner who is working on her own project and willing to find reference. In this chapter, we propose a system around an information retrieval approach, and decompose the discovery task into three major steps: project query and matching, model comparison and ranking, and processing and building ensembles with returned models. Then we describe our system vision, and present opportunities, challenges and techniques for each step. 15 Chapter 2: Related Work This chapter surveys the state of the art in provenance systems, collaborative systems as well as model management efforts in the database community. We begin with prior work on provenance systems towards the goal of supporting scientific workflow systems as well as database systems, followed by a review of recent so- lutions in the version control systems space. Thereafter, we summarize work done towards building systems for data analytics and associated lifecycles. 2.1 Provenance Systems Provenance information is important in computer-aided tasks. The goal of a provenance system is to capture and manage the origin and history of various objects in its scope, and derivation and passage of an object through its various owners. The prior work can be roughly categorized in two types: workflow (or coarse-grained) provenance [3, 4] and data (or fine-grained) provenance [5]. 2.1.1 Workflow Provenance Many workflow provenance systems have been proposed in the scientific ap- plications domain over the years, with some of the prominent systems being Ke- 16 pler [6], Taverna [7], Galaxy [8], iPlant [9], VisTrails [10], Chimera [11], and Pega- sus [12]. A scientific workflow is typically defined for complex data processing tasks, and includes a precise definition of interactions among a collection of computation steps and human-machine interaction steps.. Workflow provenance information in- cludes [13]: • Prospective information about the definition of the workflow, • Retrospective information captured during the execution of the workflow, • Metadata about steps and datasets in a workflow, • Input/output lineages among steps. The provenance information captured varies in different systems; it may include a complete record of the sequence of steps taken in a workflow to produce all data points, while in other cases, it may only entail a record of the versions of software used, as well as the models and makes of hardware equipment used in the workflow. Similar to other data management systems, considering the data being man- aged is provenance information mentioned earlier, the workflow system consists of key three components: a) ingestion mechanism for capturing the provenance data, b) a representation schema for the provenance information, c) a storage and query system infrastructure [3]. The workflow provenance systems can be distinguished from their scope and design in these components. First, the ingestion mechanism either explicitly requires workflow itself to use callbacks to input prospective and ret- rospective provenance, or is implicitly implemented at lower levels (e.g., compilers, 17 OS) to capture the retrospective provenance (e.g., [14–17]). Second, the represen- tation schema is either workflow-specific, or generic, covering partial or thorough prospective and retrospective aspects; there are modern provenance data models standardized over a long period of time as a result of consolidation for scientific workflows [18] and the Web [19]. Third, the infrastructure are dependent on the representation language, ranging from a specialized ontology language (OWL) to XML dialects stored as files to tuples stored in relational database tables. There are lines of research to improve different aspects of the workflow sys- tems, including proposing query languages to utilize provenance [20–22], operators to transform provenance for ease of use [23–26], storing large provenance graphs efficiently [27,28], and publishing provenance without disclosing important informa- tion [29,30]. 2.1.2 Data Provenance On the other hand, fine-grained provenance is often not discussed in workflow systems, but has been rather a topic of focus in the database community. In dataflow systems where the operators are written in a declarative language (e.g., SQL, Pig Latin, Spark), data provenance at record level can be captured if needed [5,31–35]. The fine-grained data provenance system is often built as an auxiliary system com- ponent inside a dataflow engine (e.g., SQL database, Spark system). As the dataflow program (e.g., SQL query, Spark program) itself is essentially a clear definition of the prospective provenance, data provenance focuses on retrospective provenance 18 at the level of individual items in query results. In this setting, verbose logging of derivations of query results is often not affordable due to its large size. A line of work [5] has focused on asking why, how, and where questions about the query results when the detailed execution log is not available. 2.1.3 Our Contribution Comparing with workflow provenance systems, they often center around creat- ing, automating, and monitoring a well-defined workflow or data analysis pipeline. However they cannot easily handle fast-changing pipelines, and typically are not suitable for ad hoc collaborative data science workflows where clear established pipelines may not exist except in the final, stable versions. Moreover, these systems typically do not support the entire range of tools or systems that the users may want to use, they impose a high overhead on the user time and can substantially increase the development time, and often require using specific computational environments. Further, many of these systems require centralized storage and computation, which may not be an option for large modeling artifacts such as datasets and trained weights. For collaborative data science workflows, our work aims to combine versioned modeling artifacts and their provenance together on top of a version control system, to provide a uniform platform for collaborative data science workflows. First, we address the challenges in developing a unified data model and tool-agnostic ingestion mechanism for versioned modeling artifacts and their metadata and provenance 19 during a collaborative data science lifecycle (Chapter 3). Second, given a concrete modeling paradigm (e.g., deep learning), we advance the version storage techniques and provenance query facilities (Chapter 4). Third, as there is no stable skeleton and the users have partial knowledge of the underlying workflow, we develop new provenance query operators (Chapter 5). In terms of data provenance systems, our work is complementary to, and can utilize, those prior techniques to capture the provenance information itself. In many situations, analysis is not performed using dataflow programs (e.g., SQL), therefore similar problem settings cannot be established in collaborative data scientific work- flows without strong assumptions on tool preferences. In our work (Chapter 3), we provide dataflow-based tools to let users write dataflow programs instead of using scripts that may lose prospective provenance, and thus enable prior data provenance work to be used in data science workflows. More importantly, we focus primarily on how to exploit that information for providing richer introspection capabilities on en- tities in data science workflows, which consists of heterogeneous artifacts at different granularities in contrast to low-granularity tuples output by dataflow programs. 2.2 Collaborative Data Science Systems 2.2.1 Data Management for Collaborative Analytics Alongside the emergence of big data in many domains, many data man- agement systems have been designed focusing on specific aspects of collaborative analytics. These include building public collaborative systems to share datasets 20 and analysis findings among non-expert users (Google Fusion tables [36]), address- ing data integration issues during sharing of datasets (Orchestra [37]), manag- ing shared SQL queries (CQMS [38], SQLShare [39]) and literate programming notebooks (LabBook [40]), organizing hosted data repositories and domain-specific datasets to be accessible by applications (CKAN [41], Quandl [42], Factual [43]), providing hosted data science platforms (OpenML [44], Domino [45], Amazon Sage- Maker [46], Google Datalab [47]) and data publishing tools (DataMarket [48]), and enterprise dataset and metadata management systems (GOODS [49], Ground [50], Collibra [51], Apache Atlas [52], Alation [53]). Most of them do not focus on artifacts and their provenance in the context of data science activities, while there are two projects sharing similar views that aim to improve collaborative data science workflows by reducing the cost of metadata collection and management. LabBook [40], a social data science notebook, uses a property graph to manage metadata captured during collaborative analytics and features a web-based application architecture for analyzing the metadata. However, LabBook does not treat versioning as a first-class construct, and does not focus on developing passive provenance ingestion mechanisms or sophisticated querying abstractions as we do here. Ground [50] is a data context service to manage all the information that informs the use of data. It has a general data model and architec- ture to import from and export to other systems. However, metadata ingestion and useful high-level query facilities are left to the users. 21 2.2.2 Version Control Systems for Data Science For data science, a wide range of analytic packages like SAS [54], Excel [55], R [56], Matlab [57], and Mahout [58], or data science toolkits such as IPython [59], Scikit-Learn [60], and Pandas [61], are frequently used for performing analysis itself; however, those lack comprehensive data management or collaboration capabilities. In practice, version control systems (VCS), such as git, svn, were originally proposed for collaborative open source software development. Now VCS and hosted platforms (e.g., GitHub, GitLab, Bitbucket) have become the de facto collaboration platforms for data scientists and many other collaborative projects, e.g., for sharing datasets and IPython/Jupyter notebooks, writing articles, publishing open course materials, etc. In particular, VCS provide transparent support for versioning and sharing, while imposing no constraints on what types of tools can be used for the data processing itself. Thus VCS and hosted platforms built around them are much more appropriate for day-to-day needs in a collaborative data science team. In terms of the underlying provenance data model and provenance ingestion mechanism, though these systems keep version lineage among committed artifacts and use commit messages to log derivation details, they are typically too “low-level”, and have very little support for capturing higher-level workflows or for keeping track of the operations being performed or any kind of provenance information. The versioning APIs supported by these systems are based on a notion of files, and are not capable of allowing data scientists to reason about that data contained within versions and the relationships between the versions in a holistic manner. 22 Storage Problem & Query Language in VCS Recently, in a series of efforts [1, 2, 62–64], systems and techniques are proposed to improve the storage infrastructure as well as the query interfaces of the VCS. First, dataset versioning and sharing are identified as an important problem in the context of collaborative data science [2]. The storage problem in VCS is later formally studied in [1], which shows that popular VCS (e.g., git) use fairly simple algorithms underneath that are optimized to work with modest-sized source code and have significant limitations when handling large files (e.g., datasets) and large numbers of versions. The storage-retrieval tradeoff of the problem under the delta-based versioning strategy is explored and connected to the balanced minimum spanning tree problem [65]. Next, on a VCS data model, a query language [62] for retrieving versioned data items is proposed, and query evaluation techniques on versioned datasets [63] and versioned relational tables [64] are studied. 2.2.3 Our Contribution Our system can be seen as a new type of workflow-aware version control sys- tem with provenance management capabilities to aid collaborative activities during a data science project lifecycle. Comparing with VCS and prior work (e.g., DataHub, git), we enhance versioning data model with workflow provenance and artifacts’ metadata information in a unified manner (Chapter 3), and provide rich query facil- ities spanning different stages of a lifecycle that are tailored for data scientists (e.g., introspection, monitoring, artifact understanding) (Chapter 3). Within a concrete 23 modeling lifecycle (e.g., deep learning), we explore how to build a specialized system by unifying data and provenance management in order to aid data scientists and accelerate the lifecycle (Chapter 4). Moreover, with the unified model, we open the opportunity to explore new research problems, e.g., discovering relevant data science repositories shared in hosted services (Chapter 6). In terms of VCS storage and query models, our work shows that the rich set of artifacts (e.g., learned weights) in collaborative data analytics require more general formulation of the VCS storage problems, as well as refined domain-specific language and query facilities over managed modeling artifacts (e.g., explore existing models, and enumerate new models) compared with those in prior work in this aspect (Chapter 4). 2.3 Machine Learning Systems 2.3.1 Modern Machine Learning Software Systems Nowadays, heterogeneous datasets (e.g., logs, text, images, graphs) are col- lected in many different domains by logging user and system behavior or labeling data via crowd-sourcing. The sizes of the datasets are increasing (e.g., web clicks of an e-commerce site, graphs representing an entire social network), the complex- ity of the models is much higher (e.g., millions of features, billions of dimension float weights), and the data-driven algorithmic design-making scenarios have become more common. Traditional analytics packages like SAS [54], Excel [55], R [56], and Matlab [57], or business intelligence and data mining support in commercial rela- 24 tional database systems, are not sufficient for the needs of modern analytics in terms of a) the size of the input data and the complexity of the machine learning models, b) the heterogeneous data types and the accessibility from expert and non-expert users performing data analytics. In the industry and academia, there has been increasing work on develop- ing general-purpose machine learning software systems. On one hand, to address the scalability w.r.t. the increasing model complexity and dataset sizes, there are lines of work proposing software frameworks [58, 66] on top of popular distributed dataflow platforms (e.g., Hadoop, Spark), or novel computation abstractions [67–69] to support training phases of machine learning models. On the other hand, there are single-machine software toolkits developed by supporting multi-dimensional ar- rays and scientific computing operations [70, 71], integrating data transformation capabilities [61], providing trending modeling methods [72], supporting popular pro- gramming languages, and enabling literate programming notebooks [59]. Because many datasets and analytics are conducted in databases, in database community, there are lines of work to support modern machine learning [73], includ- ing pushing predictive models and optimization routines into databases [74,75], and accelerating learning using database join optimization methods [76]. 2.3.2 Modeling Pipeline & Lifecycle Management Systems There is also emerging interest in developing general-purpose systems for han- dling different aspects of model lifecycle management. Multiple industry tutorials 25 on machine learning and systems discuss the modeling pipelines in practice [77,78], which are iterative processes consisting of preparing and understanding datasets, training and tuning models, and serving and monitoring model performances. Accelerating the lifecycle by improving tasks spanning the process, e.g., un- derstanding modeling artifacts and collaborating efficiently with others, will deliver results faster and have drawn significant attention from the database and systems communities. First, there are lines of work focusing on important aspects spanning the life- cycle, including accelerating iterative updating and training models [79], serving predictive models online [80,81], managing and querying developed models [82–84], organizing provenance and metadata generated in the lifecycle [50, 84, 85], hosting and discovering reference models [44,86], and assisting collaboration (Section 2.2.1). Second, there are also platform efforts to support the entire lifecycle [87, 88] for expert modelers. Several products and research efforts go further with the goal of using data provided by the user and automatically selecting good models [89,90]. 2.3.3 Our Contribution In terms of machine learning software systems, apart from building or training machine learning models, versioning, collaboration and provenance of the modeling process are largely ignored and left to the users. Our work is complementary to them to a large extent; our focus is on the lifecycle management when a data-driven project team consisting of data analysts/scientists at different skill levels are trying 26 to collaboratively develop a model. ProvDB can be used as the provenance data management layer for most such systems (Chapter 3, Chapter 5). Our work overlaps with the model pipeline and lifecycle management sys- tems. In contrast with existing research, our work features tool-neural and extensible provenance ingesting mechanisms without tool or lifecycle assumptions (Chapter 3). We also make contributions in this space by extending our provenance management approach for the deep learning modeling process, which has not been studied in the literature (Chapter 4). Morever, we further discuss important provenance query primitives for iterative and repetitive data science lifecycle, which are not seen in any of the existing work in this space (Chapter 5). 27 Chapter 3: ProvDB System Design for Collaborative Analytics In this chapter, we present an in-depth design of the core technical components for the ProvDB system proposed in Chapter 1. First, we adopt a schema later approach and show a flexible data model that combines various elements of data files, model artifacts, versions, workflow derivations, and possible metadata. Next, we explain how the rich provenance data is ingested and how heterogeneous data analysis environments can be served well with natural extensions. Then, a set of general query facilities that are orthogonal to specific environment are described and illustrated in detail. Finally we show the usefulness of the system by a case study on a deep learning project managed by ProvDB. 3.1 Unified Data Model To encompass a large variety of situations, our goal is to have a flexible data model that reflects versioning and workflow pipelines, and supports addition of arbitrary metadata or provenance information. Though the version control system has a clean model, the analysis steps between the versions are missing in modern systems, and metadata of the versioned files are not addressed. Based on a versioning data model, we use fixed “base schema” (Figure 3.1) 28 to capture the information about the versions, the different data artifacts, and so on, while allowing arbitrary properties to be added to the various entities. We map this logical data model to a property graph model (Figure 3.3), which we use as our physical data model to store the information. Comparing with other similar models proposed in the past work [13, 40], our model differs from them primarily in the explicit modeling of versions. 3.1.1 Conceptual Data Model We view a data science project as a working directory with a set of artifacts (files), and a development lifecycle as a series actions (shell commands, edits, trans- formation programs) which perform create/read/update/delete (CRUD) operations in the working directory. Commit (Git) Version Artifact id string commit Commit name string msg string 1 1 snapshots {Snapshot} path stringuser string is_usr_commit bool is_directory bool timestamp has_provenance bool snapshots {Snapshot} ... 2 parents {Version} 1 tag type 1 1 n desc stringn Derivation Snapshot IsA from Version action C/U/D to Version parents {Snapshot} command string records {Record} ResultFile properties {Property} 1 properties {Property} 1 1 n n n DataFile Property Record ingestor string line integer ScriptFile name string action C/U/D value string parents {Record} Figure 3.1: Conceptual Data Model 29 More specifically: an artifact is a file, which the user modifies, runs, and talks about with peers. Artifacts can be tagged as belonging to one of three different types: ResultFile, DataFile, ScriptFile, which helps with formulating appropriate queries. A version is a checkpoint of the project; in our case, this refers to a physical commit created via git. ProvDB has explicit versions and implicit versions ; the former are created when a user explicitly issues commit command, whereas the latter are created at provenance ingestion time when the user runs commands in the project directory. Snapshots are checkpointed versions of an artifact, and capture its evolution. ProvDB monitors file changes during the lifecycle, and emits changed (CUD) ar- tifacts as new snapshots, and the previous snapshot of the same artifact before the change is called its parent. The content of a snapshot are modeled as records, to allow fine-grained record-level provenance (some files, e.g., binary files, would be modeled as having a single record). Derivations capture the transformation context to the extent possible. If a derivation is performed by running a program or a script, then the information about it is captured along with any arguments. Derivation edges may also be created when the system notices that one or more artifacts have changed (e.g., an edit made using an editor, or a script ran outside the ProvDB context). Finally, properties are used to encode any additional information about the snapshots or the derivations, as key-value pairs (where values are often time series or JSON documents themselves). Provenance ingestion tools (discussed in next sec- tion) will generate these properties. In addition to the information about programs 30 or scripts and their arguments, properties may include any information captured by parsing shell scripts or analysis scripts themselves. Properties are also used to extract statistics about the data within the snapshots as well, so that they can be seamlessly queried. This starts blurring the distinction between data and metadata to some extent; we make cleaner distinction for concrete environments, and the users are allowed to annotate and distinguish between these two. Example 1 Suppose a user starts an analysis using a script file script1 and a data file datafile1 by copying them to a repository. She first tries out script1 on datafile1, and a result file result1 with m records is generated in the directory. She opens the result1 in an IDE and finds a number format issue, after which, she uses vim to modify the script1, and runs it again on datafile1, then each record in result1 is changed. In Figure 3.2, assuming the system has a commit at the end of each command in the shell, we show the versions, artifacts, snapshots, and derivations. Between the versions, the command is captured as a Derivation, whose properties would be the command line arguments (i.e. options, parameters). Each version includes a set of snapshots associated to an artifact. As shown in the figure, result1 is an artifact across three versions. It worth pointing out, some of the derivation context happened in the IDEs when opening the result1 at the first time may be important as well, which cannot be captured simply by looking at command lines. If there is change before a derivation, ProvDB detects it and marks the derivation as missing provenance. 31 d1 d2 d3 d4 Derivation cp script1 sh ./script1 vim script1 sh ./script1 datafile1 . datafile1 datafile1 Version V1 V2 V3 V4 V5 script1 script1 script1 script1 content content Snapshot /Record datafile1 datafile1 datafile1 datafile1 record_1 record_2 result1 result1 result1 ... record_1 record_1 record_n record_2 record_2 ... ... record_m record_m result1 Artifact Figure 3.2: Example Workflow 3.1.2 Physical Property Graph Data Model To actually store the conceptual data model, in addition to relational database, we map the logical data model (with the exception of Record) above into a property graph data model, allowing graph traversal queries and visual exploration over the stored information easily. Nodes of the property graph are of types Version, Arti- fact, etc., whereas the edges capture the parent-child and composition relationships. Besides Derivation is handled as an edge between Versions. The edge properties are used to store the detailed parameters and command options. Example 2 For the data model instance in Figure 3.2, we show the actual physical property graph in Figure 3.3. The shape of nodes are reflecting the entity types in Figure 3.2. Between artifact and snapshots, (e.g. result1 and result s1), the edge 32 has a composition relationship, while between snapshots, (e.g. result s1 and result s2), the parent-child lineage is stored across versions. For instance, the artifact result1 is unchanged in version V4, the snapshot parent of result s2 in V5 is the snapshot result s1 in V3. It is worth pointing out that providing entities across versions, the provenance query can be asked directly from the perspective of artifacts and its changes on snap- shots, without knowing the versions. For instance, compare the metadata of top-3 ResultFile snapshots of a classification result artifact. d1 d2 d3 d4 from to V1 parent V2 V3 V4 V5 has parent script1 s1 datafile1 s1 result1 s1 script1 s1 result1 s2 has parent content record s1 record s1 content' record s2 has script1 datafile1 result1 Figure 3.3: Provenance Property Graph 3.2 Provenance Ingestion The rich provenance information need to be ingested, which is particularly challenging in a situation where the work environment is open world and very di- verse. ProvDB captures provenance information or other metadata opportunisti- 33 cally, and features a suite of mechanisms that can capture provenance for different types of operations. Users can easily add provenance ingestion mechanisms, to both capture more types of information as well as richer information. The list of ingestion mechanisms are listed in Table 3.1 Ingestion Method Brief Description Shell-based Ingestion General framework for command line work environment. User Annotations GUI for the users to add important information for peers. File Views Capture fine-grained record level provenance. Extension Modules Ingestors for popular modeling tool chain. Table 3.1: ProvDB Ingestion Methods In this section, we briefly describe the ingestion mechanisms that ProvDB currently supports, which include a general-purpose UNIX shell-based ingestion framework, ingestion of DVCS versioning information, and a mechanism called file views which is intended to both simplify workflow and aid in fine-grained prove- nance capture. 3.2.1 Shell command-based Ingestion Framework The provenance ingestion framework is centered around the UNIX comman- dline shell (e.g., bash, zsh, etc). We provide a special command called provdb ingest that users can prefix to any other command, and that triggers provenance ingestion. Each run of the command results in creation of a new implicit version, 34 which allows us to capture the changes at a fine granularity. These implicit ver- sions are kept separate from the explicit versions created by a user through use of git commit, and are not visible to the users. A collection of ingestors is invoked by matching the command against a set of regular expressions, registered a priori along with the ingestors. ProvDB schedules ingestor to run before/during/after execution the user command, and expects the ingestor to return a JSON property graph consisting of a set of key-value pairs denoting properties of the snapshots or derivations. An ingestor can also provide record-level provenance information, if it is able to generate such information. A default ingestor handles abitrary commands by parsing them following POSIX standard (IEEE 1003.1-2001) to annotate utility, options, option arguments and operands. For example, provdb ingest ‘mkdir -p dir’ is parsed as utility mkdir, option p and operand dir. Concatenations of commands are decomposed and ingested separately, while a command with pipes is treated as a single command. If an external tool has been used to make any edits (e.g., a text editor), an im- plicit version is created next time provdb is run, and the derivation information is recorded as missing. 3.2.2 User Annotations Apart from plugin framework, ProvDB GUI allows users to organize, add, and annotate properties, along with other query facilities. The user can annotate project properties, such as usage descriptions for collaborations on artifacts, or notes 35 to explain rationale for a particular derivation. A user can also annotate a property as parameter and add range/step to its domains, which turns a derivation into a template and enables batch run of an experiment. For example, a grid search of a template derivation on a start snapshot can be configured directly in the UI. Maintaining such user annotations (and file views discussed next) as the datasets evolve is a complicated issue in itself [91]. 3.2.3 File Views ProvDB provides a functionality called file views to assist dataset transfor- mations and to ingest provenance among data files. Analogous to views in relational databases, a file view defines a virtual file as a transformation over an existing file. A file view can be defined either: (a) as a script or a sequence of commands (e.g., sort | uniq -c, which is equivalent to an aggregate count view), or (b) as an SQL query where the input files are treated as tables. For instance, the following query counts the rows per label that a classifier predicts wrongly comparing with ground truth. provdb fileview -c -n=‘results.csv’ -q=‘ select t._c2 as label, count(*) as err_cnt from {testfile.csv} as t, {predfile.csv} as r where t._c0 = r._c0 and t._c2 != r._c2 group by t._c2’ The SQL feature is implemented by loading the input files into an in-memory sqlite 36 database and executing the query against it. Instead of creating a view, the same syntax can be used for creating a new file instead, saving a user from coding similar functionality. File views serves as an example of a functionality that can help make the ad hoc process of data science more structured. Aside from making it easier to track dependencies, SQL-based file views also enable capturing record-level provenance by drawing upon techniques developed over the years for provenance in databases. 3.2.4 Extension Modules ProvDB also supports several specialized ingestion plugins and configura- tions to cover important data science workflows. In particular, it has an ingestor capable of ingesting provenance information from runs of the caffe deep learning framework; it not only ingests the learning hyperparameters from the configuration file, but also the accuracy and loss scores by iteration from the result logging file. Providing parsers for scripts written in popular data science tools such as Jupyter, scikit-learn and pandas is an on-going effort, by building upon prior work [17]. Though ProvDB prototype is designed to be used in a command-line en- vironment, ingesting provenance within other development environments such as different IDEs requires a white box approach, such as providing IDE callbacks and add features to them, which is out of the scope of the prototype. However, the conceptual and physical data models work similarly. 37 3.3 Query Facilities Based on the unified data model ingested by various mechanisms ProvDB supports, we identify a set of query workloads and discuss how ProvDB implements them. As an overview, ProvDB queries facilities are as follows: • Provenance queries about the Version/Workflow Graph and Properties part in the data model. • Workflow queries on derivations among nodes in the version graph. • Introspective queries, such as diffing similar artifacts and derivation pipelines in the data science processes. • Monitoring queries on a given property of a series versions, for automatically detecting problems during deployment. • Apart from general query facilities, using ProvDB extension module, domain- specific high level queries can be answered by extensions, for example, a deep learning extension (Chapter 4) allows to ask the network architectures and hyperparameters used in an experiment, while a feature engineering extension could expose queries on selected feature set. In the rest of this section, we discuss each type of the query facilities, and show how our current prototype supports them with a case study. In the next chapter, we illustrate the design of a deep learning plugin subsystem in ProvDB. 38 3.3.1 Queries over Version/Workflow Graph and Properties In a collaborative workflow, provenance queries to identify what revision and which author last modified a line in an artifact are common (e.g., git blame). ProvDB allows such queries at various levels (version, artifact, snapshot, record) and also allows querying the properties associated with the different entities (e.g., details of what parameters have been used, temporal orders of commands, etc). In fact, all the information exposed in the property graph can be directly queried using the Neo4j Cypher query language, which supports graph traversal queries and aggregation queries. The latter types of queries are primarily limited by the amount of context and properties that can be automatically ingested. ProvDB currently supports ingestors for several popular frameworks, including a program analysis ingestor for scikit-learn which extracts the scikit-learn APIs used in a program, and a hyper- parameter and result-table ingestor for caffe for deep learning (the hyper-parameter ingestor extracts experiment parameter metadata from caffe commands and ar- guments, while the results-table ingestor extracts optimization errors and accuracy metrics from training logs). Availability of this information allows users to ask more meaningful queries like: what scikit-learn script files contain a specific sequence of commands; what is the learning accuracy curve of a caffe model artifact; enumerate all different parameter combinations that have been tried out for a given learning task, and so on. Many such queries naturally result in one or more time series of values (e.g., 39 properties of an artifact over time as it evolves, results of “diff” queries discussed below); ProvDB supports a uniform visual interface for plotting such time series data, and comparing different time series. 3.3.2 Reasoning about Pipelines Similar to a workflow management system, we define a pipeline to be a se- quence of derivation edges. A pipeline can be annotated by the user by brows- ing the workflow graph and marking the start and the end edges of the pipeline. Pipelines would also be inferred automatically by the system (e.g., via pattern min- ing techniques) if the ingested history is large and clean. ProvDB UI allows a user to browse and reuse pipelines present in the system. Being able to reason about pipelines has the potential to hugely simplify the lives of data scientists, by allowing them to learn from others and also helping them avoid mistakes (e.g., omission of a crucial intermediate step). Moreover, it also makes lifecycle automations possible, such as re-invoking an old pipeline on an old artifact to verify the results, or invoking a pipeline on a different snapshot with different parameters, or schedule a cron job. 3.3.3 Introspective Diffs: Shallow vs Deep “Diff” Queries “Diff” is a first-class operator in ProvDB, and can be used for finding differ- ences at various different levels. Specifically, given a pair of nodes (corresponding to two snapshots) in the property graph, a shallow diff operation, by default, focuses on the ingested properties of the two snapshots, which are likely to contain the crucial 40 differences in most cases. It attempts to “join” the two sets of properties as best as it can, and highlights the differences; in case of time-series properties, it also allows users to generate plots so they can more easily understand the differences. For ex- ample, for two result table artifacts that may represent the outputs of two different runs of the same script (e.g., model training logs), a line-by-line diff may be useless because of irrelevant and minor numerical differences; however, by plotting the two sets of results against each other, a user can more quickly spot important trends (e.g., that a specific value of parameter leads to quicker convergence). The shallow diff operator also allows differencing the contents of the two files line-by-line if so desired. A deep diff compares the ancestors of the two target snapshots by tracing back their derivations to the common ancestor. It aligns the snapshots along the two paths, and shows the differences between each pair of aligned snapshots. For example, in a prediction workflow, a user may have tried out different prediction models and configurations to identify the best model; using ProvDB, she can start from two result table artifacts, and ask a deep diff query to compare how they are derived. 3.3.4 Continuous Monitoring or Anomaly Detection On ingested properties of artifacts and derivations, ProvDB provides a mon- itoring and alerting subsystem to aid the user during the development lifecycle. We envision two main use cases for this functionality. (a) First, it can be used to detect 41 any major changes to the properties of an evolving dataset – e.g., a large change in the distribution of values in a dataset may be cause for taking remedial action. (b) Second, in most applications, there is usually a need to “deploy” an analysis script or a trained model against live incoming data; it is important to keep track of how well the model or the script is behaving and catch any problems as soon as possible (e.g., changing input data properties; higher error rates than expected). ProvDB supports analysis of historical data (as described above) and simple alert queries that can monitor a property of an evolving artifact. To elaborate, a property belong to an artifact could be registered as a mon- itoring property. Once there is a new snapshot of the same artifact is committed, the property of the new snapshot is ingested and compared with the alert condi- tions, which are configured beforehand by the users. Numerical properties of a series of snapshots naturally forms a time series. We support a limited set of time series models, such as moving averages and standard derivation envelopes to be configured as adaptive alert conditions. 3.4 Case Study Example 3 We show the ProvDB Web GUI using a caffe deep learning project. In this project, 41 deep neural networks are created for a face classification task. The user tries out models by editing and training models. In Fig 3.4, an introspection query asks how different are two trained models ( model-0 and 9). Using the GUI, the user filters artifacts, and diffs their result logging files. In the right side query 42 Figure 3.4: Diff Artifacts (Result logging files for two deep neural networks) result pane, the ingested properties are diffed. The caffe ingestor properties are numerical time series; using the provided charting tool, the user plots the training loss and accuracy against the iteration number. From the results, we can see that model-9 does not train well in the beginning, but ends up with similar accuracy. To understand why, a deep diff between the two can be issued in the GUI and complex Cypher queries can be used as well. In Figure 3.5, the query finds previous deriva- tions and shared snapshots, which are training config files; more introspection can be done by finding changed hyperparameters. 43 Figure 3.5: Cypher Query to Find Related Changes via Derivations 3.5 Conclusion In this chapter, we presented ProvDB, its high level system architecture and our major design decisions for simplifying lifecycle management of ad hoc, collabo- rative analysis workflows that are becoming prevalent in most application domains today. We showed that a large amount of provenance and metadata information can be captured passively, and argue analyzing it in novel ways can immensely simplify the day-to-day processes undertaken by data analysts. Our ProvDB prototype uses git and Neo4j, which provides a variety of provenance ingestion mechanisms and the ability to query, analyze, and monitor the captured provenance informa- 44 tion. Our experience with using this prototype for a deep learning workflow (for a computer vision task) showed that even with limited functionality, it can sim- plify the bookkeeping tasks and make it easy to compare the effects of different hyperparameters and neural network structures. 45 Chapter 4: ModelHub: Managing Deep Learning Projects on ProvDB Deep learning models [92], also called deep neural networks (DNN), have im- proved state-of-the-art results in many important fields, and have been the subject of much research in recent years, leading to the development of several systems for facilitating deep learning. Current systems, however, mainly focus on model build- ing and training phases, while the issues of data management, model sharing, and lifecycle management are largely ignored. Deep learning modeling lifecycle generates a rich set of data artifacts, such as learned parameters and training logs, and com- prises of several frequently conducted tasks, e.g., to understand the model behaviors and to try out new models. Dealing with such artifacts and tasks is cumbersome and largely left to the users. In this chapter, we study the deep learning modeling practice and illustrate the system design of ModelHub, a domain-specific extension of ProvDB, to manage the rich set of modeling artifacts and their provenance over the lifecycle. We first list the challenges of deep learning lifecycle management, derive the system requirements for ModelHub, and illustrate its major components in Section 4.1, followed by introducing background on related topics in DNN modeling lifecycle in Section 4.2. We present an overview of ModelHub, and discuss the declarative interfaces in 46 Section 4.3. Then we describe the parameter archival store (PAS) in Section 4.4 and present an experimental evaluation in Section 4.5. 4.1 Motivation & Approach 4.1.1 DNN Modeling Lifecycle and Challenges Compared with the traditional approach of feature engineering followed by model training [79], deep learning is an end-to-end learning approach, i.e., the fea- tures are not given by a human but are learned in an automatic manner from the input data. Moreover, the features are complex and have a hierarchy along with the network representation. This requires less domain expertise and experience from the modeler, but understanding and explaining the learned models is difficult; why even well-studied models work so well is still a mystery and under active research. Thus, when developing new models, changing the learned model (especially its network structure and hyper-parameters) becomes an empirical search task. In Figure 4.1, we show a typical deep learning modeling lifecycle (we present an overview of deep neural networks in the next section). Given a prediction task, a modeler often starts from well-known models that have been successful in similar Data & if accuracy is unsatisfactory, repeat Labels Create Train Evaluate Serve Reference /Update /Test Model Model Models Model Model Figure 4.1: Deep Learning Modeling Lifecycle 47 task domains; she then specifies input training data and output loss functions, and repeatedly adjusts the DNN on operators and connections like Lego bricks, tunes model hyper-parameters, trains and evaluates the model, and repeats this loop until prediction accuracy does not improve. Due to a lack of understanding about why models work, the adjustments and tuning inside the loop are driven by heuristics, e.g., adjusting hyper-parameters that appear to have a significant impact on the learned weights, applying novel layers or tricks seen in recent empirical studies, and so on. Thus, many similar models are trained and compared, and a series of model variants needs to be explored and developed. Due to the expensive learning/training phase, each iteration of the modeling loop takes a long period of time and produces many (checkpointed) snapshots of the model. As we noted above, this is a common workflow across many other ML models as well. Current systems (Caffe [72], Theano, Torch, TensorFlow [69], etc.) mainly focus on model building and training phases, while the issues of data management, model sharing, and lifecycle management are largely ignored. Modelers are required to write external imperative scripts, edit configurations by hand and manually main- tain a manifest of model variations that have been tried out; not only are these tasks irrelevant to the modeling objective, but they are also challenging and nontrivial due to the complexity of the model as well as large footprints of the learned mod- els. More specifically, the tasks and data artifacts in the modeling lifecycle expose several systems and data management challenges, which include: • Understanding & Comparing Models : It is difficult to keep track of the many 48 models developed and/or understand the differences amongst them. Differ- ences among both the metadata about the model (training sample, hyperpa- rameters, network structure, etc.), as well as the actual learned parameters, are of interest. It is common to see a modeler write all configurations in a spread- sheet to keep track of temporary folders of input, setup scripts, snapshots and logs, which is not only a cumbersome but also an error-prone process. • Repetitive Adjusting of Models : The development lifecycle itself has time- consuming repetitive sub-steps, such as adding a layer at different places to adjust a model, searching through a set of hyper-parameters for the different variations, reusing learned weights to train models, etc., which currently have to be performed manually. • Model Versioning : Similar models are possibly trained and run multiple times, reusing others’ weights as initialization (finetuning).Maintaining the different model versions generated over time and their relationships can help with iden- tifying errors and concept drifts, comparing models over new inputs, and po- tentially reverting back to a previous model. Even for a single learned model, storing the different checkpointed snapshots can help with “warm-start” and can provide important insights into the training processes. • Parameter Archiving : The storage footprint of deep learning models tends to be very large. Recent top-ranked models in the ImageNet task have billions of floating-point parameters and require hundreds of MBs to store one snap- shot during training. Due to resource constraints, the modeler has to limit 49 the number of snapshots, even drop all snapshots of a model at the cost of retraining when needed. • Reasoning about Model Results: Another key data artifact that often needs to be reasoned about is the results of running a learned model on the training or testing dataset. By comparing the results across different models, a modeler can get insights into difficult training examples or understand correlations between specific adjustments and the performance. 4.1.2 ModelHub Approach We extend ProvDB and build the ModelHub system to address these chal- lenges. The ModelHub system is not meant to replace popular training-focused DNN systems, but rather designed to be used with them to accelerate modeling tasks and manage the rich set of lifecycle artifacts. It consists of three key components: 1. DLV: a model versioning system to store, query and aid in understanding the models and their versions. It extends the ProvDB shell command ingestion mechanism by adding deep learning artifact extractors, as well as lifecycle- specific command line suites. 2. DQL: In addition to general ProvDB query facilities, we propose a model network adjustment and hyper-parameter tuning domain specific language to serve as an abstraction layer to help modelers focus on the creation of the models. 50 3. ModelHub: a hosted deep learning model sharing system to exchange DLV repositories and enable publishing, discovering and reusing models from others. Comparing with other deep learning systems and provenance management systems, the key features and innovative design highlights of ModelHub are: • We use a git-like VCS as a familiar UI to let the modeler manage and explore the created models in a repository, and an SQL-like model enumeration DSL to aid modelers in making and examining multiple model adjustments easily. • Behind the declarative constructs, ModelHub manages different artifacts in a split back-end storage: structured data, such as network structure, training logs of a model, lineages of different model versions, output results, are stored in a relational database, while learned float-point parameters of a model are viewed as a set of float matrices and managed in a read-optimized archival storage (PAS). • Parameters dominate the storage footprint and floats are well-known at being difficult to compress. We study PAS implementation thoroughly under the context of DNN query workload and advocate a segmented approach to store the learned parameters, where the low-order bytes are stored independently of the high-order bytes. We also develop novel model evaluation schemes to use high order bytes solely and progressively uncompress less-significant chunks if needed to ensure the correctness of an inference query. • Due to the different utility of developed models, archiving versioned models 51 using parameter matrix deltas exhibits a new type of dataset versioning prob- lem which not only optimizes between storage and access tradeoff but also has model-level constraints. • Finally, the VCS model repository design extends naturally to a collabora- tive format and online system which contain rich model lineages and enables sharing, reusing, reproducing DNN models. 4.2 Background To support our design decisions, we overview the artifacts and common task practices in DNN modeling lifecycle. 4.2.1 Deep Neural Networks A deep learning model is a deep neural network (DNN) consisting of many layers having nonlinear activation functions that are capable of representing complex transformations between input data and desired output. Let D denote a data domain and O denote a prediction label domain (e.g., D may be a set of images; O may be the names of the set of objects we wish to recognize, i.e, labels). As with any prediction model, a DNN is a mapping function f : D→ O that minimizes a certain loss function L, and is of the following form: d f f0 = σ0(W0d + b0) d ∈ D0 f1 fi = σi(Wifi−1 + bi) 0 < i ≤ n f2 L(fn, ld) ld ∈ O l̂d 52 Here i denotes the layer number, (Wi, bi) are learnable weights and bias parameters in layer i, and σi is an activation function that non-linearly transforms the result of the previous layer (common activation functions include sigmoid, ReLU, etc.). Given a learned model and an input d, applying f0, f1, ..., fn in order gives us the prediction label for that input data. In the training phase, the model parameters are learned by minimizing L(fn, ld), typically done through iterative methods, such as stochastic gradient descent. Figure 4.2 shows a classic convolutional DNN, LeNet. LeNet is proposed to solve a prediction task from handwritten images to digit labels {0 · · · 9}. In the figure, a cube represents an intermediate tensor, while the dotted lines are unit transformations between tensors. More formally, a layer, Li : (W,H,X) →7 Y , is a function which defines data transformations from tensor X to tensor Y . W are the parameters which are learned from the data, and H are the hyperparameters which are given beforehand. A layer is non-parametric if W = ∅. In the computer vision community, the layers defining transformations are considered building blocks of a DNN model, and referred to using a conventional conv1 pool1 conv2 pool2 ip1 ip2 (non-parametric) (non-parametric) 28 24 2 2 5 8 2 2 5 12 5 4 5 4 8 50 12 1050 20 28 24 Wconv2 2 R501⇥50 20 1 500 Figure 4.2: Anatomy of A DNN Model (LeNet) 53 name, such as full layer, convolution layer, pool layer, normalization layer, etc. The chain is often called the network architecture. The LeNet architecture has two convolution layers, each followed by a pool layer, and two full layers, shown with layer shapes and hyperparameters in Figure 4.2. Moreover, winning models in recent ILSVRC (ImageNet Large Scale Vision Recognition Competitions) are shown in Table 4.1, with their architectures described by a composition of common layers in regular expressions syntax for illustrating the similarities (Note the activation functions and detailed connections are omitted). DNN models are learned from massive data based on some architecture, and modern successful computer vision DNN architectures consist of a large number of float weight parameters (flops) shown in Table 4.1, resulting in large storage footprints (GBs) and long training times (often weeks). Furthermore, the training process is often checkpointed and variations of models need to be explored, leading to many model copies. Network Architecture (in regular expression) |W | (flops) LeNet [93] (LconvLpool){2}Lip{2} 4.31× 105 AlexNet [94] (LconvLpool){2}(Lconv{2}L 7pool){2}Lip{3} 6× 10 VGG [95] (Lconv{2}Lpool){2}(Lconv{4}Lpool){3}Lip{3} 1.96× 1010 ResNet [96] (L 10convLpool)(Lconv){150}LpoolLip 1.13× 10 Table 4.1: Popular CNN Models for Object Recognition 54 4.2.2 Modeling Data Artifacts Unlike many other prediction methods, DNN modeling results in a very large number of weight parameters, a rich set of hyperparameters, and learning measure- ments, which are used in unique ways in practice, resulting in a mixture of structured data, files and binary floating number artifacts: • Non-convexity & Hyperparameters : A DNN model is typically non-convex, and {W} is a local optimum of the underlying loss-minimization problem. Optimization procedure employs many tricks to reach a solution quickly [97]. The set of hyperparameters (e.g., learning rate, momentum) w.r.t. to the optimization algorithm need to be maintained. • Iterations & Measurements : Models are trained iteratively and checkpointed periodically due to the long running times. A set of learning measurements are collected in various logs, including objective loss values and accuracy scores. • Fine-tuning & Snapshots : Well-known models are often learned from massive real-world data (ImageNet), and require large amounts of resources to train; when prediction tasks do not vary much (e.g., animal recognition vs dog recog- nition), the model parameters are reused as initializations and adjusted using new data; this is often referred to as fine-tuning. On the other hand, not all snapshots can be simply deleted, as the convergence is not monotonic. • Provenance & Arbitrary Files : Alternate ways to construct architectures or to set hyperparameters lead to human-in-the-loop model adjustments. Initializa- 55 tion, preprocessing schemes, and hand-crafted scripts are crucial provenance information to explore models and reproduce results. 4.2.3 Model Adjustment In a modeling lifecycle for a prediction task, the update-train-evaluate loop is repeated in daily work, and many model variations are adjusted and trained. In general, once data and loss are determined, model adjustment can be done in two orthogonal steps: a) network architecture adjustments where layers are dropped or added and layer function templates are varied, and b) hyperparameter selections, which affect the behavior of the optimization algorithms. There is much work on search strategies to enumerate and explore both. 4.2.4 Model Sharing Due to the good generalizability, long training times, and verbose hyperparam- eters required for large DNN models, there is a need to share the trained models. Jia et al. [72] built an online venue (Caffe Model Zoo) to share models. Briefly, Model Zoo is part of a github repository1 with a markdown file edited collaboratively. To publish models, modelers add an entry with links to download trained parameters in caffe format. Apart from the caffe community, similar initiatives are in place for other training systems. 1Caffe Model Zoo: https://github.com/BVLC/caffe/wiki/Model-Zoo 56 4.3 ModelHub System Overview We show the ModelHub architecture including the key components and their interactions in Figure 4.3. At a high level, the ModelHub functionality is divided among a local component and a remote component. The local functionality includes the integration with popular DNN systems such as caffe, torch, tensorflow, etc., on a local machine or a cluster. The remote functionality includes sharing of models, and their versions, among different groups of users. We primarily focus on the local functionality in this chapter. On the local system side, DLV is a version control system (VCS) implemented as a command-line tool (dlv), that serves as an interface to interact with the rest of the local and remote components. Use of a specialized VCS instead of a general- purpose VCS such as git or svn allows us to better portray and query the internal ModelHub Client manage versions DQL Module DLV Module explore models enumerate models DQL Parser & Command-line Modeler Optimizer & UI toolkit publish reuse Server Model Learning Module caffe Wrapper ModelSearch Model Local DLV Repository Publish git repo metadata PAS: Parameter facade catalog Archival Store Hosted DLV Repositories Figure 4.3: ModelHub System Architecture 57 structure of the artifacts generated in a modeling lifecycle, such as network defi- nitions, training logs, binary weights, and relationships between models. The key utilities of dlv are listed in Table 4.2, grouped by their purpose; we explain these in further detail in Section 4.3.2. DQL is a DSL we propose to assist modelers in deriving new models; the DQL query parser and optimizer components in the figure Type Command Description init Initialize a dlv repository. model version add Add model files to be committed. management commit Commit the added files. copy Scaffold model from an old one. archive Archive models in the repository. list List models and related lineages. model exploration desc Describe a particular model. diff Compare multiple models. eval Evaluate a model with given data. model enumeration query Run DQL clause. publish Publish a model to ModelHub. remote interaction search Search models in ModelHub. pull Download from ModelHub. Table 4.2: A list of key dlv utilities. 58 are used to support this language. The model learning module interacts with ex- ternal deep learning tools that the modeler uses for training and testing. They are essentially wrappers on specific DNN systems that extract and reproduce modeling artifacts. Finally, the ModelHub service is a hosted toolkit to support publishing, discovering and reusing models, and serves similar role for DNN models as github for software development or DataHub for data science [2]. 4.3.1 Data Model ModelHub works with two data models: a conceptual DNN model, and a data model for the versions in a DLV repository. 4.3.1.1 DNN Model A DNN model can be understood in different ways, as one can tell from the different model creation APIs in popular deep learning systems. In the formulation mentioned in Section 4.1, if we view a function fi as a node and dependency rela- tionship (fi, fi−1) as an edge, it becomes a directed acyclic graph (DAG). Depending on the granularity of the function in the DAG, either at the tensor operator level (add, multiply), or at a logical composition of those operators (convolution layer, full layer), it forms different DAGs. In ModelHub, we consider a DNN model node as a composition of unit operators (layers), often adopted by computer vision mod- els. The main reason for this decision is that we focus on productivity improvement in the lifecycle, rather than implementation efficiencies of training and testing. 59 4.3.1.2 VCS Data Model When managing DNN models in the VCS repository, a model version repre- sents the contents in a single version. It consists of a network definition, a collection of weights (each of which is a value assignment for the weight parameters), a set of extracted metadata (such as hyper-parameter, accuracy and loss generated in the training phase), and a collection of files used together with the model instance (e.g., scripts, datasets). In addition, we enforce that a model version must be associated with a human readable name for better utility, which reflects the logical groups of a series of improvement efforts over a DNN model in practice. In the implementation, model versions can be viewed as the following relation model version(name, id, N, W, M, F), where id is part of the primary key of model versions and is auto-generated to distinguish model versions with the same name. In brief, N,W,M,F are the network definition, weight values, extracted metadata and associated files respectively. The DAG, N, is stored as two tables: Node(id, node, A), where A is a list of attributes such as layer name, and Edge(from, to). W is managed in our learned parameter storage (PAS, Section 4.4). M , the metadata, captures the provenance information of training and testing a particular model; it is extracted from training logs by the wrapper module, and includes the hyperpa- rameters when training a model, the loss and accuracy measures at some iterations, as well as dynamic parameters in the optimization process, such as learning rate at some iterations. Finally, F is file list marked to be associated with a model version, including data files, scripts, initial configurations, and etc. Besides a set 60 of model versions, the lineage of the model versions are captured using a separate parent(base, derived, commit) relation. All of these relations are maintained/up- dated in a relational backend when the modeler runs the different dlv commands that update the repository. 4.3.2 Query Facilities Once the DNN models and their relationships are managed in DLV, the modeler can interact with them easily. The query facilities we provide can be categorized into two types: a) model exploration queries and b) model enumeration queries. 4.3.2.1 Model Exploration Queries Model exploration queries interact with the models in a repository, and are used to understand a particular model, to query lineages of the models, and to compare several models. For usability, we design it as query templates via dlv sub-command, similar to other VCS. List Models & Related Lineages By default, the query lists all versions of all models including their commit de- scriptions and parent versions; it also takes options, such as showing results for a particular model, or limiting the number of versions to be listed. dlv list [--model_name] [--commit_msg] [--last] Describe Model dlv desc shows the extracted metadata from a model version, such as the net- 61 work definition, learnable parameters, execution footprint (memory and runtime), activations of convolutional DNNs, weight matrices, and evaluation results across iterations. Note the activation is the intermediate output of a DNN model in com- puter vision and often used as an important tool to understand the model. The output formats are a result of discussions with computer vision modelers to deliver tools that fit their needs. In addition to printing to console, the query supports HTML output for displaying the images and visualizing the weight distribution. dlv desc [--model_name | --version] [--output] Compare Models dlv diff takes a list of model names or version ids and allows the modeler to compare the DNN models. Most of desc components are aligned and returned in the query result side by side. dlv diff [--model_names | --versions] [--output] Evaluate Model dlv eval runs test phase of the managed models with an optional config specifying different data or changes in the current hyper-parameters. The main usages of exploration query are two-fold: 1) for the users to get familiar with a new model, 2) for the user to test known models on different data or settings. The query returns the accuracy and optionally the activations. It is worth pointing out that complex evaluations can be done via model enumeration queries in DQL. dlv eval [--model_name | --versions] [--config] 62 4.3.2.2 Model Enumeration Queries Model enumeration queries are used to explore variations of currently available models in a repository by changing network structures or tuning hyper-parameters. There are several operations that need to be done in order to derive new models: 1) Select models from the repository to improve; 2) Slice particular models to get reusable components; 3) Construct new models by mutating the existing ones; 4) Try the new models on different hyper-parameters and pick good ones to save and work with. When enumerating models, we also want to stop exploration of bad models early. To support this rich set of requirements, we propose the DQL domain specific language, that can be executed using “dlv query”. Challenges of designing the language are: a) the data model is a mix of relational and the graph data models and b) the enumeration includes hyper-parameter tuning as well as network structure mutations, which are very different operations. We describe the language and show the key operators and constructs along with a set of examples (Query 4.1∼4.4) to show how requirements are met. select m1 where m1.name like "alexnet_%" and m1.creation_time > "2015 -11 -22" and m1["conv [1,3,5]"].next has POOL("MAX") Query 4.1: DQL select query to pick the models. 63 slice m2 from m1 where m1.name like "alexnet -origin%" mutate m2.input = m1["conv1"] and m2.output = m1["fc7"] Query 4.2: DQL slice query to get a sub-network. construct m2 from m1 where m1.name like "alexnet -avgv1%" and m1["conv*($1)"].next has POOL("AVG") mutate m1["conv*($1)"]. insert = RELU("relu$1") Query 4.3: DQL construct query to derive more models on existing ones. evaluate m from "query3" with config = "path to config" vary config.base_lr in [0.1, 0.01, 0.001] and config.net["conv*"].lr auto and config.input_data in ["path1", "path2"] keep top(5, m["loss"], 100) Query 4.4: DQL evaluate query to enumerate models with different network architectures, search hyper-parameters, and eliminate models. 64 Key Operators We adopt the standard SQL syntax to interact with the repository. DQL views the repository as a single model version table. A model version instance is a DAG, which can be viewed as object types in modern SQL conventions. In DQL, attributes can be referenced using attribute names (e.g. m1.name, m1.creation_time, m2.input, m2.output), while navigating the internal structures of the DAG, i.e. the Node and Edge EDB, we provide a regexp style selector operator on a model version to access individual DNN nodes, e.g. m1["conv[1,3,5]"] in Query 4.1 filters the nodes in m1. Once the selector operator returns a set of nodes, prev and next attributes of the node allow 1-hop traversal in the DAG. Note that POOL("MAX") is one of the standard built-in node templates for condition clauses. Using SPJ operators with object type attribute access and the selector operator, we allow relational queries to be mixed with graph traversal conditions. To retrieve reusable components in a DAG, and mutate it to get new mod- els, we provide slice, construct and mutate operators. Slice originates in pro- gramming analysis research; given a start and an end node, it returns a subgraph including all paths from the start to the end and the connections which are needed to produce the output. Construct can be found in graph query languages such as SPARQL to create new graphs. We allow construct to derive new DAGs by using selected nodes to insert nodes by splitting an outgoing edge or to delete an outgoing edge connecting to another node. Mutate limits the places where insert and delete can occur. For example, Query 4.2 and 4.3 generate reusable subgraphs and new graphs. Query 4.2 slices a sub-network from matching models between convolution 65 layer ‘conv1’ and full layer ‘fc7’, while Query 4.3 derives new models by appending a ReLU layer after all convolution layers followed by an average pool. All queries can be nested. Finally, evaluate can be used to try out new models, with potential for early out if expectations are not reached. We separate the network enumeration compo- nent from the hyper-parameter turning component; while network enumeration can be nested in the from clause, we introduce a with operator to take an instance of a tuning config template, and a vary operator to express the combination of activated multi-dimensional hyper-parameters and search strategies. auto is keyword imple- mented using default search strategies (currently grid search). To stop early and let the user control the stopping logic, we introduce a keep operator to take a rule consisting of stopping condition templates, such as top-k of the evaluated models, or accuracy threshold. Query 4.4 evaluates the models constructed and tries com- binations of at least three different hyper-parameters, and keeps the top 5 models w.r.t. the loss after 100 iterations. 4.3.3 ModelHub Implementation On the local side, the implementation of ModelHub maintains the data model in multiple back-ends and utilizes git to manage the arbitrary file diffs. Var- ious queries are decomposed and sent to different backends and chained accordingly. On the other hand, as the model repository is standalone, we host the repositories as a whole in a ModelHub service. The modeler can use the dlv publish to push 66 the repository for archiving, collaborating or sharing, and use dlv search and dlv pull to discover and reuse remote models. We envision such a form of collaboration can facilitate a learning environment, as all versions in the lifecycle are accessible and understandable with ease. The online video2 highlights the interactions with the prototype and illustrate the described concepts in this section. 4.4 Parameter archival storage (PAS) Modeling lifecycle for DNNs, and machine learning models in general, is cen- tered around the learned parameters, whose storage footprint can be very large. The goal of PAS is to maintain a large number of learned models as compactly as possible, without compromising on the query performance. Before introducing our design, we first discuss the queries of interest, and some key properties of the model artifacts. We then describe different options to store a single float matrix, and to construct deltas (differences) between two matrices. We then formulate the opti- mal version graph storage problem, discuss how it differs from the prior work, and present algorithms for solving it. Finally, we develop a novel approximate model evaluation technique, suitable for the segmented storage technique that PAS uses. 4.4.1 Weight Parameters & Query Types of Interest We illustrate the key weight parameter artifacts and the relationships among them in Figure 4.4, and also explain some of the notations used in this section. 2https://youtu.be/noEXdahlj_4 67 latest snapshots ... ... ... ... s3 ... ... ... s2 s1 model version one snapshot Figure 4.4: Relationships of Model Versions and Snapshots At a high level, the predecessor-successor relationships between all the developed models is captured as a version graph. These relationships are user-specified and conceptual in nature, and the interpretation is left to the user (i.e., an edge vi → vj indicates that vj was an updated version of the model that the user checked in after vi, but the nature of this update is irrelevant for storage purposes). A model version vi itself consists of a series of snapshots, s1, ..., sn, which represent checkpoints during the training process (most systems will take such snapshots due to the long running times of the iterations). We refer the last or the best checkpointed snapshot sn as the latest snapshot of vi, and denote it by sv .i One snapshot, in turn, consists of intermediate data X and trained parameters W (e.g., in Figure 4.2, the model has 431080 parameters for W , and 19694 · b dimensions for X, where b is the minibatch size). Since X is useful only if training needs to be resumed, only W is stored in PAS. Outside of a few rare exceptions, W 68 can always be viewed as a collection of float matrices, Rm×n,m ≥ 1, n ≥ 1, which encode the weights on the edges from outputs of the neurons in one layer to the inputs of the neurons in the next layer. Thus, we treat a float matrix as a first class data type in PAS3. The retrieval queries of interest are dictated by the operations that are done on these stored models, which include: (a) testing a model, (b) reusing weights to fine-tune other models, (c) comparing parameters of different models, (d) comparing the results of different models on a dataset, and (e) model exploration queries (Sec- tion 4.3.2). Most of these operations require execution of group retrieval queries, where all the weight matrices in a specific snapshot need to be retrieved. This is different from range queries seen in array databases (e.g., SciDB), and also have unique characteristics that influence the storage and retrieval algorithms. • Similarity among Fine-tuned Models : Although non-convexity of the train- ing algorithm and differences in network architectures across models lead to non-correlated parameters, the widely-used fine-tuning practices (Section 4.2) generate model versions with similar parameters, resulting in efficient delta encoding schemes. • Co-usage constraints: Prior work on versioning and retrieval [1] has focused on retrieving a single artifact stored in its entirety. However, we would like to store the different matrices in a snapshot independently of each other, but 3We do not make a distinction about the bias weight; the typical linear transformation W ′x+ b is treated as W · (x, 1) = (W ′, b)T · (x, 1). 69 we must retrieve them together. These co-usage constraints make the prior algorithms inapplicable as we discuss later. • Low Precision Tolerance: DNNs are well-known for their tolerance to using low-precision floating point numbers, both during training and evaluation. Further, many types of queries (e.g., visualization and comparisons) do not require retrieving the full-precision weights. • Unbalanced Access Frequencies : Not all snapshots are used frequently. The latest snapshots with the best testing accuracy are used in most of the cases. The checkpointed snapshots have limited usages, including debugging and comparisons. 4.4.2 Parameters As Segmented Float Matrices 4.4.2.1 Float Data Type Schemes Although binary (1/-1) or ternary (1/0/-1) matrices are sometimes used in DNNs, in general PAS handles real number weights. Due to different usages of snapshots, PAS offers a handful of float representations to let the user trade-off storage efficiency with lossyness using dlv. In Table 4.3, we list the schemes sup- ported in PAS: 1. Float Point : DNNs are typically trained with single precision (32 bit) floats. This scheme uses the standard IEEE 754 floating point encoding to store the weights with sign, exponent, and mantissa bits. IEEE half-precision proposal 70 (16 bits) and tensorflow truncated 16bits [69] are supported as well and can be used if desired. 2. Fixed Point : Fixed point encoding has a global exponent per matrix, and each float number only has sign and mantissa using all k bits. This scheme is a lossy scheme as tail positions are dropped, and a maximum of 2k different values can be expressed. The entropy of the matrix also drops considerably, aiding in compression. 3. Quantization: Similarly, PAS supports quantization using k bits, k ≤ 8, where 2k possible values are allowed. The quantization can be done in random manner or uniform manner by analyzing the distribution, and a coding table is used to maintain the integer codes stored in the matrices in PAS. This is most useful for snapshots whose weights are primarily used for fine-tuning or initialization. Scheme Param. Bits Compress Lossyness Usage Float Point 64/32/16 Fair Lossless Latest Fixed Point 32/16/8 Good Good Latest Quantization 8/k Excellent Poor Other Table 4.3: Float Representation Scheme Trade-offs The float point schemes present here are not new, and are used in DNN systems in practice [98–100]. As a lifecycle management tool, PAS lets experienced users 71 select schemes rather than deleting snapshots due to resource constraints. Our evaluation shows storage/accuracy tradeoffs of these schemes. 4.4.2.2 Bytewise Segmentation for Float Matrices One challenge for PAS is the high entropy of float numbers in the float arith- metic representations, which leads to them being very hard to compress. Compres- sion ratio shown in related work for scientific float point datasets, e.g., simulations, is very low. The state of art compression schemes do not work well for DNN pa- rameters either. By exploiting DNN low-precision tolerance, we adopt bytewise decomposition from prior work [101, 102] and extend it to our context to store the float matrices. The basic idea is to separate the high-order and low-order mantissa bits, and so a float matrix is stored in multiple chunks; the first chunk consists of 8 high-order bits, and the rest are segmented one byte per chunk. One major ad- vantage is the high-order bits have low entropy, and standard compression schemes (e.g., zlib) are effective for them. Apart from the simplicity of the approach, the key benefits of segmented ap- proach are two-fold: (a) it allows offloading low-order bytes to remote storage, (b) PAS queries can read high-order bytes only, in exchange for tolerating small errors. Comparison and exploration queries (dlv desc, dlv diff) can easily tolerate such errors and, as we show later in the chapter, dlv eval queries can also be made tolerant to these errors. 72 4.4.2.3 Delta Encoding Across Snapshots We observed that, due to the non-convexity in training, even re-training the same model with slightly different initializations results in very different parameters. However, the parameters from checkpoint snapshots for the same or similar mod- els tend to be close to each other. Furthermore, across model versions, fine-tuned models generated using fixed initializations from another model often have simi- lar parameters. The observations naturally suggest use of delta encoding between checkpointed snapshots in one model version and latest snapshots across multiple model versions; i.e., instead of storing all matrices in entirety, we can store some in their entirety and others as differences from those. Two possible delta functions (denoted ) are arithmetic subtraction and bitwise XOR. We find the compression footprints when applying the diff in different directions are similar. We study the delta operators on real models in Section 4.5. 4.4.3 Optimal Parameter Archival Storage Given the above background, we next address the question of how to best store a collection of model versions, so that the total storage footprint occupied by the large segmented float matrices is minimized while the retrieval performance is not compromised. This recreation/storage tradeoff sits at the core of any version control system. In recent work [1], the authors study six variants of this problem, and show the NP-hardness of most of those variations. However, their techniques cannot be directly applied in PAS, primarily because their approach is not able to 73 handle the group retrieval (co-usage) constraints. We first introduce the necessary notation, discuss the differences from prior work, and present the new techniques we developed for PAS. In Figure 4.4, a model version v ∈ V consists of time-ordered checkpointed snapshots, Sv = s1, ..., sn. Each snapshot, si consists of a named list of float matrices Mv,i⋃= {m⋃k} representing the learned parameters. All matrices in a repository, M = v∈V si∈S Mv v,i, are the parameter artifacts to archive. Each matrix m ∈ M is either stored directly, or is recovered through another matrix m′ ∈M via a delta operator , i.e. m = m′ d, where d is the delta computed using one of the techniques discussed above. In the latter case, the matrix d is stored instead of m. To unify the two cases, we introduce a empty matrix ν0, and define ∀ ∀m ∈M,m ν0 = m. Definition 1 (Matrix Storage Graph) Given a repository of model versions V , let ν0 be an empty matrix, and V =M∪ {ν0} be the set of all parameter matrices. We denote by E = {mi mj} ∪ {mi ν0} the available deltas between all pairs of matrices. Abusing notation somewhat, we also treat E as the set of all edges in a graph where V are the vertices. Finally, let GV (V , E , cs, cr) denote the matrix storage graph of V , where edge weights cs, cr : E 7→ R+ are storage cost and recreation cost of an edge respectively. Definition 2 (Matrix Storage Plan) Any connected subgraph of GV (V , E) is called a matrix storage plan for V , and denoted by PV (VP , EP ), where VP = V and EP ⊆ E. Example 4 In Figure 4.5, we show a matrix storage graph for a repository with two snapshots, s1 = {m1,m2} and s2 = {m3,m4,m5}. The weights associated with 74 v0 (2,1) (2,1) (8,2) m1 (8,2) (8,2) m3 (1,0.5) m4 m2 (4,1) (4,1) s1 (4,1) m5 s2 Figure 4.5: A Matrix Storage Graph Example an edge e = (ν0,mi) reflect the cost of materializing the matrix mi and retrieving it directly. On the other hand, for an edge between two matrices, e.g., e = (m2,m5), the weights denote the storage cost of the corresponding delta and the recreation cost of applying that delta. In Figure 4.6 and 4.7, two matrix storage plans are shown. v0 (2,1) m1 (8,2) m3 (1,0.5) m4 m2 s1 (4,1) (4,1) m5 s2 Figure 4.6: Optimal Matrix Storage Plan without Constraints For a matrix storage plan PV (VP , EP ), PAS stores all its edges and is able to 75 v0 (2,1) (2,1) m1 (8,2) (8,2) m3 m4 m2 (4,1) s1 m5 s2 Figure 4.7: Optimal Matrix Storage Plan Constrained by C ψir (s1) ≤ 3 ∧ C ψir (s2) ≤ 6 recreate any matrix mi following a path starting from ν0. The total storage cost of PV , denoted as Cs(PV ), is simply the sum of edge storage costs, i.e. ∑ Cs(PV ) = cs(e) e∈EP Computation of the average snapshot recreation cost is more involved and depends on the retreival scheme used: • Independent scheme recreates each matrix mi one by one by following the shortest path (Υν0,m ) to mi from ν0. In that case, the recreation cost isi simply computed by summing the recreation costs for all the edges along the shortest path. • Parallel scheme accesses all matrices of a snapshot in parallel (using multiple threads); the longest shortest path from ν0 defines the recreation cost for the snapshot. • Reusable scheme considers caching deltas on the way, i.e., if paths from ν0 76 to two different matrices overlap, then the shared computation is only done once. In that case, we need to construct the lowest-cost Steiner tree (TPV ,s )i involving ν0 and the matrices in the snapshot. However, because multiple large matrices need to be kept in memory simultaneously, the memory consumption of this scheme can be large. Retrieval Scheme Recreation C ψr (PV , si) Solution of Prob.1 ∑ ∑ Independent (ψi) m ∈s ∑e ∈Υ cr(ek) Spanning treej i k ν0,mj Parallel (ψp) maxm∑∈s { e ∈Υ cr(ek)} Spanning treej i k ν0,mj Reusable (ψr) e ∈T cr(ek) Subgraphk PV ,si Table 4.4: Recreation Cost of a Snapshot si Cr(PV , si) in a plan PV PAS can be configured to use any of these options during the actual query execution. However, solving the storage optimization problem with Reusable scheme is nearly impossible; since the Steiner tree problem is NP-Hard, just computing the cost of a solution becomes intractable making it hard to even compare two different storage solutions. Hence, during the storage optimization process, PAS can only support Independent or Parallel schemes. In the example above, the edges are shown as being undirected indicating that the deltas are symmetric. In general, we allow for directed deltas to handle asymmetric delta functions, and also for multiple directed edges between the same two matrices. The latter can be used to capture different options for storing the delta; e.g., we may have one edge corresponding to a remote storage option, where 77 the storage cost is lower and the recreation cost is higher; whereas another edge (between the same two matrices) may correspond to a local SSD storage option, where the storage cost is the highest and the recreation cost is the lowest. Our algorithms can thus automatically choose the appropriate storage option for different deltas. Similarly, PAS is able to make decisions at the level of byte segments of float matrices, by treating them as separate matrices that need to be retrieved together in some cases, and not in other cases. This, combined with the ability to incorporate different storage options, is a powerful generalization that allows PAS to make decisions at a very fine granularity. Given this notation, we can now state the problem formally. Since there are multiple optimization metrics, we assume that constraints on the retrieval costs are provided and ask to minimize the storage. Problem 1 (Optimal Parameter Archival Storage) Given a matrix storage graph GV (V , E , cs, cr), let θi be the snapshot recreation cost budget for each si ∈ S. Un- der a retrieval scheme ψ, find a matrix storage plan P ∗V that minimizes the total storage cost, while satisfying recreation constraints, i.e.: minimize Cs(PV ); s.t. ∀si ∈ S, C ψr (PV , si) ≤ θi PV Example 5 In Figure 4.6, without any recreation constraints, we show the best storage plan, which is the minimum spanning tree based on cs of the matrix stor- age graph, C (P ) = 19. Under independent scheme ψ , C ψis V i r (PV , s1) = 3 and C ψir (PV , s2) = 7.5. In Figure 4.7, after adding two constraints θ1 = 3 and θ2 = 6, 78 we shows an optimal storage plan P ∗V satisfying all constraints. The storage cost increases, Cs(P ∗) = 24, while C ψi(P ∗V r V , s1) = 3 and C ψir (P ∗V , s2) = 6. Although this problem variation might look similar to the ones considered in recent work [1], none of the variations studied there can handle the co-usage constraints (i.e., the constraints on simultaneously retrieving a group of versioned data artifacts). One way to enforce such constraints is to treat the entire snapshot as a single data artifact that is stored together; however, that may force us to use an overall suboptimal solution because we would not be able to choose the most appropriate delta at the level of individual matrices. Another option would be to sub-divide the retrieval budget for a snapshot into constraints on individual matrices in the snapshot. As our experiments show, that can lead to significantly higher storage utilization. Thus the formulation above is a strict generalization of the formulations considered in that prior work. Theorem 1 Optimal Parameter Archival Storage Problem is NP-hard for all re- trieval schemes in Table 4.4. Proof 1 We reduce Prob.5 in [1] to the independent scheme ψi, and Prob.6 to the parallel scheme ψp in [1], by mapping each datasets as vertices in storage graph, and introducing a snapshot holding all matrices with recreation bound Θg. For reuse scheme ψr, it is at least as hard as weighted set cover problem if reducing a set to an edge e with storage cost cs(e) as weight, an item to an vertex in GV (V , E), and set recreation budget Θg =∞. 79 Lemma 1 The optimal solution for Problem 1 is a spanning tree when retrieval scheme ψ is independent or parallel. Proof 2 Suppose we have a non-tree solution PV satisfying the constraints, and also minimize the objective. Note that parallel and independent schemes are based on shortest path Υν0,m in PV from ν0 to each matrix m, so the union of each shortest path forms a shortest path tree. If we remove edges which are not in the shortest path tree from the plan to P ′V , it results in a lower objective Cs(P ′V ), but still satisfying all recreation constraints, which leads to a contradiction. Note the above lemma is not true for the reusable scheme (ψr); snapshot Steiner trees satisfying different recreation constraints may share intermediate nodes resulting in a subgraph solution. Lemma 1 shows P ∗V is a spanning tree and connects our problem to a class of constrained minimum spanning tree problems. 4.4.3.1 Constrained Spanning Tree Problem In Problem 1, storage cost minimization while ignoring the recreation con- straints leads to a minimum spanning tree (MST) of the storage matrix; whereas the snapshot recreation constraints are best satisfied by using a shortest path tree (SPT). These problems are often referred to as constrained spanning tree prob- lems [103] or shallow-light tree constructions [104], which have been studied in areas other than dataset versioning, such as VLSI designs. Khuller et al. [65] propose an algorithm called LAST to construct such a “balanced” spanning tree in an undi- 80 rected graph G. LAST starts with a minimum spanning tree of the provided graph, traverses it in a DFS manner, and adjusts the tree by changing parents to ensure the path length in constructed solution is within (1+) times of shortest path in G, i.e. Cr(T, vi) ≤ (1 + )Cr(Υν0,v , vi), while total storage cost is within (1+2) times ofi  MST. In our problem, the co-usage constraints of matrices in each snapshot form hyperedges over the matrix storage graph making the problem more difficult. In the rest of the discussion, we adapt meta-heuristics for constrained MST problems to develop two algorithms: the first one (PAS-MT) is based on an iterative refinement scheme, where we start from an MST and then adjust it to satisfy con- straints; the second one is a priority-based tree construction algorithm (PAS-PT), which adds nodes one by one and encodes heuristic in the priority function. Both algorithms aim to solve the parallel and independent schemes, and can also find fea- sible solution for reusable scheme. Due to large memory footprints of intermediate matrices, we do not discuss algorithm for improving reusable scheme solutions. 4.4.3.2 PAS-MT The algorithm starts with T as the MST of GV (V , E), and iteratively adjusts T to satisfy the broken snapshot recreation constraints, U = {si|Cr(T, si) > θi}, by swapping one edge at a time. We denote pi as the parent of vi, (pi, vi) ∈ T and p0 = φ, and successors of vi in T as Di. A swap operation on (pi, vi) to edge (vs, vi) ∈ E − T changes parent of vi to vs in T . Lemma 2 A swap operation on vi changes storage cost of Cs(T ) by cs(pi, vi) − 81 cs(vs, vi), and changes recreation costs of vi and its successors Di by: Cr(T, vi) − Cr(T, vs)− cr(vs, vi). The proof can be derived from definition of Cs and Cr by inspection. When selecting edges in E − T , we choose the one which has the largest marginal gain for unsatisfied constraints: Eq. 4.1 sums the gain of recreation cost changes among all matrices in the same snapshot si (for the independent scheme), while Eq. 4.2 uses the max change instead (for the parallel scheme). The actual formula used is somewhat more complex, and Input: GV (V , E , cs, cr), snapshots S, recreation cost {θi ≥ 0 | si ∈ S}. Output: A spanning tree T satisfying constraints {Cr(T, si) ≤ θi} 1: let T = MST of GV (V , E); 2: while unsatisfied constraints U = {si | Cr(T, si) > θi} =6 ∅ do 3: for each edge esi = (vs, vi) ∈ E − T do 4: calculate gain(esi) with Eq. 4.1 (Eq. 4.2 for scheme ψp) 5: end for 6: find e′si = max{esi | gain(esi)} 7: break if gain(e′si) ≤ 0 8: swap (p , v ) with e′ : T = (T − {(p , v )}) ∪ {e′i i si i i si} 9: end while 10: return T unless U 6= ∅ Algorithm 1: PAS-MT 82 handles negative denominators. ∑ ∑ s ∈U v ∈s ∩D (Cr(T, vi)− Cr(T, vs)− cr(vs, vi)) ψi : max { k j k i (4.1) (vs,vi)∈E−T cs(vs, vi)− } cs(pi, vi) ∑ s ∈U (Cr(T, vi)− Cr(T, vs)− c{ r (vs, vi)) ψ : max kp } (4.2) (vs,vi)∈E−T cs(vs, vi)− cs(pi, vi) The algorithm iteratively swaps edges and stops if all recreation constraints are satisfied or no edge returns a positive gain. A single step examines |E − T | edges and |U | unsatisfied constraints, and there are at most |E| steps. Thus the complexity is bounded by O(|E|2|S|). The pseudo-code is shown in Algorithm 1. 4.4.3.3 PAS-PT This algorithm constructs a solution by “growing” a tree starting with an empty tree. In the matrix storage graph GV (M,D, cs, cr), due to the co-usage con- straint, previous tree growth algorithms [1,65] do not work any more, as adding one node at a time cannot determine whether a group constraint is satisfied. Instead, PAS-PT examines the edges in GV (V , E) in the increasing order by the storage cost cs; a priority queue is used to maintain all the candidate edges and is populated with all the edges from v0 in the beginning. At any point, the edges in Q are the ones that connect a vertex T , to a vertex outside T . Using an edge eij = (vi, vj) (s.t., vi ∈ VT ∧ vj ∈ V − VT ) popped from Q, the algorithm tries to add vj to T with minimum storage increment cs(eij). Before adding vj, it examines whether the constraints of affected groups sa (s.t., vj ∈ sa) are satisfied using actual and 83 estimated recreation costs for vertices {vk ∈ sa} in VT and V − VT respectively; if vk ∈ VT , actual recreation cost Cr(T, vk) is used, otherwise the lower bound of it, i.e. cr(ν0, vk) is used as an estimation. We refer the estimation for sa as Ĉr(T, sa). Once an edge eij is added to T , the inner edges IjT = {(vk, vj)|vk ∈ VT} of newly added vj are dequeued from Q, while the outer edges OjT = {(vj, vk) | vk ∈ V − VT} are enqueued. If the storage cost of existing vertices in T can be improved (i.e. Cs(T, vk) > cs(vk, vj)), and recreation cost is not more (i.e. Cr(T, vk) ≥ Cr(T, vj) + cr(vk, vj)), then the parent pk of vk in T is replaced to vj via the swap operation, decreasing the storage but not increasing affected group recreation cost. The algorithm stops if Q is empty and T is a spanning tree. In the case when Q is empty but VT ⊂ V , an adjustment operation on T to increase storage cost and satisfy the group recreation constraints is performed. For each vu ∈ V − VT , we append it to ν0, then in each unsatisfied group si that vu belongs to, optimally, we want to choose a set of {vg} ⊆ si ∩ T to change their parents in T , such that the decrement of storage cost is minimized while recreation cost is satisfied. The optimal adjustment itself can be viewed as a knapsack problem with extra non-cyclic constraint of T , which is NP-hard. Instead, we use the same heuristic in Eq. 4.1 to adjust vg ∈ si ∩ T one by one by swapping its parent pg to vs until the group constraints cannot improved. Similarly, the parallel scheme ψp uses Eq. 4.2 for the adjustment operation. The complexity of this algorithm is O(|E|2|S|). The pseudo-code is shown in Algorithm 2. 84 Input: GV (V, E , cs, cr), snapshots S, recreation cost {θi ≥ 0 | si ∈ S}. Output: A spanning tree T satisfying constraints {Cr(T, si) ≤ θi} 1: let T = ∅ and Q be a priority queue of edges based on cs 2: push {(ν0, vi) | vi ∈ V} in Q 3: while Q 6= ∅ do 4: pop eij = (vi, vj) from Q; let T = T ∪ {eij} e 5: let constraints satisfaction flag be Θ ijsatisfy = true 6: for each snapshot constraint sa ∈ {s | s ∈ S ∧ vj ∈ s} do 7: estimate recreation cost Ĉr(T, sa) e 8: Θ ijsatisfy = false and break if Ĉr(T, sa) > θa 9: end for e 10: if Θ ijsatisfy is false, then T = T − {eij} and goto line 3 11: pop inner edges of v jj IT = {(vk, vj) | vk ∈ T} from Q j 12: push outer edges OE−T = {(vj , vk) | vk ∈ E − T} to Q 13: for (vk, vj) ∈ T , change pk improves Cs, and no worse Cr do 14: swap (pk, vk) ∈ T with (vj , vk) 15: end for 16: end while 17: if T is not a spanning tree then 18: for each vu ∈ V − VT , do T = T ∪ {e0u = (ν0, vu)} 19: adjust T using PAS-MT heuristic. 20: end if 21: return T if T is a matrix storage plan Algorithm 2: PAS-PT 85 4.4.4 Model Evaluation Scheme in PAS Model evaluation, i.e., applying a DNN forward on a data point to get the prediction result, is a common task to explore, debug and understand models. Given a PAS storage plan, an dlv eval query requires uncompressing and applying deltas along the path to the model. We develop a novel model evaluation scheme utilizing the segmented design, that progressively accesses the low-order segments only when necessary, and guarantees no errors for arbitrary data points. The basic intuition is that: when retrieving segmented parameters, we know the minimum and maximum values of the parameters (since higher order bytes are retrieved first). If the prediction result is the same for the entire range of those values, then we do not need to access the lower order bytes. However, considering the high dimensions of parameters, non-linearity of the DNN model, unknown full precision value when issuing the query, it is not clear if this is feasible. We define the problem formally, and illustrate the determinism condition that we use to develop our algorithm. Our technique is inspired from theoretical stability analysis in numerical analysis. We make the formulation general to be applicable to other prediction functions. The basic assumption is that the prediction function returns a vector showing relative strengths of the classification labels, then the dimension index with the maximum value is used as the predicted label. Problem 2 (Parameter Perturbation Error Determination) Given a predic- tion function F(d,W ) : Rm × Rn 7→ Rc, where d is the data and W are the learned weights, the prediction result cd is the dimension index with the highest value in the 86 output o ∈ Rc. When W value is uncertain, i.e., each wi ∈ W in known to be in the range [wi,min, wi,max], determine whether cd can be ascertained without error. When W is uncertain, the output o is uncertain as well. However, if we can bound the individual entries in o, then the following condition is an applicable necessary condition for determining error: Lemma 3 Let oi ∈ o vary in range [oi,min, oi,max]. If ∃k such that ∀i, ok,min > oi,max, then prediction result cd is k. Next we illustrate a query procedure, that given data d, evaluates a DNN with weight perturbations and determines the output perturbation on the fly. Recall that DNN is a nested function (Section 4.2), we derive the output perturbations when evaluating a model while preserving perturbations step by step: ∑ x0,k = W0,k,jdj + b0,k ∑j x0,k,min = min{W0,k,jdj}+ min{b0,k} ∑j x0,k,max = max{W0,k,jdj}+ max{b0,k} j Next, activation function σ0 is applied. Most of the common activation functions are monotonic functions: R 7→ R, (e.g. sigmoid, ReLu), while pool layer functions are min, max, avg functions over several dimensions. It is easy to derive the perturbation of output of the activation function, [f0,k,min, f0,k,max]. During the evaluation query, instead of 1-D actual output, we carry 2-D perturbations, as the actual parameter value is not available. Nonlinearity decreases or increases the perturbation range. 87 Now the output perturbation at fi can be calculated similarly, except now both W and fi−1 are uncertain: ∑ xi,k = Wi,k,jfi−1,j + bi,k ∑j xi,k,min = min{Wi,k,jfi−1,j}+ min{bi,k} ∑j xi,k,max = max{Wi,k,jfi−1,j}+ max{bi,k} j Applying these steps iteratively until last layer, we can then apply Lemma 3, the condition of error determinism, to check if the result is correct. If not, then lower order segments of the float matrices are retrieved, and the evaluation is re- performed. This progressive evaluation query techniques dramatically improve the utility of PAS, as we further illustrate in our experimental evaluation. Note that, other types of queries, e.g., matrix plots, activation plots, visualizations, etc., can often be executed without retrieving the lower-order bytes either. 4.5 Evaluation Study ModelHub is designed to work with a variety of deep learning backends; our prototype interfaces with caffe [72] through a ProvDB ingestor extension that can extract caffe training logs, and read and write parameters for training. We have also built a custom layer in caffe to support progressive queries. Similar to ProvDB shell ingestor, the dlv command-line suite is implemented as a Ruby gem, utilizing git as internal VCS and sqlite3 and PAS as backends to manage the set 88 of heterogeneous artifacts in the local client. PAS is built in C++ with gcc 5.4.0. All experiments are conducted on a Ubuntu Linux 16.04 machine with an 8-core 3.0GHz AMD FX-380 processor, 16GB memory, and NVIDIA GTX 970 GPU. We use zlib for compression; unless specifically mentioned, the compression level is set to 6. When wrapping and modifying caffe, the code base version is rc3. In this section, we present a comprehensive evaluation with real-world and synthetic datasets aimed at examining our design decisions, differences of config- urations in PAS, and performance of archiving and progressive query evaluation techniques proposed in earlier sections. 4.5.1 Dataset Description 4.5.1.1 Real World Dataset To study the performance of PAS design decisions, we use a collection of shared caffe models listed in Table 4.5 published in caffe repository or Model Zoo. In brief, LeNet-5 [93] is a convolutional DNN with 431k parameters. The reference model has 0.88% error rate on MNIST. AlexNet [94] is a medium-sized model with 61 million parameters, while VGG-16 [95] has 1.9 billion parameters. Both AlexNet and VGG-16 are tested on ILSVRC-2012 dataset. The downloaded models have 43.1%, and 31.6% top-1 error rate respectively. Besides, to study the delta performance on model repositories under different workloads (i.e., retraining, fine-tuning): we use VGG-16/19 and CNN-S/M/F [105], a set of similar models developed by VGG authors to study model variations. They are similar to VGG-16, and retrained from 89 Network |W | (flops) Size Purpose LeNet-5 4.31× 105 5MB Small model AlexNet 6× 107 200MB Medium model VGG-16 [95] 1.96× 1010 500MB Large model CNN-S/M/F [105] 1.13× 1010 199/300MB Similar models VGG-Salient 1.96× 1010 500MB Fine-tuning Table 4.5: Real World DNN Models used in the Experiment Study scratch; for fine-tuning, we use VGG-Salient [106] a fine-tuning VGG model which only changes last full layer. 4.5.1.2 Synthetic Datasets Lacking sufficiently fine-grained real-world repositories of models, to evaluate performance of parameter archiving algorithms, we developed an automatic modeler to enumerate models and hyperparameters to produce a dlv repository. We gener- ated a synthetic dataset (SD): simulating a modeler who is enumerating models to solve a face recognition task, and fine-tuning a trained VGG. SD results in similar DNNs and relatively similar parameters across the models. As retraining is always used, SD1 has a set of different models w.r.t. network architecture and parameters, while finetuning practice used in SD2 results in similar network architecture, and more similar parameters. The datasets are shared online4. To elaborate, the automation is driven by a state machine that applies model- 4Dataset Details: http://www.cs.umd.edu/~hui/code/modelhub 90 ing practices from the real world. For SD1, the modeler mutates the network archi- tecture intensively by inserting/deleting layers and changing layer shapes, as well as by updating optimization related hyperparameters. While in SD2, the modeler updates the VGG network architecture slightly and changes VGG object recognition goal to a face prediction task (prediction labels changed from 1000 to 100, so the last layer is changed); various fine-tuning hyperparameter alternations are applied by mimicking practice [107]. SD in total has 54 model versions, each of which has 10 snapshots. A snapshot has 16 parametric layers and a total of 1.96× 1010 floats. 4.5.2 Evaluation Results 4.5.2.1 Float Representation & Accuracy We show the effect of different float encoding schemes on compression and accuracy in Figure 4.8; this is a tradeoff that the user often needs to consider when configuring ModelHub to save a model. In Figure 4.8, for each scheme, we plot the average compression ratio versus the average accuracy drop when applying PAS float schemes on the three real-world models. Here, random and uniform denote two standard quantization schemes. As we can see, we can get very high compression ratios (a factor of 20 or so) without a significant loss in accuracy, which may be acceptable in many scenarios. 91 Figure 4.8: Compression-Accuracy Tradeoff for Float Representation Schemes 4.5.2.2 Delta Encoding & Compression Ratio Gain Next we study the usefulness of delta encoding in real-world models in the fol- lowing scenarios: a) Similar : latest snapshots across similar models (CNN-S/M/F, VGG-16); b) Fine-tuning : fine-tuning models (VGG-16, VGG-Salient); and c) Snapshots : snapshots for the same VGG models in SD between iterations. In Figure 4.9, for different delta schemes, namely, storing original matrices (Material- ize), arithmetic subtraction (Delta-SUB), and bitwise XOR diff (Delta-XOR), the comparison is shown (i.e., we show the results of compressing the resulting matrices using zlib). The figure shows the numbers under lossless compression scheme (float 32), which has the largest storage footprint. As we can see, delta scheme is not always good, due to the non-convexity and high entropy of parameters. For models under similar architectures, storing 92 Figure 4.9: Compression Performance for Different Delta Schemes & Models materialized original parameters is often better than applying delta encoding. With fine-tuning and nearby snapshots, the delta is always better, and arithmetic subtrac- tion is consistently better than bitwise XOR. We saw similar results for many other models. These findings are useful for PAS implementation decisions, where we only perform delta between nearby snapshots in a single model, or for the fine-tuning setting among different models. Table 4.6 shows the delta encoding results when using two lossy schemes, fixed point conversion and normalization, for fine-tuned VGG datasets, but without re- ducing the number of bits used (i.e., we still use 32 bits to store the numbers). Normalization refers to adding a sufficiently large number to all the floats so that the radixes and the signs are aligned, whereas fixed point conversion uses a sin- gle exponent for all the numbers in a matrix. As we can see, for both of these, delta encoding can result in sigificant gains. Introducing additional lossiness, e.g., 93 Schemes Configuration Materialize Delta-SUB Float Number Lossless 92.83% 86.39% Representation Lossless, bytewise 83.85% 76.89% Fix point 72.43% 57.15% Fix point, bytewise 58.68% 49.34% After Lossless 68.06% 47.69% Normalization Lossless, bytewise 56.15% 36.60% Fix point 69.11% 48.94% Fix point, bytewise 55.36% 36.88% Table 4.6: Delta Performance For Lossless & Lossy Schemes, 32-bits through using fewer bits, further improves the performance, but at the expense of significantly higher accuracy loss. 4.5.2.3 Optimal Parameter Archival Storage Figure 4.10 shows the results of comparing PAS-PT, PAS-MT and the base- line LAST [65] for the optimal parameter archival problem. Using dataset SD, we derive nearby snapshot deltas as well as model-wise deltas among the latest snap- shots. To compare with LAST clearly, we vary the recreation threshold using a scalar α to mimic a full precision archiving problem instance with different con- straints, i.e., Cr(T, si) ≤ α · Cr(SPT, si). The SPT for SD is 22.77Gb and the MST is 15.44Gb. In Figure 4.10, the left y-axis denotes the storage cost (Cs) while the 94 Figure 4.10: Comparing PAS Archival Storage Algorithms for SD right y-axis is the recreation cost (Cr). As we can see, in most cases, PAS-MT and PT find much better storage solu- tions that are very close to the MST (the best possible) by exploiting the recreation thresholds. In contrast, LAST, which cannot handle group constraints, returns worse storage plans and cannot utilize the recreation constraints fully. Between MT and PT, since MT starts from the MST and adjusts it, when the constraints are tight (i.e., α < 1.5), MT cannot alter it to very different trees and the recre- ation constraints are underutilized; however, PT can exploit the constraints when selecting edges to grow the tree. On the other hand, when the threshold is loose (α ∈ [1.5, 2]), MT’s edge swapping strategy is able to refine MST extensively, while PT prunes edges early and cannot find solutions close to MST. When the constraints 95 Storage Plan Query Independent (s) Parallel (s) Materialization Full 3.49 2.16 Min Storage Full 8.47 4.85 PAS (α = 1.6) Full 8.1 4.59 2 bytes 3.19 0.38 1 byte 1.60 0.18 Table 4.7: Recreation Performance Comparison of Storage Plans continue to loosen, both PAS algorithms find good plans, while LAST can only do so at very late stages (α > 3). In practice, the best option might be to execute both algorithms and pick the best solution for a given setting. 4.5.2.4 Retrieval Performance Next we show the retrieval performance for PAS storage plans using the SD dataset. The main query type of interest is snapshot retrieval, which would retrieve all segments of a snapshot or, for a partial retrieval query, the high-order segments. In Table 4.7, the average recreation time of a snapshot for a moderate PAS storage plan (α = 1.6) is compared with the two extreme cases, full materialization, and minimum storage without recreation constraints. As we can see, PAS is not only able to find good solutions which satisfy recreation constraints, but also supports flexible access schemes. Under partial access of high order bytes, the query times for segmented snapshots are better than uncompressing the fully materialized model. 96 4.5.2.5 Progressive Query Evaluation We study the efficiency of the progressive evaluation technique using pertur- bation error determination scheme on real-world models (LeNet, AlexNet, VGG16) and their corresponding datasets. The original parameters are 4-byte floats, which are archived in segments in PAS. We modify caffe implementation of involved layers and pass two additional blobs (min/max errors) between layers. The pertur- bation error determination algorithm uses high order segments, and answers eval query on the test dataset. The algorithm determines whether top-k (1 or 5) result needs lower order bytes (i.e., matched index value range overlaps with k + 1 index value range). The result is summarized in Figure 4.11. The y-axis shows the error rate. The x-axis shows the percentage of data that needs to be retrieved (i.e., 2 bytes or 1 byte per float). As one can see, the prediction errors requiring full pre- Figure 4.11: Progressive Evaluation Query Processing Using High-Order Bytes 97 cision lower-order bytes are very small. The less high-order bytes used, higher the chance of potential errors. The consistent result of progressive query evaluation on real models supports our design decision of segmented float storage. 4.6 Conclusion In this chapter, we described how to build domain-specific ProvDB extensions to address some of the key data management challenges in learning, managing, and adjusting deep learning models. ModelHub attempts to address those challenges in a systematic fashion. The goals of ModelHub are multi-fold: (a) to make it easy for a user to explore the space of potential models by tweaking the network architecture and/or the hyperparameter values, (b) to minimize the burden in keeping track of the metadata including the accuracy scores and the fine-grained results, and (c) to compactly store a large number of models and constituent snapshots without compromising on query or retrieval performance. We presented several high-level abstractions, including a command-line version management tool and a domain- specific language, for addressing the first two goals. Anecdotal experience with our early users suggests that both of those are effective at simplifying the model exploration tasks. We also developed a read-optimized parameter archival storage for storing the learned weight parameters, and designed novel algorithms for storage optimization and for progressive query evaluation. Extensive experiments on real- world and synthetic models verify the design decisions we made and demonstrate the advantages of proposed techniques. 98 Chapter 5: Querying Collaborative Analytics Lifecycle Provenance To support non-intrusive and extensible provenance ingestion mechanisms to collect rich information, ProvDB and other lifecycle management systems often use a graph data model (e.g., property graph) and query languages (e.g., Cypher) to represent and manipulate the stored provenance. However, due to the schema-later nature of the metadata, multiple versions of the same files, unfamiliar artifacts intro- duced by team members, and enormous provenance records collected continuously, querying the ingested provenance graph and utilizing it to a large extent is very challenging for the users. As the provenance graph is verbose and evolving, and the users only have partial knowledge of it, just using standard graph languages makes it very difficult to compose queries and utilize the valuable information. These ob- servations echo the development of provenance systems in other domains, such as cybersecurity, where a clear workflow is not present and query issuers need to deal with verbose and evolving provenance graphs. In this chapter, we propose general provenance graph query operators on stan- dard PROV graph data model to address the verboseness and evolving nature of such provenance graphs. We mainly focus on querying workflow provenance for data science lifecycles which are collaboration pipelines consisting of versioned artifacts 99 and parametrized derivation steps. We discussed studies on data provenance for data science systems in Chapter 2. In Section. 5.1, we first overview the issues of querying provenance graphs on modern graph databases, and motivate the problem. Then in Section 5.2, we review the standard PROV data model, and point out the challenges and desiderata of querying collaborative analytics provenance graphs. In Section 5.3 and 5.4, we introduce graph query operators to induce and summarize provenance subgraphs of interest by allowing the users to only have partial knowl- edge of the lifecycle. We show the semantics of such queries and propose efficient evaluation techniques on top of a property graph data store. Next, the implementa- tion of the operators and their position in ProvDB are illustrated in Section 5.5. In Section 5.6, we evaluate our query methods extensively on a variety of provenance graph datasets and show the effectiveness and efficiency of the proposed methods. 5.1 Introduction As mentioned in early chapters, provenance management has been identified as an important problem for the prosperous data science activities [50,85]. In general, capturing provenance allows the practitioners introspect the data analytics trajec- tories, monitor the ongoing modeling activities, and communicate the practice with others [85]. Compared with well-established data provenance systems for databases [5], and scientific workflow systems for e-science [3], building provenance systems for data science faces an unstable data science lifecycle that is often ad hoc, typically 100 featuring highly unstructured datasets, an amalgamation of different tools and tech- niques, significant back-and-forth among team members, and trial-and-error to iden- tify the right analysis tools, models, and parameters. Schema-later approaches and graph data model are often used to capture the lifecycle, versioned artifacts and associated rich information [50, 85], which also echoes the modern provenance data model standardization over a long period of time as a result of consolidation for scientific workflows [18] and the Web [19]. Though there is an enormous potential value of data science lifecycle prove- nance and high hype to reproduce the results or accelerate the modeling process, the evolving and verbose nature of the captured provenance graphs makes them difficult to store and manipulate. Depending on the granularity, storing the graphs could take dozens of GBs within several minutes [108]. More importantly, given the evolv- ing lifecycle and verbose provenance graphs, it is difficult to write general queries to explore the graph and utilize it, because there are no predefined workflows, i.e., the pipelines change as the project evolves, and because of arbitrary steps (e.g., trial and error) in the modeling process. Though storing the provenance graph in a graph database seems like a natural choice, most of the provenance query types of interest involve paths [109], and require returning paths instead of answering questions about reachability [21], which are beyond the capability of the pattern matching query (BPM) and regular path query (RPQ) support in popular modern graph databases [110, 111]. For example, answering ‘how is today’s result file gen- erated from today’s data file’ requires a segment of the provenance graph including not only the mentioned files but also other files that are not on the path and the 101 user may not know at all (e.g., ‘a configuration file’); answering ‘how do the team members typically generate the result file from the data file?’ requires summarizing several of the query results of the above query and visualize at different resolutions. Lack of proper query facilities in modern graph databases not only limits the value of lifecycle provenance systems for data science, but also of other provenance systems. Provenance queries have specialized query types of interest [21, 109], and provenance systems often implement specialized storage systems [112] and query interfaces [22,23] on their own [3]. Recent works on provenance graphs in the prove- nance community propose various graph transformations for different tasks, which are essentially different template queries from the graph querying perspective, such as grouping nodes together to handle publishing policies [24], summarizing verbose graphs by node types to understand commonalities and outliers [25], segmenting provenance graphs via declarative languages to support feature extractions for cy- bersecurity [26]. We attempt to draw connections between provenance graph query types of interest and modern graph databases capabilities while building a prove- nance system to aid the data analytics lifecycle. In this chapter, we propose two graph operators for common provenance queries to let the user explore the evolving provenance graph without fully un- derstanding the underlying provenance graph structure. The operators not only help our purpose in the context of data science but also the applications using stan- dard provenance data models [18,19] that have no clear workflow skeletons and are verbose in nature [25,26,108,113]. First, we introduce a flexible graph segmentation operator, which queries the 102 provenance of a collection of user-given nodes (e.g., versioned file snapshots, au- thors, an entered command) given boundary criteria (e.g., hops, timestamps, path patterns). We show the semantics of such a query in a context free language, and discuss the connection with RPQ and propose efficient evaluation techniques on top of a property graph store. Second, we propose a graph summarization operator for aggregating the segmentation results, which allows multi-resolution interaction with the query results to understand similar and abnormal behaviors in those segments. 5.2 Challenges & Desiderata Next, we give motivating examples, and introduce standard provenance data model and adaptations in our context. Then we summarize the typical provenance query types of interest and analyze them from the perspective of graph queries. 5.2.1 Motivating Example In the lifecycle of a data analytics project [?,77,78,83], given specific datasets (e.g., face images and labels) and a goal (e.g., prediction function from a face to a label with high accuracy), the data scientists in a team collaborate with each other and try different models repetitively. Using a lifecycle provenance management system [50,85], details of the project progress, versions of the artifacts and associated provenance are captured and managed. In Example 6, we use a classification task using neural networks to illustrate the system background, provenance model and query type. 103 Alice commit V1 Alice commit V2 Bob commit V3 Activity Entity Property Activity Entity Property Activity Entity Property copy dataset url: http://. create model ref: vgg16 update model ann: AVG create solver iter: 20000 update solver lr: 0.01 train opt: -gpu train opt: -gpu train opt: -gpu logs acc: 0.7 logs acc: 0.5 logs acc: 0.75 weight exp: v1 weight exp: v2 weight exp: v3 Figure 5.1: A Data Analytics Project Lifecycle Example & Associated Provenance Example 6 In Figure 5.1, Alice and Bob work together on a classification task to predict face ids given an image. Alice starts the project and creates a neural net- work by modifying a popular model. She downloads the dataset and edits the model definitions and solver hyperparameters, then invokes the training program with spe- cific command options. After training the first model, she examines the accuracy in the log file, annotates the weight files, then commits a version using git. As the accuracy of the first model is not ideal, she changes the neural network by editing the model definition, trains it again and derives new log files and weight parame- ters. However the accuracy drops, and she turns to Bob for help. Bob examines what she did, trains a new model following some best practices by editing the solver configuration in version v1, and commits a better model. Behind the scene, a lifecycle management system tracks user activities, man- ages project artifacts (e.g., datasets, models, solvers) and ingests provenance. In the Figure 5.1 tables, we show ingested information in detail: a) history of user activ- ities (e.g., the first train command uses model v1 and solver v1 and generates logs v1 and weights v1), b) versions and changes of entities (e.g., weights v1, v2 and v3) 104 and derivations among those entities (e.g., model v2 is derived from model v1), and c) provenance records as associated properties to activities and entities, ingested via provenance system ingestors (e.g., dataset is copied from some url, Alice changes a pool layer type to AVG in v2, accuracy in logs v3 is 0.75). 5.2.2 Provenance Model The ingested provenance of the project lifecycle naturally forms a provenance graph, which is a directed acyclic graph1 and encodes information with multiple aspects, such as a version graph representing the artifact changes, a workflow graph reflecting the derivations of those artifact versions, and a conceptual model graph showing the involvement of problem solving methods in the project [50, 85]. To represent the provenance graph and keep our discussion general to other provenance systems, we choose the W3C PROV data model [114], which is a standard inter- change model for different provenance systems. Different aspects of the provenance graph are supported via a rich set of query facilities (Section 5.5) on top of the PROV data model. The full PROV data model is complex in order to satisfy application needs for different domains [114]. For simplicity, we use the core subset of it, which is shown in Figure 5.2. There are three types of vertices (V) in the provenance graph: • Entities (E): are the project artifacts (e.g., files, datasets, scripts) which the users work on and talk about in a project, and the underlying lifecycle man- 1In our system, we use versioning to avoid cyclic self-derivations of the same entity and over- written entity generations by some activity. 105 Agent wasAttributedTo wasAssociatedWith wasDerivedFrom used Entity Activity wasGeneratedBy Figure 5.2: Illustration of the W3C PROV Data Model agement system manages their provenance. • Activities (A): are the system or user actions (e.g., train, git commit, cron jobs) which act upon or with entities over a period of time, [ti, tj). • Agents (U): are the parties who are responsible for some activity (e.g., a team member, a system component). Among the vertices, we focus our discussion to five types of directed edges2 (E): • ‘used’ (U⊆ A× E): An activity started at time ti often uses some entities. • ‘wasGeneratedBy’ (G⊆ E× A): Then some entities would be generated by the same activity at time tj (tj ≥ ti). • ‘wasAssociatedWith’ (S⊆ A× U): An activity is always associated with some agent during its period of execution. 2There are 13 type of relationships among Entity, Activity and Agent. The proposed techniques in the chapter can be extended naturally to support more relation types in other provenance systems. 106 For instance, in Example 6, the activity train was associated with Alice, used a set of artifacts (model, solver, and dataset) and generated other artifacts (logs, weights). • ‘wasAttributedTo’ (A⊆ E× U): Besides an entity’s presence would be at- tributed to some agent, e.g., the dataset in Example 6 was added from external sources and attributed to Alice. • ’wasDerivedFrom’ (D⊆ E× E): An entity would be derived from another entity, such as different versions of the same artifact (e.g., different model versions in v1 and v2 in Figure 5.1). In the provenance graph, both vertices and edges have a label to encode their vertex type in {E , A, U} or edge type in {U , G , S , A, D}. All other provenance records are modeled as properties, ingested by a set of configured project ingestors during the period of activity executions and represented as key-value pairs. PROV standard defines various serializations of the concept model, such as RDF, XML, and JSON [19]. In our system, we use a physical property graph data model to store it, as it is more natural for the users to think of the artifacts as nodes when writing queries using Cypher or Gremlin. It is also more compact than RDF graph for the large amount of provenance records, which are treated as literal nodes. We discuss implementation details in Section 5.5. As a summary, we formally define the provenance graph used in the rest of the chapter. Definition 3 (Provenance Graph) Provenance in a data analytics project is rep- resented as a directed acyclic graph, G(V,E, λv, λe, σ, ω), where vertices have three types, V= E∪ A∪ U , and edges have five types, E= U∪ G∪ S∪ A∪ D. Label 107 functions, λv: V 7→ {E , A, U}, and λe: E7→ {U , G , S , A, D} are total functions as- sociating each vertex and each edge to its type. Given a project, we refer to the set of property types as P and their values as V, then vertex properties σ: V× P7→ V and edge properties ω: E× P7→ V are partial functions from vertex/edge and property type to some value. Example 7 Using the PROV data model, in Figure 5.3, we show the corresponding provenance graph of the project lifecycle listed in Figure 5.1. Vertex shapes follow their type in Figure 5.2. Names of the vertices (e.g., ‘model-v1’, ‘train-v3’, ‘Alice’) are made by using their representative properties (i.e., project artifact names for en- tities, operation names for activities, and first names for agents) and suffixed using the version ids to distinguish different snapshots. Activity vertices are ordered from left to right w.r.t. the temporal order of their executions. We label the edges using their types and show a subset of the edges in Figure 5.1 to illustrate usages of five relationship types. Note there are many snapshots of the same artifact in different versions, and between the versions, we maintain derivation edges ‘wasDerivedFrom’ (D) for efficient versioning storage. The figure shows the provenance of those en- tities in all three versions. The property records are shown as white rectangles but not treated as vertices in the property graph model. 108 109 Alice Bob wasAssociatedWith wasAssociatedWith wasAssociatedWith wasAssociatedWith wasAssociatedWith wasAttributedTo train-v1 update-v2 train-v2 update-v3 train-v3 opt: -gpu acc: 0.75 used used used used used used used used used used wasAttributedTo used wasAttributedTo wasGeneratedBy wasGeneratedBy wasGeneratedBy wasGeneratedBy wasGeneratedBy wasGeneratedBy wasGeneratedBy wasGeneratedBy dataset-v1 model-v1 solver-v1 log-v1 weight-v1 model-v2 log-v2 weight-v2 solver-v3 log-v3 weight-v3 wasDerivedFrom wasDerivedFrom Figure 5.3: Provenance Graph for the Lifecycle Example (Some edges and properties are not shown) Characteristics The provenance graph has the following characteristics, which we need to consider when designing query facilities: • Versioned Artifact: Each entity is a point-in-time snapshot of some artifact in the project. For instance, the query ‘accuracy of this version of the model’ discusses a particular snapshot of the model artifact, while ‘what are the common updates for solver before train’ refer to the artifact but not an individual snapshot. R1: The query facilities need to support both aspects in the graph. • Evolving Workflows: Data analytics lifecycle is explorative and collabora- tive in nature, so there is no static workflow skeleton, and no clear boundaries for individual runs in contrast with workflow systems [3]. For instance, the modeling methods may change (e.g., from SVM to neural networks), the data processing steps may vary (e.g., split, transform or merge data files), and the user-committed versions may be mixed with code changes, error fixes, thus may not serve as boundaries of provenance queries for entities. R2: The query facility for snapshots should not assume workflow skeleton and should allow flexible boundary conditions. • Partial Knowledge in Collaboration: Each team member may work on and be familiar with a subset of artifacts and activities, and may use dif- ferent tools or approaches, e.g., in Example 6, Alice and Bob use different approaches to improve accuracy. When querying retrospective provenance of 110 the snapshots attributed to other members or understanding activity process over team behaviors, the user may only have partial knowledge at query time, thus may find it difficult to compose the right graph query. R3: The query facility should support provenance queries with partial information reflecting users’ understanding and induce correct result. • Verboseness for Usage: In practice, the provenance graph would be very verbose for humans to use and in large volume for the system to store. With similar goals to recent research efforts [21, 24–26], our system aims to let the users understand the essence of provenance at their preference level by trans- forming the provenance graph. R4: The query facility should be able to ag- gregate snapshots derivations, not only to reflect the commonalities but also anomalies among derivations. 5.2.3 Provenance Queries & Challenges Despite the exchange data model, the provenance standards (e.g., PROV, OPM) do not describe query models, as different systems have their own application- level meanings of those nodes [18,19]. Many provenance systems focus on ingestion methods and rely on standard query language (e.g., SQL, SPARQL, Cypher) pro- vided by the backend DBMS to let the user manipulate the underlying informa- tion [3]. However, general queries to express provenance retrieval tend to be very complex. To improve usability, a few systems provide novel query facilities [10, 23], and some of them propose special query languages [21, 22]. Recent provenance 111 systems which adopt W3C PROV data model naturally use graph stores as back- ends [108, 113]; while the usability of standard graph query language often cannot satisfy the needs [21,110], a set of graph manipulation techniques is often proposed to utilize the provenance [24–26]. Along the same lines, we aim to draw connections between the research to improve provenance graph usages with graph query techniques. By observing the characteristics of the provenance graph in analytics lifecycle and identifying the requirements for the query facilities (Section 5.2.2), we propose two graph opera- tors (i.e., segmentation and summarization) for general provenance graphs in PROV data model. We first illustrate the queries in examples and emphasize the differences from prior work. We defer their formal discussion to Section 5.3 and 5.4. 5.2.3.1 Segmentation A very important provenance query type of interest is querying ancestors and descendants of entities, which have different names (e.g., reachability, lineage) and subtle differences in formulations (only return true/false or construct paths; support regular paths). In our context, the users introspect the lifecycle and identify issues by analyzing dependencies among snapshots. Lack of a workflow skeleton and clear boundaries makes the queries over the provenance graph more difficult. Also note that the user may not specify all interested entities in a query due to partial knowl- edge. We propose a segmentation operator that allows the user to specify sets of source and destination entities, and the operator induces other important unknown 112 Alice Query 1: src: {dataset-v1}, dst: {weight-v2} wasAssociatedWithwasAssociatedWith boundary: exclude: A, D edges update-v2 train-v2 extend two activities used used used used wasGeneratedBy wasGeneratedBy wasGeneratedBy model-v1 model-v2 dataset-v1 solver-v1 log-v2 weight-v2 Bob Query 2: src: {dataset-v1}, dst: {log-v3} wasAssociatedWith w asAssociatedWith boundary: exclude: A, D edges update-v3 train-v3 extend two activities used used used used wasGeneratedBy wasGeneratedBy wasGeneratedBy solver-v1 solver-v3 model-v1 dataset-v1 log-v3 weight-v3 Figure 5.4: Segmentation Query Examples entities in the query result. We also allow a set of boundary criteria to address the verboseness of the return graph. Example 8 In Figure 5.4, we show two examples of provenance graph segmentation query. In Query 1 (Q1), Bob was interested in what Alice did in version v2. He did not know the details of activities and the entities Alice touched, instead he set {dataset, weight} as querying entities to see how the weight in Alice’s version v2 was connected to the dataset. To exclude actions in earlier commits (e.g., v1), he set the boundaries as two activities away from those querying entities. In the figure, the system found connections among the querying entities, and included the vertices within the boundaries. After interpreting the result, Bob knew Alice updated the model definitions in model. On the other hand, Alice would ask query to understand how Bob improved the accuracy and learn from him. In Query 2 (Q2), instead of learned weight, accuracy property associated log entity is used as querying entity 113 along with dataset. The result showed Bob only updated solver configuration and did not use her new model committed in v2. In contrast with similar queries in scientific workflow provenance systems [3, 17], their processes are predefined in template workflow skeletons, and multiple executions generate different instance-level provenance run graphs. By taking ad- vantages of the workflow skeleton, there are lines of research for advanced ancestry query processing, such as defining user views over such skeleton to aid queries on ver- bose run graphs [23], executing reachability query on the run graphs efficiently [115], storing run graphs generated by the workflow skeletons compactly [27], and using workflow visualization as examples to ease query construction [10]. 5.2.3.2 Summarization In workflow systems, querying the workflow skeleton (a.k.a prospective prove- nance) is an important use case (e.g., business process [20]) and included in the provenance challenge [109]. In our context, even though a static workflow skele- ton is not present, summarization of activity commonalities and identifying abnor- mal behaviors are very useful query capabilities. However, general graph summa- rization techniques [116–118] are not applicable to provenance graphs due to con- straints of the data model definitions [24, 25, 119]. Inspired by the graph analytics work [116,117,120] and graph manipulations specific to provenance graphs [24,25], we propose a summarization operator with multi-resolution capabilities for prove- nance graphs. To support different aspects of the provenance, we design summa- 114 Agent Query 3: Summarize Q1 & Q2 50% 100% Property Aggregation: Activity: “command” update (t1) train (t1) Entity : “filename” 50% 50%50% 50% Provenance Type: neighborhood k=1 model (t1) model (t2) 100% 100% 100% 50% 50% update (t2) 50% dataset (t1) weight (t1) log (t1) 50% 50% solver (t1) solver (t2) Figure 5.5: Summarization Query Examples rization operator to be performed over query results of the segmentation operator; to ensure the rigidness of provenance entities, we use path constraints as vertex identifiers [25,121]. Example 9 We show a summarization query example in Figure 5.5. An outsider to the team (e.g., some auditor, new team member, or project manager) wanted to understand the activity overview in the project. Segmentation queries (e.g., Q1, Q2 in Figure 5.4) only show individual trails of the analytics process at the snapshot level. The outsider issued a summarization query, Query 3 (Q3), by specifying the aggregation over three types of vertices, and defining the provenance types. The query result merged Q1 and Q2 into a summary graph GS according to the query. In the figure, the vertices in GS suffixed name with provenance types to show alternative generation process, while edges are labeled with their frequency of appearance among vertices in GS. The query issuer would change the query conditions to derive various GS at different resolutions. 115 In contrast with other summarization work [118], our operator is designed for provenance graphs which include multiple types of nodes rather than a single node type [120, 122]; it works on query results rather than entire graph structure [116]; the summarization requirements are specific to provenance graphs rather than gen- eral ones [117]; we also consider aggregating graph structure and property values together, which is not studied before to the best of our knowledge. In the following sections, we describe the proposed query models in detail. In Section 5.3, we introduce the segmentation operator (PgSeg) for snapshot-level retrospective analysis of the project activities. It returns an induced subgraph of the evolving workflow by allowing the users to only have partial knowledge of the project. In Section 5.4, a summarization operator (PgSum) is described for queries at the artifact level. It merges a set of segmentation results and generates a prospective overview for identifying commonalities and abnormalities at certain resolution by user-defined aggregations. 5.3 Segmentation Operation Among the snapshots, collected provenance graph describes important ances- try relationships which form ‘the heart of provenance data’ [21]. Often lineages w.r.t. a query or a run graph trace w.r.t. a workflow are used to formulate ancestry queries in relational databases or scientific workflows [22]. However, in our context, there are no clear boundaries of logical runs, or query scopes to cleanly define the input and the output. Though a provenance graph could be collected, the key ob- 116 stacle is lack of formalisms to analyze the verbose information. In similar situations for querying script provenance [17], Prolog was used to traverse graph imperatively, which may be an overkill and require additional skill-set for team members. In our system, we design PgSeg to let the users who may only have partial knowledge to query retrospective provenance. PgSeg semantics induce a connected subgraph to show the ancestry relationships (e.g., lineage) among the entities of interest and include other causal and participating vertices within a specified boundary that is adjustable by the users. Next we first define the elements of the operator and query semantics, followed by query evaluation techniques. 5.3.1 Semantics of Segmentation (PgSeg) At a high level, we view the PgSeg operator as a 3-tuple query (Vsrc,Vdst,B) on a provenance graph G asking how a set of source entities Vsrc ⊆ E are involved in generating a set of destination entities Vdst ⊆ E . PgSeg induces induced vertices (entities, activities and agents) Vind that show the detailed generation process and satisfy certain boundary criteria B. It returns a connected subgraph S(VS ,ES) ⊆ G, where VS=Vsrc∪Vdst∪Vind, and ES=E ∩ VS×VS . When discussing the elements of PgSeg below, we use the following notations for paths in G. A path πv0,vn connecting vertices v0 and vn is a vertex-edge alter- nating sequence 〈v0, e1, v1, · · · , vn−1, en, vn〉, where n > 1, ∀i ∈ [0, n] vi ∈ V, and ∀j ∈ (0, n] ej = (vj−1, vj) ∈ E. Given a path πv0,vn , we define its path segment π̂v0,vn by simply ignoring v0 117 and vn from the beginning and end of its path sequence, i.e., 〈e1, v1, · · · , vn−1, en〉. A path label function τ maps a path π or path segment π̂ to a word by concatenating labels of the elements in its sequence order. Unless specifically men- tioned, the label of each element (vertex or edge) is derived via λv(v) and λe(e). For example, from a to c, there is a path πa,c= 〈a, ea, b, eb, c〉, where a, c ∈E , b ∈A, ea ∈G and eb ∈U ; its path label τ(πa,c) =EGAU E, and its path segment label τ(π̂a,c) =GAU . For ease of describing path patterns, for ancestry edges (used, wasGenerat- edBy), i.e., ek = (vi, vj) with label λe(ek) =U or λe(ek) =G , we introduce its virtual inverse edge e −1k = (vj, vi) with the inverse label λ (e −1) =U −1e k or λe(e −1 k ) =G −1 respectively. A inverse path is defined by reversing the sequence, e.g., π −1a,c = 〈c,e −1, b,e −1b a , a〉, while τ(π −1a,c ) =EU −1AG−1E , τ(π̂ −1) =U −1AG−1a,c . Next we discuss PgSeg semantics and our rationale in detail. 5.3.1.1 Source (Vsrc) and Destination Entities (Vdst) Provenance is about the entities. In a project, users know the committed snap- shots (e.g., data files, scripts, and their metadata) better than the detailed processes generating them. When writing a PgSeg query, we assume the user believes Vsrc may be ancestry entities w.r.t. Vdst. Then PgSeg reasons the connectivity among Vsrc and Vdst, and shows other vertices and the generation process which the user may not know and be able to write query with. Note that the user may not know the existence order of entities either, so we allow Vsrc and Vdst to overlap, and even 118 be identical. In the latter case, the user could be a program [26] and not familiar with the generation process at all. 5.3.1.2 Induced Vertices Vind Given Vsrc and Vdst, we refer Vind as the induced vertices (entities, activities and agents) contributing to the generation process. What vertices should be in Vind is the core question to ask. It should reflect the generation process precisely and concisely in order to assist the user introspect the generation details to make decisions. Prior work on inducing subgraphs from a set of vertices do not fit our needs. First, lineage query would generate all ancestors of Vdst, which is not concise or even precise: siblings of Vdst and siblings of entities along the paths may be excluded as they do not have path from Vdst or to Vsrc in G. Second, at another extreme, a provenance subgraph induced from some paths [22] or all paths [24] among vertices in Vsrc∪Vdst will only include vertices on the paths, thus exclude other contributing ancestors for Vdst. Moreover, quantitative techniques used in other domains other than provenance cannot be applied directly either, such as keyword search over graph data techniques [123] which also do not assume that users have full knowl- edge of the graph, and let users use keywords to match vertices and then induce connected subgraph among keyword vertices. However, the techniques often use tree structures (e.g., least common ancestor, Steiner tree) connecting Vsrc∪Vdst and are not aware of provenance domain knowledge, thus cannot reflect the ancestry 119 relationships precisely. Instead of defining Vind quantitatively, we define PgSeg qualitatively by a set of domain rules: (a) to be precise, PgSeg includes other participating vertices not in the lineage and not in the paths among Vsrc∪Vdst; (b) to be concise, PgSeg utilizes the path shapes between Vsrc and Vdst given by the users as a heuristic to filter the ancestry lineage subgraph. We define and categorize the rules that generate subsets of Vind as follows: 1. Vertices on Direct Path (VC1ind): Activities and entities along any direct path πv ,v between an entity vi ∈ Vsrc and an entity vj ∈ Vdst are the most importantj i ancestry information. It helps the users answer classic provenance questions, such as reachability, i.e., whether there exists a path; workflow steps, i.e., if there is a path, what activities occurred. We refer entities and activities on such direct path as VC1ind, which is defined as follows:⋃ VC1ind = {vk| ∃πv v ∈ π̂ }j ,vi k vj ,vi vi∈Vsrc,vj∈Vdst Note not only the shortest path are of interest, but all such path πv ,v in thej i DAG G should be derived. 2. Vertices on Similar Path (VC2 ): Though VC1ind ind is important, due to the partial knowledge of the user, just considering the direct paths may miss important ancestry information including: a) the entities generated together with Vdst, b) the entities used together with Vsrc, and c) more importantly, other entities and activities which are not on the direct path, but contribute to the deriva- tions. The contributing vertices are particularly relevant to the query in our 120 context, as data analytics project consists of many back-and-forth repetitive and similar steps, such as preparing data in alternative ways, adjusting model templates, and evaluating experiments. To define the induction scope, on one hand, all ancestors w.r.t. Vdst in the lineage subgraph would be returned, however it is very verbose and not concise to interpret. On the other hand, it is also difficult to let the user specify all the details of what should/should not be returned. Here we use a heuristic: induce ancestors which are not on the direct path but contribute to Vdst in a similar way, i.e., path labels from VC2ind to Vdst are the same with some directed path from Vsrc. In other words, one can think it is similar to a radius concept [26] to slice the ancestry subgraph w.r.t. Vdst, but the radius is not measured by how many hops away from Vdst but by path patterns between both Vdst and Vsrc entities that are specified by the user query. Next we first formulate the path pattern in a context free language [124], L(SimProv), then VC2ind can be defined as a L−constrained reachability query from Vsrc via Vdst over G, only accepting path labels in the language. A context-free grammar (CFG) over a provenance graph G and a PgSeg query Q is a 6-tuple (Σ, N, P, S,G, Q), where • Σ = {E ,A,U} ∪ {U ,G ,S ,A,D}∪Vdst is the alphabet consisting of vertex labels, edge labels in G and Vdst vertex identifiers (e.g., id in Neo4j) in Q • N is a set of non-terminals, and S is the start symbol. • P is the set of production rules. Each production rule in the form of 121 l → (Σ ∪ N)∗ defines an acceptable way of concatenations of the RHS words for the LHS non-terminal l. Given a CFG and a non-terminal li ∈ N as the start symbol, a context-free language (CFL), L(li), is defined as the set of all finite words over Σ by applying its production rules. The following CFG defines a language L(SimProv) that describes the heuris- tic path segment pattern for the induced vertex set. The production rules expand from some vj ∈Vdst both ways to reach vi and vk, such that the concatenated path πv ,v has the destination vj in the middle.i k SimProv→ G−1E SimProv EG | U −1A SimProv AU | G−1vjG ∀vj ∈ Vdst Now we can use L(SimProv) to define VC2ind accordingly: for any vertex vk in VC2ind, there should be at least a path from a vi ∈ Vsrc going through a vj ∈ Vdst and vk then reaching some vertex vt, such that the path segment label τ(π̂vi,vt) is a word in L(SimProv): ⋃ VC2ind = {vk| ∃πv ,v τ(π̂v ,vt) ∈ L(SimProv) ∧ vi t i k ∈ πvi,vt} vi∈Vsrc Using CFG allows us to express the heuristic properly. Note that L(SimProv) cannot be described by regular expressions over the path(segment) label, as it can be viewed as a palindrome language [124]. Moreover, it allows us to extend the query easily by using other label functions, for example, instead of 122 λv(v) and λe(e) whose domains are PROV types, using property value σ(v, pi) or ω(e, pj) in G allows us to describe interesting constraints, e.g., the induced path should use the same commands as the path from Vsrc to Vdst, or the matched entities on both sides of the path should be attributed to the same agent. For example, the former case can simply modify the second production rule in the CFG as follows: U −1 σ(ai, p0) SimProv σ(aj, p0) U s.t. ai, aj ∈ A ∧ p0 = ‘command’ ∧ σ(ai, p0) = σ(aj, p0) This is a powerful generalization that allows PgSeg to constrain induction scope by describing repetitiveness and similarily ancestry paths at a very fine granularity. 3. Entities Generated By Activities on Path (VC3ind): As mentioned earlier, the sibling entities generated together with Vdst may not be induced from directed paths. The same applies to the siblings of entities induced in VC1 C2ind and Vind, if the siblings do not have paths to Vdst. We refer to those entities as VC3ind and define it as: ⋃ VC3ind = {v| (v, vi) ∈ G ∧ v ∈/ VC1 C2ind ∪ Vind} vi∈V ′ where V ′ = (VC1 C2ind ∪ Vind) ∩ A 4. Involved Agents (VC4ind): Finally, the agents in the provenance graph may be important in some situations, e.g., from the derivation, identify who makes a mistake, like git blame in version control settings. On a provenance graph 123 G, agents can be derived easily: ⋃ VC4ind = {vu| vu ∈ U ∧ (vi, vu) ∈ S ∪ A} vi∈V ′ where V ′ = V C1 C2 C3src ∪ Vdst ∪ Vind ∪ Vind ∪ Vind Note we not only induce agents of Vdst and Vsrc, but also other induced vertices. 5.3.1.3 Boundary Criteria B On the induced subgraph, besides path shapes, the segmentation operator should be able to express users’ logical boundaries when asking the ancestry queries. It is particulary useful in an interactive setting once the user examines the returned induced subgraph and wants to make adjustments. We categorize the boundary criteria support as a) exclusion constraints and b) expansion specifications. First, boundaries would be constraints to exclude some parts of the graph, such as limiting ownership (authorship) (who), time intervals (when), project steps (particular version, file path patterns) (where), understanding capabilities (neighborhood size) (what), etc. Most of the boundaries can be defined as boolean functions mapping from a vertex or edge to true or false, i.e., bv(v) : V 7→ {0, 1}, be(e) : E 7→ {0, 1}, which can be incorporated easily to the CFG framework for subgraph induction. We define the exclusion boundary criteria as two sets of boolean functions (Bv for vertices and Be for edges), which could be provided by the system or defined by the user. Then the labeling function used for defining Vind would be 124 adjusted by applying the boundary criteria as follows: ∧  ∧λv(v) b ∈B bi(v) = 1  λe(e) b ∈B bi(e) = 1 F i vv =  , F i e e =  ε otherwise ε otherwise In other words, a vertex or an edge that satisfies all exclusion boundary conditions, is mapped to its original label. Otherwise the empty word is used as its label, so that the membership of paths having that vertex to L(SimProv) would become invalid. Second, instead of exclusion constraints, the user may wish to expand the in- duced subgraph. We allow the users to specify expansion criteria, Bx = {bx(Vx, k)}, denoting including paths which are k activities away from entities in Vx ⊆ Vind. 5.3.1.4 Discussion Validness of provenance graph is an important constraint [26, 119]. In our system, the PgSeg operator does not introduce new vertices or edge. As long as the original provenance graph is valid, the induced subgraph is valid. However, at query time, the boundaries criteria could possibly let the operator result exclude important vertices. As an interactive system, we leave it to the user to adjust the vertex set of interest and boundary criteria in their queries. 125 5.3.2 Query Evaluation 5.3.2.1 Overview: Two-Step Approach Given a PgSeg(Vsrc,Vdst,B) query, we separate the query evaluation into two steps: 1) induce: induce Vind and construct the induced graph S using Vsrcand Vdst, 2) adjust: apply B interactively to filter induced vertices or retrieve more vertices from the property graph store backend. The rationale of the two-step approach is that the operator is designed for the users with partial knowledge who are willing to understand a local neighborhood in the provenance graph. Any induction heuristic applied would be unlikely to match the user’s implicit interests and would require back-and-forth explorations. In the rest of the discussion, we assume a) the provenance graph is stored in a backend property graph store, with constant time complexity to access arbitrary vertex and arbitrary edge by corresponding primary identifier; b) given a vertex, both its incoming and outgoing edges can be accessed equally, with linear time complexity w.r.t. the in- or out-degree. In our implementation (Section 5.5), we use Neo4j as our storage backend, which satisfies the conditions – both nodes and relationships are accessed via their id. 5.3.2.2 Induce Step Given Vsrc and Vdst, PgSeg induces Vind which consists of four categories. We mainly focus our discussion on inducing vertices on direct and similar paths, 126 as the other two types, i.e., sibling entities and related agents can be derived in a straightforward manner by scanning 1-hop neighborhoods of the first two sets of results. Cypher: The definition of vertices on similar path requires a context-free language, and can- not be expressed by a regular language. When developing the system, we realize it can be decomposed into two regular language path segments, and express the query using path variables [125, 126]. We handcraft a Cypher query shown in Query 5.1. The query uses Vsrc (b) and Vdst (e1) to return all directed paths VC1ind via path vari- ables (p1), and uses Cypher with clause to hold the results. The second match finds the other half side of the SimProv via path variable p2 which then joins with p1 to match p1=(b:E) <-[:U|G*]-(e1:E) with p1 where id(b) in [0,1] and id(e1) in [30 ,31] match p2=(c:E) <-[:U|G*]-(e2:E) where id(e2) in [30 ,31] and extract(x in nodes(p1) | labels(x)[0]) = extract(x in nodes(p2) | labels(x)[0]) and extract(x in relationships(p1) | type(x)) = extract(x in relationships(p2) | type(x)) return p2; Query 5.1: Cypher Q1 for L(SimProv), Vsrc= {0, 1}, Vdst= {30, 31} 127 compare the node-by-node and edge-by-edge conditions to induce VC2ind. If we do not need to check properties, then we can use length(p1) = length(p2) instead of the two extract clauses. However, as shown later in the evaluation (Section 5.6), Neo4j takes more than 12 hours to return results for even very small graphs with about a hundred vertices. Note that regular pattern queries (RPQ) with path variables are not supported well in modern graph query languages and graph database [111,125], we develop our own PgSeg algorithm for provenance graphs. CFL-reachability: Given a vertex v and a CFL L, the problem of finding all vertieces {u} such that there is a path πv,u with label τ(πv,u) ∈ L is often referred as single source CFL- reachability (CFLR) problem or single source L-Transitive Closure problem [127, 128]. The all-pairs version, which aims to find all such pairs of vertices connected by a L path of the problem, has the same complexity. As Vsrc would be all vertices, we do not distinguish between the two in the rest of the discussion. Though the problem has been first studied in our community [127], there is little follow up and support in the context of modern graph databases. CFLR finds its main application in programming analysis and is recognized as a general formulation method for many program analysis tasks [128]. On graph representations of programs, program analysis tasks such as program slicing and pointer analysis, can be described in a CFL to specify path patterns. State of the art CFLR algorithm [129] solves the problem in O(n3/log(n)) time and O(n2) space w.r.t. the number of vertices in the graphs. It is based on a classic 128 r0 : Qd→ vj ∀vj ∈Vdst r1 : Lg→ G−1 Qd r3 : La→ A Rg r6 : Ru→ Lu U | G−1 Re r4 : Ra→ La A r7 : Le→ E Ru r2 : Rg→ Lg G r : Lu→ U −15 Ra r8 : Re→ Le E Figure 5.6: SimProv Normal Form, SimProv→ Re. Lg ⊆ A×E ; Rg, La, Ra ⊆ A×A; Lu ⊆ E ×A; Ru, Le, Re, Qd ⊆ E × E . cubic time dynamic programming scheme [128, 130] which derives production facts non-repetitively via graph traversal, and uses the method of four Russians [131] during the traversal. In the rest of the discussion, we refer it as CflrB. We first describe the algorithm briefly and then present improvement tech- niques for L(SimProv) on provenance graphs. Given a CFG, it works on its normal form [124], where each production has at most two RHS symbols, i.e., A→ BC or A → B. We show the SimProv normal form in Figure 5.6. At a high level, the algorithm traverses the graph and uses grammar as a guide to find new production facts N(i, j), where N is a LHS nonterminal, i, j are graph vertices, and the found fact N(i, j) denotes that there is a path from i to j whose path label satisfies N . To elaborate, similar to BFS, it uses a worklist W (queue) to track newly found fact N(i, j) and a fast set data structure H with time complexity O(n/log(n)) for set diff/union and O(1) for insert to memorize found facts. In the beginning, all facts F (i, j) from all single RHS symbol rules F → A are enqueued. In SimProv case (r0 in Figure 5.6), each vj ∈Vdst is added to W as Qd(vj, vj). From W , the algorithm processes one fact F (i, j) at a time until W is empty. When processing a dequeued 129 fact F (i, j), if F appears in any rule in the following cases: N(i, j)→ F (i, j) N(i, v)→ F (i, j)A(j, v) N(u, j)→ A(u, i)F (i, j) the new LHS fact N(i, v) is derived by set diff {v ∈ A(j, v)} \ {v ∈ N(i, v)} or N(u, j) by {u ∈ A(u, i)} \ {u ∈ N(u, j)} in H. Then the new facts of N are added to H to avoid repetition and W to explore it later. Once W is empty, the start symbol L facts L(i, j) in H include all vertices pairs (i, j) which have a path with label that satisfies L. If path is needed, a parent table would be used similar to BFS. In SimProv(Figure. 5.6), the start symbol is Re, ∀vi ∈ Vsrc, Re(vi, vt) facts include all vt, s.t. between them there is τ(π̂i,t) ∈ L(SimProv). L(SimProv)-reachability on PROV: Next we study the performance of CflrB for SimProv on a PROV graph, and show the fast set method is not suitable for PROV graph. Then we further explore grammar and PROV graph properties, instead of normal form, we rewrite the grammar to allow several pruning strategies and propose a linear-time algorithm if |Vdst| can be viewed as a constant. 2 Lemma 4 CflrB solves L(SimProv)-reachability on a PROV graph in O( |A||E| log |A|+ |E||A|2 |E| ) time if using fast set. Otherwise, it solves it in O(|G ||E|+ |U ||A|) time.log Proof 3 On SimProv normal form (Figure 5.6), for i ∈ [1, 8], CflrB derives ri LHS facts by a ri−1 LHS fact dequeued from W (Note it also derives r1 from r8). For i ∈ {1, 2}, ri(u, v) uses G edges in the graph during the derivation, e.g., from 130 r LHS Re to r : Lg(u, v) → G−18 1 (u, k) Re(k, v). As Re(k, v) can only be in the worklist W once, we can see that each 3-tuple (u, k, v) is formed only once on the RHS and there are at most |G ||E| of such 3-tuples. To make sure Lg(u, v) is not found before, H is checked. If not using fast set but a O(1) time procedure for each instance (u, k, v), then it takes O(|G ||E|) to produce the LHS; on the other hand, if using a fast set on u′s domain A for each u, for each Re(k, v), O( |A| log |A|) time 2 is required, thus it takes O( |A||E| log |A| ) in total. Applying similar analysis on r5 and r6 using U to derive new facts, we can see it takes O( |E||A| 2 |E| ) with fast set and O(|U ||A|)log without fast set. Finally r3, r4 and r7, r8 can be viewed as following a vertex self-loop edge and does not affect the complexity result. In our context, the PROV graph is often sparse, and both the numbers of en- tities that an activity uses and generates can be viewed as a small constant, however the domain size of A and E are potentially large. The lemma shows a quadratic time scheme for L(SimProv)-reachability if we can view the average in/out degree as a constant. Note that the quadratic time complexity is not surprising, as SimProv is a linear CFG, i.e., there is at most one nonterminal on RHS of each production rule. The CFLR time complexity for any linear grammar on general graphs G(V,E) have been shown in theory as O(|V ||E|) by a transformation to general transitive closures [127]. Rewriting SimProv: Most CFLR algorithms require the normal form mentioned earlier. However, under the normal form, it a) introduces more worklist entries, and b) misses important 131 r′1 : Ee→ vj ∀vj ∈ V ′ −1dst r2 : Aa→ G Ee G | U −1 Aa U | A Aa A | E Ee E Figure 5.7: Proposed SimProv Rewriting, SimProv → Ee. Aa ⊆ A × A; Ee ⊆ E × E . grammar properties. Instead, we rewrite SimProv as shown in Figure 5.7, and propose SimProvAlg and SimProvTst by adjusting CflrB. Comparing with standard normal forms, r′1 and r ′ 2 have more than two RHS symbols. SimProvAlg utilizes the rewritten grammar and PROV-graph properties to improve CflrB. Moreover, SimProvTst solves L(SimProv)-reachability on a PROV graph in linear time and sublinear space if viewing |Vdst| as constant. The properties of the rewritten grammar and how SimProvAlg and SimProvTst utilize them are described below, which can be used in other CFLR problems: 1. Reduction for Worklist tuples: Note that r′2 in Figure 5.7 combines rules r1, r2 in the normal form, i.e., Rg(a1, a2) → Lg(a1, e2)G(e2, a2) and Lg(a1, e2) → G−1(a1, e1)Re(e1, e2), is derived by Aa(a1, a2)→ G−1(a1, e1) Ee(e1, e2) G(e2, a2) Instead of enqueue Lg(a1, e2) and then Rg(a1, a2), SimProvAlg adds Aa(a1, a2) to W directly. In the previous normal form, there may be other cases that can also derive Aa(a1, a2), i.e., in presence of Lg(a1, ek) and G(ek, a2). In the worst case, CflrB enqueued |E| number of Lg in W which later find the 132 same fact Rg(a1, a2). It’s worth mentioning that in SimProvAlg because Aa(a1, a2) now would be derived by many Ee(e , e ) in r ′ i j 2, before adding it to W , we need to check if it is already in W . In SimProvAlg, we use two pairs of bitmaps for Ee and Aa for W and H respectively, the space cost is |E|2 |A|2O( |E| + |A|). Compressed bitmaps would be used to improve space usagelog log at the cost of non-constant time random read/write. 2. Symmetric property: In the rewritten grammar, both nonterminals Ee and Aa are symmetric, i.e., Ee(ei, ej) implies Ee(ej, ei), Aa(ai, aj) implies Aa(aj, ai), which is not held in normal forms. Intuitively Ee(e1, e2) means some path label from e1 to vj ∈Vdst is the same with some path label from e2 to vj. Using symmetric property, in SimProvAlg, we can use a straightforward pruning strategy: only process (ei, ej) in both H and W if id(ei) ≤id(ej), and (ai, aj) if id(ai) ≤id(aj); and an early stopping rule: for any Aa(ai, aj) that both ai’s and aj’s order of being is before all PgSeg query Vsrc entities, we do not need to proceed further. Note the early stopping rule is SimProv and PROV-graph specific, while solving general CFLR, even in the single-source version, cannot take source information and we need to evaluate until the end. Though both strategies used by SimProvAlg do not improve the worst-case time complexity, as shown later in empirical studies (Section 5.6), they are very useful in realistic PROV graphs. 3. Transitive property: By definition SimProv does not have transitivity, i.e., given Ee(e1, e2) and Ee(e2, e3), it does not imply Ee(e1, e3). This is because 133 a PgSeg query allows multiple Vdst, Ee(e1, e2) and Ee(e2, e3) may be found due to different vj ∈Vdst. However, if we evaluate vj ∈Vdst separately, then Ee and Aa have transitivity, which leads to a linear algorithm SimProvTst for each vj: instead of maintaining Ee(ei, ej) or Aa(ai, aj) tuples in H and W , we can use a set [e]m or [a]n to represent an equivalence class at iteration m or n where any pair in the set is a fact of Ee or Aa respectively. If at iteration m, the current W holds a set [e]m, then Aa(a, a)→G−1(a, e)Ee(e, e) is used to infer the next W (a set [a]m+1); otherwise, W must hold a set [a]m, then Ee(e, e)→U −1(e, a)Aa(a, a) is used to infer next equivalence class [e]m+1 as the next W . In the first case, as there are at most |G | possible (a, e) tuples, the step takes O(|G |) time; in the later case, similarly the step takes O(|U |) time. The algorithm returns vertices in any equivalence classes [vi]m, s.t.vi ∈ Vsrc. Overall, because there are multiple Vdst vertices, the algo- rithm runs in O(|V ||G |+|V ||U |) time and O( |E| + |A|dst dst |E| |A|) space. The earlylog log stop rule can applied as well, instead of a pair of activities in SimProvAlg, in SimProvTst all activities in an equivalent class [a]m are compared with entities in Vsrc in terms of their order of being; while the pruning strategy is not necessary, as all pairs are represented compactly in an equivalent class. Theorem 2 SimProvTst solves L(SimProv)-reachability on a PROV graph in O(|G |+ |U |) time, if |Vdst| can be viewed as a constant. 134 5.3.2.3 Adjust Step Once the induced graph S(VS ,ES) is derived, the adjustment step applies boundary criteria to filter existing vertices and retrieve more vertices. Comparing with induction step, applying boundary criteria is rather straightforward. For ex- clusion constraints Bv and Be, we apply them on vertices and edges in S(VS ,ES) linearly if present. For Bx, we traverse the backend store with specified entities for 2k hops through G−1and U −1edges to their ancestry activities and entities. To support back and forth interaction, we cache the induced graph instead of inducing multiple times. We expect k is small constant in our context as the generated graph is for human users to interpret, otherwise, a reachability index is needed. For other purposes where the two-step approaches are not ideal, the exclusion constraints Bv and Be can be evaluated together using CflrB, SimProvAlg and SimProvTst with small modifications. In CflrB the label function Fv of Bv can be applied at r0, r3, r4, r7, r8 on A or E , while Fe of Be can be applied at rest of the rules involving U and G . For SimProvAlg and SimProvTst, Fv and Fe can be applied together at r′1, r ′ 2. 5.3.2.4 Discussion We focus on the ad-hoc query evaluation schemes. As of now, the granularity of provenance is at the level of command executions, the number of activities are constrained by project members’ work rate. In case when the PROV graph becomes extremely large, indexing techniques and incremental algorithms are more practical. 135 5.4 Summarization Operation In a collaborative analytics project, collected provenance graph of the repeti- tive modeling trails records verbose steps of various pipelines and reflects different roles and work routines of the team members. Using PgSeg, the users can navigate to their segments of interest, which may be about similar pipeline steps. For exam- ple, the query result of a single PgSeg(Vsrc,Vdst,B), e.g., ‘yesterday’s input data and prediction result ’, shows a pipeline subgraph S1 about how Vdst (prediction result) was derived from Vsrc (input data) together with other induced Vind (e.g., modeling steps). As there is no skeleton for the pipeline, given a different tuple (V2src,V2dst,B2) to get another segment S2, e.g., ‘today’s input data and model result ’, the pipeline subgraph S2 may or may not be the same as S1. Given a set of segments, our de- sign goal of PgSum is to produce a precise and concise provenance summary graph, Psg, which will not only allow the users to see commonality among those segments of interest (e.g., yesterday’s and today’s pipelines are almost the same), but also let them understand alternative routines (e.g., an old step excluded in today’s pipeline). Though no workflow skeleton is defined, with that ability, Psg would enable the users to reason about prospective provenance in evolving workflows of analytics projects. One way to combine PgSeg segment graphs is to use context-free graph gram- mars (CFGG) [20] which are able to capture recursive substructures. However with- out a predefined workflow skeleton CFGG, and due to the workflow noise resulting from the nature of analytics workload, inferring a minimum CFGG from a set of 136 subgraphs is not only an intractable problem, but also possibly leads to complex graph grammars that are more difficult to be understood by the users [132]. Instead, we view it as a graph summarization task by grouping vertices and edges in the set of segments to a Psg. 5.4.1 Semantics of Summarization (PgSum) Although there are many graph summarization schemes proposed over the years [118], they are neither aware of provenance domain desiderata [25] nor the meaning of PgSeg segments. Most of the work focuses on finding smaller repre- sentations for a very large graph by methods such as compression [133], attribute- aggregation [116] and bisimulation [134]; while there are a few works aiming at combining a set of query-returned trees [135] or graphs [117] to form a compact representation. Our PgSum operator falls into the latter category, and is tailored for PgSeg segments which consist of similar or alternative steps among a set of entities of interest. A PgSum query is designed to take a 2-tuple (K,Rk) as input which describes the level of details of vertices and constrains the rigidness of the provenance; then it outputs a minimum provenance summary graph (Psg). 5.4.1.1 Property Aggregations & Provenance Types Given a set of segments S, each S i of which is a PgSeg result, to combine vertices and edges across the segments, we first introduce two concepts: a) property aggregation (K) and b) provenance type (Rk), which PgSum takes as input and 137 allow the user to obfuscate vertex details and retain structural constraints. Property Aggregation (K): Similar to an attribute-aggregation summarization query on a general graph [116], depending on the granularity level of interest, not all the details of a vertex are interesting to the user and some properties should be omitted, so that they can be combined together. For example, the user may neither care who performs an activity, nor an activity’s detailed configuration; in the former case, all agent type vertices regardless of their property values (e.g., name) should be indistinguishable and in the latter case, the same activity type vertices even with different configuration settings (e.g., training parameters) in various PgSeg segments should be viewed as if the same thing (e.g., a training activity) has happened. Formally, property aggregation K is a 3-tuple (KE , KA, KU), where each of the tuple elements is a subset of the PROV graph property types, i.e., KE ,KA,KU⊆P (Definition 3). When used in a PgSum query, it discards other irrelevant properties for each vertex type, e.g., properties of entity E type in P\KE are ignored. For example, in Figure 5.5, KE= {‘filename’}, KA= {‘command’}, KU= ∅, so properties such as version of the entity, details of an activity, names of the agents are ignored. Provenance Type (Rk): In contrast with general-purpose graphs, in a provenance graph, the vertices with identical labels and property values may be very different [25]. For example, two identical activities that use different numbers of inputs or generate different entities should be considered different. In [25], Moreau proposes using a recursive definition 138 over a vertex’s k-hop neighborhood to assign a vertex type for later aggregation. However, the definition ignores input/output degrees of vertices, and the recursive definition is exponential w.r.t. to k. It is worth mentioning that the former issue occurs in bisimulation-based method as well [134]. We extend the idea of preserving provenance meaning of a vertex and use the k-hop local neighborhood of a vertex to capture its provenance type: given a PgSeg segment S(VS ,ES), and a constant k, k ≥ 0, provenance type Rk(v) is a function that maps a vertex v ∈VS to its k-hop neighborhood subgraph in its segment S, Rk⊆S. For example, in Figure 5.5, k = 1, thus provenance type of vertices is the 1-hop neighborhood, the vertices with label ‘update’, ‘model’ and ‘solver’ all have two different provenance types (marked as ‘t1’, ‘t2’). Note one can generalize the definition of Rk(v) as a subgraph within k−hop neighborhood of v satisfying a pattern matching query, which has been proposed in [121] with application to entity matching where similar to provenance graphs, just using the vertex properties are not enough to represent the conceptual identity of a vertex. Vertex Equivalence Relation (≡kκ): G⋃iven a set of segments S= {S i(VS i,ES i)}, denoting the union of vertices as VS= iVS i, with the user specified property aggregation K and provenance type Rk, we define a binary relation ≡kκ over VS×VS, such that for each vertex pair (vi, vj) ∈ ≡kκ: a) vertex labels are the same, i.e., λv(vi) = λv(vj), b) all property values in K are equal, i.e., ∀p∈K σ(vi, p) = σ(vj, p), 139 c) Rk(vi) and Rk(vj) are graph isomorphic w.r.t. the vertex label and properties in K, i.e., there is a bijection f between Vi ∈Rk(vi) and Vj ∈Rk(vj), s.t., f(vm) = vn if 1. λv(vm) = λv(vn), 2. ∀p∈K σ(vm, p) = σ(vn, p), 3. ∀(vm, vt) ∈ Ei, (vn, f(vt)) ∈ Ej. It is easy to see that ≡kκ is an equivalence relation on VS by inspection. Using ≡k kκ, we can derive a partition P≡κ of VS, s.t., each set in t⋃he partition is an equiva- lence class by ≡kκ, denoted by [v], s.t., [vi] ∩ [vj] = ∅ and i[vi] = VS. For each [v], we can define its canonical label, e.g., the smallest vertex id, for comparing vertices. In other words, vertices in each equivalence class [v] by ≡kκ describe the homo- geneous candidates which can be merged by PgSum. Its definition not only allows the users to specify property aggregations K to obfuscate unnecessary details in dif- ferent resolutions, but also allows the users to set provenance types Rk in order to preserve local structures and ensure the meaning of provenance of a merged vertex. 5.4.1.2 Provenance Summary Graph (Psg) Next, we introduce the output summary graph of PgSum operator, Psg. Desiderata: Due to the nature of provenance, the produced Psg should be precise, i.e., we should preserve paths that exist in one or more segments, at the same time, we should not introduce any path that does not exist in any segment. On the other hand, Psg 140 should be concise; the more vertices we can merge, the better summarization result it is considered to be. In addition, as a summary, to show the commonality and the rareness of a path among the segments, we annotate each edge with its appearance frequency in the segments. Minimum Psg: PgSum combines segment vertices in their equivalence classes and ensures the paths in the output summary graph satisfy above conditions. Next we define a valid summary graph. Given a set of segments S = {S i(VS i,ES i)}, and a PgSum(K,Rk) query, a provenance summary graph, Psg(M, E, ρ, γ), is a directed acyclic graph, where • Each µ ∈ M represents a subset of an equivalence class µ ⊆ [v] w.r.t. ≡kκ over VS, and one segment vertex v can only be in one Psg vertex µ, i.e., ∀µm, µn ∈ M k, µm ∩ µ ≡κn = ∅; the vertex label function ρ : M 7→ P maps a Psg vertex to its equivalence class; • An edge em,n = (µm, µn) ∈ E exists if there is a corresponding edge in some segment, i.e., ∃S µm × µn ∩ ES i 6= ∅; the edge label function γ : E 7→ [0, 1]i annotates the edge’s frequencies over segments, i.e., γ(em,n) = |{S i|µm× µn ∩ ES i 6= ∅}|/|S|; • There is a path πm,n from µm to µn iff ∃S vs ∈ µm ∩ VS i ∧ vt ∈ µn ∩ Vi S i, there is a path πs,t from vs to vt in S i, and their path labels are the same τ(πm,n) = τ(πs,t). Note that for Psg, we use equivalence classes’ canonical label (e.g., smallest vertex id) as the vertex label in τ . 141 ⋃ It is easy to see i S i the union of all segments in S is a valid Psg. However, we are interested in a concise summary with fewer vertices. The best Psg one can get is the optimal solution of the following problem. Problem 3 (Minimum Psg) Given a set of segments S and a PgSum(K,Rk) query, find the Psg(M, E, ρ, γ) with minimum |M|. 5.4.1.3 Discussion Though requiring all paths in Psg must exist in some segment may look strict and affect the compactness of the result, PgSum operator allows using the property aggregation (K) and provenance types (Rk) to tune the compactness of Psg. Due to the rigidness and the utility of provenance, allowing paths that do not exist in any segment in the summary would cause misinterpretation of the provenance, thus would not be suitable for our context. In situations where extra paths in the summary graph is not an issue, problems with objectives such as minimizing the number of introduced extra paths, and minimizing the description length are interesting ones to be explored further. 5.4.2 Query Evaluation ⋃ Given S= {S i(VS i,ES i)}, after applying K andRk, Psg g0 = i S i is a labeled graph and conta⋃ins all paths in segments; to find a smaller Psg, we have to merge vertices in VS= iVS i in g0 while keeping the Psg invariant, i.e., not introducing new paths. 142 In order to describe merging conditions, we describe trace equivalence relations in a Psg. • in-trace equivalence ('tin): If for every path πa,u ending at u ∈ VS, there is a path πb,v ending at v ∈ VS with the same label, i.e., τ(πa,u) = τ(πb,v), we say u is in-trace dominated by v, denoted as u≤tinv. The u and v are in-trace equivalent, written u'tinv, iff u≤tinv ∧ v≤tinu. • out-trace equivalence ('tout): Similarly, if for every path starting at u, there is a path starting at v with the same label, then we say u is out-trace dominated by v, written u≤toutv. u and v are out-trace equivalent, i.e., u'toutv, iff u≤toutv∧ v≤toutu. Lemma 5 Merging u to v does not introduce new paths, if and only if 1) u'tinv, or 2) u'toutv, or 3) u≤tinv ∧ u≤toutv. The lemma defines a partial order over the vertices in VS. By applying the above lemma, we can merge vertices in a Psg greedily until no such pair exist, then we derive a minimal Psg. However, the problem of checking in-/out-trace equiva- lence is PSPACE-complete [136], which implies that the decision and optimization versions of the minimum Psg problem are PSPACE-complete. Theorem 3 Minimum Psg is PSPACE-complete. Instead of checking trace equivalence, we use simulation relations as its ap- proximation [137, 138], which is less strict than bisimulation and can be computed efficiently in O(|VS||ES|) in Psg g0 [137]. A vertex u is in-simulate dominated by a 143 vertex v, written u≤sinv, if i) their label in Psg is the same, i.e., ρ(u) = ρ(v) and ii) for each parent pu of u, there is a parent of pv of v, s.t., p ≤su inpv. We say u and v in-simulate each other, u'sinv, iff u≤s v ∧ v≤sin inu. Similarly, u is out-simulate dominated (≤sout) by v, if ρ(u) = ρ(v) and for each child cu of u, there is a child of cv of v, s.t., cu≤soutcv; and u and v out-simulate each other u'soutv iff they out-simulate dominate each other. Note that a binary relation ra approximates rb, if (ei, ej) ∈ ra implies (ei, ej) ∈ rb [138]. In other words, if (u, v) in-/out-simulates each other, then (u, v) is in-/out- trace equivalence. By using simulation instead of trace equivalence in Lemma 5 as the merge condition, we can ensure the invariant. Lemma 6 If 1) u'sinv, or 2) u'soutv, or 3) u≤s sinv ∧ u≤outv, merging u to v does not introduce new paths. We develop the PgSum algorithm by using the partial order derived from Lemma 6 merge condition in a Psg (initialized as g0) to merge the vertices. To compute ≤sin and ≤sout, we apply the similarity checking algorithm in [137] twice in O(|VS||ES|) time. From Lemma 6, we can ensure there is no new path introduced, and the merging operation does not remove paths, so PgSum algorithm finds a valid Psg. Note that unlike Lemma 5, as the reverse of Lemma 6 does not hold, so we may not able to find the minimum Psg, as there may be (u, v) is in trace equivalence but not in simulation. 144 Data Science Team analyze project lifecycle via rich facilities collect provenance via nonintrusive CLI Query Facilities Frontend Ingestion Introspection (seg/sum/diff) Monitoring Pipeline Cypher provdb CLI ingestor Query Execution Engine Git Suite Segmentation Summarization Cypher Engine Wrapper Operator Operator (Neo4j) Versioned Project Artifacts Provenance Model Storage Versioned File Store Property Graph Store (Git / other DVCS) (PROV model) (Neo4j) Figure 5.8: Position of Proposed Graph Query Operators in ProvDB 5.5 System Implementation The proposed operators and techniques are implemented in ProvDB (Chap- ter 3) for data science project lifecycles. We highlight the positions of the proposed operators in ProvDB in Figure 5.8. As mentioned in Chapter 3, ProvDB collects provenance via a git-like command line interface (CLI) called provdb, which is sup- posed to be prefixed over command line executions of day-to-day commands (e.g., provdb ingest ‘cp model.config new model.config’) by the users. Therefore the context of the executed command (Activity, User) and changes of project arti- facts (Entity) occurred before/during/after the execution can be captured by provdb and stored in a property graph store; currently we use Neo4j. On the other hand, ProvDB can replace git and acts as a data version control system (DVCS) to manage the versioned project artifacts. It is tailored for analytics project artifacts, 145 such as large model files mainly consisting of float numbers (Chapter 4). Once versions of the project artifacts and provenance of the team activities are managed by ProvDB, the data science team members and stakeholders are able to analyze the lifecycle provenance using the rich set of query facilities in ProvDB (Chapter 3). The segmentation and summarization operators proposed in this chapter are implemented with templated query interface in the front end: • For the segmentation operator, the users are allowed to identify snapshots of interest via search interface and specify boundary criteria within HTML GUI; • For the summarization operator, previous segmentation operator results are selectable in the GUI and the property aggregation and the provenance type can be specified accordingly. Returned results of both operators can be visualized and interacted with via d3.js. The query execution engine for the proposed operators and their algorithms described in Section 5.3 and 5.4 are implemented in Java using Neo4j in embedded mode for better performance. Discussion Note the design decision of using a general purpose native graph backend (Neo4j) for high-performance provenance ingestion may not be ideal, as the volume of ingested provenance records would be very large in some applications, e.g., whole-system provenance recording at kernel level [108, 112] would generate GBs of data in min- utes. The support of flexible and high performance graph ingestion on modern graph databases and efficient query evaluation remain an open question [139]. We 146 leave the issue to support similar operators for general PROV graph for our future steps. The proposed techniques in the chapter focus on enabling better utilization of the ingested provenance information via novel query facilities and are orthogonal to storage layers in provenance systems. 5.6 Evaluation Study In this section, we study the proposed operators and techniques comprehen- sively. We first evaluate the efficiency of proposed techniques for the segmentation operator by varying provenance graph properties and comparing with state-of-the- art CFLR algorithm [129, 130] and the crafted Cypher query. We then study the effectiveness of the summarization operator on segmentation results with differ- ent stableness and compare with a summarization technique on graph query re- sults [117]. All experiments are conducted on a Ubuntu Linux 16.04 machine with an 8-core 3.0GHz AMD FX-380 processor and 16GB memory. For the backend property graph store, we use Neo4j 3.2.5 community edition in embedded mode and access it via its Java APIs. ProvDB query operators are implemented in Java in order to work with Neo4j APIs. To limit the performance impact from the Neo4j, we always use the node id to seek the nodes, which can be done in constant time in Neo4j’s physical storage. Unless specifically mentioned, the page cache for Neo4j is set to 2GB and the JVM version is 1.8.0 25 and -Xmx is set to 4GB. 147 5.6.1 Dataset Description Unless lifecycle provenance management systems (e.g., Ground [50], ProvDB) are used by the practitioners for a long period of time, it is difficult to get real-world provenance graph from data science teams. Though using VCS (e.g., git) is common practice, VCS repositories only consist of versions of artifacts, but not the activities that occurred between commits. Publicly available real-world PROV provenance graph datasets in various application domains [25] are very small (KBs). We instead develop several synthetic PROV graph generators to examine different aspects of the proposed operators. The datasets and the generators are available online3. 5.6.1.1 Provenance Graphs & PgSeg Queries To study the efficiency of PgSeg, we generate a provenance graphs dataset (PG) for collaborative analytics projects by mimicking a group of project members performing a sequence of activities in a lifecycle management system. Each project artifact has many versions and each version is an entity in the graph. An activity uses one or more input entities and produces one or more output entities. To elaborate, given N , the number of vertices in the output graph, we intro- duce |U| = blog(N)c agents. To determine who performs the next activity, we use a Zipf distribution with skew sw to model their work rate. Each activity is associated with an agent and uses 1 +m input entities and generates 1 + n output entities. m and n are generated from two Poisson distributions with mean λi and λo to model 3Datasets: http://www.cs.umd.edu/~hui/code/provdbquery 148 Name PG50 PG100 PG500 PG1k PG5k PG10k PG50k |E| 37 88 343 796 3676 7580 37749 |A| 13 26 126 251 1251 2501 12501 |U | 37 74 381 726 3784 7473 37532 |G | 33 84 342 794 3670 7579 37743 Table 5.1: Summary of Provenance Graph Datasets (PG) ⌊ ⌋ different input and output size. In total, the generator produces |A| = N ac- 2+λo tivities, so that at the end of generation, the sum of entities |E|, activities |A| and agents |U| is close to N . The m input entities are picked from existing entities; the probability of an entity being selected is modeled as the pmf of a Zipf distribution with skew se at its rank in the reverse order of being. If se is large, then the activity tends to pick the latest generated entity, while se is small, an earlier entity has better chance to be selected. The n output entities are always new entities, which would be the first version of an artifact, or a new version of an existing artifact. For the latter, we add a derivation edge to an ancestor entity uniformly. We use the following values as default for the parameters: sw = 1.2, λi = 2, λo = 2, and se = 1.5. In Table 5.1, we show the information about generated provenance graphs for varying N ∈ [50, 100, 500, 1000, 5000, 10000, 50000]. On the provenance graphs in PG, we pick pairs (Vsrc, Vdst) as PgSeg queries to evaluate. Unless specifically mentioned, given a PG dataset, Vsrc are the first two entities, and Vdst are the last two entities, as they are always connected by some 149 path and the query is the most challenging PgSeg instance. In one evaluation, we vary Vsrc to show the effectiveness of the proposed pruning strategy. 5.6.1.2 Similar Segments & PgSum Queries In order to study the effectiveness of PgSum, we design a synthetic genera- tor (Sd) with the ability to vary shapes of conceptually similar provenance graph segments. In brief, the intuition is that as at different stages of the project, the stability of the underlying pipelines tends to differ, the effectiveness of summary operator could be affected; e.g., at the beginning of a project, many activities (e.g., clean, plot, train data) would happen after another one in no particular order, while at later stages, there are more likely to be stable pipelines, i.e., an activity type (e.g., preprocessing) is always followed by another activity type (e.g., train). For PgSum, the former case is more challenging than the latter one. In detail, we model a segment as a Markov chain with k states and a transition matrix M ∈ [0, 1]k×k among states. Each row of the transition matrix is generated from a Dirichlet prior with the concentration parameter α~ , i.e., the ith row is a categorical distribution for state i; each Mij represents the probability of moving to state j, i.e., pick an activity of type j. We set a single α for the vector α~ ; for higher α, the transition tends to be a uniform distribution, while for lower α, the probability is more concentrated, i.e., fewer types of activities would be picked from. Given a transition matrix, we can generate a set of segments S, each of which consists of n activities labeled with k types, derived step by step using the transition 150 matrix. For the input/output entities and edges of each activity, we use λi, λo, and se the same way in PG, and all introduced entities have the same equivalent class. We vary α, |S|, k and n to study the PgSum effectiveness on different sets of segments. A PgSum query is applied on each S, and produces a Psg. The effects of property aggregation and provenance types are reflected in the above label assignment process. 5.6.2 Evaluation Results 5.6.2.1 Segmentation Operator Next, we show the evaluation results for PgSeg algorithms. We compare our algorithm SimProvAlg and SimProvTst with the state-of-the-art general context-free language reachability algorithm, CflrB [129]. It uses bit-based set operations to improve the Reps’ [130] dynamic programming algorithm. We imple- ment the fast set using Java BitSet in order to have constant random access time, which is not true for compressed BitMap alternatives. We also compare with the Cypher query mentioned in Section 5.3.2 in Neo4j. Varying Graph Size N In Figure 5.9, we study the scalability of all algorithms. x axis denotes N of the PG graph, while y axis shows the runtime in seconds to finish the PgSeg query. Note the figure is log-scale for both axes. As we see, SimProvAlg and SimProvTst run at least one order of magnitude faster than CflrB on all PG datasets. The main reasons are the utilization of the properties of the grammar and efficient prun- 151 2 10 1 10 0 10 -1 Cypher Q 10 0CFLRB SimProvAlg -2 10 SimProvTst 100 1000 10000 50000 Number of Nodes in the Graph Figure 5.9: Comparing Cyper Query 5.1, CflrB, SimProvAlg and SimProvTst Efficiency by Varying Graph Size N ing strategies. Note CflrB runs out of memory on PG50k because of the much faster growth of the worklist than SimProvAlg, as CflrB uses normal forms and introduces an extra level. SimProvAlg runs slightly faster than SimProvTst for small instances while becomes much slower for large instances, e.g., PG50k, it is 3x slower than SimProvTstfor the query. The reason that small instances Sim- ProvAlg slightly faster is the because SimProvTst run |Vdst| times on the graph and each run’s performance gain is not large enough. When the size of the graph instance increases, the superiority of the SimProvTst by using the transitivity property becomes significant. On the other hand, the Cypher query can only return correct result for a very small graph PG50 and takes orders of magnitude longer. Surprisingly, even for 152 Runtime (s) PG100, it runs over 12 hours and we have to terminate it. From its query profiler, we know that Neo4j uses a path variable to hold all paths and joins them later which is exponential w.r.t. the path length and average out-degree. Due to the query language expressiveness, grammar properties cannot be used by the query planer. Varying Input Selection Skew se Next, in Figure 5.10, we study the effect of different selection behaviors on PG10k. The x axis is se and the y axis is the runtime in seconds in log-scale. In practice, some analytics activities tend to try many model alternatives to get the best performance for an analytics task, e.g., through a grid search over hyperparameters, or changing a neural network architecture; while there are other analytics activities, where there are long chains of data transformation pipelines, e.g., feature engineering efforts. The former types of projects tend to always take an early entity as input (e.g., dataset, label), while the latter tend to take new entities (i.e., the output of the previous pipeline step) as inputs. Tuning se in opposite directions can mimic those project behaviors, as it tunes the probability of earlier entities been selected as inputs. In Figure 5.10, we vary se from 1.1 to 2.1, and the result is quite stable for SimProvAlg, SimProvTst and CflrB, which implies the query formulation and techniques can be applied to different project types with similar performance. Varying Activity Input Mean λi We study the effect of varying density of the graph in Figure 5.11 on PG10k. The x axis varies the mean λi of the number of input entities. The y axis shows the runtime in seconds. Having a larger λi, the number of |U | edges will increases linearly, thus 153 300 100 50 CFLRB SimProvAlg 10 SimProvTst 5 1 1.2 1.4 1.6 1.8 2 Popularity Skew (se) on Generation Order Figure 5.10: Evaluate Performance for Different Project Stereotypes by Varying λi 300 100 50 CFLRB SimProvAlg 10 SimProvTst 5 1 1 2 3 4 5 Input Entity Mean (λi) Figure 5.11: Impact of Input Size λi on CflrB, SimProvAlg and SimProvTst 154 Runtime on PG (s) Runtime on PG (s)10k 10k 100 80 SimProvAlg 60 w/o Prune SimProvTst 40 w/o Prune 20 5 0 20 40 60 80 Starting Rank of Vsrc (%) Figure 5.12: Effectiveness of SimProvAlg and SimProvTst with Early Stopping the algorithms runtime increases as well. In Figure 5.11, we see SimProvAlg grows much more slowly than CflrB. Due to the pruning strategies, the growth in worklist utilization is avoided in SimProvAlg. SimProvTst further improves the SimProvAlg due to the utilization of the transitivity. Effectiveness of Early Stopping The above evaluations all use the most challenging PgSeg query on start and end entities. In practice, we expect the users will ask queries whose result they can understand by simple visualization. CflrB and general CFL don’t have early stopping properties. SimProvAlg and SimProvTst use the temporal constraints of the provenance graph to support early stopping growing the result. In Figure 5.12, we vary the Vsrc of a PgSeg query and study the performance on PG50k. The x axis is the starting position among all the entities, e.g., x = 20 means Vsrc is selected 155 Runtime on PG50k (s) at the end of 20% percentile w.r.t. the ranking of the order of being. The y axis is the runtime in seconds. As we can see, the shorter of the gap between Vsrc and Vdst, the shorter the SimProvAlg and SimProvTst runtime. By utilizing the property of PROV graphs, we get better performance empirically even though the worst case complexity does not change. 5.6.2.2 Summarization Operator Given a S= {S i(VS i,ES i)}, PgSum generates a precise summary graph Psg(M, E) by definition. Here we study its effectiv⋃eness in terms of conciseness. We use the compaction ratio defined as cr = |M|/| iVS i|. As there are few graph query re- sult summarization techniques available, in our study, we compare with pSum [117] which is designed for summarizing a set of graphs from keyword search graph queries. pSum works on undirected graphs and preserves path among keyword pairs and was shown to be more effective than summarization techniques on one large graph, e.g., SNAP [116]. To make pSum work on PgSeg segments, we introduce a conceptual (start, end) vertex pair as the keyword vertices, and let the start vertex connect to all vertices in S having 0 in-degree, and similarly let the end vertex connect to all vertices having 0 out-degree. In the rest of the experiments, by default, α = 0.1, k = 5, n = 20 and |S| = 10, and y-axis denotes the compaction ratio cr in all figures. Varying Transition Concentration α In Figure 5.13, we change the concentration parameter to mimic segment sets at various stage of a project with different stableness. x axis denotes the value of α 156 1 0.8 0.6 0.4 0.2 pSum PGSum Alg 0 0.025 0.05 0.1 0.25 0.5 1 Concentration Parameter (α) Figure 5.13: Studying PgSum on Segments at Different Stages of a Project by Varying Concentration α in log-scale. Increasing α, the transition probability tends to be uniform, in other words, the pipeline is less stable, and paths are more likely be different, so the vertex pairs which would be merged become infrequent. As we can see, PgSum algorithm always performs better than pSum, and the generated Psg is about half the result produced by pSum, as pSum cannot combine some 'tin and 'tout pairs, which are important for workflow graphs. The finding is consistent in other experiments. Varying Activity Types k Next, in Figure 5.14, we vary the possible transition states, which reflects the com- plexity of the underlying pipeline. It can also be viewed as the effect of using property aggregations on activities (e.g., distinguish the commands with the same name but different options). Increasing k leads to more different path labels, as 157 Compaction Ratio (cr) 1 0.8 0.6 0.4 0.2 pSum PGSum Alg 0 3 5 10 15 20 25 Activity Types (k) Figure 5.14: Studying PgSum on Segments with Different Complexity by Varying Activity Types k shown in the Figure 5.14, and it makes the summarization less effective. Note that when varying k, the number of activities n in a segment is set to be 20, so the effect of k on compaction ratio tends to disappear when k increases. Varying Segment Size n We vary the size of each segment n when fixing α and k to study the performance of PgSum. Intuitively, the larger the segment is, the more intermediate vertices there are. The intermediate vertices are less likely to satisfy the merging conditions. As shown in Figure 5.15, the compaction ratio increases as the input instances are more difficult. Varying Number of Segments |S| With all the shape parameters set (α = 0.25), we increase the number of similar 158 Compaction Ratio (cr) 1 0.8 0.6 0.4 0.2 pSum PGSum Alg 0 5 10 20 30 40 50 Number of Activities (n) Figure 5.15: Studying PgSum on Segments with Different Size by Varying Number of Activities n 1 0.8 0.6 0.4 0.2 pSum PGSum Alg 0 5 10 20 30 40 Number of Segments (|S|) Figure 5.16: Studying PgSum Effectiveness by Varying |S| 159 Compaction Ratio (cr) Compaction Ratio (cr) segments. As the segments are generated from the same transition matrix, they tend to have similar paths. As shown in Figure 5.16, the compaction ratio becomes better when more segments are given as input. 5.7 Conclusion In this chapter, we described the key challenges of querying provenance graphs generated in evolving workflows without predefined skeletons, such as the ones col- lected by lifecyle management systems in collaborative analytics projects. At query time, the users only have partial knowledge about the ingested provenance, due to the schema-later nature of the properties, multiple versions of the same files, un- familiar artifacts introduced by team members, and enormous provenance records collected continuously. Just using standard graph query model, it is very difficult to compose queries and utilize the valuable information. We presented two high- level graph query operators to address the verboseness and evolving nature of such provenance graphs. First, we introduced a graph segmentation operator that allows the users to only provide the vertices they are familiar with and then induces a subgraph representing the retrospective provenance of the vertices of interest. We formulated the semantics of such a query in a context free language, and developed efficient algorithms on top of a property graph backend. Second, we described a graph summarization operator that combines the results of multiple segmentation queries to assist the users to understand similar and abnormal behaviors in those conceptually similar segments with multi-resolution capabilities. Extensive experi- 160 ments on synthetic provenance graphs with different project characteristics show the operators and evaluation techniques are effective and efficient. The operators are also applicable for querying provenance graphs generated in other scenarios where there are no workflow skeletons, such as cybersecurity and system diagnosis. 161 Chapter 6: Discovering Hosted Analytics Projects Alongside developing systems for scalable machine learning and collaborative data science activities, there is an increasing trend toward publicly shared data science projects, hosted in general or dedicated hosting services, such as GitHub, DataHub, Model Zoo, etc. The artifacts of the hosted projects are rich and include not only text files, but also versioned datasets, trained models, project documents, etc. With lifecycle management systems, like ProvDB, the shared project will even consist of provenance of the lifecycle, and rich set of metadata about those artifacts. Under the fast pace and expectation of data science activities, we argue model discovery, i.e., finding relevant data science projects to reuse is an important task. In this chapter, we study the discovery task and explore system designs and research problems. Proper investigation of this issue requires a large number of hosted projects, which is not true for our ProvDB system. On the other hand, in general collaborative systems, such as GitHub, there is no clean structured data model for data science projects. Instead of structured queries, we propose a system vision via an information retrieval approach, and decompose the discovery task into three major steps: 1) project query and matching, 2) model comparison and ranking, 162 and 3) processing and building ensembles with returned models. We first describe the motivation and desiderata of the model discovery problem, then we overview ongoing effort of enriching project repositories in Section 6.1. Then we describe our system vision, and present opportunities, challenges and techniques for each step in Section 6.2. In Section 6.3, we illustrate preliminary evaluation results of proposed techniques. 6.1 Motivation In the “Big Data” era, datasets are collected blindly in different domains and industries by logging user and system behavior or labeling data via crowd-sourcing, and there is a large demand to conduct data science projects to find value from it. With more experienced practitioners and better system support to utilize the collected data and help the data science activities, more and more data science projects are built and shared. For example, there are tens of thousands of hosted Github projects using IPython/Jupyter notebooks (Figure 6.1) and the numbers have grown rapidly in the past few years; in 2016, Github made ‘Jupyter Notebook’ as a supported language type in the search engine for its repositories. For predictive models, due to the widely-used finetuning techniques and long training times, deep learning models have been shared by the community starting with publishing models on authors’ websites; now training systems tend to have models hosted at a central portal for practitioners, e.g., Caffe Model Zoo. Along the same line, ProvDB and ModelHub mentioned in previous chapters in the dissertation also include a 163 Figure 6.1: Growth of Github Repositories using iPyhont Notebooks (based on Github Weekly Dump Dataset on Google Big Query) hosting service for data science projects with comprehensive model revision histories and provenance information generated during the process. In feature engineering- based modeling practice, users tend to try many combination of features; recent commercial platforms save the produced models and the context to help future modeling activities. 6.1.1 Model Discovery Given the presence of large collections of data science projects uploaded by different groups of data scientists and hosted centrally by a system, the problem of model discovery is underserved by existing hosting services and is not well-studied 164 by machine learning lifecycle management tools. Model discovery is the problem of identifying relevant projects for a data science practitioner who is working on her own project and willing to find references, by asking queries such as • ‘What projects used a similar dataset like mine on a classification problem? (e.g. US census data, 256x256 images)’, • ‘Show me a set of diverse projects which explore this specific dataset or use this particular modeling method (e.g., random forest)’, • ‘Find or ensemble a model from projects having high recall but reasonable accuracy on a given validation dataset’. Current hosting systems, such as Github, cannot answer such queries, as they do not distinguish data science projects from open source software repositories and data scientists from software developers, and they do not understand modeling ar- tifacts and associated metadata and provenance in the lifecycle. Example 10 For example, a modeler wishes to identify others’ projects that do mortgage default rate prediction and use a linear regression method in Python in GitHub. Using today’s systems, the user could use ‘site:github.com predict mortgage default rate linear regression python’ on a modern search engine. However, gen- eral web search treats each URL separately and does not index repo as a whole; the returned results could not rank the most relevant repository properly. Instead, indi- vidual files, such as dataset, code snippet with some search keywords in the comments were returned. On the other hand, if the user uses GitHub directly, currently, it still uses repositories’ names to match the queries, and no repository may be returned. 165 Category Examples Physical file Data file properties, schema, size, author, content; script AST, runtime profiling, lib dependencies, etc. Logical artifact Model type, metric value, hyperparameters of a model, user annotations on a model, etc. Steps in a System IO calls, library methods (e.g. logistic physical file regression, pandas dataframe), DNN network architecture, and their dependencies, etc. Steps in an Analysis step type (data loading, cleaning, feature logical artifact engineering), step input, output, transitions, etc. Relations among File versions, file dependencies in the execution physical files history, system library dependency, etc. Relations among Logical artifact versions, e.g. earlier version for a logical artifacts model, lineages from modeling branches, etc. Human Users' contributions, communications, notes, collaborations comments, cognitive annotations, etc. Figure 6.2: Enriched Information for Data Science Projects We envision that by having properly designed model discovery system, a hosted data science project site should be able to help practitioners answer those queries and accelerate the modeling lifecycle and deliver results faster. 6.1.2 Enriched Project Repository Data science projects are ad hoc, exploratory and have many iterations. With- out proper data and lifecycle management systems for the modeling process, they are no more than a collection of files and possibly file histories if a version control system is used. Similar to ProvDB described in earlier chapters, recently, there is a line of work, proposing what we call lifecyle management systems (LMS), that capture provenance metadata and manage modeling artifacts and lifecycle beyond the files, 166 User Workflow Artifact so that practitioners can ask detailed queries about the artifact and the workflow [15, 17, 40, 50, 82, 83, 85] and potentially accelerate or even automate some steps in the lifecycle. Though the focus and the provenance data model vary in these systems, we argue a rich representation and storage of a data science project can be derived and stored. Such information, for modeling artifacts, includes the modeling method types, parameters, hyperparameters, input data and output result files, etc.; for project workflows, it includes file-level dependencies, function dependency graphs in scripts, temporal dependencies among different versions of files. We summarize the enriched information in different works in Figure 6.2. The enriched information by the ongoing efforts leads to opportunities for novel approaches to the model discovery problem. In the rest of the discussion, we assume a data science project is enriched and managed by such an LMS, so that the projects include files as well as captured rich information from such systems. 6.2 Model Discovery System Vision In this section, we present a system design to approach the model discovery problem. We refer the system as ModelHub Discovery. To simplify discussion, we focus on predictive models, i.e., given an observation dataset with labels, the goal is to find a prediction function that fits the observations. In the current proto- type development, we also assume the models are a call to existing library routines with users’ parameters, without special customization or self-implementation, for example, writing ones’ own logistic regression implementation in Python due to 167 project-specific requirements. We argue that the impact of this assumption is ac- ceptable as it covers many data science activities and could be addressed during the development of LMS. 6.2.1 System Overview Data Model ModelHub Discovery is a server-side service, designed to be used with client-side LMS, in particular ModelHub and ProvDB discussed in earlier chapters. The detailed data model for a project is described in Figure 3.1, and uses a schema-later approach with a property graph data model for ingested information and a minimal schema for version control. It has logical views (i.e., models and artifacts to lift file snapshots) and heterogeneous backends for different artifacts (e.g., binary store for weights, graph store for properties, and version index, etc.). When discussing techniques, we keep them general to use with other LMS. Query Model An important design decision we made for the query facility is supporting NLP key- word queries as a first-class citizen as in information retrieval (IR) systems. Using NLP queries, practitioners can easily describe project background, goals, datasets and methods using keywords and documents. We chose this option over category- based project organization or structured queries with ontological conditions because: • There are many text files (e.g., READMEs, comments) and symbols (e.g., library methods) in a project, however, different teams have their own dialects 168 and preferences, and there is lack of a unified ontology or schema over data science projects; • In structured queries, relevant ranking of matched projects needs to be pro- vided by users; • Many structural properties (e.g., accuracy, loss) of a model are project-specific, so a query condition over them for all projects does not make sense, but those are more meaningful as adjunctive filters to match similar projects. Besides NLP queries, we envision a query facility by datasets. In Datahub, datasets are hosted with identifiers, versions and possibly schemas. When a user needs to explore a dataset, which might be public, proprietary, or hosted in Datahub, allowing her to use dataset information, such as structured metadata (i.e., name, columns), or samples of the dataset (i.e., a set of rows) to query hosted projects is a very useful navigation facility. However, it requires the LMS to have detailed file dependencies for good user experience. System Architecture We present the preliminary design of the ModelHub Discovery system in Figure 6.3. We show the query processing pipelines and data serving components. As an infor- mation retrieval approach, for each project, we use a bag-of-words model to represent it as a document, which includes words from its text artifacts, function symbols, code comments, etc. Then the vector-space model [140] is used and the project is indexed using the Apache Lucene full-text indexing engine. The project catalog is a graph- store which combines enriched information from lifecycle management systems, and 169 keyword query dataset query matched project filter conditions ensemble solution Search Query Compare & Process & Ensemble & Project Match Rank Projects Returned Projects Index Index for Project for Model Text Collection Embeddings Hosted Catalog for Hosted Project Datasets Hosted Projects Repositories Figure 6.3: Overview of ModelHub Discovery Pipeline is used in adjunctive conditions to filter the matched projects and constraints on the relevance to the query. To deduplicate and rank returned projects and have diverse results, a com- parison and ranking module is proposed. It uses pair-wise similarities of projects and prunes query results. In Section 6.2.2, we discuss the similarity problem among models and propose a novel strategy, which also allows embedding models as vectors, accelerates all-pairs similarity search and enables learn-to-rank strategies. When the returned models solve the same problem on a dataset (e.g., deep learning models for ImageNet), we discuss the possibility of incorporating model ensemble techniques as a post processing step (Section 6.2.3), in order to support the case when the users have additional constraints in their projects, such as precision- recall, accuracy-fairness, or accuracy-resource trade-offs. 170 6.2.2 Compare & Rank Models Model Similarity A query potentially returns many projects with similar or even duplicated models. As one can observe from GitHub, a popular IPython/Jupyter notebook may be forked hundreds of times. In deep learning, finetuning is a common practice as a popular model trained from massive data may be transferred to other tasks; the weights are reused and finetuned, resulting in similar models. Returning similar models affects usability dramatically. Deduplication is com- mon practice in information retrieval systems. As the projects come from different teams, in order to compare projects we need a similarity model for models. However, it is challenging to determine when two models are similar. By just looking at the associated model properties, such as accuracy and confidence, it is not meaningful, as the measurements are specific to data points. To characterize a model, aside from the enriched information, the input datasets, I, the output result tables, O, and the trained parameters, P, can characterize model performance well. O is typically a structured table (data, label), while P is often float vectors, matrices or tensors, all of which are ordered multi-dimensional sequences. Due to the order, similar models are possibly obfuscated, e.g., the same dataset rows or columns may vary for differ- ent projects, while the parameter dimension may refer to different input dimensions. To address it, we propose an alignment operation to process the model pairs in the best way in order to develop a similarity method. Alignment Operation 171 For ease of illustration, we focus on 2D matrices for P, as vectors and tensors alignment can be formulated similarly, while I and O can be operated on using similar ideas. The basic idea is given two matrices A and B, we permute B’s rows and columns to determine the most similar B′ w.r.t. a given distance function, e.g., euclidean distance, string edit distance, etc. As the matrices to be aligned do not necessarily have the same dimensions, we first define a permutation matrix capable of adapting to differences in dimensions. Definition 4 (Permutation Matrix) Let a permutation ψ = (ψ1, ψ2, · · · , ψn) ∈ Ψ, where Ψ is the set of all possible n! permutations. Given two positive integers s, t ∈ N+, a permutation matrix of ψis a matrix P (ψ, s, t) ∈ [0, 1]s×t, where1 when i ≤ n, ψi = jP (ψ, s, t)i,j = 0 otherwise Using the permutation matrix, a permutation can be used in matrix multipli- cation to reorder matrix by rows or columns. Example 11 A row permutation ψ= (2, 1) of A ∈ R2×4to R3×4: 0 1   4 1 2 28 2 5 3  P(ψ, 3, 2)× A = 1 0×   = 8 2 5 34 1 2 2  0 0 0 0 0 0 A column permutation φ = (1, 3, 2, 4) of A ∈ R2×4 to R2×3: 172     1 0 0   8 2 5 3 0 0 1A× P(φ, 4, 3) = ×     =  8 5 2 4 1 2 2 0 1 0 4 2 1 0 0 0 With the permutation matrix, given two matrices and a distance function, we formulate the alignment problem as follows: Problem 4 (Matrix Alignment) Given two real matrices, A ∈ Rm×n, B ∈ Rs×t, we want to find two permutations to reorder B to A, ψ = (ψ1, ψ2, · · · , ψs) ∈ Ψrow and φ = (φ1, φ2, · · · , φt) ∈ Φcol, s.t.: ψ∗, φ∗ = arg min D(A− P(ψ,m, s)×B × P(φ, t, n)) ψ∈Ψrow,φ∈Φcol where D : Rm×n 7→ R is a distance function. We denote the best aligned matrix as ∆∗ ∗ ∗A,B = A− P(ψ ,m, s)×B × P(φ , t, n). Example 12 Let D be the ‖.‖2 norm, the two matrices:   1 1 1 8 2 5 3    8 5 2 A = , B = 4 1 2 2 4 2 1  ,  2 2 2 the permutations ψ∗ = (2, 3, 1, 4), φ∗ = (1, 3, 2) minimize the distance function: D(∆∗A,B) = ‖A− P(ψ∗, 2, 4)×B × P(φ∗, 3, 4))‖2= 173 ∥∥∥∥ ∥∥   1 1 1∥     ∥ ∥∥ ∥∥∥8 2 5 3 0 1 0 08 5 2 1 0 0 0  ∥∥∥∥ ∥∥ − 0 0 1 0∥∥ 4 1 2 2 0 0 1 0 4 2 1  ∥∥∥∥∥ = 13∥ 0 1 0 0 ∥∥2 2 2 ∥ 2 Complexity Analysis As the solutions of this matrix alignment problem lie in two permutation spaces, it is not easy to solve. A reduction from the graph edit distance problem (GED) [141] can be used to show it is NP-hard. Similar to state-of-the-art heuristic for GED [141], bipartite matching-based approximations can be used to find good solutions. Alignment Property & Its Usages Interestingly, the alignment operation can also be used to construct distance metric to calculate diverse results efficiently. We define the alignment distance of two matrices after alignment: A(A,B) = D(∆∗ ) + D(∆∗A,B B,A) + (A,B), where  is D on natively aligned matrices by indices and 0 padding. Lemma 7 If the distance D is a metric, then A is also a metric. Since the alignment distance A is also a metric, an embedding space of models can be computed using approaches such as multidimensional scaling (MDS) [142]. The central idea of MDS is to learn an inner product space so that the distances between p∑oints can be preserved or approximated. The objective function is defined by minx i,j(||xi − xj|| − d(i, j))2 where d(i, j) is the distance between model i and model j after alignment, and xi, xj are their embedding vectors respectively. 174 The solution of MDS produces a vector for each given project model, which allows us to find similar and diverse models efficiently in online settings [143], and even enables the hosting service to use a model vector representation to improve query performances by learning to rank from click logs. Discussion & Challenges The unstructured nature of project artifacts and workflows makes it challenging to identify similar projects and return diverse results. In addition to using alignment and scaling to reason about parameters and input-output result tables, we also need to develop techniques to analyze the workflows, the program scripts, and the datasets to both compare different projects and to rank them. There is also a need to develop interactive techniques that allow a user to navigate through the returned models to quickly identify the most suitable models for their needs. 6.2.3 Process & Ensemble Returned Models Constrained Query Results In many scenarios, users are interested in the most accurate predictive models for their specific tasks. Instead of returning the best model from the hosted projects, it is likely that a combination of multiple models performs better than any single model. Ensemble modeling is a widely used practice, and has been studied exten- sively [144]. The model ensemble could greatly benefit from a hosted service having many projects with a diverse set of ideas. At the same time, users may have multiple search criteria about a model, 175 such as constraints on precision-recall, accuracy-fairness, accuracy-resource, etc. For example: • Instead of the most accurate model, users may also query for a model which performs similarly to a specified precision-recall curve [145]. • Due to the possibility of disparate impact, a model returned should deal with the fairness requirements [146]. • The performance of a predictive model is also measured by the resources it requires, e.g., time consumption and computer memory cost. It is often the case that under limited resources, e.g., mobile systems, users may compromise by asking for any models that have accuracy above a given lower bound. All of these constrained query scenarios require processing the returned models to meet the user’s demand. We explore supporting model ensemble, an important case with well-established methodologies. Ensemble models The question for model assembly is how we generate one model from K selected models from repositories which are applicable to the same task in order to satisfy the user-provided measurements. This could be achieved by two meta-approaches: 1) creating a new model by parallel-connecting all of the models, whose output is a weighted average or voting of the output of all the models, and 2) creating a (tree-structured) cascade or decision tree of models such that each node contains one of the models and a decision function to decide whether to use the output of the current model or the rest of the models. Both of these approaches can be 176 realized by learning the averaging weights, voting weights or decision functions with the given evaluation measurements of single models to approximate the user-desired measurements. The system problem of model assembly is to incrementally find models that are to be used in the final ensemble model. Caruana et al. [147] study the problem of ensemble selection from a model library, and propose a general approach: 1) initialize ensemble as empty; 2) pick the model that maximizes the performance and add it to the ensemble; 3) repeat 2 until constraints are reached or all models are selected; 4) output the generated ensemble. Discussion & Challenges However, to build such system, a discovery system should be able to validate models, and better train the models in different projects. In practice, there may be only a subset of projects that have such a property. ModelHub Discovery prototype is now built on top of ModelHub, which now only contains deep learning projects. The model training and testing procedures in each repository are well understood, so that it allows us to explore this direction. On the other hand, model ensemble is also an important task for automatic model selections services [89, 147]; compared with discovering hosted projects, they deal with finding the best model via predefined prediction methods by giving a dataset and a goal. Identifying scalable system methods in this task is useful to improve the productivity for practitioners. 177 6.3 Evaluation Study In this section, we study the usefulness of model alignment technique to iden- tify similar models. Note that it is a challenge to find ground truth in the real-world repositories, as the prediction results need to be reproduced from the selected repos- itories. Instead, we finetune a VGG deep learning model to derive a synthetic model dataset. 6.3.1 Dataset Description Fine-tuning is a popular modeling practice in deep learning, when the user has an existing model properly trained in a large dataset (such as ImageNet) and wants to apply such model to a new dataset which can have a different set of prediction labels. The motivation of fine-tuning is that a well trained model should have good generalization to various data so it becomes unnecessary to do end-to-end training of every model which may consume a large amount of time. We start from the VGG-16 network which was originally trained using Ima- geNet dataset and fine-tune the model on CASIA face recognition dataset. CASIA dataset has 10575 face categories while we sample 1000 of them as the last layer pre- diction output. The way to fine-tune a model is to replace the last fully connected layer with a new one and set small or even zero learning rates for existing layers. The newly mutated model trained on the new dataset will converge much faster than an end-to-end training. It is often unnecessary to fine-tune early layers but it is necessary to set a small non-zero learning rates to some later existing layers. By 178 enumerating different pivot layers for fine-tuning, our fine-tuning model generator replicates such behavior. To elaborate, each time, we first choose a pivot parametric layer in VGG16, before which the parameters are fixed while after which the parametric layers are retained. The layer-wise learning rate and weight decay scalars are set to be (1, 2) and (1, 0) for (weight, bias) respectively. For the last full layer (fc8 in VGG16, fc8 casia1k in new models), the scalars are set to be (10, 10) for (weight, bias). Once the pivot layer is chosen, then we enumerate global optimization hyperparameters, learning rate in [10−3, 10−4, 5 × 10−5], and weight decay in [2 × 10−4, 5 × 10−4]. In total, there are 54 different model configurations. Each of them are retrained over 10000 iterations, and checkpointed per 1000 iteration. The models with their fine-tuning configurations and validation accuracies are shown in Figure 6.4. Each oval node is a model which is labeled with its generation order and the accuracy. The path from root to a model shows the detailed retraining configurations. Red circled models have the top 3 validation accuracies. 6.3.2 Evaluation Result On the generated models, we first compute pair-wise weights distances (L2) of the model pairs. Then we derive their prediction performance difference. Note each model produces a vector in [0, 1]1000 from the last layer, which we refer to as result table. On the testing dataset, we compute pairwise cosine distances between the model results. 179 Figure 6.4: Finetuned VGG Models In Figure 6.5, we show the relationship between model weight alignment dis- tances and prediction performance distances; x axis shows the alignment distance, 180 Alignment Distance (D=L2) Figure 6.5: Correlation Between Aligned Distance and Prediction Results y axis is the prediction performance distance. We normalized both of them to [0, 1], so that the closer to 0, the more similar they are. As we can see from the figure, the smaller x (alignment distance) tend to have smaller y (prediction result table difference) as well. The Pearson correlation between x and y is 0.8224. From the preliminary evaluation, it shows the model similarly technique we propose would work for some modeling practices. In a model discovery service, often the performance result is not known, due to the possible correlation between model weight alignment distance and the perfor- mance result difference, the service implementation would use the aligned distance as a signal to improve query results. 181 Prediction Result Table Difference 6.4 Conclusion In this chapter, we explored systems research opportunity to enable data sci- entists to query and reuse data science projects hosted in a central service. In a hosted service storing enriched repositories from lifecycle management tools, we pre- sented our vision of querying managed data science projects. Instead of querying on structured stores, we chose an information retrieval approach in order to better serve the needs from practitioners and described a search service allowing the user to query using project requirement languages, such as goals, datasets, model methods. We outlined the system architecture and proposed several challenging problems in building it, including developing model similarity methods and model assembly for constrained queries. Because in the real world, there were not that many reposito- ries with well-extracted information yet, we conducted preliminary evaluation using synthetic models generated by following best practice, and showed the potential value and performance of proposed techniques. 182 Chapter 7: Conclusions In this dissertation, we studied the practice of data science and explored the issues in the end-to-end lifecycle management of collaborative analytics workflows. We took a systems approach to unify the provenance and project artifacts man- agement for a collaborative analytics team, and studied how to track provenance and derivation history of models, query modeling artifacts and processing pipelines, analyze unexpected behaviors, and search hosted repositories. We first described ProvDB in Chapter 3, which is a general provenance in- gestion and graph-based storage system that introduces a command line toolkit to wrap user commands and captures static and runtime information when users ex- ecute scripts. The capturing is done via a set of ingestor plugins, such as UNIX POSIX ingestors for basic commands (e.g., mv source and target), deep learning training tool ingestors (e.g., training accuracy and loss), core data science library ingestors (e.g., scikit-learn usages in a script), etc. It also features with a rich set of general query facilities tailored for data science lifecycles, such as introspecting the project artifacts and pipelines and monitoring the ongoing modeling activities. By extending ProvDB ingestion mechanism, query facilities and storage back- end, we described the ModelHub system in Chapter 4. It includes a specialized 183 version control system to capture parameters, hyperparameters, and relationships between revised deep learning models when the user adds and commits versions. Using this information, it extends ProvDB query facility and features a domain- specific query language to explore existing models and enumerate new models. By exploiting the model metadata and derivation history, it uses novel modeling archiv- ing technique, which compacts the model storage while ensuring that higher quality models can be accessed within desired time constraints. In Chapter 5, we studied how to fully exploit the ingested provenance and help analytics project teams. As collaborative analytics projects often have unstable lifecycles resulting in evolving and verbose provenance graphs, it is common that team members only have partial knowledge of the provenance graph. Without a predefined workflow skeleton and full understanding of contributing artifacts and steps, it is difficult to write graph queries and explore the provenance graph using modern graph databases and query languages. We formulated two graph query operators, segmentation and summarization, to query retrospective and prospective provenance of analytics workflows. The segmentation operator is able to induce a certain scope in the lineages of user-specified vertices to get insight about the derivation relationships among the vertices of interest. The summarization operator can combine similar segments together to show the common and alternative pipelines among those segments. In the real world, there are more and more collaborative analysis projects being shared online. In Chapter 6, we presented our vision of a model discovery service, a system that enables data scientists to query and reuse data science projects hosted in 184 a central service, such as GitHub or ModelHub. Assuming many projects’ reposito- ries are enrinched by lifecycle management tools, we outlined a system architecture and proposed several challenging problems in building it, including developing model similarity methods and model assembly for constrained queries. 185 Bibliography [1] Souvik Bhattacherjee, Amit Chavan, Silu Huang, Amol Deshpande, and Aditya G. Parameswaran. Principles of dataset versioning: Exploring the recreation/storage tradeoff. Proceedings of the VLDB Endowment, 8(12):1346– 1357, 2015. [2] Anant P. Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J. Elmore, Samuel Madden, and Aditya G. Parameswaran. Datahub: Collaborative data science & dataset version management at scale. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings, 2015. [3] Juliana Freire, David Koop, Emanuele Santos, and Cláudio T. Silva. Prove- nance for computational tasks: A survey. Computing in Science and Engi- neering, 10(3):11–21, 2008. [4] Susan B. Davidson and Juliana Freire. Provenance and scientific workflows: challenges and opportunities. In Proceedings of the ACM SIGMOD Interna- tional Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, pages 1345–1350, 2008. [5] James Cheney, Laura Chiticariu, and Wang Chiew Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379–474, 2009. [6] Bertram Ludäscher, Ilkay Altintas, Chad Berkley, Dan Higgins, Efrat Jaeger, Matthew Jones, Edward A Lee, Jing Tao, and Yang Zhao. Scientific workflow management and the Kepler system. Concurrency and Computation: Practice and Experience, 18(10):1039–1065, 2006. [7] Tom Oinn, Mark Greenwood, Matthew Addis, M. Nedim Alpdemir, Justin Ferris, Kevin Glover, Carole Goble, Antoon Goderis, Duncan Hull, Darren Marvin, Peter Li, Phillip Lord, Matthew R. Pocock, Martin Senger, Robert Stevens, Anil Wipat, and Chris Wroe. Taverna: lessons in creating a workflow environment for the life sciences. Concurrency and Computation: Practice and Experience, 18(10):1067–1100, 2006. 186 [8] Belinda Giardine, Cathy Riemer, Ross C Hardison, Richard Burhans, Laura Elnitski, Prachi Shah, Yi Zhang, Daniel Blankenberg, Istvan Albert, James Taylor, Webb Miller, W. James Kent, and Anton Nekrutenko. Galaxy: a platform for interactive large-scale genome analysis. Genome research, 15(10):1451–1455, 2005. [9] Stephen A Goff, Matthew Vaughn, Sheldon McKay, Eric Lyons, Ann E Sta- pleton, Damian Gessler, Naim Matasci, Liya Wang, Matthew Hanlon, Andrew Lenards, Andy Muir, Nirav Merchant, Sonya Lowry, Stephen Mock, Matthew Helmke, Adam Kubach, Martha Narro, Nicole Hopkins, David Micklos, Uwe Hilgert, Michael Gonzales, Chris Jordan, Edwin Skidmore, Rion Dooley, John Cazes, Robert McLay, Zhenyuan Lu, Shiran Pasternak, Lars Koesterke, William H Piel, Ruth Grene, Christos Noutsos, Karla Gendler, Xin Feng, Chunlao Tang, Monica Lent, Seung-Jin Kim, Kristian Kvilekval, B S Man- junath, Val Tannen, Alexandros Stamatakis, Michael Sanderson, Stephen M Welch, Karen A Cranston, Pamela Soltis, Doug Soltis, Brian O’Meara, Cecile Ane, Tom Brutnell, Daniel J Kleibenstein, Jeffery W White, James Leebens- Mack, Michael J Donoghue, Edgar P Spalding, Todd J Vision, Christopher R Myers, David Lowenthal, Brian J Enquist, Brad Boyle, Ali Akoglu, Greg An- drews, Sudha Ram, Doreen Ware, Lincoln Stein, and Dan Stanzione. The iPlant collaborative: cyberinfrastructure for plant biology. Frontiers in plant science, 2, 2011. [10] Louis Bavoil, Steven P. Callahan, Carlos E. Scheidegger, Huy T. Vo, Patricia Crossno, Cláudio T. Silva, and Juliana Freire. Vistrails: Enabling interactive multiple-view visualizations. In 16th IEEE Visualization Conference, VIS 2005, Minneapolis, MN, USA, October 23-28, 2005, pages 135–142, 2005. [11] Ian T. Foster, Jens-S. Vöckler, Michael Wilde, and Yong Zhao. Chimera: Avir- tual data system for representing, querying, and automating data derivation. In Proceedings of the 14th International Conference on Scientific and Statisti- cal Database Management, July 24-26, 2002, Edinburgh, Scotland, UK, pages 37–46, 2002. [12] Ewa Deelman, Gurmeet Singh, Mei-Hui Su, James Blythe, Yolanda Gil, Carl Kesselman, Gaurang Mehta, Karan Vahi, G. Bruce Berriman, John Good, Anastasia C. Laity, Joseph C. Jacob, and Daniel S. Katz. Pegasus: A frame- work for mapping complex scientific workflows onto distributed systems. Sci- entific Programming, 13(3):219–237, 2005. [13] Yong Zhao, Michael Wilde, and Ian T. Foster. Applying the virtual data provenance model. In Provenance and Annotation of Data, International Provenance and Annotation Workshop, IPAW 2006, Chicago, IL, USA, May 3-5, 2006, Revised Selected Papers, pages 148–161. 187 [14] Philip J. Guo and Margo Seltzer. BURRITO: wrapping your lab notebook in computational infrastructure. In 4th Workshop on the Theory and Practice of Provenance, TaPP’12, Boston, MA, USA, June 14-15, 2012, 2012. [15] Fernando Seabra Chirigati, Dennis E. Shasha, and Juliana Freire. Reprozip: Using provenance to support computational reproducibility. In 5th Workshop on the Theory and Practice of Provenance, TaPP’13, Lombard, IL, USA, April 2-3, 2013, 2013. [16] Timothy M. McPhillips, Tianhong Song, Tyler Kolisnik, Steve Aulenbach, Khalid Belhajjame, Kyle Bocinsky, Yang Cao, Fernando Chirigati, Saumen C. Dey, Juliana Freire, Deborah N. Huntzinger, Christopher Jones, David Koop, Paolo Missier, Mark Schildhauer, Christopher R. Schwalm, Yaxing Wei, James Cheney, Mark Bieda, and Bertram Ludäscher. YesWorkflow: A user-oriented, language-independent tool for recovering workflow information from scripts. International Journal of Digital Curation, 10(1):298–313, 2015. [17] Leonardo Murta, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire. noWorkflow: Capturing and analyzing provenance of scripts. In Provenance and Annotation of Data and Processes - 5th International Prove- nance and Annotation Workshop, IPAW 2014, Cologne, Germany, June 9-13, 2014., pages 71–83. [18] Luc Moreau, Ben Clifford, Juliana Freire, Joe Futrelle, Yolanda Gil, Paul Groth, Natalia Kwasnikowska, Simon Miles, Paolo Missier, Jim Myers, Beth Plale, Yogesh Simmhan, Eric Stephan, and Jan Van den Bussche. The open provenance model core specification (v1.1). Future Generation Computer Sys- tems, 27(6):743 – 756, 2011. [19] Luc Moreau and Paul Groth. PROV-overview. W3C note, W3C, 2013. http://www.w3.org/TR/2013/NOTE-prov-overview-20130430/. [20] Catriel Beeri, Anat Eyal, Simon Kamenkovich, and Tova Milo. Querying business processes. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12-15, 2006, pages 343–354, 2006. [21] David A. Holland, Uri Jacob Braun, Diana Maclean, Kiran-Kumar Muniswamy-Reddy, and Margo I. Seltzer. Choosing a data model and query language for provenance. In The 2nd International Provenance and Annota- tion Workshop. Springer, 2008. [22] Manish Kumar Anand, Shawn Bowers, and Bertram Ludäscher. Techniques for efficiently querying scientific workflow provenance graphs. In EDBT 2010, 13th International Conference on Extending Database Technology, Lausanne, Switzerland, March 22-26, 2010, Proceedings, pages 287–298, 2010. 188 [23] Olivier Biton, Sarah C. Boulakia, Susan B. Davidson, and Carmem S. Hara. Querying and managing provenance through user views in scientific work- flows. In Proceedings of the 24th International Conference on Data Engineer- ing, ICDE 2008, April 7-12, 2008, Cancún, México, pages 1072–1081, 2008. [24] Paolo Missier, Jeremy Bryans, Carl Gamble, Vasa Curcin, and Roxana Dánger. Provabs: Model, policy, and tooling for abstracting PROV graphs. In Prove- nance and Annotation of Data and Processes - 5th International Provenance and Annotation Workshop, IPAW 2014, Cologne, Germany, June 9-13, 2014, pages 3–15, 2014. [25] Luc Moreau. Aggregation by provenance types: A technique for summarising provenance graphs. In Proceedings Graphs as Models, GaM@ETAPS 2015, London, UK, 11-12 April 2015., pages 129–144, 2015. [26] Rui Abreu, Dave Archer, Erin Chapman, James Cheney, Hoda Eldardiry, and Adria Gascón. Provenance segmentation. In 8th Workshop on the Theory and Practice of Provenance, TaPP’16, Washington, D.C., USA, June 8-9, 2016, 2016. [27] Manish Kumar Anand, Shawn Bowers, Timothy M. McPhillips, and Bertram Ludäscher. Efficient provenance storage over nested data collections. In EDBT 2009, 12th International Conference on Extending Database Technology, Saint Petersburg, Russia, March 24-26, 2009, Proceedings, pages 958–969, 2009. [28] Yulai Xie, Kiran-Kumar Muniswamy-Reddy, Dan Feng, Yan Li, and Darrell D. E. Long. Evaluation of a hybrid approach for efficient provenance storage. TOS, 9(4):14:1–14:29, 2013. [29] Saumen C. Dey, Daniel Zinn, and Bertram Ludäscher. ProPub: Towards a declarative approach for publishing customized, policy-aware provenance. In Proceedings of the 23rd International Conference, SSDBM 2011, Portland, OR, USA, July 20-22, 2011, pages 225–243, 2011. [30] James Cheney and Roly Perera. An analytical survey of provenance sanitiza- tion. In Provenance and Annotation of Data and Processes - 5th International Provenance and Annotation Workshop, IPAW 2014, Cologne, Germany, June 9-13, 2014. Revised Selected Papers, pages 113–126, 2014. [31] Peter Buneman, Sanjeev Khanna, and Wang Chiew Tan. Why and where: A characterization of data provenance. In Proceedings of the 8th International Conference on Database Theory, ICDT 2001, London, UK, January 4-6, 2001, pages 316–330, 2001. [32] Jennifer Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR 2005, Second Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2005, Online Proceed- ings, pages 262–276, 2005. 189 [33] Peter Buneman, Adriane Chapman, and James Cheney. Provenance man- agement in curated databases. In Proceedings of the ACM SIGMOD Inter- national Conference on Management of Data, Chicago, Illinois, USA, June 27-29, 2006, pages 539–550, 2006. [34] Todd J. Green, Gregory Karvounarakis, and Val Tannen. Provenance semir- ings. In Proceedings of the Twenty-Sixth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 11-13, 2007, Beijing, China, pages 31–40, 2007. [35] Matteo Interlandi, Kshitij Shah, Sai Deep Tetali, Muhammad Ali Gulzar, Seunghyun Yoo, Miryung Kim, Todd Millstein, and Tyson Condie. Titian: Data provenance support in Spark. Proceedings of the VLDB Endowment, 9(3), 2015. [36] Jayant Madhavan, Sreeram Balakrishnan, Kathryn Brisbin, Hector Gonzalez, Nitin Gupta, Alon Y. Halevy, Karen Jacqmin-Adams, Heidi Lam, Anno Lan- gen, Hongrae Lee, Rod McChesney, Rebecca Shapley, and Warren Shen. Big Data Storytelling Through Interactive Maps. IEEE Data Eng. Bull., 35(2):46– 54, 2012. [37] Zachary G. Ives, Nitin Khandelwal, Aneesh Kapur, and Murat Cakir. OR- CHESTRA: rapid, collaborative sharing of dynamic data. In CIDR 2005, Second Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2005, Online Proceedings, pages 107–118, 2005. [38] Nodira Khoussainova, Magdalena Balazinska, Wolfgang Gatterbauer, YongChul Kwon, and Dan Suciu. A case for A collaborative query man- agement system. In CIDR 2009, Fourth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2009, Online Pro- ceedings, 2009. [39] Bill Howe, Garrett Cole, Emad Souroush, Paraschos Koutris, Alicia Key, Nodira Khoussainova, and Leilani Battle. Database-as-a-service for long-tail science. In Proceedings of the 23rd International Conference, SSDBM 2011, Portland, OR, USA, July 20-22, 2011, pages 480–489, 2011. [40] Eser Kandogan, Mary Roth, Peter M. Schwarz, Joshua Hui, Ignacio Terriz- zano, Christina Christodoulakis, and Renée J. Miller. Labbook: Metadata- driven social collaborative data analysis. In 2015 IEEE International Con- ference on Big Data, Big Data 2015, Santa Clara, CA, USA, October 29 - November 1, 2015, pages 431–440. [41] http://ckan.org. CKAN: an Open-Source Data Portal (retrieved June 1, 2014). [42] http://www.quandl.com. Quandl (retrieved June 1, 2014). 190 [43] http://www.factual.com. Factual Inc. (retrieved June 1, 2014). [44] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. Openml: networked science in machine learning. SIGKDD Explorations, 15(2):49–60, 2013. [45] http://www.dominoup.com. Domino Data Lab (retrieved June 1, 2014). [46] https://aws.amazon.com/sagemaker. Amazon SageMaker (retrieved May 1, 2018). [47] https://cloud.google.com/datalab/. Google Datalab (retrieved May 1, 2018). [48] http://www.datamarket.com. Data Market Inc. (retrieved June 1, 2014). [49] Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. Goods: Organizing google’s datasets. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 795–806, 2016. [50] Joseph M. Hellerstein, Vikram Sreekanti, Joseph E. Gonzalez, James Dalton, Akon Dey, Sreyashi Nag, Krishna Ramachandran, Sudhanshu Arora, Arka Bhattacharyya, Shirshanka Das, Mark Donsky, Gabriel Fierro, Chang She, Carl Steinbach, Venkat Subramanian, and Eric Sun. Ground: A data context service. In CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, January 8-11, 2017, Online Proceedings, 2017. [51] https://www.collibra.com. Collibra (retrieved May 1, 2018). [52] http://atlas.apache.org. Apache Atlas (retrieved May 1, 2018). [53] https://alation.com. Alation (retrieved May 1, 2018). [54] http://www.sas.com. SAS Business Analytics Software (retrieved Oct. 14, 2013). [55] http://office.microsoft.com/en-us/excel/. Microsoft Excel (retrieved Oct. 14, 2013). [56] http://www.r-project.org. The R Project (retrieved Oct. 14, 2013). [57] http://www.mathworks.com. Mathworks Matlab (retrieved Oct. 14, 2013). [58] http://mahout.apache.org. Apache Mahout Machine Learning Library (re- trieved Oct. 14, 2013). [59] http://ipython.org. IPython (retrieved June 1, 2014). 191 [60] http://scikit-learn.org. Scikit-Learn: Machine Learning in Python (re- trieved June 1, 2014). [61] http://pandas.pydata.org. Python Data Analysis Library (retrieved June 1, 2014). [62] Amit Chavan, Silu Huang, Amol Deshpande, Aaron J. Elmore, Samuel Mad- den, and Aditya G. Parameswaran. Towards a unified query language for provenance and versioning. 2015. [63] Amit Chavan and Amol Deshpande. DEX: query execution in a delta-based storage system. pages 171–186, 2017. [64] Silu Huang, Liqi Xu, Jialin Liu, Aaron J. Elmore, and Aditya G. Parameswaran. OrpheusDB: Bolt-on versioning for relational databases. PVLDB, 10(10):1130–1141, 2017. [65] Samir Khuller, Balaji Raghavachari, and Neal E. Young. Balancing minimum spanning trees and shortest-path trees. Algorithmica, 14(4):305–321, 1995. [66] Evan R. Sparks, Ameet Talwalkar, Virginia Smith, Jey Kottalam, Xinghao Pan, Joseph E. Gonzalez, Michael J. Franklin, Michael I. Jordan, and Tim Kraska. MLI: an API for distributed machine learning. In 2013 IEEE 13th International Conference on Data Mining, Dallas, TX, USA, December 7-10, 2013, pages 1187–1192, 2013. [67] Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. Distributed GraphLab: A framework for machine learning in the cloud. Proceedings of the VLDB Endowment, 5(8):716–727, 2012. [68] Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In 11th USENIX Sym- posium on Operating Systems Design and Implementation, OSDI ’14, Broom- field, CO, USA, October 6-8, 2014., pages 583–598, 2014. [69] Mart́ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jef- frey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gor- don Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Sys- tems Design and Implementation, OSDI 2016, Savannah, GA, USA, Novem- ber 2-4, 2016., pages 265–283, 2016. [70] http://www.numpy.org. NumPy (retrieved June 1, 2014). 192 [71] http://www.scipy.org. SciPy (retrieved June 1, 2014). [72] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross B. Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolu- tional architecture for fast feature embedding. In Proceedings of the ACM In- ternational Conference on Multimedia, MM ’14, Orlando, FL, USA, November 03 - 07, 2014. [73] Arun Kumar, Matthias Boehm, and Jun Yang. Data management in machine learning: Challenges, techniques, and systems. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pages 1717–1722, 2017. [74] Mert Akdere, Ugur Çetintemel, Matteo Riondato, Eli Upfal, and Stanley B. Zdonik. The case for predictive database systems: Opportunities and chal- lenges. In CIDR 2011, Fifth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 9-12, 2011, Online Proceedings, pages 167–174, 2011. [75] Xixuan Feng, Arun Kumar, Benjamin Recht, and Christopher Ré. Towards a unified architecture for in-RDBMS analytics. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20-24, 2012, pages 325–336, 2012. [76] Arun Kumar, Jeffrey F. Naughton, and Jignesh M. Patel. Learning gener- alized linear models over normalized data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Vic- toria, Australia, May 31 - June 4, 2015, pages 1969–1984. [77] Vineet Chaoji, Rajeev Rastogi, and Gourav Roy. Machine learning in the real world. Proceedings of the VLDB Endowment, 9(13):1597–1600, 2016. [78] Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. Data management challenges in production machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pages 1723–1726, 2017. [79] Ce Zhang, Arun Kumar, and Christopher Ré. Materialization optimizations for feature selection workloads. ACM Transactions on Database Systems (TODS), 41(1):2, 2016. [80] Daniel Crankshaw, Peter Bailis, Joseph E. Gonzalez, Haoyuan Li, Zhao Zhang, Michael J. Franklin, Ali Ghodsi, and Michael I. Jordan. The missing piece in complex analytics: Low latency, scalable model management and serving with Velox. In CIDR 2015, Seventh Biennial Conference on Innovative Data Sys- tems Research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings, 2015. 193 [81] Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke. Tensorflow-serving: Flexible, high-performance ml serving. In NIPS Work- shop on ML Systems (LearningSys), 2017. [82] Manasi Vartak, Harihar Subramanyam, Wei-En Lee, Srinidhi Viswanathan, Saadiyah Husnoo, Samuel Madden, and Matei Zaharia. ModelDB: a system for machine learning model management. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, page 14. ACM, 2016. [83] Hui Miao, Ang Li, Larry S. Davis, and Amol Deshpande. Towards unified data and lifecycle management for deep learning. In 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 19-22, 2017, pages 571–582, 2017. [84] Sebastian Schelter, Joos-Hendrik Boese, Johannes Kirschnick, Thoralf Klein, and Stephan Seufert. Automatically tracking metadata and provenance of ma- chine learning experiments. In NIPS Workshop on ML Systems (LearningSys), 2017. [85] Hui Miao, Amit Chavan, and Amol Deshpande. ProvDB: Lifecycle manage- ment of collaborative analysis workflows. In Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics, HILDA@SIGMOD 2017, Chicago, IL, USA, May 14, 2017, pages 7:1–7:6, 2017. [86] Hui Miao, Ang Li, Larry S. Davis, and Amol Deshpande. On model dis- covery for hosted data science projects. In Proceedings of the 1st Workshop on Data Management for End-To-End Machine Learning, DEEM@SIGMOD 2017, Chicago, IL, USA, May 14 - 19, 2017, 2017. [87] Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Za- karia Haque, Salem Haykal, Mustafa Ispir, Vihan Jain, Levent Koc, Chiu Yuen Koo, Lukasz Lew, Clemens Mewald, Akshay Naresh Modi, Neoklis Polyzotis, Sukriti Ramesh, Sudip Roy, Steven Euijong Whang, Martin Wicke, Jarek Wilkiewicz, Xin Zhang, and Martin Zinkevich. TFX: A tensorflow-based production-scale machine learning platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017, pages 1387–1395, 2017. [88] Evan R. Sparks, Shivaram Venkataraman, Tomer Kaftan, Michael J. Franklin, and Benjamin Recht. KeystoneML: Optimizing pipelines for large-scale ad- vanced analytics. In 33rd IEEE International Conference on Data Engineer- ing, ICDE 2017, San Diego, CA, USA, April 19-22, 2017, pages 535–546, 2017. [89] https://cloud.google.com/prediction. Google Prediction API (retrieved Aug.20, 2016). 194 [90] Arun Kumar, Robert McCann, Jeffrey F. Naughton, and Jignesh M. Patel. Model selection management systems: The next frontier of advanced analytics. SIGMOD Record, 44(4):17–22, 2015. [91] Randy H Katz. Toward a unified framework for version modeling in engineer- ing databases. ACM Computing Surveys (CSUR), 22(4):375–409, 1990. [92] Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton. Deep learning. Nature, 521(7553):436–444, 2015. [93] Yann LeCun, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne E. Hubbard, and Lawrence D. Jackel. Handwrit- ten digit recognition with a back-propagation network. In Advances in Neural Information Processing Systems 2, NIPS Conference, Denver, Colorado, USA, November 27-30, 1989, pages 396–404, 1989. [94] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classifica- tion with deep convolutional neural networks. In Advances in Neural Infor- mation Processing Systems 25, NIPS Conference, Lake Tahoe, Nevada, USA, December 3-6, 2012, pages 1106–1114, 2012. [95] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the Third International Conference on Learning Representations (ICLR), 2015. [96] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778, 2016. [97] Léon Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, pages 421–436. Springer, 2012. [98] Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networks on CPUs. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, 2011. [99] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In Proceedings of the Fourth International Conference on Learning Represen- tations (ICLR), 2016. [100] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024, 2014. [101] Eric R. Schendel, Ye Jin, Neil Shah, Jackie Chen, Choong-Seock Chang, Seung-Hoe Ku, Stéphane Ethier, Scott Klasky, Robert Latham, Robert B. 195 Ross, and Nagiza F. Samatova. ISOBAR preconditioner for effective and high-throughput lossless data compression. In IEEE 28th International Con- ference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1-5 April, 2012, pages 138–149. [102] Souvik Bhattacherjee, Amol Deshpande, and Alan Sussman. Pstore: an effi- cient storage framework for managing scientific data. In Conference on Sci- entific and Statistical Database Management, SSDBM ’14, Aalborg, Denmark, June 30 - July 02, 2014, pages 25:1–25:12, 2014. [103] Narsingh Deo and Nishit Kumar. Computation of constrained spanning trees: A unified approach. In Network Optimization. 1997. [104] Andrew B Kahng and Gabriel Robins. On optimal interconnections for VLSI, volume 301. Springer Science & Business Media, 1994. [105] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Re- turn of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference, BMVC 2014, Nottingham, UK, September 1-5, 2014, 2014. [106] Jianming Zhang, Shugao Ma, Mehrnoosh Sameki, Stan Sclaroff, Margrit Betke, Zhe L. Lin, Xiaohui Shen, Brian L. Price, and Radomı́r Mech. Salient object subitizing. In IEEE Conference on Computer Vision and Pattern Recog- nition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 4045–4054, 2015. [107] Ross B. Girshick. Fast R-CNN. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 1440–1448, 2015. [108] Adam M. Bates, Dave Tian, Kevin R. B. Butler, and Thomas Moyer. Trust- worthy whole-system provenance for the linux kernel. In 24th USENIX Se- curity Symposium, USENIX Security 15, Washington, D.C., USA, August 12-14, 2015., pages 319–334, 2015. [109] Provenance Challenge. http://twiki.ipaw.info. Accessed: 2017-07. [110] Pablo Barceló Baeza. Querying graph databases. In Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Sys- tems, PODS 2013, New York, NY, USA - June 22 - 27, 2013, pages 175–188, 2013. [111] Oskar van Rest, Sungpack Hong, Jinha Kim, Xuming Meng, and Hassan Chafi. PGQL: a property graph query language. In Proceedings of the Fourth Inter- national Workshop on Graph Data Management Experiences and Systems, Redwood Shores, CA, USA, June 24 - 24, 2016, page 7, 2016. 196 [112] Kiran-Kumar Muniswamy-Reddy, David A. Holland, Uri Braun, and Margo I. Seltzer. Provenance-Aware Storage Systems. In USENIX Annual Technical Conference, General Track, pages 43–56, 2006. [113] Adam M. Bates, Wajih Ul Hassan, Kevin R. B. Butler, Alin Dobra, Bradley Reaves, Patrick T. Cable II, Thomas Moyer, and Nabil Schear. Transparent web service auditing via network provenance functions. In Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia, April 3-7, 2017, pages 887–895, 2017. [114] Paolo Missier and Luc Moreau. PROV-dm: The PROV data model. W3C recommendation, W3C, 2013. http://www.w3.org/TR/2013/REC-prov-dm- 20130430/. [115] Zhuowei Bao, Susan B. Davidson, Sanjeev Khanna, and Sudeepa Roy. An optimal labeling scheme for workflow provenance using skeleton labels. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010, pages 711–722, 2010. [116] Yuanyuan Tian, Richard A. Hankins, and Jignesh M. Patel. Efficient aggrega- tion for graph summarization. In Proceedings of the ACM SIGMOD Interna- tional Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, pages 567–580, 2008. [117] Yinghui Wu, Shengqi Yang, Mudhakar Srivatsa, Arun Iyengar, and Xifeng Yan. Summarizing answer graphs induced by keyword queries. Proceedings of the VLDB Endowment, 6(14):1774–1785, 2013. [118] Arijit Khan, Sourav S. Bhowmick, and Francesco Bonchi. Summarizing static and dynamic big graphs. Proceedings of the VLDB Endowment, 10(12):1981– 1984, 2017. [119] Luc Moreau, James Cheney, and Paolo Missier. Constraints of the PROV data model. W3C recommendation, W3C, 2013. http://www.w3.org/TR/2013/REC-prov-constraints-20130430/. [120] Peixiang Zhao, Xiaolei Li, Dong Xin, and Jiawei Han. Graph cube: on ware- housing and OLAP multidimensional networks. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12-16, 2011, pages 853–864, 2011. [121] Wenfei Fan, Zhe Fan, Chao Tian, and Xin Luna Dong. Keys for graphs. Proceedings of the VLDB Endowment, 8(12):1590–1601, 2015. [122] Zhengkui Wang, Qi Fan, Huiju Wang, Kian-Lee Tan, Divyakant Agrawal, and Amr El Abbadi. Pagrol: Parallel graph olap over large-scale attributed graphs. In IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014, pages 496–507, 2014. 197 [123] Haixun Wang and Charu C. Aggarwal. A survey of algorithms for keyword search on graph data. In Managing and Mining Graph Data, pages 249–273. 2010. [124] John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction to Automata Theory, Languages, and Computation. Pearson Addison Wesley, 3rd edition, 2007. [125] Peter T. Wood. Query languages for graph databases. SIGMOD Rec., 41(1):50–60, April 2012. [126] Pablo Barceló, Leonid Libkin, Anthony Widjaja Lin, and Peter T. Wood. Expressive languages for path queries over graph-structured data. ACM Trans. Database Syst., 37(4):31:1–31:46, 2012. [127] Mihalis Yannakakis. Graph-theoretic methods in database theory. In Proceed- ings of the Ninth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, April 2-4, 1990, Nashville, Tennessee, USA, pages 230– 242, 1990. [128] Thomas W. Reps. Program analysis via graph reachability. In ILPS ’97, Proceedings of the 1997 International Symposium on Logic Programming, Port Jefferson, Long Island, NY, USA, October 13-16, 1997, pages 5–19, 1997. [129] Swarat Chaudhuri. Subcubic algorithms for recursive state machines. In Proceedings of the 35th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2008, San Francisco, California, USA, January 7-12, 2008, pages 159–169, 2008. [130] David Melski and Thomas W. Reps. Interconvertibility of a class of set con- straints and context-free-language reachability. Theor. Comput. Sci., 248(1- 2):29–98, 2000. [131] V. L. Arlazarov, E. A. Dinic, M. A. Kronrod, and I. A. Faradzev. On eco- nomical construction of the transitive closure of a directed graph. Doklady Akademii Nauk SSSR, 194(3):487, 1970. [132] Istvan Jonyer, Lawrence B. Holder, and Diane J. Cook. Mdl-based context-free graph grammar induction and applications. International Journal on Artificial Intelligence Tools, 13(1):65–79, 2004. [133] Saket Navlakha, Rajeev Rastogi, and Nisheeth Shrivastava. Graph summa- rization with bounded error. In Proceedings of the ACM SIGMOD Interna- tional Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, pages 419–432, 2008. 198 [134] Raghav Kaushik, Pradeep Shenoy, Philip Bohannon, and Ehud Gudes. Ex- ploiting local similarity for indexing paths in graph-structured data. In Pro- ceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA, February 26 - March 1, 2002. [135] Yu Huang, Ziyang Liu, and Yi Chen. Query biased snippet generation in XML search. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10- 12, 2008, pages 315–326, 2008. [136] Larry J. Stockmeyer and Albert R. Meyer. Word problems requiring expo- nential time: Preliminary report. In Proceedings of the 5th Annual ACM Symposium on Theory of Computing, April 30 - May 2, 1973, Austin, Texas, USA, pages 1–9, 1973. [137] Monika Rauch Henzinger, Thomas A. Henzinger, and Peter W. Kopke. Com- puting simulations on finite and infinite graphs. In 36th Annual Symposium on Foundations of Computer Science, Milwaukee, Wisconsin, 23-25 October 1995, pages 453–462, 1995. [138] Tova Milo and Dan Suciu. Index structures for path expressions. In Database Theory - ICDT ’99, 7th International Conference, Jerusalem, Israel, January 10-12, 1999, Proceedings., pages 277–295, 1999. [139] Ruslan Mavlyutov, Carlo Curino, Boris Asipov, and Philippe Cudré-Mauroux. Dependency-driven analytics: A compass for uncharted data oceans. In CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Cham- inade, CA, USA, January 8-11, 2017, Online Proceedings, 2017. [140] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Intro- duction to information retrieval. Cambridge University Press, 2008. [141] Kaspar Riesen. Structural Pattern Recognition with Graph Edit Distance: Ap- proximation Algorithms and Applications. Springer, 2016. [142] Ingwer Borg and Patrick JF Groenen. Modern multidimensional scaling: The- ory and applications. Springer Science & Business Media, 2005. [143] Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. Scaling up all pairs similarity search. In Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 8-12, 2007, pages 131–140, 2007. [144] Zhi-Hua Zhou. Ensemble methods: foundations and algorithms. CRC press, 2012. [145] Brandyn A. White, Andrew E. Miller, and Larry S. Davis. Classifier-as-a- service: Online query of cascades and operating points. In Workshop on Big 199 Data Meets Computer Vision 2012, co-located with NIPS 2012, pages 1–5, 2012. [146] Michael Feldman, Sorelle A. Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015, pages 259–268, 2015. [147] Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. En- semble selection from libraries of models. In Machine Learning, Proceedings of the Twenty-first International Conference (ICML 2004), Banff, Alberta, Canada, July 4-8, 2004, 2004. 200