ABSTRACT Title of dissertation: HISTORICAL GRAPH DATA MANAGEMENT Udayan Khurana, Doctor of Philosophy, 2015 Dissertation directed by: Professor Amol Deshpande Department of Computer Science Over the last decade, we have witnessed an increasing interest in temporal analysis of information networks such as social networks or citation networks. Finding temporal interaction patterns, visualizing the evolution of graph properties, or even simply com- paring them across time, has proven to add significant value in reasoning over networks. However, because of the lack of underlying data management support, much of the work on large-scale graph analytics to date has largely focused on the study of static properties of graph snapshots. Unfortunately, a static view of interactions between entities is often an oversimplification of several complex phenomena like the spread of epidemics, informa- tion diffusion, formation of online communities, and so on. In the absence of appropriate support, an analyst today has to manually navigate the added temporal complexity of large evolving graphs, making the process cumbersome and ineffective. In this dissertation, I address the key challenges in storing, retrieving, and analyzing large historical graphs. In the first part, I present DeltaGraph, a novel, extensible, highly tunable, and distributed hierarchical index structure that enables compact recording of the historical information, and that supports efficient retrieval of historical graph snapshots. I present analytical models for estimating required storage space and snapshot retrieval times which aid in choosing the right parameters for a specific scenario. I also present optimizations such as partial materialization and columnar storage to speed up snapshot retrieval. In the second part, I present Temporal Graph Index that builds upon DeltaGraph to support version-centric retrieval such as a node’s 1-hop neighborhood history, along with snapshot reconstruction. It provides high scalability, employing careful partitioning, distribution, and replication strategies that effectively deal with temporal and topological skew, typical of temporal graph datasets. In the last part of the dissertation, I present Temporal Graph Analysis Framework that enables analysts to effectively express a variety of complex historical graph analysis tasks using a set of novel temporal graph operators and to execute them in an efficient and scalable manner on a cloud. My proposed solutions are engineered in the form of a framework called the Historical Graph Store, designed to facilitate a wide variety of large-scale historical graph analysis tasks. Historical Graph Data Management by Udayan Khurana Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2015 Advisory Committee: Professor Amol Deshpande Chair/Advisor Professor Jimmy Lin Dean’s Representative Professor Tudor Dumitras¸ Committee Member Dr. Catherine Plaisant Committee Member Professor Mihai Pop Committee Member c© Copyright by Udayan Khurana 2015 Dedication To my beloved family and friends. To all the data that my code crunched. To all the moments that didn’t count in the end. To the hope of a happier planet. ii Acknowledgments My ability to write this dissertation has been in part due to the help, guidance and support from several people. I am grateful to all of them for being a part of my life. Foremost, I want to sincerely thank my advisor Amol Deshpande. His guidance over the course of graduate school greatly helped me in all aspects of my doctoral research. Working with him continuously raised my expectations which was instrumental in the growth of my skills as a researcher. Also, his patience on several occasions helped me work my way through tough times. I also acknowledge my defense committee members Jimmy Lin, Catherine Plaisant, Mihai Pop and Tudor Dumitras for their constructive suggestions and critiques. My dis- cussions with professors Ben Shneiderman and Nick Roussopoulos over the years were also fruitful in several ways. I am thankful to the department staff, Jenny Story and Fatima Bangura in particular, for their help in all things bureaucratic. My internships at IBM helped me gain a different perspective on research and for that I would like to thank my mentors Srinivasan Parthasarathy and Deepak Turaga. I cherish the time spent with my database lab-mates Ashwin, Theodoros, Jayanta, Waala, Abdul, Hui, Souvik and Amit. Jokes, caricatures and conversations we shared, apart from all the help in form of critiques, motivation and suggestions will be thoroughly missed. During the long time I spent in Maryland and Washington DC, I met many people who touched my life and made this place feel like home. In particular, I feel fortunate to have met Alice, Hong, Abhishek, Kostas, Sheetal, Sushant, Nakul, Pang, Parul, Kleoniki, Nof and Vikas. I don’t want to miss a shout out to those who were miles or oceans away, iii but always cared – Sumeet, Mubin, Medha, Joseph, Sandy and Anand. I want to thank my brother Sukant for always inspiring me with his passion for science. Above all, I want to thank my parents Sarita and Vinod, for being there for me, unconditionally and endlessly. I hope I succeed in imbibing a tiny fraction of the academic quality and personal integrity I see in both of them. iv Table of Contents List of Figures 1 List of Tables 6 1 Introduction 7 1.1 Examples of Temporal Networks and Analysis . . . . . . . . . . . . . . . 10 1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.2.1 Storing and fetching temporal graphs . . . . . . . . . . . . . . . 18 1.2.2 Scalable Analytics . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.3.1 DeltaGraph: Reconstructing Past States of a Graph . . . . . . . . 24 1.3.2 TGI: Generalized Temporal Graph Indexing at Scale . . . . . . . 26 1.3.3 Temporal Graph Analytics . . . . . . . . . . . . . . . . . . . . . 27 1.4 Graph Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.5 Organization of the dissertation . . . . . . . . . . . . . . . . . . . . . . . 29 2 Background 31 2.1 Temporal Graph Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.1.1 Dynamic Social Network Analysis . . . . . . . . . . . . . . . . . 31 2.1.2 Temporal Network Visualization . . . . . . . . . . . . . . . . . . 32 2.2 Temporal Data Management . . . . . . . . . . . . . . . . . . . . . . . . 33 2.2.1 Temporal Queries and Storage . . . . . . . . . . . . . . . . . . . 34 2.2.2 Version Management . . . . . . . . . . . . . . . . . . . . . . . . 35 2.3 Graph Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.3.1 Graph Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.3.2 Graph Query and Processing Frameworks . . . . . . . . . . . . . 38 2.3.3 Temporal Graph Data Stores . . . . . . . . . . . . . . . . . . . . 38 3 Historical Graph Snapshot Retrieval 40 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 DeltaGraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 v 3.3.1 Anatomy of DeltaGraph . . . . . . . . . . . . . . . . . . . . . . 47 3.3.2 Snapshot Queries . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3.3 Accessibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4 Snapshot Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4.1 Single-point Snapshot Retrieval . . . . . . . . . . . . . . . . . . 52 3.4.2 Multi-point Snapshot Retrieval . . . . . . . . . . . . . . . . . . . 53 3.4.3 Composite Time-expression Graph Retrieval . . . . . . . . . . . 55 3.4.4 Speeding up Snapshot Queries using Prefetching . . . . . . . . . 57 3.4.5 Extensibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5 DeltaGraph Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5.1 Model of Graph Dynamics . . . . . . . . . . . . . . . . . . . . . 64 3.5.2 Differential Functions . . . . . . . . . . . . . . . . . . . . . . . 65 3.5.3 Space and Time Estimation Models . . . . . . . . . . . . . . . . 66 3.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.6 Construction and Maintenance . . . . . . . . . . . . . . . . . . . . . . . 70 3.6.1 Batch Construction . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.6.2 Configuring DeltaGraph . . . . . . . . . . . . . . . . . . . . . . 72 3.6.3 Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.6.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . 74 3.6.3.2 Update Procedure . . . . . . . . . . . . . . . . . . . . 76 3.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4 Generalized Historical Graph Storage in the Cloud 92 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.2 Temporal Graph Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.2.2 Prior Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.2.3 TGI Description . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.2.4 Scalable TGI design . . . . . . . . . . . . . . . . . . . . . . . . 102 4.2.5 Partitioning and Replication . . . . . . . . . . . . . . . . . . . . 106 4.2.6 Dynamic Graph Partitioning . . . . . . . . . . . . . . . . . . . . 109 4.2.7 Fetching Graph Primitives . . . . . . . . . . . . . . . . . . . . . 113 4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5 Temporal Graph Analytics 125 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.3 Memory Efficient Loading of Multiple Versions . . . . . . . . . . . . . . 129 5.3.1 GraphPool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.3.2 Clean-up of a graph from memory . . . . . . . . . . . . . . . . . 132 5.3.3 Visual Analytics on Historical Graph Snapshots . . . . . . . . . . 132 5.3.3.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . 133 vi 5.3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.4 Temporal Analytics Framework (TAF) . . . . . . . . . . . . . . . . . . . 140 5.4.1 Temporal Graph Analysis Library . . . . . . . . . . . . . . . . . 140 5.4.2 System Implementation . . . . . . . . . . . . . . . . . . . . . . 144 5.4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 6 Conclusion 156 6.1 Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.2.1 Unified Temporal Graph Processing Framework . . . . . . . . . . 160 6.2.2 Discovering Graphs of Interest . . . . . . . . . . . . . . . . . . . 161 6.2.3 A Framework for Finding Temporal Patterns in Evolving Graphs . 163 6.2.4 Shared Computation Across Graph Versions . . . . . . . . . . . . 164 6.2.5 Estimating Changes in Graph Properties . . . . . . . . . . . . . . 164 Bibliography 165 vii List of Figures 1.1 Dynamic network analysis (e.g., understanding how “communities” evolve in a social network, how centrality scores change in co-authorship net- works, etc.) can lend important insights into social, cultural, and natural phenomena. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2 The plot was constructed using our system over the DBLP network, and shows the evolution of the nodes ranked in top 25 in 2004. . . . . . . . . 13 1.3 Performing a snapshot-based evolutionary analysis with and without the support for efficient snapshot retrieval. . . . . . . . . . . . . . . . . . . . 17 1.4 A temporal graph can be represented across two different dimensions - time and entity. The chart lists retrieval tasks, graph operations, example queries at different granularities of time and entity size. . . . . . . . . . . 20 1.5 Historical Graph Store. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 Temporal relational database indexing techniques. . . . . . . . . . . . . . 43 1 3.2 Interval tree is a hierarchical data structure that indexes intervals and can be potentially used to perform snapshot retrieval queries on temporal datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 DeltaGraphs with 4 leaves, leaf-eventlist size L, arity 2. ∆(Si, Sj) de- notes delta needed to construct Si from Sj . . . . . . . . . . . . . . . . . . 48 3.4 The (Java) code snippet shows an example program that retrieves several graphs, and operates upon them. . . . . . . . . . . . . . . . . . . . . . . 51 3.5 Example plans for singlepoint and multipoint retrieval on the DeltaGraph shown in Figure 3.3(a). . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.6 Two different options for retrieving difference snapshot queries, t1− t2 or t2 − t1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.7 Query expression tree for t1 ∩ t2 − t3 ∩ t4. . . . . . . . . . . . . . . . . . 58 3.8 The extensibility framework of DeltaGraph is meant for users or program- mers to execute specific historical queries efficiently by only supplying a logic through user defined functions while the framework takes care of the temporal aspects of the implementation. . . . . . . . . . . . . . . . . 62 3.9 Different valid DeltaGraphs for input E, configurations, df = f(); k = 3; l. All of them have the same number of nodes at each corresponding level, but the structure is different. . . . . . . . . . . . . . . . . . . . . . 75 3.10 Different Packed DeltaGraphs . . . . . . . . . . . . . . . . . . . . . . . 77 3.11 Summary of Dataset 1 over time. . . . . . . . . . . . . . . . . . . . . . . 80 3.12 Summary of Dataset 4 over time. . . . . . . . . . . . . . . . . . . . . . . 81 2 3.13 Comparing DeltaGraph and Copy+Log. Int and Bal denote the Intersec- tion and Balanced functions, respectively. . . . . . . . . . . . . . . . . . 82 3.14 Performance of different DeltaGraph configurations vs. Interval Tree for Dataset 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.15 Benefits due to multi-core parallelism for snapshot retrieval on a single machine with single disk. . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.16 Multipoint query execution vs multiple singlepoint queries. . . . . . . . . 85 3.17 Retrieval with and without attributes (Dataset 2) . . . . . . . . . . . . . . 86 3.18 Effect of varying arity (Dataset 1) . . . . . . . . . . . . . . . . . . . . . 87 3.19 Effect of varying eventlist sizes (Dataset 1) . . . . . . . . . . . . . . . . 87 3.20 Effect of varying eventlist sizes on snapshot retrieval times (Dataset 4) . . 88 3.21 Snapshot retrieval performance on a DeltaGraph with different eventlist sizes for different time periods (Dataset 2). . . . . . . . . . . . . . . . . . 88 3.22 Comparison of snapshot retrieval performance for differential functions on Dataset 1. (a) Intersection and Balanced; (b) Different Mixed function configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.23 DeltaGraph snapshot retrieval using single machine, local configurations of Kyotocabinet and Cassandra (Dataset 4). . . . . . . . . . . . . . . . . 90 4.1 Temporal Graph Index representation. . . . . . . . . . . . . . . . . . . . 101 3 4.2 The graph history is divided into non-overlapping periods of time. Such division is based on time intervals after which the locality-based graph partitioning is updated. It is also used as a partial key for data chunking and placement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.3 Graph partitioning using min-cut strategy along with 1-hop replication and the use of auxiliary micro-deltas improves 1-hop neighborhood re- trieval performance without affecting the performance of snapshot or node retrieval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.4 Snapshot retrieval times for varying parallel fetch factor (c), on Dataset 1; m = 4; r = 1, ps = 500. . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.5 Snapshot retrieval times for Dataset 4; m = 6; r = 1, c = 1, ps = 500. . . 117 4.6 Snapshot retrieval times across different m and r values on Dataset 1. . . 118 4.7 Snapshot retrieval across various parameters. . . . . . . . . . . . . . . . 119 4.8 Node version retrieval across various parameters for Dataset 1. . . . . . . 121 4.9 Node version retrieval for Dataset 4; m = 6; r = 1, c = 1, ps = 500. . . . 122 4.10 Experiments on partitioning type and replication; growing data size. . . . 123 5.1 (a) Graphs at times tcurrent, t1 and t2; (b) GraphPool consisting of overlaid graphs; (c) GraphID-Bit Mapping Table . . . . . . . . . . . . . . . . . . 130 5.2 System Architecture: HiNGE, DeltaGraph and GraphPool. . . . . . . . . 133 5.3 Temporal Exploration using time-slider . . . . . . . . . . . . . . . . . . 135 5.4 Temporal Evolution of node 12 in green over three points in time: Evolu- tion of node’s neighborhood and PageRank. . . . . . . . . . . . . . . . . 136 4 5.5 (a) Node Search (b) Subgraph Pattern Search . . . . . . . . . . . . . . . 137 5.6 Cumulative GraphPool memory consumption. . . . . . . . . . . . . . . . 138 5.7 Effect of materialization . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.8 SoN: A set of nodes can be abstracted as a 3 dimensional array with tem- poral, node and attribute dimensions. . . . . . . . . . . . . . . . . . . . 141 5.9 Examples of analytics using the TAF Python API. . . . . . . . . . . . . . 146 5.10 Incremental computation using different methods. Both examples com- pute counts of nodes with a specific label in subgraphs over time. . . . . . 148 5.11 Pattern matching in temporal subgraphs. . . . . . . . . . . . . . . . . . . 149 5.12 Using the optional timepoint specification function with evolution and comparison queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.13 A pictorial representation of the parallel fetch operation between the TGI cluster and the analytics framework cluster. The numbers in circles indi- cate the relative order and the arrowheads indicate the direction of infor- mation or data flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.14 TAF computation times for Local Clustering Coefficient on varying graph sizes (N=node count) using different cluster sizes. . . . . . . . . . . . . . 153 5.15 Label counting in several 2-hop neighborhoods through version (Node- ComputeTemporal) and incremental (NodeComputeDelta) computation, respectively. We report cumulative time taken (excluding fetch time) over varying version counts; 2 Spark workers were used for Wikipedia dataset. 154 5 List of Tables 1.1 Examples of temporal networks. . . . . . . . . . . . . . . . . . . . . . . 13 3.1 Options for node attribute retrieval. Similar options exist for edge at- tribute retrieval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 A list of Differential Functions . . . . . . . . . . . . . . . . . . . . . . . 66 4.1 Comparison of access costs for different retrieval tasks (snapshot, static vertex, vertex versions, 1-hop neighborhoods and 1-hop neighborhood versions) and index storage on various temporal indexes (Log, Copy, Copy+Log, Node-centric and DeltaGraph). . . . . . . . . . . . . . . . . 99 6 Chapter 1: Introduction In recent years, we have witnessed an increasing abundance of observational data de- scribing various types of information networks, including social networks, biological net- works, citation networks, financial transaction networks, communication networks, to name a few. There is much work on analyzing such networks to understand various social and natural phenomena like: “how the entities in a network interact”, “how information spreads”, “what are the most important (central) entities”, and “what are the key build- ing blocks of a network”, and so on. Such networks, which model the interconnections and interactions between real-world objects, are mathematically represented as graphs1. Analysis of networks is extensively based on graph theory, which provides a wide range of mathematical tools for the study of properties of graphs. Graph theory covers prob- lems such as finding the shortest path between two nodes, determining cliques and mo- tifs, measuring network flow between two points, performing graph coloring, determining reachability and many others [37, 137]; it also describes metrics such as graph density, diameter, clustering coefficient, pagerank, to characterize the state of a graph. Graph min- ing applies the principles and methods of graph theory for useful discovery and prediction through pattern search, anomaly detection, dense subgraph detection, classification, etc. It finds popular application in several domains such as biology, physics, transportation 1Here on, we use the terms graph and network interchangeably. 7 engineering, targeted advertising, drug discovery and cyber security. Aggarwal et al. [19] and Chakrabarti et al. [41] are good references for a detailed coverage of techniques and applications of graph mining. With the increasing availability of the digital trace of such networks over time, the topic of network analysis has naturally extended its scope to temporal analysis of networks, which has the potential to lend much better insights into various phenomena, especially those relating to the temporal or evolutionary aspects of the network. For ex- ample, we may want to know: “which analytical model best captures network evolution”, “how information spreads over time”, “who are the people with the steepest increase in centrality measures over a period of time”, “what is the average monthly density of the network since 1997”, “how the clusters in the network evolve over time” etc. Historical queries like, “who had the highest PageRank centrality in a citation network in 1960”, “which year amongst 2001 and 2004 had the smallest network diameter”, “how many new triangles have been formed in the network over the last year”, also involve the tem- poral aspect of the network. More generally a network analyst may want to process the historical 2 trace of a network in different, usually unpredictable, ways to gain insights into various phenomena. Efforts to perform systematic analysis and visual exploration of such networks over time has prompted work on taxonomies and attempts to define process of analyses for temporal networks [21, 72]. Numerous graph data management systems have been developed over the last decade, 2We use the terms temporal graph and historical graph interchangeably, referring to the chronology of a network’s activity from the past, and possibly on-going at the present. We apply the same lack of distinction between historical data management versus temporal data management, unless otherwise spec- ified. However, we refer to historical queries as a subset of the temporal queries; the former being static graph queries in the context of a snapshot from the past, whereas the latter being a wider variety including evolution, comparison, historical types, amongst many others. 8 including specialized graph database systems like Neo4j [10], Titan [15], amongst sev- eral others, and large-scale graph processing frameworks such as Pregel [90], Giraph, GraphLab [88], and GraphX [56]. However much of the work to date, especially on cloud-scale graph data management systems, focuses on managing and analyzing a sin- gle (typically, current) static snapshot of the data. In the real world, however, interactions are a dynamic affair and any graph that abstracts a real-world process changes over time. For instance, in online social media, the friendship network on Facebook or the “follows” network on Twitter change steadily over time, whereas the “mentions” or the “retweet” networks change much more rapidly. Dynamic cellular networks in biology, evolving ci- tation networks in publications, dynamic financial transactional networks, are few other examples of such data. There is recent work on streaming analytics over dynamic graph data [45, 39], but that work typically focuses on analyzing only the recent activity in the network (typically over a sliding window). In this work, our focus is on providing the ability to analyze and to reason over the entire history of the changes to a graph. There are many different kinds of relevant his- torical analyses. For example, an analyst may wish to study the evolution of well-studied static graph properties such as centrality measures, density, conductance, etc., over time. Another approach is through the search and discovery of temporal patterns, where the events that constitute the pattern are spread out over time. Comparative analysis, such as juxtaposition of a statistic over time, or perhaps, computing aggregates such as max or mean over time, possibly gives another style of knowledge discovery into temporal graphs. Most of all, a primitive notion of simply being able to access past states of the graphs and performing simple static graph analysis, perhaps, empowers a data scientist 9 with the capacity to perform analysis in arbitrary and unconventional patterns. Supporting such a diverse set of temporal analytics and querying over large vol- umes of historical graph data requires addressing several data management challenges. Specifically, we must develop techniques for storing the historical information in a com- pact manner, while allowing a user to retrieve graph snapshots as of any time point in the past or the evolution history of a specific node or a specific neighborhood. Further the data must be stored and queried in a distributed fashion to handle the increasing scale of the data. We must also develop an expressive, high-level, easy-to-use programming framework that will allow users to specify complex temporal graph analysis tasks, while ensuring that the specified tasks can be executed efficiently in a data-parallel fashion across a cluster. In this chapter, we proceed by providing a few examples of temporal network analytics from different domains (Section 1.1), followed by the motivation for the work presented in this dissertation and the challenges in providing effective data man- agement solutions for temporal network analytics(Section 1.2), the contributions of this dissertation (Section 1.3), data model for time-evolving graphs (Section 1.4), and finally, an outline of the rest of the dissertation (Section 1.5). 1.1 Examples of Temporal Networks and Analysis Table 1.1 lists examples of different kinds of temporal networks, such as human interac- tion networks, biological networks, and others. The temporal component of the network can be based on the change in relationship between two entities, change in the degree of a relationship, or a change in the existence of an entity itself. We present some specific 10 instances of temporal network analysis in a variety of application domains below: • Organizational Sociology: Sociologists have long studied inter-organizational al- liances using dynamic network models. For instance, Gulati et al. [60] model the likelihood of a future alliance between industrial organizations or the stability of an existing one, based on dynamic interconnections in a larger network formed over time. Factors such as mutual alliances, and combined centrality over time are important factors in deciding strategic search for alliances. • International Finance: A study of the patterns of flow of Foreign Direct Invest- ment (FDI) in Taiwanese electronic firms [44] shows how certain small firms suc- cessfully internationalized themselves by a gradual process of exploiting factors in their networks. The study shows common temporal trends with different firms with respect to network activity including strategic growth of their respective networks. • Epidemiology: Network modeling has been extensively used in the study and pre- vention of epidemics as a basis for capturing spatial proximity. In fact, birth of modern epidemiology is credited to John Snow’s investigation of the 1854 cholera epidemic of London. Snow [121] elaborated the role of water supply network in the city with empirical evidence of locality and spread of the disease, leading to the discovery that cholera is water-borne. In recent years, there has been a consid- erable focus on the temporal aspect of networks in epidemiology, in two different ways. The first one is the dynamic nature of networks, i.e., structural changes over time. The other is the epidemic dynamics on network structures, i.e, propagation of infections. Gross et al. [59], present a model for spread of infection based upon a 11 ti t j t k Figure 1.1: Dynamic network analysis (e.g., understanding how “communities” evolve in a social network, how centrality scores change in co-authorship networks, etc.) can lend important insights into social, cultural, and natural phenomena. changing model of a people network. It exhibits the change of epidemic threshold on different rates of rewiring of links in a simple graph model with constant nodes and number of links. It shows how the adaptive dynamic nature of a physical con- tact network (along with infection probability) can lead to the state of the epidemic - healthy, oscillatory, bistable or endemic. There have been several other models that explain epidemic growth based on a random dynamic network [128, 134]. • Cellular Networks: Analysis of social group or community evolution determine crucial characteristics about the community itself. By studying temporal overlap in community structures on scientific collaboration networks and cellular call net- works, Palla et al. [101], report that large communities with more dynamic mem- bership last longer than the ones with more rigid membership patterns. However, in the case of smaller communities, the ones with less adaptive membership last longer. Also, their analysis provides a correlation between the time commitment of 12 2000 2001 2002 2003 2004 Year 0 100 200 300 400 500 R a n k (a cc or di ng to P ag eR an k) Figure 1.2: The plot was constructed using our system over the DBLP network, and shows the evolution of the nodes ranked in top 25 in 2004. Category Source Entitiy Relationship Temporal Aspect Communication Email Person Email sent Timestamp of email sent Facebook Person Friendship Start (and end) of friend- ship Twitter Person Retweet Retweet timestamp Telephone Person Conversation Call duration Biological Interactome Protein molecule P-P interaction Order of regulatory or physical relationship Metabolism Enzyme, Substrate Biochemical reaction Activity duration Cell Signal- ing Pathway Protein, Lipid, etc. Activation, in- hibition Reaction time or order Neural Neuron Synapse Spiking / information flow Transportation Road map Intersection Road segment Traffic delaysPostal routes Post-office, destination Transfer time Load variance; frequency Financial transactional Bank Account Transfer Time of transfer Retail Merchant, customer Payment Purchase or payment time Computer WWW Webpage Hyperlink Editing of referencesP2P/ Cluster Computer Client-server Transfer time Table 1.1: Examples of temporal networks. 13 members towards a community and the longevity of the community itself. • Scientific Collaboration: Temporal network analysis of the history of bibliographic datasets provides interesting insights into the evolution of collaborative networks or areas of research amongst other things. Erten [50] analyzed a subset of the ACM’s digital library to find interesting facts such as: while the typical size of the giant connected component in most domains of science is 80-90% of the collaborative graph, in computing it is just 49%; the average number of co-authors in mathemat- ics has only grown from 1 in 1935 to 1.5 in 1995, exhibiting a slower growth for the same than other fields; in computing, since the mid-nineties, there has been a steady rise of the collaboration networks on topics like data encryption, database management, pattern recognition, but a decline for topics like data structures and symbolic and algebraic manipulation. • Biology: Several biological processes involving interactions between entities are being increasingly modeled as networks. In functional genomics, the classical view of protein molecule functioning focused on the actions of a single protein molecule; with recent advances in the field, the “post-genomic” view of protein function- ing focusing on the protein as component of a network of interactive proteins, has gained prominance [49]. Przytycka et al. [105] provide an overview of the recent progress in dynamic protein network modeling and interesting experimental obser- vations. Other areas involving a shift of paradigm towards dynamic networks in bi- ology include cell signaling pathways, gene regulation networks, protein substrate networks and several others. All biological processes are dynamic and involve on- 14 going change of entity states and their interaction. Analysis of temporal network activity benefits the study and discovery of various biochemical phenomena. For instance, Han et al. [61] study the role of hubs in yeast protein–protein interaction networks for different cellular properties such as robustness of the network. Given the temporal interaction patterns, i.e., whether a hub interacts with other proteins at the same time or at different times, the hubs are classified into two different categories. Each of the two categories are found to contribute towards different functions in the protein-protein interaction network. Taylor et al. [127] show that a dynamic model of human protein interaction network (interactome) helps predict breast cancer outcome. They show that the biochemical changes, observed by the changing structure of the interactome, convey the relevant information about the disease outcome, which may not be possible in the case of a static protein interaction network. • Online Social Networks: Online social networks such as Facebook, Twitter, Digg, Flickr and others are useful in studying the nature of several aspects of information diffusion as their digital trace is recorded and can be subsequently analyzed. Ler- man et al. [84] study the spread of interest in news stories through analyzing the patterns of voting for a story on Digg and retweeting on Twitter. Bakshy et al. [28] study the role of structural components such as few strong links compared to many weak links in the spread of a story on Facebook; they also talk about the differences in the patterns of sharing a story by those who have been exposed to it by their friends and those who haven’t been. In both of these works, temporal clustering 15 and aggregation was performed along with structural network analysis, in order to study the flow or spread of information. 1.2 Challenges In the previous section, we saw that temporal network modeling and analysis finds utility across a wide variety of domains. In order to support a broad range of network analysis tasks, we require a graph data management system at the backend capable of low-cost storage and efficient retrieval of the historical network information, in addition to pro- viding the correct framework for expressing and executing analytical queries, temporal or otherwise. However, the existing solutions for graph data management lack adequate techniques for temporal annotation, or for storage and retrieval of large-scale historical changes to the graph. Hence, it is challenging, often prohibitively so, to perform historical graph analysis at scale due to lack of supporting temporal graph management techniques. For instance, a typical analysis of an evolving graph property through snapshot-based computation over different time points, would require access to several past states of the graph. Figure 1.3(a) illustrates the (simplified) process as of today, where in the absence of a temporal graph store, the analyst is responsible for the work of iteratively filtering the temporal graph data to extract relevant snapshot subsets and ingest them into a “static graph store”. This is both burdensome on the analyst as well as time-consuming due to repetitive, often overlapping, ingestion of data. Whereas, while using a “temporal graph store”, the analyst would gain access to snapshots promptly on the call of a function, as illustrated in Figure 1.3(b), saving effort and time in the analysis process. 16 Temporal Graph Data Filter for Snapshot at time t Graph Data for Snapshot at t Load into Graph Store Graph Engine, e.g., Apache Giraph Static Graph Computation Repeat for next timepoint repeat for t ∈ {t 1 , t 2 ...} (a) Analysis process using a static graph store Temporal Graph Data Temporal Graph Index Temporal Graph Indexing Fetch Graph at t Static Graph Computation repeat for t ∈ {t 1 , t 2 ...} Repeat for next timepoint (b) Analysis process using a temporal graph store Figure 1.3: Performing a snapshot-based evolutionary analysis with and without the sup- port for efficient snapshot retrieval. The work presented in this dissertation is motivated with the broad goal of enabling an analyst 3 to perform a wide variety of temporal graph data analytics for large evolving networks. Specifically, we found the following aspects of historical graph data manage- ment as essential abstractions that needed efficient solutions - (a) compact storage of large graph histories, (b) efficient query time reconstruction of past states of graph or version histories of nodes or neighborhoods, (c) being able to systematically express a variety of temporal graph analysis tasks, hidden from the physical handling of the temporal aspects of data, and (d) a scalable and elastic platform to run such analyses tasks. Next in this section, we discuss some of the challenges in building an effective 3Analysts or a domain experts are typically not equipped with computational skills such as those related to big data management. 17 historical graph datastore. We first talk about the challenging aspects related to storage and retrieval, followed by issues in performing analytical tasks. 1.2.1 Storing and fetching temporal graphs The fundamental challenge in storing large evolving graphs is to deal with the complex- ities of temporal change, as well as the large size of graphs, put together. The primary objective for indexing graph histories is to be able to efficiently fetch (sub-)graph prim- itives such as snapshots, node or neighborhood versions, effectively while processing analytical queries. Let us first briefly discuss some of the characteristics of static graph storage and temporal indexing, separately, as well as when put together. • Compact storage versus fast access: The goal of temporal indexing is twofold. Firstly, it is desired that the storage be compact, i.e., it should incur only a small amount of redundancy, at worst. Sec- ondly, the index should provide fast reconstruction of graph primitives at query time, such as past snapshots, or versions of nodes or neighborhoods. In the known literature on temporal indexing, we see a natural trade-off between compactness and reconstruction efficiency. While a detailed survey of these techniques is presented in Chapter 2 and Chapter 3, two contrasting approaches highlight this duality. First is a “Log” approach, that stores all the changes in a dataset in a chronological order. Without redundancy, this approach occupies minimal space resources. How- ever, reconstruction of, let us say, a past state of the dataset (snapshot), is expensive because there is only a chronological access to any state of the dataset, and no direct 18 access. This implies that all changes that occurred to the dataset, say attribute val- ues, which are no longer reflected in the snapshot of interest, are information that must be read even if not required to construct the snapshot. On the other hand, a “Copy” approach maintains a distinct copy of the dataset at each change point. This provides direct access to a past snapshot with optimal resource consumption. How- ever, the amount of space consumed is very high because an object that doesn’t change across versions is stored multiple times. Conquering this trade-off is the foremost challenge in temporal indexing. We state this challenge as, “designing a temporal index structure whose storage consumption is comparable to the Log approach and whose access efficiency is comparable to the Copy approach.” • Time-centric versus entity-centric indexing: When put together in the same context, temporal data indexing and graph indexing pose further challenges. A temporal graph dataset is based around two dimensions, i.e., time and structural entity, where the latter represents nodes or edges. If we were to use a storage structure that is primarily indexed by time, such as the Copy approach, it makes it efficient to fetch (snapshots or subsets of) states of the dataset at a desired point in time. However, if the query is based around history of a node, an index based on direct access to states of a dataset is not very helpful. Using it as such implies evaluating all states of the dataset, rendering it highly inefficient, much worse than the Log approach. Alternatively, consider an entity-based tem- poral index, which separately stores the changes for each entity in a chronological order. While it is suitable for a “node version” type of retrieval, it becomes highly 19 size time node neighborhood graph point interval Snapshot Multipoint Snapshot Subgraph versions community evolution evolution of graph density shortest paths Subgraph local clustering coefficient Static vertex Vertex history vertex connections degree evolution diameter, density pagerank betweenness centrality comparing diameter across time most central node last year compare local clustering coeff Which are X's most interacted contacts until 1995? How many citations did I have in 2012? Whether X or Y has a higher knit cluster around them? Visualize evolution of this community of investors. What is the average number of friends for a person? Has the degree of separation increased or decreased in the last 1 year? Figure 1.4: A temporal graph can be represented across two different dimensions - time and entity. The chart lists retrieval tasks (black), graph operations (red), example queries (magenta) at different granularities of time and entity size. inefficient to use it for snapshot retrieval or even neighborhood retrieval. Such a du- ality of the primacy of time versus the entities in indexing is another fundamental challenge to the problem we address. • Coping with changing topology in a dynamic graph: Storing large static graphs in distributed fashion involves careful graph partitioning and replication. However, partitioning graphs does not necessarily provide an ideal solution to distributed graph storage. This is due to the inherent interconnected nature of graphs and a skewed distribution of these connections, i.e., edges, across the set of node pairs. For reasonably dense networks, even optimal partitioning incurs a large number of edge cuts. The poor locality means that for retrieving 20 typical graph patterns such as k-hop neighborhoods, such distributed indexes do not scale very well. In recent years, few systems such as Titan [15] have focused on development of systems for storage of large (static) graphs. Graph partitioning remains an open, hard problem without many effective solutions that work with a wide range of graphs. The problem of graph partitioning is compounded in the case of changing graphs. As the structure of the graph evolves, so does the best available partitioning over time. While the graph partition may be frequently updated to maintain a good qual- ity of partitioning, it implies heavy amounts of book-keeping, i.e., keeping track of the node–partition mappings, which considerably increases the work in query processing. Because the issue of graph partitioning is essential to any distributed graph management system, the problem of “efficient graph partitioning over evolv- ing graphs” is a fundamental challenge in the context of distributed historical graph data management. • Optimal granularity of storage for different queries: Another question pertains to the strategy for physical storage of different compo- nents of a graph snapshot. If the snapshot is stored as a single chunk (ignoring the distributed context as of now), it is perhaps the ideal scenario for snapshot retrieval as the entire data to be fetched is stored in a chunk on a disk. However, when this style of storage is used for retrieving individual nodes or 1-hop neighborhoods, it will require fetching the entire snapshot, the bulk of which is not needed. An al- ternative is the strategy to store the snapshot in smaller chunks. While this will 21 make fetching smaller portions easier, it also means that fetching a snapshot means looking up for multiple chunks, i.e., multiple seeks, which makes it inefficient. Also, small partitions imply replication of edges (cut during partitioning) in two partitions. Overall increase in space of stored objects also signifies higher retrieval cost. Hence, finding the appropriate physical storage is another challenge in graph indexing. 1.2.2 Scalable Analytics State-of-the art static network analysis routines cover a wide range of tasks from analyzing graph structure to node properties. Temporal network analytics aims to extend that notion to a temporal dimension. For instance, there is a desire amongst analysts to find the evolution of a certain graph or node metric over time. Additionally, performing temporal aggregates, comparison, or search over time, are other things of prime interest in the realm of temporal graph analytics. In order to define an appropriate framework for it, the following aspects need to be explored: • Appropriate abstractions for distributed, scalable in-memory analytics: The major challenge in executing scalable analytics is the parallelization of those tasks for the benefit of fitting the required data into main memory, as well as reduc- ing analysis latencies. Large-scale (static) graph processing frameworks such as Pregel or GraphX do this in the case of static graphs through partitioning the graph by vertices and using a particular message-passing technique underneath to com- municate between different nodes. Graph algorithms typically involve accessing 22 neighbor’s information in 1-hop or further distance, which implies a lot of network communication and synchronization. While there is no single model or paradigm of in-memory distributed graph computing accepted as a standard, and it remains an open systems problem, “vertex centric, message-passing” frameworks are con- sidered the current state of the art. The added temporal factor makes this more complicated in the context of tempo- ral graph analytics. The primary question regarding the analytical model is, “what is/are the appropriate abstraction(s) to logically and physically divide a large tem- poral graph for scalable in-memory analytics?” Whether we partition the data by nodes or by time, or both, needs to be answered appropriately before describing operations on temporal graph data. • Systematically expressing temporal graph analytics: A wide variety of analytics on temporal graphs is of interest to analysts and do- main specialists. The basis of such analytical tasks lies in graph theory for static graphs. Evolution, comparison or search over time are some of the temporal ex- tension flavors to the static graph quantities. It needs to be determined, “What is the appropriate set of operations that provide sufficient expressive power to an- alysts/programmers to write various temporal graph analytical tasks?” In other words, “what is the appropriate union of temporal logic and graph algebra?” An additional engineering challenge is to be able to utilize the tools and technology built over the years for static graph analysis, i.e., algorithm libraries such as JUNG, SNAP etc., and cloud computing frameworks such as Spark or MapReduce. 23 1.3 Contributions In this dissertation, we present a set of novel techniques to address the different prob- lems in historical graph data management, with the goal of enabling large-scale temporal graph analytics. Broadly, we address the issues of compact storage, reconstruction of past states or histories of nodes, and appropriate framework for large-scale cluster analy- sis. The contributions are summarized under three different categories in Sections 1.3.1 (DeltaGraph), 1.3.2 (Temporal Graph Index) and 1.3.3 (Temporal Graph Analytics). The different concepts thus developed, have been employed to engineer a system called the Historical Graph Store (HGS). A high level diagrammatic organization of HGS can be seen in Figure 1.5. It consists of two different components. Temporal Graph Index, is a persistent and distributed index. It indexes the entire trace of a given historical network (Index Manager) and provides fast access to snapshots, node versions, etc., at query time (Query Manager). Temporal Graph Analytics Framework provides a library to specify a variety of analytical tasks, on top of an in-memory cluster-based execution platform. Al- ternatively, a visual analytics tool, HiNGE, based on GraphPool (an in-memory, compact placeholder for hundreds of graphs simultaneously), may be used. While the cluster- based system is more scalable, HiNGE is memory efficient and can take advantage of traditional graph libraries such as JUNG [99]. 1.3.1 DeltaGraph: Reconstructing Past States of a Graph In order to compactly store the entire historical trace of a network and efficiently recon- struct past states of a network at query time, we present a novel index structure called 24 TEMPORAL GRAPH INDEX (a) Framework to specify graph extraction and analysis. Operators - Select, Timeslice, Filter, Map, MapDelta Operands - Set of Nodes (SON), TGraph, … Temporal Graph History TEMPORAL GRAPH ANALYSIS FRAMEWORK (b) Apache Spark based parallel execution on RDDs Persistent, distributed, compact graph history QUERY MANAGER: Fetches Snapshots, Node Version History, Historical Neighborhood States or Versions, ... INDEX MANAGER: Creates TGI through partitioning, replication, hierarchical temporal aggregation and version chaining. RDD RDD RDD RDD GraphPool Compact, in- memory (100s of) graph snapshots HiNGE Visual network analysis GraphX JUNG Figure 1.5: Historical Graph Store. DeltaGraph [73]. It is a hierarchical, distributed, highly tunable and extensible structure geared towards fetching 100s of graph snapshots from disk to memory in milliseconds. The flexibility of DeltaGraph enables us to control the access latencies and storage con- sumption by tuning simple parameters. While its primary use is in snapshot retrieval, we also developed an extensibility framework which helps us to conveniently create dif- ferent temporal indexes. These temporal indexes can be used to answer specific queries on graph snapshots directly, bypassing the need of explicit snapshot retrieval. Evolu- tionary analysis requires fetching multiple graph snapshots from different points in time. DeltaGraph’s hierarchical structure lends itself into optimizing simultaneous retrieval of 25 multiple snapshots, exploiting commonality between them. Our comprehensive experi- mental evaluation shows that our system can retrieve historical snapshots containing up to millions of nodes and edges in several 100’s of milliseconds or less, often an order of magnitude faster than prior techniques like interval trees, and that the execution time penalties of our in-memory data structure are minimal. 1.3.2 TGI: Generalized Temporal Graph Indexing at Scale Although DeltaGraph accomplishes the task of compact storage of historical networks and provides optimal snapshot retrieval, it doesn’t support efficient means of answering version-centric queries and its design using snapshot-centric deltas is not entirely suitable for a cloud-oriented elastic store. we developed Temporal Graph Index (TGI) [75], an index that builds on DeltaGraph and addresses those two concerns. It compactly stores the entire history of a graph by appropriately partitioning and encoding the differences over time (smaller deltas). These deltas are organized to optimize the retrieval of several temporal graph primitives such as neighborhood versions, node histories, and graph snap- shots. TGI is designed to use a distributed key-value store to store the partitioned deltas, and can thus leverage the scalability afforded by those systems (our implementation uses Apache Cassandra key-value store). TGI is a tunable index structure, and we investigate the impact of tuning the different parameters through an extensive empirical evaluation. TGI provides elastic scalability and a wider variety of graph primitive fetching compared to DeltaGraph. We also show a qualitative comparison between the different indexes, including these two indexes for retrieval of various types of graph primitives. 26 1.3.3 Temporal Graph Analytics Since the typical evolutionary analysis queries require fetching 100s of graph snapshots, we need an efficient way to store all of them in memory simultaneously. we developed an in-memory structure called GraphPool to store 100s of those graphs in an overlaid manner, exploiting the overlap due to temporal consistency. GraphPool makes it possible to store many snapshots in memory which would otherwise be infeasible. We present a distributed implementation of GraphPool to demonstrate its effectiveness in performing evolutionary and comparative analysis over large evolving graphs. The system exposes a generic programmatic interface as well as an interactive visual extension called, HiNGE. Secondly, We present a Temporal Graph Analysis Framework (TAF), which provides an expressive library to specify a wide range of temporal graph analysis tasks and to execute them at scale in a cluster environment. The library is based on a novel set of temporal graph operators that enable a user to analyze the history of a graph in a variety of manners. The execution engine itself is based on Apache Spark [142], a large-scale in-memory cluster computing framework. 1.4 Graph Data Model In this section, we describe a general data model for evolving graphs used in this disserta- tion. We also consider derivatives of this data model at certain places which are explicitly specified in that context. The most basic model of a graph over a period of time is as a col- lection of graph snapshots, one corresponding to each time instance (assuming discrete time). Each such graph snapshot contains a set of nodes and a set of edges. The nodes and 27 edges are assigned unique ids at the time of their creation, which are not re-assigned after deletion of the components (a deletion followed by a re-insertion results in assignment of a new id). A node or an edge may be associated with a list of attribute-value pairs; the list of attribute names is not fixed a priori and new attributes may be added at any time. Additionally an edge contains the information about whether it is a directed edge or an undirected edge. An event is defined as the record of an atomic activity in the network. An event could pertain to either the creation or deletion of an edge or node, or change in an at- tribute value of a node or an edge. Alternatively, an event can express the occurrence of a transient edge or node that is valid only for that time instance instead of an interval (e.g., a “message” from a node to another node). Being atomic refers to the fact that the activity can not be logically broken down further into smaller events. Hence, an event always cor- responds to a single timepoint. So, the valid time interval of an edge, [ts, te], is expressed by two different events, edge addition and deletion events at ts and te respectively. The exact contents of an event depend on the event type; examples of a new-edge event (NE) can be seen below, and an update node-attribute event (UNA) as well. (a) {NE, N:23, N:4590, directed:no, 11/29/03 10:10} (b) {UNA, N:23, ‘job’, old:‘..’, new:‘..’, 11/29/07 17:00} We treat events as bidirectional, i.e., they could be applied to a database snapshot in either direction of time. For example, say that at times tk−1 and tk, the graph snapshots areGk−1 and Gk respectively. If E is the set of all events at time tk, we have that: Gk = Gk−1 + E, Gk−1 = Gk − E 28 where the + and − operators denote application of the events in E in the forward and the backward direction. All events are recorded in the direction of evolving time, i.e., going ahead in time. A list of chronologically organized events is called an eventlist. 1.5 Organization of the dissertation The rest of the dissertation is organized as follows: • Chapter 2 discusses references to the relevant background work related to tem- poral network analysis, graph datastores, and temporal data management. We talk about the cross-disciplinary demand for temporal graph analysis, followed by the limitations and the lack of convergence in existing graph datastores and temporal data management techniques. • Chapter 3 describes snapshot retrieval on historical graphs using the DeltaGraph. It contains the design, rationale and analysis of DeltaGraph, a hierarchical persis- tent index structure, that captures temporal redundancy for compact storage and provides almost direct access to network states of one or more points in the past. • Chapter 4 talks about the Temporal Graph Index which provides elastic storage of temporal graphs for a wide-variety of graph primitive retrieval. It builds on the DeltaGraph, but also provides support for version-style retrieval, such as neighbor- hood evolution. It presents a novel distributed index design that scales temporal or topological skew of a dataset, which is typical of the temporal graphs. 29 • Chapter 5 covers two essential aspects of temporal graph analysis through. It talks about a compact way to simultaneously retain hundreds of graphs in memory for analysis using the GraphPool. The latter part of the chapter describes the design and specification of the Temporal Graph Analysis Framework, which provides a library and an execution platform for large-scale temporal graph analytics. • Chapter 6 concludes the dissertation through a summary of the observations and contributions, and a description of the future problems related to the topic. 30 Chapter 2: Background 2.1 Temporal Graph Analysis There has been an increasing interest in dynamic network analysis over the last decade, fueled by the increasing availability of large volumes of temporally annotated network data. Social network analysis (SNA) and information visualization (InfoViz) communities in particular have shown great deal of interest around temporal graphs. Following is an account of the recent work in these areas. 2.1.1 Dynamic Social Network Analysis Many works have focused on designing analytical models that capture how a network evolves, with a primary focus on social networks and the Web (see, e.g., [18, 85, 79]). There is also much work on understanding how communities evolve, identifying key in- dividuals, and locating hidden groups in dynamic networks. Berger-Wolf et al. [32, 125], Tang et al. [123] and Greene et al. [58] address the problem of community evolution in dynamic networks. McCulloh and Carley [92] present techniques for social change detec- tion. Asur et al. [26] present a framework for characterizing the complex behavioral pat- terns of individuals and communities over time. In a recent work, Ahn et al. [21] present an exhaustive taxonomy of temporal visualization tasks. Biologists are interested in dis- 31 covering historical events leading to a known state of a Biological network (e.g., [96]). Evolution of shortest paths in dynamic graphs has been studied by Huo et al. [67], Ren et al. [110], and Xuan et al. [62]. Change in page rank with evolving graphs [48, 27], and the study of change in centrality of vertices, path lengths of vertex pairs, etc. [103], also lie under the larger umbrella of temporal graph analysis. Barrat et al. [29], provide a good reference for studying several dynamic processes modeled over graphs. Holme et al. [66] provide a description of several networks from different domains that are temporal or dy- namic in nature. Our goal in this work is to build a graph data management system that can efficiently and scalably support these types of dynamic network analysis tasks over large volumes of data in real-time. 2.1.2 Temporal Network Visualization Visualization is a principal tool in exploration and analysis of networks. Several network visualization tools such as Gephi [30] and NodeXL [64], are widely used for exploration and analysis of graph datasets. In recent years there has been an emphasis on development of techniques to aid analysis of temporal graphs. These have found application in study- ing evolution of social networks, Web graph, biological networks, transmission networks etc. Hence, assessing the general data management requirements for such tools is a key motivation in abstracting the correct problems in historical graph data management. Erten et al. [50] present visual analysis for an evolving dataset in computing literature through properties like diameter, connected components, collaborators, co-authors, topic trends over time. Using visualization of different snapshots, Toyoda et al. [131] study events 32 like appearance of pages, relationship between different pages for a historical dataset of Web archive. More such examples can be found in a survey by Kerracher et al. [72]. Different visualization techniques are useful in the process of gaining insights into the temporal nature of graphs. For instance, NetEvViz [76] extends NodeXL, to study evolv- ing networks. It uses an edge color coding scheme for comparative study of network growth. Ahn et al. [22] focus on different states of graph components in order to visualize temporal dynamics of a social network. Alternate representations to node-link diagrams have been proposed in order to study temporal nature of social networks, such as TimeMa- trix [141] that uses a matrix based visualization to better capture temporal aggregations and overlays. A temporal evolution analysis technique based on capturing visual consis- tency across snapshots can be seen in the work of Xu et al. [140]. A survey by Beck et al. [31] lists several other approaches for visual analysis of dynamic graphs. 2.2 Temporal Data Management There is a vast body of literature on temporal relational databases, starting with the early work in the 80’s on developing temporal data models and temporal query languages. We won’t attempt to present a exhaustive survey of that work, but instead refer the reader to several surveys and books on this topic [36, 119, 100, 124, 47, 120, 112]. The most basic concepts that a relational temporal database is based upon are valid time and transaction time, considered orthogonal to each other. Valid time denotes the time period during which a fact is true with respect to the real world. Transaction time is the time when a fact is stored in the database. A valid-time temporal database permits correction of 33 a previously incorrectly stored fact [119], unlike transaction-time databases where an inquiry into the past may yield the previously known, perhaps incorrect version of a fact. Under this nomenclature, our data management system is based on valid time – for us the time the information was entered in the database is not critical, but our focus is on the time period when the information is true. Salzberg and Tsotras [112] present a comprehensive survey of temporal data in- dexing techinques, and discuss two extreme approaches to supporting snapshot retrieval queries, referred to as the Copy and Log approaches. While the copy approach relies on storing new copies of a snapshot upon every point of change in the database, the log ap- proach relies on storing everything through changes. Their hybrid is often referred to as the Copy+Log approach.Other data structures, such as Interval Trees [25] and Segment trees [34] can also be used for storing temporal information. Temporal aggregation in scientific array databases [122] and time-series indexing [97, 40, 136] are other related topics of interest, but the challenges there are significantly different. 2.2.1 Temporal Queries and Storage From a querying perspective, both valid-time and transaction-time databases can be treated as simply collections of intervals; however indexing structures that assume transaction times can often be simpler since they don’t need to support arbitrary inserts or deletes into the index. Salzberg and Tsotras [112] present a comprehensive survey of indexing struc- tures for temporal databases. They also present a classification of different queries that one may ask over a temporal database. Under their notation, our focus in this work is on 34 both, valid timeslice query, where the goal is to retrieve all the entities and their attribute values that are valid as of a specific time point, and valid pure key range query, where the goal is to fetch an entities states over an interval of time. Snapshot index [132] is an I/O optimal solution to the problem of snapshot retrieval for transaction-time databases. There has also been work on performing time travel operations such as obtaining temporal joins, computing temporal aggregates, retrieving version history, etc., [70, 35, 71]. Re- cently, there was a proposal to add a temporal dimension to the TPC-H benchmark [24]. Many industrial systems, such as Microsoft’s ImmortalDB [87], Oracle’s Flashback fea- ture [108] and most recently, IBM’s DB2 [113] have incorporated one or more of these temporal operations for relational databases. 2.2.2 Version Management Most scientific datasets are semistructured in nature and can be effectively represented in XML format [38]. Lam and Wong [83] use complete deltas, which can be traversed in either direction of time for efficient retrieval. Other systems store the current version as a snapshot and the historical versions as deltas from the current version [91]. For such a system, the deltas only need to be unidirectional. Ghandeharizadeh et al. [54] provide a formalism on deltas, which includes a delta arithmetic. All these approaches assume unique node identifiers to merge deltas with deltas or snapshots. Buneman et al. [38] propose merging all the versions of the database into one single hierarchical data structure for efficient retrieval. In a recent work, Seering et al. [115] presented a disk based versioning system using efficient delta encoding to minimize space consumption 35 and retrieval time in array-based systems. Bhattacharjee et al. [33] systematically explore the trade-off between storage and recreation costs using compressed delta storage for different versions of a dataset. However, none of that prior work focuses on snapshot retrieval in general graph databases, or proposes techniques that can flexibly exploit the memory-resident information. 2.3 Graph Data Management In the recent years, there has been much work on graph storage and graph processing systems and numerous commercial and open-source systems have been designed to ad- dress various aspects of graph data management. Some examples include Neo4J, Alle- groGraph [17], Titan [15], GBase [68], Pregel [90], Giraph [1], GPS [111], Pegasus [69], GraphChi [80], GraphX [56], GraphLab [88], Cassovary [5] and Trinity [117]. These systems use a variety of different models for representation, storage, and querying, and there is a lack of standardized or widely accepted models for the same. 2.3.1 Graph Storage Massive scale of graph datasets and computational complexity of graph algorithms makes graph partitioning and cluster computing a common theme across modern graph database. Different graph stores use varying representations for persistent storage of graphs. For in- stance, GBase stores a distributed sparse adjacency matrix representation of the graph in HDFS files in order to facilitate query processing using Hadoop [2]. Titan, on the other hand uses a different representation utilizing distributed key-value stores such as Cassan- 36 dra [82], HBase [3] or BerkeleyDB [98]. Other examples of key-value store based graph stores are, CloudGraph [6], VertexDB [16] and Redis Graph [11]. Certain stores such as Neo4J and Titan also feature main memory caching to speed-up graph retrieval from disk. The query time graph fetch performance, on one hand, depends on the nature of storage, partitioning, chunking, etc. On the other, it also depends on the nature of queries served by the graph store. In general, graph queries follow fairly arbitrary co-access patterns and hence none of the data stores optimize their storage layouts for a broad spectrum of graph queries. For instance, GraphChi ensures serial I/O for a specific pattern of queries for graph mining. Columnar storage is another characteristic increasingly being adopted by various graph stores. It is particularly helpful in decoupling the structural aspects of a graph from the attributes, thereby ensuring lighter fetch payloads as well as greater amount of compression. Columnar storage fits naturally in case of several key- value stores that are designed to handle it effectively. Relational databases are deemed unsuitable for storing graph data. The inherent nature of traversal by graph queries leads to a high number of expensive joins if the under- lying graph data in relational format. However, certain semi-structured data formats such as XML (Extensible Markup Language) and RDF (Resource Description Framework), are able to capture graphs through their data model. RDF stores such as AllegroGraph are able to provide an effective storage base for certain kinds of graph queries with the support of query languages described later. The two prominent reasons for partial success of RDF and XML stores for graph data are, more effective joins in the context of graph traversal and capability to handle fluid schema, respectively. 37 2.3.2 Graph Query and Processing Frameworks Most graph querying happens through programmatic access to graphs in languages such as Java, Python or C++. Graph libraries such as Blueprints [4] and SNAP [86] provide a rich set of implementations for graph theoretic algorithms. SPARQL [104, 13] is a language used to search patterns in linked data. It works on an underlying RDF repre- sentation of graphs. T-SPARQL [57] is a temporal extension of SPARQL. He et al. [65], provide a language for finding sub-graph patterns using a graph as a query primitive. Gremlin [8] is a graph traversal language over the property graph data model, and has been adopted by several open-source systems. For large-scale graph analysis, perhaps the most popular framework is the vertex-centric programming framework, adopted by Giraph, GraphLab, GraphX, and several other systems; there have also been several pro- posals for richer and more expressive programming frameworks in recent years. However, most of these prior systems largely focus on analyzing a single snapshot of the graph data, with very little support for handling dynamic graphs, if any. 2.3.3 Temporal Graph Data Stores There is also prior work on temporal RDF data and temporal XML Data. Motik [95] presents a logic-based approach to representing valid time in RDF and OWL. Several works (e.g., [106, 126]) have considered the problems of subgraph pattern matching or SPARQL query evaluation over temporally annotated RDF data. There has been some work on expressing graph queries in a Datalog-based syntax [116, 129]. There is also much work on version management in XML data stores. 38 A few recent papers address the issues of storage and retrieval in dynamic graphs. G* [81] stores multiple snapshots compactly by utilizing commonalities. Chronos [63, 94] is an in-memory system for processing dynamic graphs, with objective of shared storage and computation for overlapping snapshots. Ghrab et al. [55] provide a system of network analytics through labeling graph components. Gedik et al. [53], describe a block-oriented and cache-enabled system to exploit spatio-temporal locality for solving temporal neighborhood queries. Koloniari et al. also utilize caching to fetch selective portions of temporal graphs they refer to as partial views [77]. LLAMA [89] uses multi- versioned arrays to represent a mutating graph, but their focus is primarily on in-memory representation. 39 Chapter 3: Historical Graph Snapshot Retrieval 3.1 Introduction Snapshot retrieval, or the reconstruction of a past state of the dataset, is perhaps the sin- gle most important retrieval primitive for historical network analytics. An evolutionary analysis task such as finding the evolution of pagerank of the central entities in a net- work, involves fetching 100s of snapshots from different time points in the past. Once in memory, we need to run a pagerank routine to compute the required values and plot them. Our ability to to perform this style of historical analysis efficiently depends on our ability to quickly fetch 100s of those snapshots from a persistent datastore to memory. We observe that this crucial task is prevalent across a wide variety of temporal analysis such as community evolution, growth of network characteristics like density, diameter, etc.; it also serves as the basis for answering specific historical queries such as finding graph clustering coefficient one year ago. An abstract solution to this problem provides a foundation to the task of storage and retrieval for large historical networks. In this chapter, we present a novel, user-extensible, highly tunable, and distributed hierarchical index structure called DeltaGraph, that compactly stores the entire historical trace of a network, and that supports efficient retrieval of historical graph snapshots for single-site or parallel processing. Along with the original graph data, DeltaGraph can 40 also maintain and index auxiliary information; this functionality can be used to extend the structure to efficiently execute queries like subgraph pattern matching over historical data. We develop analytical models for both the storage space needed and the snapshot retrieval times to aid in choosing the right construction parameters for a specific scenario. We expose a general programmatic API to process and analyze the snapshots retrieved from DeltaGraph. Graph Data Model: In the chapter, we follow the data model described in Section 1.4 of the introduction. To recap, we view an evolving graph as a set of snapshots over each time point, where time is assumed to be a discreet quantity. Also, a change between two different snapshots is described by an ordered (chronological) set of events, each of which describes an atomic change in the network. Chapter Outline: In the rest of the chapter, we begin by providing a background of ex- isting alternatives for snapshot retrieval from the temporal relational database, geospatial dataspace and computational geometry literature in Section 3.2. In Section 3.3, we de- scribe the details of DeltaGraph design and an overview of retrieval queries and the acces- sibility library. Next, we describe the algorithms for performing various snapshot queries in a principled and planned manner in Section 3.4. We also describe query speedup through prefetching and the auxiliary framework for extending DeltaGraph beyond snap- shot queries. In Section 3.5, we discuss the impact of different parameters on the behavior of DeltaGraph. We also present the analytical models for estimation of space and query times for given index configurations. In Section 3.6, we discuss the details of the con- struction and update algorithms for the index. We present experiments supporting the 41 efficiency of DeltaGraph in Section 3.7 and conclude the chapter in Section 3.8. 3.2 Background An optimal solution to answering snapshot retrieval queries is the external interval tree, presented by Arge and Vitter [25]. Their proposed index structure uses optimal space on disk, and supports updates in optimal (logarithmic) time. Segment trees [34] can also be used to solve this problem, but may store some intervals in a duplicated manner and hence use more space. Tsotras and Kangelaris [132] present snapshot index, an I/O optimal solution to the problem for transaction-time databases. Salzberg and Tso- tras [112] also discuss two extreme approaches to supporting snapshot retrieval queries, called Copy and Log approaches. In the Copy approach, a snapshot of the database is stored at each transaction state, the primary benefit being fast retrieval times; however the space requirements make this approach infeasible in practice. The other extreme ap- proach is the Log approach, where only and all the changes are recorded to the database, annotated by time. While this approach is space-optimal and supports O(1)-time updates (for transaction-time databases), answering a query may require scanning the entire list of changes and takes prohibitive amount of time. A mix of those two approaches, called Copy+Log, where a subset of the snapshots are explicitly stored, is often a better idea. These are illustrated in Figure 3.1. We found these (and other prior) approaches to be insufficient and inflexible for our needs for several reasons. First, they do not efficiently support multipoint queries that we expect to be very commonly used in evolutionary analysis, that need to be op- 42 t0 t 1 t 2 t 3 t 4 t 5 Snapshots capturing "copies" of the graph (a) Copy t 0 t 1 t 2 t 3 t 4 t 5 E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12 Events describing changes in the graph (b) Log t 0 t 1 t 2 t 3 t 4 t 5 E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12 Snapshots and events, describing copy + log (c) Copy+Log Figure 3.1: Temporal relational database indexing techniques. 43 51 36 89 12 42 75 95 [1, 100) [49, ∞) [45, 52) [1, 37) [35,50) (-∞, 39) [88, 90) [52, ∞) [-30,13) [11,35) ... ... ... ... ... ... Figure 3.2: Interval tree holds intervals and can be potentially used to index a temporal dataset using valid time time durations. Each node contains of: (1) a “center point”; (2) a list of all elements (intervals) whose end points lie across the center points, sorted by start times and end times; (3) pointers to left and right subtrees that correspond to intervals that end before the start of the node’s center point and those that start after this node’s center point, respectively. timized by avoiding duplicate reads and repeated processing of the events. Second, to cater to the needs of a variety of different applications, we need the index structure to be highly tunable, and to allow trading off different resources and user requirements (includ- ing memory, disk usage, and query latencies). Ideally we would also like to control the distribution of average snapshot retrieval times over the history, i.e., we should be able to reduce the retrieval times for more recent snapshots at the expense of increasing it for the older snapshots (while keeping the utilization of the other resources the same), or vice- versa. For achieving low latencies, the index structure should support flexible prefetching of portions of the index into memory and should avoid processing any events that are not needed by the query (e.g., if only the network structure is needed, then we should not have to process any events pertaining to the node or edge attributes). Finally, we would like 44 the index structure to be able to support different persistent storage options, ranging from a hard disk to the cloud; most of the previously proposed index structures are optimized primarily for disks. 3.3 DeltaGraph Our proposed index data structure, DeltaGraph is a rooted, directed, largely hierarchical graph whose lowest level corresponds to the snapshots of the network over time (that are not explicitly stored), and the interior nodes correspond to graphs constructed by com- bining the lower level graphs (these are typically not valid graphs as of any specific time point). The information stored with each edge, called edge deltas, is sufficient to con- struct the graph corresponding to the target node from the graph corresponding to the source node, and thus a specific snapshot can be created by traversing any path from the root to the snapshot. While conceptually simple, DeltaGraph is a very powerful, exten- sible, general, and tunable index structure that enables trading off the different resources and user requirements as per a specific application’s need. By appropriately tuning the DeltaGraph construction parameters, we can trade decreased snapshot retrieval times for increased disk storage requirements. One key parameter of the DeltaGraph enables us to control the distribution of average snapshot retrieval times over history. Portions of the DeltaGraph can be prefetched, allowing us to trade increased memory utilization for reduced query times. Many of these decisions can be made at run-time, enabling us to adaptively respond to changing resource characteristics or user requirements. DeltaGraph is highly extensible, providing a user the opportunity to define additional indexes to be 45 stored on edge deltas in order to perform specific operations more efficiently. Finally, DeltaGraph utilizes several other optimizations, including a column-oriented storage to minimize the amount of data that needs to be fetched to answer a query, and multi-query optimization to simultaneously retrieve many snapshots. DeltaGraph naturally enables distributed storage and processing to scale to large graphs. The edge deltas can be stored in a distributed fashion through use of horizontal partitioning, and the historical snapshots can be loaded parallely onto a set of machines in a partitioned fashion; in general, the two partitionings need not be aligned, but for computational efficiency, we currently require that they be aligned. Horizontal partition- ing also results in lower snapshot retrieval latencies since the different deltas needed for reconstruction can be fetched in parallel. We have built a DeltaGraph prototype in Java along with a Python interface. We used Kyoto Cabinet [9], a disk-based key-value store as the back-end engine to store the deltas and eventlists; in the distributed mode, we run one instance on each machine. Our design decision to use a key-value store at the back-end was motivated by the flexibility, the fast retrieval times, and the scalability afforded by such systems; since we only require a simple get/put interface from the storage engine, we can easily plug in other cloud- based, distributed key-value stores like HBase [3] or Cassandra [82]. Finally, we note that our proposed index generalizes beyond graph data and can be used for efficient snapshot retrieval in temporal relational databases as well. In fact, the DeltaGraph treats the network as a collection of objects and does not exploit any properties of the graphical structure of the data. 46 3.3.1 Anatomy of DeltaGraph Figure 3.3(a) shows a simple DeltaGraph, where the nodes S1, . . . , S4 correspond to four historical snapshots of the graph, spaced L events apart (equal spacing is not a re- quirement, but simplifies analysis). We call these nodes leaves, even though there are bidirectional edges between these nodes as shown in the figure. The interior nodes of the DeltaGraph correspond to graphs that are constructed from its children by applying what we call a differential function, denoted f(). For an interior node Sp with children Sc1 , . . . , Sck ,1 we have that Sp = f(Sc1 , . . . , Sck). The simplest differential function is perhaps the Intersection function. We discuss other differential functions in the next section. The graphs Sp are not explicitly stored in the DeltaGraph. Rather we only store the delta information with the edges. Specifically, the directed edge from Sp to Sci is associated with a delta ∆(Sci , Sp) which allows construction of Sci from Sp. It contains the elements that should be deleted from Sp (i.e., Sp − Sci) and those that should be added to Sp (i.e., Sci − Sp). The bidirectional edges among the leaves also store similar deltas; here the deltas are simply the eventlists (denoted E1, E2, E3 in Figure 3.3), called leaf-eventlists. For a leaf-eventlist E, we denote by [Estart, Eend) the time interval that it corresponds to. For convenience, we add a special root node, called super-root, at the top of the DeltaGraph that is associated with a null graph (S8 in Figure 3.3). For convenience, we call the children of the super-root as roots. A DeltaGraph can simultaneously have multiple hierarchies that use different dif- 1We abuse the notation somewhat to let Sp denote both the interior node and the graph corresponding to it. 47 S7 = f(S 5 ,S 6 ) S 5 = f(S 1 ,S 2 ) S 6 = f(S 3 ,S 4 ) S 1 S 2 S 3 S 4 S 8 =∅ ∆(S 1 ,S 5 ) ∆(S 2 ,S 5 ) ∆(S 5 ,S 7 ) ∆(S 6 ,S 7 ) ∆(S 7 ,S 8 ) ∆(S 4 ,S 6 ) E 1 E 2 E 3 L L L ∆(S 3 ,S 6 ) Super-Root Root (a) S 7 = f1(S 5 ,S 6 ) S 5 = f1(S 1 ,S 2 ) S 6 = f1(S 3 ,S 4 ) S 1 S 2 S 3 S 4 S 8 =∅ E 1 E 2 E 3 L L L S 11 = f2(S 9 ,S 10 ) S 9 = f2(S 1 ,S 2 ) S 10 = f2(S 3 ,S 4 ) Super-Root Root 1 Root 2 (b) Figure 3.3: DeltaGraphs with 4 leaves, leaf-eventlist size L, arity 2. ∆(Si, Sj) denotes delta needed to construct Si from Sj . 48 ferential functions (Figure 3.3(b)); this can used to improve query latencies at the expense of higher space requirement. The deltas and the leaf-eventlists are given unique ids in the DeltaGraph structure, and are stored in a columnar fashion, by separating out the structure information from the attribute information. For simplicity, we assume here a separation of a delta ∆ into three components: (1) ∆struct, (2) ∆nodeattr, and (3) ∆edgeattr. For a leaf-eventlist E, we have an additional component, Etransient, where the transient events are stored. Finally, the deltas and the eventlists are stored in a persistent, distributed key-value store, the key being 〈partition id, delta id, c〉, where c ∈ {δstruct, δnodeattr, . . . , Etransient} specifies which of the components is being fetched or stored, and partition id specifies the partition. Each delta or eventlist is typically partitioned and stored in a distributed manner. Based on the node id of the concerned node(s), each event, edge, node and attribute is designated a partition, using a hash function hp, such that, partition id = hp(node id). In a set up with k distributed storage units, all the deltas and eventlists are likely to have k partitions each. 3.3.2 Snapshot Queries We differentiate between a singlepoint snapshot query and a multipoint snapshot query. An example of the first query is: “Retrieve the graph as of January 2, 1995”. On the other hand, a multipoint snapshot query requires us to simultaneously retrieve multiple histori- cal snapshots (e.g, “Retrieve the graphs as of every Sunday between 1994 to 2004”). We also support more complex snapshot queries where a TimeExpression or a time interval is 49 specified instead. Any snapshot query can specify whether it requires only the structure of the graph, or a specified subset of the node or edge attributes, or all attributes. 3.3.3 Accessibility The system can be used through a programmatic API, an interactive visual interface for analytics or a declarative high level query language. The programmatic API is the best way to interact with the system in a tightly coupled manner. Specifically, the following is a list of some of the retrieval functions that we support in our programmatic API. GetHistGraph(Time t, String attr options): In the basic singlepoint graph retrieval call, the first parameter indicates the time; the second parameter indicates the attribute information to be fetched as a string formed by concatenating sub-options listed in Table 3.1. For example, to specify that all node attributes except salary, and the edge attribute name should be fetched, we would use: attr options = “+node:all- node:salary+edge:name”. GetHistGraphs(List