Abstract
 Title of dissertation:
 Measuring and Improving the Readability of
 Network Visualizations
 Cody Dunne, Doctor of Philosophy, 2013
 Directed by: Professor Ben Shneiderman
 Department of Computer Science
 Network data structures have been used extensively for modeling entities and
 their ties across such diverse disciplines as Computer Science, Sociology, Bioinfor-
 matics, Urban Planning, and Archeology. Analyzing networks involves understand-
 ing the complex relationships between entities as well as any attributes, statistics,
 or groupings associated with them. The widely used node-link visualization excels
 at showing the topology, attributes, and groupings simultaneously. However, many
 existing node-link visualizations are difficult to extract meaning from because of
 (1) the inherent complexity of the relationships, (2) the number of items designers
 try to render in limited screen space, and (3) for every network there are many
 potential unintelligible or even misleading visualizations. Automated layout algo-
 rithms have helped, but frequently generate ineffective visualizations even when
 used by expert analysts. Past work, including my own described herein, have
 shown there can be vast improvements in network visualizations, but no one can
 yet produce readable and meaningful visualizations for all networks.
 Since there is no single way to visualize all networks effectively, in this disser-
tation I investigate three complimentary strategies. First, I introduce a technique
 called motif simplification that leverages the repeating patterns or motifs in a
 network to reduce visual complexity. I replace common, high-payoff motifs with
 easily understandable glyphs that require less screen space, can reveal otherwise
 hidden relationships, and improve user performance on many network analysis
 tasks. Next, I present new Group-in-a-Box layouts that subdivide large, dense
 networks using attribute- or topology-based groupings. These layouts take group
 membership into account to more clearly show the ties within groups as well as the
 aggregate relationships between groups. Finally, I develop a set of readability
 metrics to measure visualization effectiveness and localize areas needing improve-
 ment. I detail optimization recommendations for specific user tasks, in addition to
 leveraging the readability metrics in a user-assisted layout optimization technique.
 This dissertation contributes an understanding of why some node-link visualiza-
 tions are difficult to read, what measures of readability could help guide designers
 and users, and several promising strategies for improving readability which demon-
 strate that progress is possible. This work also opens several avenues of research,
 both technical and in user education.
Measuring and Improving the Readability of Network
 Visualizations
 by
 Cody Dunne
 Dissertation submitted to the Faculty of the Graduate School of the
 University of Maryland, College Park, in partial fulfillment
 of the requirements for the degree of
 Doctor of Philosophy
 2013
 Advisory Committee:
 Professor Ben Shneiderman, Chair
 Professor Bonnie Dorr
 Assistant Professor Jon Froehlich
 Professor Amitabh Varshney
 Associate Professor Alan Neustadtl, Dean?s Representative
c Copyright by
 Cody Dunne
 2013
To my loving family
 ii
Acknowledgments
 I would like to first start by thanking my advisor, Ben Shneiderman, whose as-
 sistance was instrumental in helping me prepare this dissertation. Ben is a truly
 amazing graduate mentor, and he has helped me navigate the complexities of a
 research career and shaped me into a valuable member of the scientific community.
 Ben is a great professor with creativity, depth and breadth of knowledge, and vi-
 sion; and his enthusiasm and encouragement constantly inspires the people around
 him. Ben helped me set iterative goals while keeping the big picture in mind, was
 always available to talk, encouraged collaboration with domain experts to develop
 innovative dissertation topics, and kept the dissertation in mind as the main goal
 for every project. He constantly dedicated his time, money, and connections to
 ensure I had opportunities to do interesting research and present the results. I was
 very pleased that my nomination of him for the University of Maryland?s Graduate
 Faculty Mentor of the Year Award was selected as one of only four in 2013.
 However, I have also had a large support network of other mentors and col-
 laborators throughout my dissertation work. My committee members have played
 a pivotal role, both in my dissertation work and my job search, and I would like
 to thank them in particular: Bonnie Dorr, Jon Froehlich, Amitabh Varshney, and
 iii
iv
 Alan Neustadtl. Other collaborators I would like to thank are Catherine Plaisant,
 Marc Smith, Nathalie Henry Riche, Derek Hansen, Tony Capone, Ping Wang,
 Leah Findlater, Adam Perer, Anne Rose, Seth Powsner, Manuel Freire, Elizabeth
 Bonsignore, Eduarda Mendes Rodrigues, Natasa Milic-Frayling, Judith Klavans,
 Robert Gove, Saif Mohammad, Bongshin Lee, Ron Metoyer, George Robertson,
 Snigdha Chaturvedi, Zahra Ashktorab, Rajan Zacharia, Puneet Sharma, Udayan
 Khurana, Krist Wongsuphasawat, Darya Filippova, Awalin Sopan, Alex Quinn,
 Peter Fontana, Nick Gramsky, Rose Kirby, Emre Sefer, Meirav Taieb-Maimon,
 Andreea Olea, Eylul Dogruel Vladimir Barash, Eric Gleave, Dana Rotman, Ryan
 Blue, Adam Fuchs, Kyle King, Aaron Schulman, Yiyan Liu, and many, many more.
 I also appreciate the support of many funding sources for my dissertation work,
 including the Social Media Research Foundation; the Connected Action Consulting
 Group; National Science Foundation grants SBE 0915645, IIS 0705832, and IIS
 0968521; HHS SHARP grant 10510592; Microsoft External Research; Microsoft
 Research; the National Cancer Institute, and several University of Maryland travel
 grants.
Contents
 Acknowledgments iii
 Contents v
 1 Introduction 1
 1.1 Motif Simplification to Reduce Complexity . . . . . . . . . . . . . . 8
 1.2 Meta-Layouts for Subdividing Networks . . . . . . . . . . . . . . . 10
 1.3 Measuring Network Visualization Readability . . . . . . . . . . . . 14
 1.4 Exploration Environment . . . . . . . . . . . . . . . . . . . . . . . . 16
 1.5 Specific Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 17
 1.6 Dissertation Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . 19
 2 Related work 21
 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
 2.2 Network Visualization & Analysis . . . . . . . . . . . . . . . . . . . 22
 2.3 Measuring Node-Link Visualization Readability . . . . . . . . . . . 31
 2.4 Motif Simplification . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
 2.5 Meta-Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
 2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
 3 Applied Network Visualization 48
 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
 3.2 NodeXL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
 3.2.1 Contributions to NodeXL . . . . . . . . . . . . . . . . . . . 50
 3.2.2 NodeXL Interface . . . . . . . . . . . . . . . . . . . . . . . . 52
 3.3 Applying Network Visualization to Real Problems . . . . . . . . . . 53
 3.3.1 The Importance of Network Topology and Filtering . . . . . 54
 3.3.2 The Importance of Node & Edge Attributes . . . . . . . . . 59
 3.3.3 The Importance of Statistics and Algorithms . . . . . . . . . 64
 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
 v
Contents vi
 4 Motif Simplification to Reduce Complexity 74
 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
 4.1.1 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . 75
 4.2 Network Motif Simplification . . . . . . . . . . . . . . . . . . . . . . 76
 4.2.1 Glyph Design . . . . . . . . . . . . . . . . . . . . . . . . . . 77
 4.2.2 Motif Detection Algorithms . . . . . . . . . . . . . . . . . . 86
 4.2.3 NodeXL Implementation . . . . . . . . . . . . . . . . . . . . 94
 4.3 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
 4.3.1 U.S. Senate Voting Patterns in 2007 . . . . . . . . . . . . . . 97
 4.3.2 Lostpedia Wiki Edits . . . . . . . . . . . . . . . . . . . . . . 102
 4.3.3 Ravelry Forums . . . . . . . . . . . . . . . . . . . . . . . . . 105
 4.3.4 VOSON Web Crawl . . . . . . . . . . . . . . . . . . . . . . 105
 4.3.5 Patient Discharge Summaries . . . . . . . . . . . . . . . . . 113
 4.3.6 Larger Networks . . . . . . . . . . . . . . . . . . . . . . . . 124
 4.4 Initial Usability Study . . . . . . . . . . . . . . . . . . . . . . . . . 124
 4.5 Controlled Experiment . . . . . . . . . . . . . . . . . . . . . . . . . 126
 4.5.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
 4.5.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
 4.5.3 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
 4.5.4 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
 4.5.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
 4.5.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
 4.5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
 5 Meta-Layouts for Subdividing Networks 140
 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
 5.1.1 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . 144
 5.2 Grouping Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 146
 5.2.1 Clustering to Identify Structural Components . . . . . . . . 146
 5.2.2 Grouping to Find Attribute Relationships . . . . . . . . . . 147
 5.2.3 Advanced and Combined Approaches . . . . . . . . . . . . . 150
 5.3 Midichlorian-Directed Layout . . . . . . . . . . . . . . . . . . . . . 151
 5.4 Group-in-a-Box Meta-Layouts . . . . . . . . . . . . . . . . . . . . . 159
 5.4.1 Treemap Layout . . . . . . . . . . . . . . . . . . . . . . . . . 160
 5.4.2 Croissant-Donut Layout . . . . . . . . . . . . . . . . . . . . 162
 5.4.3 Force-Directed Layout . . . . . . . . . . . . . . . . . . . . . 170
 5.4.4 Showing Edges Between Groups . . . . . . . . . . . . . . . . 181
 5.4.5 Dividing the Problem . . . . . . . . . . . . . . . . . . . . . . 181
 5.5 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Contents vii
 5.5.1 Continent-Holding Strategies in Risk . . . . . . . . . . . . . 188
 5.5.2 Finding Regional Innovation Clusters . . . . . . . . . . . . . 195
 5.5.3 Patient Discharge Summaries . . . . . . . . . . . . . . . . . 206
 5.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 213
 5.6.1 Pilot Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
 5.6.2 Readability Measures . . . . . . . . . . . . . . . . . . . . . . 214
 5.6.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
 5.6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
 6 Measuring Network Visualization Readability 226
 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
 6.1.1 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . 231
 6.2 Readability Metrics in SocialAction . . . . . . . . . . . . . . . . . . 232
 6.2.1 Case Study: Alberta Politics Newsgroup . . . . . . . . . . . 235
 6.2.2 Case Study: New Testament Name Co-Occurrence . . . . . . 240
 6.3 Readability Metrics in NodeXL . . . . . . . . . . . . . . . . . . . . 242
 6.4 Specific Readability Metrics . . . . . . . . . . . . . . . . . . . . . . 244
 6.4.1 Node-Node Overlap @n . . . . . . . . . . . . . . . . . . . . . 245
 6.4.2 Global Readability Metric @n . . . . . . . . . . . . . . . . . 248
 6.4.3 Node Readability Metric @nj2Nn . . . . . . . . . . . . . . . . 249
 6.4.4 Edge Crossing @c . . . . . . . . . . . . . . . . . . . . . . . . 249
 6.4.5 Global Readability Metric @c . . . . . . . . . . . . . . . . . 251
 6.4.6 Edge Readability Metric @ei2Ecei . . . . . . . . . . . . . . . . . 252
 6.4.7 Node Readability Metric @nj2Ncnj . . . . . . . . . . . . . . . . 253
 6.4.8 Edge Tunnel . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
 6.4.9 Edge Crossing Angle @eca . . . . . . . . . . . . . . . . . . . 257
 6.4.10 Angular Resolution (min) @arm . . . . . . . . . . . . . . . . 258
 6.4.11 Global Readability Metric @arm . . . . . . . . . . . . . . . . 259
 6.4.12 Node Readability Metric @nj2Narm . . . . . . . . . . . . . . . . 259
 6.4.13 Angular Resolution (avg) @ara . . . . . . . . . . . . . . . . . 259
 6.4.14 Global Readability Metric @ara . . . . . . . . . . . . . . . . 260
 6.4.15 Node Readability Metric @nj2Nara . . . . . . . . . . . . . . . . 260
 6.4.16 Visualization Coverage Metric @vc . . . . . . . . . . . . . . . 260
 6.4.17 Group Overlap . . . . . . . . . . . . . . . . . . . . . . . . . 262
 6.4.18 Additional Readability Metrics . . . . . . . . . . . . . . . . 265
 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Contents viii
 7 Conclusion and Future Directions 271
 7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
 7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
 7.2.1 Motif Simplification . . . . . . . . . . . . . . . . . . . . . . . 274
 7.2.2 Group-in-a-Box Layouts . . . . . . . . . . . . . . . . . . . . 284
 7.2.3 Readability Metrics . . . . . . . . . . . . . . . . . . . . . . . 286
 7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
 Bibliography 291
List of Figures
 1.1 A node-link visualization of relationships among Twitter users men-
 tioning the hashtag ?#WIN09?, which was used by participants at a
 network science conference in September 2009. Each Twitter user is
 represented by a node containing its image, and edges between users
 indicate follow, mention, or reply relationships. The force-directed
 layout used to position the nodes highlights interesting patterns of
 connectivity like the two large communities of researchers. From
 Fig. 3.1 of the NodeXL book [HDS10, p. 33]. . . . . . . . . . . . . . 2
 1.2 Different visualizations of the same network, with (a) obscuring the
 topology while (b) and (c) are more understandable with less edge
 crossings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
 1.3 An experimental comparison of six layout algorithms on the same so-
 cial network produced widely different layouts. The top row layouts
 performed well, though bottom row layouts are difficult to extract
 meaning from. From [HJ06]. . . . . . . . . . . . . . . . . . . . . . . 6
 1.4 Fan, connector, and clique motifs (top) and their glyphs (bottom). . 8
 1.5 A bipartite network of Lostpedia of wiki edits (a) and a simplified
 version using glyphs for fan and connector motifs (b). . . . . . . . . 9
 1.6 Pennsylvania innovation relationships during 1990 (main compo-
 nent) collected by Christopher Scott Dempwolf. Nodes are laid
 out using the Harel-Koren FMS layout [HK02a] and topologic clus-
 ters found using the Clauset-Newman-Moore algorithm [CNM04]
 are shown using node color and shape. See Section 5.5.2 for more
 details and analyses of this network. . . . . . . . . . . . . . . . . . . 11
 1.7 The network for the board game Risk, where nodes are countries
 and edges indicate legal movements. Nodes are laid out using Harel-
 Koren FMS [HK02a], clustered and colored using the Clauset-Newman-
 Moore topologic clustering algorithm [CNM04]. Inter-group edges
 are combined into thick meta-edges. (a) shows the initial visualiza-
 tion, while the others show the three Group-in-a-Box (GIB) layout
 variants. See Section 5.5.1 for more details and analysis. . . . . . . 12
 ix
List of Figures x
 1.8 We can eliminate the node occlusion and edge tunnels that make
 the central overlapping group in Fig. 1.8a so hard to understand
 by zooming out and increasing the the spring lengths of the layout
 algorithm (Fig. 1.8b). . . . . . . . . . . . . . . . . . . . . . . . . . . 14
 1.9 NodeXL showing the readability metrics dialog (foreground), the
 nodes in the worksheet with edge crossing and node overlap metric
 columns, and visualization where nodes and edges are colored red-
 to-black by the edge crossing metric. The worst offenders are shown
 in red. The network shown represents the legal moves in the board
 game Risk from Fig. 1.7a. . . . . . . . . . . . . . . . . . . . . . . . 15
 2.1 The Pajek social network analysis tool [BM98] showing the main
 core subgraph extracted from Internet routing data. . . . . . . . . . 23
 2.2 The Cytoscape biologic network analysis tool [Sha+03]. . . . . . . . 24
 2.3 NodeTrix [HFM07] showing an overview of research in information
 visualization from the InfoVis ?04 contest. . . . . . . . . . . . . . . 25
 2.4 NVSS [SA06] showing citations from two Circuit Court cases in
 1991-1993 to 19 Supreme Court cases and two other Circuit Court
 cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
 2.5 GraphDice [Bez+10] showing the InfoVis 2004 contest bibliographic
 network. The left shows the plot matrix window and the right shows
 the selected plot. The right view animates between selected plots. . 28
 2.6 ManyNets [Fre+10] displaying the distributions of various statistics
 across subgraphs (rows). . . . . . . . . . . . . . . . . . . . . . . . . 28
 2.7 PivotGraph [Wat06] showing communication between aggregations
 of men and women (columns) and various locations (rows). . . . . . 29
 2.8 The main NetLens [Kan+06] interface here is showing ACM SIGCHI
 conference papers on the left and authors on the right. . . . . . . . 30
 2.9 Simple rule-based drawing optimizations shown in Figure 2.3.1 of
 [Sug02, p. 14]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
 2.10 Greedy graph summarization technique applied to the CRN-10k
 graph. From [NRS08]. . . . . . . . . . . . . . . . . . . . . . . . . . 35
 2.11 An interesting motif found in the protein-protein interaction net-
 work of S. cerevisae, a species of yeast. It appears 27,720 times,
 though these motifs all overlap and share the same set of 29 nodes.
 From [GK07]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
 2.12 In MAVIsto [KSS06], matches for a particular motif like the feed-
 forward loop are laid out aligned the same direction and highlighted.
 The bar chart shows how frequently particular motifs occur above
 expected levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
List of Figures xi
 2.13 Large Graph Layout [Ada+04] rendering of the internal structure of
 the Internet (as of 2005). From opte.org/maps . . . . . . . . . . . . 40
 2.14 An experimental comparison of six layout algorithms on a random
 grid and Sierpinski triangle dataset, discussed in [HJ06]. . . . . . . 42
 2.15 An experimental comparison of six layout algorithms on a social
 network dataset produced widely different layouts. From [HJ06]. . . 43
 2.16 Spatially ordered Treemap [WD08] of the London tube network.
 Stations (squares) are colored by the lines they serve. . . . . . . . . 44
 2.17 DICON [Cao+11] showing Treemap-like icons for clusters. . . . . . 44
 2.18 Increasing strength of edge bundling going left to right. From [Hol06]. 44
 3.1 The NodeXL [Smi+10] workspace. The dual pane view of net-
 work data and metrics (left pane) with node-link visualization (right
 pane) provide an integrated snapshot of statistics and visualiza-
 tion, along with built-in functions and controls that support ex-
 ploration and discovery. Individual worksheets separate network
 analysis tasks into separate categories, closely aligned with topol-
 ogy and attribute-based tasks, such as ?Edges?, ?Vertices? (nodes),
 and ?Groups.? The social network shown reflects voting patterns of
 U.S. senators, analyses of which are detailed in [PS08a; PS09], as
 well as Sections 3.3.1 and 4.3.1. . . . . . . . . . . . . . . . . . . . . 49
 3.2 Relationships between cancer research, awareness, and outreach in
 DC, MD, VA, and WV. The different colors represent each of the
 states in the region. (a) shows the network with the CIS ego node
 circled in green, while (b) shows the same network after removing
 the CIS node and laying it out again. The resulting visualization
 shows the remaining group structure and connections more clearly. . 55
 3.3 2007 U.S. Senate voting network, showing all 4950 links. The net-
 work is visualized inside the NodeXL network analysis tool as part
 of Excel. The highlighted red edges show the Akaka?Allard and
 Akaka?Baucus ties. . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
 3.4 NetGrok?s [Blu+08] elements include a node-link visualization (up-
 per left), a time-line histogram (lower left), a filter panel (upper
 right), and details on demand (lower right). . . . . . . . . . . . . . 60
 3.5 NetGrok?s [Blu+08] treemap layout arranges computers by the num-
 ber of connections they have and colors them by the bandwidth used.
 Communications between computers are shown using highlighting
 on mouseover. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
List of Figures xii
 3.6 GraphTrail [Dun+12a] showing three views of ACM SIGCHI con-
 ference publications, based on both the authors and their connected
 papers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
 3.7 A GraphTrail [Dun+12a] analysis showing two parallel exploration
 paths, the top examining Georgia Tech (GT) publication and cita-
 tion patterns and the bottom comparing Microsoft Research (MS).
 They start at the ROOT chart that contains all the papers in the
 dataset. Charts in each path are numbered in order of creation (e.g.,
 1, GT2, GT3, etc.), and the user interactions are shown with stars.
 The MERGED chart is the union of both branches? results. The
 user moved the mouse over the final parent link in the GT path
 (circled), highlighting the chain of actions up to the root. . . . . . . 62
 3.8 These line charts show the impact of treemaps (TM/green), cone
 trees (CT/red), and hyperbolic trees (HT/blue) in terms of trade
 press articles, academic papers, and patents. (a) shows the number
 of publications per year by type of publication for each innovation
 and (b) shows the number of citations to papers and patents by year
 for each innovation. Note that the sharp fall in patent figures in the
 faded area may be due to the average 32-month USPTO processing
 time in 2005-2008. From [Shn+12]. . . . . . . . . . . . . . . . . . . 65
 3.9 NetVisia [Gov+11b] visualization of the clustered heat map of the
 degree values for the STICK business intelligence term co-occurrence
 data from 2005, filtered to show only nodes with degrees between
 45 and 491. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
 3.10 After removing edges with low weight we can see the structure the
 network backbone. Isolate category pairs are drawn in a ring around
 the main connected component and singletons are staggered in the
 corners. Each node is colored by its semantic orientation (red for
 negative, blue for positive) and edges are colored by their weight,
 from red to blue. Node shape also codes semantic orientation, with
 triangles positive and circles negative. Size codes the magnitude
 the semantic orientation, with the largest nodes representing the
 extremes. Node labels are shown for nodes in isolates and those in
 the top 20 for betweenness centrality. From [MDD09]. . . . . . . . . 67
 3.11 The main views of ASE [Dun+12b] are displayed and labeled here:
 Reference Management (1?4), Citation Network Statistics & Visu-
 alization (5?6), Citation Context (7), Multi-Document Summaries
 (8), and Full Text with hyperlinked citations. . . . . . . . . . . . . 69
List of Figures xiii
 3.12 Algorithmically found communities in ASE [Dun+12b] are shown
 using convex hulls in the node-link visualization. When selected,
 all the citation context is shown in the top-right, along with an
 automatically generated summary of the overall context (bottom-
 right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
 4.1 From left to right: fan, connector, and clique motifs. . . . . . . . . . 74
 4.2 A 2-connector motif with three simplified glyph variants: diamond,
 crescent, and tapered diamond. . . . . . . . . . . . . . . . . . . . . 78
 4.3 A 3-connector motif and its glyph. . . . . . . . . . . . . . . . . . . 79
 4.4 Three fan motifs and two glyph variants of each. . . . . . . . . . . . 80
 4.5 Three 2-connector motifs and their glyphs. . . . . . . . . . . . . . . 81
 4.6 4-, 5-, and 6-clique motifs and their glyphs. . . . . . . . . . . . . . . 81
 4.7 Glyphs for fan, clique, and connector motif overlap. . . . . . . . . . 84
 4.8 The standard NodeXL workspace, showing U.S. Senate voting pat-
 terns from 2007. The left view shows the worksheets that store the
 network and its attributes, while the right pane shows a node-link
 visualization of the network. . . . . . . . . . . . . . . . . . . . . . . 94
 4.9 U.S. Senate 2007 co-voting network at 65% and 70% agreement cut-
 offs, simplified using clique motif glyphs. Key features are visible,
 such as the moderate Republican clique around McCain with ?wild-
 cards? at the periphery. . . . . . . . . . . . . . . . . . . . . . . . . 98
 4.10 U.S. Senate 2007 co-voting network at 80% and 85% agreement
 cutoffs, simplified using clique motif glyphs. The east-coast liberals
 and the Blue Dog Democrats separate at 80%. We see the network
 decompose at higher cutoffs. . . . . . . . . . . . . . . . . . . . . . . 99
 4.11 U.S. Senate 2007 co-voting network at 90% and 95% agreement
 cutoffs, simplified using clique motif glyphs. We see the Republican
 party fragment, with only the two senators from Georgia remaining
 at 95% agreement. . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
 4.12 A bipartite network of Lostpedia wiki edits showing wiki pages as
 boxes and their associated editors as discs. . . . . . . . . . . . . . . 103
 4.13 The Lostpedia wiki edits after being simplified using fan and con-
 nector motif glyphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
List of Figures xiv
 4.14 This network of relationships between Ravelry forums and their
 users was created by a student in Derek Hansen?s Communities of
 Practice class. In (a), three forums represented in blue are con-
 nected to contributers, and the contributers are sized and colored
 by the number of completed projects. Edge width is based on the
 number of posts by each user. This version was adapted from Fig.
 9.10 of the NodeXL book [HSS11, p. 139]. (b) shows a simplified
 version of this network, where the fan and connector motifs have
 been replaced by representative glyphs. The glyphs are sized by the
 number of nodes they replace and colored according to the average
 node attribute value. Likewise, aggregate edges between glyphs are
 sized and colored by the average of the edge weights of the edges
 they replace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
 4.15 This drawing represents the network of web pages connected to vo-
 son.anu.edu.au obtained by a web crawl. I modified it from Fig.
 12.9 of the NodeXL book [HSS11, p. 192]. A similar graph for wiki
 structure is shown on p. 259. The layout is done using Fruchterman-
 Reingold [FR91] in NodeXL, and head nodes for the fans of singly-
 connected nodes are shown in blue. . . . . . . . . . . . . . . . . . . 108
 4.16 A web crawl starting at voson.anu.edu.au, modified from Fig. 12.9
 of the NodeXL book [HSS11, p. 192], and laid out using the Harel-
 Koren FMS layout [HK02a]. . . . . . . . . . . . . . . . . . . . . . . 109
 4.17 Web crawl network with each fan and connector motif shown in a
 distinct color and shape. . . . . . . . . . . . . . . . . . . . . . . . . 110
 4.18 Web crawl network with nodes colored by their eigenvector centrality.111
 4.19 Web crawl network with fan and connector motifs simplified and
 colored by underlying eigenvector centrality. . . . . . . . . . . . . . 112
 4.20 Patients related to concepts from their medical discharge reports.
 This subnetwork focuses on the concepts ?hops5325? and ?orch7323?
 (orange discs) and their associated patients (purple triangles) and
 concepts (blue discs). The network is laid out using the Harel-Koren
 FMS layout algorithm [HK02a]. . . . . . . . . . . . . . . . . . . . . 114
 4.21 Patients and concepts from Fig. 4.20 after applying fan and connec-
 tor motif simplification. . . . . . . . . . . . . . . . . . . . . . . . . . 115
 4.22 Simplified patient and concept network from Fig. 4.21 with fans of
 20 or more concepts highlighted. This shows groups of concepts
 that are uniquely associated with a single patient. Edges from these
 fans to their associated patient, as well as the patient themselves,
 are highlighted too. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
List of Figures xv
 4.23 Patient and concept network of only the patients connected to the
 large highlighted fans from Fig. 4.22, as well as any associated con-
 cepts. The initial ?hops5325? concept is on the far right, connected
 to only two patients. . . . . . . . . . . . . . . . . . . . . . . . . . . 118
 4.24 Patient and concept network from Fig. 4.23 after applying mo-
 tif simplification. The connector motif which contains the initial
 ?hops5325? concept and three other concepts is highlighted in or-
 ange. These four concepts are only connected to two patients. . . . 119
 4.25 Patients and concepts from the original simplified view in Fig. 4.21.
 Connector motifs of concepts connected to at least 20 patients are
 highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
 4.26 Patients and concepts from Fig. 4.20, after drilling down to only
 those patients connected to our original ?hops3525? and ?orch7323?,
 as well as two other Hazardous or Poisonous Substances: ?hops5323?
 and ?hops5324?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
 4.27 A simplified view of the patients and concepts in Fig. 4.26, which
 highlights the aggregate patient relationships between the concepts. 123
 4.28 Bar charts showing performance for Task 1: ?About how many nodes
 are in the network?? The left chart shows the time spent answering
 the question while the right chart shows the error in the node count
 estimate. In this chart, and in the following ones, error bars indicate
 one standard deviation and asterisks show the level of significance
 of the statistical test (?*?, ?**?, and ?**? denote p<0.10, 0.05, and
 0.01 respectively). Negative numbers, if present, show the number
 of users that skipped the question or ran out of time. . . . . . . . . 130
 4.29 Bar charts showing performance for Task 2: ?Which individual node
 would we remove to disconnect the most nodes from the main net-
 work?? The left chart shows the time spent while the right chart
 shows the accuracy at selecting the correct node. . . . . . . . . . . 130
 4.30 Bar charts showing performance for Task 3: ?Which is the largest
 ( fan | connector | clique ) motif and how many nodes does it con-
 tain?? The left charts show the results for fans, the middle for
 connectors, and the right for cliques. . . . . . . . . . . . . . . . . . 131
 4.31 Bar charts showing performance for Task 4: ?Which node has the
 label ?XXX?? (where XXX was a name or number)? The left charts
 are for plainly visible nodes, while the right show labels hidden inside
 a simplified glyph. . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
List of Figures xvi
 4.32 Bar charts showing performance for Task 5: ?What is the length
 of the shortest path between the two highlighted nodes?? The left
 chart shows the time spent while the right chart shows the error at
 estimating path length. . . . . . . . . . . . . . . . . . . . . . . . . . 133
 4.33 Bar charts showing performance for Task 6: ?Which of the two
 highlighted nodes has more neighbors?? The left chart shows the
 time spent while the right chart shows the accuracy at selecting the
 correct node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
 4.34 Bar charts showing performance for Task 7: ? How many common
 neighbors are shared by the two highlighted nodes?? The left chart
 shows the time spent while the right chart shows the error in the
 shared neighbor count estimate. . . . . . . . . . . . . . . . . . . . . 134
 4.35 Bar charts showing performance for Task 8: ?Which of two pairs
 of nodes has more common neighbors?? The left chart shows the
 time spent while the right chart shows the accuracy at selecting the
 correct pair of nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . 134
 5.1 Co-appearance network in Les Mis?rables, originally compiled by
 Knuth [Knu93] and made into an edge list by Newman and Girvan
 [NG04]. Available in the NodeXL format from nodexl.codeplex.com/wikipage?title=NodeXL%20Teaching%20Resources141
 5.2 Co-appearance network in Les Mis?rables from Fig. 5.1, after using
 the squarified Treemap Group-in-a-Box layout. Each box shows a
 cluster found using the Wakita-Tsurumi algorithm [WT07]. Inter-
 group edges are hidden to better show internal cluster topology.
 This visualization highlights the structure of each group, such as
 the Javert & Fantine cluster and the Thenardier cluster. . . . . . . 143
 5.3 The U.S. Senate co-voting network for 2007 in shown here, with
 nodes for individual senators colored by their parties (blue Democrats,
 red Republicans, orange Independents), sized by betweenness cen-
 trality, and laid out using Furuchterman-Reingold [FR91]. Edges
 tie senators together and are weighted by their percent of voting
 agreement. Only those edges with at least 50% agreement are shown.149
 5.4 2007 U.S. Senators grouped by their regional affiliation into meta-
 nodes. Aggregate meta-edges show the number of senators between
 the two groups that vote the same way on bills at least 50% of the
 time. Collapsed from the network in Fig. 5.3. . . . . . . . . . . . . 150
 5.5 Graph summarization of the human protein interaction network
 from the HPRD database drawn with the Prefuse Force-Directed
 Layout with a global anti-gravity coefficient of 9 10 6. . . . . . . . 152
List of Figures xvii
 5.6 Same summarized human protein interaction network as Fig. 5.5,
 but clustered using Newman?s heuristic with convex hulls surround-
 ing each cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
 5.7 Same summarized, clustered human protein interaction network as
 Fig. 5.6, but using a global anti-gravity coefficient of 9 10 5 and
 zoomed in on the main connected component. Clusters are sepa-
 rated somewhat using the Vizster meta-layout modification to the
 Prefuse force-directed layout, resulting in less cluster overlap. . . . . 156
 5.8 Same summarized, clustered human protein interaction network as
 Fig. 5.7, with clusters separated further using the Midichlorian-
 Directed Layout. The internal structure of these clusters is more
 visible, as well as the inter-cluster relationships. . . . . . . . . . . . 158
 5.9 2007 U.S. Senators grouped by their regional affiliation. From [Rod+11].
 See Section 4.3.1 for more on this dataset. . . . . . . . . . . . . . . 161
 5.10 The basic principle behind the Donut variant of the Croissant-Donut
 layout is to place the most connected group in the center of the
 screen, then placing the other groups around its perimeter based on
 their connectedness (number of other groups they are connected to). 164
 5.11 The basic principle behind the Croissant variant of the Croissant-
 Donut layout is to place the most connected group in the top of
 the screen, then place the other groups around the other three sides
 based on their connectedness (number of other groups they are con-
 nected to). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
 5.12 A Donut-favoring network & groups, shown in the Treemap layout. 167
 5.13 A Donut-favoring network & groups, shown in the Donut layout. . . 167
 5.14 A Croissant-favoring network & groups, shown in the Treemap layout.168
 5.15 A Croissant-favoring network & groups, shown in the Croissant layout.168
 5.16 The Force-Directed GIB layout explicitly positions groups based on
 their aggregate connections, showing group relationships clearly at
 the expense of additional screen space. . . . . . . . . . . . . . . . . 170
 5.17 Group box positions after running the Harel-Koren FMS layout
 [HK02a] on the group relationship network of innovations in Penn-
 sylvania (see Section 5.5.2 for dataset details). Edges between groups
 are hidden. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
 5.18 An original network visualization (left), the same visualization af-
 ter removing node-node overlap with the PRISM algorithm [GN98]
 (center), and after removing node-node overlap with the solve_VPSC
 algorithm [DMS06; DMS07]. solve_VPSC maintains orthogonal or-
 dering but can result in highly skewed visualizations. From [GH09]. 175
List of Figures xviii
 5.19 An original network visualization (left), the same visualization af-
 ter removing node-node overlap with the PRISM algorithm [GN98]
 (center), and after removing node-node overlap with the solve_VPSC
 algorithm [DMS06; DMS07]. solve_VPSC maintains orthogonal or-
 dering but can result in highly skewed visualizations. From [GH09]. 176
 5.20 Network and groups from Fig. 5.17, using a different initial set of
 positions from the Harel-Koren FMS layout [HK02a] and after ad-
 justing box positions using the PRISM overlap removal technique
 [GH09]. In this case I chose an initial space-filling factor of 20%.
 The red lines map the original group positions, represented by col-
 ored shapes, to the final box positions. There is generally little
 movement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
 5.21 Network and groups from Fig. 5.17, using a different initial set of
 positions from the Harel-Koren FMS layout [HK02a] and after ad-
 justing box positions using the PRISM overlap removal technique
 [GH09]. In this case I chose an initial space-filling factor of 50%. The
 red lines map the original group positions, represented by colored
 shapes, to the final box positions. There is a substantial amount of
 movement, and while most of it preserves group relationships the
 largest groups get shoved to the periphery. . . . . . . . . . . . . . 179
 5.22 Three ways to show edges between groups in a Group-in-a-Box lay-
 out. From top to bottom: show all underlying edges, hide all un-
 derlying edges, and use aggregate meta-edges. . . . . . . . . . . . . 182
 5.23 The NodeXL Group-in-a-Box user interface. The right graph pane
 shows a Force-Directed Group-in-a-Box layout of the Risk network,
 which is described further in Section 5.5.1. The left Edges worksheet
 shows some of the edges connecting the nodes in the network. The
 Layout Options dialog in the foreground allows users to select their
 desired Group-in-a-Box layout, the size of group boxes, how to treat
 inter-group edges, and whether to use a separate grid layout for
 groups with few edges instead of the chosen main layout. . . . . . . 183
 5.24 The Cytoscape biologic network analysis tool [Sha+03], currently
 showing the human protein interaction network after applying graph
 summarization [NRS08]. Disconnected components are laid out in-
 dividually, sorted by screen space used, and striped into rows with
 each row height set by the tallest component. This can waste sub-
 stantial screen space when components have drastically different sizes.185
 5.25 Three simple groups in the NodeXL squarified treemap layout demon-
 strating how window aspect ratio can cause three groups to be laid
 out in a row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
List of Figures xix
 5.26 The box and board for the game Risk. The board consists of 42
 countries in six continents. From boardgamegeek.com/image/1466865/risk188
 5.27 The network for the board game Risk, where nodes are countries
 and edges indicate valid movements. Nodes are laid out using Harel-
 Koren FMS [HK02a], clustered and colored using Clauset-Newman-
 Moore [CNM04]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
 5.28 Risk network from Fig. 5.27, shown using the Treemap GIB layout
 with combined inter-group edges. . . . . . . . . . . . . . . . . . . . 191
 5.29 Risk network from Fig. 5.27, shown using the Croissant variant of
 the Croissant-Donut GIB layout with combined inter-group edges. . 192
 5.30 Risk network from Fig. 5.27, shown using the Force-Directed GIB
 layout with combined inter-group edges. The initial space-filling
 factor is 20%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
 5.31 Pennsylvania innovation relationships during 1990 (main compo-
 nent) collected by Christopher Scott Dempwolf. Nodes are laid out
 using the Harel-Koren FMS layout [HK02a] and I used link bundling
 as well as categorical coloring for node and link types. Gray nodes
 represent inventors; orange are firms; red are federal agencies; royal
 blue are PA DCED / Ben Franklin agencies; lime are universities.
 Red ties (lines) are SBIR / STTR funding; purple ties are patent
 relationships; aqua ties are state funding; blue ties are explicit re-
 lationships between patents; light green ties are technology-based
 relationships. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
 5.32 The innovation network from Fig. 5.31, with clusters found using
 the Clauset-Newman-Moore algorithm [CNM04] shown using node
 color and shape. Because of the dense, intermingled clusters it is
 difficult to understand the network and cluster structure. In this
 figure the edges are shown as straight lines. . . . . . . . . . . . . . . 198
 5.33 The innovation network from Fig. 5.31, with nodes grouped into
 boxes by the clusters found using the Clauset-Newman-Moore al-
 gorithm [CNM04], laid out using the Treemap GIB layout sized by
 their degree, and arranged inside boxes using the Harel-Koren FMS
 layout [HK02a]. Edge opacity is based on the tie strength and edges
 are bundled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
 5.34 The visualization from Fig. 5.33 after replacing inter-group edges
 with meta-edges that represent the aggregate relationships between
 each pair of groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
 5.35 The visualization from Fig. 5.33 after hiding inter-group edges. . . . 201
 5.36 The visualization from Fig. 5.33 after hiding inter-group edges and
 filtering to only the largest groups. . . . . . . . . . . . . . . . . . . 202
List of Figures xx
 5.37 The visualization from Fig. 5.33, but using the Croissant-Donut
 Donut GIB layout instead of the Treemap. Inter-group edges are
 visible and straight. While we can see some of the groups well,
 many of the smaller groups in the corners have high aspect ratios. . 204
 5.38 The visualization from Fig. 5.33, but using the Force-Directed GIB
 layout instead of the Treemap. Inter-group edges are visible and
 straight. All the groups have low aspect ratios, and aggregate con-
 nections between the large groups are more visible. The initial
 space-filling factor is 50%. . . . . . . . . . . . . . . . . . . . . . . . 205
 5.39 Patients and concepts related to the ?hops5325? and ?orch7323?
 medications from Fig. 4.20. Nodes are grouped using the Clauset-
 Newman-Moore topologic clustering algorithm [CNM04] and col-
 ored accordingly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
 5.40 Patients, concepts, and clusters from Fig. 5.39, shown in the Treemap
 Group-in-a-Box layout. Our ego concepts, ?hops5325? and ?orch7323?,
 are shown in orange in the largest clusters. . . . . . . . . . . . . . . 208
 5.41 Patients, concepts, and clusters from Fig. 5.39, shown in the Croissant-
 Donut Group-in-a-Box layout. In this case the Croissant variant was
 chosen automatically. Our ego concepts, ?hops5325? and ?orch7323?,
 are shown in orange in the largest clusters. . . . . . . . . . . . . . . 210
 5.42 Patients, concepts, and clusters from Fig. 5.39, shown in the Force-
 Directed Group-in-a-Box layout. Our ego concepts, ?hops5325? and
 ?orch7323?, are shown in orange in the largest clusters. The initial
 space-filling factor is 50%. . . . . . . . . . . . . . . . . . . . . . . . 211
 5.43 Patients, concepts, and clusters from Fig. 5.39, shown in the Force-
 Directed Group-in-a-Box layout but without the group boxes. The
 underlying edges are visible. The motif simplification technique
 from Chapter 4 is applied as well. . . . . . . . . . . . . . . . . . . 212
 6.1 Different visualizations of the same network with many (a), few (b),
 and no (c) edge crossings. . . . . . . . . . . . . . . . . . . . . . . . 227
 6.2 In the Planarity online game (www.planarity.net), users start with
 a planar network: one that can be embedded in two dimensions
 using straight edges with no crossings. Given a random network
 layout like (a) users try to manually eliminate crossings. The goal
 is to create a planar drawing like (b), which is the same network
 run through NodeXL?s [Smi+10] Harel-Koren FMS layout [HK02a]. 228
List of Figures xxi
 6.3 SocialAction with the integrated Network Drawing Readability Met-
 ric framework rapidly shows problem areas in the network drawing
 highlighted in red and listed in a ranked table. It is currently show-
 ing a subset of the reply relationships within the Alberta Politics dis-
 cussion newsgroup, and the network drawing has been optimized for
 the node occlusion and edge tunnel readability metrics. The steps in
 SocialAction?s Systematic Yet Flexible framework are shown along
 the top. The Network Readability panel (middle-left) shows node
 or edge readability metrics as well as global ones. The Rank Nodes
 panel at the far left ranks nodes by the edge crossing readability
 metric and provides the color scale for the Network pane. . . . . . . 234
 6.4 Ranking and coloring with the node occlusion node RM shows areas
 of high occlusion in red. To reduce occlusion we can relax the layout
 by increasing default spring lengths ((a), (b), (d)). Note that this
 is not the same as merely increasing the size of the drawing: the
 adjustment of the parameters of the layout algorithm results in a
 somewhat different layout as well. We can also use shorter unique,
 trimmed, or simplified labels ((c) & (e)), in addition to hand-tuning
 node position as a final step. Note that color scales may change
 between figures as the worst nodes become better. Counts listed
 are node occlusion (NO), edge tunnels (ET), and edge crossings (EC).237
 6.5 Using the node RM for edge tunnels, users can see areas with edge
 tunnels in red (a) and manually adjust the layout to remove them
 (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
 6.6 Likewise, the node RM for edge crossings shows users areas with lots
 of crossings (a) and lets them hand tune the layout to reduce them
 ((b)?(d)). Fig. 6.1 gives a prime example for how minimizing edge
 crossings can greatly improve the readability of a drawing. Unfortu-
 nately, minimizing the number of edge crossings for less structured
 networks often results in an asymmetric drawing like (d) in which the
 centrality and angular resolution of many nodes is reduced, decreas-
 ing their perceived importance. For larger, less structured networks
 a balance must be struck between the number of edge crossings and
 the impact of further minimization on the spatial layout of the draw-
 ing. Note that color scales may change between figures as the worst
 nodes become better. Metrics listed are node occlusion (NO), edge
 tunnels (ET), and edge crossings (EC). . . . . . . . . . . . . . . . . 239
List of Figures xxii
 6.7 Name co-appearance network from the New Testament. (a) is the
 original New York Times/ManyEyes visualization, while (b) shows
 the same network in SocialAction [PS06]. (c) shows the clusters
 found by Newman?s fast heuristic [New04] using convex hulls, and I
 optimized the layout using the node-node overlap and edge crossing
 metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
 6.8 NodeXL showing the readability metrics dialog box (foreground),
 the nodes in the worksheet with their associated edge crossing and
 node overlap metric columns, and the graph pane where nodes and
 edges are colored by the edge crossing metric on a red-black scale.
 Nodes causing the most edge crossings are colored in bright red, as
 are edges with the most crossings. The network shown represents
 the legal moves in the board game Risk (see Section 5.5.1 for details).243
 6.9 We can eliminate the node occlusion that makes the central overlap-
 ping group in Fig. 6.9a so hard to understand by zooming out and
 increasing the the spring lengths of the layout algorithm (Fig. 6.9b). 247
 6.10 In Fig. 6.10a it is difficult to tell which edges connect to which nodes
 because of the number of edge tunnels. By zooming out and hand
 tuning the layout (Fig. 6.10b) we can completely eliminate edge
 tunnels (but not crossings). . . . . . . . . . . . . . . . . . . . . . . 255
 6.11 In edge tracing tasks such as finding the length of the shortest path
 between the bottom right and top left nodes in Fig. 6.11a, increas-
 ing the edge crossing angles approaching 90 degrees (Fig. 6.11b)
 improves user path finding performance. . . . . . . . . . . . . . . . 257
 7.1 Examples of how to show edge directionality in a fan motif glyph.
 The arrows around the fans are not part of the glyph, and are only
 presented here to highlight which sector corresponds to which direc-
 tion of edges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
 7.2 Variants of the directed fan motif glyph with different numbers leaf
 nodes and number of directed edges in each of the three types (from
 head, to head, and reciprocated). . . . . . . . . . . . . . . . . . . . 278
List of Tables
 5.1 Overall network properties for the networks in our dataset. . . . . . 219
 5.2 Performance comparison of the two proposed approaches: CD-GIB
 and FD-GIB with the baseline ST-GIB layout. All figures reported
 above are median values computed for the complete dataset. . . . . 219
 xxiii
Chapter 1
 Introduction
 Networks have long been common data structures in Computer Science, but have
 only recently exploded into popular culture. Publishers like the New York Times
 now frequently including elaborate and interesting networks with their articles.1
 Online communities like Facebook, Twitter, Flickr, MySpace, and YouTube (to
 name only a handful) enjoyed enormous growth over the last few years and provide
 rich datasets of interpersonal relationships, which social scientists are now fervently
 exploring. Networks have also found applications in such diverse disciplines as
 bioinformatics, scientometrics, urban planning, politics, and archeology.
 Analysis of these datasets requires knowledge of the connectivity, clusters, and
 centrality of the nodes: tasks which necessitate relationship visualizations. Sta-
 tistical analysis and conventional visualization tools like bar and pie charts are
 often inadequate when faced with these varied and oftentimes immense datasets.
 www.visualcomplexity.com and its associated book [Lim13] provide many beauti-
 ful alternative visualizations for these data which are surveyed by [Ari08; SA06],
 1http://www.nytimes.com/2009/03/29/technology/internet/29face.html
 1
2
 Figure 1.1: A node-link visualization of relationships among Twitter users men-
 tioning the hashtag ?#WIN09?, which was used by participants at a network sci-
 ence conference in September 2009. Each Twitter user is represented by a node
 containing its image, and edges between users indicate follow, mention, or reply
 relationships. The force-directed layout used to position the nodes highlights in-
 teresting patterns of connectivity like the two large communities of researchers.
 From Fig. 3.1 of the NodeXL book [HDS10, p. 33].
 but one enduring technique in particular models relationships using a node-link
 visualization, where nodes in the network represent entities and the links or edges
 indicate ties connecting them [BMK96]. An example node-link visualization is
 shown in Fig. 1.1, which displays relationships among Twitter users at the WIN09
 conference and how they separate into two distinct communities of researchers.
 This network of interactions between people is called a social network and the
3
 resulting visualization is called a sociogram by sociologists [Mor53], graph drawing
 by graph theorists, and a node-link visualization by other network researchers
 including myself.
 Node-link visualizations have a long history, but only in the last few decades
 have we seen their frequent application as a network exploration tool. For exam-
 ple, Fisher, Smith, and Welser [FSW06] and Welser et al. [Wel+07] successfully
 used node-link visualizations to detect common social roles in online discussion
 newsgroups such as answer person and discussion person. Node-link visualizations
 have also been applied to the study of relationships between political blogs dur-
 ing the 2004 U.S. Presidential Election, showing the division between liberal and
 conservative communities as well as their internal interactions [AG05]. A similar
 large application is to map the entire Internet [CBB00]. These techniques have
 made inroads into many other domains as well. Urban planners have used node-
 link visualizations to understand networks of innovation (Section 5.5.2, [Dem12]),
 and, similarly, scientometricians use them for measuring and analyzing scientific
 publishing (Section 3.3.3, [Hen+07]). In biology and medicine, node-link visu-
 alizations are used to help explore protein-protein interaction networks [Kel+03]
 and to visualize patient conditions and treatments (Section 4.3.5, Section 5.5.3).
 Even archeology now uses node-link visualizations for looking at the relationships
 between dig sites and artifacts (Section 3.3.2, [Bru12]).
4
 (a) (b) (c)
 Figure 1.2: Different visualizations of the same network, with (a) obscuring the
 topology while (b) and (c) are more understandable with less edge crossings.
 However, there are a huge array of possible layouts of the nodes and edges
 in any given network, many of which can create misleading or incomprehensible
 visualizations [Bra+99]. Even eight nodes can be laid out in a way that obscures
 the network topology, as displayed in Fig. 1.2. In this case, edge crossings caused
 by the layout make paths difficult to follow, but other problems can be caused
 by nodes overlapping or edges tunneling underneath nodes without connecting to
 them, to name only a few of many potential readability issues. Visualizations of
 relational structures like networks are only useful to the degree they ?effectively
 convey information to the people that use them? [Bat+98]. What?s more, there is
 no ?best? layout for a network as different layouts can highlight different features
 of the network being studied [BMK96]. In fact, the spatial layout of nodes in the
 node-link visualization can have a profound impact on the detection of communi-
 ties in the network and the perceived importance of individual actors [MBK97].
5
 Hence, significant thought must be given to properly laying out networks so that
 network analysts will be able to understand and effectively communicate data
 such as clusters in the network, the paths between them, and the importance of
 individual actors.
 As manual layout of nodes in the node-link visualization is incredibly time con-
 suming to do well, a lot of effort has been put into developing automated network
 layout algorithms. There are many layout algorithms that can be used, includ-
 ing variants of the spring embedder [Ead84] such as the popular Fruchterman-
 Reingold force-directed algorithm [FR91] (used in Fig. 1.1), the Prefuse gravita-
 tional N-Body approach [HCL05], the Harel-Koren fast multi-scale (FMS) algo-
 rithm [HK02a], the high-dimensional embedding (HDE) approach of Harel and
 Koren [HK02c], the algebraic multigrid method (ACE) of Koren, Carmel, and
 Harel [KCH03], and FM3 by Hachul and J?nger [HJ05]. These force-directed algo-
 rithms are used frequently in practice. A 2006 census of the layout algorithms used
 for the first 100 examples on visualcomplexity.com showed that over a third used
 force-directed algorithms, with another third using geographic placements [SA06].
 Even with these layout techniques, many existing node-link visualizations of
 networks are not easily readable, or at least difficult to extract meaning from.
 Several factors contribute to this problem, including that the inherently complex
 relationships in large, dense networks are often difficult to perceive even with mas-
6
 Figure 1.3: An experimental comparison of six layout algorithms on the same social
 network produced widely different layouts. The top row layouts performed well,
 though bottom row layouts are difficult to extract meaning from. From [HJ06].
 sive displays. Also, as shown in Fig. 1.3, layout algorithms can produce vastly
 different results for the same network depending on the heuristics they use. The
 spatial layout of a network visualization is critical to what we perceive from it,
 meaning that for every network and user task there are many potential unintelligi-
 ble or even misleading visualizations. Moreover, end users are completely unwilling
 to experiment with layout parameters to improve the layout after the initial view
 [Bar+08]. Even expert analysts who have the experience required to tweak the
 layout algorithm and further optimize the layout manually can have difficulties
7
 with large networks. Many researchers, including myself, have shown that there
 can be vast improvements in network visualizations by using alternate approaches
 to layout [HK02c; HJ05], aggregation [Wat06; Dun+12a], and filtering [SD12].
 However, many challenges remain.
 My dissertation work contributes to this space and focuses on three compli-
 mentary approaches for helping users explore network datasets. First, I introduce
 a technique called motif simplification which helps users reduce visual com-
 plexity by replacing repeating patterns with representative glyphs (Section 1.1).
 These glyphs require less screen space, better present the core interesting parts of
 the network, and improve user task performance. Second, I present new Group-
 in-a-Box layouts to segment dense networks using attribute- or topology-based
 groupings (Section 1.2). These group-aware layouts can better display the rela-
 tionships within groups as well as between them. Finally, I develop a set of what I
 term readability metrics to measure the effectiveness of node-link visualizations,
 both to analyze the utility of layout algorithms but also to interactively guide user
 improvement of the layout (Section 1.3). I implemented each of these techniques
 in the free and open source NodeXL [Smi+10] network analyst tool, so that they
 could be easily used by novice network analysts (Section 1.4). These three tech-
 niques and their associated NodeXL implementations are discussed in the following
 sections, which provide an overview of the chapters in this dissertation.
1.1 Motif Simplification to Reduce Complexity 8
 Figure 1.4: Fan, connector, and clique motifs (top) and their glyphs (bottom).
 1.1 Motif Simplification to Reduce Complexity
 Many complex networks are littered with recurring topologic patterns or motifs,
 either because of the network structure or data collection methods. Three of these
 motifs are shown in the top row of Fig. 1.4. Regardless of their cause, some
 frequently expressed motifs contain little information compared to the space they
 occupy in the visualization. My dissertation helps address this problem with a
 new technique called motif simplification, in which common repeating motifs
 are replaced with compact yet meaningful glyphs. I focus on the three frequently
 occurring and high-payoff motifs shown in Fig. 1.4: fans of nodes with a single
 neighbor, connectors that link a set of anchor nodes, and cliques of completely
 connected nodes. My research contributes efficient algorithms for motif detection,
 the design of representative and combineable glyphs, as well as guidelines and an
 iterative process for creating glyphs for additional motifs.
1.1 Motif Simplification to Reduce Complexity 9
 (a) (b)
 Figure 1.5: A bipartite network of Lostpedia of wiki edits (a) and a simplified
 version using glyphs for fan and connector motifs (b).
 I evaluated motif simplification first with several domain experts in sociology,
 political science, medical informatics, and the U.S. Department of the Treasury to
 understand the effectiveness of the technique for real-world analyses. I followed this
 with a task-based controlled study of 36 participants analyzing networks up to 3958
 nodes, which determined the magnitude of any performance differences between
 using plain and simplified views. One example network from this study is shown
 in Fig. 1.5, in which a network of wiki editors connected to the pages they edit is
 shown in a node-link visualization with 513 nodes (left) and the simplified view with
 only 17 nodes and glyphs (right). These studies showed that motif simplification
 (1) reduces screen space used and layout effort, (2) can reveal hidden relationships,
1.2 Meta-Layouts for Subdividing Networks 10
 and (3) is quite beneficial for many network analysis tasks both in the time users
 took and their accuracy/error. Unlike other approaches, motif simplification is
 able to achieve these benefits while maintaining user awareness of the underlying
 topology. Please see Chapter 4 for more details on motif simplification.
 1.2 Meta-Layouts for Subdividing Networks
 In contrast to motif simplification, in which functionally equivalent nodes and
 edges are replaced by representative glyphs, I have also explored the use of meta-
 layouts that highlight more general topology- or attribute-based groupings of the
 network. These groups can be difficult to understand using the standard tools of
 color, shape, or convex hulls ? as evidenced by the dense, intermingled topologic
 clusters shown in Fig. 1.6. In this visualization, it is difficult to understand the
 size of each group, its internal structure, and its ties to other groups. My meta-
 layouts are designed to make all these group features easier to discern. First, the
 Midichlorian-Directed Layout is a modified force-directed layout algorithm
 that reduces spring forces between nodes in separate groups. This causes groups
 to spread apart and be more clearly analyzed, but at the expense of substantial
 screen space required. Next, I present several Group-in-a-Box layouts that
 display groups individually to more clearly show membership, topology, and inter-
 group relationships. We have one such layout in NodeXL [Smi+10] that segments
1.2 Meta-Layouts for Subdividing Networks 11
 Figure 1.6: Pennsylvania innovation relationships during 1990 (main component)
 collected by Christopher Scott Dempwolf. Nodes are laid out using the Harel-Koren
 FMS layout [HK02a] and topologic clusters found using the Clauset-Newman-
 Moore algorithm [CNM04] are shown using node color and shape. See Section 5.5.2
 for more details and analyses of this network.
1.2 Meta-Layouts for Subdividing Networks 12
 (a) Standard node-link visualization (b) Treemap GIB layout
 (c) Croissant-Donut GIB layout (d) Force-Directed GIB layout
 Figure 1.7: The network for the board game Risk, where nodes are countries
 and edges indicate legal movements. Nodes are laid out using Harel-Koren FMS
 [HK02a], clustered and colored using the Clauset-Newman-Moore topologic clus-
 tering algorithm [CNM04]. Inter-group edges are combined into thick meta-edges.
 (a) shows the initial visualization, while the others show the three Group-in-a-Box
 (GIB) layout variants. See Section 5.5.1 for more details and analysis.
1.2 Meta-Layouts for Subdividing Networks 13
 groups using aTreemap [Rod+11; SD12], which is space-filling but often separates
 related groups, drawing long edges which overlap other groups unnecessarily. This
 is visible in Fig. 1.7b as the crossing and overlapping meta-edges that represent
 the combined inter-group edges.
 I present several variants to more clearly show group relationships, each best
 suited to a range of topologies. The Croissant Group-in-a-Box layout, shown
 in Fig. 1.7c, puts the largest group at the top and wraps the remainder around three
 sides based on their connectivity. This effectively displays large groups, though
 more smaller groups are better shown using the Donut Group-in-a-Box layout
 (not shown here) which places the largest group in the center and arranges others
 around the perimeter. Finally, the Force-Directed Group-in-a-Box layout
 (Fig. 1.7d) arranges groups based on their aggregate ties and eliminates any overlap
 of their boxes. The NodeXL [Smi+10] implementation automatically picks the
 best approach for the given data to better show disconnected components, few
 groups, or different distributions of group sizes and connectedness. Several case
 studies and experiments demonstrate that Group-in-a-Box layouts more clearly
 show (1) topology within groups, (2) group membership and size, and (3) aggregate
 relationships between groups. Group-in-a-Box layouts are particularly effective for
 large networks, where high density and finite screen space limit effective network
 visualizations. I cover my work on meta-layouts extensively in Chapter 5.
1.3 Measuring Network Visualization Readability 14
 (a) Tight layout (b) Relaxed layout
 Figure 1.8: We can eliminate the node occlusion and edge tunnels that make the
 central overlapping group in Fig. 1.8a so hard to understand by zooming out and
 increasing the the spring lengths of the layout algorithm (Fig. 1.8b).
 1.3 Measuring Network Visualization Readability
 My user studies, case studies, and experiments demonstrate the utility of motif
 simplification and Group-in-a-Box layouts for network visualization, but I am also
 interested in improving the effectiveness of general node-link visualizations. By
 quantifying the readability of a layout, we can guide analysts in making improve-
 ments and feed the results in automatic layout algorithms. Past work by Purchase
 and Leonard [PL96; Pur02] as well as Ware et al. [War+02] provides definitions for
 several of what I call global readability metrics (also called aesthetic criteria),
 which measure detrimental features like edge crossings (see Fig. 1.2) and rate the
 layout as a whole. However, a single value is not enough to direct users to problem
 areas of the layout, which part of my dissertation addresses by introducing local
 readability metrics for individual nodes and edges. Moreover, I introduce sev-
1.3 Measuring Network Visualization Readability 15
 Figure 1.9: NodeXL showing the readability metrics dialog (foreground), the nodes
 in the worksheet with edge crossing and node overlap metric columns, and visu-
 alization where nodes and edges are colored red-to-black by the edge crossing
 metric. The worst offenders are shown in red. The network shown represents the
 legal moves in the board game Risk from Fig. 1.7a.
 eral new global metrics to detect readability problems like node overlap and edges
 tunneling under nodes. These readability issues are visible on the left of Fig. 1.8.
 I leverage these metrics in a new method for user-assisted layout improvement,
 which is shown in Fig. 1.9. My approach is to incrementally update the readability
 metrics in real-time as users manipulate the layout, and provide immediate visual
 feedback to users showing how they are affecting readability. As there are trade-offs
1.4 Exploration Environment 16
 when optimizing specific readability metrics, I include a survey of the related liter-
 ature studying each of these metrics and their effect on user task performance. My
 evaluations indicate that these readability metrics help users create more effective
 node-link visualizations, and I plan to release both the metrics and layout improve-
 ment tool as part of NodeXL [Smi+10]. This work aims to raise user awareness of
 network visualization readability issues, and applying my optimization technique
 will guide users in creating more effective network visualizations.
 1.4 Exploration Environment
 I implemented each of these three approaches in a scalable environment for network
 exploration and improvement, made publicly available as part of the free and
 open source NodeXL network analysis tool [Smi+10]. NodeXL is popular and
 actively developed, has over 184,000 downloads, and has been taught in over 25
 introductory courses on network analysis and visualization. I have been involved
 with the project for five years, first running user studies and then as an advisor and
 developer. By releasing my work in NodeXL, it immediately becomes available to
 help the novice users who need it the most. Motif simplification is now available
 and visible in the publicly shipping tool, and my Group-in-a-Box layouts will be
 shortly. The readability metrics and associated interactive layout improvement
 technique are implemented but hidden as they are not yet ready for public use.
1.5 Specific Contributions 17
 1.5 Specific Contributions
 The specific contributions of this dissertation are as follows:
  Motif Simplification
 ? A technique for simplifying node-link visualizations by replacing com-
 mon network motifs with representative glyphs,
 ? A set of design guidelines for these glyphs to show the motif contents
 and underlying attributes,
 ? The design of glyphs for fans, connectors, and cliques,
 ? Algorithms for detecting these three motifs,
 ? A supporting task-based study with 36 participants, and
 ? A free and open source implementation as part of NodeXL.
  Meta-Layouts
 ? A meta-layout called the Midichlorian-Directed Layout which spreads
 groups apart in a standard node-link visualization;
 ? A Croissant-Donut Group-in-a-Box layout that places subnetworks in
 boxes arranged using a Donut or Croissant pattern, and balances space-
 filling properties with showing group relationships;
1.5 Specific Contributions 18
 ? A Force-Directed Group-in-a-Box layout that places subnetworks in
 boxes arranged by their connectivity, and shows group relationships
 well at the expense of additional screen space;
 ? A set of automatic choices that are made for the user to better show
 disconnected components, few groups, or different distributions of group
 sizes and connectedness;
 ? Supporting case studies and an experiment on Twitter networks; and
 ? A free and open source implementation as part of NodeXL.
  Readability Metrics
 ? New global readability metrics to help understand different aspects of
 network visualization readability,
 ? Local readability metrics for individual nodes and edges to help users
 identify problem areas and fix them,
 ? A method for user-assisted layout improvement that provides real-time
 metric feedback to users in a ranked list and with a color scale,
 ? Implementations of readability metrics and the layout improvement
 technique in SocialAction and NodeXL, and
 ? A survey of work on readability metrics and evaluations of their effec-
 tiveness on various network analysis tasks.
1.6 Dissertation Roadmap 19
 This dissertation is aimed at helping researchers, tool designers, and network
 analysts. For researchers, my work demonstrates that progress is possible in im-
 proving node-link visualization readability and contributes to the literature an
 improved understanding of why some network visualizations are difficult to read.
 For designers of network analysis tools, I detail specific techniques they can imple-
 ment and give guidance as to what measures of readability could help users create
 more effective visualizations. For analysts, I hope to raise awareness that the im-
 ages they share or publish could be of higher quality, so that readers could extract
 relevant information. Furthermore, I provide an implementation of my techniques
 analysts can apply, so as to improve the utility of their network visualizations
 through layout changes and meaningful aggregations. My three strategies are
 complementary and applicable to many types of networks and user explorations.
 The techniques can be applied separately or in combinations based on the type of
 network and tasks involved, with different methods better for highlighting certain
 characteristics.
 1.6 Dissertation Roadmap
 The remainder of this dissertation is broken into several parts. First, in Chapter 2
 I discuss prior work done on network exploration, measuring readability, analyzing
 motifs, meta-layouts, and visualization evaluation. Next, in Chapter 3 I detail
1.6 Dissertation Roadmap 20
 the NodeXL network analysis tool [Smi+10] in which many of my dissertation
 contributions are implemented, as well as several applications of network analysis
 to problems in diverse domains. These applications helped guide my dissertation
 research. Then, Chapter 4 covers the motif simplification approach for reducing
 complexity by combining functionally equivalent nodes and edges. Moving on,
 Chapter 5 describes the meta-layout and Group-in-a-Box approaches for subdi-
 viding complex networks into manageable yet meaningful pieces. Chapter 6 then
 discusses techniques for understanding and improving the readability of a standard
 node-link network visualization. Finally, I conclude and discuss future directions
 in Chapter 7. Parts of this work have already been published [DS13; SD12] or
 are currently under submission [Cha+13], in addition to the many domain-specific
 publications discussed in Chapter 3.
Chapter 2
 Related work
 2.1 Introduction
 The field of network analyses and visualization is extensive. In this chapter I
 provide an overview of general network visualization principles, as well as detailed
 discussion of the techniques most relevant to my dissertation contributions. First,
 in Section 2.2 I detail general techniques for network visualization and analysis,
 including alternatives to the standard node-link visualization that have various
 tradeoffs. I have chosen to focus my work on improving node-link visualizations
 as they are the best visualization for understanding the overall structure of a
 network and for many important path-based tasks [HF07]. Moreover, they are
 incredibly widely used [Ari08; SA06] and the only network visualization available
 in common analysis tools like NodeXL [Smi+09] (Section 3.2), Gephi [BHJ09],
 Cytoscape [Sha+03] (Fig. 2.2), Pajek [BM98], and GUESS [Ada06].
 Next, I describe the current techniques for measuring the readability of node-
 link visualizations in Section 2.3. These techniques form the basis for my work on
 21
2.2 Network Visualization & Analysis 22
 readability metrics, which I use to help users both understand and improve the
 readability of their node-link visualizations. Third, in Section 2.4, I cover work
 with similar goals as my motif simplification technique. This includes approaches
 for aggregating, clustering, or filtering networks based on topology or attributes, in
 addition to detecting frequently occurring motifs in networks. Moving on, I detail
 techniques for taking groups or subnetworks into account when computing layouts
 in Section 2.5 and contrast these with my Group-in-a-Box meta-layouts. Some of
 my techniques I can evaluate empirically using simulations, but in many cases it is
 important to put them in front of real users to determine real-world utility. I relate
 common evaluation techniques for these kinds of studies in Section 2.6. Finally, I
 summarize the novelty of my approaches in Section 2.7.
 2.2 Network Visualization & Analysis
 The area of network analysis is currently of great interest to the community, and
 many systems have been developed to visualize and analyze networks. There are
 several general visualization frameworks that can be extended programmatically
 to create arbitrary visualizations of networks or other datasets, such as the Info-
 Vis Toolkit [Fek04], Prefuse [HCL05], and JUNG [OM+03]. Traditionally, dedi-
 cated network analysis tools have focused on two specific kinds of visualizations:
 node-link and matrix representations. Node-link visualizations excel at showing
2.2 Network Visualization & Analysis 23
 Figure 2.1: The Pajek social network analysis tool [BM98] showing the main core
 subgraph extracted from Internet routing data.
 network topology, especially in sparse social networks. Most general-purpose and
 domain-specific network analysis tools incorporate node-link visualizations, includ-
 ing NodeXL [Smi+09] (Section 3.2), Gephi [BHJ09], Cytoscape [Sha+03] (Fig. 2.2),
 Pajek [BM98] (Fig. 2.1), GUESS [Ada06], and SocialAction [PS06]. I focus my ef-
 forts on improving the utility of node-link visualizations both because of their
 effectiveness at showing overall network topology, as well as their wide usage.
 Matrix representations are less frequently used, but are better suited to espe-
 cially dense networks. MatrixExplorer [HF06], TimeMatrix [YEL10], and Matrix
2.2 Network Visualization & Analysis 24
 Figure 2.2: The Cytoscape biologic network analysis tool [Sha+03].
 Zoom [AH04] are prime examples of matrix visualizations. Whether a matrix or
 node-link representation is better suited for a specific network depends substan-
 tially on the size and characteristics of that network. Node-link visualizations
 are favored in all cases for path-finding tasks [GFC04] and both show the overall
 topology of small networks quite well, but readability becomes an issue when con-
 fronted with more than a few thousand nodes. Several recent tools like MatLink
 [HF07] and NodeTrix [HFM07] (Fig. 2.3) have worked to integrate the matrix and
 node-link representations to combine their strengths. However, the node-link vi-
 sualizations I focus on remain the most widely used as well as the most effective
 network overview visualization.
2.2 Network Visualization & Analysis 25
 Figure 2.3: NodeTrix [HFM07] showing an overview of research in information
 visualization from the InfoVis ?04 contest.
2.2 Network Visualization & Analysis 26
 Social network datasets such as scientific collaboration networks or friendship
 networks often contain multiple types of nodes and edges (i.e., heterogeneous),
 and multiple attributes on nodes or edges (i.e., multivariate). In node-link visual-
 izations, multiple attributes can be encoded using size, color, shape, opacity, etc
 [Mac86]. In particular, [Bla+09] recently attempted to represent multiple types of
 edges in node-link visualizations using texture and animation. However, it remains
 challenging to identify patterns and extract trends by solely relying on these visual
 encodings. My implementations in NodeXL [Smi+10] provide all these state-of-
 the-art attribute encodings for the node-link visualization, excluding animation.
 The motif simplification approach even shows underlying color or size informa-
 tion in the representative glyph for a motif. Unfortunately, effective attribute
 exploration requires alternate visualizations, like those I discuss in the following
 paragraphs and Section 3.3.2.
 Various hybrid network visualizations attempt to combine topology and multi-
 variate data more effectively into a single visualization such as the scatter plots of
 nodes connected by edges in Semantic Substrates [SA06] (Fig. 2.4) or GraphDice
 [Bez+10] (Fig. 2.5). Other hybrid approaches provide a visualization of topology
 on top of node aggregates, such as overlaying edges on Treemaps [Fek+03], com-
 bining Treemaps with node-link visualizations [ZCM05] or matrix representations
 for dense clusters within an aggregate node-link visualization [HFM07] (Fig. 2.3).
2.2 Network Visualization & Analysis 27
 Figure 2.4: NVSS [SA06] showing citations from two Circuit Court cases in 1991-
 1993 to 19 Supreme Court cases and two other Circuit Court cases.
 However, performing analysis of networks with many attributes remains a challenge
 with these representations, not to mention the difficulty for network overview and
 topology-based tasks.
 There have been some recent attempts to specifically handle the attributes in
 multivariate networks. For example, ManyNets [Fre+10] (Fig. 2.6) allows users to
 partition networks according to attributes or topological properties, supporting fast
 comparison of the partition statistics, though it is difficult to extract patterns and
2.2 Network Visualization & Analysis 28
 Figure 2.5: GraphDice [Bez+10] showing the InfoVis 2004 contest bibliographic
 network. The left shows the plot matrix window and the right shows the selected
 plot. The right view animates between selected plots.
 Figure 2.6: ManyNets [Fre+10] displaying the distributions of various statistics
 across subgraphs (rows).
2.2 Network Visualization & Analysis 29
 Figure 2.7: PivotGraph [Wat06] showing communication between aggregations of
 men and women (columns) and various locations (rows).
 to identify relationships between the attributes. In contrast, PivotGraph [Wat06]
 (Fig. 2.7) aggregates nodes by attribute and indicates relationships between the
 aggregates using edges. However, it does not allow users to drill-down to see the
 details of the network and does not support comparing more than two attributes.
 Nor does it allow multiple types of nodes or edges (heterogeneous networks).
 There have been many efforts to visualize heterogeneous and multivariate net-
 works. General faceted browsing systems such as FacetLens [Lee+09] can be used
 on networks with multiple types of nodes and multiple attributes. Nodes are
2.2 Network Visualization & Analysis 30
 Figure 2.8: The main NetLens [Kan+06] interface here is showing ACM SIGCHI
 conference papers on the left and authors on the right.
 grouped by their attribute values (i.e., facets) and users can pivot between node
 types, but only from a single node to its connected nodes. FacetLens helps users
 extract patterns and trends in the node attributes, but it does not explicitly rep-
 resent the relationships between nodes. NetLens [Kan+06] (Fig. 2.8) is well suited
 to handle content-actor networks with two node types. It uses two coordinated
 views, each containing nodes aggregated according to their attributes. Users can
 explore the network by filtering in one view and pivoting from their filtered sub-
 set to connected nodes in the other view. NetLens allows for complex analysis
2.3 Measuring Node-Link Visualization Readability 31
 scenarios and extraction of trends and patterns in multivariate content-actor net-
 works, but is limited to two node types at a time. Alternatively, my GraphTrail
 approach, which I discuss in Section 3.3.2, supports attribute exploration across
 many different node and edge types.
 All these techniques I have discussed are effective for exploring networks based
 on their attributes, especially for heterogeneous networks. However, none of them
 are as effective as standard node-link visualizations for showing the overall topo-
 logic structure of a network and for helping users perform path-based tasks. How-
 ever, these visualizations can be combined with node-link diagrams in a multiple
 coordinated view system [NS00; BWK00], with brushing and liking to highlight
 the same data in each view. One example tool is Network Workbench [NWB06],
 which provides an impressive array of statistics, modeling, scientometric, and vi-
 sualization algorithms for analyzing bibliometric datasets. Unfortunately these
 visualizations lack brushing and linking and are weakly integrated into the rest of
 the exploration process. Examples of systems that do a better job of this include
 my GraphTrail and Action Science Explorer (Sections 3.3.2 and 3.3.3).
 2.3 Measuring Node-Link Visualization Readability
 There is a substantial body of work aimed at developing and, more recently, em-
 pirically verifying the correctness of a wide variety of readability metrics (RMs),
2.3 Measuring Node-Link Visualization Readability 32
 Figure 2.9: Simple rule-based drawing optimizations shown in Figure 2.3.1 of
 [Sug02, p. 14].
2.3 Measuring Node-Link Visualization Readability 33
 or, as they are often called, aesthetic criteria. Sugiyama?s book [Sug02] includes
 a figure showing several simple rule-based drawing optimizations, replicated here
 in Fig. 2.9. Excellent overviews of RMs for general graphs can also be found in
 [Bat+98; War04; Bat+94; BFN85]. RMs specific for trees and UML diagrams are
 described in [WS79] and [Eic03], respectively. The first standard and numerical
 definitions of many specific RMs were given by Purchase and Leonard [PL96] and
 were elaborated on by Purchase [Pur02] who developed seven specific RM formulas.
 These will form the basis for much of my work.
 Previous work in this area primarily deals with RMs for the entire graph draw-
 ing, giving, for example, a count of the total number of edge crossings. I name
 such RMs for the entire drawing as global readability metrics, or global RMs,
 and have developed several that are not included in the literature. Section 6.4
 provides a detailed background for several global RMs, including Edge Crossing,
 Edge Crossing Angle, and my new Node-Node Overlap, Node-Edge Overlap, and
 Group Overlap metrics. Several other global RMs are discussed there in less detail,
 though many have citations to prior work in the area. These serve as excellent
 measures for how understandable the whole graph drawing is, but do not provide
 the level of specificity needed to direct users to problem areas. To address this
 problem, I augment several existing and my new global RMs with novel local
 readability metrics for individual nodes and edges.
2.4 Motif Simplification 34
 Several layout algorithms try to directly satisfy readability metrics, such as
 using simulated annealing to distribute nodes evenly, make edge-lengths uniform,
 minimize edge-crossings, and keep nodes from coming near edges [DH96]. However,
 most layout algorithms use simple heuristics instead. Moreover, no sufficiently
 fast automatic layout techniques exist to leverage these metrics to create better
 general node-link visualizations. Rather than try to combine these metrics in a
 computationally expensive layout algorithm, I develop an assistive user feedback
 technique to help users optimize their layout manually using local RM calculations.
 2.4 Motif Simplification
 We can reduce the visualization complexity by showing an aggregate version of the
 network, based on any number of criteria. NetLens [Kan+06] (Fig. 2.8) groups
 nodes by their attributes and can pivot between connected groups of two different
 types, while PivotGraph [Wat06] (Fig. 2.7) uses attribute groupings but shows ties
 between aggregates using arcs. One of my techniques, GraphTrail (Section 3.3.2),
 combines these approaches with familiar charts, arc diagrams, and a many-to-many
 pivot between several node types. However, these approaches focus on attribute
 comparisons at the expense of showing topology, as I discussed in Section 2.2.
 Alternatively, my motif simplification approach retains all topology information in
 the overview visualization by using glyphs for specific motifs.
2.4 Motif Simplification 35
 Figure 2.10: Greedy graph summarization technique applied to the CRN-10k
 graph. From [NRS08].
 Instead of attribute aggregation, we can use a hierarchical topologic clustering
 to show a topologic overview in a network of meta-nodes like ASK-GraphView
 [AHK06] or van Ham & van Wijk [HW04]. Rather than letting meta-nodes over-
 lap, van Ham & van Wijk used semantic fisheye views to show clusters as merg-
 ing spheres. Other approaches to creating overview networks include graph sum-
 marization [NRS08] (Fig. 2.10) and aggregating nodes by shared neighbor sets
 [LSS12]. Liao, Shi, and Sun [LSS12] also provide a topologic clustering tool, and a
 level of detail option to split meta-nodes apart to better see the underlying topol-
 ogy. ManyNets [Fre+10] (Fig. 2.6) takes a different approach, showing statistical
 comparisons of a network partitioned by topology, attributes, or time. These tech-
 niques can show the aggregated topology of networks with hundreds of thousands
 of nodes, but not the underlying topology which is important for users to under-
2.4 Motif Simplification 36
 stand the network structure. Often this is because of the ambiguous nature of
 clustering algorithms, in contrast to the exact motif detection algorithms I de-
 veloped for motif simplification. Moreover, these tools do not present aggregate
 attribute information on nodes, unlike my motif glyphs.
 Alternatively, we can filter to an important subset using a metric for node
 importance. Skeletal images [Her+99] highlights high-metric nodes, and replaces
 filtered trees with triangles that take the same space. Motif simplification, instead,
 aims to reduce the space required by the network in the visualization and allow
 additional layouts. Tsigkas, Thonnard, and Tzovaras [TTT12] similarly filtered a
 security network of events and features on a domain-specific metric, while includ-
 ing a way to aggregate the events joining a subset of features into meta-edges.
 However, the aggregation is limited to ties between two feature types and obscures
 the number of connecting nodes and edges.
 My approach is to instead aggregate the network by the frequently occurring
 motifs it contains. While the fan, connector and clique motifs I target are quite
 prominent in social network datasets, there are many other motifs of interest,
 especially for biologists. Motif census (counting the kinds of motifs) and analysis
 is used extensively to analyze the behavior of complex biologic networks, looking
 for repeated patterns that indicate underlying processes. For example, Milo et al.
 [Mil+02] used an approach that finds motifs that appeared more frequently than
2.4 Motif Simplification 37
 Figure 2.11: An interesting motif found in the protein-protein interaction network
 of S. cerevisae, a species of yeast. It appears 27,720 times, though these motifs all
 overlap and share the same set of 29 nodes. From [GK07].
 expected in suitably random networks. They provide an extensive chart of motifs
 of three or four nodes, and describe their frequency in various biologic networks.
 Also, Zhu, Gerstein, and Snyder [ZGS07] provides an overview of the use of network
 motifs for analyzing biologic networks. Luscombe et al. [Lus+04] and Ye et al.
 [Ye+05] both demonstrate the applications of motif analysis for understanding
 biologic processes. To look for motifs larger than three or four nodes, Grochow
 and Kellis [GK07] developed a technique called symmetry-breaking that quickly
 finds motifs of various sizes. In applying their algorithm to the protein-protein
 interaction network of S. cerevisae, a species of yeast, they discovered one motif
 that appeared 27,720 times but does not appear at all in suitably created random
 ensembles. This motif, shown in Fig. 2.11, is composed of various overlapping
 combinations of 29 nodes that represent cellular transcription machinery. For my
 three motifs, I had to develop my own algorithms to scale well to large motifs.
2.4 Motif Simplification 38
 Figure 2.12: In MAVIsto [KSS06], matches for a particular motif like the feed-
 forward loop are laid out aligned the same direction and highlighted. The bar
 chart shows how frequently particular motifs occur above expected levels.
 Knowledge of the motifs present in a network can help predict behavior and the
 ?structural signatures? of individual entities [Wel+07], but visualizing these motifs
 effectively is challenging. Huang et al. [Hua+05] detect motifs with fewer than
 five nodes and draw transparent convex hulls to highlight them. Similarly, Klukas,
 Schreiber, and Schw?bbermeyer [KSS06] take the matches to a chosen 3?5 node
 motif and color them within the overall visualization and draw them identically
 to be easily spotted (Fig. 2.12). While highlighting the motifs can help biologists
2.5 Meta-Layout 39
 spot the locations of particular processes, it does little to reduce the clutter of a
 complex network drawing and can even reduce the readability. Instead, my motif
 simplification work directly tries to reduce this clutter by replacing motifs with
 representative glyphs.
 In contrast to motif simplification, current approaches to reducing complex-
 ity aggregate nodes based on their attributes, topology, or metrics but do not
 provide visible indications on the meta-nodes showing the underlying topology.
 Moreover, these algorithms usually pay little attention to the motifs present and
 create a grouping with ambiguous topology. While current tools can highlight
 small detected motifs, there are few techniques for providing a graphical overview
 or summary of them. More importantly, I know of no approaches other than motif
 simplification that leverage the motifs present to reduce the visual complexity of
 the network visualization.
 2.5 Meta-Layout
 Much of the work on meta-layouts has focused on so-called multiscale layouts,
 which attempt to take more structure of the graph into account for the layout than
 plain force-directed techniques. For example, Large Graph Layout [Ada+04] iter-
 atively moves down a minimum spanning tree placing children on spheres around
 parents. This results in beautiful static images such as the map of the Internet in
2.5 Meta-Layout 40
 Figure 2.13: Large Graph Layout [Ada+04] rendering of the internal structure of
 the Internet (as of 2005). From opte.org/maps
2.5 Meta-Layout 41
 Fig. 2.13, though it is hard to see topology and near impossible to see attributes
 at the scale of networks they tackle.
 Other examples of multiscale layouts include a Cytoscape plugin by Salmela,
 Nevalainen, and Aittokallio [SNA08], the Harel-Koren FMS layout [HK02a] used in
 NodeXL [Smi+10], and many others (e.g., [HJ05; Wal01; Won+08; GGK04]). One
 effective approach, the Lin-Log layout [Noa04], takes explicit cluster membership
 into account when computing the node positions. Hachul and J?nger [HJ06] pro-
 vide an experimental comparison of six multiscale layouts on various toy datasets,
 such as the random grid and Sierpinski triangle shown in Fig. 2.14, as well as some
 real-world ones like the social network in Fig. 2.15. These multi-scale layouts can
 show the overall topology of the network well if they use enough screen space, but
 this ?zooming out? prevents them from displaying internal group ties clearly. None
 of them, including the Lin-Log layout [Noa04] which takes clusters into account,
 highlight group sizes and internal structures as well as my Group-in-a-Box layouts.
 One interesting meta-layout is a modification to Treemaps that attempts to
 map the boxes to known geographic locations [WD08]. These spatially ordered
 Treemaps can be effective for visualizing geographic data like the London tube
 network (Fig. 2.16). This could be potentially modified to use the the relative
 relationships of the groups rather than the geography. However, I chose to allow
 some screen space to be ?wasted? to show the ties between groups more clearly
2.5 Meta-Layout 42
 Figure 2.14: An experimental comparison of six layout algorithms on a random
 grid and Sierpinski triangle dataset, discussed in [HJ06].
2.5 Meta-Layout 43
 Figure 2.15: An experimental comparison of six layout algorithms on a social
 network dataset produced widely different layouts. From [HJ06].
 instead of using a Treemap algorithm. Another meta-layout is DICON [Cao+11]
 (Fig. 2.17), which uses Treemap-like icons to represent clusters. In addition, it uses
 a layout algorithm for the icons that generate similar icons for similar clusters. This
 approach would potentially do well with hierarchically clustered networks instead
 of a one-level hierarchy like the Group-in-a-Box layouts use, but does not display
 the internal group structure nearly as well as the Group-in-a-Box layouts.
 One option is to use edge bundling rather than aggregating the underlying
 edges like I commonly do with my Group-in-a-Box layouts (e.g., Fig. 1.7). Since
 large numbers of links that span a graph drawing can undermine readability, there
 has been a strong attraction to edge bundling to reduce clutter [Hol06; Pup+11].
2.5 Meta-Layout 44
 Figure 2.16: Spatially ordered Treemap [WD08] of the London tube network. Sta-
 tions (squares) are colored by the lines they serve.
 Figure 2.17: DICON [Cao+11] showing Treemap-like icons for clusters.
 Figure 2.18: Increasing strength of edge bundling going left to right. From [Hol06].
2.6 Evaluation 45
 NodeXL [Smi+10] currently supports several levels of edge bundling, and an ex-
 ample of these increasing levels is shown in Fig. 2.18. The initial view is attractive,
 but the bundles seem to obscure rather than highlight the strength of relationships
 among the clusters. However, the option is available to users.
 2.6 Evaluation
 Evaluating the effectiveness of complex creativity and exploration tools can be chal-
 lenging. Simple usability issues can be collected as participants express confusion
 or difficulties, and can even be iteratively used to improve the system throughout
 the user study [Med+02; Med+05]. I applied these techniques in the development
 of my three network visualization improvement approaches. However, the scope of
 the features used and the intellectual effort required for exploration render quan-
 titative laboratory techniques infeasible for capturing many important aspects of
 the tool usage [CC00]. For a recent overview of these techniques, see [Lam+11;
 PFG08].
 One way that individual tools can be analyzed and compared with others is
 based on the insights into the data users find with them, where what constitutes an
 insight is rigorously defined [Nor06; SND05; Sar+06]. Alternatively, Shneiderman
 and Plaisant [SP06] make the argument that qualitative evaluation methods are
 becoming common, accepted, and effective techniques for analyzing visual analytics
2.7 Summary 46
 tools. Excellent examples of these qualitative evaluation techniques for longitudinal
 studies are demonstrated by [PS09; PS08a; SS06].
 For my work, I predominantly use more conventional task-based studies and
 experimental evaluations, as the approaches I am suggesting are more directly
 comparable to the current state of the art node-link visualizations. Lee et al.
 [Lee+06] provide a task taxonomy for network visualization, which I leverage in my
 studies (e.g., see my evaluation of motif simplification in Section 4.5). The tasks I
 chose are also used in many recent papers evaluating network visualizations [HF07;
 SA06; GFC04]. Also, there is a substantial amount of work on user perception for
 experimental metric-based studies, including [Pur02; War+02; Hua07b; BMK96].
 However, this is beyond the scope of my work.
 2.7 Summary
 There are many approaches for visualizing networks, the most common being
 node-link visualizations which are very effective for visualizing the overall network
 topology. Unfortunately the effectiveness and perceived meaning of a node-link
 visualization is highly dependent on the layout of nodes and edges. Readability
 metrics exist to quantify the effectiveness of a static drawing, but do not iden-
 tify specific problem locations. While several layout algorithms try to directly or
 indirectly optimize for these metrics, they are often marginally effective or only
2.7 Summary 47
 useful for specific tasks for which they are optimized. Moreover, there are no user-
 controllable layout algorithms or assisted layout techniques based on the metrics.
 My work contributes new global readability metrics, as well as local readability
 metrics to direct users to problem areas. I leverage these local readability met-
 rics to create an interactive layout improvement technique that guides users using
 visual metric feedback.
 It is challenging to use node-link visualizations to analyze large, multivariate,
 and/or heterogeneous networks, and one of the most effective approaches is to
 use aggregation by topology or node and edge attributes. Effective aggregation is
 difficult to do well while preserving the underlying aggregate topology. Aggregating
 by network motifs has not been explored yet, nor has using representative glyphs for
 the resulting meta-nodes. While aggregation by toplologic and attribute clustering
 has been done in node-link visualizations, the resulting groups have only been used
 to improve the layout of inter-group relationships. My Group-in-a-Box layouts can
 show inter-group relationships, but also group size by their bounding regions as
 well as internal group structure.
Chapter 3
 Applied Network Visualization
 3.1 Introduction
 This chapter serves two purposes. First, it describes in detail NodeXL [Smi+10],
 which is a free and open source network analysis tool that drops into Microsoft
 Excel. I cover why I chose to implement many of my dissertation techniques as part
 of NodeXL, as well as my many contributions to NodeXL?s design, development,
 and evaluation (Section 3.2). Second, this chapter provides an overview of some of
 my work on applying network analysis principles to various domains and real-world
 problems (Section 3.3). It is from these applications that I gained an understanding
 of what approaches are effective for displaying networks visually, which interaction
 techniques are useful for exploring them, and what major challenges remained.
 Moreover, I learned about the necessity for designing exploration tools for end
 user tasks, as well as how to leverage powerful Computer Science and statistics
 techniques and present the algorithmic results to users. These lessons guided my
 dissertation work, and will continue to assist me in my future design challenges.
 48
3.2 NodeXL 49
 Figure 3.1: The NodeXL [Smi+10] workspace. The dual pane view of network
 data and metrics (left pane) with node-link visualization (right pane) provide an
 integrated snapshot of statistics and visualization, along with built-in functions and
 controls that support exploration and discovery. Individual worksheets separate
 network analysis tasks into separate categories, closely aligned with topology and
 attribute-based tasks, such as ?Edges?, ?Vertices? (nodes), and ?Groups.? The
 social network shown reflects voting patterns of U.S. senators, analyses of which
 are detailed in [PS08a; PS09], as well as Sections 3.3.1 and 4.3.1.
 3.2 NodeXL
 NodeXL [Smi+09; HSS11; Smi+10], shown in Fig. 3.1, is a free and open source
 network analysis add-in for Excel 2007/2010/2013. NodeXL is tailored to provide
3.2 NodeXL 50
 powerful features while still being easy to learn. The Excel integration allows rapid
 data processing using standard formulas and macros, but NodeXL also provides
 calculators for network statistics, automatic layout algorithms, visual attribute
 encodings, dynamic filters, direct manipulation, coordinated views, and importers
 from online social networks and common network file formats like GraphML, Pajek,
 and UCINET. These importers are especially important for helping novice users
 collect datasets that are of interest to them like Twitter keyword searches, their
 Facebook network, or their personal email collection.
 NodeXL is widely used in many disciplines and has a full-time developer as
 well as a team of volunteer advisors and developers. Over 25 introductory courses
 on network analysis have used NodeXL and its companion book [HSS11] as part
 of their curriculum,1 due mainly to its ease of use, open source nature, and design
 focus on novice users. I myself have taught several tutorials on using NodeXL for
 network collection and analysis.
 3.2.1 Contributions to NodeXL
 I have been involved with the NodeXL project since 2008 as an advisor, developer,
 and by running exploratory user studies that show that novice network analysts can
 effectively explore datasets with NodeXL [Bon+09]. Moreover, many of the tech-
 1nodexl.codeplex.com/wikipage?title=NodeXL%20Teaching
 %20Resources
3.2 NodeXL 51
 niques I present in this dissertation are implemented and made publicly available
 in NodeXL. My motif simplification approach detailed in Chapter 4 is currently
 shipping in NodeXL for anyone to use and build upon. Of the Group-in-a-Box
 layouts I have worked on (Chapter 5), the Treemap GIB layout is already available
 in NodeXL. The Croissant-Donut and Force-Directed variants have been imple-
 mented and I will push them to the trunk shortly. Finally, some of my readability
 metrics and the assistive layout improvement tool (Chapter 6) are implemented as
 a hidden feature and may be released in the future when we can devote additional
 time to readying them for public consumption.
 I chose to develop my techniques within NodeXL for several reasons. First,
 NodeXL is a high quality network analysis tool with a large, active, and expanding
 user base. It has over 184,000 downloads and is on an increasing trajectory. More-
 over, there are about 660 query results for ?NodeXL? on Google Scholar, many
 of which are papers applying NodeXL to network analysis challenges in various
 domains. Second, given its role as a teaching tool, many NodeXL users generally
 have little prior knowledge about network visualization readability. I believe that
 these novice users will particularly benefit from my readability-improving tech-
 niques. Moreover, the NodeXL codebase is separated into the classes necessary
 for the interactive Excel template and a disjoint set of generally applicable code
 that is packaged as a separate C# network analysis library. Users of this library
3.2 NodeXL 52
 have access to many of the algorithms behind my techniques without having to
 do the implementations themselves. Finally, NodeXL?s free availability and open
 source license encourages collaboration and provides a reference implementation
 for future users interested in applying or evaluating my techniques.
 3.2.2 NodeXL Interface
 The basic interface of NodeXL is shown in Fig. 3.1. The left side provides several
 worksheets in an Excel workbook that represents the network: one each for the
 nodes, edges, groups, group members, and overall metrics. Each worksheet has
 several columns, including basic information about the network like the nodes and
 edges between them. Additionally, there are places to insert columns for node or
 edge attributes and calculated metrics, as well as columns that control the visual
 display of each network item. These include color, shape, size, label, tooltip,
 display position, and the like. Any of these visual properties can be automatically
 filled based on the metric or attribute columns using a special autofill dialog.
 Moreover, standard Excel formulas or macros can be used for arbitrary calculations
 and scales within the tool. The Excel ribbon is customized with a new tab for many
 of the common operations users perform on networks, including the autofill feature.
 The visualization pane shown in the right of Fig. 3.1 displays a node-link vi-
 sualization based on the network in the workbook. Whenever the contents of
3.3 Applying Network Visualization to Real Problems 53
 the workbook is updated, the visualization pane can be refreshed using a button.
 The pane also provides users with several automatic layout algorithms to arrange
 the network, and any automatic or manual adjustments to the node positions are
 stored in the workbook as well. Moreover, the contents of the visualization can
 be filtered using a dynamic filters dialog. Additional windows can be opened for
 filtering the visible network, autofilling visual property columns based on metrics
 or attributes, and running automated analyses of several networks sequentially.
 The worksheet view and the visualization pane are connected using brushing,
 where any selection in one is reflected in the other. Clicking a node in the visual-
 ization or dragging a box around several causes the associated rows to be selected
 in the nodes worksheet. Likewise, any incident edges are selected in the edges
 worksheet. The reverse is also true. Any nodes or edges selected in the worksheets
 are highlighted in the visualization pane as well.
 3.3 Applying Network Visualization to Real Problems
 While much of my work has been on NodeXL [Smi+10], I have worked exten-
 sively with target users from several domains on visualizing and analyzing their
 real-world networks. I have been involved in network analysis projects for six
 years, and I strive to solve real problems by initiating contact with domain experts
 across many disciplines. I design and build visual analytics tools that have helped
3.3 Applying Network Visualization to Real Problems 54
 urban planners [SD12], political scientists [DS13], health care professionals, the
 U.S. Treasury, and many others described below. This work has helped me gain
 an understanding of the effectiveness of various visualization and interaction ap-
 proaches, as well as what major research challenges remained. Moreover, it helped
 me to realize the importance of keeping the end users and the tasks they wish to
 accomplish in mind throughout the design process. The tasks end users wish to
 perform drastically impacts the effectiveness of any chosen visualization and inter-
 action techniques. Often, some of the best breakthroughs for the end users came
 when I could integrate powerful Computer Science and statistical algorithms and
 present the results within the visualization or a coordinated view in the tool.
 3.3.1 The Importance of Network Topology and Filtering
 For some network analyses users are only interested in the topology of the relation-
 ships and not any additional attributes. For one such exploration, I visualized the
 relationships between 750 organizations that are engaged in cancer research, aware-
 ness, and outreach. The data used to create the network was collected through a
 survey of these organizations by a central agency, the Cancer Information Service
 (CIS) of the National Cancer Institute (NCI). Due to this selection method, the
 CIS played a central role in each of the networks, connected to each of the surveyed
 organizations. Many network datasets suffer from similar selection mechanisms,
3.3 Applying Network Visualization to Real Problems 55
 (a)
 (b)
 Figure 3.2: Relationships between cancer research, awareness, and outreach in DC,
 MD, VA, and WV. The different colors represent each of the states in the region.
 (a) shows the network with the CIS ego node circled in green, while (b) shows the
 same network after removing the CIS node and laying it out again. The resulting
 visualization shows the remaining group structure and connections more clearly.
3.3 Applying Network Visualization to Real Problems 56
 only showing the ego network of a person?s Facebook friends, related replies or
 mentions on Twitter, or a set of connected web sites in a web crawl. In these sorts
 of ego-centric datasets, simple filters like removing the ego of the network can sub-
 stantially improve the resulting visualization. For example, Fig. 3.2 demonstrates
 how removing the completely connected CIS ego node from the network for one
 region can substantially improve the layout and readability of the remaining nodes,
 with no loss of information.
 Some networks have large numbers of nodes and edges which can obscure mean-
 ingful groups or network items with interesting attribute values. Filtering can
 be applied to node values to remove incidental nodes of specific types or with
 low metric values, leaving only key actors. User-controlled dynamic query filters
 [AWS92; WS92] have demonstrated their value in successful commercial products
 that deal with multivariate data, such as Spotfire [Spo] and Tableau [Tab]. Dy-
 namic query filters are even more valuable in network visualizations, where the
 clutter of nodes and links can severely inhibit readability. NodeXL, discussed in
 detail in Section 3.2, supports filters on node values, link values, graph metrics,
 layout positions, and many other attributes.
 Filtering is a well-established technique for multivariate data, as shown in scat-
 tergrams, but the variety of filters in many networks means careful thought is
 needed to produce effective results. Furthermore, scattergram filtering typically
3.3 Applying Network Visualization to Real Problems 57
 Figure 3.3: 2007 U.S. Senate voting network, showing all 4950 links. The net-
 work is visualized inside the NodeXL network analysis tool as part of Excel. The
 highlighted red edges show the Akaka?Allard and Akaka?Baucus ties.
 leaves the remaining markers in place, but in networks, layout methods interact
 with filtering, so thoughtful exploration is needed.
 The power of attribute or metric filtering is shown in an example network of
 U.S. Senate voting patterns from 2007.2 The similarity in voting patterns (from
 0.0 to 1.0) is an attribute of each one of the 4950 links connecting the 100 Senator
 nodes. The naive visualization produces a thickly connected graph (Fig. 3.3), but
 filtering the similarity values to show only those with values above 0.65 produces
 a revealing portrait (Fig. 3.1). The force-directed layout shows the willingness of
 the three Republican Senators Snowe, Collins, and Specter (center, in red) to vote
 2Data provided by Chris Wilson of Slate magazine available in the NodeXL template format
 at nodexl.codeplex.com/wikipage?title=NodeXL%20Teaching
 %20Resources
3.3 Applying Network Visualization to Real Problems 58
 in support of their Democrat colleagues (top-right, in blue). One of these, Arlen
 Specter, later switched his affiliation to the Democrats in 2009. However, apart
 from the party groups and these moderates, not much of the network structure is
 visible inside the dense party clusters. This data is further explored in Section 4.3.1.
 As these filtering operations omit information from the visualization, it be-
 comes important to keep track of what was omitted. While my GraphTrail ap-
 proach detailed in Section 3.3.2 was designed to present the history of exploration
 automatically, most network analysis tools do not give you any indication that
 data has been removed. This prompted me to think about ways that nodes in
 larger datasets could be automatically filtered, but displayed in such a way as to
 notify the user what filtering has taken place and display the underlying node dis-
 tributions. This is especially important for ego-centric datasets like social network
 crawls or web crawls like discussed in Section 4.3.4, where there can be an enor-
 mous amount of peripheral data that can obscure the core relationships. This line
 of thought helped guide me in the creation of the fan and connector motif simplifi-
 cation approaches described in Chapter 4. Similarly, in the Senate example above
 the importance of edge filtering was highlighted to me. The clique motifs simplifi-
 cation technique I develop in Chapter 4 is based on this kind of edge filtering, and
 I even apply it to the same Senate dataset in Section 4.3.1.
3.3 Applying Network Visualization to Real Problems 59
 3.3.2 The Importance of Node & Edge Attributes
 Some network datasets and analysis tasks require less focus on the topology and
 more on the node and edge attributes. One of my studies focused on the network
 of relationships formed by IP traffic on a local area network (LAN) [Blu+08].
 The visualization tool we designed, NetGrok, is targeted at system administrators
 monitoring the status of their LAN. While the LAN topology was important for
 users to view, the topology of the connections with remote machines was less likely
 to be observed in the packet capture or relevant to the users. We focused instead
 on showing changes in communication patterns that could indicate malicious or
 erroneous behavior on the LAN.
 The approach we developed for this challenge focused on presenting aggrega-
 tions of the connection attributes over time such as the bandwidth used and total
 number of connections. Two of the views of NetGrok are shown in Figs. 3.4 and 3.5.
 In the node-link view (Fig. 3.4), the relationships between computers on the LAN
 are shown using a force-directed layout in an inner circle, while remote computers
 are arranged in a hash layout based on one of their attributes: their IP address.
 Connections to external computers were hidden by default due to their number
 and relatively low meaning, but shown on demand. An alternate view replaced
 the node-link visualization with a treemap as in Fig. 3.5, where each relationships
 are similarly shown on demand.
3.3 Applying Network Visualization to Real Problems 60
 Figure 3.4: NetGrok?s [Blu+08] elements include a node-link visualization (upper
 left), a time-line histogram (lower left), a filter panel (upper right), and details on
 demand (lower right).
 While individual nodes and their relationships can be of interest, in many cases
 it is the groups of nodes and their aggregate relationships that are more useful to
 study. One of my previous projects as an intern at Microsoft Research, called
 GraphTrail [Dun+12a; RLD] (Fig. 3.6), was targeted at more general networks
 and aimed to explore networks by aggregating node and edge attributes in stan-
 dard charts. For example, the bars in the bar chart in Fig. 3.6 each represent an
 aggregate of nodes and the arcs along the bottom show the aggregate relation-
3.3 Applying Network Visualization to Real Problems 61
 Figure 3.5: NetGrok?s [Blu+08] treemap layout arranges computers by the number
 of connections they have and colors them by the bandwidth used. Communications
 between computers are shown using highlighting on mouseover.
 Figure 3.6: GraphTrail [Dun+12a] showing three views of ACM SIGCHI conference
 publications, based on both the authors and their connected papers.
3.3 Applying Network Visualization to Real Problems 62
 Figure 3.7: A GraphTrail [Dun+12a] analysis showing two parallel exploration
 paths, the top examining Georgia Tech (GT) publication and citation patterns and
 the bottom comparing Microsoft Research (MS). They start at the ROOT chart
 that contains all the papers in the dataset. Charts in each path are numbered in
 order of creation (e.g., 1, GT2, GT3, etc.), and the user interactions are shown
 with stars. The MERGED chart is the union of both branches? results. The user
 moved the mouse over the final parent link in the GT path (circled), highlighting
 the chain of actions up to the root.
 ships between them. Similarly, the matrix chart on the far right show aggregate
 citations between authors and even to themselves along the diagonal. In addition,
 GraphTrail provides a pivoting mechanism to explore connected aggregates of the
 network across node types.
 One of the main benefits of GraphTrail is an infinite canvas that aggregates
 can be dragged to and dropped to create new charts for filtered subsets. Moreover,
 data can be dragged from several charts into one target, creating the union of those
 sub-networks. This intuitive data filtering is augmented with parent links, which
3.3 Applying Network Visualization to Real Problems 63
 indicate the source(s) of the data for each chart. On mouseover, the parent links
 highlight the entire provenance of that specific data all the way back to the root
 chart, in addition to a text tooltip indicating the operation performed. An example
 of this exploration history view is shown in Fig. 3.7. Exposing the analysis process
 in this way enables users to utilize their spatial memory while visual and textual
 feedback helps them track their interactions.
 I compared GraphTrail with three tools with similar goals: NetLens [Kan+06]
 (Fig. 2.8), PaperLens [Lee+05], and FacetLens [Lee+09]. From this I determined
 that GraphTrail could make all the findings reported for the other tools, as well as
 several additional ones that were not discoverable in the others. Moreover, a three-
 month field study with a team of archeologists and a lab study demonstrated that
 GraphTrail improves insight discovery, analysis comprehension, exploration recall,
 and sharing analyses with others. Prior to using GraphTrail, the archaeologists
 had been using Cytoscape [Sha+03] to explore slices of the network with one or
 two node types, and GraphTrail greatly assisted their explorations by allowing
 more interactive exploration and exposing the exploration history. This approach
 may be a first step on the way to asynchronous collaboration for network analysis.
 Both NetGrok and GraphTrail were designed to primarily display attribute
 information, with the underlying topology available on demand or in aggregate.
 These two approaches are highly effective for certain tasks, such as monitoring a
3.3 Applying Network Visualization to Real Problems 64
 computer network (NetGrok) or exploring the attributes of a network while pre-
 serving the data provenance (GraphTrail). However, neither are particularly good
 at showing the overall topology and path information that would be available in a
 node-link visualization. Through these projects I began to understand the breadth
 of visualization techniques for networks, and that it is often difficult to build gen-
 eral tools for all kinds of analysis tasks. My dissertation work has primarily focused
 on helping users perform topology-based tasks, though my increased awareness of
 the importance of attribute values guided the design of the motif simplification
 glyphs (Chapter 4) and Group-in-a-Box aggregation techniques (Chapter 5).
 3.3.3 The Importance of Statistics and Algorithms
 Much of my applied work in network analysis has been in text analytics and sci-
 entometrics, the science of measuring and analyzing science. My work on scien-
 tometrics focuses on measuring the impact of scientific publications, patents, and
 trade press articles and how they affect innovation.
 One example is a study I did comparing the trajectory of three information
 visualization innovations: treemaps, cone trees, and hyperbolic trees [Shn+12].
 I collected and analyzed academic publications, patents, and trade press articles
 over the almost two decades after the techniques were proposed. While node-link
 visualizations were useful, I found that for this task line charts were a more effective
3.3 Applying Network Visualization to Real Problems 65
 (a)
 (b)
 Figure 3.8: These line charts show the impact of treemaps (TM/green), cone
 trees (CT/red), and hyperbolic trees (HT/blue) in terms of trade press articles,
 academic papers, and patents. (a) shows the number of publications per year by
 type of publication for each innovation and (b) shows the number of citations to
 papers and patents by year for each innovation. Note that the sharp fall in patent
 figures in the faded area may be due to the average 32-month USPTO processing
 time in 2005-2008. From [Shn+12].
3.3 Applying Network Visualization to Real Problems 66
 Figure 3.9: NetVisia [Gov+11b] visualization of the clustered heat map of the
 degree values for the STICK business intelligence term co-occurrence data from
 2005, filtered to show only nodes with degrees between 45 and 491.
 network representation of what we wanted to see: changes in statistics over time.
 Two examples are shown in Fig. 3.8, where the citation network is displayed as
 several line charts that show aggregates of nodes and edges over time. Our paper
 [Shn+12] shows additional examples using scatterplots.
 I expanded these techniques to use clustered matrix diagrams for NetVisia
 [Gov+11b], including clustering nodes by metrics and by topology. An example of
 this is shown in Fig. 3.9 for business intelligence terms and their co-occurrences. In
 this case it was both statistics and hierarchical clusters of related terms that were
 of interest. These tasks were much more easily performed with line and matrix
 visualizations, and reinforced my belief that tasks and statistics of interest should
 guide tool and visualization design.
3.3 Applying Network Visualization to Real Problems 67
 Figure 3.10: After removing edges with low weight we can see the structure the
 network backbone. Isolate category pairs are drawn in a ring around the main
 connected component and singletons are staggered in the corners. Each node is
 colored by its semantic orientation (red for negative, blue for positive) and edges
 are colored by their weight, from red to blue. Node shape also codes semantic
 orientation, with triangles positive and circles negative. Size codes the magnitude
 the semantic orientation, with the largest nodes representing the extremes. Node
 labels are shown for nodes in isolates and those in the top 20 for betweenness
 centrality. From [MDD09].
3.3 Applying Network Visualization to Real Problems 68
 I also investigated using networks to model relationships between words or
 word categories. As a way to understand the behavior of a new sentiment analysis
 technique, I developed node-link visualizations of the semantic relationships be-
 tween thesaurus categories [MDD09]. After algorithmically determining antonym
 relationships between categories of the Macquarie Thesaurus, I was able to show
 the relationships between categories of words as well as the semantic orientation of
 individual categories using color Fig. 3.10. The density of this network was quite
 high, with 812 nodes connected by 27,155 antonym edges, and thus necessitated
 substantial filtering and labeling only the most significant nodes. An interesting
 aspect of Fig. 3.10 is the large number of disconnected components in a ring around
 the center, representing small groups of related thesaurus categories. Moreover,
 there are many completely disconnected categories laid out in the corners of the
 visualization. At the time, NodeXL [Smi+10] had no way of handling these discon-
 nected nodes and this layout took an enormous amount of my time to hand-tune.
 This kind of rote, manual correction helped me understand the necessity of tech-
 niques for handling disconnected components, such as the Group-in-a-Box layout
 algorithms I describe in Chapter 5 which would make this task automatic today.
 Another project I was involved with focused on creating a literature exploration
 and analysis tool called Action Science Explorer (ASE) [Dun+12b; Gov+11a].
 ASE was designed to support exploring a collection of papers so as to aid users
3.3 Applying Network Visualization to Real Problems 69
 Figure 3.11: The main views of ASE [Dun+12b] are displayed and labeled here:
 Reference Management (1?4), Citation Network Statistics & Visualization (5?6),
 Citation Context (7), Multi-Document Summaries (8), and Full Text with hyper-
 linked citations.
 in rapidly creating summaries of unfamiliar research domains. It incorporated (1)
 bibliometric lexical link mining to create a citation network for a field and context
 for each citation, (2) automatic summarization techniques to extract key points
 from papers, and (3) potent network analysis and visualization tools to aid in the
 exploration relationships. ASE, shown in Fig. 3.11, presents the academic litera-
 ture for a field using many different modalities: tables of papers, full texts, text
3.3 Applying Network Visualization to Real Problems 70
 Figure 3.12: Algorithmically found communities in ASE [Dun+12b] are shown
 using convex hulls in the node-link visualization. When selected, all the citation
 context is shown in the top-right, along with an automatically generated summary
 of the overall context (bottom-right).
 summaries, and visualizations of the citation network and the groups it contains.
 Each view of the underlying data is coordinated such that papers selected in one
 view are highlighted in the others, providing additional metadata, text summaries,
 and statistical measure rankings about them. Users can filter by rankings or via
 search queries, highlighting the matching results in all views.
 ASE represented a major collaboration with several experts in Natural Lan-
 guage Processing, who were interested in (1) understanding the effectiveness of
 their link mining and multi-document summarization approaches and (2) being
 able to apply these algorithms to real tasks and present the results to users. An
 example of the multi-document summaries ASE can compute is shown in Fig. 3.12
3.4 Summary 71
 for a selected topologic cluster of papers. Our collaborations helped them improve
 the effectiveness of the summarization algorithm, as well as develop a prototype
 tool that will guide developers of literature exploration systems to integrate such
 Natural Language Processing techniques.
 From all these collaborations I have gained an improved understanding of how
 algorithms and statistics can be brought to bear on network analysis tasks. The
 various attribute- and topology-based clustering algorithms especially can be used
 to create the groups for my Group-in-a-Box layouts (Chapter 5). Moreover, if there
 are any text associated with nodes or edges like the Tweets in a Twitter keyword
 network, this text can be analyzed to present additional information to the user as
 part of the group box labels or in additional coordinated views. The results of a
 statistics algorithm can be shown using color coding or the like, and then displayed
 in aggregate within my motif glyphs (Chapter 4).
 3.4 Summary
 NodeXL [Smi+10] is a free and open source Excel template for network analysis.
 It provides powerful features while still being easy to learn, and avoids the pre-
 processing and programming steps required by many existing tools. The Excel in-
 tegration brings standard formulas and macros, but we also include calculators for
 network statistics, layout algorithms, visual attribute encodings, dynamic filters,
3.4 Summary 72
 direct manipulation, coordinated views, and much more. NodeXL is widely used in
 many disciplines and taught in over 25 introductory courses on network analysis.
 I have been involved in the design, evaluation, and development of NodeXL, and
 have integrated the techniques presented in this dissertation as part of the shipping
 product. As my research focuses on improving network visualization readability,
 it is especially beneficial for the introductory users NodeXL targets.
 In addition to my work on NodeXL [Smi+10], I have been involved in the appli-
 cation of network analysis and visualization techniques to problems across several
 domains. In the various domains I have worked in, several different network prop-
 erties have been important to display. Working with real users helps inform the
 design process, as the tasks, statistics, and algorithms relevant to them dramat-
 ically affect the choice of visualization and interaction techniques. In domains
 where network topology was most important to show, filtering the network by at-
 tributes or statistics was critical. The limitations of filtering techniques helped
 guide my development of motif simplification (Chapter 4). Moreover, when show-
 ing topology there is a major challenge in finding an effective and simple layout
 for the nodes that avoids readability problems, while at the same time highlighting
 the necessary structures for the task at hand. This provided me with motivation
 to investigate the use of readability metrics for understanding these issues and
 improving the layout (Chapter 6). In other cases, the attributes of the nodes or
3.4 Summary 73
 edges in the network were more important, and I developed specialized visualiza-
 tions depending on the user tasks. While my dissertation work is focused primarily
 on topology-based tasks, I gained an understanding of the importance of showing
 attribute or statistics information. This informed the design of my motif simplifica-
 tion glyphs and Group-in-a-Box aggregation techniques (Chapter 5). Other forays
 into domains such as Natural Language Processing helped me to understand the
 necessity of Group-in-a-Box layouts, even for handling simple disconnected com-
 ponents. These explorations helped shape the rest of my dissertation work, as well
 as my future design challenges.
Chapter 4
 Motif Simplification to Reduce Complexity
 4.1 Introduction
 One way to reduce the complexity of node-link network visualizations is the use
 of aggregation, specifically by aggregating common network structures or subnet-
 works called motifs. Large, complex network visualizations often have large motifs
 repeated throughout because of either the network structure or how the data was
 collected. Regardless of their cause, some frequently occurring motifs contain little
 information compared to the space they occupy in the visualization. Existing tools
 may highlight certain motifs, allow users to filter them out, or replace groups of
 similar nodes with meta-nodes (e.g., see Section 5.2.2 and Fig. 5.4) ? but each of
 these approaches has the serious limitation of obscuring the underlying topology.
 Figure 4.1: From left to right: fan, connector, and clique motifs.
 74
4.1 Introduction 75
 I improve on these approaches with motif simplification, in which network
 motifs are automatically replaced with compact, representative glyphs. Well-
 designed glyphs have several benefits: they (1) require less screen space and layout
 effort, (2) are easier to understand in the context of the network, (3) can reveal
 otherwise hidden relationships, and (4) preserve as much underlying information
 as possible. In this chapter I discuss three high-payoff motifs that plague network
 analysts, shown in Fig. 4.1: fans, connectors, and cliques. I contribute the
 design of representative and combinable glyphs for these motifs, algorithms for de-
 tecting them, and a supporting task-based controlled study with 36 participants.
 These techniques are all implemented and made publicly available as part of the
 free and open source NodeXL network analysis tool [Smi+10].
 4.1.1 Chapter Overview
 Specifically, the contributions of this chapter are:
  A technique for simplifying node-link visualizations by replacing common
 network motifs with representative glyphs,
  A set of design guidelines for these glyphs to show the motif contents and
 underlying attributes,
  The design of glyphs for fans, connectors, and cliques,
  Algorithms for detecting these three motifs,
4.2 Network Motif Simplification 76
  A supporting task-based study with 36 participants,
  A free and open source implementation as part of NodeXL.
 Parts of this chapter have been published [DS13] as well as featured in an
 overview paper on novel network analysis techniques in NodeXL [SD12]. I first
 describe the basics of Motif Simplification (Section 4.2), including glyph design
 (Section 4.2.1), motif detection algorithms (Section 4.2.2), and details about the
 NodeXL implementation (Section 4.2.3). I next demonstrate the utility of motif
 simplification in several case studies (Section 4.3), a usability study (Section 4.4),
 and a controlled experiment (Section 4.5). I end with a summary in Section 4.6.
 4.2 Network Motif Simplification
 Many common network motifs present little meaningful information, yet can dom-
 inate much of the display space and obscure interesting topology. I believe that
 replacing these motifs with representative glyphs will create more effective visual-
 izations as there will be far fewer nodes and edges for layout algorithms and users
 to consider. I have chosen three motifs for my foray into motif simplification:
  A fan motif consists of a head node connected to leaf nodes with no
 other neighbors. As there may be hundreds of leaves, replacing all the leaves
 and their links to the head with a fan glyph can dramatically reduce the
 network size.
4.2 Network Motif Simplification 77
  A D-connector motif consists of functionally equivalent span nodes that
 solely link a set of D anchor nodes. Replacing span nodes and their links
 with a connector glyph can aid in connectivity comparisons.
  A D-clique motif consists of a set of D member nodes in which each pair
 is connected by at least one link. Cliques are common in biologic or similarity
 networks, where swapping for a clique glyph can highlight subgroup ties.
 These motifs are prime simplification candidates for several reasons. For one,
 these motifs are quite common in the network datasets I have encountered in
 several disciplines. While simple to understand on their own, these motifs can
 account for much of the visual complexity of a node-link visualization. The fan
 motifs especially can dominate the diagram. While connector motifs usually occupy
 less space than the fans, they are hard to detect and can contribute substantial
 complexity. In the densest networks, such as similarity scores, overall relationships
 can be hidden in a tangled hairball of overlapping clique motifs as in Fig. 4.9a.
 4.2.1 Glyph Design
 For each motif, careful thought must be given to how to represent the simplified
 version. Arbitrary motifs can be shown as a simple meta-node (e.g.,
 L
 ), possibly
 with embedded images that show a small node-link visualization of the underlying
 subnetwork. However, a specially designed representative glyph for a motif can
4.2 Network Motif Simplification 78
 Figure 4.2: A 2-connector motif with three simplified glyph variants: diamond,
 crescent, and tapered diamond.
 make it easier to understand aggregate topology and attributes with only minimal
 additional visual clutter. I went through several designs for each of my motif
 glyphs, some of which are discussed below.
 4.2.1.1 Motif Topology
 Foremost each glyph must be representative of the underlying subnetwork topology
 so that the aggregate relationships in the network can still be understood. As I
 aim to reduce visual clutter, I must use a small, easily-distinguishable glyph rather
 than heavy-weight visualizations. An effective way to differentiate the glyphs is to
 use unique shapes to identify each type, ideally that correspond to the underlying
 topology.
 Several example shapes for a connector motif are shown in Fig. 4.2. The dia-
 mond is a straightforward representation of the outline made by the motif topology,
 is discernible at scale, and has geometric properties that allow easy area scaling
4.2 Network Motif Simplification 79
 Figure 4.3: A 3-connector motif and its glyph.
 and subdivision. However, they are often used with other shapes for categori-
 cal attribute coding. The crescent is not, but my user study indicated that its
 asymmetry was visually jarring and that it had poor edge connector properties
 (Section 4.4). I finally chose a symmetric tapered diamond: unique enough to be
 distinguishable and representative yet symmetric and connectable. I use the same
 shape regardless of the number of anchor nodes so as to reduce the shape corpus
 required (Fig. 4.3). The clique motifs were originally represented with a tapered
 square to indicate the link density, but it was easily confused with the connector
 motif and has since been replaced with a rounded X (Fig. 4.6). Like the connector
 motif, the same shape is used for any number of clique members. For the fan
 motifs, I chose a sector of a circle (Fig. 4.4), as it represented the fan of leaf nodes
 commonly seen in node-link visualizations.
4.2 Network Motif Simplification 80
 Figure 4.4: Three fan motifs and two glyph variants of each.
 4.2.1.2 Contained Nodes
 In addition to the topology, it is helpful to show information about the nodes
 contained in the motif. What information we want to show impacts the display
 mechanism we choose for it. Most useful would be a count of the nodes in the motif.
 This quantitative value is best expressed by position [Mac86], though in node-link
 visualizations this is reserved for showing ties. The next best choices would be
 length, angle, or area [Mac86]. For the fan motif, I scale the angle of the sector
 linearly between 10?120 by the number of contained nodes, which also linearly
4.2 Network Motif Simplification 81
 Figure 4.5: Three 2-connector motifs and their glyphs.
 Figure 4.6: 4-, 5-, and 6-clique motifs and their glyphs.
4.2 Network Motif Simplification 82
 scales its area (Fig. 4.4). I chose this range after tests using smaller ranges (20?90 )
 did not reveal enough size variation. The vertical alignment eases area comparisons
 and eases glyph subdivision to show edge directionality or attributes. I also scale
 the area of the other motifs linearly by the number of nodes (Figs. 4.5 and 4.6).
 Designers of future motif glyphs should ensure the shape is still discernible at its
 minimum size while not so large at its maximum to occlude edges unnecessarily.
 We may also wish to show quantitative attributes or statistics of the under-
 lying nodes. Showing all the values or their distribution would require complex
 embedded charts or focusable tooltips. Instead, I show a function of the values
 such as mean (used for these examples), sum, min, or variance. As size is reserved
 for node count, we are left with the less effective color saturation, color hue, and
 density/opacity [Mac86]. While these are less effective encodings, the maximum
 deviation reported for quantitative tasks is only 13% [CM85]. Glyphs demonstrat-
 ing these quantitative attribute or statistic encodings are shown in Figs. 4.4 to 4.6,
 using the same color scale as the underlying nodes in the network. Categorical
 attributes are more challenging to display without subdividing glyphs or embed-
 ding visualizations, increasing the visual clutter. Finally, text attributes such as
 labels would help reveal the contents of the motif. While a glyph can show a small
 label, it is challenging to compute a representative one. Instead, I discuss later
 how interactivity can reveal the underlying nodes.
4.2 Network Motif Simplification 83
 4.2.1.3 Connecting Edges
 Nodes contained within a motif may have connecting edges, and when the motif is
 simplified these edges are re-routed to link to the glyph instead. This can result in
 duplicate, overlapping edges in straight-line drawings, as with the connector motif
 in Fig. 4.5. As with nodes, it is useful to show the number of duplicate edges and
 any attributes they may have. The edges could be drawn independently as curves
 of varying arcs, stacked in slices with scaled area, or use the edge distribution
 visualizations from [Mur08]; but again I strive to avoid visual clutter and show
 aggregate relationships clearly.
 I aggregate these duplicate edges into meta-edges, with width and thus area rep-
 resenting a function of the underlying edges such as the number of edges (Figs. 4.9
 to 4.11), the average of an attribute value (Figs. 4.4 and 4.5), etc. There are options
 for showing categorical attributes or labels, but these require cluttered embedded
 visualizations or interactivity. In some cases there are no attributes on the edges
 to encode, and showing even edge count would be a redundant. One example is
 the fan motif, in which the number of edges equals the already-encoded number
 of leaf nodes (in an undirected network without duplicates). Example fan glyphs
 without meta-edges are shown in the center column of Fig. 4.4.
 Alas, glyph shape impacts how edges connect to them. Ideally, each glyph lies
 along a straight line with connecting edges so paths can be traced easily. For the
4.2 Network Motif Simplification 84
 2-connector motif, a crescent would suffice if its corners were aligned along the
 path (Fig. 4.2). However, for connectors with three or more anchors my users
 reported that crescents make edges difficult to follow. Symmetric shapes like the
 tapered diamond and rounded X are better suited for many connecting edges.
 4.2.1.4 Motif Overlap
 Figure 4.7: Glyphs for fan, clique, and connector motif overlap.
 Often motifs are non-overlapping and easily transformed into glyphs, though
 many motifs do not have this luxury. When detecting motifs I can choose a non-
 overlapping set to display, but motif glyphs will be more effective at reducing
 complexity when they can be combined to show overlapping motifs. The design
 of any motif glyphs must thus take overlaps into account. Among my three mo-
 tifs, fans are the most immune to overlap. The fan leaves have too few edges to
 participate in the other motifs, though the fan head can be a connector anchor or
 clique member. As a clique glyph replaces all the clique members, I must exclude
 the fan head from the fan glyph to allow this combination. Similarly, a connector
 anchor can be a clique member, which requires its exclusion from the connector
4.2 Network Motif Simplification 85
 glyph. Two example overlaps are shown in Fig. 4.7 and more on overlap handling
 is discussed later in Section 4.2.2.4.
 4.2.1.5 Glyph Interactivity
 While the motif glyphs I described can be effective for simplifying a network, I
 would like to make sure that they are easily understandable and investigable. One
 important aspect of this is to ensure that users can switch between the original
 and simplified views interactively. Users can simplify the entire network, or only
 a selected subset of motifs. Likewise, users can expand the entire network to see
 the original visualization, or only expand a selected glyph they are interested in
 exploring. I expose the contents of each glyph with tooltips. It would be possible
 to expand on this and show details for a glyph via a heavyweight focusable tooltip
 that contains a chart of attribute distributions or a list of node labels.
 Direct manipulation of the motif glyphs and underlying nodes is an effective
 way of exploring the network. Users can adjust node or glyph placement manually,
 as well as highlight incident edges or adjacent nodes through simple context menus.
 Additionally, automatic layout algorithms are available for laying out the simplified
 network. An ideal layout algorithm would take the shape and size of the glyphs
 into account, in addition to the number of edges in any meta-edges.
4.2 Network Motif Simplification 86
 4.2.2 Motif Detection Algorithms
 General motif detection can be accomplished with approaches like symmetry-
 breaking [GK07], but custom algorithms are more effective for specific motifs that
 can vary substantially in size. I have implemented algorithms to detect fan, con-
 nector, and clique motifs of all sizes. I refer the interested reader to view and
 utilize my C# source code.1 I use the terminology of a network or graph G with
 a set of nodes G:nodes, and each node n has a set of adjacent nodes n:neighbors.
 The size of each of these node sets, say s, is denoted as jsj.
 4.2.2.1 Fan Motifs
 My approach to detecting all the fan motifs in a network is detailed in Algorithm 1,
 which has a run time complexity of O(jG:nodesj  average neighbor count). Av-
 erage neighbor count is usually relatively small and can be considered a bounded
 constant, so this technique should scale well. However, I recently came upon an al-
 ternate, faster algorithm with linear time complexity shown in Algorithm 2, though
 it has not yet been implemented in NodeXL and is not discussed further here.
 The current algorithm (Algorithm 1) first passes through all the nodes in the
 network, searching for potential fan heads. Each fan head must have two or more
 neighbors to exclude the degenerate barbell case (Line 3), though this criteria could
 be increased to find larger fans. For each potential fan head, I then search through
 1nodexl.codeplex.com/SourceControl/changeset/view/70521#1208172
4.2 Network Motif Simplification 87
 Algorithm 1 Fan motif detection algorithm.
 Time complexity: O(jG.nodesj  average neighbor count)
 1: procedure DetectFans
 2: for all n 2 G.nodes do
 3: if jn.neighborsj  2 then
 4: leaves  f;g
 5: for all nbr 2 n.neighbors do
 6: if jnbr.neighborsj = 1 then
 7: leaves.add(nbr)
 8: if jleavesj  2 then
 9: RecordFan(n, leaves)
 10: end procedure
 11: procedure RecordFan(head, leaves)
 12:    . Record a given fan motif
 13: end procedure
 Algorithm 2 Alternate fan motif detection algorithm.
 Time complexity: O(jG.nodesj)
 1: procedure DetectFans
 2: fans  MaphNode, ListhNodeii
 3: for all n 2 G.nodes do
 4: if jn.neighborsj = 1 then
 5: head  n.neighbors[0]
 6: if head =2 fans then
 7: fans[head]  ListhNodei
 8: fans[head].add(n)
 9: for all head, leaves 2 fans do
 10: if jleavesj  2 then
 11: RecordFan(head, leaves)
 12: end procedure
 13: procedure RecordFan(head, leaves)
 14:    . Record a given fan motif
 15: end procedure
4.2 Network Motif Simplification 88
 the set of its neighbors to find any leaf nodes connected only to it (Line 5). Each
 of these leaf nodes are added to the set of potential leaves. If two or more leaves
 are found in the neighbor set, the fan motif is acceptable and recorded (Line 8).
 The differing neighbor count criteria for head and leaf nodes in Algorithm 1
 prohibits any overlapping motifs from being detected. However, please note that
 I am using jn:neighborsj to show the size of the neighbor set of n, which may
 differ from n?s degree if there are overlapping edges. For example, in a network
 with directed edges a leaf node may have two overlapping edges connecting it
 to the head node, one for each direction. Moreover, an undirected network with
 several edge types may have overlapping edges of differing types. Some algorithms
 for computing degree would return higher values in these cases than the actual
 number of neighboring nodes.
 4.2.2.2 Connector Motifs
 Connectors have an dimension, denoted D, that indicates the number of anchors
 it has. D can be any integer two or greater, though the frequency of the motifs
 generally decreases proportional toD. My algorithm for detecting connector motifs
 of all dimensions is shown in Algorithms 3 and 4, and takes parameters D-min and
 D-max to indicate the range of dimensions to search for. The run time complexity
 of this algorithm is also O(jG:nodesj  average neighbor count). Again, average
 neighbor count can be considered a bounded constant.
4.2 Network Motif Simplification 89
 Algorithm 3 Part 1/2 of the D-Connector motif detection algorithm which finds
 potential motifs and filters out invalid ones. [D-min, D-max] is the range of dimen-
 sions of the connector motifs to find (the number of anchors). Time complexity:
 O(jG.nodesj  average neighbor count). See also Algorithm 4.
 1: procedure DetectConnectors(D-min, D-max)
 2: found  MaphString, Connectori
 3: detectLoop:
 4: for all n 2 G.nodes do
 5: if jn.neighborsj 2 [D-min, D-max] then
 6: for all nbr 2 n.neighbors do
 7: if jnbr.neighborsj < 2 then
 8: continue detectLoop
 9: AddSpan(n.neighbors.sorted, n, found)
 10: out  f;g
 11: used  MaphNode, Connectori
 12: filterLoop:
 13: for all c 2 found.values do
 14: if jc.spannersj  2 then
 15: for all s 2 c.spanners do
 16: if s 2 used.keys then
 17: c0  used[s]
 18: cTotal  jc.spannersj + jc.anchorsj
 19: c0Total  jc0.spannersj + jc0.anchorsj
 20: if jc.spannersj > jc0.spannersj or
 (jc.spannersj = jc0.spannersj and
 cTotal  c0total) then
 21: out.remove(c0)
 22: used.removeAll(c0.spanners)
 23: used.removeAll(c0.anchors)
 24: AddConnector(out, used, c)
 25: continue filterLoop
 26: AddConnector(out, used, c)
 27: for all c 2 out do
 28: RecordConnector(c.anchors, c.spanners)
 29: end procedure
4.2 Network Motif Simplification 90
 Algorithm 4 Part 2/2 of the D-Connector motif detection algorithm. This part
 contains procedures and a class needed for Algorithm 3.
 30: procedure AddSpan(anchors, spanner, found)
 31: key  string(anchors)
 32: if key =2 found then
 33: found[key]  new Connector(anchors)
 34: found[key].spanners.add(spanner)
 35: end procedure
 36: class Connector
 37: anchors  f;g, spanners  f;g
 38: procedure Connector(new-anchors)
 39: anchors  new-anchors
 40: end procedure
 41: end class
 42: procedure AddConnector(out, used, c)
 43: out.add(c)
 44: for all spanner 2 c.spanners do
 45: used[spanner]  c
 46: for all anchor 2 c.anchors do
 47: used[anchor]  c
 48: end procedure
 49: procedure RecordConnector(anchors, spanners)
 50:    . Record a given connector motif
 51: end procedure
 Connector motifs are not as straightforward to detect as fan motifs, despite the
 algorithms having the same run time complexity. First, a pass is made through
 all nodes searching for span nodes with sets of neighbors that could be anchors
 and creating or adding to a map of keys to possible motifs. An additional pass
 is required to traverse the potential motifs and remove those with only one span
 node, as well as remove all but the most desirable of any overlapping motifs. I
4.2 Network Motif Simplification 91
 choose motifs to keep first by the number of spanners, then by the total number
 of anchors and spanners, then arbitrarily.
 The algorithm is broken into several procedures and a class to store the details
 for each potential connector motif. The detect loop in the algorithm (Algorithm 3,
 Line 3) passes through all nodes in the network, searching for potential span nodes.
 Each span node must have between D-min and D-max neighbors, which must be
 anchor nodes. I require a minimum of two span nodes for the connector motif, so
 each anchor node must have two or more neighbors itself (Line 7). At least two of
 the neighbors are span nodes, but the remainder can be connections to the main
 network or other anchor nodes in the motif. If all the anchor nodes check out, the
 span node is added to a connector motif (Algorithm 3, Line 9) using the AddSpan
 procedure (Algorithm 4, Line 30). This motif can be new or an existing one with
 the same set of anchors. All existing motifs are stored in a map (Algorithm 3, Line
 2), using a string representation of the anchors as a key and an instance of the
 Connector class (Algorithm 4, Line 36) as the associated value. This allows
 speedy lookup of each potential motif given a sorted anchor set. Note that the
 anchor set and its string representation must be sorted so as to avoid having motifs
 with identical anchor sets but the anchors were found in a different order.
 After searching for all potential span nodes, Algorithm 3 requires an additional
 pass over the detected connector motifs to ensure that (1) they have two or more
4.2 Network Motif Simplification 92
 span nodes and (2) they do not overlap with other connector motifs. The filter
 loop on Line 12 goes through each potential Connector instance in the map to
 verify that they pass these two criteria. The first criteria, the minimum number
 of span nodes, could be increased if only larger higher payoff motifs are of interest
 (Line 7, 14). An example I have found that matches the second criteria, connector
 motif overlap, is a ring of four nodes A B  C  D A isolated from the rest of
 the network. In this case it is unclear whether to choose A & C or B & D as the
 2-connector motif anchors, as I do not allow overlap.
 As there may be other examples of overlap that need to be caught, I chose
 a general overlap detection approach that compares each span node s in a motif
 to all span and anchor nodes in already detected motifs (Algorithm 3, Lines 15
 ? 26). If there is no overlap with existing motifs, the potential Connector
 c is stored (Line 26) using the AddConnector procedure (Algorithm 4, Line
 42). However, if one of the span nodes s of a potential Connector c is also a
 span or anchor node of an already found motif c0, I then compare their sizes. I
 choose to keep the motif that has the greatest number of spanners, and if they
 are equal I choose the one with more total anchors and spanners. If both values
 are equal I keep the first detected. If the prior motif c0 is to be replaced, I must
 first remove its spanners and anchors from the map (Algorithm 3, Lines 21?23).
 After passing the minimum span count and overlap ranking checks, the detected
4.2 Network Motif Simplification 93
 connector motif c is then stored (Algorithm 3, Line 24) using the AddConnector
 procedure (Algorithm 4, Line 42). As part of this, the spanners and anchors are
 all added to the map of used nodes and associated with their Connector. All
 this bookkeeping process prevents a potential connector motif from overlapping
 with more than one that was already found. Finally, I record the remaining non-
 overlapping and valid connector motifs (Algorithm 3, Line 27).
 4.2.2.3 Clique Motifs
 To find all cliques in the graph I use the Tomita et al. algorithm [TTT06], which has
 a run time complexity of O(3jG:nodesj=3). However, this algorithm has high memory
 requirements and for especially large graphs a new linear-storage algorithm by
 Eppstein and Strash may be faster or required [ES11]. Unfortunately cliques in
 general can have high amounts of overlap. I use a greedy heuristic that chooses
 the largest non-overlapping clique motifs to keep that has a time complexity of
 O(number of motifs  average motif size). This works well on the networks I have
 analyzed, but may be insufficient for studying dense networks.
 4.2.2.4 Resolving Motif Overlap
 When computing motifs, not only can motifs of a type overlap (e.g., cliques), but in
 general the various types can overlap with each other as well. While my design for
 fan and connector motifs prevents ambiguous overlap and allows easy combinations
4.2 Network Motif Simplification 94
 (Fig. 4.7), the choice of which cliques to simplify can impact user perception of
 the network. To effectively pick a disjoint set of motifs to keep I would have to
 rate each motif by desirability and solve the set packing problem, one of Karp?s 21
 NP-complete problems [Kar72]. Not only is this problem computationally hard to
 solve exactly, it is also difficult to approximate, hence my use of heuristics.
 4.2.3 NodeXL Implementation
 Figure 4.8: The standard NodeXL workspace, showing U.S. Senate voting patterns
 from 2007. The left view shows the worksheets that store the network and its
 attributes, while the right pane shows a node-link visualization of the network.
4.2 Network Motif Simplification 95
 I have implemented a reference implementation of my motif simplification ap-
 proach and made it publicly available as part of the NodeXL network analysis
 tool [Smi+10; Smi+09]. Given that many NodeXL users generally have little prior
 knowledge about network visualization readability, I believe that they will partic-
 ularly benefit from my interactive motif simplification techniques.
 I have integrated my motif simplifications into the standard NodeXL groups
 infrastructure, which stores groups using two worksheets: (1) Groups which con-
 tains a row for each group and its attributes, and (2) Group Vertices where each
 row maps an individual grouped node to its associated group. These worksheets
 can be populated automatically in a variety of manners, including detection of
 topological clusters, exact-value attribute groupings, connected components, and
 now my three network motifs. The NodeXL group model allows for nodes that are
 in no group at all, which is important for motif simplification as not every node in
 the network is part of a motif. Note however that this group model does not allow
 overlapping groups, which means that special care must be given to the definition
 of what members of each motif constitute the group in the worksheets.
 In the group worksheets users can interactively edit the labels, attributes, visual
 encoding, and membership of specific groups; remove groups completely; or even
 create custom sets of groups by editing the worksheets or visual interaction with
 the node-link visualization. Moreover, automated statistics can be computed for
4.2 Network Motif Simplification 96
 each group and added to the Groups worksheet, including node & edge counts,
 geodesic distances, and graph density; as well as the number of edges between
 pairs of groups in a special Group Edges worksheet.
 After the groups have been computed or entered into the worksheets manually,
 users can display them in the visualization pane. When users select a group in the
 worksheet, all its member nodes are selected in the visualization. Likewise, for any
 nodes selected in the visualization users can select any groups in the worksheet
 that contain them using the ribbon menu. By default, groups are shown in their
 original expanded form based on the current layout algorithm, with categorical
 color and shape coding so as to distinguish them from each other. However, users
 can switch between the original expanded form and an alternate collapsed form
 for specific selected groups or all groups. This is done using the context menu in
 the visualization pane or the ribbon groups menu.
 The default collapsed form for groups is a meta-node representation of the same
 categorically coded shape with a plus sign inside to indicate its status (e.g.,
 L
 ),
 sized proportional to the number of nodes the group contains and with any asso-
 ciated label next to it. However, the groups for my motifs use their representative
 glyphs that were described in Section 4.2.1. When a collapsed group is selected in
 the visualization pane it is also selected in the Groups worksheet, and its position
 in the visualization can be adjusted with the mouse. These collapsed representa-
4.3 Case Studies 97
 tions are by default colored using the same categorical coloring as for the expanded
 version so the association between views can be easily identified. Through an op-
 tion in the groups menu, users can switch from the default categorical colors and
 shapes to the underlying node attribute encodings the user specified. This updates
 all collapsed motifs so that they show the aggregate attribute information about
 the underlying nodes they represent.
 4.3 Case Studies
 I explored several networks of interest using motif simplification, in several cases
 while helping domain experts analyze their data. Overall, motif simplification
 resulted in vastly reduced network size, reducing the visual complexity faced by
 the user and easing automatic and manual layout tasks.
 4.3.1 U.S. Senate Voting Patterns in 2007
 The power of clique motif simplification is shown in an example network of U.S.
 Senate voting patterns from 2007, originally discussed in Section 3.3.1. Fig. 4.9a,
 like Fig. 4.8, highlights the bridge-building nature of three Republican senators in
 the middle of the visualization. However, further insights are not readily visible
 in the tangled hairball of each party except, perhaps, that the two independent
 senators vote with the Democrats.
4.3 Case Studies 98
 (a) 65% (b) 65% simplified
 (c) 70% (d) 70% simplified
 Figure 4.9: U.S. Senate 2007 co-voting network at 65% and 70% agreement cutoffs,
 simplified using clique motif glyphs. Key features are visible, such as the moderate
 Republican clique around McCain with ?wildcards? at the periphery.
4.3 Case Studies 99
 (a) 80% (b) 80% simplified
 (c) 85% (d) 85% MS
 Figure 4.10: U.S. Senate 2007 co-voting network at 80% and 85% agreement cut-
 offs, simplified using clique motif glyphs. The east-coast liberals and the Blue Dog
 Democrats separate at 80%. We see the network decompose at higher cutoffs.
4.3 Case Studies 100
 (a) 90% (b) 90% MS
 (c) 95% (d) 95% simplified
 Figure 4.11: U.S. Senate 2007 co-voting network at 90% and 95% agreement cut-
 offs, simplified using clique motif glyphs. We see the Republican party fragment,
 with only the two senators from Georgia remaining at 95% agreement.
4.3 Case Studies 101
 After simplifying cliques, several additional features are visible (Fig. 4.9b).
 There are three completely connected groups: one with 48 Democrats, the two
 independents, and a Republican (Snowe); another with 42 Republicans; and a 4-
 clique of Collins, Smith, McCain, and Specter. I worked with a political scientist
 studying at the University of Wyoming to see if these cliques highlighted known
 behavior, and, in fact, they did. The 4-clique represents moderate Republican
 bridge builders that were often decisive votes, though they have stronger ties to
 the Republican clique. The only Senator not in a clique is Coburn, a staunch
 Republican on contentious issues but who often votes his heart.
 I increased the cutoff to 0.70 and ran the layout again (Fig. 4.9c). However, the
 simplified version (Fig. 4.9d) has become quite intriguing. While the Democrats
 and Independents still form a 50-clique, a few members trickled out of Republi-
 can cliques. Snowe returns to the middle with high connectivity with her former
 Democrat clique. Collins and Specter also move to the center, replaced in the
 McCain clique by Coleman and Lugar ? more moderates. The corner outliers are
 known wildcards that do not follow the party.
 Extending this process to higher cutoffs, we begin to see party fragmentation,
 led by the Republicans (Fig. 4.10). At 0.80 the network bisects (Fig. 4.10a), and
 the Democrats split into three cliques and a solitary Nelson, a Blue Dog moderate
 (Fig. 4.10b). The top right 4-clique is the east-coast liberals, while the left 4-
4.3 Case Studies 102
 clique are moderates. The Republicans splinter further, and by 0.95 only the two
 Senators from Georgia remain (Fig. 4.11d).
 All told, the political scientist was impressed that motif simplification could
 highlight many of the features he was already aware of. That the simplified network
 highlights these known features helps validate the design of the clique motif glyphs,
 as well as the greedy heuristic for choosing which non-overlapping set of cliques to
 simplify. Moreover, several new insights came from analyzing these visualizations
 and then checking other sources like Wikipedia and Politico to provide additional
 evidence for the pattern.
 4.3.2 Lostpedia Wiki Edits
 An example of overlapping motif simplification is shown in Fig. 4.12, which rep-
 resent the bipartite network for the Lostpedia wiki community collected by Beth
 Foss. Boxes with labels show wiki pages, linked to the colored discs representing
 their associated editors. The editors are colored and sized according to two mea-
 sures of their activity in the wiki. Fig. 4.12 shows the initial network, while the
 Fig. 4.13 shows a simplified version. By combining fan and connector glyphs, I only
 have 13 nodes to lay out and compare instead of the original 513, only 23 edges
 instead of 586, and use a fraction of the screen space. While these simplifications
 are not entirely necessary to understand such a small and well-arranged diagram,
4.3 Case Studies 103
 Figure 4.12: A bipartite network of Lostpedia wiki edits showing wiki pages as
 boxes and their associated editors as discs.
4.3 Case Studies 104
 Figure 4.13: The Lostpedia wiki edits after being simplified using fan and connector
 motif glyphs.
4.3 Case Studies 105
 they are effective at showing aggregate relationships like the large number of highly
 active main page editors.
 4.3.3 Ravelry Forums
 Another straightforward example I investigated is shown in Fig. 4.14, which I
 adapted from Fig. 9.10 of the NodeXL book [HSS11, p. 139]. Fig. 4.14a represents
 the bipartite network for the Ravelry communities collected by a student in Derek
 Hansen?s Communities of Practice class. Three forum nodes shown as small blue
 discs are connected by the contributers posting in them, with some contributers
 posting in only one forum and others posting in two. After simplifying the fan and
 connector motifs present in the network, I created the representation displayed in
 Fig. 4.14b. Note that the connector glyph used here is the older diamond shape.
 While these simplifications are not necessary to understand such a small and well-
 arranged drawing, they are easy to understand.
 4.3.4 VOSON Web Crawl
 A larger dataset I encountered is shown in Fig. 4.15, which I modified from the
 NodeXL book, Fig. 12.9 [HSS11, p. 192]. This network of 3958 web pages and
 4380 hyperlinks was collected by crawling sites connected to voson.anu.edu.au.
 It is immediately evident that large fans of nodes dominate the periphery, in in
 part because the NodeXL [Smi+10] implementation of the Fruchterman-Reingold
4.3 Case Studies 106
 (a)
 (b)
 Figure 4.14: This network of relationships between Ravelry forums and their users
 was created by a student in Derek Hansen?s Communities of Practice class. In (a),
 three forums represented in blue are connected to contributers, and the contributers
 are sized and colored by the number of completed projects. Edge width is based
 on the number of posts by each user. This version was adapted from Fig. 9.10 of
 the NodeXL book [HSS11, p. 139]. (b) shows a simplified version of this network,
 where the fan and connector motifs have been replaced by representative glyphs.
 The glyphs are sized by the number of nodes they replace and colored according
 to the average node attribute value. Likewise, aggregate edges between glyphs are
 sized and colored by the average of the edge weights of the edges they replace.
4.3 Case Studies 107
 layout [FR91] tends to draw elliptical layouts within a rectangular space. However,
 the fans tend to dominate the visualization regardless of the layout. For example,
 Fig. 4.16 shows the same graph using the Harel-Koren FMS layout [HK02a].
 My manual calculations using Gimp showed that 21% of the screen space in
 Fig. 4.15 is wasted as blank space in the corners, with 33% showing the core
 network with its connector motifs, and the remaining 46% used to show the fan
 motifs. Calculating only for the elliptical visualization region, approximately 58%
 of the space available is used to show the fan motifs. This is a substantial amount of
 area dedicated to showing a very common structure in network datasets obtained
 by crawling web sites or using surveys. Moreover, these fans do not show any
 information besides the rough number of nodes they contain. The fans in Fig. 4.15
 vary from 17 to 852 nodes, but due to overlap this can be hard to see.
 Some of the overlap between motifs and and with other nodes is not visible
 in the original image, but there is substantial overlap in the bottom-right and
 many of the smaller fans are spread in several directions or hidden in the interior.
 Some of this is visible in Fig. 4.17, where I have colored and shaped each of the
 network motifs distinctly. You can see in the bottom-right that the large light
 green and dark green fans overlap substantially, while many of the smaller fans
 are spread in several directions or hidden. Moreover, many of the fans overlap and
 obscure other more important nodes that are not participating in any fan, such
4.3 Case Studies 108
 Figure 4.15: This drawing represents the network of web pages connected to vo-
 son.anu.edu.au obtained by a web crawl. I modified it from Fig. 12.9 of the
 NodeXL book [HSS11, p. 192]. A similar graph for wiki structure is shown on p.
 259. The layout is done using Fruchterman-Reingold [FR91] in NodeXL, and head
 nodes for the fans of singly-connected nodes are shown in blue.
4.3 Case Studies 109
 Figure 4.16: A web crawl starting at voson.anu.edu.au, modified from Fig. 12.9 of
 the NodeXL book [HSS11, p. 192], and laid out using the Harel-Koren FMS layout
 [HK02a].
4.3 Case Studies 110
 Figure 4.17: Web crawl network with each fan and connector motif shown in a
 distinct color and shape.
4.3 Case Studies 111
 Figure 4.18: Web crawl network with nodes colored by their eigenvector centrality.
4.3 Case Studies 112
 Figure 4.19: Web crawl network with fan and connector motifs simplified and
 colored by underlying eigenvector centrality.
 as a huge 2-connector motif with 50 purple span nodes in the bottom-right. This
 2-connector motif, as well as the several others connecting parts of the web page
 network together, are quite hard to detect among the clutter.
 I then simplified these fan and connector motifs, going from 3958 nodes to 559
 and 4380 edges to 765, creating a much less cluttered visualization (Fig. 4.19).
 After simplification, it became evident that the large connector motif is the linked
 the web sites for the Summer Doctoral Programme at the Oxford Internet Institute
4.3 Case Studies 113
 and the National Center for eSocial Science. Applying a layout algorithm to the
 simplified network would result in a new layout that makes more effective use of the
 newfound space. This visualization is much clearer at presenting (1) the size and
 membership of the various fans motifs and (2) the large connector motifs connecting
 pairs of fan heads. Moreover, it appears to have minimal loss of information and
 visual clutter compared to the original.
 4.3.5 Patient Discharge Summaries
 Another complex network to which I have applied motif simplification maps the
 connections between medical patients and concepts related to their care. These
 concepts have been extracted from the patient discharge summaries, and include
 any associated symptoms, diseases, drugs, and procedures. They were provided
 by Todd Johnson, director of Biomedical Informatics at the University of Ken-
 tucky. The goal in analyzing this dataset was to see if motif simplification would
 help medical researchers understand overall patient trends, such as comparing the
 efficacy of competing treatments for the same condition.
 Dr. Johnson suggested that I investigate two medication concepts in the
 anonymized network, ?hops5325? and ?orch7323?, where ?hops? stands for Haz-
 ardous or Poisonous Substance and ?orch? indicates Organic Chemical. I extracted
 from the overall network only those patients connected to ?hops5325? and/or
4.3 Case Studies 114
 Figure 4.20: Patients related to concepts from their medical discharge reports.
 This subnetwork focuses on the concepts ?hops5325? and ?orch7323? (orange discs)
 and their associated patients (purple triangles) and concepts (blue discs). The
 network is laid out using the Harel-Koren FMS layout algorithm [HK02a].
4.3 Case Studies 115
 Figure 4.21: Patients and concepts from Fig. 4.20 after applying fan and connector
 motif simplification.
4.3 Case Studies 116
 ?orch7323?, as well as any additional concepts associated with those patients (a
 2-degree subnetwork). This resulted in 433 patients connected to 4701 concepts,
 including ?hops5325? and ?orch7323?. Fig. 4.20 shows a node-link visualization of
 this subnetwork using the Harel-Koren FMS layout [HK02a]. The two ego con-
 cepts ?hops5325? and ?orch7323? are shown large and in orange, other concepts are
 blue, and the patients are purple triangles. This initial view does not show much
 structure, aside from ?orch7323? being more central to the network and connected
 to more of the patients. Applying motif simplification, specifically the fan and con-
 nector motifs, reduces the complexity somewhat but not spectacularly (Fig. 4.21).
 The exact reduction is from 5134 nodes to 2695 nodes and 439 motif glyphs, and
 from 31,518 edges to 28,375 edges and meta-edges.
 Now that we have the motifs, I can use them to highlight or drill down into
 interesting patterns. Fig. 4.22 shows the largest fan motifs highlighted in red, where
 each fan has at least 20 concepts and up to 42 for the largest. These concepts are
 unique to a single patient, and the patients and their connections to the fans are
 highlighted in red as well. A medical researcher may be interested in exploring these
 singleton concept groups and drilling down to them or, alternatively, filtering them
 out to see the more common patterns. In this case I drill down to show only those
 patients and their connected concepts, displayed in Fig. 4.23 without simplification.
 ?hops5325? is peripheral to this network, only connected to two patients on the
4.3 Case Studies 117
 Figure 4.22: Simplified patient and concept network from Fig. 4.21 with fans of
 20 or more concepts highlighted. This shows groups of concepts that are uniquely
 associated with a single patient. Edges from these fans to their associated patient,
 as well as the patient themselves, are highlighted too.
4.3 Case Studies 118
 Figure 4.23: Patient and concept network of only the patients connected to the
 large highlighted fans from Fig. 4.22, as well as any associated concepts. The initial
 ?hops5325? concept is on the far right, connected to only two patients.
4.3 Case Studies 119
 Figure 4.24: Patient and concept network from Fig. 4.23 after applying motif
 simplification. The connector motif which contains the initial ?hops5325? concept
 and three other concepts is highlighted in orange. These four concepts are only
 connected to two patients.
4.3 Case Studies 120
 right. In the simplified view (Fig. 4.24), ?hops5325? is in a connector motif with
 three other concepts that are only connected to those two patients: ?orch7268?,
 ?hlca5025?, and ?hlca5238?. Interestingly, only one of these patients is connected
 to ?orch7323?. Another pattern of note is the large connector motif on the left,
 which consists of 36 concepts associated with two other patients who are connected
 to ?orch7323?. These concepts are ?aapp155?; ?dsyn 2382, 2732, 2842, 3006, 3092,
 3171, 3464, 3576, 3577, 3817, 3837, 3927, 4009, 4261, 4528, and 4827?; ?lbpr 5981,
 5990, and 6419?; ?mobd 6668, 6673, 6688, 6690, and 6715?; ?orch 7921, 8368, and
 8369?; ?patf 8787, 8818, and 8983?; ?phsu9097?; and ?topp 10357, 10429, and 10856?.
 An alternate kind of exploration is visible in Fig. 4.25, where I have highlighted
 connectors of concepts connected to at least 20 patients. These small connectors
 consist of two or more concepts that occur with many patients in the exact same
 way, but the connectors each have different sets of the 433 patients as anchors.
 The true power of motif simplification becomes evident when I drill down to
 only show the patients connected to four specific concepts. I chose our original
 ?hops3525? and ?orch7323?, as well as two other Hazardous or Poisonous Sub-
 stances: ?hops5323? and ?hops5324?. The node-link visualization of these rela-
 tionships is partially understandable (Fig. 4.26), but after applying motif simpli-
 fication the aggregate patient relationships between the concepts are much more
 clear (Fig. 4.26). Note that here the motifs consist of patients, not concepts.
4.3 Case Studies 121
 Figure 4.25: Patients and concepts from the original simplified view in Fig. 4.21.
 Connector motifs of concepts connected to at least 20 patients are highlighted.
4.3 Case Studies 122
 Figure 4.26: Patients and concepts from Fig. 4.20, after drilling down to only those
 patients connected to our original ?hops3525? and ?orch7323?, as well as two other
 Hazardous or Poisonous Substances: ?hops5323? and ?hops5324?.
4.3 Case Studies 123
 Figure 4.27: A simplified view of the patients and concepts in Fig. 4.26, which
 highlights the aggregate patient relationships between the concepts.
 It is immediately visible that two patients are connected to all four concepts
 and one patient is shared between only the ?hops? concepts. Another 7 patients
 connect ?orch7323? and ?hops5325? while 67 connect ?hops5323?, ?hops5324?, and
 ?orch7323?. Of course, 339 patients only have ?orch7323? as a concept while only
 17 are only connected to ?hops5325?.
 Overall, I believe that motif simplification can definitely help medical researchers
 understand the relationships between patients and a small number of concepts, as
 in Fig. 4.27. For larger datasets with thousands of concepts, the motifs seem to
 highlight particularly unusual connections like large groups of concepts associated
 with one patient or a few patients. To understand these relationships in detail, the
 motifs can be used to drill down to the relevant parts of the network. For additional
 analyses of this network using Group-in-a-Box layouts, see Section 5.5.3.
4.4 Initial Usability Study 124
 4.3.6 Larger Networks
 I analyzed several other large networks not pictured here. One was a network of
 innovation and funding ties with 7124 nodes and 16,109 edges. Another showed
 acquisitions of JP Morgan Chase, with 5766 nodes and 6752 edges. Both were
 visualized interactively with no performance issues, and had drastic reductions in
 complexity with motif simplification.
 4.4 Initial Usability Study
 I invited four individuals from our lab to use the motif simplification techniques
 inside NodeXL in order to understand any usability issues and general ease of use.
 I asked them to analyze three networks: Lostpedia wiki edits (Section 4.3.2), the
 VOSON web crawl (Section 4.3.4), and a network of innovation in Pennsylvania
 used as a Group-in-a-Box layout case study (Section 5.5.2). These participants
 had varying backgrounds, including Computer Science, Information Studies, and
 Economics. They also had varying education, including a recent undergraduate
 student, two graduate students, and a professor. All had little or no experience
 with NodeXL and none with motif simplification.
 After an initial hands-on training session I invited participants to explore the
 networks and recorded anything they had difficulty with or mentioned. Their
 explorations ranged from 45?60 minutes. Overall they were excited by the motif
4.4 Initial Usability Study 125
 simplifications, and were especially eager to change to the simplified version in
 the VOSON example. One of them stated about the original VOSON view, ?I?m
 overwhelmed, ... this is like one of those vision tests at the eye doctor?, but when
 asked to switch to the simplified view emphatically stated, ?Yes please!?. Asked
 afterward about her overall impression of motif simplification, one participant said,
 ?I like it because it makes more sense. For specific nodes it is easier to look at
 the spreadsheet side?. No participant detected the bottom-right connector motif
 hidden in the VOSON fan motifs, but did immediately in the simplified view.
 There were several issues the participants encountered. First, they wanted to
 simplify all repeating patterns they saw, not just my defined motifs. One even did
 the simplification manually using standard meta-nodes. Next, they were unsure
 about the design of the crescent connector motif used at the time. They did not
 understand why edges connected to the arch in several places instead of only the
 corners, and had difficulty comparing connector glyph size exactly. A few even
 confused the connector glyphs with overlapping or odd fan glyphs. I revised my
 glyph design based on this feedback to more effectively allow these analyses, as
 discussed in Section 4.2.1.1.
 In spite of these challenges, participants strongly appreciated the benefits of
 simplifying complex networks and expressed enthusiasm for integration of the
 glyphs in node-link visualizations. By replacing the common repeating motifs
4.5 Controlled Experiment 126
 with representative glyphs, many nuances of the network are revealed. When one
 participant was looking for relationships, she stated, ?I could only look at two at a
 time?. This seems to indicate that the simplified view will help users understand
 larger relationships in the network, as glyphs allow comparisons of larger subsets
 of the network and reduce the number of analyses.
 4.5 Controlled Experiment
 The usability testing guided any necessary interface revisions. Then, I ran a con-
 trolled experiment to determine the effect motif simplification has on user perfor-
 mance across several common network visualization tasks.
 4.5.1 Tasks
 I chose a varied set of tasks relating to topology, attributes, and overviews from a
 taxonomy [Lee+06], which demonstrates how all complex tasks can be seen as a
 series of low-level tasks. These tasks are also used in many recent papers evaluating
 network visualizations [HF07; SA06; GFC04]. I asked:
 1. About how many nodes are in the network?
 2. Which individual node would we remove to disconnect the most nodes from
 the main network?
4.5 Controlled Experiment 127
 3. Which is the largest ( fan | connector | clique ) motif and how many nodes
 does it contain?
 4. Which node has the label ?XXX?? (where XXX was a name or number)
 5. What is the length of the shortest path between the two highlighted nodes?
 6. Which of the two highlighted nodes has more neighbors?
 7. How many common neighbors are shared by the two highlighted nodes?
 8. Which of two pairs of nodes has more common neighbors?
 4.5.2 Data
 Current random network generators do not produce realistic data [HF07], which
 I confirmed trying to generate several networks with similar characteristics. Thus
 I chose to use three interesting networks produced by actual users solving their
 own problems. Lostpedia wiki edits (Section 4.3.2), U.S Senate voting patterns
 (Section 4.3.1), and the VOSON web crawl (Section 4.3.4).
 4.5.3 Participants
 I began with a pilot study with two participants from my lab, in which the tutorial
 and format of the questions were refined. I then recruited 36 students from my
 university (19 males, 17 females) using mailing lists and in-class announcements.
 The participants were mostly graduate students, half from Computer Science and
4.5 Controlled Experiment 128
 the balance from eight other departments. 9 had used network visualization tools
 and an overlapping 9 had seen motif simplification, though none had used it. As
 I could not generate sufficiently varied datasets with similar properties, I used a
 between subjects design. I randomly divided participants into two groups which
 had similar distributions of gender, department, grade level, and experience.
 4.5.4 Procedure
 Each 45-minute session began with 5-10 minutes of training on the tool and for the
 specific tasks, followed by about 35 minutes for answering a total of 31 questions
 across the three networks and eight tasks. Each participant received the same
 order of questions and visualizations. The control group was provided with an
 interactive node-link visualization in which they could select nodes along with
 their incident edges, as well as move the nodes. The treatment group received
 a simplified version of each new visualization, with additional interactive tooltips
 and the ability to expand and collapse the motifs. Each visualization is presented
 consistently, originally computed using the Harel-Koren FMS layout [HK02a].
 As in [GFC04; HF07], users were given one minute to answer each question,
 told to answer as quickly and accurately as possible, and that they could skip if
 they could not answer a question. The evaluator spoke each question, gave the
 participant time to ask for clarification, then revealed the next visualization in turn
4.5 Controlled Experiment 129
 and began the timer. Participants were told how well they performed at the end
 of the study. Users were given $10 plus a $15 bonus for the fastest, most accurate
 participant in each group.
 4.5.5 Analysis
 The recorded data was analyzed in several ways. As is common with response
 time data, the response times were not normally distributed so were normalized
 using a log transformation. The two groups were then compared using a t-test.
 Answers to questions consisted of a categorical answer (a specific item), which was
 recorded correct or not, and/or an integer answer. For questions with categorical
 answers, the groups were compared with Fisher?s exact test instead of the chi-
 square test as none of the statistically significant group-by-correct matrices had
 expected values of five or higher in all four cells. For numeric answers I computed
 error = (answer  truth)=truth, skipping any questions that had an incorrect
 categorical answer the integer answer depended on, and compared error across
 groups using a t-test.
 4.5.6 Results
 Here I report only the statistically significant findings, though all the analyses are
 shown in Figs. 4.28 to 4.35. I expect overview tasks like identifying the maximal
 motif of a type would be easier with the less visual complexity of a simplified
4.5 Controlled Experiment 130
 Figure 4.28: Bar charts showing performance for Task 1: ?About how many nodes
 are in the network?? The left chart shows the time spent answering the question
 while the right chart shows the error in the node count estimate. In this chart, and
 in the following ones, error bars indicate one standard deviation and asterisks show
 the level of significance of the statistical test (?*?, ?**?, and ?**? denote p<0.10, 0.05,
 and 0.01 respectively). Negative numbers, if present, show the number of users
 that skipped the question or ran out of time.
 Figure 4.29: Bar charts showing performance for Task 2: ?Which individual node
 would we remove to disconnect the most nodes from the main network?? The left
 chart shows the time spent while the right chart shows the accuracy at selecting
 the correct node.
4.5 Controlled Experiment 131
 (a) Time spent finding the largest motif.
 (b) Accuracy at selecting the largest motif.
 (c) Error in estimating the size of the largest motif.
 Figure 4.30: Bar charts showing performance for Task 3: ?Which is the largest
 ( fan | connector | clique ) motif and how many nodes does it contain?? The left
 charts show the results for fans, the middle for connectors, and the right for cliques.
4.5 Controlled Experiment 132
 (a) Time spent finding a label.
 (b) Accuracy at finding the label.
 Figure 4.31: Bar charts showing performance for Task 4: ?Which node has the
 label ?XXX?? (where XXX was a name or number)? The left charts are for plainly
 visible nodes, while the right show labels hidden inside a simplified glyph.
4.5 Controlled Experiment 133
 Figure 4.32: Bar charts showing performance for Task 5: ?What is the length of
 the shortest path between the two highlighted nodes?? The left chart shows the
 time spent while the right chart shows the error at estimating path length.
 Figure 4.33: Bar charts showing performance for Task 6: ?Which of the two high-
 lighted nodes has more neighbors?? The left chart shows the time spent while the
 right chart shows the accuracy at selecting the correct node.
4.5 Controlled Experiment 134
 Figure 4.34: Bar charts showing performance for Task 7: ? How many common
 neighbors are shared by the two highlighted nodes?? The left chart shows the time
 spent while the right chart shows the error in the shared neighbor count estimate.
 Figure 4.35: Bar charts showing performance for Task 8: ?Which of two pairs of
 nodes has more common neighbors?? The left chart shows the time spent while
 the right chart shows the accuracy at selecting the correct pair of nodes.
4.5 Controlled Experiment 135
 network. This was true for all three motifs across all three networks (Fig. 4.30).
 Cliques, the epitomical clusters, were found in the two networks they occurred in
 faster (p<0.01, -20.82s), more accurately (p<0.01, 92% vs. 23.5%), and with fewer
 people giving up (3 vs. 0). Moreover, in the Senate network there was higher
 accuracy in size estimates (p<0.05, 0% vs. -28% error), which could be true for
 the web network but I could not measure it as not one control participant detected
 the maximal 5-clique. Fans were found in both the networks they occurred in
 faster (p<0.01, mean -7.77s) and their size was approximated more closely (p<0.01,
 2% vs. -62% error). In the large web network the maximal fan was also found
 more frequently (p<0.01, 95% vs. 35%). Connectors were detected in both their
 networks faster as well (p<0.01, mean -17.13s). In the web network the largest
 connector was found more frequently (p<0.01, 79% vs. 6%), and in the wiki
 network its size was estimated more precisely (p<0.1, -5% vs. -17% error).
 These results show that using glyphs for motifs makes the motifs easier to detect
 and measure, but how does simplifying motifs affect the rest of the network? I
 hypothesized that estimating the number of nodes would be easier in the simplified,
 interactive view. As Fig. 4.28 shows, my participants could indeed gauge the size of
 all three networks with significantly more accuracy (p<0.01, -8% vs. -47% error),
 but for the wiki and web networks users took longer to do so (p<0.01, 21.82s).
 How about finding a specific node by its label? Logically reducing the number of
4.5 Controlled Experiment 136
 visual items makes finding a label easier. My results in Fig. 4.31 show that finding
 labels that are not in motifs is significantly faster (p<0.01, -19.93s), they are found
 more frequently except in the Senate case (p<0.01, 97.5% vs. 14.5%), and fewer
 users give up or run out of time (12 did on the plain wiki and web networks). I
 only saw worse search time for labels in motifs for the Senate clique case (p<0.05,
 15.29s), with no significant differences in accuracy.
 What about topology-based tasks? It seems that with fewer items on the
 screen tracing edges would be easier. For some questions it did turn out better,
 like finding the node to cut (Fig. 4.29) in the web network correctly (p<0.05, 53%
 vs. 18%) and the accuracy of the shortest path length (Fig. 4.32) between two
 clique members in the Senate network (p<0.05, -7% vs. 22% error). For others
 topology questions, the results were mixed to poor. Shortest path length time and
 accuracy (Fig. 4.32) worsened in the web network (p<0.1, 10.06s & 20% vs. 1%
 error). Comparing the number of neighbors (Fig. 4.33) was slower on the wiki
 (p<0.01, 10.89s) and senate (p<0.05, 9.26s) networks, and the choice accuracy
 dropped for the senate (p<0.1, 53% vs. 82%) and web (p<0.1, 68% vs 76%).
 Lastly, the shared neighbor count tasks (Fig. 4.34) were slower in the web network
 (p<0.01, 11.73s), and reduced accuracy in the wiki network (p<0.1, -21% vs. -
 10%). There were no significant differences in the task to find which of two pairs
 of nodes has more common neighbors (Fig. 4.35).
4.5 Controlled Experiment 137
 4.5.7 Discussion
 Overall it appears that motif simplification is beneficial for many analysis tasks.
 Naturally identifying maximal motifs is faster, more accurate, and I can estimate
 their sizes more accurately when I have glyphs and interaction. Counting nodes
 in the network turned out to be slower, but more accurate when using the glyphs.
 Finding unsimplified labels became much quicker, while simplified labels were only
 slower in one case. Finally, it seems like topology-based tasks are a mixed bag.
 Finding cut nodes is more accurate, but path-based tasks were better and worse in
 different circumstances. Comparing the number of neighbors and shared neighbors
 turned out slower and less accurate in a few cases, while counting them was more
 error-prone.
 I have already implemented additional features to increase user performance
 on topologic tasks. When I ran the study I did not yet use the sized meta-edges
 that are shown in Figs. 4.9 to 4.11. With this simple modification, I believe we
 can show much of the aggregate connectivity. However, user education is likely the
 most promising way to improve the glyph performance. Many participants had
 difficulty understanding the topology inside the collapsed glyphs.
 It is important to note that the participants generally had little to no experience
 with network analysis, nor did they necessarily have any interest in or knowledge of
 the networks they were analyzing. Despite these limitations, I found significantly
4.6 Summary 138
 better task performance with the simplified view in many cases. With more than
 the 5-10 minutes of training provided in this study, user performance would likely
 improve on many of the tasks.
 4.6 Summary
 Analyzing networks involves understanding the complex relationships between en-
 tities, as well as any attributes they may have. The widely used node-link visual-
 izations excel at this task, but many are difficult to extract meaning from because
 of the inherent complexity of the relationships and limited screen space. To help
 address this problem I introduce a technique called motif simplification, in which
 common patterns of nodes and links are replaced with compact and meaningful
 glyphs. Well-designed glyphs have several benefits: they (1) require less screen
 space and layout effort, (2) are easier to understand in the context of the network,
 (3) can reveal otherwise hidden relationships, and (4) preserve as much underlying
 information as possible. I tackle three frequently occurring and high-payoff mo-
 tifs: fans of nodes with a single neighbor, connectors that link a set of anchor
 nodes, and cliques of completely connected nodes. I contribute design guidelines
 for motif glyphs; example glyphs for the fan, connector, and clique motifs; and
 algorithms for detecting these motifs. I have also developed a free and open source
 reference implementation, made publicly available as part of NodeXL [Smi+10].
4.6 Summary 139
 With case studies and a controlled study I demonstrate the effectiveness of
 motif simplification as well as areas to focus on for improving glyph design. Motif
 simplification can result in substantial reductions in visual complexity, allowing
 easier understanding and manipulation of large network visualizations. There are
 several avenues for exploration opened up by this work, including additional glyphs
 for other common motif types, algorithms and glyphs for fuzzy motifs, and methods
 for showing edge directionality within glyphs. Now that motif simplification is
 available to all users of NodeXL, my hope is that it becomes commonly used as a
 first step when dealing with large, complex networks. It is particularly suited for
 simplifying data collected in an egocentric fashion, such as web spiders and crawls
 of social media websites.
Chapter 5
 Meta-Layouts for Subdividing Networks
 5.1 Introduction
 Visualizing a network?s topology in a node-link visualization can be useful for seeing
 its overall structure and tracking individual relationships or paths. However, with
 large, dense networks it can be challenging for a user to understand this structure
 due to the high number of edges and the resulting visual clutter. The large num-
 ber of edge crossings and tightly packed nodes in visualizations of these networks
 can be difficult for the human eye to comprehend, though automated techniques
 can aid understanding. Various automatic techniques can algorithmically group
 related nodes together based on (1) the topology of the network [CNM04; WT07;
 GN02], (2) any attributes the nodes have [Llo82], or (3) some combination of both
 [Nav+09]. Topologic clustering finds groups of nodes such that the connections
 within groups (referred to as the intra-group edges) are tighter than those between
 groups (called the inter-group edges). Another popular method is to group the
 nodes based on some common attribute such as geographical location or interests,
 140
5.1 Introduction 141
 Figure 5.1: Co-appearance network in Les Mis?rables, originally
 compiled by Knuth [Knu93] and made into an edge list by New-
 man and Girvan [NG04]. Available in the NodeXL format from
 nodexl.codeplex.com/wikipage?title=NodeXL%20Teaching%20Resources
 or a clustering of several attributes. As the nodes in a community tend to behave
 similarly or share characteristics, it can be useful to study individual communities.
 Regardless of the source of a grouping, a persistent problem is that of displaying
 the results of a grouping in the network visualization. Displaying the groups using
 node color or shape alone (like in Fig. 5.1) can be challenging, especially if the
 groups are intermingled in a complex network visualization (e.g., Fig. 5.32). As the
 network layout does not take group membership into account when placing nodes,
 it can cause groups to be occluded within the visualization and loss of information
5.1 Introduction 142
 about the structure of clusters and their relationships [Rod+11]. Meta-nodes can
 show aggregate relationships, but hide the internal structure of the groups.
 One approach showing these groups in the layout is to try to visually separate
 groups of nodes in the final visualization, such as in the Lin-Log layout [Noa04].
 However, it is hard to understand the relationships between groups in these layouts
 and these visualizations use much more screen space than regular force-directed
 layouts. Moreover, force-directed layouts in general and these types of group-aware
 layouts in particular require substantial parameter adjustment to work across a
 range of datasets [Bar+08]. It can be challenging to balance the various forces
 acting on nodes, especially as the networks increase in size. Furthermore, as noted
 by Barsky et al. [Bar+08] when working with immunologists, domain experts using
 network analysis tools can be completely unwilling to tweak layout parameters in
 order to obtain the best visualization.
 I present several new approaches for showing node groupings using meta-
 layouts, which take take an underlying grouping into account when placing nodes
 in the node-link visualization. The first, the Midichlorian-Directed Layout, is
 a modified force-directed layout that varies attractive forces between nodes based
 on group membership. Next, rather than using node-link visualizations and force-
 directed layouts of network topology alone, I describe several Group-in-a-Box
 meta-layouts that augment topology visualizations with the group memberships
5.1 Introduction 143
 Figure 5.2: Co-appearance network in Les Mis?rables from Fig. 5.1, after using the
 squarified Treemap Group-in-a-Box layout. Each box shows a cluster found using
 the Wakita-Tsurumi algorithm [WT07]. Inter-group edges are hidden to better
 show internal cluster topology. This visualization highlights the structure of each
 group, such as the Javert & Fantine cluster and the Thenardier cluster.
 of the underlying nodes. These Group-in-a-Box layouts draw a separate box for
 each group, sized according to the number of nodes in the group. The subnetwork
 the group represents is then laid out within the box, independent of the rest of the
 network. An example Group-in-a-Box layout for the Les Mis?rables co-appearance
 network from Fig. 5.1 is shown in Fig. 5.2.
 I detail three Group-in-a-Box (GIB) layouts, each with a unique way of laying
 out the group boxes. First, I describe the squarified Treemap GIB layout,
 created by my colleagues on the NodeXL team [Rod+11]. Next, I move to the two
5.1 Introduction 144
 Croissant-Donut GIB layouts: the Donut, which places the most connected
 group in the center of the visualization and wraps the other group boxes around
 it in a space-filling manner, and the Croissant, which places the most connected
 group in the top of the visualization and similarly wraps the other group boxes
 around it. The Croissant-Donut layouts were created in conjunction with three
 graduate students I mentored for a course project [Cha+13]. Finally, I discuss a
 Force-Directed GIB layout I created which arranges the group boxes based on
 the aggregate connections between groups. I algorithmically choose which Group-
 in-a-Box layout to use depending on the disconnected components present in the
 visualization, number of groups, and distribution of group sizes. I evaluate these
 Group-in-a-Box layouts through several case studies and an empirical study of
 309 of Twitter scrapes, which demonstrates the effectiveness and trade-offs of the
 various layouts. These Group-in-a-Box layouts have several benefits: (1) they
 optimize the layout of relationships within groups, (2) they highlight aggregate
 relationships between groups, and (3) it is easier to see group membership and
 size. These layouts are publicly available as part of NodeXL [Smi+10].
 5.1.1 Chapter Overview
 Specifically, the contributions of this chapter are:
  A meta-layout called the Midichlorian-Directed Layout which spreads groups
5.1 Introduction 145
 apart in a standard node-link visualization;
  A Croissant-Donut Group-in-a-Box layout that places subnetworks in boxes
 arranged using a Donut or Croissant pattern, and balances space-filling prop-
 erties with showing group relationships;
  A Force-Directed Group-in-a-Box layout that places subnetworks in boxes
 arranged by their connectivity, and shows group relationships well at the
 expense of additional screen space;
  A set of automatic choices that are made for the user to better show discon-
 nected components, few groups, or different distributions of group sizes and
 connectedness;
  Supporting case studies and an experiment on Twitter networks; and
  A free and open source implementation as part of NodeXL.
 Parts of this chapter have been published in an overview paper on novel network
 analysis techniques in NodeXL [SD12] or are under submission [Cha+13]. I first
 discuss various automatic techniques for grouping the nodes in the network that
 I will be able to leverage in my meta-layouts (Section 5.2). Next, I cover my
 preliminary work on Midichlorian-Directed Layouts in Section 5.3, then move on
 to the three Group-in-a-Box layouts in Section 5.4. I then describe evaluations of
 the Group-in-a-Box approach using case studies (Section 5.5) and an experimental
 study on 309 Twitter scrapes (Section 5.6). I end by summarizing in Section 5.7.
5.2 Grouping Techniques 146
 5.2 Grouping Techniques
 Before my meta-layouts can be applied, we first have to create meaningful group-
 ings of the nodes in the network. Various automatic techniques can algorithmically
 group related nodes together based on (1) the topology of the network, (2) any
 attributes the nodes have, or (3) some combination of both. The choice of which
 technique to use for grouping the nodes depends on the target analysis task.
 5.2.1 Clustering to Identify Structural Components
 Understanding the complexity of human anatomy is often facilitated by decompos-
 ing into subsystems such as circulatory, muscular, skeleton, neural, digestive, etc.
 These decompositions favor functional structures over physical adjacency. Since
 networks represent complex phenomena, clustering by connectivity into functional
 subsystems often proves to be beneficial. An example of this topologic cluster-
 ing is shown in Fig. 5.1, which displays the network of characters in Les Mis?rables.
 This co-appearance network shows the relatedness among characters. Edge thick-
 ness shows the number of scenes in which pairs of characters appear, while node
 size shows the number of scenes for each character. Nodes are colored based on
 their automatically detected topologic clusters. Clustering is often used as an ex-
 ploratory data analysis method to discover unexpected inclusions within a known
 cluster, unexpected separation into other clusters, or surprising clusters.
5.2 Grouping Techniques 147
 There are many topology-based clustering techniques, usually directed at find-
 ing groups of nodes that are more tightly connected with each other than with
 nodes outside the group. NodeXL implements the Clauset-Newman-Moore [CNM04],
 Wakita-Tsurumi [WT07], and Girvan-Newman [GN02] clustering algorithms, which
 all result in mutually exclusive cluster membership. The NodeXL implementations
 currently work only on undirected graphs, but additions to support directed and
 weighted graphs are planned. The effectiveness of such clusterings can be deter-
 mined using metrics such as modularity [NG04], which is roughly the number of
 edges within groups minus the expected number in an equivalent random network.
 However, verifying the quality of a clustering outcome is often hampered by the
 lack of a ground truth.
 5.2.2 Grouping to Find Attribute Relationships
 Instead of highlighting individual structural features like topologic clustering, at-
 tribute aggregation can display overall topology and attribute patterns. Nodes
 may represent people, places, documents, or roles, which are readily understand-
 able in small networks. However, with thousands or millions of nodes, analysts
 may gain insights by replacing nodes of a common type with a single group node,
 e.g. author nodes in a scientific citation network might be grouped by their cur-
 rent institution into a single node for each institution. This node could be sized by
5.2 Grouping Techniques 148
 the number of authors, thereby showing the productive institutions and revealing
 the degree of collaboration across institutions. Simplifying a million-node author
 network into a 3000 node institution network removes information, but reveals
 important patterns.
 Attribute-based node aggregation has been leveraged by several tools to un-
 derstand overall relationships at the expense of showing the underlying topology
 explicitly. PivotGraph [Wat06] groups nodes based on the intersection of a pair of
 attributes, and arranges the meta-node for each group on a grid with each attribute
 as an axis. Aggregate links between groups are shown with arcs. Similarly, my
 GraphTrail (Section 3.3.2) groups nodes by attribute into standard charts, where
 the groups can be further filtered, merged, or used to pivot to connected groups
 of other node types. One advantage of this aggregation is a dramatic reduction
 in screen space required, a fact leveraged by GraphTrail to show the exploration
 history directly integrated into the network analysis canvas. Identical value group-
 ing can be used to show the relationships between semantic groups as well as the
 relationships within them, for example with semantic substrates [SA06].
 NodeXL allows grouping nodes into meta-nodes by their attributes. As an
 example, Fig. 5.3 shows U.S. Senate co-voting patterns. Nodes are colored by the
 party affiliation attribute: red for Republicans, blue for Democrats, and orange
 for independents. Fig. 5.4 shows the same network with senators grouped by their
5.2 Grouping Techniques 149
 Figure 5.3: The U.S. Senate co-voting network for 2007 in shown here, with nodes
 for individual senators colored by their parties (blue Democrats, red Republi-
 cans, orange Independents), sized by betweenness centrality, and laid out using
 Furuchterman-Reingold [FR91]. Edges tie senators together and are weighted by
 their percent of voting agreement. Only those edges with at least 50% agreement
 are shown.
5.2 Grouping Techniques 150
 Figure 5.4: 2007 U.S. Senators grouped by their regional affiliation into meta-
 nodes. Aggregate meta-edges show the number of senators between the two groups
 that vote the same way on bills at least 50% of the time. Collapsed from the
 network in Fig. 5.3.
 regional affiliations into meta-nodes. Grouping multiple nodes into a single meta-
 node can produce measurable improvements in readability.
 5.2.3 Advanced and Combined Approaches
 Additional ways to group nodes by their attributes include the ubiquitous k-means
 clustering algorithm [Llo82], which can be used to cluster nodes by similar attribute
 values or sets of attribute values. This provides ways to create ?fuzzy? groups
 with related, but not identical, node attributes. Another approach called VI-Cut
 [Nav+09] combines hierarchical clustering with topologic clustering. Navlakha et
5.3 Midichlorian-Directed Layout 151
 al. use node attributes to create attribute-driven cuts of a hierarchical topology
 clustering, specifically focusing on biologic networks and predicting operational
 taxonomic units based on hierarchy of sequences and annotations. NodeXL does
 not currently support non-exact attribute clustering, but the results of these algo-
 rithms can be easily copied into the groups worksheets.
 5.3 Midichlorian-Directed Layout
 The first meta-layout I developed is a modified force-directed layout that takes
 group membership into account when computing forces between nodes, reducing
 the spring forces between nodes in separate groups. This approach is called the
 Midichlorian-Directed Layout (MDL), in reference to how individuals in the
 fictional Star Wars universe have varying levels of Force sensitivity depending on
 their midichlorian count. To paraphrase Darth Vader, ?The Force is strong with
 this [cluster].? This approach was developed in conjunction with Darya Filippova.1
 Our motivation for creating such a layout is that our current techniques for
 showing group membership, like displaying convex hulls on a node-link visualiza-
 tion, can be challenging to interpret. Figs. 5.5 and 5.6 show how challenging this
 can be with even simplified biologic networks. The network in these images is the
 the human protein interaction network obtained from the HPRD database, simpli-
 1http://www.cs.cmu.edu/~dfilippo/
5.3 Midichlorian-Directed Layout 152
 Figure 5.5: Graph summarization of the human protein interaction network from
 the HPRD database drawn with the Prefuse Force-Directed Layout with a global
 anti-gravity coefficient of 9 10 6.
5.3 Midichlorian-Directed Layout 153
 Figure 5.6: Same summarized human protein interaction network as Fig. 5.5, but
 clustered using Newman?s heuristic with convex hulls surrounding each cluster.
5.3 Midichlorian-Directed Layout 154
 fied using graph summarization [NRS08] down to 3312 nodes and 4746 edges. The
 groups shown with convex hulls in Fig. 5.6 are computed using Newman?s fast com-
 munity finding heuristic [New04]. Notice how the substantial occlusion among the
 clusters prevents getting an accurate cluster count and limits the viewer?s ability
 to see relationships between them.
 We based our approach for showing group membership more clearly on the
 interactive Force-Directed Layout provided by Prefuse [HCL05], which is a physics
 simulation with three main forces:
 1. Nodes exert anti-gravity on each other to enforce spacing following an inverse
 square law. This is computed using the efficient O(n log(n)) time approxi-
 mation of the gravitational n-body problem of [BH86] which uses a quad tree
 to find accurate local interactions while aggregating body masses.
 2. Edges are modeled by springs that pull connected nodes together with glob-
 ally constant spring coefficients and length.
 3. Drag forces for nodes similar to air resistance are used to prevent oscillations.
 At each timestep the forces are updated and the new node position and velocity
 is calculated by integrating over the timestep with the 4th-order Runge-Kutta
 method or, optionally, the faster but less accurate Euler Forward Method. Both
 of these integration techniques are described in [Pre+93].
5.3 Midichlorian-Directed Layout 155
 Algorithm 5 Force-directed layout algorithm
 addSpringForces()
 addRepulsiveForces()
 for every node u do
 for every node v do
 calculateForce()
 integrateForce() //Runge-Kutta integration
 assignPosition(v)
 Algorithm 6 addSpringForces() function used in MDL
 for every node u do
 for every node v do
 if u,v are connected then
 if u,v in the same community then
 a = sharedNeighbors(u,v)
 k = a * 10; //tighten the spring
 else
 k = a / 10; //relax the spring
 Our modified algorithm was inspired by the Vizster layout algorithm [HB05].
 In Vizster, the edge spring coefficient between the adjacent nodes varied based on
 the minimum degree of the two nodes incident on that edge (see Algorithm 5).
 This way nodes with few neighbors are drawn closely together while nodes with
 many neighbors are spaced farther apart to improve readability. We preserved this
 behavior, but with smaller weight on the node degree since our primary goal was
 to highlight group membership.
 The Vizster algorithm [HB05] did not explicitly use the community structure
 which resulted in overlapping communities, as seen in Fig. 5.7. Instead of let-
 ting the minimum node degree derive the community structure in the network,
5.3 Midichlorian-Directed Layout 156
 Figure 5.7: Same summarized, clustered human protein interaction network as
 Fig. 5.6, but using a global anti-gravity coefficient of 9 10 5 and zoomed in on
 the main connected component. Clusters are separated somewhat using the Vizster
 meta-layout modification to the Prefuse force-directed layout, resulting in less clus-
 ter overlap.
5.3 Midichlorian-Directed Layout 157
 we decided to exploit the community information as produced by the Newman?s
 heuristic [HB05]. For each pair of nodes that shared a cluster, we increased the
 spring coefficient to bring the nodes together. Likewise, for each pair of nodes in
 different clusters we decreased the spring coefficient and let the nodes drift apart.
 We wanted to control how close the nodes within the cluster got so we made the
 edge spring coefficient proportional to the number of neighbors shared by the u,v.
 This addition decreases convergence time and brings the densely interconnected
 nodes together. Our algorithm is shown in Algorithm 6.
 NodeXL was still in early development when this work was conducted. Instead,
 we implemented the Midichlorian-Directed Layout in SocialAction [PS06; PS08a;
 PS08b]. We chose SocialAction because of it?s ability to handle online, interactive,
 and animated layouts through its use of the Prefuse toolkit [HCL05]. In order to
 easily compare the effects of various force-directed layouts, we wanted to be able
 to dynamically change layouts and layout parameters while preserving the mental
 map [Mis+95] users had of the network. Preserving this mental map is important
 so users can understand changes to the network [MB04; PHG07]. We implemented
 a GUI to enable these interactive layout algorithm switches and animating between
 them.
 Fig. 5.7 shows the result of applying the Vizster SocNet layout [HB05], which
 provides some spacing between clusters but not enough for everything to be read-
5.3 Midichlorian-Directed Layout 158
 Figure 5.8: Same summarized, clustered human protein interaction network as
 Fig. 5.7, with clusters separated further using the Midichlorian-Directed Layout.
 The internal structure of these clusters is more visible, as well as the inter-cluster
 relationships.
5.4 Group-in-a-Box Meta-Layouts 159
 able. Contrast this with MDL in Fig. 5.8, which has almost no cluster occlusion
 and in which the connections between clusters are clearly visible. Moreover, us-
 ing the GUI to switch layouts and the animated group separation allows users
 to see the effect of the grouping immediately, supplementing any convex hulls or
 color/shape coding.
 However, the MDL approach leaves a lot to be desired. It is still challenging to
 see the aggregate relationships between groups and their relative sizes. Moreover,
 the large screen space required for laying out the groups separately severely limits
 how much detail can be seen.
 5.4 Group-in-a-Box Meta-Layouts
 Modified layout algorithms are not enough to help analysts see groups or clusters
 in the network clearly. Instead, we on the NodeXL team have chosen to show each
 group individually in its own region of the screen, bounded by a box sized according
 to the number of nodes it contains, and laid out on its own. These Group-in-a-
 Box layouts reveal internal group relationships, make clear which nodes are part of
 which groups and how many nodes a group contains, and with effective positioning
 of the boxes can show aggregate relationships between groups.
 I discuss three Group-in-a-Box (GIB) approaches that have different trade-
 offs in how space-filling they are and how well they show relationships between
5.4 Group-in-a-Box Meta-Layouts 160
 groups. The first is the Treemap GIB Layout, which completely fills the screen
 space with roughly square boxes. Next, I detail the Croissant-Donut GIB lay-
 out, which comes in two variants for networks with varying group characteristics:
 the Donut and the Croissant. The Croissant-Donut layouts use most of the
 screen space, while showing group relationships more clearly than the Treemap
 layout. Finally, present a Force-Directed GIB layout that arranges boxes by
 the group relationships, highlighting the group connectivity at the expense of addi-
 tional screen space. These layouts are implemented in NodeXL [Smi+10] and can
 be selected by the user depending on the network they are trying to analyze and
 their visualization goals. Moreover, I automatically select the most effective layout
 for the user based on a set of criteria about the network and group structure.
 5.4.1 Treemap Layout
 The Treemap Group-in-a-Box layout [Rod+11] subdivides the available screen
 space using a treemap [Shn92; JS91]. Shneiderman [Shn92] employed a slice and
 dice method for representing hierarchical information in a space-filling manner,
 which could result in boxes with high aspect ratios. Instead, the NodeXL team
 uses the squarified treemap approach of [BHJVW00] which maintains a low
 aspect ratio for the boxes. The boxes created by the Treemap layout have area
 proportional to the number of nodes they contain. For our purposes, it is impor-
5.4 Group-in-a-Box Meta-Layouts 161
 Figure 5.9: 2007 U.S. Senators grouped by their regional affiliation. From
 [Rod+11]. See Section 4.3.1 for more on this dataset.
 tant to keep a low aspect ratio because narrow boxes would not be as effective
 at displaying the structure of the group laid out inside it, in addition to being
 difficult to see their area and thus understand the number of nodes they contain.
 While NodeXL does not currently support hierarchical grouping, the Treemap GIB
 layout could be easily extended to visualize nested clusters.
 Fig. 5.9 provides an example of this approach, where the 2007 U.S. Senate
 co-voting network is segmented into five geographic regions the senators represent
 (an attribute-based grouping). See Section 4.3.1 for more on this dataset. The
5.4 Group-in-a-Box Meta-Layouts 162
 cross-region edges are hidden in this example. We can see that the South region
 has the most Senators, while the Pacific is the fewest. Moreover, we can clearly
 see the internal structure of the groups, such as the general division of each region
 into the Democrats on one side and the Republicans on the other. In each region,
 we can also see any moderates that bridge the parties, such as Collins, Snowe, and
 Specter in the Northeast region.
 However, the use of a Treemap layout for the boxes can end up placing highly
 connected clusters in different regions of the screen, with any connecting edges
 being drawn across the intermediate clusters (see Section 5.5 for examples). It can
 be hard to discern whether these long edges connect to nodes in the intermediate
 clusters, or are merely drawn overlapping. This ambiguity, in addition to the added
 clutter of these long ties, makes it difficult to analyze the relationships between
 clusters and draw meaningful conclusion. Our other layouts take the aggregate
 group relationships into account when determining where to place the boxes, so as
 to alleviate this issue.
 5.4.2 Croissant-Donut Layout
 The Croissant-Donut Group-in-a-Box layout [Cha+13] my students and I devel-
 oped tries to balance the space-filling attributes of the Treemap GIB layout with
 showing more of the underlying relationships between groups. The layout takes the
5.4 Group-in-a-Box Meta-Layouts 163
 overall group relationships into account when choosing where to place the group
 boxes. We came up with two complementary approaches, the Donut and the
 Croissant, each targeted at representing specific types of networks. We choose
 between these two approaches automatically depending on network and group
 properties (see Section 5.4.5 for more details). For these algorithms, we will be
 making use of a per-group metric we call connectedness, defined as the number
 of other groups a group is connected to.
 5.4.2.1 Donut Layout
 The Donut layout begins by placing the most connected group in the center of the
 screen. The other groups can be arranged either by their size to reduce wasted
 space or by their aggregate connections to other groups, so highly connected groups
 are adjacent. In this discussion I will use the latter, where groups are wrapped
 around the periphery in decreasing order of their connectedness.
 The area occupied by each group is calculated as:
 area(Group)=alpha screen_area jGroup.nodesj=jGraph.nodesj
 where Group.nodes refers to the nodes in the group, Graph.nodes refers to
 the nodes in the entire network, and alpha is the initial space-filling factor, which
 starts at alpha = 1:0. Note in Fig. 5.10 the area of each group, proportional to
 the number of nodes it contains, does not necessarily decrease as connectedness
 decreases.
5.4 Group-in-a-Box Meta-Layouts 164
 Figure 5.10: The basic principle behind the Donut variant of the Croissant-Donut
 layout is to place the most connected group in the center of the screen, then placing
 the other groups around its perimeter based on their connectedness (number of
 other groups they are connected to).
 This process for the Donut GIB layout is illustrated in Fig. 5.10 for a network
 with 7 groups, listed on the left in decreasing order of connectedness. The steps
 are marked in sequential order as Steps (i) to (viii). The figure also shows which
 groups remain to be placed at the bottom of each step. Initially in Step (i), we
 place G1 the most connected group at the center with an aspect ratio proportional
 to that of the screen. This is represented as a blue box in Step (ii), also known as
 the ?donut hole?. Placing G1 divides the screen into two horizontal (H1 and H2)
 and two vertical (V1 and V2) empty boxes. The remaining groups will be arranged
 in these boxes alternating in the sequence H1, H2, V1, V2.
 Since we only know the areas and not the dimensions of the groups, we use the
 orientation of the empty boxes to determine the group?s width or height. While
 placing a group in a horizontal empty box, we set its height to be same as that of
5.4 Group-in-a-Box Meta-Layouts 165
 the horizontal empty box. Its width is then determined by dividing its area by the
 height. Using the dimensions calculated above, the group is finally placed in the
 horizontal empty space box aligned with the empty box?s left side. For example,
 in Fig. 5.10, Step (ii) has a horizontal empty box labeled H1. The result of placing
 a group, G2, in H1 is shown in Step (iii) creating a new smaller H1 in (iii). It is
 easy to see that G2 and H1 of Step (ii) have same heights and they have a common
 left edge. Similarly, while placing a group in a vertical free box, we set its width
 to be same as that of the empty box and its height is determined using its width
 and area. The group and the vertical empty box share the top edges of the empty
 box. See placement of G4 in V1 in Steps (iv) and (v) of Fig. 5.10 for an example.
 After placing G1, we place the next group, G2, in H1; followed by G3 in H2,
 G4 in V1 and G5 in V2 (Steps (ii) to (vi)). Step (vi) shows that we have G6 and
 G7 left. So starting again at H1, we place G6 in H1 (Step(vii)). We then try to
 place G7 in H2, but H2 had no space left. So, we move on to V1 and place G7
 there (Step (viii)). No groups remain to be placed at Step (viii).
 The method proposed above is not space filling which might result in a situation
 where the algorithm still has some groups to place on the screen but none of the
 empty boxes are big enough. In such a situation, we restart the algorithm with
 alpha = 0:9 previous_alpha and repeat this process until all the groups get
 placed on the screen.
5.4 Group-in-a-Box Meta-Layouts 166
 Figure 5.11: The basic principle behind the Croissant variant of the Croissant-
 Donut layout is to place the most connected group in the top of the screen, then
 place the other groups around the other three sides based on their connectedness
 (number of other groups they are connected to).
 5.4.2.2 Croissant Layout
 As in the Donut layout, the Croissant layout sorts groups in decreasing order
 of connectedness and computes initial areas that can be decreased iteratively if
 needed. The box placement is similar, but instead of placing the most connected
 group at the center of the visualization, it is positioned at the top forming the
 ?croissant hole? shown in Fig. 5.11. The rest of the groups are placed around
 the remaining three sides, in one horizontal and two vertical empty boxes instead
 of two each, namely H2, V1 and V2 (Step (ii)). Groups are placed around G1
 alternating in the sequence H2, V1 and V2 (Steps (iii) to (viii)).
5.4 Group-in-a-Box Meta-Layouts 167
 Figure 5.12: A Donut-favoring network & groups, shown in the Treemap layout.
 Figure 5.13: A Donut-favoring network & groups, shown in the Donut layout.
5.4 Group-in-a-Box Meta-Layouts 168
 Figure 5.14: A Croissant-favoring network & groups, shown in the Treemap layout.
 Figure 5.15: A Croissant-favoring network & groups, shown in the Croissant layout.
5.4 Group-in-a-Box Meta-Layouts 169
 5.4.2.3 Comparing the Donut and Croissant Variants
 In general, the Donut variant of the Croissant-Donut layout is more effective when
 there are lots of small groups in the network. However, when the network contains
 one or two big clusters and a few small clusters, it can result in a lot of wasted
 space. In those cases, the Croissant layout performs better. We choose between
 these two approaches automatically depending on network and group properties
 (see Section 5.4.5 for more details).
 Figs. 5.12 to 5.15 illustrate these approaches for two separate networks in com-
 parison with the Treemap GIB layout, using combined meta-edges in place of the
 original inter-group edges. One caveat here is that many of the smallest groups
 have been filtered out in the Croissant-Donut versions as they do not show large
 numbers of small groups effectively. The first two figures show a network that is
 more suitable for the Donut layout, originally in a Treemap (Fig. 5.12) and then
 the same network in the Donut layout (Fig. 5.13). The meta-edges connecting
 groups in the Treemap layout suffer from an abundance of overlaps and crossings,
 especially near highly-connected groups like G3 and G5. However, the Donut lay-
 out positions these groups so that there are no crossings near the center of groups,
 with the remaining crossings around the group edges much easier to follow. The
 next two figures show a network that is more appropriate for the Croissant layout,
 first in a Treemap (Fig. 5.14) then in the Croissant layout (Fig. 5.15). Again, we
5.4 Group-in-a-Box Meta-Layouts 170
 Figure 5.16: The Force-Directed GIB layout explicitly positions groups based on
 their aggregate connections, showing group relationships clearly at the expense of
 additional screen space.
 see excessive overlap and crossings near the center of G2 that limit readability,
 while the Croissant layout version of almost completely eliminates the problem.
 While this approach does well at balancing a space-filling layout with showing
 the ties between groups, my Force-Directed Layout (described in the next section)
 chooses to use more white space to show those ties more clearly. These trade-offs
 are empirically verified in Section 5.6.
 5.4.3 Force-Directed Layout
 The Force-Directed Group-in-a-Box layout is my approach for explicitly showing
 the inter-group relationships in the visualization. The boxes are positioned using
 a standard force-directed layout run on the aggregate network, where the nodes
 represent entire groups and the edges between them represent the aggregate con-
 nections between a pair of groups. The overall concept is illustrated in Fig. 5.16.
5.4 Group-in-a-Box Meta-Layouts 171
 I then draw the group boxes on this initial layout centered at the group node?s
 position, followed by a step to remove the overlap created by all these boxes. This
 layout has the benefit of clearly showing the aggregate topology, but at the cost of
 more wasted screen space. However, this problem can be reduced by using effective
 overlap-reduction techniques that minimize the additional screen space required.
 5.4.3.1 Initial Configuration
 The first step of the force-directed group-in-a-box layout is setting how much of
 the screen space to use to show groups initially. Groups are represented using
 squares, sized according to the number of nodes they contain. My experiments
 with setting the initial space-filling factor ranging from 20% to 100% point to a
 general trade-off between how space-filling the resulting visualization is and how
 well the final group positions represent the actual group relationships. For more
 details on this trade-off, see Section 5.4.3.3.
 5.4.3.2 Generate Initial Group Box Positions
 The first task is to position the groups according to their connectivity with each
 other. I create a new network showing the group relationships, with one meta-
 node for each group and a combined meta-edge joining connected groups. Then
 I compute a set of initial node positions using NodeXL?s implementation of the
 Harel-Koren fast multi-scale (FMS) layout [HK02a].
5.4 Group-in-a-Box Meta-Layouts 172
 I chose to use the Harel-Koren FMS layout [HK02a] because it was implemented
 in NodeXL already, was sufficiently fast, and produced good results. However,
 future implementers may wish to use faster or more effective layout algorithms.
 According to experimental evaluations of several best-of-breed layout algorithms
 carried out by Hachul and J?nger [HJ06; HJ07], two good choices would be the
 high-dimensional embedding (HDE) approach of Harel and Koren [HK02c] or the
 algebraic multigrid method (ACE) of Koren, Carmel, and Harel [KCH03]. Hachul
 and J?nger report that HDE, followed by ACE, was the fastest algorithm for many
 test cases. However, if these layouts produce ineffective visualizations Hachul and
 J?nger suggest using their FM3 algorithm [HJ05; Hac05] to get comparable or
 better results while still having a reasonable run time. The FM3 layout may be
 complex to implement, and was the focus of Stefan Hachul?s dissertation [Hac05].
 My current implementation using the Harel-Koren FMS layout [HK02a] does
 not use meta-edge weight when calculating the layout. While this technique has
 proven effective in the examples I have explored, I would suggest that future im-
 plementers use this meta-edge weight as part of the layout algorithm to pull more
 strongly connected groups closer together and separate poorly connected groups
 further. The HDE algorithm [HK02c] can be modified for visualizing weighted
 networks [Hac05, p.22], and both ACE [KCH03] and FM3 [HJ05; Hac05] can be
 applied to weighted networks.
5.4 Group-in-a-Box Meta-Layouts 173
 Another issue is that NodeXL?s implementation of the Harel-Koren FMS layout
 [HK02a] does not presently handle multiple disconnected components well, and
 lays them out individually in the same regions. Thus, care should be taken to
 ensure to apply the force-directed Group-in-a-Box layout only on networks without
 disconnected components or isolates. See Section 5.4.5 for a general solution to
 the disconnected component problem.
 5.4.3.3 Remove Group Box Overlap
 After the initial group box positions are determined, I have to contend with the
 fact that the layout algorithm is unaware of the group boxes. If I draw each box
 centered at its position from the layout algorithm I get a visualization like Fig. 5.17
 with substantial amount of box overlap. This initial overlap can be reduced by
 using smaller boxes, but at a significant cost in wasted space. Instead, I try to
 eliminate the overlaps while retaining as much of the structural information from
 the layout as possible and minimizing additional area required. Naturally, worse
 overlap in the initial visualization leads to a less effective resulting layout.
 While the problem of creating a minimum-area layout adjustment is NP-complete,
 there are several effective node-node overlap removal algorithms that can be applied
 to these group boxes. I use the PRoxImity Stress Model (PRISM) algorithm of
 Gansner and Hu [GH09], which iteratively computes box overlap along the edges of
 a Delaunay triangulation and adjusts those edge lengths accordingly to remove the
5.4 Group-in-a-Box Meta-Layouts 174
 Figure 5.17: Group box positions after running the Harel-Koren FMS layout
 [HK02a] on the group relationship network of innovations in Pennsylvania (see
 Section 5.5.2 for dataset details). Edges between groups are hidden.
5.4 Group-in-a-Box Meta-Layouts 175
 Figure 5.18: An original network visualization (left), the same visualization after
 removing node-node overlap with the PRISM algorithm [GN98] (center), and after
 removing node-node overlap with the solve_VPSC algorithm [DMS06; DMS07].
 solve_VPSC maintains orthogonal ordering but can result in highly skewed visu-
 alizations. From [GH09].
 overlap. According to Gansner and Hu?s evaluations, the PRISM approach scales
 up well to large networks while maintaining a good tradeoff between preserving
 the network shape and limiting the area required by the adjusted visualization.
 Several other alternatives exist, though they have various problems with scal-
 ing up or preserving the network shape. For example, the scan line approach of
 Dwyer, Marriott, and Stuckey; Dwyer, Marriott, and Stuckey [DMS06; DMS07]
 is a quadratic programming algorithm removes overlaps and maintains orthogonal
5.4 Group-in-a-Box Meta-Layouts 176
 Figure 5.19: An original network visualization (left), the same visualization after
 removing node-node overlap with the PRISM algorithm [GN98] (center), and after
 removing node-node overlap with the solve_VPSC algorithm [DMS06; DMS07].
 solve_VPSC maintains orthogonal ordering but can result in highly skewed visu-
 alizations. From [GH09].
5.4 Group-in-a-Box Meta-Layouts 177
 ordering. However, the visualization can become highly skewed as you can see in
 the right side of Figs. 5.18 and 5.19. For details on many other node-node overlap
 removal techniques see Section 6.4.1.
 The effects of the box removal algorithm are illustrated in for a large network
 with an initial space-filling factor of 20% (Fig. 5.20) and 50% (Fig. 5.21). In these
 two figures, the initial positions for each group chosen by the layout algorithm
 (Section 5.4.3.2) are shown using colored circles, squares, diamonds, and triangles.
 The boxes, however, are drawn centered around their final non-overlapping posi-
 tions. For groups with substantial movement, I have drawn a red line connecting
 the initial position shape to the center of the final box position. As the layout al-
 gorithm does not take the box size into account, there can be a substantial amount
 of adjustment required to reach the final positions. In Fig. 5.20, where the initial
 space-filling factor was 20% of the screen, there is relatively little movement and
 the groups retain their relative positions to each other, except for a few crossing
 movements in the bottom-left. On the other hand, in Fig. 5.21 where the initial
 space-filling factor is 50%, I see much more group movement.
 In general, the the box overlap removal algorithm keeps the relative positions
 of the groups intact, but there can be large groups that get shoved out to the
 periphery with a high initial space-filling factor. This layout is more space-filling,
 but at the expense of obscuring some of the group relationship information. This
5.4 Group-in-a-Box Meta-Layouts 178
 Figure 5.20: Network and groups from Fig. 5.17, using a different initial set of
 positions from the Harel-Koren FMS layout [HK02a] and after adjusting box posi-
 tions using the PRISM overlap removal technique [GH09]. In this case I chose an
 initial space-filling factor of 20%. The red lines map the original group positions,
 represented by colored shapes, to the final box positions. There is generally little
 movement.
5.4 Group-in-a-Box Meta-Layouts 179
 Figure 5.21: Network and groups from Fig. 5.17, using a different initial set of
 positions from the Harel-Koren FMS layout [HK02a] and after adjusting box posi-
 tions using the PRISM overlap removal technique [GH09]. In this case I chose an
 initial space-filling factor of 50%. The red lines map the original group positions,
 represented by colored shapes, to the final box positions. There is a substantial
 amount of movement, and while most of it preserves group relationships the largest
 groups get shoved to the periphery.
5.4 Group-in-a-Box Meta-Layouts 180
 problem could potentially be alleviated by reducing the amount of overlap removed
 in each iteration of the algorithm, allowing the large groups to slowly push small
 ones out of the way instead of shoving past them to the periphery. This parameter
 is referred to as smax in [GH09] and must be larger than 1.0. I am currently using
 their suggested default value smax = 1:5, but experimentation could be useful.
 5.4.3.4 Finalize the Layout
 I do have the initial space-filling factor which sets how much of the screen to use
 to display boxes, but because the box overlap removal technique usually needs to
 adjust the boxes outward the final screen space required can increase. If the layout
 has expanded outside of the available screen space, I then scale the new layout to
 fit in in the available space. Each box is scaled down using the ratio of the layout
 space required to the screen space, maintaining its aspect ratio.
 As most layout algorithms, including the one I chose, use non-deterministic
 heuristics to place the nodes, it is possible to get a poor initial layout. For ex-
 ample, all the nodes can be placed on a diagonal line or the like. In those cases,
 the resulting visualization after overlap removal preserves that diagonal line and
 requires a significant amount of screen space. As with general force-directed lay-
 outs, the user can choose to run the layout again (or several times) to get the most
 effective force-directed group-in-a-box layout.
5.4 Group-in-a-Box Meta-Layouts 181
 5.4.4 Showing Edges Between Groups
 After running a Group-in-a-Box layout, we have an option as to how to show inter-
 group edges. NodeXL [Smi+10] currently supports three techniques: showing the
 actual underlying edges, hiding them completely, or replacing all the edges between
 each pair of groups with a meta-edge sized proportional to the number of edges
 it represents. These options are shown in Fig. 5.22, as well as many other images
 in this chapter. In addition to straight-line edges, NodeXL can draw curved or
 bundled edges that may reduce complexity (e.g., Fig. 5.31). However, because the
 group in each box is laid out independently of the rest of the network, showing
 the underlying edges explicitly often results in additional edge crossings and other
 poor layout characteristics. Depending on the user?s target analysis tasks, using
 the combined meta-edges or hiding edges completely can be more effective.
 5.4.5 Dividing the Problem
 We now have three Group-in-a-Box layouts at our disposal: Treemap, Croissant-
 Donut, and Force-Directed. Users can choose the variant most suitable to their
 task using the Layout Options dialog in NodeXL Fig. 5.23, depending on whether
 they wish to have a highly space-filling layout (Treemap), one that highlights
 group relationships (Force-Directed), or somewhere in between (Croissant-Donut).
 However, the onus is not entirely on the user to pick the best layout for their
5.4 Group-in-a-Box Meta-Layouts 182
 Figure 5.22: Three ways to show edges between groups in a Group-in-a-Box layout.
 From top to bottom: show all underlying edges, hide all underlying edges, and use
 aggregate meta-edges.
5.4 Group-in-a-Box Meta-Layouts 183
 Figure 5.23: The NodeXL Group-in-a-Box user interface. The right graph pane
 shows a Force-Directed Group-in-a-Box layout of the Risk network, which is de-
 scribed further in Section 5.5.1. The left Edges worksheet shows some of the edges
 connecting the nodes in the network. The Layout Options dialog in the foreground
 allows users to select their desired Group-in-a-Box layout, the size of group boxes,
 how to treat inter-group edges, and whether to use a separate grid layout for groups
 with few edges instead of the chosen main layout.
5.4 Group-in-a-Box Meta-Layouts 184
 network. I make several choices for the user algorithmically, so as to reduce novice
 user difficulties and speed up the analysis process.
 5.4.5.1 Disconnected Components
 First, I find any disconnected components in the network and lay each component
 out individually in a rectangular screen region using the Treemap Group-in-a-
 Box layout by default. This ensures that disconnected parts of the network are
 not drawn on top of each other, as most layout algorithms assume a single con-
 nected component. Moreover, the Treemap layout more effectively subdivides the
 screen space for the various group sizes, compared to NodeXL?s current option of
 putting components below a certain size in small same-sized boxes at the bottom
 of the screen. This also is more space-filling than the approach used by Cytoscape
 [Sha+03], which lays the components out individually, sorts the components by
 screen space used, and stripes them in rows where each row is as tall as its tallest
 component (Fig. 5.24). After the top-level component Treemap layout, if the user
 is using a regular Group-in-a-Box layout, each component?s region will be further
 subdivided using the chosen algorithm if it contains any groups. This can result in
 a two-level Treemap, where the top level divides the network by components and
 the second divides components by any groups present in the groups worksheet.
 Alternatively, a Force-Directed or Croissant-Donut Group-in-a-Box layout can be
 used for the second level.
5.4 Group-in-a-Box Meta-Layouts 185
 Figure 5.24: The Cytoscape biologic network analysis tool [Sha+03], currently
 showing the human protein interaction network after applying graph summariza-
 tion [NRS08]. Disconnected components are laid out individually, sorted by screen
 space used, and striped into rows with each row height set by the tallest compo-
 nent. This can waste substantial screen space when components have drastically
 different sizes.
5.4 Group-in-a-Box Meta-Layouts 186
 Figure 5.25: Three simple groups in the NodeXL squarified treemap layout demon-
 strating how window aspect ratio can cause three groups to be laid out in a row.
 5.4.5.2 Number of Groups
 The second automatic layout choice deals with the number of groups present in a
 component. If there is only one group (the complete subnetwork), no Group-in-a-
 Box layout is used. If there are two groups, I choose the Treemap layout to divide
 the space in half proportionally. For three or more groups, I use the user?s selected
 Group-in-a-Box layout. Ideally, a squarified Treemap layout would allow perfect
 representation of relationships between three groups without edges unnecessarily
 crossing group boundaries. However, high aspect ratio layout spaces like those
 possible when resizing the NodeXL graph pane can result in poor Treemap layouts
 for showing three-way relationships (Fig. 5.25).
5.5 Case Studies 187
 5.4.5.3 Distribution of Group Sizes
 Between the two variants of the Croissant-Donut Group-in-a-Box layout, the Donut
 should be preferred when there are many small groups in the network. Alterna-
 tively, if there are one or two big clusters and a few small clusters the Croissant
 layout will provide a more space-filling layout. We have defined a measure calledG-
 skewness to measure this property, defined as the fraction of the network?s nodes
 present in the two groups with the highest connectedness (see Section 5.4.2)
 for a definition of connectedness). We have empirically determined cutoffs of G-
 skewness that we use to automatically choose the Donut or the Croissant variant
 depending on this group structure:
  Case1: jGroupsj  3 and G-skewness < 0:1 ? Use the Treemap layout
  Case2: jGroupsj > 3 and 0:1  G-skewness  0:45 ? Use the Donut layout
  Case3: jGroupsj > 3 and G-skewness > 0:45 ? Use the Croissant layout
 5.5 Case Studies
 I explored several real-world networks with the three Group-in-a-Box layouts to
 determine their effectiveness. Examples of these case studies are detailed below.
 Two of these studies in particular involved extensive collaboration with domain
 experts to solve real-world problems: the innovation network in Section 5.5.2 and
 the medical informatics network in Section 5.5.3.
5.5 Case Studies 188
 Figure 5.26: The box and board for the game Risk. The board consists of 42
 countries in six continents. From boardgamegeek.com/image/1466865/risk
 5.5.1 Continent-Holding Strategies in Risk
 One small network that may be meaningful to a broad audience (of geeks at least)
 is that of the board game Risk. Risk is a turn-based strategy war game from
 Hasbro, originally released in 1957. The game is played on a political map of Earth
 with forty-two countries grouped into six continents. On their turn, users collect
 armies based on the countries and continents they occupy, then attempt to capture
 countries adjacent to the ones they occupy with combat a matter of attrition
5.5 Case Studies 189
 resolved via dice rolling. The game board and pieces are shown in Fig. 5.26.
 I created a network from the game board, where nodes represent countries and
 edges between them indicate valid movements across country borders.
 This network is shown in Fig. 5.27, where the nodes are laid out using the
 Harel-Koren FMS layout [HK02a] and clustered/colored using Clauset-Newman-
 Moore [CNM04]. From this visualization we can see the expected segmentation
 into continents, which are generally more insular, with specific routes to other con-
 tinents. For example, we can see the red South America on the left, the light green
 Australia in the bottom-right, and dark green North America at the top. Holding
 these three continents can be quite beneficial, as they provide troop bonuses and
 have limited access. However, we can see in the center that the purple cluster
 is a combination of Europe and Africa, or EuroAfrica. These two continents are
 so tightly connected along the Mediterranean that they are considered as one by
 Clauset-Newman-Moore, indicating correctly that they are harder to hold. More-
 over, we see the Middle East is clustered into the bottom-right of EuroAfrica,
 although it is part of light-blue Asia in the game. As any Risk aficionado or ?The
 Princess Bride? fan can tell you, ?never get involved in a land war in Asia? ? and
 this clustering result indicates one of the reasons why!
 I also looked at this network using the three Group-in-a-Box layouts, and the
 results are shown in Figs. 5.28 to 5.30. As expected, the Treemap GIB layout
5.5 Case Studies 190
 Figure 5.27: The network for the board game Risk, where nodes are countries
 and edges indicate valid movements. Nodes are laid out using Harel-Koren FMS
 [HK02a], clustered and colored using Clauset-Newman-Moore [CNM04].
5.5 Case Studies 191
 Figure 5.28: Risk network from Fig. 5.27, shown using the Treemap GIB layout
 with combined inter-group edges.
5.5 Case Studies 192
 Figure 5.29: Risk network from Fig. 5.27, shown using the Croissant variant of the
 Croissant-Donut GIB layout with combined inter-group edges.
5.5 Case Studies 193
 Figure 5.30: Risk network from Fig. 5.27, shown using the Force-Directed GIB
 layout with combined inter-group edges. The initial space-filling factor is 20%.
5.5 Case Studies 194
 (Fig. 5.28) uses the space exceptionally well while maintaining small aspect ratios
 for the group boxes. The structure of each cluster is far more readable than in the
 original node-link visualization (Fig. 5.27). The combined inter-group edges show
 the strength of connection between purple EuroAfrica and light blue Asia ? further
 solidifying our concerns about holding them in the game. The rest of the groups
 are only joined by a single route of attack. However, we can see the unfortunate
 placement of red South America next to light green Australia and light blue Asia
 ? neither of which it as any connection to.
 Moving to the Croissant-Donut GIB layout, we see that this network was as-
 signed the Croissant variant (Fig. 5.29). There is some wasted space along the
 periphery, and the group box aspect ratios are worse than in the Treemap GIB
 layout (Fig. 5.28). On the plus side, light green Australia is only next to light
 blue Asia, its sole tie to the world, and red South America is by purple EuroAfrica
 which it connects to. South America also has a tie to dark green North America
 on the left, but is unfortunately placed on the other side of the visualization.
 Finally, we look at the network using the Force-Directed GIB layout with an
 initial space-filling factor of 20%, shown in Fig. 5.30. Because of the reduced
 group box size, the labels are smaller and group internal structure is less readable.
 However, the ties between clusters are now explicitly clear based on their locations
 and the lack of meta-edge crossings or overlaps.
5.5 Case Studies 195
 Overall, this case study illustrates the trade-offs inherent in the the three tech-
 niques while at the same time highlighting effective strategies for the game Risk.
 Clusters of countries and their internal legal movements are most clear using the
 Treemap GIB layout, while the Force-Directed GIB layout highlights the move-
 ments possible between the clusters. The Croissant-Donut GIB layout strikes a
 balance between these two extremes.
 5.5.2 Finding Regional Innovation Clusters
 One of the goals of urban planners is to understand the relationships behind inno-
 vation and how the ties between organizations, individuals, and funding agencies
 affect growth. Christopher Scott Dempwolf,2 a researcher in the School of Architec-
 ture, Planning and Preservation at the University of Maryland, has been working
 to model innovation based on patent ties, federal and state funding, and physical
 locations. I introduced Dempwolf to NodeXL and helped guide several of his net-
 work analyses, including one of Pennsylvania innovations in 1990. He was keen on
 detecting technology and talent clusters which could be positively influenced. The
 network he collected included patent ties, federal funding from SBIR/STTR, and
 state funding via the DCED and Ben Franklin Technology Partners.
 An initial visualization of this network is shown in Fig. 5.31, which uses the
 Harel-Koren FMS layout [HK02a], link bundling, and categorical coloring for node
 2http://www.terpconnect.umd.edu/~dempy
5.5 Case Studies 196
 Figure 5.31: Pennsylvania innovation relationships during 1990 (main component)
 collected by Christopher Scott Dempwolf. Nodes are laid out using the Harel-
 Koren FMS layout [HK02a] and I used link bundling as well as categorical coloring
 for node and link types. Gray nodes represent inventors; orange are firms; red
 are federal agencies; royal blue are PA DCED / Ben Franklin agencies; lime are
 universities. Red ties (lines) are SBIR / STTR funding; purple ties are patent re-
 lationships; aqua ties are state funding; blue ties are explicit relationships between
 patents; light green ties are technology-based relationships.
5.5 Case Studies 197
 and link types. While quite beautiful, this visualization is not particularly effective.
 Some large structures are easily distinguishable, like the cauliflower-shaped groups
 of gray inventors and a few large orange enterprises. However, the overall structures
 and relationships are hard to interpret.
 Dempwolf was interested in technology and talent clusters, so to try to pick
 these features out of this large network I applied the Clauset-Newman-Moore clus-
 tering algorithm [CNM04]. The algorithm finds clusters of nodes that link to
 each other more frequently than outside the cluster, which, in this case, represents
 clusters of entities with similarities in patented technology. With a node-link visu-
 alization alone it can be challenging to see group membership, size, and aggregate
 relationships using solely the standard color or shape coding as in Fig. 5.32. I
 applied the Treemap GIB layout to make these features explicitly visible by laying
 out each detected cluster individually (Fig. 5.33).
 In analyzing this visualization, we discovered many expected clusters around
 specific Pennsylvania counties and local enterprises. For example, the bottom-left
 cluster of Fig. 5.33 is the Pittsburgh metro area, containing the orange Westing-
 house Electric. The Pittsburgh cluster is highly connected (via faint, small links)
 to the Montgomery county cluster to its right, another large metro area. An
 unexpected exception to the location grouping is the top-left pharmaceutical and
 medical cluster, composed of several companies, universities, HHS, and an interest-
5.5 Case Studies 198
 Figure 5.32: The innovation network from Fig. 5.31, with clusters found using the
 Clauset-Newman-Moore algorithm [CNM04] shown using node color and shape.
 Because of the dense, intermingled clusters it is difficult to understand the network
 and cluster structure. In this figure the edges are shown as straight lines.
5.5 Case Studies 199
 Figure 5.33: The innovation network from Fig. 5.31, with nodes grouped into boxes
 by the clusters found using the Clauset-Newman-Moore algorithm [CNM04], laid
 out using the Treemap GIB layout sized by their degree, and arranged inside
 boxes using the Harel-Koren FMS layout [HK02a]. Edge opacity is based on the
 tie strength and edges are bundled.
5.5 Case Studies 200
 Figure 5.34: The visualization from Fig. 5.33 after replacing inter-group edges with
 meta-edges that represent the aggregate relationships between each pair of groups.
 ing arrangement of inventors in several connected fans. Dempwolf was completely
 unaware of this cluster, which was immediately visible with the Treemap GIB
 layout. These sorts of meaningful structures were mostly hidden in the original
 visualization (Fig. 5.31).
5.5 Case Studies 201
 Figure 5.35: The visualization from Fig. 5.33 after hiding inter-group edges.
5.5 Case Studies 202
 Figure 5.36: The visualization from Fig. 5.33 after hiding inter-group edges and
 filtering to only the largest groups.
 However, the Treemap GIB layout can place highly connected groups of nodes
 far apart in the treemap, such as the two adjacent counties in Fig. 5.33 that are
 placed in the top-right and bottom-left corners. This makes it difficult to see
 aggregate relationships, with the edges stretched across many other groups they
 are not connected to. I attempted to show these aggregate relationships explicitly
 using a single aggregate edge between groups instead of the plethora of small ones,
 but the results were not encouraging (Fig. 5.34). In this example, there are too
 many connections between the various groups to be able to discern individual
5.5 Case Studies 203
 ones. I hid all inter-group edges entirely (Fig. 5.35), which showed internal group
 structure more clearly, then drilled down to only the largest groups in the network
 to create Fig. 5.36. This kind of filtered, labeled visualization would be especially
 good for presenting the results.
 Of course, I wanted to see how the other Group-in-a-Box variants handled this
 large, complex network. When I used the Fitted-Rectangles layout, our algorithms
 chose to use the Donut variant (Fig. 5.37). We can see the largest groups and their
 connections fairly well, but edges cross unrelated groups in some cases and many
 of the boxes have high aspect ratios. The smallest groups, which are shown as
 slices in the corners, have extremely high aspect ratios and should be filtered out.
 Alternatively, we could explore a combination with the Treemap GIB approach
 that subdivides corners when aspect ratios become to high.
 My Force-Directed GIB approach, on the other hand, retains very good aspect
 ratios for the groups (Fig. 5.38). Moreover, the most tightly connected groups are
 placed near each other with the edges generally overlapping few other boxes. How-
 ever, some of the largest groups are pushed to the periphery and thus their edges
 are drawn across unrelated groups. Some parameter tweaking may be necessary
 in the overlap reduction algorithm to avoid this. Also, there is much more wasted
 screen space than the Treemap GIB layout (Fig. 5.33) and the Croissant-Donut
 Donut GIB layout (Fig. 5.37).
5.5 Case Studies 204
 Figure 5.37: The visualization from Fig. 5.33, but using the Croissant-Donut Donut
 GIB layout instead of the Treemap. Inter-group edges are visible and straight.
 While we can see some of the groups well, many of the smaller groups in the
 corners have high aspect ratios.
5.5 Case Studies 205
 Figure 5.38: The visualization from Fig. 5.33, but using the Force-Directed GIB
 layout instead of the Treemap. Inter-group edges are visible and straight. All the
 groups have low aspect ratios, and aggregate connections between the large groups
 are more visible. The initial space-filling factor is 50%.
5.5 Case Studies 206
 All told, Dempwolf found that these clusters accurately represented specific eco-
 nomic development opportunities that could be influenced to increase employment.
 According to him, ?This approach gives you a list of firms to go talk to and specific
 things to talk with them about. It also identifies specific talent clusters. These
 are things that traditional industry cluster analysis has never done.? More details
 of Dempwolf?s use of NodeXL for identifying high-priority economic development
 targets are available in his slide deck3, as well as his dissertation [Dem12].
 5.5.3 Patient Discharge Summaries
 I also applied the three Group-in-a-Box meta-layouts to the network of patients and
 concepts from their discharge reports, originally discussed in Section 4.3.5 as a case
 study for motif simplification (Chapter 4). After applying the Clauset-Newman-
 Moore topologic clustering algorithm [CNM04] to the network from Fig. 4.20, the
 standard color-coding approach produced the visualization shown in Fig. 5.39. The
 many densely connected clusters here are difficult to interpret. Note that standard
 clustering algorithms may not be as effective for analyzing networks with multiple
 node types, like this one of patients and concepts.
 The Group-in-a-Box layouts, on the other hand, nicely segment these clusters.
 First, the Treemap GIB layout shown in Fig. 5.40 enables us to see the internal
 structure of each cluster. We have large clusters around our two egos in the net-
 3http://portal.sliderocket.com/ATWBE/Using-SNA-to-find-and-manage-RICs
5.5 Case Studies 207
 Figure 5.39: Patients and concepts related to the ?hops5325? and ?orch7323? med-
 ications from Fig. 4.20. Nodes are grouped using the Clauset-Newman-Moore
 topologic clustering algorithm [CNM04] and colored accordingly.
5.5 Case Studies 208
 Figure 5.40: Patients, concepts, and clusters from Fig. 5.39, shown in the Treemap
 Group-in-a-Box layout. Our ego concepts, ?hops5325? and ?orch7323?, are shown
 in orange in the largest clusters.
5.5 Case Studies 209
 work, the concepts ?hops5325? and ?orch7323? shown in orange. There is another
 large cluster to the top-right, as well as several smaller ones. Each of these clusters
 consist of several patients and a range of concepts associated with them. However,
 the Treemap layout prevents us from seeing the ties between clusters easily.
 The Croissant-Donut layout, in this case choosing the Croissant variant, is
 shown in Fig. 5.41. This layout does somewhat better at removing the overlap
 of the meta-edges between groups though has worse aspect ratios for the group
 boxes. The pure Force-Directed approach, shown in Fig. 5.42, does even better at
 showing the group ties and maintains square group boxes, though group internal
 structure is a bit less discernable than in the Treemap layout.
 One interesting combination is to use one of the Group-in-a-Box layouts with
 the motif simplification techniques I presented in Chapter 4. I combined the node
 positions given by the Force-Directed GIB layout with the simplified motif glyphs,
 resulting in the visualization in Fig. 5.43. Due to technical limitations in the imple-
 mentation, these approaches are not completely complimentary. For example, the
 edges between groups are shown and the group boxes have disappeared. However,
 the group and node positions are maintained. We can see which groups have large
 fan and connector motifs of similar concepts and could drill into them on a per-
 group basis. Future development, especially the inclusion of hierarchical or nested
 groups in NodeXL, could enable more effective combinations of these approaches.
5.5 Case Studies 210
 Figure 5.41: Patients, concepts, and clusters from Fig. 5.39, shown in the Croissant-
 Donut Group-in-a-Box layout. In this case the Croissant variant was chosen au-
 tomatically. Our ego concepts, ?hops5325? and ?orch7323?, are shown in orange in
 the largest clusters.
5.5 Case Studies 211
 Figure 5.42: Patients, concepts, and clusters from Fig. 5.39, shown in the Force-
 Directed Group-in-a-Box layout. Our ego concepts, ?hops5325? and ?orch7323?,
 are shown in orange in the largest clusters. The initial space-filling factor is 50%.
5.5 Case Studies 212
 Figure 5.43: Patients, concepts, and clusters from Fig. 5.39, shown in the Force-
 Directed Group-in-a-Box layout but without the group boxes. The underlying
 edges are visible. The motif simplification technique from Chapter 4 is applied as
 well.
5.6 Experimental Results 213
 5.6 Experimental Results
 In this section I compare the performance of the proposed Group-in-a-Box methods
 with the baseline ST-GIB on 309 Twitter networks downloaded from the NodeXL
 Graph Gallery [Smi+13]. I also describe an initial user study that was conducted
 to compare the usefulness of such GIB approaches. These studies were conducted
 by my students and me [Cha+13].
 5.6.1 Pilot Study
 The meta-layout methods proposed in this dissertation are based on the assump-
 tion that the existing ST-GIB layout is not good enough for understanding inter-
 group relations and that there is a need for methods that consider inter-group
 edges while arranging groups. To validate this hypothesis, we conducted an initial
 user study to compare the CD-GIB approach with the ST-GIB approach.
 We recruited 9 participants who self reported that they have dealt with network
 data previously. The experiment followed a within subjects design, where subjects
 were given a set of tasks and asked to use the ST-GIB and the CD-GIB layouts to
 answer questions. The order of experimental conditions was counterbalanced by
 alternating the order in which the two layouts were presented. The tasks presented
 to the users were derived from questions that may arise about a network with
 regards to the relationship between the various groups. The tasks asked users to
5.6 Experimental Results 214
 count the number of outgoing combined edges from a list of groups, to find the
 group which had the maximum number of adjacent groups, to find the number of
 groups connected to a list of pair of groups, and to ascertain whether there was an
 edge between a series of pairs of groups. Following each task, users were asked to
 rate each layout on a scale from 0 to 9 based on the the layout?s usefulness.
 Based on this experiment, CD-GIB received an average score of 6:94 1:47 and
 the ST-GIB layout received an average score of 4:61  1:59. These results were
 encouraging as they demonstrated a need for better layout algorithms that would
 assist the user in understanding the relationships in a network better.
 5.6.2 Readability Measures
 Our initial evaluations of manually analyzing the results of the three algorithms
 were encouraging. However, for a more robust and formal evaluation, we quantify
 the usefulness of the a GIB method on the basis of the following network readability
 metrics. A good layout would occupy as much of the screen space as possible; have a
 Mean Group-Box Aspect Ratio close to 1.0 for a clearer intra-cluster visualization;
 and have a low Edge-Box-Overlap for a better inter-cluster visualization.
 5.6.2.1 Edge-Box-Overlap(G)
 Discernibility of inter-group edges depends on a number of factors. Most of these
 factors become critically important for particularly long edges which run from one
5.6 Experimental Results 215
 end of the screen to another. Visually following a long edge from source group-
 box to destination group-box can be cognitively challenging especially if the edge
 overlaps with several other group boxes ?on its way?.
 Therefore, for a given inter-group edge, e, of a network or graph, G, we de-
 fine the edge overlap, Overlap(G; e), as the count of the number of group boxes
 (excluding the source and the destination group boxes) which intersect with the
 edge. A group box and an edge are said to be intersecting if the edge intersects
 with at least one of the four boundaries of the group box. For example, the edge
 overlap for the combined inter-group edge connecting G4 and G7 in Fig. 5.13 is 2
 because it intersects boxes G1 and G2. Total Edge-Box-Overlap for the network,
 G, is then defined as:
 Edge-Box-Overlap(G) =
 P
 e2E Overlap(G; e) we
 Max Overlap(G)
 (5.1)
 where
  E =Set of all inter-group edges in the network G
  we =Weight of an edge, e
 When the inter-group edges are ?combined? in nature (see Section 5.4.4), we as-
 sume a straight line between the centers of the concerned groups and compute
 Overlap(G; e) using this straight line to represent the inter-group edge, e. Edge-
5.6 Experimental Results 216
 Box-Overlap(G) is then computed by aggregating Overlap(G; e) for all the com-
 bined inter-group edges in the network, G. Here, the weight of a combined inter-
 group edge is simply the sum of weights of the constituent inter-group edges.
 For the sake of comparison, for a given network, we compute an upper bound
 to the Overlap, Max Overlap(G), as IE  (N  2), where IE = Total number
 of inter-group edges times and N=Total number of groups in the network and use
 it to normalize the observed overlap.
 5.6.2.2 Screen Space wasted
 In the current problem setting, the size and shape of the screen, where the group
 boxes have to be arranged, is predetermined. The layout algorithms should, there-
 fore, attempt to use as much of this space as possible. The space filling property
 is important because a layout which wastes more space basically assigns smaller
 areas to the group boxes (than a space-filling layout) and thus compromises on
 the clarity of intra-group cluster visualization. For example, the visualization in
 Fig. 5.30 is less space filling than that in Fig. 5.29. This happens because it assigns
 lesser screen area to group-boxes and so visualization of intra-group contents of,
 say EuroAfrica, is more difficult in Fig. 5.30 than in Fig. 5.29.
 ?Screen Space wasted? is defined as the percentage of screen space that was not
 occupied by any of the group boxes. Since the focus of this paper is arrangement
 of group boxes and not the nodes within the group, any white space inside a group
5.6 Experimental Results 217
 box is not considered as ?wasted?. Note, of course, that we are not truly wasting
 the space: we are often using to show aggregate topology.
 5.6.2.3 Mean Group-Box Aspect Ratio
 As mentioned earlier, thin elongated rectangles make analyzing their content diffi-
 cult and so it is desirable for a GIB approach to produce ?squarified? group boxes
 that have aspect ratios closer to 1.0. Also, in the three GIB approaches compared
 here, a group?s area is representative of its size. A typical user could exploit this
 property to compare group sizes based on their areas. Since visually comparing
 sizes of squares is easier than comparing sizes of rectangles, a better GIB algorithm
 should produce group boxes that are more ?square? in shape.
 Given a clustered network laid out using a GIB approach, we measure this
 property using a mean of aspect ratios of the group boxes. Defining aspect ratio
 of a box as the ratio of its width and height, the Mean Group-Box Aspect Ratio
 can be expressed as:
 Mean Group-Box Aspect Ratio =
 PN
 i=1 ai
 N
 (5.2)
 where
  ai = aspect ratio of the ith group
  N = Total number of groups in the network
5.6 Experimental Results 218
 5.6.2.4 Time taken
 This is defined as the time taken to layout the clustered network using the method
 under consideration, as determined by code surrounding the algorithm.
 5.6.3 Dataset
 We compared the performance of Squarified-Treemap, Croissant-Donut and Force-
 Directed GIBs on 309 Twitter networks. The networks each show the results of a
 search for tweets matching a certain word or hashtag. The nodes are Twitter users
 and the edges are created between any two users who mention, retweet, or reply to
 each other. These networks were collected by Marc Smith from Connected Action
 Consulting4 and are published on the NodeXL Graph Gallery [Smi+13].
 Table 5.1 describes some overall network metrics for the networks in our dataset.
 Since reporting values for individual networks was not feasible, I detail the mean,
 standard deviation, minimum, maximum and median values. All the networks
 were preprocessed to contain only the largest connected component, and the table
 reports its statistics. This was done to avoid the numerous uninteresting discon-
 nected singleton groups that exist in many social network datasets. The networks
 were then clustered using the Clauset-Newman-Moore algorithm [CNM04].
 4http://www.connectedaction.net/
5.
 6
 Experimen
 ta
 l
 Resu
 lt
 s
 21
 9
 Network Property Mean Standard Deviation Minimum Maximum Median
 Total number of Nodes 547.30 271.54 12.00 1462.00 541.00
 Total number of Edges 7820.80 7982.11 30.00 40352.00 5438.00
 Network Density ( 10 2) 1.25 1.24 0.07 9.04 0.83
 Network Modularity 0.27 0.03 0.15 0.38 0.27
 Average Geodesic Distance 3.04 0.69 1.72 7.31 3.00
 Total number of Groups 11.38 5.42 2.00 30.00 10.00
 Average Group size 52.52 35.34 4.00 236.50 43.38
 Total number of inter-group edges 1630.83 2315.98 2.00 14858.00 898.00
 Table 5.1: Overall network properties for the networks in our dataset.
 Property/Measure ST-GIB CD-GIB FD-GIB
 CD-GIB Experiments
 Donut always Croissant always
 Edge-Box-Overlap ( 10 2) 5.42 5.12 1.77 5.36 5.31
 Screen Space Wasted 0.00 2.04 58.72 17.50 2.03
 Time taken 811.00 744.00 951.00 765.00 739.00
 Mean Group-Box Aspect Ratio 1.05 2.06 1.00 3.47 2.04
 Table 5.2: Performance comparison of the two proposed approaches: CD-GIB and FD-GIB with the baseline
 ST-GIB layout. All figures reported above are median values computed for the complete dataset.
5.6 Experimental Results 220
 5.6.4 Results
 Each network in our dataset, after clustering, were laid out using each of the
 three GIB layouts and the various performance measures described above were
 computed. The aspect ratio of the screen was kept at 1.0 for all experiments and the
 inter-group edges were combined. After arranging the boxes on the screen space,
 the nodes belonging to individual groups were laid out within the corresponding
 group box using the Harel and Koren FMS layout [HK01].
 Table 5.2 presents the results of our experiments. For a given readability mea-
 sure, the highlighed cells represent the best performing method among the three
 GIB approaches. The columns titled ?Donut always? and ?Croissant always? present
 intermediate results corresponding to the two layout possibilities for the CD-GIB
 method (Section 5.4.2). The actual results for the CD-GIB layout are listed in
 the column titled ?CD-GIB? after automatically selecting the appropriate layout
 as described in Section 5.4.5.3. I also performed Student?s t-tests on these results
 and discuss the statistically significant (p<0.01) differences between treatments.
 From Table 5.2, we can see that the FD-GIB leads to very little edge-box-
 overlap (1:77  10 2) followed by CD-GIB (5:12  10 2), while ST-GIB leads to
 maximum overlap of 5:42  10 2. The statistical test revealed that the values
 obtained for FD-GIB were significantly different from others. However, the reduced
 overlap comes at the cost of an increased amount of space wasted. FD-GIB wastes
5.6 Experimental Results 221
 almost 59% of the screen space while ST-GIB wastes no space at all because of the
 use of highly space-filling treemap algorithm for laying out the group boxes. The
 space wasted by CD-GIB is 2% which is comparable to that of ST-GIB because
 like ST-GIB CD-GIB tries to ?pack? boxes next to each other. On the other hand,
 FD-GIB, lays out the boxes using one of the force-directed layouts which are not
 space-filling by nature. With regards to space wasted, all three methods were
 significantly different from each other according to the t-test results. The table
 also compares the three methods based on the time taken in milliseconds to lay
 out the complete network after clustering. We see that the time taken for all
 three methods are comparable with CD-GIB being the fastest (744ms); ST-GIB
 slightly slower with 811ms and FD-GIB taking 951ms. According to the t-test, the
 performance of FD-GIB was significantly different from others.
 Finally, Table 5.2 compares the three methods based on the aspect ratio of
 their group boxes. Since each network contains several group boxes each with a
 different aspect ratio, we compute Mean Group-Box Aspect Ratio as defined above
 and compare the median values over the complete dataset. We see that the aspect
 ratio for ST-GIB and FD-GIB are almost 1.0 and the difference between them was
 not statistically significant. However, group boxes in CD-GIB approach suffer from
 poor aspect ratio (median value of 2.06) which was worse than that of FD-GIB
 and ST-GIB and this result was statistically significant. CD-GIB leads to poor
5.6 Experimental Results 222
 aspect ratio because unlike ST-GIB and FD-GIB, this approach does not try to
 produce squarified rectangles. It instead determines one dimension of the group
 boxes using the corresponding dimension of the free-space boxes (in which it is
 being placed) available around the ?donut hole? or the ?croissant hole?. Most of the
 free-space boxes are huge rectangles of white space. Hence, if the group contains
 small number of nodes, its area would be small, but one of its dimensions would be
 same as the dimension of the free-space box leading to thin elongated rectangles.
 Table 5.2 also contains an intermediate result of comparing Donut and Crois-
 sant layouts on the same dataset. For this experiment, disregarding the paradigm
 presented in Section 5.4.5.3, all networks in the dataset were laid out using the
 Donut layout and the Croissant Layout separately. As seen from the table, the
 performances of Donut and Croissant are close in terms of overlap and time taken.
 Croissant outperforms Donut slightly in term of time taken while Donut beats
 Croissant for the overlap measure. For the other two measures, Croissant is sta-
 tistically significantly better than Donut for screen wastage and group-box aspect
 ratio. However, comparing these columns with the ?CD-GIB? column, which is ob-
 tained by selecting either Donut or Croissant layout for each network based on our
 paradigm, we see that CD-GIB seems to be benefiting from the strength of both
 the alternatives. This justifies our use of the paradigm for Donut vs. Croissant
 selection heuristic.
5.7 Summary 223
 5.7 Summary
 This chapter discusses meta-layouts, which leverage disjoint node groupings in
 order to dissect a network into more manageable, yet meaningful subnetworks
 that are displayed individually. The first meta-layout, called the Midichlorian-
 Directed Layout, uses a standard force-directed layout algorithm that has been
 modified so that groups are less strongly attracted to each other. Thus, the groups
 in the network float apart and are more easily understood in isolation. However,
 this approach requires substantial screen space and in dense areas of the network
 groups can still overlap. This makes it difficult to measure group sizes and their
 aggregate relationships. Moreover, the high scaling required means that individual
 nodes are challenging to see, much less read the labels of.
 To improve on this situation we developed three Group-in-a-Box (GIB) lay-
 outs that segment a network using the results of a topologic clustering or attribute
 grouping. Each group is laid out individually in a rectangular region of the screen,
 and we size each region according to the number of nodes it contains. The first
 layout, the Treemap GIB layout created by the NodeXL team [Rod+11], uses a
 squarified treemap algorithm [BHJVW00] to subdivide the screen space into group
 boxes with low aspect ratios, as shown in Fig. 5.9. The layout is completely space-
 filling, but can cause edge readability problems when large groups are positioned
 at opposite corners as in the innovation network in Fig. 5.33.
5.7 Summary 224
 The second layout, the Croissant-Donut GIB Layout, maintains much of
 the space-filling property of the Treemap but can show the group relationships
 more clearly. It comes in two variants: the Donut and the Croissant. In the
 Donut variant, the most connected group is placed in the center of the visualization
 and other groups are wrapped around its periphery (Fig. 5.13). Alternatively, the
 Croissant variant puts the most connected group at the top and places the other
 groups around its three sides (Fig. 5.15). The Donut layout is more effective at
 showing many small groups, while the Croissant is better for a few large groups.
 Our code chooses which of the two to use automatically depending on the distri-
 bution of group sizes. The Croissant-Donut layouts fill most of the visualization
 space while showing relationships more clearly, but aspect ratios can get especially
 high for small groups.
 Finally, I developed a Force-Directed GIB Layout that positions groups
 according to their aggregate relationships, followed by an overlap removal step
 that ensures the boxes do not intersect (Fig. 5.30). The overlap removal algorithm
 maintains the relative positions of groups while minimizing the additional space
 required. The resulting visualization requires a substantial amount of screen space,
 but uses the extra space to clearly show the relationships between groups. At the
 same time, the low aspect ratios of the group boxes helps offset their smaller size
 for showing internal group structure.
5.7 Summary 225
 We have also developed a few ways to automatically choose which to use de-
 pending on the network and group properties. By nesting the Group-in-a-Box
 layouts I can handle disconnected components better than other approaches. More-
 over, for certain numbers of groups and distributions of group sizes I pick the best
 layout for the user. Finally, I present several case studies and an experimental
 study to help validate the effectiveness of these layout techniques.
 Each of these Group-in-a-Box layouts have been implemented and made pub-
 licly available in NodeXL [Smi+10]. While the Croissant-Donut and Force-Directed
 GIB approaches have only recently been added, the Treemap layout has been avail-
 able since 2010 and is used extensively by users. Looking at the NodeXL Graph
 Gallery [Smi+13], most of the visualizations presented and almost all of those by
 Marc Smith (from Connected Action Consulting and leader of the NodeXL project)
 use the Treemap GIB layout. Dr. Smith intends to transition to the Force-Directed
 approach immediately for his work. This demonstrates the utility of these tech-
 niques for segmenting real networks into manageable, meaningful pieces ? especially
 in web environments where display space is limited and overviews are particularly
 useful. Moreover, the improved defaults for placing disconnected components will
 help all users of NodeXL, which has been downloaded more than 166,000 times
 and is used extensively for introductory network analysis courses.
Chapter 6
 Measuring Network Visualization Readability
 6.1 Introduction
 The results of applying force-directed layout algorithms can vary greatly depending
 on the size and topology of the network, and the layout generated is highly depen-
 dent on the algorithm used. Each algorithm attempts to find an optimal layout of
 the network, often according to a set of readability metrics (RMs) or heuris-
 tics. Readability metrics are measures of how understandable the network drawing
 is, based on artifacts such as the number of edge crossings or overlapping nodes in
 the drawing [DS09]. Traditionally these RMs have been called aesthetic crite-
 ria [PL96; Pur02], though several recent papers describe network visualizations in
 terms of readability instead of aesthetics ([GFC04; HBF08; Bon+09]). I call them
 readability metrics because of the ambiguity implied by the word ?aesthetic?. I am
 not concerned as much with how visually pleasing a particular network drawing is;
 instead I am interested in how well it communicates the underlying data. However,
 some of the most informative visualizations are also the most beautiful.
 226
6.1 Introduction 227
 (a) (b) (c)
 Figure 6.1: Different visualizations of the same network with many (a), few (b),
 and no (c) edge crossings.
 Optimizing the layout for specific readability metrics, or RMs, can lead to much
 more understandable drawings. For example, Figs. 6.1 and 6.2 show how reducing
 edge crossings can lead to more straightforward representations. Optimizing for
 RMs has been shown to promote many common analysis tasks, though it does not
 guarantee the resulting drawing is understandable. The particular RMs that the
 layout algorithms optimize intentionally or indirectly through heuristics may not
 be the correct ones for the tasks users are trying to accomplish. There are often
 substantial trade-offs in task performance when different RMs are optimized, and
 can result in ineffective, unintelligible, or even misleading drawings. For example,
 after reducing the number of edge crossings in a large drawing the spatial layout
 is oftentimes substantially distorted, and it can alter a viewer?s perception of the
 importance and centrality of individual nodes (see Section 6.2 and Fig. 6.6d for an
 example of this effect).
6.1 Introduction 228
 (a) (b)
 Figure 6.2: In the Planarity online game (www.planarity.net), users start with a
 planar network: one that can be embedded in two dimensions using straight edges
 with no crossings. Given a random network layout like (a) users try to manually
 eliminate crossings. The goal is to create a planar drawing like (b), which is the
 same network run through NodeXL?s [Smi+10] Harel-Koren FMS layout [HK02a].
 Additionally, as the optimization of many RMs is NP-hard [Bat+98], these
 techniques often produce suboptimal network drawings. The International Sym-
 posium on Graph Drawing has met annually for two decades working to improve
 automated network layout algorithms and RMs, among other things, but I believe
 that state of the art automated layout algorithms alone are insufficient to con-
 sistently produce understandable network drawings. Additional post-processing
 algorithms can improve the layout, but are limited in how much they can modify
 the layout. The layout algorithms available to end users depends on the network
 analysis tool being used, and post-processing techniques are rarely included and
 have difficulties with evolving networks.
6.1 Introduction 229
 Users can be made aware of the common problems RMs measure, or even
 quantitative values for RMs to optimize manually. However, current RMs only
 provide overall measures for the drawing without any means for focusing user
 attention on the problem areas. Users are not provided with any indication of where
 to start their manual improvements and how effective they have been. Seasoned
 network analysts develop an ingrained understanding of proper layout techniques
 and will adjust the spatial layout accordingly, but novice users are left to fend for
 themselves. Even expert users have difficulty applying their layout techniques to
 networks over a few hundred nodes. Furthermore, users may not be aware of the
 optimization trade-offs of particular metrics and how it affects task performance.
 Part of my dissertation work was to develop new readability metrics to mea-
 sure the effectiveness of node-link visualizations, including a set of novel node
 & edge readability metrics that provide more localized identification of where
 improvement is needed. As there are trade-offs when optimizing readability met-
 rics, I provide a survey of the related literature studying these trade-offs and the
 effect of specific metrics on user task performance. I also provide the design and
 implementation of an interactive optimization technique that provides users with
 visual metric feedback, helping them optimizing their drawings. This work aims to
 raise user awareness of network visualization readability issues, and applying these
 techniques will guide users in creating more effective node-link visualizations.
6.1 Introduction 230
 Instead of focusing only on purely automated network layout, I advocate raising
 user awareness of the importance of readability metrics for their network drawings
 and providing users with computer-assisted layout manipulation tools. Taking
 up where the automated layout leaves off, my tool gives users real-time feedback
 as to how their movement of nodes affect the RMs and provide local placement
 suggestions for the RMs users wish to optimize. I believe that this approach will
 provide users, and network analysts in particular, tools and guidelines that will
 allow them to create more understandable network drawings that more accurately
 highlight features of interest like communities within social networks.
 To enable this I detail several new readability metrics on a [0,1] continuous
 scale. Additionally, I define novel node & edge readability metrics to pro-
 vide more localized identification of where improvement is needed. The metrics
 can be used by a user to motivate improvement of the network drawing, either
 by hand, through immediate feedback techniques, or automatic improvement by
 feeding RM results back into a layout algorithm. I describe the trade-offs inherent
 in optimizing individual metrics as well as recommended metric optimizations for
 particular tasks. Several of the RMs and the interactive improvement techniques
 are implemented in SocialAction, a research network analysis tool that combines
 statistics with network analysis [PS06; PS08a; PS08b]. I have also begun inte-
 grating the metrics and improvement technique in NodeXL [Smi+10], a network
6.1 Introduction 231
 analysis template for Excel 2007/2010/2013, in order to direct users towards poor
 areas of the drawing and provide real-time readability metric feedback as users
 manipulate nodes and edges. The interaction functionality includes ranking and
 highlighting of nodes and edges by their metrics.
 6.1.1 Chapter Overview
 Specifically, the contributions of this chapter are:
  New global readability metrics to help understand different aspects of net-
 work visualization readability,
  Local readability metrics for individual nodes and edges to help users identify
 problem areas and fix them,
  A method for user-assisted layout improvement that provides real-time met-
 ric feedback to users in a ranked list and with a color scale,
  Implementations of readability metrics and the layout improvement tech-
 nique in SocialAction and NodeXL, and
  A survey of work on readability metrics and evaluations of their effectiveness
 on various network analysis tasks.
 This chapter is divided into several sections as follows. First, I describe the
 idea behind the user-assisted layout improvement technique and the SocialAction
6.2 Readability Metrics in SocialAction 232
 implementation in Section 6.2. This includes two case studies of the effectiveness
 of the approach. Next, I cover the NodeXL implementation in Section 6.3. Then I
 go into detail about specific readability metrics in Section 6.4 including a survey of
 their history and evaluations of their effectiveness. Finally I conclude in Section 6.5.
 6.2 Readability Metrics in SocialAction
 Several readability metrics (RMs) exist that measure the suitability of a network
 drawing as a whole, providing a single quantitative measure for the entire drawing.
 While these metrics can aid users in understanding that there is a problem, they
 do not highlight where the problems are occurring. To do so, we can provide
 additional attributes for both nodes and edges in the network that describe how
 these individual components affect the global understanding. I call these node
 readability metrics and edge readability metrics, or node RMs and edge
 RMs for short. This is an extension of the idea of individual node and edge
 metrics espoused in [HMM00]. Several of my metrics are detailed in Section 6.4,
 along with their individual motivations, including: node-node overlap, edge
 crossing, and node-edge overlap.
 I have implemented a prototype of the RM framework inside of SocialAction, a
 tool that uses attribute ranking and multiple coordinated views to help users sys-
 tematically explore various statistical measures for social network analysis [PS06;
6.2 Readability Metrics in SocialAction 233
 PS08a; PS08b]. In SocialAction, users can rank nodes and edges using ordered
 lists of the chosen attribute and simultaneously visually code the node-edge draw-
 ing using the ranking. Nodes remain in their original positions as users change
 the ranked attributes, which prevents the users from losing their mental map of
 the network. By combining multiple coordinated views with rapid transitions be-
 tween statistical social network analysis measures and additional node and edge
 attribute rankings, SocialAction affords network analysts a quick understanding of
 the network properties. Extreme-valued nodes and edges are highlighted particu-
 larly effectively through the combination of ranked lists and visual coding.
 I leveraged this attribute ranking system by incorporating preliminary node
 and edge RMs into SocialAction as node and edge attributes. Like any statistical
 measure or additional attributes in the dataset, users can now rank nodes and
 edges based on their individual RMs, highlighting problem areas in the network
 drawing. This allows them to rapidly flip between RM rankings and identify areas
 that would benefit from hand-tuning of the layout.
 Users can then utilize the interactive features of SocialAction which allow them
 to drag nodes or groups of nodes to new positions, attempting to manually op-
 timize the RMs. Node and edge RMs are computed in real-time for the nodes
 being dragged, and many global RMs can be selectively updated with these local
 computations to shortcut the computational complexity a complete recalculation
6.2 Readability Metrics in SocialAction 234
 Figure 6.3: SocialAction with the integrated Network Drawing Readability Metric
 framework rapidly shows problem areas in the network drawing highlighted in
 red and listed in a ranked table. It is currently showing a subset of the reply
 relationships within the Alberta Politics discussion newsgroup, and the network
 drawing has been optimized for the node occlusion and edge tunnel readability
 metrics. The steps in SocialAction?s Systematic Yet Flexible framework are shown
 along the top. The Network Readability panel (middle-left) shows node or edge
 readability metrics as well as global ones. The Rank Nodes panel at the far left
 ranks nodes by the edge crossing readability metric and provides the color scale
 for the Network pane.
 requires. This allows users to see how their movement of nodes affects both global
 and node RMs simultaneously, both in a Network Readability panel as well as
 real-time updating of the ranked list and color scale of the node-edge drawing.
 Moreover, users can switch between individual RMs and statistical measures while
 maintaining the same network layout and preserving any hand tuning they have
 already accomplished.
 Fig. 6.3 shows the SocialAction interface displaying a node-link visualization of
 reply relationships within a subset the Alberta Politics discussion newsgroup for
6.2 Readability Metrics in SocialAction 235
 which the node occlusion and edge tunnel readability metrics have been minimized.
 Across the top are the steps in SocialAction?s Systematic Yet Flexible framework,
 which allows for a guided and all-encompassing while still flexible approach to social
 network analysis, along with the Attribute Nodes panel for categorical coloring
 and the Network Readability panel (shown along the middle-left). The Network
 Readability panel shows the node or edge readability metrics for the selected items,
 as well as global readability metrics. The Rank Nodes panel (far left) shows a
 ranking of nodes by the edge crossing readability metric in decreasing order, with
 a filtering slider at the bottom. The large Network panel shows the node-edge
 drawing with color coding of nodes by their ranking in the Rank Nodes panel,
 with nodes having many edge crossings colored bright red. These are candidates
 for movement or resizing to reduce the number of edge crossings.
 6.2.1 Case Study: Alberta Politics Newsgroup
 The following figures demonstrate manual optimization of a network drawing. Un-
 derneath each figure are counts for the number of node occlusions (NO), edge
 tunnels (ET), and edge crossings (EC). Counts can usually be made available as
 tooltips, but for the RMs to be useful they must be independent of the network
 size, and are thus scaled to the continuous range from [0,1]. This requirement is
 made evident from the global count of 2954 edge crossings in the Alberta Politics
6.2 Readability Metrics in SocialAction 236
 discussion group network. Also note that figures which show a progression of draw-
 ings being optimized for a RM may change color scale, as the worst nodes become
 better. This relative scale is better at highlighting maximal existing metric values.
 Users can manipulate their drawings in order to minimize node occlusion using
 the node RM for it as a guide (Fig. 6.4). Coloring is scaled by the node RM, with
 bright red drawing user attention to areas of high occlusion. By relaxing the layout
 slider in SocialAction we can eliminate node occlusion entirely for this subset of
 the Alberta Politics dataset (Figs. 6.4a, 6.4b and 6.4d). This increases the default
 spring length used by the layout algorithm, allowing clusters of nodes to spread
 out and resulting in a larger drawing. Some networks, especially dense ones, may
 require manual tweaking. Another way to minimize occlusion is to reduce the size
 of labels. One way is to move from a full label to a distinctive yet concise one
 (Figs. 6.4c and 6.4e, though numeric ones are difficult to remember). Other ways
 include minimizing text margins in the nodes or font size.
 To reduce the number of edge tunnels in the drawing, users can rank and color
 by the node RM for local edge tunnels. Figs. 6.5a and 6.5b show a user removing
 edge tunnels by tuning node placement. This is easier for loosely connected nodes
 but can be difficult in dense areas. To reduce edge tunnels, we may have to
 increase the number of edge crossings. For manually tweaking the position of
 poorly connected nodes the local edge tunnel RM seems more useful. However,
6.2 Readability Metrics in SocialAction 237
 (a) NO:14, ET:70,
 EC:180
 (b) NO:4, ET:26, EC:159 (c) NO:1, ET:25,
 EC:180
 (d) NO:0, ET:14, EC:157 (e) NO:0, ET:12, EC:159
 Figure 6.4: Ranking and coloring with the node occlusion node RM shows areas
 of high occlusion in red. To reduce occlusion we can relax the layout by increasing
 default spring lengths ((a), (b), (d)). Note that this is not the same as merely
 increasing the size of the drawing: the adjustment of the parameters of the layout
 algorithm results in a somewhat different layout as well. We can also use shorter
 unique, trimmed, or simplified labels ((c) & (e)), in addition to hand-tuning node
 position as a final step. Note that color scales may change between figures as the
 worst nodes become better. Counts listed are node occlusion (NO), edge tunnels
 (ET), and edge crossings (EC).
6.2 Readability Metrics in SocialAction 238
 (a) NO:0,ET:14,EC:157 (b) NO:0,ET:0,EC:155
 Figure 6.5: Using the node RM for edge tunnels, users can see areas with edge
 tunnels in red (a) and manually adjust the layout to remove them (b).
 the triggered edge tunnel RM is better suited for moving highly connected nodes as
 it shows the effect a node has on its region of the drawing. As with node occlusion,
 one way of reducing edge tunnels is to shrink nodes.
 Similarly, Figs. 6.6b to 6.6d show a user removing edge crossings using the node
 RM for it. This is often a harder RM to minimize, as it is not always obvious how
 moving a node will eventually affect the total count. The process often involves
 trial and error, as well as multiple passes through each region of the drawing.
 Moreover, most social networks are not planar networks and can?t be represented
 without edge crossings. One of the easiest approaches is to pull tightly connected
 nodes near the edge farther out as in Fig. 6.6c, so that less central nodes can be
 placed between its connected edges. This has the unfortunate effect of significantly
6.2 Readability Metrics in SocialAction 239
 (a) Edge crossing NO:0,ET:0,EC:155 (b) Edge crossings removed (1/3)
 NO:0,ET:0,EC:114
 (c) Edge crossings removed (2/3)
 NO:0,ET:0,EC:90
 (d) Edge crossings removed (3/3)
 NO:0,ET:0,EC:85
 Figure 6.6: Likewise, the node RM for edge crossings shows users areas with lots of
 crossings (a) and lets them hand tune the layout to reduce them ((b)?(d)). Fig. 6.1
 gives a prime example for how minimizing edge crossings can greatly improve the
 readability of a drawing. Unfortunately, minimizing the number of edge crossings
 for less structured networks often results in an asymmetric drawing like (d) in which
 the centrality and angular resolution of many nodes is reduced, decreasing their
 perceived importance. For larger, less structured networks a balance must be struck
 between the number of edge crossings and the impact of further minimization on
 the spatial layout of the drawing. Note that color scales may change between
 figures as the worst nodes become better. Metrics listed are node occlusion (NO),
 edge tunnels (ET), and edge crossings (EC).
6.2 Readability Metrics in SocialAction 240
 worsening the angular resolution and spatial layout RMs, which can make the node
 seem less important or central than it is.
 Improving individual RMs can be beneficial for other RMs as well, though often
 there are tradeoffs between them users may have to weigh. Which RMs should be
 improved thus depends on what users are trying to convey with their drawings.
 Thus, it is imperative that users of network drawing software be made aware of
 which RMs their layout algorithms attempt to optimize and the effects various
 layout techniques have on how much of the underlying data is effectively conveyed.
 6.2.2 Case Study: New Testament Name Co-Occurrence
 In 2008 The New York Times published a node-link visualization of the co-occurrence
 of names appearing in the New Testament,1 shown in Fig. 6.7a. It used a force-
 directed layout drawn by IBM?s ManyEyes tool.2 While interesting, I believed that
 the drawing had substantial readability problems that could be improved by using
 my metrics.
 After loading the same dataset into SocialAction, the default force-directed
 layout rendered a quite similar drawing (Fig. 6.7b). After applying topological
 clustering using Newman?s fast heuristic [New04] and showing the clusters using
 convex hulls, much of the underlying structure could be discerned. I further im-
 1http://www.nytimes.com/imagepages/2008/08/31/business/31novelCA02ready.html
 2http://www-958.ibm.com/software/data/cognos/manyeyes/
6.2 Readability Metrics in SocialAction 241
 (a) (b) NO:23,ET:283,EC:2104
 (c) NO:0,ET:154,EC:2032
 Figure 6.7: Name co-appearance network from the New Testament. (a) is the orig-
 inal New York Times/ManyEyes visualization, while (b) shows the same network
 in SocialAction [PS06]. (c) shows the clusters found by Newman?s fast heuristic
 [New04] using convex hulls, and I optimized the layout using the node-node overlap
 and edge crossing metrics.
6.3 Readability Metrics in NodeXL 242
 proved on this layout by optimizing for the node-node overlap and edge crossing
 metrics, resulting in the drawing in Fig. 6.7c.
 One advantage of this new drawing (Fig. 6.7c) is that the separate clusters
 of individuals are much easier to discern than in the original drawing. It is also
 much easier to understand pivotal relationships that bridge the groups, like Peter.
 Moreover, there are no overlapping labels, though the zoom is lower. The main
 disadvantage of this drawing is in the kind of visceral reaction people may have to
 the movement of the Jesus node towards the periphery, with its group of connected
 singletons in the top left. Studies have shown that reducing the angular resolution
 of high-importance nodes like Jesus do not significantly impact task performance,
 however these kinds of modifications can substantially impact user perception of
 less important nodes.
 6.3 Readability Metrics in NodeXL
 I have begun implementing the readability metrics and automatic improvement
 technique inside NodeXL [Smi+10]. Fig. 6.8 shows the NodeXL interface with the
 readability metrics dialog in the foreground. The dialog allows the user to select
 which global, node, and edge metrics to calculate. Then the user can calculate the
 metrics on demand and optionally have NodeXL continue updating the metrics
 incrementally as the user manipulates the node-link visualization in the graph
6.3 Readability Metrics in NodeXL 243
 Figure 6.8: NodeXL showing the readability metrics dialog box (foreground), the
 nodes in the worksheet with their associated edge crossing and node overlap metric
 columns, and the graph pane where nodes and edges are colored by the edge
 crossing metric on a red-black scale. Nodes causing the most edge crossings are
 colored in bright red, as are edges with the most crossings. The network shown
 represents the legal moves in the board game Risk (see Section 5.5.1 for details).
 pane on the right. On the left side we can see the node worksheet, which has two
 additional columns populated for the calculated edge crossing and node overlap
 metrics. In this case, the edge crossing metric column has been used to color the
 nodes on a red-black scale to highlight nodes that cause edge crossing problems.
 Similarly, the edge worksheet (not visible) has column for edge crossings as well
 which was used to color the edges in the node-link visualization. With these tools
6.4 Specific Readability Metrics 244
 the user can immediately find the problem areas and make manual improvement
 with real-time color feedback.
 6.4 Specific Readability Metrics
 This section discusses several specific readability metrics (RMs), including the
 motivation for their use and the formulas I have created to quantify them. For more
 background and an introduction to my approach, see Sections 6.1 and 6.2. The
 following sections each deal with a specific metric I considered, and Section 6.4.18
 gives a brief overview of additional RMs that I have not yet implemented but
 appear valuable.
 As per [Pur02], each RM is scaled appropriately to a continuous scale from
 [0,1] where 1 indicates the positive maximum of the RM. This allows us to assign
 graph readability requirements to particular drawings based on the content and
 information we want the impart. For example, a journal may recommend 0% node
 occlusion, <2% edge tunneling, and <5% edge crossing to publish a node-link
 visualization, while having different suggestions for UML diagrams or other kinds
 of graphs. However, there are many useful graph drawings that violate these limits
 and they should not be eliminated based solely on the RMs.
 In these formulas I use a notation similar to that of [Pur02], where the graph has
 n nodes and m edges, indexed using subscripts. Using a technique called bends
6.4 Specific Readability Metrics 245
 promotion [Pur02], we can convert a polyline edge into several new straight line
 edges denoted m0 and replace the bends in the edges with new nodes denoted n0.
 6.4.1 Node-Node Overlap @n
 Euclid defined a point as that which has no part. Historically, graph layout algo-
 rithms were designed around these abstract graphs [LE02], with nodes taking
 up little or no space [WS79; Mis+95; LEN05]. However, practical graphs like
 sociograms or UML diagrams represent nodes using text, shapes, colors, pictures,
 and size [LE02]. Classical algorithms can thus frequently result in nodes with
 non-zero width and height overlapping one another in the graph drawing.
 This node-node overlap, also called overplotting, is contrary to accepted
 graph readability guidelines [Sug02], including those for trees [WS79] and UML
 diagrams [Eic03]. Moreover, areas of the drawing with high occlusion make it very
 difficult for the viewer to get an accurate count of the number of individual nodes
 in a cluster to get a sense of its scale. These problems can be reduced somewhat,
 but not entirely, through the use of a halo or fog effect around nodes to help
 distinguish them from each other.
 Many force-directed layout algorithms include node-node repulsive forces or
 equivalent constructs, including variants of the spring embedder [Ead84] such
 the popular Fruchterman-Reingold force-directed algorithm [FR91] and more scal-
6.4 Specific Readability Metrics 246
 able gravitational N-Body approaches like provided by Prefuse [HCL05] using
 the Barnes-Hut force calculation algorithm [BH86]. However, force-directed ap-
 proaches cannot usually guarantee all overlaps will be removed while the area and
 shape of the drawing are preserved because they rely on overly large repulsive
 forces or post-processing [GH09]. One notable exception is [HK02b].
 There have also been many algorithms developed for removing node-node over-
 laps using post-processing after an initial layout algorithm. These include variants
 of the force-scan method [EL92; Mis+95; LE02; Hay+02; HL03; LEN05], con-
 strained optimization [Mar+03; DMS06; DMS07], and force-directed approaches
 [LMR98; GN98; Hua+07]. One of the most effective approaches appears to be the
 PRoxImity Stress Model (PRISM) algorithm of Gansner and Hu [GH09], which is
 discussed in detail in Section 5.4.3.3 in the context of removing group box over-
 lap in a Group-in-a-Box layout and compared to Dwyer, Marriott, and Stuckey?s
 solve_VPSC algorithm [DMS06; DMS07].
 One option proposed by Li, Eades, and Nikolov [LEN05] is varying the edge
 lengths in a standard force-directed layout. While this preserves the orthogonal
 ordering well, it has scaling issues and can require excessive space [GH09]. An al-
 ternative is the Voronoi cluster busting algorithm of Lyons, Meijer, and Rappaport
 [LMR98] and used by Gansner and North [GN98] for their layout. This algorithm
 iteratively forms a Voronoi diagram for the layout and moves nodes to the center
6.4 Specific Readability Metrics 247
 (a) Tight layout (b) Relaxed layout
 Figure 6.9: We can eliminate the node occlusion that makes the central overlapping
 group in Fig. 6.9a so hard to understand by zooming out and increasing the the
 spring lengths of the layout algorithm (Fig. 6.9b).
 of their Voronoi cells. This roughly maintains the network shape, but loses much
 of the layout structure and again expands to take up a lot of screen space [GH09].
 Another interesting approach by Imamichi et al. [Ima+09] for 3D visualizations
 assumes labels extend from spherical nodes, models these masses with a set of
 spheres, and solves the sphere packing problem. This allows for arbitrary rotation
 and translation, but is not as suitable to 2D rectangles.
 Despite two decades of research into algorithms for node-node overlap removal,
 most widely used network visualization tools fail to properly reduce occlusion.
 Examples include Pajek [BM98], a common social network analysis tool, as well as
 our NodeXL [Smi+10]. In a recent user study [HHE06c] the authors had to hand
 tune the diagrams produced by Pajek to avoid occlusion. Fig. 6.9 shows how node
 occlusion can be eliminated by zooming out and increasing default spring lengths,
 at the cost of decreasing perceived clustering.
6.4 Specific Readability Metrics 248
 Node Occlusion Readability Metrics: I am not aware of any suitable ex-
 isting readability metrics for node occlusion. I suggest a global RM proportional
 to the number of uniquely distinguishable items in the graph drawing, where an
 item can be either a node or a connected mass of overlapping nodes. On a con-
 tinuous scale from 0 to 1, 1 indicates that every node is uniquely distinguishable
 from its neighbors (possibly including a spacing requirement) and 0 indicates that
 all nodes in the graph drawing are overlapping, creating one large connected mass.
 Similarly, a node RM can be proportional to the ratio of the node?s representation
 area (possibly including a spacing requirement) that is obscured by other nodes.
 Naturally there is no edge RM for node occlusion, however node occlusion is usu-
 ally grouped in the literature with edge tunneling (Section 6.4.8), which provides
 additional RMs.
 6.4.2 Global Readability Metric @n
 a = area
  
n[
 j=1
 bounds(nj)
 !
 (6.1)
 amax =
 nX
 j=1
 area(bounds(nj)) (6.2)
 @n =
 a
 amax
 (6.3)
6.4 Specific Readability Metrics 249
 6.4.3 Node Readability Metric @nj2Nn
 The regularized intersection of rectangles P and Q, denoted P \ Q, is the
 closure of the interior of the standard intersection P \Q. Regularization is used to
 remove lower-dimensional ?dangling? components (for instance, lines in 2D draw-
 ings) [Mou04].
 aj = area
  
n[
 k=1
 bounds(nj) \
  bounds(nk)
 !
 (6.4)
 @njn = 1 
aj
 area(bounds(nj))
 (6.5)
 6.4.4 Edge Crossing @c
 The number of edge crossings or intersections is the most widely accepted RM
 in the literature. In 1953, Moreno [Mor53] wrote, ?The fewer the number of lines
 crossing, the better the sociogram." Edge crossings is listed as an important general
 RM in many books on graph drawing, including [Bat+98; Sug02; War04], as well
 as for automated UML diagram layout [Eic03]. As with the Node-Node Overlap
 metric, the effect of edge crossings can be somewhat mitigated with a halo, fog,
 or border effect around the edges to help distinguish them from each other. Sub-
 stantial work has also been done in the design of graph drawing algorithms that
 specifically reduce the number of edge crossings, such as [STT81; ES90; FR91;
 CP96; DH96; Mut97].
6.4 Specific Readability Metrics 250
 Purchase?s seminal RM comparison user study identified edge crossings as hav-
 ing the greatest impact on human understanding of general graphs of the five
 RMs she studied [Pur97]. This finding has been empirically validated in [PCJ96;
 Pur98; PCA02]. These studies focus on edge tracing tasks like finding the length
 of the shortest path between two nodes, though use a global count of the num-
 ber of edge crossings. [War+02] suggests the number of edge crossings along the
 relevant edges is more important than a global measure. Additional evidence for
 the importance of edge crossing comes from [KA02], which deals with visualizing
 ordered sets. Moreover, user preference studies identify minimizing edge cross-
 ings as the most important RM for UML diagrams [PAC02; PCA02] as well as
 for node-link visualizations [HHE06a], and when given the option of improving on
 an initial force-directed or random layout, users created graph drawings with 60%
 fewer edge crossings on average [HR08]. [KA02] theorizes that crossed lines could
 be salient properties which distract the user?s visual system from the relationships
 the drawing was designed to convey.
 However, [Mut97] suggests that allowing some edge crossings can sometimes re-
 sult in more readable graph drawings and recent literature points to restricting edge
 crossing angles being almost as effective as reducing edge crossings (Section 6.4.9).
 Furthermore, recent research on node-link visualizations comparing edge tracing
 tasks like finding groups to node importance tasks indicates that while reducing
6.4 Specific Readability Metrics 251
 edge crossings improves edge tracing task performance and user preference, it has
 little effect on node importance tasks [HHE06b; HHE05; HHE07]. This was further
 verified in eye tracking studies [Hua06; Hua07b; HEH08]. They postulate that this
 indicates the effects of edge crossings can vary depending on the situation. Fur-
 ther discussion of the cognitive load imposed by edge crossings quantified using
 eye tracking is in [K0?4; HHE06c; Hua07a; HEH08]. Fig. 6.1 demonstrates how
 reducing edge crossings can lead to a much more understandable drawing.
 6.4.5 Global Readability Metric @c
 I take from [Pur02] the global RM for edge crossings (@c) based on c, the number
 of pairwise edge crossings in the drawing. Scaling by an approximate upper bound
 for the number of crossings in the drawing, I can produce a metric over [0; 1].
 call =
 m0X
 i=1
 (i 1) =
 m0(m0  1)
 2
 (6.6)
 cimpossible =
 1
 2
 n0X
 j=1
 deg(nj)(deg(nj) 1) (6.7)
 cmx = call  cimpossible (6.8)
 @c = 1 
8
 >><
 >>:
 c
 cmx
 if cmx > 0
 0 otherwise
 (6.9)
6.4 Specific Readability Metrics 252
 Here, deg(nj) is the degree of node nj. First, I calculate call, the number of
 crossings if every pair of edges intersect. Of those, I remove cimpossible, the impos-
 sible intersections of edges connected to the same node in a straight-line drawing.
 This leaves us with cmx, a (probably high) upper bound to the number of crossings
 in the drawing. Scaling c by cmx and subtracting from 1 I get the global RM for
 edge crossings @c. I can report all c crossings of m0 edges in O(m0 logm0 + c) time
 and O(m0) space [Mul91] rather than testing all cmx pairs. cmx can be computed in
 O(n0) time, though only needs to be calculated once. If the graph topology is dy-
 namically changing, only those nodes with modified degree ( n0) need to be used
 to recalculate cimpossible in O( n0) time and the added or removed edges must be
 fed back into the calculation of c. Similarly, if the layout is dynamically changing,
 then c must be updated for all edges whose location has changed. See [Mou04]
 for a discussion of various algorithms for line segment intersection reporting. The
 ability to use precomputed results to only test the modified edges for intersec-
 tions naturally depends on the choice of algorithm, though some like [Mul91] are
 iterative and seem particularly suited for the addition of new edges.
 6.4.6 Edge Readability Metric @ei2Ecei
 My edge RM for edge crossings (@ei2Ecei ) is defined for any edge ei based on the
 number of pairwise edge crossings cei between it and any other edge in the drawing.
6.4 Specific Readability Metrics 253
 Scaling as before, I can produce a metric over [0; 1]. With this metric I can identify
 the edges with the most crossings in the drawing.
 ceiall = m
 0  1 (6.10)
 ceiimpossible = deg(src(ei)) + deg(tar(ei)) 2 (6.11)
 ceimx = c
 ei
 all  c
 ei
 impossible (6.12)
 = m0  deg(src(ei)) deg(tar(ei)) + 1 (6.13)
 @ei2Ecei = 1 
8
 >><
 >>:
 cei
 c
 ei
 mx
 if ceimx > 0
 0 otherwise
 (6.14)
 ceiall is the number of edges ei could intersect in the drawing, of which I can
 remove the impossible intersections ceiimpossible. Edges that have the same source
 or target node as ei (src(ei) and tar(ei), respectively) cannot intersect ei in a
 straight-line drawing. Thus I have ceimx, an upper bound to the number of edges
 crossing ei. Scaling cei by ceimx and subtracting from 1 I get the edge RM for edge
 crossings.
 6.4.7 Node Readability Metric @nj2Ncnj
 My node RM for edge crossings (@nj2Ncnj ) is defined for any node nj based on c
 nj ,
 the sum of the number of crossings its connected edges have (triggered crossings).
6.4 Specific Readability Metrics 254
 Again, I scale to a continuous metric scale of [0; 1]. This allows us to identify the
 nodes whose positions are the cause of many edge crossings.
 cnj =
 X
 ei2edges(nj)
 cei (6.15)
 cnjmx =
 X
 ei2edges(nj)
 ceimx (6.16)
 =
 X
 ei2edges(nj)
 m0 + 1 deg(src(ei)) deg(tar(ei)) (6.17)
 =
 X
 ei2edges(nj)
 m0 + 1 deg(nj) deg(adj(nj; ei)) (6.18)
 = deg(nj)(m
 0 + 1 deg(nj)) (6.19)
  
X
 ei2edges(nj)
 deg(adj(nj; ei)) (6.20)
 @nj2Ncnj = 1 
8
 >><
 >>:
 cnj
 c
 nj
 mx
 if cnjmx > 0
 0 otherwise
 (6.21)
 Here edges(nj) is the set of all edges connected to node nj. I define an upper
 bound to the number of edge crossings of connected edges cnjmx as the sum of the
 individual edge upper bounds ceimx from the edge RM. For all connected edges, I can
 pick the current node nj as either the source or the target, and use the adjacent
 node along edge ei, denoted adj(nj; ei), as the other. As deg(nj) = jedges(nj)j, I
 get the formula for cnjmx. Again scaling cnj by c
 nj
 mx and subtracting from 1 I get the
6.4 Specific Readability Metrics 255
 (a) Original layout (b) After removing
 edge tunnels
 Figure 6.10: In Fig. 6.10a it is difficult to tell which edges connect to which nodes
 because of the number of edge tunnels. By zooming out and hand tuning the
 layout (Fig. 6.10b) we can completely eliminate edge tunnels (but not crossings).
 node RM for edge crossings.
 6.4.8 Edge Tunnel
 There is little literature dealing with nodes occluding edges and vice versa, and it is
 often lumped together with node occlusion (Section 6.4.1). Because of the limited
 definitions available for this RM, I will call the specific case of a node occluding an
 edge an edge tunnel. The reverse can be called an edge bridge, but as many
 modern graph drawing tools (e.g. SocialAction [PS06], NodeXL [Smi+10]) draw
 nodes with higher priority than edges I am ignoring this case.
 Both cases are accounted for by the simulated annealing graph drawing algo-
 rithm from [DH96], which incorporates the distance between every node and edge
 in a fine-tuning step. [Sug02] calls avoiding edge tunnels a basic rule, and for
 UML diagrams, [Eic03] specifies that nodes should not be too close to edges un-
6.4 Specific Readability Metrics 256
 less they are connected or a more important RM forces their proximity. However,
 many algorithms do not take this into account, including [LEN05] and the com-
 monly used Fruchterman-Reingold algorithm [FR91]. Even tools using algorithms
 that remove edge tunnels are not guaranteed to do so. The excellent user study
 [War+02] used 200 generated graph drawings with 42 nodes each, of which the
 results from 7 graph drawings had to be excluded from the final analysis because
 of unexpected edge tunnels that implied nonexistent connections. The standard
 users of graph drawing tools are more likely to overlook such problems than RM
 researchers. Fig. 6.10 shows how zooming out and hand tuning a layout to reduce
 edge tunnels allows for a much clearer picture of the network topology.
 Edge Tunnel Readability Metrics: The global RM for edge tunnels can be
 built upon the global RM for edge crossings (Section 6.4.4), comparing the number
 of edge tunnels in the graph drawing to an appropriate upper bound. A simple
 edge RM is thus an appropriate scale of the number of edge tunnels that edge
 has. Local edge tunnels is defined as a node RM for the number of edges that
 tunnel under that node. An second node RM for triggered edge tunnels, the
 edge tunnels of all edges connected to that node, can be specified in terms of the
 combined edge RMs for those edges.
6.4 Specific Readability Metrics 257
 (a) Original layout (b) After making edge
 crossings more perpen-
 dicular
 Figure 6.11: In edge tracing tasks such as finding the length of the shortest path
 between the bottom right and top left nodes in Fig. 6.11a, increasing the edge
 crossing angles approaching 90 degrees (Fig. 6.11b) improves user path finding
 performance.
 6.4.9 Edge Crossing Angle @eca
 The impact of edge crossing angles was first introduced as a global RM by
 [War+02], which is based on a neurophysiological view of the user. Ware et al.
 claim rapid early-stage neural processing causes certain features to ?pop out? to
 users, and that these neurons are coarsely tuned when examining angles, roughly
 between +/- 30 degrees. Though they did not find the impact of edge crossing an-
 gles to be significant, they did find that another angular measure, path continuity,
 was. This neurophysiological view supplies an explanation for the results of [HE05;
 Hua06; Hua07b; Hua07a; HHE08], which use an eye tracking user study to verify
 that the angle of edge crossings has a significant impact on user response time for
 edge tracing tasks. Moreover, response time significantly decreased as the cross-
6.4 Specific Readability Metrics 258
 ing angle tended towards 90%, though tended to level off or even slightly increase
 beyond 70%. This is attibuted to extra back-and-forth eye movements around ac-
 cute crossings. However, as the size of the graph increases creating longer searching
 paths, the impact of even near-perpendicular crossings can build up and become
 significant [Hua07b]. See Fig. 6.11 for a demonstration of how more perpendicular
 edge crossing angles promote path finding tasks.
 Edge Crossing Angle Readability Metrics: I believe the global RM for
 angular resolutioncan be modified to incorporate the average deviation of edge
 crossing angles from the ideal angle of  70 degrees instead. [War+02] uses the av-
 erage cosine crossing angle as their global RM metric, and my planned experiments
 with these metrics may suggest that modification as well. The associated edge RM
 follows simply by removing the sum over all nodes and the relevant scaling. The
 node RM is somewhat harder to define, though it can be based on the combining
 the edge RMs for the node?s connected edges.
 6.4.10 Angular Resolution (min) @arm
 The angular resolution RM refers to the minimum or average angle formed by all
 the edges incident to an individual node. This section discusses both but defines
 the minimum metric. [STT81] and [For+93] dealt with this early on, and [Pur02]
 defines a minimum angle metric called @m. [Pur97] found this metric had no effect
6.4 Specific Readability Metrics 259
 on path finding tasks, but it was found significant for recognizing actor status by
 [HHE06b].
 6.4.11 Global Readability Metric @arm
 d =
 1
 n
 nX
 j=1
 dnj (6.22)
 =
 1
 n
 nX
 j=1
  
 
 
 
#j   jmin
 #j
  
 
 
 (6.23)
 @arm = 1 d (6.24)
 6.4.12 Node Readability Metric @nj2Narm
 dnj =
  
 
 
 
#j   jmin
 #j
  
 
 
 (6.25)
 #j =
 360 
deg(vj)
 (6.26)
 @njarm = 1 d
 nj (6.27)
 6.4.13 Angular Resolution (avg) @ara
 This metric is similar to the minimum angular resolution RM discussed in Sec-
 tion 6.4.10 and is described there.
6.4 Specific Readability Metrics 260
 6.4.14 Global Readability Metric @ara
 d =
 1
 n
 nX
 j=1
 dnj (6.28)
 =
 1
 n
 nX
 j=1
 0
 @
 1
 deg(nj)
 deg(nj)X
 i=1
  
 
 
 
#j   i;(i+1)%deg(nj)
 #j
  
 
 
 
1
 A (6.29)
 @ara = 1 d (6.30)
 6.4.15 Node Readability Metric @nj2Nara
 dnj =
 1
 deg(nj)
 deg(nj)X
 i=1
  
 
 
 
#j   i;(i+1)%deg(nj)
 #j
  
 
 
 (6.31)
 @njarm = 1 d
 nj (6.32)
 where #j is the same as in Section 6.4.10.
 6.4.16 Visualization Coverage Metric @vc
 The visualization coverage or ink metric denoted @vc is my attempt to quantify
 the amount of screen space used by the visual items in a visualization compared to
 the entire space available. It is formulated as the area occupied by all visual items
 divided by the area of the screen space. The objective of this metric is to measure
 the amount of theoretically available screen space, so as to quantify the reduction
6.4 Specific Readability Metrics 261
 in in ink presented to the user after filtering (Section 3.3.1) or motif simplification
 (Chapter 4). It can also measure the reduction in ink by using aggregate edges (or
 no edges) between groups in the Group-in-a-Box layouts (Chapter 5).
 Here I use a notation of a network or graph G with jG:nodesj nodes and
 jG:edgesj edges and a network visualization V (G). Each individual node n 2
 G:nodes and edge e 2 G:edges is indexed using subscripts (e.g., ni; ej). For any
 node, edge, or visualization k, bounds(k) indicates a bounding shape b for that
 item in the visualization, and area(b) denotes the area of that bounding shape.
 The visualization coverage metric @vc is defined as follows:
 bn =
 [
 n2G.nodes
 bounds(n) (6.33)
 be =
 [
 e2G.edges
 bounds(e) (6.34)
 a = area(bn [ be) (6.35)
 namax = argmax
 ni2G.nodes
 area(bounds(ni)) (6.36)
 eamax = argmax
 ej2G.edges
 area(bounds(ej)) (6.37)
 a = max(namax; eamax) (6.38)
6.4 Specific Readability Metrics 262
 amax = area(bounds(V (G))) (6.39)
 @vc =
 a a 
amax
 (6.40)
 First, a union is computed of all the node bounding shapes and edge bounding
 shapes in the visualization, including all meta-nodes and meta-edges. In order
 for the metric to have a range of [0; 1], this area a must have the maximum node
 or edge area a subtracted from it. This quantity is then divided by the total
 visualization area.
 6.4.17 Group Overlap
 In Algorithm 7, I describe an algorithm for counting the number of overlaps be-
 tween groups (sets) of nodes in the network and the remaining nodes. It first
 computes a convex hull for each group, then finds the number of nodes outside
 the group that overlap with the convex hull. The objective is to measure how the
 original layout of the group affects users? perceptions of group membership, and
 how an alternate layouts improve on these perceptions. This measure is applicable
 to both motif simplification (Chapter 4) and meta-layout (Chapter 5).
 I believe that convex hulls are more appropriate for this measure than alterna-
 tives like concave hulls because (1) convex hulls more accurately model the way
 users perceive regions of the network, and (2) it is more efficient to find inter-
6.4 Specific Readability Metrics 263
 Algorithm 7 Calculate the number of group-node overlaps for each group
 1: groups = set of all groups, where each group is a set of points (xi; yi)
 2: hullCounts = [];
 3: for all g 2 groups do
 4: count = 0
 5: hull = grahamScan(g)
 6: for all node 2 G:nodes j node =2 g do
 7: if intersects(hull; node) then
 8: count = count+ 1
 9: hullCounts:add(count)
 return hullCounts
 sections between convex polygons than simple polygons [Mou04]. Additionally,
 colored convex hulls are often used to show network group structure (e.g., [PS06]).
 Two functions are called in Algorithm 7 which we assume are defined elsewhere.
 The first, grahamScan(S), is the Graham scan algorithm3 for computing a convex
 hull of a finite set of points in O(n log n) time, where n is the number of points
 (nodes), in this case jgj. The second, intersects(a; b), computes the intersection of
 two convex polygons in O(log n) time, where n is the count of the nodes in a and
 b [DK83; Mou04].
 The time complexity of Algorithm 7 is derived below, where jnodejj is the
 number of sides of the polygon representing a particular node nodej. The other
 uses of jsj indicate the size of the enclosed set s. E.g., jgij is the number of nodes
 3http://en.wikipedia.org/wiki/Graham_scan
6.4 Specific Readability Metrics 264
 in the set gi.
 time = timea + timeb (6.41)
 timea =
 jgroupsjX
 i=1
 jgij log jgij; where gi 2 groups (6.42)
 timeb = jgroupsjjnodesj log(max
 i
 (j hull(gi)j) + max
 j
 (jnodejj)) (6.43)
 As 8i; j hull(gi)j  jgij  maxi(jgij)  jnodesj, and as maxj(jnodejj) is a con-
 stant for the highest degree polygon used as a node shape,
 timea  jgroupsjmax
 i
 (jgij) log max
 i
 (jgij) (6.44)
  jgroupsjjnodesj log max
 i
 (jgij) (6.45)
 timeb = O(jgroupsjjnodesj log max
 i
 (jgij)) (6.46)
 time = O(jgroupsjjnodesj log max
 i
 (jgij) (6.47)
 Thus, the time complexity of Algorithm 7 is given in Eq. (6.47). As jgroupsj  
jnodesj and maxi(jgij)  jnodesj, another (much worse) upper bound would be
 jnodesj2 log jnodesj.
6.4 Specific Readability Metrics 265
 6.4.18 Additional Readability Metrics
 There are many potential RMs that can be taken into account to produce effective
 graph drawings, and each impacts how understandable the final product is and
 how successfully it imparts the author?s message. Many that I am investigating
 for standardization and inclusion in my framework are briefly discussed below.
 Node Size: The size of nodes in the graph drawing can significantly affect
 node occlusion, edge tunneling, and the ability of users to see shapes and colors
 as well as read labels. I suggest outlining four size constraints depending on the
 amount of information to be displayed. Displaying the location of the node only
 requires representing a point, while adding properties like color and shape to in-
 dicate additional attributes requires more space to be identifiable. Nodes must be
 even larger yet in order to display meaningful text labels within the node, which
 are dealt with more in the following two RMs.
 Node Label Distinctiveness: In many graph drawings node labels must
 be truncated to limit node occlusion and edge tunneling. As it is important to
 have uniquely identifiable and meaningful labels, users should attempt to remove
 common prefixes (e.g. ?Department of? in an organization network). A RM for
 assessing the distinctiveness of individual labels in the drawing would draw atten-
 tion to these problems, but must be flexible enough to accommodate unexpected
 prefixes. A potential solution might be found through the use of suffix trees.
6.4 Specific Readability Metrics 266
 Text Legibility: Similarly, the text must be sized and formatted appropriately
 so that it is readable in the final drawing. If this is not possible, the text should
 be removed to reduce node occlusion, edge tunneling, and the size of the graph. A
 common measure for this is the angle subtended by the text from the users point
 of view, though this may be difficult to translate into a RM.
 Node Color & Shape Variance: As users have substantial difficulty in-
 terpreting a graph drawing using too many distinct shapes or colors to represent
 attributes, a RM should be defined that indicates the difficulty of keeping those
 combinations in memory. This might limit the publication of drawings with exces-
 sive shape and color coding.
 Edge Bends: [ES90] stated that edges in a graph drawing should be as
 straight as possible. While the examples here deal with only straight-line drawings,
 edges with bends can be very useful for some types of graphs like UML diagrams.
 [Pur02] defines a RM for edge bends, while [Pur97] found that they have an impact
 on path finding tasks.
 Path Continuity: How continuous a path is is inversely related to the number
 and size of its bends. [War+02] defines continuation at a node as ?the angular
 deviation from a straight line of the two edges on the shortest path which emanate
 from the node.? The sum of these deviations provides the basis for a path continuity
 RM. Their user study found path continuity to be significant for path finding tasks.
6.4 Specific Readability Metrics 267
 Geometric-path tendency: A path between two nodes in a graph drawing
 can ?become harder to follow when many branches of the path go toward the target
 node? [Hua07b]. This is known as the geometric path tendency. Though a RM is
 not obvious, developing one may result in graph drawings better suited for edge
 tracing tasks.
 Orthogonality: [Pur02] defines a RM for orthogonality using measures for
 the extent nodes and edges in the graph drawing follow the points and lines of an
 imaginary Cartesian grid. Orthogonality is important for some kinds of drawings,
 especially those of UML class diagrams [PAC02] and other hierarchical structures.
 However, it is unimportant and can even be misleading for node-link visualizations,
 as by placing nodes along imaginary lines the visualization implies to viewers that
 horizontally or vertically adjusted nodes are related [KA02]. Node and edge RMs
 for orthogonality would likely be of limited use.
 Symmetry: [LNS85] observed that a graph drawing is ?good? when it displays
 as many symmetries as possible. This was verified by [Pur97] and a RM for axial
 symmetry is provided by [Pur02]. Like for orthogonality, node and edge RMs for
 symmetry are of limited value.
 Spatial Layout & Grouping: The spatial layout of nodes in a graph drawing
 has a substantial impact on the ability of users to ascertain the importance of actors
 in the network as well as identifying groups or communities of them [MBK97]. A
6.5 Summary 268
 RM for this might compare how effectively the visual grouping of nodes in the
 graph drawing conveys groupings found via a community algorithm that operates
 only on the structure of the graph.
 Edge Length: The most common algorithms for node-link visualization layout
 are the many variations of the spring embedder [Ead84], which attempt to reduce
 the variance of intra-node distances in the graph drawing. However, [HR08] found
 that users prefer to space clusters of nodes proportional to number of connecting
 edges between them. This might lend credence to a RM that analyzes the strength
 of relationships between clusters and compares that to the actual visible separation,
 though optimizing the RM would be difficult when using spring or force based
 layout algorithms.
 Path Branches: The number of edges branching from shortest paths within
 the graph drawing can also have an affect on path finding tasks [War+02]. A global
 RM might compute the number of branches along each shortest path in the graph
 drawing as a measure of the general difficulty of edge tracing tasks.
 6.5 Summary
 My user studies, case studies, and experiments demonstrate the utility of motif
 simplification and Group-in-a-Box layouts for network visualization, but I am also
 interested in improving the effectiveness of general node-link visualizations. By
6.5 Summary 269
 quantifying the readability of a layout we can guide analysts in making improve-
 ments and in the future feed the results in automatic layout algorithms. Past
 work provides definitions for several global readability metrics, which measure
 detrimental features like edge crossings and rate the layout as a whole. However,
 a single value is not enough to direct users to problem areas of the layout, which I
 address by introducing local readability metrics for individual nodes and edges.
 Moreover, I introduce several new global metrics to detect readability problems like
 node-node overlap and edges tunneling under nodes (node-edge overlap).
 I leverage these metrics in a new method for user-assisted layout improvement.
 By computing the metrics in real-time as users manipulate the layout, I provide
 immediate visual feedback to users as they optimize their visualization, showing
 how they are affecting readability. As there are trade-offs when optimizing specific
 readability metrics, I include a survey of the related literature studying each of
 these metrics and their effect on user task performance. My evaluations indicate
 that these readability metrics help users create more effective node-link visual-
 izations, and I plan to release both the metrics and layout improvement tool as
 part of NodeXL [Smi+10]. These metrics and the improvement technique were
 additionally implemented as part of SocialAction [PS06; PS08a; PS08b], though I
 have not made this code publicly available due to the research prototype nature
 of SocialAction.
6.5 Summary 270
 This work aims to raise user awareness of network visualization readability
 issues, and applying my optimization technique will guide users in creating more
 effective network visualizations. I believe that many currently published networks
 could be substantially improved with a few modest refinements based on these
 readability metrics. While no set of requirements can fully capture all effective
 network drawings, I believe that applying select RMs for the task at hand will
 improve most network authors? output. These principles will need refinement to
 deal with large networks where node aggregation, edge bundles, and cluster markers
 may be necessary to allow users to make scalable comparisons.
Chapter 7
 Conclusion and Future Directions
 7.1 Conclusion
 My dissertation contributes techniques for understanding and improving the read-
 ability of node-link network visualizations. First, I present motif simplification,
 a technique for reducing the complexity of node-link visualizations. With motif
 simplification, common repeating network motifs are replaced with easily under-
 standable motif glyphs that require less space, are easier to understand, and reveal
 hidden relationships. While users must learn the visual language of motifs and
 glyphs, there is a dramatic payoff in the usability and readability of the visualiza-
 tion. I contribute design guidelines for motif glyphs; designs of glyphs to replace
 the high-payoff fan, connector, and clique motifs common in networks; as well as
 algorithms to identify these motifs. I have also developed a free and open source
 reference implementation, made publicly available as part of NodeXL [Smi+10],
 and I present results from a controlled study of 36 participants that demonstrates
 the benefit of motif simplification for many common network analysis tasks.
 271
7.1 Conclusion 272
 An important part of network analysis is understanding the community struc-
 tures that are present, and highlighting these features can provide immediate in-
 sights during an exploration. Standard approaches for showing communities using
 color, shape, convex hulls, or layout algorithms do not sufficiently expose commu-
 nity membership, internal structure, and inter-community relationships. I address
 this problem with three meta-layouts that subdivide complex networks based on
 their community structure. The first, the Midichlorian-Directed Layout, uses a
 force-directed layout to visually separate clusters. The other two Group-in-a-Box
 (GIB) layouts display each community laid out individually within its own box,
 sized according to the number of nodes therein. The Fitted-Rectangles GIB layout
 arranges the boxes to optimize the space used while still showing inter-community
 relationships. The Force-Directed GIB layout, alternatively, arranges community
 boxes based on their aggregate ties at the cost of additional space. My implemen-
 tation in NodeXL [Smi+10] automatically chooses the most appropriate Group-in-
 a-Box layout to best show disconnected components and different numbers or sizes
 of communities. Several case studies and an experimental study of 309 Twitter
 networks demonstrate the utility of the proposed layouts, especially for presenting
 the aggregate relationships between communities.
 Third, my dissertation contributes a set of global and local readability metrics
 to help users understand and improve their node-link network visualizations. The
7.1 Conclusion 273
 global metrics can be used to evaluate the effectiveness of a particular layout of the
 node-link visualization. Additionally, the local metrics are implemented within the
 analysis tool to help users identify problem areas in the visualization using color
 coding, and the metrics and associated colors are updated in real time as users
 manipulate the visualization. This provides them with immediate feedback as to
 how they are affecting the visualization?s readability. The basics of this technique
 are implemented in NodeXL [Smi+10] and SocialAction [PS06], another tool for
 network analysis. This work provides an improved understanding of node-link
 visualization readability, the trade-offs when optimizing for specific tasks, and
 techniques users can use to improve their visualization. My hope is that it will
 encourage developers to take network visualization readability into account when
 designing analysis tools, as well as help educate users about these issues.
 The three techniques I present can be used together or individually to help cre-
 ate more effective visualizations, especially for novice users. For example, a user
 could apply a Group-in-a-Box layout to highlight the clusters in the network, which
 are then laid out individually using motif simplification. The user could then use
 the interactive readability metric improvement tool to optimize the layout for pre-
 sentation. The reference implementation of these techniques in NodeXL [Smi+10]
 will be particularly useful for novice users, as NodeXL is frequently used for teach-
 ing introductory courses on network analysis. It is my hope that these strategies for
7.2 Future Directions 274
 improving visualizations of large, complex networks will demonstrate that progress
 is possible, and will provide several starting points for other researchers exploring
 additional ways to visualize networks.
 7.2 Future Directions
 This dissertation opens up several interesting avenues of research on node-link
 network visualizations. Below I detail specific opportunities for leveraging my work
 on motif simplification, Group-in-a-Box meta-layouts, and readability metrics to
 handle even larger and more complex datasets.
 7.2.1 Motif Simplification
 My studies indicate that motif simplification is an effective way of reducing node-
 link visualization complexity, but it does pose several challenges and opens up
 many avenues for future work. These include better education amd explanation of
 the motifs and their associated glyphs, but also additional techniques for showing
 more of the underlying network information and scaling to larger datasets. At
 the cost of having larger and more complex glyphs, additional details like directed
 edges, approximate topology, and node attribute distributions can be exposed.
7.2 Future Directions 275
 7.2.1.1 Visual Complexity and Education
 The visual complexity of multiple glyphs can require time for users to understand
 and train their eyes/mind to recognize them. As such, I have tried to keep the
 visual lexicon as small as possible, for example by using the same connector motif
 glyph for any number of anchors instead of creating many variants (Fig. 4.3). I
 also made several changes after the initial pilot study to improve user perception,
 including changing the crescent connector motif glyph to a more effective tapered
 diamond glyph (Fig. 4.2). However, my task-based study showed that users had
 still had difficulties with topology-based tasks when using motif simplification (Sec-
 tion 4.5). Part of this can possibly be attributed to the loss of edge information
 that occurred before I started using sized meta-edges between motifs (e.g., Figs. 4.9
 to 4.11), which I have not yet tested.
 I believe the main issue, though was that participants were only given a few
 minutes to understand the basics of node-link network diagrams as well as any
 translations between motifs and their associated glyphs. While participants had
 a legend available to them throughout the study, it did not seem to be enough
 to ensure users understood the translations. User education is likely the most
 promising way to improve the glyph performance, either through additional pre-
 liminary training or time spent using the techniques and becoming comfortable
 with the translations. Currently in NodeXL there may be the extra effort required
7.2 Future Directions 276
 to learn the motif concepts and interpret the glyphs, which may deter some users,
 but simplification is a user choice which can be reversed at any time.
 Another option would be to use more heavyweight glyphs that expose more of
 the underlying information to the user. Several of these approaches are discussed
 below for showing edge directionality, approximate motifs, arbitrary motifs, and
 attribute distributions. However, I have tried to strike a balance between showing
 the underlying information and maintaining a small visual lexicon, as well as keep-
 ing glyphs small and understandable at a distance. Heavyweight glyphs expose
 more, but at a substantial cost of visual clutter and space required.
 7.2.1.2 Edge Directionality
 Many networks have the added complexity of edge directionality, which is impor-
 tant for some tasks like determining information flow and trust analysis. For tasks
 on directed networks like path-finding, the underlying edge directionality needs
 to be taken into account in the glyph design so as to show these flows. I began
 working on this problem and developed an effective technique for subdividing fan
 glyphs without requiring any labels or annotations to show directionality. An ex-
 ample of this is shown in Fig. 7.1, with extra arrows around the edges that are not
 part of the glyphs.
 The example directed fan motif in Fig. 7.1 is divided into three representatively
 sized sectors, each representing a different directionality of edges: towards the
7.2 Future Directions 277
 Figure 7.1: Examples of how to show edge directionality in a fan motif glyph. The
 arrows around the fans are not part of the glyph, and are only presented here to
 highlight which sector corresponds to which direction of edges.
 head node, towards leaf nodes, or in both directions (reciprocated ties). The
 directionality of each sector can be shown with small arrows inside the sectors, but
 this requires a much larger glyph to be readable at a distance. Instead, I chose
 to arrange the sectors at different angles around the head node. The left glyph in
 Fig. 7.1 shows only edges pointing in one direction and that are not reciprocated.
 Both sectors are aligned vertically, with the incoming edge sector growing clockwise
 from vertical and the outgoing sector growing counter-clockwise from vertical. If
 only one direction of edges exist, say those pointing from the head to leaves, only
 that sector would be drawn. This technique for growing the sectors in different
 directions from vertical makes the directionality of the edges immediately clear,
 without requiring labels.
 If there are reciprocated edges I propose the right glyph in Fig. 7.1. In this glyph
 there is a third sector for the reciprocated edges in the center, which grows evenly
7.2 Future Directions 278
 Figure 7.2: Variants of the directed fan motif glyph with different numbers leaf
 nodes and number of directed edges in each of the three types (from head, to head,
 and reciprocated).
 in both directions from vertical. Then, instead of the solo-directed edges growing
 from vertical they grow from the edge of the central glyph. If there are no edges in
 one of the three sectors, it is not drawn at all and there is no extra border, again
 maintaining the directionality information solely in vertical alignment. Several
 variants for different configurations of edges are shown in Fig. 7.2. With this design
 the original size information of the fan glyph can be retained and directionality
 shown, all without labels.
 This kind of subdivided design worked well for the fan glyphs, but is not as easy
 for things like the connector and clique motifs. Connectors are especially difficult
7.2 Future Directions 279
 because they can have virtually any number of anchors, and thus combinations of
 edge directionalities. While a 2-Connector will have 32 = 9 different combinations
 of edges (in-in, in-out, in-reciprocated, etc.), a 3-Connector will have 33 = 27 and
 a 4-Connector 34 = 81. When we get to the 70-Connector that showed up in the
 medical records example (Section 4.3.5), there are 2:5 1033 different combinations
 to show! It seems like displaying all the potential flows through a connector will
 be challenging. Instead, each meta-edge can be subdivided into three proportion-
 ally sized parts to show some of the directionality information. However, at this
 point we are creating a new heavyweight encoding for every glyph and it becomes
 difficult to keep the visual lexicon small, which is why I decided not to pursue this
 route. Cliques could be somewhat easier, but would likely require embedding a
 flow visualization or asymmetric adjacency matrix inside the motif glyph. This
 would be similar to the approach presented by NodeTrix [HFM07].
 7.2.1.3 Approximate Topology
 One of the best ways to scale up motif simplification to larger networks is to use
 approximate topology for the simplification instead of requiring exact motifs. We
 then return to the problem of displaying this ambiguity to the user, the basis for
 the exact motif simplification approach in the first place. Moreover, we have the
 problem of detecting these ?fuzzy? motifs or functionally equivalent bits of the
 network. Almost-cliques are perhaps the most studied motif of the bunch and are
7.2 Future Directions 280
 used as the basis for some clustering algorithms. Instead of showing the present
 edges in an almost-clique like normal, it could be better to instead show the absence
 of specific edges in the motif. These absences can be represented as light cuts across
 a regular polygon glyph that shows a complete clique. Alternatively, an adjacency
 matrix can be embedded in the clique glyph, again either showing the underlying
 edges (as in NodeTrix [HFM07]) or showing their absence.
 Fan and connector motifs are perhaps a bit trickier to show. The presence
 of additional edges in a fan motif, connecting the leaf nodes, or in a connector
 motif, linking the span nodes, could be shown using various styles or textures for
 the glyph components. For example, connections between the fan leaves could be
 shown with a curved outer line for the sector like in the basic glyph, but when there
 are no connections that sector has a jagged appearance. One approach for finding
 ?fuzzy? fan motifs would be to look for any trees in the network, which could be
 detected by iteratively applying the linear time algorithm detailed in Algorithm 2.
 These trees would have to be simplified into a staggered glyph to show the depth
 of its various parts, and in that case maintaining the area scaling to show node
 count would be difficult.
 7.2.1.4 Arbitrary Motifs
 A similar problem is how to detect and represent arbitrary motifs for the user.
 These motifs can be user-specified, like those of known interest to biologists, or
7.2 Future Directions 281
 automatically generated using motif census tools. See Section 2.4 for an extensive
 discussion of motif census techniques, as well as the current state-of-the-art tech-
 niques for displaying the resulting motifs. Current motif census and visualization
 approaches are used in bioinformatics, but only find small motifs with little sim-
 plification payoff and do not create truly simplified displays. The main problem
 to solve would be automatically generating effective and distinguishable glyphs.
 Motif simplification would be more generally applicable if we can develop a tech-
 nique for detecting new kinds of motifs automatically and suggesting ones that will
 have a high payoff if simplified. A motif census tool could be created that makes
 recommendations for specific motif simplifications to target based on readability
 metrics for the original and reduced visualizations. One heavyweight display ap-
 proach would be to embed small node-link visualizations of some representative
 topology inside the meta-nodes. The latest version of Cytoscape [Sha+03] will
 now show an exact subnetwork visualization inside a meta-node, but it would be
 better to automatically create a small, representative version to display.
 7.2.1.5 Attribute Distributions
 The current motif glyphs show a single aggregate measure of the underlying node
 attributes, such as their average, on the same color scale as used for the nodes.
 While this provides some information, it is not enough to identify unusual outliers
 or distributions of attribute values. With a more heavyweight glyph, this distri-
7.2 Future Directions 282
 bution could be shown with small box-and-whisker charts or the like. Perhaps a
 bit simpler would be to use proportionally sized stripes of color to show categori-
 cal attributes or bins of attributes. Alternatively, the glyph could be subdivided
 into distinct sized sections for each attribute bin. While these approaches would
 highlight underlying attributes better, they do come at a substantial cost of screen
 space and visual complexity.
 7.2.1.6 Overlap Handling
 While the underlying topology of an individual motif is unambiguous, in some
 cases the choice of which motifs to simplify can lead to different overviews. The
 fan and connector motifs prevent ambiguous overlap, but clique motifs can overlap
 each other substantially. I use a heuristic that picks the largest non-overlapping
 clique to simplify. A more effective, but computationally hard, approach would
 be to rate each motif by desirability and find the optimal set of motifs by solving
 the NP-complete set-packing problem [Kar72]. This could result in overall better
 simplifications, as well as more confidence in having meaningful results.
 7.2.1.7 Layout Algorithms
 One of the common results of motif simplification is having the simplified network
 be rather dense. Most layout heuristics do not handle dense networks as well
 as sparse ones, though it is computationally easier than running on the original
7.2 Future Directions 283
 network. Moreover, especially with the heavyweight glyphs I discuss above, it
 becomes important to take the glyph size and shape into account. We could apply
 an overlap removal post-processing step as in Section 5.4.3.3, but it is better to take
 the node size and shape into account in the layout algorithm. This algorithm should
 also take the aggregate strength of any meta-edges into account to ensure that
 things like tightly linked anchors of a connector motif are brought close together.
 7.2.1.8 Interaction Techniques
 An interesting interactive technique that could be leveraged is semantic zooming,
 where more details are revealed as the user zooms in on the network. Similar to
 Google Maps, features are revealed only when they do not add undue complexity
 to the display. Instead of expanding and collapsing glyphs on demand, glyphs
 would be expanded automatically when there is enough screen space available to
 present them well. This could be combined with ?fuzzy? summarization [NRS08] or
 backbone-generation [Won+08] techniques to get further reductions in complexity,
 at the cost of losing some information about the topology. All these overview ap-
 proaches would be especially effective for web-based network visualizations, which
 have a space premium and significant performance issues with even small networks.
7.2 Future Directions 284
 7.2.2 Group-in-a-Box Layouts
 The Group-in-a-Box layouts I have discussed could benefit from several improve-
 ments. First, better automatic parameter selection and layout choice techniques
 could get users to good results faster without trial and error. Moreover, better
 layout algorithms could be applied to get the initial group positions. Finally, ad-
 ditional interaction techniques could let users explore the groups in the network
 individually.
 7.2.2.1 Automatic Parameter Selection
 Currently, the initial space-filling factor used in the Force-Directed Group-in-a-
 Box layout (Section 5.4.3.1) is hard-coded in NodeXL at 50%. A more effective
 approach might iteratively lower that value if the box overlap removal step caused
 too much movement of the group boxes. Alternatively, the layout could run several
 times to correct for mistakes like the groups being placed in poor positions initially,
 which can cause substantial overlap or degenerate cases like a single line.
 7.2.2.2 Layout Algorithm Improvements
 The layout algorithm I currently use for the Force-Directed Group-in-a-Box layout
 is the Harel-Koren FMS layout [HK02a]. One problem with the implementation
 is that it does not take the aggregate meta-edge strength into account yet when
 positioning the group boxes, an issue I plan to address as soon as I have time. A
7.2 Future Directions 285
 more substantial step would be to try this approach using other effective layout
 algorithms like the high-dimensional embedding (HDE) approach of Harel and
 Koren [HK02c] or the algebraic multigrid method (ACE) of Koren, Carmel, and
 Harel [KCH03]. The FM3 algorithm [HJ05; Hac05] seems to produce particularly
 good results, but may be slower and difficult to implement. HDE, ACE, and FM3
 should all be able to utilize the meta-edge weight between groups.
 7.2.2.3 Evaluation
 My students and I are currently conducting an empirical evaluation of the Group-
 in-a-Box layouts on thousands of Twitter scrapes (Section 5.6), but more work is
 definitely needed to quantify how useful these meta-layouts are. Additional task-
 based studies could help quantify the benefits of the Group-in-a-Box approach and
 any potential pitfalls that have not been exposed through my case studies and
 explorations.
 7.2.2.4 Automatic Layout Choice
 In some cases, I choose which Group-in-a-Box layout to use based on the number
 of connected components, groups, and certain group properties (Section 5.4.5).
 Despite this, it would be good to extend this work to completely automate the
 Group-in-a-Box layout choice. One way to do this would be to run each layout,
 quantify its utility using readability metrics, and choose the best one. Alter-
7.2 Future Directions 286
 natively, studies like our empirical analysis of Group-in-a-Box layouts on Twitter
 networks may provide sufficient data to automatically choose the best layout based
 on network and group statistics. Similarly, the best clustering algorithm for a net-
 work could be found by comparing how effective each clustering algorithm is when
 the results are displayed in the Force-Directed Group-in-a-Box layout. This would
 be quantified by using the readability metrics.
 7.2.2.5 Interaction Techniques
 Instead of displaying all the groups on the screen at the same time, interactive
 techniques could help users drill into particular groups. The original Treemap tool1
 and now Spotfire [Spo] allow users to drill into a Treemap interactively, showing
 only one box on a level. This same kind of interactive drill-down can be applied to
 any of the Group-in-a-Box layouts, and would be especially effective for hierarchical
 clusterings. An alternate technique like continuously variable zoom [Dil+94] would
 let users see one group in more of the screen space, while minimizing other groups
 to take up less space.
 7.2.3 Readability Metrics
 There are several ways forward for work on the readability metrics. Initially, there
 is a need for local node and edge versions of current global metrics that I did
 1http://www.cs.umd.edu/hcil/treemap/
7.2 Future Directions 287
 not cover as part of my work. As more metrics are developed, they should be
 evaluated for user task performance and integrated into a visual taxonomy for the
 user, which can then be used to help users choose the metrics to optimize. These
 optimizations could be done manually with color-coding assistance like I do now,
 but also using a snap-to-local-maxima or fully automatic approach.
 7.2.3.1 Additional Local Metrics
 There are many existing global readability metrics that I have not created local
 node and edge versions for, many of which are these are listed in Section 6.4.18.
 The development of additional local metrics would provide users with more ways
 to understand the effectiveness of their node-link visualizations, as well as ways to
 improve those visualizations. While there are many studies looking at the utility
 of metrics like edge crossings (Section 6.4.4), many metrics are not as well studied.
 With any new metrics, it becomes important to quantify how well it maps to user
 task performance.
 7.2.3.2 Metric-Task Taxonomy and User Interface
 It would be useful to document the results of new metric studies, as well as the large
 corpus of studies I detail in Section 6.4, in a metric-by-task taxonomy that can be
 presented visually to the user. While NodeXL will currently let users select which
 metrics to optimize, the user may not be aware of which metrics they should use
7.2 Future Directions 288
 for particular tasks. This taxonomy interface would let users select a path-finding
 task, for example, and be given the appropriate metrics to optimize.
 7.2.3.3 Automatic Metric Optimization
 Once a metric-by-task user interface exists, we can then enable the user to select
 several of the relevant metrics to optimize. While my current implementations
 only show the user highlighting for one metric at a time, we could create a linear
 or weighted combination of the metrics to display. More interestingly, we could
 feed this combined metric into a snap-to-local-maxima tool, or even an automatic
 layout algorithm that finds the perfect layout for that user-defined energy function.
 Simulated annealing [Met+53; KGV83] may be a good approach for a fully
 automated layout. Simulated annealing is an optimization strategy originating
 in statistical mechanics [Met+53] that has since been rewritten more generally
 [KGV83], and can be applied to many classical combinatorial problems. Surveys
 of the method and uses of simulated annealing can be found in [Haj85; Joh+91;
 JP87; LA87]. Earlier work has used simulated annealing for network layouts with a
 hard-coded energy function, based on metrics such as evenly-spaced nodes, uniform
 edge lengths, edge crossings, edge tunnels [DH96] or even to show group members
 proximally [Bar+08]. We could build on this to optimize the user-defined energy
 function that was created using the metric-by-task user interface. The running
 time of this approach would likely be slow (O(N2E)), with memory required about
7.3 Summary 289
 O(max(N2; E2)), but would produce ?perfect? layouts for a given set of metrics.
 7.3 Summary
 Network data structures have been used extensively in recent years for modeling
 entities and their ties for many diverse disciplines. Analyzing networks involves
 understanding the complex relationships between entities as well as any attributes,
 statistics, or groupings associated with them. The omnipresent node-link visual-
 ization excels at showing network topology and features simultaneously, but many
 node-link visualizations are not easily readable or difficult to extract meaning from
 because of inherent network complexity or size. Moreover, for every network there
 are many potential unintelligible or even misleading visualizations.
 In this dissertation I discuss strategies to help users create more effective node-
 link visualizations, all implemented in the NodeXL network analysis tool [Smi+10].
 I first introduce a technique called motif simplification that leverages the repeating
 patterns or motifs in a network to reduce visual complexity and increase readabil-
 ity. I then discuss meta-layout algorithms that take attribute- or topology-based
 groupings into account, so as to more clearly show the ties within groups and
 the aggregate relationships between groups. Finally, I detail readability metrics
 to quantify the effectiveness of node-link visualizations, localize areas needing im-
 provement, and be fed into assistive layout tools.
7.3 Summary 290
 Each of these thrusts of my work opens up new avenues of research on network
 visualization. The motif simplification work can be expanded to show additional
 topology and attribute information, as well as arbitrary patterns in the network.
 My Group-in-a-Box layouts would benefit from advanced layout algorithms, in
 addition to automatic parameter and layout selection techniques. Finally, future
 work could develop local node and edge readability metrics for existing global
 metrics, and implement a visual metric-by-task taxonomy tool that would feed
 into automatic layout algorithms.
Bibliography
 [Ada+04] Alex T Adai, Shailesh V Date, Shannon Wieland, and Edward M
 Marcotte. LGL: Creating a map of protein function with an algo-
 rithm for visualizing very large biological networks . In: Journal of
 Molecular Biology 340.1 (2004), pp. 179?190. doi: 10.1016/j.jmb.
 2004.04.047 (cit. on pp. 39, 40).
 [Ada06] Eytan Adar. GUESS: a language and interface for graph exploration.
 In: CHI ?06: Proc. SIGCHI Conference on Human Factors in Com-
 puting Systems. 2006, pp. 791?800. doi: 10.1145/1124772.1124889
 (cit. on pp. 21, 23).
 [AG05] Lada A. Adamic and Natalie Glance. The political blogosphere and
 the 2004 U.S. election: Divided they blog . In: LinkKDD ?05: Proc.
 3rd International Workshop on Link Discovery. 2005, pp. 36?43.
 doi: 10.1145/1134271.1134277 (cit. on p. 3).
 [AH04] James Abello and Frank van Ham. Matrix Zoom: A visual interface
 to semi-external graphs . In: INFOVIS ?04: Proc. IEEE Symposium
 on Information Visualization. INFOVIS ?04. 2004, pp. 183?190. doi:
 10.1109/INFVIS.2004.46 (cit. on p. 24).
 [AHK06] James Abello, Frank van Ham, and Neeraj Krishnan. ASK-GraphView:
 a large scale graph visualization system. In: TVCG: IEEE Transac-
 tions on Visualization and Computer Graphics 12.5 (2006), pp. 669?
 676. doi: 10.1109/TVCG.2006.120 (cit. on p. 35).
 [Ari08] Aleks Aris. Visualizing and exploring networks using Semantic Sub-
 strates . PhD thesis. University of Maryland, Department of Com-
 puter Science, 2008 (cit. on pp. 1, 21).
 [AWS92] Christopher Ahlberg, Christopher Williamson, and Ben Shneider-
 man. Dynamic queries for information exploration: an implemen-
 tation and evaluation. In: CHI ?92: Proc. SIGCHI Conference on
 Human Factors in Computing Systems. 1992, pp. 619?626. doi: 10.
 1145/142750.143054 (cit. on p. 56).
 291
BIBLIOGRAPHY 292
 [Bar+08] Aaron Barsky, Tamara Munzner, Jennifer Gardy, and Robert Kin-
 caid. Cerebral: Visualizing multiple experimental conditions on a
 graph with biological context. In: TVCG: IEEE Transactions on Vi-
 sualization and Computer Graphics 14.6 (2008), pp. 1253?1260. doi:
 10.1109/TVCG.2008.117 (cit. on pp. 6, 142, 288).
 [Bat+94] Giuseppe Di Battista, Peter Eades, Roberto Tamassia, and Ioannis
 G. Tollis. Algorithms for drawing graphs: An annotated bibliography .
 In: Computational Geometry 4 (1994), pp. 235?282. doi: 10.1016/
 0925-7721(94)00014-X (cit. on p. 33).
 [Bat+98] Giuseppe Di Battista, Peter Eades, Roberto Tamassia, and Ioannis
 G. Tollis. Graph drawing: Algorithms for the visualization of graphs .
 Ed. by Laura Steele. Prentice Hall, 1998 (cit. on pp. 4, 33, 228, 249).
 [Bez+10] Anastasia Bezerianos, Fanny Chevalier, Pierre Dragicevic, Niklas
 Elmqvist, and Jean-Daniel Fekete. GraphDice: A system for explor-
 ing multivariate social networks . In: EuroVis ?10: Proc. 2010 Euro-
 graphics/IEEE Symposium on Visualization. 2010. doi: 10.1111/
 j.1467-8659.2009.01687.x (cit. on pp. 26, 28).
 [BFN85] Carlo Batini, L. Furlani, and Enrico Nardelli. What is a good dia-
 gram? A pragmatic approach. In: ER ?85: Proc. 4th International
 Conference on the Entity-Relationship Approach to Software Engi-
 neering. 1985, pp. 312?319 (cit. on p. 33).
 [BH86] Josh Barnes and Piet Hut. A hierarchical O(N log N) force-calculation
 algorithm. In: Nature 324.6096 (1986), pp. 446?449. doi: 10.1038/
 324446a0 (cit. on pp. 154, 246).
 [BHJ09] Mathieu Bastian, Sebastien Heymann, and Mathieu Jacomy. Gephi:
 An open source software for exploring and manipulating networks .
 In: ICWSM ?09: Proc. International AAAI Conference on Weblogs
 and Social Media. 2009 (cit. on pp. 21, 23).
 [BHJVW00] Mark Bruls, Kees Huizing, and Jarke J. Van Wijk. Squarified Treemaps .
 In: Proc. Joint Eurographics and IEEE TCVG symposium on Visu-
 alization. 2000, pp. 33?42 (cit. on pp. 160, 223).
 [Bla+09] Jorik Blaas, Charl Botha, Edward Grundy, Mark Jones, Robert
 Laramee, and Frits Post. Smooth graphs for visual exploration of
 higher-order state transitions . In: TVCG: IEEE Transactions on Vi-
 sualization and Computer Graphics 15.6 (2009), pp. 969?976. doi:
 10.1109/TVCG.2009.181 (cit. on p. 26).
BIBLIOGRAPHY 293
 [Blu+08] Ryan Blue, Cody Dunne, Adam Fuchs, Kyle King, and Aaron Schul-
 man. Visualizing real-time network resource usage. In: VizSec ?08:
 Proc. 5th international workshop on Visualization for Computer Se-
 curity. 2008, pp. 119?135. doi: 10.1007/978-3-540-85933-8_12
 (cit. on pp. 59?61).
 [BM98] Vladimir Batagelj and Andrej Mrvar. Pajek - Program for large net-
 work analysis . In: Connections 21 (1998), pp. 47?57 (cit. on pp. 21,
 23, 247).
 [BMK96] Jim Blythe, Cathleen McGrath, and David Krackhardt. The ef-
 fect of graph layout on inference from social network data. In: GD
 ?95: Proc. 3rd International Symposium on Graph Drawing. GD ?95.
 1996, pp. 40?51. doi: 10.1007/BFb0021783 (cit. on pp. 2, 4, 46).
 [Bon+09] Elizabeth M. Bonsignore, Cody Dunne, Dana Rotman, Marc Smith,
 Tony Capone, Derek L. Hansen, and Ben Shneiderman. First steps
 to NetViz Nirvana: Evaluating social network analysis with NodeXL.
 In: CSE ?09: Proc. 2009 International Conference on Computational
 Science and Engineering. Vol. 4. 2009, pp. 332?339. doi: 10.1109/
 CSE.2009.120 (cit. on pp. 50, 226).
 [Bra+99] Ulrik Brandes, Patrick Kenis, J???rg Raab, Volker Schneider, and
 Dorothea Wagner. Explorations into the visualization of policy net-
 works . In: Journal of Theoretical Politics 11.11 (1999), pp. 75?106.
 doi: 10.1177/0951692899011001004 (cit. on p. 4).
 [Bru12] Tom Brughmans. Thinking through networks: a review of formal net-
 work methods in archaeology . English. In: Journal of Archaeological
 Method and Theory (2012), pp. 1?40. doi: 10.1007/s10816-012-
 9133-8 (cit. on p. 3).
 [BWK00] Michelle Q. Wang Baldonado, Allison Woodruff, and Allan Kuchin-
 sky. Guidelines for using multiple views in information visualization.
 In: AVI ?00: Proc. 2000 working conference on Advanced Visual In-
 terfaces. 2000, pp. 110?119. doi: 10.1145/345513.345271 (cit. on
 p. 31).
 [Cao+11] Nan Cao, Gotz, D., Sun, J., and Huamin Qu. DICON: Interac-
 tive visual analysis of multidimensional clusters . In: TVCG: IEEE
 Transactions on Visualization and Computer Graphics 17.12 (2011),
 pp. 2581?2590. doi: 10.1109/TVCG.2011.188 (cit. on pp. 43, 44).
 [CBB00] Bill Cheswick, Hal Burch, and Steve Branigan. Mapping and vi-
 sualizing the internet . In: Proc. 2000 USENIX Annual Technical
 Conference. 2000, pp. 1?12 (cit. on p. 3).
BIBLIOGRAPHY 294
 [CC00] Chaomei Chen and Mary P. Czerwinski. Empirical evaluation of in-
 formation visualizations: An introduction. In: International Journal
 of Human-Computer Studies 53.5 (2000), pp. 631?635. doi: 10.
 1006/ijhc.2000.0421 (cit. on p. 45).
 [Cha+13] Snigdha Chaturvedi, Zahra Ashktorab, Cody Dunne, Rajan Zacharia,
 and Ben Shneiderman. Group-in-a-Box layouts for visualizing net-
 work communities and their ties. Under submission. 2013 (cit. on
 pp. 20, 144, 145, 162, 213).
 [CM85] William S. Cleveland and Robert McGill. Graphical perception and
 graphical methods for analyzing scientific data. In: Science 229.4716
 (1985), pp. 828?833. doi: 10.1126/science.229.4716.828 (cit. on
 p. 82).
 [CNM04] Aaron Clauset, Mark E. J. Newman, and Cristopher Moore. Find-
 ing community structure in very large networks . In: Physical Re-
 view E: Statistical, Nonlinear, and Soft Matter Physics 70 (6 2004),
 p. 066111. doi: 10.1103/PhysRevE.70.066111 (cit. on pp. 11, 12,
 140, 147, 189, 190, 197?199, 206, 207, 218).
 [CP96] Michael K. Coleman and D. Stott Parker. Aesthetics-based graph
 layout for human consumption. In: Software: Practice and Experi-
 ence 26.12 (1996), pp. 1415?1438. doi: 10.1002/(SICI)1097-
 024X(199612)26:12<1415::AID- SPE69> 3.3.CO;2- G (cit. on
 p. 249).
 [Dem12] Christopher Scott Dempwolf. Network models of regional innovation
 clusters and their impact on economic growth. PhD thesis. Univer-
 sity of Maryland, College Park, 2012 (cit. on pp. 3, 206).
 [DH96] Ron Davidson and David Harel. Drawing graphs nicely using sim-
 ulated annealing . In: TOG: ACM Transactions on Graphics 15.4
 (1996), pp. 301?331. doi: 10.1145/234535.234538 (cit. on pp. 34,
 249, 255, 288).
 [Dil+94] John Dill, Lyn Bartram, Albert Ho, and Frank Henigman. A con-
 tinuously variable zoom for navigating large hierarchical networks .
 In: Proc. IEEE SMC ?94. Vol. 1. 1994, pp. 386?390. doi: 10.1109/
 ICSMC.1994.399869 (cit. on p. 286).
 [DK83] David P. Dobkin and David G. Kirkpatrick. Fast detection of poly-
 hedral intersection. In: Theoretical Computer Science 27.3 (1983),
 pp. 241?253. doi: 10.1016/0304-3975(82)90120-7 (cit. on p. 263).
BIBLIOGRAPHY 295
 [DMS06] Tim Dwyer, Kim Marriott, and Peter Stuckey. Fast node overlap
 removal . In: GD ?05: Proc. 13th International Symposium on Graph
 Drawing. 2006, pp. 153?164. doi: 10.1007/11618058_15 (cit. on
 pp. 175, 176, 246).
 [DMS07] Tim Dwyer, Kim Marriott, and Peter Stuckey. Fast node overlap
 removal?correction. In:GD ?06: Proc. 14th International Symposium
 on Graph Drawing. 2007, pp. 446?447. doi: 10.1007/978-3-540-
 70904-6_44 (cit. on pp. 175, 176, 246).
 [DS09] Cody Dunne and Ben Shneiderman. Improving graph drawing read-
 ability by incorporating readability metrics: A software tool for net-
 work analysts . Human-Computer Interaction Lab Tech Report HCIL-
 2009-13. University of Maryland, 2009 (cit. on p. 226).
 [DS13] Cody Dunne and Ben Shneiderman. Motif simplification: improv-
 ing network visualization readability with fan, connector, and clique
 glyphs . In: CHI ?13: Proc. SIGCHI Conference on Human Factors in
 Computing Systems. CHI ?13. 2013, pp. 3247?3256. doi: 10.1145/
 2470654.2466444 (cit. on pp. 20, 54, 76).
 [Dun+12a] Cody Dunne, Nathalie Henry Riche, Bongshin Lee, Ronald A. Metoyer,
 and George G. Robertson. GraphTrail: Analyzing large multivari-
 ate, heterogeneous networks while supporting exploration history . In:
 CHI ?12: Proc. SIGCHI Conference on Human Factors in Comput-
 ing Systems. 2012, pp. 1663?1672. doi: 10.1145/2207676.2208293
 (cit. on pp. 7, 60?62).
 [Dun+12b] Cody Dunne, Ben Shneiderman, Robert Gove, Judith Klavans, and
 Bonnie Dorr. Rapid understanding of scientific paper collections:
 Integrating statistics, text analytics, and visualization. In: JASIST:
 Journal of the American Society for Information Science and Tech-
 nology 63.12 (2012), pp. 2351?2369. doi: 10.1002/asi.22652 (cit.
 on pp. 68?70).
 [Ead84] Peter Eades. A heuristic for graph drawing . In: CN: Congressus Nu-
 merantium 42 (1984), pp. 149?160 (cit. on pp. 5, 245, 268).
 [Eic03] Holger Eichelberger. Nice class diagrams admit good design? In:
 SoftVis ?03: Proc. 2003 ACM Symposium on Software Visualization.
 2003, pp. 159?216. doi: 10.1145/774833.774857 (cit. on pp. 33,
 245, 249, 255).
 [EL92] Peter Eades and Wei Lai. Algorithms for disjoint node images. In:
 ACSC ?92: Proc. 15th Australian Computer Science Conference.
 1992, pp. 253?265 (cit. on p. 246).
BIBLIOGRAPHY 296
 [ES11] David Eppstein and Darren Strash. Listing all maximal cliques in
 large sparse real-world graphs . In: SEA ?11: Proc. 10th International
 Symposium on Experimental Algorithms. Vol. 6630. 2011, pp. 364?
 375. doi: 10.1007/978-3-642-20662-7_31 (cit. on p. 93).
 [ES90] Peter Eades and Kozo Sugiyama. How to draw a directed graph. In:
 Journal of Information Processing 13.4 (1990), pp. 424?437 (cit. on
 pp. 249, 266).
 [Fek+03] Jean-Daniel Fekete, DavidWang, Niem Dang, Aleks Aris, and Cather-
 ine Plaisant. Overlaying graph links on Treemaps . In: Information
 Visualization Symposium Poster Compendium. INFOVIS. 2003, pp. 82?
 83 (cit. on p. 26).
 [Fek04] Jean-Daniel Fekete. The InfoVis Toolkit . In: INFOVIS ?04: Proc.
 IEEE symposium on Information Visualization. INFOVIS ?04. 2004,
 pp. 167?174. doi: 10.1109/INFVIS.2004.64 (cit. on p. 22).
 [For+93] Michael Formann, Torben Hagerup, James Haralambides, Michael
 Kaufmann, Frank Thomson Leighton, Antonios Symvonis, EmoWelzl,
 and Gerhard J. Woeginger. Drawing graphs in the plane with high
 resolution. In: SIAM Journal on Computing 22.5 (1993), pp. 1035?
 1052. doi: 10.1137/0222063 (cit. on p. 258).
 [FR91] Thomas M. J. Fruchterman and Edward M. Reingold. Graph draw-
 ing by force-directed placement . In: SPE: Software: Practice and Ex-
 perience 21.11 (1991), pp. 1129?1164. doi: 10.1002/spe.4380211102
 (cit. on pp. 5, 107, 108, 149, 245, 249, 256).
 [Fre+10] Manuel Freire, Catherine Plaisant, Ben Shneiderman, and Jen Gol-
 beck. ManyNets: An interface for multiple network analysis and
 visualization. In: CHI ?10: Proc. SIGCHI Conference on Human
 Factors in Computing Systems. 2010, pp. 213?222. doi: 10.1145/
 1753326.1753358 (cit. on pp. 27, 28, 35).
 [FSW06] Danyel Fisher, Marc Smith, and Howard T. Welser. You are who you
 talk to: Detecting roles in Usenet newsgroups . In: HICSS ?06: Proc.
 39th Annual Hawaii International Conference on System Sciences.
 2006, p. 59.2. doi: 10.1109/HICSS.2006.536 (cit. on p. 3).
 [GFC04] Mohammad Ghoniem, Jean-Daniel Fekete, and Philippe Castagliola.
 A comparison of the readability of graphs using node-link and matrix-
 based representations . In: INFOVIS ?04: Proc. IEEE Symposium on
 Information Visualization. INFOVIS ?04. 2004, pp. 17?24. doi: 10.
 1109/INFVIS.2004.1 (cit. on pp. 24, 46, 126, 128, 226).
BIBLIOGRAPHY 297
 [GGK04] Pawel Gajer, Michael T. Goodrich, and Stephen G. Kobourov. A
 multi-dimensional approach to force-directed layouts of large graphs .
 In: Computational Geometry: Theory and Applications 29.1 (2004),
 pp. 3?18. doi: 10.1016/j.comgeo.2004.03.014 (cit. on p. 41).
 [GH09] Emden Gansner and Yifan Hu. Efficient node overlap removal using
 a proximity stress model . In: GD ?08: Proc. 16th International Sym-
 posium on Graph Drawing. 2009, pp. 206?217. doi: 10.1007/978-
 3-642-00219-9_20 (cit. on pp. 173, 175, 176, 178?180, 246, 247).
 [GK07] Joshua Grochow and Manolis Kellis. Network motif discovery using
 Subgraph Enumeration and Symmetry-Breaking . In: RECOMB ?07:
 Proc. 11th iInternational conference on Research in Computational
 Molecular Biology. 2007, pp. 92?106. doi: 10.1007/978-3-540-
 71681-5_7 (cit. on pp. 37, 86).
 [GN02] Michelle Girvan and Mark E. J. Newman. Community structure in
 social and biological networks . In: PNAS: Proc. National Academy
 of Sciences of the United States of America 99.12 (2002), pp. 7821?
 7826. doi: 10.1073/pnas.122653799 (cit. on pp. 140, 147).
 [GN98] Emden Gansner and Stephen North. Improved force-directed lay-
 outs . In: GD ?98: Proc. 6th International Symposium on Graph
 Drawing. 1998, pp. 364?373. doi: 10.1007/3-540-37623-2_28
 (cit. on pp. 175, 176, 246).
 [Gov+11a] Robert Gove, Cody Dunne, Ben Shneiderman, Judith Klavans, and
 Bonnie Dorr. Evaluating visual and statistical exploration of scien-
 tific literature networks . In: VL/HCC ?11: Proc. 2011 IEEE Sym-
 posium on Visual Languages and Human-Centric Computing. 2011,
 pp. 217?224. doi: 10.1109/VLHCC.2011.6070403 (cit. on p. 68).
 [Gov+11b] Robert Gove, Nick Gramsky, Rose Kirby, Emre Sefer, Awalin Sopan,
 Cody Dunne, Ben Shneiderman, and Meirav Taieb-Maimon. NetVisia:
 Heat map & matrix visualization of dynamic social network statistics
 & content . In: SocialCom ?11: Proc. 2011 IEEE 3rd International
 Conference on Social Computing. 2011, pp. 19?26. doi: 10.1109/
 PASSAT/SocialCom.2011.216 (cit. on p. 66).
 [Hac05] Stefan Hachul. A potential-field-based multilevel algorithm for draw-
 ing large graphs . PhD thesis. Universit?t zu K?ln, 2005 (cit. on
 pp. 172, 285).
BIBLIOGRAPHY 298
 [Haj85] Bruce Hajek. A tutorial survey of theory and applications of simu-
 lated annealing . In: CDC ?85: Proc. 24th IEEE Conference on De-
 cision and Control. Vol. 24. 1985, pp. 755?760. doi: 10.1109/CDC.
 1985.268599 (cit. on p. 288).
 [Hay+02] Kunihiko Hayashi, Michiko Inoue, Toshimitsu Masuzawa, and Hideo
 Fujiwara. A layout adjustment problem for disjoint rectangles pre-
 serving orthogonal order . In: Systems and Computers in Japan 33.2
 (2002), pp. 31?42. doi: 10.1002/scj.1104 (cit. on p. 246).
 [HB05] Jeffrey Heer and Danah Boyd. Vizster: Visualizing online social net-
 works . In: INFOVIS ?05: Proc. IEEE Symposium on Information Vi-
 sualization. INFOVIS ?05. 2005, pp. 32?39. doi: 10.1109/INFVIS.
 2005.1532126 (cit. on pp. 155, 157).
 [HBF08] Nathalie Henry, Anastasia Bezerianos, and Jean-Daniel Fekete. Im-
 proving the readability of clustered social networks using node dupli-
 cation. In: TVCG: IEEE Transactions on Visualization and Com-
 puter Graphics 14.6 (2008), pp. 1317?1324. doi: 10.1109/TVCG.
 2008.141 (cit. on p. 226).
 [HCL05] Jeffrey Heer, Stuart K. Card, and James A. Landay. Prefuse: A
 toolkit for interactive information visualization. In: CHI ?05: Proc.
 SIGCHI Conference on Human Factors in Computing Systems. 2005,
 pp. 421?430. doi: 10.1145/1054972.1055031 (cit. on pp. 5, 22, 154,
 157, 246).
 [HDS10] Derek Hansen, Cody Dunne, and Ben Shneiderman. Analyzing social
 media networks with NodeXL. Proc. 27th Annual Human-Computer
 Interaction Lab Symposium. 2010 (cit. on p. 2).
 [HE05] Weidong Huang and Peter Eades. How people read graphs . In: APVis
 ?05: Proc. 2005 Asia-Pacific Symposium on Information Visualisa-
 tion. 2005, pp. 51?58 (cit. on p. 257).
 [HEH08] Weidong Huang, Peter Eades, and Seok-Hee Hong. Beyond time and
 error: A cognitive approach to the evaluation of graph drawings . In:
 BELIV ?08: Proc. 2008 conference on BEyond time and errors: novel
 evaLuation methods for Information Visualization. 2008, pp. 1?8.
 doi: 10.1145/1377966.1377970 (cit. on p. 251).
 [Hen+07] Nathalie Henry, Howard Goodell, Niklas Elmqvist, and Jean-Daniel
 Fekete. 20 years of four HCI conferences: A visual exploration. In:
 International Journal of Human-Computer Interaction 23.3 (2007),
 pp. 239?285. doi: 10.1080/10447310701702402 (cit. on p. 3).
BIBLIOGRAPHY 299
 [Her+99] Ivan Herman, M. Scott Marshall, Guy Melan?on, D. J. Duke, Maylis.
 Delest, and J.-P. Domenger. Skeletal images as visual cues in graph
 visualization. In: Data Visualization ?99: Proc. joint Eurographics
 and IEEE TVCG Symposium on Visualization. 1999, pp. 13?22 (cit.
 on p. 36).
 [HF06] Nathalie Henry and Jean-Daniel Fekete. MatrixExplorer: A dual-
 representation system to explore social networks . In: TVCG: IEEE
 Transactions on Visualization and Computer Graphics 12.5 (2006),
 pp. 677?684. doi: 10.1109/TVCG.2006.160 (cit. on p. 23).
 [HF07] Nathalie Henry and Jean-Daniel Fekete. MatLink: Enhanced ma-
 trix visualization for analyzing social networks . In: INTERACT ?07:
 Proc. 11th IFIP TC 13 International Conference on Human-computer
 interaction. 2007, pp. 288?302. doi: 10.1007/978-3-540-74800-
 7_24 (cit. on pp. 21, 24, 46, 126?128).
 [HFM07] Nathalie Henry, Jean-Daniel Fekete, and Michael J. McGuffin. Node-
 Trix: A hybrid visualization of social networks . In: TVCG: IEEE
 Transactions on Visualization and Computer Graphics 13.6 (2007),
 pp. 1302?1309. doi: 10.1109/TVCG.2007.70582 (cit. on pp. 24?26,
 279, 280).
 [HHE05] Weidong Huang, Seok-Hee Hong, and Peter Eades. Layout effects:
 Comparison of sociogram drawing conventions . Tech. rep. 575. Uni-
 versity of Sydney, 2005 (cit. on p. 251).
 [HHE06a] Weidong Huang, Seok-Hee Hong, and Peter Eades. How people read
 sociograms: A questionnaire study . In: APVis ?06: Proc. 2006 Asia-
 Pacific Symposium on Information Visualisation. 2006, pp. 199?206
 (cit. on p. 250).
 [HHE06b] Weidong Huang, Seok-Hee Hong, and Peter Eades. Layout effects
 on sociogram perception. In: GD ?05: Proc. 13th International Sym-
 posium on Graph Drawing. Vol. 3843/2006. Lecture Notes in Com-
 puter Science. 2006, pp. 262?273. doi: 10.1007/11618058_24 (cit.
 on pp. 251, 259).
 [HHE06c] Weidong Huang, Seok-Hee Hong, and Peter Eades. Predicting graph
 reading performance: A cognitive approach. In: APVis ?06: Proc.
 2006 Asia-Pacific Symposium on Information Visualisation. 2006,
 pp. 207?216. doi: 10.1145/1151903.1151933 (cit. on pp. 247, 251).
BIBLIOGRAPHY 300
 [HHE07] Weidong Huang, Seok-Hee Hong, and Peter Eades. Effects of so-
 ciogram drawing conventions and edge crossings in social network
 visualizations . In: JGAA: Journal of Graph Algorithms and Appli-
 cations 11.2 (2007), pp. 397?429 (cit. on p. 251).
 [HHE08] Weidong Huang, Seok-Hee Hong, and Peter Eades. Effects of cross-
 ing angles . In: PacificVIS ?08: Proc. 2008 IEEE Pacific Visualiza-
 tion Symposium. 2008, pp. 41?46. doi: 10.1109/PACIFICVIS.2008.
 4475457 (cit. on p. 257).
 [HJ05] Stefan Hachul and Michael J?nger. Drawing large graphs with a
 potential-field-based multilevel algorithm. In: GD ?04: Proc. 12th In-
 ternational Symposium on Graph Drawing. Vol. 3383/2005. Lecture
 Notes in Computer Science. 2005, pp. 285?295. doi: 10.1007/978-
 3-540-31843-9_29 (cit. on pp. 5, 7, 41, 172, 285).
 [HJ06] Stefan Hachul and Michael J?nger. An experimental comparison of
 fast algorithms for drawing general large graphs . In: GD ?05: Proc.
 13th International Symposium on Graph Drawing. Vol. 3843/2006.
 Lecture Notes in Computer Science. 2006, pp. 235?250. doi: 10.
 1007/11618058_22 (cit. on pp. 6, 41?43, 172).
 [HJ07] Stefan Hachul and Michael J?nger. Large-graph layout algorithms
 at work: an experimental study . In: JGAA: Journal of Graph Algo-
 rtihms and Applications 11.2 (2007), pp. 345?369. doi: 10.7155/
 jgaa.00150 (cit. on p. 172).
 [HK01] David Harel and Yehuda Koren. A fast multi-scale method for draw-
 ing large graphs . In: GD ?00: Proc. 8th International Symposium on
 Graph Drawing. 2001, pp. 235?287. doi: 10.1007/3-540-44541-
 2_18 (cit. on p. 220).
 [HK02a] David Harel and Yehuda Koren. A fast multi-scale method for draw-
 ing large graphs . In: JGAA: Journal of Graph Algorithms and Ap-
 plications 6.3 (2002), pp. 179?202. doi: 10.7155/jgaa.00051 (cit.
 on pp. 5, 11, 12, 41, 107, 109, 114, 116, 128, 171?174, 178, 179, 189,
 190, 195, 196, 199, 228, 284).
 [HK02b] David Harel and Yehuda Koren. Drawing graphs with non-uniform
 vertices . In: AVI ?02: Proc. Working Conference on Advanced Visual
 Interfaces. 2002, pp. 157?166. doi: 10.1145/1556262.1556288 (cit.
 on p. 246).
BIBLIOGRAPHY 301
 [HK02c] David Harel and Yehuda Koren. Graph drawing by high-dimensional
 embedding . In: GD ?02: Proc. 10th International Symposium on
 Graph Drawing. Vol. 2528. Lecture Notes in Computer Science. 2002,
 pp. 207?219. doi: 10.1007/3-540-36151-0_20 (cit. on pp. 5, 7,
 172, 285).
 [HL03] Xiaodi Huang and Wei Lai. Force-Transfer: A new approach to re-
 moving overlapping nodes in graph layout . In: ACSC ?03: Proc. 26th
 Australasian Computer Science Conference. 2003, pp. 349?358 (cit.
 on p. 246).
 [HMM00] Ivan Herman, Guy Melan?on, and M. Scott Marshall. Graph visu-
 alization and navigation in information visualization: a survey . In:
 TVCG: IEEE Transactions on Visualization and Computer Graph-
 ics 6.1 (2000), pp. 24?43. doi: 10 . 1109 / 2945 . 841119 (cit. on
 p. 232).
 [Hol06] Danny Holten. Hierarchical edge bundles: visualization of adjacency
 relations in hierarchical data. In: TVCG: IEEE Transactions on
 Visualization and Computer Graphics 12.5 (2006), pp. 741?748. doi:
 10.1109/TVCG.2006.147 (cit. on pp. 43, 44).
 [HR08] Frank van Ham and Bernice E. Rogowitz. Perceptual organization
 in user-generated graph layouts . In: TVCG: IEEE Transactions on
 Visualization and Computer Graphics 14.6 (2008), pp. 1333?1339.
 doi: 10.1109/TVCG.2008.155 (cit. on pp. 250, 268).
 [HSS11] Derek Hansen, Ben Shneiderman, and Mark Smith. Analyzing social
 media networks with NodeXL: Insights from a connected world . Ed.
 by Mary James and David Bevans. Morgan Kaufmann, 2011 (cit. on
 pp. 49, 50, 105, 106, 108, 109).
 [Hua+05] Weidong Huang, Colin Murray, Xiaobin Shen, Le Song, Ying Xin
 Wu, and Lanbo Zheng. Visualisation and analysis of network motifs .
 In: INFOVIS ?05: Proc. 9th International Conference on Informa-
 tion Visualisation. 2005, pp. 697?702. doi: 10.1109/IV.2005.138
 (cit. on p. 38).
 [Hua+07] Xiaodi Huang, Wei Lai, A. S. M. Sajeev, and Junbin Gao. A new
 algorithm for removing node overlapping in graph visualization. In:
 Information Sciences 177.14 (2007), pp. 2821?2844. doi: 10.1016/
 j.ins.2007.02.016 (cit. on p. 246).
 [Hua06] Weidong Huang. An eye tracking study into the effects of graph lay-
 out . Tech. rep. University of Sydney, 2006 (cit. on pp. 251, 257).
BIBLIOGRAPHY 302
 [Hua07a] Weidong Huang. Beyond time and error: A cognitive approach to the
 evaluation of graph visualizations . PhD thesis. University of Sydney,
 2007 (cit. on pp. 251, 257).
 [Hua07b] Weidong Huang. Using eye tracking to investigate graph layout ef-
 fects . In: APVis ?07: Proc. 2007 Asia-Pacific Symposium on Infor-
 mation Visualisation. 2007, pp. 97?100. doi: 10.1109/APVIS.2007.
 329282 (cit. on pp. 46, 251, 257, 258, 267).
 [HW04] Frank van Ham and Jarke J. van Wijk. Interactive visualization of
 small world graphs . In: INFOVIS ?04: Proc. IEEE Symposium on
 Information Visualization. INFOVIS ?04. 2004, pp. 199?206. doi:
 10.1109/INFVIS.2004.43 (cit. on p. 35).
 [Ima+09] Takashi Imamichi, Yohei Arahori, Jaeseong Gim, Seok-Hee Hong,
 and Hiroshi Nagamochi. Removing node overlaps using multi-sphere
 scheme. In: GD ?08: Proc. 16th International Symposium on Graph
 Drawing. 2009, pp. 296?301. doi: 10.1007/978-3-642-00219-9_28
 (cit. on p. 247).
 [Joh+91] David S. Johnson, Cecilia R. Aragon, Lyle A. McGeoch, and Cather-
 ine Schevon. Optimization by simulated annealing: An experimental
 evaluation; Part II, graph coloring and number partitioning . In: Op-
 erations Research 39 (3 1991), pp. 378?406. doi: 10.1287/opre.
 39.3.378 (cit. on p. 288).
 [JP87] David S. Johnson and Henry O. Pollak. Hypergraph planarity and the
 complexity of drawing Venn diagrams . In: Journal of Graph Theory
 11.3 (1987), pp. 309?325. doi: 10.1002/jgt.3190110306 (cit. on
 p. 288).
 [JS91] Brian Johnson and Ben Shneiderman. Tree-Maps: A space-filling
 approach to the visualization of hierarchical information structures .
 In: VIS ?91: Proc. 1991 IEEE Conference on Visualization. 1991,
 pp. 284?291. doi: 10.1109/VISUAL.1991.175815 (cit. on p. 160).
 [K0?4] Christof K?rner. Sequential processing in comprehension of hierar-
 chical graphs . In: Applied Cognitive Psychology 18.4 (2004), pp. 467?
 480. doi: 10.1002/acp.997 (cit. on p. 251).
 [KA02] Christof K?rner and Dietrich Albert. Speed of comprehension of visu-
 alized ordered sets. In: Journal of Experimental Psychology: Applied
 8.1 (2002), pp. 57?71. doi: 10.1037/1076-898X.8.1.57 (cit. on
 pp. 250, 267).
BIBLIOGRAPHY 303
 [Kan+06] Hyunmo Kang, Catherine Plaisant, Bongshin Lee, and Benjamin B.
 Bederson. NetLens: Iterative exploration of content-actor network
 data. In: VAST ?06: Proc. IEEE Symposium on Visual Analytics
 Science And Technology. 2006, pp. 91?98. doi: 10.1109/VAST.
 2006.261426 (cit. on pp. 30, 34, 63).
 [Kar72] Richard M. Karp. Reducibility among combinatorial problems . In:
 Complexity of Computer Computations. 1972, pp. 85?103 (cit. on
 pp. 94, 282).
 [KCH03] Yehuda Koren, Liran Carmel, and David Harel. Drawing huge graphs
 by algebraic multigrid optimization. In: Multiscale Modeling & Sim-
 ulation 1.4 (2003), pp. 645?673. doi: 10.1137/S154034590241370X
 (cit. on pp. 5, 172, 285).
 [Kel+03] Brian P Kelley, Roded Sharan, Richard M Karp, Taylor Sittler,
 David E Root, Brent R Stockwell, and Trey Ideker. Conserved path-
 ways within bacteria and yeast as revealed by global protein net-
 work alignment. In: PNAS: Proc. National Academy of Sciences of
 the United States of America 100.20 (2003), pp. 11394?11399. doi:
 10.1073/pnas.1534710100 (cit. on p. 3).
 [KGV83] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by
 simulated annealing . In: Science 220.4598 (1983), pp. 671?680. doi:
 10.1126/science.220.4598.671 (cit. on p. 288).
 [Knu93] Donald Ervin Knuth. The Stanford GraphBase: A platform for com-
 binatorial computing. Addison-Wesley, 1993 (cit. on p. 141).
 [KSS06] Christian Klukas, Falk Schreiber, and Henning Schw?bbermeyer.
 Coordinated perspectives and enhanced force-directed layout for the
 analysis of network motifs . In: APVis ?06: Proc. 2006 Asia-Pacific
 Symposium on Information Visualisation. 2006, pp. 39?48 (cit. on
 p. 38).
 [LA87] Peter J. M. Laarhoven and Emile H. L. Aarts. Simulated annealing:
 theory and applications . D. Reidel Publishing Company, 1987 (cit.
 on p. 288).
 [Lam+11] Heidi Lam, Enrico Bertini, Petra Isenberg, Catherine Plaisant, and
 Sheelagh Carpendale. Empirical studies in information visualiza-
 tion: Seven scenarios . In: TVCG: IEEE Transactions on Visual-
 ization and Computer Graphics PP.99 (2011), p. 1. doi: 10.1109/
 TVCG.2011.279 (cit. on p. 45).
BIBLIOGRAPHY 304
 [LE02] Wei Lai and Peter Eades. Removing edge-node intersections in draw-
 ings of graphs . In: Information Processing Letters 81.2 (2002), pp. 105?
 110. doi: 10.1016/S0020-0190(01)00194-6 (cit. on pp. 245, 246).
 [Lee+05] Bongshin Lee, Mary Czerwinski, George Robertson, and Benjamin
 B. Bederson. Understanding research trends in conferences using Pa-
 perLens . In: CHI EA ?05:: Proc. CHI ?05 Extended Abstracts on
 Human Factors in Computing Systems. 2005, pp. 1969?1972. doi:
 10.1145/1056808.1057069 (cit. on p. 63).
 [Lee+06] Bongshin Lee, Catherine Plaisant, Cynthia Sims Parr, Jean-Daniel
 Fekete, and Nathalie Henry. Task taxonomy for graph visualization.
 In: BELIV ?06: Proc. 2006 AVI workshop on BEyond time and er-
 rors: novel evaLuation methods for Information Visualization. 2006,
 pp. 1?5. doi: 10.1145/1168149.1168168 (cit. on pp. 46, 126).
 [Lee+09] Bongshin Lee, Greg Smith, George G. Robertson, Mary Czerwin-
 ski, and Desney S. Tan. FacetLens: Exposing trends and relation-
 ships to support sensemaking within faceted datasets . In: CHI ?09:
 Proc. SIGCHI Conference on Human Factors in Computing Sys-
 tems. 2009, pp. 1293?1302. doi: 10.1145/1518701.1518896 (cit.
 on pp. 29, 63).
 [LEN05] Wanchun Li, Peter Eades, and Nikola S. Nikolov. Using spring algo-
 rithms to remove node overlapping . In: APVis ?05: Proc. 2005 Asia-
 Pacific Symposium on Information Visualisation. 2005, pp. 131?140
 (cit. on pp. 245, 246, 256).
 [Lim13] Manuel Lima. Visual Complexity: Mapping Patterns of Information.
 Princeton Architectural Press, 2013 (cit. on p. 1).
 [Llo82] Stuart P. Lloyd. Least squares quantization in PCM . In: IEEE Trans-
 actions on Information Theory 28.2 (1982), pp. 129?137. doi: 10.
 1109/TIT.1982.1056489 (cit. on pp. 140, 150).
 [LMR98] Kelly A. Lyons, Henk Meijer, and David Rappaport. Algorithms
 for cluster busting in anchored graph drawing . In: JGAA: Journal
 of Graph Algorithms and Applications 2.1 (1998), pp. 1?24 (cit. on
 p. 246).
 [LNS85] R. J. Lipton, S. C. North, and J. S. Sandberg. A method for drawing
 graphs . In: SCG ?85: Proc. 1st Annual Symposium on Computational
 Geometry. 1985, pp. 153?160. doi: 10.1145/323233.323254 (cit.
 on p. 267).
BIBLIOGRAPHY 305
 [LSS12] Qi Liao, Lei Shi, and Xiaohua Sun. Anomaly analysis and visual-
 ization through compressed graphs. In: LDAV ?12: Proc. IEEE Sym-
 posium on Large-Scale Data Analysis and VIsualization Poster Ses-
 sion. 2012 (cit. on p. 35).
 [Lus+04] Nicholas M Luscombe, M. Madan Babu, Haiyuan Yu, Michael Sny-
 der, Sarah A Teichmann, and Mark Gerstein. Genomic analysis of
 regulatory network dynamics reveals large topological changes. In:
 Nature 431.7006 (2004), pp. 308?312. doi: 10.1038/nature02782
 (cit. on p. 37).
 [Mac86] Jock Mackinlay. Automating the design of graphical presentations
 of relational information. In: TOG: ACM Transactions on Graphics
 5.2 (1986), pp. 110?141. doi: 10.1145/22949.22950 (cit. on pp. 26,
 80, 82).
 [Mar+03] Kim Marriott, Peter Stuckey, Vincent Tam, and Weiqing He. Re-
 moving node overlapping in graph layout using constrained optimiza-
 tion. In: Constraints 8.2 (2003), pp. 143?171. doi: 10.1023/A:
 1022371615202 (cit. on p. 246).
 [MB04] Cathleen McGrath and Jim Blythe. Do you see what I want you to
 see? The effects of motion and spatial layout on viewers? perceptions
 of graph structure. In: JOSS: Journal of Social Structure 5.2 (2004)
 (cit. on p. 157).
 [MBK97] Cathleen McGrath, Jim Blythe, and David Krackhardt. The effect of
 spatial arrangement on judgments and errors in interpreting graphs .
 In: SN: Social Networks 19.3 (1997), pp. 223?242. doi: 10.1016/
 S0378-8733(96)00299-7 (cit. on pp. 4, 267).
 [MDD09] Saif Mohammad, Cody Dunne, and Bonnie Dorr. Generating high-
 coverage semantic orientation lexicons from overtly marked words
 and a thesaurus . In: EMNLP ?09: Proc. 2009 conference on Em-
 pirical Methods in Natural Language Processing. 2009, pp. 599?608
 (cit. on pp. 67, 68).
 [Med+02] Michael C. Medlock, Dennis Wixon, Mark Terrano, Ramon L. Romero,
 and Bill Fulton. Using the RITE method to improve products: A def-
 inition and a case study . In: Proc. Usability Professional???s Asso-
 ciation 2002. 2002 (cit. on p. 45).
 [Med+05] Michael C. Medlock, Dennis Wixon, Mike McGee, and Dan Welsh.
 Cost-justifying usability: An update for an Internet age. In: 2005.
 Chap. 17, pp. 489?517 (cit. on p. 45).
BIBLIOGRAPHY 306
 [Met+53] Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosen-
 bluth, Augusta H. Teller, and Edward Teller. Equation of state
 calculations by fast computing machines . In: Journal of Chemical
 Physics 21.6 (1953), pp. 1087?1092. doi: 10.1063/1.1699114 (cit.
 on p. 288).
 [Mil+02] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U.
 Alon. Network motifs: Simple building blocks of complex networks.
 In: Science 298.5594 (2002), pp. 824?827. doi: 10.1126/science.
 298.5594.824 (cit. on p. 36).
 [Mis+95] Kazuo Misue, Peter Eades, Wei Lai, and Kozo Sugiyama. Layout
 adjustment and the mental map. In: Journal of Visual Languages &
 Computing 6.2 (1995), pp. 183?210. doi: 10.1006/jvlc.1995.1010
 (cit. on pp. 157, 245, 246).
 [Mor53] Jacob L. Moreno. Who shall survive? Foundations of sociometry,
 group psychotherapy and sociodrama. Beacon House, 1953, p. 141
 (cit. on pp. 3, 249).
 [Mou04] David M. Mount. Geometric intersection. In: The Handbook of Dis-
 crete and Computational Geometry. 2nd ed. 2004, pp. 857?876 (cit.
 on pp. 249, 252, 263).
 [Mul91] Kentan Mulmuley. A fast planar partition algorithm, II . In: Journal
 of the ACM 38.1 (1991), pp. 74?103. doi: 10.1145/102782.102785
 (cit. on p. 252).
 [Mur08] Scott Murray. Visualizing network relationship. Tech. rep. Massachusetts
 College of Art and Design, 2008 (cit. on p. 83).
 [Mut97] Petra Mutzel. An alternative method to crossing minimization on hi-
 erarchical graphs . In: SIAM Journal on Optimization. Vol. 1190/1997.
 Lecture Notes in Computer Science. 1997, pp. 318?333. doi: 10.
 1007/3-540-62495-3_57 (cit. on pp. 249, 250).
 [Nav+09] Saket Navlakha, James White, Niranjan Nagarajan, Mihai Pop, and
 Carl Kingsford. Finding biologically accurate clusterings in hierar-
 chical tree decompositions using the variation of information. In:
 RECOMB ?09: Proc. 14th Annual international conference on Re-
 search in Computational Molecular Biology. 2009 (cit. on pp. 140,
 150).
 [New04] Mark E. J. Newman. Fast algorithm for detecting community struc-
 ture in networks . In: Physical Review E: Statistical, Nonlinear, and
 Soft Matter Physics 69.6 (2004), p. 066133. doi: 10.1103/PhysRevE.
 69.066133 (cit. on pp. 154, 240, 241).
BIBLIOGRAPHY 307
 [NG04] Mark E. J. Newman and Michelle Girvan. Finding and evaluating
 community structure in networks . In: Physical Review E: Statistical,
 Nonlinear, and Soft Matter Physics 69.2 Pt 2 (2004), p. 026113. doi:
 10.1103/PhysRevE.69.026113 (cit. on pp. 141, 147).
 [Noa04] Andreas Noack. An energy model for visual graph clustering . In:
 GD ?03: Proc. 11th International Symposium on Graph Drawing.
 Vol. 2912/2004. Lecture Notes in Computer Science. 2004, pp. 425?
 436. doi: 10.1007/978-3-540-24595-7_40 (cit. on pp. 41, 142).
 [Nor06] Chris North. Toward measuring visualization insight . In:CGA: IEEE
 Computer Graphics and Applications 26.3 (2006), pp. 6?9. doi: 10.
 1109/MCG.2006.70 (cit. on p. 45).
 [NRS08] Saket Navlakha, Rajeev Rastogi, and Nisheeth Shrivastava. Graph
 summarization with bounded error . In: SIGMOD ?08: Proc. 2008
 ACM SIGMOD international conference on Management of data.
 SIGMOD ?08. 2008, pp. 419?432. doi: 10.1145/1376616.1376661
 (cit. on pp. 35, 154, 185, 283).
 [NS00] Chris North and Ben Shneiderman. Snap-together visualization: can
 users construct and operate coordinated visualizations? In: Interna-
 tional Journal of Human-Computer Studies 53.5 (2000), pp. 715?
 739. doi: DOI:10.1006/ijhc.2000.0418 (cit. on p. 31).
 [NWB06] NWB Team. Network Workbench [Software] . 2006 (cit. on p. 31).
 [OM+03] Joshua O? Madadhain, Danyel Fisher, Scott White, and Yan-Biao
 Boey. The JUNG (Java Universal Network/Graph) framework . Tech.
 rep. UCI-ICS 03-17. University of California, Irvine, 2003 (cit. on
 p. 22).
 [PAC02] Helen C. Purchase, Jo-Anne Allder, and David Carrington. Graph
 layout aesthetics in UML diagrams: User preferences . In: JGAA:
 Journal of Graph Algorithms and Applications 6.3 (2002), pp. 255?
 279 (cit. on pp. 250, 267).
 [PCA02] Helen C. Purchase, David Carrington, and Jo-Anne Allder. Em-
 pirical evaluation of aesthetics-based graph layout . In: Empirical
 Software Engineering 7.3 (2002), pp. 233?255. doi: 10.1023/A:
 1016344215610 (cit. on p. 250).
 [PCJ96] Helen C. Purchase, Robert F. Cohen, and Murray James. Validat-
 ing graph drawing aesthetics . In: GD ?95: Proc. 3rd International
 Symposium on Graph Drawing. Vol. 1027/1996. Lecture Notes in
 Computer Science. 1996, pp. 435?446. doi: 10.1007/BFb0021827
 (cit. on p. 250).
BIBLIOGRAPHY 308
 [PFG08] Catherine Plaisant, Jean-Daniel Fekete, and Georges Grinstein. Pro-
 moting insight-based evaluation of visualizations: From contest to
 benchmark repository . In: TVCG: IEEE Transactions on Visualiza-
 tion and Computer Graphics 14.1 (2008), pp. 120?134. doi: 10.
 1109/TVCG.2007.70412 (cit. on p. 45).
 [PHG07] Helen C. Purchase, Eve Hoggan, and Carsten G?rg. How important
 is the ?mental map?? ? An empirical investigation of a dynamic
 graph layout algorithm. In: GD ?06: Proc. 14th International Sym-
 posium on Graph Drawing. Vol. 4372. Lecture notes in Computer
 Science. 2007, pp. 184?195. doi: 10.1007/978-3-540-70904-6_19
 (cit. on p. 157).
 [PL96] Helen C. Purchase and David Leonard.Graph drawing aesthetic met-
 rics. Tech. rep. 361. Key Centre for Software Technology, Dept. of
 Computer Science, University of Queensland, 1996 (cit. on pp. 14,
 33, 226).
 [Pre+93] William H. Press, Saul A. Teukolsky, William T. Vetterling, and
 Brian P. Flannery. Numerical recipes in FORTRAN; the art of sci-
 entific computing . 2nd ed. Cambridge University Press, 1993 (cit. on
 p. 154).
 [PS06] Adam Perer and Ben Shneiderman. Balancing systematic and flex-
 ible exploration of social networks . In: TVCG: IEEE Transactions
 on Visualization and Computer Graphics 12.5 (2006), pp. 693?700.
 doi: 10.1109/TVCG.2006.122 (cit. on pp. 23, 157, 230, 232, 241,
 255, 263, 269, 273).
 [PS08a] Adam Perer and Ben Shneiderman. Integrating statistics and vi-
 sualization: Case studies of gaining clarity during exploratory data
 analysis . In: CHI ?08: Proc. SIGCHI Conference on Human Factors
 in Computing Systems. 2008, pp. 265?274. doi: 10.1145/1357054.
 1357101 (cit. on pp. 46, 49, 157, 230, 233, 269).
 [PS08b] Adam Perer and Ben Shneiderman. Systematic yet flexible discov-
 ery: Guiding domain experts through exploratory data analysis . In:
 IUI ?08: Proc. 13th International Conference on Intelligent User In-
 terfaces. 2008, pp. 109?118. doi: 10.1145/1378773.1378788 (cit.
 on pp. 157, 230, 233, 269).
 [PS09] Adam Perer and Ben Shneiderman. Integrating statistics and vi-
 sualization for exploratory power: From long-term case studies to
 design guidelines . In: CGA: IEEE Computer Graphics and Applica-
BIBLIOGRAPHY 309
 tions 29.3 (2009), pp. 39?51. doi: 10.1109/MCG.2009.44 (cit. on
 pp. 46, 49).
 [Pup+11] Sergey Pupyrev, Lev Nachmanson, Sergey Bereg, and Alexander
 E. Holroyd. Edge routing with ordered bundles . In: GD ?11: Proc.
 19th International Symposium on Graph Drawing. 2011, pp. 136?
 147. doi: 10.1007/978-3-642-25878-7_14 (cit. on p. 43).
 [Pur02] Helen C. Purchase. Metrics for graph drawing aesthetics . In: JVLC:
 Journal of Visual Languages & Computing 13 (2002), pp. 501?516.
 doi: 10.1006/jvlc.2002.0232 (cit. on pp. 14, 33, 46, 226, 244,
 245, 251, 258, 266, 267).
 [Pur97] Helen C. Purchase. Which aesthetic has the greatest effect on human
 understanding? In: GD ?97: Proc. 5th International Symposium on
 Graph Drawing. Vol. 1353/1997. Lecture Notes in Computer Science.
 1997, pp. 248?261. doi: 10.1007/3- 540- 63938- 1_67 (cit. on
 pp. 250, 258, 266, 267).
 [Pur98] Helen C. Purchase. The effects of graph layout . In: OZCHI ?08: Proc.
 2008 Australasian Computer Human Interaction Conference. 1998,
 pp. 80?86. doi: 10.1109/OZCHI.1998.732199 (cit. on p. 250).
 [RLD] Nathalie Riche, Bongshin Lee, and Cody Dunne. Interactive visu-
 alization for exploring multi-modal, multi-relational, and multivari-
 ate graph data. English. U.S. Patent Application 13/041474 (cit. on
 p. 60).
 [Rod+11] Eduarda Mendes Rodrigues, Natasa Milic-Frayling, Marc Smith,
 Ben Shneiderman, and Derek Hansen. Group-in-a-Box layout for
 multi-faceted analysis of communities . In: SocialCom ?11: Proc. 2011
 IEEE 3rd International Conference on Social Computing. 2011, pp. 354?
 361. doi: 10.1109/PASSAT/SocialCom.2011.139 (cit. on pp. 13,
 142, 143, 160, 161, 223).
 [SA06] Ben Shneiderman and Aleks Aris. Network visualization by Seman-
 tic Substrates . In: TVCG: IEEE Transactions on Visualization and
 Computer Graphics 12.5 (2006), pp. 733?740. doi: 10.1109/TVCG.
 2006.166 (cit. on pp. 1, 5, 21, 26, 27, 46, 126, 148).
 [Sar+06] Purvi Saraiya, Chris North, Vy Lam, and Karen A. Duca.An insight-
 based longitudinal study of visual analytics . In: TVCG: IEEE Trans-
 actions on Visualization and Computer Graphics 12.6 (6 2006),
 pp. 1511?1522. doi: 10.1109/TVCG.2006.85 (cit. on p. 45).
BIBLIOGRAPHY 310
 [SD12] Ben Shneiderman and Cody Dunne. Interactive network exploration
 to derive insights: Filtering, clustering, grouping, and simplification.
 In: GD ?12: Proc. 20th International Symposium on Graph Drawing.
 Vol. 7704. Keynote. 2012, pp. 2?18. doi: 10.1007/978-3-642-
 36763-2_2 (cit. on pp. 7, 13, 20, 54, 76, 145).
 [Sha+03] Paul Shannon, Andrew Markiel, Owen Ozier, Nitin S. Baliga, Jonathan
 T. Wang, Daniel Ramage, Nada Amin, Benno Schwikowski, and
 Trey Ideker. Cytoscape: A software environment for integrated mod-
 els of biomolecular interaction networks. In: Genome Research 13.11
 (2003), pp. 2498?2504. doi: 10.1101/gr.1239303 (cit. on pp. 21,
 23, 24, 63, 184, 185, 281).
 [Shn+12] Ben Shneiderman, Cody Dunne, Puneet Sharma, and Ping Wang.
 Innovation trajectories for information visualizations: Comparing
 treemaps, cone trees, and hyperbolic trees . In: IVS: Information Vi-
 sualization 11.2 (2012), pp. 87?105. doi: 10.1177/1473871611424815
 (cit. on pp. 64?66).
 [Shn92] Ben Shneiderman. Tree visualization with Tree-Maps: 2-D space-
 filling approach. In: ACM Trans. Graph. 11.1 (1992), pp. 92?99.
 doi: 10.1145/102377.115768 (cit. on p. 160).
 [Smi+09] Marc Smith, Ben Shneiderman, Natasa Milic-Frayling, Eduarda Mendes
 Rodrigues, Vladimir Barash, Cody Dunne, Tony Capone, Adam
 Perer, and Eric Gleave. Analyzing (social media) networks with NodeXL.
 In: C&T ?09: Proc. 4th International Conference on Communities
 and Technologies. 2009, pp. 255?264. doi: 10 . 1145 / 1556460 .
 1556497 (cit. on pp. 21, 23, 49, 95).
 [Smi+10] Marc Smith, Ben Shneiderman, Natasa Milic-Frayling, Eduarda M.
 Rodrigues, Vladimir Barash, Cody Dunne, Tony Capone, Adam
 Perer, and Eric Gleave. NodeXL: A free and open network overview,
 discovery and exploration add-in for Excel 2007/2010 . Social Media
 Research Foundation. 2010. url: http://nodexl.codeplex.com
 (cit. on pp. 7, 10, 13, 16, 20, 26, 41, 45, 48, 49, 53, 68, 71, 72, 75, 95,
 105, 138, 144, 160, 181, 225, 228, 230, 242, 247, 255, 269, 271?273,
 289).
 [Smi+13] Marc Smith, Ben Shneiderman, Natasa Milic-Frayling, Eduarda M.
 Rodrigues, Vladimir Barash, Cody Dunne, Tony Capone, Adam
 Perer, and Eric Gleave. NodeXLGraph Gallery . Social Media Re-
 search Foundation. 2013. url: http://nodexlgraphgallery.org/
 (cit. on pp. 213, 218, 225).
BIBLIOGRAPHY 311
 [SNA08] Pekka Salmela, Olli S. Nevalainen, and Tero Aittokallio. A multi-
 level graph layout algorithm for Cytoscape bioinformatics software
 platform. Tech. rep. 861. Turku Centre for Computer Science, 2008
 (cit. on p. 41).
 [SND05] Purvi Saraiya, Chris North, and Karen A. Duca. An insight-based
 methodology for evaluating bioinformatics visualizations . In: TVCG:
 IEEE Transactions on Visualization and Computer Graphics 11.4
 (2005), pp. 443?456. doi: 10.1109/TVCG.2005.53 (cit. on p. 45).
 [SP06] Ben Shneiderman and Catherine Plaisant. Strategies for evaluat-
 ing information visualization tools: Multi-dimensional in-depth long-
 term case studies . In: BELIV ?06: Proc. 2006 AVI workshop on BE-
 yond time and errors: novel evaLuation methods for Information
 Visualization. 2006, pp. 1?7. doi: 10.1145/1168149.1168158 (cit.
 on p. 45).
 [Spo] Spotfire. spotfire.tibco.com (cit. on pp. 56, 286).
 [SS06] Jinwook Seo and Ben Shneiderman. Knowledge discovery in high-
 dimensional data: case studies and a user survey for the rank-by-
 feature framework . In: TVCG: IEEE Transactions on Visualization
 and Computer Graphics 12.3 (2006), pp. 311?322. doi: 10.1109/
 TVCG.2006.50 (cit. on p. 46).
 [STT81] Kozo Sugiyama, Shojiro Tagawa, and Mitsuhiko Toda. Methods for
 visual understanding of hierarchical system structures . In: IEEE
 Transactions on Systems, Man and Cybernetics 11.2 (1981), pp. 109?
 125. doi: 10.1109/TSMC.1981.4308636 (cit. on pp. 249, 258).
 [Sug02] Kozo Sugiyama. Graph drawing and applications for software and
 knowledge engineers . Vol. 11. Series on Software Engineering and
 Knowledge Engineering. World Scientific Publishing Company, 2002
 (cit. on pp. 32, 33, 245, 249, 255).
 [Tab] Tableau. www.tableausoftware.com (cit. on p. 56).
 [TTT06] Etsuji Tomita, Akira Tanaka, and Haruhisa Takahashi. The worst-
 case time complexity for generating all maximal cliques and compu-
 tational experiments . In: Theoretical Computer Science 363.1 (2006),
 pp. 28?42. doi: 10.1016/j.tcs.2006.06.015 (cit. on p. 93).
 [TTT12] Orestis Tsigkas, Olivier Thonnard, and Dimitrios Tzovaras. Visual
 spam campaigns analysis using abstract graphs representation. In:
 VizSEC ?12:: Proc. 9th International Symposium on Visualization
 for Cyber Security. VizSec ?12. 2012, pp. 64?71. doi: 10.1145/
 2379690.2379699 (cit. on p. 36).
BIBLIOGRAPHY 312
 [Wal01] C. Walshaw. A multilevel algorithm for force-directed graph drawing .
 In: GD ?00: Proc. 8th International Symposium on Graph Drawing.
 2001, pp. 31?55. doi: 10.1007/3-540-44541-2_17 (cit. on p. 41).
 [War+02] ColinWare, Helen C. Purchase, Linda Colpoys, and Matthew McGill.
 Cognitive measurements of graph aesthetics . In: IVS: Information
 Visualization 1.2 (2002), pp. 103?110. doi: 10.1057/palgrave.
 ivs.9500013 (cit. on pp. 14, 46, 250, 256?258, 266, 268).
 [War04] Colin Ware. Information visualization: perception for design. Mor-
 gan Kaufmann Publishers Inc., 2004 (cit. on pp. 33, 249).
 [Wat06] Martin Wattenberg. Visual exploration of multivariate graphs . In:
 CHI ?06: Proc. SIGCHI Conference on Human Factors in Comput-
 ing Systems. CHI ?06. 2006, pp. 811?819. doi: 10.1145/1124772.
 1124891 (cit. on pp. 7, 29, 34, 148).
 [WD08] Jo Wood and Jason Dykes. Spatially ordered treemaps . In: TVCG:
 IEEE Transactions on Visualization and Computer Graphics 14.6
 (2008), pp. 1348?1355. doi: 10 . 1109 / TVCG . 2008 . 165 (cit. on
 pp. 41, 44).
 [Wel+07] Howard T. Welser, Eric Gleave, Danyel Fisher, and Marc Smith.
 Visualizing the signatures of social roles in online discussion groups .
 In: JOSS: Journal of Social Structure 8.2 (2007) (cit. on pp. 3, 38).
 [Won+08] Pak Chung Wong, Harlan Foote, Patrick Mackey, George Chin,
 Heidi Sofia, and Jim Thomas. A dynamic multiscale magnifying tool
 for exploring large sparse graphs . In: IVS: Information Visualization
 7.2 (2008), pp. 105?117. doi: 10.1057/palgrave.ivs.9500177 (cit.
 on pp. 41, 283).
 [WS79] Charles Wetherell and Alfred Shannon. Tidy drawings of trees . In:
 IEEE Transactions on Software Engineering SE-5.5 (1979), pp. 514?
 520. doi: 10.1109/TSE.1979.234212 (cit. on pp. 33, 245).
 [WS92] Christopher Williamson and Ben Shneiderman. The dynamic Home-
 Finder: evaluating dynamic queries in a real-estate information ex-
 ploration system. In: SIGIR ?92: Proc. 15th annual international
 ACM SIGIR conference on research and development in informa-
 tion retrieval. 1992, pp. 338?346. doi: 10.1145/133160.133216
 (cit. on p. 56).
 [WT07] Ken Wakita and Toshiyuki Tsurumi. Finding community structure
 in mega-scale social networks: [extended abstract] . In: WWW ?07:
 Proc. 16th international conference on World Wide Web. 2007, pp. 1275?
 1276. doi: 10.1145/1242572.1242805 (cit. on pp. 140, 143, 147).
BIBLIOGRAPHY 313
 [Ye+05] Ping Ye, Brian D Peyser, Forrest A Spencer, and Joel S Bader.
 Commensurate distances and similar motifs in genetic congruence
 and protein interaction networks in yeast. In: BMC Bioinformatics
 6 (2005), p. 270. doi: 10.1186/1471-2105-6-270 (cit. on p. 37).
 [YEL10] Ji Soo Yi, Niklas Elmqvist, and Seungyoon Lee. TimeMatrix: Ana-
 lyzing temporal social networks using interactive matrix-based visual-
 izations . In: IJHCI: International Journal of Human-Computer In-
 teraction 26.11?12 (2010), pp. 1031?1051. doi: 10.1080/10447318.
 2010.516722 (cit. on p. 23).
 [ZCM05] Shengdong Zhao, Mark H. Chignell, and Michael J. McGuffin. Elas-
 tic Hierarchies: Combining treemaps and node-link diagrams . In:
 INFOVIS ?05: Proc. IEEE symposium on Information Visualiza-
 tion. INFOVIS ?05. 2005, pp. 57?64. doi: 10.1109/INFVIS.2005.
 1532129 (cit. on p. 26).
 [ZGS07] Xiaowei Zhu, Mark Gerstein, and Michael Snyder. Getting connected:
 Analysis and principles of biological networks. In: Genes Develop-
 ment 21.9 (2007), pp. 1010?1024. doi: 10.1101/gad.1528707 (cit.
 on p. 37).