ABSTRACT Title of Dissertation: REPRESENTATION LEARNING FOR LARGE-SCALE GRAPHS Yu Jin Doctor of Philosophy, 2023 Dissertation Directed by: Professor Joseph F. JaJa Department of Electrical and Computer Engineering Graphs are widely used to model object relationships in real world applications such as biology, neuroscience, and communication networks. Traditional graph analysis tools focus on extracting key graph patterns to characterize graphs which are further used in the downstream tasks such as prediction. Common graph characteristics include global and local graph measurements such as clustering coefficient, global efficiency, characteristic path length, diameter and so on. Many applications often involve high dimensional and large-scale graphs for which existing methods, which rely on small numbers of graph characteristics, cannot be used to efficiently perform graph tasks. A major research challenge is to learn graph representations that can be used to efficiently perform graph tasks as required by a wide range of applications. In this thesis, we have developed a number of novel methods to tackle the challenges associated with processing or representing large-scale graphs. In the first part, we propose a general graph coarsening framework that maps large graphs into smaller ones while preserving important structural graph properties. Based on spectral graph theory, we define a novel distance function that measures the differences between graph spectra of the original and coarse graphs. We show that the proposed spectral distance sheds light on the structural differences in the graph coarsening process. In addition, we propose graph coarsening algorithms that aim to minimize the spectral distance, with provably strong bounds. Experiments show that the proposed algorithms outperform previous graph coarsening methods in applications such as graph classification and stochastic block recovery tasks. In the second part, we propose a new graph neural network paradigm that improves the expressiveness of the best known graph representations. Graph neural network (GNN) models have recently been introduced to solve challenging graph-related problems. Most GNN models follow the message-passing paradigm where node information is propagated through edges, and graph representations are formed by the aggregation of node representations. Despite their successes, message-passing GNN models are limited in terms of their expressive power, which fail to capture basic characteristic properties of graphs. In our work, we represent graphs as the composition of sequence representations. Through the design of sequence sampling and modeling techniques, the proposed graph representations achieve provably powerful expressiveness while maintaining permutation invariance. Empirical results show that the proposed model achieves superior results in real-world graph classification tasks. In the third part, we develop a fast implementation of spectral clustering methods on CPU- GPU platforms. Spectral clustering is one of the most popular graph clustering algorithms which achieved state-of-the art performance in a wide range of applications. However, existing implementations in commonly used software platforms such as Matlab and Python do not scale well for many of the emerging Big Data applications. We present a fast implementation of the spectral clustering algorithm on a CPU-GPU heterogeneous platform. Our implementation takes advantage of the computational power of the multi-core CPU and the massive multithreading capabilities of GPUs. We show that the new implementation achieved significantly accelerated computation speeds compared with previous implementations on a wide range of tasks. In the fourth part, we study structural brain networks derived from Diffusion Tensor Imaging (DTI) data. The processing of DTI data coupled with the use of modern tractographic methods reveal white matter fiber connectivity at a relatively high resolution; this allows us to model the brain as a structural network which encodes pairwise connectivity strengths between brain voxels. We have developed an iterative method to delineate the brain cortex into fine-grained connectivity-based brain parcellations. This allows to map the initial large-scale brain network into a relatively small weighted graph that preserves the essential structural connectivity information. We show that graph representations based on the brain networks from new brain parcellations are more powerful in discriminating between different populations groups, compared with existing brain parcellations. REPRESENTATION LEARNING FOR LARGE-SCALE GRAPHS by Yu Jin Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2023 Advisory Committee: Professor Joseph F. JaJa, Chair/Advisor Professor Jonathan Simon Professor Behtash Babadi Professor Rajeev Barua Professor Luiz Pessoa, Dean’s Representative © Copyright by Yu Jin 2023 Acknowledgments I would like to express my most sincere gratitude to Professor Joseph JaJa for his mentorship and guidance over the long journey. His wisdom and depth of knowledge has influenced me as a better researcher and a better person. Under his trust and encouragement, I am able to freely pursue a wide range of challenging research topics. I am grateful for his patience and flexibility that supported me through the most puzzling years. I want to thank Dr. Andreas Loukas who supported me to extend the work on graph coarsening. His enthusiasm and persistence in graph research has influenced me to persistently tackle challenging and fundamental research problems. I would like to thank the committee members Professor Jonathan Simon, Professor Behtash Babadi, Professor Rajeev Barua and Professor Luiz Pessoa for providing valuable feedback on the thesis and presentation. I want to thank the department of ECE for fostering a diverse environment to learn about cutting-edge technologies from different domains. Lastly, I want to thank my wife, Yun Zhou and my parents Wu Jin and Liumei Zhou for their love and support. ii Table of Contents Acknowledgements ii Table of Contents iii List of Tables vi List of Figures viii List of Abbreviations x Chapter 1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 2: Background 7 2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Graph Coarsening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.2 Graph Coarsening Methods . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Spectral Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.2 Relationship with Graph Structures . . . . . . . . . . . . . . . . . . . . . 11 2.3.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Graph Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.2 Permutation Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.3 Examples of GNN Models . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.4 Expressiveness of GNN . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5 Structural Brain Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5.2 Diffusion Tensor Imaging Processing . . . . . . . . . . . . . . . . . . . 18 2.5.3 Brain Parcellation Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.4 Construction of Structural Brain Networks . . . . . . . . . . . . . . . . . 21 2.5.5 Graph Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 21 iii Chapter 3: Graph Coarsening with Preserved Spectral Properties 24 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.1 Graph Coarsening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.2 Graph Lifting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 Spectral Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.1 Properties of the Coarse Laplacian Spectrum . . . . . . . . . . . . . . . 32 3.4.2 Spectral Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.3 Relation to Graph Coarsening . . . . . . . . . . . . . . . . . . . . . . . 37 3.5 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5.1 Multilevel Graph Coarsening . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5.2 Spectral Graph Coarsening . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.6.1 Graph Classification with Coarse Graphs . . . . . . . . . . . . . . . . . . 43 3.6.2 Block Recovery in the Stochastic Block Model . . . . . . . . . . . . . . 45 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Chapter 4: Powerful Graph Neural Networks with Sequence Modeling 48 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.2 Graph and Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4 Designing Graph Neural Networks with Sequence Modeling . . . . . . . . . . . 54 4.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4.2 Express Graphs as Combinations of Sequences . . . . . . . . . . . . . . 55 4.4.3 Sequence Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4.4 Expressive Power of SeqGNN . . . . . . . . . . . . . . . . . . . . . . . 58 4.4.5 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.6 Practical Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.7 Training and Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5.2 Baseline models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.5.3 Model Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.5.4 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.5.5 Analysis of Expressive Power . . . . . . . . . . . . . . . . . . . . . . . 66 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Chapter 5: Fast Spectral Clustering in GPU-CPU platforms 68 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2 Overview of Spectral Clustering Algorithm . . . . . . . . . . . . . . . . . . . . 70 5.3 Environment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3.1 The Heterogeneous System . . . . . . . . . . . . . . . . . . . . . . . . . 72 iv 5.3.2 CUDA Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3.3 ARPACK Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3.4 OpenBLAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.4.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.4.2 Parallel Eigensolvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4.3 Parallel k-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.5.2 Environment and Software . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.5.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Chapter 6: Connectivity-based Brain Parcellations 95 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2 Data and Preprocessing Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2.2 Nonlinear Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2.3 Probabilistic tractography . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.3.1 Connectivity Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.3.2 Spatially Constrained Similarity Graph . . . . . . . . . . . . . . . . . . . 102 6.3.3 Minimum Graph-Cut Problem and Iterative Refinement . . . . . . . . . . 103 6.4 Reproducibility and Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . 104 6.4.1 Parcellation Similarity Metrics . . . . . . . . . . . . . . . . . . . . . . . 105 6.4.2 Stability and Reproducibility of Subject Parecellations . . . . . . . . . . 107 6.4.3 Group Consistency and Atlas Generation . . . . . . . . . . . . . . . . . 115 6.5 Discriminative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.5.1 Similarity-based Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.5.2 Connectome Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Chapter 7: Concluding Remarks and Future Work 124 Appendix A: Supplementary Materials of Chapter 3 127 A.1 Proof of Proposition 3.4.1, 3.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 127 A.1.1 Proof of Proposition 3.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . 127 A.1.2 Proof of Proposition 3.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . 129 A.2 Proof of Corollary 3.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 A.3 Proof of Theorem 3.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 A.4 Additional Material for Experiments . . . . . . . . . . . . . . . . . . . . . . . . 138 A.4.1 Graph Classification Dataset . . . . . . . . . . . . . . . . . . . . . . . . 138 A.4.2 Definition of Normalized Mutual Information . . . . . . . . . . . . . . . 138 Bibliography 140 v List of Tables 2.1 Demographics of NKI Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1 Classification accuracy on coarse graphs that are five times smaller. . . . . . . . . 42 3.2 Recovery Accuracy of Block Structures from Random Graphs in Stochastic Block Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Statistics of the graph benchmark datasets. . . . . . . . . . . . . . . . . . . . . . 45 4.1 Comparison between MPNN and SeqGNN . . . . . . . . . . . . . . . . . . . . . 57 4.2 Results (measured by accuracy: %) on TUDataset. . . . . . . . . . . . . . . . . . 62 5.1 CPU and GPU specifics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.3 Running Time of Spectral Clustering on DTI Dataset . . . . . . . . . . . . . . . 88 5.4 Running Time of Spectral Clustering on FB Dataset . . . . . . . . . . . . . . . . 90 5.5 Running Time of Spectral Clustering on Syn200 Dataset . . . . . . . . . . . . . 91 5.6 Running Time of Spectral Clustering on dblp Dataset . . . . . . . . . . . . . . . 92 5.7 Comparison Between Data Communication Time and Computation Time . . . . 93 6.1 Subject Demographics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2 NMI Between Parcellations After 1st Iteration . . . . . . . . . . . . . . . . . . . 111 6.3 NMI Between Parcellations After 2nd Iteration . . . . . . . . . . . . . . . . . . 111 6.4 NMI Between Parcellations After 3rd Iteration . . . . . . . . . . . . . . . . . . 112 6.5 NMI Between Parcellations After 4th Iteration . . . . . . . . . . . . . . . . . . 112 6.6 Dice’s Coefficient Between Parcellations After 1st Iteration . . . . . . . . . . . . 112 6.7 Dice’s Coefficient Between Parcellations After 2nd Iteration . . . . . . . . . . . 113 6.8 Dice’s Coefficient Between Parcellations After 3rd Iteration . . . . . . . . . . . 113 6.9 Dice’s Coefficient Between Parcellations After 4th Iteration . . . . . . . . . . . 113 6.10 NMI Between Parcellations in Consecutive Iteration Stages . . . . . . . . . . . . 114 6.11 Dice’s coefficient Between Parcellations in Consecutive Iteration Stages . . . . . 115 6.12 Average Similarity Between Parcellations of Different Subjects within the NC Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.13 Average Similarity Between Parcellations of Subjects within the NC Group and Randomly Generated Parcellation . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.14 P-value and T-statistic for Gender Study within the NC Group . . . . . . . . . . 117 vi 6.15 P-value and T-statistic for Age Study within the NC Group . . . . . . . . . . . . 118 6.16 P-value and T-statistic for Schizophrenic Study . . . . . . . . . . . . . . . . . . 118 6.17 P-value and T-statistic of Pair-wise Connectivity Between Normal Controls and Schizophrenic Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 A.1 Statistics of the graph benchmark datasets. . . . . . . . . . . . . . . . . . . . . . 138 vii List of Figures 2.1 Left: Seed region which is JHU white matter atlas. Middle: target region which is the AAL mask. Right: the result of probabilistic tractography which models the distribution of neuron fiber bundles where the intensity of each voxel represents the number of streamlines passing through that voxel. All figures are imaged in the axial plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1 Left: an example illustrating the graph coarsening process. The original graph is a random graph sampled from the stochastic block model with 50-node and 10 predefined blocks. The coarse graph is obtained from the predefined partitions. Right: Eigenvalues and eigenvectors of normalized Laplacian matrices of original, coarse and lifted graphs. The eigenvalues of coarsened graphs align with the eigenvalues of original graphs and the eigenvectors indicate the block membership information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.1 Relationships between graphs, node permutations and node sequences. The node permutation is the ordering of the graph nodes and the node sequence contain the edge information between the consecutive nodes. . . . . . . . . . . . . . . . . . 50 4.2 The overall framework of the SeqGNN model. (1) The first step is to represent graphs as the set of node sequences. The node sequence consists of the permutation of graph nodes and their associated edge information between consecutive nodes. The node sequences and the associated weights are determined by the sequence sampling methods and the specific graph structures. (2) The second step is learn the sequence representations which are further combined to form the graph representations. 51 4.3 Example: (a) Regular graphs that cannot be distinguished by MPNN and 1-WL; (b) With k layers of message passing, node representations of MPNN models contain information from nodes that are k hops away. The information flow to compute the node representations can be represented as the rooted subtrees where root node absorb the information flowed from the leaf nodes, as shown in the figure [1, 2]; Due to the anonymity of graph nodes, the subtrees are indistinguishable for regular graphs. (c) SeqGNN formulates the graph as the composition of sequence representations with node permutation and edge information. Despite the anonymity of graph nodes, the sequences can still distinguish two graphs because of differences in the edge patterns. . . . . . . . . . . . . . . . . . . . . . 53 4.4 Graph coarsening techniques to reduce the complexity . . . . . . . . . . . . . . . 60 viii 4.5 Classification accuracy under different number of convolutional layers. . . . . . . 66 5.1 CUDA Program Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 Parallel Implementation of Spectral Clustering . . . . . . . . . . . . . . . . . . . 84 5.3 Time Costs of Spectral Clustering on DTI Dataset . . . . . . . . . . . . . . . . . 88 5.4 Time Costs of Spectral Clustering on FB Dataset . . . . . . . . . . . . . . . . . 90 5.5 Time Costs of Spectral Clustering on Syn200 Dataset . . . . . . . . . . . . . . . 91 5.6 Time Costs of Spectral Clustering on dblp Dataset . . . . . . . . . . . . . . . . . 92 6.1 32 neighbors of voxel within sphere of radius r = 2. Note that each voxel represents 1.718mm×1.718mm×3mm brain volume. The figure shows the symbolic neighbors of voxels rather than the actual volume size. . . . . . . . . . . . . . . 100 6.2 Brain segmentations used to define connectivity profiles. . . . . . . . . . . . . . 109 6.3 Subject reproducibility after each iteration. . . . . . . . . . . . . . . . . . . . . . 114 6.4 Atlas and confidence map for the NC group. Note that for confidence map, the grey scale represents the ratio of overlapped regions. . . . . . . . . . . . . . . . 117 6.5 Parcellation with 5 regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.6 Binary maps where entries in red color have p-values <0.05 and <0.00005 respectively in terms of connectivity strengths between the two population groups using our 40-region parcellations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.7 Binary map where entries in red color have p-values <0.05 and <0.00005 respectively between connectivity strengths of the two population groups using the AAL atlas. 123 ix List of Abbreviations BLAS Basic Linear Algebra Subprograms BSR Block Compressed Sparse Row Format CIN Cell Isomorphism Network CNN Convolutional Neural Network COO Coordinate Format CPU Central Processing Unit CSC Compressed Sparse Column Format CSR Compressed Sparse Row Format DGCNN Deep Graph Convolutional Neural Network dMRI Diffusion Magnetic Resonance Imaging DTI Diffusion Tensor Imaging EM Edge Matching FC Functional Connectivity fMRI Functional Magnetic Resonance Imaging GAT Graph Attention Network GCN Graph Convolutional Network GIN Graph Isomorphism Network GNN Graph Neural Network GPU Graphics Processing Unit IPCC Human Inferior Parietal Cortex Complex IRAM Implicitly Restarted Arnoldi Method LAPACK Linear Algebra PACKage LSTM Long Short-term Memory Networks LV Local Variation MDL Minimum Description Length MGC Multilevel Graph Coarsening MKL Math Kernel Library MNI Montreal Neurological Institute MPI Message Passing Interface MPNN Message Passing Graph Neural Network MRI Magnetic Resonance Imaging NC Normal Control Ncut Normalized Cut NetLSD Network Laplacian Spectral Descriptor x NMI Normalized Mutual Information PCIe Peripheral Component Interconnect Express PNA Principal Neighbourhood Aggregation PPGN Provably Powerful Graph Network ROI Region of Interest RNN Recurrent Neural Network SC Spectral Clustering SD Spectral Distance SeqGNN Graph neural network with sequence modeling SGC Spectral Graph Coarsening SGD Stochastic Gradient Descent SIMD Single Instruction, Multiple Data SIN Simplicial Isomorphism Network SM Streaming Multiprocessors SP Streaming Processors SZ Schizophrenia TPJ Temporoparietal Junction Area WL Weisfeiler Lehman Isomorphism Test xi Chapter 1: Introduction 1.1 Motivation Graphs are widely used to model object relationships in real-world applications. There have been extensive studies that propose the use of various graph features to help analyze and understand the complex structures in graphs over a number of domains. Most of the proposed graph features are based on the statistics of elementary graph structures such as paths, walks, trees, and cuts, which are then used for subsequent machine learning tasks. However, the pre- defined, hand-crafted graph features are not flexible enough to handle graph prediction tasks for a wide range applications. The computations of many graph features do not scale well with the graph size, which causes high computational costs especially for the analysis of large-scale graphs. In this introduction, we describe the main problems addressed in this thesis. Graph Coarsening. To analyze large size graphs, a common technique is to coarsen the graph into a smaller graph that reduces the computational load while attempting to maintain the important structural graph properties of the original graph. The coarsening process significantly reduces the dimensionality of the original large graphs and accelerates subsequent graph processing tasks. Graph statistics such as characteristic path length, global efficiency, clustering coefficient can be efficiently computed on the coarsened graphs. Although various graph coarsening methods have been proposed, there have been no overall consensus on the criteria for the graph coarsening, 1 which directly determines the statistical effectiveness of subsequent network analysis. Spectral Methods. Spectral methods are emerging as powerful tools in machine learning and network science. Graph spectra are represented by the eigenvalues and eigenvectors of matrices associated with the graph such as the adjacency and Laplacian matrices. The field of spectral graph theory has established a number of interesting results linking the fundamental relationship between graph structural properties and graph spectra. In particular related work shows that the graph spectra contains rich information of important graph characteristics such as connectivity, bipartiteness, the number of connected components, and diameter and average path length [3]. We introduce a novel framework in this thesis that links tightly graph coarsening and spectral graph theory. Spectral Clustering Algorithm. There are a number of algorithms which include the computation of the graph spectrum as one of the major computational components to solve problems in a wide variety of applications. An example is the spectral clustering algorithm, which is one of the most popular clustering algorithms aiming to find densely-connected clusters within graphs. Note that graph spectral features, in addition to the common network statistics, are used in graph representations, which can achieve good prediction capability. A significant hurdle for using graph spectra in applications with large-scale graphs is the computational cost to compute the whole set of eigenvalues and eigenvectors. Graphics Processing Units (GPUs) are well-known in their parallel processing capabilities, which can accelerate the computation of large-scale matrix computations. We provide in this thesis a GPU accelerated implementation to accelerate the computation of the spectra, which significantly outperforms previous implementations. Graph Neural Networks. Deep learning models have achieved great successes in a wide 2 range of applications including computer vision, natural language processing, audio processing, and many others. Recently, deep learning models have been introduced to solve graph-related tasks such as node classification and graph classification. A notable model is the graph neural network (GNN) model which learns the node and graph representations to solve important graph tasks. Most of GNN models follow a message-passing paradigm where node representations are formed by iteratively aggregating information from neighboring nodes, followed by a graph pooling function to generate graph representations [4, 5, 6, 7]. Despite the success of message passing GNN models in many graph-related tasks, recent studies have shown that GNN models have limited expressive power which fails to capture even simple graph characteristics. For example, GNN graph representations cannot distinguish any pair of regular graphs. In this thesis, we develop a new approach that achieves more powerful expressiveness while maintaining permutation invariance. Structural Brain Networks. An important investigation conducted in this thesis is the study of large-scale brain networks derived from Diffusion Tensor Imaging (DTI) data. The objective is to learn characteristic graph representations which can achieve high prediction performance in applications involving for example brain disorders. Diffusion Magnetic Resonance Imaging (MRI) exploits the anisotropic diffusion of water molecules in the brain to enable the estimation of the brain’s anatomical fiber tracts at a relatively high resolution. Tractographic methods can be used to generate whole-brain anatomical connectivity matrix where each matrix element provides an estimate of the connectivity strength between the corresponding voxels. The resulting brain connectivity graphs are quite large involving hundreds of thousands of nodes and billions of edges. A common method to address this challenge is to build structural brain networks using a predefined whole brain parcellations (defining Regions of Interest - ROIs), where the nodes of 3 the network represent the brain regions and the edge weights capture the connectivity strengths between the corresponding brain regions. Traditional methods use the existing anatomical brain atlases such as Automated Anatomical Labelling (AAL) [8] to build structural brain networks, which do not consider the underlying connectivity information. There is potential to develop more effective brain parcellations using structural information derived from the DTI data. 1.2 Contributions of this Thesis In this thesis, we make significant, original contributions to all the problems mentioned above. In the first part, we propose an effective graph coarsening method which coarsens large graphs while provably preserving some important graph structural properties. Graph coarsening is a common technique used to simplify complex graphs by reducing the number of nodes and edges while trying to preserve some of the main structural properties of the original graph. A key challenge is to determine what specific graph properties are to be preserved by coarse graphs. We propose a new graph coarsening method that attempts to preserve the graph spectral properties. We start by relating the spectrum of the original graph to the coarsened graph. Based on this insight, we define a novel distance function that measures the differences between graph spectra of the original and coarsened graphs. We show that the proposed spectral distance captures in a significant way the structural differences during the graph coarsening process. We also propose graph coarsening algorithms that aim to minimize the spectral distance. Experiments show that the proposed algorithms can outperform previous graph coarsening methods in applications such as graph classification and stochastic block recovery tasks. 4 In the second part, we propose a new graph neural network paradigm that improves the expressiveness of graph representations. We represent graphs as the weighted combination of sequence representations. We show that, through the design of sequence sampling and modeling techniques, the proposed graph representations achieve provably powerful expressiveness while maintaining the permutation invariance property. Empirical result show that the proposed graph representations achieve superior performance in real-world graph classification tasks. The third part focuses on the acceleration of the spectral clustering methods leveraging on the computational advantages of CPUs and GPUs. Spectral clustering is one of the most effective graph clustering algorithms which are widely used in a number of applications. However, existing implementations in commonly used software platforms such as Matlab and Python do not scale well for many of the emerging Big Data applications. We present a fast implementation of the spectral clustering algorithm on a CPU-GPU heterogeneous platform. Our implementation takes advantage of the computational power of the multi-core CPU and the massive multithreading capabilities of GPUs. We show that the new implementation achieved significantly accelerated computation speed compared with previous implementations in a number of important applications. The fourth part focuses on the study of structural brain networks with the objective to determine graph representations based on DTI data. We develop an iterative method to delineate the brain cortex into fine-grained connectivity-based brain parcellations. This allows the mapping of the initial large-scale brain network into a relatively small weighted graph, which preserves the essential structural connectivity information. We show that graph representation based on the brain networks from the new brain parcellations is more powerful in discriminating between a number of population groups. 5 1.3 Outline of Thesis The rest of the thesis is organized as follows. In Chapter 2, we give the background of the research work conducted in this thesis. In Chapter 3, we propose a new graph coarsening method which coarsens large graphs into smaller ones with preserved graph spectral properties. In Chapter 4, we propose a new graph neural network model that improves the expressiveness of graph representations. In Chapter 5, we present a fast implementation of spectral clustering for large-scale graphs. In Chapter 6, we study the graph representations of structural brain network while we conclude in Chapter 7 by discussing several potential directions for future research. 6 Chapter 2: Background 2.1 Preliminaries We denote the graph as G = (V ,W ,H), with V as the set of graph nodes with n = |V|, W ∈ Rn×n as the (possibly weighted) adjacency matrix and H ∈ Rn×d as the matrix holding the node features. We denote by vi ∈ V as the node indexed by i, w(i) ∈ Rn as the vector of all possible edge weights associated with vi and d(i) = ∑n j=1W (i, j) as the node degree. In addition, we use hi as the node representation of node vi (corresponding to the ith row of H) and hG as the graph-level representation. 2.2 Graph Coarsening 2.2.1 Overview Graph coarsening is a common graph processing technique widely used in a number of applications involving large scale graphs. Graph coarsening simplifies large graphs by reducing the number of nodes and edges while trying to preserve some of the main properties of the original graph. The goal is to accelerate graph processing while maintaining a similar performance on graph processing tasks. One of the key challenges is to define the coarsening criteria based on which the graph 7 nodes and edges are reduced. The coarsening criteria directly determine the coarsening process and the quality of the resulting coarsened graphs. Researchers have proposed a number of different coarsening criteria, resulting in the optimization to preserve certain graph properties. For example, Loukas et al. proposed to preserve the action of the graph Laplacian with respect to an (eigen)-space of fixed dimension, arguing that this suffices to capture the global properties of graph relevant to partitioning and spectral clustering [9]. Durfee et al. proposed to preserve the all-pairs effective resistance [10] while Garg and Jaakkola defined a cost based on the theory of optimal transport [11]. Saket et al. suggested a Minimum Description Length (MDL) principle relevant to unweighted graphs [12]. Most of previous graph coarsening methods are specific to particular applications; the question of how to define an application-independent graph coarsening framework remains a challenge. 2.2.2 Graph Coarsening Methods There are a number of graph coarsening methods which are designed to optimize for different graph properties some of which are summarized below. Heavy Edge Matching In this method, the graph nodes are collapsed based on the heaviest edge (i.e., the edge with the largest weight) connecting them. The process is repeated iteratively until a certain graph size is reached [13]. Algebraic Multigrid Coarsening This method is based on the concept of the multigrid method used in numerical linear algebra. It involves constructing a hierarchy of coarser graphs from the original graph using matrix operations, which makes it possible to solve linear systems efficiently [14]. 8 Community Detection-based Coarsening The method first clusters graph nodes forming densely- connected communities and merges graph nodes belonging to the same community into one single node. The method is particularly useful to find community structures within graphs [15]. 2.2.3 Applications The following are examples of applications where graph coarsening methods have been used. Graph partitioning Graph coarsening are often used as a preprocessing step in graph partitioning, which aims to divide a graph into smaller subgraphs while minimizing some cost function related to the number of edges cut. The coarsened graph can be partitioned more efficiently, and the resulting partition can then be projected back onto the original graph. Community Detection Graph coarsening can be used to find communities or clusters within the graph. By merging closely connected nodes, the coarser graph can help reveal the underlying community structures more efficiently [16, 17]. Multigrid Methods Graph coarsening plays a central role in algebraic multigrid methods as well as the related class of multilevel incomplete LU factorizations. The idea is to project the original problem to an ”equivalent” problem on the coarse mesh and the solution is interpolated back to the fine level grid [14]. Graph coarsening significantly accelerates the processing speed for multigrid methods. Machine Learning Graph-based machine learning methods, such as Graph Convolutional Networks (GCN), have included graph coarsening as one of the core components in building the end-to-end machine learning models for specific graph tasks. One benefit of graph coarsening is to reduce 9 computational time and memory requirements for large graphs [7]. Moreover, graph coarsening can often improve the convergence speed and performance for specific tasks [18]. 2.3 Spectral Graph Theory 2.3.1 Overview We introduce some basic concepts from spectral graph theory, which we believe offers a more robust foundation to understand our work on graph coarsening. In essence, the spectral graph theory explores the fundamental relationship between the structural properties of graphs and the spectra of matrices associated with these graphs, typically either the adjacency matrix, or the Laplacian matrix. There are many tutorials and several textbooks that have been written about spectral graph theory. See for example [3]. In this thesis, we focus on the normalized Laplacian matrix whose eigenvalues are closely related to a number of topological invariants of the graph. The normalized Laplacian of a graph is defined using the following equation: I −D−1W (2.1) where W is the weight matrix of the graph and D is the diagonal matrix such that each diagonal entry is defined by Di,i = ∑ j Wi,j . for our purposes, the spectrum of the graph is defined as the sequence of the eigenvalues of the normalized graph Laplacian ordered as follows: 0 = λ1 ≤ λ2 ≤ ... ≤ λn ≤ 2 . 10 2.3.2 Relationship with Graph Structures Extensive studies have shown that the spectra are closely related to the graph structures. For example, the multiplicity of the eigenvalue 0 corresponds to the number of connected components in the graph. The largest eigenvalue λn is equal to 2 if and only if the graph is bipartite [3]. Moreover, the second smallest eigenvalue λ2 is of special importance and has been shown to be closely related to the convergence rate of random walks on the graph, ”chemical conductance,” Cheeger’s constant, graph diameter, and graph quasi-randomness [3]. Other eigenvalues also play an important role in capturing the graph structural properties, which has not been fully explored yet. 2.3.3 Applications While we focus in this thesis on the application of spectral graph theory on graph coarsening, other major applications include: • Spectral Clustering. Spectral clsutering is a widely-used technique that groups similar data points by analyzing the eigenvectors of the graph’s Laplacian matrix [19]. • Image Segmentation. In computer vision, spectral graph theory can be applied to segment images into meaningful regions, by representing the image as a graph and using eigenvectors to identify boundaries between different regions [20, 21]. • Dimensionality Reduction. Spectral methods for dimensionality reduction are techniques that use the spectral (eigenvalue) decomposition of matrices derived from the data in order to transform the data into a lower-dimensional space. These methods are primarily based 11 on constructing a similarity graph from the data, where each data point is represented as a node and the edge weights represent the similarity between data points. From this graph, a matrix representation is derived, often the graph Laplacian, but sometimes the adjacency or degree matrix. The eigenvectors corresponding to the smallest eigenvalues of this matrix are then used to represent the data in a lower-dimensional space. Notable examples are Laplacian Eigenmaps, Locally Linear Embedding (LLE) and Isomap [22, 23, 24, 25]. 2.4 Graph Neural Network 2.4.1 Overview Graph neural network (GNN) models have emerged as a powerful tool to solve graph related problems such as node classification, link prediction and graph classification [4, 5, 7, 26, 27, 28]. Most of GNN models follow a message-passing paradigm where node representations are formed by iteratively aggregating information from neighboring nodes, followed by a graph pooling function to generate graph representations [4, 5, 6, 7]. Starting with the initial node features H(0) = H , the message passing graph neural network iteratively updates node representations by aggregating information from neighboring nodes. The following functions characterize the graph convolutional process [4, 6, 9], h(i) v = Aggregate(i)(h(i−1) v ,m(i) v ), m(i) v = Msg(i)({h(i−1) u : u ∈ N (v)}) (2.2) where hi u is the representation of node u at iteration i and Msg is the message function that aggregates neighborhood information. The graph representation hG is obtained by applying the 12 pooling function Pool on the corresponding node representations as, hG = Readout(h(k) v , v ∈ V) (2.3) 2.4.2 Permutation Invariance Most GNN models need to satisfy the property of Permutation Invariance, that is, the graph representations remain invariant under any permutations applied on the inputs. We denote π = {v1, v2, ..., vn} as a node permutation and Gπ = (Vπ,Wπ,Hπ) is the graph where the node indices, weight matrix and feature matrix are permutated over π. The permutation invariance is expressed as follows, GNN(H ,W ) = GNN(Hπ,Wπ),∀π ∈ Π. (2.4) 2.4.3 Examples of GNN Models There are many variants of GNN models with different message, aggregation and readout functions. Some of the notable examples are as follows, Graph Convolutional Network (GCN). GCN is one of the most popular GNN models which borrow the idea of Convolutional Neural Network (CNN) in the graph domain. The GCN model leverages the graph convolutional function to aggregate information from neighboring nodes and apply a pooling function to form graph representations from node representations. GCN models achieve high performance on node and graph classification tasks[28, 29]. Graph Sample and Aggregation (GraphSAGE). GraphSAGE models extends the GCN framework 13 by introducing a more flexible, scalable, and inductive learning approach. GraphSAGE learns to generate embeddings for nodes by sampling and aggregating information from their local neighborhood, which is particularly useful for learning on large graphs or graphs with unseen nodes during training [30]. Graph Attention Network (GAT). GAT models introduce the attention mechanisms into GNNs, which enables them to weigh the importance of neighboring nodes differently when aggregating information. Through the attention mechanism, the message function can focus on the most relevant neighbors and capture more complex graph structures. GAT models have been applied to various tasks, including node classification, link prediction, and graph classification [31]. 2.4.4 Expressiveness of GNN The expressiveness of graph representation models is the ability to capture the intricate structure of graph data and encode this information into node or graph representations. For GNN models, the expressiveness is determined by the model architecture and the specific aggregation, messaging and readout functions. The number of message propagation steps can have a significant influence on the expressiveness[9, 32]. GNN models with a large number of propagation steps can capture long range of dependencies and interactions [33, 34, 35]. Test of Isomorphism. The expressiveness of GNN models is often evaluated on the ability to distinguish non-isomorphic graphs. Isomorphic graphs are pairs of graphs with the same structure but but may differ in the labels of their vertices and edges up to node permutations. Two graphs G1 = (V1,W1,H1) and G2 = (V2,W2,H2) are said to be isomorphic when there exist a permutation π such that the edges and node features are identical after applying the node 14 permutations W1,π = W2 and H1,π = H2. The ability to distinguish isomorphic graphs is an important component in understanding the expressiveness of graph representation models. Weisfeiler-Lehman Tests of Isomorphism. One of the most powerful and widely used tests of graph isomorphism is the Weisfeiler-Lehman Test of Isomorphism (WL) [36, 37]. The WL test is an refinement-based method that iteratively updates the node labels with local neighborhood until the stopping conditions are reached. The 1-dimensional WL test (1-WL) updates the node labels based on the immediate neighbors during the labeling process. The detailed steps of 1-WL are as follows, • Initialization: Assign an initial label to each node in each graph. • Iterative Labelling: For each iteration, a new label is computed for each node based on its current label and the multiset of labels of its neighbors. • Stopping Condition: the process continues until no new labels are generated or until a pre-determined number of iterations are reached. • Decision: After the last iteration, nodes of the two graphs are sorted according to their final labels. The two graphs are considered isomorphic if and only if the sequence of node labels are the same for both graphs. Higher dimensional Weisfeiler-Lehman Isomorphism Tests (k-WL) consider the k-hop neighbors in the color relabeling step, which results stronger capability to distinguish non-isomorphic graphs. Cai et al. show that (k + 1)-WL has strictly stronger expressive power than k-WL, i.e. there are non-isomorphic graphs which can be distinguished by (k + 1)-WL but not by k-WL [38]. 15 Recent studies have established key results on the relationship between the expressiveness of GNN and the series of Weisfeiler-Lehman (WL) Tests of Isomorphism (WL). They showed that the expressiveness of the most powerful message-passing GNN models does not exceed 1- WL [1, 39]. As 1-WL cannot distinguish between certain graphs that are not isomorphic, such as regular graphs, GNN models suffer from the limitation of expressiveness. How to design graph representations with higher-order of expressiveness exceeding the WL tests remains a challenge. 2.4.5 Limitations Despite the success GNN models have achieved over a number of applications, GNN models suffer from limitations which affect their effectiveness. Some limitations are listed next. • Scalability. The complexity of graph convolutions and the memory requirements can grow significantly with the size of the graph and the number of iterations. This makes it challenging to apply GNNs to real-world problems that involve large-scale graphs. • Expressiveness. As mentioned previously, the expressiveness of GNN models is limited by 1-WL, which cannot distinguish non-isomorphic graphs, such as pairs of regular graphs [6]. • Over-smoothing. As the number of convolutional layers increase, the information incorporated become indistinguishable, resulting in a loss of node-specific information and reduced discriminative power [40]. A number of methods have been proposed to address the limitations which improve the GNN expressiveness and its applicability to a wider range of problems. For example, approximate 16 methods and graph sampling methods are introduced to solve the scalability problem [? ]. To address the limitation of expressiveness, higher-order graph neural network models are proposed with provably higher expressiveness than 1-WL [41, 42]. To tackle the over-smoothing problem, various methods have been proposed to improve the effectiveness, such as DropEdge with the idea of dropping edges in the input graph during training, which reduces the information flowing between nodes thus preventing the over-smoothing [43]. 2.5 Structural Brain Networks 2.5.1 Overview Diffusion Magnetic Resonance Imaging (MRI) exploits the anisotropic diffusion of water molecules in the brain to enable the estimation of the brain’s anatomical fiber tracts at a relatively high resolution. In particular, tractographic methods can be used to generate whole-brain anatomical connectivity matrix where each element provides an estimate of the connectivity strength between the corresponding voxels. Structural brain networks are built using the connectivity information and a predefined brain parcellation, where the nodes of the network represent the brain regions and the edge weights capture the connectivity strengths between the corresponding brain regions. 17 Table 2.1: Demographics of NKI Sample Age Group 4-9 10-19 20-29 30-39 40-49 50-59 60-69 70-85 Total Female 4 18 19 8 5 6 9 10 79 Male 5 16 24 9 26 8 5 4 97 Total 9 34 43 17 31 14 14 14 176 2.5.2 Diffusion Tensor Imaging Processing 2.5.2.1 Data Collection The diffusion MRI dataset used in this research is taken from the publicly available Nathan Kline Institute (NKI)/ Rockland dataset1, which consists of data for individuals whose ages range from 4 to 85. The diffusion MRI was performed using a SIEMENS MAGNETOM TrioTim syngo MR B15 system. The high-angular resolution diffusion imaging protocol was used to assess white matter integrity as measured by fractional anisotropy. Diffusion tensor data were collected using a single- shot, echo-planar, single refocusing spin-echo, T2-weighted sequence with a spatial resolution of 2.0×2.0×2.0mm. The sequence parameters were: TE/TR=91/10000ms, FOV=256mm, axial slice orientation with 58 slices, 64 isotropically distributed diffusion weighted directions, two diffusion weighting values (b=0 and 1000s/mm2) and six b=0 images. These parameters were calculated using an optimization technique that maximizes the contrast to noise ratio for FA measurements. For each subject, the image data consists of 76 volumes of 3D images of dimensions 128×128×53, each voxel representing 2.0mm×2.0mm×2.0mm brain volume. 1http://fcon 1000.projects.nitrc.org/indi/pro/nki.html 18 http://fcon_1000.projects.nitrc.org/indi/pro/nki.html Figure 2.1: Left: Seed region which is JHU white matter atlas. Middle: target region which is the AAL mask. Right: the result of probabilistic tractography which models the distribution of neuron fiber bundles where the intensity of each voxel represents the number of streamlines passing through that voxel. All figures are imaged in the axial plane 2.5.2.2 Nonlinear Registration The diffusion images of all subjects are registered to Montreal Neurological Institute (MNI) standard space using the nonlinear registration package FNIRT in FSL software [44]. The nonlinear registration process generates the warping coefficients that balance the similarity between the diffusion image and the standard MNI152 image, and the smoothness of the warping coefficients. 2.5.2.3 Probabilistic Tractographical Methods After a set of standard preprocessing steps, probabilistic tractography is used to model cross fiber distributions for each voxel through the BEDPOSTX package in FSL [45]. Probabilistic tractography is processed through the diffusion toolbox in FSL [46]. The standard white matter atlas is specified as a seed region. The brain atlas AAL mask is specified as the target region, which covers the whole brain cortex region. We generate 50 streamlines from every voxel in the seed region. These streamlines are propagated following the cross fiber distribution computed during a preprocessing step. Curvature threshold is enforced to eliminate unqualified streamlines. 19 The distance correction option is set to correct for the fact that the distribution drops as the travel distance increases. The tractography output is a structural connectivity network modeled as a weighted graph where each node is a voxel in the target region space and each edge weight corresponds to the relative connectivity strength in terms of the number of streamlines connecting the corresponding pair of voxels. Using this connectivity information and a set of predefined regions of interests (ROIs) or whole brain parcellations, structural brain networks can be built and analyzed. Figure 2.1 shows the seed and target regions as well as the tractography result of a subject. 2.5.3 Brain Parcellation Scheme The brain parcellation scheme takes as input the probabilistic tractography results represented as a large connectivity matrix, which is a large sparse matrix with millions of nodes and billions of edges. The k-region brain parcellation is brain segmentations that partition the brain’s grey matter into k spatially contiguous regions, such that the connectivity profiles of the voxels in each region are as similar as possible. The parcellations are expected to be consistent among members of a structurally homogeneous population sample. Notable brain parcellations include the following, • Brodmann’s Areas. Proposed by Korbinian Brodmann in 1909, Brodmann’s areas divide the brain based on cytoarchitectonic properties, or the distribution and arrangement of neurons in the cortex [47]. • Automated Anatomical Labeling (AAL). The AAL atlas is a widely used atlas introduced by Tzourio-Mazoyer et al. in 2002. It divides the brain based on anatomical landmarks into 20 116 regions (90 cortical and subcortical, and 26 cerebellar) [8]. It is commonly used for voxel-based morphometry and functional fMRI studies[48, 49]. • Harvard-Oxford Atlas. The Harvard-Oxford Atlas is a probabilistic atlas that provides a standard space for neuroimaging analyses. It’s based on the segmentation of structural images of the brain from 48 subjects [50]. 2.5.4 Construction of Structural Brain Networks Given the brain parcellations and the connectivity information revealed by probabilistic tractography, the structural brain network are built where the nodes represent the regions in the parcellation and the edges reflect the connectivity strength between the corresponding regions. The edge weights are defined as follows, W (Ri, Rj) = ∑ va∈Ri,vb∈Rj W (va, vb) (2.5) where W (va, vb) represents the number of streamlines connecting the two voxels as generated by the tractographic results. 2.5.5 Graph Theoretical Analysis Graph theoretical analysis is applied on the structural brain networks to extract graph patterns [51]. Many global and local graph-theoretic measurements have been introduced to characterize structural brain networks [52, 53, 54]. The following are some of the mostly common used graph measurements, • Characteristic path length (CPL). The characteristic path length tries to capture the 21 network integration and is computed as the average of the shortest paths between all pairs of vertices. CPL = 1 N(N − 1) ∑ i,j,i̸=j d(i, j) (2.6) where d(i, j) is the shortest path between node i and j. • Global efficiency (Eglobal). The global efficiency of a graph is the average of the inverse of the shortest paths between all pairs of vertices and hence tries to capture how well pairs of nodes are connected. [55]. Eglobal = 1 N(N − 1) ∑ i,j,i̸=j 1 d(i, j) (2.7) • Clustering coefficient. The clustering coefficient tries to capture graph separation and is defined as the average of the local clustering coefficient at each node. C = 1 N ∑ Ci (2.8) where the local clustering coefficient at node i is defined as Ci = 2Γi degi(degi − 1) (2.9) Γi is the number of triangles around node i, Γi = 1 2 ∑ j,h ui,jui,huj,h (2.10) 22 Note that degi > 0 in our case since isolated nodes are removed from the network. • Sparsity ratio. The sparsity ratio is defined as the ratio of the number of edges over the number of possible edges between nodes, that is, Sparsity = ∑ i,j ui,j k(k − 1) (2.11) A more detailed description of graph theoretic measures commonly used to analyze structural brain networks can be found in [53]. These traditional graph-theoretical measurements summarize the complex graph structures of structural brain networks [56]. However, when the graph size increases, the graph measurements cannot fully capture the complex network patterns. 23 Chapter 3: Graph Coarsening with Preserved Spectral Properties 3.1 Introduction Graphs are widely used to represent object relationships in real-world applications. As many applications involve large-scale graphs with complex structures, it is generally hard to explore and analyze the key properties directly from large graphs. Hence graph coarsening techniques have been commonly used to facilitate this process [57, 58]. Generally speaking, the aim of any graph reduction scheme is to reduce the number of nodes and edges of a graph, while also ensuring that the “essential properties” of the original graph are preserved. The question of what properties should be preserved remains inconclusive, but there is significant evidence that they should relate to the spectrum of a graph operator, such as the adjacency or normalized Laplacian matrices [59, 60]. A long list of theorems in spectral graph theory shows that the combinatorial properties of a graph are aptly captured by its spectrum. As such, graphs with similar spectra are generally regarded to share similar global and local structure [61, 62]. Based on this realization, modern graph sparsification techniques [63, 64, 65] have moved on from previously considered objectives, such as cut and shortest-path distance preservation, and now aim to find sparse spectrally similar graphs. In contrast to graph sparsification, there has been little progress towards attaining spectrum preservation guarantees in coarsening. The foremost roadblock seems to lie in defining what 24 spectral similarity should entail for graphs of different sizes. The original and coarse graphs have different numbers of eigenvalues and eigenvectors, which prohibit a direct comparison. To circumvent this issue, recent works have considered restricting the guarantees to a subset of the spectrum [9, 66]. However, focusing only on a subset of eigenvalues and eigenvectors also means that important information of the graph spectrum is ignored. In this chapter, we start by reconsidering the fundamental spectral distance metric [64, 67, 68, 69], which compares two graphs by means of a norm of their eigenvalue differences. This metric is seemingly inappropriate as it necessitates that two graphs have the same number of eigenvalues. However, we find that in the context of coarsening, this difficulty can be circumvented by substituting the coarse graph with its lifted counterpart: the latter contains the same information as the former while also having the correct number of eigenvalues. Our analysis shows that the proposed distance naturally captures the graph changes in the graph coarsening process. In particular, when the graph coarsening merges nodes that have similar connections to the rest of the graph, the spectral distance is provably small. By merging similarly connected nodes, nodes and edges in coarse graphs are able to represent the connectivity patterns of the original graphs, thus preserving structural and connectivity information. Our contributions in this chapter are summarized as follows: • We show how the spectral distance [64, 67, 68, 69], though originally restricted to graphs of the same size, can be utilized to measure how similar a graph is with its coarsened counterpart. • We examine how the new spectral distance captures graph structural changes occurring during the graph coarsening process. 25 • We present two coarsening algorithms that provably minimize the spectral distance. • We experimentally show that the proposed methods outperform other graph coarsening algorithms on two graph related tasks. All the proofs can be found in the Appendix. 3.2 Related Work Recent works have proposed to coarsen graphs by preserving the spectral properties of the matrix representations of graphs [9, 10, 60, 66, 70]. For example, Loukas et al. proposed to preserve the action of the graph Laplacian with respect to an (eigen)-space of fixed dimension, arguing that this suffices to capture the global properties of graphs relevant to partitioning and spectral clustering [9]. Durfee et al. proposed to preserve the all-pairs effective resistance [10]. Garg and Jaakkola defined a cost based on the theory of optimal transport [11]. Saket et al. suggested a Minimum Description Length (MDL) principle relevant to unweighted graphs [12]. Most of these distance functions are specific to particular applications; the question of how to define an application-independent graph coarsening framework remains a challenge. There is a sizable literature dealing with the characterization of graphs in terms of their spectral properties [64, 71, 72]. Previous work defined distance functions based on Laplacian eigenvalues which measure differences between graphs [64, 67]. Spielman and Teng introduced a notion of spectral similarity for two graphs in their graph sparsification framework [59, 65]. Recently, Tsitsulin et al. proposed an efficient graph feature extractor, based on Laplacian spectrum, for comparisons of large graphs [71]. Dong uses spectral densities to visualize and estimate meaningful information about graph structures [72]. Nevertheless, despite the popularity 26 of spectral methods, the graph spectrum remains little explored in the context of graph coarsening. 3.3 Preliminaries Let G = (V , E ,W ) be a graph, with V a set of N = |V| nodes, E a set of M = |E| edges, and W ∈ RN×N the weighted adjacency matrix. We denote the node vi the node by w(i) ∈ RN representing the vector of the weights of the edges incident on vi and by d(i) = ∑N j=1W (i, j) the node degree of vi. The graphs considered in this work are weighted, undirected, and possess no isolated nodes (i.e. d(i) > 0 for all vi). The combinatorial and normalized Laplacians of G are defined as L = D −W and L = IN −D−1/2WD−1/2, (3.1) respectively, where IN is the N × N identity matrix and D is the diagonal degree matrix with D(i, i) = d(i). 3.3.1 Graph Coarsening The coarse graph Gc = (Vc, Ec,Wc) with n = |Vc| is obtained from the original graph G by first selecting a set of non-overlapping graph partitions P = {S1,S2, . . . ,Sn} ⊂ V , which cover all the nodes in V . Each partition Sp corresponds to a “super-node” denoted by sp and the “super-edge” connecting the super-nodes Wc(p, q) has weight equal to the accumulative edge 27 Coarsening 1 6 11 16 21 26 31 36 41 46 50 Index 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Ei ge nv al ue Original Graph Coarse Graph Lifted Graph 5 10 15 20 25 30 35 40 45 50 1st Eigenvector -1 0 1 Va lu e 5 10 15 20 25 30 35 40 45 50 3rd Eigenvector -1 0 1 Va lu e 5 10 15 20 25 30 35 40 45 50 20th Eigenvector -1 0 1 Va lu e Figure 3.1: Left: an example illustrating the graph coarsening process. The original graph is a random graph sampled from the stochastic block model with 50-node and 10 predefined blocks. The coarse graph is obtained from the predefined partitions. Right: Eigenvalues and eigenvectors of normalized Laplacian matrices of original, coarse and lifted graphs. The eigenvalues of coarsened graphs align with the eigenvalues of original graphs and the eigenvectors indicate the block membership information. weights between nodes in the corresponding graph partitions Sp and Sq: Wc(p, q) = w(Sp,Sq) := ∑ vi∈Sp,vj∈Sq W (i, j) (3.2) 28 Let P ∈ Rn×N be the matrix whose columns are partition indicator vectors: P (p, i) =  1, if vi ∈ Sp 0, othewise. It is then well known that the weight matrix Wc of the coarse graph Gc satisfies Wc = PWP⊤. The definition of the coarsened Laplacian matrices follows directly: Lc = Dc −Wc and Lc = In −D−1/2 c WcD −1/2 c . Similarly to the adjacency matrix, the combinatorial Laplacian of the coarse graph can be obtained by the formula Lc = PLP⊤. The same however doesn’t hold for the normalized Laplacian, as in general PLP⊤ ̸= Lc. 3.3.2 Graph Lifting We define Gl = (V , El,Wl) to be the graph lifted from the coarse graph Gc with respect to a set of non-overlapping partitions P . In graph lifting, each node sp of the coarse graph is lifted to |Sp| nodes and nodes in the lifted graph are connected by edges whose weight is equal to the coarse edge weight normalized by the sizes of partitions. Specifically, for any vi ∈ Sp and 29 vj ∈ Sq we have: Wl(i, j) = w(Sp,Sq) |Sp||Sq| = ∑ v′i∈Sp,v′j∈Sq W (i′, j′) |Sp||Sq| = Wc(p, q) |Sp||Sq| . (3.3) When Sp = Sq = S , the weight Wl(i, j) can be seen to be equal to the weight of all edges in the subgraph induced by S, after normalization by |S|2. It easily follows that if W (i, j) is the same for every vi, vj ∈ S, then also Wl(i, j) = W (i, j), i.e., in-partition weights are exactly preserved by successive coarsening and lifting in this case. The above combinatorial definition can be expressed in an algebraic form in terms of the the pseudo-inverse P+ of P , (i.e., PP+ = I), whose elements are: P+(j, p) =  1 |Sp| if vj ∈ Sp 0 otherwise. With this in place, the adjacency matrices of the lifted and coarse graphs are connected by the following relations: Wl = P+WcP ∓ and Wc = PWlP ⊤. The following equation reveals that lifting preserves the connectivity up to a projection onto the 30 partitions: Wl = P+WcP ∓ = P+PWP⊤P∓ = ΠWΠ⊤ = ΠWΠ, (3.4) where Π = P+P is a projection matrix, with ΠΠ = P+PP+P = P+P = Π. The lifted Laplacian matrices are given by Ll = P+LcP ∓ and Ll = C⊤LcC, (3.5) where C ∈ Rn×N is the normalized coarsening matrix whose entries are given by: C(p, i) =  1√ |Sp| if vi ∈ Sp 0 otherwise, such that C⊤ = C+ and C⊤C = P∓P = Π. In this manner, we have Lc = PLlP ⊤ and Lc = CLlC ⊤. (3.6) For a more in-depth discussion of the mathematics of graph coarsening and graph lifting, we refer the interested reader to [9]. 31 3.4 Spectral Distance We start by briefly reviewing some basic facts about the spectrum associated with the Laplacian matrix of a coarse graph. We then demonstrate how to exploit these properties in order to render the classical spectral distance metric amenable to (coarse) graphs of different sizes. 3.4.1 Properties of the Coarse Laplacian Spectrum Denote the eigenvalues and eigenvectors of the normalized Laplacian matrices as λ and u, respectively, with L = UΛU⊤ where the i-th column of U corresponds to ui and Λ = diag(λ). The eigenvalues are ordered in non-decreasing order. Property 3.4.1 (Interlacing. Section 5.3 in [73]). The normalized Laplacian eigenvalues of the original and coarsened graphs satisfy λ(i) ≤ λc(i) ≤ λ(i+N − n) for all i = 1, . . . , n. Property 3.4.1 is a general interlacing inequality that captures pairwise difference between the eigenvalues of the original and coarse graph Laplacians [3, 74]. Since it holds for any graph and coarsening, the inequality will, in some cases, be loose. Property 3.4.2 (Eigenvalue Preservation). The normalized Laplacian eigenvalues of the lifted graph contain all eigenvalues of the coarse graph and additional eigenvalues 1 with (N − n) multiplicity. 32 Property 3.4.3 (Eigenvector Preservation). The eigenvectors of the coarse graph lifted by C, i.e. ul = Cuc are the eigenvectors of Ll. Proof. We start by noticing that the projection matrix Π acts as an identity matrix w.r.t. the lifted normalized Laplacian Ll = ΠLlΠ, since Ll = C⊤LcC = C⊤CLlC ⊤C = ΠLlΠ. Now, consider the following eigenvalue equation: Lcuc = λcuc CLlC ⊤uc = λcuc C⊤CLlC ⊤uc = λcC ⊤uc ΠLlΠC⊤uc = λcC ⊤uc LlC ⊤uc = λcC ⊤uc Note that in the fourth step, we used the relation C⊤ = C⊤CC⊤ = ΠC⊤, which holds due to the properties of the Moore-Penrose pseudo-inverse. Thus, C⊤uc are eigenvectors of Ll with the corresponding eigenvalues of the coarse graph. To show there are N − n additional eigenvalues 1, one can observe that IN − L = D −1/2 l WlD −1/2 l is a rank-n matrix because nodes within the same partition have exactly the same edge weights. Hence IN −L contains N − 1 eigenvalue 0 and correspondingly L contains eigenvalue 1 with N − n multiplicity. Property 3.4.2 and 3.4.3 state that the action of lifting preserves most spectral properties of the coarse graph. Thus, we may use the lifted graph as a proxy to define the distance function [75]. Figure 3.1 shows an example illustrating the graph coarsening process as well as the effect 33 on the graph spectrum. 3.4.2 Spectral Distance In the following, we propose two notions of the spectral distance to quantify the difference between original and coarse graphs. We first use the lifted graph as the “proxy” of the coarse graph and define the full spectral distance: Definition 3.4.4. The full spectral distance between graph G and Gc is defined as follows: SDfull(G,Gc) = ∥λ− λl∥1 = N∑ i=1 |λ(i)− λl(i)|, where vectors λ and λl contain the eigenvalues of the original and lifted graphs. As the original and lifted graphs have the same number of nodes, we may directly use a norm to measure the pairwise differences between eigenvalues. On the flip side, the definition requires computing all eigenvalues of original graphs regardless of the coarse graph size, which is computationally expensive, especially for large graphs. The limitation motivates us to define the partial spectral distance by selecting part of the terms in the full spectral distance definition. Let k1 and k2 be defined as k1 = argmaxi{i : λc(i) < 1}, k2 = N − n + k1. We expand 34 the full spectral distance into three terms as follows: SDfull(G,Gc) = N∑ i=1 |λ(i)− λl(i)| = k1∑ i=1 |λ(i)− λl(i)|+ k2∑ i=k1+1 |λ(i)− λl(i)| + N∑ i=k2+1 |λ(i)− λl(i)| (3.7) = k1∑ i=1 |λ(i)− λc(i)|+ k2∑ i=k1+1 |λ(i)− 1| + N∑ i=k2+1 |λ(i)− λc(i−N + n)| (3.8) The last equation is from the Property 3.4.2 where λl contains eigenvalues of the coarse graph as well as eigenvalue 1 with N − n multiplicity. Eigenvalue λl satisfies: λl(i) =  λc(i) i ≤ k1 1 k1 + 1 ≤ i ≤ k2 λc(i−N + n) i > k2 With this in place, we define the partial spectral distance to be equal to the full spectral distance minus the N − n terms for which λl = 1: Definition 3.4.5. The partial spectral distance between graph G and Gc is defined as SDpart(G,Gc) = k∑ i=1 |λ(i)− λc(i)|+ n∑ i=k+1 |λc(i)− λ(i+N − n)|, 35 where k = argmaxi{i : λc(i) < 1}. For the partial spectral distance, we only need to compute n rather than N eigenvalues of the normalized Laplacian of the original graph, which significantly reduces the computational cost when n≪ N . The full and partial spectral distances are related by, SDfull(G,Gc) = SDpart(G,Gc) + k2∑ i=k1+1 |λ(i)− 1| The excluded terms ∑k2 i=k1+1 |λ(i)−1|measure the closeness of the original Laplacian eigenvalues and eigenvalue 1. The two definitions are equivalent when the normalized Laplacian of the original graph LN contains eigenvalue 1 with N − n multiplicity. The condition is equivalent to asserting that the adjacency matrix W is singular with N − n algebraic multiplicity of the eigenvalue 0 [76, 77]. We have observed empirically that, when coarsening nodes that have similar connections, the adjacency matrix has eigenvalues close to 0. In such situations, the terms of the full spectral distance that are excluded by the partial spectral distance are almost 0 and the partial distance closely approximates the full one. Note that both definitions of spectral distance are proper distance metrics over the space of graph Laplacian eigenvalues. However, the spectral distance is not able to distinguish graphs with the same sets of Laplacian eigenvalues (referred to as cospectral graphs [61]). Thus, there could exist multiple coarse graphs corresponding to the same spectral distance. 36 3.4.3 Relation to Graph Coarsening To illustrate the connections between spectral distance and graph coarsening, we first consider the ideal case when merged nodes within the same partitions have the same normalized edge weights: Proposition 3.4.1. Let the graph Gc be obtained by coarsening G with respect to a set of partitions P = {S1,S2, . . . ,Sn}. IfP is selected such that every node in a partition has the same normalized edge weights, w(i) d(i) = w(j) d(j) for all vi, vj ∈ S and S ∈ P (3.9) then SDfull(G,Gc) = 0 and SDpart(G,Gc) = 0. Therefore, the ideal graph coarsening attains a minimal (full and partial) spectral distance. We next provide a more general result on how the spectral distance can capture the structural changes in the graph coarsening framework. Consider the basic coarsening where the coarse graph is formed by merging one pair of nodes (i.e. n = N − 1). In this setting, we prove the following: Proposition 3.4.2. Suppose the graph Gc is obtained from G by merging a pair of nodes v(a) and v(b). If the normalized edge weights of merged nodes satisfy ∥∥∥∥w(a) d(a) − w(b) d(b) ∥∥∥∥ 1 ≤ ϵ, 37 then the spectral distance between the original and coarse graphs is bounded by SDfull(G,Gc) ≤ Nϵ and SDpart(G,Gc) ≤ nϵ. The above proposition states that the spectral distance is bounded by the discrepancy of normalized edge weights of merged nodes. The bound implies that minimizing the nodes’ edge weights within the same partitions results in bounded spectral perturbations. 3.5 Algorithms We propose two graph coarsening algorithms to produce coarse graphs with minimal small spectral distance. The first follows from Proposition 3.4.2 in that the coarse graphs are formed by iteratively merging graph nodes with similar normalized edge weights. The second algorithm is inspired by spectral clustering: we leverage on the combinations of normalized Laplacian eigenvectors combined with k-means clustering to find the graph partitions and the corresponding coarse graphs. Though different, both algorithms are shown to generate coarse graphs of bounded spectral distance. 3.5.1 Multilevel Graph Coarsening The Multilevel Graph Coarsening (MGC) algorithm iteratively merges pairs of nodes which share similar connections. During each iteration, MGC searches for the pair of nodes with the most similar normalized edge weights and merges them into super-nodes. To reduce the 38 Algorithm 1 Multilevel Graph Coarsening (MGC) 1: Input: Graph G = (V , E ,W ) and target size of the coarse graph n. 2: s← N 3: while s > n do 4: for vi ∈ Vs do 5: for vj ∈ Ni do 6: ds(i, j) = ∥∥∥w(i) d(i) − w(j) d(j) ∥∥∥ 1 7: end for 8: end for 9: imin, jmin = argmini,j ds(i, j) 10: s← s− 1 11: Merge nodes vimin and vjmin to form the coarse graph Gs. 12: end while 13: return Gn = (Vn, En,Wn) computational cost, we constraint the candidate pairs of graph nodes to be within 2-hop distance. We denote Ni as the set of nodes that are within 2-hops distance from node vi. The pseudo-code of MGC is presented in Algorithm 1. Analysis. The following corollary bounds the spectral distance of MGC algorithm: Corollary 3.5.1. Suppose the graph Gc is coarsened from G by iteratively merging pairs of nodes v(as) and v(bs) for s from N to n+ 1, if the normalized edge weights of merged nodes satisfy, ∥∥∥∥w(as) d(as) − w(bs) d(bs) ∥∥∥∥ 1 ≤ ϵs, then the spectral distance between the original and coarse graphs is bounded by SDfull(G,Gc) ≤ N n+1∑ s=N ϵs, SDpart(G,Gc) ≤ n n+1∑ s=N ϵs The bound is a direct corollary of Proposition 3.4.2. 39 Time complexity. The time complexity of MGC is O(M(N + n)(N − n)), which is derived as follows: For each iteration, the computational cost of the 1-norm in line 6 is O(s). Then the time complexity of the while loop in line 3 is O( ∑N s=n s · M) = O(M n+N 2 (N − n)) = O(M(N + n)(N − n)). When n ≈ N , the complexity reduces to O(MN). On the other hand, for n≪ N the complexity becomes O(MN2). 3.5.2 Spectral Graph Coarsening The spectral graph coarsening (SGC) algorithm identifies the coarsening partitions by attempting to minimize the k-means cost of rows of Laplacian eigenvectors. Different from traditional spectral clustering, we select eigenvectors with the eigenvalues corresponding to the head and tail eigenvalues as in the definition of partial spectral distance in Definition 3.4.5. The procedure is described in Algorithm 2. Notice that, since k1 is unknown at the start, SGC algorithm iterates over different possible combinations of eigenvectors and selects the coarsening with minimum k-means cost. Analysis. The following theorem relates the partial spectral distance with the k-means cost: Theorem 3.5.2. Let the coarse graph Gc be obtained from Algorithm 2 with graph partition P∗, suppose that the graph coarsening is consistent, i.e., Lc = CLC⊤, and let the k-means cost satisfy F(U ,P∗) < 1. Then, the partial spectral distance is bounded by SDpart(G,Gc) ≤ (n+ 2)F(U ,P∗) + 4 √ F(U ,P∗) 1−F(U ,P∗) . The theorem states that the spectral distance is bounded by the k-means clustering cost. 40 Algorithm 2 Spectral Graph Coarsening (SGC) 1: Input: Graph G = (V , E ,W ), eigenvectors U of the normalized Laplacian L, target size n. 2: if λ(N) ≤ 1 then 3: Set k1 ← n ▷ Spectral Clustering 4: else 5: Set k1 ← argmink{k : λ(k) ≤ 1, k ≤ n,λ(N − n+ k + 1) > 1} ▷ Iterative Spectral Coarsening 6: end if 7: k2 ← N − n+ k1. 8: while k1 ≤ n do 9: Uk1 ← [U(1 : k1);U(k2 + 1 : N)] 10: Apply k-means clustering algorithm on the rows of Uk1 to obtain graph partitions P∗ k1 that optimizes the following k-means cost: F(Uk1 ,P∗ k1 ) = N∑ i=1 ( r(i)− ∑ j∈Si r(j) |Si| )2 where r(i) is the ith row of Uk1 . 11: k1 ← k1 + 1, k2 = N − n+ k1 12: end while 13: return coarse graph Gc generated with respect to the partitions with minimum k-means clustering cost as P∗ = argmin k1 F(Uk1 ,P∗ k1 ) Further, when the graph eigenvectors point to well-separated clusters and the k-means cost is small, the spectral distance is smaller. The main assumption posed is that Lc = CLC⊤, which may not hold for some graphs. For situations when this assumption is not met, the claim can be readily reworked to hold for the combinatorial Laplacian matrix for which the relation Lc = PLP⊤ always holds. Time complexity. Excluding the one-time partial sparse eigenvalue decomposition that takes roughly O(R(Mn+Nn2)) time using Lanczos iteration with R restarts and a graph of M edges (we need the smallest and largest n eigenvalues and eigenvectors) [78], the time complexity of SGC is O(KTNn2), where K refers to the number of times the while loop is executed with 41 Table 3.1: Classification accuracy on coarse graphs that are five times smaller. Datasets MUTAG ENZYMES NCI1 NCI109 PROTEINS PTC EM 78.90 18.92 62.81 61.35 63.72 48.56 LV 79.01 24.68 63.59 60.49 62.72 50.24 METIS 77.62 24.79 59.74 61.64 63.70 49.34 SC 80.37 24.40 63.14 62.57 64.08 50.16 SGC 80.34 29.19 63.94 63.69 64.70 52.76 MGC 81.53 30.89 66.07 63.55 65.26 52.28 Original 86.58 37.32 66.39 64.93 66.60 53.72 K ≤ n and O(TNn2) is the complexity of the k-means clustering (whereas T bounds the number of k-means iterations). 3.6 Experiments We proceed to empirically evaluate the proposed graph coarsening algorithms on tasks involving real-world and synthetic graphs. Our first experiment considers the classification of coarsened graphs, whereas the second examines how well one may recover the block structure of graphs sampled from the stochastic block model. We show that our graph coarsening algorithms, which optimize the spectral distance, yield minimal classification accuracy degradation and can recover the block structures with high accuracy. Codes for both experiments are publicly available1. Baseline Algorithms We compare our methods with the following graph coarsening and partitioning algorithms as, • Edge Matching (EM). The coarse graphs are formed by maximum-weight matching with the weight calculated as W (i, j)/max{d(i), d(j)} [79]. 1https://github.com/yuj-umd/spectral-coarsening 42 https://github.com/yuj-umd/spectral-coarsening • Local Variation (LV). Local variation methods coarsen a graph in a manner that approximately preserves a subset of its spectrum [9]. Here, we used the neighborhood-based variant and aimed to preserve the first max(10, n) eigenvectors and eigenvalues. Alternative choices for the preserved eigenspace may yield different results. • METIS. This is a standard graph partitioning algorithm based on multi-level partitioning schemes that are widely used various domains, such as finite element methods and VLSI [13]. • Spectral Clustering (SC). Spectral clustering is a widely used graph clustering algorithm that finds densely connected graph partitions determined from the eigenvectors of the graph Laplacian [19]. For a review of recent results on the fast approximation of SC, see [78]. Note that to apply graph partitioning algorithms for coarsening purposes, we coarsen the graphs with respect to the graph partitions following the standard coarsening process as in equation 3.2. 3.6.1 Graph Classification with Coarse Graphs Graph classification is a well studied graph machine learning problem, with a variety of applications to material design, drug discovery and computational neuroscience [6, 71, 80, 81]. However, some graph classifiers are not scalable for large graphs, such as those encountered in social network analysis and computational neuroscience [81, 82]. Graph coarsening can reduce the graph sizes in the datasets, which provides acceleration on the training and inference of graph classification models. However, if the coarsening is not carefully done, it can also result in loss of useful information and, thus, of classification accuracy. In the following, we quantify the effect of different coarsening choices to graph classification. We utilize various graph coarsening methods 43 Table 3.2: Recovery Accuracy of Block Structures from Random Graphs in Stochastic Block Model p, q Type EM LV METIS SC MGC SGC 0.2, 0.01 Associative 0.1819 0.3076 0.7792 0.7845 0.3664 0.7845 Disassortative 0.0956 0.1071 0.0815 0.0877 0.1093 0.0850 Mixed 0.1052 0.1944 0.2389 0.3335 0.6062 0.7107 0.5, 0.1 Associative 0.1015 0.1902 0.7820 0.7930 0.2868 0.7930 Disassortative 0.0854 0.1068 0.0602 0.0788 0.1474 0.7901 Mixed 0.0848 0.2241 0.2883 0.4074 0.7343 0.7699 0.8, 0.3 Associative 0.0823 0.1139 0.5596 0.6532 0.1172 0.6532 Disassortative 0.0836 0.0976 0.0776 0.1342 0.7784 0.7931 Mixed 0.0888 0.1503 0.2929 0.3909 0.5428 0.7209 to reduce the size of graphs in the datasets before passing them to the graph classifier. We then evaluate the quality of graph coarsening based on the classification accuracy drop (as compared to the same classifier on the original graphs). Evaluation We coarsen the graph samples until n = N/5, i.e., until their number of nodes is reduced by a factor of five. The classification performance are evaluated based on 10-fold cross validation—in accordance to previous works [5, 6, 71]. Datasets. We use five standard graph classification datasets for graph classification evaluation [80, 83, 84]. Each dataset contains a set of variable-sized graphs stemming from a variety of applications. The graph statistics can be found in Table A.1. The graph classifier. We use the Network Laplacian Spectral Descriptor (NetLSD) combined with a 1-NN classifier as the graph classification method [71]. NetLSD was shown as an efficient graph feature extractor and achieve state-of-the art classification performance [71]. Note that NetLSD extracts graph features that only depend on the graph structure and does not consider 44 Table 3.3: Statistics of the graph benchmark datasets. Datasets MUTAG ENZYMES NCI1 NCI109 PROTEINS PTC Sample size 188 600 4110 4127 1108 344 Average |V | 17.93 32.63 29.87 29.68 39.06 14.29 Average |E| 19.79 62.14 32.3 32.13 72.70 14.69 # classes 2 6 2 2 2 2 node and edge features. Results Table 3.1 shows the graph classification performance on coarse graphs. In all cases, the proposed graph coarsening algorithms yield better classification accuracy than alternative methods. Interestingly, for four out of the six datasets (NCI1, NCI109, PROTEINS, and PTC) there is almost no degradation to the classification accuracy induced by coarsening, even if the graphs in the coarse dataset are five times smaller—this, we believe, is an encouraging result. 3.6.2 Block Recovery in the Stochastic Block Model In this experiment, we test whether coarsening algorithms can be used to recover the block structures of random graphs sampled from stochastic block models. The stochastic block model is a random graph model that is commonly used to evaluate graph partitioning and clustering algorithms [85, 86]. The model is parameterized by a probability matrix B ∈ [0, 1]n×n, with graph nodes in blocks i and j being connected with probability B(i, j). Random graphs can be generated from the stochastic block model by sampling the upper triangular entries W (i, j) in accordance with the edge probability. The lower triangular entries are then set as W (j, i) = W (i, j). We parameterize B with p and q as follows: 45 • Assortative. The diagonal entries of B are p and the off-diagonal entries are q. • Disassortative. The diagonal entries of B are q and the off-diagonal entries are p. • Mixed. The entries of B are randomly assigned with p and q (each with probability 1/2). Evaluation We evaluate the performance of graph coarsening algorithms by measuring the discrepancy between the recovered graph partitions and the ground-truth blocks. We use the Normalized Mutual Information (NMI) to measure the recovery error between any two graph partitions. The definition of NMI can be found in the supplementary material. For each stochastic block model setting, we set N = 200 and n = 10, with 20 nodes for each partition. We repeat the experiment 10 times and report the average NMI metric achieved by each method. We compare our graph coarsening algorithms with the graph coarsening and partitioning algorithms mentioned earlier. Table 3.2 reports the average NMI in three different stochastic block model configurations. Our proposed methods outperform other methods in almost all cases. In particular, our methods achieve high recovery accuracy for disassortative and mixed settings, where traditional graph partitioning algorithms fail to recover accurately. The EM and LV coarsening algorithms are not optimized for block recovery and thus exhibit far worse performance on this task. 3.7 Conclusion In this chapter, we propose a new framework for graph coarsening. We leverage the spectral properties of normalized Laplacian matrices to define a new notion of graph distance that quantifies the differences between original and coarse graphs. We argue that the proposed 46 spectral distance naturally captures the structural changes in the graph coarsening process, and we propose graph coarsening algorithms that guarantee that the coarse graphs exhibit a bounded spectral distance. Experiments show that our proposed methods can outperform other graph coarsening algorithms on graph classification and block recovery tasks. 47 Chapter 4: Powerful Graph Neural Networks with Sequence Modeling 4.1 Introduction Recently, graph neural network models (GNN) have emerged as a poweful tool to tackle challenging graph-related problems such as graph classification, link predictions and node classification [4, 5, 7, 26, 27, 28]. Most GNN models follow the message-passing paradigm where the nodes information are propagated along edges to form new node representations [4]. The final graph representations are the result of applying a pooling function over all the node representations. Various message-passing GNN (MPNN) models have been proposed to address graph- related problems [87, 88]. For example, Xu et al. proposed to use multi-set functions for node aggregation and graph pooling [6, 89]. Gilmer et al. proposed a general message-passing scheme utilizing Set2Set to obtain graph representations [4, 90]. Gao and Ji proposed a top-k graph pooling function by downsampling the graph nodes [91]. Despite their recent success, MPNN models suffer from the fundamental limitation of expressiveness. As pointed out by previous studies, the expressive power of standard MPNN does not exceed the 1-dimensional Weisfeiler-Lehman (WL) Isomorphism Test [6, 39]. Therefore, MPNN models cannot distinguish pairs of graphs that the 1-dimensional WL cannot distinguish. For example, MPNN models cannot distinguish regular graphs of different structures. Note that k-regular graphs are defined as graphs with all the node degrees are k but could have different 48 structures. One example is shown in Figure 4.3). More generally, Garg et al. recently show MPNN cannot capture certain graph properties, which hampers the model effectiveness in a wide range of applications [9, 92, 93, 94]. To tackle the above problem, previous studies proposed to equip additional features to the node representations to improve model expressiveness. For example, Sato et al. and Abboud et al. proposed to assign random node identifiers to improve the expressiveness [2, 95]. However, MPNN models with random node identifiers usually do not maintain the properties of equivariance and invariance, which are key properties in the GNN model design. Bouritsas et al. used subgraph counts as additional features [94]. The manual feature engineering requires domain-specific knowledge to achieve the optimal performance for certain applications[94]. Recent work introduced permutation-sensitive functions to formulate powerful permutation- invariant graph functions [30, 96, 97, 98]. For example, in the proposed GraphSAGE model, Hamilton et al. used a Long Short-Term Memory (LSTM) aggregator on sampled sets of neighbors to extract expressive representations [30]. Murphy et al. proposed Relational Pooling to build powerful graph presentations with permutation-sensitive functions. However, there are limitations of previous work: LSTM aggregators in GraphSAGE model randomly selects the node permutations which does not satisfy the permutation invariance; Relational Pooling methods requires expensive computation for all n! possible node permutations which makes it difficult to generalize to applications with large-scale graphs. In this work, we propose SeqGNN, a new graph neural network paradigm that improves the expressiveness of GNN models. Leveraging on the rich expressiveness of sequence models, SeqGNN represents graphs as the composition of sequence representations. With the design of sequence modeling techniques, SeqGNN models can form expressive graph representations 49 (a) Graph (b) Permutation (c) Sequence Figure 4.1: Relationships between graphs, node permutations and node sequences. The node permutation is the ordering of the graph nodes and the node sequence contain the edge information between the consecutive nodes. while maintaining permutation invariance. In addition, SeqGNN can distinguish non-isomorphic graphs which cannot be distinguished by MPNN. We further show that SeqGNN models achieve competitive performance in the real-world graph classification tasks. 4.2 Related Work Message Passing GNN Message-Passing GNN models have been the primary paradigms of GNN models. Previous work proposed various convolutional and pooling functions that achieve excellent performance in a variety of domains [6, 28, 29, 30, 39, 99, 100]. For example, Xu et al. proposed to use multi-set functions for graph aggregation and pooling [6]. Gilmer et al. proposed a general message-passing scheme utilizing Set2Set to obtain graph representations [4, 90]. Expressive Power of GNN Xu et al. and Morris et al. first pointed that MPNN cannot exceed the expressive power of 1-WL isomorphism tests. Abboud et al. and Sato et al showed that associating graph nodes with extra node features, which can be as simple as random initialization, can improve the expressive power [2, 101]. Loukas et al. theoretically proved that when the 50 Sequence Modeling p1 p2 … pK Pooling Graph Representation !!Input graph G: ":adjacent matrix $: node features Figure 4.2: The overall framework of the SeqGNN model. (1) The first step is to represent graphs as the set of node sequences. The node sequence consists of the permutation of graph nodes and their associated edge information between consecutive nodes. The node sequences and the associated weights are determined by the sequence sampling methods and the specific graph structures. (2) The second step is learn the sequence representations which are further combined to form the graph representations. nodes are equipped with the unique node identifiers, GNN models can approximate any Turing computable functions over connected attributed graphs [32]. Maron et al. proposed high-order graph networks that are more powerful GNN but suffer from high computational complexity[41, 42]. Sequence Modeling Sequence modeling is a type of machine learning task that involves predicting or understanding a sequence of data points. Sequence data consist of data points where each element in the sequence has a particular position, and the arrangement of these elements carries significant information. Sequence modeling methods have been extensively studied in the field of natural language processing and temporal data analysis [102, 103]. Recurrent Neural Network (RNN) and its variants are able to encode variable-sized sequence inputs to fixed-sized representations [102, 103]. Recently, attention-based models such as transformers have been effectively applied to a wide range of domains[100, 104]. 51 4.3 Preliminaries 4.3.1 Notations Graph A graph is represented as G = (V ,W ,H), with V as the set of graph nodes with n = |V|, W ∈ Rn×n as the adjacent matrix and H ∈ Rn×d as the node representations. We denote by vi ∈ V as the node indexed at i andNs(vi) as the set of neighbors of vi within s hops (We denote N (vi) = N1(vi). We use hi as the node representation of node vi and hG as the graph-level representation. Permutation We denote a permutation over integers from 1 to n as a list of node indices π = {v1, v2, ..., vn}. The corresponding permutation matrix is denoted as Pπ. We use Π to denote the set of all possible permutations with |Π| = n!. π−1 is denoted as the inverse permutation of π. A graph permutated with π is denoted as Gπ = (Vπ,Wπ,Hπ) where the node indices, weight matrix and feature matrix are permutated over π denoted as Vπ, Wπ = PπWP T π and Hπ = PπH . Sequence We denote a sequence as x = {x1, x2, ..., xn}where the elements are constructed from the graph G and the node permutation π. The elements xi contains the node information of vi as well as other structural information such as the edge information between vi and vi+1. We use X ∈ Rn×D to denote the sequence where Xi,: contains the element at position i. The relationships of the graph, permutation and sequence is captured in Figure 4.1. We use a function SeqDec : (G,π)→ Rn×D to denote the relationship as Xπ = SeqDec(G,π) (4.1) 52 (a) Graphs (b) MPNN Decomposition: Rooted subtree (c) SeqGNN Decomposition: Sequence Sequence difference Figure 4.3: Example: (a) Regular graphs that cannot be distinguished by MPNN and 1-WL; (b) With k layers of message passing, node representations of MPNN models contain information from nodes that are k hops away. The information flow to compute the node representations can be represented as the rooted subtrees where root node absorb the information flowed from the leaf nodes, as shown in the figure [1, 2]; Due to the anonymity of graph nodes, the subtrees are indistinguishable for regular graphs. (c) SeqGNN formulates the graph as the composition of sequence representations with node permutation and edge information. Despite the anonymity of graph nodes, the sequences can still distinguish two graphs because of differences in the edge patterns. 4.3.2 Graph and Sequence Models Graph Neural Network GNN models compute the graph representations from the node features H and graph structures W , denoted as GNN : (Rn×d,Rn×n) → Rd. In particular, MPNN, one of the most popular GNN models, iteratively updates node representations by aggregating information from neighboring nodes and the graph representation is obtained by applying the pooling function on the corresponding node representations [4, 6, 9], MPNN(H ,W ) = Pool(hv, v ∈ V) (4.2) 53 where hi u is the representation of node u at it