ABSTRACT

Title of Dissertation: REPRESENTATION LEARNING
FOR LARGE-SCALE GRAPHS

Yu Jin
Doctor of Philosophy, 2023

Dissertation Directed by: Professor Joseph F. JaJa
Department of Electrical and Computer Engineering

Graphs are widely used to model object relationships in real world applications such as

biology, neuroscience, and communication networks. Traditional graph analysis tools focus on

extracting key graph patterns to characterize graphs which are further used in the downstream

tasks such as prediction. Common graph characteristics include global and local graph measurements

such as clustering coefficient, global efficiency, characteristic path length, diameter and so on.

Many applications often involve high dimensional and large-scale graphs for which existing

methods, which rely on small numbers of graph characteristics, cannot be used to efficiently

perform graph tasks. A major research challenge is to learn graph representations that can be

used to efficiently perform graph tasks as required by a wide range of applications.

In this thesis, we have developed a number of novel methods to tackle the challenges

associated with processing or representing large-scale graphs. In the first part, we propose a

general graph coarsening framework that maps large graphs into smaller ones while preserving


important structural graph properties. Based on spectral graph theory, we define a novel distance

function that measures the differences between graph spectra of the original and coarse graphs.

We show that the proposed spectral distance sheds light on the structural differences in the graph

coarsening process. In addition, we propose graph coarsening algorithms that aim to minimize the

spectral distance, with provably strong bounds. Experiments show that the proposed algorithms

outperform previous graph coarsening methods in applications such as graph classification and

stochastic block recovery tasks.

In the second part, we propose a new graph neural network paradigm that improves the

expressiveness of the best known graph representations. Graph neural network (GNN) models

have recently been introduced to solve challenging graph-related problems. Most GNN models

follow the message-passing paradigm where node information is propagated through edges, and

graph representations are formed by the aggregation of node representations. Despite their

successes, message-passing GNN models are limited in terms of their expressive power, which

fail to capture basic characteristic properties of graphs. In our work, we represent graphs as

the composition of sequence representations. Through the design of sequence sampling and

modeling techniques, the proposed graph representations achieve provably powerful expressiveness

while maintaining permutation invariance. Empirical results show that the proposed model

achieves superior results in real-world graph classification tasks.

In the third part, we develop a fast implementation of spectral clustering methods on CPU-

GPU platforms. Spectral clustering is one of the most popular graph clustering algorithms

which achieved state-of-the art performance in a wide range of applications. However, existing

implementations in commonly used software platforms such as Matlab and Python do not scale


well for many of the emerging Big Data applications. We present a fast implementation of the

spectral clustering algorithm on a CPU-GPU heterogeneous platform. Our implementation takes

advantage of the computational power of the multi-core CPU and the massive multithreading

capabilities of GPUs. We show that the new implementation achieved significantly accelerated

computation speeds compared with previous implementations on a wide range of tasks.

In the fourth part, we study structural brain networks derived from Diffusion Tensor Imaging

(DTI) data. The processing of DTI data coupled with the use of modern tractographic methods

reveal white matter fiber connectivity at a relatively high resolution; this allows us to model

the brain as a structural network which encodes pairwise connectivity strengths between brain

voxels. We have developed an iterative method to delineate the brain cortex into fine-grained

connectivity-based brain parcellations. This allows to map the initial large-scale brain network

into a relatively small weighted graph that preserves the essential structural connectivity information.

We show that graph representations based on the brain networks from new brain parcellations are

more powerful in discriminating between different populations groups, compared with existing

brain parcellations.


REPRESENTATION LEARNING FOR LARGE-SCALE GRAPHS

by

Yu Jin

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2023

Advisory Committee:
Professor Joseph F. JaJa, Chair/Advisor
Professor Jonathan Simon
Professor Behtash Babadi
Professor Rajeev Barua
Professor Luiz Pessoa, Dean’s Representative


© Copyright by
Yu Jin
2023


Acknowledgments

I would like to express my most sincere gratitude to Professor Joseph JaJa for his mentorship

and guidance over the long journey. His wisdom and depth of knowledge has influenced me as

a better researcher and a better person. Under his trust and encouragement, I am able to freely

pursue a wide range of challenging research topics. I am grateful for his patience and flexibility

that supported me through the most puzzling years.

I want to thank Dr. Andreas Loukas who supported me to extend the work on graph

coarsening. His enthusiasm and persistence in graph research has influenced me to persistently

tackle challenging and fundamental research problems.

I would like to thank the committee members Professor Jonathan Simon, Professor Behtash

Babadi, Professor Rajeev Barua and Professor Luiz Pessoa for providing valuable feedback on

the thesis and presentation.

I want to thank the department of ECE for fostering a diverse environment to learn about

cutting-edge technologies from different domains.

Lastly, I want to thank my wife, Yun Zhou and my parents Wu Jin and Liumei Zhou for

their love and support.

ii


Table of Contents

Acknowledgements ii

Table of Contents iii

List of Tables vi

List of Figures viii

List of Abbreviations x

Chapter 1: Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Chapter 2: Background 7
2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Graph Coarsening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Graph Coarsening Methods . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Spectral Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Relationship with Graph Structures . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Graph Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.2 Permutation Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.3 Examples of GNN Models . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.4 Expressiveness of GNN . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Structural Brain Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.2 Diffusion Tensor Imaging Processing . . . . . . . . . . . . . . . . . . . 18
2.5.3 Brain Parcellation Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.4 Construction of Structural Brain Networks . . . . . . . . . . . . . . . . . 21
2.5.5 Graph Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 21

iii


Chapter 3: Graph Coarsening with Preserved Spectral Properties 24
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.1 Graph Coarsening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.2 Graph Lifting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4 Spectral Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.1 Properties of the Coarse Laplacian Spectrum . . . . . . . . . . . . . . . 32
3.4.2 Spectral Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.3 Relation to Graph Coarsening . . . . . . . . . . . . . . . . . . . . . . . 37

3.5 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5.1 Multilevel Graph Coarsening . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5.2 Spectral Graph Coarsening . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6.1 Graph Classification with Coarse Graphs . . . . . . . . . . . . . . . . . . 43
3.6.2 Block Recovery in the Stochastic Block Model . . . . . . . . . . . . . . 45

3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Chapter 4: Powerful Graph Neural Networks with Sequence Modeling 48
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.2 Graph and Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4 Designing Graph Neural Networks with Sequence Modeling . . . . . . . . . . . 54
4.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.2 Express Graphs as Combinations of Sequences . . . . . . . . . . . . . . 55
4.4.3 Sequence Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.4 Expressive Power of SeqGNN . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.5 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.6 Practical Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.7 Training and Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5.2 Baseline models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5.3 Model Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5.4 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5.5 Analysis of Expressive Power . . . . . . . . . . . . . . . . . . . . . . . 66

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Chapter 5: Fast Spectral Clustering in GPU-CPU platforms 68
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2 Overview of Spectral Clustering Algorithm . . . . . . . . . . . . . . . . . . . . 70
5.3 Environment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.3.1 The Heterogeneous System . . . . . . . . . . . . . . . . . . . . . . . . . 72

iv


5.3.2 CUDA Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3.3 ARPACK Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3.4 OpenBLAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4.2 Parallel Eigensolvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4.3 Parallel k-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.5.2 Environment and Software . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Chapter 6: Connectivity-based Brain Parcellations 95
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2 Data and Preprocessing Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.2.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2.2 Nonlinear Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2.3 Probabilistic tractography . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3.1 Connectivity Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3.2 Spatially Constrained Similarity Graph . . . . . . . . . . . . . . . . . . . 102
6.3.3 Minimum Graph-Cut Problem and Iterative Refinement . . . . . . . . . . 103

6.4 Reproducibility and Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . 104
6.4.1 Parcellation Similarity Metrics . . . . . . . . . . . . . . . . . . . . . . . 105
6.4.2 Stability and Reproducibility of Subject Parecellations . . . . . . . . . . 107
6.4.3 Group Consistency and Atlas Generation . . . . . . . . . . . . . . . . . 115

6.5 Discriminative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.5.1 Similarity-based Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.5.2 Connectome Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

Chapter 7: Concluding Remarks and Future Work 124

Appendix A: Supplementary Materials of Chapter 3 127
A.1 Proof of Proposition 3.4.1, 3.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 127

A.1.1 Proof of Proposition 3.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . 127
A.1.2 Proof of Proposition 3.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . 129

A.2 Proof of Corollary 3.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
A.3 Proof of Theorem 3.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
A.4 Additional Material for Experiments . . . . . . . . . . . . . . . . . . . . . . . . 138

A.4.1 Graph Classification Dataset . . . . . . . . . . . . . . . . . . . . . . . . 138
A.4.2 Definition of Normalized Mutual Information . . . . . . . . . . . . . . . 138

Bibliography 140

v


List of Tables

2.1 Demographics of NKI Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1 Classification accuracy on coarse graphs that are five times smaller. . . . . . . . . 42
3.2 Recovery Accuracy of Block Structures from Random Graphs in Stochastic Block

Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Statistics of the graph benchmark datasets. . . . . . . . . . . . . . . . . . . . . . 45

4.1 Comparison between MPNN and SeqGNN . . . . . . . . . . . . . . . . . . . . . 57
4.2 Results (measured by accuracy: %) on TUDataset. . . . . . . . . . . . . . . . . . 62

5.1 CPU and GPU specifics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3 Running Time of Spectral Clustering on DTI Dataset . . . . . . . . . . . . . . . 88
5.4 Running Time of Spectral Clustering on FB Dataset . . . . . . . . . . . . . . . . 90
5.5 Running Time of Spectral Clustering on Syn200 Dataset . . . . . . . . . . . . . 91
5.6 Running Time of Spectral Clustering on dblp Dataset . . . . . . . . . . . . . . . 92
5.7 Comparison Between Data Communication Time and Computation Time . . . . 93

6.1 Subject Demographics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2 NMI Between Parcellations After 1st Iteration . . . . . . . . . . . . . . . . . . . 111
6.3 NMI Between Parcellations After 2nd Iteration . . . . . . . . . . . . . . . . . . 111
6.4 NMI Between Parcellations After 3rd Iteration . . . . . . . . . . . . . . . . . . 112
6.5 NMI Between Parcellations After 4th Iteration . . . . . . . . . . . . . . . . . . 112
6.6 Dice’s Coefficient Between Parcellations After 1st Iteration . . . . . . . . . . . . 112
6.7 Dice’s Coefficient Between Parcellations After 2nd Iteration . . . . . . . . . . . 113
6.8 Dice’s Coefficient Between Parcellations After 3rd Iteration . . . . . . . . . . . 113
6.9 Dice’s Coefficient Between Parcellations After 4th Iteration . . . . . . . . . . . 113
6.10 NMI Between Parcellations in Consecutive Iteration Stages . . . . . . . . . . . . 114
6.11 Dice’s coefficient Between Parcellations in Consecutive Iteration Stages . . . . . 115
6.12 Average Similarity Between Parcellations of Different Subjects within the NC

Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.13 Average Similarity Between Parcellations of Subjects within the NC Group and

Randomly Generated Parcellation . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.14 P-value and T-statistic for Gender Study within the NC Group . . . . . . . . . . 117

vi


6.15 P-value and T-statistic for Age Study within the NC Group . . . . . . . . . . . . 118
6.16 P-value and T-statistic for Schizophrenic Study . . . . . . . . . . . . . . . . . . 118
6.17 P-value and T-statistic of Pair-wise Connectivity Between Normal Controls and

Schizophrenic Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

A.1 Statistics of the graph benchmark datasets. . . . . . . . . . . . . . . . . . . . . . 138

vii


List of Figures

2.1 Left: Seed region which is JHU white matter atlas. Middle: target region which is
the AAL mask. Right: the result of probabilistic tractography which models the
distribution of neuron fiber bundles where the intensity of each voxel represents
the number of streamlines passing through that voxel. All figures are imaged in
the axial plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1 Left: an example illustrating the graph coarsening process. The original graph is
a random graph sampled from the stochastic block model with 50-node and 10
predefined blocks. The coarse graph is obtained from the predefined partitions.
Right: Eigenvalues and eigenvectors of normalized Laplacian matrices of original,
coarse and lifted graphs. The eigenvalues of coarsened graphs align with the
eigenvalues of original graphs and the eigenvectors indicate the block membership
information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1 Relationships between graphs, node permutations and node sequences. The node
permutation is the ordering of the graph nodes and the node sequence contain the
edge information between the consecutive nodes. . . . . . . . . . . . . . . . . . 50

4.2 The overall framework of the SeqGNN model. (1) The first step is to represent
graphs as the set of node sequences. The node sequence consists of the permutation
of graph nodes and their associated edge information between consecutive nodes.
The node sequences and the associated weights are determined by the sequence
sampling methods and the specific graph structures. (2) The second step is learn
the sequence representations which are further combined to form the graph representations. 51

4.3 Example: (a) Regular graphs that cannot be distinguished by MPNN and 1-WL;
(b) With k layers of message passing, node representations of MPNN models
contain information from nodes that are k hops away. The information flow to
compute the node representations can be represented as the rooted subtrees where
root node absorb the information flowed from the leaf nodes, as shown in the
figure [1, 2]; Due to the anonymity of graph nodes, the subtrees are indistinguishable
for regular graphs. (c) SeqGNN formulates the graph as the composition of
sequence representations with node permutation and edge information. Despite
the anonymity of graph nodes, the sequences can still distinguish two graphs
because of differences in the edge patterns. . . . . . . . . . . . . . . . . . . . . . 53

4.4 Graph coarsening techniques to reduce the complexity . . . . . . . . . . . . . . . 60

viii


4.5 Classification accuracy under different number of convolutional layers. . . . . . . 66

5.1 CUDA Program Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Parallel Implementation of Spectral Clustering . . . . . . . . . . . . . . . . . . . 84
5.3 Time Costs of Spectral Clustering on DTI Dataset . . . . . . . . . . . . . . . . . 88
5.4 Time Costs of Spectral Clustering on FB Dataset . . . . . . . . . . . . . . . . . 90
5.5 Time Costs of Spectral Clustering on Syn200 Dataset . . . . . . . . . . . . . . . 91
5.6 Time Costs of Spectral Clustering on dblp Dataset . . . . . . . . . . . . . . . . . 92

6.1 32 neighbors of voxel within sphere of radius r = 2. Note that each voxel
represents 1.718mm×1.718mm×3mm brain volume. The figure shows the symbolic
neighbors of voxels rather than the actual volume size. . . . . . . . . . . . . . . 100

6.2 Brain segmentations used to define connectivity profiles. . . . . . . . . . . . . . 109
6.3 Subject reproducibility after each iteration. . . . . . . . . . . . . . . . . . . . . . 114
6.4 Atlas and confidence map for the NC group. Note that for confidence map, the

grey scale represents the ratio of overlapped regions. . . . . . . . . . . . . . . . 117
6.5 Parcellation with 5 regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.6 Binary maps where entries in red color have p-values <0.05 and <0.00005 respectively

in terms of connectivity strengths between the two population groups using our
40-region parcellations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6.7 Binary map where entries in red color have p-values <0.05 and <0.00005 respectively
between connectivity strengths of the two population groups using the AAL atlas. 123

ix


List of Abbreviations

BLAS Basic Linear Algebra Subprograms
BSR Block Compressed Sparse Row Format
CIN Cell Isomorphism Network
CNN Convolutional Neural Network
COO Coordinate Format
CPU Central Processing Unit
CSC Compressed Sparse Column Format
CSR Compressed Sparse Row Format
DGCNN Deep Graph Convolutional Neural Network
dMRI Diffusion Magnetic Resonance Imaging
DTI Diffusion Tensor Imaging
EM Edge Matching
FC Functional Connectivity
fMRI Functional Magnetic Resonance Imaging
GAT Graph Attention Network
GCN Graph Convolutional Network
GIN Graph Isomorphism Network
GNN Graph Neural Network
GPU Graphics Processing Unit
IPCC Human Inferior Parietal Cortex Complex
IRAM Implicitly Restarted Arnoldi Method
LAPACK Linear Algebra PACKage
LSTM Long Short-term Memory Networks
LV Local Variation
MDL Minimum Description Length
MGC Multilevel Graph Coarsening
MKL Math Kernel Library
MNI Montreal Neurological Institute
MPI Message Passing Interface
MPNN Message Passing Graph Neural Network
MRI Magnetic Resonance Imaging
NC Normal Control
Ncut Normalized Cut
NetLSD Network Laplacian Spectral Descriptor

x


NMI Normalized Mutual Information
PCIe Peripheral Component Interconnect Express
PNA Principal Neighbourhood Aggregation
PPGN Provably Powerful Graph Network
ROI Region of Interest
RNN Recurrent Neural Network
SC Spectral Clustering
SD Spectral Distance
SeqGNN Graph neural network with sequence modeling
SGC Spectral Graph Coarsening
SGD Stochastic Gradient Descent
SIMD Single Instruction, Multiple Data
SIN Simplicial Isomorphism Network
SM Streaming Multiprocessors
SP Streaming Processors
SZ Schizophrenia
TPJ Temporoparietal Junction Area
WL Weisfeiler Lehman Isomorphism Test

xi


Chapter 1: Introduction

1.1 Motivation

Graphs are widely used to model object relationships in real-world applications. There

have been extensive studies that propose the use of various graph features to help analyze and

understand the complex structures in graphs over a number of domains. Most of the proposed

graph features are based on the statistics of elementary graph structures such as paths, walks,

trees, and cuts, which are then used for subsequent machine learning tasks. However, the pre-

defined, hand-crafted graph features are not flexible enough to handle graph prediction tasks for

a wide range applications. The computations of many graph features do not scale well with

the graph size, which causes high computational costs especially for the analysis of large-scale

graphs. In this introduction, we describe the main problems addressed in this thesis.

Graph Coarsening. To analyze large size graphs, a common technique is to coarsen the

graph into a smaller graph that reduces the computational load while attempting to maintain the

important structural graph properties of the original graph. The coarsening process significantly

reduces the dimensionality of the original large graphs and accelerates subsequent graph processing

tasks. Graph statistics such as characteristic path length, global efficiency, clustering coefficient

can be efficiently computed on the coarsened graphs. Although various graph coarsening methods

have been proposed, there have been no overall consensus on the criteria for the graph coarsening,

1


which directly determines the statistical effectiveness of subsequent network analysis.

Spectral Methods. Spectral methods are emerging as powerful tools in machine learning

and network science. Graph spectra are represented by the eigenvalues and eigenvectors of

matrices associated with the graph such as the adjacency and Laplacian matrices. The field

of spectral graph theory has established a number of interesting results linking the fundamental

relationship between graph structural properties and graph spectra. In particular related work

shows that the graph spectra contains rich information of important graph characteristics such as

connectivity, bipartiteness, the number of connected components, and diameter and average path

length [3]. We introduce a novel framework in this thesis that links tightly graph coarsening and

spectral graph theory.

Spectral Clustering Algorithm. There are a number of algorithms which include the

computation of the graph spectrum as one of the major computational components to solve

problems in a wide variety of applications. An example is the spectral clustering algorithm,

which is one of the most popular clustering algorithms aiming to find densely-connected clusters

within graphs. Note that graph spectral features, in addition to the common network statistics,

are used in graph representations, which can achieve good prediction capability.

A significant hurdle for using graph spectra in applications with large-scale graphs is

the computational cost to compute the whole set of eigenvalues and eigenvectors. Graphics

Processing Units (GPUs) are well-known in their parallel processing capabilities, which can

accelerate the computation of large-scale matrix computations. We provide in this thesis a

GPU accelerated implementation to accelerate the computation of the spectra, which significantly

outperforms previous implementations.

Graph Neural Networks. Deep learning models have achieved great successes in a wide

2


range of applications including computer vision, natural language processing, audio processing,

and many others. Recently, deep learning models have been introduced to solve graph-related

tasks such as node classification and graph classification. A notable model is the graph neural

network (GNN) model which learns the node and graph representations to solve important graph

tasks. Most of GNN models follow a message-passing paradigm where node representations

are formed by iteratively aggregating information from neighboring nodes, followed by a graph

pooling function to generate graph representations [4, 5, 6, 7]. Despite the success of message

passing GNN models in many graph-related tasks, recent studies have shown that GNN models

have limited expressive power which fails to capture even simple graph characteristics. For

example, GNN graph representations cannot distinguish any pair of regular graphs. In this

thesis, we develop a new approach that achieves more powerful expressiveness while maintaining

permutation invariance.

Structural Brain Networks. An important investigation conducted in this thesis is the

study of large-scale brain networks derived from Diffusion Tensor Imaging (DTI) data. The

objective is to learn characteristic graph representations which can achieve high prediction performance

in applications involving for example brain disorders. Diffusion Magnetic Resonance Imaging

(MRI) exploits the anisotropic diffusion of water molecules in the brain to enable the estimation

of the brain’s anatomical fiber tracts at a relatively high resolution. Tractographic methods can be

used to generate whole-brain anatomical connectivity matrix where each matrix element provides

an estimate of the connectivity strength between the corresponding voxels. The resulting brain

connectivity graphs are quite large involving hundreds of thousands of nodes and billions of

edges. A common method to address this challenge is to build structural brain networks using

a predefined whole brain parcellations (defining Regions of Interest - ROIs), where the nodes of

3


the network represent the brain regions and the edge weights capture the connectivity strengths

between the corresponding brain regions. Traditional methods use the existing anatomical brain

atlases such as Automated Anatomical Labelling (AAL) [8] to build structural brain networks,

which do not consider the underlying connectivity information. There is potential to develop

more effective brain parcellations using structural information derived from the DTI data.

1.2 Contributions of this Thesis

In this thesis, we make significant, original contributions to all the problems mentioned

above.

In the first part, we propose an effective graph coarsening method which coarsens large

graphs while provably preserving some important graph structural properties. Graph coarsening

is a common technique used to simplify complex graphs by reducing the number of nodes and

edges while trying to preserve some of the main structural properties of the original graph. A key

challenge is to determine what specific graph properties are to be preserved by coarse graphs. We

propose a new graph coarsening method that attempts to preserve the graph spectral properties.

We start by relating the spectrum of the original graph to the coarsened graph. Based on this

insight, we define a novel distance function that measures the differences between graph spectra

of the original and coarsened graphs. We show that the proposed spectral distance captures in a

significant way the structural differences during the graph coarsening process. We also propose

graph coarsening algorithms that aim to minimize the spectral distance. Experiments show that

the proposed algorithms can outperform previous graph coarsening methods in applications such

as graph classification and stochastic block recovery tasks.

4


In the second part, we propose a new graph neural network paradigm that improves the

expressiveness of graph representations. We represent graphs as the weighted combination of

sequence representations. We show that, through the design of sequence sampling and modeling

techniques, the proposed graph representations achieve provably powerful expressiveness while

maintaining the permutation invariance property. Empirical result show that the proposed graph

representations achieve superior performance in real-world graph classification tasks.

The third part focuses on the acceleration of the spectral clustering methods leveraging on

the computational advantages of CPUs and GPUs. Spectral clustering is one of the most effective

graph clustering algorithms which are widely used in a number of applications. However, existing

implementations in commonly used software platforms such as Matlab and Python do not scale

well for many of the emerging Big Data applications. We present a fast implementation of the

spectral clustering algorithm on a CPU-GPU heterogeneous platform. Our implementation takes

advantage of the computational power of the multi-core CPU and the massive multithreading

capabilities of GPUs. We show that the new implementation achieved significantly accelerated

computation speed compared with previous implementations in a number of important applications.

The fourth part focuses on the study of structural brain networks with the objective to

determine graph representations based on DTI data. We develop an iterative method to delineate

the brain cortex into fine-grained connectivity-based brain parcellations. This allows the mapping

of the initial large-scale brain network into a relatively small weighted graph, which preserves

the essential structural connectivity information. We show that graph representation based on the

brain networks from the new brain parcellations is more powerful in discriminating between a

number of population groups.

5


1.3 Outline of Thesis

The rest of the thesis is organized as follows. In Chapter 2, we give the background of

the research work conducted in this thesis. In Chapter 3, we propose a new graph coarsening

method which coarsens large graphs into smaller ones with preserved graph spectral properties.

In Chapter 4, we propose a new graph neural network model that improves the expressiveness of

graph representations. In Chapter 5, we present a fast implementation of spectral clustering for

large-scale graphs. In Chapter 6, we study the graph representations of structural brain network

while we conclude in Chapter 7 by discussing several potential directions for future research.

6


Chapter 2: Background

2.1 Preliminaries

We denote the graph as G = (V ,W ,H), with V as the set of graph nodes with n = |V|,

W ∈ Rn×n as the (possibly weighted) adjacency matrix and H ∈ Rn×d as the matrix holding

the node features. We denote by vi ∈ V as the node indexed by i, w(i) ∈ Rn as the vector of all

possible edge weights associated with vi and d(i) =
∑n

j=1W (i, j) as the node degree.

In addition, we use hi as the node representation of node vi (corresponding to the ith row

of H) and hG as the graph-level representation.

2.2 Graph Coarsening

2.2.1 Overview

Graph coarsening is a common graph processing technique widely used in a number of

applications involving large scale graphs. Graph coarsening simplifies large graphs by reducing

the number of nodes and edges while trying to preserve some of the main properties of the original

graph. The goal is to accelerate graph processing while maintaining a similar performance on

graph processing tasks.

One of the key challenges is to define the coarsening criteria based on which the graph

7


nodes and edges are reduced. The coarsening criteria directly determine the coarsening process

and the quality of the resulting coarsened graphs. Researchers have proposed a number of

different coarsening criteria, resulting in the optimization to preserve certain graph properties.

For example, Loukas et al. proposed to preserve the action of the graph Laplacian with respect to

an (eigen)-space of fixed dimension, arguing that this suffices to capture the global properties of

graph relevant to partitioning and spectral clustering [9]. Durfee et al. proposed to preserve the

all-pairs effective resistance [10] while Garg and Jaakkola defined a cost based on the theory of

optimal transport [11]. Saket et al. suggested a Minimum Description Length (MDL) principle

relevant to unweighted graphs [12]. Most of previous graph coarsening methods are specific to

particular applications; the question of how to define an application-independent graph coarsening

framework remains a challenge.

2.2.2 Graph Coarsening Methods

There are a number of graph coarsening methods which are designed to optimize for

different graph properties some of which are summarized below.

Heavy Edge Matching In this method, the graph nodes are collapsed based on the heaviest edge

(i.e., the edge with the largest weight) connecting them. The process is repeated iteratively until

a certain graph size is reached [13].

Algebraic Multigrid Coarsening This method is based on the concept of the multigrid method

used in numerical linear algebra. It involves constructing a hierarchy of coarser graphs from the

original graph using matrix operations, which makes it possible to solve linear systems efficiently

[14].

8


Community Detection-based Coarsening The method first clusters graph nodes forming densely-

connected communities and merges graph nodes belonging to the same community into one

single node. The method is particularly useful to find community structures within graphs [15].

2.2.3 Applications

The following are examples of applications where graph coarsening methods have been

used.

Graph partitioning Graph coarsening are often used as a preprocessing step in graph partitioning,

which aims to divide a graph into smaller subgraphs while minimizing some cost function related

to the number of edges cut. The coarsened graph can be partitioned more efficiently, and the

resulting partition can then be projected back onto the original graph.

Community Detection Graph coarsening can be used to find communities or clusters within the

graph. By merging closely connected nodes, the coarser graph can help reveal the underlying

community structures more efficiently [16, 17].

Multigrid Methods Graph coarsening plays a central role in algebraic multigrid methods as well

as the related class of multilevel incomplete LU factorizations. The idea is to project the original

problem to an ”equivalent” problem on the coarse mesh and the solution is interpolated back

to the fine level grid [14]. Graph coarsening significantly accelerates the processing speed for

multigrid methods.

Machine Learning Graph-based machine learning methods, such as Graph Convolutional Networks

(GCN), have included graph coarsening as one of the core components in building the end-to-end

machine learning models for specific graph tasks. One benefit of graph coarsening is to reduce

9


computational time and memory requirements for large graphs [7]. Moreover, graph coarsening

can often improve the convergence speed and performance for specific tasks [18].

2.3 Spectral Graph Theory

2.3.1 Overview

We introduce some basic concepts from spectral graph theory, which we believe offers a

more robust foundation to understand our work on graph coarsening. In essence, the spectral

graph theory explores the fundamental relationship between the structural properties of graphs

and the spectra of matrices associated with these graphs, typically either the adjacency matrix,

or the Laplacian matrix. There are many tutorials and several textbooks that have been written

about spectral graph theory. See for example [3].

In this thesis, we focus on the normalized Laplacian matrix whose eigenvalues are closely

related to a number of topological invariants of the graph. The normalized Laplacian of a graph

is defined using the following equation:

I −D−1W (2.1)

where W is the weight matrix of the graph and D is the diagonal matrix such that each diagonal

entry is defined by Di,i =
∑

j Wi,j .

for our purposes, the spectrum of the graph is defined as the sequence of the eigenvalues of

the normalized graph Laplacian ordered as follows: 0 = λ1 ≤ λ2 ≤ ... ≤ λn ≤ 2 .

10


2.3.2 Relationship with Graph Structures

Extensive studies have shown that the spectra are closely related to the graph structures. For

example, the multiplicity of the eigenvalue 0 corresponds to the number of connected components

in the graph. The largest eigenvalue λn is equal to 2 if and only if the graph is bipartite [3].

Moreover, the second smallest eigenvalue λ2 is of special importance and has been shown to be

closely related to the convergence rate of random walks on the graph, ”chemical conductance,”

Cheeger’s constant, graph diameter, and graph quasi-randomness [3]. Other eigenvalues also play

an important role in capturing the graph structural properties, which has not been fully explored

yet.

2.3.3 Applications

While we focus in this thesis on the application of spectral graph theory on graph coarsening,

other major applications include:

• Spectral Clustering. Spectral clsutering is a widely-used technique that groups similar

data points by analyzing the eigenvectors of the graph’s Laplacian matrix [19].

• Image Segmentation. In computer vision, spectral graph theory can be applied to segment

images into meaningful regions, by representing the image as a graph and using eigenvectors

to identify boundaries between different regions [20, 21].

• Dimensionality Reduction. Spectral methods for dimensionality reduction are techniques

that use the spectral (eigenvalue) decomposition of matrices derived from the data in order

to transform the data into a lower-dimensional space. These methods are primarily based

11


on constructing a similarity graph from the data, where each data point is represented as a

node and the edge weights represent the similarity between data points. From this graph, a

matrix representation is derived, often the graph Laplacian, but sometimes the adjacency or

degree matrix. The eigenvectors corresponding to the smallest eigenvalues of this matrix

are then used to represent the data in a lower-dimensional space. Notable examples are

Laplacian Eigenmaps, Locally Linear Embedding (LLE) and Isomap [22, 23, 24, 25].

2.4 Graph Neural Network

2.4.1 Overview

Graph neural network (GNN) models have emerged as a powerful tool to solve graph

related problems such as node classification, link prediction and graph classification [4, 5, 7, 26,

27, 28]. Most of GNN models follow a message-passing paradigm where node representations

are formed by iteratively aggregating information from neighboring nodes, followed by a graph

pooling function to generate graph representations [4, 5, 6, 7].

Starting with the initial node features H(0) = H , the message passing graph neural

network iteratively updates node representations by aggregating information from neighboring

nodes. The following functions characterize the graph convolutional process [4, 6, 9],

h(i)
v = Aggregate(i)(h(i−1)

v ,m(i)
v ),

m(i)
v = Msg(i)({h(i−1)

u : u ∈ N (v)})
(2.2)

where hi
u is the representation of node u at iteration i and Msg is the message function that

aggregates neighborhood information. The graph representation hG is obtained by applying the

12


pooling function Pool on the corresponding node representations as,

hG = Readout(h(k)
v , v ∈ V) (2.3)

2.4.2 Permutation Invariance

Most GNN models need to satisfy the property of Permutation Invariance, that is, the

graph representations remain invariant under any permutations applied on the inputs. We denote

π = {v1, v2, ..., vn} as a node permutation and Gπ = (Vπ,Wπ,Hπ) is the graph where the node

indices, weight matrix and feature matrix are permutated over π. The permutation invariance is

expressed as follows,

GNN(H ,W ) = GNN(Hπ,Wπ),∀π ∈ Π. (2.4)

2.4.3 Examples of GNN Models

There are many variants of GNN models with different message, aggregation and readout

functions. Some of the notable examples are as follows,

Graph Convolutional Network (GCN). GCN is one of the most popular GNN models which

borrow the idea of Convolutional Neural Network (CNN) in the graph domain. The GCN model

leverages the graph convolutional function to aggregate information from neighboring nodes and

apply a pooling function to form graph representations from node representations. GCN models

achieve high performance on node and graph classification tasks[28, 29].

Graph Sample and Aggregation (GraphSAGE). GraphSAGE models extends the GCN framework

13


by introducing a more flexible, scalable, and inductive learning approach. GraphSAGE learns

to generate embeddings for nodes by sampling and aggregating information from their local

neighborhood, which is particularly useful for learning on large graphs or graphs with unseen

nodes during training [30].

Graph Attention Network (GAT). GAT models introduce the attention mechanisms into GNNs,

which enables them to weigh the importance of neighboring nodes differently when aggregating

information. Through the attention mechanism, the message function can focus on the most

relevant neighbors and capture more complex graph structures. GAT models have been applied

to various tasks, including node classification, link prediction, and graph classification [31].

2.4.4 Expressiveness of GNN

The expressiveness of graph representation models is the ability to capture the intricate

structure of graph data and encode this information into node or graph representations. For GNN

models, the expressiveness is determined by the model architecture and the specific aggregation,

messaging and readout functions. The number of message propagation steps can have a significant

influence on the expressiveness[9, 32]. GNN models with a large number of propagation steps

can capture long range of dependencies and interactions [33, 34, 35].

Test of Isomorphism. The expressiveness of GNN models is often evaluated on the ability

to distinguish non-isomorphic graphs. Isomorphic graphs are pairs of graphs with the same

structure but but may differ in the labels of their vertices and edges up to node permutations.

Two graphs G1 = (V1,W1,H1) and G2 = (V2,W2,H2) are said to be isomorphic when there

exist a permutation π such that the edges and node features are identical after applying the node

14


permutations W1,π = W2 and H1,π = H2. The ability to distinguish isomorphic graphs is an

important component in understanding the expressiveness of graph representation models.

Weisfeiler-Lehman Tests of Isomorphism. One of the most powerful and widely used tests of

graph isomorphism is the Weisfeiler-Lehman Test of Isomorphism (WL) [36, 37]. The WL test

is an refinement-based method that iteratively updates the node labels with local neighborhood

until the stopping conditions are reached. The 1-dimensional WL test (1-WL) updates the node

labels based on the immediate neighbors during the labeling process. The detailed steps of 1-WL

are as follows,

• Initialization: Assign an initial label to each node in each graph.

• Iterative Labelling: For each iteration, a new label is computed for each node based on its

current label and the multiset of labels of its neighbors.

• Stopping Condition: the process continues until no new labels are generated or until a

pre-determined number of iterations are reached.

• Decision: After the last iteration, nodes of the two graphs are sorted according to their final

labels. The two graphs are considered isomorphic if and only if the sequence of node labels

are the same for both graphs.

Higher dimensional Weisfeiler-Lehman Isomorphism Tests (k-WL) consider the k-hop

neighbors in the color relabeling step, which results stronger capability to distinguish non-isomorphic

graphs. Cai et al. show that (k + 1)-WL has strictly stronger expressive power than k-WL, i.e.

there are non-isomorphic graphs which can be distinguished by (k + 1)-WL but not by k-WL

[38].

15


Recent studies have established key results on the relationship between the expressiveness

of GNN and the series of Weisfeiler-Lehman (WL) Tests of Isomorphism (WL). They showed

that the expressiveness of the most powerful message-passing GNN models does not exceed 1-

WL [1, 39]. As 1-WL cannot distinguish between certain graphs that are not isomorphic, such as

regular graphs, GNN models suffer from the limitation of expressiveness. How to design graph

representations with higher-order of expressiveness exceeding the WL tests remains a challenge.

2.4.5 Limitations

Despite the success GNN models have achieved over a number of applications, GNN

models suffer from limitations which affect their effectiveness. Some limitations are listed next.

• Scalability. The complexity of graph convolutions and the memory requirements can

grow significantly with the size of the graph and the number of iterations. This makes

it challenging to apply GNNs to real-world problems that involve large-scale graphs.

• Expressiveness. As mentioned previously, the expressiveness of GNN models is limited

by 1-WL, which cannot distinguish non-isomorphic graphs, such as pairs of regular graphs

[6].

• Over-smoothing. As the number of convolutional layers increase, the information incorporated

become indistinguishable, resulting in a loss of node-specific information and reduced

discriminative power [40].

A number of methods have been proposed to address the limitations which improve the

GNN expressiveness and its applicability to a wider range of problems. For example, approximate

16


methods and graph sampling methods are introduced to solve the scalability problem [? ]. To

address the limitation of expressiveness, higher-order graph neural network models are proposed

with provably higher expressiveness than 1-WL [41, 42]. To tackle the over-smoothing problem,

various methods have been proposed to improve the effectiveness, such as DropEdge with the

idea of dropping edges in the input graph during training, which reduces the information flowing

between nodes thus preventing the over-smoothing [43].

2.5 Structural Brain Networks

2.5.1 Overview

Diffusion Magnetic Resonance Imaging (MRI) exploits the anisotropic diffusion of water

molecules in the brain to enable the estimation of the brain’s anatomical fiber tracts at a relatively

high resolution. In particular, tractographic methods can be used to generate whole-brain anatomical

connectivity matrix where each element provides an estimate of the connectivity strength between

the corresponding voxels. Structural brain networks are built using the connectivity information

and a predefined brain parcellation, where the nodes of the network represent the brain regions

and the edge weights capture the connectivity strengths between the corresponding brain regions.

17


Table 2.1: Demographics of NKI Sample

Age Group 4-9 10-19 20-29 30-39 40-49 50-59 60-69 70-85 Total

Female 4 18 19 8 5 6 9 10 79

Male 5 16 24 9 26 8 5 4 97

Total 9 34 43 17 31 14 14 14 176

2.5.2 Diffusion Tensor Imaging Processing

2.5.2.1 Data Collection

The diffusion MRI dataset used in this research is taken from the publicly available Nathan

Kline Institute (NKI)/ Rockland dataset1, which consists of data for individuals whose ages range

from 4 to 85. The diffusion MRI was performed using a SIEMENS MAGNETOM TrioTim syngo

MR B15 system. The high-angular resolution diffusion imaging protocol was used to assess white

matter integrity as measured by fractional anisotropy. Diffusion tensor data were collected using

a single- shot, echo-planar, single refocusing spin-echo, T2-weighted sequence with a spatial

resolution of 2.0×2.0×2.0mm. The sequence parameters were:

TE/TR=91/10000ms, FOV=256mm, axial slice orientation with 58 slices, 64 isotropically distributed

diffusion weighted directions, two diffusion weighting values (b=0 and 1000s/mm2) and six b=0

images. These parameters were calculated using an optimization technique that maximizes the

contrast to noise ratio for FA measurements. For each subject, the image data consists of 76

volumes of 3D images of dimensions 128×128×53, each voxel representing 2.0mm×2.0mm×2.0mm

brain volume.
1http://fcon 1000.projects.nitrc.org/indi/pro/nki.html

18

http://fcon_1000.projects.nitrc.org/indi/pro/nki.html


Figure 2.1: Left: Seed region which is JHU white matter atlas. Middle: target region which
is the AAL mask. Right: the result of probabilistic tractography which models the distribution
of neuron fiber bundles where the intensity of each voxel represents the number of streamlines
passing through that voxel. All figures are imaged in the axial plane

2.5.2.2 Nonlinear Registration

The diffusion images of all subjects are registered to Montreal Neurological Institute (MNI)

standard space using the nonlinear registration package FNIRT in FSL software [44]. The

nonlinear registration process generates the warping coefficients that balance the similarity between

the diffusion image and the standard MNI152 image, and the smoothness of the warping coefficients.

2.5.2.3 Probabilistic Tractographical Methods

After a set of standard preprocessing steps, probabilistic tractography is used to model cross

fiber distributions for each voxel through the BEDPOSTX package in FSL [45]. Probabilistic

tractography is processed through the diffusion toolbox in FSL [46]. The standard white matter

atlas is specified as a seed region. The brain atlas AAL mask is specified as the target region,

which covers the whole brain cortex region. We generate 50 streamlines from every voxel in the

seed region. These streamlines are propagated following the cross fiber distribution computed

during a preprocessing step. Curvature threshold is enforced to eliminate unqualified streamlines.

19


The distance correction option is set to correct for the fact that the distribution drops as the travel

distance increases. The tractography output is a structural connectivity network modeled as a

weighted graph where each node is a voxel in the target region space and each edge weight

corresponds to the relative connectivity strength in terms of the number of streamlines connecting

the corresponding pair of voxels. Using this connectivity information and a set of predefined

regions of interests (ROIs) or whole brain parcellations, structural brain networks can be built

and analyzed. Figure 2.1 shows the seed and target regions as well as the tractography result of a

subject.

2.5.3 Brain Parcellation Scheme

The brain parcellation scheme takes as input the probabilistic tractography results represented

as a large connectivity matrix, which is a large sparse matrix with millions of nodes and billions

of edges.

The k-region brain parcellation is brain segmentations that partition the brain’s grey matter

into k spatially contiguous regions, such that the connectivity profiles of the voxels in each region

are as similar as possible. The parcellations are expected to be consistent among members of a

structurally homogeneous population sample. Notable brain parcellations include the following,

• Brodmann’s Areas. Proposed by Korbinian Brodmann in 1909, Brodmann’s areas divide

the brain based on cytoarchitectonic properties, or the distribution and arrangement of

neurons in the cortex [47].

• Automated Anatomical Labeling (AAL). The AAL atlas is a widely used atlas introduced

by Tzourio-Mazoyer et al. in 2002. It divides the brain based on anatomical landmarks into

20


116 regions (90 cortical and subcortical, and 26 cerebellar) [8]. It is commonly used for

voxel-based morphometry and functional fMRI studies[48, 49].

• Harvard-Oxford Atlas. The Harvard-Oxford Atlas is a probabilistic atlas that provides

a standard space for neuroimaging analyses. It’s based on the segmentation of structural

images of the brain from 48 subjects [50].

2.5.4 Construction of Structural Brain Networks

Given the brain parcellations and the connectivity information revealed by probabilistic

tractography, the structural brain network are built where the nodes represent the regions in the

parcellation and the edges reflect the connectivity strength between the corresponding regions.

The edge weights are defined as follows,

W (Ri, Rj) =
∑

va∈Ri,vb∈Rj

W (va, vb) (2.5)

where W (va, vb) represents the number of streamlines connecting the two voxels as generated by

the tractographic results.

2.5.5 Graph Theoretical Analysis

Graph theoretical analysis is applied on the structural brain networks to extract graph

patterns [51]. Many global and local graph-theoretic measurements have been introduced to

characterize structural brain networks [52, 53, 54]. The following are some of the mostly common

used graph measurements,

• Characteristic path length (CPL). The characteristic path length tries to capture the

21


network integration and is computed as the average of the shortest paths between all pairs

of vertices.

CPL =
1

N(N − 1)

∑
i,j,i̸=j

d(i, j) (2.6)

where d(i, j) is the shortest path between node i and j.

• Global efficiency (Eglobal). The global efficiency of a graph is the average of the inverse

of the shortest paths between all pairs of vertices and hence tries to capture how well pairs

of nodes are connected. [55].

Eglobal =
1

N(N − 1)

∑
i,j,i̸=j

1

d(i, j)
(2.7)

• Clustering coefficient. The clustering coefficient tries to capture graph separation and is

defined as the average of the local clustering coefficient at each node.

C =
1

N

∑
Ci (2.8)

where the local clustering coefficient at node i is defined as

Ci =
2Γi

degi(degi − 1)
(2.9)

Γi is the number of triangles around node i,

Γi =
1

2

∑
j,h

ui,jui,huj,h (2.10)

22


Note that degi > 0 in our case since isolated nodes are removed from the network.

• Sparsity ratio. The sparsity ratio is defined as the ratio of the number of edges over the

number of possible edges between nodes, that is,

Sparsity =

∑
i,j ui,j

k(k − 1)
(2.11)

A more detailed description of graph theoretic measures commonly used to analyze structural

brain networks can be found in [53]. These traditional graph-theoretical measurements summarize

the complex graph structures of structural brain networks [56]. However, when the graph size

increases, the graph measurements cannot fully capture the complex network patterns.

23


Chapter 3: Graph Coarsening with Preserved Spectral Properties

3.1 Introduction

Graphs are widely used to represent object relationships in real-world applications. As

many applications involve large-scale graphs with complex structures, it is generally hard to

explore and analyze the key properties directly from large graphs. Hence graph coarsening

techniques have been commonly used to facilitate this process [57, 58].

Generally speaking, the aim of any graph reduction scheme is to reduce the number of

nodes and edges of a graph, while also ensuring that the “essential properties” of the original

graph are preserved. The question of what properties should be preserved remains inconclusive,

but there is significant evidence that they should relate to the spectrum of a graph operator, such

as the adjacency or normalized Laplacian matrices [59, 60]. A long list of theorems in spectral

graph theory shows that the combinatorial properties of a graph are aptly captured by its spectrum.

As such, graphs with similar spectra are generally regarded to share similar global and local

structure [61, 62]. Based on this realization, modern graph sparsification techniques [63, 64, 65]

have moved on from previously considered objectives, such as cut and shortest-path distance

preservation, and now aim to find sparse spectrally similar graphs.

In contrast to graph sparsification, there has been little progress towards attaining spectrum

preservation guarantees in coarsening. The foremost roadblock seems to lie in defining what

24


spectral similarity should entail for graphs of different sizes. The original and coarse graphs

have different numbers of eigenvalues and eigenvectors, which prohibit a direct comparison. To

circumvent this issue, recent works have considered restricting the guarantees to a subset of the

spectrum [9, 66]. However, focusing only on a subset of eigenvalues and eigenvectors also means

that important information of the graph spectrum is ignored.

In this chapter, we start by reconsidering the fundamental spectral distance metric [64,

67, 68, 69], which compares two graphs by means of a norm of their eigenvalue differences.

This metric is seemingly inappropriate as it necessitates that two graphs have the same number of

eigenvalues. However, we find that in the context of coarsening, this difficulty can be circumvented

by substituting the coarse graph with its lifted counterpart: the latter contains the same information

as the former while also having the correct number of eigenvalues.

Our analysis shows that the proposed distance naturally captures the graph changes in

the graph coarsening process. In particular, when the graph coarsening merges nodes that have

similar connections to the rest of the graph, the spectral distance is provably small. By merging

similarly connected nodes, nodes and edges in coarse graphs are able to represent the connectivity

patterns of the original graphs, thus preserving structural and connectivity information.

Our contributions in this chapter are summarized as follows:

• We show how the spectral distance [64, 67, 68, 69], though originally restricted to graphs

of the same size, can be utilized to measure how similar a graph is with its coarsened

counterpart.

• We examine how the new spectral distance captures graph structural changes occurring

during the graph coarsening process.

25


• We present two coarsening algorithms that provably minimize the spectral distance.

• We experimentally show that the proposed methods outperform other graph coarsening

algorithms on two graph related tasks.

All the proofs can be found in the Appendix.

3.2 Related Work

Recent works have proposed to coarsen graphs by preserving the spectral properties of the

matrix representations of graphs [9, 10, 60, 66, 70]. For example, Loukas et al. proposed to

preserve the action of the graph Laplacian with respect to an (eigen)-space of fixed dimension,

arguing that this suffices to capture the global properties of graphs relevant to partitioning and

spectral clustering [9]. Durfee et al. proposed to preserve the all-pairs effective resistance [10].

Garg and Jaakkola defined a cost based on the theory of optimal transport [11]. Saket et al.

suggested a Minimum Description Length (MDL) principle relevant to unweighted graphs [12].

Most of these distance functions are specific to particular applications; the question of how to

define an application-independent graph coarsening framework remains a challenge.

There is a sizable literature dealing with the characterization of graphs in terms of their

spectral properties [64, 71, 72]. Previous work defined distance functions based on Laplacian

eigenvalues which measure differences between graphs [64, 67]. Spielman and Teng introduced

a notion of spectral similarity for two graphs in their graph sparsification framework [59, 65].

Recently, Tsitsulin et al. proposed an efficient graph feature extractor, based on Laplacian

spectrum, for comparisons of large graphs [71]. Dong uses spectral densities to visualize and

estimate meaningful information about graph structures [72]. Nevertheless, despite the popularity

26


of spectral methods, the graph spectrum remains little explored in the context of graph coarsening.

3.3 Preliminaries

Let G = (V , E ,W ) be a graph, with V a set of N = |V| nodes, E a set of M = |E| edges,

and W ∈ RN×N the weighted adjacency matrix. We denote the node vi the node by w(i) ∈ RN

representing the vector of the weights of the edges incident on vi and by d(i) =
∑N

j=1W (i, j)

the node degree of vi. The graphs considered in this work are weighted, undirected, and possess

no isolated nodes (i.e. d(i) > 0 for all vi).

The combinatorial and normalized Laplacians of G are defined as

L = D −W and L = IN −D−1/2WD−1/2, (3.1)

respectively, where IN is the N × N identity matrix and D is the diagonal degree matrix with

D(i, i) = d(i).

3.3.1 Graph Coarsening

The coarse graph Gc = (Vc, Ec,Wc) with n = |Vc| is obtained from the original graph G

by first selecting a set of non-overlapping graph partitions P = {S1,S2, . . . ,Sn} ⊂ V , which

cover all the nodes in V . Each partition Sp corresponds to a “super-node” denoted by sp and the

“super-edge” connecting the super-nodes Wc(p, q) has weight equal to the accumulative edge

27


Coarsening

1 6 11 16 21 26 31 36 41 46 50

Index

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Ei
ge

nv
al

ue

Original Graph
Coarse Graph
Lifted Graph

5 10 15 20 25 30 35 40 45 50

1st Eigenvector

-1

0

1
Va

lu
e

5 10 15 20 25 30 35 40 45 50

3rd Eigenvector

-1

0

1

Va
lu

e

5 10 15 20 25 30 35 40 45 50

20th Eigenvector

-1

0

1

Va
lu

e

Figure 3.1: Left: an example illustrating the graph coarsening process. The original graph is a
random graph sampled from the stochastic block model with 50-node and 10 predefined blocks.
The coarse graph is obtained from the predefined partitions. Right: Eigenvalues and eigenvectors
of normalized Laplacian matrices of original, coarse and lifted graphs. The eigenvalues of
coarsened graphs align with the eigenvalues of original graphs and the eigenvectors indicate the
block membership information.

weights between nodes in the corresponding graph partitions Sp and Sq:

Wc(p, q) = w(Sp,Sq) :=
∑

vi∈Sp,vj∈Sq

W (i, j) (3.2)

28


Let P ∈ Rn×N be the matrix whose columns are partition indicator vectors:

P (p, i) =


1, if vi ∈ Sp

0, othewise.

It is then well known that the weight matrix Wc of the coarse graph Gc satisfies

Wc = PWP⊤.

The definition of the coarsened Laplacian matrices follows directly:

Lc = Dc −Wc and Lc = In −D−1/2
c WcD

−1/2
c .

Similarly to the adjacency matrix, the combinatorial Laplacian of the coarse graph can be obtained

by the formula Lc = PLP⊤. The same however doesn’t hold for the normalized Laplacian, as

in general PLP⊤ ̸= Lc.

3.3.2 Graph Lifting

We define Gl = (V , El,Wl) to be the graph lifted from the coarse graph Gc with respect

to a set of non-overlapping partitions P . In graph lifting, each node sp of the coarse graph is

lifted to |Sp| nodes and nodes in the lifted graph are connected by edges whose weight is equal

to the coarse edge weight normalized by the sizes of partitions. Specifically, for any vi ∈ Sp and

29


vj ∈ Sq we have:

Wl(i, j) =
w(Sp,Sq)
|Sp||Sq|

=

∑
v′i∈Sp,v′j∈Sq

W (i′, j′)

|Sp||Sq|

=
Wc(p, q)

|Sp||Sq|
. (3.3)

When Sp = Sq = S , the weight Wl(i, j) can be seen to be equal to the weight of all edges in the

subgraph induced by S, after normalization by |S|2. It easily follows that if W (i, j) is the same

for every vi, vj ∈ S, then also Wl(i, j) = W (i, j), i.e., in-partition weights are exactly preserved

by successive coarsening and lifting in this case.

The above combinatorial definition can be expressed in an algebraic form in terms of the

the pseudo-inverse P+ of P , (i.e., PP+ = I), whose elements are:

P+(j, p) =


1

|Sp| if vj ∈ Sp

0 otherwise.

With this in place, the adjacency matrices of the lifted and coarse graphs are connected by the

following relations:

Wl = P+WcP
∓ and Wc = PWlP

⊤.

The following equation reveals that lifting preserves the connectivity up to a projection onto the

30


partitions:

Wl = P+WcP
∓ = P+PWP⊤P∓

= ΠWΠ⊤ = ΠWΠ, (3.4)

where Π = P+P is a projection matrix, with ΠΠ = P+PP+P = P+P = Π.

The lifted Laplacian matrices are given by

Ll = P+LcP
∓ and Ll = C⊤LcC, (3.5)

where C ∈ Rn×N is the normalized coarsening matrix whose entries are given by:

C(p, i) =


1√
|Sp|

if vi ∈ Sp

0 otherwise,

such that C⊤ = C+ and C⊤C = P∓P = Π. In this manner, we have

Lc = PLlP
⊤ and Lc = CLlC

⊤. (3.6)

For a more in-depth discussion of the mathematics of graph coarsening and graph lifting, we refer

the interested reader to [9].

31


3.4 Spectral Distance

We start by briefly reviewing some basic facts about the spectrum associated with the

Laplacian matrix of a coarse graph. We then demonstrate how to exploit these properties in

order to render the classical spectral distance metric amenable to (coarse) graphs of different

sizes.

3.4.1 Properties of the Coarse Laplacian Spectrum

Denote the eigenvalues and eigenvectors of the normalized Laplacian matrices as λ and u,

respectively, with L = UΛU⊤ where the i-th column of U corresponds to ui and Λ = diag(λ).

The eigenvalues are ordered in non-decreasing order.

Property 3.4.1 (Interlacing. Section 5.3 in [73]). The normalized Laplacian eigenvalues of the

original and coarsened graphs satisfy

λ(i) ≤ λc(i) ≤ λ(i+N − n) for all i = 1, . . . , n.

Property 3.4.1 is a general interlacing inequality that captures pairwise difference between

the eigenvalues of the original and coarse graph Laplacians [3, 74]. Since it holds for any graph

and coarsening, the inequality will, in some cases, be loose.

Property 3.4.2 (Eigenvalue Preservation). The normalized Laplacian eigenvalues of the lifted

graph contain all eigenvalues of the coarse graph and additional eigenvalues 1 with (N − n)

multiplicity.

32


Property 3.4.3 (Eigenvector Preservation). The eigenvectors of the coarse graph lifted by C,

i.e. ul = Cuc are the eigenvectors of Ll.

Proof. We start by noticing that the projection matrix Π acts as an identity matrix w.r.t. the lifted

normalized Laplacian Ll = ΠLlΠ, since Ll = C⊤LcC = C⊤CLlC
⊤C = ΠLlΠ. Now,

consider the following eigenvalue equation:

Lcuc = λcuc

CLlC
⊤uc = λcuc

C⊤CLlC
⊤uc = λcC

⊤uc

ΠLlΠC⊤uc = λcC
⊤uc

LlC
⊤uc = λcC

⊤uc

Note that in the fourth step, we used the relation C⊤ = C⊤CC⊤ = ΠC⊤, which holds

due to the properties of the Moore-Penrose pseudo-inverse. Thus, C⊤uc are eigenvectors of Ll

with the corresponding eigenvalues of the coarse graph.

To show there are N − n additional eigenvalues 1, one can observe that IN − L =

D
−1/2
l WlD

−1/2
l is a rank-n matrix because nodes within the same partition have exactly the

same edge weights. Hence IN −L contains N − 1 eigenvalue 0 and correspondingly L contains

eigenvalue 1 with N − n multiplicity.

Property 3.4.2 and 3.4.3 state that the action of lifting preserves most spectral properties of

the coarse graph. Thus, we may use the lifted graph as a proxy to define the distance function

[75]. Figure 3.1 shows an example illustrating the graph coarsening process as well as the effect

33


on the graph spectrum.

3.4.2 Spectral Distance

In the following, we propose two notions of the spectral distance to quantify the difference

between original and coarse graphs. We first use the lifted graph as the “proxy” of the coarse

graph and define the full spectral distance:

Definition 3.4.4. The full spectral distance between graph G and Gc is defined as follows:

SDfull(G,Gc) = ∥λ− λl∥1 =
N∑
i=1

|λ(i)− λl(i)|,

where vectors λ and λl contain the eigenvalues of the original and lifted graphs.

As the original and lifted graphs have the same number of nodes, we may directly use a

norm to measure the pairwise differences between eigenvalues.

On the flip side, the definition requires computing all eigenvalues of original graphs regardless

of the coarse graph size, which is computationally expensive, especially for large graphs. The

limitation motivates us to define the partial spectral distance by selecting part of the terms in the

full spectral distance definition.

Let k1 and k2 be defined as k1 = argmaxi{i : λc(i) < 1}, k2 = N − n + k1. We expand

34


the full spectral distance into three terms as follows:

SDfull(G,Gc) =
N∑
i=1

|λ(i)− λl(i)|

=

k1∑
i=1

|λ(i)− λl(i)|+
k2∑

i=k1+1

|λ(i)− λl(i)|

+
N∑

i=k2+1

|λ(i)− λl(i)| (3.7)

=

k1∑
i=1

|λ(i)− λc(i)|+
k2∑

i=k1+1

|λ(i)− 1|

+
N∑

i=k2+1

|λ(i)− λc(i−N + n)| (3.8)

The last equation is from the Property 3.4.2 where λl contains eigenvalues of the coarse graph as

well as eigenvalue 1 with N − n multiplicity. Eigenvalue λl satisfies:

λl(i) =



λc(i) i ≤ k1

1 k1 + 1 ≤ i ≤ k2

λc(i−N + n) i > k2

With this in place, we define the partial spectral distance to be equal to the full spectral distance

minus the N − n terms for which λl = 1:

Definition 3.4.5. The partial spectral distance between graph G and Gc is defined as

SDpart(G,Gc) =
k∑

i=1

|λ(i)− λc(i)|+

n∑
i=k+1

|λc(i)− λ(i+N − n)|,

35


where k = argmaxi{i : λc(i) < 1}.

For the partial spectral distance, we only need to compute n rather than N eigenvalues of

the normalized Laplacian of the original graph, which significantly reduces the computational

cost when n≪ N .

The full and partial spectral distances are related by,

SDfull(G,Gc) = SDpart(G,Gc) +
k2∑

i=k1+1

|λ(i)− 1|

The excluded terms
∑k2

i=k1+1 |λ(i)−1|measure the closeness of the original Laplacian eigenvalues

and eigenvalue 1. The two definitions are equivalent when the normalized Laplacian of the

original graph LN contains eigenvalue 1 with N − n multiplicity. The condition is equivalent

to asserting that the adjacency matrix W is singular with N − n algebraic multiplicity of the

eigenvalue 0 [76, 77]. We have observed empirically that, when coarsening nodes that have

similar connections, the adjacency matrix has eigenvalues close to 0. In such situations, the

terms of the full spectral distance that are excluded by the partial spectral distance are almost 0

and the partial distance closely approximates the full one.

Note that both definitions of spectral distance are proper distance metrics over the space

of graph Laplacian eigenvalues. However, the spectral distance is not able to distinguish graphs

with the same sets of Laplacian eigenvalues (referred to as cospectral graphs [61]). Thus, there

could exist multiple coarse graphs corresponding to the same spectral distance.

36


3.4.3 Relation to Graph Coarsening

To illustrate the connections between spectral distance and graph coarsening, we first

consider the ideal case when merged nodes within the same partitions have the same normalized

edge weights:

Proposition 3.4.1. Let the graph Gc be obtained by coarsening G with respect to a set of partitions

P = {S1,S2, . . . ,Sn}. IfP is selected such that every node in a partition has the same normalized

edge weights,

w(i)

d(i)
=

w(j)

d(j)
for all vi, vj ∈ S and S ∈ P (3.9)

then

SDfull(G,Gc) = 0 and SDpart(G,Gc) = 0.

Therefore, the ideal graph coarsening attains a minimal (full and partial) spectral distance.

We next provide a more general result on how the spectral distance can capture the structural

changes in the graph coarsening framework. Consider the basic coarsening where the coarse

graph is formed by merging one pair of nodes (i.e. n = N − 1). In this setting, we prove the

following:

Proposition 3.4.2. Suppose the graph Gc is obtained from G by merging a pair of nodes v(a) and

v(b). If the normalized edge weights of merged nodes satisfy

∥∥∥∥w(a)

d(a)
− w(b)

d(b)

∥∥∥∥
1

≤ ϵ,

37


then the spectral distance between the original and coarse graphs is bounded by

SDfull(G,Gc) ≤ Nϵ and SDpart(G,Gc) ≤ nϵ.

The above proposition states that the spectral distance is bounded by the discrepancy of

normalized edge weights of merged nodes. The bound implies that minimizing the nodes’ edge

weights within the same partitions results in bounded spectral perturbations.

3.5 Algorithms

We propose two graph coarsening algorithms to produce coarse graphs with minimal small

spectral distance. The first follows from Proposition 3.4.2 in that the coarse graphs are formed

by iteratively merging graph nodes with similar normalized edge weights. The second algorithm

is inspired by spectral clustering: we leverage on the combinations of normalized Laplacian

eigenvectors combined with k-means clustering to find the graph partitions and the corresponding

coarse graphs. Though different, both algorithms are shown to generate coarse graphs of bounded

spectral distance.

3.5.1 Multilevel Graph Coarsening

The Multilevel Graph Coarsening (MGC) algorithm iteratively merges pairs of nodes which

share similar connections. During each iteration, MGC searches for the pair of nodes with

the most similar normalized edge weights and merges them into super-nodes. To reduce the

38


Algorithm 1 Multilevel Graph Coarsening (MGC)

1: Input: Graph G = (V , E ,W ) and target size of the coarse graph n.
2: s← N
3: while s > n do
4: for vi ∈ Vs do
5: for vj ∈ Ni do
6: ds(i, j) =

∥∥∥w(i)
d(i)
− w(j)

d(j)

∥∥∥
1

7: end for
8: end for
9: imin, jmin = argmini,j ds(i, j)

10: s← s− 1
11: Merge nodes vimin and vjmin to form the coarse graph Gs.
12: end while
13: return Gn = (Vn, En,Wn)

computational cost, we constraint the candidate pairs of graph nodes to be within 2-hop distance.

We denote Ni as the set of nodes that are within 2-hops distance from node vi. The pseudo-code

of MGC is presented in Algorithm 1.

Analysis. The following corollary bounds the spectral distance of MGC algorithm:

Corollary 3.5.1. Suppose the graph Gc is coarsened from G by iteratively merging pairs of nodes

v(as) and v(bs) for s from N to n+ 1, if the normalized edge weights of merged nodes satisfy,

∥∥∥∥w(as)

d(as)
− w(bs)

d(bs)

∥∥∥∥
1

≤ ϵs,

then the spectral distance between the original and coarse graphs is bounded by

SDfull(G,Gc) ≤ N
n+1∑
s=N

ϵs, SDpart(G,Gc) ≤ n
n+1∑
s=N

ϵs

The bound is a direct corollary of Proposition 3.4.2.

39


Time complexity. The time complexity of MGC is O(M(N + n)(N − n)), which is derived

as follows: For each iteration, the computational cost of the 1-norm in line 6 is O(s). Then

the time complexity of the while loop in line 3 is O(
∑N

s=n s · M) = O(M n+N
2

(N − n)) =

O(M(N + n)(N − n)). When n ≈ N , the complexity reduces to O(MN). On the other hand,

for n≪ N the complexity becomes O(MN2).

3.5.2 Spectral Graph Coarsening

The spectral graph coarsening (SGC) algorithm identifies the coarsening partitions by

attempting to minimize the k-means cost of rows of Laplacian eigenvectors. Different from

traditional spectral clustering, we select eigenvectors with the eigenvalues corresponding to the

head and tail eigenvalues as in the definition of partial spectral distance in Definition 3.4.5.

The procedure is described in Algorithm 2. Notice that, since k1 is unknown at the start, SGC

algorithm iterates over different possible combinations of eigenvectors and selects the coarsening

with minimum k-means cost.

Analysis. The following theorem relates the partial spectral distance with the k-means cost:

Theorem 3.5.2. Let the coarse graph Gc be obtained from Algorithm 2 with graph partition P∗,

suppose that the graph coarsening is consistent, i.e., Lc = CLC⊤, and let the k-means cost

satisfy F(U ,P∗) < 1. Then, the partial spectral distance is bounded by

SDpart(G,Gc) ≤
(n+ 2)F(U ,P∗) + 4

√
F(U ,P∗)

1−F(U ,P∗)
.

The theorem states that the spectral distance is bounded by the k-means clustering cost.

40


Algorithm 2 Spectral Graph Coarsening (SGC)

1: Input: Graph G = (V , E ,W ), eigenvectors U of the normalized Laplacian L, target size n.
2: if λ(N) ≤ 1 then
3: Set k1 ← n ▷ Spectral Clustering
4: else
5: Set k1 ← argmink{k : λ(k) ≤ 1, k ≤ n,λ(N − n+ k + 1) > 1} ▷ Iterative Spectral

Coarsening
6: end if
7: k2 ← N − n+ k1.
8: while k1 ≤ n do
9: Uk1 ← [U(1 : k1);U(k2 + 1 : N)]

10: Apply k-means clustering algorithm on the rows of Uk1 to obtain graph partitions P∗
k1

that optimizes the following k-means cost:

F(Uk1 ,P∗
k1
) =

N∑
i=1

(
r(i)−

∑
j∈Si

r(j)

|Si|

)2

where r(i) is the ith row of Uk1 .
11: k1 ← k1 + 1, k2 = N − n+ k1
12: end while
13: return coarse graph Gc generated with respect to the partitions with minimum k-means

clustering cost as
P∗ = argmin

k1
F(Uk1 ,P∗

k1
)

Further, when the graph eigenvectors point to well-separated clusters and the k-means cost is

small, the spectral distance is smaller. The main assumption posed is that Lc = CLC⊤, which

may not hold for some graphs. For situations when this assumption is not met, the claim can

be readily reworked to hold for the combinatorial Laplacian matrix for which the relation Lc =

PLP⊤ always holds.

Time complexity. Excluding the one-time partial sparse eigenvalue decomposition that takes

roughly O(R(Mn+Nn2)) time using Lanczos iteration with R restarts and a graph of M edges

(we need the smallest and largest n eigenvalues and eigenvectors) [78], the time complexity

of SGC is O(KTNn2), where K refers to the number of times the while loop is executed with

41


Table 3.1: Classification accuracy on coarse graphs that are five times smaller.

Datasets MUTAG ENZYMES NCI1 NCI109 PROTEINS PTC

EM 78.90 18.92 62.81 61.35 63.72 48.56
LV 79.01 24.68 63.59 60.49 62.72 50.24

METIS 77.62 24.79 59.74 61.64 63.70 49.34
SC 80.37 24.40 63.14 62.57 64.08 50.16

SGC 80.34 29.19 63.94 63.69 64.70 52.76
MGC 81.53 30.89 66.07 63.55 65.26 52.28

Original 86.58 37.32 66.39 64.93 66.60 53.72

K ≤ n and O(TNn2) is the complexity of the k-means clustering (whereas T bounds the number

of k-means iterations).

3.6 Experiments

We proceed to empirically evaluate the proposed graph coarsening algorithms on tasks

involving real-world and synthetic graphs. Our first experiment considers the classification of

coarsened graphs, whereas the second examines how well one may recover the block structure of

graphs sampled from the stochastic block model. We show that our graph coarsening algorithms,

which optimize the spectral distance, yield minimal classification accuracy degradation and can

recover the block structures with high accuracy. Codes for both experiments are publicly available1.

Baseline Algorithms We compare our methods with the following graph coarsening and partitioning

algorithms as,

• Edge Matching (EM). The coarse graphs are formed by maximum-weight matching with

the weight calculated as W (i, j)/max{d(i), d(j)} [79].

1https://github.com/yuj-umd/spectral-coarsening

42

https://github.com/yuj-umd/spectral-coarsening


• Local Variation (LV). Local variation methods coarsen a graph in a manner that approximately

preserves a subset of its spectrum [9]. Here, we used the neighborhood-based variant and

aimed to preserve the first max(10, n) eigenvectors and eigenvalues. Alternative choices

for the preserved eigenspace may yield different results.

• METIS. This is a standard graph partitioning algorithm based on multi-level partitioning

schemes that are widely used various domains, such as finite element methods and VLSI

[13].

• Spectral Clustering (SC). Spectral clustering is a widely used graph clustering algorithm

that finds densely connected graph partitions determined from the eigenvectors of the graph

Laplacian [19]. For a review of recent results on the fast approximation of SC, see [78].

Note that to apply graph partitioning algorithms for coarsening purposes, we coarsen the graphs

with respect to the graph partitions following the standard coarsening process as in equation 3.2.

3.6.1 Graph Classification with Coarse Graphs

Graph classification is a well studied graph machine learning problem, with a variety of

applications to material design, drug discovery and computational neuroscience [6, 71, 80, 81].

However, some graph classifiers are not scalable for large graphs, such as those encountered in

social network analysis and computational neuroscience [81, 82]. Graph coarsening can reduce

the graph sizes in the datasets, which provides acceleration on the training and inference of graph

classification models. However, if the coarsening is not carefully done, it can also result in loss of

useful information and, thus, of classification accuracy. In the following, we quantify the effect of

different coarsening choices to graph classification. We utilize various graph coarsening methods

43


Table 3.2: Recovery Accuracy of Block Structures from Random Graphs in Stochastic Block
Model

p, q Type EM LV METIS SC MGC SGC

0.2, 0.01
Associative 0.1819 0.3076 0.7792 0.7845 0.3664 0.7845

Disassortative 0.0956 0.1071 0.0815 0.0877 0.1093 0.0850
Mixed 0.1052 0.1944 0.2389 0.3335 0.6062 0.7107

0.5, 0.1
Associative 0.1015 0.1902 0.7820 0.7930 0.2868 0.7930

Disassortative 0.0854 0.1068 0.0602 0.0788 0.1474 0.7901
Mixed 0.0848 0.2241 0.2883 0.4074 0.7343 0.7699

0.8, 0.3
Associative 0.0823 0.1139 0.5596 0.6532 0.1172 0.6532

Disassortative 0.0836 0.0976 0.0776 0.1342 0.7784 0.7931
Mixed 0.0888 0.1503 0.2929 0.3909 0.5428 0.7209

to reduce the size of graphs in the datasets before passing them to the graph classifier. We then

evaluate the quality of graph coarsening based on the classification accuracy drop (as compared

to the same classifier on the original graphs).

Evaluation We coarsen the graph samples until n = N/5, i.e., until their number of nodes is

reduced by a factor of five. The classification performance are evaluated based on 10-fold cross

validation—in accordance to previous works [5, 6, 71].

Datasets. We use five standard graph classification datasets for graph classification evaluation [80,

83, 84]. Each dataset contains a set of variable-sized graphs stemming from a variety of applications.

The graph statistics can be found in Table A.1.

The graph classifier. We use the Network Laplacian Spectral Descriptor (NetLSD) combined

with a 1-NN classifier as the graph classification method [71]. NetLSD was shown as an efficient

graph feature extractor and achieve state-of-the art classification performance [71]. Note that

NetLSD extracts graph features that only depend on the graph structure and does not consider

44


Table 3.3: Statistics of the graph benchmark datasets.

Datasets MUTAG ENZYMES NCI1 NCI109 PROTEINS PTC

Sample size 188 600 4110 4127 1108 344
Average |V | 17.93 32.63 29.87 29.68 39.06 14.29
Average |E| 19.79 62.14 32.3 32.13 72.70 14.69

# classes 2 6 2 2 2 2

node and edge features.

Results Table 3.1 shows the graph classification performance on coarse graphs. In all cases,

the proposed graph coarsening algorithms yield better classification accuracy than alternative

methods. Interestingly, for four out of the six datasets (NCI1, NCI109, PROTEINS, and PTC)

there is almost no degradation to the classification accuracy induced by coarsening, even if the

graphs in the coarse dataset are five times smaller—this, we believe, is an encouraging result.

3.6.2 Block Recovery in the Stochastic Block Model

In this experiment, we test whether coarsening algorithms can be used to recover the block

structures of random graphs sampled from stochastic block models.

The stochastic block model is a random graph model that is commonly used to evaluate

graph partitioning and clustering algorithms [85, 86]. The model is parameterized by a probability

matrix B ∈ [0, 1]n×n, with graph nodes in blocks i and j being connected with probability

B(i, j). Random graphs can be generated from the stochastic block model by sampling the

upper triangular entries W (i, j) in accordance with the edge probability. The lower triangular

entries are then set as W (j, i) = W (i, j).

We parameterize B with p and q as follows:

45


• Assortative. The diagonal entries of B are p and the off-diagonal entries are q.

• Disassortative. The diagonal entries of B are q and the off-diagonal entries are p.

• Mixed. The entries of B are randomly assigned with p and q (each with probability 1/2).

Evaluation We evaluate the performance of graph coarsening algorithms by measuring the discrepancy

between the recovered graph partitions and the ground-truth blocks. We use the Normalized

Mutual Information (NMI) to measure the recovery error between any two graph partitions. The

definition of NMI can be found in the supplementary material.

For each stochastic block model setting, we set N = 200 and n = 10, with 20 nodes for

each partition. We repeat the experiment 10 times and report the average NMI metric achieved

by each method.

We compare our graph coarsening algorithms with the graph coarsening and partitioning

algorithms mentioned earlier. Table 3.2 reports the average NMI in three different stochastic

block model configurations. Our proposed methods outperform other methods in almost all

cases. In particular, our methods achieve high recovery accuracy for disassortative and mixed

settings, where traditional graph partitioning algorithms fail to recover accurately. The EM

and LV coarsening algorithms are not optimized for block recovery and thus exhibit far worse

performance on this task.

3.7 Conclusion

In this chapter, we propose a new framework for graph coarsening. We leverage the

spectral properties of normalized Laplacian matrices to define a new notion of graph distance

that quantifies the differences between original and coarse graphs. We argue that the proposed

46


spectral distance naturally captures the structural changes in the graph coarsening process, and

we propose graph coarsening algorithms that guarantee that the coarse graphs exhibit a bounded

spectral distance. Experiments show that our proposed methods can outperform other graph

coarsening algorithms on graph classification and block recovery tasks.

47


Chapter 4: Powerful Graph Neural Networks with Sequence Modeling

4.1 Introduction

Recently, graph neural network models (GNN) have emerged as a poweful tool to tackle

challenging graph-related problems such as graph classification, link predictions and node classification

[4, 5, 7, 26, 27, 28]. Most GNN models follow the message-passing paradigm where the nodes

information are propagated along edges to form new node representations [4]. The final graph

representations are the result of applying a pooling function over all the node representations.

Various message-passing GNN (MPNN) models have been proposed to address graph-

related problems [87, 88]. For example, Xu et al. proposed to use multi-set functions for node

aggregation and graph pooling [6, 89]. Gilmer et al. proposed a general message-passing scheme

utilizing Set2Set to obtain graph representations [4, 90]. Gao and Ji proposed a top-k graph

pooling function by downsampling the graph nodes [91].

Despite their recent success, MPNN models suffer from the fundamental limitation of

expressiveness. As pointed out by previous studies, the expressive power of standard MPNN does

not exceed the 1-dimensional Weisfeiler-Lehman (WL) Isomorphism Test [6, 39]. Therefore,

MPNN models cannot distinguish pairs of graphs that the 1-dimensional WL cannot distinguish.

For example, MPNN models cannot distinguish regular graphs of different structures. Note that

k-regular graphs are defined as graphs with all the node degrees are k but could have different

48


structures. One example is shown in Figure 4.3).

More generally, Garg et al. recently show MPNN cannot capture certain graph properties,

which hampers the model effectiveness in a wide range of applications [9, 92, 93, 94].

To tackle the above problem, previous studies proposed to equip additional features to the

node representations to improve model expressiveness. For example, Sato et al. and Abboud et

al. proposed to assign random node identifiers to improve the expressiveness [2, 95]. However,

MPNN models with random node identifiers usually do not maintain the properties of equivariance

and invariance, which are key properties in the GNN model design. Bouritsas et al. used subgraph

counts as additional features [94]. The manual feature engineering requires domain-specific

knowledge to achieve the optimal performance for certain applications[94].

Recent work introduced permutation-sensitive functions to formulate powerful permutation-

invariant graph functions [30, 96, 97, 98]. For example, in the proposed GraphSAGE model,

Hamilton et al. used a Long Short-Term Memory (LSTM) aggregator on sampled sets of neighbors

to extract expressive representations [30]. Murphy et al. proposed Relational Pooling to build

powerful graph presentations with permutation-sensitive functions. However, there are limitations

of previous work: LSTM aggregators in GraphSAGE model randomly selects the node permutations

which does not satisfy the permutation invariance; Relational Pooling methods requires expensive

computation for all n! possible node permutations which makes it difficult to generalize to

applications with large-scale graphs.

In this work, we propose SeqGNN, a new graph neural network paradigm that improves

the expressiveness of GNN models. Leveraging on the rich expressiveness of sequence models,

SeqGNN represents graphs as the composition of sequence representations. With the design

of sequence modeling techniques, SeqGNN models can form expressive graph representations

49


(a) Graph

(b) Permutation

(c) Sequence

Figure 4.1: Relationships between graphs, node permutations and node sequences. The
node permutation is the ordering of the graph nodes and the node sequence contain the edge
information between the consecutive nodes.

while maintaining permutation invariance. In addition, SeqGNN can distinguish non-isomorphic

graphs which cannot be distinguished by MPNN. We further show that SeqGNN models achieve

competitive performance in the real-world graph classification tasks.

4.2 Related Work

Message Passing GNN Message-Passing GNN models have been the primary paradigms of

GNN models. Previous work proposed various convolutional and pooling functions that achieve

excellent performance in a variety of domains [6, 28, 29, 30, 39, 99, 100]. For example, Xu et al.

proposed to use multi-set functions for graph aggregation and pooling [6]. Gilmer et al. proposed

a general message-passing scheme utilizing Set2Set to obtain graph representations [4, 90].

Expressive Power of GNN Xu et al. and Morris et al. first pointed that MPNN cannot exceed

the expressive power of 1-WL isomorphism tests. Abboud et al. and Sato et al showed that

associating graph nodes with extra node features, which can be as simple as random initialization,

can improve the expressive power [2, 101]. Loukas et al. theoretically proved that when the

50


Sequence 
Modeling

p1
p2

… 
pK

Pooling

Graph
Representation !!Input graph G: 

":adjacent matrix
$: node features

Figure 4.2: The overall framework of the SeqGNN model. (1) The first step is to represent
graphs as the set of node sequences. The node sequence consists of the permutation of graph
nodes and their associated edge information between consecutive nodes. The node sequences
and the associated weights are determined by the sequence sampling methods and the specific
graph structures. (2) The second step is learn the sequence representations which are further
combined to form the graph representations.

nodes are equipped with the unique node identifiers, GNN models can approximate any Turing

computable functions over connected attributed graphs [32]. Maron et al. proposed high-order

graph networks that are more powerful GNN but suffer from high computational complexity[41,

42].

Sequence Modeling Sequence modeling is a type of machine learning task that involves predicting

or understanding a sequence of data points. Sequence data consist of data points where each

element in the sequence has a particular position, and the arrangement of these elements carries

significant information. Sequence modeling methods have been extensively studied in the field

of natural language processing and temporal data analysis [102, 103]. Recurrent Neural Network

(RNN) and its variants are able to encode variable-sized sequence inputs to fixed-sized representations

[102, 103]. Recently, attention-based models such as transformers have been effectively applied

to a wide range of domains[100, 104].

51


4.3 Preliminaries

4.3.1 Notations

Graph A graph is represented as G = (V ,W ,H), with V as the set of graph nodes with n = |V|,

W ∈ Rn×n as the adjacent matrix and H ∈ Rn×d as the node representations. We denote by

vi ∈ V as the node indexed at i andNs(vi) as the set of neighbors of vi within s hops (We denote

N (vi) = N1(vi). We use hi as the node representation of node vi and hG as the graph-level

representation.

Permutation We denote a permutation over integers from 1 to n as a list of node indices π =

{v1, v2, ..., vn}. The corresponding permutation matrix is denoted as Pπ. We use Π to denote

the set of all possible permutations with |Π| = n!. π−1 is denoted as the inverse permutation

of π. A graph permutated with π is denoted as Gπ = (Vπ,Wπ,Hπ) where the node indices,

weight matrix and feature matrix are permutated over π denoted as Vπ, Wπ = PπWP T
π and

Hπ = PπH .

Sequence We denote a sequence as x = {x1, x2, ..., xn}where the elements are constructed from

the graph G and the node permutation π. The elements xi contains the node information of vi as

well as other structural information such as the edge information between vi and vi+1. We use

X ∈ Rn×D to denote the sequence where Xi,: contains the element at position i.

The relationships of the graph, permutation and sequence is captured in Figure 4.1. We use

a function SeqDec : (G,π)→ Rn×D to denote the relationship as

Xπ = SeqDec(G,π) (4.1)

52


(a) Graphs (b) MPNN Decomposition: Rooted subtree (c) SeqGNN Decomposition: Sequence

Sequence difference

Figure 4.3: Example: (a) Regular graphs that cannot be distinguished by MPNN and 1-WL; (b)
With k layers of message passing, node representations of MPNN models contain information
from nodes that are k hops away. The information flow to compute the node representations can
be represented as the rooted subtrees where root node absorb the information flowed from the
leaf nodes, as shown in the figure [1, 2]; Due to the anonymity of graph nodes, the subtrees are
indistinguishable for regular graphs. (c) SeqGNN formulates the graph as the composition of
sequence representations with node permutation and edge information. Despite the anonymity
of graph nodes, the sequences can still distinguish two graphs because of differences in the edge
patterns.

4.3.2 Graph and Sequence Models

Graph Neural Network

GNN models compute the graph representations from the node features H and graph

structures W , denoted as GNN : (Rn×d,Rn×n) → Rd. In particular, MPNN, one of the most

popular GNN models, iteratively updates node representations by aggregating information from

neighboring nodes and the graph representation is obtained by applying the pooling function on

the corresponding node representations [4, 6, 9],

MPNN(H ,W ) = Pool(hv, v ∈ V) (4.2)

53


where hi
u is the representation of node u at it