ABSTRACT

Title of Dissertation: UNDERSTANDING AND ENHANCING
MACHINE LEARNING MODELS
WITH THEORETICAL FOUNDATIONS

Zhengmian Hu
Doctor of Philosophy, 2024

Dissertation Directed by: Professor Heng Huang
Department of Computer Science

Machine learning has become a key driver of many contemporary technological ad-

vancements. With its empirical success, there is an urgent need for theoretical research

to explain and complement these practical achievements. This includes understanding the

empirical success of machine learning, especially deep learning, and aiding the design of

better algorithms in terms of performance, efficiency, and security.

This dissertation aims to advance the understanding and practical development of

machine learning through three interrelated research directions, while emphasizing reliable

theoretical guarantees throughout the process.

In the first part, we study the deep learning theory under overparameterization con-

ditions. The core objects of study are the Conjugate Kernel and Neural Tangent Kernel,

which have deep connections to the training dynamics of deep learning. Based on the

analysis of these kernels, we prove several new concentration results characterizing the


trainability and generalization of infinitely wide neural networks.

In the second part, we focus on training algorithms. On one hand, we propose new

algorithms to improve learning efficiency. This includes a new underdamped Langevin

MCMC method called ALUM, for which we prove its complexity reaches the theoretical

lower bound. On the other hand, we propose new theoretical tools to analyze existing

algorithms and obtain tighter convergence results. For Proxskip, our analysis shows it

can still achieve an improvement in communication complexity from sublinear to linear

convergence under stochastic oracle. We also generalize the concept of Lipschitz smoothness

for tighter non-convex optimization analysis.

In the third part, we develop new Monte Carlo methods to large language models

(LLMs) to improve their efficiency and security. We develop unbiased watermarking tech-

niques to protect model outputs and propose an Accelerated Speculative Sampling method

for faster inference. We also investigate the trade-off between watermark strength and

inference sampling efficiency, pointing out the conflict between the two.


UNDERSTANDING AND ENHANCING MACHINE LEARNING
MODELS WITH THEORETICAL FOUNDATIONS

by

Zhengmian Hu

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2024

Advisory Committee:
Professor Heng Huang, Chair/Advisor
Professor Furong Huang
Professor Tianyi Zhou
Professor Ang Li
Professor Min Wu, Dean’s Representative


© Copyright by
Zhengmian Hu

2024


To my parents.

ii


Acknowledgments

First and foremost, I would like to express my sincere gratitude to my PhD advisor,

Dr. Heng Huang, for his invaluable encouragement, guidance, and unwavering support

throughout my PhD journey. His belief in my potential and his spirit of hard work have been

a constant source of inspiration throughout my academic journey. I appreciate his insightful

advice, patience, and the freedom and resources he provided, which were instrumental in

shaping my research.

I would also like to thank my committee members, Dr. Furong Huang, Dr. Tianyi

Zhou, Dr. Ang Li, and Dr. Min Wu, for their gracious commitment to serve on my PhD

dissertation committee. Moreover, I would also like to thank Dr. Furong Huang, Dr. Tianyi

Zhou, and Dr. Mohammad Hajiaghayi for serving on my preliminary exam committee. I

am truly grateful for the time and effort they have dedicated to my academic growth.

Thanks to all members in Huang Lab, including Dr. Feihu Huang, Dr. Bin Gu, Dr.

Hongchang Gao, Dr. Yanfu Zhang, Dr. Lei Luo, Dr. Ziyi Chen, Dr. Kamran Ghasedi,

Dr. Zhouyuan Huo, Dr. An Xu, Dr. Runxue Bao, Dr. Shangqian Gao, Dr. Chao Li, Dr.

Peiran Yu, Dr. Zhenyi Wang, Dr. Junfeng Guo, Guodong Liu, Wenhan Xian, Xiaoqian

Dou, Jiaping Zhu, Alireza Ganjdanesh, Junyi Li, Yihan Wu, Xidong Wu, Lichang Chen,

Yanshuo Chen, Ruibo Chen, Chenxi Liu, Qi He, Tianyi Xiong, Yan Wen, and Tong Zheng,

for all the great times we had together. I feel so lucky to work with so many excellent

iii


colleagues. I am especially thankful for the deep expertise in optimization that Huang

Lab possesses, as being immersed in this environment has greatly benefited my research in

optimization.

I extend my thanks to my roommates and close collaborators, Xidong Wu and Yan-

shuo Chen, for listening and offering me advice, and to my other collaborators, including

Gang Wu, Jianhui Sun, Hongyang Zhang and Huimin Wu.

To all those mentioned and many others who have supported me along the way, thank

you for being part of this incredible journey.

Finally, I want to thank my family for the education they have provided me and their

continuing and unconditional support.

iv


Table of Contents

Dedication ii

Acknowledgements iii

Table of Contents v

List of Tables ix

List of Figures xii

List of Abbreviations xv

Chapter 1: Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

I Better Understanding Deep Learning with Overparame-
terization 9

Chapter 2: Analyzing the Conjugate Kernel and Neural Tangent Kernel 10
2.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Feedforward Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Residual Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Overview of Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.7 Additional Theoretical Result . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.8 Additional Explanation on Tensor Diagram . . . . . . . . . . . . . . . . . . 35
2.9 Additional Experiment Result . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.10 Proof of Main Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Chapter 3: Bridging the Gap Between Empirical Risk Minimization and Bayesian
Approaches in Overparameterized Neural Networks 60

3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2 Training by Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.3 Transformative Bayesian Learning . . . . . . . . . . . . . . . . . . . . . . . 68

v


3.4 Connections between Bayesian Learning and Optimization . . . . . . . . . 70
3.5 Interpolation of Optimization and Bayesian Learning . . . . . . . . . . . . 71
3.6 Overparameterized Neural Network . . . . . . . . . . . . . . . . . . . . . . 74
3.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.8 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.9 Discussion on the strength of assumption . . . . . . . . . . . . . . . . . . . 84
3.10 Additional Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.11 Analysis of Infinitely Wide Network . . . . . . . . . . . . . . . . . . . . . . 88
3.12 Additional Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . 101
3.13 Additional Discussion on Algorithms . . . . . . . . . . . . . . . . . . . . . 101

II Analyzing and Enhancing Learning Algorithms 109

Chapter 4: Optimal Underdamped Langevin MCMC Method 110
4.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.2 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.3 AcceLerated ULD-MCMC (ALUM) methods . . . . . . . . . . . . . . . . . 115
4.4 Theoretical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.5 Information-based complexity . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.6 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.8 Extra discussions on ULD . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.9 Proof of convergence result and discretization error . . . . . . . . . . . . . 143
4.10 Proof of information-based complexity lower bound . . . . . . . . . . . . . 174
4.11 Extra discussion on experiments . . . . . . . . . . . . . . . . . . . . . . . . 190

Chapter 5: Tighter analysis for Proxskip 204
5.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
5.2 New Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.3 Continuous Limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
5.4 Main Idea of Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
5.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
5.6 ProxSkip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
5.7 SProxSkip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
5.8 ProxSkip-VR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
5.9 GradSkip+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

Chapter 6: Generalizing Lipschitz Smoothness for Tighter Analysis of Nonconvex
Optimization 267

6.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
6.2 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
6.3 Beyond Lipschitz Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . 271
6.4 Lookahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
6.5 Local SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

vi


6.6 Negative Curvature for Stochastic Compositional Optimization . . . . . . . 286
6.7 Comparison to related works . . . . . . . . . . . . . . . . . . . . . . . . . . 291
6.8 Basic lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
6.9 Lookahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
6.10 Local SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
6.11 Negative curvature for stochastic compositional optimization . . . . . . . . 311
6.12 Additional experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
6.13 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314

III Enhancing Large Language Model with Monte Carlo 317

Chapter 7: Unbiased Watermark for Large Language Models 318
7.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
7.2 Warm up: undetectability in a simplified toy environment . . . . . . . . . . 322
7.3 Watermarking with unbiased reweighting . . . . . . . . . . . . . . . . . . . 324
7.4 Statistical hypothesis testing for watermark detection . . . . . . . . . . . . 331
7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
7.6 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
7.7 Detailed definition and additional proofs . . . . . . . . . . . . . . . . . . . 341
7.8 Additional discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
7.9 Likelihood-agnostic watermark score . . . . . . . . . . . . . . . . . . . . . 355
7.10 Detailed experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
7.11 More experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
7.12 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
7.13 Ethics Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366

Chapter 8: Accelerated Speculative Sampling Based on Tree Monte Carlo 370
8.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
8.2 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
8.3 Tree Monte Carlo (TMC) . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
8.4 Accelerated Speculative Sampling (ASpS) . . . . . . . . . . . . . . . . . . 411
8.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
8.6 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
8.7 Full Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428

Chapter 9: Inevitable Trade-off between Watermark Strength and Speculative Sam-
pling Efficiency for Language Models 434

9.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
9.2 Two Reweight Framework for Accelerated Generation of Watermarked Tokens439
9.3 No-go Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
9.4 Algorithms for Maintaining Watermark Strength or Sampling Efficiency . . 447
9.5 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
9.6 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
9.7 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456

vii


9.8 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
9.9 U Score, DeltaGubel Reweight, and Gamma Reweight . . . . . . . . . . . . 462
9.10 Extension to Variants of Speculative Sampling . . . . . . . . . . . . . . . . 464
9.11 Ethical Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
9.12 Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
9.13 Additional Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . 467

Bibliography 480

viii


List of Tables

2.1 Verification of CK and NTK of a three layers feedforward network. . . . . 31
2.2 Verification of CK and NTK of a two branches residual network. Every

branch is a feedforward network with one hidden layer. . . . . . . . . . . . 32
2.3 The p-value of KS test on CK and NTK of the feedforward network. . . . . 33
2.4 The p-value of KS test on CK and NTK of the residual network. . . . . . . 33
2.5 Verification of CK and NTKs of a three layer feedforward network. . . . . 37
2.6 Verification of CK and NTKs of a two branches residual network. Every

branch is a feedforward network with one hidden layer. . . . . . . . . . . . 38
2.7 Classification with fixed NTK and random NTK . . . . . . . . . . . . . . . 50

4.1 Summary of gradient complexity of sampling methods, which is defined as
number of gradient evaluation of ∇fi(x) needed to sample from m-strongly-

log-concave distributions up to ε
√

d/m accuracy in 2-Wasserstein distance
where ε ≤ 1 is the target accuracy, d is the dimension. ULA, LPM, RMM,
and ALUM are full gradient methods, therefore, gradient complexities for
them are sample size N times the iteration complexities. Only dependence
on d, ε, and N are shown below. The dependence of batch size b is made
clear in Table 4.3. The dependence of condition number κ is discussed in
Section 4.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

4.2 Iteration complexities for full gradient ALUM on both sampling and approx-
imation problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

4.3 Gradient complexity for SAGA-ALUM and SVRG-ALUM. . . . . . . . . . 124
4.4 The summary of different datasets used in our experiments. . . . . . . . . 193

5.1 Different stochastic local methods for strongly convex optimization. . . . . 218

7.1 Performance of different watermarking methods on TS and MT. We use F1
scores of BERTScore and scale BERTScore and ROUGE-1 with a factor of
100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336

7.2 Mean and variance of score per token for different reweighting methods and
different tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337

7.3 Text sampled from OPT-6.7B, with and without watermarks. For ”No water-
mark” (NW), the score is computed based on δ-reweight. When watermarks
are included, the corresponding reweighting function is used for computing
score. The optimal perturbation strengths d obtained by grid search are
0.9, 0.0, 0.0 for three outputs respectively. . . . . . . . . . . . . . . . . . . 337

ix


7.4 Additional result about the performance of different watermarking methods
on TS. We scale BERTScore and ROUGE with a factor of 100. . . . . . . . 360

7.5 Additional result about the performance of different watermarking methods
on MT. We scale BERTScore with a factor of 100. . . . . . . . . . . . . . . 361

7.6 Score per token when the estimated token distribution is computed from a
different temperature than the real token distribution. . . . . . . . . . . . . 362

7.7 Score per token when the estimated token distribution is computed from a
different top k than the real token distribution. . . . . . . . . . . . . . . . 362

7.8 Score per token when the estimated token distribution is computed with and
without input. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362

7.9 Score per token when the estimated token distribution is computed from a
different model than the real token distribution. . . . . . . . . . . . . . . . 362

7.10 Mean and variance of score per token for δ-reweight based on Gumbel trick
on different tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363

7.11 Additional result with T5 for translation tasks and LLaMA 2 [1] for sum-
marization and poem generation. . . . . . . . . . . . . . . . . . . . . . . . 363

7.12 AUC for different watermarking detection methods under different pertur-
bation strength. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364

7.13 AUC for different watermarking detection methods under different deletion
portion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365

8.1 Single Token Generation with Speculative Sampling . . . . . . . . . . . . . 374
8.2 An Illustration of Speculative Sampling Coupling . . . . . . . . . . . . . . 375
8.3 An Illustration of Accelerated Speculative Sampling . . . . . . . . . . . . . 375
8.4 Desired Joint Distribution in a Toy Example . . . . . . . . . . . . . . . . . 376
8.5 Summary of accepted token numbers per step improvement by ASpS over

SpS. n is the number of reference token. . . . . . . . . . . . . . . . . . . . 424
8.6 Summary of accepted token numbers per step (ATN) and per token time

(PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is
the number of reference token. We use LLaMa-7b model as target model
and LLaMa-68m model as reference model on machine translation task. . . 428

8.7 Summary of accepted token numbers per step (ATN) and per token time
(PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is
the number of reference token. We use LLaMa-7b model as target model
and LLaMa-68m model as reference model on text summarization task. . . 429

8.8 Summary of accepted token numbers per step (ATN) and per token time
(PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is
the number of reference token. We use LLaMa-13b model as target model
and LLaMa-68m model as reference model on open-ended text generation
task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429

8.9 Summary of accepted token numbers per step (ATN) and per token time
(PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is
the number of reference token. We use Opt-13b model as target model and
Opt-125m model as reference model on open-ended text generation task. . 429

x


8.10 Summary of accepted token numbers per step (ATN) and per token time
(PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is
the number of reference token. We use Opt-6.7b model as target model and
Opt-125m model as reference model on open-ended text generation task. . 430

8.11 Summary of accepted token numbers per step (ATN) and per token time
(PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is
the number of reference token. We use GPT-Neo-2.7B model as target model
and GPT-2 model as reference model on open-ended text generation task. . 430

8.12 Summary of accepted token numbers per step (ATN) and per token time
(PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is
the number of reference token. We use GPT-2-xl model as target model and
GPT-2 model as reference model on open-ended text generation task. . . . 430

8.13 Summary of accepted token numbers per step (ATN) and per token time
(PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is
the number of reference token. We use LLaMa-7b model as target model
and LLaMa-68m model as reference model on open-ended text generation
task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431

8.14 Summary of accepted token numbers per step (ATN) and per token time
(PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is
the number of reference token. We use LLaMa-2-7b model as target model
and LLaMa-68m model as reference model on open-ended text generation
task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431

8.15 Summary of accepted token numbers per step (ATN) and per token time
(PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is
the number of reference token. We use LLaMa-2-7b model as target model
and LLaMa-68m model as reference model on machine translation task. . . 431

8.16 Summary of accepted token numbers per step (ATN) and per token time
(PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is
the number of reference token. We use LLaMa-2-7b model as target model
and LLaMa-68m model as reference model on text summarization task. . . 432

8.17 Summary of accepted token numbers per step (ATN) and per token time
(PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is
the number of reference token. We use LLaMa-2-13b model as target model
and LLaMa-68m model as reference model on open-ended text generation
task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432

9.1 Text summarization task with LLaMa-7b model [1] as target model and
LLaMa-68m model [2] as reference model. . . . . . . . . . . . . . . . . . . 472

9.2 Text summarization task with LLaMa-13b model [1] as target model and
LLaMa-68m model [2] as reference model. . . . . . . . . . . . . . . . . . . 474

9.3 Open-ended text generation task with LLaMa-7b model [1] as target model
and LLaMa-68m model [2] as reference model. . . . . . . . . . . . . . . . . 476

9.4 Open-ended text generation task with LLaMa-13b model [1] as target model
and LLaMa-68m model [2] as reference model. . . . . . . . . . . . . . . . . 478

xi


List of Figures

2.1 Different regimes in the study of CK and NTK. Network width and depth are
denoted as d and n. Black line and Blue line: CK and NTK converge to fixed
values. Green line: For feedforward network, diagonal elements of CK and
NTK of each parameter converge to log-normal distribution. The variance
of the log value is proportional to d

n
. Diagonal elements of residual network

NTK converges to the product of CK of a single feedforward network and a
log normal distribution random variable. Red line: For feedforward network,
the variance of CK and NTK is unbounded. For residual network, diagonal
elements of CK converges to log-normal distribution, and diagonal elements
of NTK converges in law to a log-normal distributed variable times CK of a
feedforward network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Distribution of CK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3 Distribution of NTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4 Distribution of CK and NTK . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Joint distribution of logarithm NTK of weights from different layer. . . . . 40
2.6 Estimation of correlation coefficients between weight NTKs and bias NTKs. 42
2.7 Joint distribution of logarithm NTK of weights NTK and bias NTK. . . . . 43
2.8 Joint distribution of logarithm NTK of weights NTK for residual network

with 20 branches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.9 Distribution of CK and joint distribution of NTK from different layers. . . 44
2.10 Distribution of CK and joint distribution of NTK from different layers. . . 45

3.2 Comparison of dynamics in parameter space and function space. . . . . . . 80
3.1 Hessian trace for one randomly initialized network (left) and the probability

density of Hessian trace at initialization (right). . . . . . . . . . . . . . . . 80
3.3 Test accuracy for ensemble trained by GD and TransBL with β = exp(−40)

for weight clipping (left). Distribution of weights in TransBL (middle).
Trade-off between sampling efficiency and test accuracy (right). . . . . . . 82

3.4 The KL divergence, training and testing loss gap, and time dependence of
weight distribution of experiment in Section 3.7.3. . . . . . . . . . . . . . 102

3.5 Bayesian learning process . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.6 Illsutration of TransBL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.7 Interpolation between optimization and Bayesian learning . . . . . . . . . 104

4.1 Trajectory error for different algorithms1. The vertical lines are the error
bar. FG means full gradient, SG means stochastic gradient. . . . . . . . . . 129

4.2 Discretization error of full gradient methods on australian dataset.2 . . . . 131

xii


4.3 Discretization error for SVRG-ALUM and SAGA-ALUM with different batch
sizes on australian dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

4.4 Estimated trajectory error of RMM with different segments number. . . . . 192
4.5 Logistic regression on phishing dataset. . . . . . . . . . . . . . . . . . . . . 195
4.6 Logistic regression on german dataset. . . . . . . . . . . . . . . . . . . . . 196
4.7 Logistic regression on mushrooms dataset. . . . . . . . . . . . . . . . . . . 197
4.8 Discretization error on australian dataset for different algorithms under dif-

ferent step size.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
4.9 Discretization error for SVRG-ALUM and SAGA-ALUM with different batch

sizes on australian dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
4.10 Sampling from posterior of Bayesian logistic regression model on australian

dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
4.11 Sampling from posterior of Bayesian logistic regression model on phishing

dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
4.12 Sampling from posterior of Bayesian logistic regression model on german

dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
4.13 Sampling from posterior of Bayesian logistic regression model on mushrooms

dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

5.1 Relationship between step size and decreasing speed of Lyapunov function
when µ = 0.1 and L = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

5.2 ProxSkip experiment for comparing of different step sizes. . . . . . . . . . . 211
5.3 Convergence of ODEProx. . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
5.4 Illustration of old Lyapunov function . . . . . . . . . . . . . . . . . . . . . 215
5.5 Illustration of new Lyapunov function . . . . . . . . . . . . . . . . . . . . . 216
5.6 SProxSkip experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

6.1 Upper bound based on lower smoothness. . . . . . . . . . . . . . . . . . . . 272
6.2 Exponential divergence for gradient flow on nonconvex objective function. . 272
6.3 Similar to Figure 6.2 but with different initialization. . . . . . . . . . . . . 273
6.4 A counter example showing that Lookahead with too large horizon doesn’t

converge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
6.5 Convergence of ResNet-18 on CIFAR-10. . . . . . . . . . . . . . . . . . . . 278
6.6 Different convergence conditions for local SGD. . . . . . . . . . . . . . . . 282
6.7 Estimated Hessian eigenspectrum for ResNet-18 at random initialization. . 286
6.8 Dynamics of minimum Hessian eigenvalue, function value, and gradient dur-

ing training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
6.9 Estimated minimum negative curvature for different networks. . . . . . . . 289
6.10 Dynamics of ratio of minimum and maximum Hessian eigenvalues. . . . . . 290
6.12.1Additional result for wide network. . . . . . . . . . . . . . . . . . . . . . . 313
6.12.2Additional result for lazy network. . . . . . . . . . . . . . . . . . . . . . . . 314

7.3.1 Illustration of δ-reweight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
7.3.2 Illustration of γ-reweight. . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
7.4.1 Distribution of perplexity of output for TS and BLEU score for MT. . . . . 335

xiii


9.0.1 Taxonomy of watermarking and speculative sampling trade-offs in language
models. The ideal case of maintaining both watermark strength and sam-
pling efficiency is proven to be impossible by the no-go theorem. The pro-
posed algorithms focus on maintaining either watermark strength or sam-
pling efficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437

9.5.1 Comparison of different methods. The x-axis shows the Average Accepted
Tokens Per Step (AATPS) as a measure of speculative sampling efficiency,
while y-axis shows the Average Negative Log P-value Per Token (ANLPPT)
as a measure of watermark strength. The P-value is computed based on ei-
ther a likelihood-based test using the maximin-LLR score (left) or a likelihood-
agnostic test using the U score (right). Watermarking is performed using
either the DeltaGumbel reweight (top) or the Gamma reweight (bottom).
Error bars represent 3σ confidence intervals4. . . . . . . . . . . . . . . . . 453

9.13.1Text summarization task with LLaMa-13b model [1] as target model and
LLaMa-68m model [2] as reference model. . . . . . . . . . . . . . . . . . . 468

9.13.2Open-ended text generation task with LLaMa-7b model [1] as target model
and LLaMa-68m model [2] as reference model. . . . . . . . . . . . . . . . 469

9.13.3Open-ended text generation task with LLaMa-13b model [1] as target model
and LLaMa-68m model [2] as reference model. . . . . . . . . . . . . . . . 470

xiv


List of Abbreviations

ALUM AcceLerated ULD-MCMC
APGD Accelerated Proximal Gradient Descent
ASpS Accelerated Speculative Sampling
BART Bidirectional and Auto-Regressive Transformers
BLEU Bilingual Evaluation Understudy (score)
CK Conjugate Kernel
CNN-DM CNN/Daily Mail (dataset)
DNNs Deep Neural Networks
ERM Empirical Risk Minimization
FL Federated Learning
KL Kullback-Leibler (divergence)
LLM(s) Large Language Model(s)
LLR Log Likelihood Ratio
LPM Left Point Method
MBart Multilingual BART
MCMC Markov Chain Monte Carlo
MCTS Monte Carlo Tree Search
MGF Moment-Generating Function
MSE Mean Squared Error
MT Machine Translation
NAG Nesterov’s Accelerated Gradient
NNGP Neural Network Gaussian Process
NTK Neural Tangent Kernel
ODE Ordinary Differential Equation
PAC Probably Approximately Correct
PGD Proximal Gradient Descent
ProxSkip-VR ProxSkip with Variance Reduction
ReLU Rectified Linear Unit
RMM Randomized Midpoint Method
ROUGE Recall-Oriented Understudy for Gisting Evaluation
SAGA Stochastic Average Gradient Algorithm
SDE Stochastic Differential Equation
SGD Stochastic Gradient Descent
SpS Speculative Sampling
SProxSkip Stochastic ProxSkip
SVRG Stochastic Variance Reduced Gradient

xv


TMC Tree Monte Carlo
TransBL Transformative Bayesian Learning
TS Text Summarization
TV Total Variation (distance)
ULD Underdamped Langevin Dynamics
VA Variational Approximation
WMT Workshop on Statistical Machine Translation

xvi


Chapter 1: Introduction

1.1 Motivation

Machine learning is a technique that transforms massive amounts of data into predic-

tive and decision-making tools. In recent years, the development of machine learning has

been driven by model improvements and massively increased computing power, promoting

social progress and scientific and technological advancements.

The explosion of data create a need for effectively extracting patterns from big data,

especially when the underlying mechanisms are complex, while avoiding overfitting. The

vast amount of data provides us with more training samples, allowing us to build more

complex and accurate models, thereby reducing model bias. However, big data also brings

challenges such as data noise, which can lead to increased model variance.

Neural networks have achieved a good balance in this situation. With their powerful

representation capabilities and flexible structural design, neural networks have become a

practical low-bias method that can effectively capture complex features and patterns in

data. With appropriate regularization techniques and practical experience, the variance of

neural networks can also be well controlled. Neural networks have demonstrated excellent

scalability to large-scale data scenarios.

However, neural networks face challenges in terms of trainability and generalization

1


ability. Theoretically, neural networks are non-convex, which can lead to local minima. The

multi-layer structure also brings optimization difficulties, such as vanishing or exploding

gradients. The trainability of neural networks lacks theoretical guarantees. Moreover, the

generalization ability of neural networks lacks rigorous theoretical guarantees. Although

neural networks have empirically shown good generalization performance on many tasks,

we still do not have a complete theoretical framework to explain and predict this gener-

alization ability. Traditional statistical learning theories, such as VC dimension theory

and Rademacher complexity, have significant limitations in explaining the generalization

behavior of neural networks.

Recent research on over-parameterization has shed light on these problems. As the

width of neural networks increases, the loss landscape of neural network becomes smoother,

local minima decrease, and gradient descent can find global minima, a phenomenon known

as the ”blessing of width.” It allows us to characterize trainability and generalization ability

under a rigorous mathematical framework, where the training dynamics is largely captured

by Neural Tangent Kernel. Part I of this dissertation discusses our contribution in this

area.

The explosive growth of data poses challenges to training algorithms, which essentially

aim to find the optimal model parameters by minimizing an objective function. Common

objective functions include the empirical loss and PAC-Bayesian generalization bounds,

leading to Bayesian learning. Efficient training requires controlling computational complex-

ity. Theoretical research on the complexity of training algorithms is crucial for providing

performance guarantees, such as convergence rate, sample complexity, and communication

complexity, evaluating algorithm performance, and comparing algorithms. The core re-

2


sults are the upper and lower bounds on complexity, and pursuing matching upper and

lower bound is one of the research goal of complexity theory. Understanding algorithm

convergence helps in understanding the strengths, weaknesses, applicability, and limita-

tions of algorithms, guiding algorithm improvement and design. Part II of this dissertation

discusses the progress in this area.

Inference efficiency is also important, especially in the context of large language mod-

els (LLMs), which have made significant breakthroughs in natural language processing.

Efficient inference is crucial for applications such as dialogue systems, text generation, and

question-answering systems. The probabilistic structure of LLMs, being auto-regressive

models, allows for the application of advanced Monte Carlo methods to accelerate infer-

ence. Chapter 8 of this dissertation presents work in this direction, applying novel Tree

Monte Carlo methods to further accelerate LLM inference.

While LLMs have achieved remarkable success in natural language processing, they

also pose security and ethical challenges. LLMs can be maliciously exploited to generate

false information, spread rumors, or engage in other activities that harm society. Due to the

high readability and persuasiveness of the text generated by LLMs, this false information

can mislead the public and have serious negative impacts. Therefore, preventing the misuse

of LLMs while ensuring their performance is an urgent problem to be addressed. An

important question is how to track and identify the content generated by LLMs to prevent

their misuse. Watermarking techniques provide a promising approach to address this issue.

We find that watermarking can be modeled as a Monte Carlo process that enable us to

effectively embed watermark information without affecting the performance of LLMs.

Furthermore, considering the inference cost of LLMs, how to ensure LLM watermark-

3


ing while improving inference efficiency is an urgent problem to be solved. We discover its

challenges. The embedding of watermarks and the acceleration of inference are inherently

conflicting. Theoretically characterizing this inherent trade-off will help deepen understand-

ing and provide new ideas and methods for solving the security and efficiency problems of

LLMs. Part III of this dissertation discusses our contribution in this area of research.

1.2 Dissertation Outline

Better Understanding Deep Learning with Overparameterization

In Chapter 2, we investigate the distributions of Conjugate Kernel (CK) and Neural

Tangent Kernel (NTK) for ReLU networks with random initialization. We derive the precise

distributions and moments of the diagonal elements of these kernels. For a feedforward

network, these values converge in law to a log-normal distribution when the network depth

d and width n simultaneously tend to infinity and the variance of log diagonal elements

is proportional to d/n. For the residual network, in the limit that number of branches m

increases to infinity and the width n remains fixed, the diagonal elements of Conjugate

Kernel converge in law to a log-normal distribution where the variance of log value is

proportional to 1/n, and the diagonal elements of NTK converge in law to a log-normal

distributed variable times the conjugate kernel of one feedforward network. Our theoretical

analysis results suggest that residual network remains trainable in the limit of infinite

branches and fixed network width. We also conduct numerical experiments to validate the

soundness of our theoretical analysis.

Chapter 3 proposes a novel algorithm, Transformative Bayesian Learning (TransBL),

4


which bridges the gap between empirical risk minimization (ERM) and Bayesian learning

for neural networks. We compare ERM, which uses gradient descent to optimize, and

Bayesian learning with importance sampling for their generalization and computational

complexity. We derive the first algorithm-dependent PAC-Bayesian generalization bound

for infinitely wide networks based on an exact KL divergence between the trained posterior

distribution obtained by infinitesimal step size gradient descent and a Gaussian prior.

Moreover, we show how to transform gradient-based optimization into importance sampling

by incorporating a weight. While Bayesian learning has better generalization, it suffers from

low sampling efficiency. Optimization methods, on the other hand, have good sampling

efficiency but poor generalization. Our proposed algorithm TransBL enables a trade-off

between generalization and sampling efficiency.

Analyzing and Enhancing Learning Algorithms

In Chapter 4, we study the underdamped Langevin diffusion (ULD) with strongly-

convex potential consisting of finite summation of N smooth components, and propose an

efficient discretization method, which requires O(N + d
1

3N
2

3/ε
2

3 ) gradient evaluations to

achieve ε-error (in
√
E∥·∥2

2 distance) for approximating d-dimensional ULD. Moreover, we

prove a lower bound of gradient complexity as Ω(N + d
1

3N
2

3/ε
2

3 ), which indicates that our

method is optimal in dependence of N , ε, and d. In particular, we apply our method to

sample the strongly-log-concave distribution and obtain gradient complexity better than

all existing gradient based sampling algorithms. Experimental results on both synthetic

and real-world data show that our new method consistently outperforms the existing ULD

approaches.

In Chapter 5, we provide a tighter analysis for ProxSkip, an algorithm that allows

5


fewer proximal operator computations to solve composite optimization problems. We im-

prove the existing decreasing speed of Lyapunov function from O(p2) to O(p), when p, the

frequency of the proximal operators is small enough. Our theoretical analysis also reveals

the drawbacks of using large step sizes for gradient descent in ProxSkip when the proximal

operator part is the bottleneck. Our main motivation comes from the continuous limit in

which the original analysis of ProxSkip fails to guarantee convergence when both the step

size γ and frequency p tend to zero. We construct a counterexample to demonstrate why

such counterintuitive behavior occurs for the original analysis and then propose a novel

Lyapunov function variant to construct a tighter analysis, avoiding the problem of the old

one. Such a new Lyapunov function can be directly extended to many other variants of

ProxSkip. When applied to stochastic gradient setup, our analysis leads to an improved

proximal operator complexity for SProxSkip from O(
√

1/εµ2 log(1/ε)) to O(
√
κ log(1/ε)).

Negative and positive curvatures affect optimization in different ways. However, a

lot of existing optimization theories are based on the Lipschitz smoothness assumption,

which cannot differentiate between the two. In Chapter 6, we propose to use two sepa-

rate assumptions for positive and negative curvatures, so that we can study the different

implications of the two. We analyze the Lookahead and Local SGD methods as concrete

examples. Both of them require multiple copies of model parameters and communication

among them for every certain period of time in order to prevent divergence. We show that

the minimum communication frequency is inversely proportional to the negative curvature,

and when the negative curvature becomes zero, we recover the existing theory results for

convex optimization. Finally, both experimentally and theoretically, we demonstrate that

modern neural networks have highly unbalanced positive/negative curvatures. Thus, an

6


analysis based on separate positive and negative curvatures is more pertinent.

Enhancing Large Language Model with Monte Carlo

The recent advancements in large language models (LLMs) have sparked a growing

apprehension regarding the potential misuse. One approach to mitigating this risk is to

incorporate watermarking techniques into LLMs, allowing for the tracking and attribution

of model outputs. This study examines a crucial aspect of watermarking: how significantly

watermarks impact the quality of model-generated outputs. Previous studies have sug-

gested a trade-off between watermark strength and output quality. However, our research

in Chapter 7 demonstrates that it is possible to integrate watermarks without affecting the

output probability distribution with appropriate implementation. We refer to this type

of watermark as an unbiased watermark. This has significant implications for the use of

LLMs, as it becomes impossible for users to discern whether a service provider has incorpo-

rated watermarks or not. Furthermore, the presence of watermarks does not compromise

the performance of the model in downstream tasks, ensuring that the overall utility of the

language model is preserved. Our findings contribute to the ongoing discussion around

responsible AI development, suggesting that unbiased watermarks can serve as an effective

means of tracking and attributing model outputs without sacrificing output quality.

Speculative Sampling (SpS) has been introduced to speed up the inference of large

language models (LLMs) by generating multiple tokens in a single forward pass under

the guidance of a reference model, while preserving the original distribution. We observe

that SpS can be derived through maximum coupling on the token distribution. However,

we find that this approach is not optimal as it applies maximum coupling incrementally

for each new token, rather than seeking a global maximum coupling that yields a faster

7


algorithm, given the tree-space nature of LLM generative distributions. In Chapter 8, we

shift our focus from distributions on a token space to those on a tree space. We propose

a novel class of Tree Monte Carlo (TMC) methods, demonstrating their unbiasedness and

convergence. As a particular instance of TMC, our new algorithm, Accelerated Speculative

Sampling (ASpS), outperforms traditional SpS by generating more tokens per step on

average, achieving faster inference, while maintaining the original distribution.

Given that we have watermarking techniques that inject watermarks into the gen-

erated content without altering the output quality, and we have acceleration techniques,

specifically speculative sampling, leverage a draft model to speed up the sampling pro-

cess while preserving the output distribution, we further investigate whether it is possible

to simultaneously accelerate the sampling process and inject watermarks into the gener-

ated content. In Chapter 9, we find that the integration of watermarking and acceleration

is non-trivial. We prove a no-go theorem, which states that it is impossible to simul-

taneously maintain the highest watermark strength and the highest sampling efficiency.

Furthermore, we propose two methods that maintain either the sampling efficiency or the

watermark strength, but not both. Our work provides a rigorous theoretical foundation for

understanding the inherent trade-off between watermark strength and sampling efficiency

in accelerating the generation of watermarked tokens for large language models. We also

conduct numerical experiments to validate our theoretical findings and demonstrate the

effectiveness of the proposed methods.

8


Part I

Better Understanding Deep Learning with Overparameterization

9


Chapter 2: Analyzing the Conjugate Kernel and Neural Tangent Kernel

Deep Neural Networks (DNNs) have been successfully applied to numerous appli-

cations such as computer vision [3, 4], natural language processing [5] and speech recog-

nition [6], where DNNs often achieved superior performance compared to conventional

machine learning algorithms and can also be used as a feature extractor (representation

learning) for other methods.

Despite the extensive success, the DNN training process and mechanism of general-

ization are not fully understood due to the difficulties raised by the non-convexity of the

loss function and the complication of optimization methods.

One way to simplify the DNNs analysis is to only train parameter in one layer while

keep the parameters of other layers as fixed. Jarrett et al. [7] demonstrated that only

training last layer of a neural network can achieve competitive performance. Since all

parameters of previous layers are fixed, the outputs at the last hidden layer F (x, θ) of

non-linear DNNs are considered as a random feature vector whose inner product is known

as Conjugate Kernel (CK) [8, 9, 10, 11]: Σ(x, y) = F (x, θ)⊤F (y, θ).

The other way to simplify the DNNs analysis is using infinitesimal step size to conduct

continuous time analysis [12, 13, 14]. The training process of parameters θ and neural

10


network output F (x, θ) turns into two first order ordinary differential equations (ODE):

dθi

dt
= − ∂L

∂F (x, θ)

∂F (x, θ)

∂θi

(2.1)

dF (y, θ)

dt
=
∑

i

∂F (y, θ)

∂θi

∂F (x, θ)

∂θi

⊤ ∂L

∂F (x, θ)
. (2.2)

The first ODE shows that the dynamics of parameters θ follows the gradient descent

of a highly non-convex loss function L(F (x, θ)). However, the dynamics of the output

F (y, θ) enjoys the form of kernel gradient descent where the loss L is usually convex. The

neural tangent kernel (NTK) is therefore defined as follows [15]:

K(x, y) =
∑

i

∂F (y, θ)

∂θi

∂F (x, θ)

∂θi

⊤

. (2.3)

Both CK and NTK depend on the inputs and parameters of the neural network. They

are potentially complicated due to the randomness of parameters at initialization and the

changing of parameters during the training process.

One line of research bypass this complexity by conducting the analysis in the infinite-

width limit with fixed depth. Neal [16], Williams [17], Lee et al. [18], de G. Matthews et al.

[19] showed that with proper scaling, a neural network with random initialization converges

to a Gaussian process at infinite-width limit. Therefore all elements of conjugate kernel

converge to a fixed value and CK is usually called as Neural Network Gaussian Process

(NNGP) kernel.

The training dynamics under this regime also simplifies dramatically. Jacot et al. [15]

showed that as network width increases, the variance of NTK for a randomly initialized

11


DNN reduces to zero and the NTK remains fixed during the training phase.

Another similar regime is to increase the depth to infinity at a slower rate than the

network width. Under this regime, DNNs largely preserve the property of fixed CK and

NTK. Huang et al. [20] studied the limiting NTK of feedforward network and residual

network by increasing widths to infinity first, and then increasing depths. Their result

shows that the limiting NTK of feedforward network degenerates into delta function as the

depth increases to infinity.

The fact that CK and NTK converge to fixed values at the infinite width limit reveals

important properties about DNNs. First, it shows that the gradient descent optimizes wide

DNNs to a global minimum. During the training of an infinite width neural network, the

change of parameters reduces to zero and the network behaves like a linear model. The

training dynamics of the output is precisely a kernel regression. With proper initialization

and well-behaved loss function, the training loss will converge to zero. Second, it explains

the generalization of an overparameterized network. NTK governs the generalization prop-

erty of training an infinitely wide neural network, while CK is related to the generalization

when we only train the last layer.

The limiting CK and NTK only depend on the network depth, the scaling of variance

at each layer, and the choice of non-linearity function. The closed-form expression of CK

and NTK allows directly evaluating an infinite width network and can be used for Kernel

Regression or Kernel Support Vector Machines [21, 22].

However, the success of DNNs has not been fully explained by the fixed CK and NTK.

The NTK of an infinite width network generally has no feature learning capability, because

it is data-independent. The training of DNNs, on the other hand, is mostly considered

12


as a feature selection process. Arora et al. [21], Chizat et al. [23] demonstrate that the

linearized DNNs with a fixed NTK could have a substantial gap of performance from finite

width DNNs trained with gradient descent.

The above limitations of existing DNN analysis methods motivate us to study the

prior distribution of CK and NTK for finite width DNNs. However, most existing works

focus on controlling the deviation of finite width DNNs from the infinite width regime,

rather than particular properties associated with finite width. This chapter aims to study

the CK and NTK for finite width network and the limiting behaviour when they converge

to a random variable. The main contributions of this chapter are summarized as follows:

• This chapter provides new analysis for CK and NTK at finite width and depth con-

dition. The precise distributions for diagonal elements of CK and NTK for every

parameter are derived. Our results are applicable to both feedforward network and

residual network with ReLU activation function.

• This chapter derives upper bounds for every order moments of random CK and NTK,

instead of only variance of them as in previous works. Our upper bounds on r-th

order moment is proportional to exp
((

r
2

)
β
)

where randomness coefficient β depends

on network settings. The fast increasing high order moments demonstrate the long

tail nature of CK and NTK for finite width network.

• Our new results can be generalized into infinite width and depth case, where we show

that the limiting distribution of CK and NTK with random initialization are closely

related to log-normal distribution. Our results endorse existing practice of modeling

activation and gradient norm in neural network as log-normal distribution.

13


n

d

Figure 2.1: Different regimes in the study of CK and NTK. Network width and depth are
denoted as d and n. Black line and Blue line: CK and NTK converge to fixed values.
Green line: For feedforward network, diagonal elements of CK and NTK of each parameter
converge to log-normal distribution. The variance of the log value is proportional to d

n
.

Diagonal elements of residual network NTK converges to the product of CK of a single
feedforward network and a log normal distribution random variable. Red line: For feedfor-
ward network, the variance of CK and NTK is unbounded. For residual network, diagonal
elements of CK converges to log-normal distribution, and diagonal elements of NTK con-
verges in law to a log-normal distributed variable times CK of a feedforward network.

• We show that randomness coefficient β for residual network could remains bounded

even in the limit of infinite depth and finite width. Therefore, an infinitely deep

residual network with proper scaling is still trainable.

A comparison of properties of CK and NTK under different regimes of interest are shown

in Figure 2.1.

2.1 Related Works

Convergence of Wide Network For infinite width neural networks, the NTK is a fixed

value during training and the training trajectory is the same as kernel regression [15, 24].

This is not true for any finite width network. However, neural networks with large but fi-

nite width still enjoy similar properties. More specifically, The NTK is almost fixed during

training, and the training loss decreases to almost zero [21, 25, 26, 27]. These works charac-

14


terize the size of the deviations around the limiting behaviour in certain overparameterized

regimes.

Finite Width Correction The research works along this direction focus on the deviation

of NTK from its expectation without overparameterized assumption. Hanin and Nica [28]

provided an upper bound on the variance of diagonal element of NTK based on path

counting. Littwin et al. [29] extended the result to networks with residual structure. These

results do not reveal the heavy tail nature of prior distribution of CK and NTK, as only

second order moments are discussed. Dyer and Gur-Ari [30] presented bounds on general

correlation functions in DNNs. However, most results are limited to linear network or

remain to be conjectures.

Log-Normal Distribution in Neural Network Log-normal distribution has been used

to approximate the output rate in neuroscience [31, 32]. Hanin and Nica [33], Gurbuzbal-

aban and Hu [34] give theoretical guarantee that norm of feedforward network output

converge to a log-normal distribution. In machine learning with artificial neural network,

log-normal distribution is also observed to serve as a better approximation of gradient dis-

tribution than Gaussian distribution and inspires new technique for quantized training in

[35].

15


2.2 Preliminary

Feedforward Networks We consider feedforward network with fully connected layers and

ReLU activation. With assuming the input as x0, the definition of each layer is:

yi = σiWixi−1 + bi, i = 1, . . . , d

xi = ReLU(yi), i = 1, . . . , d− 1 , (2.4)

where f(x0) = yd is the output of the network and xd is the output of last hidden layer.

The length of vectors yi and xi is ni. The parameters of the network are initialized to

independent random variables (Wi)j,k ∼ N (0, 1) and bi = 0. Note that although bi is set

to 0 at initialization, this parameter is still trainable thus it contributes to the NTK.

We assume that output is a scalar and nd = 1 while analyzing the NTK of feedforward

network, and require output to be a vector with the same size as input x0 when the

feedforward network is used as one branch in residual network.

Residual Network We consider a residual network where every residual block is a feed-

forward network with ReLU activation. With input x0, for i = 0, · · · ,m− 1, we have:

xi,0 = xi

yi,j = σi,jWi,jxi,j−1 + bi,j , j = 1, . . . , di

xi,j = ReLU(yi,j), j = 1, . . . , di − 1 (2.5)

xi+1 = xi + yi,di

yout = σoutWoutxm

16


where m is the total number of residual branches, f(x0) = yout is the output of the network.

The length of vectors yi,j and xi,j is ni,j. The length of vector xi is ni = ni,0 = ni,di
= n.

The length of output yout is nout = 1. The parameters are initialized in a similar way as

feedforward networks. The additional fully connected layer helps change the dimension of

output and simplify the analysis of NTK.

Conjugate Kernel The conjugate kernel is defined as the inner product of last hid-

den layer. More specifically, CK is Σ(x0, x̃0) = xT
d−1x̃d−1 for feedforward networks and

Σ(x0, x̃0) = xT
mx̃m for residual network.

We also define the conjugate kernel of output as the inner of the output layer of the

network. For feedforward networks Σ′(x0, x̃0) = yT
d ỹd and for residual networks Σ′(x0, x̃0) =

yT
outỹout. These values arise in the analysis of residual network NTK.

Neural Tangent Kernel The NTK is defined as inner product of gradient with respect to

network parameters. Therefore, NTK can be decomposed into summation of smaller NTKs,

each of which comes from training a unique parameter in the network. More specifically,

K(x0, x̃0) =
∑

θ

Kθ(x0, x̃0)

Kθ(x0, x̃0) =
∂f(x0)

∂θ

∂f(x̃0)

∂θ

⊤

(2.6)

where θ is the weight or bias in one layer.

In this chapter, we consider the case where f(x0) is a scalar, thus NTK can be

described as a scalar as well. But our results can be easily extended to the case where a

network has a vector output, and the NTK K(x0, x̃0) is a matrix.

Norm of Gaussian Random Vector We introduce two functions that will be used in

17


our analysis. For an n-dimensional standard Gaussian vector w ∼ N (0, In×n), we define

the following two functions regarding to even order moments of w and ReLU(w):

G1(n, r) = E
[
∥w∥2r

]
=

r−1∏

t=0

(n+ 2t) (2.7)

G2(n, r) = E
[
∥ReLU(w)∥2r

]
=

1

2n

n∑

k=0

(
n

k

)
G1(k, r)

For large n, we have:

G1(n, r) = nr

(
1 + 2

(
r

2

)
1

n
+O(

1

n2
)

)
(2.8)

G2(n, r) =
(
n

2

)r
(

1 + 5

(
r

2

)
1

n
+O(

1

n2
)

)

Log-Normal Distribution eZ , where Z follows Gaussian distribution, is said to follow log-

normal distribution. When Z ∼ N (−β
2
, β), we have the r-th order moment as E[(eZ)r] =

exp
((

r
2

)
β
)
. We use parameter β to characterize the randomness of such random variable.

The log-normal distribution is heavy-tail in the sense that moment generating function

E[exp
(
teZ

)
] is infinite for all positive t.

2.3 Feedforward Networks

We first introduce the precise distribution of diagonal elements of conjugate kernel.

Theorem 1 (CK of feedforward network). The diagonal elements of conjugate kernel

Σ(x0, x0) have the same distribution as ∥x0∥2∏d−1
k=1 σ

2
k∥ReLU(vk)∥2 at initialization where

vk ∼ N (0, Ink×nk
) is the independent standard Gaussian random vector.

18


Moreover, Σ′(x0, x0) has the same distribution as ∥x0∥2σ2
d∥vd∥2∏d−1

k=1 σ
2
k∥ReLU(vk)∥2

at initialization.

We can further calculate the moments of CK as follows.

Corollary 2 (Moments of feedforward network CK). The r-th order moment of CK of

d-layer feedforward network with ReLU activation is:

E[Σ(x0, x0)
r] =∥x0∥2r

(
d−1∏

k=1

σ2r
k G2(nk, r)

)

=∥x0∥2rcr

(
exp

((
r

2

)
β

)
+ O(

d∑

i=1

1

n2
i

)

)

where r is a non-negative integer, c =
∏d−1

k=1 σ
2
k

nk

2
and β =

∑d−1
k=1

5
nk

.

We can derive the convergence of distribution as follows.

Theorem 3 (Limiting distribution of CK). With the same definition of c and β as Corol-

lary 2, at the limit of n1, · · · , nd−1 → ∞ and c and β converge to fixed values, c → ĉ

and β → β̂, the distribution of E[Σ(x0, x0)
r] converges to a log-normal distribution in law.

More specifically, the limit distribution is the same as ∥x0∥2ĉeZ where Z ∼ N (− β̂
2
, β̂) is a

Gaussian random variable.

Remark 4. By comparing Corollary 2 and moments of log-normal distribution, we can

easily see that all moments converge to moments of log-normal distribution in the same limit

in Theorem 3. However, we point out that the convergence in law as stated in Theorem 3

is stronger than convergence of all moments. The reason is that the long tail of log-normal

distribution makes itself not determined by its moments [36].

19


Remark 5. The results in Corollary 2 and Theorem 3 can be easily generalized to Σ′(x0, x0).

When we require nd also increases to infinity, values c and β in Corollary 2 and Theorem 3

should be changed to c = 2
∏d

k=1 σ
2
k

nk

2
, β = 2

nd
+
∑d−1

k=1
5

nk
and the distribution of Σ′(x0, x0)

still converges to log-normal distribution.

When we have nd = 1, the limiting distribution of Σ′(x0, x0) is the same as ∥x0∥2σ2
d∥w∥2ĉeZ

where w is a standard Gaussian variable and other values are the same. Therefore, the ac-

tivation of one single neuron doesn’t converge to log-normal distribution.

Next, we derive the distribution of diagonal NTK coming from trainable weights Wi:

KWi
(x0, x0) =

∑

j,k

∂f(x0)

∂(Wi)j,k

∂f(x0)

∂(Wi)j,k

⊤

. (2.9)

This distribution is fully characterized by the connections between CK and NTK.

Theorem 6 (NTK of weights). KWi
(x0, x0) at initialization has the same distribution as

conjugate kernel σ2
dΣ(x0, x0).

Remark 7. For weights Wd at the last layer of feedforward network, KWd
(x0, x̃0) =

σ2
dΣ(x0, x̃0) always holds true. Theorem 6 shows that a similar relationship between NTK

and CK exists for all other weights in the network when we only consider the distribution

of diagonal elements at initialization.

Remark 8. Since the distribution of KWi
(x0, x0) is the same as σ2

dΣ(x0, x0), they also enjoy

the same limiting behaviour. More specifically, at the limit of depth and width increasing to

infinity at same rate, KWi
(x0, x0) also converges to a log-normal distribution. This supports

the observation and assumption in [35] that gradient magnitudes are log-normal distributed.

20


We then derive the distribution of diagonal NTK coming from trainable biases bi.

Kbi
(x0, x0) =

∑

j

∂f(x0)

∂(bi)j

∂f(x0)

∂(bi)j

⊤

. (2.10)

Theorem 9 (NTK of each biases). Kbi
(x0, x0) has the same distribution as

∏d−1
k=i σ

2
k+1∥ReLU(vk)∥2

where vk ∼ N (0, Ink×nk
).

The moments are given by

E[Kbi
(x0, x0)

r] =




d∏

k=i+1

σ2r
k




d−1∏

k=i

G2(nk, r)

=cr

(
exp

((
r

2

)
β

)
+ O(

d∑

i=1

1

n2
i

)

)

where c =
∏d−1

k=i σ
2
k+1

nk

2
and β =

∑d−1
k=i

5
nk

.

This random variable also converges to a log-normal distribution in the limit that

ni, · · · , nd−1 → ∞ and c, β converge to fixed values.

We have derived the distributions and moments of weights NTKs and biases NTKs

separately. However, the correlations between them are too convoluted to have a simple

closed form solution. Therefore, we further adopt Jensen’s inequality to derive an upper

bound of any moments of the whole NTK:

K(x0, x0) =
d∑

i=1

(KWi
(x0, x0) +Kbi

(x0, x0)). (2.11)

21


Corollary 10 (The upper bound of moments of NTK).

E[K(x0, x0)
r]

(E[K(x0, x0)])r
≤ exp

((
r

2

)
β

)
+ O(

d∑

i=1

1

n2
i

)

where β =
∑d−1

k=1
5

nk
.

By comparing the upper bound in Corollary 10 and moments of a log-normal distri-

bution, we can conclude that the effective randomness coefficient is β =
∑d−1

k=1
5

nk
.

2.4 Residual Networks

We first give the distribution of diagonal CK of the residual networks.

Theorem 11 (CK of residual network). The conjugate kernel of m-branches residual net-

work Σ(x0, x0) has the same distribution as ∥x∥2∏m−1
i=0 ∥ê+ Vi∥2 where

Vi = σi,dj
vi,di

di−1∏

j=1

σi,j∥ReLU(vi,j)∥,

ê is a unit vector along any direction and vi,j ∼ N (0, Ini,j ,ni,j
).

Remark 12. The random variable ∥Vi∥2 has the same distribution as Σ′(ê, ê) of a single

branch in residual network when regarded as a separate feedforward network.

The moments of CK can be derived as follows. The closed form expression can be

found at section 2.7.1.

22


Corollary 13 (Moments of CK). We have

E[Σ(x0, x0)
r]=exp

((
r +

4

n

(
r

2

))
m−1∑

i=0

ci

)
+ O(

m−1∑

i=0

c2
i )

where r is a positive integer, ci = 2
∏di

k=1 σ
2
i,k

ni,k

2
.

Next, we provide the limiting distribution of CK of residual networks.

Theorem 14 (Limiting distribution of CK). Let ci = 2
∏di

k=1 σ
2
i,k

ni,k

2
, βi = 2

ni,di

+
∑di−1

k=1
5

ni,k

and β = 4
n
. At the limit of ci → 0,

∑m−1
i=0 ci → ĉ while βi remains bounded and β → β̂, the

distribution of Σ(x0, x0) converges in law to eĉeZ, where Z ∼ N (− β̂ĉ
2
, β̂ĉ).

Remark 15. Theorem 14 is irrelevant to the specific shape of any residual branch, but only

requires the randomness coefficients βi for every branch to remain bounded. Therefore, this

theorem is applicable to the residual network with infinite branches but every branch is a

finite width and depth feedforward network.

Remark 16. The randomness coefficient of CK for residual network is 4
n

∑m−1
i=0 ci and

doesn’t directly depend on depth. If only the summation of ci is bounded, the randomness

coefficient remains finite when depth increases to infinity. Therefore, the random feature at

the last hidden layer of deep residual network is much stabler than that of a deep feedforward

network.

Remark 17. Theorem 14 implies that in order to keep an arbitrarily deep residual network

trainable, it is necessary that 4
n

∑m−1
i=0 ci has an upper bound that is independent of branch

number m. Similar phenomenon has been studied in previous works. Zhang et al. [37]

observed that if ci of each layer is kept as a constant, the variance of network output will

23


increase as branches increase. Arpit et al. [38] proposed an initialization method that ensure

∑m−1
i=0 ci doesn’t increase with m.

We then derive the distribution of diagonal NTK of weights and biases.

Theorem 18 (NTK of each weights and biases). The diagonal NTK coming from weights

in one layer KWi,j
(x0, x0) has the same distribution as ∥x∥2σ2

out∥Vi∥2∏m−1
s ̸=i
s=0

∥ê+ Vs∥2. . The

definition of Vi, vi,j and ê is the same as Theorem 11.

The distribution of NTK coming from biases in one layer Kbi,j
(x0, x0) is the same as

σ2
out

∥∥∥V ′
i,j

∥∥∥
2∏m−1

s=i+1 ∥ê+ Vs∥2 , where V ′
i,j = vi,di

∏di−1
k=j σi,k+1∥ReLU(vi,k)∥.

We move the closed form expression of NTK moments into Section 2.7.1. Next, we

derive the limiting distribution of NTKs and connect them with conjugate kernel of a

feedforward network.

Theorem 19 (Limiting distribution of weights NTK). With the same distribution of ci,

βi and β as Theorem 14, at the limit of cs → 0 for all s ̸= i,
∑m−1

s=0
s ̸=i

cs → ĉ while βi

remains bounded for all s ̸= i and β → β̂, the random variable KWi,j
(x0, x0) converges

in law to σ2
oute

ĉeZΣ′
i(ê, ê), where Σ′

i(ê, ê) is the conjugate kernel at the output layer of a

feedforward network with the same shape as (i + 1)-th branch in the residual network and

Z ∼ N (− β̂ĉ
2
, β̂ĉ).

Theorem 20 (Limiting distribution of biases NTK). With the same distribution of ci,

βi and β as Theorem 14, at the limit of cs → 0 for all s > i,
∑m−1

s>i cs → ĉ while βi

remains bounded for all s > i and β → β̂, the random variable Kbi
(x0, x0) converges in

law to σ2
oute

ĉeZΣ′
i,j(0, 0), where Z ∼ N (− β̂ĉ

2
, β̂ĉ) and Σ′

i,j(0, 0) is the conjugate kernel at

24


the output layer of a feedforward network with the same shape as (i + 1)-th branch in the

residual network but the bias bj is initialized as a standard Gaussian vector.

The above results on the weights NTK and biases NTK can also be combined to

derive an upper bound on moments of the whole NTK.

Corollary 21 (The upper bound of the total NTK).

E[K(x0, x0)
r]

(E[K(x0, x0)])r
≤ exp

((
r

2

)(
max

i
βi +

4

n

∑

i

ci

))
+

O



m−1∑

i=0

di∑

j=1

1

n2
i,j

+
m−1∑

i=0

c2
i




where βi = 2
ni,di

+
∑di−1

k=1
5

ni,k
and ci = 2

∏di

k=1 σ
2
i,k

ni,k

2
.

Remark 22. By comparing the upper bound in Corollary 21 and moments of a log-normal

distribution, we conclude that the effective randomness coefficient is maxi βi +
4
n

∑
i ci. Note

that βi only depends on the shape of a single branch, and effective randomness coefficient

only depends on maximum βi. This clearly contrasts with the feedforward network where,

β increases linearly with depth. The extra term 4
n

∑
i ci is the same as the randomness

coefficient of CK. Therefore, with proper scaling, any order moments of NTK of residual

network could remain bounded at infinite branches and finite width limit.

2.5 Overview of Proof

Our analysis is based on the equivalent transform on random vectors and tensor

networks. We note that many previous works adopt path-based approach to study neural

network with ReLU activation [28, 29, 39]. This method decomposes the network output

25


into a summation of all possible path connection input neuron to output one. Each path

carries the weights and activation from all layers. Compared with this sum-over-path

method, our method is more intuitive and reveals more properties of CK and NTK.

We first focus on the method for calculating CK and NTK of the feedforward network.

After that, we generalize the method to residual network.

2.5.1 Conjugate Kernel

The last hidden layer of a feedforward ReLU network with biases initialized as 0 can

be expressed as:

xd−1 =
d−1∏

i=1

σiReLU(Wd(. . . (ReLU(W1x0)) . . . )) .

We define an auxiliary random variable at each layer as vi = 1
∥xi−1∥

Wixi−1. If ∥xi−1∥ = 0,

we take vi as a uniform random vector on the unit sphere. Then we have vi ∼ N (0, Ini,ni
).

Each layer of the feedforward network can be expressed as:

yi = σivi∥xi−1∥, xi = σiReLU(vi)∥xi−1∥. (2.12)

Therefore, the norm of last hidden layer is ∥xd−1∥ = ∥x0∥
∏d−1

i=1 σi∥ReLU(vi)∥, and the

norm of network output is ∥yd∥ = ∥vd∥∥x0∥
∏d

i=1 σi
∏d−1

i=1 ∥ReLU(vi)∥.

The output at each layer of the network yi is dependent on the previous layer. How-

ever, we can show that the normalized output vi is independent with each other.

Lemma 23. All random vectors vi for i = 1, . . . , d are independent with each other.

Based on above lemma, it is easy to prove the distribution of Σ(x0, x0) and Σ′(x0, x0)

26


in Theorem 1.

2.5.2 Neural Tangent Kernel

In this section, we demonstrate our methods via proving the moments of NTK coming

from weights. The proof is best shown in the language of tensor network diagram. Here

we give a brief introduction of this notation, and we will give more examples to illustrate

how the tensor graph notation corresponds to the standard vector and matrix notation in

Section 2.8.

We denote a vector v in diagram as v . The node v in the diagram represents a

tensor, and only one edge coming out of this node indicates that it is a rank-1 tensor or a

vector. Similarly, a matrix V is represented as V . The contraction of tensors or matrix

multiplication is represented as an edge connecting two tensors. For example for any two

matrix A and B, A B represents the matrix multiplication AB. The identity matrix

can be represented as an unbounded edge in the diagram without a node attached

to it for simplicity. [40] provided a more comprehensive introduction on tensor network

diagram. The main benefit of tensor network diagram is that trace, tensor product and

tensor contraction can be expressed in a simple and unified way without extra notation.

In the feedforward network with ReLU activation, xi = ReLU(yi) can be written

in a matrix multiplication form xi = Diyi. The matrix Di is a diagonal matrix where

[Di]j,j is 1 if [yi]j > 0 and is 0 if [yi]j < 0. When [yi]j = 0 we take [Di]j,j as a Bernoulli

random variable with p = 1
2
. Therefore the output of feedforward network can be writ-

ten as yd = WdDd−1Wd−1 . . . D1W1x0. For the compactness of the diagram, we define

27


A = Dd−1Wd−1 . . . Di and B = Di−1Wi−1 . . . D1W1x0 when we consider NTK coming from

weights Wi in i-th layer.

With above definitions, the network output can be written as follows:

yd =

(
d∏

i=j

σj

)
Wd A Wi B

The weights Wd at the last layer is a vector rather than matrix since we require scalar

output yd.

The partial gradient with respect to Wi can be represented as follows:

∂yd

∂Wi

=

(
d∏

j=1

σj

)
Wd A B

The NTK is just the inner product of gradients, therefore has the following form:

KWi
(x0, x0) =

(
d∏

j=1

σ2
j

)

Wd A B

Wd A B




In the above diagram, we highlight the contraction in the red box. The key idea in our

method is substituting this part of the tensor network with the expectation of another

random tensor network.

Given a random matrix V where all elements are i.i.d. standard Gaussian variables.

We have the expectation of V ⊗V , which is V tensor product with itself as EVi,jVk,l = Ii,kIj,l.

This can be represented in the following diagram:

E

[(
V
V

)]
=

28


This relationship can be generalized to higher order with Isserlis’ theorem. For example:

E




 V

V
V

V




 = + + .

Generally, the expectation of ⊗2r
i=1V , which is 2r-th order tensor product of V to itself,

can be represented as summation of (2r− 1)!! terms of tensor product of identity matrices,

each of which corresponds to a pairing in a set of 2r elements.

Based on Isserlis’ theorem, we can substitute the contraction part in the the diagram

of KWi
(x0, x0)

r as follows:

KWi
(x0, x0)

r =

∏d
j=1 σ

2r
j

(2r − 1)!!
EV

(
Wd A V B

)2r

An extra term (2r−1)!! shows up in the above result. We next take the expectation of Wd.

Since elements in Wd is also i.i.d. standard Gaussian variables, the expectation of ⊗2r
i=1Wd

also contains (2r − 1)!! terms of tensor product of identity matrices. Therefore, we have

following result:

E[KWi
(x0, x0)

r] =

(
d∏

j=1

σ2r
j

)
E

(
A V B
A V B

)r

Recall that Σ(x0, x0)
r =

(∏d−1
j=1 σ

2r
j

)
∥AWiB∥2r. By using similar auxiliary random variable

in section 2.5.1, we have E[KWi
(x0, x0)

r] = σ2r
d E[Σ(x0, x0)

r].

For a network with only one hidden layer, the distribution of Σ(x0, x0) is determined

by its moments according to Carleman’s condition. Therefore, the distributions of diagonal

weights NTK and diagonal CK are same. For feedforward network with more than one

29


hidden layer, we need to set up auxiliary variable in backward manner, the specific proof

is shown in Section 2.10. The biases NTK can be controlled in similar way.

2.5.3 Residual Network

For conjugate kernel, we can adopt auxiliary random variables vi,j similar to Sec-

tion 2.5.1. We only need to show vi,j are independent with each other to prove Theorem 11.

The NTK can be controlled with same methods in Section 2.5.2. The only difference

is that when we calculate the NTK of parameter in one branch, we need to remove the skip

connection bypassing that branch in the tensor network.

2.6 Experimental Results

In this section, we will validate our theoretical results with numerical experiments.

2.6.1 Verifying the Moments

We sample 1000 three layers feedforward network and calculate the NTK and CK

at initialization. All parameters are initialized in the distribution specified in Section 2.2.

The hyper-parameters are set as n0 = n1 = n2 = 100, n2 = 1 and σi = 1
10

. The input x0 is

sampled from a unit sphere.

The theoretical prediction and estimation obtained in numerical experiment is shown

in Table 2.1. We only show the moments for Σ(x0, x0) in this section. Moments of Σ′(x0, x0)

and NTKs are moved to Section 2.9.1. We don’t include higher order moments because the

long tail nature of CK distribution makes it harder to numerically estimate higher order

30


Moment Predicted Estimated

EΣ(x0, x0)1 2.50e-1 2.52e-01 ± 2.55e-03

EΣ(x0, x0)2 6.89e-2 6.98e-02 ± 1.49e-03

EΣ(x0, x0)3 2.08e-2 2.13e-02 ± 7.43e-04

EΣ(x0, x0)4 6.86e-3 7.08e-03 ± 3.73e-04

Table 2.1: Verification of CK and NTK of a three layers feedforward network.

moments. The theory and experiment results fit well in terms of first 4 order moments,

which demonstrates the accuracy of our theoretical results.

We also compare the moments of residual network in Table 2.2 by sampling 10000

residual network with ni,j = 100 and σi,j = 1
10

. Only CK moments are shown below and

the full result can be found in Section 2.9.1.

2.6.2 Verifying the Log-Normal Distribution

In the following experiment, we will verify our theoretical results on limiting distri-

bution of diagonal CK and NTK.

According to Theorem 3, the diagonal CK of feedforward network converges to the

log-normal distribution in the infinite width and infinite depth limit. We demonstrate

the convergence in Figure 2.2, in which we take three feedforward networks with different

depth. The hyper-parameters including σi and ni are tuned to keep c = 1 and β ≈ 1. We

reinitialize these networks 106 times and plot the empirical distribution of CK. Figures of

empirical distribution for other settings can be found in section 2.9.2.

We use Kolmogorov–Smirnov test to calculate the difference between the distributions

of CK and NTK and their limiting distributions to validate Theorems 3, 6 and 9. Table 2.3

shows that the p-value for CK and NTK coming from weights and biases at first layer.

31


Moment Predicted Estimated

EΣ(x0, x0)1 2.25e+0 2.25e+00 ± 4.09e-03

EΣ(x0, x0)2 5.23e+0 5.23e+00 ± 1.95e-02

EΣ(x0, x0)3 1.26e+1 1.26e+01 ± 7.33e-02

EΣ(x0, x0)4 3.13e+1 3.12e+01 ± 2.58e-01

Table 2.2: Verification of CK and NTK of a two branches residual network. Every branch
is a feedforward network with one hidden layer.

−3 −2 −1 0 1 2

log Σ(x0, x0)

d=2

d=5

d=15

0.0 0.5 1.0 1.5 2.0 2.5

Σ(x0, x0)

d=2

d=5

d=15

Figure 2.2: Distribution of CK
The red line is the theoretical limiting distribution given c and β.

The KS test takes 1000 samples of CK and NTK can calculate the maximum difference

between the empirical distribution and the theoretical limiting distribution. The hyper-

parameters are the same as neural networks used in Figure 2.2. Table 2.3 illustrates that

when d = 2, we have strong evidence to support that the distributions of CK and NTK are

not log-normal. However, when the depth increases to 5 and 15, it is hard to distinguish

them from the log-normal distribution.

We further verify the limiting behaviour in Theorems 19 and 20. For CK of residual

network, we test the empirical distribution of Σ(x0, x0) against the limiting the log-normal

distribution. The NTK of residual network is tested against another sample of eZΣ(x0, x0)

where Σ(x0, x0) is CK of a finite sized feedforward network and eZ is independently sampled

from the log normal distribution. The sample size in Table 2.4 is also 1000. We set the

branch of residual network as the feedforward networks with one hidden layer and hyper-

parameters are tuned so that
∑

i ci = 1 and β = 4
100

.

32


Depth Σ(x0, x0) KW1
(x0, x0) Kb1

(x0, x0)

2 1.60e-10 3.48e-09 3.48e-09

5 1.30e-01 1.44e-02 1.44e-02

15 7.13e-01 1.23e-01 1.23e-01

Table 2.3: The p-value of KS test on CK and NTK of the feedforward network.

Branches Σ(x0, x0) KW0,1
Kb0,1

1 0.00e+00 8.28e-01 5.73e-01

5 1.65e-31 1.62e-04 3.96e-05

20 3.97e-05 7.76e-02 2.41e-01

100 1.36e-01 9.71e-02 1.34e-01

Table 2.4: The p-value of KS test on CK and NTK of the residual network.

We note that the p-value for conjugate kernel in Table 2.4 gradually increases, which

indicates the distribution of CK gradually converges to the log-normal distribution. The

p-value for NTK in a residual network with only one branch is very high. This is expected

because the variance of log-normal distribution coming from other branches is 0.

We also conduct empirical study on the correlation between NTK from different

parameters in Section 2.9.3. Although the joint probability is complicated, we can provide

approximation on the correlation coefficient of logarithm NTKs. The statistics property

of non diagonal elements of random CK and NTK are discussed in Section 2.9.4. In

Section 2.9.5, the random gradient are used as random feature for classification on UCI

dataset.

33


2.7 Additional Theoretical Result

2.7.1 Moments of residual network CK and NTKs

Corollary 24 (Moments of CK). We have

E[Σ(x0, x0)
r] =∥x0∥2r

m−1∏

i=0

∑

0≤r1,r2,r3

r1+r2+r3=r
r2 is even

(
r

r1, r2, r3

)
2r2(r2 − 1)!!

di∏

k=1

σ2r1+r2

i,k

×G1(ni,di
+ r2, r1)

di−1∏

k=1

G2(ni,k, r1 +
r2

2
)

where r is a positive integer and we assume (−1)!! = 1.

Theorem 25 (NTK of each weights). The moments of NTK coming from weights of resid-

ual network with ReLU activation is

E[KWs,t
(x0, x0)

r] =σ2r
out∥x0∥2r




ds∏

k=1

σ2r
s,k


G1(ns,ds

, r)
ds−1∏

k=1

G2(ns,k, r)

m−1∏

i=0
i̸=s

∑

0≤r1,r2,r3

r1+r2+r3=r
r2 is even

(
r

r1, r2, r3

)
2r2(r2 − 1)!!

di∏

k=1

σ2r1+r2

i,k

G1(ni,di
+ r2, r1)

di−1∏

k=1

G2(ni,k, r1 +
r2

2
)

where r is a positive integer and we assume (−1)!! == 0!! = 1.

Theorem 26 (NTK of each biases). The moments of NTK coming from biases of residual

34


network with ReLU activation is

E[Kbs,t
(x0, x0)

r] =σ2r
out




ds∏

k=t+1

σ2r
s,k


G1(ns,ds

, r)
ds−1∏

k=t

G2(ns,k, r)

m−1∏

i=s+1

∑

0≤r1,r2,r3

r1+r2+r3=r
r2 is even

(
r

r1, r2, r3

)
2r2(r2 − 1)!!

di∏

k=1

σ2r1+r2

i,k

G1(ni,di
+ r2, r1)

di−1∏

k=1

G2(ni,k, r1 +
r2

2
)

where r is a positive integer.

2.8 Additional Explanation on Tensor Diagram

In this section we give the standard vector/matrix notation correspondences of all

the formulas in Section 2.5.2.

yd =

(
d∏

j=1

σj

)
Wd A Wi B yd =

(
d∏

j=1

σj

)
WdAWiB

∂yd

∂Wi

=

(
d∏

j=1

σj

)
Wd A B

∂yd

∂[Wi]j,k

=

(
d∏

j=1

σj

)
[A⊤W⊤

d ]jBk

35


KWi
(x0, x0) =

(
d∏

j=1

σ2
j

)

Wd A B

Wd A B


KWi

(x0, x0) =

(
d∏

j=1

σ2
j

)∥∥∥A⊤W⊤
d

∥∥∥
2∥B∥2

E

[(
V
V

)]
= E [Vi,jVk,l] = δi,kδj,l

E




 V

V
V

V




 = + +

E[Vi1,j1Vi2,j2Vi3,j3Vi4,j4] =

δi1,i2δi3,i4δj1,j2δj3,j4

+δi1,i3δi2,i4δj1,j3δj2,j4

+δi1,i4δi2,i3δj1,j4δj2,j3

KWi
(x0, x0)

r =

∏d
j=1 σ

2r
j

(2r − 1)!!
EV

(
Wd A V B

)2r
KWi

(x0, x0)
r =

∏d
j=1 σ

2r
j

(2r − 1)!!
EV (WdAV B)2r

E[KWi
(x0, x0)

r] =

(
d∏

j=1

σ2r
j

)
E

(
A V B
A V B

)r

E[KWi
(x0, x0)

r] =

(
d∏

j=1

σ2r
j

)
E∥AV B∥2r

36


2.9 Additional Experiment Result

2.9.1 Other moments of CK and NTK

Table 2.5: Verification of CK and NTKs of a three layer feedforward network.

Moment Predicted Estimated

EΣ(x0, x0)1 2.50e-1 2.52e-01 ± 2.55e-03

EΣ(x0, x0)2 6.89e-2 6.98e-02 ± 1.49e-03

EΣ(x0, x0)3 2.08e-2 2.13e-02 ± 7.43e-04

EΣ(x0, x0)4 6.86e-3 7.08e-03 ± 3.73e-04

EΣ′(x0, x0)1 2.50e-3 2.72e-03 ± 1.41e-04

EΣ′(x0, x0)2 2.07e-5 2.73e-05 ± 3.76e-06

EΣ′(x0, x0)3 3.12e-7 5.19e-07 ± 1.54e-07

EΣ′(x0, x0)4 7.21e-9 1.49e-08 ± 7.37e-09

EKW1
(x0, x0)1 2.50e-3 2.48e-03 ± 2.55e-05

EKW1
(x0, x0)2 6.89e-6 6.81e-06 ± 1.48e-07

EKW1
(x0, x0)3 2.08e-8 2.06e-08 ± 7.42e-10

EKW1
(x0, x0)4 6.86e-11 6.81e-11 ± 3.77e-12

EKb1
(x0, x0)1 2.50e-1 2.48e-01 ± 2.55e-03

EKb1
(x0, x0)2 6.89e-2 6.81e-02 ± 1.48e-03

EKb1
(x0, x0)3 2.08e-2 2.06e-02 ± 7.42e-04

EKb1
(x0, x0)4 6.86e-3 6.81e-03 ± 3.77e-04

EKW2
(x0, x0)1 2.50e-3 2.51e-03 ± 2.63e-05

EKW2
(x0, x0)2 6.89e-6 6.98e-06 ± 1.56e-07

EKW2
(x0, x0)3 2.08e-8 2.15e-08 ± 7.97e-10

EKW2
(x0, x0)4 6.86e-11 7.28e-11 ± 4.19e-12

EKb2
(x0, x0)1 5.00e-1 4.97e-01 ± 3.65e-03

EKb2
(x0, x0)2 2.62e-1 2.61e-01 ± 3.86e-03

EKb2
(x0, x0)3 1.44e-1 1.43e-01 ± 3.29e-03

EKb2
(x0, x0)4 8.29e-2 8.28e-02 ± 2.66e-03

EKW3
(x0, x0)1 2.50e-3 2.52e-03 ± 2.55e-05

EKW3
(x0, x0)2 6.89e-6 6.98e-06 ± 1.49e-07

EKW3
(x0, x0)3 2.08e-8 2.13e-08 ± 7.43e-10

EKW3
(x0, x0)4 6.86e-11 7.08e-11 ± 3.73e-12

EKb3
(x0, x0)1 1.00e+00 1.00e+00 ± 0.00e+00

EKb3
(x0, x0)2 1.00e+00 1.00e+00 ± 0.00e+00

EKb3
(x0, x0)3 1.00e+00 1.00e+00 ± 0.00e+00

EKb3
(x0, x0)4 1.00e+00 1.00e+00 ± 0.00e+00

37


Table 2.6: Verification of CK and NTKs of a two branches residual network. Every branch
is a feedforward network with one hidden layer.

Moment Predicted Estimated

EΣ(x0, x0)1 2.25e+0 2.25e+00 ± 4.09e-03

EΣ(x0, x0)2 5.23e+0 5.23e+00 ± 1.95e-02

EΣ(x0, x0)3 1.26e+1 1.26e+01 ± 7.33e-02

EΣ(x0, x0)4 3.13e+1 3.12e+01 ± 2.58e-01

EΣ′(x0, x0)1 2.25e-2 2.24e-02 ± 3.29e-04

EΣ′(x0, x0)2 1.57e-3 1.58e-03 ± 5.47e-05

EΣ′(x0, x0)3 1.89e-4 1.92e-04 ± 1.26e-05

EΣ′(x0, x0)4 3.29e-5 3.24e-05 ± 3.45e-06

EKW0,1
(x0, x0)1 7.50e-3 7.51e-03 ± 2.27e-05

EKW0,1
(x0, x0)2 6.13e-5 6.15e-05 ± 3.89e-07

EKW0,1
(x0, x0)3 5.43e-7 5.49e-07 ± 5.69e-09

EKW0,1
(x0, x0)4 5.22e-9 5.30e-09 ± 8.38e-11

EKb0,1
(x0, x0)1 7.50e-1 7.51e-01 ± 2.27e-03

EKb0,1
(x0, x0)2 6.13e-1 6.15e-01 ± 3.89e-03

EKb0,1
(x0, x0)3 5.43e-1 5.49e-01 ± 5.69e-03

EKb0,1
(x0, x0)4 5.22e-1 5.30e-01 ± 8.38e-03

EKW0,2
(x0, x0)1 7.50e-3 7.53e-03 ± 2.26e-05

EKW0,2
(x0, x0)2 6.13e-5 6.18e-05 ± 3.89e-07

EKW0,2
(x0, x0)3 5.43e-7 5.52e-07 ± 5.71e-09

EKW0,2
(x0, x0)4 5.22e-9 5.33e-09 ± 8.44e-11

EKb0,2
(x0, x0)1 1.50e+0 1.50e+00 ± 2.89e-03

EKb0,2
(x0, x0)2 2.33e+0 2.33e+00 ± 9.18e-03

EKb0,2
(x0, x0)3 3.76e+0 3.76e+00 ± 2.30e-02

EKb0,2
(x0, x0)4 6.30e+0 6.28e+00 ± 5.41e-02

EKW1,1
(x0, x0)1 7.50e-3 7.50e-03 ± 2.25e-05

EKW1,1
(x0, x0)2 6.13e-5 6.13e-05 ± 3.84e-07

EKW1,1
(x0, x0)3 5.43e-7 5.44e-07 ± 5.59e-09

EKW1,1
(x0, x0)4 5.22e-9 5.23e-09 ± 8.27e-11

EKb1,1
(x0, x0)1 5.00e-1 4.99e-01 ± 1.33e-03

EKb1,1
(x0, x0)2 2.68e-1 2.67e-01 ± 1.46e-03

EKb1,1
(x0, x0)3 1.53e-1 1.53e-01 ± 1.31e-03

EKb1,1
(x0, x0)4 9.32e-2 9.27e-02 ± 1.16e-03

EKW1,2
(x0, x0)1 7.50e-3 7.50e-03 ± 2.22e-05

EKW1,2
(x0, x0)2 6.13e-5 6.11e-05 ± 3.78e-07

EKW1,2
(x0, x0)3 5.43e-7 5.41e-07 ± 5.46e-09

EKW1,2
(x0, x0)4 5.22e-9 5.17e-09 ± 7.84e-11

EKb1,2
(x0, x0)1 1.00e+00 9.99e-01 ± 1.41e-03

EKb1,2
(x0, x0)2 1.02e+00 1.02e+00 ± 2.90e-03

EKb1,2
(x0, x0)3 1.06e+00 1.06e+00 ± 4.58e-03

EKb1,2
(x0, x0)4 1.12e+00 1.12e+00 ± 6.61e-03

EKWout(x0, x0)1 2.25e-2 2.25e-02 ± 4.09e-05

EKWout(x0, x0)2 5.23e-4 5.23e-04 ± 1.95e-06

EKWout(x0, x0)3 1.26e-5 1.26e-05 ± 7.33e-08

EKWout(x0, x0)4 3.13e-7 3.12e-07 ± 2.58e-09

38


2.9.2 The limiting log normal distribution

We compare the distribution of random NTK for finite width feedforward network in

Figure 2.3. The red lines in the figure are the limiting log-normal distribution as defined

in Theorems 6 and 9. As depth increases, the distribution of weight NTK and bias NTK

gradually approach the limiting distribution.

−3 −2 −1 0 1 2

logKW0
(x0, x0)

d=2

d=5

d=15

−2 0 2 4 6

logKb0
(x0, x0)

d=2

d=5

d=15

Figure 2.3: Distribution of NTK
The red line is the theoretical limiting distribution given c and β.

We then demonstrate the convergence of distribution of CK and NTK for residual

network in fig. 2.4. The red lines are computed according to theorems 19 and 20. Note

that the red line in right figure is only similar but not identical to the probability density

of Gaussian distribution. As the number of branches increases, the CK converge to a log-

normal distribution. As for NTK for weights in first layer of first branch, when m = 1, the

NTK of residual network is exactly the NTK of a feedforward network, therefore the curve

fit perfectly. When m ≥ 2, the real distribution is not same as limiting distribution but

converge to it as branches increase.

39


0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75

log Σ(x0, x0)

m=1

m=5

m=20

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5

logKW0,1
(x0, x0)

m=1

m=5

m=20

Figure 2.4: Distribution of CK and NTK
The red line is the theoretical limiting distribution given c and β.

2.9.3 Correlation between CK and NTKs

The joint distribution of CK and NTK coming from different parameters are non-

trivial. We conduct phenomenological analysis on their correlation coefficient.

Correlation between weight NTKs We first focus on correlation between weight NTKs.

The CK is included in this case since KWd
(x0, x̃0) = σ2

dΣ(x0, x̃0) always holds true.

−4 −2 0 2

log(KW1
(x0, x0))

−4

−2

0

2

lo
g
(K

W
2
(x

0
,x

0
))

(a) layer 1 vs layer 2

−4 −2 0 2

log(KW1
(x0, x0))

−6

−4

−2

0

2

lo
g
(K

W
5
(x

0
,x

0
))

(b) layer 1 vs layer 5

−4 −2 0 2

log(KW1
(x0, x0))

−5

−4

−3

−2

−1

0

1

2

3

lo
g
(K

W
1
5
(x

0
,x

0
))

(c) layer 1 vs layer 15

−6 −4 −2 0 2

log(KW5
(x0, x0))

−5

−4

−3

−2

−1

0

1

2

3

lo
g
(K

W
6
(x

0
,x

0
))

(d) layer 5 vs layer 6

Figure 2.5: Joint distribution of logarithm NTK of weights from different layer.

We consider d = 15 layer feedforward network with a uniform width n = 70 for each

layer. The correlation between logarithm of several weights NTK are shown in Figure 2.5.

By induction, we get the following pattern:

The closer the two weights are in the feed forward network, the more relevant the

40


logarithm of their NTK.

We give an explanation of this phenomenon based on random tensor network.

Without loss of generality, we assume i < j and consider two NTK KWi
(x0, x0)

and KWj
(x0, x0). We can describe NTKs as follows where a =

∏

i σ
2
i is a constant, C =

Di−1Wi−1 . . . D1W1x0, B = Dj−1 . . .Wi+1Di and A = WdDd−1 . . .Wj+1Dj.

KWi
(x0, x0) = a





A C
A C

B
B

Wj
Wj





KWj
(x0, x0) = a

(

A C
A C

B
B

Wi
Wi

)

.

Let the singular value decomposition of B be B = U⊤SV , we have

KWi
(x0, x0) = a

∥

∥

∥SUW⊤
j A

⊤
∥

∥

∥

2∥C∥2

KWj
(x0, x0) = a∥A∥2∥SVWiC∥2.

We further let wi = 1
∥C∥

WiC and wj = 1
∥A∥

W⊤
j A

⊤ so

KWi
(x0, x0) = a∥A∥2∥C∥2∥SUwj∥2

KWj
(x0, x0) = a∥A∥2∥C∥2∥SV wi∥2.

Wi and Wj are dependent, therefore wi and wj are also independent. We assume that

most co-variance between logarithm of weights NTK comes from the identical terms ∥A∥2

and ∥C∥2. The distributions of these values are similar to log normal distribution, and the

41


random coefficient β is proportional to d
n
.

Cov[logKWi
(x0, x0), logKWj

(x0, x0)] ≈V ar[log ∥A∥2] + V ar[log ∥C∥2]

∝d− j
n

+
i− 1

n

=
d− 1

n
− j − i

n

We can further estimate correlation coefficient as

Cor[KWi
(x0, x0), KWj

(x0, x0)] ≈ 1− j − i
n

j− i is just the distance between two weights in the feedforward network. Therefore,

the correlation between logarithm of weights NTK decreases with the distance between

layers of these weights.

However, the above estimation is only qualitative. It tends to underestimate the

correlation as shown in Figure 2.6a because the co-variance from ∥SUwj∥2 and ∥SV wi∥2

are neglected.

0.4 0.6 0.8

Correlation Coefficient

0.0

0.2

0.4

0.6

0.8

E
s
t
im

a
t
io
n

x=y

(a) Correlation coefficients
between weight NTKs from
different layers.

1 3 5 7 9 11 13 15

Layer

0.0

0.2

0.4

0.6

0.8

1.0

C
o
rr
el
a
ti
o
n
C
o
effi

ci
en
t

estimate

(b) Correlation coefficients
between weight NTK and
bias NTK in same layer.

Figure 2.6: Estimation of correlation coefficients between weight NTKs and bias NTKs.

42


Correlation between weight NTK and bias NTK We next show the correlation between

weight NTK KWi
(x0, x0) and bias NTK Kbi

(x0, x0), where Wi and bi are weight and bias

from same layer.

−4 −2 0 2

log(KW1
(x0, x0))

−2

0

2

4

6

lo
g
(K

b
1
(x

0
,x

0
))

(a) layer 1

−4 −2 0 2

log(KW2
(x0, x0))

−1

0

1

2

3

4

5

6

7

lo
g
(K

b
2
(x

0
,x

0
))

(b) layer 2

−4 −2 0 2

log(KW8
(x0, x0))

0

1

2

3

4

5

lo
g
(K

b
8
(x

0
,x

0
))

(c) layer 8

−4 −2 0 2

log(KW14
(x0, x0))

2.0

2.5

3.0

3.5

4.0

lo
g
(K

b
1
4
(x

0
,x

0
))

(d) layer 14

Figure 2.7: Joint distribution of logarithm NTK of weights NTK and bias NTK.

According to Figure 2.7, the weight NTK and bias NTK from shallower layers in

feedforward network comes with higher correlation.

Based on similar argument as previous section, we have KWi
(x0, x0) = a∥A∥2∥B∥2

and Kbj
(x0, x0) = a∥A∥2 where A = Di−1 . . . D1W1x0. The co-variance and variance of

weights NTK and variance of bias NTK can be estimated to be proportional to d− i, d− 1

and d − i respectively. The correlation can be further estimated as
√

d−i
d−1

. Experiment in

Figure 2.6b shows that this this estimation is almost tight.

Correlation between weight NTK of residual network The pattern that closer weights in

the network have higher correlation for logarithm NTK still applies to residual network.

However, one main difference is that correlation of weights NTK across branches are con-

trolled by variance σi,j of the network. With same definition of ci in Theorem 14, we have

the correlation increases with
∑

i ci as shown in Figure 2.8.

43


−3.5 −3.0 −2.5 −2.0 −1.5 −1.0

log(KW0,1
(x0, x0))

−3.5

−3.0

−2.5

−2.0

−1.5

−1.0

lo
g
(K

W
1
,
1
(x

0
,x

0
))

(a)
∑

i ci = 1

10 11 12 13 14 15 16

log(KW0,1
(x0, x0))

10

11

12

13

14

15

16

lo
g
(K

W
1
,
1
(x

0
,x

0
))

(b)
∑

i ci = 20

Figure 2.8: Joint distribution of logarithm NTK of weights NTK for residual network with
20 branches.

2.9.4 Non Diagonal Elements

−1 0 1 2 3 4 5

Σ(x0, x
′

0
)

d=2

d=5

d=15

(a)

−10 −8 −6 −4 −2 0 2 4

log Σ(x0, x
′

0
)

d=2

d=5

d=15

(b)

−6 −4 −2 0 2

log(KW1
(x0, x0))

−6

−4

−2

0

2

lo
g
(K

W
2
(x

0
,x

0
))

(c)

−6 −4 −2 0 2

log(KW1
(x0, x0))

−6

−4

−2

0

2

lo
g
(K

W
1
0
(x

0
,x

0
))

(d)

Figure 2.9: Distribution of CK and joint distribution of NTK from different layers.

We pick two unit length vector x0 and x′
0 with angle π

3
. The distribution of CK and

NTK is plotted in Figure 2.9. In logarithmic chart, we discard the negative part of the

kernel. The network settings are same as Figure 2.2 where randomness coefficient β ≈ 1.

The distribution has long tail similar to the diagonal case. However, non-diagonal

elements of random kernel is not guaranteed to be positive.

The non-diagonal elements of NTK from weights in different layers also shows similar

pattern as diagonal elements. Figures 2.9a and 2.9b show that NTKs from weights in

44


different layer are more irrelevant when two weights are further.

We next show the random NTK on a unit circle for feedforward network and residual

network. We set the network width as n = 250 for each network. 10 independently sampled

NTKs and their mean are shown in Figure 2.10. As the depth of feedforward network

increases, the NTK becomes increasingly noisy and the mean degenerate into a constant

plus a delta function. Therefore the NTK for deep feedforward network behave like a

white noise. The residual network in Figure 2.10c has same depth as the deep feedforward

network, but the NTK is much less noisy and the mean NTK remains informative.

0.0 0.5 1.0 1.5 2.0 2.5 3.0

θ

1.5

2.0

2.5

3.0

3.5

4.0

K
(x

0
,x

′ 0
)

(a) Feedforward Network
(d=5)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

θ

0

25

50

75

100

125

150

175

K
(x

0
,x

′ 0
)

(b) Feedforward Network
(d=100)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

θ

5000

6000

7000

8000

9000

10000

11000

K
(x

0
,x

′ 0
)

(c) Residual Network
(d=5,m=20)

Figure 2.10: Distribution of CK and joint distribution of NTK from different layers.

2.9.5 Classification with random NTK

Following the setting of [22], we consider classification on UCI dataset based on SVM

with random NTK and compare the result with NTK of infinite width network. The depth

of the finite width network is set to be same as the infinite width network. The width of

the feedforward network are tuned to achieve β ≈ 1. The average of 100 random NTK are

sampled to reduce the variance. The result is shown in table 2.7.

45


Task Fixed NTK (β = 0) Random NTK (β ≈ 1)

average 81.95% 81.54%

abalone 65.49% 65.52%

acute-inflammation 100.00% 100.00%

acute-nephritis 100.00% 100.00%

arrhythmia 71.90% 72.12%

balance-scale 98.40% 97.28%

balloons 100.00% 100.00%

bank 89.87% 89.58%

blood 79.28% 78.21%

breast-cancer 73.59% 73.24%

breast-cancer-wisc 97.29% 97.43%

breast-cancer-wisc-diag 96.83% 97.18%

breast-cancer-wisc-prog 80.61% 84.18%

breast-tissue 70.19% 71.15%

car 98.55% 97.92%

cardiotocography-10clases 86.11% 85.03%

cardiotocography-3clases 93.08% 92.84%

chess-krvkp 99.12% 99.03%

congressional-voting 61.93% 61.93%

conn-bench-sonar-mines-rocks 85.10% 84.62%

46


contrac 53.60% 54.55%

credit-approval 86.48% 86.19%

cylinder-bands 78.32% 78.32%

dermatology 96.98% 96.98%

echocardiogram 85.61% 85.61%

ecoli 86.61% 86.31%

energy-y1 95.05% 89.84%

energy-y2 87.89% 88.80%

fertility 88.00% 88.00%

flags 50.00% 50.52%

glass 68.40% 66.98%

haberman-survival 73.68% 73.68%

heart-cleveland 59.54% 57.89%

heart-hungarian 85.96% 85.27%

heart-switzerland 47.58% 44.35%

heart-va 33.50% 31.00%

hepatitis 82.05% 83.97%

ilpd-indian-liver 72.77% 71.92%

ionosphere 94.89% 93.75%

iris 97.30% 97.97%

led-display 74.20% 75.40%

lenses 83.33% 87.50%

47


libras 85.83% 85.28%

low-res-spect 93.80% 93.98%

lung-cancer 53.12% 46.88%

lymphography 87.16% 85.81%

mammographic 81.04% 81.88%

molec-biol-promoter 88.46% 86.54%

molec-biol-splice 87.01% 85.32%

musk-1 87.82% 88.87%

oocytes merluccius nucleus 4d 84.12% 81.18%

oocytes merluccius states 2f 92.94% 91.86%

oocytes trisopterus nucleus 2f 86.62% 85.09%

oocytes trisopterus states 5b 94.41% 92.21%

ozone 97.16% 97.16%

parkinsons 91.84% 93.88%

pima 75.52% 75.91%

pittsburg-bridges-MATERIAL 91.35% 91.35%

pittsburg-bridges-REL-L 73.08% 72.12%

pittsburg-bridges-SPAN 71.74% 70.65%

pittsburg-bridges-T-OR-D 84.00% 85.00%

pittsburg-bridges-TYPE 70.19% 71.15%

planning 71.11% 71.11%

plant-margin 83.81% 83.62%

48


plant-shape 73.19% 71.81%

plant-texture 84.94% 84.69%

post-operative 72.73% 72.73%

primary-tumor 52.13% 54.88%

seeds 92.79% 92.79%

semeion 95.67% 94.22%

spambase 94.52% 94.46%

statlog-australian-credit 68.02% 68.02%

statlog-german-credit 77.60% 77.00%

statlog-heart 88.06% 88.06%

statlog-image 98.05% 98.01%

statlog-vehicle 83.41% 79.86%

steel-plates 77.78% 76.80%

synthetic-control 99.50% 99.50%

teaching 59.87% 55.92%

tic-tac-toe 99.79% 98.54%

titanic 78.95% 78.95%

trains 87.50% 87.50%

vertebral-column-2clases 86.04% 84.42%

vertebral-column-3clases 84.09% 82.47%

waveform 86.64% 86.50%

waveform-noise 86.48% 85.96%

49


wine 97.73% 98.30%

wine-quality-red 65.62% 64.81%

wine-quality-white 67.22% 64.77%

yeast 59.77% 59.84%

zoo 96.00% 99.00%

Table 2.7: Classification with fixed NTK and random NTK

2.10 Proof of Main Theory

2.10.1 Proof of Theorem 1

According to section 2.5.1, we have