ABSTRACT Title of Dissertation: UNDERSTANDING AND ENHANCING MACHINE LEARNING MODELS WITH THEORETICAL FOUNDATIONS Zhengmian Hu Doctor of Philosophy, 2024 Dissertation Directed by: Professor Heng Huang Department of Computer Science Machine learning has become a key driver of many contemporary technological ad- vancements. With its empirical success, there is an urgent need for theoretical research to explain and complement these practical achievements. This includes understanding the empirical success of machine learning, especially deep learning, and aiding the design of better algorithms in terms of performance, efficiency, and security. This dissertation aims to advance the understanding and practical development of machine learning through three interrelated research directions, while emphasizing reliable theoretical guarantees throughout the process. In the first part, we study the deep learning theory under overparameterization con- ditions. The core objects of study are the Conjugate Kernel and Neural Tangent Kernel, which have deep connections to the training dynamics of deep learning. Based on the analysis of these kernels, we prove several new concentration results characterizing the trainability and generalization of infinitely wide neural networks. In the second part, we focus on training algorithms. On one hand, we propose new algorithms to improve learning efficiency. This includes a new underdamped Langevin MCMC method called ALUM, for which we prove its complexity reaches the theoretical lower bound. On the other hand, we propose new theoretical tools to analyze existing algorithms and obtain tighter convergence results. For Proxskip, our analysis shows it can still achieve an improvement in communication complexity from sublinear to linear convergence under stochastic oracle. We also generalize the concept of Lipschitz smoothness for tighter non-convex optimization analysis. In the third part, we develop new Monte Carlo methods to large language models (LLMs) to improve their efficiency and security. We develop unbiased watermarking tech- niques to protect model outputs and propose an Accelerated Speculative Sampling method for faster inference. We also investigate the trade-off between watermark strength and inference sampling efficiency, pointing out the conflict between the two. UNDERSTANDING AND ENHANCING MACHINE LEARNING MODELS WITH THEORETICAL FOUNDATIONS by Zhengmian Hu Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2024 Advisory Committee: Professor Heng Huang, Chair/Advisor Professor Furong Huang Professor Tianyi Zhou Professor Ang Li Professor Min Wu, Dean’s Representative © Copyright by Zhengmian Hu 2024 To my parents. ii Acknowledgments First and foremost, I would like to express my sincere gratitude to my PhD advisor, Dr. Heng Huang, for his invaluable encouragement, guidance, and unwavering support throughout my PhD journey. His belief in my potential and his spirit of hard work have been a constant source of inspiration throughout my academic journey. I appreciate his insightful advice, patience, and the freedom and resources he provided, which were instrumental in shaping my research. I would also like to thank my committee members, Dr. Furong Huang, Dr. Tianyi Zhou, Dr. Ang Li, and Dr. Min Wu, for their gracious commitment to serve on my PhD dissertation committee. Moreover, I would also like to thank Dr. Furong Huang, Dr. Tianyi Zhou, and Dr. Mohammad Hajiaghayi for serving on my preliminary exam committee. I am truly grateful for the time and effort they have dedicated to my academic growth. Thanks to all members in Huang Lab, including Dr. Feihu Huang, Dr. Bin Gu, Dr. Hongchang Gao, Dr. Yanfu Zhang, Dr. Lei Luo, Dr. Ziyi Chen, Dr. Kamran Ghasedi, Dr. Zhouyuan Huo, Dr. An Xu, Dr. Runxue Bao, Dr. Shangqian Gao, Dr. Chao Li, Dr. Peiran Yu, Dr. Zhenyi Wang, Dr. Junfeng Guo, Guodong Liu, Wenhan Xian, Xiaoqian Dou, Jiaping Zhu, Alireza Ganjdanesh, Junyi Li, Yihan Wu, Xidong Wu, Lichang Chen, Yanshuo Chen, Ruibo Chen, Chenxi Liu, Qi He, Tianyi Xiong, Yan Wen, and Tong Zheng, for all the great times we had together. I feel so lucky to work with so many excellent iii colleagues. I am especially thankful for the deep expertise in optimization that Huang Lab possesses, as being immersed in this environment has greatly benefited my research in optimization. I extend my thanks to my roommates and close collaborators, Xidong Wu and Yan- shuo Chen, for listening and offering me advice, and to my other collaborators, including Gang Wu, Jianhui Sun, Hongyang Zhang and Huimin Wu. To all those mentioned and many others who have supported me along the way, thank you for being part of this incredible journey. Finally, I want to thank my family for the education they have provided me and their continuing and unconditional support. iv Table of Contents Dedication ii Acknowledgements iii Table of Contents v List of Tables ix List of Figures xii List of Abbreviations xv Chapter 1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 I Better Understanding Deep Learning with Overparame- terization 9 Chapter 2: Analyzing the Conjugate Kernel and Neural Tangent Kernel 10 2.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Feedforward Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 Residual Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5 Overview of Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.7 Additional Theoretical Result . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.8 Additional Explanation on Tensor Diagram . . . . . . . . . . . . . . . . . . 35 2.9 Additional Experiment Result . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.10 Proof of Main Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Chapter 3: Bridging the Gap Between Empirical Risk Minimization and Bayesian Approaches in Overparameterized Neural Networks 60 3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.2 Training by Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.3 Transformative Bayesian Learning . . . . . . . . . . . . . . . . . . . . . . . 68 v 3.4 Connections between Bayesian Learning and Optimization . . . . . . . . . 70 3.5 Interpolation of Optimization and Bayesian Learning . . . . . . . . . . . . 71 3.6 Overparameterized Neural Network . . . . . . . . . . . . . . . . . . . . . . 74 3.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.8 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.9 Discussion on the strength of assumption . . . . . . . . . . . . . . . . . . . 84 3.10 Additional Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.11 Analysis of Infinitely Wide Network . . . . . . . . . . . . . . . . . . . . . . 88 3.12 Additional Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . 101 3.13 Additional Discussion on Algorithms . . . . . . . . . . . . . . . . . . . . . 101 II Analyzing and Enhancing Learning Algorithms 109 Chapter 4: Optimal Underdamped Langevin MCMC Method 110 4.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.2 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.3 AcceLerated ULD-MCMC (ALUM) methods . . . . . . . . . . . . . . . . . 115 4.4 Theoretical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.5 Information-based complexity . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.6 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.8 Extra discussions on ULD . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 4.9 Proof of convergence result and discretization error . . . . . . . . . . . . . 143 4.10 Proof of information-based complexity lower bound . . . . . . . . . . . . . 174 4.11 Extra discussion on experiments . . . . . . . . . . . . . . . . . . . . . . . . 190 Chapter 5: Tighter analysis for Proxskip 204 5.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 5.2 New Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 5.3 Continuous Limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 5.4 Main Idea of Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 5.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 5.6 ProxSkip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 5.7 SProxSkip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 5.8 ProxSkip-VR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 5.9 GradSkip+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Chapter 6: Generalizing Lipschitz Smoothness for Tighter Analysis of Nonconvex Optimization 267 6.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 6.2 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 6.3 Beyond Lipschitz Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . 271 6.4 Lookahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 6.5 Local SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 vi 6.6 Negative Curvature for Stochastic Compositional Optimization . . . . . . . 286 6.7 Comparison to related works . . . . . . . . . . . . . . . . . . . . . . . . . . 291 6.8 Basic lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 6.9 Lookahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 6.10 Local SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 6.11 Negative curvature for stochastic compositional optimization . . . . . . . . 311 6.12 Additional experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 6.13 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 III Enhancing Large Language Model with Monte Carlo 317 Chapter 7: Unbiased Watermark for Large Language Models 318 7.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 7.2 Warm up: undetectability in a simplified toy environment . . . . . . . . . . 322 7.3 Watermarking with unbiased reweighting . . . . . . . . . . . . . . . . . . . 324 7.4 Statistical hypothesis testing for watermark detection . . . . . . . . . . . . 331 7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 7.6 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 7.7 Detailed definition and additional proofs . . . . . . . . . . . . . . . . . . . 341 7.8 Additional discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 7.9 Likelihood-agnostic watermark score . . . . . . . . . . . . . . . . . . . . . 355 7.10 Detailed experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 7.11 More experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 7.12 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 7.13 Ethics Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 Chapter 8: Accelerated Speculative Sampling Based on Tree Monte Carlo 370 8.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 8.2 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 8.3 Tree Monte Carlo (TMC) . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 8.4 Accelerated Speculative Sampling (ASpS) . . . . . . . . . . . . . . . . . . 411 8.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 8.6 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 8.7 Full Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428 Chapter 9: Inevitable Trade-off between Watermark Strength and Speculative Sam- pling Efficiency for Language Models 434 9.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436 9.2 Two Reweight Framework for Accelerated Generation of Watermarked Tokens439 9.3 No-go Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 9.4 Algorithms for Maintaining Watermark Strength or Sampling Efficiency . . 447 9.5 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450 9.6 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452 9.7 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456 vii 9.8 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460 9.9 U Score, DeltaGubel Reweight, and Gamma Reweight . . . . . . . . . . . . 462 9.10 Extension to Variants of Speculative Sampling . . . . . . . . . . . . . . . . 464 9.11 Ethical Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465 9.12 Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466 9.13 Additional Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . 467 Bibliography 480 viii List of Tables 2.1 Verification of CK and NTK of a three layers feedforward network. . . . . 31 2.2 Verification of CK and NTK of a two branches residual network. Every branch is a feedforward network with one hidden layer. . . . . . . . . . . . 32 2.3 The p-value of KS test on CK and NTK of the feedforward network. . . . . 33 2.4 The p-value of KS test on CK and NTK of the residual network. . . . . . . 33 2.5 Verification of CK and NTKs of a three layer feedforward network. . . . . 37 2.6 Verification of CK and NTKs of a two branches residual network. Every branch is a feedforward network with one hidden layer. . . . . . . . . . . . 38 2.7 Classification with fixed NTK and random NTK . . . . . . . . . . . . . . . 50 4.1 Summary of gradient complexity of sampling methods, which is defined as number of gradient evaluation of ∇fi(x) needed to sample from m-strongly- log-concave distributions up to ε √ d/m accuracy in 2-Wasserstein distance where ε ≤ 1 is the target accuracy, d is the dimension. ULA, LPM, RMM, and ALUM are full gradient methods, therefore, gradient complexities for them are sample size N times the iteration complexities. Only dependence on d, ε, and N are shown below. The dependence of batch size b is made clear in Table 4.3. The dependence of condition number κ is discussed in Section 4.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.2 Iteration complexities for full gradient ALUM on both sampling and approx- imation problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.3 Gradient complexity for SAGA-ALUM and SVRG-ALUM. . . . . . . . . . 124 4.4 The summary of different datasets used in our experiments. . . . . . . . . 193 5.1 Different stochastic local methods for strongly convex optimization. . . . . 218 7.1 Performance of different watermarking methods on TS and MT. We use F1 scores of BERTScore and scale BERTScore and ROUGE-1 with a factor of 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 7.2 Mean and variance of score per token for different reweighting methods and different tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 7.3 Text sampled from OPT-6.7B, with and without watermarks. For ”No water- mark” (NW), the score is computed based on δ-reweight. When watermarks are included, the corresponding reweighting function is used for computing score. The optimal perturbation strengths d obtained by grid search are 0.9, 0.0, 0.0 for three outputs respectively. . . . . . . . . . . . . . . . . . . 337 ix 7.4 Additional result about the performance of different watermarking methods on TS. We scale BERTScore and ROUGE with a factor of 100. . . . . . . . 360 7.5 Additional result about the performance of different watermarking methods on MT. We scale BERTScore with a factor of 100. . . . . . . . . . . . . . . 361 7.6 Score per token when the estimated token distribution is computed from a different temperature than the real token distribution. . . . . . . . . . . . . 362 7.7 Score per token when the estimated token distribution is computed from a different top k than the real token distribution. . . . . . . . . . . . . . . . 362 7.8 Score per token when the estimated token distribution is computed with and without input. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 7.9 Score per token when the estimated token distribution is computed from a different model than the real token distribution. . . . . . . . . . . . . . . . 362 7.10 Mean and variance of score per token for δ-reweight based on Gumbel trick on different tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 7.11 Additional result with T5 for translation tasks and LLaMA 2 [1] for sum- marization and poem generation. . . . . . . . . . . . . . . . . . . . . . . . 363 7.12 AUC for different watermarking detection methods under different pertur- bation strength. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 7.13 AUC for different watermarking detection methods under different deletion portion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 8.1 Single Token Generation with Speculative Sampling . . . . . . . . . . . . . 374 8.2 An Illustration of Speculative Sampling Coupling . . . . . . . . . . . . . . 375 8.3 An Illustration of Accelerated Speculative Sampling . . . . . . . . . . . . . 375 8.4 Desired Joint Distribution in a Toy Example . . . . . . . . . . . . . . . . . 376 8.5 Summary of accepted token numbers per step improvement by ASpS over SpS. n is the number of reference token. . . . . . . . . . . . . . . . . . . . 424 8.6 Summary of accepted token numbers per step (ATN) and per token time (PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is the number of reference token. We use LLaMa-7b model as target model and LLaMa-68m model as reference model on machine translation task. . . 428 8.7 Summary of accepted token numbers per step (ATN) and per token time (PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is the number of reference token. We use LLaMa-7b model as target model and LLaMa-68m model as reference model on text summarization task. . . 429 8.8 Summary of accepted token numbers per step (ATN) and per token time (PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is the number of reference token. We use LLaMa-13b model as target model and LLaMa-68m model as reference model on open-ended text generation task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 8.9 Summary of accepted token numbers per step (ATN) and per token time (PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is the number of reference token. We use Opt-13b model as target model and Opt-125m model as reference model on open-ended text generation task. . 429 x 8.10 Summary of accepted token numbers per step (ATN) and per token time (PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is the number of reference token. We use Opt-6.7b model as target model and Opt-125m model as reference model on open-ended text generation task. . 430 8.11 Summary of accepted token numbers per step (ATN) and per token time (PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is the number of reference token. We use GPT-Neo-2.7B model as target model and GPT-2 model as reference model on open-ended text generation task. . 430 8.12 Summary of accepted token numbers per step (ATN) and per token time (PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is the number of reference token. We use GPT-2-xl model as target model and GPT-2 model as reference model on open-ended text generation task. . . . 430 8.13 Summary of accepted token numbers per step (ATN) and per token time (PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is the number of reference token. We use LLaMa-7b model as target model and LLaMa-68m model as reference model on open-ended text generation task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 8.14 Summary of accepted token numbers per step (ATN) and per token time (PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is the number of reference token. We use LLaMa-2-7b model as target model and LLaMa-68m model as reference model on open-ended text generation task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 8.15 Summary of accepted token numbers per step (ATN) and per token time (PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is the number of reference token. We use LLaMa-2-7b model as target model and LLaMa-68m model as reference model on machine translation task. . . 431 8.16 Summary of accepted token numbers per step (ATN) and per token time (PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is the number of reference token. We use LLaMa-2-7b model as target model and LLaMa-68m model as reference model on text summarization task. . . 432 8.17 Summary of accepted token numbers per step (ATN) and per token time (PTN) improvement and perplexity (PPL) change for ASpS over SpS. n is the number of reference token. We use LLaMa-2-13b model as target model and LLaMa-68m model as reference model on open-ended text generation task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432 9.1 Text summarization task with LLaMa-7b model [1] as target model and LLaMa-68m model [2] as reference model. . . . . . . . . . . . . . . . . . . 472 9.2 Text summarization task with LLaMa-13b model [1] as target model and LLaMa-68m model [2] as reference model. . . . . . . . . . . . . . . . . . . 474 9.3 Open-ended text generation task with LLaMa-7b model [1] as target model and LLaMa-68m model [2] as reference model. . . . . . . . . . . . . . . . . 476 9.4 Open-ended text generation task with LLaMa-13b model [1] as target model and LLaMa-68m model [2] as reference model. . . . . . . . . . . . . . . . . 478 xi List of Figures 2.1 Different regimes in the study of CK and NTK. Network width and depth are denoted as d and n. Black line and Blue line: CK and NTK converge to fixed values. Green line: For feedforward network, diagonal elements of CK and NTK of each parameter converge to log-normal distribution. The variance of the log value is proportional to d n . Diagonal elements of residual network NTK converges to the product of CK of a single feedforward network and a log normal distribution random variable. Red line: For feedforward network, the variance of CK and NTK is unbounded. For residual network, diagonal elements of CK converges to log-normal distribution, and diagonal elements of NTK converges in law to a log-normal distributed variable times CK of a feedforward network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Distribution of CK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3 Distribution of NTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.4 Distribution of CK and NTK . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.5 Joint distribution of logarithm NTK of weights from different layer. . . . . 40 2.6 Estimation of correlation coefficients between weight NTKs and bias NTKs. 42 2.7 Joint distribution of logarithm NTK of weights NTK and bias NTK. . . . . 43 2.8 Joint distribution of logarithm NTK of weights NTK for residual network with 20 branches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.9 Distribution of CK and joint distribution of NTK from different layers. . . 44 2.10 Distribution of CK and joint distribution of NTK from different layers. . . 45 3.2 Comparison of dynamics in parameter space and function space. . . . . . . 80 3.1 Hessian trace for one randomly initialized network (left) and the probability density of Hessian trace at initialization (right). . . . . . . . . . . . . . . . 80 3.3 Test accuracy for ensemble trained by GD and TransBL with β = exp(−40) for weight clipping (left). Distribution of weights in TransBL (middle). Trade-off between sampling efficiency and test accuracy (right). . . . . . . 82 3.4 The KL divergence, training and testing loss gap, and time dependence of weight distribution of experiment in Section 3.7.3. . . . . . . . . . . . . . 102 3.5 Bayesian learning process . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.6 Illsutration of TransBL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.7 Interpolation between optimization and Bayesian learning . . . . . . . . . 104 4.1 Trajectory error for different algorithms1. The vertical lines are the error bar. FG means full gradient, SG means stochastic gradient. . . . . . . . . . 129 4.2 Discretization error of full gradient methods on australian dataset.2 . . . . 131 xii 4.3 Discretization error for SVRG-ALUM and SAGA-ALUM with different batch sizes on australian dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.4 Estimated trajectory error of RMM with different segments number. . . . . 192 4.5 Logistic regression on phishing dataset. . . . . . . . . . . . . . . . . . . . . 195 4.6 Logistic regression on german dataset. . . . . . . . . . . . . . . . . . . . . 196 4.7 Logistic regression on mushrooms dataset. . . . . . . . . . . . . . . . . . . 197 4.8 Discretization error on australian dataset for different algorithms under dif- ferent step size.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 4.9 Discretization error for SVRG-ALUM and SAGA-ALUM with different batch sizes on australian dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 4.10 Sampling from posterior of Bayesian logistic regression model on australian dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 4.11 Sampling from posterior of Bayesian logistic regression model on phishing dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 4.12 Sampling from posterior of Bayesian logistic regression model on german dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 4.13 Sampling from posterior of Bayesian logistic regression model on mushrooms dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 5.1 Relationship between step size and decreasing speed of Lyapunov function when µ = 0.1 and L = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 5.2 ProxSkip experiment for comparing of different step sizes. . . . . . . . . . . 211 5.3 Convergence of ODEProx. . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 5.4 Illustration of old Lyapunov function . . . . . . . . . . . . . . . . . . . . . 215 5.5 Illustration of new Lyapunov function . . . . . . . . . . . . . . . . . . . . . 216 5.6 SProxSkip experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 6.1 Upper bound based on lower smoothness. . . . . . . . . . . . . . . . . . . . 272 6.2 Exponential divergence for gradient flow on nonconvex objective function. . 272 6.3 Similar to Figure 6.2 but with different initialization. . . . . . . . . . . . . 273 6.4 A counter example showing that Lookahead with too large horizon doesn’t converge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 6.5 Convergence of ResNet-18 on CIFAR-10. . . . . . . . . . . . . . . . . . . . 278 6.6 Different convergence conditions for local SGD. . . . . . . . . . . . . . . . 282 6.7 Estimated Hessian eigenspectrum for ResNet-18 at random initialization. . 286 6.8 Dynamics of minimum Hessian eigenvalue, function value, and gradient dur- ing training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 6.9 Estimated minimum negative curvature for different networks. . . . . . . . 289 6.10 Dynamics of ratio of minimum and maximum Hessian eigenvalues. . . . . . 290 6.12.1Additional result for wide network. . . . . . . . . . . . . . . . . . . . . . . 313 6.12.2Additional result for lazy network. . . . . . . . . . . . . . . . . . . . . . . . 314 7.3.1 Illustration of δ-reweight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 7.3.2 Illustration of γ-reweight. . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 7.4.1 Distribution of perplexity of output for TS and BLEU score for MT. . . . . 335 xiii 9.0.1 Taxonomy of watermarking and speculative sampling trade-offs in language models. The ideal case of maintaining both watermark strength and sam- pling efficiency is proven to be impossible by the no-go theorem. The pro- posed algorithms focus on maintaining either watermark strength or sam- pling efficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437 9.5.1 Comparison of different methods. The x-axis shows the Average Accepted Tokens Per Step (AATPS) as a measure of speculative sampling efficiency, while y-axis shows the Average Negative Log P-value Per Token (ANLPPT) as a measure of watermark strength. The P-value is computed based on ei- ther a likelihood-based test using the maximin-LLR score (left) or a likelihood- agnostic test using the U score (right). Watermarking is performed using either the DeltaGumbel reweight (top) or the Gamma reweight (bottom). Error bars represent 3σ confidence intervals4. . . . . . . . . . . . . . . . . 453 9.13.1Text summarization task with LLaMa-13b model [1] as target model and LLaMa-68m model [2] as reference model. . . . . . . . . . . . . . . . . . . 468 9.13.2Open-ended text generation task with LLaMa-7b model [1] as target model and LLaMa-68m model [2] as reference model. . . . . . . . . . . . . . . . 469 9.13.3Open-ended text generation task with LLaMa-13b model [1] as target model and LLaMa-68m model [2] as reference model. . . . . . . . . . . . . . . . 470 xiv List of Abbreviations ALUM AcceLerated ULD-MCMC APGD Accelerated Proximal Gradient Descent ASpS Accelerated Speculative Sampling BART Bidirectional and Auto-Regressive Transformers BLEU Bilingual Evaluation Understudy (score) CK Conjugate Kernel CNN-DM CNN/Daily Mail (dataset) DNNs Deep Neural Networks ERM Empirical Risk Minimization FL Federated Learning KL Kullback-Leibler (divergence) LLM(s) Large Language Model(s) LLR Log Likelihood Ratio LPM Left Point Method MBart Multilingual BART MCMC Markov Chain Monte Carlo MCTS Monte Carlo Tree Search MGF Moment-Generating Function MSE Mean Squared Error MT Machine Translation NAG Nesterov’s Accelerated Gradient NNGP Neural Network Gaussian Process NTK Neural Tangent Kernel ODE Ordinary Differential Equation PAC Probably Approximately Correct PGD Proximal Gradient Descent ProxSkip-VR ProxSkip with Variance Reduction ReLU Rectified Linear Unit RMM Randomized Midpoint Method ROUGE Recall-Oriented Understudy for Gisting Evaluation SAGA Stochastic Average Gradient Algorithm SDE Stochastic Differential Equation SGD Stochastic Gradient Descent SpS Speculative Sampling SProxSkip Stochastic ProxSkip SVRG Stochastic Variance Reduced Gradient xv TMC Tree Monte Carlo TransBL Transformative Bayesian Learning TS Text Summarization TV Total Variation (distance) ULD Underdamped Langevin Dynamics VA Variational Approximation WMT Workshop on Statistical Machine Translation xvi Chapter 1: Introduction 1.1 Motivation Machine learning is a technique that transforms massive amounts of data into predic- tive and decision-making tools. In recent years, the development of machine learning has been driven by model improvements and massively increased computing power, promoting social progress and scientific and technological advancements. The explosion of data create a need for effectively extracting patterns from big data, especially when the underlying mechanisms are complex, while avoiding overfitting. The vast amount of data provides us with more training samples, allowing us to build more complex and accurate models, thereby reducing model bias. However, big data also brings challenges such as data noise, which can lead to increased model variance. Neural networks have achieved a good balance in this situation. With their powerful representation capabilities and flexible structural design, neural networks have become a practical low-bias method that can effectively capture complex features and patterns in data. With appropriate regularization techniques and practical experience, the variance of neural networks can also be well controlled. Neural networks have demonstrated excellent scalability to large-scale data scenarios. However, neural networks face challenges in terms of trainability and generalization 1 ability. Theoretically, neural networks are non-convex, which can lead to local minima. The multi-layer structure also brings optimization difficulties, such as vanishing or exploding gradients. The trainability of neural networks lacks theoretical guarantees. Moreover, the generalization ability of neural networks lacks rigorous theoretical guarantees. Although neural networks have empirically shown good generalization performance on many tasks, we still do not have a complete theoretical framework to explain and predict this gener- alization ability. Traditional statistical learning theories, such as VC dimension theory and Rademacher complexity, have significant limitations in explaining the generalization behavior of neural networks. Recent research on over-parameterization has shed light on these problems. As the width of neural networks increases, the loss landscape of neural network becomes smoother, local minima decrease, and gradient descent can find global minima, a phenomenon known as the ”blessing of width.” It allows us to characterize trainability and generalization ability under a rigorous mathematical framework, where the training dynamics is largely captured by Neural Tangent Kernel. Part I of this dissertation discusses our contribution in this area. The explosive growth of data poses challenges to training algorithms, which essentially aim to find the optimal model parameters by minimizing an objective function. Common objective functions include the empirical loss and PAC-Bayesian generalization bounds, leading to Bayesian learning. Efficient training requires controlling computational complex- ity. Theoretical research on the complexity of training algorithms is crucial for providing performance guarantees, such as convergence rate, sample complexity, and communication complexity, evaluating algorithm performance, and comparing algorithms. The core re- 2 sults are the upper and lower bounds on complexity, and pursuing matching upper and lower bound is one of the research goal of complexity theory. Understanding algorithm convergence helps in understanding the strengths, weaknesses, applicability, and limita- tions of algorithms, guiding algorithm improvement and design. Part II of this dissertation discusses the progress in this area. Inference efficiency is also important, especially in the context of large language mod- els (LLMs), which have made significant breakthroughs in natural language processing. Efficient inference is crucial for applications such as dialogue systems, text generation, and question-answering systems. The probabilistic structure of LLMs, being auto-regressive models, allows for the application of advanced Monte Carlo methods to accelerate infer- ence. Chapter 8 of this dissertation presents work in this direction, applying novel Tree Monte Carlo methods to further accelerate LLM inference. While LLMs have achieved remarkable success in natural language processing, they also pose security and ethical challenges. LLMs can be maliciously exploited to generate false information, spread rumors, or engage in other activities that harm society. Due to the high readability and persuasiveness of the text generated by LLMs, this false information can mislead the public and have serious negative impacts. Therefore, preventing the misuse of LLMs while ensuring their performance is an urgent problem to be addressed. An important question is how to track and identify the content generated by LLMs to prevent their misuse. Watermarking techniques provide a promising approach to address this issue. We find that watermarking can be modeled as a Monte Carlo process that enable us to effectively embed watermark information without affecting the performance of LLMs. Furthermore, considering the inference cost of LLMs, how to ensure LLM watermark- 3 ing while improving inference efficiency is an urgent problem to be solved. We discover its challenges. The embedding of watermarks and the acceleration of inference are inherently conflicting. Theoretically characterizing this inherent trade-off will help deepen understand- ing and provide new ideas and methods for solving the security and efficiency problems of LLMs. Part III of this dissertation discusses our contribution in this area of research. 1.2 Dissertation Outline Better Understanding Deep Learning with Overparameterization In Chapter 2, we investigate the distributions of Conjugate Kernel (CK) and Neural Tangent Kernel (NTK) for ReLU networks with random initialization. We derive the precise distributions and moments of the diagonal elements of these kernels. For a feedforward network, these values converge in law to a log-normal distribution when the network depth d and width n simultaneously tend to infinity and the variance of log diagonal elements is proportional to d/n. For the residual network, in the limit that number of branches m increases to infinity and the width n remains fixed, the diagonal elements of Conjugate Kernel converge in law to a log-normal distribution where the variance of log value is proportional to 1/n, and the diagonal elements of NTK converge in law to a log-normal distributed variable times the conjugate kernel of one feedforward network. Our theoretical analysis results suggest that residual network remains trainable in the limit of infinite branches and fixed network width. We also conduct numerical experiments to validate the soundness of our theoretical analysis. Chapter 3 proposes a novel algorithm, Transformative Bayesian Learning (TransBL), 4 which bridges the gap between empirical risk minimization (ERM) and Bayesian learning for neural networks. We compare ERM, which uses gradient descent to optimize, and Bayesian learning with importance sampling for their generalization and computational complexity. We derive the first algorithm-dependent PAC-Bayesian generalization bound for infinitely wide networks based on an exact KL divergence between the trained posterior distribution obtained by infinitesimal step size gradient descent and a Gaussian prior. Moreover, we show how to transform gradient-based optimization into importance sampling by incorporating a weight. While Bayesian learning has better generalization, it suffers from low sampling efficiency. Optimization methods, on the other hand, have good sampling efficiency but poor generalization. Our proposed algorithm TransBL enables a trade-off between generalization and sampling efficiency. Analyzing and Enhancing Learning Algorithms In Chapter 4, we study the underdamped Langevin diffusion (ULD) with strongly- convex potential consisting of finite summation of N smooth components, and propose an efficient discretization method, which requires O(N + d 1 3N 2 3/ε 2 3 ) gradient evaluations to achieve ε-error (in √ E∥·∥2 2 distance) for approximating d-dimensional ULD. Moreover, we prove a lower bound of gradient complexity as Ω(N + d 1 3N 2 3/ε 2 3 ), which indicates that our method is optimal in dependence of N , ε, and d. In particular, we apply our method to sample the strongly-log-concave distribution and obtain gradient complexity better than all existing gradient based sampling algorithms. Experimental results on both synthetic and real-world data show that our new method consistently outperforms the existing ULD approaches. In Chapter 5, we provide a tighter analysis for ProxSkip, an algorithm that allows 5 fewer proximal operator computations to solve composite optimization problems. We im- prove the existing decreasing speed of Lyapunov function from O(p2) to O(p), when p, the frequency of the proximal operators is small enough. Our theoretical analysis also reveals the drawbacks of using large step sizes for gradient descent in ProxSkip when the proximal operator part is the bottleneck. Our main motivation comes from the continuous limit in which the original analysis of ProxSkip fails to guarantee convergence when both the step size γ and frequency p tend to zero. We construct a counterexample to demonstrate why such counterintuitive behavior occurs for the original analysis and then propose a novel Lyapunov function variant to construct a tighter analysis, avoiding the problem of the old one. Such a new Lyapunov function can be directly extended to many other variants of ProxSkip. When applied to stochastic gradient setup, our analysis leads to an improved proximal operator complexity for SProxSkip from O( √ 1/εµ2 log(1/ε)) to O( √ κ log(1/ε)). Negative and positive curvatures affect optimization in different ways. However, a lot of existing optimization theories are based on the Lipschitz smoothness assumption, which cannot differentiate between the two. In Chapter 6, we propose to use two sepa- rate assumptions for positive and negative curvatures, so that we can study the different implications of the two. We analyze the Lookahead and Local SGD methods as concrete examples. Both of them require multiple copies of model parameters and communication among them for every certain period of time in order to prevent divergence. We show that the minimum communication frequency is inversely proportional to the negative curvature, and when the negative curvature becomes zero, we recover the existing theory results for convex optimization. Finally, both experimentally and theoretically, we demonstrate that modern neural networks have highly unbalanced positive/negative curvatures. Thus, an 6 analysis based on separate positive and negative curvatures is more pertinent. Enhancing Large Language Model with Monte Carlo The recent advancements in large language models (LLMs) have sparked a growing apprehension regarding the potential misuse. One approach to mitigating this risk is to incorporate watermarking techniques into LLMs, allowing for the tracking and attribution of model outputs. This study examines a crucial aspect of watermarking: how significantly watermarks impact the quality of model-generated outputs. Previous studies have sug- gested a trade-off between watermark strength and output quality. However, our research in Chapter 7 demonstrates that it is possible to integrate watermarks without affecting the output probability distribution with appropriate implementation. We refer to this type of watermark as an unbiased watermark. This has significant implications for the use of LLMs, as it becomes impossible for users to discern whether a service provider has incorpo- rated watermarks or not. Furthermore, the presence of watermarks does not compromise the performance of the model in downstream tasks, ensuring that the overall utility of the language model is preserved. Our findings contribute to the ongoing discussion around responsible AI development, suggesting that unbiased watermarks can serve as an effective means of tracking and attributing model outputs without sacrificing output quality. Speculative Sampling (SpS) has been introduced to speed up the inference of large language models (LLMs) by generating multiple tokens in a single forward pass under the guidance of a reference model, while preserving the original distribution. We observe that SpS can be derived through maximum coupling on the token distribution. However, we find that this approach is not optimal as it applies maximum coupling incrementally for each new token, rather than seeking a global maximum coupling that yields a faster 7 algorithm, given the tree-space nature of LLM generative distributions. In Chapter 8, we shift our focus from distributions on a token space to those on a tree space. We propose a novel class of Tree Monte Carlo (TMC) methods, demonstrating their unbiasedness and convergence. As a particular instance of TMC, our new algorithm, Accelerated Speculative Sampling (ASpS), outperforms traditional SpS by generating more tokens per step on average, achieving faster inference, while maintaining the original distribution. Given that we have watermarking techniques that inject watermarks into the gen- erated content without altering the output quality, and we have acceleration techniques, specifically speculative sampling, leverage a draft model to speed up the sampling pro- cess while preserving the output distribution, we further investigate whether it is possible to simultaneously accelerate the sampling process and inject watermarks into the gener- ated content. In Chapter 9, we find that the integration of watermarking and acceleration is non-trivial. We prove a no-go theorem, which states that it is impossible to simul- taneously maintain the highest watermark strength and the highest sampling efficiency. Furthermore, we propose two methods that maintain either the sampling efficiency or the watermark strength, but not both. Our work provides a rigorous theoretical foundation for understanding the inherent trade-off between watermark strength and sampling efficiency in accelerating the generation of watermarked tokens for large language models. We also conduct numerical experiments to validate our theoretical findings and demonstrate the effectiveness of the proposed methods. 8 Part I Better Understanding Deep Learning with Overparameterization 9 Chapter 2: Analyzing the Conjugate Kernel and Neural Tangent Kernel Deep Neural Networks (DNNs) have been successfully applied to numerous appli- cations such as computer vision [3, 4], natural language processing [5] and speech recog- nition [6], where DNNs often achieved superior performance compared to conventional machine learning algorithms and can also be used as a feature extractor (representation learning) for other methods. Despite the extensive success, the DNN training process and mechanism of general- ization are not fully understood due to the difficulties raised by the non-convexity of the loss function and the complication of optimization methods. One way to simplify the DNNs analysis is to only train parameter in one layer while keep the parameters of other layers as fixed. Jarrett et al. [7] demonstrated that only training last layer of a neural network can achieve competitive performance. Since all parameters of previous layers are fixed, the outputs at the last hidden layer F (x, θ) of non-linear DNNs are considered as a random feature vector whose inner product is known as Conjugate Kernel (CK) [8, 9, 10, 11]: Σ(x, y) = F (x, θ)⊤F (y, θ). The other way to simplify the DNNs analysis is using infinitesimal step size to conduct continuous time analysis [12, 13, 14]. The training process of parameters θ and neural 10 network output F (x, θ) turns into two first order ordinary differential equations (ODE): dθi dt = − ∂L ∂F (x, θ) ∂F (x, θ) ∂θi (2.1) dF (y, θ) dt = ∑ i ∂F (y, θ) ∂θi ∂F (x, θ) ∂θi ⊤ ∂L ∂F (x, θ) . (2.2) The first ODE shows that the dynamics of parameters θ follows the gradient descent of a highly non-convex loss function L(F (x, θ)). However, the dynamics of the output F (y, θ) enjoys the form of kernel gradient descent where the loss L is usually convex. The neural tangent kernel (NTK) is therefore defined as follows [15]: K(x, y) = ∑ i ∂F (y, θ) ∂θi ∂F (x, θ) ∂θi ⊤ . (2.3) Both CK and NTK depend on the inputs and parameters of the neural network. They are potentially complicated due to the randomness of parameters at initialization and the changing of parameters during the training process. One line of research bypass this complexity by conducting the analysis in the infinite- width limit with fixed depth. Neal [16], Williams [17], Lee et al. [18], de G. Matthews et al. [19] showed that with proper scaling, a neural network with random initialization converges to a Gaussian process at infinite-width limit. Therefore all elements of conjugate kernel converge to a fixed value and CK is usually called as Neural Network Gaussian Process (NNGP) kernel. The training dynamics under this regime also simplifies dramatically. Jacot et al. [15] showed that as network width increases, the variance of NTK for a randomly initialized 11 DNN reduces to zero and the NTK remains fixed during the training phase. Another similar regime is to increase the depth to infinity at a slower rate than the network width. Under this regime, DNNs largely preserve the property of fixed CK and NTK. Huang et al. [20] studied the limiting NTK of feedforward network and residual network by increasing widths to infinity first, and then increasing depths. Their result shows that the limiting NTK of feedforward network degenerates into delta function as the depth increases to infinity. The fact that CK and NTK converge to fixed values at the infinite width limit reveals important properties about DNNs. First, it shows that the gradient descent optimizes wide DNNs to a global minimum. During the training of an infinite width neural network, the change of parameters reduces to zero and the network behaves like a linear model. The training dynamics of the output is precisely a kernel regression. With proper initialization and well-behaved loss function, the training loss will converge to zero. Second, it explains the generalization of an overparameterized network. NTK governs the generalization prop- erty of training an infinitely wide neural network, while CK is related to the generalization when we only train the last layer. The limiting CK and NTK only depend on the network depth, the scaling of variance at each layer, and the choice of non-linearity function. The closed-form expression of CK and NTK allows directly evaluating an infinite width network and can be used for Kernel Regression or Kernel Support Vector Machines [21, 22]. However, the success of DNNs has not been fully explained by the fixed CK and NTK. The NTK of an infinite width network generally has no feature learning capability, because it is data-independent. The training of DNNs, on the other hand, is mostly considered 12 as a feature selection process. Arora et al. [21], Chizat et al. [23] demonstrate that the linearized DNNs with a fixed NTK could have a substantial gap of performance from finite width DNNs trained with gradient descent. The above limitations of existing DNN analysis methods motivate us to study the prior distribution of CK and NTK for finite width DNNs. However, most existing works focus on controlling the deviation of finite width DNNs from the infinite width regime, rather than particular properties associated with finite width. This chapter aims to study the CK and NTK for finite width network and the limiting behaviour when they converge to a random variable. The main contributions of this chapter are summarized as follows: • This chapter provides new analysis for CK and NTK at finite width and depth con- dition. The precise distributions for diagonal elements of CK and NTK for every parameter are derived. Our results are applicable to both feedforward network and residual network with ReLU activation function. • This chapter derives upper bounds for every order moments of random CK and NTK, instead of only variance of them as in previous works. Our upper bounds on r-th order moment is proportional to exp (( r 2 ) β ) where randomness coefficient β depends on network settings. The fast increasing high order moments demonstrate the long tail nature of CK and NTK for finite width network. • Our new results can be generalized into infinite width and depth case, where we show that the limiting distribution of CK and NTK with random initialization are closely related to log-normal distribution. Our results endorse existing practice of modeling activation and gradient norm in neural network as log-normal distribution. 13 n d Figure 2.1: Different regimes in the study of CK and NTK. Network width and depth are denoted as d and n. Black line and Blue line: CK and NTK converge to fixed values. Green line: For feedforward network, diagonal elements of CK and NTK of each parameter converge to log-normal distribution. The variance of the log value is proportional to d n . Diagonal elements of residual network NTK converges to the product of CK of a single feedforward network and a log normal distribution random variable. Red line: For feedfor- ward network, the variance of CK and NTK is unbounded. For residual network, diagonal elements of CK converges to log-normal distribution, and diagonal elements of NTK con- verges in law to a log-normal distributed variable times CK of a feedforward network. • We show that randomness coefficient β for residual network could remains bounded even in the limit of infinite depth and finite width. Therefore, an infinitely deep residual network with proper scaling is still trainable. A comparison of properties of CK and NTK under different regimes of interest are shown in Figure 2.1. 2.1 Related Works Convergence of Wide Network For infinite width neural networks, the NTK is a fixed value during training and the training trajectory is the same as kernel regression [15, 24]. This is not true for any finite width network. However, neural networks with large but fi- nite width still enjoy similar properties. More specifically, The NTK is almost fixed during training, and the training loss decreases to almost zero [21, 25, 26, 27]. These works charac- 14 terize the size of the deviations around the limiting behaviour in certain overparameterized regimes. Finite Width Correction The research works along this direction focus on the deviation of NTK from its expectation without overparameterized assumption. Hanin and Nica [28] provided an upper bound on the variance of diagonal element of NTK based on path counting. Littwin et al. [29] extended the result to networks with residual structure. These results do not reveal the heavy tail nature of prior distribution of CK and NTK, as only second order moments are discussed. Dyer and Gur-Ari [30] presented bounds on general correlation functions in DNNs. However, most results are limited to linear network or remain to be conjectures. Log-Normal Distribution in Neural Network Log-normal distribution has been used to approximate the output rate in neuroscience [31, 32]. Hanin and Nica [33], Gurbuzbal- aban and Hu [34] give theoretical guarantee that norm of feedforward network output converge to a log-normal distribution. In machine learning with artificial neural network, log-normal distribution is also observed to serve as a better approximation of gradient dis- tribution than Gaussian distribution and inspires new technique for quantized training in [35]. 15 2.2 Preliminary Feedforward Networks We consider feedforward network with fully connected layers and ReLU activation. With assuming the input as x0, the definition of each layer is: yi = σiWixi−1 + bi, i = 1, . . . , d xi = ReLU(yi), i = 1, . . . , d− 1 , (2.4) where f(x0) = yd is the output of the network and xd is the output of last hidden layer. The length of vectors yi and xi is ni. The parameters of the network are initialized to independent random variables (Wi)j,k ∼ N (0, 1) and bi = 0. Note that although bi is set to 0 at initialization, this parameter is still trainable thus it contributes to the NTK. We assume that output is a scalar and nd = 1 while analyzing the NTK of feedforward network, and require output to be a vector with the same size as input x0 when the feedforward network is used as one branch in residual network. Residual Network We consider a residual network where every residual block is a feed- forward network with ReLU activation. With input x0, for i = 0, · · · ,m− 1, we have: xi,0 = xi yi,j = σi,jWi,jxi,j−1 + bi,j , j = 1, . . . , di xi,j = ReLU(yi,j), j = 1, . . . , di − 1 (2.5) xi+1 = xi + yi,di yout = σoutWoutxm 16 where m is the total number of residual branches, f(x0) = yout is the output of the network. The length of vectors yi,j and xi,j is ni,j. The length of vector xi is ni = ni,0 = ni,di = n. The length of output yout is nout = 1. The parameters are initialized in a similar way as feedforward networks. The additional fully connected layer helps change the dimension of output and simplify the analysis of NTK. Conjugate Kernel The conjugate kernel is defined as the inner product of last hid- den layer. More specifically, CK is Σ(x0, x̃0) = xT d−1x̃d−1 for feedforward networks and Σ(x0, x̃0) = xT mx̃m for residual network. We also define the conjugate kernel of output as the inner of the output layer of the network. For feedforward networks Σ′(x0, x̃0) = yT d ỹd and for residual networks Σ′(x0, x̃0) = yT outỹout. These values arise in the analysis of residual network NTK. Neural Tangent Kernel The NTK is defined as inner product of gradient with respect to network parameters. Therefore, NTK can be decomposed into summation of smaller NTKs, each of which comes from training a unique parameter in the network. More specifically, K(x0, x̃0) = ∑ θ Kθ(x0, x̃0) Kθ(x0, x̃0) = ∂f(x0) ∂θ ∂f(x̃0) ∂θ ⊤ (2.6) where θ is the weight or bias in one layer. In this chapter, we consider the case where f(x0) is a scalar, thus NTK can be described as a scalar as well. But our results can be easily extended to the case where a network has a vector output, and the NTK K(x0, x̃0) is a matrix. Norm of Gaussian Random Vector We introduce two functions that will be used in 17 our analysis. For an n-dimensional standard Gaussian vector w ∼ N (0, In×n), we define the following two functions regarding to even order moments of w and ReLU(w): G1(n, r) = E [ ∥w∥2r ] = r−1∏ t=0 (n+ 2t) (2.7) G2(n, r) = E [ ∥ReLU(w)∥2r ] = 1 2n n∑ k=0 ( n k ) G1(k, r) For large n, we have: G1(n, r) = nr ( 1 + 2 ( r 2 ) 1 n +O( 1 n2 ) ) (2.8) G2(n, r) = ( n 2 )r ( 1 + 5 ( r 2 ) 1 n +O( 1 n2 ) ) Log-Normal Distribution eZ , where Z follows Gaussian distribution, is said to follow log- normal distribution. When Z ∼ N (−β 2 , β), we have the r-th order moment as E[(eZ)r] = exp (( r 2 ) β ) . We use parameter β to characterize the randomness of such random variable. The log-normal distribution is heavy-tail in the sense that moment generating function E[exp ( teZ ) ] is infinite for all positive t. 2.3 Feedforward Networks We first introduce the precise distribution of diagonal elements of conjugate kernel. Theorem 1 (CK of feedforward network). The diagonal elements of conjugate kernel Σ(x0, x0) have the same distribution as ∥x0∥2∏d−1 k=1 σ 2 k∥ReLU(vk)∥2 at initialization where vk ∼ N (0, Ink×nk ) is the independent standard Gaussian random vector. 18 Moreover, Σ′(x0, x0) has the same distribution as ∥x0∥2σ2 d∥vd∥2∏d−1 k=1 σ 2 k∥ReLU(vk)∥2 at initialization. We can further calculate the moments of CK as follows. Corollary 2 (Moments of feedforward network CK). The r-th order moment of CK of d-layer feedforward network with ReLU activation is: E[Σ(x0, x0) r] =∥x0∥2r ( d−1∏ k=1 σ2r k G2(nk, r) ) =∥x0∥2rcr ( exp (( r 2 ) β ) + O( d∑ i=1 1 n2 i ) ) where r is a non-negative integer, c = ∏d−1 k=1 σ 2 k nk 2 and β = ∑d−1 k=1 5 nk . We can derive the convergence of distribution as follows. Theorem 3 (Limiting distribution of CK). With the same definition of c and β as Corol- lary 2, at the limit of n1, · · · , nd−1 → ∞ and c and β converge to fixed values, c → ĉ and β → β̂, the distribution of E[Σ(x0, x0) r] converges to a log-normal distribution in law. More specifically, the limit distribution is the same as ∥x0∥2ĉeZ where Z ∼ N (− β̂ 2 , β̂) is a Gaussian random variable. Remark 4. By comparing Corollary 2 and moments of log-normal distribution, we can easily see that all moments converge to moments of log-normal distribution in the same limit in Theorem 3. However, we point out that the convergence in law as stated in Theorem 3 is stronger than convergence of all moments. The reason is that the long tail of log-normal distribution makes itself not determined by its moments [36]. 19 Remark 5. The results in Corollary 2 and Theorem 3 can be easily generalized to Σ′(x0, x0). When we require nd also increases to infinity, values c and β in Corollary 2 and Theorem 3 should be changed to c = 2 ∏d k=1 σ 2 k nk 2 , β = 2 nd + ∑d−1 k=1 5 nk and the distribution of Σ′(x0, x0) still converges to log-normal distribution. When we have nd = 1, the limiting distribution of Σ′(x0, x0) is the same as ∥x0∥2σ2 d∥w∥2ĉeZ where w is a standard Gaussian variable and other values are the same. Therefore, the ac- tivation of one single neuron doesn’t converge to log-normal distribution. Next, we derive the distribution of diagonal NTK coming from trainable weights Wi: KWi (x0, x0) = ∑ j,k ∂f(x0) ∂(Wi)j,k ∂f(x0) ∂(Wi)j,k ⊤ . (2.9) This distribution is fully characterized by the connections between CK and NTK. Theorem 6 (NTK of weights). KWi (x0, x0) at initialization has the same distribution as conjugate kernel σ2 dΣ(x0, x0). Remark 7. For weights Wd at the last layer of feedforward network, KWd (x0, x̃0) = σ2 dΣ(x0, x̃0) always holds true. Theorem 6 shows that a similar relationship between NTK and CK exists for all other weights in the network when we only consider the distribution of diagonal elements at initialization. Remark 8. Since the distribution of KWi (x0, x0) is the same as σ2 dΣ(x0, x0), they also enjoy the same limiting behaviour. More specifically, at the limit of depth and width increasing to infinity at same rate, KWi (x0, x0) also converges to a log-normal distribution. This supports the observation and assumption in [35] that gradient magnitudes are log-normal distributed. 20 We then derive the distribution of diagonal NTK coming from trainable biases bi. Kbi (x0, x0) = ∑ j ∂f(x0) ∂(bi)j ∂f(x0) ∂(bi)j ⊤ . (2.10) Theorem 9 (NTK of each biases). Kbi (x0, x0) has the same distribution as ∏d−1 k=i σ 2 k+1∥ReLU(vk)∥2 where vk ∼ N (0, Ink×nk ). The moments are given by E[Kbi (x0, x0) r] =   d∏ k=i+1 σ2r k   d−1∏ k=i G2(nk, r) =cr ( exp (( r 2 ) β ) + O( d∑ i=1 1 n2 i ) ) where c = ∏d−1 k=i σ 2 k+1 nk 2 and β = ∑d−1 k=i 5 nk . This random variable also converges to a log-normal distribution in the limit that ni, · · · , nd−1 → ∞ and c, β converge to fixed values. We have derived the distributions and moments of weights NTKs and biases NTKs separately. However, the correlations between them are too convoluted to have a simple closed form solution. Therefore, we further adopt Jensen’s inequality to derive an upper bound of any moments of the whole NTK: K(x0, x0) = d∑ i=1 (KWi (x0, x0) +Kbi (x0, x0)). (2.11) 21 Corollary 10 (The upper bound of moments of NTK). E[K(x0, x0) r] (E[K(x0, x0)])r ≤ exp (( r 2 ) β ) + O( d∑ i=1 1 n2 i ) where β = ∑d−1 k=1 5 nk . By comparing the upper bound in Corollary 10 and moments of a log-normal distri- bution, we can conclude that the effective randomness coefficient is β = ∑d−1 k=1 5 nk . 2.4 Residual Networks We first give the distribution of diagonal CK of the residual networks. Theorem 11 (CK of residual network). The conjugate kernel of m-branches residual net- work Σ(x0, x0) has the same distribution as ∥x∥2∏m−1 i=0 ∥ê+ Vi∥2 where Vi = σi,dj vi,di di−1∏ j=1 σi,j∥ReLU(vi,j)∥, ê is a unit vector along any direction and vi,j ∼ N (0, Ini,j ,ni,j ). Remark 12. The random variable ∥Vi∥2 has the same distribution as Σ′(ê, ê) of a single branch in residual network when regarded as a separate feedforward network. The moments of CK can be derived as follows. The closed form expression can be found at section 2.7.1. 22 Corollary 13 (Moments of CK). We have E[Σ(x0, x0) r]=exp (( r + 4 n ( r 2 )) m−1∑ i=0 ci ) + O( m−1∑ i=0 c2 i ) where r is a positive integer, ci = 2 ∏di k=1 σ 2 i,k ni,k 2 . Next, we provide the limiting distribution of CK of residual networks. Theorem 14 (Limiting distribution of CK). Let ci = 2 ∏di k=1 σ 2 i,k ni,k 2 , βi = 2 ni,di + ∑di−1 k=1 5 ni,k and β = 4 n . At the limit of ci → 0, ∑m−1 i=0 ci → ĉ while βi remains bounded and β → β̂, the distribution of Σ(x0, x0) converges in law to eĉeZ, where Z ∼ N (− β̂ĉ 2 , β̂ĉ). Remark 15. Theorem 14 is irrelevant to the specific shape of any residual branch, but only requires the randomness coefficients βi for every branch to remain bounded. Therefore, this theorem is applicable to the residual network with infinite branches but every branch is a finite width and depth feedforward network. Remark 16. The randomness coefficient of CK for residual network is 4 n ∑m−1 i=0 ci and doesn’t directly depend on depth. If only the summation of ci is bounded, the randomness coefficient remains finite when depth increases to infinity. Therefore, the random feature at the last hidden layer of deep residual network is much stabler than that of a deep feedforward network. Remark 17. Theorem 14 implies that in order to keep an arbitrarily deep residual network trainable, it is necessary that 4 n ∑m−1 i=0 ci has an upper bound that is independent of branch number m. Similar phenomenon has been studied in previous works. Zhang et al. [37] observed that if ci of each layer is kept as a constant, the variance of network output will 23 increase as branches increase. Arpit et al. [38] proposed an initialization method that ensure ∑m−1 i=0 ci doesn’t increase with m. We then derive the distribution of diagonal NTK of weights and biases. Theorem 18 (NTK of each weights and biases). The diagonal NTK coming from weights in one layer KWi,j (x0, x0) has the same distribution as ∥x∥2σ2 out∥Vi∥2∏m−1 s ̸=i s=0 ∥ê+ Vs∥2. . The definition of Vi, vi,j and ê is the same as Theorem 11. The distribution of NTK coming from biases in one layer Kbi,j (x0, x0) is the same as σ2 out ∥∥∥V ′ i,j ∥∥∥ 2∏m−1 s=i+1 ∥ê+ Vs∥2 , where V ′ i,j = vi,di ∏di−1 k=j σi,k+1∥ReLU(vi,k)∥. We move the closed form expression of NTK moments into Section 2.7.1. Next, we derive the limiting distribution of NTKs and connect them with conjugate kernel of a feedforward network. Theorem 19 (Limiting distribution of weights NTK). With the same distribution of ci, βi and β as Theorem 14, at the limit of cs → 0 for all s ̸= i, ∑m−1 s=0 s ̸=i cs → ĉ while βi remains bounded for all s ̸= i and β → β̂, the random variable KWi,j (x0, x0) converges in law to σ2 oute ĉeZΣ′ i(ê, ê), where Σ′ i(ê, ê) is the conjugate kernel at the output layer of a feedforward network with the same shape as (i + 1)-th branch in the residual network and Z ∼ N (− β̂ĉ 2 , β̂ĉ). Theorem 20 (Limiting distribution of biases NTK). With the same distribution of ci, βi and β as Theorem 14, at the limit of cs → 0 for all s > i, ∑m−1 s>i cs → ĉ while βi remains bounded for all s > i and β → β̂, the random variable Kbi (x0, x0) converges in law to σ2 oute ĉeZΣ′ i,j(0, 0), where Z ∼ N (− β̂ĉ 2 , β̂ĉ) and Σ′ i,j(0, 0) is the conjugate kernel at 24 the output layer of a feedforward network with the same shape as (i + 1)-th branch in the residual network but the bias bj is initialized as a standard Gaussian vector. The above results on the weights NTK and biases NTK can also be combined to derive an upper bound on moments of the whole NTK. Corollary 21 (The upper bound of the total NTK). E[K(x0, x0) r] (E[K(x0, x0)])r ≤ exp (( r 2 )( max i βi + 4 n ∑ i ci )) + O   m−1∑ i=0 di∑ j=1 1 n2 i,j + m−1∑ i=0 c2 i   where βi = 2 ni,di + ∑di−1 k=1 5 ni,k and ci = 2 ∏di k=1 σ 2 i,k ni,k 2 . Remark 22. By comparing the upper bound in Corollary 21 and moments of a log-normal distribution, we conclude that the effective randomness coefficient is maxi βi + 4 n ∑ i ci. Note that βi only depends on the shape of a single branch, and effective randomness coefficient only depends on maximum βi. This clearly contrasts with the feedforward network where, β increases linearly with depth. The extra term 4 n ∑ i ci is the same as the randomness coefficient of CK. Therefore, with proper scaling, any order moments of NTK of residual network could remain bounded at infinite branches and finite width limit. 2.5 Overview of Proof Our analysis is based on the equivalent transform on random vectors and tensor networks. We note that many previous works adopt path-based approach to study neural network with ReLU activation [28, 29, 39]. This method decomposes the network output 25 into a summation of all possible path connection input neuron to output one. Each path carries the weights and activation from all layers. Compared with this sum-over-path method, our method is more intuitive and reveals more properties of CK and NTK. We first focus on the method for calculating CK and NTK of the feedforward network. After that, we generalize the method to residual network. 2.5.1 Conjugate Kernel The last hidden layer of a feedforward ReLU network with biases initialized as 0 can be expressed as: xd−1 = d−1∏ i=1 σiReLU(Wd(. . . (ReLU(W1x0)) . . . )) . We define an auxiliary random variable at each layer as vi = 1 ∥xi−1∥ Wixi−1. If ∥xi−1∥ = 0, we take vi as a uniform random vector on the unit sphere. Then we have vi ∼ N (0, Ini,ni ). Each layer of the feedforward network can be expressed as: yi = σivi∥xi−1∥, xi = σiReLU(vi)∥xi−1∥. (2.12) Therefore, the norm of last hidden layer is ∥xd−1∥ = ∥x0∥ ∏d−1 i=1 σi∥ReLU(vi)∥, and the norm of network output is ∥yd∥ = ∥vd∥∥x0∥ ∏d i=1 σi ∏d−1 i=1 ∥ReLU(vi)∥. The output at each layer of the network yi is dependent on the previous layer. How- ever, we can show that the normalized output vi is independent with each other. Lemma 23. All random vectors vi for i = 1, . . . , d are independent with each other. Based on above lemma, it is easy to prove the distribution of Σ(x0, x0) and Σ′(x0, x0) 26 in Theorem 1. 2.5.2 Neural Tangent Kernel In this section, we demonstrate our methods via proving the moments of NTK coming from weights. The proof is best shown in the language of tensor network diagram. Here we give a brief introduction of this notation, and we will give more examples to illustrate how the tensor graph notation corresponds to the standard vector and matrix notation in Section 2.8. We denote a vector v in diagram as v . The node v in the diagram represents a tensor, and only one edge coming out of this node indicates that it is a rank-1 tensor or a vector. Similarly, a matrix V is represented as V . The contraction of tensors or matrix multiplication is represented as an edge connecting two tensors. For example for any two matrix A and B, A B represents the matrix multiplication AB. The identity matrix can be represented as an unbounded edge in the diagram without a node attached to it for simplicity. [40] provided a more comprehensive introduction on tensor network diagram. The main benefit of tensor network diagram is that trace, tensor product and tensor contraction can be expressed in a simple and unified way without extra notation. In the feedforward network with ReLU activation, xi = ReLU(yi) can be written in a matrix multiplication form xi = Diyi. The matrix Di is a diagonal matrix where [Di]j,j is 1 if [yi]j > 0 and is 0 if [yi]j < 0. When [yi]j = 0 we take [Di]j,j as a Bernoulli random variable with p = 1 2 . Therefore the output of feedforward network can be writ- ten as yd = WdDd−1Wd−1 . . . D1W1x0. For the compactness of the diagram, we define 27 A = Dd−1Wd−1 . . . Di and B = Di−1Wi−1 . . . D1W1x0 when we consider NTK coming from weights Wi in i-th layer. With above definitions, the network output can be written as follows: yd = ( d∏ i=j σj ) Wd A Wi B The weights Wd at the last layer is a vector rather than matrix since we require scalar output yd. The partial gradient with respect to Wi can be represented as follows: ∂yd ∂Wi = ( d∏ j=1 σj ) Wd A B The NTK is just the inner product of gradients, therefore has the following form: KWi (x0, x0) = ( d∏ j=1 σ2 j )  Wd A B Wd A B   In the above diagram, we highlight the contraction in the red box. The key idea in our method is substituting this part of the tensor network with the expectation of another random tensor network. Given a random matrix V where all elements are i.i.d. standard Gaussian variables. We have the expectation of V ⊗V , which is V tensor product with itself as EVi,jVk,l = Ii,kIj,l. This can be represented in the following diagram: E [( V V )] = 28 This relationship can be generalized to higher order with Isserlis’ theorem. For example: E     V V V V     = + + . Generally, the expectation of ⊗2r i=1V , which is 2r-th order tensor product of V to itself, can be represented as summation of (2r− 1)!! terms of tensor product of identity matrices, each of which corresponds to a pairing in a set of 2r elements. Based on Isserlis’ theorem, we can substitute the contraction part in the the diagram of KWi (x0, x0) r as follows: KWi (x0, x0) r = ∏d j=1 σ 2r j (2r − 1)!! EV ( Wd A V B )2r An extra term (2r−1)!! shows up in the above result. We next take the expectation of Wd. Since elements in Wd is also i.i.d. standard Gaussian variables, the expectation of ⊗2r i=1Wd also contains (2r − 1)!! terms of tensor product of identity matrices. Therefore, we have following result: E[KWi (x0, x0) r] = ( d∏ j=1 σ2r j ) E ( A V B A V B )r Recall that Σ(x0, x0) r = (∏d−1 j=1 σ 2r j ) ∥AWiB∥2r. By using similar auxiliary random variable in section 2.5.1, we have E[KWi (x0, x0) r] = σ2r d E[Σ(x0, x0) r]. For a network with only one hidden layer, the distribution of Σ(x0, x0) is determined by its moments according to Carleman’s condition. Therefore, the distributions of diagonal weights NTK and diagonal CK are same. For feedforward network with more than one 29 hidden layer, we need to set up auxiliary variable in backward manner, the specific proof is shown in Section 2.10. The biases NTK can be controlled in similar way. 2.5.3 Residual Network For conjugate kernel, we can adopt auxiliary random variables vi,j similar to Sec- tion 2.5.1. We only need to show vi,j are independent with each other to prove Theorem 11. The NTK can be controlled with same methods in Section 2.5.2. The only difference is that when we calculate the NTK of parameter in one branch, we need to remove the skip connection bypassing that branch in the tensor network. 2.6 Experimental Results In this section, we will validate our theoretical results with numerical experiments. 2.6.1 Verifying the Moments We sample 1000 three layers feedforward network and calculate the NTK and CK at initialization. All parameters are initialized in the distribution specified in Section 2.2. The hyper-parameters are set as n0 = n1 = n2 = 100, n2 = 1 and σi = 1 10 . The input x0 is sampled from a unit sphere. The theoretical prediction and estimation obtained in numerical experiment is shown in Table 2.1. We only show the moments for Σ(x0, x0) in this section. Moments of Σ′(x0, x0) and NTKs are moved to Section 2.9.1. We don’t include higher order moments because the long tail nature of CK distribution makes it harder to numerically estimate higher order 30 Moment Predicted Estimated EΣ(x0, x0)1 2.50e-1 2.52e-01 ± 2.55e-03 EΣ(x0, x0)2 6.89e-2 6.98e-02 ± 1.49e-03 EΣ(x0, x0)3 2.08e-2 2.13e-02 ± 7.43e-04 EΣ(x0, x0)4 6.86e-3 7.08e-03 ± 3.73e-04 Table 2.1: Verification of CK and NTK of a three layers feedforward network. moments. The theory and experiment results fit well in terms of first 4 order moments, which demonstrates the accuracy of our theoretical results. We also compare the moments of residual network in Table 2.2 by sampling 10000 residual network with ni,j = 100 and σi,j = 1 10 . Only CK moments are shown below and the full result can be found in Section 2.9.1. 2.6.2 Verifying the Log-Normal Distribution In the following experiment, we will verify our theoretical results on limiting distri- bution of diagonal CK and NTK. According to Theorem 3, the diagonal CK of feedforward network converges to the log-normal distribution in the infinite width and infinite depth limit. We demonstrate the convergence in Figure 2.2, in which we take three feedforward networks with different depth. The hyper-parameters including σi and ni are tuned to keep c = 1 and β ≈ 1. We reinitialize these networks 106 times and plot the empirical distribution of CK. Figures of empirical distribution for other settings can be found in section 2.9.2. We use Kolmogorov–Smirnov test to calculate the difference between the distributions of CK and NTK and their limiting distributions to validate Theorems 3, 6 and 9. Table 2.3 shows that the p-value for CK and NTK coming from weights and biases at first layer. 31 Moment Predicted Estimated EΣ(x0, x0)1 2.25e+0 2.25e+00 ± 4.09e-03 EΣ(x0, x0)2 5.23e+0 5.23e+00 ± 1.95e-02 EΣ(x0, x0)3 1.26e+1 1.26e+01 ± 7.33e-02 EΣ(x0, x0)4 3.13e+1 3.12e+01 ± 2.58e-01 Table 2.2: Verification of CK and NTK of a two branches residual network. Every branch is a feedforward network with one hidden layer. −3 −2 −1 0 1 2 log Σ(x0, x0) d=2 d=5 d=15 0.0 0.5 1.0 1.5 2.0 2.5 Σ(x0, x0) d=2 d=5 d=15 Figure 2.2: Distribution of CK The red line is the theoretical limiting distribution given c and β. The KS test takes 1000 samples of CK and NTK can calculate the maximum difference between the empirical distribution and the theoretical limiting distribution. The hyper- parameters are the same as neural networks used in Figure 2.2. Table 2.3 illustrates that when d = 2, we have strong evidence to support that the distributions of CK and NTK are not log-normal. However, when the depth increases to 5 and 15, it is hard to distinguish them from the log-normal distribution. We further verify the limiting behaviour in Theorems 19 and 20. For CK of residual network, we test the empirical distribution of Σ(x0, x0) against the limiting the log-normal distribution. The NTK of residual network is tested against another sample of eZΣ(x0, x0) where Σ(x0, x0) is CK of a finite sized feedforward network and eZ is independently sampled from the log normal distribution. The sample size in Table 2.4 is also 1000. We set the branch of residual network as the feedforward networks with one hidden layer and hyper- parameters are tuned so that ∑ i ci = 1 and β = 4 100 . 32 Depth Σ(x0, x0) KW1 (x0, x0) Kb1 (x0, x0) 2 1.60e-10 3.48e-09 3.48e-09 5 1.30e-01 1.44e-02 1.44e-02 15 7.13e-01 1.23e-01 1.23e-01 Table 2.3: The p-value of KS test on CK and NTK of the feedforward network. Branches Σ(x0, x0) KW0,1 Kb0,1 1 0.00e+00 8.28e-01 5.73e-01 5 1.65e-31 1.62e-04 3.96e-05 20 3.97e-05 7.76e-02 2.41e-01 100 1.36e-01 9.71e-02 1.34e-01 Table 2.4: The p-value of KS test on CK and NTK of the residual network. We note that the p-value for conjugate kernel in Table 2.4 gradually increases, which indicates the distribution of CK gradually converges to the log-normal distribution. The p-value for NTK in a residual network with only one branch is very high. This is expected because the variance of log-normal distribution coming from other branches is 0. We also conduct empirical study on the correlation between NTK from different parameters in Section 2.9.3. Although the joint probability is complicated, we can provide approximation on the correlation coefficient of logarithm NTKs. The statistics property of non diagonal elements of random CK and NTK are discussed in Section 2.9.4. In Section 2.9.5, the random gradient are used as random feature for classification on UCI dataset. 33 2.7 Additional Theoretical Result 2.7.1 Moments of residual network CK and NTKs Corollary 24 (Moments of CK). We have E[Σ(x0, x0) r] =∥x0∥2r m−1∏ i=0 ∑ 0≤r1,r2,r3 r1+r2+r3=r r2 is even ( r r1, r2, r3 ) 2r2(r2 − 1)!! di∏ k=1 σ2r1+r2 i,k ×G1(ni,di + r2, r1) di−1∏ k=1 G2(ni,k, r1 + r2 2 ) where r is a positive integer and we assume (−1)!! = 1. Theorem 25 (NTK of each weights). The moments of NTK coming from weights of resid- ual network with ReLU activation is E[KWs,t (x0, x0) r] =σ2r out∥x0∥2r   ds∏ k=1 σ2r s,k  G1(ns,ds , r) ds−1∏ k=1 G2(ns,k, r) m−1∏ i=0 i̸=s ∑ 0≤r1,r2,r3 r1+r2+r3=r r2 is even ( r r1, r2, r3 ) 2r2(r2 − 1)!! di∏ k=1 σ2r1+r2 i,k G1(ni,di + r2, r1) di−1∏ k=1 G2(ni,k, r1 + r2 2 ) where r is a positive integer and we assume (−1)!! == 0!! = 1. Theorem 26 (NTK of each biases). The moments of NTK coming from biases of residual 34 network with ReLU activation is E[Kbs,t (x0, x0) r] =σ2r out   ds∏ k=t+1 σ2r s,k  G1(ns,ds , r) ds−1∏ k=t G2(ns,k, r) m−1∏ i=s+1 ∑ 0≤r1,r2,r3 r1+r2+r3=r r2 is even ( r r1, r2, r3 ) 2r2(r2 − 1)!! di∏ k=1 σ2r1+r2 i,k G1(ni,di + r2, r1) di−1∏ k=1 G2(ni,k, r1 + r2 2 ) where r is a positive integer. 2.8 Additional Explanation on Tensor Diagram In this section we give the standard vector/matrix notation correspondences of all the formulas in Section 2.5.2. yd = ( d∏ j=1 σj ) Wd A Wi B yd = ( d∏ j=1 σj ) WdAWiB ∂yd ∂Wi = ( d∏ j=1 σj ) Wd A B ∂yd ∂[Wi]j,k = ( d∏ j=1 σj ) [A⊤W⊤ d ]jBk 35 KWi (x0, x0) = ( d∏ j=1 σ2 j )  Wd A B Wd A B  KWi (x0, x0) = ( d∏ j=1 σ2 j )∥∥∥A⊤W⊤ d ∥∥∥ 2∥B∥2 E [( V V )] = E [Vi,jVk,l] = δi,kδj,l E     V V V V     = + + E[Vi1,j1Vi2,j2Vi3,j3Vi4,j4] = δi1,i2δi3,i4δj1,j2δj3,j4 +δi1,i3δi2,i4δj1,j3δj2,j4 +δi1,i4δi2,i3δj1,j4δj2,j3 KWi (x0, x0) r = ∏d j=1 σ 2r j (2r − 1)!! EV ( Wd A V B )2r KWi (x0, x0) r = ∏d j=1 σ 2r j (2r − 1)!! EV (WdAV B)2r E[KWi (x0, x0) r] = ( d∏ j=1 σ2r j ) E ( A V B A V B )r E[KWi (x0, x0) r] = ( d∏ j=1 σ2r j ) E∥AV B∥2r 36 2.9 Additional Experiment Result 2.9.1 Other moments of CK and NTK Table 2.5: Verification of CK and NTKs of a three layer feedforward network. Moment Predicted Estimated EΣ(x0, x0)1 2.50e-1 2.52e-01 ± 2.55e-03 EΣ(x0, x0)2 6.89e-2 6.98e-02 ± 1.49e-03 EΣ(x0, x0)3 2.08e-2 2.13e-02 ± 7.43e-04 EΣ(x0, x0)4 6.86e-3 7.08e-03 ± 3.73e-04 EΣ′(x0, x0)1 2.50e-3 2.72e-03 ± 1.41e-04 EΣ′(x0, x0)2 2.07e-5 2.73e-05 ± 3.76e-06 EΣ′(x0, x0)3 3.12e-7 5.19e-07 ± 1.54e-07 EΣ′(x0, x0)4 7.21e-9 1.49e-08 ± 7.37e-09 EKW1 (x0, x0)1 2.50e-3 2.48e-03 ± 2.55e-05 EKW1 (x0, x0)2 6.89e-6 6.81e-06 ± 1.48e-07 EKW1 (x0, x0)3 2.08e-8 2.06e-08 ± 7.42e-10 EKW1 (x0, x0)4 6.86e-11 6.81e-11 ± 3.77e-12 EKb1 (x0, x0)1 2.50e-1 2.48e-01 ± 2.55e-03 EKb1 (x0, x0)2 6.89e-2 6.81e-02 ± 1.48e-03 EKb1 (x0, x0)3 2.08e-2 2.06e-02 ± 7.42e-04 EKb1 (x0, x0)4 6.86e-3 6.81e-03 ± 3.77e-04 EKW2 (x0, x0)1 2.50e-3 2.51e-03 ± 2.63e-05 EKW2 (x0, x0)2 6.89e-6 6.98e-06 ± 1.56e-07 EKW2 (x0, x0)3 2.08e-8 2.15e-08 ± 7.97e-10 EKW2 (x0, x0)4 6.86e-11 7.28e-11 ± 4.19e-12 EKb2 (x0, x0)1 5.00e-1 4.97e-01 ± 3.65e-03 EKb2 (x0, x0)2 2.62e-1 2.61e-01 ± 3.86e-03 EKb2 (x0, x0)3 1.44e-1 1.43e-01 ± 3.29e-03 EKb2 (x0, x0)4 8.29e-2 8.28e-02 ± 2.66e-03 EKW3 (x0, x0)1 2.50e-3 2.52e-03 ± 2.55e-05 EKW3 (x0, x0)2 6.89e-6 6.98e-06 ± 1.49e-07 EKW3 (x0, x0)3 2.08e-8 2.13e-08 ± 7.43e-10 EKW3 (x0, x0)4 6.86e-11 7.08e-11 ± 3.73e-12 EKb3 (x0, x0)1 1.00e+00 1.00e+00 ± 0.00e+00 EKb3 (x0, x0)2 1.00e+00 1.00e+00 ± 0.00e+00 EKb3 (x0, x0)3 1.00e+00 1.00e+00 ± 0.00e+00 EKb3 (x0, x0)4 1.00e+00 1.00e+00 ± 0.00e+00 37 Table 2.6: Verification of CK and NTKs of a two branches residual network. Every branch is a feedforward network with one hidden layer. Moment Predicted Estimated EΣ(x0, x0)1 2.25e+0 2.25e+00 ± 4.09e-03 EΣ(x0, x0)2 5.23e+0 5.23e+00 ± 1.95e-02 EΣ(x0, x0)3 1.26e+1 1.26e+01 ± 7.33e-02 EΣ(x0, x0)4 3.13e+1 3.12e+01 ± 2.58e-01 EΣ′(x0, x0)1 2.25e-2 2.24e-02 ± 3.29e-04 EΣ′(x0, x0)2 1.57e-3 1.58e-03 ± 5.47e-05 EΣ′(x0, x0)3 1.89e-4 1.92e-04 ± 1.26e-05 EΣ′(x0, x0)4 3.29e-5 3.24e-05 ± 3.45e-06 EKW0,1 (x0, x0)1 7.50e-3 7.51e-03 ± 2.27e-05 EKW0,1 (x0, x0)2 6.13e-5 6.15e-05 ± 3.89e-07 EKW0,1 (x0, x0)3 5.43e-7 5.49e-07 ± 5.69e-09 EKW0,1 (x0, x0)4 5.22e-9 5.30e-09 ± 8.38e-11 EKb0,1 (x0, x0)1 7.50e-1 7.51e-01 ± 2.27e-03 EKb0,1 (x0, x0)2 6.13e-1 6.15e-01 ± 3.89e-03 EKb0,1 (x0, x0)3 5.43e-1 5.49e-01 ± 5.69e-03 EKb0,1 (x0, x0)4 5.22e-1 5.30e-01 ± 8.38e-03 EKW0,2 (x0, x0)1 7.50e-3 7.53e-03 ± 2.26e-05 EKW0,2 (x0, x0)2 6.13e-5 6.18e-05 ± 3.89e-07 EKW0,2 (x0, x0)3 5.43e-7 5.52e-07 ± 5.71e-09 EKW0,2 (x0, x0)4 5.22e-9 5.33e-09 ± 8.44e-11 EKb0,2 (x0, x0)1 1.50e+0 1.50e+00 ± 2.89e-03 EKb0,2 (x0, x0)2 2.33e+0 2.33e+00 ± 9.18e-03 EKb0,2 (x0, x0)3 3.76e+0 3.76e+00 ± 2.30e-02 EKb0,2 (x0, x0)4 6.30e+0 6.28e+00 ± 5.41e-02 EKW1,1 (x0, x0)1 7.50e-3 7.50e-03 ± 2.25e-05 EKW1,1 (x0, x0)2 6.13e-5 6.13e-05 ± 3.84e-07 EKW1,1 (x0, x0)3 5.43e-7 5.44e-07 ± 5.59e-09 EKW1,1 (x0, x0)4 5.22e-9 5.23e-09 ± 8.27e-11 EKb1,1 (x0, x0)1 5.00e-1 4.99e-01 ± 1.33e-03 EKb1,1 (x0, x0)2 2.68e-1 2.67e-01 ± 1.46e-03 EKb1,1 (x0, x0)3 1.53e-1 1.53e-01 ± 1.31e-03 EKb1,1 (x0, x0)4 9.32e-2 9.27e-02 ± 1.16e-03 EKW1,2 (x0, x0)1 7.50e-3 7.50e-03 ± 2.22e-05 EKW1,2 (x0, x0)2 6.13e-5 6.11e-05 ± 3.78e-07 EKW1,2 (x0, x0)3 5.43e-7 5.41e-07 ± 5.46e-09 EKW1,2 (x0, x0)4 5.22e-9 5.17e-09 ± 7.84e-11 EKb1,2 (x0, x0)1 1.00e+00 9.99e-01 ± 1.41e-03 EKb1,2 (x0, x0)2 1.02e+00 1.02e+00 ± 2.90e-03 EKb1,2 (x0, x0)3 1.06e+00 1.06e+00 ± 4.58e-03 EKb1,2 (x0, x0)4 1.12e+00 1.12e+00 ± 6.61e-03 EKWout(x0, x0)1 2.25e-2 2.25e-02 ± 4.09e-05 EKWout(x0, x0)2 5.23e-4 5.23e-04 ± 1.95e-06 EKWout(x0, x0)3 1.26e-5 1.26e-05 ± 7.33e-08 EKWout(x0, x0)4 3.13e-7 3.12e-07 ± 2.58e-09 38 2.9.2 The limiting log normal distribution We compare the distribution of random NTK for finite width feedforward network in Figure 2.3. The red lines in the figure are the limiting log-normal distribution as defined in Theorems 6 and 9. As depth increases, the distribution of weight NTK and bias NTK gradually approach the limiting distribution. −3 −2 −1 0 1 2 logKW0 (x0, x0) d=2 d=5 d=15 −2 0 2 4 6 logKb0 (x0, x0) d=2 d=5 d=15 Figure 2.3: Distribution of NTK The red line is the theoretical limiting distribution given c and β. We then demonstrate the convergence of distribution of CK and NTK for residual network in fig. 2.4. The red lines are computed according to theorems 19 and 20. Note that the red line in right figure is only similar but not identical to the probability density of Gaussian distribution. As the number of branches increases, the CK converge to a log- normal distribution. As for NTK for weights in first layer of first branch, when m = 1, the NTK of residual network is exactly the NTK of a feedforward network, therefore the curve fit perfectly. When m ≥ 2, the real distribution is not same as limiting distribution but converge to it as branches increase. 39 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 log Σ(x0, x0) m=1 m=5 m=20 −3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 logKW0,1 (x0, x0) m=1 m=5 m=20 Figure 2.4: Distribution of CK and NTK The red line is the theoretical limiting distribution given c and β. 2.9.3 Correlation between CK and NTKs The joint distribution of CK and NTK coming from different parameters are non- trivial. We conduct phenomenological analysis on their correlation coefficient. Correlation between weight NTKs We first focus on correlation between weight NTKs. The CK is included in this case since KWd (x0, x̃0) = σ2 dΣ(x0, x̃0) always holds true. −4 −2 0 2 log(KW1 (x0, x0)) −4 −2 0 2 lo g (K W 2 (x 0 ,x 0 )) (a) layer 1 vs layer 2 −4 −2 0 2 log(KW1 (x0, x0)) −6 −4 −2 0 2 lo g (K W 5 (x 0 ,x 0 )) (b) layer 1 vs layer 5 −4 −2 0 2 log(KW1 (x0, x0)) −5 −4 −3 −2 −1 0 1 2 3 lo g (K W 1 5 (x 0 ,x 0 )) (c) layer 1 vs layer 15 −6 −4 −2 0 2 log(KW5 (x0, x0)) −5 −4 −3 −2 −1 0 1 2 3 lo g (K W 6 (x 0 ,x 0 )) (d) layer 5 vs layer 6 Figure 2.5: Joint distribution of logarithm NTK of weights from different layer. We consider d = 15 layer feedforward network with a uniform width n = 70 for each layer. The correlation between logarithm of several weights NTK are shown in Figure 2.5. By induction, we get the following pattern: The closer the two weights are in the feed forward network, the more relevant the 40 logarithm of their NTK. We give an explanation of this phenomenon based on random tensor network. Without loss of generality, we assume i < j and consider two NTK KWi (x0, x0) and KWj (x0, x0). We can describe NTKs as follows where a = ∏ i σ 2 i is a constant, C = Di−1Wi−1 . . . D1W1x0, B = Dj−1 . . .Wi+1Di and A = WdDd−1 . . .Wj+1Dj. KWi (x0, x0) = a   A C A C B B Wj Wj   KWj (x0, x0) = a ( A C A C B B Wi Wi ) . Let the singular value decomposition of B be B = U⊤SV , we have KWi (x0, x0) = a ∥ ∥ ∥SUW⊤ j A ⊤ ∥ ∥ ∥ 2∥C∥2 KWj (x0, x0) = a∥A∥2∥SVWiC∥2. We further let wi = 1 ∥C∥ WiC and wj = 1 ∥A∥ W⊤ j A ⊤ so KWi (x0, x0) = a∥A∥2∥C∥2∥SUwj∥2 KWj (x0, x0) = a∥A∥2∥C∥2∥SV wi∥2. Wi and Wj are dependent, therefore wi and wj are also independent. We assume that most co-variance between logarithm of weights NTK comes from the identical terms ∥A∥2 and ∥C∥2. The distributions of these values are similar to log normal distribution, and the 41 random coefficient β is proportional to d n . Cov[logKWi (x0, x0), logKWj (x0, x0)] ≈V ar[log ∥A∥2] + V ar[log ∥C∥2] ∝d− j n + i− 1 n = d− 1 n − j − i n We can further estimate correlation coefficient as Cor[KWi (x0, x0), KWj (x0, x0)] ≈ 1− j − i n j− i is just the distance between two weights in the feedforward network. Therefore, the correlation between logarithm of weights NTK decreases with the distance between layers of these weights. However, the above estimation is only qualitative. It tends to underestimate the correlation as shown in Figure 2.6a because the co-variance from ∥SUwj∥2 and ∥SV wi∥2 are neglected. 0.4 0.6 0.8 Correlation Coefficient 0.0 0.2 0.4 0.6 0.8 E s t im a t io n x=y (a) Correlation coefficients between weight NTKs from different layers. 1 3 5 7 9 11 13 15 Layer 0.0 0.2 0.4 0.6 0.8 1.0 C o rr el a ti o n C o effi ci en t estimate (b) Correlation coefficients between weight NTK and bias NTK in same layer. Figure 2.6: Estimation of correlation coefficients between weight NTKs and bias NTKs. 42 Correlation between weight NTK and bias NTK We next show the correlation between weight NTK KWi (x0, x0) and bias NTK Kbi (x0, x0), where Wi and bi are weight and bias from same layer. −4 −2 0 2 log(KW1 (x0, x0)) −2 0 2 4 6 lo g (K b 1 (x 0 ,x 0 )) (a) layer 1 −4 −2 0 2 log(KW2 (x0, x0)) −1 0 1 2 3 4 5 6 7 lo g (K b 2 (x 0 ,x 0 )) (b) layer 2 −4 −2 0 2 log(KW8 (x0, x0)) 0 1 2 3 4 5 lo g (K b 8 (x 0 ,x 0 )) (c) layer 8 −4 −2 0 2 log(KW14 (x0, x0)) 2.0 2.5 3.0 3.5 4.0 lo g (K b 1 4 (x 0 ,x 0 )) (d) layer 14 Figure 2.7: Joint distribution of logarithm NTK of weights NTK and bias NTK. According to Figure 2.7, the weight NTK and bias NTK from shallower layers in feedforward network comes with higher correlation. Based on similar argument as previous section, we have KWi (x0, x0) = a∥A∥2∥B∥2 and Kbj (x0, x0) = a∥A∥2 where A = Di−1 . . . D1W1x0. The co-variance and variance of weights NTK and variance of bias NTK can be estimated to be proportional to d− i, d− 1 and d − i respectively. The correlation can be further estimated as √ d−i d−1 . Experiment in Figure 2.6b shows that this this estimation is almost tight. Correlation between weight NTK of residual network The pattern that closer weights in the network have higher correlation for logarithm NTK still applies to residual network. However, one main difference is that correlation of weights NTK across branches are con- trolled by variance σi,j of the network. With same definition of ci in Theorem 14, we have the correlation increases with ∑ i ci as shown in Figure 2.8. 43 −3.5 −3.0 −2.5 −2.0 −1.5 −1.0 log(KW0,1 (x0, x0)) −3.5 −3.0 −2.5 −2.0 −1.5 −1.0 lo g (K W 1 , 1 (x 0 ,x 0 )) (a) ∑ i ci = 1 10 11 12 13 14 15 16 log(KW0,1 (x0, x0)) 10 11 12 13 14 15 16 lo g (K W 1 , 1 (x 0 ,x 0 )) (b) ∑ i ci = 20 Figure 2.8: Joint distribution of logarithm NTK of weights NTK for residual network with 20 branches. 2.9.4 Non Diagonal Elements −1 0 1 2 3 4 5 Σ(x0, x ′ 0 ) d=2 d=5 d=15 (a) −10 −8 −6 −4 −2 0 2 4 log Σ(x0, x ′ 0 ) d=2 d=5 d=15 (b) −6 −4 −2 0 2 log(KW1 (x0, x0)) −6 −4 −2 0 2 lo g (K W 2 (x 0 ,x 0 )) (c) −6 −4 −2 0 2 log(KW1 (x0, x0)) −6 −4 −2 0 2 lo g (K W 1 0 (x 0 ,x 0 )) (d) Figure 2.9: Distribution of CK and joint distribution of NTK from different layers. We pick two unit length vector x0 and x′ 0 with angle π 3 . The distribution of CK and NTK is plotted in Figure 2.9. In logarithmic chart, we discard the negative part of the kernel. The network settings are same as Figure 2.2 where randomness coefficient β ≈ 1. The distribution has long tail similar to the diagonal case. However, non-diagonal elements of random kernel is not guaranteed to be positive. The non-diagonal elements of NTK from weights in different layers also shows similar pattern as diagonal elements. Figures 2.9a and 2.9b show that NTKs from weights in 44 different layer are more irrelevant when two weights are further. We next show the random NTK on a unit circle for feedforward network and residual network. We set the network width as n = 250 for each network. 10 independently sampled NTKs and their mean are shown in Figure 2.10. As the depth of feedforward network increases, the NTK becomes increasingly noisy and the mean degenerate into a constant plus a delta function. Therefore the NTK for deep feedforward network behave like a white noise. The residual network in Figure 2.10c has same depth as the deep feedforward network, but the NTK is much less noisy and the mean NTK remains informative. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 θ 1.5 2.0 2.5 3.0 3.5 4.0 K (x 0 ,x ′ 0 ) (a) Feedforward Network (d=5) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 θ 0 25 50 75 100 125 150 175 K (x 0 ,x ′ 0 ) (b) Feedforward Network (d=100) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 θ 5000 6000 7000 8000 9000 10000 11000 K (x 0 ,x ′ 0 ) (c) Residual Network (d=5,m=20) Figure 2.10: Distribution of CK and joint distribution of NTK from different layers. 2.9.5 Classification with random NTK Following the setting of [22], we consider classification on UCI dataset based on SVM with random NTK and compare the result with NTK of infinite width network. The depth of the finite width network is set to be same as the infinite width network. The width of the feedforward network are tuned to achieve β ≈ 1. The average of 100 random NTK are sampled to reduce the variance. The result is shown in table 2.7. 45 Task Fixed NTK (β = 0) Random NTK (β ≈ 1) average 81.95% 81.54% abalone 65.49% 65.52% acute-inflammation 100.00% 100.00% acute-nephritis 100.00% 100.00% arrhythmia 71.90% 72.12% balance-scale 98.40% 97.28% balloons 100.00% 100.00% bank 89.87% 89.58% blood 79.28% 78.21% breast-cancer 73.59% 73.24% breast-cancer-wisc 97.29% 97.43% breast-cancer-wisc-diag 96.83% 97.18% breast-cancer-wisc-prog 80.61% 84.18% breast-tissue 70.19% 71.15% car 98.55% 97.92% cardiotocography-10clases 86.11% 85.03% cardiotocography-3clases 93.08% 92.84% chess-krvkp 99.12% 99.03% congressional-voting 61.93% 61.93% conn-bench-sonar-mines-rocks 85.10% 84.62% 46 contrac 53.60% 54.55% credit-approval 86.48% 86.19% cylinder-bands 78.32% 78.32% dermatology 96.98% 96.98% echocardiogram 85.61% 85.61% ecoli 86.61% 86.31% energy-y1 95.05% 89.84% energy-y2 87.89% 88.80% fertility 88.00% 88.00% flags 50.00% 50.52% glass 68.40% 66.98% haberman-survival 73.68% 73.68% heart-cleveland 59.54% 57.89% heart-hungarian 85.96% 85.27% heart-switzerland 47.58% 44.35% heart-va 33.50% 31.00% hepatitis 82.05% 83.97% ilpd-indian-liver 72.77% 71.92% ionosphere 94.89% 93.75% iris 97.30% 97.97% led-display 74.20% 75.40% lenses 83.33% 87.50% 47 libras 85.83% 85.28% low-res-spect 93.80% 93.98% lung-cancer 53.12% 46.88% lymphography 87.16% 85.81% mammographic 81.04% 81.88% molec-biol-promoter 88.46% 86.54% molec-biol-splice 87.01% 85.32% musk-1 87.82% 88.87% oocytes merluccius nucleus 4d 84.12% 81.18% oocytes merluccius states 2f 92.94% 91.86% oocytes trisopterus nucleus 2f 86.62% 85.09% oocytes trisopterus states 5b 94.41% 92.21% ozone 97.16% 97.16% parkinsons 91.84% 93.88% pima 75.52% 75.91% pittsburg-bridges-MATERIAL 91.35% 91.35% pittsburg-bridges-REL-L 73.08% 72.12% pittsburg-bridges-SPAN 71.74% 70.65% pittsburg-bridges-T-OR-D 84.00% 85.00% pittsburg-bridges-TYPE 70.19% 71.15% planning 71.11% 71.11% plant-margin 83.81% 83.62% 48 plant-shape 73.19% 71.81% plant-texture 84.94% 84.69% post-operative 72.73% 72.73% primary-tumor 52.13% 54.88% seeds 92.79% 92.79% semeion 95.67% 94.22% spambase 94.52% 94.46% statlog-australian-credit 68.02% 68.02% statlog-german-credit 77.60% 77.00% statlog-heart 88.06% 88.06% statlog-image 98.05% 98.01% statlog-vehicle 83.41% 79.86% steel-plates 77.78% 76.80% synthetic-control 99.50% 99.50% teaching 59.87% 55.92% tic-tac-toe 99.79% 98.54% titanic 78.95% 78.95% trains 87.50% 87.50% vertebral-column-2clases 86.04% 84.42% vertebral-column-3clases 84.09% 82.47% waveform 86.64% 86.50% waveform-noise 86.48% 85.96% 49 wine 97.73% 98.30% wine-quality-red 65.62% 64.81% wine-quality-white 67.22% 64.77% yeast 59.77% 59.84% zoo 96.00% 99.00% Table 2.7: Classification with fixed NTK and random NTK 2.10 Proof of Main Theory 2.10.1 Proof of Theorem 1 According to section 2.5.1, we have