ABSTRACT Title of Dissertation: SPECTRAL METHODS FOR NEURAL NETWORK DESIGNS Jiahao Su Doctor of Philosophy, 2022 Dissertation Directed by: Assistant Professor Furong Huang Department of Computer Science Neural networks are general-purpose function approximators. Given a problem, engi- neers or scientists select a hypothesis space of functions with specific properties by designing the network architecture. However, mainstream designs are often ad-hoc, which could suf- fer from numerous undesired properties. Most prominently, the network architectures are gigantic, where most parameters are redundant while consuming computational resources. Furthermore, the learned networks are sensitive to adversarial perturbation and tend to underestimate the predictive uncertainty. We aim to understand and address these prob- lems using spectral methods ? while these undesired properties are hard to interpret from network parameters in the original domain, we could establish their relationship when we represent the parameters in a spectral domain. These relationships allow us to design networks with certified properties via the spectral representation of parameters. SPECTRAL METHODS FOR NEURAL NETWORK DESIGNS by Jiahao Su Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2022 Advisory Committee: Assistant Professor Furong Huang, Chair/Advisor Associate Professor Behtash Babadi Professor Nikhil Chopra Associate Professor Thomas A. Goldstein Professor Min Wu ?c Copyright by Jiahao Su 2022 Acknowledgments Throughout my doctoral reseasrch, I have received a great deal of support and assis- tance. I owe my gratitude to all the people who have made this dissertation possible. First and foremost, I would like to thank my advisor, Professor Furong Huang, who helped me in my most difficult time and guided me through my doctoral study. She has given me invaluable opportunities to work on exciting and challenging problems over the last five years. Her research passion has inspired and will continue to inspire me in my career. I am very fortunate and proud to be my advisor?s first Ph.D. graduate. I would also like to thank my intern manager, Professor Anima Anandkumar; my intern mentor and long-term collaborator, Dr. Wonmin Byeon. It is impossible to complete my dissertation without their deep insights into my research. I am deeply grateful to all my collaborators and colleagues at the University of Mary- land. There were memorable moments when I started my doctoral research working with Jingling Li, Xuchen You, and Yanchao Sun. Milan Cvitkovic provided me the initial idea for my first publication at conferences. My collaboration with Shiqi Wang, Xiaoyu Liu, Tahseen Rabbani, Chenghao Deng, and Evan Wang has been highly fruitful. I owe my deepest thanks to my friends, especially those pursuing their doctoral degrees ? their encouragement is crucial for me to finish this dissertation. I would like to express my gratitude to my high school friends Xinshi Chen, Zelun Luo, Siqi Yang, Boyang Zhang, ii Haici Zhang, Rui Zhang, Yuliang Zou; and my college friends Jinghui Chen, Yang Chen, Shupeng Gui, Zecheng He, Xu Peng, Pan Zhong. Lastly, I would like to thank my parents for their unconditioned love and support. It would not be possible to start my Ph.D. study without their financial support. Likewise, it would not be possible to continue my research career without their encouragement. Unfortunately, due to the COVID-19 pandemic, I have not seen my parents in the last three years ? I miss you, dad and mom. iii Table of Contents Acknowledgements ii Table of Contents iv Chapter 1: Introduction 1 Chapter 2: Compact Architecture Designs by Tensor Representations 8 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Introduction to Tensor Representations . . . . . . . . . . . . . . . . . . . . 13 2.4 Tensorial Neural Networks (TNNs) . . . . . . . . . . . . . . . . . . . . . . 16 2.4.1 Tensorial Layers versus Traditional Layers . . . . . . . . . . . . . . 18 2.4.2 Relationship between NNs and TNNs . . . . . . . . . . . . . . . . . 20 2.5 Algorithms for TNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5.1 Prediction in TNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5.2 Learning in TNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.3 Compression via Knowledge Distillation . . . . . . . . . . . . . . . 25 2.6 Interpretation of Existing Compact Architectures . . . . . . . . . . . . . . 28 2.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.7.1 Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.7.2 Learning from Scratch . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Chapter 3: AutoTNN: A Framework for Representing and Learning Ten- sorial Neural Networks 40 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3 Tensor Operations and einsum . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3.1 Representations of Multilinear Operations . . . . . . . . . . . . . . 46 3.3.2 From einsum to conv einsum . . . . . . . . . . . . . . . . . . . . . . 46 3.3.3 Compact Neural Networks via conv einsum . . . . . . . . . . . . . . 48 3.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4.1 Atomic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.2 Optimal Sequencer . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4.3 Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 iv 3.5.1 Accuracy and Memory Results . . . . . . . . . . . . . . . . . . . . . 57 3.5.2 Runtime results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Chapter 4: Higher-order RNN for Spatio-temporal Learning 64 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3 Background: Convolutional LSTM and Higher-order LSTM . . . . . . . . . 68 4.4 Methodology: Convolutional Tensor-Train LSTM . . . . . . . . . . . . . . 69 4.4.1 Extending ConvLSTM to Higher-orders . . . . . . . . . . . . . . . . 71 4.4.2 Designing an Effective and Efficient Higher-order ConvLSTM . . . 72 4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.5.2 Analysis of Empirical Results . . . . . . . . . . . . . . . . . . . . . 76 4.6 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.7 Supplemental Materials for Empirical Studies . . . . . . . . . . . . . . . . 82 4.7.1 Preprocessing Module . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.7.2 Model Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.7.3 Training, Evaluation, and Hyper-parameters Selection . . . . . . . . 85 4.7.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.7.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Chapter 5: Efficient Learning of Bayesian Quantized Networks 90 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2 Bayesian Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.2.1 The Prediction Problem in BNNs . . . . . . . . . . . . . . . . . . . 94 5.2.2 The Learning Problem in BNNs . . . . . . . . . . . . . . . . . . . . 95 5.2.3 Learning in BNNs: An Alternative ELBO . . . . . . . . . . . . . . 96 5.2.4 Prediction in BNNs: A Tensor Approach . . . . . . . . . . . . . . . 98 5.3 Related Works and Discussions . . . . . . . . . . . . . . . . . . . . . . . . 102 5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.4.2 Analysis of Empirical Results . . . . . . . . . . . . . . . . . . . . . 108 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Chapter 6: ARMA Nets: Economical Expansion of Receptive Fields 112 6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.3 ARMA Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.3.1 ARMA Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.3.2 Effective Receptive Field . . . . . . . . . . . . . . . . . . . . . . . . 119 6.3.3 Prediction and Learning of ARMA Layers . . . . . . . . . . . . . . 121 6.3.4 Stability of ARMA Layers . . . . . . . . . . . . . . . . . . . . . . . 124 6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 v 6.5 Supplemental Materials for Empirical Studies . . . . . . . . . . . . . . . . 131 6.5.1 Visualization of Effective Receptive Fields . . . . . . . . . . . . . . 131 6.5.2 Multi-frame Video Prediction . . . . . . . . . . . . . . . . . . . . . 132 6.5.3 Medical Image Segmentation . . . . . . . . . . . . . . . . . . . . . . 137 6.5.4 Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Chapter 7: Scaling-up Diverse Orthogonal Convolutional Networks by a Paraunitary Framework 141 7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 7.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.3 Designing Orthogonal Convolutions via Paraunitary Systems . . . . . . . . 146 7.3.1 Achieving Orthogonal Convolutions by Paraunitary Systems . . . . 147 7.3.2 Parameterization of Paraunitary Systems . . . . . . . . . . . . . . . 148 7.3.3 Separable Orthogonal 2D-Convolutions . . . . . . . . . . . . . . . . 151 7.4 Unifying Orthogonal Convolutions Variants as Paraunitary Systems . . . . 152 7.5 Learning Deep Orthogonal Networks with Lipschitz Bounds . . . . . . . . 154 7.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 7.6.1 Exact Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . 158 7.6.2 Adversarial Robustness . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.6.3 Scaling-up Deep Orthogonal Networks with Lipschitz Bounds . . . 161 7.7 Supplementary Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 7.7.1 Pseudo Code for SC-Fac Algorithm . . . . . . . . . . . . . . . . . . 164 7.7.2 Setups and Additional Results for Empirical Studies . . . . . . . . . 165 7.7.3 Orthogonal Convolutions for Residual Flows . . . . . . . . . . . . . 168 7.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Chapter 8: Conclusion 173 Bibliography 175 vi Chapter 1: Introduction Deep learning has achieved unprecedented performance in a wide range of applica- tions, and most notably computer vision [1, 2] and natural language processing [3, 4]. The most crucial factor in these successes is the growing depth of modern neural networks [5, 6], along with enormous effort in developing techniques to learn deep models [7, 8]. However, mainstream architectures and techniques are often ad-hoc, leading to numerous mysterious properties, many of which are undesirable. For example, deep networks typically have more parameters than training examples, where most parameters are redundant while consuming computational resources [9]. Furthermore, the learned models are sensitive to adversarial perturbations [10] and tend to make overconfident predictions [11]. The lack of principled methods attribute to the implicit relationship between net- work parameters and properties ? It is hard to interpret how a parameter impacts the overall properties. In this dissertation, we develop principled network architectures toward addressing this problem using spectral methods ? while it is hard to relate network proper- ties to parameters in the original space, we can make the relationship explicit in a carefully chosen spectral domain. In this chapter, we briefly review the undesirable properties of modern neural net- works, including over-parametrization, uncertainty miscalibration, and sensitivity to per- 1 turbations. Then we discuss the equivalence between network architecture in deep learning and hypothesis set in machine learning. Our goal of architecture designs is to eliminate un- desirable hypotheses from the hypothesis set. The following chapters present how to design network architectures (i.e., choose hypothesis sets) to avoid these undesirable properties. Network architectures and hypothesis sets. Abstractly, a deep network is an adap- tive function that approximates a certain mapping y = NN(x; ?), (1.1) where x,y are the input/output of the network NN, and ? is the network parameters. There are two steps to obtain a particular function. (1) In the first step, we pick a network architecture, which defines the space of all possible mappings: H = {NN(?, ?)}??? (1.2) where H is known as the hypothesis set, and ? is the parameter space. When the archi- tecture is known from the context, H and ? are two equivalent notions. (2) In the second step, we choose one mapping h? among all candidates H by mini- mizing a loss function L: ?? = arg minL(D, ?), (1.3) ??? where D is a dataset. Equivalently, we have h? = NN(?, ??). Our research focuses on the first step ? we try to choose the hypothesis set H such 2 Figure 1.1: Architecture of a deep residual network [5]. that we can readily find a desirable model h. Since the relationships between the parameter space and the network properties are generally implicit, we investigate spectral methods ?? F such that in a transformed space F , the relationships become explicit. In practice, deep networks? building blocks are usually simple (see Figure 1.1). Typi- cally, a deep network is a cascade of multiple blocks, where each block consists of interleav- ing linear and nonlinear layers. Despite its simplicity, a composition of simple blocks could lead to many undesirable global properties, including over-parametrization, miscalibrated uncertainties, and sensitivity to perturbations. Over-parameterization. The increasing performance of modern neural network accred- its to the evolving deep architectures [1, 2, 5] (see Figure 1.2). However, their parameters grow with the depth, which in turn leads to an explosive growth of computation, as shown in Figure 1.3. The high demand for computational resources makes these models very 3 difficult to deploy on constrained devices like smartphones or embedded systems. 28.2 25.8 152 layers 16.4 11.7 22 layers 19 layers 6.7 7.3 3.57 8 layers 8 layers shallow ILSVRC'15 ILSVRC'14 ILSVRC'14 ILSVRC'13 ILSVRC'12 ILSVRC'11 ILSVRC'10 ResNet GoogleNet VGG AlexNet ImageNet Classification top-5 error (%) Figure 1.2: Increasing performance and growing depth of modern neural networks [5]. Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. ?Deep Residual Learning for Image Recognition?. CVPR 2016. Figure 1.3: Increasing computational complexity of modern neural networks [12]. Uncertainty miscalibration. Calibrated prediction, i.e., predicting probability esti- mates of the true correctness likelihood, is important in uncertainty-sensitive application 4 such as healthcare. However, deep networks are typically poorly calibrated [11]. One mys- terious property of these deep networks is that they tend to give over-confident predictions even for out-of-distribution data (c.f. Figure 1.4a). From a different perspective, the neg- ative log-likelihood, which measures the level of uncertainty, is more prone to overfitting than predictive error (c.f. Figure 1.4b). (b) Easy overfitting of negative (a) Overconfident predictions for out-of-distribution data. log-likelihood (NLL). Figure 1.4: Uncertainty miscalibration of modern neural networks [11]. Figure 1.5: Modern neural network is sensitive to adversarial perturbation [10]. Sensitivity to perturbations. Deep networks are susceptible to small perturbations ? an imperceptible perturbation to the input can flip the prediction completely. The pertur- bation can be artificial noise added to the input as shown in Figure 1.5 [10], or geometric 5 transforms such as translation, rotation, and scaling as shown in Figure 1.6 [13]. This problem raises concerns about to use of deep networks in security-sensitive applications. Figure 1.6: Modern neural network is sensitive to geometric perturbations [13]. Organization of the dissertation. The rest of the dissertation is organized as follows. In Chapter 2 and Chapter 3, we introduce tensor representations, which allow for concise descriptions for multilinear operations. In Chapter 2, we present a framework of compact layer designs based on tensor representations. With this framework, we derive a family of compact yet expressive convolutional networks, which maintain high performance after significantly reducing model parameters. In Chapter 3, we develop an automatic library, which can build and efficiently train tensorial neural networks (whose layers adopt a tensor representation). In Chapter 4, we use tensor representations to develop the first higher-order recurrent network for spatio-temporal learning. This model achieves state-of- 6 the-art performance among recent approaches while with the fewest parameters. In Chapter 5, we introduce Bayesian neural networks, which explicitly characterize uncertainties by learning the posterior distribution of the network parameters. We then propose an efficient sampling-free approach to learn Bayesian quantized networks based on tensor operations ? our method nicely bridges the learning of calibrated Bayesian models and compact quantized networks. In Chapter 6 and Chapter 7, we show that the convolutional layer in neural networks is equivalent to the MIMO filter bank in signal processing. Then, in Chapter 6, we introduce ARMA layers based on IIR filter banks for dense prediction networks such that we can expand their receptive fields economically. Next, in Chapter 7, we will propose orthogonal convolutional layers for robust networks based on paraunitary filter banks such that we can ensure exact orthogonality easily. We conclude the dissertation in Chapter 8 and discuss potential future works. 7 Chapter 2: Compact Architecture Designs by Tensor Representations 2.1 Overview Modern neural networks [1, 2, 6, 14, 15, 16] achieve unprecedented performance on many difficult learning problems at the cost of requiring excessive model parameters for deeper and wider architectures. The vast number of model parameters is a practical ob- stacle to deploying neural networks on constrained devices, such as smartphones and IoT devices. Thus a fundamental problem in deep learning is to design neural networks with compact architectures that maintain expressive power comparable to large models. Two complementary approaches are common for this purpose: one compresses pre-trained mod- els while preserving their performance as much as possible [9]; the other aims to develop compact neural architectures such as inception modules [16], interleaved group convolu- tions [17], and bottleneck blocks [15, 18]. Since linear layers (i.e., fully-connected and convolutional layers) comprise almost all parameters and computation, the common goal of both approaches is to reduce the expense by the linear operations. Motivated by the tensor decomposition of linear layers [19, 20, 21], we propose a framework of tensorial layers that outlines the design space of low-rank factorization ? the framework simultaneously allows compression of pre-trained models and exploration of better network architectures. Our proposed tensorial layers extend the linear operations 8 of matrix multiplications (in fully-connected layers) and multi-channel convolutions (in convolutional layers) to multilinear operations with multiple kernels. To characterize these layers, we introduce a novel suite of generalized tensor algebra that extends linear operations on low-order tensors to multilinear ones on higher-order tensors (cf. Section 2.3). We name a neural network composed of tensorial layers as a tensorial neural network (TNN), which by definition generalizes the traditional neural network (NN) ? if we restrict the multilinear operations in tensorial layers to matrix multiplications or multi-channel convolutions, the TNN reduces to a traditional NN. Unlike traditional NNs that may flatten the data into low-order tensors (e.g., from videos to frames), TNNs allow for data with arbitrary order. Quite the opposite, TNNs deliberately reshape the data into higher-order tensors and use higher-order weight kernels in each layer. In this higher-order space, TNNs can achieve high expressive power with a smaller number of parameters. [1, 2, 3, 1, 2, 3, 1, 2, 3] { Periodic structure 1, 1, 1 1Invariant structures reshape [ 2, 2, 2] = [ 2][1, 1, 1, 2, 2, 2, 3, 3, 3] 3, 3, 3 3 Modulated structure Low-rank structure Figure 2.1: A toy example of invariant structures. The periodic and modulated structures are exposed by exploiting the low rank structure in the reshaped matrix. To understand the benefit of higher-order space, we illustrate with a toy example in Figure 2.1. Consider a vector with periodic structure [1, 2, 3, 1, 2, 3, 1, 2, 3] or with modu- lated structure [1, 1, 1, 2, 2, 2, 3, 3, 3], representing the vector naively requires 9 parameters, which by itself cannot be further compressed by factorization. However, if we reshape the vector into a higher-order object, for instance, a matrix [1, 1, 1; 2, 2, 2; 3, 3, 3]. Since all columns of this matrix are the same, we can decompose the rank-1 matrix into an outer 9 product of two vectors without losing information. Therefore, only 6 parameters are needed to represent the original length-9 vector. Intuitively, it is easier to represent higher-order tensors in a factorized form than low-order ones. To use TNNs in practice, we need to address both prediction and learning problems in tensorial layers. (1) Prediction with a TNN is similar to a traditional NN: its input passes through all layers in a feedforward manner. In a TNN, each layer involves a generalized tensor operation between the higher-order input and multiple weight kernels, followed by an activation function such as ReLU. (2) To provide a practical solution to the learning problem, we derive efficient backpropagation rules [22] for a broad family of tensorial layers using the newly introduced tensor algebra. We can then efficiently learn TNNs using first- order optimization methods such as stochastic gradient descent (SGD). Although we could build and train TNNs from scratch, we can also use them to compress pre-trained NNs, as tensorial layers naturally identify both low-rank and invariant structures in the original kernels of the linear layers (Figure 2.1). Given a pre-trained NN gq ? Gq with q parameters, we may compress it to a TNN hp ? Hp with p parameters as depicted in Figure 2.7. This process involves two steps: (1) data tensorization: reshaping the input into a higher-order tensor; and (2) knowledge distillation: mapping a NN to a TNN, using layer-wise data reconstruction. We demonstrate the expressive power of TNNs by conducting experiments on sev- eral benchmark image classification datasets. Our algorithm compresses ResNet-32 on the CIFAR-10 dataset by 10? with degradation of only 1.92% (achieving an accuracy of 91.28%). Experiments on LeNet-5, VGG, ResNet, and Wide-ResNet consistently verify that our tensorial neural networks outperform the state-of-the-art low-rank architectures 10 under the same compression rate (with 5% test accuracy improvement on CIFAR-10 using sequential knowledge distillation and ImageNet when trained from scratch). Contributions. In summary, we make the following contributions: 1. We propose a framework of tensorial layers, which extends special linear operations in traditional neural networks to general multilinear operations. This results in tensorial neural networks (TNNs) with compact architecture designs in higher-order space. 2. We introduce a system of generalized tensor algebra, with which we derive efficient prediction and learning in TNNs. In particular, we are the first to derive and analyze backpropagation for generalized tensor operations. 3. We develop an effective algorithm to compress pre-trained models into TNNs, ex- ploiting low-rank and invariant structures in the parameter space. 4. We provide interpretations of famous network architectures with our proposed tenso- rial layers, explaining why these famous architectures are empirically successful. Our framework provides a principled way to design structured weight matrices/tensors (see examples in Figures 2.8 and 2.9). 2.2 Related Works Tensor networks are widely used in quantum physics [23], numerical analysis [24], and machine learning [25, 26]. [27, 28] use tensor networks to establish the expressive power of convolutional and recurrent neural networks. Recently, [29] combines tensor networks with genetic algorithms to search for efficient layer designs. Unlike our work, the search space 11 in [29] only includes low-order tensors. Moreover, their method does not apply knowledge distillation to pre-trained models to produce more compact architectures. Model compression of neural networks. Existing approaches for neural network com- pression can be roughly grouped into the following categories: low-rank factorization, design of compact filters, knowledge distillation, as well as pruning, quantization, and encoding. 1. Low-rank factorization. Various factorizations have been proposed to reduce the num- ber of parameters in linear layers. Pioneering works propose to flattening/unfolding the parameters in convolutional layers into matrices (known as matricization), fol- lowed by dictionary learning or matrix decomposition [30, 31, 32]. Subsequently, [20] and [21] show that it is possible to compress these parameter structures directly using tensor decompositions (e.g., CP or Tucker decomposition [33]). The ground- breaking works [19, 34] demonstrate that the low-order parameter structures can be efficiently compressed via tensor-train decomposition [35] by first reshaping the structures into a higher-order tensor. This idea is later extended in two directions: tensor-train decomposition is used to compress LSTM/GRU layers in recurrent neural networks [36], higher-order recurrent neural networks [37, 38], and 3D convolutional layers [39]; other decompositions are also explored for better compression, such as tensor-ring decomposition [40] and block-term decomposition [41]. 2. Pruning, quantization, and encoding. The pioneering work [42] proposes a three- step pipeline to compress a pre-trained model by pruning uninformative connections, quantizing remaining weights, and encoding discretized parameters. These ideas are complementary to low-rank factorization ? [43] combines pruning with low-rank 12 factorization, and [44] combines quantization with low-rank factorization. 3. Knowledge distillation. This process aims to transfer information from a pre-trained teacher network to a smaller student network. [45, 46] propose to train the student network with the teacher network?s logits (the vector before the softmax layer). [47] extends this idea so that the outputs from both networks match at each layer, with an affine transformation. Our approach synergizes two of the above methods: (1) it uses knowledge distillation to project a pre-trained network to the set of TNNs with low-rank tensor structures, and (2) it exploits low-rank structures, which naturally correspond to compact architecture designs (structured connections) that allow for efficient evaluation using generalized tensor operations. Since other compression methods such as pruning and quantization complement our approach, they may further improve performance augmenting our approach. 2.3 Introduction to Tensor Representations Notations. Lower case letters (e.g., v), upper case letters (e.g., M), and calligraphic let- ters (e.g., T ) are used to denote vectors, matrices, and multi-dimensional arrays (tensors), respectively. We say that the array T ? RI0?????Im?1 is a m-order tensor. Furthermore, the kth coordinate of the entries of T corresponds to the kth mode of T , and Ik is referred to as the dimension of T along mode-k. By fixing all indices of T , except that corresponding to mode-k, we obtain the mode-k fibers of T , where the vector T Ii0,??? k,ik?1,:,ik+1,??? ,im?1 ? R denotes the mode-k fiber of T indexed by (i0, ? ? ? , ik?1, ik+1, ? ? ? , im?1). 13 Operator Notation Definition T (1) mode-(I , J ) i0?,??? ,ik?1,ik+1,??? ,im?1,j0,??? ,jl?1,jl+1,??? ,jn?1k l (1) ?T = X ?Ik Y Tensor Contraction J = Xl i0,??? ,ik?1,:,ik+1,??? ,im?1 ,Yj0,??? ,jl?1,:,jl+1,??? ,jn?1 Inner product of mode-k fiber of X and mode-l fiber of Y T (2) mode-(I , J ) i0,??? ,ik?1,:,ik+1,??? ,im?1,j0,??? ,jl?1,jl+1,??? ,jn?1k l T (2) = X ?Ik Y Tensor Convolution J = Xi0,??? ,i ? ,:,i ,??? ,im?1 ? Yl k 1 k+1 j0,??? ,jl?1,:,jl+1,??? ,jn?1 Convolution of mode-k fiber of X and mode-l fiber of Y T (3) mode-(I , J ) i0,??? ,ik?1,r,ik+1,??? ,im?1,j0,??? ,jl?1,jl+1,??? ,jn?1k l T (3) = X ?Ik Y Tensor Batch Product J = Xl i0,??? ,ik?1,r,i Yk+1,??? ,im?1 j0,??? ,jl?1,r,jl+1,??? ,jn?1 Hadamard product of mode-k fiber of X and mode-l fiber of Y Table 2.1: Summary of tensor operations. In this table, X ? RI0?????Im?1 , Y ? RJ0?????Jn?1 are input tensors. T (1), T (2), T (3) are the outputs of Mode-(Ik, Jl) tensor contraction, convolution and batch product respectively. Note that both mode-(Ik, Jl) ten- sor contraction and mode-(Ik, Jl) tensor batch product are legal only if Ik = Jl. As a result, T (1) is an (m+ n? 2)-order tensor, and T (2), T (3) are (m+ n? 1)-order tensors. Tensor diagrams. In Figure 2.2, we introduce tensor diagrams, graphical representations of multi-dimensional arrays [23, 24]. In tensor diagrams, each node represents an array (scalar, vector, matrix, or higher-order tensor), with its order denoted by the number of legs extending from the node. Each leg corresponds to one tensor mode, whose dimension is associated with a positive integer. Notice that tensor diagrams are ordering-agnostic, e.g., a matrix M ? RI?J and its transpose M> ? RJ?I have the same diagram. K I J I J I a v M T (a) Scalar (b) Vector (c) Matrix (d) Tensor Figure 2.2: Tensor diagrams of a scalar a?R, a vector v?RI , a matrix M ? RI?J , and a 3rd-order tensor T ? RI?J?K . Primitives of tensor operations. In Figure 2.3 and Table 2.1, we introduce the primi- tives for generalized tensor operations on arbitrary-order tensors, extending the matrix/vec- tor operations. In tensor diagrams, an operation is a (hyper-)edge that links the legs from 14 I2 J2 I2 J2 I1 I0 J1 J0 X Y = T (1) I0 = J1 I1 J0 Mode-(I0, J1) tensor co?ntraction(a) X ?I0 Y ? T (1) T (1)J : i ,i ,j ,j = r Xr,i1,i2Y1 1 2 0 2 j0,r,j2 I2 J2 I2 J2 I1 I0 J1 J0 I ? 0 X Y = T (2) I ?0 I1 J0 Mode-(I0, J1) tensor convolution (b) X ?I0J Y ? T (2) (2) : T 1 :,i ,i ,j ,j = X 1 2 0 2 :,i1,i2 ? Yj0,:,j2 I2 J2 I2 J2 I1 I0 J1 J0 I0 X Y = T (3) I0 J1 I1 J0 Mode-(I0, J1) tensor batch product (c) X ?I0J Y ? T (3): T (3) 1 r,i1,i2,j0,j = X 2 r,i1,i2 Yj0,r,j2 Figure 2.3: Primitives of generalized tensor operations. In the diagrams, X ? RI0?I1?I2 , Y ? RJ0?J1?J2 are input tensors, and T (1) ? RI1?I2?J0?J2 , T (2) ? RI?0?I1?I2?J0?J2 , T (3) ? RI0?I1?I2?J0?J2 are output tensors of corresponding operations. Similar definitions apply to general mode-(Ik, Jl) tensor operations. 15 input tensors, where the edge shape denotes the type of operation: a solid line stands for tensor contraction, a dashed line represents tensor convolution, and a curved line is for ten- sor batch product. We illustrate these three operations with examples of 3rd-order tensors X and Y in Figure 2.3, and define them on higher-order tensors in Table 2.1. Generalized tensor operations. Generalized tensor operations take two or more ten- sors as inputs and carry out one or more primitive operations on those tensors. In Figure 2.4, we illustrate three non-primitive generalized tensor operations. We refer to the primitive tensor operations in Figure 2.3 as single-edge-double-node operations; similarly, the three generalized tensor operations in Figure 2.4 are called multi-edge-double-node, single-edge- multi-node, and multi-edge-multi-node operations, respectively. Given a generalized tensor operation formed from more than one primitive operation, we may evaluate the primitives in any order to obtain the same result. However, evaluating the primitives in one order may require substantially more floating-point operations (FLOPs) than in another. While it is NP-hard to obtain the best order (that requires the fewest FLOPs) [48], an exhaustive search is practical if the number of input tensors is small [49]. 2.4 Tensorial Neural Networks (TNNs) In this section, we introduce Tensorial Neural Networks, a type of neural network whose layers (called tensorial layers) are tensor networks. Tensorial layers generalize tra- ditional fully-connected/convolutional layers, as the transformations of these layers are themselves primitive/generalized tensor operations. For example, a fully-connected layer, which involves a matrix-vector product, is equivalent to a contraction (cf. Figure 2.3a), and 16 = ( Multi-)edges-double-nodes ?operation(a) X ?I2J ? ? I0 (1) J Y ? T : T (1) i ,i ,j ,: = X2 1 1 2 0 r r,i1,: ? Yj0,r,: = ( Single-edg)e-multi-nodes ope?ration(b) 1?R X ?R Y ?RR R R Z ? T (2) (2) : Ti ,j ,k = r Xr,i1Yr,j1 1 1 1Zr,k1 = ( )( Mult)i-edges-multi-nodes?operation(c) X ?I0 Y ?I2 ? ?J2J K K Z ? T (3) (3): T1 0 1 i ,j = Xr ,i1 0,k2 {rl} 0 1,r1Yj0,r0,r2Zr1,r2,k2 Figure 2.4: Examples of generalized tensor operations. The operation in (a) is known as a 1D-convolutional layer, and operations (b) and (c) are known as CP decomposition and Tensor-Ring decomposition, respectively. 17 we will see that a convolutional layer is equivalent to a generalized tensor operation (cf. Fig- ure 2.5a). Our primary focus is on tensorial layers that extend the traditional convolutional layer ? since a fully-connected layer is simply a 1? 1 convolutional layer. 2.4.1 Tensorial Layers versus Traditional Layers Each convolutional layer in a convolutional neural network (CNN) is given by a com- pound operation applied to a 3rd-order input tensor and a 4th-order weight tensor (cf. Fig- ure 2.5a). In contrast, each tensorial layer in a TNN corresponds to an arbitrary generalized tensor operation applied to a higher-order input tensor and multiple weight tensors (cf. Fig- ure 2.5b). We describe both types of layers in more detail below. Traditional convolutional layer. A traditional 2D-convolutional layer is parameterized by a 4th-order weight kernel K ? RH?W?S?T , where H (resp. W ) is the height (resp. width) of the filter, and S (resp. T ) is the number of input (resp. output) channels. Such a layer maps a 3rd-order input tensor U ? RX?Y?S to another 3rd-order output tensor V ? RX??Y ??T , where X (resp. Y ) is the height (resp. width) of the input feature maps, and X ? (resp. Y ?) is the height (resp. width) of the output feature maps. We can write this convolutional layer concisely using our generalized tensor operations: ( ) V = U ?XH ? ?YW ? ?SS K. (2.1) Moreover, a convolutional layer involves a multi-edge-double-node operation, where multi- ple primitive tensor operations apply to different modes. Specifically, there are two tensor 18 convolutions: one on the modes with dimensions X,H, while the other on the modes with dimensions Y,W . A tensor contraction further applies on the modes with dimension S. Y U ? U Y ? X X Y S0 X? Y? S1 S2 ? S X W H W H R0 R1 K K0 K1 K2 Km R2 T0 T1 T2 Y ? T U (1) =(U ? ?S0 K(0)( ) S0 )(`+1) (`) R`?1 S` (`) V = U ?X ? ?Y ? ?S K U =(U ?R ? ?S )KH W S `?1 ` V ? R= U (m) ?X ? ?Y ? ? m?1 (m)H W R Km?1 (a) Convolutional layer. (b) mTT-convolutional layer. Figure 2.5: Comparison between a convolutional layer and a tensorial layer. (a) The traditional convolutional layer is a building block for CNN; (b) The mTT-convolutional layer is a building block for TNN-mTT. Tensorial layers. Tensorial layers involve applying a generalized tensor operation to an input tensor and multiple weight kernels. We illustrate several tensorial layers in Figure 2.5 and Figure 2.6. In Figure 2.5b, we illustrate a tensorial layer inspired by the tensor-train (TT) layer [35]. We will refer to this layer as a mTT-convolutional layer (the letter ?m? is for ?modified?; this layer is slightly different from the one in [35]). A mTT layer takes an (m + 2)-order input tensor U ? and returns an output tensor V ? of the same order. This layer has (m + 1) kernels {Ki}mi=0 as parameters, in order to preserve the multi- dimensional structure of U ?. Mode-i of U ? contracts with its corresponding kernel Ki, and interactions between modes are captured by contractions between adjacent kernels (e.g., Ki 19 and Ki+1). These contractions are crucial for modeling multi-dimensional transformations with high expressive power. Thus, a mTT-convolutional layer enables processing of a higher-order input. We refer to a network with mTT-convolutional layers as a TNN-mTT. In Figures 2.6a to 2.6c, we develop other tensorial layers inspired by tensor-ring (TR), canonical polyadic (CP), and Tucker (TK) tensor decompositions [33, 40]; we refer to the corresponding networks as TNN-mTR, TNN-mCP, and TNN-mTK networks, respectively. 2.4.2 Relationship between NNs and TNNs Interpretation via generalized tensor decompositions. We can use a tensorial layer to approximate a higher-order linear layer (fully-connected or convolutional). Suppose U , K, and V in Equation (2.1) are reshaped into higher-order tensors U ?, K?, and V ?, such that input/output channels are indexed by m modes (i.e., U ? ? RX?Y?S0?????Sm?1 , K? ? RH?W?S0????? ? Sm?1?T0?????T ? ? ?m 1 , and V ? ? RX ?Y ?T0?????Tm?1? , where S = m?1 i=0 Si and T = m?1 ? ? ?i=0 Ti). We then have the following relationship between U , K , and V : V ? = U ? ( ) ?H ? ?W ? ?S0 Sm+1 ?X Y S ? ? ? ? ? ?S K . (2.2)0 m+1 For a TNN-mTT tensorial layer, the kernels {K }m ? ?i i=0 correspond to factors of K , when K can be represented with a modified tensor-train decomposition: K? ( ) , mTT {K }m?1i i=0 = K0 ? R0 K ?R1 ? ? ? ?Rm?1R0 1 R1 R K . (2.3)m?1 m 20 U (1) = U(? ?S0 K(0)S0 ) U (`+1) = (U (`) ?R ? ?S` )K(`)R S` V ? = U (m) ?X ? ?Y ? ?R K(m)H W R (a) mCP-convolutional layer. U (0) = U ? ?(S0 P(0) ? ? ? ?Sm?1 P(m?1)S0 Sm?1s s ) U (1) = U (0) ?X ? ?Y R R0 m?1H W ? ?Rs ? ? ? ? ? ?Rs C0 m?1 t V ? t = U (1) ?R0 Q(0) ? ? ? ?Rm?1 Q(m?1) Rt Rt0 m?1 (b) mTK-convolutional layer. U (1) =(U ? ?S0 K(0)S0 ) U (`+1() U (`) ?R= `?1 ? ?S` (l)R K`?1 S` V ? ) = U (m) ?X ? ?Y ? ?Rm?1H W R ? ? Rm R K (m) m?1 m (c) mTR-convolutional layer. Figure 2.6: Variants of tensorial layers. (a) The mCP-convolutional layer is a building block for TNN-mCP; (b) The mTK-convolutional layer is a building block for TNN-mTK; (c) The mTK-convolutional layer is a building block for TNN-mTR. 21 This motivates us to compress a linear layer into a tensorial layer, and more broadly, compress a traditional NN into a compact TNN. In Section 2.5, we will study relevant compression algorithms in detail. Hypothesis sets of NNs and TNNs. Suppose the class of traditional NNs and our proposed TNNs share the same architecture (i.e., only the tensor operation in each layer is different). We illustrate the relations between their hypothesis sets in Figure 2.7. Let Gq and Hq denote the classes of functions that can be represented by NNs and TNNs, both with at most q parameters. (1) TNNs generalize NNs. Formally, for any q > 0, Gq ? Hq holds. (2) NNs can be mapped to TNNs with fewer parameters and thus TNNs can be used for compression of NNs. Formally, there exists p ? q such that Hp ? Gq. p p G :compressed NN H :compressed TNN f hq gq gp p G p h p q G qH H q G : NN qH : TNN Figure 2.7: Relationship between traditional NNs and TNNs. Suppose the class of NNs and TNNs have the same architecture (i.e., only the tensor operation at each layer is different), and f is the target concept. (1) Learning of a NN with q parameters results in gq that is closest to f in Gq, while learning of a TNN with q parameters results in hq that is closest to f in Hq. Apparently, hq is closer to f than gq, (2) Compression of a pre-trained NN gq ? Gq to NNs with p parameters (p ? q) results in gp that is closest to gq in Gp, while compression of gq to TNNs with p parameters results in hp that is closest to gq in Hp. Apparently, the compressed TNN hp is closer to gq than the compressed NN gp. 22 2.5 Algorithms for TNNs In this section, we investigate practical algorithms for TNNs. We first develop pre- diction and backpropagation algorithms for TNNs, which allows us to train a TNN from scratch. We then consider algorithms that distill a compact TNN from a pre-trained model. 2.5.1 Prediction in TNNs Prediction with TNN is similar to that of traditional neural networks: the input passes through layers in a feedforward manner. Each layer in a TNN involves applying a generalized tensor operation to the input and multiple weight kernels before applying a nonlinear function such as ReLU. While it is difficult to determine the most efficient order to evaluate the primitives of a generalized tensor operation in general, we derive efficient orders for all TNN architectures introduced in this chapter. For example, we can efficiently evaluate each mTT-convolutional layer as follows: U = U ? ?S01 ( S K0, ) (2.4a)0 U R= i?1 Sii+1 ( Ui ?R ? ?S K) i, (2.4b)i?1 i V ? U ?X ? ?Y ? ?R= m?1m H W R Km. (2.4c)m?1 Here U ? is the layer input, and V ? is the output. The tensors {U }mi i=1 are intermediate results. We provide efficient strategies for performing the forward pass for the other tensorial layers in Figure 2.6, summarize the complexity (the number of FLOPs and amount of parameter storage required) for each forward pass in Table 2.2. 23 2.5.2 Learning in TNNs To train a TNN via stochastic gradient descent, we derive backpropagation rules for each tensorial layer displayed in Figures 2.5 and 2.6. To derive such rules, we consider the partial derivatives of some loss function L with respect to the input (?L/?U ?) and kernel factors (e.g., {?L/?K mi}i=0 in a mTT-convolutional layer), given ?L/?V ?. As previously done for performing a forward pass, we develop efficient strategies for executing backprop- agation with each type of tensorial layer. For an mTT-convolutional layer, an efficient strategy for performing backpropagation is ?L ?L ( ?> ) = ?X? H ? ? Y ?> W Km, (2.5a)?Um ?V ?L ?L (?X?> )= ? ?Y ?> ?T0 ? ? ? ? ? ?Tm?1 Um, (2.5b) ?K ?V ? X Y( T0 Tm?1m ?L ?L ) = ?Ri ? ?TiR T K , (2.5c)( ?U ?U i i i i i+1 ?L ?L ) = ?X ? ?Y ? ?S0 ? ? ? ? ? ?Ti?1X Y S T ? ? Si+1 S ? ? ? ? ? ? Sm?1 U , (2.5d) ?K ?U 0 ( i?1 ) i+1 Sm?1 i i i+1 ?L ?L = ?R0 T0 ?U(? ?U R ? ? 0 T K0, (2.5e)0 1 ?L ?L ?X ? ?Y ? ?S1 ? ? ? ? ? ?S ) = m?1 U ?X Y S S , (2.5f)?K 1 m?10 ?U1 where ?> denotes a transposed convolution. We derive efficient backpropagation strategies for the other tensorial layers in [50] and summarize and their complexities in Table 2.2. Learning from scratch (Learn-Scratch). We can train any TNN from scratch (re- ferred to as Learning from Scratch, or Learn-Scratch in short), given suitable algorithms for forward and backward passes. Since a TNN is formed by replacing each layer in a 24 Layer O(params.) O(forward ops.) O(backward ops., input) O(backward ops., params.) original k2N2 k2N2D2 k2N2D2 N2D4 2 1 1 1 mCP (mN m + k2)R (mN1+ + k2m N)RD2 (mN1+ 2m + k N)RD2 (mN1+ 2 2m +D N)RD mTK (2mN + k2R2m?1)R (2mN + k2R2m?1)RD2 (2mN + k2R2m?1)RD2 (2mN +R2m?1D2)RD2 2 1 1 1 mTT (mN mR + k2)R (mN1+ 2mR + k N)RD2 (mN1+mR + k2N)RD2 (mN1+mR +D2N)RD2 2 1 1 1 mTR (mN + k2)R2m (mN1+m + k2N)R2D2 (mN1+m + k2N)R2D2 (mN1+ +D2m N)R2D2 Table 2.2: The number of parameters and operations required by various tenso- rial layers. This table shows the special case when X = Y = X ? = Y ? = D, S = T = N , H = W = k, and D  k (see section 2.4 for notations). Remark: The number of FLOPs does not accurately reflect the actual running time on GPUs, as the existing CUDA library can not fully utilize the degree of parallelism in general tensor operations. traditional NN with a tensorial layer, Learn-Scratch is as straightforward as training a traditional NN but is inefficient if we have a pre-trained reference NN. 2.5.3 Compression via Knowledge Distillation Suppose we aim to compress a pre-trained neural network gq ? Gq to a model with p parameters, where p q. As is illustrated in Figure 2.7, Hp is a broader class of networks than Gp, and hence our goal is to obtain the hp ? Hp that is, in some sense, closest to gq, rather than obtain the analogous gp ? Gp. We expect that searching for such a hp yields a network that outperforms the analogous gp in terms of predictive accuracy. Intuitively, we aim to ?project? a pre-trained NN g ? Gq to a TNN h? ? Hp. (Note that we omit the superscripts on g and h to simplify notation.) Denote the input to g as U and U ? is a reshaped version of U (so that U ? may be an input for h). our goal is to find h? such that h? = arg min dist(h(U ?), g(U)), (2.6) h?Hp where dist(?, ?) denotes any distance(-like) metric (e.g., the square of the `2 distance) be- tween the set of network outputs (the logits in classification problems). Solving Equa- 25 tion (2.6) is known as knowledge distillation; this process ?distills? the knowledge from g and ?instills? it into h? [46]. Because the class Hp of TNNs is so vast, in practice, we minimize the objective in Equation (2.6), over a much smaller class of TNNs. Concretely, given the input data U and g ? Gq, we minimize the objective over the class of TNN-mTTs, TNN-mTRs, TNN-mCPs, and TNN-mTKs, where we assign each of these models a pre-specified number of layers, kernels per layer, and kernel dimensions, with a total of p parameters. Given a model in (i) the class of TNNs selected, let {K m` }i=0 denote the set of (m + 1) kernels of the `th layer of that model (replace m with 2m for TNN-mTKs). Our goal is to now search for kernels {K(i)` }i,` for all L layers in the TNN, such that these kernels can be used to construct the TNN h that is a good approximation to g. Specifically, we aim to solve ? {K(i) } U U ? {K(i)` i,` = arg min dist(g( ), h( ; ` }i,`)). (2.7) {K(i)` }i,` Here, dist denotes a distance metric, which we assume as the squared `2 distance in this work. In what follows, we discuss three different approaches for solving Equation (2.7). Layer-wise decomposition (Layer-Decomp). Given the relationship between TNNs and NNs (cf. Section 2.4.2), we may solve Equation (2.7) with the following two steps: (1) For each layer (e.g., layer `), we reshape the original kernel K(`) of g into a higher-order tensor K?(`) (i), and (2) we solve {K` }i such that applying corresponding tensor operation to those kernels produces the best approximate of K?(`) (we assume that K(`) is reshaped in a way such that the dimensions of K?(`) match the ones of the approximate). The second 26 step for a mTT-convolutional layer is solving the following optimization problem. {K(i) ? ?(`) (i) 2 ` }i = arg min ?K ?mTT({K` }i)? , ?` ? [L], (2.8) {K(i)` }i (i) where mTT({K` }i) denotes the result of the generalized tensor operation in Figure 2.5b on {K(i)` }i (we can formulate similar problems for other tensorial layers). Typically, one solves Equation (2.8) via an alternating least squares method [51], as Equation (2.8) reduces to solving a least-squares problem if we fix all but one kernel in each layer. However, such a method typically does not yield accurate solutions to Equation (2.7). Thus, we usually only use it to initialize parameters for more advanced approaches. End-to-end knowledge distillation (E2E-KD). A classic algorithm of knowledge distillation is to minimize the `2 distance between the outputs of h and g. Formally, we can formulate the optimization problem as: {K(i) ? ` }i,` = arg min ?h(U ?)? g(U)?2. (2.9) {K(i)` }i,` However, solving the problem has two main drawbacks: (1) The backpropagation is very expensive as it requires end-to-end gradients flow in a TNN; (2) The optimization is un- stable when the factors in all layers are solved simultaneously. To avoid these challenges, we propose decomposing it into a sequence of L sub-problems. Sequential Knowledge Distillation (Seq-KD). For the `th sub-problem, we obtain (i) the kernels {K` }i by minimizing the `2 distance between the intermediate results of the 27 `th layers of g and h, i.e., V(`) and V ?(`), ? {K(i) } ?V(`) ? V ?(l)` i = arg min ? 2, ?` ? [L]. (2.10) {K(i)` }i where the input to the `th layer is the output from its previous layer, i.e., U (`) = V(`?1) and U ?(`) = V ?(`?1). We can use SGD to solve the problem once we derive the backpropagation rule for the general tensor operation used in the `th layer of g. Since `th sub-problem depends on the result from its previous one, the algorithm requires us to solve them sequentially from the bottom layers to the upper ones. 2.6 Interpretation of Existing Compact Architectures Recent advances in compact architecture designs such as Inception [16], Xception [52], interleaved group convolutions [17], and bottleneck structures [15, 18] propose to group multiple primitive operations into modules. We will show that we can express all such modules using the framework of tensorial layers (with minor modifications). Interleaved group modules. The critical idea in interleaved group modules involves dividing and branching the input into several blocks and constraining each block?s connec- tions to avoid computations across blocks. The architectures of tensorial layers utilize a similar strategy: for example, the tensorial layer in Figure 2.8b has the same architecture as the network in Figure 2.8a, where each length-nine input branches into three blocks and connections exist only within each block. This idea of grouping operations plays a vital role in the development of Inception [16] and Xception [52]. 28 U Group Layer 2 3 3 Interleaving K1 K21 Group Layer 1 3 3 (a) Neuron Connections. (b) Diagram. Figure 2.8: An interleaved group module without nonlinearity (a) is expressed as a tensorial layer (b). Bottleneck modules. A bottleneck structure forces a model to adopt a compact repre- sentation by constructing a narrow bottleneck (with fewer hidden units) in the middle of each module. Such modules correspond to the low-rank structures used in tensorial layers, as illustrated by the following example with matrices: consider a weight matrixW ? RS?S, its low-rank decomposition W = PQ (with P ? RS?R and Q ? RR?S). This model re- quires an input vector u ? RS to first be multiplied by P and then by Q during a forward pass. Therefore, the input u goes into a low-dimensional space RR after being multiplied by P , resulting in a bottleneck in this two-steps module. In practice, the bottleneck module in [15, 18] can be represented by tensor diagrams (cf. Figure 2.9), whose input with kN channels is first mapped to a structure with N channels by kernel K0. kN H H W kN N N kN K K0 K1 K2 W kN Figure 2.9: A bottleneck module without nonlinearity is expressed as a Tucker decom- position of the original layer. 29 Discussion of compact architecture designs. The two examples above illustrate one way of designing compact tensorial layers. This design process starts with a traditional layer (fully-connected or convolutional), followed by (optional) reshaping and some tensor decomposition of the (reshaped) kernel. Consequently, the original layer transforms into a tensorial layer with a compact structure. We can also design novel architectures from scratch by, for example, using tensor networks as building blocks for other architectures (cf. Section 2.3). One recent attempt that applies this methodology is [37], where tensor- train networks are used to introduce multilinear operations to an RNN. 2.7 Experimental Results We divide this section into two subsections. In Section 2.7.1, we use pre-trained models to evaluate the effectiveness of our compression algorithms (cf. Section 2.5.3). In Section 2.7.2, we demonstrate that our tensorial neural networks can be trained from scratch (i.e., without reference models) on a wide range of datasets and backbone models. In both scenarios, we show that our TNNs maintain high accuracy, even when they utilize significantly fewer parameters than traditional neural networks. Considerations for TNN experiments. There are three items we consider when de- signing the experiments with TNNs that follow: (1) Kernel reshaping. We refer to an architecture whose kernels are reshaped into higher-order tensors (before performing a low-rank kernel factorization) as a TNN; we refer to an architecture containing factorized kernels but without reshaping as a NN. Although the latter is also a TNN, we still call it a NN, as the resulting architecture (after low-rank factorization) consists only of low-order 30 operations (i.e., matrix multiplications and multi-channel convolutions), as in traditional neural networks. In what follows, we will compare the performance of TNNs to that of NNs. (2) Types of tensor networks. Existing NN baselines are networks that do not involve any kernel reshaping and use classical kernel decompositions, e.g., SVD [30, 31], CP [20, 31], and TK [21]. Therefore, we refer to these architectures as NN-SVD, NN-CP, and NN-TK architectures, where the suffix denotes the type of kernel decomposition. As discussed in Section 2.4, we may use kernel reshaping and other types of decompositions to obtain TNNs, which achieve better expressive power than NNs (cf. Figure 2.7). Consequently, we refer to these architectures that involve reshaping kernels as TNN-mCPs, TNN-mTTs, TNN-mTRs, etc. (3) Training or compression strategy. We train the above models either via knowledge distillation or from scratch. To distinguish these two strategies, we use the term compression for knowledge distillation (i.e., there exists a pre-trained reference network to compress). We use the term TNN-based compression (TNN-C) to describe the process of training the TNN-mCPs, TNN-mTTs, TNN-mTRs, etc. via knowledge distil- lation, and the term low-rank compression (NN-C) to describe the analogous process for training the NN-SVDs, NN-CPs, NN-TKs, NN-TTs, etc. 2.7.1 Knowledge Distillation In this subsection, we evaluate different algorithms of knowledge distillation in Sec- tion 2.5.3, namely layer-wise decomposition (Layer-Decomp), end-to-end knowledge distil- lation (E2E-KD), and sequential knowledge distillation (Seq-KD). We conduct extensive experiments on compressing convolutional layers in ResNet-32 for CIFAR10, and we aim 31 S1 S S0 S2 H W H W K ? K T0 T2 T T1 (a) Original kernel. (b) Tensorized kernel. W T2 H S2 KC K2 R S0 T1 K0 K1 T0 S1 (c) Canonical polyadic (CP) (d) mCP (m = 3) S0 T2 P0 Q2 H Rs0 H Rt2 S Rs Rt T S1 R s 1 R t 1 T1 S C T P1 C Q1K K K s t W R R2 W 0 (e) Tucker (TK) P2 Q0 S2 T0 (f) mTK (m = 3) S0 S1 S2 H S H W T R0 R1 R2 K0 K1 K2 KC Rs R Rt KS KH KW KT T0 S1 T2 W (g) Tensor-train (TT) (h) mTT (m = 3) (i) Tensor-ring (TR) (j) mTR (m = 3) Figure 2.10: Diagrams of tensor decompositions. 32 to figure out the best strategy for combining these algorithms. Experimental Setup. We find that Layer-Decomp is merely better than random guesses in our experiments (see the test errors in Figure 2.11 at the beginning), Therefore, we can only use Layer-Decomp as initialization for E2E-KD and Seq-KD. With both algorithms, all layers are compressed uniformly at the same compression rate except for the first and last layers. Therefore, the compression rate is both layer-wise and (approximately) global. (We investigate the non-uniform allocation of all parameters across layers, but empirical results show that uniform assignment performs the best.) For all experiments, we use Adam optimizer with an initial learning rate of 10?3, which decays by 10 every 50 epochs. Our algorithm achieves 5% higher accuracies than the baselines on CIFAR-10 using ResNet-32. The results from Table 2.3 demonstrate that our TNNs maintain high accuracies even after the pre-trained networks are highly compressed. Given a pre- trained ResNet-32 and compression rate of 10%, the NN-CP with E2E-KD reduces the original accuracy from 93.2% to 86.93%; while the TNN-mCP with Seq-KD maintains the accuracy as 91.28% with the same compression rate ? a performance loss of 2% with only 10% of the number of parameters. Furthermore, TNN-C achieves further aggressive compression ? a performance loss of 6% with only 2% of the number of parameters. We observe similar trends (higher compression and accuracy) for TNN-mTT. The structure of the mTK decomposition makes TNN-mTK less effective with very high compression, since the decomposition poses a narrow bottleneck, which may lose necessary information. Increasing the network size to 20% of the original provides reasonable performance on 33 CIFAR-10 for TNN-mTK. Compression rate Compression rate Architect. Architect. 5% 10% 20% 40% 2% 5% 10% 20% NN-SVD [30, 31] 83.09 87.27 89.58 90.85 TNN-TR [53]? - 80.80? - 90.60 NN-CP [20, 31] 84.02 86.93 88.75 88.75 TNN-mCP 85.7 89.86 91.28 - NN-TK [21] 83.57 86.00 88.03 89.35 TNN-TK 61.06 71.34 81.59 87.11 NN-TT [34]? 77.44 82.92 84.13 86.64 TNN-mTT 78.95 84.26 87.89 - ?Cited from [53], the accuracy of 80.8% is achieved by 6.67% compression rate. ?The architecture is proposed as a baseline in [34]. Table 2.3: Test accuracy of ResNet-32 on CIFAR10. We compare end-to-end knowl- edge distillation (E2E-KD) using low-rank compression (NN-C) against sequen- tial knowledge distillation (Seq-KD) with TNN-based compression (TNN-C). The original ResNet-32 achieves 93.2% test accuracy with 0.46M parameters [5]. TNN-based compression, sequential knowledge distillation, or both? Table 2.3 shows that TNN-C with Seq-KD outperforms NN-C with traditional E2E-KD. Now we address the following question: is one factor (Seq-KD or TNN-C) primarily responsible for increased performance, or is the benefit due to synergy between the two? (1) We present the accuracies of different compression methods in Table 2.5. Other than at very high compression rate (5% column in Table 2.5), Seq-KD consistently outper- forms E2E-KD. In addition, Seq-KD converges faster than E2E-KD, and Figure 2.11 plots the test error over the number of gradient updates for various compression methods. (2) We present the effect of different architectures on accuracy in Table 2.4, Table 2.7 and Table 2.8. (2.1) First, we compare TNNs with NNs via Seq-KD. Interestingly, as demonstrated in Table 2.4, if TNN-based compression is used, the test accuracy is restored for even very low compression rates1. (2.2) Second, we compare TNNs with NNs via Learn- Scratch. As demonstrated in Table 2.7 and Table 2.8, TNNs outperform NNs trained using 1Note that TNN-mTK remains an exception for aggressive compression due to the extreme bottleneck structure that we previously discussed. 34 Learn-Scratch under the same number of parameters. These results confirm that TNNs are more flexible than traditional NNs as TNNs allow exploitation of invariant structures in the original parameters: such structures are exploited by our proposed TNN-based compression (our TNN-C), but not by low-rank compression (NN-C). Therefore, our results show that TNN-C and Seq-KD are symbiotic, and both are necessary to obtain high accuracy with significant compression. Compression rate Compression rate Architecture Architecture 5% 10% 5% 10% NN-CP [20, 31] 83.19 88.50 TNN-mCP 89.86 91.28 NN-TK [21] 80.11 86.73 TNN-mTK 71.34 81.59 NN-TT [34] 80.77 87.08 TNN-mTT 84.26 87.89 Table 2.4: Test accuracy for ResNet-32 on CIFAR-10. We compare sequential knowl- edge distillation (Seq-KD) against learning from scratch (Learn-Scratch) using our TNNs. The original ResNet-32 achieves 93.2% accuracy with 0.46M parameters [5]. Compression rate Architecture 5% 10% 20% 40% Seq E2E Seq E2E Seq E2E Seq E2E NN-SVD [30, 31] 74.04 83.09 85.28 87.27 89.74 89.58 91.83 90.85 NN-CP [20, 31] 83.19 84.02 88.50 86.93 90.72 88.75 89.75 88.75 NN-TK [21] 80.11 83.57 86.75 86.00 89.55 88.03 91.30 89.35 NN-TT [34] 80.77 77.44 87.08 82.92 89.14 84.13 91.21 86.64 Table 2.5: Test accuracy of ResNet-32 on CIFAR-10. We compare sequential knowledge distillation (Seq-KD) against end-to-end knowledge distillation (E2E-KD) using NN-C. The original ResNet-32 achieves 93.2% accuracy with 0.46M parameters [5]. Convergence rate. Compared to end-to-end knowledge distillation (E2E-KD), an an- cillary benefit of sequential knowledge distillation (Seq-KD) is that it is much faster and leads to more stable convergence. Figure 2.11 plots compression error over the number of gradient updates for various methods (This experiment is for NN-C with 10% compression rate). There are three salient points: first, Seq-KD has a very high error in the beginning 35 while the ?early? blocks are tuned (and the rest of the network is left unchanged to the values after tensor decomposition). However, as the final block is tuned (around 2 ? 1011 gradient updates) in the figure, the errors drop to nearly a minimum immediately. In com- parison, E2E-KD requires 50?100% more gradient updates to achieve stable performance. Finally, the result also shows that for each block, Seq-KD achieves convergence very quickly (and nearly monotonically), which results in the stair-step pattern since extra tuning of a block does not improve (or appreciably reduce) performance. 1 cp-seq 0.9 tt-seq tk-seq 0.8 cp-e2e 0.7 tk-e2e tt-e2e 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 3 3.5 x-axis Figure 2.11: Test error curves for sequential knowledge distillation (Seq-KD) v.s. end-to-end knowledge distillation (E2E-KD) on ResNet-32 for CIFAR-10. Both use layer-wise decomposition (Layer-Decomp) for initialization. 2.7.2 Learning from Scratch While it is beneficial to have a pre-trained model as reference (see Table 2.6 for a comparison), there are scenarios that knowledge distillation is not applicable: (1) The pre- trained model is simply not available; (2) The model is too deep that a sequential knowledge distillation is too expensive; (3) We aim to learn TNNs with even higher expressive power 36 y-axis than NNs. In this subsection, we verify the performance of our TNNs when trained from scratch for a wide range of backbone models and datasets. Seq-KD Learn-Scratch Architecture 2% 5% 10% 2% 5% 10% TNN-mCP 85.70 89.86 91.28 81.41 82.12 82.93 TNN-mTK 61.60 71.34 81.59 60.65 61.46 65.75 TNN-mTT 78.95 84.26 87.89 79.95 81.82 83.08 Table 2.6: Test accuracy for ResNet-32 on CIFAR-10. We compare sequential knowl- edge distillation (Seq-KD) against learning from scratch (Learn-Scratch) using our TNNs. The original ResNet-32 achieves 93.2% accuracy with 0.46M parameters [5]. Wide-ResNet for CIFAR-100. To demonstrate that TNNs are compatible with other backbones (in addition to ResNet), we evaluate our TNNs with Wide-ResNet backbone [14] on the CIFAR-100 dataset. As shown in Table 2.7, our TNNs (in particular TNN-mTT), when trained from scratch, already outperform other state-of-the-art low-rank factorization- based methods. Compression rate 0.5% 1% 2% 5% NN-TT [34] 37.02% 54.65% 52.69% 51.42% NN-CP [20, 31] 40.74% 58.04% 56.9% 64.83% Compression rate 0.33% 0.5% 0.66% 1% TNN-mTT 61.67% 65.36% 66.82% 68.83% Table 2.7: Test accuracy of Wide-ResNet-28-10 on CIFAR-100. We compare baseline NNs against our TNNs by training all models from from scratch (i.e., without reference models). The original model achieves 81.25% accuracy with 36.5M parameters [14]. ResNet for ImageNet-2012. To show that our TNNs scale to large datasets, we eval- uate their performance on the ImageNet-2012 dataset with a ResNet-50 backbone. The results in Table 2.8 show that our TNNs significantly outperform the low-rank factorization- based methods at each compression rate. Furthermore, our TNNs maintain very high 37 accuracies given less than 10% of the parameters of the original ResNet-50. Compression rate Architecture 1% 2% 5% 10% 20% 50% NN-CP 57.86% 64.17% 69.37% 71.52% 72.08% 72.44% NN-TT 56.82% 62.23% 65.54% 66.21% 66.90% 66.92% NN-TR 56.59% 62.97% 69.59% 71.61% 73.04% 73.21% Compression rate Architecture 0.5% 1% 5% 10% 20% 50% TNN-mCP 72.65% 73.76% 74.03% 75.00% 75.31% 77.31% TNN-mTT 69.27% 73.04% 73.51% 73.50% 73.87% 74.14% TNN-mTR 67.49% 73.23% 74.12% 75.01% 75.32% 75.16% Table 2.8: Top-1 test accuracy of ResNet-50 on ImageNet. We compare baseline NNs against our TNNs, where all models are trained from scratch. The original ResNet-50 model achieves 78.03% Top-1 test accuracy with 25.6M parameters [5]. VGG, ResNet and Wide-ResNet with Full Parameters. While we use TNNs mostly for model compression in this chapter, one remaining question is the performance of TNNs when they have the same number of parameters as the original model. To answer this question, we train TNN-mTTfrom scratch with architectures VGG-16 [2], ResNet-34 [15] and WRN-28-10 [14] on CIFAR-10. As shown in Table 2.9, TNNs (without hyper-parameter optimization) match/outperform their original model (where the hyper-parameters are highly optimized) when their numbers of parameters are the same. TNN-mTT NN TNN-mTT NN TNN-mTT NN Accuracy VGG VGG WRN WRN ResNet ResNet Training 100% 100% 100% 100% 100% 100% Test 93.68% 92.64% 95.09% 95.83% 91.79% 92.49% Table 2.9: Performance of TNNs v.s. NNs counterparts on CIFAR-10. NN stands for the uncompressed model proposed by the original paper. All models are trained from scratch (i.e., without reference models). 38 2.8 Conclusion In this chapter, we introduced a new suite of generalized tensor algebra, which pro- vides systematic notations for generalized tensor operations (a.k.a., tensor networks). Based on these generalized tensor operations, we developed a family of tensorial layers, extend- ing existing fully-connected/convolutional layers in traditional neural networks. We con- structed tensorial neural networks (TNNs) using tensorial layers as building blocks, and empirically showed that our TNNs maintain high predictive performance even when they contain significantly fewer parameters than traditional neural networks. Our experiments on LeNet-5, VGG, ResNet, and Wide-ResNet consistently verified that our TNNs outper- form the state-of-the-art low-rank architectures under the same compression rate. 39 Chapter 3: AutoTNN: A Framework for Representing and Learning Ten- sorial Neural Networks 3.1 Overview Modern neural networks are expressive over various learning problems but at the cost of increased width and depth. State-of-the-art convolutional models, for example, can contain up to several billion parameters [54, 55]. Unfortunately, the size and training costs of such networks are at odds with a rapidly emerging industrial and academic interest in performing learning tasks on low-fidelity hosts such as IoT and mobile devices [56, 57]. An increasingly popular approach for generating compact yet expressive models is to use tensorial neural network (TNNs) [20, 50, 58, 59], which factorize each layer?s weight tensor into several smaller factors. As a result, each TNN layer is a multilinear operation among its input and weight factors. One particular method by which a TNN can arise is reshaping a network weight into a higher-order tensor. The reshaped tensor takes a factorized form, such as canoni- cal polyadic (CP), Tucker (TK), and tensor-train (TT) decompositions. Factorization of a reshaped tensor is an efficient way of representing underlying properties such as peri- odicity and modularity invariances/similarities, which often exist in neural network mod- 40 els [20, 34, 50, 53], and preserved in the factorization. Factorization is especially useful for low-rank, higher-order reshaped weights since we can capture abundant structural in- formation without much representation redundancy. Figure 3.1 depicts this scenario using the example of a vector displaying periodicity. 1 1 1 1 1 X Figure 3.1: Tensor reshaping. The figure shows how reshaping can help reduce the number of parameters while preserving critical structural properties. A vector of length 15 displays periodicity in its entries (represented here by repeating colors) except for a few entries. Reshaping it into a 3?5 matrix and then taking a factorized tensor approximation (rank-1 SVD) a) reduces the number of parameters to store and b) culls out artifacts (non-periodic elements) while representing periodicity sans redundancy. In addition to the lighter weight, TNNs preserve the same predictive power level over various backbone networks and tasks. However, in contrast to the availability of high- performance libraries for convolutional neural networks (CNNs) (e.g., NVIDIA?s cuDNN for GPUs, Intel?s MKL for CPUs), efficient library-based solutions are not available for training TNNs, especially those based on CNNs. In this work, we introduce an open-source library and a suite of meta-algorithms, col- lectively termed as AutoTNN, to efficiently construct and train TNNs. Given a backbone network, AutoTNN will tensorize its layers, compress the tensorial weights, and conduct end-to-end training on the resulting TNN. In particular, the user does not need to inde- pendently develop any part of the TNN architecture. AutoTNN interprets the forward and backward pass of a TNN as a generalized 41 einsum graph/sequence. Here, ?generalized? means including convolutions, an operation not supported by any existing einsum implementation but handled by our meta-function conv einsum. A lack of support for convolution operations in einsum prevented the eval- uation of tensorized CNNs via well-known algorithms such as netcon. However, deriving tensor decompositions with convolutions is nontrivial as (a) the factorization involving a convolution is challenging to represent as an einsum string, and (b) backpropagation rules for such structure designs were not derived in any prior work. conv einsum evaluates these tensor operations with an optimal sequence, rather than naively performing all the tensor operations from left to right, resulting in a vastly reduced/improved number of floating point operations (FLOPs) during training and inference, a mechanism we refer to as the optimal sequencer. In totality, AutoTNN addresses computational and memory overhead challenges asso- ciated with automated TNN construction and training. In conjunction with conv einsum, the optimal sequencer, and checkpointing, this work provides a framework and library which significantly improves the efficiency of TNN design and deployment. Contributions. In summary, our contributions include: 1. We develop an open-source library AutoTNN and framework for designing and train- ing a TNN given a backbone network and specified tensor decomposition. 2. Our algorithm conv einsum optimizes the training process. In particular, conv einsum interprets forward and backward passes as sequences of multilinear operations, in- cluding (multi-way) convolutions, extending the classical einsum. 42 3. Furthermore, conv einsum utilizes the library opt-einsum [60], which generalizes the netcon algorithm [49], to efficiently determine an optimal order of these operations. 4. We exploit a mechanism referred to as checkpointing [61] which can balance the computational cost of a forward or backward pass of a TNN with the memory cost of large intermediate products during evaluation. 5. Our experiments demonstrate the expressiveness of TNNs with AutoTNN over various tasks, including video classification, speech recognition, and image classification. 3.2 Related Works Low-rank factorization. Various types of low-rank factorization have been proposed to reduce the number of parameters in linear layers. Pioneering works proposed to flatten/un- fold the parameters in convolutional layers into matrices (known as matricization), followed by (sparse) dictionary learning or matrix decomposition [30, 31, 32]. Subsequently, [20] and [21] showed that it is possible to compress the parameters directly by standard tensor de- compositions (in particular, CP decomposition or Tucker decomposition [33]). Further groundbreaking works [19, 34] demonstrated that the low-order weights could be efficiently compressed by the tensor-train decomposition [35] by first reshaping the parameters into higher-order tensors. This paradigm was later extended in two directions: (1) the tensor- train decomposition is used to compress LSTM/GRU layers in recurrent neural networks (RNNs) [36] and higher-order recurrent neural networks [37, 38]; and (2) other decompo- sitions are explored for better compression, such as the tensor-ring decomposition [40] and block-term decomposition [41]. 43 Other compression methods. TNNs belong to the large family of low-rank approx- imation methods, which complements other compression techniques such as quantization and pruning. Many papers have justified that combining these different lines of the works may render more competitive model compression. For instance, [44] verify that TNNs with quantization achieve SOTA results on 3D tasks, and [43] demonstrate that TNNs with pruning obtain SOTA image classification. This work, however, is chiefly concerned with the automated development and benchmarking of TNNs without any other compression methods. For future work, we will investigate incorporating quantization, pruning, and knowledge distillation techniques directly into AutoTNN. Existing libraries. Various libraries support tensor operations. (1) Pytorch [62] sup- ports specialized tensor operations commonly used in neural networks, including various convolutional layers and the einsum function. However, it is non-trivial to implement arbitrary tensor operations optimally. (2) TensorLy [63] supports common tensorial op- erations across various platforms, including Pytorch and TensorFlow. However, the library does not support the construction of arbitrary tensor networks. (3) NumPy [64] is a gen- eral computation library that has an optimal sequencer in its einsum function, but it does not support GPUs, nor does it support convolutions. (4) Einops [65] extends einsum to GPUs. To the best of our knowledge, no existing library supports general tensor networks on GPUs, and our work aims to close this gap. (5) Gnetcon [66] attempts to extend einsum to convolutions but does not support multi-way convolution between a collection of modes with more than two differing dimensions, nor was it fully integrated into an end- to-end training framework. (6) Tensor Comprehensions (TC) [67] optimizes tensorial 44 computations through translation of generalized ?Einstein? notations of tensorial sequences (including multi-way convolutions). However, TC cannot tensorize a network ? the user needs to manually define their network architecture, whereas AutoTNN can automatically tensorize and compress a backbone network. Furthermore, defining tensor operation with more than two tensors in the language of TC requires a hand-coded sequence of binary operations. In contrast, our conv einsum can handle arbitrary tensor networks by faithfully adapting the syntax of einsum. 3.3 Tensor Operations and einsum In this section, we outline multi-linear operations common to TNNs. Many of these operations are systematically expressible and computable via the popular notational frame- work and function einsum, first introduced by the Python library NumPy [64]. Therefore, we will first review the essences of tensor operations and their corresponding einsum repre- sentations. With these concepts in hand, we then formally introduce TNN. Notations. We use lower case letters (e.g., v) to denote vectors, upper case letters (e.g., M ) denote matrices, and curly letters (e.g., T ) denote tensors. For a tensor T ? RI1?I2?????IN , we refer to a specific entry by the subscript notation Ti1i2...i , where 1 ? iN n ? In for 1 ? n ? N . Furthermore, we refer to N as the order, a particular index n as a mode, and the magnitude of a mode In as dimension size. For example, a tensor T ? R3?4?5 is a 3rd order tensor that has dimension size 4 at its second mode. 45 3.3.1 Representations of Multilinear Operations The einsum function allows definitions of multilinear operations via string inputs. In this subsection, we highlight an example of multilinear operation involving three primi- tive operations (contraction, batch product, outer product). Consider two 3rd-order tensors T (1) ? RB?C?I , T (2) ? RA?C?J , and a multilinear operation between these two tensors: ?C T (1) (2)b,i,j = Tb,c,i Tb,c,j. (3.1) c=1 We can denote the operation above in einsum as: T = einsum (? bci , bcj?>b i j ? , T1 , T2) where the string in the quotation mark precisely specifies the operation, known as an einsum string. In this string, the letter ?c? indicates contraction since it appears in both inputs but not the output; the letter ?b? denotes batch product since it appears in both inputs and the output; lastly, the letters ?i? and ?j? represent outer product as they each appears in one of the two inputs and they both appear in the output. 3.3.2 From einsum to conv einsum While many popular libraries (e.g., NumPy, TensorFlow, PyTorch) implement einsum, none of them support convolutions, despite that convolution is (multi)linear and ubiquitous in modern neural networks. Therefore, we generalize einsum to a meta-function conv einsum, which handles convolution as a primitive operation. 46 Tensor convolution generalizes the convolution on vectors to higher-order tensors. For instance, given two tensors T (1) ? RX?B?C and T (2) ? RL?D?E, we can define a convolution between the modes with dimension sizes X and L. The operation returns a 5th order tensor T ? RX??B?C?D?E, with its entries calculated as: T (1) (2):,b,c,d,e = T:,b,c ? T:,d,e, (3.2) where ? denotes a convolution between two vectors. Note that the dimensions X and L can be different, and the output dimension X ? depends on boundary condition (e.g., a standard convolution yields X ? = X + L? 1). We write Equation (3.2) in conv einsum as T = conv einsum (? lbc , lde?>lbcde | l ? , T1 , T2) In this scheme, the same letter ?l? is used for different modes, even if their dimension sizes may differ. Furthermore, the placement of ?l? right to the pipe-delimiter indicates that conv einsum performs convolution on the corresponding modes. Notice that a letter for convolution appears in all inputs, the output, and after the delimiter. We can use conv einsum to represent multilinear operations on more than two in- puts. For instance, consider three tensors X ? RB?F?S?H?W , K(1) ? RF?G?K?K , K(2) ? RS?T?K?K , and an operation ?F ?S Y (1) (2)b,g,t,:,: = Xb,f,s,:,: ? Kf,g,:,: ? Ks,t,:,: (3.3) f=1 s=1 ? ? leads to an output tensor Y ? RB?G?T?H ?W . This is known as interleaved group convo- lution [17]. In conv einsum, it writes as 47 T = conv einsum (? bfshw , fghw , sthw?>bgthw |hw? ,X, K1, K2) We will discuss how to evaluate a conv einsum with more than two inputs in Section 3.4.2. 3.3.3 Compact Neural Networks via conv einsum In this subsection, we formulate various network layers in terms of conv einsum. Standard convolutional networks. We review the standard 2D-convolutional layer in neural networks. Such a layer is parameterized by a 4th order tensor W ? RT?S?H?W , ? ? ? ? which maps a 3rd order tensor X ? RS?H ?W to a 3rd order tensor Y ? RT?H ?W : Y = conv einsum (?bshw , tshw?>bthw |hw? , X, W) Note that a neural network typically computes its inputs in mini-batches, so the conv einsum string contains an additional letter "b" to index examples in a mini-batch. Since convolu- tional layers include fully-connected layers as a particular case when H = W = 1, we focus on designs of convolutional layers for the remainder of this subsection. Tensorial neural networks. Convolutional layers motivate the importance and usage of TNNs since their structures benefited from reshaping and tensorial decomposition. Nu- merous works propose to design tensorial layers where the (reshaped) convolution kernel W is factorized using tensor decompositions [20, 21, 34, 50, 68]. Our proposed conv einsum can handle these types of designs. Here, we present two representatives of efficient tensorial convolutional layer designs based on the CP decomposition [33]. (1) In a CP convolutional layer [20], the kernel W is factorized into 4 factors W (1) ? RR?T , W (2) ? RR?S, W (3) ? RR?H , W (4) ? RR?W such that 48 W = conv einsum (? rt , rs , rh , rw?>tshw ? , W1, W2, W3, W4) Plugging this decomposition into the 2D-convolutional layer, we obtain the following conv einsum string for this layer: Y = conv einsum (?bshw , rt , rs , rh , rw?>bthw |hw? , X, W1, W2, W3, W4) (2) In a reshaped CP convolutional layer [50], the convolution kernel W ? RT?S?H?W i?s first reshaped?into a higher order tensor W ? R T1????TM?S1????SM?H?W such that T = M M m=1 Tm, S = m=1 Sm, and then factorized into (m+ 1) tensors W (m) ? RR?Tm?Sm with W(0) ? RR?H?W . For example, when M = 3: W = conv einsum (? r ( t1 ) ( s1 ) , r ( t2 ) ( s2 ) , r ( t3 ) ( s3 ) , rhw? ?>(t3 ) ( t2 ) ( t1 ) ( s3 ) ( s2 ) ( s1 )? , W1, W2, W3, W0) We can write the layer?s conv einsum string as: Y = conv einsum (?b( s1 ) ( s2 ) ( s3 )hw, r ( t1 ) ( s1 ) , r ( t2 ) ( s2 ) , r ( t3 ) ( s3 ) , rhw ?>n( t1 ) ( t2 ) ( t3 )hw? , X, W1, W2, W3, W0) For both layers, R is the rank of the CP decomposition, which controls the number of parameters (i.e., compression rate) of the layer. 3.4 Algorithms The last section presents how conv einsum represents general multilinear operations in neural networks. In this section, we develop a suite of algorithms to efficiently implement conv einsum. We organize this section as follows: (1) In Section 3.4.1, we develop an algorithm to reduce a conv einsum function with two inputs to a collection of atomic PyTorch 49 operations, which allows us to reuse GPU-optimized utilities in PyTorch to complete the computation. (2) In Section 3.4.2, we derive an optimal sequencer which automatically decomposes a conv einsum function with an arbitrary number of inputs into a sequence of conv einsum operations with two inputs; and (3) In Section 3.4.3, we utilize a checkpointing technique to reduce the memory overhead of our implementation further. 3.4.1 Atomic Operations In the subsection, we show that any conv einsum function can be realized by GPU- optimized PyTorch utilities, einsum and convNd (e.g., conv1d, conv2d). In particular, any two-inputs conv einsum with convolution can be realized via a convNd. To understand why this is possible, we analyze the conv einsum string for the conv1d function: Y = conv einsum (? bsh , tsh?>bth | h? , X, W) where ?t? stands for the output channel, ?s? the input channel, ?h? the length of features/- filters, and ?b? the batch size. Now, we can categorize these letters in terms of primitive operations. (1) The letter ?h? is a convolution, appearing in both inputs and the output; (2) The letter ?s? is a contraction, appearing in both inputs but not the output; (3a) The letter ?t? is an outer product, appearing in the first input and the output; (3b) The letter ?b? is another outer product, appearing in the second input and the output. The conv1d function covers almost all mixtures of compound operations in which each operation type appears at most once, which we refer to as an atomic operation. However, we have two cases that are not covered: (4) A batch product that appears in both inputs and the output; and (5) A self-contraction that occurs in only one input. Fortunately, 50 we can readily address these two edge cases. For (4), the function conv1d supports a group-convolution option, which effectively extends to: Y = conv einsum (? gtsh , bgsh?>bgth | h? , X, W) where ?g? stands for the filter group. In terms of tensor operations, it is a batch product, which appears in both inputs and the output. For (5), such a letter can be eliminated by summing over the corresponding index in pre-processing. Multiple letters with the same operation type. Now we address the scenario where multiple letters have the same operation type. For example, if there are two different letters in a conv einsum string designated for convolution, we can use conv2d instead of conv1d. Notice that conv2d realizes a conv einsum such as: Y = conv einsum (? gtshw , bgshw?>bgthw | h ,w? , X, W) where ?g? stands for the filter group and ?h?/?w? represent the height/width respectively. In principle, we can use a convNd function to compute a conv einsum function with N letters for convolution (though convNd for N ? 4 requires custom implementation). For non-convolution letters, all letters with the same type can be merged into one letter by preprocessing (i.e., the corresponding modes reshape into one compound mode), and the letter is converted back to multiple letters by post-processing (i.e., the compound mode reshape back to its corresponding modes). 51 3.4.2 Optimal Sequencer Our conv einsum leverages APIs in the existing open-source opt-einsum library for NumPy [60]. The opt-einsum library can handle determining the FLOPs-optimal evaluation order of tensor networks involving contractions, outer products, and batch products. Our conv einsum introduces convolution handling to opt-einsum. However, due to the complexity of our convolutional functionality and the core APIs of opt-einsum, we only present a high- level overview of the basis of conv einsum. Complete sequence: ijk,jl,lmq,njpq->ijknp|j Naive FLOP count: 4.212e+05 A=np . random . rand (4 , 7 , 9) Optimized FLOP count: 2.056e+05 B=np . random . rand (10 , 5) Largest intermediate: 1.944e+05 elements C=np . random . rand (5 , 4 , 2) -------------------------------------------------------------------------------- D=np . random . rand (6 , 8 , 9 , 2) current remaining pr in t ( conv einsum . cont rac t path -------------------------------------------------------------------------------- (? i j k , j l , lmq , njpq?>i j knp | j ? , A, B, C, D) ) lmq,jl->qj ijk,njpq,qj->ijknp|j qj,njpq->jnp|j ijk,jnp->ijknp|j jnp,ijk->ijknp|j ijknp->ijknp|j (a) Tensor sequence generation. A tensor sequence over a collection (b) An optimal sequence of paths. The code, lever- of tensors A,B, C, and D, involv- aging opt-einsum with our added support for convolutions, ing contractions, convolutions, and displays the optimal sequence of paths for the conv einsum batch products is analyzed. We store string submitted in Figure 3.2a. We are also presented and print the optimal sequence of with information about the naive left-to-right cost vs the paths, which is stored in a string ar- cost of taking the suggested path, along with the size/cost ray path info. of the largest intermediate. Figure 3.2: conv einsum sample code. The figure depicts the generation via NumPy of a set of tensors coalesced into one tensor sequence. The sequence is represented as a string and submitted to optimal sequencer of conv einsum for path analysis. The output of the analysis is presented in Figure 3.2b. The opt-einsum library relies on the well-known netcon [49] algorithm for determining an optimal order of evaluations. netcon was designed to handle operation sequences in tensor networks. For example, for A ? RI?J?K ,B ? RJ?L, C ? RL?M , one might be 52 Tensor sequence: ? ?! ? ?" ? ?# ? ?$ ? tnn-cost[? ?! ?] tnn-cost ? ?! ? ?" ? < ? tnn-cost ? ?! ? ?" ? ?# ? tnn-cost[(? ?# ?) ?$ ?] > ? tnn-cost ? ?! ? ?" ? ?# ? tnn-cost ? ?! ? ?" ? ?# ? ?$ ? tnn-cost ? ?! ? ?" ? ?# ? ?$ ? tnn-cost ? ?! ? ?" ? ?# ? ?$ ? tnn-cost ? ?! ? ?" ? ?# ? ?$ ? ?%: any multi-linear operation : optimal path : non-optimal path : cost-capped path Figure 3.3: Optimal sequencer example. In this figure, conv einsum deploys the optimal sequencer to analyze the path tree of an abstract, well-defined tensor sequence A ?1 B ?2 C ?3 D ?4 E , where ?i for 1 ? i ? 4 is any collection of multi-linear operations, including convolutions, batch products, contractions, and outer products. The tree traversal strategy is a fusion of netcon and our tnn-cost API. The green path indicates a potentially viable optimal path, whereas the red path indicates a path which satisfies the cost cap c at each node. While the red path satisfies the cost cap, it may result in more FLOPs overall compared to the complete green path. interested in the optimal order of evaluation of the tensor ?L ?J T = A:,j,: ? Bj,l ? Cl,:. (3.4) l=1 j=1 Let us mo?mentarily suppress the i?ndex and summation notation of Equation (3.4), i.e., let (AB) , J?1A B , (BC) , L?1j=0 :,j,: j,l l=0 Bj,lCl,:. The possible paths we may take to arrive at T include (AB) ? (AB)C, or (BC) ? A(BC). Each of these paths has a predictable contraction cost, or the number of multiplications/additions (FLOPs), dependent on the dimensions of the tensor modes involved in each intermediate product. We can organize all the possible paths for this example and any contraction sequence in general into a tree. Each node is associated with the contraction cost of forming the product represented by that node. netcon efficiently traverses such path trees and determine the FLOPs-optimal path in a fast manner, even though this is an NP-hard problem in general [48]. Furthermore, netcon supports cost-capping, avoiding traversal down a branch if the resulting intermediate 53 product exceeds some FLOPs cap c. In particular, netcon is capable of handling all types of multilinear seq?uences ex(cept for tho)se involving convolutions. For example, consider the tensor T N p,:,q,r,t = n=1 Bn,p An,:,r ? C:,q Dr,t with a convolution. We may equivalently compute T as (? )N ?N Tp,:,q,r,t = (Bn,pAn,:,r) ? C:,q Dr,t = Bn,p ((An,:,rDr,t) ? C:,q) . (3.5) n=1 n=1 Our conv einsum extends the netcon paradigm to handle convolutions simply by replacing the contraction cost function with a more general TNN cost, which adds the cost (FLOPs- wise) of the convolutions (if present) within an intermediate product at each node. We refer to this generalized scheme as the optimal sequencer. Figure 3.3 depicts the optimal sequencer analyzing the path tree an abstract tensor sequence. The action of the optimal sequencer is crucial for trimming the training time of our TNNs. Modification of the cost model for training. The (netcon-based) opt-einsum se- quencer only considers forward computation in tensor networks. However, in a neural network setting, we also need to consider the backpropagation computation. Specifically, given two inputs T (1), T (2), which interact through an atomic operation f resulting in an output tensor T = f(T (1), T (2)), the opt-einsum sequencer will calculate the cost of computing T without any concern for associated backpropagation calculations. How- ever, the backpropagation algorithm needs to compute ?L/?T (1) = g1(?L/?T , T (2)) and ?L/?T (2) = g (T (1)2 , ?L/?T ), where g1 and g2 are gradient calculations dependent on f . Therefore, we modify the cost from cost(f) to cost(f) + cost(g1) + cost(g2). For instance, 54 consider f as a standard 2D-convolution, where the operation between the input T (1) ? RB?S?X?Y ? ?and the weight T (2) ? RT?S?H?W leads to the output T ? RB?T?X ?Y . We have cost(f) = O(BHWXY TS) for the forward pass, and cost(g1) = O(BHWX ?Y ?TS), cost(g2) = O(BXYX ?Y ?TS) for the backward pass. In order to achieve optimal scheduling, we modify the cost function of opt-einsum to consider all three costs, which is inherited by the optimal sequencer of conv einsum. 3.4.3 Checkpointing TNNs use composite tensor operations (a.k.a. tensor networks) to design compact network layers. However, since we pairwisely evaluate a tensor network, computing a tensor network with N inputs leads to (N ? 1) intermediate results. Therefore, if we use an automatic gradient function, we will need to save these (N ? 1) intermediates in memory, causing high memory overhead. To avoid storing the intermediate products, we rely on gradient checkpointing [61] which recomputes the gradient during the backward pass rather than saving all intermediate results in memory. We can think of this mechanism as a trade-off between memory and computation. The total memory used by a neural network consists of static and dynamic memory. Static memory depends on the size of the model and some fixed costs built-in by PyTorch, while dynamic memory depends on the computational graph saved in the memory. Usually, when training a neural network, in the forward pass, the model caches all values of the activation neurons and reuses them in the backward pass calculation. Gradient checkpointing avoids any activation caching in the forward pass and thus can effectively relieve any potential memory overflow in that phase. 55 3.5 Experimental Results In this section, we compare training and inference run-times of AutoTNN against the prevalent PyTorch. We demonstrate that AutoTNN provides an efficient solution for TNN deployment in different tasks across various domains. We use an NVIDIA GeForce RTX 2080Ti for all tasks. Tasks. We test AutoTNN on a range of tasks under different network compression rates: (1) A classic two-stream convolutional neural network [69] is used for a video classification task, trained on the UCF-101 data set [70]. ResNet-101 [5] was chosen as the ConvNet for both the spatial and temporal streams, pre-trained on ImageNet [71]. The two-stream network is adapted from [69]. (2) An Automatic Speech Recognition task using the Con- former architecture [72], which incorporates convolution modules between the attention modules and the feed-forward neural network modules of a Transformer model [4]. We train the model on the LibriSpeech dataset [73]. (3) An image classification task trained on the CIFAR10 [74] data set using the classic ResNet-34 [5] architecture. State-of-the-art accuracy requires the usage of knowledge distillation (c.f. [50]). Baselines. We compare AutoTNN against two baselines across all tasks: PyTorch im- plementation with and without checkpointing. In particular, we compare the usage of conv einsum to optimally evaluate forward/backward tensor sequences against PyTorch with and without checkpointing to demonstrate the benefits of conv einsum. 56 3.5.1 Accuracy and Memory Results (1) TNNs demonstrate competitive accuracy under aggressive compression. Table 3.1 shows the test/training accuracy of TNNs using a reshaped CP decomposition of their tensor weights under different compression rates in three machine learning tasks, namely Video Classification (VC), Automatic Speech Recognition (ASR), and Image Classification (IC). A compression rate (CR) of x% indicates that the size of the TNN model is x% of the original/baseline model size. As shown in Table 3.1, for instance, a TNN using only 10% size of the original backbone model (i.e., a ResNet-34 with 21M parameters) maintains 98% of the baseline performance in an IC task on the CIFAR10 benchmark dataset. Table 3.1: TNN performance under various model scales for diverse machine learning tasks. Automatic Speech Recognition (ASR) on LibriSpeech is measured by Word Error Rate (WER) (the lower, the better). Image classification (IC) on CIFAR10 (results from [50]) is measured by top-1 precision. Video classification (VC) on UCF-101 is measured by top-1 accuracy. Compression Rate (CR) IC ASR VC Original 93.2 2.1 88.98 100% - 2.08 89.00 50% - 2.29 88.61 20% - 2.36 88.10 10% 91.28 2.43 87.63 5% 89.86 3.01 86.62 2% 85.70 3.76 86.41 (2) Naive PyTorch implementation of TNNs suffers from high intermediary mem- ory costs during training. AutoTNN significantly reduces these costs. In TNNs, some intermediate data objects can be prohibitively large during training computation, thus challenging to fit into memory. In Table 3.2, we present the maximal batch size allowed 57 for two large-scale tasks under different compression rates: video classification (VC) and automatic speech recognition (ASR). We observe that if the size of a TNN matches the original backbone model size (i.e., CR=100%), the maximal allowed batch size in a Py- Torch implementation is 0 without checkpointing. Even if we compress the model to 1% of the original # of parameters, the maximal allowed batch size is still limited, making the computation infeasible or too slow. The non-optimal evaluation order by PyTorch results in large intermediate objects that do not fit into the memory. As indicated in Table 3.2, even incorporation of checkpointing into PyTorch implementation does not help much with the problem. Although checkpointing sidesteps save any intermediate results in memory arising during forward passes and instead recomputes them in the backward passes, the order of evaluation in the forward pass remains unchanged. If an intermediary computation causes overflow, this issue is likely to persist even in a checkpointing implementation ? only the optimal sequencer can help with such a dilemma. On average, the most cost-parsimonious path found by the optimal sequencer will contain smaller intermediate products than a naive left-to-right PyTorch evaluation. 3.5.2 Runtime results Our conv einsum implements an optimal sequencer that evaluates a tensorial forward or backward pass in an order which incurs the minimum number of FLOPS. Additionally, conv einsum uses checkpointing to avoid memory overflow. (1) The optimal sequencer used in conv einsum significantly improves the runtime ef- ficiency of training and test in TNNs. In Figures 3.4 to 3.6, we compare the training and 58 Table 3.2: Maximum batch size for a speech and video task. The maximal batch size allowed for data under varying compression rates and using different libraries on (1) an automatic speech recognition task on LibriSpeech and (2) a video classification task for spatial (S) and temporal (T) streams of a two-stream network on UCF-101. conv einsum allows for larger batch sizes by efficiently evaluating tensorial forwards and backwards passes. ?ckp? means checkpointing. Automatic speech recognition task CR conv einsum PyTorch w/ ckpt PyTorch w/o ckpt 1% 14 8 6 2% 14 8 6 5% 12 6 4 10% 10 4 2 20% 8 2 0 50% 6 2 0 100% 4 1 0 Video classification task conv einsum PyTorch w/ ckpt PyTorch w/o ckpt CR S T S T S T 1% 20 30 2 4 1 2 2% 20 30 2 4 0 0 5% 20 30 2 4 0 0 10% 18 30 1 2 0 0 20% 16 27 0 0 0 0 50% 12 22 0 0 0 0 100% 4 14 0 0 0 0 59 22 2.6 conv einsum conv einsum 20 PyTorch w/o ckpt 2.4 PyTorch w/o ckpt PyTorch w/ ckpt PyTorch w/ ckpt 18 2.2 16 2.0 14 1.8 1.6 12 1.4 10 1% 2% 5% 10% 20% 50% 100% 1% 2% 5% 10% 20% 50% 100% compression rate compression rate (a) RCP Train (b) RCP Test Figure 3.4: Runtime comparison between conv einsum and PyTorch for an image classification task. An RCP-TNN (R = 3) is trained on the CIFAR-10 dataset. Runtimes are averaged over 3 random runs, and error bars are denoted by the shaded areas. conv einsum conv einsum 100 PyTorch w/o ckpt PyTorch w/o ckpt PyTorch w/ ckpt 15 PyTorch w/ ckpt 80 60 10 40 5 20 1% 2% 5% 10% 20% 50% 100% 1% 2% 5% 10% 20% 50% 100% compression rate compression rate (a) CP Train (b) CP Test Figure 3.5: Run-time comparison between conv einsum and PyTorch for a speech recognition task. A CP-TNN is trained on the LibriSpeech dataset. Runtimes are averaged over 3 random runs, and error bars are denoted by the shaded areas. 60 seconds per epoch seconds per epoch seconds per epoch seconds per epoch conv einsum 4,000 conv einsum 4,000 PyTorch w/ ckpt PyTorch w/ ckpt PyTorch w/o ckpt PyTorch w/o ckpt 3,000 3,000 2,000 2,000 1,000 1% 2% 5% 10% 20% 50% 100% 1% 2% 5% 10% 20% 50% 100% compression rate compression rate (a) RCP Train (spatial) (b) RCP Test (spatial) 1,400 conv einsum 3,000 conv einsum PyTorch w/ ckpt PyTorch w/ ckpt 1,200 PyTorch w/o ckpt 2,500 PyTorch w/o ckpt 1,000 2,000 800 1,500 600 1,000 400 1% 2% 5% 10% 20% 50% 100% 1% 2% 5% 10% 20% 50% 100% compression rate compression rate (c) RCP Train (temporal) (d) RCP Test (temporal) Figure 3.6: Run-time comparison between conv einsum and PyTorch for a video classification machine learning task An RCP-TNN (R = 3) is trained on the UCF- 101 dataset. All tests were run using the maximum allowable batch size. PyTorch w/ checkpointing was only able to run without memory overflow for compression rates 1% - 10% and PyTorch w/o checkpointing was only able to run without memory overflow for compression rate 1%. Runtimes are averaged over 3 random runs, and error bars are denoted by the shaded areas. 61 seconds per epoch seconds per epoch seconds per epoch seconds per epoch test times between conv einsum and PyTorch (with and w/o checkpointing) implementa- tions over a wide range of model scales and using different forms of tensor decomposition. The IC and VC tasks use RCP decompositions, while the ASR task uses a standard CP decomposition. We observe that conv einsum universally outperforms the baselines. In the VC task, we use the maximal allowable batch size for each model size, while in the ASR task, we compare implementations using the same batch size. Furthermore, in Table 3.3 we show that conv einsum outperforms PyTorch in the IC task under different tensor decompo- sitions. When memory requirement is the bottleneck of a task (such as VC), checkpointing helps accelerate the runtime by alleviating potential memory overflows and allowing more batches. On the other hand, when the batch sizes are the same (as in ASR), checkpointing itself trades computational complexity for space, thus increasing the overall runtime. In either scenario, conv einsum achieves the fastest runtimes in all tasks compared to PyTorch implementations with and without checkpointing. (2) AutoTNN works for weight tensors with different sizes. AutoTNN serves as a general efficient library solution and is tensor-structure-agnostic. The networks we have experimented with contain weights of vastly differing sizes. As shown in Figures 3.4 to 3.6, AutoTNN exhibits competitive results against existing PyTorch solutions (with or without) checkpointing over various tensor sizes. (3) AutoTNN works for a variety of tensor forms. conv einsum is both data-agnostic and structure-agnostic. Given any sequence of multilinear tensor operations, conv einsum computes the optimal sequence with the least number of FLOPs. As a result, conv einsum is a universal solution to training any TNN. Table 3.3 shows the results of the image classification task on CIFAR10 using different forms of tensor decomposition. We could 62 observe that conv einsum outperforms PyTorch with/without checkpointing in all cases. Table 3.3: Run-time (seconds per epoch) comparison between conv einsum and PyTorch im- plementation of TNNs using different tensor decomposition forms in the image classification task on the CIFAR-10 dataset. PyTorch PyTorch conv-einsum w/o ckpt w/ ckpt Tensor Form Train Test Train Test Train Test RCP 14 2.14 22 2.57 29 2.61 RTR 8 1.41 16 1.86 16 1.86 RTT 8 1.37 16 1.61 16 1.68 RTK 6 1.34 17 1.47 17 1.54 3.6 Conclusion In this chapter, we introduce an open-source library, AutoTNN, which can build and efficiently train TNNs. Our AutoTNN is competitive against and, in many cases, superior to the standard PyTorch. For future work, we plan to further accelerate training and test times by incorporating parallel computation paradigms into the intra-layer tensor computations. Since our AutoTNN relies on the PyTorch backend, we can also use tensorRT to accelerate AutoTNN further. Additionally, we will investigate incorporating quantization, pruning, and knowledge distillation techniques directly into AutoTNN. 63 Chapter 4: Higher-order RNN for Spatio-temporal Learning 4.1 Overview While computer vision has achieved remarkable successes in many realms, e.g., image classification, many real-life tasks remain out-of-reach for current deep learning systems, such as prediction from complex spatio-temporal data. Such data naturally arises in a wide range of applications such as autonomous driving, robot control [75], visual perception tasks such as action recognition [76] or object tracking [77], and weather prediction [78]. This kind of video understanding problems is challenging, since they require learning spatial- temporal representations that capture both content and dynamics simultaneously. Learning from (video) sequences. Most state-of-the-art video models are based on recurrent neural networks (RNNs), typically some variations of Convolutional LSTM (Con- vLSTM) where spatio-temporal information is encoded explicitly in each cell [78, 79, 80, 81]. These RNNs are first-order Markovian models in nature, meaning that the hidden states are updated using information from the previous time step only, resulting in an intrinsic difficulty in capturing long-range temporal correlations. 64 Incorporating higher-order correlations. For 1D sequence modeling, [82] and [37] proposed higher-order generalizations of RNNs for long-term forecasting problems. Higher- order RNNs explicitly incorporate an extended history of previous states in each update, which requires higher-order tensors to characterize the transition function (instead of a tran- sition matrix as in the first-order RNNs). However, this typically leads to an exponential blow-up in the function?s complexity. This problem is more pronounced when generalizing ConvLSTM to higher-order, and no prior work explores these generalizations. Scaling up with tensor methods. To avoid the exponential blow-up in the complexity of transition function, tensor decompositions [83] have been investigated in higher-order RNNs [37]. Tensor decomposition avoids the exponential growth of model complexity and introduces an information bottleneck that facilitates effective representation learning. This bottleneck restricts the information that is passed on from one sub-system to another in a learning system [84, 85]. Previously, low-rank tensor factorization has been used to improve various deep network architectures [20, 21, 86]. However, its application has not been explored in the context of spatio-temporal LSTMs. The only approach that leveraged tensor factorization for compact higher-order LSTMs [37] considers exclusively sequence forecasting, which does not naturally extend to general spatio-temporal data. Generalizing ConvLSTM to higher-orders. When extending to higher-order, we aim to design a transition function that can leverage all previous hidden states and satisfies three properties: (1) The operations preserve the spatial structure of the hidden states; (2) The receptive field increases with time. In other words, the longer the temporal correlation is 65 captured, the larger the spatial context should be; (3) Finally, space and time complexities grow at most linearly with the number of times steps. Because previous transition functions in higher-order RNNs were for one-dimensional sequence, when directly extended to spatio- temporal data, they do not satisfy all three properties. A direct extension fails to preserve the spatial stricture or increases the complexity exponentially. Contributions. In this chapter, we propose a higher-order convolutional LSTM model for complex spatio-temporal data satisfying all three properties. Our model incorporates a long history of states in each update while preserving their spatial structure using convo- lutional operations. Directly constructing such a model leads to an exponential growth of parameters in both spatial and temporal dimensions. Instead, our model is made compu- tationally tractable via a novel convolutional tensor-train decomposition, which recursively performs a convolutional factorization of the kernels across time. Besides parameter re- duction, this factorization introduces an information bottleneck enabling the learning of better representations. As a result, it achieves better results than previous works with only a fraction of the parameters. We empirically demonstrate our model?s performance on several challenging tasks, including early activity recognition and video prediction. We report an absolute increase of 8% accuracy over the state-of-the-art [81] for early activity recognition on the Something- Something v2 dataset. Our model outperforms both 3D-CNN and ConvLSTM by a large margin. We also report a new state-of-the-art multi-step video prediction on both Moving- MNIST-2 and KTH datasets. 66 4.2 Related Works Tensor decompositions. Tensor decompositions such as CP, Tucker or Tensor-Train [33, 35], are widely used for dimensionality reduction [25] and learning probabilistic models [83]. These tensor factorization techniques have also been widely used in deep learning to improve performance, speed-up computation, and compress the deep neural networks [19, 20, 21, 50, 58, 87], recurrent networks [36, 88] and Transformers [89]. [36] has proposed tensor-train RNNs to compress both inputs-states and states-states matrices within each cell with TTD by reshaping the matrices into tensors, and showed improvement for video classification. Departing from prior works that rely on existing tensor decompositions, we propose a novel convolutional tensor-train decomposition (CTTD) designed for efficient and compact higher-order convolutional recurrent networks. Unlike [36], we aim to compress higher-order ConvLSTM, rather than first-order fully-connected LSTM. We further propose Convolu- tional Tensor-Train decomposition to preserve spatial structure after compression. Spatio-temporal prediction models. Prior prediction models have focused on pre- dicting short-term video [90, 91] or decomposing motion and contents [92, 93, 94, 95]. Many of these works use ConvLSTM as a base module, which deploys 2D convolutions in LSTM to efficiently exploit spatio-temporal information. Some works modified the stan- dard ConvLSTM to better capture spatio-temporal correlations [79, 80]. [91] demonstrated strong performance using a deep ConvLSTM network as a baseline, and we adopt this base architecture in this chapter. 67 4.3 Background: Convolutional LSTM and Higher-order LSTM In this section, we briefly review Long Short-Term Memory (LSTM), and its general- izations Convolutional LSTM (ConvLSTM) for spatio-temporal modeling, and higher-order LSTM (d) for learning long-term dynamics. Long Short-Term Memory [96]. LSTM is a first-order Markovian model widely used in 1D sequence learning. At each step, an LSTM cell updates its hidden state h(t) and cell state c(t) using the immediate previous states {h(t?1), c(t?1)} and the current input x(t): [i(t);f (t); c?(t);o(t)] = ?(Wx(t) +Kh(t?1)), (4.1a) c(t) = c(t?1) ? f (t) + c?(t) ? i(t), h(t) = o(t) ? ?(c(t)), (4.1b) where ?(?) denotes a sigmoid(?) applied to the input gate i(t), forget gate f (t) and output gate o(t), and a tanh(?) applied to the memory cell c?(t) and cell state c(t). ? denotes element-wise product. LSTMs have two major restrictions: (a) only 1D-sequences can be modeled, not spatio-temporal data such as videos; (b) they are difficult to capture long-term dynamics as first-order models. Convolutional LSTM (ConvLSTM) [78, 97]. ConvLSTM addresses the limitation (a) by extending LSTM to model spatio-temporal structures within each cell, i.e., the states, cell memory, gates and parameters are all encoded as high-dimensional tensors: [I(t);F (t); C?(t);O(t)] = ?(W ? X (t) +K ? H(t?1)), (4.2) 68 where ? defines convolution between states and parameters as in convolutional layers. Higher-order LSTM [37, 82]. HO-LSTM is a higher-order Markovian generalization of the basic LSTM, which partially addresses the limitation (b) in modeling long-term dynamics. Specifically, HO-LSTM explicitly incorporates more previous states in each update, replacing the first step in LSTM by [ ] ( ( )) i(t);f (t); c?(t);o(t) = ? Wx(t) + ? h(t?1), ? ? ? ,h(t?N) , (4.3) where ? combines N previous states {h(t?1), ? ? ? ,h(t?N)} and N is the order of the HO- LSTM. For example, ? can be a linear function [82] or a polynomial function [37]: ( ? Linear: ? h(t? ) N 1) ( , ? ? ? ,h (t?N); T (1), ? (N) (i) (t?i)) ?? ? ,T = T h ? . (4.4)i=1 Polynomial: ? h(t?1), ? ? ? ,h(t?N); T = T , h(t?1) ? ? ? ? ? h(t?N) . (4.5) While a linear function requires the numbers of parameters and operations growing lin- early in N , a polynomial function has space/computational complexity exponential in N if implemented naively. Notations. To facilitate the reading, we provide a table of notations in Table 4.1. 4.4 Methodology: Convolutional Tensor-Train LSTM In this section, we detail the challenges and requirements for higher-order ConvLSTM. We then introduce our model and motivate the design of each module by these requirements. 69 Symbol Meaning Value or Size H Height of feature map W Width of feature map - Cin # of input channels Cout # of output channels t Current time step - W Weights for X (t) [K ?K ? 4Cout ? Cin] X (t) Input features [H ?W ? Cin] H(t) Hidden state C(t) Cell state I(t) Input gate F (t) [H ?W ? Cout]Forget gate C?(t) Cell memory O(t) Output gate ? Mapping function for higher-order RNN - M order of higher-order RNN M ? N N Order of CTTD K Initial filter size K(0) = K K(i) Filter size in K?(i) C(i) # channels in H?(i) C(0) = 4Cout G(i) Factors in the CTTD [K(0) ?K(0) ? C(i) ? C(i?1)] D Size of sliding window D = M ?N + 1 P(i) Preprocessing kernel [D ?K ?K ? C ? C(i)out ] H?(i) Pre-processed hidden state [H ?W ? C(i)] K(i) Weights for H?(i) [K(i) ?K(i) ? C(i) ? C(0)] Table 4.1: Table of notations. 70 4.4.1 Extending ConvLSTM to Higher-orders We can express a general higher-order ConvLSTM by combining several previous states when computing the gates for each step: [ ] ( ( )) I(t);F (t); C?(t);O(t) = ? W ? X (t) + ? H(t?1), ? ? ? ,H(t?N) . (4.6) 1. The operations in ? preserve the spatial structures in the hidden states H(t)?s, e.g., satisfy translation-equivariant property. 2. The size of the receptive field forH(t?i) increases with i, the time gap from the current step (i = 1, 2, ? ? ? , N). In other words, the longer temporal correlation captured, the larger the considered spatial context should be. 3. Both space and time complexities grow at most linearly with the number of times steps N , i.e., O(N). Limitations of previous approaches. While it is possible to construct a function ? by extending the linear function in Equation (4.4) or the polynomial function in Equa- tion (4.5) to the tensor case, none of these extensions satisfy all three properties. While the polynomial function with tensor-train decomposition [37] meets requirement (3), the operations do not preserve the spatial structures in the hidden states. On the other hand, augmenting the linear function with convolutions leads to a function: ( ) ?N ? H(t?1), ? ? ? ,H(t?N); K(1), ? ? ? ,K(N) = K(i) ? H(t?i) (4.7) i=1 71 which does not satisfy requirement (2) if all K(i) contain filters of the same size K. An immediate remedy is to expand K(i) such that its filter size K(i) grows linearly in i. How- ever, the function would require O(N3) space and computational complexity, violating the requirement (3). 4.4.2 Designing an Effective and Efficient Higher-order ConvLSTM In order to satisfy all three requirements (1)-(3) introduced above, and enable efficient learning/inference, we propose a novel convolutional tensor-train decomposition (CTTD) that leverages a tensor-train structure [35] to jointly express the convolutional kernels {K(1), ? ? ? ,K(N)} in Equation (4.7) as a series of smaller factors {G(1), ? ? ? ,G(N)} while maintaining their spatial structures. Convolutional Tensor-Train module. Concretely, let K(i) be the i-th kernel in Equa- tion (4.7), of size [K(i) ?K(i) ? C(i) ? C(0)], where K(i) = i[K(1) ? 1] + 1 is the filter size that increases linearly with i; K(1) is the initial filter size; C(i) is the number of chan- nels in H(t?i); and C(0) is the number of channels for the output of the function ? (thus C(0) = 4?Cout, where Cout is the number of channels of the higher-order ConvLSTM). The CTTD factorizes K(i) using a subset of factors {G(1), ? ? ? ,G(i)} up to index i such that ( ) C?(i?1) ?C(1) K(:,:,ci,c0) = CTTD {G(j)}i , ? ? ? G(i) ? ? ? ? ? ?G(1)i j=1 :,:,ci,ci?1 :,:,c ,c , (4.8)1 0 ci?1=1 c1=1 where G(i) has size [K(1) ?K(1) ?C(i) ?C(i?1)]. The number of factors N is known as the order of the decomposition, and the ranks of the decomposition {C(1), ? ? ? , C(N?1)} are the 72 number of channels of the convolutional kernels. Notice that the same factors {G(1), ? ? ? ,G(N)} is used to construct all convolutional kernels {K(1), ? ? ? ,K(N)}, such that the total number of parameters grows linearly in N . In fact, the convolutional kernel K(i+1)?can be recursively constructed as K (i) = G(i) ? K(i?1) K(1) G(1) K(:,:,ci,c0) G(:,:,c ,cwith = and = i i?1) ? K(:,:,ci?1,c0)i c i i?1 ,?i ? 2. This results in ai?1 convolutional tensor-train module that we use for function ? in Equation (4.7): ( ? ? )? = CTT H(t 1), ? ? ? ,H(t N); G(1), ? ? ? ,G(N) ?N ( ) (4.9) , CTTD {G(j)}i (t?i)j=1 ? H . i=1 In [38], we show that the computation of Equation (4.9) can be done in linear time O(N), thus the construction of CTT satisfies all requirements (1)-(3). Preprocessing module. In Equation (4.9), we use the raw hidden states H(t) as inputs to CTT. This design has two limitations: (a) The number of past steps in CTT (i.e., the order of the higher-order ConvLSTM) is equal to the number of factors in CTTD (i.e., the order of the tensor decomposition), which both equal to N . It is prohibitive to use a long history, as a large tensor order leads to gradient vanishing/exploding problem in computing Equation (4.9); (b) All the ranks C(i) are equal to the number of channels in H(t), which prevents the use of lower-ranks to further reduce the model complexity. To address both issues, we develop a preprocessing module to reduce the number of steps and channels in previous hidden states before CTT. Suppose the number of steps M is no less than the tensor order N (i.e., M ? N), the preprocessing collects the neighboring 73 steps with a sliding window and reduce it into an intermediate result with C(i) channels: [ ? ]H?(i) = P(i) ? H(t i); ? ? ? ;H(t?i+N?M) (4.10) where P(i) represents a convolutional layer that maps the concatenation [?] into H?(i). C(t?1) + C(t ) Preprocessing Module Conv. Tensor-Train Module tanh P(1) + H(t?1:t?3) * H~ (1) * 3x3 G(1)=K (1) + H(t ) P(2) LSTM H(t?2:t?4) * ~ K (2)H (2) * 5x5 * 3x3 G(2) P(3) K (3) + Addition H(t?3:t?5) * H~ (3) 7x7 * Convolution* * 3x3 G(3) Element-wise product Sigmoid W tanh Hyperbolic tangent X(t ) * Figure 4.1: Convolutional Tensor-Train LSTM. The preprocessing module first groups the previous hidden states into overlapping sets with a sliding window and subsequently reduces the number of channels in each group using a convolutional layer. Next, the convo- lutional tensor-train module takes the results, aggregates their spatio-temporal information, and computes the gates for the LSTM update. The diagram visualizes a Conv-TT-LSTM with one channel. When Conv-TT-LSTM has multiple channels, the addition also accu- mulates the results from all channels. Convolutional Tensor-Train LSTM. By combining all the above modules, we obtain our proposed Conv-TT-LSTM, illustrated in Figure 4.1 and expressed as: [ ] ( ( )) I(t);F (t); C?(t);O(t) = ? W ? X (t) + CTT H?(1), ? ? ? , H?(N); G(1), ? ? ? ,G(N) (4.11) This final implementation has several advantages: it drastically reduces the number of parameters and makes the higher-order ConvLSTM even more compact than the first- 74 order ConvLSTM. The low-rank constraint acts as an implicit regularizer, leading to more generalizable models. Finally, the tensor-train structure inherently encodes the correlations resulting from the natural flow of time [37]. 4.5 Experimental Results In this section, we empirically evaluate our approach on two different tasks ? video prediction and early activity recognition and find out it outperforms existing methods. 4.5.1 Experimental Setup Evaluation. For video prediction, the model predicts every pixel in the frame. We test our models on the KTH human action dataset [98] with resolution 128 ? 128 and the Moving-MNIST-2 dataset [76] with resolution 64 ? 64. We train all models to predict 10 future frames given 10 input frames and tested to predict 10 ? 40 frames recursively. For early activity recognition, we evaluate our approach on the Something-Something V2 dataset. Following [81], we used the subset of 41 categories defined by [99] (Table 7). We train the prediction model to predict the following 10 frames given 25% ? 50% of frames and jointly classify the activity using the learned representations of the prediction model. Model architecture. In all video prediction experiments, we use 12 recurrent layers. For early activity recognition, we follow the framework in [81]. The prediction model consists of a two-layer 2D-convolutional encoder and decoder with eight recurrent layers between them. The classifier, which contains two 2D-convolutional layers and one fully-connected layer, takes the last recurrent layer?s output and returns a label output. We explain the 75 detailed architecture in Section 4.7.2. Hyper-parameter selection. We validate the hyper-parameters of our Conv-TT-LSTM through a wide grid search on the validation set. Specifically, we consider a base filter size S = 3, 5, decomposition order N = 1, 2, 3, 5, tensor ranks C(i) = 4, 8, 16, and number of hidden states M = 1, 3, 5. Section 4.7.3 contains the details of our hyper-parameter search. Efficient implementation. Two versions of the implementation are available: the orig- inal and the optimized version. In the optimized version, we use multi-threading to accel- erate our implementation using the NVIDIA apex library [100]. Furthermore, we adopt fused kernels to speed up the Adam optimizer [8] and TorchScript to fuse multiplications and additions. Lastly, we use affinity binding to reduce the communication cost between GPUs and CPUs. These modifications speed up training up to four times. Both versions are available online: https://github.com/NVlabs/conv-tt-lstm. 4.5.2 Analysis of Empirical Results Multi-frame Video prediction: KTH action dataset. First, we test our model on human action videos. In Table 4.2, we report the evaluation on both 20 and 40 frames prediction. Figure 4.3 (right) shows the model comparisons with SSIM v.s. LPIPS and the model size. (1) Our model is consistently better than the ConvLSTM baseline for both 20 and 40 frames prediction. (2) While our proposed Conv-TT-LSTMs achieve lower SSIM value compared to the state-of-the-art models in 20 frames prediction, they outperform all previous models in LPIPS for both 20 and 40 frames prediction. Figure 4.3 (right) 76 shows a visual comparison of our model, ConvLSTM baseline, PredRNN++ [80], and E3D-LSTM [81]. Our model produces sharper frames and better preserves the human silhouettes, although slight artifacts exist over time (shifting). We believe such artifacts may vanish using a different loss function or additional techniques that help per-pixel motion prediction. 0.85 Best Best Conv-TT-LSTM [Ours] 0.885 0.84 2.69M Conv-TT-LSTM [Ours] PredRNN++ [18] 15.05M 2.69M 0.83 0.88 E3D-LSTM [19] 41.94M ConvLSTM 3.97M 0.82 0.875 0.81 ConvLSTM 2M 10M 20M 30M 0.87 2M 10M 20M 30ME3D-LSTM [19] 3.97M 41.94M 0.8 PredRNN++ [18] 0.865 15.05M 0.79 0.86 80 70 60 50 40 30 350 300 250 200 150 LPIPS (lower is better) LPIPS (lower is better) Figure 4.2: SSIM v.s. LPIPS scores on Moving-MNIST-2 (left) and KTH action datasets (right). The bubble size is the model size. The higher SSIM scores and lower LPIPS scores, the better quality of predictions. On both datasets and for both metrics, our approach reaches a significantly better performance than other methods while having only a fraction of the parameters. Multi-frame video prediction: Moving-MNIST-2 dataset. We also evaluate our model on the Moving-MNIST-2 dataset and show that our model predicts the digits almost correctly in terms of structure and motion (See Figure 4.3). Table 4.2 reports the average statistics for 10 and 30 frames prediction, and Figure 4.2 (left) shows the comparisons of SSIM v.s. LPIPS and the model size. Our Conv-TT-LSTM models (1) consistently outper- form the ConvLSTM baseline for both 10 and 30 frames prediction with fewer parameters; (2) outperform previous approaches in terms of SSIM and LPIPS (especially on 30 frames prediction), with less than one fifth of the model parameters. We reproduce the PredRNN++ [80] and E3D-LSTM [81] from the source code [101, 77 SSIM (higher is better) SSIM (higher is better) input truths (top) / predictions input truths (top) / predictions t = 1 6 11 17 23 29 35 t = 1 6 11 15 19 23 27 PredRNN++ PredRNN++ E3D-LSTM E3D-LSTM ConvLSTM ConvLSTM Conv-TT-LSTM Conv-TT-LSTM Figure 4.3: 30 frames prediction on Moving-MNIST (left), and 20 frame predic- tion on KTH action datasets (right) given 10 input frames. The first frames (t = 1, 11) are animations. Adobe reader is required to view the animation. Our method generates both semantically plausible and visually crisp images, compared to other approaches. (10 ? 20) (10 ? 40) Complexities Method PSNR SSIM LPIPS PSNR SSIM LPIPS # Params. # FLOPS Time(m) ConvLSTM [78] 23.58 0.712 - 22.85 0.639 - 7.58M 106.6G - MCNET [93] 25.95 0.804 - - - - - - - PredRNN++ [80] (retrained [101]) 28.62 0.888 228.9 26.94 0.865 279.0 15.05M - - E3D-LSTM [81] (retrained [102]) 27.92 0.893 298.4 26.55 0.878 328.8 41.94M - - ConvLSTM (baseline) 28.21 0.903 137.1 26.01 0.876 201.3 3.97M 55.83G 28.9 Conv-TT-LSTM (Ours) 28.36 0.907 133.4 26.11 0.882 191.2 2.69M 37.83G 74.8 (10 ? 10) (10 ? 30) Complexities Method MSE SSIM LPIPS MSE SSIM LPIPS # Params. # FLOPS Time(m) ConvLSTM [78] 25.22 0.713 - 38.13 0.595 - 7.58M 30.32G - VPN [103] 15.65 0.870 - 31.64 0.620 - - - - PredRNN++ [80] (retrained [101]) 10.29 0.913 59.51 20.53 0.834 139.9 15.05M - - E3D-LSTM [81] (pretrained [102]) 20.23 0.869 76.12 32.37 0.803 150.3 41.94M - - ConvLSTM (baseline) 18.17 0.882 67.13 33.08 0.806 140.1 3.97M 15.88G 14.8 Conv-TT-LSTM (Ours) 12.96 0.915 40.54 25.81 0.840 90.38 2.69M 10.76G 29.6 Table 4.2: Evaluation of multi-steps prediction on the KTH action (top) and Moving-MNIST-2 (bottom) datasets. Higher PSNR/SSIM and lower MSE/LPIPS values indicate better predictive results. # of FLOPs denotes the multiplications for one- step prediction per sample, and Time(m) represents the clock time (in minutes) required by training the model for one epoch (10,000 samples). 78 Moving-MNIST KTH action 102]. We find that (1) PredRNN++ and E3D-LSTM output vague and blurry digits in long-term prediction (especially after 20 steps); (2) our Conv-TT-LSTM produces sharp and realistic digits over all steps. An example of visual comparison is shown in Figure 4.3. Front 25% Front 50% 100% ... .... 3D-CNN Wrong (100%) Wrong (100%) Pushing [something] ConvLSTM Wrong (84%) Wrong (46%) from right to left Conv-TT-LSTM Wrong (18%) Correct (100%) ... .... 3D-CNN Wrong (67%) Wrong (100%) Pulling [something] ConvLSTM Wrong (84%) Correct (61%) from right to left Conv-TT-LSTM Wrong (18%) Correct (98%) Figure 4.4: Examples of early activity recognition on the Something-Something V2 dataset. (?) indicates the confidence of Correct/Wrong prediction. Model Input Dropping Holding MovingLR MovingRL Picking Poking Pouring Putting Showing Tearing 3D-CNN 8.5 4.7 25.8 32.6 7.5 2.9 1.9 10.3 14.0 14.5 ConvLSTM 25% 8.5 7.0 27.4 38.8 16.8 5.9 1.9 12.0 7.0 21.2 Conv-TT-LSTM 11.5 4.7 33.9 40.8 16.8 5.9 5.7 13.6 20.9 26.0 3D-CNN 14.6 11.6 45.2 57.1 16.8 8.8 11.3 17.4 16.3 26.0 ConvLSTM 50% 21.5 7.0 43.5 47.0 15.9 14.7 5.7 20.7 16.3 30.8 Conv-TT-LSTM 24.6 11.6 56.5 57.1 27.6 5.9 13.2 25.5 37.2 46.2 Table 4.3: Per-activity accuracy of early activity recognition on the Something- Something V2 dataset. We used 41 categories for training. For per-activity evaluation, the 41 categories are grouped into ten similar activities [99]. Our model substantially outperforms 3D-CNN and ConvLSTM on long-term dynamics such as Moving or Tearing while achieving marginal improvement on static activities such as Holding or Pouring. Early activity recognition: Something-Something V2 dataset. To demonstrate that our Conv-TT-LSTM-based prediction model can learn efficient representations from videos, we evaluate the models on early activity recognition on the Something-Something V2 dataset. In this task, a model only observes a small fraction (25% ? 50%) of frames 79 and learns to predict future frames. Based on the learned representations of the beginning frames, the model predicts the full video?s overall activity. Intuitively, the learned repre- sentation encodes the future information for frame prediction, and the better the quality of the representations, the higher the classification accuracy. As shown in Table 4.3 and Table 4.4 our Conv-TT-LSTM model consistently outperforms the baseline ConvLSTM and 3D-CNN models as well as E3D-LSTM [81] under different ratio of input frames. Our experimental setup and architecture follow [81]. MSE(?10?3) SSIM LPIPS Input Ratio CTTD with 1? 1 filters (similar to standard TTD) Model Front 25% Front 50% single order 31.52 0.810 148.7 3D-CNN* 9.11 10.30 order 3 34.84 0.800 151.2 E3D-LSTM* [81] 14.59 22.73 CTTD with 5? 5 filters 3D-CNN 13.26 20.72 ConvLSTM 15.46 21.97 single order 33.08 0.806 140.1 Conv-TT-LSTM (ours) 19.53 30.05 order 3 28.88 0.831 104.1 Table 4.4: Early activity recogni- Table 4.5: Ablation studies of Conv- tion on the Something-Something TT-LSTM on the Moving-MNIST-2 V2 dataset using 41 categories as [81]. dataset. The models are tested for 10 to (*) indicates the result by [81]. 30 frames prediction. 4.6 Discussions In this section, we further justify the importance of our proposed modules, namely the convolutional tensor-train decomposition (CTTD) and the preprocessing module. We also explain the computational complexity of our model and the difficulties of spatio-temporal learning with Transformer [4]. Importance of encoding higher-order correlations in a convolutional manner. Two key differences between CTTD and existing low-rank decompositions are higher-order 80 decomposition and convolutional operations. To verify their impact, we compare the perfor- mance of two ablated models against our CTTD-base model in Table 4.5. The single order means that the higher-order model is replaced with a first-order model (tensor order = 1). By replacing 5? 5 filters to 1? 1, the convolutions are removed, and the CTTD reduces to a standard tensor-train decomposition. The results show a decrease in performance: the ablated models achieve similar performances of ConvLSTM baseline at best, demonstrating that both higher-order models and convolutional operations are necessary. Importance of the preprocessing module. There could be other ways to incorporate previous hidden states into the CTT module. One is to reduce the number of channels while keeping the number of steps; the other is to reuse all previous states? concatenation for each input to CTT. Unfortunately, the former fails due to the gradient vanishing/exploding problem. In contrast, the latter has a tube-shaped receptive field that fails to distinguish more recent steps and the ones from the remote history. Computational complexity. Table 4.2 provides the number of FLOPS for all models. Our Conv-TT-LSTM model has lower computational complexity and fewer parameters than other models under comparison. The efficiency is made possible by a linear algorithm for the CTT module in Equation (4.9), derived in [38]. Notice that a lower FLOPS does not necessarily lead to faster computation since the convolutional tensor-train module is naturally sequential, as shown in Table 4.2. Transformer for spatio-temporal learning. Transformer [4] is a popular predictive model based on the attention mechanism, which is very successful in natural language pro- 81 cessing [104]. However, the Transformer has prohibitive limitations on video understanding due to excessive needs for both memory and computation. While language modeling only involves temporal attention, video understanding requires attention to spatial dimensions as well [105]. Moreover, since the attention mechanism does not preserve the spatial structures by design, Transformer additionally requires auxiliary components, including an autoregres- sive module and multi-resolution upscaling when applied to spatial data [105, 106, 107]. Our Conv-TT-LSTM incorporates a broad spatio-temporal context, but with a compact, efficient and structure-preserving operator without additional components. 4.7 Supplemental Materials for Empirical Studies The supplementary material provides implementation details for all experiments and performs additional ablation studies of our model. We demonstrate that our Conv-TT- LSTM model outperforms regular ConvLSTM with varying settings. 4.7.1 Preprocessing Module In Section 4.4.2, we use a sliding window to concatenate consecutive states in the preprocessing module (Equation (4.10)). In the discussion (Section 4.6), we argue that other possible approaches are less effective in preserving spatio-temporal structure than our sliding window approach. Here, we discuss an alternative approach previously proposed for non-convolutional higher-order RNN [37], which we name as fixed window approach. We will compare these two approaches in computational complexity, temporal structure-preserving, and predictive performance. 82 Fixed window approach. With fixed window approach, M previous steps {H(t?1), ? ? ? , H(t?M)} are first concatenated into a single tensor, which is then repeatedly mapped to N inputs {H?(1), ? ? ? , H?(N)} to the CTT module. [ ] Fixed Window (FW): H?(i) = P(i) ? H(t?1); ? ? ? ;H(t?N)[ ] (4.12a) Sliding Window (SW): H?(i) = P(i) ? H(t?i); ? ? ? ;H(t?i+N?M) (4.12b) For comparison, we list both equations for the fixed window approach and the sliding window approach. We also illustrate these two approaches in Figure 4.5. P(1) H(t?1:t?3) * H~ (1) P(1) * H~ (1) P(2) H(t?2:t?4) * ~ P(2)H (2) H(t?1:t?5) * H~ (2) P(3) P(3) H(t?3:t?5) * H~ (3) * H~ (3) (a) Sliding window approach (b) Fixed window approach (alternative) Figure 4.5: Variations of preprocessing modules. 4.7.2 Model Architectures Multi-frame video prediction. All experiments use a 12-layers ConvLSTM / Conv- TT-LSTM with 32 channels for the first and last 3 layers, and 48 channels for the 6 layers in the middle. A convolutional layer is applied on top of all recurrent layers to compute the 83 predicted frames, followed by an extra sigmoid layer for the KTH action dataset. Following [91], we add two skip connections performing concatenation over channels between (3, 9) and (6, 12) layers. An illustration of the network architecture is included in Figure 4.6a. All convolutional kernels are initialized by Xavier?s normalized initializer [108] and initial hidden/cell states are initialized as zeros. X? (t+1) X? (t+1) y? Conv-TT-LSTM x 3 (32) Classifier Conv-TT-LSTM x 3 2D-CNN Decoder (48) Prediction model Conv-TT-LSTM x 3 (Conv-TT-LSTMs) (48) Conv-TT-LSTM x 3 2D-CNN Encoder (32) X (t ) X (t ) (a) Prediction model (b) Recognition model Figure 4.6: Network architectures for video prediction and early activity recognition. Early activity recognition. Following [81], the network consists of four modules: a 2D-CNN encoder, a video prediction network, a 2D-CNN decoder and a 3D-CNN classifier, as showed in Figure 4.6b. (1) The 2D-CNN encoder has two 2-strided 2D-convolutional layers with 64 channels, which reduce the resolution from 224?224 to 56?56. (2) The 2D- CNN decoder contains two 2-strided transposed 2D-convolutional layers with 64 channels, which restore the resolution from 56? 56 to 224? 224. (3) The video prediction network is miniature version of Figure 4.6a, where the number of layers in each block is reduced 84 to 2. In the experiments, we evaluate three realizations of each layer: ConvLSTM, Conv- TT-LSTM or masked 3D-convolutional layer. (4) The 3D-CNN classifier takes the last 16 frames from the input and predicts a label for the 41 categories. The classifier contains two 2-strided 3D-convolutional layers with 128 channels, each followed by a 3D-pooling layer. These layers reduce the resolution from 56? 56 to 7? 7, and the output is fed into a two-layer perceptron with 512 units for a predictive label. 4.7.3 Training, Evaluation, and Hyper-parameters Selection Training strategy. We argue for a careful choice of learning scheduling and gradient clipping to facilitate training. Specifically, various learning scheduling techniques, includ- ing learning rate decay, scheduled sampling [109], and curriculum learning with varying weighting factors, are added during training. (1) For video prediction, we use learning rate decay along with scheduled sampling, where scheduled sampling starts if the model does not improve for a few epochs in terms of validation loss. (2) For early activity recognition, we combine learning rate decay with weighting factor decay, where the weighting factor decreases linearly ? := max(? ? , 0) on the plateau. (3) We also found gradient clipping essential for higher-order models. We train All models with Adam optimizer [8]. In the initial experiments, we found that our models are unstable at a high learning rate 1e?3, but learn poorly at a low learning rate of 1e?4. Consequently, we use gradient clipping with a learning rate of 1e?3, with a clipping value of 1 for all experiments. Evaluation metrics. We use two traditional metrics, SSIM [110] and MSE (or PSNR), and a recently proposed deep-learning-based metric LPIPS [111], which measures the sim- 85 ilarity between features from different layer. Since MSE (or PSNR) is based on pixel-wise difference, it favors vague and blurry predictions ? thus, it is not a proper measurement of perceptual similarity. While SSIM was initially proposed to address the problem, [111] shows that their proposed LPIPS metric aligns better with human perception. Hyper-parameters selection. Table 4.6 summarizes our search values for different hyper-parameters for Conv-TT-LSTM. (1) For filter size K, we found models with larger filter size K = 5 consistently outperform the ones with K = 3. (2) For learning rate, we found that our models are unstable at a high learning rate such as 10?3, but learn poorly at a low learning rate 10?4. Consequently, we use gradient clipping with learning rate 10?3, with clipping value 1 for all experiments. (3) While the performance typically increases as the order grows, the model suffers gradient instability in training with a high order, e.g., N = 5. Therefore, we choose the order N = 3 for all Conv-TT-LSTM models. (4)(5) For small ranks C(i) and steps M , the performance increases monotonically with C(i) and M . But the performance stays on plateau when we further increase them, therefore we settle down at C(i) = 8,?i and M = 5 for all experiments. Filter size K Learning rate Order of CTTD N Ranks of CTTD C(i) Time steps M {3, 5} {10?4, 5? 10?4, 10?3} {1, 2, 3, 5} {4, 8, 16} {1, 3, 5} Table 4.6: Hyper-parameters search values for Conv-TT-LSTM experiments. 4.7.4 Datasets Moving-MNIST-2 dataset. We generate the Moving-MNIST-2 dataset by moving two digits with size 28?28 in the MNIST dataset within a 64?64 black canvas. These digits are 86 placed at a random initial location, move with constant velocity in the canvas, and bounce when they reach the boundary. Following [80], we generate 10,000 videos for training, 3,000 for validation, and 5,000 for test with default parameters in the generator1. KTH action dataset. The KTH action dataset [98] contains videos of 25 individuals per- forming six types of actions on a simple background. Our experimental setup follows [80], which uses persons 1-16 for training and 17-25 for testing, and we resize each frame to 128? 128 pixels. We train all our models to predict 10 frames given 10 input frames. We randomly select 20 contiguous frames from the training videos as a sample and group every 10,000 samples into one epoch to apply the learning strategy, as explained at the beginning of this section. Something-Something V2 dataset. The Something-Something V2 dataset [99] is a benchmark for activity recognition, which can be download online2. Following [81], we use the official subset with 41 categories that contain 55111 training videos and 7518 test videos. The video length ranges between 2 and 6 seconds with 24 frames per second (fps). We reserve 10% of the training videos for validation and use the remaining 90% for optimizing the models. 4.7.5 Ablation Studies With the ablation studies, we show that our proposed Conv-TT-LSTM consistently improves the performance of ConvLSTM, regardless of the architecture, loss function, and 1The Python code for Moving-MNIST-2 generator is publicly available online in [112]. 2https://20bn.com/datasets/something-something 87 Layers Sched. Loss (10 ? 30) Model Params. 4 12 TF SS `1 `1 + `2 MSE SSIM LPIPS ConvLSTM - 37.19 0.791 184.2 11.48M 3 5 5 3 5 3 Conv-TT-LSTM FW 31.46 0.819 112.5 5.65M ConvLSTM - 33.96 0.805 184.4 3.97M 5 3 3 5 5 3 Conv-TT-LSTM FW 30.27 0.827 118.2 2.65M ConvLSTM - 36.95 0.802 135.1 3.97M 5 3 5 3 3 5 Conv-TT-LSTM FW 34.84 0.807 128.4 2.65M ConvLSTM - 33.08 0.806 140.1 3.97M 5 3 5 3 5 3 Conv-TT-LSTM FW 28.88 0.831 104.1 2.65M Conv-TT-LSTM SW 5 3 5 3 5 3 25.81 0.840 90.38 2.69M Table 4.7: Evaluation of ConvLSTM and our Conv-TT-LSTM under ablated settings. In this table, FW stands for fixed window approach, SW stands for sliding window approach; For learning scheduling, TF denotes teaching forcing and SS denotes scheduled sampling. The experiments show that (1) our Conv-TT-LSTM is able to improve upon ConvLSTM under all settings; (2) Our current learning approach is optimal in the search space; (3) The sliding window approach outperforms the fixed window one under the optimal experimental setting. learning schedule used. Specifically, we perform three ablation studies on our experimental setting, by (1) Reducing the number of layers from 12 layers to 4 layers (same as [78] and [80]); (2) Changing the loss function from L1 + L2 to L1 only; and (3) Disabling the scheduled sampling and use teacher forcing during training process. We compare the performance of our proposed Conv-TT-LSTM against the ConvLSTM baseline in these ablated settings, Table 4.7. The results show that our proposed Conv-TT-LSTM consis- tently outperforms ConvLSTM in all settings, i.e., the Conv-TT-LSTM model improves upon ConvLSTM in a broad range of setups, which is not limited to the specific setting we adopt in this chapter. These ablation studies further show that our configuration is optimal for predictive learning in Moving-MNIST-2 dataset. 88 4.8 Conclusion In this chapter, we proposed a fully-convolutional higher-order LSTM model for spatio-temporal data. To make the approach computationally and memory feasible, we pro- posed a novel convolutional tensor-train decomposition that jointly parameterizes the con- volutions and naturally encodes temporal dependencies. The result is a compact model that outperforms prior work on video prediction, including something-something V2, moving- MNIST-2, and KTH action datasets. 89 Chapter 5: Efficient Learning of Bayesian Quantized Networks 5.1 Overview A Bayesian approach to deep learning considers the network?s parameters as random variables and seeks to infer their posterior distribution given the training data. Models trained this way, called Bayesian neural networks (BNNs) [113], in principle have well- calibrated uncertainties when they make predictions, which is essential in scenarios such as active learning and reinforcement learning [114]. Furthermore, the posterior distribution over the model parameters provides valuable information to compress a neural network. There are three main challenges in using BNNs: (1) Intractable posterior: Com- puting and storing the exact posterior distribution over the network weights is intractable due to the complexity and high-dimensionality of deep networks. (2) Prediction: Per- forming a forward pass (a.k.a. as probabilistic propagation) in a BNN to compute a pre- diction for input cannot be performed exactly since the distribution of hidden activations at each layer is intractable to compute. (3) Learning: The classic evidence lower bound (ELBO) learning objective for training BNNs is not amenable to backpropagation as the ELBO is not an explicit function of the output of probabilistic propagation. These challenges are typically addressed either by making simplifying assumptions about the distributions of the parameters and activations, or by using sampling-based 90 approaches, which are expensive and unreliable (likely to overestimate the uncertainties in predictions). Our goal is to propose a sampling-free method which uses probabilistic propagation to learn BNNs deterministically. A seemingly unrelated area of deep learning research is that of quantized neural net- works (QNNs), which offer advantages of computational and memory efficiency compared to continuous-valued models. QNNs, like BNNs, face challenges in training, though for different reasons: (4.1) The non-differentiable activation function is not amenable to back- propagation. (4.2) Gradient updates cease to be meaningful since the model parameters in QNNs are coarsely quantized. In this work, we combine the ideas of BNNs and QNNs in a novel way that ad- dresses the challenges mentioned above (1)(2)(3)(4) in training both models. We propose Bayesian quantized networks (BQNs), models that (like QNNs) have quantized parame- ters and activations over which they learn (like BNNs) categorical posterior distributions. BQNs have several appealing properties: ? BQNs solve challenge (1) by using categorical distributions for their parameters. ? BQNs can be trained via sampling-free backpropagation and stochastic gradient as- cent of a differentiable ELBO, which addresses challenges (2), (3), and (4). ? BQNs leverage efficient tensor operations for probabilistic propagation, further ad- dressing challenge (2). We show the equivalence between probabilistic propagation in BQNs and tensor contractions [33], and introduce a rank-1 CP tensor decomposition (mean-field approximation) that speeds up the forward pass in BQNs. ? BQNs provide a tunable trade-off between computational resource and model com- 91 plexity: using a fine quantization allows for more complex distribution at the cost of more computation. ? Sampling from BQNs provides an alternative way to obtain deterministic QNNs. In our experiments, we demonstrate the expressive power of BQNs. We show that BQNs trained using our sampling-free method have much better-calibrated uncertainty compared with the state-of-the-art Bootstrap ensemble of quantized neural networks (E- QNN) trained by [115]. More impressively, our trained BQNs achieve comparable log- likelihood against Gaussian Bayesian neural network (BNN) trained with stochastic gradi- ent variational Bayes (SGVB) [116] (the performance of Gaussian BNNs are expected to be better than BQNs since they allow for continuous random variables). We further verify that BQNs can be easily used to compress (Bayesian) neural networks and obtain deterministic QNNs. Finally, we evaluate the effect of mean-field approximation in BQNs by compar- ing with its Monte-Carlo realizations, where no approximation is used. We show that our sampling-free probabilistic propagation achieves similar accuracy and log-likelihood ? justifying the use of mean-field approximation in BQNs. Contributions. In summary, we make the following contributions: 1. We propose an alternative evidence lower bound (ELBO) for Bayesian neural networks such that optimization of the variational objective is compatible with the backprop- agation algorithm. 2. We introduce Bayesian quantized networks (BQNs), establish a duality between BQNs and hierarchical tensor networks, and show prediction a BQN is equivalent 92 to a series of tensor contractions. 3. We derive a sampling-free approach for both learning and inference in BQNs using probabilistic propagation (analytical inference), achieving better-calibrated uncer- tainty for the learned models. 4. We develop a set of fast algorithms for efficient learning and prediction for BQNs. 5.2 Bayesian Neural Networks Notation. We use bold letters such as ? to denote random variables, and non-bold letters such as ? to denote their realizations. We abbreviate Pr[? = ?] as Pr[?] and use bold letters in an equation if the equality holds for arbitrary realizations. For example, Pr[x,y] = Pr[y|x] Pr[x] means Pr[x = x,y = y] = Pr[y = y|x = x] Pr[x = x],?x ? X , y ? Y . Problem setting. Given a dataset D = {(xn, y Nn)}n=1 of N data points, we aim to learn a neural network with model parameters ? that predict the output y ? Y based on the input x ? X . (1) We first solve the learning problem to find an approximate posterior distribution Q(?;?) over ? with parameters ? such that Q(?;?) ? Pr[?|D]. (2) We then solve the prediction problem to compute the predictive distribution Pr[y|x,D] for arbitrary input x = x givenQ(?;?). For notational simplicity, we will omit the conditioning on D and write Pr[y|x,D] as Pr[y|x] in what follows. To address the prediction and learning problems in BNNs, we analyze these models in their general form of probabilistic graphical models. Let h(l), ?(l) and h(l+1) denote the inputs, model parameters, and (hidden) outputs of the l-th layer respectively. We assume 93 ? that (1) ?(l)?s are layer-wise independent, i.e., Q(?;?) = L?1Q(?(l);?(l)), and h(l)l=0 follow the Markovian property, i.e., Pr[h(l+1)|h( : l),?( : l)] = Pr[h(l+1)|h(l),?(l)]. 5.2.1 The Prediction Problem in BNNs Computing the predictive distribution Pr[y|x,D] in a BNN requires marginalizing over the random variable ?. The hierarchical structure of BNNs allows this marginalization to be computed in multiple steps sequentially. The predictive distribution Pr[h(l+1)|x] can be obtained from its preceding layer Pr[h(l)|x]: ? ?Pr[h?(l?+1)|x?] = Pr[h(l+1)|h(l), ?(l)] Q(?(l);?(l)) P? r[h??(l)|x] .dh(l)d?(l) (5.1)h(l),?(l) ? P (h(l+1);?(l+1)) P (h(l);?(l)) This iterative process to compute the predictive distributions layer-by-layer sequentially is known as probabilistic propagation [117, 118, 119]. With this approach, we need to explicitly compute and store each intermediate result Pr[h(l)|x] in its parameterized form P (h(l);?(l)) (the conditioning on x is hidden in ?(l), i.e., ?(l) is a function of x). Therefore, probabilistic propagation is a deterministic process that computes ?(l+1) as a function of ?(l) and ?(l), denoted as ?(l+1) = g(l)(?(l), ?(l)). Challenge in probabilistic propagation. If the hidden variables h(l)?s are continuous, Equation (5.1) generally can not be evaluated in closed form as it is difficult to find a family of parameterized distributions P for h(l) such that h(l+1) remains in P under the operations of a neural network layer. Therefore most existing methods consider approximations at each layer of probabilistic propagation. In Section 5.2.1, we will show that this issue can be 94 (partly) addressed if we consider the h(l)?s to be discrete random variables, as in a BQN. 5.2.2 The Learning Problem in BNNs Objective function. A standard approach to finding a good approximation Q(?;?) is variational inference, which finds ?? such that the KL-divergence from Q(?;?) to Pr[?|D] is minimized i.e., ?? = arg min? KL(Q(?;?)||Pr[?|D]). It is equivalent to maximize the negative KL-divergence, generally known as the evidence lower bound (ELBO) L(?): ?N maxL(?) = ?KL(Q(?;?)||Pr[?|D]) = Ln(?) +R(?), ? n=1 (5.2) where Ln(?) = EQ [log Pr[yn|xn,?]] , and R(?) = EQ [log (Pr[?])] +H(Q). Sampling-free probabilistic backpropagation. Optimization in neural networks heav- ily relies on the gradient-based methods, where the partial derivatives ?L(?)/?? of the objective L(?) w.r.t. the parameters ? are obtained by backpropagation. Formally, if the output produced by a neural network is given by a (sub-)differentiable function g(?), and the objective L(g(?)) is an explicit function of g(?) (and not just an explicit function of ?), then the partial derivatives can be computed by chain rule: ?L(g(?)) ?L(g(?)) = ? ?g(?) . (5.3) ?? ?g(?) ?? The learning problem can then be (approximately) solved by first-order methods, typically stochastic gradient descent/ascent. Notice that (1) For classification, the function g(?) returns the probabilities after the softmax function, not the categorical label; (2) An 95 additional regularizer R(?) on the parameters will not cause difficulty in backpropagation, given ?R(?)/?? is easily computed. Challenge in sampling-free probabilistic backpropagation. Learning BNNs is not amenable to standard backpropagation because the ELBO objective function L(?) in Equa- tion (5.4b) is not an explicit function of the predictive distribution g(?) in Equation (5.4a) (i.e., L(?) can not be easily written as L(g(?))). ? gn(?) = EQ [Pr[yn|xn,?]] = ? Pr[yn|xn, ?]Q(?;?)d?, (5.4a)? Ln(?) = EQ [log(Pr[yn|xn,?])] = log (Pr[yn|xn, ?])Q(?;?)d?. (5.4b) ? Although Ln(?) is a function of ?, it is not an explicit function of gn(?). Consequently, the chain rule in Equation (5.3) on which backpropagation is based is not directly applicable. 5.2.3 Learning in BNNs: An Alternative ELBO Alternative evidence lower bound. We make learning in BNNs amenable to back- propagation by developing a lower bound Ln(?) ? Ln(?) such that ?Ln(?)/?? can be obtained by chain rule (i.e., Ln(?) is an explicit function of the results from the forward pass.) With Ln(?) in hand, we can (approximately) find ?? by maximizing the alternative objective via gradient-based method: ( ? )N ?? = arg maxL(?) = arg max R(?) + Ln(?) . (5.5) ? ? n=1 96 Theorem 5.1 provides one such feasible Ln(?) which only depends on second last output h(L?1). Theorem 5.1 (Alternative evidence lower bound). Define each term Ln(?) of L(?) in Equation (5.5) as [ ( )] Ln(?) := E (L?1) (L?1)h(L?1)?P ; ?(L?1)?Q log Pr[yn|h ,? ] , (5.6) then Ln(?) is a lower bound of Ln(?), i.e., Ln(?) ? Ln(?). The equality Ln(?) = Ln(?) holds if h(L?1) is deterministic given input x and all parameters before the last layer ?( :L?2). Analytic form of Ln(?). While the lower bound in Theorem 5.1 applies to BNNs with arbitrary distributions P on hidden variables h, Q on model parameters ?, and any problem setting (e.g., classification or regression), in practice sampling-free probabilistic backprop- agation requires that Ln(?) can be analytically evaluated (or further lower bounded) in terms of ?(L?1) and ?(L?1). This task is nontrivial since it requires redesign of the output layer, i.e., the function of Pr[y|h(L?1),?(L?1)]. For ease of exposition, we only present the classification case in this section ? the regression case can be found in [120]. Since Ln(?) involves the last layer only, we omit the superscripts/subscripts of h(L?1), ?(L?1), ?(L?1), xn, yn, and denote them as h, ?, ?, x, y for simplicity. Theorem 5.2 (Analytic form of Ln(?) for classification). Let h ? RK (with K the number of classes) be the pre-activations of a softmax layer (a.k.a. logits), and ? = s ? R+ 97 be a scaling factor that adjusts its scale such that ?K Pr[y = c|h, s] = exp(hc/s)/ exp(hk/s) (5.7) k=1 Suppose the logits {h Kk}k=1 are pairwise independent (which holds under mean-field approxi- mation) and hk follows a Gaussian distribution hk ? N (?k, ?k) (therefore ? = {?k, ?k}Kk=1) and s is a deterministic parameter. Then Ln(?) is further lower bounded as (? )K L ? ? ( c ? ? ) k ?k n(?) log exp + . (5.8) s s 2s2 k=1 The proof of Theorem 5.2 and its regression counterpart can be found in [120]. 5.2.4 Prediction in BNNs: A Tensor Approach While Section 5.2.3 provides a general solution to learning in BNNs, the solution relies on the ability to perform probabilistic propagation efficiently. To address this, we introduce Bayesian quantized networks (BQNs) ? BNNs where both hidden units h(l)?s and model parameters ?(l)?s take discrete values ? along with a set of novel algorithms for efficient sampling-free probabilistic propagation. Probabilistic propagation as tensor contractions. For simplicity, we assume acti- vations and model parameters take values from the same set Q, and denote the degree of quantization as D = |Q|, (e.g. Q = {?1, 1}, D = 2). Lemma 5.3 (Probabilistic propagation in BQNs). After quantization, each step of 98 probabilistic propagation in Equation (5.1) becomes a finite sum instead of an integral: ? P (h(l+1);?(l+1)) = Pr[h(l+1)|h(l), ?(l)] Q(?(l);?(l)) P (h(l);?(l)), (5.9) h(l),?(l) and a categorically distributed h(l) results in h(l+1) being categorical as well. The equation holds without any assumption on the operation Pr[h(l+1)|h(l), ?(l)] in the neural network. Note that all distributions in Equation (5.9) are represented in high-order tensors: Suppose there are I input units, J output units, and K model parameters at the l-th layer, then h(l) ? QI , ?(l) ? QK , and h(l+1) ? QJ , and their distributions are characterized by P (h(l);?(l) ? RDI) , Q(?(l);?(l)) ? RDK , P (h(l+1);?(l+1)) ? RDJ , and Pr[h(l+1)|h(l),?(l)] ? RDJ?DI?DK respectively. Therefore, each step in probabilistic propagation is a tensor contraction of three tensors, which establishes the duality between BQNs and hierarchical tensor networks [121]. Since tensor contractions are differentiable w.r.t. all inputs, BQNs thus circumvent the difficulties in training QNNs [122, 123], whose outputs are not differentiable w.r.t. the discrete parameters. This result is not surprising: if we consider learning in QNNs as an integer programming (IP) problem, solving its Bayesian counterpart is equivalent to relaxing the problem into a continuous optimization problem [124]. Complexity of exact propagation. The computational complexity to evaluate Equa- tion (5.9) exactly is exponential in the number of random variables O(DIJK), which is intractable for quantized neural network of any reasonable size. We thus seek approxima- tions to Equation (5.9). 99 Approximate propagation via tensor decomposition. We propose a principled ap- proximation to reduce the computational complexity in probabilistic propagation in BQNs using tensor CP decomposition, which factors an intractable high-order probability tensor into tractable lower-order factors [24]. In this chapter, we consider the simplest rank-1 tensor CP decomposition, where the joint distributions of P and Q are fully factorized into products of their marginal distributions, thus equivalent to the mean-field approxima- tion [125]. With rank-1 CP decomposition on P (h(l);?(l)),?l ? [L], the tensor contraction in (5.9) reduces to a standard Tucker contraction [33]: ? ? ? (l+1) (l+1) ? (l+1)P (h ;? ) Pr[h |?(l), h(l) (l) (l) (l) (l)j j j ] Q(?k ;?k ) P (hi ;?i ), (5.10) h(l),?(l) k i (l) (l) where each ?i or ?k parameterizes a single categorical variable. In our im?plementation, we(l) (l) (l) store the parameters in their log-space, i.e., Q(? = Q(d)) = exp(? (d))/ Dk k q=1 exp(?k (q)). Fan-in number and the complexity of approximate propagation. In a practical (l+1) model, for the l-th layer, an output unit hj only (conditionally) depends on a subset of (l) (h) all input units {hi } and model parameters {?k } according to the connectivity pattern in (l+1) (l+1) the layer. We denote the set of dependent input units and parame?ters fo? r h?j as?Ij M(l+1) ? (l+1)? ? (l+1)?and j , and define the fan-in number E for the layer as maxj ?Ij ?+ ?Mj ?. The approximate propagation reduces the computational complexity from O(DIJK) to O(JDE), which is linear in the number of output units J if we assume the fan-in number E to be a constant (i.e. E is not proportional to I). 100 Fast algorithms for approximate propagation. Different types of network layers have different fan-in numbers E, and for those layers with E greater than a small constant, Equation (5.10) is inefficient since the complexity grows exponential in E. Therefore, we devise fast(er) algorithms to further lower the complexity. ? Small fan-in layers: direct tensor contraction. If E is small, we implement the approximate propagation through tensor contraction in Equation (5.10). The computational complexity is O(JDE) as discussed previously. ? Medium fan-in layers: discrete Fourier transform. If E is medium, we imple- ment approximate propagation through fast Fourier transform since summation of discrete random variables is equivalent to convolution between their probability mass function. With the fast Fourier transform, the computational complexity is reduced to O(JE2D log(ED)). ? Large fan-in layers: Lyapunov central limit theorem. In a typical linear layer, the fan-in E is large, and a super-quadratic algorithm using fast Fourier transform is still computational expensive. Therefore, we adopt a faster algorithm based on the Lyapunov central limit theorem. With CLT, the computational complexity is further reduced to O(JED). Remarks: Depending on the fan-in numbers E, we adopt CLT for linear layers with suf- ficiently large E such as fully connected layers and convolutional layers; DFT for those with medium E such as average pooling layers and depth-wise layers; and direct tensor contraction for those with small E such as shortcut layers and nonlinear layers. 101 5.3 Related Works and Discussions Probabilistic neural networks and Bayesian neural networks These models con- sider weights to be random variables and aim to learn their distributions. To further distinguish two families of such models, we call a model Bayesian neural network if the dis- tributions are learned using a prior-posterior framework (i.e., via Bayesian inference) [116, 117, 118, 119, 126, 127], and otherwise probabilistic neural network [128, 129, 130]. In partic- ular, our work is closely related to natural-parameters networks (NPN) [128], which consider both weights and activations to be random variables from exponential family. Since cate- gorical distribution (over quantized values) belongs to the exponential family, our BQN can be interpreted as categorical NPN, but we learn the distributions via Bayesian inference. For Bayesian neural networks, various types of approaches have been proposed to learn the posterior distribution over model parameters: (1) Sampling-free assumed density filtering (ADF), including EBP [117] and PBP [118], is an online algorithm that (approximately) updates the posterior distribution by Bayes? rule for each observation. If the model parameters ? are Gaussian distributed, [131] shows that the Bayes? rule can be computed in analytic form based on ? log(gn(?))/??, and EBP [117] derives a similar rule for Bernoulli parameters in binary classification. Notice that ADF is compatible to backpropagation: ?log(gn(?)) 1 ?gn(?) = ? , (5.11) ?? gn(?) ?? assuming gn(?) can be (approximately) computed by sampling-free probabilistic propaga- 102 tion as in Section 5.2.1. However, this approach has two significant limitations: (a) the Bayes? rule needed to be derived case by case, and analytic rules for most common scenar- ios are not known yet. (b) it is not compatible with modern optimization methods (such as SGD or Adam) as the optimization is solved analytically for each data point, therefore challenging to cope with large-scale models. (2) Sampling-based variational inference (SVI), formulates an optimization prob- lem and solves it approximately via stochastic gradient descent (SGD). The most popular method among all is, Stochastic Gradient Variational Bayes (SGVB), which approximates Ln(?) by the average of multiple samples [116, 126, 127]. Before each step of learning or prediction, a number of independent samples of the model parameters {?s}Ss=1 are drawn according to the current estimate of Q, i.e. ?s ? Q, by which the predictive function gn(?) and the loss Ln(?) can be approximated by 1 ?S 1 ?S gn(?) ? Pr[yn|xn, ?s] = fn(?s), (5.12a) ?S Ss=1 s=1S ?S L ? 1 1n(?) log (Pr[yn|xn, ?s]) = log (fn(?s)) , (5.12b) S S s=1 s=1 where fn(?) = Pr[yn|xn, ?] denotes the predictive function given a specific realization ? of the model parameters. The gradients of Ln(?) can now be approximated as ?L Sn(?) ?? 1 ?Ln(?) ? ?fn(?s) ? ??s . (5.13) ?? S ?fn(?s) ??s ??s=1 This approach has multiple drawbacks: (a) Repeated sampling suffers from high variance, besides being computationally expensive in both learning and prediction phases; (b) While 103 gn(?) is differentiable w.r.t. ?, fn(?) may not be differentiable w.r.t. ?. One such example is quantized neural networks, whose backpropagation is approximated by straight through estimator [132]. (3) The partial derivatives ??s/?? are difficult to compute with complicated reparameterization tricks [133, 134]. (3) Deterministic variational inference (DVI) Our approach is most similar to [135], which observes that if Pr[h(l+1)|h(l),?(l)] is a Dirac function, i.e., the underlying model is deterministic, [ ( )] Ln(?) := Eh(L?1)?P ; ?(L?1)?Q log Pr[y |h(L?1)n ,?(L?1)] . (5.14) Our approach considers a wider scope of problem settings, where the model could be stochastic, i.e. Pr[h(l+1)|h(l),?(l)] is an arbitrary function. Furthermore, [135] considers the case that all parameters ? are Gaussian distributed, whose sampling-free probabilistic propagation requires complicated approximation [129]. Quantized neural networks. These models can be categorized into two classes: (1) Partially quantized networks, where only weights are discretized [42, 136]; (2) Fully quan- tized networks, where both weights and hidden units are quantized [122, 123, 137, 138, 139]. While both classes provide compact size, low-precision neural network models, fully quan- tized networks further enjoy fast computation provided by specialized bit-wise operations. In general, quantized neural networks are difficult to train due to their non-differentiability. Gradient descent by backpropagation is approximated by either straight-through estima- tors [132] or probabilistic methods [140, 141, 142]. Unlike these papers, we focus on Bayesian 104 learning of fully quantized networks in this chapter. Optimization of quantized neural net- works typically requires a dedicated loss function, learning scheduling and initialization. For example, [142] considers pre-training of a continuous-valued neural network as the initial- ization. Since our approach considers learning from scratch (with a uniform initialization), the performance could be inferior to prior works in terms of absolute accuracy. Tensor networks and tensorial neural networks Tensor networks (TNs) are widely used in numerical analysis [24], quantum physiscs [23], and recently machine learning [25, 26] to model interactions among multi-dimensional random objects. Various tensorial neu- ral networks (TNNs) [50, 143] have been proposed that reduce the size of neural networks by replacing the linear layers with TNs. Recently, [121] points out the duality between proba- bilistic graphical models (PGMs) and TNs, i.e., there exists a bijection between PGMs and TNs. We advance this line of thinking by connecting hierarchical Bayesian models (e.g. Bayesian neural networks) and hierarchical TNs. 5.4 Experimental Results 5.4.1 Experimental Setup In this section, we demonstrate the effectiveness of BQNs on the MNIST, Fashion- MNIST, KMNIST and CIFAR10 classification datasets. We evaluate our BQNs with both multi-layer perceptron (MLP) and convolutional neural network (CNN) models. In training, each image is augmented by a random shift within 2 pixels (with an additional random flipping for CIFAR10), and no augmentation is used in test. In the experiments, we consider 105 a class of quantized neural networks, with both binary weights and activations (i.e., Q = {?1, 1}) with sign activations ?(?) = sign(?). For BQNs, the distribution parameters ? are initialized by Xavier?s uniform initializer, and all models are trained by Adam optimizer [8] for 100 epochs (and 300 epochs for CIFAR10) with batch size 100 and initial learning rate 10?2, which decays by 0.98 per epoch. Training objective of BQNs. To allow for customized level of uncertainty in the learned Bayesian models, we introduce a regularization coefficient ? in the alternative ELBO pro- posed in Equation (5.5) (i.e., a lower bound of the likelihood), and train the BQNs by maximizing the following objective: ? ( )N ?N L(?) = Ln(?) + ?R(?) = ? 1/? Ln(?) +R(?) (5.15) n=1 n=1 where ? controls the uncertainty level ? the importance of the prior over the training set. Baselines. (1) We compare our BQN against the baseline ? Bootstrap ensemble of quan- tized neural networks (E-QNN). Each member in the ensemble is trained in a non-Bayesian way [115], and jointly make the prediction by averaging over the logits from all members. Note that [115] is chosen over other QNN training methods as the baseline since it trains QNN from random initialization, thus a fair comparison to our approach. (2) To ex- hibit the effectiveness of our BQN, we further compare against continuous-valued Bayesian neural network (abbreviated as BNN) with Gaussian parameters. The model is trained with stochastic gradient variational Bayes (SGVB) augmented by local re-parameterization trick [116]. Since the BNN allows for continuous parameters (different from BQN with 106 quantized parameters), the predictive error is expected to be lower than BQN. Evaluation of BQNs. While 0-1 test error is a popular metric to measure the predictive performance, it is too coarse a metric to assess the uncertainty in decision making (for example, it does not account for how badly the wrong predictions are). Therefore, we mainly use the negative log-likelihood (NLL) to measure the predictive performance in the experiments. Once a BQN is trained (i.e., an approximate posterior Q(?) is learned), we consider three modes to evaluate the behavior of the model: (1) analytic inference (AI), (2) Monte Carlo (MC) sampling and (3) Maximum a Posterior (MAP) estimation: 1. In analytic inference (AI, i.e., our proposed method), we analytically integrate over Q(?) to obtain the predictive distribution as in the training phase. Notice that the exact NLL is not accessible with probabilistic propagation (which is why we propose an alternative ELBO in Equation (5.5)), we report an upper bound of the NLL. 2. In MC sampling, we draw S sets of model parameters independently from the pos- terior posterior ?s ? Q(?), ?s ? [S], and the forward propagation is performed as in (non-Bayesian) quantized neural network for each set ?s, followed by an average over the model outputs. The difference between analytic inference and MC sampling will be used to evaluate (a) the effect of mean-field approximation and (b) the tightness of the our proposed alternative ELBO. 3. MAP estimation is similar to MC sampling, except that there is only one set of model parameters ?? = arg max?Q(?). We exhibit our model?s ability to compress a BNN by comparing MAP estimation of our BQN with non-Bayesian QNN. 107 5.4.2 Analysis of Empirical Results MNIST KMNIST Methods NLL (10?3) % Err. NLL (10?3) % Err. E-QNN on MLP 546.6?157.9 3.30 ?0.65 2385.6?432.3 17.88?1.86 BQN on MLP 130.0?3.5 2.49?0.08 457.7?13.8 13.41?0.12 E-QNN on CNN 425.3?61.8 0.85?0.13 3755.7?465.1 11.49?1.16 BQN on CNN 41.8?1.6 0.85?0.06 295.5?1.4 9.95?0.15 Fashion-MNIST CIFAR10 Methods NLL (10?3) % Err. NLL (10?3) % Err. E-QNN on MLP 2529.4?276.7 13.02?0.81 N/A N/A BQN on MLP 417.3?8.1 9.99?0.20 N/A N/A E-QNN on CNN 1610.7?158.4 3.02?0.37 7989.7 ? 600.2 15.92 ? 0.72 BQN on CNN 209.5?2.8 4.65?0.15 530.6 ? 23.0 13.74 ?0.47 Table 5.1: Comparison of performance of BQNs against the baseline E-QNN. Each E-QNN is an ensemble of 10 networks trained individually and make predictions jointly. We report both NLL (which accounts for prediction uncertainty) and 0-1 test error (which doesn?t account for prediction uncertainty). All the numbers average over 10 runs with different seeds; the standard deviation is exhibited following the ? sign. Expressive power and uncertainty calibration in BQNs. We report the perfor- mance via all evaluations of our BQN models against the Ensemble-QNN in Table 5.1. (1) Compared to E-QNNs, our BQNs have significantly lower NLL and smaller predic- tive error (except for Fashion-MNIST with architecture CNN). (2) As we can observe in Figure 5.1, BQNs impressively achieve comparable NLL to continuous-valued BNN, with slightly higher test error. As our model parameters only take values {?1, 1}, small degra- dation in predictive accuracy is expected. Evaluations of mean-field approximation and tightness of the alternative ELBO. If analytic inference (by probabilistic propagation) were computed exactly, the evaluation metrics would have been equal to the ones with MC sampling (with infinite samples). There- 108 102.5 3 104 10 103.5 BNN BNN BNN BNN 2 E-QNN E-QNN E-QNN E-QNN10 BQN BQN BQN BQN 103 102.5 103 101.5 10?4 10?3 10?4 10?3 10?4 10?3 10?9 10?8 10?7 ? level of model uncertainty ? level of model uncertainty ? level of model uncertainty ? level of model uncertainty (a) NLL MNIST (b) NLL FMNIST (c) NLL KMNIST (d) NLL CIFAR10 BNN BNN E-QNN 16 E-QNN 5 16 BQN BQN 0.9 14 4.5 14 BNN BNN 4 E-QNN E-QNN 12 BQN 12 BQN 0.8 3.5 10 3 10 0.7 8 10?4 10?3 10?4 10?3 10?4 10?3 10?9 10?8 10?7 ? level of model uncertainty ? level of model uncertainty ? level of model uncertainty ? level of model uncertainty (e) Error MNIST (f) Error FMNIST (g) Error KMNIST (h) Error CIFAR10 Figure 5.1: Comparison of the predictive performance of our BQNs against the E-QNN as well as the non-quantized BNN trained by SGVB on a CNN. Negative log- likelihood (NLL) which accounts for uncertainty and 0-1 test error which doesn?t account for uncertainty are displayed. 60 Monte Carlo Sampling 300 200 Analytical Inference 50 Difference 200 150 Monte Carlo Sampling Monte Carlo Sampling 40 Analytical Inference Analytical Inference Difference 30 100 Difference 100 20 50 0 10?6 10?5 10?4 10?3 10?4 10?3 10?4 10?3 ? level of model uncertainty ? level of model uncertainty ? level of model uncertainty (a) NLL MNIST (b) NLL FMNIST (c) NLL KMNIST 1 10 5 0.8 8 4 Monte Carlo Sampling Monte Carlo Sampling Monte Carlo Sampling 0.6 Analytical Inference 6 Analytical Inference 3 Analytical Inference Difference Difference Difference 0.4 4 2 0.2 2 1 0 0 0 10?6 10?5 10?4 10?3 10?4 10?3 10?4 10?3 ? level of model uncertainty ? level of model uncertainty ? level of model uncertainty (d) Error MNIST (e) Error FMNIST (f) Error KMNIST Figure 5.2: Illustration of mean-field approximation and tightness of alternative ELBO on CNNs. The performance gap between our analytical inference and the Monte Carlo Sampling is displayed. 109 Percentage Error NLL Percentage Error NLL Percentage Error NLL Percentage Error NLL Percentage Error NLL Percentage Error NLL Percentage Error NLL fore we can evaluate the approximations in probabilistic propagation, namely mean-field approximation in Equation (5.10) and relaxation of the original ELBO in Equation (5.5), by measuring the gap between analytic inference and MC sampling. As shown in Figure 5.2, such gaps are small for all scenarios, which justifies the approximations we use in BQNs. To further decouple these two factors of mean-field approximation and relaxation of the original ELBO, we vary the regularization coefficient ? in the learning objective. (1) For ? = 0 (where the prior term is removed), the models are forced to become deterministic during training. Since the deterministic models do not have mean-field approximation in the forward pass, the gap between analytic inference and MC-sampling reflects the tightness of our alternative ELBO. (2) As ? increases, the gaps increases slightly as well, which shows that the mean-field approximation becomes slightly less accurate with higher learned uncertainty in the model. MNIST KMNIST Fashion-MNIST Methods NLL(10?3) % Err. NLL(10?3) % Err. NLL(10?3) % Err. QNN-MLP 522.4?42.2 4.14?0.25 2019.1?281.2 19.56?1.97 2427.1?193.5 15.67?1.19 BQN-MLP 137.60?4.40 3.69?0.09 464.60?12.80 14.79?0.21 461.30?13.40 12.89?0.17 QNN-CNN 497.4?139.5 1.08?0.2 4734.5?1697.2 14.2?2.29 1878.3?223.8 3.88?0.33 BQN-CNN 30.3?1.6 0.92?0.07 293.6?4.4 10.82?0.37 179.1?4.4 5.00?0.11 Table 5.2: Deterministic model compression through direct training of QNN [115] v.s. MAP estimation in our proposed BQN. All the numbers are averages over 10 runs with different seeds, the standard deviation is exhibited following the ? sign. Compression of neural networks via BQNs. One advantage of BQNs over continuous- valued BNNs is that we can obtain deterministic QNNs for free, since a BQN can be inter- preted as an ensemble of infinite QNNs (each is a realization of the posterior distribution). (1) One simple approach is to set the model parameters to their MAP estimates, which compresses a given BQN to 1/64 of its original size (and has the same number of bits as a 110 MNIST KMNIST Fashion-MNIST Methods NLL(10?3) % Err. NLL(10?3) % Err. NLL(10?3) % Err. E-QNN-MLP 546.60?157.90 3.30 ?0.65 2385.60?432.30 17.88?1.86 2529.40?276.70 13.02?0.81 BQN-MLP 108.9?2.6 2.73?0.09 429.50?11.60 13.83?0.12 385.30?5.10 10.81?0.44 E-QNN-CNN 425.3?61.80 0.85?0.13 3755.70?465.10 11.49?1.16 1610.70?158.40 3.02?0.37 BQN-CNN 29.2?0.6 0.87?0.04 286.3?2.7 10.56?0.14 174.5?3.6 4.82?0.13 Table 5.3: Bayesian model compression through direct training of E-QNN v.s. a Monte-Carlo sampling on our proposed BQN. Each ensemble consists of 5 quantized neural networks, and for fair comparison we use 5 samples for Monte-Carlo evaluation. All the numbers are averages over 10 runs with different seeds, the standard deviation are exhibited following the ? sign. single QNN). (2) MC sampling can be interpreted as another approach to compress a BQN, which reduces the original size to its S/64 (as an ensemble of S QNNs). In Table 5.2 and Table 5.3, we compare the models by both approaches to their counterparts (a single QNN for MAP, and E-QNN for MC sampling) trained from scratch as in [115]. Our compressed models outperform their counterparts (in NLL) for both approaches. We attribute this to two factors: (a) QNNs are not trained in a Bayesian way; therefore the uncertainty is not well calibrated; and (b) Non-differentiable QNNs are unstable to train. Our compression approaches via BQNs simultaneously solve both problems. 5.5 Conclusion We present a sampling-free, backpropagation-compatible, variational approach for learn- ing Bayesian quantized neural networks (BQNs). We develop a suite of algorithms for efficient inference in BQNs such that our approach scales to large models. We evaluate our BQNs by Monte-Carlo sampling, which proves that our approach can learn a proper poste- rior distribution on QNNs. Furthermore, we show that our approach can learn (ensemble) QNNs by taking maximum a posterior (or sampling from) the posterior distribution. 111 Chapter 6: ARMA Nets: Economical Expansion of Receptive Fields 6.1 Overview Convolutional layers in neural networks have many successful applications for ma- chine learning tasks. Each output neuron encodes an input region of the network measured by the effective receptive field (ERF) [144]. A large ERF that allows for sufficient global in- formation is needed to make accurate predictions; however, a simple stack of convolutional layers does not effectively expand ERF. Convolutional neural networks (CNNs) typically encode global information by adding downsampling (pooling) layers, which coarsely ag- gregate global information. Subsequently, a fully-connected classification layer reduces the entire feature map to an output label. Downsampling and fully-connected layers are suit- able for image classification tasks where only a single prediction is needed. But they are less effective, due to potential loss of information, in dense prediction tasks such as semantic segmentation and video prediction, where each pixel requests a prediction. Therefore, it is crucial to introduce mechanisms that enlarge ERF without too much information loss. Naive approaches to expanding ERF, such as deepening the network or enlarging the filter size, drastically increase the model complexity, which results in expensive computa- tion, difficulty in optimization, and susceptibility to overfitting. Advanced architectures have been proposed to expand ERF, including encoder-decoder networks [145], dilated 112 convolutional networks [146, 147], and non-local networks [148]. However, encoder-decoder networks could lose high-frequency information due to the downsampling layers. Dilated convolutional networks could suffer from the gridding effect while the ERF expansion is limited, and non-local networks are expensive in training and inference. We introduce a novel autoregressive-moving-average (ARMA) layer that enables adap- tive receptive field by explicit interconnections among its output neurons. Our ARMA layer realizes these interconnections via extra convolutions on output neurons, on top of the con- volutions on input neurons as in a traditional convolutional layer. We provably show that an ARMA network can have arbitrarily large ERF, thus capturing global information, with minimal extra parameters at each layer. Consequently, an ARMA network can flexibly en- large its ERF to leverage global knowledge without reducing spatial resolution. Moreover, the ARMA networks are independent of the architectures above, including encoder-decoder networks, dilated convolutional networks, and non-local networks. A significant challenge in ARMA networks lies in the complex computations needed in both forward and backward propagations ? simple convolution operations are not appli- cable since the output neurons are influenced by their neighbors and thus interrelated. An- other challenge in ARMA networks is instability ? the additional interconnections among the output neurons could recursively amplify the outputs and lead them to infinity. We address both challenges in this chapter. Contributions. In summary, our contributions include: 1. We introduce a novel ARMA layer that is a plug-and-play module substituting con- volution layers in neural networks to allow flexible tuning of their ERF, adapting to 113 the task requirements, and improving performance in dense prediction problems. 2. We recognize and address the problems of computation and instability in ARMA layers. (1) To reduce computational complexity, we develop FFT-based algorithms for both forward and backward passes; (2) To guarantee stability, we propose a separable ARMA layer and a re-parameterization mechanism that ensures that the layer operates in a stable region. 3. We successfully apply ARMA layers in ConvLSTM network [78] for pixel-level multi- frame video prediction and U-Net model [145] for medical image segmentation. ARMA networks substantially outperform the corresponding baselines on both tasks, suggest- ing that our proposed ARMA layer is a general and useful building block for dense prediction problems. 6.2 Related Works Dilated convolution [149] enlarges the receptive field by upsampling the filter coef- ficients with zeros. Unlike encoder-decoder structure, dilated convolution preserves the spatial resolution and is thus widely used in dense prediction problems, including semantic segmentation [146, 150, 151], and objection detection [152, 153]. However, dilated convo- lution creates gridding artifacts if its input contains higher frequency than the upsampling rate [147], and the local information inconsistency hampers the performance of dilated convolutional networks [154]. Such artifacts can be alleviated by additional anti-aliasing layer [147], group interacting layer [154], or spatial pyramid pooling [155]. 114 Deformable convolution [156, 157, 158] allows the filter shape (i.e., locations of the incoming pixels) to be learnable. While the deformable convolution adjusts the filter shape, our ARMA layer expands the effective filter size adaptively. Non-local attention network [148] inserts non-local attention blocks between the convolutional layers. A non-local attention block computes a weighted sum of all input neurons for each output neuron, similar to attention mechanism [4]. In practice, non-local attention blocks are computationally expensive, thus they are typically inserted between the layers with low-resolution features. In contrast, our ARMA layers are economical and can be used throughout the network. Encoder-decoder structured network [145, 150] pairs upsampling and downsam- pling layers to maintain the resolution, and introduces skip-connection between each pair to preserve the high-frequency information. Since the shortcut bypasses the downsam- pling/upsampling layers, the network has a small receptive field for high-frequency compo- nents. A potential solution is to augment upsampling with non-local attention block [159] or ARMA layer. Spatial recurrent neural networks [97, 160, 161, 162, 163, 164] are used to expand the receptive field or learn the affinity between neighboring pixels. However, almost all prior works consider nonlinear recurrent neural networks, where the gating mechanism prohibits an efficient parallel algorithm. Quasi-recurrent neural networks [165] partially address the problem by decoupling the linear operations and parallelizing them using convolutions. In contrast, our proposed ARMA layer is a linear recurrent neural network, allowing for 115 efficient evaluation using FFT. 6.3 ARMA Neural Networks In this section, we introduce a novel autoregressive-moving-average (ARMA) layer and analyze its ability to expand Effective Receptive Field (ERF) in neural networks. The analysis is further verified by visualizing the ERF with varying network depth and strength of autoregressive coefficients. We then address the problems of computation and stability in the ARMA layer. 6.3.1 ARMA Layer A traditional convolutional layer is essentially a moving-average model [166]: ?S Y:,:,t = W:,:,t,s ? X:,:,s, (6.1) s=1 where the moving-average coefficientsW ? RKm?Km?T?S, are parameterized by a 4th-order kernel (Km is the filter size, and S, T are input/output channels), : denotes all elements from the specified coordinate and ? denotes convolution between a channel and a filter. As motivated in the introduction, we introduce a novel ARMA layer that enables an adaptive receptive field by introducing explicit interconnections among its output neurons, as illustrated in Figure 6.1. Our ARMA layer realizes these interconnections by introducing extra convolutions on the outputs, upon the convolutions on the inputs as in a traditional convolutional layer. As a result, in an ARMA layer, each output neuron can be affected by an input pixel faraway through interconnections among the output neurons, thus receives 116 (a) Traditional Convolutional Layer. (b) ARMA Layer Figure 6.1: Diagrams of 1D convolutional layers. In (b), the ARMA layer introduces interconnections among output neurons explicitly. Thus, each output neuron can receive information from all input neurons. global information. Formally, we define an ARMA layer in Definition 6.1. Definition 6.1 (ARMA layer). An ARMA layer is parameterized by a moving-average kernel (coefficients) W ? RKm?Km?S?T and an autoregressive kernel (coefficients) A ? RKa?Ka?T . It receives an input X ? RI1?I2?S ? ?and returns an output Y ? RI1?I2?T with an ARMA model: ?S A:,:,t ? Y:,:,t = W:,:,t,s ? X:,:,s. (6.2) s=1 Remarks: (1) The ARMA layer maintains the shift-invariant property, since the output interconnections are realized by convolutions. (2) The ARMA layer reduces to a traditional layer if the autoregressive kernel A represents an identical mapping. (3) The ARMA layer is a plug-and-play module that can replace any convolutional layer, adding K2aT extra parameters negligible compared to K2wST parameters in a traditional convolution layer. (4) Unlike a traditional layer, computing Equation (6.2) and its backpropagation are nontrivial, which are studied in Section 6.3.3. Our ARMA layer is complementary to the methods of dilated convolutional layer, 117 (a) Traditional Convolution Layer. (b) ARMA Layer. (c) Dilated Convolutional Layer. (d) Dilated ARMA Layer. Figure 6.2: Diagrams of 2D convolutional layers. In (b), each output neuron receives its neighbors? receptive field. In (d), ARMA?s autoregression fills the gaps created by dilated convolution. deformable convolutional layer, non-local attention block, as well as encoder-decoder ar- chitecture. For instance, a dilated ARMA layer, illustrated in Figure 6.2d, removes the gridding effect caused by dilated convolution ? the autoregressive kernel plays a role as an anti-aliasing filter. The motivation of introducing the ARMA layer is to enlarge the effective input region for each network output without increasing the filter size or network depth, thus avoiding the difficulties in training wider or deeper models. As illustrated in Figure 6.2, each output neuron in a traditional convolutional layer (Figure 6.2a) only receives information from a small input region (the filter size). However, an ARMA layer enlarges the small local region to a larger one (Figure 6.2b). It enables an output neuron to receive information from a faraway input neuron through the connections to its neighbors. In Section 6.3.2, we 118 formally introduce the concept of effective receptive field (ERF) to characterize the input region size. Moreover, we will show that an ARMA network can have arbitrarily large ERF with a single extra parameter at each layer in Theorem 6.3. 6.3.2 Effective Receptive Field Effective receptive field (ERF) measures the contribution of each input pixel to an output neuron [144]. In this subsection, we analyze the ERF of an L-layers network with ARMA layers v.s. traditional convolutional layers. Formally, consider an output at (i1, i2), the impact from an input pixel at (i1 ? p1, i2 ? p2) (i.e L ?layers and (p1, p2) pixe?ls away)? (L) (1) ? is measured by the gradient amplitude g(i1, i2, p1, p2) = ??Yi1,i2,t/?Xi1?p ? (where1,i2?p2,s superscripts index the layers), i.e., how an output neuron changes as an input pixel perturbs. Definition 6.2 (Effective receptive field, ERF). Consider an L-layers network with an S-channels input X (1) ? RI1?I2?S and a T -channels output Y(L) ? RI1?I2?T , its effective receptive field?is defined as the empi?rical distribution of the gradient maps: ERF(p1, p2) = 1/(I1I2ST ) ? s,t,i ,i [g(i1, i2, p1, p2)/ j ,j g(j1, j2, p1, p2)], To measure the size of the ERF,1 2 1 2 we define its radius r(ERF) as the standard deviation of the empirical distribution: ?( ) [?? ]2 r (ERF)2 = p21 + p 2 2 ERF (p 2 2 1, p2)? p1 + p2 ERF(p1, p2) . (6.3) p1,p2 p1,p2 Notice that the ERF simultaneously depends on the model parameters and a specified input to the network, i.e., it is both model-dependent and data-dependent. Therefore, it is generally intractable to compute the ERF analytically for any practical neural network. We follow [144] to estimate the radius with a simplified linear network. The paper empirically 119 verifies that such an estimation is accurate and can be used to guide filter designs. Theorem 6.3 (ERF radius of a linear ARMA net?work). Consider an L-layers linear(`) (`) (`) (`?1) network, where the `th layer computes y ?a(`)i y = K ?1[(1?a(`))/K(`)i?1 p=0 ] ?y (i.e.,i?d(`)p the moving-average coefficients are uniform with length K(`) and dilation d(`), and the autoregressive coefficients a(`) = {1,?a(`)} has length 2). Suppose 0 ? a(`) < 1,?` ? [L], the ERF radius of the network is ?? ( ) ?L ?d(`)2 2K(`) ? 1 a(`)r(ERF)2ARMA = + ?2 . (6.4)12 (1? a(`)) `=1 We prove Theorem 6.3 in [167]. If the coefficients for all laye?rs are identical, e.g.,? K(`) = K, d(`) = d, a(`) = a, the radius reduces to r(ERF) = L? d2(K2 ? 1)/12 + a/(1? a)2ARMA . Moreover, if a = 0, d = 1, the ARMA layers reduce to trad?itional convolutional layers, and? the ERF of the linear CNN has radius r(ERF)CNN = L ? (K2 ? 1)/12 as shown in [144]. Remarks: (1) Compared with a (dilated) CNN, an ARMA network can have arbitrarily large ERF with an extra parameter a at each layer. When the autore- gressive coefficient a is large (e.g., a > 1? 1/(dK)), the second term a/(1? a)2 dominates the radius, and the ERF is substantially larger than that of a CNN. In particular, the radius tends to infinity as a approaches 1. (2) An ARMA network can adaptively adjust its ERF through learnable parameter a. As a gets smaller (e.g., a < 1 ? 1/(dK)), the second term is comparable to or smaller than the first term, and the effect of expanded ERF diminishes. In particular, if a = 0, an ARMA network reduces to a CNN. 120 a L = 1 L = 3 L = 5 a L = 1 L = 3 L = 5 0.0 0.8 (CNN) 0.6 0.9 Figure 6.3: Visualization of the effective receptive field of linear ARMA networks under different network depth L = 1, 3, 5 and different magnitude of the autoregressive coefficients a = 0.0, 0.6, 0.8, 0.9. Visualization of the ERF. We analytically show in Theorem 6.3 that the ERF radius increases with the network depth and autoregressive coefficients? magnitude. We now verify our analysis by simulating linear ARMA networks with varying depths and autoregressive coefficients? magnitude. As shown in Figure 6.3, the ERF radius increases as the autore- gressive coefficients get larger. When the autoregressive coefficients are zeros, an ARMA network reduces to a traditional convolutional network. The simulation results also indicate that ARMA?s ability to expand the ERF increases with the network depth. In conclusion, an ARMA network can have a large ERF even when the network is shallow, and its ability to expand the ERF increases when the network gets deeper. 6.3.3 Prediction and Learning of ARMA Layers In an ARMA layer, each neuron is influenced by its neighbors from all directions (see Figure 6.2b). As a result, no neuron could be evaluated alone before evaluating any other neighboring neurons. To compute Equation (6.2), we need to solve a set of linear equations to obtain all values simultaneously. (1) However, the standard solver using 121 Gaussian elimination is too expensive to be practical, and therefore we need to seek a more efficient solution. (2) Furthermore, the solver for linear equations typically does not support automatic differetiation, and we have to derive the backward equations analytically. (3) Finally, we also need to devise an efficient algorithm to compute the backpropagation equations efficiently. In the section, we address these aforementioned problems. Decomposition of an ARMA Layer. We decompose the ARMA layer in Equation (6.2) into a moving-average (MA) layer and an autoregressive (AR) layer: ?S MA Layer: T:,:,t = W:,:,t,s ? X:,:,s, (6.5a) s=1 AR Layer: A:,:,t ? Y:,:,t = T:,:,t, (6.5b) ? ? where T ? RI1?I2?T is the intermediate result. Difficulty in computing the AR layer. While the MA layer in Equation (6.5a) is simply a traditional convolutional layer defined in Equation (6.1), it is nontrivial to solve the AR layer in Equation (6.5b). In principle, the linear equations in the AR layer can be solved in time cubic in dimension O((I21 + I 2 2 )I1I2T ) using Gaussian elimination. However, this is too expensive in practice. Computing the AR layer in frequency domain. We propose to use the frequency- domain division to solve the deconvolution problem in the AR layer [168]. Since the convo- lution in the spatial domain leads to an element-wise product in the frequency domain, we first transform A, T into their frequency representations A?, T? , with which we compute Y? 122 Layer # params. # FLOPs r(ERF)2 Conv. K2C2 O(I2K2C2w w ) ( O(LK2w) ) ARMA K2 2wC +K 2C O(K2 2a wI C 2 + I2 log(I)C) O LK2w + La/(1? a)2 Table 6.1: Comparison between traditional convolutional layer and ARMA layer. An ARMA layer achieves large gain of the ERF radius through small overhead of extra parameters and FLOPs. Through a single extra parameter a (thus Ka = 2), the ERF radius can be arbitrarily large. For simplicity, we assume all heights and widths are equal I1 = I2 = I ? 1 = I ? 2 = I, and the input and output channels are identical S = T = C. (the frequency representation of Y) with the element-wise division. Then, we reconstruct the output Y by an inverse Fourier transform of Y? . Computational overhead. ARMA trades small overhead of extra parameters and com- putation for a large gain of ERF radius, as shown in Table 6.1. With the Fast Fourier Transform (FFT) algorithm, the FLOPS required by the extra autoregressive layer is O(log(max(I1, I2))I1I2T ). Importantly, compared with non-local attention block [148], the extra computation introduced in an ARMA layer is smaller; a non-local attention block requires O(I21I 2 2T ) FLOPS. Backpropagation. Deriving the backpropagation for an ARMA layer is nontrivial; though the rule for the MA layer is conventional, that of the AR layer is not. In Theorem 6.4, we show that the backpropagation of an AR layer can be computed as two ARMA models. Theorem 6.4 (Backpropagation of an AR layer). Given an AR layer A:,:,t ? Y:,:,t = T:,:,t and its output gradient ?L/?Y, the gradients {?L/?A, ?L/?X} can be obtained by 123 two ARMA models: A> ?L > ?L:,:,t ? = ?Y ? , (6.6a)?A :,:,t:,:,t ?Y:,:,t A> ? ?L ?L:,:,t = , (6.6b)?T:,:,t ?Y:,:,t where A> , Y>:,:,t :,:,t are the transposed images of A >:,:,t, Y:,:,t (e.g., Ai ,i ,t = A1 2 ?i1,?i2,t. We prove Theorem 6.4 in [167]. Since the backpropagation is characterized by ARMA models, it can be evaluated efficiently using the FFT algorithm similar to Equation (6.5). 6.3.4 Stability of ARMA Layers An ARMA model with arbitrary coefficients is not always stable. For example, the model yi ? ayi?1 = xi is unstable if |a| > 1: Consider an input x with x0 = 1 and xi = 0, ?i 6= 0, the output y will recursively amplify itself as y0 = 1, y1 = a, ? ? ? , yi = ai and diverge to infinity. Stability constraints for an ARMA layer. Therefore, the key to the stability of an ARMA layer is to constrain its autoregressive coefficients, which prevents the output from repeatedly amplifying itself. To derive the constraints, we propose a special design, separable ARMA layer inspired by separable filters [168]. Definition 6.5 (Separable ARMA layer). A separable ARMA layer is parameterized by a moving-average kernel W ? RKw?Kw?S?T and T ? Q sets of autoregressive filters { (q) (q)(f , g )Q }T ,. It takes an input X ? RI1?I2?S ? ?:,t :,t q=1 t=1 and returns an output Y ? RI1?I2?T as 124 ( ) ( ) ?S (1) ? ? ? ? ? (Q) (1) (Q)f:,t f:,t ? g:,t ? ? ? ? ? g:,t ? Y:,:,t = W:,:,t,s ? X:,:,s (6.7) s=1 (q) (q) where the filters f 3:,t , g:,t ? R , and ? denotes outer product of two 1D-filters. Remarks: Each autoregressive filter A:,:,t is designed to be separable, i.e., A:,:,t = F:,t ? G:,t, thus it can be characterized by two 1D-filters F:,t and G:,t. By the fundamental theorem of algebra [169], any 1D-filter can be represented as a composition of length-3 (1) (2) (Q) filters. Therefore, F:,t and G:,t can further be factorized as F:,t = f:,t ? f:,t ? ? ? ? f:,t and (1) (2) (Q) G:,t = g:,t ? g:,t ? ? ? ? g:,t . In summary, each A:,:,t is characterized by Q sets of length-3 (q) (q) autoregressive filters (f Q:,t , g:,t )q=1. A (1): ,: , t F : ,t G: , t F : ,t f : , t Parameter f (2): , t f (1)?1, t ? f f (1)(1) ?2/2 ??2 /2 f (1) = f ? 1, t 0, t(1) 0, t?1, t f 1, t f f?2/2 ?2/2 ? ?1, t = = f (1) 1, t* 0, t Rotation ? f (1) 1, tf 1, t tanh ??f f1,qt ,t (a) (b) (c) (d) Figure 6.4: Illustration of the re-parametrization mechanism. For each channel t, (a) the two-dimensional filter A:,:,t is parameterized through an outer product of two 1D- (1) (Q) filters F:,t and G:,t; (b) F:,t is parameterized through a convolution of f:,t ? ? ? ? ? f:,t , and (1) (Q) similarly G:,t as a convolution of g:,t ? ? ? ? ? g:,t ; (c) we re-parameterize each constrained (q) (q) (q) (q) (f?1,t, f1,t ) to unconstrained (? f , ?f g gq,t q,t), and similarly (g?1,t, g1,t ) to (?q,t, ?q,t); (d) final (q) (q) parameters for unconstrained optimization are (f , ?f , ?f0,t q,t q,t, g g g Q 0,t?q,t, ?q,t)q=1. Theorem 6.6 (Constraints for a stable separable ARMA layer). A sufficient con- dition for the separable ARMA layer (Definition 6.5) to be stable (i.e., output be bounded 125 for any bounded input) is: ??? ? ? ?(q) (q)? (q) ? (q) (q)? (q)f?1,t + f1,t ? < f0,t , ?g?1,t + g1,t ? < g0,t , ?q ? [Q], t ? [T ]. (6.8) We prove Theorem 6.6 in [167], which uses the standard techniques of linear system theory using Z-transform. Ensuring stability via re-parameterization. In principle, the constraints for sta- bility in an ARMA layer (as in Theorem 6.6) could be enforced through constraints in optimization. However, a constrained optimization algorithm, such as projected gradient descent [170], is more expensive as it requires an extra projection step. Moreover, it is more difficult to achieve convergence. In order to avoid these challenges, we introduce a re-parameterization mechanism to remove constraints needed to guarantee stability. Theorem 6.7 (Re-parameterization). For a separable ARMA layer in Definition 6.5, if we (q) (q) (q) (q) re-parameterize each tuple (f f f g g?1,t, f1,t , g?1,t, g1,t ) as learnable parameters (?q,t, ?q,t, ?q,t, ?q,t): ?? ? ? ?? ?? ?? (q) (q) (q) ? ?f g f g? ?1,t ?1,t??? ???f0,t 0 ?????? 2/2 ? 2/2?= ?? ? ???? ?q,t ?q,t ??? (6.9)(q) (q) (q) f f1,t g1,t 0 g0,t 2/2 2/2 tanh(?q,t) tanh(? g q,t) (q) (q) then the layer is stable for unconstrained {(f , ?f , ?f g g Q T0,t q,t q,t, g0,t , ?q,t, ?q,t)q=1}t=1. (q) (q) In practice, we set f0,t = g0,t = 1 (since the scale can be learned by the moving- average kernel), and only store and optimize over each tuple (?fq,t, ? f g g q,t, ?q,t, ?q,t). In other words, each autoregressive filter A:,:,t is constructed from (?f , ?f , ?g , ?g )Qq,t q,t q,t q,t q=1 on the fly during training or inference. 126 without reparam 80 with reparam 60 40 20 0 200 400 600 800 1,0001,2001,400 iterations Figure 6.5: Learning curves with/without re-parameterization mechanism. The ARMA network with VGG-11 backbone is trained on the CIFAR-10 dataset. Experimental demonstration of re-parameterization. To verify that such mech- anism is essential for stable training, we train a VGG-11 network [2] on the CIFAR-10 dataset, where all convolutional layers are replaced by ARMA layers with autoregressive co- efficients initialized as zeros. We compare the learning curves using the re-parameterization v.s. not using the re-parameterization in Figure 6.5. As we can see, the training quickly converges under our proposed re-parameterization mechanism with which the stability of the network is guaranteed. However, without the re-parameterization mechanism, a naive training of the ARMA network never converges and gets NaN error quickly. The experiment thus verifies that the theory in Theorem 6.7 is effective in guaranteeing stability. 127 percentage traning accuracy 6.4 Experimental Results We apply our ARMA networks on two dense prediction problems ? pixel-level video prediction and semantic segmentation to demonstrate the effectiveness of ARMA networks. (1) We incorporate our ARMA layers in U-Nets [145, 171] for semantic segmentation, and in the ConvLSTM network [78, 91] for video prediction. We show that the resulted ARMA U-Net and ARMA-LSTM models uniformly outperform the baselines on both tasks. (2) We then interpret the varying performance of ARMA networks on different tasks by visualizing the histograms of the learned autoregressive coefficients. We include the detailed setups (datasets, model architectures, training strategies, and evaluation metrics) and visualization in section 6.5 for reproducibility purposes. Semantic segmentation on biomedical medical images. We evaluate our ARMA U-Net on the lesion segmentation task in the ISIC 2018 challenge [172], comparing against a baseline U-Net [145] and non-local U-Net [171] (U-Net with non-local attention blocks). Model params. ACC SE SP PC F1 JS 0.946 0.884 0.977 0.857 0.842 0.754 U-Net [145] 3.453M ? 0.003 ? 0.019 ? 0.005 ? 0.020 ? 0.009 ? 0.011 0.945 0.877 0.973 0.844 0.831 0.741 NL U-Net [171] 4.403M ? 0.003 ? 0.017 ? 0.004 ? 0.014 ? 0.012 ? 0.013 0.955 0.896 0.972 0.873 0.861 0.780 ARMA U-Net 3.455M ? 0.003 ? 0.011 ? 0.005 ? 0.011 ? 0.007 ? 0.009 0.960 0.909 0.968 0.870 0.870 0.790 NL ARMA U-Net 4.405M ? 0.002 ? 0.009 ? 0.004 ? 0.011 ? 0.006 ? 0.008 Table 6.2: Semantic segmentation on ISIC dataset. For all metrics (ACC, SE, SP, PC, F1 and JS), higher values indicates better performance. The reported numbers are an average of 10 runs with different seeds. 128 ARMA networks outperform both baselines in almost all metrics. As shown in Ta- ble 6.2, our (non-local) ARMA U-Net outperforms both U-Net and non-local U-Net except for specificity (SP). Furthermore, we find that the synergy of non-local attention and ARMA layers achieves the best results among all. Pixel-level video prediction. We evaluate our ARMA-LSTM network on the Moving- MNIST-2 dataset [173] with different moving velocities, comparing against the baseline Conv-LSTM network [78, 91] and its augmentation using dilated convolutions and non- local attention blocks [148]. As shown in the visualizations in subsection 6.5.2, the dilated ARMA-LSTM does not have gridding artifacts as in dilated Conv-LSTM; that is, ARMA removes the gridding artifacts. original speed 2X speed 3X speed Model MA AR dil. params. PSNR SSIM PSNR SSIM PSNR SSIM Conv-LSTM (size 3) 3 1 1 0.887M 18.24 0.867 16.62 0.827 15.81 0.810 Conv-LSTM (size 5) 5 1 1 2.462M 19.58 0.901 17.61 0.856 16.99 0.841 Dilated Conv-LSTM 3 2 2 0.887M 19.16 0.893 17.92 0.858 17.48 0.846 Dilated ARMA-LSTM 3 3 2 0.893M 19.72 0.904 18.05 0.870 17.65 0.855 ARMA-LSTM (size 3) 3 2 1 0.893M 19.72 0.899 18.73 0.881 18.13 0.869 Table 6.3: Multi-frame video prediction on the Moving-MNIST-2 dataset with three different speeds. MA and AR denote the size of moving-average and autoregressive kernels, respectively, and dil. the dilation in the moving-average kernel. Higher PSNR and SSIM indicate better performance, and all results are average over 10 predicted frames. ARMA networks outperform non-local blocks: As shown in Table 6.4, our ARMA- LSTM with kernel sizes 3 ? 3 outperforms the Conv-LSTM networks augmented by non- local blocks. However, the non-local mechanism does not always improve the baselines or our models. When both ARMA-LSTM and Conv-LSTM are combined with non-local blocks, our model achieves better performance compared to the non-ARMA baselines. 129 Original Non-local Model MA AR dil. PSNR SSIM PSNR SSIM Conv-LSTM 3 1 1 18.24 0.867 19.45 0.895 Conv-LSTM 5 1 1 19.58 0.901 19.18 0.891 ARMA-LSTM 3 3 1 19.72 0.899 19.62 0.897 Table 6.4: Comparison with non-local networks on video prediction. The original networks are the same as in Table 6.3. Each non-local network additionally inserts two non-local blocks in the corresponding base network. Histogram of the Autoregressive Coefficients 60 Image Classification Video Prediction 50 40 30 20 10 0 0.75 0.50 0.25 0.00 0.25 0.50 0.75 coefficient Figure 6.6: Histogram of the autoregressive coefficients in ARMA networks. Interpretation by autoregressive coefficients. Figure 6.6 compares the histograms of the trained autoregressive coefficients between video prediction and image classification to explain why ARMA networks achieve impressive performance in dense prediction, (sub- section 6.5.4 demonstrates the performance of ARMA networks in image classification with baselines VGG and ResNet.) 1. The histograms demonstrate how ARMA networks adaptively learn autoregressive coefficients according to the tasks. As motivated in the introduction, dense prediction such as video prediction requires each layer to have a large receptive field to capture 130 percentage(%) global information. 2. The large autoregressive coefficients in the video prediction model suggest that the overall ERF is significantly expanded. In the image classification model, global infor- mation is already aggregated by pooling (downsampling) layers and a fully-connected classification layer. Therefore, the ARMA layers automatically learn nearly zero au- toregressive coefficients. 6.5 Supplemental Materials for Empirical Studies In this section, we explain detailed setups (datasets, model architectures, learning strategies, and evaluation metrics) of all experiments, and provide additional visualizations of the results. 6.5.1 Visualization of Effective Receptive Fields In the simulations in Section 6.3.2, all linear networks have 32 channels and 64? 64 feature size at each layer. The filter size for both moving-average coefficients and au- toregressive coefficients is set to 3 ? 3: each moving-average kernel W is initialized using Xavier?s method, while the autoregressive kernel A is initialized randomly within a stable ? ? (q) (q) (q) (q)region a f?1,t + f1,t ? 0,?a ? g?1,t + g1,t ? 0, ?t ? [T ], q ? [Q] (see Section 6.3.4 for details). We compute each heat map in Figure 6.3 as an average of 32 gradient maps from different channels. 131 6.5.2 Multi-frame Video Prediction Datasets and metrics. We generate the Moving-MNIST-2 dataset by moving two digits of size 28? 28 drawn from the MNIST dataset within a 64? 64 black canvas [173]. Each digit starts from a random initial location, moves with constant velocity in the canvas, and bounces when they reach the boundary independently. In addition to the default velocity in the public generator [173], we increase the velocity to 2? and 3? to test all models on videos with stronger motions. For each velocity, we generate 10, 000 videos for the training set, 3, 000 for the validation set, and 5, 000 for the test set, each of which contains 20 frames. We train all models to predict the next 10 frames given 10 input frames, and we evaluate their performance based on the metrics of mean square error (MSE), peak signal-noise ratio (PSNR), and structure similarity (SSIM) [110]. 32 units 32 units 32 units 32 units 32 units 32 units 32 units 32 units Conv Conv Conv Conv Conv Conv Conv Conv Input (ARMA) (ARMA) (ARMA) (ARMA) Output Input (ARMA) (ARMA) (ARMA) (ARMA) Output LSTM LSTM LSTM LSTM Non- Non-LSTM LSTM LSTM Local LSTM Local X 3 X 3 X 3 X 3 X 3 X 3 X 3 X 3 Block 1 Block 2 Block 3 Block 4 Conv Block 1 Block 2 Block 3 Block 4 Conv (a) Conv-LSTM. (b) Non-Local Conv-LSTM. Figure 6.7: Backbone architectures for video prediction. Model architectures. (1) Baselines. The backbone architecture consists of a stack of 12 Conv-LSTM modules, and each module contains 32 units (channels). Following [91], two skip connections that perform channel concatenation are added between (3, 9) and (6, 12) module. An additional traditional convolutional layer is applied on top of all recurrent layers to compute the predicted frames. The backbone architecture is illustrated in Figure 6.7a. In the baseline networks, we consider three different convolutions at each 132 layer: (a) Traditional convolution with filter size 3 ? 3; (b) Traditional convolution with filter size 5?5; and (c) 2-dilated convolution with filter size 3?3. (2) ARMA networks. Our ARMA networks use the same backbone architecture as baselines and replace their convolutional layers with ARMA layers. For all ARMA models, we set the filter size for both moving-average and autoregressive parts to 3 ? 3. In the ARMA networks, we consider two different convolutions each layer: (a) The moving-average part is a traditional convolution; (b) We further consider using 2-dilated convolution in the moving-average part. (3) Non-local networks. In non-local networks, we additionally insert two non- local blocks in the backbone architecture, as illustrated in Figure 6.7b. In each non-local block, we use embedded Gaussian as the non-local operation [148], and we replace the batch normalization by instance normalization that is compatible with recurrent neural networks. In non-local networks, we consider three types of convolutions at each layer: (a)(b) Traditional convolutions with filter size 3?3 and 5?5; (c) ARMA layer with 3?3 moving-average and autoregressive filters. Visualization of the predictions. We visualize the predictions by different models under three moving velocities in Figure 6.8, Figure 6.9, and Figure 6.10. Notice that autoregression removes the gridding artifacts by dilated convolutions ? since each neuron receives information from all pixels in a local region (Figure 6.2d), adjacent neurons no longer receive from separate sets of pixels. Moreover, for videos with a higher speed, the advantage of our ARMA layer is more pronounced as expected due to ARMA?s ability to expand the ERF. 133 input ground truth (top) / predictions t = 1 2 3 4 5 6 7 8 9 10 11 12 13 Traditional (Kw = 3) 2-dilated (Kw = 3) Traditional (Kw = 5) 2-dilated ARMA (Kw = 3,Ka = 3) ARMA (Kw = 3,Ka = 3) Figure 6.8: Prediction on Moving-MNIST-2 (original speed). The first row contains the last 3 input frames and 10 ground-truth frames for models to predict. input ground truth (top) / predictions t = 1 2 3 4 5 6 7 8 9 10 11 12 13 Traditional (Kw = 3) 2-dilated (Kw = 3) Traditional (Kw = 5) 2-dilated ARMA (Kw = 3,Ka = 3) ARMA (Kw = 3,Ka = 3) Figure 6.9: Prediction on Moving-MNIST-2 (2? speed). The first row contains the last 3 input frames and 10 ground-truth frames for models to predict. 134 input ground truth (top) / predictions t = 1 2 3 4 5 6 7 8 9 10 11 12 13 Traditional (Kw = 3) 2-dilated (Kw = 3) Traditional (Kw = 5) 2-dilated ARMA (Kw = 3,Ka = 3) ARMA (Kw = 3,Ka = 3) Figure 6.10: Prediction on Moving-MNIST-2 (3? speed). The first row contains the last 3 input frames and 10 ground-truth frames for models to predict. ?10?2 Conv (K=3) 24 Conv (K=3) Conv (K=3) 2.5 Conv (K=5) Conv (K=5) 0.95 Conv (K=5) Dilated-Conv Dilated-Conv 22 Dilated-Conv 2 Dialted ARMA Dialted ARMA Dialted ARMA 0.9 20 1.5 1 18 0.85 0.5 16 0.8 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 Time steps Time steps Time steps (a) MSE (b) PSNR (c) SSIM Figure 6.11: Per-frame performance comparison of our ARMA and our dilated ARMA networks v.s. the Conv-LSTM, dilated Conv-LSTM baselines for Moving-MNIST-2 (original speed). Lower MSE (in 10?3) and higher PSNR/SSIM indicate better performance. 135 MSE PSNR SSIM ?10?2 Conv (K=3) 3 0.95 Conv (K=3) 22 Conv (K=5) Conv (K=5) Dilated-Conv 2.5 Dilated-Conv ARMA 0.9 ARMA 20 2 1.5 Conv (K=3) 18 0.85 Conv (K=5) 1 Dilated-Conv 16 ARMA 0.8 0.5 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 Time steps Time steps Time steps (a) MSE (b) PSNR (c) SSIM Figure 6.12: Per-frame performance comparison of our ARMA and our dilated ARMA networks v.s. the Conv-LSTM, dilated Conv-LSTM baselines for Moving-MNIST-2 (2? speed). Lower MSE (in 10?3) and higher PSNR/SSIM indicate better performance. ?10?2 22 Conv (K=3) Conv (K=3) 3 Conv (K=5) Conv (K=5) Dilated-Conv 20 0.9 Dilated-Conv 2.5 ARMA ARMA 2 18 Conv (K=3) 0.85 1.5 Conv (K=5) Dilated-Conv 16 1 ARMA 0.8 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 Time steps Time steps Time steps (a) MSE (b) PSNR (c) SSIM Figure 6.13: Per-frame performance comparison of our ARMA and our dilated ARMA networks v.s. the Conv-LSTM, dilated Conv-LSTM baselines for Moving-MNIST-2 (3? speed). Lower MSE (in 10?3) and higher PSNR/SSIM indicate better performance. 136 MSE MSE PSNR PSNR SSIM SSIM 6.5.3 Medical Image Segmentation Dataset and metrics. For all experiments, we use a dataset from ISIC 2018: Skin Lesion Analysis Towards Melanoma Detection [174], which can be downloaded online1. In this task, a model aims to predict a binary mask that indicates the location of the primary skin lesion for each input image. The dataset consists of 2594 images, and we resize each image to 224?224. We split the dataset into a training set, validation set and test set with ratios of 0.7, 0.1, and 0.2, respectively. All models are evaluated using the following metrics: AC = (TP +TN)/(TP +TN+FP +FN), SE = TP/(TP +FN), SP = TN/(TN+FP ), PC = TP/(TP + FP ), F1 = 2PC ? SE/(PC + SE) and JS = |GT ? SR|/|GT ? SR|, where TP stands for true positive, TN for true negative, FP for false positive, FN for false negative, GT for ground truth mask and SR for predictive mask. (a) U-Net architecture. (b) Non-local U-Net architecture. Figure 6.14: Backbone architectures for semantic segmentation. Model architectures. (1) Baselines. We use U-Net [145] and non-local U-Net [171] as baseline models. U-Net has a contracting path to capture context, and a symmetric 1https://challenge2018.isic-archive.com/task1/training/ 137 expanding path enables precise localization. The network architecture is illustrated in Figure 6.14a. Non-local U-Net further contains global aggregation blocks based on the self- attention operator to aggregate global information without a deep encoder for biomedical image segmentation, as illustrated in Figure 6.14b. (2) Our architectures. We replace all traditional convolution layers with ARMA layers in U-Net and non-local U-Net. Training strategy. All models are trained using Adam optimizer [8] with binary cross entropy (BCE) loss. For the initial learning rate, we search from 10?2 to 10?5 and choose 10?3 for U-Net and 10?2 for non-local U-Net. The learning rate is decayed by 0.98 every epoch. During training, each image is randomly augmented by rotation, cropping, shifting, color jitter, and normalization following the public source code https://github.com/ LeeJunHyun/Image_Segmentation. Input Ground U-net U-net Non-Local U-net Non-Local U-net image truth with ARMA without ARMA with ARMA without ARMA Figure 6.15: Predictive results of U-Nets with/without ARMA layers. 138 6.5.4 Image Classification Model architectures and datasets. We replace the traditional convolutional layers by ARMA layers in three benchmarking architectures for image classification: AlexNet [1], VGG-11 [2], and ResNet-18 [5, 15]. We apply our proposed ARMA networks on CIFAR10 and CIFAR100 datasets. Both datasets have 50000 training examples and 10000 test ex- amples, and we use 5000 examples from the training set for validation (and leave 45000 examples for training). Training strategy. We train all models using cross-entropy loss and SGD optimizer with batch size 128, learning rate 0.1, weight decay 0.0005 and momentum 0.9. For CIFAR10, the models are trained for 300 epochs and we half the learning rate every 30 epochs. For CIFAR100, the models are trained for 200 epochs and we divide the learning rate by 5 at the 60th, 120th, 160th epochs. Results. The experimental results are summarized in Table 6.5. Our results show that ARMA models achieve comparable or slightly better results than the benchmarking ar- chitectures. Replacing the traditional convolutional layer with our proposed ARMA layer slightly boosts VGG-11 and ResNet-18 by 0.01%-0.1% in terms of accuracy. Since image classifications tasks do not require convolutional layers to have large receptive fields, the learned autoregressive coefficients concentrate around 0, as shown in Figure 6.6. Consequently, ARMA networks effectively reduce to traditional convolutional neural networks and therefore achieve comparable results. 139 AlexNet VGG-11 ResNet-18 Dataset Conv. ARMA Conv. ARMA Conv. ARMA CIFAR10 86.30? 0.29 85.67? 0.19 91.57? 0.59 91.57? 0.73 95.01? 0.15 95.07? 0.13 CIFAR100 58.99? 0.37 57.43? 0.24 68.25? 0.11 68.36? 1.67 73.71? 0.23 73.72? 0.52 Table 6.5: Image classification on CIFAR10 and CIFAR100. We report accuracy (%) and standard deviations from 10 runs with different seeds. 6.6 Conclusion In this chapter, we propose a novel ARMA layer capable of expanding a network?s ef- fective receptive field adaptively. Compared to a traditional convolutional layer, an ARMA layer has additional interconnections among output neurons realized by convolutions. We address the computation and instability problems in ARMA layers, thus allows ARMA lay- ers to replace traditional convolutional layers with minimal extra cost. We show empirically that ARMA networks consistently improve performance on dense prediction tasks. Our model is closely related to techniques in signal processing and machine learning. First, an ARMA layer is equivalent to an IIR filter bank in signal processing [175]. It thus opens the opportunity for users to improve ARMA layers using filter bank theory. Alternatively, we can interpret the autoregressive (AR) part as a learnable spectral normal- ization [176] following the moving-average (MA) part. Following this relationship, we can further investigate how to better learn the AR part according to the spectrum of the MA part. Finally, the ARMA layer is a linear recurrent neural network, where the recurrent propagations are over the spatial domain (section 6.2). Therefore, an interesting future direction is to use ARMA layers to accelerate the computation in RNNs. 140 Chapter 7: Scaling-up Diverse Orthogonal Convolutional Networks by a Paraunitary Framework 7.1 Overview Convolutional neural networks, whose deployment has witnessed extensive empirical success, still exhibit limitations that are not thoroughly studied. Firstly, deep convolutional networks are in general difficult to learn, and their high performance heavily relies on tech- niques that are not fully understood, such as skip-connections [5], batch normalization [7], specialized initialization [108]. Secondly, they are sensitive to imperceptible perturbations, including adversarial attacks [10] and geometric transformations [13]. Finally, a precise characterization of their generalizability is still under active investigation [177, 178]. Orthogonal networks, which have a ?flat? spectrum with all singular values of each linear layer being 1 (thus the output norm ?y? equals the input norm ?x?, ?x), alleviate all problems above. As shown in recent works, by enforcing orthogonality in neural networks, we obtain (1) easier optimization [179, 180]: since each orthogonal layer preserves the gradient norm during backpropagation, an orthogonal network is free from gradient van- ishing/ exploding problems; (2) robustness against perturbation [181, 182, 183]: since each orthogonal layer is 1-Lipschitz, an orthogonal network will not amplify any input per- 141 turbation to flip the output prediction; (3) better generalizability [178]: a network?s generalization error is positively related to the standard deviation of each linear layer?s singular values, thus encouraging orthogonality lowers the network?s generalization error. Despite the benefits, enforcing orthogonality in convolutional networks is challenging. To avoid strict constraint, orthogonal initialization (dynamical isometry) [184, 185, 186] and orthogonal regularization [180, 187] are opted for the gradient vanishing/exploding problems. However, as they do not enforce strict orthogonality (and Lipschitzness), these methods are unsuitable for applications that require strict Lipschitzness, such as adversarial robustness [181] and residual flows [188]. Our goal is to enforce exact orthogonality in state-of-the-art deep convolutional net- works without expensive overhead. We identify three main challenges. Challenge I: Achieving exact orthogonality throughout training. Prior works such as orthogonal regularization [187] and reshaped kernel orthogonality [189, 190], while enjoying algorithmic simplicity, fail to meet the requirement of exact orthogonality. Also, note that enforcing the constraint during training is necessary since a post-training orthogonalization can substan- tially alter the weight values and jeopardize the performance. Challenge II: Avoiding expensive computations. An efficient algorithm is crucial for scalability to large net- works. Existing work based on projected gradient descent [191], however, requires expensive projection after each update. For instance, the projection step in [191] computes an SVD and flattens the spectrum to enforce orthogonality, which costs O(size(feature) ? channels3) for a convolutional layer. Challenge III: Scaling-up to state-of-the-art deep convo- lutional networks. There are many variants to the standard convolutional layer essential for state-of-the-art networks, including dilated, strided, group convolutions. However, none 142 of the existing methods proposes mechanisms to orthogonalize these variants. The lack of techniques, as a result, limits the broad applications of orthogonal convolutional layers to state-of-the-art networks. We resolve challenges I, II & III by proposing a parameterization of orthogonal convolutions. First, using the convolution theorem [192] (spatial convolution is equivalent to spectral product), we reduce the problem of designing orthogonal convolutions to con- structing unitary matrices for all frequencies, i.e., paraunitary systems [175]. Further using a complete factorization theorem of paraunitary systems, we obtain a parameterization in the spatial domain for all orthogonal 1D-convolutions and separable 2D-convolutions (no work achieves a complete parameterization of all orthogonal 2D-convolutions), which at- tains high expressiveness, computational/memory efficiency, and exact orthogonality. Since no previous approach achieved exact orthogonality, we are the first to show how essential exact orthogonality is in different types of networks ? we observe that exact orthogonality in deeper architectures (with > 10 layers) is beneficial to obtain robust performance. Furthermore, we unify orthogonal convolution variants (dilated, strided, and group convolutions) as paraunitary systems, allowing us to orthogonalize these variants using the same parameterization for standard convolutions. (No previous work presents mech- anisms for learning these variants? orthogonal versions.) Since these variants are crucial in advanced architectures, our work makes it possible to generate the orthogonal counter- parts of these architectures, allowing us to investigate their performance, which was not possible before. Combined with our study in skip-connection and initialization, we scale orthogonal networks to deep architectures, including ResNet and ShuffleNet, substantially outperforming their shallower counterparts (Section 7.6). Finally, we show how to deploy 143 our orthogonal networks in Residual Flow [188], a flow-based generative model that requires strict Lipschitzness (Section 7.7.3). Contributions. In summary, we have made the following contributions: 1. We establish the equivalence between orthogonal convolutions in the spatial domain and paraunitary systems in the spectral domain. Therefore, we interpret and unify all existing approaches as implicit designs of paraunitary systems. 2. Based on a factorization theorem of paraunitary systems, we propose a complete pa- rameterization of orthogonal 1D-convolutions and separable 2D-convolutions, which attains high expressiveness, exact orthogonality (machine-epsilon), and computation- al/memory efficiency (< 50% memory of previous methods). These features are crucial in learning deep orthogonal networks with state-of-the-art performance. 3. We prove that orthogonality for various convolutional layers (strided, dilated, group) are also entirely characterized by paraunitary systems. Consequently, our parameter- ization easily extends to these variants, ensuring their exact orthogonality, complete- ness, and efficiency. 4. We study the design considerations (choices of skip connection, initialization, depth, width, kernel size) for orthogonal networks and show that orthogonal networks can scale to deep architectures (e.g., ResNet, ShuffleNet). 7.2 Related Works Dynamical isometry [184, 185, 186, 193] aims to address the gradient vanishing/explod- 144 ing problems in deep vanilla networks with orthogonal initialization. These works focus on understanding the interplay between initialization methods and various nonlinear activa- tions. However, these approaches do not guarantee orthogonality (and Lipschitzness) after training, thus are unsuitable for applications that require strict Lipschitz bounds, such as adversarial robustness [181, 182] and residual flows [188, 194]. Learning orthogonality has three typical families of methods: regularization, parame- terization (i.e., mapping unconstrained parameters to the feasible set with a surjective func- tion), projected gradient descent (PGD) / Riemannian gradient descent (RGD). While the regularization approach is approximate, the latter two learn exact orthogonality. Among these approaches, PGD/RGD requires modification of the optimizer, whereas the other two are compatible with standard optimizersx. (1) For orthogonal matrices, various regu- larization methods are proposed in [195] and [196]. Alternatively, numerous parameteriza- tion methods exist, including Householder reflections [197], Given rotations [198], Cayley transform [199], matrix exponential [200], and algorithmic unrolling [181, 201]. Lastly, [189] propose PGD via singular value clipping, and [202, 203] consider RGD. (2) For or- thogonal convolutions, some existing works learn orthogonality for the flattened ma- trix [189, 190, 196] or each output channel [204]. However, these methods do not lead to orthogonality (norm preserving) of the operation as a whole. [191] propose to use PGD via singular value clipping and masking ? however, singular value decomposition is expensive, and masking can lead to approximate orthogonality. To the best of our knowledge, there is no accurate PGD or RGD for orthogonal convolutions. Alternatively, recent works adopt parameterizations, using block convolutions [182], Cayley transform [183], or convolution 145 exponential [205]. Paraunitary systems are extensively studied in filter banks and wavelets [175, 206, 207]. Classic theory shows that 1D-paraunitary systems are completely characterized by a spec- tral factorization (see Chapter 14 of [175] or Chapter 5 of [206]), but not all multi- dimensional (MD) paraunitary systems admit a factorized form (see Chapter 8 of [207]). While the complete characterization of MD-paraunitary systems is known in theory (which requires solving a system of nonlinear equations) [208, 209], most practical constructions use separable paraunitary systems [207] and special classes of non-separable paraunitary systems [210]. The equivalence between orthogonal convolutions and paraunitary systems thus opens the opportunities to apply these classic theories in designing orthogonal convo- lutions. 7.3 Designing Orthogonal Convolutions via Paraunitary Systems Designing an orthogonal convolutional layer {ht,s : yt = ht,s ? x }T,Ss t=1,s=1 (s, t index input/output channels) in the spatial domain is challenging. In terms of matrix-vector product, a convolutional layer has a block-circulant weight matrix, with its (t, s)th block Cir (ht,s) ? RN?N as: ?? ??? ht,s[1] ht,s[N ] ? ? ? ht,s[2]?? ?? ? ??? ht,s[2] h ? t,s[1] ht,s[N ] ? ? ? ? Cir (ht,s) = ? ? . (7.1)? .. ?. .? . . . . . . .. ???? ht,s[N ] ? ? ? ht,s[2] ht,s[1] 146 Therefore, the layer is orthogonal if block-circulant matrix [Cir (h )]T,St,s t=1,s=1 is orthogonal. However, it is not obvious how to enforce orthogonality in a block-circulant matrix. 7.3.1 Achieving Orthogonal Convolutions by Paraunitary Systems We propose a novel design of orthogonal convolutions from a spectral perspective, motivated by the convolution theorem (Theorem 7.1). For simplicity, we group the en- tries at the same locations a vector/matrix, e.g., we denote {xs[n]}Ss=1 as x[n] ? RS and {ht,s[n]}T,S T?St=1,s=1 as h[n] ? R . ? Theorem 7.1 (Convolution theorem [192]). A convolution layer h: y[i] = n h[n]x[i? n] in the spatial domain is equivalent to matrix-v?ector products in the sp?ectral domain, i.e., Y(z?) = H(z)X(z),?z ? C. Here, X(z) = N?1 x[n]z?n, Y(z) = N?1n=0 n=0 y[n]z ?n, H(z) = L h[n]z?nn=?L denote the z-transforms of input, output, kernel respectively, where N is the length of x,y and [?L,L] is the span of the filter h. The convolution theorem states that a convolution layer is a matrix-vector product in the spectral domain. If the transfer matrix H(z) is unitary at z = ej? for all frequencies ?? ? [0, 2?) (j is the imaginary unit), the layer h is orthogonal. As our major novelty, we design orthogonal convolutions via construction of unitary transfer matrix H(ej?) at all frequencies ? ? [0, 2?), known as a paraunitary system [175]. The paraunitary theorem (Theorem 7.2) shows that a convolutional layer is orthogonal in the spatial domain if and only if it is paraunitary in the spectral domain. Theorem 7.2 (Paraunitary theorem). A standard convolutional layer is orthogonal if 147 and only if its transfer matrix H(z) is paraunitary, i.e., H(z)>H(z) = I, ?|z| = 1?? H(ej?)>H(ej?) = I, ?? ? R. (7.2) In other words, the transfer matrix H(ej?) is unitary for all frequencies ? ? R. Benefits through paraunitary systems. (1) The spectral representation simplifies the designs of orthogonal convolutions, which avoids analysis of block-circulant matrices. (2) Since paraunitary systems are necessary and sufficient for orthogonal convolutions, it is impossible to find an orthogonal convolution whose transfer matrix is not paraunitary. (3) There exists a complete factorization of paraunitary systems: any paraunitary H(z) is a product of multiple factors in the spectral domain (Equation (7.3a)). (4) Since spec- tral multiplications correspond to spatial convolutions, any orthogonal convolution can be realized as cascaded convolutions, each parameterized by an orthogonal matrix (Equa- tion (7.3b)). (5) There are mature methods that parameterize orthogonal matrices via unconstrained parameters. Consequently, we can learn orthogonal convolutions using stan- dard optimizers on a model parameterized via our design. Interpretation of existing methods. Since paraunitary system is necessary and suffi- cient for orthogonal convolution, all existing approaches, including singular value clipping and masking (SVCM) [191], block convolution orthogonal parameterization (BCOP) [182], Cayley Convolution (CayleyConv) [183], skew orthogonal convolution (SOC) [205] construct paraunitary systems implicitly (see [211] for interpretations). 148 7.3.2 Parameterization of Paraunitary Systems After reducing the problem of orthogonal convolutions to paraunitary systems, we are left with how to realize paraunitary systems. To address this, we use a complete factorization to realize any paraunitary system as shown in the following theorem. Theorem 7.3 (Complete factorization for 1D-paraunitary sys?tems). Suppose that a paraunitary system H(z) is finite-length, i.e., it can be written as ?nn h[n]z for some sequence {h[n], n ? [?L,L]}, then it can be factorized in the following form: H(z) = V(z;U (?L)) ? ? ?V(z;U (?1))QV(z?1;U (1)) ? ? ?V(z?1;U (L)), (7.3a) where V(z;U (`)) = I ?U (`)U (`)> +U (`)U (`)>z,?` ? {?L, ? ? ? ,?1} ? {1, ? ? ? , L}. (7.3b) Here Q is an orthogonal matrix, and U (`) is a column-orthogonal matrix. Consequently, the paraunitary system H(z) is parameterized by L + L + 1 (column-)orthogonal matrices Q and U (`)?s. As spectral multiplications are equivalent to spatial convolutions, the complete spec- tral factorization of paraunitary systems in Equation (7.3a) allows us to parameterize any orthogonal convolution in the spatial domain as cascaded convolutions of V(z;U (`))?s spa- tial counterparts and the orthogonal matrix Q. Model design in the spatial domain. Following Equation (7.3), we obtain a com- plete design of orthogonal 1D-convolutions : using learnable (column)-orthogonal matrices ({U (`)}?1 (`)`=?L,Q, {U }L`=1), we parameterize a size (L + L + 1) convolution as cascaded 149 convolutions of the following filters in the spatial domain {[ ]}?1 {[ ]}L I ?U (`)U (`)>, U (`) (`)> > >U ,Q, U (`)U (`) , I ?U (`)U (`) . (7.4) `=?L `=1 Figure 7.1 visualizes our design of orthogonal convolution layers; each block denotes a convolution and the filter coefficients are displayed in each block. In practice, we compose all (L + L + 1) filters into one for orthogonal convolution, which not only increases the computational parallelism but also avoids storing intermediate outputs between filters. x [ I?U (L)(U (L))? [ I?U (1)(U (1))? [U (?1)(U (?1))? , [U (?L)(U (?L)? ? ) ? , y ,U (L) U (L) ? ,U (1)(U (1) Q )? ] I?U (?1)(U (?1))? ] I?U (?L)(U (?L)( ) ] ) ? ] Figure 7.1: Complete design of 1D orthogonal convolution as a cascade of con- volutions. The filter coefficients are depicted in each block, where Q is orthogonal and U (`)?s are column-orthogonal. With a complete factorization of paraunitary systems, we reduce the problem of designing orthogonal convolutions to the one for orthogonal matrices. Parameterization for orthogonal matrices. There are various parameterizations for orthogonal matrices, including the Bjo?rck orthogonalization [181, 182], the Cayley trans- form [199, 212], and the exponential map [200]. We follow the GeoTorch library 1, which adopts a modified version of exponential map due to its efficiency, exactness, and complete- ness. The exponential map is a surjective mapping from a skew-symmetry matrix A to a special orthogonal matrix U (i.e., det(U) = 1) with U = exp(A) = I +A +A2/2 + ? ? ? , which is computed up to machine-precision [213]. To parameterize all orthogonal matrices, 1https://github.com/Lezcano/geotorch 150 GeoTorch introduces an orthogonal matrix V in U = V exp(A), where V is (randomly) generated at initialization and fixed during training. Finally, observe that the upper-triangle entries uniquely determine a skew-symmetric matrix. We now have an end-to-end pipeline in Figure 7.2, which parameterizes orthogonal convolutions by unconstrained upper-triangle entries in skew-symmetric matrices. ?n Ortho. Conv. H ( z)=? h[n] z Paraunitary H ( z)= f (Q , {U (l)}) Ortho. Factors Q=exp(A) h [n] H ( z) Q ,{U ( l )} U ( l) exp A(l) A ,{A(l )= ( ) } Figure 7.2: SC-Fac: A pipeline for designing orthogonal convolutional layer. (1) An orthogonal convolution h[n] is equivalent a paraunitary system H(z) in the spectral domain (Theorem 7.1). (2) The paraunitary system H(z) is multiplications of factors characterized by (column-)orthogonal matrices ({U (`)}?1 (`)`=?L,Q, {U }L`=1) (Equation (7.3), Theorem 7.2). (3) These orthogonal matrices are parameterized by skew-symmetric ma- trices using exponential map. 7.3.3 Separable Orthogonal 2D-Convolutions As 2D-convolutions are widely used in convolutional networks, we extend our orthog- onal 1D-convolution to the 2D version. Analog to the 1D case, a 2D-convolutional layer is orthogonal if and only if its transfer matrix H(z1, z2) is paraunitary (H(z1, z2) is unitary ?|z1| = 1, |z2| = 1). Construction of orthogonal 2D-convolutions. Using two orthogonal 1D-convolutions, we can readily obtain a complete design of separable orthogonal 2D-convolutions, where H(z1, z2) = H1(z1)H2(z2) is a product of two 1D-paraunitary systems. As a result, we can parameterize a separable orthogonal 2D-convolution with filter size (L1 + L1 + 1) ? (L2 + L2 + 1) as a convolution of two orthogonal 1D-convolutions with learnable (column- (`) (`) )orthogonal matrices ({U }?1 ,Q , {U }L11 `=?L 1 1 `=1) and ({ (`) U }?1 (`) L2 1 2 `=?L ,Q2, {U 2 2 }`=1). Since 151 our method relies on separability and complete factorization for 1D-paraunitary systems, we call it Separable Complete Factorization (SC-Fac). (See Section 7.7.1 for pseudo code). Benefits compared to other methods. (1) Easier analysis. While BCOP [203] and our SC-Fac are complete in 1D case, none of them is complete in 2D case. However, since separability reduces the design to the 1D case, it makes the analysis of various types of convolutional layers easier, as we will see in Section 7.4. (2) Efficient inference. Note that CayleyConv [183] and SOC [205] define the convolution implicitly (as an infinite-length filter), the coefficients for H(z1, z2) can not be saved for repeated inference. In contrast, SC-Fac has the same inference expense as a normal convolutional layer after a one-time computation of the coefficients. (3) Efficient training. As shown in [211], SC-Fac also has the lowest computational complexity among all approaches. (4) Exact orthogonality. Lastly, as shown in Table 7.2 (Section 7.6.1), SC-Fac achieves exact orthogonality (up to machine-precision), while previous approaches are approximate to a varying degree. 7.4 Unifying Orthogonal Convolutions Variants as Paraunitary Systems Various convolutional layers (strided, dilated, and group convolution) are widely used in neural networks. However, it is not apparent how to enforce their orthogonality, as the convolution theorem (Theorem 7.1) only holds for standard convolutions. Previous approaches only deal with standard convolutions [182, 183, 191], thus orthogonality for state-of-the-art architectures has never been studied before. We address this limitation by modifying convolution theorem for each variant of convolution layer, which allows us to design these variants using paraunitary systems. 152 Table 7.1: Various types of convolutions. In the following table, we present the modified Z-transforms, Y(z), H(z), and X(z) for each type of convolution such that Y(z) = H(z)X(z) holds. For dilated and strided convolutions, Xr|R(z) is the (r,R)- polyphase component of X(z), with which we define X[R](z) , [X0|R(z)>, . . . ,XR?1|R(z)>]> and X?[R](z) = [X?0|R(z), . . . ,X?(R?1)|R(z)]. For group convolution, hg is the filter for the gth group with Hg(z) being its Z-transform. We stack matrices from different groups into block-diagonal matrices: h{G}[n] = blkdiag ({hg[z]}), H{G}(z) = blkdiag ({hg[z]}). Spectral Representation Type Spatia?l Representation Y(z) H(z) X(z) Standard y[i] =?? n?Z h[n]x[i? n] Y(z) H(z) X(z)R-Dilated y[i] = ?Rn?Z h [n]x[i? n] Y(z) H(zR) X(z) ?R-Strided y[i] =??n?Z h[n]x[Ri? n] Y(z) H? [R](z) X[R](z) ?R-Strided y[i] = n?Z h[n]x?R[i? n] Y[R](z) H[R](z) X(z) G-Group y[i] = h{G}n?Z [n]x[i? n] Y(z) H{G}(z) X(z) Theorem 7.1 (Convolution and paraunitary theorems for various convolutions). Strided, dilated, and group convolutions can be unified in the spectral domain as Y(z) = H(z)X(z), where Y(z), H(z), X(z) are modified Z-transforms of y, h, x. We instantiate the equation for various convolutions in Table 7.1. Furthermore, a convolution is orthogonal if and only if H(z) is paraunitary. In Table 7.1, we formulate strided, dilated, and group convolutions in the spatial do- main, interpreting them as up-sampled or down-sampled variants of a standard convolution. Now, we introduce the concept of up-sampling and down-sampling precisely below. Given a sequence x, we introduce its up-sampled sequence x?R with sampling rate R as x?R[n] , x[n/R] for n ? 0 (mod R). On the other hand, its (r,R)-polyphase component xr|R indicates the r-th down-sampled sequence with sampling rate R, defined as xr|R[n] , x[nR + r]. We illustrated an example of x?R and xr|R in Figure 7.3 when sampling rate R = 2. The Z-transforms of x?R, xr|R are denoted as X?R(z), Xr|R(z) respectively. Now we are ready to interpret convolution variants. (1) Strided convolution is used to 153 x [n] x [n] ?? ?? n n 0 1 2 3 4 5 0 1 2 3 4 5 6 7 x? 2[n] x 0?2 [n] x1?2 [n] ? ? n n n 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 0 1 2 3 (a) Up-sample (b) Down-sample Figure 7.3: Up/down sampling. In (a), the sequence x[n] is up-sampled into another sequence x?2[n]. In (b), the sequence x[n] is down-sampled into two sequences x0|2[n] with even entries (red) and x1|2[n] with odd entries (blue). adjust the feature resolution: a strided convolution (?R-strided) decreases the resolution by down-sampling after a standard convolution, while a transposed strided convolutional layer (?R-strided) increases the resolution by up-sampling before a standard convolution. (2) Dilated convolution increases the receptive field of a convolution without extra parameters: an R-dilated convolution up-samples its filters before convolution with the input. (3) Group convolution reduces the parameters and computations, thus widely used by efficient architectures: a G-group convolution divides the input/output channels into G groups and restricts the connections within each group. According to Theorem 7.1, a convolution is orthogonal if and only if its modified Z-transform H(z) is paraunitary. 7.5 Learning Deep Orthogonal Networks with Lipschitz Bounds In this section, we switch our focus from layer design to network design. In particular, we aim to study how to scale-up deep orthogonal networks with Lipschitz bounds. 154 Lipschitz networks [181, 182, 183], whose Lipschitz bounds are imposed by their archi- tectures, are proposed as competitive candidates to guarantee robustness in deep learning. A Lipschitz network consists of orthogonal layers and GroupSort activations ? both are 1- Lipschitz and gradient norm preserving. Given a Lipschitz constant L, a Lipschitz network f can compute a certified radius for each input from its output margin. Formally, denote the output margin of an input x with label c as Mf (x) , max(0, f(x)c ?max f(x)i), (7.5) i=6 c i.e., the difference between the correct logit and the second largest logit. Then the output ? is robust to perturbation such that f(x + ) = f(x) = c,? : ?? 10 layers), while all methods perform similarly in shallow networks as shown in Table 7.5 (Section 7.7.2). (4) Receptive field and down-sampling. Previous works [182, 183] use larger kernel size and no stride for Lipschitz networks. In Table 7.4 (Right), we perform a study on the effects of kernel/dilation size and down-sampling types for the orthogonal convolu- tions. We find that an average pooling as down-sampling consistently outperforms strided convolutions. Furthermore, a larger kernel size helps to boost the performance. Table 7.4: (Left) Comparisons of various skip connection types on WideResNet22-10 (kernel size equals 5). (Right) Comparisons of various receptive field and down- sampling types on WideResNet10-10. The symbols 3, 5 indicate whether average pool- ing or strided convolution is used for down-sampling. For ?slim? in strided convolution, we set kernel size = stride; and for for ?wide?, kernel size = stride * kernel size? (where kernel size? is the kernel size for the main branch. Receptive Field Down-Sampling Test Acc. Kernel Dilation Pool Stride Clean PGD Test Acc. Skip type 3 1 5 slim 80.70 68.81 Clean PGD 3 1 5 wide 82.36 70.36 ConvNet (w/o skip) 69.59 59.22 ShuffleNet (concat) 75.21 66.00 3 1 3 5 84.54 71.71 ResNet (add) 87.82 76.46 3 2 3 5 81.53 70.07 5 1 3 5 84.09 74.29 5 2 3 5 81.28 70.58 (5) Run-time and memory comparison. We find that previously proposed or- thogonal convolutions such as CayleyConv, BCOP, and RKO require more GPU memory 162 and computation time than SC-Fac. Therefore, we could not to scale them due to memory constraints (for 22 and 32 layers using Tesla V100 32G). In order to scale up Lipschitz networks, economical implementation of orthogonal convolution is crucial. As shown in Figure 7.5, for deep and wide architectures, our SC-Fac is the most computationally and memory efficient method and the only method that scales to a width increase of 10 on WideResNet22. Missing numbers in Figure 7.5 and Table 7.6 (Section 7.7.2) are due to the large memory requirement. Normal SC-Fac Cayley BCOP RKO 30 500 25 15 400 20 300 1015 200 10 5 100 5 0 0 0 1 3 6 8 10 1 3 6 8 10 1 3 6 8 10 Width Width Width Figure 7.5: Run-time and memory comparison using WideResNet22 on Tesla V100 32G. x-axis indicates the width factor (channels = base channels ? factor). Our SC-Fac is the most computationally and memory-efficient for wide architectures and is the only method that scales to width factor to 10 on WideResNet22. We also compare with an ordinary network with regular convolutions and ReLU activations. Note that SC-Fac has the same inference speed as a regular convolution ? the overhead is from the GroupSort activations. In summary, additive skip-connections are still essential for deep orthogonal networks. Due to the orthogonal constraints, it is helpful to increase the depth/width of the network. However, this significantly increases the memory requirement; thus, a cheap implementation (like SC-Fac) is desirable. Finally, we find that a larger kernel size and down-sampling based on average pooling is helpful, unlike standard practices in deep networks. 163 Train Time (s) Inference Time (s) Memory (GB) 7.7 Supplementary Materials 7.7.1 Pseudo Code for SC-Fac Algorithm Algorithm 1: Separable Complete Factorization (SC-Fac) Input: Number of channels C, kernel size K = 2L+ 1, and (`) (`) Skew-symmetric matrices {Ad } with Ad ? RC?C , ` ? [?L,L], d ? {1, 2}. Output: A paraunitary system H ? RC?C?K?K . (`) Initialization: Sample Nd from {1, ? ? ? , C} uniformly ?` ? [?L,L], d ? {1, 2} /* Iterate for vertical/horizontal dimensions */ for d = 1 to 2 do /* 1) Compute orthogonal matrices from skew-symmetric matrices */ /* Iterate for filter locations */ for ` = ?L to L do if ` = 0 then ? (0)Qd matrix exp(Ad ) // usematrix exp() in GeoTorch [223] else (`) ? (`) (`)Ud select(matrix exp(Ad ), cols = Nd ) // selects the first cols columns of the matrix end if end for /* 2) Compose 1D paraunitary systems from orthogonal matrices */ Hd ? Qd for ` = 1 to L do [ > ] H ? [H (`) (`) ? (`) (`)>d conv1d( d, Ud Ud , I Ud Ud )] H (?`) (?`) > (?`) (?`)> d ? conv1d( I ? Ud Ud , Ud Ud ,Hd) end for end for /* 3) Compose a 2D paraunitary systems from two 1D paraunitary */ H ? Compose(H1,H2) // i.e., H:,:,i,j = (H2):,:,j(H1):,:,i where the 1D paraunitary systems H1 and H2 are of size C ? C ?K return H We include the pseudo-code for separable complete factorization (Section 7.3) in Al- gorithm 1 and diverse orthogonal convolutions (Section 7.4) in Algorithm 2. The pseudo-code in Algorithm 1 consists of three parts: (1) First, we obtain orthog- 164 Algorithm 2: Construct Diverse Orthogonal Convolutions from Paraunitary Sys- tems Input: Number of base channels C, kernel size K = R(2L+ 1), stride R, dilation D, number of groups G Output: An orthogonal kernel W ? RT?S?K?K Set K ? ? K/R, number of input channels S ? GC/R2 and output channels T ? GC for g = 0 to G? 1 do /* 1) Construct orthogonal convolutions from paraunitary systems */ (`,g) Initialize skew-symmetric matrices {{Ad }L 2`=?L}d=1 for the current g Hg ? (`,g)Algorithm 1: SC-Fac(C,K ?, {{A L 2d }`=?L}d=1) Hg ? reshape(Hg, (C,C,K ?, K ?)? (C/R2, C,K,K)) end for /* 2) Concatenate orthogonal convolutions from different groups */ W ? concatenate({Hg}G?1g=0 , dim = 0) return W (where the filter for input channel s and output channel t is W K?Kt,s,:,: ? R ) onal matrices from skew-symmetric matrices using matrix exponential. We use GeoTorch library [223] for the function matrix exp in our implementation; (2) Subsequently, we con- struct two 1D paraunitary systems using these orthogonal matrices; and (3) Lastly, we compose two 1D paraunitary systems to obtain one 2D paraunitary systems. The pseudo- code in Algorithm 2 consists of two parts: (4) First, we reshape each paraunitary system into an orthogonal convolution depending on the stride; and (5) second, we concatenate the orthogonal kernels for different groups and return the output. ? 7.7.2 Setups and Additional Results for Empirical Studies Network architectures. For fair comparisons, we follow the architectures by [183] for KW-Large, ResNet9, WideResNet10-10 (i.e., shallow networks). We set the group size for GroupSort activations as 2 in all experiments. For networks deeper than 10 layers, we im- plement their architectures modifying from the Pytorch official implementation of ResNet. 165 It is crucial to replace the global pooling before fully-connected layers with an average pooling with a window size of 4. For the average pooling, we multiply the output with the window size to maintain its 1-Lipschitzness. Other architectures, including ShuffleNet and plain convolutional network (ConvNet), are further modified from the ResNet, where only the skip-connections are changed or removed. We use the widen factor to indicate the channel number: we set the number of channels at each layer as base channels multiplied by the widen factor. The base channels are 16, 32, 64 for three groups of residual blocks. More details of the ResNet architecture can be found in the official PyTorch implementation.2 Learning strategies. We use the CIFAR-10 dataset for all our experiments. We nor- malize all input images to [0, 1] followed by standard augmentation, including random cropping and horizontal flipping. We use the Adam optimizer with a maximum learning rate of 10?2 coupled with a piece-wise triangular learning rate scheduler. We initialize all our SC-Fac layers as permutation matrices: (1) we select the number of columns for each pair U (`),U (?`) uniformly from {1, ? ? ? , T} at initialization (the number is fixed during training); (2) for ` > 0, we sample the entries in U (`) uniformly with respect to the Haar measure; and (3) for ` < 0, we set U (?`) = QU (`) according to Proposition 7.2. Multi-class hinge loss. Following previous works on Lipschitz networks [181, 182, 183], we adopt the multi-class hinge loss in training. For each model, we perform a grid search on different margins 0 ? {1?10?3, 2?10?3, 5?10?3, 1?10?2, 2?10?2, 5?10?2, 0.1, 0.2, 0.5} and report the best performance in terms of robust accuracy. Notice that the margin 0 controls the trade-off between clean and robust accuracy, as shown in Figure 7.6. 2 https://github.com/pytorch/vision/blob/master/torchvision/models/ 166 100 Train Test PGD 80 60 500 200 100 50 20 10 5 2 1 Margin 0 (?10?3) Figure 7.6: Effect of the Lipschitz margin 0 for WideResNet22-10. It shows a trade-off between clean and robust accuracy with different margins for multi-class hinge loss. As shown, the training and test accuracy become higher with larger margin, but the robust accuracy decreases after 0 = 0.1. Table 7.5: Effect of initialization methods on WideResNet (kernel size 5). Initialization WideResNet10-10 WideResNet22-10 Clean (%) PGD (%) Clean (%) PGD (%) uniform 83.58 73.20 87.55 75.71 torus 82.40 72.50 88.12 75.43 permutation 83.18 73.16 87.82 76.46 identical 83.29 73.49 87.82 75.49 Initialization methods. In Proposition 7.2, we show how to initialize our orthogonal convolutional layers as orthogonal matrices. In Table 7.5, we perform a study on different initialization methods, including identical, permutation, uniform, and torus [199, 221]. We find that permutation works the best for WideResNet22-10, while all methods are similar in shallower WideResNet10-10. Therefore, we use permutation for all other experiments. Network depth and width. Exact orthogonality is criticized for harming the expressive power of neural networks, and we find that increasing network depth/width can partially compensate for such loss. In Table 7.6, we perform a study on the impact of network depth/width on the predictive performance. As shown, deeper/wider architectures con- 167 Accuracy (%) Table 7.6: Comparison of different depth and width on WideResNet (kernel size 5). Some numbers are missing due to the large memory requirement (on Tesla V100 32G). The notation width factor indicates (channels = base channels ? factor). 10 layers Width 1 3 6 8 10 1 3 6 8 10 Clean (%) PGD with  = 36/255 (%) Ours 79.96 84.17 84.96 84.61 84.09 65.92 69.70 72.18 72.51 74.29 Cayley 77.88 82.14 82.56 85.53 85.01 66.65 73.06 74.33 75.66 76.13 RKO 81.37 83.55 84.67 85.18 84.62 70.55 74.44 76.41 76.65 77.02 22 layers Width 1 3 6 8 10 1 3 6 8 10 Clean (%) PGD with  = 36/255 (%) Ours 79.90 82.22 87.21 88.10 87.82 67.95 70.88 74.30 75.12 76.46 Cayley 79.11 84.82 85.85 - - 69.79 65.61 74.81 - - RKO 82.71 84.19 84.33 84.55 - 72.40 74.36 75.66 76.41 - 34 layers Width 1 3 6 8 10 1 3 6 8 10 Clean (%) PGD with  = 36/255 (%) Ours 81.24 88.17 88.92 - - 69.21 71.85 75.09 - - Cayley 82.46 84.29 - - - 71.27 74.73 - - - RKO 81.51 83.24 83.92 - - 71.38 73.84 75.03 - - sistently improve both the clean and robust accuracy for our implementation. However, the best robust accuracy is achieved by a 22-layer network since we can afford a wide architecture for 34-layer architecture. 7.7.3 Orthogonal Convolutions for Residual Flows In this subsection, we first review the class of flow-based generative models [224, 225]. We focus on invertible residual network [194], a flow-based model that relies on Lipschitz residual block, and its extended version Residual Flow [188]. We then show how to construct 168 improved Residual Flow using our orthogonal convolutions. Flow-based models. Given an observable vector x ? RD and a latent vector z ? RD, we define a bijective mapping f : RD ? RD from the latent vector z to an observation x = f(z). We further define the inverse of f as F = f?1, with which we represent the likelihood of x by the one of z as: ln pX(x) = ln pZ(z) + ln|detJF (x)|, (7.7) where pX is the data distribution, pZ is the base distribution (usually a normal distribution), and JF (x) is the Jacobian of F at x. In practice, the bijective mapping f is composed by a sequence of K bijective mapping such that f = fK ? ? ? ? ? f1, where each fk is named as a flow. Since the inverse mapping F = F1 ? ? ? ?FK transforms the data distribution pX into a normal distribution pZ , flow-based models are also known as normalizing flows. Accordingly, we rewrite Equation (7.7) as: ?K ln pX(x) = ln pZ(z) + ln|detJF (x)|, (7.8)k k=1 In a flow-based model, we require efficient computations of (a) each bijective mapping fk, (b) its inverse mapping F = f?1k k , and (c) the corresponding log-determinant ln|detJF (?)|. Invertible residual networks (i-ResNets). [194] proposes a flow-based model based on residual network (ResNet). Note that a block in ResNet is defined as F (x) = x+ g(x), where g is a convolutional network. In [194], the authors prove that F is a bijective mapping 169 if g is 1-Lipschitz, and its inverse mapping can be computed by fixed-point iterations: xk+1 = y ? g(xk), (7.9) where y = g(x) is the output of F and the initialization of the iterative algorithm is x0 := y. From the Banach fixed-point theorem, we have k ?x? Lip(g)xk?2 = ?x1 ? x0? , (7.10)1? Lip(g) i.e., the convergence rate is exponential in the number of iterations and smaller Lipschitz constant yields faster convergence. Moreover, the log-determinant can be computed as: ln pX(x) = ln pZ(z) + tr((ln?(I + Jg(x))) ) (7.11)? (?1)k+1 = ln pZ(z) + tr [Jg(x)] k , (7.12) k k=1 where the infinite sum is approximated by truncation and the trace is efficiently estimated using the Hutchinson trace estimator tr(A) = E >v?N (0,I)[v Av]. To constrain the Lipschitz constant, i-ResNet uses spectral normalization on each linear layer. Moreover, to improve optimization stability, i-ResNet changes the activation function from ReLU to ELU, ensuring nonlinear activations have continuous derivatives. As summarized in [194], there are two remaining problems in this model: (1) The estimator of the log-determinant is biased and inefficient; (2) Designing and learning net- works with a Lipschitz constraint are challenging ? one needs to constrain each linear layer 170 Table 7.7: Comparisons of various flow-based models on the MNIST dataset. We report the performance in bits per dimension (bpm), where a smaller number indicates a better performance. Model MNIST Glow [218] 1.05 FFJORD [226] 0.99 i-ResNet [194] 1.05 Residual Flow [188] 0.97 SC-Fac Residual Flow (Ours) 0.896 in the block instead of being able to control the Lipschitz constant of a block. Residual Flow. [188] addresses problem {(1) by proposing an unbiased Russian roulette estimator for Equation (7.12): [? [ ] ]n (?1)k v Jg(x)k v tr (ln (I + Jg(x))) = En,v , (7.13) k P(N ? k) k=1 where n ? p(N) and v ? N (0, I). Residual Flow further changes the activation from ELU to LipSwish. The LipSwich activation avoids derivative saturation, which occurs when the second derivative is zero in a large region. However, problem (2) remains unresolved. Residual flows with orthogonal convolutions. We propose to address problem (2) by replacing spectral normalization by orthogonal convolution (SC-Fac). Note that or- thogonal convolutions directly control the Lipschitz constant of a ResNet block. We keep all other components unchanged ? in particular, we use LipSwish instead of GroupSort, as GroupSort suffers from derivative saturation. We experiment our model on MNIST dataset, and Table 7.7 shows that our model substantially improve the performance over the original Residual Flow. We display some images generated by our model in Figure 7.7. 171 Figure 7.7: Random samples from SC-Fac Residual Flow trained on MNIST. 7.8 Conclusion In this chapter, we present a paraunitary framework for orthogonal convolutions. Specifically, we establish the equivalence between orthogonal convolutions in the spatial domain and paraunitary systems in the spectral domain. Therefore, any design for orthog- onal convolutions is implicitly constructing paraunitary systems. We further show that the orthogonality for variants of convolution (strided, dilated, and group convolutions) is also fully characterized by paraunitary systems. In summary, paraunitary systems are all we need to ensure orthogonality for diverse types of convolutions. Based on the complete factorization of 1D paraunitary systems, we develop the first exact and complete design of separable orthogonal 2D-convolutions. Our versatile design allows us to study the design principles for orthogonal convolutional networks. Conse- quently, we scale orthogonal networks to deeper architectures, substantially outperforming their shallower counterparts. In our experiments, we observe that exact orthogonality plays a crucial role in learning deep Lipschitz networks. In the future, we plan to investigate other use cases that exact orthogonality is essential. 172 Chapter 8: Conclusion This dissertation investigates spectral methods to design neural networks with desir- able properties (or without undesirable properties). We divide the dissertation into three modules. In the first module, we apply tensor representations to interpret and design multilin- ear operations. In Chapter 2, we introduced a framework to design compact convolutional layers, which maintain high expressive power but with significantly fewer parameters. Then, in Chapter 3, we develop an automatic library that can effectively denote tensor represen- tations, with which we can efficiently learn multilinear operations. And lastly, in Chap- ter 4, we propose the first higher-order recurrent layer for spatio-temporal learning, which achieves the state-of-the-art in long-term prediction tasks. There are two potential direc- tions to further advance multilinear operations in neural networks. It will be interesting to understand and design attention modules from the perspective of tensor representations, as attention mechanisms are gaining popularity over convolutions and recurrences in recent years. We also plan to develop fused algorithms to further accelerate tensor representation evaluation, which avoids the serial computation of binary operations. In the second module (Chapter 5), we develop an efficient scheme to model uncer- tainties using Bayesian quantized networks. Using this scheme, we convert the intractable 173 computation in Bayesian neural networks into tractable tensor operations and develop fast algorithms for these operations. On the other hand, learning a quantized neural network is an integer programming problem since the weights only take discrete values. However, the problem becomes continuous since the probability vector is continuous. Therefore, our scheme also provides an alternative way to learn compact quantized networks. In the fu- ture, it will be interesting to investigate the statistical properties of our design and other approaches for learning Bayesian neural networks. Since a quantitive analysis of the trade- off between bias and variance is lacking, different methods are comparable in their final performance. Therefore, it is desirable to characterize their properties such that we know their applicability from the context and further improve these approaches. In the last module, we show that a convolutional layer is equivalent to a multi-input multi-output (MIMO) filter bank, which allows us to specify the properties of convolutional layers using filter bank theory. In Chapter 6, we propose ARMA layers based on infinite impulse response (IIR) filter banks such that we can economically expand the receptive field of a convolutional layer. And in Chapter 7, we propose orthogonal convolutions based on paraunitary filter banks such that we can easily ensure their exact orthogonality. In this module, we have mainly focused on the properties of convolutional layers per se. As a future direction, it will be interesting to investigate the interplay between the properties of convolutional layers and network architectures. For instance, it will be interesting to understand the interactions between ARMA layers and self-attention modules in expanding receptive fields. As another example, we plan to investigate how orthogonal convolutions collaborate with other components in constructing invertible neural networks. 174 Bibliography [1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097?1105, 2012. [2] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large- scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [3] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014. [4] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, L ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998?6008, 2017. [5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770?778, 2016. [6] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700?4708, 2017. [7] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448?456. PMLR, 2015. [8] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. [9] Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282, 2017. [10] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014. [11] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, pages 1321?1330. PMLR, 2017. 175 [12] Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678, 2016. [13] Aharon Azulay and Yair Weiss. Why do deep convolutional networks generalize so poorly to small image transformations? Journal of Machine Learning Research, 20(184):1?25, 2019. [14] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision Conference 2016. British Machine Vision Association, 2016. [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630? 645. Springer, 2016. [16] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence, 2017. [17] Ting Zhang, Guo-Jun Qi, Bin Xiao, and Jingdong Wang. Interleaved group convolu- tions. In Proceedings of the IEEE international conference on computer vision, pages 4373?4382, 2017. [18] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013. [19] Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Ten- sorizing neural networks. Advances in neural information processing systems, 28, 2015. [20] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Vic- tor Lempitsky. Speeding-up convolutional neural networks using fine-tuned CP- decomposition. arXiv preprint arXiv:1412.6553, 2014. [21] Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530, 2015. [22] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning represen- tations by back-propagating errors. nature, 323(6088):533?536, 1986. [23] Roma?n Oru?s. A practical introduction to tensor networks: Matrix product states and projected entangled pair states. Annals of Physics, 349:117?158, 2014. [24] Lars Grasedyck, Daniel Kressner, and Christine Tobler. A literature survey of low- rank tensor approximation techniques. GAMM-Mitteilungen, 36(1):53?78, 2013. [25] Andrzej Cichocki, Namgil Lee, Ivan Oseledets, Anh-Huy Phan, Qibin Zhao, and Danilo P Mandic. Tensor networks for dimensionality reduction and large-scale op- timization: Part 1 low-rank tensor decompositions. Foundations and Trends?R in Machine Learning, 9(4-5):249?429, 2016. 176 [26] Andrzej Cichocki, Anh-Huy Phan, Qibin Zhao, Namgil Lee, Ivan Oseledets, Masashi Sugiyama, Danilo P Mandic, et al. Tensor networks for dimensionality reduction and large-scale optimization: Part 2 applications and future perspectives. Foundations and Trends?R in Machine Learning, 9(6):431?673, 2017. [27] Nadav Cohen and Amnon Shashua. Convolutional rectifier networks as generalized tensor decompositions. In International Conference on Machine Learning, pages 955? 963, 2016. [28] Valentin Khrulkov, Alexander Novikov, and Ivan Oseledets. Expressive power of recurrent neural networks. In International Conference on Learning Representations, 2018. [29] Kohei Hayashi, Taiki Yamaguchi, Yohei Sugawara, and Shin-ichi Maeda. Exploring unexplored tensor network decompositions for convolutional neural networks. Ad- vances in Neural Information Processing Systems, 32, 2019. [30] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014. [31] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. Ad- vances in neural information processing systems, 27, 2014. [32] Xiangyu Zhang, Jianhua Zou, Xiang Ming, Kaiming He, and Jian Sun. Efficient and accurate approximations of nonlinear convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1984?1992, 2015. [33] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455?500, 2009. [34] Timur Garipov, Dmitry Podoprikhin, Alexander Novikov, and Dmitry Vetrov. Ul- timate tensorization: compressing convolutional and FC layers alike. arXiv preprint arXiv:1611.03214, 2016. [35] Ivan V Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Comput- ing, 33(5):2295?2317, 2011. [36] Yinchong Yang, Denis Krompass, and Volker Tresp. Tensor-train recurrent neural networks for video classification. In International Conference on Machine Learning, pages 3891?3900. PMLR, 2017. [37] Rose Yu, Stephan Zheng, Anima Anandkumar, and Yisong Yue. Long-term forecast- ing using tensor-train RNNs. arXiv preprint arXiv:1711.00073, 2017. [38] Jiahao Su, Wonmin Byeon, Jean Kossaifi, Furong Huang, Jan Kautz, and Anima Anandkumar. Convolutional tensor-train LSTM for spatio-temporal learning. Ad- vances in Neural Information Processing Systems, 33:13714?13726, 2020. 177 [39] Dingheng Wang, Guangshe Zhao, Guoqi Li, Lei Deng, and Yang Wu. Compressing 3d-CNNs based on tensor train decomposition. Neural Networks, 131:215?230, 2020. [40] Qibin Zhao, Guoxu Zhou, Shengli Xie, Liqing Zhang, and Andrzej Cichocki. Tensor ring decomposition. arXiv preprint arXiv:1606.05535, 2016. [41] Jinmian Ye, Guangxi Li, Di Chen, Haiqin Yang, Shandian Zhe, and Zenglin Xu. Block-term tensor neural networks. Neural Networks, 2020. [42] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015. [43] Saurabh Goyal, Anamitra Roy Choudhury, and Vivek Sharma. Compression of deep neural networks by combining pruning and low rank decomposition. In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 952?958. IEEE, 2019. [44] Donghyun Lee, Dingheng Wang, Yukuan Yang, Lei Deng, Guangshe Zhao, and Guoqi Li. QTT-net: Quantized tensor train neural networks for 3d object and video recog- nition. Neural Networks, 2021. [45] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? Advances in neural information processing systems, 27, 2014. [46] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. [47] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014. [48] Chi-Chung Lam, P Sadayappan, and Rephael Wenger. On optimizing a class of multi-dimensional loops with reductions for parallel execution. Parallel Processing Letters, 7(2):157?168, 1997. [49] Robert NC Pfeifer, Jutho Haegeman, and Frank Verstraete. Faster identification of optimal contraction sequences for tensor networks. Physical Review E, 90(3):033315, 2014. [50] Jiahao Su, Jingling Li, Xiaoyu Liu, Teresa Ranadive, Christopher Coley, Tai-Ching Tuan, and Furong Huang. Compact neural architecture designs by tensor represen- tations. Frontiers in Artificial Intelligence, 5, 2022. [51] Pierre Comon, Xavier Luciani, and Andre? LF De Almeida. Tensor decompositions, alternating least squares and other tales. Journal of Chemometrics: A Journal of the Chemometrics Society, 23(7-8):393?405, 2009. 178 [52] Franc?ois Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251?1258, 2017. [53] Wenqi Wang, Yifan Sun, Brian Eriksson, Wenlin Wang, and Vaneet Aggarwal. Wide compression: Tensor ring nets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9329?9338, 2018. [54] Asifullah Khan, Anabia Sohail, Umme Zahoora, and Aqsa Saeed Qureshi. A survey of the recent architectures of deep convolutional neural networks. Artificial Intelligence Review, 53(8):5455?5516, 2020. [55] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra- fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information pro- cessing systems, 33:1877?1901, 2020. [56] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273?1282. PMLR, 2017. [57] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aure?lien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends?R in Machine Learning, 14(1?2):1?210, 2021. [58] Jean Kossaifi, Aran Khanna, Zachary Lipton, Tommaso Furlanello, and Anima Anandkumar. Tensor contraction layers for parsimonious deep nets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 26?32, 2017. [59] Jean Kossaifi, Zachary C Lipton, Arinbjorn Kolbeinsson, Aran Khanna, Tommaso Furlanello, and Anima Anandkumar. Tensor regression networks. Journal of Machine Learning Research, 21(123):1?21, 2020. [60] G Daniel, Johnnie Gray, et al. Opt\ einsum-a python package for optimizing contrac- tion order for einsum-like expressions. Journal of Open Source Software, 3(26):753, 2018. [61] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016. [62] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32:8026?8037, 2019. [63] Jean Kossaifi, Yannis Panagakis, Anima Anandkumar, and Maja Pantic. Tensorly: Tensor learning in python. Journal of Machine Learning Research, 20(26):1?6, 2019. 179 [64] Charles R Harris, K Jarrod Millman, Ste?fan J van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, et al. Array programming with numpy. Nature, 585(7825):357?362, 2020. [65] Alex Rogozhnikov and Cristian Garcia. Einops. https://github.com/arogozhnikov/einops, 2020. [66] Alexander Reustle, Tahseen Rabbani, and Furong Huang. Fast GPU convolution for CP-decomposed tensorial neural networks. In Proceedings of SAI Intelligent Systems Conference, pages 468?487. Springer, 2020. [67] Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. Tensor comprehensions: Framework-agnostic high-performance machine learning ab- stractions. arXiv preprint arXiv:1802.04730, 2018. [68] Po-An Wang and Chi-Jen Lu. Tensor decomposition via simultaneous power iteration. In International Conference on Machine Learning, pages 3665?3673, 2017. [69] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014. [70] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. [71] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211?252, 2015. [72] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. Conformer: Convolution-augmented transformer for speech recognition. Proc. Interspeech 2020, pages 5036?5040, 2020. [73] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206?5210. IEEE, 2015. [74] Alex Krizhevsky. Learning multiple layers of features from tiny images. Master?s thesis, University of Toronto, 2009. [75] Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2786?2793. IEEE, 2017. 180 [76] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learn- ing of video representations using LSTMs. In International conference on machine learning, pages 843?852. PMLR, 2015. [77] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei- Fei, and Silvio Savarese. Social LSTM: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 961?971, 2016. [78] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang- chun Woo. Convolutional LSTM network: A machine learning approach for precipi- tation nowcasting. Advances in neural information processing systems, 28, 2015. [79] Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and Philip S Yu. Pre- drnn: Recurrent neural networks for predictive learning using spatiotemporal LSTMs. Advances in neural information processing systems, 30, 2017. [80] Yunbo Wang, Zhifeng Gao, Mingsheng Long, Jianmin Wang, and S Yu Philip. Pre- drnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal pre- dictive learning. In International Conference on Machine Learning, pages 5123?5132. PMLR, 2018. [81] Yunbo Wang, Lu Jiang, Ming-Hsuan Yang, Li-Jia Li, Mingsheng Long, and Li Fei- Fei. Eidetic 3d LSTM: A model for video prediction and beyond. In International conference on learning representations, 2018. [82] Rohollah Soltani and Hui Jiang. Higher order recurrent neural networks. arXiv preprint arXiv:1605.00064, 2016. [83] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Tel- garsky. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research, 15(1):2773?2832, 2014. [84] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1?5. IEEE, 2015. [85] Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1):1947?1980, 2018. [86] Jean Kossaifi, Adrian Bulat, Georgios Tzimiropoulos, and Maja Pantic. T-net: Parametrizing fully convolutional nets with a single high-order tensor. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7822?7831, 2019. [87] Yongxin Yang and Timothy Hospedales. Deep multi-task representation learning: A tensor factorisation approach. arXiv preprint arXiv:1605.06391, 2016. 181 [88] Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. Compressing recurrent neu- ral network with tensor train. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 4451?4458. IEEE, 2017. [89] Xindian Ma, Peng Zhang, Shuai Zhang, Nan Duan, Yuexian Hou, Ming Zhou, and Dawei Song. A tensorized transformer for language modeling. Advances in Neural Information Processing Systems, 32, 2019. [90] William Lotter, Gabriel Kreiman, and David Cox. Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104, 2016. [91] Wonmin Byeon, Qin Wang, Rupesh Kumar Srivastava, and Petros Koumoutsakos. Contextvp: Fully context-aware video prediction. In Proceedings of the European Conference on Computer Vision (ECCV), pages 753?769, 2018. [92] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physi- cal interaction through video prediction. Advances in neural information processing systems, 29, 2016. [93] Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. De- composing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033, 2017. [94] Emily L Denton et al. Unsupervised learning of disentangled representations from video. Advances in neural information processing systems, 30, 2017. [95] Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li F Fei-Fei, and Juan Carlos Niebles. Learning to decompose and disentangle representations for video prediction. Advances in neural information processing systems, 31, 2018. [96] Sepp Hochreiter and Ju?rgen Schmidhuber. Long short-term memory. Neural compu- tation, 9(8):1735?1780, 1997. [97] Marijn F Stollenga, Wonmin Byeon, Marcus Liwicki, and Juergen Schmidhuber. Par- allel multi-dimensional LSTM, with application to fast biomedical volumetric image segmentation. Advances in neural information processing systems, 28, 2015. [98] Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: a local SVM approach. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., volume 3, pages 32?36. IEEE, 2004. [99] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The? something something? video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pages 5842?5850, 2017. 182 [100] Github repo. https://github.com/NVIDIA/apex, 2018. [Online; accessed 20- October-2020]. [101] Github repo. https://github.com/Yunbo426/predrnn-pp, 2019. [Online; accessed 20-October-2020]. [102] Github repo. https://github.com/google/e3d_lstm, 2019. [Online; accessed 20- October-2020]. [103] Nal Kalchbrenner, Aa?ron Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. In International Conference on Machine Learning, pages 1771?1779. PMLR, 2017. [104] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre- training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171?4186, 2019. [105] Dirk Weissenborn, Oscar Ta?ckstro?m, and Jakob Uszkoreit. Scaling autoregressive video models. In International Conference on Learning Representations, 2019. [106] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In International Conference on Machine Learning, pages 4055?4064. PMLR, 2018. [107] Jacob Menick and Nal Kalchbrenner. Generating high fidelity images with sub- scale pixel networks and multidimensional upscaling. In International Conference on Learning Representations, 2018. [108] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feed- forward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249?256. JMLR Workshop and Conference Proceedings, 2010. [109] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sam- pling for sequence prediction with recurrent neural networks. Advances in neural information processing systems, 28, 2015. [110] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600?612, 2004. [111] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586?595, 2018. 183 [112] Github repo. https://github.com/jthsieh/DDPAE-video-prediction/blob/ master/data/moving_mnist.py, 2018. [Online; accessed 20-October-2020]. [113] Hao Wang and Dit-Yan Yeung. Towards bayesian deep learning: A framework and some existing methods. IEEE Transactions on Knowledge and Data Engineering, 28(12):3395?3408, 2016. [114] Yarin Gal. Uncertainty in deep learning. PhD thesis, University of Cambridge, 2016. [115] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Ben- gio. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1. arXiv preprint arXiv:1602.02830, 2016. [116] Kumar Shridhar, Felix Laumann, and Marcus Liwicki. A comprehensive guide to bayesian convolutional neural network with variational inference. arXiv preprint arXiv:1901.02731, 2019. [117] Daniel Soudry, Itay Hubara, and Ron Meir. Expectation backpropagation: Parameter-free training of multilayer neural networks with continuous or discrete weights. Advances in neural information processing systems, 27, 2014. [118] Jose? Miguel Herna?ndez-Lobato and Ryan Adams. Probabilistic backpropagation for scalable learning of bayesian neural networks. In International Conference on Ma- chine Learning, pages 1861?1869, 2015. [119] Soumya Ghosh, Francesco Maria Delle Fave, and Jonathan Yedidia. Assumed density filtering methods for learning bayesian neural networks. In Thirtieth AAAI Confer- ence on Artificial Intelligence, 2016. [120] Jiahao Su, Milan Cvitkovic, and Furong Huang. Sampling-free learning of bayesian quantized neural networks. In International Conference on Learning Representations, 2019. [121] Elina Robeva and Anna Seigal. Duality of graphical models and tensor networks. Information and Inference: A Journal of the IMA, 2017. [122] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. Advances in neural information processing systems, 28, 2015. [123] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor- net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525?542. Springer, 2016. [124] David P Williamson and David B Shmoys. The design of approximation algorithms. Cambridge university press, 2011. [125] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends?R in Machine Learning, 1(1?2):1? 305, 2008. 184 [126] Alex Graves. Practical variational inference for neural networks. Advances in neural information processing systems, 24, 2011. [127] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015. [128] Hao Wang, Xingjian Shi, and Dit-Yan Yeung. Natural-parameter networks: A class of probabilistic neural networks. Advances in neural information processing systems, 29, 2016. [129] Alexander Shekhovtsov and Boris Flach. Feed-forward propagation in probabilistic neural networks with categorical and max layers. In International conference on learning representations, 2018. [130] Jochen Gast and Stefan Roth. Lightweight probabilistic deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3369? 3378, 2018. [131] Thomas Peter Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, Massachusetts Institute of Technology, 2001. [132] Yoshua Bengio, Nicholas Le?onard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013. [133] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016. [134] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel- softmax. arXiv preprint arXiv:1611.01144, 2016. [135] Anqi Wu, Sebastian Nowozin, Edward Meeds, Richard E Turner, Jose? Miguel Herna?ndez-Lobato, and Alexander L Gaunt. Deterministic variational inference for robust bayesian neural networks. In International Conference on Learning Represen- tations, 2018. [136] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quanti- zation. arXiv preprint arXiv:1612.01064, 2016. [137] Minje Kim and Paris Smaragdis. Bitwise neural networks. arXiv preprint arXiv:1601.06071, 2016. [138] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa- net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016. [139] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Ben- gio. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869?6898, 2017. 185 [140] Steve K Esser, Rathinakumar Appuswamy, Paul Merolla, John V Arthur, and Dhar- mendra S Modha. Backpropagation for energy-efficient neuromorphic computing. Advances in neural information processing systems, 28, 2015. [141] Oran Shayer, Dan Levi, and Ethan Fetaya. Learning discrete weights using the local reparameterization trick. In International Conference on Learning Representations, 2018. [142] Jorn WT Peters and Max Welling. Probabilistic binary neural networks. arXiv preprint arXiv:1809.03368, 2018. [143] Elizabeth Newman, Lior Horesh, Haim Avron, and Misha Kilmer. Stable tensor neural networks for rapid deep learning. arXiv preprint arXiv:1811.06569, 2018. [144] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the ef- fective receptive field in deep convolutional neural networks. In Advances in neural information processing systems, pages 4898?4906, 2016. [145] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234?241. Springer, 2015. [146] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolu- tions. arXiv preprint arXiv:1511.07122, 2015. [147] Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. Dilated residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 472?480, 2017. [148] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794?7803, 2018. [149] Matthias Holschneider, Richard Kronland-Martinet, Jean Morlet, and Ph Tchamitchian. A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets, pages 286?297. Springer, 1990. [150] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431?3440, 2015. [151] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834?848, 2017. [152] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-FCN: Object detection via region- based fully convolutional networks. In Advances in neural information processing systems, pages 379?387, 2016. 186 [153] Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. Scale-aware trident networks for object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 6054?6063, 2019. [154] Zhengyang Wang and Shuiwang Ji. Smoothed dilated convolutions for improved dense prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2486?2495, 2018. [155] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Re- thinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017. [156] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764?773, 2017. [157] Yunho Jeon and Junmo Kim. Active convolution: Learning the shape of convolution for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4201?4209, 2017. [158] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9308?9316, 2019. [159] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazu- nari Misawa, Kensaku Mori, Steven McDonagh, Nils Y Hammerla, Bernhard Kainz, et al. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999, 2018. [160] Wonmin Byeon, Thomas M Breuel, Federico Raue, and Marcus Liwicki. Scene label- ing with LSTM recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3547?3555, 2015. [161] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016. [162] Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. Grid long short-term memory. arXiv preprint arXiv:1507.01526, 2015. [163] Sifei Liu, Jinshan Pan, and Ming-Hsuan Yang. Learning recursive filters for low-level vision via a hybrid neural network. In European Conference on Computer Vision, pages 560?576. Springer, 2016. [164] Sifei Liu, Shalini De Mello, Jinwei Gu, Guangyu Zhong, Ming-Hsuan Yang, and Jan Kautz. Learning affinity via spatial propagation networks. In Advances in Neural Information Processing Systems, pages 1520?1530, 2017. [165] James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. Quasi- recurrent neural networks. arXiv preprint arXiv:1611.01576, 2016. 187 [166] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time series analysis: forecasting and control. John Wiley & Sons, 2015. [167] Jiahao Su, Shiqi Wang, and Furong Huang. ARMA nets: Expanding receptive field for dense prediction. Advances in Neural Information Processing Systems, 33:17696? 17707, 2020. [168] Jae S Lim. Two-dimensional signal and image processing. Prentice-Hall, Inc., 1990. [169] Alan V Oppenheim, John R Buck, and Ronald W Schafer. Discrete-time signal processing. Prentice-Hall, 2014. [170] Dimitri P Bertsekas and Athena Scientific. Convex optimization algorithms. Athena Scientific Belmont, 2015. [171] Zhengyang Wang, Na Zou, Dinggang Shen, and Shuiwang Ji. Non-local u-nets for biomedical image segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 6315?6322, 2020. [172] Webpage. https://challenge2018.isic-archive.com. [Online; accessed 29-May- 2020]. [173] Github repo. https://github.com/jthsieh/DDPAE-video-prediction/blob/ master/data/moving_mnist.py. [Online; accessed 05-Jan-2020]. [174] Noel CF Codella, David Gutman, M Emre Celebi, Brian Helba, Michael A Marchetti, Stephen W Dusza, Aadi Kalloo, Konstantinos Liopyris, Nabin Mishra, Harald Kittler, et al. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC). In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pages 168?172. IEEE, 2018. [175] PP Vaidyanathan. Multirate systems and filter banks. Prentice-Hall, Inc., 1993. [176] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. [177] Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-bayesian ap- proach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations, 2018. [178] Kui Jia, Shuai Li, Yuxin Wen, Tongliang Liu, and Dacheng Tao. Orthogonal deep neural networks. IEEE transactions on pattern analysis and machine intelligence, 2019. [179] Jiong Zhang, Qi Lei, and Inderjit Dhillon. Stabilizing gradients for deep neural networks via efficient SVD parameterization. In International Conference on Machine Learning, pages 5806?5814. PMLR, 2018. 188 [180] Haozhi Qi, Chong You, Xiaolong Wang, Yi Ma, and Jitendra Malik. Deep isometric learning for visual recognition. In International Conference on Machine Learning, pages 7824?7835. PMLR, 2020. [181] Cem Anil, James Lucas, and Roger Grosse. Sorting out lipschitz function approxi- mation. In International Conference on Machine Learning, pages 291?301. PMLR, 2019. [182] Qiyang Li, Saminul Haque, Cem Anil, James Lucas, Roger B Grosse, and Jo?rn-Henrik Jacobsen. Preventing gradient attenuation in lipschitz constrained convolutional net- works. In Advances in neural information processing systems, pages 15390?15402, 2019. [183] Asher Trockman and J Zico Kolter. Orthogonalizing convolutional layers with the cayley transform. In International Conference on Learning Representations, 2021. [184] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. Advances in neural information processing systems, 30, 2017. [185] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. The emergence of spec- tral universality in deep networks. In International Conference on Artificial Intelli- gence and Statistics, pages 1924?1932. PMLR, 2018. [186] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean field theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks. In International Conference on Machine Learning, pages 5393?5402. PMLR, 2018. [187] Jiayun Wang, Yubei Chen, Rudrasis Chakraborty, and Stella X Yu. Orthogonal con- volutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11505?11515, 2020. [188] Ricky TQ Chen, Jens Behrmann, David K Duvenaud, and Jo?rn-Henrik Jacobsen. Residual flows for invertible generative modeling. Advances in Neural Information Processing Systems, 32, 2019. [189] Kui Jia, Dacheng Tao, Shenghua Gao, and Xiangmin Xu. Improving training of deep neural networks via singular value bounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4344?4352, 2017. [190] Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. In Inter- national Conference on Machine Learning, pages 854?863. PMLR, 2017. [191] Hanie Sedghi, Vineet Gupta, and Philip M Long. The singular values of convolutional layers. In International Conference on Learning Representations, 2019. 189 [192] Alan V. Oppenheim, Alan S. Willsky, and S. Hamid Nawab. Signals & Systems (2nd Ed.). Prentice-Hall, Inc., 1996. [193] Minmin Chen, Jeffrey Pennington, and Samuel Schoenholz. Dynamical isometry and a mean field theory of RNNs: Gating enables signal propagation in recurrent neural networks. In International Conference on Machine Learning, pages 873?882. PMLR, 2018. [194] Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Duvenaud, and Jo?rn-Henrik Jacobsen. Invertible residual networks. In International Conference on Machine Learning, pages 573?582. PMLR, 2019. [195] Di Xie, Jiang Xiong, and Shiliang Pu. All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with or- thonormality and modulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6176?6185, 2017. [196] Nitin Bansal, Xiaohan Chen, and Zhangyang Wang. Can we gain more from orthog- onality regularizations in training deep networks? Advances in Neural Information Processing Systems, 31, 2018. [197] Zakaria Mhammedi, Andrew Hellicar, Ashfaqur Rahman, and James Bailey. Efficient orthogonal parametrisation of recurrent neural networks using householder reflec- tions. In International Conference on Machine Learning, pages 2401?2409. PMLR, 2017. [198] Victor Dorobantu, Per Andre Stromhaug, and Jess Renteria. Dizzyrnn: Reparameter- izing recurrent neural networks for norm-preserving backpropagation. arXiv preprint arXiv:1612.04035, 2016. [199] Kyle Helfrich, Devin Willmott, and Qiang Ye. Orthogonal recurrent neural networks with scaled cayley transform. In International Conference on Machine Learning, pages 1969?1978. PMLR, 2018. [200] Mario Lezcano-Casado and David Mart?nez-Rubio. Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group. In International Conference on Machine Learning, pages 3794?3803. PMLR, 2019. [201] Lei Huang, Li Liu, Fan Zhu, Diwen Wan, Zehuan Yuan, Bo Li, and Ling Shao. Controllable orthogonalization in training DNNs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6429?6438, 2020. [202] Eugene Vorontsov, Chiheb Trabelsi, Samuel Kadoury, and Chris Pal. On orthogo- nality and learning recurrent networks with long term dependencies. In International Conference on Machine Learning, pages 3570?3578, 2017. [203] Jun Li, Fuxin Li, and Sinisa Todorovic. Efficient riemannian optimization on the stiefel manifold via the cayley transform. In International Conference on Learning Representations, 2019. 190 [204] Sheng Liu, Xiao Li, Yuexiang Zhai, Chong You, Zhihui Zhu, Carlos Fernandez- Granda, and Qing Qu. Convolutional normalization: Improving deep convolutional network robustness and training. Advances in Neural Information Processing Sys- tems, 34, 2021. [205] Sahil Singla and Soheil Feizi. Skew orthogonal convolutions. In International Con- ference on Machine Learning, pages 9756?9766. PMLR, 2021. [206] Gilbert Strang and Truong Nguyen. Wavelets and filter banks. SIAM, 1996. [207] Yuan-Pei Lin and PP Vaidyanathan. Theory and design of two-dimensional filter banks: A review. Multidimensional Systems and Signal Processing, 7(3-4):263?330, 1996. [208] Shankar Venkataraman and Bernard C Levy. A comparison of design methods for 2d FIR orthogonal perfect reconstruction filter banks. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 42(8):525?536, 1995. [209] Jianping Zhou. Multidimensional multirate systems: Characterization, design, and applications. University of Illinois at Urbana-Champaign, 2005. [210] Barry Hurley and Ted Hurley. Paraunitary matrices. arXiv preprint arXiv:1205.0703, 2012. [211] Jiahao Su, Wonmin Byeon, and Furong Huang. Scaling-up diverse orthogonal convo- lutional networks with a paraunitary framework. arXiv preprint arXiv:2106.09121, 2021. [212] Kehelwala DG Maduranga, Kyle E Helfrich, and Qiang Ye. Complex unitary re- current neural networks using scaled cayley transform. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4528?4535, 2019. [213] Nicholas J Higham. The scaling and squaring method for the matrix exponential revisited. SIAM review, 51(4):747?764, 2009. [214] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132?7141, 2018. [215] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105?6114. PMLR, 2019. [216] Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014. [217] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. arXiv preprint arXiv:1605.08803, 2016. 191 [218] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018. [219] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6848?6856, 2018. [220] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practi- cal guidelines for efficient CNN architecture design. In Proceedings of the European conference on computer vision (ECCV), pages 116?131, 2018. [221] Mikael Henaff, Arthur Szlam, and Yann LeCun. Recurrent orthogonal networks and long-memory tasks. In International Conference on Machine Learning, pages 2034? 2042. PMLR, 2016. [222] Eric Wong, Frank Schmidt, Jan Hendrik Metzen, and J Zico Kolter. Scaling provable adversarial defenses. Advances in Neural Information Processing Systems, 31, 2018. [223] Mario Lezcano Casado. Trivializations for gradient-based optimization on manifolds. Advances in Neural Information Processing Systems, 32, 2019. [224] George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57):1?64, 2021. [225] Ivan Kobyzev, Simon Prince, and Marcus Brubaker. Normalizing flows: An intro- duction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. [226] Will Grathwohl, Ricky TQ Chen, Jesse Bettencourt, Ilya Sutskever, and David Du- venaud. FFJORD: Free-form continuous dynamics for scalable reversible generative models. In International Conference on Learning Representations, 2018. 192