ABSTRACT Title of Dissertation: CONTROL THEORY-INSPIRED ACCELERATION OF THE GRADIENT-DESCENT METHOD: CENTRALIZED AND DISTRIBUTED Kushal Chakrabarti Doctor of Philosophy, 2022 Dissertation Directed by: Professor Nikhil Chopra Department of Electrical and Computer Engineering Mathematical optimization problems are prevalent across various disciplines in science and engineering. Particularly in electrical engineering, convex and non-convex optimization problems are well-known in signal processing, estimation, control, and machine learning research. In many of these contemporary applications, the data points are dispersed over several sources. Restrictions such as industrial competition, administrative regulations, and user privacy have motivated significant research on distributed optimization algorithms for solving such data-driven modeling problems. The traditional gradient-descent method can solve optimization problems with differentiable cost functions. However, the speed of convergence of the gradient-descent method and its accelerated variants is highly influenced by the conditioning of the optimization problem being solved. Specifically, when the cost is ill-conditioned, these methods (i) require many iterations to converge and (ii) are highly unstable against process noise. In this dissertation, we propose novel optimization algorithms, inspired by control-theoretic tools, that can significantly attenuate the influence of the problem’s conditioning. First, we consider solving the linear regression problem in a distributed server-agent network. We propose the Iteratively Pre-conditioned Gradient-Descent (IPG) algorithm to mitigate the deleterious impact of the data points’ conditioning on the convergence rate. We show that the IPG algorithm has an improved rate of convergence in comparison to both the classical and the accelerated gradient-descent methods. We further study the robustness of IPG against system noise and extend the idea of iterative pre-conditioning to stochastic settings, where the server updates the estimate based on a randomly selected data point at every iteration. In the same distributed environment, we present theoretical results on the local convergence of IPG for solving convex optimization problems. Next, we consider solving a system of linear equations in peer-to-peer multi-agent networks and propose a decentralized pre-conditioning technique. The proposed algorithm converges linearly, with an improved convergence rate than the decentralized gradient-descent. Considering the practical scenario where the computations performed by the agents are corrupted, or a communication delay exists between them, we study the robustness guarantee of the proposed algorithm and a variant of it. We apply the proposed algorithm for solving decentralized state estimation problems. Further, we develop a generic framework for adaptive gradient methods that solve non-convex optimization problems. Here, we model the adaptive gradient methods in a state-space framework, which allows us to exploit control- theoretic methodology in analyzing Adam and its prominent variants. We then utilize the classical transfer function paradigm to propose new variants of a few existing adaptive gradient methods. Applications on benchmark machine learning tasks demonstrate our proposed algorithms’ efficiency. Our findings suggest further exploration of the existing tools from control theory in complex machine learning problems. The dissertation is concluded by showing that the potential in the previously mentioned idea of IPG goes beyond solving generic optimization problems through the development of a novel distributed beamforming algorithm and a novel observer for nonlinear dynamical systems, where IPG’s robustness serves as a foundation in our designs. The proposed IPG for distributed beamforming (IPG-DB) facilitates a rapid establishment of communication links with far-field targets while jamming potential adversaries without assuming any feedback from the receivers, subject to unknown multipath fading in realistic environments. The proposed IPG observer utilizes a non-symmetric pre-conditioner, like IPG, as an approximation of the observability mapping’s inverse Jacobian such that it asymptotically replicates the Newton observer with an additional advantage of enhanced robustness against measurement noise. Empirical results are presented, demonstrating both of these methods’ efficiency compared to the existing methodologies. CONTROL THEORY-INSPIRED ACCELERATION OF THE GRADIENT-DESCENT METHOD: CENTRALIZED AND DISTRIBUTED by Kushal Chakrabarti Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2022 Advisory Committee: Professor Nikhil Chopra, Chair/Advisor Professor Nuno Martins Professor Richard J. La Professor André Tits Professor Pratap Tokekar © Copyright by Kushal Chakrabarti 2022 to my beloved Ishita Acknowledgments This dissertation is one of my proud possession and it wont be in its final form without the help of others. First and foremost, I’d express my sincere gratitude to my advisor, Professor Nikhil Chopra, for giving me an invaluable opportunity to work on challenging and extremely interesting projects over the past five years. He has always made himself available for help and advice and there has never been an occasion when I’ve knocked on his door and he hasn’t given me time. It has been a pleasure to work with and learn from such an extraordinary individual. Thanks are due to Professor Nuno Martins, Professor Richard La, Professor André Tits, and Professor Pratap Tokekar for agreeing to serve on my thesis committee and for sparing their invaluable time reviewing the manuscript. Also, thanks to other faculty members and teaching assistants in the university for the course work which established my foundation for research. My colleagues have enriched my professional life in many ways and deserve a special mention. Especially, a significant part of this research was in collaboration with Nirupam Gupta, Jeffrey N. Twigg, Fikadu T. Dagefu, and Amrit S. Bedi. I would like to acknowledge financial support from the Petroleum Institute at UAE, the United States Department of Agriculture, and the Army Research Laboratory, for part of the projects discussed herein. iii I want to thank my friends from here, for making my stay memorable. My warmest thanks go to my family for their support, love, encouragement, and patience. And at the last from Einstein’s words: Thanks to those who said NO !! because of them I did it myself. iv Table of Contents Dedication ii Acknowledgements iii Table of Contents v List of Tables ix List of Figures xi Chapter 1: Introduction 1 Chapter 2: Distributed Linear Regression in Server-Agent Network 8 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.1 Background on gradient-descent method . . . . . . . . . . . . . . . . . . 10 2.1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.3 System noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.4 Stochastic settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.5 Summary of our contributions . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Proposed algorithm: Iteratively Pre-Conditioned Gradient-Descent (IPG) . . . . . 18 2.2.1 Motivation for IPG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.2 Steps in each iteration t . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.3 Algorithm complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.4 Convergence guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Comparisons with the existing methods . . . . . . . . . . . . . . . . . . . . . . 25 2.3.1 The gradient-descent method . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4 Robustness of the IPG method . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4.1 Notation, assumption, and prior results . . . . . . . . . . . . . . . . . . . 27 2.4.2 Robustness against observation noise . . . . . . . . . . . . . . . . . . . . 29 2.4.3 Robustness against process noise . . . . . . . . . . . . . . . . . . . . . . 31 2.5 SGD with iterative pre-conditioning . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5.1 Steps in each iteration t . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5.2 Notation and assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5.3 Convergence guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.6 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.6.1 Stochastic settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 v Chapter 3: Decentralized Linear Regression in Peer-to-Peer Network 52 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.1.1 Background on decentralized gradient-descent . . . . . . . . . . . . . . . 53 3.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.1.3 Communication delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.1.4 Summary of our contributions . . . . . . . . . . . . . . . . . . . . . . . 57 3.2 Proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.2.1 Convergence guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.2.2 Robustness against computational process noise . . . . . . . . . . . . . . 64 3.3 Comparison with decentralized gradient-descent . . . . . . . . . . . . . . . . . . 66 3.4 Multiple-solutions case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.5 Directed-graph case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.6 Application: decentralized state estimation . . . . . . . . . . . . . . . . . . . . . 70 3.7 Proposed algorithm in presence of communication delay . . . . . . . . . . . . . 74 3.7.1 Convergence guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.8 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.8.1 Comparison with decentralized Kalman filter . . . . . . . . . . . . . . . 80 3.8.2 In presence of delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Chapter 4: Distributed Convex Optimization in Server-Agent Network 84 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.1.1 Summary of our contributions . . . . . . . . . . . . . . . . . . . . . . . 89 4.2 Proposed algorithm: Iteratively Pre-conditioned Gradient-descent (IPG) . . . . . 90 4.2.1 Steps in each iteration t . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2.2 Convergence guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.3.1 Distributed noisy quadratic model . . . . . . . . . . . . . . . . . . . . . 97 4.3.2 Distributed logistic regression . . . . . . . . . . . . . . . . . . . . . . . 99 4.4 Summary and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Chapter 5: Non-Convex Optimization 104 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.1.2 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.2 Proposed algorithm: Generalized AdaGrad (G-AdaGrad) . . . . . . . . . . . . . 117 5.2.1 Description of G-AdaGrad . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.2.2 Convergence of G-AdaGrad . . . . . . . . . . . . . . . . . . . . . . . . 119 5.3 State-model of Adam and AdaBelief . . . . . . . . . . . . . . . . . . . . . . . . 122 5.4 A general adaptive gradient algorithm . . . . . . . . . . . . . . . . . . . . . . . 124 5.5 Continuous-time AdaBound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.5.1 State-space model of AdaBound . . . . . . . . . . . . . . . . . . . . . . 129 5.5.2 Convergence of AdaBound . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.6 Convergence analysis of Nadam . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.7 Convergence analysis of RAdam . . . . . . . . . . . . . . . . . . . . . . . . . . 133 vi 5.8 Convergence of rescaled gradient flow . . . . . . . . . . . . . . . . . . . . . . . 134 5.9 Proposed algorithm: AdamSSM . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.10 Proposed Algorithm: NadamSSM . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.11 Proposed Algorithm: MAdamSSM . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.12 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.12.1 G-AdaGrad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.12.2 AdamSSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.12.3 NadamSSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.12.4 MAdamSSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 5.13 Summary and discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Chapter 6: Distributed Beamforming 156 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 6.1.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 6.1.2 Summary of our contributions . . . . . . . . . . . . . . . . . . . . . . . 158 6.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.3 Proposed algorithm: Iteratively Pre-conditioned Gradient-descent for Distributed Beamforming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.3.1 Steps in each iteration t ≥ 0 . . . . . . . . . . . . . . . . . . . . . . . . 165 6.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Chapter 7: Nonlinear Observer 172 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 7.1.1 Prior works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 7.1.2 Summary of our contributions . . . . . . . . . . . . . . . . . . . . . . . 176 7.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 7.3 Proposed observer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 7.3.1 Steps in each sampling instant k ≥ N . . . . . . . . . . . . . . . . . . . 180 7.3.2 Step-size selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 7.3.3 Relation with extended Kalman filter . . . . . . . . . . . . . . . . . . . . 183 7.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Chapter 8: Summary and Future Work 191 8.1 Completed work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 8.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Appendix A: Proofs of the Theoretical Results 196 A.1 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 A.2 Proof of Corollary 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 A.3 Proof of Lemma 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 A.4 Proof of Theorem 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 A.5 Proof of Theorem 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 A.6 Proof of Theorem 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 vii A.7 Proof of Theorem 2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 A.7.1 Preliminary results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 A.7.2 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 A.7.3 Convergence of the pre-conditioner matrix . . . . . . . . . . . . . . . . . 214 A.7.4 Proof of the theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 A.8 Proof of Lemma 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 A.9 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 A.10 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 A.11 Proof of Theorem 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 A.12 Proof of Theorem 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 A.13 Proof of Lemma 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 A.14 Proof of Lemma 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 A.15 Proof of Theorem 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 A.16 Proof of Lemma 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 A.17 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 A.18 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 A.19 Proof of Theorem 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 A.20 Proof of Theorem 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 A.21 Proof of Theorem 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 A.22 Proof of Theorem 5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 A.23 Proof of Theorem 5.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Bibliography 269 viii List of Tables 2.1 Comparisons between the theoretical convergence rate of different algorithms for distributed linear regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Additional notation for analysis of IPSG. . . . . . . . . . . . . . . . . . . . . . . 38 2.3 The parameters used in different algorithms for their minimum convergence rate on distributed linear regression experiments. . . . . . . . . . . . . . . . . . . . . 48 2.4 The number of iterations required by different algorithms to attain relative estimation error ϵtol on distributed linear regression experiments. . . . . . . . . . . . . . . . 48 2.5 Comparisons between the final estimation errors limt→∞ ∥∥x(t)− x∗ ∥∥ for different algorithms on distributed linear regression experiments. . . . . . . . . . . . . . . 49 2.6 The parameters used in different stochastic algorithms on distributed linear regression experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.7 Comparisons between the number of iterations required by different stochastic algorithms to attain the specified values for the relative estimation errors ϵtol =∥∥x(t)− x∗ ∥∥ /∥∥x(0)− x∗ ∥∥ on distributed linear regression experiments. . . . . . . 51 3.1 Required notations for solving decentralized linear equations subject to communication delay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.2 Comparisons between the performance of different algorithms on decentralized state prediction experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.1 Comparison between convergence rate and per-iteration computational complexity of different algorithms, for solving distributed convex optimization problems. t is the number of iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2 The parameters used in different algorithms on distributed convex optimization experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.3 Comparisons between the number of iterations and total floating point operations (flops) required by different algorithms to attain estimation error ∥∥x(t)− x∗ ∥∥ = 0.05 for the NQM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.4 Comparisons between the number of iterations and total floating point operations (flops) required by different algorithms to attain ∥∥g(t)∥∥ = 0.02 for binary classification on MNIST data, subject to process noise. . . . . . . . . . . . . . . . . . . . . . . 101 5.1 Comparisons between best training accuracy, best test accuracy, and number of training epochs required to achieve these accuracies for different algorithms on image classification task with ResNet34. . . . . . . . . . . . . . . . . . . . . . . 149 ix 5.2 Comparisons between best training accuracy, best test accuracy, and number of training epochs required to achieve these accuracies for different algorithms on image classification task with VGG11. . . . . . . . . . . . . . . . . . . . . . . . 150 5.3 Comparisons between best training set perplexity, best test set perplexity, and number of training epochs required to achieve these perplexities for different algorithms on language modeling task with 1-layer LSTM. . . . . . . . . . . . . 150 5.4 Comparisons between best training set perplexity, best test set perplexity, and number of training epochs required to achieve these perplexities for different algorithms on language modeling task 2-layer LSTM. . . . . . . . . . . . . . . . 151 5.5 Comparisons between best training set perplexity, best test set perplexity, and number of training epochs required to achieve these perplexities for different algorithms on language modeling task 3-layer LSTM. . . . . . . . . . . . . . . . 151 5.6 Comparisons between mean (and std.) of the best training and test accuracies, and the number of training epochs required to achieve them for the MAdamSSM (proposed) and MAdam algorithms over five runs. . . . . . . . . . . . . . . . . . 152 5.7 Comparisons between the best training and test accuracies, and the number of training epochs required to achieve them for the MAdamSSM (proposed) and MAdam algorithms applied to train ResNet34 with CIFAR-10 data over five runs. 152 x List of Figures 2.1 Distributed system architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 ∥∥x(t)− x∗ ∥∥ under Algorithm 1 with different initialization on “ash608”. α = 0.1, δ = 1, β = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.3 ∥∥x(t)− x∗ ∥∥ under Algorithm 1 with different initialization on “gr 30 30”. α = 3× 10−3, δ = 0.4, β = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.4 ∥∥x(t)− x∗ ∥∥ in absence of noise for different algorithms on “bcsstm07”. . . . . . 42 2.5 ∥∥x(t)− x∗ ∥∥ in absence of noise for different algorithms on “can 229”. . . . . . 43 2.6 ∥∥x(t)− x∗ ∥∥ in presence of observation noise for different algorithms on “ash608”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.7 ∥∥x(t)− x∗ ∥∥ in presence of process noise for different algorithms on “ash608”. . 44 2.8 ∥∥x(t)− x∗ ∥∥ in presence of observation noise for different algorithms on “gr 30 30”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.9 ∥∥x(t)− x∗ ∥∥ in presence of process noise for different algorithms on “gr 30 30”. 45 2.10 ∥∥x(t)− x∗ ∥∥ for different stochastic algorithms on “MNIST”. . . . . . . . . . . . 46 2.11 ∥∥x(t)− x∗ ∥∥ for different stochastic algorithms on “illc1850”. . . . . . . . . . . 47 3.1 Error in estimating the true initial state z(0) in presence of computational process noise, for different algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.2 Error in estimating the system states z(t) in presence of process noise and measurement noise, for Algorithm 3 and DKF. . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.3 ∥∥xi(t)− x∗ ∥∥ under (a) Algorithm 4 and projection-based algorithm, (b) Algorithm 4, (c) PI consensus algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.1 ∥∥x(t)− x∗ ∥∥ for different algorithms on the noisy quadratic model. . . . . . . . . 97 4.2 ∥∥g(t)∥∥ for different algorithms on MNIST. . . . . . . . . . . . . . . . . . . . . . 99 5.1 Decision boundary in the a1−a2 plane, obtained from training a linear regression model for classification of digit-1 and digit-5 using G-AdaGrad. The data points from MNIST training set are plotted in a1 − a2 plane. . . . . . . . . . . . . . . 141 5.2 Decision boundary in the a1−a2 plane, obtained from training a linear regression model for classification of digit-1 and digit-5 using G-AdaGrad. The data points from MNIST test set are plotted in a1 − a2 plane. . . . . . . . . . . . . . . . . . 141 5.3 1 2 ∥∥Ax(t)− b ∥∥2 − f ∗ of linear regression for classifying digit-1 and digit-5 from the MNIST dataset with G-AdaGrad. . . . . . . . . . . . . . . . . . . . . . . . 141 5.4 Training loss of logistic regression model for classifying digit-1 and digit-5 from the MNIST dataset with G-AdaGrad. . . . . . . . . . . . . . . . . . . . . . . . 142 xi 5.5 Test set accuracy for image classification task on CIFAR-10 dataset with ResNet34 architecture trained with different algorithms. . . . . . . . . . . . . . . . . . . . 143 5.6 Test set accuracy for image classification task on CIFAR-10 dataset with VGG11 architecture trained with different algorithms. . . . . . . . . . . . . . . . . . . . 143 5.7 Test set perplexity for language modeling task on PTB dataset with 1-layer LSTM architecture trained with different algorithms. . . . . . . . . . . . . . . . . . . . 144 5.8 Test set perplexity for language modeling task on PTB dataset with 3-layer LSTM architecture trained with different algorithms. . . . . . . . . . . . . . . . . . . . 146 5.9 Test set perplexity for language modeling task on PTB dataset with 2-layer LSTM architecture trained with different algorithms. . . . . . . . . . . . . . . . . . . . 147 5.10 Second raw moment estimate of gradient along dimension i = 811, for image classification task on CIFAR-10 dataset with VGG11 architecture trained with Adam and the proposed AdamSSM algorithms. . . . . . . . . . . . . . . . . . . 154 5.11 Second raw moment estimate of gradient along dimension i = 1728, for image classification task on CIFAR-10 dataset with VGG11 architecture trained with Adam and the proposed AdamSSM algorithms. . . . . . . . . . . . . . . . . . . 155 6.1 Cartesian coordinates of n = 19 beamforming agents in the synthetic problem. . 168 6.2 Array factors constructed by the GD algorithm for solving the synthetic beamforming problem, after different number of iterations t. . . . . . . . . . . . . . . . . . . 169 6.3 Array factors {|AFi| , i = 1, . . . , 49} constructed by the IPG-DB algorithm for solving the synthetic beamforming problem, after different number of iterations t. 169 6.5 CAD model of the section of Los Angeles from Figure 6.4. . . . . . . . . . . . . 170 6.4 A section of Los Angeles used as the environment for beamforming problem. . . 170 6.6 Cartesian coordinates of n = 5 beamforming agents in the LA urban environment problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 6.7 Array factors for the electric field along the z-direction constructed by the IPG- DB algorithm for solving the beamforming problem in LA urban environment, initially (t = 0) and after t = 25000 iterations. . . . . . . . . . . . . . . . . . . 171 7.1 State estimation error of different observers for bio-reactor system (7.13)-(7.16) with measurement noise’s standard deviation 0.01. . . . . . . . . . . . . . . . . 184 7.2 State estimation error of different observers for bio-reactor system (7.13)-(7.16) with measurement noise’s standard deviation 0.01. . . . . . . . . . . . . . . . . 185 7.3 State estimation error of different observers for bio-reactor system (7.13)-(7.16) with measurement noise’s standard deviation 0.001. . . . . . . . . . . . . . . . . 186 7.4 State estimation error of different observers for bio-reactor system (7.13)-(7.16) with measurement noise’s standard deviation 0.001. . . . . . . . . . . . . . . . . 187 7.5 State estimation error of different observers for inverted pendulum-cart system (7.17)- (7.21) with measurement noise’s standard deviation 0.1. . . . . . . . . . . . . . 188 7.6 State estimation error of different observers for inverted pendulum-cart system (7.17)- (7.21) with measurement noise’s standard deviation 0.01. . . . . . . . . . . . . . 189 7.7 State estimation error of different observers for inverted pendulum-cart system (7.17)- (7.21) with measurement noise’s standard deviation 0.1. . . . . . . . . . . . . . 189 xii 7.8 State estimation error of different observers for inverted pendulum-cart system (7.17)- (7.21) with measurement noise’s standard deviation 0.01. . . . . . . . . . . . . . 190 7.9 True system trajectory and its estimate for the system (7.22)-(7.24) obtained by IPG observer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 7.10 True system trajectory and its estimate for the system (7.22)-(7.24) obtained by LPV observer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 xiii Chapter 1: Introduction Mathematical optimization is a common sub-problem in various disciplines, such as modeling, estimation, control systems, wireless communication, data analysis, finance, and operations research. Especially with the recent data-driven technological advancements, optimization is ubiquitous in several applications. A simple optimization technique, known as least-squares, dates back to 1795 A.D. when it was invented by a young scientist Karl Friedrich Gauss [1] for studying planetary motion using telescopic measurement data. Since then, several new optimization methodologies have been developed. The most attractive feature of optimization lies in the constructive formulation of a problem, followed by the availability of sophisticated and thoroughly studied tools to solve the formulated problem in a reliable and efficient manner. Therefore, the focus of this dissertation will be on developing novel tools for solving a class of optimization problems, aimed at balancing the decisive triad of performance, efficiency, and reliability. Specifically, this dissertation will consider solving unconstrained optimization problems, where the objective is to minimize a smooth and differentiable cost function. The classical gradient-descent algorithm, attributed to Cauchy (1847 A.D.), is the basic first-order optimization algorithm to solve such problems. The gradient-descent algorithm is iterative, wherein the idea is to repeatedly refine an estimate of the minimum point by taking steps in the opposite direction of the gradient of the cost function at the current estimate [2]. In principle, the gradient-descent 1 method eventually converges to a minimum point of the cost. However, the speed of convergence of the gradient-descent method is highly influenced by the conditioning of the optimization problem being solved. Specifically, when the cost is ill-conditioned, this method (i) requires many iterations to converge and (ii) is highly unstable against process noise. This dissertation aims at developing novel optimization algorithms, inspired by classical tools from control theory, that can significantly attenuate the fundamental influence of the problem’s conditioning. In many contemporary applications, the data points are dispersed over several sources (or agents). Due to industrial competition, administrative regulations, and user privacy, it is nearly impossible to collect the individual data from the different agents at a single machine [3]. In recent years, these restrictions have motivated a significant amount of research on distributed algorithms for solving data-driven modeling problems. In most distributed algorithms, an individual agent is not required to transmit its raw data points [3, 4]. In the aforementioned cases, the data itself is distributed over multiple sources or agents. With the availability of a large amount of data, it is sometimes necessary to distribute the processing task over multiple machines or agents due to the inherent complexity of the problem regarding computational and/or memory requirements [5]. Therefore, apart from centralized optimization problems, this dissertation will also consider solving optimization problems with a differentiable cost function in a distributed manner. In Chapters 2-5, we propose our algorithms for solving specific optimization problems in centralized settings and in distributed network of multiple agents. In Chapters 6-7, we go beyond solving generic optimization problems by considering the distributed beamforming problem and the nonlinear observer problem, wherein we develop new techniques for these two problems by leveraging an idea from Chapter 4. In Chapter 2, we consider the multi-agent linear least-squares problem in a server-agent 2 network architecture. The system comprises multiple agents, each with a set of local data points. The agents are connected to a server, and there is no inter-agent communication. The agents’ goal is to compute a linear model that optimally fits the collective data. The agents, however, cannot share their data points. In principle, the agents can solve this problem by collaborating with the server using the server-agent network variant of the classical gradient-descent method. However, when the data points are ill-conditioned, the gradient-descent method requires a large number of iterations to converge. We propose a novel iterative pre-conditioning technique to mitigate the deleterious impact of the data points’ conditioning on the convergence rate of the gradient- descent method. The idea of iterative pre-conditioning is inspired by adaptive control techniques. We rigorously show that our proposed algorithm converges with an improved rate of convergence in comparison to both the classical and the accelerated gradient-descent methods. In the presence of system uncertainties, we characterize the proposed algorithm’s robustness against noise. We further extend our algorithm to the stochastic settings, where the server updates the estimate of a minimum point based on a single randomly selected data point at every iteration instead of the full batch of data points. Through numerical experiments on benchmark least-squares problems, we validate our theoretical findings. This work was carried out under the PSIM Project, supported by the Petroleum Institute, Khalifa University of Science and Technology, Abu Dhabi, UAE. The research works in this chapter have been published in [6–9]. In Chapter 3, we consider solving the linear least-squares problem in peer-to-peer multi- agent networks. Again, each agent has a set of local data points, and its goal is to compute a linear model that fits the collective data points. In principle, the agents can apply the decentralized gradient-descent method. However, when the data matrix is ill-conditioned, gradient-descent requires many iterations to converge and is unstable against system noise. We propose a decentralized 3 pre-conditioning technique to address these issues with the decentralized gradient-descent method. We show that the proposed algorithm has an improved convergence rate than the decentralized gradient-descent. Considering the practical scenario where the computations performed by the agents are corrupted, we also study the robustness guarantee of the proposed algorithm. Additionally, we apply the proposed algorithm for solving decentralized linear state estimation problems. The empirical results show our proposed state predictor’s favorable convergence rate and robustness against system noise compared to prominent decentralized algorithms. When the communication links between agents are subject to potentially large but constant delays, we utilize a similar decentralized pre-conditioning technique to propose a delay-tolerant algorithm that solves the decentralized linear equations. Empirical results show that the accuracy of the solution obtained by the proposed algorithm is better or comparable to the existing methods. This work was partially supported by the USDA grant 2020-68012-31085 and partially carried out under the PSIM Project, supported by the Petroleum Institute, Khalifa University of Science and Technology, Abu Dhabi, UAE. The research works in this chapter have been published in [10, 11]. In Chapter 4, we study a distributed multi-agent convex optimization problem. The system architecture is the same as Chapter 2. However, the cost function here is convex and not limited to quadratic cost. The agents’ goal is to learn a parameter vector that optimizes the aggregate of their local costs without revealing their local data points. We extend the idea of iterative pre-conditioning, proposed in Chapter 2, to convex cost functions and a class of non-convex functions. We present theoretical results on the local convergence of the proposed algorithm. We demonstrate our algorithm’s superior performance to prominent distributed algorithms for solving real logistic regression problems and emulating neural network training via a noisy quadratic model, thereby signifying the proposed algorithm’s efficiency for distributively solving 4 non-convex optimization. Further worth of the idea of IPG is investigated later in Chapter 6-7. In Chapter 5, we consider solving smooth non-convex optimization problems, with a focus on deep neural network models. We aim to develop a generic framework for adaptive gradient- descent algorithms for solving non-convex problems in the continuous-time domain. Particularly, we model the adaptive gradient methods in a state-space framework, which allows us to exploit standard control-theoretic tools in analyzing existing prominent adaptive methods. We rigorously show that few other algorithms that are not encompassed by the proposed generic framework, can also be modeled and analyzed using a similar technique. The analyses of all all these fast gradient algorithms are unified by a common underlying proof sketch, relying upon Barbalat’s lemma. Control-theoretic tools can also aid in developing novel adaptive gradient algorithms, as we show by proposing new variants of three existing adaptive gradient methods, using the concept of transfer functions. Applications on benchmark machine learning tasks demonstrate the proposed algorithms’ efficiency. The findings in this chapter suggest further exploration of the existing tools from control theory in complex machine learning problems. Part of the research works in this chapter have been published in [12] and accepted for presentation at [13, 14]. Distributed multi-agent beampattern matching inherently enables covert communication by constructively combining the transmitted signals at target receivers and forming nulls towards the adversaries. Most existing beamforming methodologies achieve this by relying on partial or full- state real-time feedback from the receivers. Chapter 6 proposes a novel distributed beamforming technique that does not assume any feedback from the receivers or channel parameters such as multipath fading. The proposed algorithm works in a server-agent architecture of the beamforming agents, eliminating the need for receivers’ feedback. Our algorithm is built on the classical gradient-descent (GD) method. However, when the problem is ill-conditioned, GD requires 5 many iterations to converge and is unstable against system noise. We propose an iterative pre- conditioning technique, founded upon the IPG algorithm proposed in Chapter 4, to mitigate the deleterious effects of the data points’ conditioning on the convergence rate, facilitating a rapid establishment of communication links with far-field targets. The empirical results demonstrate the proposed beamforming algorithm’s favorable convergence rate and robustness against unknown multipath fading in realistic environments. This work is part of an ArtIAMAS project1 and is being partially funded by ARL Grant No. W911NF2120076. The results in this chapter has been accepted for presentation at [15]. Nonlinear observers are required for process monitoring, fault detection, signal reconstruction, and control in various applications. The previously proposed Newton observer has fast exponential convergence and applies to a wide class of problems. However, the Newton observer lacks robustness against measurement noise due to the computation of a matrix inverse at each step. In Chapter 7, we propose a novel observer for discrete-time plant dynamics with sampled measurements to alleviate the impact of measurement noise. The key to the proposed observer is an iterative pre-conditioning technique for the gradient-descent method, proposed in Chapter 4, for solving general optimization problems. The proposed observer utilizes a non-symmetric pre-conditioner to approximate the observability mapping’s inverse Jacobian so that it asymptotically replicates the Newton observer with an additional advantage of enhanced robustness against measurement noise. Our observer applies to a wide class of nonlinear systems, as it is not contingent upon linearization or any specific structure in the plant nonlinearity and the measurement mapping. Its improved robustness against measurement noise than the prominent nonlinear observers is demonstrated through empirical results. Its relation with extended Kalman filter is also discussed. 1https://artiamas.umd.edu/ 6 Finally, Chapter 8 summarizes the completed work and presents the future research directions resulting from this dissertation. Detailed proof of the theoretical results in this dissertation is presented in Appendix A. 7 Chapter 2: Distributed Linear Regression in Server-Agent Network 2.1 Introduction Figure 2.1: Distributed system architecture. In this chapter, we consider the multi-agent distributed linear least-squares problem. The nomenclature distributed here refers to the data being distributed across multiple agents. Specifically, we consider a system comprising multiple agents, each agent having a set of local data points. The agents can communicate bidirectionally with a central server, as shown in Fig. 2.1. However, there is no inter-agent communication, and the agents cannot share their local data points with the server. The agents’ goal is to generate a linear mathematical model that optimally fits the data points held by all the agents. To solve this problem, as an individual agent cannot access the collective data points, the agents collaborate with the central server. Henceforth, we will refer to the above-described system architecture as server-agent network. Also, throughout this chapter, the system is assumed synchronous. To be precise, we consider m agents in the system. Each agent i has a set of ni data points represented by the rows of a (ni × d)-dimensional real-valued local data matrix Ai, and the elements of a ni-dimensional real-valued local observation vector bi. Typically, ni > d for each 8 i. The goal for the agents is to compute a parameter vector x∗ ∈ Rd such that x∗ ∈ X∗ = arg min x∈Rd m∑ i=1 1 2 ∥∥∥Aix− bi ∥∥∥2 , (2.1) where∥·∥ denotes the Euclidean norm. For each agent i, we define a local cost function F i(x) = 1 2 ∥∥∥Aix− bi ∥∥∥2 , ∀x ∈ Rd. (2.2) Note that solving for the optimization problem (2.1) is equivalent to computing a minimum point of the aggregate cost function ∑m i=1 F i(x). An algorithm that enables the agents to jointly solve the above problem in the architecture of Fig. 2.1 without sharing their data points is defined as a distributed algorithm. Common applications of the above linear least-squares problem include linear regression, state estimation, and hypothesis testing [16], [17]. Also, a wide range of supervised machine learning problems can be modeled as a linear least-squares problem, such as the supply chain demand forecasting [18], prediction of online user input actions [19], and the problem of selecting sparse linear solvers [20]. In many contemporary applications, the data points exist as dispersed over several sources (or agents). Due to industrial competition, administrative regulations, and user privacy, it is nearly impossible to collect the individual data from different agents at a single machine (or server) [3]. These restrictions have motivated a significant amount of research in recent years on distributed server-agent algorithms for solving data-driven modeling problems, such as (2.1) above [3], [4]. Herein lies our motivation to improve upon the state-of-the-art methods for solving (2.1) in a server-agent network. 9 To be able to present our key contributions, we review below the server-agent version of the traditional gradient-descent method [2]. 2.1.1 Background on gradient-descent method The gradient-descent (GD) method is an iterative algorithm wherein the server maintains an estimate of a minimum point, defined by (2.1), and updates it iteratively using the gradients of individual agents’ local cost functions. To be precise, for each iteration t = 0, 1, . . . , let x(t) ∈ Rd denote the estimate maintained by the server. The initial estimate x(0) may be chosen arbitrarily from Rd. For each iteration t, the server broadcasts x(t) to all the agents. Each agent i computes the gradient, denoted by gi(t), of its local cost function F i(x) at x(t). Specifically, for all i ∈ {1, . . . , m} and for all t ∈ {0, 1, . . .}, gi(t) = ∇F i(x(t)) = ( Ai )T ( Ai x(t)− bi ) . (2.3) Here, (·)T denotes the transpose. The agents send their computed gradients {gi(t), i = 1, . . . , m} to the server. Upon receiving the gradients, the server updates x(t) to x(t+ 1) according to x(t+ 1) = x(t)− δ m∑ i=1 gi(t), t ∈ {0, 1, . . .}, (2.4) where δ is a positive scalar real value commonly referred as the step-size. Let g(t) denote the sum of all the agents’ gradients, that is, g(t) = ∑m i=1 g i(t). So, g(t) is the gradient of the aggregate cost defined in (2.1). Substituting from above in (2.4), we obtain that convergence of the above server-agent version of the GD method is identical to its centralized counterpart with cost function 10 ∑m i=1 F i(x). It is known that for small enough step-size δ, the gradients {g(t), t = 0, 1, . . .} converge linearly to 0d, the d-dimensional zero vector. Specifically, for δ small enough, there exists µ ∈ [0, 1) such that ∥∥g(t)∥∥ ≤ µt ∥∥g(0)∥∥ , ∀t ≥ 0 [21]. Commonly, the value µ is referred as the convergence rate [22]. The smaller the value of µ faster is the convergence, and vice-versa. We propose an iterative pre-conditioning technique that improves upon the GD method’s convergence rate in a server-agent network. Specifically, in each iteration, the server multiplies the aggregate of the agents’ gradients g(t) by a pre-conditioner matrix K before updating the local estimates. However, unlike the classical pre-conditioning techniques [21], the server iteratively updates the pre-conditioner matrix K. Hence, the name iterative pre-conditioning. 2.1.2 Related work In his seminal work, Nesterov showed that the use of momentum could significantly accelerate the GD method [23]. Recently, there has been work on the applicability of the NAG method to the server-agent network, such as [5] and references therein. Azizan-Ruhi et al. [5] have proposed the APC method, a combination of the NAG method with a projection operation. Azizan-Ruhi et al. have shown through experiments that their APC method converges faster compared to the variants of the NAG and HBM [24]. However, they do not provide any theoretical guarantee for the improvement in the convergence speed. Also, Azizan-Ruhi et al. only consider a degenerate case of the optimization problem (2.1) where the set of linear equationsAix = bi, i = 1, . . . , m, has a unique solution. We consider a more general setting wherein the minimum value of the aggregate cost function ∑m i=1 F i(x) may not be zero, and the solution of the optimization problem (2.1) may 11 not be unique. The HBM is another momentum-based accelerated variant of the GD method [24]. For the case when the optimization problem (2.1) has a unique solution, both these accelerated methods, namely HBM and GD, are known to converge linearly with a rate of convergence smaller than the above traditional GD method [25], [26]. Newton’s method converges quadratically, hence superlinearly [22]. However, Newton’s method does not apply to the server-agent networked system unless the agents share their local data points with the server. A quasi-Newton method, on the other hand, such as BFGS applies to the server-agent networks [22]. However, similar to Newton’s method, BFGS also requires the solution of the optimization problem (2.1) to be unique. We consider a setting where the solution of the problem (2.1) may not be unique. 2.1.3 System noise These distributed algorithms are iterative wherein the server maintains an estimate of a solution defined by (2.1), which is updated iteratively using the gradients of the individual agents’ local cost functions defined in (2.2). In an ideal scenario with no noise, these algorithms converge to an optimal regression parameter defined in (2.1). Practical systems, however, inevitably suffers from uncertainties or noise [27,28]. Specifically, we consider two types of additive system noises, 1) observation noise, and 2) process noise. The observation noise, as the name suggests, models the uncertainties in the local data points observed by the agents [29]. The process noise models the uncertainties or jitters in the computation process due to hardware failures, quantization errors, or noisy communication links [28]. 12 In this dissertation, we empirically show that the approximation error, or the robustness, of our method in the presence of the above system noises compares favorably to all the other aforementioned prominent distributed algorithms. Besides empirical results, we also present formal analyses on the proposed method’s robustness when negatively impacted by additive system noises. Robustness analyses of some of the other aforementioned distributed algorithms are in [30, 31]. 2.1.4 Stochastic settings We also consider solving the distributed linear least-squares problem using stochastic algorithms. In particular, each agent i has n local data points, represented by a local data matrixAi and a local observation vector bi of dimensions n× d and n× 1, respectively. Thus, for all i ∈ {1, . . . , m}, Ai ∈ Rn×d and bi ∈ Rn. For each agent i, we define a local cost function Fi : Rd → R such that for a given parameter vector x ∈ Rd, F i(x) = 1 2n ∥∥∥Aix− bi ∥∥∥2 . (2.5) The agents’ objective is to compute an optimal parameter vector x∗ ∈ Rd such that x∗ ∈ arg min x∈Rd 1 m m∑ i=1 F i(x). (2.6) There are several theoretical and practical reasons for solving the distributed problem (2.6) using stochastic methods rather than batched optimization methods, particularly when the number of data-points is abundant [32]. The basic prototype of the stochastic optimization methods that 13 solve (2.1) is the traditional stochastic gradient (SGD) [32]. Several accelerated variants of the stochastic gradient descent algorithm have been proposed in the past decade [33–38]. A few of such well-known methods are the adaptive gradient descent (AdaGrad) [33], adaptive momentum estimation (Adam) [34], AMSGrad [37]. These algorithms are stochastic, wherein the server maintains an estimate of a solution defined by (2.6), which is refined iteratively by the server using the stochastic gradients computed by a randomly chosen agent. In particular, Adam has been demonstrated to compare favorably with other stochastic optimization algorithms for a wide range of optimization problems. However, Adam updates the current estimate effectively based on only a window of the past gradients due to the exponentially decaying term present in its estimate updating equation, which leads to poor convergence in many problems [37]. A recently proposed variant of Adam is the AMSGrad algorithm, which proposes to fix Adam’s convergence issue by incorporating “long-term memory” of the past gradients. In this dissertation, we propose a stochastic iterative pre-conditioning technique for improving the rate of convergence of the distributed stochastic gradient descent method when solving the linear least-squares problem (2.6) in distributed networks. The idea of iterative pre-conditioning in the deterministic (batched data) case has been mentioned in Section 2.1.1 wherein the server updates the estimate using the sum of the agents’ gradient multiplied with a suitable iterative pre- conditioning matrix. Updating the pre-conditioning matrix depends on the entire dataset at each iteration. We extend that idea to the stochastic settings, where the server updates both the estimate and the iterative pre-conditioning matrix based on a randomly chosen agents’ stochastic gradient at every iteration. Each agent computes its stochastic gradient based on a single randomly chosen data point from its local set of data points. Using real-world datasets, we empirically show that the proposed algorithm converges in fewer iterations compared to the aforementioned state-of- 14 the-art distributed methods. Besides empirical results, we also present a formal analysis of the proposed algorithm’s convergence in stochastic settings. 2.1.5 Summary of our contributions In this chapter, we propose an iterative pre-conditioning technique for improving the rate of convergence of the traditional gradient-descent method when solving the linear least-squares problem in a server-agent network. We present below a summary of our key contributions and comparisons with other prominent algorithms applicable to the server-agent network architecture. 2.1.5.1 In deterministic settings • We show that our algorithm, in general, converges linearly. For the special case when the solution of (2.1) is unique, the convergence of our algorithm is superlinear. Please refer Section 2.2.4 for details. • We show that our algorithm has a favorable rate of convergence compared to the prominent distributed linear least-squares algorithms applicable to the server-agent network; the gradient- descent (GD) method, Nesterov’s accelerated gradient-descent (NAG) method, the heavy- ball method (HBM), and the accelerated projection-consensus method (APC) [5]. These comparisons are summarized in Table 2.1. Please refer Section 2.3 for details. - In the particular case when the solution (2.1) is unique, our algorithm converges superlinearly which is only comparable to the BFGS method [22]. The convergence of the other aforementioned algorithms, on the other hand, is only linear [5], [25]. - For the general case when the least-squares problem has multiple solutions (2.1), 15 both our algorithm and the GD method converge linearly. In this case, our algorithm provably improves upon the explicit rate of convergence of GD. Specifically, our algorithm converges exponentially faster compared to the GD method. The explicit linear rate of convergence of the NAG and HBM method relies on the uniqueness of the solution [25], [26]. For multiple solutions, these methods’ convergence rate is available only in terms of the order of iteration numbers [26], [39]. However, not only the order of number of iterations but also the coefficient associated with the order contribute to characterizing the convergence rate of an algorithm. Hence, we provide the explicit convergence rate of our algorithm. Among the other aforementioned algorithms, convergence of the APC method and BFGS method requires uniqueness of the solution [22], [5]. • We validate our obtained theoretical results through numerical experiments, presented in Section 2.6, on benchmark real-world datasets. Table 2.1: Comparisons between the theoretical convergence rate of different algorithms for distributed linear regression. Algorithm Algorithm 1 GD NAG HBM APC BFGS Condition for any number of optimal solutions unique optimal convergence solution unique superlinear linear [5], [25] superlinear [22] Explicit solution convergence rate multiple solutions linear, faster than GD linear not known [26], [39] need not converge [22] 16 2.1.5.2 In presence of system noise Here, we make two key contributions, summarized as follows. • In Section 2.4, we theoretically characterize the robustness guarantees of the IPG method against both the observation and the process noises. • In Section 2.6, we empirically show the improved robustness of the IPG method, in comparison to the state-of-the-art algorithms. 2.1.5.3 In stochastic settings • We present a formal convergence analysis of our proposed stochastic algorithm. Our convergence result can be informally summarized as follows. Suppose the solution of problem (2.6) is unique, and the variances of the stochastic gradients computed by the agents are bounded. In that case, our proposed algorithm, i.e., Algorithm 2, converges linearly in expectation to a proximity of the solution of the problem (2.6). The approximation error is proportional to the algorithm’s stepsize and the variances of the stochastic gradients. Note that, as in the deterministic settings, our algorithm converges superlinearly to the exact solution when the gradient noise is zero. Formal details are presented in Theorem 2.5 in Section 2.5.3. • Using real-world datasets, we empirically show that our proposed stochastic algorithm’s convergence rate is superior to that of the state-of-the-art stochastic methods when distributively solving linear least-squares problems. These datasets comprise - four benchmark datasets from the SuiteSparse Matrix Collection; 17 - a subset of the “cleveland” dataset from the UCI Machine Learning Repository, which contains binary classification data of whether the patient has heart failure or not based on 13 features; - a subset of the “MNIST” dataset for classification of handwritten digits one and five. Please refer to Section 2.6 for further details. 2.2 Proposed algorithm: Iteratively Pre-Conditioned Gradient-Descent (IPG) In this section, we present our algorithm, its computational complexity, and its convergence properties in the deterministic settings, i.e., with full-batch data and when there is no noise in the system. 2.2.1 Motivation for IPG The proposed algorithm is similar to the gradient-descent method described in Section 2.1.1. However, a notable difference is that in our algorithm, the server multiplies the aggregate of the gradients received from the agents by a pre-conditioner matrix. The server uses the pre- conditioned aggregates of the agents’ gradients to update its current estimates. In literature, this technique is commonly referred as pre-conditioning [40]. When the matrix ATA is non-singular, the best pre-conditioner matrix for the gradient-descent method is the inverse Hessian matrix( ATA )−1, resulting in Newton’s method which converges superlinearly. However, ( ATA )−1 cannot be computed directly in a distributed setting as it requires the agents to send their local data points to the server. Thus, instead of a constant pre-conditioner matrix ( ATA )−1, we propose a distributed scheme where the server iteratively updates the pre-conditioning matrix in such a 18 way that it converges to ( ATA )−1 (ref. Lemma A.1). This specific iterative update rule of the pre-conditioning matrix is described later in Step 4 of the next subsection. Thus, our algorithm eventually converges to Newton’s method and has superlinear convergence rate, as shown later in Section 2.2.4. Herein lies the motivation of our proposed algorithm. The idea of iterative pre-conditioning is inspired by simple adaptive control techniques [41]. In adaptive control theory, the goal is to design a feedback control law, when the system parameters are unknown, so that the system state is driven to an equilibrium point. Analogous to such techniques, the proposed algorithm “adapts” to the unknown inverse Hessian matrix ( ATA )−1 by tuning the iterative pre-conditioner matrix (the “feedback gain”) which drives the estimate (system state) to a solution of (2.1). At the same time, the pre-conditioner matrix, which is an estimate of the unknown parameter ( ATA )−1, also converges to the true parameter ( ATA )−1. Thus, in the language of adaptive control theory, we have both the parameter convergence and the state convergence. In each iteration t ∈ {0, 1, . . .}, the server maintains an estimate x(t) of a minimum point (2.1), and a pre-conditioner matrix K(t) ∈ Rd×d. The initial estimate x(0) and the pre- conditioner matrixK(0) are chosen arbitrarily from Rd and Rd×d, respectively. For each iteration t ≥ 0, the algorithm steps are presented below. 2.2.2 Steps in each iteration t In each iteration t, the algorithm comprises four steps, executed collaboratively by the server and the agents. Before initiating the iterative process, the server chooses three non- negative scalar real-valued parameters α, δ, and β whose specific values are presented later in 19 Section 2.2.4. The parameter β is broadcast to all the agents. • Step 1: The server sends the estimate x(t) and the matrix K(t) to each agent i. • Step 2.1: Each agent i computes the gradient gi(t), as defined in (2.3). • Step 2.2: Each agent i computes a set of vectors { Ri j(t) : j = 1, . . . , d } such that for each j = 1, . . . , d, Ri j(t) = (( Ai )T Ai + ( β m ) I ) kj(t)− ( 1 m ) ej, (2.7) where I denote the (d × d)-dimensional identity matrix, ej and kj(t) denote the j-th columns of matrices I and K(t), respectively. • Step 3: Each agent i sends gradient gi(t) and the set of vectors { Ri j(t), j = 1, . . . , d } to the server. • Step 4: The server updates the matrix K(t) as kj(t+ 1) = kj(t)− α m∑ i=1 Ri j(t), j = 1, ..., d. (2.8) Finally, the server updates the estimate x(t) to x(t+ 1) such that x(t+ 1) = x(t)− δK(t+ 1) m∑ i=1 gi(t). (2.9) Parameter δ is a non-negative real value, commonly referred as the step-size. Our algorithm is summarized below in Algorithm 1. 20 Algorithm 1 Iterative pre-conditioning for the gradient-descent method in a server-agent network. 1: The server initializes x(0) ∈ Rd, K(0) ∈ Rd×d and selects the parameters α > 0, δ > 0 and β ≥ 0. 2: for t = 0, 1, 2, . . . do 3: The server sends x(t) and K(t) to each agent i ∈ {1, . . . , m}. 4: Each agent i computes the gradient gi(t) as defined by (2.3), and a set of vectors{ Ri j(t), j = 1, . . . , d } as defined by (2.7). 5: Each agent i sends gi(t) and the set { Ri j(t), j = 1, . . . , d } to the server. 6: The server updates K(t) to K(t+ 1) as defined by (2.8). 7: The server updates the estimate x(t) to x(t+ 1) as defined by (2.9). 8: end for 2.2.3 Algorithm complexity We now present the computational complexity of Algorithm 1, in terms of the total number of floating-point operations (flops) required per iteration. For each iteration t, each agent i computes the gradient gi(t), defined in (2.3), and d vectors {Ri j(t) : j = 1, ..., d}, defined in (2.7). Computation of gi(t) requires two matrix- vector multiplications, namely Ai x(t) and (Ai)T ( Ai x(t)− bi ) , in that order. As Ai is an (ni × d)-dimensional matrix and x(t) is a d-dimensional vector, computation of gradient gi(t) requires O(nid) flops. From (2.7), computation of each vector Ri j(t) requires two matrix-vector multiplications, namelyAi kj(t) and (Ai)T (Aikj(t)), in that order. AsAi is an (ni×d)-dimensional matrix, and both vectors kj(t) and Ai kj(t) are of dimensions d, computation of each Ri j(t) requires O(nid) flops. Thus, net computation of d vectors {Ri j(t) : j = 1, ..., d} requires O(nid 2) flops. Therefore, the computational complexity of Algorithm 1 for each agent i is d numbers of O(nid) parallel flops, for each iteration. Note that, the computation of each member in the set {Ri j(t) : j = 1, ..., d} is independent of each other. Hence, agent i can compute the d vectors {Ri j(t) : j = 1, ..., d} in parallel. 21 For each iteration t, the server computes the matrix K(t + 1), defined in (2.8), and vector x(t + 1), defined in (2.9). Computing K(t + 1) only requires O(d) floating-point additions, and can be ignored. In (2.9), the computation of K(t + 1) ∑m i=1 g i(t) requires only one matrix- vector multiplication between the d × d dimensional matrix K(t + 1) and the d-dimensional vector ∑m i=1 g i(t). Therefore, the computational complexity for the server is O(d2) flops, for each iteration. Communication complexity: For each iteration t, each agent sends a vector gi(t) ∈ Rd and d vectors Ri j(t) ∈ Rd, which means (d2 + d) real scalar values are send to the server by each agent. Thus, the communication complexity of Algorithm 1 for each agent is O(d2), compared to O(d) of the GD method, which is a drawback of Algorithm 1. Next, we present convergence results for Algorithm 1. 2.2.4 Convergence guarantees To be able to present the formal convergence guarantees of Algorithm 1, we note below some elementary facts and introduce some notation. • We define the collective data matrix and the collective observation vector respectively to be A = [ (A1)T , . . . , (Am)T ]T , b = [ (b1)T , . . . , (bm)T ]T . • Note that matrix product ATA is positive semi-definite. Thus, if β > 0 then ( ATA+ βI ) is positive definite, and therefore, invertible. Let Kβ = ( ATA+ βI )−1, and let λ1, . . . , λd denote the eigenvalues of matrix ATA such that λ1 ≥ . . . ≥ λd ≥ 0. • The rank of matrix ATA is denoted by r. Note that r = d if and only if the matrix ATA is full rank. In general, when ATA is not the trivial zero matrix, 1 ≤ r ≤ d. Note that if 22 r < d then λ1 ≥ . . . ≥ λr > λr+1 = . . . = λd = 0. • For a matrixM ∈ Rd×d,∥M∥F denotes its Frobenius norm [42]. Specifically, ifmij denote the (i, j)-th element of matrix M , then∥M∥F = √∑d i=1 ∑d j=1m 2 ij . For each agent i, recall from (2.2), the cost function F i(x) is convex and differentiable. Thus, the aggregate cost function ∑m i=1 F i(x) is also convex. Therefore, x∗ ∈ X∗ if and only if [43] ∇ ∑m i=1 F i(x∗) = 0d, where 0d denotes the d-dimensional zero vector. Recall, from Section 2.1.1, g(t) = ∇ m∑ i=1 F i(x(t)) = m∑ i=1 ∇F i(x(t)) = m∑ i=1 gi(t). (2.10) We now define below parameters that determine the convergence rate of Algorithm 1. Let, µ∗ = λ1 − λr λ1 + λr + 2(λ1λr/β) , (2.11) ϱ = λ1 − λd λ1 + λd + 2β . (2.12) Since λ1, λr, β > 0, (2.11)-(2.12) implies that µ∗, ϱ < 1. We define the optimal step-size parameter δ∗ = 2 λ1 λ1+β + λr λr+β . (2.13) The key result on the convergence of Algorithm 1 is presented next. Theorem 2.1. Consider Algorithm 1. Suppose, 0 < α < 2 λ1+β and 0 < δ < 2 ( λ1+β λ1 ) . Then, 1. there exists non-negative real values µ and ρ with µ∗ ≤ µ < 1 and ϱ ≤ ρ < 1 such that 23 for each iteration t ≥ 0, ∥∥g(t+ 1) ∥∥≤(µ+ δλ1 ∥∥K(0)−Kβ ∥∥ F ρt+1 )∥∥g(t)∥∥ ; (2.14) 2. limt→∞ ∥∥g(t)∥∥ = 0; 3. If δ = δ∗ then (2.14) holds with µ = µ∗. As ρ < 1, (2.14) of Theorem 2.1 implies that lim t→∞ ∥g(t+1)∥ ∥g(t)∥ ≤ µ < 1, since ρt+1 → 0 as t → ∞. Thus, Theorem 2.1 implies that the sequence of gradients {g(t)}t≥0 converges linearly to 0d with convergence rate no worse than µ. Since g(t) is linearly related to x(t) as presented in (2.10), linear convergence of {g(t)}t≥0 to 0d implies linear convergence of the sequence of estimators {x(t)}t≥0 to a point in X∗. Superlinear convergence: Consider the special case when x∗ is the unique solution for the least-squares problem (2.1), i.e., the unique minimum point of the aggregate cost function∑m i=1 F i(x). In this particular case, the matrix ATA is full-rank, and therefore, r = d. Here, we show below that Algorithm 1 with parameter β = 0 converges superlinearly to x∗. Specifically, we obtain the following corollary of Theorem 2.1. Recall, from (2.12), that when β = 0 then ϱ = λ1−λd λ1+λd < 1. Corollary 2.1. Consider Algorithm 1 with β = 0. If x∗ defined by (2.1) is unique, and the parameter α satisfies the condition stated in Theorem 2.1, then for δ = 1 there exists a non- negative real value ρ ∈ [ϱ, 1) such that, 1. for each iteration t ≥ 0, ∥∥g(t+ 1) ∥∥ ≤ λ1 ∥∥K(0)−Kβ ∥∥ F ρt+1 ∥∥g(t)∥∥; 2. limt→∞ ∥x(t+1)−x∗∥ ∥x(t)−x∗∥ = 0. 24 Remarks: Since the number of data points ni held by each agent i is not necessarily the same across the agents, it may be of interest to minimize the weighted aggregate cost function∑m i=1 ni 2 ∑m i=1 ni ∥∥Aix− bi ∥∥2, where each local cost function F i is weighted by the fraction of the number of data points of the corresponding agent i relative to the total number of data points of all the agents combined. We note that the results in this section are still valid for this weighted cost function. 2.3 Comparisons with the existing methods In this section, we present comparisons between the rates of convergence of Algorithm 1 and other prominent server-agent algorithms reviewed in Section 2.1. The algorithms are elaborated in [44]. Unique optimal solution: For the special case when the solution of the distributed least- squares problem (2.1) is unique, as shown in Section 2.2.4, Algorithm 1 converges superlinearly, which is comparable only to the BFGS method. The rest of the algorithms, namely GD, NAG, HBM, and APC, only converge linearly [5]. Multiple optimal solutions: In general when the solution for the distributed least-squares problem (2.1) is not unique, we show below that Algorithm 1 converges exponentially faster than the GD method. It should be noted that the convergence of BFGS and APC, and the explicit convergence rate of NAG and HBM can only be guaranteed for the special when the solution of (2.1) is unique; see [5], [25], [26] and references therein. Otherwise, the convergence rate of NAG and HBM is known only in terms of the order of iteration numbers [26], [39]. However, in practice the constant factor that is often ignored in order convergence analysis can be quite 25 critical. Thus, we provided an explicit convergence rate of Algorithm 1 in Theorem 2.1. The above comparisons are succinctly summarized in Table 2.1 (ref. Page 16). Detailed comparison with the classical GD method is presented below. 2.3.1 The gradient-descent method Consider the GD algorithm in server-agent networks, described in Section 2.1.1. To the best of our knowledge, in the open literature, the explicit rate of convergence for GD is mentioned only when the solution for (2.1) is unique [2], [25], [26]. We present below, formally in Lemma 2.1, the explicit convergence rate of GD for the general case. We define a parameter µGD = λ1 − λr λ1 + λr , where λ1 and λr are respectively the largest and the smallest non-zero eigenvalue of ATA. Lemma 2.1. Consider the gradient-descent algorithm in a server-agent network as presented in Section 2.1.1. In (2.4), if δ ∈ ( 0, 2 λ1 ) then there exists µ with µGD ≤ µ < 1 such that, for each iteration t ≥ 0, ∥∥g(t+ 1) ∥∥ ≤ µ ∥∥g(t)∥∥. We show formally below, in Theorem 2.2, that Algorithm 1 converges faster than GD method. Note that, in the special case when all the non-zero eigenvalues of the matrix ATA are equal, i.e. ATA = kI for some k ≥ 0, both the GD algorithm and Algorithm 1 solve the optimization problem (2.1) in just one iteration. Now, Theorem 2.2 below presents the case when λ1 > λr. Theorem 2.2. Consider Algorithm 1. Suppose that λ1 > λr. If β > 0 then there exists a positive finite integer τ , and two positive finite real values c and r with r < 1, such that ∥∥g(t+ 1) ∥∥ ≤ c (r µGD) t+1 ∥∥g(0)∥∥ , ∀t > τ . Consider the best possible rate of convergence for the GD algorithm. That is, substitute 26 µ = µGD in Lemma 2.1. In that case, the gradients for the GD algorithm in a server-agent network satisfy ∥∥g(t+ 1) ∥∥ ≤ (µGD) t+1 ∥∥g(0)∥∥ , ∀t ≥ 0. From Theorem 2.2, the gradients’ norm for Algorithm 1 are given by ∥∥g(t+ 1) ∥∥ ≤ c (r µGD) t+1 ∥∥g(0)∥∥ , ∀t > τ . Thus, assuming that both the algorithms are identically initialized with some x(0), there exists a finite integer T such that the ratio between the upper bounds on the gradients of Algorithm 1 and GD in server- agent network is given by c rt+1 for iteration t > T , where r < 1. This statement implies that, after a finite number of iterations, Algorithm 1 is guaranteed to have a smaller error bound compared to GD with identical initialization of x(0) and arbitrary initialization of the iterative pre-conditioning matrix K(0). More importantly, this error bound of Algorithm 1 decreases to zero at an exponentially faster rate compared to the latter one. 2.4 Robustness of the IPG method In this section, we present our key results on the robustness of the IPG method [6], described in Algorithm 1 for the noise-free case, in the presence of additive system noises; the observation noise and the process noise. We introduce below some notation, our main assumption, and review a pivotal prior result. 2.4.1 Notation, assumption, and prior results Assumption 2.1. Assume that the matrix ATA is full rank. Note that Assumption 2.1 holds true if and only if the matrix ATA is positive definite with λd > 0. Under Assumption 1, ATA is invertible, and we let K∗ = ( ATA )−1. Note, from (2.2), that the Hessian of the aggregate cost function ∑m i=1 F i(x) is equal to ATA for all x ∈ Rd. 27 Thus, under Assumption 1 when ATA is positive definite, the aggregate cost function has a unique minimum point. Equivalently, the solution of the least squares problem defined by (2.1) is unique. We review below in Lemma 2.2 a prior result that is pivotal for our key results presented later in this section. We let ϱ = (λ1 − λd)/(λ1 + λd). (2.15) For each iteration t, recall the pre-conditioner matrix K(t) in Algorithm 1, and define K̃(t) = K(t)−K∗, ∀t ≥ 0. (2.16) Let k̃j(t) denote the j-th column of K̃(t) and let k∗j denote the j-th column of the matrix K∗. Lemma 2.2 below states sufficient conditions under which the sequence of the pre-conditioner matrices {K(t), t = 0, 1, . . .}, in Algorithm 1, converges linearly to K∗. Lemma 2.2. Consider Algorithm 1 with α ∈ (0, 2 λ1 ). If Assumption 1 holds true then there exists a real value ρ ∈ [ϱ, 1) such that, for all j ∈ {1, . . . , d}, ∥∥∥k̃j(t+ 1) ∥∥∥ ≤ ρ ∥∥∥k̃j(t)∥∥∥ , ∀t ≥ 0. (2.17) We first present in Section 2.4.2 below the robustness of the IPG method against observation noise, followed by the robustness against process noise in Section 2.4.3. 28 2.4.2 Robustness against observation noise Based upon the literature [29], we model observation noise as follows. Each agent i observes a corrupted observation vector bio, instead of the true observation vector bi. Specifically, bio = bi + wi b, i ∈ {1, . . . ,m}, (2.18) where wi b ∈ Rni is a random vector. Let E [·] denote the expectation of a function of the random vectors {wi b, i = 1, . . . ,m}. Let∥·∥1 denote the l1-norm [45]. Assumption 2.2. Assume that there exists η <∞ such that E [∥∥∥wi b ∥∥∥ 1 ] ≤ η, i ∈ {1, . . . ,m}, . (2.19) In the presence of the above observation noise, in each iteration t of Algorithm 1, each agent i sends to the server a corrupted gradient gi(t) defined by (2.20) below, instead of (2.3). Specifically, for all i and t, gi(t) = ( Ai )T ( Ai x(t)− bio ) . (2.20) Due to the above corruption in gradient computation, Algorithm 1 no longer converges to an exact solution, defined by (2.1), but rather to an approximation. Theorem 2.3 below presents a key result on the robustness of Algorithm 1 against the above additive observation noise. Recall from Section 2.4.1 that under Assumption 1 the solution, denoted by x∗, of the regression problem (2.1) is unique. For each iteration t, we define the 29 estimation error z(t) = x(t)− x∗. (2.21) For a matrix M , with columns M1, . . . ,Md, its Frobenius norm [45] is defined to be ∥M∥F =√∑d j=1 ∥∥Mj ∥∥2. Recall the definition of ϱ from (2.15). Theorem 2.3. Let { z(t) }t=∞ t=0 be constructed by Algorithm 1 with parameters α < 2 λ1 and δ ≤ 1, in the presence of additive observation noise defined in (2.18) and the gradients for each iteration t, {gi(t), i = 1, . . . , m}, defined by (2.20). If Assumptions 2.1-2.2 hold then there exists ρ ∈ [ϱ, 1) such that, for all t ≥ 0, E [∥∥z(t+ 1) ∥∥] ≤ (1− δ + δλ1 ∥∥∥K̃(0) ∥∥∥ F ρt+1 )∥∥z(t)∥∥+ δηm √ λ1 ∥∥∥K̃(0) ∥∥∥ F ρt+1 + δηm √ 1/λd. (2.22) Additionally, limt→∞ E [∥∥z(t)∥∥] ≤ δηm √ 1/λd. Since ρ ∈ [0, 1) and δ ∈ (0, 1], Theorem 2.3 implies that the IPG method under the influence of additive observation noise (2.18) converges in expectation within a distance of δηm √ 1/λd from the true solution x∗ of the regression problem defined in (2.1). Since the gradient-descent (GD) method (with constant step-sizes) is a special case of the IPG method with K(t) = I for all t, from the proof of Theorem 2.3 we obtain that the final error of GD is bounded by δηm √ λ1. 30 2.4.3 Robustness against process noise In the presence of process noise [27,28], in each iteration t of Algorithm 1 (the IPG method in the noise-free case), the server computes corrupted values for both the pre-conditioner matrix K(t) and the current estimate x(t), described formally as follows. Specifically, in each iteration t, and for each j ∈ {1, . . . , d}, instead of computing kj(t) (the j-th column of K(t)) accurately, the server computes koj (t) = kj(t) + wk j (t), (2.23) where wk j (t) ∈ Rd. Similarly, instead of computing x(t) accurately, the server computes a corrupted estimate xo(t) = x(t) + wx(t), (2.24) where wx(t) ∈ Rd. Together, the vectors ζ(t) = {wx(t), wk j (t), j = 1, . . . , d} are referred as additive process noise. For each iteration t, let Eζt [·] denote the expectation of a function of the random vectors ζ(t). Let Et [·] denote the total expectation of a function of the random vectors {ζ(0), . . . , ζ(t)}. Assumption 2.3. Assume that the random vectors {wx(t), wk j (t), j = 1, . . . , d} are mutually independent for all t, and there exists ω <∞ such that for all t, j, Eζt [∥∥∥wk j (t) ∥∥∥ 1 ] ≤ ω, and Eζt [∥∥wx(t) ∥∥ 1 ] ≤ ω. (2.25) 31 In the presence of above process noise, Algorithm 1 is modified as follows. In each iteration t, each agent i sends to the server a gradient gi(t) and d vectorsRi 1(t), . . . , R i d(t) defined by (2.26) and (2.27) below, respectively, instead of (2.3) and (2.7). Specifically, for all i and t, gi(t) = ( Ai )T ( Ai xo(t)− bi ) , (2.26) and, for j = 1, . . . , d, Ri j(t) = ( Ai )T Aikoj (t)− ( 1 m ) ej. (2.27) Similarly, in each iteration t, the server now computes K(t+ 1) whose j-th column is defined as follows, instead of (2.8), for all j ∈ {1, . . . , d}, kj(t+ 1) = koj (t)− α ∑m i=1R i j(t). (2.28) Recall that koj (t + 1), defined by (2.23), is the corrupted value of kj(t + 1). Instead of (2.9), the updated estimate x(t+ 1) is defined by x(t+ 1) = xo(t)− δKo(t+ 1) ∑m i=1 g i(t), ∀t. (2.29) Recall that xo(t+ 1), defined by (2.23), is the corrupted value of x(t+ 1). To present our key result on the robustness of the IPG method, in Theorem 2.4 below, against the above process noise, we introduce some notation. From Lemma 2.2, recall that ρ ∈ 32 [ϱ, 1). For each iteration t, we define K̃o(t) = [ k̃o1(t), . . . , k̃ o d(t) ] = Ko(t)−K∗, and (2.30) u(t) = 1− δ + δλ1 ( ρt ∥∥∥K̃(0) ∥∥∥+ ω √ d ∑t i=0 ρ i ) . (2.31) Additionally, we let ρbd = ∥∥∥K̃(0) ∥∥∥∥∥∥K̃(0) ∥∥∥+ ω √ d , and ωbd = (1− ρ) λ1 √ d . (2.32) Theorem 2.4 below characterizes the convergence of Algorithm 1 with modifications (2.26)- (2.29) in presence of process noise. Recall from definition (2.21) that z(t) = x(t) − x∗ for all t. Upon substituting x(t) from (2.24), we obtain that z(t) = xo(t) − wx(t) − x∗. We let z0(t) = x0(t)− x∗. Thus, for each iteration t, zo(t) = z(t) + wx(t). (2.33) Theorem 2.4. Let { zo(t) }t=∞ t=0 be constructed by Algorithm 1 with parameter α < 2 λ1 and δ ≤ 1, in the presence of additive process noise defined in (2.23)-(2.24), and modifications defined in (2.26)-(2.29). If Assumptions 2.1 and 2.3 hold then there exists ρ ∈ [ϱ, 1) such that Et [∥∥zo(t)∥∥] ≤ Πt k=1u(k) ∥∥z(0)∥∥+ (1 + u(t) + . . .+Πt k=1u(k) ) ω, ∀t. (2.34) 33 Moreover, if ρ < ρbd, and ω < ωbd, (2.35) then limt→∞ Et [∥∥zo(t)∥∥] < ω δ ( 1−(ω/ωbd) ) . Note that, from the proof of Theorem 2.4 when K(t) = I, ∀t, the asymptotic estimation error of the traditional GD method under the above process noise is bounded by ω 1−∥I−δATA∥ . 2.5 SGD with iterative pre-conditioning In this section, we present our algorithm for solving (2.6) in stochastic settings. Our algorithm follows the basic prototype of the stochastic gradient descent method in distributed settings. However, unlike the traditional distributed stochastic gradient descent, the server in our algorithm multiplies the stochastic gradients received from the agents by a stochastic pre- conditioner matrix. These pre-conditioned stochastic gradients are then used by the server to update the current estimate. In order to present the algorithm, we introduce some notation. The individual data points of the agents are represented by an data row vector a of dimensions 1× d and a scalar observation b. Thus, a ∈ R1×d and b ∈ R. For each data point (a, b), we define individual cost function f : Rd → R such that for a given x ∈ Rd, f(x; a, b) = 1 2 (a x− b)2 , (2.36) 34 and the gradient of the individual cost function f as g(x; a, b) = ∇xf(x; a, b) = aT (a x− b) . (2.37) In each iteration t ∈ {0, 1, . . .}, the server maintains an estimate x(t) of a minimum point (2.6), and a stochastic pre-conditioner matrix K(t) ∈ Rd×d. The initial estimate x(0) and the pre-conditioner matrix K(0) are chosen arbitrarily from Rd and Rd×d, respectively. For each iteration t = 0, 1, . . ., the algorithm steps are presented below. 2.5.1 Steps in each iteration t Before initiating the iterations, the server chooses a positive scalar real-valued parameter β and broadcast it to all the agents. We number the agents in order from 1 to m. In each iteration t, the proposed algorithm comprises of four steps described below. These steps are executed collaboratively by the server and the agents, without requiring any agent to share its local data points. For each iteration t, the server also chooses two positive scalar real-valued parameters α and δ. • Step 1: The server sends the estimate x(t) and the pre-conditioner matrix K(t) to each agent i ∈ {1, . . . ,m}. • Step 2: Each agent i ∈ {1, . . . ,m} chooses a data point (ait , bit) uniformly at random from its local data points (Ai, bi). Note that, ait and bit are respectively a row in the input matrix Ai and the output vector bi of agent i. Each data point is independently and identically distributed (i.i.d.). Based on the selected data point (ait , bit), each agent i then computes a 35 stochastic gradient, denoted by git(t), which is defined as git(t) = g(x(t); ait , bit). (2.38) In the same step, each agent i ∈ {1, . . . ,m} computes a set of vectors { hitj (t) : j = 1, . . . , d } : hitj (t) = hj(kj(t); a it , bit), (2.39) where the function hj : Rd → Rd is defined below. Let I denote the (d × d)-dimensional identity matrix. Let ej and kj(t) denote the j-th columns of matrices I andK(t), respectively. For each column j ∈ {1, . . . , d} of K(t) and each individual data point (a, b), we define hj(kj; a, b) = ( aTa+ βI ) kj − ej. (2.40) • Step 3: Each agent i ∈ {1, . . . ,m} sends the stochastic gradient git(t) and the set of stochastic vectors { hitj (t), j = 1, . . . , d } to the server. • Step 4: The server draws an i.i.d. sample ζt uniformly at random from the set of agents {1, . . . ,m} and updates the matrix K(t) to K(t+ 1) such that, for each j ∈ {1, . . . , d}, kj(t+ 1) = kj(t)− αh ζtt j (t). (2.41) 36 Finally, the server updates the estimate x(t) to x(t+ 1) such that x(t+ 1) = x(t)− δK(t+ 1)gζtt (t). (2.42) Parameter δ is a non-negative real value, commonly referred as the stepsize. These steps of our algorithm are summarized in Algorithm 2. Algorithm 2 Iteratively Pre-Conditioned Stochastic Gradient-descent (IPSG). 1: The server initializes x(0) ∈ Rd, K(0) ∈ Rd×d, β > 0 and chooses {α > 0, δ > 0 : t = 0, 1, . . .}. 2: Steps in each iteration t ∈ {0, 1, 2, . . .}: 3: The server sends x(t) and K(t) to all the agents. 4: Each agent i ∈ {1, . . . , m} uniformly selects an i.i.d. data point (ait , bit) from its local data points (Ai, bi). 5: Each agent i ∈ {1, . . . , m} sends to the server a stochastic gradient git(t), defined in (2.38), and d stochastic vectors hit1 (t), . . . , h it d (t), defined in (2.39). 6: The server uniformly draws an i.i.d. sample ζt from the set of agents {1, . . . ,m}. 7: The server updates K(t) to K(t+ 1) as defined by (2.41). 8: The server updates the estimate x(t) to x(t+ 1) as defined by (2.42). Next, we present the formal convergence guarantees of Algorithm 2. We begin by introducing some notation and our main assumptions. 2.5.2 Notation and assumptions For each iteration t ≥ 0 we define the following. • Let Eζt [·] and for each agent i ∈ {1, . . . ,m} Eit [·] denote the conditional expectation of a function the random variables ζt and it, respectively, given the current estimate x(t) and the current pre-conditioner K(t). • Let It = {it, i = 1, . . . ,m} ∪ {ζt} and EIt [·] = E1t,...mt,ζt(·). 37 Table 2.2: Additional notation for analysis of IPSG. A = [ (a1)T , . . . , (aN)T ]T Kβ = ( 1 N ATA+ βI )−1 C1 = maxi ∥∥∥(ai)T ai − 1 N ATA ∥∥∥ ρ = 1 N ∑N i=1 ∥∥∥∥I − α (( ai )T ai + βI )∥∥∥∥ s1 ≥ . . . ≥ sd ≥ 0: eigenvalues of the positive semi-definite matrix ATA Λi and λi: the largest and the smallest eigenvalue of ( ai )T ai L = β +maxi=1,...,N Λi µ = ( 1− 2αsd N (1− αL) ) σ2 = max j=1,...,d 1 N N∑ i=1 ∥∥∥∥∥ (( ai )T ai + βI ) Kβ ej − ej ∥∥∥∥∥ 2 C3 = αNσ2 sd(1−αL) C2 = α N ∑N i=1 ∥∥∥(ai)T ai − 1 N ATA ∥∥∥∥∥Kβ ∥∥ ϱ = ∥∥∥I − α ( 1 N ATA+ βI )∥∥∥ C5(t) = 2C1E2 s1 N (∥∥Kβ ∥∥+∥∥∥K̃(0) ∥∥∥ ϱt) C6(t) = 2sd sd+Nβ − 2 s1 N ∥∥∥K̃(0) ∥∥∥ ϱt+1 C7(t) = 2C1E1 (∥∥Kβ ∥∥+∥∥∥K̃(0) ∥∥∥ ϱt) C8(t) = (V2 + 1) s21 N (dC3 + ∥∥Kβ ∥∥2 + 2C2 ∥∥Kβ ∥∥∑t j=0 ρ j + ∥∥∥K̃(0) ∥∥∥2 F µt+1 +2 ∥∥Kβ ∥∥∥∥∥K̃(0) ∥∥∥ ρt+1) + 0.5 z(t) = x(t)− x∗ R1(t) = 1 + δ2C8(t) + αδC5(t)− δC6(t) R2(t) = δ2V1N(dC3 + ∥∥Kβ ∥∥2 + 2C2 ∥∥Kβ ∥∥∑t j=0 ρ j + ∥∥∥K̃(0) ∥∥∥2 F µt+1 +2 ∥∥Kβ ∥∥∥∥∥K̃(0) ∥∥∥ ρt+1) + 1 2 α2C7(t) 2 • Let Et [·] denote the total expectation of a function of the random variables {I0, . . . , It} given the initial estimate x(0) and initial pre-conditioner matrix K(0). Specifically, Et [·] = EI0,...,It(·), t ≥ 0. (2.43) • Define the conditional variance of the stochastic gradient git(t), which is a function of the random variable it, given the current estimate x(t) and the current pre-conditioner K(t) as Vit [ git(t) ] = Eit [∥∥∥∥git(t)− Eit [ git(t) ]∥∥∥∥2 ] = Eit [∥∥∥git(t)∥∥∥2]−∥∥∥∥Eit [ git(t) ]∥∥∥∥2 . (2.44) 38 Additional notation are listed in Table 2.2. Among these notation, Kβ is an approximation of the inverse Hessian matrix, to which the sequence of the pre-conditioner matrices converges in expectation. R1(t) and R2(t) characterize the estimation error after t iterations. The other notation in Table 2.2 are required to define R1(t) and R2(t). We make the following assumption on the rank of the matrix ATA. Assumption 2.4. Assume that the matrix ATA is full rank. Note that Assumption 2.4 holds true if and only if the matrix ATA is positive definite with sd > 0. As the Hessian of the aggregate cost function ∑m i=1 F i(x) is equal to ATA for all x (see (2.5), under Assumption 2.4, the aggregate cost function has a unique minimum point. Equivalently, the solution of the distributed least squares problem defined by (2.6) is unique. We also assume, as formally stated in Assumption 2.5 below, that the variance of the stochastic gradient for each agent is bounded. This is a standard assumption for the analysis of stochastic algorithms [32]. Assumption 2.5. Assume that there exist two non-negative real scalar values V1 and V2 such that, for each iteration t = 0, 1, . . . and each agent i ∈ {1, . . . ,m}, Vit [ git(t) ] ≤ V1 + V2 ∥∥∑m i=1∇F i(x(t))/m ∥∥2 . Next, we present our key result on the convergence of Algorithm 2. 39 2.5.3 Convergence guarantees Theorem 2.5. Consider Algorithm 2 with parameters β > 0, α < min { N sd , 1 L , 2 (s1/N)+β } and δ > 0. If Assumptions 2.4 and 2.5 are satisfied, then there exist two non-negative real scalar values E1 ≥ √ V1N and E2 ≥ √ V2N such that the following statements hold true. 1. If the stepsize δ is sufficiently small, then there exists a non-negative integer T < ∞ such that for any iteration t ≥ T , R1(t) is positive and less than 1. 2. For an arbitrary time step t ≥ 0, given the estimate x(t) and the matrix K(t), Et [∥∥z(t+ 1) ∥∥2] ≤ R1(t) ∥∥z(t)∥∥2 +R2(t). (2.45) 3. Given arbitrary choices of the initial estimate x(0) ∈ Rd and the pre-conditioner matrix K(0) ∈ Rd×d, lim t→∞ Et [∥∥z(t+ 1 ∥∥2] ≤ δ2V1N ( dC3 + ∥∥Kβ ∥∥2 + 2C2 ∥∥Kβ ∥∥ 1− ρ ) + 2α2 ( C1E1 ∥∥Kβ ∥∥)2 . The implications of Theorem 2.5 are as follows. • According to Part (1) and (2) of Theorem 2.5, for small enough values of the parameters α and stepsize δ, as R1(t) ∈ (0, 1) after some finite iterations, Algorithm 2 converges linearly in expectation to a neighborhood of the minimum point x∗ of the least-squares problem (2.6). • According to Part (3) of Theorem 2.5, the neighborhood of x∗, to which the estimates of 40 Algorithm 2 converges in expectation, is O(V1). In other words, the sequence of expected “distance” between the minima x∗ of (2.6) and the final estimated value of Algorithm 2 is proportional to the variance of the stochastic gradients at the minimum point. 2.6 Experimental results In this section, first, we present experimental results comparing Algorithm 1 with other Figure 2.2: ∥∥x(t)− x∗ ∥∥ under Algorithm 1 with different initialization on “ash608”. α = 0.1, δ = 1, β = 0. state-of-the-art methods. We consider five benchmark collective data matricesA, namely “ash608”, “bcsstm07”, “can 229”, “gr 30 30”, and “qc324”, from the SuiteSparse Matrix Collection (http://sparse.tamu.edu/). The true value of the collective observation is b = Ax∗ where x∗ is a d-dimensional vector with all entries equal to 1. The rows of (A, b) are distributed among m = 10 agents. AsATA is positive definite except for “can 229”, the problem (2.1) has a unique solution x∗ for the datasets except “can 229” which has multiple solutions. We simulate Algorithm 1 with several initialization, for the datasets “ash608” and “gr 30 30”. For either of these datasets we consider three sets of initialization (x(0), K(0)): each entry of x(0) and K(0) is zero; each entry of x(0) is selected uniformly at random within (−3, 3) and each entry of K(0) is zero; each entry of x(0) and K(0) is selected uniformly at random within (−3, 3) and (0, 0.01) respectively. We observe that, Algorithm 2.2.2 converges to x∗ irrespective of the initial choice x(0) and K(0) (ref. Fig. 2.2-2.3). 41 Figure 2.3: ∥∥x(t)− x∗ ∥∥ under Algorithm 1 with different initialization on “gr 30 30”. α = 3 × 10−3, δ = 0.4, β = 0. The experimental results have been compared with the other distributed algorithms mentioned in Section 2.3. The parameters for all of these algorithms are chosen such that the respective algorithms achieve their optimal (smallest worst-case) convergence rate (ref. Table 2.3). For Algorithm 1 these optimal parameter values are given by α = 2 λ1+λd+2β and δ∗ in (2.13). The optimal parameter expressions for the algorithms GD, NAG, HBM, and APC can be found in [5], [25]. The stepsize parameter for BFGS is selected using backtracking [22]. Note that evaluating the optimal tuning parameters for any of these algorithms requires knowledge about the smallest and largest eigenvalues of ATA. Figure 2.4: ∥∥x(t)− x∗ ∥∥ in absence of noise for different algorithms on “bcsstm07”. We compare the number of iterations needed by these algorithms to reach a relative estimation error defined as ϵtol = ∥∥x(t)− x∗ ∥∥/∥x∗∥ for unique solution or ϵtol = ∥∥Ax(t)− b ∥∥/∥∥Ax(0)− b ∥∥ for multiple solutions. Clearly, Algorithm 1 performs fastest among the algorithms except BFGS, significantly for the datasets “bcsstm07” and “gr 30 30” (Table 2.4). Thus, our theoretical claim on improvements over these methods is corroborated by the above experimental results. Observation Noise: We add uniformly distributed random noise vectors from (−0.25, 0.25) and (−0.15, 0.15), respectively, to the true output vectors of datasets “ash608” and “gr 30 30”. The algorithm parameters are chosen such that each algorithm achieves its minimum convergence 42 rate [5, 44] (ref. Table 2.3). Each iterative algorithm is run until the changes in its subsequent Figure 2.5: ∥∥x(t)− x∗ ∥∥ in absence of noise for different algorithms on “can 229”. estimates is less than 10−4 over 20 consecutive iterations. We note the final estimation errors of the different algorithms in Table 2.5. We observe that the final estimation error of the IPG method is either comparable or favourable to all the other algorithms, for each dataset. Process noise: We simulate the algorithms by adding noise to the iterated variables. For the algorithms GD, NAG, HBM, and IPG, the process noise has been generated by rounding-off each entries of all the iterated variables in the respective algorithms to four decimal places. However, the rounding-off does not generate same values of noise w for the APC and BFGS algorithms. Therefore, for APC, we add uniformly distributed random numbers in the range (0, 5× 10−5) for both the datasets, and similarly, for BFGS, we add uniformly distributed random numbers in the range (0, 9×10−5) and (0, 2×10−6) respectively for the datasets “ash608” and “gr 30 30”. The final estimation errors of different algorithms are noted in Table 2.5. We observe that the final error for IPG is less than all the other algorithms. Also, we observe that the estimation error for the BFGS algorithm on dataset “ash608” grows unbounded after 360 iterations. The cause for this instability is the violation of the non-singularity of the approximated Hessian matrix, which is a necessary condition for the convergence of BFGS [22]. 43 Figure 2.6: ∥∥x(t)− x∗ ∥∥ in presence of observation noise for different algorithms on “ash608”. 2.6.1 Stochastic settings We conduct experiments for different collective data points (A, b). Four of these datasets are from the the benchmark datasets available in SuiteSparse Matrix Collection. Particularly these four datasets are “abtaha1”, “abtaha2”, “gre 343”, and “illc1850”. The fifth dataset, “cleveland”, is from the UCI Machine Learning Repository [46]. The sixth and final dataset is the “MNIST” [47] dataset. Figure 2.7: ∥∥x(t)− x∗ ∥∥ in presence of process noise for different algorithms on “ash608”. In the case of the first four aforementioned datasets, the problem is set up as follows. Consider a particular dataset “abtaha1”. Here, the matrix A has 14596 rows and d = 209 columns. The collective output vector b is such that b = Ax∗ where x∗ is a 209 dimensional vector, all of whose entries are unity. The data points represented by the rows of the matrix A and the corresponding observations represented by the elements of the vector b are divided amongst m = 4 agents numbered from 1 to 4. Since the matrixA for this particular dataset has 14596 rows and 209 columns, each of the four agents 1, . . . , 4 has a data matrix Ai of dimension 3649× 209 and a observation vector bi of dimension 3649. The data points for the other three datasets, 44 “abtaha2”, “gre 343”, and “illc1850”, are similarly distributed among m = 4, m = 7 and m = 10 agents, respectively. Figure 2.8: ∥∥x(t)− x∗ ∥∥ in presence of observation noise for different algorithms on “gr 30 30”. For the fifth dataset, 212 arbitrary instances from the “cleveland” dataset have been selected. This dataset has 13 numeric attributes, each corresponding to a column in the matrix A, and a target class (whether the patient has heart disease or not), which corresponds to the output vector b. Since the attributes in the matrix A has different units, its each column is shifted by the mean value of the corresponding column and then divided by the standard deviation of that column. Finally, a 212-dimensional column vector of unity is appended to this pre-processed matrix. This is our final input matrix A of dimension (212 × 14). The collective data points (A,B) are then distributed among m = 4 agents, in the manner described earlier. Figure 2.9: ∥∥x(t)− x∗ ∥∥ in presence of process noise for different algorithms on “gr 30 30”. From the training examples of the “MNIST” dataset, we select 1500 arbitrary instances labeled as either the digit one or the digit five. For each instance, we calculate two quantities, namely the average intensity of an image and the average symmetry of an image [48]. Let the column vectors a1 and a2 respectively denote the average intensity and the average symmetry of those 1500. Then, our input matrix before pre-processing is[ a1 a2 a1. 2 a1. ∗ a2 a2. 2 ] . Here, (.∗) represents element-wise multiplication and (.2) represents element-wise squares. This raw input matrix is then pre-processed as described earlier 45 for the “cleveland” dataset. Finally, a 1500-dimensional column vector of unity is appended to this pre-processed matrix. This is our final input matrix A of dimension (1500 × 6). The collective data points (A,B) are then distributed among m = 10 agents, in the manner already described for the other datasets. As the matrix ATA is positive definite in each of these cases, the optimization problem (2.6) has a unique solution x∗ for all of these datasets. Figure 2.10: ∥∥x(t)− x∗ ∥∥ for different stochastic algorithms on “MNIST”. We compare the performance of our proposed IPSG method (Algorithm 1) on the aforementioned datasets, with the other stochastic algorithms mentioned in Section 2.1.4. Specifically, these algorithms are stochastic gradient descent (SGD) [32], adaptive gradient descent (AdaGrad) [33], adaptive momentum estimation (Adam) [34], and AMSGrad [37] in the distributed network architecture of Fig. 2.1. These algorithms are implemented with different combinations of the respective algorithm parameters on the individual datasets. The parameter combinations are described below. IPSG: The optimal (smallest) convergence rate of the