ABSTRACT

Title of Dissertation: CONTROL THEORY-INSPIRED ACCELERATION
OF THE GRADIENT-DESCENT METHOD:
CENTRALIZED AND DISTRIBUTED

Kushal Chakrabarti
Doctor of Philosophy, 2022

Dissertation Directed by: Professor Nikhil Chopra
Department of Electrical and Computer Engineering

Mathematical optimization problems are prevalent across various disciplines in science

and engineering. Particularly in electrical engineering, convex and non-convex optimization

problems are well-known in signal processing, estimation, control, and machine learning research.

In many of these contemporary applications, the data points are dispersed over several sources.

Restrictions such as industrial competition, administrative regulations, and user privacy have

motivated significant research on distributed optimization algorithms for solving such data-driven

modeling problems. The traditional gradient-descent method can solve optimization problems

with differentiable cost functions. However, the speed of convergence of the gradient-descent

method and its accelerated variants is highly influenced by the conditioning of the optimization

problem being solved. Specifically, when the cost is ill-conditioned, these methods (i) require

many iterations to converge and (ii) are highly unstable against process noise. In this dissertation,

we propose novel optimization algorithms, inspired by control-theoretic tools, that can significantly


attenuate the influence of the problem’s conditioning.

First, we consider solving the linear regression problem in a distributed server-agent network.

We propose the Iteratively Pre-conditioned Gradient-Descent (IPG) algorithm to mitigate the

deleterious impact of the data points’ conditioning on the convergence rate. We show that

the IPG algorithm has an improved rate of convergence in comparison to both the classical

and the accelerated gradient-descent methods. We further study the robustness of IPG against

system noise and extend the idea of iterative pre-conditioning to stochastic settings, where the

server updates the estimate based on a randomly selected data point at every iteration. In the

same distributed environment, we present theoretical results on the local convergence of IPG for

solving convex optimization problems. Next, we consider solving a system of linear equations in

peer-to-peer multi-agent networks and propose a decentralized pre-conditioning technique. The

proposed algorithm converges linearly, with an improved convergence rate than the decentralized

gradient-descent. Considering the practical scenario where the computations performed by the

agents are corrupted, or a communication delay exists between them, we study the robustness

guarantee of the proposed algorithm and a variant of it. We apply the proposed algorithm

for solving decentralized state estimation problems. Further, we develop a generic framework

for adaptive gradient methods that solve non-convex optimization problems. Here, we model

the adaptive gradient methods in a state-space framework, which allows us to exploit control-

theoretic methodology in analyzing Adam and its prominent variants. We then utilize the classical

transfer function paradigm to propose new variants of a few existing adaptive gradient methods.

Applications on benchmark machine learning tasks demonstrate our proposed algorithms’ efficiency.

Our findings suggest further exploration of the existing tools from control theory in complex

machine learning problems.


The dissertation is concluded by showing that the potential in the previously mentioned

idea of IPG goes beyond solving generic optimization problems through the development of a

novel distributed beamforming algorithm and a novel observer for nonlinear dynamical systems,

where IPG’s robustness serves as a foundation in our designs. The proposed IPG for distributed

beamforming (IPG-DB) facilitates a rapid establishment of communication links with far-field

targets while jamming potential adversaries without assuming any feedback from the receivers,

subject to unknown multipath fading in realistic environments. The proposed IPG observer

utilizes a non-symmetric pre-conditioner, like IPG, as an approximation of the observability

mapping’s inverse Jacobian such that it asymptotically replicates the Newton observer with an

additional advantage of enhanced robustness against measurement noise. Empirical results are

presented, demonstrating both of these methods’ efficiency compared to the existing methodologies.


CONTROL THEORY-INSPIRED ACCELERATION OF THE
GRADIENT-DESCENT METHOD: CENTRALIZED AND DISTRIBUTED

by

Kushal Chakrabarti

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2022

Advisory Committee:
Professor Nikhil Chopra, Chair/Advisor
Professor Nuno Martins
Professor Richard J. La
Professor André Tits
Professor Pratap Tokekar


© Copyright by
Kushal Chakrabarti

2022


to my beloved Ishita


Acknowledgments

This dissertation is one of my proud possession and it wont be in its final form without the

help of others.

First and foremost, I’d express my sincere gratitude to my advisor, Professor Nikhil Chopra,

for giving me an invaluable opportunity to work on challenging and extremely interesting projects

over the past five years. He has always made himself available for help and advice and there has

never been an occasion when I’ve knocked on his door and he hasn’t given me time. It has been

a pleasure to work with and learn from such an extraordinary individual.

Thanks are due to Professor Nuno Martins, Professor Richard La, Professor André Tits,

and Professor Pratap Tokekar for agreeing to serve on my thesis committee and for sparing their

invaluable time reviewing the manuscript.

Also, thanks to other faculty members and teaching assistants in the university for the

course work which established my foundation for research.

My colleagues have enriched my professional life in many ways and deserve a special

mention. Especially, a significant part of this research was in collaboration with Nirupam Gupta,

Jeffrey N. Twigg, Fikadu T. Dagefu, and Amrit S. Bedi.

I would like to acknowledge financial support from the Petroleum Institute at UAE, the

United States Department of Agriculture, and the Army Research Laboratory, for part of the

projects discussed herein.

iii


I want to thank my friends from here, for making my stay memorable. My warmest thanks

go to my family for their support, love, encouragement, and patience. And at the last from

Einstein’s words:

Thanks to those who said NO !! because of them I did it myself.

iv


Table of Contents

Dedication ii

Acknowledgements iii

Table of Contents v

List of Tables ix

List of Figures xi

Chapter 1: Introduction 1

Chapter 2: Distributed Linear Regression in Server-Agent Network 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Background on gradient-descent method . . . . . . . . . . . . . . . . . . 10
2.1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 System noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.4 Stochastic settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.5 Summary of our contributions . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Proposed algorithm: Iteratively Pre-Conditioned Gradient-Descent (IPG) . . . . . 18
2.2.1 Motivation for IPG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 Steps in each iteration t . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Algorithm complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.4 Convergence guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3 Comparisons with the existing methods . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 The gradient-descent method . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4 Robustness of the IPG method . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.1 Notation, assumption, and prior results . . . . . . . . . . . . . . . . . . . 27
2.4.2 Robustness against observation noise . . . . . . . . . . . . . . . . . . . . 29
2.4.3 Robustness against process noise . . . . . . . . . . . . . . . . . . . . . . 31

2.5 SGD with iterative pre-conditioning . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.1 Steps in each iteration t . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.2 Notation and assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.3 Convergence guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.6 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.6.1 Stochastic settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

v


Chapter 3: Decentralized Linear Regression in Peer-to-Peer Network 52
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.1.1 Background on decentralized gradient-descent . . . . . . . . . . . . . . . 53
3.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.1.3 Communication delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.1.4 Summary of our contributions . . . . . . . . . . . . . . . . . . . . . . . 57

3.2 Proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.1 Convergence guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2.2 Robustness against computational process noise . . . . . . . . . . . . . . 64

3.3 Comparison with decentralized gradient-descent . . . . . . . . . . . . . . . . . . 66
3.4 Multiple-solutions case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5 Directed-graph case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6 Application: decentralized state estimation . . . . . . . . . . . . . . . . . . . . . 70
3.7 Proposed algorithm in presence of communication delay . . . . . . . . . . . . . 74

3.7.1 Convergence guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.8 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.8.1 Comparison with decentralized Kalman filter . . . . . . . . . . . . . . . 80
3.8.2 In presence of delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Chapter 4: Distributed Convex Optimization in Server-Agent Network 84
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.1.1 Summary of our contributions . . . . . . . . . . . . . . . . . . . . . . . 89
4.2 Proposed algorithm: Iteratively Pre-conditioned Gradient-descent (IPG) . . . . . 90

4.2.1 Steps in each iteration t . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2.2 Convergence guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3.1 Distributed noisy quadratic model . . . . . . . . . . . . . . . . . . . . . 97
4.3.2 Distributed logistic regression . . . . . . . . . . . . . . . . . . . . . . . 99

4.4 Summary and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Chapter 5: Non-Convex Optimization 104
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.1.2 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.2 Proposed algorithm: Generalized AdaGrad (G-AdaGrad) . . . . . . . . . . . . . 117
5.2.1 Description of G-AdaGrad . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.2.2 Convergence of G-AdaGrad . . . . . . . . . . . . . . . . . . . . . . . . 119

5.3 State-model of Adam and AdaBelief . . . . . . . . . . . . . . . . . . . . . . . . 122
5.4 A general adaptive gradient algorithm . . . . . . . . . . . . . . . . . . . . . . . 124
5.5 Continuous-time AdaBound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

5.5.1 State-space model of AdaBound . . . . . . . . . . . . . . . . . . . . . . 129
5.5.2 Convergence of AdaBound . . . . . . . . . . . . . . . . . . . . . . . . . 132

5.6 Convergence analysis of Nadam . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.7 Convergence analysis of RAdam . . . . . . . . . . . . . . . . . . . . . . . . . . 133

vi


5.8 Convergence of rescaled gradient flow . . . . . . . . . . . . . . . . . . . . . . . 134
5.9 Proposed algorithm: AdamSSM . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.10 Proposed Algorithm: NadamSSM . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.11 Proposed Algorithm: MAdamSSM . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.12 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

5.12.1 G-AdaGrad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.12.2 AdamSSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.12.3 NadamSSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.12.4 MAdamSSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

5.13 Summary and discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

Chapter 6: Distributed Beamforming 156
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

6.1.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.1.2 Summary of our contributions . . . . . . . . . . . . . . . . . . . . . . . 158

6.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.3 Proposed algorithm: Iteratively Pre-conditioned Gradient-descent for

Distributed Beamforming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.3.1 Steps in each iteration t ≥ 0 . . . . . . . . . . . . . . . . . . . . . . . . 165

6.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

Chapter 7: Nonlinear Observer 172
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

7.1.1 Prior works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7.1.2 Summary of our contributions . . . . . . . . . . . . . . . . . . . . . . . 176

7.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
7.3 Proposed observer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

7.3.1 Steps in each sampling instant k ≥ N . . . . . . . . . . . . . . . . . . . 180
7.3.2 Step-size selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.3.3 Relation with extended Kalman filter . . . . . . . . . . . . . . . . . . . . 183

7.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

Chapter 8: Summary and Future Work 191
8.1 Completed work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
8.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

Appendix A: Proofs of the Theoretical Results 196
A.1 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
A.2 Proof of Corollary 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
A.3 Proof of Lemma 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
A.4 Proof of Theorem 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
A.5 Proof of Theorem 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
A.6 Proof of Theorem 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

vii


A.7 Proof of Theorem 2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
A.7.1 Preliminary results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
A.7.2 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
A.7.3 Convergence of the pre-conditioner matrix . . . . . . . . . . . . . . . . . 214
A.7.4 Proof of the theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

A.8 Proof of Lemma 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
A.9 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
A.10 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
A.11 Proof of Theorem 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
A.12 Proof of Theorem 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
A.13 Proof of Lemma 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
A.14 Proof of Lemma 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
A.15 Proof of Theorem 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
A.16 Proof of Lemma 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
A.17 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
A.18 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
A.19 Proof of Theorem 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
A.20 Proof of Theorem 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
A.21 Proof of Theorem 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
A.22 Proof of Theorem 5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
A.23 Proof of Theorem 5.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

Bibliography 269

viii


List of Tables

2.1 Comparisons between the theoretical convergence rate of different algorithms for
distributed linear regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Additional notation for analysis of IPSG. . . . . . . . . . . . . . . . . . . . . . . 38
2.3 The parameters used in different algorithms for their minimum convergence rate

on distributed linear regression experiments. . . . . . . . . . . . . . . . . . . . . 48
2.4 The number of iterations required by different algorithms to attain relative estimation

error ϵtol on distributed linear regression experiments. . . . . . . . . . . . . . . . 48
2.5 Comparisons between the final estimation errors limt→∞

∥∥x(t)− x∗
∥∥ for different

algorithms on distributed linear regression experiments. . . . . . . . . . . . . . . 49
2.6 The parameters used in different stochastic algorithms on distributed linear regression

experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.7 Comparisons between the number of iterations required by different stochastic

algorithms to attain the specified values for the relative estimation errors ϵtol =∥∥x(t)− x∗
∥∥ /∥∥x(0)− x∗

∥∥ on distributed linear regression experiments. . . . . . . 51

3.1 Required notations for solving decentralized linear equations subject to communication
delay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.2 Comparisons between the performance of different algorithms on decentralized
state prediction experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.1 Comparison between convergence rate and per-iteration computational complexity
of different algorithms, for solving distributed convex optimization problems. t
is the number of iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.2 The parameters used in different algorithms on distributed convex optimization
experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.3 Comparisons between the number of iterations and total floating point operations
(flops) required by different algorithms to attain estimation error

∥∥x(t)− x∗
∥∥ =

0.05 for the NQM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.4 Comparisons between the number of iterations and total floating point operations

(flops) required by different algorithms to attain
∥∥g(t)∥∥ = 0.02 for binary classification

on MNIST data, subject to process noise. . . . . . . . . . . . . . . . . . . . . . . 101

5.1 Comparisons between best training accuracy, best test accuracy, and number of
training epochs required to achieve these accuracies for different algorithms on
image classification task with ResNet34. . . . . . . . . . . . . . . . . . . . . . . 149

ix


5.2 Comparisons between best training accuracy, best test accuracy, and number of
training epochs required to achieve these accuracies for different algorithms on
image classification task with VGG11. . . . . . . . . . . . . . . . . . . . . . . . 150

5.3 Comparisons between best training set perplexity, best test set perplexity, and
number of training epochs required to achieve these perplexities for different
algorithms on language modeling task with 1-layer LSTM. . . . . . . . . . . . . 150

5.4 Comparisons between best training set perplexity, best test set perplexity, and
number of training epochs required to achieve these perplexities for different
algorithms on language modeling task 2-layer LSTM. . . . . . . . . . . . . . . . 151

5.5 Comparisons between best training set perplexity, best test set perplexity, and
number of training epochs required to achieve these perplexities for different
algorithms on language modeling task 3-layer LSTM. . . . . . . . . . . . . . . . 151

5.6 Comparisons between mean (and std.) of the best training and test accuracies,
and the number of training epochs required to achieve them for the MAdamSSM
(proposed) and MAdam algorithms over five runs. . . . . . . . . . . . . . . . . . 152

5.7 Comparisons between the best training and test accuracies, and the number of
training epochs required to achieve them for the MAdamSSM (proposed) and
MAdam algorithms applied to train ResNet34 with CIFAR-10 data over five runs. 152

x


List of Figures

2.1 Distributed system architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2

∥∥x(t)− x∗
∥∥ under Algorithm 1 with different initialization on “ash608”. α =

0.1, δ = 1, β = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3

∥∥x(t)− x∗
∥∥ under Algorithm 1 with different initialization on “gr 30 30”. α =

3× 10−3, δ = 0.4, β = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4

∥∥x(t)− x∗
∥∥ in absence of noise for different algorithms on “bcsstm07”. . . . . . 42

2.5
∥∥x(t)− x∗

∥∥ in absence of noise for different algorithms on “can 229”. . . . . . 43
2.6

∥∥x(t)− x∗
∥∥ in presence of observation noise for different algorithms on “ash608”.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.7

∥∥x(t)− x∗
∥∥ in presence of process noise for different algorithms on “ash608”. . 44

2.8
∥∥x(t)− x∗

∥∥ in presence of observation noise for different algorithms on “gr 30 30”.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.9
∥∥x(t)− x∗

∥∥ in presence of process noise for different algorithms on “gr 30 30”. 45
2.10

∥∥x(t)− x∗
∥∥ for different stochastic algorithms on “MNIST”. . . . . . . . . . . . 46

2.11
∥∥x(t)− x∗

∥∥ for different stochastic algorithms on “illc1850”. . . . . . . . . . . 47

3.1 Error in estimating the true initial state z(0) in presence of computational process
noise, for different algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.2 Error in estimating the system states z(t) in presence of process noise and measurement
noise, for Algorithm 3 and DKF. . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.3
∥∥xi(t)− x∗

∥∥ under (a) Algorithm 4 and projection-based algorithm, (b) Algorithm 4,
(c) PI consensus algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.1
∥∥x(t)− x∗

∥∥ for different algorithms on the noisy quadratic model. . . . . . . . . 97
4.2

∥∥g(t)∥∥ for different algorithms on MNIST. . . . . . . . . . . . . . . . . . . . . . 99

5.1 Decision boundary in the a1−a2 plane, obtained from training a linear regression
model for classification of digit-1 and digit-5 using G-AdaGrad. The data points
from MNIST training set are plotted in a1 − a2 plane. . . . . . . . . . . . . . . 141

5.2 Decision boundary in the a1−a2 plane, obtained from training a linear regression
model for classification of digit-1 and digit-5 using G-AdaGrad. The data points
from MNIST test set are plotted in a1 − a2 plane. . . . . . . . . . . . . . . . . . 141

5.3 1
2

∥∥Ax(t)− b
∥∥2 − f ∗ of linear regression for classifying digit-1 and digit-5 from

the MNIST dataset with G-AdaGrad. . . . . . . . . . . . . . . . . . . . . . . . 141
5.4 Training loss of logistic regression model for classifying digit-1 and digit-5 from

the MNIST dataset with G-AdaGrad. . . . . . . . . . . . . . . . . . . . . . . . 142

xi


5.5 Test set accuracy for image classification task on CIFAR-10 dataset with ResNet34
architecture trained with different algorithms. . . . . . . . . . . . . . . . . . . . 143

5.6 Test set accuracy for image classification task on CIFAR-10 dataset with VGG11
architecture trained with different algorithms. . . . . . . . . . . . . . . . . . . . 143

5.7 Test set perplexity for language modeling task on PTB dataset with 1-layer LSTM
architecture trained with different algorithms. . . . . . . . . . . . . . . . . . . . 144

5.8 Test set perplexity for language modeling task on PTB dataset with 3-layer LSTM
architecture trained with different algorithms. . . . . . . . . . . . . . . . . . . . 146

5.9 Test set perplexity for language modeling task on PTB dataset with 2-layer LSTM
architecture trained with different algorithms. . . . . . . . . . . . . . . . . . . . 147

5.10 Second raw moment estimate of gradient along dimension i = 811, for image
classification task on CIFAR-10 dataset with VGG11 architecture trained with
Adam and the proposed AdamSSM algorithms. . . . . . . . . . . . . . . . . . . 154

5.11 Second raw moment estimate of gradient along dimension i = 1728, for image
classification task on CIFAR-10 dataset with VGG11 architecture trained with
Adam and the proposed AdamSSM algorithms. . . . . . . . . . . . . . . . . . . 155

6.1 Cartesian coordinates of n = 19 beamforming agents in the synthetic problem. . 168
6.2 Array factors constructed by the GD algorithm for solving the synthetic beamforming

problem, after different number of iterations t. . . . . . . . . . . . . . . . . . . 169
6.3 Array factors {|AFi| , i = 1, . . . , 49} constructed by the IPG-DB algorithm for

solving the synthetic beamforming problem, after different number of iterations t. 169
6.5 CAD model of the section of Los Angeles from Figure 6.4. . . . . . . . . . . . . 170
6.4 A section of Los Angeles used as the environment for beamforming problem. . . 170
6.6 Cartesian coordinates of n = 5 beamforming agents in the LA urban environment

problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.7 Array factors for the electric field along the z-direction constructed by the IPG-

DB algorithm for solving the beamforming problem in LA urban environment,
initially (t = 0) and after t = 25000 iterations. . . . . . . . . . . . . . . . . . . 171

7.1 State estimation error of different observers for bio-reactor system (7.13)-(7.16)
with measurement noise’s standard deviation 0.01. . . . . . . . . . . . . . . . . 184

7.2 State estimation error of different observers for bio-reactor system (7.13)-(7.16)
with measurement noise’s standard deviation 0.01. . . . . . . . . . . . . . . . . 185

7.3 State estimation error of different observers for bio-reactor system (7.13)-(7.16)
with measurement noise’s standard deviation 0.001. . . . . . . . . . . . . . . . . 186

7.4 State estimation error of different observers for bio-reactor system (7.13)-(7.16)
with measurement noise’s standard deviation 0.001. . . . . . . . . . . . . . . . . 187

7.5 State estimation error of different observers for inverted pendulum-cart system (7.17)-
(7.21) with measurement noise’s standard deviation 0.1. . . . . . . . . . . . . . 188

7.6 State estimation error of different observers for inverted pendulum-cart system (7.17)-
(7.21) with measurement noise’s standard deviation 0.01. . . . . . . . . . . . . . 189

7.7 State estimation error of different observers for inverted pendulum-cart system (7.17)-
(7.21) with measurement noise’s standard deviation 0.1. . . . . . . . . . . . . . 189

xii


7.8 State estimation error of different observers for inverted pendulum-cart system (7.17)-
(7.21) with measurement noise’s standard deviation 0.01. . . . . . . . . . . . . . 190

7.9 True system trajectory and its estimate for the system (7.22)-(7.24) obtained by
IPG observer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

7.10 True system trajectory and its estimate for the system (7.22)-(7.24) obtained by
LPV observer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

xiii


Chapter 1: Introduction

Mathematical optimization is a common sub-problem in various disciplines, such as modeling,

estimation, control systems, wireless communication, data analysis, finance, and operations research.

Especially with the recent data-driven technological advancements, optimization is ubiquitous in

several applications. A simple optimization technique, known as least-squares, dates back to 1795

A.D. when it was invented by a young scientist Karl Friedrich Gauss [1] for studying planetary

motion using telescopic measurement data. Since then, several new optimization methodologies

have been developed. The most attractive feature of optimization lies in the constructive formulation

of a problem, followed by the availability of sophisticated and thoroughly studied tools to solve

the formulated problem in a reliable and efficient manner. Therefore, the focus of this dissertation

will be on developing novel tools for solving a class of optimization problems, aimed at balancing

the decisive triad of performance, efficiency, and reliability.

Specifically, this dissertation will consider solving unconstrained optimization problems,

where the objective is to minimize a smooth and differentiable cost function. The classical

gradient-descent algorithm, attributed to Cauchy (1847 A.D.), is the basic first-order optimization

algorithm to solve such problems. The gradient-descent algorithm is iterative, wherein the idea

is to repeatedly refine an estimate of the minimum point by taking steps in the opposite direction

of the gradient of the cost function at the current estimate [2]. In principle, the gradient-descent

1


method eventually converges to a minimum point of the cost. However, the speed of convergence

of the gradient-descent method is highly influenced by the conditioning of the optimization

problem being solved. Specifically, when the cost is ill-conditioned, this method (i) requires

many iterations to converge and (ii) is highly unstable against process noise. This dissertation

aims at developing novel optimization algorithms, inspired by classical tools from control theory,

that can significantly attenuate the fundamental influence of the problem’s conditioning.

In many contemporary applications, the data points are dispersed over several sources (or

agents). Due to industrial competition, administrative regulations, and user privacy, it is nearly

impossible to collect the individual data from the different agents at a single machine [3]. In

recent years, these restrictions have motivated a significant amount of research on distributed

algorithms for solving data-driven modeling problems. In most distributed algorithms, an individual

agent is not required to transmit its raw data points [3, 4]. In the aforementioned cases, the

data itself is distributed over multiple sources or agents. With the availability of a large amount

of data, it is sometimes necessary to distribute the processing task over multiple machines or

agents due to the inherent complexity of the problem regarding computational and/or memory

requirements [5]. Therefore, apart from centralized optimization problems, this dissertation will

also consider solving optimization problems with a differentiable cost function in a distributed

manner. In Chapters 2-5, we propose our algorithms for solving specific optimization problems in

centralized settings and in distributed network of multiple agents. In Chapters 6-7, we go beyond

solving generic optimization problems by considering the distributed beamforming problem and

the nonlinear observer problem, wherein we develop new techniques for these two problems by

leveraging an idea from Chapter 4.

In Chapter 2, we consider the multi-agent linear least-squares problem in a server-agent

2


network architecture. The system comprises multiple agents, each with a set of local data points.

The agents are connected to a server, and there is no inter-agent communication. The agents’ goal

is to compute a linear model that optimally fits the collective data. The agents, however, cannot

share their data points. In principle, the agents can solve this problem by collaborating with the

server using the server-agent network variant of the classical gradient-descent method. However,

when the data points are ill-conditioned, the gradient-descent method requires a large number

of iterations to converge. We propose a novel iterative pre-conditioning technique to mitigate

the deleterious impact of the data points’ conditioning on the convergence rate of the gradient-

descent method. The idea of iterative pre-conditioning is inspired by adaptive control techniques.

We rigorously show that our proposed algorithm converges with an improved rate of convergence

in comparison to both the classical and the accelerated gradient-descent methods. In the presence

of system uncertainties, we characterize the proposed algorithm’s robustness against noise. We

further extend our algorithm to the stochastic settings, where the server updates the estimate of

a minimum point based on a single randomly selected data point at every iteration instead of the

full batch of data points. Through numerical experiments on benchmark least-squares problems,

we validate our theoretical findings. This work was carried out under the PSIM Project, supported

by the Petroleum Institute, Khalifa University of Science and Technology, Abu Dhabi, UAE. The

research works in this chapter have been published in [6–9].

In Chapter 3, we consider solving the linear least-squares problem in peer-to-peer multi-

agent networks. Again, each agent has a set of local data points, and its goal is to compute a linear

model that fits the collective data points. In principle, the agents can apply the decentralized

gradient-descent method. However, when the data matrix is ill-conditioned, gradient-descent

requires many iterations to converge and is unstable against system noise. We propose a decentralized

3


pre-conditioning technique to address these issues with the decentralized gradient-descent method.

We show that the proposed algorithm has an improved convergence rate than the decentralized

gradient-descent. Considering the practical scenario where the computations performed by the

agents are corrupted, we also study the robustness guarantee of the proposed algorithm. Additionally,

we apply the proposed algorithm for solving decentralized linear state estimation problems. The

empirical results show our proposed state predictor’s favorable convergence rate and robustness

against system noise compared to prominent decentralized algorithms. When the communication

links between agents are subject to potentially large but constant delays, we utilize a similar

decentralized pre-conditioning technique to propose a delay-tolerant algorithm that solves the

decentralized linear equations. Empirical results show that the accuracy of the solution obtained

by the proposed algorithm is better or comparable to the existing methods. This work was

partially supported by the USDA grant 2020-68012-31085 and partially carried out under the

PSIM Project, supported by the Petroleum Institute, Khalifa University of Science and Technology,

Abu Dhabi, UAE. The research works in this chapter have been published in [10, 11].

In Chapter 4, we study a distributed multi-agent convex optimization problem. The system

architecture is the same as Chapter 2. However, the cost function here is convex and not limited

to quadratic cost. The agents’ goal is to learn a parameter vector that optimizes the aggregate

of their local costs without revealing their local data points. We extend the idea of iterative

pre-conditioning, proposed in Chapter 2, to convex cost functions and a class of non-convex

functions. We present theoretical results on the local convergence of the proposed algorithm.

We demonstrate our algorithm’s superior performance to prominent distributed algorithms for

solving real logistic regression problems and emulating neural network training via a noisy

quadratic model, thereby signifying the proposed algorithm’s efficiency for distributively solving

4


non-convex optimization. Further worth of the idea of IPG is investigated later in Chapter 6-7.

In Chapter 5, we consider solving smooth non-convex optimization problems, with a focus

on deep neural network models. We aim to develop a generic framework for adaptive gradient-

descent algorithms for solving non-convex problems in the continuous-time domain. Particularly,

we model the adaptive gradient methods in a state-space framework, which allows us to exploit

standard control-theoretic tools in analyzing existing prominent adaptive methods. We rigorously

show that few other algorithms that are not encompassed by the proposed generic framework,

can also be modeled and analyzed using a similar technique. The analyses of all all these fast

gradient algorithms are unified by a common underlying proof sketch, relying upon Barbalat’s

lemma. Control-theoretic tools can also aid in developing novel adaptive gradient algorithms,

as we show by proposing new variants of three existing adaptive gradient methods, using the

concept of transfer functions. Applications on benchmark machine learning tasks demonstrate

the proposed algorithms’ efficiency. The findings in this chapter suggest further exploration of

the existing tools from control theory in complex machine learning problems. Part of the research

works in this chapter have been published in [12] and accepted for presentation at [13, 14].

Distributed multi-agent beampattern matching inherently enables covert communication by

constructively combining the transmitted signals at target receivers and forming nulls towards the

adversaries. Most existing beamforming methodologies achieve this by relying on partial or full-

state real-time feedback from the receivers. Chapter 6 proposes a novel distributed beamforming

technique that does not assume any feedback from the receivers or channel parameters such as

multipath fading. The proposed algorithm works in a server-agent architecture of the beamforming

agents, eliminating the need for receivers’ feedback. Our algorithm is built on the classical

gradient-descent (GD) method. However, when the problem is ill-conditioned, GD requires

5


many iterations to converge and is unstable against system noise. We propose an iterative pre-

conditioning technique, founded upon the IPG algorithm proposed in Chapter 4, to mitigate the

deleterious effects of the data points’ conditioning on the convergence rate, facilitating a rapid

establishment of communication links with far-field targets. The empirical results demonstrate

the proposed beamforming algorithm’s favorable convergence rate and robustness against unknown

multipath fading in realistic environments. This work is part of an ArtIAMAS project1 and is

being partially funded by ARL Grant No. W911NF2120076. The results in this chapter has been

accepted for presentation at [15].

Nonlinear observers are required for process monitoring, fault detection, signal reconstruction,

and control in various applications. The previously proposed Newton observer has fast exponential

convergence and applies to a wide class of problems. However, the Newton observer lacks

robustness against measurement noise due to the computation of a matrix inverse at each step. In

Chapter 7, we propose a novel observer for discrete-time plant dynamics with sampled measurements

to alleviate the impact of measurement noise. The key to the proposed observer is an iterative

pre-conditioning technique for the gradient-descent method, proposed in Chapter 4, for solving

general optimization problems. The proposed observer utilizes a non-symmetric pre-conditioner

to approximate the observability mapping’s inverse Jacobian so that it asymptotically replicates

the Newton observer with an additional advantage of enhanced robustness against measurement

noise. Our observer applies to a wide class of nonlinear systems, as it is not contingent upon

linearization or any specific structure in the plant nonlinearity and the measurement mapping.

Its improved robustness against measurement noise than the prominent nonlinear observers is

demonstrated through empirical results. Its relation with extended Kalman filter is also discussed.

1https://artiamas.umd.edu/

6


Finally, Chapter 8 summarizes the completed work and presents the future research directions

resulting from this dissertation. Detailed proof of the theoretical results in this dissertation is

presented in Appendix A.

7


Chapter 2: Distributed Linear Regression in Server-Agent Network

2.1 Introduction

Figure 2.1: Distributed system
architecture.

In this chapter, we consider the multi-agent

distributed linear least-squares problem. The nomenclature

distributed here refers to the data being distributed

across multiple agents. Specifically, we consider a

system comprising multiple agents, each agent having a

set of local data points. The agents can communicate

bidirectionally with a central server, as shown in

Fig. 2.1. However, there is no inter-agent communication, and the agents cannot share their

local data points with the server. The agents’ goal is to generate a linear mathematical model

that optimally fits the data points held by all the agents. To solve this problem, as an individual

agent cannot access the collective data points, the agents collaborate with the central server.

Henceforth, we will refer to the above-described system architecture as server-agent network.

Also, throughout this chapter, the system is assumed synchronous.

To be precise, we consider m agents in the system. Each agent i has a set of ni data points

represented by the rows of a (ni × d)-dimensional real-valued local data matrix Ai, and the

elements of a ni-dimensional real-valued local observation vector bi. Typically, ni > d for each

8


i. The goal for the agents is to compute a parameter vector x∗ ∈ Rd such that

x∗ ∈ X∗ = arg min
x∈Rd

m∑
i=1

1

2

∥∥∥Aix− bi
∥∥∥2 , (2.1)

where∥·∥ denotes the Euclidean norm. For each agent i, we define a local cost function

F i(x) =
1

2

∥∥∥Aix− bi
∥∥∥2 , ∀x ∈ Rd. (2.2)

Note that solving for the optimization problem (2.1) is equivalent to computing a minimum point

of the aggregate cost function
∑m

i=1 F
i(x). An algorithm that enables the agents to jointly solve

the above problem in the architecture of Fig. 2.1 without sharing their data points is defined as a

distributed algorithm.

Common applications of the above linear least-squares problem include linear regression,

state estimation, and hypothesis testing [16], [17]. Also, a wide range of supervised machine

learning problems can be modeled as a linear least-squares problem, such as the supply chain

demand forecasting [18], prediction of online user input actions [19], and the problem of selecting

sparse linear solvers [20]. In many contemporary applications, the data points exist as dispersed

over several sources (or agents). Due to industrial competition, administrative regulations, and

user privacy, it is nearly impossible to collect the individual data from different agents at a single

machine (or server) [3]. These restrictions have motivated a significant amount of research in

recent years on distributed server-agent algorithms for solving data-driven modeling problems,

such as (2.1) above [3], [4]. Herein lies our motivation to improve upon the state-of-the-art

methods for solving (2.1) in a server-agent network.

9


To be able to present our key contributions, we review below the server-agent version of

the traditional gradient-descent method [2].

2.1.1 Background on gradient-descent method

The gradient-descent (GD) method is an iterative algorithm wherein the server maintains

an estimate of a minimum point, defined by (2.1), and updates it iteratively using the gradients

of individual agents’ local cost functions. To be precise, for each iteration t = 0, 1, . . . , let

x(t) ∈ Rd denote the estimate maintained by the server. The initial estimate x(0) may be chosen

arbitrarily from Rd. For each iteration t, the server broadcasts x(t) to all the agents. Each agent i

computes the gradient, denoted by gi(t), of its local cost function F i(x) at x(t). Specifically, for

all i ∈ {1, . . . , m} and for all t ∈ {0, 1, . . .},

gi(t) = ∇F i(x(t)) =
(
Ai
)T (

Ai x(t)− bi
)
. (2.3)

Here, (·)T denotes the transpose. The agents send their computed gradients {gi(t), i = 1, . . . , m}

to the server. Upon receiving the gradients, the server updates x(t) to x(t+ 1) according to

x(t+ 1) = x(t)− δ

m∑
i=1

gi(t), t ∈ {0, 1, . . .}, (2.4)

where δ is a positive scalar real value commonly referred as the step-size. Let g(t) denote the sum

of all the agents’ gradients, that is, g(t) =
∑m

i=1 g
i(t). So, g(t) is the gradient of the aggregate

cost defined in (2.1). Substituting from above in (2.4), we obtain that convergence of the above

server-agent version of the GD method is identical to its centralized counterpart with cost function

10


∑m
i=1 F

i(x).

It is known that for small enough step-size δ, the gradients {g(t), t = 0, 1, . . .} converge

linearly to 0d, the d-dimensional zero vector. Specifically, for δ small enough, there exists µ ∈

[0, 1) such that
∥∥g(t)∥∥ ≤ µt

∥∥g(0)∥∥ , ∀t ≥ 0 [21]. Commonly, the value µ is referred as the

convergence rate [22]. The smaller the value of µ faster is the convergence, and vice-versa.

We propose an iterative pre-conditioning technique that improves upon the GD method’s

convergence rate in a server-agent network. Specifically, in each iteration, the server multiplies

the aggregate of the agents’ gradients g(t) by a pre-conditioner matrix K before updating the

local estimates. However, unlike the classical pre-conditioning techniques [21], the server iteratively

updates the pre-conditioner matrix K. Hence, the name iterative pre-conditioning.

2.1.2 Related work

In his seminal work, Nesterov showed that the use of momentum could significantly accelerate

the GD method [23]. Recently, there has been work on the applicability of the NAG method to the

server-agent network, such as [5] and references therein. Azizan-Ruhi et al. [5] have proposed the

APC method, a combination of the NAG method with a projection operation. Azizan-Ruhi et al.

have shown through experiments that their APC method converges faster compared to the variants

of the NAG and HBM [24]. However, they do not provide any theoretical guarantee for the

improvement in the convergence speed. Also, Azizan-Ruhi et al. only consider a degenerate case

of the optimization problem (2.1) where the set of linear equationsAix = bi, i = 1, . . . , m, has a

unique solution. We consider a more general setting wherein the minimum value of the aggregate

cost function
∑m

i=1 F
i(x) may not be zero, and the solution of the optimization problem (2.1) may

11


not be unique.

The HBM is another momentum-based accelerated variant of the GD method [24]. For the

case when the optimization problem (2.1) has a unique solution, both these accelerated methods,

namely HBM and GD, are known to converge linearly with a rate of convergence smaller than

the above traditional GD method [25], [26].

Newton’s method converges quadratically, hence superlinearly [22]. However, Newton’s

method does not apply to the server-agent networked system unless the agents share their local

data points with the server. A quasi-Newton method, on the other hand, such as BFGS applies to

the server-agent networks [22]. However, similar to Newton’s method, BFGS also requires the

solution of the optimization problem (2.1) to be unique. We consider a setting where the solution

of the problem (2.1) may not be unique.

2.1.3 System noise

These distributed algorithms are iterative wherein the server maintains an estimate of a

solution defined by (2.1), which is updated iteratively using the gradients of the individual agents’

local cost functions defined in (2.2). In an ideal scenario with no noise, these algorithms converge

to an optimal regression parameter defined in (2.1). Practical systems, however, inevitably suffers

from uncertainties or noise [27,28]. Specifically, we consider two types of additive system noises,

1) observation noise, and 2) process noise. The observation noise, as the name suggests, models

the uncertainties in the local data points observed by the agents [29]. The process noise models

the uncertainties or jitters in the computation process due to hardware failures, quantization

errors, or noisy communication links [28].

12


In this dissertation, we empirically show that the approximation error, or the robustness,

of our method in the presence of the above system noises compares favorably to all the other

aforementioned prominent distributed algorithms. Besides empirical results, we also present

formal analyses on the proposed method’s robustness when negatively impacted by additive

system noises. Robustness analyses of some of the other aforementioned distributed algorithms

are in [30, 31].

2.1.4 Stochastic settings

We also consider solving the distributed linear least-squares problem using stochastic algorithms.

In particular, each agent i has n local data points, represented by a local data matrixAi and a local

observation vector bi of dimensions n× d and n× 1, respectively. Thus, for all i ∈ {1, . . . , m},

Ai ∈ Rn×d and bi ∈ Rn. For each agent i, we define a local cost function Fi : Rd → R such that

for a given parameter vector x ∈ Rd,

F i(x) =
1

2n

∥∥∥Aix− bi
∥∥∥2 . (2.5)

The agents’ objective is to compute an optimal parameter vector x∗ ∈ Rd such that

x∗ ∈ arg min
x∈Rd

1

m

m∑
i=1

F i(x). (2.6)

There are several theoretical and practical reasons for solving the distributed problem (2.6)

using stochastic methods rather than batched optimization methods, particularly when the number

of data-points is abundant [32]. The basic prototype of the stochastic optimization methods that

13


solve (2.1) is the traditional stochastic gradient (SGD) [32]. Several accelerated variants of the

stochastic gradient descent algorithm have been proposed in the past decade [33–38]. A few of

such well-known methods are the adaptive gradient descent (AdaGrad) [33], adaptive momentum

estimation (Adam) [34], AMSGrad [37]. These algorithms are stochastic, wherein the server

maintains an estimate of a solution defined by (2.6), which is refined iteratively by the server

using the stochastic gradients computed by a randomly chosen agent.

In particular, Adam has been demonstrated to compare favorably with other stochastic

optimization algorithms for a wide range of optimization problems. However, Adam updates the

current estimate effectively based on only a window of the past gradients due to the exponentially

decaying term present in its estimate updating equation, which leads to poor convergence in many

problems [37]. A recently proposed variant of Adam is the AMSGrad algorithm, which proposes

to fix Adam’s convergence issue by incorporating “long-term memory” of the past gradients.

In this dissertation, we propose a stochastic iterative pre-conditioning technique for improving

the rate of convergence of the distributed stochastic gradient descent method when solving the

linear least-squares problem (2.6) in distributed networks. The idea of iterative pre-conditioning

in the deterministic (batched data) case has been mentioned in Section 2.1.1 wherein the server

updates the estimate using the sum of the agents’ gradient multiplied with a suitable iterative pre-

conditioning matrix. Updating the pre-conditioning matrix depends on the entire dataset at each

iteration. We extend that idea to the stochastic settings, where the server updates both the estimate

and the iterative pre-conditioning matrix based on a randomly chosen agents’ stochastic gradient

at every iteration. Each agent computes its stochastic gradient based on a single randomly chosen

data point from its local set of data points. Using real-world datasets, we empirically show that

the proposed algorithm converges in fewer iterations compared to the aforementioned state-of-

14


the-art distributed methods. Besides empirical results, we also present a formal analysis of the

proposed algorithm’s convergence in stochastic settings.

2.1.5 Summary of our contributions

In this chapter, we propose an iterative pre-conditioning technique for improving the rate

of convergence of the traditional gradient-descent method when solving the linear least-squares

problem in a server-agent network. We present below a summary of our key contributions and

comparisons with other prominent algorithms applicable to the server-agent network architecture.

2.1.5.1 In deterministic settings

• We show that our algorithm, in general, converges linearly. For the special case when the

solution of (2.1) is unique, the convergence of our algorithm is superlinear. Please refer

Section 2.2.4 for details.

• We show that our algorithm has a favorable rate of convergence compared to the prominent

distributed linear least-squares algorithms applicable to the server-agent network; the gradient-

descent (GD) method, Nesterov’s accelerated gradient-descent (NAG) method, the heavy-

ball method (HBM), and the accelerated projection-consensus method (APC) [5]. These

comparisons are summarized in Table 2.1. Please refer Section 2.3 for details.

- In the particular case when the solution (2.1) is unique, our algorithm converges

superlinearly which is only comparable to the BFGS method [22]. The convergence

of the other aforementioned algorithms, on the other hand, is only linear [5], [25].

- For the general case when the least-squares problem has multiple solutions (2.1),

15


both our algorithm and the GD method converge linearly. In this case, our algorithm

provably improves upon the explicit rate of convergence of GD. Specifically, our

algorithm converges exponentially faster compared to the GD method. The explicit

linear rate of convergence of the NAG and HBM method relies on the uniqueness

of the solution [25], [26]. For multiple solutions, these methods’ convergence rate

is available only in terms of the order of iteration numbers [26], [39]. However, not

only the order of number of iterations but also the coefficient associated with the order

contribute to characterizing the convergence rate of an algorithm. Hence, we provide

the explicit convergence rate of our algorithm. Among the other aforementioned

algorithms, convergence of the APC method and BFGS method requires uniqueness

of the solution [22], [5].

• We validate our obtained theoretical results through numerical experiments, presented in

Section 2.6, on benchmark real-world datasets.

Table 2.1: Comparisons between the theoretical convergence rate of different algorithms for
distributed linear regression.

Algorithm Algorithm 1 GD NAG HBM APC BFGS
Condition for any number of optimal solutions unique optimal
convergence solution

unique superlinear linear [5], [25] superlinear [22]
Explicit solution
convergence
rate

multiple
solutions

linear, faster
than GD

linear not known [26], [39] need not converge [22]

16


2.1.5.2 In presence of system noise

Here, we make two key contributions, summarized as follows.

• In Section 2.4, we theoretically characterize the robustness guarantees of the IPG method

against both the observation and the process noises.

• In Section 2.6, we empirically show the improved robustness of the IPG method, in comparison

to the state-of-the-art algorithms.

2.1.5.3 In stochastic settings

• We present a formal convergence analysis of our proposed stochastic algorithm. Our

convergence result can be informally summarized as follows. Suppose the solution of

problem (2.6) is unique, and the variances of the stochastic gradients computed by the

agents are bounded. In that case, our proposed algorithm, i.e., Algorithm 2, converges

linearly in expectation to a proximity of the solution of the problem (2.6). The approximation

error is proportional to the algorithm’s stepsize and the variances of the stochastic gradients.

Note that, as in the deterministic settings, our algorithm converges superlinearly to the

exact solution when the gradient noise is zero. Formal details are presented in Theorem 2.5

in Section 2.5.3.

• Using real-world datasets, we empirically show that our proposed stochastic algorithm’s

convergence rate is superior to that of the state-of-the-art stochastic methods when distributively

solving linear least-squares problems. These datasets comprise

- four benchmark datasets from the SuiteSparse Matrix Collection;

17


- a subset of the “cleveland” dataset from the UCI Machine Learning Repository,

which contains binary classification data of whether the patient has heart failure or

not based on 13 features;

- a subset of the “MNIST” dataset for classification of handwritten digits one and five.

Please refer to Section 2.6 for further details.

2.2 Proposed algorithm: Iteratively Pre-Conditioned Gradient-Descent (IPG)

In this section, we present our algorithm, its computational complexity, and its convergence

properties in the deterministic settings, i.e., with full-batch data and when there is no noise in the

system.

2.2.1 Motivation for IPG

The proposed algorithm is similar to the gradient-descent method described in Section 2.1.1.

However, a notable difference is that in our algorithm, the server multiplies the aggregate of

the gradients received from the agents by a pre-conditioner matrix. The server uses the pre-

conditioned aggregates of the agents’ gradients to update its current estimates. In literature, this

technique is commonly referred as pre-conditioning [40]. When the matrix ATA is non-singular,

the best pre-conditioner matrix for the gradient-descent method is the inverse Hessian matrix(
ATA

)−1, resulting in Newton’s method which converges superlinearly. However,
(
ATA

)−1

cannot be computed directly in a distributed setting as it requires the agents to send their local

data points to the server. Thus, instead of a constant pre-conditioner matrix
(
ATA

)−1, we propose

a distributed scheme where the server iteratively updates the pre-conditioning matrix in such a

18


way that it converges to
(
ATA

)−1 (ref. Lemma A.1). This specific iterative update rule of the

pre-conditioning matrix is described later in Step 4 of the next subsection. Thus, our algorithm

eventually converges to Newton’s method and has superlinear convergence rate, as shown later

in Section 2.2.4. Herein lies the motivation of our proposed algorithm.

The idea of iterative pre-conditioning is inspired by simple adaptive control techniques [41].

In adaptive control theory, the goal is to design a feedback control law, when the system parameters

are unknown, so that the system state is driven to an equilibrium point. Analogous to such

techniques, the proposed algorithm “adapts” to the unknown inverse Hessian matrix
(
ATA

)−1

by tuning the iterative pre-conditioner matrix (the “feedback gain”) which drives the estimate

(system state) to a solution of (2.1). At the same time, the pre-conditioner matrix, which is an

estimate of the unknown parameter
(
ATA

)−1, also converges to the true parameter
(
ATA

)−1.

Thus, in the language of adaptive control theory, we have both the parameter convergence and

the state convergence.

In each iteration t ∈ {0, 1, . . .}, the server maintains an estimate x(t) of a minimum

point (2.1), and a pre-conditioner matrix K(t) ∈ Rd×d. The initial estimate x(0) and the pre-

conditioner matrixK(0) are chosen arbitrarily from Rd and Rd×d, respectively. For each iteration

t ≥ 0, the algorithm steps are presented below.

2.2.2 Steps in each iteration t

In each iteration t, the algorithm comprises four steps, executed collaboratively by the

server and the agents. Before initiating the iterative process, the server chooses three non-

negative scalar real-valued parameters α, δ, and β whose specific values are presented later in

19


Section 2.2.4. The parameter β is broadcast to all the agents.

• Step 1: The server sends the estimate x(t) and the matrix K(t) to each agent i.

• Step 2.1: Each agent i computes the gradient gi(t), as defined in (2.3).

• Step 2.2: Each agent i computes a set of vectors
{
Ri

j(t) : j = 1, . . . , d
}

such that for each

j = 1, . . . , d,

Ri
j(t) =

((
Ai
)T
Ai +

(
β

m

)
I

)
kj(t)−

(
1

m

)
ej, (2.7)

where I denote the (d × d)-dimensional identity matrix, ej and kj(t) denote the j-th

columns of matrices I and K(t), respectively.

• Step 3: Each agent i sends gradient gi(t) and the set of vectors
{
Ri

j(t), j = 1, . . . , d
}

to

the server.

• Step 4: The server updates the matrix K(t) as

kj(t+ 1) = kj(t)− α
m∑
i=1

Ri
j(t), j = 1, ..., d. (2.8)

Finally, the server updates the estimate x(t) to x(t+ 1) such that

x(t+ 1) = x(t)− δK(t+ 1)
m∑
i=1

gi(t). (2.9)

Parameter δ is a non-negative real value, commonly referred as the step-size.

Our algorithm is summarized below in Algorithm 1.

20


Algorithm 1 Iterative pre-conditioning for the gradient-descent method in a server-agent
network.

1: The server initializes x(0) ∈ Rd, K(0) ∈ Rd×d and selects the parameters α > 0, δ > 0 and
β ≥ 0.

2: for t = 0, 1, 2, . . . do
3: The server sends x(t) and K(t) to each agent i ∈ {1, . . . , m}.
4: Each agent i computes the gradient gi(t) as defined by (2.3), and a set of vectors{

Ri
j(t), j = 1, . . . , d

}
as defined by (2.7).

5: Each agent i sends gi(t) and the set
{
Ri

j(t), j = 1, . . . , d
}

to the server.
6: The server updates K(t) to K(t+ 1) as defined by (2.8).
7: The server updates the estimate x(t) to x(t+ 1) as defined by (2.9).
8: end for

2.2.3 Algorithm complexity

We now present the computational complexity of Algorithm 1, in terms of the total number

of floating-point operations (flops) required per iteration.

For each iteration t, each agent i computes the gradient gi(t), defined in (2.3), and d

vectors {Ri
j(t) : j = 1, ..., d}, defined in (2.7). Computation of gi(t) requires two matrix-

vector multiplications, namely Ai x(t) and (Ai)T
(
Ai x(t)− bi

)
, in that order. As Ai is an

(ni × d)-dimensional matrix and x(t) is a d-dimensional vector, computation of gradient gi(t)

requires O(nid) flops. From (2.7), computation of each vector Ri
j(t) requires two matrix-vector

multiplications, namelyAi kj(t) and (Ai)T (Aikj(t)), in that order. AsAi is an (ni×d)-dimensional

matrix, and both vectors kj(t) and Ai kj(t) are of dimensions d, computation of each Ri
j(t)

requires O(nid) flops. Thus, net computation of d vectors {Ri
j(t) : j = 1, ..., d} requires

O(nid
2) flops. Therefore, the computational complexity of Algorithm 1 for each agent i is d

numbers of O(nid) parallel flops, for each iteration. Note that, the computation of each member

in the set {Ri
j(t) : j = 1, ..., d} is independent of each other. Hence, agent i can compute the d

vectors {Ri
j(t) : j = 1, ..., d} in parallel.

21


For each iteration t, the server computes the matrix K(t + 1), defined in (2.8), and vector

x(t + 1), defined in (2.9). Computing K(t + 1) only requires O(d) floating-point additions,

and can be ignored. In (2.9), the computation of K(t + 1)
∑m

i=1 g
i(t) requires only one matrix-

vector multiplication between the d × d dimensional matrix K(t + 1) and the d-dimensional

vector
∑m

i=1 g
i(t). Therefore, the computational complexity for the server is O(d2) flops, for

each iteration.

Communication complexity: For each iteration t, each agent sends a vector gi(t) ∈ Rd

and d vectors Ri
j(t) ∈ Rd, which means (d2 + d) real scalar values are send to the server by each

agent. Thus, the communication complexity of Algorithm 1 for each agent is O(d2), compared

to O(d) of the GD method, which is a drawback of Algorithm 1.

Next, we present convergence results for Algorithm 1.

2.2.4 Convergence guarantees

To be able to present the formal convergence guarantees of Algorithm 1, we note below

some elementary facts and introduce some notation.

• We define the collective data matrix and the collective observation vector respectively to

be A =
[
(A1)T , . . . , (Am)T

]T
, b =

[
(b1)T , . . . , (bm)T

]T .

• Note that matrix product ATA is positive semi-definite. Thus, if β > 0 then
(
ATA+ βI

)
is positive definite, and therefore, invertible. Let Kβ =

(
ATA+ βI

)−1, and let λ1, . . . , λd

denote the eigenvalues of matrix ATA such that λ1 ≥ . . . ≥ λd ≥ 0.

• The rank of matrix ATA is denoted by r. Note that r = d if and only if the matrix ATA

is full rank. In general, when ATA is not the trivial zero matrix, 1 ≤ r ≤ d. Note that if

22


r < d then λ1 ≥ . . . ≥ λr > λr+1 = . . . = λd = 0.

• For a matrixM ∈ Rd×d,∥M∥F denotes its Frobenius norm [42]. Specifically, ifmij denote

the (i, j)-th element of matrix M , then∥M∥F =
√∑d

i=1

∑d
j=1m

2
ij .

For each agent i, recall from (2.2), the cost function F i(x) is convex and differentiable.

Thus, the aggregate cost function
∑m

i=1 F
i(x) is also convex. Therefore, x∗ ∈ X∗ if and only

if [43] ∇
∑m

i=1 F
i(x∗) = 0d, where 0d denotes the d-dimensional zero vector. Recall, from

Section 2.1.1,

g(t) = ∇
m∑
i=1

F i(x(t)) =
m∑
i=1

∇F i(x(t)) =
m∑
i=1

gi(t). (2.10)

We now define below parameters that determine the convergence rate of Algorithm 1. Let,

µ∗ =
λ1 − λr

λ1 + λr + 2(λ1λr/β)
, (2.11)

ϱ =
λ1 − λd

λ1 + λd + 2β
. (2.12)

Since λ1, λr, β > 0, (2.11)-(2.12) implies that µ∗, ϱ < 1. We define the optimal step-size

parameter

δ∗ =
2

λ1

λ1+β
+ λr

λr+β

. (2.13)

The key result on the convergence of Algorithm 1 is presented next.

Theorem 2.1. Consider Algorithm 1. Suppose, 0 < α < 2
λ1+β

and 0 < δ < 2
(

λ1+β
λ1

)
. Then,

1. there exists non-negative real values µ and ρ with µ∗ ≤ µ < 1 and ϱ ≤ ρ < 1 such that

23


for each iteration t ≥ 0,

∥∥g(t+ 1)
∥∥≤(µ+ δλ1

∥∥K(0)−Kβ

∥∥
F
ρt+1

)∥∥g(t)∥∥ ; (2.14)

2. limt→∞
∥∥g(t)∥∥ = 0;

3. If δ = δ∗ then (2.14) holds with µ = µ∗.

As ρ < 1, (2.14) of Theorem 2.1 implies that lim
t→∞

∥g(t+1)∥
∥g(t)∥ ≤ µ < 1, since ρt+1 → 0 as

t → ∞. Thus, Theorem 2.1 implies that the sequence of gradients {g(t)}t≥0 converges linearly

to 0d with convergence rate no worse than µ. Since g(t) is linearly related to x(t) as presented

in (2.10), linear convergence of {g(t)}t≥0 to 0d implies linear convergence of the sequence of

estimators {x(t)}t≥0 to a point in X∗.

Superlinear convergence: Consider the special case when x∗ is the unique solution for

the least-squares problem (2.1), i.e., the unique minimum point of the aggregate cost function∑m
i=1 F

i(x). In this particular case, the matrix ATA is full-rank, and therefore, r = d. Here, we

show below that Algorithm 1 with parameter β = 0 converges superlinearly to x∗. Specifically,

we obtain the following corollary of Theorem 2.1. Recall, from (2.12), that when β = 0 then

ϱ = λ1−λd

λ1+λd
< 1.

Corollary 2.1. Consider Algorithm 1 with β = 0. If x∗ defined by (2.1) is unique, and the

parameter α satisfies the condition stated in Theorem 2.1, then for δ = 1 there exists a non-

negative real value ρ ∈ [ϱ, 1) such that,

1. for each iteration t ≥ 0,
∥∥g(t+ 1)

∥∥ ≤ λ1
∥∥K(0)−Kβ

∥∥
F
ρt+1

∥∥g(t)∥∥;

2. limt→∞
∥x(t+1)−x∗∥
∥x(t)−x∗∥ = 0.

24


Remarks: Since the number of data points ni held by each agent i is not necessarily the

same across the agents, it may be of interest to minimize the weighted aggregate cost function∑m
i=1

ni

2
∑m

i=1 ni

∥∥Aix− bi
∥∥2, where each local cost function F i is weighted by the fraction of the

number of data points of the corresponding agent i relative to the total number of data points of

all the agents combined. We note that the results in this section are still valid for this weighted

cost function.

2.3 Comparisons with the existing methods

In this section, we present comparisons between the rates of convergence of Algorithm 1

and other prominent server-agent algorithms reviewed in Section 2.1. The algorithms are elaborated

in [44].

Unique optimal solution: For the special case when the solution of the distributed least-

squares problem (2.1) is unique, as shown in Section 2.2.4, Algorithm 1 converges superlinearly,

which is comparable only to the BFGS method. The rest of the algorithms, namely GD, NAG,

HBM, and APC, only converge linearly [5].

Multiple optimal solutions: In general when the solution for the distributed least-squares

problem (2.1) is not unique, we show below that Algorithm 1 converges exponentially faster than

the GD method. It should be noted that the convergence of BFGS and APC, and the explicit

convergence rate of NAG and HBM can only be guaranteed for the special when the solution

of (2.1) is unique; see [5], [25], [26] and references therein. Otherwise, the convergence rate of

NAG and HBM is known only in terms of the order of iteration numbers [26], [39]. However,

in practice the constant factor that is often ignored in order convergence analysis can be quite

25


critical. Thus, we provided an explicit convergence rate of Algorithm 1 in Theorem 2.1.

The above comparisons are succinctly summarized in Table 2.1 (ref. Page 16). Detailed

comparison with the classical GD method is presented below.

2.3.1 The gradient-descent method

Consider the GD algorithm in server-agent networks, described in Section 2.1.1. To the best

of our knowledge, in the open literature, the explicit rate of convergence for GD is mentioned only

when the solution for (2.1) is unique [2], [25], [26]. We present below, formally in Lemma 2.1,

the explicit convergence rate of GD for the general case. We define a parameter µGD =
λ1 − λr
λ1 + λr

,

where λ1 and λr are respectively the largest and the smallest non-zero eigenvalue of ATA.

Lemma 2.1. Consider the gradient-descent algorithm in a server-agent network as presented in

Section 2.1.1. In (2.4), if δ ∈
(
0, 2

λ1

)
then there exists µ with µGD ≤ µ < 1 such that, for each

iteration t ≥ 0,
∥∥g(t+ 1)

∥∥ ≤ µ
∥∥g(t)∥∥.

We show formally below, in Theorem 2.2, that Algorithm 1 converges faster than GD

method. Note that, in the special case when all the non-zero eigenvalues of the matrix ATA

are equal, i.e. ATA = kI for some k ≥ 0, both the GD algorithm and Algorithm 1 solve the

optimization problem (2.1) in just one iteration. Now, Theorem 2.2 below presents the case when

λ1 > λr.

Theorem 2.2. Consider Algorithm 1. Suppose that λ1 > λr. If β > 0 then there exists a positive

finite integer τ , and two positive finite real values c and r with r < 1, such that
∥∥g(t+ 1)

∥∥ ≤

c (r µGD)
t+1
∥∥g(0)∥∥ , ∀t > τ .

Consider the best possible rate of convergence for the GD algorithm. That is, substitute

26


µ = µGD in Lemma 2.1. In that case, the gradients for the GD algorithm in a server-agent

network satisfy
∥∥g(t+ 1)

∥∥ ≤ (µGD)
t+1
∥∥g(0)∥∥ , ∀t ≥ 0. From Theorem 2.2, the gradients’

norm for Algorithm 1 are given by
∥∥g(t+ 1)

∥∥ ≤ c (r µGD)
t+1
∥∥g(0)∥∥ , ∀t > τ . Thus, assuming

that both the algorithms are identically initialized with some x(0), there exists a finite integer T

such that the ratio between the upper bounds on the gradients of Algorithm 1 and GD in server-

agent network is given by c rt+1 for iteration t > T , where r < 1. This statement implies

that, after a finite number of iterations, Algorithm 1 is guaranteed to have a smaller error bound

compared to GD with identical initialization of x(0) and arbitrary initialization of the iterative

pre-conditioning matrix K(0). More importantly, this error bound of Algorithm 1 decreases to

zero at an exponentially faster rate compared to the latter one.

2.4 Robustness of the IPG method

In this section, we present our key results on the robustness of the IPG method [6], described

in Algorithm 1 for the noise-free case, in the presence of additive system noises; the observation

noise and the process noise.

We introduce below some notation, our main assumption, and review a pivotal prior result.

2.4.1 Notation, assumption, and prior results

Assumption 2.1. Assume that the matrix ATA is full rank.

Note that Assumption 2.1 holds true if and only if the matrix ATA is positive definite with

λd > 0. Under Assumption 1, ATA is invertible, and we let K∗ =
(
ATA

)−1. Note, from (2.2),

that the Hessian of the aggregate cost function
∑m

i=1 F
i(x) is equal to ATA for all x ∈ Rd.

27


Thus, under Assumption 1 when ATA is positive definite, the aggregate cost function has a

unique minimum point. Equivalently, the solution of the least squares problem defined by (2.1)

is unique.

We review below in Lemma 2.2 a prior result that is pivotal for our key results presented

later in this section. We let

ϱ = (λ1 − λd)/(λ1 + λd). (2.15)

For each iteration t, recall the pre-conditioner matrix K(t) in Algorithm 1, and define

K̃(t) = K(t)−K∗, ∀t ≥ 0. (2.16)

Let k̃j(t) denote the j-th column of K̃(t) and let k∗j denote the j-th column of the matrix K∗.

Lemma 2.2 below states sufficient conditions under which the sequence of the pre-conditioner

matrices {K(t), t = 0, 1, . . .}, in Algorithm 1, converges linearly to K∗.

Lemma 2.2. Consider Algorithm 1 with α ∈ (0, 2
λ1
). If Assumption 1 holds true then there exists

a real value ρ ∈ [ϱ, 1) such that, for all j ∈ {1, . . . , d},

∥∥∥k̃j(t+ 1)
∥∥∥ ≤ ρ

∥∥∥k̃j(t)∥∥∥ , ∀t ≥ 0. (2.17)

We first present in Section 2.4.2 below the robustness of the IPG method against observation

noise, followed by the robustness against process noise in Section 2.4.3.

28


2.4.2 Robustness against observation noise

Based upon the literature [29], we model observation noise as follows. Each agent i

observes a corrupted observation vector bio, instead of the true observation vector bi. Specifically,

bio = bi + wi
b, i ∈ {1, . . . ,m}, (2.18)

where wi
b ∈ Rni is a random vector. Let E [·] denote the expectation of a function of the random

vectors {wi
b, i = 1, . . . ,m}. Let∥·∥1 denote the l1-norm [45].

Assumption 2.2. Assume that there exists η <∞ such that

E
[∥∥∥wi

b

∥∥∥
1

]
≤ η, i ∈ {1, . . . ,m}, . (2.19)

In the presence of the above observation noise, in each iteration t of Algorithm 1, each

agent i sends to the server a corrupted gradient gi(t) defined by (2.20) below, instead of (2.3).

Specifically, for all i and t,

gi(t) =
(
Ai
)T (

Ai x(t)− bio
)
. (2.20)

Due to the above corruption in gradient computation, Algorithm 1 no longer converges to an

exact solution, defined by (2.1), but rather to an approximation.

Theorem 2.3 below presents a key result on the robustness of Algorithm 1 against the above

additive observation noise. Recall from Section 2.4.1 that under Assumption 1 the solution,

denoted by x∗, of the regression problem (2.1) is unique. For each iteration t, we define the

29


estimation error

z(t) = x(t)− x∗. (2.21)

For a matrix M , with columns M1, . . . ,Md, its Frobenius norm [45] is defined to be ∥M∥F =√∑d
j=1

∥∥Mj

∥∥2. Recall the definition of ϱ from (2.15).

Theorem 2.3. Let
{
z(t)

}t=∞
t=0

be constructed by Algorithm 1 with parameters α < 2
λ1

and δ ≤

1, in the presence of additive observation noise defined in (2.18) and the gradients for each

iteration t, {gi(t), i = 1, . . . , m}, defined by (2.20). If Assumptions 2.1-2.2 hold then there exists

ρ ∈ [ϱ, 1) such that, for all t ≥ 0,

E
[∥∥z(t+ 1)

∥∥] ≤ (1− δ + δλ1

∥∥∥K̃(0)
∥∥∥
F
ρt+1

)∥∥z(t)∥∥+ δηm
√
λ1

∥∥∥K̃(0)
∥∥∥
F
ρt+1 + δηm

√
1/λd.

(2.22)

Additionally, limt→∞ E
[∥∥z(t)∥∥] ≤ δηm

√
1/λd.

Since ρ ∈ [0, 1) and δ ∈ (0, 1], Theorem 2.3 implies that the IPG method under the

influence of additive observation noise (2.18) converges in expectation within a distance of

δηm
√

1/λd from the true solution x∗ of the regression problem defined in (2.1). Since the

gradient-descent (GD) method (with constant step-sizes) is a special case of the IPG method

with K(t) = I for all t, from the proof of Theorem 2.3 we obtain that the final error of GD is

bounded by δηm
√
λ1.

30


2.4.3 Robustness against process noise

In the presence of process noise [27,28], in each iteration t of Algorithm 1 (the IPG method

in the noise-free case), the server computes corrupted values for both the pre-conditioner matrix

K(t) and the current estimate x(t), described formally as follows. Specifically, in each iteration

t, and for each j ∈ {1, . . . , d}, instead of computing kj(t) (the j-th column of K(t)) accurately,

the server computes

koj (t) = kj(t) + wk
j (t), (2.23)

where wk
j (t) ∈ Rd. Similarly, instead of computing x(t) accurately, the server computes a

corrupted estimate

xo(t) = x(t) + wx(t), (2.24)

where wx(t) ∈ Rd. Together, the vectors ζ(t) = {wx(t), wk
j (t), j = 1, . . . , d} are referred as

additive process noise. For each iteration t, let Eζt [·] denote the expectation of a function of the

random vectors ζ(t). Let Et [·] denote the total expectation of a function of the random vectors

{ζ(0), . . . , ζ(t)}.

Assumption 2.3. Assume that the random vectors {wx(t), wk
j (t), j = 1, . . . , d} are mutually

independent for all t, and there exists ω <∞ such that for all t, j,

Eζt

[∥∥∥wk
j (t)
∥∥∥
1

]
≤ ω, and Eζt

[∥∥wx(t)
∥∥
1

]
≤ ω. (2.25)

31


In the presence of above process noise, Algorithm 1 is modified as follows. In each iteration

t, each agent i sends to the server a gradient gi(t) and d vectorsRi
1(t), . . . , R

i
d(t) defined by (2.26)

and (2.27) below, respectively, instead of (2.3) and (2.7). Specifically, for all i and t,

gi(t) =
(
Ai
)T (

Ai xo(t)− bi
)
, (2.26)

and, for j = 1, . . . , d,

Ri
j(t) =

(
Ai
)T

Aikoj (t)−
(

1

m

)
ej. (2.27)

Similarly, in each iteration t, the server now computes K(t+ 1) whose j-th column is defined as

follows, instead of (2.8), for all j ∈ {1, . . . , d},

kj(t+ 1) = koj (t)− α
∑m

i=1R
i
j(t). (2.28)

Recall that koj (t + 1), defined by (2.23), is the corrupted value of kj(t + 1). Instead of (2.9), the

updated estimate x(t+ 1) is defined by

x(t+ 1) = xo(t)− δKo(t+ 1)
∑m

i=1 g
i(t), ∀t. (2.29)

Recall that xo(t+ 1), defined by (2.23), is the corrupted value of x(t+ 1).

To present our key result on the robustness of the IPG method, in Theorem 2.4 below,

against the above process noise, we introduce some notation. From Lemma 2.2, recall that ρ ∈

32


[ϱ, 1). For each iteration t, we define

K̃o(t) =
[
k̃o1(t), . . . , k̃

o
d(t)
]
= Ko(t)−K∗, and (2.30)

u(t) = 1− δ + δλ1

(
ρt
∥∥∥K̃(0)

∥∥∥+ ω
√
d
∑t

i=0 ρ
i

)
. (2.31)

Additionally, we let

ρbd =

∥∥∥K̃(0)
∥∥∥∥∥∥K̃(0)

∥∥∥+ ω
√
d
, and ωbd =

(1− ρ)

λ1
√
d
. (2.32)

Theorem 2.4 below characterizes the convergence of Algorithm 1 with modifications (2.26)-

(2.29) in presence of process noise. Recall from definition (2.21) that z(t) = x(t) − x∗ for

all t. Upon substituting x(t) from (2.24), we obtain that z(t) = xo(t) − wx(t) − x∗. We let

z0(t) = x0(t)− x∗. Thus, for each iteration t,

zo(t) = z(t) + wx(t). (2.33)

Theorem 2.4. Let
{
zo(t)

}t=∞
t=0

be constructed by Algorithm 1 with parameter α < 2
λ1

and δ ≤

1, in the presence of additive process noise defined in (2.23)-(2.24), and modifications defined

in (2.26)-(2.29). If Assumptions 2.1 and 2.3 hold then there exists ρ ∈ [ϱ, 1) such that

Et

[∥∥zo(t)∥∥] ≤ Πt
k=1u(k)

∥∥z(0)∥∥+ (1 + u(t) + . . .+Πt
k=1u(k)

)
ω, ∀t. (2.34)

33


Moreover, if

ρ < ρbd, and ω < ωbd, (2.35)

then limt→∞ Et

[∥∥zo(t)∥∥] < ω

δ
(
1−(ω/ωbd)

) .

Note that, from the proof of Theorem 2.4 when K(t) = I, ∀t, the asymptotic estimation

error of the traditional GD method under the above process noise is bounded by ω

1−∥I−δATA∥ .

2.5 SGD with iterative pre-conditioning

In this section, we present our algorithm for solving (2.6) in stochastic settings. Our

algorithm follows the basic prototype of the stochastic gradient descent method in distributed

settings. However, unlike the traditional distributed stochastic gradient descent, the server in

our algorithm multiplies the stochastic gradients received from the agents by a stochastic pre-

conditioner matrix. These pre-conditioned stochastic gradients are then used by the server to

update the current estimate.

In order to present the algorithm, we introduce some notation. The individual data points of

the agents are represented by an data row vector a of dimensions 1× d and a scalar observation

b. Thus, a ∈ R1×d and b ∈ R. For each data point (a, b), we define individual cost function

f : Rd → R such that for a given x ∈ Rd,

f(x; a, b) =
1

2
(a x− b)2 , (2.36)

34


and the gradient of the individual cost function f as

g(x; a, b) = ∇xf(x; a, b) = aT (a x− b) . (2.37)

In each iteration t ∈ {0, 1, . . .}, the server maintains an estimate x(t) of a minimum

point (2.6), and a stochastic pre-conditioner matrix K(t) ∈ Rd×d. The initial estimate x(0)

and the pre-conditioner matrix K(0) are chosen arbitrarily from Rd and Rd×d, respectively. For

each iteration t = 0, 1, . . ., the algorithm steps are presented below.

2.5.1 Steps in each iteration t

Before initiating the iterations, the server chooses a positive scalar real-valued parameter β

and broadcast it to all the agents. We number the agents in order from 1 to m. In each iteration

t, the proposed algorithm comprises of four steps described below. These steps are executed

collaboratively by the server and the agents, without requiring any agent to share its local data

points. For each iteration t, the server also chooses two positive scalar real-valued parameters α

and δ.

• Step 1: The server sends the estimate x(t) and the pre-conditioner matrix K(t) to each

agent i ∈ {1, . . . ,m}.

• Step 2: Each agent i ∈ {1, . . . ,m} chooses a data point (ait , bit) uniformly at random from

its local data points (Ai, bi). Note that, ait and bit are respectively a row in the input matrix

Ai and the output vector bi of agent i. Each data point is independently and identically

distributed (i.i.d.). Based on the selected data point (ait , bit), each agent i then computes a

35


stochastic gradient, denoted by git(t), which is defined as

git(t) = g(x(t); ait , bit). (2.38)

In the same step, each agent i ∈ {1, . . . ,m} computes a set of vectors
{
hitj (t) : j = 1, . . . , d

}
:

hitj (t) = hj(kj(t); a
it , bit), (2.39)

where the function hj : Rd → Rd is defined below. Let I denote the (d × d)-dimensional

identity matrix. Let ej and kj(t) denote the j-th columns of matrices I andK(t), respectively.

For each column j ∈ {1, . . . , d} of K(t) and each individual data point (a, b), we define

hj(kj; a, b) =
(
aTa+ βI

)
kj − ej. (2.40)

• Step 3: Each agent i ∈ {1, . . . ,m} sends the stochastic gradient git(t) and the set of

stochastic vectors
{
hitj (t), j = 1, . . . , d

}
to the server.

• Step 4: The server draws an i.i.d. sample ζt uniformly at random from the set of agents

{1, . . . ,m} and updates the matrix K(t) to K(t+ 1) such that, for each j ∈ {1, . . . , d},

kj(t+ 1) = kj(t)− αh
ζtt
j (t). (2.41)

36


Finally, the server updates the estimate x(t) to x(t+ 1) such that

x(t+ 1) = x(t)− δK(t+ 1)gζtt (t). (2.42)

Parameter δ is a non-negative real value, commonly referred as the stepsize.

These steps of our algorithm are summarized in Algorithm 2.

Algorithm 2 Iteratively Pre-Conditioned Stochastic Gradient-descent (IPSG).
1: The server initializes x(0) ∈ Rd, K(0) ∈ Rd×d, β > 0 and chooses {α > 0, δ > 0 : t =

0, 1, . . .}.
2: Steps in each iteration t ∈ {0, 1, 2, . . .}:
3: The server sends x(t) and K(t) to all the agents.
4: Each agent i ∈ {1, . . . , m} uniformly selects an i.i.d. data point (ait , bit) from its local data

points (Ai, bi).
5: Each agent i ∈ {1, . . . , m} sends to the server a stochastic gradient git(t), defined in (2.38),

and d stochastic vectors hit1 (t), . . . , h
it
d (t), defined in (2.39).

6: The server uniformly draws an i.i.d. sample ζt from the set of agents {1, . . . ,m}.
7: The server updates K(t) to K(t+ 1) as defined by (2.41).
8: The server updates the estimate x(t) to x(t+ 1) as defined by (2.42).

Next, we present the formal convergence guarantees of Algorithm 2. We begin by introducing

some notation and our main assumptions.

2.5.2 Notation and assumptions

For each iteration t ≥ 0 we define the following.

• Let Eζt [·] and for each agent i ∈ {1, . . . ,m} Eit [·] denote the conditional expectation of

a function the random variables ζt and it, respectively, given the current estimate x(t) and

the current pre-conditioner K(t).

• Let It = {it, i = 1, . . . ,m} ∪ {ζt} and EIt [·] = E1t,...mt,ζt(·).

37


Table 2.2: Additional notation for analysis of IPSG.

A =
[
(a1)T , . . . , (aN)T

]T
Kβ =

(
1
N
ATA+ βI

)−1

C1 = maxi

∥∥∥(ai)T ai − 1
N
ATA

∥∥∥ ρ = 1
N

∑N
i=1

∥∥∥∥I − α
((
ai
)T
ai + βI

)∥∥∥∥
s1 ≥ . . . ≥ sd ≥ 0: eigenvalues of the positive semi-definite matrix ATA

Λi and λi: the largest and the
smallest eigenvalue of

(
ai
)T
ai

L = β +maxi=1,...,N Λi

µ =
(
1− 2αsd

N
(1− αL)

)
σ2 = max

j=1,...,d

1

N

N∑
i=1

∥∥∥∥∥
((

ai
)T

ai + βI

)
Kβ ej − ej

∥∥∥∥∥
2

C3 =
αNσ2

sd(1−αL)
C2 =

α
N

∑N
i=1

∥∥∥(ai)T ai − 1
N
ATA

∥∥∥∥∥Kβ

∥∥
ϱ =

∥∥∥I − α
(

1
N
ATA+ βI

)∥∥∥ C5(t) = 2C1E2
s1
N

(∥∥Kβ

∥∥+∥∥∥K̃(0)
∥∥∥ ϱt)

C6(t) =
2sd

sd+Nβ
− 2 s1

N

∥∥∥K̃(0)
∥∥∥ ϱt+1 C7(t) = 2C1E1

(∥∥Kβ

∥∥+∥∥∥K̃(0)
∥∥∥ ϱt)

C8(t) = (V2 + 1)
s21
N
(dC3 +

∥∥Kβ

∥∥2 + 2C2

∥∥Kβ

∥∥∑t
j=0 ρ

j +
∥∥∥K̃(0)

∥∥∥2
F
µt+1

+2
∥∥Kβ

∥∥∥∥∥K̃(0)
∥∥∥ ρt+1) + 0.5

z(t) = x(t)− x∗ R1(t) = 1 + δ2C8(t) + αδC5(t)− δC6(t)

R2(t) = δ2V1N(dC3 +
∥∥Kβ

∥∥2 + 2C2

∥∥Kβ

∥∥∑t
j=0 ρ

j +
∥∥∥K̃(0)

∥∥∥2
F
µt+1

+2
∥∥Kβ

∥∥∥∥∥K̃(0)
∥∥∥ ρt+1) + 1

2
α2C7(t)

2

• Let Et [·] denote the total expectation of a function of the random variables {I0, . . . , It}

given the initial estimate x(0) and initial pre-conditioner matrix K(0). Specifically,

Et [·] = EI0,...,It(·), t ≥ 0. (2.43)

• Define the conditional variance of the stochastic gradient git(t), which is a function of the

random variable it, given the current estimate x(t) and the current pre-conditioner K(t) as

Vit

[
git(t)

]
= Eit

[∥∥∥∥git(t)− Eit

[
git(t)

]∥∥∥∥2
]
= Eit

[∥∥∥git(t)∥∥∥2]−∥∥∥∥Eit

[
git(t)

]∥∥∥∥2 . (2.44)

38


Additional notation are listed in Table 2.2. Among these notation, Kβ is an approximation of

the inverse Hessian matrix, to which the sequence of the pre-conditioner matrices converges

in expectation. R1(t) and R2(t) characterize the estimation error after t iterations. The other

notation in Table 2.2 are required to define R1(t) and R2(t).

We make the following assumption on the rank of the matrix ATA.

Assumption 2.4. Assume that the matrix ATA is full rank.

Note that Assumption 2.4 holds true if and only if the matrix ATA is positive definite

with sd > 0. As the Hessian of the aggregate cost function
∑m

i=1 F
i(x) is equal to ATA for all

x (see (2.5), under Assumption 2.4, the aggregate cost function has a unique minimum point.

Equivalently, the solution of the distributed least squares problem defined by (2.6) is unique.

We also assume, as formally stated in Assumption 2.5 below, that the variance of the

stochastic gradient for each agent is bounded. This is a standard assumption for the analysis

of stochastic algorithms [32].

Assumption 2.5. Assume that there exist two non-negative real scalar values V1 and V2 such

that, for each iteration t = 0, 1, . . . and each agent i ∈ {1, . . . ,m},

Vit

[
git(t)

]
≤ V1 + V2

∥∥∑m
i=1∇F i(x(t))/m

∥∥2 .
Next, we present our key result on the convergence of Algorithm 2.

39


2.5.3 Convergence guarantees

Theorem 2.5. Consider Algorithm 2 with parameters β > 0, α < min
{

N
sd
, 1
L
, 2
(s1/N)+β

}
and

δ > 0. If Assumptions 2.4 and 2.5 are satisfied, then there exist two non-negative real scalar

values E1 ≥
√
V1N and E2 ≥

√
V2N such that the following statements hold true.

1. If the stepsize δ is sufficiently small, then there exists a non-negative integer T < ∞ such

that for any iteration t ≥ T , R1(t) is positive and less than 1.

2. For an arbitrary time step t ≥ 0, given the estimate x(t) and the matrix K(t),

Et

[∥∥z(t+ 1)
∥∥2] ≤ R1(t)

∥∥z(t)∥∥2 +R2(t). (2.45)

3. Given arbitrary choices of the initial estimate x(0) ∈ Rd and the pre-conditioner matrix

K(0) ∈ Rd×d,

lim
t→∞

Et

[∥∥z(t+ 1
∥∥2] ≤ δ2V1N

(
dC3 +

∥∥Kβ

∥∥2 + 2C2

∥∥Kβ

∥∥
1− ρ

)
+ 2α2

(
C1E1

∥∥Kβ

∥∥)2 .

The implications of Theorem 2.5 are as follows.

• According to Part (1) and (2) of Theorem 2.5, for small enough values of the parameters

α and stepsize δ, as R1(t) ∈ (0, 1) after some finite iterations, Algorithm 2 converges

linearly in expectation to a neighborhood of the minimum point x∗ of the least-squares

problem (2.6).

• According to Part (3) of Theorem 2.5, the neighborhood of x∗, to which the estimates of

40


Algorithm 2 converges in expectation, is O(V1). In other words, the sequence of expected

“distance” between the minima x∗ of (2.6) and the final estimated value of Algorithm 2 is

proportional to the variance of the stochastic gradients at the minimum point.

2.6 Experimental results

In this section, first, we present experimental results comparing Algorithm 1 with other

Figure 2.2:
∥∥x(t)− x∗

∥∥ under
Algorithm 1 with different
initialization on “ash608”.
α = 0.1, δ = 1, β = 0.

state-of-the-art methods. We consider five benchmark

collective data matricesA, namely “ash608”, “bcsstm07”,

“can 229”, “gr 30 30”, and “qc324”, from the

SuiteSparse Matrix Collection (http://sparse.tamu.edu/).

The true value of the collective observation is b = Ax∗

where x∗ is a d-dimensional vector with all entries equal

to 1. The rows of (A, b) are distributed among m = 10

agents. AsATA is positive definite except for “can 229”, the problem (2.1) has a unique solution

x∗ for the datasets except “can 229” which has multiple solutions.

We simulate Algorithm 1 with several initialization, for the datasets “ash608” and “gr 30 30”.

For either of these datasets we consider three sets of initialization (x(0), K(0)): each entry of

x(0) and K(0) is zero; each entry of x(0) is selected uniformly at random within (−3, 3) and

each entry of K(0) is zero; each entry of x(0) and K(0) is selected uniformly at random within

(−3, 3) and (0, 0.01) respectively. We observe that, Algorithm 2.2.2 converges to x∗ irrespective

of the initial choice x(0) and K(0) (ref. Fig. 2.2-2.3).

41


Figure 2.3:
∥∥x(t)− x∗

∥∥ under
Algorithm 1 with different
initialization on “gr 30 30”.
α = 3 × 10−3, δ = 0.4, β = 0.

The experimental results have been compared

with the other distributed algorithms mentioned in

Section 2.3. The parameters for all of these algorithms

are chosen such that the respective algorithms achieve

their optimal (smallest worst-case) convergence rate

(ref. Table 2.3). For Algorithm 1 these optimal

parameter values are given by α = 2
λ1+λd+2β

and δ∗

in (2.13). The optimal parameter expressions for the algorithms GD, NAG, HBM, and APC can

be found in [5], [25]. The stepsize parameter for BFGS is selected using backtracking [22]. Note

that evaluating the optimal tuning parameters for any of these algorithms requires knowledge

about the smallest and largest eigenvalues of ATA.

Figure 2.4:
∥∥x(t)− x∗

∥∥ in absence
of noise for different algorithms on
“bcsstm07”.

We compare the number of iterations needed by

these algorithms to reach a relative estimation error

defined as ϵtol =
∥∥x(t)− x∗

∥∥/∥x∗∥ for unique solution

or ϵtol =
∥∥Ax(t)− b

∥∥/∥∥Ax(0)− b
∥∥ for multiple

solutions. Clearly, Algorithm 1 performs fastest

among the algorithms except BFGS, significantly for

the datasets “bcsstm07” and “gr 30 30” (Table 2.4).

Thus, our theoretical claim on improvements over these methods is corroborated by the above

experimental results.

Observation Noise: We add uniformly distributed random noise vectors from (−0.25, 0.25)

and (−0.15, 0.15), respectively, to the true output vectors of datasets “ash608” and “gr 30 30”.

The algorithm parameters are chosen such that each algorithm achieves its minimum convergence

42


rate [5, 44] (ref. Table 2.3). Each iterative algorithm is run until the changes in its subsequent

Figure 2.5:
∥∥x(t)− x∗

∥∥ in absence
of noise for different algorithms on
“can 229”.

estimates is less than 10−4 over 20 consecutive

iterations. We note the final estimation errors of the

different algorithms in Table 2.5. We observe that

the final estimation error of the IPG method is either

comparable or favourable to all the other algorithms, for

each dataset.

Process noise: We simulate the algorithms by adding noise to the iterated variables. For the

algorithms GD, NAG, HBM, and IPG, the process noise has been generated by rounding-off each

entries of all the iterated variables in the respective algorithms to four decimal places. However,

the rounding-off does not generate same values of noise w for the APC and BFGS algorithms.

Therefore, for APC, we add uniformly distributed random numbers in the range (0, 5× 10−5) for

both the datasets, and similarly, for BFGS, we add uniformly distributed random numbers in the

range (0, 9×10−5) and (0, 2×10−6) respectively for the datasets “ash608” and “gr 30 30”. The

final estimation errors of different algorithms are noted in Table 2.5. We observe that the final

error for IPG is less than all the other algorithms. Also, we observe that the estimation error for

the BFGS algorithm on dataset “ash608” grows unbounded after 360 iterations. The cause for

this instability is the violation of the non-singularity of the approximated Hessian matrix, which

is a necessary condition for the convergence of BFGS [22].

43


Figure 2.6:
∥∥x(t)− x∗

∥∥ in presence of observation noise for different algorithms on “ash608”.

2.6.1 Stochastic settings

We conduct experiments for different collective data points (A, b). Four of these datasets

are from the the benchmark datasets available in SuiteSparse Matrix Collection. Particularly

these four datasets are “abtaha1”, “abtaha2”, “gre 343”, and “illc1850”. The fifth dataset,

“cleveland”, is from the UCI Machine Learning Repository [46]. The sixth and final dataset is

the “MNIST” [47] dataset.

Figure 2.7:
∥∥x(t)− x∗

∥∥ in presence of
process noise for different algorithms
on “ash608”.

In the case of the first four aforementioned

datasets, the problem is set up as follows. Consider a

particular dataset “abtaha1”. Here, the matrix A has

14596 rows and d = 209 columns. The collective

output vector b is such that b = Ax∗ where x∗ is a 209

dimensional vector, all of whose entries are unity. The

data points represented by the rows of the matrix A and

the corresponding observations represented by the elements of the vector b are divided amongst

m = 4 agents numbered from 1 to 4. Since the matrixA for this particular dataset has 14596 rows

and 209 columns, each of the four agents 1, . . . , 4 has a data matrix Ai of dimension 3649× 209

and a observation vector bi of dimension 3649. The data points for the other three datasets,

44


“abtaha2”, “gre 343”, and “illc1850”, are similarly distributed among m = 4, m = 7 and

m = 10 agents, respectively.

Figure 2.8:
∥∥x(t)− x∗

∥∥ in presence
of observation noise for different
algorithms on “gr 30 30”.

For the fifth dataset, 212 arbitrary instances from

the “cleveland” dataset have been selected. This dataset

has 13 numeric attributes, each corresponding to a

column in the matrix A, and a target class (whether the

patient has heart disease or not), which corresponds to

the output vector b. Since the attributes in the matrix

A has different units, its each column is shifted by the

mean value of the corresponding column and then divided by the standard deviation of that

column. Finally, a 212-dimensional column vector of unity is appended to this pre-processed

matrix. This is our final input matrix A of dimension (212 × 14). The collective data points

(A,B) are then distributed among m = 4 agents, in the manner described earlier.

Figure 2.9:
∥∥x(t)− x∗

∥∥ in presence of
process noise for different algorithms
on “gr 30 30”.

From the training examples of the “MNIST”

dataset, we select 1500 arbitrary instances labeled

as either the digit one or the digit five. For each

instance, we calculate two quantities, namely the

average intensity of an image and the average symmetry

of an image [48]. Let the column vectors a1 and

a2 respectively denote the average intensity and the

average symmetry of those 1500. Then, our input matrix before pre-processing is[
a1 a2 a1.

2 a1. ∗ a2 a2.
2

]
. Here, (.∗) represents element-wise multiplication and (.2)

represents element-wise squares. This raw input matrix is then pre-processed as described earlier

45


for the “cleveland” dataset. Finally, a 1500-dimensional column vector of unity is appended

to this pre-processed matrix. This is our final input matrix A of dimension (1500 × 6). The

collective data points (A,B) are then distributed among m = 10 agents, in the manner already

described for the other datasets. As the matrix ATA is positive definite in each of these cases, the

optimization problem (2.6) has a unique solution x∗ for all of these datasets.

Figure 2.10:
∥∥x(t)− x∗

∥∥ for different
stochastic algorithms on “MNIST”.

We compare the performance of our proposed

IPSG method (Algorithm 1) on the aforementioned

datasets, with the other stochastic algorithms mentioned

in Section 2.1.4. Specifically, these algorithms

are stochastic gradient descent (SGD) [32], adaptive

gradient descent (AdaGrad) [33], adaptive momentum

estimation (Adam) [34], and AMSGrad [37] in the distributed network architecture of Fig. 2.1.

These algorithms are implemented with different combinations of the respective algorithm

parameters on the individual datasets. The parameter combinations are described below.

IPSG: The optimal (smallest) convergence rate of the