ABSTRACT
Title of dissertation: CONSENSUS, PREDICTION AND
OPTIMIZATION IN
DIRECTED NETWORKS
Van Sy Mai, Doctor of Philosophy, 2017
Dissertation directed by: Professor Eyad H. Abed
Dept. Electrical and Computer Engineering
This dissertation develops theory and algorithms for distributed consensus in
multi-agent networks. The models considered are opinion dynamics models based
on the well known DeGroot model. We study the following three related topics:
consensus of networks with leaders, consensus prediction, and distributed optimiza-
tion.
First, we revisit the problem of agreement seeking in a weighted directed net-
work in the presence of leaders. We develop new sufficient conditions that are weaker
than existing conditions for guaranteeing consensus for both fixed and switching
network topologies, emphasizing the importance not only of persistent connectivity
between the leader and the followers but also of the strength of the connections. We
then study the problem of a leader aiming to maximize its influence on the opinions
of the network agents through targeted connection with a limited number of agents,
possibly in the presence of another leader having a competing opinion. We reveal
fundamental properties of leader influence defined in terms of either the transient
behavior or the achieved steady state opinions of the network agents. In particular,
not only is the degree of this influence a supermodular set function, but its contin-
uous relaxation is also convex for any strongly connected directed network. These
results pave the way for developing efficient approximation algorithms admitting
certain quality certifications, which when combined can provide effective tools and
better analysis for optimal influence spreading in large networks.
Second, we introduce and investigate problems of network monitoring and
consensus prediction. Here, an observer, without exact knowledge of the network,
seeks to determine in the shortest possible time the asymptotic agreement value by
monitoring a subset of the agents. We uncover a fundamental limit on the minimum
required monitoring time for the case of a single observed node, and analyze the
case of multiple observed nodes. We provide conditions for achieving the limit in
the former case and develop algorithms toward achieving conjectured bounds in the
latter through local observation and local computation.
Third, we study a distributed optimization problem where a network of agents
seeks to minimize the sum of the agents’ individual objective functions while each
agent may be associated with a separate local constraint. We develop new dis-
tributed algorithms for solving this problem. In these algorithms, consensus pre-
diction is employed as a means to achieve fast convergence rates, possibly in finite
time. An advantage of our distributed optimization algorithms is that they work
under milder assumptions on the network weight matrix than are commonly as-
sumed in the literature. Most distributed algorithms require undirected networks.
Consensus-based algorithms can apply to directed networks under an assumption
that the network weight matrix is doubly stochastic (i.e., both row stochastic and
column stochastic), or in some recent literature only column stochastic. Our algo-
rithms work for directed networks and only require row stochasticity, a mild assump-
tion. Doubly stochastic or column stochastic weight matrices can be hard to arrange
locally, especially in broadcast-based communication. We achieve the simplification
to the row stochastic assumption through a distributed rescaling technique. Next,
we develop a unified convergence analysis of a distributed projected subgradient al-
gorithm and its variation that can be applied to both unconstrained and constrained
problems without assuming boundedness or commonality of the local constraint sets.
CONSENSUS, PREDICTION AND OPTIMIZATION
IN DIRECTED NETWORKS
by
Van Sy Mai
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2017
Advisory Committee:
Professor Eyad H. Abed, Chair/Advisor
Professor Richard J. La
Professor P. S. Krishnaprasad
Professor Andre´ L. Tits
Associate Professor Nikhil Chopra
c© Copyright by
Van Sy Mai
2017
Acknowledgments
First and foremost, I would like to thank my advisor, Professor Eyad Abed
for his invaluable guidance and unstinting support over the past five years. It has
been a honor to work with and learn from him, without whom this thesis would not
have been possible.
I am also grateful to Professor P. S. Krishnaprasad, Professor Richard La,
Professor Andre´ Tits and Professor Nikhil Chopra for agreeing to serve on my thesis
committee and providing me with insightful comments and suggestions to broaden
my research from various angles. I would also like to thank my Master’s thesis
advisor, Professor Suchin Arunsawatwong, for encouraging me to pursuit a PhD
degree and helping me with the application process.
Throughout my years at the University of Maryland, I have been fortunate to
have many wonderful friends who deserve a special mention. I would like to thank
Dzung Ta, Sanmeet Narula and Devon Harbaugh for their help and company during
the first year of my graduate life in the US. Many thanks also go to my office-mates
Bhaskar Ramasubramanian, Alborz Alavian, and James Ferlez for their friendship,
feedback and support. My interaction with Dipankar Maity has been very fruitful
and he deserves special thanks. My days in Maryland would not have been enjoyable
without my Vietnamese friends, including Chanh Kieu, My Le, Khoa Trinh, and
especially the Sean Lam’s family, and I would like to thank them all.
I would like to acknowledge the financial support from the Air Force Office
of Scientific Research through MURI AFOSR Grant #FA9550-09-1-0538 and the
ii
Department of Electrical and Computer Engineering at the University of Maryland.
Last but never least, I am most grateful to my family: my parents and brother
for having always stood by and believed in me through my whole life, and espe-
cially my wonderful and loving wife and our beloved daughter for their incredible
support, endless patience and unconditional encouragement. Words, by no means,
can express the gratitude I owe them.
iii
Table of Contents
List of Tables viii
List of Figures ix
List of Abbreviations xi
1 Introduction 1
1.1 Motivation and Thesis Objectives . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Consensus and Information Sharing Model . . . . . . . . . . . 2
1.1.2 Network Asymmetry . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Main Problems and Thesis Contributions . . . . . . . . . . . . . . . . 6
1.2.1 Network with Leaders . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Consensus Prediction . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Distributed Optimization . . . . . . . . . . . . . . . . . . . . . 11
1.3 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.1 Consensus in Networks with Leaders . . . . . . . . . . . . . . 14
1.3.2 Consensus Prediction . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.3 Distributed Optimization . . . . . . . . . . . . . . . . . . . . . 18
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.5 Notation and Mathematical Background . . . . . . . . . . . . . . . . 22
1.5.1 Notation and Definitions . . . . . . . . . . . . . . . . . . . . . 22
1.5.2 Convergence of DeGroot Model . . . . . . . . . . . . . . . . . 24
I Consensus Network with Leaders 27
2 Opinion Dynamics with Persistent Leaders 28
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3 Opinion Dynamics with One Leader . . . . . . . . . . . . . . . . . . . 32
2.4 Opinion Dynamics with Two Leaders . . . . . . . . . . . . . . . . . . 46
2.5 Conclusion and Extensions . . . . . . . . . . . . . . . . . . . . . . . . 48
iv
3 Optimizing Leader Influence in Networks through Selection of Direct Follow-
ers 51
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Problem Formulation and Related Works . . . . . . . . . . . . . . . . 56
3.2.1 Formulation of Influence Optimization Problem for the Single
Leader Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.2 Formulation of Influence Optimization Problem in the Pres-
ence of a Competing Leader . . . . . . . . . . . . . . . . . . . 63
3.2.3 Comparison to Previous Work . . . . . . . . . . . . . . . . . . 66
3.2.3.1 Single leader case . . . . . . . . . . . . . . . . . . . . 66
3.2.3.2 Multiple leaders case . . . . . . . . . . . . . . . . . . 68
3.2.3.3 Our Contributions . . . . . . . . . . . . . . . . . . . 70
3.3 Special Cases K = 1, 2: Optimal Solutions . . . . . . . . . . . . . . . 71
3.3.1 Single Agent Selection . . . . . . . . . . . . . . . . . . . . . . 72
3.3.2 Two-Agent Selection . . . . . . . . . . . . . . . . . . . . . . . 75
3.4 General Case: Convexification Approach . . . . . . . . . . . . . . . . 80
3.4.1 Convexity of Relaxation . . . . . . . . . . . . . . . . . . . . . 80
3.4.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . 90
3.5 Supermodularity and Greedy Algorithms . . . . . . . . . . . . . . . . 93
3.5.1 Supermodularity Results . . . . . . . . . . . . . . . . . . . . . 93
3.5.2 Greedy Algorithms and Ratio Bounds . . . . . . . . . . . . . . 96
3.6 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.6.1 Example 1: Small Network with One Leader . . . . . . . . . . 105
3.6.2 Example 2: Medium-Size Network with Two Leaders . . . . . 108
3.7 Closing Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.7.1 Application to Friedkin’s Model . . . . . . . . . . . . . . . . . 111
3.7.2 Further Convexity Results . . . . . . . . . . . . . . . . . . . . 112
3.7.3 Towards Relaxing Strong Connectivity Assumption . . . . . . 113
II Consensus Prediction by Observer 114
4 Consensus Prediction in Minimum Time 115
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.2 Problem Statement and Previous Results . . . . . . . . . . . . . . . . 117
4.2.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . 117
4.2.2 Previous Results on Consensus in Finite Time . . . . . . . . . 119
4.3 Shortest Time Prediction of Consensus and Local Computation of
Minimal Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.3.1 Optimality of (Di + 1) . . . . . . . . . . . . . . . . . . . . . . 123
4.3.2 Local Computation of qi . . . . . . . . . . . . . . . . . . . . . 126
4.4 Toward Minimizing Observation Time . . . . . . . . . . . . . . . . . 128
4.4.1 Observed Nodes with Identical Minimal Polynomials . . . . . 129
4.4.2 Observed Nodes with Different Minimal Polynomials . . . . . 133
4.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
v
4.5.1 Example 1: Network with Identical Minimal Polynomials . . . 134
4.5.2 Example 2: Network with Different Minimal Polynomials . . . 135
4.6 Toward Selecting Observed Nodes . . . . . . . . . . . . . . . . . . . . 137
4.6.1 When qi = qj? . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
4.6.2 Bounds on deg(qi) . . . . . . . . . . . . . . . . . . . . . . . . 141
4.7 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . 143
III Distributed Optimization 144
5 Local Prediction for Enhanced Convergence of Distributed Optimization Al-
gorithms 145
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.2 Problem Statement and Background . . . . . . . . . . . . . . . . . . 150
5.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 150
5.2.2 Subgradient Methods . . . . . . . . . . . . . . . . . . . . . . . 151
5.2.3 Finite-Time Consensus Using Minimal Polynomials . . . . . . 153
5.3 Distributed Subgradient Optimization Using Finite Time Consensus . 155
5.3.1 Main Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5.3.2 Extensions of the Algorithm 5.1 . . . . . . . . . . . . . . . . . 165
5.4 Finite-Time Optimization for Quadratic Cost Functions . . . . . . . . 168
5.4.1 Ratio-Consensus based Algorithm . . . . . . . . . . . . . . . . 169
5.4.2 Gradient-based Algorithm . . . . . . . . . . . . . . . . . . . . 170
5.5 On Minimal Value of κ and Performance Limits of Distributed Sub-
gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.5.1 Minimal Value of κ . . . . . . . . . . . . . . . . . . . . . . . . 177
5.5.2 Performance Limit of Distributed Subgradient Methods . . . . 179
5.6 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
5.6.1 Example 1: Network of 5 agents with differentiable cost func-
tions having Lipschitz continuous gradient . . . . . . . . . . . 181
5.6.2 Example 2: Network of 200 agents with `1 cost functions . . . 185
5.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
6 Distributed Optimization over Directed Graphs with Row Stochasticity and
Constraint Regularity 191
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
6.2 Problem Formulation and Proposed Algorithms . . . . . . . . . . . . 197
6.3 Basic Relations and Convergence Result . . . . . . . . . . . . . . . . 203
6.4 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
6.5 Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.6 Conclusions and Extensions . . . . . . . . . . . . . . . . . . . . . . . 229
7 Conclusions 233
7.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7.2 Directions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . 236
vi
A Omitted Proofs 240
A.1 Known Matrix Results . . . . . . . . . . . . . . . . . . . . . . . . . . 240
A.2 Omitted Proofs in Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . 241
A.2.1 Proof of Theorem 3.3.1 . . . . . . . . . . . . . . . . . . . . . . 241
A.2.2 Proof of Theorem 3.3.2 . . . . . . . . . . . . . . . . . . . . . . 242
A.2.3 Proof of Theorem 3.3.4 . . . . . . . . . . . . . . . . . . . . . . 243
A.2.4 Proof of Lemma 3.5.3 . . . . . . . . . . . . . . . . . . . . . . . 244
A.2.5 Proof of Lemma 3.5.5 . . . . . . . . . . . . . . . . . . . . . . . 246
A.3 Omitted Proofs in Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . 248
A.3.1 Proof of Theorem 5.3.5 . . . . . . . . . . . . . . . . . . . . . . 248
A.3.2 Proof of Theorem 5.3.12 . . . . . . . . . . . . . . . . . . . . . 250
A.3.3 Proof of Theorem 5.3.14 . . . . . . . . . . . . . . . . . . . . . 251
A.3.4 Proof of Extension to Row Stochastic Weight Matrix . . . . . 252
A.3.5 Proof of Theorem 5.5.1 . . . . . . . . . . . . . . . . . . . . . . 252
A.3.6 Proof of Lemma 5.4.3 . . . . . . . . . . . . . . . . . . . . . . . 253
A.3.7 Proof of Theorem 5.4.4 . . . . . . . . . . . . . . . . . . . . . . 255
A.3.8 Proof of Theorem 5.4.5 . . . . . . . . . . . . . . . . . . . . . . 256
A.3.9 Distributed Evaluation of Global Cost Function and Algo-
rithm Local Termination . . . . . . . . . . . . . . . . . . . . . 257
A.4 Omitted Proofs in Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . 259
A.4.1 Proof of Theorem 6.3.6 for Algorithm 6.2. . . . . . . . . . . . 259
Bibliography 261
vii
List of Tables
3.1 Comparison results for Network in example 1 (∗ denotes an optimal value).
In the last column, JKP Rlxd
1(2)
denotes JKP Rlxd1 (JKP Rlxd2 ). . . . . . . . . . . 107
4.1 Observation times using Algorithm 4.1. . . . . . . . . . . . . . . . . . . 135
4.2 Observation time for each node to compute consensus value in Example 2 136
4.3 Optimal time T ∗ when the observer can choose any m nodes . . . . . . . 137
viii
List of Figures
1.1 A directed network of 5 agents. . . . . . . . . . . . . . . . . . . . . . . 5
1.2 A network in the presence of 2 leaders T and Q. . . . . . . . . . . . . . . 7
1.3 A network with an observer. . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 A network of five agents that try to minimize the sum of individual cost
functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 Network in example 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.2 K∗ = {7, 8, 15, 25} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.3 K = {7, 13, 16, 25} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.4 K = {8, 13, 16, 25} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.5 Alg. 3.1 every 5 time steps . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.6 Upper bounds (solid lines) and lower bounds (dashed line) on J∗; The
global lower bound J(Vα) holds for any K. The ratio bound 1−JGU1−f∗P Rlxd
(shown by a dotted line) is at least 90% as K ≥ 90. . . . . . . . . . . . . 110
3.7 CPU run times (s) in 4 schemes. The Interior Point Method takes approx-
imately 0.21 s per iteration. . . . . . . . . . . . . . . . . . . . . . . . . 111
4.1 Network example 2. Self weights are not shown. . . . . . . . . . . . . . . 136
5.1 Network topology in example 1. . . . . . . . . . . . . . . . . . . . . . . 181
5.2 Network responses for example 1 with convex cost functions having Lips-
chitz continuous gradient using Algorithm (5.32) and (5.37). Left: For any
i ∈ V, si(t) (solid lines) converges to optimal solution (dash-line) and xi(t)
reaches a limit cycle of period κ. In the top-left figure, ◦ represents s¯(kκ)
of the centralized subgradient method implemented as (5.19). Right: Ob-
jective error comparisons with DPS using step size γ(t) = a
tb
, where (blue)
solid lines correspond to a = 0.01, (green) dashed lines a = 0.05, (black)
dotted lines a = 0.1, and (cyan) dash-dotted ones a = 0.2. For each a,
we plot the results for b = 0.5 and 1. The results from our algorithm are
shown in red circles ◦. The algorithm terminates locally for all the agents
at t = 186 with relative error of the global cost function guaranteed to be
less than  = 10−6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
ix
5.3 Network responses for example 1 with quadratic cost functions when using
Algorithm 5.3 with κ = 7, x(0) = c, and with 4 values of γ. . . . . . . . . 187
5.4 Responses of the network in example 2. Dashed line: optimal solution.
(a)-(b): Algorithm (5.32)-(5.33), where sub-figure within (a) is a zoom-in
of period [400, 800]; (c): Algorithm by Olshevsky (2016) with a constant
step size β = 1
L0
√
NT
; and (d): Distributed Subgradient Method (5.4) with
γ(t) = 1√
t
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
6.1 Directed communication graph of the network example. . . . . . . . . . . 228
6.2 Performances of Algorithms 6.1, 6.2, and DSP methods with and without
reweighting technique. Reweighting means for each i ∈ V, pii is known to
agent i in advance and zii(t) = pii, ∀t ≥ 0. Here, s(t) = PX
(
x¯(t)
)
. . . . . . 230
x
List of Symbols and Abbreviations
E Edge set of a graph
G Communication Graph
L Laplacian matrix
N Number of agents in the network
N0 Set of all nonnegative integer numbers
R(R+) Set of all (nonnegative) real numbers
t Discrete time
V Node set of a graph
W, [wij] Weight matrix
xi(t),xi(t) Opinion or state vector of agent i at time t
Z(Z+) Set of all (nonnegative) integer numbers
pi The normalized left Perron eigenvector of the weight matrix W
AEP Almost Equitable Partition
DPS Distributed Projected Subgradient
DSM Distributed Subgradient Method
IPM Interior Point Method
LCM Least Common Multiple
LTI Linear Time-Invariant
PGM Projected Gradient Method
xi
Chapter 1: Introduction
Networks are ubiquitous in physical, biological, and engineered systems. Depend-
ing on the particular domain and on the network nodes and their interconnections,
networks can display interesting characteristics and can achieve a variety of func-
tions. Networks research has seen significantly increasing interest over the past
several decades, owing mainly to the realization that applications are wide-ranging
and that these applications can prove to be both practical and valuable for society.
While a large body of literature has arisen, our understanding of the characteristics,
features and dynamics of networks is still in the early stages of development.
In this thesis, we aim to contribute to this important and growing field by
pursuing several directions in the general area of network consensus. Among the
topics we pursue is the development of new conditions and algorithms for reaching
and/or predicting agreement among agents in a network. To this end, a number
of theories will be brought to bear on several problems of interest under various
scenarios with regard to network connectivity. We also introduce and study problems
of network monitoring and consensus prediction, and apply our results to distributed
optimization.
1
1.1 Motivation and Thesis Objectives
1.1.1 Consensus and Information Sharing Model
There has been much interest in problems of distributed computation and cooper-
ative control, where a group of agents aims to achieve a global objective without
resorting to a centralized coordination entity and possibly in the presence of limited
computing capability and/or energy resources. In this realm, network consensus is
a basic problem, which concerns processes by which a collection of agents through
their local interactions tries to reach a common goal or decision. This problem has
been studied extensively in recent years and has found applications in many areas
such as opinion dynamics and learning in social networks, distributed optimization,
multi-vehicle rendezvous, formation control and sensor fusion, to name a few (see,
e.g., [1–13]). Extensive surveys and tutorials can be found in [14–16].
Historically, consensus in one form or another has long been observed in both
natural and social networks and systems. For example, a flock of birds flies in a
certain shape with a common velocity, a swarm of fireflies blinks in unison, a group of
people achieves agreement after repeatedly exchanging opinions with one another.
The discovery of network consensus can be traced to circa 1665 when Christiaan
Huygens observed the synchrony of two pendulum clocks mounted next to each
other on the same support, a phenomenon now referred to as coupled oscillations
[17]. Thus, studying mechanisms for network consensus is key to understanding
collective behaviors of both natural and social systems, and to building engineering
2
networks as well.
Many recent efforts have also been devoted to the case of networks with more
than one type of agents, including, e.g., leaders and followers, stubborn or even
adversarial agents, which appear naturally in real world networks and systems (see,
e.g., [5, 12, 13, 18–23] and references therein). The notion of leader is also useful
in the study of control of networked systems, where leaders serve as agents that
directly receive control inputs and network connections are paths for control action
propagating to the other agents [24]. Network controllability and its dual notion,
observability, have also been studied extensively in recent years [25–31]. However,
a closely related problem, namely network consensus prediction has scarcely been
considered. In this thesis, we will investigate this problem in detail.
The idea behind consensus can also serve as a mechanism for information shar-
ing/diffusing in the design of many distributed multi-agents algorithms, including
distributed optimization where a group of agents with limited communication tries
to solve a global optimization problem. This problem arises in many applications
such as distributed estimation in sensor networks [32–34], distributed resource al-
location [35, 36], and large-scale machine learning and statistical inference [37–39],
and is becoming more urgent in the new era of “Big Data”.
1.1.2 Network Asymmetry
For distributed systems, communication is vital to system performance as it is the
backbone for information to flow from one agent to another, and hence, the only
3
platform for each agent to contribute to or get involved in the global objective of
the network. As a result, special attention for communications is needed in the
study and design of distributed systems; in fact, it is one of the main aspects that
is different from centralized ones.
In general, communications can be categorized as undirected or directed. In
many applications, it is possible that inter-agent communications are undirected,
i.e., when two individuals communicate, each receives information from the other,
and moreover, they can have some agreement on how to use that information. How-
ever, there are many other scenarios where communications are directed due to, e.g.,
communication constraints arising from various sources, including physical network
connections or hardware capability of system components. This is clearly the case
when we consider the effect of a leader or stubborn agent in the network. An-
other practical example is an ad hoc Wireless Sensor Network, where there may not
be a pre-existing communication infrastructure at the time of deployment and di-
rected communications can arise as a consequence of the geometric network layout or
nonuniform transmission power limits, or even sensor mobility. We identify directed
communication as a type of network asymmetry; see Figure 1.1 for an example of a
directed network with 5 agents.
Although directed communication schemes include undirected schemes as a
special case (and thus apply to a much larger range of situations), a large body of
literature in the field of consensus and distributed computation and optimization
focuses on undirected networks. Such networks are more amenable to mathematical
analysis, especially when using tools that assume network symmetry (such as sym-
4
Figure 1.1: A directed network of 5 agents.
metric nonnegative Laplacian matrices, for instance). Analysis of algorithms and
problems with directed communications is generally very involved and often requires
new tools and techniques. We seek to develop algorithms that apply to networks with
directed communication schemes as the prime subject of this thesis. In particular,
we are interested in scenarios where network asymmetry poses challenges in the
analysis and may hinder system performance.
Among various models in the literature, the DeGroot model [1] is one of the
most often used for its simplicity and ability to exhibit consensus behaviors. Here,
each agent in the network repeatedly updates its opinion as a weighted average of
the opinions of its immediate neighbors, including itself. This update scheme simply
gives rise to a row stochastic weight matrix, which under mild conditions guaran-
tees consensus in the network. The agreement value depends on this weight matrix
and the agents’ initial opinions. However, in many situations including various dis-
tributed optimization algorithms, an averaging scheme is desired instead; and to
this end, the weight matrix is often required to satisfy a balancedness condition,
namely, double stochasticity. In such a case, this condition means any agent needs
5
to know to whom it sends information and/or regulates the way the recipient uses
the information. This can be ensured fairly easily and locally in undirected networks
but is very hard and costly to ensure in many distributed systems such as wireless
sensor/ad-hoc networks and especially networks with broadcast-based communi-
cations. A few algorithms proposed recently for directed networks employ column
stochastic matrices by requiring that each agent at any time is aware of which neigh-
bors receive the information sent to them by the agent. A row stochastic matrix is
much easier to implement in these applications/communication environments, but
yields unsatisfactory performance when used with those algorithms. Therefore, in
this thesis, we will mostly deal with row-stochastic matrices as the main source of
network asymmetry.
1.2 Main Problems and Thesis Contributions
In this research, we consider the following three main topics and the associated
problems identified below. Unlike most existing work in the literature where network
symmetry or balancedness is assumed, we address the general case of weighted
directed graphs. Thus, one of the main technical contributions that we emphasize is
a set of tools and techniques developed to overcome network asymmetry in various
problems and applications of consensus.
6
1.2.1 Network with Leaders
In the first part of the thesis, we consider a DeGroot model with the presence of
external media nodes, representing leaders, or sources of news often having constant
opinion values. See Figure 1.2 for an illustrative example.
Figure 1.2: A network in the presence of 2 leaders T and Q.
First, when consensus is the main goal of a leader, we are interested in finding
conditions under which the whole network will eventually agree with that leader’s
opinion for any initial opinions of the agents. Indeed, we will determine how strong
the connections between the leaders and the followers as well as those among the
followers should be to ensure that this agreement can be achieved asymptotically.
• When there is only one leader, we derive new sufficient conditions for guar-
anteeing consensus with the leader for both fixed and switching topologies.
These conditions emphasize the persistence of the connectivity between the
leader and the followers and are the mildest so far, covering many existing
results in the literature.
7
• In the presence of more than one leader, we show that only those that are
persistent matter. In particular, when only one leader is persistent, we provide
conditions under which the network converges to the state of this leader.
• A technical contribution lies in the tool that we develop to prove the results
above; namely, a result on the convergence of a infinite product of nonnegative
substochastic matrices.
Second, we study the problem of a leader that aims to influence the opinions
of agents in a directed network through connecting with a limited number of the
agents. The leader’s goal is to select this set of agents, referred to as direct follow-
ers, to achieve the greatest possible influence on the opinions of agents throughout
the network. Here, when there is only one leader and consensus is guaranteed a
priori, the influence of that leader is characterized through the transient error of the
network, and thus is able to take into account both the network structure and the
opinion dynamics evolving on it. When, on the other hand, there is a second leader
(or a stubborn agent) with a competing opinion and consensus is not achievable,
the influence of the first leader is measured in terms of the steady state error of the
network. Compared to existing work, not only are our problem settings and formu-
lations more natural (and thus likely to be of more value for practical applications)
but our technical results are also much stronger. In particular,
• We prove the supermodularity property of the objective function capturing the
leader influence in both cases and the convexity of its continuous relaxation
for general directed networks. Here, the convexity result is novel; the super-
8
modularity result generalizes existing results in the literature but is proved
using a different technique.
• We then develop greedy algorithms that are theoretically guaranteed to have
a lower bound on the approximation ratio. The new convex result allows us
to benefit from efficient (customized) numerical solvers to obtain practically
comparable solutions. We demonstrate through numerical examples that the
two approaches can be combined to provide effective tools and better analysis
for optimal design of influence spreading in diffusive networks.
1.2.2 Consensus Prediction
In this part of the dissertation, we introduce and study the problem of consensus
prediction in a network whose dynamics is described by a DeGroot model. In
particular, we assume that there is an observer who can monitor the states of a group
of agents, but might not have accurate knowledge of the underlying communication
graph and the associated weight matrix; see Figure 1.3.
Figure 1.3: A network with an observer.
9
We want to answer the following questions: For any initial opinions of the
agents, how can the observer determine the consensus value, if it exists, by using
a finite number of observations? what is the minimum number of observations
needed? And, if the observer has more information about the network structure,
how to minimize observation time over possible choices of observed nodes. Our main
contributions in this topic are as follows:
• We reveal an intrinsic relation between the consensus value and network data,
namely, if the consensus value can be computed at a particular time for any
initial opinions, then it can be expressed as a linear combination of available
observation data with associated coefficients depending on the weight matrix.
• We derive a fundamental limit on the monitoring time for the case of a single
observed node, below which the observer with limited knowledge about the
network is not able to determine the consensus value regardless of the method
used. We provide sufficient conditions for achieving this limit.
• We provide a conjecture and analysis for the case of multiple observed nodes
and develop algorithms toward achieving the conjectured bounds through local
observations and computations. We show that with certain knowledge about
the network structure, the observer can answer a few questions regarding the
optimal monitoring time.
10
1.2.3 Distributed Optimization
In our work on distributed optimization, we seek interaction rules for a network
of agents which result in the network collaboratively solving a global optimization
problem. The interactions among agents must be local, without a central coordina-
tion unit, and the objective function is the sum of local costs of all the agents; see
Figure 1.4.
Figure 1.4: A network of five agents that try to minimize the sum of individual cost
functions.
Under the conditions that the underlying communication graph is directed and
the weight matrix is only row stochastic, we design algorithms for the agents to col-
laboratively solve this problem in a distributed manner and/or with fast convergence
rates.
• We first study the use of consensus prediction for enhancing convergence of
distributed optimization algorithms. The resulting algorithms are the first
11
that possess the following useful features: (i) they are distributed but behave
similarly to the centralized gradient methods except on a slower time-scale (in-
cluding finite time convergence for quadratic cost functions), (ii) all the agents
are able to locally stop updating at the same time with the same estimate of
optimal solution, and (iii) the theoretical convergence scales at most linearly
with a network size in general (and thus is the best so far).
• We provide a unified analysis for distributed projected subgradient methods
with nonidentical local constraint sets. To deal with network asymmetry, we
introduce a rescaling technique to the original distributed projected subgra-
dient methods by incorporating an addition consensus step which aims to
provide each agent with an estimate of the corresponding element of the left
Perron eigenvector of the weight matrix.
• We present another algorithm that also uses the rescaling technique above but
is able to achieve linear convergence under a stronger assumption on the local
objective cost functions.
1.3 Literature Survey
One of the very first mathematical models used for studying network consensus is
the DeGroot model [1], described as follows. Consider a group of N agents and
let xi(t) ∈ [0, 1] denote the opinion of agent i at time t; here time is discrete, i.e.,
t ∈ Z+. Each agent has an initial opinion xi(0). At any time t ≥ 0, each agent
observes the opinions of its neighbors and naively updates its own opinion according
12
to
xi(t+ 1) =
N∑
j=1
wijxj(t), i = 1, . . . , N, (1.1)
where wij indicates the weight that agent i places on agent j’s opinion. Here wij 6= 0
implies that agent i is able to obtain the opinion of agent j at time t. Thus,
W = [wij] ∈ RN×N is often called the weight matrix (or trust matrix). In this regard,
it is natural to represent the communication network among the agents using graphs,
where xi is the state of node i and wij the weight or strength of link (i, j); and for this
reason, the terms agent and node are used interchangeably. The network achieves
consensus if for any initial opinion xi(0), it holds that limt→∞ |xi(t)−xj(t)| = 0,∀i, j.
Thus, the network following (1.1) is often called a consensus network.
This model was introduced in [1] for the synchronous and time-invariant case,
where the author studied the process of reaching agreement among a group of ex-
perts, used in [40] for studying coordination of a group of particles, and extended
in [2, 41–43] for the case of asynchronous and time-varying network in the context
of distributed decision making and parallel computing. Since then, a vast literature
on network consensus has developed. Models that generalize or are related to (1.1)
are also numerous, including, for example, agents with high order linear dynam-
ics [44, 45], nonlinear dynamics (possibly with nonlinear coupling among agents)
[40,46–51], where consensus (in terms of the agents’ outputs) is also known as syn-
chronization. Convergence under various assumptions on the communication graph
are also studied, including, for example, directed information flow [6, 8], link/node
failure and noises [33], communication time-delays [47, 52], fixed or switching net-
13
work topology [5, 9], and quantization [53,54].
In this thesis, however, we will mostly focus on the first-order linear model
(1.1), which is relatively simple but instructive enough for basic study of consensus
dynamics and particularly suitable for various distributed computation and opti-
mization algorithms.
1.3.1 Consensus in Networks with Leaders
Many recent efforts have also studied networks with more than one type of agent,
including, e.g., leaders and followers, stubborn or even adversarial agents (see, e.g.,
[5,12,13,18–23]). In general, for the DeGroot model, consensus can still be achieved
asymptotically with a single leader, but not if there are multiple leaders such that
at least two of them are uncooperative.
In this subsection, we review known conditions for consensus in the literature,
where a leader is included in the network as a special agent with constant opinion.
For the case of a fixed network with a constant weight matrix, a necessary
and sufficient condition for consensus is that the graph is rooted [9]. However, such
a condition is still an open question for a time-varying interaction topology. In
this case, conditions for consensus are those ensuring the convergence to a rank-one
matrix of an infinite product
lim
t→∞
W (t)W (t− 1) . . .W (0),
where each W (t) is a stochastic matrix. This is also a well studied problem in the
theory of non-homogeneous Markov chains. Therefore, many results and tools in
14
matrix theory and Markov chain theory can be appealed to. It has been shown [9,55]
that a necessary condition is that the union graph over an infinite interval is rooted.
This condition though is far from being sufficient; a counterexample can be found
in [8]. Therefore, to derive sufficient conditions, more assumptions on the network
connectivity and the weight matrix are required. For example, the authors in [5]
rely on Wolfowitz’s theorem [56] on the convergence of infinite products of stochastic
matrices W (t) belonging to a finite set. This condition is relaxed in [8,55,57,58] so
that the matrices can belong to an infinite set. However, these works require that
all the self-weights and other nonzero link weights are uniformly bounded below by
a positive number and that either the weight matrix at any time has a symmetric
zero/nonzero structure (i.e., the graph is undirected) or the union of the interaction
graphs over any period of some fixed length is strongly connected. Similarly, the
limiting behavior of products of random stochastic matrices is also studied in [59]
assuming the cut-balanced property of the sequence of the matrices, which is in the
same spirit of having symmetric zero/nonzero structures.
We will show that for the DeGroot model in the presence of a leader, the
strong connectivity and symmetric structure condition of the weight matrix can
be relaxed. However, we take a different approach, in that we develop consensus
conditions directly for the model with leaders.
Besides deriving condition for guaranteeing consensus, the problem of selecting
an optimal subset of agents in the network to influence is also of interest in many
practical applications, often known as leader selection and optimal stubborn agent
placement [20,60–67]. By considering different measures of influence or centrality in
15
the network, there have been various approaches to solve the associated optimization
problem. For example, [68] considers the problem of minimizing the convergence rate
of a consensus network. The authors show the connection between the convergence
rate and the maximum distance from the leader to the followers and then apply a
combinatorial optimization method to solve the problem approximately. In [69], the
authors consider the problem of minimizing the total system error in a noisy network
and derive systematic solutions for some special cases. In [62] the authors consider a
characterization quantifying both the transient and the steady states of the agents’
opinions, assuming that all the regular agents have the same initial opinions and
that any direct follower replaces its own opinion by that of the leader. In [70], the
authors use a continuous-time model and consider the problem of leader selection
in order to minimize the convergence error, defined as the `p-norm of the distance
between the followers’ states and the convex hull of the leader states. By replacing
the convergence error by an upper bound that is independent of the initial states of
the network, [70] is able to employ a supermodular optimization approach.
Our work in this topic departs from this literature in many respects. First, we
drop the assumption that the network is undirected, and only ask that the underly-
ing network be directed. Second, we allow selected direct follower nodes (i.e., agents
directly connected to the leader) to follow inter-agent dynamics like any other agents,
rather than forcing them to adopt the leader’s opinion instantaneously. Third, we
allow the agents in the network to have different initial opinions, and the leader
can assign different weights to the network agents. Finally, and more importantly,
although continuous relaxation and greedy heuristics have been employed in deal-
16
ing with influence maximization problems, our theoretical results on convexity and
supermodularity are considerably stronger than existing results. We achieve this
without assuming any symmetry or resorting to the random walk theory. This not
only provides a deeper understanding of diffusive processes but also can be used for
a broad range of applications.
1.3.2 Consensus Prediction
The topic of consensus prediction is useful in network monitoring and security but
has not received much attention. The existing literature mostly focuses on the ob-
servability problem for networked multi-agent systems. The problem we consider
differs from the observability problem in the sense that we are concerned with the
final value instead of trying to recover the initial conditions of all the agents. More-
over, here, the observer might not have know the network structure or the weight
matrix. Our analysis builds on a recent method for reaching consensus in finite
time by employing the minimal polynomial of each agent [71, 72]. This method is
concerned only with consensus predictability, and does not concern optimality.
Although predicting the agreement value of a consensus network is the main
goal here, our approach makes a contribution to the topic of network identification
through application of realization theory [73–75] to distributed networked systems.
17
1.3.3 Distributed Optimization
Consensus also plays an important role in distributed optimization, where a group
of agents with limited communication tries to solve a global optimization problem,
where the objective function is the sum of (possibly nonsmooth) local objectives of
the agents and the global constraint set is the intersection of local constraints. Tsit-
siklis [2] among others pioneered research on distributed computation over networks
and the interplay between network dynamics and performance of decentralized al-
gorithms in the context of networked control. Specifically, in [2] the problem of
achieving consensus in system (1.1) was studied and then used as a subroutine for
performing estimation and solving a class of optimization problem in a distributed
manner. In this connection, the consensus step is utilized to deal with the fact
that the agents have incomplete knowledge about the optimization problem. It was
this idea that has triggered the development of many consensus-based distributed
algorithms, see e.g., [32, 42,76–85] and references therein.
Well known among those is the class of distributed (sub)gradients methods,
which possesses many practically desirable characteristics including its simplicity in
implementation and generally weak assumptions on the local cost functions as well as
the network topology. Major limitations of algorithms in this category are also well
studied. First, the convergence of many algorithms depends on the choice of step size
sequences. When a constant step size is used, both Distributed Gradient Descent
and Distributed Subgradient methods only yield convergence to a neighborhood
of the optimal solution and of the optimal value [78, 86]. This occurs even under
18
stronger assumptions on the local objective functions such as strong convexity and
Lipschitz continuous gradients, and is thus one of the main differences between these
methods and their centralized counterparts. This motivates the use of particular
diminishing or adaptive step sizes to achieve asymptotic convergence. However, the
convergence rate can be very slow (compared to that of the centralized method),
depending on the step size sequence, whose appropriate selection is not trivial.
Nesterov’s acceleration technique can employed [87] to speed up the convergence.
Second, many incremental subgradient methods require all the agents to construct a
closed cycle in order to pass an estimate of the solution around the network; see e.g.,
[32,88,89]. Third, even when asymptotic convergence is guaranteed, it is not obvious
how each agent can locally decide when to stop the algorithm without affecting other
agents’ estimates. Put differently, there are no simple criteria for all the agents to
stop at the same time while also sharing the same estimate of an optimal solution.
This is also true for most (if not all) other distributed optimization methods.
When all the local cost functions are quadratic, many other consensus-based
algorithms can outperform those in the subgradient class. For example, the ratio
consensus method can be used to solve problem without constraints and converges
exponentially [34,90]. Based on this method, [91] proposed a Newton-Raphson-like
algorithm which also converges asymptotically for a class of functions having con-
tinuous, strictly positive and bounded second derivatives, assuming a sufficiently
small discretization step. Recently, much attention has also been given to decen-
tralized Alternating Direction Method of Multipliers (ADMM) type methods with
fast convergence in both theory and practice [39,43,92].
19
Most existing methods in distributed optimization (including those mentioned
above) require the network to be undirected so that neighboring agents exchange in-
formation in both directions, increasing the possibility of reaching some agreement.
Many methods employing subgradient and consensus steps require the weight ma-
trix associated with the network to be column stochastic or even doubly stochastic,
which may be hard to arrange in directed networks, especially in a broadcast-based
communication environment. Recent reweighting technique introduced in [93, 94]
allows row stochastic matrices but assumes knowledge of the graph, that is the sta-
tionary distribution of the weight matrix and the number of agents in the network.
Thus, a fully distributed algorithm employing only row stochastic weight matri-
ces has not been available in the field of distributed optimization thus far. In our
work, we will develop such an algorithm. Moreover, known convergence analysis of
distributed subgradient methods varies according to whether the problem is uncon-
strained or constrained, and whether the local constraint sets, usually compact, are
identical or nonidentical. Thus, there is a lack of a unified convergence analysis for
those scenarios, and hence developing such a analysis is also one of the goals of this
thesis.
1.4 Thesis Organization
The remainder of the thesis is organized as follows. This introductory chapter
will end with notations, definitions and mathematical background, including well
known results for the DeGroot model. Our main results are presented in three parts
20
corresponding to the three topics.
Part I of the thesis is concerned with the influence of a leader on the opinions
of the agents in a directed network whose dynamics follow the DeGroot model.
Specifically, in Chapter 2, we develop various sufficient conditions for guaranteeing
consensus of all the network agents to the leader opinions in many scenarios: static
and dynamic network topologies, with one leader or two competing leaders. Then
in Chapter 3, we are concerned with the problem of optimizing the influence of a
leader on the opinions of the agents in the case of a fixed network topology. We
derive various joint centrality measures for a group of followers in different settings,
and then develop theory and approximation algorithms for obtaining suboptimal
solutions in large networks.
Part II, which consists of Chapter 4, deals with problems related to an observer
that seeks to predict the consensus value of a network by monitoring the opinions of
a group of agents. This setting can be seen as the dual to that in Part I, where the
leader injects information/control into the network. We make use of a central tool
in functional analysis, namely the Hahn-Banach Theorem, to prove the optimality
aspect of the degree of the minimal polynomial of a node as a tight lower bound of
the observation time over all possible approaches to determine the consensus value
if only that node is monitored. We then develop analysis and distributed algorithms
for the case of multiple observed nodes. We also discuss optimal selection of observed
nodes using graph theory.
Part III of the thesis is concerned with distributed optimization. In Chapter 5,
we employ the consensus prediction method presented in Chapter 4 as an acceler-
21
ation technique for enhancing convergence of the distributed gradient method in
terms of correctness and speed. In the special case where the local objective func-
tions are quadratic, we show that finite time convergence can be achieved. We also
discuss a performance limit of distributed optimization and compare it with our al-
gorithms. In Chapter 6, we introduce new technique that enables many distributed
optimization algorithms to work with directed networks and row stochastic weight
matrices. We then develop a unified analysis for convergence as well as convergence
rate of a distributed subgradient algorithm and its variation that can be applied to
both unconstrained problems and constrained ones possibly with nonidentical and
unbounded local constraint sets.
Finally, conclusions and directions for future research are given in Chapter 7.
1.5 Notation and Mathematical Background
1.5.1 Notation and Definitions
Notation:
We use boldface characters and symbols to denote vectors, for example, x =
[x1, x2, . . . , xm]
T ∈ Rm, xi = [xi1, xi2, . . . , xim]T ∈ Rm, 1 = [1, 1, . . . , 1]T and ei =
[0, . . . , 0, 1i, 0, . . . , 0]
T . For a vector x, ‖x‖1, ‖x‖2 (or just ‖x‖) and ‖x‖∞ denote its
1-norm, 2-norm, and∞-norm, respectively. We also denote by diag(x) the diagonal
matrix whose diagonal elements are the elements of vector x.
For a matrix A, AT denotes its transpose, A† its pseudo-inverse, [A]ij or Aij
the ijth element, rank(A) the rank, tr(A) the trace, ρ(A) its spectral radius, ‖A‖ the
22
(induced) 2-norm of A, and |A| the matrix composed of absolute value of elements
of A, i.e., [|A|]ij = |[A]ij|,∀i, j. We also use A(i) and A(j), respectively, to denote
the i-th row and j-th column of A.
Sets are denoted by calligraphic upper case letters. For a given set A, |A| or
card(A) denotes its cardinality, and χA denotes the associated indicator function.
The degree of a polynomial q is denoted by deg(q).
Basic Notion:
A matrix A = [aij] is nonnegative (positive) if aij ≥ 0 (aij > 0),∀i, j. If A−B
is a nonnegative matrix, we write A ≥ B. A nonnegative square matrix A is row
stochastic (or simply stochastic) if A1 = 1, (row) substochastic if A1 ≤ 1, column
stochastic if AT1 = 1, and doubly stochastic if it is both row and column stochastic.
A square matrix A is called an M-matrix if (i) all the off-diagonal elements are
nonpositive, i.e., aij ≤ 0,∀i 6= j, and (ii) it can be expressed as A = sI − B where
B is a nonnegative matrix such that ρ(B) ≤ s.
A directed graph G = (V , E) consists of a finite set of nodes V = {1, 2, . . . , N}
and a set E ⊆ V ×V of edges, where an ordered pair (i, j) ∈ E indicates that agent i
receives information on the state of agent j. A directed path is a sequence of edges
in the form (i1, i2), (i2, i3), . . . , (ik−1, ik). A simple path is a path without any node
repeated. Node i is said to be reachable from node j if there exists a path from j
to i. Each node is reachable from itself (i.e., self-loop is permitted). For node i,
Ni = {j ∈ V : (i, j) ∈ E} is the set of in-neighbors (or neighbors for short), and |Ni|
is the degree (also in-degree) of node i. Graph G is connected (or weakly connected)
if it cannot be partitioned into 2 separate groups that have no paths connecting
23
them. Graph G is strongly connected if each node is reachable from any other node.
A tree is a graph that has a node called root from which all the other nodes are
reachable. The diameter of a connected graph G, denoted by diam(G), is the length
of the longest path among all simple paths.
Let f : Rm → R be a convex function. The domain of f is denoted by dom(f).
We denote by ∂f(x) the subdifferential of f at x ∈ dom(f), i.e., the set of all
subgradients of f at x:
∂f(x) = {g ∈ Rm : f(y)− f(x) ≥ gT (y − x),∀y ∈ dom(f)}. (1.2)
A differentiable function f is called strongly convex with parameter µ > 0 if
for any x,y ∈ dom(f),
f(y)− f(x) ≥ ∇f(x)T (y − x) + µ
2
‖y − x‖. (1.3)
1.5.2 Convergence of DeGroot Model
This subsection presents convergence conditions for the DeGroot model (1.1) that
will serve as a basic result for our development in the sequel. Consider a leaderless
network consisting of N agents denoted by V = {1, 2, . . . , N}. The underlying
communication is characterized by a directed graph G = (V , E). The update of
each agent i’s opinion at any time t ≥ 0 (here t denotes time, which can take any
nonnegative integer value) can also be given as follows:
xi(t+ 1) =
∑
j∈Ni wijxj(t)∑
j∈Ni wij
, xi(0) = x0i ∈ R, (1.4)
where wij ∈ [0,∞) quantifies the unnormalized weight that agent i places on agent
j’s opinion and recall that Ni denotes the set of node i’s immediate neighbors
24
(including itself). The weight matrix of the network is denoted W := [wij] ∈ RN×N ,
with the inter-nodal influence parameters wij > 0 when there is a direct link from
agent j to agent i and wij = 0 if no such link exists.
Definition 1.5.1. Consider the DeGroot model (1.1). The network achieves consen-
sus if for any initial opinions, it holds that limt→∞ |xi(t)−xj(t)| = 0,∀i, j = 1, . . . , N .
In most chapters (except Chapter 2), we will make the following assumptions
on the communication graph G and the weight matrix W , which are usually imposed
to ensure consensus of the network.
Assumption 1.5.2. The network G = (V , E) is a fixed and strongly connected
directed graph.
Assumption 1.5.3. The matrix W = [wij] is row stochastic and satisfies wij > 0
if (i, j) ∈ E, wii > 0 for some i ∈ V, and wij = 0 otherwise.
Assumption 1.5.3 means that the zero-nonzero structure of the weight matrix
W reflects the network structure. Moreover, W is now a normalized weight matrix.
Thus, we can also express (1.4) compactly as
x(t+ 1) = Wx(t). (1.5)
It is well known (see, e.g., [1, 5, 14, 15, 49]) that under the assumptions above,
the network achieves consensus. In fact, W is irreducible and represents an ergodic
Markov chain. Let pi denote the stationary distribution of W , i.e., pi is the left
eigenvector of W corresponding to the eigenvalue 1 and satisfying the condition
1Tpi = 1. The following result is the well known Perron-Frobenius theorem for
25
irreducible matrices (see, e.g. [95]), which lays the foundation for theories of Markov
chains and network consensus (see, e.g., [14, 15] and references therein).
Theorem 1.5.4. (see, e.g., [95]) If W is row stochastic and irreducible, then
1. W has spectral radius ρ(W ) = 1, which is also a simple eigenvalue.
2. piTW = piT and pi is a strictly positive vector.
3. ∃ limt→∞W t = 1piT . The convergence rate is geometric and determined by the
second largest eigenvalue of W .
Corollary 1.5.5. Under Assumptions 1.5.2 and 1.5.3, the network (1.4) achieves
consensus and
lim
t→∞
xi(t) = pi
Tx(0), ∀i ∈ V , (1.6)
Moreover, the convergence rate is geometric.
Clearly, the consensus value depends on the weight matrix W and the initial
opinions x(0).
We also often define a weighted Laplacian matrix L = [lij] ∈ RN×N satisfying
L = I −W
for some  > 0. The following are obvious.
piTL = 0T , L1 = 0.
Moreover, all the eigenvalues of L are positively stable except only one at the origin.
26
Part I: Consensus Network with Leaders
27
Chapter 2: Opinion Dynamics with Persistent Leaders
Abstract: This chapter revisits the problem of agreement seeking in a network of
agents under the influence of leaders. The persistence of the effect of the leader (or
leaders) on the opinions of the network agents is characterized by the total weight
that they place on the leader’s information over time. If this weight is infinite, then
the leader is called persistent. We will describe the asymptotic behavior of network
opinions towards the state of a persistent leader in both cases of fixed and switching
network topologies. We also show that only persistent leaders are able to drive the
network to the leader’s constant state.
2.1 Introduction
It is widely known that both the communication graph and the influence structure
of a network play important roles in reaching consensus. The former indicates whom
an agent interacts with while the latter determines the weights on the information
that he receives from others. In most existing work on network consensus with
or without leaders, the weights are usually assumed to be constant or varying in a
compact set bounded away from zero (see, e.g., [1,5,13,15] and references therein). In
practice, however, the connections between agents may be transient and the weights
28
may fluctuate in a broad range and even diminish with time. Recently, the notion
of persistent graphs was studied in [55], where it was shown that persistent links
are crucial in seeking an agreement. However, the work [55] considered networks
without a leader and required stronger conditions than just the persistence of the
agent interconnections. In this chapter, we study networks in the presence of leaders
and show that consensus can still be achieved under milder requirements.
A great number of recent efforts have also been devoted to the consensus
problem in networks with leaders (see, e.g., [5, 12, 13, 20] and references therein).
A general conclusion is that consensus cannot be achieved when the leaders have
competing opinions. This is mainly because only persistent leaders were considered.
In this chapter, we show that network agreement can still be reached in the pres-
ence of competing leaders provided that only one of them is persistent. This also
distinguishes our results from existing works in this setting.
The main contributions of this work are as follows. First, we derive new
sufficient conditions for guaranteeing agreement in networks with a leader for both
fixed and switching topologies. These conditions emphasize the persistence of the
connectivity between the leader and the followers and, to the best of our knowledge,
are the mildest, covering many existing results in the literature. Second, we show
that in a network with more than one leader, only those that are persistent matter.
In particular, when there is only one persistent leader, we provide conditions under
which the network converges to the state of this leader. Most of the results in this
chapter were first presented in [96].
The rest of the chapter is organized as follows. In Section 2.2, we describe
29
the problem of interest in detail. Sections 2.3 and 2.4 present the main results for
networks with a single leader and two leaders, respectively. Finally, discussion and
future work are given in Section 2.5.
2.2 Problem Formulation
Consider a set of N agents or nodes interacting over a communication network. The
topology of the network at time t ∈ Z+ is described by a graph G(t) = (V , E(t)).
Let xi(t) ∈ [0, 1] denote the opinion of node i at time t. At the initial time t = 0,
each agent has an initial opinion xi(0). Suppose that at each time t, every agent
synchronously obtains opinions of his neighbors and naively updates his opinion
following the DeGroot discrete-time model [1]
xi(t+ 1) =
∑
j∈V
wij(t)xj(t), ∀i ∈ V , (2.1)
where wij(t) ≥ 0 indicates the weight that agent i puts on agent j’s opinion at time
t. Here W (t) = [wij(t)] ∈ RN×N represents the weight matrix (or trust matrix) at
time t. We will assume that W (t) is a row stochastic matrix for any t, i.e., W (t) is
nonnegative and W (t)1 = 1.
Now we consider to the above network under the effect of an external media
node, representing a leader or a source of news with a constant opinion value T ∈
[0, 1]. Although it is often thought of as a stubborn node indistinguishable from
the others, we consider it separately from that context as we want to look at the
network from the point of view of a leader and investigate its effect on the opinions
of other regular agents, conventionally called followers. To this end, assume that
30
the leader can connect to some of the followers and persuade them to trust in its
opinion T with trust levels αi(t) ∈ [0, 1],∀i ∈ V . Here, αi(t) = 0 means distrust or
unawareness of the leader and αi(t) = 1 implies absolute trust. The update rule is
then given by
xi(t+ 1) = αi(t)T + (1− αi(t))
∑
j∈V
wij(t)xj(t), ∀i ∈ V . (2.2)
In the matrix form, (2.2) reads
x(t+ 1) = α(t)T + Γ(t)W (t)x(t), (2.3)
where α(t) = [α1(t), . . . , αN(t)]
T and Γ(t) = I−diag(α(t)) with I ∈ RN×N being the
identity matrix. We are interested in finding conditions under which the network
can finally agree with the leader opinion. Indeed, we will determine how strong
the connections between the leader and the followers should be to ensure that this
agreement can be achieved asymptotically. We will also extend these results to the
case of the network with more than one leader.
The following notions will be used in this chapter. See, e.g., [8, 55].
Definition 2.2.1. Consider a time-varying graph G(t) = (V , E(t)) with an associ-
ated time-dependent weight matrix W (t).
• Link (i, j) is called persistent if ∑t≥0wij(t) =∞.
• Node i is persistent if ∑t≥0∑Nk=1wki(t) =∞.
• The persistent graph G∞ induced by {G(t),W (t),∀t ∈ Z+} is the graph con-
taining all the persistent links.
31
2.3 Opinion Dynamics with One Leader
This section studies the convergence of opinions under different assumptions on the
connection topology of the followers’ network and the connectivity of the leader with
the followers.
First, it is easy to see that if there exists t∗ ∈ Z+ such that x(t∗) = T1
(e.g., α(t∗ − 1) = 1), then x(t) = T1,∀t ≥ t∗, i.e., the network converges to T in
finite-time (at most t∗ steps) for any initial opinions x(0).
Second, it is also known that when W and α are constant, then x(t)→ T1 for
any x(0) if the extended graph including the leader and the followers is a spanning
tree whose root is at the leader (see, e.g., [9]).
Third, consider the case when the leader’s effect lasts for an interval [0, t0], e.g.,
a campaign period with α(t) = 0, ∀t > t0. Suppose that W is fixed and the network
is strongly connected. Then there exists pi ∈ RN such that limt→∞W t = 1piT (cf.
Section 1.5). As a result, limt→∞ x(t) = 1piTx(t0), i.e., the consensus value may
differ from T . If α(t) ≡ α,∀t ∈ [0, t0], then it can be verified that
lim
t→∞
x(t) = 1
(
T − piT (ΓW )t0+1(1T − x(0))) . (2.4)
Since ΓW is strictly substochastic and irreducible, it follows that ρ(ΓW ) < 1 (see,
e.g., [97, Thm 1.1, p. 24]), and thus limt0→∞(ΓW )
t0 = 0. Therefore, x(∞) = T1
when t0 →∞, i.e., consensus to the leader’s state when the leader is persistent.
Next, we allow α(t) to be time-varying and the limit limt→∞α(t) need not
exist. Some related works are [5, 57] and [59]. Notice that the results in [5] rely
32
on Wolfowitz’s theorem [56] on the convergence of infinite products of stochastic
matrices belonging to a finite set. This condition is relaxed in [57] so that the
matrices can belong to an infinite set. However, the work [57] requires the symmetry
of the zero/nonzero structure of these matrices. Similarly, the limiting behavior of
products of random stochastic matrices is also studied in [59] assuming the cut-
balanced property of the sequence of these matrices. This property is in the same
spirit of having symmetric zero/nonzero structures. Here, we need not impose those
conditions.
The following condition suffices to ensure the asymptotic convergence of the
network to the leader’s opinion.
Theorem 2.3.1. (One Leader, Arbitrary Graph) Consider system (2.2) and suppose
that the weights on the leader satisfy
∑
t≥0
min
i∈V
αi(t) =∞. (2.5)
Then x(t)→ T1 as t→∞ for any initial opinion x(0).
Proof. The proof follows the method presented in [46]. Let x˜(t) = [x(t)T T ]T .
Equation (2.3) is equivalent to
x˜(t+ 1) = W˜ (t)x˜(t), W˜ (t) ,
Γ(t)W α(t)
0 1
 . (2.6)
33
Let h(t) = max1≤i,j≤N+1 (x˜i(t)− x˜j(t)). Obviously, h(t) ≥ 0,∀t ≥ 0. Now
h(t+ 1) = max
1≤i,j≤N+1
(
x˜i(t+ 1)− x˜j(t+ 1)
)
= max
1≤i,j≤N+1
( ∑
1≤k≤N+1
w˜ikx˜k(t)−
∑
1≤k≤N+1
w˜jkx˜k(t)
)
= max
1≤i,j≤N+1
∑
1≤k≤N+1
(w˜ik − w˜jk)x˜k(t).
Denote h˜ij(t) =
∑N+1
k=1 (w˜ik − w˜jk)x˜k(t). Then
h˜ij(t) =
∑
1≤k≤N+1
(w˜ik −min(w˜ik, w˜jk))x˜k(t)−
∑
1≤k≤N+1
(w˜jk −min(w˜ik, w˜jk))x˜k(t)
≤
∑
1≤k≤N+1
(w˜ik −min(w˜ik, w˜jk)) max
l
x˜l(t)−
∑
1≤k≤N+1
(w˜jk −min(w˜ik, w˜jk)) min
l
x˜l(t).
Rearranging terms and using the fact that
∑N+1
k=1 w˜ik = 1, i = 1, . . . , N + 1 yield
h˜ij(t) ≤ max
l
x˜l(t)−min
l
x˜l(t)−
∑
1≤k≤N+1
min(w˜ik, w˜jk)(max
l
x˜l(t)−min
l
x˜l(t))
=
(
max
l
x˜l(t)−min
l
x˜l(t)
)(
1−
∑
1≤k≤N+1
min(w˜ik, w˜jk)
)
.
Therefore,
h(t+ 1) ≤ max
1≤i,j≤N+1
h(t)
(
1−
∑
1≤k≤N+1
min(w˜ik, w˜jk)
)
= h(t)
(
1− min
1≤i,j≤N+1
∑
1≤k≤N+1
min(w˜ik, w˜jk)
)
. (2.7)
Now for any i, j ∈ V
∑
1≤k≤N+1
min(w˜ik, w˜jk) ≥ min(w˜i,N+1, w˜j,N+1) = min(αi(t), αj(t)) ≥ min
k∈V
αk(t) (2.8)
Define α(t) := mink∈V αk(t). It follows immediately from (2.7) that, for any t ≥ 0
h(t+ 1) ≤ h(t)(1− α(t)) ≤ h(t)e−α(t) ≤ h(0)e−
∑t
s=0 α(s). (2.9)
34
The second inequality follows from the fact that 1−z ≤ e−z,∀z ≥ 0. The assumption
that
∑∞
s=0 α(s) =∞ implies that limt→∞ h(t)→ 0, hence xi(t)→ T,∀i ∈ V .
Clearly, the result above holds for any structure of the network and weight
matrix (even in the time-varying case) as it merely relies on the condition (2.5),
which means that every node in the network persistently trusts the leader (in the
sense that
∑
t≥0 αi(t) =∞,∀i ∈ V). Note that the notion of persistent graphs, that
is a graph that consists of links satisfying
∑
t≥0wij(t) = ∞, was also studied in
[55]. However, to guarantee the global agreement, [55] requires further that there
exist a∗ > 0 and T∗ > 0 such that
∑t+T∗−1
s=t wij(s) ≥ a∗,∀t ≥ 0 and for all persistent
links (i, j). This condition is stronger than the condition of being a persistent link.
Therefore, the results in [55] cannot be applied in this case.
One can notice that condition (2.5) is rather strong. In practice, there are
many situations where ensuring this condition may be costly since the leader needs
to directly approach every agent in the network for an infinite number of times.
This is usually not the most practical advertising strategy either. In fact, the leader
should make use of the connections between the followers to advertise for its opinion.
Therefore, below, we relax this condition by imposing requirements on the network
structure.
Assumption 2.3.2. The graph G is fixed and strongly connected. The weight matrix
W is fixed and has positive diagonal elements, that is wii > 0,∀i ∈ V.
Theorem 2.3.3. (One Leader, Strongly Connected Graph, Fixed Weight) Consider
35
system (2.2) and let Assumption 2.3.2 hold. Suppose the weights αi satisfy
∑
t≥0
max
i∈V
αi(t) =∞. (2.10)
Then x(t)→ T1 as t→∞ for any x(0).
Before giving the proof, we make a few remarks. First, Theorems 2.3.3 and
2.3.1 show the importance of persistent links (including constant weights as a spe-
cial case) in shaping the final opinion. Second, since the network size is finite,
(2.10) is equivalent to the condition that at least one follower persistently trusts the
leader even if its trust level fades away. This condition holds for many plausible
specifications of αi, e.g., αi(t) =
c
(t+1)γ
for any c ∈ (0, 1] and γ ∈ [0, 1].
It is tempting to follow the proof of Theorem 2.3.1which is primarily based on
inequalities (2.7) and (2.8). However, under condition (2.10) this technique is no
longer applicable. Consider, e.g., a connected undirected network with N = 5 and
W =

w11 w12 0 0 0
w21 w22 w23 0 0
0 w32 w33 w34 0
0 0 w43 w44 w45
0 0 0 w54 w55

.
Assume that αi(t) = 0,∀i ∈ V \ {3},∀t ≥ 0 and
∑
t≥0 α3(t) = ∞. It can be seen
that
min
1≤i,j≤N+1
∑
1≤k≤N+1
min(w˜ik, w˜jk) = 0.
Thus, from (2.7), we can only obtain that h(t + 1) ≤ h(t),∀t ≥ 0, which is not
enough to ensure the convergence of h(t) to 0.
36
Moreover, since α(t) is not restricted to belong to a finite set, the results in
[5] cannot be used. Further, neither W nor W˜ (t) (see (2.6)) is required to have a
symmetric zero/nonzero structure, the results in [57,59] and [98] are not applicable.
The following proof uses the results presented in [99] on the convergence of infinite
products of substochastic matrices. Notice that [99, Theorem 6.2] requires that the
smallest row-sums of all the matrices be uniformly bounded away and below 1. Here,
we require milder conditions.
The following results are needed to proceed the proof of Theorem 2.3.3.
Lemma 2.3.4. [99] Let Mi ∈ Rn×n, i = 1, ...,m be any m substochastic matrices,
then the product P =
∏m
i=1Mi is also a substochastic matrix.
Lemma 2.3.5. [99] Let Mi ∈ Rn×n, i = 1, ..., n − 1, be irreducible substochastic
matrices with positive diagonals, then the product P =
∏n−1
i=1 Mi is a strictly positive
matrix, i.e., Pij > 0, ∀i, j. Further, let
m = min{[Mk]ij|i, j ∈ [1, n], k ∈ [1, n− 1], [Mk]ij > 0},
then mini,j Pij ≥ mn−1.
For any matrix M , let ri(M) ,
∑
jMij, i.e., the sum of i
th-row of the matrix.
Lemma 2.3.6. Suppose that Mi ∈ Rn×n, i = 1, ...,m satisfy mini ri(M1) ≤ r1 and
maxi ri(Mk) ≤ r¯k, k = 2, ...,m. Then it holds that mini ri(M1...Mm) ≤ r1r¯2...r¯m.
Proof. If mini ri(M1) ≤ r1 and maxi ri(M2) ≤ r¯2, then ri(M1M2) =
∑
j[M1]ijrj[M2] ≤∑
j[M1]ij r¯2 = ri[M1]r¯2. Thus, mini ri(M1M2) ≤ r1r¯2. By induction, it holds that
mini ri(M1...Mm) ≤ r1r¯2...r¯m provided maxi ri(Mk) ≤ r¯k, k = 2, ...,m.
37
Lemma 2.3.7. Let U, V,D1 and D2 be nonnegative matrices with appropriate di-
mensions. If D1 ≤ D2, then ||UD1V ||∞ ≤ ||UD2V ||∞.
Proof. Let U (i) denote the i-th row of U and V(j) the j-th column of V . Since
U, V,D1 and D2 are nonnegative, it follows that 0 ≤ U (i)D1V(j) ≤ U (i)D2V(j) for
∀i, j. Thus 0 ≤ UD1V ≤ UD2V , and hence, ‖UD1V ‖∞ ≤ ‖UD2V ‖∞.
The proof of Theorem 2.3.3 is presented next.
Proof. Defining ξ(t) , x(t)− T1, the update rule (2.3) can be expressed as follows
ξ(t+ 1) = A(t)ξ(t), A(t) , Γ(t)W, (2.11)
where Γ(t) = I − diag(α(t)). We need to show that limt→∞ ξ(t) = 0 for any ξ(0),
or equivalently,
lim
s→∞
∥∥∥ ∏
0≤t≤s
A(t)
∥∥∥
∞
= 0, (2.12)
where
∏
0≤t≤sA(t) , A(s) . . . A(0). Note that although A(t) is substochastic for all
t ≥ 0 (hence ρ(A(t)) ≤ 1), it does not automatically imply that ∏t≥0A(t) = 0;
this is true even if A(t) is strictly substochastic. (For example, the sequence ai =
1
2
√
2 +
√
2 + ...+
√
2︸ ︷︷ ︸
i times
satisfies ai ∈ (0, 1),∀i ≥ 1, but
∏∞
i=1 ai =
2
pi
.)
Take any η ∈ (0, 1) and define
A˜(t) = Γ˜(t)W, Γ˜(t) = I − diag(ηα(t)).
Obviously, Γ(t) ≤ Γ˜(t), ∀t ≥ 0. Applying Lemma 2.3.7, we have
∥∥∥ ∏
0≤t≤s
A(t)
∥∥∥
∞
≤
∥∥∥ ∏
0≤t≤s
A˜(t)
∥∥∥
∞
, ∀s ≥ 0.
38
Thus, the following condition suffices for (2.12)
lim
s→∞
∥∥∥ ∏
0≤t≤s
A˜(t)
∥∥∥
∞
= 0. (2.13)
Next, define
B(s) :=
∏
sN1≤t≤(s+1)N1−1
A˜(t), with N1 := N − 1, (2.14)
and note the following.
(i) ∃b > 0 such that mini,j Bij(s) ≥ b,∀s ≥ 0 and ∀i, j ∈ V . This can be shown
as follows. Let
w = min{wij| i, j ∈ V , wij > 0}. (2.15)
Then, min{A˜ij(t)| i, j ∈ V , A˜ij(t) > 0} ≥ (1 − η)w. Note that A˜(t) is irre-
ducible since the network is strongly connected (by assumption). Thus, by
Lemma 2.3.5, we have
min
i,j
Bij(s) ≥ ((1− η)w)N1 =: b. (2.16)
(ii) maxi ri(B(s)) ≤ 1, since B(s) is substochastic any s (cf. Lemma 2.3.4).
(iii) mini ri(B(s)) ≤ 1− δ((s+ 1)N1 − 1) where
δ(t) := ηmax
i∈V
αi(t), ∀t ≥ 0.
This can be obtained by using Lemma 2.3.6 with
min
i
ri
(
A˜
(
(s+ 1)N1 − 1
)) ≤ 1− δ((s+ 1)N1 − 1)
and maxi ri
(
A˜(t)
) ≤ 1 for t = sN1, . . . , (s+ 1)N1 − 2.
39
Now, let rj∗(B(s)) denote the smallest row sum of B(s). Using the above results
yields
ri
(
B(s+ 1)B(s)
)
=
∑
1≤j≤N
Bij(s+ 1)rj
(
B(s)
)
= Bij∗(s+ 1)rj∗
(
B(s)
)
+
∑
1≤j≤N,j 6=j∗
Bij(s+ 1)rj
(
B(s)
)
(ii)−(iii)
≤ Bij∗(s+ 1)
[
1− δ((s+ 1)N1 − 1)]+ ∑
1≤j≤N,j 6=j∗
Bij(s+ 1)
(ii)
≤ Bij∗(s+ 1)
[
1− δ((s+ 1)N1 − 1)]+ 1−Bij∗(s+ 1)
(i)
≤ 1− bδ((s+ 1)N1 − 1)
≤ e−bδ((s+1)N1−1) ∀i ∈ V ,
where the last inequality follows from the fact that 1− z ≤ e−z,∀z ≥ 0. Thus
‖
∏
0≤t≤(2m+2)N1−1
A˜(t)‖∞ = ‖
∏
0≤s≤m
B(2s+ 1)B(2s)‖∞
≤
∏
0≤s≤m
‖B(2s+ 1)B(2s)‖∞
≤ e−b
∑m
s=0 δ((2s+1)N1−1). (2.17)
If
∑∞
s=0 δ((2s+ 1)N1− 1) =∞, then the right side of (2.17) decays to 0 as m→∞,
thus (2.13) follows immediately. Therefore, it remains to show that this is also the
case when (2.10) holds, or equivalently
∑
t≥0 δ(t) =∞. To this end, let
∆mi :=
∑
0≤j≤m
δ(i+ j2N1), ∀i ∈ [0, 2N1 − 1]
and note that ∑
0≤t≤2(m+1)N1−1
δ(t) =
∑
0≤s≤2N1−1
∆ms . (2.18)
Now, let (2.10) hold. We claim that there must exist k ∈ [0, 2N1 − 1] such that
∆∞k = ∞. This can be shown by contradiction, i.e., if ∆∞i < ∞,∀i ∈ [0, 2N1 − 1],
40
then
∑2N1−1
s=0 ∆
∞
s < ∞, thus taking the limit of both sides of (2.18) as m → ∞
yields
∑
t≥0 δ(t) <∞, which contradicts (2.10). Thus the claim holds.
Now, if k = N1 − 1, i.e., ∆∞N1−1 = ∞, then we obtain the desired result,
otherwise we can redefine B(s) ,
∏(s+1)N1+k−1
t=sN1+k
A˜(t) and follow the same steps as
above to show that
∏∞
t=k A˜(t) = 0 and thus
∏
t≥0 A˜(t) = 0. This completes the
proof.
Note that we can take N1 = d0 in (2.14), where d0 denotes the diameter of the
graph G, and then repeat the above proof using a slight modification of Lemma 2.3.5
applied to A˜(t) = Γ˜(t)W . In fact,
P =
∏
k≤i≤k+d0−1
A˜(i)
is strictly positive for any k. The proof of this is not much different from that of
Lemma 2.3.5, thus skipped here. Note also that d0 = N − 1 in the worst case.
The above proof also allows us to estimate the −convergence time for some
cases of αi(·) as follows.
Corollary 2.3.8. (−Convergence Time) Let d0 be the diameter of the graph G and
w be defined as in (2.15). Given any number  > 0, it holds that
‖x(t)− T1‖∞ ≤ ‖x(0)− T1‖∞
if t > 2d0m where
(i) m = exp( 2d0
(0.5w)d0
log −1) if maxi αi(τ) = 1τ+1 , or
(ii) m = 1
α¯((1−α¯)w)d0 log 
−1 if maxi αi(τ) = α¯ ∈ (0, 1) for all τ ≥ 0.
41
Proof. The proof follows from (2.17) and the fact that
∑k
t=1
1
t
> ln(k + 1).
The following result is a straight extension of Theorem 2.3.3 to the case of
time-varying weight matrices.
Assumption 2.3.9. (Strong Weight) The weight matrix W (t) is row stochastic and
satisfies
a) wii(t) ≥ γ, ∀i ∈ V for some γ ∈ (0, 1),
b) wij(t) ∈ {0} ∪ [γ, 1),∀i, j ∈ V , i 6= j.
Theorem 2.3.10. (One Leader, Strongly Connected Graph, Time-varying Weight)
Consider system (2.2) and suppose that G(t) is strongly connected for all t ≥ 0 and
that Assumption 2.3.9 holds. If (2.10) holds, then x(t)→ T1 for any x(0).
Proof. The only difference between Theorems 2.3.10 and 2.3.3 is thatA(t) = Γ(t)W (t).
However, under Assumption 2.3.9, we can choose w = γ and b = ((1 − η)γ)N−1
for (2.15) and (2.16), respectively, then follow the same steps as in the proof of
Theorem 2.3.3.
Note that if there exists t∗ ∈ Z+ such that
∏t∗
t=0A(t) = 0 (e.g., α(t
∗) = 1),
then the network converges to T in at most t∗ + 1 time steps for any initial opinion
x(0). Thus, condition (2.10) used in Theorems 2.3.3 and 2.3.10 is only sufficient but
not necessary.
The strong connectivity requirement can be further relaxed to the existence of
a spanning tree provided that a root node believes in T persistently. In the following,
we assume that node 1 be always a root node.
42
Theorem 2.3.11. (One Leader, Spanning Tree Graph, Time-varying Weight) Sup-
pose that the graph G(t) is a directed spanning tree whose root is at node 1 for all
t ≥ 0. Let Assumption 2.3.9 hold. It follows that x(t)→ T1 for any x(0) if
∑
t≥0
α1(t) =∞. (2.19)
The intuition of the result is as follows. Condition (2.19) means that there is an
infinite information flow from the leader into node 1. Since node 1 is the root of the
tree, this flow arrives at every node of the network, thus consensus can be achieved.
The idea of the proof follows that in Theorem 2.3.3 with some modifications.
Proof. For simplicity, assume that αi(t) = 0,∀t ≥ 0, ∀i = 2, . . . , N . The proof
follows the same line of that of Theorem 2.3.3. Recall (2.11)-(2.13) and note that
A˜(t) is a substochastic matrix for any t; specifically, r1(A˜(t)) ≤ 1 and ri(A˜(t)) =
1, i = 2, . . . , N . Let d0 denote the diameter of the tree. It can be proved that if any
matrices A1, . . . , Ad0 satisfy the conditions on A˜(t), then P ,
∏d0
i=1Ai satisfies
Pi1 ≥ b = ((1− η)γ)d0 , i = 1, . . . , N. (2.20)
Let B(s) =
∏(s+1)d0−1
t=sd0
A(t). It follows that
ri(B(s+ 1)B(s)) = Bi1(s+ 1)r1(B(s)) +
∑
2≤j≤N
Bij(s+ 1)rj(B(s))
≤ Bi1(s+ 1)(1− ηα1((s+ 1)d0 − 1)) +
∑
2≤j≤N
Bij(s+ 1)
≤ Bi1(s+ 1) (1− ηα1((s+ 1)d0 − 1)) + 1−Bi1(s+ 1)
≤ 1− bηα1((s+ 1)d0 − 1)
≤ e−bηα1((s+1)d0−1) ∀i ∈ V ,
43
where the second to last inequality follows from (2.20). The rest of the proof follows
that of Theorem 2.3.3.
We can further relax the condition on the connectivity of the network by
employing the notion of bounded connectivity times (see, e.g., [100]). Before stating
this result, we need the following lemma.
Lemma 2.3.12. [5] Let m ≥ 2 be a positive integer and let A1, A2, . . . , Am ∈ Rn×n
be nonnegative matrices with positive diagonal elements satisfying 0 < µ ≤ [Ai]jj ≤
ρ, ∀i,∀j, then
A1A2 . . . Am ≥
(
µ2
2ρ
)m−1
(A1 + A2 + . . .+ Am) .
As a consequence of this lemma, if the union of all graphs associated with Ai
is a spanning tree, then the graph associated with the product A1A2 . . . Am is also a
spanning tree. For any integers t ≥ 0 and N0 > 0, define G[N0](t) =
(V ,∪t+N0−1k=t E(k))
as the union of a sequence of graphs over interval [t, t+N0). We state the following
result.
Theorem 2.3.13. (One Leader, Periodically Spanning Tree Graph) Consider sys-
tem (2.2) and let Assumption 2.3.9-a) hold. Suppose there exists N0 > 0 such that
for all t, the graph G[N0](t) admits a spanning tree whose root is at node 1 and edges
satisfy ∑
t≤k≤t+N0−1
wij(k) ≥ γN0. (2.21)
If (2.19) holds, then x(t)→ T1 for any x(0).
44
Proof. (Sketch) Let GN0t denote a spanning tree in the union graph G[N0](t) whose
root is at node 1 and edges satisfy condition (2.21). This condition implies that
during any interval of length N0 and for any (i, j) ∈ GN0t there exists at least one
time t∗ij such that wi,j(t
∗
ij) ≥ γ. Denote d0 the maximum diameter of G[N0](t),∀t.
Since the self-weight wii of each agent is bounded away from 0 by γ, then one can
see that every node in the network is reachable from node 1 in at most d0N0 steps.
Thus, the first column of the matrix
(k+1)d0N0−1∏
t=kd0N0
A˜(t)
is positive and bounded away from 0 by a positive number b = ((1 − η)γ)d0N0 .
Therefore, one can follow the same steps as in the proof of the Theorem 2.3.11 to
conclude this result.
A closely related work to this result is Proposition 3.3 in [55]. In that paper, the
authors studied the problem of −agreement in persistent graphs without a leader.
Here, we consider the presence of a leader (or a source of news) in the network. It can
be seen that under the assumptions of Theorem 2.3.13, the link between the leader
and agent 1 and those between the agents in the graph GN0t are persistent. However,
one cannot utilize the result in [55] to prove Theorem 2.3.13 because it assumes that
there exist a∗ > 0 and T∗ > 0 such that
∑t+T∗−1
s=t wi,j(s) ≥ a∗,∀t ≥ 0 and for all
persistent links (i, j). This condition is indeed equivalent to (2.21). However, we do
not require this condition on the connections between the leader and the followers.
45
2.4 Opinion Dynamics with Two Leaders
In this section, we investigate the case where there are two leaders (or two sources
of news) with different opinions T and Q. Assume that all the nodes in the network
can be influenced by the two leaders with trust levels αi(t), βi(t) ∈ [0, 1],∀i ∈ V .
The update rule is now given by
x(t+ 1) = α(t)T + β(t)Q+ Γ(t)W (t)x(t), (2.22)
where α(t) = [α1(t), ..., αN(t)]
T ,β(t) = [β1(t), ..., βN(t)]
T and Γ(t) = I−diag(α(t)+
β(t)).
In general, when both T and Q are persistent and the weight matrix W is time-
varying, network agreement need not be achieved and the agents’ opinions may not
converge. Interesting results on opinion disagreement and fluctuation can be found
in, e.g., [12, 13] and [20]. In the case where α(t) ≡ α, β(t) ≡ β, and W (t) ≡ W ,
the opinions converge to a fixed vector x∞ satisfying
(I − ΓW )x∞ = αT + βQ.
In the following, we consider the case when only T is persistent.
Assumption 2.4.1. The weights the weights α and β satisfy
∑∞
t=0 maxi∈V αi(t) =
∞ and ∑∞t=0 maxi∈V βi(t) <∞.
Note also that if there exists tβ ∈ Z+ such that βi(t) = 0,∀i ∈ V ,∀t ≥ tβ, then
we can immediately make use of the results in the previous section since after time
tβ there is only one persistent leader. In what follows, we assume that the presence
of leader Q can last for infinite time.
46
Theorem 2.4.2. (Two Leaders, Strongly Connected Graph, Time-varying Weight)
Consider system (2.22) with two leaders. Suppose that G(t) is strongly connected for
all t ≥ 0 and that Assumptions 2.3.9 and 2.4.1 hold. Then x(t)→ T1 for any x(0).
Proof. Again, let ξ(t) = x(t)− T1. Then system (2.22) becomes
ξ(t+ 1) = Γ(t)W (t)ξ(t) + (Q− T )β(t). (2.23)
Let A(t) = Γ(t)W (t) and u(t) = (Q− T )β(t). We note the following:
(i) From Theorem 2.3.10, it can be verified that the unforced system ξ(t + 1) =
A(t)ξ(t) with ξ(0) = x(0) − T1 is asymptotically stable. In fact, let Φ(t, l)
denote the transition matrix Φ(t, l) := A(t − 1)A(t − 2) . . . A(l + 1), then it
follows that limk→∞Φ(t, l) = 0,∀l ∈ Z+.
(ii) By Assumption 2.4.1, u is absolutely summable and hence bounded.
We now show that limt→∞ ξ(t) = 0. The solution to (2.23) is given by
ξ(t) = Φ(t, 0)ξ(0) +
∑
0≤k≤t−1
Φ(t, k + 1)u(k). (2.24)
Let δ > 0 be given. If ‖ξ(0)‖∞ = 0, then ‖Φ(t, 0)‖∞‖ξ(0)‖∞ = 0,∀t. Otherwise,
from fact (i) we have limk→∞Φ(t, 0) = 0. Thus, ∃N1 ∈ Z+ such that
‖Φ(t, 0)‖∞‖ξ(0)‖∞ ≤ δ
3
, ∀t ≥ N1. (2.25)
Next, from (ii) we have
∑
t≥0 ‖u(t)‖∞ ≤ |Q − T |
∑
t≥0 ‖β(t)‖∞ < ∞. Therefore,
∃N2 ∈ Z+ such that N2 ≥ N1 and
∑
k≥t
‖u(k)‖∞ ≤ δ
3
, ∀t ≥ N2. (2.26)
47
Let u¯ = supt ‖u(t)‖∞. From (i) we have ∃N3 ∈ Z+ sufficiently large such that
‖Φ(t, i)‖∞ ≤ δ
3u¯N2
, ∀i ≤ N2, t ≥ N2 +N3. (2.27)
Now for any t ≥ N2 +N3, it follows from (2.24) that
‖ξ(t)‖∞ ≤ ‖Φ(t, 0)‖∞‖ξ(0)‖∞ +
∑
0≤k≤t−1
‖Φ(t, k + 1)‖∞‖u(k)‖∞. (2.28)
Using (2.25), (2.26) and (2.27), and noting that ‖Φ(t, k+1)‖∞ ≤ 1 for any k ≤ t−1,
we have the following for ∀t ≥ N2 +N3:
‖ξ(t)‖∞ ≤ δ
3
+
t−1∑
k=0
‖Φ(t, k + 1)‖∞‖u(k)‖∞
≤ δ
3
+
N2−1∑
k=0
‖Φ(t, k + 1)‖∞‖u(k)‖∞ +
t∑
k=N2
‖Φ(t, k + 1)‖∞‖u(k)‖∞
≤ δ
3
+
N2−1∑
k=0
δ
3u¯N2
u¯+
t∑
k=N2
‖u(k)‖∞
≤ δ
3
+
δ
3
+
∞∑
k=N2
‖u(k)‖∞ ≤ δ. (2.29)
Since δ > 0 is chosen arbitrarily, (2.29) proves that ξ(t)→ 0 as t→∞.
It should be noted that the result of Theorem 2.4.2 is still valid when there
are two or more nonpersistent leaders in the network. This result implies that
only persistent ones matter. It is possible to relax the assumption on the network
connectivity, but we do not proceed further in this direction.
2.5 Conclusion and Extensions
This chapter revisited the agreement seeking problem in networks with leaders,
which has received a fair amount of recent attention. We developed various new
48
sufficient conditions for guaranteeing consensus to the persistent leader’s opinion.
We pointed out the important role of persistent connectivity between the leader and
the others in the network. In the following, we discuss possible extensions of the
our results.
First, note that model (2.2) does not present any delay explicitly. However,
the framework presented in this chapter can be extended to take into account infor-
mation delays as follows. Consider a generalized version of (2.2), given by
xi(t+ 1) = αi(t)T + (1− αi(t))
∑
j∈V
wij(t)xj(t− τij(t)), ∀i ∈ V . (2.30)
where the delay functions τij are assumed to be uniformly bounded, i.e.,
∃τ ≥ 1 : τij(t) ∈ [0, τ − 1], ∀(i, j) ∈ E .
The idea is to consider an extended network Gτ , which is composed of the original
graph G and τ − 1 copies of it with each being the 1-delay in time of one another.
The state of Gτ is [x(t)T ,x−1(t)T , . . . ,x−(τ−1)(t)T ]T where x−k(t) = x(t − k). Note
that if G(t) is strongly connected for any t, then every node i ∈ V is a root of a
spanning tree in the union graph (Gτ )[τ ](t). Thus, Theorem 2.3.11 can be applied
to this union graph to establish consensus reachability of the original network G.
In this connection, we can conclude that consensus to the leader is also robust to
bounded delays.
Second, we can also consider the case where the leader’s state is time varying:
T (t+ 1) = T (t) + u(t)
xi(t+ 1) = αi(t)T (t) + (1− αi(t))
∑
j∈V
wij(t)xj(t), ∀i ∈ V .
49
Define the tracking error ξ(t) := x(t)− T (t)1. Then
ξ(t+ 1) = Γ(t)W (t)ξ(t)− 1u(t) (2.31)
which is a linear time-varying system with input u. Note that the unforced system
ξ(t + 1) = Γ(t)W (t)ξ(t) is asymptotically stable under suitable conditions as in
Section 2.3. Therefore, we can invoke stability results of linear time-varying systems
to derive consensus conditions. One such result is the following, whose proof follows
the same line of that of Theorem 2.4.2 in Section 2.4 and thus is omitted.
Proposition 2.5.1. Consider system (2.31) and let Assumption 2.3.2 (or 2.3.9)
hold. If
∑∞
t=0 |u(t)| <∞, then consensus is achieved, i.e., limt→∞ |x(t)−T (t)1| = 0
for any x(0).
Remark 2.5.2. Note that an equivalent characterization of consensus is that the size
of the convex hull of the states of all the agents (including the leader) has to be 0
in the limit. In [8], the author impose the condition of strict convex hull shrinking.
The above result shows that the leader needs not to move into the convex hull of
the states of regular agents at any time step in order for achieving consensus. Also,
the convex hull needs not to shrink monotonically. This result could also give a hint
on reducing the gap between necessary and sufficient conditions for consensus.
50
Chapter 3: Optimizing Leader Influence in Networks through Selec-
tion of Direct Followers
Abstract: The chapter considers the problem of a leader that aims to influence the
opinions of agents in a directed network through connecting with a limited number
of the agents. The aim is to select this set of agents, referred to as direct followers,
to achieve the greatest possible influence on the opinions of agents throughout the
network. Direct followers are simply agents that the leader decides to connect to, and
the influence then occurs through the network’s natural inter-agent dynamics. The
problem of optimally influencing a network in the presence of another leader with a
competing opinion is also considered. The problems with a single leader and in the
presence of a competing leader are unified into a general combinatoric optimization
problem, for which two heuristic approaches are developed. The first approach
is based on a convex relaxation scheme, possibly in combination with the `1-norm
regularization technique, and the second is based on a greedy selection strategy. The
main technical novelties of this work are in the establishment of supermodularity of
the objective function and convexity of its continuous relaxation. As a result, the
greedy approach is guaranteed to yield a lower bound on the approximation ratio
that is sharper than (1 − 1
e
), while the convex approach can benefit from efficient
51
(customized) numerical solvers to have practically comparable solutions possibly
with faster computation times, especially for large networks. The two approaches
can be combined to provide effective tools and better analysis for optimal design of
influence spreading in diffusive networks. Numerical examples are given to illustrate
the usefulness of the approaches. In these examples, the approximation ratio can
be made to reach 90% or higher depending on the number of direct followers.
3.1 Introduction
The notion of a leader is introduced in many cases to represent a special agent
who has the ability to affect the states (or opinions) of other regular agents, con-
ventionally called followers, while its state is uninfluenced by others- in this sense,
a leader is also termed elsewhere as a stubborn agent). A great number of recent
efforts have also been devoted to the consensus problem in networks with leaders
(see, e.g., [5, 13, 18, 20, 101, 102] and references therein). In most cases, a leader is
usually assumed to have a limited number of connections with other agents, due
to restrictions on, e.g., the budget, communication power or channels of the leader.
This gives rise to the problem of the leader choosing whom to influence directly so
that the overall network can best perform (in some sense) under the restriction on
the leader’s connectivity.
This chapter deals with problems related to a leader selecting a limited number
of agents with which to communicate in a directed network. The aim of the leader
is to achieve maximum influence on the opinions of agents throughout the network.
52
The network agents that the leader selects to communicate with are referred to as
direct followers. Network agents all update their opinions dynamically based on
their current opinions and on opinions of immediate neighbors. Thus, through its
connections with the direct followers and the inherent network dynamics, the leader
influences the opinions of agents throughout the network. The leader wishes to
select the limited group of direct followers so as to maximize its influence on the
network, in the sense that the opinions of agents throughout the network approach
the opinion of the leader either as rapidly as possible over time or as close as possible
in the limit. In particular, we consider the following two problems:
• Problem (P1): Optimize the influence of a leader on the agents in a directed
network, whose opinion dynamics follow the well known DeGroot model. Here,
the leader’s goal is to select a limited number of direct followers to connect
to, in order to influence all the agents to converge to its constant opinion as
quickly as possible.
• Problem (P2): Optimize the influence of one leader in the presence of another
leader (with a competing opinion) over a directed network of followers under
a similar connectivity constraint as in (P1). Here, the influence of a leader is
measured in terms of the distance between the leader’s opinion and a weighted
average of the steady state opinions of all the network agents.
We unify the two problems above into a more general combinatoric optimization
problem, called (P), and develop two heuristics for approximately solving problem
(P), namely:
53
• Convex relaxation: which can be treated effectively by available numerical
algorithms and solvers. Here, the convexity result is novel.
• Greedy algorithms: which can be carried out in polynomial time. Here, the
supermodularity result is new and can be used to provide provable accuracy
guarantees for the greedy solutions.
This chapter is related to a large body of literature on problems of leader selec-
tion, stubborn agent placement, and sensor selection (see, e.g., [20, 61–67, 103, 104]
and references therein) but departs from this literature in many respects. First, we
only ask that the underlying network be directed. Second, we allow selected direct
follower nodes to follow inter-agent dynamics like any other agents, rather than forc-
ing them to adopt the leader’s opinion instantaneously. Third, we allow the agents in
the network to have different initial opinions (which are taken into account explicitly
in the context of problem (P1)), and the agents can be weighted differently by the
leader. Finally, and more importantly, although continuous relaxation and greedy
heuristics have been employed in dealing with influence maximization problems, our
theoretical results on convexity and supermodularity are considerably stronger than
existing results, without assuming any symmetry or resorting to the random walk
theory. This not only provides a deeper understanding of diffusive processes but
also can be used for a broad range of applications. More detailed comparisons will
be given in Section 3.2 after our problem formulations.
Other by-products of our analysis include: (i) a dynamic centrality measure in
the context of problem (P1) (i.e., one in which the measure of effectiveness of the set
54
of chosen agents can vary with time); (ii) straightforward application to Friedkin’s
model [3] (where each agent is allowed to have stubbornness in retaining its initial
opinion) in the context of problem (P2); (iii) an affirmative answer to a conjecture
recently proposed in [105] on optimization of on-chip thermoelectric cooling systems;
and (iv) a convexity result for the state trajectory of a class of bilinear discrete-time
systems.
The remainder of the chapter proceeds as follows. In Section 3.2, we introduce
our network models and associated optimization problems of interest. Related works
are also reviewed. Our main results are given in Sections 3.3, 3.4 and 3.5. In Section
3.3 we provide exact solutions to the problems (P1) and (P2) for the case of selecting
one agent or two. The general case of multiple agents selection is treated in Sections
3.4 and 3.5. Specifically, in Section 3.4, we establish the convexity of the relaxed and
approximate problems and discuss associated numerical issues in applying convex
solvers to these problems. In Section 3.5, we prove the supermodularity property
of the original objective functions and present two greedy algorithms that admit
provable approximation ratios. Next, a few simulation results are reported in Section
3.6 for two example networks; one of small size and the other much larger. Finally,
further (convexity) results and applications to another opinion dynamic model are
discussed in Section 3.7.
55
3.2 Problem Formulation and Related Works
This section proceeds as follows. In Subsection 3.2.1 we augment the DeGroot model
with a single leader, and formulate an associated problem of optimizing the leader’s
influence on the opinion dynamics of the network agents. Finally, in Subsection
3.2.2, we consider a model similar to that of Subsection 3.2.1 except that we further
include another leader with a differing (constant) opinion. Here, the influence of a
leader is defined differently but the optimization problem shares the same structure
with the previous setting.
Consider a leaderless network with N agents denoted by V = {1, 2, . . . , N}.
The underlying communication is characterized by a directed graph G = (V , E). The
dynamics of each agent is described by the DeGroot model (1.4), which is repeated
here for convenience. Let xi(t) ∈ [0, 1] denote the state or opinion of node i at time
t ∈ N0; At the start, each agent has an initial state x0i ∈ [0, 1]. At any other time
t > 0, each agent observes opinions of its neighbors and updates its opinion as
xi(t+ 1) =
∑
j∈Ni
wijxj(t), xi(0) = x0i, ∀i ∈ V , (3.1)
where, recall that, Ni denotes the set of node i’s immediate neighbors (including
itself) and W := [wij] ∈ RN×N denotes the normalized weight matrix of the network.
We make the following blanket assumption, simply the combination of Assumptions
1.5.2 and 1.5.3 and presented here for convenience.
Assumption 3.2.1. (Network Connectivity and Weight Matrix) The graph G is
fixed in time and strongly connected. The weight matrix W is fixed, row-stochastic
56
and satisfies wij > 0 for (i, j) ∈ E , i 6= j, and wij = 0 otherwise. Moreover, W has
at least one positive diagonal element.
3.2.1 Formulation of Influence Optimization Problem for the Single
Leader Case
Given a directed network G = (V , E) with dynamics as described above, we now
consider the effect of an external leader, denoted by T /∈ V , seeking to connect to
the network. The leader is assumed to have a constant opinion T ∈ [0, 1]. The
relationship of the leader to the network G is as follows:
• For any agent i ∈ V , the weight αi ∈ [0,∞] that it would place on the
leader’s opinion T if the leader selects to connect to the agent is known.1 The
connection would of course be directed from the leader to the network agent.
(The reverse direction, from regular agents to the leader, would be pointless
as the leader’s opinion is assumed fixed and cannot be influenced.) We refer
to α := [α1, . . . , αN ]
T as the vector of potential trust of network agents in the
leader, or simply the trust vector.
• The leader knows α but is only able to directly connect to up to K agents in
the network G. The K agents that the leader elects to connect to are called
direct followers, and are cumulatively denoted in the sequel by the set K.
Unless otherwise stated, such connection is established at time t = 0 and the
set K remains fixed thereafter.
1In this chapter, we allow the weight to be ∞.
57
Note that αi = 0 indicates lack of trust or that agent i is not accessible to the leader,
and αi =∞ (or αi  1 > wij,∀j ∈ Ni) indicates the highest possible level of trust
of agent i in the leader’s opinion. Without loss of generality, we make the following
assumption, which means that the leader only connects to followers having nonzero
trust level. (Clearly, it would be pointless for the leader to connect to an agent
which would have zero trust in its opinion.)
Assumption 3.2.2. (Positive Trust Selection) The set K is such that K 6= ∅ and
K ⊆ Vα := {i ∈ V : αi > 0}.
For each K, let the corresponding selection vector sK be
[sK]i := χK(i), ∀i ∈ V .
(Recall that χA is the indicator function of a set A.) Then the update rule (3.1) for
agent i in the presence of the leader becomes
xi(t+ 1) =
[sK]iαiT +
∑
j∈Ni wijxj(t)
[sK]iαi + 1
. (3.2)
Here, it is understood that xi(t + 1) = T if [sK]i = 1 and αi = ∞. In vector form,
(3.2) can be expressed as
x(t+ 1) = (I + diag(αK))−1(αKT +Wx(t)) (3.3)
where αK := sK ◦ α. Here ◦ denotes the element-wise product (also known as the
Hadamard product).
The following result is well known (see, e.g., [5, 20,55]).
58
Theorem 3.2.3. (Consensus to Leader’s Opinion) Let Assumptions 3.2.1 (Network
Connectivity and Weight Matrix) and 3.2.2 (Positive Trust Selection) hold. Then
for any x(0) ∈ RN , all network agents asymptotically achieve consensus with the
leader’s opinion, i.e., limt→∞ xi(t) = T,∀i ∈ V. Moreover, the rate of convergence
is exponential.
This theorem asserts that all network agents will adopt the leader’s opinion
asymptotically, regardless of their initial opinions. Note that asymptotic conver-
gence can be ensured under conditions milder than Assumptions 3.2.1 and 3.2.2
(see, e.g., [5, 96] and Chapter 2).
Although the initial opinion x(0) and the selection of the set K play no role
in the final consensus value, which is the leader’s state (as long as αK 6= 0), they
clearly affect the manner in which the agents approach this agreement, i.e., the
transient behavior of system (3.2). Thus, we turn our attention to the problem of
choosing K direct followers so as to minimize the transient error and convergence
time of agents’ opinions in the network.
To capture this dynamic behavior, we consider the error vector ξ(t) := x(t)−
T1, which follows the dynamics
ξ(t+ 1) = (I + diag(αK))−1Wξ(t), (3.4)
Thus for all t ≥ 0, ξ(t) is given by
ξ(t) = ((I + diag(αK))−1W )tξ0. (3.5)
Consensus regardless of initial condition is clearly equivalent to the global asymp-
totic stability of the origin for (3.4), and since the system is linear and time-invariant,
59
consensus is also equivalent to global exponential stability. Let L be the weighted
Laplacian matrix given by
L := I −W. (3.6)
We have the following facts on the spectrum of the state dynamics matrix in (3.4)
and the spectrum of the weighted Laplacian matrix.
Lemma 3.2.4. (Spectrum) If Assumptions 3.2.1 and 3.2.2 hold, then
(i) ρ
(
(I + diag(αK))−1W
)
< 1, and
(ii) ∀λ ∈ σ(L+ diag(αK)),<(λ) > 0.
Proof. It is well-known that if A is an irreducible row substochastic matrix with the
row-sum of at least one row less than one, then ρ(A) < 1 (see, e.g., [97, Thm 1.1, p.
24]). Using this result with A = (I+ diag(αK))−1W yields part (i). Part (ii) follows
immediately from an application of the Gershgorin Circle Theorem [95, p. 344] and
[95, Cor. 6.2.9, p. 356], using the strong connectivity of G and noticing that at
least one diagonal entry of L + diag(αK) is shifted to the right compared to the
corresponding entry of L.
Remark 3.2.5. Assertion (i) of the lemma is in fact equivalent to the result in The-
orem 3.2.3 above (the constant linear system is exponentially stable). Part (ii) will
be needed in defining our objective costs in the sequel.
Next, define ‖ξi‖l1 :=
∑∞
t=1 |ξi(t)| (which is well defined because of exponential
convergence of ξ) and consider the cumulative convergence error defined as
J totalK =
∑
i∈V
bi‖ξi‖l1 ,
60
where b = [b1, . . . , bN ]
T ∈ RN+ is a weight vector chosen by the leader, which we
require to satisfy 1Tb = 1. The elements of the vector b are measures of the relative
preferences that the leader places on the opinions of all network agents. Note that
we do not include ξi(0) in ‖ξi‖l1 since ξi(0) does not depend on the leader’s selection
of direct followers. We say the selection K1 is better than K2 if J totalK1 < J totalK2 .
Roughly speaking, the smaller J totalK , the smaller the convergence time, i.e., the
faster consensus is achieved. However, since computing J totalK is nontrivial, we will
work with an upper bound J
(1)
K obtained as follows:
J totalK =
∑
t≥1
bT |ξ(t)| (3.5)= bT
∑
t≥0
∣∣((I + diag(αK))−1W)tξ(1)∣∣
≤ bT
∑
t≥0
(
(I + diag(αK))−1W
)t|ξ(1)|
(Lem. 3.2.4)
= bT (I − (I + diag(αK))−1W )−1|ξ(1)|
= bT (diag(W1 +αK)−W )−1diag(W1 +αK)|ξ(1)|
≤ bT (L+ diag(αK))−1|Wξ0| =: J (1)K . (3.7)
Here the last inequality holds since, first, the inverse (L+ diag(αK))−1 exists based
on Lemma 3.2.4, part (ii), and, second |ξ(1)| ≤ (I + diag(αK))−1|Wξ0|. It can be
verified that equality holds if either ξ0 ≥ 0 or ξ0 ≤ 0, i.e., if the leader’s opinion T
is outside the convex hull of the agents’ initial opinions {xi(0), i ∈ V}. Therefore,
J
(1)
K is a tight upper bound on J
total
K . The more influential the direct followers are,
the smaller J
(1)
K is and thus the faster consensus can be reached. Formally, in this
61
work we consider the following problem:
(P1)
min
K⊆V
J
(1)
K = b
T (L+ diag(αK))−1|Wξ0|
s.t. |K| ≤ K,
(3.8)
Remark 3.2.6. The objective function J
(1)
K is defined in such a way that it allows
the leader T to (i) weight each agent in the network differently through the weight
or preference vector b, (ii) take into account partial incentives or trust encoded in
the vector α, and (iii) incorporate the role of initial opinions of all the agents in the
network. As a consequence of (iii), it is possible for the leader to view J
(1)
K as the
cost-to-go at the initial time, when the set of direct followers is first chosen, and to
define the cost at any time t as
J
(1)
K (t) = b
T (L+ diag(αK))−1|Wξ(t)|.
With this time-dependent objective cost, one can imagine a policy that achieves
improved performance through re-solving a similar optimization problem at regular
intervals for new sets of direct followers. (This would entail having limited term
contracts with the direct followers selected at any time.) This is akin to a model
predictive control strategy with an infinite horizon cost-to-go J
(1)
K (t) and control
action being the sequence of sets of direct followers.
Remark 3.2.7. (Dynamic centrality measure for degree of influence of set of direct
followers) Note that the reciprocal of J
(1)
K , denoted by CK := 1/J
(1)
K , can be viewed
as a measure of the effectiveness of the set K in spreading the leader’s opinion. This
can also be viewed in terms of the relative influence of the choice of one set of agents
62
K vs. another, or as a centrality measure of a set K of direct followers. A set K is
more influential than K′ if CK > CK′ . What is new about our centrality measure
is that CK can be taken as a dynamic centrality measure through the definition
CK(t) := 1/J
(1)
K (t), rather than a fixed quantity as are many existing centrality
measures in the literature.
3.2.2 Formulation of Influence Optimization Problem in the Presence
of a Competing Leader
Now we consider a similar model as above except that there are two leaders with
different opinions T and Q trying to influence opinions of agents in network G. Let
K,L ⊆ V denote the sets of nodes that are directly connected to T and Q. Each
node in the network has some potential trust levels αi, βi ∈ [0,∞],∀i ∈ V and
updates its opinion as follows:2
xi(t+ 1) =
[sK]iαiT + [sL]iβiQ+
∑
j∈Ni wijxj(t)
[sK]iαi + [sL]iβi + 1
(3.9)
where sK and sL denote the selection vectors of T and Q respectively. In matrix
form, (3.9) reads
x(t+ 1) = (I + diag(αK + βL))
−1(αKT + βLQ+Wx(t)).
where αK := sK ◦α and βL := sL ◦ β. In our context, α and β are associated with
the agents in the network and are assumed to be fixed over time. For given choices
of K and L, the network G need not (and usually does not) reach consensus even
2We exclude the case where αi = βi =∞ for some i ∈ V.
63
under a strong connectivity assumption. In fact, the opinions converge to a fixed
vector x(∞) which depends only on α, β, and W , but not x(0) (see (3.10) below).
Thus in this section, we will measure the influence of each leader by examining the
limiting opinion x(∞).
As we are interested in designing a competition strategy for one leader (T ) in
the presence of another (Q), without loss of generality suppose β 6= 0 and sL = 1
(i.e., the set of direct followers of Q is known to T ).
If αK = 0, it is clear that xi(∞) = Q,∀i ∈ V under the strong connectivity
assumption on G, i.e., the whole network will eventually be out of favor with leader
T . Therefore we only consider αK 6= 0. Further, we assume that
card(αK) ≤ K < card(α)
where K represents the maximum number of connections that T is allowed to es-
tablish, accounting for limited communication and/or budget. We are interested
in the following problem: Given knowledge of α,β,W and of the largest allowed
number of connections K, which nodes should leader T directly connect to in order
to achieve the greatest possible influence (in a sense made precise below) on the
eventual opinions of the network agents?
Note that the limiting opinion vector x(∞) satisfies
x(∞) = (I + diag(αK + β))−1(βQ+αKT +Wx(∞)).
Thus,
x(∞) = (Lβ + diag(αK))−1(βQ+αKT ), (3.10)
64
where Lβ := L+ diag(β) = I + diag(β)−W , which is nonsingular under the strong
connectivity assumption and the condition that αK 6= 0 and β 6= 0 (cf. Lemma
3.2.4-ii).
We are interested in the steady state error vector ξ(∞) := x(∞)− T1. Since
(Lβ + diag(αK))−1(β +αK) = 1, it can be verified that
ξ(∞) = (Lβ + diag(αK))−1β(Q− T ).
To quantify the long term effect of T in the presence of Q, we define the following
function operating on the set K:
J
(2)
K := b
T |ξ(∞)|,
where b ≥ 0 is a weight or preference vector, indicating the relative importance to
the leader T of the final opinion of each agent in the network. Since (Lβ+diag(αK))
is a nonsingular M-matrix (cf. Lemma 3.2.4-ii), it follows that (Lβ + diag(αK))−1 is
a nonnegative matrix (see, e.g., Lemma A.1.3 in Appendix A.1)). Thus
J
(2)
K = b
T (Lβ + diag(αK))−1β|Q− T |.
Without loss of generality, let T = 0 and Q = 1 represent two competing opinions.
We are interested in the following problem:
Given α,β,b,W and an integer K > 0, select K such that |K| ≤ K and the
effect of T is maximized, i.e.,
(P2)
min
K⊆V
J
(2)
K = b
T (Lβ + diag(αK))−1β
s.t. |K| ≤ K
(3.11)
65
This is a link creation problem (namely selection ofK), where partial incentives
are allowed (i.e., α,β ∈ [0,∞]N) and each agent can be weighted differently (through
b). In the limiting case when αi, βi are all either 0 or ∞, this problem reduces to
the previously studied optimal stubborn placement or leader selection problems in
the literature, which we recall below. First, we give a general problem formulation
that covers the cases without and with a competing leader.
Remark 3.2.8. (A unified problem formulation) Except for some minor differences,
problems (P1) and (P2) described in (3.8) and (3.11) are almost the same. Our aim
is thus to develop methods that can be applied to both. To this end, we embed
these two problems in the following general one:
(P)
min
K⊆V
JK = bT (Lβ + diag(αK))−1c
s.t. |K| ≤ K
(3.12)
where b, c and β are nonnegative vectors. The optimal value will be denoted by J∗.
3.2.3 Comparison to Previous Work
3.2.3.1 Single leader case
The following model is widely used in the literature (see, e.g., [15, 62,67,106,107]):
xi(t+ 1) =

α˜iT + (1− α˜i)
∑
j∈V wijxj(t), i ∈ K∑
j∈V wijxj(t), i ∈ V\K
(3.13)
which is equivalent to the one described in (3.2) with
α˜i =
αi
αi + 1
. (3.14)
66
Based on this model, the works [62, 67, 106] consider the following associated prob-
lem:
min
K⊆V,|K|≤K
f˜(K) := 1T (I −DKW )−11, (3.15)
where DK = I − diag(α˜K), α˜ = 1, and f˜(K) represents the cumulative errors over
time of all the agents. Note that f˜(K) in (3.15) clearly corresponds to a special
case of J
(1)
K with b = 1 and ξ0 = 1. Thus, one may wonder why we use model
(3.2) instead of (3.13). The main reason is that using the former model allows us
to obtain a much stronger convexity result than using the latter. This is also one of
the main contribution of our work.
To deal with problem (3.15), [62] uses a continuous relaxation of f˜ and `1-
norm regularization technique and proves element-wise convexity of the so-obtained
objective function. This allows the authors to employ the coordinate descent ap-
proach. However, it is important to point out that the relaxed problem formulated
in [62] is not necessarily convex; see Remark 3.4.1 below for an example. In [67],
supermodularity property of f˜(K) in (3.15) is proved a greedy heuristic [108] is used
to yield approximate solutions with provable accuracy.
In [65], the authors use a continuous-time version of the DeGroot model and
consider the problem of selecting a set of nodes to become leaders (instantaneously)
so as to minimize the convergence error, defined as the lp-norm of the distance
between the followers’ states and the convex hull of the leader states. By replacing
the convergence error with an upper bound that is independent of the initial states
of the network (and is loose in general), [65] proves the supermodularity property
67
of so-obtained bound based on a connection with the random walk theory, and then
employs the greedy approach in [108].
Kempe et al. [60] also formulate the problem of finding the influential nodes
in a network as a discrete optimization problem with a submodular cost function
and apply the greedy algorithm to obtain a (1− 1
e
) approximate solution. However,
the diffusion model in [60], called Independent Cascade, is basically different from
the opinion model considered here.
3.2.3.2 Multiple leaders case
In [64], the authors consider a linear stochastic model the mean behavior of which
is equivalent to the following deterministic model:
xi(t) =

0, if i ∈ V0
1, if i ∈ V1
∑
j∈V wijxj(t− 1) else
(3.16)
where V0,V1 ⊂ V are two disjoint sets of stubborn agents. This model is a limiting
case of (3.9) with αi, βi ∈ {0,∞} (i.e., an agent becomes stubborn if directly con-
nected to a leader). The optimal stubborn agent placement problem studied in [64]
is defined as follows: For a given set V0 with known locations in the network, choose
K nodes from V\V0 to form the set V1 so that the network bias toward V1 in the
limit is maximized, i.e.,
max
V1⊂V
{∑
i∈V
xi(∞) : |V1| = K, V0 fixed
}
.
68
This problem is in fact similar to a special case of (3.11) with b = 1 and αi, βi ∈
{0,∞}. The authors prove submodularity of the objective function based on connec-
tion with a random walk and then use the greedy algorithm [108] to (approximately)
solve the problem.
The work [66] considers a similar model as in [64] and defines a measure of
node centrality for a given set V0 as H(l) =
∑
i∈V xi(∞|V1 = {l}). The authors
introduce a distributed message passing algorithm that enables each node l ∈ V\V0
to compute its own H(l). One of our optimality criteria is also able to subsume this
centrality measure as a special case. More importantly, it is considered in a more
general setting and practical (centralized) algorithms are developed for the benefit
of network designers or market competitors.
In [63] the following model proposed by Friedkin and Johnsen [3] is considered:
xi(t) = (1− σi)
∑
j∈V
wijxj(t− 1) + σixi(0).
Here σi ∈ [0, 1] reflects the level of stubbornness of each agent i ∈ V regarding its
initial opinion. The paper deals with the problem of selecting K nodes so that if they
become fully stubborn and their opinions are set to 1, then the limiting opinions of
all the agents, on average, are as positive as possible, i.e.,
max
V1⊂V
{∑
i∈V
xi(∞) : |V1| = K, xi(t) = 1,∀t ≥ 0,∀i ∈ V1
}
.
The authors exploit a connection between this model and absorbing random walks
to establish the submodularity of the cost function, and then rely on the greedy
algorithm in [108] to approximate the optimal solution within factor (1− 1
e
).
69
3.2.3.3 Our Contributions
Our work greatly generalizes and differs from the aforementioned works both in
problem formulation and solution.
Regarding problem formulation, it should be noted that our direct followers
can have dynamics like any other network node, unlike the forceful/stubborn agents
in those papers. This can be viewed in terms of trust levels of the direct followers
with respect to the leader’s state being arbitrary in our work. Moreover, within the
context of problem (P1), the agents’ initial opinions need not be the same and are
taken into account explicitly in the cost J
(1)
K , which is a tight upper bound on the
cumulative error of all the agents over time. This also allows us to consider a time-
varying objective cost J
(1)
K (t) and update the set of direct followers K repeatedly to
further improve the network performance. Furthermore, the agents can be weighted
differently by the leader in contributing to the cost JK. We believe that these settings
are more natural and thus likely to be of more value for practical applications.
Finally, the models considered here, i.e., (3.2) and (3.9), allow us to establish the
convexity of a relaxed problem of (P), while neither (3.16) nor (3.13) does so; see
also Remark 3.4.1 below.
Regarding problem solving, although we adopt two well known heuristic ap-
proaches, namely, convex relaxation/approximation technique and the greedy se-
lection strategy, the theoretical results presented here are much more general and
stronger. In particular, our technical contributions include establishment of the su-
permodularity property of the objective function in problem (P) and the convexity
70
of its continuous relaxation; both results are based on the M-matrix theory, which
is completely different from tools used in [63–65, 67]. First, we prove the convexity
of our relaxed problem (in the usual sense instead of just element-wise) and without
assuming any kind of symmetry, which is of great benefit since it allows us to use
much more effective numerical algorithms (e.g., gradient descent and Interior Point
Methods) compared to the coordinate descent approach employed in [62]. Second,
we derive a general matrix supermodularity inequality that can be used to prove
supermodularity of JK as well as another type of cost function encountered in the
literature (see Remark 3.5.6 below). Combining the supermodularity result with the
notion of curvature of a submodular function [109], we prove that the well known
greedy algorithm [108] applied to our problem admits a theoretical approximation
guarantee that is sharper than (1− 1
e
). In addition, we develop an improved version
of this algorithm that is able to achieve better accuracy. Finally, in both approaches,
we derive upper and lower bounds on the optimal value, which, when combined to-
gether, provide a better analysis of the obtained approximate solutions. As will be
demonstrated in our numerical examples, the approximation ratio can be ensured
to be ranging from 70% to 100% depending on the value K.
3.3 Special Cases K = 1, 2: Optimal Solutions
For any matrix A, let A(i) and A
(j) denote the i-th column and j-th row of A,
respectively. Moreover, we will use both Aij, [A]ij and aij to refer to the (ij)-th
element of A.
71
3.3.1 Single Agent Selection
Because W is an irreducible row stochastic matrix, 0 is a simple eigenvalues of
L = I − W associated with right eigenvector 1. Let pi ∈ RN denote the left
normalized eigenvector corresponding to this eigenvalue such that piT1 = 1. It is
known by Perron theorem (see [95, Thm. 8.4.4]) that pi is strictly positive under
the strong connectivity assumption on the underlying communication graph.
Now let K be a singleton, i.e., K = 1. Then, there are at most N possible
choices that leader T can take. For problem (3.8), we have the following result
(where we recall that for a matrix A, A(k) and A(k) denote the k-th row and k-th
column, respectively).
Theorem 3.3.1. (Single agent selection for problem (P1) in (3.8)) Suppose b sat-
isfies the normalized condition that 1Tb = 1. For any k ∈ V, we have
J
(1)
{k} = p
T
k |ξ0|, with (3.17)
pTk := (b
TL† − L†(k))− (α−1k − L†kk − bTL†(k))
piT
pik
. (3.18)
Moreover, if b = 1/N , then we have
pTk = (α
−1
k + L
†
kk)
piT
pik
− L†(k). (3.19)
Proof. See Appendix A.2.1.
Our next result characterizes the cost function corresponding to a single agent
selection for problem (3.11).
72
Theorem 3.3.2. (Single agent selection for problem (P2) in (3.11)) Let P = L−1β .
For any k ∈ V, we have
J
(2)
{k} = 1−
bTP(k)
α−1k + Pkk
. (3.20)
Proof. See Appendix A.2.2.
As a result, when K = 1, optimal solutions to both problems (3.8) and (3.11)
are given by
k∗ = arg min
i∈V
J{i}.
It should be noted that one only needs to evaluate L† and pi or L−1β once
(which requires O(N3) operations), then uses (3.18) or (3.20) to compute the in-
fluence corresponding to each follower being selected. When N is large, this is
less computationally expensive than inverting matrix (L+diag(αK)) multiple times
(each costs O(N3) operations) for different choices of K.
The cost J{k} is inversely proportional to the trust level αk. This has a practical
meaning as follows. In a social network, an agent who is strongly influenced by his
neighbors but is quite skeptical about new information (from the leader) may be
less important in spreading the leader’s opinion than one of his friends, who is easier
to persuade.
Furthermore, J
(1)
{k} also depends linearly in |ξ0|, the initial error of the whole
network. As noted earlier in Remark 3.2.7, we can view J
(1)
{k} as the cost-to-go at
initial time, i.e., J
(1)
{k}(0). In this connection, it is easy to see that the cost at any
time t is given by
J
(1)
{k}(t) = p
T
{k}|ξ(t)|.
73
This suggests that the centrality of each agent should be dynamic. That is, an
agent may be the most important at some time but may not be at the other times,
depending not only on its position in the network structure but also on how it
behaves over time. The significance of this is that if the leader is able to repeatedly
compute J
(1)
{k}(t), then it can further improve the performance of the network by
repeatedly selecting the informed agent.
Remark 3.3.3. (Connection of J
(1)
{k} with other centrality measures) Consider again
when the graph G is undirected and L is symmetric, b = 1/N , and ξ0 = 1/N . Note
that pi = 1/N and that L†pi = 0 (see Lemma A.1.5). Hence,
J
(1)
{k} = α
−1
k + L
†
kk. (3.21)
Thus C
(1)
{k} is proportional to 1/L
†
kk. It is interesting to note that in [110] the authors
define the topological centrality of a node to be TCk = 1/L
†
kk where L
† is the pseudo-
inverse of a Laplacian matrix L; see [110] for further details. Additionally, the
notions of information centrality [111] and node certainty [112] can also be shown
to be proportional to 1/L†kk. Notice that [110, 111] only define these notions for
undirected graphs where L is symmetric. Thus when the graph G is undirected and
L is symmetric, these centrality indices and our C{k} are equivalent in ranking the
importance of nodes in the network.
Moreover, for undirected networks, the pseudo-inverse of a Laplacian matrix
also has a nice connection with the notions of resistance distance, that is,
L†kk =
1
ICk
− Kf
N2
74
where Kf = tr(L
†) denotes the Kirchhoff index of the network and ICk the infor-
mation centrality [111] of node k given by
1
ICk
=
1
N
∑
j
rkj
with rkj being the topological distance between k and j. Therefore,
J
(1)
{k} =
N
αk
+
N
ICk
− Kf
N
=
N
αk
+
∑
j
rkj − Kf
N
.
As a consequence, if αk = αj,∀k, j ∈ V , then the centrality C(1){k} agrees with the
information centrality. In particular, nodes with smaller total distance to all the
others will have higher centrality measures, thus more important.
It is, however, important to note that our measure C
(1)
{k} also depends propor-
tionally on αk, which makes more practical sense since αk represents the proclivity
towards the leader’s opinion of agent k. Moreover, C
(1)
{k} is not restrictively defined
for undirected graphs and symmetric L.
3.3.2 Two-Agent Selection
In this subsection, we derive an explicit expression to the joint centrality of any
pair of agents. Let K = {i, j} ⊂ V, i 6= j. For problem (3.8), we have the following
result.
Theorem 3.3.4. (Two-agent selection for problem (P1) in (3.8)) Let b = 1/N . We
have
J
(1)
{ij} = p
T
ij|ξ0|, (3.22)
75
where
pTij =
γiiγjj − γijγji∑
γij
piT − γjj + γji∑
γij
L†(i) − γii + γij∑
γij
L†(j), (3.23)
and
∑
γij := γjj + γij + γii + γji with
γii =
1
pii
(L†ii +
1
αi
), γji = −
L†ji
pii
γjj =
1
pij
(L†jj +
1
αj
), γij = −
L†ij
pij
.
Proof. See Appendix A.2.3.
Note that pij can also be expressed as
γjj + γji∑
γij
pi +
γii + γij∑
γij
pj − (γii + γij)(γjj + γji)∑
γij
pi,
where pi is given by (3.19), which is proportionate to J
(1)
{i} .
As a consequence of Theorem 3.3.4, we can determine the optimal pair at any
time t as
K∗(t) = arg min
i,j∈V,i 6=j
pTij|ξ(t)|. (3.24)
Remark 3.3.5. (A special case) Consider the case where L is symmetric and let
ξ0 = 1,α = 1∞. Note that pi ∈ span(1) and L†pi = 0. Thus the cost J (1){ij} reduces
to
J
(1)
{ij} =
L†iiL
†
jj − L†ijL†ji
L†ii + L
†
jj − L†ij − L†ji
. (3.25)
Notice that the term in the denominator L†ii+L
†
jj−L†ij−L†ji =: rij is usually referred
to as the resistance distance of the network measured at nodes i and j, which is
76
identical to the topological distance between them. The term in the numerator can
be expressed as
L†iiL
†
jj − L†ijL†ji = L†iiL†jj(1− (cos†(i, j))2),
where
cos†(i, j) =
L†ij√
L†iiL
†
jj
.
Here, by following [113], we use cos†(i, j) to measure how structurally similar the
roles of i and j are. The cost now reads
J
(1)
{ij} = L
†
iiL
†
jj
(1− (cos†(i, j))2)
rji
.
By (3.21) and α = 1∞, we have
C
(1)
{ij} = C
(1)
{i}C
(1)
{j}
rji
1− (cos†(i, j))2 .
Obviously, the cost C
(1)
{ij} depends on individual centrality C
(1)
{i}, C
(1)
{j}, resistance dis-
tance rij and cos
†(i, j) in a nonlinear fashion. However, we can loosely infer that to
minimize J{ij}, the optimal selection should satisfy the following
• Self-centrality: C(1){i} and C(1){j} should be large.
• Relative distance: rij should be large, i.e., i and j should be far apart.
• Topological similarity: cos†(i, j) should be large, i.e., i and j should have
similar roles in the network.
Consider, for example, an unweighted undirected cycle graph where αi is identical
for any node. An optimal choice (i∗, j∗) is any two nodes that are of the farthest
77
distance.3 Consider a network which consists of two communities as another exam-
ple. A reasonable candidate for the optimal solution would be i∗ and j∗ where each
node is the most influential in each community.
Remark 3.3.6. In connection with other centrality measures (e.g., information cen-
trality, topological centrality), the cost J
(1)
{ij} can be described by
J
(1)
{ij} = L
†
iiL
†
jj
(1− (cos†(i, j))2)
rji
= (
1
ICi
− Kf
N2
)(
1
ICj
− Kf
N2
)
(1− (cos†(i, j))2)
rji
= (
Ri
N
− Kf
N2
)(
Rj
N
− Kf
N2
)
(1− (cos†(i, j))2)
rji
where Ri =
∑
k rik is the sum of all resistance distances from node i to all the others,
which is reciprocal to the information centrality of node i.
The derivations of C
(1)
{ij} in previous remarks are valid only under special as-
sumptions about the trust vector α, the initial condition ξ(0) and the structure of
the network as well as the Laplacian matrix L. More importantly, it only indicates
the importance of nodes at initial time 0. As the agents’ opinions evolve with time,
so do their influence measures with respect to the whole performance of the network.
This can be captured by our time-varying objective function J
(1)
K (t) or C
(1)
K (t), which
3This can be seen as follows. Let N denote the number of nodes in the cycle. For any 2 nodes in
the cycle, there are exactly 2 disjoint paths connecting them. Let x, y denote the lengths of the 2
paths, which obviously satisfy x+y = N . Since all the nodes are identical to each other, so the joint
centrality only depends on the relative distance rij where rij = (x
−1+y−1)−1 = xyN ≤ (x+y)
2
2N =
N
2 .
Therefore, rij is maximized when x = y =
N
2 for even N or (x, y) = (
N−1
2 ,
N+1
2 ) for N odd. This
proves the claim.
78
is defined for any initial condition and any trust and bias vectors. This is one of the
main differences between our work and others.
In a similar fashion, we can compute the cost function associated with any
pair of agents for the case of problem (P2) as follows.
Theorem 3.3.7. (Two-agent selection for problem (P2) in (3.11)) Let P = L−1β and
νii = α
−1
i + Pii, νji = −Pji
νjj = α
−1
j + Pjj, νij = −Pij.
We have
J
(2)
{ij} =
bTP(i)(νjj + νij) + b
TP(j)(νii + νji)
νiiνjj − νijνji . (3.26)
Proof. The proof is based on the rank-2 update of matrix inversion by the Woodbury
identity (A.1).
Of course, this objective function depends on the network structure and weight
matrix (encoded in L) as well as the trust vectors α, β. Although connections with
other notions of centrality may not be inferred easily, there is a close relation between
this cost function and the average voltage of a network of resistors. In particular,
assume the graph G is undirected and consider the network of |E| resistors corre-
sponding to the graph with wij representing the conductance between nodes i and
j. Let Q and T denote two voltage sources and let αi (βi) denote the conductance
between node i and T (Q) when there is a link, that is, when node i is selected by
T (or Q). Then it can be seen that J
(2)
{ij} is the weighted average (corresponding to
b) of the node voltages in the resistor network.
79
We close this section with the following remark. As we have seen in this sec-
tion, the joint-centrality measure of a set becomes more complicated to express as
K increases. Moreover, for large networks, finding the optimal solution by sweep-
ing through all the possible combinations is a challenging or even impractical task.
Therefore, we content ourselves with approximate solutions whenever they are at-
tainable with certain quality. In this connection, we now develop two practical
approaches for the general problem-(P) where lower and upper bounds on the opti-
mal value can be obtained and used to assess approximate solutions.
3.4 General Case: Convexification Approach
In this section, we study the convexity property of the continuous relation defined
by JK and discuss numerical methods that can be used to solve the relaxed or
approximate problem. We emphasize that no symmetry conditions on the Laplacian
matrix L (even on its structure).
3.4.1 Convexity of Relaxation
Consider problem (3.12), equivalently put as follows:
(P)
min
s∈RN
f(s) := bT (Lβ + diag(s ◦α))−1c
s.t. si ∈ {0, 1} ∀i = 1, . . . , N,
card(s) ≤ K
(3.27)
where b, c ∈ RN+\{0}. Recall that Lβ = L+ β. We will also use L0 := L to signify
the case of problem (3.8) i.e., β = 0. The optimal value of this problem is denoted
80
by f ∗P .
First, this problem is clearly combinatoric in nature (hence nonconvex) and
generally hard to solve especially for large networks. We defer our discussion on the
convexity of the objective function f for now and discuss techniques to handle the
cardinality constraint instead. The first idea is to consider instead of (P) a relaxed
version (P Rlxd) defined as follows:
(P Rlxd)
min
y∈RN
f(y)
s.t. y ∈ [0, 1]N and 1Ty ≤ K.
(3.28)
This is a continuous relaxation of (P). The optimal value for (P Rlxd), denoted by
f ∗P Rlxd, is clearly a lower bound for that of (P), i.e., f
∗
P Rlxd ≤ f ∗P . Of course this
lower bound is useful if an optimal solution yP Rlxd is computable. In that case, if
yP Rlxd is a binary vector, then it is also the optimal one for (P). However, a binary
solution is not to be expected as yP Rlxd tends to be dense. In general, we can use
a simple projection onto the feasible set of problem (P) to obtain an approximation
(e.g., rounding up to 1 the K largest elements of yP Rlxd and zeroing out the rest),
resulting in an upper bound on f ∗P , which we denote by f¯P Rlxd.
Next, we consider another practical approximation using the well known `1-
norm regularization technique. Here, we consider the problem
(P Aprx)
min
y∈RN
g(y) := f(y) + µ1Ty
s.t. y ∈ [0, 1]N =: Ω
(3.29)
where µ is a positive parameter the role of which is to promote sparsity of the
solution. (Note that if µ = 0, then clearly y = 1 is the global solution to this ap-
81
proximate problem (see also Theorem 3.4.3 below); increasing µ is a way to penalize
the number of nonzero elements in the solution.) Let s∗P Aprx be the binary vector
corresponding to the K largest elements of a solution to problem (P Aprx). Then
fP Aprx := f(s
∗
P Aprx) is clearly an upper bound on the optimal value of the original
problem (P). As a result, the gap (fP Aprx − f ∗P Rlxd) can also be used to evaluate
the quality of our approximations.
We now discuss convexity of the function f , which would clearly be pertinent
for problem (P Rlxd) as well as problem (P Aprx). Note that we do not assume
any symmetry conditions on the Laplacian matrix L (even on its structure), or on
the nonnegative vectors b and c (trivial cases such as b = 0 or c = 0 are excluded).
For somewhat similar cost functions that are convex under symmetry of L, see, e.g.,
[61,104,105]. As noted earlier, the functions f˜ in (3.15) and J
(1)
K defined for problem
(P1) with b = 1 and ξ0 = 1 are equivalent through the change of variables (3.14),
one may wonder whether the continuous relaxation of f˜ is convex. The following
remark provides a negative answer for this question.
Remark 3.4.1. (Nonconvexity of continuous relaxation of f˜ of (3.15)) We use a
simple example with N = 2 to show that the continuous relaxation of f˜ of (3.15)
using (3.28). By abuse of notation, consider
f˜(s) = 1T (I − (I − diag(s ◦ α˜))W )−11, s ∈ Ω = [0, 1]N .
82
Suppose W =
0.1 0.9
0.5 0.5
 and α˜ =
0.8
0.8
. We have
∇2f˜(1/2) =
 6.9101 16.6656
16.6656 22.2587

which is not positive definite as it has a negative eigenvalue, namely λ = −3.7632.
Thus, f˜ is nonconvex on Ω.
In contrast to f˜ , the continuous relaxation of J
(1)
K is convex on Ω. In fact, more
is true; we establish below the convexity of f in (3.28), which is the main result of
this subsection. The convexity proof relies on the following technical lemma.4
Lemma 3.4.2. Let A ∈ RN×N be nonnegative and V ∈ RN×N be diagonal. Then
for each m ≥ 0, ∑
i+j+k=m
AiV AjV Ak
is a nonnegative matrix, where i, j, k are nonnegative integers.
Proof. By change of variables, we have
∑
i+j+k=m
AiV AjV Ak =
∑
0≤q≤r≤m
AqV Ar−qV Am−r.
Writing V = diag(v1, . . . , vn), the st-th coefficient of the matrix above is
∑
0≤q≤r≤m
∑
1≤i,j≤n
[Aq]si[A
r−q]ij[Am−r]jtvivj (3.30)
To simplify this expression, let us consider the graph generated by matrix A, where
aij denotes the weight of the directed edge i → j.5 Let Pm denote the set of all
4We thank Prof. Terrence Tao for the idea for the proof of Lemma 3.4.2.
5The edge direction defined within this proof is in reverse order to our usual notation.
83
walks of length m from node s0 = s to sm = t, i.e., those of the form
s = s0
e1→ s1 e2→ . . . em→ sm = t,
where ei = (si−1si) denotes the directed edge from node si−1 to si. Now for a
fixed tuple (qirj), consider a collection P(qirj) ⊂ Pm that satisfies the conditions
sq = i and sr = j (i.e., fixing positions q and r). Then the term under the double
summation in (3.30) represents the total weight of all the walks6 in P(qirj) multiplied
by vsqvsr , i.e.,
[Aq]si[A
r−q]ij[Am−r]jtvivj =
∑
{ek}m1 ∈P(qirj)
ae1ae2 . . . aemvsqvsr ,
where we have assumed A = [akl]1≤k,l≤n. Summing the right side of this relation
over 1 ≤ i, j ≤ n yields the total weight of all the walks in Pm (each being scaled
by vsqvsr), namely,
∑
{ek}m1 ∈Pm
ae1ae2 . . . aemvsqvsr .
As a result, (3.30) becomes
∑
0≤q≤r≤m
∑
{ek}m1 ∈Pm
ae1ae2 . . . aemvsqvsr =
∑
{ek}m1 ∈Pm
ae1ae2 . . . aem
∑
0≤q≤r≤m
vsqvsr (3.31)
Note that aei ≥ 0 for any i and that
2
∑
0≤q≤r≤m
vsqvsr =
( ∑
0≤i≤m
vsi
)2
+
∑
0≤i≤m
v2si ≥ 0.
It then follows that the right side of (3.31) is nonnegative, thereby completing the
proof.
6A walk’s weight is defined as the product of the weights of all the edges along the walk.
84
We are now ready to establish the convexity as well as other important prop-
erties of our objective functions.
Theorem 3.4.3. (Properties of objective function in (3.27)) For any b, c,α ∈
RN+\{0} and β ∈ RN+ , let Ω be given as in (3.29) and consider f : RN+ → R ∪ {∞}
defined in (3.27), i.e.,
f(y) := bT (Lβ + diag(y ◦α))−1c. (3.32)
Then f is positive, convex and decreasing on Ω. It is smooth on Ω\{0} with gradient
∇f and Hessian H given by
∇f(y) = −(Y −Tb) ◦α ◦ (Y −1c) with
Y := Lβ + diag(y ◦α)
(3.33)
and
H(y) = H(y) +HT (y) with
H(y) := diag(α ◦ (Y −Tb))Y −1diag(α ◦ (Y −1c)).
(3.34)
Moreover, H(y) is a nonnegative matrix and
0  H(y)  LfI, with Lf := ρ(H(0)). (3.35)
Furthermore, Lf ≤ N maxij[H(0)]ij.
Proof. Smoothness of f follows from its definition. Positiveness follows from as-
sumptions b, c,β ≥ 0 and the fact that Y = Lβ + diag(y ◦ α) is a nonsingular
M-matrix whenever y ∈ Ω and β are not both equal 0, which ensures that Y −1 is a
nonnegative matrix (see Lemma A.1.3 in Appendix A.1). Hence f(y) = bTY −1c ≥ 0
85
for all y ∈ Ω. Next, we find the first differential of f , namely,
df(y) = bTdY −1c
= −bTY −1diag(α)diag(Y −1c)dy,
= −
[
(Y −Tb) ◦α ◦ (Y −1c)
]T
dy, (3.36)
where we have used the fact that dY −1 = −Y −1(dY )Y −1, dY = d(Lβ+diag(y◦α)) =
diag(dy ◦α), and diag(x)y = diag(y)x = x ◦ y. Therefore,
∇f(y) = −
(
Y −Tb
)
◦α ◦
(
Y −1c
)
.
Since Y −1 ≥ 0N×N , we have ∇f(y) ≤ 0, which implies that f is decreasing in y. In
fact, a stronger statement holds, that is, Y −1 = (Lβ + diag(y ◦α))−1 is nonnegative
and decreasing in y. As a result, ‖∇f(y)‖2 ≤ ‖∇f(0)‖2,∀y ∈ Ω. When β 6= 0,
‖∇f(0)‖2 <∞, thus f is Lipschitz continuous with parameter ‖∇f(0)‖2 on Ω.
Next, we find the second differential of f as follows:
d2f(y) = 2bTY −1diag(dy ◦α)Y −1diag(dy ◦α)Y −1c, (3.37)
= 2dyTdiag(α ◦ (Y −Tb))Y −1diag(α ◦ (Y −1c))dy,
= dyT (H +HT )dy (3.38)
with H defined as in (3.34). Thus, H = (H + HT ) is the Hessian of f . Clearly, H
is nonnegative since Y −1,W,b, c are so. This proves that H is also nonnegative.
For convexity, it suffices to show that d2f given by (3.37) is positive semidefi-
nite on Ω\{0}. Indeed, since b and c are nonnegative, we will prove that
Y −1V Y −1V Y −1 ≥ 0N×N ,
86
where V = diag(dy◦α). Note that Y is a nonsingular M-matrix. Thus, by definition,
Y = s(I − A) for some positive s and some nonnegative matrix A with ρ(A) < 1.
Then we have Y −1 = s−1
∑∞
i=0 A
i and hence
Y −1V Y −1V Y −1 = s−3
∑
i≥0
∑
j≥0
∑
k≥0
AiV AjV Ak
= s−3
∑
m≥0
∑
i+j+k=m
AiV AjV Ak. (3.39)
Now by Lemma 3.4.2,
∑
i+j+k=mA
iV AjV Ak ≥ 0N×N for each m ≥ 0. Therefore,
Y −1V Y −1V Y −1 ≥ 0, thereby proving convexity of f .
Next, to prove (3.35), we use the inequality
xTH(y)x ≤ ρ(H(y))xTx, ∀x ∈ RN ,y ∈ Ω, (3.40)
which holds since ρ(H(y)) is the largest eigenvalue of the nonnegative (and sym-
metric) matrix H(y) (see Theorem A.1.1 in Appendix A.1). Note also that H(y) is
decreasing in y ∈ Ω. Thus we have
0N×N ≤ H(y) ≤ H(0) ≤ max
ij
[H(0)]ij11
T .
Finally, by Theorem A.1.2 in Appendix A.1, we have ρ(H(y)) ≤ ρ(H(0)) = Lf ≤
maxij[H(0)]ijρ(11
T ) = N maxij[H(0)]ij.
Consider again the example in Remark 3.4.1 and choose W and α satisfying
(3.14), e.g., W = W˜ and α = 4 × 1. With β = 0, b = c = 1, we have f(y) =
1T (L+ 4diag(y))−11 and
∇2f(1/2) = H(1/2) =
2.5952 0.7958
0.7958 3.8131
  0.
87
Remark 3.4.4. (Lipschitz constant for problem (3.8)) When β = 0 we have Lβ =
L, which is singular. As a result, the Lipschitz constant Lf = ρ(H(0)) = ∞.
Indeed, when α = β = 0, the agents’ opinions converge to a consensus value that
is unaffected by either T or Q.
The following result is an immediate consequence, whose proof is omitted.
Corollary 3.4.5. (Properties of g in (3.29)) The function g is smooth and convex
over constraint set Ω with gradient
∇g(y) = ∇f(y) + µ1, (3.41)
which is Lipschitz continuous with Lipschitz constant Lg = Lf . Moreover, if η :=
miny∈Ω λmin(H(y)) > 0, then g is strongly convex with parameter η.
It now becomes obvious that both problems (P Rlxd) and (P Aprx) are convex
with a (possibly strongly) convex smooth cost function. Thus, they can be solved
by various algorithms, including Interior Point Methods (IPMs), or the Projected
Gradient Method (PGM) (see e.g., [114–118]), provided that∇f(y) can be evaluated
efficiently (see Remark 3.4.11 below).
We now remark on how to deal with the original problem (P) in connection
with tuning the parameter µ in (P Aprx).
Remark 3.4.6. (On selecting regularization parameter µ) From the optimal solution
y˜∗ of problem (3.29) for a particular µ, we can obtain an approximate solution to
the original problem (3.27) by choosing nodes corresponding to the K largest entries
of y˜∗. As µ increases, there (usually) exists µ¯ such that card(y˜∗) ≤ K. Once this
88
value is found (which can be done fairly easily), µ can be tuned manually within
the interval [0, µ¯] to find the best approximation.
We conclude this subsection with the following remark, showing an application
of our analysis developed above.
Remark 3.4.7. (A proof of a conjecture in [105]) In a less apparently related context,
the authors in [105] study an on-chip active cooling system (based on super-lattice
thin-film thermoelectric coolers) and the problem of minimizing the steady state
temperature profile. The following conjecture was posed and supported by extensive
simulations.
Conjecture 3.4.8. ([105]) Suppose H−1 ∈ RN×N is a Stieljes matrix.7 Then, for any
1 ≤ k, l ≤ N and z ∈ RN , the following holds: zTdiag(H(k))Hdiag(H(l))z ≥ 0.
Assuming this conjecture to be valid, the paper then shows the convexity of
each element hkl of matrix H(x) = (G − xD)−1 as a function of x ∈ [0, xm], where
D is a diagonal matrix with at least one positive entry, G is an irreducible Stieljes
matrix, and xm > 0 such that G − xD is positive definite for all x ∈ [0, xm]. This
convexity result was proved in a later work [119] by using results on the convexity
of parameterized linear equations [120] but the conjecture has not yet been proved.
We will prove the conjecture next.
Although our cost function does not resemble H(x) in the mentioned papers,
our analysis provided above can be used to give an affirmative answer to the con-
jecture, even under a weaker assumption, namely that H−1 is an M-matrix. Indeed
7A Stieljes matrix is a real symmetric positive definite M-matrix.
89
the proof below does not require a symmetry assumption.
Proof of Conjecture 3.4.8. Let V = diag(z). We have
zTdiag(H(k))Hdiag(H(l))z = H
(k)diag(z)Hdiag(z)H(l)
= eTkHVHVHel.
Since H−1 is an M-matrix, it follows that H−1 = s(I − A) for some s > 0 and
some A ≥ 0N×N with ρ(A) < 1. Thus H = s−1
∑
i≥0A
i, and hence using the same
expansion as in (3.39) yields s3HVHVH =
∑
m≥0
∑
i+j+k=mA
iV AjV Ak, which is
nonnegative by Lemma 3.4.2. Therefore, eTkHVHVHel ≥ 0 as desired.
The foregoing proof suggests that the convexity results in our work can be
useful in studying various applications, such as those considered in [105,119].
3.4.2 Numerical Methods
We now discuss two numerical algorithms that can be used to solve problem (P Aprx),
namely the Projected Gradient Method and Interior Point Methods. Problem
(P Rlxd) can be treated similarly.
Theorem 3.4.9 (PGM). Consider the Projected Gradient Method applied to prob-
lem (3.29):
y(t+1) = PΩ
[
y(t) − γ(t)(∇f(y(t)) + µ1)] (3.42)
where PΩ denotes the projection operator onto the constraint set Ω of (3.29), and
γ(t) step size. If γ(t) is chosen by the Armijo rule, then every limit point of {y(t)} is
an optimal solution to problem (3.29). If β 6= 0, we can use any constant step size
90
γ(t) ≡ γ ∈ (0, Lf ). If η > 0, then for γ = 1Lf , y(t) converges linearly to the unique
solution y∗ with rate (1− η
Lf
)
1
2 .
Proof. The theorem follows from Propositions 2.3.1 and 2.3.2 in [115], and Theorem
2.2.8 in [117].
Remark 3.4.10. (On implementation of PGM when β = 0 ) This corresponds to
Problem (3.8). We have that f is well-defined and smooth on Ω\{0} (f(0) = ∞).
As a result, given y(0) 6= 0, the level set Ω0 = {y ∈ Ω|g(y) ≤ g(y(0))} is convex
compact set excluding 0, over which g, ∇g and ∇2g = H are continuous. In
particular, ∇g is Lipschitz continuous on Ω0 with coefficient L0g = maxy∈Ω0 ‖H(y)‖.
Thus, we can replace PΩ by PΩ0 or choose a step size ensuring that y
(t) ∈ Ω0, then
the PGM iteration (3.42) still works in this case (i.e., β = 0).
Remark 3.4.11. (On gradient evaluation) Gradient ∇f(y) involves inversion of Y =
(Lβ+diag(y◦α)), which usually costs O(N3) operations and O(N2) memory storage,
and thus does not scale well with network size. Moreover, even if the underlying
graph is sparse, this inversion can yield a dense matrix and therefore storing it could
also be too expensive for very large networks. In such a case, one way to reduce
those costs is to exploit the sparsity of the graph and the structure of the cost
function. In particular, from (3.33), we have ∇f(y) = −u ◦α ◦ v, where
u := Y −Tb, v := Y −1c. (3.43)
That is, u and v are respectively the solutions to the sparse linear equations Y Tu =
b and Y v = c, for which many solvers/algorithms are available. For example,
91
based on the diagonal dominance property of matrix Y , we can employ the power-
iteration. Specifically, consider the decomposition Y = Dy + E, where Dy and
E denote the diagonal and off-diagonal parts of Y . It is clear from the structure
of Y = Lβ + diag(y) that only Dy depends on y (hence the subscript y). Now
consider u, which satisfies b = Y Tu = Dyu + E
Tu. Since Dy is invertible, we
then have u = −D−1y ETu + D−1y b, which is a fixed point relation, where under
Assumptions 3.2.1 and 3.2.2, the right side defines a contraction mapping with
contraction coefficient ρ(D−1y E
T ) < 1. Therefore, we can use the following iteration
to compute u:
uk+1 = −D−1y (ETuk − b). (3.44)
It should be noted that (3.44) is highly scalable since (i) E is sparse and can be
read off from L (or W ), whose storage takes only O(|E|) where |E| is the number of
directed edges in the graph, and (ii) the computation also takes O(|E|) operations
as it only involves a multiplication of uk with E
T and an element-wise scaling (after
a subtraction by b) by diagonal entries of Dy. Moreover, suppose (3.44) terminates
in ku iterations, which yields a convergence error proportional to ρ
ku(D−1y E
T ), then
the running time to compute u is O(ku|E|). Finally, v can be computed in the same
manner, i.e., vk+1 = −D−1y (Evk − c).
PGM belongs to the class of first order methods which only requires gradient
evaluation (and projection step). Thus, it can be employed to deal with large
networks. However, for networks that are not very large, other more efficient and
sophisticated algorithms are available such as primal-dual IPMs [114]. Here we note
92
that each iteration of this method involves computing the Newton direction, which
requires O(N3) operations to evaluate gradient ∇f and Hessian matrix H, given
respectively in (3.33) and (3.34). In practice, the method converges in a very few
iterations (say, a few tens), which is often much less than required in PGM.
3.5 Supermodularity and Greedy Algorithms
In this section, we develop an alternative approach to problem (P) based on the
greedy strategy where approximation bounds for the suboptimal solutions can be
established. To this end, we first prove that JK in (3.12) is monotone and supermod-
ular in the set-variable K. In fact, more is true, that is, function f given by (3.32)
is supermodular and monotone on Ω. For this, we will give two different proofs as
each has its own merit. As a result, problem (3.12) admits a (1− 1
e
) approximation
algorithm [108]. We then develop an improved version of this algorithm that is able
to achieve better approximate solutions.
3.5.1 Supermodularity Results
We now establish supermodularity of the objective function f , and thus JK. Our first
approach relies on the results in the previous subsection and the following known
result.
Theorem 3.5.1. (Topkis’ Characterization Theorem [121, 122]) Let Ω = [x, x¯] be
an interval in RN and h : RN → R be twice continuously differentiable on (some
open set containing) Ω. Then h is supermodular on Ω if and only if for all x ∈ Ω
93
and all i 6= j, ∂2h/∂xi∂xj ≥ 0. (There are no restrictions on ∂2h/∂x2i .)
As a consequence, we have the following.
Theorem 3.5.2. (Supermodularity of objective functions) Consider the function
f in problem (P) and the set Ω defined in (P Aprx). Then f is supermodular and
monotone on Ω. Thus, the cost JK is supermodular and monotone in K.
Proof. By Theorem 3.4.3, f in (3.32) is decreasing and its Hessian matrix H is
element-wise nonnegative on Ω. Supermodularity of f then follows from Theorem
3.5.1. Thus, JK is also supermodular as it is the restriction of f on the vertexes of
Ω.
It should be pointed out that unlike in problem (P2), the function f in (P1) is
not defined at 0 ∈ Ω and thus is not twice continuously differentiable on any open
set containing Ω. Therefore, the result above does not apply directly to problem
(P1).
Next, we provide a second approach to proving the supermodularity result
avoiding the technical problem above. This approach is based on the following two
lemmas, the first of which is a matrix supermodularity inequality and the second is a
composition property. These results not only provide us a deeper understanding of
the influence process considered here but also are useful in proving the modularity
of another related cost function used in the literature.
Lemma 3.5.3. (Matrix supermodularity inequality) For any S ⊂ V, let ΓS =
diag(αS). Then we have (Lβ + ΓS)−1 ∈ RN×N+ is nonincreasing and supermodular
94
in S, i.e., the following matrix inequalities hold for any v, k ∈ V\S
(Lβ + ΓS)−1 − (Lβ + ΓS∪{v})−1 ≥ (Lβ + ΓS∪{k})−1 − (Lβ + ΓS∪{k,v})−1 ≥ 0. (3.45)
This result also holds true if we replace Lβ with L0.
Proof. The proof relies on the Woodbury matrix identity and results in M-matrix
theory. See Appendix A.2.4 for details.
The result seems to suggest that opinion diffusion and influence spreading
processes inherently possess monotonicity and supermodularity properties.
Remark 3.5.4. We do not exclude the case S = ∅ since it can be seen that (Lβ +
Γ∅)
−1 = +∞ if β = 0.
Lemma 3.5.5. (Composition property) Suppose F : 2V → RN×N is decreasing
and supermodular, f : RN×N → R is increasing and convex. Then the composition
(f ◦ F ) is nonincreasing and supermodular. 8
Proof. This result is a straightforward extension of the standard case [121] in which
F : 2V → R and f : R→ R. Details are omitted for brevity.
Now using Lemmas 3.5.3 and 3.5.5 with F (K) = (Lβ+ΓK)−1, f1(X) = bTX|ξ0|
and f2(X) = b
TXβ, we again have that J
(i)
K = (fi ◦ F )(K) for i = 1, 2 are nonin-
creasing and supermodular.
Remark 3.5.6. The authors in [104] consider the problem of selecting a number of
agents as leaders (in their context) in order to minimize the overall variance in an
8Here, ◦ denotes the composition operator and should not be confused with the Hadamard
product used in Section 3.4.
95
undirected unweighted network subject to stochastic disturbances. It can be verified
that the cost function in that paper is equivalent to tr
(
(L + diag(αK))−1
)
, which
is equal to (f ◦ F )(K) with F (K) = (L + ΓK)−1 and f(X) = tr(X). Using the
result above, we can immediately conclude the supermodularity property of this
cost function; this was not established in [104].
3.5.2 Greedy Algorithms and Ratio Bounds
Having established supermodularity of the objective functions, we now introduce
our greedy algorithms and show their ratio bounds. For convenience, JS and J(S)
are used interchangeably. Our first algorithm, whose output is denoted by KG, is
similar to the greedy algorithm in [108], which we described next.
Algorithm 3.1: Greedy Adding KG
Data: W , α, β, b and K
1 Init: KG ← ∅
2 for i = 1 : K do
3 k∗i ← arg min{J(KG ∪ {v}), v 6∈ KG}
4 KG ← KG ∪ {k∗i }
5 Output: KG
Description of Algorithm 3.1: The idea is to start with an empty set KG (line
1) then greedily find one more node that most decreases the cost JK to add to the
set KG (lines 2-4). The algorithm is terminated after K sequential selections.
Remark 3.5.7. (Complexity of Algorithm 3.1) The number of function evaluations is
KN − K(K−1)
2
. In a naive way, without exploiting the structure of the cost function,
each evaluation requires O(N3) operations (due to matrix inversion) and thus the
total cost would be O(KN4). We can use the following tricks to alleviate this
96
computational burden.
• Rank-1 updates: At any iteration, let S denote the current set KG and let
P := (Lβ + ΓS)−1. By the Woodbury identity (A.1), it can be verified that
(Lβ + ΓS∪{v})−1 = P −
P(v)P
(v)
α−1v + Pvv
. (3.46)
Let ∆J(v,S) := J(S)− J(S ∪ {v}). Then
∆J(v,S) = b
TP(v)P
(v)c
α−1v + Pvv
.
As a result, knowing P , it requires O(N) operations to compute ∆J(v,S)
and hence O(N(N − |S|)) to find v∗ = arg maxv∈V\S ∆J(v,S). The matrix
(Lβ + ΓS∪{v∗})−1 is then obtained from P by a rank-1 update (3.46), which is
O(N2). Note that the initial case S = ∅ corresponds to P = L−1β , which takes
O(N3) operations to compute. To sum up, using this scheme, the algorithm
requires O(KN2 +N3) operations, reduced from the naive way by a factor of
O(N). It also demands O(N2) of memory space (mainly to store the matrix
inverse).
• Power-iteration method: For a very large network, it may be too expensive to
reserve O(N2) memory for storing the inversed matrix (Lβ + ΓS)−1. In this
case, one can exploit the sparsity structure of Lβ in connection with the power-
iteration method to overcome the memory issue as shown in Remark 3.4.11.
In particular, we can write J(S) = uTSc, where uS = (Lβ + ΓS)−Tb can
be computed using iteration (3.44). As before, let ku denote the number of
97
iterations (on average) of running (3.44) (to achieve certain accuracy of u).
Then the algorithm takes O(KNku|E|) operations and O(|E|) memory.
Note that the same greedy algorithm using rank-1 updates has been applied
in [107] for the case of problem (3.8). Here, we use this algorithm for (3.12) (which
is more general) and provide proofs of the supermodularity of JK and the ratio
bounding the error incurred, which were not included in [107].
Our result on the approximation guarantee of Algorithm 3.1 involves the notion
of curvature of a submodular function (see, e.g., [109]. Let Z(S) be nondecreasing
submodular in S. Then
σ := 1−min
x∈P
Z(P\{x})− Z(P)
Z(∅)− Z({x}) , (3.47)
is called the total curvature of Z with respect to the set P .
Theorem 3.5.8. ([109, Cor. 5.7]) Let Z(S) be a nondecreasing submodular function
of S such that Z(∅) = 0. Let SG and S∗ denote the greedy solution and the optimal
solution to the problem max{Z(S) : S ⊆ P , |S| ≤ K}. Then
Z(SG)
Z(S∗) ≥
1
σ
(
1− (1− σ
K
)K
)
=: Rσ,K (3.48)
where σ is the curvature of Z with respect to P.
To use this result, we need to consider the case β = 0 separately since L0 is
singular and thus J(∅) =∞.
Theorem 3.5.9. (Properties of Alg. 3.1) Let Assumptions 3.2.1 and 3.2.2 hold. Let
K∗ denote an optimal solution to (3.12) and let KG be the output of Algorithm 3.1.
Let Vα = {i ∈ V , αi 6= 0}.
98
(i) Let v∗ = arg minv∈Vα J({v}). If β = 0, then
J({v∗})− J(KG)
J({v∗})− J(K∗) ≥ Rσ,K−1 (3.49)
where σ = 1−minx∈Vα\{v∗} J(Vα\{x})−J(Vα)J({v∗})−J({v∗,x}) .
(ii) If β 6= 0, then
J(∅)− J(KG)
J(∅)− J(K∗) ≥ Rσ,K (3.50)
where σ = 1−minx∈Vα J(Vα\{x})−J(Vα)J(∅)−J({x}) .
Proof. (i). Define Z(S) := J({v∗})−J(S ∪ {v∗}) for any S⊆Vα\{v∗}. Then it can
be verified that Z is nondecreasing, submodular with curvature σ and Z(∅) = 0.
Thus, applying Theorem 3.5.8 and rearranging terms yield (3.49).
(ii). Similarly, (3.50) follows from Theorem 3.5.8 with σ being the curvature
of Z(S) := J(∅)− J(S) for any S ⊆ Vα.
Note that Rσ,K >
1
σ
(1− e−σ) > 1− e−1 for any α ∈ (0, 1) and K ≥ 1. Thus in
general Rσ,K is tighter than the constant bound (1− e−1) established in [108] (and
also [63,65,67]).
Remark 3.5.10. (Bounds on J(K∗) by Alg. 3.1) Clearly, J(KG) is an upper bound on
J(K∗) and (3.49) or (3.50) provides a lower bound. We shall denote these bounds by
JGU and JGL respectively; e.g., JGL = J(∅)− J(∅)−JGURσ,K for (3.50) . Since J(K∗) ≥ 0,
the bound JGL is useful only if JGL ≥ 0, i.e., JGU ≥ (1 − Rσ,K)J(∅) or JGU ≥
(1−Rσ,K−1)J({v∗}).
In the following, we construct another algorithm (Algorithm 3.2 given and
described below), which contains Algorithm 3.1 as a special case and is able to
99
practically improve accuracy. The idea is still to greedily select one “best” node at
a time, but we additionally employ a particular swapping strategy: to repeatedly
replace a selected node in K by another node in V\K (or more precisely Vα\K )
if the swapping most decreases the objective function. This strategy is in fact a
special case of the Interchange Heuristic [108], which was also employed in [103] and
[104] for problems related to sensor placement and leader selection. Our algorithm
here differs from the aforementioned ones in that instead of swapping whenever an
improvement of the cost function occurs, we carry out swapping in the direction
of steepest descent coordinate, which helps avoid exponential number of exchanges.
(As a side note, the supermodularity property and approximation bound for the
greedy algorithm were not established in [103, 104]. Moreover, the convex analysis
in these works is based on the symmetry of Laplacian matrices associated with
undirected graphs.)
Algorithm 3.2: Greedy Swapping KSM := GSwap(KS0 ,M)
Data: W , α, β, b, KS0 , and M
1 for m = 1 : M or until KSm = KSm+1 do
2 S ← ∅, T = {t1, t2, . . .} ← KSm−1
3 for i = 1 : K do
4 T ← T \{ti}
5 t∗i ← arg minv 6∈S∪T J(S ∪ {v} ∪ T )
6 S ← S ∪ {t∗i }
7 KSm ← S
8 Output: KSm
Description of Algorithm 3.2: The algorithm starts with an arbitrary set KS0 ⊆
Vα (assuming |KS0 | ∈ [0, K]) and works in a cyclic manner for a predetermined
number of cycles M or until KSm∗ = KSm∗−1 for some m∗ (line 1). In the m-th
100
cycle (lines 2-7), we revise the estimate KSm−1 from previous cycle by updating
each entry one after the other; that is, for i = 1, . . . , N , we select t∗i ∈ Ri :=
V\{t∗1, . . . , t∗i−1, ti+1, . . . , tk} that minimizes the cost J(S ∪ {v} ∪ T ) (line 5), i.e.,
t∗i = arg min
v∈Ri
J({t∗1, . . . , t∗i−1︸ ︷︷ ︸
S
, v, ti+1, . . . , tk︸ ︷︷ ︸
T
}),
then add t∗i to S. We call this a greedy swapping step. Note that if i > |KS0 |,
we allow {ti} = ∅, in which case greedy swapping reduces to greedy adding (as in
Algorithm 3.1). In essence, this algorithm is based on the cyclic coordinate descent
method (also known as the Gauss-Seidel method).
Remark 3.5.11. (Entry search in Algorithm 3.2) In general, it is not computationally
efficient to determine the optimal order in which the elements of the set KS are
selected to be revised in each cycle (in order to reduce the objective cost to the
extent possible). In this work, we use a cyclic selection scheme with the least
possible complexity.
Remark 3.5.12. (Complexity of Alg. 3.2 with cyclic search) Each cycle (other than
the first one) requires (KN − K2) function evaluations. That of the first cycle
depends on |KS0 |, but is no more than KN − K(K−1)2 . Again, the naive approach
takes O(mKN4) operations and O(N2) memory; but we can exploit the structure
of the cost function to reduce these computational and memory costs, especially for
large networks. Using the power-iteration method, we can avoid O(N2) memory
requirement as shown in Remark 3.5.7. For not too large networks where storage is
not an issue, we can employ the Woodbury matrix identity (A.1) for rank-2 updates
(since swapping involves two nodes). Specifically, suppose we want to check for a
101
possible swap between t ∈ T ∪ S =: P with some v ∈ V\P . Let P := (Lβ + ΓP)−1
and E(tv) := [et, ev]. Then it can be shown that
(Lβ + ΓP\{t}∪{v})−1 = P − PE(tv)
Ptt − α−1t Ptv
Pvt Pvv + α
−1
v

−1
ET(tv)P. (3.51)
Thus, ∆2J(−t, v,P):=J(P)−J(P∪{v}\{t}), the marginal gain of swapping t and
v, can be computed as
[bTP(t),b
TP(v)]
Ptt − α−1t Ptv
Pvt Pvv + α
−1
v

−1 P (t)c
P (v)c

which takes O(N) operations provided that P is known. Hence, finding v∗ =
arg maxv∈V\P ∆2J(−ti, v,S) requires O(N(N − K)) operations and if a swap is
performed, the matrix (Lβ + ΓP\{ti}∪{v∗})
−1 is then obtained from P by a rank-2
update (3.51), which takes O(N2). (Note that the foregoing calculation resulting
in the swapping selection above is also more computationally expensive than find-
ing a possible greedy swap; which is also one of the reasons we opt for the greedy
swapping strategy instead of the swapping method used in [103] and [104].) During
each cycle, at most K swaps can be carried out, taking O(KN2) operations. For
the initial cycle, if P is not supplied, then its computation costs at most O(N3).
Thus, in general, for M cycles, Algorithm 3.2 takes O(MKN2 + N3) operations.
However, from our simulations, a good value of M is usually small (say 2-3) and
does not scale as O(N).
Theorem 3.5.13. (Properties of Alg. 3.2) Let {KSm}M0 denote the sequence of
approximate solutions generated by Algorithm 3.2.
102
(i) If KS0 = ∅, then KS1 ≡ KG, where KG denotes the output of Algorithm 3.1.
(ii) For any m ≥ 0 and KS0 ⊆ Vα, J(KSm+1) ≤ J(KSm). In fact, let m∗ denote the
smallest index such that KSm∗ = KSm∗+1, then
J(KSm) < J(KSm+1), ∀m < m∗, and
J(KSm) = J(KSm+1), ∀m ≥ m∗.
(iii) Let v∗= arg minv∈Vα J
(1)({v}). For any KS0 ⊆ Vα,
J (1)({v∗})− J (1)(KSm∗)
J (1)({v∗})− J (1)(K∗) >
1
2
and
1− J (2)(KSm∗)
1− J (2)(K∗) >
1
2
.
Proof. (i) Consider KS0 = ∅ and the first cycle, i.e., m = 1. So, T =∅ and S is
initialized as empty. As a result, line 5 becomes: t∗i = arg minv∈V\S J(S ∪ {v}),
which together with line 6 is the greedy algorithm 3.1. Therefore, KS1 ≡ KG as
desired.
(ii) Consider the m-th cycle. It follows from the algorithm that KSm−1 =
{t1, t2, . . . , tK} (line 3). By the greedy choice of t∗i (line 5), it can be seen that
J(KSm−1)=J({t1, t2, . . . , tK})≤J({t∗1, t2, . . . , tK})≤ . . .≤J({t∗1, t∗2, . . . , t∗K}) = J(KSm).
Thus J(KSm−1) = J(KSm) if and only if all the inequalities in this relation become
equalities, i.e., no further improvement on the objective can be made entry-wise.
Hence, if KSm∗ = KSm∗+1 for some m∗, then J(KSm) = J(KSm∗),∀m ≥ m∗. The
existence of m∗ clearly follows from the fact that the feasible set of K is finite
(which comes from finiteness of the network size).
(iii) For any submodular and nondecreasing function Z(S), it follows from
103
[108, Thm. 5.1] that
Z(S∗)− Z(SI)
Z(S∗)− Z(∅) ≤
K − 1
2K − 1 <
1
2
where S∗ and SI denote the optimal solution and an interchange solution (i.e., no
more possible local improvement) to the problem max{Z(S) : S ⊆ P , |S| ≤ K}.
Applying this result to our case, where Z(S) := J (1)({v∗}) − J (1)(S ∪ {v∗}),∀S ⊆
Vα\{v∗} for problem (P1) or Z(S) := 1− J (2)(S), ∀S ⊆ Vα for problem (P2), yields
the desired results. Here, KSm∗ is an interchange solution for each KS0 ⊆ Vα.
The ratio bound of 1
2
in part (iii) is less than the constant Rσ,K in Theo-
rem 3.5.9 but holds for any initial set KS0 . Note also that first part of this proposition
asserts that Algorithm 3.1 can be obtained from Algorithm 3.2 by letting KS0 = ∅
and M = 1. In this case, the performance of the latter algorithm is ensured to be
no worst than the former. In fact, it is clear from part (ii) that better estimates are
attained almost surely when M > 1.
Corollary 3.5.14. (Approximation accuracy of Alg. 3.2 with KS0 = ∅) For any
m∗ ≥ 1, J(KSm∗) ≤ J(KG). Strict inequality holds if m∗>1.
Although we are not yet able to quantify this gain rigorously, our simulation
results illustrate radical improvement compared to Algorithm 3.1, even with small
values of M .
Remark 3.5.15. (On implementation of Alg. 3.2) The following are worth noting.
• Starting point: The algorithm works for an arbitrary choice of KS0 and thus
can be useful in practice to improve upon a good starting set KS0 which may
104
be available from, e.g., the convex relaxation approach or Algorithm 3.1.
• Local minimizer KSm∗: When it is found, there are practical techniques to
possibly escape this local minimizer at the expense of more computation time
and power; e.g., random swapping of multiple nodes in KSm∗ with V\KSm∗ .
• Termination: We observed that even with a small M (say 2-3), the algo-
rithm still finds a good approximation, especially from a good starting point.
This may be attributable to the “diminishing returns” nature of the objective
function resulting in significant improvements only in the first few cycles.
3.6 Numerical Examples
The simulations in this section were carried out in Matlab R© R2015b on a PC with
Intel R© CoreTMi7 CPU@3.10 GHz and 12 GB of RAM.
3.6.1 Example 1: Small Network with One Leader
Consider the network depicted in Figure 3.1, where at every time step each agent up-
dates its opinion by taking the average of its own opinion and those of its neighbors,
i.e.,
wij =
1
|Ni| , ∀j ∈ Ni, i ∈ V .
This network is also studied in [62,107].
Suppose there is an external leader with constant opinion T = 0 who wants to
connect to a small number of agents so as to achieve fast consensus to its opinion;
105
Figure 3.1: Network in example 1.
see Problem (P1). We revisit the problem of direct followers selection (with maxi-
mum level of trust) in [62], which corresponds to α=∞, β=0, x0=1 and b=1/N.
Table 3.1 compares the simulation results of different approaches: (1) exhaustive
search, which provides optimal solutions, (2) coordinate descent method [62], (3)
Algorithms 3.1 and 3.2, and (4) the convex relaxation (P Rlxd) solved by an Inte-
rior Point Method. In the last case, we approximate α=1031 to solve for yP Rlxd,
and then choose KP Rlxd corresponding to the K largest elements of yP Rlxd. We
further apply the greedy swapping algorithm to KP Rlxd; see the last column, where
KP Rlxd1 = GSwap(KP Rlxd, 1) and KP Rlxd2 = GSwap(KP Rlxd1 , 1).
As observed from Table 3.1, algorithm GSwap takes very few cycles to converge
to optimal solutions except for the case K = 5, where it falls into a local minimizer.
Usually, M = 2 is enough to obtain a good approximate solution, which is much
improved from what generated by Algorithm 3.1 (and is exact in many cases).
106
Table 3.1: Comparison results for Network in example 1 (∗ denotes an optimal value).
In the last column, JKP Rlxd
1(2)
denotes JKP Rlxd1 (JKP Rlxd2 ).
Exhaustive search [62] Alg.3.1 Alg.3.2 (KS0 = ∅) (P Rlxd) in (3.28)
K K∗ JK∗ JK JKG JKS2 JKSm∗ m
∗ JKP Rlxd JKP Rlxd
1(2)
1 13 44.16 180.32 ∗ ∗ ∗ 1 180.32 ∗(∗)
2 8, 19 13.36 28.96 23.37 ∗ ∗ 2 28.96 16.54 (∗)
3 8, 15, 25 6.94 10.47 9.29 ∗ ∗ 2 10.47 ∗(∗)
4 7, 8, 15 5.18 7.03 5.85 ∗ ∗ 2 7.53 5.45 (∗)
25
5 3, 7, 9, 15 3.53 3.83 4.09 4.06 4.06 2 6.57 3.82 (3.82)
25
6 3, 7, 9, 13 2.22 ∗ 3.13 2.54 ∗ 3 5.61 ∗(∗)
16, 25
7 3, 7, 9, 13 1.36 ∗ 2.17 ∗ ∗ 2 2.17 ∗(∗)
16, 19, 25
As for the implementation of the convex approach with regularization, there is
no optimal rule for selecting µ, the sparsity penalizing coefficient, other than trial-
and-error (see also Remark 3.4.6). Note that the computational cost per cycle of the
coordinate descent method in [62], which involves the cost function’s gradient and
Hessian matrix evaluations at each coordinate, is roughly twice as much as that of
Algorithm 3.2 (which requires only function evaluations). In addition, Algorithm 3.2
converges after 2-3 cycles with a guaranteed accuracy, while the coordinate descent
method could take many more cycles for each trial of µ (with no provable bound on
accuracy). Furthermore, although the Interior Point Method applied to (P Rlxd)
also employs the gradient and Hessian matrix, it converges within a few iterations
(10-20 in this example).
We simulate the network responses for the case K = 4; see Figures 3.2-3.5,
where the fastest convergence is when the leader repeatedly applies Algorithm 3.1
every Tp = 5 time steps.
107
0 20 40 60 80
0
0.2
0.4
0.6
0.8
1
Optimal Solution
time step
x(t
)
Figure 3.2: K∗ = {7, 8, 15, 25}
0 20 40 60 80
0
0.2
0.4
0.6
0.8
1
Coordinate descent
time step
x(t
)
Figure 3.3: K = {7, 13, 16, 25}
0 20 40 60 80
0
0.2
0.4
0.6
0.8
1
Algorithm 1
time step
x(t
)
Figure 3.4: K = {8, 13, 16, 25}
0 20 40 60 80
0
0.2
0.4
0.6
0.8
1
Algorithm 1 repeated with period Tp = 5
time step
x(t
)
Figure 3.5: Alg. 3.1 every 5 time steps
3.6.2 Example 2: Medium-Size Network with Two Leaders
Consider a directed network based on the largest strongly connected component of
the Wikipedia vote network9 studied in [123]. Thus, our network has N = 1300
nodes and 39456 edges. We generate the weight of each directed edge randomly in
the interval (0, 1). Suppose that leader Q has selected the set Vβ containing the first
50 nodes with the highest out-degrees and that βi = 10
6, ∀i ∈ Vβ (thus, they are
in full support of Q). Suppose that leader T can connect to up to K nodes in Vα
that contains the first 1000 nodes that are not direct followers of Q (here “the first
1000 nodes is understood in terms of the numbering sequence of the nodes). We
9Data available at: http://snap.stanford.edu/data/wiki-Vote.html
108
also assume that αi = 10,∀i ∈ Vα.
In this example, we consider problem (P2) for different values of K ∈ [1, 200]
using various schemes:
(i) Algorithm 3.1: the greedy algorithm with output KG providing JGU = J(KG)
and JGL as upper and lower bounds on J
∗ (see Remark 3.5.10).
(ii) (P Rlxd)+IPM: the relaxed problem (P Rlxd) solved by the Interior Point
Method in OPTI toolbox [124],10 which gives f¯P Rlxd and f
∗
P Rlxd as upper and
lower bounds.
(iii) (P Aprx)+IPM: the regularized problem (P Aprx) solved by the Interior Point
Method (with sparsity threshold set to 0.01). The output, denoted by KP Aprx,
yields a corresponding cost fP Aprx =: JP Aprx, an upper bound on J
∗.
(iv) GSwap(KP Aprx, 1): applying one cycle of the greedy swapping algorithm to
KP Aprx obtained from (iii).
The simulation results are shown in Figures 3.6 and 3.7. Here, the upper bounds by
the greedy algorithm, GSwap(KP Aprx, 1) and (P Rlxd)+IPM schemes are almost
the same while the convex relaxation approach gives the best lower bounds, which
help evaluate approximation errors. In particular, using these bounds, we are able
to conclude that the the approximation ratio of greedy solutions KG (as well as that
of GSwap(KP Aprx, 1) and (P Rlxd)+IPM) satisfies
1− J(KG)
1− J∗ ≥
1− JGU
1− f ∗P Rlxd
.
10Here, we let y(0)=0 and stop the algorithm if |fi−fi−1||fi| ≤ 10−6.
109
0 20 40 60 80 100 120 140 160 180 200
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 3.6: Upper bounds (solid lines) and lower bounds (dashed line) on J∗; The global
lower bound J(Vα) holds for any K. The ratio bound 1−JGU1−f∗P Rlxd (shown by a dotted line)
is at least 90% as K ≥ 90.
The lower bound, depicted by a dotted line in Figure 3.6, is clearly much higher
than Rσ,K (here σ = 0.99) and the well-known theoretical ratio of (1−1e) = 63.21%
for the greedy algorithm. For example, the ratio bound is at least 90% as K ≥ 90.
Regarding running time, note that the greedy algorithm scales linearly with
K, while the convex approach does not; see Figure 3.7. As µ increases, |KP Aprx|
reduces, and thus so does the running time of GSwap(KP Aprx, 1).
3.7 Closing Discussion
This section provides further applications and results based on the analysis devel-
oped in the previous sections.
110
0 50 100 150 200
0
10
20
30
0 50 100 150 200
10
20
30
40
0 0.01 0.02 0.03
0
20
40
0 0.01 0.02 0.03
0
20
40
Figure 3.7: CPU run times (s) in 4 schemes. The Interior Point Method takes approxi-
mately 0.21 s per iteration.
3.7.1 Application to Friedkin’s Model
Consider a Friedkin’s model [3] in the presence of two leaders T and Q:
xi(t+ 1) =
αiT + βiQ+ σixi(0) +
∑
j∈Ni wijxj(t)
αi + βi + σi + 1
,
where, as before, αi and βi denote the weights that agent i puts on T and Q,
respectively, and σi represents the stubbornness of agent i in keeping his initial
opinion (or internal belief, see also [20,63]). In matrix form,
x(t+ 1) = (I + diag(α+ β + σ))−1(αT + βQ+ σ ◦ x(0) +Wx(t)). (3.52)
111
Again, assuming Q = 1 and T = 0, the equilibrium of the system is then given by
x(∞) = (L+ diag(α+ β + σ))−1(β + σ ◦ x(0)).
Thus, we can define an associated influence optimization problem of “T against Q
and the stubborn” as follows
min
K⊆V
{bTx(∞) = bT (Lβ+σ + diag(αK))−1c˜ : |K| ≤ K}
where Lβ+σ := L + diag(β + σ) and c˜ := β + diag(σ)x(0). Clearly, this problem
fits into the general one (P) described in (3.12) and thus can be treated efficiently
by the methods developed in this chapter.
3.7.2 Further Convexity Results
The following theorem builds on the convexity analysis in Section 3.4.
Theorem 3.7.1. Consider systems (i) x(t + 1) = Aux(t) where Au = A + diag(u)
is a nonnegative matrix, and (ii) x(t + 1) = Bux(t) where Bu = (A + diag(u))
−1B
and B is a nonnegative matrix and A + diag(u) is a nonsingular M-matrix. For
either system, if x(0) ≥ 0, then for ∀t ≥ 0, xi(t) is convex in u.
Proof. [Sketch] For system (i), the conclusion follows from noting that dAt+2u =∑
i+j+k=tA
i
udiag(du)A
j
udiag(du)A
k
u and applying Lemma 3.4.2. For (ii), let bij(u)
denote the ij-th element of Bu. Similar to the convexity proof of f in (3.32), it
can be shown that bij(u) is positive, convex and decreasing in u. Thus, [B
t
u]ij, as a
summation of products of bij(u), is also positive, convex and decreasing in u.
112
It can be verified that result for system (ii) in the statement of Theorem 3.7.1
can be applied to both models (3.2) and (3.9) as well as (3.52). It is also interesting
to note that the result for system (i) in the statement of Theorem 3.7.1 is closely
related to [125, Lem. 3], which states that for a continuous-time system x˙(t) =
(M + diag(u))x(t) with x(0) ≥ 0, the function u 7→ xi(t) = eTi e(M+diag(u))tx(0) is
convex if M is a Metzler matrix.11 Thus our result for system (i) can be viewed
as the dual applied for discrete-time systems. However, we remark that the proof
technique developed here is totally different. Moreover, in general, not every real
matrix has a real logarithm, let alone uniqueness. The connection between these
results and applications of Theorem 3.7.1 are left for future work.
3.7.3 Towards Relaxing Strong Connectivity Assumption
Consider the case where the network G is fixed but not strongly connected. Without
loss of generality, assume that G is weakly connected (i.e., not disconnected). Then
for each K, we decompose V = VK ∪ VK¯ where VK denotes the set of agents in G
that are reachable from K. Clearly, {T} ∪ VK forms a spanning tree rooted at T .
Moreover, there are directed links from VK¯ to VK but not vice versa. As a result,
the agents in VK¯ evolve independently with those in VK and reach an equilibrium.
Thus, the opinion of each agent in VK also converges to a fixed point, which is a
linear combination of T and the final opinions of those in VK¯. In this regard, VK¯
can be considered as other leaders to VK besides T , thus the analysis in this chapter
can also be applied in this scenario.
11i.e., off-diagonal entries are nonnegative.
113
Part II: Consensus Prediction by Observer
114
Chapter 4: Consensus Prediction in Minimum Time
Abstract: This chapter studies an observer that seeks to predict in minimal time
the asymptotic agreement value of the agents in a network. The network is governed
by the DeGroot opinion dynamics model. The observer can monitor the opinions
of a group of agents, but might not have accurate knowledge of the underlying
communication graph and the associated weight matrix. The work makes use of and
builds on previous work on finite time consensus to address this prediction problem.
In particular, for the case of a single observed agent, a tight lower bound on the
monitoring time is determined below which the observer with limited knowledge
about the network is not able to determine the consensus value regardless of the
method used. This minimal prediction time can be achieved by employing the
minimal polynomial associated with this observed agent. Next, for the general case
of an observer with access to multiple agents, a similar bound is conjectured, and
we develop algorithms toward achieving this bound through local observations and
computations.
115
4.1 Introduction
In this chapter, we are concerned with the problem of predicting the consensus
value of a network implementing a consensus protocol, where the agents exchange
information according to the nearest neighbor weighted averaging scheme. This
problem is related to the finite-time consensus problem that has been investigated
in the literature (see, e.g., [71, 72]). Building on these contributions, we investigate
the minimal observation time that enables an observer to determine the consensus
value by monitoring a set of agents in the network.
This problem is useful in network monitoring and security. Moreover, the
algorithms developed in this work can also be used to allow the agents to possibly
reach consensus in a time that is shorter than the best known results in the literature.
As an application, in Chapter 5, we will demonstrate the use of consensus prediction
in developing distributed optimization algorithms that have many desired features.
The contributions of this work are as follows. First, we reveal an intrinsic
relation between the consensus value and available observation data, based on which
we (i) derive a fundamental limit on the monitoring time for the case of a single
observed node, and (ii) provide a conjecture and analysis for the case of multiple
observed nodes. Next, we develop algorithms toward achieving the conjectured
bounds through local observations and computations.
The rest of the chapter is organized as follows. In Section 4.2 we describe
the problem formulation and provide some background on the finite-time consensus
protocol developed in [71]. In Section 4.3, we provide the main results on shortest
116
time prediction of consensus using the notion of a node’s minimal polynomial as
introduced in [71]. Then in Section 4.4 we develop algorithms for computing min-
imal polynomials in a distributed manner and in (sub)optimal time. We provide
numerical examples and discuss problems for future work in Sections 4.5 and 4.7,
respectively.
4.2 Problem Statement and Previous Results
4.2.1 Problem Description
Consider a network consisting N agents denoted by V = {1, 2, . . . , N} with the
underlying communication characterized by a directed graph G = (V , E). Let xi(t)
denote the state or opinion of node i at time t ≥ 0; xi(0) represents the initial
opinion. At any time t, each agent observes opinions of its neighbors and updates
its opinion following the DeGroot model (1.4) as described in Chapter 1, namely:
xi(t+ 1) =
∑
j∈Ni
wijxj(t), ∀t ≥ 0, ∀i ∈ V , (4.1)
where, recall that, Ni denotes the set of agent i’s neighbors (including itself) and
W = [wij] the weight matrix. In this chapter, the following is a blanket assumption,
which is the combination of Assumptions 1.5.2 and 1.5.3 and presented here for
convenience.
Assumption 4.2.1. (Network Connectivity and Weight Matrix) The graph G is
fixed and strongly connected. The weight matrix W is fixed, row-stochastic and
satisfies wij > 0 for (i, j) ∈ E , i 6= j, and wij = 0 otherwise. Moreover, W has at
117
least one positive diagonal element.
Under this assumption, the network asymptotically achieves consensus:
lim
t→∞
x(t) = 1piTx(0), (4.2)
where we recall that pi ∈ RN is the normalized left Perron eigenvector of W , i.e.,
piTW = piT and 1Tpi = 1; see Section 1.5.2 for details. Denote by x∗ the consensus
value, i.e., x∗ = piTx(0). Our problems of interest are as follows:
Suppose that there is an observer that might not know W but can monitor the
states of m nodes in the network starting from initial time t = 0. First, for any
initial states x(0), how can the observer predict the consensus value x∗ in minimum
time? Second, which nodes should be observed to minimize the number of time-steps
needed when more information on the network is available?
Let O ⊂ V denote the set of m nodes selected by the observer. By the
observation at time t we mean the vector xO(t) ∈ Rm that includes the states of
observed nodes at time t. The number of consecutive observations (starting from
t = 0) that allows the observer to determine x∗ is called the observation time.
We find it convenient to introduce the following information model. Define
Θ(t) as the “accumulated information” about the network that the observer pos-
sesses at time t. (Note that Θ(t) is an equivalence class.) Let Θ(−1) denote the
initial knowledge and assume that the information dynamics satisfies Θ(t + 1) =
Θ(t) ∪ {xO(t + 1)}, implying that the observer accumulates information. As a re-
sult, at any time t ≥ 0, the observer knows xO(s),∀s ∈ [0, t].
118
4.2.2 Previous Results on Consensus in Finite Time
We now recall the method in [71] that enables the agents to exactly calculate the
consensus value after running the iteration (4.1) for only a finitely many steps. The
method is based on the concept of an individual node’s minimal polynomial, which
is given below.
First, recall that for any square matrix A ∈ Rn×n, its associated minimal
polynomial qA is the monic polynomial of least degree for which qA(A) = 0n×n.
Definition 4.2.2. (Minimal polynomial of a node [71]) Given weight matrix W ,
the minimal polynomial of node i, denoted by qi, is the monic polynomial of least
degree for which eTi qi(W ) = 0
T
n , where ei is the i
th standard unit basis vector.
The existence of qi follows from the fact that qW satisfies the condition e
T
i qW (W ) =
0TN . Moreover, qi is easily seen to be unique by virtue of being a monic polynomial of
least degree satisfying this condition. Note also that the value deg(qi)’s may not be
the same for different i ∈ V . However, it always holds that deg(qi) ≤ deg(qW ), i ∈ V .
Furthermore, important properties of qi are given below; see [71] for a proof.
Lemma 4.2.3. (Properties of minimal polynomial of a node) For each i ∈ V, qi
divides qW . Moreover, if µ is a simple eigenvalue of W whose associated eigenvector
has all nonzero elements, then µ is a simple root of qi.
As a consequence, when Assumption 4.2.1 holds, all the roots of qi are strictly
inside the unit circle except only one at 1. Denoting
Di := deg(qi)−1,
119
the minimal polynomial qi can be expressed as
qi(ξ)=(ξ − 1)
∑
0≤l≤Di
a
(i)
l ξ
l, (4.3)
where a (i) = [a
(i)
0 , a
(i)
1 , ..., a
(i)
Di
]T satisfies
a
(i)
Di
= 1,
∑
0≤l≤Di
a
(i)
l 6= 0. (4.4)
This decomposition of qi will be useful in determining the consensus value at each
node in finitely many iterations as we briefly describe next; for a full development
with all steps given in detail, the reader is referred to [71]. Recall from Definition
4.2.2 that eTi qi(W ) = 0
T
n . Thus, for t ≥ 0,
0 = eTi qi(W )x(t)
(4.3)
=
∑
0≤l≤Di+1
(a
(i)
l−1 − a(i)l )eTi W lx(t),
where a
(i)
−1 = a
(i)
Di+1
= 0 for the convenience of notation. Note that eTi W
lx(t) =
xi(t+ l). Thus, we have
∑
0≤l≤Di+1
c
(i)
l xi(t+ l) = 0, ∀t ≥ 0,
where c
(i)
l := a
(i)
l−1−a(i)l . Denote by Xi(z) the z-transform of the signal xi. Applying
the unilateral z-transform to the equation above and invoking the time-shifting
property yields
qi(z)Xi(z) =
∑
1≤l≤Di+1
c
(i)
l
∑
0≤j≤l−1
xi(j)z
l−j. (4.5)
By the Final Value Theorem (see, e.g., [126]), we then have
lim
t→∞
xi(t) = lim
z→1
(z − 1)Xi(z) (4.3)-(4.5)=
∑Di
l=0 a
(i)
l xi(l)∑Di
l=0 a
(i)
l
. (4.6)
120
Note that {xi(0), ..., xi(Di)} are consecutive state values of agent i. Thus (4.6)
implies that agent i can find limt→∞ xi(t) after Di iterations of (4.1), provided that
a (i) is known. By (4.2), this limit is the consensus value x∗ = piTx(0).
Remark 4.2.4. The method presented above can be viewed as each agent being an ob-
server with its own information model: Θi(t+ 1) = Θi(t) ∪ {xi(t)} and Θi(−1) = ∅.
We remark, however, that in general, even in a distributed setting, more local
information is available to each agent than just its own state, e.g., Θi(t + 1) =
Θi(t) ∪ {xNi(t)} (recalling that Ni denotes the set of direct neighbors of agent i)
and Θi(−1) might not be empty.
Remark 4.2.5. Our setting of having just one observer is more general in the sense
that the scenario above can be seen as a special case with appropriate choices of
observed nodes O and information model Θ(t).
Remark 4.2.6. It is obvious from (4.6) that agent i (or the observer that monitors
agent i) determine the consensus value x∗ as a linear combination of Di + 1 consec-
utive observations of agent i’s state. This is merely a consequence of the use of the
minimal polynomial qi, which by no means assures the optimality of Di + 1 a priori.
Hence, the following the question is also of interest:
Among all possible methods that the observer may use to find x∗, which is
associated with the least observation time?
121
4.3 Shortest Time Prediction of Consensus and Local Computation
of Minimal Polynomials
This section deals with the question posed in the foregoing remark. It turns out that
if O = {i}, then the number Di + 1 of observations is optimal in determining x∗ for
any initial value of the network and for all possible methods. This optimal value can
be achieved by using minimal polynomial qi as in (4.6). We show this in detail, then
present an optimality conjecture for the case of having multiple observed nodes and
discuss an idea to achieve this minimum observation time through the computation
of minimal polynomials.
We first uncover an intrinsic relation between the consensus value and observa-
tion data: if x∗ can be computed at some time r ≥ 0, then x∗ is a linear combination
of available observation data with associated coefficients depending on W .
Theorem 4.3.1. If r ∈ Z+ and g : Rm(r+1) → R are such that for any x(0) ∈ RN
x∗ = g(xO(r),xO(r − 1), ...,xO(0)), (4.7)
then ∃β0,β2, ...,βr ∈ Rm such that x∗ =
∑r
i=0 β
T
i xO(i).
To prove this result, we make use of the linearity of the dynamic system (4.1)
in conjunction with the following lemma, whose proof is an application of the Hahn-
Banach theorem.
Lemma 4.3.2. [127, p. 188] Let f0, f1, ..., fn be linear functionals on a vector space
V and suppose that f0(v) = 0 for every v ∈ V satisfying fi(v) = 0 for i = 1, 2, ..., n.
Then there are constants β1, β2, ..., βn such that f0 =
∑n
i=1 βifi.
122
Proof of Theorem 4.3.1: Let r and g satisfy (4.7). Define the following
functions f0, fi,t : RN → R for any t ≥ 0 and i ∈ O such that for ∀v ∈ RN
fi,t(v) := e
T
i W
tv, f0(v) := lim
t→∞
eT1W
tv. (4.8)
That is, if x(0) = v, then fi,t(v) = xi(t) and f0(v) = limt→∞ x1(t) = x∗ since the
network reaches consensus. Clearly, f0 and fi,t are linear functions on RN . Next,
define
Ω = {v ∈ RN | fi,t(v) = 0, 0 ≤ t ≤ r, i ∈ O}.
It can be verified that Ω is a subspace on which xO(t) = 0 for 0 ≤ t ≤ r. We now
consider f0 on Ω. It follows from (4.7) that for any v ∈ Ω and γ ∈ R
f0(v) = g(0,0, ...,0) = f0(γv) = γf0(v), (4.9)
where the second equality holds since γv ∈ Ω, and the last equality by linearity of
f0. As a result, we have f0(v) = 0 for any v ∈ Ω. Therefore, by using Lemma 4.3.2,
we have
f0 =
∑
0≤t≤r,i∈O
βi,tfi,t
for some constants βi,t. This concludes the proof.
Next, we will employ Theorem 4.3.1 to assess the optimality of using minimal
polynomials in consensus prediction.
4.3.1 Optimality of (Di + 1)
Our main result for the case of single observed node is as follows:
123
Theorem 4.3.3. Suppose O = {i} ⊂ V and Θ(t) = Θ(t − 1) ∪ {xi(t)},∀t ≥ 0,
where Θ(−1) may contain any information related to W . Then the observation
time is always bounded below by Di+1, regardless of the method used. Furthermore,
this bound can be achieved if qi ∈ Θ(−1).
Proof: Suppose O = {1}. We prove by contradiction, i.e., suppose there exist
a positive integer r < D1 and a mapping g : Rr → R such that for any x(0) ∈ RN ,
x∗ = g(x1(r), x1(r − 1), ..., x1(0)). (4.10)
Here, g depends on Θ(−1). By Theorem 4.3.1, we conclude that there exist β0, β2, ..., βr
such that x∗ =
∑r
i=0 βix1(i). Without loss of generality, assume that βr 6= 0
(otherwise, we consider βr−1 and so on). Define ft(v) := eT1W
tv and f0(v) :=
limt→∞ eT1W
tv for any v ∈ RN . Then
f0 =
∑
0≤i≤r
βifi. (4.11)
Note that for ∀t ∈ Z+ and ∀v ∈ RN , we have
f0(Wv) = f0(v), ft(Wv) = ft+1(v),
which in view of (4.11) then implies that
∑
0≤i≤r
βifi+1 =
∑
0≤i≤r
βifi
⇐⇒ eT1W r+1 +
∑
1≤i≤r
βi−1 − βi
βr
eT1W
i − β0
βr
eT1 = 0
T .
As a result, the polynomial q˜1 given by q˜1(ξ) = ξ
r+1 +
∑r
i=1 β
−1
r (βi−1−βi)ξi−β−1r β0
satisfies
eT1 q˜1(W ) = 0, with deg(q˜1) = r + 1 < D1 + 1,
124
which, however, contradicts the fact that the minimal polynomial q1 of node 1 is of
degree D1 + 1. This concludes that the observation time is always bounded below
by D1 + 1.
It remains to show that this bound is achieved if q1 ∈ Θ(−1). This is obvious
in view of (4.6).
Remark 4.3.4. As we have shown that the shortest time Di + 1 can be achieved if
qi ∈ Θ(Di), in which case the coefficients βj in Theorem 4.3.1 can be determined
from qi as βj = a
(i)
j /
∑Di
l=0 a
(i)
l . In the case where Θ(−1) = ∅, then Di and a(i)j
for j = 0, 1, . . . , (Di − 1) become Di + 1 unknowns characterizing qi, and therefore,
the observer would need Di + 1 additional observation data in order to be able to
determine these unknowns.
In the following, we consider the case m ≥ 2 and are interested in quantifying
the minimal observation time conditionally on the initial information Θ(−1) in terms
of qi. Although we have not yet been able to determine the minimum time, we
conjecture the following:
Conjecture 4.3.5. Suppose the observer can monitor the states of a set O of m
nodes and Θ(t) = Θ(t − 1) ∪ {xO(t)},∀t ≥ 0. Let Tinf denote the least observation
time and let Dmin = mini∈ODi.
(i) If Θ(−1) = {qi,∀i ∈ O}, then Tinf = Dmin + 1.
(ii) If Θ(−1) = ∅, then Tinf ≥ Dmin + 2 + dDminm e.1
1For any x ∈ R, dxe denotes the least integer greater than or equal to x.
125
Remark 4.3.6. Case (i) can be reasoned as follows. Without loss of generality, let
O = {1, ...,m} and D1 = Dmin. If Tinf ≤ D1, i.e., x∗ can be found by the time
t = D1 − 1. By Theorem 4.3.1, x∗ is a linear combination of
{xi(k), 1 ≤ i ≤ m, 0 ≤ k ≤ D1 − 1}.
However, it follows from (4.6) that, at time t = D1− 1, we have m linear equations:
x∗ =
∑Di
k=0 a
(i)
k xi(k)∑Di
k=0 a
(i)
k
, ∀i ∈ O
with at least m+ 1 unknowns including x∗ and xO(D1). Thus, in general, x∗ is not
computable up to time t = D1 − 1.
Remark 4.3.7. Case (ii) of the conjecture is based on our development in the next
section where the idea is to demonstrate that the lower bound on Tinf can be achieved
if qk with k = arg mini∈ODi can be computed from observation data up to that time
and if “ideal conditions” (which will be made clear later) are assumed.
Remark 4.3.8. With a different assumption on Θ(−1), it is possible that Tinf < Dmin.
For example, if O = V and {pi} ⊆ Θ(−1), then x∗ = piTxO(0), i.e., Tinf = 1.
4.3.2 Local Computation of qi
The minimal polynomial qi can be computed locally by agent i in many ways.
First, let c(i) = [c
(i)
0 , c
(i)
1 , . . . , c
(i)
Di
, 1]T ∈ RDi+2 denote the vector of coefficients
of qi. Then it follows from the definition that
0 = eTi qi(W ) =
Di+1∑
k=1
c
(i)
k e
T
i W
k = (c(i))TO
(i)
Di+2
, (4.12)
with O
(i)
Di+2
= [ei W
Tei . . . (W
Di+1)Tei]
T .
126
Observe that O
(i)
Di+2
has the form of the observability matrix for the pair (W, eTi ).
Therefore, the observer might be able to compute c(i) by constructing O
(i)
k and
increasing k until O
(i)
k loses rank. This particular value of k is equal to Di + 2, i.e.,
at time t = Di + 1. Moreover, it should be note that the construction of O
(i)
k need
not require the knowledge of the entire network. Specifically, let N (i)k denote the set
of agents connected to node i through a path of length at most k. Thus O
(i)
k can
be determined using a submatrix of W with column and row in N (i)k . This requires
appropriate a dynamic information model: Θ(t+ 1) = Θ(t) ∪ {xi(t), eTi W t}.
A distributed algorithm for computing qi was also proposed in [71], where net-
work performsN runs of (4.1) with different initial conditions x(1)(0),x(2)(0), . . . ,x(N)(0)
assumed to be linearly independent, each for N + 1 time steps. During each run,
every node stores its own values. After N runs, every node is able to matrix
Xi,t =

x
(1)
i (0) x
(1)
i (1) · · · x(1)i (t+ 1)
x
(2)
i (0) x
(2)
i (1) · · · x(2)i (t+ 1)
...
...
. . .
...
x
(N)
i (0) x
(N)
i (1) · · · x(N)i (t+ 1)

where x
(j)
i (t) is the value of node i at time t in the j-th run. Then Di is the smallest
positive integer for which Xi,Di is not full column rank and the coefficient vector of
qi, denoted by c
(i), can be found from Xi,Dic
(i) = 0; see [71] for details.
In [72] the authors presented another algorithm for computing qi which also
uses solely agent i’s state values but requires a fewer number of time steps. In
particular, let the network run (4.1) for at most 2N + 1 time steps, starting from
almost arbitrary initial state x(0) (except for a set of Lebesgue measure zero in RN).
127
Each node i constructs its Hankel matrix Hi,k defined through setting zi(k + 1) =
xi(k + 1)− xi(k) and
Hi,k :=

zi(1) zi(2) · · · zi(k + 1)
zi(2) zi(3) · · · zi(k + 2)
...
...
. . .
...
zi(k + 1) zi(k + 2) · · · zi(2k + 1)

(4.13)
and finds the first rank-defective matrix Hi,k as k increases, namely Hi,Di . Then
a (i) is computed from Hi,Dia
(i) = 0. Although it is hard to characterize the set
of initial states x(0) (of measure zero) for which this computation scheme fails to
provide a (i), practical techniques to alleviate the problem are available; see, e.g.,
[128]. More importantly, this approach in general provides a minimum time of
2(Di + 1) for consensus prediction in the scenario that Θ(t + 1) = Θ(t) ∪ {xi(t)}
with Θ(−1) = ∅. See [72] for further details.
We remark that the idea of using Hankel matrix Hi,k to compute a
(i) in fact
has its roots in the realization theory [73–75]. Here, finding qi can be regarded as
a network identification problem. Aimed with this view, in the next section, we
build our algorithms on the previous approach by employing block-Hankel matrices
in order to reduce the observation time needed to compute qi.
4.4 Toward Minimizing Observation Time
In this section, we present partial solutions to the problem described in Section
4.2.1. Clearly, when m = 1, it follows from Section 4.3 (see Theorem 4.3.3) that
128
the solution is given by O = arg mini∈V Di. Thus in the following we consider
m ≥ 2. Our main idea is to make use of available information to construct block-
Hankel matrices instead of (4.13). We will consider two cases: (1) when the minimal
polynomials of observed nodes are identical, i.e., qi = qj, ∀i, j ∈ O, and (2) when
they are nonidentical, i.e., ∃i, j ∈ O, qi 6= qj. In any case, we do not assume that
qi ∈ Θ(−1),∀i ∈ O.
4.4.1 Observed Nodes with Identical Minimal Polynomials
Let qi =: q where q(ξ) = (ξ− 1)
∑D
k=0 akξ
k with aD = 1 and a := [a0, a1, . . . , aD]
T ∈
RD+1, which is not assumed to be available to the observer at initial time. For any
subset S ⊆ V , define
zS(t) = xS(t)− xS(t− 1), ∀t ≥ 1.
For any sequence {O1,O2, . . . ,Op}, define the following block-Hankel matrix
Mp,D({Oi})=

zO1(1) zO1(2) · · · zO1(D+1)
zO2(2) zO2(3) · · · zO2(D+2)
...
...
. . .
...
zOp(p) zOp(p+1) · · · zOp(D+p)

(4.14)
Important properties of this matrix is given next.
Theorem 4.4.1. There exist p ∈ Z+ and a sequence {Oi|i = 1, ..., p, Oi ⊂ O} such
that the following hold:
rank(Mp,D({Oi})) = rank(Mp,D−1({Oi})) = D, (4.15)
Mp,D({Oi})a = 0. (4.16)
129
Proof. It follows from Section 4.3.2 that for ∀i ∈ O,
rank(Hi,D) = rank(Hi,D−1) = D, Hi,Da = 0, (4.17)
where Hi,D ∈ R(D+1)×(D+1) is a Hankel matrix given by (4.13). Let HO,D =
[HT1,D H
T
2,D . . . H
T
m,D]
T . It then follows from (4.17) that rank(HO,D) = rank(HO,D−1) =
D and HO,Da = 0. Now choose p = D + 1, Oi = O, i = 1, ..., p, and construct
Mp,D({Oi}) as in (4.14). It is easy to see that Mp,D({Oi}) has the same rows as
HO,D but in a different order. Thus the claim follows.
As a result, once a sequence {Oi}pi=1 satisfying (4.15) is found, a can be
determined from (4.16) and (4.4). Then the consensus value x∗ can be computed as
in (4.6), i.e.,
x∗ =
∑D
k=0 akxi(k)∑D
k=0 ak
, ∀i ∈ O. (4.18)
It is important to note that the number of time steps needed to constructMp,D({Oi})
in (4.16), denoted by Tc, is given by
Tc = D + p+ 1. (4.19)
Clearly, Tc ≥ D + 2. Now define
p∗ := min{p : (4.15) holds }, T ∗ := D + 1 + p∗.
Thus T ∗ is the minimum time needed for the observer to compute x∗. Note that p∗
depends on the choice of {Oi}p∗1 . Our next result provides general bounds on this
value.
Theorem 4.4.2. The following hold: D + 1 ≥ p∗ ≥ dD
m
e.
130
Proof. The first inequality follows from the choices of p = D + 1 and the sequence
{Oi} to construct a particular Mp,D({Oi}) in the proof of Theorem 4.4.1. We
now show the lower bound. For any p ∈ Z+ and {Oi} satisfying (4.15), since
rank(Mp,D({Oi})) = D and Oi ⊆ O,∀i = 1, . . . , p, it follows that
D ≤
∑
1≤i≤p
card(Oi) ≤
∑
1≤i≤p
card(O) = pm.
Thus, p ≥ dD
m
e. Hence the second inequality follows.
This result implies that D+1+dD
m
e ≤ T ∗, which agrees with Conjecture 4.3.5;
here the lower bound is less than that in case (ii) of the conjecture by 1 since to
achieve this we have implicitly assumed the knowledge of D.
We can obtain a sharper bound for any initial state x(0) except for a set of
Lebesgue measure zero as follows.
Proposition 4.4.3. Suppose that the self-weight wii > 0,∀i ∈ O. Then for any
initial state x(0) except for a set of measure zero, the following hold:
(i) If m ≥ D + 1, then p∗ = 1.
(ii) If m ≤ D, then D + 1−m ≥ p∗.
Proof. [Sketch] First, note that if wii > 0,∀i ∈ O, then it can be seen that
rank(M1,D(O)) = rank(M1,D−1(O)) = min(D,m) almost surely. Thus, if m ≥ D+1,
then it follows that rank(M1,D(O)) = rank(M1,D−1(O)) = D, thus p∗ = 1.
Now if m ≤ D, choose p = D + 1 −m, O1 = O. There must exist j ∈ O so
that if O2 = ... = Op = {j} then rank(Mp,D({Oi})) = rank(Hj,D) = rank(Hj,D−1) =
131
rank(Mp,D−1({Oi})) for almost any x(0) except for a set of measure zero. This
implies that p∗ ≤ p = D + 1−m.
Next, we note that the set of sequences {Oi}p1 that satisfy (4.15) and achieve
p∗ includes a special one, namely {Oi| Oi = O,∀i = 1, . . . , p}. In fact, by defining
Mp,d := Mp,d({Oi| Oi = O, ∀i = 1, . . . , p})
for any p, d ≥ 1, we have the following:
Theorem 4.4.4. Suppose {Oi}pi=1 is a sequence such that p ≥ p∗ and {Oi}p
∗
i=1
satisfies (4.15). Then for any d ≥ D,
rank(Mp,d({Oi}p1)) = rank(Mp∗,D−1) = D. (4.20)
Proof. The proof follows from (4.15) and the definition of p∗.
Condition (4.20) allows us to construct Algorithm 4.1 below to be implemented
by the observer to find a and x∗, assuming the knowledge of D (in addition to the
condition that qi = qj,∀i, j ∈ O). Starting from p = 1, the observer repeatedly
increases p and checks if rank(Mp,D) = D, i.e, if p
∗ is found.
Algorithm 4.1: Compute a and x∗ for the case of identical minimum
polynomials with knowledge of D
Data: The set O, m = card(O) and D
1 init: p← 1
2 while rank(Mp,D) < D do
3 p← p+ 1
4 Compute a and x∗ using (4.16) and (4.18)
In the case where D is not available in advance, we can find D as the first
132
value of d such that
d = rank(Mp,d−1) = rank(Mp,d) = rank(Mp+1,d). (4.21)
Here, we want to find the first column-rank defective matrix Mp,d({O}p) as p and
d increase appropriately. Based on this condition, we propose Algorithm 4.2 to
determine a and x∗ without the knowledge of D.
Algorithm 4.2: Compute a and x∗ for the case of identical minimum
polynomials without knowing D
Data: The set O, m = card(O)
1 init: d← 1; p← 1
2 while (4.21) not met do
3 increase d and/or p
4 Compute a and x∗ using (4.16) and (4.18)
Remark 4.4.5. Algorithm 4.2 requires observation time T ∗ = D+2+p∗ ≥ D+2+dD
m
e
since it uses Mp+1,D. Moreover, when m = 1, the algorithm is the same as that in
[72], which was summarized in Subsection 4.3.2 above.
4.4.2 Observed Nodes with Different Minimal Polynomials
For any S ⊆ O, let qS denote the least common multiple of {qi, i ∈ S}, which can be
regarded as the joint minimum polynomial of the set S. Define DS := deg(qS)− 1.
Since 1 is a simple root of each qi, it is also a simple root of qS . Hence, there exists
a ∈ RDS+1 such that
qS(ξ) = (ξ − 1)
∑
0≤k≤DS
akξ
k,
∑
0≤k≤DS
ak 6= 0, aDS = 1.
Here a also depends on S. Now using Algorithm 4.1 or 4.2 above, the observer can
determine a and thus x∗ as if qi = qj = qS , which requires a minimum observation
133
time denoted by T ∗(S). Therefore, for a given set O, the (sub)optimal observation
time is
T ∗O = min{T ∗(S) : S ⊆ O}. (4.22)
This is clearly a combinatoric problem, whose solution may be hard to find exactly
especially when Θ(−1) = ∅. If qi ∈ Θ(−1), then we can resort to a greedy algorithm.
In any case, T ∗O is upper bounded by mini∈O{T ∗({i})} and T ∗(O), which are easier
to compute.
To conclude this section, we remark that in the algorithms in [71, 72], each
agent i uses only its opinion history to compute x∗ and thus the best observation
time is 2Di + 2. Our results assert that if each agent i also functions as an observer,
then the consensus value could be predicted in a fewer number of iterations.
4.5 Numerical Examples
4.5.1 Example 1: Network with Identical Minimal Polynomials
Consider a ring network of N = 10 agents with
W =

.8 .1 0 · · · 0 .1
.1 .8 .1 · · · 0 0
...
...
... · · · ... ...
.1 0 0 · · · .1 .8

∈ R10×10
134
and with (randomly generated) initial opinions:
x(0) = [0.9797, 0.2848, 0.5949, 0.9621, 0.1857,
0.1930, 0.3416, 0.9329, 0.3906, 0.2732]T .
It can be seen that pi = 1/10 and thus the consensus value is x∗ = piTx(0) = 0.5139.
Moreover, qi = qj,∀i, j ∈ V due to the symmetry of the network and the weight
matrix.
First, consider the scenario where each node i ∈ V wishes to x∗ from its local
information and Θi(−1) = ∅. Using the algorithm in [72], any agent can find D = 5
and compute x∗ after 2D + 2 = 12 time steps, where as, by using Algorithm 4.2,
each agent (having has 2 neighbors, hence m = 3) can find D = 5 and compute x∗
after Tc = 9 time steps; see also Remark 4.4.5.
Next, consider the case where an observer knows D and can monitor m agents
in the network. Results in Table 4.1 holds for any choice of O. (Note also that
Tc ≥ D+ 2; see (4.19)). Here m = D is the smallest number of observed nodes that
also gives the minimum observation time.
Table 4.1: Observation times using Algorithm 4.1.
m 1 2 3 4 5 6 7 8 9 10
Tc 11 9 8 8 7 7 7 7 7 7
4.5.2 Example 2: Network with Different Minimal Polynomials
Consider the graph given in Figure 4.1 (from [71]). In this example,
q1(ξ) = q2(ξ) = q3(ξ) = (ξ − 1)(ξ − λ1)(ξ − λ2)(ξ − λ3),
135
q4(ξ) = (ξ − λ4)q1(ξ), q5(ξ) = q6(ξ) = (ξ − λ5)q4(ξ).
Figure 4.1: Network example 2. Self weights are not shown.
In Table 4.2, we compare the observation times obtained by the algorithm
in [72] with those by Algorithm 4.2, where each node monitors its neighbors and
naively uses Algorithm 4.2. It is interesting to see that the observation time of node
1 is longer than that of node 2 although it has more neighbors. The reason is that
node 1 uses information from nodes 5 and 6, which have the largest observation
time among all (or to be precise, the joint minimum polynomial of nodes 1, 5 and 6
is q6, which is of highest order).
Table 4.2: Observation time for each node to compute consensus value in Example 2
Node Tc by [72] Tc by Alg. 4.2 Observed nodes
1 8 8 {1,2,4,5,6}
2 8 6 {1,2,3}
3 8 7 {2,3}
4 10 9 {1,4}
5 12 10 {1,5,6}
6 12 10 {1,5,6}
Finally, suppose that the observer is able to select any m nodes to monitors.
Let O∗ = arg minO⊆V,card(O)=m T ∗O, i.e., the set that gives the optimal observation
time. The result given in Table 4.3 is obtained by brute-force computations. Thus,
the best choice would be O∗ = {1, 2} with T ∗ = 6 and only 2 nodes being monitored.
136
Table 4.3: Optimal time T ∗ when the observer can choose any m nodes
m T ∗ O∗
1 8 {1}, {2}, {3}
2 7 {1, 2}
3 6 {1, 2, 3}
4 6 {1, 2, 3, 4}
5 6 {1, 2, 3, 4, 5}
6 6 {1, 2, 3, 4, 5, 6}
4.6 Toward Selecting Observed Nodes
Recall that the second question in Section 4.2.1 is about optimal selection of observed
nodes. Based on the previous section, the (sub)optimal solution to this problem can
be given by O∗ = arg minO⊆V,card(O)=m T ∗O. The optimal solution to this problem is
not obvious and left for future work. Here, instead, we give heuristic descriptions
of O∗, including:
(A) qi’s should be similar;
(B) deg(qi)’s should be small;
(C) p∗ should be close to DO∗
m
.
Note that the degree of qi and the relationships between qi and qj depend
not only on the structure of the network, but also on the weight matrix. To have
a closer look at the minimal polynomial of a node, let us consider the following.
Let c(i) = [c
(i)
0 , c
(i)
1 , . . . , c
(i)
Di
, 1]T ∈ RDi+2 be vector of coefficients of qi, i.e., qi(ξ) =
137
ξDi+1 +
∑Di
k=0 c
(i)
k ξ
k. From definition,
0T = eTi qi(W ) =
Di+1∑
k=1
c
(i)
k e
T
i W
k = (c(i))T

eTi
eTi W
...
eTi W
Di+1

(4.23)
Thus, it can be shown that
Proposition 4.6.1. deg(qi) is the observability index of the pair (e
T
i ,W ).
Let us revisit Example 2 in the previous section. We keep the network structure
but consider the following two weight matrices: W1 corresponds to equal neighbors
weights, and W2 is randomly generated.
W1 =

1/5 1/5 0 1/5 1/5 1/5
1/3 1/3 1/3 0 0 0
0 1/2 1/2 0 0 0
1/2 0 0 1/2 0 0
1/3 0 0 0 1/3 1/3
1/3 0 0 0 1/3 1/3

W2 =

.19 .24 0 .13 .17 .27
.54 .15 .31 0 0 0
0 .27 .73 0 0 0
.63 0 0 .37 0 0
.22 0 0 0 .35 .45
.32 0 0 0 .28 .40

In the case of W1, we have
q1 = q2 = q3 = q4, deg(q1) = 5, q5 = q6 = qW1 , deg(q5) = 6.
In the case of W2, we have
q1 = q2 = . . . q6 = qW2 , deg(qW2) = 6.
138
Clearly, changing agents’ weights can change the agents’ minimal polynomials as
well as the degrees. However, it is also apparent that certain properties of qi are
pertinent to the network structure and thus are related to the structural observability
concept. This direction of investigation is left for future work.
In the following, we restrict ourself to the class of undirected graphs and
explore necessary and/or sufficient conditions in terms of graph theory to meet
descriptions (A)-(C) above. We also assume that that the weight matrix is given by
W = I − L (4.24)
where L := Din − A is the Laplacian matrix, A is the adjacency matrix, Din =
diag(A1) is the in-degree matrix, and  ∈ (0,mini[Din]−1ii ) (which is to ensure that
W is a positive weight matrix).
Next we introduce some graph notions [129].
Definition 4.6.2. (Automorphism) An automorphism of G = (V , E) is a permuta-
tion ψ of V such that
(ψ(i), ψ(j)) ∈ E ⇔ (i, j) ∈ E (4.25)
Proposition 4.6.3. ([129]) Let A be the adjacency matrix of the graph G and ψ
a permutation on its node set V. Associate with this permutation the permutation
matrix P . Then ψ is an automorphism of G if and only if PA = AP.
In the following, a partition of the graph G = (V , E) is denoted by C =
{C1, . . . , Ck} for some appropriate k, where Ci’s are called cells.
Definition 4.6.4. (Almost Equitable Partition-AEP) Suppose C = {C1, . . . , Ck} is
a partition of a graph G.
139
• C is said to be almost equitable if each node in Ci has the same number of
neighbors in Cj,∀i, j ∈ {1, ..., k}, i 6= j.
• C is said to be almost equitable w.r.t. node v if C is an AEP and {v} ∈ C.
• The minimum AEP w.r.t node v, denoted by C∗v, is an AEP such that {v} ∈ C
and card(C∗v) is minimal.
Definition 4.6.5. (Distance Regular Graph) An undirected graph G is said to be
regular if deg(i) = deg(j),∀i, j ∈ V . It is called distance-regular if it is regular and
for any pair of nodes u, v ∈ V with dist(u, v) = i, 0 < i < diam(G), there exist
numbers fi and gi such that there are fi neighbors of v that are of distance i − 1
from u and gi neighbors of v that are of distance i+ 1 from u.
4.6.1 When qi = qj?
We have the following result.
Proposition 4.6.6. If there exists an automorphism ψ of G such that ψ(i) = j,
then qi = qj. The converse is not true.
Proof. Suppose there exists an automorphism ψ of G such that ψ(i) = j. Let
P be the permutation matrix associated with ψ. By Proposition 4.6.3, we have
P TA = AP T and thus P TDin = DinP T . Then
WP T = (I − (Din − A))P T = P T − (DinP T − AP T )
= P T − (P TDin − P TA) = P TW.
140
From this, it can be show that W kP T = P TW k for any integer k ≥ 1. Note also
that Pei = ej. Then, multiplying both sides of (4.23) with P
T yields
0T = (c(i))T

eTi P
T
eTi WP
T
. . .
eTi W
Di+1P T

= (c(i))T

eTi P
T
eTi P
TW
. . .
eTi P
TWDi+1

= (c(i))T

eTj
eTjW
. . .
eTjW
Di+1

Therefore, eTj qi(W ) = 0. Since qj is the minimal polynomial of node j, it follows that
qj|qi. Next, we note that since ψ is an automorphism, ψ−1 exists and corresponds
to the permutation matrix P T . That is, ψ−1(j) = i, or P Tej = ei. Now apply the
same argument as above, we have qi|qj. Therefore, qi = qj.
The converse is not true; see Section 4.5.2 for a counterexample, where q1 = q2
but there’s no permutation between nodes 1 and 2.
This proposition only provides sufficient conditions. Consider again Example
4.5.2. Using this Proposition, we can only conclude that q5 = q6 but not q1 = q2 = q3.
This result is useful when the graph is highly symmetric. E.g., consider again the
example in Subsection 4.5.1, it is easy to see that qi = qj,∀i, j ∈ V .
We will show in the next subsection that for distance regular graphs, all the
agents’ minimal polynomials are the same; see Proposition 4.6.9.
4.6.2 Bounds on deg(qi)
Using the fact that deg(qi) is the observability index of the pair (e
T
i ,W ), we can
obtain some bounds on the degree of qi as follows.
141
Theorem 4.6.7. [72, 130, 131] If G is connected and undirected, it holds that
diam(G, i) + 1 ≤ deg(qi) ≤ card(C∗i ) (4.26)
where diam(G, i) = maxv∈V dist(i, v) is the longest distance from node i to any oth-
ers, and C∗i denotes the minimum almost equitable partition w.r.t. node i.
In some special cases, for example in distance regular graphs, the upper bound
and lower bound are tight. Examples of distance regular graphs include cycles,
hypercubes and complete graphs.
Theorem 4.6.8. [132] If G is a distance regular graph, then
diam(G) + 1 = deg(qi) = card(C∗i ), ∀i ∈ V (4.27)
Based on this result, we have the following.
Proposition 4.6.9. If G is a distance regular graph, then
qi = qj, ∀i, j ∈ V . (4.28)
Proof. It is well known that the adjacency matrix A of a distance regular graph has
diam(G)+1 distinct eigenvalues [129]. Thus, so does W since W = I−(Din−A) =
(I − Din) + A and Din = f0I for some f0. Moreover, the minimal polynomial of
W , qW is of degree diam(G) + 1 with diam(G) + 1 distinct roots. Now by Theorem
4.6.8, deg(qi) = Di + 1 = diam(G) + 1,∀i ∈ V . Moreover, qi|qW ,∀i ∈ V . Therefore,
qi = q,∀i ∈ V and the Proposition follows.
142
4.7 Limitations and Future Work
The main drawbacks of the method presented in this chapter are as follows. First
it can only be applied to networks with fixed topology and linear time invariant dy-
namics. Second, the computational accuracy of the rank of a (block-) Hankel matrix
does not scale well with the size of the matrix. Third, since the prediction value is
in the limit as time tend to infinity, the accuracy is sensitive to computational error
and may not work well in the presence of observation noises and/or communication
delays. Extension and further development to overcome these limitations are thus
important directions for future work.
Among possible directions for future work besides resolving the validity of
Conjecture 4.3.5, we note a possible application of consensus prediction in network
monitoring for misbehavior; see Section 7.2.
143
Part III: Distributed Optimization
144
Chapter 5: Local Prediction for Enhanced Convergence of Distributed
Optimization Algorithms
Abstract: This chapter studies distributed optimization problems where a network
of agents seeks to minimize the sum of their private cost functions. Algorithms are
proposed that build on past consensus-based distributed optimization algorithms
by incorporating a local predictive step as developed in Chapter 4. The algorithms
involve introduction of local optimization variables at the network nodes alongside
the original local node states. In the first algorithm, the local optimization variables
are updated cyclically through a subgradient step while the opinion variables follow
a traditional consensus protocol periodically interrupted by a predictive consensus
estimate re-set operation. For convex cost functions with bounded subdifferentials,
this algorithm is guaranteed to converge to within some range of the optimal value
if using a constant step size or to the optimal value if a diminishing step size is in
place. For differentiable cost functions whose sum is convex and has a Lipschitz
continuous gradient, convergence to the optimal value can be ensured when using
a constant step size, even if some of the individual cost functions are nonconvex.
In addition, exponential convergence to the optimal solution is achieved when the
private cost functions are further assumed to be strongly convex. In these cases, each
145
optimization variable behaves like the centralized subgradient method except at a
slower time scale. The last two algorithms are specialized for the case of quadratic
cost functions and converge in finite time to the optimal solution or a neighborhood
of arbitrarily small size. Simulation examples are given to illustrate the algorithms.
5.1 Introduction
We consider a network of N agents, without a central coordinating unit, aiming to
cooperatively solve the global optimization problem
min
x∈X
F (x) :=
N∑
i=1
fi(x) (5.1)
where fi : R → R represents the private cost function of agent i, X is a nonempty
constraint set known to all the agents, and it is assumed that each agent is able to
communicate with its direct neighbors.
Solving this problem in a distributed fashion calls for strategies of coopera-
tion among all the agents in the network. In this regard, many distributed algo-
rithms have been developed; see e.g., [32,42,76–78,80–85,133] and references therein.
Among them, the class of distributed (sub)gradient-based algorithms is well known
for its simplicity in implementation and generally mild assumptions imposed on
the local cost functions and the network topology. In particular, this class requires
each function fi to be (at least) convex and usually with bounded subgradient or
Lipschitz continuous gradient.
Major limitations of algorithms in the category are also well known. First, the
convergence of many algorithms depends on the choice of step size sequences. When
146
a constant step size is used, both Distributed Gradient Descent and Distributed Sub-
gradient methods only yield convergence to a neighborhood of the optimal solution
and of the optimal value [78, 86]. This occurs even if the fi are strongly convex
and have Lipschitz continuous gradients, and is thus one of the main differences
between these methods and their centralized counterparts. This motivates the use
of particular diminishing or adaptive step sizes to achieve asymptotic convergence.
However, the convergence rate can be very slow (compared to that of the centralized
method), depending on the step size sequence, whose appropriate selection is not
trivial. Second, many incremental subgradient methods require all the agents to
construct a closed cycle in order to pass an estimate of the solution around the net-
work; see e.g., [32,88,89]. Third, even when asymptotic convergence is guaranteed,
it is not obvious how each agent can locally decide when to stop the algorithm with-
out affecting other agents’ estimates. Put differently, there are no simple criteria for
all the agents to stop at the same time while also sharing the same estimate of an
optimal solution. This is also true for most (if not all) other distributed optimization
methods.
When all the local cost functions are quadratic, many other consensus-based
algorithms can outperform those in the subgradient class. For example, the ra-
tio consensus method can be used to solve problem (5.1) without constraints and
converges exponentially [34, 90]. Based on this method, [91] proposed a Newton-
Raphson-like algorithm which also converges asymptotically for a class of functions
having continuous, strictly positive and bounded second derivatives, assuming a
sufficiently small discretization step.
147
Our main contributions are as follows. We propose and study a new distributed
optimization technique for solving (5.1) on a fixed and directed network; the new
technique involves use of a distributed prediction scheme based on node minimal
polynomials introduced in Chapter 4. Specifically, we first present a distributed
subgradient-type algorithm for the general setting of problem (5.1) but without
constraints (i.e., X = R) which we show has convergence rate similar to that of the
centralized (sub)gradient method. In fact, the convergence rate to the optimal value
is O(ln(t)/
√
t) under a diminishing step size for the case where the cost functions
fi are convex and have bounded subgradients, while for the case in which the total
cost function F is convex and has Lipschitz continuous gradient, it is O(1/t) under
a constant step size. In the former case, an optimal choice of the step size can yield
a rate O(1/
√
t), which is the best achievable for both centralized and distributed
subgradient methods [117, 134]. In the latter case, if F is further assumed to be
strongly convex, then we obtain both exponential convergence to the optimal value
as well as exponential convergence to the optimal solution. The performance of
our algorithm also resembles that of the centralized subgradient method in that all
the agents, in finite time, agree on an identical estimate of a solution and continue
to agree thereafter (and possibly approach the global optimal solution), solving
problem (5.1) as if they all knew the global function F . Moreover, this algorithm
is among very few of gradient type that can deal with the case where some of the
local cost functions are nonconvex, as long as the total cost F is convex and has a
Lipschitz continuous gradient. Next, with some modifications, we also extend the
algorithm to cope with constraints x ∈ X and non-column stochasticity of the weight
148
matrix. Finally, in the case where all the functions fi are quadratic and the problem
is unconstrained, we construct two algorithms, one of which is ratio-consensus-like
and converges in finite time (which is the fastest convergence achieved to-date in
distributed optimization methods), and the other is a gradient-based algorithm that
achieves near finite-time convergence. The convergence times of our algorithms scale
at most linearly with the network size. In fact, they are linear in the maximum
degree of the agents’ minimal polynomial, which ranges between the diameter and
size of the network.
For comparison, we report here known convergence rates/time of other dis-
tributed (sub)gradient methods. First, for convex cost functions with bounded
subgradients, [135] shows that the distributed proximal-gradient method under a
diminishing step size O(1/
√
t) achieves a rate of O(ln(t)/
√
t). A similar convergence
rate was obtained for the dual averaging method in [81] and for the subgradient-
push method in [85]. The recent work [134] presents an algorithm with convergence
time linear in the network size, and states that it is the best convergence time so
far for problem (5.1) with non-differentiable convex functions. Our first algorithm
admits an even better convergence time. Second, for convex cost functions with
Lipschitz continuous gradient, the algorithm in [136] uses a second-order update
at each iteration, which yields a convergence rate of O(1/t) in terms of best run-
ning violation to the first-order optimality condition under a fixed step size. Note
that our first algorithm also attains the same rate but in terms of the objective
error. Third, for convex cost functions with bounded and Lipschitz continuous gra-
dients, [87] proposed two fast distributed gradient methods; one of which converges
149
at rate O(ln(t)/t) under a diminishing step size, while the other achieves O(1/t2)
convergence through the use of an inner consensus loop and Nesterov’s acceleration
technique (see, [117, Chap. 2]). Finally, when the global cost function F is strongly
convex and all fi have Lipschitz continuous gradients, the algorithm in [136] also
converges at a linear rate.
The rest of the chapter is organized as follows. Section 5.2 contains the prob-
lem formulation and some background on subgradient methods and the finite-time
consensus prediction introduced in Chapter 4. The main algorithm and convergence
results for general cost functions (some possibly nonconvex) are given in Section 5.3.
Performance limits of our algorithms are discussed in Section 5.5, which is followed
by some simulation examples in Section 5.6 to illustrate the algorithms. Finally, con-
cluding remarks are given in Section 5.7. Most proofs are given in Appendix A.3.
5.2 Problem Statement and Background
5.2.1 Problem Statement
Consider a network consisting of N agents where the underlying communication is
characterized by a directed graph G = (V , E). The objective of all the agents is to
solve problem (5.1), repeated here for convenience:
min
x∈X
F (x) =
N∑
i=1
fi(x), (5.2)
where X ⊆ R is the constraint set. Note that here we assume variable x to be scalar
for simplicity of notation. The case of a vector variable can be treated following the
150
same steps. Let F ∗ and X∗ denote the optimal value and the optimal solution set,
respectively, i.e., X∗ = {x∗ ∈ X,F (x∗) = F ∗ := minx∈X F (x)}. The following is a
blanket assumption:
Assumption 5.2.1. The set X is convex and X∗ is nonempty.
In our setting, agent i only has access to fi and local information on its neigh-
bors’ opinions, and no central coordinating node is assumed to exist. Thus, the
agents need to collaborate in a distributed manner to solve problem (5.1). This
involves local iterative computation along with information diffusion. We are in-
terested in the scenario where the communication graph G connecting the agents is
directed and fixed. We make the following additional blanket assumption.
Assumption 5.2.2. (Network Connectivity and Weight Matrix) The graph G is
fixed and strongly connected. The weight matrix W is fixed, row-stochastic and
satisfies wij > 0 for (i, j) ∈ E , i 6= j, and wij = 0 otherwise. Moreover, W has at
least one positive diagonal element.
5.2.2 Subgradient Methods
Subgradient methods are the simplest numerical algorithms for solving problem (5.1)
for the case in which each function fi is convex. Recall that for the unconstrained
problem (i.e., when X = R), the centralized method is based on the iteration (see
e.g., [115, Chap. 6], [117, Chap. 3])
x(t+ 1) = x(t)− γ(t)g(x(t)), (5.3)
151
where γ(t) is the step size at iteration t and g(x(t)) is a subgradient of F at x(t), i.e.,
g(x(t)) ∈ ∂F (x(t)). In the special case where F is differentiable, g(x(t)) = ∇F (x(t))
and (5.3) reduces to the centralized gradient descent method.
In the distributed setting described above, subgradient methods can take many
forms (see, e.g., [76,78]), one of which is as follows. Every agent has its own estimate
of an optimal solution, and updates it at each iteration by combining a consensus
step with a local optimization step:
xi(t+ 1) =
∑
j∈Ni
wijxj(t)− γ(t)gi(xi(t)), (5.4)
where Ni = {j ∈ V : (i, j) ∈ E} is the set of in-neighbors of node i (including i),
γ(t) the step size known to all agents, and gi(xi(t)) ∈ ∂fi(xi(t)). This algorithm is
usually referred to as the Distributed Subgradient Method, or DSM for short. A
modified version called the Distributed Projected Subgradient method or DPS (see,
e.g., [137]) is used when the problem includes a global constraint x ∈ X. In DPS,
xi(t+ 1) = PX
(∑
j∈Ni
wijxj(t)− γ(t)gi(xi(t))
)
, (5.5)
where PX denotes the projection operator onto the set X (assuming that each agent
is able to perform this operation). The convergence of these distributed methods
depends on the step size sequence and the weight matrix W . Unlike the centralized
version, these algorithms are only guaranteed to converge to an error neighborhood
of the optimal solution when ∇fi’s are Lipschitz continuous and a constant step size
is used (see, e.g., [78, 86]).
In the following, we develop new distributed algorithms based on (5.4) that can
achieve faster convergence and may not require a diminishing step size; convergence
152
can even occur in finite time for quadratic cost functions. Moreover, the convergence
is similar to that of the centralized subgradient algorithm in the sense that the
agents are able to stop the algorithm at the same time with identical estimates
of an optimal solution. The main novelty introduced in this chapter is to take
advantage of the finite-time consensus protocol introduced in [71] (described in the
next subsection) to result in a modified version of algorithm (5.4) which enjoys
accelerated convergence in comparison to (5.4). We find that the new algorithm
can even achieve the performance limit of distributed algorithms in many cases (see
discussion in Section 5.5). Another benefit of using the new protocol is that it can
be implemented in a distributed manner for an arbitrary weight matrix (as long as
the associated graph is strongly connected).
5.2.3 Finite-Time Consensus Using Minimal Polynomials
This subsection is a brief summary of 4.2.2. Consider the following update iteration
xi(t+ 1) =
∑
j∈Ni
wijxj(t), ∀i ∈ V , (5.6)
or in vector form: x(t+ 1) = Wx(t). From Subsection 1.5.2 in Chapter 1, we know
that under Assumption 5.2.2
∃ lim
t→∞
W t = 1piT =: Φ, (5.7)
where pi ∈ RN is the normalized left Perron eigenvector of W , that is, piTW = piT
and 1Tpi = 1. Therefore, the network in (5.6) asymptotically reaches consensus
lim
t→∞
x(t) = Φx(0). (5.8)
153
From Section 4.2.2 in Chapter 4, we also know that each agent i can locally
compute the consensus value using its minimal polynomial qi. In particular,
lim
t→∞
xi(t) =
∑Di
l=0 a
(i)
l xi(l)∑Di
l=0 a
(i)
l
, (5.9)
where Di = deg(qi)− 1 and a (i) = [a(i)0 , a(i)1 , ..., a(i)Di ]T satisfies:
qi(ξ) = (ξ − 1)
Di∑
l=0
a
(i)
l ξ
l, a
(i)
Di
= 1,
Di∑
l=0
a
(i)
l 6= 0. (5.10)
Note also that a (i) can be computed locally and in finite time by agent i. We
summarize the analysis above as follows.
Theorem 5.2.3. (Prediction of consensus value by minimal polynomial) Consider
system (5.6) for t = 0, ..., D¯ − 1, where D¯ = maxi∈V Di with deg(qi) = Di + 1. Let
Assumption 5.2.2 hold. Then
∑Di
l=0 a
(i)
l xi(l)∑Di
l=0 a
(i)
l
= eTi Φx(0), ∀i ∈ V , (5.11)
where a(i) = [a
(i)
0 , a
(i)
1 , ..., a
(i)
Di
]T ∈ RDi+1 are given in (5.10).
In this regard, the node minimal polynomials qi can be viewed as a tool pro-
viding a shortcut to reaching consensus. This idea will be employed in this chapter
to develop new distributed algorithms with desirable features such as behavior sim-
ilar to that of centralized algorithms, distributed stopping criteria, and finite (or
practically finite) time convergence.
154
5.3 Distributed Subgradient Optimization Using Finite Time Con-
sensus
In this section, we return to optimization problem (5.1) and show how minimal
polynomials associated with the agents can be used to improve the convergence
speed of the distributed subgradient method (5.4). To this end, we will assume that
each agent i ∈ V knows its minimal polynomial qi and a common upper bound κ on
deg(qi), i.e.,
κ ≥ deg(qi),∀i ∈ V . (5.12)
Note that the least possible value of κ is always less than or equal to the number of
agents N in the network; see Section 5.5.1 for further discussion on this. Therefore,
κ can be chosen to be N or a known upper bound on N .
We now consider problem (5.1) and the following possible assumptions, which
will not be invoked together below.
Assumption 5.3.1. For each i ∈ V, fi is convex and has bounded subgradients on
X, i.e., ∃Li ∈ (0,∞) such that |gi(x)| ≤ Li,∀gi(x) ∈ ∂fi(x),∀x ∈ X.
Assumption 5.3.2. For each i ∈ V, fi is differentiable on the interior of X.
Moreover, the function F is convex on X and the gradient ∇F is L∇F -Lipschitz
continuous for some L∇F ∈ (0,∞).
The former assumption implies that F and the fi are convex and have bounded
subgradients while not necessarily being differentiable; the latter requires convexity
of F while not requiring convexity of each fi.
155
5.3.1 Main Algorithm
Our main idea is to combine the consensus prediction step offered by (5.9) with the
distributed subgradient method (5.4). Specifically, we propose an algorithm, called
Finite-time consensus Aided Distributed Optimization (FADO), that performs the
following 3 sequential steps in a cyclic manner: (i) κ iterations of usual consensus
algorithm (5.6) (used to diffuse information in the network), followed by (ii) a pre-
diction step using minimal polynomials, and then (iii) a (sub)gradient optimization
step applied to the predicted consensus value obtained from step (ii). The detailed
algorithm is as follows.
Algorithm 5.1. (Finite-time consensus Aided Distributed Optimization - FADO).
Each agent i ∈ V initializes a pair of local variables (si(0), xi(0)) in X and updates
them for t ≥ 1 according to
si(t) =

∑Di
l=0 a
(i)
l xi(t− κ+ l)∑Di
l=0 a
(i)
l
if t = kκ (5.13a)
si(t− 1) else (5.13b)
xi(t) =

si(t)− γkgi(si(t)) if t = kκ (5.14a)∑
j∈Ni
wijxj(t− 1) else (5.14b)
where γk is the step size at t = kκ.
Remark 5.3.3. (On implementation of (5.13a)) The foregoing form (5.13a) of the
algorithm is quite useful for analysis, but in practice, instead of storing Di+1 values
of xi’s to compute si, each agent can just maintain a single memory register to store
a running sum from kκ to (k+ 1)κ, denoted by yi, and update it as a new estimate
156
xi(t) becomes available as follows. Agent i sets yi(t) = aˆ
(i)
0 xi(t) at time t = kκ
and then updates this variable as yi(kκ + τ + 1) = yi(kκ + τ) + aˆ
(i)
τ xi(kκ + τ) for
τ = 0, . . . Di, where aˆ
(i)
τ = a
(i)
τ /
∑Di
l=0 a
(i)
l . Then si((k + 1)κ) = yi(kκ + Di + 1). Of
course, each agent still needs to store Di + 1 normalized coefficients aˆ
(i)
τ .
Clearly, information exchanged among the agents at each time involves only
xi(t), but not si(t). Moreover, each agent i only needs to update si(t) once every
κ time steps. Note that whenever t = kκ, (5.13a) must be carried out prior to
(5.14a). The next result asserts that by utilizing minimal polynomials as done
in the main algorithm (5.13a)) above, we succeed in forcing the states si(t) to
be identical over the whole network after an initial time period of length κ, i.e.,
si(t) = sj(t),∀t ≥ κ,∀i, j ∈ V . In fact, it can be seen that the si(t)’s will be similar
to each other for all time t ≥ 0 if identically initialized.
Theorem 5.3.4. (Agreement of si, i ∈ V after κ steps) Consider (5.13)-(5.14) and
let Assumption 5.2.2 hold. If gi is bounded for any i ∈ V, then
si(t) = sj(t), ∀i, j ∈ V , ∀t ≥ κ. (5.15)
Proof. First, since gi is bounded ∀i ∈ V , we have that xi(t) in (5.14) is well-defined.
Also, (5.13) is well-defined as
∑Di
l=0 a
(i)
l 6= 0; cf. (5.10).
Next, for any k ≥ 0, by (5.14b) we have
x(t) = Wx(t− 1), ∀t = kκ+ 1, . . . , kκ+ κ− 1. (5.16)
Then at time t = (k + 1)κ, we have
si((k + 1)κ)
(5.13a)
=
∑Di
l=0 a
(i)
l xi(kκ+ l)∑Di
l=0 a
(i)
l
= eTi Φx(kκ), (5.17)
157
where the last equality follows from Theorem 5.2.3 and (5.12). Here Φ is the con-
sensus matrix defined in (5.7). Therefore,
si((k + 1)κ) = e
T
i 1pi
Tx(kκ) = piTx(kκ), (5.18)
which is independent of i. It remains to use (5.13b).
In the rest of this subsection, we establish the convergence of the algorithm
when problem (5.1) is unconstrained (i.e., X = R) and the weight matrix is doubly
stochastic. In the next subsection, we will show how to modify our algorithm to
allow relaxing of these assumptions (i.e., the problem may have constraints and W
may be only row stochastic.)
Our first convergence result deals with convex cost functions with bounded
subgradients.
Theorem 5.3.5. (Local cost functions with bounded subgradients) Consider prob-
lem (5.1) with X = R. Let Assumptions 5.2.1, 5.2.2 and 5.3.1 hold. Assume
further that W is doubly stochastic. Let all the agents perform (5.13)-(5.14). Let
s¯(t) := si(t),∀i ∈ V ,∀t ≥ κ and g(s¯(kκ)) :=
∑N
i=1 gi(s¯(kκ)) ∈ ∂F (s¯(kκ)). Then
s¯((k + 1)κ) = s¯(kκ)− γk
N
g(s¯(kκ)), (5.19)
Also, let
sˆk :=
∑k
τ=1 γτ s¯(τκ)∑k
τ=1 γτ
. (5.20)
We then have the following:
158
(i) If γk ≡ γ > 0, then for each i ∈ V
lim
k→∞
F (sˆk)− F ∗ ≤ γ
2N
L2F , LF :=
∑
j∈V
Lj (5.21)
(ii) For a given number of iterations T , let K = bT/κc. Let R be any number such
that R ≥ dist(s¯(κ), X∗). Then with constant step size γk ≡ NRLF√K , we have
F (sˆK)− F ∗ ≤ RLF√
K
. (5.22)
(iii) If γk > 0, limk→∞ γk = 0, and
∑∞
k=1 γk =∞, then
lim
k→∞
F (sˆk) = F
∗, ∀i ∈ V . (5.23)
In fact, if γk =
1√
k
, the convergence rate is O( ln k√
k
).
Proof. See Appendix A.3.1.
Remark 5.3.6. (On convergence and limit points) The auxiliary sequence {si(t)}∞t=0
is generated to yield convergence of F (sˆk) to F
∗ (case (iii)) or an optimal value
neighborhood (cases (i) and (ii)), rather than xi(t) directly. In general, xi(t) does
not converge but rather reaches a limit cycle of period κ whenever si(t) converges
(possibly also to a limit cycle). Note that by definition (5.20) and the fact that
s¯(t) = si(t), ∀i ∈ V ,∀t ≥ κ, sˆk is a global variable that is available to all the
agents. Moreover, each agent i can compute sˆk using its local variable si(t) and an
augmented running sum Γk ∈ R in a recursive manner as follows:
Γk+1 = Γk + γk (5.24)
sˆk+1 =
(
Γksˆk + γk+1si((k + 1)κ)
)
/Γk+1, (5.25)
159
where Γ0 = 0, sˆ0 = 0 (here the subscript k denotes the iteration index in the slow
time scale k = bt/κc). When a constant step size γ is used, the following simplified
update is sufficient: sˆk+1 =
(
ksˆk + si((k + 1)κ)
)
/(k + 1).
Remark 5.3.7. (Stopping criteria) By Theorem 5.3.4, all the agents are able to stop
at the same time with the same estimate si = s¯ (or sˆk) of the optimal solution if they
use a common stopping criterion, e.g., running the algorithm for a predetermined
number of iterations T as in Theorem 5.3.5(ii), or until one of the following holds
(see Appendix A.3.9 for other criteria):
|si((k + 1)κ)− si(kκ)| ≤  (5.26)
|F (sˆk+1)− F (sˆk)| ≤ . (5.27)
This is not the case for many other distributed algorithms, where any consensus
can only be achieved in the sense of an asymptotic limit. Note that F (sˆk) involves
evaluation of the global cost function F at sˆk and algorithm (5.13)-(5.14) does not
provide it to each agent. However, we show in Appendix A.3.9 that it is possible
for each agent, by using augmented iterations, to locally compute F (sˆk) at time
t = (k + 1)κ for any k ≥ 1.
Remark 5.3.8. (Connection with convergence of centralized method) In light of (5.15)
and (5.19), our algorithm performs analogously to the centralized subgradient method
(5.3) except on a slower time scale, where F is convex with subgradient g bounded
in magnitude by LF =
∑N
j=1 Lj. As a result, it adopts performance guarantee of the
centralized subgradient method, as shown in (5.21)-(5.23). For detailed analysis of
this centralized method, see, e.g., [117, Chap. 3] and [138]. It should be noted that
160
the convergence our algorithm depends on not only that of the virtual centralized
subgradient iteration (5.19) but also the process of information diffusing over the
network carried out by (5.14b) (hence the importance of strong connectivity).
Remark 5.3.9. (Step size design and objective bound in case (ii)) For a given T ,
the constant step size γ = NR
LF
√
K
depends on the constant R and the ratio LF/N .
First, it is clear that the smallest value of R, which is dist(s¯(κ), X∗), minimizes the
error bound in (5.22) but is rather of theoretical interest only since it requires the
knowledge of the solution set X∗ (note that s¯(κ) = si(κ). However, in practice,
an upper (possibly loose) bound R may be inferred, especially when there is some
restriction on the range of global variable in (5.1); see also Subsection 5.3.2 and
examples in Section 5.6. Second, the ratio LF/N is indeed the average of Lipschitz
constants of the agents’ local cost functions and thus can be computed locally in
finite time. For example, using the same algorithm above with xi(0) = Li, then
by (5.18) and double stochasticity of W , we have si(κ) = pi
Tx(0) =
∑
i∈V Li/N .
Alternatively, the agents can choose γ = R
Lmax
√
K
instead, where Lmax := maxi∈V Li
which can be computed in a distributed manner and in finite time using a max-
consensus protocol [139]. In this case, the corresponding bound in (5.22) becomes
F (sˆK)− F ∗
(A.22)
≤ R√
K
(NLmax
2
+
L2F
2NLmax
) ≤ RNLmax√
K
,
where the last inequality follows from LF ≤ NLmax.
Remark 5.3.10. (On best convergence rate and scalability) It should be pointed out
that the result in case (ii) of Theorem 5.3.5 demonstrates an improvement on best
analyses of distributed subgradient algorithms so far. To wit, recall that for any
161
given number of iterations T , we have Kκ ≤ T < (K + 1)κ and thus the following:
1
N
(F (sˆK)− F ∗)
(5.22)
≤ RLF
N
√
K
≤ RLmax
√
K + 1
K
√
κ
T
. (5.28)
That is, the convergence rate of the (averaged) objective error is of orderO(RLmax
√
κ
T
);
this also holds for γ = R
Lmax
√
K
(as shown earlier). In other words, the time it takes
for the bound in (5.28) to drop below  > 0 is O(κ/2). As N increases, this turns
out to be the fastest rate achieved to-date among distributed subgradient-based
methods for convex cost functions with bounded subgradients. In this setting, the
best known rate so far was demonstrated in [134], where the authors considered the
problem of minimizing 1
N
∑N
i=1 fi(x) (which is why we need a factor of
1
N
on the
left side of (5.28) for comparison). In this reference, the authors proposed a linear
time average consensus protocol and used it to design a new algorithm for solving
problem (5.1). The consensus protocol there is a combination of a weighted aver-
aging scheme based on the Metropolis rule [90] and an extrapolation step (which is
similar to the idea of adding momentum to speed up iterative methods). Under a
fixed step size β = 1
Lmax
√
NT
, [134] showed that the aggregated objective error decays
as O(R2Lmax
√
N
T
), i.e., it takes O(N/2) time steps to reach an  error. Thus this
algorithm scales linearly in the network size; hence the title “linear time.” To im-
plement this algorithm (specifically the Metropolis rule), however, the graph needs
to be undirected.
Our algorithm, on the other hand, applies to directed graphs and requires
O(κ/2) steps to reach an  error. In general, without knowledge of the network
topology, we can just take κ to be N or a known upper bound on N , then our
162
algorithm still possesses the fastest rate (which is similar to the algorithm in [134],
where a common upper bound on N is also required). However, for certain graphs
such as distance regular graphs, κ can be as small as the graph diameter, which can
be small compared to the network size (see Section 5.5.1 for further discussion on
this). In this connection, our algorithm convergence time scales at most linearly in
the network size.
Note also that the centralized subgradient method converges to an  error of
the optimal value in O(1/2) time steps. Hence, our algorithm performance lies
between that of the centralized method and the best known rate for distributed
ones. This intuitively makes sense since our algorithm is not only distributed but
also behaving like the centralized one except the time-scale is slowed down by a
factor of κ.
Remark 5.3.11. (On number of subgradient evaluations) Another important aspect
that should be noted is the number of subgradient evaluations since the computation
of subgradients usually dominates the time it takes to perform the optimization step.
Within T time steps, our algorithm requires each agent to evaluate its subgradient
T/N times, whereas most, if not all, other distributed subgradient algorithms require
T evaluations.
Of course, the advantages of our algorithm are based on the assumption that
all the agents are equipped with their own minimal polynomial. Although this seems
restrictive at first, we remark that the agents’ minimal polynomials can be computed
(prior to the main algorithm’s implementation) in a centralized or decentralized
163
manner, or could be done on-the-fly and in finite time as well; see [71, 72, 140] for
such algorithms and also Section 5.6 for numerical examples.
Remark 5.3.8 also suggests that known results on the centralized gradient
descent method can be used in a straightforward manner to show the convergence
of the algorithm when assuming smoothness condition on F (rather than on every
fi). In particular, when F is differentiable, convergence to the optimal value can be
ensured with sufficiently small constant step size as shown next.
Theorem 5.3.12. (Convex global cost with Lipschitz gradient) Consider problem
(5.1) with X = R. Let Assumptions 5.2.1, 5.2.2 and 5.3.2 hold. Assume further
W is doubly stochastic. Let all the agents synchronously perform (5.13)-(5.14) with
γk ≡ γ ∈ (0, 2NL∇F ). Then for each i ∈ V
F (si(t))− F ∗ = O(κ/t), as t→∞. (5.29)
Proof. See Appendix A.3.2.
Remark 5.3.13. (Convergence comparison) Similar to the centralized Gradient De-
scent method, when F is convex and ∇F Lipschitz continuous, the proposed algo-
rithm converges to the optimal value without the need for a diminishing step size.
This is a key difference between our algorithm and the distributed gradient descent
method (5.4) and many others, which do not converge to the optimal value under
the constant step size rule. Moreover, a running bound on the objective error is
available in (A.27), bearing a resemblance to that of the centralized method except
for being scaled by a factor of κ.
164
Faster convergence rates can be obtained when we assume further strong con-
vexity of the global cost function F (but not every individual cost fi). In particular,
the algorithm achieves linear convergence rates to both the optimal value and the
optimal solution. (Note that under Assumption 5.2.1 and strong convexity of F ,
there exists a unique x∗ ∈ X∗).
Theorem 5.3.14. (Strongly convex global cost with Lipschitz continuous gradient)
Consider problem (5.1) with X = R. Let Assumptions 5.2.1, 5.2.2 and 5.3.2 hold.
Assume further W is doubly stochastic and F is strongly convex with parameter
µ > 0. If all the agents perform (5.13)-(5.14) with γk ≡ γ ∈ (0, 2Nµ+L∇F ], then
|si(kκ)− x∗|2 ≤ βk−1|si(κ)− x∗|2, (5.30)
F (si(kκ))− F ∗ ≤ 1
2
L∇Fβk−1|si(κ)− x∗|2 (5.31)
where β = 1 − 2γµL
N(µ+L∇F )
∈ (0, 1). Thus, F (si(t)) → F ∗ and si(t) → x∗ linearly at
rates β
1
κ and β
1
2κ , respectively.
Proof. See Appendix A.3.3.
5.3.2 Extensions of the Algorithm 5.1
We now consider possible extensions of the algorithm to deal with two cases: (i)
problem (5.1) is subject to a constraint x ∈ X for some convex set X, and (ii) the
weight matrix W is only row stochastic.
Case (i): We assume X is closed and convex. To satisfy this constraint, we
resort to a projection operator onto X, denoted PX . In the special case X ⊂ R, X
is just an interval, thus the projection is simply a cut-off function.
165
Assuming that all the agents know the set X, we modify our algorithm de-
scribed above as follows. For any i ∈ V , initialize si(0) = xi(0) ∈ X, and update for
any t ≥ 1
si(t) =

PX
(∑Di
l=0 a
(i)
l xi(t−κ+l)∑Di
l=0 a
(i)
l
)
if t = kκ
si(t− 1) else
(5.32)
xi(t) =

si(t)− γkgi(si(t)) if t = kκ
∑
j∈Ni wijxj(t− 1) else
(5.33)
The key idea of this extension is that the same PX is used by all the agents, forcing
the modified algorithm to work in the same manner as the original one, as shown
next.
Define gs(t) := [g1(s1(t)), . . . , gN(sN(t))]
T and s¯(t) := 1T s(t)/N , and recall
that Φ = 11T/N . We have
si(kκ+ κ)
(5.32)
= PX
(∑Di
l=0 a
(i)
l xi(kκ+ l)∑Di
l=0 a
(i)
l
)
(Thm.5.2.3)
= PX
(
eTi Φx(kκ)
)
(5.33)
= PX
(1T
N
(
s(kκ)− γkgs(kκ)
))
= PX
(
s¯(kκ)− γk
N
N∑
j=1
gj(sj(kκ))
)
. (5.34)
Thus, si(kκ) = s¯(kκ),∀i ∈ V , k ≥ 1. Hence, (5.15) holds and
s¯((k + 1)κ) = PX
(
s¯(kκ)− γk
N
g(s¯(kκ))
)
, (5.35)
where g(s¯(kκ)) =
∑
gi(s¯(kκ)) ∈ ∂F (s¯(kκ)). Now (5.35) is the usual (centralized)
projected (sub)gradient method, whose convergence results are not much different
from those of the (sub)gradient method (see, e.g., [117,138], [115, Chap. 2]). There-
fore, the conclusions of Theorems 5.3.5, 5.3.12 and 5.3.14 still hold.
166
Case (ii): It should be noted that in distributed settings, a row stochastic
weight matrix is much easier to implement than a column (or doubly) stochastic one
as each agent can individually decide the weight on the information received from
its neighbors. When this is the case, most (if not all) subgradient-based methods do
not converge to the optimal value; our proposed algorithm above is no exception.
However, it can be modified to overcome this by using the re-weighting technique
as in [93, 100] which requires that the value pii is available to agent i for all i ∈ V .
The modified algorithm is as follows:
si(t) =

∑Di
l=0 a
(i)
l xi(t− κ+ l)∑Di
l=0 a
(i)
l
if t = kκ
si(t− 1) else
(5.36)
xi(t) =

si(t)− γk gi(si(t))
Npii
if t = kκ
∑
j∈Ni wijxj(t− 1) else
(5.37)
The only difference between this extension and the original algorithm (5.13)-(5.14)
is the scaling factor (Npii)
−1 of the subgradient in (5.37) where pii > 0,∀i ∈ V (see
[95, Thm. 8.4.4]). Note that the factor N−1 is not crucial to the convergence of the
algorithm as its appearance is merely to retain the conclusions in Theorems 5.3.5,
5.3.12 and 5.3.14; see Appendix A.3.4 for a detailed proof.
Here we assume that the value pii is available to agent i for all i ∈ V . In fact,
each agent can compute its corresponding entry in the vector pi in finite time (pos-
sibly during the process of determining its own minimal polynomial). In particular,
consider iteration (5.6) with initial condition x(0) = ei for some i ∈ V and suppose
167
that Assumption 5.2.2 holds. Then,
lim
t→∞
x(t) = lim
t→∞
W tx(0) = 1piTei = pii1. (5.38)
That is, the network of agents running (5.6) achieves consensus x¯ = pii. Therefore, by
applying one of finite-time consensus algorithms in, e.g., [71,72] (see also Subsection
5.2.3), agent i can compute pii in finite time. This is also the main idea employed
in [141] for each agent to compute pi.
Remark 5.3.15. When problem (5.1) is subject to both a global constraint x ∈ X
and a row stochastic weight matrix W , we can combine (5.32) and (5.37) together
to form a new algorithm. The proof for convergence of the so-obtained algorithm is
merely based on the proofs of both cases above, and thus is skipped for brevity.
5.4 Finite-Time Optimization for Quadratic Cost Functions
Now we consider problem (5.1) without any constraint, i.e., X = R, and with
quadratic cost functions
fi(x) = bi(x− ci)2, ∀i ∈ V (5.39)
for some bi, ci ∈ R with bi > 0. Clearly, the optimal solution is given by x∗ =
(
∑N
i=1 bici)
/
(
∑N
i=1 bi). Of course, our algorithm in the previous section is still ap-
plicable. Here we aim to achieve (near) finite-time convergence algorithms by capi-
talizing on the special form of the cost functions.
168
5.4.1 Ratio-Consensus based Algorithm
Our first algorithm is based on the observation that x∗ can be expressed as the ratio
of two average quantities, namely, x∗ = ( 1
N
∑N
i=1 bici)
/
( 1
N
∑N
i=1 bi). Thus, inspired
by the idea of the ratio-consensus algorithm (see, e.g., [34, 90]), we construct the
following finite time algorithm:
Algorithm 5.2. (Finite-time Ratio-consensus) Let κ satisfy (5.12). Each agent
i ∈ V initializes a pair of local variables (yi, zi) at time t = 0 as yi(0) = bici, zi(0) = bi
and updates them according to
yi(t) =
∑
j∈Ni
wijyj(t− 1), t = 1, . . . , N (5.40)
zi(t) =
∑
j∈Ni
wijzj(t− 1), t = 1, . . . , N (5.41)
x∗i =
∑Di
l=0 a
(i)
l yi(l)∑Di
l=0 a
(i)
l zi(l)
, ∀i ∈ V , (5.42)
Note that (5.42) is evaluated only once at final time t = κ+1. The next result
is immediate.
Theorem 5.4.1. (Finite-time optimization for quadratic costs) Consider problem
(5.1) with X = R and fi given as in (5.39). Let Assumption 5.2.2 hold, and further
let W be doubly stochastic. If the agents perform (5.40)-(5.42), then
x∗i = x
∗, ∀i ∈ V . (5.43)
Proof. We have Φ = 11T/N by (5.7) and double stochasticity of W . Application
of Theorem 5.2.3 yields
∑Di
l=0 a
(i)
l yi(l) =
1
N
1Ty(0)
∑Di
l=0 a
(i)
l and
∑Di
l=0 a
(i)
l zi(l) =
1
N
1Tz(0)
∑Di
l=0 a
(i)
l . The theorem then follows.
169
Although the idea of this algorithm is simple, it, to our knowledge, has not been
presented elsewhere in the literature. It also shows an interesting connection with
the finite-time behavior of the centralized gradient descent method. In particular,
for a quadratic cost function, the centralized method converges in just one iteration
by using the Newton step. In a distributed setting, we observe the same finite time
converging behavior of our algorithm above, except in κ steps.
Here, at each time t = 1, ..., N , each agent exchanges its pair of variables
(yi, zi) with its neighbors. Consequently, all the agents reach the optimal solution
indirectly through diffusing the coefficients of their quadratic cost functions. In
the case that the agents do not want to reveal information about their private
cost functions, it is still possible for each agent to use its minimal polynomial in
connection with exchanging estimates of the solution with its neighbors to achieve
very fast convergence. In the following, we derive such an algorithm.
5.4.2 Gradient-based Algorithm
Our idea is as follows. Consider distributed gradient method (5.4) applied to the
problem with the quadratic costs (5.39):
xi(t+ 1) =
∑
j∈Ni
wijxj(t)− γbi(xi(t)− ci), (5.44)
where γ is a constant and xi(0) can be chosen arbitrarily. This iteration does not
converge to the optimal solution 1 but rather an O(γ)-neighborhood of x∗, even
1which can be easily verified by contradiction, i.e., if it does x∗ =
∑
j∈Ni wijx
∗ − γbi(x∗ − ci),
then x∗ = ci,∀i ∈ V.
170
when W is assumed to be doubly stochastic (see, e.g., [78, 86]):
lim
t→∞
x(t) = O(γ)+ x∗1. (5.45)
for some  ∈ RN . Therefore, if each agent can predict its final value in finite time
(possibly in the same manner as above), then by using a sufficiently small γ all the
agents may employ a very few prediction steps in conjunction with a finite number
of iterations (5.44) in order to obtain a close estimate to the optimal solution. This
idea will be pursued in the following.
To this end, we first show how each agent can compute its final value in (5.44)
in finite time and in a distributed manner. This is different from what is reported
in subsection 5.2.3 since (5.44) is not a consensus iteration. Eq. (5.44) represents
a linear time-invariant system with constant inputs, which can also be expressed in
vector form as
x(t+ 1) = (W − γB)x(t) + γBc, (5.46)
where B := diag([b1, . . . , bN ]) and c := [c1, . . . , cN ]
T . In this connection, it is clear
that the convergence of iteration (5.44) depends on the system matrix (W − γB).
We then have the following:
Theorem 5.4.2. (Stability condition) Let Assumption 5.2.2 hold. Suppose that W
has positive diagonal elements. There exists γ0 > 0 such that
ρ(W − γB) < 1, (5.47)
for any γ ∈ (0, γ0). Moreover, if , we can take γ0 = 2 mini∈V wiimaxi∈V bi .
171
Proof. Note that 2 mini∈V wii
maxi∈V bi
> 0. By the Gershgorin circle theorem (see, e.g., [95,
p. 344]), all the eigenvalues of (W − γB) are located in the union of N discs
N⋃
i=1
{z ∈ C : |z − wii + γbi| ≤ 1− wii}. (5.48)
As a result, ρ(W − γB) < 1 for any γ ∈ (0, 2 mini∈V wii
maxi∈V bi
). The existence of γ0 then
follows.
Condition (5.47) guarantees that the system (5.46) is BIBO stable, thus the
states converge to some fixed values, which are not necessarily equal. In fact, it
follows from (5.46) that
lim
t→∞
x(t) = Φγc, Φγ :=
(
I − (W − γB))−1γB, (5.49)
where I − (W − γB) is invertible because of (5.47).
Let z(t)T , [x(t)T , cT ]. The system (5.46) can be described equivalently as
z(t+ 1) = W˜z(t), W˜ ,
W − γB −γB
0N×N I
 . (5.50)
For any i = 1, . . . , N , define q˜i to be the monic polynomial of minimum degree
such that eTi q˜i(W˜ ) = 0
T , where ei ∈ R2N is the i-th standard unit vector. We will
call q˜i the minimal polynomial of node i in system (5.50).
Lemma 5.4.3. (Minimal polynomials in system (5.50)) Let γ satisfy (5.47). For
each i ∈ V, there exists a˜(i) ∈ RD˜i+1 such that
q˜i(ξ) = (ξ − 1)
D˜i∑
j=0
a˜
(i)
j ξ
j, a˜
(i)
D˜i
= 1, (5.51)
where deg(q˜i) = D˜i + 1 ≤ 1 + n. Moreover, all the zeros of q˜i are strictly inside the
unit circle except for one at 1.
172
Proof. See Appendix A.3.6.
Clearly, q˜i is of the same form as qi in (5.10) and also has 1 as the only zero of
maximum modulus. Thus, q˜i can also be computed locally and in finite time using
the schemes presented in Section 4.3.2.
Now we assume that all the agents know a common upper bound κ on deg(q˜i).
Note that deg(q˜i) ≤ N + 1,∀i ∈ V . Thus κ can be chosen to be N + 1 or an upper
bound on N + 1. After κ consecutive iterations of (5.44), each agent is able to
determine its own final value to which it will converge if all the agents follow (5.44)
forever. Using the same arguments as (5.10)-(5.9) in Subsection 5.2.3 we have
lim
t→∞
xi(t) =
( D˜i∑
k=0
a˜
(i)
k xi(k)
)
/(
D˜i∑
k=0
a˜
(i)
k ), ∀i ∈ V . (5.52)
Therefore, in the same spirit of Theorem 5.2.3, we can view the right side of (5.52)
and κ iterations of (5.44) as a realization of the operator Φγ given in (5.49). This
realization is carried out in finite time, enabling us to construct the following algo-
rithm:
Algorithm 5.3. (Near Finite-time Gradient-based Optimization) Each agent i ∈ V
initializes a pair of local variables (si, xi) at time t = 0 as si(0) = xi(0) = ci and
173
updates them for t ≥ 1 according to
si(t) =

∑D˜i
k=0 a˜
(i)
k xi(t− κ+ k)∑D˜i
k=0 a˜
(i)
k
, if t = kκ
si(t− 1), else
(5.53)
xi(t) =

si(t), if t = kκ
∑
j∈Ni
wijxj(t− 1)− γbi
(
xi(t− 1) + si(t− 1)
)
, else
(5.54)
where γ satisfies (5.47).
In the following, we show that by choosing γ appropriately, this algorithm
achieves exponential convergence to the optimal solution. More importantly, the
convergence rate is adjustable using γ.
Theorem 5.4.4. (Convergence of consensus matrix Φγ) Let Assumption 5.2.2 hold
and γ satisfy (5.47). Then Φγ given by (5.49) is a row stochastic and irreducible
matrix, and has Bpi as a left Perron eigenvector. Moreover,
lim
k→∞
Φkγ = 1pi
TB/(piTB1), (5.55)
and the convergence is exponential with rate determined by the second largest eigen-
value λ2(Φγ).
Proof. See Appendix A.3.7.
Theorem 5.4.4 allows us to prove the convergence of (5.53)-(5.54).
Theorem 5.4.5. (Convergence for quadratic costs) Consider Algorithm (5.53)-
(5.54) and let Assumption 5.2.2 hold. Assume further that W is doubly stochastic
174
and has positive diagonal elements. Let γ satisfy (5.47). Then
lim
t→∞
x(t) = lim
t→∞
s(t) = x∗1. (5.56)
Moreover, the convergence is linear with rate |λ2(Φγ)| 1κ .
Proof. See Appendix A.3.8.
Although in this theorem the weight matrix W is assumed to be doubly
stochastic, we note that the algorithm can be modified using the re-weighting trick
so that W can be taken to be only row stochastic (see Remark 5.4.8 for details).
Theorem 5.4.5 shows the effect of γ on the convergence of Algorithm (5.53)-
(5.54) and the rate at which the agents’ estimates converge linearly to the optimal
solution. Specifically, γ needs to be chosen so as to first ensure the stability of
the algorithm (namely, condition (5.47) in Theorem 5.4.2), and then accelerate the
convergence speed by reducing the rate |λ2(Φγ)| 1κ . The next remarks successively
address these issues.
Remark 5.4.6. (Distributed agreement on step size to ensure stability condition)
In order to guarantee the convergence of the algorithm, it is necessary that all
the agents select the same value of step size γ that satisfies (5.47). Such a value
can be determined by all the agents in finite time and in a distributed manner as
follows. Prior to running Algorithm (5.53)-(5.54), all the agents can implement a
max-consensus algorithm (see, e.g., [139]) in order to compute both bM := maxi∈V bi
and aM := −maxi∈V(−wii). Then set γ = ε2aM/bM , where ε ∈ (0, 1) is a constant
known to all the agents. Clearly, 0 < γ < 2aM
bM
= 2 mini∈V wii
maxi∈V bi
≤ γ0, by Theorem
175
5.4.2. Moreover, the max-consensus algorithm converges after a finite number of
iterations equal to the diameter of the graph. In case this number is unknown to
all the agents, any upper bound (e.g., N , the network size) can also be used to
terminate this algorithm.
Remark 5.4.7. (Fast convergence by choosing small step size γ) Recall from Theorem
5.4.5 that x(t) converges linearly to x∗1 with rate |λ2(Φγ)| 1κ , where it can be seen
from (5.55) that
|λ2(Φγ)| = ρ
(
Φγ − 1pi
TB
piTB1
)
= ρ
(
Φγ − 1b
T
bT1
)
. (5.57)
Here, pi = 1/N since W is doubly stochastic. Moreover,
Φγc
(5.45)&(5.49)
= O(γ)+ x∗1 = O(γ)+ 1bTc/(bT1),
where we have used the fact that x∗ = bTc/bT1. Rearranging terms yields
(
Φγ −
1bT
bT 1
)
c = O(γ), which, in view of (5.57), then implies that
lim
γ→0+
|λ2(Φγ)| = 0. (5.58)
Although Φγ is not defined at γ = 0, choosing γ small brings about a fast convergence
rate. Moreover, when γ is sufficiently small so that |λ2(Φγ)| 1κ is close to 0, system
(5.53)-(5.54) exhibits a near dead-beat response. When this is the case, all the
agents may agree to perform the algorithm for kκ steps with a small integer k (e.g.,
1, 2, 3) so that each agent obtains a close approximation to the optimal solution. In
conjunction with Remark 5.4.6, all the agents may choose γ = min(θ, ε2aM/bM),
where 0 < θ, ε 1 are supposedly known to every agent. A word of warning, how-
ever, is that too small a value of γ (say 10−16) could possibly affect the convergence
176
of the algorithm due to computational round-off errors. This deserves more analysis
in future work.
Remark 5.4.8. (Extension of Algorithm (5.53)-(5.54) for row stochastic weight ma-
trix) Assume that pii is available to agent i. (Indeed, all the agents can cooperate to
compute their corresponding pii in finite time [141].) We modify (5.54) as follows:
xi(t) =

si(t), if t = kκ
∑
j∈Ni
wijxj(t− 1)−
γbi
(
xi(t− 1) + si(t− 1)
)
Npii
, else
and redefine Φγ =
(
γ−1B−1S(I −W ) + I
)−1
, where S := diag(Npi) and γ is such
that ρ(W − γS−1B) < 1 in place of stability condition in (5.47). Note also that γ0
in Theorem 5.4.2 can be chosen to be γ0 =
2 mini npiiwii
maxi bii
. Under this condition, it can
be verified that Φγ is still a valid consensus matrix with b = B1 being a left Perron
eigenvector. Therefore,
lim
k→∞
s(kκ) = lim
k→∞
Φkγc = 1b
Tc/(bT1) = 1x∗,
i.e., the optimal solution is achieved by every agent.
5.5 On Minimal Value of κ and Performance Limits of Distributed
Subgradient Methods
5.5.1 Minimal Value of κ
It is evident that the convergence speed of each algorithm presented in Sections
5.3 and 5.4 depends on the value κ, which is an upper bound on the degrees of
177
all the minimal polynomials (qi, i ∈ V). Indeed, a smaller κ corresponds to faster
convergence. Thus, in the best scenario, κ = κmin := maxi∈V deg(qi). We note the
following regarding this value.
First, for general directed networks, κmin satisfies
diam(G) + 1 ≤ κmin ≤ deg(qW ) ≤ N, (5.59)
where diam(G) denotes the graph diameter. The lower bound can be shown, e.g., by
application of [72, Thm. 3]; an alternative argument is given in the next subsection.
The upper bounds follow from definitions of minimal polynomials and the Cayley-
Hamilton theorem.
It is interesting to seek classes of graphs for which the lower bound is achieved.
Clearly, this is the case for line graphs since then diam(G) + 1 = N . Next, we show
that the lower bound can be achieved with another class of graphs, namely distance
regular graphs (see Definition 4.6.5), of which examples include cycles, hypercubes
and complete graphs. References [129,142] and a recent survey [143] provide further
information on distance regular graphs. The next result asserts that in the setting
of distance regular graphs, we can ensure that κmin = diam(G) + 1. See Appendix
A.3.5 for a proof.
Theorem 5.5.1. If G is distance regular and W = I − L(G), where L(G) is
the Laplacian matrix and  > 0 satisfying (|Ni| − 1) ≤ 1, then diam(G) + 1 =
deg(qi),∀i ∈ V.
Thus, for distance regular graphs (with W = I−L(G)), the convergence times
of our algorithms are linear in the network diameter rather than the network size.
178
Next, we discuss the tightness of the upper bounds in (5.59). Note that min-
imal polynomials qi, thus κmin, depend explicitly on the weight matrix W . In fact,
zeros of qi are the eigenvalues of W corresponding to the modes of system (5.6)
that are observable from the output xi (see [141, Sec. V]). This has two direct im-
plications as follows: (i) if the network (5.6) is observable from at least one node,
then κmin = N (a line graph falls into this case), (ii) an algorithm for computing qi
locally can be used as a means of verifying system observability or computing graph
spectrum in a distributed manner. This also suggests that the observability theory
might be useful in the problem of weight design so as to minimize κmin.
Finally, when all agents know the degree of their own minimal polynomial,
they can compute κmin in a distributed fashion by using a max-consensus algorithm
[139] for a finite number of time steps that is upper bounded by diam(G).
5.5.2 Performance Limit of Distributed Subgradient Methods
We now discuss how fast the convergence of a distributed subgradient method could
be in comparison with the corresponding centralized counterpart and with our al-
gorithm and in connection with the network topology.
First, it is clear that for a general problem in the form of (5.1) and a given
network topology, diam(G) is equal to the smallest running time among all possible
distributed algorithms since this is the minimum time for information to travel from
any node to all others in the network. (This also explains the lower bound in (5.59).)
An intuitively simple algorithm that theoretically achieves this fastest convergence
179
would be based on communication flooding ; in particular, assuming sufficient com-
munication power and memory capacity of each agent as well as availability of closed
form characterization of each local cost function fi and closed form solution to the
global optimization problem, at each time step, every agent broadcasts its function
fi (assumed to have a unique identifier) and all those data received from neighbors
at previous time step. As a result, at time diam(G) + 1, all the agents are able
to determine the global cost function F and, hence, can determine the optimum
independently. Of course, even when assuming uniqueness of the optimal solution
(so as for all the agents also reach consensus), this algorithm is far from being of
any practical use. However, a similar behavior exhibits when applying our algo-
rithms (5.40)-(5.42) and (5.53)-(5.54) for quadratic objective functions, where the
closed form solution to the global problem is simple (and the centralized Newton
method solves it in one iteration). Specifically, (5.40)-(5.42) terminates in κ steps
with exact solution while (5.53)-(5.54) can do so with arbitrary small error by using
a sufficiently small γ. As noted earlier, κ = diam(G) + 1 in certain graphs.
Second, suppose that the global objective function can be optimized by some
centralized subgradient-based method with some convergence time. In the dis-
tributed setting, one should expect that the corresponding distributed algorithm,
when converging, is slowed down by at least a factor of diam(G), again, due to
the limit of information travel in discrete-time. As noted earlier in Remark 5.3.10,
the best analysis on distributed subgradient methods for convex cost functions with
bounded subgradients demonstrates O(N/2) convergence time, which is linear in
the network size (see [134]), while that of the centralized counterpart is O(1/2).
180
Our result of O(κmin/
2) convergence time is the first to bridge the gap between
O(N/2) and O(diam(G)/2), which we reckon it to be the limit of distributed sub-
gradient methods. Of course, the tightness of these bounds depends on the network
topology. For example, in a line graph, they are equivalent. Complete graphs are
on the other extreme with κmin = diam(G)+1 = 2 for any network size. This makes
sense since agents in a complete network should be able to act unanimously.
5.6 Simulations
Next we give some simulation results to illustrate the algorithms proposed above. In
these examples, each agent does not know its minimal polynomial in advance, but
rather computes it using Algorithm 4 in Chapter 4 in connection with M consensus
iterations (5.6), provided that M is sufficiently large, e.g., M ≥ 2 deg(qi)+1,∀i ∈ V .
5.6.1 Example 1: Network of 5 agents with differentiable cost func-
tions having Lipschitz continuous gradient
Consider the network and associated weight matrix shown in Figure 5.1.
Figure 5.1: Network topology in example 1.
181
W =

.7 .3 0 0 0
.2 .6 0 0 .2
0 .3 .4 .3 0
0 0 .5 .5 0
0 0 .4 0 .6

Let X = [−2, 2] and local cost functions be
f1(x) = (x− 3)2 + 2x, f2(x) = (x4 − x3 + 2x2)/3,
f3(x) = (x− 0.1)4/6, f4(x) = ex,
f5(x) = −2x2 + 2x, (nonconvex).
Note that F (x) =
∑5
i=1 fi(x) is convex, but f5 is not. In fact, F is strongly convex
with Lipschitz continuous gradient. Here, x∗ = 0.7427 ∈ X and F ∗ = 9.4812.
Here, we combine both extensions of the main algorithm as proposed in Section
5.3.2 to deal with global constraint X and row stochasticity of the weight matrix; in
particular, (5.32) and (5.37) will be employed (see Remark 9). In order to apply these
iterations, the agents need to find their corresponding element pii in the normalized
left Perron eigenvector pi of the weight matrix. This can be done using the idea
discussed at the end of Section 5.3.2, which we describe next.
Prior to implementation of the optimization algorithm, let all the agents run
the following 2N + 1 iterations (we assume that the network size N is available to
the agents)
p(i)(t+ 1) =
∑
j∈Ni
wijp
(j)(t), ∀i ∈ V , t = 0, . . . , 2N, (5.60)
182
where p(i)(0) = ei ∈ RN (i.e., the i-th unit vector). At time t = 2N + 1, each
agent i has enough data, namely the sequence {[p(i)]i(t)}2Nt=0, to compute a (i) (or qi
equivalently) as well as pii as shown in Section 5.2.3. In particular,
pi = [0.20619, 0.30928, 0.20619, 0.12371, 0.15464]T
a (i) = [0.0192, −0.261, 1.1, −1.8, 1]T , ∀i ∈ V .
In this case, κmin = N . We will take κ = N + 1.
Next, to ensure convergence of the algorithm, the constant step size γ is chosen
to satisfy conditions of Theorem 5.3.12, which requires a Lipschitz constant L of
the global gradient ∇F . We now show that all the agents can determine such
a constant in finite time and in a distributed manner, and then locally select a
suitable step size. To this end, suppose that the agents know a Lipschitz constant
l¯i of their local gradient ∇fi. If fi is continuously twice differentiable, such an
l¯i := maxx∈X |∇2fi(x)| can be found easily especially when X is compact (which
holds in this example). Clearly, ∇F is also Lipschitz continuous on X with constants
L¯ and Lˆ given by
L¯ :=
∑
i∈V
l¯i ≤ N max
i∈V
l¯i =: Lˆ.
On the one hand, by using a max-consensus protocol [139], all the agents can de-
termine maxi∈V l¯i as well as Lˆ (since it is assumed that N is known to every agent).
However, Lˆ could be a very loose bound, leading to a very small step size, thus
reducing convergence speed. On the other hand, L¯ is usually a tighter bound, which
can also be computed locally by using minimal polynomials obtained earlier as fol-
lows. After (5.60), let all the agents also perform the following update (where t
183
denotes the iteration index, not physical time)
li(t+ 1) =
∑
j∈Ni
wijlj(t), ∀i ∈ V , t = 0, . . . , N,
where li(0) = pi
−1
i l¯i. Upon termination, each agent i can find
( Di∑
τ=0
a(i)τ li(τ)
)
/(
Di∑
τ=0
a(i)τ )
(Thm.5.2.3)
=
∑
i∈V
piili(0) =
∑
i∈V
l¯i = L¯.
Now the agents can locally compute the step size γ = ε2N/L¯, where a common
ε ∈ (0, 1) is known to the agents beforehand. Since L¯ is usually not the least
Lipschitz constant, ε can be set to 1. (A small ε leads to a small step size, which can
reduce the convergence speed.) In this example, {l¯i} = {2, 20, 8.9, 7.4, 4} and we
take γ = 2N/L¯ = 10
42.3
. Moreover, we let every agent i compute F
(i)
k =
1
N
F (si(kκ))
and use a relative tolerance  = 10−6 (see Appendix A.3.9 and criterion (A.42)) to
locally terminate the main algorithm.
The simulation results of the our algorithm are given in Fig. 5.2, with (ran-
domly generated) initial conditions:
x(0) = [0.6238, 1.4262, −0.9162, 1.5838, −1.1648]T .
As expected, si’s (depicted by solid lines) become identical after the first κ steps,
and then converge to the optimal solution x∗ (shown by dash-line) while xi’s reach
limit cycles of period κ. Every agent locally decides to stop at t = 186 (i.e., k = 31)
since each finds that |F (i)30 −F (i)29 | = 7.5253×10−7|F (i)30 | (noting that F (i)30 is computed
at time t = 31N). Upon termination, si(186) = 0.7402 and F (si(186)) = 9.4813.
We also carry out the centralized subgradient method (5.3) in the form of
(5.19) using the same step size γk = γ and with the starting point s¯(κ) = pi
Tx(0).
184
The simulation result of s¯(kκ) is marked with ◦ in the top-left sub-figure, which
agrees with si(kκ). We further compare the objective error of our algorithm with
that obtained from the DPS method (5.5) with diminishing and non-summable step
sizes of the form γ(t) = a
tb
where a = [0.01 : 0.05 : 0.5] and b = [0.5 : 0.1 : 1].
For this method, we denote s¯(t) = 1
N
∑
xi(t). Results for a few samples of (a, b)
are given in the right subplot of Fig. 5.2. Here we also scale gradients in DPS by
a factor of (Npii)
−1 just like the re-weighting technique in [93, 100] and also (5.37).
Note that generally the DPS method is not guaranteed to converge when some fi
are nonconvex. Moreover, the DSM (5.4) fails to converge if γ(t) is not selected
carefully, e.g., a > 0.5 and b = 0.5. Clearly, our algorithm outperforms DPS in
terms of the convergence rate and number of gradient evaluations.
With the same network and weight matrix as before, we now consider quadratic
cost functions fi(x) = bi(x−ci)2, where c = [0, 4, 3, 1, 1]T and b = [3, 1, 3, 3, 4]T .
Fig. 5.3 shows the performance of Algorithm (5.53)-(5.54) with different values of
step size γ. Clearly, sufficiently small γ’s yield near dead-beat responses.
5.6.2 Example 2: Network of 200 agents with `1 cost functions
Now we consider a set of N = 200 agents communicating over a ring graph, where
Ni = {i, i± 1, i± 10}, for ∀i ∈ V (if i+ 10 > 200, by i+ 10 we mean i+ 10− 200).
Here, the graph is distance regular with diameter diam(G) = 10, and each agent
has 5 neighbors (including itself). Assume that W = [wij] is such that wij =
1
|Ni|
if j ∈ Ni and wij = 0 otherwise. In this case, by Theorem 5.5.1 (with  = 0.2), we
185
0 50 100
−2
0
2
t (step)
xi(t)
0 500 1000
0
0.5
1
1.5
2
2.5
3
F(s¯(t))− F∗
t (step)
Alg. (5.32)&(5.37)
 
0 50 100
−2
0
2
si(t)
Figure 5.2: Network responses for example 1 with convex cost functions having Lipschitz
continuous gradient using Algorithm (5.32) and (5.37). Left: For any i ∈ V, si(t) (solid
lines) converges to optimal solution (dash-line) and xi(t) reaches a limit cycle of period
κ. In the top-left figure, ◦ represents s¯(kκ) of the centralized subgradient method imple-
mented as (5.19). Right: Objective error comparisons with DPS using step size γ(t) = a
tb
,
where (blue) solid lines correspond to a = 0.01, (green) dashed lines a = 0.05, (black)
dotted lines a = 0.1, and (cyan) dash-dotted ones a = 0.2. For each a, we plot the results
for b = 0.5 and 1. The results from our algorithm are shown in red circles ◦. The algo-
rithm terminates locally for all the agents at t = 186 with relative error of the global cost
function guaranteed to be less than  = 10−6.
have deg(qi) = diam(G) + 1 = 11,∀i ∈ V .
The local cost functions are fi(x) = |x− ci| with x ∈ X = [0, 100] and ci = i if
i ≤ 100 and ci = 0.2(i− 100) otherwise. So, the goal of all the agents is to find the
median value of {ci}N1 . Thus, in this case, x∗ = 16.9. The choice of ci is motivated
by the desire to have x∗ far from the mean µc of the elements in {ci} (which is
µc = 30.3 and can be found by any averaging protocol).
Here, each local cost function is convex but non-differentiable and has subgra-
dients bounded by L0 = 1. Therefore, we will use Algorithm (5.32)-(5.33), where
each agent initializes xi(0) = ci. We also want to compare the performance of our
186
0 10 20
0
2
4
γ = 10−1
x
i
(t
)
0 10 20
0
2
4
γ = 10−2
x
i
(t
)
0 10 20
0
2
4
γ = 10−3
x
i
(t
)
0 10 20
0
2
4
γ = 10−6
x
i
(t
)
Figure 5.3: Network responses for example 1 with quadratic cost functions when using
Algorithm 5.3 with κ = 7, x(0) = c, and with 4 values of γ.
algorithm with the one in [134]. Thus, we choose the step size by Theorem 5.3.5-
(ii), i.e., γ = R
L0
√
κ
T
. In this example, we take the number of iterations T = 4N (as
suggested by [134]), R = 100 (which is the size of X), 2 and Li = L0 = 1. Since
deg(qi) = 11,∀i ∈ V , κ needs to satisfy the condition κ ≥ 11. We suppose that each
agent sets κ = 50. Hence, γ = 25.
In this example, we let each agent i reevaluate a (i) (equivalently its minimal
polynomial) after every κ steps (note that κ > 2 deg(qi) in this case). (The reason is
that we observe from simulations that computation of minimal polynomials in large
graphs is prone to error.) The results are given in Fig. 5.4. Algorithm (5.32)-(5.33)
and the one in [134] do not converge asymptotically to the optimal solution but
2A better choice for R can be found as follows: at time κ, s¯(κ) = si(κ) =
∑N
i=1 xi(0)/N = µc.
Thus each agent can take R = maxu∈X |si(κ)− u| = 100− si(κ) = 69.7, which then improves the
objective error bound in (5.22) by 30.3%.
187
rather approach a solution neighborhood since both use constant step sizes. Ours
does so faster and the size of the neighborhood is much smaller (thanks to the rate
O(RLmax
√
κ
T
) as compared to O(R2Lmax
√
N
T
) in [134]; see Remark 5.3.10). There
are some small numerical errors in the simulation of s(t), but that does not cause
instability to our algorithm. The DSM (5.4) with γ(t) = 1√
t
admits asymptotic
convergence but with very slow rate. Furthermore, it is not trivial to choose a
“good” step size sequence or a stopping criterion with some performance guarantee
(such as consensus and objective error bound); it is especially difficult doing so in
a distributed fashion. In contrast, our algorithm allows efficient stopping criteria
such as (5.26)-(5.27) (and (A.42) in Appendix A.3.9), or with a predetermined num-
ber of iterations. Moreover, consensus (of the optimization variables si) is always
guaranteed upon termination.
5.7 Concluding Remarks
We have presented three fast algorithms for the distributed optimization problem
(5.1) on a fixed and directed graph with convergence time being linear in the maxi-
mum degree of the agents’ minimal polynomials rather than the network size. From
a broader view, our algorithms can be seen as a way of distributing the central-
ized subgradient method without sacrificing its convergence behavior, at the cost of
the algorithm being slowed down due to a larger time-scale needed for diffusion of
information through the network.
Among possible directions for future work, we mention the problem of design-
188
0 100 200 300 400 500 600 700 800
0
50
100
(a) si(t) in (5.32)
0 100 200 300 400 500 600 700 800
0
50
100
(c) yi(t) in Algorithm in [132]
0 100 200 300 400 500 600 700 800
0
50
100
t (step)
(d) xi(t) in Distributed Subgradient Method (5.4)
16
17
18
0 100 200 300 400 500 600 700 800
0
50
100
(b) xi(t) in (5.33)
Figure 5.4: Responses of the network in example 2. Dashed line: optimal solution. (a)-
(b): Algorithm (5.32)-(5.33), where sub-figure within (a) is a zoom-in of period [400, 800];
(c): Algorithm by Olshevsky (2016) with a constant step size β = 1
L0
√
NT
; and (d):
Distributed Subgradient Method (5.4) with γ(t) = 1√
t
.
ing the weight matrix W for a given network topology so as to achieve the smallest
possible κmin. For the case of large networks, κmin could be large, and hence de-
termining the exact minimal polynomial could be a challenging task for each agent
189
in terms of memory storage and computational capability. In this scenario, the use
of approximations of minimal polynomials or other finite-time consensus protocol
may be a problem deserving investigation. Another appealing direction is to adapt
the algorithms to time-varying networks, as well as networks influenced by noises
and/or delays.
190
Chapter 6: Distributed Optimization over Directed Graphs with Row
Stochasticity and Constraint Regularity
Abstract: This chapter deals with an optimization problem over a network of
agents, where the cost function is the sum of the individual (possibly nonsmooth)
objectives of the agents and the global constraint set is the intersection of local
constraints; this problem is more general than that in the previous chapter. The
main goals of this chapter are: (i) to remove the need for column stochasticity;
(ii) to relax the compactness assumption, and (iii) to provide a unified convergence
analysis. Specifically, assuming the communication graph to be fixed and directed
and the weight matrix to (only) be row stochastic, a distributed projected subgra-
dient algorithm and a variation of the algorithm are presented to solve the problem
for cost functions that are convex and Lipschitz continuous. The key component
of the algorithms is to adjust the subgradient of each agent by an estimate of its
corresponding entry of the normalized left Perron eigenvector of the weight matrix.
These estimates are obtained locally from an augmented consensus iteration using
the same row stochastic weight matrix and requiring very limited global informa-
tion about the network. Moreover, based on a regularity assumption on the local
constraint sets, a unified analysis is given that can be applied to both unconstrained
191
problems and constrained ones without assuming compactness of the constraint sets.
Finally, the convergence rate of the algorithms is studied in terms of the distance
from each agent’s available estimate to the global constraint set and an objective
error defined on this set.
6.1 Introduction
As in Chapter 5, we consider a network of agents without a central coordination
unit that is tasked with solving a global optimization problem in which the objective
function is the sum of local costs of the agents, that is, F (x) =
∑N
i=1 fi(x) where
fi : Rm → R represents the private objective of agent i and N is the number of agents
in the network. In addition, each agent may be associated with a private constraint
set. Many distributed optimization methods have been developed to address this
problem; see e.g., [32, 39,42,43,76–81,83,94,134,144,145] and references therein.
Although much research has been carried out on this problem area, most of
the existing literature invokes the assumption that communication among agents is
bidirectional. That is, for any pair of neighboring agents, each agent receives infor-
mation from the other. This assumption further allows many distributed algorithms
to employ doubly stochastic weight matrices, allowing straightforward mechanisms
for the agents to reach an optimal consensus. However, the double stochasticity
assumption is not always practical in real world applications. This is the case, for
example, where agents have different communication ranges due to environmental
effects or individual broadcast power limits.
192
In this work, we consider a more general case in which the communication
among agents is not necessarily bidirectional, and thus is naturally represented by
a directed graph. This scenario has recently been considered in [82, 93, 145]. A
common idea in these chapters is the combination of a (sub)gradient distributed
optimization algorithm and the Push-Sum protocol [146]. One essential requirement
of this protocol is that each agent knows its out-degree exactly and/or controls its
outgoing weights so that they sum up to one, leading to a column stochastic weight
matrix. The same requirement is also imposed in [147], where the authors develop a
distributed subgradient algorithm by employing a weight balancing technique. Such
requirement, however, could be impractical in many other situations, especially
when agents use a broadcast-based communication scheme and thus they neither
know their out-neighbors nor are able to adjust their outgoing weights (i.e., the
weights that others put on its information). In wireless sensor networks, for instance,
directed communications can arise as a consequence of geometric network layout
or nonuniform power limits, each node can only send information to those lying
within its coverage area without receiving acknowledgment signals from them. A
similar scenario may also happen during operation of a network initially designed
to implement a column stochastic weight matrix; for example, a node encounters
an unreliable or broken in-coming channel and has no “cheap” and local means
to inform the sender of it not receiving packages (i.e., logically not being an out-
neighbor of the sender). As a result, the performance of the network as a whole is
not guaranteed due to this malfunctioning communication link, even if the network
is still strongly connected. Thus, networks relying on a column stochastic weight
193
matrix may not be robust to link failure.
In comparison with a column stochastic weight matrix, one that is row stochas-
tic is much easier to achieve in a distributed setting. Here, each agent can individ-
ually (and to some extent arbitrarily) decide the weights placed on information it
receives from its neighbors. Thus, if the weight matrix is required to be only for a
row stochastic, there is no need for nodes to send acknowledgment signals. As an
immediate but important consequence, a network requiring only a row stochastic
weight matrix is more robust to link losses/jamming, and even changes in the net-
work structure. This makes row stochastic matrices suitable for reaching consensus
in broadcast-based communication environments, for example ad hoc wireless net-
works. However, when a row stochastic matrix is used for distributed optimization,
most (if not all) (sub)gradient based algorithms fail to achieve an optimal solution
due to the nonuniform stationary distribution of the weight matrix (also known as
the normalized left Perron eigenvector). In [93], the authors suggest a re-weighting
technique that makes it possible to use a row stochastic matrix in distributed opti-
mization. The same technique is also employed in [94]. However, the implementation
of the algorithms in [93, 94] assume knowledge of the graph, that is the stationary
distribution of the weight matrix and the number of agents in the network. Indeed,
a fully distributed algorithm employing only row stochastic weight matrices has not
been available in the field of distributed optimization thus far.
In this work, we achieve such algorithms under mild requirements on available
global network information and under the assumption that the network is strongly
connected. More precisely, we present a distributed algorithm and a variation on the
194
algorithm that use a row stochastic weight matrix and assume that each agent knows
only an upper bound on the number of agents in the network. Our idea is as follows.
We let all the agents perform an augmented consensus protocol in order to estimate
the stationary distribution of the weight matrix while updating their states using an
iteration akin to that in the Distributed Projected Subgradient (DPS) method (see,
e.g., [137, 148, 149]), except that subgradient values are now scaled appropriately
and locally by the agents. Here, the estimation step is implemented concurrently
with the optimization step, and thus no network communication overhead is added.
Moreover, although the algorithm is based on the projected subgradient method, we
believe that its principle (i.e., the use of a particular augmented consensus) can be
generalized to a class of distributed algorithms that use consensus and subgradient
steps.
Another important contribution is our unified convergence analysis (as well
as the convergence rate) that applies to both unconstrained and constrained prob-
lems with identical or nonidentical private constraint sets. Most existing works
on subgradient based methods usually assume the problem to be either uncon-
strained [78,134,145], or constrained with identical (often compact) constraint sets
[81,93,137,148,150]. Nonidentical constraints are considered in [94,137,151], where
the local constraint sets are assumed to be compact and their intersection has a
nonempty interior. In our work, we assume regularity of the constraint sets, which
is weaker than requiring boundedness and allows the global constraint set to have an
empty interior. We establish convergence of our algorithms to the optimal solution,
and demonstrate how the rate of convergence depends on the step size sequence,
195
exhibiting similarity to that of the centralized subgradient approach. To the best of
our knowledge, convergence rates of distributed subgradient methods have not been
studied before for the case of nonidentical unbounded constraint sets (possibly with
an empty-interior intersection).
Preliminary work along the line of this chapter appeared in [150], where only
one algorithm was presented and several proofs were omitted. In addition, it is
assumed in [150] that all the local constraint sets are identical and compact, while
in this chapter we consider nonidentical constraint sets and relax the compactness
requirement, allowing for a broader class of applications. The current chapter fur-
ther introduces a variation on the algorithm presented in [150], and presents a new
convergence analysis that holds for both algorithms under these relaxations. Here
the proof technique relies on the regularity assumption on the local constraint set,
thus significantly different from that in [150]. Finally, the rate of convergence, which
was not shown in [150], is studied here for both algorithms.
The rest of the chapter proceeds as follows. The problem formulation and
proposed algorithms are given in Section 6.2. The convergence and the convergence
rate of the algorithms are studied in Sections 6.3 and 6.4, respectively. Section 6.5
includes a numerical example to illustrate our findings. Concluding remarks are
given in Section 6.6.
Additional Notation and Terminology:
The projection of a vector x on a nonempty closed convex set X ⊆ Rm is
denoted by PX(x), i.e., PX(x) = arg miny∈X ‖x − y‖, where as usual, ‖ · ‖ denotes
the 2-norm. We also denote by dist(x, X) the (Euclidean) distance from x to X, i.e.,
196
dist(x, X) = ‖x − PX(x)‖. The following inequality is called the nonexpansiveness
property :
‖PX(x)− PX(y)‖ ≤ ‖x− y‖, ∀x,y ∈ Rm (6.1)
We will employ the notion of regularity of the constraint sets, which plays an
important role in the study of projection algorithms. This notion involves upper
estimating the distance of a point to the intersection of a collection of closed convex
sets in terms of the distance to each set (see [152,153]). Recalled next is the definition
needed here, stated for a finite dimensional setting.
Definition 6.1.1. A collection of closed convex sets {Xi, i ∈ V} (with a nonempty
intersection) is regular with respect to a nonempty set B ⊆ Rm if there exists a
constant rB ≥ 1 such that
dist(x,∩i∈VXi) ≤ rB max
i∈V
dist(x, Xi), ∀x ∈ B. (6.2)
It is said to be regular if B = Rm.
For example, if the sets Xi are identical, then they are regular. The following
two sets X1 = {(x1, x2) : x2 ≥ x21} and X2 = {(x1, x2) : x2 ≤ 0} with X1∩X2 = {0}
are not regular with respect to any ball centered at the origin.
6.2 Problem Formulation and Proposed Algorithms
Consider a network consisting of N agents where the underlying communication is
characterized by a fixed directed graph G = (V , E). All agents share the objective
197
of solving
min
x∈Rm
F (x) :=
∑
i∈V
fi(x),
s.t. x ∈
⋂
i∈V
Xi =: X
(6.3)
where each fi : Rm → R is a convex function representing the private objective of
agent i, and each Xi is a convex constraint set only available to agent i. Obviously,
F is also convex. Let F ∗ and X∗ denote the optimal value and the optimal solution
set of the problem (i.e., X∗ = {x ∈ X,F (x) = F ∗}). Let
U := conv(
⋃
i∈V
Xi), (6.4)
i.e., the convex hull of
⋃
i∈V Xi. The following assumptions are adopted in the sequel.
Assumption 6.2.1. (Basic Problem Assumptions) Problem (6.3) satisfies the fol-
lowing:
(a) (Constraint sets) The sets Xi ⊆ Rm are closed and convex, and X 6= ∅.
Moreover, {Xi, i ∈ V} is regular with respect to U .
(b) (Bounded subdifferential) For any i ∈ V, fi : Rm → R is convex with subdif-
ferential bounded on U , i.e.,
∃Lf ∈ (0,∞), ‖gi‖ ≤ Lf , ∀gi ∈ ∂fi(x), ∀x ∈ U (6.5)
(c) The solution set X∗ is nonempty.
Here, the regularity assumption on the collection of constraint sets is weaker
than requiring boundedness, which allows us to consider a broader class of optimiza-
tion problems. This assumption holds trivially with rU = 1 when the constraint sets
198
are identical. An unconstrained problem is a special case with Xi = Rm,∀i ∈ V .
The regularity assumption is also satisfied if the sets Xi are compact and X has a
nonempty interior; such assumptions are used in [94,137]. In fact, by [152, Cor. 2],
one can deduce that rU =
∑N
i=1(DU/δ)
i is a regularity constant, where DU denotes
the diameter of U and δ is the radius of a ball lying in U . Other important cases
include when the sets Xi are hyperplanes or half-spaces (see [153]).
Note also that since each fi is convex on Rm, Assumption 6.2.1(b) implies that
each individual cost function fi is also Lf -Lipschitz continuous on U . This will be
the case if all Xi are compact, since then U is also compact. Assumption 6.2.1(c)
can be satisfied when, e.g., the sets Xi are closed and at least one them is compact,
since then X is compact. In general, however, we do not require compactness of the
constraint sets.
In our setting, agent i only has access to fi and local information on its neigh-
bors’ opinions, and no central coordinating node is assumed to exist. Thus, the
agents need to collaborate in a distributed manner to solve problem (6.3). This
involves local iterative computation along with information diffusion. We are in-
terested in the scenario where the communication graph G connecting the agents is
directed and fixed. We make the following additional blanket assumptions.
Assumption 6.2.2. (Connectivity) The network G = (V , E) is strongly connected.
Assumption 6.2.3. (Unique ID) The agents are labeled 1, 2, . . . , N and their mes-
sages carry a unique identifier of the sender. Moreover, all the agents know the
value N (or an upper bound N ′ ≥ N).
199
Assumption 6.2.3 is only technical, implying that each agent can distinguish
messages from its neighbors. This will be the case if media access control (MAC)
addresses are used.
Here, at any time slot, each agent exchanges its current state with its neighbors
(in accordance with the directed network structure). Upon receiving the information
from its neighbors (including itself), agent i incorporates knowledge of these states
using a weighted average scheme. Thus, each edge (i, j) ∈ E is associated with a
weight wij ≥ 0 (locally chosen by agent i). Let the weight matrix W = [wij] satisfy
the following condition.
Assumption 6.2.4. (Weight Rule) The matrix W satisfies wi > 0 for i ∈ V,
wij > 0 for (i, j) ∈ E and wij = 0 otherwise. Moreover, W is row stochastic.
This assumption means that the zero-nonzero structure of the weight matrix
W reflects the network structure. Note also that W has positive diagonal elements
reflecting that each agent has access to its own state. Further, W is irreducible
under Assumption 6.2.2.
Again we stress that unlike the case with existing algorithms in the literature,
the weight matrix W is assumed to be only row stochastic, and not either doubly
stochastic or column stochastic. As a result, each agent i controls the i-th row of
W , independently with others. This also gives each agent the freedom in deciding
the weights that it places on its neighbors’ information. This explains why row
stochastic matrices are more suitable for ad hoc wireless networks.
We now propose the following distributed algorithm to solve problem (6.3)
200
under all the assumptions above.
Algorithm 6.1. At time t = 0, agent i initializes an estimate xi(0) ∈ Xi and a
variable zi(0) = ei ∈ RN (or ∈ RN ′ if only a bound N ′ ≥ N is available). For each
time t ≥ 0, all agents update their states as follows:
xi(t+ 1) = PXi
(∑
j∈V
wijxj(t)− γ(t) gi(t)
zii(t)
)
(6.6)
zi(t+ 1) =
∑
j∈V
wijzj(t). (6.7)
Here, Ni is the set of node i’s in-neighbors (including itself), γ(t) is a nonnegative
step size (which will be specified later), gi(t) ∈ ∂fi(
∑
j∈V wijxj(t)) is a subgradient
of fi, and zi(t) = [zi1, zi2, . . . , ziN ]
T for each i ∈ V.
Note that because of wii > 0,∀i ∈ V (cf. Assumption 6.2.4), (6.7), and zii(0) =
1, it can be shown (later in Lemma 6.3.3) that zii(t) > 0,∀t ≥ 0, ∀i ∈ V . Thus (6.6)
is well defined.
In essence, the update (6.6) can be viewed as a modified version of the dis-
tributed projected subgradient (DPS) method [137] where each private cost func-
tion’s subgradient is scaled by zii(t) obtained from (6.7). Here, the update (6.7) is,
in fact, a consensus iteration aiming to provide each agent i ∈ V with an estimate of
pi = [pi1, . . . , piN ]
T - the left normalized Perron eigenvector of W , i.e., the left eigen-
vector pi satisfying 1Tpi = 1. This iteration resembles those used in [154, 155]. Of
course, if each agent i ∈ V knows the pii in advance, then iteration (6.7) is redun-
dant as all the agents can simply use zii(t) = pii,∀t ≥ 0. (In fact, if initialized with
zi(0) = pi, then it follows from (6.7) that zi(t) = pi for all t ≥ 0.) In this case, our
201
rescaling subgradient technique reduces to the reweighting scheme used in [93,94].
We also remark that the DPS method in [137] can be applied to time-varying
networks but requires the weight matrix to be doubly stochastic at each time t.
Further, for nonidentical constraint sets, [137] only considers complete graphs and
assumes that the intersection set X has nonempty interior. Later, [94] extended
the method to directed time-varying graphs possibly with (fixed and uniform) com-
munication delays but still requires doubly stochastic weight matrices and compact
constraint sets with nonempty interior. Thus, the results in these works are not
readily applicable to cases where the Xi are unbounded (e.g., X = Rm) and/or
X has an empty interior (e.g., an Xi includes linear equality constraints) and the
weight matrix is only row stochastic. Another extension in [145] dealing with the
unconstrained case employs column stochastic matrices. Algorithm 6.1 can be seen
as an extension of DPS under the fixed network setting where only row stochastic
weight matrices are used. Note also that here we assume the network is fixed during
one run of the algorithm. Between any two consecutive runs, the network structure
is allowed to change, and our algorithm need not be adjusted except each agent i
may need to reselect new weights wij for its new neighbor set (which is a trivial
task). Moreover, our development technique does not employ the compactness of
the constraint sets as well as nonempty interior of their intersection.
The following variation on Algorithm 6.1 will also be considered, where each
agent takes the subgradient step first, followed by the consensus step:
Algorithm 6.2. With the same initializations as in Algorithm 6.1, all agents update
202
their states according to
xi(t+ 1) = PXi
(∑
j∈V
wij
(
xj(t)− γ(t) gj(t)
zjj(t)
))
(6.8)
zi(t+ 1) =
∑
j∈V
wijzj(t), (6.9)
where gj(t) ∈ ∂fj(xj(t)), i.e., a subgradient of fj at xj(t) (which differs from the
subgradient used in (6.6) of Algorithm 6.1).
It has been shown in [76] that the order of the optimization step and the
consensus step in the original DPS method can be interchanged, which, if a constant
step size is used, often gives a better convergence speed to a solution neighborhood
[156]. Comparison between Algorithms 6.1 and 6.2, however, is out of the scope of
this chapter.
In this work, the following type of diminishing step size sequence will be used
to ensure convergence of our algorithms to the optimal solution. For the convergence
rate analysis, a less restrictive assumption will be employed.
Assumption 6.2.5. (Step Size Rule) The step size sequence {γ(t)} is positive non-
increasing and satisfies
∑∞
t=0 γ(t) =∞ and
∑∞
t=0 γ
2(t) <∞.
There are many ways to choose the step size sequence γ(t) satisfying this
assumption, e.g., γ(t) = c
tθ
,∀t ≥ 1, for some constants c > 0 and θ ∈ (0.5, 1].
6.3 Basic Relations and Convergence Result
In this section, we simultaneously prove the convergence of both algorithms (6.6)-
(6.7) and (6.8)-(6.9). We begin with a few basic results that will be used later.
203
First, besides the nonexpansiveness property (6.1), other properties of a pro-
jection operator are given in the following lemma.
Lemma 6.3.1. ([137]) Let Y ⊆ Rm be a nonempty closed convex set. Then for any
x ∈ Rm and y ∈ Y ,
(a) (PY (x)− x)T (x− y) ≤ −‖PY (x)− x‖2.
(b) ‖PY (x)− y‖2 ≤ ‖x− y‖2 − ‖PY (x)− x‖2.
Second, the following lemma is a consequence of the convexity of the function
‖ · ‖2.
Lemma 6.3.2. For any a1, . . . , aN ≥ 0 such that
∑N
i=1 ai = 1, we have ‖
∑N
i=1 aixi‖2 ≤∑N
i=1 ai‖xi‖2 for ∀xi ∈ Rm, i = 1, . . . , N .
Next, we characterize the convergence of the power iteration of the weight
matrix in the following lemma, which is a consequence of the Perron-Frobenius
theorem (see, e.g., [95]).
Lemma 6.3.3. (Convergence of power of weight matrix) Let Assumptions 6.2.2
(Connectivity) and 6.2.4 (Weight Rule) hold. Then limt→∞W t = 1piT , where pi >
0 is the normalized left Perron eigenvector of W . Moreover, the convergence is
geometric with rate λ ∈ (|λ2(W )|, 1), where λ2(W ) is the second largest eigenvalue
of W.
Proof. Under Assumptions 6.2.2 and 6.2.4, W is an irreducible row stochastic matrix
with positive diagonal entries, and thus primitive (i.e., W is irreducible and has only
204
one eigenvalue of maximum modulus; see, e.g., [95, Thm. 8.5.2 and Lem. 8.5.5]).
The result now follows from [95, Thm. 8.5.1].
The next proposition, describing the convergence of the estimation step in
(6.7), follows directly from the foregoing lemma and will be used in the sequel.
Proposition 6.3.4. (Convergence of zii) Consider iteration (6.7). Let Assumptions
(Connectivity) and 6.2.4 (Weight Rule) hold. Then for each λ ∈ (|λ2(W )|, 1), there
exists C=C(λ,W ) > 0 such that the following hold for ∀i, j ∈ V and ∀t ≥ 0:
|[W t]ji − pii| ≤ Cλt, |zii(t)− pii| ≤ Cλt. (6.10)
Moreover, there exists η > 0 such that
η−1 ≤ zii(t) ≤ 1, ∀t ≥ 0,∀i ∈ V . (6.11)
Proof. Let Z(t) = [z1(t), z2(t), · · · , zN(t)]T . It follows from Algorithm 6.1 that for
any t ≥ 0,
Z(t+ 1) = WZ(t), Z(0) = I.
Thus, Z(t) = W t,∀t ≥ 0. Hence, (6.10) follows by Lemma 6.3.3 for some C > 0
and λ ∈ (|λ2(W )|, 1).
Next, for each i ∈ V , by (6.7), we have zii(t + 1) =
∑
j∈V wijzji(t), where
zii(0) = 1, zji(0) = 0,∀j 6= i. Clearly, 1 ≥ zij(t) ≥ 0,∀i, j ∈ V ,∀t ≥ 0. Since
limt→∞ zii(t) = pii > 0, there exists t0 ≥ 0 such that zii(t) ≥ pii/2,∀i ∈ V ,∀t > t0.
Moreover, we have that zii(t0) ≥ wiizii(t0 − 1) ≥ . . . ≥ wt0ii zii(0) > 0 since wii > 0
(cf. Assumption 6.2.4). Therefore, zii(t) > 0 for any t ∈ [0, t0]. By taking
η−1 = min{zii(t), pii/2,∀i ∈ V ,∀t ∈ [0, t0]},
205
then (6.11) follows as desired.
Remark 6.3.5. In the rest of the chapter, the parameters C, λ and η refer to the
constants in Proposition 6.3.4.
We now turn to iterations (6.6) and (6.8). Our next result describes a general
relation on the overall evolution of the states of the agents in terms of their distances
from any point v ∈ X as well as the weighted averaged state vector x¯(t), defined as
x¯(t) :=
∑
j∈V
pijxj(t), ∀t ≥ 0. (6.12)
This relation also involves the step size sequence γ(t) and an error term
(
F (x¯(t))−
F (v)
)
, which in general is not the global objective error since x¯(t) may not be in
X; it is so if the constraint sets {Xi, i ∈ V} are identical.
Theorem 6.3.6. (Bound on evolution of xi) Let Assumptions 6.2.1 (Basic Problem
Assumptions), 6.2.2 (Connectivity), 6.2.3 (Unique ID) and 6.2.4 (Weight Rule) be
satisfied. Then for both Algorithms 6.1 and 6.2, the following holds for any v ∈ X
and t ≥ 0:
∑
i∈V
pii‖xi(t+ 1)− v‖2 ≤ (1 +D1λ2t)
∑
i∈V
pii‖xi(t)− v‖2
− 2γ(t)(F (x¯(t))− F (v))−∑
i∈V
pii‖φi(t)‖2
+D2γ(t)
∑
i∈V
pii‖xi(t)− x¯(t)‖+D3γ2(t), (6.13)
where D1 = NCLfη,D2 = 2Lfη, D3 = L
2
fη
2 +NLfCη, and
φi(t) := PXi
(∑
j∈V
wijxj(t)− γ(t) gi(t)
zii(t)
)
−
(∑
j∈V
wijxj(t)− γ(t) gi(t)
zii(t)
)
(6.14)
206
for Algorithm 6.1 whereas for Algorithm 6.2, φi(t) is defined as
φi(t) := PXi
(∑
j∈V
wij
(
xj(t)− γ(t) gj(t)
zjj(t)
))
−
∑
j∈V
wij
(
xj(t)− γ(t) gj(t)
zjj(t)
)
(6.15)
Proof. We provide here a proof for the case of Algorithm 6.1. The proof for Algo-
rithm 6.2 is given in Appendix A.4.1. Let yi(t) :=
∑
j∈V wijxj(t). By using (6.6)
and the definition of φi(t) (cf. (6.14)), we have for any v ∈ X ⊆ Xi
‖xi(t+ 1)− v‖2 =
∥∥∥yi(t)− v − γ(t) gi(t)
zii(t)
+ φi(t)
∥∥∥2
=
∥∥∥yi(t)− v − γ(t) gi(t)
zii(t)
∥∥∥2 + ‖φi(t)‖2 + 2φi(t)T(yi(t)− v − γ(t) gi(t)
zii(t)
)
≤
∥∥∥yi(t)− v − γ(t) gi(t)
zii(t)
∥∥∥2 − ‖φi(t)‖2, (6.16)
where the last inequality follows from the fact that (cf. Lemma 6.3.1(a))
φi(t)
T
(
yi(t)− γ(t) gi(t)
zii(t)
− v) ≤ −‖φi(t)‖2.
The first term on the right side of (6.16) equals
‖yi(t)− v‖2 + 2γ(t)
zii(t)
gi(t)
T (v − yi(t)) + γ
2(t)
z2ii(t)
‖gi(t)‖2. (6.17)
We now derive an upper bound for each term in this sum. Rewriting yi(t) − v =∑
j∈V wij(xj(t)− v) then using Lemma 6.3.2 yields
‖yi(t)− v‖2 ≤
∑
j∈V
wij‖xj(t)− v‖2. (6.18)
Next, ignoring the positive factor 2γ(t)
zii(t)
, the second term in (6.17) can be bounded
as follows:
gi(t)
T (v − yi(t)) ≤ fi(v)− fi(yi(t)) ≤ fi(v)− fi(x¯(t)) +
∣∣fi(yi(t))− fi(x¯(t))∣∣
≤ fi(v)− fi(x¯(t)) + Lf
∑
j∈V
wij ‖xj(t)− x¯(t)‖ . (6.19)
207
where the first inequality holds since gi(t) ∈ ∂fi(yi(t)), the second follows from
the triangle inequality, and the last one from Lf -Lipschitz continuity of fi over
conv
(⋃
i∈V Xi
)
(cf. Assumption 6.2.1(b)) and the triangle inequality. By continuing
(6.16) and using (6.17), (6.18), (6.19) and the conditions that ‖gi(t)‖ ≤ Lf and
z−1ii (t) ≤ η,∀i ∈ V ,∀t ≥ 0, we have
‖xi(t+ 1)− v‖2 ≤
∑
j∈V
wij‖(xj(t)− v)‖2 + 2γ(t)
zii(t)
(fi(v)− fi(x¯(t)))− ‖φi(t)‖2
+ 2Lf
γ(t)
zii(t)
∑
j∈V
wij ‖xj(t)− x¯(t)‖+ γ2(t)L2fη2. (6.20)
Multiplying both sides by pii then summing over i ∈ V yields
∑
i∈V
pii‖xi(t+ 1)− v‖2 ≤
∑
i∈V
pii
∑
j∈V
wij‖xj(t)− v‖2
+ 2
∑
i∈V
piiγ(t)
zii(t)
(fi(v)− fi(x¯(t)))−
∑
i∈V
pii‖φi(t)‖2
+ 2Lf
∑
i∈V
piiγ(t)
zii(t)
∑
j∈V
wij ‖xj(t)− x¯(t)‖+ γ2(t)L2fη2. (6.21)
Now consider each term on the right side of (6.21). First,
∑
i∈V
pii
∑
j∈V
wij‖xj(t)− v‖2 =
∑
i∈V
pii‖xi(t)− v‖2, (6.22)
where we have used the fact that piTW = piT . Second,
∑
i∈V
pii
zii(t)
(
fi(v)− fi(x¯(t))
)
=
∑
i∈V
fi(v)− fi(x¯(t)) +
∑
i∈V
( pii
zii(t)
− 1)(fi(v)− fi(x¯(t)))
≤ F (v)− F (x¯(t)) +
∑
i∈V
|zii(t)− pii|
zii(t)
|fi(x¯(t))− fi(v)|
≤ F (v)− F (x¯(t)) +NCLfηλt ‖x¯(t)− v‖ , (6.23)
208
where C > 0 and λ ∈ (0, 1) satisfy (6.10), and η satisfies (6.11). Third, we also have
∑
i∈V
pii
zii(t)
∑
j∈V
wij ‖xj(t)− x¯(t)‖ ≤ η
∑
i,j∈V
piiwij ‖xj(t)− x¯(t)‖
= η
∑
i∈V
pii ‖xi(t)− x¯(t)‖ (6.24)
Now, combining (6.21)-(6.24) yields
∑
i∈V
pii‖xi(t+ 1)− v‖2 ≤
∑
i∈V
pii‖xi(t)− v‖2 − 2γ(t) (F (x¯(t))− F ∗)
−
∑
i∈V
pii‖φi(t)‖2 + 2γ(t)NCLfηλt‖x¯(t)− v‖
+ 2γ(t)Lfη
∑
i∈V
pii ‖xi(t)− x¯(t)‖+ γ2(t)L2fη2. (6.25)
Finally, by writing x¯(t)−v = ∑i∈V pii(xi(t)−v) and then using the Cauchy-Schwarz
inequality and Lemma 6.3.2, we have
2γ(t)λt‖x¯(t)− v‖ ≤ γ2(t) + λ2t‖
∑
i∈V
pii(xi(t)− v)‖2
≤ γ2(t) + λ2t
∑
i∈V
pii‖xi(t)− v‖2
Using this bound for (6.25) and then rearranging terms yields (6.13) as desired.
Before proceeding further, it is worth highlighting the differences between this
result, in particular (6.13), with that obtained from the usual DPS method [137] in
the context of Algorithm 6.1. First, since the (normalized) left Perron eigenvector
pi is nonuniform, we opt for employing the weighted average vectors x¯(t) (as well as∑
i∈V pii‖xi(t) − v‖2) instead of the exact average one. Of course, when pi = 1/N ,
i.e., W is doubly stochastic, the former vector reduces to the latter. Second, the term
D1λ
2t
∑
i∈V pii‖xi(t)− v‖2 (or more precisely the term 2γ(t)NCLfηλt‖x¯(t)− v‖ in
209
(6.25)) arises as a consequence of each agent i using an estimate zii(t) of pii generated
from the estimation step (6.7). Finally, since we do not require the constraint sets to
be bounded or identical (or have a nonempty interior), the projection error φi is not
guaranteed to be bounded a priori and the term
(
F (x¯(t))− F (v)) does not reflect
the global objective error (as x¯(t) need not be in X). Therefore, quantifying the
behaviors of these terms and errors will be the main challenging task in analyzing
the convergence as well as the convergence rates of our algorithms; this calls for new
results that are more accessible than (6.13) as we develop in the sequel.
We now provide some bounds on the terms ‖xi(t)−x¯(t)‖ and ‖φi(t)‖ appearing
in (6.13) in terms of the step size sequence γ(t) and the total projection error β(t),
defined as
β(t) :=
∑
i∈V
‖φi(t)‖, ∀t ≥ 0. (6.26)
Theorem 6.3.7. Let Assumptions 6.2.1 (Basic Problem Assumptions), 6.2.2 (Con-
nectivity), 6.2.3 (Unique ID), and 6.2.4 (Weight Rule) hold. The following hold for
both Algorithms 6.1 and 6.2:
(a) Let D4 := C
∑
j∈V ‖xj(0)‖. For any i ∈ V,
‖xi(t)− x¯(t)‖ ≤ D4λt +D1
∑
0≤s≤t−1
λt−sγ(s) + C
∑
0≤s≤t−1
λt−sβ(s). (6.27)
(b) Define
θ(t) := γ(t)
∑
0≤s≤t−1
λt−sβ(s), θ(0) = 0. (6.28)
If {γ(t)} is nonincreasing, then
θ(t+ 1) ≤ λθ(t) + λγ(t)β(t). (6.29)
210
Proof. (a) First, we express (6.6) and (6.8) in the form
xi(t+ 1) =
∑
j∈V
wijxj(t) + i(t), (6.30)
where i(t) ∈ Rm is an error term. Then, we have
xi(t) =
∑
j∈V
[W t]ijxj(0) +
∑
0≤s≤t−1
∑
j∈V
[W t−s]ijj(s).
Since x¯(t) =
∑
j∈V pijxj(t) and pi
TW = piT , it follows that
x¯(t) =
∑
j∈V
pijxj(0) +
∑
0≤s≤t−1
∑
j∈V
pijj(s).
Thus, the term ‖xi(t)− x¯(t)‖ can be expressed as∥∥∥∑
j∈V
(
[W t]ij − pij
)
xj(0) +
t−1∑
s=0
∑
j∈V
(
[W t−s]ij − pij
)
j(s)
∥∥∥
≤
∑
j∈V
∣∣[W t]ij − pij∣∣ ‖xj(0)‖+ t−1∑
s=0
∑
j∈V
∣∣[W t−s]ij − pij∣∣ ‖j(s)‖.
Hence, by using the bound in (6.10), we then have
‖xi(t)− x¯(t)‖ ≤ D4λt + C
∑
0≤s≤t−1
λt−s
∑
j∈V
‖j(s)‖. (6.31)
Now consider Algorithm 6.1, where it follows from (6.6) and (6.14) that i(t) =
φi(t) − γ(t) gi(t)zii(t) . By using the triangle inequality and the facts that ‖gi(t)‖ ≤ Lf
(cf. Assumption 6.2.1(b)) and that z−1ii ≤ η (see (6.11)), we obtain
‖i(t)‖ ≤ ‖φi(t)‖+ γ(t)Lfη, ∀i ∈ V . (6.32)
Next, we show that this bound also holds for Algorithm 6.2. From (6.8) and (6.15)
we have i(t) = φi(t)− γ(t)
∑
j∈V wij
gj(t)
zjj(t)
. As a result, for ∀i ∈ V
‖i(t)‖ ≤ ‖φi(t)‖+ γ(t)
∑
j∈V
wij
‖gj(t)‖
|zjj(t)| ≤ ‖φi(t)‖+ γ(t)Lfη.
211
By combining (6.32) and (6.31) and rearranging terms, we have
‖xi(t)− x¯(t)‖ ≤ D4λt + C
∑
0≤s≤t−1
λt−s
(
Nγ(s)Lfη + β(s)
)
.
(b) By using the definition of θ(t) and the monotonicity of {γ(t)}, we have
θ(t+ 1) ≤ γ(t)
∑
0≤s≤t
λt+1−sβ(s) = λθ(t) + λγ(t)β(t),
which concludes the proof.
We note the following. First, it is clear from (6.27) that the effect of initial
conditions on the differences between agents’ states vanishes exponentially. Second,
one can view the last two terms on the right side of (6.27) as the convolutions of γ(t)
and β(t) with λt. Thus, for the convergence of the algorithms, we expect these terms
to decay to zero under a suitable choice of γ(t). For example, when limt→∞ γ(t) = 0,
we show next that limt→∞
∑t−1
s=0 λ
t−sγ(s) = 0. However, whether this also implies
limt→∞
∑t−1
s=0 λ
t−sβ(s) = 0 is inconclusive since β(t) depends on the agents’ states
and the sets Xi. Finally, we introduced θ(t) in order to study the behavior of the
term γ(t)
∑
i∈V pii‖xi(t)− x¯(t)‖ in (6.13).
Corollary 6.3.8. In Theorem 6.3.7, if limt→∞ β(t) = 0, then limt→∞ θ(t) = 0.
Additionally, if limt→∞ γ(t) = 0, then limt→∞
∑
i∈V pii‖xi(t)− x¯(t)‖ = 0.
Proof. Clearly, it suffices to prove that for any λ ∈ (0, 1) and any nonnegative
sequence {β(t)}t≥0 satisfying limt→∞ β(t) = 0, limt→∞
∑t
s=0 λ
t−sβ(s) = 0. This
claim is stated in [137, Lem. 7].
Our next result is basically a consequence of Theorems 6.3.6 and 6.3.7 un-
der the regularity assumption on the constraint sets. Specifically, we will apply
212
the bounds obtained in (6.27) and (6.29) to (6.13), and then select suitable associ-
ated coefficients to generate a more accessible relation, which is key to proving the
convergence as well as convergence rate of the algorithms.
Theorem 6.3.9. Let Assumptions 6.2.1 (Basic Problem Assumptions), 6.2.2 (Con-
nectivity), 6.2.3 (Unique ID), and 6.2.4 (Weight Rule) be satisfied. The following
holds for both Algorithms 6.1 and 6.2 and for any nonincreasing step size sequence
{γ(t)}:
∑
i∈V
pii‖xi(t+ 1)− v‖2 + abθ(t+ 1)
≤ (1 +D1λ2t)
∑
i∈V
pii‖xi(t)− v‖2 + abθ(t)
− 2γ(t)(F (s(t))− F (v))−D6∑
i∈V
‖φi(t)‖2
+D24γ(t)λ
t +D21γ(t)
∑
0≤s≤t−1
λt−sγ(s) +D′3γ
2(t), (6.33)
where s(t) = PX
(
x¯(t)
)
, pimin = mini∈V pii, b =
√
pimin
Nλ
, a =
D′2C
(1−λ)b , D
′
2 = D2 +
2LR
pimin
, R
is a regularity constant of {Xi, i ∈ V}, D6 = pimin2 , D24 = D′2D4, D21 = D′2D1 and
D′3 =
2D3+λa2
2
.
Proof. By adding and subtracting F (s(t)) and using the Lipschitz continuity of F
we have
F (v)− F (x¯(t)) ≤ F (v)− F (s(t)) + Lf‖s(t)− x¯(t)‖.
Now we find an upper bound on the term ‖s(t)−x¯(t)‖. By the regularity assumption
of {Xi, i ∈ V}, there exists R such that dist(x, X) ≤ Rmaxi∈V dist(x, Xi), ∀x ∈
213
conv(∪i∈VXi). As a result, we have
‖s(t)− x¯(t)‖ = dist(x¯(t), X) ≤ Rmax
i∈V
dist(x¯(t), Xi)
≤ R
∑
i∈V
pii
pimin
dist(x¯(t), Xi) ≤ R
∑
i∈V
pii
pimin
‖xi − x¯(t)‖, (6.34)
where the last inequality holds since dist(x¯(t), Xi) ≤ ‖xi− x¯(t)‖ (cf. Lem. 6.3.1(b)).
Hence,
F (v)− F (x¯(t)) ≤ F (v)− F (s(t)) + LfR
pimin
∑
i∈V
pii‖xi(t)− x¯(t)‖.
Using this bound for (6.13), we then have
∑
i∈V
pii‖xi(t+ 1)− v‖2 ≤ (1 +D1λ2t)
∑
i∈V
pii‖xi(t)− v‖2
− 2γ(t)(F (s(t))− F (v))−∑
i∈V
pii‖φi(t)‖2
+ (D2 +
2LR
pimin
)γ(t)
∑
i∈V
pii‖xi(t)− x¯(t)‖+D3γ2(t).
Next, by adding abθ(t+1) to both sides of this relation and using the bounds (6.27)
and (6.29), we further have
∑
i∈V
pii‖xi(t+ 1)− v‖2 + abθ(t+ 1)
≤ (1 +D1λ2t)
∑
i∈V
pii‖xi(t)− v‖2 + abθ(t)
+ ab(λ− 1)θ(t) + abλγ(t)β(t)
− 2γ(t)(F (s(t))− F (v))−∑
i∈V
pii‖φi(t)‖2 +D3γ2(t)
+D24γ(t)λ
t +D21γ(t)
∑
0≤s≤t−1
λt−sγ(s) +D2Cθ(t). (6.35)
214
Now with the choice of a =
D′2C
(1−λ)b , the terms ab(λ− 1)θ(t) and D′2Cθ(t) cancel out.
Further, by the Cauchy-Schwarz inequality, we have
abγ(t)β(t) ≤ a
2γ2(t) + b2β2(t)
2
≤ a
2γ2(t)
2
+
b2n
2
∑
i∈V
‖φi(t)‖2
=
a2
2
γ2(t) +
pimin
2λ
∑
i∈V
‖φi(t)‖2.
The last equality holds since b2 = pimin
Nλ
. As a result, we have
abλγ(t)β(t)−
∑
i∈V
pii‖φi(t)‖2 ≤ λa
2
2
γ2(t)− pimin
2
∑
i∈V
‖φi(t)‖2.
It remains to apply the relations above to (6.35) and then rearrange terms to
obtain (6.33).
It should be noted that (6.33) holds uniformly on X since the constants Di
are independent of the choice of v ∈ X. When restricted to X∗, we immediately
have a relation between the (weighted average) distance (squared) from the optimal
solution, i.e.,
∑
i∈V pii‖xi(t)− v∗‖2, and the global objective error F (s(t))− F ∗ (as
s(t) ∈ X), both of which are desired to converge under a suitable choice of step size
sequence.
We are now ready to give a convergence result that applies to both Algo-
rithms (6.6)-(6.7) and (6.8)-(6.9), whose proof is based on the Theorem 6.3.9 and
the following lemma.
Lemma 6.3.10. ([157]) Let {vt}∞t=0, {ut}∞t=0, {bt}∞t=0 and {ct}∞t=0 be nonnegative
sequences such that
∑∞
t=0 bt <∞,
∑∞
t=0 ct <∞ and for ∀t ≥ 0
vt+1 ≤ (1 + bt)vt − ut + ct. (6.36)
215
Then {vt} converges and
∑∞
t=0 ut <∞.
Theorem 6.3.11. (Convergence to optimal solution) Let Assumptions 6.2.1-6.2.5
be satisfied. Then both Algorithms 6.1 and 6.2 yield convergence to the optimal
solution, i.e.,
∃x∗ ∈ X∗ : lim
t→∞
xi(t) = x
∗, ∀i ∈ V . (6.37)
Proof. The proof proceeds in two steps: (i) apply Lemma 6.3.10 to (6.33), and then
(ii) prove convergence to the optimal solution.
Step (i): Let x† be arbitrary in X∗ and define the nonnegative sequences
{vt}, {ut}, {bt} and {ct} as follows:
vt :=
∑
i∈V
pii‖xi(t)− x†‖2 + abθ(t), bt := D1λ2t,
ut := 2γ(t)(F (s(t))− F ∗) +D6
∑
i∈V
‖φi(t)‖2,
ct := D24γ(t)λ
t +D21γ(t)
∑
0≤s≤t−1
λt−sγ(s) +D′3γ
2(t).
By adding the nonnegative term D1λ
2tabθ(t) to the right hand side of (6.33), we
obtain
vt+1 ≤ (1 + bt)vt − ut + ct, ∀t ≥ 0.
We now show that other conditions of Lemma 6.3.10 also hold, namely
∑∞
t=0 bt <
∞ and ∑∞t=0 ct < ∞. The former condition is obvious since λ ∈ (0, 1) implies
that
∑∞
t=0 bt = (1 − λ2)−1. To prove the latter, consider each term in ct. First,∑
t≥0 γ
2(t) < ∞ by Assumption 6.2.5. Second, by the Cauchy-Schwarz inequality,
216
we have γ(t)λt ≤ (γ2(t) + λ2t)/2. Thus
∑
t≥0
γ(t)λt ≤ 1
2
∑
t≥0
γ2(t) +
1
2
∑
t≥0
λ2t <∞. (6.38)
Third, by monotonicity of sequence {γ(t)} (cf. Assumption 6.2.5) the second term in
ct can be bounded as follows: γ(t)
∑t−1
s=0 λ
t−sγ(s) ≤∑t−1s=0 λt−sγ2(s) ≤∑ts=0 λt−sγ2(s).
Thus, for any N ≥ 1 we have
N∑
t=1
γ(t)
t−1∑
s=0
λt−sγ(s) ≤
∑
0≤s≤t≤N
λt−sγ2(s) ≤
N∑
s=0
γ2(s)
∞∑
t=s
λt−s
=
∑
0≤s≤N
γ2(s)
1− λ ≤
∑
s≥0 γ
2(s)
1− λ <∞.
This concludes that {ct} is summable as desired. Therefore, in view of Lemma
6.3.10, the following hold:
∃ lim
t→∞
∑
i∈V
pii‖xi(t)− x†‖2 + abθ(t) =: δ ≥ 0 (6.39)
∑
t≥0
γ(t)(F (s(t))− F ∗) + D6
2
∑
i∈V
‖φi(t)‖2 <∞. (6.40)
Step (ii): First, by (6.40), we have limt→∞
∑
i∈V ‖φi(t)‖2 = 0. Thus, limt→∞ β(t) =
0, which by Corollary 6.3.8 yields limt→∞ θ(t) = 0. It then follows from (6.39) that
lim
t→∞
∑
i∈V
pii‖xi(t)− x†‖2 = δ. (6.41)
As a result, for each i ∈ V , {xi(t)}t≥0 is a bounded sequence. Thus so are {x¯(t)}t≥0
and {s(t)}t≥0.
Next, since
∑
t≥0 γ(t) =∞, it then follows from (6.40) that lim inft→∞ F (s(t)) =
F ∗. Thus, there exists a subsequence {s(tk)} ⊆ {s(t)} such that
lim
k→∞
F (s(tk)) = F
∗. (6.42)
217
Now since {s(tk)} is also a bounded sequence, there exists a convergent subsequence
{s(tl)} ⊆ {s(tk)}. Denote liml→∞ s(tl) = x∗ for some x∗ ∈ X (since X is closed).
We next show that x∗ ∈ X∗. By the continuity of F on Rm
lim
l→∞
F (s(tl)) = F (x
∗). (6.43)
which in view of (6.42) implies that F (x∗) = F ∗. By convexity of F , we conclude
that x∗ ∈ X∗. Since x† ∈ X∗ was chosen arbitrarily, we can let x† = x∗.
Now it remains to show that δ = 0, which by (6.41) will then complete the
proof. By the triangle and Cauchy-Schwarz inequalities
‖xi(t)− x∗‖2 ≤ (‖xi(t)− x¯(t)‖+ ‖x¯(t)− s(t)‖+ ‖s(t)− x∗‖)2
≤ 3(‖xi(t)− x¯(t)‖2 + ‖x¯(t)− s(t)‖2 + ‖s(t)− x∗‖2).
Next, since ‖x¯(t) − s(t)‖ ≤ R
pimin
∑
i∈V pii‖xi − x¯(t)‖ (cf. (6.34)), we have ‖s(t) −
x¯(t)‖2 ≤ R2
pi2min
∑
i∈V pii‖xi − x¯(t)‖2 by Lemma 6.3.2. As a result,
1
3
‖xi(t)− x∗‖2 ≤ ‖xi(t)− x¯(t)‖2 + R
2
pi2min
∑
i∈V
pii‖xi − x¯(t)‖2 + ‖s(t)− x∗‖2.
Multiplying both sides by pii and summing over i ∈ V yields
∑
i∈V
pii
3
‖xi(t)− x∗‖2 ≤ R′
∑
i∈V
pii‖xi(t)− x¯(t)‖2 + ‖s(t)− x∗‖2,
where R′ = 1+ R
2
pi2min
. Taking lim inf as t→∞ both sides of this inequality and using
(6.41) yield:
δ
3
≤ lim inf
t→∞
(
R′
∑
i∈V
pii‖xi(t)− x¯(t)‖2 + ‖s(t)− x∗‖2
)
= lim inf
t→∞
‖s(t)− x∗‖2. (6.44)
218
Here we have used the superadditivity property of the limit inferior and the fact
that limt→∞
∑
i∈V pii‖xi(t)− x¯(t)‖2 = 0 since limt→∞ β(t) = 0 (see Corollary 6.3.8).
Since the subsequence {s(tl)} converges to x∗, we have lim inft→∞ ‖s(t) − x∗‖ = 0,
which in view of (6.44) implies that δ = 0.
6.4 Rate of Convergence
We now discuss the convergence rate of our algorithms, which evidently depends on
the choice of γ(t). Since the estimation step (6.7) converges exponentially, one should
expect that the convergence rate of the objective error is equivalent to that of usual
distributed subgradient methods in the case when the constraint sets are identical
and/or compact. We emphasize, however, that such assumptions are relaxed in our
work, i.e., the sets Xi can be nonidentical and unbounded. Moreover, the global
constraint set X is also allowed to have an empty interior. Thus, for all i ∈ V , the
agents’ estimates xi(t) as well as their weighted average x¯(t) need not be in the set
X at any time t. As a result, local analysis around the optimal solution does not
readily apply.
In this work, to quantify the distance from the optimum, we propose to use a
combined error term which involves (i) the distance from a local estimate x˜i(t) of
each agent to some point s˜(t) ∈ X and (ii) the global objective error evaluated at
s˜(t), i.e., F (s˜(t))− F ∗. Specifically, we define
x˜i(t) :=
∑t
k=0 γ(k)xi(k)∑t
k=0 γ(k)
, s˜(t) :=
∑t
k=0 γ(k)s(k)∑t
k=0 γ(k)
. (6.45)
Here, for each t ≥ 1, x˜i(t) is a convex combination of xi(0),xi(1), . . . ,xi(t), which
219
can be computed locally by agent i but might not be in X. In contrast, s˜(t) always
belongs to X but is not directly available to each agent. The following theorem
asserts that both errors ‖x˜i(t)− s˜(t)‖ and F (s˜(t))− F ∗ decay as O(
∑t
k=0 γ
2(k)∑t
k=0 γ(k)
).
Theorem 6.4.1. (Convergence rate) Let Assumptions 6.2.1 (Basic Problem As-
sumptions), 6.2.2 (Connectivity), 6.2.3 (Unique ID) and 6.2.4 (Weight Rule) hold.
Let {γ(t)} be a nonnegative and nonincreasing sequence. Then for both Algorithms
6.1 and 6.2, the following holds for ∀t ≥ 0:
C0‖x˜i(t)− s˜(t)‖+ F (s˜(t))− F ∗ ≤ C1 + C2
∑t
k=0 γ
2(k)∑t
k=0 γ(k)
, (6.46)
where C0 =
D6(1−λ)
2N(N+1)Cλ
, some C1 > 0 and C2 = O(
N4C2
(1−λ)2 e
D1
1−λ2 ) as N → ∞ and
λ→ 1 (recalling that D1 = NCLfη and D6 = pimin/2). Moreover, if {Xi, i ∈ V} are
compact, the constant C2 is O(
N4C2
(1−λ)2 ).
The proof of Theorem 6.4.1 is structured in the following steps:
(i) Use the bound (6.33) in Theorem 6.3.9 to upper estimate the sum
t∑
k=0
2γ(k)(F (s(k))− F ∗) +D6
∑
i∈V
‖φi(k)‖2
in terms of
∑
0≤k≤t γ(k) and
∑t
k=0 γ
2(k).
(ii) Relate the left side of (6.46) to this sum by using the convexity of F and the
bounds given in Theorem 6.3.7.
(iii) Analyze the constants Ci.
The following technical lemma will be used in Step (i) for the general case
where {Xi,∀i ∈ V} are not necessarily bounded.
220
Lemma 6.4.2. For any D > 0 and λ ∈ (0, 1), it holds that
1 +
D
1− λ ≤
∏
t≥0
(1 +Dλt) ≤ e D1−λ . (6.47)
Proof of Lemma 6.4.2. Note that for any T ≥ 1 we have
1 +D
∑
0≤t≤T
λt ≤
∏
0≤t≤T
(1 +Dλt) ≤ eD
∑T
t=0 λ
t
where the second inequality follows from the basic relation that 1 + x ≤ ex for any
x ≥ 0. Taking the limit as T →∞ yields the desired result.
Proof of Theorem 6.4.1. We proceed through the 3 steps described above.
Step (i): Let the nonnegative sequences {vt}, {ut}, {bt} and {ct} be defined
as in Step (i) of the proof of Theorem 6.3.11, i.e.,
vt :=
∑
i∈V
pii‖xi(t)− x∗‖2 + abθ(t), bt := D1λ2t,
ut := 2γ(t)(F (s(t))− F ∗) +D6Φt, Φt :=
∑
i∈V
‖φi(t)‖2
ct := D24γ(t)λ
t +D21γ(t)
∑
0≤s≤t−1
λt−sγ(s) +D′3γ
2(t).
By using Theorem 6.3.9 and adding the nonnegative term btabθ(t) to the right hand
side of (6.33), we have
vt+1 ≤ (1 + bt)vt − ut + ct, ∀t ≥ 0,
which then implies that
vt+1 ≤
∏
0≤k≤t
(1 + bk)v0 +
∑
0≤k≤t
(ck − uk)
∏
k+1≤s≤t
(1 + bs). (6.48)
221
By Lemma 6.4.2, the following holds for any t, k ≥ 0
1 <
∏
k≤s≤t
(1 + bs) < e
D1
1−λ2 =: De.
As a result, (6.48) implies that
vt+1 ≤ Dev0 +
∑
0≤k≤t
Deck −
∑
0≤k≤t
uk, (6.49)
from which by rearranging terms and using the fact that vt+1 ≥ 0, we have (recalling
the definition of ut)
∑
0≤k≤t
γ(k)(F (s(k))− F ∗) + D6
2
Φk ≤ R1 +R2
∑
0≤k≤t
ck, (6.50)
where R1 = Dev0/2 and R2 = De/2. Next, we will derive an upper bound on the
term
∑t
k=0 ck based on the following estimates:
∑
0≤k≤t
γ(k)λk ≤
∑
0≤k≤t
γ(0)λk ≤ γ(0)
1− λ, (6.51)
and
∑
0≤k≤t
γ(k)
∑
0≤s≤k
λk−sγ(s) ≤
∑
0≤k≤t
∑
0≤s≤k
γ2(s)λk−s =
∑
0≤s≤t
γ2(s)
∑
s≤k≤t
λk−s
≤
∑t
s=0 γ
2(s)
1− λ . (6.52)
Hence, ∑
0≤k≤t
ck ≤ D24γ(0)
1− λ + (
D21
1− λ +D
′
3)
∑
0≤k≤t
γ2(k).
Therefore,
∑
0≤k≤t
γ(k)(F (s(k))− F ∗) + D6
2
Φk ≤M1 +M2
∑
0≤k≤t
γ2(k), (6.53)
222
where
M1 = R1 +
R2D24γ(0)
1− λ , M2 = R2
( D21
1− λ +D
′
3
)
.
Step (ii): Now we derive lower bounds on the left hand side of (6.53). Recall
that s˜(t) =
∑t
k=0 γ(k)s(k)/
∑t
k=0 γ(k). By convexity of F , we then have
F (s˜(t))− F ∗ ≤
∑t
k=0 γ(k)
(
F (s(k))− F ∗)∑t
k=0 γ(k)
. (6.54)
Next, we will relate the term ‖x˜i(t)− s˜(t)‖ with
∑t
k=0 Φk. By the triangle inequality,
it can be shown that
‖x˜i(t)− s˜(t)‖ ≤
∑t
k=0 γ(k)‖xi(k)− s(k)‖∑t
k=0 γ(k)
. (6.55)
We now quantify the numerator of the right hand side of (6.55). First, note that
(cf. (6.27))
‖xi(t)− x¯(t)‖ ≤ D4λt +D1
∑
0≤s≤t−1
λt−sγ(s) + C
∑
0≤s≤t−1
λt−sβ(s).
Second, let R be a regularity constant of {Xi, i ∈ V}. Then
‖s(t)− x¯(t)‖ = dist(x¯(t), X) ≤ Rmax
i∈V
dist(x¯(t), Xi)
≤ R
∑
i∈V
dist(x¯(t), Xi) ≤ R
∑
i∈V
‖xi − x¯(t)‖.
Thus, by the triangle inequality and the two previous relations,
‖xi(k)− s(k)‖
(N + 1)C
≤ D4
C
λt +
D1
C
∑
0≤s≤t−1
λt−sγ(s) +
∑
0≤s≤t−1
λt−sβ(s)
223
which implies that (see the definition of θ(t) in Theorem 6.3.7(b))
∑
0≤k≤t
γ(k)
‖xi(k)− x¯(k)‖
(N + 1)C
≤ D4
C
∑
0≤k≤t
γ(k)λk +
D1
C
∑
0≤k≤t
γ(k)
∑
0≤s≤k−1
λk−sγ(s) +
∑
0≤k≤t
θ(k).
(6.51)−(6.52)
≤ D4γ(0)
(1− λ)C +
D1
(1− λ)C
∑
0≤s≤t
γ2(s) +
∑
0≤k≤t
θ(k). (6.56)
The last term can be bounded as follows. By (6.29) and noting that θ(0) = 0, we
have
∑
0≤k≤t
θ(k) ≤ λ
∑
0≤k≤t−1
θ(k) + λ
∑
0≤k≤t−1
γ(k)β(k)
≤ λ
∑
0≤k≤t
θ(k) + λ
∑
0≤k≤t
γ2(k)
4
+ β2(k),
where we have used the fact that γβ ≤ γ2
4
+β2,∀γ, β ∈ R. Rearranging terms yields
∑
0≤k≤t
θ(k) ≤ λ
1− λ
∑
0≤k≤t−1
γ2(k)
4
+ β2(k). (6.57)
Moreover, by the Cauchy-Schwarz inequality, we have
∑
0≤k≤t
β2(k) =
∑
0≤k≤t
(∑
i∈V
‖φi(k)‖
)2 ≤ ∑
0≤k≤t
N
∑
i∈V
‖φi(k)‖2 ≤ N
∑
0≤k≤t
Φk. (6.58)
Using this bound and (6.57) for (6.56), we obtain
C0
∑
0≤k≤t
γ(k)‖xi(k)− x¯(k)‖ ≤M3 +M4
∑
0≤k≤t
γ2(k) +
D6
2
∑
0≤k≤t
Φk,
with C0 =
D6(1− λ)
2N(N + 1)Cλ
, M3 =
D6D4γ(0)
2NλC
, M4 =
D6
2N
(D1
λC
+
1
4
)
.
Combining the inequality above with (6.53) yields
C0
∑
0≤k≤t
γ(k)‖xi(k)− x¯(k)‖+
∑
0≤k≤t
γ(k)(F (s(k))− F ∗)
≤ (M1 +M3) + (M2 +M4)
∑
0≤k≤t
γ2(k).
224
Let C1 := M1 + M3, C2 := M2 + M4. Dividing both sides by
∑t
k=0 γ(k) and then
using (6.54) and (6.55) yields (6.46) as desired, i.e.,
C0‖x˜i(t)− s˜(t)‖+ F (s˜(t))− F ∗ ≤ C1 + C2
∑t
k=0 γ
2(k)∑t
k=0 γ(k)
.
Step (iii): We now discuss the constant associated with the convergence rate
in terms of the network size and the spectral gap 1−λ. To this end, we assume, for
simplicity that pi−1min = O(N) (in fact pimin ≤ 1N ) and that η = O(N). Then it can be
verified that the dominant term is M2 in C2, which is
O
(N4C2R2
(1− λ)2
)
= O
( N4C2
(1− λ)2 e
D1
1−λ2
)
.
A better estimate can be obtained if we assume further that {Xi, i ∈ V} are
compact. In this case, there exists DX > 0 such that ‖xi(t) − x∗‖2 ≤ DX ,∀i ∈
V ,∀t ≥ 0. Thus, by using Theorem 6.3.9, we have for any t ≥ 0
vt+1 ≤ vt + bt
∑
i∈V
pii‖xi(t)− x†‖2 − ut + ct
≤ vt +DXbt − ut + ct
≤ v0 +
∑
0≤k≤t
DXbk − uk + ck
≤ v0 + DXD1
1− λ2 +
∑
0≤k≤t
ck − uk.
Here we have used the facts that
∑
i∈V pii = 1 and
t∑
k=0
bk = D1
t∑
k=1
λ2k ≤ D1
1− λ2 .
As a result,
∑
0≤k≤t
uk ≤ v0 + D1DX
1− λ2 +
∑
0≤k≤t
ck, ∀t ≥ 0. (6.59)
225
Thus, we have that (6.50) still holds but with R1 = v0+
D1DX
1−λ2 and R2 = 1 (compared
to R2 = De/2 as before). Hence, the constant C2 reduces to O(
N4C2
(1−λ)2 ).
We remark that the explicit formulas for C1 and C2 obtained in the proof are
rather involved. Thus to simplify the estimate orders of C2, we have assumed that
pi−1min = O(N) (in fact pimin ≤ 1N ) and that η = O(N). Note also that the spectral
gap, defined as 1−|λ2(W )|, also affects the constant bounds since |λ2(W )| < λ < 1,
signifying the importance of the strong connectivity assumption.
This result demonstrates how the convergence property of the step size se-
quence implies that of our algorithms; as a side note Assumption 6.2.5 is not needed
for (6.46) to hold. In particular, convergence rate analysis now boils down to study-
ing the behavior of the right side of (6.46); exactly the same task has been carried
out thoroughly in the literature for centralized (projected) subgradient methods
(see, e.g., [115,117,138]). Thus, we proceed no further than providing a few notable
results and proving another convergence bound on the objective error in the case of
identical constraint sets.
Corollary 6.4.3. Let the assumptions of Theorem 6.4.1 be satisfied. Let E(t) =
C1+C2
∑t
k=0 γ
2(k)∑t
k=0 γ(k)
. The following hold.
(a) If γ(t) ≡ γ, then E(t) = C2γ + C1γt . If limt→∞ γ(t) = 0 and
∑
t≥0 γ(t) = ∞,
then limt→∞E(t) = 0.
(b) If {Xi, i ∈ V} are identical, then there exist C˜1>0, C˜2>0 such that
F (x˜i(t))− F ∗ ≤ C˜1 + C˜2
∑t
k=0 γ
2(k)∑t
k=0 γ(k)
. (6.60)
226
Further, if γ(t) = O( 1√
t
) then F (x˜i(t))− F ∗ = O( ln t√t ).
Proof. We only prove (6.60) in part (b). Note that xi(t) ∈ X for ∀t ≥ 0 and ∀i ∈ V .
By Lipschitz continuity of F , we have F (x˜i(t))−F ∗ = F (x˜i(t))−F (s˜(t))+F (s˜(t))−
F ∗ ≤ NLf‖x˜i(t)− s˜(t)‖+ F (s˜(t))− F ∗. It remains to use (6.46).
Note that for unconstrained problems, the convergence rate of O( ln t√
t
) is also
achieved by recent distributed subgradient based methods such as Dual Averaging
[81] or Subgradient-Push [145].
6.5 Numerical Example
Consider a machine learning problem via the l1-norm regularized logistic loss func-
tions
min
x∈X
F (x) =
∑
1≤i≤r
log
(
1 + exp
(− li(pTi u + v)))+ µ‖u‖1
with variable x=[uT , v]T , u ∈ Rm, v ∈ R. Here, µ > 0 is a regularization parameter.
The training set consists of r pairs (pi, li) where pi ∈ Rm is a feature vector and
li ∈ {−1, 1} is the corresponding label. Suppose that x satisfies a linear equality
constraint: X = {x ∈ Rm+1, Aeqx = beq}, where Aeq ∈ Rq×(m+1) and beq ∈ Rq. In
general, when the problem data is distributed or too large to store and/or process on
a single machine, employing a network of machines provides a solution. This arises
in many applications such as online social network data, wireless sensor networks,
and cloud computing.
In our example, this problem is to be solved by a network of N = 9 nodes
with the communication graph described in Fig. 6.1. We assume r = 500, m = 50
227
Figure 6.1: Directed communication graph of the network example.
and q = 36, and select (pi, li), Aeq and beq based on normally distributed random
numbers. We choose µ = 50. Suppose the problem data are distributed among the
N nodes as follows: each node i stores a partition Pi of roughly rN training data
and a set of q
N
equality constraints, referred to by (A
(i)
eq , b
(i)
eq ). Thus, for each agent
i ∈ V , the local cost function and constraint set are given by
fi(x) =
∑
j∈Pi
log
(
1 + exp
(− lj(pTj u + v)))+ µN ‖u‖1
Xi = {x ∈ Rm+1 : A(i)eq x = b(i)eq }.
We assume that the weight matrix W = [wij] is such that wij =
1
|Ni| if j ∈ Ni
and wij = 0 otherwise. We carry out simulations with Algorithms 6.1 and 6.2 us-
ing step size γ(t) = 1
N2(t+1)
and the usual DPS method (denoted by DPS-(a)), and
its variation DPS-(b) (i.e., the order of the subgradient and consensus steps is re-
versed) using step size γ′(t) = 1
N(t+1)
. Here γ(t) and γ′(t) are different by a factor
N for the sake of comparison since subgradients in our algorithms are scaled by pi−1i
(which equals N if W is doubly stochastic). The initial state vectors xi(0) = 0 for
228
∀i ∈ V . The simulation results in terms of relative errors in the objective function
and the optimal solution are shown in Fig. 6.2, where F ∗ and x∗ are obtained by
solving the global problem using a centralized method. Clearly, both Algorithms 6.1
and 6.2 converge to the optimal solution and have similar performances, which are
comparable to the DSP methods combined with the reweighting technique [93, 94],
where the knowledge of pi is assumed in advance, (or equivalently (6.6) and (6.8)
with zii(t) = pii,∀i ∈ V ,∀t ≥ 0). The usual DPS methods fail to converge to the
optimal solution. We also consider the case where link 1→ 2 is lost. The reweight-
ing technique requires the whole network to be reprogrammed with a new Perron
eigenvector, which may not be available immediately. In contrast, our algorithms
are unchanged except for node 2 adjusting its incoming link weights. Clearly con-
vergence is still achieved (since the network is still strongly connected) but slower
since the spectral gap decreases.
6.6 Conclusions and Extensions
In this chapter, we have proposed two modified versions of the DPS method that
use require only a row stochastic weight matrix and studied their convergence and
convergence rates. Moreover, our analysis does not invoke the compactness re-
quirement usually imposed on the local constraint sets and is able to deal with
various scenarios, including constrained/unconstrained problems, the sets Xi being
bounded/unbounded or identical/nonidentical. It is important to note the following:
First, it is possible to employ other eigenvector estimation schemes in place of
229
100 102 104 106
t (steps)
10-5
10-4
10-3
10-2
10-1
(
F (s(t))− F ∗
)
/F ∗
Alg.6.1
DPS-(a)
Alg.6.1 w/o link (21)
(6.6)+Reweighting
100 102 104 106
t (steps)
10-3
10-2
10-1
100
101
maxi ‖xi(t)− x
∗‖/‖x∗‖
Alg.6.1
DPS-(a)
Alg.6.1 w/o link (21)
(6.6)+Reweighting
100 102 104 106
t (steps)
10-5
10-4
10-3
10-2
10-1
(
F (s(t))− F ∗
)
/F ∗
Alg.6.2
DPS-(b)
Alg.6.2 w/o link (21)
(6.8)+Reweighting
100 102 104 106
t (steps)
10-3
10-2
10-1
100
101
maxi ‖xi(t)− x
∗‖/‖x∗‖
Alg.6.2
DPS-(b)
Alg.6.2 w/o link (21)
(6.8)+Reweighting
Figure 6.2: Performances of Algorithms 6.1, 6.2, and DSP methods with and without
reweighting technique. Reweighting means for each i ∈ V, pii is known to agent i in
advance and zii(t) = pii,∀t ≥ 0. Here, s(t) = PX
(
x¯(t)
)
.
(6.7) as long as zii(t) → pii sufficiently fast (e.g., satisfying (6.10)). These include
any finite-time computation algorithm, e.g., [141]. Moreover, as we have seen from
Section 6.4 and also the numerical example, the convergence rate of our algorithms is
much slower than the estimation step offered by (6.7). Therefore, it is also possible
230
to have (6.7) run asynchronously with (6.6), for example, at a slower time scale
to save communication bandwidth for exchanging xi variables and/or to reliably
communicate zi with no errors as it is important for the scaling step used in (6.6).
Second, the convergence analysis developed in this chapter can be adapted
to either relax the compactness requirement in others projected subgradient based
methods (e.g., [94, 137]) or impose regular constraints to other subgradient based
algorithms (e.g., [134, 145]); this holds even when the network is time-varying and
possibly with fixed communication delays.
Third, the idea of using the augmented iteration (6.7) to adjust (sub)gradient
magnitudes as in (6.6) is not only applicable to the distributed projected subgra-
dient methods, but also can be employed to alleviate the condition of the weight
matrix being doubly stochastic for some other existing distributed algorithms (us-
ing consensus and (sub)gradient steps). For example, we have observed through
simulations that the gradient-based method proposed in [136, 144] can be modified
in the same spirit and still retains fast convergence speed under a suitable constant
step size.
Based on this idea, we have recently proposed a new algorithm [158] that
converges linearly under the strong convexity assumption on the cost functions. We
now briefly introduce this algorithm.
Algorithm 6.3. For any t ≥ 0, each agent i maintains 3 vectors xi(t),yi(t) ∈
231
Rm, zi(t) ∈ RN and update them as follows:
xi(t+ 1) =
∑
j∈Ni
wijxj(t)− γyi(t) (6.61)
zi(t+ 1) =
∑
j∈Ni
wijzj(t) (6.62)
yi(t+ 1) =
∑
j∈Ni
wijyj(t) +
gi(t+ 1)
zii(t+ 1)
− gi(t)
zii(t)
, (6.63)
where initial estimate xi(0) ∈ Xi, yi(0) = ∇fi(xi(0)), zi(0) = ei ∈ RN , γ is a
positive constant step size, and gi(t) = ∇fi(xi(t)).
Assumption 6.6.1. (Lipschitz continuous gradients and strong convexity) The
functions fi are differentiable and strongly convex. Moreover, the gradients ∇fi
are Lipschitz continuous.
Theorem 6.6.2. ([158]) Suppose Xi = Rm,∀i ∈ V and let the agents implement
Algorithm 6.3. Under Assumptions 6.2.2,6.2.3, 6.2.4, 6.6.1, there exist γ¯ > 0 and
µ ∈ (0, 1) such that if γ ∈ (0, γ¯) then ‖xi(t)− x∗‖ = O(µt).
Estimations of γ¯ and µ are rather involved and conservative; see [158] for
details. Note also that although selection of appropriate step size γ requires global
information about the network and cost functions (thus centralized initialization),
the implementation of the algorithm is distributed and exponential convergence is
achieved.
232
Chapter 7: Conclusions
7.1 Summary of Results
This dissertation developed theory and algorithms that advance the state-of-the-
art in analysis and applications of distributed consensus in multi-agent networks
where communications are broadcast-based and directed; hence the notion of net-
work asymmetry.
Networks with Leaders: In the first part of the thesis, we considered a DeGroot
model with the presence of an external media node, representing a leader, or truth,
or a source of news having a constant opinion value.
First, when consensus is the main goal of the leader, we introduced the no-
tion of a persistent leader and developed new sufficient conditions for guaranteeing
convergence for both fixed and switching topologies and in the presence of other
competing but nonpersistent leaders. We also demonstrated that the results can be
readily extended to the case where the persistent leader’s opinion is time-varying
opinion and the case of communications with time-varying but bounded delays.
Second, we studied the problem of a leader that aims to maximize its influence
the opinions of agents in a directed network subject to the constraint that the
233
number of direct followers selected is not more than K. When there is only one
leader and consensus is guaranteed a priori, we characterized the influence of that
leader through the transient error of the network while when there is a stubborn
agent or a second leader with a competing opinion and consensus is not possible,
we measured the leader influence in terms of the steady state error of the network.
We described the optimal solution for special cases, namely K = 1, 2, in which we
introduce a few notions of centrality that can be useful for practical applications.
Then for a general K, we studied a general combinatorial problem encompassing
many other existing problems in the literature. We proved the supermodularity
property of the objective function and the convexity of its continuous relaxation for
general directed networks, and then developed practical approaches for suboptimal
solutions. We demonstrate through numerical examples that the two approaches
can be combined to provide effective tools and better analysis for optimal design of
influence spreading in diffusive networks.
Our analysis has been shown to be useful for various applications other than
those considered here. In particular, the convexity analysis offers (i) an affirmative
answer to a conjecture recently proposed in [105] on optimization of on-chip ther-
moelectric cooling systems and (ii) a convexity result for the state trajectory of a
class of bilinear discrete-time systems. The supermodularity analysis can also be
used for sensor selection problems.
Consensus Prediction: In this part, we introduced and studied the problem of
consensus prediction in a network whose dynamics is described by a DeGroot model.
234
By an application of the Hahn-Banach theorem, we established a fundamental rela-
tion between the consensus value and network data, that is, if the consensus value
can be computed at a particular time for any initial opinions, then it can be ex-
pressed as a linear combination of available observation data. This allowed us to
prove a tight lower bound on the monitoring time for the case of a single observed
node regardless of the method used by the observer. We also demonstrated that this
bound can be achieved if the minimal polynomial associated with the observed node
is available to the observer. For the case of multiple observed nodes, we proposed
a conjecture on lower bounds of the monitoring time and developed algorithms to-
ward achieving the those bounds through local observations and computations. Our
results in this direction can also be regarded as a data-driven method for network
identification.
Distributed Optimization: First we demonstrated that consensus prediction
can be employed for enhancing convergence of the distributed gradient method for
solving a distributed convex optimization problem on a strongly connected directed
network. The convergence rates of our algorithms are similar to those of the cen-
tralized gradient method, including finite time convergence for the case of quadratic
objective functions, except being slower by a constant factor depending on the net-
work structure and the weight matrix. The convergence times of these algorithms
scale linearly with the network diameter for certain structures (e.g., distance regular
graphs) and at most linearly with the network size in general.
Second, we proposed a rescaling technique that enables distributed subgradi-
235
ent algorithms to work with directed networks and row stochastic matrices instead
of column or doubly stochastic ones. Based on a regularity assumption, we then
developed unified analyses for convergence and convergence rate of a distributed
projected subgradient method that can be applied to both unconstrained problems
or constrained ones with nonidentical (and possibly unbounded) local constraint
sets. We also introduced another algorithm that also uses the rescaling technique
above but converges linearly under a stronger assumption on the local objective cost
functions.
7.2 Directions for Future Work
In this dissertation, we only considered discrete-time models for consensus, predic-
tion and distributed optimization. However, there are a wide range of applications
where continuous-time models are appropriate. Thus, development of similar results
for the continuous-time case will be useful.
Networks with Leaders: Use of stability conditions developed for system (2.31)
to study consensus conditions in the case of a leaderless network is worth exploring
to reduce the gap between necessary and sufficient conditions for consensus. Note
that in the latter, we may regard any agent, e.g., agent N , as a “leader” with
u(t) =
∑
j∈NN wNj(t) (xj(t)− xN(t)).
It would also be interesting to investigate and design consensus protocols for
the case where there are multiple persistent leaders with time-varying states and/or
malicious agents. The intuition is that if the convex hull of all the leaders’ states
236
shrinks overtime to a point and the effects of malicious agents are non-persistent,
then it is still possible to reach consensus asymptotically.
Extension of the consensus results developed in Chapter 2 to coordination and
synchronization of multi-agent systems is also of interest.
The convexity and supermodularity results established in Chapter 3 find ap-
plications in various important problems: including sensor/actuator placement for
observability/controllability in consensus networks. Moreover, in the case of 2 com-
peting leaders, it is also important to study the game played by the 2 leaders on the
network, assuming that each has a limited budget (e.g., number of direct followers).
Consensus Prediction: Besides resolving the validity of the main conjecture for
the case of multiple observed nodes, which requires further rigorous analysis other
than the argument presented in Remark 4.3.6, there are numerous problems worth
exploring in this topic, including the following:
Coping with Noise and Delays: First, if the communication delays are fixed,
then our results can be applied in a fairly straightforward manner as the network
is still a linear time-invariant system. The case of time-varying delays remains
difficult. Second, in the presence of observation or communication noises, exact
consensus prediction in finite time is impossible. Since communication noise can
derail consensus, observation noise is more relevant to the current topic, in which
case we need to estimate the joint minimal polynomial and predict trend of the
network states and/or range of the consensus value. In this connection, the (partial)
realization theory [73–75, 159] and system identification techniques [160] might be
237
brought into bear [161,162].
Network Monitoring for Misbehavior: As we have seen, for a network of
agents whose dynamics follow the time-invariant DeGroot model, it is possible to
predict the future behavior of the observed nodes as well as consensus value by
using the minimal polynomials of these nodes, which in turn can be computed from
observation data. This allows the observer to detect certain changes in the dynamics
of a set of nodes, dubbed misbehavior, which may be caused by faults or attacks.
In the case where only approximations of these minimal polynomials are available,
possibly due to corrupted or noisy observations, one can still expect to capture the
trend of the network response using these approximate polynomials. Consequently,
certain types of faults in the network agents’ dynamics may still be detected by the
observer. Thus, characterizing misbehavior detectable from local observation is an
interesting and important direction to pursue. This has a close connection to the
topic of distributed fault detection and identification in the literature [163–166].
Distributed Optimization: First, since our distributed algorithm FADO devel-
oped in Chapter 5 behaves in a similar manner as the centralized gradient method,
we can apply acceleration techniques by Nesterov [117] and others to FADO in order
to achieve better convergence rates. The problem of designing the weight matrix
W for a given network topology so as to achieve the smallest possible κmin is also
of interest as another way of speeding up algorithms we presented in this chapter.
Second, extending our algorithms in Chapter 6 to the case of switching communica-
tion graphs is also worth exploring. Finally, in all these algorithms, communication
238
noise and delays were not considered. Therefore, these issues deserve more research
as well as attention for applications in practice.
239
Appendix A: Omitted Proofs
A.1 Known Matrix Results
Theorem A.1.1. ([95, Thm. 8.3.1]) If A ∈ RN×N+ , then ρ(A) is an eigenvalue of
A and there exists x ∈ RN+\{0} such that Ax = ρ(A)x.
Theorem A.1.2. ([95, Thm. 8.1.18]) Let A,B ∈ RN×N . If |A| ≤ B, then ρ(A) ≤
ρ(|A|) ≤ ρ(B).
Lemma A.1.3. ([167]) Let P ∈ RN×N be the inverse of a nonsingular M-matrix.
Then P ≥ 0 and
Pjk ≥ PjiP−1ii Pik, ∀i, j, k = 1, . . . , N.
Theorem A.1.4. (Woodbury Matrix Identity [168, p. 258]) Let A ∈ Rn×n, B ∈
Rn×r, C ∈ Rr×r, D ∈ Rr×n. Then the following holds whenever any involved inverse
exists:
(A−BC−1D)−1 = A−1 + A−1B(C −DA−1B)−1DA−1 (A.1)
Lemma A.1.5. ([169]) Let L ∈ RN×N be a Laplacian matrix. Suppose that 0 be
a simple eigenvalue of L. Let z denote the left eigenvector associated with this
240
eigenvalue and let L† be the pseudo-inverse of L. Then
1TL† = 0, L†z = 0, L†L = I − 1
N
11T , LL† = I − 1‖z‖2 zz
T .
Lemma A.1.6. ([169, 170]) Let d, e ∈ RN . The Moore-Penrose pseudoinverse of
the rank-1 update of a matrix F ∈ RN×N is given by
(F + edT )† = F † +G
where
G = − 1‖w‖2 vw
T − 1‖m‖2 mh
T +
1 + dTF †e
‖m‖2‖w‖2 mw
T
and v = F †e,h = (F †)Td,w = (I − FF †)e and m = (I − F †F )d.
A.2 Omitted Proofs in Chapter 3
A.2.1 Proof of Theorem 3.3.1
Suppose K = {k} ⊂ V , then we have
J
(1)
{k} = b
T (L+ αkeke
T
k )
−1|ξ0|
Now applying Lemma A.1.6 (cf. Appendix A.1) with F = L, e = ek and d = αkek
yields
(L+ αkeke
T
k )
−1 = L† +G
241
where
G = − 1‖w‖2 vw
T − 1‖m‖2 mh
T +
1 + dTL†e
‖m‖2‖w‖2 mw
T
w = (I − LL†)ek = 1‖pi‖2pipi
Tek =
pik
‖pi‖2pi
m = (I − L†L)d = αk
N
11Tek =
αk
N
1
v = L†ek
h = (L†)Tαkek
Thus,
G = − 1
pi2k
‖pi‖2
L†ek
pik
‖pi‖2pi
T − 1
α2k
N
αk
N
1αke
T
kL
† +
1 + αke
T
kL
†ek
α2k
N
pi2k
‖pi‖2
αk
N
1
pik
‖pi‖2pi
T
= − 1
pik
L†ekpiT − 1eTkL† +
1 + αkL
†
kk
αkpik
1piT (A.2)
Note also that bT1 = 1. Then we have
J
(1)
{k} = (b
TL† − L†(k))|ξ0|+ (α−1k + L†kk − bTL†k)
piT |ξ0|
pik
.
Moreover, if b = 1/N , then by Lemma A.1.5 (cf. Appendix A.1) we have bTL† = 0T .
Hence, (3.19) follows immediately.
A.2.2 Proof of Theorem 3.3.2
By using Woodbury identity (A.1) (cf. Appendix A.1-Theorem A.1.4) and recalling
that P = L−1β , we have
J
(2)
(k) = b
T (Lβ + αkeke
T
k )
−1β = bT
(
P − Peke
T
kP
α−1k + e
T
kPek
)
β. (A.3)
242
Now since L1 = 0, we have
Lβ1 = L1 + diag(β)1 = β.
Left-multiplying both sides with P yields Pβ = 1. It remains to use this relation
to simplify A.3.
A.2.3 Proof of Theorem 3.3.4
Denote P = (L+ αieie
T
i )
−1. By Woodbury matrix identity (A.1) we have
pTij =
1T
N
(L+ αieie
T
i + αjeje
T
j )
−1 =
1T
N
(
P − Peje
T
j P
α−1j + e
T
j Pej
)
= pTi
(
I − eje
T
j P
α−1j + e
T
j Pej
)
(A.4)
where we have used pTi =
1
N
1TP ; see Theorem 3.3.1. Next, by (A.2),
P = (L+ αieie
T
i )
−1 = L† − 1
pii
L†eipiT − 1eTi L† +
1 + αie
T
i L
†ei
αipii
1piT
Then
eTj P = L
†(j) − L
†
ji
pii
piT − L†(i) + 1 + αiL
†
ii
αipii
piT = L†(j) − L†(i) + (γii + γji)piT
As a result, we have
1
αj
+ eTj Pej =
1
αj
+ L†jj − L†ij + (γii + γji)uj = (γjj + γij + γii + γji)pij
243
Substituting this relation into (A.4) yields
pTij = p
T
i −
pTi eje
T
j P
α−1j + e
T
j Pej
= pTi −
pTi ej
pij
∑
γij
eTj P
= pTi −
(γii + γij)pij
pij
∑
γij
(
L†(j) + γjipiT + pTi
)
= pTi −
γii + γij∑
γij
(
γjjpi
T − pTj + γjipiT + pTi
)
=
γjj + γji∑
γij
pTi +
γii + γij∑
γij
pTj −
(γii + γij)(γjj + γji)∑
γij
piT
=
(γii + γij)(γjj + γji)∑
γij
piT − γjj + γji∑
γij
L†(i) − γii + γij∑
γij
L†(j),
where the third to last and the last equalities follow from the relation pTi = γiipi
T −
L†(i) (cf. Theorem 3.3.1). This completes the proof.
A.2.4 Proof of Lemma 3.5.3
First, we show that (Lβ+ΓS)−1 is nonincreasing in S. Let DS = diag(W1+β+αS)
and note that ρ
(
D−1S W
)
< 1 (cf. Lemma 3.2.4). By the absolutely convergent
Neumann series (I −D−1S W )−1 =
∑∞
i=0(D
−1
S W )
i. Thus we have
(Lβ + ΓS)−1 = (DS −W )−1 =
∑
i≥0(D
−1
S W )
iD−1S (A.5)
which is clearly nonnegative. Moreover, for any T ⊆ V such that T ⊇ S, we
have 0N×N ≤ D−1T ≤ D−1S , which together with (A.5) implies that (Lβ + ΓT )−1 ≤
(Lβ + ΓS)−1.
Alternatively, we can also use the fact that f(y) = bT (Lβ + diag(y ◦α))−1c is
a non-increasing function on Ω for any b, c ∈ RN+ (cf. Theorem 3.4.3) to conclude
the monotonicity of (Lβ + ΓS)−1. This proves the second inequality in (3.45).
244
We now prove the first inequality in (3.45), that is, for any v, k ∈ V\S
(Lβ + ΓS)−1 − (Lβ + ΓS∪{v})−1 ≥ (Lβ + ΓS∪{k})−1 − (Lβ + ΓS∪{k,v})−1 (A.6)
Let P := (Lβ + ΓS)−1 and Q := (Lβ + ΓS∪{k})−1. By Woodbury identity (A.1), it
can be shown that
(Lβ + ΓS∪{v})−1 = P − P(v)P (v)(α−1v + Pvv)−1,
(Lβ + ΓS∪{k,v})−1 = Q−Q(v)Q(v)(α−1v +Qvv)−1.
Thus, (A.6) is equivalent to the following matrix inequality
P(v)P
(v)(α−1v + Pvv)
−1 ≥ Q(v)Q(v)(α−1v +Qvv)−1.
It suffices to show that this inequality holds element-wise, i.e.,
PivPvj
α−1v + Pvv
≥ QivQvj
α−1v +Qvv
, ∀i, j ∈ V . (A.7)
Note again by Woodbury identity that
Q = (Lβ + ΓS∪{k})−1 = P − P(k)P (k)(α−1k + Pkk)−1,
i.e., Qij = Pij−PikPkj/(α−1k +Pkk), ∀i, j ∈ V . Therefore, we have (A.7) is equivalent
to
α−1v +Qvv
α−1v + Pvv
PivPvj ≥ (Piv − PikPkv
α−1k + Pkk
)(Pvj − PvkPkj
α−1k + Pkk
)
or, by rearranging terms,
PvkPkvPivPvj
(α−1v + Pvv)
+
PikPkvPvkPkj
(α−1k + Pkk)
≤ PikPkvPvj + PivPvkPkj. (A.8)
245
We now show that (A.8) holds. To this end, first note that P is the inverse of a
nonsingular M-matrix. Thus, by Lemma A.1.3 and the fact that α−1v ≥ 0, we have
Pik ≥ PivPvk/Pvv ≥ PivPvk/(α−1v + Pvv).
Next, multiplying both sides of the above inequality with PkvPkj ≥ 0 yields
PvkPkvPivPvj/(α
−1
v + Pvv) ≤ PikPkvPvj. (A.9)
Similarly we have
PikPkvPvkPkj/(α
−1
k + Pkk) ≤ PivPvkPkj. (A.10)
Finally, adding (A.10) and (A.9) together results in (A.8), which then completes the
proof.
A.2.5 Proof of Lemma 3.5.5
Let φ = f ◦ F . We need to show that for any S ⊆ T ⊆ V
φ(S) + φ(T ) ≤ φ(S ∪ T ) + φ(S ∩ T ). (A.11)
First, since F is decreasing, we have
F (S ∪ T ) ≤ F (S), F (T ) ≤ F (S ∩ T ). (A.12)
As a result, φ(S∪T ) = f(F (S∪T )) ≤ f(F (S∩T )) = φ(S∩T ) since f is increasing.
This proves that φ is nonincreasing.
Next, we have that there exist a1, a2 ∈ [0, 1] such that
F (S) = a1F (S ∪ T ) + (1− a1)F (S ∩ T ) (A.13)
F (T ) = a1F (S ∪ T ) + (1− a2)F (S ∩ T ). (A.14)
246
Adding side-by-side of the above equations gives
F (S) + F (T ) = (a1 + a2)F (S ∪ T ) + (2− a1 − a2)F (S ∩ T ),
whose left side is less than F (S ∪ T ) + F (S ∩ T ) by supermodularity property of
F . Then we have
(a1 + a2)F (S ∪ T ) + (2− a1 − a2)F (S ∩ T ) ≤ F (S ∪ T ) + F (S ∩ T )
or, by rearranging terms,
(1− a1 − a2)(F (S ∩ T )− F (S ∪ T )) ≤ 0N×N ,
from which, together with (A.12), we conclude that a1+a2 ≥ 1. Now using convexity
of f and (A.13) we have
φ(S) = f(F (S)) = f(a1F (S ∪ T ) + (1− a1)F (S ∩ T ))
≤ a1f
(
F (S ∪ T ))+ (1− a1)f(F (S ∩ T ))
= a1φ(S ∪ T ) + (1− a1)φ(S ∩ T ).
Similarly, convexity of f and (A.14) imply
φ(T ) ≤ a2φ(S ∪ T ) + (1− a2)φ(S ∩ T ).
Adding two equations above side by side yields
φ(S) + φ(T ) ≤ (a1 + a2)φ(S ∪ T ) + (2− a1 − a2)φ(S ∩ T )
= φ(S ∪ T ) + φ(S ∩ T ) + (a1 + a2 − 1)
(
φ(S ∪ T )− φ(S ∩ T ))
≤ φ(S ∪ T ) + φ(S ∩ T ),
where in the last inequality we have used the facts that a1 + a2 ≥ 1 and that φ is
nonincreasing. Thus, (A.11) is proved.
247
A.3 Omitted Proofs in Chapter 5
A.3.1 Proof of Theorem 5.3.5
For t ≥ 0, let
s(t) := [s1(t), . . . , sN(t)]
T , s¯(t) := piT s(t), gs(t) := [g1(s1(t)), . . . , gN(sN(t))]
T .
It follows from Theorem 5.3.4 that
s¯(t) = si(t),∀i ∈ V ,∀t ≥ κ. (A.15)
Thus, for any k ≥ 0, we have
s¯((k + 1)κ) = si((k + 1)κ)
(5.18)
= piTx(kκ)
(5.14a)
= piT
(
s(kκ)− γkgs(kκ)
)
,
= s¯(kκ)− γkpiTgs(kκ). (A.16)
Since W is doubly stochastic, we have pi = 1/N (see, e.g., [95]). Thus we have for
any k ≥ 1
NpiTgs(kκ)
(A.15)
=
N∑
i=1
gi(s¯(kκ)) = g(s¯(kκ)), (A.17)
where g(s¯(kκ)) ∈ ∂F (s¯(kκ)). Thus, (A.16) becomes
s¯((k + 1)κ) = s¯(kκ)− γkN−1g(s¯(kκ)), (A.18)
which is the same as (5.19).
Next, it is obvious that (A.18) is the usual centralized subgradient iteration
(5.3) applied to problem (5.1), where F is convex with bounded subgradient |g| ≤∑N
j=1 Lj = LF , by Assumption 5.3.1. Therefore, existing convergence results of the
248
centralized subgradient method apply; see, e.g., [117, Chap. 3]. Here, we provide
analysis that suits our context to prove the main results. In particular, let
s¯k := s¯((k + 1)κ), g¯k := g(s¯(kκ)), γ¯k := γk/N.
Then for any x∗ ∈ X∗, we have
|s¯k+1 − x∗|2 = |s¯k − x∗ − γ¯kg¯k|2
= |s¯k − x∗|2 − 2γ¯kg¯k(s¯k − x∗) + γ¯2k g¯2k
≤ |s¯k − x∗|2 − 2γ¯k(F¯k − F ∗) + γ¯2k g¯2k (A.19)
≤ |s¯1 − x∗|2 − 2
k∑
l=1
γ¯l(F¯l − F ∗) +
k∑
l=1
γ¯2l g¯
2
l ,
where F¯k := F (s¯(kκ)), and we have used the definition of subgradient in the first in-
equality, and the last one follows from applying (A.19) recursively. Now, rearranging
terms and using 0 ≤ |s¯k+1 − x∗|2 and |g| ≤ LF we have
k∑
l=1
γ¯l(F¯l − F ∗) ≤ 1
2
(|s¯1 − x∗|2 + L2F k∑
l=1
γ¯2l
)
.
By the convexity of F , the left side is bounded below by( k∑
l=1
γ¯l
)(∑k
l=1 γ¯lF (s¯(lκ))∑k
l=1 γ¯l
− F ∗
)
≥
( k∑
l=1
γ¯l
)(
F (sˆk)− F ∗
)
Combining this and (A.22), we then have
F (sˆk)− F ∗ ≤ |s¯1 − x
∗|2 + L2F
∑k
l=1 γ¯
2
l
2
∑k
l=1 γ¯l
. (A.20)
Now we consider different choices of step size γk.
(i) For a constant step size γk ≡ γ, i.e., γ¯k ≡ γ/N,∀k ≥ 1, it follows from
(A.20) that
F (sˆk)− F ∗ ≤ N |s¯1 − x
∗|2
2kγ
+
L2Fγ
2N
. (A.21)
249
Letting k →∞ yields (5.21).
(ii) Since (A.21) holds true for any x∗ ∈ X∗, γ > 0 and k ∈ Z>0, for any given
K ∈ Z>0 we have
F (sˆK)− F ∗ ≤ NR
2
2Kγ
+
L2Fγ
2N
, (A.22)
Now we minimize the right hand side of (A.22) with respect to γ > 0. By application
of Cauchy-Schwarz inequality NR
2
2Kγ
+
L2F γ
2N
≥ RLF√
K
, where equality holds when γ =
NR
LF
√
K
. Thus, with this optimal step size, we have F (sˆK)− F ∗ ≤ RLF√K .
(iii) For a non-summable but diminishing step size, it can be shown that the
right hand side of (A.20) decays to 0 as k →∞; see [138] for such an argument.
Finally, consider γk =
1√
k
. It can be verified that
∑k
l=1 γ¯
2
l ≤ 1+ln(k)N2 and∑k
l=1 γ¯l ≥
√
k
2N
,∀k ≥ 1. Using these bounds for (A.20), we obtain
F (sˆk)− F ∗ ≤ N
2|s¯1 − x∗|2 + L2F (1 + ln(k))
N
√
k
, (A.23)
which implies that F (sˆk)− F ∗ = O( ln(k)√k ) as k →∞.
A.3.2 Proof of Theorem 5.3.12
Following the same line of proof as in Theorem 5.3.5 (see Appendix A.3.1), it can
be shown that
s¯((k + 1)κ) = s¯(kκ)− γN−1∇F (s¯(kκ)), (A.24)
where s¯(t) = 1
N
∑N
j=1 sj(t). Clearly, (A.24) is the standard centralized gradient
descent method. Thus, by [117, Thm. 2.1.14] we have for any γ ∈ (0, 2N
L∇F
)
F (s¯(kκ))− F ∗ ≤ a1
a2 + ka3
(A.25)
250
where a1 = (F (s¯(κ)) − F ∗)(s¯(κ) − x∗)2, a2 = (s¯(κ) − x∗)2, and a3 = (F (s¯(κ)) −
F ∗)(1− L∇F h
2
)h with h = γ
N
. As a result,
F (si(kκ))− F ∗ = O(1/k), as k →∞. (A.26)
Finally, for each t ≥ κ, there exist positive integers k ≥ 1 and l ∈ [0, κ − 1] such
that t = kκ+ l. Then by (5.13b),
F (si(t))− F ∗ = F (si(kκ))− F ∗
(A.25)
≤ a1
a2 + ka3
≤ a1κ
a2κ+ ta3
. (A.27)
The last inequality holds for for sufficiently large t since a3 > 0, l ∈ [0, κ]. Thus,
F (si(t))− F ∗ = O(κt ) as t→∞.
A.3.3 Proof of Theorem 5.3.14
First, note that (A.24) still holds in this case, i.e, s¯((k+1)κ) = s¯(kκ)− γ
N
∇F (s¯(kκ)),
where s¯(t) = 1
N
∑N
j=1 sj(t). Applying [117, Thm. 2.1.15] to this iteration (i.e.,
(A.24)) yields
|s¯(kκ)− x∗|2 ≤ βk−1|s¯(κ)− x∗|2,
2(F (s¯(kκ))− F ∗) ≤ L∇Fβk−1|s¯(κ)− x∗|2.
Then (5.30) and (5.31) follows immediately since si(t) = s¯(t),∀t ≥ κ),∀i ∈ V (cf.
Theorem 5.3.4).
Next, for each t ≥ κ, there exist positive integers k ≥ 1 and l ∈ [0, κ− 1] such
that t = kκ+ l. Then, we have
|si(t)− x∗| (5.13b)= |si(kκ)− x∗|
(5.30)
≤ Cβ k−12 , with C := |s¯(κ)− x∗|
= C
(
β
1
2κ
)t−l−κ ≤ Cβ −12 (β 12κ )t, (A.28)
251
where the last inequality holds since l < κ. Thus, si(t) → x∗ linearly at rate β 12κ .
Similarly, by using (5.31), it can be shown that F (si(t)) → F ∗ linearly at rate
β
1
κ .
A.3.4 Proof of Extension to Row Stochastic Weight Matrix
First notice that the proof of Theorem 5.3.4 does not make use of (5.14a). Thus,
(5.15) still holds for (5.36)-(5.37). Following the proof of Theorem 5.3.5 (see Ap-
pendix A.3.1), we let s¯(t) := si(t), t ≥ κ and gs(t) := [g1(s1(t)), . . . , gN(sN(t))]T .
Recall that Φ = 1piT . We then have
s¯((k + 1)κ) = si((k + 1)κ)
(5.18)
= piTx(kκ)
(5.37)
= piT
(
s(kκ)− γk(Ndiag(pi))−1gs(kκ)
)
= s¯(kκ)− γkN−1
N∑
i=1
gi(si(kκ)). (A.29)
Since si(kκ) = s¯(kκ),∀i ∈ V (cf. Theorem 5.3.4), (A.29) becomes
s¯((k + 1)κ) = s¯(kκ)− γkN−1g(s¯(kκ)), (A.30)
where g(s¯(kκ)) =
∑
gi(s¯(kκ)) ∈ ∂F (s¯(kκ)). Now (A.30) is the same as (5.19).
Therefore, the same conclusions in Theorems 5.3.5-5.3.14 hold for the convergence
of the modified algorithm.
A.3.5 Proof of Theorem 5.5.1
First, note that the condition (|Ni| − 1) ≤ 1 is to ensure that W is a nonnegative
matrix, hence a valid weight matrix.
252
Now we prove that diam(G) + 1 = deg(qi), ∀i ∈ V . Define
Ω(i) := [ei L(G)Tei . . . (L(G)N−1)Tei]T .
By [72, Prop. 1] we have deg(qi) = Di + 1 = rank(Ω
(i)). Next, by application of
[132, Prop. 5], if the graph G is distance regular, then diam(G) + 1 = rank(Σ(i))
where Σ(i) is the controllability matrix of the pair (L(G), ei), computed as
Σ(i) = [ei L(G)ei . . . L(G)N−1ei] = [eTi eTi L(G) . . . eT (L(G)N−1)]T = (Ω(i))T
Here, the second equality follows from the symmetry of L(G) (recalling that G is
undirected). Thus, diam(G) + 1 = rank((Ω(i))T ) = rank(Ω(i)) = deg(qi).
A.3.6 Proof of Lemma 5.4.3
For a given matrix A ∈ RN×N , let qA denote its minimal polynomial (i.e., the
monic polynomial of minimum degree such that qA(A) = 0). It follows from the
Cayley-Hamilton theorem that deg(qA) ≤ N .
Let J denote the Jordan canonical form of W − γB, i.e., there exists a non-
singular matrix S ∈ RN×N such that W − γB = SJS−1. Since similar matrices
have the same minimal polynomial ([95, Corollary 3.3.3]), we have q(W−γB) = qJ .
Moreover, it can be verified that
W˜ =
SJS−1 −γB
0N×N I
 =
S Φγ
0 I

J 0
0 I

︸ ︷︷ ︸
,K
S−1 −S−1Φγ
0 I
 (A.31)
where Φγ is defined in (5.49). Thus, K is the Jordan canonical form of W˜ . Under
condition (5.47), i.e., ρ(J) < 1, the order of the largest Jordan block of K corre-
253
sponding to eigenvalue 1 is equal to 1. It then follows immediately that (see e.g.,
[95, Thm. 3.3.6]) qK(ξ) = (ξ − 1)qJ(ξ). Consequently, we have
qW˜ (ξ) = qK(ξ) = (ξ − 1)qJ(ξ) = (ξ − 1)q(W−γB)(ξ). (A.32)
Since q˜i|qW˜ (see Lemma 4.2.3), we obtain the following,
deg(q˜i) ≤ deg(qW˜ )
(A.32)
= 1 + deg(q(W−γB)) ≤ 1 +N.
We next show that q˜i(1) = 0, which then clearly implies
q˜i(ξ) = (ξ − 1)
D˜i∑
j=0
a˜
(i)
j ξ
j, a˜
(i)
D˜i
= 1
for some a˜ (i) ∈ RD˜i+1. To this end, recall that for any i = 1, . . . , N and the unit
vector ei ∈ RN , we have
0T2N = [e
T
i 0
T
N ]q˜i(W˜ )
(A.31)
= [eTi 0
T
N ]
S Φγ
0 I

q˜i(J) 0
0 q˜i(1)I

S−1 −S−1Φγ
0 I

= [eTi Sq˜i(J)S
−1 q˜i(1)eTi Φγ − eTi Sq˜i(J)S−1Φγ]. (A.33)
As a result,
q˜i(1)e
T
i Φγ = 0
T
N . (A.34)
Since Φγ is invertible (see (5.49)), none of its rows is identical to 0
T
N . Consequently,
(A.34) implies that q˜i(1) = 0.
Finally, (A.32) implies that if ρ(W − γB) < 1, then 1 is the only zero of
maximum modulus of qW˜ . The proof is completed by noting that q˜i divides qW˜ .
254
A.3.7 Proof of Theorem 5.4.4
Consider again system (5.46) and assume that (5.47) holds.
First we show that Φγ satisfies Φγ1 = 1 and pi
TBΦγ = pi
TB. Note that for
any A ∈ RN×N such that (I − A) is invertible, we have
(I − A)−1 − I = (I − A)−1A = A(I − A)−1. (A.35)
Now let E = γB and note that W1 = 1 and piTW = piT . Then
Φγ1 = [I − (W − E)]−1E1
= [I − (W − E)]−1W1− [I − (W − E)]−1(W − E)1
(A.35)
= [I − (W − E)]−11−
(
[I − (W − E)]−1 − I
)
1 = 1,
piTEΦγ = pi
TE[I − (W − E)]−1E
= piTW [I − (W − E)]−1E − piT (W − E)[I − (W − E)]−1E
(A.35)
= piT [I − (W − E)]−1E − piT
(
[I − (W − E)]−1 − I
)
E
= piTE.
Therefore, Bpi and 1 are left and right eigenvectors of Φγ corresponding to the
eigenvalue 1, respectively.
Next, by Assumption 5.2.2 and condition (5.47), (I − (W − γB)) is an irre-
ducible nonsingular M-matrix. Thus, (I − (W − γB))−1 is a strictly positive matrix
(see, e.g., [171]). Therefore, Φγ = [I−(W −γB)]−1γB is also strictly positive. Thus
Φγ is also irreducible (i.e., the graph associated with Φγ is strongly connected; in fact
255
it is complete), and 1 is a simple eigenvalue, corresponding to the spectral radius of
Φγ. In fact, by Perron’s theorem for positive matrices (see, e.g., [95, Thm 8.2.11]),
1 is the unique eigenvalue of maximum modulus of Φγ, and
lim
k→∞
Φkγ = 1pi
TB/(piTB1).
Hence, Φγ is a valid weight matrix for consensus. Moreover, the convergence is ex-
ponential with rate |λ2(Φγ)|, where λ2 is an eigenvalue of second largest in modulus.
A.3.8 Proof of Theorem 5.4.5
By (5.52) and (5.49), we have the following, which is in the same spirit as Theorem
5.2.3 ( D˜i∑
l=0
a˜
(i)
l xi(l)
)
/
( D˜i∑
l=0
a˜
(i)
l
)
= eTi Φγx(0) ∀i ∈ V ,
where a˜ (i) = [a˜
(i)
0 , . . . , a˜
(i)
D˜i
]T ∈ RD˜i+1 satisfies (5.51). Thus,
s((k + 1)κ) = Φγs(kκ) = Φ
k+1
γ c. (A.36)
By Theorem 5.4.4, Φγ is a valid consensus matrix. Since W is doubly stochastic, we
have pi = 1/N , and thus, b = B1 = [b1, . . . , bN ]
T is a left Perron eigenvector of Φγ.
Thus, by (5.55)
lim
k→∞
s(kκ) = lim
k→∞
Φkγc = 1b
Tc/(bT1) = 1x∗. (A.37)
This and (5.53)-(5.54) imply (5.56).
256
Moreover, since Φkγ converges exponentially as k → ∞, so does s(kκ), i.e.,
there exist C > 0 and λ2 ∈ (0, 1) such that
‖s(kκ)− 1x∗‖ ≤ Cλk2, ∀k ≥ 0.
Now for each t ≥ κ, there exist positive integers k ≥ 1 and l ∈ [0, κ − 1] such that
t = kκ+ l. We then have
‖s(t)− 1x∗‖ (5.53)= ‖s(kκ)− 1x∗‖ ≤ Cλk2 ≤ Cλ−12 λ
t
κ
2 . (A.38)
where the last inequality follows from the condition that l ∈ [0, κ − 1]. Therefore,
s(t) also converges exponentially with rate λ
1
κ
2 . This concludes the proof.
A.3.9 Distributed Evaluation of Global Cost Function and Algorithm
Local Termination
Here we provide an augmentation to our main algorithm that allows each agent i
to compute F (sˆk) or F (si(t)). Besides (5.13)-(5.14), let all the agents also perform
(5.24)-(5.25) (in order to compute sˆk) as well as the following for all t ≥ κ+ 1:
yi(t) =

fi(sˆk) if t = kκ (A.39a)∑
j∈Ni
wijyj(t− 1) if t 6= kκ (A.39b)
and
F
(i)
k =
∑Di
τ=0 a
(i)
τ yi(t− κ+ τ)∑Di
l=0 a
(i)
τ
, if t = kκ (A.40)
Claim: Let the assumptions of Theorem 5.3.5 hold. Then, F (sˆk) = NF
(i)
k for
any k ≥ 1 and ∀i ∈ V .
257
Proof of Claim: First recall from Theorem 5.3.4 that si(t) = sj(t) =: s¯(t) for
all t ≥ κ and i, j ∈ V . Thus, each agent can locally find sˆk using (5.24)-(5.25).
Next, by (A.39b), we have
y(t) = Wy(t− 1), ∀t = kκ+ 1, . . . (k + 1)κ− 1.
At time t = (k + 1)κ, we have
∑Di
τ=0 a
(i)
τ yi(kκ+ τ)∑Di
τ=0 a
(i)
τ
(Thm.5.2.3)
= eTi Φy(kκ) = N
−11Ty(kκ)
= N−1
N∑
i=1
fi(sˆk) = N
−1F (sˆk). (A.41)
where the second equality follows from (5.7) with pi = 1/N and the third one from
(A.39a). Therefore the claim holds.
As a result, each agent can compute F (sˆk)/N in a distributed manner (and
hence F (sˆk) if N is known). Similarly, if (A.39a) is replaced by y(t) = fi(si(kκ))
for t = kκ, the same argument holds for F (si(kκ)), that is, F (si(kκ)) = NF
(i)
k for
any k ≥ 1 and ∀i ∈ V .
Therefore, all the agents can stop at the same time if they agree to use a
common stopping criterion such as (5.27), which is based on absolute convergence
error. Note that other criteria of the same type can also be employed, for example,
local relative convergence tolerance
|F (i)k − F (i)k−1| ≤ |F (i)k |, (A.42)
is obviously equivalent to global relative convergence tolerance |F (sˆk)− F (sˆk−1)| ≤
|F (sˆk)| or |F (si(kκ))− F (si(kκ− κ))| ≤ |F (si(kκ))| if si is used instead.
258
A.4 Omitted Proofs in Chapter 6
A.4.1 Proof of Theorem 6.3.6 for Algorithm 6.2.
Recall that for Algorithm 6.2,the projection error φi(t) is given by (6.15). Thus, for
any v ∈ X, we have
‖xi(t+1)− v‖2 =
∥∥∥∑
j∈Ni
wij
(
xj(t)− γ(t) gj(t)
zjj(t)
)
− v + φi(t)
∥∥∥2.
Expanding the right side and using the fact that
φi(t)
T
(∑
j∈V
wij
(
xj(t)− γ(t) gj(t)
zjj(t)
)− v) ≤ −‖φi(t)‖2
for any v ∈ X ⊆ Xi (cf. Lemma 6.3.1(a)), we obtain
‖xi(t+ 1)− v‖2 ≤
∥∥∥∑
j∈Ni
wij
(
xj(t)− γ(t) gj(t)
zjj(t)
)
− v
∥∥∥2 − ‖φi(t)‖2
≤
∑
j∈Ni
wij
∥∥∥xj(t)− v − γ(t) gj(t)
zjj(t)
∥∥∥2 − ‖φi(t)‖2, (A.43)
where (A.43) follows from
∑
j∈V wij = 1 and Lemma 6.3.2. Hence,
∑
i∈V
pii‖xi(t+ 1)− v‖2 ≤
∑
i∈V
pii
∑
j∈Ni
wij
∥∥∥xj(t)− v − γ(t) gj(t)
zjj(t)
∥∥∥2 −∑
i∈V
pii‖φi(t)‖2
=
∑
i∈V
pii
∥∥∥xi(t)− v − γ(t) gi(t)
zii(t)
∥∥∥2 −∑
i∈V
pii‖φi(t)‖2, (A.44)
where (A.44) holds since piTW = piT . Expanding the first term on the right side of
(A.44) yields
∑
i∈V
pii ‖xi(t)− v‖2 − 2γ(t)
∑
i∈V
pii
zii(t)
gi(t)
T (xi(t)− v) + γ2(t)
∑
i∈V
pii
z2ii(t)
‖gi(t)‖2.
(A.45)
259
We now derive upper bounds for the last two terms in (A.45). First, by (6.5), (6.11)
and the fact that
∑
i∈V pii = 1, we have
γ2(t)
∑
i∈V
pii
z2ii(t)
‖gi(t)‖2 ≤ γ2(t)L2fη2. (A.46)
Second, using the facts that gi(t) ∈ ∂fi(xi(t)) and fi is Lipschitz continuous on
conv(∪Ni=1Xi)(cf. Assumption 6.2.1(b)), it can be shown that
gi(t)
T (v − xi(t))≤fi(v)− fi(x¯(t)) + Lf ‖xi(t)− x¯(t)‖ . (A.47)
As a result, the second term in (A.45) can be bounded as follows (ignoring the factor
2γ(t)):
∑
i∈V
−pii
zii(t)
gi(t)
T (xi(t)− v)
(A.47)
≤
∑
i∈V
pii
zii(t)
(
fi(v)− fi(x¯(t)) + Lf ‖xi(t)− x¯(t)‖
)
(6.11)
≤
∑
i∈V
pii
zii(t)
(
fi(v)− fi(x¯(t))
)
+ Lfη
∑
i∈V
pii ‖xi(t)− x¯(t)‖ ,
(6.23)
≤ F (v)− F (x¯(t)) +NCLfηλt ‖x¯(t)− v‖
+ Lfη
∑
i∈V
pii ‖xi(t)− x¯(t)‖ . (A.48)
Finally, returning to the argument in (A.44) and using (A.45) together with the
bounds in (A.46) and (A.48), we obtain
∑
i∈V
pii‖xi(t+ 1)− v‖2 ≤
∑
i∈V
pii ‖xi(t)− v‖2 − 2γ(t)
(
F (x¯(t))− F ∗
)
+ 2γ(t)NCLfηλ
t ‖x¯(t)− v‖
+ 2γ(t)Lfη
∑
i∈V
‖xi(t)− x¯(t)‖+ γ2(t)L2fη2
which is the same as (6.25). Therefore, (6.13) readily follows by the same constants
Di as in the case of Algorithm 6.1.
260
Bibliography
[1] M. H. DeGroot, “Reaching a consensus,” J. American Statistical Association,
vol. 69, no. 345, pp. 118–121, 1974.
[2] J. N. Tsitsiklis, “Problems in decentralized decision making and computation,”
Ph.D. dissertation, MIT, 1984.
[3] N. Friedkin and E. Johnsen, “Social influence and opinions,” J. Mathematical
Sociology, vol. 15, pp. 193–206, 1990.
[4] P. DeMarzo, D. Vayanos, and J. Zwiebel, “Persuasion bias, social influence,
and unidimensional opinions,” Quarterly J. Economics, vol. 118, no. 3, pp.
909–968, 2003.
[5] A. Jadbabaie, J. Lin, and A. S. Morse, “Coordination of groups of mobile au-
tonomous agents using nearest neighbor rules,” IEEE Trans. Autom. Control,
vol. 48, no. 6, pp. 988–1001, 2003.
[6] J. Fax and R. Murray, “Information flow and cooperative control of vehicle
formations,” IEEE Trans. Autom. Control, vol. 49, no. 9, pp. 1465–1476, 2004.
[7] J. Cortes, S. Martinez, T. Karatas, and F. Bullo, “Coverage control for mobile
sensing networks,” IEEE Trans. Robot. Automat., vol. 20, no. 2, pp. 243–255,
2004.
[8] L. Moreau, “Stability of multiagent systems with time-dependent communi-
cation links,” IEEE Trans. Autom. Control, vol. 50, pp. 169–182, 2005.
[9] W. Ren, R. W. Beard, and T. W. McLain, “Coordination variables and con-
sensus building in multiple vehicle systems,” in Cooperative Control. Springer,
2005, pp. 171–188.
[10] W. Ren and R. W. Beard, Distributed Consensus in Multi-vehicle Cooperative
Control. Springer, 2008.
261
[11] B. Golub and M. O. Jackson, “Naive learning in social networks and the
wisdom of crowds,” American Economic J.: Microeconomics, pp. 112–149,
2010.
[12] D. Acemoglu and A. Ozdaglar, “Opinion dynamics and learning in social net-
works,” Dynamic Games and Applications, vol. 1, no. 1, pp. 3–49, 2011.
[13] D. Acemoglu, G. Como, F. Fagnani, and A. Ozdaglar, “Opinion fluctuations
and disagreement in social networks,” Mathematics of Operations Research,
vol. 38, no. 1, pp. 1–27, 2013.
[14] W. Ren, R. W. Beard, and E. M. Atkins, “Information consensus in multive-
hicle cooperative control,” IEEE Control Syst. Mag., vol. 27, no. 2, pp. 71–82,
2007.
[15] R. Olfati-Saber, J. A. Fax, and R. M. Murray, “Consensus and cooperation
in networked multi-agent systems,” Proc. IEEE, vol. 95, no. 1, pp. 215–233,
2007.
[16] Y. Cao, W. Yu, W. Ren, and G. Chen, “An overview of recent progress in the
study of distributed multi-agent coordination,” IEEE Trans Ind. Informat.,
vol. 9, no. 1, pp. 427–438, 2013.
[17] S. Strogatz, Sync: The Emerging Science of Spontaneous Order. Hyperion,
2003.
[18] M. Ji, A. Muhammad, and M. Egerstedt, “Leader-based multi-agent coordina-
tion: Controllability and optimal control,” in Proc. American Control Conf.,
2006, pp. 1358–1363.
[19] E. Yildiz, D. Acemoglu, A. E. Ozdaglar, A. Saberi, and A. Scaglione,
“Discrete opinion dynamics with stubborn agents,” 2011. [Online]. Available:
http://dx.doi.org/10.2139/ssrn.1744113
[20] J. Ghaderi and R. Srikant, “Opinion dynamics in social networks: a local
interaction game with stubborn agents,” in Proc. American Control Conf.
IEEE, 2013, pp. 1982–1987.
[21] A. Fagiolini, M. Pellinacci, G. Valenti, G. Dini, and A. Bicchi, “Consensus-
based distributed intrusion detection for multi-robot systems,” in Proc. IEEE
Int. Conf. Robotics and Automation, 2008, pp. 120–127.
[22] S. Sundaram and C. N. Hadjicostis, “Distributed function calculation via lin-
ear iterative strategies in the presence of malicious agents,” IEEE Trans. Au-
tom. Control, no. 7, pp. 1495–1508, 2011.
[23] H. J. LeBlanc, H. Zhang, S. Sundaram, and X. Koutsoukos, “Consensus of
multi-agent networks in the presence of adversaries using only local informa-
tion,” in Proc. 1st Int. Conf. High Confidence Networked Systems, 2012, pp.
1–10.
262
[24] M. Ji, A. Muhammad, and M. Egerstedt, “Leader-based multi-agent coordi-
nation: Controllability and optimal control,” in Proc. American Control Conf.
IEEE, 2006, pp. 6–pp.
[25] F. Sorrentino, M. di Bernardo, F. Garofalo, and G. Chen, “Controllability of
complex networks via pinning,” Physical Rev. E, vol. 75, no. 4, p. 046103,
2007.
[26] A. Rahmani, M. Ji, M. Mesbahi, and M. Egerstedt, “Controllability of multi-
agent systems from a graph-theoretic perspective,” SIAM J. Control and Op-
timization, vol. 48, no. 1, pp. 162–186, 2009.
[27] Y.-Y. Liu, J.-J. Slotine, and A.-L. Baraba´si, “Controllability of complex net-
works,” Nature, vol. 473, no. 7346, pp. 167–173, 2011.
[28] G. Parlangeli and G. Notarstefano, “On the reachability and observability
of path and cycle graphs,” IEEE Trans. Autom. Control, vol. 57, no. 3, pp.
743–748, 2012.
[29] F. Pasqualetti, S. Zampieri, and F. Bullo, “Controllability metrics, limita-
tions and algorithms for complex networks,” IEEE Trans. Control Netw. Syst.,
vol. 1, no. 1, pp. 40–52, 2014.
[30] A. Chapman, M. Nabi-Abdolyousefi, and M. Mesbahi, “Controllability and
observability of network-of-networks via cartesian products,” IEEE Trans. Au-
tom. Control, vol. 59, no. 10, pp. 2668–2679, 2014.
[31] A. J. Whalen, S. N. Brennan, T. D. Sauer, and S. J. Schiff, “Observability and
controllability of nonlinear networks: The role of symmetry,” Physical Rev. X,
vol. 5, no. 1, p. 011005, 2015.
[32] M. Rabbat and R. Nowak, “Distributed optimization in sensor networks,” in
Proc. 3rd Int. Sympo. Inf. Process. Sensor Networks. ACM, 2004, pp. 20–27.
[33] S. Kar and J. M. Moura, “Distributed consensus algorithms in sensor networks
with imperfect communication: Link failures and channel noise,” IEEE Trans.
Signal Process., vol. 57, no. 1, pp. 355–369, 2009.
[34] S. Bolognani, S. Del Favero, L. Schenato, and D. Varagnolo, “Consensus-
based distributed sensor calibration and least-square parameter identification
in WSNs,” Int. J. Robust and Nonlinear Control, vol. 20, no. 2, pp. 176–193,
2010.
[35] L. Xiao and S. Boyd, “Optimal scaling of a gradient method for distributed
resource allocation,” J. Optim. Theory App., vol. 129, no. 3, pp. 469–488,
2006.
263
[36] P. Di Lorenzo and S. Barbarossa, “A bio-inspired swarming algorithm for
decentralized access in cognitive radio,” IEEE Trans. Signal Process., vol. 59,
no. 12, pp. 6160–6174, 2011.
[37] G. Mateos, J. A. Bazerque, and G. B. Giannakis, “Distributed sparse linear
regression,” IEEE Trans. Signal Process., vol. 58, no. 10, pp. 5262–5276, Oct
2010.
[38] P. A. Forero, A. Cano, and G. B. Giannakis, “Consensus-based distributed
support vector machines,” J. Machine Learning Research, vol. 11, no. May,
pp. 1663–1707, 2010.
[39] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed opti-
mization and statistical learning via the alternating direction method of mul-
tipliers,” Found. Trends. Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
[40] T. Vicsek, A. Cziro´k, E. Ben-Jacob, I. Cohen, and O. Shochet, “Novel type
of phase transition in a system of self-driven particles,” Physical Rev. Lett.,
vol. 75, no. 6, p. 1226, 1995.
[41] V. Borkar and P. Varaiya, “Asymptotic agreement in distributed estimation,”
IEEE Trans. Autom. Control, vol. 27, no. 3, pp. 650–655, 1982.
[42] J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asynchronous de-
terministic and stochastic gradient optimization algorithms,” IEEE Trans.
Autom. Control, vol. 31, no. 9, pp. 803–812, 1986.
[43] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation:
Numerical Methods. Prentice-Hall, Inc., 1989.
[44] L. Scardovi and R. Sepulchre, “Synchronization in networks of identical linear
systems,” Automatica, vol. 45, no. 11, pp. 2557–2562, 2009.
[45] Z. Li, Z. Duan, G. Chen, and L. Huang, “Consensus of multiagent systems
and synchronization of complex networks: A unified viewpoint,” IEEE Trans.
Circuits Syst. I, vol. 57, no. 1, pp. 213–224, 2010.
[46] U. Krause, “A discrete nonlinear and nonautonomous model of consensus
formation,” Commun. Difference Equations, pp. 227–236, 2000.
[47] N. Chopra and M. W. Spong, “Passivity-based control of multi-agent sys-
tems,” in Advances in Robot Control. Springer, 2006, pp. 107–134.
[48] J. Zhou, J.-a. Lu, and J. Lu, “Adaptive synchronization of an uncertain com-
plex dynamical network,” IEEE Trans. Autom. Control, vol. 51, no. 4, pp.
652–656, 2006.
[49] M. Arcak, “Passivity as a design tool for group coordination,” IEEE Trans.
Autom. Control, vol. 52, no. 8, pp. 1380–1390, 2007.
264
[50] G.-B. Stan and R. Sepulchre, “Analysis of interconnected oscillators by dis-
sipativity theory,” IEEE Trans. Autom. Control, vol. 52, no. 2, pp. 256–270,
2007.
[51] N. Chopra and M. W. Spong, “On exponential synchronization of Kuramoto
oscillators,” IEEE Trans. Autom. Control, vol. 54, no. 2, pp. 353–357, 2009.
[52] R. Olfati-Saber and R. M. Murray, “Consensus problems in networks of
agents with switching topology and time-delays,” IEEE Trans. Autom. Con-
trol, vol. 49, no. 9, pp. 1520–1533, 2004.
[53] A. Kashyap, T. Bas¸ar, and R. Srikant, “Quantized consensus,” Automatica,
vol. 43, no. 7, pp. 1192–1203, 2007.
[54] S. Kar and J. M. Moura, “Distributed consensus algorithms in sensor net-
works: Quantized data and random link failures,” IEEE Trans. Signal Pro-
cess., vol. 58, no. 3, pp. 1383–1400, 2010.
[55] G. Shi and K. H. Johansson, “Persistent graphs and consensus convergence,”
in IEEE 51st Conf. Decision and Control, 2012, pp. 2046–2051.
[56] J. Wolfowitz, “Products of indecomposable, aperiodic, stochastic matrices,”
Proc. American Mathematical Society, vol. 14, no. 5, pp. 733–737, 1963.
[57] J. M. Hendrickx and V. D. Blondel, “Convergence of linear and non-linear
versions of Vicseks model,” in Proc. 17th Int. Sympo. Mathematical Theory of
Networks and Systems, 2005, pp. 1229–1240.
[58] V. Blondel, J. Hendrickx, A. Olshevsky, and J. Tsitsiklis, “Convergence in
multiagent coordination, consensus, and flocking,” in 44th IEEE Conf. Deci-
sion and Control/ 2005 European Control Conf., 2005, pp. 2996–3000.
[59] B. Touri and A. Nedic, “Product of random stochastic matrices,” IEEE Trans.
Autom. Control, vol. 59, no. 2, pp. 437–448, Feb 2014.
[60] D. Kempe, J. Kleinberg, and E´. Tardos, “Maximizing the spread of influence
through a social network,” in Proc. 9th ACM SIGKDD Int. Conf. Knowledge
Discovery and Data Mining. ACM, 2003, pp. 137–146.
[61] S. Patterson and B. Bamieh, “Leader selection for optimal network coherence,”
in Proc. 49th IEEE Conf. Decision and Control. IEEE, 2010, pp. 2692–2697.
[62] M. Fardad, F. Lin, X. Zhang, and M. R. Jovanovic, “On new characterizations
of social influence in social networks,” in Proc. American Control Conf., 2013,
pp. 4777–4782.
[63] A. Gionis, E. Terzi, and P. Tsaparas, “Opinion maximization in social net-
works,” in Proc. SIAM Int. Conf. Data Mining. SIAM, 2013, pp. 387–395.
265
[64] E. Yildiz, A. Ozdaglar, D. Acemoglu, A. Saberi, and A. Scaglione, “Binary
opinion dynamics with stubborn agents,” ACM Trans. Econ. Comp., vol. 1,
no. 4, p. 19, 2013.
[65] A. Clark, B. Alomair, L. Bushnell, and R. Poovendran, “Minimizing con-
vergence error in multi-agent systems via leader selection: A supermodular
optimization approach,” IEEE Trans. Autom. Control, vol. 59, no. 6, pp.
1480–1494, 2014.
[66] L. Vassio, F. Fagnani, P. Frasca, and A. Ozdaglar, “Message passing opti-
mization of harmonic influence centrality,” IEEE Trans. Control Netw. Syst.,
vol. 1, no. 1, pp. 109–120, 2014.
[67] V. S. Borkar, A. Karnik, J. Nair, and S. Nalli, “Manufacturing consent,” IEEE
Trans. Autom. Control, vol. 60, no. 1, pp. 104–117, 2015.
[68] G. Shi, K. C. Sou, H. Sandberg, and K. H. Johansson, “A graph-theoretic ap-
proach on optimizing informed-node selection in multi-agent tracking control,”
Physica D: Nonlinear Phenomena, vol. 267, pp. 104–111, 2014.
[69] K. Fitch and N. E. Leonard, “Information centrality and optimal leader selec-
tion in noisy networks,” in IEEE 52nd Conf. Decision and Control. IEEE,
2013, pp. 7510–7515.
[70] A. Clark, B. Alomair, L. Bushnell, and R. Poovendran, “Minimizing con-
vergence error in multi-agent systems via leader selection: A supermodular
optimization approach,” CoRR, vol. abs/1306.4949, 2013.
[71] S. Sundaram and C. N. Hadjicostis, “Finite-time distributed consensus in
graphs with time-invariant topologies,” in Proc. American Control Conf.,
2007, pp. 711–716.
[72] Y. Yuan, G.-B. Stan, L. Shi, M. Barahona, and J. Goncalves, “Decentralised
minimum-time consensus,” Automatica, vol. 49, no. 5, pp. 1227–1235, 2013.
[73] B. Ho and R. E. Kalman, “Effective construction of linear state-variable mod-
els from input/output functions,” Automatisierungstechnik, vol. 14, no. 1-12,
pp. 545–548, 1966.
[74] A. Tether, “Construction of minimal linear state-variable models from finite
input-output data,” IEEE Trans. Autom. Control, vol. 15, no. 4, pp. 427–436,
1970.
[75] L. Silverman, “Realization of linear dynamical systems,” IEEE Trans. Autom.
Control, vol. 16, no. 6, pp. 554–567, 1971.
[76] B. Johansson, T. Keviczky, M. Johansson, and K. H. Johansson, “Subgradient
methods and consensus algorithms for solving convex optimization problems,”
in Proc. 47th IEEE Conf. Decision and Control, 2008, pp. 4185–4190.
266
[77] A. Jadbabaie, A. Ozdaglar, and M. Zargham, “A distributed Newton method
for network optimization,” in Proc. 48th IEEE Conf. Decision and Control /
28th Chinese Control Conf. IEEE, 2009, pp. 2736–2741.
[78] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent
optimization,” IEEE Trans. Autom. Control, vol. 54, no. 1, pp. 48–61, 2009.
[79] D. Jakovetic, J. Xavier, and J. M. Moura, “Cooperative convex optimization
in networked systems: Augmented Lagrangian algorithms with directed gossip
communication,” IEEE Trans. Signal Process., vol. 59, no. 8, pp. 3889–3902,
2011.
[80] I. Lobel and A. Ozdaglar, “Distributed subgradient methods for convex opti-
mization over random networks,” IEEE Trans. Autom. Control, vol. 56, no. 6,
pp. 1291–1306, 2011.
[81] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging for dis-
tributed optimization: convergence analysis and network scaling,” IEEE
Trans. Autom. Control, vol. 57, no. 3, pp. 592–606, 2012.
[82] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, “Consensus-based distributed
optimization: Practical issues and applications in large-scale machine learn-
ing,” in Proc. 50th Annu. Allerton Conf. Commun. Control Comp. IEEE,
2012, pp. 1543–1550.
[83] E. Wei and A. Ozdaglar, “On the O(1/k) convergence of asynchronous
distributed alternating direction method of multipliers,” arXiv preprint
arXiv:1307.8254, 2013.
[84] B. Gharesifard and J. Corte´s, “Distributed continuous-time convex optimiza-
tion on weight-balanced digraphs,” IEEE Trans. Autom. Control, vol. 59,
no. 3, pp. 781–786, 2014.
[85] A. Nedic and A. Olshevsky, “Distributed optimization over time-varying di-
rected graphs,” IEEE Trans. Autom. Control, vol. 60, no. 3, pp. 601–615,
March 2015.
[86] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient
descent,” arXiv preprint arXiv:1310.7063, 2013.
[87] D. Jakovetic, J. Xavier, and J. M. Moura, “Fast distributed gradient meth-
ods,” IEEE Trans. Autom. Control, vol. 59, no. 5, pp. 1131–1146, 2014.
[88] A. Nedic and D. P. Bertsekas, “Incremental subgradient methods for nondif-
ferentiable optimization,” SIAM J. Optim., vol. 12, no. 1, pp. 109–138, 2001.
[89] S. S. Ram, A. Nedic, and V. V. Veeravalli, “Incremental stochastic subgradient
algorithms for convex optimization,” SIAM J. Optim., vol. 20, no. 2, pp. 691–
717, 2009.
267
[90] L. Xiao, S. Boyd, and S. Lall, “A scheme for robust distributed sensor fusion
based on average consensus,” in Proc. Int. Conf. Information Processing in
Sensor Networks, 2005, pp. 63–70.
[91] F. Zanella, D. Varagnolo, A. Cenedese, G. Pillonetto, and L. Schenato,
“Newton-Raphson consensus for distributed convex optimization,” in Proc.
50th IEEE Conf. Decision / 2011 Control and European Control Conf., 2011,
pp. 5917–5922.
[92] I. D. Schizas, A. Ribeiro, and G. B. Giannakis, “Consensus in ad hoc wsns
with noisy linkspart i: Distributed estimation of deterministic signals,” IEEE
Trans. Signal Process., vol. 56, no. 1, pp. 350–364, 2008.
[93] K. Tsianos and M. Rabbat, “Distributed dual averaging for convex optimiza-
tion under communication delays,” in Proc. American Control Conf., Jun.
2012, pp. 1067–1072.
[94] P. Lin, W. Ren, and Y. Song, “Distributed multi-agent optimization subject
to nonidentical constraints and communication delays,” Automatica, vol. 65,
pp. 120–131, 2016.
[95] R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge University Press,
1985.
[96] V. S. Mai and E. H. Abed, “Opinion dynamics with persistent leaders,” in
IEEE 53rd Conf. Decision and Control, 2014, pp. 2907–2913.
[97] H. Minc, Nonnegative Matrices. John Wiley and Sons, New York, 1988.
[98] J. Lorenz, “A stabilization theorem for dynamics of continuous opinions,”
Physica A, vol. 355, no. 1, pp. 217–223, 2005.
[99] D. J. Hartfiel, Nonhomogeneous Matrix Products. World Scientific, 2002.
[100] A. Olshevsky and J. N. Tsitsiklis, “Convergence speed in distributed consensus
and averaging,” SIAM Rev., vol. 53, no. 4, pp. 747–772, 2011.
[101] W. Wang and J.-J. Slotine, “A theoretical study of different leader roles in
networks,” IEEE Trans. Autom. Control, vol. 51, no. 7, pp. 1156–1161, 2006.
[102] A. Rahmani, M. Ji, M. Mesbahi, and M. Egerstedt, “Controllability of multi-
agent systems from a graph-theoretic perspective,” SIAM J. Control Optim.,
vol. 48, no. 1, pp. 162–186, 2009.
[103] S. Joshi and S. Boyd, “Sensor selection via convex optimization,” IEEE Trans.
Signal Process., vol. 57, no. 2, pp. 451–462, 2009.
[104] F. Lin, M. Fardad, and M. R. Jovanovic´, “Algorithms for leader selection in
large dynamical networks: Noise-corrupted leaders,” in Proc. 50th IEEE Conf.
Decision Control, European Control Conf., 2011, pp. 2932–2937.
268
[105] J. Long, S. O. Memik, and M. Grayson, “Optimization of an on-chip active
cooling system based on thin-film thermoelectric coolers,” in Proc. Conf. De-
sign, Automation and Test in Europe, 2010, pp. 117–122.
[106] V. S. Borkar, J. Nair, and N. Sanketh, “Manufacturing consent,” in 48th
Allerton Conf. Commun. Control Comp., 2010, pp. 1550–1555.
[107] V. S. Mai and E. H. Abed, “Dynamic consensus measure and optimal selection
of direct followers in multiagent networks,” in Proc. American Control Conf.
IEEE, 2016, pp. 2880–2885.
[108] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher, “An analysis of approxi-
mations for maximizing submodular set functions –I,” Mathematical Program-
ming, vol. 14, no. 1, pp. 265–294, 1978.
[109] M. Conforti and G. Cornue´jols, “Submodular set functions, matroids and the
greedy algorithm: Tight worst-case bounds and some generalizations of the
Rado-Edmonds theorem,” Discrete Applied Mathematics, vol. 7, no. 3, pp.
251–274, 1984.
[110] G. Ranjan and Z.-L. Zhang, “Geometry of complex networks and topological
centrality,” Physica A: Statistical Mechanics and its Applications, vol. 392,
no. 17, pp. 3833–3845, 2013.
[111] K. Stephenson and M. Zelen, “Rethinking centrality: Methods and examples,”
Social Networks, vol. 11, no. 1, pp. 1–37, 1989.
[112] I. Poulakakis, L. Scardovi, and N. E. Leonard, “Node classification in networks
of stochastic evidence accumulators,” arXiv preprint arXiv:1210.4235, 2012.
[113] M. Brand, “A random walks perspective on maximizing satisfaction and
profit,” in SDM. SIAM, 2005, pp. 12–19.
[114] S. J. Wright, Primal-dual interior-point methods. Siam, 1997.
[115] D. P. Bertsekas, Nonlinear Programming. Athena Scientific, 1999.
[116] F. A. Potra and S. J. Wright, “Interior-point methods,” J. Computational and
Applied Mathematics, vol. 124, no. 1–2, pp. 281–302, 2000.
[117] Y. Nesterov, Introductory Lectures on Convex Optimization. Springer Science
& Business Media, 2004, vol. 87.
[118] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University
Press, 2004.
[119] J. Long, D. Li, S. O. Memik, and S. Ulgen, “Theory and analysis for optimiza-
tion of on-chip thermoelectric cooling systems,” IEEE Trans. Comput.-Aided
Design Integr. Circuits Syst., vol. 32, no. 10, pp. 1628–1632, 2013.
269
[120] A. Ganesan, S. R. Ross, and B. R. Barmish, “An extreme point result for
convexity, concavity and monotonicity of parameterized linear equation solu-
tions,” Linear Algebra Appl., vol. 390, pp. 61–73, 2004.
[121] D. M. Topkis, “Minimizing a submodular function on a lattice,” Operations
Research, vol. 26, no. 2, pp. 305–321, 1978.
[122] P. Milgrom and J. Roberts, “Rationalizability, learning, and equilibrium in
games with strategic complementarities,” Econometrica, vol. 58, no. 6, pp.
1255–1277, 1990.
[123] J. Leskovec, D. Huttenlocher, and J. Kleinberg, “Predicting positive and neg-
ative links in online social networks,” in Proc. 19th Int. Conf. World Wide
Web. ACM, 2010, pp. 641–650.
[124] J. Currie and D. I. Wilson, “OPTI: Lowering the Barrier Between Open Source
Optimizers and the Industrial MATLAB User,” in Found. Comp.-Aided Pro-
cess Operations, N. Sahinidis and J. Pinto, Eds., Savannah, Georgia, USA,
8–11 January 2012.
[125] P. Colaneri, R. H. Middleton, Z. Chen, D. Caporale, and F. Blanchini, “Con-
vexity of the cost functional in an optimal control problem for a class of positive
switched systems,” Automatica, vol. 50, no. 4, pp. 1227–1234, 2014.
[126] B. P. Lathi, Linear Systems and Signals. Oxford University Press, 2009.
[127] D. G. Luenberger, Optimization by Vector Space Methods. John Wiley &
Sons, 1997.
[128] T. Charalambous, Y. Yuan, T. Yang, W. Pan, C. N. Hadjicostis, and M. Jo-
hansson, “Distributed finite-time average consensus in digraphs in the presence
of time delays,” IEEE Trans. Control Netw. Syst., vol. 2, no. 4, pp. 370–381,
2015.
[129] C. D. Godsil, G. Royle, and C. Godsil, Algebraic Graph Theory. Springer
New York, 2001, vol. 8.
[130] S. Martini, M. Egerstedt, and A. Bicchi, “Controllability analysis of multi-
agent systems using relaxed equitable partitions,” Int. J. Syst. Control Com-
mun., vol. 2, no. 1-3, pp. 100–121, 2010.
[131] S. Zhang, M. Cao, and M. K. Camlibel, “Upper and lower bounds for con-
trollable subspaces of networks of diffusively coupled agents,” IEEE Trans.
Autom. Control, vol. 59, no. 3, pp. 745–750, 2014.
[132] S. Zhang, M. K. Camlibel, and M. Cao, “Controllability of diffusively-coupled
multi-agent systems with general and distance regular coupling topologies,” in
Proc. 50th IEEE Conf. Decision / 2011 Control and European Control Conf.,
2011, pp. 759–764.
270
[133] J. Wang and N. Elia, “A control perspective for centralized and distributed
convex optimization,” in Proc. 50th IEEE Conf. Decision and Control / 2011
European Control Conf., 2011, pp. 3800–3805.
[134] A. Olshevsky, “Linear time average consensus on fixed graphs and implica-
tions for decentralized optimization and multi-agent control,” arXiv preprint
arXiv:1411.4186v6, 2016.
[135] I.-A. Chen, “Fast distributed first-order methods,” Master’s thesis, MIT, 2012.
[136] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm
for decentralized consensus optimization,” SIAM J. Optim., vol. 25, no. 2, pp.
944–966, 2015.
[137] A. Nedic, A. Ozdaglar, and P. A. Parrilo, “Constrained consensus and op-
timization in multi-agent networks,” IEEE Trans. Autom. Control, vol. 55,
no. 4, pp. 922–938, 2010.
[138] S. Boyd, L. Xiao, and A. Mutapcic, “Subgradient methods,” Lecture notes
of EE392o, Stanford University, Autumn Quarter, 2003. [Online]. Available:
http://web.mit.edu/6.976/www/notes/subgrad method.pdf
[139] B. Nejad, S. Attia, and J. Raisch, “Max-consensus in a max-plus algebraic
setting: The case of fixed communication topologies,” in Int. Sympo. Info.
Commun. Automation Tech., 2009, pp. 1–7.
[140] S. Sundaram and C. N. Hadjicostis, “Distributed function calculation and
consensus using linear iterative strategies,” IEEE J. Select. Areas Commun.,
vol. 26, no. 4, pp. 650–660, 2008.
[141] T. Charalambous, M. G. Rabbat, M. Johansson, and C. N. Hadjicostis, “Dis-
tributed finite-time computation of digraph parameters: Left-eigenvector, out-
degree and spectrum,” IEEE Trans. Control Netw. Syst., vol. 3, no. 2, pp.
137–148, 2016.
[142] A. E. Brouwer, A. M. Cohen, , and A. Neumaier, Distance-Regular Graphs.
New York: Springer-Verlag, 1989, vol. 18.
[143] E. R. van Dam, J. H. Koolen, and H. Tanaka, “Distance-regular graphs,”
Electronic J. Combinatorics, vol. DS22, 2014.
[144] J. Wang and N. Elia, “Control approach to distributed optimization,” in Proc.
48th Annu. Allerton Conf. Commun. Control Comp., 2010, pp. 557–561.
[145] A. Nedic and A. Olshevsky, “Distributed optimization over time-varying di-
rected graphs,” IEEE Trans. Autom. Control, vol. 60, no. 3, pp. 601–615,
March 2015.
271
[146] D. Kempe, A. Dobra, and J. Gehrke, “Gossip-based computation of aggregate
information,” in Proc. 44th Annu. IEEE Symp. Found. Comp. Sci., 2003, pp.
482–491.
[147] A. Makhdoumi and A. Ozdaglar, “Graph balancing for distributed subgradi-
ent methods over directed graphs,” in Proc. 54th IEEE Conf. Decision and
Control, 2015, pp. 1364–1371.
[148] S. S. Ram, A. Nedic´, and V. V. Veeravalli, “Distributed stochastic subgradient
projection algorithms for convex optimization,” J. Optim. Theory Appl., vol.
147, no. 3, pp. 516–545, 2010.
[149] S. Lee and A. Nedic, “Distributed random projection algorithm for convex
optimization,” IEEE J. Sel. Topics Signal Process., vol. 7, no. 2, pp. 221–229,
2013.
[150] V. S. Mai and E. H. Abed, “Distributed optimization over weighted directed
graphs using row stochastic matrix,” in Proc. American Control Conf., 2016,
pp. 7165–7170.
[151] I. Lobel, A. Ozdaglar, and D. Feijer, “Distributed multi-agent optimization
with state-dependent communication,” Mathematical Programming, vol. 129,
no. 2, pp. 255–284, 2011.
[152] A. Hoffmann, “The distance to the intersection of two convex sets expressed
by the distances to each of them,” Mathematische Nachrichten, vol. 157, no. 1,
pp. 81–98, 1992.
[153] H. H. Bauschke and J. M. Borwein, “On projection algorithms for solving
convex feasibility problems,” SIAM Rev., vol. 38, no. 3, pp. 367–426, 1996.
[154] Z. Qu, C. Li, and F. Lewis, “Cooperative control based on distributed esti-
mation of network connectivity,” in Proc. American Control Conf., 2011, pp.
3441–3446.
[155] A. Priolo, A. Gasparri, E. Montijano, and C. Sagues, “A decentralized algo-
rithm for balancing a strongly connected weighted digraph,” in Proc. American
Control Conf., Jun. 2013, pp. 6547–6552.
[156] I. Matei and J. S. Baras, “A comparison between upper bounds on performance
of two consensus-based distributed optimization algorithms,” in Estimation
and Control of Networked Systems, vol. 3, no. 1, 2012, pp. 168–173.
[157] H. Robbins and D. Siegmund, “A convergence theorem for nonnegative almost
supermartingales and some applications,” Methods in Statistics, pp. 233–257,
1971.
272
[158] C. Xi, V. S. Mai, E. H. Abed, and U. A. Khan, “Linear convergence in directed
optimization with row-stochastic matrices,” arXiv preprint arXiv:1611.06160,
2016.
[159] W. B. Gragg and A. Lindquist”, “On the partial realization problem,” Linear
Algebra Appl., vol. 50, pp. 277 – 319, 1983.
[160] L. Ljung, System Identification. Wiley Online Library, 1999.
[161] S. Beheshti and M. A. Dahleh, “Noisy data and impulse response estimation,”
IEEE Trans. Signal Process., vol. 58, no. 2, pp. 510–521, 2010.
[162] Y. Yuan, G.-B. Stan, L. Shi, M. Barahona, and J. Gonc¸alves, “Minimal-time
uncertain output final value of unknown DT-LTI systems with application to
the decentralised network consensus problem,” in Proc. Int. Sympo. Mathe-
matical Theory of Netw. Syst., 2010.
[163] F. Pasqualetti, A. Bicchi, and F. Bullo, “Distributed intrusion detection for
secure consensus computations,” in Proc. 46th IEEE Conf. Decision and Con-
trol. IEEE, 2007, pp. 5594–5599.
[164] I. Shames, A. M. Teixeira, H. Sandberg, and K. H. Johansson, “Distributed
fault detection for interconnected second-order systems,” Automatica, vol. 47,
no. 12, pp. 2757–2764, 2011.
[165] F. Pasqualetti, A. Bicchi, and F. Bullo, “Consensus computation in unreli-
able networks: A system theoretic approach,” IEEE Trans. Autom. Control,
vol. 57, no. 1, pp. 90–104, 2012.
[166] F. Pasqualetti, F. Do¨rfler, and F. Bullo, “Attack detection and identification
in cyber-physical systems,” IEEE Trans. Autom. Control, vol. 58, no. 11, pp.
2715–2729, 2013.
[167] J. McDonald, M. Neumann, H. Schneider, and M. Tsatsomeros, “Inverse
M -matrix inequalities and generalized ultrametric matrices,” Linear Algebra
Appl., vol. 220, pp. 321–341, 1995.
[168] N. J. Higham, Accuracy and Stability of Numerical Algorithms. Siam, 2002.
[169] A. Ben-Israel and T. N. Greville, Generalized Inverses: Theory and Applica-
tions. Springer, 2003, vol. 13.
[170] C. Meyer, Jr., “Generalized inversion of modified matrices,” SIAM J. Applied
Math., vol. 24, no. 3, pp. 315–323, 1973.
[171] R. Plemmons, “M-matrix characterizations. I–Nonsingular M-matrices,” Lin-
ear Algebra Appl., vol. 18, no. 2, pp. 175 – 188, 1977.
273