ABSTRACT Title of dissertation: CONSENSUS, PREDICTION AND OPTIMIZATION IN DIRECTED NETWORKS Van Sy Mai, Doctor of Philosophy, 2017 Dissertation directed by: Professor Eyad H. Abed Dept. Electrical and Computer Engineering This dissertation develops theory and algorithms for distributed consensus in multi-agent networks. The models considered are opinion dynamics models based on the well known DeGroot model. We study the following three related topics: consensus of networks with leaders, consensus prediction, and distributed optimiza- tion. First, we revisit the problem of agreement seeking in a weighted directed net- work in the presence of leaders. We develop new sufficient conditions that are weaker than existing conditions for guaranteeing consensus for both fixed and switching network topologies, emphasizing the importance not only of persistent connectivity between the leader and the followers but also of the strength of the connections. We then study the problem of a leader aiming to maximize its influence on the opinions of the network agents through targeted connection with a limited number of agents, possibly in the presence of another leader having a competing opinion. We reveal fundamental properties of leader influence defined in terms of either the transient behavior or the achieved steady state opinions of the network agents. In particular, not only is the degree of this influence a supermodular set function, but its contin- uous relaxation is also convex for any strongly connected directed network. These results pave the way for developing efficient approximation algorithms admitting certain quality certifications, which when combined can provide effective tools and better analysis for optimal influence spreading in large networks. Second, we introduce and investigate problems of network monitoring and consensus prediction. Here, an observer, without exact knowledge of the network, seeks to determine in the shortest possible time the asymptotic agreement value by monitoring a subset of the agents. We uncover a fundamental limit on the minimum required monitoring time for the case of a single observed node, and analyze the case of multiple observed nodes. We provide conditions for achieving the limit in the former case and develop algorithms toward achieving conjectured bounds in the latter through local observation and local computation. Third, we study a distributed optimization problem where a network of agents seeks to minimize the sum of the agents’ individual objective functions while each agent may be associated with a separate local constraint. We develop new dis- tributed algorithms for solving this problem. In these algorithms, consensus pre- diction is employed as a means to achieve fast convergence rates, possibly in finite time. An advantage of our distributed optimization algorithms is that they work under milder assumptions on the network weight matrix than are commonly as- sumed in the literature. Most distributed algorithms require undirected networks. Consensus-based algorithms can apply to directed networks under an assumption that the network weight matrix is doubly stochastic (i.e., both row stochastic and column stochastic), or in some recent literature only column stochastic. Our algo- rithms work for directed networks and only require row stochasticity, a mild assump- tion. Doubly stochastic or column stochastic weight matrices can be hard to arrange locally, especially in broadcast-based communication. We achieve the simplification to the row stochastic assumption through a distributed rescaling technique. Next, we develop a unified convergence analysis of a distributed projected subgradient al- gorithm and its variation that can be applied to both unconstrained and constrained problems without assuming boundedness or commonality of the local constraint sets. CONSENSUS, PREDICTION AND OPTIMIZATION IN DIRECTED NETWORKS by Van Sy Mai Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2017 Advisory Committee: Professor Eyad H. Abed, Chair/Advisor Professor Richard J. La Professor P. S. Krishnaprasad Professor Andre´ L. Tits Associate Professor Nikhil Chopra c© Copyright by Van Sy Mai 2017 Acknowledgments First and foremost, I would like to thank my advisor, Professor Eyad Abed for his invaluable guidance and unstinting support over the past five years. It has been a honor to work with and learn from him, without whom this thesis would not have been possible. I am also grateful to Professor P. S. Krishnaprasad, Professor Richard La, Professor Andre´ Tits and Professor Nikhil Chopra for agreeing to serve on my thesis committee and providing me with insightful comments and suggestions to broaden my research from various angles. I would also like to thank my Master’s thesis advisor, Professor Suchin Arunsawatwong, for encouraging me to pursuit a PhD degree and helping me with the application process. Throughout my years at the University of Maryland, I have been fortunate to have many wonderful friends who deserve a special mention. I would like to thank Dzung Ta, Sanmeet Narula and Devon Harbaugh for their help and company during the first year of my graduate life in the US. Many thanks also go to my office-mates Bhaskar Ramasubramanian, Alborz Alavian, and James Ferlez for their friendship, feedback and support. My interaction with Dipankar Maity has been very fruitful and he deserves special thanks. My days in Maryland would not have been enjoyable without my Vietnamese friends, including Chanh Kieu, My Le, Khoa Trinh, and especially the Sean Lam’s family, and I would like to thank them all. I would like to acknowledge the financial support from the Air Force Office of Scientific Research through MURI AFOSR Grant #FA9550-09-1-0538 and the ii Department of Electrical and Computer Engineering at the University of Maryland. Last but never least, I am most grateful to my family: my parents and brother for having always stood by and believed in me through my whole life, and espe- cially my wonderful and loving wife and our beloved daughter for their incredible support, endless patience and unconditional encouragement. Words, by no means, can express the gratitude I owe them. iii Table of Contents List of Tables viii List of Figures ix List of Abbreviations xi 1 Introduction 1 1.1 Motivation and Thesis Objectives . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Consensus and Information Sharing Model . . . . . . . . . . . 2 1.1.2 Network Asymmetry . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Main Problems and Thesis Contributions . . . . . . . . . . . . . . . . 6 1.2.1 Network with Leaders . . . . . . . . . . . . . . . . . . . . . . 7 1.2.2 Consensus Prediction . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.3 Distributed Optimization . . . . . . . . . . . . . . . . . . . . . 11 1.3 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.1 Consensus in Networks with Leaders . . . . . . . . . . . . . . 14 1.3.2 Consensus Prediction . . . . . . . . . . . . . . . . . . . . . . . 17 1.3.3 Distributed Optimization . . . . . . . . . . . . . . . . . . . . . 18 1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.5 Notation and Mathematical Background . . . . . . . . . . . . . . . . 22 1.5.1 Notation and Definitions . . . . . . . . . . . . . . . . . . . . . 22 1.5.2 Convergence of DeGroot Model . . . . . . . . . . . . . . . . . 24 I Consensus Network with Leaders 27 2 Opinion Dynamics with Persistent Leaders 28 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3 Opinion Dynamics with One Leader . . . . . . . . . . . . . . . . . . . 32 2.4 Opinion Dynamics with Two Leaders . . . . . . . . . . . . . . . . . . 46 2.5 Conclusion and Extensions . . . . . . . . . . . . . . . . . . . . . . . . 48 iv 3 Optimizing Leader Influence in Networks through Selection of Direct Follow- ers 51 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2 Problem Formulation and Related Works . . . . . . . . . . . . . . . . 56 3.2.1 Formulation of Influence Optimization Problem for the Single Leader Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.2.2 Formulation of Influence Optimization Problem in the Pres- ence of a Competing Leader . . . . . . . . . . . . . . . . . . . 63 3.2.3 Comparison to Previous Work . . . . . . . . . . . . . . . . . . 66 3.2.3.1 Single leader case . . . . . . . . . . . . . . . . . . . . 66 3.2.3.2 Multiple leaders case . . . . . . . . . . . . . . . . . . 68 3.2.3.3 Our Contributions . . . . . . . . . . . . . . . . . . . 70 3.3 Special Cases K = 1, 2: Optimal Solutions . . . . . . . . . . . . . . . 71 3.3.1 Single Agent Selection . . . . . . . . . . . . . . . . . . . . . . 72 3.3.2 Two-Agent Selection . . . . . . . . . . . . . . . . . . . . . . . 75 3.4 General Case: Convexification Approach . . . . . . . . . . . . . . . . 80 3.4.1 Convexity of Relaxation . . . . . . . . . . . . . . . . . . . . . 80 3.4.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . 90 3.5 Supermodularity and Greedy Algorithms . . . . . . . . . . . . . . . . 93 3.5.1 Supermodularity Results . . . . . . . . . . . . . . . . . . . . . 93 3.5.2 Greedy Algorithms and Ratio Bounds . . . . . . . . . . . . . . 96 3.6 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.6.1 Example 1: Small Network with One Leader . . . . . . . . . . 105 3.6.2 Example 2: Medium-Size Network with Two Leaders . . . . . 108 3.7 Closing Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 3.7.1 Application to Friedkin’s Model . . . . . . . . . . . . . . . . . 111 3.7.2 Further Convexity Results . . . . . . . . . . . . . . . . . . . . 112 3.7.3 Towards Relaxing Strong Connectivity Assumption . . . . . . 113 II Consensus Prediction by Observer 114 4 Consensus Prediction in Minimum Time 115 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.2 Problem Statement and Previous Results . . . . . . . . . . . . . . . . 117 4.2.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . 117 4.2.2 Previous Results on Consensus in Finite Time . . . . . . . . . 119 4.3 Shortest Time Prediction of Consensus and Local Computation of Minimal Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.3.1 Optimality of (Di + 1) . . . . . . . . . . . . . . . . . . . . . . 123 4.3.2 Local Computation of qi . . . . . . . . . . . . . . . . . . . . . 126 4.4 Toward Minimizing Observation Time . . . . . . . . . . . . . . . . . 128 4.4.1 Observed Nodes with Identical Minimal Polynomials . . . . . 129 4.4.2 Observed Nodes with Different Minimal Polynomials . . . . . 133 4.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 v 4.5.1 Example 1: Network with Identical Minimal Polynomials . . . 134 4.5.2 Example 2: Network with Different Minimal Polynomials . . . 135 4.6 Toward Selecting Observed Nodes . . . . . . . . . . . . . . . . . . . . 137 4.6.1 When qi = qj? . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 4.6.2 Bounds on deg(qi) . . . . . . . . . . . . . . . . . . . . . . . . 141 4.7 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . 143 III Distributed Optimization 144 5 Local Prediction for Enhanced Convergence of Distributed Optimization Al- gorithms 145 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.2 Problem Statement and Background . . . . . . . . . . . . . . . . . . 150 5.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 150 5.2.2 Subgradient Methods . . . . . . . . . . . . . . . . . . . . . . . 151 5.2.3 Finite-Time Consensus Using Minimal Polynomials . . . . . . 153 5.3 Distributed Subgradient Optimization Using Finite Time Consensus . 155 5.3.1 Main Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5.3.2 Extensions of the Algorithm 5.1 . . . . . . . . . . . . . . . . . 165 5.4 Finite-Time Optimization for Quadratic Cost Functions . . . . . . . . 168 5.4.1 Ratio-Consensus based Algorithm . . . . . . . . . . . . . . . . 169 5.4.2 Gradient-based Algorithm . . . . . . . . . . . . . . . . . . . . 170 5.5 On Minimal Value of κ and Performance Limits of Distributed Sub- gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 5.5.1 Minimal Value of κ . . . . . . . . . . . . . . . . . . . . . . . . 177 5.5.2 Performance Limit of Distributed Subgradient Methods . . . . 179 5.6 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 5.6.1 Example 1: Network of 5 agents with differentiable cost func- tions having Lipschitz continuous gradient . . . . . . . . . . . 181 5.6.2 Example 2: Network of 200 agents with `1 cost functions . . . 185 5.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 6 Distributed Optimization over Directed Graphs with Row Stochasticity and Constraint Regularity 191 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 6.2 Problem Formulation and Proposed Algorithms . . . . . . . . . . . . 197 6.3 Basic Relations and Convergence Result . . . . . . . . . . . . . . . . 203 6.4 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 6.5 Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 6.6 Conclusions and Extensions . . . . . . . . . . . . . . . . . . . . . . . 229 7 Conclusions 233 7.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 7.2 Directions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . 236 vi A Omitted Proofs 240 A.1 Known Matrix Results . . . . . . . . . . . . . . . . . . . . . . . . . . 240 A.2 Omitted Proofs in Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . 241 A.2.1 Proof of Theorem 3.3.1 . . . . . . . . . . . . . . . . . . . . . . 241 A.2.2 Proof of Theorem 3.3.2 . . . . . . . . . . . . . . . . . . . . . . 242 A.2.3 Proof of Theorem 3.3.4 . . . . . . . . . . . . . . . . . . . . . . 243 A.2.4 Proof of Lemma 3.5.3 . . . . . . . . . . . . . . . . . . . . . . . 244 A.2.5 Proof of Lemma 3.5.5 . . . . . . . . . . . . . . . . . . . . . . . 246 A.3 Omitted Proofs in Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . 248 A.3.1 Proof of Theorem 5.3.5 . . . . . . . . . . . . . . . . . . . . . . 248 A.3.2 Proof of Theorem 5.3.12 . . . . . . . . . . . . . . . . . . . . . 250 A.3.3 Proof of Theorem 5.3.14 . . . . . . . . . . . . . . . . . . . . . 251 A.3.4 Proof of Extension to Row Stochastic Weight Matrix . . . . . 252 A.3.5 Proof of Theorem 5.5.1 . . . . . . . . . . . . . . . . . . . . . . 252 A.3.6 Proof of Lemma 5.4.3 . . . . . . . . . . . . . . . . . . . . . . . 253 A.3.7 Proof of Theorem 5.4.4 . . . . . . . . . . . . . . . . . . . . . . 255 A.3.8 Proof of Theorem 5.4.5 . . . . . . . . . . . . . . . . . . . . . . 256 A.3.9 Distributed Evaluation of Global Cost Function and Algo- rithm Local Termination . . . . . . . . . . . . . . . . . . . . . 257 A.4 Omitted Proofs in Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . 259 A.4.1 Proof of Theorem 6.3.6 for Algorithm 6.2. . . . . . . . . . . . 259 Bibliography 261 vii List of Tables 3.1 Comparison results for Network in example 1 (∗ denotes an optimal value). In the last column, JKP Rlxd 1(2) denotes JKP Rlxd1 (JKP Rlxd2 ). . . . . . . . . . . 107 4.1 Observation times using Algorithm 4.1. . . . . . . . . . . . . . . . . . . 135 4.2 Observation time for each node to compute consensus value in Example 2 136 4.3 Optimal time T ∗ when the observer can choose any m nodes . . . . . . . 137 viii List of Figures 1.1 A directed network of 5 agents. . . . . . . . . . . . . . . . . . . . . . . 5 1.2 A network in the presence of 2 leaders T and Q. . . . . . . . . . . . . . . 7 1.3 A network with an observer. . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 A network of five agents that try to minimize the sum of individual cost functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 Network in example 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 3.2 K∗ = {7, 8, 15, 25} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.3 K = {7, 13, 16, 25} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.4 K = {8, 13, 16, 25} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.5 Alg. 3.1 every 5 time steps . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.6 Upper bounds (solid lines) and lower bounds (dashed line) on J∗; The global lower bound J(Vα) holds for any K. The ratio bound 1−JGU1−f∗P Rlxd (shown by a dotted line) is at least 90% as K ≥ 90. . . . . . . . . . . . . 110 3.7 CPU run times (s) in 4 schemes. The Interior Point Method takes approx- imately 0.21 s per iteration. . . . . . . . . . . . . . . . . . . . . . . . . 111 4.1 Network example 2. Self weights are not shown. . . . . . . . . . . . . . . 136 5.1 Network topology in example 1. . . . . . . . . . . . . . . . . . . . . . . 181 5.2 Network responses for example 1 with convex cost functions having Lips- chitz continuous gradient using Algorithm (5.32) and (5.37). Left: For any i ∈ V, si(t) (solid lines) converges to optimal solution (dash-line) and xi(t) reaches a limit cycle of period κ. In the top-left figure, ◦ represents s¯(kκ) of the centralized subgradient method implemented as (5.19). Right: Ob- jective error comparisons with DPS using step size γ(t) = a tb , where (blue) solid lines correspond to a = 0.01, (green) dashed lines a = 0.05, (black) dotted lines a = 0.1, and (cyan) dash-dotted ones a = 0.2. For each a, we plot the results for b = 0.5 and 1. The results from our algorithm are shown in red circles ◦. The algorithm terminates locally for all the agents at t = 186 with relative error of the global cost function guaranteed to be less than  = 10−6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 ix 5.3 Network responses for example 1 with quadratic cost functions when using Algorithm 5.3 with κ = 7, x(0) = c, and with 4 values of γ. . . . . . . . . 187 5.4 Responses of the network in example 2. Dashed line: optimal solution. (a)-(b): Algorithm (5.32)-(5.33), where sub-figure within (a) is a zoom-in of period [400, 800]; (c): Algorithm by Olshevsky (2016) with a constant step size β = 1 L0 √ NT ; and (d): Distributed Subgradient Method (5.4) with γ(t) = 1√ t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 6.1 Directed communication graph of the network example. . . . . . . . . . . 228 6.2 Performances of Algorithms 6.1, 6.2, and DSP methods with and without reweighting technique. Reweighting means for each i ∈ V, pii is known to agent i in advance and zii(t) = pii, ∀t ≥ 0. Here, s(t) = PX ( x¯(t) ) . . . . . . 230 x List of Symbols and Abbreviations E Edge set of a graph G Communication Graph L Laplacian matrix N Number of agents in the network N0 Set of all nonnegative integer numbers R(R+) Set of all (nonnegative) real numbers t Discrete time V Node set of a graph W, [wij] Weight matrix xi(t),xi(t) Opinion or state vector of agent i at time t Z(Z+) Set of all (nonnegative) integer numbers pi The normalized left Perron eigenvector of the weight matrix W AEP Almost Equitable Partition DPS Distributed Projected Subgradient DSM Distributed Subgradient Method IPM Interior Point Method LCM Least Common Multiple LTI Linear Time-Invariant PGM Projected Gradient Method xi Chapter 1: Introduction Networks are ubiquitous in physical, biological, and engineered systems. Depend- ing on the particular domain and on the network nodes and their interconnections, networks can display interesting characteristics and can achieve a variety of func- tions. Networks research has seen significantly increasing interest over the past several decades, owing mainly to the realization that applications are wide-ranging and that these applications can prove to be both practical and valuable for society. While a large body of literature has arisen, our understanding of the characteristics, features and dynamics of networks is still in the early stages of development. In this thesis, we aim to contribute to this important and growing field by pursuing several directions in the general area of network consensus. Among the topics we pursue is the development of new conditions and algorithms for reaching and/or predicting agreement among agents in a network. To this end, a number of theories will be brought to bear on several problems of interest under various scenarios with regard to network connectivity. We also introduce and study problems of network monitoring and consensus prediction, and apply our results to distributed optimization. 1 1.1 Motivation and Thesis Objectives 1.1.1 Consensus and Information Sharing Model There has been much interest in problems of distributed computation and cooper- ative control, where a group of agents aims to achieve a global objective without resorting to a centralized coordination entity and possibly in the presence of limited computing capability and/or energy resources. In this realm, network consensus is a basic problem, which concerns processes by which a collection of agents through their local interactions tries to reach a common goal or decision. This problem has been studied extensively in recent years and has found applications in many areas such as opinion dynamics and learning in social networks, distributed optimization, multi-vehicle rendezvous, formation control and sensor fusion, to name a few (see, e.g., [1–13]). Extensive surveys and tutorials can be found in [14–16]. Historically, consensus in one form or another has long been observed in both natural and social networks and systems. For example, a flock of birds flies in a certain shape with a common velocity, a swarm of fireflies blinks in unison, a group of people achieves agreement after repeatedly exchanging opinions with one another. The discovery of network consensus can be traced to circa 1665 when Christiaan Huygens observed the synchrony of two pendulum clocks mounted next to each other on the same support, a phenomenon now referred to as coupled oscillations [17]. Thus, studying mechanisms for network consensus is key to understanding collective behaviors of both natural and social systems, and to building engineering 2 networks as well. Many recent efforts have also been devoted to the case of networks with more than one type of agents, including, e.g., leaders and followers, stubborn or even adversarial agents, which appear naturally in real world networks and systems (see, e.g., [5, 12, 13, 18–23] and references therein). The notion of leader is also useful in the study of control of networked systems, where leaders serve as agents that directly receive control inputs and network connections are paths for control action propagating to the other agents [24]. Network controllability and its dual notion, observability, have also been studied extensively in recent years [25–31]. However, a closely related problem, namely network consensus prediction has scarcely been considered. In this thesis, we will investigate this problem in detail. The idea behind consensus can also serve as a mechanism for information shar- ing/diffusing in the design of many distributed multi-agents algorithms, including distributed optimization where a group of agents with limited communication tries to solve a global optimization problem. This problem arises in many applications such as distributed estimation in sensor networks [32–34], distributed resource al- location [35, 36], and large-scale machine learning and statistical inference [37–39], and is becoming more urgent in the new era of “Big Data”. 1.1.2 Network Asymmetry For distributed systems, communication is vital to system performance as it is the backbone for information to flow from one agent to another, and hence, the only 3 platform for each agent to contribute to or get involved in the global objective of the network. As a result, special attention for communications is needed in the study and design of distributed systems; in fact, it is one of the main aspects that is different from centralized ones. In general, communications can be categorized as undirected or directed. In many applications, it is possible that inter-agent communications are undirected, i.e., when two individuals communicate, each receives information from the other, and moreover, they can have some agreement on how to use that information. How- ever, there are many other scenarios where communications are directed due to, e.g., communication constraints arising from various sources, including physical network connections or hardware capability of system components. This is clearly the case when we consider the effect of a leader or stubborn agent in the network. An- other practical example is an ad hoc Wireless Sensor Network, where there may not be a pre-existing communication infrastructure at the time of deployment and di- rected communications can arise as a consequence of the geometric network layout or nonuniform transmission power limits, or even sensor mobility. We identify directed communication as a type of network asymmetry; see Figure 1.1 for an example of a directed network with 5 agents. Although directed communication schemes include undirected schemes as a special case (and thus apply to a much larger range of situations), a large body of literature in the field of consensus and distributed computation and optimization focuses on undirected networks. Such networks are more amenable to mathematical analysis, especially when using tools that assume network symmetry (such as sym- 4 Figure 1.1: A directed network of 5 agents. metric nonnegative Laplacian matrices, for instance). Analysis of algorithms and problems with directed communications is generally very involved and often requires new tools and techniques. We seek to develop algorithms that apply to networks with directed communication schemes as the prime subject of this thesis. In particular, we are interested in scenarios where network asymmetry poses challenges in the analysis and may hinder system performance. Among various models in the literature, the DeGroot model [1] is one of the most often used for its simplicity and ability to exhibit consensus behaviors. Here, each agent in the network repeatedly updates its opinion as a weighted average of the opinions of its immediate neighbors, including itself. This update scheme simply gives rise to a row stochastic weight matrix, which under mild conditions guaran- tees consensus in the network. The agreement value depends on this weight matrix and the agents’ initial opinions. However, in many situations including various dis- tributed optimization algorithms, an averaging scheme is desired instead; and to this end, the weight matrix is often required to satisfy a balancedness condition, namely, double stochasticity. In such a case, this condition means any agent needs 5 to know to whom it sends information and/or regulates the way the recipient uses the information. This can be ensured fairly easily and locally in undirected networks but is very hard and costly to ensure in many distributed systems such as wireless sensor/ad-hoc networks and especially networks with broadcast-based communi- cations. A few algorithms proposed recently for directed networks employ column stochastic matrices by requiring that each agent at any time is aware of which neigh- bors receive the information sent to them by the agent. A row stochastic matrix is much easier to implement in these applications/communication environments, but yields unsatisfactory performance when used with those algorithms. Therefore, in this thesis, we will mostly deal with row-stochastic matrices as the main source of network asymmetry. 1.2 Main Problems and Thesis Contributions In this research, we consider the following three main topics and the associated problems identified below. Unlike most existing work in the literature where network symmetry or balancedness is assumed, we address the general case of weighted directed graphs. Thus, one of the main technical contributions that we emphasize is a set of tools and techniques developed to overcome network asymmetry in various problems and applications of consensus. 6 1.2.1 Network with Leaders In the first part of the thesis, we consider a DeGroot model with the presence of external media nodes, representing leaders, or sources of news often having constant opinion values. See Figure 1.2 for an illustrative example. Figure 1.2: A network in the presence of 2 leaders T and Q. First, when consensus is the main goal of a leader, we are interested in finding conditions under which the whole network will eventually agree with that leader’s opinion for any initial opinions of the agents. Indeed, we will determine how strong the connections between the leaders and the followers as well as those among the followers should be to ensure that this agreement can be achieved asymptotically. • When there is only one leader, we derive new sufficient conditions for guar- anteeing consensus with the leader for both fixed and switching topologies. These conditions emphasize the persistence of the connectivity between the leader and the followers and are the mildest so far, covering many existing results in the literature. 7 • In the presence of more than one leader, we show that only those that are persistent matter. In particular, when only one leader is persistent, we provide conditions under which the network converges to the state of this leader. • A technical contribution lies in the tool that we develop to prove the results above; namely, a result on the convergence of a infinite product of nonnegative substochastic matrices. Second, we study the problem of a leader that aims to influence the opinions of agents in a directed network through connecting with a limited number of the agents. The leader’s goal is to select this set of agents, referred to as direct follow- ers, to achieve the greatest possible influence on the opinions of agents throughout the network. Here, when there is only one leader and consensus is guaranteed a priori, the influence of that leader is characterized through the transient error of the network, and thus is able to take into account both the network structure and the opinion dynamics evolving on it. When, on the other hand, there is a second leader (or a stubborn agent) with a competing opinion and consensus is not achievable, the influence of the first leader is measured in terms of the steady state error of the network. Compared to existing work, not only are our problem settings and formu- lations more natural (and thus likely to be of more value for practical applications) but our technical results are also much stronger. In particular, • We prove the supermodularity property of the objective function capturing the leader influence in both cases and the convexity of its continuous relaxation for general directed networks. Here, the convexity result is novel; the super- 8 modularity result generalizes existing results in the literature but is proved using a different technique. • We then develop greedy algorithms that are theoretically guaranteed to have a lower bound on the approximation ratio. The new convex result allows us to benefit from efficient (customized) numerical solvers to obtain practically comparable solutions. We demonstrate through numerical examples that the two approaches can be combined to provide effective tools and better analysis for optimal design of influence spreading in diffusive networks. 1.2.2 Consensus Prediction In this part of the dissertation, we introduce and study the problem of consensus prediction in a network whose dynamics is described by a DeGroot model. In particular, we assume that there is an observer who can monitor the states of a group of agents, but might not have accurate knowledge of the underlying communication graph and the associated weight matrix; see Figure 1.3. Figure 1.3: A network with an observer. 9 We want to answer the following questions: For any initial opinions of the agents, how can the observer determine the consensus value, if it exists, by using a finite number of observations? what is the minimum number of observations needed? And, if the observer has more information about the network structure, how to minimize observation time over possible choices of observed nodes. Our main contributions in this topic are as follows: • We reveal an intrinsic relation between the consensus value and network data, namely, if the consensus value can be computed at a particular time for any initial opinions, then it can be expressed as a linear combination of available observation data with associated coefficients depending on the weight matrix. • We derive a fundamental limit on the monitoring time for the case of a single observed node, below which the observer with limited knowledge about the network is not able to determine the consensus value regardless of the method used. We provide sufficient conditions for achieving this limit. • We provide a conjecture and analysis for the case of multiple observed nodes and develop algorithms toward achieving the conjectured bounds through local observations and computations. We show that with certain knowledge about the network structure, the observer can answer a few questions regarding the optimal monitoring time. 10 1.2.3 Distributed Optimization In our work on distributed optimization, we seek interaction rules for a network of agents which result in the network collaboratively solving a global optimization problem. The interactions among agents must be local, without a central coordina- tion unit, and the objective function is the sum of local costs of all the agents; see Figure 1.4. Figure 1.4: A network of five agents that try to minimize the sum of individual cost functions. Under the conditions that the underlying communication graph is directed and the weight matrix is only row stochastic, we design algorithms for the agents to col- laboratively solve this problem in a distributed manner and/or with fast convergence rates. • We first study the use of consensus prediction for enhancing convergence of distributed optimization algorithms. The resulting algorithms are the first 11 that possess the following useful features: (i) they are distributed but behave similarly to the centralized gradient methods except on a slower time-scale (in- cluding finite time convergence for quadratic cost functions), (ii) all the agents are able to locally stop updating at the same time with the same estimate of optimal solution, and (iii) the theoretical convergence scales at most linearly with a network size in general (and thus is the best so far). • We provide a unified analysis for distributed projected subgradient methods with nonidentical local constraint sets. To deal with network asymmetry, we introduce a rescaling technique to the original distributed projected subgra- dient methods by incorporating an addition consensus step which aims to provide each agent with an estimate of the corresponding element of the left Perron eigenvector of the weight matrix. • We present another algorithm that also uses the rescaling technique above but is able to achieve linear convergence under a stronger assumption on the local objective cost functions. 1.3 Literature Survey One of the very first mathematical models used for studying network consensus is the DeGroot model [1], described as follows. Consider a group of N agents and let xi(t) ∈ [0, 1] denote the opinion of agent i at time t; here time is discrete, i.e., t ∈ Z+. Each agent has an initial opinion xi(0). At any time t ≥ 0, each agent observes the opinions of its neighbors and naively updates its own opinion according 12 to xi(t+ 1) = N∑ j=1 wijxj(t), i = 1, . . . , N, (1.1) where wij indicates the weight that agent i places on agent j’s opinion. Here wij 6= 0 implies that agent i is able to obtain the opinion of agent j at time t. Thus, W = [wij] ∈ RN×N is often called the weight matrix (or trust matrix). In this regard, it is natural to represent the communication network among the agents using graphs, where xi is the state of node i and wij the weight or strength of link (i, j); and for this reason, the terms agent and node are used interchangeably. The network achieves consensus if for any initial opinion xi(0), it holds that limt→∞ |xi(t)−xj(t)| = 0,∀i, j. Thus, the network following (1.1) is often called a consensus network. This model was introduced in [1] for the synchronous and time-invariant case, where the author studied the process of reaching agreement among a group of ex- perts, used in [40] for studying coordination of a group of particles, and extended in [2, 41–43] for the case of asynchronous and time-varying network in the context of distributed decision making and parallel computing. Since then, a vast literature on network consensus has developed. Models that generalize or are related to (1.1) are also numerous, including, for example, agents with high order linear dynam- ics [44, 45], nonlinear dynamics (possibly with nonlinear coupling among agents) [40,46–51], where consensus (in terms of the agents’ outputs) is also known as syn- chronization. Convergence under various assumptions on the communication graph are also studied, including, for example, directed information flow [6, 8], link/node failure and noises [33], communication time-delays [47, 52], fixed or switching net- 13 work topology [5, 9], and quantization [53,54]. In this thesis, however, we will mostly focus on the first-order linear model (1.1), which is relatively simple but instructive enough for basic study of consensus dynamics and particularly suitable for various distributed computation and opti- mization algorithms. 1.3.1 Consensus in Networks with Leaders Many recent efforts have also studied networks with more than one type of agent, including, e.g., leaders and followers, stubborn or even adversarial agents (see, e.g., [5,12,13,18–23]). In general, for the DeGroot model, consensus can still be achieved asymptotically with a single leader, but not if there are multiple leaders such that at least two of them are uncooperative. In this subsection, we review known conditions for consensus in the literature, where a leader is included in the network as a special agent with constant opinion. For the case of a fixed network with a constant weight matrix, a necessary and sufficient condition for consensus is that the graph is rooted [9]. However, such a condition is still an open question for a time-varying interaction topology. In this case, conditions for consensus are those ensuring the convergence to a rank-one matrix of an infinite product lim t→∞ W (t)W (t− 1) . . .W (0), where each W (t) is a stochastic matrix. This is also a well studied problem in the theory of non-homogeneous Markov chains. Therefore, many results and tools in 14 matrix theory and Markov chain theory can be appealed to. It has been shown [9,55] that a necessary condition is that the union graph over an infinite interval is rooted. This condition though is far from being sufficient; a counterexample can be found in [8]. Therefore, to derive sufficient conditions, more assumptions on the network connectivity and the weight matrix are required. For example, the authors in [5] rely on Wolfowitz’s theorem [56] on the convergence of infinite products of stochastic matrices W (t) belonging to a finite set. This condition is relaxed in [8,55,57,58] so that the matrices can belong to an infinite set. However, these works require that all the self-weights and other nonzero link weights are uniformly bounded below by a positive number and that either the weight matrix at any time has a symmetric zero/nonzero structure (i.e., the graph is undirected) or the union of the interaction graphs over any period of some fixed length is strongly connected. Similarly, the limiting behavior of products of random stochastic matrices is also studied in [59] assuming the cut-balanced property of the sequence of the matrices, which is in the same spirit of having symmetric zero/nonzero structures. We will show that for the DeGroot model in the presence of a leader, the strong connectivity and symmetric structure condition of the weight matrix can be relaxed. However, we take a different approach, in that we develop consensus conditions directly for the model with leaders. Besides deriving condition for guaranteeing consensus, the problem of selecting an optimal subset of agents in the network to influence is also of interest in many practical applications, often known as leader selection and optimal stubborn agent placement [20,60–67]. By considering different measures of influence or centrality in 15 the network, there have been various approaches to solve the associated optimization problem. For example, [68] considers the problem of minimizing the convergence rate of a consensus network. The authors show the connection between the convergence rate and the maximum distance from the leader to the followers and then apply a combinatorial optimization method to solve the problem approximately. In [69], the authors consider the problem of minimizing the total system error in a noisy network and derive systematic solutions for some special cases. In [62] the authors consider a characterization quantifying both the transient and the steady states of the agents’ opinions, assuming that all the regular agents have the same initial opinions and that any direct follower replaces its own opinion by that of the leader. In [70], the authors use a continuous-time model and consider the problem of leader selection in order to minimize the convergence error, defined as the `p-norm of the distance between the followers’ states and the convex hull of the leader states. By replacing the convergence error by an upper bound that is independent of the initial states of the network, [70] is able to employ a supermodular optimization approach. Our work in this topic departs from this literature in many respects. First, we drop the assumption that the network is undirected, and only ask that the underly- ing network be directed. Second, we allow selected direct follower nodes (i.e., agents directly connected to the leader) to follow inter-agent dynamics like any other agents, rather than forcing them to adopt the leader’s opinion instantaneously. Third, we allow the agents in the network to have different initial opinions, and the leader can assign different weights to the network agents. Finally, and more importantly, although continuous relaxation and greedy heuristics have been employed in deal- 16 ing with influence maximization problems, our theoretical results on convexity and supermodularity are considerably stronger than existing results. We achieve this without assuming any symmetry or resorting to the random walk theory. This not only provides a deeper understanding of diffusive processes but also can be used for a broad range of applications. 1.3.2 Consensus Prediction The topic of consensus prediction is useful in network monitoring and security but has not received much attention. The existing literature mostly focuses on the ob- servability problem for networked multi-agent systems. The problem we consider differs from the observability problem in the sense that we are concerned with the final value instead of trying to recover the initial conditions of all the agents. More- over, here, the observer might not have know the network structure or the weight matrix. Our analysis builds on a recent method for reaching consensus in finite time by employing the minimal polynomial of each agent [71, 72]. This method is concerned only with consensus predictability, and does not concern optimality. Although predicting the agreement value of a consensus network is the main goal here, our approach makes a contribution to the topic of network identification through application of realization theory [73–75] to distributed networked systems. 17 1.3.3 Distributed Optimization Consensus also plays an important role in distributed optimization, where a group of agents with limited communication tries to solve a global optimization problem, where the objective function is the sum of (possibly nonsmooth) local objectives of the agents and the global constraint set is the intersection of local constraints. Tsit- siklis [2] among others pioneered research on distributed computation over networks and the interplay between network dynamics and performance of decentralized al- gorithms in the context of networked control. Specifically, in [2] the problem of achieving consensus in system (1.1) was studied and then used as a subroutine for performing estimation and solving a class of optimization problem in a distributed manner. In this connection, the consensus step is utilized to deal with the fact that the agents have incomplete knowledge about the optimization problem. It was this idea that has triggered the development of many consensus-based distributed algorithms, see e.g., [32, 42,76–85] and references therein. Well known among those is the class of distributed (sub)gradients methods, which possesses many practically desirable characteristics including its simplicity in implementation and generally weak assumptions on the local cost functions as well as the network topology. Major limitations of algorithms in this category are also well studied. First, the convergence of many algorithms depends on the choice of step size sequences. When a constant step size is used, both Distributed Gradient Descent and Distributed Subgradient methods only yield convergence to a neighborhood of the optimal solution and of the optimal value [78, 86]. This occurs even under 18 stronger assumptions on the local objective functions such as strong convexity and Lipschitz continuous gradients, and is thus one of the main differences between these methods and their centralized counterparts. This motivates the use of particular diminishing or adaptive step sizes to achieve asymptotic convergence. However, the convergence rate can be very slow (compared to that of the centralized method), depending on the step size sequence, whose appropriate selection is not trivial. Nesterov’s acceleration technique can employed [87] to speed up the convergence. Second, many incremental subgradient methods require all the agents to construct a closed cycle in order to pass an estimate of the solution around the network; see e.g., [32,88,89]. Third, even when asymptotic convergence is guaranteed, it is not obvious how each agent can locally decide when to stop the algorithm without affecting other agents’ estimates. Put differently, there are no simple criteria for all the agents to stop at the same time while also sharing the same estimate of an optimal solution. This is also true for most (if not all) other distributed optimization methods. When all the local cost functions are quadratic, many other consensus-based algorithms can outperform those in the subgradient class. For example, the ratio consensus method can be used to solve problem without constraints and converges exponentially [34,90]. Based on this method, [91] proposed a Newton-Raphson-like algorithm which also converges asymptotically for a class of functions having con- tinuous, strictly positive and bounded second derivatives, assuming a sufficiently small discretization step. Recently, much attention has also been given to decen- tralized Alternating Direction Method of Multipliers (ADMM) type methods with fast convergence in both theory and practice [39,43,92]. 19 Most existing methods in distributed optimization (including those mentioned above) require the network to be undirected so that neighboring agents exchange in- formation in both directions, increasing the possibility of reaching some agreement. Many methods employing subgradient and consensus steps require the weight ma- trix associated with the network to be column stochastic or even doubly stochastic, which may be hard to arrange in directed networks, especially in a broadcast-based communication environment. Recent reweighting technique introduced in [93, 94] allows row stochastic matrices but assumes knowledge of the graph, that is the sta- tionary distribution of the weight matrix and the number of agents in the network. Thus, a fully distributed algorithm employing only row stochastic weight matri- ces has not been available in the field of distributed optimization thus far. In our work, we will develop such an algorithm. Moreover, known convergence analysis of distributed subgradient methods varies according to whether the problem is uncon- strained or constrained, and whether the local constraint sets, usually compact, are identical or nonidentical. Thus, there is a lack of a unified convergence analysis for those scenarios, and hence developing such a analysis is also one of the goals of this thesis. 1.4 Thesis Organization The remainder of the thesis is organized as follows. This introductory chapter will end with notations, definitions and mathematical background, including well known results for the DeGroot model. Our main results are presented in three parts 20 corresponding to the three topics. Part I of the thesis is concerned with the influence of a leader on the opinions of the agents in a directed network whose dynamics follow the DeGroot model. Specifically, in Chapter 2, we develop various sufficient conditions for guaranteeing consensus of all the network agents to the leader opinions in many scenarios: static and dynamic network topologies, with one leader or two competing leaders. Then in Chapter 3, we are concerned with the problem of optimizing the influence of a leader on the opinions of the agents in the case of a fixed network topology. We derive various joint centrality measures for a group of followers in different settings, and then develop theory and approximation algorithms for obtaining suboptimal solutions in large networks. Part II, which consists of Chapter 4, deals with problems related to an observer that seeks to predict the consensus value of a network by monitoring the opinions of a group of agents. This setting can be seen as the dual to that in Part I, where the leader injects information/control into the network. We make use of a central tool in functional analysis, namely the Hahn-Banach Theorem, to prove the optimality aspect of the degree of the minimal polynomial of a node as a tight lower bound of the observation time over all possible approaches to determine the consensus value if only that node is monitored. We then develop analysis and distributed algorithms for the case of multiple observed nodes. We also discuss optimal selection of observed nodes using graph theory. Part III of the thesis is concerned with distributed optimization. In Chapter 5, we employ the consensus prediction method presented in Chapter 4 as an acceler- 21 ation technique for enhancing convergence of the distributed gradient method in terms of correctness and speed. In the special case where the local objective func- tions are quadratic, we show that finite time convergence can be achieved. We also discuss a performance limit of distributed optimization and compare it with our al- gorithms. In Chapter 6, we introduce new technique that enables many distributed optimization algorithms to work with directed networks and row stochastic weight matrices. We then develop a unified analysis for convergence as well as convergence rate of a distributed subgradient algorithm and its variation that can be applied to both unconstrained problems and constrained ones possibly with nonidentical and unbounded local constraint sets. Finally, conclusions and directions for future research are given in Chapter 7. 1.5 Notation and Mathematical Background 1.5.1 Notation and Definitions Notation: We use boldface characters and symbols to denote vectors, for example, x = [x1, x2, . . . , xm] T ∈ Rm, xi = [xi1, xi2, . . . , xim]T ∈ Rm, 1 = [1, 1, . . . , 1]T and ei = [0, . . . , 0, 1i, 0, . . . , 0] T . For a vector x, ‖x‖1, ‖x‖2 (or just ‖x‖) and ‖x‖∞ denote its 1-norm, 2-norm, and∞-norm, respectively. We also denote by diag(x) the diagonal matrix whose diagonal elements are the elements of vector x. For a matrix A, AT denotes its transpose, A† its pseudo-inverse, [A]ij or Aij the ijth element, rank(A) the rank, tr(A) the trace, ρ(A) its spectral radius, ‖A‖ the 22 (induced) 2-norm of A, and |A| the matrix composed of absolute value of elements of A, i.e., [|A|]ij = |[A]ij|,∀i, j. We also use A(i) and A(j), respectively, to denote the i-th row and j-th column of A. Sets are denoted by calligraphic upper case letters. For a given set A, |A| or card(A) denotes its cardinality, and χA denotes the associated indicator function. The degree of a polynomial q is denoted by deg(q). Basic Notion: A matrix A = [aij] is nonnegative (positive) if aij ≥ 0 (aij > 0),∀i, j. If A−B is a nonnegative matrix, we write A ≥ B. A nonnegative square matrix A is row stochastic (or simply stochastic) if A1 = 1, (row) substochastic if A1 ≤ 1, column stochastic if AT1 = 1, and doubly stochastic if it is both row and column stochastic. A square matrix A is called an M-matrix if (i) all the off-diagonal elements are nonpositive, i.e., aij ≤ 0,∀i 6= j, and (ii) it can be expressed as A = sI − B where B is a nonnegative matrix such that ρ(B) ≤ s. A directed graph G = (V , E) consists of a finite set of nodes V = {1, 2, . . . , N} and a set E ⊆ V ×V of edges, where an ordered pair (i, j) ∈ E indicates that agent i receives information on the state of agent j. A directed path is a sequence of edges in the form (i1, i2), (i2, i3), . . . , (ik−1, ik). A simple path is a path without any node repeated. Node i is said to be reachable from node j if there exists a path from j to i. Each node is reachable from itself (i.e., self-loop is permitted). For node i, Ni = {j ∈ V : (i, j) ∈ E} is the set of in-neighbors (or neighbors for short), and |Ni| is the degree (also in-degree) of node i. Graph G is connected (or weakly connected) if it cannot be partitioned into 2 separate groups that have no paths connecting 23 them. Graph G is strongly connected if each node is reachable from any other node. A tree is a graph that has a node called root from which all the other nodes are reachable. The diameter of a connected graph G, denoted by diam(G), is the length of the longest path among all simple paths. Let f : Rm → R be a convex function. The domain of f is denoted by dom(f). We denote by ∂f(x) the subdifferential of f at x ∈ dom(f), i.e., the set of all subgradients of f at x: ∂f(x) = {g ∈ Rm : f(y)− f(x) ≥ gT (y − x),∀y ∈ dom(f)}. (1.2) A differentiable function f is called strongly convex with parameter µ > 0 if for any x,y ∈ dom(f), f(y)− f(x) ≥ ∇f(x)T (y − x) + µ 2 ‖y − x‖. (1.3) 1.5.2 Convergence of DeGroot Model This subsection presents convergence conditions for the DeGroot model (1.1) that will serve as a basic result for our development in the sequel. Consider a leaderless network consisting of N agents denoted by V = {1, 2, . . . , N}. The underlying communication is characterized by a directed graph G = (V , E). The update of each agent i’s opinion at any time t ≥ 0 (here t denotes time, which can take any nonnegative integer value) can also be given as follows: xi(t+ 1) = ∑ j∈Ni wijxj(t)∑ j∈Ni wij , xi(0) = x0i ∈ R, (1.4) where wij ∈ [0,∞) quantifies the unnormalized weight that agent i places on agent j’s opinion and recall that Ni denotes the set of node i’s immediate neighbors 24 (including itself). The weight matrix of the network is denoted W := [wij] ∈ RN×N , with the inter-nodal influence parameters wij > 0 when there is a direct link from agent j to agent i and wij = 0 if no such link exists. Definition 1.5.1. Consider the DeGroot model (1.1). The network achieves consen- sus if for any initial opinions, it holds that limt→∞ |xi(t)−xj(t)| = 0,∀i, j = 1, . . . , N . In most chapters (except Chapter 2), we will make the following assumptions on the communication graph G and the weight matrix W , which are usually imposed to ensure consensus of the network. Assumption 1.5.2. The network G = (V , E) is a fixed and strongly connected directed graph. Assumption 1.5.3. The matrix W = [wij] is row stochastic and satisfies wij > 0 if (i, j) ∈ E, wii > 0 for some i ∈ V, and wij = 0 otherwise. Assumption 1.5.3 means that the zero-nonzero structure of the weight matrix W reflects the network structure. Moreover, W is now a normalized weight matrix. Thus, we can also express (1.4) compactly as x(t+ 1) = Wx(t). (1.5) It is well known (see, e.g., [1, 5, 14, 15, 49]) that under the assumptions above, the network achieves consensus. In fact, W is irreducible and represents an ergodic Markov chain. Let pi denote the stationary distribution of W , i.e., pi is the left eigenvector of W corresponding to the eigenvalue 1 and satisfying the condition 1Tpi = 1. The following result is the well known Perron-Frobenius theorem for 25 irreducible matrices (see, e.g. [95]), which lays the foundation for theories of Markov chains and network consensus (see, e.g., [14, 15] and references therein). Theorem 1.5.4. (see, e.g., [95]) If W is row stochastic and irreducible, then 1. W has spectral radius ρ(W ) = 1, which is also a simple eigenvalue. 2. piTW = piT and pi is a strictly positive vector. 3. ∃ limt→∞W t = 1piT . The convergence rate is geometric and determined by the second largest eigenvalue of W . Corollary 1.5.5. Under Assumptions 1.5.2 and 1.5.3, the network (1.4) achieves consensus and lim t→∞ xi(t) = pi Tx(0), ∀i ∈ V , (1.6) Moreover, the convergence rate is geometric. Clearly, the consensus value depends on the weight matrix W and the initial opinions x(0). We also often define a weighted Laplacian matrix L = [lij] ∈ RN×N satisfying L = I −W for some  > 0. The following are obvious. piTL = 0T , L1 = 0. Moreover, all the eigenvalues of L are positively stable except only one at the origin. 26 Part I: Consensus Network with Leaders 27 Chapter 2: Opinion Dynamics with Persistent Leaders Abstract: This chapter revisits the problem of agreement seeking in a network of agents under the influence of leaders. The persistence of the effect of the leader (or leaders) on the opinions of the network agents is characterized by the total weight that they place on the leader’s information over time. If this weight is infinite, then the leader is called persistent. We will describe the asymptotic behavior of network opinions towards the state of a persistent leader in both cases of fixed and switching network topologies. We also show that only persistent leaders are able to drive the network to the leader’s constant state. 2.1 Introduction It is widely known that both the communication graph and the influence structure of a network play important roles in reaching consensus. The former indicates whom an agent interacts with while the latter determines the weights on the information that he receives from others. In most existing work on network consensus with or without leaders, the weights are usually assumed to be constant or varying in a compact set bounded away from zero (see, e.g., [1,5,13,15] and references therein). In practice, however, the connections between agents may be transient and the weights 28 may fluctuate in a broad range and even diminish with time. Recently, the notion of persistent graphs was studied in [55], where it was shown that persistent links are crucial in seeking an agreement. However, the work [55] considered networks without a leader and required stronger conditions than just the persistence of the agent interconnections. In this chapter, we study networks in the presence of leaders and show that consensus can still be achieved under milder requirements. A great number of recent efforts have also been devoted to the consensus problem in networks with leaders (see, e.g., [5, 12, 13, 20] and references therein). A general conclusion is that consensus cannot be achieved when the leaders have competing opinions. This is mainly because only persistent leaders were considered. In this chapter, we show that network agreement can still be reached in the pres- ence of competing leaders provided that only one of them is persistent. This also distinguishes our results from existing works in this setting. The main contributions of this work are as follows. First, we derive new sufficient conditions for guaranteeing agreement in networks with a leader for both fixed and switching topologies. These conditions emphasize the persistence of the connectivity between the leader and the followers and, to the best of our knowledge, are the mildest, covering many existing results in the literature. Second, we show that in a network with more than one leader, only those that are persistent matter. In particular, when there is only one persistent leader, we provide conditions under which the network converges to the state of this leader. Most of the results in this chapter were first presented in [96]. The rest of the chapter is organized as follows. In Section 2.2, we describe 29 the problem of interest in detail. Sections 2.3 and 2.4 present the main results for networks with a single leader and two leaders, respectively. Finally, discussion and future work are given in Section 2.5. 2.2 Problem Formulation Consider a set of N agents or nodes interacting over a communication network. The topology of the network at time t ∈ Z+ is described by a graph G(t) = (V , E(t)). Let xi(t) ∈ [0, 1] denote the opinion of node i at time t. At the initial time t = 0, each agent has an initial opinion xi(0). Suppose that at each time t, every agent synchronously obtains opinions of his neighbors and naively updates his opinion following the DeGroot discrete-time model [1] xi(t+ 1) = ∑ j∈V wij(t)xj(t), ∀i ∈ V , (2.1) where wij(t) ≥ 0 indicates the weight that agent i puts on agent j’s opinion at time t. Here W (t) = [wij(t)] ∈ RN×N represents the weight matrix (or trust matrix) at time t. We will assume that W (t) is a row stochastic matrix for any t, i.e., W (t) is nonnegative and W (t)1 = 1. Now we consider to the above network under the effect of an external media node, representing a leader or a source of news with a constant opinion value T ∈ [0, 1]. Although it is often thought of as a stubborn node indistinguishable from the others, we consider it separately from that context as we want to look at the network from the point of view of a leader and investigate its effect on the opinions of other regular agents, conventionally called followers. To this end, assume that 30 the leader can connect to some of the followers and persuade them to trust in its opinion T with trust levels αi(t) ∈ [0, 1],∀i ∈ V . Here, αi(t) = 0 means distrust or unawareness of the leader and αi(t) = 1 implies absolute trust. The update rule is then given by xi(t+ 1) = αi(t)T + (1− αi(t)) ∑ j∈V wij(t)xj(t), ∀i ∈ V . (2.2) In the matrix form, (2.2) reads x(t+ 1) = α(t)T + Γ(t)W (t)x(t), (2.3) where α(t) = [α1(t), . . . , αN(t)] T and Γ(t) = I−diag(α(t)) with I ∈ RN×N being the identity matrix. We are interested in finding conditions under which the network can finally agree with the leader opinion. Indeed, we will determine how strong the connections between the leader and the followers should be to ensure that this agreement can be achieved asymptotically. We will also extend these results to the case of the network with more than one leader. The following notions will be used in this chapter. See, e.g., [8, 55]. Definition 2.2.1. Consider a time-varying graph G(t) = (V , E(t)) with an associ- ated time-dependent weight matrix W (t). • Link (i, j) is called persistent if ∑t≥0wij(t) =∞. • Node i is persistent if ∑t≥0∑Nk=1wki(t) =∞. • The persistent graph G∞ induced by {G(t),W (t),∀t ∈ Z+} is the graph con- taining all the persistent links. 31 2.3 Opinion Dynamics with One Leader This section studies the convergence of opinions under different assumptions on the connection topology of the followers’ network and the connectivity of the leader with the followers. First, it is easy to see that if there exists t∗ ∈ Z+ such that x(t∗) = T1 (e.g., α(t∗ − 1) = 1), then x(t) = T1,∀t ≥ t∗, i.e., the network converges to T in finite-time (at most t∗ steps) for any initial opinions x(0). Second, it is also known that when W and α are constant, then x(t)→ T1 for any x(0) if the extended graph including the leader and the followers is a spanning tree whose root is at the leader (see, e.g., [9]). Third, consider the case when the leader’s effect lasts for an interval [0, t0], e.g., a campaign period with α(t) = 0, ∀t > t0. Suppose that W is fixed and the network is strongly connected. Then there exists pi ∈ RN such that limt→∞W t = 1piT (cf. Section 1.5). As a result, limt→∞ x(t) = 1piTx(t0), i.e., the consensus value may differ from T . If α(t) ≡ α,∀t ∈ [0, t0], then it can be verified that lim t→∞ x(t) = 1 ( T − piT (ΓW )t0+1(1T − x(0))) . (2.4) Since ΓW is strictly substochastic and irreducible, it follows that ρ(ΓW ) < 1 (see, e.g., [97, Thm 1.1, p. 24]), and thus limt0→∞(ΓW ) t0 = 0. Therefore, x(∞) = T1 when t0 →∞, i.e., consensus to the leader’s state when the leader is persistent. Next, we allow α(t) to be time-varying and the limit limt→∞α(t) need not exist. Some related works are [5, 57] and [59]. Notice that the results in [5] rely 32 on Wolfowitz’s theorem [56] on the convergence of infinite products of stochastic matrices belonging to a finite set. This condition is relaxed in [57] so that the matrices can belong to an infinite set. However, the work [57] requires the symmetry of the zero/nonzero structure of these matrices. Similarly, the limiting behavior of products of random stochastic matrices is also studied in [59] assuming the cut- balanced property of the sequence of these matrices. This property is in the same spirit of having symmetric zero/nonzero structures. Here, we need not impose those conditions. The following condition suffices to ensure the asymptotic convergence of the network to the leader’s opinion. Theorem 2.3.1. (One Leader, Arbitrary Graph) Consider system (2.2) and suppose that the weights on the leader satisfy ∑ t≥0 min i∈V αi(t) =∞. (2.5) Then x(t)→ T1 as t→∞ for any initial opinion x(0). Proof. The proof follows the method presented in [46]. Let x˜(t) = [x(t)T T ]T . Equation (2.3) is equivalent to x˜(t+ 1) = W˜ (t)x˜(t), W˜ (t) , Γ(t)W α(t) 0 1  . (2.6) 33 Let h(t) = max1≤i,j≤N+1 (x˜i(t)− x˜j(t)). Obviously, h(t) ≥ 0,∀t ≥ 0. Now h(t+ 1) = max 1≤i,j≤N+1 ( x˜i(t+ 1)− x˜j(t+ 1) ) = max 1≤i,j≤N+1 ( ∑ 1≤k≤N+1 w˜ikx˜k(t)− ∑ 1≤k≤N+1 w˜jkx˜k(t) ) = max 1≤i,j≤N+1 ∑ 1≤k≤N+1 (w˜ik − w˜jk)x˜k(t). Denote h˜ij(t) = ∑N+1 k=1 (w˜ik − w˜jk)x˜k(t). Then h˜ij(t) = ∑ 1≤k≤N+1 (w˜ik −min(w˜ik, w˜jk))x˜k(t)− ∑ 1≤k≤N+1 (w˜jk −min(w˜ik, w˜jk))x˜k(t) ≤ ∑ 1≤k≤N+1 (w˜ik −min(w˜ik, w˜jk)) max l x˜l(t)− ∑ 1≤k≤N+1 (w˜jk −min(w˜ik, w˜jk)) min l x˜l(t). Rearranging terms and using the fact that ∑N+1 k=1 w˜ik = 1, i = 1, . . . , N + 1 yield h˜ij(t) ≤ max l x˜l(t)−min l x˜l(t)− ∑ 1≤k≤N+1 min(w˜ik, w˜jk)(max l x˜l(t)−min l x˜l(t)) = ( max l x˜l(t)−min l x˜l(t) )( 1− ∑ 1≤k≤N+1 min(w˜ik, w˜jk) ) . Therefore, h(t+ 1) ≤ max 1≤i,j≤N+1 h(t) ( 1− ∑ 1≤k≤N+1 min(w˜ik, w˜jk) ) = h(t) ( 1− min 1≤i,j≤N+1 ∑ 1≤k≤N+1 min(w˜ik, w˜jk) ) . (2.7) Now for any i, j ∈ V ∑ 1≤k≤N+1 min(w˜ik, w˜jk) ≥ min(w˜i,N+1, w˜j,N+1) = min(αi(t), αj(t)) ≥ min k∈V αk(t) (2.8) Define α(t) := mink∈V αk(t). It follows immediately from (2.7) that, for any t ≥ 0 h(t+ 1) ≤ h(t)(1− α(t)) ≤ h(t)e−α(t) ≤ h(0)e− ∑t s=0 α(s). (2.9) 34 The second inequality follows from the fact that 1−z ≤ e−z,∀z ≥ 0. The assumption that ∑∞ s=0 α(s) =∞ implies that limt→∞ h(t)→ 0, hence xi(t)→ T,∀i ∈ V . Clearly, the result above holds for any structure of the network and weight matrix (even in the time-varying case) as it merely relies on the condition (2.5), which means that every node in the network persistently trusts the leader (in the sense that ∑ t≥0 αi(t) =∞,∀i ∈ V). Note that the notion of persistent graphs, that is a graph that consists of links satisfying ∑ t≥0wij(t) = ∞, was also studied in [55]. However, to guarantee the global agreement, [55] requires further that there exist a∗ > 0 and T∗ > 0 such that ∑t+T∗−1 s=t wij(s) ≥ a∗,∀t ≥ 0 and for all persistent links (i, j). This condition is stronger than the condition of being a persistent link. Therefore, the results in [55] cannot be applied in this case. One can notice that condition (2.5) is rather strong. In practice, there are many situations where ensuring this condition may be costly since the leader needs to directly approach every agent in the network for an infinite number of times. This is usually not the most practical advertising strategy either. In fact, the leader should make use of the connections between the followers to advertise for its opinion. Therefore, below, we relax this condition by imposing requirements on the network structure. Assumption 2.3.2. The graph G is fixed and strongly connected. The weight matrix W is fixed and has positive diagonal elements, that is wii > 0,∀i ∈ V. Theorem 2.3.3. (One Leader, Strongly Connected Graph, Fixed Weight) Consider 35 system (2.2) and let Assumption 2.3.2 hold. Suppose the weights αi satisfy ∑ t≥0 max i∈V αi(t) =∞. (2.10) Then x(t)→ T1 as t→∞ for any x(0). Before giving the proof, we make a few remarks. First, Theorems 2.3.3 and 2.3.1 show the importance of persistent links (including constant weights as a spe- cial case) in shaping the final opinion. Second, since the network size is finite, (2.10) is equivalent to the condition that at least one follower persistently trusts the leader even if its trust level fades away. This condition holds for many plausible specifications of αi, e.g., αi(t) = c (t+1)γ for any c ∈ (0, 1] and γ ∈ [0, 1]. It is tempting to follow the proof of Theorem 2.3.1which is primarily based on inequalities (2.7) and (2.8). However, under condition (2.10) this technique is no longer applicable. Consider, e.g., a connected undirected network with N = 5 and W =  w11 w12 0 0 0 w21 w22 w23 0 0 0 w32 w33 w34 0 0 0 w43 w44 w45 0 0 0 w54 w55  . Assume that αi(t) = 0,∀i ∈ V \ {3},∀t ≥ 0 and ∑ t≥0 α3(t) = ∞. It can be seen that min 1≤i,j≤N+1 ∑ 1≤k≤N+1 min(w˜ik, w˜jk) = 0. Thus, from (2.7), we can only obtain that h(t + 1) ≤ h(t),∀t ≥ 0, which is not enough to ensure the convergence of h(t) to 0. 36 Moreover, since α(t) is not restricted to belong to a finite set, the results in [5] cannot be used. Further, neither W nor W˜ (t) (see (2.6)) is required to have a symmetric zero/nonzero structure, the results in [57,59] and [98] are not applicable. The following proof uses the results presented in [99] on the convergence of infinite products of substochastic matrices. Notice that [99, Theorem 6.2] requires that the smallest row-sums of all the matrices be uniformly bounded away and below 1. Here, we require milder conditions. The following results are needed to proceed the proof of Theorem 2.3.3. Lemma 2.3.4. [99] Let Mi ∈ Rn×n, i = 1, ...,m be any m substochastic matrices, then the product P = ∏m i=1Mi is also a substochastic matrix. Lemma 2.3.5. [99] Let Mi ∈ Rn×n, i = 1, ..., n − 1, be irreducible substochastic matrices with positive diagonals, then the product P = ∏n−1 i=1 Mi is a strictly positive matrix, i.e., Pij > 0, ∀i, j. Further, let m = min{[Mk]ij|i, j ∈ [1, n], k ∈ [1, n− 1], [Mk]ij > 0}, then mini,j Pij ≥ mn−1. For any matrix M , let ri(M) , ∑ jMij, i.e., the sum of i th-row of the matrix. Lemma 2.3.6. Suppose that Mi ∈ Rn×n, i = 1, ...,m satisfy mini ri(M1) ≤ r1 and maxi ri(Mk) ≤ r¯k, k = 2, ...,m. Then it holds that mini ri(M1...Mm) ≤ r1r¯2...r¯m. Proof. If mini ri(M1) ≤ r1 and maxi ri(M2) ≤ r¯2, then ri(M1M2) = ∑ j[M1]ijrj[M2] ≤∑ j[M1]ij r¯2 = ri[M1]r¯2. Thus, mini ri(M1M2) ≤ r1r¯2. By induction, it holds that mini ri(M1...Mm) ≤ r1r¯2...r¯m provided maxi ri(Mk) ≤ r¯k, k = 2, ...,m. 37 Lemma 2.3.7. Let U, V,D1 and D2 be nonnegative matrices with appropriate di- mensions. If D1 ≤ D2, then ||UD1V ||∞ ≤ ||UD2V ||∞. Proof. Let U (i) denote the i-th row of U and V(j) the j-th column of V . Since U, V,D1 and D2 are nonnegative, it follows that 0 ≤ U (i)D1V(j) ≤ U (i)D2V(j) for ∀i, j. Thus 0 ≤ UD1V ≤ UD2V , and hence, ‖UD1V ‖∞ ≤ ‖UD2V ‖∞. The proof of Theorem 2.3.3 is presented next. Proof. Defining ξ(t) , x(t)− T1, the update rule (2.3) can be expressed as follows ξ(t+ 1) = A(t)ξ(t), A(t) , Γ(t)W, (2.11) where Γ(t) = I − diag(α(t)). We need to show that limt→∞ ξ(t) = 0 for any ξ(0), or equivalently, lim s→∞ ∥∥∥ ∏ 0≤t≤s A(t) ∥∥∥ ∞ = 0, (2.12) where ∏ 0≤t≤sA(t) , A(s) . . . A(0). Note that although A(t) is substochastic for all t ≥ 0 (hence ρ(A(t)) ≤ 1), it does not automatically imply that ∏t≥0A(t) = 0; this is true even if A(t) is strictly substochastic. (For example, the sequence ai = 1 2 √ 2 + √ 2 + ...+ √ 2︸ ︷︷ ︸ i times satisfies ai ∈ (0, 1),∀i ≥ 1, but ∏∞ i=1 ai = 2 pi .) Take any η ∈ (0, 1) and define A˜(t) = Γ˜(t)W, Γ˜(t) = I − diag(ηα(t)). Obviously, Γ(t) ≤ Γ˜(t), ∀t ≥ 0. Applying Lemma 2.3.7, we have ∥∥∥ ∏ 0≤t≤s A(t) ∥∥∥ ∞ ≤ ∥∥∥ ∏ 0≤t≤s A˜(t) ∥∥∥ ∞ , ∀s ≥ 0. 38 Thus, the following condition suffices for (2.12) lim s→∞ ∥∥∥ ∏ 0≤t≤s A˜(t) ∥∥∥ ∞ = 0. (2.13) Next, define B(s) := ∏ sN1≤t≤(s+1)N1−1 A˜(t), with N1 := N − 1, (2.14) and note the following. (i) ∃b > 0 such that mini,j Bij(s) ≥ b,∀s ≥ 0 and ∀i, j ∈ V . This can be shown as follows. Let w = min{wij| i, j ∈ V , wij > 0}. (2.15) Then, min{A˜ij(t)| i, j ∈ V , A˜ij(t) > 0} ≥ (1 − η)w. Note that A˜(t) is irre- ducible since the network is strongly connected (by assumption). Thus, by Lemma 2.3.5, we have min i,j Bij(s) ≥ ((1− η)w)N1 =: b. (2.16) (ii) maxi ri(B(s)) ≤ 1, since B(s) is substochastic any s (cf. Lemma 2.3.4). (iii) mini ri(B(s)) ≤ 1− δ((s+ 1)N1 − 1) where δ(t) := ηmax i∈V αi(t), ∀t ≥ 0. This can be obtained by using Lemma 2.3.6 with min i ri ( A˜ ( (s+ 1)N1 − 1 )) ≤ 1− δ((s+ 1)N1 − 1) and maxi ri ( A˜(t) ) ≤ 1 for t = sN1, . . . , (s+ 1)N1 − 2. 39 Now, let rj∗(B(s)) denote the smallest row sum of B(s). Using the above results yields ri ( B(s+ 1)B(s) ) = ∑ 1≤j≤N Bij(s+ 1)rj ( B(s) ) = Bij∗(s+ 1)rj∗ ( B(s) ) + ∑ 1≤j≤N,j 6=j∗ Bij(s+ 1)rj ( B(s) ) (ii)−(iii) ≤ Bij∗(s+ 1) [ 1− δ((s+ 1)N1 − 1)]+ ∑ 1≤j≤N,j 6=j∗ Bij(s+ 1) (ii) ≤ Bij∗(s+ 1) [ 1− δ((s+ 1)N1 − 1)]+ 1−Bij∗(s+ 1) (i) ≤ 1− bδ((s+ 1)N1 − 1) ≤ e−bδ((s+1)N1−1) ∀i ∈ V , where the last inequality follows from the fact that 1− z ≤ e−z,∀z ≥ 0. Thus ‖ ∏ 0≤t≤(2m+2)N1−1 A˜(t)‖∞ = ‖ ∏ 0≤s≤m B(2s+ 1)B(2s)‖∞ ≤ ∏ 0≤s≤m ‖B(2s+ 1)B(2s)‖∞ ≤ e−b ∑m s=0 δ((2s+1)N1−1). (2.17) If ∑∞ s=0 δ((2s+ 1)N1− 1) =∞, then the right side of (2.17) decays to 0 as m→∞, thus (2.13) follows immediately. Therefore, it remains to show that this is also the case when (2.10) holds, or equivalently ∑ t≥0 δ(t) =∞. To this end, let ∆mi := ∑ 0≤j≤m δ(i+ j2N1), ∀i ∈ [0, 2N1 − 1] and note that ∑ 0≤t≤2(m+1)N1−1 δ(t) = ∑ 0≤s≤2N1−1 ∆ms . (2.18) Now, let (2.10) hold. We claim that there must exist k ∈ [0, 2N1 − 1] such that ∆∞k = ∞. This can be shown by contradiction, i.e., if ∆∞i < ∞,∀i ∈ [0, 2N1 − 1], 40 then ∑2N1−1 s=0 ∆ ∞ s < ∞, thus taking the limit of both sides of (2.18) as m → ∞ yields ∑ t≥0 δ(t) <∞, which contradicts (2.10). Thus the claim holds. Now, if k = N1 − 1, i.e., ∆∞N1−1 = ∞, then we obtain the desired result, otherwise we can redefine B(s) , ∏(s+1)N1+k−1 t=sN1+k A˜(t) and follow the same steps as above to show that ∏∞ t=k A˜(t) = 0 and thus ∏ t≥0 A˜(t) = 0. This completes the proof. Note that we can take N1 = d0 in (2.14), where d0 denotes the diameter of the graph G, and then repeat the above proof using a slight modification of Lemma 2.3.5 applied to A˜(t) = Γ˜(t)W . In fact, P = ∏ k≤i≤k+d0−1 A˜(i) is strictly positive for any k. The proof of this is not much different from that of Lemma 2.3.5, thus skipped here. Note also that d0 = N − 1 in the worst case. The above proof also allows us to estimate the −convergence time for some cases of αi(·) as follows. Corollary 2.3.8. (−Convergence Time) Let d0 be the diameter of the graph G and w be defined as in (2.15). Given any number  > 0, it holds that ‖x(t)− T1‖∞ ≤ ‖x(0)− T1‖∞ if t > 2d0m where (i) m = exp( 2d0 (0.5w)d0 log −1) if maxi αi(τ) = 1τ+1 , or (ii) m = 1 α¯((1−α¯)w)d0 log  −1 if maxi αi(τ) = α¯ ∈ (0, 1) for all τ ≥ 0. 41 Proof. The proof follows from (2.17) and the fact that ∑k t=1 1 t > ln(k + 1). The following result is a straight extension of Theorem 2.3.3 to the case of time-varying weight matrices. Assumption 2.3.9. (Strong Weight) The weight matrix W (t) is row stochastic and satisfies a) wii(t) ≥ γ, ∀i ∈ V for some γ ∈ (0, 1), b) wij(t) ∈ {0} ∪ [γ, 1),∀i, j ∈ V , i 6= j. Theorem 2.3.10. (One Leader, Strongly Connected Graph, Time-varying Weight) Consider system (2.2) and suppose that G(t) is strongly connected for all t ≥ 0 and that Assumption 2.3.9 holds. If (2.10) holds, then x(t)→ T1 for any x(0). Proof. The only difference between Theorems 2.3.10 and 2.3.3 is thatA(t) = Γ(t)W (t). However, under Assumption 2.3.9, we can choose w = γ and b = ((1 − η)γ)N−1 for (2.15) and (2.16), respectively, then follow the same steps as in the proof of Theorem 2.3.3. Note that if there exists t∗ ∈ Z+ such that ∏t∗ t=0A(t) = 0 (e.g., α(t ∗) = 1), then the network converges to T in at most t∗ + 1 time steps for any initial opinion x(0). Thus, condition (2.10) used in Theorems 2.3.3 and 2.3.10 is only sufficient but not necessary. The strong connectivity requirement can be further relaxed to the existence of a spanning tree provided that a root node believes in T persistently. In the following, we assume that node 1 be always a root node. 42 Theorem 2.3.11. (One Leader, Spanning Tree Graph, Time-varying Weight) Sup- pose that the graph G(t) is a directed spanning tree whose root is at node 1 for all t ≥ 0. Let Assumption 2.3.9 hold. It follows that x(t)→ T1 for any x(0) if ∑ t≥0 α1(t) =∞. (2.19) The intuition of the result is as follows. Condition (2.19) means that there is an infinite information flow from the leader into node 1. Since node 1 is the root of the tree, this flow arrives at every node of the network, thus consensus can be achieved. The idea of the proof follows that in Theorem 2.3.3 with some modifications. Proof. For simplicity, assume that αi(t) = 0,∀t ≥ 0, ∀i = 2, . . . , N . The proof follows the same line of that of Theorem 2.3.3. Recall (2.11)-(2.13) and note that A˜(t) is a substochastic matrix for any t; specifically, r1(A˜(t)) ≤ 1 and ri(A˜(t)) = 1, i = 2, . . . , N . Let d0 denote the diameter of the tree. It can be proved that if any matrices A1, . . . , Ad0 satisfy the conditions on A˜(t), then P , ∏d0 i=1Ai satisfies Pi1 ≥ b = ((1− η)γ)d0 , i = 1, . . . , N. (2.20) Let B(s) = ∏(s+1)d0−1 t=sd0 A(t). It follows that ri(B(s+ 1)B(s)) = Bi1(s+ 1)r1(B(s)) + ∑ 2≤j≤N Bij(s+ 1)rj(B(s)) ≤ Bi1(s+ 1)(1− ηα1((s+ 1)d0 − 1)) + ∑ 2≤j≤N Bij(s+ 1) ≤ Bi1(s+ 1) (1− ηα1((s+ 1)d0 − 1)) + 1−Bi1(s+ 1) ≤ 1− bηα1((s+ 1)d0 − 1) ≤ e−bηα1((s+1)d0−1) ∀i ∈ V , 43 where the second to last inequality follows from (2.20). The rest of the proof follows that of Theorem 2.3.3. We can further relax the condition on the connectivity of the network by employing the notion of bounded connectivity times (see, e.g., [100]). Before stating this result, we need the following lemma. Lemma 2.3.12. [5] Let m ≥ 2 be a positive integer and let A1, A2, . . . , Am ∈ Rn×n be nonnegative matrices with positive diagonal elements satisfying 0 < µ ≤ [Ai]jj ≤ ρ, ∀i,∀j, then A1A2 . . . Am ≥ ( µ2 2ρ )m−1 (A1 + A2 + . . .+ Am) . As a consequence of this lemma, if the union of all graphs associated with Ai is a spanning tree, then the graph associated with the product A1A2 . . . Am is also a spanning tree. For any integers t ≥ 0 and N0 > 0, define G[N0](t) = (V ,∪t+N0−1k=t E(k)) as the union of a sequence of graphs over interval [t, t+N0). We state the following result. Theorem 2.3.13. (One Leader, Periodically Spanning Tree Graph) Consider sys- tem (2.2) and let Assumption 2.3.9-a) hold. Suppose there exists N0 > 0 such that for all t, the graph G[N0](t) admits a spanning tree whose root is at node 1 and edges satisfy ∑ t≤k≤t+N0−1 wij(k) ≥ γN0. (2.21) If (2.19) holds, then x(t)→ T1 for any x(0). 44 Proof. (Sketch) Let GN0t denote a spanning tree in the union graph G[N0](t) whose root is at node 1 and edges satisfy condition (2.21). This condition implies that during any interval of length N0 and for any (i, j) ∈ GN0t there exists at least one time t∗ij such that wi,j(t ∗ ij) ≥ γ. Denote d0 the maximum diameter of G[N0](t),∀t. Since the self-weight wii of each agent is bounded away from 0 by γ, then one can see that every node in the network is reachable from node 1 in at most d0N0 steps. Thus, the first column of the matrix (k+1)d0N0−1∏ t=kd0N0 A˜(t) is positive and bounded away from 0 by a positive number b = ((1 − η)γ)d0N0 . Therefore, one can follow the same steps as in the proof of the Theorem 2.3.11 to conclude this result. A closely related work to this result is Proposition 3.3 in [55]. In that paper, the authors studied the problem of −agreement in persistent graphs without a leader. Here, we consider the presence of a leader (or a source of news) in the network. It can be seen that under the assumptions of Theorem 2.3.13, the link between the leader and agent 1 and those between the agents in the graph GN0t are persistent. However, one cannot utilize the result in [55] to prove Theorem 2.3.13 because it assumes that there exist a∗ > 0 and T∗ > 0 such that ∑t+T∗−1 s=t wi,j(s) ≥ a∗,∀t ≥ 0 and for all persistent links (i, j). This condition is indeed equivalent to (2.21). However, we do not require this condition on the connections between the leader and the followers. 45 2.4 Opinion Dynamics with Two Leaders In this section, we investigate the case where there are two leaders (or two sources of news) with different opinions T and Q. Assume that all the nodes in the network can be influenced by the two leaders with trust levels αi(t), βi(t) ∈ [0, 1],∀i ∈ V . The update rule is now given by x(t+ 1) = α(t)T + β(t)Q+ Γ(t)W (t)x(t), (2.22) where α(t) = [α1(t), ..., αN(t)] T ,β(t) = [β1(t), ..., βN(t)] T and Γ(t) = I−diag(α(t)+ β(t)). In general, when both T and Q are persistent and the weight matrix W is time- varying, network agreement need not be achieved and the agents’ opinions may not converge. Interesting results on opinion disagreement and fluctuation can be found in, e.g., [12, 13] and [20]. In the case where α(t) ≡ α, β(t) ≡ β, and W (t) ≡ W , the opinions converge to a fixed vector x∞ satisfying (I − ΓW )x∞ = αT + βQ. In the following, we consider the case when only T is persistent. Assumption 2.4.1. The weights the weights α and β satisfy ∑∞ t=0 maxi∈V αi(t) = ∞ and ∑∞t=0 maxi∈V βi(t) <∞. Note also that if there exists tβ ∈ Z+ such that βi(t) = 0,∀i ∈ V ,∀t ≥ tβ, then we can immediately make use of the results in the previous section since after time tβ there is only one persistent leader. In what follows, we assume that the presence of leader Q can last for infinite time. 46 Theorem 2.4.2. (Two Leaders, Strongly Connected Graph, Time-varying Weight) Consider system (2.22) with two leaders. Suppose that G(t) is strongly connected for all t ≥ 0 and that Assumptions 2.3.9 and 2.4.1 hold. Then x(t)→ T1 for any x(0). Proof. Again, let ξ(t) = x(t)− T1. Then system (2.22) becomes ξ(t+ 1) = Γ(t)W (t)ξ(t) + (Q− T )β(t). (2.23) Let A(t) = Γ(t)W (t) and u(t) = (Q− T )β(t). We note the following: (i) From Theorem 2.3.10, it can be verified that the unforced system ξ(t + 1) = A(t)ξ(t) with ξ(0) = x(0) − T1 is asymptotically stable. In fact, let Φ(t, l) denote the transition matrix Φ(t, l) := A(t − 1)A(t − 2) . . . A(l + 1), then it follows that limk→∞Φ(t, l) = 0,∀l ∈ Z+. (ii) By Assumption 2.4.1, u is absolutely summable and hence bounded. We now show that limt→∞ ξ(t) = 0. The solution to (2.23) is given by ξ(t) = Φ(t, 0)ξ(0) + ∑ 0≤k≤t−1 Φ(t, k + 1)u(k). (2.24) Let δ > 0 be given. If ‖ξ(0)‖∞ = 0, then ‖Φ(t, 0)‖∞‖ξ(0)‖∞ = 0,∀t. Otherwise, from fact (i) we have limk→∞Φ(t, 0) = 0. Thus, ∃N1 ∈ Z+ such that ‖Φ(t, 0)‖∞‖ξ(0)‖∞ ≤ δ 3 , ∀t ≥ N1. (2.25) Next, from (ii) we have ∑ t≥0 ‖u(t)‖∞ ≤ |Q − T | ∑ t≥0 ‖β(t)‖∞ < ∞. Therefore, ∃N2 ∈ Z+ such that N2 ≥ N1 and ∑ k≥t ‖u(k)‖∞ ≤ δ 3 , ∀t ≥ N2. (2.26) 47 Let u¯ = supt ‖u(t)‖∞. From (i) we have ∃N3 ∈ Z+ sufficiently large such that ‖Φ(t, i)‖∞ ≤ δ 3u¯N2 , ∀i ≤ N2, t ≥ N2 +N3. (2.27) Now for any t ≥ N2 +N3, it follows from (2.24) that ‖ξ(t)‖∞ ≤ ‖Φ(t, 0)‖∞‖ξ(0)‖∞ + ∑ 0≤k≤t−1 ‖Φ(t, k + 1)‖∞‖u(k)‖∞. (2.28) Using (2.25), (2.26) and (2.27), and noting that ‖Φ(t, k+1)‖∞ ≤ 1 for any k ≤ t−1, we have the following for ∀t ≥ N2 +N3: ‖ξ(t)‖∞ ≤ δ 3 + t−1∑ k=0 ‖Φ(t, k + 1)‖∞‖u(k)‖∞ ≤ δ 3 + N2−1∑ k=0 ‖Φ(t, k + 1)‖∞‖u(k)‖∞ + t∑ k=N2 ‖Φ(t, k + 1)‖∞‖u(k)‖∞ ≤ δ 3 + N2−1∑ k=0 δ 3u¯N2 u¯+ t∑ k=N2 ‖u(k)‖∞ ≤ δ 3 + δ 3 + ∞∑ k=N2 ‖u(k)‖∞ ≤ δ. (2.29) Since δ > 0 is chosen arbitrarily, (2.29) proves that ξ(t)→ 0 as t→∞. It should be noted that the result of Theorem 2.4.2 is still valid when there are two or more nonpersistent leaders in the network. This result implies that only persistent ones matter. It is possible to relax the assumption on the network connectivity, but we do not proceed further in this direction. 2.5 Conclusion and Extensions This chapter revisited the agreement seeking problem in networks with leaders, which has received a fair amount of recent attention. We developed various new 48 sufficient conditions for guaranteeing consensus to the persistent leader’s opinion. We pointed out the important role of persistent connectivity between the leader and the others in the network. In the following, we discuss possible extensions of the our results. First, note that model (2.2) does not present any delay explicitly. However, the framework presented in this chapter can be extended to take into account infor- mation delays as follows. Consider a generalized version of (2.2), given by xi(t+ 1) = αi(t)T + (1− αi(t)) ∑ j∈V wij(t)xj(t− τij(t)), ∀i ∈ V . (2.30) where the delay functions τij are assumed to be uniformly bounded, i.e., ∃τ ≥ 1 : τij(t) ∈ [0, τ − 1], ∀(i, j) ∈ E . The idea is to consider an extended network Gτ , which is composed of the original graph G and τ − 1 copies of it with each being the 1-delay in time of one another. The state of Gτ is [x(t)T ,x−1(t)T , . . . ,x−(τ−1)(t)T ]T where x−k(t) = x(t − k). Note that if G(t) is strongly connected for any t, then every node i ∈ V is a root of a spanning tree in the union graph (Gτ )[τ ](t). Thus, Theorem 2.3.11 can be applied to this union graph to establish consensus reachability of the original network G. In this connection, we can conclude that consensus to the leader is also robust to bounded delays. Second, we can also consider the case where the leader’s state is time varying: T (t+ 1) = T (t) + u(t) xi(t+ 1) = αi(t)T (t) + (1− αi(t)) ∑ j∈V wij(t)xj(t), ∀i ∈ V . 49 Define the tracking error ξ(t) := x(t)− T (t)1. Then ξ(t+ 1) = Γ(t)W (t)ξ(t)− 1u(t) (2.31) which is a linear time-varying system with input u. Note that the unforced system ξ(t + 1) = Γ(t)W (t)ξ(t) is asymptotically stable under suitable conditions as in Section 2.3. Therefore, we can invoke stability results of linear time-varying systems to derive consensus conditions. One such result is the following, whose proof follows the same line of that of Theorem 2.4.2 in Section 2.4 and thus is omitted. Proposition 2.5.1. Consider system (2.31) and let Assumption 2.3.2 (or 2.3.9) hold. If ∑∞ t=0 |u(t)| <∞, then consensus is achieved, i.e., limt→∞ |x(t)−T (t)1| = 0 for any x(0). Remark 2.5.2. Note that an equivalent characterization of consensus is that the size of the convex hull of the states of all the agents (including the leader) has to be 0 in the limit. In [8], the author impose the condition of strict convex hull shrinking. The above result shows that the leader needs not to move into the convex hull of the states of regular agents at any time step in order for achieving consensus. Also, the convex hull needs not to shrink monotonically. This result could also give a hint on reducing the gap between necessary and sufficient conditions for consensus. 50 Chapter 3: Optimizing Leader Influence in Networks through Selec- tion of Direct Followers Abstract: The chapter considers the problem of a leader that aims to influence the opinions of agents in a directed network through connecting with a limited number of the agents. The aim is to select this set of agents, referred to as direct followers, to achieve the greatest possible influence on the opinions of agents throughout the network. Direct followers are simply agents that the leader decides to connect to, and the influence then occurs through the network’s natural inter-agent dynamics. The problem of optimally influencing a network in the presence of another leader with a competing opinion is also considered. The problems with a single leader and in the presence of a competing leader are unified into a general combinatoric optimization problem, for which two heuristic approaches are developed. The first approach is based on a convex relaxation scheme, possibly in combination with the `1-norm regularization technique, and the second is based on a greedy selection strategy. The main technical novelties of this work are in the establishment of supermodularity of the objective function and convexity of its continuous relaxation. As a result, the greedy approach is guaranteed to yield a lower bound on the approximation ratio that is sharper than (1 − 1 e ), while the convex approach can benefit from efficient 51 (customized) numerical solvers to have practically comparable solutions possibly with faster computation times, especially for large networks. The two approaches can be combined to provide effective tools and better analysis for optimal design of influence spreading in diffusive networks. Numerical examples are given to illustrate the usefulness of the approaches. In these examples, the approximation ratio can be made to reach 90% or higher depending on the number of direct followers. 3.1 Introduction The notion of a leader is introduced in many cases to represent a special agent who has the ability to affect the states (or opinions) of other regular agents, con- ventionally called followers, while its state is uninfluenced by others- in this sense, a leader is also termed elsewhere as a stubborn agent). A great number of recent efforts have also been devoted to the consensus problem in networks with leaders (see, e.g., [5, 13, 18, 20, 101, 102] and references therein). In most cases, a leader is usually assumed to have a limited number of connections with other agents, due to restrictions on, e.g., the budget, communication power or channels of the leader. This gives rise to the problem of the leader choosing whom to influence directly so that the overall network can best perform (in some sense) under the restriction on the leader’s connectivity. This chapter deals with problems related to a leader selecting a limited number of agents with which to communicate in a directed network. The aim of the leader is to achieve maximum influence on the opinions of agents throughout the network. 52 The network agents that the leader selects to communicate with are referred to as direct followers. Network agents all update their opinions dynamically based on their current opinions and on opinions of immediate neighbors. Thus, through its connections with the direct followers and the inherent network dynamics, the leader influences the opinions of agents throughout the network. The leader wishes to select the limited group of direct followers so as to maximize its influence on the network, in the sense that the opinions of agents throughout the network approach the opinion of the leader either as rapidly as possible over time or as close as possible in the limit. In particular, we consider the following two problems: • Problem (P1): Optimize the influence of a leader on the agents in a directed network, whose opinion dynamics follow the well known DeGroot model. Here, the leader’s goal is to select a limited number of direct followers to connect to, in order to influence all the agents to converge to its constant opinion as quickly as possible. • Problem (P2): Optimize the influence of one leader in the presence of another leader (with a competing opinion) over a directed network of followers under a similar connectivity constraint as in (P1). Here, the influence of a leader is measured in terms of the distance between the leader’s opinion and a weighted average of the steady state opinions of all the network agents. We unify the two problems above into a more general combinatoric optimization problem, called (P), and develop two heuristics for approximately solving problem (P), namely: 53 • Convex relaxation: which can be treated effectively by available numerical algorithms and solvers. Here, the convexity result is novel. • Greedy algorithms: which can be carried out in polynomial time. Here, the supermodularity result is new and can be used to provide provable accuracy guarantees for the greedy solutions. This chapter is related to a large body of literature on problems of leader selec- tion, stubborn agent placement, and sensor selection (see, e.g., [20, 61–67, 103, 104] and references therein) but departs from this literature in many respects. First, we only ask that the underlying network be directed. Second, we allow selected direct follower nodes to follow inter-agent dynamics like any other agents, rather than forc- ing them to adopt the leader’s opinion instantaneously. Third, we allow the agents in the network to have different initial opinions (which are taken into account explicitly in the context of problem (P1)), and the agents can be weighted differently by the leader. Finally, and more importantly, although continuous relaxation and greedy heuristics have been employed in dealing with influence maximization problems, our theoretical results on convexity and supermodularity are considerably stronger than existing results, without assuming any symmetry or resorting to the random walk theory. This not only provides a deeper understanding of diffusive processes but also can be used for a broad range of applications. More detailed comparisons will be given in Section 3.2 after our problem formulations. Other by-products of our analysis include: (i) a dynamic centrality measure in the context of problem (P1) (i.e., one in which the measure of effectiveness of the set 54 of chosen agents can vary with time); (ii) straightforward application to Friedkin’s model [3] (where each agent is allowed to have stubbornness in retaining its initial opinion) in the context of problem (P2); (iii) an affirmative answer to a conjecture recently proposed in [105] on optimization of on-chip thermoelectric cooling systems; and (iv) a convexity result for the state trajectory of a class of bilinear discrete-time systems. The remainder of the chapter proceeds as follows. In Section 3.2, we introduce our network models and associated optimization problems of interest. Related works are also reviewed. Our main results are given in Sections 3.3, 3.4 and 3.5. In Section 3.3 we provide exact solutions to the problems (P1) and (P2) for the case of selecting one agent or two. The general case of multiple agents selection is treated in Sections 3.4 and 3.5. Specifically, in Section 3.4, we establish the convexity of the relaxed and approximate problems and discuss associated numerical issues in applying convex solvers to these problems. In Section 3.5, we prove the supermodularity property of the original objective functions and present two greedy algorithms that admit provable approximation ratios. Next, a few simulation results are reported in Section 3.6 for two example networks; one of small size and the other much larger. Finally, further (convexity) results and applications to another opinion dynamic model are discussed in Section 3.7. 55 3.2 Problem Formulation and Related Works This section proceeds as follows. In Subsection 3.2.1 we augment the DeGroot model with a single leader, and formulate an associated problem of optimizing the leader’s influence on the opinion dynamics of the network agents. Finally, in Subsection 3.2.2, we consider a model similar to that of Subsection 3.2.1 except that we further include another leader with a differing (constant) opinion. Here, the influence of a leader is defined differently but the optimization problem shares the same structure with the previous setting. Consider a leaderless network with N agents denoted by V = {1, 2, . . . , N}. The underlying communication is characterized by a directed graph G = (V , E). The dynamics of each agent is described by the DeGroot model (1.4), which is repeated here for convenience. Let xi(t) ∈ [0, 1] denote the state or opinion of node i at time t ∈ N0; At the start, each agent has an initial state x0i ∈ [0, 1]. At any other time t > 0, each agent observes opinions of its neighbors and updates its opinion as xi(t+ 1) = ∑ j∈Ni wijxj(t), xi(0) = x0i, ∀i ∈ V , (3.1) where, recall that, Ni denotes the set of node i’s immediate neighbors (including itself) and W := [wij] ∈ RN×N denotes the normalized weight matrix of the network. We make the following blanket assumption, simply the combination of Assumptions 1.5.2 and 1.5.3 and presented here for convenience. Assumption 3.2.1. (Network Connectivity and Weight Matrix) The graph G is fixed in time and strongly connected. The weight matrix W is fixed, row-stochastic 56 and satisfies wij > 0 for (i, j) ∈ E , i 6= j, and wij = 0 otherwise. Moreover, W has at least one positive diagonal element. 3.2.1 Formulation of Influence Optimization Problem for the Single Leader Case Given a directed network G = (V , E) with dynamics as described above, we now consider the effect of an external leader, denoted by T /∈ V , seeking to connect to the network. The leader is assumed to have a constant opinion T ∈ [0, 1]. The relationship of the leader to the network G is as follows: • For any agent i ∈ V , the weight αi ∈ [0,∞] that it would place on the leader’s opinion T if the leader selects to connect to the agent is known.1 The connection would of course be directed from the leader to the network agent. (The reverse direction, from regular agents to the leader, would be pointless as the leader’s opinion is assumed fixed and cannot be influenced.) We refer to α := [α1, . . . , αN ] T as the vector of potential trust of network agents in the leader, or simply the trust vector. • The leader knows α but is only able to directly connect to up to K agents in the network G. The K agents that the leader elects to connect to are called direct followers, and are cumulatively denoted in the sequel by the set K. Unless otherwise stated, such connection is established at time t = 0 and the set K remains fixed thereafter. 1In this chapter, we allow the weight to be ∞. 57 Note that αi = 0 indicates lack of trust or that agent i is not accessible to the leader, and αi =∞ (or αi  1 > wij,∀j ∈ Ni) indicates the highest possible level of trust of agent i in the leader’s opinion. Without loss of generality, we make the following assumption, which means that the leader only connects to followers having nonzero trust level. (Clearly, it would be pointless for the leader to connect to an agent which would have zero trust in its opinion.) Assumption 3.2.2. (Positive Trust Selection) The set K is such that K 6= ∅ and K ⊆ Vα := {i ∈ V : αi > 0}. For each K, let the corresponding selection vector sK be [sK]i := χK(i), ∀i ∈ V . (Recall that χA is the indicator function of a set A.) Then the update rule (3.1) for agent i in the presence of the leader becomes xi(t+ 1) = [sK]iαiT + ∑ j∈Ni wijxj(t) [sK]iαi + 1 . (3.2) Here, it is understood that xi(t + 1) = T if [sK]i = 1 and αi = ∞. In vector form, (3.2) can be expressed as x(t+ 1) = (I + diag(αK))−1(αKT +Wx(t)) (3.3) where αK := sK ◦ α. Here ◦ denotes the element-wise product (also known as the Hadamard product). The following result is well known (see, e.g., [5, 20,55]). 58 Theorem 3.2.3. (Consensus to Leader’s Opinion) Let Assumptions 3.2.1 (Network Connectivity and Weight Matrix) and 3.2.2 (Positive Trust Selection) hold. Then for any x(0) ∈ RN , all network agents asymptotically achieve consensus with the leader’s opinion, i.e., limt→∞ xi(t) = T,∀i ∈ V. Moreover, the rate of convergence is exponential. This theorem asserts that all network agents will adopt the leader’s opinion asymptotically, regardless of their initial opinions. Note that asymptotic conver- gence can be ensured under conditions milder than Assumptions 3.2.1 and 3.2.2 (see, e.g., [5, 96] and Chapter 2). Although the initial opinion x(0) and the selection of the set K play no role in the final consensus value, which is the leader’s state (as long as αK 6= 0), they clearly affect the manner in which the agents approach this agreement, i.e., the transient behavior of system (3.2). Thus, we turn our attention to the problem of choosing K direct followers so as to minimize the transient error and convergence time of agents’ opinions in the network. To capture this dynamic behavior, we consider the error vector ξ(t) := x(t)− T1, which follows the dynamics ξ(t+ 1) = (I + diag(αK))−1Wξ(t), (3.4) Thus for all t ≥ 0, ξ(t) is given by ξ(t) = ((I + diag(αK))−1W )tξ0. (3.5) Consensus regardless of initial condition is clearly equivalent to the global asymp- totic stability of the origin for (3.4), and since the system is linear and time-invariant, 59 consensus is also equivalent to global exponential stability. Let L be the weighted Laplacian matrix given by L := I −W. (3.6) We have the following facts on the spectrum of the state dynamics matrix in (3.4) and the spectrum of the weighted Laplacian matrix. Lemma 3.2.4. (Spectrum) If Assumptions 3.2.1 and 3.2.2 hold, then (i) ρ ( (I + diag(αK))−1W ) < 1, and (ii) ∀λ ∈ σ(L+ diag(αK)),<(λ) > 0. Proof. It is well-known that if A is an irreducible row substochastic matrix with the row-sum of at least one row less than one, then ρ(A) < 1 (see, e.g., [97, Thm 1.1, p. 24]). Using this result with A = (I+ diag(αK))−1W yields part (i). Part (ii) follows immediately from an application of the Gershgorin Circle Theorem [95, p. 344] and [95, Cor. 6.2.9, p. 356], using the strong connectivity of G and noticing that at least one diagonal entry of L + diag(αK) is shifted to the right compared to the corresponding entry of L. Remark 3.2.5. Assertion (i) of the lemma is in fact equivalent to the result in The- orem 3.2.3 above (the constant linear system is exponentially stable). Part (ii) will be needed in defining our objective costs in the sequel. Next, define ‖ξi‖l1 := ∑∞ t=1 |ξi(t)| (which is well defined because of exponential convergence of ξ) and consider the cumulative convergence error defined as J totalK = ∑ i∈V bi‖ξi‖l1 , 60 where b = [b1, . . . , bN ] T ∈ RN+ is a weight vector chosen by the leader, which we require to satisfy 1Tb = 1. The elements of the vector b are measures of the relative preferences that the leader places on the opinions of all network agents. Note that we do not include ξi(0) in ‖ξi‖l1 since ξi(0) does not depend on the leader’s selection of direct followers. We say the selection K1 is better than K2 if J totalK1 < J totalK2 . Roughly speaking, the smaller J totalK , the smaller the convergence time, i.e., the faster consensus is achieved. However, since computing J totalK is nontrivial, we will work with an upper bound J (1) K obtained as follows: J totalK = ∑ t≥1 bT |ξ(t)| (3.5)= bT ∑ t≥0 ∣∣((I + diag(αK))−1W)tξ(1)∣∣ ≤ bT ∑ t≥0 ( (I + diag(αK))−1W )t|ξ(1)| (Lem. 3.2.4) = bT (I − (I + diag(αK))−1W )−1|ξ(1)| = bT (diag(W1 +αK)−W )−1diag(W1 +αK)|ξ(1)| ≤ bT (L+ diag(αK))−1|Wξ0| =: J (1)K . (3.7) Here the last inequality holds since, first, the inverse (L+ diag(αK))−1 exists based on Lemma 3.2.4, part (ii), and, second |ξ(1)| ≤ (I + diag(αK))−1|Wξ0|. It can be verified that equality holds if either ξ0 ≥ 0 or ξ0 ≤ 0, i.e., if the leader’s opinion T is outside the convex hull of the agents’ initial opinions {xi(0), i ∈ V}. Therefore, J (1) K is a tight upper bound on J total K . The more influential the direct followers are, the smaller J (1) K is and thus the faster consensus can be reached. Formally, in this 61 work we consider the following problem: (P1) min K⊆V J (1) K = b T (L+ diag(αK))−1|Wξ0| s.t. |K| ≤ K, (3.8) Remark 3.2.6. The objective function J (1) K is defined in such a way that it allows the leader T to (i) weight each agent in the network differently through the weight or preference vector b, (ii) take into account partial incentives or trust encoded in the vector α, and (iii) incorporate the role of initial opinions of all the agents in the network. As a consequence of (iii), it is possible for the leader to view J (1) K as the cost-to-go at the initial time, when the set of direct followers is first chosen, and to define the cost at any time t as J (1) K (t) = b T (L+ diag(αK))−1|Wξ(t)|. With this time-dependent objective cost, one can imagine a policy that achieves improved performance through re-solving a similar optimization problem at regular intervals for new sets of direct followers. (This would entail having limited term contracts with the direct followers selected at any time.) This is akin to a model predictive control strategy with an infinite horizon cost-to-go J (1) K (t) and control action being the sequence of sets of direct followers. Remark 3.2.7. (Dynamic centrality measure for degree of influence of set of direct followers) Note that the reciprocal of J (1) K , denoted by CK := 1/J (1) K , can be viewed as a measure of the effectiveness of the set K in spreading the leader’s opinion. This can also be viewed in terms of the relative influence of the choice of one set of agents 62 K vs. another, or as a centrality measure of a set K of direct followers. A set K is more influential than K′ if CK > CK′ . What is new about our centrality measure is that CK can be taken as a dynamic centrality measure through the definition CK(t) := 1/J (1) K (t), rather than a fixed quantity as are many existing centrality measures in the literature. 3.2.2 Formulation of Influence Optimization Problem in the Presence of a Competing Leader Now we consider a similar model as above except that there are two leaders with different opinions T and Q trying to influence opinions of agents in network G. Let K,L ⊆ V denote the sets of nodes that are directly connected to T and Q. Each node in the network has some potential trust levels αi, βi ∈ [0,∞],∀i ∈ V and updates its opinion as follows:2 xi(t+ 1) = [sK]iαiT + [sL]iβiQ+ ∑ j∈Ni wijxj(t) [sK]iαi + [sL]iβi + 1 (3.9) where sK and sL denote the selection vectors of T and Q respectively. In matrix form, (3.9) reads x(t+ 1) = (I + diag(αK + βL)) −1(αKT + βLQ+Wx(t)). where αK := sK ◦α and βL := sL ◦ β. In our context, α and β are associated with the agents in the network and are assumed to be fixed over time. For given choices of K and L, the network G need not (and usually does not) reach consensus even 2We exclude the case where αi = βi =∞ for some i ∈ V. 63 under a strong connectivity assumption. In fact, the opinions converge to a fixed vector x(∞) which depends only on α, β, and W , but not x(0) (see (3.10) below). Thus in this section, we will measure the influence of each leader by examining the limiting opinion x(∞). As we are interested in designing a competition strategy for one leader (T ) in the presence of another (Q), without loss of generality suppose β 6= 0 and sL = 1 (i.e., the set of direct followers of Q is known to T ). If αK = 0, it is clear that xi(∞) = Q,∀i ∈ V under the strong connectivity assumption on G, i.e., the whole network will eventually be out of favor with leader T . Therefore we only consider αK 6= 0. Further, we assume that card(αK) ≤ K < card(α) where K represents the maximum number of connections that T is allowed to es- tablish, accounting for limited communication and/or budget. We are interested in the following problem: Given knowledge of α,β,W and of the largest allowed number of connections K, which nodes should leader T directly connect to in order to achieve the greatest possible influence (in a sense made precise below) on the eventual opinions of the network agents? Note that the limiting opinion vector x(∞) satisfies x(∞) = (I + diag(αK + β))−1(βQ+αKT +Wx(∞)). Thus, x(∞) = (Lβ + diag(αK))−1(βQ+αKT ), (3.10) 64 where Lβ := L+ diag(β) = I + diag(β)−W , which is nonsingular under the strong connectivity assumption and the condition that αK 6= 0 and β 6= 0 (cf. Lemma 3.2.4-ii). We are interested in the steady state error vector ξ(∞) := x(∞)− T1. Since (Lβ + diag(αK))−1(β +αK) = 1, it can be verified that ξ(∞) = (Lβ + diag(αK))−1β(Q− T ). To quantify the long term effect of T in the presence of Q, we define the following function operating on the set K: J (2) K := b T |ξ(∞)|, where b ≥ 0 is a weight or preference vector, indicating the relative importance to the leader T of the final opinion of each agent in the network. Since (Lβ+diag(αK)) is a nonsingular M-matrix (cf. Lemma 3.2.4-ii), it follows that (Lβ + diag(αK))−1 is a nonnegative matrix (see, e.g., Lemma A.1.3 in Appendix A.1)). Thus J (2) K = b T (Lβ + diag(αK))−1β|Q− T |. Without loss of generality, let T = 0 and Q = 1 represent two competing opinions. We are interested in the following problem: Given α,β,b,W and an integer K > 0, select K such that |K| ≤ K and the effect of T is maximized, i.e., (P2) min K⊆V J (2) K = b T (Lβ + diag(αK))−1β s.t. |K| ≤ K (3.11) 65 This is a link creation problem (namely selection ofK), where partial incentives are allowed (i.e., α,β ∈ [0,∞]N) and each agent can be weighted differently (through b). In the limiting case when αi, βi are all either 0 or ∞, this problem reduces to the previously studied optimal stubborn placement or leader selection problems in the literature, which we recall below. First, we give a general problem formulation that covers the cases without and with a competing leader. Remark 3.2.8. (A unified problem formulation) Except for some minor differences, problems (P1) and (P2) described in (3.8) and (3.11) are almost the same. Our aim is thus to develop methods that can be applied to both. To this end, we embed these two problems in the following general one: (P) min K⊆V JK = bT (Lβ + diag(αK))−1c s.t. |K| ≤ K (3.12) where b, c and β are nonnegative vectors. The optimal value will be denoted by J∗. 3.2.3 Comparison to Previous Work 3.2.3.1 Single leader case The following model is widely used in the literature (see, e.g., [15, 62,67,106,107]): xi(t+ 1) =  α˜iT + (1− α˜i) ∑ j∈V wijxj(t), i ∈ K∑ j∈V wijxj(t), i ∈ V\K (3.13) which is equivalent to the one described in (3.2) with α˜i = αi αi + 1 . (3.14) 66 Based on this model, the works [62, 67, 106] consider the following associated prob- lem: min K⊆V,|K|≤K f˜(K) := 1T (I −DKW )−11, (3.15) where DK = I − diag(α˜K), α˜ = 1, and f˜(K) represents the cumulative errors over time of all the agents. Note that f˜(K) in (3.15) clearly corresponds to a special case of J (1) K with b = 1 and ξ0 = 1. Thus, one may wonder why we use model (3.2) instead of (3.13). The main reason is that using the former model allows us to obtain a much stronger convexity result than using the latter. This is also one of the main contribution of our work. To deal with problem (3.15), [62] uses a continuous relaxation of f˜ and `1- norm regularization technique and proves element-wise convexity of the so-obtained objective function. This allows the authors to employ the coordinate descent ap- proach. However, it is important to point out that the relaxed problem formulated in [62] is not necessarily convex; see Remark 3.4.1 below for an example. In [67], supermodularity property of f˜(K) in (3.15) is proved a greedy heuristic [108] is used to yield approximate solutions with provable accuracy. In [65], the authors use a continuous-time version of the DeGroot model and consider the problem of selecting a set of nodes to become leaders (instantaneously) so as to minimize the convergence error, defined as the lp-norm of the distance between the followers’ states and the convex hull of the leader states. By replacing the convergence error with an upper bound that is independent of the initial states of the network (and is loose in general), [65] proves the supermodularity property 67 of so-obtained bound based on a connection with the random walk theory, and then employs the greedy approach in [108]. Kempe et al. [60] also formulate the problem of finding the influential nodes in a network as a discrete optimization problem with a submodular cost function and apply the greedy algorithm to obtain a (1− 1 e ) approximate solution. However, the diffusion model in [60], called Independent Cascade, is basically different from the opinion model considered here. 3.2.3.2 Multiple leaders case In [64], the authors consider a linear stochastic model the mean behavior of which is equivalent to the following deterministic model: xi(t) =  0, if i ∈ V0 1, if i ∈ V1 ∑ j∈V wijxj(t− 1) else (3.16) where V0,V1 ⊂ V are two disjoint sets of stubborn agents. This model is a limiting case of (3.9) with αi, βi ∈ {0,∞} (i.e., an agent becomes stubborn if directly con- nected to a leader). The optimal stubborn agent placement problem studied in [64] is defined as follows: For a given set V0 with known locations in the network, choose K nodes from V\V0 to form the set V1 so that the network bias toward V1 in the limit is maximized, i.e., max V1⊂V {∑ i∈V xi(∞) : |V1| = K, V0 fixed } . 68 This problem is in fact similar to a special case of (3.11) with b = 1 and αi, βi ∈ {0,∞}. The authors prove submodularity of the objective function based on connec- tion with a random walk and then use the greedy algorithm [108] to (approximately) solve the problem. The work [66] considers a similar model as in [64] and defines a measure of node centrality for a given set V0 as H(l) = ∑ i∈V xi(∞|V1 = {l}). The authors introduce a distributed message passing algorithm that enables each node l ∈ V\V0 to compute its own H(l). One of our optimality criteria is also able to subsume this centrality measure as a special case. More importantly, it is considered in a more general setting and practical (centralized) algorithms are developed for the benefit of network designers or market competitors. In [63] the following model proposed by Friedkin and Johnsen [3] is considered: xi(t) = (1− σi) ∑ j∈V wijxj(t− 1) + σixi(0). Here σi ∈ [0, 1] reflects the level of stubbornness of each agent i ∈ V regarding its initial opinion. The paper deals with the problem of selecting K nodes so that if they become fully stubborn and their opinions are set to 1, then the limiting opinions of all the agents, on average, are as positive as possible, i.e., max V1⊂V {∑ i∈V xi(∞) : |V1| = K, xi(t) = 1,∀t ≥ 0,∀i ∈ V1 } . The authors exploit a connection between this model and absorbing random walks to establish the submodularity of the cost function, and then rely on the greedy algorithm in [108] to approximate the optimal solution within factor (1− 1 e ). 69 3.2.3.3 Our Contributions Our work greatly generalizes and differs from the aforementioned works both in problem formulation and solution. Regarding problem formulation, it should be noted that our direct followers can have dynamics like any other network node, unlike the forceful/stubborn agents in those papers. This can be viewed in terms of trust levels of the direct followers with respect to the leader’s state being arbitrary in our work. Moreover, within the context of problem (P1), the agents’ initial opinions need not be the same and are taken into account explicitly in the cost J (1) K , which is a tight upper bound on the cumulative error of all the agents over time. This also allows us to consider a time- varying objective cost J (1) K (t) and update the set of direct followers K repeatedly to further improve the network performance. Furthermore, the agents can be weighted differently by the leader in contributing to the cost JK. We believe that these settings are more natural and thus likely to be of more value for practical applications. Finally, the models considered here, i.e., (3.2) and (3.9), allow us to establish the convexity of a relaxed problem of (P), while neither (3.16) nor (3.13) does so; see also Remark 3.4.1 below. Regarding problem solving, although we adopt two well known heuristic ap- proaches, namely, convex relaxation/approximation technique and the greedy se- lection strategy, the theoretical results presented here are much more general and stronger. In particular, our technical contributions include establishment of the su- permodularity property of the objective function in problem (P) and the convexity 70 of its continuous relaxation; both results are based on the M-matrix theory, which is completely different from tools used in [63–65, 67]. First, we prove the convexity of our relaxed problem (in the usual sense instead of just element-wise) and without assuming any kind of symmetry, which is of great benefit since it allows us to use much more effective numerical algorithms (e.g., gradient descent and Interior Point Methods) compared to the coordinate descent approach employed in [62]. Second, we derive a general matrix supermodularity inequality that can be used to prove supermodularity of JK as well as another type of cost function encountered in the literature (see Remark 3.5.6 below). Combining the supermodularity result with the notion of curvature of a submodular function [109], we prove that the well known greedy algorithm [108] applied to our problem admits a theoretical approximation guarantee that is sharper than (1− 1 e ). In addition, we develop an improved version of this algorithm that is able to achieve better accuracy. Finally, in both approaches, we derive upper and lower bounds on the optimal value, which, when combined to- gether, provide a better analysis of the obtained approximate solutions. As will be demonstrated in our numerical examples, the approximation ratio can be ensured to be ranging from 70% to 100% depending on the value K. 3.3 Special Cases K = 1, 2: Optimal Solutions For any matrix A, let A(i) and A (j) denote the i-th column and j-th row of A, respectively. Moreover, we will use both Aij, [A]ij and aij to refer to the (ij)-th element of A. 71 3.3.1 Single Agent Selection Because W is an irreducible row stochastic matrix, 0 is a simple eigenvalues of L = I − W associated with right eigenvector 1. Let pi ∈ RN denote the left normalized eigenvector corresponding to this eigenvalue such that piT1 = 1. It is known by Perron theorem (see [95, Thm. 8.4.4]) that pi is strictly positive under the strong connectivity assumption on the underlying communication graph. Now let K be a singleton, i.e., K = 1. Then, there are at most N possible choices that leader T can take. For problem (3.8), we have the following result (where we recall that for a matrix A, A(k) and A(k) denote the k-th row and k-th column, respectively). Theorem 3.3.1. (Single agent selection for problem (P1) in (3.8)) Suppose b sat- isfies the normalized condition that 1Tb = 1. For any k ∈ V, we have J (1) {k} = p T k |ξ0|, with (3.17) pTk := (b TL† − L†(k))− (α−1k − L†kk − bTL†(k)) piT pik . (3.18) Moreover, if b = 1/N , then we have pTk = (α −1 k + L † kk) piT pik − L†(k). (3.19) Proof. See Appendix A.2.1. Our next result characterizes the cost function corresponding to a single agent selection for problem (3.11). 72 Theorem 3.3.2. (Single agent selection for problem (P2) in (3.11)) Let P = L−1β . For any k ∈ V, we have J (2) {k} = 1− bTP(k) α−1k + Pkk . (3.20) Proof. See Appendix A.2.2. As a result, when K = 1, optimal solutions to both problems (3.8) and (3.11) are given by k∗ = arg min i∈V J{i}. It should be noted that one only needs to evaluate L† and pi or L−1β once (which requires O(N3) operations), then uses (3.18) or (3.20) to compute the in- fluence corresponding to each follower being selected. When N is large, this is less computationally expensive than inverting matrix (L+diag(αK)) multiple times (each costs O(N3) operations) for different choices of K. The cost J{k} is inversely proportional to the trust level αk. This has a practical meaning as follows. In a social network, an agent who is strongly influenced by his neighbors but is quite skeptical about new information (from the leader) may be less important in spreading the leader’s opinion than one of his friends, who is easier to persuade. Furthermore, J (1) {k} also depends linearly in |ξ0|, the initial error of the whole network. As noted earlier in Remark 3.2.7, we can view J (1) {k} as the cost-to-go at initial time, i.e., J (1) {k}(0). In this connection, it is easy to see that the cost at any time t is given by J (1) {k}(t) = p T {k}|ξ(t)|. 73 This suggests that the centrality of each agent should be dynamic. That is, an agent may be the most important at some time but may not be at the other times, depending not only on its position in the network structure but also on how it behaves over time. The significance of this is that if the leader is able to repeatedly compute J (1) {k}(t), then it can further improve the performance of the network by repeatedly selecting the informed agent. Remark 3.3.3. (Connection of J (1) {k} with other centrality measures) Consider again when the graph G is undirected and L is symmetric, b = 1/N , and ξ0 = 1/N . Note that pi = 1/N and that L†pi = 0 (see Lemma A.1.5). Hence, J (1) {k} = α −1 k + L † kk. (3.21) Thus C (1) {k} is proportional to 1/L † kk. It is interesting to note that in [110] the authors define the topological centrality of a node to be TCk = 1/L † kk where L † is the pseudo- inverse of a Laplacian matrix L; see [110] for further details. Additionally, the notions of information centrality [111] and node certainty [112] can also be shown to be proportional to 1/L†kk. Notice that [110, 111] only define these notions for undirected graphs where L is symmetric. Thus when the graph G is undirected and L is symmetric, these centrality indices and our C{k} are equivalent in ranking the importance of nodes in the network. Moreover, for undirected networks, the pseudo-inverse of a Laplacian matrix also has a nice connection with the notions of resistance distance, that is, L†kk = 1 ICk − Kf N2 74 where Kf = tr(L †) denotes the Kirchhoff index of the network and ICk the infor- mation centrality [111] of node k given by 1 ICk = 1 N ∑ j rkj with rkj being the topological distance between k and j. Therefore, J (1) {k} = N αk + N ICk − Kf N = N αk + ∑ j rkj − Kf N . As a consequence, if αk = αj,∀k, j ∈ V , then the centrality C(1){k} agrees with the information centrality. In particular, nodes with smaller total distance to all the others will have higher centrality measures, thus more important. It is, however, important to note that our measure C (1) {k} also depends propor- tionally on αk, which makes more practical sense since αk represents the proclivity towards the leader’s opinion of agent k. Moreover, C (1) {k} is not restrictively defined for undirected graphs and symmetric L. 3.3.2 Two-Agent Selection In this subsection, we derive an explicit expression to the joint centrality of any pair of agents. Let K = {i, j} ⊂ V, i 6= j. For problem (3.8), we have the following result. Theorem 3.3.4. (Two-agent selection for problem (P1) in (3.8)) Let b = 1/N . We have J (1) {ij} = p T ij|ξ0|, (3.22) 75 where pTij = γiiγjj − γijγji∑ γij piT − γjj + γji∑ γij L†(i) − γii + γij∑ γij L†(j), (3.23) and ∑ γij := γjj + γij + γii + γji with γii = 1 pii (L†ii + 1 αi ), γji = − L†ji pii γjj = 1 pij (L†jj + 1 αj ), γij = − L†ij pij . Proof. See Appendix A.2.3. Note that pij can also be expressed as γjj + γji∑ γij pi + γii + γij∑ γij pj − (γii + γij)(γjj + γji)∑ γij pi, where pi is given by (3.19), which is proportionate to J (1) {i} . As a consequence of Theorem 3.3.4, we can determine the optimal pair at any time t as K∗(t) = arg min i,j∈V,i 6=j pTij|ξ(t)|. (3.24) Remark 3.3.5. (A special case) Consider the case where L is symmetric and let ξ0 = 1,α = 1∞. Note that pi ∈ span(1) and L†pi = 0. Thus the cost J (1){ij} reduces to J (1) {ij} = L†iiL † jj − L†ijL†ji L†ii + L † jj − L†ij − L†ji . (3.25) Notice that the term in the denominator L†ii+L † jj−L†ij−L†ji =: rij is usually referred to as the resistance distance of the network measured at nodes i and j, which is 76 identical to the topological distance between them. The term in the numerator can be expressed as L†iiL † jj − L†ijL†ji = L†iiL†jj(1− (cos†(i, j))2), where cos†(i, j) = L†ij√ L†iiL † jj . Here, by following [113], we use cos†(i, j) to measure how structurally similar the roles of i and j are. The cost now reads J (1) {ij} = L † iiL † jj (1− (cos†(i, j))2) rji . By (3.21) and α = 1∞, we have C (1) {ij} = C (1) {i}C (1) {j} rji 1− (cos†(i, j))2 . Obviously, the cost C (1) {ij} depends on individual centrality C (1) {i}, C (1) {j}, resistance dis- tance rij and cos †(i, j) in a nonlinear fashion. However, we can loosely infer that to minimize J{ij}, the optimal selection should satisfy the following • Self-centrality: C(1){i} and C(1){j} should be large. • Relative distance: rij should be large, i.e., i and j should be far apart. • Topological similarity: cos†(i, j) should be large, i.e., i and j should have similar roles in the network. Consider, for example, an unweighted undirected cycle graph where αi is identical for any node. An optimal choice (i∗, j∗) is any two nodes that are of the farthest 77 distance.3 Consider a network which consists of two communities as another exam- ple. A reasonable candidate for the optimal solution would be i∗ and j∗ where each node is the most influential in each community. Remark 3.3.6. In connection with other centrality measures (e.g., information cen- trality, topological centrality), the cost J (1) {ij} can be described by J (1) {ij} = L † iiL † jj (1− (cos†(i, j))2) rji = ( 1 ICi − Kf N2 )( 1 ICj − Kf N2 ) (1− (cos†(i, j))2) rji = ( Ri N − Kf N2 )( Rj N − Kf N2 ) (1− (cos†(i, j))2) rji where Ri = ∑ k rik is the sum of all resistance distances from node i to all the others, which is reciprocal to the information centrality of node i. The derivations of C (1) {ij} in previous remarks are valid only under special as- sumptions about the trust vector α, the initial condition ξ(0) and the structure of the network as well as the Laplacian matrix L. More importantly, it only indicates the importance of nodes at initial time 0. As the agents’ opinions evolve with time, so do their influence measures with respect to the whole performance of the network. This can be captured by our time-varying objective function J (1) K (t) or C (1) K (t), which 3This can be seen as follows. Let N denote the number of nodes in the cycle. For any 2 nodes in the cycle, there are exactly 2 disjoint paths connecting them. Let x, y denote the lengths of the 2 paths, which obviously satisfy x+y = N . Since all the nodes are identical to each other, so the joint centrality only depends on the relative distance rij where rij = (x −1+y−1)−1 = xyN ≤ (x+y) 2 2N = N 2 . Therefore, rij is maximized when x = y = N 2 for even N or (x, y) = ( N−1 2 , N+1 2 ) for N odd. This proves the claim. 78 is defined for any initial condition and any trust and bias vectors. This is one of the main differences between our work and others. In a similar fashion, we can compute the cost function associated with any pair of agents for the case of problem (P2) as follows. Theorem 3.3.7. (Two-agent selection for problem (P2) in (3.11)) Let P = L−1β and νii = α −1 i + Pii, νji = −Pji νjj = α −1 j + Pjj, νij = −Pij. We have J (2) {ij} = bTP(i)(νjj + νij) + b TP(j)(νii + νji) νiiνjj − νijνji . (3.26) Proof. The proof is based on the rank-2 update of matrix inversion by the Woodbury identity (A.1). Of course, this objective function depends on the network structure and weight matrix (encoded in L) as well as the trust vectors α, β. Although connections with other notions of centrality may not be inferred easily, there is a close relation between this cost function and the average voltage of a network of resistors. In particular, assume the graph G is undirected and consider the network of |E| resistors corre- sponding to the graph with wij representing the conductance between nodes i and j. Let Q and T denote two voltage sources and let αi (βi) denote the conductance between node i and T (Q) when there is a link, that is, when node i is selected by T (or Q). Then it can be seen that J (2) {ij} is the weighted average (corresponding to b) of the node voltages in the resistor network. 79 We close this section with the following remark. As we have seen in this sec- tion, the joint-centrality measure of a set becomes more complicated to express as K increases. Moreover, for large networks, finding the optimal solution by sweep- ing through all the possible combinations is a challenging or even impractical task. Therefore, we content ourselves with approximate solutions whenever they are at- tainable with certain quality. In this connection, we now develop two practical approaches for the general problem-(P) where lower and upper bounds on the opti- mal value can be obtained and used to assess approximate solutions. 3.4 General Case: Convexification Approach In this section, we study the convexity property of the continuous relation defined by JK and discuss numerical methods that can be used to solve the relaxed or approximate problem. We emphasize that no symmetry conditions on the Laplacian matrix L (even on its structure). 3.4.1 Convexity of Relaxation Consider problem (3.12), equivalently put as follows: (P) min s∈RN f(s) := bT (Lβ + diag(s ◦α))−1c s.t. si ∈ {0, 1} ∀i = 1, . . . , N, card(s) ≤ K (3.27) where b, c ∈ RN+\{0}. Recall that Lβ = L+ β. We will also use L0 := L to signify the case of problem (3.8) i.e., β = 0. The optimal value of this problem is denoted 80 by f ∗P . First, this problem is clearly combinatoric in nature (hence nonconvex) and generally hard to solve especially for large networks. We defer our discussion on the convexity of the objective function f for now and discuss techniques to handle the cardinality constraint instead. The first idea is to consider instead of (P) a relaxed version (P Rlxd) defined as follows: (P Rlxd) min y∈RN f(y) s.t. y ∈ [0, 1]N and 1Ty ≤ K. (3.28) This is a continuous relaxation of (P). The optimal value for (P Rlxd), denoted by f ∗P Rlxd, is clearly a lower bound for that of (P), i.e., f ∗ P Rlxd ≤ f ∗P . Of course this lower bound is useful if an optimal solution yP Rlxd is computable. In that case, if yP Rlxd is a binary vector, then it is also the optimal one for (P). However, a binary solution is not to be expected as yP Rlxd tends to be dense. In general, we can use a simple projection onto the feasible set of problem (P) to obtain an approximation (e.g., rounding up to 1 the K largest elements of yP Rlxd and zeroing out the rest), resulting in an upper bound on f ∗P , which we denote by f¯P Rlxd. Next, we consider another practical approximation using the well known `1- norm regularization technique. Here, we consider the problem (P Aprx) min y∈RN g(y) := f(y) + µ1Ty s.t. y ∈ [0, 1]N =: Ω (3.29) where µ is a positive parameter the role of which is to promote sparsity of the solution. (Note that if µ = 0, then clearly y = 1 is the global solution to this ap- 81 proximate problem (see also Theorem 3.4.3 below); increasing µ is a way to penalize the number of nonzero elements in the solution.) Let s∗P Aprx be the binary vector corresponding to the K largest elements of a solution to problem (P Aprx). Then fP Aprx := f(s ∗ P Aprx) is clearly an upper bound on the optimal value of the original problem (P). As a result, the gap (fP Aprx − f ∗P Rlxd) can also be used to evaluate the quality of our approximations. We now discuss convexity of the function f , which would clearly be pertinent for problem (P Rlxd) as well as problem (P Aprx). Note that we do not assume any symmetry conditions on the Laplacian matrix L (even on its structure), or on the nonnegative vectors b and c (trivial cases such as b = 0 or c = 0 are excluded). For somewhat similar cost functions that are convex under symmetry of L, see, e.g., [61,104,105]. As noted earlier, the functions f˜ in (3.15) and J (1) K defined for problem (P1) with b = 1 and ξ0 = 1 are equivalent through the change of variables (3.14), one may wonder whether the continuous relaxation of f˜ is convex. The following remark provides a negative answer for this question. Remark 3.4.1. (Nonconvexity of continuous relaxation of f˜ of (3.15)) We use a simple example with N = 2 to show that the continuous relaxation of f˜ of (3.15) using (3.28). By abuse of notation, consider f˜(s) = 1T (I − (I − diag(s ◦ α˜))W )−11, s ∈ Ω = [0, 1]N . 82 Suppose W = 0.1 0.9 0.5 0.5  and α˜ = 0.8 0.8 . We have ∇2f˜(1/2) =  6.9101 16.6656 16.6656 22.2587  which is not positive definite as it has a negative eigenvalue, namely λ = −3.7632. Thus, f˜ is nonconvex on Ω. In contrast to f˜ , the continuous relaxation of J (1) K is convex on Ω. In fact, more is true; we establish below the convexity of f in (3.28), which is the main result of this subsection. The convexity proof relies on the following technical lemma.4 Lemma 3.4.2. Let A ∈ RN×N be nonnegative and V ∈ RN×N be diagonal. Then for each m ≥ 0, ∑ i+j+k=m AiV AjV Ak is a nonnegative matrix, where i, j, k are nonnegative integers. Proof. By change of variables, we have ∑ i+j+k=m AiV AjV Ak = ∑ 0≤q≤r≤m AqV Ar−qV Am−r. Writing V = diag(v1, . . . , vn), the st-th coefficient of the matrix above is ∑ 0≤q≤r≤m ∑ 1≤i,j≤n [Aq]si[A r−q]ij[Am−r]jtvivj (3.30) To simplify this expression, let us consider the graph generated by matrix A, where aij denotes the weight of the directed edge i → j.5 Let Pm denote the set of all 4We thank Prof. Terrence Tao for the idea for the proof of Lemma 3.4.2. 5The edge direction defined within this proof is in reverse order to our usual notation. 83 walks of length m from node s0 = s to sm = t, i.e., those of the form s = s0 e1→ s1 e2→ . . . em→ sm = t, where ei = (si−1si) denotes the directed edge from node si−1 to si. Now for a fixed tuple (qirj), consider a collection P(qirj) ⊂ Pm that satisfies the conditions sq = i and sr = j (i.e., fixing positions q and r). Then the term under the double summation in (3.30) represents the total weight of all the walks6 in P(qirj) multiplied by vsqvsr , i.e., [Aq]si[A r−q]ij[Am−r]jtvivj = ∑ {ek}m1 ∈P(qirj) ae1ae2 . . . aemvsqvsr , where we have assumed A = [akl]1≤k,l≤n. Summing the right side of this relation over 1 ≤ i, j ≤ n yields the total weight of all the walks in Pm (each being scaled by vsqvsr), namely, ∑ {ek}m1 ∈Pm ae1ae2 . . . aemvsqvsr . As a result, (3.30) becomes ∑ 0≤q≤r≤m ∑ {ek}m1 ∈Pm ae1ae2 . . . aemvsqvsr = ∑ {ek}m1 ∈Pm ae1ae2 . . . aem ∑ 0≤q≤r≤m vsqvsr (3.31) Note that aei ≥ 0 for any i and that 2 ∑ 0≤q≤r≤m vsqvsr = ( ∑ 0≤i≤m vsi )2 + ∑ 0≤i≤m v2si ≥ 0. It then follows that the right side of (3.31) is nonnegative, thereby completing the proof. 6A walk’s weight is defined as the product of the weights of all the edges along the walk. 84 We are now ready to establish the convexity as well as other important prop- erties of our objective functions. Theorem 3.4.3. (Properties of objective function in (3.27)) For any b, c,α ∈ RN+\{0} and β ∈ RN+ , let Ω be given as in (3.29) and consider f : RN+ → R ∪ {∞} defined in (3.27), i.e., f(y) := bT (Lβ + diag(y ◦α))−1c. (3.32) Then f is positive, convex and decreasing on Ω. It is smooth on Ω\{0} with gradient ∇f and Hessian H given by ∇f(y) = −(Y −Tb) ◦α ◦ (Y −1c) with Y := Lβ + diag(y ◦α) (3.33) and H(y) = H(y) +HT (y) with H(y) := diag(α ◦ (Y −Tb))Y −1diag(α ◦ (Y −1c)). (3.34) Moreover, H(y) is a nonnegative matrix and 0  H(y)  LfI, with Lf := ρ(H(0)). (3.35) Furthermore, Lf ≤ N maxij[H(0)]ij. Proof. Smoothness of f follows from its definition. Positiveness follows from as- sumptions b, c,β ≥ 0 and the fact that Y = Lβ + diag(y ◦ α) is a nonsingular M-matrix whenever y ∈ Ω and β are not both equal 0, which ensures that Y −1 is a nonnegative matrix (see Lemma A.1.3 in Appendix A.1). Hence f(y) = bTY −1c ≥ 0 85 for all y ∈ Ω. Next, we find the first differential of f , namely, df(y) = bTdY −1c = −bTY −1diag(α)diag(Y −1c)dy, = − [ (Y −Tb) ◦α ◦ (Y −1c) ]T dy, (3.36) where we have used the fact that dY −1 = −Y −1(dY )Y −1, dY = d(Lβ+diag(y◦α)) = diag(dy ◦α), and diag(x)y = diag(y)x = x ◦ y. Therefore, ∇f(y) = − ( Y −Tb ) ◦α ◦ ( Y −1c ) . Since Y −1 ≥ 0N×N , we have ∇f(y) ≤ 0, which implies that f is decreasing in y. In fact, a stronger statement holds, that is, Y −1 = (Lβ + diag(y ◦α))−1 is nonnegative and decreasing in y. As a result, ‖∇f(y)‖2 ≤ ‖∇f(0)‖2,∀y ∈ Ω. When β 6= 0, ‖∇f(0)‖2 <∞, thus f is Lipschitz continuous with parameter ‖∇f(0)‖2 on Ω. Next, we find the second differential of f as follows: d2f(y) = 2bTY −1diag(dy ◦α)Y −1diag(dy ◦α)Y −1c, (3.37) = 2dyTdiag(α ◦ (Y −Tb))Y −1diag(α ◦ (Y −1c))dy, = dyT (H +HT )dy (3.38) with H defined as in (3.34). Thus, H = (H + HT ) is the Hessian of f . Clearly, H is nonnegative since Y −1,W,b, c are so. This proves that H is also nonnegative. For convexity, it suffices to show that d2f given by (3.37) is positive semidefi- nite on Ω\{0}. Indeed, since b and c are nonnegative, we will prove that Y −1V Y −1V Y −1 ≥ 0N×N , 86 where V = diag(dy◦α). Note that Y is a nonsingular M-matrix. Thus, by definition, Y = s(I − A) for some positive s and some nonnegative matrix A with ρ(A) < 1. Then we have Y −1 = s−1 ∑∞ i=0 A i and hence Y −1V Y −1V Y −1 = s−3 ∑ i≥0 ∑ j≥0 ∑ k≥0 AiV AjV Ak = s−3 ∑ m≥0 ∑ i+j+k=m AiV AjV Ak. (3.39) Now by Lemma 3.4.2, ∑ i+j+k=mA iV AjV Ak ≥ 0N×N for each m ≥ 0. Therefore, Y −1V Y −1V Y −1 ≥ 0, thereby proving convexity of f . Next, to prove (3.35), we use the inequality xTH(y)x ≤ ρ(H(y))xTx, ∀x ∈ RN ,y ∈ Ω, (3.40) which holds since ρ(H(y)) is the largest eigenvalue of the nonnegative (and sym- metric) matrix H(y) (see Theorem A.1.1 in Appendix A.1). Note also that H(y) is decreasing in y ∈ Ω. Thus we have 0N×N ≤ H(y) ≤ H(0) ≤ max ij [H(0)]ij11 T . Finally, by Theorem A.1.2 in Appendix A.1, we have ρ(H(y)) ≤ ρ(H(0)) = Lf ≤ maxij[H(0)]ijρ(11 T ) = N maxij[H(0)]ij. Consider again the example in Remark 3.4.1 and choose W and α satisfying (3.14), e.g., W = W˜ and α = 4 × 1. With β = 0, b = c = 1, we have f(y) = 1T (L+ 4diag(y))−11 and ∇2f(1/2) = H(1/2) = 2.5952 0.7958 0.7958 3.8131   0. 87 Remark 3.4.4. (Lipschitz constant for problem (3.8)) When β = 0 we have Lβ = L, which is singular. As a result, the Lipschitz constant Lf = ρ(H(0)) = ∞. Indeed, when α = β = 0, the agents’ opinions converge to a consensus value that is unaffected by either T or Q. The following result is an immediate consequence, whose proof is omitted. Corollary 3.4.5. (Properties of g in (3.29)) The function g is smooth and convex over constraint set Ω with gradient ∇g(y) = ∇f(y) + µ1, (3.41) which is Lipschitz continuous with Lipschitz constant Lg = Lf . Moreover, if η := miny∈Ω λmin(H(y)) > 0, then g is strongly convex with parameter η. It now becomes obvious that both problems (P Rlxd) and (P Aprx) are convex with a (possibly strongly) convex smooth cost function. Thus, they can be solved by various algorithms, including Interior Point Methods (IPMs), or the Projected Gradient Method (PGM) (see e.g., [114–118]), provided that∇f(y) can be evaluated efficiently (see Remark 3.4.11 below). We now remark on how to deal with the original problem (P) in connection with tuning the parameter µ in (P Aprx). Remark 3.4.6. (On selecting regularization parameter µ) From the optimal solution y˜∗ of problem (3.29) for a particular µ, we can obtain an approximate solution to the original problem (3.27) by choosing nodes corresponding to the K largest entries of y˜∗. As µ increases, there (usually) exists µ¯ such that card(y˜∗) ≤ K. Once this 88 value is found (which can be done fairly easily), µ can be tuned manually within the interval [0, µ¯] to find the best approximation. We conclude this subsection with the following remark, showing an application of our analysis developed above. Remark 3.4.7. (A proof of a conjecture in [105]) In a less apparently related context, the authors in [105] study an on-chip active cooling system (based on super-lattice thin-film thermoelectric coolers) and the problem of minimizing the steady state temperature profile. The following conjecture was posed and supported by extensive simulations. Conjecture 3.4.8. ([105]) Suppose H−1 ∈ RN×N is a Stieljes matrix.7 Then, for any 1 ≤ k, l ≤ N and z ∈ RN , the following holds: zTdiag(H(k))Hdiag(H(l))z ≥ 0. Assuming this conjecture to be valid, the paper then shows the convexity of each element hkl of matrix H(x) = (G − xD)−1 as a function of x ∈ [0, xm], where D is a diagonal matrix with at least one positive entry, G is an irreducible Stieljes matrix, and xm > 0 such that G − xD is positive definite for all x ∈ [0, xm]. This convexity result was proved in a later work [119] by using results on the convexity of parameterized linear equations [120] but the conjecture has not yet been proved. We will prove the conjecture next. Although our cost function does not resemble H(x) in the mentioned papers, our analysis provided above can be used to give an affirmative answer to the con- jecture, even under a weaker assumption, namely that H−1 is an M-matrix. Indeed 7A Stieljes matrix is a real symmetric positive definite M-matrix. 89 the proof below does not require a symmetry assumption. Proof of Conjecture 3.4.8. Let V = diag(z). We have zTdiag(H(k))Hdiag(H(l))z = H (k)diag(z)Hdiag(z)H(l) = eTkHVHVHel. Since H−1 is an M-matrix, it follows that H−1 = s(I − A) for some s > 0 and some A ≥ 0N×N with ρ(A) < 1. Thus H = s−1 ∑ i≥0A i, and hence using the same expansion as in (3.39) yields s3HVHVH = ∑ m≥0 ∑ i+j+k=mA iV AjV Ak, which is nonnegative by Lemma 3.4.2. Therefore, eTkHVHVHel ≥ 0 as desired. The foregoing proof suggests that the convexity results in our work can be useful in studying various applications, such as those considered in [105,119]. 3.4.2 Numerical Methods We now discuss two numerical algorithms that can be used to solve problem (P Aprx), namely the Projected Gradient Method and Interior Point Methods. Problem (P Rlxd) can be treated similarly. Theorem 3.4.9 (PGM). Consider the Projected Gradient Method applied to prob- lem (3.29): y(t+1) = PΩ [ y(t) − γ(t)(∇f(y(t)) + µ1)] (3.42) where PΩ denotes the projection operator onto the constraint set Ω of (3.29), and γ(t) step size. If γ(t) is chosen by the Armijo rule, then every limit point of {y(t)} is an optimal solution to problem (3.29). If β 6= 0, we can use any constant step size 90 γ(t) ≡ γ ∈ (0, Lf ). If η > 0, then for γ = 1Lf , y(t) converges linearly to the unique solution y∗ with rate (1− η Lf ) 1 2 . Proof. The theorem follows from Propositions 2.3.1 and 2.3.2 in [115], and Theorem 2.2.8 in [117]. Remark 3.4.10. (On implementation of PGM when β = 0 ) This corresponds to Problem (3.8). We have that f is well-defined and smooth on Ω\{0} (f(0) = ∞). As a result, given y(0) 6= 0, the level set Ω0 = {y ∈ Ω|g(y) ≤ g(y(0))} is convex compact set excluding 0, over which g, ∇g and ∇2g = H are continuous. In particular, ∇g is Lipschitz continuous on Ω0 with coefficient L0g = maxy∈Ω0 ‖H(y)‖. Thus, we can replace PΩ by PΩ0 or choose a step size ensuring that y (t) ∈ Ω0, then the PGM iteration (3.42) still works in this case (i.e., β = 0). Remark 3.4.11. (On gradient evaluation) Gradient ∇f(y) involves inversion of Y = (Lβ+diag(y◦α)), which usually costs O(N3) operations and O(N2) memory storage, and thus does not scale well with network size. Moreover, even if the underlying graph is sparse, this inversion can yield a dense matrix and therefore storing it could also be too expensive for very large networks. In such a case, one way to reduce those costs is to exploit the sparsity of the graph and the structure of the cost function. In particular, from (3.33), we have ∇f(y) = −u ◦α ◦ v, where u := Y −Tb, v := Y −1c. (3.43) That is, u and v are respectively the solutions to the sparse linear equations Y Tu = b and Y v = c, for which many solvers/algorithms are available. For example, 91 based on the diagonal dominance property of matrix Y , we can employ the power- iteration. Specifically, consider the decomposition Y = Dy + E, where Dy and E denote the diagonal and off-diagonal parts of Y . It is clear from the structure of Y = Lβ + diag(y) that only Dy depends on y (hence the subscript y). Now consider u, which satisfies b = Y Tu = Dyu + E Tu. Since Dy is invertible, we then have u = −D−1y ETu + D−1y b, which is a fixed point relation, where under Assumptions 3.2.1 and 3.2.2, the right side defines a contraction mapping with contraction coefficient ρ(D−1y E T ) < 1. Therefore, we can use the following iteration to compute u: uk+1 = −D−1y (ETuk − b). (3.44) It should be noted that (3.44) is highly scalable since (i) E is sparse and can be read off from L (or W ), whose storage takes only O(|E|) where |E| is the number of directed edges in the graph, and (ii) the computation also takes O(|E|) operations as it only involves a multiplication of uk with E T and an element-wise scaling (after a subtraction by b) by diagonal entries of Dy. Moreover, suppose (3.44) terminates in ku iterations, which yields a convergence error proportional to ρ ku(D−1y E T ), then the running time to compute u is O(ku|E|). Finally, v can be computed in the same manner, i.e., vk+1 = −D−1y (Evk − c). PGM belongs to the class of first order methods which only requires gradient evaluation (and projection step). Thus, it can be employed to deal with large networks. However, for networks that are not very large, other more efficient and sophisticated algorithms are available such as primal-dual IPMs [114]. Here we note 92 that each iteration of this method involves computing the Newton direction, which requires O(N3) operations to evaluate gradient ∇f and Hessian matrix H, given respectively in (3.33) and (3.34). In practice, the method converges in a very few iterations (say, a few tens), which is often much less than required in PGM. 3.5 Supermodularity and Greedy Algorithms In this section, we develop an alternative approach to problem (P) based on the greedy strategy where approximation bounds for the suboptimal solutions can be established. To this end, we first prove that JK in (3.12) is monotone and supermod- ular in the set-variable K. In fact, more is true, that is, function f given by (3.32) is supermodular and monotone on Ω. For this, we will give two different proofs as each has its own merit. As a result, problem (3.12) admits a (1− 1 e ) approximation algorithm [108]. We then develop an improved version of this algorithm that is able to achieve better approximate solutions. 3.5.1 Supermodularity Results We now establish supermodularity of the objective function f , and thus JK. Our first approach relies on the results in the previous subsection and the following known result. Theorem 3.5.1. (Topkis’ Characterization Theorem [121, 122]) Let Ω = [x, x¯] be an interval in RN and h : RN → R be twice continuously differentiable on (some open set containing) Ω. Then h is supermodular on Ω if and only if for all x ∈ Ω 93 and all i 6= j, ∂2h/∂xi∂xj ≥ 0. (There are no restrictions on ∂2h/∂x2i .) As a consequence, we have the following. Theorem 3.5.2. (Supermodularity of objective functions) Consider the function f in problem (P) and the set Ω defined in (P Aprx). Then f is supermodular and monotone on Ω. Thus, the cost JK is supermodular and monotone in K. Proof. By Theorem 3.4.3, f in (3.32) is decreasing and its Hessian matrix H is element-wise nonnegative on Ω. Supermodularity of f then follows from Theorem 3.5.1. Thus, JK is also supermodular as it is the restriction of f on the vertexes of Ω. It should be pointed out that unlike in problem (P2), the function f in (P1) is not defined at 0 ∈ Ω and thus is not twice continuously differentiable on any open set containing Ω. Therefore, the result above does not apply directly to problem (P1). Next, we provide a second approach to proving the supermodularity result avoiding the technical problem above. This approach is based on the following two lemmas, the first of which is a matrix supermodularity inequality and the second is a composition property. These results not only provide us a deeper understanding of the influence process considered here but also are useful in proving the modularity of another related cost function used in the literature. Lemma 3.5.3. (Matrix supermodularity inequality) For any S ⊂ V, let ΓS = diag(αS). Then we have (Lβ + ΓS)−1 ∈ RN×N+ is nonincreasing and supermodular 94 in S, i.e., the following matrix inequalities hold for any v, k ∈ V\S (Lβ + ΓS)−1 − (Lβ + ΓS∪{v})−1 ≥ (Lβ + ΓS∪{k})−1 − (Lβ + ΓS∪{k,v})−1 ≥ 0. (3.45) This result also holds true if we replace Lβ with L0. Proof. The proof relies on the Woodbury matrix identity and results in M-matrix theory. See Appendix A.2.4 for details. The result seems to suggest that opinion diffusion and influence spreading processes inherently possess monotonicity and supermodularity properties. Remark 3.5.4. We do not exclude the case S = ∅ since it can be seen that (Lβ + Γ∅) −1 = +∞ if β = 0. Lemma 3.5.5. (Composition property) Suppose F : 2V → RN×N is decreasing and supermodular, f : RN×N → R is increasing and convex. Then the composition (f ◦ F ) is nonincreasing and supermodular. 8 Proof. This result is a straightforward extension of the standard case [121] in which F : 2V → R and f : R→ R. Details are omitted for brevity. Now using Lemmas 3.5.3 and 3.5.5 with F (K) = (Lβ+ΓK)−1, f1(X) = bTX|ξ0| and f2(X) = b TXβ, we again have that J (i) K = (fi ◦ F )(K) for i = 1, 2 are nonin- creasing and supermodular. Remark 3.5.6. The authors in [104] consider the problem of selecting a number of agents as leaders (in their context) in order to minimize the overall variance in an 8Here, ◦ denotes the composition operator and should not be confused with the Hadamard product used in Section 3.4. 95 undirected unweighted network subject to stochastic disturbances. It can be verified that the cost function in that paper is equivalent to tr ( (L + diag(αK))−1 ) , which is equal to (f ◦ F )(K) with F (K) = (L + ΓK)−1 and f(X) = tr(X). Using the result above, we can immediately conclude the supermodularity property of this cost function; this was not established in [104]. 3.5.2 Greedy Algorithms and Ratio Bounds Having established supermodularity of the objective functions, we now introduce our greedy algorithms and show their ratio bounds. For convenience, JS and J(S) are used interchangeably. Our first algorithm, whose output is denoted by KG, is similar to the greedy algorithm in [108], which we described next. Algorithm 3.1: Greedy Adding KG Data: W , α, β, b and K 1 Init: KG ← ∅ 2 for i = 1 : K do 3 k∗i ← arg min{J(KG ∪ {v}), v 6∈ KG} 4 KG ← KG ∪ {k∗i } 5 Output: KG Description of Algorithm 3.1: The idea is to start with an empty set KG (line 1) then greedily find one more node that most decreases the cost JK to add to the set KG (lines 2-4). The algorithm is terminated after K sequential selections. Remark 3.5.7. (Complexity of Algorithm 3.1) The number of function evaluations is KN − K(K−1) 2 . In a naive way, without exploiting the structure of the cost function, each evaluation requires O(N3) operations (due to matrix inversion) and thus the total cost would be O(KN4). We can use the following tricks to alleviate this 96 computational burden. • Rank-1 updates: At any iteration, let S denote the current set KG and let P := (Lβ + ΓS)−1. By the Woodbury identity (A.1), it can be verified that (Lβ + ΓS∪{v})−1 = P − P(v)P (v) α−1v + Pvv . (3.46) Let ∆J(v,S) := J(S)− J(S ∪ {v}). Then ∆J(v,S) = b TP(v)P (v)c α−1v + Pvv . As a result, knowing P , it requires O(N) operations to compute ∆J(v,S) and hence O(N(N − |S|)) to find v∗ = arg maxv∈V\S ∆J(v,S). The matrix (Lβ + ΓS∪{v∗})−1 is then obtained from P by a rank-1 update (3.46), which is O(N2). Note that the initial case S = ∅ corresponds to P = L−1β , which takes O(N3) operations to compute. To sum up, using this scheme, the algorithm requires O(KN2 +N3) operations, reduced from the naive way by a factor of O(N). It also demands O(N2) of memory space (mainly to store the matrix inverse). • Power-iteration method: For a very large network, it may be too expensive to reserve O(N2) memory for storing the inversed matrix (Lβ + ΓS)−1. In this case, one can exploit the sparsity structure of Lβ in connection with the power- iteration method to overcome the memory issue as shown in Remark 3.4.11. In particular, we can write J(S) = uTSc, where uS = (Lβ + ΓS)−Tb can be computed using iteration (3.44). As before, let ku denote the number of 97 iterations (on average) of running (3.44) (to achieve certain accuracy of u). Then the algorithm takes O(KNku|E|) operations and O(|E|) memory. Note that the same greedy algorithm using rank-1 updates has been applied in [107] for the case of problem (3.8). Here, we use this algorithm for (3.12) (which is more general) and provide proofs of the supermodularity of JK and the ratio bounding the error incurred, which were not included in [107]. Our result on the approximation guarantee of Algorithm 3.1 involves the notion of curvature of a submodular function (see, e.g., [109]. Let Z(S) be nondecreasing submodular in S. Then σ := 1−min x∈P Z(P\{x})− Z(P) Z(∅)− Z({x}) , (3.47) is called the total curvature of Z with respect to the set P . Theorem 3.5.8. ([109, Cor. 5.7]) Let Z(S) be a nondecreasing submodular function of S such that Z(∅) = 0. Let SG and S∗ denote the greedy solution and the optimal solution to the problem max{Z(S) : S ⊆ P , |S| ≤ K}. Then Z(SG) Z(S∗) ≥ 1 σ ( 1− (1− σ K )K ) =: Rσ,K (3.48) where σ is the curvature of Z with respect to P. To use this result, we need to consider the case β = 0 separately since L0 is singular and thus J(∅) =∞. Theorem 3.5.9. (Properties of Alg. 3.1) Let Assumptions 3.2.1 and 3.2.2 hold. Let K∗ denote an optimal solution to (3.12) and let KG be the output of Algorithm 3.1. Let Vα = {i ∈ V , αi 6= 0}. 98 (i) Let v∗ = arg minv∈Vα J({v}). If β = 0, then J({v∗})− J(KG) J({v∗})− J(K∗) ≥ Rσ,K−1 (3.49) where σ = 1−minx∈Vα\{v∗} J(Vα\{x})−J(Vα)J({v∗})−J({v∗,x}) . (ii) If β 6= 0, then J(∅)− J(KG) J(∅)− J(K∗) ≥ Rσ,K (3.50) where σ = 1−minx∈Vα J(Vα\{x})−J(Vα)J(∅)−J({x}) . Proof. (i). Define Z(S) := J({v∗})−J(S ∪ {v∗}) for any S⊆Vα\{v∗}. Then it can be verified that Z is nondecreasing, submodular with curvature σ and Z(∅) = 0. Thus, applying Theorem 3.5.8 and rearranging terms yield (3.49). (ii). Similarly, (3.50) follows from Theorem 3.5.8 with σ being the curvature of Z(S) := J(∅)− J(S) for any S ⊆ Vα. Note that Rσ,K > 1 σ (1− e−σ) > 1− e−1 for any α ∈ (0, 1) and K ≥ 1. Thus in general Rσ,K is tighter than the constant bound (1− e−1) established in [108] (and also [63,65,67]). Remark 3.5.10. (Bounds on J(K∗) by Alg. 3.1) Clearly, J(KG) is an upper bound on J(K∗) and (3.49) or (3.50) provides a lower bound. We shall denote these bounds by JGU and JGL respectively; e.g., JGL = J(∅)− J(∅)−JGURσ,K for (3.50) . Since J(K∗) ≥ 0, the bound JGL is useful only if JGL ≥ 0, i.e., JGU ≥ (1 − Rσ,K)J(∅) or JGU ≥ (1−Rσ,K−1)J({v∗}). In the following, we construct another algorithm (Algorithm 3.2 given and described below), which contains Algorithm 3.1 as a special case and is able to 99 practically improve accuracy. The idea is still to greedily select one “best” node at a time, but we additionally employ a particular swapping strategy: to repeatedly replace a selected node in K by another node in V\K (or more precisely Vα\K ) if the swapping most decreases the objective function. This strategy is in fact a special case of the Interchange Heuristic [108], which was also employed in [103] and [104] for problems related to sensor placement and leader selection. Our algorithm here differs from the aforementioned ones in that instead of swapping whenever an improvement of the cost function occurs, we carry out swapping in the direction of steepest descent coordinate, which helps avoid exponential number of exchanges. (As a side note, the supermodularity property and approximation bound for the greedy algorithm were not established in [103, 104]. Moreover, the convex analysis in these works is based on the symmetry of Laplacian matrices associated with undirected graphs.) Algorithm 3.2: Greedy Swapping KSM := GSwap(KS0 ,M) Data: W , α, β, b, KS0 , and M 1 for m = 1 : M or until KSm = KSm+1 do 2 S ← ∅, T = {t1, t2, . . .} ← KSm−1 3 for i = 1 : K do 4 T ← T \{ti} 5 t∗i ← arg minv 6∈S∪T J(S ∪ {v} ∪ T ) 6 S ← S ∪ {t∗i } 7 KSm ← S 8 Output: KSm Description of Algorithm 3.2: The algorithm starts with an arbitrary set KS0 ⊆ Vα (assuming |KS0 | ∈ [0, K]) and works in a cyclic manner for a predetermined number of cycles M or until KSm∗ = KSm∗−1 for some m∗ (line 1). In the m-th 100 cycle (lines 2-7), we revise the estimate KSm−1 from previous cycle by updating each entry one after the other; that is, for i = 1, . . . , N , we select t∗i ∈ Ri := V\{t∗1, . . . , t∗i−1, ti+1, . . . , tk} that minimizes the cost J(S ∪ {v} ∪ T ) (line 5), i.e., t∗i = arg min v∈Ri J({t∗1, . . . , t∗i−1︸ ︷︷ ︸ S , v, ti+1, . . . , tk︸ ︷︷ ︸ T }), then add t∗i to S. We call this a greedy swapping step. Note that if i > |KS0 |, we allow {ti} = ∅, in which case greedy swapping reduces to greedy adding (as in Algorithm 3.1). In essence, this algorithm is based on the cyclic coordinate descent method (also known as the Gauss-Seidel method). Remark 3.5.11. (Entry search in Algorithm 3.2) In general, it is not computationally efficient to determine the optimal order in which the elements of the set KS are selected to be revised in each cycle (in order to reduce the objective cost to the extent possible). In this work, we use a cyclic selection scheme with the least possible complexity. Remark 3.5.12. (Complexity of Alg. 3.2 with cyclic search) Each cycle (other than the first one) requires (KN − K2) function evaluations. That of the first cycle depends on |KS0 |, but is no more than KN − K(K−1)2 . Again, the naive approach takes O(mKN4) operations and O(N2) memory; but we can exploit the structure of the cost function to reduce these computational and memory costs, especially for large networks. Using the power-iteration method, we can avoid O(N2) memory requirement as shown in Remark 3.5.7. For not too large networks where storage is not an issue, we can employ the Woodbury matrix identity (A.1) for rank-2 updates (since swapping involves two nodes). Specifically, suppose we want to check for a 101 possible swap between t ∈ T ∪ S =: P with some v ∈ V\P . Let P := (Lβ + ΓP)−1 and E(tv) := [et, ev]. Then it can be shown that (Lβ + ΓP\{t}∪{v})−1 = P − PE(tv) Ptt − α−1t Ptv Pvt Pvv + α −1 v  −1 ET(tv)P. (3.51) Thus, ∆2J(−t, v,P):=J(P)−J(P∪{v}\{t}), the marginal gain of swapping t and v, can be computed as [bTP(t),b TP(v)] Ptt − α−1t Ptv Pvt Pvv + α −1 v  −1 P (t)c P (v)c  which takes O(N) operations provided that P is known. Hence, finding v∗ = arg maxv∈V\P ∆2J(−ti, v,S) requires O(N(N − K)) operations and if a swap is performed, the matrix (Lβ + ΓP\{ti}∪{v∗}) −1 is then obtained from P by a rank-2 update (3.51), which takes O(N2). (Note that the foregoing calculation resulting in the swapping selection above is also more computationally expensive than find- ing a possible greedy swap; which is also one of the reasons we opt for the greedy swapping strategy instead of the swapping method used in [103] and [104].) During each cycle, at most K swaps can be carried out, taking O(KN2) operations. For the initial cycle, if P is not supplied, then its computation costs at most O(N3). Thus, in general, for M cycles, Algorithm 3.2 takes O(MKN2 + N3) operations. However, from our simulations, a good value of M is usually small (say 2-3) and does not scale as O(N). Theorem 3.5.13. (Properties of Alg. 3.2) Let {KSm}M0 denote the sequence of approximate solutions generated by Algorithm 3.2. 102 (i) If KS0 = ∅, then KS1 ≡ KG, where KG denotes the output of Algorithm 3.1. (ii) For any m ≥ 0 and KS0 ⊆ Vα, J(KSm+1) ≤ J(KSm). In fact, let m∗ denote the smallest index such that KSm∗ = KSm∗+1, then J(KSm) < J(KSm+1), ∀m < m∗, and J(KSm) = J(KSm+1), ∀m ≥ m∗. (iii) Let v∗= arg minv∈Vα J (1)({v}). For any KS0 ⊆ Vα, J (1)({v∗})− J (1)(KSm∗) J (1)({v∗})− J (1)(K∗) > 1 2 and 1− J (2)(KSm∗) 1− J (2)(K∗) > 1 2 . Proof. (i) Consider KS0 = ∅ and the first cycle, i.e., m = 1. So, T =∅ and S is initialized as empty. As a result, line 5 becomes: t∗i = arg minv∈V\S J(S ∪ {v}), which together with line 6 is the greedy algorithm 3.1. Therefore, KS1 ≡ KG as desired. (ii) Consider the m-th cycle. It follows from the algorithm that KSm−1 = {t1, t2, . . . , tK} (line 3). By the greedy choice of t∗i (line 5), it can be seen that J(KSm−1)=J({t1, t2, . . . , tK})≤J({t∗1, t2, . . . , tK})≤ . . .≤J({t∗1, t∗2, . . . , t∗K}) = J(KSm). Thus J(KSm−1) = J(KSm) if and only if all the inequalities in this relation become equalities, i.e., no further improvement on the objective can be made entry-wise. Hence, if KSm∗ = KSm∗+1 for some m∗, then J(KSm) = J(KSm∗),∀m ≥ m∗. The existence of m∗ clearly follows from the fact that the feasible set of K is finite (which comes from finiteness of the network size). (iii) For any submodular and nondecreasing function Z(S), it follows from 103 [108, Thm. 5.1] that Z(S∗)− Z(SI) Z(S∗)− Z(∅) ≤ K − 1 2K − 1 < 1 2 where S∗ and SI denote the optimal solution and an interchange solution (i.e., no more possible local improvement) to the problem max{Z(S) : S ⊆ P , |S| ≤ K}. Applying this result to our case, where Z(S) := J (1)({v∗}) − J (1)(S ∪ {v∗}),∀S ⊆ Vα\{v∗} for problem (P1) or Z(S) := 1− J (2)(S), ∀S ⊆ Vα for problem (P2), yields the desired results. Here, KSm∗ is an interchange solution for each KS0 ⊆ Vα. The ratio bound of 1 2 in part (iii) is less than the constant Rσ,K in Theo- rem 3.5.9 but holds for any initial set KS0 . Note also that first part of this proposition asserts that Algorithm 3.1 can be obtained from Algorithm 3.2 by letting KS0 = ∅ and M = 1. In this case, the performance of the latter algorithm is ensured to be no worst than the former. In fact, it is clear from part (ii) that better estimates are attained almost surely when M > 1. Corollary 3.5.14. (Approximation accuracy of Alg. 3.2 with KS0 = ∅) For any m∗ ≥ 1, J(KSm∗) ≤ J(KG). Strict inequality holds if m∗>1. Although we are not yet able to quantify this gain rigorously, our simulation results illustrate radical improvement compared to Algorithm 3.1, even with small values of M . Remark 3.5.15. (On implementation of Alg. 3.2) The following are worth noting. • Starting point: The algorithm works for an arbitrary choice of KS0 and thus can be useful in practice to improve upon a good starting set KS0 which may 104 be available from, e.g., the convex relaxation approach or Algorithm 3.1. • Local minimizer KSm∗: When it is found, there are practical techniques to possibly escape this local minimizer at the expense of more computation time and power; e.g., random swapping of multiple nodes in KSm∗ with V\KSm∗ . • Termination: We observed that even with a small M (say 2-3), the algo- rithm still finds a good approximation, especially from a good starting point. This may be attributable to the “diminishing returns” nature of the objective function resulting in significant improvements only in the first few cycles. 3.6 Numerical Examples The simulations in this section were carried out in Matlab R© R2015b on a PC with Intel R© CoreTMi7 CPU@3.10 GHz and 12 GB of RAM. 3.6.1 Example 1: Small Network with One Leader Consider the network depicted in Figure 3.1, where at every time step each agent up- dates its opinion by taking the average of its own opinion and those of its neighbors, i.e., wij = 1 |Ni| , ∀j ∈ Ni, i ∈ V . This network is also studied in [62,107]. Suppose there is an external leader with constant opinion T = 0 who wants to connect to a small number of agents so as to achieve fast consensus to its opinion; 105 Figure 3.1: Network in example 1. see Problem (P1). We revisit the problem of direct followers selection (with maxi- mum level of trust) in [62], which corresponds to α=∞, β=0, x0=1 and b=1/N. Table 3.1 compares the simulation results of different approaches: (1) exhaustive search, which provides optimal solutions, (2) coordinate descent method [62], (3) Algorithms 3.1 and 3.2, and (4) the convex relaxation (P Rlxd) solved by an Inte- rior Point Method. In the last case, we approximate α=1031 to solve for yP Rlxd, and then choose KP Rlxd corresponding to the K largest elements of yP Rlxd. We further apply the greedy swapping algorithm to KP Rlxd; see the last column, where KP Rlxd1 = GSwap(KP Rlxd, 1) and KP Rlxd2 = GSwap(KP Rlxd1 , 1). As observed from Table 3.1, algorithm GSwap takes very few cycles to converge to optimal solutions except for the case K = 5, where it falls into a local minimizer. Usually, M = 2 is enough to obtain a good approximate solution, which is much improved from what generated by Algorithm 3.1 (and is exact in many cases). 106 Table 3.1: Comparison results for Network in example 1 (∗ denotes an optimal value). In the last column, JKP Rlxd 1(2) denotes JKP Rlxd1 (JKP Rlxd2 ). Exhaustive search [62] Alg.3.1 Alg.3.2 (KS0 = ∅) (P Rlxd) in (3.28) K K∗ JK∗ JK JKG JKS2 JKSm∗ m ∗ JKP Rlxd JKP Rlxd 1(2) 1 13 44.16 180.32 ∗ ∗ ∗ 1 180.32 ∗(∗) 2 8, 19 13.36 28.96 23.37 ∗ ∗ 2 28.96 16.54 (∗) 3 8, 15, 25 6.94 10.47 9.29 ∗ ∗ 2 10.47 ∗(∗) 4 7, 8, 15 5.18 7.03 5.85 ∗ ∗ 2 7.53 5.45 (∗) 25 5 3, 7, 9, 15 3.53 3.83 4.09 4.06 4.06 2 6.57 3.82 (3.82) 25 6 3, 7, 9, 13 2.22 ∗ 3.13 2.54 ∗ 3 5.61 ∗(∗) 16, 25 7 3, 7, 9, 13 1.36 ∗ 2.17 ∗ ∗ 2 2.17 ∗(∗) 16, 19, 25 As for the implementation of the convex approach with regularization, there is no optimal rule for selecting µ, the sparsity penalizing coefficient, other than trial- and-error (see also Remark 3.4.6). Note that the computational cost per cycle of the coordinate descent method in [62], which involves the cost function’s gradient and Hessian matrix evaluations at each coordinate, is roughly twice as much as that of Algorithm 3.2 (which requires only function evaluations). In addition, Algorithm 3.2 converges after 2-3 cycles with a guaranteed accuracy, while the coordinate descent method could take many more cycles for each trial of µ (with no provable bound on accuracy). Furthermore, although the Interior Point Method applied to (P Rlxd) also employs the gradient and Hessian matrix, it converges within a few iterations (10-20 in this example). We simulate the network responses for the case K = 4; see Figures 3.2-3.5, where the fastest convergence is when the leader repeatedly applies Algorithm 3.1 every Tp = 5 time steps. 107 0 20 40 60 80 0 0.2 0.4 0.6 0.8 1 Optimal Solution time step x(t ) Figure 3.2: K∗ = {7, 8, 15, 25} 0 20 40 60 80 0 0.2 0.4 0.6 0.8 1 Coordinate descent time step x(t ) Figure 3.3: K = {7, 13, 16, 25} 0 20 40 60 80 0 0.2 0.4 0.6 0.8 1 Algorithm 1 time step x(t ) Figure 3.4: K = {8, 13, 16, 25} 0 20 40 60 80 0 0.2 0.4 0.6 0.8 1 Algorithm 1 repeated with period Tp = 5 time step x(t ) Figure 3.5: Alg. 3.1 every 5 time steps 3.6.2 Example 2: Medium-Size Network with Two Leaders Consider a directed network based on the largest strongly connected component of the Wikipedia vote network9 studied in [123]. Thus, our network has N = 1300 nodes and 39456 edges. We generate the weight of each directed edge randomly in the interval (0, 1). Suppose that leader Q has selected the set Vβ containing the first 50 nodes with the highest out-degrees and that βi = 10 6, ∀i ∈ Vβ (thus, they are in full support of Q). Suppose that leader T can connect to up to K nodes in Vα that contains the first 1000 nodes that are not direct followers of Q (here “the first 1000 nodes is understood in terms of the numbering sequence of the nodes). We 9Data available at: http://snap.stanford.edu/data/wiki-Vote.html 108 also assume that αi = 10,∀i ∈ Vα. In this example, we consider problem (P2) for different values of K ∈ [1, 200] using various schemes: (i) Algorithm 3.1: the greedy algorithm with output KG providing JGU = J(KG) and JGL as upper and lower bounds on J ∗ (see Remark 3.5.10). (ii) (P Rlxd)+IPM: the relaxed problem (P Rlxd) solved by the Interior Point Method in OPTI toolbox [124],10 which gives f¯P Rlxd and f ∗ P Rlxd as upper and lower bounds. (iii) (P Aprx)+IPM: the regularized problem (P Aprx) solved by the Interior Point Method (with sparsity threshold set to 0.01). The output, denoted by KP Aprx, yields a corresponding cost fP Aprx =: JP Aprx, an upper bound on J ∗. (iv) GSwap(KP Aprx, 1): applying one cycle of the greedy swapping algorithm to KP Aprx obtained from (iii). The simulation results are shown in Figures 3.6 and 3.7. Here, the upper bounds by the greedy algorithm, GSwap(KP Aprx, 1) and (P Rlxd)+IPM schemes are almost the same while the convex relaxation approach gives the best lower bounds, which help evaluate approximation errors. In particular, using these bounds, we are able to conclude that the the approximation ratio of greedy solutions KG (as well as that of GSwap(KP Aprx, 1) and (P Rlxd)+IPM) satisfies 1− J(KG) 1− J∗ ≥ 1− JGU 1− f ∗P Rlxd . 10Here, we let y(0)=0 and stop the algorithm if |fi−fi−1||fi| ≤ 10−6. 109 0 20 40 60 80 100 120 140 160 180 200 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 3.6: Upper bounds (solid lines) and lower bounds (dashed line) on J∗; The global lower bound J(Vα) holds for any K. The ratio bound 1−JGU1−f∗P Rlxd (shown by a dotted line) is at least 90% as K ≥ 90. The lower bound, depicted by a dotted line in Figure 3.6, is clearly much higher than Rσ,K (here σ = 0.99) and the well-known theoretical ratio of (1−1e) = 63.21% for the greedy algorithm. For example, the ratio bound is at least 90% as K ≥ 90. Regarding running time, note that the greedy algorithm scales linearly with K, while the convex approach does not; see Figure 3.7. As µ increases, |KP Aprx| reduces, and thus so does the running time of GSwap(KP Aprx, 1). 3.7 Closing Discussion This section provides further applications and results based on the analysis devel- oped in the previous sections. 110 0 50 100 150 200 0 10 20 30 0 50 100 150 200 10 20 30 40 0 0.01 0.02 0.03 0 20 40 0 0.01 0.02 0.03 0 20 40 Figure 3.7: CPU run times (s) in 4 schemes. The Interior Point Method takes approxi- mately 0.21 s per iteration. 3.7.1 Application to Friedkin’s Model Consider a Friedkin’s model [3] in the presence of two leaders T and Q: xi(t+ 1) = αiT + βiQ+ σixi(0) + ∑ j∈Ni wijxj(t) αi + βi + σi + 1 , where, as before, αi and βi denote the weights that agent i puts on T and Q, respectively, and σi represents the stubbornness of agent i in keeping his initial opinion (or internal belief, see also [20,63]). In matrix form, x(t+ 1) = (I + diag(α+ β + σ))−1(αT + βQ+ σ ◦ x(0) +Wx(t)). (3.52) 111 Again, assuming Q = 1 and T = 0, the equilibrium of the system is then given by x(∞) = (L+ diag(α+ β + σ))−1(β + σ ◦ x(0)). Thus, we can define an associated influence optimization problem of “T against Q and the stubborn” as follows min K⊆V {bTx(∞) = bT (Lβ+σ + diag(αK))−1c˜ : |K| ≤ K} where Lβ+σ := L + diag(β + σ) and c˜ := β + diag(σ)x(0). Clearly, this problem fits into the general one (P) described in (3.12) and thus can be treated efficiently by the methods developed in this chapter. 3.7.2 Further Convexity Results The following theorem builds on the convexity analysis in Section 3.4. Theorem 3.7.1. Consider systems (i) x(t + 1) = Aux(t) where Au = A + diag(u) is a nonnegative matrix, and (ii) x(t + 1) = Bux(t) where Bu = (A + diag(u)) −1B and B is a nonnegative matrix and A + diag(u) is a nonsingular M-matrix. For either system, if x(0) ≥ 0, then for ∀t ≥ 0, xi(t) is convex in u. Proof. [Sketch] For system (i), the conclusion follows from noting that dAt+2u =∑ i+j+k=tA i udiag(du)A j udiag(du)A k u and applying Lemma 3.4.2. For (ii), let bij(u) denote the ij-th element of Bu. Similar to the convexity proof of f in (3.32), it can be shown that bij(u) is positive, convex and decreasing in u. Thus, [B t u]ij, as a summation of products of bij(u), is also positive, convex and decreasing in u. 112 It can be verified that result for system (ii) in the statement of Theorem 3.7.1 can be applied to both models (3.2) and (3.9) as well as (3.52). It is also interesting to note that the result for system (i) in the statement of Theorem 3.7.1 is closely related to [125, Lem. 3], which states that for a continuous-time system x˙(t) = (M + diag(u))x(t) with x(0) ≥ 0, the function u 7→ xi(t) = eTi e(M+diag(u))tx(0) is convex if M is a Metzler matrix.11 Thus our result for system (i) can be viewed as the dual applied for discrete-time systems. However, we remark that the proof technique developed here is totally different. Moreover, in general, not every real matrix has a real logarithm, let alone uniqueness. The connection between these results and applications of Theorem 3.7.1 are left for future work. 3.7.3 Towards Relaxing Strong Connectivity Assumption Consider the case where the network G is fixed but not strongly connected. Without loss of generality, assume that G is weakly connected (i.e., not disconnected). Then for each K, we decompose V = VK ∪ VK¯ where VK denotes the set of agents in G that are reachable from K. Clearly, {T} ∪ VK forms a spanning tree rooted at T . Moreover, there are directed links from VK¯ to VK but not vice versa. As a result, the agents in VK¯ evolve independently with those in VK and reach an equilibrium. Thus, the opinion of each agent in VK also converges to a fixed point, which is a linear combination of T and the final opinions of those in VK¯. In this regard, VK¯ can be considered as other leaders to VK besides T , thus the analysis in this chapter can also be applied in this scenario. 11i.e., off-diagonal entries are nonnegative. 113 Part II: Consensus Prediction by Observer 114 Chapter 4: Consensus Prediction in Minimum Time Abstract: This chapter studies an observer that seeks to predict in minimal time the asymptotic agreement value of the agents in a network. The network is governed by the DeGroot opinion dynamics model. The observer can monitor the opinions of a group of agents, but might not have accurate knowledge of the underlying communication graph and the associated weight matrix. The work makes use of and builds on previous work on finite time consensus to address this prediction problem. In particular, for the case of a single observed agent, a tight lower bound on the monitoring time is determined below which the observer with limited knowledge about the network is not able to determine the consensus value regardless of the method used. This minimal prediction time can be achieved by employing the minimal polynomial associated with this observed agent. Next, for the general case of an observer with access to multiple agents, a similar bound is conjectured, and we develop algorithms toward achieving this bound through local observations and computations. 115 4.1 Introduction In this chapter, we are concerned with the problem of predicting the consensus value of a network implementing a consensus protocol, where the agents exchange information according to the nearest neighbor weighted averaging scheme. This problem is related to the finite-time consensus problem that has been investigated in the literature (see, e.g., [71, 72]). Building on these contributions, we investigate the minimal observation time that enables an observer to determine the consensus value by monitoring a set of agents in the network. This problem is useful in network monitoring and security. Moreover, the algorithms developed in this work can also be used to allow the agents to possibly reach consensus in a time that is shorter than the best known results in the literature. As an application, in Chapter 5, we will demonstrate the use of consensus prediction in developing distributed optimization algorithms that have many desired features. The contributions of this work are as follows. First, we reveal an intrinsic relation between the consensus value and available observation data, based on which we (i) derive a fundamental limit on the monitoring time for the case of a single observed node, and (ii) provide a conjecture and analysis for the case of multiple observed nodes. Next, we develop algorithms toward achieving the conjectured bounds through local observations and computations. The rest of the chapter is organized as follows. In Section 4.2 we describe the problem formulation and provide some background on the finite-time consensus protocol developed in [71]. In Section 4.3, we provide the main results on shortest 116 time prediction of consensus using the notion of a node’s minimal polynomial as introduced in [71]. Then in Section 4.4 we develop algorithms for computing min- imal polynomials in a distributed manner and in (sub)optimal time. We provide numerical examples and discuss problems for future work in Sections 4.5 and 4.7, respectively. 4.2 Problem Statement and Previous Results 4.2.1 Problem Description Consider a network consisting N agents denoted by V = {1, 2, . . . , N} with the underlying communication characterized by a directed graph G = (V , E). Let xi(t) denote the state or opinion of node i at time t ≥ 0; xi(0) represents the initial opinion. At any time t, each agent observes opinions of its neighbors and updates its opinion following the DeGroot model (1.4) as described in Chapter 1, namely: xi(t+ 1) = ∑ j∈Ni wijxj(t), ∀t ≥ 0, ∀i ∈ V , (4.1) where, recall that, Ni denotes the set of agent i’s neighbors (including itself) and W = [wij] the weight matrix. In this chapter, the following is a blanket assumption, which is the combination of Assumptions 1.5.2 and 1.5.3 and presented here for convenience. Assumption 4.2.1. (Network Connectivity and Weight Matrix) The graph G is fixed and strongly connected. The weight matrix W is fixed, row-stochastic and satisfies wij > 0 for (i, j) ∈ E , i 6= j, and wij = 0 otherwise. Moreover, W has at 117 least one positive diagonal element. Under this assumption, the network asymptotically achieves consensus: lim t→∞ x(t) = 1piTx(0), (4.2) where we recall that pi ∈ RN is the normalized left Perron eigenvector of W , i.e., piTW = piT and 1Tpi = 1; see Section 1.5.2 for details. Denote by x∗ the consensus value, i.e., x∗ = piTx(0). Our problems of interest are as follows: Suppose that there is an observer that might not know W but can monitor the states of m nodes in the network starting from initial time t = 0. First, for any initial states x(0), how can the observer predict the consensus value x∗ in minimum time? Second, which nodes should be observed to minimize the number of time-steps needed when more information on the network is available? Let O ⊂ V denote the set of m nodes selected by the observer. By the observation at time t we mean the vector xO(t) ∈ Rm that includes the states of observed nodes at time t. The number of consecutive observations (starting from t = 0) that allows the observer to determine x∗ is called the observation time. We find it convenient to introduce the following information model. Define Θ(t) as the “accumulated information” about the network that the observer pos- sesses at time t. (Note that Θ(t) is an equivalence class.) Let Θ(−1) denote the initial knowledge and assume that the information dynamics satisfies Θ(t + 1) = Θ(t) ∪ {xO(t + 1)}, implying that the observer accumulates information. As a re- sult, at any time t ≥ 0, the observer knows xO(s),∀s ∈ [0, t]. 118 4.2.2 Previous Results on Consensus in Finite Time We now recall the method in [71] that enables the agents to exactly calculate the consensus value after running the iteration (4.1) for only a finitely many steps. The method is based on the concept of an individual node’s minimal polynomial, which is given below. First, recall that for any square matrix A ∈ Rn×n, its associated minimal polynomial qA is the monic polynomial of least degree for which qA(A) = 0n×n. Definition 4.2.2. (Minimal polynomial of a node [71]) Given weight matrix W , the minimal polynomial of node i, denoted by qi, is the monic polynomial of least degree for which eTi qi(W ) = 0 T n , where ei is the i th standard unit basis vector. The existence of qi follows from the fact that qW satisfies the condition e T i qW (W ) = 0TN . Moreover, qi is easily seen to be unique by virtue of being a monic polynomial of least degree satisfying this condition. Note also that the value deg(qi)’s may not be the same for different i ∈ V . However, it always holds that deg(qi) ≤ deg(qW ), i ∈ V . Furthermore, important properties of qi are given below; see [71] for a proof. Lemma 4.2.3. (Properties of minimal polynomial of a node) For each i ∈ V, qi divides qW . Moreover, if µ is a simple eigenvalue of W whose associated eigenvector has all nonzero elements, then µ is a simple root of qi. As a consequence, when Assumption 4.2.1 holds, all the roots of qi are strictly inside the unit circle except only one at 1. Denoting Di := deg(qi)−1, 119 the minimal polynomial qi can be expressed as qi(ξ)=(ξ − 1) ∑ 0≤l≤Di a (i) l ξ l, (4.3) where a (i) = [a (i) 0 , a (i) 1 , ..., a (i) Di ]T satisfies a (i) Di = 1, ∑ 0≤l≤Di a (i) l 6= 0. (4.4) This decomposition of qi will be useful in determining the consensus value at each node in finitely many iterations as we briefly describe next; for a full development with all steps given in detail, the reader is referred to [71]. Recall from Definition 4.2.2 that eTi qi(W ) = 0 T n . Thus, for t ≥ 0, 0 = eTi qi(W )x(t) (4.3) = ∑ 0≤l≤Di+1 (a (i) l−1 − a(i)l )eTi W lx(t), where a (i) −1 = a (i) Di+1 = 0 for the convenience of notation. Note that eTi W lx(t) = xi(t+ l). Thus, we have ∑ 0≤l≤Di+1 c (i) l xi(t+ l) = 0, ∀t ≥ 0, where c (i) l := a (i) l−1−a(i)l . Denote by Xi(z) the z-transform of the signal xi. Applying the unilateral z-transform to the equation above and invoking the time-shifting property yields qi(z)Xi(z) = ∑ 1≤l≤Di+1 c (i) l ∑ 0≤j≤l−1 xi(j)z l−j. (4.5) By the Final Value Theorem (see, e.g., [126]), we then have lim t→∞ xi(t) = lim z→1 (z − 1)Xi(z) (4.3)-(4.5)= ∑Di l=0 a (i) l xi(l)∑Di l=0 a (i) l . (4.6) 120 Note that {xi(0), ..., xi(Di)} are consecutive state values of agent i. Thus (4.6) implies that agent i can find limt→∞ xi(t) after Di iterations of (4.1), provided that a (i) is known. By (4.2), this limit is the consensus value x∗ = piTx(0). Remark 4.2.4. The method presented above can be viewed as each agent being an ob- server with its own information model: Θi(t+ 1) = Θi(t) ∪ {xi(t)} and Θi(−1) = ∅. We remark, however, that in general, even in a distributed setting, more local information is available to each agent than just its own state, e.g., Θi(t + 1) = Θi(t) ∪ {xNi(t)} (recalling that Ni denotes the set of direct neighbors of agent i) and Θi(−1) might not be empty. Remark 4.2.5. Our setting of having just one observer is more general in the sense that the scenario above can be seen as a special case with appropriate choices of observed nodes O and information model Θ(t). Remark 4.2.6. It is obvious from (4.6) that agent i (or the observer that monitors agent i) determine the consensus value x∗ as a linear combination of Di + 1 consec- utive observations of agent i’s state. This is merely a consequence of the use of the minimal polynomial qi, which by no means assures the optimality of Di + 1 a priori. Hence, the following the question is also of interest: Among all possible methods that the observer may use to find x∗, which is associated with the least observation time? 121 4.3 Shortest Time Prediction of Consensus and Local Computation of Minimal Polynomials This section deals with the question posed in the foregoing remark. It turns out that if O = {i}, then the number Di + 1 of observations is optimal in determining x∗ for any initial value of the network and for all possible methods. This optimal value can be achieved by using minimal polynomial qi as in (4.6). We show this in detail, then present an optimality conjecture for the case of having multiple observed nodes and discuss an idea to achieve this minimum observation time through the computation of minimal polynomials. We first uncover an intrinsic relation between the consensus value and observa- tion data: if x∗ can be computed at some time r ≥ 0, then x∗ is a linear combination of available observation data with associated coefficients depending on W . Theorem 4.3.1. If r ∈ Z+ and g : Rm(r+1) → R are such that for any x(0) ∈ RN x∗ = g(xO(r),xO(r − 1), ...,xO(0)), (4.7) then ∃β0,β2, ...,βr ∈ Rm such that x∗ = ∑r i=0 β T i xO(i). To prove this result, we make use of the linearity of the dynamic system (4.1) in conjunction with the following lemma, whose proof is an application of the Hahn- Banach theorem. Lemma 4.3.2. [127, p. 188] Let f0, f1, ..., fn be linear functionals on a vector space V and suppose that f0(v) = 0 for every v ∈ V satisfying fi(v) = 0 for i = 1, 2, ..., n. Then there are constants β1, β2, ..., βn such that f0 = ∑n i=1 βifi. 122 Proof of Theorem 4.3.1: Let r and g satisfy (4.7). Define the following functions f0, fi,t : RN → R for any t ≥ 0 and i ∈ O such that for ∀v ∈ RN fi,t(v) := e T i W tv, f0(v) := lim t→∞ eT1W tv. (4.8) That is, if x(0) = v, then fi,t(v) = xi(t) and f0(v) = limt→∞ x1(t) = x∗ since the network reaches consensus. Clearly, f0 and fi,t are linear functions on RN . Next, define Ω = {v ∈ RN | fi,t(v) = 0, 0 ≤ t ≤ r, i ∈ O}. It can be verified that Ω is a subspace on which xO(t) = 0 for 0 ≤ t ≤ r. We now consider f0 on Ω. It follows from (4.7) that for any v ∈ Ω and γ ∈ R f0(v) = g(0,0, ...,0) = f0(γv) = γf0(v), (4.9) where the second equality holds since γv ∈ Ω, and the last equality by linearity of f0. As a result, we have f0(v) = 0 for any v ∈ Ω. Therefore, by using Lemma 4.3.2, we have f0 = ∑ 0≤t≤r,i∈O βi,tfi,t for some constants βi,t. This concludes the proof. Next, we will employ Theorem 4.3.1 to assess the optimality of using minimal polynomials in consensus prediction. 4.3.1 Optimality of (Di + 1) Our main result for the case of single observed node is as follows: 123 Theorem 4.3.3. Suppose O = {i} ⊂ V and Θ(t) = Θ(t − 1) ∪ {xi(t)},∀t ≥ 0, where Θ(−1) may contain any information related to W . Then the observation time is always bounded below by Di+1, regardless of the method used. Furthermore, this bound can be achieved if qi ∈ Θ(−1). Proof: Suppose O = {1}. We prove by contradiction, i.e., suppose there exist a positive integer r < D1 and a mapping g : Rr → R such that for any x(0) ∈ RN , x∗ = g(x1(r), x1(r − 1), ..., x1(0)). (4.10) Here, g depends on Θ(−1). By Theorem 4.3.1, we conclude that there exist β0, β2, ..., βr such that x∗ = ∑r i=0 βix1(i). Without loss of generality, assume that βr 6= 0 (otherwise, we consider βr−1 and so on). Define ft(v) := eT1W tv and f0(v) := limt→∞ eT1W tv for any v ∈ RN . Then f0 = ∑ 0≤i≤r βifi. (4.11) Note that for ∀t ∈ Z+ and ∀v ∈ RN , we have f0(Wv) = f0(v), ft(Wv) = ft+1(v), which in view of (4.11) then implies that ∑ 0≤i≤r βifi+1 = ∑ 0≤i≤r βifi ⇐⇒ eT1W r+1 + ∑ 1≤i≤r βi−1 − βi βr eT1W i − β0 βr eT1 = 0 T . As a result, the polynomial q˜1 given by q˜1(ξ) = ξ r+1 + ∑r i=1 β −1 r (βi−1−βi)ξi−β−1r β0 satisfies eT1 q˜1(W ) = 0, with deg(q˜1) = r + 1 < D1 + 1, 124 which, however, contradicts the fact that the minimal polynomial q1 of node 1 is of degree D1 + 1. This concludes that the observation time is always bounded below by D1 + 1. It remains to show that this bound is achieved if q1 ∈ Θ(−1). This is obvious in view of (4.6). Remark 4.3.4. As we have shown that the shortest time Di + 1 can be achieved if qi ∈ Θ(Di), in which case the coefficients βj in Theorem 4.3.1 can be determined from qi as βj = a (i) j / ∑Di l=0 a (i) l . In the case where Θ(−1) = ∅, then Di and a(i)j for j = 0, 1, . . . , (Di − 1) become Di + 1 unknowns characterizing qi, and therefore, the observer would need Di + 1 additional observation data in order to be able to determine these unknowns. In the following, we consider the case m ≥ 2 and are interested in quantifying the minimal observation time conditionally on the initial information Θ(−1) in terms of qi. Although we have not yet been able to determine the minimum time, we conjecture the following: Conjecture 4.3.5. Suppose the observer can monitor the states of a set O of m nodes and Θ(t) = Θ(t − 1) ∪ {xO(t)},∀t ≥ 0. Let Tinf denote the least observation time and let Dmin = mini∈ODi. (i) If Θ(−1) = {qi,∀i ∈ O}, then Tinf = Dmin + 1. (ii) If Θ(−1) = ∅, then Tinf ≥ Dmin + 2 + dDminm e.1 1For any x ∈ R, dxe denotes the least integer greater than or equal to x. 125 Remark 4.3.6. Case (i) can be reasoned as follows. Without loss of generality, let O = {1, ...,m} and D1 = Dmin. If Tinf ≤ D1, i.e., x∗ can be found by the time t = D1 − 1. By Theorem 4.3.1, x∗ is a linear combination of {xi(k), 1 ≤ i ≤ m, 0 ≤ k ≤ D1 − 1}. However, it follows from (4.6) that, at time t = D1− 1, we have m linear equations: x∗ = ∑Di k=0 a (i) k xi(k)∑Di k=0 a (i) k , ∀i ∈ O with at least m+ 1 unknowns including x∗ and xO(D1). Thus, in general, x∗ is not computable up to time t = D1 − 1. Remark 4.3.7. Case (ii) of the conjecture is based on our development in the next section where the idea is to demonstrate that the lower bound on Tinf can be achieved if qk with k = arg mini∈ODi can be computed from observation data up to that time and if “ideal conditions” (which will be made clear later) are assumed. Remark 4.3.8. With a different assumption on Θ(−1), it is possible that Tinf < Dmin. For example, if O = V and {pi} ⊆ Θ(−1), then x∗ = piTxO(0), i.e., Tinf = 1. 4.3.2 Local Computation of qi The minimal polynomial qi can be computed locally by agent i in many ways. First, let c(i) = [c (i) 0 , c (i) 1 , . . . , c (i) Di , 1]T ∈ RDi+2 denote the vector of coefficients of qi. Then it follows from the definition that 0 = eTi qi(W ) = Di+1∑ k=1 c (i) k e T i W k = (c(i))TO (i) Di+2 , (4.12) with O (i) Di+2 = [ei W Tei . . . (W Di+1)Tei] T . 126 Observe that O (i) Di+2 has the form of the observability matrix for the pair (W, eTi ). Therefore, the observer might be able to compute c(i) by constructing O (i) k and increasing k until O (i) k loses rank. This particular value of k is equal to Di + 2, i.e., at time t = Di + 1. Moreover, it should be note that the construction of O (i) k need not require the knowledge of the entire network. Specifically, let N (i)k denote the set of agents connected to node i through a path of length at most k. Thus O (i) k can be determined using a submatrix of W with column and row in N (i)k . This requires appropriate a dynamic information model: Θ(t+ 1) = Θ(t) ∪ {xi(t), eTi W t}. A distributed algorithm for computing qi was also proposed in [71], where net- work performsN runs of (4.1) with different initial conditions x(1)(0),x(2)(0), . . . ,x(N)(0) assumed to be linearly independent, each for N + 1 time steps. During each run, every node stores its own values. After N runs, every node is able to matrix Xi,t =  x (1) i (0) x (1) i (1) · · · x(1)i (t+ 1) x (2) i (0) x (2) i (1) · · · x(2)i (t+ 1) ... ... . . . ... x (N) i (0) x (N) i (1) · · · x(N)i (t+ 1)  where x (j) i (t) is the value of node i at time t in the j-th run. Then Di is the smallest positive integer for which Xi,Di is not full column rank and the coefficient vector of qi, denoted by c (i), can be found from Xi,Dic (i) = 0; see [71] for details. In [72] the authors presented another algorithm for computing qi which also uses solely agent i’s state values but requires a fewer number of time steps. In particular, let the network run (4.1) for at most 2N + 1 time steps, starting from almost arbitrary initial state x(0) (except for a set of Lebesgue measure zero in RN). 127 Each node i constructs its Hankel matrix Hi,k defined through setting zi(k + 1) = xi(k + 1)− xi(k) and Hi,k :=  zi(1) zi(2) · · · zi(k + 1) zi(2) zi(3) · · · zi(k + 2) ... ... . . . ... zi(k + 1) zi(k + 2) · · · zi(2k + 1)  (4.13) and finds the first rank-defective matrix Hi,k as k increases, namely Hi,Di . Then a (i) is computed from Hi,Dia (i) = 0. Although it is hard to characterize the set of initial states x(0) (of measure zero) for which this computation scheme fails to provide a (i), practical techniques to alleviate the problem are available; see, e.g., [128]. More importantly, this approach in general provides a minimum time of 2(Di + 1) for consensus prediction in the scenario that Θ(t + 1) = Θ(t) ∪ {xi(t)} with Θ(−1) = ∅. See [72] for further details. We remark that the idea of using Hankel matrix Hi,k to compute a (i) in fact has its roots in the realization theory [73–75]. Here, finding qi can be regarded as a network identification problem. Aimed with this view, in the next section, we build our algorithms on the previous approach by employing block-Hankel matrices in order to reduce the observation time needed to compute qi. 4.4 Toward Minimizing Observation Time In this section, we present partial solutions to the problem described in Section 4.2.1. Clearly, when m = 1, it follows from Section 4.3 (see Theorem 4.3.3) that 128 the solution is given by O = arg mini∈V Di. Thus in the following we consider m ≥ 2. Our main idea is to make use of available information to construct block- Hankel matrices instead of (4.13). We will consider two cases: (1) when the minimal polynomials of observed nodes are identical, i.e., qi = qj, ∀i, j ∈ O, and (2) when they are nonidentical, i.e., ∃i, j ∈ O, qi 6= qj. In any case, we do not assume that qi ∈ Θ(−1),∀i ∈ O. 4.4.1 Observed Nodes with Identical Minimal Polynomials Let qi =: q where q(ξ) = (ξ− 1) ∑D k=0 akξ k with aD = 1 and a := [a0, a1, . . . , aD] T ∈ RD+1, which is not assumed to be available to the observer at initial time. For any subset S ⊆ V , define zS(t) = xS(t)− xS(t− 1), ∀t ≥ 1. For any sequence {O1,O2, . . . ,Op}, define the following block-Hankel matrix Mp,D({Oi})=  zO1(1) zO1(2) · · · zO1(D+1) zO2(2) zO2(3) · · · zO2(D+2) ... ... . . . ... zOp(p) zOp(p+1) · · · zOp(D+p)  (4.14) Important properties of this matrix is given next. Theorem 4.4.1. There exist p ∈ Z+ and a sequence {Oi|i = 1, ..., p, Oi ⊂ O} such that the following hold: rank(Mp,D({Oi})) = rank(Mp,D−1({Oi})) = D, (4.15) Mp,D({Oi})a = 0. (4.16) 129 Proof. It follows from Section 4.3.2 that for ∀i ∈ O, rank(Hi,D) = rank(Hi,D−1) = D, Hi,Da = 0, (4.17) where Hi,D ∈ R(D+1)×(D+1) is a Hankel matrix given by (4.13). Let HO,D = [HT1,D H T 2,D . . . H T m,D] T . It then follows from (4.17) that rank(HO,D) = rank(HO,D−1) = D and HO,Da = 0. Now choose p = D + 1, Oi = O, i = 1, ..., p, and construct Mp,D({Oi}) as in (4.14). It is easy to see that Mp,D({Oi}) has the same rows as HO,D but in a different order. Thus the claim follows. As a result, once a sequence {Oi}pi=1 satisfying (4.15) is found, a can be determined from (4.16) and (4.4). Then the consensus value x∗ can be computed as in (4.6), i.e., x∗ = ∑D k=0 akxi(k)∑D k=0 ak , ∀i ∈ O. (4.18) It is important to note that the number of time steps needed to constructMp,D({Oi}) in (4.16), denoted by Tc, is given by Tc = D + p+ 1. (4.19) Clearly, Tc ≥ D + 2. Now define p∗ := min{p : (4.15) holds }, T ∗ := D + 1 + p∗. Thus T ∗ is the minimum time needed for the observer to compute x∗. Note that p∗ depends on the choice of {Oi}p∗1 . Our next result provides general bounds on this value. Theorem 4.4.2. The following hold: D + 1 ≥ p∗ ≥ dD m e. 130 Proof. The first inequality follows from the choices of p = D + 1 and the sequence {Oi} to construct a particular Mp,D({Oi}) in the proof of Theorem 4.4.1. We now show the lower bound. For any p ∈ Z+ and {Oi} satisfying (4.15), since rank(Mp,D({Oi})) = D and Oi ⊆ O,∀i = 1, . . . , p, it follows that D ≤ ∑ 1≤i≤p card(Oi) ≤ ∑ 1≤i≤p card(O) = pm. Thus, p ≥ dD m e. Hence the second inequality follows. This result implies that D+1+dD m e ≤ T ∗, which agrees with Conjecture 4.3.5; here the lower bound is less than that in case (ii) of the conjecture by 1 since to achieve this we have implicitly assumed the knowledge of D. We can obtain a sharper bound for any initial state x(0) except for a set of Lebesgue measure zero as follows. Proposition 4.4.3. Suppose that the self-weight wii > 0,∀i ∈ O. Then for any initial state x(0) except for a set of measure zero, the following hold: (i) If m ≥ D + 1, then p∗ = 1. (ii) If m ≤ D, then D + 1−m ≥ p∗. Proof. [Sketch] First, note that if wii > 0,∀i ∈ O, then it can be seen that rank(M1,D(O)) = rank(M1,D−1(O)) = min(D,m) almost surely. Thus, if m ≥ D+1, then it follows that rank(M1,D(O)) = rank(M1,D−1(O)) = D, thus p∗ = 1. Now if m ≤ D, choose p = D + 1 −m, O1 = O. There must exist j ∈ O so that if O2 = ... = Op = {j} then rank(Mp,D({Oi})) = rank(Hj,D) = rank(Hj,D−1) = 131 rank(Mp,D−1({Oi})) for almost any x(0) except for a set of measure zero. This implies that p∗ ≤ p = D + 1−m. Next, we note that the set of sequences {Oi}p1 that satisfy (4.15) and achieve p∗ includes a special one, namely {Oi| Oi = O,∀i = 1, . . . , p}. In fact, by defining Mp,d := Mp,d({Oi| Oi = O, ∀i = 1, . . . , p}) for any p, d ≥ 1, we have the following: Theorem 4.4.4. Suppose {Oi}pi=1 is a sequence such that p ≥ p∗ and {Oi}p ∗ i=1 satisfies (4.15). Then for any d ≥ D, rank(Mp,d({Oi}p1)) = rank(Mp∗,D−1) = D. (4.20) Proof. The proof follows from (4.15) and the definition of p∗. Condition (4.20) allows us to construct Algorithm 4.1 below to be implemented by the observer to find a and x∗, assuming the knowledge of D (in addition to the condition that qi = qj,∀i, j ∈ O). Starting from p = 1, the observer repeatedly increases p and checks if rank(Mp,D) = D, i.e, if p ∗ is found. Algorithm 4.1: Compute a and x∗ for the case of identical minimum polynomials with knowledge of D Data: The set O, m = card(O) and D 1 init: p← 1 2 while rank(Mp,D) < D do 3 p← p+ 1 4 Compute a and x∗ using (4.16) and (4.18) In the case where D is not available in advance, we can find D as the first 132 value of d such that d = rank(Mp,d−1) = rank(Mp,d) = rank(Mp+1,d). (4.21) Here, we want to find the first column-rank defective matrix Mp,d({O}p) as p and d increase appropriately. Based on this condition, we propose Algorithm 4.2 to determine a and x∗ without the knowledge of D. Algorithm 4.2: Compute a and x∗ for the case of identical minimum polynomials without knowing D Data: The set O, m = card(O) 1 init: d← 1; p← 1 2 while (4.21) not met do 3 increase d and/or p 4 Compute a and x∗ using (4.16) and (4.18) Remark 4.4.5. Algorithm 4.2 requires observation time T ∗ = D+2+p∗ ≥ D+2+dD m e since it uses Mp+1,D. Moreover, when m = 1, the algorithm is the same as that in [72], which was summarized in Subsection 4.3.2 above. 4.4.2 Observed Nodes with Different Minimal Polynomials For any S ⊆ O, let qS denote the least common multiple of {qi, i ∈ S}, which can be regarded as the joint minimum polynomial of the set S. Define DS := deg(qS)− 1. Since 1 is a simple root of each qi, it is also a simple root of qS . Hence, there exists a ∈ RDS+1 such that qS(ξ) = (ξ − 1) ∑ 0≤k≤DS akξ k, ∑ 0≤k≤DS ak 6= 0, aDS = 1. Here a also depends on S. Now using Algorithm 4.1 or 4.2 above, the observer can determine a and thus x∗ as if qi = qj = qS , which requires a minimum observation 133 time denoted by T ∗(S). Therefore, for a given set O, the (sub)optimal observation time is T ∗O = min{T ∗(S) : S ⊆ O}. (4.22) This is clearly a combinatoric problem, whose solution may be hard to find exactly especially when Θ(−1) = ∅. If qi ∈ Θ(−1), then we can resort to a greedy algorithm. In any case, T ∗O is upper bounded by mini∈O{T ∗({i})} and T ∗(O), which are easier to compute. To conclude this section, we remark that in the algorithms in [71, 72], each agent i uses only its opinion history to compute x∗ and thus the best observation time is 2Di + 2. Our results assert that if each agent i also functions as an observer, then the consensus value could be predicted in a fewer number of iterations. 4.5 Numerical Examples 4.5.1 Example 1: Network with Identical Minimal Polynomials Consider a ring network of N = 10 agents with W =  .8 .1 0 · · · 0 .1 .1 .8 .1 · · · 0 0 ... ... ... · · · ... ... .1 0 0 · · · .1 .8  ∈ R10×10 134 and with (randomly generated) initial opinions: x(0) = [0.9797, 0.2848, 0.5949, 0.9621, 0.1857, 0.1930, 0.3416, 0.9329, 0.3906, 0.2732]T . It can be seen that pi = 1/10 and thus the consensus value is x∗ = piTx(0) = 0.5139. Moreover, qi = qj,∀i, j ∈ V due to the symmetry of the network and the weight matrix. First, consider the scenario where each node i ∈ V wishes to x∗ from its local information and Θi(−1) = ∅. Using the algorithm in [72], any agent can find D = 5 and compute x∗ after 2D + 2 = 12 time steps, where as, by using Algorithm 4.2, each agent (having has 2 neighbors, hence m = 3) can find D = 5 and compute x∗ after Tc = 9 time steps; see also Remark 4.4.5. Next, consider the case where an observer knows D and can monitor m agents in the network. Results in Table 4.1 holds for any choice of O. (Note also that Tc ≥ D+ 2; see (4.19)). Here m = D is the smallest number of observed nodes that also gives the minimum observation time. Table 4.1: Observation times using Algorithm 4.1. m 1 2 3 4 5 6 7 8 9 10 Tc 11 9 8 8 7 7 7 7 7 7 4.5.2 Example 2: Network with Different Minimal Polynomials Consider the graph given in Figure 4.1 (from [71]). In this example, q1(ξ) = q2(ξ) = q3(ξ) = (ξ − 1)(ξ − λ1)(ξ − λ2)(ξ − λ3), 135 q4(ξ) = (ξ − λ4)q1(ξ), q5(ξ) = q6(ξ) = (ξ − λ5)q4(ξ). Figure 4.1: Network example 2. Self weights are not shown. In Table 4.2, we compare the observation times obtained by the algorithm in [72] with those by Algorithm 4.2, where each node monitors its neighbors and naively uses Algorithm 4.2. It is interesting to see that the observation time of node 1 is longer than that of node 2 although it has more neighbors. The reason is that node 1 uses information from nodes 5 and 6, which have the largest observation time among all (or to be precise, the joint minimum polynomial of nodes 1, 5 and 6 is q6, which is of highest order). Table 4.2: Observation time for each node to compute consensus value in Example 2 Node Tc by [72] Tc by Alg. 4.2 Observed nodes 1 8 8 {1,2,4,5,6} 2 8 6 {1,2,3} 3 8 7 {2,3} 4 10 9 {1,4} 5 12 10 {1,5,6} 6 12 10 {1,5,6} Finally, suppose that the observer is able to select any m nodes to monitors. Let O∗ = arg minO⊆V,card(O)=m T ∗O, i.e., the set that gives the optimal observation time. The result given in Table 4.3 is obtained by brute-force computations. Thus, the best choice would be O∗ = {1, 2} with T ∗ = 6 and only 2 nodes being monitored. 136 Table 4.3: Optimal time T ∗ when the observer can choose any m nodes m T ∗ O∗ 1 8 {1}, {2}, {3} 2 7 {1, 2} 3 6 {1, 2, 3} 4 6 {1, 2, 3, 4} 5 6 {1, 2, 3, 4, 5} 6 6 {1, 2, 3, 4, 5, 6} 4.6 Toward Selecting Observed Nodes Recall that the second question in Section 4.2.1 is about optimal selection of observed nodes. Based on the previous section, the (sub)optimal solution to this problem can be given by O∗ = arg minO⊆V,card(O)=m T ∗O. The optimal solution to this problem is not obvious and left for future work. Here, instead, we give heuristic descriptions of O∗, including: (A) qi’s should be similar; (B) deg(qi)’s should be small; (C) p∗ should be close to DO∗ m . Note that the degree of qi and the relationships between qi and qj depend not only on the structure of the network, but also on the weight matrix. To have a closer look at the minimal polynomial of a node, let us consider the following. Let c(i) = [c (i) 0 , c (i) 1 , . . . , c (i) Di , 1]T ∈ RDi+2 be vector of coefficients of qi, i.e., qi(ξ) = 137 ξDi+1 + ∑Di k=0 c (i) k ξ k. From definition, 0T = eTi qi(W ) = Di+1∑ k=1 c (i) k e T i W k = (c(i))T  eTi eTi W ... eTi W Di+1  (4.23) Thus, it can be shown that Proposition 4.6.1. deg(qi) is the observability index of the pair (e T i ,W ). Let us revisit Example 2 in the previous section. We keep the network structure but consider the following two weight matrices: W1 corresponds to equal neighbors weights, and W2 is randomly generated. W1 =  1/5 1/5 0 1/5 1/5 1/5 1/3 1/3 1/3 0 0 0 0 1/2 1/2 0 0 0 1/2 0 0 1/2 0 0 1/3 0 0 0 1/3 1/3 1/3 0 0 0 1/3 1/3  W2 =  .19 .24 0 .13 .17 .27 .54 .15 .31 0 0 0 0 .27 .73 0 0 0 .63 0 0 .37 0 0 .22 0 0 0 .35 .45 .32 0 0 0 .28 .40  In the case of W1, we have q1 = q2 = q3 = q4, deg(q1) = 5, q5 = q6 = qW1 , deg(q5) = 6. In the case of W2, we have q1 = q2 = . . . q6 = qW2 , deg(qW2) = 6. 138 Clearly, changing agents’ weights can change the agents’ minimal polynomials as well as the degrees. However, it is also apparent that certain properties of qi are pertinent to the network structure and thus are related to the structural observability concept. This direction of investigation is left for future work. In the following, we restrict ourself to the class of undirected graphs and explore necessary and/or sufficient conditions in terms of graph theory to meet descriptions (A)-(C) above. We also assume that that the weight matrix is given by W = I − L (4.24) where L := Din − A is the Laplacian matrix, A is the adjacency matrix, Din = diag(A1) is the in-degree matrix, and  ∈ (0,mini[Din]−1ii ) (which is to ensure that W is a positive weight matrix). Next we introduce some graph notions [129]. Definition 4.6.2. (Automorphism) An automorphism of G = (V , E) is a permuta- tion ψ of V such that (ψ(i), ψ(j)) ∈ E ⇔ (i, j) ∈ E (4.25) Proposition 4.6.3. ([129]) Let A be the adjacency matrix of the graph G and ψ a permutation on its node set V. Associate with this permutation the permutation matrix P . Then ψ is an automorphism of G if and only if PA = AP. In the following, a partition of the graph G = (V , E) is denoted by C = {C1, . . . , Ck} for some appropriate k, where Ci’s are called cells. Definition 4.6.4. (Almost Equitable Partition-AEP) Suppose C = {C1, . . . , Ck} is a partition of a graph G. 139 • C is said to be almost equitable if each node in Ci has the same number of neighbors in Cj,∀i, j ∈ {1, ..., k}, i 6= j. • C is said to be almost equitable w.r.t. node v if C is an AEP and {v} ∈ C. • The minimum AEP w.r.t node v, denoted by C∗v, is an AEP such that {v} ∈ C and card(C∗v) is minimal. Definition 4.6.5. (Distance Regular Graph) An undirected graph G is said to be regular if deg(i) = deg(j),∀i, j ∈ V . It is called distance-regular if it is regular and for any pair of nodes u, v ∈ V with dist(u, v) = i, 0 < i < diam(G), there exist numbers fi and gi such that there are fi neighbors of v that are of distance i − 1 from u and gi neighbors of v that are of distance i+ 1 from u. 4.6.1 When qi = qj? We have the following result. Proposition 4.6.6. If there exists an automorphism ψ of G such that ψ(i) = j, then qi = qj. The converse is not true. Proof. Suppose there exists an automorphism ψ of G such that ψ(i) = j. Let P be the permutation matrix associated with ψ. By Proposition 4.6.3, we have P TA = AP T and thus P TDin = DinP T . Then WP T = (I − (Din − A))P T = P T − (DinP T − AP T ) = P T − (P TDin − P TA) = P TW. 140 From this, it can be show that W kP T = P TW k for any integer k ≥ 1. Note also that Pei = ej. Then, multiplying both sides of (4.23) with P T yields 0T = (c(i))T  eTi P T eTi WP T . . . eTi W Di+1P T  = (c(i))T  eTi P T eTi P TW . . . eTi P TWDi+1  = (c(i))T  eTj eTjW . . . eTjW Di+1  Therefore, eTj qi(W ) = 0. Since qj is the minimal polynomial of node j, it follows that qj|qi. Next, we note that since ψ is an automorphism, ψ−1 exists and corresponds to the permutation matrix P T . That is, ψ−1(j) = i, or P Tej = ei. Now apply the same argument as above, we have qi|qj. Therefore, qi = qj. The converse is not true; see Section 4.5.2 for a counterexample, where q1 = q2 but there’s no permutation between nodes 1 and 2. This proposition only provides sufficient conditions. Consider again Example 4.5.2. Using this Proposition, we can only conclude that q5 = q6 but not q1 = q2 = q3. This result is useful when the graph is highly symmetric. E.g., consider again the example in Subsection 4.5.1, it is easy to see that qi = qj,∀i, j ∈ V . We will show in the next subsection that for distance regular graphs, all the agents’ minimal polynomials are the same; see Proposition 4.6.9. 4.6.2 Bounds on deg(qi) Using the fact that deg(qi) is the observability index of the pair (e T i ,W ), we can obtain some bounds on the degree of qi as follows. 141 Theorem 4.6.7. [72, 130, 131] If G is connected and undirected, it holds that diam(G, i) + 1 ≤ deg(qi) ≤ card(C∗i ) (4.26) where diam(G, i) = maxv∈V dist(i, v) is the longest distance from node i to any oth- ers, and C∗i denotes the minimum almost equitable partition w.r.t. node i. In some special cases, for example in distance regular graphs, the upper bound and lower bound are tight. Examples of distance regular graphs include cycles, hypercubes and complete graphs. Theorem 4.6.8. [132] If G is a distance regular graph, then diam(G) + 1 = deg(qi) = card(C∗i ), ∀i ∈ V (4.27) Based on this result, we have the following. Proposition 4.6.9. If G is a distance regular graph, then qi = qj, ∀i, j ∈ V . (4.28) Proof. It is well known that the adjacency matrix A of a distance regular graph has diam(G)+1 distinct eigenvalues [129]. Thus, so does W since W = I−(Din−A) = (I − Din) + A and Din = f0I for some f0. Moreover, the minimal polynomial of W , qW is of degree diam(G) + 1 with diam(G) + 1 distinct roots. Now by Theorem 4.6.8, deg(qi) = Di + 1 = diam(G) + 1,∀i ∈ V . Moreover, qi|qW ,∀i ∈ V . Therefore, qi = q,∀i ∈ V and the Proposition follows. 142 4.7 Limitations and Future Work The main drawbacks of the method presented in this chapter are as follows. First it can only be applied to networks with fixed topology and linear time invariant dy- namics. Second, the computational accuracy of the rank of a (block-) Hankel matrix does not scale well with the size of the matrix. Third, since the prediction value is in the limit as time tend to infinity, the accuracy is sensitive to computational error and may not work well in the presence of observation noises and/or communication delays. Extension and further development to overcome these limitations are thus important directions for future work. Among possible directions for future work besides resolving the validity of Conjecture 4.3.5, we note a possible application of consensus prediction in network monitoring for misbehavior; see Section 7.2. 143 Part III: Distributed Optimization 144 Chapter 5: Local Prediction for Enhanced Convergence of Distributed Optimization Algorithms Abstract: This chapter studies distributed optimization problems where a network of agents seeks to minimize the sum of their private cost functions. Algorithms are proposed that build on past consensus-based distributed optimization algorithms by incorporating a local predictive step as developed in Chapter 4. The algorithms involve introduction of local optimization variables at the network nodes alongside the original local node states. In the first algorithm, the local optimization variables are updated cyclically through a subgradient step while the opinion variables follow a traditional consensus protocol periodically interrupted by a predictive consensus estimate re-set operation. For convex cost functions with bounded subdifferentials, this algorithm is guaranteed to converge to within some range of the optimal value if using a constant step size or to the optimal value if a diminishing step size is in place. For differentiable cost functions whose sum is convex and has a Lipschitz continuous gradient, convergence to the optimal value can be ensured when using a constant step size, even if some of the individual cost functions are nonconvex. In addition, exponential convergence to the optimal solution is achieved when the private cost functions are further assumed to be strongly convex. In these cases, each 145 optimization variable behaves like the centralized subgradient method except at a slower time scale. The last two algorithms are specialized for the case of quadratic cost functions and converge in finite time to the optimal solution or a neighborhood of arbitrarily small size. Simulation examples are given to illustrate the algorithms. 5.1 Introduction We consider a network of N agents, without a central coordinating unit, aiming to cooperatively solve the global optimization problem min x∈X F (x) := N∑ i=1 fi(x) (5.1) where fi : R → R represents the private cost function of agent i, X is a nonempty constraint set known to all the agents, and it is assumed that each agent is able to communicate with its direct neighbors. Solving this problem in a distributed fashion calls for strategies of coopera- tion among all the agents in the network. In this regard, many distributed algo- rithms have been developed; see e.g., [32,42,76–78,80–85,133] and references therein. Among them, the class of distributed (sub)gradient-based algorithms is well known for its simplicity in implementation and generally mild assumptions imposed on the local cost functions and the network topology. In particular, this class requires each function fi to be (at least) convex and usually with bounded subgradient or Lipschitz continuous gradient. Major limitations of algorithms in the category are also well known. First, the convergence of many algorithms depends on the choice of step size sequences. When 146 a constant step size is used, both Distributed Gradient Descent and Distributed Sub- gradient methods only yield convergence to a neighborhood of the optimal solution and of the optimal value [78, 86]. This occurs even if the fi are strongly convex and have Lipschitz continuous gradients, and is thus one of the main differences between these methods and their centralized counterparts. This motivates the use of particular diminishing or adaptive step sizes to achieve asymptotic convergence. However, the convergence rate can be very slow (compared to that of the centralized method), depending on the step size sequence, whose appropriate selection is not trivial. Second, many incremental subgradient methods require all the agents to construct a closed cycle in order to pass an estimate of the solution around the net- work; see e.g., [32,88,89]. Third, even when asymptotic convergence is guaranteed, it is not obvious how each agent can locally decide when to stop the algorithm with- out affecting other agents’ estimates. Put differently, there are no simple criteria for all the agents to stop at the same time while also sharing the same estimate of an optimal solution. This is also true for most (if not all) other distributed optimization methods. When all the local cost functions are quadratic, many other consensus-based algorithms can outperform those in the subgradient class. For example, the ra- tio consensus method can be used to solve problem (5.1) without constraints and converges exponentially [34, 90]. Based on this method, [91] proposed a Newton- Raphson-like algorithm which also converges asymptotically for a class of functions having continuous, strictly positive and bounded second derivatives, assuming a sufficiently small discretization step. 147 Our main contributions are as follows. We propose and study a new distributed optimization technique for solving (5.1) on a fixed and directed network; the new technique involves use of a distributed prediction scheme based on node minimal polynomials introduced in Chapter 4. Specifically, we first present a distributed subgradient-type algorithm for the general setting of problem (5.1) but without constraints (i.e., X = R) which we show has convergence rate similar to that of the centralized (sub)gradient method. In fact, the convergence rate to the optimal value is O(ln(t)/ √ t) under a diminishing step size for the case where the cost functions fi are convex and have bounded subgradients, while for the case in which the total cost function F is convex and has Lipschitz continuous gradient, it is O(1/t) under a constant step size. In the former case, an optimal choice of the step size can yield a rate O(1/ √ t), which is the best achievable for both centralized and distributed subgradient methods [117, 134]. In the latter case, if F is further assumed to be strongly convex, then we obtain both exponential convergence to the optimal value as well as exponential convergence to the optimal solution. The performance of our algorithm also resembles that of the centralized subgradient method in that all the agents, in finite time, agree on an identical estimate of a solution and continue to agree thereafter (and possibly approach the global optimal solution), solving problem (5.1) as if they all knew the global function F . Moreover, this algorithm is among very few of gradient type that can deal with the case where some of the local cost functions are nonconvex, as long as the total cost F is convex and has a Lipschitz continuous gradient. Next, with some modifications, we also extend the algorithm to cope with constraints x ∈ X and non-column stochasticity of the weight 148 matrix. Finally, in the case where all the functions fi are quadratic and the problem is unconstrained, we construct two algorithms, one of which is ratio-consensus-like and converges in finite time (which is the fastest convergence achieved to-date in distributed optimization methods), and the other is a gradient-based algorithm that achieves near finite-time convergence. The convergence times of our algorithms scale at most linearly with the network size. In fact, they are linear in the maximum degree of the agents’ minimal polynomial, which ranges between the diameter and size of the network. For comparison, we report here known convergence rates/time of other dis- tributed (sub)gradient methods. First, for convex cost functions with bounded subgradients, [135] shows that the distributed proximal-gradient method under a diminishing step size O(1/ √ t) achieves a rate of O(ln(t)/ √ t). A similar convergence rate was obtained for the dual averaging method in [81] and for the subgradient- push method in [85]. The recent work [134] presents an algorithm with convergence time linear in the network size, and states that it is the best convergence time so far for problem (5.1) with non-differentiable convex functions. Our first algorithm admits an even better convergence time. Second, for convex cost functions with Lipschitz continuous gradient, the algorithm in [136] uses a second-order update at each iteration, which yields a convergence rate of O(1/t) in terms of best run- ning violation to the first-order optimality condition under a fixed step size. Note that our first algorithm also attains the same rate but in terms of the objective error. Third, for convex cost functions with bounded and Lipschitz continuous gra- dients, [87] proposed two fast distributed gradient methods; one of which converges 149 at rate O(ln(t)/t) under a diminishing step size, while the other achieves O(1/t2) convergence through the use of an inner consensus loop and Nesterov’s acceleration technique (see, [117, Chap. 2]). Finally, when the global cost function F is strongly convex and all fi have Lipschitz continuous gradients, the algorithm in [136] also converges at a linear rate. The rest of the chapter is organized as follows. Section 5.2 contains the prob- lem formulation and some background on subgradient methods and the finite-time consensus prediction introduced in Chapter 4. The main algorithm and convergence results for general cost functions (some possibly nonconvex) are given in Section 5.3. Performance limits of our algorithms are discussed in Section 5.5, which is followed by some simulation examples in Section 5.6 to illustrate the algorithms. Finally, con- cluding remarks are given in Section 5.7. Most proofs are given in Appendix A.3. 5.2 Problem Statement and Background 5.2.1 Problem Statement Consider a network consisting of N agents where the underlying communication is characterized by a directed graph G = (V , E). The objective of all the agents is to solve problem (5.1), repeated here for convenience: min x∈X F (x) = N∑ i=1 fi(x), (5.2) where X ⊆ R is the constraint set. Note that here we assume variable x to be scalar for simplicity of notation. The case of a vector variable can be treated following the 150 same steps. Let F ∗ and X∗ denote the optimal value and the optimal solution set, respectively, i.e., X∗ = {x∗ ∈ X,F (x∗) = F ∗ := minx∈X F (x)}. The following is a blanket assumption: Assumption 5.2.1. The set X is convex and X∗ is nonempty. In our setting, agent i only has access to fi and local information on its neigh- bors’ opinions, and no central coordinating node is assumed to exist. Thus, the agents need to collaborate in a distributed manner to solve problem (5.1). This involves local iterative computation along with information diffusion. We are in- terested in the scenario where the communication graph G connecting the agents is directed and fixed. We make the following additional blanket assumption. Assumption 5.2.2. (Network Connectivity and Weight Matrix) The graph G is fixed and strongly connected. The weight matrix W is fixed, row-stochastic and satisfies wij > 0 for (i, j) ∈ E , i 6= j, and wij = 0 otherwise. Moreover, W has at least one positive diagonal element. 5.2.2 Subgradient Methods Subgradient methods are the simplest numerical algorithms for solving problem (5.1) for the case in which each function fi is convex. Recall that for the unconstrained problem (i.e., when X = R), the centralized method is based on the iteration (see e.g., [115, Chap. 6], [117, Chap. 3]) x(t+ 1) = x(t)− γ(t)g(x(t)), (5.3) 151 where γ(t) is the step size at iteration t and g(x(t)) is a subgradient of F at x(t), i.e., g(x(t)) ∈ ∂F (x(t)). In the special case where F is differentiable, g(x(t)) = ∇F (x(t)) and (5.3) reduces to the centralized gradient descent method. In the distributed setting described above, subgradient methods can take many forms (see, e.g., [76,78]), one of which is as follows. Every agent has its own estimate of an optimal solution, and updates it at each iteration by combining a consensus step with a local optimization step: xi(t+ 1) = ∑ j∈Ni wijxj(t)− γ(t)gi(xi(t)), (5.4) where Ni = {j ∈ V : (i, j) ∈ E} is the set of in-neighbors of node i (including i), γ(t) the step size known to all agents, and gi(xi(t)) ∈ ∂fi(xi(t)). This algorithm is usually referred to as the Distributed Subgradient Method, or DSM for short. A modified version called the Distributed Projected Subgradient method or DPS (see, e.g., [137]) is used when the problem includes a global constraint x ∈ X. In DPS, xi(t+ 1) = PX (∑ j∈Ni wijxj(t)− γ(t)gi(xi(t)) ) , (5.5) where PX denotes the projection operator onto the set X (assuming that each agent is able to perform this operation). The convergence of these distributed methods depends on the step size sequence and the weight matrix W . Unlike the centralized version, these algorithms are only guaranteed to converge to an error neighborhood of the optimal solution when ∇fi’s are Lipschitz continuous and a constant step size is used (see, e.g., [78, 86]). In the following, we develop new distributed algorithms based on (5.4) that can achieve faster convergence and may not require a diminishing step size; convergence 152 can even occur in finite time for quadratic cost functions. Moreover, the convergence is similar to that of the centralized subgradient algorithm in the sense that the agents are able to stop the algorithm at the same time with identical estimates of an optimal solution. The main novelty introduced in this chapter is to take advantage of the finite-time consensus protocol introduced in [71] (described in the next subsection) to result in a modified version of algorithm (5.4) which enjoys accelerated convergence in comparison to (5.4). We find that the new algorithm can even achieve the performance limit of distributed algorithms in many cases (see discussion in Section 5.5). Another benefit of using the new protocol is that it can be implemented in a distributed manner for an arbitrary weight matrix (as long as the associated graph is strongly connected). 5.2.3 Finite-Time Consensus Using Minimal Polynomials This subsection is a brief summary of 4.2.2. Consider the following update iteration xi(t+ 1) = ∑ j∈Ni wijxj(t), ∀i ∈ V , (5.6) or in vector form: x(t+ 1) = Wx(t). From Subsection 1.5.2 in Chapter 1, we know that under Assumption 5.2.2 ∃ lim t→∞ W t = 1piT =: Φ, (5.7) where pi ∈ RN is the normalized left Perron eigenvector of W , that is, piTW = piT and 1Tpi = 1. Therefore, the network in (5.6) asymptotically reaches consensus lim t→∞ x(t) = Φx(0). (5.8) 153 From Section 4.2.2 in Chapter 4, we also know that each agent i can locally compute the consensus value using its minimal polynomial qi. In particular, lim t→∞ xi(t) = ∑Di l=0 a (i) l xi(l)∑Di l=0 a (i) l , (5.9) where Di = deg(qi)− 1 and a (i) = [a(i)0 , a(i)1 , ..., a(i)Di ]T satisfies: qi(ξ) = (ξ − 1) Di∑ l=0 a (i) l ξ l, a (i) Di = 1, Di∑ l=0 a (i) l 6= 0. (5.10) Note also that a (i) can be computed locally and in finite time by agent i. We summarize the analysis above as follows. Theorem 5.2.3. (Prediction of consensus value by minimal polynomial) Consider system (5.6) for t = 0, ..., D¯ − 1, where D¯ = maxi∈V Di with deg(qi) = Di + 1. Let Assumption 5.2.2 hold. Then ∑Di l=0 a (i) l xi(l)∑Di l=0 a (i) l = eTi Φx(0), ∀i ∈ V , (5.11) where a(i) = [a (i) 0 , a (i) 1 , ..., a (i) Di ]T ∈ RDi+1 are given in (5.10). In this regard, the node minimal polynomials qi can be viewed as a tool pro- viding a shortcut to reaching consensus. This idea will be employed in this chapter to develop new distributed algorithms with desirable features such as behavior sim- ilar to that of centralized algorithms, distributed stopping criteria, and finite (or practically finite) time convergence. 154 5.3 Distributed Subgradient Optimization Using Finite Time Con- sensus In this section, we return to optimization problem (5.1) and show how minimal polynomials associated with the agents can be used to improve the convergence speed of the distributed subgradient method (5.4). To this end, we will assume that each agent i ∈ V knows its minimal polynomial qi and a common upper bound κ on deg(qi), i.e., κ ≥ deg(qi),∀i ∈ V . (5.12) Note that the least possible value of κ is always less than or equal to the number of agents N in the network; see Section 5.5.1 for further discussion on this. Therefore, κ can be chosen to be N or a known upper bound on N . We now consider problem (5.1) and the following possible assumptions, which will not be invoked together below. Assumption 5.3.1. For each i ∈ V, fi is convex and has bounded subgradients on X, i.e., ∃Li ∈ (0,∞) such that |gi(x)| ≤ Li,∀gi(x) ∈ ∂fi(x),∀x ∈ X. Assumption 5.3.2. For each i ∈ V, fi is differentiable on the interior of X. Moreover, the function F is convex on X and the gradient ∇F is L∇F -Lipschitz continuous for some L∇F ∈ (0,∞). The former assumption implies that F and the fi are convex and have bounded subgradients while not necessarily being differentiable; the latter requires convexity of F while not requiring convexity of each fi. 155 5.3.1 Main Algorithm Our main idea is to combine the consensus prediction step offered by (5.9) with the distributed subgradient method (5.4). Specifically, we propose an algorithm, called Finite-time consensus Aided Distributed Optimization (FADO), that performs the following 3 sequential steps in a cyclic manner: (i) κ iterations of usual consensus algorithm (5.6) (used to diffuse information in the network), followed by (ii) a pre- diction step using minimal polynomials, and then (iii) a (sub)gradient optimization step applied to the predicted consensus value obtained from step (ii). The detailed algorithm is as follows. Algorithm 5.1. (Finite-time consensus Aided Distributed Optimization - FADO). Each agent i ∈ V initializes a pair of local variables (si(0), xi(0)) in X and updates them for t ≥ 1 according to si(t) =  ∑Di l=0 a (i) l xi(t− κ+ l)∑Di l=0 a (i) l if t = kκ (5.13a) si(t− 1) else (5.13b) xi(t) =  si(t)− γkgi(si(t)) if t = kκ (5.14a)∑ j∈Ni wijxj(t− 1) else (5.14b) where γk is the step size at t = kκ. Remark 5.3.3. (On implementation of (5.13a)) The foregoing form (5.13a) of the algorithm is quite useful for analysis, but in practice, instead of storing Di+1 values of xi’s to compute si, each agent can just maintain a single memory register to store a running sum from kκ to (k+ 1)κ, denoted by yi, and update it as a new estimate 156 xi(t) becomes available as follows. Agent i sets yi(t) = aˆ (i) 0 xi(t) at time t = kκ and then updates this variable as yi(kκ + τ + 1) = yi(kκ + τ) + aˆ (i) τ xi(kκ + τ) for τ = 0, . . . Di, where aˆ (i) τ = a (i) τ / ∑Di l=0 a (i) l . Then si((k + 1)κ) = yi(kκ + Di + 1). Of course, each agent still needs to store Di + 1 normalized coefficients aˆ (i) τ . Clearly, information exchanged among the agents at each time involves only xi(t), but not si(t). Moreover, each agent i only needs to update si(t) once every κ time steps. Note that whenever t = kκ, (5.13a) must be carried out prior to (5.14a). The next result asserts that by utilizing minimal polynomials as done in the main algorithm (5.13a)) above, we succeed in forcing the states si(t) to be identical over the whole network after an initial time period of length κ, i.e., si(t) = sj(t),∀t ≥ κ,∀i, j ∈ V . In fact, it can be seen that the si(t)’s will be similar to each other for all time t ≥ 0 if identically initialized. Theorem 5.3.4. (Agreement of si, i ∈ V after κ steps) Consider (5.13)-(5.14) and let Assumption 5.2.2 hold. If gi is bounded for any i ∈ V, then si(t) = sj(t), ∀i, j ∈ V , ∀t ≥ κ. (5.15) Proof. First, since gi is bounded ∀i ∈ V , we have that xi(t) in (5.14) is well-defined. Also, (5.13) is well-defined as ∑Di l=0 a (i) l 6= 0; cf. (5.10). Next, for any k ≥ 0, by (5.14b) we have x(t) = Wx(t− 1), ∀t = kκ+ 1, . . . , kκ+ κ− 1. (5.16) Then at time t = (k + 1)κ, we have si((k + 1)κ) (5.13a) = ∑Di l=0 a (i) l xi(kκ+ l)∑Di l=0 a (i) l = eTi Φx(kκ), (5.17) 157 where the last equality follows from Theorem 5.2.3 and (5.12). Here Φ is the con- sensus matrix defined in (5.7). Therefore, si((k + 1)κ) = e T i 1pi Tx(kκ) = piTx(kκ), (5.18) which is independent of i. It remains to use (5.13b). In the rest of this subsection, we establish the convergence of the algorithm when problem (5.1) is unconstrained (i.e., X = R) and the weight matrix is doubly stochastic. In the next subsection, we will show how to modify our algorithm to allow relaxing of these assumptions (i.e., the problem may have constraints and W may be only row stochastic.) Our first convergence result deals with convex cost functions with bounded subgradients. Theorem 5.3.5. (Local cost functions with bounded subgradients) Consider prob- lem (5.1) with X = R. Let Assumptions 5.2.1, 5.2.2 and 5.3.1 hold. Assume further that W is doubly stochastic. Let all the agents perform (5.13)-(5.14). Let s¯(t) := si(t),∀i ∈ V ,∀t ≥ κ and g(s¯(kκ)) := ∑N i=1 gi(s¯(kκ)) ∈ ∂F (s¯(kκ)). Then s¯((k + 1)κ) = s¯(kκ)− γk N g(s¯(kκ)), (5.19) Also, let sˆk := ∑k τ=1 γτ s¯(τκ)∑k τ=1 γτ . (5.20) We then have the following: 158 (i) If γk ≡ γ > 0, then for each i ∈ V lim k→∞ F (sˆk)− F ∗ ≤ γ 2N L2F , LF := ∑ j∈V Lj (5.21) (ii) For a given number of iterations T , let K = bT/κc. Let R be any number such that R ≥ dist(s¯(κ), X∗). Then with constant step size γk ≡ NRLF√K , we have F (sˆK)− F ∗ ≤ RLF√ K . (5.22) (iii) If γk > 0, limk→∞ γk = 0, and ∑∞ k=1 γk =∞, then lim k→∞ F (sˆk) = F ∗, ∀i ∈ V . (5.23) In fact, if γk = 1√ k , the convergence rate is O( ln k√ k ). Proof. See Appendix A.3.1. Remark 5.3.6. (On convergence and limit points) The auxiliary sequence {si(t)}∞t=0 is generated to yield convergence of F (sˆk) to F ∗ (case (iii)) or an optimal value neighborhood (cases (i) and (ii)), rather than xi(t) directly. In general, xi(t) does not converge but rather reaches a limit cycle of period κ whenever si(t) converges (possibly also to a limit cycle). Note that by definition (5.20) and the fact that s¯(t) = si(t), ∀i ∈ V ,∀t ≥ κ, sˆk is a global variable that is available to all the agents. Moreover, each agent i can compute sˆk using its local variable si(t) and an augmented running sum Γk ∈ R in a recursive manner as follows: Γk+1 = Γk + γk (5.24) sˆk+1 = ( Γksˆk + γk+1si((k + 1)κ) ) /Γk+1, (5.25) 159 where Γ0 = 0, sˆ0 = 0 (here the subscript k denotes the iteration index in the slow time scale k = bt/κc). When a constant step size γ is used, the following simplified update is sufficient: sˆk+1 = ( ksˆk + si((k + 1)κ) ) /(k + 1). Remark 5.3.7. (Stopping criteria) By Theorem 5.3.4, all the agents are able to stop at the same time with the same estimate si = s¯ (or sˆk) of the optimal solution if they use a common stopping criterion, e.g., running the algorithm for a predetermined number of iterations T as in Theorem 5.3.5(ii), or until one of the following holds (see Appendix A.3.9 for other criteria): |si((k + 1)κ)− si(kκ)| ≤  (5.26) |F (sˆk+1)− F (sˆk)| ≤ . (5.27) This is not the case for many other distributed algorithms, where any consensus can only be achieved in the sense of an asymptotic limit. Note that F (sˆk) involves evaluation of the global cost function F at sˆk and algorithm (5.13)-(5.14) does not provide it to each agent. However, we show in Appendix A.3.9 that it is possible for each agent, by using augmented iterations, to locally compute F (sˆk) at time t = (k + 1)κ for any k ≥ 1. Remark 5.3.8. (Connection with convergence of centralized method) In light of (5.15) and (5.19), our algorithm performs analogously to the centralized subgradient method (5.3) except on a slower time scale, where F is convex with subgradient g bounded in magnitude by LF = ∑N j=1 Lj. As a result, it adopts performance guarantee of the centralized subgradient method, as shown in (5.21)-(5.23). For detailed analysis of this centralized method, see, e.g., [117, Chap. 3] and [138]. It should be noted that 160 the convergence our algorithm depends on not only that of the virtual centralized subgradient iteration (5.19) but also the process of information diffusing over the network carried out by (5.14b) (hence the importance of strong connectivity). Remark 5.3.9. (Step size design and objective bound in case (ii)) For a given T , the constant step size γ = NR LF √ K depends on the constant R and the ratio LF/N . First, it is clear that the smallest value of R, which is dist(s¯(κ), X∗), minimizes the error bound in (5.22) but is rather of theoretical interest only since it requires the knowledge of the solution set X∗ (note that s¯(κ) = si(κ). However, in practice, an upper (possibly loose) bound R may be inferred, especially when there is some restriction on the range of global variable in (5.1); see also Subsection 5.3.2 and examples in Section 5.6. Second, the ratio LF/N is indeed the average of Lipschitz constants of the agents’ local cost functions and thus can be computed locally in finite time. For example, using the same algorithm above with xi(0) = Li, then by (5.18) and double stochasticity of W , we have si(κ) = pi Tx(0) = ∑ i∈V Li/N . Alternatively, the agents can choose γ = R Lmax √ K instead, where Lmax := maxi∈V Li which can be computed in a distributed manner and in finite time using a max- consensus protocol [139]. In this case, the corresponding bound in (5.22) becomes F (sˆK)− F ∗ (A.22) ≤ R√ K (NLmax 2 + L2F 2NLmax ) ≤ RNLmax√ K , where the last inequality follows from LF ≤ NLmax. Remark 5.3.10. (On best convergence rate and scalability) It should be pointed out that the result in case (ii) of Theorem 5.3.5 demonstrates an improvement on best analyses of distributed subgradient algorithms so far. To wit, recall that for any 161 given number of iterations T , we have Kκ ≤ T < (K + 1)κ and thus the following: 1 N (F (sˆK)− F ∗) (5.22) ≤ RLF N √ K ≤ RLmax √ K + 1 K √ κ T . (5.28) That is, the convergence rate of the (averaged) objective error is of orderO(RLmax √ κ T ); this also holds for γ = R Lmax √ K (as shown earlier). In other words, the time it takes for the bound in (5.28) to drop below  > 0 is O(κ/2). As N increases, this turns out to be the fastest rate achieved to-date among distributed subgradient-based methods for convex cost functions with bounded subgradients. In this setting, the best known rate so far was demonstrated in [134], where the authors considered the problem of minimizing 1 N ∑N i=1 fi(x) (which is why we need a factor of 1 N on the left side of (5.28) for comparison). In this reference, the authors proposed a linear time average consensus protocol and used it to design a new algorithm for solving problem (5.1). The consensus protocol there is a combination of a weighted aver- aging scheme based on the Metropolis rule [90] and an extrapolation step (which is similar to the idea of adding momentum to speed up iterative methods). Under a fixed step size β = 1 Lmax √ NT , [134] showed that the aggregated objective error decays as O(R2Lmax √ N T ), i.e., it takes O(N/2) time steps to reach an  error. Thus this algorithm scales linearly in the network size; hence the title “linear time.” To im- plement this algorithm (specifically the Metropolis rule), however, the graph needs to be undirected. Our algorithm, on the other hand, applies to directed graphs and requires O(κ/2) steps to reach an  error. In general, without knowledge of the network topology, we can just take κ to be N or a known upper bound on N , then our 162 algorithm still possesses the fastest rate (which is similar to the algorithm in [134], where a common upper bound on N is also required). However, for certain graphs such as distance regular graphs, κ can be as small as the graph diameter, which can be small compared to the network size (see Section 5.5.1 for further discussion on this). In this connection, our algorithm convergence time scales at most linearly in the network size. Note also that the centralized subgradient method converges to an  error of the optimal value in O(1/2) time steps. Hence, our algorithm performance lies between that of the centralized method and the best known rate for distributed ones. This intuitively makes sense since our algorithm is not only distributed but also behaving like the centralized one except the time-scale is slowed down by a factor of κ. Remark 5.3.11. (On number of subgradient evaluations) Another important aspect that should be noted is the number of subgradient evaluations since the computation of subgradients usually dominates the time it takes to perform the optimization step. Within T time steps, our algorithm requires each agent to evaluate its subgradient T/N times, whereas most, if not all, other distributed subgradient algorithms require T evaluations. Of course, the advantages of our algorithm are based on the assumption that all the agents are equipped with their own minimal polynomial. Although this seems restrictive at first, we remark that the agents’ minimal polynomials can be computed (prior to the main algorithm’s implementation) in a centralized or decentralized 163 manner, or could be done on-the-fly and in finite time as well; see [71, 72, 140] for such algorithms and also Section 5.6 for numerical examples. Remark 5.3.8 also suggests that known results on the centralized gradient descent method can be used in a straightforward manner to show the convergence of the algorithm when assuming smoothness condition on F (rather than on every fi). In particular, when F is differentiable, convergence to the optimal value can be ensured with sufficiently small constant step size as shown next. Theorem 5.3.12. (Convex global cost with Lipschitz gradient) Consider problem (5.1) with X = R. Let Assumptions 5.2.1, 5.2.2 and 5.3.2 hold. Assume further W is doubly stochastic. Let all the agents synchronously perform (5.13)-(5.14) with γk ≡ γ ∈ (0, 2NL∇F ). Then for each i ∈ V F (si(t))− F ∗ = O(κ/t), as t→∞. (5.29) Proof. See Appendix A.3.2. Remark 5.3.13. (Convergence comparison) Similar to the centralized Gradient De- scent method, when F is convex and ∇F Lipschitz continuous, the proposed algo- rithm converges to the optimal value without the need for a diminishing step size. This is a key difference between our algorithm and the distributed gradient descent method (5.4) and many others, which do not converge to the optimal value under the constant step size rule. Moreover, a running bound on the objective error is available in (A.27), bearing a resemblance to that of the centralized method except for being scaled by a factor of κ. 164 Faster convergence rates can be obtained when we assume further strong con- vexity of the global cost function F (but not every individual cost fi). In particular, the algorithm achieves linear convergence rates to both the optimal value and the optimal solution. (Note that under Assumption 5.2.1 and strong convexity of F , there exists a unique x∗ ∈ X∗). Theorem 5.3.14. (Strongly convex global cost with Lipschitz continuous gradient) Consider problem (5.1) with X = R. Let Assumptions 5.2.1, 5.2.2 and 5.3.2 hold. Assume further W is doubly stochastic and F is strongly convex with parameter µ > 0. If all the agents perform (5.13)-(5.14) with γk ≡ γ ∈ (0, 2Nµ+L∇F ], then |si(kκ)− x∗|2 ≤ βk−1|si(κ)− x∗|2, (5.30) F (si(kκ))− F ∗ ≤ 1 2 L∇Fβk−1|si(κ)− x∗|2 (5.31) where β = 1 − 2γµL N(µ+L∇F ) ∈ (0, 1). Thus, F (si(t)) → F ∗ and si(t) → x∗ linearly at rates β 1 κ and β 1 2κ , respectively. Proof. See Appendix A.3.3. 5.3.2 Extensions of the Algorithm 5.1 We now consider possible extensions of the algorithm to deal with two cases: (i) problem (5.1) is subject to a constraint x ∈ X for some convex set X, and (ii) the weight matrix W is only row stochastic. Case (i): We assume X is closed and convex. To satisfy this constraint, we resort to a projection operator onto X, denoted PX . In the special case X ⊂ R, X is just an interval, thus the projection is simply a cut-off function. 165 Assuming that all the agents know the set X, we modify our algorithm de- scribed above as follows. For any i ∈ V , initialize si(0) = xi(0) ∈ X, and update for any t ≥ 1 si(t) =  PX (∑Di l=0 a (i) l xi(t−κ+l)∑Di l=0 a (i) l ) if t = kκ si(t− 1) else (5.32) xi(t) =  si(t)− γkgi(si(t)) if t = kκ ∑ j∈Ni wijxj(t− 1) else (5.33) The key idea of this extension is that the same PX is used by all the agents, forcing the modified algorithm to work in the same manner as the original one, as shown next. Define gs(t) := [g1(s1(t)), . . . , gN(sN(t))] T and s¯(t) := 1T s(t)/N , and recall that Φ = 11T/N . We have si(kκ+ κ) (5.32) = PX (∑Di l=0 a (i) l xi(kκ+ l)∑Di l=0 a (i) l ) (Thm.5.2.3) = PX ( eTi Φx(kκ) ) (5.33) = PX (1T N ( s(kκ)− γkgs(kκ) )) = PX ( s¯(kκ)− γk N N∑ j=1 gj(sj(kκ)) ) . (5.34) Thus, si(kκ) = s¯(kκ),∀i ∈ V , k ≥ 1. Hence, (5.15) holds and s¯((k + 1)κ) = PX ( s¯(kκ)− γk N g(s¯(kκ)) ) , (5.35) where g(s¯(kκ)) = ∑ gi(s¯(kκ)) ∈ ∂F (s¯(kκ)). Now (5.35) is the usual (centralized) projected (sub)gradient method, whose convergence results are not much different from those of the (sub)gradient method (see, e.g., [117,138], [115, Chap. 2]). There- fore, the conclusions of Theorems 5.3.5, 5.3.12 and 5.3.14 still hold. 166 Case (ii): It should be noted that in distributed settings, a row stochastic weight matrix is much easier to implement than a column (or doubly) stochastic one as each agent can individually decide the weight on the information received from its neighbors. When this is the case, most (if not all) subgradient-based methods do not converge to the optimal value; our proposed algorithm above is no exception. However, it can be modified to overcome this by using the re-weighting technique as in [93, 100] which requires that the value pii is available to agent i for all i ∈ V . The modified algorithm is as follows: si(t) =  ∑Di l=0 a (i) l xi(t− κ+ l)∑Di l=0 a (i) l if t = kκ si(t− 1) else (5.36) xi(t) =  si(t)− γk gi(si(t)) Npii if t = kκ ∑ j∈Ni wijxj(t− 1) else (5.37) The only difference between this extension and the original algorithm (5.13)-(5.14) is the scaling factor (Npii) −1 of the subgradient in (5.37) where pii > 0,∀i ∈ V (see [95, Thm. 8.4.4]). Note that the factor N−1 is not crucial to the convergence of the algorithm as its appearance is merely to retain the conclusions in Theorems 5.3.5, 5.3.12 and 5.3.14; see Appendix A.3.4 for a detailed proof. Here we assume that the value pii is available to agent i for all i ∈ V . In fact, each agent can compute its corresponding entry in the vector pi in finite time (pos- sibly during the process of determining its own minimal polynomial). In particular, consider iteration (5.6) with initial condition x(0) = ei for some i ∈ V and suppose 167 that Assumption 5.2.2 holds. Then, lim t→∞ x(t) = lim t→∞ W tx(0) = 1piTei = pii1. (5.38) That is, the network of agents running (5.6) achieves consensus x¯ = pii. Therefore, by applying one of finite-time consensus algorithms in, e.g., [71,72] (see also Subsection 5.2.3), agent i can compute pii in finite time. This is also the main idea employed in [141] for each agent to compute pi. Remark 5.3.15. When problem (5.1) is subject to both a global constraint x ∈ X and a row stochastic weight matrix W , we can combine (5.32) and (5.37) together to form a new algorithm. The proof for convergence of the so-obtained algorithm is merely based on the proofs of both cases above, and thus is skipped for brevity. 5.4 Finite-Time Optimization for Quadratic Cost Functions Now we consider problem (5.1) without any constraint, i.e., X = R, and with quadratic cost functions fi(x) = bi(x− ci)2, ∀i ∈ V (5.39) for some bi, ci ∈ R with bi > 0. Clearly, the optimal solution is given by x∗ = ( ∑N i=1 bici) / ( ∑N i=1 bi). Of course, our algorithm in the previous section is still ap- plicable. Here we aim to achieve (near) finite-time convergence algorithms by capi- talizing on the special form of the cost functions. 168 5.4.1 Ratio-Consensus based Algorithm Our first algorithm is based on the observation that x∗ can be expressed as the ratio of two average quantities, namely, x∗ = ( 1 N ∑N i=1 bici) / ( 1 N ∑N i=1 bi). Thus, inspired by the idea of the ratio-consensus algorithm (see, e.g., [34, 90]), we construct the following finite time algorithm: Algorithm 5.2. (Finite-time Ratio-consensus) Let κ satisfy (5.12). Each agent i ∈ V initializes a pair of local variables (yi, zi) at time t = 0 as yi(0) = bici, zi(0) = bi and updates them according to yi(t) = ∑ j∈Ni wijyj(t− 1), t = 1, . . . , N (5.40) zi(t) = ∑ j∈Ni wijzj(t− 1), t = 1, . . . , N (5.41) x∗i = ∑Di l=0 a (i) l yi(l)∑Di l=0 a (i) l zi(l) , ∀i ∈ V , (5.42) Note that (5.42) is evaluated only once at final time t = κ+1. The next result is immediate. Theorem 5.4.1. (Finite-time optimization for quadratic costs) Consider problem (5.1) with X = R and fi given as in (5.39). Let Assumption 5.2.2 hold, and further let W be doubly stochastic. If the agents perform (5.40)-(5.42), then x∗i = x ∗, ∀i ∈ V . (5.43) Proof. We have Φ = 11T/N by (5.7) and double stochasticity of W . Application of Theorem 5.2.3 yields ∑Di l=0 a (i) l yi(l) = 1 N 1Ty(0) ∑Di l=0 a (i) l and ∑Di l=0 a (i) l zi(l) = 1 N 1Tz(0) ∑Di l=0 a (i) l . The theorem then follows. 169 Although the idea of this algorithm is simple, it, to our knowledge, has not been presented elsewhere in the literature. It also shows an interesting connection with the finite-time behavior of the centralized gradient descent method. In particular, for a quadratic cost function, the centralized method converges in just one iteration by using the Newton step. In a distributed setting, we observe the same finite time converging behavior of our algorithm above, except in κ steps. Here, at each time t = 1, ..., N , each agent exchanges its pair of variables (yi, zi) with its neighbors. Consequently, all the agents reach the optimal solution indirectly through diffusing the coefficients of their quadratic cost functions. In the case that the agents do not want to reveal information about their private cost functions, it is still possible for each agent to use its minimal polynomial in connection with exchanging estimates of the solution with its neighbors to achieve very fast convergence. In the following, we derive such an algorithm. 5.4.2 Gradient-based Algorithm Our idea is as follows. Consider distributed gradient method (5.4) applied to the problem with the quadratic costs (5.39): xi(t+ 1) = ∑ j∈Ni wijxj(t)− γbi(xi(t)− ci), (5.44) where γ is a constant and xi(0) can be chosen arbitrarily. This iteration does not converge to the optimal solution 1 but rather an O(γ)-neighborhood of x∗, even 1which can be easily verified by contradiction, i.e., if it does x∗ = ∑ j∈Ni wijx ∗ − γbi(x∗ − ci), then x∗ = ci,∀i ∈ V. 170 when W is assumed to be doubly stochastic (see, e.g., [78, 86]): lim t→∞ x(t) = O(γ)+ x∗1. (5.45) for some  ∈ RN . Therefore, if each agent can predict its final value in finite time (possibly in the same manner as above), then by using a sufficiently small γ all the agents may employ a very few prediction steps in conjunction with a finite number of iterations (5.44) in order to obtain a close estimate to the optimal solution. This idea will be pursued in the following. To this end, we first show how each agent can compute its final value in (5.44) in finite time and in a distributed manner. This is different from what is reported in subsection 5.2.3 since (5.44) is not a consensus iteration. Eq. (5.44) represents a linear time-invariant system with constant inputs, which can also be expressed in vector form as x(t+ 1) = (W − γB)x(t) + γBc, (5.46) where B := diag([b1, . . . , bN ]) and c := [c1, . . . , cN ] T . In this connection, it is clear that the convergence of iteration (5.44) depends on the system matrix (W − γB). We then have the following: Theorem 5.4.2. (Stability condition) Let Assumption 5.2.2 hold. Suppose that W has positive diagonal elements. There exists γ0 > 0 such that ρ(W − γB) < 1, (5.47) for any γ ∈ (0, γ0). Moreover, if , we can take γ0 = 2 mini∈V wiimaxi∈V bi . 171 Proof. Note that 2 mini∈V wii maxi∈V bi > 0. By the Gershgorin circle theorem (see, e.g., [95, p. 344]), all the eigenvalues of (W − γB) are located in the union of N discs N⋃ i=1 {z ∈ C : |z − wii + γbi| ≤ 1− wii}. (5.48) As a result, ρ(W − γB) < 1 for any γ ∈ (0, 2 mini∈V wii maxi∈V bi ). The existence of γ0 then follows. Condition (5.47) guarantees that the system (5.46) is BIBO stable, thus the states converge to some fixed values, which are not necessarily equal. In fact, it follows from (5.46) that lim t→∞ x(t) = Φγc, Φγ := ( I − (W − γB))−1γB, (5.49) where I − (W − γB) is invertible because of (5.47). Let z(t)T , [x(t)T , cT ]. The system (5.46) can be described equivalently as z(t+ 1) = W˜z(t), W˜ , W − γB −γB 0N×N I  . (5.50) For any i = 1, . . . , N , define q˜i to be the monic polynomial of minimum degree such that eTi q˜i(W˜ ) = 0 T , where ei ∈ R2N is the i-th standard unit vector. We will call q˜i the minimal polynomial of node i in system (5.50). Lemma 5.4.3. (Minimal polynomials in system (5.50)) Let γ satisfy (5.47). For each i ∈ V, there exists a˜(i) ∈ RD˜i+1 such that q˜i(ξ) = (ξ − 1) D˜i∑ j=0 a˜ (i) j ξ j, a˜ (i) D˜i = 1, (5.51) where deg(q˜i) = D˜i + 1 ≤ 1 + n. Moreover, all the zeros of q˜i are strictly inside the unit circle except for one at 1. 172 Proof. See Appendix A.3.6. Clearly, q˜i is of the same form as qi in (5.10) and also has 1 as the only zero of maximum modulus. Thus, q˜i can also be computed locally and in finite time using the schemes presented in Section 4.3.2. Now we assume that all the agents know a common upper bound κ on deg(q˜i). Note that deg(q˜i) ≤ N + 1,∀i ∈ V . Thus κ can be chosen to be N + 1 or an upper bound on N + 1. After κ consecutive iterations of (5.44), each agent is able to determine its own final value to which it will converge if all the agents follow (5.44) forever. Using the same arguments as (5.10)-(5.9) in Subsection 5.2.3 we have lim t→∞ xi(t) = ( D˜i∑ k=0 a˜ (i) k xi(k) ) /( D˜i∑ k=0 a˜ (i) k ), ∀i ∈ V . (5.52) Therefore, in the same spirit of Theorem 5.2.3, we can view the right side of (5.52) and κ iterations of (5.44) as a realization of the operator Φγ given in (5.49). This realization is carried out in finite time, enabling us to construct the following algo- rithm: Algorithm 5.3. (Near Finite-time Gradient-based Optimization) Each agent i ∈ V initializes a pair of local variables (si, xi) at time t = 0 as si(0) = xi(0) = ci and 173 updates them for t ≥ 1 according to si(t) =  ∑D˜i k=0 a˜ (i) k xi(t− κ+ k)∑D˜i k=0 a˜ (i) k , if t = kκ si(t− 1), else (5.53) xi(t) =  si(t), if t = kκ ∑ j∈Ni wijxj(t− 1)− γbi ( xi(t− 1) + si(t− 1) ) , else (5.54) where γ satisfies (5.47). In the following, we show that by choosing γ appropriately, this algorithm achieves exponential convergence to the optimal solution. More importantly, the convergence rate is adjustable using γ. Theorem 5.4.4. (Convergence of consensus matrix Φγ) Let Assumption 5.2.2 hold and γ satisfy (5.47). Then Φγ given by (5.49) is a row stochastic and irreducible matrix, and has Bpi as a left Perron eigenvector. Moreover, lim k→∞ Φkγ = 1pi TB/(piTB1), (5.55) and the convergence is exponential with rate determined by the second largest eigen- value λ2(Φγ). Proof. See Appendix A.3.7. Theorem 5.4.4 allows us to prove the convergence of (5.53)-(5.54). Theorem 5.4.5. (Convergence for quadratic costs) Consider Algorithm (5.53)- (5.54) and let Assumption 5.2.2 hold. Assume further that W is doubly stochastic 174 and has positive diagonal elements. Let γ satisfy (5.47). Then lim t→∞ x(t) = lim t→∞ s(t) = x∗1. (5.56) Moreover, the convergence is linear with rate |λ2(Φγ)| 1κ . Proof. See Appendix A.3.8. Although in this theorem the weight matrix W is assumed to be doubly stochastic, we note that the algorithm can be modified using the re-weighting trick so that W can be taken to be only row stochastic (see Remark 5.4.8 for details). Theorem 5.4.5 shows the effect of γ on the convergence of Algorithm (5.53)- (5.54) and the rate at which the agents’ estimates converge linearly to the optimal solution. Specifically, γ needs to be chosen so as to first ensure the stability of the algorithm (namely, condition (5.47) in Theorem 5.4.2), and then accelerate the convergence speed by reducing the rate |λ2(Φγ)| 1κ . The next remarks successively address these issues. Remark 5.4.6. (Distributed agreement on step size to ensure stability condition) In order to guarantee the convergence of the algorithm, it is necessary that all the agents select the same value of step size γ that satisfies (5.47). Such a value can be determined by all the agents in finite time and in a distributed manner as follows. Prior to running Algorithm (5.53)-(5.54), all the agents can implement a max-consensus algorithm (see, e.g., [139]) in order to compute both bM := maxi∈V bi and aM := −maxi∈V(−wii). Then set γ = ε2aM/bM , where ε ∈ (0, 1) is a constant known to all the agents. Clearly, 0 < γ < 2aM bM = 2 mini∈V wii maxi∈V bi ≤ γ0, by Theorem 175 5.4.2. Moreover, the max-consensus algorithm converges after a finite number of iterations equal to the diameter of the graph. In case this number is unknown to all the agents, any upper bound (e.g., N , the network size) can also be used to terminate this algorithm. Remark 5.4.7. (Fast convergence by choosing small step size γ) Recall from Theorem 5.4.5 that x(t) converges linearly to x∗1 with rate |λ2(Φγ)| 1κ , where it can be seen from (5.55) that |λ2(Φγ)| = ρ ( Φγ − 1pi TB piTB1 ) = ρ ( Φγ − 1b T bT1 ) . (5.57) Here, pi = 1/N since W is doubly stochastic. Moreover, Φγc (5.45)&(5.49) = O(γ)+ x∗1 = O(γ)+ 1bTc/(bT1), where we have used the fact that x∗ = bTc/bT1. Rearranging terms yields ( Φγ − 1bT bT 1 ) c = O(γ), which, in view of (5.57), then implies that lim γ→0+ |λ2(Φγ)| = 0. (5.58) Although Φγ is not defined at γ = 0, choosing γ small brings about a fast convergence rate. Moreover, when γ is sufficiently small so that |λ2(Φγ)| 1κ is close to 0, system (5.53)-(5.54) exhibits a near dead-beat response. When this is the case, all the agents may agree to perform the algorithm for kκ steps with a small integer k (e.g., 1, 2, 3) so that each agent obtains a close approximation to the optimal solution. In conjunction with Remark 5.4.6, all the agents may choose γ = min(θ, ε2aM/bM), where 0 < θ, ε 1 are supposedly known to every agent. A word of warning, how- ever, is that too small a value of γ (say 10−16) could possibly affect the convergence 176 of the algorithm due to computational round-off errors. This deserves more analysis in future work. Remark 5.4.8. (Extension of Algorithm (5.53)-(5.54) for row stochastic weight ma- trix) Assume that pii is available to agent i. (Indeed, all the agents can cooperate to compute their corresponding pii in finite time [141].) We modify (5.54) as follows: xi(t) =  si(t), if t = kκ ∑ j∈Ni wijxj(t− 1)− γbi ( xi(t− 1) + si(t− 1) ) Npii , else and redefine Φγ = ( γ−1B−1S(I −W ) + I )−1 , where S := diag(Npi) and γ is such that ρ(W − γS−1B) < 1 in place of stability condition in (5.47). Note also that γ0 in Theorem 5.4.2 can be chosen to be γ0 = 2 mini npiiwii maxi bii . Under this condition, it can be verified that Φγ is still a valid consensus matrix with b = B1 being a left Perron eigenvector. Therefore, lim k→∞ s(kκ) = lim k→∞ Φkγc = 1b Tc/(bT1) = 1x∗, i.e., the optimal solution is achieved by every agent. 5.5 On Minimal Value of κ and Performance Limits of Distributed Subgradient Methods 5.5.1 Minimal Value of κ It is evident that the convergence speed of each algorithm presented in Sections 5.3 and 5.4 depends on the value κ, which is an upper bound on the degrees of 177 all the minimal polynomials (qi, i ∈ V). Indeed, a smaller κ corresponds to faster convergence. Thus, in the best scenario, κ = κmin := maxi∈V deg(qi). We note the following regarding this value. First, for general directed networks, κmin satisfies diam(G) + 1 ≤ κmin ≤ deg(qW ) ≤ N, (5.59) where diam(G) denotes the graph diameter. The lower bound can be shown, e.g., by application of [72, Thm. 3]; an alternative argument is given in the next subsection. The upper bounds follow from definitions of minimal polynomials and the Cayley- Hamilton theorem. It is interesting to seek classes of graphs for which the lower bound is achieved. Clearly, this is the case for line graphs since then diam(G) + 1 = N . Next, we show that the lower bound can be achieved with another class of graphs, namely distance regular graphs (see Definition 4.6.5), of which examples include cycles, hypercubes and complete graphs. References [129,142] and a recent survey [143] provide further information on distance regular graphs. The next result asserts that in the setting of distance regular graphs, we can ensure that κmin = diam(G) + 1. See Appendix A.3.5 for a proof. Theorem 5.5.1. If G is distance regular and W = I − L(G), where L(G) is the Laplacian matrix and  > 0 satisfying (|Ni| − 1) ≤ 1, then diam(G) + 1 = deg(qi),∀i ∈ V. Thus, for distance regular graphs (with W = I−L(G)), the convergence times of our algorithms are linear in the network diameter rather than the network size. 178 Next, we discuss the tightness of the upper bounds in (5.59). Note that min- imal polynomials qi, thus κmin, depend explicitly on the weight matrix W . In fact, zeros of qi are the eigenvalues of W corresponding to the modes of system (5.6) that are observable from the output xi (see [141, Sec. V]). This has two direct im- plications as follows: (i) if the network (5.6) is observable from at least one node, then κmin = N (a line graph falls into this case), (ii) an algorithm for computing qi locally can be used as a means of verifying system observability or computing graph spectrum in a distributed manner. This also suggests that the observability theory might be useful in the problem of weight design so as to minimize κmin. Finally, when all agents know the degree of their own minimal polynomial, they can compute κmin in a distributed fashion by using a max-consensus algorithm [139] for a finite number of time steps that is upper bounded by diam(G). 5.5.2 Performance Limit of Distributed Subgradient Methods We now discuss how fast the convergence of a distributed subgradient method could be in comparison with the corresponding centralized counterpart and with our al- gorithm and in connection with the network topology. First, it is clear that for a general problem in the form of (5.1) and a given network topology, diam(G) is equal to the smallest running time among all possible distributed algorithms since this is the minimum time for information to travel from any node to all others in the network. (This also explains the lower bound in (5.59).) An intuitively simple algorithm that theoretically achieves this fastest convergence 179 would be based on communication flooding ; in particular, assuming sufficient com- munication power and memory capacity of each agent as well as availability of closed form characterization of each local cost function fi and closed form solution to the global optimization problem, at each time step, every agent broadcasts its function fi (assumed to have a unique identifier) and all those data received from neighbors at previous time step. As a result, at time diam(G) + 1, all the agents are able to determine the global cost function F and, hence, can determine the optimum independently. Of course, even when assuming uniqueness of the optimal solution (so as for all the agents also reach consensus), this algorithm is far from being of any practical use. However, a similar behavior exhibits when applying our algo- rithms (5.40)-(5.42) and (5.53)-(5.54) for quadratic objective functions, where the closed form solution to the global problem is simple (and the centralized Newton method solves it in one iteration). Specifically, (5.40)-(5.42) terminates in κ steps with exact solution while (5.53)-(5.54) can do so with arbitrary small error by using a sufficiently small γ. As noted earlier, κ = diam(G) + 1 in certain graphs. Second, suppose that the global objective function can be optimized by some centralized subgradient-based method with some convergence time. In the dis- tributed setting, one should expect that the corresponding distributed algorithm, when converging, is slowed down by at least a factor of diam(G), again, due to the limit of information travel in discrete-time. As noted earlier in Remark 5.3.10, the best analysis on distributed subgradient methods for convex cost functions with bounded subgradients demonstrates O(N/2) convergence time, which is linear in the network size (see [134]), while that of the centralized counterpart is O(1/2). 180 Our result of O(κmin/ 2) convergence time is the first to bridge the gap between O(N/2) and O(diam(G)/2), which we reckon it to be the limit of distributed sub- gradient methods. Of course, the tightness of these bounds depends on the network topology. For example, in a line graph, they are equivalent. Complete graphs are on the other extreme with κmin = diam(G)+1 = 2 for any network size. This makes sense since agents in a complete network should be able to act unanimously. 5.6 Simulations Next we give some simulation results to illustrate the algorithms proposed above. In these examples, each agent does not know its minimal polynomial in advance, but rather computes it using Algorithm 4 in Chapter 4 in connection with M consensus iterations (5.6), provided that M is sufficiently large, e.g., M ≥ 2 deg(qi)+1,∀i ∈ V . 5.6.1 Example 1: Network of 5 agents with differentiable cost func- tions having Lipschitz continuous gradient Consider the network and associated weight matrix shown in Figure 5.1. Figure 5.1: Network topology in example 1. 181 W =  .7 .3 0 0 0 .2 .6 0 0 .2 0 .3 .4 .3 0 0 0 .5 .5 0 0 0 .4 0 .6  Let X = [−2, 2] and local cost functions be f1(x) = (x− 3)2 + 2x, f2(x) = (x4 − x3 + 2x2)/3, f3(x) = (x− 0.1)4/6, f4(x) = ex, f5(x) = −2x2 + 2x, (nonconvex). Note that F (x) = ∑5 i=1 fi(x) is convex, but f5 is not. In fact, F is strongly convex with Lipschitz continuous gradient. Here, x∗ = 0.7427 ∈ X and F ∗ = 9.4812. Here, we combine both extensions of the main algorithm as proposed in Section 5.3.2 to deal with global constraint X and row stochasticity of the weight matrix; in particular, (5.32) and (5.37) will be employed (see Remark 9). In order to apply these iterations, the agents need to find their corresponding element pii in the normalized left Perron eigenvector pi of the weight matrix. This can be done using the idea discussed at the end of Section 5.3.2, which we describe next. Prior to implementation of the optimization algorithm, let all the agents run the following 2N + 1 iterations (we assume that the network size N is available to the agents) p(i)(t+ 1) = ∑ j∈Ni wijp (j)(t), ∀i ∈ V , t = 0, . . . , 2N, (5.60) 182 where p(i)(0) = ei ∈ RN (i.e., the i-th unit vector). At time t = 2N + 1, each agent i has enough data, namely the sequence {[p(i)]i(t)}2Nt=0, to compute a (i) (or qi equivalently) as well as pii as shown in Section 5.2.3. In particular, pi = [0.20619, 0.30928, 0.20619, 0.12371, 0.15464]T a (i) = [0.0192, −0.261, 1.1, −1.8, 1]T , ∀i ∈ V . In this case, κmin = N . We will take κ = N + 1. Next, to ensure convergence of the algorithm, the constant step size γ is chosen to satisfy conditions of Theorem 5.3.12, which requires a Lipschitz constant L of the global gradient ∇F . We now show that all the agents can determine such a constant in finite time and in a distributed manner, and then locally select a suitable step size. To this end, suppose that the agents know a Lipschitz constant l¯i of their local gradient ∇fi. If fi is continuously twice differentiable, such an l¯i := maxx∈X |∇2fi(x)| can be found easily especially when X is compact (which holds in this example). Clearly, ∇F is also Lipschitz continuous on X with constants L¯ and Lˆ given by L¯ := ∑ i∈V l¯i ≤ N max i∈V l¯i =: Lˆ. On the one hand, by using a max-consensus protocol [139], all the agents can de- termine maxi∈V l¯i as well as Lˆ (since it is assumed that N is known to every agent). However, Lˆ could be a very loose bound, leading to a very small step size, thus reducing convergence speed. On the other hand, L¯ is usually a tighter bound, which can also be computed locally by using minimal polynomials obtained earlier as fol- lows. After (5.60), let all the agents also perform the following update (where t 183 denotes the iteration index, not physical time) li(t+ 1) = ∑ j∈Ni wijlj(t), ∀i ∈ V , t = 0, . . . , N, where li(0) = pi −1 i l¯i. Upon termination, each agent i can find ( Di∑ τ=0 a(i)τ li(τ) ) /( Di∑ τ=0 a(i)τ ) (Thm.5.2.3) = ∑ i∈V piili(0) = ∑ i∈V l¯i = L¯. Now the agents can locally compute the step size γ = ε2N/L¯, where a common ε ∈ (0, 1) is known to the agents beforehand. Since L¯ is usually not the least Lipschitz constant, ε can be set to 1. (A small ε leads to a small step size, which can reduce the convergence speed.) In this example, {l¯i} = {2, 20, 8.9, 7.4, 4} and we take γ = 2N/L¯ = 10 42.3 . Moreover, we let every agent i compute F (i) k = 1 N F (si(kκ)) and use a relative tolerance  = 10−6 (see Appendix A.3.9 and criterion (A.42)) to locally terminate the main algorithm. The simulation results of the our algorithm are given in Fig. 5.2, with (ran- domly generated) initial conditions: x(0) = [0.6238, 1.4262, −0.9162, 1.5838, −1.1648]T . As expected, si’s (depicted by solid lines) become identical after the first κ steps, and then converge to the optimal solution x∗ (shown by dash-line) while xi’s reach limit cycles of period κ. Every agent locally decides to stop at t = 186 (i.e., k = 31) since each finds that |F (i)30 −F (i)29 | = 7.5253×10−7|F (i)30 | (noting that F (i)30 is computed at time t = 31N). Upon termination, si(186) = 0.7402 and F (si(186)) = 9.4813. We also carry out the centralized subgradient method (5.3) in the form of (5.19) using the same step size γk = γ and with the starting point s¯(κ) = pi Tx(0). 184 The simulation result of s¯(kκ) is marked with ◦ in the top-left sub-figure, which agrees with si(kκ). We further compare the objective error of our algorithm with that obtained from the DPS method (5.5) with diminishing and non-summable step sizes of the form γ(t) = a tb where a = [0.01 : 0.05 : 0.5] and b = [0.5 : 0.1 : 1]. For this method, we denote s¯(t) = 1 N ∑ xi(t). Results for a few samples of (a, b) are given in the right subplot of Fig. 5.2. Here we also scale gradients in DPS by a factor of (Npii) −1 just like the re-weighting technique in [93, 100] and also (5.37). Note that generally the DPS method is not guaranteed to converge when some fi are nonconvex. Moreover, the DSM (5.4) fails to converge if γ(t) is not selected carefully, e.g., a > 0.5 and b = 0.5. Clearly, our algorithm outperforms DPS in terms of the convergence rate and number of gradient evaluations. With the same network and weight matrix as before, we now consider quadratic cost functions fi(x) = bi(x−ci)2, where c = [0, 4, 3, 1, 1]T and b = [3, 1, 3, 3, 4]T . Fig. 5.3 shows the performance of Algorithm (5.53)-(5.54) with different values of step size γ. Clearly, sufficiently small γ’s yield near dead-beat responses. 5.6.2 Example 2: Network of 200 agents with `1 cost functions Now we consider a set of N = 200 agents communicating over a ring graph, where Ni = {i, i± 1, i± 10}, for ∀i ∈ V (if i+ 10 > 200, by i+ 10 we mean i+ 10− 200). Here, the graph is distance regular with diameter diam(G) = 10, and each agent has 5 neighbors (including itself). Assume that W = [wij] is such that wij = 1 |Ni| if j ∈ Ni and wij = 0 otherwise. In this case, by Theorem 5.5.1 (with  = 0.2), we 185 0 50 100 −2 0 2 t (step) xi(t) 0 500 1000 0 0.5 1 1.5 2 2.5 3 F(s¯(t))− F∗ t (step) Alg. (5.32)&(5.37) 0 50 100 −2 0 2 si(t) Figure 5.2: Network responses for example 1 with convex cost functions having Lipschitz continuous gradient using Algorithm (5.32) and (5.37). Left: For any i ∈ V, si(t) (solid lines) converges to optimal solution (dash-line) and xi(t) reaches a limit cycle of period κ. In the top-left figure, ◦ represents s¯(kκ) of the centralized subgradient method imple- mented as (5.19). Right: Objective error comparisons with DPS using step size γ(t) = a tb , where (blue) solid lines correspond to a = 0.01, (green) dashed lines a = 0.05, (black) dotted lines a = 0.1, and (cyan) dash-dotted ones a = 0.2. For each a, we plot the results for b = 0.5 and 1. The results from our algorithm are shown in red circles ◦. The algo- rithm terminates locally for all the agents at t = 186 with relative error of the global cost function guaranteed to be less than  = 10−6. have deg(qi) = diam(G) + 1 = 11,∀i ∈ V . The local cost functions are fi(x) = |x− ci| with x ∈ X = [0, 100] and ci = i if i ≤ 100 and ci = 0.2(i− 100) otherwise. So, the goal of all the agents is to find the median value of {ci}N1 . Thus, in this case, x∗ = 16.9. The choice of ci is motivated by the desire to have x∗ far from the mean µc of the elements in {ci} (which is µc = 30.3 and can be found by any averaging protocol). Here, each local cost function is convex but non-differentiable and has subgra- dients bounded by L0 = 1. Therefore, we will use Algorithm (5.32)-(5.33), where each agent initializes xi(0) = ci. We also want to compare the performance of our 186 0 10 20 0 2 4 γ = 10−1 x i (t ) 0 10 20 0 2 4 γ = 10−2 x i (t ) 0 10 20 0 2 4 γ = 10−3 x i (t ) 0 10 20 0 2 4 γ = 10−6 x i (t ) Figure 5.3: Network responses for example 1 with quadratic cost functions when using Algorithm 5.3 with κ = 7, x(0) = c, and with 4 values of γ. algorithm with the one in [134]. Thus, we choose the step size by Theorem 5.3.5- (ii), i.e., γ = R L0 √ κ T . In this example, we take the number of iterations T = 4N (as suggested by [134]), R = 100 (which is the size of X), 2 and Li = L0 = 1. Since deg(qi) = 11,∀i ∈ V , κ needs to satisfy the condition κ ≥ 11. We suppose that each agent sets κ = 50. Hence, γ = 25. In this example, we let each agent i reevaluate a (i) (equivalently its minimal polynomial) after every κ steps (note that κ > 2 deg(qi) in this case). (The reason is that we observe from simulations that computation of minimal polynomials in large graphs is prone to error.) The results are given in Fig. 5.4. Algorithm (5.32)-(5.33) and the one in [134] do not converge asymptotically to the optimal solution but 2A better choice for R can be found as follows: at time κ, s¯(κ) = si(κ) = ∑N i=1 xi(0)/N = µc. Thus each agent can take R = maxu∈X |si(κ)− u| = 100− si(κ) = 69.7, which then improves the objective error bound in (5.22) by 30.3%. 187 rather approach a solution neighborhood since both use constant step sizes. Ours does so faster and the size of the neighborhood is much smaller (thanks to the rate O(RLmax √ κ T ) as compared to O(R2Lmax √ N T ) in [134]; see Remark 5.3.10). There are some small numerical errors in the simulation of s(t), but that does not cause instability to our algorithm. The DSM (5.4) with γ(t) = 1√ t admits asymptotic convergence but with very slow rate. Furthermore, it is not trivial to choose a “good” step size sequence or a stopping criterion with some performance guarantee (such as consensus and objective error bound); it is especially difficult doing so in a distributed fashion. In contrast, our algorithm allows efficient stopping criteria such as (5.26)-(5.27) (and (A.42) in Appendix A.3.9), or with a predetermined num- ber of iterations. Moreover, consensus (of the optimization variables si) is always guaranteed upon termination. 5.7 Concluding Remarks We have presented three fast algorithms for the distributed optimization problem (5.1) on a fixed and directed graph with convergence time being linear in the maxi- mum degree of the agents’ minimal polynomials rather than the network size. From a broader view, our algorithms can be seen as a way of distributing the central- ized subgradient method without sacrificing its convergence behavior, at the cost of the algorithm being slowed down due to a larger time-scale needed for diffusion of information through the network. Among possible directions for future work, we mention the problem of design- 188 0 100 200 300 400 500 600 700 800 0 50 100 (a) si(t) in (5.32) 0 100 200 300 400 500 600 700 800 0 50 100 (c) yi(t) in Algorithm in [132] 0 100 200 300 400 500 600 700 800 0 50 100 t (step) (d) xi(t) in Distributed Subgradient Method (5.4) 16 17 18 0 100 200 300 400 500 600 700 800 0 50 100 (b) xi(t) in (5.33) Figure 5.4: Responses of the network in example 2. Dashed line: optimal solution. (a)- (b): Algorithm (5.32)-(5.33), where sub-figure within (a) is a zoom-in of period [400, 800]; (c): Algorithm by Olshevsky (2016) with a constant step size β = 1 L0 √ NT ; and (d): Distributed Subgradient Method (5.4) with γ(t) = 1√ t . ing the weight matrix W for a given network topology so as to achieve the smallest possible κmin. For the case of large networks, κmin could be large, and hence de- termining the exact minimal polynomial could be a challenging task for each agent 189 in terms of memory storage and computational capability. In this scenario, the use of approximations of minimal polynomials or other finite-time consensus protocol may be a problem deserving investigation. Another appealing direction is to adapt the algorithms to time-varying networks, as well as networks influenced by noises and/or delays. 190 Chapter 6: Distributed Optimization over Directed Graphs with Row Stochasticity and Constraint Regularity Abstract: This chapter deals with an optimization problem over a network of agents, where the cost function is the sum of the individual (possibly nonsmooth) objectives of the agents and the global constraint set is the intersection of local constraints; this problem is more general than that in the previous chapter. The main goals of this chapter are: (i) to remove the need for column stochasticity; (ii) to relax the compactness assumption, and (iii) to provide a unified convergence analysis. Specifically, assuming the communication graph to be fixed and directed and the weight matrix to (only) be row stochastic, a distributed projected subgra- dient algorithm and a variation of the algorithm are presented to solve the problem for cost functions that are convex and Lipschitz continuous. The key component of the algorithms is to adjust the subgradient of each agent by an estimate of its corresponding entry of the normalized left Perron eigenvector of the weight matrix. These estimates are obtained locally from an augmented consensus iteration using the same row stochastic weight matrix and requiring very limited global informa- tion about the network. Moreover, based on a regularity assumption on the local constraint sets, a unified analysis is given that can be applied to both unconstrained 191 problems and constrained ones without assuming compactness of the constraint sets. Finally, the convergence rate of the algorithms is studied in terms of the distance from each agent’s available estimate to the global constraint set and an objective error defined on this set. 6.1 Introduction As in Chapter 5, we consider a network of agents without a central coordination unit that is tasked with solving a global optimization problem in which the objective function is the sum of local costs of the agents, that is, F (x) = ∑N i=1 fi(x) where fi : Rm → R represents the private objective of agent i and N is the number of agents in the network. In addition, each agent may be associated with a private constraint set. Many distributed optimization methods have been developed to address this problem; see e.g., [32, 39,42,43,76–81,83,94,134,144,145] and references therein. Although much research has been carried out on this problem area, most of the existing literature invokes the assumption that communication among agents is bidirectional. That is, for any pair of neighboring agents, each agent receives infor- mation from the other. This assumption further allows many distributed algorithms to employ doubly stochastic weight matrices, allowing straightforward mechanisms for the agents to reach an optimal consensus. However, the double stochasticity assumption is not always practical in real world applications. This is the case, for example, where agents have different communication ranges due to environmental effects or individual broadcast power limits. 192 In this work, we consider a more general case in which the communication among agents is not necessarily bidirectional, and thus is naturally represented by a directed graph. This scenario has recently been considered in [82, 93, 145]. A common idea in these chapters is the combination of a (sub)gradient distributed optimization algorithm and the Push-Sum protocol [146]. One essential requirement of this protocol is that each agent knows its out-degree exactly and/or controls its outgoing weights so that they sum up to one, leading to a column stochastic weight matrix. The same requirement is also imposed in [147], where the authors develop a distributed subgradient algorithm by employing a weight balancing technique. Such requirement, however, could be impractical in many other situations, especially when agents use a broadcast-based communication scheme and thus they neither know their out-neighbors nor are able to adjust their outgoing weights (i.e., the weights that others put on its information). In wireless sensor networks, for instance, directed communications can arise as a consequence of geometric network layout or nonuniform power limits, each node can only send information to those lying within its coverage area without receiving acknowledgment signals from them. A similar scenario may also happen during operation of a network initially designed to implement a column stochastic weight matrix; for example, a node encounters an unreliable or broken in-coming channel and has no “cheap” and local means to inform the sender of it not receiving packages (i.e., logically not being an out- neighbor of the sender). As a result, the performance of the network as a whole is not guaranteed due to this malfunctioning communication link, even if the network is still strongly connected. Thus, networks relying on a column stochastic weight 193 matrix may not be robust to link failure. In comparison with a column stochastic weight matrix, one that is row stochas- tic is much easier to achieve in a distributed setting. Here, each agent can individ- ually (and to some extent arbitrarily) decide the weights placed on information it receives from its neighbors. Thus, if the weight matrix is required to be only for a row stochastic, there is no need for nodes to send acknowledgment signals. As an immediate but important consequence, a network requiring only a row stochastic weight matrix is more robust to link losses/jamming, and even changes in the net- work structure. This makes row stochastic matrices suitable for reaching consensus in broadcast-based communication environments, for example ad hoc wireless net- works. However, when a row stochastic matrix is used for distributed optimization, most (if not all) (sub)gradient based algorithms fail to achieve an optimal solution due to the nonuniform stationary distribution of the weight matrix (also known as the normalized left Perron eigenvector). In [93], the authors suggest a re-weighting technique that makes it possible to use a row stochastic matrix in distributed opti- mization. The same technique is also employed in [94]. However, the implementation of the algorithms in [93, 94] assume knowledge of the graph, that is the stationary distribution of the weight matrix and the number of agents in the network. Indeed, a fully distributed algorithm employing only row stochastic weight matrices has not been available in the field of distributed optimization thus far. In this work, we achieve such algorithms under mild requirements on available global network information and under the assumption that the network is strongly connected. More precisely, we present a distributed algorithm and a variation on the 194 algorithm that use a row stochastic weight matrix and assume that each agent knows only an upper bound on the number of agents in the network. Our idea is as follows. We let all the agents perform an augmented consensus protocol in order to estimate the stationary distribution of the weight matrix while updating their states using an iteration akin to that in the Distributed Projected Subgradient (DPS) method (see, e.g., [137, 148, 149]), except that subgradient values are now scaled appropriately and locally by the agents. Here, the estimation step is implemented concurrently with the optimization step, and thus no network communication overhead is added. Moreover, although the algorithm is based on the projected subgradient method, we believe that its principle (i.e., the use of a particular augmented consensus) can be generalized to a class of distributed algorithms that use consensus and subgradient steps. Another important contribution is our unified convergence analysis (as well as the convergence rate) that applies to both unconstrained and constrained prob- lems with identical or nonidentical private constraint sets. Most existing works on subgradient based methods usually assume the problem to be either uncon- strained [78,134,145], or constrained with identical (often compact) constraint sets [81,93,137,148,150]. Nonidentical constraints are considered in [94,137,151], where the local constraint sets are assumed to be compact and their intersection has a nonempty interior. In our work, we assume regularity of the constraint sets, which is weaker than requiring boundedness and allows the global constraint set to have an empty interior. We establish convergence of our algorithms to the optimal solution, and demonstrate how the rate of convergence depends on the step size sequence, 195 exhibiting similarity to that of the centralized subgradient approach. To the best of our knowledge, convergence rates of distributed subgradient methods have not been studied before for the case of nonidentical unbounded constraint sets (possibly with an empty-interior intersection). Preliminary work along the line of this chapter appeared in [150], where only one algorithm was presented and several proofs were omitted. In addition, it is assumed in [150] that all the local constraint sets are identical and compact, while in this chapter we consider nonidentical constraint sets and relax the compactness requirement, allowing for a broader class of applications. The current chapter fur- ther introduces a variation on the algorithm presented in [150], and presents a new convergence analysis that holds for both algorithms under these relaxations. Here the proof technique relies on the regularity assumption on the local constraint set, thus significantly different from that in [150]. Finally, the rate of convergence, which was not shown in [150], is studied here for both algorithms. The rest of the chapter proceeds as follows. The problem formulation and proposed algorithms are given in Section 6.2. The convergence and the convergence rate of the algorithms are studied in Sections 6.3 and 6.4, respectively. Section 6.5 includes a numerical example to illustrate our findings. Concluding remarks are given in Section 6.6. Additional Notation and Terminology: The projection of a vector x on a nonempty closed convex set X ⊆ Rm is denoted by PX(x), i.e., PX(x) = arg miny∈X ‖x − y‖, where as usual, ‖ · ‖ denotes the 2-norm. We also denote by dist(x, X) the (Euclidean) distance from x to X, i.e., 196 dist(x, X) = ‖x − PX(x)‖. The following inequality is called the nonexpansiveness property : ‖PX(x)− PX(y)‖ ≤ ‖x− y‖, ∀x,y ∈ Rm (6.1) We will employ the notion of regularity of the constraint sets, which plays an important role in the study of projection algorithms. This notion involves upper estimating the distance of a point to the intersection of a collection of closed convex sets in terms of the distance to each set (see [152,153]). Recalled next is the definition needed here, stated for a finite dimensional setting. Definition 6.1.1. A collection of closed convex sets {Xi, i ∈ V} (with a nonempty intersection) is regular with respect to a nonempty set B ⊆ Rm if there exists a constant rB ≥ 1 such that dist(x,∩i∈VXi) ≤ rB max i∈V dist(x, Xi), ∀x ∈ B. (6.2) It is said to be regular if B = Rm. For example, if the sets Xi are identical, then they are regular. The following two sets X1 = {(x1, x2) : x2 ≥ x21} and X2 = {(x1, x2) : x2 ≤ 0} with X1∩X2 = {0} are not regular with respect to any ball centered at the origin. 6.2 Problem Formulation and Proposed Algorithms Consider a network consisting of N agents where the underlying communication is characterized by a fixed directed graph G = (V , E). All agents share the objective 197 of solving min x∈Rm F (x) := ∑ i∈V fi(x), s.t. x ∈ ⋂ i∈V Xi =: X (6.3) where each fi : Rm → R is a convex function representing the private objective of agent i, and each Xi is a convex constraint set only available to agent i. Obviously, F is also convex. Let F ∗ and X∗ denote the optimal value and the optimal solution set of the problem (i.e., X∗ = {x ∈ X,F (x) = F ∗}). Let U := conv( ⋃ i∈V Xi), (6.4) i.e., the convex hull of ⋃ i∈V Xi. The following assumptions are adopted in the sequel. Assumption 6.2.1. (Basic Problem Assumptions) Problem (6.3) satisfies the fol- lowing: (a) (Constraint sets) The sets Xi ⊆ Rm are closed and convex, and X 6= ∅. Moreover, {Xi, i ∈ V} is regular with respect to U . (b) (Bounded subdifferential) For any i ∈ V, fi : Rm → R is convex with subdif- ferential bounded on U , i.e., ∃Lf ∈ (0,∞), ‖gi‖ ≤ Lf , ∀gi ∈ ∂fi(x), ∀x ∈ U (6.5) (c) The solution set X∗ is nonempty. Here, the regularity assumption on the collection of constraint sets is weaker than requiring boundedness, which allows us to consider a broader class of optimiza- tion problems. This assumption holds trivially with rU = 1 when the constraint sets 198 are identical. An unconstrained problem is a special case with Xi = Rm,∀i ∈ V . The regularity assumption is also satisfied if the sets Xi are compact and X has a nonempty interior; such assumptions are used in [94,137]. In fact, by [152, Cor. 2], one can deduce that rU = ∑N i=1(DU/δ) i is a regularity constant, where DU denotes the diameter of U and δ is the radius of a ball lying in U . Other important cases include when the sets Xi are hyperplanes or half-spaces (see [153]). Note also that since each fi is convex on Rm, Assumption 6.2.1(b) implies that each individual cost function fi is also Lf -Lipschitz continuous on U . This will be the case if all Xi are compact, since then U is also compact. Assumption 6.2.1(c) can be satisfied when, e.g., the sets Xi are closed and at least one them is compact, since then X is compact. In general, however, we do not require compactness of the constraint sets. In our setting, agent i only has access to fi and local information on its neigh- bors’ opinions, and no central coordinating node is assumed to exist. Thus, the agents need to collaborate in a distributed manner to solve problem (6.3). This involves local iterative computation along with information diffusion. We are in- terested in the scenario where the communication graph G connecting the agents is directed and fixed. We make the following additional blanket assumptions. Assumption 6.2.2. (Connectivity) The network G = (V , E) is strongly connected. Assumption 6.2.3. (Unique ID) The agents are labeled 1, 2, . . . , N and their mes- sages carry a unique identifier of the sender. Moreover, all the agents know the value N (or an upper bound N ′ ≥ N). 199 Assumption 6.2.3 is only technical, implying that each agent can distinguish messages from its neighbors. This will be the case if media access control (MAC) addresses are used. Here, at any time slot, each agent exchanges its current state with its neighbors (in accordance with the directed network structure). Upon receiving the information from its neighbors (including itself), agent i incorporates knowledge of these states using a weighted average scheme. Thus, each edge (i, j) ∈ E is associated with a weight wij ≥ 0 (locally chosen by agent i). Let the weight matrix W = [wij] satisfy the following condition. Assumption 6.2.4. (Weight Rule) The matrix W satisfies wi > 0 for i ∈ V, wij > 0 for (i, j) ∈ E and wij = 0 otherwise. Moreover, W is row stochastic. This assumption means that the zero-nonzero structure of the weight matrix W reflects the network structure. Note also that W has positive diagonal elements reflecting that each agent has access to its own state. Further, W is irreducible under Assumption 6.2.2. Again we stress that unlike the case with existing algorithms in the literature, the weight matrix W is assumed to be only row stochastic, and not either doubly stochastic or column stochastic. As a result, each agent i controls the i-th row of W , independently with others. This also gives each agent the freedom in deciding the weights that it places on its neighbors’ information. This explains why row stochastic matrices are more suitable for ad hoc wireless networks. We now propose the following distributed algorithm to solve problem (6.3) 200 under all the assumptions above. Algorithm 6.1. At time t = 0, agent i initializes an estimate xi(0) ∈ Xi and a variable zi(0) = ei ∈ RN (or ∈ RN ′ if only a bound N ′ ≥ N is available). For each time t ≥ 0, all agents update their states as follows: xi(t+ 1) = PXi (∑ j∈V wijxj(t)− γ(t) gi(t) zii(t) ) (6.6) zi(t+ 1) = ∑ j∈V wijzj(t). (6.7) Here, Ni is the set of node i’s in-neighbors (including itself), γ(t) is a nonnegative step size (which will be specified later), gi(t) ∈ ∂fi( ∑ j∈V wijxj(t)) is a subgradient of fi, and zi(t) = [zi1, zi2, . . . , ziN ] T for each i ∈ V. Note that because of wii > 0,∀i ∈ V (cf. Assumption 6.2.4), (6.7), and zii(0) = 1, it can be shown (later in Lemma 6.3.3) that zii(t) > 0,∀t ≥ 0, ∀i ∈ V . Thus (6.6) is well defined. In essence, the update (6.6) can be viewed as a modified version of the dis- tributed projected subgradient (DPS) method [137] where each private cost func- tion’s subgradient is scaled by zii(t) obtained from (6.7). Here, the update (6.7) is, in fact, a consensus iteration aiming to provide each agent i ∈ V with an estimate of pi = [pi1, . . . , piN ] T - the left normalized Perron eigenvector of W , i.e., the left eigen- vector pi satisfying 1Tpi = 1. This iteration resembles those used in [154, 155]. Of course, if each agent i ∈ V knows the pii in advance, then iteration (6.7) is redun- dant as all the agents can simply use zii(t) = pii,∀t ≥ 0. (In fact, if initialized with zi(0) = pi, then it follows from (6.7) that zi(t) = pi for all t ≥ 0.) In this case, our 201 rescaling subgradient technique reduces to the reweighting scheme used in [93,94]. We also remark that the DPS method in [137] can be applied to time-varying networks but requires the weight matrix to be doubly stochastic at each time t. Further, for nonidentical constraint sets, [137] only considers complete graphs and assumes that the intersection set X has nonempty interior. Later, [94] extended the method to directed time-varying graphs possibly with (fixed and uniform) com- munication delays but still requires doubly stochastic weight matrices and compact constraint sets with nonempty interior. Thus, the results in these works are not readily applicable to cases where the Xi are unbounded (e.g., X = Rm) and/or X has an empty interior (e.g., an Xi includes linear equality constraints) and the weight matrix is only row stochastic. Another extension in [145] dealing with the unconstrained case employs column stochastic matrices. Algorithm 6.1 can be seen as an extension of DPS under the fixed network setting where only row stochastic weight matrices are used. Note also that here we assume the network is fixed during one run of the algorithm. Between any two consecutive runs, the network structure is allowed to change, and our algorithm need not be adjusted except each agent i may need to reselect new weights wij for its new neighbor set (which is a trivial task). Moreover, our development technique does not employ the compactness of the constraint sets as well as nonempty interior of their intersection. The following variation on Algorithm 6.1 will also be considered, where each agent takes the subgradient step first, followed by the consensus step: Algorithm 6.2. With the same initializations as in Algorithm 6.1, all agents update 202 their states according to xi(t+ 1) = PXi (∑ j∈V wij ( xj(t)− γ(t) gj(t) zjj(t) )) (6.8) zi(t+ 1) = ∑ j∈V wijzj(t), (6.9) where gj(t) ∈ ∂fj(xj(t)), i.e., a subgradient of fj at xj(t) (which differs from the subgradient used in (6.6) of Algorithm 6.1). It has been shown in [76] that the order of the optimization step and the consensus step in the original DPS method can be interchanged, which, if a constant step size is used, often gives a better convergence speed to a solution neighborhood [156]. Comparison between Algorithms 6.1 and 6.2, however, is out of the scope of this chapter. In this work, the following type of diminishing step size sequence will be used to ensure convergence of our algorithms to the optimal solution. For the convergence rate analysis, a less restrictive assumption will be employed. Assumption 6.2.5. (Step Size Rule) The step size sequence {γ(t)} is positive non- increasing and satisfies ∑∞ t=0 γ(t) =∞ and ∑∞ t=0 γ 2(t) <∞. There are many ways to choose the step size sequence γ(t) satisfying this assumption, e.g., γ(t) = c tθ ,∀t ≥ 1, for some constants c > 0 and θ ∈ (0.5, 1]. 6.3 Basic Relations and Convergence Result In this section, we simultaneously prove the convergence of both algorithms (6.6)- (6.7) and (6.8)-(6.9). We begin with a few basic results that will be used later. 203 First, besides the nonexpansiveness property (6.1), other properties of a pro- jection operator are given in the following lemma. Lemma 6.3.1. ([137]) Let Y ⊆ Rm be a nonempty closed convex set. Then for any x ∈ Rm and y ∈ Y , (a) (PY (x)− x)T (x− y) ≤ −‖PY (x)− x‖2. (b) ‖PY (x)− y‖2 ≤ ‖x− y‖2 − ‖PY (x)− x‖2. Second, the following lemma is a consequence of the convexity of the function ‖ · ‖2. Lemma 6.3.2. For any a1, . . . , aN ≥ 0 such that ∑N i=1 ai = 1, we have ‖ ∑N i=1 aixi‖2 ≤∑N i=1 ai‖xi‖2 for ∀xi ∈ Rm, i = 1, . . . , N . Next, we characterize the convergence of the power iteration of the weight matrix in the following lemma, which is a consequence of the Perron-Frobenius theorem (see, e.g., [95]). Lemma 6.3.3. (Convergence of power of weight matrix) Let Assumptions 6.2.2 (Connectivity) and 6.2.4 (Weight Rule) hold. Then limt→∞W t = 1piT , where pi > 0 is the normalized left Perron eigenvector of W . Moreover, the convergence is geometric with rate λ ∈ (|λ2(W )|, 1), where λ2(W ) is the second largest eigenvalue of W. Proof. Under Assumptions 6.2.2 and 6.2.4, W is an irreducible row stochastic matrix with positive diagonal entries, and thus primitive (i.e., W is irreducible and has only 204 one eigenvalue of maximum modulus; see, e.g., [95, Thm. 8.5.2 and Lem. 8.5.5]). The result now follows from [95, Thm. 8.5.1]. The next proposition, describing the convergence of the estimation step in (6.7), follows directly from the foregoing lemma and will be used in the sequel. Proposition 6.3.4. (Convergence of zii) Consider iteration (6.7). Let Assumptions (Connectivity) and 6.2.4 (Weight Rule) hold. Then for each λ ∈ (|λ2(W )|, 1), there exists C=C(λ,W ) > 0 such that the following hold for ∀i, j ∈ V and ∀t ≥ 0: |[W t]ji − pii| ≤ Cλt, |zii(t)− pii| ≤ Cλt. (6.10) Moreover, there exists η > 0 such that η−1 ≤ zii(t) ≤ 1, ∀t ≥ 0,∀i ∈ V . (6.11) Proof. Let Z(t) = [z1(t), z2(t), · · · , zN(t)]T . It follows from Algorithm 6.1 that for any t ≥ 0, Z(t+ 1) = WZ(t), Z(0) = I. Thus, Z(t) = W t,∀t ≥ 0. Hence, (6.10) follows by Lemma 6.3.3 for some C > 0 and λ ∈ (|λ2(W )|, 1). Next, for each i ∈ V , by (6.7), we have zii(t + 1) = ∑ j∈V wijzji(t), where zii(0) = 1, zji(0) = 0,∀j 6= i. Clearly, 1 ≥ zij(t) ≥ 0,∀i, j ∈ V ,∀t ≥ 0. Since limt→∞ zii(t) = pii > 0, there exists t0 ≥ 0 such that zii(t) ≥ pii/2,∀i ∈ V ,∀t > t0. Moreover, we have that zii(t0) ≥ wiizii(t0 − 1) ≥ . . . ≥ wt0ii zii(0) > 0 since wii > 0 (cf. Assumption 6.2.4). Therefore, zii(t) > 0 for any t ∈ [0, t0]. By taking η−1 = min{zii(t), pii/2,∀i ∈ V ,∀t ∈ [0, t0]}, 205 then (6.11) follows as desired. Remark 6.3.5. In the rest of the chapter, the parameters C, λ and η refer to the constants in Proposition 6.3.4. We now turn to iterations (6.6) and (6.8). Our next result describes a general relation on the overall evolution of the states of the agents in terms of their distances from any point v ∈ X as well as the weighted averaged state vector x¯(t), defined as x¯(t) := ∑ j∈V pijxj(t), ∀t ≥ 0. (6.12) This relation also involves the step size sequence γ(t) and an error term ( F (x¯(t))− F (v) ) , which in general is not the global objective error since x¯(t) may not be in X; it is so if the constraint sets {Xi, i ∈ V} are identical. Theorem 6.3.6. (Bound on evolution of xi) Let Assumptions 6.2.1 (Basic Problem Assumptions), 6.2.2 (Connectivity), 6.2.3 (Unique ID) and 6.2.4 (Weight Rule) be satisfied. Then for both Algorithms 6.1 and 6.2, the following holds for any v ∈ X and t ≥ 0: ∑ i∈V pii‖xi(t+ 1)− v‖2 ≤ (1 +D1λ2t) ∑ i∈V pii‖xi(t)− v‖2 − 2γ(t)(F (x¯(t))− F (v))−∑ i∈V pii‖φi(t)‖2 +D2γ(t) ∑ i∈V pii‖xi(t)− x¯(t)‖+D3γ2(t), (6.13) where D1 = NCLfη,D2 = 2Lfη, D3 = L 2 fη 2 +NLfCη, and φi(t) := PXi (∑ j∈V wijxj(t)− γ(t) gi(t) zii(t) ) − (∑ j∈V wijxj(t)− γ(t) gi(t) zii(t) ) (6.14) 206 for Algorithm 6.1 whereas for Algorithm 6.2, φi(t) is defined as φi(t) := PXi (∑ j∈V wij ( xj(t)− γ(t) gj(t) zjj(t) )) − ∑ j∈V wij ( xj(t)− γ(t) gj(t) zjj(t) ) (6.15) Proof. We provide here a proof for the case of Algorithm 6.1. The proof for Algo- rithm 6.2 is given in Appendix A.4.1. Let yi(t) := ∑ j∈V wijxj(t). By using (6.6) and the definition of φi(t) (cf. (6.14)), we have for any v ∈ X ⊆ Xi ‖xi(t+ 1)− v‖2 = ∥∥∥yi(t)− v − γ(t) gi(t) zii(t) + φi(t) ∥∥∥2 = ∥∥∥yi(t)− v − γ(t) gi(t) zii(t) ∥∥∥2 + ‖φi(t)‖2 + 2φi(t)T(yi(t)− v − γ(t) gi(t) zii(t) ) ≤ ∥∥∥yi(t)− v − γ(t) gi(t) zii(t) ∥∥∥2 − ‖φi(t)‖2, (6.16) where the last inequality follows from the fact that (cf. Lemma 6.3.1(a)) φi(t) T ( yi(t)− γ(t) gi(t) zii(t) − v) ≤ −‖φi(t)‖2. The first term on the right side of (6.16) equals ‖yi(t)− v‖2 + 2γ(t) zii(t) gi(t) T (v − yi(t)) + γ 2(t) z2ii(t) ‖gi(t)‖2. (6.17) We now derive an upper bound for each term in this sum. Rewriting yi(t) − v =∑ j∈V wij(xj(t)− v) then using Lemma 6.3.2 yields ‖yi(t)− v‖2 ≤ ∑ j∈V wij‖xj(t)− v‖2. (6.18) Next, ignoring the positive factor 2γ(t) zii(t) , the second term in (6.17) can be bounded as follows: gi(t) T (v − yi(t)) ≤ fi(v)− fi(yi(t)) ≤ fi(v)− fi(x¯(t)) + ∣∣fi(yi(t))− fi(x¯(t))∣∣ ≤ fi(v)− fi(x¯(t)) + Lf ∑ j∈V wij ‖xj(t)− x¯(t)‖ . (6.19) 207 where the first inequality holds since gi(t) ∈ ∂fi(yi(t)), the second follows from the triangle inequality, and the last one from Lf -Lipschitz continuity of fi over conv (⋃ i∈V Xi ) (cf. Assumption 6.2.1(b)) and the triangle inequality. By continuing (6.16) and using (6.17), (6.18), (6.19) and the conditions that ‖gi(t)‖ ≤ Lf and z−1ii (t) ≤ η,∀i ∈ V ,∀t ≥ 0, we have ‖xi(t+ 1)− v‖2 ≤ ∑ j∈V wij‖(xj(t)− v)‖2 + 2γ(t) zii(t) (fi(v)− fi(x¯(t)))− ‖φi(t)‖2 + 2Lf γ(t) zii(t) ∑ j∈V wij ‖xj(t)− x¯(t)‖+ γ2(t)L2fη2. (6.20) Multiplying both sides by pii then summing over i ∈ V yields ∑ i∈V pii‖xi(t+ 1)− v‖2 ≤ ∑ i∈V pii ∑ j∈V wij‖xj(t)− v‖2 + 2 ∑ i∈V piiγ(t) zii(t) (fi(v)− fi(x¯(t)))− ∑ i∈V pii‖φi(t)‖2 + 2Lf ∑ i∈V piiγ(t) zii(t) ∑ j∈V wij ‖xj(t)− x¯(t)‖+ γ2(t)L2fη2. (6.21) Now consider each term on the right side of (6.21). First, ∑ i∈V pii ∑ j∈V wij‖xj(t)− v‖2 = ∑ i∈V pii‖xi(t)− v‖2, (6.22) where we have used the fact that piTW = piT . Second, ∑ i∈V pii zii(t) ( fi(v)− fi(x¯(t)) ) = ∑ i∈V fi(v)− fi(x¯(t)) + ∑ i∈V ( pii zii(t) − 1)(fi(v)− fi(x¯(t))) ≤ F (v)− F (x¯(t)) + ∑ i∈V |zii(t)− pii| zii(t) |fi(x¯(t))− fi(v)| ≤ F (v)− F (x¯(t)) +NCLfηλt ‖x¯(t)− v‖ , (6.23) 208 where C > 0 and λ ∈ (0, 1) satisfy (6.10), and η satisfies (6.11). Third, we also have ∑ i∈V pii zii(t) ∑ j∈V wij ‖xj(t)− x¯(t)‖ ≤ η ∑ i,j∈V piiwij ‖xj(t)− x¯(t)‖ = η ∑ i∈V pii ‖xi(t)− x¯(t)‖ (6.24) Now, combining (6.21)-(6.24) yields ∑ i∈V pii‖xi(t+ 1)− v‖2 ≤ ∑ i∈V pii‖xi(t)− v‖2 − 2γ(t) (F (x¯(t))− F ∗) − ∑ i∈V pii‖φi(t)‖2 + 2γ(t)NCLfηλt‖x¯(t)− v‖ + 2γ(t)Lfη ∑ i∈V pii ‖xi(t)− x¯(t)‖+ γ2(t)L2fη2. (6.25) Finally, by writing x¯(t)−v = ∑i∈V pii(xi(t)−v) and then using the Cauchy-Schwarz inequality and Lemma 6.3.2, we have 2γ(t)λt‖x¯(t)− v‖ ≤ γ2(t) + λ2t‖ ∑ i∈V pii(xi(t)− v)‖2 ≤ γ2(t) + λ2t ∑ i∈V pii‖xi(t)− v‖2 Using this bound for (6.25) and then rearranging terms yields (6.13) as desired. Before proceeding further, it is worth highlighting the differences between this result, in particular (6.13), with that obtained from the usual DPS method [137] in the context of Algorithm 6.1. First, since the (normalized) left Perron eigenvector pi is nonuniform, we opt for employing the weighted average vectors x¯(t) (as well as∑ i∈V pii‖xi(t) − v‖2) instead of the exact average one. Of course, when pi = 1/N , i.e., W is doubly stochastic, the former vector reduces to the latter. Second, the term D1λ 2t ∑ i∈V pii‖xi(t)− v‖2 (or more precisely the term 2γ(t)NCLfηλt‖x¯(t)− v‖ in 209 (6.25)) arises as a consequence of each agent i using an estimate zii(t) of pii generated from the estimation step (6.7). Finally, since we do not require the constraint sets to be bounded or identical (or have a nonempty interior), the projection error φi is not guaranteed to be bounded a priori and the term ( F (x¯(t))− F (v)) does not reflect the global objective error (as x¯(t) need not be in X). Therefore, quantifying the behaviors of these terms and errors will be the main challenging task in analyzing the convergence as well as the convergence rates of our algorithms; this calls for new results that are more accessible than (6.13) as we develop in the sequel. We now provide some bounds on the terms ‖xi(t)−x¯(t)‖ and ‖φi(t)‖ appearing in (6.13) in terms of the step size sequence γ(t) and the total projection error β(t), defined as β(t) := ∑ i∈V ‖φi(t)‖, ∀t ≥ 0. (6.26) Theorem 6.3.7. Let Assumptions 6.2.1 (Basic Problem Assumptions), 6.2.2 (Con- nectivity), 6.2.3 (Unique ID), and 6.2.4 (Weight Rule) hold. The following hold for both Algorithms 6.1 and 6.2: (a) Let D4 := C ∑ j∈V ‖xj(0)‖. For any i ∈ V, ‖xi(t)− x¯(t)‖ ≤ D4λt +D1 ∑ 0≤s≤t−1 λt−sγ(s) + C ∑ 0≤s≤t−1 λt−sβ(s). (6.27) (b) Define θ(t) := γ(t) ∑ 0≤s≤t−1 λt−sβ(s), θ(0) = 0. (6.28) If {γ(t)} is nonincreasing, then θ(t+ 1) ≤ λθ(t) + λγ(t)β(t). (6.29) 210 Proof. (a) First, we express (6.6) and (6.8) in the form xi(t+ 1) = ∑ j∈V wijxj(t) + i(t), (6.30) where i(t) ∈ Rm is an error term. Then, we have xi(t) = ∑ j∈V [W t]ijxj(0) + ∑ 0≤s≤t−1 ∑ j∈V [W t−s]ijj(s). Since x¯(t) = ∑ j∈V pijxj(t) and pi TW = piT , it follows that x¯(t) = ∑ j∈V pijxj(0) + ∑ 0≤s≤t−1 ∑ j∈V pijj(s). Thus, the term ‖xi(t)− x¯(t)‖ can be expressed as∥∥∥∑ j∈V ( [W t]ij − pij ) xj(0) + t−1∑ s=0 ∑ j∈V ( [W t−s]ij − pij ) j(s) ∥∥∥ ≤ ∑ j∈V ∣∣[W t]ij − pij∣∣ ‖xj(0)‖+ t−1∑ s=0 ∑ j∈V ∣∣[W t−s]ij − pij∣∣ ‖j(s)‖. Hence, by using the bound in (6.10), we then have ‖xi(t)− x¯(t)‖ ≤ D4λt + C ∑ 0≤s≤t−1 λt−s ∑ j∈V ‖j(s)‖. (6.31) Now consider Algorithm 6.1, where it follows from (6.6) and (6.14) that i(t) = φi(t) − γ(t) gi(t)zii(t) . By using the triangle inequality and the facts that ‖gi(t)‖ ≤ Lf (cf. Assumption 6.2.1(b)) and that z−1ii ≤ η (see (6.11)), we obtain ‖i(t)‖ ≤ ‖φi(t)‖+ γ(t)Lfη, ∀i ∈ V . (6.32) Next, we show that this bound also holds for Algorithm 6.2. From (6.8) and (6.15) we have i(t) = φi(t)− γ(t) ∑ j∈V wij gj(t) zjj(t) . As a result, for ∀i ∈ V ‖i(t)‖ ≤ ‖φi(t)‖+ γ(t) ∑ j∈V wij ‖gj(t)‖ |zjj(t)| ≤ ‖φi(t)‖+ γ(t)Lfη. 211 By combining (6.32) and (6.31) and rearranging terms, we have ‖xi(t)− x¯(t)‖ ≤ D4λt + C ∑ 0≤s≤t−1 λt−s ( Nγ(s)Lfη + β(s) ) . (b) By using the definition of θ(t) and the monotonicity of {γ(t)}, we have θ(t+ 1) ≤ γ(t) ∑ 0≤s≤t λt+1−sβ(s) = λθ(t) + λγ(t)β(t), which concludes the proof. We note the following. First, it is clear from (6.27) that the effect of initial conditions on the differences between agents’ states vanishes exponentially. Second, one can view the last two terms on the right side of (6.27) as the convolutions of γ(t) and β(t) with λt. Thus, for the convergence of the algorithms, we expect these terms to decay to zero under a suitable choice of γ(t). For example, when limt→∞ γ(t) = 0, we show next that limt→∞ ∑t−1 s=0 λ t−sγ(s) = 0. However, whether this also implies limt→∞ ∑t−1 s=0 λ t−sβ(s) = 0 is inconclusive since β(t) depends on the agents’ states and the sets Xi. Finally, we introduced θ(t) in order to study the behavior of the term γ(t) ∑ i∈V pii‖xi(t)− x¯(t)‖ in (6.13). Corollary 6.3.8. In Theorem 6.3.7, if limt→∞ β(t) = 0, then limt→∞ θ(t) = 0. Additionally, if limt→∞ γ(t) = 0, then limt→∞ ∑ i∈V pii‖xi(t)− x¯(t)‖ = 0. Proof. Clearly, it suffices to prove that for any λ ∈ (0, 1) and any nonnegative sequence {β(t)}t≥0 satisfying limt→∞ β(t) = 0, limt→∞ ∑t s=0 λ t−sβ(s) = 0. This claim is stated in [137, Lem. 7]. Our next result is basically a consequence of Theorems 6.3.6 and 6.3.7 un- der the regularity assumption on the constraint sets. Specifically, we will apply 212 the bounds obtained in (6.27) and (6.29) to (6.13), and then select suitable associ- ated coefficients to generate a more accessible relation, which is key to proving the convergence as well as convergence rate of the algorithms. Theorem 6.3.9. Let Assumptions 6.2.1 (Basic Problem Assumptions), 6.2.2 (Con- nectivity), 6.2.3 (Unique ID), and 6.2.4 (Weight Rule) be satisfied. The following holds for both Algorithms 6.1 and 6.2 and for any nonincreasing step size sequence {γ(t)}: ∑ i∈V pii‖xi(t+ 1)− v‖2 + abθ(t+ 1) ≤ (1 +D1λ2t) ∑ i∈V pii‖xi(t)− v‖2 + abθ(t) − 2γ(t)(F (s(t))− F (v))−D6∑ i∈V ‖φi(t)‖2 +D24γ(t)λ t +D21γ(t) ∑ 0≤s≤t−1 λt−sγ(s) +D′3γ 2(t), (6.33) where s(t) = PX ( x¯(t) ) , pimin = mini∈V pii, b = √ pimin Nλ , a = D′2C (1−λ)b , D ′ 2 = D2 + 2LR pimin , R is a regularity constant of {Xi, i ∈ V}, D6 = pimin2 , D24 = D′2D4, D21 = D′2D1 and D′3 = 2D3+λa2 2 . Proof. By adding and subtracting F (s(t)) and using the Lipschitz continuity of F we have F (v)− F (x¯(t)) ≤ F (v)− F (s(t)) + Lf‖s(t)− x¯(t)‖. Now we find an upper bound on the term ‖s(t)−x¯(t)‖. By the regularity assumption of {Xi, i ∈ V}, there exists R such that dist(x, X) ≤ Rmaxi∈V dist(x, Xi), ∀x ∈ 213 conv(∪i∈VXi). As a result, we have ‖s(t)− x¯(t)‖ = dist(x¯(t), X) ≤ Rmax i∈V dist(x¯(t), Xi) ≤ R ∑ i∈V pii pimin dist(x¯(t), Xi) ≤ R ∑ i∈V pii pimin ‖xi − x¯(t)‖, (6.34) where the last inequality holds since dist(x¯(t), Xi) ≤ ‖xi− x¯(t)‖ (cf. Lem. 6.3.1(b)). Hence, F (v)− F (x¯(t)) ≤ F (v)− F (s(t)) + LfR pimin ∑ i∈V pii‖xi(t)− x¯(t)‖. Using this bound for (6.13), we then have ∑ i∈V pii‖xi(t+ 1)− v‖2 ≤ (1 +D1λ2t) ∑ i∈V pii‖xi(t)− v‖2 − 2γ(t)(F (s(t))− F (v))−∑ i∈V pii‖φi(t)‖2 + (D2 + 2LR pimin )γ(t) ∑ i∈V pii‖xi(t)− x¯(t)‖+D3γ2(t). Next, by adding abθ(t+1) to both sides of this relation and using the bounds (6.27) and (6.29), we further have ∑ i∈V pii‖xi(t+ 1)− v‖2 + abθ(t+ 1) ≤ (1 +D1λ2t) ∑ i∈V pii‖xi(t)− v‖2 + abθ(t) + ab(λ− 1)θ(t) + abλγ(t)β(t) − 2γ(t)(F (s(t))− F (v))−∑ i∈V pii‖φi(t)‖2 +D3γ2(t) +D24γ(t)λ t +D21γ(t) ∑ 0≤s≤t−1 λt−sγ(s) +D2Cθ(t). (6.35) 214 Now with the choice of a = D′2C (1−λ)b , the terms ab(λ− 1)θ(t) and D′2Cθ(t) cancel out. Further, by the Cauchy-Schwarz inequality, we have abγ(t)β(t) ≤ a 2γ2(t) + b2β2(t) 2 ≤ a 2γ2(t) 2 + b2n 2 ∑ i∈V ‖φi(t)‖2 = a2 2 γ2(t) + pimin 2λ ∑ i∈V ‖φi(t)‖2. The last equality holds since b2 = pimin Nλ . As a result, we have abλγ(t)β(t)− ∑ i∈V pii‖φi(t)‖2 ≤ λa 2 2 γ2(t)− pimin 2 ∑ i∈V ‖φi(t)‖2. It remains to apply the relations above to (6.35) and then rearrange terms to obtain (6.33). It should be noted that (6.33) holds uniformly on X since the constants Di are independent of the choice of v ∈ X. When restricted to X∗, we immediately have a relation between the (weighted average) distance (squared) from the optimal solution, i.e., ∑ i∈V pii‖xi(t)− v∗‖2, and the global objective error F (s(t))− F ∗ (as s(t) ∈ X), both of which are desired to converge under a suitable choice of step size sequence. We are now ready to give a convergence result that applies to both Algo- rithms (6.6)-(6.7) and (6.8)-(6.9), whose proof is based on the Theorem 6.3.9 and the following lemma. Lemma 6.3.10. ([157]) Let {vt}∞t=0, {ut}∞t=0, {bt}∞t=0 and {ct}∞t=0 be nonnegative sequences such that ∑∞ t=0 bt <∞, ∑∞ t=0 ct <∞ and for ∀t ≥ 0 vt+1 ≤ (1 + bt)vt − ut + ct. (6.36) 215 Then {vt} converges and ∑∞ t=0 ut <∞. Theorem 6.3.11. (Convergence to optimal solution) Let Assumptions 6.2.1-6.2.5 be satisfied. Then both Algorithms 6.1 and 6.2 yield convergence to the optimal solution, i.e., ∃x∗ ∈ X∗ : lim t→∞ xi(t) = x ∗, ∀i ∈ V . (6.37) Proof. The proof proceeds in two steps: (i) apply Lemma 6.3.10 to (6.33), and then (ii) prove convergence to the optimal solution. Step (i): Let x† be arbitrary in X∗ and define the nonnegative sequences {vt}, {ut}, {bt} and {ct} as follows: vt := ∑ i∈V pii‖xi(t)− x†‖2 + abθ(t), bt := D1λ2t, ut := 2γ(t)(F (s(t))− F ∗) +D6 ∑ i∈V ‖φi(t)‖2, ct := D24γ(t)λ t +D21γ(t) ∑ 0≤s≤t−1 λt−sγ(s) +D′3γ 2(t). By adding the nonnegative term D1λ 2tabθ(t) to the right hand side of (6.33), we obtain vt+1 ≤ (1 + bt)vt − ut + ct, ∀t ≥ 0. We now show that other conditions of Lemma 6.3.10 also hold, namely ∑∞ t=0 bt < ∞ and ∑∞t=0 ct < ∞. The former condition is obvious since λ ∈ (0, 1) implies that ∑∞ t=0 bt = (1 − λ2)−1. To prove the latter, consider each term in ct. First,∑ t≥0 γ 2(t) < ∞ by Assumption 6.2.5. Second, by the Cauchy-Schwarz inequality, 216 we have γ(t)λt ≤ (γ2(t) + λ2t)/2. Thus ∑ t≥0 γ(t)λt ≤ 1 2 ∑ t≥0 γ2(t) + 1 2 ∑ t≥0 λ2t <∞. (6.38) Third, by monotonicity of sequence {γ(t)} (cf. Assumption 6.2.5) the second term in ct can be bounded as follows: γ(t) ∑t−1 s=0 λ t−sγ(s) ≤∑t−1s=0 λt−sγ2(s) ≤∑ts=0 λt−sγ2(s). Thus, for any N ≥ 1 we have N∑ t=1 γ(t) t−1∑ s=0 λt−sγ(s) ≤ ∑ 0≤s≤t≤N λt−sγ2(s) ≤ N∑ s=0 γ2(s) ∞∑ t=s λt−s = ∑ 0≤s≤N γ2(s) 1− λ ≤ ∑ s≥0 γ 2(s) 1− λ <∞. This concludes that {ct} is summable as desired. Therefore, in view of Lemma 6.3.10, the following hold: ∃ lim t→∞ ∑ i∈V pii‖xi(t)− x†‖2 + abθ(t) =: δ ≥ 0 (6.39) ∑ t≥0 γ(t)(F (s(t))− F ∗) + D6 2 ∑ i∈V ‖φi(t)‖2 <∞. (6.40) Step (ii): First, by (6.40), we have limt→∞ ∑ i∈V ‖φi(t)‖2 = 0. Thus, limt→∞ β(t) = 0, which by Corollary 6.3.8 yields limt→∞ θ(t) = 0. It then follows from (6.39) that lim t→∞ ∑ i∈V pii‖xi(t)− x†‖2 = δ. (6.41) As a result, for each i ∈ V , {xi(t)}t≥0 is a bounded sequence. Thus so are {x¯(t)}t≥0 and {s(t)}t≥0. Next, since ∑ t≥0 γ(t) =∞, it then follows from (6.40) that lim inft→∞ F (s(t)) = F ∗. Thus, there exists a subsequence {s(tk)} ⊆ {s(t)} such that lim k→∞ F (s(tk)) = F ∗. (6.42) 217 Now since {s(tk)} is also a bounded sequence, there exists a convergent subsequence {s(tl)} ⊆ {s(tk)}. Denote liml→∞ s(tl) = x∗ for some x∗ ∈ X (since X is closed). We next show that x∗ ∈ X∗. By the continuity of F on Rm lim l→∞ F (s(tl)) = F (x ∗). (6.43) which in view of (6.42) implies that F (x∗) = F ∗. By convexity of F , we conclude that x∗ ∈ X∗. Since x† ∈ X∗ was chosen arbitrarily, we can let x† = x∗. Now it remains to show that δ = 0, which by (6.41) will then complete the proof. By the triangle and Cauchy-Schwarz inequalities ‖xi(t)− x∗‖2 ≤ (‖xi(t)− x¯(t)‖+ ‖x¯(t)− s(t)‖+ ‖s(t)− x∗‖)2 ≤ 3(‖xi(t)− x¯(t)‖2 + ‖x¯(t)− s(t)‖2 + ‖s(t)− x∗‖2). Next, since ‖x¯(t) − s(t)‖ ≤ R pimin ∑ i∈V pii‖xi − x¯(t)‖ (cf. (6.34)), we have ‖s(t) − x¯(t)‖2 ≤ R2 pi2min ∑ i∈V pii‖xi − x¯(t)‖2 by Lemma 6.3.2. As a result, 1 3 ‖xi(t)− x∗‖2 ≤ ‖xi(t)− x¯(t)‖2 + R 2 pi2min ∑ i∈V pii‖xi − x¯(t)‖2 + ‖s(t)− x∗‖2. Multiplying both sides by pii and summing over i ∈ V yields ∑ i∈V pii 3 ‖xi(t)− x∗‖2 ≤ R′ ∑ i∈V pii‖xi(t)− x¯(t)‖2 + ‖s(t)− x∗‖2, where R′ = 1+ R 2 pi2min . Taking lim inf as t→∞ both sides of this inequality and using (6.41) yield: δ 3 ≤ lim inf t→∞ ( R′ ∑ i∈V pii‖xi(t)− x¯(t)‖2 + ‖s(t)− x∗‖2 ) = lim inf t→∞ ‖s(t)− x∗‖2. (6.44) 218 Here we have used the superadditivity property of the limit inferior and the fact that limt→∞ ∑ i∈V pii‖xi(t)− x¯(t)‖2 = 0 since limt→∞ β(t) = 0 (see Corollary 6.3.8). Since the subsequence {s(tl)} converges to x∗, we have lim inft→∞ ‖s(t) − x∗‖ = 0, which in view of (6.44) implies that δ = 0. 6.4 Rate of Convergence We now discuss the convergence rate of our algorithms, which evidently depends on the choice of γ(t). Since the estimation step (6.7) converges exponentially, one should expect that the convergence rate of the objective error is equivalent to that of usual distributed subgradient methods in the case when the constraint sets are identical and/or compact. We emphasize, however, that such assumptions are relaxed in our work, i.e., the sets Xi can be nonidentical and unbounded. Moreover, the global constraint set X is also allowed to have an empty interior. Thus, for all i ∈ V , the agents’ estimates xi(t) as well as their weighted average x¯(t) need not be in the set X at any time t. As a result, local analysis around the optimal solution does not readily apply. In this work, to quantify the distance from the optimum, we propose to use a combined error term which involves (i) the distance from a local estimate x˜i(t) of each agent to some point s˜(t) ∈ X and (ii) the global objective error evaluated at s˜(t), i.e., F (s˜(t))− F ∗. Specifically, we define x˜i(t) := ∑t k=0 γ(k)xi(k)∑t k=0 γ(k) , s˜(t) := ∑t k=0 γ(k)s(k)∑t k=0 γ(k) . (6.45) Here, for each t ≥ 1, x˜i(t) is a convex combination of xi(0),xi(1), . . . ,xi(t), which 219 can be computed locally by agent i but might not be in X. In contrast, s˜(t) always belongs to X but is not directly available to each agent. The following theorem asserts that both errors ‖x˜i(t)− s˜(t)‖ and F (s˜(t))− F ∗ decay as O( ∑t k=0 γ 2(k)∑t k=0 γ(k) ). Theorem 6.4.1. (Convergence rate) Let Assumptions 6.2.1 (Basic Problem As- sumptions), 6.2.2 (Connectivity), 6.2.3 (Unique ID) and 6.2.4 (Weight Rule) hold. Let {γ(t)} be a nonnegative and nonincreasing sequence. Then for both Algorithms 6.1 and 6.2, the following holds for ∀t ≥ 0: C0‖x˜i(t)− s˜(t)‖+ F (s˜(t))− F ∗ ≤ C1 + C2 ∑t k=0 γ 2(k)∑t k=0 γ(k) , (6.46) where C0 = D6(1−λ) 2N(N+1)Cλ , some C1 > 0 and C2 = O( N4C2 (1−λ)2 e D1 1−λ2 ) as N → ∞ and λ→ 1 (recalling that D1 = NCLfη and D6 = pimin/2). Moreover, if {Xi, i ∈ V} are compact, the constant C2 is O( N4C2 (1−λ)2 ). The proof of Theorem 6.4.1 is structured in the following steps: (i) Use the bound (6.33) in Theorem 6.3.9 to upper estimate the sum t∑ k=0 2γ(k)(F (s(k))− F ∗) +D6 ∑ i∈V ‖φi(k)‖2 in terms of ∑ 0≤k≤t γ(k) and ∑t k=0 γ 2(k). (ii) Relate the left side of (6.46) to this sum by using the convexity of F and the bounds given in Theorem 6.3.7. (iii) Analyze the constants Ci. The following technical lemma will be used in Step (i) for the general case where {Xi,∀i ∈ V} are not necessarily bounded. 220 Lemma 6.4.2. For any D > 0 and λ ∈ (0, 1), it holds that 1 + D 1− λ ≤ ∏ t≥0 (1 +Dλt) ≤ e D1−λ . (6.47) Proof of Lemma 6.4.2. Note that for any T ≥ 1 we have 1 +D ∑ 0≤t≤T λt ≤ ∏ 0≤t≤T (1 +Dλt) ≤ eD ∑T t=0 λ t where the second inequality follows from the basic relation that 1 + x ≤ ex for any x ≥ 0. Taking the limit as T →∞ yields the desired result. Proof of Theorem 6.4.1. We proceed through the 3 steps described above. Step (i): Let the nonnegative sequences {vt}, {ut}, {bt} and {ct} be defined as in Step (i) of the proof of Theorem 6.3.11, i.e., vt := ∑ i∈V pii‖xi(t)− x∗‖2 + abθ(t), bt := D1λ2t, ut := 2γ(t)(F (s(t))− F ∗) +D6Φt, Φt := ∑ i∈V ‖φi(t)‖2 ct := D24γ(t)λ t +D21γ(t) ∑ 0≤s≤t−1 λt−sγ(s) +D′3γ 2(t). By using Theorem 6.3.9 and adding the nonnegative term btabθ(t) to the right hand side of (6.33), we have vt+1 ≤ (1 + bt)vt − ut + ct, ∀t ≥ 0, which then implies that vt+1 ≤ ∏ 0≤k≤t (1 + bk)v0 + ∑ 0≤k≤t (ck − uk) ∏ k+1≤s≤t (1 + bs). (6.48) 221 By Lemma 6.4.2, the following holds for any t, k ≥ 0 1 < ∏ k≤s≤t (1 + bs) < e D1 1−λ2 =: De. As a result, (6.48) implies that vt+1 ≤ Dev0 + ∑ 0≤k≤t Deck − ∑ 0≤k≤t uk, (6.49) from which by rearranging terms and using the fact that vt+1 ≥ 0, we have (recalling the definition of ut) ∑ 0≤k≤t γ(k)(F (s(k))− F ∗) + D6 2 Φk ≤ R1 +R2 ∑ 0≤k≤t ck, (6.50) where R1 = Dev0/2 and R2 = De/2. Next, we will derive an upper bound on the term ∑t k=0 ck based on the following estimates: ∑ 0≤k≤t γ(k)λk ≤ ∑ 0≤k≤t γ(0)λk ≤ γ(0) 1− λ, (6.51) and ∑ 0≤k≤t γ(k) ∑ 0≤s≤k λk−sγ(s) ≤ ∑ 0≤k≤t ∑ 0≤s≤k γ2(s)λk−s = ∑ 0≤s≤t γ2(s) ∑ s≤k≤t λk−s ≤ ∑t s=0 γ 2(s) 1− λ . (6.52) Hence, ∑ 0≤k≤t ck ≤ D24γ(0) 1− λ + ( D21 1− λ +D ′ 3) ∑ 0≤k≤t γ2(k). Therefore, ∑ 0≤k≤t γ(k)(F (s(k))− F ∗) + D6 2 Φk ≤M1 +M2 ∑ 0≤k≤t γ2(k), (6.53) 222 where M1 = R1 + R2D24γ(0) 1− λ , M2 = R2 ( D21 1− λ +D ′ 3 ) . Step (ii): Now we derive lower bounds on the left hand side of (6.53). Recall that s˜(t) = ∑t k=0 γ(k)s(k)/ ∑t k=0 γ(k). By convexity of F , we then have F (s˜(t))− F ∗ ≤ ∑t k=0 γ(k) ( F (s(k))− F ∗)∑t k=0 γ(k) . (6.54) Next, we will relate the term ‖x˜i(t)− s˜(t)‖ with ∑t k=0 Φk. By the triangle inequality, it can be shown that ‖x˜i(t)− s˜(t)‖ ≤ ∑t k=0 γ(k)‖xi(k)− s(k)‖∑t k=0 γ(k) . (6.55) We now quantify the numerator of the right hand side of (6.55). First, note that (cf. (6.27)) ‖xi(t)− x¯(t)‖ ≤ D4λt +D1 ∑ 0≤s≤t−1 λt−sγ(s) + C ∑ 0≤s≤t−1 λt−sβ(s). Second, let R be a regularity constant of {Xi, i ∈ V}. Then ‖s(t)− x¯(t)‖ = dist(x¯(t), X) ≤ Rmax i∈V dist(x¯(t), Xi) ≤ R ∑ i∈V dist(x¯(t), Xi) ≤ R ∑ i∈V ‖xi − x¯(t)‖. Thus, by the triangle inequality and the two previous relations, ‖xi(k)− s(k)‖ (N + 1)C ≤ D4 C λt + D1 C ∑ 0≤s≤t−1 λt−sγ(s) + ∑ 0≤s≤t−1 λt−sβ(s) 223 which implies that (see the definition of θ(t) in Theorem 6.3.7(b)) ∑ 0≤k≤t γ(k) ‖xi(k)− x¯(k)‖ (N + 1)C ≤ D4 C ∑ 0≤k≤t γ(k)λk + D1 C ∑ 0≤k≤t γ(k) ∑ 0≤s≤k−1 λk−sγ(s) + ∑ 0≤k≤t θ(k). (6.51)−(6.52) ≤ D4γ(0) (1− λ)C + D1 (1− λ)C ∑ 0≤s≤t γ2(s) + ∑ 0≤k≤t θ(k). (6.56) The last term can be bounded as follows. By (6.29) and noting that θ(0) = 0, we have ∑ 0≤k≤t θ(k) ≤ λ ∑ 0≤k≤t−1 θ(k) + λ ∑ 0≤k≤t−1 γ(k)β(k) ≤ λ ∑ 0≤k≤t θ(k) + λ ∑ 0≤k≤t γ2(k) 4 + β2(k), where we have used the fact that γβ ≤ γ2 4 +β2,∀γ, β ∈ R. Rearranging terms yields ∑ 0≤k≤t θ(k) ≤ λ 1− λ ∑ 0≤k≤t−1 γ2(k) 4 + β2(k). (6.57) Moreover, by the Cauchy-Schwarz inequality, we have ∑ 0≤k≤t β2(k) = ∑ 0≤k≤t (∑ i∈V ‖φi(k)‖ )2 ≤ ∑ 0≤k≤t N ∑ i∈V ‖φi(k)‖2 ≤ N ∑ 0≤k≤t Φk. (6.58) Using this bound and (6.57) for (6.56), we obtain C0 ∑ 0≤k≤t γ(k)‖xi(k)− x¯(k)‖ ≤M3 +M4 ∑ 0≤k≤t γ2(k) + D6 2 ∑ 0≤k≤t Φk, with C0 = D6(1− λ) 2N(N + 1)Cλ , M3 = D6D4γ(0) 2NλC , M4 = D6 2N (D1 λC + 1 4 ) . Combining the inequality above with (6.53) yields C0 ∑ 0≤k≤t γ(k)‖xi(k)− x¯(k)‖+ ∑ 0≤k≤t γ(k)(F (s(k))− F ∗) ≤ (M1 +M3) + (M2 +M4) ∑ 0≤k≤t γ2(k). 224 Let C1 := M1 + M3, C2 := M2 + M4. Dividing both sides by ∑t k=0 γ(k) and then using (6.54) and (6.55) yields (6.46) as desired, i.e., C0‖x˜i(t)− s˜(t)‖+ F (s˜(t))− F ∗ ≤ C1 + C2 ∑t k=0 γ 2(k)∑t k=0 γ(k) . Step (iii): We now discuss the constant associated with the convergence rate in terms of the network size and the spectral gap 1−λ. To this end, we assume, for simplicity that pi−1min = O(N) (in fact pimin ≤ 1N ) and that η = O(N). Then it can be verified that the dominant term is M2 in C2, which is O (N4C2R2 (1− λ)2 ) = O ( N4C2 (1− λ)2 e D1 1−λ2 ) . A better estimate can be obtained if we assume further that {Xi, i ∈ V} are compact. In this case, there exists DX > 0 such that ‖xi(t) − x∗‖2 ≤ DX ,∀i ∈ V ,∀t ≥ 0. Thus, by using Theorem 6.3.9, we have for any t ≥ 0 vt+1 ≤ vt + bt ∑ i∈V pii‖xi(t)− x†‖2 − ut + ct ≤ vt +DXbt − ut + ct ≤ v0 + ∑ 0≤k≤t DXbk − uk + ck ≤ v0 + DXD1 1− λ2 + ∑ 0≤k≤t ck − uk. Here we have used the facts that ∑ i∈V pii = 1 and t∑ k=0 bk = D1 t∑ k=1 λ2k ≤ D1 1− λ2 . As a result, ∑ 0≤k≤t uk ≤ v0 + D1DX 1− λ2 + ∑ 0≤k≤t ck, ∀t ≥ 0. (6.59) 225 Thus, we have that (6.50) still holds but with R1 = v0+ D1DX 1−λ2 and R2 = 1 (compared to R2 = De/2 as before). Hence, the constant C2 reduces to O( N4C2 (1−λ)2 ). We remark that the explicit formulas for C1 and C2 obtained in the proof are rather involved. Thus to simplify the estimate orders of C2, we have assumed that pi−1min = O(N) (in fact pimin ≤ 1N ) and that η = O(N). Note also that the spectral gap, defined as 1−|λ2(W )|, also affects the constant bounds since |λ2(W )| < λ < 1, signifying the importance of the strong connectivity assumption. This result demonstrates how the convergence property of the step size se- quence implies that of our algorithms; as a side note Assumption 6.2.5 is not needed for (6.46) to hold. In particular, convergence rate analysis now boils down to study- ing the behavior of the right side of (6.46); exactly the same task has been carried out thoroughly in the literature for centralized (projected) subgradient methods (see, e.g., [115,117,138]). Thus, we proceed no further than providing a few notable results and proving another convergence bound on the objective error in the case of identical constraint sets. Corollary 6.4.3. Let the assumptions of Theorem 6.4.1 be satisfied. Let E(t) = C1+C2 ∑t k=0 γ 2(k)∑t k=0 γ(k) . The following hold. (a) If γ(t) ≡ γ, then E(t) = C2γ + C1γt . If limt→∞ γ(t) = 0 and ∑ t≥0 γ(t) = ∞, then limt→∞E(t) = 0. (b) If {Xi, i ∈ V} are identical, then there exist C˜1>0, C˜2>0 such that F (x˜i(t))− F ∗ ≤ C˜1 + C˜2 ∑t k=0 γ 2(k)∑t k=0 γ(k) . (6.60) 226 Further, if γ(t) = O( 1√ t ) then F (x˜i(t))− F ∗ = O( ln t√t ). Proof. We only prove (6.60) in part (b). Note that xi(t) ∈ X for ∀t ≥ 0 and ∀i ∈ V . By Lipschitz continuity of F , we have F (x˜i(t))−F ∗ = F (x˜i(t))−F (s˜(t))+F (s˜(t))− F ∗ ≤ NLf‖x˜i(t)− s˜(t)‖+ F (s˜(t))− F ∗. It remains to use (6.46). Note that for unconstrained problems, the convergence rate of O( ln t√ t ) is also achieved by recent distributed subgradient based methods such as Dual Averaging [81] or Subgradient-Push [145]. 6.5 Numerical Example Consider a machine learning problem via the l1-norm regularized logistic loss func- tions min x∈X F (x) = ∑ 1≤i≤r log ( 1 + exp (− li(pTi u + v)))+ µ‖u‖1 with variable x=[uT , v]T , u ∈ Rm, v ∈ R. Here, µ > 0 is a regularization parameter. The training set consists of r pairs (pi, li) where pi ∈ Rm is a feature vector and li ∈ {−1, 1} is the corresponding label. Suppose that x satisfies a linear equality constraint: X = {x ∈ Rm+1, Aeqx = beq}, where Aeq ∈ Rq×(m+1) and beq ∈ Rq. In general, when the problem data is distributed or too large to store and/or process on a single machine, employing a network of machines provides a solution. This arises in many applications such as online social network data, wireless sensor networks, and cloud computing. In our example, this problem is to be solved by a network of N = 9 nodes with the communication graph described in Fig. 6.1. We assume r = 500, m = 50 227 Figure 6.1: Directed communication graph of the network example. and q = 36, and select (pi, li), Aeq and beq based on normally distributed random numbers. We choose µ = 50. Suppose the problem data are distributed among the N nodes as follows: each node i stores a partition Pi of roughly rN training data and a set of q N equality constraints, referred to by (A (i) eq , b (i) eq ). Thus, for each agent i ∈ V , the local cost function and constraint set are given by fi(x) = ∑ j∈Pi log ( 1 + exp (− lj(pTj u + v)))+ µN ‖u‖1 Xi = {x ∈ Rm+1 : A(i)eq x = b(i)eq }. We assume that the weight matrix W = [wij] is such that wij = 1 |Ni| if j ∈ Ni and wij = 0 otherwise. We carry out simulations with Algorithms 6.1 and 6.2 us- ing step size γ(t) = 1 N2(t+1) and the usual DPS method (denoted by DPS-(a)), and its variation DPS-(b) (i.e., the order of the subgradient and consensus steps is re- versed) using step size γ′(t) = 1 N(t+1) . Here γ(t) and γ′(t) are different by a factor N for the sake of comparison since subgradients in our algorithms are scaled by pi−1i (which equals N if W is doubly stochastic). The initial state vectors xi(0) = 0 for 228 ∀i ∈ V . The simulation results in terms of relative errors in the objective function and the optimal solution are shown in Fig. 6.2, where F ∗ and x∗ are obtained by solving the global problem using a centralized method. Clearly, both Algorithms 6.1 and 6.2 converge to the optimal solution and have similar performances, which are comparable to the DSP methods combined with the reweighting technique [93, 94], where the knowledge of pi is assumed in advance, (or equivalently (6.6) and (6.8) with zii(t) = pii,∀i ∈ V ,∀t ≥ 0). The usual DPS methods fail to converge to the optimal solution. We also consider the case where link 1→ 2 is lost. The reweight- ing technique requires the whole network to be reprogrammed with a new Perron eigenvector, which may not be available immediately. In contrast, our algorithms are unchanged except for node 2 adjusting its incoming link weights. Clearly con- vergence is still achieved (since the network is still strongly connected) but slower since the spectral gap decreases. 6.6 Conclusions and Extensions In this chapter, we have proposed two modified versions of the DPS method that use require only a row stochastic weight matrix and studied their convergence and convergence rates. Moreover, our analysis does not invoke the compactness re- quirement usually imposed on the local constraint sets and is able to deal with various scenarios, including constrained/unconstrained problems, the sets Xi being bounded/unbounded or identical/nonidentical. It is important to note the following: First, it is possible to employ other eigenvector estimation schemes in place of 229 100 102 104 106 t (steps) 10-5 10-4 10-3 10-2 10-1 ( F (s(t))− F ∗ ) /F ∗ Alg.6.1 DPS-(a) Alg.6.1 w/o link (21) (6.6)+Reweighting 100 102 104 106 t (steps) 10-3 10-2 10-1 100 101 maxi ‖xi(t)− x ∗‖/‖x∗‖ Alg.6.1 DPS-(a) Alg.6.1 w/o link (21) (6.6)+Reweighting 100 102 104 106 t (steps) 10-5 10-4 10-3 10-2 10-1 ( F (s(t))− F ∗ ) /F ∗ Alg.6.2 DPS-(b) Alg.6.2 w/o link (21) (6.8)+Reweighting 100 102 104 106 t (steps) 10-3 10-2 10-1 100 101 maxi ‖xi(t)− x ∗‖/‖x∗‖ Alg.6.2 DPS-(b) Alg.6.2 w/o link (21) (6.8)+Reweighting Figure 6.2: Performances of Algorithms 6.1, 6.2, and DSP methods with and without reweighting technique. Reweighting means for each i ∈ V, pii is known to agent i in advance and zii(t) = pii,∀t ≥ 0. Here, s(t) = PX ( x¯(t) ) . (6.7) as long as zii(t) → pii sufficiently fast (e.g., satisfying (6.10)). These include any finite-time computation algorithm, e.g., [141]. Moreover, as we have seen from Section 6.4 and also the numerical example, the convergence rate of our algorithms is much slower than the estimation step offered by (6.7). Therefore, it is also possible 230 to have (6.7) run asynchronously with (6.6), for example, at a slower time scale to save communication bandwidth for exchanging xi variables and/or to reliably communicate zi with no errors as it is important for the scaling step used in (6.6). Second, the convergence analysis developed in this chapter can be adapted to either relax the compactness requirement in others projected subgradient based methods (e.g., [94, 137]) or impose regular constraints to other subgradient based algorithms (e.g., [134, 145]); this holds even when the network is time-varying and possibly with fixed communication delays. Third, the idea of using the augmented iteration (6.7) to adjust (sub)gradient magnitudes as in (6.6) is not only applicable to the distributed projected subgra- dient methods, but also can be employed to alleviate the condition of the weight matrix being doubly stochastic for some other existing distributed algorithms (us- ing consensus and (sub)gradient steps). For example, we have observed through simulations that the gradient-based method proposed in [136, 144] can be modified in the same spirit and still retains fast convergence speed under a suitable constant step size. Based on this idea, we have recently proposed a new algorithm [158] that converges linearly under the strong convexity assumption on the cost functions. We now briefly introduce this algorithm. Algorithm 6.3. For any t ≥ 0, each agent i maintains 3 vectors xi(t),yi(t) ∈ 231 Rm, zi(t) ∈ RN and update them as follows: xi(t+ 1) = ∑ j∈Ni wijxj(t)− γyi(t) (6.61) zi(t+ 1) = ∑ j∈Ni wijzj(t) (6.62) yi(t+ 1) = ∑ j∈Ni wijyj(t) + gi(t+ 1) zii(t+ 1) − gi(t) zii(t) , (6.63) where initial estimate xi(0) ∈ Xi, yi(0) = ∇fi(xi(0)), zi(0) = ei ∈ RN , γ is a positive constant step size, and gi(t) = ∇fi(xi(t)). Assumption 6.6.1. (Lipschitz continuous gradients and strong convexity) The functions fi are differentiable and strongly convex. Moreover, the gradients ∇fi are Lipschitz continuous. Theorem 6.6.2. ([158]) Suppose Xi = Rm,∀i ∈ V and let the agents implement Algorithm 6.3. Under Assumptions 6.2.2,6.2.3, 6.2.4, 6.6.1, there exist γ¯ > 0 and µ ∈ (0, 1) such that if γ ∈ (0, γ¯) then ‖xi(t)− x∗‖ = O(µt). Estimations of γ¯ and µ are rather involved and conservative; see [158] for details. Note also that although selection of appropriate step size γ requires global information about the network and cost functions (thus centralized initialization), the implementation of the algorithm is distributed and exponential convergence is achieved. 232 Chapter 7: Conclusions 7.1 Summary of Results This dissertation developed theory and algorithms that advance the state-of-the- art in analysis and applications of distributed consensus in multi-agent networks where communications are broadcast-based and directed; hence the notion of net- work asymmetry. Networks with Leaders: In the first part of the thesis, we considered a DeGroot model with the presence of an external media node, representing a leader, or truth, or a source of news having a constant opinion value. First, when consensus is the main goal of the leader, we introduced the no- tion of a persistent leader and developed new sufficient conditions for guaranteeing convergence for both fixed and switching topologies and in the presence of other competing but nonpersistent leaders. We also demonstrated that the results can be readily extended to the case where the persistent leader’s opinion is time-varying opinion and the case of communications with time-varying but bounded delays. Second, we studied the problem of a leader that aims to maximize its influence the opinions of agents in a directed network subject to the constraint that the 233 number of direct followers selected is not more than K. When there is only one leader and consensus is guaranteed a priori, we characterized the influence of that leader through the transient error of the network while when there is a stubborn agent or a second leader with a competing opinion and consensus is not possible, we measured the leader influence in terms of the steady state error of the network. We described the optimal solution for special cases, namely K = 1, 2, in which we introduce a few notions of centrality that can be useful for practical applications. Then for a general K, we studied a general combinatorial problem encompassing many other existing problems in the literature. We proved the supermodularity property of the objective function and the convexity of its continuous relaxation for general directed networks, and then developed practical approaches for suboptimal solutions. We demonstrate through numerical examples that the two approaches can be combined to provide effective tools and better analysis for optimal design of influence spreading in diffusive networks. Our analysis has been shown to be useful for various applications other than those considered here. In particular, the convexity analysis offers (i) an affirmative answer to a conjecture recently proposed in [105] on optimization of on-chip ther- moelectric cooling systems and (ii) a convexity result for the state trajectory of a class of bilinear discrete-time systems. The supermodularity analysis can also be used for sensor selection problems. Consensus Prediction: In this part, we introduced and studied the problem of consensus prediction in a network whose dynamics is described by a DeGroot model. 234 By an application of the Hahn-Banach theorem, we established a fundamental rela- tion between the consensus value and network data, that is, if the consensus value can be computed at a particular time for any initial opinions, then it can be ex- pressed as a linear combination of available observation data. This allowed us to prove a tight lower bound on the monitoring time for the case of a single observed node regardless of the method used by the observer. We also demonstrated that this bound can be achieved if the minimal polynomial associated with the observed node is available to the observer. For the case of multiple observed nodes, we proposed a conjecture on lower bounds of the monitoring time and developed algorithms to- ward achieving the those bounds through local observations and computations. Our results in this direction can also be regarded as a data-driven method for network identification. Distributed Optimization: First we demonstrated that consensus prediction can be employed for enhancing convergence of the distributed gradient method for solving a distributed convex optimization problem on a strongly connected directed network. The convergence rates of our algorithms are similar to those of the cen- tralized gradient method, including finite time convergence for the case of quadratic objective functions, except being slower by a constant factor depending on the net- work structure and the weight matrix. The convergence times of these algorithms scale linearly with the network diameter for certain structures (e.g., distance regular graphs) and at most linearly with the network size in general. Second, we proposed a rescaling technique that enables distributed subgradi- 235 ent algorithms to work with directed networks and row stochastic matrices instead of column or doubly stochastic ones. Based on a regularity assumption, we then developed unified analyses for convergence and convergence rate of a distributed projected subgradient method that can be applied to both unconstrained problems or constrained ones with nonidentical (and possibly unbounded) local constraint sets. We also introduced another algorithm that also uses the rescaling technique above but converges linearly under a stronger assumption on the local objective cost functions. 7.2 Directions for Future Work In this dissertation, we only considered discrete-time models for consensus, predic- tion and distributed optimization. However, there are a wide range of applications where continuous-time models are appropriate. Thus, development of similar results for the continuous-time case will be useful. Networks with Leaders: Use of stability conditions developed for system (2.31) to study consensus conditions in the case of a leaderless network is worth exploring to reduce the gap between necessary and sufficient conditions for consensus. Note that in the latter, we may regard any agent, e.g., agent N , as a “leader” with u(t) = ∑ j∈NN wNj(t) (xj(t)− xN(t)). It would also be interesting to investigate and design consensus protocols for the case where there are multiple persistent leaders with time-varying states and/or malicious agents. The intuition is that if the convex hull of all the leaders’ states 236 shrinks overtime to a point and the effects of malicious agents are non-persistent, then it is still possible to reach consensus asymptotically. Extension of the consensus results developed in Chapter 2 to coordination and synchronization of multi-agent systems is also of interest. The convexity and supermodularity results established in Chapter 3 find ap- plications in various important problems: including sensor/actuator placement for observability/controllability in consensus networks. Moreover, in the case of 2 com- peting leaders, it is also important to study the game played by the 2 leaders on the network, assuming that each has a limited budget (e.g., number of direct followers). Consensus Prediction: Besides resolving the validity of the main conjecture for the case of multiple observed nodes, which requires further rigorous analysis other than the argument presented in Remark 4.3.6, there are numerous problems worth exploring in this topic, including the following: Coping with Noise and Delays: First, if the communication delays are fixed, then our results can be applied in a fairly straightforward manner as the network is still a linear time-invariant system. The case of time-varying delays remains difficult. Second, in the presence of observation or communication noises, exact consensus prediction in finite time is impossible. Since communication noise can derail consensus, observation noise is more relevant to the current topic, in which case we need to estimate the joint minimal polynomial and predict trend of the network states and/or range of the consensus value. In this connection, the (partial) realization theory [73–75, 159] and system identification techniques [160] might be 237 brought into bear [161,162]. Network Monitoring for Misbehavior: As we have seen, for a network of agents whose dynamics follow the time-invariant DeGroot model, it is possible to predict the future behavior of the observed nodes as well as consensus value by using the minimal polynomials of these nodes, which in turn can be computed from observation data. This allows the observer to detect certain changes in the dynamics of a set of nodes, dubbed misbehavior, which may be caused by faults or attacks. In the case where only approximations of these minimal polynomials are available, possibly due to corrupted or noisy observations, one can still expect to capture the trend of the network response using these approximate polynomials. Consequently, certain types of faults in the network agents’ dynamics may still be detected by the observer. Thus, characterizing misbehavior detectable from local observation is an interesting and important direction to pursue. This has a close connection to the topic of distributed fault detection and identification in the literature [163–166]. Distributed Optimization: First, since our distributed algorithm FADO devel- oped in Chapter 5 behaves in a similar manner as the centralized gradient method, we can apply acceleration techniques by Nesterov [117] and others to FADO in order to achieve better convergence rates. The problem of designing the weight matrix W for a given network topology so as to achieve the smallest possible κmin is also of interest as another way of speeding up algorithms we presented in this chapter. Second, extending our algorithms in Chapter 6 to the case of switching communica- tion graphs is also worth exploring. Finally, in all these algorithms, communication 238 noise and delays were not considered. Therefore, these issues deserve more research as well as attention for applications in practice. 239 Appendix A: Omitted Proofs A.1 Known Matrix Results Theorem A.1.1. ([95, Thm. 8.3.1]) If A ∈ RN×N+ , then ρ(A) is an eigenvalue of A and there exists x ∈ RN+\{0} such that Ax = ρ(A)x. Theorem A.1.2. ([95, Thm. 8.1.18]) Let A,B ∈ RN×N . If |A| ≤ B, then ρ(A) ≤ ρ(|A|) ≤ ρ(B). Lemma A.1.3. ([167]) Let P ∈ RN×N be the inverse of a nonsingular M-matrix. Then P ≥ 0 and Pjk ≥ PjiP−1ii Pik, ∀i, j, k = 1, . . . , N. Theorem A.1.4. (Woodbury Matrix Identity [168, p. 258]) Let A ∈ Rn×n, B ∈ Rn×r, C ∈ Rr×r, D ∈ Rr×n. Then the following holds whenever any involved inverse exists: (A−BC−1D)−1 = A−1 + A−1B(C −DA−1B)−1DA−1 (A.1) Lemma A.1.5. ([169]) Let L ∈ RN×N be a Laplacian matrix. Suppose that 0 be a simple eigenvalue of L. Let z denote the left eigenvector associated with this 240 eigenvalue and let L† be the pseudo-inverse of L. Then 1TL† = 0, L†z = 0, L†L = I − 1 N 11T , LL† = I − 1‖z‖2 zz T . Lemma A.1.6. ([169, 170]) Let d, e ∈ RN . The Moore-Penrose pseudoinverse of the rank-1 update of a matrix F ∈ RN×N is given by (F + edT )† = F † +G where G = − 1‖w‖2 vw T − 1‖m‖2 mh T + 1 + dTF †e ‖m‖2‖w‖2 mw T and v = F †e,h = (F †)Td,w = (I − FF †)e and m = (I − F †F )d. A.2 Omitted Proofs in Chapter 3 A.2.1 Proof of Theorem 3.3.1 Suppose K = {k} ⊂ V , then we have J (1) {k} = b T (L+ αkeke T k ) −1|ξ0| Now applying Lemma A.1.6 (cf. Appendix A.1) with F = L, e = ek and d = αkek yields (L+ αkeke T k ) −1 = L† +G 241 where G = − 1‖w‖2 vw T − 1‖m‖2 mh T + 1 + dTL†e ‖m‖2‖w‖2 mw T w = (I − LL†)ek = 1‖pi‖2pipi Tek = pik ‖pi‖2pi m = (I − L†L)d = αk N 11Tek = αk N 1 v = L†ek h = (L†)Tαkek Thus, G = − 1 pi2k ‖pi‖2 L†ek pik ‖pi‖2pi T − 1 α2k N αk N 1αke T kL † + 1 + αke T kL †ek α2k N pi2k ‖pi‖2 αk N 1 pik ‖pi‖2pi T = − 1 pik L†ekpiT − 1eTkL† + 1 + αkL † kk αkpik 1piT (A.2) Note also that bT1 = 1. Then we have J (1) {k} = (b TL† − L†(k))|ξ0|+ (α−1k + L†kk − bTL†k) piT |ξ0| pik . Moreover, if b = 1/N , then by Lemma A.1.5 (cf. Appendix A.1) we have bTL† = 0T . Hence, (3.19) follows immediately. A.2.2 Proof of Theorem 3.3.2 By using Woodbury identity (A.1) (cf. Appendix A.1-Theorem A.1.4) and recalling that P = L−1β , we have J (2) (k) = b T (Lβ + αkeke T k ) −1β = bT ( P − Peke T kP α−1k + e T kPek ) β. (A.3) 242 Now since L1 = 0, we have Lβ1 = L1 + diag(β)1 = β. Left-multiplying both sides with P yields Pβ = 1. It remains to use this relation to simplify A.3. A.2.3 Proof of Theorem 3.3.4 Denote P = (L+ αieie T i ) −1. By Woodbury matrix identity (A.1) we have pTij = 1T N (L+ αieie T i + αjeje T j ) −1 = 1T N ( P − Peje T j P α−1j + e T j Pej ) = pTi ( I − eje T j P α−1j + e T j Pej ) (A.4) where we have used pTi = 1 N 1TP ; see Theorem 3.3.1. Next, by (A.2), P = (L+ αieie T i ) −1 = L† − 1 pii L†eipiT − 1eTi L† + 1 + αie T i L †ei αipii 1piT Then eTj P = L †(j) − L † ji pii piT − L†(i) + 1 + αiL † ii αipii piT = L†(j) − L†(i) + (γii + γji)piT As a result, we have 1 αj + eTj Pej = 1 αj + L†jj − L†ij + (γii + γji)uj = (γjj + γij + γii + γji)pij 243 Substituting this relation into (A.4) yields pTij = p T i − pTi eje T j P α−1j + e T j Pej = pTi − pTi ej pij ∑ γij eTj P = pTi − (γii + γij)pij pij ∑ γij ( L†(j) + γjipiT + pTi ) = pTi − γii + γij∑ γij ( γjjpi T − pTj + γjipiT + pTi ) = γjj + γji∑ γij pTi + γii + γij∑ γij pTj − (γii + γij)(γjj + γji)∑ γij piT = (γii + γij)(γjj + γji)∑ γij piT − γjj + γji∑ γij L†(i) − γii + γij∑ γij L†(j), where the third to last and the last equalities follow from the relation pTi = γiipi T − L†(i) (cf. Theorem 3.3.1). This completes the proof. A.2.4 Proof of Lemma 3.5.3 First, we show that (Lβ+ΓS)−1 is nonincreasing in S. Let DS = diag(W1+β+αS) and note that ρ ( D−1S W ) < 1 (cf. Lemma 3.2.4). By the absolutely convergent Neumann series (I −D−1S W )−1 = ∑∞ i=0(D −1 S W ) i. Thus we have (Lβ + ΓS)−1 = (DS −W )−1 = ∑ i≥0(D −1 S W ) iD−1S (A.5) which is clearly nonnegative. Moreover, for any T ⊆ V such that T ⊇ S, we have 0N×N ≤ D−1T ≤ D−1S , which together with (A.5) implies that (Lβ + ΓT )−1 ≤ (Lβ + ΓS)−1. Alternatively, we can also use the fact that f(y) = bT (Lβ + diag(y ◦α))−1c is a non-increasing function on Ω for any b, c ∈ RN+ (cf. Theorem 3.4.3) to conclude the monotonicity of (Lβ + ΓS)−1. This proves the second inequality in (3.45). 244 We now prove the first inequality in (3.45), that is, for any v, k ∈ V\S (Lβ + ΓS)−1 − (Lβ + ΓS∪{v})−1 ≥ (Lβ + ΓS∪{k})−1 − (Lβ + ΓS∪{k,v})−1 (A.6) Let P := (Lβ + ΓS)−1 and Q := (Lβ + ΓS∪{k})−1. By Woodbury identity (A.1), it can be shown that (Lβ + ΓS∪{v})−1 = P − P(v)P (v)(α−1v + Pvv)−1, (Lβ + ΓS∪{k,v})−1 = Q−Q(v)Q(v)(α−1v +Qvv)−1. Thus, (A.6) is equivalent to the following matrix inequality P(v)P (v)(α−1v + Pvv) −1 ≥ Q(v)Q(v)(α−1v +Qvv)−1. It suffices to show that this inequality holds element-wise, i.e., PivPvj α−1v + Pvv ≥ QivQvj α−1v +Qvv , ∀i, j ∈ V . (A.7) Note again by Woodbury identity that Q = (Lβ + ΓS∪{k})−1 = P − P(k)P (k)(α−1k + Pkk)−1, i.e., Qij = Pij−PikPkj/(α−1k +Pkk), ∀i, j ∈ V . Therefore, we have (A.7) is equivalent to α−1v +Qvv α−1v + Pvv PivPvj ≥ (Piv − PikPkv α−1k + Pkk )(Pvj − PvkPkj α−1k + Pkk ) or, by rearranging terms, PvkPkvPivPvj (α−1v + Pvv) + PikPkvPvkPkj (α−1k + Pkk) ≤ PikPkvPvj + PivPvkPkj. (A.8) 245 We now show that (A.8) holds. To this end, first note that P is the inverse of a nonsingular M-matrix. Thus, by Lemma A.1.3 and the fact that α−1v ≥ 0, we have Pik ≥ PivPvk/Pvv ≥ PivPvk/(α−1v + Pvv). Next, multiplying both sides of the above inequality with PkvPkj ≥ 0 yields PvkPkvPivPvj/(α −1 v + Pvv) ≤ PikPkvPvj. (A.9) Similarly we have PikPkvPvkPkj/(α −1 k + Pkk) ≤ PivPvkPkj. (A.10) Finally, adding (A.10) and (A.9) together results in (A.8), which then completes the proof. A.2.5 Proof of Lemma 3.5.5 Let φ = f ◦ F . We need to show that for any S ⊆ T ⊆ V φ(S) + φ(T ) ≤ φ(S ∪ T ) + φ(S ∩ T ). (A.11) First, since F is decreasing, we have F (S ∪ T ) ≤ F (S), F (T ) ≤ F (S ∩ T ). (A.12) As a result, φ(S∪T ) = f(F (S∪T )) ≤ f(F (S∩T )) = φ(S∩T ) since f is increasing. This proves that φ is nonincreasing. Next, we have that there exist a1, a2 ∈ [0, 1] such that F (S) = a1F (S ∪ T ) + (1− a1)F (S ∩ T ) (A.13) F (T ) = a1F (S ∪ T ) + (1− a2)F (S ∩ T ). (A.14) 246 Adding side-by-side of the above equations gives F (S) + F (T ) = (a1 + a2)F (S ∪ T ) + (2− a1 − a2)F (S ∩ T ), whose left side is less than F (S ∪ T ) + F (S ∩ T ) by supermodularity property of F . Then we have (a1 + a2)F (S ∪ T ) + (2− a1 − a2)F (S ∩ T ) ≤ F (S ∪ T ) + F (S ∩ T ) or, by rearranging terms, (1− a1 − a2)(F (S ∩ T )− F (S ∪ T )) ≤ 0N×N , from which, together with (A.12), we conclude that a1+a2 ≥ 1. Now using convexity of f and (A.13) we have φ(S) = f(F (S)) = f(a1F (S ∪ T ) + (1− a1)F (S ∩ T )) ≤ a1f ( F (S ∪ T ))+ (1− a1)f(F (S ∩ T )) = a1φ(S ∪ T ) + (1− a1)φ(S ∩ T ). Similarly, convexity of f and (A.14) imply φ(T ) ≤ a2φ(S ∪ T ) + (1− a2)φ(S ∩ T ). Adding two equations above side by side yields φ(S) + φ(T ) ≤ (a1 + a2)φ(S ∪ T ) + (2− a1 − a2)φ(S ∩ T ) = φ(S ∪ T ) + φ(S ∩ T ) + (a1 + a2 − 1) ( φ(S ∪ T )− φ(S ∩ T )) ≤ φ(S ∪ T ) + φ(S ∩ T ), where in the last inequality we have used the facts that a1 + a2 ≥ 1 and that φ is nonincreasing. Thus, (A.11) is proved. 247 A.3 Omitted Proofs in Chapter 5 A.3.1 Proof of Theorem 5.3.5 For t ≥ 0, let s(t) := [s1(t), . . . , sN(t)] T , s¯(t) := piT s(t), gs(t) := [g1(s1(t)), . . . , gN(sN(t))] T . It follows from Theorem 5.3.4 that s¯(t) = si(t),∀i ∈ V ,∀t ≥ κ. (A.15) Thus, for any k ≥ 0, we have s¯((k + 1)κ) = si((k + 1)κ) (5.18) = piTx(kκ) (5.14a) = piT ( s(kκ)− γkgs(kκ) ) , = s¯(kκ)− γkpiTgs(kκ). (A.16) Since W is doubly stochastic, we have pi = 1/N (see, e.g., [95]). Thus we have for any k ≥ 1 NpiTgs(kκ) (A.15) = N∑ i=1 gi(s¯(kκ)) = g(s¯(kκ)), (A.17) where g(s¯(kκ)) ∈ ∂F (s¯(kκ)). Thus, (A.16) becomes s¯((k + 1)κ) = s¯(kκ)− γkN−1g(s¯(kκ)), (A.18) which is the same as (5.19). Next, it is obvious that (A.18) is the usual centralized subgradient iteration (5.3) applied to problem (5.1), where F is convex with bounded subgradient |g| ≤∑N j=1 Lj = LF , by Assumption 5.3.1. Therefore, existing convergence results of the 248 centralized subgradient method apply; see, e.g., [117, Chap. 3]. Here, we provide analysis that suits our context to prove the main results. In particular, let s¯k := s¯((k + 1)κ), g¯k := g(s¯(kκ)), γ¯k := γk/N. Then for any x∗ ∈ X∗, we have |s¯k+1 − x∗|2 = |s¯k − x∗ − γ¯kg¯k|2 = |s¯k − x∗|2 − 2γ¯kg¯k(s¯k − x∗) + γ¯2k g¯2k ≤ |s¯k − x∗|2 − 2γ¯k(F¯k − F ∗) + γ¯2k g¯2k (A.19) ≤ |s¯1 − x∗|2 − 2 k∑ l=1 γ¯l(F¯l − F ∗) + k∑ l=1 γ¯2l g¯ 2 l , where F¯k := F (s¯(kκ)), and we have used the definition of subgradient in the first in- equality, and the last one follows from applying (A.19) recursively. Now, rearranging terms and using 0 ≤ |s¯k+1 − x∗|2 and |g| ≤ LF we have k∑ l=1 γ¯l(F¯l − F ∗) ≤ 1 2 (|s¯1 − x∗|2 + L2F k∑ l=1 γ¯2l ) . By the convexity of F , the left side is bounded below by( k∑ l=1 γ¯l )(∑k l=1 γ¯lF (s¯(lκ))∑k l=1 γ¯l − F ∗ ) ≥ ( k∑ l=1 γ¯l )( F (sˆk)− F ∗ ) Combining this and (A.22), we then have F (sˆk)− F ∗ ≤ |s¯1 − x ∗|2 + L2F ∑k l=1 γ¯ 2 l 2 ∑k l=1 γ¯l . (A.20) Now we consider different choices of step size γk. (i) For a constant step size γk ≡ γ, i.e., γ¯k ≡ γ/N,∀k ≥ 1, it follows from (A.20) that F (sˆk)− F ∗ ≤ N |s¯1 − x ∗|2 2kγ + L2Fγ 2N . (A.21) 249 Letting k →∞ yields (5.21). (ii) Since (A.21) holds true for any x∗ ∈ X∗, γ > 0 and k ∈ Z>0, for any given K ∈ Z>0 we have F (sˆK)− F ∗ ≤ NR 2 2Kγ + L2Fγ 2N , (A.22) Now we minimize the right hand side of (A.22) with respect to γ > 0. By application of Cauchy-Schwarz inequality NR 2 2Kγ + L2F γ 2N ≥ RLF√ K , where equality holds when γ = NR LF √ K . Thus, with this optimal step size, we have F (sˆK)− F ∗ ≤ RLF√K . (iii) For a non-summable but diminishing step size, it can be shown that the right hand side of (A.20) decays to 0 as k →∞; see [138] for such an argument. Finally, consider γk = 1√ k . It can be verified that ∑k l=1 γ¯ 2 l ≤ 1+ln(k)N2 and∑k l=1 γ¯l ≥ √ k 2N ,∀k ≥ 1. Using these bounds for (A.20), we obtain F (sˆk)− F ∗ ≤ N 2|s¯1 − x∗|2 + L2F (1 + ln(k)) N √ k , (A.23) which implies that F (sˆk)− F ∗ = O( ln(k)√k ) as k →∞. A.3.2 Proof of Theorem 5.3.12 Following the same line of proof as in Theorem 5.3.5 (see Appendix A.3.1), it can be shown that s¯((k + 1)κ) = s¯(kκ)− γN−1∇F (s¯(kκ)), (A.24) where s¯(t) = 1 N ∑N j=1 sj(t). Clearly, (A.24) is the standard centralized gradient descent method. Thus, by [117, Thm. 2.1.14] we have for any γ ∈ (0, 2N L∇F ) F (s¯(kκ))− F ∗ ≤ a1 a2 + ka3 (A.25) 250 where a1 = (F (s¯(κ)) − F ∗)(s¯(κ) − x∗)2, a2 = (s¯(κ) − x∗)2, and a3 = (F (s¯(κ)) − F ∗)(1− L∇F h 2 )h with h = γ N . As a result, F (si(kκ))− F ∗ = O(1/k), as k →∞. (A.26) Finally, for each t ≥ κ, there exist positive integers k ≥ 1 and l ∈ [0, κ − 1] such that t = kκ+ l. Then by (5.13b), F (si(t))− F ∗ = F (si(kκ))− F ∗ (A.25) ≤ a1 a2 + ka3 ≤ a1κ a2κ+ ta3 . (A.27) The last inequality holds for for sufficiently large t since a3 > 0, l ∈ [0, κ]. Thus, F (si(t))− F ∗ = O(κt ) as t→∞. A.3.3 Proof of Theorem 5.3.14 First, note that (A.24) still holds in this case, i.e, s¯((k+1)κ) = s¯(kκ)− γ N ∇F (s¯(kκ)), where s¯(t) = 1 N ∑N j=1 sj(t). Applying [117, Thm. 2.1.15] to this iteration (i.e., (A.24)) yields |s¯(kκ)− x∗|2 ≤ βk−1|s¯(κ)− x∗|2, 2(F (s¯(kκ))− F ∗) ≤ L∇Fβk−1|s¯(κ)− x∗|2. Then (5.30) and (5.31) follows immediately since si(t) = s¯(t),∀t ≥ κ),∀i ∈ V (cf. Theorem 5.3.4). Next, for each t ≥ κ, there exist positive integers k ≥ 1 and l ∈ [0, κ− 1] such that t = kκ+ l. Then, we have |si(t)− x∗| (5.13b)= |si(kκ)− x∗| (5.30) ≤ Cβ k−12 , with C := |s¯(κ)− x∗| = C ( β 1 2κ )t−l−κ ≤ Cβ −12 (β 12κ )t, (A.28) 251 where the last inequality holds since l < κ. Thus, si(t) → x∗ linearly at rate β 12κ . Similarly, by using (5.31), it can be shown that F (si(t)) → F ∗ linearly at rate β 1 κ . A.3.4 Proof of Extension to Row Stochastic Weight Matrix First notice that the proof of Theorem 5.3.4 does not make use of (5.14a). Thus, (5.15) still holds for (5.36)-(5.37). Following the proof of Theorem 5.3.5 (see Ap- pendix A.3.1), we let s¯(t) := si(t), t ≥ κ and gs(t) := [g1(s1(t)), . . . , gN(sN(t))]T . Recall that Φ = 1piT . We then have s¯((k + 1)κ) = si((k + 1)κ) (5.18) = piTx(kκ) (5.37) = piT ( s(kκ)− γk(Ndiag(pi))−1gs(kκ) ) = s¯(kκ)− γkN−1 N∑ i=1 gi(si(kκ)). (A.29) Since si(kκ) = s¯(kκ),∀i ∈ V (cf. Theorem 5.3.4), (A.29) becomes s¯((k + 1)κ) = s¯(kκ)− γkN−1g(s¯(kκ)), (A.30) where g(s¯(kκ)) = ∑ gi(s¯(kκ)) ∈ ∂F (s¯(kκ)). Now (A.30) is the same as (5.19). Therefore, the same conclusions in Theorems 5.3.5-5.3.14 hold for the convergence of the modified algorithm. A.3.5 Proof of Theorem 5.5.1 First, note that the condition (|Ni| − 1) ≤ 1 is to ensure that W is a nonnegative matrix, hence a valid weight matrix. 252 Now we prove that diam(G) + 1 = deg(qi), ∀i ∈ V . Define Ω(i) := [ei L(G)Tei . . . (L(G)N−1)Tei]T . By [72, Prop. 1] we have deg(qi) = Di + 1 = rank(Ω (i)). Next, by application of [132, Prop. 5], if the graph G is distance regular, then diam(G) + 1 = rank(Σ(i)) where Σ(i) is the controllability matrix of the pair (L(G), ei), computed as Σ(i) = [ei L(G)ei . . . L(G)N−1ei] = [eTi eTi L(G) . . . eT (L(G)N−1)]T = (Ω(i))T Here, the second equality follows from the symmetry of L(G) (recalling that G is undirected). Thus, diam(G) + 1 = rank((Ω(i))T ) = rank(Ω(i)) = deg(qi). A.3.6 Proof of Lemma 5.4.3 For a given matrix A ∈ RN×N , let qA denote its minimal polynomial (i.e., the monic polynomial of minimum degree such that qA(A) = 0). It follows from the Cayley-Hamilton theorem that deg(qA) ≤ N . Let J denote the Jordan canonical form of W − γB, i.e., there exists a non- singular matrix S ∈ RN×N such that W − γB = SJS−1. Since similar matrices have the same minimal polynomial ([95, Corollary 3.3.3]), we have q(W−γB) = qJ . Moreover, it can be verified that W˜ = SJS−1 −γB 0N×N I  = S Φγ 0 I  J 0 0 I  ︸ ︷︷ ︸ ,K S−1 −S−1Φγ 0 I  (A.31) where Φγ is defined in (5.49). Thus, K is the Jordan canonical form of W˜ . Under condition (5.47), i.e., ρ(J) < 1, the order of the largest Jordan block of K corre- 253 sponding to eigenvalue 1 is equal to 1. It then follows immediately that (see e.g., [95, Thm. 3.3.6]) qK(ξ) = (ξ − 1)qJ(ξ). Consequently, we have qW˜ (ξ) = qK(ξ) = (ξ − 1)qJ(ξ) = (ξ − 1)q(W−γB)(ξ). (A.32) Since q˜i|qW˜ (see Lemma 4.2.3), we obtain the following, deg(q˜i) ≤ deg(qW˜ ) (A.32) = 1 + deg(q(W−γB)) ≤ 1 +N. We next show that q˜i(1) = 0, which then clearly implies q˜i(ξ) = (ξ − 1) D˜i∑ j=0 a˜ (i) j ξ j, a˜ (i) D˜i = 1 for some a˜ (i) ∈ RD˜i+1. To this end, recall that for any i = 1, . . . , N and the unit vector ei ∈ RN , we have 0T2N = [e T i 0 T N ]q˜i(W˜ ) (A.31) = [eTi 0 T N ] S Φγ 0 I  q˜i(J) 0 0 q˜i(1)I  S−1 −S−1Φγ 0 I  = [eTi Sq˜i(J)S −1 q˜i(1)eTi Φγ − eTi Sq˜i(J)S−1Φγ]. (A.33) As a result, q˜i(1)e T i Φγ = 0 T N . (A.34) Since Φγ is invertible (see (5.49)), none of its rows is identical to 0 T N . Consequently, (A.34) implies that q˜i(1) = 0. Finally, (A.32) implies that if ρ(W − γB) < 1, then 1 is the only zero of maximum modulus of qW˜ . The proof is completed by noting that q˜i divides qW˜ . 254 A.3.7 Proof of Theorem 5.4.4 Consider again system (5.46) and assume that (5.47) holds. First we show that Φγ satisfies Φγ1 = 1 and pi TBΦγ = pi TB. Note that for any A ∈ RN×N such that (I − A) is invertible, we have (I − A)−1 − I = (I − A)−1A = A(I − A)−1. (A.35) Now let E = γB and note that W1 = 1 and piTW = piT . Then Φγ1 = [I − (W − E)]−1E1 = [I − (W − E)]−1W1− [I − (W − E)]−1(W − E)1 (A.35) = [I − (W − E)]−11− ( [I − (W − E)]−1 − I ) 1 = 1, piTEΦγ = pi TE[I − (W − E)]−1E = piTW [I − (W − E)]−1E − piT (W − E)[I − (W − E)]−1E (A.35) = piT [I − (W − E)]−1E − piT ( [I − (W − E)]−1 − I ) E = piTE. Therefore, Bpi and 1 are left and right eigenvectors of Φγ corresponding to the eigenvalue 1, respectively. Next, by Assumption 5.2.2 and condition (5.47), (I − (W − γB)) is an irre- ducible nonsingular M-matrix. Thus, (I − (W − γB))−1 is a strictly positive matrix (see, e.g., [171]). Therefore, Φγ = [I−(W −γB)]−1γB is also strictly positive. Thus Φγ is also irreducible (i.e., the graph associated with Φγ is strongly connected; in fact 255 it is complete), and 1 is a simple eigenvalue, corresponding to the spectral radius of Φγ. In fact, by Perron’s theorem for positive matrices (see, e.g., [95, Thm 8.2.11]), 1 is the unique eigenvalue of maximum modulus of Φγ, and lim k→∞ Φkγ = 1pi TB/(piTB1). Hence, Φγ is a valid weight matrix for consensus. Moreover, the convergence is ex- ponential with rate |λ2(Φγ)|, where λ2 is an eigenvalue of second largest in modulus. A.3.8 Proof of Theorem 5.4.5 By (5.52) and (5.49), we have the following, which is in the same spirit as Theorem 5.2.3 ( D˜i∑ l=0 a˜ (i) l xi(l) ) / ( D˜i∑ l=0 a˜ (i) l ) = eTi Φγx(0) ∀i ∈ V , where a˜ (i) = [a˜ (i) 0 , . . . , a˜ (i) D˜i ]T ∈ RD˜i+1 satisfies (5.51). Thus, s((k + 1)κ) = Φγs(kκ) = Φ k+1 γ c. (A.36) By Theorem 5.4.4, Φγ is a valid consensus matrix. Since W is doubly stochastic, we have pi = 1/N , and thus, b = B1 = [b1, . . . , bN ] T is a left Perron eigenvector of Φγ. Thus, by (5.55) lim k→∞ s(kκ) = lim k→∞ Φkγc = 1b Tc/(bT1) = 1x∗. (A.37) This and (5.53)-(5.54) imply (5.56). 256 Moreover, since Φkγ converges exponentially as k → ∞, so does s(kκ), i.e., there exist C > 0 and λ2 ∈ (0, 1) such that ‖s(kκ)− 1x∗‖ ≤ Cλk2, ∀k ≥ 0. Now for each t ≥ κ, there exist positive integers k ≥ 1 and l ∈ [0, κ − 1] such that t = kκ+ l. We then have ‖s(t)− 1x∗‖ (5.53)= ‖s(kκ)− 1x∗‖ ≤ Cλk2 ≤ Cλ−12 λ t κ 2 . (A.38) where the last inequality follows from the condition that l ∈ [0, κ − 1]. Therefore, s(t) also converges exponentially with rate λ 1 κ 2 . This concludes the proof. A.3.9 Distributed Evaluation of Global Cost Function and Algorithm Local Termination Here we provide an augmentation to our main algorithm that allows each agent i to compute F (sˆk) or F (si(t)). Besides (5.13)-(5.14), let all the agents also perform (5.24)-(5.25) (in order to compute sˆk) as well as the following for all t ≥ κ+ 1: yi(t) =  fi(sˆk) if t = kκ (A.39a)∑ j∈Ni wijyj(t− 1) if t 6= kκ (A.39b) and F (i) k = ∑Di τ=0 a (i) τ yi(t− κ+ τ)∑Di l=0 a (i) τ , if t = kκ (A.40) Claim: Let the assumptions of Theorem 5.3.5 hold. Then, F (sˆk) = NF (i) k for any k ≥ 1 and ∀i ∈ V . 257 Proof of Claim: First recall from Theorem 5.3.4 that si(t) = sj(t) =: s¯(t) for all t ≥ κ and i, j ∈ V . Thus, each agent can locally find sˆk using (5.24)-(5.25). Next, by (A.39b), we have y(t) = Wy(t− 1), ∀t = kκ+ 1, . . . (k + 1)κ− 1. At time t = (k + 1)κ, we have ∑Di τ=0 a (i) τ yi(kκ+ τ)∑Di τ=0 a (i) τ (Thm.5.2.3) = eTi Φy(kκ) = N −11Ty(kκ) = N−1 N∑ i=1 fi(sˆk) = N −1F (sˆk). (A.41) where the second equality follows from (5.7) with pi = 1/N and the third one from (A.39a). Therefore the claim holds. As a result, each agent can compute F (sˆk)/N in a distributed manner (and hence F (sˆk) if N is known). Similarly, if (A.39a) is replaced by y(t) = fi(si(kκ)) for t = kκ, the same argument holds for F (si(kκ)), that is, F (si(kκ)) = NF (i) k for any k ≥ 1 and ∀i ∈ V . Therefore, all the agents can stop at the same time if they agree to use a common stopping criterion such as (5.27), which is based on absolute convergence error. Note that other criteria of the same type can also be employed, for example, local relative convergence tolerance |F (i)k − F (i)k−1| ≤ |F (i)k |, (A.42) is obviously equivalent to global relative convergence tolerance |F (sˆk)− F (sˆk−1)| ≤ |F (sˆk)| or |F (si(kκ))− F (si(kκ− κ))| ≤ |F (si(kκ))| if si is used instead. 258 A.4 Omitted Proofs in Chapter 6 A.4.1 Proof of Theorem 6.3.6 for Algorithm 6.2. Recall that for Algorithm 6.2,the projection error φi(t) is given by (6.15). Thus, for any v ∈ X, we have ‖xi(t+1)− v‖2 = ∥∥∥∑ j∈Ni wij ( xj(t)− γ(t) gj(t) zjj(t) ) − v + φi(t) ∥∥∥2. Expanding the right side and using the fact that φi(t) T (∑ j∈V wij ( xj(t)− γ(t) gj(t) zjj(t) )− v) ≤ −‖φi(t)‖2 for any v ∈ X ⊆ Xi (cf. Lemma 6.3.1(a)), we obtain ‖xi(t+ 1)− v‖2 ≤ ∥∥∥∑ j∈Ni wij ( xj(t)− γ(t) gj(t) zjj(t) ) − v ∥∥∥2 − ‖φi(t)‖2 ≤ ∑ j∈Ni wij ∥∥∥xj(t)− v − γ(t) gj(t) zjj(t) ∥∥∥2 − ‖φi(t)‖2, (A.43) where (A.43) follows from ∑ j∈V wij = 1 and Lemma 6.3.2. Hence, ∑ i∈V pii‖xi(t+ 1)− v‖2 ≤ ∑ i∈V pii ∑ j∈Ni wij ∥∥∥xj(t)− v − γ(t) gj(t) zjj(t) ∥∥∥2 −∑ i∈V pii‖φi(t)‖2 = ∑ i∈V pii ∥∥∥xi(t)− v − γ(t) gi(t) zii(t) ∥∥∥2 −∑ i∈V pii‖φi(t)‖2, (A.44) where (A.44) holds since piTW = piT . Expanding the first term on the right side of (A.44) yields ∑ i∈V pii ‖xi(t)− v‖2 − 2γ(t) ∑ i∈V pii zii(t) gi(t) T (xi(t)− v) + γ2(t) ∑ i∈V pii z2ii(t) ‖gi(t)‖2. (A.45) 259 We now derive upper bounds for the last two terms in (A.45). First, by (6.5), (6.11) and the fact that ∑ i∈V pii = 1, we have γ2(t) ∑ i∈V pii z2ii(t) ‖gi(t)‖2 ≤ γ2(t)L2fη2. (A.46) Second, using the facts that gi(t) ∈ ∂fi(xi(t)) and fi is Lipschitz continuous on conv(∪Ni=1Xi)(cf. Assumption 6.2.1(b)), it can be shown that gi(t) T (v − xi(t))≤fi(v)− fi(x¯(t)) + Lf ‖xi(t)− x¯(t)‖ . (A.47) As a result, the second term in (A.45) can be bounded as follows (ignoring the factor 2γ(t)): ∑ i∈V −pii zii(t) gi(t) T (xi(t)− v) (A.47) ≤ ∑ i∈V pii zii(t) ( fi(v)− fi(x¯(t)) + Lf ‖xi(t)− x¯(t)‖ ) (6.11) ≤ ∑ i∈V pii zii(t) ( fi(v)− fi(x¯(t)) ) + Lfη ∑ i∈V pii ‖xi(t)− x¯(t)‖ , (6.23) ≤ F (v)− F (x¯(t)) +NCLfηλt ‖x¯(t)− v‖ + Lfη ∑ i∈V pii ‖xi(t)− x¯(t)‖ . (A.48) Finally, returning to the argument in (A.44) and using (A.45) together with the bounds in (A.46) and (A.48), we obtain ∑ i∈V pii‖xi(t+ 1)− v‖2 ≤ ∑ i∈V pii ‖xi(t)− v‖2 − 2γ(t) ( F (x¯(t))− F ∗ ) + 2γ(t)NCLfηλ t ‖x¯(t)− v‖ + 2γ(t)Lfη ∑ i∈V ‖xi(t)− x¯(t)‖+ γ2(t)L2fη2 which is the same as (6.25). Therefore, (6.13) readily follows by the same constants Di as in the case of Algorithm 6.1. 260 Bibliography [1] M. H. DeGroot, “Reaching a consensus,” J. American Statistical Association, vol. 69, no. 345, pp. 118–121, 1974. [2] J. N. Tsitsiklis, “Problems in decentralized decision making and computation,” Ph.D. dissertation, MIT, 1984. [3] N. Friedkin and E. Johnsen, “Social influence and opinions,” J. Mathematical Sociology, vol. 15, pp. 193–206, 1990. [4] P. DeMarzo, D. Vayanos, and J. Zwiebel, “Persuasion bias, social influence, and unidimensional opinions,” Quarterly J. Economics, vol. 118, no. 3, pp. 909–968, 2003. [5] A. Jadbabaie, J. Lin, and A. S. Morse, “Coordination of groups of mobile au- tonomous agents using nearest neighbor rules,” IEEE Trans. Autom. Control, vol. 48, no. 6, pp. 988–1001, 2003. [6] J. Fax and R. Murray, “Information flow and cooperative control of vehicle formations,” IEEE Trans. Autom. Control, vol. 49, no. 9, pp. 1465–1476, 2004. [7] J. Cortes, S. Martinez, T. Karatas, and F. Bullo, “Coverage control for mobile sensing networks,” IEEE Trans. Robot. Automat., vol. 20, no. 2, pp. 243–255, 2004. [8] L. Moreau, “Stability of multiagent systems with time-dependent communi- cation links,” IEEE Trans. Autom. Control, vol. 50, pp. 169–182, 2005. [9] W. Ren, R. W. Beard, and T. W. McLain, “Coordination variables and con- sensus building in multiple vehicle systems,” in Cooperative Control. Springer, 2005, pp. 171–188. [10] W. Ren and R. W. Beard, Distributed Consensus in Multi-vehicle Cooperative Control. Springer, 2008. 261 [11] B. Golub and M. O. Jackson, “Naive learning in social networks and the wisdom of crowds,” American Economic J.: Microeconomics, pp. 112–149, 2010. [12] D. Acemoglu and A. Ozdaglar, “Opinion dynamics and learning in social net- works,” Dynamic Games and Applications, vol. 1, no. 1, pp. 3–49, 2011. [13] D. Acemoglu, G. Como, F. Fagnani, and A. Ozdaglar, “Opinion fluctuations and disagreement in social networks,” Mathematics of Operations Research, vol. 38, no. 1, pp. 1–27, 2013. [14] W. Ren, R. W. Beard, and E. M. Atkins, “Information consensus in multive- hicle cooperative control,” IEEE Control Syst. Mag., vol. 27, no. 2, pp. 71–82, 2007. [15] R. Olfati-Saber, J. A. Fax, and R. M. Murray, “Consensus and cooperation in networked multi-agent systems,” Proc. IEEE, vol. 95, no. 1, pp. 215–233, 2007. [16] Y. Cao, W. Yu, W. Ren, and G. Chen, “An overview of recent progress in the study of distributed multi-agent coordination,” IEEE Trans Ind. Informat., vol. 9, no. 1, pp. 427–438, 2013. [17] S. Strogatz, Sync: The Emerging Science of Spontaneous Order. Hyperion, 2003. [18] M. Ji, A. Muhammad, and M. Egerstedt, “Leader-based multi-agent coordina- tion: Controllability and optimal control,” in Proc. American Control Conf., 2006, pp. 1358–1363. [19] E. Yildiz, D. Acemoglu, A. E. Ozdaglar, A. Saberi, and A. Scaglione, “Discrete opinion dynamics with stubborn agents,” 2011. [Online]. Available: http://dx.doi.org/10.2139/ssrn.1744113 [20] J. Ghaderi and R. Srikant, “Opinion dynamics in social networks: a local interaction game with stubborn agents,” in Proc. American Control Conf. IEEE, 2013, pp. 1982–1987. [21] A. Fagiolini, M. Pellinacci, G. Valenti, G. Dini, and A. Bicchi, “Consensus- based distributed intrusion detection for multi-robot systems,” in Proc. IEEE Int. Conf. Robotics and Automation, 2008, pp. 120–127. [22] S. Sundaram and C. N. Hadjicostis, “Distributed function calculation via lin- ear iterative strategies in the presence of malicious agents,” IEEE Trans. Au- tom. Control, no. 7, pp. 1495–1508, 2011. [23] H. J. LeBlanc, H. Zhang, S. Sundaram, and X. Koutsoukos, “Consensus of multi-agent networks in the presence of adversaries using only local informa- tion,” in Proc. 1st Int. Conf. High Confidence Networked Systems, 2012, pp. 1–10. 262 [24] M. Ji, A. Muhammad, and M. Egerstedt, “Leader-based multi-agent coordi- nation: Controllability and optimal control,” in Proc. American Control Conf. IEEE, 2006, pp. 6–pp. [25] F. Sorrentino, M. di Bernardo, F. Garofalo, and G. Chen, “Controllability of complex networks via pinning,” Physical Rev. E, vol. 75, no. 4, p. 046103, 2007. [26] A. Rahmani, M. Ji, M. Mesbahi, and M. Egerstedt, “Controllability of multi- agent systems from a graph-theoretic perspective,” SIAM J. Control and Op- timization, vol. 48, no. 1, pp. 162–186, 2009. [27] Y.-Y. Liu, J.-J. Slotine, and A.-L. Baraba´si, “Controllability of complex net- works,” Nature, vol. 473, no. 7346, pp. 167–173, 2011. [28] G. Parlangeli and G. Notarstefano, “On the reachability and observability of path and cycle graphs,” IEEE Trans. Autom. Control, vol. 57, no. 3, pp. 743–748, 2012. [29] F. Pasqualetti, S. Zampieri, and F. Bullo, “Controllability metrics, limita- tions and algorithms for complex networks,” IEEE Trans. Control Netw. Syst., vol. 1, no. 1, pp. 40–52, 2014. [30] A. Chapman, M. Nabi-Abdolyousefi, and M. Mesbahi, “Controllability and observability of network-of-networks via cartesian products,” IEEE Trans. Au- tom. Control, vol. 59, no. 10, pp. 2668–2679, 2014. [31] A. J. Whalen, S. N. Brennan, T. D. Sauer, and S. J. Schiff, “Observability and controllability of nonlinear networks: The role of symmetry,” Physical Rev. X, vol. 5, no. 1, p. 011005, 2015. [32] M. Rabbat and R. Nowak, “Distributed optimization in sensor networks,” in Proc. 3rd Int. Sympo. Inf. Process. Sensor Networks. ACM, 2004, pp. 20–27. [33] S. Kar and J. M. Moura, “Distributed consensus algorithms in sensor networks with imperfect communication: Link failures and channel noise,” IEEE Trans. Signal Process., vol. 57, no. 1, pp. 355–369, 2009. [34] S. Bolognani, S. Del Favero, L. Schenato, and D. Varagnolo, “Consensus- based distributed sensor calibration and least-square parameter identification in WSNs,” Int. J. Robust and Nonlinear Control, vol. 20, no. 2, pp. 176–193, 2010. [35] L. Xiao and S. Boyd, “Optimal scaling of a gradient method for distributed resource allocation,” J. Optim. Theory App., vol. 129, no. 3, pp. 469–488, 2006. 263 [36] P. Di Lorenzo and S. Barbarossa, “A bio-inspired swarming algorithm for decentralized access in cognitive radio,” IEEE Trans. Signal Process., vol. 59, no. 12, pp. 6160–6174, 2011. [37] G. Mateos, J. A. Bazerque, and G. B. Giannakis, “Distributed sparse linear regression,” IEEE Trans. Signal Process., vol. 58, no. 10, pp. 5262–5276, Oct 2010. [38] P. A. Forero, A. Cano, and G. B. Giannakis, “Consensus-based distributed support vector machines,” J. Machine Learning Research, vol. 11, no. May, pp. 1663–1707, 2010. [39] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed opti- mization and statistical learning via the alternating direction method of mul- tipliers,” Found. Trends. Machine Learning, vol. 3, no. 1, pp. 1–122, 2011. [40] T. Vicsek, A. Cziro´k, E. Ben-Jacob, I. Cohen, and O. Shochet, “Novel type of phase transition in a system of self-driven particles,” Physical Rev. Lett., vol. 75, no. 6, p. 1226, 1995. [41] V. Borkar and P. Varaiya, “Asymptotic agreement in distributed estimation,” IEEE Trans. Autom. Control, vol. 27, no. 3, pp. 650–655, 1982. [42] J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asynchronous de- terministic and stochastic gradient optimization algorithms,” IEEE Trans. Autom. Control, vol. 31, no. 9, pp. 803–812, 1986. [43] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, Inc., 1989. [44] L. Scardovi and R. Sepulchre, “Synchronization in networks of identical linear systems,” Automatica, vol. 45, no. 11, pp. 2557–2562, 2009. [45] Z. Li, Z. Duan, G. Chen, and L. Huang, “Consensus of multiagent systems and synchronization of complex networks: A unified viewpoint,” IEEE Trans. Circuits Syst. I, vol. 57, no. 1, pp. 213–224, 2010. [46] U. Krause, “A discrete nonlinear and nonautonomous model of consensus formation,” Commun. Difference Equations, pp. 227–236, 2000. [47] N. Chopra and M. W. Spong, “Passivity-based control of multi-agent sys- tems,” in Advances in Robot Control. Springer, 2006, pp. 107–134. [48] J. Zhou, J.-a. Lu, and J. Lu, “Adaptive synchronization of an uncertain com- plex dynamical network,” IEEE Trans. Autom. Control, vol. 51, no. 4, pp. 652–656, 2006. [49] M. Arcak, “Passivity as a design tool for group coordination,” IEEE Trans. Autom. Control, vol. 52, no. 8, pp. 1380–1390, 2007. 264 [50] G.-B. Stan and R. Sepulchre, “Analysis of interconnected oscillators by dis- sipativity theory,” IEEE Trans. Autom. Control, vol. 52, no. 2, pp. 256–270, 2007. [51] N. Chopra and M. W. Spong, “On exponential synchronization of Kuramoto oscillators,” IEEE Trans. Autom. Control, vol. 54, no. 2, pp. 353–357, 2009. [52] R. Olfati-Saber and R. M. Murray, “Consensus problems in networks of agents with switching topology and time-delays,” IEEE Trans. Autom. Con- trol, vol. 49, no. 9, pp. 1520–1533, 2004. [53] A. Kashyap, T. Bas¸ar, and R. Srikant, “Quantized consensus,” Automatica, vol. 43, no. 7, pp. 1192–1203, 2007. [54] S. Kar and J. M. Moura, “Distributed consensus algorithms in sensor net- works: Quantized data and random link failures,” IEEE Trans. Signal Pro- cess., vol. 58, no. 3, pp. 1383–1400, 2010. [55] G. Shi and K. H. Johansson, “Persistent graphs and consensus convergence,” in IEEE 51st Conf. Decision and Control, 2012, pp. 2046–2051. [56] J. Wolfowitz, “Products of indecomposable, aperiodic, stochastic matrices,” Proc. American Mathematical Society, vol. 14, no. 5, pp. 733–737, 1963. [57] J. M. Hendrickx and V. D. Blondel, “Convergence of linear and non-linear versions of Vicseks model,” in Proc. 17th Int. Sympo. Mathematical Theory of Networks and Systems, 2005, pp. 1229–1240. [58] V. Blondel, J. Hendrickx, A. Olshevsky, and J. Tsitsiklis, “Convergence in multiagent coordination, consensus, and flocking,” in 44th IEEE Conf. Deci- sion and Control/ 2005 European Control Conf., 2005, pp. 2996–3000. [59] B. Touri and A. Nedic, “Product of random stochastic matrices,” IEEE Trans. Autom. Control, vol. 59, no. 2, pp. 437–448, Feb 2014. [60] D. Kempe, J. Kleinberg, and E´. Tardos, “Maximizing the spread of influence through a social network,” in Proc. 9th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining. ACM, 2003, pp. 137–146. [61] S. Patterson and B. Bamieh, “Leader selection for optimal network coherence,” in Proc. 49th IEEE Conf. Decision and Control. IEEE, 2010, pp. 2692–2697. [62] M. Fardad, F. Lin, X. Zhang, and M. R. Jovanovic, “On new characterizations of social influence in social networks,” in Proc. American Control Conf., 2013, pp. 4777–4782. [63] A. Gionis, E. Terzi, and P. Tsaparas, “Opinion maximization in social net- works,” in Proc. SIAM Int. Conf. Data Mining. SIAM, 2013, pp. 387–395. 265 [64] E. Yildiz, A. Ozdaglar, D. Acemoglu, A. Saberi, and A. Scaglione, “Binary opinion dynamics with stubborn agents,” ACM Trans. Econ. Comp., vol. 1, no. 4, p. 19, 2013. [65] A. Clark, B. Alomair, L. Bushnell, and R. Poovendran, “Minimizing con- vergence error in multi-agent systems via leader selection: A supermodular optimization approach,” IEEE Trans. Autom. Control, vol. 59, no. 6, pp. 1480–1494, 2014. [66] L. Vassio, F. Fagnani, P. Frasca, and A. Ozdaglar, “Message passing opti- mization of harmonic influence centrality,” IEEE Trans. Control Netw. Syst., vol. 1, no. 1, pp. 109–120, 2014. [67] V. S. Borkar, A. Karnik, J. Nair, and S. Nalli, “Manufacturing consent,” IEEE Trans. Autom. Control, vol. 60, no. 1, pp. 104–117, 2015. [68] G. Shi, K. C. Sou, H. Sandberg, and K. H. Johansson, “A graph-theoretic ap- proach on optimizing informed-node selection in multi-agent tracking control,” Physica D: Nonlinear Phenomena, vol. 267, pp. 104–111, 2014. [69] K. Fitch and N. E. Leonard, “Information centrality and optimal leader selec- tion in noisy networks,” in IEEE 52nd Conf. Decision and Control. IEEE, 2013, pp. 7510–7515. [70] A. Clark, B. Alomair, L. Bushnell, and R. Poovendran, “Minimizing con- vergence error in multi-agent systems via leader selection: A supermodular optimization approach,” CoRR, vol. abs/1306.4949, 2013. [71] S. Sundaram and C. N. Hadjicostis, “Finite-time distributed consensus in graphs with time-invariant topologies,” in Proc. American Control Conf., 2007, pp. 711–716. [72] Y. Yuan, G.-B. Stan, L. Shi, M. Barahona, and J. Goncalves, “Decentralised minimum-time consensus,” Automatica, vol. 49, no. 5, pp. 1227–1235, 2013. [73] B. Ho and R. E. Kalman, “Effective construction of linear state-variable mod- els from input/output functions,” Automatisierungstechnik, vol. 14, no. 1-12, pp. 545–548, 1966. [74] A. Tether, “Construction of minimal linear state-variable models from finite input-output data,” IEEE Trans. Autom. Control, vol. 15, no. 4, pp. 427–436, 1970. [75] L. Silverman, “Realization of linear dynamical systems,” IEEE Trans. Autom. Control, vol. 16, no. 6, pp. 554–567, 1971. [76] B. Johansson, T. Keviczky, M. Johansson, and K. H. Johansson, “Subgradient methods and consensus algorithms for solving convex optimization problems,” in Proc. 47th IEEE Conf. Decision and Control, 2008, pp. 4185–4190. 266 [77] A. Jadbabaie, A. Ozdaglar, and M. Zargham, “A distributed Newton method for network optimization,” in Proc. 48th IEEE Conf. Decision and Control / 28th Chinese Control Conf. IEEE, 2009, pp. 2736–2741. [78] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Trans. Autom. Control, vol. 54, no. 1, pp. 48–61, 2009. [79] D. Jakovetic, J. Xavier, and J. M. Moura, “Cooperative convex optimization in networked systems: Augmented Lagrangian algorithms with directed gossip communication,” IEEE Trans. Signal Process., vol. 59, no. 8, pp. 3889–3902, 2011. [80] I. Lobel and A. Ozdaglar, “Distributed subgradient methods for convex opti- mization over random networks,” IEEE Trans. Autom. Control, vol. 56, no. 6, pp. 1291–1306, 2011. [81] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging for dis- tributed optimization: convergence analysis and network scaling,” IEEE Trans. Autom. Control, vol. 57, no. 3, pp. 592–606, 2012. [82] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, “Consensus-based distributed optimization: Practical issues and applications in large-scale machine learn- ing,” in Proc. 50th Annu. Allerton Conf. Commun. Control Comp. IEEE, 2012, pp. 1543–1550. [83] E. Wei and A. Ozdaglar, “On the O(1/k) convergence of asynchronous distributed alternating direction method of multipliers,” arXiv preprint arXiv:1307.8254, 2013. [84] B. Gharesifard and J. Corte´s, “Distributed continuous-time convex optimiza- tion on weight-balanced digraphs,” IEEE Trans. Autom. Control, vol. 59, no. 3, pp. 781–786, 2014. [85] A. Nedic and A. Olshevsky, “Distributed optimization over time-varying di- rected graphs,” IEEE Trans. Autom. Control, vol. 60, no. 3, pp. 601–615, March 2015. [86] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” arXiv preprint arXiv:1310.7063, 2013. [87] D. Jakovetic, J. Xavier, and J. M. Moura, “Fast distributed gradient meth- ods,” IEEE Trans. Autom. Control, vol. 59, no. 5, pp. 1131–1146, 2014. [88] A. Nedic and D. P. Bertsekas, “Incremental subgradient methods for nondif- ferentiable optimization,” SIAM J. Optim., vol. 12, no. 1, pp. 109–138, 2001. [89] S. S. Ram, A. Nedic, and V. V. Veeravalli, “Incremental stochastic subgradient algorithms for convex optimization,” SIAM J. Optim., vol. 20, no. 2, pp. 691– 717, 2009. 267 [90] L. Xiao, S. Boyd, and S. Lall, “A scheme for robust distributed sensor fusion based on average consensus,” in Proc. Int. Conf. Information Processing in Sensor Networks, 2005, pp. 63–70. [91] F. Zanella, D. Varagnolo, A. Cenedese, G. Pillonetto, and L. Schenato, “Newton-Raphson consensus for distributed convex optimization,” in Proc. 50th IEEE Conf. Decision / 2011 Control and European Control Conf., 2011, pp. 5917–5922. [92] I. D. Schizas, A. Ribeiro, and G. B. Giannakis, “Consensus in ad hoc wsns with noisy linkspart i: Distributed estimation of deterministic signals,” IEEE Trans. Signal Process., vol. 56, no. 1, pp. 350–364, 2008. [93] K. Tsianos and M. Rabbat, “Distributed dual averaging for convex optimiza- tion under communication delays,” in Proc. American Control Conf., Jun. 2012, pp. 1067–1072. [94] P. Lin, W. Ren, and Y. Song, “Distributed multi-agent optimization subject to nonidentical constraints and communication delays,” Automatica, vol. 65, pp. 120–131, 2016. [95] R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge University Press, 1985. [96] V. S. Mai and E. H. Abed, “Opinion dynamics with persistent leaders,” in IEEE 53rd Conf. Decision and Control, 2014, pp. 2907–2913. [97] H. Minc, Nonnegative Matrices. John Wiley and Sons, New York, 1988. [98] J. Lorenz, “A stabilization theorem for dynamics of continuous opinions,” Physica A, vol. 355, no. 1, pp. 217–223, 2005. [99] D. J. Hartfiel, Nonhomogeneous Matrix Products. World Scientific, 2002. [100] A. Olshevsky and J. N. Tsitsiklis, “Convergence speed in distributed consensus and averaging,” SIAM Rev., vol. 53, no. 4, pp. 747–772, 2011. [101] W. Wang and J.-J. Slotine, “A theoretical study of different leader roles in networks,” IEEE Trans. Autom. Control, vol. 51, no. 7, pp. 1156–1161, 2006. [102] A. Rahmani, M. Ji, M. Mesbahi, and M. Egerstedt, “Controllability of multi- agent systems from a graph-theoretic perspective,” SIAM J. Control Optim., vol. 48, no. 1, pp. 162–186, 2009. [103] S. Joshi and S. Boyd, “Sensor selection via convex optimization,” IEEE Trans. Signal Process., vol. 57, no. 2, pp. 451–462, 2009. [104] F. Lin, M. Fardad, and M. R. Jovanovic´, “Algorithms for leader selection in large dynamical networks: Noise-corrupted leaders,” in Proc. 50th IEEE Conf. Decision Control, European Control Conf., 2011, pp. 2932–2937. 268 [105] J. Long, S. O. Memik, and M. Grayson, “Optimization of an on-chip active cooling system based on thin-film thermoelectric coolers,” in Proc. Conf. De- sign, Automation and Test in Europe, 2010, pp. 117–122. [106] V. S. Borkar, J. Nair, and N. Sanketh, “Manufacturing consent,” in 48th Allerton Conf. Commun. Control Comp., 2010, pp. 1550–1555. [107] V. S. Mai and E. H. Abed, “Dynamic consensus measure and optimal selection of direct followers in multiagent networks,” in Proc. American Control Conf. IEEE, 2016, pp. 2880–2885. [108] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher, “An analysis of approxi- mations for maximizing submodular set functions –I,” Mathematical Program- ming, vol. 14, no. 1, pp. 265–294, 1978. [109] M. Conforti and G. Cornue´jols, “Submodular set functions, matroids and the greedy algorithm: Tight worst-case bounds and some generalizations of the Rado-Edmonds theorem,” Discrete Applied Mathematics, vol. 7, no. 3, pp. 251–274, 1984. [110] G. Ranjan and Z.-L. Zhang, “Geometry of complex networks and topological centrality,” Physica A: Statistical Mechanics and its Applications, vol. 392, no. 17, pp. 3833–3845, 2013. [111] K. Stephenson and M. Zelen, “Rethinking centrality: Methods and examples,” Social Networks, vol. 11, no. 1, pp. 1–37, 1989. [112] I. Poulakakis, L. Scardovi, and N. E. Leonard, “Node classification in networks of stochastic evidence accumulators,” arXiv preprint arXiv:1210.4235, 2012. [113] M. Brand, “A random walks perspective on maximizing satisfaction and profit,” in SDM. SIAM, 2005, pp. 12–19. [114] S. J. Wright, Primal-dual interior-point methods. Siam, 1997. [115] D. P. Bertsekas, Nonlinear Programming. Athena Scientific, 1999. [116] F. A. Potra and S. J. Wright, “Interior-point methods,” J. Computational and Applied Mathematics, vol. 124, no. 1–2, pp. 281–302, 2000. [117] Y. Nesterov, Introductory Lectures on Convex Optimization. Springer Science & Business Media, 2004, vol. 87. [118] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004. [119] J. Long, D. Li, S. O. Memik, and S. Ulgen, “Theory and analysis for optimiza- tion of on-chip thermoelectric cooling systems,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 32, no. 10, pp. 1628–1632, 2013. 269 [120] A. Ganesan, S. R. Ross, and B. R. Barmish, “An extreme point result for convexity, concavity and monotonicity of parameterized linear equation solu- tions,” Linear Algebra Appl., vol. 390, pp. 61–73, 2004. [121] D. M. Topkis, “Minimizing a submodular function on a lattice,” Operations Research, vol. 26, no. 2, pp. 305–321, 1978. [122] P. Milgrom and J. Roberts, “Rationalizability, learning, and equilibrium in games with strategic complementarities,” Econometrica, vol. 58, no. 6, pp. 1255–1277, 1990. [123] J. Leskovec, D. Huttenlocher, and J. Kleinberg, “Predicting positive and neg- ative links in online social networks,” in Proc. 19th Int. Conf. World Wide Web. ACM, 2010, pp. 641–650. [124] J. Currie and D. I. Wilson, “OPTI: Lowering the Barrier Between Open Source Optimizers and the Industrial MATLAB User,” in Found. Comp.-Aided Pro- cess Operations, N. Sahinidis and J. Pinto, Eds., Savannah, Georgia, USA, 8–11 January 2012. [125] P. Colaneri, R. H. Middleton, Z. Chen, D. Caporale, and F. Blanchini, “Con- vexity of the cost functional in an optimal control problem for a class of positive switched systems,” Automatica, vol. 50, no. 4, pp. 1227–1234, 2014. [126] B. P. Lathi, Linear Systems and Signals. Oxford University Press, 2009. [127] D. G. Luenberger, Optimization by Vector Space Methods. John Wiley & Sons, 1997. [128] T. Charalambous, Y. Yuan, T. Yang, W. Pan, C. N. Hadjicostis, and M. Jo- hansson, “Distributed finite-time average consensus in digraphs in the presence of time delays,” IEEE Trans. Control Netw. Syst., vol. 2, no. 4, pp. 370–381, 2015. [129] C. D. Godsil, G. Royle, and C. Godsil, Algebraic Graph Theory. Springer New York, 2001, vol. 8. [130] S. Martini, M. Egerstedt, and A. Bicchi, “Controllability analysis of multi- agent systems using relaxed equitable partitions,” Int. J. Syst. Control Com- mun., vol. 2, no. 1-3, pp. 100–121, 2010. [131] S. Zhang, M. Cao, and M. K. Camlibel, “Upper and lower bounds for con- trollable subspaces of networks of diffusively coupled agents,” IEEE Trans. Autom. Control, vol. 59, no. 3, pp. 745–750, 2014. [132] S. Zhang, M. K. Camlibel, and M. Cao, “Controllability of diffusively-coupled multi-agent systems with general and distance regular coupling topologies,” in Proc. 50th IEEE Conf. Decision / 2011 Control and European Control Conf., 2011, pp. 759–764. 270 [133] J. Wang and N. Elia, “A control perspective for centralized and distributed convex optimization,” in Proc. 50th IEEE Conf. Decision and Control / 2011 European Control Conf., 2011, pp. 3800–3805. [134] A. Olshevsky, “Linear time average consensus on fixed graphs and implica- tions for decentralized optimization and multi-agent control,” arXiv preprint arXiv:1411.4186v6, 2016. [135] I.-A. Chen, “Fast distributed first-order methods,” Master’s thesis, MIT, 2012. [136] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm for decentralized consensus optimization,” SIAM J. Optim., vol. 25, no. 2, pp. 944–966, 2015. [137] A. Nedic, A. Ozdaglar, and P. A. Parrilo, “Constrained consensus and op- timization in multi-agent networks,” IEEE Trans. Autom. Control, vol. 55, no. 4, pp. 922–938, 2010. [138] S. Boyd, L. Xiao, and A. Mutapcic, “Subgradient methods,” Lecture notes of EE392o, Stanford University, Autumn Quarter, 2003. [Online]. Available: http://web.mit.edu/6.976/www/notes/subgrad method.pdf [139] B. Nejad, S. Attia, and J. Raisch, “Max-consensus in a max-plus algebraic setting: The case of fixed communication topologies,” in Int. Sympo. Info. Commun. Automation Tech., 2009, pp. 1–7. [140] S. Sundaram and C. N. Hadjicostis, “Distributed function calculation and consensus using linear iterative strategies,” IEEE J. Select. Areas Commun., vol. 26, no. 4, pp. 650–660, 2008. [141] T. Charalambous, M. G. Rabbat, M. Johansson, and C. N. Hadjicostis, “Dis- tributed finite-time computation of digraph parameters: Left-eigenvector, out- degree and spectrum,” IEEE Trans. Control Netw. Syst., vol. 3, no. 2, pp. 137–148, 2016. [142] A. E. Brouwer, A. M. Cohen, , and A. Neumaier, Distance-Regular Graphs. New York: Springer-Verlag, 1989, vol. 18. [143] E. R. van Dam, J. H. Koolen, and H. Tanaka, “Distance-regular graphs,” Electronic J. Combinatorics, vol. DS22, 2014. [144] J. Wang and N. Elia, “Control approach to distributed optimization,” in Proc. 48th Annu. Allerton Conf. Commun. Control Comp., 2010, pp. 557–561. [145] A. Nedic and A. Olshevsky, “Distributed optimization over time-varying di- rected graphs,” IEEE Trans. Autom. Control, vol. 60, no. 3, pp. 601–615, March 2015. 271 [146] D. Kempe, A. Dobra, and J. Gehrke, “Gossip-based computation of aggregate information,” in Proc. 44th Annu. IEEE Symp. Found. Comp. Sci., 2003, pp. 482–491. [147] A. Makhdoumi and A. Ozdaglar, “Graph balancing for distributed subgradi- ent methods over directed graphs,” in Proc. 54th IEEE Conf. Decision and Control, 2015, pp. 1364–1371. [148] S. S. Ram, A. Nedic´, and V. V. Veeravalli, “Distributed stochastic subgradient projection algorithms for convex optimization,” J. Optim. Theory Appl., vol. 147, no. 3, pp. 516–545, 2010. [149] S. Lee and A. Nedic, “Distributed random projection algorithm for convex optimization,” IEEE J. Sel. Topics Signal Process., vol. 7, no. 2, pp. 221–229, 2013. [150] V. S. Mai and E. H. Abed, “Distributed optimization over weighted directed graphs using row stochastic matrix,” in Proc. American Control Conf., 2016, pp. 7165–7170. [151] I. Lobel, A. Ozdaglar, and D. Feijer, “Distributed multi-agent optimization with state-dependent communication,” Mathematical Programming, vol. 129, no. 2, pp. 255–284, 2011. [152] A. Hoffmann, “The distance to the intersection of two convex sets expressed by the distances to each of them,” Mathematische Nachrichten, vol. 157, no. 1, pp. 81–98, 1992. [153] H. H. Bauschke and J. M. Borwein, “On projection algorithms for solving convex feasibility problems,” SIAM Rev., vol. 38, no. 3, pp. 367–426, 1996. [154] Z. Qu, C. Li, and F. Lewis, “Cooperative control based on distributed esti- mation of network connectivity,” in Proc. American Control Conf., 2011, pp. 3441–3446. [155] A. Priolo, A. Gasparri, E. Montijano, and C. Sagues, “A decentralized algo- rithm for balancing a strongly connected weighted digraph,” in Proc. American Control Conf., Jun. 2013, pp. 6547–6552. [156] I. Matei and J. S. Baras, “A comparison between upper bounds on performance of two consensus-based distributed optimization algorithms,” in Estimation and Control of Networked Systems, vol. 3, no. 1, 2012, pp. 168–173. [157] H. Robbins and D. Siegmund, “A convergence theorem for nonnegative almost supermartingales and some applications,” Methods in Statistics, pp. 233–257, 1971. 272 [158] C. Xi, V. S. Mai, E. H. Abed, and U. A. Khan, “Linear convergence in directed optimization with row-stochastic matrices,” arXiv preprint arXiv:1611.06160, 2016. [159] W. B. Gragg and A. Lindquist”, “On the partial realization problem,” Linear Algebra Appl., vol. 50, pp. 277 – 319, 1983. [160] L. Ljung, System Identification. Wiley Online Library, 1999. [161] S. Beheshti and M. A. Dahleh, “Noisy data and impulse response estimation,” IEEE Trans. Signal Process., vol. 58, no. 2, pp. 510–521, 2010. [162] Y. Yuan, G.-B. Stan, L. Shi, M. Barahona, and J. Gonc¸alves, “Minimal-time uncertain output final value of unknown DT-LTI systems with application to the decentralised network consensus problem,” in Proc. Int. Sympo. Mathe- matical Theory of Netw. Syst., 2010. [163] F. Pasqualetti, A. Bicchi, and F. Bullo, “Distributed intrusion detection for secure consensus computations,” in Proc. 46th IEEE Conf. Decision and Con- trol. IEEE, 2007, pp. 5594–5599. [164] I. Shames, A. M. Teixeira, H. Sandberg, and K. H. Johansson, “Distributed fault detection for interconnected second-order systems,” Automatica, vol. 47, no. 12, pp. 2757–2764, 2011. [165] F. Pasqualetti, A. Bicchi, and F. Bullo, “Consensus computation in unreli- able networks: A system theoretic approach,” IEEE Trans. Autom. Control, vol. 57, no. 1, pp. 90–104, 2012. [166] F. Pasqualetti, F. Do¨rfler, and F. Bullo, “Attack detection and identification in cyber-physical systems,” IEEE Trans. Autom. Control, vol. 58, no. 11, pp. 2715–2729, 2013. [167] J. McDonald, M. Neumann, H. Schneider, and M. Tsatsomeros, “Inverse M -matrix inequalities and generalized ultrametric matrices,” Linear Algebra Appl., vol. 220, pp. 321–341, 1995. [168] N. J. Higham, Accuracy and Stability of Numerical Algorithms. Siam, 2002. [169] A. Ben-Israel and T. N. Greville, Generalized Inverses: Theory and Applica- tions. Springer, 2003, vol. 13. [170] C. Meyer, Jr., “Generalized inversion of modified matrices,” SIAM J. Applied Math., vol. 24, no. 3, pp. 315–323, 1973. [171] R. Plemmons, “M-matrix characterizations. I–Nonsingular M-matrices,” Lin- ear Algebra Appl., vol. 18, no. 2, pp. 175 – 188, 1977. 273