ABSTRACT Title of Dissertation: LEARNING IN LARGE MULTI-AGENT SYSTEMS Semih Kara, Doctor of Philosophy, 2024 Dissertation Directed by: Professor Nuno C. Martins Department of Electrical and Computer Engineering In this dissertation, we study a framework of large-scale multi-agent strategic interactions. The agents are nondescript and use a learning rule to repeatedly revise their strategies based on their payoffs. Within this setting, our results are structured around three main themes: (i) Guaranteed learning of Nash equilibria, (ii) The inverse problem, i.e. estimating the payoff mechanism from the agents’ strategy choices, and (iii) Applications to the placement of electric vehicle charging stations. In the traditional setup, the agents’ inter-revision times follow identical and independent exponential distributions. We expand on this by allowing these intervals to depend on the agents’ strategies or have Erlang distributions. These extensions enhance the framework’s modeling capabilities, enabling it to address problems such as task allocation with varying service times or multiple stages. We also explore a third generalization, concerning the accessibility among strategies. Majority of the existing literature assume that the agents can transition between any two strategies, whereas we allow only certain alternatives to be accessible from certain others. This adjustment further improves the framework’s modeling capabilities, such as by incorporating constraints on strategy switching related to spatial and informational factors. For all of these extensions, we use Lyapunov’s method and passivity-based techniques to find conditions on the revision rates, learning rule, and payoff mechanism that ensure the agents learn to play a Nash equilibrium of the payoff mechanism. For our second class of problems, we adopt a multi-agent inverse reinforcement learning perspective. Here, we assume that the learning rule is known but, unlike in existing work, the payoff mechanism is unknown. We propose a method to estimate the unknown payoff mechanism from sample path observations of the populations’ strategy profile. Our approach is two-fold: We estimate the agents’ strategy transitioning probabilities, which we then use – along with the known learning rule – to obtain a payoff mechanism estimate. Our findings regarding the estimation of transitioning probabilities are general, while for the second step, we focus on linear payoff mechanisms and three well-known learning rules (Smith, replicator, and Brown-von Neumann-Nash). Additionally, under certain assumptions, we show that we can use the payoff mechanism estimate to predict the Nash equilibria of the unknown mechanism and forecast the strategy profile induced by other rules. Lastly, we contribute to a traffic simulation tool by integrating electric vehicles, their charging behaviors, and charging stations. This simulation tool is based on spatial-queueing principles and, although less detailed than some microscopic simulators, it runs much faster and accurately represents traffic rules. Using this tool, we identify optimal charging station locations (on real roadway networks) that minimize the overall traffic. LEARNING IN LARGE MULTI-AGENT SYSTEMS by Semih Kara Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2024 Advisory Committee: Professor Nuno C. Martins, Chair/Advisor Professor P. S. Krishnaprasad Professor Richard La Professor Kaiqing Zhang Professor Nikhil Chopra © Copyright by Semih Kara 2024 To my beloved parents: Mine and Tarık. ii Acknowledgments I am immensely thankful to my advisor, Prof. Nuno Martins, whose guidance has transformed my chaotic intellectual ambitions into focused research. From him, I have gained a wealth of knowledge, but above all, I cherish the lessons on being a part of academia. I can vividly remember countless instances where he encouraged me to delve deeply into problems, take the time to ponder, and pursue questions that sparked my interest; after all, that was what made academia special. He also went above and beyond to ensure that I was funded, provided with necessary research materials, and had the opportunity to attend conferences. Going back to my first research manuscripts, I can only imagine the pain that I put him through with my horrible writing, which he tirelessly helped me refine through numerous meticulous meetings. I believe that it is difficult to find mentors like him in this day and age, so I feel very fortunate that our paths crossed. I would also like to thank my dissertation committee: Prof. P. S. Krishnaprasad, Prof. Richard La, Prof. Kaiqing Zhang and Prof. Nikhil Chopra. Not only they have been extremely accommodating, but also they posed thought-provoking questions that fostered stimulating discussions. I wish to emphasize my sincere gratitude to Prof. Krishnaprasad for his generosity in offering me a desk at his lab in my first year, engaging in numerous academic discussions with me and guiding me to intriguing problems and references, enriching my academic journey. Furthermore, I am indebted to Prof. Murat Arcak for his guidance and support. I sincerely value his positivity, iii willingness to help, and our collaborative work – including my interactions with his research team at UC Berkeley: Can Kızılkale, Yasin Sönmez, and Alex Kurzhanskiy. The classes that I took at UMD was another defining part of my academic journey, as well as my personal growth. Learning from Prof. Krishnaprasad, Prof. Andre Tits, Prof. Alexander Barg, Prof. Armand Makowski, Prof. Mark Freidlin, Prof. John Baras, Prof. Richard La, and Prof. Radu Balan was a true pleasure. I cherish having the opportunity to benefit from the vast experience and intellectual capabilities of these individuals. Moreover, I am grateful for my undergraduate education at Bilkent University. There, I had the chance to build the fundamentals of my knowledge and developed an interest in dynamical systems and game theory. My interactions with Prof. Serdar Yüksel, Prof. Ömer Morgül, Prof. Hitay Özbay, Prof. Orhan Arıkan, and Prof. Bülent Özgüler were all pivotal during this period. Also, I feel very fortunate for the friendship and mentoring of Prof. Kemal Yıldız. Learning game theory from him was a pure joy, and his door was always open for a chat, whether academic or not. Overall, I consider myself extremely lucky to have known and interacted with so many exceptional scholars. This dissertation would also not be possible without the unwavering love and support of my family and friends. Throughout my life, my sister Simge, grandfather Halit, aunt Ayşen, and uncle Craig have always stood by me, shaping me into the person I am today. While I had to be away from my family during my PhD, I found a strong support system among my friends, who became like an extended family. I treasure the countless laughs with Batuhan and Özde. I never had a dull moment with them, and our bond only grew stronger with time. I am grateful for the friendships of Elif, Berk, Ferda, Olga, and Diego, with whom I shared numerous adventures and moments of happiness. Mario, Agata, and Michael also hold a special place in my heart, with our iv memories full of joy, great food, and smiles. Soccer has been a significant outlet for me during my PhD, and I am thankful to have had amazing teammates, who are also great friends: Mustafa, Cameron, Arda, Bilal, Rüştü, Olsan, Utku, Adam, and Tri. Moreover, my longtime friends from Ankara – Yunus, Cem, Tibet, Can, Deniz, Bartu, Oskay, and Ece – have consistently emphasized their love and encouragement for me. Having such wonderful friendships is one of my greatest achievements, and I am left speechless by the extent of support and affection I have received from them over the years. Finally, my deepest gratitude are to my wife, Ece Yegane, and my parents, Mine and Tarık Kara. To my parents: I inherited a love of learning from you, which has enriched my life since childhood. Seeing how you approached research and teaching (especially my father’s approach to mathematics), it was natural for me to aspire to intellectual exploration. You also supported me unconditionally in pursuing my aspirations and cheered for me at every step. I believe you are as loving and supportive as parents can be, and I am eternally grateful to have you in my life. To my beautiful and loving wife Ece: Your unconditional support, understanding, and love have been my rock throughout this journey. Not only your sacrifices and care have enabled me to focus on my research many, many times, but you also spark joy to my life, making it a million times better. Thank you for being my biggest supporter, being my best friend, always finding fun things to do, discovering great restaurants to try, convincing me to foster cats, finding our apartment, cooking the best dishes, helping me discover what I want, playing games with me, being my skiing buddy; the list goes on and on and on... I am the luckiest person alive, because I get to be loved by you. Looking back, I am truly touched to see what an amazing group of people that I had the chance to know. v Table of Contents Dedication ii Acknowledgements iii Table of Contents vi List of Tables x List of Figures xi List of Abbreviations xiii Chapter 1: Introduction 1 1.1 Learning Nash Equilibria Under Generalizations . . . . . . . . . . . . . . . . . . 1 1.1.1 Learning Under Strategy-Dependent Revision Rates . . . . . . . . . . . . 2 1.1.2 Learning Under Erlang Distributed Inter-Revision Intervals . . . . . . . . 3 1.1.3 Learning Under Constrained Strategy-Switching . . . . . . . . . . . . . 3 1.2 The Inverse Problem: Learning the Game from Agents’ Actions . . . . . . . . . 4 1.3 Applications to Charging Station Placement . . . . . . . . . . . . . . . . . . . . 5 Chapter 2: Framework Description 6 2.1 The Social State and the Population State . . . . . . . . . . . . . . . . . . . . . 6 2.2 The Payoffs, the Payoff Mechanism, and Nash Equilibria . . . . . . . . . . . . . 7 2.3 The Revision Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 Revision Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 The Learning Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.3 The Social State As a Markov Process . . . . . . . . . . . . . . . . . . . 13 2.4 Deterministic Mean-Field Approximation . . . . . . . . . . . . . . . . . . . . . 14 2.5 Methods for Analyzing the Mean Closed Loop . . . . . . . . . . . . . . . . . . . 16 2.5.1 Conditions on the EDM . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5.2 Conditions on the Payoff Mechanism . . . . . . . . . . . . . . . . . . . 20 2.5.3 Ensuring the Global Asymptotic Stability of the Mean Closed Loop . . . 24 Chapter 3: Pairwise Comparison Rules Under Strategy-Dependent Revision Rates 26 vi 3.1 Strategy-Dependent Revision Rates . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.1 Nash Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.2 Passivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.3 Numerical Evaluation of the Worst-Case Revision Rate Ratio Bound for the Smith Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3.4 Global Asymptotic Stability Guarantees . . . . . . . . . . . . . . . . . . 38 3.4 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4.1 A DRAM Market HPG . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4.2 A Memoryless Payoff Mechanism . . . . . . . . . . . . . . . . . . . . . 40 3.4.3 Dynamics Under the Smith Rule . . . . . . . . . . . . . . . . . . . . . . 41 3.4.4 A Dynamic Payoff Mechanism . . . . . . . . . . . . . . . . . . . . . . . 42 3.4.5 Dynamics Under the Smith Rule and the Smoothed HPG . . . . . . . . . 43 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Chapter 4: Excess Payoff Rules Under Strategy-Dependent Revision Rates 46 4.1 EPT Rules, Recap of Existing Results, and Motivation . . . . . . . . . . . . . . . 47 4.2 The EDM With Strategy-Dependent Revision Rates and Problem Description . . 48 4.3 Performance of EPT Rules Under Strategy-Dependent Revision Rates and RM- EPT Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.4 (NS) and (PC) Properties of Sign Preserving RM-EPT Rules . . . . . . . . . . . 54 4.5 Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Chapter 5: Learning Nash Equilibria Under Erlang Clocks 58 5.1 Existing Work and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 Erlang Distributed Inter-Revision Intervals and Problem Description . . . . . . . 60 5.2.1 Erlang Inter-Revision Intervals . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.2 Erlang Evolutionary Dynamics . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3 Convergence Under Potential Games . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4 Convergence Under Strictly Contractive Games . . . . . . . . . . . . . . . . . . 66 5.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.5.1 A Congestion Game Example . . . . . . . . . . . . . . . . . . . . . . . 68 5.5.2 A Rock-Paper-Scissors (RPS) Game Example . . . . . . . . . . . . . . . 69 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Chapter 6: Learning Nash Equilibria Under Constrained Strategy Switching 72 6.1 Incomplete Strategy Graphs, Related Work and Gaps . . . . . . . . . . . . . . . 72 6.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.3 Stability Guarantees Under Constrained Strategy Switching . . . . . . . . . . . . 79 6.3.1 Analysis Under a Fixed Regularization Parameter . . . . . . . . . . . . . 79 6.3.2 Updating the Regularization Parameter . . . . . . . . . . . . . . . . . . . 80 6.3.3 Asymptotic Stability of xNE . . . . . . . . . . . . . . . . . . . . . . . . 82 vii 6.4 Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Chapter 7: Learning Population Games 86 7.1 Informal Problem Description, Motivation, and Contributions . . . . . . . . . . . 87 7.1.1 Informal Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 87 7.1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.2 Additional Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . . . 92 7.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7.4 Estimating the Payoff Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7.4.1 Estimating Strategy Switching Probabilities . . . . . . . . . . . . . . . . 96 7.4.2 Learning the Payoff Matrix From Strategy Switching Estimates . . . . . . 101 7.5 Practical Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.5.1 Finite-Horizon Prediction of the Population State . . . . . . . . . . . . . 110 7.5.2 Numerical Results on Predicting the Population State . . . . . . . . . . . 111 7.5.3 Estimating the Nash Equilibria . . . . . . . . . . . . . . . . . . . . . . . 112 7.5.4 Numerical Results on Estimating Nash Equilibria . . . . . . . . . . . . . 114 7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Chapter 8: An Application to Optimal Electric Vehicle Charging Station Placement 116 8.1 The Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 8.2 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 8.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Appendix A: Auxiliary Results for Chapter 3 125 A.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 A.2 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Appendix B: Auxiliary Results for Chapter 4 134 B.1 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 B.2 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Appendix C: Auxiliary Results for Chapter 5 138 C.1 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 C.2 Proof of Corollary 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 C.3 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 C.3.1 Auxiliary Results for the Proof of Theorem 6 . . . . . . . . . . . . . . . 143 Appendix D: Auxiliary Results for Chapter 6 147 D.1 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 D.2 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 D.3 Proof of Theorem 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Appendix E: Auxiliary Results for Chapter 7 152 E.1 Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 viii E.2 Proof of Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 E.2.1 Proof Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 E.2.2 Main Body of the Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 E.2.3 Auxiliary Results for the Proof of Theorem 8 . . . . . . . . . . . . . . . 158 E.3 Proof of Theorem 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 E.3.1 Estimating the Row Differences Under the Smith Protocol . . . . . . . . 167 E.3.2 Estimating the Row Differences Under the Replicator Protocol . . . . . . 173 E.3.3 Estimating the Row Differences Under the BNN Protocol . . . . . . . . . 177 E.4 Proof of Theorem 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 E.5 Results on the Estimation of Nash Equilibria . . . . . . . . . . . . . . . . . . . . 188 E.5.1 Proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 E.5.2 Essential Equilibria of F . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Bibliography 196 ix List of Tables 2.1 The Smith, replicator and BNN revision protocols. . . . . . . . . . . . . . . . . . 13 7.1 Estimates and actual values of strategy switching probabilities at the 55-th segment.101 7.2 Estimates of strategy switching probabilities. . . . . . . . . . . . . . . . . . . . . 109 7.3 Row differences of Θ and Θ̂. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 E.1 Summary of the notation related to the partitioning process. . . . . . . . . . . . . 154 x List of Figures 2.1 Congestion game example with one origin/destination pair: (a) depicts the topology of the links and (b) illustrates the 3 strategies representing all possible routes from the origin to the destination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 The revision timeline near t∗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 The agents’ strategic revisions and its influence on X and P. . . . . . . . . . . . 14 2.4 Diagram representing the mean closed loop as the positive feedback interconnection of the PDM and the EDM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1 Comparing λ̄Smith with the lower-bound in (3.9). . . . . . . . . . . . . . . . . . . 38 3.2 Trajectories of distributions of DRAM buyers on the manufacturers under the HPG and Smith rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Time domain plots of the distributions of DRAM buyers on the manufacturers under the HPG and Smith rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4 Time domain plots of the PDM’s state and distributions of DRAM buyers on the manufacturers under the smoothed HPG and Smith rule. . . . . . . . . . . . . . . 44 4.1 Trajectories of x for the numerical example. . . . . . . . . . . . . . . . . . . . . 56 5.1 Possible sub-strategy transitions. . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2 Trajectories of x̄ for the congestion game example. . . . . . . . . . . . . . . . . 69 5.3 Trajectories of x̄ for the RPS game example. . . . . . . . . . . . . . . . . . . . . 71 6.1 The agents’ learning rule and F as a feedback system. . . . . . . . . . . . . . . . 78 6.2 Coordinator implementation of KL-regularization. . . . . . . . . . . . . . . . . . 79 6.3 Transmitter configurations and the resulting G. . . . . . . . . . . . . . . . . . . 83 6.4 Time domain plots of x and y. . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.1 First 3 segments of X from the simulation. . . . . . . . . . . . . . . . . . . . . . 100 7.2 Illustration of S for the congestion game in Example 1: (a) shows S on the simplex, (b) shows the regions in which the strategy offering the smallest payoff is different (the numbers indicate which strategy offers the lowest payoff). As a side note, the point where the lines intersect is the Nash equilibrium of the game. 106 7.3 Sample paths from the simulation: (a) displays X1 [0,T 1], (b) displays X2 [0,T 2], and (c) displays X3 [0,T 3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 xi 7.4 Simulations of the population states resulting fromF and F̂ : (a) shows a close-up of the trajectories (b) displays the trajectories on the simplex. . . . . . . . . . . . 111 8.1 Spatial-queuing structure of a link. . . . . . . . . . . . . . . . . . . . . . . . . . 117 8.2 Illustration of a charging station. . . . . . . . . . . . . . . . . . . . . . . . . . . 119 8.3 Area of interest in College Park, exported from OpenStreetMap. The figure in (b) highlights the roads that the vehicles can take. . . . . . . . . . . . . . . . . . . . 120 8.4 Roadway network extracted from the OpenStreetMap data displayed in Figure 8.3. 121 8.5 Roadway network with the candidate station locations, and the origin-destination nodes highlighted. Green nodes represent station locations, the blue node is the origin, and the red one is the destination. . . . . . . . . . . . . . . . . . . . . . . 122 8.6 Simulation results for 10 vehicles. . . . . . . . . . . . . . . . . . . . . . . . . . 122 8.7 Simulation results for 100 vehicles. . . . . . . . . . . . . . . . . . . . . . . . . . 123 xii List of Abbreviations EPT Excess payoff target EV Electric vehicle Iid Independent and identically distributed KL Kullback-Leibler (NS) Nash stationarity PATH Partners for Advanced Transportation Technology (PC) Positive correlation PC Pairwise comparison RM Rate-modified Uhc Upper hemicontinuous xiii Chapter 1: Introduction In this dissertation, we examine a framework of noncooperative strategic interactions among large numbers of agents; finding applications in various fields, such as traffic management [1, 2], distributed task allocation [3], distributed extremum seeking [4,5], communication networks [6], and many more. In this framework [7,8], the agents are indistinguishable and select strategies from a common finite set. Each strategy has a payoff, assigned by a payoff mechanism that couples the agents’ strategy choices. In response to these payoffs, the agents repeatedly revise their strategies using a learning rule. With the fundamental objective of ensuring the scalability, efficiency, and robustness of such systems, we study three types of problems: • Learning Nash equilibria under generalizations of the framework. • Learning the payoff mechanism from the agents’ strategy choices. • Applications to the placement of electric vehicle charging stations. 1.1 Learning Nash Equilibria Under Generalizations A well-studied problem in game theory, including the setting above, is whether the agents can learn to play a Nash equilibrium (a strategy profile at which no agent has incentive to 1 unilaterally deviate from its strategy) by following simple heuristics [7, 9]. This question not only provides an evolutionary basis and convenient computation algorithms for Nash equilibria, but also validates the use of these heuristics as distributed control algorithms. Nonetheless, in many cases, we do not have control over the agents’ strategic behavior (e.g. traffic and epidemics). Therefore, to have performance guarantees for such cases, we would like the guarantees on learning Nash equilibria to be as comprehensive as possible. Motivated by this line of thinking, we analyzed three extensions of the existing framework: These are strategy- dependent revision rates, Erlang distributed inter-revision intervals, and constrained strategy switching over graphs. 1.1.1 Learning Under Strategy-Dependent Revision Rates First, we expand upon the traditional framework of population games by introducing a new element: The dependency of the agents’ revision rates on their strategies. Existing stability results in this context have been limited to pairwise comparison learning rules and necessitated the payoff mechanism to be a (memoryless potential game). Consequently, our aim is to broaden these findings to encompass a wider range of payoff mechanisms and learning rules. These payoff mechanisms types include dynamic mechanisms that are so-called δ-antidissipative or memoryless games that are weighted contractive [10]. As for the learning rules, the impartial pairwise comparison [11] and excess payoff target [12] classes will be central to our results. We demonstrate that under these types of payoff mechanisms, learning rules that belong to the impartial pairwise comparison class or an extension of the excess payoff target class lead to converge to a Nash equilibrium. Under certain conditions, this convergence is contingent upon 2 the revision rates meeting certain bound conditions. 1.1.2 Learning Under Erlang Distributed Inter-Revision Intervals In this generalization, we again focus on the agents’ inter-revision intervals. However, instead of allowing them to depend on the agents’ strategies, we now consider them to follow independent and identical Erlang distributions. Existing results in the population games literature [7] assume exponentially distributed inter-revision intervals. Nevertheless, this assumption is inadequate for scenarios where each strategy involves a sequence of sub-tasks (sub-strategies) that must be completed before a new revision time occurs. Our work on this extension proposes a methodology for such cases, assuming that the duration of a sub-strategy is exponentially distributed, resulting in Erlang distributed inter-revision intervals. We provide guarantees for learning Nash equilibria under the assumption that the agents’ learning rule belongs to the pairwise comparison class. 1.1.3 Learning Under Constrained Strategy-Switching In our third and final consideration of generalized learning of Nash equilibria, we delve into the study of restricted strategy-switching over graphs. While most existing work [7, 8] assume unrestricted switching between any two strategies, we focus here on scenarios where only certain strategies can be reached from others. This restriction is captured by a strategy graph, which, while connected, may be incomplete. Using Lyapunov’s method, we demonstrate that adjustments based on Kullback Leibler divergence, applied to either the payoffs or the learning rules, lead to the strategy profile’s near-global convergence to the Nash equilibrium. This result 3 holds for strictly concave potential games with interior Nash equilibria. We underscore the practical implications of our findings and validate them with a numerical example. 1.2 The Inverse Problem: Learning the Game from Agents’ Actions The first set of problems we address pertains to understanding the agents’ learning behavior. Conversely, the second subject we investigate takes a completely different approach: Here, we explore whether we can estimate the underlying payoff mechanism from observations of the agents’ strategy choices, which falls under the real of multi-agent inverse reinforcement learning [13, 14]. In this problem, we again assume that the learning rule is known. However, unlike existing studies, we assume that the payoff mechanism is unknown. We develop a method to learn the unknown payoff mechanism from sample path observations of the agents’ strategy choices. Our approach consists of two main parts: (i) We provide probabilistic guarantees for estimating the agents’ probabilities of transitioning among strategies. (ii) Leveraging the known learning rule and the results from part (i), we derive an estimate of the payoff mechanism. Our results in part (i) are general, while in part (ii), we focus on linear payoff mechanisms and three well- known learning rules (the so-called Smith, replicator, and Brown-von Neumann-Nash rules). Additionally, under certain assumptions, we demonstrate that this estimate can be used to predict the Nash equilibria of the unknown payoff mechanism and forecast the population state induced by other learning rules. 4 1.3 Applications to Charging Station Placement Finally, we focus on an application closely aligned with our framework, aiming to develop a simulation-based methodology to identify optimal locations for electric vehicle charging stations, with the goal of minimizing total traffic congestion. To address this challenge, we enhance an open-source traffic simulator [15] developed by the UC Berkeley PATH Program. Originally designed for analyzing traffic in wildfire evacuation scenarios, this tool has demonstrated its effectiveness in simulating traffic on a mesoscopic scale [16–18]. To tailor it for our purposes, we incorporate EV models and charging stations into this simulator. Subsequently, we employ the simulator to perform a basic case study on the placement of EV charging stations in College Park. Our findings illustrate how fluctuations in demand and other factors can impact the optimal locations for these stations. Our results on this subject are relatively newer, and we discuss intriguing avenues for future research to build upon these findings. 5 Chapter 2: Framework Description This thesis primarily focuses on a framework of large-scale multi-agent strategic interactions, commonly known to as population games [7]. In this framework, a large number of agents, referred to as the society, are partitioned into populations, where agents within each population are nondescript. At any time, each agent follows a single strategy from a finite set of alternatives. The strategies have payoffs that couple the agents’ choices and quantify their benefits. The agents repeatedly revise their strategies following a learning rule that reflects the boundedly rational, myopic behavior of their populations. The resulting process representing the society’s strategy profile is referred to as the social state. Provided that the number of agents in each population is large, the social state can be approximated by the trajectories of a nonlinear dynamical system known as the mean closed loop. The remainder of this chapter provides the mathematical details of this framework, where we refer to [7, Chapter 10], [8], and the references therein for a more in depth exposition. 2.1 The Social State and the Population State Consider a collection of N ∈ N agents, partitioned into a finite number of populations l ∈ {1, . . . , L} with N l many agents in population l. At any time t ≥ 0, each agent of any 6 population l follows a single strategy from a finite set V l := {1, . . . , nl}. The agents that belong to the same population are nondescript, meaning that the only way to tell them apart is through their strategies. Hence, the strategy profile of population l at time t can be described by the so-called population state Xl(t), whose entries are the proportions of agents in population l selecting the available strategies. Notice that Xl(t) takes values in the following discrete simplex: Xl := ξl ∈ Rnl ∣∣∣∣∣ N lξli ∈ {0, . . . , N l} and nl∑ j=1 ξlj = 1  . The so-called social state is the concatenation of the states of all populationsX := (X1, . . . ,XL). Thus, X takes values in X := X1 × · · · × XL. 2.2 The Payoffs, the Payoff Mechanism, and Nash Equilibria Each strategy has a payoff that couples the agents’ strategy choices, describing the reward that an agent would receive upon choosing it. We denote the payoff at time t of strategy i for population l by Pl i(t) and denote the payoff vectors of population l and the society by Pl := (Pl 1, . . . ,P l nl ) and P := (P1, . . . ,PL), respectively. As X contains all the relevant information about the agents’ strategy choices, the payoffs can be expressed in terms of X. More specifically, a payoff mechanism assigns P as a causal function of X. The simplest mechanism is memoryless, in which case P(t) is determined by a map (referred to as game) applied to X(t). Definition 1 (Game) A memoryless payoff mechanism is a globally Lipschitz continuous and 7 O D link 1 link 2 link 3 link 4 link 5 (a) strategy 1 strategy 2 strategy 3 (b) Figure 2.1: Congestion game example with one origin/destination pair: (a) depicts the topology of the links and (b) illustrates the 3 strategies representing all possible routes from the origin to the destination. continuously differentiable function F : X → Rn, acting as F : X(t) 7→ P (t), where n := n1 + · · ·+ nl. Such an F is referred to as a game (or population game). Example 1 Linear versions of congestion games, originally proposed by [19] (also see [7, Section 2.2.2]), is an example of a well-studied class of games. Consider the graph in Figure 2.1a. In this graph, O denotes the origin, D denoted the destination, links represent roads, and arrows on links represent the direction in which an agent choosing the link travels. There is a single origin-destination pair and the agents can choose to travel from the origin to the destination via one of the 3 routes in Figure 2.1b. Accordingly, these routes constitute the strategies available to the agents. An agent using a link e ∈ {1, . . . , 5} experiences a delay due to the link’s utilization. We assume that this dependence is linear, given by ce ∑ {i:i∈Φe} ξi. Here, Φe is the set of routes that contain link e, ξi is the percentage of agents traversing route i, and ce is a positive constant quantifying how well link e accommodates traffic. The payoff of using route (or strategy) i is the negative of the sum of delays on the links that form route i. Hence, there is a single population 8 and the payoffs of using the routes under the population state value ξ are given by  F1(ξ) F2(ξ) F3(ξ)  = −  c1 + c2 0 c1 0 c4 + c5 c5 c1 c5 c1 + c3 + c5   ξ1 ξ2 ξ3  , where Figure 2.1b displays which route corresponds to which strategy. More generally, the payoff mechanism is a dynamical system, called the payoff dynamics model (PDM), with input X and output P. As discussed in [8, 20], PDMs incorporate behaviors such as delays, pricing inertia, agent-level learning, and are also used to isolate long-term trends [21]. Definition 2 (Payoff Dynamics Model) At any t ≥ 0, a payoff dynamics model (PDM) assigns P(t) as Q̇(t) = G ( Q(t),X(t) ) , P(t) = H ( Q(t),X(t) ) , (2.1) where G : Rm × X → Rm is globally Lipschitz continuous, H : Rm × X→ Rn is continuously differentiable and globally Lipschitz continuous, and m is a positive integer. In addition, Q(0) takes values in some compact set Q0 ⊆ Rm, and there is a game FG,H that equals H in the stationary regime in the sense that the following implication holds for all γ ∈ Rm and ξ ∈ X: G(γ, ξ) = 0⇒ H(γ, ξ) = FG,H(ξ). (2.2) Remark 1 Games are specific PDM instances, because any game F can be obtained as a PDM by choosing G(γ, ξ) = 0 and H(γ, ξ) = F(ξ) for all ξ ∈ X and γ ∈ Rm (see [8] for further information on how PDMs generalize games). 9 The analysis in this thesis presumes, as is the case in [8, 20], that Q remains in a bounded set Q. Notice that this is guaranteed whenever the PDM is input to state stable [22], because X and Q(0) take values in the bounded sets X and Q0, respectively. Furthermore, in combination with Q and X being bounded, the fact that H is Lipschitz continuous ensures that P remains in a bounded set P := P1 × · · · ×PL. An important concept in game theory is Nash equilibria, which is the set of strategy profiles at which no agent has an incentive to unilaterally deviate from its strategy [23]. In our context, the Nash equilibria is defined for a (population) game F as follows. Definition 3 (Nash equilibria) For any F : ∆→ Rn, the Nash equilibria of F is NE(F) := { ξ ∈ ∆ ∣∣∣∣ ξli > 0⇒ i ∈ argmax j∈V l {F lj(ξ)} } , where ∆l := {ξl ∈ Rnl ≥0 | ∑nl i=1 ξ l i = 1} is the standard (nl−1)-simplex and ∆ = ∆1×· · ·×∆L. In other words, at a Nash equilibrium a positive proportion of a population plays a strategy only if that strategy offers the highest payoff. Alternatively, defining the correspondences blF(ξ) := argmax j∈V l {F lj(ξ)} and Bl F(ξ) := {ηl ∈ ∆l | ηli > 0⇒ i ∈ blF(ξ)}, we can write NE(F) as the set of fixed points of BF := (B1 F , . . . , B L F). Every population game has at least one Nash equilibrium [7, Theorem 2.1.1]. 10 2.3 The Revision Paradigm The agents repeatedly revise their strategies in response to X and P. This revision process has two main components: (i) The times at which the agents revise their strategies, and (ii) A function – called the learning rule – that specifies how a revising agent decides on its subsequent strategy. 2.3.1 Revision Times The revision times of each agent follow a Poisson process. Specifically, the agents possess identical and independent Poisson processes with rate λ > 0 and an agents’ revision times are defined as the jump times of its process. This construction is equivalent to assuming that the time between each agents’ any consecutive revisions (inter-revision interval) is an exponential random variable with rate λ. These random variables are independent across all agents in all populations. In Chapters 3, 4 and 5, we will explore extensions of this assumption by permitting the distributions of the inter-revision intervals to depend on the agents’ strategies or follow Erlang distributions. 2.3.2 The Learning Rule Having specified the revision times, now we describe how a revising agent behaves. Given a revision opportunity, an agent decides on its subsequent strategy according to a probability distribution characterized by a learning rule (or revision protocol). The rule of each population l is a Lipschitz continuous function ρl : ∆l × Rnl → Rnl×nl ≥0 that satisfies ∑nl j=1 ρ l ij(ξ l, πl) = 1 for 11 all i ∈ V l, ξl ∈ ∆l and πl ∈ Pl. Broadly speaking, given that an agent in population l following strategy i is revising its strategy, ρlij(ξ l, πl) is the probability that the agent switches to strategy j, where ξl and πl are respectively the values of Xl and Pl immediately prior to the revision. More precisely, consider that an agent revises its strategy at time t̄ > 0. It follows from the construction of the revision times that, with probability 1, there exists a t∗ > 0 between t̄ and the previous revision time of the society. Let us denote the strategy at t∗ of the revising agent by i, which we refer to as its origin strategy. Then, for any j ∈ V l the probability of the revising agent’s subsequent strategy being j is ρlij(X l(t∗),Pl(t∗)). The realization of this lottery becomes the revising agent’s strategy at time t̄. Figure 2.2 illustrates the revision timeline near t∗. t ∼ exp(λN) ∼ exp(λN) ∼ exp(λN) t∗ t̄t Society’s previous revision occurs at t, so X is constant over [t, t̄) Revision results in a switch from i to j with probability 1 L Xl i(t ∗)ρlij(X l(t∗),Pl(t∗)) Figure 2.2: The revision timeline near t∗. Three well-studied examples of learning rules are the Smith, replicator, and Brown-von Neumann-Nash (BNN) rules. We present these rules in Table 2.1, in which [·]+ = max(·, 0). Moreover, in this table, i, j ∈ V l are such that i ̸= j and C is a constant satisfying C ≥ maxξl∈∆l,π∈Pl,i∈V l ∑ j∈V l\{i} ρ l ij(ξ l, πl). The Smith rule was proposed by [24] to study traffic assignment problems. An agent following the Smith rule switches to a strategy with positive probability if and only if that strategy offers a higher payoff than the agent’s origin strategy. The replicator rule stems from the seminal paper of [25] and is closely related to earlier models from 12 mathematical biology [26] such as genetics, prebiotic evolution, animal behavior, and population ecology [26]. An agent following the replicator rule not only considers the payoff differences between its origin strategy and the target strategies, but also considers the percentage of agents playing the target strategies, which constitutes an imitation component. The BNN rule was introduced by [27] to analyze symmetric zero-sum games and was later used by [28] to show the existence of Nash equilibria in normal-form games. In contrast to the Smith and replicator protocols, an agent following the BNN rule compares the payoffs of the target strategies with the population average of the payoffs. Name Definition Smith ρSmithij (ξ, π) := [πj − πi]+/C Replicator ρRepij (ξ, π) := ξj[πj − πi]+/C Brown-von Neumann-Nash (BNN) ρBNNij (ξ, π) := [πj − ξTπ]+/C Table 2.1: The Smith, replicator and BNN revision protocols. 2.3.3 The Social State As a Markov Process The agents repeatedly update their strategies, as illustrated in Figure 2.3, causing X to vary over time. When the payoff mechanism is memoryless, X is a right-continuous pure-jump Markov process with state space X, jump rate λN and transition probabilities from ξ to ζ given below [7]: P(X(t̄) = ζ |X(t∗) = ξ) =  1 L ξliρ l ij(ξ l,F l(ξ)) if ζ − ξ = el,j−el,i N l , l ∈ {1, . . . , L}, i, j ∈ V l 0 otherwise , where el,i is the (i+ ∑l−1 k=1 n k)-th basis vector in Rn, and t̄ and t∗ are as in Section 2.3.2. 13 An agent uses the revision protocol ρl to revise its strategy as a function of Xl and Pl. Social state X might vary with the agent’s revision. Payoff mechanism assigns P. Outcome of the revision Updated X Updated X Updated P Figure 2.3: The agents’ strategic revisions and its influence on X and P. Similarly, the analysis in [8] shows that if the payoff mechanism is a PDM, then the pair (X, Q̌) is a right-continuous pure-jump Markov process, where Q̌ is the piecewise constant approximation of Q that only gets updated at the revision times. We refer to [8] for the details of this case. 2.4 Deterministic Mean-Field Approximation Many research on population games rely on the mean-field approximation of X for analysis [7,8,29]. In population games, “mean-field” signifies aggregating the effects of individual agents. The key idea is that the agents’ homogeneity cause the randomness that they introduce to the aggregate behavior to balance out as N → ∞. Hence, the population’s aggregate strategic behavior, captured by X, is well-approximated by a deterministic trajectory. Particularly, [8, 30] use [31, Theorem 2.1] to prove the following: Remark 2 (Mean-Field Approximation) If limN1→∞,...,NL→∞(X(0),Q(0)) = (x0, q0) ∈ ∆×Q0 14 almost surely, then lim N1→∞,...,NL→∞ P ( sup t∈[0,T ] ∥(X(t), Q̌(t))− (x(t), q(t))∥ > ϵ ) = 0 holds for all ϵ > 0 and T > 0, where (x, q) is the solution of the following system with initial state (x0, q0): q̇(t) = G(q(t), x(t)), (2.3a) p(t) = H(q(t), x(t)), (2.3b) ẋl(t) = V l(xl(t), pl(t)), l ∈ {1, . . . , L}, (2.3c) in which t ≥ 0 and V l : ∆l × Rnl → Rnl is given for all ξl ∈ ∆l, πl ∈ Pr and i ∈ V l by V li(ξl, πl) := nl∑ j=1, j ̸=i ξljλρ l ji(ξ l, πl)︸ ︷︷ ︸ inflow switching to strategy i − nl∑ j=1, j ̸=i ξliλρ l ij(ξ l, πl)︸ ︷︷ ︸ outflow switching away from strategy i . (2.4) The system (2.3) is commonly referred to as the mean closed loop. In other words, as the number of agents in each population approaches infinity, X is well- approximated on any finite time interval by the solution x of the deterministic dynamical system in (2.3). Remark 3 Note that, when the payoff mechanism is a game F , the system in (2.3) reduces to ẋl(t) = Vr(xl(t),F l(x(t))), l ∈ {1, . . . , L}. (2.5) 15 While this finding offers a finite-horizon perspective on the behavior of X, [29] shows that the mean closed loop can also provide infinite-horizon guarantees: Remark 4 (An Infinite-Horizon Guarantee on X) LetO denote the equilibria of the mean closed loop and define Ox := {ξ ∈ ∆ | (ξ, γ) ∈ O}, so Ox is the x-component of O. If O is globally asymptotically stable, then the stationary distributions of X concentrate near Ox. A question of practical relevance is whether the agents can learn to play a Nash equilibrium of F or FG,H (recall that FG,H is the “stationary game” of the PDM), meaning whether X converges to such equilibrium. Through their result in Remark 4, [29] establishes that we can address this question by analyzing the mean closed loop. Specifically, ifO is globally asymptotically stable and Ox = NE(FG,H), then the stationary distributions of X are close to NE(FG,H), provided that there is a large number of agents in every population. This insight greatly simplifies the task of deriving convergence properties for X, as achieving global asymptotic stability guarantees for the mean closed loop is typically easier. 2.5 Methods for Analyzing the Mean Closed Loop Having established the importance of the mean closed loop, the focus shifts to analyzing it. Observe that the mean closed loop can be interpreted as the positive feedback interconnection of two systems: The PDM and (2.3c), where (2.3c) (with state and output x and input p) is commonly referred to as the evolutionary dynamics model (EDM). This feedback interpretation, which we illustrate in Figure 2.4, yields a significant benefit: It enables us to utilize passivity-based techniques and analyze the PDM and the EDM independently. This approach yields (seperate) conditions on the EDM and the PDM, which, when combined, 16 Payoff Dynamics Model (PDM) q̇(t) = G(q(t), x(t)) p(t) = H(q(t), x(t)) Evolutionary Dynamics Model (EDM) ẋ(t) = V(x(t), p(t)) p x Figure 2.4: Diagram representing the mean closed loop as the positive feedback interconnection of the PDM and the EDM. have been proven to ensure the global asymptotic stability of Nash equilibria. In the following sections, we present some of these conditions, which will be instrumental in our forthcoming analysis. 2.5.1 Conditions on the EDM An important condition on the EDM is called Nash stationarity [12]. Definition 4 (Nash Stationarity) The EDM is said to be Nash stationary (NS) if for every l ∈ {1, . . . , L}, ξl ∈ ∆l and πl ∈ Pl the following holds: V l(ξl, πl) = 0 ⇐⇒ ξl ∈ argmax η∈∆l ηTπl. (NS) Observe that (NS) is a condition on the equilibria of the EDM and establishes an important connection between NE(FG,H) and the equilibria of the mean closed loop: When the PDM is stationary, the mean closed loop is in equilibrium if and only if the x-component of this equilibrium belongs to NE(FG,H). In the case where the payoff mechanism is memoryless, this means that the mean closed loop is stationary if and only if x is in NE(F). 17 Another condition, known as positive correlation, pertains to the non-equilibrium behavior of the EDM [12]. Definition 5 (Positive Correlation) The EDM is said to satisfy positive correlation (PC) if for every l ∈ {1, . . . , L}, ξl ∈ ∆l and πl ∈ Pl the following holds: V l(ξl, πl) ̸= 0⇒ V l(ξl, πl)Tπl > 0. (PC) Essentially, when the payoff mechanism is memoryless, (PC) implies that ẋ andF(x) make an acute angle as long as x is not stationary. In the context of population games, the concepts of (NS) and (PC) were solidified by [12, 32]. In their work, [21] highlighted the resemblance between (PC) and the standard system- theoretic passivity concept [33]. Expanding on this observation, [21] introduced the concept of δ-passivity. Subsequent extensions were made by [8, 10], with the latest being in [10], which is the form of δ-passivity that we will use in our discussion. Definition 6 (δ-Passivity) Given l ∈ {1, . . . , L}, the EDM for population l is δ-passive if there are functions S l : ∆l × Rnl → R≥0 and Sl : ∆l × Rnl → R≥0 such that the following holds for all ξl ∈ ∆l and πl, νl ∈ Rnl: ∂S l(ξl, πl) ∂ξl V l(ξl, πl) + ∂S l(ξl, πl) ∂πl νl ≤ −Sl(ξl, πl) + V l(ξl, πl)Tνl, (2.6a) S l(ξl, πl) = 0⇔ V l(ξl, πl) = 0, (2.6b) Sl(ξl, πl) = 0⇔ V l(ξl, πl) = 0. (2.6c) Following the convention in [8, 20], we will refer to S l as a δ-storage function. 18 Now, we introduce some well-studied classes of learning rules that ensure the (PC) and δ-passivity of the EDM. These classes will be crucial for our upcoming results. Definition 7 (Pairwise Comparison Learning Rules) A learning rule ρl is said to belong to the pairwise comparison (PC) class if for all i ∈ V l, j ∈ V l \ {i}, ξl ∈ ∆l and πl ∈ Pl it can be written as ρlij(ξ l, πl) = ϕlij(π l), where ϕlij(π l) : Rnl → R≥0 satisfies sign preservation in the sense that ϕlij(π l) > 0 if πlj > πli and ϕlij(π l) = 0 if πlj ≤ πli. If, in addition, ρl can be written as ρlij(ξ l, πl) = ϕlj(π l j − πli), then it is said to be an impartial pairwise comparison rule (IPC), where ϕj : R → R≥0 is again sign preserving for all j ∈ V l \ {i}. Essentially, an agent following a PC rule can only switch to strategies that offer greater payoffs. Their desirable incentive properties [7, 11] and inherently fully decentralized operation result in the applicability of the PC class in many engineering problems. The Smith rule, which we presented in Section 2.3.2, belongs to the PC class. Note that we will refer to positive correlation as (PC) and pairwise comparison as PC. A second important class of learning rules is the excess payoff target class [12]. Definition 8 (Excess Payoff Target Learning Rules) A learning rule ρl is said to be an excess 19 payoff target (EPT) rule if it can be written for all i ∈ V l, j ∈ V l \ {i}, ξl ∈ ∆l and πl ∈ Pl as: ρlij(ξ l, πl) = φlj(π̂ l), (2.7a) π̂l := πl − 1(πl)T ξl. (2.7b) Here, π̂l is called the excess payoff of population l and φl : Rnl → Rnl ≥0 is a function that satisfies acuteness, which requires that φl(η)Tη > 0 for all η ∈ int(Rnl ∗ ) := int(Rnl \ int(Rnl ≤0)) [12]. Intuitively, acuteness ensures that an agent following an EPT rule is more likely to switch to strategies with payoffs that are higher than the population-average payoff. An example of an EPT rule is the Brown-von Neumann-Nash (BNN) rule, which is obtained by setting φlj(π̂ l) = max{0, π̂lj} (see Section 2.3.2). Remark 5 Below are some of the current findings on (NS), (PC), and δ-passivity linked to PC and EPT protocols, which will be pertinent to our upcoming analysis: • If ρl is a PC rule, then the EDM of population l satisfies (NS) and (PC) [11]. • If ρl is an IPC rule, then the EDM of population l satisfies δ-passivity [8]. • If ρl is an EPT rule, then the EDM of population l satisfies (NS) and (PC) [12]. 2.5.2 Conditions on the Payoff Mechanism In line with the δ-passivity approach, in addition to those for the EDM, we must also establish conditions for the payoff mechanism. The first two of such conditions are valid when the payoff mechanism is memoryless. Additionally, for simplicity, we present these conditions assuming a single population, which is how we will use them. 20 Definition 9 (Potential Games) Assume that there is a single population, i.e. L = 1. The game F is said to be a potential game if there is a continuously differentiable function f : Rn → R satisfying ∇f = F . We refer to f as the game’s potential. Originally formulated in the context of normal form games, potential games were extended to our (population games) framework in [32]. In such games, a single scalar-valued function can capture all essential information about payoffs that influences agents’ incentives. The presence of this function, known as the potential function, underpins the numerous attractive features of potential games [7, Chapter 3.1]. One such feature is related to distributed optimization. Remark 6 When F is a (single population) potential game, NE(F) coincides with the points that satisfy the Karush-Kuhn-Tucker conditions for the problem of maximizing f over ∆. Hence, if f is concave, then NE(F) = argmax ξ∈∆ f(ξ). Moreover, if f is strictly concave, then argmaxξ∈∆ f(ξ) is unique. We will denote this unique maximizer as xNE. A second type of games are called contractive (also called stable in former work [34]). Definition 10 (Contractive Games) Assume that there is a single population. A game F is said to be contractive if (ζ − ξ)T (F(ζ)−F(ξ)) ≤ 0 21 for all ζ, ξ ∈ ∆. If the above inequality holds strictly whenever ξ ̸= ζ , thenF is said to be strictly contractive. Note that when F is differentiable, the above condition is equivalent to ηTDF(ξ)η ≤ 0 being satisfied for all η ∈ T∆ := {ν ∈ Rn | ∑n i=1 νi = 0} and ξ ∈ ∆, where DF denotes the Jacobian of F . For strictly contractive and differentiable F , we have ηTDF(ξ)η < 0, in which case we define γ := − max ξ∈∆,η∈T∆ ηTDF(ξ)η, γ̄ := − min ξ∈∆,η∈T∆ ηTDF(ξ)η. Similar to potential games, contractive games also have many desirable properties, such as convexity of NE(F) [34]. The authors in [10] extend the contractivity definition for multi- population games by introducing weighted contractivity. However, we omit presenting this definition since we do not utilize it. We note that the class of potential and contractive games do not contain one another. For instance, the 123-coordination game [7, Example 3.1.5] is potential, but not contractive, and the “good” rock-paper-scissors (RPS) game [7, Example 3.3.2] is strictly contractive, but not potential. Finally, we present the passivity condition associated with the PDM. The original form of this condition (akin to δ-passivity of the EDM) is known as δ-antipassivity [8]. However, 22 in this thesis, we adopt an expanded version termed δ-antidissipativity, introduced by [10] to accommodate a wider range of payoff mechanisms (see, for instance, the mixed autonomy congestion game in [10]). Definition 11 (δ-Antidissipativity) Given positive constants w1, . . . , wL, a PDM is said to be δ- antidissipative with weights w1, . . . , wL if there are functions Q : Rm × ∆ → R≥0 and R : Rm ×∆→ R≥0 such that the following holds for all γ ∈ Rm, ξ ∈ ∆ and ν ∈ T∆: ∂Q(γ, ξ) ∂γ G(γ, ξ) + ∂Q(γ, ξ) ∂ξ ν ≤ −R(γ, ξ)− ψTΠψ (2.8a) Q(γ, ξ) = 0⇔ G(γ, ξ) = 0 (2.8b) R(γ, ξ) = 0⇔ G(γ, ξ) = 0 (2.8c) where T∆ = T∆1 × · · · × T∆L, the vector ψ is given by ψ := ∂H(γ,ξ) ∂γ G(γ, ξ) + ∂H(γ,ξ) ∂ξ ν ν  and Π is the block matrix with blocks Π11 = Π22 = 0, Π12 = Π21 = 1/2W in which W is the block-diagonal matrix W := diag (w1In1×n1 , . . . , wLInL×nL). Remark 7 If a PDM is δ-antidissipative with weights wl = 1 for all r ∈ {1, . . . , L}, then it is δ-antipassive as defined and studied in [8] (see [10, Remark 8]). Numerous examples of δ- antipassive PDMs can be found in [8]. For the case when the PDM is a game F , δ-antidissipativity with weights {w1, . . . , wL} 23 reduces to the condition: ρ∑ l=1 wl(F l(ξ)−F l(η))T (ξl − ηl) ≤ 0 (2.9) holds for all ξη ∈ ∆. Observe that (2.9) coincides with contractivity (see Definition 10) when the weights are identical, and can be viewed, more generally, as weighted contractivity [10] with respect to the block-diagonal matrix W. 2.5.3 Ensuring the Global Asymptotic Stability of the Mean Closed Loop By employing Lyapunov’s method and passivity-based techniques, various combinations of the conditions outlined in Sections 2.5.1 and 2.5.2 have been demonstrated to ensure the global asymptotic stability of NE(F) or NE(FG,H). Below, we recap some of these guarantees. Remark 8 (Guarantees on the Global Asymptotic Staility of Nash Equilibria) Assume that there is a single population and the payoff mechanism is memoryless. If F is a potential game and the EDM satisfies (NS) and (PC), then NE(F) is globally asymptotically stable under the mean closed loop [32]. On the other hand, if the payoff mechanism is a δ-antipassive PDM and the EDM of each population is δ-passive, then the equilibria of the mean closed loop is globally asymptotically stable. If, in addition, the EDM of each population is also (NS), then ξ∗ ∈ NE(FG,H) for every equilibrium point (ξ∗, γ∗) of the mean closed loop [10]. In light of Remark 7, this result also implies that the δ-passivity and (NS) of the EDM guarantees the global asymptotic stability of NE(F), whenever the payoff mechanism is a memoryless contractive game. 24 As an example, consider that we have a single population of agents, interracting thrugh the congestion game in Example 1. From Remark 5, we know that the Smith rule (which belongs to the PC class) yields an EDM that satisfies (NS) and (PC). Moreover, the congestion game is a potential game with f(ξ) = −(ξ21(c1 + c2) + ξ22(c4 + c5)/2 + ξ23(c1 + c3 + c5)/2 + ξ1ξ3c1 + ξ2ξ3c5) (in fact, every congestion game is a potential game [7, Example 3.1.2]). Therefore, if the agents follow the Smith rule, then (as we outline in Remark 8) it follows from [32] that the distribution of the agents on the routes converges to NE(F). Note from Remark 6 that NE(F) = argmaxξ∈∆ f(ξ), which yields the following Nash equilibria: NE(F) = {( c1c5 + c3c4 + c3c5 + c4c5 c1c2 + c1c3 + c2c3 + c1c5 + c2c4 + c3c4 + c3c5 + c4c5 , c1c2 + c1c3 + c2c3 + c1c5 c1c2 + c1c3 + c2c3 + c1c5 + c2c4 + c3c4 + c3c5 + c4c5 , c2c4 − c1c5 c1c2 + c1c3 + c2c3 + c1c5 + c2c4 + c3c4 + c3c5 + c4c5 )} . With this example, we conclude the framework description and proceed by presenting our own contributions. 25 Chapter 3: Pairwise Comparison Rules Under Strategy-Dependent Revision Rates In this chapter, we extend the traditional population games framework (presented in Chapter 2) by allowing the agents’ revision rates to depend on their strategies. Existing stability results for this case permit only a memoryless potential game to determine the payoffs. This chapter extends these results to cover payoff mechanisms that can be either dynamic or memoryless games that do not have to be potential. To make our analysis concrete, we assume that each population follows a PC learning rule. We use a well-motivated example, which we call the hassle vs. price game, to illustrate an application of our extension and results. 3.1 Strategy-Dependent Revision Rates Observe from Section 2.3.1 that the traditional population games framework requires the agents’ inter-revision intervals to be iid exponential random variables with the same rate λ. Crucially, this inhibits the framework’s applicability in many practical settings. To illustrate, consider a decentralized task allocation problem. Imagine a system with a stream of large number of tasks, each categorized into one of several types. Suppose that there are entities that serve these tasks, which we interpret as the agents. At any given time, each agent handles a single job, requiring a service time that is independent of other tasks’ service times and follows an exponential distribution. After completing a task, an agent can opt to work on a 26 task of a different type. Each type of task has a payoff that quantifies the benefit that an agent would receive upon selecting it. A meaningful objective would be to devise learning rules and a payoff structure that guide the agents to allocate themselves optimally across task types, in a decentralized manner. However, applying results from the population games framework to this problem is feasible only if each task type’s service time has the same rate, which is often not the case in reality. To facilitate the framework’s applicability in a wider range of settings, such as the problem above, we allow for the agents’ revision rates to depend on their strategies. We assume that for each l ∈ {1, . . . , L} and i ∈ V l a positive constant λli characterizes the rate at which the agents in population l currently following strategy i revise their strategies, and define the vector λl = (λl1, . . . , λ l nl ) as the revision rates of population l. The examples below give a more concrete account of the prelevance of this extension. Example 2 A motivating example of application of our framework, which we will be invoking throughout this chapter to illustrate our contributions, is that of a “hassle vs. price” game (HPG). In this example, each agent operates a machine that uses a component that must be replaced when it fails. There are several manufacturers that make the component to varying degrees of reliability. Specifically, each component has an exponentially distributed lifetime and its failure rate depends on the manufacturer. The available strategies are the manufacturers, and the payoff of each strategy combines two non-positive terms: (i) a hassle (disruption) cost that increases with the failure rate and (ii) the price of the component, which is higher for more reliable manufacturers. The revision opportunity time occurs when the component fails and the agent must decide based on the available information, such as the current payoffs ascribed to the 27 strategies, whether to keep the current strategy (buy again from the same manufacturer) or follow a different strategy (decide on another manufacturer to buy from). The agents are partitioned into populations, each uniquely associated to a machine type and/or the undertaking for which the machine is used. The game F , with its i-th component for population l specified below, is an example of a memoryless payoff mechanism for the HPG: F li (ξ) := HPG −βlλli︸ ︷︷ ︸ hassle (replacement) cost − Ci ( Di(ξ) )︸ ︷︷ ︸ component price . (3.1) The game’s components have the following meaning: • β1, . . . , βL are positive constants quantifying the costs of replacing a component for the respective population. • {1, . . . , κ} is the set of available manufacturers (this is also the strategy set equally available to all1 populations). • λl1, . . . , λ l κ are the failure rates of the components for the l-th population according to manufacturer, which we assume are ordered as λl1 > . . . > λlκ > 0 (manufacturer κ makes the most reliable components). • D : X→ [0, d̄]κ gives the (effective) demand from each manufacturer for all i ∈ {1, . . . , κ} and ξ ∈ ∆ as Di(ξ) := L∑ l=1 αlξli. (3.2) 1This means that all populations have the same strategy set and same number of strategies (n1 = · · · = nL = κ). 28 Here, α1, . . . , αL are positive constants that quantify the relative weight of each population on the demand. These constants may reflect, for instance, the relative sizes of the populations. Finally, Ci : R≥0 → [ci,∞) is a continuously differentiable surjective function (of the demand) that quantifies the cost of a component made by the i-th manufacturer. When the HPG above satisfies the conditions in the remark below, it is weighted contractive. Remark 9 (Properties of C for Example 2 and Example 4) We assume that C in equations (3.1) and (3.10) have the following properties: a) C1, . . . , Cκ are increasing. b) More reliable components are more expensive, i.e., if i > j then Ci(δ) > Cj(δ) for all δ ∈ [0, d̄]. Remark 10 When C satisfies Assumption 9.a, we can show, by following an approach analogous to that of [10, Section IV.A], that theF in Example 2 is weighted contractive with {w1, . . . , wL} = {α1, . . . , αL}. Moreover, by appropriately modifying the steps in the proof of [10, Proposition 3], we can also show that the PDM example that we will present in Section satisfies δ-antidissipativity with {w1, . . . , wL} = {α1, . . . , αL}. In economic theory, Assumption 9.a is referred to as demand-pull inflation [35] that occurs when the supply of a product is limited2, the manufacturer discounts the price when the demand is weak (and gradually eliminates the discount as demand rises), or when the manufacturer raises the price with increasing demand as a way to increase profits when the product becomes popular. 2Factors restricting supply may include scarcity of raw materials, when manufacturer strategically opts to limit production to keep prices up (as dynamic random-access memory manufacturers have been doing in the last 3 years), difficulty in ramping up production fast enough to meet demand and sanctions to name a few. 29 Higher cost (decrease in payoff) for a strategy with higher demand, as measured by the portion of the population following it, is common in many other applications, such as congestion games. Additional examples of contractive and weighted contractive games can be found respectively in [7] and [10], while other instances of δ-antipassive and δ-antidissipative PDMs can be found respectively in [20], [8], [21] and [10]. Below, we present another setting that advocates strategy- dependent revision rates. Example 3 We could model the effect of the contract value on employee turnover in a way that would lead to another example analogous to Example 2. In such an example, a population’s agents would be the businesses wishing to hire and retain an employee for a specific job type. Each population would comprise businesses with comparable characteristics from the employees’ viewpoint, such as location, structure and size. The strategies available to a population’s agents would be the different types of contracts they can offer. In this case, Ci in (3.1) would determine the cost of contract i as a function of the demand. Cheaper contracts offering worse benefits and/or lower salary would lead to a higher turnover rate (quantified by λli) and associated increased cost for retraining and rehiring (quantified by βliλ l i). To allow for strategy-dependent revision rates, we replace the characterization of the revision times given in Section 2.3.1 as follows: Given a population l ∈ {1, . . . , L}, we endow each agent of population l with an arrival process in which the inter-revision times are distributed exponentially with rate λli, where i is the strategy of the agent within the inter-revision period, and define the revision times of the agents as the jump times of their arrival processes. From this construction, it follows that the probability that some agent of population l currently following the i-th strategy is allowed to revise its strategy within an infinitesimal 30 time interval of duration δ is δλliN lXl i(t ∗), where t∗ is in the interval and precedes the revision opportunity time (see Section 2.3.2 for the definition of t∗). Moreover, note that the event that a revision opportunity occurs for a given agent during this period is conditionally independent, given its own current strategy, of the revisions of all other agents. The social state X resulting from this extension is again a continuous time pure jump Markov process with state space X and transition probabilities from ξ to ζ for any ξ, ζ ∈ X given at any revision time t̄ by P(X(t̄) = ζ |X(t∗) = ξ) =  λliξ l iN l∑L l=1 ∑nl i=1 λ l iξ l iN l ρlij(ξ l, πl) if ζ − ξ = el,j−el,i N l , l ∈ {1, . . . , L}, i, j ∈ V l, i ̸= j 1− 1∑L l=1 ∑nl i=1 λ l iξ l iN l ∑L l=1 ∑nl i=1 ∑nl j=1,j ̸=i λ l iξ l iN lρlij(ξ l, πl) if ζ = ξ 0 otherwise . Just like in the conventional framework, the mean-field approximation results discussed in Section 2.4 can be extended to X with strategy-dependent revision rates, leading to a slightly modified EDM given for all t ≥ 0, l ∈ {1, . . . , L}, i ∈ V l, ξl ∈ ∆l and πl ∈ Pl by V li(ξl, πl) = nl∑ j=1, j ̸=i ξljλ l jρ l ji(ξ l, πl)− nl∑ j=1, j ̸=i ξliλ l iρ l ij(ξ l, πl). (3.3) Observe that, in contrast to the λ factor in the traditional EDM (see (2.3c)), the EDM arising from the strategy-dependent revision rates features λli factors that scale ρlij . In the mean- field limit, we can see ξliλ l iρ l ij(ξ l, πl) as the rate of flow from strategy i to j in population l, incorporating the revision rate λli. 31 3.2 Problem Formulation The mean closed loop with strategy-dependent revision rates is equivalent (in terms of equilibrium and stability properties) to the traditional mean closed loop with EDMs induced by learning rules of the form ρ̃lij(ξ l, πl) := λli∑nl k=1 λ l k ρlij(ξ l, πl). This implies that, fundamentally, we can incorporate the effects of strategy-dependent revision rates into ρ. However, achieving this would necessitate ρlij ̸= ρlkj , a condition that the IPC and EPT rules cannot accommodate (see Definitions 7 and 8). Consequently, it is crucial to note that the stability results associated with IPC and EPT rules no longer hold when we allow for strategy-dependent revision rates. Remark 11 As opposed to the IPC and EPT classes, PC rules allow for the incorporation of λli. That is, for any PC rule ρl, the learning rule given by ρ̃lij(ξ l, πl) := λli/( ∑nl k=1 λ l k)ρ l ij(ξ l, πl) also belongs to the PC class. Therefore, stability results associated with PC rules carry on to the strategy-dependent case. Particularly, the EDM resulting from strategy-dependent PC rules satisfies (NS) and (PC), which guarantees the global asymptotic stability of NE(F) under the mean closed loop for potential F (see Remarks 5 and 8). Having seen that some of the fundamental stability guarantees in the standard framework fail when we allow for strategy-dependent λ (specifically, those from IPC and EPT rules), a natural next step is to ask whether we can discover new conditions that ensure the global asymptotic stability of Nash equilibria. In the remainder of this chapter, we address this question for IPC 32 rules, where we will consider EPT rules in the next chapter. As in Section 2.5.1, we will denote an arbitrary IPC rule by ϕ, so the learning rules that we will study have the form ρlij(ξ l, πl) = ϕlj(π l j − πli) (see, again, Definition 7). Similarly to the traditional case, we will use δ-passivity techniques to establish such guarantees for the IPC class. Therefore, following the stability results discussed in Remark 8, we will assume a δ-antidissipative PDM and derive conditions for the strategy-dependent IPC EDM (3.3) to ensure that it is (NS) and δ-passive. Hence, our refined problem is to determine guarantees for the (NS) and δ-passivity of the EDM resulting from IPC rules and strategy-dependent revision rates. 3.3 Stability Analysis We start by defining the following worst-case ratios quantifying the relative discrepancies of a population’s revision rates. Definition 12 (Worst-Case Revision Rate Ratios) Given l ∈ {1, . . . , L} and the revision rates λl1, . . . , λ l nl , we define the worst-case revision rate ratio for the l-th population as follows: λlR := max { λli λlj ∣∣∣∣∣ i, j ∈ V l } . (3.4) Notice that λlR ≥ 1 holds by definition and λlR = 1 if and only if the revision rates for the l-th population are identical. We proceed our analysis by first establishing Nash stationarity, and then investigating δ- passivity. The worst-case revision rate ratios will be critical in the δ-passivity step. 33 3.3.1 Nash Stationarity Recall from Remark 11 that the (NS) and (PC) properties of PC rules persist in the strategy- dependent revision rate case. Additionally, for any IPC rule ϕl, the function ρ̃lij(ξ l, πl) := λli/( ∑nl k=1 λ l k)ϕ l ij(ξ l, πl) is a PC rule. This implies that the EDM resulting from any combination of strategy-dependent revision rates and IPC rule is a PC EDM. Consequently, the (NS) property of the PC EDM extends to the EDM resulting from any IPC rule and strategy-dependent λ. More precisely, we have the following lemma, which we state without proof because it follows immediately from this reasoning. Lemma 1 Given l ∈ {1, . . . , L}, if the l-th population follows an IPC rule, then the resulting EDM of population l is (NS) for any positive revision rates λl1, . . . , λ L nl . 3.3.2 Passivity Now, we investigate δ-passivity. Inspired by the Lyapunov and δ-storage functions introduced respectively in [34] and [21], we choose the following δ-storage function: Given a population l ∈ {1, . . . , L} with a learning rule ϕl of the IPC class, we set our δ-storage function to be S l : ∆l × Rnl → R≥0 specified by S l(ξl, πl) := nl∑ i=1 λliξ l i  nl∑ k=1 ψlk(π l k − πli)  (3.5a) where ψlk(π l k − πli) := ∫ πlk−π l i 0 ϕlk(s)ds (3.5b) 34 Denoting ∑nl k=1 ψ l k(π l k − πli) by γli(π l) we can write S l in more compactly as S l(ξl, πl) = nl∑ i=1 λliξ l iγ l i(π l). (3.5c) From our analysis of S l, we derive the following theorem, which presents a condition guaranteeing the δ-passivity of the EDM induced by IPC rules under strategy-dependent revision rates. Theorem 1 Given l ∈ {1, . . . , L}, consider that the l-th population follows an IPC rule specified by ϕl and a worst-case revision rate ratio λlR (see (3.4)). The EDM for population l is δ-passive if (i) nl = 2, or (ii) nl ≥ 3 and the following inequality holds: λlR < λ̄ϕl(n l), (3.6) where λ̄ϕl is determined from ϕl as λ̄ϕl(n l) := min k∈V l inf πl∈Rnl { γlk(π l) ∑nl i=1 ϕ l i(π l i − πlk)∑nl i=1 ϕ l i(π l i − πlk)γli(πl) } . (3.7a) Although (to avoid cluttered notation) we do not explicitly indicate in (3.7a), the infimum is computed subject to the following constraint on πl: nl∑ i=1 ϕli(π l i − πlk)γli(πl) ̸= 0. (3.7b) We prove Theorem 1 in Appendix A, by showing that S l satisfies (2.6). Remark 12 (When to compute (3.7)) According to Theorem 1, an IPC EDM is always δ-passive 35 for a population with two strategies nl = 2, irrespective of the revision rates. Hence, only when nl ≥ 3 will one need to compute (3.7) to test whether (3.6) holds. Notably, in many cases (3.7) can be computed numerically. To see an implementation of (3.6), consider the HPG with 5 manufacturers and suppose that the agents of population l follow the Smith rule. In the upcoming Section 3.3.3, we will compute λ̄ϕl(5) to be 4.65 for the Smith rule. Consequently, in this setting, (3.6) requires the worst case failure rates of the component when used by population l (calculated via (3.4)) to be less than 4.65. Although in many cases (3.7) can be computed numerically, this computation can be cumbersome. To facilitate bypassing the computation of (3.7), in Proposition 1, we present a simple lower bound for λ̄ϕl(nl) that is valid for IPC rules satisfying the following assumption. Assumption 1 For a non-decreasing map ϕ̄l : R → R≥0, the following holds for all π ∈ R and i ∈ V l: ϕli(π) = ϕ̄l(π). (3.8) Proposition 1 Consider that a population l ∈ {1, . . . , L} (with nl ≥ 3) follows an IPC rule. If this rule satisfies Assumption 1, then it holds for all nl ≥ 3 that λ̄ϕl(n l) ≥ nl − 1 nl − 2 . (3.9) We prove Proposition 1 in Appendix A. This proof also provides an alternative way to compute λ̄ϕl(nl) for the case in which Assumption 1 holds (see (A.8))). Remark 13 We can conclude from (3.9) that, for the learning rules satisfying the conditions of 36 Proposition 1, λ̄ϕl(nl) is strictly greater than 1, which, according to Theorem 1, affords some δ-passivity robustness with respect to λlR. This fact is in contrast to previous results establishing δ-passivity for IPC rules. We emphasize that Assumption 1 is critical in obtaining Proposition 1 and the following counterexample illustrates the reason for this. Counterexample 1 Consider that nl = 3 and population l adopts an IPC rule specified by ϕl1(·) = [·]2+, ϕl2(·) = ϕl3(·) = [·]+. This learning rule violates Assumption 1 and, as we proceed to show, it will infringe (3.9) with λ̄ϕl(3) = 1. To do so, we use the following inequality that we obtain by setting πl1 = 0, πl2 = −ϵ, πl3 = −ϵ + ϵ7/4, with ϵ > 0, when computing the infimum in (3.7a): λ̄ϕl(3) ≤ lim ϵ→0+ (2ϵ3 + 3ϵ7/2)(ϵ2 + ϵ7/4) 2(ϵ− ϵ7/4)3ϵ7/4 = 1. 3.3.3 Numerical Evaluation of the Worst-Case Revision Rate Ratio Bound for the Smith Rule Let us denote λ̄ϕl for the Smith rule as λ̄Smith. In Figure 3.1, we plot the values of λ̄Smith(n l) for nl ∈ {3, . . . , 10}, which we calculated by computing (3.7) numerically. Figure 3.1 also displays the lower bound in (3.9) for λ̄Smith. Note that, since the Smith rule satisfies Assumption 1, the lower bound in (3.9) holds for λ̄Smith, for any nl ≥ 3. The plots in Figure 3.1 illustrate that the lower-bound in (3.9) may be conservative – a consequence of it being valid for a large subclass of IPC rules. Notably, from the values of λ̄Smith plotted in Figure 3.1 we observe that the Smith rule yields a δ-passive EDM even if the revision rates vary by a multiplicative factor of 9.44 when nl = 3. For the nl = 9, case the Smith EDM 37 3 4 5 6 7 8 9 10 2 1.5 1.33 1.25 1.2 1.16 1.14 1.12 9.44 5.89 4.65 4 3.59 3.3 3.1 2.93 Number of strategies (nl) L og ar ith m ic sc al e λ̄Smith(n l) nl−1 nl−2 Figure 3.1: Comparing λ̄Smith with the lower-bound in (3.9). remains δ-passive even when the revision rates vary by a multiplicative factor of 3.1. 3.3.4 Global Asymptotic Stability Guarantees As we outlined in Remark 8, [10] proves that δ-passivity and (NS) suffice to draw conclusions about the equilibrium stability of the mean closed loop. Consequently, our findings regarding (NS) and δ-passivity for IPC rules under strategy-dependent revision rates directly lead to the theorem below. We state the theorem without proof as it directly follows from [10, Theorem 2] in conjunction with Lemma 1 and Theorem 1. Theorem 2 Consider that the payoff mechanism is a PDM and that each population follows an IPC rule. If these learning rules satisfy the conditions of Theorem 1 and the PDM is δ- antidissipative (see Definition 11), then the equilibria set of the mean closed loop is globally asymptotically stable. In addition, if (ξ∗, γ∗) is an equilibrium point of the mean closed loop, then ξ∗ ∈ NE(FG,H). Recall from Remark 7 that, for the case when the payoff mechanism is memoryless, the 38 δ-antidissipativity requirement reduces to weighted contractivity. This, in combination with Theorem 2, leads to the corollary below. Corollary 1 Consider that the payoff mechanism is a game F and that each population follows an IPC rule. If these learning rules satisfy the conditions of Theorem 1 and F is weighted contractive (see Remark 7), then the equilibria set of the mean closed loop is NE(F) and is globally asymptotically stable. We note that Corollary 1 generalizes the earliest stability results for IPC rules, established in [34, Theorem 7.1], in two ways. In comparison to [34, Theorem 7.1], which presumes that the game is contractive and the revision rates are identical within each population, Corollary 1 allows for weighted contractive games and it contends with unequal revision rates so long as they satisfy the conditions of the corollary. The stability theorems in [10] allow for weighted contractive games, but the article lacks the results needed to consider the case in which the revision rates within each population are different. 3.4 Numerical Examples As industrial-grade data-driven processing centers and vehicle to vehicle networks are becoming more prevalent, life cycles of dynamic random-access memories (DRAMs) used in these applications emerge as important benchmarks. To provide examples of how our results can come into play, we look into the HPG, introduced in Example 2, in the context of the DRAM market. 39 3.4.1 A DRAM Market HPG Two classes of systems in which DRAMs are commonly used are industrial and automotive systems, which we refer to as respectively class 1 and 2. Hence, there are two populations, where we use population l to represent the collection of agents that utilize DRAMs in class l. We assume that there are 3 manufacturers producing DRAMs with failure rates in these utilization classes given by λ11 = 5, λ21 = 10, λ12 = 4, λ22 = 9 and λ13 = 3, λ23 = 5, where λli is the failure rate of DRAMs produced by manufacturer i when utilized in class l. Moreover, we let the replacement costs for industrial and automotive DRAMs to be respectively β1 = 2 and β2 = 1. 3.4.2 A Memoryless Payoff Mechanism We assume that the payoffs are determined by the game F with the structure presented in Example 2. We specify the component price from manufacturer i ∈ {1, 2, 3}, which is the Ci in (3.1), as the sum of a fixed production cost, C0i, and a term reflecting the pull-back inflation, Cpi. To represent the pull-back inflation on the cost, we use a quadratic term Cpi(Di(ξ)) = (Di(ξ))2 = (α1ξ1i + α2ξ2i ) 2, where αl is in proportion to the share of class-l in the DRAM market. Finally, we set α1 = 1 and α2 = 2, and the fixed DRAM production costs to be C01 = 1, C02 = 1.2 and C03 = 1.5, which completes the construction of F (as in (3.1)) for our DRAM market HPG.3 Notice that this F satisfies Assumption 9, hence, as stated in Remark 10, it is a weighted contractive game. 3We would like to clarify that the functions and parameters selected in this section are for illustration purposes, and they are not estimated from data. 40 (2/3, 1/6, 1/6) (0, 0, 1) (1, 0, 0) (0, 1, 0) (a) Industrial-grade users (2/3, 1/6, 1/6) (0, 0, 1) (1, 0, 0) (0, 1, 0) (b) Automotive-grade users Figure 3.2: Trajectories of distributions of DRAM buyers on the manufacturers under the HPG and Smith rule. 3.4.3 Dynamics Under the Smith Rule Now we illustrate an application of Corollary 1. Consider the mean closed loop formed by the F constructed above (in Section 3.4.1) and the EDM in which all populations follow the Smith rule with the revision rates specified in Section 3.4.1. Assume that initially the buyers are distributed on the manufacturers according to x1(0) = x2(0) = (2/3, 1/6, 1/6). Since the failure rates satisfy the conditions of Theorem 1 and F is weighted contractive, we can invoke Corollary 1 to conclude that x converges to NE(F), which in this case is the singleton {ξ∗} with (ξ1)∗ = (0, 1, 0), (ξ2)∗ = (0, 0, 1). For this example, the trajectory and the time domain plot of x are portrayed respectively in Figure 3.2 and Figure 3.3. We note that the way in which the trajectories in Figure 3.2 are portrayed is a common way to represent the trajectory of the mean social state when the populations have 3 strategies (see for instance [7, Chapter 2.3.4]). 41 0 0.2 0.4 0.6 0.8 1 0 1 x1 1 x1 2 x1 3 Time (t) St at e of po pu la tio n 1 (x 1 ) (a) Industrial-grade users 0 0.2 0.4 0.6 0.8 1 0 1 x2 1 x2 2 x2 3 Time (t) St at e of po pu la tio n 2 (x 2 ) (b) Automotive-grade users Figure 3.3: Time domain plots of the distributions of DRAM buyers on the manufacturers under the HPG and Smith rule. 3.4.4 A Dynamic Payoff Mechanism Now, we consider a payoff mechanism that can be viewed as a dynamic extension of Example 2. Our construction of this PDM parallels that in [10, Section VI.A]. Example 4 Given a positive time constant a and parameters as defined in Example 2, the following defines the “smoothed” HPG PDM: aq̇(t) = −q(t) +  C1 ( D1 ( x(t) )) ... Cκ ( Dκ ( x(t) ))  , (3.10a) pli(t) = −βlλli − qi(t). (3.10b) Here q(0) ∈ Q = Q0 := [0, d̄]κ. We can also specify a set P that includes all possible p as follows: P := {π ∈ RκL | πli = −βlλli − γi for some γ in Q} Note that, in (3.10), p is the effective payoff perceived by the agents and q represents a smoothed 42 version of the costs of the component (produced by different manufacturers). In [21], the authors argue that dynamically modifying a game via smoothing reduces the impacts of short-term fluctuations and isolates the longer-term trends. We can motivate this effect in the context of our HPG in two ways. Firstly, even though the costs of the component may change persistently (e.g. due to the demand-pull inflation outlined in Assumption 9), retailers often do not change the sale prices as frequently, and rather, the sale prices follow the long-term trends of the costs. Secondly, it might take agents some time to learn and react to the sale prices, causing the affects of the changes in costs on the payoffs to be smoothed. Remark 14 It follows immediately from (3.10), (3.1) and (2.2) that, for the smoothed HPG, the stationary game FG,H is identical to F . 3.4.5 Dynamics Under the Smith Rule and the Smoothed HPG We carry out an analysis that is analogous to that in Section 3.4.3, but with the PDM given by the smoothed HPG from Section 3.4.2. We select a = 5 in (3.10) while keeping all the other parameters from Section 3.4.1 and Section 3.4.3 unchanged. Note that the C constructed in Section 3.4.1 satisfies Assumption 9, therefore this PDM is δ-antidissipative (see Remark 10). Since the failure rates satisfy the conditions of Theorem 1 and the PDM specified above is δ-antidissipative, we conclude from Theorem 2 that, like in Section 3.4.3, x converges to NE(F). The time evolution of the PDM’s state q and the mean social state x are plotted in Figure 3.4, indicating that x1 and x2 indeed converge respectively to (0, 1, 0) and (0, 0, 1). 43 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2.19 5.5 q1 q2 q3 Time (t) PD M st at e (q ) (a) State of the smoothed HPG 0 0.2 0.4 0.6 0.8 1 0 1 x1 1 x1 2 x1 3 Time (t) St at e of po pu la tio n 1 (x 1 ) (b) Industrial-grade users 0 0.2 0.4 0.6 0.8 1 0 1 x2 1 x2 2 x2 3 Time (t) St at e of po pu la tio n 2 (x 2 ) (c) Automotive-grade users Figure 3.4: Time domain plots of the PDM’s state and distributions of DRAM buyers on the manufacturers under the smoothed HPG and Smith rule. 44 3.5 Conclusion In this chapter, we extended a fundamental assumption in the population games framework by allowing the agents’ revision rates to depend on their strategies. We demonstrated that the resulting social state is again a right-continuous pure-jump Markov process, maintaining the traditional mean-field approximation results. By applying these findings, we derived an EDM that generalizes the EDM from the traditional framework. However, we observed that the stability results associated with IPC protocols no longer hold in the strategy-dependent case. To address this, we established that if a population either has two strategies or its worst-case revision rate ratio satisfies a bound condition, then the population’s EDM satisfies (NS) and δ-passivity. Utilizing the results from [10], we concluded that these conditions ensure the global asymptotic stability of the mean closed loop’s equilibria, provided the payoff mechanism is a δ-antidissipative PDM or a weighted contractive game. Furthermore, the x-components of these equilibria correspond to NE(FG,H) (or NE(F) in the memoryless case). We illustrated our results using a motivating example, called the hassle vs. price game. 45 Chapter 4: Excess Payoff Rules Under Strategy-Dependent Revision Rates In Chapter 3, we extended the traditional population games framework to allow for the agents’ revision rates to depend on their strategies. This chapter also investigates this extension, but for a different class of learning rules. Specifically, in Chapter 3, we found that the stability guarantees associated with IPC and EPT rules no longer hold when the revision rates are strategy-dependent. We then identified new conditions on IPC rules and revision rates that secure the stability of Nash equilibria in the extended framework. In this chapter, we focus on broadening the stability guarantees from the EPT class to cover scenarios with strategy-dependent revision rates. Instead of focusing on system-theoretic passivity properties as in Chapter 3, here we seek conditions guaranteeing stability of Nash equilibria when the payoff mechanism is a potential game [32] (see also Section 2.5.2). As we stated in Remark 11, classical results were general enough to handle Nash equilibria stability under potential games for the PC class, hence potential games were not a focus in Chapter 3. Our contributions are twofold: (i) We unveil the existence of EPT rules and λ for which the EDM violates (NS) or (PC). Using these results, we show that for some EPT rules and λ, the mean population state may not converge to a Nash equilibrium under a broad class of potential games. (ii) We propose a modification of the EPT class, which we call sign preserving rate- 46 modified EPT rules. For any learning rule in the modified class and any λ, we prove that the EDM satisfies (NS) and (PC). As we illustrate in Section 4.5, the rules in the modified class are easy to implement. From this chapter onward, for simplicity and ease of notation, we assume a single population. However, it’s important to note that our results are applicable to multi-population scenarios as well. 4.1 EPT Rules, Recap of Existing Results, and Motivation Potential games (see Definition 9) were introduced in [36] and progressed to be analyzed extensively. They were adapted to the population games and evolutionary dynamics framework in [32, 37] and were further investigated in this context in [38]. Congestion games [7], [39] are celebrated examples of potential games. Also, remember that potential games have a valuable connection to distributed optimization [38]: A potential game’s Nash equilibria coincide with the points that satisfy the Karush-Kuhn-Tucker (KKT) conditions for the problem of maximizing the game’s potential over the standard (n− 1)-simplex (see Remark 6). The second key component of our analysis is the EPT class. Due to their suitability for capturing reluctance and moderation in finding best-performing strategies, EPT rules were introduced in [12] as a well-motivated model of individual choice. Applicability of EPT rules in engineering settings has been demonstrated in [40] for water distribution systems and in [41] for distributed wireless networks. Despite their thorough examination, as detailed in Remark 11, existing results [12] on the stability properties from EPT rules are inconclusive under strategy-dependent revision rates, i.e., 47 when there are strategies i and j for which λi ̸= λj . Recall from Remark 8 that the work in [32] establishes that if the payoff mechanism is a potential game F and the EDM satisfies (NS) and (PC), then NE(F) is the globally asymptotically stable equilibrium of the mean closed loop. Therefore, to obtain guarantees on the performance of the EPT class under strategy-dependent revision rates, we will investigate the (NS) and (PC) properties of the resulting EDM. 4.2 The EDM With Strategy-Dependent Revision Rates and Problem Description In Section 3.1, we generalized the population games framework to allow for the agents’ revision rates to depend on their strategies. We showed that the resulting population state X (we are assuming a single population) is a right-continuous pure-jump Markov process and utilized the mean-field techniques summarized in Section 2.4 to obtain the following EDM: Vi(ξ, π) = n∑ j=1, j ̸=i ξjλjρji(ξ, π)− n∑ j=1, j ̸=i ξiλiρij(ξ, π), which differs from the traditional EDM (2.3c) by incorporating the λi and λj factors instead of a common λ factor. As a motivating example of strategy-dependent revision rates, in addition to the examples in Chapter 3, we introduce a distributed task allocation problem. Example 5 Suppose that there is a large number of agents and 3 types of tasks that the agents perform, which constitute their strategy set. Tasks of type i ∈ {1, 2, 3} have expected service time λ−1 i > 0, which may differ from λ−1 j > 0 for i ̸= j. Each agent undertakes a new task, possibly of a different type, only after completing its current one. Hence, the revision rate associated with 48 tasks of type i is λi > 0. At any time t ≥ 0, a central entity assigns a payoff vector P(t) to the types of tasks and disseminates P(t). When an agent completes a task, it decides on the type of its new task in a distributed manner using a learning rule ρ and announces the types of its completed and new tasks to the central entity. Let us denote the fraction of agents working on tasks of type i at time t by Xi(t). Given θ ∈ ∆, the distributed task allocation problem is to find a ρ and a payoff mechanism that drives X near θ in the long run. We will revisit this problem in the upcoming Section 4.5 to demonstrate our results. Distributed task allocation problems in the population games and evolutionary dynamics framework have been studied in [3]. However, the setting in [3] is different from ours because [3] assumes that the revision rates are identical. Remember from Definition 8 that ρ belongs to the EPT class if it can be written as ρij(ξ, π) = φj(π̂), where π̂ := π − 1πT ξ is the excess payoff and φ : Rn → Rn ≥0 is a function that satisfies acuteness in the sense that ϕ(η)Tη > 0 for all η ∈ int(Rn ∗ ) := int(Rn \ int(Rn ≤0)). Then, the EDM induced by an EPT rule φ and strategy-dependent revision rates λ is specified for all i ∈ V , ξ ∈ ∆ and π ∈ P by Vi(ξ, π) = n∑ j=1, j ̸=i ξjλjφi(π̂)− n∑ j=1, j ̸=i ξiλiφj(π̂). As we outline in Remark 11, unlike the PC class, we cannot embed the effects of strategy- dependent revision rates into ρ, because the functional form of the EPT class can only depend on the target strategies, whereas the revision rates depend on the origin strategies. In simpler terms, ρij(ξ, π) = φi(π̂)λj/(1 Tλ) is not an EPT rule. In the nominal case of λ1 = · · · = λn, EPT rules are known to secure the (NS) and (PC) 49 of the EDM, resulting in the global asymptotic stability of NE(F) under the mean closed loop for any potential game F (see Remark 8). However, the above observation implies that we have a different EDM when λi ̸= λj for some i, j ∈ V , and this stability result is no longer valid. Consequently, two important questions arise: • Does the EDM satisfy (NS) and (PC) for every EPT rule and λ ∈ Rn >0? We show soon in Section 4.3 that the answer is negative in general (with strategy-dependent revision rates). Thus, we also pose the ensuing problem. • Can we modify the EPT class so that the EDM satisfies (NS) and (PC) for every ρ in the modified class and λ ∈ Rn >0? Sign preserving rate-modified EPT (RM-EPT) rules will be our answer to this problem. 4.3 Performance of EPT Rules Under Strategy-Dependent Revision Rates and RM-EPT Rules In this section, we show that the EPT class does not ensure (NS) and (PC) when the revision rates are allowed to be strategy-dependent. Using our results on (NS) we also show that, for some EPT rules and potential games, the convergence of x to NE(F) is not robust under perturbations of the revision rates. Subsequently, we modify the EPT class to ensure that the resulting EDM satisfies (NS) and (PC) for all λ ∈ Rn >0. We begin by showing that even one of the most ubiquitous EPT rules, the BNN rule, produces an EDM that violates (PC) for some λ. Recall from Section 2.3.2 that the BNN rule is obtained by setting φj(π̂) = max{0, π̂j}/C. 50 Counterexample 2 Consider ρ to be the BNN rule with C = 1 and let ξ = [1/2 0 1/2]T , π = [0 1/2 3/4]T , λ = [1 1 16]T . Substituting these into (2.4), we have V(ξ, π) ̸= 0 and V(ξ, π)Tπ = −0.0156 < 0. Therefore, the EDM specified by ρ and λ violates (PC). Having established that EPT rules may fail to guarantee (PC), we modify them to eliminate this issue. We construct the modified class, which we name rate-modified EPT (RM-EPT) learning rules, by replacing π̂ in Definition 8 with π̃ defined as π̃ := π − 1π̄, π̄ := ∑n i=1 λiξiπi∑n j=1 λjξj . (4.1) We refer to π̄ as the rate-weighted average payoff. Thus, given an EPT rule as in Definition 8, we obtain its rate-modified version by simply replacing π̂ in (2.7a) with π̃. Notice that an EPT rule coincides with its rate-modified counterpart when the revision rates are identical. Remark 15 Suppose that, at every revision time, the revising agent declares (possibly to a coordinator) the payoff of the strategy it was following immediately prior to the revision time. For instance, in our distributed task allocation problem in Example 5, this would correspond to the agents announcing the payoffs they received from the tasks that they recently completed. Since the number of agents is large, the average of the payoffs announced within a short time interval is approximately the rate-weighted average payoff. Consequently, in such a setting, RM-EPT rules can be implemented effortlessly, without any knowledge of λ. Although this modification suffices to ensure (PC), it does not guarantee (NS). Consider 51 the rate-modified version of the EPT rule in [12, Proposition 2.1], characterized by φj(π̃) = n∑ i=1 (ecπ̃i(k + 1)([π̃j]+) k + ecπ̃jc([π̃i]+) k+1), (4.2) where c > 0, k > 0 and (k + 1)ek+2 + 1 ≥ n. A key property of this rule is that it satisfies the positivity condition (Pos), requiring the following implication to be satisfied for all π ∈ Rn: ξ /∈ argmax η∈∆ πTη ⇒ min i,j∈V ρij(ξ, π) > 0. (Pos) The following counterexample shows that the EDM violates (NS) for the learning rule in (4.2) and some λ. Counterexample 3 Let ρ be the RM-EPT rule characterized by (4.2) and assume thatF satisfies int(∆) ̸⊂ NE(F) (for instance, F given by F(ξ) = ξ − (1/n)1 satisfies this criterion because NE(F) = {(1/n)1} = (1/n, . . . , 1/n)). Let us take ξ ∈ int(∆) such that ξ /∈ NE(F) and set π = F(ξ). Note that V(ξ, π) = 0 if and only if ρ(ξ, π)TΞλ = Ξλ, (4.3) where Ξ denotes the n-by-n diagonal matrix with its (i, i)-th entry given for all i ∈ V by ξi. From ξ /∈ NE(F), it follows that ρij(ξ, π) > 0 for all i, j ∈ V . Therefore, by ρ(ξ, π) being a stochastic matrix and the Perron-Frobenius theorem, ρ(ξ, π)T has an eigenvector ν that corresponds to the eigenvalue 1 and all entries of ν are positive. Because Ξ is invertible, Ξ−1ν is the unique λ for which Ξλ = ν. In addition, from Ξ−1 being a diagonal matrix with positive diagonal entries and 52 ν having positive entries, it follows that each entry of Ξ−1ν is positive. Thus, choosing λ = Ξ−1ν does not conflict with λ ∈ Rn >0, and with this choice (4.3) holds. Since ξ /∈ NE(F) implies ξ /∈ argmaxη∈∆ π Tη, we conclude that the EDM generated by ρ and λ violates (NS). Remark 16 Our derivations in Counterexample 3 are valid not only for t