ABSTRACT

Title of Dissertation: LEARNING IN LARGE MULTI-AGENT SYSTEMS

Semih Kara, Doctor of Philosophy, 2024

Dissertation Directed by: Professor Nuno C. Martins
Department of Electrical and Computer Engineering

In this dissertation, we study a framework of large-scale multi-agent strategic interactions.

The agents are nondescript and use a learning rule to repeatedly revise their strategies based

on their payoffs. Within this setting, our results are structured around three main themes: (i)

Guaranteed learning of Nash equilibria, (ii) The inverse problem, i.e. estimating the payoff

mechanism from the agents’ strategy choices, and (iii) Applications to the placement of electric

vehicle charging stations.

In the traditional setup, the agents’ inter-revision times follow identical and independent

exponential distributions. We expand on this by allowing these intervals to depend on the agents’

strategies or have Erlang distributions. These extensions enhance the framework’s modeling

capabilities, enabling it to address problems such as task allocation with varying service times

or multiple stages. We also explore a third generalization, concerning the accessibility among

strategies. Majority of the existing literature assume that the agents can transition between

any two strategies, whereas we allow only certain alternatives to be accessible from certain

others. This adjustment further improves the framework’s modeling capabilities, such as by


incorporating constraints on strategy switching related to spatial and informational factors. For all

of these extensions, we use Lyapunov’s method and passivity-based techniques to find conditions

on the revision rates, learning rule, and payoff mechanism that ensure the agents learn to play a

Nash equilibrium of the payoff mechanism.

For our second class of problems, we adopt a multi-agent inverse reinforcement learning

perspective. Here, we assume that the learning rule is known but, unlike in existing work, the

payoff mechanism is unknown. We propose a method to estimate the unknown payoff mechanism

from sample path observations of the populations’ strategy profile. Our approach is two-fold:

We estimate the agents’ strategy transitioning probabilities, which we then use – along with

the known learning rule – to obtain a payoff mechanism estimate. Our findings regarding the

estimation of transitioning probabilities are general, while for the second step, we focus on

linear payoff mechanisms and three well-known learning rules (Smith, replicator, and Brown-von

Neumann-Nash). Additionally, under certain assumptions, we show that we can use the payoff

mechanism estimate to predict the Nash equilibria of the unknown mechanism and forecast the

strategy profile induced by other rules.

Lastly, we contribute to a traffic simulation tool by integrating electric vehicles, their

charging behaviors, and charging stations. This simulation tool is based on spatial-queueing

principles and, although less detailed than some microscopic simulators, it runs much faster and

accurately represents traffic rules. Using this tool, we identify optimal charging station locations

(on real roadway networks) that minimize the overall traffic.


LEARNING IN LARGE MULTI-AGENT SYSTEMS

by

Semih Kara

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2024

Advisory Committee:
Professor Nuno C. Martins, Chair/Advisor
Professor P. S. Krishnaprasad
Professor Richard La
Professor Kaiqing Zhang
Professor Nikhil Chopra


© Copyright by
Semih Kara

2024


To my beloved parents: Mine and Tarık.

ii


Acknowledgments

I am immensely thankful to my advisor, Prof. Nuno Martins, whose guidance has transformed

my chaotic intellectual ambitions into focused research. From him, I have gained a wealth

of knowledge, but above all, I cherish the lessons on being a part of academia. I can vividly

remember countless instances where he encouraged me to delve deeply into problems, take the

time to ponder, and pursue questions that sparked my interest; after all, that was what made

academia special. He also went above and beyond to ensure that I was funded, provided with

necessary research materials, and had the opportunity to attend conferences. Going back to my

first research manuscripts, I can only imagine the pain that I put him through with my horrible

writing, which he tirelessly helped me refine through numerous meticulous meetings. I believe

that it is difficult to find mentors like him in this day and age, so I feel very fortunate that our

paths crossed.

I would also like to thank my dissertation committee: Prof. P. S. Krishnaprasad, Prof.

Richard La, Prof. Kaiqing Zhang and Prof. Nikhil Chopra. Not only they have been extremely

accommodating, but also they posed thought-provoking questions that fostered stimulating discussions.

I wish to emphasize my sincere gratitude to Prof. Krishnaprasad for his generosity in offering

me a desk at his lab in my first year, engaging in numerous academic discussions with me and

guiding me to intriguing problems and references, enriching my academic journey. Furthermore,

I am indebted to Prof. Murat Arcak for his guidance and support. I sincerely value his positivity,

iii


willingness to help, and our collaborative work – including my interactions with his research

team at UC Berkeley: Can Kızılkale, Yasin Sönmez, and Alex Kurzhanskiy.

The classes that I took at UMD was another defining part of my academic journey, as well

as my personal growth. Learning from Prof. Krishnaprasad, Prof. Andre Tits, Prof. Alexander

Barg, Prof. Armand Makowski, Prof. Mark Freidlin, Prof. John Baras, Prof. Richard La, and

Prof. Radu Balan was a true pleasure. I cherish having the opportunity to benefit from the vast

experience and intellectual capabilities of these individuals.

Moreover, I am grateful for my undergraduate education at Bilkent University. There, I had

the chance to build the fundamentals of my knowledge and developed an interest in dynamical

systems and game theory. My interactions with Prof. Serdar Yüksel, Prof. Ömer Morgül, Prof.

Hitay Özbay, Prof. Orhan Arıkan, and Prof. Bülent Özgüler were all pivotal during this period.

Also, I feel very fortunate for the friendship and mentoring of Prof. Kemal Yıldız. Learning game

theory from him was a pure joy, and his door was always open for a chat, whether academic or

not. Overall, I consider myself extremely lucky to have known and interacted with so many

exceptional scholars.

This dissertation would also not be possible without the unwavering love and support of

my family and friends. Throughout my life, my sister Simge, grandfather Halit, aunt Ayşen, and

uncle Craig have always stood by me, shaping me into the person I am today. While I had to be

away from my family during my PhD, I found a strong support system among my friends, who

became like an extended family. I treasure the countless laughs with Batuhan and Özde. I never

had a dull moment with them, and our bond only grew stronger with time. I am grateful for the

friendships of Elif, Berk, Ferda, Olga, and Diego, with whom I shared numerous adventures and

moments of happiness. Mario, Agata, and Michael also hold a special place in my heart, with our

iv


memories full of joy, great food, and smiles. Soccer has been a significant outlet for me during

my PhD, and I am thankful to have had amazing teammates, who are also great friends: Mustafa,

Cameron, Arda, Bilal, Rüştü, Olsan, Utku, Adam, and Tri. Moreover, my longtime friends from

Ankara – Yunus, Cem, Tibet, Can, Deniz, Bartu, Oskay, and Ece – have consistently emphasized

their love and encouragement for me. Having such wonderful friendships is one of my greatest

achievements, and I am left speechless by the extent of support and affection I have received from

them over the years.

Finally, my deepest gratitude are to my wife, Ece Yegane, and my parents, Mine and Tarık

Kara. To my parents: I inherited a love of learning from you, which has enriched my life since

childhood. Seeing how you approached research and teaching (especially my father’s approach

to mathematics), it was natural for me to aspire to intellectual exploration. You also supported me

unconditionally in pursuing my aspirations and cheered for me at every step. I believe you are as

loving and supportive as parents can be, and I am eternally grateful to have you in my life. To my

beautiful and loving wife Ece: Your unconditional support, understanding, and love have been

my rock throughout this journey. Not only your sacrifices and care have enabled me to focus on

my research many, many times, but you also spark joy to my life, making it a million times better.

Thank you for being my biggest supporter, being my best friend, always finding fun things to do,

discovering great restaurants to try, convincing me to foster cats, finding our apartment, cooking

the best dishes, helping me discover what I want, playing games with me, being my skiing buddy;

the list goes on and on and on... I am the luckiest person alive, because I get to be loved by you.

Looking back, I am truly touched to see what an amazing group of people that I had the

chance to know.

v


Table of Contents

Dedication ii

Acknowledgements iii

Table of Contents vi

List of Tables x

List of Figures xi

List of Abbreviations xiii

Chapter 1: Introduction 1
1.1 Learning Nash Equilibria Under Generalizations . . . . . . . . . . . . . . . . . . 1

1.1.1 Learning Under Strategy-Dependent Revision Rates . . . . . . . . . . . . 2
1.1.2 Learning Under Erlang Distributed Inter-Revision Intervals . . . . . . . . 3
1.1.3 Learning Under Constrained Strategy-Switching . . . . . . . . . . . . . 3

1.2 The Inverse Problem: Learning the Game from Agents’ Actions . . . . . . . . . 4
1.3 Applications to Charging Station Placement . . . . . . . . . . . . . . . . . . . . 5

Chapter 2: Framework Description 6
2.1 The Social State and the Population State . . . . . . . . . . . . . . . . . . . . . 6
2.2 The Payoffs, the Payoff Mechanism, and Nash Equilibria . . . . . . . . . . . . . 7
2.3 The Revision Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 Revision Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 The Learning Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 The Social State As a Markov Process . . . . . . . . . . . . . . . . . . . 13

2.4 Deterministic Mean-Field Approximation . . . . . . . . . . . . . . . . . . . . . 14
2.5 Methods for Analyzing the Mean Closed Loop . . . . . . . . . . . . . . . . . . . 16

2.5.1 Conditions on the EDM . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.2 Conditions on the Payoff Mechanism . . . . . . . . . . . . . . . . . . . 20
2.5.3 Ensuring the Global Asymptotic Stability of the Mean Closed Loop . . . 24

Chapter 3: Pairwise Comparison Rules Under Strategy-Dependent Revision Rates 26

vi


3.1 Strategy-Dependent Revision Rates . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3.1 Nash Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2 Passivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.3 Numerical Evaluation of the Worst-Case Revision Rate Ratio Bound for

the Smith Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.4 Global Asymptotic Stability Guarantees . . . . . . . . . . . . . . . . . . 38

3.4 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.1 A DRAM Market HPG . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.2 A Memoryless Payoff Mechanism . . . . . . . . . . . . . . . . . . . . . 40
3.4.3 Dynamics Under the Smith Rule . . . . . . . . . . . . . . . . . . . . . . 41
3.4.4 A Dynamic Payoff Mechanism . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.5 Dynamics Under the Smith Rule and the Smoothed HPG . . . . . . . . . 43

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Chapter 4: Excess Payoff Rules Under Strategy-Dependent Revision Rates 46
4.1 EPT Rules, Recap of Existing Results, and Motivation . . . . . . . . . . . . . . . 47
4.2 The EDM With Strategy-Dependent Revision Rates and Problem Description . . 48
4.3 Performance of EPT Rules Under Strategy-Dependent Revision Rates and RM-

EPT Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 (NS) and (PC) Properties of Sign Preserving RM-EPT Rules . . . . . . . . . . . 54
4.5 Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Chapter 5: Learning Nash Equilibria Under Erlang Clocks 58
5.1 Existing Work and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Erlang Distributed Inter-Revision Intervals and Problem Description . . . . . . . 60

5.2.1 Erlang Inter-Revision Intervals . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.2 Erlang Evolutionary Dynamics . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.3 Convergence Under Potential Games . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4 Convergence Under Strictly Contractive Games . . . . . . . . . . . . . . . . . . 66
5.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.5.1 A Congestion Game Example . . . . . . . . . . . . . . . . . . . . . . . 68
5.5.2 A Rock-Paper-Scissors (RPS) Game Example . . . . . . . . . . . . . . . 69

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Chapter 6: Learning Nash Equilibria Under Constrained Strategy Switching 72
6.1 Incomplete Strategy Graphs, Related Work and Gaps . . . . . . . . . . . . . . . 72
6.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3 Stability Guarantees Under Constrained Strategy Switching . . . . . . . . . . . . 79

6.3.1 Analysis Under a Fixed Regularization Parameter . . . . . . . . . . . . . 79
6.3.2 Updating the Regularization Parameter . . . . . . . . . . . . . . . . . . . 80
6.3.3 Asymptotic Stability of xNE . . . . . . . . . . . . . . . . . . . . . . . . 82

vii


6.4 Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Chapter 7: Learning Population Games 86
7.1 Informal Problem Description, Motivation, and Contributions . . . . . . . . . . . 87

7.1.1 Informal Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 87
7.1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

7.2 Additional Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . . . 92
7.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.4 Estimating the Payoff Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

7.4.1 Estimating Strategy Switching Probabilities . . . . . . . . . . . . . . . . 96
7.4.2 Learning the Payoff Matrix From Strategy Switching Estimates . . . . . . 101

7.5 Practical Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.5.1 Finite-Horizon Prediction of the Population State . . . . . . . . . . . . . 110
7.5.2 Numerical Results on Predicting the Population State . . . . . . . . . . . 111
7.5.3 Estimating the Nash Equilibria . . . . . . . . . . . . . . . . . . . . . . . 112
7.5.4 Numerical Results on Estimating Nash Equilibria . . . . . . . . . . . . . 114

7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Chapter 8: An Application to Optimal Electric Vehicle Charging Station Placement 116
8.1 The Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.2 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

Appendix A: Auxiliary Results for Chapter 3 125
A.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
A.2 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

Appendix B: Auxiliary Results for Chapter 4 134
B.1 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
B.2 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

Appendix C: Auxiliary Results for Chapter 5 138
C.1 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
C.2 Proof of Corollary 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
C.3 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

C.3.1 Auxiliary Results for the Proof of Theorem 6 . . . . . . . . . . . . . . . 143

Appendix D: Auxiliary Results for Chapter 6 147
D.1 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
D.2 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
D.3 Proof of Theorem 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

Appendix E: Auxiliary Results for Chapter 7 152
E.1 Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

viii


E.2 Proof of Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
E.2.1 Proof Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
E.2.2 Main Body of the Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
E.2.3 Auxiliary Results for the Proof of Theorem 8 . . . . . . . . . . . . . . . 158

E.3 Proof of Theorem 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
E.3.1 Estimating the Row Differences Under the Smith Protocol . . . . . . . . 167
E.3.2 Estimating the Row Differences Under the Replicator Protocol . . . . . . 173
E.3.3 Estimating the Row Differences Under the BNN Protocol . . . . . . . . . 177

E.4 Proof of Theorem 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
E.5 Results on the Estimation of Nash Equilibria . . . . . . . . . . . . . . . . . . . . 188

E.5.1 Proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
E.5.2 Essential Equilibria of F . . . . . . . . . . . . . . . . . . . . . . . . . . 191

Bibliography 196

ix


List of Tables

2.1 The Smith, replicator and BNN revision protocols. . . . . . . . . . . . . . . . . . 13

7.1 Estimates and actual values of strategy switching probabilities at the 55-th segment.101
7.2 Estimates of strategy switching probabilities. . . . . . . . . . . . . . . . . . . . . 109
7.3 Row differences of Θ and Θ̂. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

E.1 Summary of the notation related to the partitioning process. . . . . . . . . . . . . 154

x


List of Figures

2.1 Congestion game example with one origin/destination pair: (a) depicts the topology
of the links and (b) illustrates the 3 strategies representing all possible routes from
the origin to the destination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 The revision timeline near t∗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 The agents’ strategic revisions and its influence on X and P. . . . . . . . . . . . 14
2.4 Diagram representing the mean closed loop as the positive feedback interconnection

of the PDM and the EDM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Comparing λ̄Smith with the lower-bound in (3.9). . . . . . . . . . . . . . . . . . . 38
3.2 Trajectories of distributions of DRAM buyers on the manufacturers under the

HPG and Smith rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Time domain plots of the distributions of DRAM buyers on the manufacturers

under the HPG and Smith rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Time domain plots of the PDM’s state and distributions of DRAM buyers on the

manufacturers under the smoothed HPG and Smith rule. . . . . . . . . . . . . . . 44

4.1 Trajectories of x for the numerical example. . . . . . . . . . . . . . . . . . . . . 56

5.1 Possible sub-strategy transitions. . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Trajectories of x̄ for the congestion game example. . . . . . . . . . . . . . . . . 69
5.3 Trajectories of x̄ for the RPS game example. . . . . . . . . . . . . . . . . . . . . 71

6.1 The agents’ learning rule and F as a feedback system. . . . . . . . . . . . . . . . 78
6.2 Coordinator implementation of KL-regularization. . . . . . . . . . . . . . . . . . 79
6.3 Transmitter configurations and the resulting G. . . . . . . . . . . . . . . . . . . 83
6.4 Time domain plots of x and y. . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7.1 First 3 segments of X from the simulation. . . . . . . . . . . . . . . . . . . . . . 100
7.2 Illustration of S for the congestion game in Example 1: (a) shows S on the

simplex, (b) shows the regions in which the strategy offering the smallest payoff
is different (the numbers indicate which strategy offers the lowest payoff). As a
side note, the point where the lines intersect is the Nash equilibrium of the game. 106

7.3 Sample paths from the simulation: (a) displays X1
[0,T 1], (b) displays X2

[0,T 2], and
(c) displays X3

[0,T 3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

xi


7.4 Simulations of the population states resulting fromF and F̂ : (a) shows a close-up
of the trajectories (b) displays the trajectories on the simplex. . . . . . . . . . . . 111

8.1 Spatial-queuing structure of a link. . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.2 Illustration of a charging station. . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.3 Area of interest in College Park, exported from OpenStreetMap. The figure in (b)

highlights the roads that the vehicles can take. . . . . . . . . . . . . . . . . . . . 120
8.4 Roadway network extracted from the OpenStreetMap data displayed in Figure 8.3. 121
8.5 Roadway network with the candidate station locations, and the origin-destination

nodes highlighted. Green nodes represent station locations, the blue node is the
origin, and the red one is the destination. . . . . . . . . . . . . . . . . . . . . . . 122

8.6 Simulation results for 10 vehicles. . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.7 Simulation results for 100 vehicles. . . . . . . . . . . . . . . . . . . . . . . . . . 123

xii


List of Abbreviations

EPT Excess payoff target
EV Electric vehicle

Iid Independent and identically distributed

KL Kullback-Leibler

(NS) Nash stationarity

PATH Partners for Advanced Transportation Technology
(PC) Positive correlation
PC Pairwise comparison

RM Rate-modified

Uhc Upper hemicontinuous

xiii


Chapter 1: Introduction

In this dissertation, we examine a framework of noncooperative strategic interactions among

large numbers of agents; finding applications in various fields, such as traffic management [1, 2],

distributed task allocation [3], distributed extremum seeking [4,5], communication networks [6],

and many more.

In this framework [7,8], the agents are indistinguishable and select strategies from a common

finite set. Each strategy has a payoff, assigned by a payoff mechanism that couples the agents’

strategy choices. In response to these payoffs, the agents repeatedly revise their strategies using a

learning rule. With the fundamental objective of ensuring the scalability, efficiency, and robustness

of such systems, we study three types of problems:

• Learning Nash equilibria under generalizations of the framework.

• Learning the payoff mechanism from the agents’ strategy choices.

• Applications to the placement of electric vehicle charging stations.

1.1 Learning Nash Equilibria Under Generalizations

A well-studied problem in game theory, including the setting above, is whether the agents

can learn to play a Nash equilibrium (a strategy profile at which no agent has incentive to

1


unilaterally deviate from its strategy) by following simple heuristics [7, 9]. This question not

only provides an evolutionary basis and convenient computation algorithms for Nash equilibria,

but also validates the use of these heuristics as distributed control algorithms.

Nonetheless, in many cases, we do not have control over the agents’ strategic behavior (e.g.

traffic and epidemics). Therefore, to have performance guarantees for such cases, we would like

the guarantees on learning Nash equilibria to be as comprehensive as possible. Motivated by this

line of thinking, we analyzed three extensions of the existing framework: These are strategy-

dependent revision rates, Erlang distributed inter-revision intervals, and constrained strategy

switching over graphs.

1.1.1 Learning Under Strategy-Dependent Revision Rates

First, we expand upon the traditional framework of population games by introducing a

new element: The dependency of the agents’ revision rates on their strategies. Existing stability

results in this context have been limited to pairwise comparison learning rules and necessitated

the payoff mechanism to be a (memoryless potential game). Consequently, our aim is to broaden

these findings to encompass a wider range of payoff mechanisms and learning rules. These

payoff mechanisms types include dynamic mechanisms that are so-called δ-antidissipative or

memoryless games that are weighted contractive [10]. As for the learning rules, the impartial

pairwise comparison [11] and excess payoff target [12] classes will be central to our results.

We demonstrate that under these types of payoff mechanisms, learning rules that belong to the

impartial pairwise comparison class or an extension of the excess payoff target class lead to

converge to a Nash equilibrium. Under certain conditions, this convergence is contingent upon

2


the revision rates meeting certain bound conditions.

1.1.2 Learning Under Erlang Distributed Inter-Revision Intervals

In this generalization, we again focus on the agents’ inter-revision intervals. However,

instead of allowing them to depend on the agents’ strategies, we now consider them to follow

independent and identical Erlang distributions.

Existing results in the population games literature [7] assume exponentially distributed

inter-revision intervals. Nevertheless, this assumption is inadequate for scenarios where each

strategy involves a sequence of sub-tasks (sub-strategies) that must be completed before a new

revision time occurs. Our work on this extension proposes a methodology for such cases, assuming

that the duration of a sub-strategy is exponentially distributed, resulting in Erlang distributed

inter-revision intervals. We provide guarantees for learning Nash equilibria under the assumption

that the agents’ learning rule belongs to the pairwise comparison class.

1.1.3 Learning Under Constrained Strategy-Switching

In our third and final consideration of generalized learning of Nash equilibria, we delve

into the study of restricted strategy-switching over graphs. While most existing work [7, 8]

assume unrestricted switching between any two strategies, we focus here on scenarios where only

certain strategies can be reached from others. This restriction is captured by a strategy graph,

which, while connected, may be incomplete. Using Lyapunov’s method, we demonstrate that

adjustments based on Kullback Leibler divergence, applied to either the payoffs or the learning

rules, lead to the strategy profile’s near-global convergence to the Nash equilibrium. This result

3


holds for strictly concave potential games with interior Nash equilibria. We underscore the

practical implications of our findings and validate them with a numerical example.

1.2 The Inverse Problem: Learning the Game from Agents’ Actions

The first set of problems we address pertains to understanding the agents’ learning behavior.

Conversely, the second subject we investigate takes a completely different approach: Here, we

explore whether we can estimate the underlying payoff mechanism from observations of the

agents’ strategy choices, which falls under the real of multi-agent inverse reinforcement learning

[13, 14].

In this problem, we again assume that the learning rule is known. However, unlike existing

studies, we assume that the payoff mechanism is unknown. We develop a method to learn the

unknown payoff mechanism from sample path observations of the agents’ strategy choices. Our

approach consists of two main parts: (i) We provide probabilistic guarantees for estimating the

agents’ probabilities of transitioning among strategies. (ii) Leveraging the known learning rule

and the results from part (i), we derive an estimate of the payoff mechanism. Our results in

part (i) are general, while in part (ii), we focus on linear payoff mechanisms and three well-

known learning rules (the so-called Smith, replicator, and Brown-von Neumann-Nash rules).

Additionally, under certain assumptions, we demonstrate that this estimate can be used to predict

the Nash equilibria of the unknown payoff mechanism and forecast the population state induced

by other learning rules.

4


1.3 Applications to Charging Station Placement

Finally, we focus on an application closely aligned with our framework, aiming to develop a

simulation-based methodology to identify optimal locations for electric vehicle charging stations,

with the goal of minimizing total traffic congestion.

To address this challenge, we enhance an open-source traffic simulator [15] developed by

the UC Berkeley PATH Program. Originally designed for analyzing traffic in wildfire evacuation

scenarios, this tool has demonstrated its effectiveness in simulating traffic on a mesoscopic scale

[16–18]. To tailor it for our purposes, we incorporate EV models and charging stations into this

simulator. Subsequently, we employ the simulator to perform a basic case study on the placement

of EV charging stations in College Park. Our findings illustrate how fluctuations in demand and

other factors can impact the optimal locations for these stations. Our results on this subject

are relatively newer, and we discuss intriguing avenues for future research to build upon these

findings.

5


Chapter 2: Framework Description

This thesis primarily focuses on a framework of large-scale multi-agent strategic interactions,

commonly known to as population games [7].

In this framework, a large number of agents, referred to as the society, are partitioned

into populations, where agents within each population are nondescript. At any time, each agent

follows a single strategy from a finite set of alternatives. The strategies have payoffs that couple

the agents’ choices and quantify their benefits.

The agents repeatedly revise their strategies following a learning rule that reflects the

boundedly rational, myopic behavior of their populations. The resulting process representing

the society’s strategy profile is referred to as the social state. Provided that the number of agents

in each population is large, the social state can be approximated by the trajectories of a nonlinear

dynamical system known as the mean closed loop.

The remainder of this chapter provides the mathematical details of this framework, where

we refer to [7, Chapter 10], [8], and the references therein for a more in depth exposition.

2.1 The Social State and the Population State

Consider a collection of N ∈ N agents, partitioned into a finite number of populations

l ∈ {1, . . . , L} with N l many agents in population l. At any time t ≥ 0, each agent of any

6


population l follows a single strategy from a finite set V l := {1, . . . , nl}.

The agents that belong to the same population are nondescript, meaning that the only way

to tell them apart is through their strategies. Hence, the strategy profile of population l at time

t can be described by the so-called population state Xl(t), whose entries are the proportions of

agents in population l selecting the available strategies. Notice that Xl(t) takes values in the

following discrete simplex:

Xl :=

ξl ∈ Rnl

∣∣∣∣∣ N lξli ∈ {0, . . . , N l} and
nl∑
j=1

ξlj = 1

 .

The so-called social state is the concatenation of the states of all populationsX := (X1, . . . ,XL).

Thus, X takes values in X := X1 × · · · × XL.

2.2 The Payoffs, the Payoff Mechanism, and Nash Equilibria

Each strategy has a payoff that couples the agents’ strategy choices, describing the reward

that an agent would receive upon choosing it. We denote the payoff at time t of strategy i for

population l by Pl
i(t) and denote the payoff vectors of population l and the society by Pl :=

(Pl
1, . . . ,P

l
nl
) and P := (P1, . . . ,PL), respectively.

As X contains all the relevant information about the agents’ strategy choices, the payoffs

can be expressed in terms of X. More specifically, a payoff mechanism assigns P as a causal

function of X. The simplest mechanism is memoryless, in which case P(t) is determined by a

map (referred to as game) applied to X(t).

Definition 1 (Game) A memoryless payoff mechanism is a globally Lipschitz continuous and

7


O D

link 1 link 2

link 3

link 4 link 5

(a)

strategy 1 strategy 2 strategy 3

(b)

Figure 2.1: Congestion game example with one origin/destination pair: (a) depicts the topology
of the links and (b) illustrates the 3 strategies representing all possible routes from the origin to
the destination.

continuously differentiable function F : X → Rn, acting as F : X(t) 7→ P (t), where n :=

n1 + · · ·+ nl. Such an F is referred to as a game (or population game).

Example 1 Linear versions of congestion games, originally proposed by [19] (also see [7,

Section 2.2.2]), is an example of a well-studied class of games. Consider the graph in Figure 2.1a.

In this graph, O denotes the origin, D denoted the destination, links represent roads, and arrows

on links represent the direction in which an agent choosing the link travels. There is a single

origin-destination pair and the agents can choose to travel from the origin to the destination via

one of the 3 routes in Figure 2.1b. Accordingly, these routes constitute the strategies available to

the agents. An agent using a link e ∈ {1, . . . , 5} experiences a delay due to the link’s utilization.

We assume that this dependence is linear, given by ce
∑

{i:i∈Φe} ξi. Here, Φe is the set of routes

that contain link e, ξi is the percentage of agents traversing route i, and ce is a positive constant

quantifying how well link e accommodates traffic. The payoff of using route (or strategy) i is the

negative of the sum of delays on the links that form route i. Hence, there is a single population

8


and the payoffs of using the routes under the population state value ξ are given by


F1(ξ)

F2(ξ)

F3(ξ)

 = −


c1 + c2 0 c1

0 c4 + c5 c5

c1 c5 c1 + c3 + c5




ξ1

ξ2

ξ3

 ,

where Figure 2.1b displays which route corresponds to which strategy.

More generally, the payoff mechanism is a dynamical system, called the payoff dynamics

model (PDM), with input X and output P. As discussed in [8, 20], PDMs incorporate behaviors

such as delays, pricing inertia, agent-level learning, and are also used to isolate long-term trends [21].

Definition 2 (Payoff Dynamics Model) At any t ≥ 0, a payoff dynamics model (PDM) assigns

P(t) as

Q̇(t) = G
(
Q(t),X(t)

)
,

P(t) = H
(
Q(t),X(t)

)
,

(2.1)

where G : Rm × X → Rm is globally Lipschitz continuous, H : Rm × X→ Rn is continuously

differentiable and globally Lipschitz continuous, and m is a positive integer. In addition, Q(0)

takes values in some compact set Q0 ⊆ Rm, and there is a game FG,H that equals H in the

stationary regime in the sense that the following implication holds for all γ ∈ Rm and ξ ∈ X:

G(γ, ξ) = 0⇒ H(γ, ξ) = FG,H(ξ). (2.2)

Remark 1 Games are specific PDM instances, because any game F can be obtained as a PDM

by choosing G(γ, ξ) = 0 and H(γ, ξ) = F(ξ) for all ξ ∈ X and γ ∈ Rm (see [8] for further

information on how PDMs generalize games).

9


The analysis in this thesis presumes, as is the case in [8, 20], that Q remains in a bounded

set Q. Notice that this is guaranteed whenever the PDM is input to state stable [22], because X

and Q(0) take values in the bounded sets X and Q0, respectively. Furthermore, in combination

with Q and X being bounded, the fact that H is Lipschitz continuous ensures that P remains in

a bounded set P := P1 × · · · ×PL.

An important concept in game theory is Nash equilibria, which is the set of strategy profiles

at which no agent has an incentive to unilaterally deviate from its strategy [23]. In our context,

the Nash equilibria is defined for a (population) game F as follows.

Definition 3 (Nash equilibria) For any F : ∆→ Rn, the Nash equilibria of F is

NE(F) :=

{
ξ ∈ ∆

∣∣∣∣ ξli > 0⇒ i ∈ argmax
j∈V l

{F lj(ξ)}

}
,

where ∆l := {ξl ∈ Rnl

≥0 |
∑nl

i=1 ξ
l
i = 1} is the standard (nl−1)-simplex and ∆ = ∆1×· · ·×∆L.

In other words, at a Nash equilibrium a positive proportion of a population plays a strategy

only if that strategy offers the highest payoff. Alternatively, defining the correspondences

blF(ξ) := argmax
j∈V l

{F lj(ξ)}

and

Bl
F(ξ) := {ηl ∈ ∆l | ηli > 0⇒ i ∈ blF(ξ)},

we can write NE(F) as the set of fixed points of BF := (B1
F , . . . , B

L
F). Every population game

has at least one Nash equilibrium [7, Theorem 2.1.1].

10


2.3 The Revision Paradigm

The agents repeatedly revise their strategies in response to X and P. This revision process

has two main components: (i) The times at which the agents revise their strategies, and (ii) A

function – called the learning rule – that specifies how a revising agent decides on its subsequent

strategy.

2.3.1 Revision Times

The revision times of each agent follow a Poisson process. Specifically, the agents possess

identical and independent Poisson processes with rate λ > 0 and an agents’ revision times are

defined as the jump times of its process.

This construction is equivalent to assuming that the time between each agents’ any consecutive

revisions (inter-revision interval) is an exponential random variable with rate λ. These random

variables are independent across all agents in all populations.

In Chapters 3, 4 and 5, we will explore extensions of this assumption by permitting the

distributions of the inter-revision intervals to depend on the agents’ strategies or follow Erlang

distributions.

2.3.2 The Learning Rule

Having specified the revision times, now we describe how a revising agent behaves. Given

a revision opportunity, an agent decides on its subsequent strategy according to a probability

distribution characterized by a learning rule (or revision protocol). The rule of each population l

is a Lipschitz continuous function ρl : ∆l × Rnl → Rnl×nl
≥0 that satisfies

∑nl

j=1 ρ
l
ij(ξ

l, πl) = 1 for

11


all i ∈ V l, ξl ∈ ∆l and πl ∈ Pl.

Broadly speaking, given that an agent in population l following strategy i is revising its

strategy, ρlij(ξ
l, πl) is the probability that the agent switches to strategy j, where ξl and πl are

respectively the values of Xl and Pl immediately prior to the revision.

More precisely, consider that an agent revises its strategy at time t̄ > 0. It follows from

the construction of the revision times that, with probability 1, there exists a t∗ > 0 between t̄ and

the previous revision time of the society. Let us denote the strategy at t∗ of the revising agent by

i, which we refer to as its origin strategy. Then, for any j ∈ V l the probability of the revising

agent’s subsequent strategy being j is ρlij(X
l(t∗),Pl(t∗)). The realization of this lottery becomes

the revising agent’s strategy at time t̄. Figure 2.2 illustrates the revision timeline near t∗.

t

∼ exp(λN) ∼ exp(λN) ∼ exp(λN)

t∗ t̄t

Society’s previous
revision occurs at t,

so X is constant over [t, t̄)

Revision results
in a switch from i

to j with probability
1
L
Xl
i(t

∗)ρlij(X
l(t∗),Pl(t∗))

Figure 2.2: The revision timeline near t∗.

Three well-studied examples of learning rules are the Smith, replicator, and Brown-von

Neumann-Nash (BNN) rules. We present these rules in Table 2.1, in which [·]+ = max(·, 0).

Moreover, in this table, i, j ∈ V l are such that i ̸= j and C is a constant satisfying C ≥

maxξl∈∆l,π∈Pl,i∈V l
∑

j∈V l\{i} ρ
l
ij(ξ

l, πl). The Smith rule was proposed by [24] to study traffic

assignment problems. An agent following the Smith rule switches to a strategy with positive

probability if and only if that strategy offers a higher payoff than the agent’s origin strategy. The

replicator rule stems from the seminal paper of [25] and is closely related to earlier models from

12


mathematical biology [26] such as genetics, prebiotic evolution, animal behavior, and population

ecology [26]. An agent following the replicator rule not only considers the payoff differences

between its origin strategy and the target strategies, but also considers the percentage of agents

playing the target strategies, which constitutes an imitation component. The BNN rule was

introduced by [27] to analyze symmetric zero-sum games and was later used by [28] to show

the existence of Nash equilibria in normal-form games. In contrast to the Smith and replicator

protocols, an agent following the BNN rule compares the payoffs of the target strategies with the

population average of the payoffs.

Name Definition

Smith ρSmithij (ξ, π) := [πj − πi]+/C
Replicator ρRepij (ξ, π) := ξj[πj − πi]+/C

Brown-von Neumann-Nash (BNN) ρBNNij (ξ, π) := [πj − ξTπ]+/C

Table 2.1: The Smith, replicator and BNN revision protocols.

2.3.3 The Social State As a Markov Process

The agents repeatedly update their strategies, as illustrated in Figure 2.3, causing X to

vary over time. When the payoff mechanism is memoryless, X is a right-continuous pure-jump

Markov process with state space X, jump rate λN and transition probabilities from ξ to ζ given

below [7]:

P(X(t̄) = ζ |X(t∗) = ξ) =


1
L
ξliρ

l
ij(ξ

l,F l(ξ)) if ζ − ξ = el,j−el,i
N l , l ∈ {1, . . . , L}, i, j ∈ V l

0 otherwise

,

where el,i is the (i+
∑l−1

k=1 n
k)-th basis vector in Rn, and t̄ and t∗ are as in Section 2.3.2.

13


An agent uses the
revision protocol ρl to
revise its strategy as a
function of Xl and Pl.

Social state
X might vary
with the agent’s
revision.

Payoff mechanism assigns P.

Outcome of the revision

Updated X

Updated X

Updated P

Figure 2.3: The agents’ strategic revisions and its influence on X and P.

Similarly, the analysis in [8] shows that if the payoff mechanism is a PDM, then the pair

(X, Q̌) is a right-continuous pure-jump Markov process, where Q̌ is the piecewise constant

approximation of Q that only gets updated at the revision times. We refer to [8] for the details of

this case.

2.4 Deterministic Mean-Field Approximation

Many research on population games rely on the mean-field approximation of X for analysis

[7,8,29]. In population games, “mean-field” signifies aggregating the effects of individual agents.

The key idea is that the agents’ homogeneity cause the randomness that they introduce to the

aggregate behavior to balance out as N → ∞. Hence, the population’s aggregate strategic

behavior, captured by X, is well-approximated by a deterministic trajectory. Particularly, [8, 30]

use [31, Theorem 2.1] to prove the following:

Remark 2 (Mean-Field Approximation) If limN1→∞,...,NL→∞(X(0),Q(0)) = (x0, q0) ∈ ∆×Q0

14


almost surely, then

lim
N1→∞,...,NL→∞

P

(
sup
t∈[0,T ]

∥(X(t), Q̌(t))− (x(t), q(t))∥ > ϵ

)
= 0

holds for all ϵ > 0 and T > 0, where (x, q) is the solution of the following system with initial

state (x0, q0):

q̇(t) = G(q(t), x(t)), (2.3a)

p(t) = H(q(t), x(t)), (2.3b)

ẋl(t) = V l(xl(t), pl(t)), l ∈ {1, . . . , L}, (2.3c)

in which t ≥ 0 and V l : ∆l × Rnl → Rnl is given for all ξl ∈ ∆l, πl ∈ Pr and i ∈ V l by

V li(ξl, πl) :=
nl∑

j=1, j ̸=i

ξljλρ
l
ji(ξ

l, πl)︸ ︷︷ ︸
inflow switching to strategy i

−
nl∑

j=1, j ̸=i

ξliλρ
l
ij(ξ

l, πl)︸ ︷︷ ︸
outflow switching away from strategy i

. (2.4)

The system (2.3) is commonly referred to as the mean closed loop.

In other words, as the number of agents in each population approaches infinity, X is well-

approximated on any finite time interval by the solution x of the deterministic dynamical system

in (2.3).

Remark 3 Note that, when the payoff mechanism is a game F , the system in (2.3) reduces to

ẋl(t) = Vr(xl(t),F l(x(t))), l ∈ {1, . . . , L}. (2.5)

15


While this finding offers a finite-horizon perspective on the behavior of X, [29] shows that

the mean closed loop can also provide infinite-horizon guarantees:

Remark 4 (An Infinite-Horizon Guarantee on X) LetO denote the equilibria of the mean closed

loop and define Ox := {ξ ∈ ∆ | (ξ, γ) ∈ O}, so Ox is the x-component of O. If O is globally

asymptotically stable, then the stationary distributions of X concentrate near Ox.

A question of practical relevance is whether the agents can learn to play a Nash equilibrium

of F or FG,H (recall that FG,H is the “stationary game” of the PDM), meaning whether X

converges to such equilibrium. Through their result in Remark 4, [29] establishes that we can

address this question by analyzing the mean closed loop. Specifically, ifO is globally asymptotically

stable and Ox = NE(FG,H), then the stationary distributions of X are close to NE(FG,H),

provided that there is a large number of agents in every population. This insight greatly simplifies

the task of deriving convergence properties for X, as achieving global asymptotic stability guarantees

for the mean closed loop is typically easier.

2.5 Methods for Analyzing the Mean Closed Loop

Having established the importance of the mean closed loop, the focus shifts to analyzing

it. Observe that the mean closed loop can be interpreted as the positive feedback interconnection

of two systems: The PDM and (2.3c), where (2.3c) (with state and output x and input p) is

commonly referred to as the evolutionary dynamics model (EDM).

This feedback interpretation, which we illustrate in Figure 2.4, yields a significant benefit:

It enables us to utilize passivity-based techniques and analyze the PDM and the EDM independently.

This approach yields (seperate) conditions on the EDM and the PDM, which, when combined,

16


Payoff Dynamics Model
(PDM)

q̇(t) = G(q(t), x(t))
p(t) = H(q(t), x(t))

Evolutionary Dynamics Model
(EDM)

ẋ(t) = V(x(t), p(t))

p

x

Figure 2.4: Diagram representing the mean closed loop as the positive feedback interconnection
of the PDM and the EDM.

have been proven to ensure the global asymptotic stability of Nash equilibria. In the following

sections, we present some of these conditions, which will be instrumental in our forthcoming

analysis.

2.5.1 Conditions on the EDM

An important condition on the EDM is called Nash stationarity [12].

Definition 4 (Nash Stationarity) The EDM is said to be Nash stationary (NS) if for every l ∈

{1, . . . , L}, ξl ∈ ∆l and πl ∈ Pl the following holds:

V l(ξl, πl) = 0 ⇐⇒ ξl ∈ argmax
η∈∆l

ηTπl. (NS)

Observe that (NS) is a condition on the equilibria of the EDM and establishes an important

connection between NE(FG,H) and the equilibria of the mean closed loop: When the PDM

is stationary, the mean closed loop is in equilibrium if and only if the x-component of this

equilibrium belongs to NE(FG,H). In the case where the payoff mechanism is memoryless, this

means that the mean closed loop is stationary if and only if x is in NE(F).

17


Another condition, known as positive correlation, pertains to the non-equilibrium behavior

of the EDM [12].

Definition 5 (Positive Correlation) The EDM is said to satisfy positive correlation (PC) if for

every l ∈ {1, . . . , L}, ξl ∈ ∆l and πl ∈ Pl the following holds:

V l(ξl, πl) ̸= 0⇒ V l(ξl, πl)Tπl > 0. (PC)

Essentially, when the payoff mechanism is memoryless, (PC) implies that ẋ andF(x) make

an acute angle as long as x is not stationary.

In the context of population games, the concepts of (NS) and (PC) were solidified by [12,

32]. In their work, [21] highlighted the resemblance between (PC) and the standard system-

theoretic passivity concept [33]. Expanding on this observation, [21] introduced the concept of

δ-passivity. Subsequent extensions were made by [8, 10], with the latest being in [10], which is

the form of δ-passivity that we will use in our discussion.

Definition 6 (δ-Passivity) Given l ∈ {1, . . . , L}, the EDM for population l is δ-passive if there

are functions S l : ∆l × Rnl → R≥0 and Sl : ∆l × Rnl → R≥0 such that the following holds for

all ξl ∈ ∆l and πl, νl ∈ Rnl:

∂S l(ξl, πl)
∂ξl

V l(ξl, πl) + ∂S l(ξl, πl)
∂πl

νl ≤ −Sl(ξl, πl) + V l(ξl, πl)Tνl, (2.6a)

S l(ξl, πl) = 0⇔ V l(ξl, πl) = 0, (2.6b)

Sl(ξl, πl) = 0⇔ V l(ξl, πl) = 0. (2.6c)

Following the convention in [8, 20], we will refer to S l as a δ-storage function.

18


Now, we introduce some well-studied classes of learning rules that ensure the (PC) and

δ-passivity of the EDM. These classes will be crucial for our upcoming results.

Definition 7 (Pairwise Comparison Learning Rules) A learning rule ρl is said to belong to the

pairwise comparison (PC) class if for all i ∈ V l, j ∈ V l \ {i}, ξl ∈ ∆l and πl ∈ Pl it can be

written as

ρlij(ξ
l, πl) = ϕlij(π

l),

where ϕlij(π
l) : Rnl → R≥0 satisfies sign preservation in the sense that ϕlij(π

l) > 0 if πlj > πli

and ϕlij(π
l) = 0 if πlj ≤ πli.

If, in addition, ρl can be written as

ρlij(ξ
l, πl) = ϕlj(π

l
j − πli),

then it is said to be an impartial pairwise comparison rule (IPC), where ϕj : R → R≥0 is again

sign preserving for all j ∈ V l \ {i}.

Essentially, an agent following a PC rule can only switch to strategies that offer greater

payoffs. Their desirable incentive properties [7, 11] and inherently fully decentralized operation

result in the applicability of the PC class in many engineering problems. The Smith rule, which

we presented in Section 2.3.2, belongs to the PC class. Note that we will refer to positive

correlation as (PC) and pairwise comparison as PC.

A second important class of learning rules is the excess payoff target class [12].

Definition 8 (Excess Payoff Target Learning Rules) A learning rule ρl is said to be an excess

19


payoff target (EPT) rule if it can be written for all i ∈ V l, j ∈ V l \ {i}, ξl ∈ ∆l and πl ∈ Pl as:

ρlij(ξ
l, πl) = φlj(π̂

l), (2.7a)

π̂l := πl − 1(πl)T ξl. (2.7b)

Here, π̂l is called the excess payoff of population l and φl : Rnl → Rnl

≥0 is a function that satisfies

acuteness, which requires that φl(η)Tη > 0 for all η ∈ int(Rnl

∗ ) := int(Rnl \ int(Rnl

≤0)) [12].

Intuitively, acuteness ensures that an agent following an EPT rule is more likely to switch

to strategies with payoffs that are higher than the population-average payoff. An example of an

EPT rule is the Brown-von Neumann-Nash (BNN) rule, which is obtained by setting φlj(π̂
l) =

max{0, π̂lj} (see Section 2.3.2).

Remark 5 Below are some of the current findings on (NS), (PC), and δ-passivity linked to PC

and EPT protocols, which will be pertinent to our upcoming analysis:

• If ρl is a PC rule, then the EDM of population l satisfies (NS) and (PC) [11].

• If ρl is an IPC rule, then the EDM of population l satisfies δ-passivity [8].

• If ρl is an EPT rule, then the EDM of population l satisfies (NS) and (PC) [12].

2.5.2 Conditions on the Payoff Mechanism

In line with the δ-passivity approach, in addition to those for the EDM, we must also

establish conditions for the payoff mechanism. The first two of such conditions are valid when

the payoff mechanism is memoryless. Additionally, for simplicity, we present these conditions

assuming a single population, which is how we will use them.

20


Definition 9 (Potential Games) Assume that there is a single population, i.e. L = 1. The game

F is said to be a potential game if there is a continuously differentiable function f : Rn → R

satisfying ∇f = F . We refer to f as the game’s potential.

Originally formulated in the context of normal form games, potential games were extended

to our (population games) framework in [32]. In such games, a single scalar-valued function can

capture all essential information about payoffs that influences agents’ incentives. The presence

of this function, known as the potential function, underpins the numerous attractive features of

potential games [7, Chapter 3.1]. One such feature is related to distributed optimization.

Remark 6 When F is a (single population) potential game, NE(F) coincides with the points

that satisfy the Karush-Kuhn-Tucker conditions for the problem of maximizing f over ∆. Hence,

if f is concave, then

NE(F) = argmax
ξ∈∆

f(ξ).

Moreover, if f is strictly concave, then argmaxξ∈∆ f(ξ) is unique. We will denote this unique

maximizer as xNE.

A second type of games are called contractive (also called stable in former work [34]).

Definition 10 (Contractive Games) Assume that there is a single population. A game F is said

to be contractive if

(ζ − ξ)T (F(ζ)−F(ξ)) ≤ 0

21


for all ζ, ξ ∈ ∆. If the above inequality holds strictly whenever ξ ̸= ζ , thenF is said to be strictly

contractive.

Note that when F is differentiable, the above condition is equivalent to

ηTDF(ξ)η ≤ 0

being satisfied for all η ∈ T∆ := {ν ∈ Rn |
∑n

i=1 νi = 0} and ξ ∈ ∆, where DF denotes the

Jacobian of F . For strictly contractive and differentiable F , we have ηTDF(ξ)η < 0, in which

case we define

γ := − max
ξ∈∆,η∈T∆

ηTDF(ξ)η,

γ̄ := − min
ξ∈∆,η∈T∆

ηTDF(ξ)η.

Similar to potential games, contractive games also have many desirable properties, such

as convexity of NE(F) [34]. The authors in [10] extend the contractivity definition for multi-

population games by introducing weighted contractivity. However, we omit presenting this

definition since we do not utilize it.

We note that the class of potential and contractive games do not contain one another. For

instance, the 123-coordination game [7, Example 3.1.5] is potential, but not contractive, and

the “good” rock-paper-scissors (RPS) game [7, Example 3.3.2] is strictly contractive, but not

potential.

Finally, we present the passivity condition associated with the PDM. The original form

of this condition (akin to δ-passivity of the EDM) is known as δ-antipassivity [8]. However,

22


in this thesis, we adopt an expanded version termed δ-antidissipativity, introduced by [10] to

accommodate a wider range of payoff mechanisms (see, for instance, the mixed autonomy congestion

game in [10]).

Definition 11 (δ-Antidissipativity) Given positive constants w1, . . . , wL, a PDM is said to be δ-

antidissipative with weights w1, . . . , wL if there are functions Q : Rm × ∆ → R≥0 and R :

Rm ×∆→ R≥0 such that the following holds for all γ ∈ Rm, ξ ∈ ∆ and ν ∈ T∆:

∂Q(γ, ξ)
∂γ

G(γ, ξ) + ∂Q(γ, ξ)
∂ξ

ν ≤ −R(γ, ξ)− ψTΠψ (2.8a)

Q(γ, ξ) = 0⇔ G(γ, ξ) = 0 (2.8b)

R(γ, ξ) = 0⇔ G(γ, ξ) = 0 (2.8c)

where T∆ = T∆1 × · · · × T∆L, the vector ψ is given by

ψ :=

∂H(γ,ξ)
∂γ
G(γ, ξ) + ∂H(γ,ξ)

∂ξ
ν

ν


and Π is the block matrix with blocks Π11 = Π22 = 0, Π12 = Π21 = 1/2W in which W is the

block-diagonal matrix W := diag (w1In1×n1 , . . . , wLInL×nL).

Remark 7 If a PDM is δ-antidissipative with weights wl = 1 for all r ∈ {1, . . . , L}, then it

is δ-antipassive as defined and studied in [8] (see [10, Remark 8]). Numerous examples of δ-

antipassive PDMs can be found in [8].

For the case when the PDM is a game F , δ-antidissipativity with weights {w1, . . . , wL}

23


reduces to the condition:

ρ∑
l=1

wl(F l(ξ)−F l(η))T (ξl − ηl) ≤ 0 (2.9)

holds for all ξη ∈ ∆. Observe that (2.9) coincides with contractivity (see Definition 10) when

the weights are identical, and can be viewed, more generally, as weighted contractivity [10] with

respect to the block-diagonal matrix W.

2.5.3 Ensuring the Global Asymptotic Stability of the Mean Closed Loop

By employing Lyapunov’s method and passivity-based techniques, various combinations of

the conditions outlined in Sections 2.5.1 and 2.5.2 have been demonstrated to ensure the global

asymptotic stability of NE(F) or NE(FG,H). Below, we recap some of these guarantees.

Remark 8 (Guarantees on the Global Asymptotic Staility of Nash Equilibria) Assume that there

is a single population and the payoff mechanism is memoryless. If F is a potential game and

the EDM satisfies (NS) and (PC), then NE(F) is globally asymptotically stable under the mean

closed loop [32].

On the other hand, if the payoff mechanism is a δ-antipassive PDM and the EDM of each

population is δ-passive, then the equilibria of the mean closed loop is globally asymptotically

stable. If, in addition, the EDM of each population is also (NS), then ξ∗ ∈ NE(FG,H) for every

equilibrium point (ξ∗, γ∗) of the mean closed loop [10]. In light of Remark 7, this result also

implies that the δ-passivity and (NS) of the EDM guarantees the global asymptotic stability of

NE(F), whenever the payoff mechanism is a memoryless contractive game.

24


As an example, consider that we have a single population of agents, interracting thrugh the

congestion game in Example 1. From Remark 5, we know that the Smith rule (which belongs

to the PC class) yields an EDM that satisfies (NS) and (PC). Moreover, the congestion game is

a potential game with f(ξ) = −(ξ21(c1 + c2) + ξ22(c4 + c5)/2 + ξ23(c1 + c3 + c5)/2 + ξ1ξ3c1 +

ξ2ξ3c5) (in fact, every congestion game is a potential game [7, Example 3.1.2]). Therefore, if

the agents follow the Smith rule, then (as we outline in Remark 8) it follows from [32] that the

distribution of the agents on the routes converges to NE(F). Note from Remark 6 that NE(F) =

argmaxξ∈∆ f(ξ), which yields the following Nash equilibria:

NE(F) =

{(
c1c5 + c3c4 + c3c5 + c4c5

c1c2 + c1c3 + c2c3 + c1c5 + c2c4 + c3c4 + c3c5 + c4c5
,

c1c2 + c1c3 + c2c3 + c1c5
c1c2 + c1c3 + c2c3 + c1c5 + c2c4 + c3c4 + c3c5 + c4c5

,

c2c4 − c1c5
c1c2 + c1c3 + c2c3 + c1c5 + c2c4 + c3c4 + c3c5 + c4c5

)}
.

With this example, we conclude the framework description and proceed by presenting our

own contributions.

25


Chapter 3: Pairwise Comparison Rules Under Strategy-Dependent Revision Rates

In this chapter, we extend the traditional population games framework (presented in Chapter 2)

by allowing the agents’ revision rates to depend on their strategies. Existing stability results for

this case permit only a memoryless potential game to determine the payoffs. This chapter extends

these results to cover payoff mechanisms that can be either dynamic or memoryless games that do

not have to be potential. To make our analysis concrete, we assume that each population follows

a PC learning rule. We use a well-motivated example, which we call the hassle vs. price game,

to illustrate an application of our extension and results.

3.1 Strategy-Dependent Revision Rates

Observe from Section 2.3.1 that the traditional population games framework requires the

agents’ inter-revision intervals to be iid exponential random variables with the same rate λ.

Crucially, this inhibits the framework’s applicability in many practical settings.

To illustrate, consider a decentralized task allocation problem. Imagine a system with a

stream of large number of tasks, each categorized into one of several types. Suppose that there

are entities that serve these tasks, which we interpret as the agents. At any given time, each agent

handles a single job, requiring a service time that is independent of other tasks’ service times

and follows an exponential distribution. After completing a task, an agent can opt to work on a

26


task of a different type. Each type of task has a payoff that quantifies the benefit that an agent

would receive upon selecting it. A meaningful objective would be to devise learning rules and

a payoff structure that guide the agents to allocate themselves optimally across task types, in a

decentralized manner. However, applying results from the population games framework to this

problem is feasible only if each task type’s service time has the same rate, which is often not the

case in reality.

To facilitate the framework’s applicability in a wider range of settings, such as the problem

above, we allow for the agents’ revision rates to depend on their strategies. We assume that

for each l ∈ {1, . . . , L} and i ∈ V l a positive constant λli characterizes the rate at which the

agents in population l currently following strategy i revise their strategies, and define the vector

λl = (λl1, . . . , λ
l
nl
) as the revision rates of population l.

The examples below give a more concrete account of the prelevance of this extension.

Example 2 A motivating example of application of our framework, which we will be invoking

throughout this chapter to illustrate our contributions, is that of a “hassle vs. price” game

(HPG). In this example, each agent operates a machine that uses a component that must be

replaced when it fails. There are several manufacturers that make the component to varying

degrees of reliability. Specifically, each component has an exponentially distributed lifetime and

its failure rate depends on the manufacturer. The available strategies are the manufacturers,

and the payoff of each strategy combines two non-positive terms: (i) a hassle (disruption) cost

that increases with the failure rate and (ii) the price of the component, which is higher for more

reliable manufacturers. The revision opportunity time occurs when the component fails and the

agent must decide based on the available information, such as the current payoffs ascribed to the

27


strategies, whether to keep the current strategy (buy again from the same manufacturer) or follow

a different strategy (decide on another manufacturer to buy from). The agents are partitioned into

populations, each uniquely associated to a machine type and/or the undertaking for which the

machine is used.

The game F , with its i-th component for population l specified below, is an example of a

memoryless payoff mechanism for the HPG:

F li (ξ) :=
HPG

−βlλli︸ ︷︷ ︸
hassle (replacement) cost

− Ci
(
Di(ξ)

)︸ ︷︷ ︸
component price

. (3.1)

The game’s components have the following meaning:

• β1, . . . , βL are positive constants quantifying the costs of replacing a component for the

respective population.

• {1, . . . , κ} is the set of available manufacturers (this is also the strategy set equally available

to all1 populations).

• λl1, . . . , λ
l
κ are the failure rates of the components for the l-th population according to

manufacturer, which we assume are ordered as λl1 > . . . > λlκ > 0 (manufacturer κ makes

the most reliable components).

• D : X→ [0, d̄]κ gives the (effective) demand from each manufacturer for all i ∈ {1, . . . , κ}

and ξ ∈ ∆ as

Di(ξ) :=
L∑
l=1

αlξli. (3.2)

1This means that all populations have the same strategy set and same number of strategies (n1 = · · · = nL = κ).

28


Here, α1, . . . , αL are positive constants that quantify the relative weight of each population on the

demand. These constants may reflect, for instance, the relative sizes of the populations. Finally,

Ci : R≥0 → [ci,∞) is a continuously differentiable surjective function (of the demand) that

quantifies the cost of a component made by the i-th manufacturer.

When the HPG above satisfies the conditions in the remark below, it is weighted contractive.

Remark 9 (Properties of C for Example 2 and Example 4) We assume that C in equations (3.1)

and (3.10) have the following properties:

a) C1, . . . , Cκ are increasing.

b) More reliable components are more expensive, i.e., if i > j then Ci(δ) > Cj(δ) for all

δ ∈ [0, d̄].

Remark 10 When C satisfies Assumption 9.a, we can show, by following an approach analogous

to that of [10, Section IV.A], that theF in Example 2 is weighted contractive with {w1, . . . , wL} =

{α1, . . . , αL}. Moreover, by appropriately modifying the steps in the proof of [10, Proposition 3],

we can also show that the PDM example that we will present in Section satisfies δ-antidissipativity

with {w1, . . . , wL} = {α1, . . . , αL}.

In economic theory, Assumption 9.a is referred to as demand-pull inflation [35] that occurs

when the supply of a product is limited2, the manufacturer discounts the price when the demand

is weak (and gradually eliminates the discount as demand rises), or when the manufacturer raises

the price with increasing demand as a way to increase profits when the product becomes popular.

2Factors restricting supply may include scarcity of raw materials, when manufacturer strategically opts to limit
production to keep prices up (as dynamic random-access memory manufacturers have been doing in the last 3 years),
difficulty in ramping up production fast enough to meet demand and sanctions to name a few.

29


Higher cost (decrease in payoff) for a strategy with higher demand, as measured by the portion

of the population following it, is common in many other applications, such as congestion games.

Additional examples of contractive and weighted contractive games can be found respectively

in [7] and [10], while other instances of δ-antipassive and δ-antidissipative PDMs can be found

respectively in [20], [8], [21] and [10]. Below, we present another setting that advocates strategy-

dependent revision rates.

Example 3 We could model the effect of the contract value on employee turnover in a way that

would lead to another example analogous to Example 2. In such an example, a population’s

agents would be the businesses wishing to hire and retain an employee for a specific job type.

Each population would comprise businesses with comparable characteristics from the employees’

viewpoint, such as location, structure and size. The strategies available to a population’s agents

would be the different types of contracts they can offer. In this case, Ci in (3.1) would determine

the cost of contract i as a function of the demand. Cheaper contracts offering worse benefits

and/or lower salary would lead to a higher turnover rate (quantified by λli) and associated

increased cost for retraining and rehiring (quantified by βliλ
l
i).

To allow for strategy-dependent revision rates, we replace the characterization of the revision

times given in Section 2.3.1 as follows: Given a population l ∈ {1, . . . , L}, we endow each

agent of population l with an arrival process in which the inter-revision times are distributed

exponentially with rate λli, where i is the strategy of the agent within the inter-revision period,

and define the revision times of the agents as the jump times of their arrival processes.

From this construction, it follows that the probability that some agent of population l

currently following the i-th strategy is allowed to revise its strategy within an infinitesimal

30


time interval of duration δ is δλliN
lXl

i(t
∗), where t∗ is in the interval and precedes the revision

opportunity time (see Section 2.3.2 for the definition of t∗). Moreover, note that the event that

a revision opportunity occurs for a given agent during this period is conditionally independent,

given its own current strategy, of the revisions of all other agents.

The social state X resulting from this extension is again a continuous time pure jump

Markov process with state space X and transition probabilities from ξ to ζ for any ξ, ζ ∈ X given

at any revision time t̄ by

P(X(t̄) = ζ |X(t∗) = ξ)

=



λliξ
l
iN

l∑L
l=1

∑nl

i=1 λ
l
iξ
l
iN

l
ρlij(ξ

l, πl) if ζ − ξ = el,j−el,i
N l , l ∈ {1, . . . , L}, i, j ∈ V l, i ̸= j

1− 1∑L
l=1

∑nl

i=1 λ
l
iξ
l
iN

l

∑L
l=1

∑nl

i=1

∑nl

j=1,j ̸=i λ
l
iξ
l
iN

lρlij(ξ
l, πl) if ζ = ξ

0 otherwise

.

Just like in the conventional framework, the mean-field approximation results discussed in

Section 2.4 can be extended to X with strategy-dependent revision rates, leading to a slightly

modified EDM given for all t ≥ 0, l ∈ {1, . . . , L}, i ∈ V l, ξl ∈ ∆l and πl ∈ Pl by

V li(ξl, πl) =
nl∑

j=1, j ̸=i

ξljλ
l
jρ
l
ji(ξ

l, πl)−
nl∑

j=1, j ̸=i

ξliλ
l
iρ
l
ij(ξ

l, πl). (3.3)

Observe that, in contrast to the λ factor in the traditional EDM (see (2.3c)), the EDM

arising from the strategy-dependent revision rates features λli factors that scale ρlij . In the mean-

field limit, we can see ξliλ
l
iρ
l
ij(ξ

l, πl) as the rate of flow from strategy i to j in population l,

incorporating the revision rate λli.

31


3.2 Problem Formulation

The mean closed loop with strategy-dependent revision rates is equivalent (in terms of

equilibrium and stability properties) to the traditional mean closed loop with EDMs induced by

learning rules of the form

ρ̃lij(ξ
l, πl) :=

λli∑nl

k=1 λ
l
k

ρlij(ξ
l, πl).

This implies that, fundamentally, we can incorporate the effects of strategy-dependent revision

rates into ρ. However, achieving this would necessitate ρlij ̸= ρlkj , a condition that the IPC and

EPT rules cannot accommodate (see Definitions 7 and 8). Consequently, it is crucial to note

that the stability results associated with IPC and EPT rules no longer hold when we allow for

strategy-dependent revision rates.

Remark 11 As opposed to the IPC and EPT classes, PC rules allow for the incorporation of

λli. That is, for any PC rule ρl, the learning rule given by ρ̃lij(ξ
l, πl) := λli/(

∑nl

k=1 λ
l
k)ρ

l
ij(ξ

l, πl)

also belongs to the PC class. Therefore, stability results associated with PC rules carry on to

the strategy-dependent case. Particularly, the EDM resulting from strategy-dependent PC rules

satisfies (NS) and (PC), which guarantees the global asymptotic stability of NE(F) under the

mean closed loop for potential F (see Remarks 5 and 8).

Having seen that some of the fundamental stability guarantees in the standard framework

fail when we allow for strategy-dependent λ (specifically, those from IPC and EPT rules), a

natural next step is to ask whether we can discover new conditions that ensure the global asymptotic

stability of Nash equilibria. In the remainder of this chapter, we address this question for IPC

32


rules, where we will consider EPT rules in the next chapter. As in Section 2.5.1, we will denote

an arbitrary IPC rule by ϕ, so the learning rules that we will study have the form ρlij(ξ
l, πl) =

ϕlj(π
l
j − πli) (see, again, Definition 7).

Similarly to the traditional case, we will use δ-passivity techniques to establish such guarantees

for the IPC class. Therefore, following the stability results discussed in Remark 8, we will

assume a δ-antidissipative PDM and derive conditions for the strategy-dependent IPC EDM (3.3)

to ensure that it is (NS) and δ-passive. Hence, our refined problem is to determine guarantees for

the (NS) and δ-passivity of the EDM resulting from IPC rules and strategy-dependent revision

rates.

3.3 Stability Analysis

We start by defining the following worst-case ratios quantifying the relative discrepancies

of a population’s revision rates.

Definition 12 (Worst-Case Revision Rate Ratios) Given l ∈ {1, . . . , L} and the revision rates

λl1, . . . , λ
l
nl

, we define the worst-case revision rate ratio for the l-th population as follows:

λlR := max

{
λli
λlj

∣∣∣∣∣ i, j ∈ V l

}
. (3.4)

Notice that λlR ≥ 1 holds by definition and λlR = 1 if and only if the revision rates for the

l-th population are identical.

We proceed our analysis by first establishing Nash stationarity, and then investigating δ-

passivity. The worst-case revision rate ratios will be critical in the δ-passivity step.

33


3.3.1 Nash Stationarity

Recall from Remark 11 that the (NS) and (PC) properties of PC rules persist in the strategy-

dependent revision rate case. Additionally, for any IPC rule ϕl, the function ρ̃lij(ξ
l, πl) :=

λli/(
∑nl

k=1 λ
l
k)ϕ

l
ij(ξ

l, πl) is a PC rule. This implies that the EDM resulting from any combination

of strategy-dependent revision rates and IPC rule is a PC EDM. Consequently, the (NS) property

of the PC EDM extends to the EDM resulting from any IPC rule and strategy-dependent λ.

More precisely, we have the following lemma, which we state without proof because it follows

immediately from this reasoning.

Lemma 1 Given l ∈ {1, . . . , L}, if the l-th population follows an IPC rule, then the resulting

EDM of population l is (NS) for any positive revision rates λl1, . . . , λ
L
nl

.

3.3.2 Passivity

Now, we investigate δ-passivity. Inspired by the Lyapunov and δ-storage functions introduced

respectively in [34] and [21], we choose the following δ-storage function: Given a population

l ∈ {1, . . . , L} with a learning rule ϕl of the IPC class, we set our δ-storage function to be

S l : ∆l × Rnl → R≥0 specified by

S l(ξl, πl) :=
nl∑
i=1

λliξ
l
i

 nl∑
k=1

ψlk(π
l
k − πli)

 (3.5a)

where

ψlk(π
l
k − πli) :=

∫ πlk−π
l
i

0

ϕlk(s)ds (3.5b)

34


Denoting
∑nl

k=1 ψ
l
k(π

l
k − πli) by γli(π

l) we can write S l in more compactly as

S l(ξl, πl) =
nl∑
i=1

λliξ
l
iγ
l
i(π

l). (3.5c)

From our analysis of S l, we derive the following theorem, which presents a condition

guaranteeing the δ-passivity of the EDM induced by IPC rules under strategy-dependent revision

rates.

Theorem 1 Given l ∈ {1, . . . , L}, consider that the l-th population follows an IPC rule specified

by ϕl and a worst-case revision rate ratio λlR (see (3.4)). The EDM for population l is δ-passive

if (i) nl = 2, or (ii) nl ≥ 3 and the following inequality holds:

λlR < λ̄ϕl(n
l), (3.6)

where λ̄ϕl is determined from ϕl as

λ̄ϕl(n
l) := min

k∈V l
inf

πl∈Rnl

{
γlk(π

l)
∑nl

i=1 ϕ
l
i(π

l
i − πlk)∑nl

i=1 ϕ
l
i(π

l
i − πlk)γli(πl)

}
. (3.7a)

Although (to avoid cluttered notation) we do not explicitly indicate in (3.7a), the infimum is

computed subject to the following constraint on πl:

nl∑
i=1

ϕli(π
l
i − πlk)γli(πl) ̸= 0. (3.7b)

We prove Theorem 1 in Appendix A, by showing that S l satisfies (2.6).

Remark 12 (When to compute (3.7)) According to Theorem 1, an IPC EDM is always δ-passive

35


for a population with two strategies nl = 2, irrespective of the revision rates. Hence, only when

nl ≥ 3 will one need to compute (3.7) to test whether (3.6) holds. Notably, in many cases (3.7)

can be computed numerically.

To see an implementation of (3.6), consider the HPG with 5 manufacturers and suppose

that the agents of population l follow the Smith rule. In the upcoming Section 3.3.3, we will

compute λ̄ϕl(5) to be 4.65 for the Smith rule. Consequently, in this setting, (3.6) requires the

worst case failure rates of the component when used by population l (calculated via (3.4)) to be

less than 4.65.

Although in many cases (3.7) can be computed numerically, this computation can be cumbersome.

To facilitate bypassing the computation of (3.7), in Proposition 1, we present a simple lower

bound for λ̄ϕl(nl) that is valid for IPC rules satisfying the following assumption.

Assumption 1 For a non-decreasing map ϕ̄l : R → R≥0, the following holds for all π ∈ R and

i ∈ V l:

ϕli(π) = ϕ̄l(π). (3.8)

Proposition 1 Consider that a population l ∈ {1, . . . , L} (with nl ≥ 3) follows an IPC rule. If

this rule satisfies Assumption 1, then it holds for all nl ≥ 3 that

λ̄ϕl(n
l) ≥ nl − 1

nl − 2
. (3.9)

We prove Proposition 1 in Appendix A. This proof also provides an alternative way to

compute λ̄ϕl(nl) for the case in which Assumption 1 holds (see (A.8))).

Remark 13 We can conclude from (3.9) that, for the learning rules satisfying the conditions of

36


Proposition 1, λ̄ϕl(nl) is strictly greater than 1, which, according to Theorem 1, affords some

δ-passivity robustness with respect to λlR. This fact is in contrast to previous results establishing

δ-passivity for IPC rules.

We emphasize that Assumption 1 is critical in obtaining Proposition 1 and the following

counterexample illustrates the reason for this.

Counterexample 1 Consider that nl = 3 and population l adopts an IPC rule specified by

ϕl1(·) = [·]2+, ϕl2(·) = ϕl3(·) = [·]+. This learning rule violates Assumption 1 and, as we proceed

to show, it will infringe (3.9) with λ̄ϕl(3) = 1. To do so, we use the following inequality that we

obtain by setting πl1 = 0, πl2 = −ϵ, πl3 = −ϵ + ϵ7/4, with ϵ > 0, when computing the infimum in

(3.7a):

λ̄ϕl(3) ≤ lim
ϵ→0+

(2ϵ3 + 3ϵ7/2)(ϵ2 + ϵ7/4)

2(ϵ− ϵ7/4)3ϵ7/4
= 1.

3.3.3 Numerical Evaluation of the Worst-Case Revision Rate Ratio Bound for

the Smith Rule

Let us denote λ̄ϕl for the Smith rule as λ̄Smith. In Figure 3.1, we plot the values of λ̄Smith(n
l)

for nl ∈ {3, . . . , 10}, which we calculated by computing (3.7) numerically. Figure 3.1 also

displays the lower bound in (3.9) for λ̄Smith. Note that, since the Smith rule satisfies Assumption 1,

the lower bound in (3.9) holds for λ̄Smith, for any nl ≥ 3.

The plots in Figure 3.1 illustrate that the lower-bound in (3.9) may be conservative – a

consequence of it being valid for a large subclass of IPC rules. Notably, from the values of λ̄Smith

plotted in Figure 3.1 we observe that the Smith rule yields a δ-passive EDM even if the revision

rates vary by a multiplicative factor of 9.44 when nl = 3. For the nl = 9, case the Smith EDM

37


3 4 5 6 7 8 9 10

2

1.5
1.33

1.25
1.2 1.16 1.14 1.12

9.44
5.89 4.65 4 3.59 3.3 3.1 2.93

Number of strategies (nl)

L
og

ar
ith

m
ic

sc
al

e

λ̄Smith(n
l)

nl−1
nl−2

Figure 3.1: Comparing λ̄Smith with the lower-bound in (3.9).

remains δ-passive even when the revision rates vary by a multiplicative factor of 3.1.

3.3.4 Global Asymptotic Stability Guarantees

As we outlined in Remark 8, [10] proves that δ-passivity and (NS) suffice to draw conclusions

about the equilibrium stability of the mean closed loop. Consequently, our findings regarding

(NS) and δ-passivity for IPC rules under strategy-dependent revision rates directly lead to the

theorem below. We state the theorem without proof as it directly follows from [10, Theorem 2]

in conjunction with Lemma 1 and Theorem 1.

Theorem 2 Consider that the payoff mechanism is a PDM and that each population follows

an IPC rule. If these learning rules satisfy the conditions of Theorem 1 and the PDM is δ-

antidissipative (see Definition 11), then the equilibria set of the mean closed loop is globally

asymptotically stable. In addition, if (ξ∗, γ∗) is an equilibrium point of the mean closed loop,

then ξ∗ ∈ NE(FG,H).

Recall from Remark 7 that, for the case when the payoff mechanism is memoryless, the

38


δ-antidissipativity requirement reduces to weighted contractivity. This, in combination with

Theorem 2, leads to the corollary below.

Corollary 1 Consider that the payoff mechanism is a game F and that each population follows

an IPC rule. If these learning rules satisfy the conditions of Theorem 1 and F is weighted

contractive (see Remark 7), then the equilibria set of the mean closed loop is NE(F) and is

globally asymptotically stable.

We note that Corollary 1 generalizes the earliest stability results for IPC rules, established

in [34, Theorem 7.1], in two ways. In comparison to [34, Theorem 7.1], which presumes that the

game is contractive and the revision rates are identical within each population, Corollary 1 allows

for weighted contractive games and it contends with unequal revision rates so long as they satisfy

the conditions of the corollary. The stability theorems in [10] allow for weighted contractive

games, but the article lacks the results needed to consider the case in which the revision rates

within each population are different.

3.4 Numerical Examples

As industrial-grade data-driven processing centers and vehicle to vehicle networks are

becoming more prevalent, life cycles of dynamic random-access memories (DRAMs) used in

these applications emerge as important benchmarks. To provide examples of how our results can

come into play, we look into the HPG, introduced in Example 2, in the context of the DRAM

market.

39


3.4.1 A DRAM Market HPG

Two classes of systems in which DRAMs are commonly used are industrial and automotive

systems, which we refer to as respectively class 1 and 2. Hence, there are two populations, where

we use population l to represent the collection of agents that utilize DRAMs in class l. We

assume that there are 3 manufacturers producing DRAMs with failure rates in these utilization

classes given by λ11 = 5, λ21 = 10, λ12 = 4, λ22 = 9 and λ13 = 3, λ23 = 5, where λli is the

failure rate of DRAMs produced by manufacturer i when utilized in class l. Moreover, we let the

replacement costs for industrial and automotive DRAMs to be respectively β1 = 2 and β2 = 1.

3.4.2 A Memoryless Payoff Mechanism

We assume that the payoffs are determined by the game F with the structure presented in

Example 2. We specify the component price from manufacturer i ∈ {1, 2, 3}, which is the Ci

in (3.1), as the sum of a fixed production cost, C0i, and a term reflecting the pull-back inflation,

Cpi. To represent the pull-back inflation on the cost, we use a quadratic term Cpi(Di(ξ)) =

(Di(ξ))2 = (α1ξ1i + α2ξ2i )
2, where αl is in proportion to the share of class-l in the DRAM

market. Finally, we set α1 = 1 and α2 = 2, and the fixed DRAM production costs to be C01 = 1,

C02 = 1.2 and C03 = 1.5, which completes the construction of F (as in (3.1)) for our DRAM

market HPG.3 Notice that this F satisfies Assumption 9, hence, as stated in Remark 10, it is a

weighted contractive game.

3We would like to clarify that the functions and parameters selected in this section are for illustration purposes,
and they are not estimated from data.

40


(2/3, 1/6, 1/6)

(0, 0, 1) (1, 0, 0)

(0, 1, 0)

(a) Industrial-grade users

(2/3, 1/6, 1/6)

(0, 0, 1) (1, 0, 0)

(0, 1, 0)

(b) Automotive-grade users

Figure 3.2: Trajectories of distributions of DRAM buyers on the manufacturers under the HPG
and Smith rule.

3.4.3 Dynamics Under the Smith Rule

Now we illustrate an application of Corollary 1. Consider the mean closed loop formed

by the F constructed above (in Section 3.4.1) and the EDM in which all populations follow the

Smith rule with the revision rates specified in Section 3.4.1. Assume that initially the buyers are

distributed on the manufacturers according to x1(0) = x2(0) = (2/3, 1/6, 1/6).

Since the failure rates satisfy the conditions of Theorem 1 and F is weighted contractive,

we can invoke Corollary 1 to conclude that x converges to NE(F), which in this case is the

singleton {ξ∗} with (ξ1)∗ = (0, 1, 0), (ξ2)∗ = (0, 0, 1). For this example, the trajectory and

the time domain plot of x are portrayed respectively in Figure 3.2 and Figure 3.3. We note that

the way in which the trajectories in Figure 3.2 are portrayed is a common way to represent the

trajectory of the mean social state when the populations have 3 strategies (see for instance [7,

Chapter 2.3.4]).

41


0 0.2 0.4 0.6 0.8 1
0

1

x1
1

x1
2

x1
3

Time (t)

St
at

e
of

po
pu

la
tio

n
1

(x
1
)

(a) Industrial-grade users

0 0.2 0.4 0.6 0.8 1
0

1

x2
1

x2
2

x2
3

Time (t)

St
at

e
of

po
pu

la
tio

n
2

(x
2
)

(b) Automotive-grade users

Figure 3.3: Time domain plots of the distributions of DRAM buyers on the manufacturers under
the HPG and Smith rule.

3.4.4 A Dynamic Payoff Mechanism

Now, we consider a payoff mechanism that can be viewed as a dynamic extension of

Example 2. Our construction of this PDM parallels that in [10, Section VI.A].

Example 4 Given a positive time constant a and parameters as defined in Example 2, the following

defines the “smoothed” HPG PDM:

aq̇(t) = −q(t) +


C1
(
D1

(
x(t)

))
...

Cκ
(
Dκ
(
x(t)

))

 , (3.10a)

pli(t) = −βlλli − qi(t). (3.10b)

Here q(0) ∈ Q = Q0 := [0, d̄]κ. We can also specify a set P that includes all possible p as

follows:

P := {π ∈ RκL | πli = −βlλli − γi for some γ in Q}

Note that, in (3.10), p is the effective payoff perceived by the agents and q represents a smoothed

42


version of the costs of the component (produced by different manufacturers).

In [21], the authors argue that dynamically modifying a game via smoothing reduces the

impacts of short-term fluctuations and isolates the longer-term trends. We can motivate this effect

in the context of our HPG in two ways. Firstly, even though the costs of the component may

change persistently (e.g. due to the demand-pull inflation outlined in Assumption 9), retailers

often do not change the sale prices as frequently, and rather, the sale prices follow the long-term

trends of the costs. Secondly, it might take agents some time to learn and react to the sale prices,

causing the affects of the changes in costs on the payoffs to be smoothed.

Remark 14 It follows immediately from (3.10), (3.1) and (2.2) that, for the smoothed HPG, the

stationary game FG,H is identical to F .

3.4.5 Dynamics Under the Smith Rule and the Smoothed HPG

We carry out an analysis that is analogous to that in Section 3.4.3, but with the PDM given

by the smoothed HPG from Section 3.4.2. We select a = 5 in (3.10) while keeping all the

other parameters from Section 3.4.1 and Section 3.4.3 unchanged. Note that the C constructed in

Section 3.4.1 satisfies Assumption 9, therefore this PDM is δ-antidissipative (see Remark 10).

Since the failure rates satisfy the conditions of Theorem 1 and the PDM specified above is

δ-antidissipative, we conclude from Theorem 2 that, like in Section 3.4.3, x converges to NE(F).

The time evolution of the PDM’s state q and the mean social state x are plotted in Figure 3.4,

indicating that x1 and x2 indeed converge respectively to (0, 1, 0) and (0, 0, 1).

43


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1

2.19

5.5

q1

q2

q3

Time (t)

PD
M

st
at

e
(q

)

(a) State of the smoothed HPG

0 0.2 0.4 0.6 0.8 1
0

1

x1
1

x1
2

x1
3

Time (t)

St
at

e
of

po
pu

la
tio

n
1

(x
1
)

(b) Industrial-grade users

0 0.2 0.4 0.6 0.8 1
0

1

x2
1

x2
2

x2
3

Time (t)

St
at

e
of

po
pu

la
tio

n
2

(x
2
)

(c) Automotive-grade users

Figure 3.4: Time domain plots of the PDM’s state and distributions of DRAM buyers on the
manufacturers under the smoothed HPG and Smith rule.

44


3.5 Conclusion

In this chapter, we extended a fundamental assumption in the population games framework

by allowing the agents’ revision rates to depend on their strategies. We demonstrated that the

resulting social state is again a right-continuous pure-jump Markov process, maintaining the

traditional mean-field approximation results. By applying these findings, we derived an EDM

that generalizes the EDM from the traditional framework.

However, we observed that the stability results associated with IPC protocols no longer hold

in the strategy-dependent case. To address this, we established that if a population either has two

strategies or its worst-case revision rate ratio satisfies a bound condition, then the population’s

EDM satisfies (NS) and δ-passivity. Utilizing the results from [10], we concluded that these

conditions ensure the global asymptotic stability of the mean closed loop’s equilibria, provided

the payoff mechanism is a δ-antidissipative PDM or a weighted contractive game. Furthermore,

the x-components of these equilibria correspond to NE(FG,H) (or NE(F) in the memoryless

case). We illustrated our results using a motivating example, called the hassle vs. price game.

45


Chapter 4: Excess Payoff Rules Under Strategy-Dependent Revision Rates

In Chapter 3, we extended the traditional population games framework to allow for the

agents’ revision rates to depend on their strategies. This chapter also investigates this extension,

but for a different class of learning rules.

Specifically, in Chapter 3, we found that the stability guarantees associated with IPC and

EPT rules no longer hold when the revision rates are strategy-dependent. We then identified

new conditions on IPC rules and revision rates that secure the stability of Nash equilibria in the

extended framework. In this chapter, we focus on broadening the stability guarantees from the

EPT class to cover scenarios with strategy-dependent revision rates.

Instead of focusing on system-theoretic passivity properties as in Chapter 3, here we seek

conditions guaranteeing stability of Nash equilibria when the payoff mechanism is a potential

game [32] (see also Section 2.5.2). As we stated in Remark 11, classical results were general

enough to handle Nash equilibria stability under potential games for the PC class, hence potential

games were not a focus in Chapter 3.

Our contributions are twofold: (i) We unveil the existence of EPT rules and λ for which

the EDM violates (NS) or (PC). Using these results, we show that for some EPT rules and λ, the

mean population state may not converge to a Nash equilibrium under a broad class of potential

games. (ii) We propose a modification of the EPT class, which we call sign preserving rate-

46


modified EPT rules. For any learning rule in the modified class and any λ, we prove that the

EDM satisfies (NS) and (PC). As we illustrate in Section 4.5, the rules in the modified class are

easy to implement.

From this chapter onward, for simplicity and ease of notation, we assume a single population.

However, it’s important to note that our results are applicable to multi-population scenarios as

well.

4.1 EPT Rules, Recap of Existing Results, and Motivation

Potential games (see Definition 9) were introduced in [36] and progressed to be analyzed

extensively. They were adapted to the population games and evolutionary dynamics framework

in [32, 37] and were further investigated in this context in [38]. Congestion games [7], [39] are

celebrated examples of potential games. Also, remember that potential games have a valuable

connection to distributed optimization [38]: A potential game’s Nash equilibria coincide with the

points that satisfy the Karush-Kuhn-Tucker (KKT) conditions for the problem of maximizing the

game’s potential over the standard (n− 1)-simplex (see Remark 6).

The second key component of our analysis is the EPT class. Due to their suitability

for capturing reluctance and moderation in finding best-performing strategies, EPT rules were

introduced in [12] as a well-motivated model of individual choice. Applicability of EPT rules in

engineering settings has been demonstrated in [40] for water distribution systems and in [41] for

distributed wireless networks.

Despite their thorough examination, as detailed in Remark 11, existing results [12] on the

stability properties from EPT rules are inconclusive under strategy-dependent revision rates, i.e.,

47


when there are strategies i and j for which λi ̸= λj . Recall from Remark 8 that the work in [32]

establishes that if the payoff mechanism is a potential game F and the EDM satisfies (NS) and

(PC), then NE(F) is the globally asymptotically stable equilibrium of the mean closed loop.

Therefore, to obtain guarantees on the performance of the EPT class under strategy-dependent

revision rates, we will investigate the (NS) and (PC) properties of the resulting EDM.

4.2 The EDM With Strategy-Dependent Revision Rates and Problem Description

In Section 3.1, we generalized the population games framework to allow for the agents’

revision rates to depend on their strategies. We showed that the resulting population state X (we

are assuming a single population) is a right-continuous pure-jump Markov process and utilized

the mean-field techniques summarized in Section 2.4 to obtain the following EDM:

Vi(ξ, π) =
n∑

j=1, j ̸=i

ξjλjρji(ξ, π)−
n∑

j=1, j ̸=i

ξiλiρij(ξ, π),

which differs from the traditional EDM (2.3c) by incorporating the λi and λj factors instead of a

common λ factor.

As a motivating example of strategy-dependent revision rates, in addition to the examples

in Chapter 3, we introduce a distributed task allocation problem.

Example 5 Suppose that there is a large number of agents and 3 types of tasks that the agents

perform, which constitute their strategy set. Tasks of type i ∈ {1, 2, 3} have expected service time

λ−1
i > 0, which may differ from λ−1

j > 0 for i ̸= j. Each agent undertakes a new task, possibly

of a different type, only after completing its current one. Hence, the revision rate associated with

48


tasks of type i is λi > 0. At any time t ≥ 0, a central entity assigns a payoff vector P(t) to the

types of tasks and disseminates P(t). When an agent completes a task, it decides on the type

of its new task in a distributed manner using a learning rule ρ and announces the types of its

completed and new tasks to the central entity. Let us denote the fraction of agents working on

tasks of type i at time t by Xi(t). Given θ ∈ ∆, the distributed task allocation problem is to find

a ρ and a payoff mechanism that drives X near θ in the long run. We will revisit this problem in

the upcoming Section 4.5 to demonstrate our results.

Distributed task allocation problems in the population games and evolutionary dynamics

framework have been studied in [3]. However, the setting in [3] is different from ours because [3]

assumes that the revision rates are identical.

Remember from Definition 8 that ρ belongs to the EPT class if it can be written as ρij(ξ, π) =

φj(π̂), where π̂ := π − 1πT ξ is the excess payoff and φ : Rn → Rn
≥0 is a function that satisfies

acuteness in the sense that ϕ(η)Tη > 0 for all η ∈ int(Rn
∗ ) := int(Rn \ int(Rn

≤0)). Then, the

EDM induced by an EPT rule φ and strategy-dependent revision rates λ is specified for all i ∈ V ,

ξ ∈ ∆ and π ∈ P by

Vi(ξ, π) =
n∑

j=1, j ̸=i

ξjλjφi(π̂)−
n∑

j=1, j ̸=i

ξiλiφj(π̂).

As we outline in Remark 11, unlike the PC class, we cannot embed the effects of strategy-

dependent revision rates into ρ, because the functional form of the EPT class can only depend on

the target strategies, whereas the revision rates depend on the origin strategies. In simpler terms,

ρij(ξ, π) = φi(π̂)λj/(1
Tλ) is not an EPT rule.

In the nominal case of λ1 = · · · = λn, EPT rules are known to secure the (NS) and (PC)

49


of the EDM, resulting in the global asymptotic stability of NE(F) under the mean closed loop

for any potential game F (see Remark 8). However, the above observation implies that we have

a different EDM when λi ̸= λj for some i, j ∈ V , and this stability result is no longer valid.

Consequently, two important questions arise:

• Does the EDM satisfy (NS) and (PC) for every EPT rule and λ ∈ Rn
>0? We show soon in

Section 4.3 that the answer is negative in general (with strategy-dependent revision rates).

Thus, we also pose the ensuing problem.

• Can we modify the EPT class so that the EDM satisfies (NS) and (PC) for every ρ in the

modified class and λ ∈ Rn
>0? Sign preserving rate-modified EPT (RM-EPT) rules will be

our answer to this problem.

4.3 Performance of EPT Rules Under Strategy-Dependent Revision Rates and

RM-EPT Rules

In this section, we show that the EPT class does not ensure (NS) and (PC) when the revision

rates are allowed to be strategy-dependent. Using our results on (NS) we also show that, for some

EPT rules and potential games, the convergence of x to NE(F) is not robust under perturbations

of the revision rates. Subsequently, we modify the EPT class to ensure that the resulting EDM

satisfies (NS) and (PC) for all λ ∈ Rn
>0.

We begin by showing that even one of the most ubiquitous EPT rules, the BNN rule,

produces an EDM that violates (PC) for some λ. Recall from Section 2.3.2 that the BNN rule is

obtained by setting φj(π̂) = max{0, π̂j}/C.

50


Counterexample 2 Consider ρ to be the BNN rule with C = 1 and let ξ = [1/2 0 1/2]T ,

π = [0 1/2 3/4]T , λ = [1 1 16]T . Substituting these into (2.4), we have V(ξ, π) ̸= 0 and

V(ξ, π)Tπ = −0.0156 < 0. Therefore, the EDM specified by ρ and λ violates (PC).

Having established that EPT rules may fail to guarantee (PC), we modify them to eliminate

this issue. We construct the modified class, which we name rate-modified EPT (RM-EPT)

learning rules, by replacing π̂ in Definition 8 with π̃ defined as

π̃ := π − 1π̄, π̄ :=

∑n
i=1 λiξiπi∑n
j=1 λjξj

. (4.1)

We refer to π̄ as the rate-weighted average payoff. Thus, given an EPT rule as in Definition 8, we

obtain its rate-modified version by simply replacing π̂ in (2.7a) with π̃. Notice that an EPT rule

coincides with its rate-modified counterpart when the revision rates are identical.

Remark 15 Suppose that, at every revision time, the revising agent declares (possibly to a

coordinator) the payoff of the strategy it was following immediately prior to the revision time.

For instance, in our distributed task allocation problem in Example 5, this would correspond to

the agents announcing the payoffs they received from the tasks that they recently completed. Since

the number of agents is large, the average of the payoffs announced within a short time interval is

approximately the rate-weighted average payoff. Consequently, in such a setting, RM-EPT rules

can be implemented effortlessly, without any knowledge of λ.

Although this modification suffices to ensure (PC), it does not guarantee (NS). Consider

51


the rate-modified version of the EPT rule in [12, Proposition 2.1], characterized by

φj(π̃) =
n∑
i=1

(ecπ̃i(k + 1)([π̃j]+)
k + ecπ̃jc([π̃i]+)

k+1), (4.2)

where c > 0, k > 0 and (k + 1)ek+2 + 1 ≥ n. A key property of this rule is that it satisfies the

positivity condition (Pos), requiring the following implication to be satisfied for all π ∈ Rn:

ξ /∈ argmax
η∈∆

πTη ⇒ min
i,j∈V

ρij(ξ, π) > 0. (Pos)

The following counterexample shows that the EDM violates (NS) for the learning rule in (4.2)

and some λ.

Counterexample 3 Let ρ be the RM-EPT rule characterized by (4.2) and assume thatF satisfies

int(∆) ̸⊂ NE(F) (for instance, F given by F(ξ) = ξ − (1/n)1 satisfies this criterion because

NE(F) = {(1/n)1} = (1/n, . . . , 1/n)). Let us take ξ ∈ int(∆) such that ξ /∈ NE(F) and set

π = F(ξ). Note that V(ξ, π) = 0 if and only if

ρ(ξ, π)TΞλ = Ξλ, (4.3)

where Ξ denotes the n-by-n diagonal matrix with its (i, i)-th entry given for all i ∈ V by ξi. From

ξ /∈ NE(F), it follows that ρij(ξ, π) > 0 for all i, j ∈ V . Therefore, by ρ(ξ, π) being a stochastic

matrix and the Perron-Frobenius theorem, ρ(ξ, π)T has an eigenvector ν that corresponds to the

eigenvalue 1 and all entries of ν are positive. Because Ξ is invertible, Ξ−1ν is the unique λ for

which Ξλ = ν. In addition, from Ξ−1 being a diagonal matrix with positive diagonal entries and

52


ν having positive entries, it follows that each entry of Ξ−1ν is positive. Thus, choosing λ = Ξ−1ν

does not conflict with λ ∈ Rn
>0, and with this choice (4.3) holds. Since ξ /∈ NE(F) implies

ξ /∈ argmaxη∈∆ π
Tη, we conclude that the EDM generated by ρ and λ violates (NS).

Remark 16 Our derivations in Counterexample 3 are valid not only for t