ABSTRACT
Title of dissertation: Fast optimization methods for machine learning,
and game-theoretic models of cultural evolution
Soham De
Doctor of Philosophy, 2018
Dissertation directed by: Dr. Tom Goldstein and Dr. Dana Nau
Department of Computer Science
This thesis has two parts. In the first part, we explore fast stochastic optimization
methods for machine learning.
Mathematical optimization is a backbone of modern machine learning. Most ma-
chine learning problems require optimizing some objective function that measures how
well a model matches a data set, with the intention of drawing patterns and making de-
cisions on new unseen data. The success of optimization algorithms in solving these
problems is critical to the success of machine learning, and has enabled the research com-
munity to explore more complex machine learning problems that require bigger models
and larger datasets.
Stochastic gradient descent (SGD) has become the standard optimization routine
in machine learning, and in particular in deep neural networks, due to its impressive
performance across a wide variety of tasks and models. SGD, however, can often be
slow for neural networks with many layers and typically requires careful user oversight
for setting hyperparameters properly. While innovations such as batch normalization and
skip connections have helped alleviate some of these issues, why such innovations are
required eludes full understanding, and it is worthwhile to gain deeper theoretical insights
into these problems and to consider more advanced optimization methods specifically
tailored towards training large complex models.
In this part of the thesis, we review and analyze some of the recent progress made
in this direction, and develop new optimization algorithms that are provably fast, signifi-
cantly easier to train, and require less user oversight. Then, we will discuss the theory of
quantized networks, which use low-precision weights to compress and accelerate neural
networks, and when/why they are trainable. Finally, we discuss some recent results on
how the convergence of SGD is affected by the architecture of neural nets, and we show
using theoretical analysis that wide networks train faster than narrow nets, and deeper
networks train slower than shallow nets – an effect often observed in practice.
In the second part of the thesis, we study the evolution of cultural norms in human
societies using game-theoretic models, drawing from research in cross-cultural psychol-
ogy. Understanding human behavior and modeling how cultural norms evolve in different
human societies is vital for designing policies and avoiding conflicts around the world. In
this part, we explore ways to use computational game-theoretic techniques, and in partic-
ular evolutionary game-theoretic (EGT) models, to gain insight into why different human
societies have different norms and behaviors.
We first describe an evolutionary game-theoretic model to study how norms change
in a society, based on the idea that different strength of norms in societies translate to
different game-theoretic interaction structures and incentives. We identify conditions that
determine when societies change their existing norms, when they are resistant to such
change, and how this depends on the strength of norms in a society.
Next, we extend this study to analyze the evolutionary relationships between the
tendency to conform and how quickly a population reacts when conditions make a change
in norm desirable. Our analysis identifies conditions when a tipping point is reached in a
population, causing norms to change rapidly.
Next we study conditions that affect the existence of group-biased behavior among
humans (i.e., favoring others from the same group, and being hostile towards others from
different groups). Using an evolutionary game-theoretic model, we show that out-group
hostility is dramatically reduced by mobility. Technological and societal advances over
the past centuries have greatly increased the degree to which humans change physical
locations, and our results show that in highly mobile societies, ones choice of action is
more likely to depend on what individual one is interacting with, rather than the group to
which the individual belongs.
Fast optimization methods for machine learning,
and game-theoretic models of cultural evolution
by
Soham De
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2018
Advisory Committee:
Dr. Tom Goldstein, Co-Chair/Advisor
Dr. Dana S. Nau, Co-Chair/Advisor
Dr. David W. Jacobs
Dr. Michele J. Gelfand
Dr. John P. Dickerson
©c Copyright by
Soham De
2018

Acknowledgments
I am grateful to my advisors Dana Nau and Tom Goldstein for their constant sup-
port, encouragement and guidance over the years, and for giving me the freedom to pur-
sue my own varied interests. Their brilliance and dedication have been an ongoing source
of inspiration. I have been fortunate to also have the opportunity to work closely with
Michele Gelfand from the Psychology department. The numerous fascinating and lively
discussions with her and Dana were among the highlights during my time here.
I was lucky to have been part of an excellent department filled with many wonderful
people. I would like to thank my other thesis defense committee members, David Jacobs
and John Dickerson, for their insightful comments and encouragement. I would also
like to thank Jodie, Sharron, Jenny and Tom Hurst, for always ensuring that everything
ran smoothly in the CS department, and for going out of their way to help me on a few
occasions.
This thesis would not have been possible without the help of my close collaborators
Hao, Karthik, James M., Anirbit, Sohil, Abhay, Zheng, Xinyue and Patrick, to whom I
am truly indebted. I would also like to thank Bhiksha Raj and Karen Livescu, with whom
I was fortunate to be able to work with during my undergraduate years, and who were
instrumental in developing my interests in machine learning.
I would also like to acknowledge the many friends I made while pursuing my PhD,
as well as some of my older friends, all of whom made the last few years immensely
enjoyable. These include Siddharth, Soumyadip, Agniv, Jia, Upamanyu, Rishov, Souvik,
Piyana, Udit, Dipankar, Debdipta, Biswadip, Shawon, Arijit, Wrick, Sunandita, Kartik,
ii
Sudha, Manaswi, Bhaskar, Pallabi, Meethu, Sankha, Prashanth, Vicky, Karol, Emmy,
Amit, Prarthana, Arunima, Amrita, Aritra, Anirban and others who I am surely forgetting.
I would also like to thank my uncle and aunt in Maryland who have never made me feel
too far from home.
Finally I would like to thank my parents for their continued love and support, and
for always encouraging me in every endeavor of my life.
iii
Table of Contents
Acknowledgements ii
List of Tables vii
List of Figures viii
1 Introduction and organization of the thesis 1
I FAST & EFFICIENT TRAINING IN MACHINE LEARNING 4
2 Introduction, background and notation 5
2.1 Machine learning as an optimization problem . . . . . . . . . . . . . . . 6
2.2 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 On the successes and drawbacks of SGD . . . . . . . . . . . . . . . . . . 10
2.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Table of notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Automated inference using adaptive batch sizes 14
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Big Batch SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Preliminaries and motivation . . . . . . . . . . . . . . . . . . . . 18
3.2.2 A template for big batch SGD . . . . . . . . . . . . . . . . . . . 20
3.3 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 Comparison to classical SGD . . . . . . . . . . . . . . . . . . . 25
3.4 Practical implementation with backtracking line search . . . . . . . . . . 26
3.5 Adaptive step sizes using the Barzilai-Borwein estimate . . . . . . . . . . 29
3.5.1 Convergence proof . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5.2 Practical implementation . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6.1 Convex experiments . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6.2 Neural network experiments . . . . . . . . . . . . . . . . . . . . 37
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
iv
4 Distributing SGD using variance reduction 40
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 CentralVR algorithm: single-worker case . . . . . . . . . . . . . . . . . 44
4.2.1 Algorithm overview . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.2 Permutation sampling . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.3 Algorithm details for CentralVR . . . . . . . . . . . . . . . . . . 46
4.3 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Distributed algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.1 Synchronous version . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.2 Asynchronous version . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 Distributed variants of SVRG and SAGA . . . . . . . . . . . . . . . . . 55
4.5.1 Distributed SVRG . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5.2 Distributed SAGA . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.6 Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6.1 Single worker results . . . . . . . . . . . . . . . . . . . . . . . . 58
4.6.2 Distributed results . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5 Investigating training methods for quantized neural nets 70
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Background and related work . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3 Algorithms for training quantized neural nets . . . . . . . . . . . . . . . 73
5.4 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4.1 Convergence of Stochastic Rounding (SR) . . . . . . . . . . . . . 76
5.4.2 Convergence of Binary Connect (BC) . . . . . . . . . . . . . . . 78
5.5 What about non-convex problems? . . . . . . . . . . . . . . . . . . . . . 80
5.5.1 Toy problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5.2 Asymptotic analysis of Stochastic Rounding . . . . . . . . . . . 84
5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.6.1 A way forward: big batch training . . . . . . . . . . . . . . . . . 96
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6 Why is SGD so fast for neural nets? 99
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2 SGD is fast when gradient confusion is low . . . . . . . . . . . . . . . . 103
6.2.1 Conditions for even faster convergence . . . . . . . . . . . . . . 106
6.3 Over-parameterized problems have low gradient confusion . . . . . . . . 109
6.3.1 A simple case: linear regression . . . . . . . . . . . . . . . . . . 110
6.3.2 Linear neural networks . . . . . . . . . . . . . . . . . . . . . . . 116
6.3.3 Extension to arbitrary depth linear networks . . . . . . . . . . . . 119
6.3.4 More general neural networks . . . . . . . . . . . . . . . . . . . 122
6.3.5 Beyond linearly generated data . . . . . . . . . . . . . . . . . . . 125
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
v
II STUDYING THE EVOLUTION OF CULTURAL NORMS 130
7 Using game theory to study the evolution of cultural norms 131
7.1 Evolutionary game theory in biology . . . . . . . . . . . . . . . . . . . . 132
7.2 Modeling cultural evolution . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
8 Understanding norm change in human societies 139
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.2 Proposed model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.2.1 Replicator dynamic on infinite well-mixed populations . . . . . . 148
8.2.2 Agent simulations on finite networks . . . . . . . . . . . . . . . 157
8.3 Evolving exploration rates . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.4 Significance of the work . . . . . . . . . . . . . . . . . . . . . . . . . . 165
9 Tipping points for norm change in human cultures 167
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
9.2 Background and related work . . . . . . . . . . . . . . . . . . . . . . . . 168
9.3 Proposed evolutionary game-theoretic model . . . . . . . . . . . . . . . 169
9.3.1 When does norm change occur? . . . . . . . . . . . . . . . . . . 171
9.3.2 Rate of norm change in tight vs. loose cultures . . . . . . . . . . 173
9.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
10 On the evolution of ethnocentrism in human cultures 176
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
10.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
10.2.1 Empirical analysis . . . . . . . . . . . . . . . . . . . . . . . . . 183
10.3 Significance of the work . . . . . . . . . . . . . . . . . . . . . . . . . . 186
10.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
10.4.1 Evolutionary dynamics of our model . . . . . . . . . . . . . . . . 187
10.4.2 Clustering coefficient . . . . . . . . . . . . . . . . . . . . . . . . 189
10.4.3 Strategy set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
10.4.4 Mutation rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
10.4.5 Range of mobility . . . . . . . . . . . . . . . . . . . . . . . . . 191
Bibliography 193
vi
List of Tables
4.1 Distributed Algorithms Proposed . . . . . . . . . . . . . . . . . . . . . . 44
5.1 VGG-9 on CIFAR-10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 VGG-BC for CIFAR-10. . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.3 Top-1 test error after training with full-precision (ADAM), binarized weights
(R-ADAM, SR-ADAM, BC-ADAM), and binarized weights with big batch
size (Big SR-ADAM). . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
vii
List of Figures
3.1 Convex experiments. Left to right: Ridge regression on MILLIONSONG;
Logistic regression on COVERTYPE; Logistic regression on IJCNN1.
The top row shows how the norm of the true gradient decreases with the
number of epochs, the middle and bottom rows show the batch sizes and
step sizes used on each iteration by the big batch methods. Here ‘passes
through the data’ indicates number of epochs, while ‘iterations’ refers to
the number of parameter updates used by the method (there may be mul-
tiple iterations during one epoch). . . . . . . . . . . . . . . . . . . . . . 35
3.2 Neural Network Experiments. The three columns from left to right cor-
respond to results for CIFAR-10, SVHN, and MNIST, respectively. The
top row presents classification accuracies on the training set, while the
bottom row presents classification accuracies on the test set. . . . . . . . 36
4.1 Single Worker Results. Logistic regression on toy dataset; Ridge regres-
sion on toy data; Logistic regression on IJCNN1 dataset; Ridge regression
on MILLIONSONG dataset; In each case CentralVR converges much
faster than SVRG and SAGA. . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 Distributed Results on toy datasets for CentralVR-Sync and CentralVR-
Async, compared to Distributed SVRG (Section 4.5.1), Distributed SAGA
(Section 4.5.2), Parameter Server SVRG and EASGD. Left two plots:
Convergence curve for Logistic and ridge regression on synthetic data
over 192 nodes. Right two plots: Time required for convergence as num-
ber of local workers is increased (data on each local worker is constant –
i.e., total data scales linearly with the number of local workers) for logis-
tic and ridge regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Distributed Results on SUSY and MILLIONSONG for CentralVR-Sync
and CentralVR-Async, compared to Distributed SVRG (Section 4.5.1),
Distributed SAGA (Section 4.5.2), Parameter Server SVRG (Param Server
SVRG) and EASGD. (Left two plots) Convergence curve for Logistic re-
gression and ridge regression on SUSY over 500 nodes and on MILLION-
SONG over 240 nodes. (Right two plots) Time required for convergence
as number of local workers is increased. . . . . . . . . . . . . . . . . . . 69
viii
5.1 The SR method starts at some location w (in this case 0), adds a pertur-
bation to w, and then rounds. As the learning rate α gets smaller, the
distribution of the perturbation gets “squished” near the origin, making
the algorithm less likely to move. The “squishing” effect is the same for
the part of the distribution lying to the left and to the right of w, and so it
does not effect the relative probability of moving left or right. . . . . . . . 82
5.2 Effect of shrinking the learning rate in SR vs BC on a toy problem. The
left figure plots the objective function (5.8). Histograms plot the distri-
bution of the quantized weights over 106 iterations. The top row of plots
correspond to BC, while the bottom row is SR, for different learning rates
α. As the learning rate α shrinks, the BC distribution concentrates on a
minimizer, while the SR distribution stagnates. . . . . . . . . . . . . . . 83
5.3 Markov chain example with 3 states. In the right figure, we halved each
transition probability for moving between states, with the remaining prob-
ability put on the self-loop. Notice that halving all the transition proba-
bilities would not change the equilibrium distribution, and instead would
only increase the mixing time of the Markov chain. . . . . . . . . . . . . 85
5.4 Percentage of weight changes during training of VGG-BC on CIFAR-10. 96
5.5 Effect of batch size on SR-ADAM when tested with ResNet-56 on CIFAR-
10. (a) Test error vs epoch. Test error is reported with dashed lines, train
error with solid lines. (b) Percentage of weight changes since initializa-
tion. (c) Percentage of weight changes per every 5 epochs. . . . . . . . . 96
6.1 Simulation proof for Theorem 6.3.1. As the dimensionality of a random
linear regression problem increases, the probability of violating the gra-
dient confusion condition η > 0.1 vanishes. . . . . . . . . . . . . . . . . 111
6.2 How width affects convergence curves and gradient inner products. . . . . 128
6.3 How depth affects convergence curves and gradient inner products. . . . . 128
6.4 Effect of batch normalization and skip connections on a Wide ResNet . . 129
7.1 Graph of 1s(u −u ) , for s = 5 and −1 ≤ ua − un ≤ 1. . . . . . . . . . . 1361+e a n
8.1 Individual payoff matrices. Mc denotes the coordination game and Mf
denotes the fixed-payoff game used in our model. . . . . . . . . . . . . . 142
8.2 Weighted payoff matrix M defined as M = cMc + (1− c)Mf . . . . . . . 143
8.3 Updated payoff matrix after assuming ac − bc = af − bc and adding a
suitable constant to the payoffs in M in Figure 8.2. . . . . . . . . . . . . 144
8.4 Figures show the change in the proportion of B agents with time with a
well-mixed infinite population where reproduction is determined by the
replicator dynamic with b > a. . . . . . . . . . . . . . . . . . . . . . . . 154
8.5 Figure shows the rate of change of B agents versus the proportion of
B agents, with a well-mixed infinite population where reproduction is
determined by the replicator dynamic with b > a. . . . . . . . . . . . . . 155
ix
8.6 Simulations with the Fermi rule on a toroidal grid of size 2500. From top
to bottom: c = 1.0, c = 0.75, c = 0.5. Initially: a = 1.0, b = 1.15. We
use a structural shock at 2500 iterations, after which the payoffs become:
a = 1.15, b = 1.0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.7 Replicator-mutator dynamic on an infinite well-mixed population with
a = 0.4 and b = 0.6. The solid and dotted lines denote c = 0.05 and
c = 0.3, respectively. The colors denote the exploration rates. . . . . . . . 161
8.8 Simulations with the Fermi rule on a toroidal grid of size 2500, with struc-
tural shocks at intervals of 75 iterations. From left to right: c = 1.0,
c = 0.8, c = 0.5. Initially: a = 1.0, b = 1.15. The left column shows
proportions of norms A and B. The right column shows proportions of
the population that use each different exploration rate. . . . . . . . . . . . 162
9.1 Plot of (9.2) for different values of k. . . . . . . . . . . . . . . . . . . . . 171
9.2 Left: Heatmap of the right-hand side in (9.5) when xB = 0.1, for various
uB −uA and k values. Right: Heatmap of the right-hand side in (9.7), for
various uB − uA and k values. Best viewed in color. . . . . . . . . . . . . 172
9.3 Left: Plot of (9.4) at uB − uA = 0.7. Right: Heatmap of maxx ẋB forB
various k and m values, with uB − uA = 0.7. Best viewed in color. . . . . 174
10.1 Prisoner’s Dilemma payoff matrix used in our model. . . . . . . . . . . . 177
10.2 Sequence of events at each time step in our evolutionary game-theoretic
model. The sequence of steps are the same as in Hammond and Axelrod’s
paper [HA06] except for the Mobility stage, which is new. For additional
details, see the Methods section. . . . . . . . . . . . . . . . . . . . . . . 178
10.3 Proportions of actions and strategies as a function of mobility, after 30,000
iterations, averaged over 100 simulation runs. The plots show the propor-
tions of (a) the group-entitative and individual-entitative agents, (b) the
actions played by the agents, (c) the strategies of the individual-entitative
agents, (d) the in-group and (e) out-group strategies of the group-entitative
agents, (f) the degree of clustering on the grid. . . . . . . . . . . . . . . . 182
10.4 Single simulation run for 20000 generations with no mobility (m = 0).
(a) Proportions of group-entitative and individual-entitative agents. (b)
Relative proportions of the individual-entitative agents’ strategies; Rela-
tive proportions of the group-entitative agents’ (c) in-group and (d) out-
group strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
10.5 Single simulation run for 30000 generations with no mobility (m = 0.05).
(a) Proportions of group-entitative and individual-entitative agents. (b)
Relative proportions of the individual-entitative agents’ strategies; Rela-
tive proportions of the group-entitative agents’ (c) in-group and (d) out-
group strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
x
10.6 Cooperation breaking down at higher mobility values. Each data point
is an average of 100 individual simulation runs. The plots show (a) the
proportion of agents cooperating and defecting; and (b) over an agent’s
lifetime, the average number of unique opponents it encounters, and the
average number of games played against each of them. . . . . . . . . . . 192
xi
Chapter 1: Introduction and organization of the thesis
This thesis has two parts. In the first part, we explore fast stochastic optimization
methods for machine learning. In the second part of the thesis, we study the evolution of
cultural norms in human societies using game-theoretic models, drawing from research
in cross-cultural psychology. In this chapter, we provide a brief overview of each part of
the thesis.
Fast and efficient training in machine learning
Mathematical optimization is a backbone of modern machine learning. Most ma-
chine learning problems require optimizing some objective function that measures how
well a model matches a data set, with the intention of drawing patterns and making de-
cisions on new unseen data. The success of optimization algorithms in solving these
problems is critical to the success of machine learning, and has enabled the research com-
munity to explore more complex machine learning problems that require bigger models
and larger datasets.
Stochastic gradient descent (SGD) has become the standard optimization routine
in machine learning, and in particular in deep neural networks, due to its impressive
performance across a wide variety of tasks and models. SGD, however, can often be
1
slow for neural networks with many layers and typically requires careful user oversight
for setting hyperparameters properly. While innovations such as batch normalization and
skip connections have helped alleviate some of these issues, why such innovations are
required eludes full understanding, and it is worthwhile to gain deeper theoretical insights
into these problems and to consider more advanced optimization methods specifically
tailored towards training large complex models.
In this part of the thesis, we review and analyze some of the recent progress made
in this area, develop new optimization algorithms of our own, and theoretically and em-
pirically analyze the performance of existing well-known optimization techniques. In
Chapter 2, we review existing work in this area, and present some of the open problems
that we explore in the rest of the thesis. In Chapters 3 and 4, we develop new optimiza-
tion algorithms that are provably fast, significantly easier to train, and require less user
oversight. In Chapter 5, we discuss the theory of quantized networks, which use low-
precision weights to compress and accelerate neural networks, and when/why they are
trainable. Finally in Chapter 6, we discuss some recent results on how the convergence of
SGD is affected by the architecture of neural nets, and we show using theoretical analysis
that wide networks train faster than narrow nets, and deeper networks train slower than
shallow nets – an effect often observed in practice.
Studying the evolution of cultural norms
Understanding human behavior and modeling how cultural norms evolve in dif-
ferent human societies is vital for designing policies and avoiding conflicts around the
2
world. In this part, we explore ways to use computational game-theoretic techniques, and
in particular evolutionary game-theoretic (EGT) models, to gain insight into why different
human societies have different norms and behaviors.
In Chapter 7, we introduce evolutionary game theory, and review how it has been
previously used to study biological and cultural evolution.
In Chapter 8, we describe an evolutionary game-theoretic model to study how
norms change in a society, based on the idea that different strength of norms in soci-
eties translate to different game-theoretic interaction structures and incentives. We iden-
tify conditions that determine when societies change their existing norms, when they are
resistant to such change, and how this depends on the strength of norms in a society.
Next, in Chapter 9, we extend this study to analyze the evolutionary relationships
between the tendency to conform and how quickly a population reacts when conditions
make a change in norm desirable. Our analysis identifies conditions when a tipping point
is reached in a population, causing norms to change rapidly.
Finally, in Chapter 10, we study conditions that affect the existence of group-biased
behavior among humans (i.e., favoring others from the same group, and being hostile
towards others from different groups). Using an evolutionary game-theoretic model, we
show that out-group hostility is dramatically reduced by mobility. Technological and
societal advances over the past centuries have greatly increased the degree to which hu-
mans change physical locations, and our results show that in highly mobile societies, ones
choice of action is more likely to depend on what individual one is interacting with, rather
than the group to which the individual belongs.
3
Part I
FAST & EFFICIENT TRAINING IN MACHINE LEARNING
4
Chapter 2: Introduction, background and notation
Interest in the field of machine learning has grown rapidly over the past decade,
and is generally considered now to be one of the key components towards building intel-
ligent systems. Millions of people today use applications that run on machine learning
algorithms, in the form of recommendation systems on platforms such as Amazon or Net-
flix, search engines like Google, speech recognition softwares such as Apple’s Siri or the
Google Assistant, or image recognition softwares used on various social media websites.
Machine learning algorithms draw inferences from massive amounts of data by building
a mathematical model to capture patterns or make predictions. As computing resources
become increasingly powerful and more easily accessible, machine learning has become
increasingly prevalent, and will most likely continue to do so over the coming years.
Mathematical optimization is one of the backbones of modern machine learning.
Most machine learning problems can be formulated as optimizing some objective based
on a current available set of data (a process typically called training), with the intention
of drawing patterns and making decisions on new unseen data (testing). The success of
optimization algorithms in solving these problems is critical to the success of machine
learning, and has led the research community to to explore more complex machine learn-
ing problems that require core complex mathematical models and larger datasets.
5
Due to the increasing size of datasets, complex machine learning models can take
days to train even with high-performance computing hardware. Moreover, there is a
need for efficient optimization algorithms specifically tailored towards training on huge
datasets. Thus, there has been widespread interest recently, not only in more efficient
optimization algorithms, but also in coming up with heuristics that enable existing op-
timization algorithms to work better. In this thesis, we review and analyze some of the
recent progress made in this direction, and develop several optimization algorithms of our
own that are provably fast. Using a principled approach, we also investigate and provide
a theoretical justification for why certain optimization algorithms and certain heuristics
have been successful in training complex models, while others have not.
For the rest of this chapter, we provide an introduction to optimization methods for
solving large-scale machine learning problems and define the notation to be used in the
rest of this part of the thesis. In the next section, we show how many popular machine
learning models can be formulated as solving an optimization problem. In subsequent
sections, we review some existing algorithms that have been successfully used to solve
such large-scale problems, and investigate the open questions in this area. Finally, we
summarize the main contributions of this part of the thesis.
2.1 Machine learning as an optimization problem
Consider the simple case of linear regression, which is used to model a linear re-
lationship between independent variables x and a dependent variable y. Suppose we are
given a dataset of n observations: {(x1, y1), (x2, y2), . . . , (xn, yn)}. In linear regression,
6
the objective is to find parameters w such that: 〈w,xi〉 = yi,∀i, where 〈w,x〉 denotes the
inner product between w and x. This may not be a solvable problem due to a variety of
reasons; for example, the underlying relationship between x and y may not be linear, or
due to measurement noise when collecting the observations in the dataset. Thus, the typ-
ical approach is to solve the problem of finding parameters w such that 〈w,xi〉 is close
to yi on average.∑This is typically formulated as the following optimization problem:
min f(w) := 1 nw i=1(〈w,xi〉 − y 2n i) .
Similar to the linear regression case, many popular machine learning problems can
be formulated as optimization problems of the form:
∑n1
min f(w) := fi(w; xi), (2.1)
w n
i=1
where {xi} is a collection of data drawn from some unknown probability distribution
p. In typical machine learning applications, each term fi(w; xi) measures how well a
model with parameters w fits one particular data observation xi. Given a dataset D of
n data samples {xi}, f(w) measures how well the model fits the entire corpus of data
on average. This is typically called an empirical risk minimization problem, and it is
an estimate of the true problem we want to solve, i.e., the expected risk minimization
problem: minw Ex ∼p[fi(w; xi)]. Since we typically don’t have enough information oni
the underlying data distribution p to solve the expected risk minimization problem, we
typically solve (2.1) instead.
For supervised learning problems, where the objective is to predict a value/label
based on some input, each data sample x in the dataset D has a corresponding label
y = C(x), for some unknown labeling function C. In this case, a training pair refers
7
to the tuple (x,y). We consider that x is a d-dimensional vector with x ∈ Rd, unless
specified otherwise. For clarity in presentation, from hereon, we denote fi(w; xi) as
fi(w). We sometimes also use fx to denote the model’s loss corresponding to a data
sample x, which will be clear from the context. Some of the notation used in the rest of
the thesis is summarized in Section 2.5.
Logistic regression Many popular machine learning models use objective functions of
the same form as in (2.1). For example, logistic regression, which is a linear model for
doing binary classification (i.e., distinguishing between two classes of data), uses the
following objective function: fi(w) = log(1 + exp(−yi〈w,xi〉)), where yi denotes the
binary label, +1 or −1, which is averaged over n observations.
Neural networks Another powerful class of models that are formulated as (2.1) are
deep neural networks. Neural networks use a series of non-linear transformations to build
highly complex and flexible function approximators. The output of a typical deep neural
network with β + 1 layers is given by:
ŷi = σ(Wβσ(. . . σ(W1σ(W0xi + b0) + b1) . . . ) + bβ).
Here xi ∈ Rd is the input data sample to the neural net, the W’s denote the weight
matrices, b’s denote the bias vectors, and ŷi denotes the output of the neural network. The
function σ(.) is typically non-linear and applied point-wise to its arguments. Common
choices for σ(.) are the sigmoid function: σ(x) = 1/(1+exp(−x)), or the ReLU: σ(x) =
max(0, x). This sequence of non-linear transformations help the neural network express
complex function classes. The shapes of the weight matrices and biases are such that
8
the output ŷi is the same size as the label yi. Thus, for this neural net, the parameters of
the model are given by w = [vec(W )>0 vec(W )>1 · · · vec(W > > > > >β) b0 b1 · · · bβ ]
(where we imagine all vectors to be column vectors by default and> denotes the transpose
operator). Neural networks have been very successful at wide range of applications, and
the loss function used depends on the specific application. For multi-class classification,
a c∑ommon loss function is the cross entropy, where each fi would have the form: fi(w) =− cj=1(yi)j log(ŷi)j , where c is the number of classes (thus the dimensions of yi and ŷi
are also c). One can also use the L2 loss function for regression problems where each fi
would be: fi(w) = ‖y 2i − ŷi‖ .
Other examples of machine learning models that follow a similar form as (2.1) are
support vector machines, matrix completion and graph cuts, among others.
2.2 Stochastic gradient descent
Traditionally, optimization problems of the form (2.1) have been solved using it-
erative deterministic optimization methods. A popular example of such a method is the
gradient descent algorithm, which uses iterative updates of the form:
wk+1 = wk − α∇f(wk),
where α denotes the step size, and∇f denotes the gradient of f w.r.t. the parameters w.
Deterministic optimization methods like gradient descent enjoy fast convergence
rates and require less user oversight for setting the step size α, and thus is easy to use.
However, when n is large (or even infinite) and the model is large, as is often the case
9
in modern machine learning, it becomes intractable to exactly evaluate f(w) or its gra-
dient ∇f(w), which makes classical gradient methods impossible. In such situations,
the method of choice for minimizing (2.1) is the stochastic gradient descent (SGD) algo-
rithm [RM51]. On iteration k, SGD uses an approximation f̃ of the true function f , and
then computes
wk+1 = wk − αk∇f̃k(wk), (2.2)
where αk denotes the step size used on the k-th iteration. Typically, f̃ is an unbiased
estimate of f , where a batch Bk∑⊆ D of data is selected uniformly at random on each
iteration k. Thus, f̃ 1k(w) = |B | x ∈B fi(w). Note that EB [∇f̃ (w )] = ∇f(w ), andk i k k k k k
so the calculated gradient ∇f̃k(wk) can be interpreted as a “noisy” approximation to the
true gradient.
2.3 On the successes and drawbacks of SGD
Stochastic gradient descent (SGD) has become one of the most popular optimiza-
tion algorithms for training deep neural networks, achieving impressive generalization
performance across a wide variety of tasks and models. When SGD’s hyper-parameters
(learning rate, batch size) are set properly, it can usually good generalization performance
compared to other optimization algorithms on a variety of benchmark neural network
tasks [WRS+17, KS17, SMDH13]. There are, however, a number of open questions and
well-known limitations of SGD.
SGD can often be slow for neural networks with many layers, or ones with recurrent
connections. While innovations such as batch normalization and skip connections have
10
helped alleviate this issue to a certain extent, why such techniques are required eludes full
understanding, and it is worthwhile to gain deeper theoretical insights into these problems.
A major drawback of SGD is that it requires careful user oversight for setting the
step size schedule. Performance is very sensitive to this choice, and all state-of-the-art re-
sults were achieved on very careful choice of the learning rate schedule. While there have
been some recent work on methods for automatically setting step sizes for stochastic algo-
rithms [KB14,MH15,SZL13,TMDQ16], they are largely heuristic without any theoretical
guarantees on convergence rates, and don’t work well in practice either [WRLG18].
Moreover, as the datasets grow larger and models become more complex (such as
increasing depth on neural networks), SGD typically takes a much longer time to train to
high accuracies (i.e., convergence rates are slow). While innovations such as batch nor-
malization and skip connections have helped alleviate this issue to a certain extent, why
such techniques are required eludes full understanding. Further, SGD being an inherently
sequential algorithm and because of the noise in the gradients, can’t be efficiently dis-
tributed over computing clusters. This indicates the need for faster optimization methods
for training these models.
2.4 Contributions
In this part of the thesis, we explore a few of the open questions mentioned in
Section 2.3. We list the main contributions below.
In Chapter 3, we develop stochastic optimization algorithms that require no user
oversight by automatically setting the hyperparameters of SGD. This is done by adap-
11
tively growing the batch size over time to control the amount of noise in the gradient
estimate relative to the signal in the gradient estimate. Controlling the noise, in turn,
makes the process of setting step sizes much easier, and we present various adaptive step
size methods that have provable convergence rate guarantees, as well as good empirical
performance on a wide range of machine learning models and datasets.
In Chapter 4, we explore a variant of SGD that has a provably faster convergence
rate. We show that this variant can scale linearly over hundreds of computing cores and
can speed up training of machine learning models on massive datasets without experienc-
ing the slowdown that existing stochastic methods experience. This was done by lever-
aging a class of stochastic algorithms called variance reduction, that explicitly reduce the
variance in the SGD gradient estimate by adding an error correction term.
In Chapter 5, we investigate quantized networks, which use low-precision weights
to compress and accelerate neural networks. We discuss the theory of quantized net-
works, and when/why they are trainable. In particular, we show that quantized training
algorithms that exploit high-precision representations have an important greedy search
phase that purely quantized training methods lack, which explains the difficulty of train-
ing using low-precision arithmetic.
Finally in Chapter 6, we explore why SGD is efficient for neural nets when tuned
properly, and how neural net design affects SGD. In particular, we investigate how over-
parametrization – an increase in the number of parameters beyond the number of training
data and typical setting in most neural network problems – affects the dynamics of SGD.
We find that wide networks train faster than narrow nets, and deeper networks train slower
than shallow nets – an effect often observed in practice.
12
2.5 Table of notation
d data dimension
n number of data points
w vector of parameters of the machine learning model (boldface denotes a vector)
x d-dimensional input data sample
y label of the input data sample; we assume y = C(x) for a labeling function C
D training data; D = {(xi,y ni)}i=1 for supervised problems, D = {x ni}i=1 otherwise
B set of data points chosen in the mini-batch, i.e., B ⊆ D
k current iteration of the optimizer
fx or fi scalar function denoting the model’s loss corresponding to training pair (xi,yi)i
∑
f scalar loss function to be minimized; typically f = 1 ni=1 fn i
f̃k approximation of f at iteration k used by stochastic optimization algorithms
f̃B (overloading notation) approximation of f by using mini-batch B ⊆ D
(v)i i-th element of the vector v
∇γ(v) gradient of a scalar function γ, i.e., (∇γ(w))i = ∂γ/∂(v)i
〈 ∑v1,v2〉 inner product between two vectors, i.e., 〈v d1,v2〉 = i=1(v1)√i · (v2)i
‖v‖ ∑L2 norm of vector v, unless otherwise specified; i.e., ‖v‖ = di=1(v)2i
13
Chapter 3: Automated inference using adaptive batch sizes
3.1 Introduction
SGD uses noisy gradient approximations to solve (2.1). Since the gradient approxi-
mations are noisy, the step size αk must vanish as k →∞ to guarantee convergence of the
method. Typical step size rules require the user to find the optimal decay rate schedule,
which usually requires an expensive grid search over different possible parameter values.
In this chapter, we propose a “big batch” strategy for SGD. Rather than letting the step size
vanish over time as the iterates approach a minimizer, we let the mini-batch B adaptively
grow in size to maintain a constant signal-to-noise ratio of the gradient approximation.
This prevents the algorithm from getting overwhelmed with noise, and guarantees con-
vergence with an appropriate constant step size. Recent results [KMN+16] have shown
that large fixed batch sizes fail to find good minimizers for non-convex problems like
deep neural networks. Adaptively increasing the batch size over time overcomes this lim-
itation: intuitively, in the initial iterations, the increased stochasticity (corresponding to
smaller batches) can help land the iterates near a good minimizer, and larger batches later
on can increase the speed of convergence towards this minimizer.
Using this batching strategy, we show that we can keep the step size constant, or let
it adapt using a simple Armijo backtracking line search, making the method completely
14
adaptive with no user-defined parameters. We also derive an adaptive step size method
based on the [BB88] curvature estimate that fully automates the big batch method, while
empirically enjoying a faster convergence rate than the Armijo backtracking line search.
Big batch methods that adaptively grow the batch size over time have several po-
tential advantages over conventional small-batch SGD:
• Big batch methods don’t require the user to choose step size decay parameters.
Larger batch sizes with less noise enable easy estimation of the accuracy of the
approximate gradient, making it straightforward to adaptively scale up the batch
size and maintain fast convergence.
• Backtracking line search tends to work very well when combined with big batches,
making the methods completely adaptive with no parameters. A nearly constant
signal-to-noise ratio also enables us to define an adaptive step size method based
on the Barzilai-Borwein curvature estimate, that performs better empirically on a
range of convex problems than the backtracking line search.
• Higher order methods like stochastic L-BFGS typically require more work per it-
eration than simple SGD. When using big batches, the overhead of more complex
methods like L-BFGS can be amortized over more costly gradient approximations.
Furthermore, better Hessian approximations can be computed using less noisy gra-
dient terms.
• For a restricted class of non-convex problems (functions satisfying the Polyak-
Łojasiewicz Inequality), the per-iteration complexity of big batch SGD is linear
and the approximate gradients vanish as the method approaches a solution, which
15
makes it easy to define automated stopping conditions. In contrast, small batch
SGD exhibits sub-linear convergence, and the noisy gradients are not usable as a
stopping criterion.
• Big batch methods are much more efficient than conventional SGD in massively
parallel/distributed settings. Bigger batches perform more computation between
parameter updates, and thus allow a much higher ratio of computation to commu-
nication.
For the reasons above, big batch SGD is potentially much easier to automate and requires
much less user oversight than classical small batch SGD.
Related work
In this section, we focus on automating stochastic optimization methods by reduc-
ing the noise in SGD. We do this by adaptively growing the batch size to control the
variance in the gradient estimates, maintaining an approximately constant signal-to-noise
ratio, leading to automated methods that do not require vanishing step size parameters.
While there has been some work on adaptive step size methods for stochastic optimiza-
tion [MH15, SZL13, TMDQ16, KB14, Zei12], the methods are largely heuristic without
any kind of theoretical guarantees or convergence rates. The work in [TMDQ16] was a
first step towards provable automated stochastic methods, and we explore in this direction
to show provable convergence rates for the automated big batch method.
While there has been relatively little work in provable automated stochastic meth-
ods, there has been recent interest in methods that control gradient noise. These methods
16
mitigate the effects of vanishing step sizes, though choosing the (constant) step size still
requires tuning and oversight. There have been a few papers in this direction that use
dynamically increasing batch sizes. In [FS12], the authors propose to increase the size of
the batch by a constant factor on every iteration, and prove linear convergence in terms
of the iterates of the algorithm. In [BCNW12], the authors propose an adaptive strategy
for growing the batch size; however, the authors do not present a theoretical guarantee
for this method, and instead prove linear convergence for a continuously growing batch,
similar to [FS12].
Variance reduction (VR) SGD methods use an error correction term to reduce the
noise in stochastic gradient estimates. The methods enjoy a provably faster conver-
gence rate than SGD and have been shown to outperform SGD on convex problems
[DBLJ14, JZ13, SRB13, DD+14], as well as in parallel [RHS+15] and distributed set-
tings [DG16]. A caveat, however, is that these methods require either extra storage or
full gradient computations, both limiting factors when the dataset is very large. In a re-
cent paper [HAV+15], the authors propose a growing batch strategy for a VR method
that enjoys the same convergence guarantees. However, as mentioned above, choosing
the constant step size still requires tuning. Another conceptually related approach is im-
portance sampling, i.e., choosing training points such that the variance in the gradient
estimates is reduced [BTPG15, CR16, NWS14].
17
3.2 Big Batch SGD
3.2.1 Preliminaries and motivation
Classical stochastic gradient methods thrive when the current iterate is far from
optimal. In this case, a small amount of data is necessary to find a descent direction,
and optimization progresses efficiently. As wk starts approaching the true solution w?,
however, noisy gradient estimates frequently fail to produce descent directions and do not
reliably decrease the objective. By choosing larger batches with less noise, we may be
able to maintain descent directions on each iteration and uphold fast convergence. This
observation motivates the proposed “big batch” method. We now explore this idea more
rigorously. We wish to show that a noisy gradient approximation ∇f̃ produces a descent
direction when the noise is comparable in magnitude to the true gradient∇f .
Lemma 3.2.1. A sufficient condition for −∇f̃(w) to be a descent direction is
‖∇f̃(w)−∇f(w)‖2 < ‖∇f̃(w)‖2.
Proof. This is a standard result in stochastic optimization. We know that −∇f̃(w) is
a descent direction iff 〈∇f̃(w),∇f(w)〉 > 0. Expanding ‖∇f̃(w) − ∇f(w)‖2 we get:
‖∇f̃(w)‖2 + ‖∇f(w)‖2 − 2〈∇f̃(w),∇f(w)〉 < ‖∇f̃(w)‖2. We can re-write this as:
−2〈∇f̃(w),∇f(w)〉 < −‖∇f(w)‖2 ≤ 0, which is true for a descent direction. 
Thus, we see that: if the error ‖∇f̃(w)−∇f(w)‖2 is small relative to the gradient
‖∇f̃(w)‖2, the stochastic approximation is a descent direction. But how big is this error
and how large does a batch need to be to guarantee this condition? Let f̃B denote the
unbiased estimate of f using a mini-batch B sampled uniformly at random from dataset
18
D. Also, let fx denote the loss corresponding to training pair (x, C(x)). Then, by the
weak law of large numbers1
1 1
E[‖∇f̃B(w)−∇f(w)‖2] = |B|Ex[‖∇fx(w)−∇f(w)‖
2] = |B| Tr Varx∇fx(w),
and so we can estimate the error of a stochastic gradient if we have some knowledge of
the variance of ∇fx(w). In practice, this variance could be estimated using the sample
variance of a batch {∇fi(w)}x ∈B. However, we would like some bounds on the mag-i
nitude of this gradient to show that it is well-behaved, and also to analyze worst-case
convergence behavior. To this end, we make the following assumption.
Assumption 3.2.1. We assume that each fi has Lx-Lipschitz dependence on data x, i.e.,
given two data points x1,x2 ∼ p(x), we have: ‖∇f1(w)−∇f2(w)‖ ≤ Lx‖x1 − x2‖.
Under this assumption, we can bound the error of the stochastic gradient. The
bound is uniform with respect to w, which makes it rather useful in analyzing the conver-
gence rate for big batch methods.
Theorem 3.2.1. Given the current iterate w, suppose Assumption 3.2.1 holds and that the
data distribution p has bounded second moment. Then the estimated gradient ∇f̃B(w)
has variance bounded by
‖∇ −∇ ‖2 ∇ ≤ 4L
2
x Tr VarE x
(x)
B f̃B(w) f(w) := Tr VarB( f̃B(w)) |B| ,
where x ∼ p(x). Note the bound is uniform in w.
1We assume the random variable∇fx is measurable and has bounded second moment. These conditions
will be guaranteed by the hypothesis of Theorem 3.2.1.
19
Proof. Let x̄ = E[x] be the mean of x. Given the current iterate w, we assume that the
batch B is sampled uniformly with replacement from p. We then have:
‖∇fx(w)−∇f(w)‖2 ≤ 2‖∇fx(w)−∇f 2x̄(w)‖ + 2‖∇fx̄(w)−∇f(w)‖2
≤ 2L2 2x‖x− x̄‖ + 2‖Ex[∇fx̄(w)−∇fx(w)]‖2
≤ 2L2x‖x− x̄‖2 + 2Ex‖∇fx̄(w)−∇f 2x(w)‖
≤ 2L2x‖x− x̄‖2 + 2L2xEx‖x̄− x‖2
= 2L2x‖x− x̄‖2 + 2L2x Tr Varx(x),
where the first inequality uses the property ‖a + b‖2 ≤ 2‖a‖2 + 2‖b‖2, the second and
fourth inequalities use Assumption 3.2.1, and the third inequality uses Jensen’s inequality.
This bound is uniform in w. We then have
Ex‖∇fx(w)−∇f(w)‖2 ≤ 2L2xE ‖x− x̄‖2x + 2L2x Tr Varx(x) = 4L2x Tr Varx(x),
uniformly for all w. The result follows from the observation that
EB‖∇
1
f̃B(w)−∇f(w)‖2 = |B|Ex‖∇fx(w)−∇f(w)‖
2.

Note that, using a finite number of samples, one can approximate the quantity Varx(x).
3.2.2 A template for big batch SGD
Theorem 3.2.1 and Lemma 3.2.1 together suggest that we should expect d = −∇f̃B
to be a descent direction reasonably often provided
1
θ2‖∇f̃B(w)‖2 ≥ |B| [Tr Varx(∇fx(w))], (3.1)
20
2‖∇ ‖2 ≥ 4L
2
or θ f̃ (w) x
Tr Varx(x)
B |B| ,
for some θ < 1. Big batch methods capitalize on this observation.
On each iteration k, starting from a point wk, the big batch method performs the
following steps:
1. Estimate the variance Tr Varx[∇fx(wk)], and a batch size K large enough that
θ2E‖∇f̃B (w 2 2k k)‖ ≥ E‖∇f̃B (wk k)−∇f(wk)‖
1
= Tr Varx∇fx(wk), (3.2)
K
where θ ∈ (0, 1) and Bk is the selected batch on the k-th iteration with |Bk| = K.
2. Choose a step size αk.
3. Perform the update: wk+1 = wk − αk∇f̃B (wk).k
One can implement these steps using different variance estimators and different step size
strategies. In the next section, we show that, if condition (3.2) holds, then fast conver-
gence can be achieved using an appropriate constant step size. In subsequent sections,
we address the issue of how to build practical big batch implementations using automated
variance and step size estimators that require no user oversight.
3.3 Convergence analysis
We now present convergence bounds for big batch SGD methods (3.3). We rewrite
the SGD update as:
wk+1 = wk − α∇f̃B (wk) = wk − α(∇f(wk k) + ẽk), (3.3)
21
where ẽk = ∇f̃B (wk)−∇f(wk), and EB[ẽk] = 0. Let us also define g̃k = ∇f(wk)+ẽk k.
Before we present our results, we first state two assumptions about the loss function f(w).
Assumption 3.3.1. We assume that the objective function f has L-Lipschitz gradients:
f(w) ≤ f(w′) + 〈∇f(w′), (w −w′)〉+ L‖w −w′‖2.
2
This is a standard smoothness assumption used widely in the optimization literature.
Note that a consequence of Assumption 3.3.1 is: ‖∇f(w)−∇f(w′)‖ ≤ L‖w −w′‖.
Assumption 3.3.2. We assume that the objective function f satisfies the Polyak-Łojasiewicz
Inequality: ‖∇f(w)‖2 ≥ 2µ(f(w)− f(w?)), where w? is the optimal solution.
Note that this inequality does not require f to be convex. It does, however, imply
that every stationary point is a global minimizer [KNS16,Pol63]. We now present a result
that establishes an upper bound on the objective value in terms of the error in the gradient
of the sampled batch.
Lemma 3.3.1. Suppose we apply an update of the form (3.3) where the batch Bk is uni-
formly sampled from the dataset D on each iteration k. If the objective f satisfies As-
sumptions 3.3.1 and 3.3.2, we have:
( ( Lα2 )) Lα2
E[f(wk+1)− f(w?)] ≤ 1− 2µ α− E[f(w )− f(w?)] + E‖ẽ ‖2k k .
2 2
Proof. From (3.3) and Assumption 3.3.1 we get
Lα2
f(wk+1) ≤ f(w 2k)− α〈g̃k,∇f(wk)〉+ ‖g̃k‖ .
2
Taking expectation with respect to the batch Bt and conditioning on wk, we get
Lα2
E[f(w ?k+1)− f(w )] ≤f(wk)− f(w?)− α〈E[g̃k],∇f(w 2k)〉+ E‖g̃k‖
2
22
( 2 )
− ? − − Lα ‖∇ ‖2 Lα
2
=(f(wk) (f(w ) ))α f(wk) + E‖ẽ
2
k‖
2 2
Lα2 2≤ 1− 2µ α− (f(wk)− f(w?
Lα
)) + E‖ẽk‖2,
2 2
where the second inequality follows from Assumption 3.3.2. Taking expectation, the
result follows. 
Using Lemma 3.3.1, we now provide convergence rates for big batch SGD.
Theorem 3.3.1. Suppose f satisfies Assumptions 3.3.1 and 3.3.2. Suppose further that on
each iteration the batch size is large enough to satisfy (3.2) for θ ∈ (0, 1). If 0 ≤ α < 2 ,
Lβ
2
where β = θ +(1−θ)
2
− 2 , then we get the following linear convergence bound for big batch(1 θ)
SGD using updates of the form 3.3:
E[f(wk+1)− f(w?)] ≤ γ · E[f(wk)− f(w?)],
( 2 )
where γ = 1− 2µ(α− Lα β ) . Choosing the optimal step size of α = 1 , we get
2 βL
( µ )
E[f(wk+1)− f(w?)] ≤ 1− · E[f(wk)− f(w?)].
βL
Proof. We begin by applying the reverse triangle inequality to (3.2) to get (1−θ)E‖∇fB(x)‖ ≤
E‖∇f(x)‖, which applied to (3.2) yields:
θ2
E‖∇f(w )‖2− k ≥ E‖∇fB(wk)−∇f(wk)‖
2 = E‖ẽk‖2. (3.4)
(1 θ)2
Applying (3.4) to the result in Lemma 3.3.1, we get
( Lα2β )
E[f(w ?k+1)− f(w )] ≤ E[f(w )− f(w?k )]− α− E‖∇f(wk)‖2,
2
where β = θ
2+(1−θ)2
− 2 ≥
2
1. Assuming α− Lα β ≥ 0 and using Assumption 3.3.2, we get:
(1 θ) ( 2( Lα2β ))
E[f(w ?k+1)− f(w )] ≤ 1− 2µ α− E[f(wk)− f(w?)],
2
23
whic(h proves the theorem). Note that max {α− Lα
2β} = 1( ) α , and µ ≤ L. It follows that2 2Lβ2
0 ≤ 1− 2µ α− Lα β < 1. The second result follows immediately. 
2
Note that the above linear convergence rate bound holds without requiring con-
vexity. Comparing it with the convergence rate of deterministic gradient descent under
similar assumptions, we see that big batch SGD suffers a slowdown by a factor β, due to
the noise in the estimation of the gradients. We now present a result proving a O(1/k)
convergence rate for general smooth convex functions.
Theorem 3.3.2. Suppose f satisfies Assumptions 3.3.1, is convex, and condition (3.2) is
satisfied on each iteration. Then we get sub-linear convergence of the form:
− ? ≤ ‖w
? 2
E 0
−w ‖
[f(wk) f(w )] = O(1/k),
(2α− 2Lα2β)(k + 1)
θ2where β = +(1−θ)
2
− 2 and α <
1 . Choosing the optimal step size of α = 1 , we get
(1 θ) Lβ 2Lβ
− ? ≤ 2Lβ‖w −w
?‖2
E 0[f(wk) f(w )] = O(1/k).
k + 1
Proof. Applying the reverse triangle inequality to (3.2) and using Lemma 3.3.1 we get,
as in Theorem 3.3.1:
( Lα2β )
E[f(wk+1)] ≤ E[f(wk)]− α− E‖∇f(wk)‖2, (3.5)
2
2 2 2
where β = θ +(1−θ) Lα β 2− 2 ≥ 1. Note that α− > 0 if α < . From (3.3), taking norm on(1 θ) 2 Lβ
both sides and taking expectation, conditioned on all wt, with t = 0, 1, · · · , k, we get
E‖w −w?‖2k+1 = ‖wk −w?‖2 − 2αE〈wk −w?,∇f(wk) + ẽk〉+ α2E‖∇f(wk) + ẽk‖2
≤ ‖w ?k −w ‖2 − 2α〈wk −w?,∇f(wk)〉+ α2β‖∇f(wk)‖2
≤ ‖wk −w?‖2 − 2α(f(w )− f(w?k )) + α2β‖∇f(wk)‖2
24
≤ ‖w ?k −w ‖2 − 2α(f(wk)− f(w?)) + 2Lα2β(f(w )− f(w?k ))
= ‖w ? 2 2k −w ‖ − (2α− 2Lα β)(f(wk)− f(w?)),
where we use the property that E[ẽk] = 0, and the properties f(w) ≤ f(w?) + 〈w −
w?,∇f(w)〉 (which follows from the convexity of f ) and ‖∇f(w)‖2 ≤ 2L(f(w) −
f(w?)) (a proof for this identity can be found in [Nes13]). Note that 2α − 2Lα2β > 0
when α < 1 . Taking expectation on all w
Lβ k
, we get
1
E[f(wk)− f(w?)] ≤ − (E‖w
? 2
k −w ‖ − E‖wk+1 −w?‖2). (3.6)
2α(1 Lαβ)
Summing (3.6) over all k = 0, 1, · · · , T , and using the telescoping sum in ‖wk −w?‖2:
∑T
E[f(w )− f(w? 1k )] ≤ ? 2 ? 2− (E‖w0 −w ‖ − E‖wk+1 −w ‖ )2α(1 Lαβ)
k=0
≤ 1 ‖w ? 20 −w ‖ . (3.7)
2α(1− Lαβ)
From (3.5) we see that E[f(wk+1)] ≤ E[f(wk)] when α < 2 . Thus, we rewrite (3.7) as:Lβ
∑T ? 2
E[f(w )− f(w? ≤ 1 ‖w0 −w ‖k )] E[f(wk)− f(w?)] ≤ .
T + 1 (2α− 2Lα2β)(T + 1)
k=0
Choosing the optimal step size of α = 1 , the second result follows. 
2Lβ
3.3.1 Comparison to classical SGD
Conventional small batch SGD methods can attain only O(1/k) convergence for
strongly convex problems, thus requiring O(1/) gradient evaluations to achieve an opti-
mality gap less than , and this has been shown to be optimal in the online setting (i.e.,
the infinite data setting) [RSS11]. In the previous section, however, we have shown that
25
big batch SGD methods converge linearly in the number of iterations, under a weaker
assumption than strong convexity, in the online setting. Unfortunately, per-iteration con-
vergence rates are not a fair comparison between these methods because the cost of a big
batch iteration grows with the iteration count, unlike classical SGD. For this reason, it
is interesting to study the convergence rate of big batch SGD as a function of gradient
evaluations.
From Lemma 3.3.1, we see that we should not expect to achieve an optimality gap
2
less than  until we have: Lα EB ‖ẽk‖2 < . In the worst case, by Theorem 3.2.1, this2 k
2 2
requires Lα 4Lx Tr Varx(x)|B| < , or |B| ≥ O(1/) gradient evaluations. Note that in the2
online or infinite data case, this is an optimal bound, and matches that of other SGD
methods.
All our results hold for the infinite sample case. Note that the finite sample case is
fairly trivial with a growing batch size: asymptotically, the batch size becomes the whole
dataset, at which point we get the same asymptotic behavior as deterministic gradient
descent, achieving linear convergence rates.
3.4 Practical implementation with backtracking line search
While one could implement a big batch method using analytical bounds on the
gradient and its variance (such as that provided by Theorem 3.2.1), the purpose of big
batch methods is to enable automated adaptive estimation of algorithm parameters. Fur-
thermore, the step size bounds provided by our convergence analysis, like the step size
bounds for classical SGD, are fairly conservative and more aggressive step size choices
26
Algorithm 1 Big batch SGD: fixed step size
1: initialize w0, step size α, initial batch size K > 1, batch size increment δk
2: while not converged do
3: Draw random batch with size |B| = K; Calculate VB and ∇f̃B(wk) using (3.8)
4: while ‖∇f̃ (w )‖2B k ≤ VB/K do
5: Increase batch size K ← K + δK
6: Sample more gradients and update VB and ∇fB(wk)
7: end while
8: wk+1 = wk − α∇f̃B(wk)
9: end while
are likely to be more effective.
The framework outlined in Section 3.2.2 requires two ingredients: estimating the
batch size and estimating the step size. To estimate the batch size needed to achieve (3.2),
we start with an initial batch size K, and draw a random batch B with |B| = K. We then
compute the stochastic gradient estimate∇f̃B(wk) and the sample variance
1 ∑
VB := |B| − ‖∇fx(wk)−∇f̃
2
B(wk)‖ ≈ Tr Varx∈B(∇fx(wk)). (3.8)
1
x∈B
We then test whether ‖∇fB(wk)‖2 > VB/|B| as a proxy for (3.2). If this condition holds,
we proceed with a gradient step, else we increase the batch size K ← K + δK , and check
our condition again. We fix δK = 0.1K for all our experiments. Our implementation
also simply chooses θ = 1. The fixed step size big batch method is listed in Algorithm
1. We also consider a backtracking variant of SGD that adaptively tunes the step size.
This method selects batch sizes using the same criterion (3.8) as in the constant step size
27
case. However, after a batch has been selected, a backtracking Armijo line search is used
to select a step size. In the Armijo line search, we keep decreasing the step size by a
constant factor (in our case, by a factor of 2) until the following condition is satisfied on
each iteration:
f̃B(wk+1) ≤ f̃B(wk)− cαk‖∇f̃B(wk)‖2, (3.9)
where c is a parameter of the line search usually set to 0 < c ≤ 0.5. We now present a
convergence result of big batch SGD using the Armijo line search.
Theorem 3.4.1. Suppose that f satisfies Assumptions 3.3.1 and 3.3.2 and on each itera-
tion, and the batch size is large enough to satisfy (3.2) for θ ∈ (0, 1). If an Armijo line
search, given by (3.9), is used, and the step size is decreased by a factor of 2 failing (3.9),
then we get the following linear convergence bound for big batch SGD using updates of
the form 3.3:
E[f(w ?k+1)− f(w )] ≤ γ · E[f(wk)− f(w?)],
( ( ))
where γ = 1 − 2cµmin α , 10 and 0 < c ≤ 0.5. If the initial step size α0 is set2βL
large enough such that α ≥ 10 , then we get:2βL (
− ? ≤ − cµ
)
E[f(wk+1) f(w )] 1 E[f(wk)− f(w?)].
βL
Proof. Applying the reverse triangle inequality to (3.2) and using Lemma 3.3.1 we get,
as in Theorem 3.3.1:
( Lα2β )
E[f(w ?k+1)− f(w )] ≤ E[f(wk)− f(w?)]− α− E‖∇f(w 2k)‖ , (3.10)
2
2
where β = θ +(1−θ)
2
(1− ≥ 1.θ)2
28
We will show that the backtracking condition in (3.9) is satisfied whenever 0 <
≤ 2α 1 . Notice that: ≤ 1 implies − Lα β0 < α α t αtt t t + ≤ − . Thus, we can rewriteβL βL 2 2
(3.10) as
E[f(w ?k+1)− f(w )] ≤ E[f(wk)− f(w?)]−
αtE‖∇f(w )‖2k
2
≤ E[f(w )− f(w?k )]− cαtE‖∇f(w 2k)‖ ,
where 0 < c ≤ 0.5. Thus, the backtracking line search condition (3.9) is satisfied when-
ever 0 < α ≤ 1t . Now we know that either αt = α0 (the initial step size), or α 1Lβ t ≥ ,2βL
where the step size is decreased by a factor of 2 each time the backtracking condition
fails. Thus, we can rewrite the above as
( 1 )
E[f(wk+1)− f(w?)] ≤ E[f(w )− f(w?)]− cmin α , E‖∇f(w )‖2k 0 k .
2βL
Using Assumption 3.3.2 we get ( ( 1 ))
E[f(w )− f(w?k+1 )] ≤ 1− 2cµmin α0, E[f(wk)− f(w?)].
2βL
Assuming we start off the step size at a large value such that min(α 1 10, ) = , we can2βL 2βL
rewrite the above to get the desired bound. 
In practice, on iterations where the batch size increases, we double the step size
before running line search to prevent the step sizes from decreasing monotonically. The
complete details are listed in Algorithm 2.
3.5 Adaptive step sizes using the Barzilai-Borwein estimate
While the Armijo backtracking line search leads to an automated big batch method,
the step size sequence is monotonic (neglecting the heuristic mentioned in the previous
29
Algorithm 2 Big batch SGD: backtracking line search
1: initialize w0, initial step size α, initial batch size K > 1, batch size increment δk,
backtracking line search parameter c, flag F = 0
2: while not converged do
3: Draw random batch with size |B| = K; Calculate VB and ∇f̃B(wk) using (3.8)
4: while ‖∇f̃B(wk)‖2 ≤ VB/K do
5: Increase batch size K ← K + δK
6: Sample more gradients and update VB and ∇fB(wk)
7: Set flag F = 1
8: end while
9: if flag F == 1 then
10: α← α ∗ 2; Reset flag F = 0
11: end if
12: while f̃B(w 2k − α∇f̃B(wk)) > f̃B(wk)− cαt‖∇f̃B(wk)‖ do
13: α← α/2
14: end while
15: wk+1 = wk − α∇f̃B(wk)
16: end while
section). In this section, we derive a non-monotonic step size scheme that uses curvature
estimates to propose new step size choices.
Our derivation follows the classical adaptive [BB88] (BB) method. The BB meth-
ods fits a quadratic model to the objective on each iteration, and a step size is proposed
that is optimal for the local quadratic model [GSB14]. To derive the analog of the
30
BB method for stochastic problems, we consider quadratic approximations of the form
f(w) = Eφfφ(w), where fφ(w) = ν‖w − φ‖2 and φ ∼ N (w?, σ2I). We derive the2
optimal step size for this. We can rewrite the quadratic approximation as:
ν ν ν ( )
f(w) = E ‖w − φ‖2φ = [〈w,w〉 − 2〈w,w?〉 − E〈φ,φ〉] = ‖w −w?‖2 + dσ2 ,
2 2 ∑ ∑ 2
since we can write: E〈φ,φ〉 = d 2i=1 E(φ)i =
d ?
i=1(w )
2
i + σ
2 = ‖w?‖2 + dσ2.
Further, notice that: Eφ[∇f(w)] = ν(w−w?) and Tr Varφ[∇f(w)] = dν2σ2. Using the
quadratic approximation, we can rewrite the update for big batch SGD as:
∑|B|
− 1 νσα
∑
t
wk+1 = wk αt |B| ν(wk − φi) = (1− ναt)wk + ναtw
? + |B| ξi,
i=1 i∈B
where we write φi = w? + σξi∥with ξi ∼ N (0, I). The expected value of f is:
ν ∥∥ νσα ∑ E t[f(wk+1)] = E ξ ∥∥(1− ναt)(wk −w?) + |B| ξi∥∥
∥∥2∥ + dσ22 ( i∈B )
ν 2 2
= ‖(1− ναt)(wk −w?)‖2
ν α
+ (1 + t )dσ2 .
2 |B|
Minimizing E[f(wk+1)] w.r.t. αk(we get:1
1 |B | T∥∥
)
r Varx[∇fx∥∥(wk)]αk = · 1− k 2 . (3.11)ν E ∇f̃B (w )k k
Here ν denotes the curvature of the quadratic approximation. Note that, in the case of
deterministic gradient descent, the optimal step size is simply 1/ν [GSB14]. We estimate
the curvature νt on each iteration using the BB least-squares rule [BB88]:
〈wk −wk−1,∇f̃B (wk)−∇f̃B (wk−1)〉
ν k kk = ‖ . (3.12)w 2k −wk−1‖
Thus, each time we sample a batch Bk on the k-th iteration, we calculate the gradient
on that batch in the previous iterate, i.e., we calculate ∇f̃B (wk−1). This gives us ank
approximate curvature estimate, with which we derive the step size αk using (3.11).
31
3.5.1 Convergence proof
Here we prove convergence for the adaptive step size method described above. For
the convergence proof, we first state two assumptions:
Assumption 3.5.1. Each fi has L-Lipschitz gradients:
f (w) ≤ f (w′) + 〈∇f (w′),w −w′〉+ Li i i ‖w −w′‖2, ∀i.2
Assumption 3.5.2. Each fi is µ-strongly convex:
〈∇fi(w)−∇fi(w′),w −w′〉 ≥ µ‖w −w′‖2,∀i.
Note that both assumptions are stronger than Assumptions 3.3.1 and 3.3.2, i.e.,
Assumption 3.5.1 implies 3.3.1 and Assumption 3.5.2 implies 3.3.2 [KNS16]. Both are
very standard assumptions frequently used in the convex optimization literature.
From (3.11), we see that we can lower bound the step size as: α 2k ≥ (1 − θ )/ν.
Thus, the step size for big batch SGD is scaled down by at most 1 − θ2. For simplicity,
we assume that the step size is set to this lower bound: αk = (1 − θ2)/νk. Thus, from
Assumptions 3.5.1 and 3.5.2, we can bound νk, and also αk, as follows:
2 2
µ ≤ νt ≤
1− θ 1− θ
L =⇒ ≤ αt ≤ .
L µ
From Theorem 3.3.1, we see that we have linear convergence with the adaptive step size
method when:
( Lα2β ) 2(1− θ2)
1− 2µ α− ≤ 1− + β(1− θ2)2κ < 1 =⇒ κ2 2< ,
2 κ β(1− θ2)
where κ = L/µ is the condition number. We see that the adaptive step size method enjoys
a linear convergence rate when the problem is well-conditioned. In the next section, we
talk about ways to deal with poorly-conditioned problems.
32
3.5.2 Practical implementation
To achieve robustness of the algorithm for poorly conditioned problems, we include
a backtracking line search after calculating (3.11), to ensure that the step sizes do not
blow up. Further, instead of calculating two gradients on each iteration (∇f̃B (w ) andk k
∇f̃B (wk k−1)), our implementation uses the same batch (and step size) on two consecutive
iterations. Thus, one parameter update takes place for each gradient calculation.
We found the step size calculated from (3.11) to be noisy when the batch is small.
While this did not affect long-term performance, we perform a smoothing operation to
even out the step sizes and make performance more predictable. Let α̃k denote the step
(size calc)ulated from (3.11). Then, the step size on each iteration is given by αk =
1 − |B| αk−1 + |B| α̃k. This ensures that the update is proportional to how accurate then n
estimate on each iteration is. This simple smoothing operation seemed to work very well
in practice as shown in the experimental section. Note that when |Bk| = n, we just use
αk = 1/νk. Since there is no noise in the algorithm in this case, we use the optimal step
size for a deterministic algorithm. Algorithm 3 shows the complete details.
33
Algorithm 3 Big batch SGD: with BB step sizes
1: initialize w0, initial step size α, initial batch size K > 1, batch size increment δk,
backtracking line search parameter c
2: while not converged do
3: Draw random batch with |B| = K; Calculate VB and GB = ∇f̃B(x) using (3.8)
4: while ‖GB‖2 ≤ VB/K do
5: Increase batch size K ← K + δK
6: Sample more gradients and update VB and GB
7: end while
8: while f̃B(x− α∇f̃B(x)) > f̃B(x)− cα‖∇f̃ 2B(x)‖ do
9: α← α/2
10: end while
11: x← x− α∇f̃B(x)
12: if K < n then
13: Calculate α̃ = (1− VB/(K‖GB‖2))/ν using (3.11) and (3.12)
14: else
15: Calculate α̃ = 1/ν using (3.12)
16: end if
17: step size smoothing: α← α(1−K/n) + α̃K/n
18: while f̃B(x− α∇f̃B(x)) > f̃B(x)− cα‖∇f̃B(x)‖2 do
19: α← α/2
20: end while
21: x← x− α∇f̃B(x)
22: end while
34
Figure 3.1: Convex experiments. Left to right: Ridge regression on MILLIONSONG;
Logistic regression on COVERTYPE; Logistic regression on IJCNN1. The top row shows
how the norm of the true gradient decreases with the number of epochs, the middle and
bottom rows show the batch sizes and step sizes used on each iteration by the big batch
methods. Here ‘passes through the data’ indicates number of epochs, while ‘iterations’
refers to the number of parameter updates used by the method (there may be multiple
iterations during one epoch).
35
Mean class accuracy (train set) Mean class accuracy (train set) Mean class accuracy (train set)
100 100 100
98 99.8
90
99.6
96
80 99.4
94
99.2
92
70 Adadelta Adadelta 99 Adadelta
BB+Adadelta BB+Adadelta BB+Adadelta
SGD+Mom (Fine Tuned) 90 SGD+Mom (Fine Tuned) SGD+Mom (Fine Tuned)
98.8
60 SGD+Mom (Fixed LR) SGD+Mom (Fixed LR) SGD+Mom (Fixed LR)
BBS+Mom (Fixed LR) 88 BBS+Mom (Fixed LR) BBS+Mom (Fixed LR)
BBS+Mom+Armijo BBS+Mom+Armijo 98.6 BBS+Mom+Armijo
50 86 98.4
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Number of epochs Number of epochs Number of epochs
Mean class accuracy (test set) Mean class accuracy (test set) Mean class accuracy (test set)
80 90
75 89
88 99
70
87
65
86
60 98.5
85
55 84
50 83 98
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Number of epochs Number of epochs Number of epochs
Figure 3.2: Neural Network Experiments. The three columns from left to right corre-
spond to results for CIFAR-10, SVHN, and MNIST, respectively. The top row presents
classification accuracies on the training set, while the bottom row presents classification
accuracies on the test set.
3.6 Experiments
In this section, we present our experimental results. We explore big batch meth-
ods with both convex and non-convex (neural network) experiments on large and high-
dimensional datasets.
3.6.1 Convex experiments
For the convex experiments, we test big batch SGD on a binary classification prob-
lem with logistic regression and a linear regression problem:
1 ∑n
min log(1 + exp(−yi〈xi,w〉)),
w n
i=1
36
Accuracy Accuracy
Accuracy Accuracy
Accuracy Accuracy
∑n1
min (〈xi,w〉 − yi)2.
w n
i=1
Figure 3.1 presents the results of our convex experiments on three standard real
world datasets: IJCNN1 [Pro01] and COVERTYPE [BD99] for logistic regression, and
MILLIONSONG [BMEWL11] for linear regression. As a preprocessing step, we nor-
malize the features for each dataset. We compare deterministic gradient descent (GD)
and SGD with step size decay (αk = a/(b+ k)) to big batch SGD using a fixed step size
(BBS+Fixed LR), with backtracking line search (BBS+Armijo) and with the adaptive step
size (3.11) (BBS+BB), as well as the growing batch method described in [FS12] (denoted
as SF; while the authors propose a quasi-Newton method, we adapt their algorithm to a
first-order method). We selected step size parameters using a comprehensive grid search
for all algorithms, except BBS+Armijo and BBS+BB, which require no parameter tuning.
We see that across all three problems, the big batch methods outperform the other
algorithms. We also see that both fully automated methods are always comparable to or
better than fixed step size methods. The automated methods increase the batch size more
slowly than BBS+Fixed LR and SF, and thus, these methods can take more steps with
smaller batches, leveraging its advantages longer. Further, note that the step sizes derived
by the automated methods are very close to the optimal fixed step size rate.
3.6.2 Neural network experiments
To demonstrate the versatility of the big batch SGD framework, we also present re-
sults on neural network experiments. We compare big batch SGD against SGD with finely
tuned step size schedules and fixed step sizes. We also compare with Adadelta [Zei12],
37
and combine the big batch method with AdaDelta (BB+AdaDelta) to show that more
complex SGD variants can benefit from growing batch sizes. In addition, we had also
compared big batch methods with L-BFGS. However, we found L-BFGS to consistently
yield poorer generalization error on neural networks, and thus we omitted these results.
We train a convolutional neural network [LBBH98] (ConvNet) to classify three
benchmark image datasets: CIFAR-10 [KH09], SVHN [NWC+11], and MNIST [LBBH98].
Our ConvNet is composed of 4 layers. We use 32 × 32 pixel images as input. The first
layer of the ConvNet contains 16× 3× 3, and the second layer contains 256× 3× 3 fil-
ters. The third and fourth layers are fully connected [LBBH98] with 256 and 10 outputs
respectively. Each layer except the last one is followed by a ReLu non-linearity [KSH12]
and a max pooling stage [RHBL07] of size 2 × 2. This ConvNet has over 4.3 million
weights.
To compare against fine-tuned SGD, we used a comprehensive grid search on the
step size schedule to identify optimal parameters (up to a factor of 2 accuracy). For
CIFAR10, the step size starts from 0.5 and is divided by 2 every 5 epochs with 0 step size
decay. For SVHN, the step size starts from 0.5 and is divided by 2 every 5 epochs with
1e−05 learning rate decay. For MNIST, the learning rate starts from 1 and is divided by 2
every 3 epochs with 0 step size decay. All algorithms use a momentum parameter of 0.9,
and SGD and AdaDelta use mini-batches of size 128.
Fixed step size methods use the default decay rule of the Torch library: αk =
α0/(1 + 10
−7k), where α0 was chosen to be the step size used in the fine-tuned experi-
ments. We also tune the hyper-parameter ρ in the Adadelta algorithm, and we found 0.9,
0.9 and 0.8 to be best-performing parameters for CIFAR10, SVHN and MNIST respec-
38
tively.
We plot the accuracy on the train and test set vs the number of epochs (full passes
through the dataset) in Figure 3.2. We notice that the big batch SGD with backtrack-
ing performs better than both Adadelta and SGD (Fixed LR) in terms of both train and
test error. Big batch SGD even performs comparably to fine tuned SGD but without the
trouble of fine tuning. This is interesting because most state-of-the-art deep networks
(like AlexNet [KSH12], VGG Net [SZ14], ResNets [HZRS16a]) were trained by their
creators using standard SGD with momentum, and training parameters were tuned over
long periods of time (sometimes months). Finally, we note that the big batch AdaDelta
performs consistently better than plain AdaDelta on both large scale problems (SVHN
and CIFAR-10), and performance is nearly identical on the small-scale MNIST problem.
3.7 Summary
We analyzed and studied the behavior of alternative SGD methods in which the
batch size increases over time. Unlike classical SGD methods, in which stochastic gradi-
ents quickly become swamped with noise, these “big batch” methods maintain a nearly
constant signal to noise ratio of the approximate gradient. As a result, big batch methods
are able to adaptively adjust batch sizes without user oversight. The proposed automated
methods are shown to be empirically comparable or better performing than other stan-
dard methods, but without requiring an expert user to choose learning rates and decay
parameters.
39
Chapter 4: Distributing SGD using variance reduction
4.1 Introduction
For truly large datasets, parallel or distributed algorithms are vital, driving inter-
est in SGD variants that parallelize over massive distributed datasets. While there has
been quite a bit of recent work in the area of parallel asynchronous SGD algorithms
[RRWN11, DCM+12, LHLL15, AD11, LASY14, SS14, ZLS09, BT89, ZWLS10, ZCL15],
these methods typically experience substantially reduced marginal benefit as the number
of worker nodes increase over a certain limit. Thus, while some of these algorithms scale
linearly when the number of worker nodes is small, they are less effective when the data
is distributed over hundreds or thousands of nodes.
Moreover, most research in parallel or distributed SGD methods has been focused
on the parameter server model of computation [RRWN11, DCM+12, AD11, LASY14,
ZLS09], where each update to the centrally stored parameter vector requires a communi-
cation phase between the local node and the central server. However, SGD methods tend
to become unstable with infrequent communication, and there has been less work in the
truly distributed setting where communication costs are high [ZWLS10, ZCL15, MR16].
In this section, we propose to boost the scalability of stochastic optimization algorithms
using variance reduction techniques, yielding SGD methods that scale linearly over hun-
40
dreds or thousands of nodes and can train models on massive datasets without the slow-
down that existing stochastic methods experience.
Notation For this chapter, let fk̃ denote the stochastic function chosen on the k-th iter-
ation, where k̃ is an index chosen uniformly at random from {1, 2, . . . , n}. Thus, using
this notation, the regular SGD update can be written as wk+1 = wk − α∇fk̃(wk).
Background
Variance reduction (VR) methods [JZ13,DBLJ14,RHS+15,RSB12,DD+14,KLRT14,
KR13, XZ14, WCSX13, HAV+15] have recently gained popularity as an alternative to
classical SGD. These methods reduce the variance in the stochastic gradient estimates,
and are able to maintain a large constant step size to achieve fast convergence to high
accuracy.
VR methods exploit the fact that gradient errors are highly correlated between dif-
ferent uses of the same function fk̃. This is done by subtracting an error correction term
from ∇fk̃(wk) that estimates the gradient error from the most recent use of fk̃. Thus the
stochastic gradients used by VR methods have the form
g̃k = ∇︸ fk̃︷(︷wk︸) − ︸∇fk̃(w̃︷︷) + gw̃︸, (4.1)
approximate gradient error correction term
where w̃ is an old iterate, and gw̃ is an approximation of the true gradient ∇f(w̃). Typ-
ically, gw̃ can be kept fixed over an epoch or can be updated cheaply on every iteration.
As an example, the SVRG algorithm [JZ13] has an update rule of the form:
( )
wk+1 = wk − α ∇fk̃(wk)−∇fk̃(w̃) +∇f(w̃) , (4.2)
41
where w̃ is chosen to be a recent iterate from the algorithm history and is fixed over 1 or
2 epochs, and gw̃ = ∇f(w̃) is the true gradient of f at w̃, which needs to be computed
once every 1 or 2 epochs. Another popular VR algorithm, SAGA [DBLJ14], uses the
following corrected gradient approximation
∑n
g̃k = ∇fk̃(wk)−∇
1
fk̃(w̃k̃) + ∇fj(w̃j), (4.3)n
j=1
where each ∇fj(w̃j) denotes the most recent value of ∇fj and w̃j denotes the iterate at
which the most recent ∇fj was evaluated. In this case gw̃ is the average of the ∇fj(w̃j)
values for all j ∈ {1, 2, . . . , n}. This error correction term reduces the variance in the
stochastic gradients, and thus ensures fast convergence. Notice that for both the algo-
rithms, S[VR]G and SAGA, if k̃ is chosen uniformly at random from {1, 2, · · · , n}, we
have Ek̃ g̃k = ∇f(wk). Thus, the error correction term has expected value 0 and the
approximate gradient g̃k is unbiased for both SVRG and SAGA.
Most work on VR methods has focused on studying their faster convergence rates
and better stability properties when compared to classical SGD in the sequential setting.
While there have been a few recent papers on parallelizing VR methods, these methods
scale poorly in distributed settings and all prior work that we know of has focussed on
small-scale parallel or shared memory settings, with the data distributed over 10 or 20
nodes [RHS+15, MPP+15, PLT+16]. These parallel algorithms use a parameter server
model of computation, and are based on the assumption that communication costs are low,
which may not be true in large-scale heterogenous distributed computing environments.
The fact that the error correction term reduces the variance in the stochastic gradients,
however, seems to indicate that distributed VR methods could be helpful in distributed
42
settings. In particular, the variance-reduced gradients would help in dealing with the
problems of instability and slower convergence faced by regular stochastic methods when
the frequency of communication between the server and the local nodes is increased.
Contributions
In this work, we use variance reduction to dramatically boost the performance of
SGD in the distributed setting. We do this by exploiting the dependence of VR methods on
the gradient correction term gw̃.We allow many local worker nodes to run simultaneously,
while communicating with the central server only through the exchange of this central
error correction term and the locally stored iterates. The proposed schemes allow many
asynchronous processes to work towards a central solution with minimal communication,
while simultaneously benefitting from the fast convergence provided by VR.
This work has four main contributions:
• First, we present a new VR algorithm CentralVR, built on SAGA, that is robust to
noise and variance in the dataset. We propose synchronous (CentralVR-Sync) and
asynchronous (CentralVR-Async) variations of CentralVR which can linearly scale
up over massive datasets using hundreds of cores.
• Second, we theoretically study the convergence of CentralVR and prove linear con-
vergence of the method with constant step sizes.
• Third, we propose distributed versions of the existing popular VR algorithms, SVRG
and SAGA, that are robust to high communication latency between the worker
nodes and the central server, and can scale over large distributed settings ranging
43
over hundreds of nodes. Table 4.1 summarizes the distributed algorithms proposed
in this section and their storage and computation requirements.
• Finally, we present empirical results over different models and datasets that show
that these distributed algorithms can be trained on massive highly distributed datasets
in far less time than existing state-of-the-art stochastic optimization methods. Per-
formance of all these distributed methods scales linearly up to hundreds of workers
with low communication frequency. We show empirically that the proposed meth-
ods converge much faster than competing options.
Table 4.1: Distributed Algorithms Proposed
Proposed Algorithm Asynchronous? Storage (No. of gradients) Gradients/Iteration
CentralVR-Sync No n 1
CentralVR-Async Yes n 1
Distributed SVRG No 2 2.5
Distributed SAGA Yes n 1
4.2 CentralVR algorithm: single-worker case
We begin by proposing our new VR scheme, CentralVR, in the single-worker case.
As we will see later, the proposed method has a natural generalization to the distributed
setting that has low communication requirements.
44
4.2.1 Algorithm overview
Our proposed VR scheme is divided into epochs, with n updates taking place in
each epoch. Let the iterates generated in the m-th epoch be written as {wm nj }j=1. Also
let w̃ml denote the iterate at which the l-th data index was most recently used before the
m+ 1-th epoch (i.e., on or before the m-th epoch). Then, the update for CentralVR is:
wm+1k+1 = w
m+1
k − αvm+1k , ∑ (4.4)n
vm+1k = ∇f m+1
1
k̃(wk )−∇fk̃(w̃m) + ∇fj(w̃mj ). (4.5)k̃ n
∑ j=1
Denote ḡm = 1 n mj=1∇fj(w̃j ). Thus, ḡm is the average of the gradients of all com-n
ponent functions {∇fj}nj=1, each evaluated at the most recent iterate {w̃m}nj j=1 at which
the corresponding function was used on or before the m-th epoch. These gradients are
stored in a table, and the average gradient ḡm is updated at the end of each epoch, i.e.,
after every n parameter updates. Note that if k̃ is cho[sen uniform] ly at random from the
set {1, 2, · · · , n} on each iteration k, then we have E ∇f (w̃m[ ] k̃ k̃ ) = ḡ
m. Thus, the error
k̃
correction term has expected value 0, and E vm+1k = ∇f(wm+1k ), i.e., the approximate
gradient vm+1k is unbiased.
4.2.2 Permutation sampling
In practical implementations, it is natural to consider a random permutation of the
data indices on every epoch, rather than uniformly choosing a random index on every
iteration. Thus, on each epoch, a random permutation of the data indices is chosen and a
pass is made over the entire dataset, resulting in n updates, one per data sample. Permu-
45
tation sampling often outperforms uniform random sampling empirically [Bot09,Bot12],
although theoretical justification for this is still limited (see [GOP15, Sha16] for some
recent results).
As an alternative to uniform random sampling, CentralVR can leverage random
permutations over the data indices. Let πm denote a random permutation of the data
indices {1, 2, · · · , n} for the m-th epoch, with πmj denoting the data index chosen in the
j-th iteration in the m-th epoch. Thus, now w̃lm denotes the iterate corresponding to the
point when the l-th data index was chosen in the m-th epoch. The update rule with the
random permutation is given by (4.4) and (4.5), with k̃ = πm+1∑ k . ∑
Summing (4.4) over all k = 0, 1, · · · , n−1, we get n−1 vm+1k=0 k =
n−1 m+1
k=0 ∇fk(w̃k ).
Thus, summing (4.4) over all k = 0, 1, · · · , n − 1, using the telescoping sum in wm+1k ,
and using the convention that wm+1 n0 = xm, we get
∑n ( )
wm+2 m+1 m+10 = w0 − α ∇fj w̃j . (4.6)
j=1
Equation (4.6) shows the update rule in terms of the iterates at the ends of the
epochs. Thus, over an epoch, the average gradient accumulated by CentralVR is unbi-
ased and thus is a good estimate of the true gradient. This average gradient term can be
accumulated cheaply during an epoch, without any noticeable overhead.
4.2.3 Algorithm details for CentralVR
The detailed steps of CentralVR are listed in Algorithm 4. Note, the stored gradients
and the average gradient term gw̃ are initialized using a single epoch of “vanilla” SGD
with no VR correction.
46
Algorithm 4 CentralVR Algorithm: single worker case
1: parameters learning rate α
2: initialize w, {∇fj(w̃j)}j , and ḡ using plain SGD
3: while not converged do
4: g̃← 0
5: set π: random permutation of indices 1, 2, · · · , n
6: for k in {1, . . . , n} do ( )
7: set: wk+1 ← wk − α ∇fπ (wk)−∇fπ (w̃π ) + ḡk k k
8: accumulate average: g̃← g̃ +∇fπ (wk)/nk
9: store gradient: ∇fπ (w̃π )← ∇fk k π (wk k)
10: end for
11: set average gradient for next epoch: ḡ← g̃
12: end while
CentralVR builds on the SAGA method. SAGA relies on the∑update rule (4.3),
which requires an average over a large number of iterates (g = 1w̃ j∇fj(wj)) to ben
continuously updated on every iteration. In the distributed setting, where the vector gw̃
must be shared across nodes, maintaining an up-to-date average requires large amounts
of communication. This makes SAGA less stable in distributed implementations when
the communication frequency is decreased. Updating gw̃ only occasionally (as we do
in the distributed variants of CentralVR below) translates into significant communication
savings in the distributed setting.
CentralVR has the same time and space complexities as SAGA. Namely, on ev-
ery iteration, 1 gradient computation is required, similar to SGD, and the n gradients
47
{∇fj(w̃m nj )}j=1 also need to be stored. Note that this is not always a significant storage
requirement, since for models like logistic regression and ridge regression only a single
number is required to be stored corresponding to each gradient.
4.3 Convergence analysis
We now present convergence bounds for Algorithm 4. We make the following
standard assumptions about the function when studying convergence properties. First,
each fi is strongly convex with strong convexity constant µ:
µ
f (w) ≥ f (w′i i ) + 〈∇f ′ ′i(w ),w −w 〉+ ‖w −w′‖2. (4.7)
2
Second, each fi has Lipschitz continuous gradients with Lipschitz constant L so that
fi(w) ≤
L
fi(w
′) + 〈∇fi(w′),w −w′〉+ ‖w −w′‖2. (4.8)
2
We now present our main result.
Theorem 4.3.1. Consider CentralVR with data inde(x k̃ drawn uniform)ly at random (with
2
replacement) on each iteration k. Define ρ := max 1− αµ, 2L α− . If the step size αµ(1 2Lα)
is small enough such that 0 < ρ < 1, then we have the following bound:
∥∥ ∥2 ( ) (∥ ∥ )2 ( )wm+20 −w?∥ + c f(wm+1)− f(w?) ≤ ρ ∥wm+1 −w?∥0 + c f(w̃m)− f(w?) ,
∑
where c = 2nα(1 − 2Lα) and we define f(wm) := 1 n−1 mk=0 f(wk ). In other words, then
method converges linearly.
We first start with two lemmas that will be useful in the proof for Theorem 4.3.1.
48
∑
Lemma 4.3.1. For any f defined as f := 1 ni=1 fi, where each fi satisfies (4.7) andn
(4.8), and on conditioning on any w, we have
∥
E∥ ∥∇ 2fj(w)−∇f (w?)∥j ≤ 2L(f(w)− f(w?)),
where j is sampled uniformly at random from {1, 2, . . . , n} and w? is the minimizer of f.
Proof. A standard result used frequently in the convex optimization literature is:
‖∇fj(w)−∇fj(w?)‖2 ≤ 2L(fj(w)− fj(w?)− 〈∇f (w?),w −w?j 〉),
where fj is L-Lipschitz smooth. A proof for this inequality can be found in [Nes13]
(Theorem 2.1.5 on page 56). Since j is sampled uniformly at random from {1, 2, . . . , n},
we can write: Ej(fj(w) − fj(w?) − 〈∇f ?j(w ),w − w?〉) = f(w) − f(w?), using the
property that∇f(w?) = 0. The result follows. 
∑
Lemma 4.3.2. For any f defined as f := 1 n
n i=1
fi, where each fi satisfies (4.7) and
(4.8), and for any w and i we have
∥∥ ∥ 22 ( )∇fi(w)−∇f (w?)∥i ≤ 2L f(w)− f(w?) ,
µ
where w? denotes the minimizer of f .
Proof. A standard result used frequently in the convex optimization literature is:
‖∇fi(w)−∇f (w?i )‖2 ≤ L2‖x−w?‖2,
where fi is L-Lipschitz smooth. A proof for this inequality can be found in [Nes13]
(Theorem 2.1.5 on page 56). From (4.7), we get:
2 ( ) ( )‖w −w?‖2 ≤ f(w)− f(w?)− 〈w −w? 2,∇f(w?)〉 = f(w)− f(w?) ,
µ µ
49
using the property that∇f(w?) = 0. The desired result follows immediately. 
We now move on to the proof of Theorem 4.3.1.
Proof. Let the update rule for CentralVR be denoted as
wm+1k+1 = [w
m+1
k − αvm+1k , ∑ ]
vm+1k = ∇f (wm+1)−∇f (w̃m
1 m
k̃ k k̃ ) + ∇fj(w̃k̃ n j ) .
j
In this proof, we assume that the data indices are accessed randomly with replacement.
Thus, w̃k̃m denotes the last iterate when the k̃-th data index was chosen in or before the m-
th epoch. Thus,[condit]ioning on all w, vm+1k is an unbiased estimator of the true gradient
at wm+1k , i.e., E v
m+1
k = ∇f(wm+1k ). Conditioned on all history (all w), we first begin
with the standard identity:
[ ] [ ]
E ‖wm+1 −w? 2k+1 ‖ = E ‖wm+1k − αvm+1 −w?‖2k
= ‖wm+1 −w?‖2 − 2α〈wm+1k k −w?,∇f(wm+1 2 k 2k )〉+ α E‖vm+1‖ .
(4.9)
We now bound (4.9). Using the definition of strong convexity in (4.7), we can simplify
the inner product term in (4.9) as
〈w? −wm+1 µk ,∇f(wm+1)〉 ≤ −(f(wm+1k k )− f(w?))− ‖w? −wm+1k ‖2. (4.10)2
We now bound the magnitude of the gradient term in (4.9):
E‖ m+1 2∥vk ‖
=E∥∥ ∑ ∥2∇fk̃(wm+1k )−∇ 1 ∥fk̃(w̃m) + ∇fj(w̃m)k̃ j ∥
∥ n j 1 ∑ ∥2
=E∥∥∇f (wm+1 ∥k̃ k )−∇f ?k̃(w ) +∇f ?k̃(w )−∇fk̃(w̃m) + ∇fj(w̃m)k̃ ∥n j
j
50
∥
≤2E∥ ∥ ∥∇f (wm+1)−∇ ? ∥2 ∥f (w ) + 2E∥∇f (w̃m( ∑k̃ k k̃ ∑ )∥k̃ )−∇fk̃(w
?)
k̃
2
− 1 ∇f m 1 ? ∥j(w̃ )− ∇fj(w ) ∥
n j n
∥ j j∥ ∥ ∥ [ ]∥2=2E ∇f (wm+1)−∇f (w? 2)∥ + 2E∥∥∇f (w̃m ∥∥ k̃ k k̃ ∥ ∥ k̃ )−∇f (w
?
k̃ )∥− E ∇f (w̃
m
k̃ )−∇fk̃(w?)k̃ k̃ ∥
≤ 2 22E∥∇f (wm+1)−∇f (w?)∥ + 2E∥ m ? ∥( k̃ k k̃) 4L2 (
∇fk̃(w̃ )−k̃ )∇fk̃(w )
≤4L f(wm+1k )− f(w?) + E f(w̃m)− f(w?) . (4.11)µ k̃
The second equality uses the property that ∇f(w?) = 0. The first inequality uses the
property that ‖a + b‖2 ≤ 2‖a‖2 + 2‖b‖2. The second inequality uses E‖φ − Eφ‖2 =
E‖φ‖2 − ‖Eφ‖2 ≤ E‖φ‖2, for any random vector φ. The third inequality follows from
Lemma 4.3.1 and Lemma 4.3.2.
We now plug (4.10) and (4.11) into (4.9) and rearrange:
[ ] ( )
E ‖wm+1 −w? 2∥ k+1 ‖∥ + 2α∥(1− 2Lα) f(w
m+1 ?
∥ k )− f(w )2 2 ( )
≤ ∥ 2wm+1k −w?∥ − ∥ m+1 − ?∥2 4L ααµ wk w + E f(w̃m)− f(w?) . (4.12)µ k̃
Taking expectation∥on all w and∥summing (4.12) over all k = 0, 1, . . . , n − 1, we get a
telescoping sum in ∥wm+1k −w?∥2 that yields:
∥
E∥ ∥ ( )wm+2 ?∥20 −w + 2nα(1− 2Lα)E f(wm+1)− f(w?)∥
≤E∥wm+1 −w?∥∥ ∑n−1 ∥ ∥ 4nL2α22 2 ( )0 − αµ E∥wm+1 −w?∥ + E f(w̃mk )− f(w?) , (4.13)µ
k=0 ∑
where we use the convention wm = wm+1, and define f(wm) as f(wm) := 1 n−1 f(wmn 0 n k=0 k ).
We now observe that
∥
E∥wm+1 ?∥∥ ∑n−12 ∥∥ ∥∥m+1 ? 20 −w ≤ E wk −w .
k=0
51
Thus we can rewrite
∑n−1 ∥∥ ∥∥2 ∥∥ ∥−αµ E wm+1 −w? ≤ −αµE wm+1 −w?∥2k 0 .
k=0
Substituting this in (4.13), we get:
∥
E∥ ∥ ( )wm+20 −w?∥2∥ + 2nα(1−
m+1 ?
∥ ∥
2Lα)E f(w )− f(w )
∥ 4nL2α2 ( )≤ (1− αµ)E wm+10 − 2w? + E f(w̃m)− f(w?) .µ
We can rewrite this to get:
∥∥ ∥ ( )E wm+2 2(0 ∥−w
?∥ + 2∥nα(1− 2Lα)E f(w
m+1 ?
( )− f(w ) ))
≤ ρ E∥wm+1 −w?∥20 + 2nα(1− 2Lα)E f(w̃m)− f(w?) ,
( )
2 2
where ρ = max 1− αµ, 4nL α− . The result immediately follows. 2nµα(1 2Lα)
Remark on step size restrictions From Theorem 4.3.1, notice that CentralVR con-
verges linearly when the step size α is small enough such that
( 1 1 µ )
α < min , , .
µ 2L 2L(L+ µ)
Since L ≥ µ, we see that this condition is satisfied whenever α < µ .
2L(L+µ)
4.4 Distributed algorithms
We now consider the distributed setting, with a single central server and p local
clients, each of which contains a portion of the data set. In this setting, the data is
decomposed into disjoint subsets {Ωs}, where s denotes a particular local client, and
52
∑
s |Ωs| = n. We denote the i-th function stored on client s as f si . Our goal is to mini-
mize the global objective function of the form
∑p ∑|Ωs|1
f(w) = f sj (w).n
s=1 j=1
We consider a centralized setting, where the clients can only communicate with the
central server, and our goal is to derive stochastic algorithms that scale linearly to high
p, while remaining stable even under low communication frequencies between local and
central nodes.
4.4.1 Synchronous version
CentralVR naturally extends to the distributed synchronous setting, and is presented
in Algorithm 5. To distinguish the algorithm from the single worker case, we call it
CentralVR-Sync. On each epoch, the local nodes first retrieve a copy of the central iterate
w, and also gw̃, which represents the averaged gradient over all data. The CentralVR
method is then performed on each node, and the most recent gradient for each data point
∇f s(w̃k̃) is stored. By sharing gw̃ across nodes, we ensure that the local gradient updatesk̃
utilize global gradient information from remote nodes. This prevents the local node from
drifting far away from the global solution, even if each local node runs for one whole
epoch before communicating back with the central server.
In CentralVR-Sync, each local node performs local updates for one epoch, or |Ωs|
iterations, before communicating with the server. This is a rather low communication
frequency compared to a parameter server model of computation in which updates are
continuously streamed to the central node. This makes a significant difference in runtimes
53
when the number of local nodes is large, as shown in later sections.
4.4.2 Asynchronous version
The synchronous algorithm can be extended very easily to the asynchronous case,
CentralVR-Async, as shown in Algorithm 6. In CentralVR-Async, the central server keeps
a copy of the current iterate w and average gradient ḡ. The key idea for CentralVR-Async
is that, once a local node completes an epoch, it sends the change in the local averages,
given by ∆ws and ∆ḡs, over the last epoch to the central server. This change is added to
the global w and ḡ to update the parameters stored on the central server. Thus, when the
central server receives parameters from a local node s, it performs the updates:
1 1
w = w + ∆ws and ḡ = ḡ + ∆ḡs,
p p
where ∆ws and ∆ḡs are given by
∆ws = {{wm+1n −wmn }s and }
s 1
∑ ∑
∆ḡ = ∇f s(w̃m+1| | j j )−
1 s m
Ω | ∇f (w̃Ω | j j ) .s
j∈ sΩs j∈Ω ss
Sending the change in the local parameter values rather than the local parameters them-
selves ensures that, when updating the central parameter, the previous contribution to
the average from that local worker is just replaced by the new contribution. Thus, a fast
working local node does not bias the global average solution toward its local solution with
an excessive number of updates. This makes the algorithm more robust to heterogenous
computing environments where nodes work at disparate speeds.
The proposed CentralVR scheme has several advantages. It does not require a full
gradient computation as in SVRG, and thus can be made fully asynchronous. Moreover,
54
since the average gradient gw̃ in the error correction term is updated only at the end of an
epoch, communication periods can be increased between the central server and the local
nodes, while still maintaining fast and stable convergence.
4.5 Distributed variants of SVRG and SAGA
In this section, we propose distributed variants of popular variance reduction meth-
ods: SVRG and SAGA. The properties of these variants are overviewed in Table 4.1.
4.5.1 Distributed SVRG
In this section, we present a distributed version of SVRG appropriate for distributed
scenarios with high communication delays. Recently, in [RHS+15], the authors presented
an asynchronous distributed version of SVRG using a parameter server model of compu-
tation. In SVRG, the average gradient term is gw̃ = ∇f(w̃) as shown in (4.2). This cor-
rection term is very accurate because it uses the entire dataset. This would indicate that
the algorithm would be robust to high communication periods between the local nodes
and the server.
However, a truly asynchronous method is not possible with SVRG since a synchro-
nization step is unavoidable when computing the full gradient. Thus, in this section, we
present a synchronous variant of SVRG in Algorithm 7. We define an additional parame-
ter τ to denote the communication period, i.e., the number of updates to run on each local
node before communicating with the central server.
The true gradient ḡ is maintained across all nodes throughout the whole commu-
55
nication period τ , thus ensuring that the local workers stay close to the desired solution,
even when τ is large. After τ updates, the current iterate ws on each local node s is aver-
aged on the central server to get w. The true gradient is evaluated at w, i.e., ḡ = ∇f(w),
and w̄ = w is used on each local node during the next epoch.
4.5.2 Distributed SAGA
The update rule for SAGA is given in (4.3). Since there is no synchronization step
required as in SVRG, there is a very natural asynchronous version of the algorithm under
the parameter server model of computation. A linear convergence proof has been pre-
sented for the parameter server model of SAGA (see Theorem 3 in [RHS+15]). However,
this work does not contain any empirical studies of the method. The parameter server
framework is a very natural generalization of SAGA, however it has very high bandwidth
requirements for large numbers of nodes.
Algorithm 8 presents an asynchronous version of SAGA with lower communication
frequency. Like SVRG, we define a communication period parameter τ which determines
the number of iterations to run on each machine before central communication.
In the SAGA algorithm, the average gradient term ḡ is updated on each iteration.
Thus, as local iterations progress, the average gradient evolves differently on each local
node. This makes the algorithm less robust to higher communication periods τ . As the
communication period increases, the local nodes drift farther apart from each other and
the global solution. Thus, the learning rate needs to shrink as τ increases over a certain
limit. This in turn slows down convergence. For this reason, distributed SAGA is less
56
tolerant to long communication periods than the Algorithms in Sections 4.4 and 4.5.1.
However, it still has fast convergence for much higher communication periods than exist-
ing stochastic schemes.
The asynchronous SAGA method (Algorithm 8) is built on the same idea as the
proposed asynchronous algorithm: running averages are kept on each local node, and at
the end of an epoch the change in the parameter values are sent to the central server. This
makes the algorithm more robust when local nodes work at heterogenous speeds.
In our distributed SAGA algorithm, care has to be taken while updating the average
gradient ḡ. Note that ḡ is averaged over the whole dataset. Thus, when replacing the
gradient value at the current index k̃, the update is scaled down by a factor of n (the total
number of global samples, as opposed to |Ωs|, the number of local samples). At the end
of a local epoch, the average of the stored gradients on each local node is sent back to the
central server, along with the current estimate w. This ensures that the average gradient
term on the central server ḡ is built from the most recent gradient computations at each
index.
4.6 Empirical results
In this section, we present the empirical performance of the proposed methods, both
in sequential and distributed settings. We benchmark the methods for two test problems:
first, a binary classification prob(lem with `2-regularize)d logistic regression where each
fi is of the form: fi(w) = log 1 + exp(−yi〈xi,w〉) + λ‖w‖2, where feature vector
xi ∈ Rd has label yi ∈ R. We also consider a ridge regression problem of the form
57
Figure 4.1: Single Worker Results. Logistic regression on toy dataset; Ridge regression on
toy data; Logistic regression on IJCNN1 dataset; Ridge regression on MILLIONSONG
dataset; In each case CentralVR converges much faster than SVRG and SAGA.
fi(w) = (〈xi,w〉 − y 2i) + λ‖w‖2. We present all our results with the `2 regularization
parameter set at λ = 10−4, though we found that our results were not sensitive to this
choice of parameter.
4.6.1 Single worker results
We first test our algorithms in the sequential, non-distributed setting. It is well
known that VR beats vanilla SGD by a wide margin in many applications. However, the
different VR methods vary widely in their empirical behavior. We compare the single
worker CentralVR algorithm to the two most popular VR methods, SVRG [JZ13] and
58
SAGA [DBLJ14].
We test the methods on two synthetic “toy” datasets, in addition to two real-world
datasets. Synthetic classification data was generated by sampling two normal distributions
with unit variance and means separated by one unit. For the least-squares prediction prob-
lem, we generate a random normal matrix X and random labels of the form y = Aw + ,
where  is standard Gaussian noise. For each case, we kept the size of the dataset at
n = 5000 with d = 20 features. For the binary classification problem, we kept equal
numbers of data samples for each class. We also tested performance of our algorithms on
two standard real world datasets: IJCNN1 [Pro01] for binary classification and the MIL-
LIONSONG [BMEWL11] dataset for least squares prediction. IJCNN1 contains 35,000
training data samples of 22 dimensions, while MILLIONSONG contains 463,715 train-
ing samples of 90 dimensions. For all our experiments, we maintain a constant learning
rate, and choose the learning rate that yields fastest convergence.
Results appear in Figure 4.1. We compare convergence rates of the algorithms in
terms of number of gradient computations for each method. This provides a level playing
field since different VR methods require different numbers of gradient computations per
iteration, and gradient computations dominate the computing time. The proposed Cen-
tralVR algorithm widely out-performs SAGA and SVRG in all cases, requiring less than
one-third of the gradient computations of the other methods.
59
4.6.2 Distributed results
We now present results of our algorithms in highly distributed settings. We imple-
ment the algorithms using a Python binding to MPI, and all experiments were run on an
Intel Xeon E5 cluster with 24 cores per node. All our asynchronous implementations are
“locked”, where at a given time only one local node can update the parameters on the cen-
tral server. However, all proposed asynchronous algorithms can be easily implemented in
a lock-free setting, leading to further speedups.
We compare the distributed versions of CentralVR, CentralVR-Async [CVR-Async
in Figures 4.2 and 4.3] and CentralVR-Sync [CVR-Sync in Figures 4.2 and 4.3], proposed
in Section 4.4 with the following algorithms:
1. Distributed SVRG (Section 4.5.1) [D-SVRG in Figures 4.2 and 4.3]. We set the
communication period τ = 2n as recommended in [JZ13]. We found the perfor-
mance of the algorithm to be very robust to τ .
2. Distributed SAGA (Section 4.5.2) [D-SAGA in Figures 4.2 and 4.3]. We vary the
communication period τ = {10, 100, 1000, 10000} and present results for the τ
yielding best results. The algorithm remains relatively stable for τ = {10, 100, 1000}
but convergence speeds start slowing down significantly at τ = 10000.
3. Elastic Averaging SGD (EASGD): This is a recently proposed asynchronous SGD
method [ZCL15] that has been shown to efficiently accelerate training times of deep
neural networks. As in [ZCL15], we tested the algorithm for communication peri-
ods τ = {4, 16, 64}, and found results to be nearly insensitive to τ (τ updates occur
60
before communication). We also found the regular EASGD algorithm to outper-
form the momentum version (M-EASGD). We test performance both for a constant
step size as well as a decaying step size (using a local clock on each machine) as
given by α0/(1 + γk)0.5 (as in [ZCL15]), where α0 is the initial step size, k is the
local iteration number, and γ is the decay parameter. EASGD has been shown to
outperform the related popular asynchronous SGD method Downpour [DCM+12],
on both convex and non-convex settings.
4. Asynchronous “Parameter Server” SVRG [PS-SVRG in Figures 4.2 and 4.3]: an
asynchronous version of SVRG on a parameter server model of computation [RHS+15].
This method outperforms a popular asynchronous SGD method, Hogwild [RRWN11],
which also uses a parameter server model. We set the epoch size to 2n, as recom-
mended in [RHS+15].
For the variance reduction methods, we performed experiments using a constant
step size, as well as the simple learning rate decay rule αl = α0γl (here, l is the num-
ber of epochs, instead of iterations). Decaying the step size does not yield consistent
performance gains, and constant step sizes work very well in practice.
We compared the algorithms on a binary classification problem and a least-squares
prediction problem using both toy datasets and real world datasets. The toy datasets were
created on each local worker exactly the same way as for the sequential experiments. The
toy datasets had d = 1000 features and |Ωs| = 5000 samples for each core s, i.e., the total
size of the dataset was p×5000, where p denotes the number of local nodes. We also used
the real world datasets MILLIONSONG [BMEWL11] (containing close to 500,000 data
61
samples) for ridge regression and SUSY [BSW14] (5,000,000 data samples) for logistic
regression.
Figure 4.2 shows results of our distributed experiments on toy datasets. The left
two plots compare the rates of convergence of our algorithms scaled over 192 cores for
logistic regression and ridge regression. The x-axis displays wall clock time in seconds
and the y-axis displays the relative norm of the gradient, i.e., the ratio between the current
gradient norm and the initial gradient norm. In almost all cases the proposed algorithms,
in particular CentralVR, have substantially superior rates of convergence over established
schemes. The right two plots in Figure 4.2 demonstrate the scalability of our algorithms.
On the y-axis, we plot the wall clock time (in seconds) required for convergence, and on
the x-axis, we vary the number of nodes as 96, 192, 480 and 960. Each local worker
has |Ωs| = 5000 data points in each case, i.e., the amount of data scales linearly with
the number of nodes. Notice that CentralVR-Sync and CentralVR-Async exhibit nearly
perfect linear scaling, even when the number of workers is almost 1000. The dataset size
in this regime is close to 5 million data points, and the proposed CentralVR methods train
both our logistic and ridge regression models to five digits of precision in less than 15
seconds.
Figure 4.3 shows results of our distributed experiments on the large datasets SUSY
and MILLIONSONG. The left two plots show convergence results for our algorithms
over 500 nodes for SUSY and 240 nodes for MILLIONSONG. In both cases, we see
that our proposed algorithms outperform or remain competitive with previously proposed
schemes. The right two plots show the scaling of our algorithms as we increase the num-
ber of local workers for training SUSY and MILLIONSONG. We see that for MILLION-
62
SONG, increasing the number of local workers initially decreases convergence time, but
speed levels out for large numbers of workers, likely due to the smaller size of the local
dataset fragments. On the larger SUSY problem, we find a consistent decrease in the con-
vergence times as we increase the number of workers. We train on this 5,000,000 sample
dataset in less than 5 seconds using 750 local workers.
4.7 Summary
This section introduced a new variance reduction scheme, CentralVR, that has
lower communication requirements than conventional schemes, allowing it to perform
better in highly parallel cloud or cluster computing platforms. In addition, distributed
versions of well-known variance reduction stochastic gradient descent (SGD) methods
are presented that also perform well in highly distributed settings. We show that by lever-
aging variance reduction, we can combat the diminishing returns that plague classical
SGD methods when scaled across many workers, achieving linear performance scaling
to over 1000 cores. This represents a significant increase in scalability over previous
stochastic gradient methods.
63
Algorithm 5 CentralVR-Sync Algorithm
1: parameters learning rate α
2: initialize w, {∇fj(w̃j)}j , ḡ
3: while not converged do
4: for each local node s do
5: g̃← 0
6: set π: random permutation of indices 1, 2, · · · , |Ωs|
7: for k in {1, . . . , |Ωs(|} do )
8: w ← w − α ∇f sk+1 k π (w )−∇f s (w̃ ) + ḡk k πk πk
9: accumulate average: g̃← g̃ +∇f sπ (wk)/|Ωs|k
10: store gradient: ∇f sπ (w̃ )← ∇f sπ π (wk k k k)
11: end for
12: set average gradient to send to server: ḡ← g̃
13: send w, ḡ to central node
14: receive updated w, ḡ from central node
15: end for
16: central node:
17: average w, ḡ received from workers
18: broadcast averaged w, ḡ to local workers
19: end while
64
Algorithm 6 CentralVR-Async Algorithm
1: parameters learning rate α
2: initialize w, {∇fj(w̃j)}j, ḡ, ρ = 1/p,wold = ḡold = 0
3: while not converged do
4: for each local node do
5: g̃← 0
6: set π: random permutation of indices 1, 2, · · · , |Ωs|
7: for k in {1, . . . , |Ωs(|} do )
8: w s sk+1 ← wk − α ∇fπ (wk)−∇fπ (w̃k k π ) + ḡk
9: accumulate average: g̃← g̃ +∇f sπ (wk k)/|Ωs|
10: store gradient: ∇f sπ (w̃π )← ∇f sπ (wk k k k)
11: end for
12: set average gradient: ḡ← g̃
13: compute change: ∆w← w −wold, ∆ḡ← ḡ − ḡold
14: set: wold ← w, ḡold ← ḡ
15: send ∆w, ∆ḡ to central node
16: receive updated w, ḡ from central node
17: end for
18: central node:
19: receive ∆w, ∆ḡ from a local worker
20: update: w← w + ρ∆x, ḡ← ḡ + ρ∆ḡ
21: send new w, ḡ back to local worker
22: end while
65
Algorithm 7 Synchronous SVRG
1: parameters step size α, communication period τ
2: initialize w
3: while not converged do
4: set: w̄← w
5: set: ḡ← ∇f(w̄) via synchronization step
6: for each local node s do
7: for k in {1, . . . , τ} do
8: sample k̃ ∈ {1, . .(. , |Ωs|} with replacement)
9: w ← w − α ∇f s sk+1 k (wk)−∇f (w̄) + ḡk̃ k̃
10: end for
11: send w to central node
12: receive updated w from central node
13: end for
14: central node:
15: average w received from workers
16: broadcast averaged w to local workers
17: end while
66
Algorithm 8 Asynchronous SAGA
1: parameters step size α, communication period τ
2: initialize w, {∇fj(w̃j)}j, ρ =∑1/p,wold = ḡold = 0
3: set average gradient: ḡ← 1 j∇fj(w̃n j)
4: while not converged do
5: for each local node do
6: for k in {1, . . . , τ} do
7: sample k̃ ∈ {1, . .(. , n} with replacement )
8: wk+1 ← wk − α ∇(f
s(w s
k̃ k
)−∇f (w̃
k̃ k̃
) +)ḡ
9: update: ḡ← ḡ + 1 ∇f s(wk)−∇f s(w̃n k̃ k̃ k̃)
10: store gradient: ∇f s(w̃k̃)← ∇f s(w )k̃ k̃ k
11: end for
12: compute change: ∆w← w −wold, ∆ḡ← ḡ − ḡold
13: set: wold ← x, ḡold ← ḡ
14: send ∆w, ∆ḡ to central node
15: receive updated w, ḡ from central node
16: end for
17: central node:
18: receive ∆w, ∆ḡ from a local worker
19: update: w← w + ρ∆w, ḡ← ḡ + ρ∆ḡ
20: send new w, ḡ back to local worker
21: end while
67
Figure 4.2: Distributed Results on toy datasets for CentralVR-Sync and CentralVR-Async,
compared to Distributed SVRG (Section 4.5.1), Distributed SAGA (Section 4.5.2), Pa-
rameter Server SVRG and EASGD. Left two plots: Convergence curve for Logistic and
ridge regression on synthetic data over 192 nodes. Right two plots: Time required for
convergence as number of local workers is increased (data on each local worker is con-
stant – i.e., total data scales linearly with the number of local workers) for logistic and
ridge regression.
68
Figure 4.3: Distributed Results on SUSY and MILLIONSONG for CentralVR-Sync and
CentralVR-Async, compared to Distributed SVRG (Section 4.5.1), Distributed SAGA
(Section 4.5.2), Parameter Server SVRG (Param Server SVRG) and EASGD. (Left two
plots) Convergence curve for Logistic regression and ridge regression on SUSY over 500
nodes and on MILLIONSONG over 240 nodes. (Right two plots) Time required for con-
vergence as number of local workers is increased.
69
Chapter 5: Investigating training methods for quantized neural nets
5.1 Introduction
Deep neural networks are an integral part of state-of-the-art computer vision and
natural language processing systems. Because of their high memory requirements and
computational complexity, networks are usually trained using powerful hardware. There
is an increasing interest in training and deploying neural networks directly on battery-
powered devices, such as cell phones or other platforms. Such low-power embedded
systems are memory and power limited, and in some cases lack basic support for floating-
point arithmetic.
To make neural nets practical on embedded systems, many researchers have fo-
cused on training nets with coarsely quantized weights. For example, weights may be
constrained to take on integer/binary values, or may be represented using low-precision
(8 bits or less) fixed-point numbers. Quantized nets offer the potential of superior mem-
ory and computation efficiency, while achieving performance that is competitive with
state-of-the-art high-precision nets. Quantized weights can dramatically reduce memory
size and access bandwidth, increase power efficiency, exploit hardware-friendly bitwise
operations, and accelerate inference throughput [CHS+16, MOPU93, RORF16].
Handling low-precision weights is difficult and motivates interest in new training
70
methods. When learning rates are small, stochastic gradient methods make small updates
to weight parameters. Binarization/discretization of weights after each training iteration
“rounds off” these small updates and causes training to stagnate [CHS+16]. Thus, the
naı̈ve approach of quantizing weights using a rounding procedure yields poor results when
weights are represented using a small number of bits. Other approaches include classical
stochastic rounding methods [GAGN15], as well as schemes that combine full-precision
floating-point weights with discrete rounding procedures [CBD15]. While some of these
schemes seem to work in practice, results in this area are largely experimental, and little
work has been devoted to explaining the excellent performance of some methods, the poor
performance of others, and the important differences in behavior between these methods.
Contributions
In this chapter, we study quantized training methods from a theoretical perspective,
with the goal of understanding the differences in behavior, and reasons for success or
failure, of various methods. In particular, we present a convergence analysis showing that
classical stochastic rounding (SR) methods [GAGN15] as well as newer and more pow-
erful methods like BinaryConnect (BC) [CBD15] are capable of solving convex discrete
problems up to a level of accuracy that depends on the quantization level. We then address
the issue of why algorithms that maintain floating-point representations, like BC, work so
well, while fully quantized training methods like SR stall before training is complete.
We show that the long-term behavior of BC has an important annealing property that is
needed for non-convex optimization, while classical rounding methods lack this property.
71
5.2 Background and related work
The arithmetic operations of deep networks can be truncated down to 8-bit fixed-
point without significant deterioration in inference performance [GAGN15,LTA16,HS14,
LCMB16, LZL16]. The most extreme scenario of quantization is binarization, in which
only 1-bit (two states) is used for weight representation [KS15,CBD15,CHS+16,RORF16,
HCS+16, BIL+15].
Previous work on obtaining a quantized neural network can be divided into two cat-
egories: quantizing pre-trained models with or without retraining [HS14,AHS15,LTA16,
ZHMD17, ZYG+17], and training a quantized model from scratch [GAGN15, CBD15,
RORF16, CHS+16, ZWN+16]. We focus on approaches that belong to the second cate-
gory, as they can be used for both training and inference under constrained resources.
For training quantized NNs from scratch, many authors suggest maintaining a high-
precision floating point copy of the weights while feeding quantized weights into back-
prop [CBD15, HCS+16, RORF16, ZWN+16], which results in good empirical perfor-
mance. There are limitations in using such methods on low-power devices, however,
where floating-point arithmetic is not always available or not desirable. Another widely
used solution using only low-precision weights is stochastic rounding [HF92, GAGN15].
Experiments show that networks using 16-bit fixed-point representations with stochastic
rounding can deliver results nearly identical to 32-bit floating-point computations [GAGN15],
while lowering the precision down to 3-bit fixed-point often results in a significant per-
formance degradation [MLM16]. Bayesian learning has also been applied to train binary
networks [SHM14,CSML15]. A more comprehensive review can be found in [RORF16].
72
5.3 Algorithms for training quantized neural nets
Neural networks have objective functions of the same form as (2.1) where each fi
is a non-convex loss function. When floating-point representations are available, the stan-
dard method for training neural networks is SGD (2.2). In this chapter, we consider the
problem of training convolutional neural networks (CNNs) with low precision weights.
Convolutions are computationally expensive; low precision weights can be used to accel-
erate them by replacing expensive multiplications with efficient addition and subtraction
operations [RORF16, LZL16] or bitwise operations [HCS+16, ZWN+16].
To train networks using a low-precision representation of the weights, a quantiza-
tion function Q(·) is needed to convert a real-valued number w into a quantized/rounded
version ŵ = Q(w). We use the same notation for quantizing vectors, where we assume
Q acts on each dimension of the vector. Different quantized optimization routines can be
defined by selecting different quantizers, and also by selecting when quantization happens
during optimization. The common options are:
Deterministic Rounding (R) A basic uniform or deterministic quantization function
snaps a floating point value to the closest quantize⌊d value as⌋:|w| 1
Qd(w) = sign(w) ·∆ · + , (5.1)
∆ 2
where ∆ denotes the quantization step or resolution, i.e., the smallest positive number
that is representable. One exception to this definition is when we consider binary weights,
where all weights are constrained to have two values w ∈ {−1, 1} and uniform rounding
becomes Qd(w) = sign(w).
73
The deterministic rounding SGD maintains quantized weights with updates of the form:
( )
Deterministic Rounding: wbk+1 = Q w
b
d k − αk∇f̃ (wbk k) , (5.2)
where wb denotes the low-precision weights, which are quantized using Qd immediately
after applying the gradient descent update. If gradient updates are significantly smaller
than the quantization step, this method loses gradient information and weights may never
be modified from their starting values.
Stochastic Rounding (SR) The stochastic rounding quantization function is defined as:b
w c+ 1 for p ≤ w − bw c,
 ∆ ∆ ∆Qs(w) = ∆ · (5.3)bw c otherwise,
∆
where p ∈ [0, 1] is produced by a uniform random number generator. This operator is
non-deterministic, and rounds its argument up with probability w/∆−bw/∆c, and down
otherwise. This quantizer satisfies the important property E[Qs(w)] = w. Similar to
the deterministic rounding method, the SR optimization method also maintains quantized
weights with updates of the form:
( )
Stochastic Rounding: wb b bk+1 = Qs wk − αk∇f̃k(wk) . (5.4)
BinaryConnect (BC) The BinaryConnect algorithm [CBD15] accumulates gradient
updates using a full-precision buffer wr, and quantizes weights just before gradient com-
putations as follows.
( )
BinaryConnect: wrk+1 = w
r
k − αk∇f̃k Q(wrk) . (5.5)
74
Either stochastic rounding Qs or deterministic rounding Qd can be used for quantizing
the weights wr, but in practice, Qd is the common choice. The original BinaryConnect
paper constrains the low-precision weights to be {−1, 1}, which can be generalized to
{−∆,∆}. A more recent method, Binary-Weights-Net (BWN) [RORF16], allows differ-
ent filters to have different scales for quantization, which often results in better perfor-
mance on large datasets.
Notation For the rest of the chapter, we use Q to denote both Qs and Qd unless the
situation requires this to be distinguished. We also drop the superscripts on wr and wb,
and simply write w.
5.4 Convergence analysis
We now present convergence guarantees for the Stochastic Rounding (SR) and Bi-
naryConnect (BC) algorithms, with updates of the form (5.4) and (5.5), respectively. For
the purposes of deriving theoretical guarantees, we assume each fi in (2.1) is differen-
tiable and µ-strongly convex: 〈∇f(w′),w − w′〉 ≤ f(w) − f(w′) − µ‖w − w′‖2. We
2
assume the (stochastic) gradients are bounded: E‖∇f̃ 2k(wk)‖ ≤ G2. Some results below
also assume the domain of the problem is finite. In this case, the rounding algorithm clips
values that leave the domain. For example, in the binary case, rounding returns bounded
values in {−1, 1}.
75
5.4.1 Convergence of Stochastic Rounding (SR)
We can rewrite the update rule (5.4) as:
wk+1 = wk − αk∇f̃k(wk) + rk, (5.6)
where rk = Qs(wk−αk∇f̃k(wk))−wk +αk∇f̃k(wk) denotes the quantization error on
the k-th iteration. We want to bound this error in expectation. To this end, we present the
following lemma.
Lemma 5.4.1. The stochastic rounding error rk on each iteration can be bounded, in
expectation, as:
∥
E∥r ∥∥2 √k ≤ d∆αkG,
where d denotes the dimension of w.
Proof. We want to bound the quantization error rk. Consider the i-th entry in rk denoted
by (rk)i. Similarly, we define (wk)i and (∇f̃k(wk))i. Choose some random number
p ∈ [0, 1]. The stochastic rounding operation produces a value of rk given by
(rk)i = Qs((wk)i − αk(∇f̃k(wk))i)− (wk)i + αk(∇f̃k(wk))i
= ∆ ·−q + 1, for p ≤ q,−q, otherwise,
⌊ ⌋
α (∇f̃ (w ))
where k k k iq = − − −αk(∇f̃k(wk))i and q ∈ [0, 1]. Now we have
∆ ∆
[ ]
Ep ((r 2k)i) ≤ ∆2((−q + 1)2q + (−q)2(1− q)) = ∆2q(1− q) ≤ ∆2 min{q, 1− q}.
76
{ − } ≤ ∣∣∣∣∣
∣
αk(∇f̃ (w )) ∣Because k k i ∣min q, 1 q ∣, it follows that:
∆ ∣
[ ] ∣∣ ∣∣ ∣ ∣
E ((r ) )2 ≤ 2 ∣∣∣αk(∇f̃k(wk))i ∣p k i ∆ ∣∣ ≤ ∣∆ ∣ ∣αk(∇f̃k(wk))i∣ .∆
Summing over the index i yields:
∥∥ ∥∥2 ∥∥ ∥∥ √ ∥ ∥Ep rk ≤ ∆αk ∇f̃k(wk) ≤ dαk∆∥∇f̃k(wk)∥ . (5.7)2 1 2
( ∥ ∥ ) ∥ ∥
The result follows from: E∥∇ 2 2f̃k(w )∥ ≤ E∥k ∇f̃ (w )∥ ≤ G2. 2 k k 2
From Lemma 5.4.1, we see that the rounding error per step decreases as the learning
rate αk decreases. This is intuitive since the probability of an entry in wk+1 differing from
wk is small when the gradient update is small relative to ∆. Using the above lemma, we
now present convergence rate results for Stochastic Rounding (SR) in the stro∑ngly-convex
case. Our error estimates are ergodic, i.e., they are in terms of w̄ = 1 kk t=1 wt, thek
average of the iterates.
Theorem 5.4.1. Assume that f is µ-strongly convex and the learning rates are given by
α 1k = . Consider the SR algorithm with updates of the form (5.4). Then, we have:µ(k+1)
√
− ? ≤ (1 + log(k + 1))G
2 d∆G
E[f(w̄k) f(w )] + ,
2µk 2
where w? = arg minw f(w).
Proof. Subtracting w? from (5.6), taking norm, and expectation conditioned on wk:
E‖wk+1 −w?‖2 = ‖wk −w?‖2 − 2E〈wk −w?, αk∇f̃k(wk)− rk〉+ E‖αk∇f̃k(wk)− rk‖2
= ‖wk −w?‖2 − 2α 〈w −w?k k ,∇f(w 2 2 2k)〉+ αkE‖∇f̃k(wk)‖ + E‖rk‖
√
≤ ‖wk −w?‖2 − 2αk〈wk −w?,∇f(wk)〉+ α2G2k + d∆αkG,
77
where we use the bounded variance assumption, E[rk] = 0, and Lemma 5.4.1. Using the
assumption that f is µ-strongly convex, we can simplify this to:
√
E‖wk+1 −w?‖2 ≤ (1− αkµ)‖w ? 2 ? 2 2k −w ‖ − 2αk(f(wk)− f(w )) + αkG + d∆αkG.
Re-arranging the terms, and taking expectation we get:
√
2αkE(f(wk)− f(w?)) ≤ (1− αkµ)E‖wk −w?‖2 − E‖wk+1 −w?‖2 + α2 2kG + d∆αkG.
Assume that the step size decreases with the rate αk = 1/µ(k + 1). Then we have:
√
µk µ(k + 1) 1 d∆G
E(f(wk)− f(w?)) ≤ E‖w −w?‖2k − E‖w ? 2 2k+1 −w ‖ + G + .
2 2 2µ(k + 1) 2
Averaging over k = 0 to T , we get a telescoping sum on the right hand side, which yields:
1 ∑T ∑T √G2 1 d∆G µ(T + 1)
E(f(wk)− f(w?)) ≤ + − E‖wk+1 −w?‖2
T 2µT k + 1 2 2
k=0 k=0 √
≤ (1 + log(T + 1))G
2 d∆G
+ .
2µT 2 ∑
Using Jensen’s ∑inequality, we have: E(f(w̄T ) − f(w
?)) ≤ 1 Tk=0 E(f(wk) − f(w?)),T
where w̄T = 1
T
k=0 wk, the average of the iterates. The desired bound follows. T
We see that SR converges until it reaches an “accuracy floor.” As the quantiza-
tion becomes more fine grained, our theory predicts that the accuracy of SR approaches
that of high-precision floating point at a rate linear in ∆. This extra term caused by the
discretization is unavoidable since this method maintains quantized weights.
5.4.2 Convergence of Binary Connect (BC)
When analyzing the BC algorithm, we assume that the Hessian satisfies the Lips-
chitz bound: ‖∇2fi(w) − ∇2fi(w′)‖ ≤ L2‖w − w′‖ for some L2 ≥ 0. While this is a
78
slightly non-standard assumption, we will see that it enables us to gain better insights into
the behavior of the algorithm.
We assume that the quantization function in BC uses stochastic rounding. In the
case of BC, we see that the quantization error r does not approach 0 as in SR-SGD.
Nonetheless, the effect of this rounding error diminishes with shrinking αk because αk
multiplies the gradient update, and thus implicitly the rounding error as well.
Theorem 5.4.2. Assume that f is µ-strongly convex, the domain has finite diameter D,
and the learning rates are given by α = 1k . Consider the BC algorithm with updatesµ(k+1)
of the form (5.5). Then we have:
√
− ? ≤ (1 + log(k + 1))G
2 DL2 d∆E[f(w̄k) f(w )] + .
2µk 2
Proof. We can rewrite the update rule (5.5), as
( )
wk+1 = wk − αk∇f̃ (wk +) rk ( )
= wk − αk[∇f̃ wk +∇2f̃ wk rk + r̂k]
where ‖r̂k‖ ≤ L2‖r ‖2k from our assumption on the Hessian. Note that in general rk has2
mean zero while r̂k does not. Using the same steps as in the Theorem 5.4.1, we get
E‖w ? 2k+1 −w ‖ = ‖wk −w?‖2 − 2αkE〈w ?k −w ,∇f̃k(wk + r 2k)〉+ αkE‖∇f̃k(wk + rk)‖2.
≤ ‖wk −w?‖2 − 2αkE〈w −w?k ,∇f(wk) + r̂k〉+ α2kG2
= ‖wk −w?‖2 − 2αkE〈wk −w?,∇f(wk)〉+ α2 2kG − 2αkE〈wk −w?, r̂k〉
Assuming the domain has finite diameter D, and observing that the quantization error for
√
BC-SGD can always be upper-bounded as ‖rk‖ ≤ d∆, we get:
√
−2αkE〈w −w?k , r̂k〉 ≤ E‖ ‖ ≤
L2
2αkD r̂k 2αkD ‖rk‖ ≤ αkDL2 d∆.
2
79
Following the same steps as in Theorem 5.4.1, the desired bound follows. 
Now, the error floor is determined by both ∆ and L2. For a quadratic least-squares
problem, the gradient of f is linear and the Hessian is constant. Thus, L2 = 0 and we get
the following corollary.
Corollary 5.4.1. Assume that f is quadratic and the learning rates are given by αk =
1 . The BC algorithm with updates of the form (5.5) yields
µ(k+1)
2
E[f(w̄k)− ? ≤
(1 + log(k + 1))G
f(w )] .
2µk
We see that the real-valued weights accumulated in BC can converge to the true
minimizer of quadratic losses. Furthermore, this suggests that, when the function behaves
like a quadratic on the distance scale ∆, one would expect BC to perform fundamentally
better than SR. While this may seem like a restrictive condition, there is evidence that
even non-convex neural networks become well approximated as a quadratic in the later
stages of optimization within a neighborhood of a local minimum [MG15].
Note, our convergence results on BC are for wr instead of wb, and these measures
of convergence are not directly comparable. It is not possible to bound wb when BC is
used, as the values of wb may not converge in the usual sense (e.g., in the +/-1 binary case
wr might converge to 0, in which case arbitrarily small perturbations to wr might send
wb to +1 or -1).
5.5 What about non-convex problems?
The global convergence results presented above for convex problems show that, in
general, both the SR and BC algorithms converge to within O(∆) accuracy of the mini-
80
mizer (in expected value). While we observe some differences between the two methods
when the iterates are close to a minimizer (where the objective function behaves like a
quadratic), these results do not explain the large differences generally observed when
applied to non-convex neural nets. We now study how the long-term behavior of SR dif-
fers from BC. Note that this section makes no convexity assumptions, and the proposed
theoretical results are directly applicable to neural networks.
Typical (continuous-valued) SGD methods have an important exploration-exploitation
tradeoff. When the learning rate is large, the algorithm explores by moving quickly be-
tween states. Exploitation happens when the learning rate is small. In this case, noise
averaging causes the algorithm more greedily pursues local minimizers with lower loss
values. Thus, the distribution of iterates produced by the algorithm becomes increasingly
concentrated near minimizers as the learning rate vanishes (see, e.g., the large-deviation
estimates in [LNS12]). BC maintains this property as well—indeed, we saw in Corollary
5.4.1 a class of problems for which the iterates concentrate on the minimizer for small αk.
In this section, we show that the SR method lacks this important tradeoff: as the
step size gets small and the algorithm slows down, the quality of the iterates produced by
the algorithm does not improve, and the algorithm does not become progressively more
likely to produce low-loss iterates. This behavior is illustrated in Figures 5.1 and 5.2.
To understand this problem conceptually, consider the simple case of a one-variable
optimization problem starting at w0 = 0 with ∆ = 1 (Figure 5.1). On each iteration, the
algorithm computes a stochastic approximation ∇f̃ of the gradient by sampling from
a distribution, which we call p. This gradient is then multiplied by the step size to get
α∇f̃ . The probability of moving to the right (or left) is then roughly proportional to the
81
↵
↵rf(wk) rf(wk)2
 1 0 +1  1 0 +1
Figure 5.1: The SR method starts at some location w (in this case 0), adds a perturbation
to w, and then rounds. As the learning rate α gets smaller, the distribution of the per-
turbation gets “squished” near the origin, making the algorithm less likely to move. The
“squishing” effect is the same for the part of the distribution lying to the left and to the
right of w, and so it does not effect the relative probability of moving left or right.
magnitude of α∇f̃ . Note the random variable α∇f̃ has distribution pα(z) = α−1p(z/α).
Now, suppose that α is small enough that we can neglect the tails of pα(z) that lie
outside the interval [−1, 1]. The probability of transitioning from w0 = 0 to w1 = 1 using
stochastic ro∫unding, denoted by∫Tα(0, 1), is then1 1 ∫ 1/α ∫ ∞
Tα(0, 1) ≈
1
zpα(z)dz = zp(z/α) dz = α p(w)w dw ≈ α p(w)w dw,
0 α 0 0 0
where the first approximation is because we neglected the unlikely case that α∇f̃ > 1,
and the second approximation appears because we added a small tail probability to the
estimate. These approximations get more accurate fo∫r small α. We see that, assuming the
tails of p are “li∫ght” enough, we have
∞
Tα(0, 1) ∼ α p(w)w dw as α → 0. Similarly,0
Tα(0,−1) ∼ 0α −∞ p(w)w dw as α→ 0.
What does this observation mean for the behavior of SR? First of all, the probability
of leaving w0 on an iteration is [∫ ∞ ∫ 0 ]
Tα(0,−1) + Tα(0, 1) ≈ α p(w)w dw + p(w)w dw ,
0 −∞
82
12
10
8
6
4
2
0
-2 0 2 4 6 8
Weight w
(a) α = 1.0 (b) α = 0.1 (c) α = 0.01 (d) α = 0.001
Figure 5.2: Effect of shrinking the learning rate in SR vs BC on a toy problem. The left
figure plots the objective function (5.8). Histograms plot the distribution of the quantized
weights over 106 iterations. The top row of plots correspond to BC, while the bottom row
is SR, for different learning rates α. As the learning rate α shrinks, the BC distribution
concentrates on a minimizer, while the SR distribution stagnates.
which vanishes for small α. This means the algorithm slows down as the learning rate
drops off, which is not surprising. However, the conditional probability of ending up at
w1 = 1 given that the algorithm did leave w0 is ∫∞
Tα(0, 1) p(w)w dw
Tα(0, 1|w1 =6 w ) ≈ = ∫ 00 ∫ ,
Tα(0,−1) + Tα(0, 1) 0 ∞−∞ p(w)w dw + p(w)w dw0
which does not depend on α. In other words, provided α is small, SR, on average, makes
the same decisions/transitions with learning rate α as it does with learning rate α/10; it
just takes 10 times longer to make those decisions when α/10 is used. In this situation,
there is no exploitation benefit in decreasing α.
83
Loss Value
5.5.1 Toy problem
To gain more intuition about the effect of shrinking the learning rate in SR vs BC,
consider the following simple1-dimensional non-convex problem:

w2 + 2, if w < 1,
min f(w) :=
w (w − 2.5)
2 + 0.75, if 1 ≤ w < 3.5, (5.8)
(w − 4.75)2 + 0.19, if w ≥ 3.5.
Figure 5.2 shows a plot of this loss function. To visualize the distribution of iterates,
we initialize at w = 4.0, and run SR and BC for 106 iterations using a quantization
resolution of 0.5.
Figure 5.2 shows the distribution of the quantized weight parameters w over the it-
erations when optimized with SR and BC for different learning rates α. As we shift from
α = 1 to α = 0.001, the distribution of BC iterates transitions from a wide/explorative
distribution to a narrow distribution in which iterates aggressively concentrate on the min-
imizer. In contrast, the distribution produced by SR concentrates only slightly and then
stagnates; the iterates are spread widely even when the learning rate is small.
5.5.2 Asymptotic analysis of Stochastic Rounding
The above argument is intuitive, but also informal. To make these statements rigor-
ous, we interpret the SR method as a Markov chain. On each iteration, SR starts at some
state (iterate) w, and moves to a new state w′ with some transition probability Tα(w,w′)
that depends only on w and the learning rate α. For fixed α, this is clearly a Markov
84
0.6 0.2 0.8 0.6
0.4 0.2
A B A B
0.2 0.1
0.6 0.4 0.3 0.2
0.2 0.2 0.1 0.1
C C
0.2 0.6
Figure 5.3: Markov chain example with 3 states. In the right figure, we halved each
transition probability for moving between states, with the remaining probability put on
the self-loop. Notice that halving all the transition probabilities would not change the
equilibrium distribution, and instead would only increase the mixing time of the Markov
chain.
process with transition matrix1 Tα(w,w′).
The long-term behavior of this Markov process is determined by the stationary
distribution of Tα(w,w′). We show below that for small α, the stationary distribution
of Tα(w,w′) is nearly invariant to α, and thus decreasing α below some threshold has
virtually no effect on the long term behavior of the method. This happens because, as α
shrinks, the relative transition probabilities remain the same (conditioned on the fact that
the parameters change), even though the absolute probabilities decrease (see Figure 5.3).
In this case, there is no exploitation benefit to decreasing α.
Theorem 5.5.1. Let pw,i denote the probability distribution of the i-th entry in ∇f̃(w),
the stochastic grad∫ient estimate at w. Assume there is a constant C1 ∫such that for all w,
i, and we have ∞ Cν pw,i(z) dz ≤ C12 , and some C2 such that both 2 pw,i(z) dz > 0ν ν 0
1Our analysis below does not require the state space to be finite, so T (w,w′α ) may be a linear operator
rather than a matrix. Nonetheless, we use the term “matrix” as it is standard.
85
∫
and 0− pw,i(z) dz > 0. Define the matrixC2

∫∞
p z ′ ′
0 w,i
(z) dz, if w and w differ only at coordinate i, and (w )
∆ i
= (w)i + ∆
Ũ(w,w′) = ∫ 0 z −∞ pw,i(z) dz, if w and w
′ differ only at coordinate i, and (w′)i = (w)i −∆

∆
0, otherwise,
and the associated Markov chain transition matrix
T̃α0 = I − α T0 · diag(1 Ũ) + α0Ũ , (5.9)
where α0 is the largest constant that makes T̃α0 non-negative. Suppose T̃α has a stationary
distribution, denoted π̃. Then, for sufficiently small α, Tα has a stationary distribution πα,
and limα→0 πα = π̃. Furthermore, this limiting distribution satisfies π̃(w) > 0 for any
state w, and is thus not concentrated on local minimizers of f .
Proof. Let the matrix Uα be a partial transition matrix defined by Uα(w,w) = 0, and
Uα(w,w
′) = Tα(w,w
′) for w 6= w′. From Uα, we can get back the full transition matrix
Tα using the formula
Tα = I − diag(1TUα) + Uα.
Note that this formula is essentially “filling in” the diagonal entries of Tα so that every
column sums to 1, thus making Tα a valid stochastic matrix.
Let’s bound the entries in Uα. Suppose that we begin an iteration of the stochastic
rounding algorithm at some point w. Consider an adjacent point w′ that differs from w at
only 1 coordinate, i, with (w′)i = (w)i + ∆. Then we have
∫ ∫
1 ∆ (w) 1 2∆i 2∆− (w)i
Uα(w,w
′) = pw,i((w)i/α) d(w)i + pw,i((w)i/α) d(w)i
α 0 ∆ α ∆ ∆
86
∫
1 ∆/α
∫
αz 1 2∆/α 2∆− αz
= ∫ pw,i(z) α dz∫+ pw,i(z) α dzα 0 ∆ α ∆/α ∆∆/α ∞
≤ zα∫ pw,i(z) dz + pw,i(z) dz0 ∆ ∆/α∞ z
= α p 2w,i(z) dz +O(α ). (5.10)
0 ∆ ∫
Note we have used the decay assumption: ∞ p (z) ≤ Cw,i ∫ 2 . If (w
′)i = (w)i − ∆, thenν ν
similarly the transition probability is 0Uα(w,w′) = α −∞ pw,i(z)
z dz + O(α2), and if
∆
(w′)i = (w)i ±m∆ for an integer m > 1, Uα(w,w′) = O(α2). We can approximate the
behavior of Uα using the matrix∫ ∞ p (z) z dz, if w and w′ differ only at coordinate i, and (w′) = (w) + ∆∫0
w,i ∆ i i
Ũ(w,w′) =  0 z −∞ pw,i(z) dz, if w and w
′ differ only at coordinate i, and (w′)
∆ i
= (w)i −∆
0, otherwise.
Define the associated Markov chain transition matrix
T̃α0 = I − α0 · diag(1T Ũ) + α0Ũ , (5.11)
where α0 is the largest scalar such that the stochastic linear operator T̃α0 has non-negative
entries. For α < α0, T̃α has non-negative entries and column sums equal to 1; it thus
defines the transition operator of a Markov chain. Let π̃ denote the stationary distribution
of the Markov chain with transition matrix T̃α0 .
We now claim that π̃ is also the stationary distribution of T̃α for all α < α0. We
verify this by noting that
T̃α = (I − α · diag(1T Ũ)) + αŨ
α α
= (1− )I + [I − α T0 · diag(1 Ũ) + α0Ũ ]
α0 α0
87
− α α= (1 )I + T̃α
α α 0
, (5.12)
0 0
and so T̃απ̃ = (1− α )π̃ + α π̃ = π̃.α0 α0
Recall that Tα is the transition matrix for the Markov chain generated by the stochas-
tic rounding algorithm with learning rate α. We wish to show that this Markov chain is
well approximated by T̃α. Note that∏
Tα(w,w
′) = Tα(w,w + ((w
′)i − (w)i)∆ei) ≤ O(α2)
i,(w) ′i 6=(w )i
when w,w′ differ at more than 1 coordinate, and ei denotes a vector that is 1 in the
i-th coordinate, and 0 everywhere else. In other words, transitions between multiple
coordinates simultaneously become vanishingly unlikely for small α. When w and w′
differ by exactly 1 coordinate, we know from (5.10): T ′α(w,w ) = αŨ(w,w′) + O(α2).
These observations show that the off-diagonal elements of Tα are well approximated (up
to uniform O(α2) error) by the corresponding elements in αŨ. Since the columns of Tα
sum to one, the diagonal elements are well approximated as well, and we have
Tα = (I − α · diag(1T Ũ)) + αŨ +O(α2) = T̃α +O(α2).
To be precise, the notation above means that
|Tα(w,w′)− T̃α(w,w′)| < Cα2, (5.13)
for some C that is uniform over (w,w′).
We are now ready to show that the stationary distribution of Tα exists and ap-
proaches π̃. Re-arranging (5.12) gives us: α0T̃α + (α − α0)I = αT̃α0 . Combining this
with (5.13), we get
∥∥ ∥α0Tα + (α− α0)I − αT̃ ∥α0 ∞ < O(α2),
88
and so
∥∥∥α α ∥0 − 0 ∥Tα + (1 )I − T̃α0∥ < O(α). (5.14)α α ∞
From (5.14), we see that the matrix α0T + (1 − α0α )I approaches T̃α0 . Note thatα α
π̃ is the Perron-Frobenius eigenvalue of T̃α0 , and thus has multiplicity 1. Multiplicity 1
eigenvalues/vectors of a matrix vary continuously with small perturbations to that matrix
(Theorem 8, p130 of [Lax07]). It follows that, for small α, α0Tα + (1 − α0 )I has aα α
stationary distribution, and this distribution approaches π̃. The leading eigenvector of
α0Tα + (1− α0 )I is the same as the leading eigenvector of Tα, and it follows that Tα hasα α
a stationary distribution that approaches π̃.∫ ∫
Finally, note that we have assumed C2 0pw,i(z) dz > 0 and − pw,i(z) dz > 0.0 C2
Under this assumption, for α < 1 , T̃α0(w,w
′) > 0 whenever w,w′ are neighbors
C2
the differ at a single coordinate. It follows that every state in the Markov chain T̃α0 is
accessible from every other state by traversing a path of non-zero transition probabilities,
and so π̃(w) > 0 for every state w. 
While the long term stationary behavior of SR is relatively insensitive to α, the
convergence speed of the algorithm is not. To measure this, we consider the mixing time
of the Markov chain. Let πα denote the stationary distribution of a Markov chain. We say
that the -mixing time of the chain is M if M is the smallest integer such that [LPW09]
|P(wM ∈ A|w0)− π(A)| ≤ , for all w0 and all subsets of states A ⊆ W. (5.15)
We show below that the mixing time of the Markov chain gets large for small α, which
means exploration slows down, even though no exploitation gain is being realized.
89
Theorem 5.5.2. Let pw,i satisfy the assumptions of Theorem 5.5.1. Choose some  suffi-
ciently small that there exists a proper subset of states A ⊂ W with stationary probability
πα(A) greater than . Let M(α) denote the -mixing time of the chain with learning rate
α. Then,
lim M(α) =∞.
α→0
Proof. Given som∑e distribution π over the states of the Markov chain, and some set A of
states, let [π]A = a∈A π(a) denote the measure of A with respect to π.
Suppose for contradiction that the mixing time of the chain remains bounded as α
vanishes. Then we can find an integer M that upper bounds the -mixing time for all α.
By the assumption of the theorem, we can select some set of states A with [π̃]A > , and
some starting state a 6∈ A. Let e be a distribution (a vector in the finite-state case) with
ea = 1, eb = 0 for b 6= a. Note that [e]A = 0 because a 6∈ A. Then
∣∣ ∣[e]A − [π̃] ∣A > .
Note that, as α → 0, we have ‖Tα − T̃ ‖ → 0 and thus ‖TMα α − T̃Mα ‖ → 0. We also see
from the definition of T̃α in (5.11), limα→0 T̃α = I. It follows that
∣ ∣ ∣ ∣
lim ∣[TMe] − [π̃] ∣ = ∣[e] − [π̃] ∣
→ α A A A A
> ,
α 0
and so for some α the inequality (5.15) is violated. This is a contradiction because it was
assumed M is an upper bound on the mixing time. 
90
5.6 Experiments
To explore the implications of the theory, we train both VGG-like networks [SZ15]
and Residual networks [HZRS16b] with binarized weights on image classification prob-
lems. On CIFAR-10, we train ResNet-56, wide ResNet-56 (WRN-56-2, with two times
more filters than ResNet-56), VGG-9, and the high capacity VGG-BC network used for
the original BC model [CBD15]. We also train ResNet-56 on CIFAR-100, and ResNet-18
on ImageNet [RDS+15]. VGG-9 on CIFAR-10 consists of 7 convolutional layers and 2
fully connected layers. The convolutional layers contain 64, 64, 128, 128, 256, 256 and
256 of 3 × 3 filters respectively. There is a Batch Normalization and ReLU after each
convolutional layer and the first fully connected layer. The details of the architecture are
presented in Table 5.1. VGG-BC is a high-capacity network used for the original BC
method [CBD15], which contains 6 convolutional layers and 3 linear layers. We use the
same architecture as in [CBD15] except using softmax and cross-entropy loss instead of
SVM and squared hinge loss, respectively. The details of the architecture are presented
in Table 5.2. ResNets-56 has 55 convolutional layers and one linear layer, and contains
three stages of residual blocks where each stage has the same number of residual blocks.
WRN-56-2 doubles the number of filters in each residual block as in [ZK16]. ResNets-18
for ImageNet has the same description as in [HZRS16b].
We implement all models in Torch7 [CKF11] and train the quantized models with
NVIDIA GPUs. The default minibatch size is 128. Following [CBD15], we do not
use weight decay during training. We use Adam [KB15] as our baseline optimizer as
we found it to frequently give better results than well-tuned SGD (an observation that
91
is consistent with previous papers on quantized models [CHS+16, MOPU93, RORF16,
GAGN15, CBD15]), and we train with the three quantized algorithms mentioned in Sec-
tion 5.3, i.e., R-ADAM, SR-ADAM and BC-ADAM. The image pre-processing and
data augmentation procedures are the same as [HZRS16b]. Similar to [RORF16], we
only quantize the weights in the convolutional layers, but not linear layers, during train-
ing. Binarizing linear layers causes some performance drop without much computational
speedup. This is because fully connected layers have very little computation overhead
compared to Conv layers. Also, for state-of-the-art CNNs, the number of FC parame-
ters is quite small. The number of params of Conv/FC layers for CNNs in Table 1 are
(in millions): VGG-9: 1.7/1.1, VGG-BC: 4.6/9.4, ResNet-56: 0.84/0.0006, WRN-56-2:
3.4/0.001, ResNet-18: 11.2/0.5. While the VGG-like nets have many FC parameters, the
more efficient and higher performing ResNets are almost entirely convolutional.
The weights of convolutional layers are intitialized with random Rademacher (±1)
variables. We set the initial learning rate to 0.01 and decrease the learning rate by a factor
of 10 at epochs 82 and 122 for CIFAR-10 and CIFAR-100 [HZRS16b]. For ImageNet
experiments, we train the model for 90 epochs and decrease the learning rate at epochs 30
and 60. The authors of BC [CBD15] adopt a small initial learning rate (0.003) and it takes
500 epochs to converge. It is observed that large binary weights (∆ = 1) will generate
small gradients when batch normalization is used [IS15a], hence a large learning rate is
necessary for faster convergence. We experiment with a larger learning rate (0.01) and
find it converges to the same performance within 160 epochs, comparing with 500 epochs
in the original paper [CBD15].
92
Table 5.1: VGG-9 on CIFAR-10.
layer type kernel size input size output size
Conv 1 3× 3 3 × 32× 32 64 × 32× 32
Conv 2 3× 3 64 × 32× 32 64 × 32× 32
Max Pooling 2× 2 64 × 32× 32 64 × 16× 16
Conv 3 3× 3 64 × 16× 16 128× 16× 16
Conv 4 3× 3 128× 16× 16 128× 16× 16
Max Pooling 2× 2 128× 16× 16 128× 8 × 8
Conv 5 3× 3 128× 8 × 8 256× 8 × 8
Conv 6 3× 3 256× 8 × 8 256× 8 × 8
Conv 7 3× 3 256× 8 × 8 256× 8 × 8
Max Pooling 2× 2 256× 8 × 8 256× 4 × 4
Linear 1× 1 1× 4096 1× 256
Linear 1× 1 1× 256 1× 10
Results The overall results are summarized in Table 5.3. The binary model trained by
BC-ADAM has comparable performance to the full-precision model trained by ADAM.
SR-ADAM outperforms R-ADAM, which verifies the effectiveness of Stochastic Round-
ing. There is a performance gap between SR-ADAM and BC-ADAM across all models
and datasets. This is consistent with our theoretical results in Sections 5.4 and 5.5, which
predict that keeping track of the real-valued weights as in BC-ADAM should produce
93
Table 5.2: VGG-BC for CIFAR-10.
layer type kernel size input size output size
Conv 1 3× 3 3 × 32× 32 128× 32× 32
Conv 2 3× 3 128× 32× 32 128× 32× 32
Max Pooling 2× 2 128× 32× 32 128× 16× 16
Conv 3 3× 3 128× 16× 16 256× 16× 16
Conv 4 3× 3 256× 16× 16 256× 16× 16
Max Pooling 2× 2 256× 16× 16 256× 8 × 8
Conv 5 3× 3 256× 8 × 8 512× 8 × 8
Conv 6 3× 3 512× 8 × 8 512× 8 × 8
Max Pooling 2× 2 512× 8 × 8 512× 4 × 4
Linear 1× 1 1× 8192 1× 1024
Linear 1× 1 1× 1024 1× 1024
Linear 1× 1 1× 1024 1× 10
better minimizers.
Exploration vs exploitation tradeoffs Section 5.5 discusses the exploration/exploitation
tradeoff of continuous-valued SGD methods and predicts that fully discrete methods like
SR are unable to enter a greedy phase. To test this effect, we plot the percentage of
changed weights (signs different from the initialization) as a function of the training
epochs (Figures 5.4 and 5.5). SR-ADAM explores aggressively; it changes more weights
94
Table 5.3: Top-1 test error after training with full-precision (ADAM), binarized weights
(R-ADAM, SR-ADAM, BC-ADAM), and binarized weights with big batch size (Big SR-
ADAM).
CIFAR-10 CIFAR-100 ImageNet
VGG-9 VGG-BC ResNet-56 WRN-56-2 ResNet-56 ResNet-18
ADAM 7.97 7.12 8.10 6.62 33.98 36.04
BC-ADAM 10.36 8.21 8.83 7.17 35.34 52.11
Big SR-ADAM 16.95 16.77 19.84 16.04 50.79 77.68
SR-ADAM 23.33 20.56 26.49 21.58 58.06 88.86
R-ADAM 23.99 21.88 33.56 27.90 68.39 91.07
in the conv layers than both R-ADAM and BC-ADAM, and keeps changing weights until
nearly 40% of the weights differ from their starting values (in a binary model, randomly
re-assigning weights would result in 50% change). The BC method never changes more
than 20% of the weights (Fig 5.4b), indicating that it stays near a local minimizer and
explores less. Interestingly, we see that the weights of the conv layers were not changed
at all by R-ADAM; when the tails of the stochastic gradient distribution are light, this
method is ineffective.
95
50 50 50
40 conv_1 40 conv_1 40 conv_1
conv_2 conv_2 conv_2
conv_3 conv_3 conv_3
30 30 30
conv_4 conv_4 conv_4
conv_5 conv_5 conv_5
20 20 20
conv_6 conv_6 conv_6
linear_1 linear_1 linear_1
10 linear_2 10 linear_2 10 linear_2
linear_3 linear_3 linear_3
0 0 0
0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180
Epochs Epochs Epochs
(a) R-ADAM (b) BC-ADAM (c) SR-ADAM
Figure 5.4: Percentage of weight changes during training of VGG-BC on CIFAR-10.
60 60 50
BC-ADAM 128 BC-ADAM 128
50 BC-ADAM 1024 50 40 BC-ADAM 1024
SR-ADAM 128 SR-ADAM 128
40 SR-ADAM 1024 40 30 SR-ADAM 1024
30 30
BC-ADAM 128 20
20 20
BC-ADAM 1024
10
10 10 SR-ADAM 128
SR-ADAM 1024
0 0 0
0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160
Epochs Epochs Epochs
(a) BC-ADAM vs SR-ADAM (b) Weight changes since be- (c) Weight changes every 5
ginning epochs
Figure 5.5: Effect of batch size on SR-ADAM when tested with ResNet-56 on CIFAR-10.
(a) Test error vs epoch. Test error is reported with dashed lines, train error with solid lines.
(b) Percentage of weight changes since initialization. (c) Percentage of weight changes
per every 5 epochs.
5.6.1 A way forward: big batch training
We saw in Section 5.5 that SR is unable to exploit local minima because, for small
learning rates, shrinking the learning rate does not produce additional bias towards mov-
ing downhill. This was illustrated in Figure 5.1. If this is truly the cause of the problem,
96
Error (%) Percentage of changed weights (%)
Percentage of changed weights (%) Percentage of changed weights (%)
Percentage of changed weights (%) Percentage of changed weights (%)
then our theory predicts that we can improve the performance of SR for low-precision
training by increasing the batch size. This shrinks the variance of the gradient distri-
bution in Figure 5.1 without changing the mean and concentrates more of the gradient
distribution towards downhill directions, making the algorithm more greedy.
To verify this, we tried different batch sizes for SR including 128, 256, 512 and
1024, and found that the larger the batch size, the better the performance of SR. Fig-
ure 5.5a illustrates the effect of a batch size of 1024 for BC and SR methods. We find
that the BC method, like classical SGD, performs best with a small batch size. However,
a large batch size is essential for the SR method to perform well. Figure 5.5b shows the
percentage of weights changed by SR and BC during training. We see that the large batch
methods change the weights less aggressively than the small batch methods, indicating
less exploration. Figure 5.5c shows the percentage of weights changed during each 5
epochs of training. It is clear that small-batch SR changes weights much more frequently
than using a big batch. This property of big batch training clearly benefits SR; we see in
Figure 5.5a and Table 5.3 that big batch training improved performance over SR-ADAM
consistently.
In addition to providing a means of improving fixed-point training, this suggests
that recently proposed methods using big batches [DYJG16, GDG+17] may be able to
exploit lower levels of precision to further accelerate training.
97
5.7 Conclusion
The training of quantized neural networks is essential for deploying machine learn-
ing models on portable and ubiquitous devices. We provide a theoretical analysis to better
understand the BinaryConnect (BC) and Stochastic Rounding (SR) methods for training
quantized networks. We proved convergence results for BC and SR methods that predict
an accuracy bound that depends on the coarseness of discretization. For general non-
convex problems, we proved that SR differs from conventional stochastic methods in that
it is unable to exploit greedy local search. Experiments confirm these findings, and show
that the mathematical properties of SR are indeed observable in practice.
98
Chapter 6: Why is SGD so fast for neural nets?
6.1 Introduction
Stochastic gradient descent [RM51] (and its momentum variants [Nes83]) has be-
come the standard optimization routine for deep learning due to its fast convergence and
good generalization properties [WRS+17,KS17,SMDH13], but the performance of these
methods defies explanation.
Classical convex optimization theory predicts that the learning rate of SGD needs
to decrease over time for convergence to be guaranteed [SZ13, Ber11]. With constant
learning rates, it has been shown that SGD converges fast to a neighborhood of the min-
imizer, but then reaches a noise floor that depends on the variance of the gradients at
the minimizer [MB11, NWS14]. When models contain the same number of parameters
as training data, it is possible for a model to over-fit the data while still being strongly
convex. In this case, convergence without a noise floor is possible without decaying the
learning rate [MB11, NWS14].
But the behavior of SGD on high-dimensional neural models still evades explana-
tion. Neural networks operate in a regime where the number of parameters is much larger
than the number of training data. In this regime, SGD seems to converge very quickly. So
quickly, in fact, that practitioners often use exponentially decaying learning rate schedules
99
without seeing the method stall. Furthermore, network architecture seems to affect SGD a
lot. It is common knowledge among practitioners that wider networks train faster [ZK16],
and deeper networks train slower [BSF94, GB10].
The goal of this paper is to study why SGD is efficient for neural nets, and how
neural net design affects SGD. In particular, we investigate how over-parametrization –
an increase in the number of parameters beyond the number of training data – affects the
dynamics of SGD.
To explain the fast convergence of SGD on over-parameterized problems, we in-
troduce a simple concept called gradient confusion. When confusion is high, stochastic
gradients produced by different data samples may be negatively correlated. When this
happens, data samples contradict one another, causing slow convergence. When confu-
sion is low, the gradients produced by different samples are similar, and we show via the-
oretical and empirical results that convergence is much faster than predicted by classical
theory. For randomized training data, we show that the gradients of over-parameterized
neural networks are likely to have low confusion. Finally, we present experimental results
showing that low gradient confusion leads to efficient convergence of SGD.
Problem formulation & conceptual overview
Notation. In this chapter, we use upper-case bold fonts to represent matrices. We use
(W)i,j to indicate the (i, j) cell in matrix W and (W)i for the i-th row of matrix W.
SGD works by iteratively selecting a random function f̃k, and modifying the pa-
rameters to decrease the value of the objective term f̃k. It may happen that the selected
100
gradient ∇f̃k is negatively correlated with the gradient of another term ∇fj. In this case,
the gains we make by decreasing f̃k are partially cancelled out by an increase in fj, and
convergence becomes slow. When the gradients of different mini-batches are negatively
correlated, the objective terms disagree on which direction the parameters should move,
and we say that there is gradient confusion.
Definition 6.1.1. A set of objective functions {fi} has gradient confusion η if the pair-
wise inner products between gradients satisfy
〈∇fi(w),∇fj(w)〉 ≥ −η, ∀i, j. (6.1)
SGD converges fast when gradient confusion is low. To see why, consider the
case of training a logistic regression model on a dataset with orthogonal vectors. We
have fi(w) = `(yi · 〈xi,w〉), where ` : R → R is the logistic loss, {xi} is a set of
orthogonal training vectors, and yi ∈ {−1, 1} is a label. We then have∇fi(w) = yi`′(yi ·
〈xi,w〉)xi and 〈∇fi(w),∇fj(w)〉 = y ′iyj` (〈xi,w〉)`′(〈xj,w〉)〈xi,xj〉 = 0, and so there
is no gradient confusion (η = 0). Because of gradient orthogonality, an update in the
gradient direction fi has no effect on the loss value of fj for i 6= j. In this case, SGD
decouples into a (deterministic) gradient decent on each objective term separately, and we
can expect to see the fast rates of convergence attained by deterministic gradient descent,
rather than the slow rates of SGD.
Can we expect a problem to have low gradient confusion in practice? It is known
that randomly chosen vectors in high dimensions are nearly orthogonal with high proba-
bility [GS16]. For this reason, we would expect an average-case (i.e., random) problem
to have nearly orthogonal gradients, provided that we don’t train on too many training
101
vectors (in which case it becomes likely that we will see two training vectors with large
negative correlation). In other words, we should expect a random optimization problem
to have low gradient confusion when the number of parameters is “large” and the number
of training data is “small” – i.e., when the model is over-parameterized.
The above argument is rather informal, and ignores issues like random sampling or-
der and non-convexity. Furthermore, it is unclear whether we can expect low levels of gra-
dient confusion in practice, and what effect non-zero confusion has on convergence rates.
Below, we present a rigorous argument that low confusion levels accelerate SGD for both
convex and non-convex problems. Then, we turn to the issue of over-parameterization,
and show that gradient confusion is low for over-parameterized classifier problems with
random data. Finally, we use computational experiments to show that gradient confu-
sion is low for real-world neural nets, and that this explains the superior optimization
performance of SGD.
Related work
The authors of [ACH18] study the behavior of SGD on over-parameterized prob-
lems, and show that SGD on over-parameterized linear neural nets is similar to applying
a certain preconditioner while optimizing. Our work differs from [ACH18] in that it stud-
ies a completely different mechanism of acceleration, and that we establish a more direct
relationship between width, depth, problem dimensionality, and the error floor of SGD
convergence. The behavior of SGD on over-parameterized problems was also studied
in [MBB17] with the purpose of exploring how SGD hyper-parameters (learning rates,
102
batch size, etc...) affect convergence. In contrast, this study focuses on how and why
network architecture choices affect convergence.
Several other authors have studied the impact of structured gradients on SGD.
[BFL+17] study the effects of “shattered gradients,” which is when (non-stochastic) gra-
dients at different (but close) locations in parameter space become negatively correlated.
This is different from gradient confusion, which refers to negative correlations between
stochastic mini-batch gradients at the same location in parameter space. Another related
issue is that stochastic noise during training leads to improved generalization performance
because of implicit regularization. This is not our main focus – we addresses the question
of why SGD is good for optimization, rather than why it’s good for generalization.
6.2 SGD is fast when gradient confusion is low
We now present a rigorous analysis of gradient confusion and its effect on SGD. We
begin by looking at the case where the objective satisfies the PL inequality (a condition
related to, but weaker than, strong convexity), where we can provide tight bounds on the
rate of convergence in terms of the optimality gap. Then we look at a broader class of
non-convex functions, and prove fast convergence to a stationary point.
We begin by making two standard assumptions about the objective function.
Assumption 6.2.1. The individual gradients∇fi are L-Lipschitz continuous:
f ′i(w ) ≤ fi(w) + 〈∇fi(w), w′ −w〉+ L‖w′ −w‖2, ∀i.2
Assumption 6.2.2. The individual functions fi satisfy the PL inequality:
1‖∇f (w)‖2i ≥ µ(fi(w)− f ?i ), ∀i,2
103
where f ?i = minw fi(w).
Using these assumptions, we now state the following convergence result.
Theorem 6.2.1 (Linear convergence under bounded gradient confusion). If the objective
function satisfies Assumptions 6.2.1 and 6.2.2, and has gradient confusion bounded by η,
SGD with updates of the form (2.2) converges linearly to a neighborhood of the minima
on (2.1) as
f(wk)− f ? ≤ ρk(f(w )− f ?) + αη̂0 1− ,ρ
( 2 )
where η̂ = max{η, 0}, the learning rate α ≤ 2/nL and ρ = 1− 2µ α− nLα .
n 2
Proof. From Assumption 6.2.1, we have
f(wk+1) ≤ f(wk) + 〈∇f(wk), wk+1 −wk〉+ L‖w 22 k+1 −wk‖
2
= f(wk)− α( 〈∇f(wk)), ∇f̃k(wk)〉+
Lα ‖∇f̃k(wk)‖22
= f(w )− αk ( −
2 ‖∇ ∑Lα ) f̃
2 α
n 2 k
(wk)‖ − ∀i:f 6=f̃ 〈∇fi(wk), ∇f̃k(wk)〉n i k
≤ 2f(w )− α − Lαk ( ‖∇f̃ (w )‖
2 + α(n−1)η̂ ,
n 2 ) k k n
≤ f(w )− α − Lα2k ‖∇f̃k(w 2k)‖ + αη̂,n 2
where the second-last inequality follows from Definition 6.1.1. Let α < 2/nL. Then,
using Assumption 6.2.2 and subtracting by f ? = minw f(w) on both sides, we get
( )
2
f(w )− f ?k+1 ≤ f(wk)− f ? − 2µ α − Lα (f̃ ?k(wk)− f̃k ) + αη̂,n 2
where f̃ ?k = minw f̃k(w). Taking expectation and using the fact that E [f ?i i ] ≤ f ?, we get:
( )
f(wk+1)− f ? ≤ 1− 2µα + µLα2 (f(wk)− f ?) + αη̂.n
104
Writing ρ = 1− 2µα + µLα2, and unrolling the iterations, we get
n
∑
f(w ? k+1 ? k ik+1)− f ≤ ρ (f(w0)− f ) +∑i=0
ρ αη̂
≤ ρk+1(f(w0)− f ?) + ∞i=0 ρiαη̂
= ρk+1(f(w )− f ?) + αη̂0 1− ,ρ
which completes the proof. 
This result shows that SGD converges linearly to a neighborhood of a minimizer,
and the size of this neighborhood depends on the level of gradient confusion. When
η ≤ 0, there is no confusion, and SGD converges directly to a minimizer without using a
vanishing learning rate schedule.
In the case of non-convex functions, we can still prove convergence to a neighbor-
hood of a stationary point under the following standard assumption.
Assumption 6.2.3. Assume that the variance of the gradients is bounded as:
[ ]
E ‖∇f̃(w)−∇f(w)‖2 ≤ σ2.
The following theorem shows fast convergence in the case of a smooth non-convex
function when gradient confusion is low.
Theorem 6.2.2. If the objective function satisfies Assumptions 6.2.1, and 6.2.3, and the
confusion bound (6.1), then SGD converges to a stationary point with
?
min E‖∇f(wk)‖2 ≤
2n f(w1)− f 2nη
k=1,...,T 2α− +nLα2 T 2− nLα
for learning rate α < 2/(nL).
105
Proof. From Theorem 6.2.1, we have:
( 2)
f(wk+1) ≤ f(wk)−
α − Lα ‖∇f̃k(w 2k)‖ + αη. (6.2)
n 2
Using Assumption 6.2.3, we can write:
E‖∇f̃ 2k(wk)‖ = E‖∇f̃ (w )−∇f(w )‖2k k k + E‖∇f(w )‖2k = σ2 + E‖∇f(wk)‖2.
Thus, taking expectation and assuming the step size α < 2/L, we can rewrite
equation (6.2) as:
( )
E‖∇f(w )‖2 ≤ 2n 2nηk − E f(w )− f(w ) − σ
2 +
2α nLα2
2n (
k k+1 ) 2− nLα
≤ 2nη− E f(wk)− f(w2 k+1) + .2α nLα 2− nLα
Taking an average over T iterations, and using f ? = minw f(w), we get:
∑T
‖∇ ‖2 ≤ 1 ‖∇ 2 2n f(w1)− f
? 2nη
min E f(wk) E f(wk)‖ ≤ + .
k=1,...,T T 2α− nLα2 T 2− nLα
k=1

The presence of a noise floor is not always observed for over-parameterized prob-
lems, and the assumption that η ≤ 0 is unrealistically strong to guarantee such conver-
gence. In the next section, we present a few additional assumptions under which faster
convergence can be observed.
6.2.1 Conditions for even faster convergence
Faster convergence can be guaranteed if we re-define gradient confusion using the
correlation between gradients (rather than the dot product). If these correlations are
bounded below, then linear convergence occurs with no noise floor. The following results
106
prove this convergence in the case of over-fitting. Over-fitting occurs when the minimal
objective value of the composite objective F in (2.1) is the average of all the minimal val-
ues the terms {fi}. In other words, parameter w? that minimizes F also simultaneously
minimizes every objective term fi.
Theorem 6.2.3. Suppose F satisfies Assumptions 6.2.1, 6.2.2, and the correlation-based
confusion condition
〈∇fi(w),∇fj(w)〉
‖∇fi(w)‖‖∇ ‖
≥ −ν, ∀i, j. (6.3)
fj(w) ∑ ∑
Suppose further the loss satisfies the over-fitting condition min 1w i f (w) =
1
i i minw fi(w).n n
If the objective has confusion ν < µ and learning rate α < 2 − 2ν , then SGD
nL nL µ
converges with
f(wk)− f ? ≤ ρk(f(w0)− f ?),
where ρ = 1− 2µα/n+ µLα2 + 2ανL.
Proof. From (6.3) and the identity 2ab ≤ a2 + b2 we get
〈∇ −νfi(w),∇fj(w)〉 ≥ −ν‖∇fi(w)‖‖∇fj(w)‖ ≥ (‖∇fi(w)‖2 + ‖∇fj(w)‖2)
2
≥ −νL(fi(w)− f ? ?i + fj(w)− fj ).
Following the proof of Theorem 6.2.1, we have
L
f(wk+1) ≤ f(wk) + 〈∇f(wk), w 2k+1 −wk〉+ ‖wk+1 −wk‖
2
2
= f(wk)− α(〈∇f(wk),)∇
Lα
f̃k(wk)〉+ ‖∇∑f̃k(wk)‖
2
2
α Lα2 α
= f(w 2k)− − ‖∇f̃k(wk)‖ − 〈∇fi(wk), ∇f̃k(wk)〉
n 2 n
∀i:fi 6=f̃k
107
( 2) ∑
≤ f(wk)−
α
2µ − Lα ανL(f̃k(w )− f̃ ?) + (f (w )− f ?k k i k i + f̃k(wk)− f̃ ?n 2 n k )
∀i:fi=6 f̃k
where the second-last inequality follows from Definition 6.1.1. Let the learning rate α <
2/nL. Then, using Assumption 6.2.2 and subtracting by f ? = minw f(w) on both sides
(α Lα2)
f(w ?k+1)− f ≤ f(w )− f ?k − 2µ − ( ?∑f̃k(wk)− f̃k )n 2ανL
+ (fi(w )− f ?k i + f̃k(w ?k)− f̃k ),n
∀i:fi 6=f̃k
where f̃ ?k = minw f̃k(w). Taking expectation and using the fact that E [f ?] = f ?i i , we get
( 2µα )
f(w ? 2k+1)− f ≤ 1− + µLα + 2ανL (f(w )− f ?k ).
n

Finally, we can strengthen the definition of confusion by examining the correlation
between∇fi(w) and∇fj(w′) for all w and w′.Compared to Theorem 6.2.1, convergence
is guaranteed with a larger learning rate that is independent of the training set size n, and
faster geometric decay.
Theorem 6.2.4. If the objective function satisfies Assumptions 6.2.1 and 6.2.2, and satis-
fies the strengthened gradient confusion bound
〈∇fi(w),∇fj(w′)〉 ≥ −η, ∀i, j,w,w′,
then SGD converges with
αη̂
f(w ? k ?k)− f ≤ ρ (f(w0)− f ) + − ,1 ρ
where η̂ = max{η, 0}, the learning rate α ≤ 2/L and ρ = 1− 2µα/n+ µLα2/n.
108
Proof. We start by noting that, for i 6= j ∫ α
fi(w − ∇
∂
α fj(w)) = fi(w) + ∫ fi(w − t∇fj(w)) dt (6.4)t=0 ∂tα
= fi(w)− ∫ ∇fj(w)
Tfi(w − t∇fj(w)) dt (6.5)
t=0
α
≤ fi(w) + η̂ dt ≤ fi(w) + αη̂. (6.6)
t=0
We then have
∑
nf(wk+1) = f̃k(w
k − α∇f̃ (wkk )) + fi(wk − α∇f̃k(wk))
fi 6=f̃k
2 ∑
≤ Lαf̃k(wk)− α〈∇f̃k(wk), ∇f̃k(wk)〉+ ‖∇f̃ (w )‖2k k + fi(w) + αη̂
2
fi 6=f̃k
Lα2≤ nf(wk)− (α− )‖∇f̃k(wk)‖2 + nαη̂.
2
Re-arranging and applying Assumption 6.2.2 we get
2µ Lα2
f(w kk+1) ≤ f(w )− (α− )(f̃k(wk)− f̃ ?k ) + αη̂.n 2
Taking expectations and subtracting f ? from both sides we get
( 2µα µLα2 )
f(wk+1)− f ? ≤ 1− + ) (f̃ (w )− f̃ ?k k k ) + αη̂.n n
Unrolling this expression gives us
αη̂
f(wk+1)− f ? ≤ ρk+1(f(w ?0)− f ) + − ,1 ρ
where ρ = 1− 2µα/n+ µLα2/n. which completes the proof. 
6.3 Over-parameterized problems have low gradient confusion
In the previous section, we showed that low gradient confusion can result in much
faster convergence of SGD to the neighborhood of a critical point for general smooth
109
non-convex functions. The question still remains, however, when such a condition might
arise, and how it might explain the effectiveness of SGD on neural net problems.
In practice the level of gradient confusion will depend on the structure of the train-
ing data. However, we can analyze gradient confusion for generic (i.e., random) model
problems using methods from high-dimensional probability. We rigorously analyze the
case where training data is randomly sampled from a unit sphere, and identify specific
cases where gradient confusion (Definition 6.1.1) is low with high-probability.
We consider a synthetic datasets of the form D = {(x , C(x ))}ni i i=1, for some la-
beling function C. The data points {xi} are drawn uniformly and independently from
the surface of a d-dimensional unit sphere. The function fi(w) we consider is the least-
squares function.
We show below that, given a confusion parameter η, a range of models (including
neural networks) with randomized training data naturally attain confusion less than η
provided the dimension of the problem is sufficiently large. The method we use to prove
this result is very similar for a range of different problem classes, and so we begin by
illustrating the result on the simple class of linear regression problems.
6.3.1 A simple case: linear regression
We begin by examining gradient confusion in the case of a simple linear least-
squares regression. We assume that the hidden concept that the algorithm is trying to
learn is given by C(x) = 〈w̃,x〉, for some “true” weight vector w̃1. Throughout we will
1The arguments in this paper actually hold for the case of a noisy model C(x) = 〈w̃,x〉+ ζ where ζ is
a Gaussian random variable, however we will omit the noise term for notational simplicity.
110
0.16
Numerical Estimation of Violation Prob.
0.14 Best fit 1/poly(d)
0.12
0.10
0.08
0.06
0.04
0.02
0 200 400 600 800 1000
Problem dimension (d)
Figure 6.1: Simulation proof for Theorem 6.3.1. As the dimensionality of a random linear
regression problem increases, the probability of violating the gradient confusion condition
η > 0.1 vanishes.
denote by gw : Rd → R the function we fit to the training data. In this section we simply
have
1
gw(x) = 〈w,x〉, and fi(w) = (gw(xi)− C(x ))2i ,
2
but we will consider more complex situations below. For this problem, we have
∇fi(w) = αixi, where αi := gw(xi)− C(xi).
With this definition of the gradient, we can prove the following theorem.
Theorem 6.3.1 (Concentration for linear regression). Let w, w̃ ∈ [−1, 1]d be the approx-
imate and true weight vectors and let η >( 0 be)a given constant. For d > Ω(log n), we
have that with probability at least 1− Ω 1poly , equation (6.7) holds at w.(d)
Figure 6.1 shows a numerical demonstration of this theorem.
Note that one limitation of Theorem 6.3.1, and of all other results presented in the
rest of this section, is that the bound is non-uniform in the weights w.
111
Pr[ > 0.1]
Technical approach & proof sketch
To prove that any bound on the gradient confusion η > 0 is attained for sufficiently
large d, examine the values of the function h(xi,xj) := 〈∇fi,∇fj〉 when xi and xj are
selected at random from the unit sphere. Our goal will be to show that this function has
positive expected value, and that h concentrates around its expected value for large d,
thus making it extremely unlikely that a large negative value of h is observed. Once we
have shown that h(xi,xj) < η with extremely low probability for a random pair of points
(xi,xj), we can use a union bound to show that this occurs with low probability for all
pairwise comparisons between data points.
For a given constant η > 0, our goal is to find an appropriate concentration bound
for the event
for some i 6= j, h(xi,xj) ≥ −η. (6.7)
In other words, we want to find a sharp bound τ(n, d, η) such that for fixed i, j
Pr[h(xi,xj) ≤ −η] ≤ τ(n, d, η). (6.8)
By using the union bound and the fact that h(xi,xj) is identically distributed for all i, j ∈
[n], we have that
Pr[∃i 6= j, h(xi,xj) ≤ −η] ≤ n2τ(n, d, η). (6.9)
We use tools from high-dimensional probability to find an appropriate function τ for a
large class of predictors g, and show that the quantity n2τ(n, d, η) vanishes for large d.
112
Proof of Theorem 6.3.1
We will briefly describe some technical lemmas we require in our analysis. The
following Chernoff-style concentration bound is proved in Chapter 5 of [Ver].
Lemma 6.3.1 (Concentration of Lipshitz function over a sphere). Let x ∈ Rd be sampled
uniformly from the surface of a d-dimensional sphere. Consider a Lipshitz function ` :
Rd → R which is differentiable everywhere. Let ||∇`||∞ denote supx∈Rd ‖∇`(x)‖∞.
Then for any t ≥ 0 and so[m∣ e fixed constan∣t c ≥]0, we have the following.∣∣ ( )− E ∣∣ ≥ ≤ − cdt2Pr `(x) [`(x)] t 2 exp 2 , (6.10)ρ
where ρ ≥ ‖∇`‖∞ is a entry-wise bound on∇`.
We will rely on the following generalization of Lemma 6.3.1.
Corollary 6.3.1. Let x,y ∈ Rd be two mutually independent vectors sampled uniformly
from the surface of a d-dimensional sphere. Consider a Lipshitz function ` : Rd×Rd → R
which is differentiable everywhere. Let ‖∇`‖∞ denote sup(x,y)∈Rd×Rd ‖∇`(x,y)‖∞. Then
for any t ≥ 0 and some[∣fixed constant c ≥ 0,∣we h]ave the fol(lowing.∣ )
Pr ∣`(x,y)− E ∣[`(x,y)]∣ ≥ t ≤ 2 exp − cdt22 , (6.11)ρ
where ρ ≥ ‖∇`‖∞ is a entry-wise bound on∇`.
Proof. This corollary can be derived from Lemma 6.3.1 as follows. Note that for every
fixed ỹ ∈ Rd, equation (6.10) holds. Additionally, we have that the vectors x and y are
m∫utually indep∫endent. Hence we can write the LHS of equation (6.11) as the following.
(y)1=∞ (y)d=∞ [∣∣∣ ∣ ∣∣ ∣∣]. . . Pr `(x,y)− E ∣ ∣ ∣[`(x,y)]∣ ≥ t ∣∣ y = ỹ∣∣ φ(ỹ)d(y)1 . . . d(y)d.
(y)1=−∞ (y)d=−∞
113
Here φ(ỹ) refers to the pdf of[t∣he distribution of y. ∣From] independence, the inner term
in the integral(evaluates)to ∣Pr ∣ ∣`(x, ỹ)− E[`(x, ỹ)]∣ ≥ t . We know this is less than or
2
equal to 2 ex∫p − cdt||∇ ||2 .∫Therefore, the integral can be upper bounded by the following.` ∞
(y)1=∞ (y)d=∞ ( )
cdt2
. . . 2 exp −||∇ || φ(ỹ)d(y)1 . . . d(y)d.` 2∞
(y)1=−∞ (y)d=−∞
Since φ(ỹ) is a valid pdf, we get the required equation (6.11). 
Additionally, we will use the following facts about a normalized Gaussian random
variable.
Lemma 6.3.2. For a normalized Gaussian x (i.e., an x sampled uniformly from the sur-
face of a unit d-dimensional sphere) the following statements are true.
1. ∀i ∈ [d] we have that E[(x)i] = 0.
2. ∀i ∈ [d] we have that E[(x)2i ] = 1/d.
Proof. Part (1) can be proved by observing that the normalized Gaussian random variable
is spherically symmetric about the origin. In other words, for every i ∈ [d] the vectors
(x1, x2, . . . , xi, . . . , xd) and (x1, x2, . . . ,−xi, . . . , xd) are identically distributed. Hence
E[xi] = E[−xi] which implies that E[xi] = 0.
Part (2) can be proved by observing∑that for any i, j ∈ [d], xi and xj are identically
distributed. Fix any i ∈ [d]. We have that dj=1 E[x2j ] = d× E[x2i ]. Note that we have
∑ ∫d (x)1=∞ ∫ (x)d=∞ ∑d x2
E 2 j=1 j[xj ] = . . . ∑ φ(x)d(x)1 . . . d(x)d d = 1.
j=1 ′ x
2
′
(x)1=−∞ (x) =−∞ j =1 jd
Therefore E[x2i ] = 1/d. 
114
We are now ready to prove Theorem 6.3.1.
Proof. The proof illustrates a general strategy we will use for other general models. We
will prove that h(·, ·) has two properties, namely that ∇h(·, ·) is bounded, and the entries
in ∇h(·, ·) have non-negative expectation. These properties enable us to use Corollary
6.3.1 to show that h(·, ·) concentrates on non-negative values.
Bounded Gradient. Fix an arbitrary value xi = x̃i and xj = x̃j . Consider an arbi-
trary coordinate corresponding to the variable (xi)p (by symmetry a corresponding ar-
gument holds for (xj)p) in the vector ∇h(x̃i, x̃j). The term evaluates to αiαj(x̃j)p +
(∆)pαj〈x̃i, x̃j〉. Note that we have |(x̃i)p| ≤ 1 for every i ∈ [n],∑p ∈ [d]. Addi-
tionally, from our assumption on weights w and w̃ we have −2 ≤ dp=1(∆)p ≤ 2.
The∑refore we have that for every i ∈ [n], −2 ≤ αi ≤ 2. And hence αiαj(x̃j)p +
α dj p=1(∆)p(x̃i)p(x̃j)p ≤ 8. Hence we have that ||∇h||∞ ≤ ρ = 8. In particular, the
upper bound ρ is a constant.
Non-negative Expectation. To compute bounds on E[h(xi,xj)], we want to evaluate
E[αiαj〈xi,xj〉]. On expanding the product and removing all summands where either
(xi)p or (xj)p appear as an odd-term, we have E[h(xi,xj)] = ‖∆‖2/d2. Therefore
we have 0 ≤ E[h(xi,xj)]. Alternatively, we can obtain this lower-bound as follows.
Note that since −2 ≤ αi, αj ≤ 2, we have that E[αiαj〈xi,xj〉] ≥∑−4E[〈xi,xj〉] =
∑0. The last equality is ∑because of the following: E[〈xi,x 〉] = E[
d
j p=1(xi)p(xj)p] =
d
p=1 E[(xi)p(x )
d
j p] = p=1 E[(xi)p]E[(xi)p] = 0. The second equality follows from
Linearity of Expectation, the third from independence of xi and xj and the last follows
from Lemma 6.3.2.
115
We combine the two properties as follows. From Non-negative Expectation prop-
erty and equation (6.11), we have that
( )
≤ − ≤ ≤ − ≤ − cdη2Pr[h(xi,xj) η] Pr[h(xi,xj) E[h(xi,xj)] η] 2 exp 2 .ρ
( )
2
The probability that some value of h(∇fi,∇fj) lies below−(η is then)bounded by 2n2 exp − cdη2 .ρ
2
For any choice of c1, we can solve the inequality 2n2 exp − cdη ≤ d−c12 , to get d >ρ
√
Ω(ρ2 log n) and η > Ω(1/ d). In particular, the bound holds for any constant η > 0. 
6.3.2 Linear neural networks
We now study the behavior of gradient confusion for neural networks. We begin
with the simplified case of linear neural networks (i.e., with no non-linearities) with one
output feature, one hidden layer, and a quadratic loss. We’ll then examine the case of
more general networks.
Let W0 ∈ [−1, 1]d×`1 ,W ∈ [−1, 1]`1×11 denote the weight matrices connecting the
input layer to the hidden layer, and the hidden layer to the output, respectively. Thus, the
output of the neural net is given by W1W0x. Then we have,
∑d ∑`1
g(x) = (W1)p′(W0)p,p′(x)p, and αi := g(xi)− C(xi).
p=1 p′=1
Later, we consider the case of more than one hidden-layer, and use `i to denote
the width of layer i and β to be the total number of hidden layers. Further we define
` := maxi∈[β] `i. Throughout this sub-section we make the following assumption.
Assumption 6.3.1 (Small Weights). We will assume W is such that the outputs at each
neuron of the hidden layer, C(x) and g(x) all lie between [−1, 1] for every value of x in
116
the unit ball2. Additionally, we will assume that the entries in the weight matrices Wi
(e.g., W0, W1, . . . , Wβ) lie in the range [−1/`, 1/`].
Is the small-weights assumption reasonable? If we relax the assumption and let every
entry in each of the weight matrices lie in the range [−1, 1], then a product of matrices
Wβ ·Wβ−1 · . . . (Wβ′)k ′ may lead to an exponential blow up in values. If we haveβ
vectors v1,v `2 ∈ [−1, 1] then 〈v1,v2〉 can be as large as `. Hence in a sequence of β
products of matrices, the final value can become as large as `β . Note this would imply
that ||∇h|| β∞ ≤ ` and hence we need d ≥ Ω(`2β log n) for the required concentration.
The small weights assumption is not just a theoretical concern, but also usually mandated
in practice. Without small weights, the gradients ∇fi blows up in magnitude and a phe-
nomenon known as gradient-explosion (which is particularly problematic for recurrent
neural networks) is observed. This may indicate why weights are typically initialized as
N (0, 1/`i) in practice [Ben12].
Similar to the case for linear regression, we prove the following theorem. A full
discussion of this result, and a rigorous proof, are in the Appendix.
Theorem 6.3.2 (Bounded gradient confusion for single layer linear neural network). Con-
sider a single hidden layer neural network with fixed weight vectors W ∈ R1×`11 ,W0 ∈
R`1×d satisfying Assumption 6.3.1, and let η > 0 be a given constant. We have that with
probability at least 1− Ω (n exp(−c `2dη21 )), equation (6.7) holds.
Proof. We will prove the two properties.
2Note one could use an activation function (sigmoid, tanh, or softmax) to enforce this assumption. For
now we will assume that we can bound the outputs appropriately.
117
Bounded Gradient. Fix an arbitrary value xi = x̃i and xj = x̃j . Consider an arbitrary
coordinate corresponding to the variable (xi)p (by symmetry a corresponding argument
holds for (xj)p) in the vector∇h(x,y). The term evaluates to the following.(∑ ∑ )
α α `1 d
∑`1
i j p(′′=1 p′=1(W0)p,p′′(W0)p′,p′′(xj))p′ + p′=1(W1)
2
p′(xj)∑ p
+ α κ `1j i,j p′=1(W1)p′(W0)p,p′ − (W̃)p .
By the assumption that both C(x) and g(x) lie between [−1, 1], we have that−2 ≤ αi ≤ 2
for all i ∈ [n]. From Assumption 6.3.1 assumption and the sampling procedure of xi’s we
have the following. ∑d ∑d1 1
(W0)p,p′′(xi)p ≤ (xi)p ≤ . (6.12)
`1 `p=1 p=1 1
Hence, combining this with Assumption 6.3.1 assumption, the first term in the sum
above∑can be upper-bounded by 8/`1. Note that for every pair i, j ∈ [n] we have that−1 ≤ dp=1(xi)p(xj)p ≤ 1 since this quantity represents the cosine of the angle between
two vectors. Again using the observation in Equation (6.12), we have that κi,j ≤ 2/`1.
Hence, the second term in the summand above can be upper-bounded by 8/`1 and hence
||∇(h(xi,xj))||∞ ≤ 16/`1. In other words, we have that ρ is O(1/`1).
Non-negative Expectation. Note that αi, αj ≥ −2. Therefore we have E[h(xi,xj)] ≥
−4E[κi,j]. By using Linearity of Expectation and the fact that normalized Gaussian ran-
dom variables have mean 0 at every co-ordinate, we have that E[κi,j] = 0. Therefore we
have E[h(xi,xj)] ≥ 0.
Armed with the bounded gradient and non-negative expectation properties, the rest
of the proof follows as in the proof of Theorem 6.3.1. 
Note that from theorem 6.3.2 we have that the concentration gets sharper as the
118
width of the hidden layer increases. In particular, SGD converges faster (on the simulated
dataset) with increasing width.
6.3.3 Extension to arbitrary depth linear networks
We now extend the above analysis to an arbitrary depth linear neural network. Inter-
estingly, we see in this case that the gradient confusion depends critically on the network
architecture – in particular the width and depth.
In this case we have the model g(x) := WβWβ−1 . . .W1W0x where {W }βi i=1 are
weight matrices of appropriate dimensions. We let β denote the number of layers in our
hidden network, and we let ` denote the width of our network – i.e., the maximal number
of features in any layer.
Theorem 6.3.3 (Extension to arbitrary depth linear neural networks). Consider an arbi-
trary depth linear neural network, and assume the weights(satisfy Assump)tion 6.3.1. Let
2 2
η > 0 be a given constant. With probability at least 1−Ω n exp(−c` dη2 ) we have thatβ
equation (6.7) holds.
Proof. This statement can be equivalently written as follows. Note that for constants
2 2
c′, c′′ > 0, we want c′n2 exp(−c` dη2 ) ≤ c′′ 1poly ,, since we we assume that d is theβ (d)
asymptotic parameter and hence this makes the concentration explicit. Rearranging and
solving for d, we get the condition that d ≥ Ω((β/`)2 log n).
We now show the two properties.
Bounded gradient. We will first compute ∇fi(Wβ, . . . ,W0). Note that the gradient
can be visualized as follows. The first `β entries correspond to the entries in ∂fi . The∂Wβ
119
next `β ∗ `β−1 entries correspond to ∂fi and so on. These entries can be computed as∂Wβ−1
follows.( )  ∑ ∑∑ `β−1 `∂f 1 di = αi . . . (Wβ−1)p ,p − . . . (W ) β β 1 0 p1,p(xi)p .∂Wβ pβ pβ−1=1 p1=1 p=1
( ) ( `β `β′+2 `β′−1 ` d
∂f ∑ ∑ ∑ ∑1 ∑i
= αi . . . . . .
∂Wβ′ pβ′+1,pβ′ pβ=1 pβ′+2=1 pβ′−1=1 p1=1 p=1 )
(Wβ)p (Wβ−1)β pβ ,p − . . . (Wβ 1 β′+1)pβ′ (W ′ ) . . . (W ) (x )+2,pβ′+1 β −1 pβ′ ,pβ′−1 0 p1,p i p
∀β′ ∈ {1, 2, . . . , β − 1}.( ) ∑`β ∑`∂f 2i = αi . . . (Wβ)p . . . (W1)p2,p1(x ) i p .∂W β0 p1,p pβ=1 p2=1
Hence h(xi,xj) can be written as follows.
∑`β ( ) ( ) β−1 `β′+1 `β′ ( ) ( )∂fj ∂f ∑ ∑ ∑i ∂fj ∂fi
h(xi,xj) = +
∂Wβ p ∂W ∂W ′ ∂W ′pβ=1 β p β ββ β β′=1 p ′∑ p ′ ,p ′ p ′ ,p ′β +1=1 pβ′=1 β +1 β β +1 β`1 ∑d ( ) ( )∂fj ∂fi
+ .
∂W
p =1 p=1 0 p ,p
∂W
1 0 p1,p1
To bound ||∇h||∞, consider any fixed xi = x̃i and xj = x̃j . Consider the entry
corresponding to (xi)p. The following claim can be obtained as a consequence of small-
weights Assumption.
Lemma 6.3.3. Let W0,W1,W2, . . . ,Wβ be weight matrices the satisfying Assumption
6.3.1. Then for any given kβ′ , we have that any product consisting of a sequence of
matrices Wβ ·Wβ−1 · . . . (Wβ′)p ′ lies in the interval [−1/`, 1/`].β
Proof. For notational convenience, denote the column vector (Wβ′)p ′ as W0. Defineβ
ν := β−β′. We also rewrite the chain Wβ ·Wβ−1·. . . (Wβ′)p as W ·Wβ′ ν ν−1·. . .·W0. We
120
will now prove that the value of this matrix-vector product lies in the interval [−1/`, 1/`].
By induction on ν, we will prove that the product vν := Wν−1 · . . . ·W0 will always
yield a column vector where every entry lies in the interval [−1/`, 1/`]. Given the proof of
this inductive hypothesis, note that we have the following inner-product 〈Wν ,vν〉 where
every entry in both these vectors lies in the interval [−1/`, 1/`]. Since the dot-product is a
sum over at most ` terms with each term in the sum bounded in the interval [−1/`2, 1/`2],
we have that the dot-product lies in the interval [−1/`, 1/`].
We will now prove the inductive statement. The base-case is when ν = 1. In this
case we have a single vector W0. By assumption we have that each entry in this vector
lies in the interval [−1/`, 1/`] and therefore the hypothesis is true. Consider the case
when ν > 1. Consider the chain vν := Wν−1 ·Wν−2 · . . . ·W0. From the inductive
hypothesis, we have that vν−1 := Wν−2 · . . . ·W0 gives a column vector where every
entry is in the interval [−1/`, 1/`]. Consider the i-th entry in the column vector vν . This
is obtained by the inner product of the i-th row of matrix Wν−1 and the column vector
vν−1. Since this is a sum of at most ` term with each term in the interval [−1/`2, 1/`2],
we have that this i-th entry lies in the interval [−1/`, 1/`]. 
In the analysis for general neural networks, we use the following corollary.
Corollary 6.3.2. Let W0,W1,W2, . . . ,Wβ be weight matrices the satisfying Assump-
tion 6.3.1. Then for any given β ≤ β′ ≤ 1, we have that any product consisting of a
sequence of matrices W′β ·Wβ′−1 · . . .W0 · x lies in the interval [−1/`, 1/`], where x is
a normalized Gaussian.
Proof. The proof of this follows directly by combining Equation 6.12 and the above
121
theorem. 
By the small weights assumption and lemma 6.3.3, we have that each entry in
∇fi(Wβ, . . . ,W0) is at most 3/` (i.e., αi ≤ 2 and the terms involving the sum over
weight matrices is at most 1/`). By using the small-weights assumption repeatedly and
noting that αi, αj ≤ 2, we get that each entry in the gradient is at most 18/`2 + 18(β −
1)/`+ 18/`2 ≤ 54β/`.
Hence we have ρ as O(β/`).
Non-negative expectation. As before, using the fact that αi ≥ −2 and αj ≥ −2 and
normalized gaussian random variable has mean 0, we have that E[h(xi,xj)] ≥ 0.
Note that the bound in this case depends both on the maximum “width” and the
“depth”. In particular, as depth increases or width decreases, we need the dimension d
to be larger to get the required concentration. The other way to interpret this is when
either the width increases or the depth decreases, the probability that Equation (6.7) holds
increases. 
6.3.4 More general neural networks
This result can be extended to models with certain non-linearities. We consider the
model g(x) := σ(Wβσ(Wβ−1 . . . σ(W1σ(W0x)) . . .). Here, the function σ(.) is applied
point-wise to its arguments. We will assume that the non-linear activation is given by a
function σ(x) with the following properties.
• (P1) Boundedness: −1 ≤ σ(x) ≤ 1 for every value of x ∈ R.
• (P2) Twice Differentiability: σ is twice differentiable at every point in R.
122
• (P3) Bounded Differentials: −1 ≤ σ′(x) ≤ 1 and −1 ≤ σ′′(x) ≤ 1 for all x ∈ R.
Most classical activation functions such as sigmoid, tanh, and softmax satisfy these re-
quirements (although relu does not).
Provided the small-weights assumption holds3, we have the following theorem
which is analogous to the case of linear neural networks, with a mildly stronger require-
ment on the constant η and weaker dependence on depth β.
Theorem 6.3.4 (Concentration bounds for arbitrary depth neural networks). Let η > 4 be
a given constant. Consider a neural network with weights satisfying Assumption 6.3.1,
non-lin(earity satisfying p)roperties P1-P3, and quadratic loss. With probability at least
1− Ω n exp(−c`2d(η−4)24 we have that equation (6.7) holds.β
The proof of this theorem follows in a similar manner as for linear neural networks.
We critically use the property that terms involving σ(...), σ′(..) and σ′′(...) lie in the range
[−1, 1]. In particular, we have the following expression for ∇fi(Wβ, . . . ,W0). Here
A1, A2, . . . , Aβ are some fixed expressions involving the weight matrices and the random
vector xi. ( )
∂fi
= αiσ
′(...)(σ(Wβ−1 · σ(. . . σ(W0 · xi))))p .
∂W ββ pβ
( ) (∑`β `∑β′+2∂fi
= α ′ ′iσ (A1)σ (A2) . . . σ
′(Aβ−β′) . . .
∂Wβ′ pβ′+1,pβ′ pβ=1 pβ′+2=1 )
(Wβ)p (Wβ−1)p ,p − . . . (Wβ′+1)p ′ ,p ′ σ(Wβ′−1 ·σ(Wβ′−2 . . . σ(W ·xβ β β 1 β +2 β +1 0 i)) . . .)pβ′
3The activation function guarantees that the output at each layer is at most 1. However we still need the
assumption that each entry in the weight matrix is not too large. Otherwise, the gradient can blow up on the
backward pass, even if the forward pass is stable.
123
∀β′ ∈ {1, 2, . . . , β − 1}.
( ) ∑ `β `∂f ∑2i
= α σ′i (A1)σ
′(A2) . . . σ
′(A )β . . . (Wβ)p . . . (W β 1)p∂W 2,p1(xi)p .0 p1,p pβ=1 p2=1
Consider the expression for h(xi,xj) as follows.
∑`β ( ) ( ) ∑β−1 `∑β′+1 ∑`β′ ( ) ( )∂fj ∂fi ∂fj ∂fi
h(xi,xj) = +
∂W ∂W ∂W ′ ∂W ′
pβ=1
β p β p ′ p =1 p =(1 )ββ β β =1 ′∑∑′ pβ(′
β
+1,pβ′ ) pβ +1 β β′+1,pβ′`1 d ∂fj ∂fi
+ .
∂W ∂W
p1=1 p=1
0 p1,p 0 p1,p
To bound ||∇h||∞, consider any fixed xi = x̃i and xj = x̃j . Consider the entry
corresponding to (xi)p. Since −1 ≤ σ′(..) ≤ 1 we can use the same results from linear
neural network to upper bound the value of partial differentials involving the function fj .
To compute the partial differential with resp(ect to )(xi)p we make the following obser-
vation. Differential with respect to (x ) in ∂fii p involves two terms both of∂Wβ′ pβ′+1,pβ′
which ca(n be up)per bounded by 2/` using Corollary 6.3.2 and Lemma 6.3.3. The differ-
ential in ∂fi involves β′ terms each of them upper-bounded by 2/` by using
∂Wβ′ pβ′+1,pβ′
(the fac)t that −1 ≤ σ′′ ≤ 1, Corollary 6.3.2 and Lemma 6.3.3. Finally the differential in
∂fi can be upper-bounded by 2β/`.
∂W0 p1,p
Therefore, we get that ρ ≤ O(`/β2).
Computing a lower-bound on the expectation is tricky because of the function σ.
Although we cannot show the non-negative expectation property in this case, we can
show a slightly weaker lower-bound that still suffices for our purposes. We show that
E[h(xi,xj)] ≥ −4. The proof of this follows from the assumption that αi ≥ −2 and
αj ≥ −2 and the fact that the remaining terms in h(xi,xj) involve products of terms
124
σ, σ′ and the product of matrices. From the boundedness assumptions on σ and its first-
derivative and from lemma 6.3.3 we have that these terms are all at least −1. Hence we
have that E[h(xi,xj)] ≥ −4.
To complete the proof, we now observe the following.
( − )cd(η 4)2
Pr[h(xi,xj) ≤ −η] ≤ Pr[h(xi,xj) ≤ E[h(xi,xj)]− η + 4] ≤ 2 exp −
ρ2
Making the substitution of η′ := η − 4 we obtain the theorem. Note that to have
η′ > 0 we want η > 4.
6.3.5 Beyond linearly generated data
In this sub-section we will briefly show when the theorem (and the proofs) in the
previous sub-sections extend to the case beyond linearly generated data. Note that the
only fact we used about the function C(x) was that it always lies in the interval [−1, 1].
This implies that −2 ≤ αi ≤ 2 and −2 ≤ αj ≤ 2 holds. When considering ||∇h||∞ we
used the first derivative of the concept function C. In linearly generated data, we get that
this first-derivative exists and it lies in the interval [−1, 1]. Hence the bounds on ||∇h||∞
and E[h(xi,xj)] follow directly from these observations.
Therefore for any concept function C with the following properties, the above con-
centration theorems extend as is.
1. For every x ∈ [−1, 1]d we have −1 ≤ C(x) ≤ 1.
2. The function C is differentiable everywhere in the interval [−1, 1]d.
3. For every x ∈ [−1, 1]d and every i ∈ [d] we have that −1 ≤ ∂C(x) ≤ 1.
∂(x)i
125
6.4 Experiments
We present experimental results to see the effect of depth and width on convergence
rates and gradient confusion. It is worth noting that Theorem 6.3.4 implies that SGD
becomes more effective when width increases or depth decreases. Also, Theorem 6.2.1
indicates that gradient confusion affects the heights of the final “noise floor” of constant
step size SGD, and so we expect the effect of gradient confusion to be most prominent
near the end of training, particularly when the convergence curve has flattened out near
this floor.
We perform experiments on wide residual networks (WRN) [ZK16] for an image
classification task on CIFAR-10. WRN is an extension of ResNet [HZRS16a], which
is one of the state-of-the-art architectures for image classification. WRN is a stack of
residual blocks, and we denote the architecture as WRN-β-` following [ZK16], where β
represents the depth and ` represents the width factor of the network4.
The WRN architecture for CIFAR datasets is a stack of three groups of residual
blocks. There is a downsampling layer between two blocks, and the number of channels
(width of a convolutional layer) is doubled after downsampling. In the three groups, the
width of convolutional layers is {16`, 32`, 64`}, respectively. Each group contains βr
residual blocks, and each residual block contains two 3×3 convolutional layers equipped
with ReLU activation, batch normalization and dropout. There is a 3 × 3 convolutional
layer with 16 channels before the three groups of residual blocks. And there is a global
4The width factor is the number of filters relative to the original ResNet, e.g., a factor of 1 corresponds
to the original ResNet, and 2 means the network is twice as wide.
126
average pooling, a fully-connected layer and a softmax layer after the three groups. The
depth of WRN is β = 6βr + 4.
We turn off dropout for all our experiments. Our first round of experimental net-
works have no skip connections or batch normalization [IS15b] so as to stay as close to the
assumptions of our theorems as possible. Later on, we study the effects that skip connec-
tions and batch normalization have on convergence rate and gradient confusion. We use
SGD as the optimizer with no momentum. We train all experiments for 200 epochs and
use a standard learning rate decay schedule, where the initial learning rate is reduced by
a factor of 10 at epochs 80 and 160. We use a mini-batch of 128 for all our experiments.
To measure gradient confusion, at the end of every training epoch, we sample 100
mini-batches each of size 128. We calculate gradients on each of these mini-batches, and
then calculate pairwise cosine similarities. To measure the worse-case gradient confusion,
we calculate the lowest gradient cosine similarity among all pairs.
Effect of width. To test our theoretical results, and in particular Theorem 6.3.4, we
consider a WRN with no batch normalization and no skip connections. This makes the
network behave like a typical deep convolutional neural network. We now test the effect
of increasing width in this network, while keeping the depth fixed. In particular, we
consider the following networks: WRN-28-1, WRN-28-2, WRN-28-10. Figure 6.2 shows
how the training loss and the minimum gradient cosine similarity is affected by a change
in width. We present results with both a fixed initial learning rate across all networks,
as well as where we tune the optimal initial learning rate to optimize the performance of
each network. Quite clearly, width helps in faster convergence, as well as lower gradient
127
confusion.
(a) Loss; Fixed LR (b) Confusion; Fixed (c) Loss, Best LR (d) Confusion, Best LR
LR
Figure 6.2: How width affects convergence curves and gradient inner products.
Effect of depth. Using the same experimental setup as above, we now keep the width
fixed, and change the depth over the networks WRN-28-2 and WRN-40-2. Figure 6.3
shows the results. We again see that our theoretical results seem to be backed by the
experiments, where we find faster convergence and lower gradient confusion with smaller
depth.
(a) Loss; Fixed LR (b) Confusion; Fixed (c) Loss, Best LR (d) Confusion, Best LR
LR
Figure 6.3: How depth affects convergence curves and gradient inner products.
Effect of batch normalization and skip connections. Finally, we test the effect that
techniques such as batch normalization and adding skip connections has on convergence
128
speed. Figure 6.4 shows results for WRN-40-2, where we start with a network with no
batch normalization and no skip connections, and then progressively add them to the net-
work. For these runs, we present results only with the best tuned initial learning rate since
the optimal learning rates are usually very different on batch normalized vs. non-batch
normalized networks. We see that adding batch normalization makes a big difference in
the convergence speed as well as in lowering gradient confusion. Adding skip connec-
tions on top of this further accelerates training, although it seems to have minimal effect
on gradient confusion (when used on top of batch normalization).
(a) Loss; Best LR (b) Confusion; Best LR
Figure 6.4: Effect of batch normalization and skip connections on a Wide ResNet
6.5 Conclusion
We study the effect of high dimensionality and over-parameterization on the con-
vergence of SGD, and show that low gradient confusion in high dimensional problems
can lead to accelerated convergence. This addresses the issue of why SGD is an effective
optimizer for over-parameterized problems. An interesting question for future work is
whether there is a connection between gradient confusion and generalization for SGD.
129
Part II
STUDYING THE EVOLUTION OF CULTURAL NORMS
130
Chapter 7: Using game theory to study the evolution of cultural norms
Understanding human behavior and modeling how cultural norms evolve in dif-
ferent human societies is vital for designing policies and avoiding conflicts around the
world. This part of the thesis describes ways to use computational game-theoretic tech-
niques, and in particular evolutionary game theoretic (EGT) models, to gain insight into
why different human societies have different norms and behaviors.
Conventional (non-evolutionary) game theory is good for analyzing situations where
we know an individuals’ preferences, and want to predict what they will do based on those
preferences. However in our work, we want to know how these preferences arose. We are
interested in the following kinds of questions:
• What kinds of structural and external factors might have led to the emergence of
behaviors we see among individuals in a society?
• What evolutionary pressures might have led to variations in those behaviors?
• Can they be validated by observed phenomena?
Conventional game theory can’t properly answer these questions. To lay out an
individual’s preferences in a conventional game-theoretic model would, in essence, be
building into the model the very traits whose emergence we want to study. We instead
131
need to lay out the structural/environmental factors that might be responsible for the evo-
lution of those traits, to see whether those traits would evolve, and evolutionary game
theory provides an efficient framework to do just that.
7.1 Evolutionary game theory in biology
Evolutionary game theory (EGT) was first developed as an application of game
theory to evolving populations composed of multiple animal species, as a way to model
how each species’ evolutionary fitness causes its proportion of the population to grow or
shrink [SP73]. The idea is to represent an interaction among animals as a normal-form
game. The game’s payoffs are intended to represent the effect that the interaction will
have on the individuals’ evolutionary fitness. For example, if two animals fight over a
piece of food, one might expect that each individual’s fitness would be affected by how
the fight affects the animal’s health, and whether the animal gets the piece of food.
Rather than developing a detailed model of each specific individual, EGT models
typically are at a much more abstract level that does not distinguish among the individuals
within each species, but instead looks at the average behavior of all individuals of that
species. More specifically:
• If the population is composed of n different species, then for each species i (i =
1, . . . n), all individuals of species i have the same strategy si, namely the strategy
of being a member of species i. This strategy is intended to encompass—in an
abstract way, of course—everything that might affect this species’ average evolu-
tionary fitness: size, aggressiveness, sensory abilities, intelligence, etc.
132
• ∑Each species i constitutes some proportion xi of the entire population, with ni=1 xi =
1. If we choose an individual at random, then for i = 1, . . . , n, the probability that
this individual uses strategy si is xi.
Now, consider an interaction (e.g., a conflict over a source of food) between two indi-
viduals: one from species i and one from species j. For simplicity of presentation, let’s
restrict this to just two individuals, but it can easily be generalized to interactions among
k individuals for arbitrary k.
To formulate this interaction as a normal-form game, let’s say that the individuals’
expected payoffs are u(si, sj) and u(sj, si), where “payoff” means the effect that the
interaction will have on the individual’s evolutionary fitness.
The normal-form game is symmetric, i.e., if the two individuals are named a and
b, then it doesn’t matter whether the one with strategy si is individual a or individual
b. In either case, this individual’s expected payoff is u(si, sj), and the other individual’s
expected payoff is u(sj, si).
Suppose an individual with strategy si meets another individual chosen at random.
Then for j = 1, . . . , n, the other individual’s strategy∑is sj with probability xj . Hence the
expected payoff for the individual with strategy s is ni j=1 xju(si, sj). Earlier, we said the
expected payoff is intended to represent the interaction’s effects on evolutionary fitness.
The idea is that species i’s expected payoff is higher than that of the entire population,
then species i will reproduce at a higher rate, hence its proportion xi will increase. If
species j’s expected payoff is lower than that of the entire population, then species j will
reproduce at a lower rate, hence xj will decrease.
133
The best-known way to model this is the replicator dynamic [TJ78]. The origi-
nal version is a differential equation that assumes an infinite population and continuous
time. Let πi(x) ≥ 0 be the average payoff obtained by individuals of species i when
the proportions of each spec∑ies are x = (x1, . . . , xn). Then the average payoff for the
entire population is θ(x) = ni=1 xiπi(x). According to the replicator dynamic, the rate
of change in each xi is given by the following differential equation:
dxi/dt = xi(πi(x)− θ(x)). (7.1)
The replicator dynamic is consistent with the Lotka-Volterra equations for the dynamics
of biological systems. Indeed, the replicator dynamic is mathematically equivalent to a
generalization of those equations [PN02].
The replicator dynamic can be translated into a difference equation in which the
population is finite, and time proceeds as a sequence of discrete iterations [HS84]. This
formulation can be used to run a discrete-event computer simulations and look at their
outcomes—which is useful if the differential equations are too complicated to solve math-
ematically.
The above approach assumes that the species are well-mixed, i.e., that they are uni-
formly distributed geographically. Such an assumption is often inaccurate; there are many
settings in which an individuals’ location can make a huge difference in what interactions
they have, and how those interactions affect their evolutionary fitness. To model such sit-
uations, it often is useful to locate the individuals in a network in which they are restricted
to interact with their neighbors. This is further discussed later in this chapter.
134
7.2 Modeling cultural evolution
EGT can be used to model aspects of the evolution of human cultures. Here, strate-
gies correspond not to collections of individuals, but instead to possible behaviors. A
successful strategy—i.e., a behavior that produces good results—is likely to be adopted
by others, hence become more prevalent in the population. Conversely, the prevalence
of an unsuccessful strategy is likely to decrease. The propagation of these strategies cor-
responds not to biological reproduction, but instead to cultural transmission, in which
humans imitate others and learn from others. Rather than the replicator dynamic, here the
evolutionary model is a comparison process, e.g., a modified version of the Fermi rule
from statistical mechanics [Blu93]. At each iteration t, each individual a uses some strat-
egy in a game-theoretic interaction and receives a payoff ua. Then, before the beginning
of iteration t+1, a compares ua to the payoff un received by a randomly chosen neighbor
n, and decides whether to keep using the same strategy that it used before, or switch to the
neighbor’s strategy. The probability of switching is given by a version of the well-known
sigmoid function (see Figure 7.1):
Pr[a switches to n’s strategy] = 1/(1 + es(ua−un)),
where ua and un are a’s and n’s payoffs in the current iteration, and s ≥ 0 is an arbitrary
constant called the selection strength. The Fermi rule can easily be adapted to situations
in which the population isn’t well-mixed. For example, one can locate the individuals at
the nodes of a network, restrict each individual a to interact only with its neighbors, and
restrict a to compare its payoff with the payoffs of its neighbors.
135
Figure 7.1: Graph of 1s(u −u ) , for s = 5 and −1 ≤ u1+e a n a − un ≤ 1.
Usually the Fermi rule is further modified by introducing an exploration dynamic
that is somewhat analogous to biological mutation. In biological evolution, mutation
occurs so rarely that game-theoretic biological models often omit it. In cultural evolution,
an analogous phenomenon happens more frequently: individuals to try out new behaviors
[THDS+09]. The exploration dynamic models this as follows: when each agent a chooses
what strategy to use at the next iteration, there is a small probability µ that a will choose
a strategy s at random from the set of all possible strategies, regardless of whether s was
a successful strategy for the agents who used it in the current iteration, or whether any
agent even used it at all.
One of the limitations of EGT models is that they deliberately omit large amounts
of detail. In EGT models of biological evolution, they ignore most of the factors that
might influence whether a particular individual will reproduce successfully, and instead
consider all individuals of a species to be equivalent. Similarly, EGT models of cultural
evolution ignore most of the complexities of human interactions. For example, rather than
reasoning about the physical outcomes of an interaction among several individuals, these
outcomes are represented by payoff values. Because the models are highly simplified,
136
they don’t give exact numeric predictions of what would happen in real life.
On the other hand, a good EGT model can provide explanations of the underlying
dynamics of an evolving system, and establish support for causal relationships. Conse-
quently, such models can provide a useful complement to empirical studies, in which
there may be questions whether or not a correlation among various factors indicates a
causal relationship [Ald95].
7.3 Contributions
We list the main contributions in this part of the thesis below.
In Chapter 8, we study how norms change in a society. To do this, we build an
evolutionary game-theoretic model based on the idea that different strength of norms in
societies translate to different game-theoretic interaction structures and incentives. This
model is used to study the evolutionary relationships of the need for coordination in a
society (which is related to its norm strength) with two key aspects of norm change: cul-
tural inertia (whether or how quickly the population responds when faced with conditions
that make a norm change desirable), and exploration rate (the willingness of agents to try
out new strategies). Our results show that a high need for coordination leads to both high
cultural inertia and a low exploration rate, while a low need for coordination leads to low
cultural inertia and high exploration rate.
In Chapter 9, we extend this to study the rate at which a norm changes in different
cultures. We analyze the evolutionary relationships between the tendency to conform and
how quickly a population reacts when conditions make a change in norm desirable. Our
137
analysis identifies conditions when a tipping point is reached in a population, causing
norms to change rapidly. We find that tighter cultures are more likely to be initially
resistant to norm changes, but once it reaches a tipping point, they change faster than
looser cultures.
In Chapter 10, we study conditions that affect the existence of group-biased behav-
ior among humans (i.e., favoring others from the same group, and being hostile towards
others from different groups). Using an evolutionary game-theoretic model, we show that
out-group hostility is dramatically reduced by mobility. Technological and societal ad-
vances over the past centuries have greatly increased the degree to which humans change
physical locations, and our results show that in highly mobile societies, ones choice of
action is more likely to depend on what individual one is interacting with, rather than the
group to which the individual belongs.
138
Chapter 8: Understanding norm change in human societies
8.1 Introduction
Human societies around the world are unique in their ability to develop, maintain,
and enforce social norms. Social norms enable individuals in a society to coordinate
actions, and are critical in accomplishing different tasks. Neuroscience, field, and ex-
perimental research have all established that there are marked differences in the strength
of social norms around the globe [BVL13, EH14, GRN+11, HG14, HEM+10, HMB+06,
HTG08,RGNL15]. Some cultures (e.g., some middle-eastern countries, India, South Ko-
rea, etc.) are tight, in the sense that they tend to have strong social norms, with a high
degree of norm-adherence and higher punishment directed towards norm-violators. Other
cultures (e.g., Netherlands, New Zealand, Australia, etc.) are loose, i.e., individuals tend
to develop weaker norms with more tolerance for deviance [GRN+11, HG14, RGNL15].
This indicates that the nature of human interaction and influence is vastly different across
different cultures around the world.
To date, there has been little research on the evolutionary processes of norm main-
tenance and the processes that lead to norm change, and how these processes are substan-
tially different in societies around the world. However, recent world events (e.g., recent
social uprisings and turmoil) show that it is critically important to develop such an un-
139
derstanding. In this section, we draw ideas from recent social science research to build
culture-sensitive models that provide insights into the substantial societal differences that
exist in how individuals interact and influence each other.
Here, we use EGT to examine the relationships of the amount of need for coor-
dination (which psychological and sociological studies show is related to norm strength
[RGNL15]), with two key aspects of norm change in societies:
1. the amount of cultural inertia, i.e., the amount of resistance to changing a cultural
norm, and
2. the exploration rate, i.e., the extent to which agents are willing to try out new
behaviors.
More specifically, our primary contributions in this work are as follows:
• We provide a novel way to
1. model a society’s strength of norms by using an agent’s need for coordination
in the society, and
2. model the desirable/undesirable norms in a society,
by characterizing how they affect the payoffs in a game-theoretic payoff matrix,
leading to different interaction structures and incentives in a society.
• We investigate cultural evolution of norm change in this model using two well-
known models of change in evolutionary game theory (the replicator dynamic [TJ78]
and the Fermi rule [Blu93]). Using mathematical analyses and extensive agent-
based simulations, we establish that: the higher the need for coordination is, the
140
higher the cultural inertia will be, and vice versa. When a population faces con-
ditions that make a norm change desirable, a high need for coordination will make
them slower to change to the new norm compared to a society with a lower need
for coordination. Further, if the need for coordination is high enough, the existing
norm will not change at all.
• In order to understand how norms change in different cultures, we also examine
whether the need for coordination in a society has a causal evolutionary relationship
to an agent’s tendency to learn socially (i.e., adopt a behavior that is being used by
other agents in the population) versus innovate/explore new random behaviors. In
order to be able to do so, we propose a novel way to model this, where we let the
exploration rate, i.e., the probability that an agent tries out a new action at random,
evolve over time as part of the agent’s strategy, rather than stay fixed as in previous
work [THDS+09].
• The cultural differences in the distribution of agent strategies favoring social learn-
ing versus innovation or exploration can have a critical impact on how attitudes,
beliefs and behaviors spread throughout the population, and thus, is vital to under-
standing norm change. At a societal level, such differences can affect the rate at
which new technologies, languages, moral traditions, and political institutions are
adopted, while at local levels, they can affect the processes of influence at the in-
dividual level. Using the above model of evolving exploration rates, we verify this
by establishing, via extensive agent-based simulations, that: the higher the need for
coordination is, the lower the exploration rate will be, and vice versa.
141
A B A B
Mc = A ac, ac 0, 0 Mf = A af , af af , bf
B 0, 0 bc, bc B bf , af bf , bf
Figure 8.1: Individual payoff matrices. Mc denotes the coordination game and Mf de-
notes the fixed-payoff game used in our model.
These results provide insight into the reasons why tight societies are less open to
change, and why cultural inertia and high levels of social learning develop in such soci-
eties. To our knowledge, this is the first work to provide a culturally-sensitive model of
norm change and to show how the processes of norm propagation differ across societies.
The rest of the chapter is organized as follows. Section 8.2 provides our model of
the need for coordination, and mathematical analyses and agent-based simulations show-
ing how it affects cultural inertia. Section 8.3 describes our model of evolving exploration
rates, and shows how the degree of need for coordination affects the evolution of explo-
ration rates. In Section 8.4 we discuss the significance of our results.
8.2 Proposed model
Past field and experimental research have shown that tight societies have stronger
norms, where individuals adhere to norms much more than loose societies, and face higher
punishment when deviating. On the other hand, individuals in loose societies typically
have more tolerance for deviant behavior [GRN+11, HG14, RGNL15]. Past EGT studies
142
A B
M = A cac + (1− c)af , cac + (1− c)af (1− c)af , (1− c)bf
B (1− c)bf , (1− c)af cbc + (1− c)bf , cbc + (1− c)bf
Figure 8.2: Weighted payoff matrix M defined as M = cMc + (1− c)Mf .
have shown that a society’s exposure to societal threat is a key mediating factor in its
strength of norms [RGNL15], where threats can be either ecological like natural disas-
ters and scarcity of resources, or manmade such as threats of invasions and conflict. In
high-threat situations, societies tend to develop strong norms for coordinating social in-
teraction, (i.e., to become tighter), since coordination is vital for the society’s survival. In
low-threat situations, there is less need for coordination, which affords weaker norms and
looser societies.
Using this intuition, we hypothesize that the interactions between individuals in dif-
ferent societies are governed by different payoff structures and incentives. Tight societies
tend to have a high need for coordination, and we can model the extreme case as a coordi-
nation game Mc, where one only gets a payoff if playing the same action as the agent one
is interacting with. In loose societies, on the other hand, individuals’ payoffs are less af-
fected by others’ actions, and we can model the extreme case as a fixed-payoff game Mf ,
in which an agent’s payoff depends only on the action played by that agent, and not on the
actions of the other agent. For cases in between the two extremes, we use a game in which
the payoff matrix is a weighted combination of a coordination game and a fixed-payoff
143
A B
M ′ = A a, a (1− c)a, (1− c)b
B (1− c)b, (1− c)a b, b
Figure 8.3: Updated payoff matrix after assuming ac− bc = af − bc and adding a suitable
constant to the payoffs in M in Figure 8.2.
game, with the weighting factor 0 ≤ c ≤ 1 denoting the need for coordination.
As is done in many EGT studies, we consider games in which individuals have two
possible actions to choose from. In our case, the two actions A and B correspond to
possible norms that the society could settle on. As shown in Figure 8.1, the coordination
game has a payoff matrix Mc in which ac and bc are the payoff parameters; and the fixed-
payoff game has a payoff matrix Mf in which af and bf are the payoff parameters. The
weighted combination of the two games, shown in Figure 8.2, is M = cMc + (1− c)Mf ,
where 0 ≤ c ≤ 1 is the need for coordination.
We first present a lemma that shows that under a mild assumption, the payoff matrix
M can be much simplified on adding a constant to all payoffs in the matrix.
Lemma 8.2.1. Consider the game matrix M defined in Figure 8.2, and assume that ac −
bc = af − bf . Then, under a suitable addition of a constant to the payoffs, and using
ac = a and bc = b, the game matrix M reduces to the matrix M ′ shown in Figure 8.3.
Proof. On adding the constant value of (1− c) ∗ (ac − af ) = (1− c) ∗ (bc − bf ) (where
equality holds under the assumption) to all payoffs in M , the payoff matrix M reduces to
144
M ′, shown in Figure 8.3, where we denote ac = a and bc = b. 
The assumption ac − bc = af − bf is very reasonable, since this just ensures that
switching from one norm to the other always results in the same change in payoffs, re-
gardless of the weight c on the coordination game. Otherwise, there would be an added
causal factor for the dynamics of norm change. Also note that, from Lemma 8.2.1, under
additions with a constant, this assumption reduces to just setting ac = af and bc = bf . For
the rest of the section, we will work with payoff matrix M ′ where we set ac = af = a and
bc = bf = b. In subsequent sections, we will show why simplifying the payoff matrix by
adding a constant value to all payoffs (as shown in Lemma 8.2.1) is a perfectly reasonable
step to take.
From payoff matrixM ′, we see that whenever b < a, the better action for the society
to settle on (in terms of payoff) is A, while if a < b then it is B. Let M ′AB be the payoff
that an agent receives when they play action A and their opponent plays action B. Let
M ′AA, M
′
BA and M
′
BB be defined similarly. Studying the Nash equilibrium of the game
M ′, we get the following lemma.
Lemma 8.2.2. Consider the game matrix M ′ defined in Figure 8.3, where all payoff
values are positive, i.e., a, b > 0. Then we have:
(i) If b > a, the strategy profile (B,B) is a Nash Equilibrium. Further, if c ≥ b−a , then
b
(A,A) is also a Nash equilibrium. Further, the strategy profile ((q, 1−q), (q, 1−q))
is a Nash Equilibrium only when c ≥ b−a , where q = b−(1−c)a . Note that the mixed
b c(a+b)
strategy (q, 1 − q) denotes playing action A with probability q and action B with
probability 1− q.
145
(ii) Similarly, if a > b, the strategy profile (A,A) is a Nash Equilibrium. Further, if
c ≥ a−b , then the strategy profile (B,B), as well as ((q, 1− q), (q, 1− q)) are also
a
Nash Equilibria, with q = b−(1−c)a .
c(a+b)
Proof. For (A,A) to be a Nash equilibrium of the game M ′, as defined in Figure 8.3, the
following condition has to hold:
M ′ ′AA ≥MBA ⇒ ≥
b− a
c . (8.1)
b
Similarly, for (B,B) to be a Nash equilibrium, the required condition is:
′ ′ a− bMBB ≥MAB ⇒ c ≥ . (8.2)a
Consider the following two cases:
1. b > a : In this case, (8.2) is always satisfied. Thus, (B,B) is a NE. If c is large
enough such that (8.1) is satisfied, then (A,A) is also a NE.
2. a > b : In this case, (8.1) is always satisfied. Thus, (A,A) is a NE. If c is large
enough such that (8.2) is satisfied, then (B,B) is also a NE.
Note that ((q, 1 − q), (q, 1 − q)) is a mixed-strategy Nash Equilibrium when the
strategy (q, 1− q) makes the agent indifferent to the opponent’s strategy, i.e., when:
qM ′AA + (1− q)M ′ ′BA = qMAB + (1− q)M ′BB. (8.3)
Simplifying this, we get:
b− (1− c)a
q = .
c(a+ b)
146
We know that 0 ≤ q ≤ 1. Thus, this reduces to the following two conditions for ((q, 1−
q), (q, 1− q)) to be a Nash Equilibrium:
≥ b− ac and c ≥ a− b.
b a
When b > a, c ≥ a−b is always satisfied. Thus, when c is large enough such that
a
c ≥ b−a , ((q, 1 − q), (q, 1 − q)) is a mixed-strategy Nash Equilibrium. Similarly, when
b
a > b, c ≥ b−a is always satisfied, and when c is large enough such that c ≥ a−b ,
b a
((q, 1− q), (q, 1− q)) is a mixed-strategy Nash Equilibrium. 
From Lemma 8.2.2, we see that only when c is high enough, the sub-optimal action pair
becomes a Nash Equilibrium, where sub-optimal action pair refers to the situation where
both agents get a lower payoff than otherwise possible using, for example, the optimal
action pair. This means that when b > a, (A,A) is the sub-optimal action pair. Thus,
from Lemma 8.2.2, we that see if the need for coordination c is high, then the population
may converge to either of two different equilibria, one of which is sub-optimal in terms
of overall payoff. When c is low, on the other hand, the society will converge to a single
globally-optimal equilibrium.
In the next two sub-sections we introduce two models for studying norm change,
using two well-known models of evolutionary change (the replicator dynamic [TJ78] and
the Fermi rule [Blu93]). We show that both models of evolutionary change are invariant to
additions to the payoffs by a constant, and thus the results from this section carry forward.
We derive results for how different societies respond to a need for norm change using both
mathematical analysis on infinite well-mixed populations (where well-mixed denotes that
any agent can interact with any other agent in the population), and extensive agent-based
147
simulations on finite structured populations (where agents are placed on a network and
can interact with only their neighbors).
8.2.1 Replicator dynamic on infinite well-mixed populations
Consider a well-mixed infinite population of agents. This is a standard setting used
in evolutionary game theory, since a well-mixed infinite population is usually analytically
tractable. Let the agents be interacting with each other using game matrix M ′ defined
in Figure 8.3, and the proportion of agents playing each strategy be denoted by x =
(xA, xB), i.e., xA proportion of agents with strategy A, and proportion xB = 1− xA with
strategy B. Also, let uA(x) and uB(x) denote the payoffs received by an agent playing
actions A and B respectively, given the strategy proportion x. The expected payoff for an
agent is given by interacting with a randomly chosen agent in the population. Thus, we
get the following:
E[uA(x)] = x ′ ′AMAA + xBMAB,
E[uB(x)] = xAM ′BA + x M ′B BB.
On analyzing the Nash Equilibria of this system, we observe the following lemma.
Lemma 8.2.3. Consider a well-mixed infinite population where agents interact using the
game M ′ in Figure 8.3. Assuming all payoff values are positive, i.e., a, b > 0, and using
Lemma 8.2.2, we have:
(i) When b > a, xA = 0 is a Nash Equilibrium. If c ≥ b−a , then xA = 1 and xA =b
b−(1−c)a (which corresponds to the mixed-strategy Nash Equilibrium in Lemma
c(a+b)
8.2.2) are also Nash Equilibria.
148
(ii) Similarly, when a > b, x = 1 is a Nash Equilibrium, while if c ≥ a−bA , then xa A = 0
and x = b−(1−c)aA also are Nash Equilibria.c(a+b)
Proof. Consider the cases: xA = 0 (the strategy set where all of the population plays B)
and xA = 1 (the strategy set where all of the population plays A). From Lemma 8.2.2, we
get the following two cases:
1. b > a : In this case, (8.2) is always satisfied. Thus, xA = 0 is a NE. If c is large
enough such that (8.1) is satisfied, then xA = 1 is also a NE.
2. a > b : In this case, (8.1) is always satisfied. Thus, xA = 0 is a NE. If c is large
enough such that (8.2) is satisfied, then xA = 1 is also a NE.
Now consider the intermediate case where xA = p with 0 < p < 1. For xA = p to
be a NE, no A agent should have a strictly better payoff if switching to B, and vice versa.
Thus, the following two conditions need to be simultaneously satisfied:
pM ′ + (1− p)M ′ ≥ pM ′AA AB BA + (1− p)M ′BB,
and pM ′ + (1− p)M ′ ≥ pM ′BA BB AA + (1− p)M ′AB.
Both of these conditions are satisfied only when:
pM ′AA + (1− p)M ′AB = pM ′BA + (1− p)M ′BB.
This simplifies to:
b− (1− c)a
p = ,
c(a+ b)
and similar to Lemma 8.2.2, the results follow. 
149
We assume that on each iteration, agents interact with other randomly chosen agents,
and the population evolves according to the replicator dynamic. The replicator dynamic
is based on the idea that the proportion of agents of a type (or strategy) increases when it
achieves expected payoff higher than the average payoff, and decreases when achieving
lower payoff than the average payoff. Thus, over time, the proportion of agents of a type
that achieves payoff higher than the average payoff starts increasing in the population, and
eventually take over. More formally, the replicator dynamic is given by the differential
equation
dxA
ẋA = = xA · (E[uA(x)]− θ(x)), (8.4)
dt
where θ(x) = xAE[uA(x)] + xBE[uB(x)] is the average payoff received by all agents in
the population. From (8.4), it is clear that the rate of change remains the same on adding
a constant to the payoff matrix, since the added constants would just cancel each other
out. Thus, Lemma 8.2.1 follows through to this section as well.
Using the game matrix M ′, the rate of change in the proportion xA is given by:
ẋA = xA(1− xA)(c(a+ b)xA − (b− (1− c)a)). (8.5)
The fixed points of this rate of change are given by:
b− (1− c)a
xA = 0, xA = 1, and xA = . (8.6)
c(a+ b)
These correspond to the Nash Equilibria derived earlier. Next, we study the stability of
the Nash equilibria derived above, where we define a stable Nash equilibrium under the
replicator dynamic to be one where: if an infinitesimal proportion of agents change their
strategy, the replicator dynamic always forces the population back to the original Nash
150
equilibrium. More precisely, let the Nash equilibrium be xA = p. If xA increases an
infinitesimal amount to p+ , the Nash equilibrium is stable only if ẋA < 0, which drives
the population back to the Nash equilibrium xA = p. Similarly, if xA decreases by ,
xA = p is stable only if ẋA > 0. Thus, we state the following corollary.
Corollary 8.2.1. From Lemma 8.2.3 and Eq. (8.5) and Eq. (8.6), we see that the Nash
Equilibria xA = 0 and xA = 1 are stable, while the Nash Equilibrium x =
b−(1−c)a
A isc(a+b)
unstable.
Proof. Let φ = b−(1−c)a . From Eq. (8.5), we notice that, if xA = φ + , then ẋc(a+b) A > 0,
while if xA = φ − , then ẋA < 0, for any small  > 0. Thus, x = b−(1−c)aA representsc(a+b)
an unstable fixed point. Similarly notice that if xA = , ẋA < 0, while if xA = 1 − ,
ẋA > 0. Thus, xA = 0 and xA = 1 represent stable fixed points. 
There is a further notion of equilibrium used in EGT called evolutionarily stable
strategies (ESS) [Smi82]. A strategy S is an ESS if there is a small proportion py such
that, when any other strategy T has a proportion px < py (where the rest of the population
has strategy S), the payoff of an S agent is always strictly greater than a T agent. Using
this definition, we state the following theorem.
Theorem 8.2.1. From Lemma 8.2.3 and Corollary 8.2.1, we see:
(i) When b > a, B is an ESS. If c ≥ b−a , then A is also an ESS.
b
(ii) When a > b, A is an ESS. If c ≥ a−b , then B is also an ESS.
a
Proof. Let C denote the mixed strategy (q, 1 − q). From (8.3), we see that for C to be a
151
mixed strategy NE, the following condition needs to hold:
M ′ ′
q = BB
−MAB
′ − ′ ′ − ′ , (8.7)MAA MBA +MBB MAB
Let M ′AC denote the payoff received by the row player when an A player (row player)
interacts with a C player (column player). Similarly, we define M ′ ′ ′CC , MCA, MBC and
M ′CB. Thus we get:
M ′ 2 ′ ′CC := q MAA + q(1− q)MAB + q(1− q)M ′BA + (1− q)2M ′BB,
M ′CA := qM
′
AA + (1− q)M ′BA,
M ′ ′AC := qMAA + (1− q)M ′AB,
M ′CB := qM
′
AB + (1− q)M ′BB,
M ′BC := qM
′
BA + (1− q)M ′BB.
Let us derive conditions for which A is an Evolutionarily Stable Strategy (ESS). Thus,
we consider the proportion of agents playing A to be close to 1, i.e., xA = 1 − , where
0 <  1.
Let S denote the set of strategies other thanA that agents can play, i.e., S ∈ {B,C}.
For A to be an ESS, one of the following conditions need to hold: either
1. M ′ > M ′AA SA, or
2. M ′AA = M
′
SA and M
′
AS > M
′
SS .
From Lemma 8.2.2, we see that M ′AA > M
′
BA simplifies to the condition c >
b−a .
b
Further, M ′ = M ′AA BA simplifies to c =
b−a .
b
We also notice that M ′ ′AA > MCA simplifies to:
M ′AA > qM
′ ′ ′ ′
AA + (1− q)MBA ⇒ MAA > MBA.
152
Now consider the three cases:
1. b > a : In this case, if c is large enough such that c > b−a is satisfied, then A is an
b
ESS.
2. a > b : In this case, c > b−a is always satisfied. Thus, A is an ESS.
b
3. b > a and c = b−a : In this case, for A to be an ESS, both the conditions M ′
b AB
>
M ′BB and M
′ > M ′AC CC have to be satisfied. M
′
AB > M
′
BB simplifies to (1− c)a >
b, which is never satisfied. M ′AC > M
′
CC simplifies to:
qM ′ + (1− q)M ′ > q2M ′AA AB AA + q(1− q)M ′AB + q(1− q)M ′ 2BA + (1− q) M ′BB,
⇒ q(1− q)(M ′AA −M ′BA) > (1− q)2(M ′ ′BB −MAB), (8.8)
which is also never satisfied (follows from (8.7)). Thus, for this case, A is not an
ESS.
We can similarly derive conditions where B is an Evolutionarily Stable Strategy
(ESS).
Now we examine whether C is an ESS. Using (8.7), we can show that the following
conditions are satisfied: M ′ = M ′ and M ′CC AC CC = M
′
BC . Thus, for C to be an ESS,
both M ′ ′CA > MAA and M
′
CB > M
′
BB need to be satisfied. These two conditions simplify
to the following conditions: M ′BA > M
′ and M ′AA AB > M
′
BB, which in turn simplify to
the conditions c < b−a and c < a−b . Both of these conditions cannot be simultaneously
b a
satisfied. Thus, C is not an ESS. 
We observe that the strategies A and B are Evolutionary Stable Strategies (ESS),
when adopted by everyone in the population (corresponding to the stable Nash equilibria
153
Figure 8.4: Figures show the change in the proportion of B agents with time with a well-
mixed infinite population where reproduction is determined by the replicator dynamic
with b > a.
xA = 1 and xA = 0). The unstable Nash Equilibrium, on the other hand, does not
correspond to an ESS, since even a small group with a different strategy is able to force
the population to a different equilibrium. Thus, only stable Nash Equilibria correspond to
evolutionarily stable strategies.
Theorem 8.2.1 indicates that a society under our model is bound to end up at one
of the evolutionarily stable strategies: with every individual on action A or everyone on
action B, since even a small perturbation moves the society away from the unstable Nash
equilibrium. When c is low, there exists only a single ESS, and thus the society adapts
itself and settles on the ESS. When c is high, there are two ESSs, and thus the society
might settle on either one, depending on the starting point of the society.
Let us consider two societies: one with a lower need for coordination c1, and one
with a high need for coordination c2 > c1. To avoid some awkward phrasing, we’ll
call these the “looser” and “tighter” societies, respectively. Suppose a majority of both
societies are playing normA, and suppose they evolve according to the replicator dynamic
154
Figure 8.5: Figure shows the rate of change ofB agents versus the proportion ofB agents,
with a well-mixed infinite population where reproduction is determined by the replicator
dynamic with b > a.
given in Eq. (7.1). We are interested in how these two societies would respond to the
action B, when the payoff of action B is higher than A, i.e., when b > a, or equivalently,
M ′ ′BB > MAA. First notice that if c2 > (b − a)/b, and c1 < (b − a)/b, it follows from
Theorem 8.2.1 that the tighter society remains on norm A while the looser one switches
to the globally optimal norm B.
Now suppose the difference in norm payoffs is large enough such that c2 < (b−a)/b
(and thus also, c1 < (b − a)/b). This ensures that there is only a single equilibrium for
both societies at xA = 0. Thus, both societies would eventually switch to norm B, and
we are interested in the rate at which this change occurs. Let ẋB1 and ẋB2 denote the rate
of change when the need to conform is c1 or c2, respectively. Then we can show that:
ẋB2 − ẋB1 = xB(1− xB)(c2 − c1)((a+ b)xB − b).
155
This simplifies to ≤ 0, when x ≤ b B ; a+bẋB2 − ẋB1 (8.9)> 0, when x > bB .a+b
Thus, ẋB2 < ẋB1 in the initial stages when xB < b/(a+b). However, once the proportion
of B agents become big enough such that xB > b/(a+ b), then the higher the value of c,
the higher the rate of change will be. Thus, when c is high, the switch from A to B takes
time to speed up, with more cultural inertia than when c is low, even when the payoff
of the new norm is arbitrarily large compared to the previous norm. The initial cultural
inertia results in the society with a higher c value to take longer overall to switch to the
new norm.
Figure 8.4 illustrates these properties of well-mixed populations using the replica-
tor dynamic. We start off the society at the proportion xA = 0.95. In the first of the
three graphs, the tighter society (again using “tighter” as shorthand for “higher need for
coordination”) has c > (b− a)/b. Thus, while the less-tight society switches to the more
beneficial norm B, the tighter society is resistant to the change (since the difference in
payoffs is small) and stays with norm A. The second and third graphs show situations
where both societies switch to norm B. We observe that the tighter society switches
more slowly towards changing to norm B, but the difference in speed decreases as the
difference in payoffs between B and A increases.
As derived in Eq. (8.9), the rate of change for a society with higher c grows larger
than with lower c only after xB > b . This is shown in Figure 8.5. This also indicatesa+b
the initial inertia that societies with a higher need for coordination experience towards
156
Figure 8.6: Simulations with the Fermi rule on a toroidal grid of size 2500. From top to
bottom: c = 1.0, c = 0.75, c = 0.5. Initially: a = 1.0, b = 1.15. We use a structural
shock at 2500 iterations, after which the payoffs become: a = 1.15, b = 1.0.
changing norms. The need for coordination in these societies lead to individuals being
reluctant to try out new norms, which in turn leads to inertia.
8.2.2 Agent simulations on finite networks
A limitation of the model introduced in the previous section, is that it assumes
that the population is infinite and well-mixed. While the assumption that a population is
infinite is not a bad approximation for very large populations (which is the scale that we
are interested in), the assumption that agents are well-mixed, i.e., where any agent can
interact with any other agent, is often inaccurate. In this section, we show that the results
derived in the previous section, also extend to a model in which agents are structured on
157
the nodes of a graph/network, where agents can only interact with another agent if they
are connected by an edge in the graph.
More specifically, we now consider a structured population where agents are ar-
ranged on the nodes of a toroidal (wrap-around) grid, such that each agent can interact
only with the 4 other agents they are connected to. We consider toroidal grids as a con-
venient example, however, the results we describe below also extend to other network
structures like small-world networks [WS98], and preferential attachment [BA99] mod-
els. Mathematical analysis of evolutionary games on structured populations is not yet a
well-developed field, and thus we perform simulations of our model as follows.
Initially, we arrange agents with random strategies (A or B) on each node of the
grid. In each iteration, each pair of agents connected by an edge interact in a two-player
game defined by the payoff matrix M . The total payoff of each agent is computed by
summing over the payoffs received by an agent for each game that they play. Since the
population is finite, we use dynamics defined on finite populations. After each interaction
phase, agents use the Fermi rule to update it’s strategy for the next iteration. Under the
Fermi rule, an agent ψa picks a random neighbor ψn and observes its payoff, and the agent
then decides to switch to the neighbor’s strategy with probability p = (1 + exp(−s(ua −
u −1n))) , where ua and un are the payoffs of the agent and the neighbor, and s is a user-
defined parameter (in all our experiments, we set s = 5). With probability 1 − p, the
agent retains its old strategy. With a small probability µ, called the exploration rate, an
agent then tries out an action completely randomly. This repeats for every iteration of the
simulation. Note that the Fermi rule also only depends on a difference between payoff
values, and thus, like the replicator dynamic, is also invariant to addition of a constant
158
to the payoffs. Thus, for all our experiments in this section, we use the simplified game
matrix M ′ from Figure 8.3.
To study cultural inertia (i.e., resistance to changing a cultural norm) or rapid cul-
tural change in different societies, we use a game-theoretic model of a structural shock.
A structural shock represents a catastrophic incident in a society, where suddenly there
is abrupt change in the payoffs for actions A and B. We are interested in studying how
societies with different needs for coordination react to such an abrupt and drastic shift in
the payoffs of the possible actions. In our EGT model, we implement a structural shock
by simply interchanging the payoffs of actions A and B, thus, denoting a sudden change
in the globally optimal action in a society. This is equivalent to interchanging the payoff
values a and b. Thus, if initially, we have b > a, after a structural shock, we get a > b.
Note that in the simulations presented in Figure 8.4 with well-mixed populations, we use
a structural shock implicitly by assuming the norm is A and the payoff of action B is
higher than action A.
Consider that, initially, the action with a higher utility (and the current norm) in
a society is B, i.e., b > a with xA = 0. Suppose, the society experiences a structural
shock, where now action A becomes more desirable with a > b. On introducing a small
proportion of agents playing norm A (say xA = 0.01), if the need for coordination is
low then the population will switch to the new norm with xA = 1. This is because,
after the structural shock, the Nash Equilibrium (and ESS) is xA = 1, as shown above.
However, if the need for coordination is high (i.e., c ≥ a−b ), then xA = 0 is still a Nasha
Equilibrium (and ESS) and the population will remain on the sub-optimal norm B even
after the structural shock.
159
All experiments were run on a grid with 2500 nodes, and the simulation goes on for
6000 iterations, with a structural shock implemented at 2500 iterations. 100 independent
simulations are run for each setting and the results are averaged over the 100 runs. Figure
8.6 shows the results of our simulations. The plots show the proportion of agents playing
norm A vs norm B. As before, the parameter c denotes the need for coordination. When
c is low, very little cultural inertia develops and agents are more willing to innovate by
exploring behaviors other than the current societal norms. In this case, the population will
change more quickly to a different norm if the new norm will be beneficial. By contrast,
when c is high, we see the evolutionary emergence of higher levels of cultural inertia,
with agents less willing to innovate or to violate established cultural norms. In this case,
the population is slower to change to the new norm, and if c is high enough it may not
change at all. Thus, qualitatively, the results with a structured populations match those
from the infinite well-mixed populations in Section 8.2.1, and the mechanics that lead to
the above results can be explained using the same equilibrium results derived above.
8.3 Evolving exploration rates
In addition to the amount of cultural inertia, another key aspect to understanding
how norms change in different cultures is to study whether an agent is evolutionarily
more likely to learn socially (i.e., adopt a behavior that is being used by other agents
in the population) or to innovate and explore new random behaviors. Such tendencies
are critical in understanding the rate at which new technologies, languages, or moral
traditions are adopted in a population, and help us understand the processes of influence
160
Figure 8.7: Replicator-mutator dynamic on an infinite well-mixed population with a =
0.4 and b = 0.6. The solid and dotted lines denote c = 0.05 and c = 0.3, respectively.
The colors denote the exploration rates.
and persuasion at the individual level.
In the model presented in Section 8.2.2 for finite structured populations, the ex-
ploration rate (i.e., the small probability with which an agent tries out a new strategy
at random) was kept at a constant low value. This exploration rate denotes how much an
agent is open to change and trying out new actions at random. Thus, it seems that the need
for coordination in a society might affect how likely an individual is to try out different
actions, instead of conforming to their neighbors. Particularly, it seems natural to assume
that individuals in tight societies are much less likely to try out random actions than in-
dividuals in loose societies [GRN+11, HG14]. In this section, we test this hypothesis by
presenting a model to study the evolution of exploration rates in different societies.
To get some intuition about the hypothesis, we go back to our setting of a well-
mixed infinite population. Note that the replicator dynamic does not have a provision for
exploration rates. Thus, we use a variant of the replicator dynamic called the replicator-
mutator equation [Lev03]. Using this variant, one can include exploration rates into
161
(a) c = 1.0; Action proportions (b) c = 1.0; Exploration rates
(c) c = 0.8; Action proportions (d) c = 0.8; Exploration rates
(e) c = 0.5; Action proportions (f) c = 0.5; Exploration rates
Figure 8.8: Simulations with the Fermi rule on a toroidal grid of size 2500, with structural
shocks at intervals of 75 iterations. From left to right: c = 1.0, c = 0.8, c = 0.5. Initially:
a = 1.0, b = 1.15. The left column shows proportions of norms A and B. The right
column shows proportions of the population that use each different exploration rate.
162
the replicator dynamic. Thus, if we fix µ to be the exploration rate, we can write the
replicator-mutator equation as:
ẋA = (1− µ)xAE[uA(x)] + µxAE[uB(x)]− xAθ(x),
= xA(E[uA(x)]− θ(x)) + µxA(E[uB(x)]− E[uA(x)]).
Thus, like the replicator dynamic, we can write the rate of change in terms of payoff
differences, which makes the dynamic invariant to additions to the payoffs by a constant.
Thus, we again use the simplified game matrix M ′ from Figure 8.3. Simplifying the
equation for ẋA, we get:
ẋA =xA(1− xA)(c(a+ b)xA − (b− (1− c)a))
+ µ(xAxB(1− c)(b− a) + (x2 b− x2B Aa)). (8.10)
Figure 8.7 plots the replicator-mutator equation (Eq. (8.10)) with a well-mixed infinite
population. The solid lines are for a low need for coordination (c = 0.05), while the
dotted lines are for a high need for coordination (c = 0.3), and we plot the proportion of
B agents, as well as the rate of change, for various exploration rate µ values. From the
figure, we see that for all exploration rates µ, when the need for coordination is high then
there is higher cultural inertia.
To study how an agent’s tendency to learn socially or explore develops in a cul-
ture, we let the exploration rate (referred to as the mutation rate in biological models)
evolve. The exploration rate is the probability µ with which an agent chooses a random
new strategy at each iteration (0 ≤ µ  1). In biological evolution, mutation occurs
so rarely that game-theoretic biological models often omit it. In cultural evolution, how-
ever, exploration is an important step since individuals try out new behaviors much more
163
frequently [THDS+09]. Studying the evolution of exploration rates helps us get insights
about a society’s openness to change. Low exploration rates suggest that individuals are
less likely to try out new strategies and are more likely to coordinate with their neighbors.
On the other hand, high exploration rates mean that individuals are more open to change
and innovation.
To model the evolution of exploration rates, we first create a set L of possible explo-
ration rates. These can be a finite discrete set of exploration rates. For all our experiments,
we use the set of exploration rates: L = {0.0, 0.1, 0.2, 0.3, 0.4, 0.5}. The exploration rate
is added as part of the strategy of an agent, and each individual now chooses an ex-
ploration rate in addition to the game action (A or B). Thus, an agent now copies the
exploration rate of a neighbor, along with the game action, when updating its strategy
using the Fermi rule.
Note that a regularly changing environment is essential for studying the evolution
of exploration rates since, if the environment is not changing frequently enough, an ex-
ploration rate of 0 would always be evolutionarily stable. To model the changing environ-
ment, we will use the same switch in dominant norms (structural shock) that we used in
our earlier experiments, except now we apply the structural shock multiple times at much
shorter and regular intervals. We use a fixed interval of 75 iterations to apply the struc-
tural shock. We run the simulation for a total of 2000 iterations. For these experiments,
an agent’s strategy set now becomes a size of 10: 5 possible exploration rates in L × 2
possible game actions (norm A or norm B). We use the same toroidal grid as described
before. Figure 8.8 shows the experimental results. Each column in Figure 8.8 shows,
for a specific c, the proportion of agents playing norm A vs norm B (top plot), and the
164
proportion of agents with each exploration rate (bottom plot).
We see that when the need for coordination is high, low exploration rates are
adopted by the majority of the society. Individuals in such a society are more likely
to adopt the strategies of their neighbors, and this leads to high cultural inertia. In loose
societies, however, higher proportions of exploration rates µ > 0 evolve, and individuals
are more open towards change, leading to lower cultural inertia. This fits well with our
results in Section 8.2, and provides insights into why cultural inertia develops in societies
with a higher need for coordination.
8.4 Significance of the work
In this chapter, we examined the processes underlying cultural inertia and norm
change. We build evolutionary game-theoretic models that show that societies that have
a higher need for coordination – those that are tight – have higher cultural inertia, with
individuals being less likely to switch to the new norm even when it might have a larger
payoff. Societies with a lower need for coordination – those that are loose – on the other
hand, have low cultural inertia, with individuals more willing to innovate and open to
change.
By letting the exploration rate evolve, we used it to study an agent’s tendency to
either learn using social interaction or innovate and explore new random behaviors, and
we show that exploration rates evolve differently in different cultures. When the need
for coordination is high, the majority of the population has very low exploration rates,
and individuals are more likely to adopt the strategies of their neighbors. When the need
165
for coordination is low, higher exploration rates evolve, leading to lower cultural inertia,
and more openness to change. This explains why tight cultures tend to have less deviant
behavior among individuals with more norm adherence.
To our knowledge, this is the first work that predicts the effects of the need for
coordination on norm change and cultural inertia, and how it affects an agent’s decision
of whether to learn from others or to innovate and explore random behaviors. We found
our main qualitative findings to be robust to a wide range of parameter values, in both our
simulation and theoretical results.
By studying how socio-structural factors such as the need for coordination affect
cultural inertia, this work aims to establish a culturally-sensitive model of norm change.
With this model, we identify the conditions that lead to stability or instability in estab-
lished population norms in different cultural contexts. Such knowledge is critical in pro-
viding us the ability to identify early markers of impending drastic shifts in populations’
norms and thus enable tools providing alerts to potential social uprisings and turmoil.
166
Chapter 9: Tipping points for norm change in human cultures
9.1 Introduction
Tightness-looseness is a dynamic construct, yet to date, there has been little research
on the evolutionary processes that lead to changes in societal norms, the rate at which such
changes occurs, and how these processes vary across different cultures. In this chapter, we
aim to study how cultural differences in the way humans interact and influence each other
heavily influence how societal norms are established and the rate at which they change
across the world. We examine the causal relationship between an individual’s tendency
to conform with those around them and the rate at which norms are changed in different
cultures. More specifically, our primary contributions in this chapter are as follows:
• Drawing on recent research in cultural psychology, we propose a game-theoretic
model of a culture based on the tendency of an individual to conform with others,
vs. being more individualistic in their behavior.
• Using this model, we provide conditions under which a population is open to chang-
ing the current norm in a society.
• Finally, we analyze the rate at which such norm changes occur and compare the
rates in different cultures. We find that tighter cultures are more likely to be initially
167
resistant to norm changes, but once it reaches a tipping point, they change faster
than looser cultures.
9.2 Background and related work
There has been widespread interest in studying the emergence of social norms in a
population both from an evolutionary perspective [You01,HO01,Mer38,Bic05,HYOR14],
as well as in empirical research [JKV10, KJTW09, CB15]. There has been, however,
much less work done on understanding the processes that lead to change in an already
established norm in a population. A related concept, the propagation of information
in social networks, has been well-studied (see [CLC13, Jac10, EK10] for an overview),
but these works typically do not account for the differences in how individuals inter-
act and influence each other in different cultures. Data science approaches have also
explored this question, however, it is very challenging to separate out the various con-
founding factors (such as institutional influence) to establish clear causal relationships
[ZZX15, LGRC12, KYC+12, LMK+13].
This chapter extends the results of Chapter 8. While we studied the processes of
norm change in Chapter 8, there were limitations in the model that made it difficult to
analyze the speed of norm change. In the next section, we describe our new model, which
is more amenable to mathematical analysis, and provides a clearer picture into the factors
that affect the speed of norm change in different cultures.
168
9.3 Proposed evolutionary game-theoretic model
Consider an infinite, well-mixed population (i.e., each individual can interact with
any other individual in the population) that evolves according to the well-known replicator
dynamic [HS03]. For simplicity of presentation, suppose each agent may choose one of
two possible actions: A and B (see Section 9.4 for a discussion on the assumptions used
in our model). The two actions A and B correspond to possible norms that the society
could settle on. Let xA and xB denote the proportions of the population using actions
A and B respectively, with 0 ≤ xA, xB ≤ 1 and xA + xB = 1, and let x = (xA, xB).
According to the replicator dynamic, the rate of change in the proportions of agents using
each action is given by (7.1): ẋi = xi[fi(x)−φ(x)], where i ∈ {A,B}, ẋi = dxi/dt (i.e.,
rate of change of xi), fi(x) is the fitness of action i, and φ(x) denotes the average fitness
of the population, i.e.: φ(x) = xAfA(x) + xBfB(x). The replicator dynamic is based on
the idea that the proportion of agents with a particular strategy increases when it achieves
expected fitness higher than the average fitness, and vice versa.
Let uA and uB denote the payoffs associated with actions A and B, where 0 <
uA, uB < 1 and uA + uB = 1. To define the fitness function fi, we use the key insight
that in loose cultures, individuals tend to choose the action that is most beneficial to
them; but in tight cultures, individuals tend to conform to the same action that others
use, even if a different action might be more beneficial to each individual. To model this
mathematically, we let fi be a weighted combination of the payoff ui and an additional
conformism fitness measure θi that depends on whether the individual is conforming to
others in the population. Let m denote the parameter controlling the weighting between
169
these two fitness measures, i.e., the amount of conformist transmission in a population.
Thus, we define fi as:
fi(x,m) = (1−m)ui +mθi(x, k), (9.1)
where 0 ≤ m ≤ 1, and we define the conformism fitness measure θi as:
[ ( )]−1
θi(x, k) = 1 + exp − k(x− 0.5) , (9.2)
where k > 0. Note that we can vary the behavior of the conformism fitness measure θi
using the parameter k (see Figure 9.1). For example, when k is large, θi is close to a step
function where an agent has a non-zero conformism fitness only if they conform with the
majority action:
0, if xi < 0.5;
θ∞i (x) = lim θi(x, k) =
k→∞ 0.5, if xi = 0.5;1, if xi > 0.5.
Note that with no conformism whatsoever (m = 0), each action’s fitness depends solely
on its payoff, i.e., typical of a very loose culture. On the other hand, with 100% conformist
transmission (m = 1), i’s fitness depends solely on the conformism fitness measure (for
the case of θ∞i , this means that i’s fitness depends solely on whether i is in the majority
or the minority of the population). This is more indicative of a very tight culture. For
simplicity, for the rest of the chapter, we denote θ ∞ ∞i(x, k) as θi and θi (x) as θi .
170
Figure 9.1: Plot of (9.2) for different values of k.
9.3.1 When does norm change occur?
Suppose norm B has a higher utility compared to A, i.e., uB > uA. We are inter-
ested in analyzing the conditions for which a population shifts from norm A to B (norm
change). We can re-write the average fitness to be:
φ(x) = (1−m)(xAuA + xBuB) +m(xAθA + xBθB). (9.3)
We are interested in anlayzing the rate of change in the proportion ofB individuals. From
(7.1), (9.1), and (9.3), we get:
[ ]
ẋB = xB(1− xB) (1−m)(uB − uA) +m(θB − θA) . (9.4)
Note that θB ≥ θA when xB ≥ 0.5. Since uB > uA, we see that ẋB > 0, i.e., xB will
converge to 1 (limt→∞ xB = 1) when xB ≥ 0.5. If xA > xB, i.e., if the current norm in
the population is A, norm change takes place only if:
uB − uA
m < − − . (9.5)(uB uA) + (θA θB)
Thus, norm change takes place only if the population is loose enough, while tighter cul-
tures are more resistant to change. Further, note that θ∞A − θ∞B = 1 when xA > xB.
171
Figure 9.2: Left: Heatmap of the right-hand side in (9.5) when xB = 0.1, for various
uB−uA and k values. Right: Heatmap of the right-hand side in (9.7), for various uB−uA
and k values. Best viewed in color.
Thus, when the conformist fitness measure is a step-function θ∞i , (9.5) becomes: m <
(uB − uA)/(uB − uA + 1) < 0.5. Thus, for θ∞i , norm change occurs only in loose cul-
tures where individuals weigh their individual payoff more than whether they conform to
others in the population. Figure 9.2 (left) shows a heatmap of how condition (9.5) varies
with uB − uA and k when xB = 0.1. We see that the bound on m increases as uB − uA
increases, i.e., a population becomes more likely to switch the norm. On increasing k, we
see that the bound on m decreases. This makes intuitive sense, since a higher k makes
the difference in fitness between A and B clearer. Thus, in tight cultures, where peo-
ple tend to agree more on what behaviors are appropriate vs. inappropriate in different
situations [GRN+11], a higher k would lead to more resistance to norm change.
172
9.3.2 Rate of norm change in tight vs. loose cultures
We are now interested in studying the speed with which norms change in different
populations. Consider two possible values of m, namely m1 and m2, with m2 > m1 (i.e.,
m2 is a more conformist culture than m1). Let the corresponding values of ẋB be denoted
by ẋ1 and ẋ2B B, respectively. Assume further that both m1 and m2 satisfy (9.5), i.e., norm
change takes place in both cultures. Analyzing the difference in the rates of change, from
(9.4) we get:
[ ]
ẋ2B − ẋ1B = xB(1− xB)(m2 −m1) (θB − θA)− (uB − uA) . (9.6)
Note that when xB ≤ 0.5, θB − θA ≤ 0, which would mean: ẋ2 − ẋ1B B ≤ 0, i.e., the more
conformist culture would be slow to change initially. To analyze the case when xB > 0.5,
let’s assume xB = 0.5 + , for  > 0. Thus, xA = 0.5 − . From (9.6), we see that for
ẋ2 1B − ẋB > 0, the following condition needs to hold:
[ ]
 > ln(1 + uB − uA)− ln(1− (uB − uA)) /k. (9.7)
Figure 9.2 (right) plots a heatmap of how this bound varies with uB − uA and k. We see
that as k increases, the point at which the more conformist culture starts changing faster
moves closer to the point xB = 0.5. Note that when k → ∞, (9.7) reduces to  > 0,
i.e., as soon as xB becomes a majority, greater conformism would produce a larger rate
of change.
We are also interested in studying how the rate of change ẋB varies with xB as a
population switches to norm B, and how this relates to different levels of conformism.
We first look at the maximum rate of change ẋmaxB = maxx ẋB (we numerically calculateB
173
Figure 9.3: Left: Plot of (9.4) at uB−uA = 0.7. Right: Heatmap of maxx ẋB for variousB
k and m values, with uB − uA = 0.7. Best viewed in color.
this, given values for k, m and uB − uA). Figure 9.3 (right) plots ẋmaxB for different m
and k values, where we set uB − uA = 0.7. These values were chosen such that the norm
changes from A to B for all the considered combinations (using the bounds from Section
9.3.1). We see that when k is low, lower conformism leads to higher ẋmaxB . However, as
k increases (i.e., as θB approaches θ∞B ), there is a clear transition, where more conformist
cultures end up having a higher maximum rate of change.
This effect is clearer in the left plot of Figure 9.3, where we show how ẋB varies
with xB. We see that when k is low, i.e., when there is no clear difference between A
and B in it’s conformist fitness measure θ, ẋB changes slowly for both tight and loose
cultures, with the loose culture having a higher rate of change. With high k, however,
we see that the tighter culture faces a “tipping point”, which results in a sudden increase
in ẋB with the tighter culture adopting a higher rate of change ẋB than the loose culture.
Thus, in a more conformist culture with high k, initially peer pressure impedes the switch
to the more beneficial norm B. But once enough of the population has switched, a tipping
174
point is reached where peer pressure causes the rest of them to switch very rapidly.
9.4 Discussion
In this chapter, we show that tight cultures sometimes experience a tipping point
for norm change while loose cultures typically face a more gradual change. The results
presented assume an infinite well-mixed population with two possible actions. We believe
it would be relatively straightforward to extend our results for multiple actions. Assum-
ing an infinite well-mixed population made our model mathematically tractable and to
provide exact conditions for norm change. As future work, it would be interesting to ex-
tend this model to the finite population case, where interactions between individuals are
dictated by a social network.
175
Chapter 10: On the evolution of ethnocentrism in human cultures
10.1 Introduction
Nearly all major conflicts across the globe, both current and historical, are charac-
terized by individuals defining themselves and others in terms of their group membership.
Substantial empirical evidence supports people’s tendency to favor in-group members
and show hostility towards out-group individuals [Taj82, BK85, HRW02, BFF06, CL09].
From an evolutionary perspective, numerous studies have shown how in populations com-
prised of various groups, group-biased behavior that discriminates or is hostile against
out-groups evolves or emerges readily and dominantly [HA06,CB07,AOW+09,GvdB11,
FTC+12, MNVV12, HKS13]. Since humans are social beings who establish and define
groups constantly, the development of out-group hostility and resulting group conflict
might thus seem inevitable.
In contrast, however, statistics have shown that violence and outgroup conflict have
actually declined dramatically over the past few centuries of human civilization, suggest-
ing out-group hostility is not inevitable after all [Pin11a, Pin11b]. What factors might
lead to such a decrease in conflict? Evolutionary game-theoretic models can shed light on
this question by exploring how various factors affect the emergence and maintenance of
individuals’ behaviors relating to group conflict.
176
Cooperate Defect
Cooperate b− c, b− c −c, b
Defect b, −c 0, 0
Figure 10.1: Prisoner’s Dilemma payoff matrix used in our model.
Our evolutionary game model builds on a prior model developed in Hammond
and Axelrod’s pioneering work [HA06] on the evolution of ethnocentrism, and used in
Hartshorn, et al. [HKS13]. In their model, agents had perceivable group tags, played
one-shot Prisoner’s Dilemma games with their neighbors, and could behave differently
toward in-group members than out-group members. Each agent’s inherited traits included
a group tag, an action (cooperate or defect) to use with in-group members, and a similar
action to use with out-group members. Thus there were four possible strategies: Co-
operate with both in-group and out-group members; Defect against both in-group and
out-group members; Ethnocentric (cooperate with in-group members, defect against out-
group members); Traitorous (defect against in-group members, cooperate with out-group
members).
Using their model with four different groups (or group tags), we have replicated
their result showing that after a period in which Cooperative agents are briefly abundant,
evolutionary pressure leads to a predominance of Ethnocentric agents. Defectors and
Traitors never establish themselves.
Since the agents in that model conditioned their actions only on the group tags, they
177
Figure 10.2: Sequence of events at each time step in our evolutionary game-theoretic
model. The sequence of steps are the same as in Hammond and Axelrod’s paper [HA06]
except for the Mobility stage, which is new. For additional details, see the Methods
section.
were in effect group-entitative. That leaves open the question whether there are conditions
under which individual-entitative agents – agents that base their actions on knowledge of
individuals per se rather than group tags – may be able to exist and perhaps even be
favored by evolutionary pressures.
Moreover, that model does not incorporate mobility. Research in cultural psychol-
ogy has demonstrated large empirical differences in residential mobility around the globe
with important psychological consequences [Lon91, Ang00]. Researchers have shown
that in high-mobility contexts, individuals change relationships often; they form new re-
lationships and sever unwanted relationships with great ease [OKM+13, OSYA15]. In
such contexts, having a broad network of weak ties and being open toward strangers
(with whom it might be valuable to form relationships) is highly adaptive. Indeed, Oishi,
et al. [OSYA15] observe that in highly mobile contexts, “since it is hard to keep track
of behaviors of many strangers whom one meets, one needs to carefully avoid being as-
sociated with defectors or free-riders in order to exploit the greatest possible relational
178
benefit” (p. 228). Thus, individuals are more likely to adopt strategies that try to evaluate
the “trustworthiness and worth” [OSYA15] of others in highly mobile contexts, i.e., adopt
individual-entitative strategies. On the other hand, in low-mobility contexts, individuals
have far fewer opportunities to form new relationships, and severing existing relationships
can have extreme adverse effects such as being ostracized from one’s only social cir-
cle [OSYA15], causing “the existential, social, and psychological death of the individual”
(p. 755) [Lan92]. Based on these theories we would predict that group-entitative behavior
and associative ethnocentrism is adaptive in low mobility societies, yet it is maladaptive
in high-mobility contexts, where individual-entitative strategies would be evolutionarily
favored.
We have run extensive new evolutionary simulations, augmenting the prior model to
include individual-entitative strategies and mobility; and our results show that the evolu-
tion of ethnocentrism is driven by low mobility. Indeed, our subsequent empirical analysis
of archival data verifies that contexts with high residential mobility have less out-group
hostility than those with low mobility.
In our evolutionary game model, agents are arranged on a toroidal (wrap-around)
grid, so that every node on the grid is connected to 4 neighboring nodes). Initially the grid
is empty. The sequence of events at each time step is shown in Figure 10.2; these are the
same as in Hammond and Axelrod’s paper [HA06] except for the Mobility stage, which
is new. For additional details, see the Methods section.
The agents’ strategies are similar to those in Hammond and Axelrod’s model [HA06],
where agents can distinguish between in-group and out-group members by observing the
group tags. Hence agents’ strategies can be conditioned on whether they are interacting
179
with in-group or out-group members. In addition, in our model, each agent’s strategies
can be conditioned on the past history of other agents. Each agent can either be group-
entitative or individual-entitative, and this is an inherited trait.
A group-entitative agent i ignores individual identities. Its actions toward an agent
j depend only on its last encounter with anyone in j’s group. It has two possibly different
strategies: one for in-groups and another for out-groups. Each of those strategies is one
of the following: AllC (always cooperate), AllD (always defect), TFT (Tit-for-Tat: play
whatever action the opponent played in i’s last interaction with anyone from j’s group),
or OTFT (play the opposite of what TFT would play). Note that this is unlike the models
in previous chapters where agents had no memory of prior interactions.
An individual-entitative agent i ignores other agents’ group tags; i’s action toward
j depends only on its last encounter specifically with j. Thus i has one of the above four
strategies, except that TFT and OTFT depend on i’s last interaction with j specifically,
rather than someone in j’s group.
To model mobility, there is a probability m with which, at the beginning of each
iteration, an agent moves to a randomly chosen empty spot in the network. Thus a high
value of m represents a highly mobile population, while a low value of m represents a
population with low mobility. We vary m from 0 to 0.08 in our experiments. It is impor-
tant to note that a mobility probability of 0.08 is quite high: it means that on average, 8%
of the population move to different locations on each iteration – a substantial amount of
movement even for small values of m. At higher levels of mobility (m > 0.1), cooper-
ation breaks down in a society, and the majority of the population starts defecting – and
thus is not representative of any stable society around the world.
180
10.2 Results
Figure 10.3 shows our results after letting the populations evolve for 30,000 iter-
ations. Without mobility (i.e., m = 0), group-entitative agents comprise 75% of the
population. These agents’ strategies are predominantly out-group hostile (AllD) and in-
group cooperative (AllC). This is reasonably consistent with Hammond and Axelrod’s
model [HA06], but notice that even when m = 0, individual-entitative agents comprise
about 25% of the population.
As mobility increases, the evolutionary pressures shift to favor individual-entitative
agents. For m > 0.02 they comprise about 80% of the population, and about 70% of
them play TFT. Thus, the evolutionary dominance of group-entitative and ethnocentric
strategies is thwarted by mobility.
The reason why low mobility favors group-entitative strategies while higher mo-
bility favors individual-entitative strategies is related to the clustering of group members
(Figure 10.3). With low mobility, groups tend to cluster together heavily; hence agents
interact primarily with in-group members. Thus the ethnocentric strategy (i.e., group-
entitativity with in-group cooperation and out-group-hostility) is effective and profitable
in terms of payoff. Under higher mobility, however, agents are less clustered by group
membership, hence more likely to interact with out-group members, hence cannot rely on
high payoffs from in-group interactions. Furthermore, group-entitative strategies are less
effective because different group members are much less likely to have the same strategy.
This favors the individual-entitative Tit-for-Tat (TFT) strategy (see the Methods section
for more information on the clustering coefficient).
181
(a) (b)
(c) (d)
(e) (f)
Figure 10.3: Proportions of actions and strategies as a function of mobility, after 30,000
iterations, averaged over 100 simulation runs. The plots show the proportions of (a) the
group-entitative and individual-entitative agents, (b) the actions played by the agents,
(c) the strategies of the individual-entitative agents, (d) the in-group and (e) out-group
strategies of the group-entitative agents, (f) the degree of clustering on the grid.
182
To illustrate the evolutionary trajectories that led to the reported results, Figures
10.4 and 10.5 show representative evolutionary trajectories for single simulation runs. In
Figure 10.4, there is no mobility. Group-entitative agents quickly become a majority, and
most of them are ethnocentric (in-group cooperative and out-group hostile). In Figure
10.5, the mobility probability is m = 0.05. Individual-entitative agents evolve to become
a majority, with most of those agents playing Tit-for-Tat (TFT).
To illustrate robustness of the results with our model, we also performed a series
of experiments where we initialized the population on the grid to have high clustering
of group-entitative or individual-entitative agents (instead of starting out with an empty
grid). In each case, we notice that the results are the same as when we start out with
an empty grid, i.e., group-entitative agents dominate under no mobility and individual-
entitative agents dominate under higher values of mobility. This is due to the exploration
dynamics (or mutation phase) in our model. The exploration dynamic has been shown
to be a key aspect of evolutionary game-theoretic models for cultural evolution [TSS+10,
ATTN12], and this ensures that our model remains robust to the initial conditions of the
grid.
10.2.1 Empirical analysis
In order to complement these modeling efforts, we also gathered data to test the no-
tion that mobility relates to lower ethnocentrism. We analyzed data from the U.S. Census
Bureau [Ren11, Wor98] that provides measures of mobility in the U.S. 50 states (defined
as the percentage of people born in the state of residence; reverse scored, with higher
183
(a) (b)
(c) (d)
Figure 10.4: Single simulation run for 20000 generations with no mobility (m = 0). (a)
Proportions of group-entitative and individual-entitative agents. (b) Relative proportions
of the individual-entitative agents’ strategies; Relative proportions of the group-entitative
agents’ (c) in-group and (d) out-group strategies.
scores being reflective of higher mobility) and data from the DDB Needham Life Style
Survey [Wor98]. We found that mobility was positively correlated with responses to the
question “I am interested in the cultures of other countries” (r = 0.614, p < .001), and
negatively correlated with responses to questions regarding ethnocentrism (e.g., Amer-
icans should always buy American products, r = −0.654, p < .001; The government
184
(a) (b)
(c) (d)
Figure 10.5: Single simulation run for 30000 generations with no mobility (m = 0.05).
(a) Proportions of group-entitative and individual-entitative agents. (b) Relative propor-
tions of the individual-entitative agents’ strategies; Relative proportions of the group-
entitative agents’ (c) in-group and (d) out-group strategies.
should restrict imported projects, r = −0.578, p < .001).
In addition, states that have higher mobility also have higher openness, one of the
big five personality dimensions, which is associated with breadth of experience and inter-
est and interest in new ideas and other cultures (r = 0.321, p = .023) [JNS08].
185
10.3 Significance of the work
The evolution of cooperation has been of great scientific interest in many disci-
plines; and to date, many evolutionary and empirical studies have found that ingroup-
favoring and outgroup hostile behaviors are common. This has caused much concern that
group conflict and ethnocentrism is an inevitable threat on our planet. We integrate re-
search on group conflict with human mobility [Now06, OLS07, SYHT09, YSS09, Ois10,
SYM10, WEMG11], and show for the first time that the evolution of ethnocentrism and
group entitative behavior is thwarted by high mobility. As mobility is rapidly chang-
ing around the globe [OSYA15], this work predicts that group conflict will continue to
decrease, in line with Pinker’s historical analysis [Pin11a, Pin11b].
Mobility is an important and well-studied topic in cultural psychology [OLS07,
SYHT09, YSS09, Ois10, SYM10]. Low mobility leads to conditions where interacting
individuals are likely to be reproductively related and this has been shown to be important
in the evolution of cooperation [Now06, WEMG11]. In our model, we find mobility
plays a crucial role in the evolution of ethnocentrism in a society. More specifically, we
establish that low mobility leads to in-group cooperation and out-group hostility. High
mobility, on the other hand, leads to more individual-entitative behavior, where agents
take actions based on the specific individuals they interact with, and not based on the
group that those individuals belong to.
Another unique aspect of our model is that we allow for agents to have memory of
previous actions played by other agents and the possibility of individual-entitative agents,
where agents take actions based on the individual they are playing against rather than
186
their tag. In a society with high mobility, agents would be moving to different parts of the
grid, which leads to low clustering of agents belonging to the same group. Thus, agents
with group-entitative strategies suffer, which leads to the evolution of individual-entitative
strategies, with strategies like Tit-for-Tat gaining prominence. Under low mobility, on the
other hand, agents of the same group cluster together much more, and simple group-
entitative strategies like in-group cooperative and out-group hostility gain prominence.
It would be fruitful to incorporate mobility into other evolutionary game-theoretic
models of conflict in future research. Moreover, since mobility in a society could be mo-
tivated by multiple factors that could have divergent effects, it would be good for future
models of mobility and ethnocentrism to incorporate models of these motivations. Mo-
bility might reduce ethnocentrism when agents move for economic reasons, but mobility
might not reduce ethnocentrism if agents move primarily to be among other in-group
members. In all, our work shows for the first time that mobility is a critical factor that
affects the dynamics of conflict with important implications for theory and policy.
10.4 Methods
10.4.1 Evolutionary dynamics of our model
Here is a more detailed version of the sequence of steps in our evolutionary game-
theoretic model:
1. Birth: One agent with a random strategy appears at a random empty site, if such a
site exists.
187
2. Base Payoff: Each existing agent receives the base payoff from the environment;
we use a base payoff of 0.12 throughout our experiments.
3. Interaction Payoff: Each agent plays a game with each of its neighboring agents on
the grid, receiving payoffs according to the game definition. The game played by
the agents is the canonical 2-player cooperation/defection dilemma (Figure 6), with
the benefit of cooperation b = 0.03 and the cost of cooperation c = 0.01. In this
phase, the action chosen by an agent in each game depends on the type of agents
playing that game (i.e., group-entitative or individual-entitative agent) as well as
the type of strategy being used by that agent.
4. Fitness: Each agent is assigned fitness equal to the agent’s accumulated payoff.
5. Reproduction: In random order, each agent is given a chance to reproduce with
probability equal to its fitness. If an agent gets a chance to reproduce, it places an
offspring in a randomly chosen empty site in its neighborhood, if such a site exists.
The offspring has the same traits as its parent, with a mutation rate of µ = 0.005
per trait.
6. Death: Each agent has a probability d = 0.1 of dying. If an agent dies, it is removed
from the grid.
7. Mobility: Each agent has a probability ofm of moving to a randomly chosen empty
spot on the grid.
188
10.4.2 Clustering coefficient
A clustering coefficient is a metric for measuring the amount of clustering of nodes
in a graph. We can measure the clustering of group tags by comparing the group tags of
the agents at neighboring locations. For each location (x, y) on the torus grid we consider
four triples, each consisting of (x, y) and a pair of adjacent neighboring locations:
1. location (x, y), the neighbor above it, and the neighbor to its left;
2. location (x, y), the neighbor above it, and the neighbor to its right;
3. location (x, y), the neighbor below it, and the neighbor to its left;
4. location (x, y), the neighbor below it, and the neighbor to its right.
Our clustering coefficient is the total number of triples that contain three agents
with the same group tag, divided by the total number of triples in the grid. For a torus
grid of size N ×M , the denominator in our metric, i.e., the number of total triplets, is
simply 4NM . The clustering coefficient lies in the range from 0 to 1, and is higher when
agents of the same tag cluster together on the grid, while being small when there is less
clustering of agents of the same tag.
10.4.3 Strategy set
When an individual-entitative agent i interacts with an agent j that i has never
encountered before, or when a group-entitative agent i interacts with an agent from a
group that i has never encountered before, i must choose whether to cooperate or defect.
189
That choice is part of i’s strategy; and in our experiments we allowed both possibilities.
This doubled the total number of strategies in our simulations but made no meaningful
difference in the results – so in favor of simplicity and clarity, we did not discuss this
detail earlier.
10.4.4 Mutation rate
In Hammond and Axelrod’s model [HA06], during reproduction, an offspring will
have the same strategy s as its parent, except that for each trait in s, there is a small
probability µ = 0.05 that this trait will be changed to a randomly chosen one. Notice that
µ is not the probability that the offspring’s strategy will differ from s. Instead, for each
trait in s, it is the probability that this trait will be changed; and this happens independently
for each of the traits in s. Consequently, the probability that an offspring retains the exact
same strategy as its parent is inversely proportional to the number of possible traits.
Our model has a higher number of different possible traits than Hammond and Ax-
elrod’s model. Thus, in order to maintain roughly the same probability that an offspring
will retain the same strategy as its parent, we needed to use a smaller value for µ.
In Hammond and Axelrod [HA06], each agent has 3 traits: the group tag and the
actions to take when playing against an in-group and an out-group agent. However, in
our model, the number of traits is significantly higher. Each group-entitative agent has 7
traits: the group tag, and traits specifying what action to take in each of the following six
situations:
1. when an in-group agent cooperated on the last meeting,
190
2. when an in-group agent defected on the last meeting,
3. the first time one meets an in-group agent,
4. when an out-group agent cooperated on the last meeting,
5. when an out-group agent defected on the last meeting,
6. the first time one meets an out-group agent.
Similarly, each individual-entitative agent in our model has 4 traits. Thus, in order
to ensure that the probability of an offspring retaining the same strategy as its parent is
similar to that in Hammond an Axelrod’s model, we needed to use a lower value for µ
than what they used. We used µ = 0.005.
10.4.5 Range of mobility
In our experiments, the reason we limited the mobility probability to 0 ≤ m ≤ 0.8
is that cooperation breaks down at higher levels of mobility, as shown in Figure 10.6. For
example, at m = 0.2, about 80% of all actions are defections. Figure 10.6 also shows the
reason for this breakdown. As m increases, the average number of games that an agent
plays with the same opponent decreases monotonically. For example, at m = 0.2, each
agent plays only 2 games with each opponent on average – which favors AllD rather than
Tit-for-Tat (TFT).
As a societal analogy, agents are more likely to defect against each other if no
agent interacts with any other agent long enough to create any kind of interpersonal ties.
For example, consider the limiting case where mobility probability m = 1.0, i.e., every
191
agent moves to a different location in the grid at every iteration. This condition is similar
to a well-mixed population, in which every agent can interact with every other agent
in the population, with no network structure defining the set of possible interactions.
Under well-mixed populations, it is well established that in a cooperation game like the
Prisoner’s Dilemma, the society devolves into universal Defection.
(a) (b)
Figure 10.6: Cooperation breaking down at higher mobility values. Each data point is an
average of 100 individual simulation runs. The plots show (a) the proportion of agents
cooperating and defecting; and (b) over an agent’s lifetime, the average number of unique
opponents it encounters, and the average number of games played against each of them.
192
Bibliography
[ACH18] Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of
deep networks: Implicit acceleration by overparameterization. arXiv preprint
arXiv:1802.06509, 2018.
[AD11] Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. In
Advances in Neural Information Processing Systems, pages 873–881, 2011.
[AHS15] Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Fixed point optimization of
deep convolutional neural networks for object recognition. In ICASSP. IEEE, 2015.
[Ald95] John Aldrich. Correlations genuine and spurious in Pearson and Yule. Statistical
Science, pages 364–376, 1995.
[Ang00] Shlomo Angel. Housing policy matters: A global analysis. Oxford University
Press, 2000.
[AOW+09] Tibor Antal, Hisashi Ohtsuki, John Wakeley, Peter D Taylor, and Martin A Nowak.
Evolution of cooperation by phenotypic similarity. Proceedings of the National
Academy of Sciences, 106(21):8597–8600, 2009.
[ATTN12] Benjamin Allen, Arne Traulsen, Corina E Tarnita, and Martin A Nowak. How
mutation affects evolutionary games on graphs. Journal of theoretical biology,
299:97–105, 2012.
[BA99] Albert-László Barabási and Réka Albert. Emergence of scaling in random net-
works. Science, 286(5439):509–512, 1999.
[BB88] Jonathan Barzilai and Jonathan M Borwein. Two-point step size gradient methods.
IMA Journal of Numerical Analysis, 8(1):141–148, 1988.
[BCNW12] Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size se-
lection in optimization methods for machine learning. Mathematical programming,
134(1):127–155, 2012.
[BD99] Jock A Blackard and Denis J Dean. Comparative accuracies of artificial neural net-
works and discriminant analysis in predicting forest cover types from cartographic
variables. Computers and electronics in agriculture, 24(3):131–151, 1999.
193
[Ben12] Yoshua Bengio. Practical recommendations for gradient-based training of deep
architectures. In Neural networks: Tricks of the trade, pages 437–478. Springer,
2012.
[Ber11] Dimitri P Bertsekas. Incremental gradient, subgradient, and proximal methods for
convex optimization: A survey. Optimization for Machine Learning, 2010(1-38):3,
2011.
[BFF06] Helen Bernhard, Urs Fischbacher, and Ernst Fehr. Parochial altruism in humans.
Nature, 442(7105):912–915, 2006.
[BFL+17] David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and
Brian McWilliams. The shattered gradients problem: If resnets are the answer, then
what is the question? arXiv preprint arXiv:1702.08591, 2017.
[Bic05] Cristina Bicchieri. The grammar of society: The nature and dynamics of social
norms. Cambridge University Press, 2005.
[BIL+15] Carlo Baldassi, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Riccardo
Zecchina. Subdominant dense clusters allow for simple learning and high compu-
tational performance in neural networks with discrete synapses. Physical review
letters, 115(12):128101, 2015.
[BK85] Marilynn B Brewer and Roderick M Kramer. The psychology of intergroup atti-
tudes and behavior. Annual review of psychology, 36(1):219–243, 1985.
[Blu93] Lawrence E Blume. The statistical mechanics of strategic interaction. Games and
Economic Behavior, 5(3):387–424, 1993.
[BMEWL11] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The
million song dataset. In Proceedings of the 12th International Conference on Music
Information Retrieval (ISMIR 2011), 2011.
[Bot09] Léon Bottou. Curiously fast convergence of some stochastic gradient descent al-
gorithms. In Proceedings of the symposium on learning and data science, Paris,
2009.
[Bot12] L Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the
Trade, pages 421–436. Springer, 2012.
[BSF94] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term depen-
dencies with gradient descent is difficult. IEEE transactions on neural networks,
5(2):157–166, 1994.
[BSW14] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic particles
in high-energy physics with deep learning. Nature communications, 5, 2014.
[BT89] Dimitri P Bertsekas and John N Tsitsiklis. Parallel and distributed computation:
numerical methods. Prentice-Hall, Inc., 1989.
[BTPG15] Guillaume Bouchard, Théo Trouillon, Julien Perez, and Adrien Gaidon. Accel-
erating stochastic gradient descent via online learning to sample. arXiv preprint
arXiv:1506.09016, 2015.
194
[BVL13] Daniel Balliet and Paul AM Van Lange. Trust, punishment, and cooperation across
18 societies a meta-analysis. Perspectives on Psychological Science, 8(4):363–379,
2013.
[CB07] Jung-Kyoo Choi and Samuel Bowles. The coevolution of parochial altruism and
war. science, 318(5850):636–640, 2007.
[CB15] Damon Centola and Andrea Baronchelli. The spontaneous emergence of conven-
tions: An experimental study of cultural evolution. Proceedings of the National
Academy of Sciences, 112(7):1989–1994, 2015.
[CBD15] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect:
Training deep neural networks with binary weights during propagations. In NIPS,
2015.
[CHS+16] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Ben-
gio. Binarized neural networks: Training deep neural networks with weights and
activations constrained to +1 or -1. arXiv preprint arXiv:1602.02830, 2016.
[CKF11] Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. Torch7: A matlab-
like environment for machine learning. In BigLearn, NIPS Workshop, 2011.
[CL09] Yan Chen and Sherry Xin Li. Group identity and social preferences. The American
Economic Review, 99(1):431–457, 2009.
[CLC13] Wei Chen, Laks VS Lakshmanan, and Carlos Castillo. Information and influence
propagation in social networks. Synthesis Lectures on Data Management, 5(4):1–
177, 2013.
[CR16] Dominik Csiba and Peter Richtárik. Importance sampling for minibatches. arXiv
preprint arXiv:1602.02283, 2016.
[CSML15] Zhiyong Cheng, Daniel Soudry, Zexi Mao, and Zhenzhong Lan. Training binary
multilayer neural networks for image classification using expectation backpropa-
gation. arXiv preprint arXiv:1503.03562, 2015.
[DBLJ14] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental
gradient method with support for non-strongly convex composite objectives. In
Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
[DCM+12] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao,
Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed
deep networks. In Advances in Neural Information Processing Systems, pages
1223–1231, 2012.
[DD+14] Aaron Defazio, Justin Domke, et al. Finito: A faster, permutable incremental gra-
dient method for big data problems. In Proceedings of The 31st International Con-
ference on Machine Learning, pages 1125–1133, 2014.
[DG16] Soham De and Tom Goldstein. Efficient distributed SGD with variance reduction.
In 2016 IEEE International Conference on Data Mining. IEEE, 2016.
195
[DYJG16] Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Big Batch SGD:
Automated inference using adaptive batch sizes. arXiv preprint arXiv:1610.05792,
2016.
[EH14] Jean Ensminger and Joseph Henrich. Experimenting with social norms: Fairness
and punishment in cross-cultural perspective. Russell Sage Foundation, 2014.
[EK10] David Easley and Jon Kleinberg. Networks, crowds, and markets: Reasoning about
a highly connected world. Cambridge University Press, 2010.
[FS12] Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods
for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.
[FTC+12] Feng Fu, Corina E Tarnita, Nicholas A Christakis, Long Wang, David G Rand, and
Martin A Nowak. Evolution of in-group favoritism. Scientific reports, 2:460, 2012.
[GAGN15] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan.
Deep learning with limited numerical precision. In ICML, 2015.
[GB10] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep
feedforward neural networks. In Proceedings of the thirteenth international con-
ference on artificial intelligence and statistics, pages 249–256, 2010.
[GDG+17] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski,
Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large
minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677,
2017.
[GOP15] Mert Gürbüzbalaban, Asu Ozdaglar, and Pablo Parrilo. Why random reshuffling
beats stochastic gradient descent. arXiv preprint arXiv:1510.08560, 2015.
[GRN+11] Michele J Gelfand, Jana L Raver, Lisa Nishii, Lisa M Leslie, Janetta Lun,
Beng Chong Lim, Lili Duan, Assaf Almaliach, Soon Ang, Jakobina Arnadottir,
et al. Differences between tight and loose cultures: A 33-nation study. Science,
332(6033):1100–1104, 2011.
[GS16] Tom Goldstein and Christoph Studer. Phasemax: Convex phase retrieval via basis
pursuit. arXiv preprint arXiv:1610.07531, 2016.
[GSB14] Tom Goldstein, Christoph Studer, and Richard Baraniuk. A field guide to forward-
backward splitting with a FASTA implementation. arXiv eprint, abs/1411.3406,
2014.
[GvdB11] Julián Garcı́a and Jeroen CJM van den Bergh. Evolution of parochial altruism by
multilevel selection. Evolution and Human Behavior, 32(4):277–287, 2011.
[HA06] Ross A Hammond and Robert Axelrod. The evolution of ethnocentrism. Journal
of Conflict Resolution, 50(6):926–936, 2006.
[HAV+15] Reza Harikandeh, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub
Konečnỳ, and Scott Sallinen. Stop wasting my gradients: Practical svrg. In Ad-
vances in Neural Information Processing Systems, pages 2242–2250, 2015.
196
[HCS+16] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua
Bengio. Quantized neural networks: Training neural networks with low precision
weights and activations. arXiv preprint arXiv:1609.07061, 2016.
[HEM+10] Joseph Henrich, Jean Ensminger, Richard McElreath, Abigail Barr, Clark Barrett,
Alexander Bolyanatz, Juan Camilo Cardenas, Michael Gurven, Edwins Gwako,
Natalie Henrich, et al. Markets, religion, community size, and the evolution of
fairness and punishment. Science, 327(5972):1480–1484, 2010.
[HF92] Markus Höhfeld and Scott E Fahlman. Probabilistic rounding in neural network
learning with limited precision. Neurocomputing, 4(6):291–299, 1992.
[HG14] Jesse R Harrington and Michele J Gelfand. Tightness–looseness across the 50
united states. Proceedings of the National Academy of Sciences, 111(22):7990–
7995, 2014.
[HKS13] Max Hartshorn, Artem Kaznatcheev, and Thomas Shultz. The evolutionary domi-
nance of ethnocentric cooperation. Journal of Artificial Societies and Social Simu-
lation, 16(3):7, 2013.
[HMB+06] Joseph Henrich, Richard McElreath, Abigail Barr, Jean Ensminger, Clark Barrett,
Alexander Bolyanatz, Juan Camilo Cardenas, Michael Gurven, Edwins Gwako,
Natalie Henrich, et al. Costly punishment across human societies. Science,
312(5781):1767–1770, 2006.
[HO01] Michael Hechter and Karl-Dieter Opp. Social norms. Russell Sage Foundation,
2001.
[HRW02] Miles Hewstone, Mark Rubin, and Hazel Willis. Intergroup bias. Annual review of
psychology, 53(1):575–604, 2002.
[HS84] Josef Hofbauer and Karl Sigmund. Evolutionstheorie und dynamische Systeme.
Parey, 1984.
[HS03] Josef Hofbauer and Karl Sigmund. Evolutionary game dynamics. Bulletin of the
American Mathematical Society, 40(4):479–519, 2003.
[HS14] Kyuyeon Hwang and Wonyong Sung. Fixed-point feedforward deep neural net-
work design using weights+ 1, 0, and- 1. In IEEE Workshop on Signal Processing
Systems (SiPS), 2014.
[HTG08] Benedikt Herrmann, Christian Thöni, and Simon Gächter. Antisocial punishment
across societies. Science, 319(5868):1362–1367, 2008.
[HYOR14] Dirk Helbing, Wenjian Yu, Karl-Dieter Opp, and Heiko Rauhut. Conditions for
the emergence of shared norms in populations with incompatible preferences. PloS
one, 9(8):e104207, 2014.
[HZRS16a] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 770–778, 2016.
197
[HZRS16b] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learn-
ing for Image Recognition. In CVPR, 2016.
[IS15a] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Net-
work Training by Reducing Internal Covariate Shift. 2015.
[IS15b] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift. arXiv preprint
arXiv:1502.03167, 2015.
[Jac10] Matthew O Jackson. Social and economic networks. Princeton university press,
2010.
[JKV10] Stephen Judd, Michael Kearns, and Yevgeniy Vorobeychik. Behavioral dynamics
and influence in networked coloring and consensus. Proceedings of the National
Academy of Sciences, 107(34):14978–14982, 2010.
[JNS08] Oliver P John, Laura P Naumann, and Christopher J Soto. Paradigm shift to the
integrative big five trait taxonomy. Handbook of personality: Theory and research,
3:114–158, 2008.
[JZ13] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using pre-
dictive variance reduction. In Advances in Neural Information Processing Systems,
pages 315–323, 2013.
[KB14] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980, 2014.
[KB15] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
ICLR, 2015.
[KH09] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from
tiny images, 2009.
[KJTW09] Michael Kearns, Stephen Judd, Jinsong Tan, and Jennifer Wortman. Behavioral
experiments on biased voting in networks. Proceedings of the National Academy
of Sciences, 106(5):1347–1352, 2009.
[KLRT14] Jakub Konečnỳ, Jie Liu, Peter Richtárik, and Martin Takáč. ms2gd: Mini-
batch semi-stochastic gradient descent in the proximal setting. arXiv preprint
arXiv:1410.4744, 2014.
[KMN+16] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,
and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization
gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
[KNS16] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradi-
ent and proximal-gradient methods under the polyak-łojasiewicz condition. In
Joint European Conference on Machine Learning and Knowledge Discovery in
Databases, pages 795–811. Springer, 2016.
[KR13] Jakub Konečnỳ and Peter Richtárik. Semi-stochastic gradient descent methods.
arXiv preprint arXiv:1312.1666, 2013.
198
[KS15] Minje Kim and Paris Smaragdis. Bitwise neural networks. In ICML Workshop on
Resource-Efficient Machine Learning, 2015.
[KS17] Nitish Shirish Keskar and Richard Socher. Improving generalization performance
by switching from adam to sgd. arXiv preprint arXiv:1712.07628, 2017.
[KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification
with deep convolutional neural networks. In Advances in neural information pro-
cessing systems, pages 1097–1105, 2012.
[KYC+12] Farshad Kooti, Haeryun Yang, Meeyoung Cha, P Krishna Gummadi, and Winter A
Mason. The emergence of conventions in online social networks. In ICWSM, 2012.
[Lan92] Hope Landrine. Clinical implications of cultural differences: The referential versus
the indexical self. Clinical Psychology Review, 12(4):401–415, 1992.
[LASY14] Mu Li, David G Andersen, Alex J Smola, and Kai Yu. Communication efficient
distributed machine learning with the parameter server. In Advances in Neural
Information Processing Systems, pages 19–27, 2014.
[Lax07] P.D. Lax. Linear Algebra and Its Applications. Number v. 10 in Linear algebra and
its applications. Wiley, 2007.
[LBBH98] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based
learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–
2324, 1998.
[LCMB16] Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neu-
ral networks with few multiplications. ICLR, 2016.
[Lev03] Simon Levin. Complex adaptive systems: exploring the known, the unknown and
the unknowable. Bulletin of the American Mathematical Society, 40(1):3–19, 2003.
[LGRC12] Janette Lehmann, Bruno Gonçalves, José J Ramasco, and Ciro Cattuto. Dynamical
classes of collective attention in twitter. In Proceedings of the 21st international
conference on World Wide Web, pages 251–260. ACM, 2012.
[LHLL15] Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous parallel
stochastic gradient for nonconvex optimization. In Advances in Neural Informa-
tion Processing Systems, pages 2719–2727, 2015.
[LMK+13] Yu-Ru Lin, Drew Margolin, Brian Keegan, Andrea Baronchelli, and David Lazer.
# bigbirds never die: Understanding social dynamics of emergent hashtag. arXiv
preprint arXiv:1303.7144, 2013.
[LNS12] Guanghui Lan, Arkadi Nemirovski, and Alexander Shapiro. Validation analysis
of mirror descent stochastic approximation method. Mathematical programming,
134(2):425–458, 2012.
[Lon91] Larry Long. Residential mobility differences among developed countries. Interna-
tional Regional Science Review, 14(2):133–147, 1991.
199
[LPW09] David Asher Levin, Yuval Peres, and Elizabeth Lee Wilmer. Markov chains and
mixing times. American Mathematical Soc., 2009.
[LTA16] Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. Fixed point quantization
of deep convolutional networks. In ICML, 2016.
[LZL16] Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint
arXiv:1605.04711, 2016.
[MB11] Eric Moulines and Francis R Bach. Non-asymptotic analysis of stochastic approx-
imation algorithms for machine learning. In Advances in Neural Information Pro-
cessing Systems, pages 451–459, 2011.
[MBB17] Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Un-
derstanding the effectiveness of sgd in modern over-parametrized learning. arXiv
preprint arXiv:1712.06559, 2017.
[Mer38] Robert K Merton. Science and the social order. Philosophy of Science, 5(3):321–
337, 1938.
[MG15] James Martens and Roger Grosse. Optimizing neural networks with kronecker-
factored approximate curvature. In International Conference on Machine Learning,
pages 2408–2417, 2015.
[MH15] Maren Mahsereci and Philipp Hennig. Probabilistic line searches for stochastic
optimization. In Advances In Neural Information Processing Systems, pages 181–
189, 2015.
[MLM16] Daisuke Miyashita, Edward H Lee, and Boris Murmann. Convolutional neural
networks using logarithmic data representation. arXiv preprint arXiv:1603.01025,
2016.
[MNVV12] Melissa M McDonald, Carlos David Navarrete, and Mark Van Vugt. Evolution and
the psychology of intergroup conflict: The male warrior hypothesis. Phil. Trans. R.
Soc. B, 367(1589):670–679, 2012.
[MOPU93] Michele Marchesi, Gianni Orlandi, Francesco Piazza, and Aurelio Uncini. Fast
neural networks without multipliers. IEEE Transactions on Neural Networks,
4(1):53–62, 1993.
[MPP+15] Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan
Ramchandran, and Michael I Jordan. Perturbed iterate analysis for asynchronous
stochastic optimization. arXiv preprint arXiv:1507.06970, 2015.
[MR16] Aryan Mokhtari and Alejandro Ribeiro. Dsa: Decentralized double stochastic av-
eraging gradient algorithm. Journal of Machine Learning Research, 17(61):1–35,
2016.
[Nes83] Yurii Nesterov. A method of solving a convex programming problem with conver-
gence rate o (1/k2). 1983.
[Nes13] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, vol-
ume 87. Springer Science & Business Media, 2013.
200
[Now06] Martin A Nowak. Five rules for the evolution of cooperation. science,
314(5805):1560–1563, 2006.
[NWC+11] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and An-
drew Y Ng. Reading digits in natural images with unsupervised feature learning.
In NIPS workshop on deep learning and unsupervised feature learning, volume
2011, page 4. Granada, Spain, 2011.
[NWS14] Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent,
weighted sampling, and the randomized kaczmarz algorithm. In Advances in Neu-
ral Information Processing Systems, pages 1017–1025, 2014.
[Ois10] Shigehiro Oishi. The psychology of residential mobility implications for the
self, social relationships, and well-being. Perspectives on Psychological Science,
5(1):5–21, 2010.
[OKM+13] Shigehiro Oishi, Selin Kesebir, Felicity F Miao, Thomas Talhelm, Yumi Endo,
Yukiko Uchida, Yasufumi Shibanai, and Vinai Norasakkunkit. Residential mobility
increases motivation to expand social network: But why? Journal of Experimental
Social Psychology, 49(2):217–223, 2013.
[OLS07] Shigehiro Oishi, Janetta Lun, and Gary D Sherman. Residential mobility, self-
concept, and positive affect in social interactions. Journal of personality and social
psychology, 93(1):131, 2007.
[OSYA15] SHIGEHIRO Oishi, JOANNA Schug, MASAKI Yuki, and JORDAN Axt. The psy-
chology of residential and relational mobilities. Handbook of advances in culture
and psychology, 5:221–272, 2015.
[Pin11a] Steven Pinker. The better angels of our nature: Why violence has declined, vol-
ume 75. Viking New York, 2011.
[Pin11b] Steven Pinker. Decline of violence: Taming the devil within us. Nature,
478(7369):309–311, 2011.
[PLT+16] Xinghao Pan, Maximilian Lam, Stephen Tu, Dimitris Papailiopoulos, Ce Zhang,
Michael I Jordan, Kannan Ramchandran, Chris Re, and Benjamin Recht. Cyclades:
Conflict-free asynchronous machine learning. arXiv preprint arXiv:1605.09721,
2016.
[PN02] Karen M Page and Martin A Nowak. Unifying evolutionary dynamics. Journal of
Theoretical Biology, 219(1):93–98, 2002.
[Pol63] Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal
Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 3(4):643–653, 1963.
[Pro01] Danil Prokhorov. Ijcnn 2001 neural network competition. Slide presentation in
IJCNN, 1, 2001.
[RDS+15] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean
Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al.
Imagenet Large Scale Visual Recognition Challenge. IJCV, 2015.
201
[Ren11] Ping Ren. Lifetime mobility in the United States: 2010. US Department of Com-
merce, Economics and Statistics Administration, US Census Bureau, 2011.
[RGNL15] Patrick Roos, Michele Gelfand, Dana Nau, and Janetta Lun. Societal threat and
cultural variation in the strength of social norms: An evolutionary basis. Organiza-
tional Behavior and Human Decision Processes, 129:14–23, 2015.
[RHBL07] Marc Aurelio Ranzato, Fu Jie Huang, Y-Lan Boureau, and Yann LeCun. Unsuper-
vised learning of invariant feature hierarchies with applications to object recogni-
tion. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Confer-
ence on, pages 1–8. IEEE, 2007.
[RHS+15] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, and Alex J Smola.
On variance reduction in stochastic gradient descent and its asynchronous variants.
In Advances in Neural Information Processing Systems, pages 2629–2637, 2015.
[RM51] Herbert Robbins and Sutton Monro. A stochastic approximation method. The
annals of mathematical statistics, pages 400–407, 1951.
[RORF16] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNOR-
Net: ImageNet Classification Using Binary Convolutional Neural Networks.
ECCV, 2016.
[RRWN11] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-
free approach to parallelizing stochastic gradient descent. In Advances in Neural
Information Processing Systems, pages 693–701, 2011.
[RSB12] Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method
with an exponential convergence rate for finite training sets. In Advances in Neural
Information Processing Systems, pages 2663–2671, 2012.
[RSS11] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient
descent optimal for strongly convex stochastic optimization. arXiv preprint
arXiv:1109.5647, 2011.
[Sha16] Ohad Shamir. Without-replacement sampling for stochastic gradient methods:
Convergence results and application to distributed optimization. arXiv preprint
arXiv:1603.00570, 2016.
[SHM14] Daniel Soudry, Itay Hubara, and Ron Meir. Expectation backpropagation:
Parameter-free training of multilayer neural networks with continuous or discrete
weights. In NIPS, 2014.
[SMDH13] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the impor-
tance of initialization and momentum in deep learning. In International conference
on machine learning, pages 1139–1147, 2013.
[Smi82] John Maynard Smith. Evolution and the Theory of Games. Cambridge University
Press, 1982.
[SP73] J. M. Smith and G. R. Price. The logic of animal conflict. Nature, 246:15–18, 1973.
202
[SRB13] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with
the stochastic average gradient. arXiv preprint arXiv:1309.2388, 2013.
[SS14] Ohad Shamir and Nathan Srebro. Distributed stochastic optimization and learning.
In Communication, Control, and Computing (Allerton), 2014 52nd Annual Allerton
Conference on, pages 850–857. IEEE, 2014.
[SYHT09] Joanna Schug, Masaki Yuki, Hiroki Horikawa, and Kosuke Takemura. Similar-
ity attraction and actually selecting similar others: How cross-societal differences
in relational mobility affect interpersonal similarity in japan and the usa. Asian
Journal of Social Psychology, 12(2):95–103, 2009.
[SYM10] Joanna Schug, Masaki Yuki, and William Maddux. Relational mobility explains
between-and within-culture differences in self-disclosure to close friends. Psycho-
logical Science, 2010.
[SZ13] Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth opti-
mization: Convergence results and optimal averaging schemes. In International
Conference on Machine Learning, pages 71–79, 2013.
[SZ14] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for
large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[SZ15] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for
Large-Scale Image Recognition. In ICLR, 2015.
[SZL13] Tom Schaul, Sixin Zhang, and Yann LeCun. No more pesky learning rates. In
Proceedings of The 30th International Conference on Machine Learning, pages
343–351, 2013.
[Taj82] Henri Tajfel. Social psychology of intergroup relations. Annual review of psychol-
ogy, 33(1):1–39, 1982.
[THDS+09] Arne Traulsen, Christoph Hauert, Hannelore De Silva, Martin A Nowak, and Karl
Sigmund. Exploration dynamics in evolutionary games. Proceedings of the Na-
tional Academy of Sciences, 106(3):709–712, 2009.
[TJ78] Peter D Taylor and Leo B Jonker. Evolutionary stable strategies and game dynam-
ics. Mathematical Biosciences, 40(1-2):145–156, 1978.
[TMDQ16] Conghui Tan, Shiqian Ma, Yu-Hong Dai, and Yuqiu Qian. Barzilai-borwein step
size for stochastic gradient descent. arXiv preprint arXiv:1605.04131, 2016.
[TSS+10] Arne Traulsen, Dirk Semmann, Ralf D Sommerfeld, Hans-Jürgen Krambeck, and
Manfred Milinski. Human strategy updating in evolutionary games. Proceedings
of the National Academy of Sciences, 107(7):2962–2966, 2010.
[Ver] Roman Vershynin. High-dimensional probability.
[WCSX13] Chong Wang, Xi Chen, Alex J Smola, and Eric P Xing. Variance reduction for
stochastic gradient optimization. In Advances in Neural Information Processing
Systems, pages 181–189, 2013.
203
[WEMG11] Stuart A West, Claire El Mouden, and Andy Gardner. Sixteen common misconcep-
tions about the evolution of cooperation in humans. Evolution and Human Behav-
ior, 32(4):231–262, 2011.
[Wor98] DDB Worldwide. The DDB Life Style Survey Data. http://bowlingalone.
com/?page_id=7, 1998. [Online; accessed 20-October-2015].
[WRLG18] Yuhuai Wu, Mengye Ren, Renjie Liao, and Roger Grosse. Understanding short-
horizon bias in stochastic meta-optimization. arXiv preprint arXiv:1803.02021,
2018.
[WRS+17] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin
Recht. The marginal value of adaptive gradient methods in machine learning. In
Advances in Neural Information Processing Systems, pages 4151–4161, 2017.
[WS98] Duncan J Watts and Steven H Strogatz. Collective dynamics of ’small-world’ net-
works. Nature, 393(6684):440–442, 1998.
[XZ14] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive
variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
[You01] H Peyton Young. Individual strategy and social structure: An evolutionary theory
of institutions. Princeton University Press, 2001.
[YSS09] Toshio Yamagishi, NAOTO Suzuki, and M Schaller. An institutional approach to
culture. Evolution, culture, and the human mind, pages 185–203, 2009.
[ZCL15] Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic
averaging sgd. In Advances in Neural Information Processing Systems, pages 685–
693, 2015.
[Zei12] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint
arXiv:1212.5701, 2012.
[ZHMD17] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quan-
tization. ICLR, 2017.
[ZK16] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint
arXiv:1605.07146, 2016.
[ZLS09] Martin Zinkevich, John Langford, and Alex J Smola. Slow learners are fast. In
Advances in Neural Information Processing Systems, pages 2331–2339, 2009.
[ZWLS10] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized
stochastic gradient descent. In Advances in neural information processing systems,
pages 2595–2603, 2010.
[ZWN+16] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou.
Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth
gradients. arXiv preprint arXiv:1606.06160, 2016.
[ZYG+17] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental
network quantization: Towards lossless CNNs with low-precision weights. ICLR,
2017.
204
[ZZX15] Leihan Zhang, Jichang Zhao, and Ke Xu. Who creates trends in online social
media: The crowd or opinion leaders? Journal of Computer-Mediated Communi-
cation, 21(1):1–16, 2015.
205