ABSTRACT Title of dissertation: Fast optimization methods for machine learning, and game-theoretic models of cultural evolution Soham De Doctor of Philosophy, 2018 Dissertation directed by: Dr. Tom Goldstein and Dr. Dana Nau Department of Computer Science This thesis has two parts. In the first part, we explore fast stochastic optimization methods for machine learning. Mathematical optimization is a backbone of modern machine learning. Most ma- chine learning problems require optimizing some objective function that measures how well a model matches a data set, with the intention of drawing patterns and making de- cisions on new unseen data. The success of optimization algorithms in solving these problems is critical to the success of machine learning, and has enabled the research com- munity to explore more complex machine learning problems that require bigger models and larger datasets. Stochastic gradient descent (SGD) has become the standard optimization routine in machine learning, and in particular in deep neural networks, due to its impressive performance across a wide variety of tasks and models. SGD, however, can often be slow for neural networks with many layers and typically requires careful user oversight for setting hyperparameters properly. While innovations such as batch normalization and skip connections have helped alleviate some of these issues, why such innovations are required eludes full understanding, and it is worthwhile to gain deeper theoretical insights into these problems and to consider more advanced optimization methods specifically tailored towards training large complex models. In this part of the thesis, we review and analyze some of the recent progress made in this direction, and develop new optimization algorithms that are provably fast, signifi- cantly easier to train, and require less user oversight. Then, we will discuss the theory of quantized networks, which use low-precision weights to compress and accelerate neural networks, and when/why they are trainable. Finally, we discuss some recent results on how the convergence of SGD is affected by the architecture of neural nets, and we show using theoretical analysis that wide networks train faster than narrow nets, and deeper networks train slower than shallow nets – an effect often observed in practice. In the second part of the thesis, we study the evolution of cultural norms in human societies using game-theoretic models, drawing from research in cross-cultural psychol- ogy. Understanding human behavior and modeling how cultural norms evolve in different human societies is vital for designing policies and avoiding conflicts around the world. In this part, we explore ways to use computational game-theoretic techniques, and in partic- ular evolutionary game-theoretic (EGT) models, to gain insight into why different human societies have different norms and behaviors. We first describe an evolutionary game-theoretic model to study how norms change in a society, based on the idea that different strength of norms in societies translate to different game-theoretic interaction structures and incentives. We identify conditions that determine when societies change their existing norms, when they are resistant to such change, and how this depends on the strength of norms in a society. Next, we extend this study to analyze the evolutionary relationships between the tendency to conform and how quickly a population reacts when conditions make a change in norm desirable. Our analysis identifies conditions when a tipping point is reached in a population, causing norms to change rapidly. Next we study conditions that affect the existence of group-biased behavior among humans (i.e., favoring others from the same group, and being hostile towards others from different groups). Using an evolutionary game-theoretic model, we show that out-group hostility is dramatically reduced by mobility. Technological and societal advances over the past centuries have greatly increased the degree to which humans change physical locations, and our results show that in highly mobile societies, ones choice of action is more likely to depend on what individual one is interacting with, rather than the group to which the individual belongs. Fast optimization methods for machine learning, and game-theoretic models of cultural evolution by Soham De Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2018 Advisory Committee: Dr. Tom Goldstein, Co-Chair/Advisor Dr. Dana S. Nau, Co-Chair/Advisor Dr. David W. Jacobs Dr. Michele J. Gelfand Dr. John P. Dickerson ©c Copyright by Soham De 2018 Acknowledgments I am grateful to my advisors Dana Nau and Tom Goldstein for their constant sup- port, encouragement and guidance over the years, and for giving me the freedom to pur- sue my own varied interests. Their brilliance and dedication have been an ongoing source of inspiration. I have been fortunate to also have the opportunity to work closely with Michele Gelfand from the Psychology department. The numerous fascinating and lively discussions with her and Dana were among the highlights during my time here. I was lucky to have been part of an excellent department filled with many wonderful people. I would like to thank my other thesis defense committee members, David Jacobs and John Dickerson, for their insightful comments and encouragement. I would also like to thank Jodie, Sharron, Jenny and Tom Hurst, for always ensuring that everything ran smoothly in the CS department, and for going out of their way to help me on a few occasions. This thesis would not have been possible without the help of my close collaborators Hao, Karthik, James M., Anirbit, Sohil, Abhay, Zheng, Xinyue and Patrick, to whom I am truly indebted. I would also like to thank Bhiksha Raj and Karen Livescu, with whom I was fortunate to be able to work with during my undergraduate years, and who were instrumental in developing my interests in machine learning. I would also like to acknowledge the many friends I made while pursuing my PhD, as well as some of my older friends, all of whom made the last few years immensely enjoyable. These include Siddharth, Soumyadip, Agniv, Jia, Upamanyu, Rishov, Souvik, Piyana, Udit, Dipankar, Debdipta, Biswadip, Shawon, Arijit, Wrick, Sunandita, Kartik, ii Sudha, Manaswi, Bhaskar, Pallabi, Meethu, Sankha, Prashanth, Vicky, Karol, Emmy, Amit, Prarthana, Arunima, Amrita, Aritra, Anirban and others who I am surely forgetting. I would also like to thank my uncle and aunt in Maryland who have never made me feel too far from home. Finally I would like to thank my parents for their continued love and support, and for always encouraging me in every endeavor of my life. iii Table of Contents Acknowledgements ii List of Tables vii List of Figures viii 1 Introduction and organization of the thesis 1 I FAST & EFFICIENT TRAINING IN MACHINE LEARNING 4 2 Introduction, background and notation 5 2.1 Machine learning as an optimization problem . . . . . . . . . . . . . . . 6 2.2 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 On the successes and drawbacks of SGD . . . . . . . . . . . . . . . . . . 10 2.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5 Table of notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3 Automated inference using adaptive batch sizes 14 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Big Batch SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.1 Preliminaries and motivation . . . . . . . . . . . . . . . . . . . . 18 3.2.2 A template for big batch SGD . . . . . . . . . . . . . . . . . . . 20 3.3 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.1 Comparison to classical SGD . . . . . . . . . . . . . . . . . . . 25 3.4 Practical implementation with backtracking line search . . . . . . . . . . 26 3.5 Adaptive step sizes using the Barzilai-Borwein estimate . . . . . . . . . . 29 3.5.1 Convergence proof . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5.2 Practical implementation . . . . . . . . . . . . . . . . . . . . . . 33 3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.6.1 Convex experiments . . . . . . . . . . . . . . . . . . . . . . . . 36 3.6.2 Neural network experiments . . . . . . . . . . . . . . . . . . . . 37 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 iv 4 Distributing SGD using variance reduction 40 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 CentralVR algorithm: single-worker case . . . . . . . . . . . . . . . . . 44 4.2.1 Algorithm overview . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.2 Permutation sampling . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.3 Algorithm details for CentralVR . . . . . . . . . . . . . . . . . . 46 4.3 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4 Distributed algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4.1 Synchronous version . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.2 Asynchronous version . . . . . . . . . . . . . . . . . . . . . . . 54 4.5 Distributed variants of SVRG and SAGA . . . . . . . . . . . . . . . . . 55 4.5.1 Distributed SVRG . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.5.2 Distributed SAGA . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.6 Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.6.1 Single worker results . . . . . . . . . . . . . . . . . . . . . . . . 58 4.6.2 Distributed results . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5 Investigating training methods for quantized neural nets 70 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2 Background and related work . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3 Algorithms for training quantized neural nets . . . . . . . . . . . . . . . 73 5.4 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.4.1 Convergence of Stochastic Rounding (SR) . . . . . . . . . . . . . 76 5.4.2 Convergence of Binary Connect (BC) . . . . . . . . . . . . . . . 78 5.5 What about non-convex problems? . . . . . . . . . . . . . . . . . . . . . 80 5.5.1 Toy problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.5.2 Asymptotic analysis of Stochastic Rounding . . . . . . . . . . . 84 5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.6.1 A way forward: big batch training . . . . . . . . . . . . . . . . . 96 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6 Why is SGD so fast for neural nets? 99 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2 SGD is fast when gradient confusion is low . . . . . . . . . . . . . . . . 103 6.2.1 Conditions for even faster convergence . . . . . . . . . . . . . . 106 6.3 Over-parameterized problems have low gradient confusion . . . . . . . . 109 6.3.1 A simple case: linear regression . . . . . . . . . . . . . . . . . . 110 6.3.2 Linear neural networks . . . . . . . . . . . . . . . . . . . . . . . 116 6.3.3 Extension to arbitrary depth linear networks . . . . . . . . . . . . 119 6.3.4 More general neural networks . . . . . . . . . . . . . . . . . . . 122 6.3.5 Beyond linearly generated data . . . . . . . . . . . . . . . . . . . 125 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 v II STUDYING THE EVOLUTION OF CULTURAL NORMS 130 7 Using game theory to study the evolution of cultural norms 131 7.1 Evolutionary game theory in biology . . . . . . . . . . . . . . . . . . . . 132 7.2 Modeling cultural evolution . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 8 Understanding norm change in human societies 139 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 8.2 Proposed model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 8.2.1 Replicator dynamic on infinite well-mixed populations . . . . . . 148 8.2.2 Agent simulations on finite networks . . . . . . . . . . . . . . . 157 8.3 Evolving exploration rates . . . . . . . . . . . . . . . . . . . . . . . . . 160 8.4 Significance of the work . . . . . . . . . . . . . . . . . . . . . . . . . . 165 9 Tipping points for norm change in human cultures 167 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 9.2 Background and related work . . . . . . . . . . . . . . . . . . . . . . . . 168 9.3 Proposed evolutionary game-theoretic model . . . . . . . . . . . . . . . 169 9.3.1 When does norm change occur? . . . . . . . . . . . . . . . . . . 171 9.3.2 Rate of norm change in tight vs. loose cultures . . . . . . . . . . 173 9.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 10 On the evolution of ethnocentrism in human cultures 176 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 10.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 10.2.1 Empirical analysis . . . . . . . . . . . . . . . . . . . . . . . . . 183 10.3 Significance of the work . . . . . . . . . . . . . . . . . . . . . . . . . . 186 10.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 10.4.1 Evolutionary dynamics of our model . . . . . . . . . . . . . . . . 187 10.4.2 Clustering coefficient . . . . . . . . . . . . . . . . . . . . . . . . 189 10.4.3 Strategy set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 10.4.4 Mutation rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 10.4.5 Range of mobility . . . . . . . . . . . . . . . . . . . . . . . . . 191 Bibliography 193 vi List of Tables 4.1 Distributed Algorithms Proposed . . . . . . . . . . . . . . . . . . . . . . 44 5.1 VGG-9 on CIFAR-10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.2 VGG-BC for CIFAR-10. . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.3 Top-1 test error after training with full-precision (ADAM), binarized weights (R-ADAM, SR-ADAM, BC-ADAM), and binarized weights with big batch size (Big SR-ADAM). . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 vii List of Figures 3.1 Convex experiments. Left to right: Ridge regression on MILLIONSONG; Logistic regression on COVERTYPE; Logistic regression on IJCNN1. The top row shows how the norm of the true gradient decreases with the number of epochs, the middle and bottom rows show the batch sizes and step sizes used on each iteration by the big batch methods. Here ‘passes through the data’ indicates number of epochs, while ‘iterations’ refers to the number of parameter updates used by the method (there may be mul- tiple iterations during one epoch). . . . . . . . . . . . . . . . . . . . . . 35 3.2 Neural Network Experiments. The three columns from left to right cor- respond to results for CIFAR-10, SVHN, and MNIST, respectively. The top row presents classification accuracies on the training set, while the bottom row presents classification accuracies on the test set. . . . . . . . 36 4.1 Single Worker Results. Logistic regression on toy dataset; Ridge regres- sion on toy data; Logistic regression on IJCNN1 dataset; Ridge regression on MILLIONSONG dataset; In each case CentralVR converges much faster than SVRG and SAGA. . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Distributed Results on toy datasets for CentralVR-Sync and CentralVR- Async, compared to Distributed SVRG (Section 4.5.1), Distributed SAGA (Section 4.5.2), Parameter Server SVRG and EASGD. Left two plots: Convergence curve for Logistic and ridge regression on synthetic data over 192 nodes. Right two plots: Time required for convergence as num- ber of local workers is increased (data on each local worker is constant – i.e., total data scales linearly with the number of local workers) for logis- tic and ridge regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.3 Distributed Results on SUSY and MILLIONSONG for CentralVR-Sync and CentralVR-Async, compared to Distributed SVRG (Section 4.5.1), Distributed SAGA (Section 4.5.2), Parameter Server SVRG (Param Server SVRG) and EASGD. (Left two plots) Convergence curve for Logistic re- gression and ridge regression on SUSY over 500 nodes and on MILLION- SONG over 240 nodes. (Right two plots) Time required for convergence as number of local workers is increased. . . . . . . . . . . . . . . . . . . 69 viii 5.1 The SR method starts at some location w (in this case 0), adds a pertur- bation to w, and then rounds. As the learning rate α gets smaller, the distribution of the perturbation gets “squished” near the origin, making the algorithm less likely to move. The “squishing” effect is the same for the part of the distribution lying to the left and to the right of w, and so it does not effect the relative probability of moving left or right. . . . . . . . 82 5.2 Effect of shrinking the learning rate in SR vs BC on a toy problem. The left figure plots the objective function (5.8). Histograms plot the distri- bution of the quantized weights over 106 iterations. The top row of plots correspond to BC, while the bottom row is SR, for different learning rates α. As the learning rate α shrinks, the BC distribution concentrates on a minimizer, while the SR distribution stagnates. . . . . . . . . . . . . . . 83 5.3 Markov chain example with 3 states. In the right figure, we halved each transition probability for moving between states, with the remaining prob- ability put on the self-loop. Notice that halving all the transition proba- bilities would not change the equilibrium distribution, and instead would only increase the mixing time of the Markov chain. . . . . . . . . . . . . 85 5.4 Percentage of weight changes during training of VGG-BC on CIFAR-10. 96 5.5 Effect of batch size on SR-ADAM when tested with ResNet-56 on CIFAR- 10. (a) Test error vs epoch. Test error is reported with dashed lines, train error with solid lines. (b) Percentage of weight changes since initializa- tion. (c) Percentage of weight changes per every 5 epochs. . . . . . . . . 96 6.1 Simulation proof for Theorem 6.3.1. As the dimensionality of a random linear regression problem increases, the probability of violating the gra- dient confusion condition η > 0.1 vanishes. . . . . . . . . . . . . . . . . 111 6.2 How width affects convergence curves and gradient inner products. . . . . 128 6.3 How depth affects convergence curves and gradient inner products. . . . . 128 6.4 Effect of batch normalization and skip connections on a Wide ResNet . . 129 7.1 Graph of 1s(u −u ) , for s = 5 and −1 ≤ ua − un ≤ 1. . . . . . . . . . . 1361+e a n 8.1 Individual payoff matrices. Mc denotes the coordination game and Mf denotes the fixed-payoff game used in our model. . . . . . . . . . . . . . 142 8.2 Weighted payoff matrix M defined as M = cMc + (1− c)Mf . . . . . . . 143 8.3 Updated payoff matrix after assuming ac − bc = af − bc and adding a suitable constant to the payoffs in M in Figure 8.2. . . . . . . . . . . . . 144 8.4 Figures show the change in the proportion of B agents with time with a well-mixed infinite population where reproduction is determined by the replicator dynamic with b > a. . . . . . . . . . . . . . . . . . . . . . . . 154 8.5 Figure shows the rate of change of B agents versus the proportion of B agents, with a well-mixed infinite population where reproduction is determined by the replicator dynamic with b > a. . . . . . . . . . . . . . 155 ix 8.6 Simulations with the Fermi rule on a toroidal grid of size 2500. From top to bottom: c = 1.0, c = 0.75, c = 0.5. Initially: a = 1.0, b = 1.15. We use a structural shock at 2500 iterations, after which the payoffs become: a = 1.15, b = 1.0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 8.7 Replicator-mutator dynamic on an infinite well-mixed population with a = 0.4 and b = 0.6. The solid and dotted lines denote c = 0.05 and c = 0.3, respectively. The colors denote the exploration rates. . . . . . . . 161 8.8 Simulations with the Fermi rule on a toroidal grid of size 2500, with struc- tural shocks at intervals of 75 iterations. From left to right: c = 1.0, c = 0.8, c = 0.5. Initially: a = 1.0, b = 1.15. The left column shows proportions of norms A and B. The right column shows proportions of the population that use each different exploration rate. . . . . . . . . . . . 162 9.1 Plot of (9.2) for different values of k. . . . . . . . . . . . . . . . . . . . . 171 9.2 Left: Heatmap of the right-hand side in (9.5) when xB = 0.1, for various uB −uA and k values. Right: Heatmap of the right-hand side in (9.7), for various uB − uA and k values. Best viewed in color. . . . . . . . . . . . . 172 9.3 Left: Plot of (9.4) at uB − uA = 0.7. Right: Heatmap of maxx ẋB forB various k and m values, with uB − uA = 0.7. Best viewed in color. . . . . 174 10.1 Prisoner’s Dilemma payoff matrix used in our model. . . . . . . . . . . . 177 10.2 Sequence of events at each time step in our evolutionary game-theoretic model. The sequence of steps are the same as in Hammond and Axelrod’s paper [HA06] except for the Mobility stage, which is new. For additional details, see the Methods section. . . . . . . . . . . . . . . . . . . . . . . 178 10.3 Proportions of actions and strategies as a function of mobility, after 30,000 iterations, averaged over 100 simulation runs. The plots show the propor- tions of (a) the group-entitative and individual-entitative agents, (b) the actions played by the agents, (c) the strategies of the individual-entitative agents, (d) the in-group and (e) out-group strategies of the group-entitative agents, (f) the degree of clustering on the grid. . . . . . . . . . . . . . . . 182 10.4 Single simulation run for 20000 generations with no mobility (m = 0). (a) Proportions of group-entitative and individual-entitative agents. (b) Relative proportions of the individual-entitative agents’ strategies; Rela- tive proportions of the group-entitative agents’ (c) in-group and (d) out- group strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 10.5 Single simulation run for 30000 generations with no mobility (m = 0.05). (a) Proportions of group-entitative and individual-entitative agents. (b) Relative proportions of the individual-entitative agents’ strategies; Rela- tive proportions of the group-entitative agents’ (c) in-group and (d) out- group strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 x 10.6 Cooperation breaking down at higher mobility values. Each data point is an average of 100 individual simulation runs. The plots show (a) the proportion of agents cooperating and defecting; and (b) over an agent’s lifetime, the average number of unique opponents it encounters, and the average number of games played against each of them. . . . . . . . . . . 192 xi Chapter 1: Introduction and organization of the thesis This thesis has two parts. In the first part, we explore fast stochastic optimization methods for machine learning. In the second part of the thesis, we study the evolution of cultural norms in human societies using game-theoretic models, drawing from research in cross-cultural psychology. In this chapter, we provide a brief overview of each part of the thesis. Fast and efficient training in machine learning Mathematical optimization is a backbone of modern machine learning. Most ma- chine learning problems require optimizing some objective function that measures how well a model matches a data set, with the intention of drawing patterns and making de- cisions on new unseen data. The success of optimization algorithms in solving these problems is critical to the success of machine learning, and has enabled the research com- munity to explore more complex machine learning problems that require bigger models and larger datasets. Stochastic gradient descent (SGD) has become the standard optimization routine in machine learning, and in particular in deep neural networks, due to its impressive performance across a wide variety of tasks and models. SGD, however, can often be 1 slow for neural networks with many layers and typically requires careful user oversight for setting hyperparameters properly. While innovations such as batch normalization and skip connections have helped alleviate some of these issues, why such innovations are required eludes full understanding, and it is worthwhile to gain deeper theoretical insights into these problems and to consider more advanced optimization methods specifically tailored towards training large complex models. In this part of the thesis, we review and analyze some of the recent progress made in this area, develop new optimization algorithms of our own, and theoretically and em- pirically analyze the performance of existing well-known optimization techniques. In Chapter 2, we review existing work in this area, and present some of the open problems that we explore in the rest of the thesis. In Chapters 3 and 4, we develop new optimiza- tion algorithms that are provably fast, significantly easier to train, and require less user oversight. In Chapter 5, we discuss the theory of quantized networks, which use low- precision weights to compress and accelerate neural networks, and when/why they are trainable. Finally in Chapter 6, we discuss some recent results on how the convergence of SGD is affected by the architecture of neural nets, and we show using theoretical analysis that wide networks train faster than narrow nets, and deeper networks train slower than shallow nets – an effect often observed in practice. Studying the evolution of cultural norms Understanding human behavior and modeling how cultural norms evolve in dif- ferent human societies is vital for designing policies and avoiding conflicts around the 2 world. In this part, we explore ways to use computational game-theoretic techniques, and in particular evolutionary game-theoretic (EGT) models, to gain insight into why different human societies have different norms and behaviors. In Chapter 7, we introduce evolutionary game theory, and review how it has been previously used to study biological and cultural evolution. In Chapter 8, we describe an evolutionary game-theoretic model to study how norms change in a society, based on the idea that different strength of norms in soci- eties translate to different game-theoretic interaction structures and incentives. We iden- tify conditions that determine when societies change their existing norms, when they are resistant to such change, and how this depends on the strength of norms in a society. Next, in Chapter 9, we extend this study to analyze the evolutionary relationships between the tendency to conform and how quickly a population reacts when conditions make a change in norm desirable. Our analysis identifies conditions when a tipping point is reached in a population, causing norms to change rapidly. Finally, in Chapter 10, we study conditions that affect the existence of group-biased behavior among humans (i.e., favoring others from the same group, and being hostile towards others from different groups). Using an evolutionary game-theoretic model, we show that out-group hostility is dramatically reduced by mobility. Technological and societal advances over the past centuries have greatly increased the degree to which hu- mans change physical locations, and our results show that in highly mobile societies, ones choice of action is more likely to depend on what individual one is interacting with, rather than the group to which the individual belongs. 3 Part I FAST & EFFICIENT TRAINING IN MACHINE LEARNING 4 Chapter 2: Introduction, background and notation Interest in the field of machine learning has grown rapidly over the past decade, and is generally considered now to be one of the key components towards building intel- ligent systems. Millions of people today use applications that run on machine learning algorithms, in the form of recommendation systems on platforms such as Amazon or Net- flix, search engines like Google, speech recognition softwares such as Apple’s Siri or the Google Assistant, or image recognition softwares used on various social media websites. Machine learning algorithms draw inferences from massive amounts of data by building a mathematical model to capture patterns or make predictions. As computing resources become increasingly powerful and more easily accessible, machine learning has become increasingly prevalent, and will most likely continue to do so over the coming years. Mathematical optimization is one of the backbones of modern machine learning. Most machine learning problems can be formulated as optimizing some objective based on a current available set of data (a process typically called training), with the intention of drawing patterns and making decisions on new unseen data (testing). The success of optimization algorithms in solving these problems is critical to the success of machine learning, and has led the research community to to explore more complex machine learn- ing problems that require core complex mathematical models and larger datasets. 5 Due to the increasing size of datasets, complex machine learning models can take days to train even with high-performance computing hardware. Moreover, there is a need for efficient optimization algorithms specifically tailored towards training on huge datasets. Thus, there has been widespread interest recently, not only in more efficient optimization algorithms, but also in coming up with heuristics that enable existing op- timization algorithms to work better. In this thesis, we review and analyze some of the recent progress made in this direction, and develop several optimization algorithms of our own that are provably fast. Using a principled approach, we also investigate and provide a theoretical justification for why certain optimization algorithms and certain heuristics have been successful in training complex models, while others have not. For the rest of this chapter, we provide an introduction to optimization methods for solving large-scale machine learning problems and define the notation to be used in the rest of this part of the thesis. In the next section, we show how many popular machine learning models can be formulated as solving an optimization problem. In subsequent sections, we review some existing algorithms that have been successfully used to solve such large-scale problems, and investigate the open questions in this area. Finally, we summarize the main contributions of this part of the thesis. 2.1 Machine learning as an optimization problem Consider the simple case of linear regression, which is used to model a linear re- lationship between independent variables x and a dependent variable y. Suppose we are given a dataset of n observations: {(x1, y1), (x2, y2), . . . , (xn, yn)}. In linear regression, 6 the objective is to find parameters w such that: 〈w,xi〉 = yi,∀i, where 〈w,x〉 denotes the inner product between w and x. This may not be a solvable problem due to a variety of reasons; for example, the underlying relationship between x and y may not be linear, or due to measurement noise when collecting the observations in the dataset. Thus, the typ- ical approach is to solve the problem of finding parameters w such that 〈w,xi〉 is close to yi on average.∑This is typically formulated as the following optimization problem: min f(w) := 1 nw i=1(〈w,xi〉 − y 2n i) . Similar to the linear regression case, many popular machine learning problems can be formulated as optimization problems of the form: ∑n1 min f(w) := fi(w; xi), (2.1) w n i=1 where {xi} is a collection of data drawn from some unknown probability distribution p. In typical machine learning applications, each term fi(w; xi) measures how well a model with parameters w fits one particular data observation xi. Given a dataset D of n data samples {xi}, f(w) measures how well the model fits the entire corpus of data on average. This is typically called an empirical risk minimization problem, and it is an estimate of the true problem we want to solve, i.e., the expected risk minimization problem: minw Ex ∼p[fi(w; xi)]. Since we typically don’t have enough information oni the underlying data distribution p to solve the expected risk minimization problem, we typically solve (2.1) instead. For supervised learning problems, where the objective is to predict a value/label based on some input, each data sample x in the dataset D has a corresponding label y = C(x), for some unknown labeling function C. In this case, a training pair refers 7 to the tuple (x,y). We consider that x is a d-dimensional vector with x ∈ Rd, unless specified otherwise. For clarity in presentation, from hereon, we denote fi(w; xi) as fi(w). We sometimes also use fx to denote the model’s loss corresponding to a data sample x, which will be clear from the context. Some of the notation used in the rest of the thesis is summarized in Section 2.5. Logistic regression Many popular machine learning models use objective functions of the same form as in (2.1). For example, logistic regression, which is a linear model for doing binary classification (i.e., distinguishing between two classes of data), uses the following objective function: fi(w) = log(1 + exp(−yi〈w,xi〉)), where yi denotes the binary label, +1 or −1, which is averaged over n observations. Neural networks Another powerful class of models that are formulated as (2.1) are deep neural networks. Neural networks use a series of non-linear transformations to build highly complex and flexible function approximators. The output of a typical deep neural network with β + 1 layers is given by: ŷi = σ(Wβσ(. . . σ(W1σ(W0xi + b0) + b1) . . . ) + bβ). Here xi ∈ Rd is the input data sample to the neural net, the W’s denote the weight matrices, b’s denote the bias vectors, and ŷi denotes the output of the neural network. The function σ(.) is typically non-linear and applied point-wise to its arguments. Common choices for σ(.) are the sigmoid function: σ(x) = 1/(1+exp(−x)), or the ReLU: σ(x) = max(0, x). This sequence of non-linear transformations help the neural network express complex function classes. The shapes of the weight matrices and biases are such that 8 the output ŷi is the same size as the label yi. Thus, for this neural net, the parameters of the model are given by w = [vec(W )>0 vec(W )>1 · · · vec(W > > > > >β) b0 b1 · · · bβ ] (where we imagine all vectors to be column vectors by default and> denotes the transpose operator). Neural networks have been very successful at wide range of applications, and the loss function used depends on the specific application. For multi-class classification, a c∑ommon loss function is the cross entropy, where each fi would have the form: fi(w) =− cj=1(yi)j log(ŷi)j , where c is the number of classes (thus the dimensions of yi and ŷi are also c). One can also use the L2 loss function for regression problems where each fi would be: fi(w) = ‖y 2i − ŷi‖ . Other examples of machine learning models that follow a similar form as (2.1) are support vector machines, matrix completion and graph cuts, among others. 2.2 Stochastic gradient descent Traditionally, optimization problems of the form (2.1) have been solved using it- erative deterministic optimization methods. A popular example of such a method is the gradient descent algorithm, which uses iterative updates of the form: wk+1 = wk − α∇f(wk), where α denotes the step size, and∇f denotes the gradient of f w.r.t. the parameters w. Deterministic optimization methods like gradient descent enjoy fast convergence rates and require less user oversight for setting the step size α, and thus is easy to use. However, when n is large (or even infinite) and the model is large, as is often the case 9 in modern machine learning, it becomes intractable to exactly evaluate f(w) or its gra- dient ∇f(w), which makes classical gradient methods impossible. In such situations, the method of choice for minimizing (2.1) is the stochastic gradient descent (SGD) algo- rithm [RM51]. On iteration k, SGD uses an approximation f̃ of the true function f , and then computes wk+1 = wk − αk∇f̃k(wk), (2.2) where αk denotes the step size used on the k-th iteration. Typically, f̃ is an unbiased estimate of f , where a batch Bk∑⊆ D of data is selected uniformly at random on each iteration k. Thus, f̃ 1k(w) = |B | x ∈B fi(w). Note that EB [∇f̃ (w )] = ∇f(w ), andk i k k k k k so the calculated gradient ∇f̃k(wk) can be interpreted as a “noisy” approximation to the true gradient. 2.3 On the successes and drawbacks of SGD Stochastic gradient descent (SGD) has become one of the most popular optimiza- tion algorithms for training deep neural networks, achieving impressive generalization performance across a wide variety of tasks and models. When SGD’s hyper-parameters (learning rate, batch size) are set properly, it can usually good generalization performance compared to other optimization algorithms on a variety of benchmark neural network tasks [WRS+17, KS17, SMDH13]. There are, however, a number of open questions and well-known limitations of SGD. SGD can often be slow for neural networks with many layers, or ones with recurrent connections. While innovations such as batch normalization and skip connections have 10 helped alleviate this issue to a certain extent, why such techniques are required eludes full understanding, and it is worthwhile to gain deeper theoretical insights into these problems. A major drawback of SGD is that it requires careful user oversight for setting the step size schedule. Performance is very sensitive to this choice, and all state-of-the-art re- sults were achieved on very careful choice of the learning rate schedule. While there have been some recent work on methods for automatically setting step sizes for stochastic algo- rithms [KB14,MH15,SZL13,TMDQ16], they are largely heuristic without any theoretical guarantees on convergence rates, and don’t work well in practice either [WRLG18]. Moreover, as the datasets grow larger and models become more complex (such as increasing depth on neural networks), SGD typically takes a much longer time to train to high accuracies (i.e., convergence rates are slow). While innovations such as batch nor- malization and skip connections have helped alleviate this issue to a certain extent, why such techniques are required eludes full understanding. Further, SGD being an inherently sequential algorithm and because of the noise in the gradients, can’t be efficiently dis- tributed over computing clusters. This indicates the need for faster optimization methods for training these models. 2.4 Contributions In this part of the thesis, we explore a few of the open questions mentioned in Section 2.3. We list the main contributions below. In Chapter 3, we develop stochastic optimization algorithms that require no user oversight by automatically setting the hyperparameters of SGD. This is done by adap- 11 tively growing the batch size over time to control the amount of noise in the gradient estimate relative to the signal in the gradient estimate. Controlling the noise, in turn, makes the process of setting step sizes much easier, and we present various adaptive step size methods that have provable convergence rate guarantees, as well as good empirical performance on a wide range of machine learning models and datasets. In Chapter 4, we explore a variant of SGD that has a provably faster convergence rate. We show that this variant can scale linearly over hundreds of computing cores and can speed up training of machine learning models on massive datasets without experienc- ing the slowdown that existing stochastic methods experience. This was done by lever- aging a class of stochastic algorithms called variance reduction, that explicitly reduce the variance in the SGD gradient estimate by adding an error correction term. In Chapter 5, we investigate quantized networks, which use low-precision weights to compress and accelerate neural networks. We discuss the theory of quantized net- works, and when/why they are trainable. In particular, we show that quantized training algorithms that exploit high-precision representations have an important greedy search phase that purely quantized training methods lack, which explains the difficulty of train- ing using low-precision arithmetic. Finally in Chapter 6, we explore why SGD is efficient for neural nets when tuned properly, and how neural net design affects SGD. In particular, we investigate how over- parametrization – an increase in the number of parameters beyond the number of training data and typical setting in most neural network problems – affects the dynamics of SGD. We find that wide networks train faster than narrow nets, and deeper networks train slower than shallow nets – an effect often observed in practice. 12 2.5 Table of notation d data dimension n number of data points w vector of parameters of the machine learning model (boldface denotes a vector) x d-dimensional input data sample y label of the input data sample; we assume y = C(x) for a labeling function C D training data; D = {(xi,y ni)}i=1 for supervised problems, D = {x ni}i=1 otherwise B set of data points chosen in the mini-batch, i.e., B ⊆ D k current iteration of the optimizer fx or fi scalar function denoting the model’s loss corresponding to training pair (xi,yi)i ∑ f scalar loss function to be minimized; typically f = 1 ni=1 fn i f̃k approximation of f at iteration k used by stochastic optimization algorithms f̃B (overloading notation) approximation of f by using mini-batch B ⊆ D (v)i i-th element of the vector v ∇γ(v) gradient of a scalar function γ, i.e., (∇γ(w))i = ∂γ/∂(v)i 〈 ∑v1,v2〉 inner product between two vectors, i.e., 〈v d1,v2〉 = i=1(v1)√i · (v2)i ‖v‖ ∑L2 norm of vector v, unless otherwise specified; i.e., ‖v‖ = di=1(v)2i 13 Chapter 3: Automated inference using adaptive batch sizes 3.1 Introduction SGD uses noisy gradient approximations to solve (2.1). Since the gradient approxi- mations are noisy, the step size αk must vanish as k →∞ to guarantee convergence of the method. Typical step size rules require the user to find the optimal decay rate schedule, which usually requires an expensive grid search over different possible parameter values. In this chapter, we propose a “big batch” strategy for SGD. Rather than letting the step size vanish over time as the iterates approach a minimizer, we let the mini-batch B adaptively grow in size to maintain a constant signal-to-noise ratio of the gradient approximation. This prevents the algorithm from getting overwhelmed with noise, and guarantees con- vergence with an appropriate constant step size. Recent results [KMN+16] have shown that large fixed batch sizes fail to find good minimizers for non-convex problems like deep neural networks. Adaptively increasing the batch size over time overcomes this lim- itation: intuitively, in the initial iterations, the increased stochasticity (corresponding to smaller batches) can help land the iterates near a good minimizer, and larger batches later on can increase the speed of convergence towards this minimizer. Using this batching strategy, we show that we can keep the step size constant, or let it adapt using a simple Armijo backtracking line search, making the method completely 14 adaptive with no user-defined parameters. We also derive an adaptive step size method based on the [BB88] curvature estimate that fully automates the big batch method, while empirically enjoying a faster convergence rate than the Armijo backtracking line search. Big batch methods that adaptively grow the batch size over time have several po- tential advantages over conventional small-batch SGD: • Big batch methods don’t require the user to choose step size decay parameters. Larger batch sizes with less noise enable easy estimation of the accuracy of the approximate gradient, making it straightforward to adaptively scale up the batch size and maintain fast convergence. • Backtracking line search tends to work very well when combined with big batches, making the methods completely adaptive with no parameters. A nearly constant signal-to-noise ratio also enables us to define an adaptive step size method based on the Barzilai-Borwein curvature estimate, that performs better empirically on a range of convex problems than the backtracking line search. • Higher order methods like stochastic L-BFGS typically require more work per it- eration than simple SGD. When using big batches, the overhead of more complex methods like L-BFGS can be amortized over more costly gradient approximations. Furthermore, better Hessian approximations can be computed using less noisy gra- dient terms. • For a restricted class of non-convex problems (functions satisfying the Polyak- Łojasiewicz Inequality), the per-iteration complexity of big batch SGD is linear and the approximate gradients vanish as the method approaches a solution, which 15 makes it easy to define automated stopping conditions. In contrast, small batch SGD exhibits sub-linear convergence, and the noisy gradients are not usable as a stopping criterion. • Big batch methods are much more efficient than conventional SGD in massively parallel/distributed settings. Bigger batches perform more computation between parameter updates, and thus allow a much higher ratio of computation to commu- nication. For the reasons above, big batch SGD is potentially much easier to automate and requires much less user oversight than classical small batch SGD. Related work In this section, we focus on automating stochastic optimization methods by reduc- ing the noise in SGD. We do this by adaptively growing the batch size to control the variance in the gradient estimates, maintaining an approximately constant signal-to-noise ratio, leading to automated methods that do not require vanishing step size parameters. While there has been some work on adaptive step size methods for stochastic optimiza- tion [MH15, SZL13, TMDQ16, KB14, Zei12], the methods are largely heuristic without any kind of theoretical guarantees or convergence rates. The work in [TMDQ16] was a first step towards provable automated stochastic methods, and we explore in this direction to show provable convergence rates for the automated big batch method. While there has been relatively little work in provable automated stochastic meth- ods, there has been recent interest in methods that control gradient noise. These methods 16 mitigate the effects of vanishing step sizes, though choosing the (constant) step size still requires tuning and oversight. There have been a few papers in this direction that use dynamically increasing batch sizes. In [FS12], the authors propose to increase the size of the batch by a constant factor on every iteration, and prove linear convergence in terms of the iterates of the algorithm. In [BCNW12], the authors propose an adaptive strategy for growing the batch size; however, the authors do not present a theoretical guarantee for this method, and instead prove linear convergence for a continuously growing batch, similar to [FS12]. Variance reduction (VR) SGD methods use an error correction term to reduce the noise in stochastic gradient estimates. The methods enjoy a provably faster conver- gence rate than SGD and have been shown to outperform SGD on convex problems [DBLJ14, JZ13, SRB13, DD+14], as well as in parallel [RHS+15] and distributed set- tings [DG16]. A caveat, however, is that these methods require either extra storage or full gradient computations, both limiting factors when the dataset is very large. In a re- cent paper [HAV+15], the authors propose a growing batch strategy for a VR method that enjoys the same convergence guarantees. However, as mentioned above, choosing the constant step size still requires tuning. Another conceptually related approach is im- portance sampling, i.e., choosing training points such that the variance in the gradient estimates is reduced [BTPG15, CR16, NWS14]. 17 3.2 Big Batch SGD 3.2.1 Preliminaries and motivation Classical stochastic gradient methods thrive when the current iterate is far from optimal. In this case, a small amount of data is necessary to find a descent direction, and optimization progresses efficiently. As wk starts approaching the true solution w?, however, noisy gradient estimates frequently fail to produce descent directions and do not reliably decrease the objective. By choosing larger batches with less noise, we may be able to maintain descent directions on each iteration and uphold fast convergence. This observation motivates the proposed “big batch” method. We now explore this idea more rigorously. We wish to show that a noisy gradient approximation ∇f̃ produces a descent direction when the noise is comparable in magnitude to the true gradient∇f . Lemma 3.2.1. A sufficient condition for −∇f̃(w) to be a descent direction is ‖∇f̃(w)−∇f(w)‖2 < ‖∇f̃(w)‖2. Proof. This is a standard result in stochastic optimization. We know that −∇f̃(w) is a descent direction iff 〈∇f̃(w),∇f(w)〉 > 0. Expanding ‖∇f̃(w) − ∇f(w)‖2 we get: ‖∇f̃(w)‖2 + ‖∇f(w)‖2 − 2〈∇f̃(w),∇f(w)〉 < ‖∇f̃(w)‖2. We can re-write this as: −2〈∇f̃(w),∇f(w)〉 < −‖∇f(w)‖2 ≤ 0, which is true for a descent direction.  Thus, we see that: if the error ‖∇f̃(w)−∇f(w)‖2 is small relative to the gradient ‖∇f̃(w)‖2, the stochastic approximation is a descent direction. But how big is this error and how large does a batch need to be to guarantee this condition? Let f̃B denote the unbiased estimate of f using a mini-batch B sampled uniformly at random from dataset 18 D. Also, let fx denote the loss corresponding to training pair (x, C(x)). Then, by the weak law of large numbers1 1 1 E[‖∇f̃B(w)−∇f(w)‖2] = |B|Ex[‖∇fx(w)−∇f(w)‖ 2] = |B| Tr Varx∇fx(w), and so we can estimate the error of a stochastic gradient if we have some knowledge of the variance of ∇fx(w). In practice, this variance could be estimated using the sample variance of a batch {∇fi(w)}x ∈B. However, we would like some bounds on the mag-i nitude of this gradient to show that it is well-behaved, and also to analyze worst-case convergence behavior. To this end, we make the following assumption. Assumption 3.2.1. We assume that each fi has Lx-Lipschitz dependence on data x, i.e., given two data points x1,x2 ∼ p(x), we have: ‖∇f1(w)−∇f2(w)‖ ≤ Lx‖x1 − x2‖. Under this assumption, we can bound the error of the stochastic gradient. The bound is uniform with respect to w, which makes it rather useful in analyzing the conver- gence rate for big batch methods. Theorem 3.2.1. Given the current iterate w, suppose Assumption 3.2.1 holds and that the data distribution p has bounded second moment. Then the estimated gradient ∇f̃B(w) has variance bounded by ‖∇ −∇ ‖2 ∇ ≤ 4L 2 x Tr VarE x (x) B f̃B(w) f(w) := Tr VarB( f̃B(w)) |B| , where x ∼ p(x). Note the bound is uniform in w. 1We assume the random variable∇fx is measurable and has bounded second moment. These conditions will be guaranteed by the hypothesis of Theorem 3.2.1. 19 Proof. Let x̄ = E[x] be the mean of x. Given the current iterate w, we assume that the batch B is sampled uniformly with replacement from p. We then have: ‖∇fx(w)−∇f(w)‖2 ≤ 2‖∇fx(w)−∇f 2x̄(w)‖ + 2‖∇fx̄(w)−∇f(w)‖2 ≤ 2L2 2x‖x− x̄‖ + 2‖Ex[∇fx̄(w)−∇fx(w)]‖2 ≤ 2L2x‖x− x̄‖2 + 2Ex‖∇fx̄(w)−∇f 2x(w)‖ ≤ 2L2x‖x− x̄‖2 + 2L2xEx‖x̄− x‖2 = 2L2x‖x− x̄‖2 + 2L2x Tr Varx(x), where the first inequality uses the property ‖a + b‖2 ≤ 2‖a‖2 + 2‖b‖2, the second and fourth inequalities use Assumption 3.2.1, and the third inequality uses Jensen’s inequality. This bound is uniform in w. We then have Ex‖∇fx(w)−∇f(w)‖2 ≤ 2L2xE ‖x− x̄‖2x + 2L2x Tr Varx(x) = 4L2x Tr Varx(x), uniformly for all w. The result follows from the observation that EB‖∇ 1 f̃B(w)−∇f(w)‖2 = |B|Ex‖∇fx(w)−∇f(w)‖ 2.  Note that, using a finite number of samples, one can approximate the quantity Varx(x). 3.2.2 A template for big batch SGD Theorem 3.2.1 and Lemma 3.2.1 together suggest that we should expect d = −∇f̃B to be a descent direction reasonably often provided 1 θ2‖∇f̃B(w)‖2 ≥ |B| [Tr Varx(∇fx(w))], (3.1) 20 2‖∇ ‖2 ≥ 4L 2 or θ f̃ (w) x Tr Varx(x) B |B| , for some θ < 1. Big batch methods capitalize on this observation. On each iteration k, starting from a point wk, the big batch method performs the following steps: 1. Estimate the variance Tr Varx[∇fx(wk)], and a batch size K large enough that θ2E‖∇f̃B (w 2 2k k)‖ ≥ E‖∇f̃B (wk k)−∇f(wk)‖ 1 = Tr Varx∇fx(wk), (3.2) K where θ ∈ (0, 1) and Bk is the selected batch on the k-th iteration with |Bk| = K. 2. Choose a step size αk. 3. Perform the update: wk+1 = wk − αk∇f̃B (wk).k One can implement these steps using different variance estimators and different step size strategies. In the next section, we show that, if condition (3.2) holds, then fast conver- gence can be achieved using an appropriate constant step size. In subsequent sections, we address the issue of how to build practical big batch implementations using automated variance and step size estimators that require no user oversight. 3.3 Convergence analysis We now present convergence bounds for big batch SGD methods (3.3). We rewrite the SGD update as: wk+1 = wk − α∇f̃B (wk) = wk − α(∇f(wk k) + ẽk), (3.3) 21 where ẽk = ∇f̃B (wk)−∇f(wk), and EB[ẽk] = 0. Let us also define g̃k = ∇f(wk)+ẽk k. Before we present our results, we first state two assumptions about the loss function f(w). Assumption 3.3.1. We assume that the objective function f has L-Lipschitz gradients: f(w) ≤ f(w′) + 〈∇f(w′), (w −w′)〉+ L‖w −w′‖2. 2 This is a standard smoothness assumption used widely in the optimization literature. Note that a consequence of Assumption 3.3.1 is: ‖∇f(w)−∇f(w′)‖ ≤ L‖w −w′‖. Assumption 3.3.2. We assume that the objective function f satisfies the Polyak-Łojasiewicz Inequality: ‖∇f(w)‖2 ≥ 2µ(f(w)− f(w?)), where w? is the optimal solution. Note that this inequality does not require f to be convex. It does, however, imply that every stationary point is a global minimizer [KNS16,Pol63]. We now present a result that establishes an upper bound on the objective value in terms of the error in the gradient of the sampled batch. Lemma 3.3.1. Suppose we apply an update of the form (3.3) where the batch Bk is uni- formly sampled from the dataset D on each iteration k. If the objective f satisfies As- sumptions 3.3.1 and 3.3.2, we have: ( ( Lα2 )) Lα2 E[f(wk+1)− f(w?)] ≤ 1− 2µ α− E[f(w )− f(w?)] + E‖ẽ ‖2k k . 2 2 Proof. From (3.3) and Assumption 3.3.1 we get Lα2 f(wk+1) ≤ f(w 2k)− α〈g̃k,∇f(wk)〉+ ‖g̃k‖ . 2 Taking expectation with respect to the batch Bt and conditioning on wk, we get Lα2 E[f(w ?k+1)− f(w )] ≤f(wk)− f(w?)− α〈E[g̃k],∇f(w 2k)〉+ E‖g̃k‖ 2 22 ( 2 ) − ? − − Lα ‖∇ ‖2 Lα 2 =(f(wk) (f(w ) ))α f(wk) + E‖ẽ 2 k‖ 2 2 Lα2 2≤ 1− 2µ α− (f(wk)− f(w? Lα )) + E‖ẽk‖2, 2 2 where the second inequality follows from Assumption 3.3.2. Taking expectation, the result follows.  Using Lemma 3.3.1, we now provide convergence rates for big batch SGD. Theorem 3.3.1. Suppose f satisfies Assumptions 3.3.1 and 3.3.2. Suppose further that on each iteration the batch size is large enough to satisfy (3.2) for θ ∈ (0, 1). If 0 ≤ α < 2 , Lβ 2 where β = θ +(1−θ) 2 − 2 , then we get the following linear convergence bound for big batch(1 θ) SGD using updates of the form 3.3: E[f(wk+1)− f(w?)] ≤ γ · E[f(wk)− f(w?)], ( 2 ) where γ = 1− 2µ(α− Lα β ) . Choosing the optimal step size of α = 1 , we get 2 βL ( µ ) E[f(wk+1)− f(w?)] ≤ 1− · E[f(wk)− f(w?)]. βL Proof. We begin by applying the reverse triangle inequality to (3.2) to get (1−θ)E‖∇fB(x)‖ ≤ E‖∇f(x)‖, which applied to (3.2) yields: θ2 E‖∇f(w )‖2− k ≥ E‖∇fB(wk)−∇f(wk)‖ 2 = E‖ẽk‖2. (3.4) (1 θ)2 Applying (3.4) to the result in Lemma 3.3.1, we get ( Lα2β ) E[f(w ?k+1)− f(w )] ≤ E[f(w )− f(w?k )]− α− E‖∇f(wk)‖2, 2 where β = θ 2+(1−θ)2 − 2 ≥ 2 1. Assuming α− Lα β ≥ 0 and using Assumption 3.3.2, we get: (1 θ) ( 2( Lα2β )) E[f(w ?k+1)− f(w )] ≤ 1− 2µ α− E[f(wk)− f(w?)], 2 23 whic(h proves the theorem). Note that max {α− Lα 2β} = 1( ) α , and µ ≤ L. It follows that2 2Lβ2 0 ≤ 1− 2µ α− Lα β < 1. The second result follows immediately.  2 Note that the above linear convergence rate bound holds without requiring con- vexity. Comparing it with the convergence rate of deterministic gradient descent under similar assumptions, we see that big batch SGD suffers a slowdown by a factor β, due to the noise in the estimation of the gradients. We now present a result proving a O(1/k) convergence rate for general smooth convex functions. Theorem 3.3.2. Suppose f satisfies Assumptions 3.3.1, is convex, and condition (3.2) is satisfied on each iteration. Then we get sub-linear convergence of the form: − ? ≤ ‖w ? 2 E 0 −w ‖ [f(wk) f(w )] = O(1/k), (2α− 2Lα2β)(k + 1) θ2where β = +(1−θ) 2 − 2 and α < 1 . Choosing the optimal step size of α = 1 , we get (1 θ) Lβ 2Lβ − ? ≤ 2Lβ‖w −w ?‖2 E 0[f(wk) f(w )] = O(1/k). k + 1 Proof. Applying the reverse triangle inequality to (3.2) and using Lemma 3.3.1 we get, as in Theorem 3.3.1: ( Lα2β ) E[f(wk+1)] ≤ E[f(wk)]− α− E‖∇f(wk)‖2, (3.5) 2 2 2 2 where β = θ +(1−θ) Lα β 2− 2 ≥ 1. Note that α− > 0 if α < . From (3.3), taking norm on(1 θ) 2 Lβ both sides and taking expectation, conditioned on all wt, with t = 0, 1, · · · , k, we get E‖w −w?‖2k+1 = ‖wk −w?‖2 − 2αE〈wk −w?,∇f(wk) + ẽk〉+ α2E‖∇f(wk) + ẽk‖2 ≤ ‖w ?k −w ‖2 − 2α〈wk −w?,∇f(wk)〉+ α2β‖∇f(wk)‖2 ≤ ‖wk −w?‖2 − 2α(f(w )− f(w?k )) + α2β‖∇f(wk)‖2 24 ≤ ‖w ?k −w ‖2 − 2α(f(wk)− f(w?)) + 2Lα2β(f(w )− f(w?k )) = ‖w ? 2 2k −w ‖ − (2α− 2Lα β)(f(wk)− f(w?)), where we use the property that E[ẽk] = 0, and the properties f(w) ≤ f(w?) + 〈w − w?,∇f(w)〉 (which follows from the convexity of f ) and ‖∇f(w)‖2 ≤ 2L(f(w) − f(w?)) (a proof for this identity can be found in [Nes13]). Note that 2α − 2Lα2β > 0 when α < 1 . Taking expectation on all w Lβ k , we get 1 E[f(wk)− f(w?)] ≤ − (E‖w ? 2 k −w ‖ − E‖wk+1 −w?‖2). (3.6) 2α(1 Lαβ) Summing (3.6) over all k = 0, 1, · · · , T , and using the telescoping sum in ‖wk −w?‖2: ∑T E[f(w )− f(w? 1k )] ≤ ? 2 ? 2− (E‖w0 −w ‖ − E‖wk+1 −w ‖ )2α(1 Lαβ) k=0 ≤ 1 ‖w ? 20 −w ‖ . (3.7) 2α(1− Lαβ) From (3.5) we see that E[f(wk+1)] ≤ E[f(wk)] when α < 2 . Thus, we rewrite (3.7) as:Lβ ∑T ? 2 E[f(w )− f(w? ≤ 1 ‖w0 −w ‖k )] E[f(wk)− f(w?)] ≤ . T + 1 (2α− 2Lα2β)(T + 1) k=0 Choosing the optimal step size of α = 1 , the second result follows.  2Lβ 3.3.1 Comparison to classical SGD Conventional small batch SGD methods can attain only O(1/k) convergence for strongly convex problems, thus requiring O(1/) gradient evaluations to achieve an opti- mality gap less than , and this has been shown to be optimal in the online setting (i.e., the infinite data setting) [RSS11]. In the previous section, however, we have shown that 25 big batch SGD methods converge linearly in the number of iterations, under a weaker assumption than strong convexity, in the online setting. Unfortunately, per-iteration con- vergence rates are not a fair comparison between these methods because the cost of a big batch iteration grows with the iteration count, unlike classical SGD. For this reason, it is interesting to study the convergence rate of big batch SGD as a function of gradient evaluations. From Lemma 3.3.1, we see that we should not expect to achieve an optimality gap 2 less than  until we have: Lα EB ‖ẽk‖2 < . In the worst case, by Theorem 3.2.1, this2 k 2 2 requires Lα 4Lx Tr Varx(x)|B| < , or |B| ≥ O(1/) gradient evaluations. Note that in the2 online or infinite data case, this is an optimal bound, and matches that of other SGD methods. All our results hold for the infinite sample case. Note that the finite sample case is fairly trivial with a growing batch size: asymptotically, the batch size becomes the whole dataset, at which point we get the same asymptotic behavior as deterministic gradient descent, achieving linear convergence rates. 3.4 Practical implementation with backtracking line search While one could implement a big batch method using analytical bounds on the gradient and its variance (such as that provided by Theorem 3.2.1), the purpose of big batch methods is to enable automated adaptive estimation of algorithm parameters. Fur- thermore, the step size bounds provided by our convergence analysis, like the step size bounds for classical SGD, are fairly conservative and more aggressive step size choices 26 Algorithm 1 Big batch SGD: fixed step size 1: initialize w0, step size α, initial batch size K > 1, batch size increment δk 2: while not converged do 3: Draw random batch with size |B| = K; Calculate VB and ∇f̃B(wk) using (3.8) 4: while ‖∇f̃ (w )‖2B k ≤ VB/K do 5: Increase batch size K ← K + δK 6: Sample more gradients and update VB and ∇fB(wk) 7: end while 8: wk+1 = wk − α∇f̃B(wk) 9: end while are likely to be more effective. The framework outlined in Section 3.2.2 requires two ingredients: estimating the batch size and estimating the step size. To estimate the batch size needed to achieve (3.2), we start with an initial batch size K, and draw a random batch B with |B| = K. We then compute the stochastic gradient estimate∇f̃B(wk) and the sample variance 1 ∑ VB := |B| − ‖∇fx(wk)−∇f̃ 2 B(wk)‖ ≈ Tr Varx∈B(∇fx(wk)). (3.8) 1 x∈B We then test whether ‖∇fB(wk)‖2 > VB/|B| as a proxy for (3.2). If this condition holds, we proceed with a gradient step, else we increase the batch size K ← K + δK , and check our condition again. We fix δK = 0.1K for all our experiments. Our implementation also simply chooses θ = 1. The fixed step size big batch method is listed in Algorithm 1. We also consider a backtracking variant of SGD that adaptively tunes the step size. This method selects batch sizes using the same criterion (3.8) as in the constant step size 27 case. However, after a batch has been selected, a backtracking Armijo line search is used to select a step size. In the Armijo line search, we keep decreasing the step size by a constant factor (in our case, by a factor of 2) until the following condition is satisfied on each iteration: f̃B(wk+1) ≤ f̃B(wk)− cαk‖∇f̃B(wk)‖2, (3.9) where c is a parameter of the line search usually set to 0 < c ≤ 0.5. We now present a convergence result of big batch SGD using the Armijo line search. Theorem 3.4.1. Suppose that f satisfies Assumptions 3.3.1 and 3.3.2 and on each itera- tion, and the batch size is large enough to satisfy (3.2) for θ ∈ (0, 1). If an Armijo line search, given by (3.9), is used, and the step size is decreased by a factor of 2 failing (3.9), then we get the following linear convergence bound for big batch SGD using updates of the form 3.3: E[f(w ?k+1)− f(w )] ≤ γ · E[f(wk)− f(w?)], ( ( )) where γ = 1 − 2cµmin α , 10 and 0 < c ≤ 0.5. If the initial step size α0 is set2βL large enough such that α ≥ 10 , then we get:2βL ( − ? ≤ − cµ ) E[f(wk+1) f(w )] 1 E[f(wk)− f(w?)]. βL Proof. Applying the reverse triangle inequality to (3.2) and using Lemma 3.3.1 we get, as in Theorem 3.3.1: ( Lα2β ) E[f(w ?k+1)− f(w )] ≤ E[f(wk)− f(w?)]− α− E‖∇f(w 2k)‖ , (3.10) 2 2 where β = θ +(1−θ) 2 (1− ≥ 1.θ)2 28 We will show that the backtracking condition in (3.9) is satisfied whenever 0 < ≤ 2α 1 . Notice that: ≤ 1 implies − Lα β0 < α α t αtt t t + ≤ − . Thus, we can rewriteβL βL 2 2 (3.10) as E[f(w ?k+1)− f(w )] ≤ E[f(wk)− f(w?)]− αtE‖∇f(w )‖2k 2 ≤ E[f(w )− f(w?k )]− cαtE‖∇f(w 2k)‖ , where 0 < c ≤ 0.5. Thus, the backtracking line search condition (3.9) is satisfied when- ever 0 < α ≤ 1t . Now we know that either αt = α0 (the initial step size), or α 1Lβ t ≥ ,2βL where the step size is decreased by a factor of 2 each time the backtracking condition fails. Thus, we can rewrite the above as ( 1 ) E[f(wk+1)− f(w?)] ≤ E[f(w )− f(w?)]− cmin α , E‖∇f(w )‖2k 0 k . 2βL Using Assumption 3.3.2 we get ( ( 1 )) E[f(w )− f(w?k+1 )] ≤ 1− 2cµmin α0, E[f(wk)− f(w?)]. 2βL Assuming we start off the step size at a large value such that min(α 1 10, ) = , we can2βL 2βL rewrite the above to get the desired bound.  In practice, on iterations where the batch size increases, we double the step size before running line search to prevent the step sizes from decreasing monotonically. The complete details are listed in Algorithm 2. 3.5 Adaptive step sizes using the Barzilai-Borwein estimate While the Armijo backtracking line search leads to an automated big batch method, the step size sequence is monotonic (neglecting the heuristic mentioned in the previous 29 Algorithm 2 Big batch SGD: backtracking line search 1: initialize w0, initial step size α, initial batch size K > 1, batch size increment δk, backtracking line search parameter c, flag F = 0 2: while not converged do 3: Draw random batch with size |B| = K; Calculate VB and ∇f̃B(wk) using (3.8) 4: while ‖∇f̃B(wk)‖2 ≤ VB/K do 5: Increase batch size K ← K + δK 6: Sample more gradients and update VB and ∇fB(wk) 7: Set flag F = 1 8: end while 9: if flag F == 1 then 10: α← α ∗ 2; Reset flag F = 0 11: end if 12: while f̃B(w 2k − α∇f̃B(wk)) > f̃B(wk)− cαt‖∇f̃B(wk)‖ do 13: α← α/2 14: end while 15: wk+1 = wk − α∇f̃B(wk) 16: end while section). In this section, we derive a non-monotonic step size scheme that uses curvature estimates to propose new step size choices. Our derivation follows the classical adaptive [BB88] (BB) method. The BB meth- ods fits a quadratic model to the objective on each iteration, and a step size is proposed that is optimal for the local quadratic model [GSB14]. To derive the analog of the 30 BB method for stochastic problems, we consider quadratic approximations of the form f(w) = Eφfφ(w), where fφ(w) = ν‖w − φ‖2 and φ ∼ N (w?, σ2I). We derive the2 optimal step size for this. We can rewrite the quadratic approximation as: ν ν ν ( ) f(w) = E ‖w − φ‖2φ = [〈w,w〉 − 2〈w,w?〉 − E〈φ,φ〉] = ‖w −w?‖2 + dσ2 , 2 2 ∑ ∑ 2 since we can write: E〈φ,φ〉 = d 2i=1 E(φ)i = d ? i=1(w ) 2 i + σ 2 = ‖w?‖2 + dσ2. Further, notice that: Eφ[∇f(w)] = ν(w−w?) and Tr Varφ[∇f(w)] = dν2σ2. Using the quadratic approximation, we can rewrite the update for big batch SGD as: ∑|B| − 1 νσα ∑ t wk+1 = wk αt |B| ν(wk − φi) = (1− ναt)wk + ναtw ? + |B| ξi, i=1 i∈B where we write φi = w? + σξi∥with ξi ∼ N (0, I). The expected value of f is: ν ∥∥ νσα ∑ E t[f(wk+1)] = E ξ ∥∥(1− ναt)(wk −w?) + |B| ξi∥∥ ∥∥2∥ + dσ22 ( i∈B ) ν 2 2 = ‖(1− ναt)(wk −w?)‖2 ν α + (1 + t )dσ2 . 2 |B| Minimizing E[f(wk+1)] w.r.t. αk(we get:1 1 |B | T∥∥ ) r Varx[∇fx∥∥(wk)]αk = · 1− k 2 . (3.11)ν E ∇f̃B (w )k k Here ν denotes the curvature of the quadratic approximation. Note that, in the case of deterministic gradient descent, the optimal step size is simply 1/ν [GSB14]. We estimate the curvature νt on each iteration using the BB least-squares rule [BB88]: 〈wk −wk−1,∇f̃B (wk)−∇f̃B (wk−1)〉 ν k kk = ‖ . (3.12)w 2k −wk−1‖ Thus, each time we sample a batch Bk on the k-th iteration, we calculate the gradient on that batch in the previous iterate, i.e., we calculate ∇f̃B (wk−1). This gives us ank approximate curvature estimate, with which we derive the step size αk using (3.11). 31 3.5.1 Convergence proof Here we prove convergence for the adaptive step size method described above. For the convergence proof, we first state two assumptions: Assumption 3.5.1. Each fi has L-Lipschitz gradients: f (w) ≤ f (w′) + 〈∇f (w′),w −w′〉+ Li i i ‖w −w′‖2, ∀i.2 Assumption 3.5.2. Each fi is µ-strongly convex: 〈∇fi(w)−∇fi(w′),w −w′〉 ≥ µ‖w −w′‖2,∀i. Note that both assumptions are stronger than Assumptions 3.3.1 and 3.3.2, i.e., Assumption 3.5.1 implies 3.3.1 and Assumption 3.5.2 implies 3.3.2 [KNS16]. Both are very standard assumptions frequently used in the convex optimization literature. From (3.11), we see that we can lower bound the step size as: α 2k ≥ (1 − θ )/ν. Thus, the step size for big batch SGD is scaled down by at most 1 − θ2. For simplicity, we assume that the step size is set to this lower bound: αk = (1 − θ2)/νk. Thus, from Assumptions 3.5.1 and 3.5.2, we can bound νk, and also αk, as follows: 2 2 µ ≤ νt ≤ 1− θ 1− θ L =⇒ ≤ αt ≤ . L µ From Theorem 3.3.1, we see that we have linear convergence with the adaptive step size method when: ( Lα2β ) 2(1− θ2) 1− 2µ α− ≤ 1− + β(1− θ2)2κ < 1 =⇒ κ2 2< , 2 κ β(1− θ2) where κ = L/µ is the condition number. We see that the adaptive step size method enjoys a linear convergence rate when the problem is well-conditioned. In the next section, we talk about ways to deal with poorly-conditioned problems. 32 3.5.2 Practical implementation To achieve robustness of the algorithm for poorly conditioned problems, we include a backtracking line search after calculating (3.11), to ensure that the step sizes do not blow up. Further, instead of calculating two gradients on each iteration (∇f̃B (w ) andk k ∇f̃B (wk k−1)), our implementation uses the same batch (and step size) on two consecutive iterations. Thus, one parameter update takes place for each gradient calculation. We found the step size calculated from (3.11) to be noisy when the batch is small. While this did not affect long-term performance, we perform a smoothing operation to even out the step sizes and make performance more predictable. Let α̃k denote the step (size calc)ulated from (3.11). Then, the step size on each iteration is given by αk = 1 − |B| αk−1 + |B| α̃k. This ensures that the update is proportional to how accurate then n estimate on each iteration is. This simple smoothing operation seemed to work very well in practice as shown in the experimental section. Note that when |Bk| = n, we just use αk = 1/νk. Since there is no noise in the algorithm in this case, we use the optimal step size for a deterministic algorithm. Algorithm 3 shows the complete details. 33 Algorithm 3 Big batch SGD: with BB step sizes 1: initialize w0, initial step size α, initial batch size K > 1, batch size increment δk, backtracking line search parameter c 2: while not converged do 3: Draw random batch with |B| = K; Calculate VB and GB = ∇f̃B(x) using (3.8) 4: while ‖GB‖2 ≤ VB/K do 5: Increase batch size K ← K + δK 6: Sample more gradients and update VB and GB 7: end while 8: while f̃B(x− α∇f̃B(x)) > f̃B(x)− cα‖∇f̃ 2B(x)‖ do 9: α← α/2 10: end while 11: x← x− α∇f̃B(x) 12: if K < n then 13: Calculate α̃ = (1− VB/(K‖GB‖2))/ν using (3.11) and (3.12) 14: else 15: Calculate α̃ = 1/ν using (3.12) 16: end if 17: step size smoothing: α← α(1−K/n) + α̃K/n 18: while f̃B(x− α∇f̃B(x)) > f̃B(x)− cα‖∇f̃B(x)‖2 do 19: α← α/2 20: end while 21: x← x− α∇f̃B(x) 22: end while 34 Figure 3.1: Convex experiments. Left to right: Ridge regression on MILLIONSONG; Logistic regression on COVERTYPE; Logistic regression on IJCNN1. The top row shows how the norm of the true gradient decreases with the number of epochs, the middle and bottom rows show the batch sizes and step sizes used on each iteration by the big batch methods. Here ‘passes through the data’ indicates number of epochs, while ‘iterations’ refers to the number of parameter updates used by the method (there may be multiple iterations during one epoch). 35 Mean class accuracy (train set) Mean class accuracy (train set) Mean class accuracy (train set) 100 100 100 98 99.8 90 99.6 96 80 99.4 94 99.2 92 70 Adadelta Adadelta 99 Adadelta BB+Adadelta BB+Adadelta BB+Adadelta SGD+Mom (Fine Tuned) 90 SGD+Mom (Fine Tuned) SGD+Mom (Fine Tuned) 98.8 60 SGD+Mom (Fixed LR) SGD+Mom (Fixed LR) SGD+Mom (Fixed LR) BBS+Mom (Fixed LR) 88 BBS+Mom (Fixed LR) BBS+Mom (Fixed LR) BBS+Mom+Armijo BBS+Mom+Armijo 98.6 BBS+Mom+Armijo 50 86 98.4 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Number of epochs Number of epochs Number of epochs Mean class accuracy (test set) Mean class accuracy (test set) Mean class accuracy (test set) 80 90 75 89 88 99 70 87 65 86 60 98.5 85 55 84 50 83 98 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Number of epochs Number of epochs Number of epochs Figure 3.2: Neural Network Experiments. The three columns from left to right corre- spond to results for CIFAR-10, SVHN, and MNIST, respectively. The top row presents classification accuracies on the training set, while the bottom row presents classification accuracies on the test set. 3.6 Experiments In this section, we present our experimental results. We explore big batch meth- ods with both convex and non-convex (neural network) experiments on large and high- dimensional datasets. 3.6.1 Convex experiments For the convex experiments, we test big batch SGD on a binary classification prob- lem with logistic regression and a linear regression problem: 1 ∑n min log(1 + exp(−yi〈xi,w〉)), w n i=1 36 Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy ∑n1 min (〈xi,w〉 − yi)2. w n i=1 Figure 3.1 presents the results of our convex experiments on three standard real world datasets: IJCNN1 [Pro01] and COVERTYPE [BD99] for logistic regression, and MILLIONSONG [BMEWL11] for linear regression. As a preprocessing step, we nor- malize the features for each dataset. We compare deterministic gradient descent (GD) and SGD with step size decay (αk = a/(b+ k)) to big batch SGD using a fixed step size (BBS+Fixed LR), with backtracking line search (BBS+Armijo) and with the adaptive step size (3.11) (BBS+BB), as well as the growing batch method described in [FS12] (denoted as SF; while the authors propose a quasi-Newton method, we adapt their algorithm to a first-order method). We selected step size parameters using a comprehensive grid search for all algorithms, except BBS+Armijo and BBS+BB, which require no parameter tuning. We see that across all three problems, the big batch methods outperform the other algorithms. We also see that both fully automated methods are always comparable to or better than fixed step size methods. The automated methods increase the batch size more slowly than BBS+Fixed LR and SF, and thus, these methods can take more steps with smaller batches, leveraging its advantages longer. Further, note that the step sizes derived by the automated methods are very close to the optimal fixed step size rate. 3.6.2 Neural network experiments To demonstrate the versatility of the big batch SGD framework, we also present re- sults on neural network experiments. We compare big batch SGD against SGD with finely tuned step size schedules and fixed step sizes. We also compare with Adadelta [Zei12], 37 and combine the big batch method with AdaDelta (BB+AdaDelta) to show that more complex SGD variants can benefit from growing batch sizes. In addition, we had also compared big batch methods with L-BFGS. However, we found L-BFGS to consistently yield poorer generalization error on neural networks, and thus we omitted these results. We train a convolutional neural network [LBBH98] (ConvNet) to classify three benchmark image datasets: CIFAR-10 [KH09], SVHN [NWC+11], and MNIST [LBBH98]. Our ConvNet is composed of 4 layers. We use 32 × 32 pixel images as input. The first layer of the ConvNet contains 16× 3× 3, and the second layer contains 256× 3× 3 fil- ters. The third and fourth layers are fully connected [LBBH98] with 256 and 10 outputs respectively. Each layer except the last one is followed by a ReLu non-linearity [KSH12] and a max pooling stage [RHBL07] of size 2 × 2. This ConvNet has over 4.3 million weights. To compare against fine-tuned SGD, we used a comprehensive grid search on the step size schedule to identify optimal parameters (up to a factor of 2 accuracy). For CIFAR10, the step size starts from 0.5 and is divided by 2 every 5 epochs with 0 step size decay. For SVHN, the step size starts from 0.5 and is divided by 2 every 5 epochs with 1e−05 learning rate decay. For MNIST, the learning rate starts from 1 and is divided by 2 every 3 epochs with 0 step size decay. All algorithms use a momentum parameter of 0.9, and SGD and AdaDelta use mini-batches of size 128. Fixed step size methods use the default decay rule of the Torch library: αk = α0/(1 + 10 −7k), where α0 was chosen to be the step size used in the fine-tuned experi- ments. We also tune the hyper-parameter ρ in the Adadelta algorithm, and we found 0.9, 0.9 and 0.8 to be best-performing parameters for CIFAR10, SVHN and MNIST respec- 38 tively. We plot the accuracy on the train and test set vs the number of epochs (full passes through the dataset) in Figure 3.2. We notice that the big batch SGD with backtrack- ing performs better than both Adadelta and SGD (Fixed LR) in terms of both train and test error. Big batch SGD even performs comparably to fine tuned SGD but without the trouble of fine tuning. This is interesting because most state-of-the-art deep networks (like AlexNet [KSH12], VGG Net [SZ14], ResNets [HZRS16a]) were trained by their creators using standard SGD with momentum, and training parameters were tuned over long periods of time (sometimes months). Finally, we note that the big batch AdaDelta performs consistently better than plain AdaDelta on both large scale problems (SVHN and CIFAR-10), and performance is nearly identical on the small-scale MNIST problem. 3.7 Summary We analyzed and studied the behavior of alternative SGD methods in which the batch size increases over time. Unlike classical SGD methods, in which stochastic gradi- ents quickly become swamped with noise, these “big batch” methods maintain a nearly constant signal to noise ratio of the approximate gradient. As a result, big batch methods are able to adaptively adjust batch sizes without user oversight. The proposed automated methods are shown to be empirically comparable or better performing than other stan- dard methods, but without requiring an expert user to choose learning rates and decay parameters. 39 Chapter 4: Distributing SGD using variance reduction 4.1 Introduction For truly large datasets, parallel or distributed algorithms are vital, driving inter- est in SGD variants that parallelize over massive distributed datasets. While there has been quite a bit of recent work in the area of parallel asynchronous SGD algorithms [RRWN11, DCM+12, LHLL15, AD11, LASY14, SS14, ZLS09, BT89, ZWLS10, ZCL15], these methods typically experience substantially reduced marginal benefit as the number of worker nodes increase over a certain limit. Thus, while some of these algorithms scale linearly when the number of worker nodes is small, they are less effective when the data is distributed over hundreds or thousands of nodes. Moreover, most research in parallel or distributed SGD methods has been focused on the parameter server model of computation [RRWN11, DCM+12, AD11, LASY14, ZLS09], where each update to the centrally stored parameter vector requires a communi- cation phase between the local node and the central server. However, SGD methods tend to become unstable with infrequent communication, and there has been less work in the truly distributed setting where communication costs are high [ZWLS10, ZCL15, MR16]. In this section, we propose to boost the scalability of stochastic optimization algorithms using variance reduction techniques, yielding SGD methods that scale linearly over hun- 40 dreds or thousands of nodes and can train models on massive datasets without the slow- down that existing stochastic methods experience. Notation For this chapter, let fk̃ denote the stochastic function chosen on the k-th iter- ation, where k̃ is an index chosen uniformly at random from {1, 2, . . . , n}. Thus, using this notation, the regular SGD update can be written as wk+1 = wk − α∇fk̃(wk). Background Variance reduction (VR) methods [JZ13,DBLJ14,RHS+15,RSB12,DD+14,KLRT14, KR13, XZ14, WCSX13, HAV+15] have recently gained popularity as an alternative to classical SGD. These methods reduce the variance in the stochastic gradient estimates, and are able to maintain a large constant step size to achieve fast convergence to high accuracy. VR methods exploit the fact that gradient errors are highly correlated between dif- ferent uses of the same function fk̃. This is done by subtracting an error correction term from ∇fk̃(wk) that estimates the gradient error from the most recent use of fk̃. Thus the stochastic gradients used by VR methods have the form g̃k = ∇︸ fk̃︷(︷wk︸) − ︸∇fk̃(w̃︷︷) + gw̃︸, (4.1) approximate gradient error correction term where w̃ is an old iterate, and gw̃ is an approximation of the true gradient ∇f(w̃). Typ- ically, gw̃ can be kept fixed over an epoch or can be updated cheaply on every iteration. As an example, the SVRG algorithm [JZ13] has an update rule of the form: ( ) wk+1 = wk − α ∇fk̃(wk)−∇fk̃(w̃) +∇f(w̃) , (4.2) 41 where w̃ is chosen to be a recent iterate from the algorithm history and is fixed over 1 or 2 epochs, and gw̃ = ∇f(w̃) is the true gradient of f at w̃, which needs to be computed once every 1 or 2 epochs. Another popular VR algorithm, SAGA [DBLJ14], uses the following corrected gradient approximation ∑n g̃k = ∇fk̃(wk)−∇ 1 fk̃(w̃k̃) + ∇fj(w̃j), (4.3)n j=1 where each ∇fj(w̃j) denotes the most recent value of ∇fj and w̃j denotes the iterate at which the most recent ∇fj was evaluated. In this case gw̃ is the average of the ∇fj(w̃j) values for all j ∈ {1, 2, . . . , n}. This error correction term reduces the variance in the stochastic gradients, and thus ensures fast convergence. Notice that for both the algo- rithms, S[VR]G and SAGA, if k̃ is chosen uniformly at random from {1, 2, · · · , n}, we have Ek̃ g̃k = ∇f(wk). Thus, the error correction term has expected value 0 and the approximate gradient g̃k is unbiased for both SVRG and SAGA. Most work on VR methods has focused on studying their faster convergence rates and better stability properties when compared to classical SGD in the sequential setting. While there have been a few recent papers on parallelizing VR methods, these methods scale poorly in distributed settings and all prior work that we know of has focussed on small-scale parallel or shared memory settings, with the data distributed over 10 or 20 nodes [RHS+15, MPP+15, PLT+16]. These parallel algorithms use a parameter server model of computation, and are based on the assumption that communication costs are low, which may not be true in large-scale heterogenous distributed computing environments. The fact that the error correction term reduces the variance in the stochastic gradients, however, seems to indicate that distributed VR methods could be helpful in distributed 42 settings. In particular, the variance-reduced gradients would help in dealing with the problems of instability and slower convergence faced by regular stochastic methods when the frequency of communication between the server and the local nodes is increased. Contributions In this work, we use variance reduction to dramatically boost the performance of SGD in the distributed setting. We do this by exploiting the dependence of VR methods on the gradient correction term gw̃.We allow many local worker nodes to run simultaneously, while communicating with the central server only through the exchange of this central error correction term and the locally stored iterates. The proposed schemes allow many asynchronous processes to work towards a central solution with minimal communication, while simultaneously benefitting from the fast convergence provided by VR. This work has four main contributions: • First, we present a new VR algorithm CentralVR, built on SAGA, that is robust to noise and variance in the dataset. We propose synchronous (CentralVR-Sync) and asynchronous (CentralVR-Async) variations of CentralVR which can linearly scale up over massive datasets using hundreds of cores. • Second, we theoretically study the convergence of CentralVR and prove linear con- vergence of the method with constant step sizes. • Third, we propose distributed versions of the existing popular VR algorithms, SVRG and SAGA, that are robust to high communication latency between the worker nodes and the central server, and can scale over large distributed settings ranging 43 over hundreds of nodes. Table 4.1 summarizes the distributed algorithms proposed in this section and their storage and computation requirements. • Finally, we present empirical results over different models and datasets that show that these distributed algorithms can be trained on massive highly distributed datasets in far less time than existing state-of-the-art stochastic optimization methods. Per- formance of all these distributed methods scales linearly up to hundreds of workers with low communication frequency. We show empirically that the proposed meth- ods converge much faster than competing options. Table 4.1: Distributed Algorithms Proposed Proposed Algorithm Asynchronous? Storage (No. of gradients) Gradients/Iteration CentralVR-Sync No n 1 CentralVR-Async Yes n 1 Distributed SVRG No 2 2.5 Distributed SAGA Yes n 1 4.2 CentralVR algorithm: single-worker case We begin by proposing our new VR scheme, CentralVR, in the single-worker case. As we will see later, the proposed method has a natural generalization to the distributed setting that has low communication requirements. 44 4.2.1 Algorithm overview Our proposed VR scheme is divided into epochs, with n updates taking place in each epoch. Let the iterates generated in the m-th epoch be written as {wm nj }j=1. Also let w̃ml denote the iterate at which the l-th data index was most recently used before the m+ 1-th epoch (i.e., on or before the m-th epoch). Then, the update for CentralVR is: wm+1k+1 = w m+1 k − αvm+1k , ∑ (4.4)n vm+1k = ∇f m+1 1 k̃(wk )−∇fk̃(w̃m) + ∇fj(w̃mj ). (4.5)k̃ n ∑ j=1 Denote ḡm = 1 n mj=1∇fj(w̃j ). Thus, ḡm is the average of the gradients of all com-n ponent functions {∇fj}nj=1, each evaluated at the most recent iterate {w̃m}nj j=1 at which the corresponding function was used on or before the m-th epoch. These gradients are stored in a table, and the average gradient ḡm is updated at the end of each epoch, i.e., after every n parameter updates. Note that if k̃ is cho[sen uniform] ly at random from the set {1, 2, · · · , n} on each iteration k, then we have E ∇f (w̃m[ ] k̃ k̃ ) = ḡ m. Thus, the error k̃ correction term has expected value 0, and E vm+1k = ∇f(wm+1k ), i.e., the approximate gradient vm+1k is unbiased. 4.2.2 Permutation sampling In practical implementations, it is natural to consider a random permutation of the data indices on every epoch, rather than uniformly choosing a random index on every iteration. Thus, on each epoch, a random permutation of the data indices is chosen and a pass is made over the entire dataset, resulting in n updates, one per data sample. Permu- 45 tation sampling often outperforms uniform random sampling empirically [Bot09,Bot12], although theoretical justification for this is still limited (see [GOP15, Sha16] for some recent results). As an alternative to uniform random sampling, CentralVR can leverage random permutations over the data indices. Let πm denote a random permutation of the data indices {1, 2, · · · , n} for the m-th epoch, with πmj denoting the data index chosen in the j-th iteration in the m-th epoch. Thus, now w̃lm denotes the iterate corresponding to the point when the l-th data index was chosen in the m-th epoch. The update rule with the random permutation is given by (4.4) and (4.5), with k̃ = πm+1∑ k . ∑ Summing (4.4) over all k = 0, 1, · · · , n−1, we get n−1 vm+1k=0 k = n−1 m+1 k=0 ∇fk(w̃k ). Thus, summing (4.4) over all k = 0, 1, · · · , n − 1, using the telescoping sum in wm+1k , and using the convention that wm+1 n0 = xm, we get ∑n ( ) wm+2 m+1 m+10 = w0 − α ∇fj w̃j . (4.6) j=1 Equation (4.6) shows the update rule in terms of the iterates at the ends of the epochs. Thus, over an epoch, the average gradient accumulated by CentralVR is unbi- ased and thus is a good estimate of the true gradient. This average gradient term can be accumulated cheaply during an epoch, without any noticeable overhead. 4.2.3 Algorithm details for CentralVR The detailed steps of CentralVR are listed in Algorithm 4. Note, the stored gradients and the average gradient term gw̃ are initialized using a single epoch of “vanilla” SGD with no VR correction. 46 Algorithm 4 CentralVR Algorithm: single worker case 1: parameters learning rate α 2: initialize w, {∇fj(w̃j)}j , and ḡ using plain SGD 3: while not converged do 4: g̃← 0 5: set π: random permutation of indices 1, 2, · · · , n 6: for k in {1, . . . , n} do ( ) 7: set: wk+1 ← wk − α ∇fπ (wk)−∇fπ (w̃π ) + ḡk k k 8: accumulate average: g̃← g̃ +∇fπ (wk)/nk 9: store gradient: ∇fπ (w̃π )← ∇fk k π (wk k) 10: end for 11: set average gradient for next epoch: ḡ← g̃ 12: end while CentralVR builds on the SAGA method. SAGA relies on the∑update rule (4.3), which requires an average over a large number of iterates (g = 1w̃ j∇fj(wj)) to ben continuously updated on every iteration. In the distributed setting, where the vector gw̃ must be shared across nodes, maintaining an up-to-date average requires large amounts of communication. This makes SAGA less stable in distributed implementations when the communication frequency is decreased. Updating gw̃ only occasionally (as we do in the distributed variants of CentralVR below) translates into significant communication savings in the distributed setting. CentralVR has the same time and space complexities as SAGA. Namely, on ev- ery iteration, 1 gradient computation is required, similar to SGD, and the n gradients 47 {∇fj(w̃m nj )}j=1 also need to be stored. Note that this is not always a significant storage requirement, since for models like logistic regression and ridge regression only a single number is required to be stored corresponding to each gradient. 4.3 Convergence analysis We now present convergence bounds for Algorithm 4. We make the following standard assumptions about the function when studying convergence properties. First, each fi is strongly convex with strong convexity constant µ: µ f (w) ≥ f (w′i i ) + 〈∇f ′ ′i(w ),w −w 〉+ ‖w −w′‖2. (4.7) 2 Second, each fi has Lipschitz continuous gradients with Lipschitz constant L so that fi(w) ≤ L fi(w ′) + 〈∇fi(w′),w −w′〉+ ‖w −w′‖2. (4.8) 2 We now present our main result. Theorem 4.3.1. Consider CentralVR with data inde(x k̃ drawn uniform)ly at random (with 2 replacement) on each iteration k. Define ρ := max 1− αµ, 2L α− . If the step size αµ(1 2Lα) is small enough such that 0 < ρ < 1, then we have the following bound: ∥∥ ∥2 ( ) (∥ ∥ )2 ( )wm+20 −w?∥ + c f(wm+1)− f(w?) ≤ ρ ∥wm+1 −w?∥0 + c f(w̃m)− f(w?) , ∑ where c = 2nα(1 − 2Lα) and we define f(wm) := 1 n−1 mk=0 f(wk ). In other words, then method converges linearly. We first start with two lemmas that will be useful in the proof for Theorem 4.3.1. 48 ∑ Lemma 4.3.1. For any f defined as f := 1 ni=1 fi, where each fi satisfies (4.7) andn (4.8), and on conditioning on any w, we have ∥ E∥ ∥∇ 2fj(w)−∇f (w?)∥j ≤ 2L(f(w)− f(w?)), where j is sampled uniformly at random from {1, 2, . . . , n} and w? is the minimizer of f. Proof. A standard result used frequently in the convex optimization literature is: ‖∇fj(w)−∇fj(w?)‖2 ≤ 2L(fj(w)− fj(w?)− 〈∇f (w?),w −w?j 〉), where fj is L-Lipschitz smooth. A proof for this inequality can be found in [Nes13] (Theorem 2.1.5 on page 56). Since j is sampled uniformly at random from {1, 2, . . . , n}, we can write: Ej(fj(w) − fj(w?) − 〈∇f ?j(w ),w − w?〉) = f(w) − f(w?), using the property that∇f(w?) = 0. The result follows.  ∑ Lemma 4.3.2. For any f defined as f := 1 n n i=1 fi, where each fi satisfies (4.7) and (4.8), and for any w and i we have ∥∥ ∥ 22 ( )∇fi(w)−∇f (w?)∥i ≤ 2L f(w)− f(w?) , µ where w? denotes the minimizer of f . Proof. A standard result used frequently in the convex optimization literature is: ‖∇fi(w)−∇f (w?i )‖2 ≤ L2‖x−w?‖2, where fi is L-Lipschitz smooth. A proof for this inequality can be found in [Nes13] (Theorem 2.1.5 on page 56). From (4.7), we get: 2 ( ) ( )‖w −w?‖2 ≤ f(w)− f(w?)− 〈w −w? 2,∇f(w?)〉 = f(w)− f(w?) , µ µ 49 using the property that∇f(w?) = 0. The desired result follows immediately.  We now move on to the proof of Theorem 4.3.1. Proof. Let the update rule for CentralVR be denoted as wm+1k+1 = [w m+1 k − αvm+1k , ∑ ] vm+1k = ∇f (wm+1)−∇f (w̃m 1 m k̃ k k̃ ) + ∇fj(w̃k̃ n j ) . j In this proof, we assume that the data indices are accessed randomly with replacement. Thus, w̃k̃m denotes the last iterate when the k̃-th data index was chosen in or before the m- th epoch. Thus,[condit]ioning on all w, vm+1k is an unbiased estimator of the true gradient at wm+1k , i.e., E v m+1 k = ∇f(wm+1k ). Conditioned on all history (all w), we first begin with the standard identity: [ ] [ ] E ‖wm+1 −w? 2k+1 ‖ = E ‖wm+1k − αvm+1 −w?‖2k = ‖wm+1 −w?‖2 − 2α〈wm+1k k −w?,∇f(wm+1 2 k 2k )〉+ α E‖vm+1‖ . (4.9) We now bound (4.9). Using the definition of strong convexity in (4.7), we can simplify the inner product term in (4.9) as 〈w? −wm+1 µk ,∇f(wm+1)〉 ≤ −(f(wm+1k k )− f(w?))− ‖w? −wm+1k ‖2. (4.10)2 We now bound the magnitude of the gradient term in (4.9): E‖ m+1 2∥vk ‖ =E∥∥ ∑ ∥2∇fk̃(wm+1k )−∇ 1 ∥fk̃(w̃m) + ∇fj(w̃m)k̃ j ∥ ∥ n j 1 ∑ ∥2 =E∥∥∇f (wm+1 ∥k̃ k )−∇f ?k̃(w ) +∇f ?k̃(w )−∇fk̃(w̃m) + ∇fj(w̃m)k̃ ∥n j j 50 ∥ ≤2E∥ ∥ ∥∇f (wm+1)−∇ ? ∥2 ∥f (w ) + 2E∥∇f (w̃m( ∑k̃ k k̃ ∑ )∥k̃ )−∇fk̃(w ?) k̃ 2 − 1 ∇f m 1 ? ∥j(w̃ )− ∇fj(w ) ∥ n j n ∥ j j∥ ∥ ∥ [ ]∥2=2E ∇f (wm+1)−∇f (w? 2)∥ + 2E∥∥∇f (w̃m ∥∥ k̃ k k̃ ∥ ∥ k̃ )−∇f (w ? k̃ )∥− E ∇f (w̃ m k̃ )−∇fk̃(w?)k̃ k̃ ∥ ≤ 2 22E∥∇f (wm+1)−∇f (w?)∥ + 2E∥ m ? ∥( k̃ k k̃) 4L2 ( ∇fk̃(w̃ )−k̃ )∇fk̃(w ) ≤4L f(wm+1k )− f(w?) + E f(w̃m)− f(w?) . (4.11)µ k̃ The second equality uses the property that ∇f(w?) = 0. The first inequality uses the property that ‖a + b‖2 ≤ 2‖a‖2 + 2‖b‖2. The second inequality uses E‖φ − Eφ‖2 = E‖φ‖2 − ‖Eφ‖2 ≤ E‖φ‖2, for any random vector φ. The third inequality follows from Lemma 4.3.1 and Lemma 4.3.2. We now plug (4.10) and (4.11) into (4.9) and rearrange: [ ] ( ) E ‖wm+1 −w? 2∥ k+1 ‖∥ + 2α∥(1− 2Lα) f(w m+1 ? ∥ k )− f(w )2 2 ( ) ≤ ∥ 2wm+1k −w?∥ − ∥ m+1 − ?∥2 4L ααµ wk w + E f(w̃m)− f(w?) . (4.12)µ k̃ Taking expectation∥on all w and∥summing (4.12) over all k = 0, 1, . . . , n − 1, we get a telescoping sum in ∥wm+1k −w?∥2 that yields: ∥ E∥ ∥ ( )wm+2 ?∥20 −w + 2nα(1− 2Lα)E f(wm+1)− f(w?)∥ ≤E∥wm+1 −w?∥∥ ∑n−1 ∥ ∥ 4nL2α22 2 ( )0 − αµ E∥wm+1 −w?∥ + E f(w̃mk )− f(w?) , (4.13)µ k=0 ∑ where we use the convention wm = wm+1, and define f(wm) as f(wm) := 1 n−1 f(wmn 0 n k=0 k ). We now observe that ∥ E∥wm+1 ?∥∥ ∑n−12 ∥∥ ∥∥m+1 ? 20 −w ≤ E wk −w . k=0 51 Thus we can rewrite ∑n−1 ∥∥ ∥∥2 ∥∥ ∥−αµ E wm+1 −w? ≤ −αµE wm+1 −w?∥2k 0 . k=0 Substituting this in (4.13), we get: ∥ E∥ ∥ ( )wm+20 −w?∥2∥ + 2nα(1− m+1 ? ∥ ∥ 2Lα)E f(w )− f(w ) ∥ 4nL2α2 ( )≤ (1− αµ)E wm+10 − 2w? + E f(w̃m)− f(w?) .µ We can rewrite this to get: ∥∥ ∥ ( )E wm+2 2(0 ∥−w ?∥ + 2∥nα(1− 2Lα)E f(w m+1 ? ( )− f(w ) )) ≤ ρ E∥wm+1 −w?∥20 + 2nα(1− 2Lα)E f(w̃m)− f(w?) , ( ) 2 2 where ρ = max 1− αµ, 4nL α− . The result immediately follows. 2nµα(1 2Lα) Remark on step size restrictions From Theorem 4.3.1, notice that CentralVR con- verges linearly when the step size α is small enough such that ( 1 1 µ ) α < min , , . µ 2L 2L(L+ µ) Since L ≥ µ, we see that this condition is satisfied whenever α < µ . 2L(L+µ) 4.4 Distributed algorithms We now consider the distributed setting, with a single central server and p local clients, each of which contains a portion of the data set. In this setting, the data is decomposed into disjoint subsets {Ωs}, where s denotes a particular local client, and 52 ∑ s |Ωs| = n. We denote the i-th function stored on client s as f si . Our goal is to mini- mize the global objective function of the form ∑p ∑|Ωs|1 f(w) = f sj (w).n s=1 j=1 We consider a centralized setting, where the clients can only communicate with the central server, and our goal is to derive stochastic algorithms that scale linearly to high p, while remaining stable even under low communication frequencies between local and central nodes. 4.4.1 Synchronous version CentralVR naturally extends to the distributed synchronous setting, and is presented in Algorithm 5. To distinguish the algorithm from the single worker case, we call it CentralVR-Sync. On each epoch, the local nodes first retrieve a copy of the central iterate w, and also gw̃, which represents the averaged gradient over all data. The CentralVR method is then performed on each node, and the most recent gradient for each data point ∇f s(w̃k̃) is stored. By sharing gw̃ across nodes, we ensure that the local gradient updatesk̃ utilize global gradient information from remote nodes. This prevents the local node from drifting far away from the global solution, even if each local node runs for one whole epoch before communicating back with the central server. In CentralVR-Sync, each local node performs local updates for one epoch, or |Ωs| iterations, before communicating with the server. This is a rather low communication frequency compared to a parameter server model of computation in which updates are continuously streamed to the central node. This makes a significant difference in runtimes 53 when the number of local nodes is large, as shown in later sections. 4.4.2 Asynchronous version The synchronous algorithm can be extended very easily to the asynchronous case, CentralVR-Async, as shown in Algorithm 6. In CentralVR-Async, the central server keeps a copy of the current iterate w and average gradient ḡ. The key idea for CentralVR-Async is that, once a local node completes an epoch, it sends the change in the local averages, given by ∆ws and ∆ḡs, over the last epoch to the central server. This change is added to the global w and ḡ to update the parameters stored on the central server. Thus, when the central server receives parameters from a local node s, it performs the updates: 1 1 w = w + ∆ws and ḡ = ḡ + ∆ḡs, p p where ∆ws and ∆ḡs are given by ∆ws = {{wm+1n −wmn }s and } s 1 ∑ ∑ ∆ḡ = ∇f s(w̃m+1| | j j )− 1 s m Ω | ∇f (w̃Ω | j j ) .s j∈ sΩs j∈Ω ss Sending the change in the local parameter values rather than the local parameters them- selves ensures that, when updating the central parameter, the previous contribution to the average from that local worker is just replaced by the new contribution. Thus, a fast working local node does not bias the global average solution toward its local solution with an excessive number of updates. This makes the algorithm more robust to heterogenous computing environments where nodes work at disparate speeds. The proposed CentralVR scheme has several advantages. It does not require a full gradient computation as in SVRG, and thus can be made fully asynchronous. Moreover, 54 since the average gradient gw̃ in the error correction term is updated only at the end of an epoch, communication periods can be increased between the central server and the local nodes, while still maintaining fast and stable convergence. 4.5 Distributed variants of SVRG and SAGA In this section, we propose distributed variants of popular variance reduction meth- ods: SVRG and SAGA. The properties of these variants are overviewed in Table 4.1. 4.5.1 Distributed SVRG In this section, we present a distributed version of SVRG appropriate for distributed scenarios with high communication delays. Recently, in [RHS+15], the authors presented an asynchronous distributed version of SVRG using a parameter server model of compu- tation. In SVRG, the average gradient term is gw̃ = ∇f(w̃) as shown in (4.2). This cor- rection term is very accurate because it uses the entire dataset. This would indicate that the algorithm would be robust to high communication periods between the local nodes and the server. However, a truly asynchronous method is not possible with SVRG since a synchro- nization step is unavoidable when computing the full gradient. Thus, in this section, we present a synchronous variant of SVRG in Algorithm 7. We define an additional parame- ter τ to denote the communication period, i.e., the number of updates to run on each local node before communicating with the central server. The true gradient ḡ is maintained across all nodes throughout the whole commu- 55 nication period τ , thus ensuring that the local workers stay close to the desired solution, even when τ is large. After τ updates, the current iterate ws on each local node s is aver- aged on the central server to get w. The true gradient is evaluated at w, i.e., ḡ = ∇f(w), and w̄ = w is used on each local node during the next epoch. 4.5.2 Distributed SAGA The update rule for SAGA is given in (4.3). Since there is no synchronization step required as in SVRG, there is a very natural asynchronous version of the algorithm under the parameter server model of computation. A linear convergence proof has been pre- sented for the parameter server model of SAGA (see Theorem 3 in [RHS+15]). However, this work does not contain any empirical studies of the method. The parameter server framework is a very natural generalization of SAGA, however it has very high bandwidth requirements for large numbers of nodes. Algorithm 8 presents an asynchronous version of SAGA with lower communication frequency. Like SVRG, we define a communication period parameter τ which determines the number of iterations to run on each machine before central communication. In the SAGA algorithm, the average gradient term ḡ is updated on each iteration. Thus, as local iterations progress, the average gradient evolves differently on each local node. This makes the algorithm less robust to higher communication periods τ . As the communication period increases, the local nodes drift farther apart from each other and the global solution. Thus, the learning rate needs to shrink as τ increases over a certain limit. This in turn slows down convergence. For this reason, distributed SAGA is less 56 tolerant to long communication periods than the Algorithms in Sections 4.4 and 4.5.1. However, it still has fast convergence for much higher communication periods than exist- ing stochastic schemes. The asynchronous SAGA method (Algorithm 8) is built on the same idea as the proposed asynchronous algorithm: running averages are kept on each local node, and at the end of an epoch the change in the parameter values are sent to the central server. This makes the algorithm more robust when local nodes work at heterogenous speeds. In our distributed SAGA algorithm, care has to be taken while updating the average gradient ḡ. Note that ḡ is averaged over the whole dataset. Thus, when replacing the gradient value at the current index k̃, the update is scaled down by a factor of n (the total number of global samples, as opposed to |Ωs|, the number of local samples). At the end of a local epoch, the average of the stored gradients on each local node is sent back to the central server, along with the current estimate w. This ensures that the average gradient term on the central server ḡ is built from the most recent gradient computations at each index. 4.6 Empirical results In this section, we present the empirical performance of the proposed methods, both in sequential and distributed settings. We benchmark the methods for two test problems: first, a binary classification prob(lem with `2-regularize)d logistic regression where each fi is of the form: fi(w) = log 1 + exp(−yi〈xi,w〉) + λ‖w‖2, where feature vector xi ∈ Rd has label yi ∈ R. We also consider a ridge regression problem of the form 57 Figure 4.1: Single Worker Results. Logistic regression on toy dataset; Ridge regression on toy data; Logistic regression on IJCNN1 dataset; Ridge regression on MILLIONSONG dataset; In each case CentralVR converges much faster than SVRG and SAGA. fi(w) = (〈xi,w〉 − y 2i) + λ‖w‖2. We present all our results with the `2 regularization parameter set at λ = 10−4, though we found that our results were not sensitive to this choice of parameter. 4.6.1 Single worker results We first test our algorithms in the sequential, non-distributed setting. It is well known that VR beats vanilla SGD by a wide margin in many applications. However, the different VR methods vary widely in their empirical behavior. We compare the single worker CentralVR algorithm to the two most popular VR methods, SVRG [JZ13] and 58 SAGA [DBLJ14]. We test the methods on two synthetic “toy” datasets, in addition to two real-world datasets. Synthetic classification data was generated by sampling two normal distributions with unit variance and means separated by one unit. For the least-squares prediction prob- lem, we generate a random normal matrix X and random labels of the form y = Aw + , where  is standard Gaussian noise. For each case, we kept the size of the dataset at n = 5000 with d = 20 features. For the binary classification problem, we kept equal numbers of data samples for each class. We also tested performance of our algorithms on two standard real world datasets: IJCNN1 [Pro01] for binary classification and the MIL- LIONSONG [BMEWL11] dataset for least squares prediction. IJCNN1 contains 35,000 training data samples of 22 dimensions, while MILLIONSONG contains 463,715 train- ing samples of 90 dimensions. For all our experiments, we maintain a constant learning rate, and choose the learning rate that yields fastest convergence. Results appear in Figure 4.1. We compare convergence rates of the algorithms in terms of number of gradient computations for each method. This provides a level playing field since different VR methods require different numbers of gradient computations per iteration, and gradient computations dominate the computing time. The proposed Cen- tralVR algorithm widely out-performs SAGA and SVRG in all cases, requiring less than one-third of the gradient computations of the other methods. 59 4.6.2 Distributed results We now present results of our algorithms in highly distributed settings. We imple- ment the algorithms using a Python binding to MPI, and all experiments were run on an Intel Xeon E5 cluster with 24 cores per node. All our asynchronous implementations are “locked”, where at a given time only one local node can update the parameters on the cen- tral server. However, all proposed asynchronous algorithms can be easily implemented in a lock-free setting, leading to further speedups. We compare the distributed versions of CentralVR, CentralVR-Async [CVR-Async in Figures 4.2 and 4.3] and CentralVR-Sync [CVR-Sync in Figures 4.2 and 4.3], proposed in Section 4.4 with the following algorithms: 1. Distributed SVRG (Section 4.5.1) [D-SVRG in Figures 4.2 and 4.3]. We set the communication period τ = 2n as recommended in [JZ13]. We found the perfor- mance of the algorithm to be very robust to τ . 2. Distributed SAGA (Section 4.5.2) [D-SAGA in Figures 4.2 and 4.3]. We vary the communication period τ = {10, 100, 1000, 10000} and present results for the τ yielding best results. The algorithm remains relatively stable for τ = {10, 100, 1000} but convergence speeds start slowing down significantly at τ = 10000. 3. Elastic Averaging SGD (EASGD): This is a recently proposed asynchronous SGD method [ZCL15] that has been shown to efficiently accelerate training times of deep neural networks. As in [ZCL15], we tested the algorithm for communication peri- ods τ = {4, 16, 64}, and found results to be nearly insensitive to τ (τ updates occur 60 before communication). We also found the regular EASGD algorithm to outper- form the momentum version (M-EASGD). We test performance both for a constant step size as well as a decaying step size (using a local clock on each machine) as given by α0/(1 + γk)0.5 (as in [ZCL15]), where α0 is the initial step size, k is the local iteration number, and γ is the decay parameter. EASGD has been shown to outperform the related popular asynchronous SGD method Downpour [DCM+12], on both convex and non-convex settings. 4. Asynchronous “Parameter Server” SVRG [PS-SVRG in Figures 4.2 and 4.3]: an asynchronous version of SVRG on a parameter server model of computation [RHS+15]. This method outperforms a popular asynchronous SGD method, Hogwild [RRWN11], which also uses a parameter server model. We set the epoch size to 2n, as recom- mended in [RHS+15]. For the variance reduction methods, we performed experiments using a constant step size, as well as the simple learning rate decay rule αl = α0γl (here, l is the num- ber of epochs, instead of iterations). Decaying the step size does not yield consistent performance gains, and constant step sizes work very well in practice. We compared the algorithms on a binary classification problem and a least-squares prediction problem using both toy datasets and real world datasets. The toy datasets were created on each local worker exactly the same way as for the sequential experiments. The toy datasets had d = 1000 features and |Ωs| = 5000 samples for each core s, i.e., the total size of the dataset was p×5000, where p denotes the number of local nodes. We also used the real world datasets MILLIONSONG [BMEWL11] (containing close to 500,000 data 61 samples) for ridge regression and SUSY [BSW14] (5,000,000 data samples) for logistic regression. Figure 4.2 shows results of our distributed experiments on toy datasets. The left two plots compare the rates of convergence of our algorithms scaled over 192 cores for logistic regression and ridge regression. The x-axis displays wall clock time in seconds and the y-axis displays the relative norm of the gradient, i.e., the ratio between the current gradient norm and the initial gradient norm. In almost all cases the proposed algorithms, in particular CentralVR, have substantially superior rates of convergence over established schemes. The right two plots in Figure 4.2 demonstrate the scalability of our algorithms. On the y-axis, we plot the wall clock time (in seconds) required for convergence, and on the x-axis, we vary the number of nodes as 96, 192, 480 and 960. Each local worker has |Ωs| = 5000 data points in each case, i.e., the amount of data scales linearly with the number of nodes. Notice that CentralVR-Sync and CentralVR-Async exhibit nearly perfect linear scaling, even when the number of workers is almost 1000. The dataset size in this regime is close to 5 million data points, and the proposed CentralVR methods train both our logistic and ridge regression models to five digits of precision in less than 15 seconds. Figure 4.3 shows results of our distributed experiments on the large datasets SUSY and MILLIONSONG. The left two plots show convergence results for our algorithms over 500 nodes for SUSY and 240 nodes for MILLIONSONG. In both cases, we see that our proposed algorithms outperform or remain competitive with previously proposed schemes. The right two plots show the scaling of our algorithms as we increase the num- ber of local workers for training SUSY and MILLIONSONG. We see that for MILLION- 62 SONG, increasing the number of local workers initially decreases convergence time, but speed levels out for large numbers of workers, likely due to the smaller size of the local dataset fragments. On the larger SUSY problem, we find a consistent decrease in the con- vergence times as we increase the number of workers. We train on this 5,000,000 sample dataset in less than 5 seconds using 750 local workers. 4.7 Summary This section introduced a new variance reduction scheme, CentralVR, that has lower communication requirements than conventional schemes, allowing it to perform better in highly parallel cloud or cluster computing platforms. In addition, distributed versions of well-known variance reduction stochastic gradient descent (SGD) methods are presented that also perform well in highly distributed settings. We show that by lever- aging variance reduction, we can combat the diminishing returns that plague classical SGD methods when scaled across many workers, achieving linear performance scaling to over 1000 cores. This represents a significant increase in scalability over previous stochastic gradient methods. 63 Algorithm 5 CentralVR-Sync Algorithm 1: parameters learning rate α 2: initialize w, {∇fj(w̃j)}j , ḡ 3: while not converged do 4: for each local node s do 5: g̃← 0 6: set π: random permutation of indices 1, 2, · · · , |Ωs| 7: for k in {1, . . . , |Ωs(|} do ) 8: w ← w − α ∇f sk+1 k π (w )−∇f s (w̃ ) + ḡk k πk πk 9: accumulate average: g̃← g̃ +∇f sπ (wk)/|Ωs|k 10: store gradient: ∇f sπ (w̃ )← ∇f sπ π (wk k k k) 11: end for 12: set average gradient to send to server: ḡ← g̃ 13: send w, ḡ to central node 14: receive updated w, ḡ from central node 15: end for 16: central node: 17: average w, ḡ received from workers 18: broadcast averaged w, ḡ to local workers 19: end while 64 Algorithm 6 CentralVR-Async Algorithm 1: parameters learning rate α 2: initialize w, {∇fj(w̃j)}j, ḡ, ρ = 1/p,wold = ḡold = 0 3: while not converged do 4: for each local node do 5: g̃← 0 6: set π: random permutation of indices 1, 2, · · · , |Ωs| 7: for k in {1, . . . , |Ωs(|} do ) 8: w s sk+1 ← wk − α ∇fπ (wk)−∇fπ (w̃k k π ) + ḡk 9: accumulate average: g̃← g̃ +∇f sπ (wk k)/|Ωs| 10: store gradient: ∇f sπ (w̃π )← ∇f sπ (wk k k k) 11: end for 12: set average gradient: ḡ← g̃ 13: compute change: ∆w← w −wold, ∆ḡ← ḡ − ḡold 14: set: wold ← w, ḡold ← ḡ 15: send ∆w, ∆ḡ to central node 16: receive updated w, ḡ from central node 17: end for 18: central node: 19: receive ∆w, ∆ḡ from a local worker 20: update: w← w + ρ∆x, ḡ← ḡ + ρ∆ḡ 21: send new w, ḡ back to local worker 22: end while 65 Algorithm 7 Synchronous SVRG 1: parameters step size α, communication period τ 2: initialize w 3: while not converged do 4: set: w̄← w 5: set: ḡ← ∇f(w̄) via synchronization step 6: for each local node s do 7: for k in {1, . . . , τ} do 8: sample k̃ ∈ {1, . .(. , |Ωs|} with replacement) 9: w ← w − α ∇f s sk+1 k (wk)−∇f (w̄) + ḡk̃ k̃ 10: end for 11: send w to central node 12: receive updated w from central node 13: end for 14: central node: 15: average w received from workers 16: broadcast averaged w to local workers 17: end while 66 Algorithm 8 Asynchronous SAGA 1: parameters step size α, communication period τ 2: initialize w, {∇fj(w̃j)}j, ρ =∑1/p,wold = ḡold = 0 3: set average gradient: ḡ← 1 j∇fj(w̃n j) 4: while not converged do 5: for each local node do 6: for k in {1, . . . , τ} do 7: sample k̃ ∈ {1, . .(. , n} with replacement ) 8: wk+1 ← wk − α ∇(f s(w s k̃ k )−∇f (w̃ k̃ k̃ ) +)ḡ 9: update: ḡ← ḡ + 1 ∇f s(wk)−∇f s(w̃n k̃ k̃ k̃) 10: store gradient: ∇f s(w̃k̃)← ∇f s(w )k̃ k̃ k 11: end for 12: compute change: ∆w← w −wold, ∆ḡ← ḡ − ḡold 13: set: wold ← x, ḡold ← ḡ 14: send ∆w, ∆ḡ to central node 15: receive updated w, ḡ from central node 16: end for 17: central node: 18: receive ∆w, ∆ḡ from a local worker 19: update: w← w + ρ∆w, ḡ← ḡ + ρ∆ḡ 20: send new w, ḡ back to local worker 21: end while 67 Figure 4.2: Distributed Results on toy datasets for CentralVR-Sync and CentralVR-Async, compared to Distributed SVRG (Section 4.5.1), Distributed SAGA (Section 4.5.2), Pa- rameter Server SVRG and EASGD. Left two plots: Convergence curve for Logistic and ridge regression on synthetic data over 192 nodes. Right two plots: Time required for convergence as number of local workers is increased (data on each local worker is con- stant – i.e., total data scales linearly with the number of local workers) for logistic and ridge regression. 68 Figure 4.3: Distributed Results on SUSY and MILLIONSONG for CentralVR-Sync and CentralVR-Async, compared to Distributed SVRG (Section 4.5.1), Distributed SAGA (Section 4.5.2), Parameter Server SVRG (Param Server SVRG) and EASGD. (Left two plots) Convergence curve for Logistic regression and ridge regression on SUSY over 500 nodes and on MILLIONSONG over 240 nodes. (Right two plots) Time required for con- vergence as number of local workers is increased. 69 Chapter 5: Investigating training methods for quantized neural nets 5.1 Introduction Deep neural networks are an integral part of state-of-the-art computer vision and natural language processing systems. Because of their high memory requirements and computational complexity, networks are usually trained using powerful hardware. There is an increasing interest in training and deploying neural networks directly on battery- powered devices, such as cell phones or other platforms. Such low-power embedded systems are memory and power limited, and in some cases lack basic support for floating- point arithmetic. To make neural nets practical on embedded systems, many researchers have fo- cused on training nets with coarsely quantized weights. For example, weights may be constrained to take on integer/binary values, or may be represented using low-precision (8 bits or less) fixed-point numbers. Quantized nets offer the potential of superior mem- ory and computation efficiency, while achieving performance that is competitive with state-of-the-art high-precision nets. Quantized weights can dramatically reduce memory size and access bandwidth, increase power efficiency, exploit hardware-friendly bitwise operations, and accelerate inference throughput [CHS+16, MOPU93, RORF16]. Handling low-precision weights is difficult and motivates interest in new training 70 methods. When learning rates are small, stochastic gradient methods make small updates to weight parameters. Binarization/discretization of weights after each training iteration “rounds off” these small updates and causes training to stagnate [CHS+16]. Thus, the naı̈ve approach of quantizing weights using a rounding procedure yields poor results when weights are represented using a small number of bits. Other approaches include classical stochastic rounding methods [GAGN15], as well as schemes that combine full-precision floating-point weights with discrete rounding procedures [CBD15]. While some of these schemes seem to work in practice, results in this area are largely experimental, and little work has been devoted to explaining the excellent performance of some methods, the poor performance of others, and the important differences in behavior between these methods. Contributions In this chapter, we study quantized training methods from a theoretical perspective, with the goal of understanding the differences in behavior, and reasons for success or failure, of various methods. In particular, we present a convergence analysis showing that classical stochastic rounding (SR) methods [GAGN15] as well as newer and more pow- erful methods like BinaryConnect (BC) [CBD15] are capable of solving convex discrete problems up to a level of accuracy that depends on the quantization level. We then address the issue of why algorithms that maintain floating-point representations, like BC, work so well, while fully quantized training methods like SR stall before training is complete. We show that the long-term behavior of BC has an important annealing property that is needed for non-convex optimization, while classical rounding methods lack this property. 71 5.2 Background and related work The arithmetic operations of deep networks can be truncated down to 8-bit fixed- point without significant deterioration in inference performance [GAGN15,LTA16,HS14, LCMB16, LZL16]. The most extreme scenario of quantization is binarization, in which only 1-bit (two states) is used for weight representation [KS15,CBD15,CHS+16,RORF16, HCS+16, BIL+15]. Previous work on obtaining a quantized neural network can be divided into two cat- egories: quantizing pre-trained models with or without retraining [HS14,AHS15,LTA16, ZHMD17, ZYG+17], and training a quantized model from scratch [GAGN15, CBD15, RORF16, CHS+16, ZWN+16]. We focus on approaches that belong to the second cate- gory, as they can be used for both training and inference under constrained resources. For training quantized NNs from scratch, many authors suggest maintaining a high- precision floating point copy of the weights while feeding quantized weights into back- prop [CBD15, HCS+16, RORF16, ZWN+16], which results in good empirical perfor- mance. There are limitations in using such methods on low-power devices, however, where floating-point arithmetic is not always available or not desirable. Another widely used solution using only low-precision weights is stochastic rounding [HF92, GAGN15]. Experiments show that networks using 16-bit fixed-point representations with stochastic rounding can deliver results nearly identical to 32-bit floating-point computations [GAGN15], while lowering the precision down to 3-bit fixed-point often results in a significant per- formance degradation [MLM16]. Bayesian learning has also been applied to train binary networks [SHM14,CSML15]. A more comprehensive review can be found in [RORF16]. 72 5.3 Algorithms for training quantized neural nets Neural networks have objective functions of the same form as (2.1) where each fi is a non-convex loss function. When floating-point representations are available, the stan- dard method for training neural networks is SGD (2.2). In this chapter, we consider the problem of training convolutional neural networks (CNNs) with low precision weights. Convolutions are computationally expensive; low precision weights can be used to accel- erate them by replacing expensive multiplications with efficient addition and subtraction operations [RORF16, LZL16] or bitwise operations [HCS+16, ZWN+16]. To train networks using a low-precision representation of the weights, a quantiza- tion function Q(·) is needed to convert a real-valued number w into a quantized/rounded version ŵ = Q(w). We use the same notation for quantizing vectors, where we assume Q acts on each dimension of the vector. Different quantized optimization routines can be defined by selecting different quantizers, and also by selecting when quantization happens during optimization. The common options are: Deterministic Rounding (R) A basic uniform or deterministic quantization function snaps a floating point value to the closest quantize⌊d value as⌋:|w| 1 Qd(w) = sign(w) ·∆ · + , (5.1) ∆ 2 where ∆ denotes the quantization step or resolution, i.e., the smallest positive number that is representable. One exception to this definition is when we consider binary weights, where all weights are constrained to have two values w ∈ {−1, 1} and uniform rounding becomes Qd(w) = sign(w). 73 The deterministic rounding SGD maintains quantized weights with updates of the form: ( ) Deterministic Rounding: wbk+1 = Q w b d k − αk∇f̃ (wbk k) , (5.2) where wb denotes the low-precision weights, which are quantized using Qd immediately after applying the gradient descent update. If gradient updates are significantly smaller than the quantization step, this method loses gradient information and weights may never be modified from their starting values. Stochastic Rounding (SR) The stochastic rounding quantization function is defined as:b w c+ 1 for p ≤ w − bw c,  ∆ ∆ ∆Qs(w) = ∆ · (5.3)bw c otherwise, ∆ where p ∈ [0, 1] is produced by a uniform random number generator. This operator is non-deterministic, and rounds its argument up with probability w/∆−bw/∆c, and down otherwise. This quantizer satisfies the important property E[Qs(w)] = w. Similar to the deterministic rounding method, the SR optimization method also maintains quantized weights with updates of the form: ( ) Stochastic Rounding: wb b bk+1 = Qs wk − αk∇f̃k(wk) . (5.4) BinaryConnect (BC) The BinaryConnect algorithm [CBD15] accumulates gradient updates using a full-precision buffer wr, and quantizes weights just before gradient com- putations as follows. ( ) BinaryConnect: wrk+1 = w r k − αk∇f̃k Q(wrk) . (5.5) 74 Either stochastic rounding Qs or deterministic rounding Qd can be used for quantizing the weights wr, but in practice, Qd is the common choice. The original BinaryConnect paper constrains the low-precision weights to be {−1, 1}, which can be generalized to {−∆,∆}. A more recent method, Binary-Weights-Net (BWN) [RORF16], allows differ- ent filters to have different scales for quantization, which often results in better perfor- mance on large datasets. Notation For the rest of the chapter, we use Q to denote both Qs and Qd unless the situation requires this to be distinguished. We also drop the superscripts on wr and wb, and simply write w. 5.4 Convergence analysis We now present convergence guarantees for the Stochastic Rounding (SR) and Bi- naryConnect (BC) algorithms, with updates of the form (5.4) and (5.5), respectively. For the purposes of deriving theoretical guarantees, we assume each fi in (2.1) is differen- tiable and µ-strongly convex: 〈∇f(w′),w − w′〉 ≤ f(w) − f(w′) − µ‖w − w′‖2. We 2 assume the (stochastic) gradients are bounded: E‖∇f̃ 2k(wk)‖ ≤ G2. Some results below also assume the domain of the problem is finite. In this case, the rounding algorithm clips values that leave the domain. For example, in the binary case, rounding returns bounded values in {−1, 1}. 75 5.4.1 Convergence of Stochastic Rounding (SR) We can rewrite the update rule (5.4) as: wk+1 = wk − αk∇f̃k(wk) + rk, (5.6) where rk = Qs(wk−αk∇f̃k(wk))−wk +αk∇f̃k(wk) denotes the quantization error on the k-th iteration. We want to bound this error in expectation. To this end, we present the following lemma. Lemma 5.4.1. The stochastic rounding error rk on each iteration can be bounded, in expectation, as: ∥ E∥r ∥∥2 √k ≤ d∆αkG, where d denotes the dimension of w. Proof. We want to bound the quantization error rk. Consider the i-th entry in rk denoted by (rk)i. Similarly, we define (wk)i and (∇f̃k(wk))i. Choose some random number p ∈ [0, 1]. The stochastic rounding operation produces a value of rk given by (rk)i = Qs((wk)i − αk(∇f̃k(wk))i)− (wk)i + αk(∇f̃k(wk))i = ∆ ·−q + 1, for p ≤ q,−q, otherwise, ⌊ ⌋ α (∇f̃ (w )) where k k k iq = − − −αk(∇f̃k(wk))i and q ∈ [0, 1]. Now we have ∆ ∆ [ ] Ep ((r 2k)i) ≤ ∆2((−q + 1)2q + (−q)2(1− q)) = ∆2q(1− q) ≤ ∆2 min{q, 1− q}. 76 { − } ≤ ∣∣∣∣∣ ∣ αk(∇f̃ (w )) ∣Because k k i ∣min q, 1 q ∣, it follows that: ∆ ∣ [ ] ∣∣ ∣∣ ∣ ∣ E ((r ) )2 ≤ 2 ∣∣∣αk(∇f̃k(wk))i ∣p k i ∆ ∣∣ ≤ ∣∆ ∣ ∣αk(∇f̃k(wk))i∣ .∆ Summing over the index i yields: ∥∥ ∥∥2 ∥∥ ∥∥ √ ∥ ∥Ep rk ≤ ∆αk ∇f̃k(wk) ≤ dαk∆∥∇f̃k(wk)∥ . (5.7)2 1 2 ( ∥ ∥ ) ∥ ∥ The result follows from: E∥∇ 2 2f̃k(w )∥ ≤ E∥k ∇f̃ (w )∥ ≤ G2. 2 k k 2 From Lemma 5.4.1, we see that the rounding error per step decreases as the learning rate αk decreases. This is intuitive since the probability of an entry in wk+1 differing from wk is small when the gradient update is small relative to ∆. Using the above lemma, we now present convergence rate results for Stochastic Rounding (SR) in the stro∑ngly-convex case. Our error estimates are ergodic, i.e., they are in terms of w̄ = 1 kk t=1 wt, thek average of the iterates. Theorem 5.4.1. Assume that f is µ-strongly convex and the learning rates are given by α 1k = . Consider the SR algorithm with updates of the form (5.4). Then, we have:µ(k+1) √ − ? ≤ (1 + log(k + 1))G 2 d∆G E[f(w̄k) f(w )] + , 2µk 2 where w? = arg minw f(w). Proof. Subtracting w? from (5.6), taking norm, and expectation conditioned on wk: E‖wk+1 −w?‖2 = ‖wk −w?‖2 − 2E〈wk −w?, αk∇f̃k(wk)− rk〉+ E‖αk∇f̃k(wk)− rk‖2 = ‖wk −w?‖2 − 2α 〈w −w?k k ,∇f(w 2 2 2k)〉+ αkE‖∇f̃k(wk)‖ + E‖rk‖ √ ≤ ‖wk −w?‖2 − 2αk〈wk −w?,∇f(wk)〉+ α2G2k + d∆αkG, 77 where we use the bounded variance assumption, E[rk] = 0, and Lemma 5.4.1. Using the assumption that f is µ-strongly convex, we can simplify this to: √ E‖wk+1 −w?‖2 ≤ (1− αkµ)‖w ? 2 ? 2 2k −w ‖ − 2αk(f(wk)− f(w )) + αkG + d∆αkG. Re-arranging the terms, and taking expectation we get: √ 2αkE(f(wk)− f(w?)) ≤ (1− αkµ)E‖wk −w?‖2 − E‖wk+1 −w?‖2 + α2 2kG + d∆αkG. Assume that the step size decreases with the rate αk = 1/µ(k + 1). Then we have: √ µk µ(k + 1) 1 d∆G E(f(wk)− f(w?)) ≤ E‖w −w?‖2k − E‖w ? 2 2k+1 −w ‖ + G + . 2 2 2µ(k + 1) 2 Averaging over k = 0 to T , we get a telescoping sum on the right hand side, which yields: 1 ∑T ∑T √G2 1 d∆G µ(T + 1) E(f(wk)− f(w?)) ≤ + − E‖wk+1 −w?‖2 T 2µT k + 1 2 2 k=0 k=0 √ ≤ (1 + log(T + 1))G 2 d∆G + . 2µT 2 ∑ Using Jensen’s ∑inequality, we have: E(f(w̄T ) − f(w ?)) ≤ 1 Tk=0 E(f(wk) − f(w?)),T where w̄T = 1 T k=0 wk, the average of the iterates. The desired bound follows. T We see that SR converges until it reaches an “accuracy floor.” As the quantiza- tion becomes more fine grained, our theory predicts that the accuracy of SR approaches that of high-precision floating point at a rate linear in ∆. This extra term caused by the discretization is unavoidable since this method maintains quantized weights. 5.4.2 Convergence of Binary Connect (BC) When analyzing the BC algorithm, we assume that the Hessian satisfies the Lips- chitz bound: ‖∇2fi(w) − ∇2fi(w′)‖ ≤ L2‖w − w′‖ for some L2 ≥ 0. While this is a 78 slightly non-standard assumption, we will see that it enables us to gain better insights into the behavior of the algorithm. We assume that the quantization function in BC uses stochastic rounding. In the case of BC, we see that the quantization error r does not approach 0 as in SR-SGD. Nonetheless, the effect of this rounding error diminishes with shrinking αk because αk multiplies the gradient update, and thus implicitly the rounding error as well. Theorem 5.4.2. Assume that f is µ-strongly convex, the domain has finite diameter D, and the learning rates are given by α = 1k . Consider the BC algorithm with updatesµ(k+1) of the form (5.5). Then we have: √ − ? ≤ (1 + log(k + 1))G 2 DL2 d∆E[f(w̄k) f(w )] + . 2µk 2 Proof. We can rewrite the update rule (5.5), as ( ) wk+1 = wk − αk∇f̃ (wk +) rk ( ) = wk − αk[∇f̃ wk +∇2f̃ wk rk + r̂k] where ‖r̂k‖ ≤ L2‖r ‖2k from our assumption on the Hessian. Note that in general rk has2 mean zero while r̂k does not. Using the same steps as in the Theorem 5.4.1, we get E‖w ? 2k+1 −w ‖ = ‖wk −w?‖2 − 2αkE〈w ?k −w ,∇f̃k(wk + r 2k)〉+ αkE‖∇f̃k(wk + rk)‖2. ≤ ‖wk −w?‖2 − 2αkE〈w −w?k ,∇f(wk) + r̂k〉+ α2kG2 = ‖wk −w?‖2 − 2αkE〈wk −w?,∇f(wk)〉+ α2 2kG − 2αkE〈wk −w?, r̂k〉 Assuming the domain has finite diameter D, and observing that the quantization error for √ BC-SGD can always be upper-bounded as ‖rk‖ ≤ d∆, we get: √ −2αkE〈w −w?k , r̂k〉 ≤ E‖ ‖ ≤ L2 2αkD r̂k 2αkD ‖rk‖ ≤ αkDL2 d∆. 2 79 Following the same steps as in Theorem 5.4.1, the desired bound follows.  Now, the error floor is determined by both ∆ and L2. For a quadratic least-squares problem, the gradient of f is linear and the Hessian is constant. Thus, L2 = 0 and we get the following corollary. Corollary 5.4.1. Assume that f is quadratic and the learning rates are given by αk = 1 . The BC algorithm with updates of the form (5.5) yields µ(k+1) 2 E[f(w̄k)− ? ≤ (1 + log(k + 1))G f(w )] . 2µk We see that the real-valued weights accumulated in BC can converge to the true minimizer of quadratic losses. Furthermore, this suggests that, when the function behaves like a quadratic on the distance scale ∆, one would expect BC to perform fundamentally better than SR. While this may seem like a restrictive condition, there is evidence that even non-convex neural networks become well approximated as a quadratic in the later stages of optimization within a neighborhood of a local minimum [MG15]. Note, our convergence results on BC are for wr instead of wb, and these measures of convergence are not directly comparable. It is not possible to bound wb when BC is used, as the values of wb may not converge in the usual sense (e.g., in the +/-1 binary case wr might converge to 0, in which case arbitrarily small perturbations to wr might send wb to +1 or -1). 5.5 What about non-convex problems? The global convergence results presented above for convex problems show that, in general, both the SR and BC algorithms converge to within O(∆) accuracy of the mini- 80 mizer (in expected value). While we observe some differences between the two methods when the iterates are close to a minimizer (where the objective function behaves like a quadratic), these results do not explain the large differences generally observed when applied to non-convex neural nets. We now study how the long-term behavior of SR dif- fers from BC. Note that this section makes no convexity assumptions, and the proposed theoretical results are directly applicable to neural networks. Typical (continuous-valued) SGD methods have an important exploration-exploitation tradeoff. When the learning rate is large, the algorithm explores by moving quickly be- tween states. Exploitation happens when the learning rate is small. In this case, noise averaging causes the algorithm more greedily pursues local minimizers with lower loss values. Thus, the distribution of iterates produced by the algorithm becomes increasingly concentrated near minimizers as the learning rate vanishes (see, e.g., the large-deviation estimates in [LNS12]). BC maintains this property as well—indeed, we saw in Corollary 5.4.1 a class of problems for which the iterates concentrate on the minimizer for small αk. In this section, we show that the SR method lacks this important tradeoff: as the step size gets small and the algorithm slows down, the quality of the iterates produced by the algorithm does not improve, and the algorithm does not become progressively more likely to produce low-loss iterates. This behavior is illustrated in Figures 5.1 and 5.2. To understand this problem conceptually, consider the simple case of a one-variable optimization problem starting at w0 = 0 with ∆ = 1 (Figure 5.1). On each iteration, the algorithm computes a stochastic approximation ∇f̃ of the gradient by sampling from a distribution, which we call p. This gradient is then multiplied by the step size to get α∇f̃ . The probability of moving to the right (or left) is then roughly proportional to the 81 ↵ ↵rf(wk) rf(wk)2 1 0 +1 1 0 +1 Figure 5.1: The SR method starts at some location w (in this case 0), adds a perturbation to w, and then rounds. As the learning rate α gets smaller, the distribution of the per- turbation gets “squished” near the origin, making the algorithm less likely to move. The “squishing” effect is the same for the part of the distribution lying to the left and to the right of w, and so it does not effect the relative probability of moving left or right. magnitude of α∇f̃ . Note the random variable α∇f̃ has distribution pα(z) = α−1p(z/α). Now, suppose that α is small enough that we can neglect the tails of pα(z) that lie outside the interval [−1, 1]. The probability of transitioning from w0 = 0 to w1 = 1 using stochastic ro∫unding, denoted by∫Tα(0, 1), is then1 1 ∫ 1/α ∫ ∞ Tα(0, 1) ≈ 1 zpα(z)dz = zp(z/α) dz = α p(w)w dw ≈ α p(w)w dw, 0 α 0 0 0 where the first approximation is because we neglected the unlikely case that α∇f̃ > 1, and the second approximation appears because we added a small tail probability to the estimate. These approximations get more accurate fo∫r small α. We see that, assuming the tails of p are “li∫ght” enough, we have ∞ Tα(0, 1) ∼ α p(w)w dw as α → 0. Similarly,0 Tα(0,−1) ∼ 0α −∞ p(w)w dw as α→ 0. What does this observation mean for the behavior of SR? First of all, the probability of leaving w0 on an iteration is [∫ ∞ ∫ 0 ] Tα(0,−1) + Tα(0, 1) ≈ α p(w)w dw + p(w)w dw , 0 −∞ 82 12 10 8 6 4 2 0 -2 0 2 4 6 8 Weight w (a) α = 1.0 (b) α = 0.1 (c) α = 0.01 (d) α = 0.001 Figure 5.2: Effect of shrinking the learning rate in SR vs BC on a toy problem. The left figure plots the objective function (5.8). Histograms plot the distribution of the quantized weights over 106 iterations. The top row of plots correspond to BC, while the bottom row is SR, for different learning rates α. As the learning rate α shrinks, the BC distribution concentrates on a minimizer, while the SR distribution stagnates. which vanishes for small α. This means the algorithm slows down as the learning rate drops off, which is not surprising. However, the conditional probability of ending up at w1 = 1 given that the algorithm did leave w0 is ∫∞ Tα(0, 1) p(w)w dw Tα(0, 1|w1 =6 w ) ≈ = ∫ 00 ∫ , Tα(0,−1) + Tα(0, 1) 0 ∞−∞ p(w)w dw + p(w)w dw0 which does not depend on α. In other words, provided α is small, SR, on average, makes the same decisions/transitions with learning rate α as it does with learning rate α/10; it just takes 10 times longer to make those decisions when α/10 is used. In this situation, there is no exploitation benefit in decreasing α. 83 Loss Value 5.5.1 Toy problem To gain more intuition about the effect of shrinking the learning rate in SR vs BC, consider the following simple1-dimensional non-convex problem:  w2 + 2, if w < 1, min f(w) := w (w − 2.5) 2 + 0.75, if 1 ≤ w < 3.5, (5.8) (w − 4.75)2 + 0.19, if w ≥ 3.5. Figure 5.2 shows a plot of this loss function. To visualize the distribution of iterates, we initialize at w = 4.0, and run SR and BC for 106 iterations using a quantization resolution of 0.5. Figure 5.2 shows the distribution of the quantized weight parameters w over the it- erations when optimized with SR and BC for different learning rates α. As we shift from α = 1 to α = 0.001, the distribution of BC iterates transitions from a wide/explorative distribution to a narrow distribution in which iterates aggressively concentrate on the min- imizer. In contrast, the distribution produced by SR concentrates only slightly and then stagnates; the iterates are spread widely even when the learning rate is small. 5.5.2 Asymptotic analysis of Stochastic Rounding The above argument is intuitive, but also informal. To make these statements rigor- ous, we interpret the SR method as a Markov chain. On each iteration, SR starts at some state (iterate) w, and moves to a new state w′ with some transition probability Tα(w,w′) that depends only on w and the learning rate α. For fixed α, this is clearly a Markov 84 0.6 0.2 0.8 0.6 0.4 0.2 A B A B 0.2 0.1 0.6 0.4 0.3 0.2 0.2 0.2 0.1 0.1 C C 0.2 0.6 Figure 5.3: Markov chain example with 3 states. In the right figure, we halved each transition probability for moving between states, with the remaining probability put on the self-loop. Notice that halving all the transition probabilities would not change the equilibrium distribution, and instead would only increase the mixing time of the Markov chain. process with transition matrix1 Tα(w,w′). The long-term behavior of this Markov process is determined by the stationary distribution of Tα(w,w′). We show below that for small α, the stationary distribution of Tα(w,w′) is nearly invariant to α, and thus decreasing α below some threshold has virtually no effect on the long term behavior of the method. This happens because, as α shrinks, the relative transition probabilities remain the same (conditioned on the fact that the parameters change), even though the absolute probabilities decrease (see Figure 5.3). In this case, there is no exploitation benefit to decreasing α. Theorem 5.5.1. Let pw,i denote the probability distribution of the i-th entry in ∇f̃(w), the stochastic grad∫ient estimate at w. Assume there is a constant C1 ∫such that for all w, i, and we have ∞ Cν pw,i(z) dz ≤ C12 , and some C2 such that both 2 pw,i(z) dz > 0ν ν 0 1Our analysis below does not require the state space to be finite, so T (w,w′α ) may be a linear operator rather than a matrix. Nonetheless, we use the term “matrix” as it is standard. 85 ∫ and 0− pw,i(z) dz > 0. Define the matrixC2  ∫∞ p z ′ ′ 0 w,i (z) dz, if w and w differ only at coordinate i, and (w ) ∆ i = (w)i + ∆ Ũ(w,w′) = ∫ 0 z −∞ pw,i(z) dz, if w and w ′ differ only at coordinate i, and (w′)i = (w)i −∆  ∆ 0, otherwise, and the associated Markov chain transition matrix T̃α0 = I − α T0 · diag(1 Ũ) + α0Ũ , (5.9) where α0 is the largest constant that makes T̃α0 non-negative. Suppose T̃α has a stationary distribution, denoted π̃. Then, for sufficiently small α, Tα has a stationary distribution πα, and limα→0 πα = π̃. Furthermore, this limiting distribution satisfies π̃(w) > 0 for any state w, and is thus not concentrated on local minimizers of f . Proof. Let the matrix Uα be a partial transition matrix defined by Uα(w,w) = 0, and Uα(w,w ′) = Tα(w,w ′) for w 6= w′. From Uα, we can get back the full transition matrix Tα using the formula Tα = I − diag(1TUα) + Uα. Note that this formula is essentially “filling in” the diagonal entries of Tα so that every column sums to 1, thus making Tα a valid stochastic matrix. Let’s bound the entries in Uα. Suppose that we begin an iteration of the stochastic rounding algorithm at some point w. Consider an adjacent point w′ that differs from w at only 1 coordinate, i, with (w′)i = (w)i + ∆. Then we have ∫ ∫ 1 ∆ (w) 1 2∆i 2∆− (w)i Uα(w,w ′) = pw,i((w)i/α) d(w)i + pw,i((w)i/α) d(w)i α 0 ∆ α ∆ ∆ 86 ∫ 1 ∆/α ∫ αz 1 2∆/α 2∆− αz = ∫ pw,i(z) α dz∫+ pw,i(z) α dzα 0 ∆ α ∆/α ∆∆/α ∞ ≤ zα∫ pw,i(z) dz + pw,i(z) dz0 ∆ ∆/α∞ z = α p 2w,i(z) dz +O(α ). (5.10) 0 ∆ ∫ Note we have used the decay assumption: ∞ p (z) ≤ Cw,i ∫ 2 . If (w ′)i = (w)i − ∆, thenν ν similarly the transition probability is 0Uα(w,w′) = α −∞ pw,i(z) z dz + O(α2), and if ∆ (w′)i = (w)i ±m∆ for an integer m > 1, Uα(w,w′) = O(α2). We can approximate the behavior of Uα using the matrix∫ ∞ p (z) z dz, if w and w′ differ only at coordinate i, and (w′) = (w) + ∆∫0 w,i ∆ i i Ũ(w,w′) =  0 z −∞ pw,i(z) dz, if w and w ′ differ only at coordinate i, and (w′) ∆ i = (w)i −∆ 0, otherwise. Define the associated Markov chain transition matrix T̃α0 = I − α0 · diag(1T Ũ) + α0Ũ , (5.11) where α0 is the largest scalar such that the stochastic linear operator T̃α0 has non-negative entries. For α < α0, T̃α has non-negative entries and column sums equal to 1; it thus defines the transition operator of a Markov chain. Let π̃ denote the stationary distribution of the Markov chain with transition matrix T̃α0 . We now claim that π̃ is also the stationary distribution of T̃α for all α < α0. We verify this by noting that T̃α = (I − α · diag(1T Ũ)) + αŨ α α = (1− )I + [I − α T0 · diag(1 Ũ) + α0Ũ ] α0 α0 87 − α α= (1 )I + T̃α α α 0 , (5.12) 0 0 and so T̃απ̃ = (1− α )π̃ + α π̃ = π̃.α0 α0 Recall that Tα is the transition matrix for the Markov chain generated by the stochas- tic rounding algorithm with learning rate α. We wish to show that this Markov chain is well approximated by T̃α. Note that∏ Tα(w,w ′) = Tα(w,w + ((w ′)i − (w)i)∆ei) ≤ O(α2) i,(w) ′i 6=(w )i when w,w′ differ at more than 1 coordinate, and ei denotes a vector that is 1 in the i-th coordinate, and 0 everywhere else. In other words, transitions between multiple coordinates simultaneously become vanishingly unlikely for small α. When w and w′ differ by exactly 1 coordinate, we know from (5.10): T ′α(w,w ) = αŨ(w,w′) + O(α2). These observations show that the off-diagonal elements of Tα are well approximated (up to uniform O(α2) error) by the corresponding elements in αŨ. Since the columns of Tα sum to one, the diagonal elements are well approximated as well, and we have Tα = (I − α · diag(1T Ũ)) + αŨ +O(α2) = T̃α +O(α2). To be precise, the notation above means that |Tα(w,w′)− T̃α(w,w′)| < Cα2, (5.13) for some C that is uniform over (w,w′). We are now ready to show that the stationary distribution of Tα exists and ap- proaches π̃. Re-arranging (5.12) gives us: α0T̃α + (α − α0)I = αT̃α0 . Combining this with (5.13), we get ∥∥ ∥α0Tα + (α− α0)I − αT̃ ∥α0 ∞ < O(α2), 88 and so ∥∥∥α α ∥0 − 0 ∥Tα + (1 )I − T̃α0∥ < O(α). (5.14)α α ∞ From (5.14), we see that the matrix α0T + (1 − α0α )I approaches T̃α0 . Note thatα α π̃ is the Perron-Frobenius eigenvalue of T̃α0 , and thus has multiplicity 1. Multiplicity 1 eigenvalues/vectors of a matrix vary continuously with small perturbations to that matrix (Theorem 8, p130 of [Lax07]). It follows that, for small α, α0Tα + (1 − α0 )I has aα α stationary distribution, and this distribution approaches π̃. The leading eigenvector of α0Tα + (1− α0 )I is the same as the leading eigenvector of Tα, and it follows that Tα hasα α a stationary distribution that approaches π̃.∫ ∫ Finally, note that we have assumed C2 0pw,i(z) dz > 0 and − pw,i(z) dz > 0.0 C2 Under this assumption, for α < 1 , T̃α0(w,w ′) > 0 whenever w,w′ are neighbors C2 the differ at a single coordinate. It follows that every state in the Markov chain T̃α0 is accessible from every other state by traversing a path of non-zero transition probabilities, and so π̃(w) > 0 for every state w.  While the long term stationary behavior of SR is relatively insensitive to α, the convergence speed of the algorithm is not. To measure this, we consider the mixing time of the Markov chain. Let πα denote the stationary distribution of a Markov chain. We say that the -mixing time of the chain is M if M is the smallest integer such that [LPW09] |P(wM ∈ A|w0)− π(A)| ≤ , for all w0 and all subsets of states A ⊆ W. (5.15) We show below that the mixing time of the Markov chain gets large for small α, which means exploration slows down, even though no exploitation gain is being realized. 89 Theorem 5.5.2. Let pw,i satisfy the assumptions of Theorem 5.5.1. Choose some  suffi- ciently small that there exists a proper subset of states A ⊂ W with stationary probability πα(A) greater than . Let M(α) denote the -mixing time of the chain with learning rate α. Then, lim M(α) =∞. α→0 Proof. Given som∑e distribution π over the states of the Markov chain, and some set A of states, let [π]A = a∈A π(a) denote the measure of A with respect to π. Suppose for contradiction that the mixing time of the chain remains bounded as α vanishes. Then we can find an integer M that upper bounds the -mixing time for all α. By the assumption of the theorem, we can select some set of states A with [π̃]A > , and some starting state a 6∈ A. Let e be a distribution (a vector in the finite-state case) with ea = 1, eb = 0 for b 6= a. Note that [e]A = 0 because a 6∈ A. Then ∣∣ ∣[e]A − [π̃] ∣A > . Note that, as α → 0, we have ‖Tα − T̃ ‖ → 0 and thus ‖TMα α − T̃Mα ‖ → 0. We also see from the definition of T̃α in (5.11), limα→0 T̃α = I. It follows that ∣ ∣ ∣ ∣ lim ∣[TMe] − [π̃] ∣ = ∣[e] − [π̃] ∣ → α A A A A > , α 0 and so for some α the inequality (5.15) is violated. This is a contradiction because it was assumed M is an upper bound on the mixing time.  90 5.6 Experiments To explore the implications of the theory, we train both VGG-like networks [SZ15] and Residual networks [HZRS16b] with binarized weights on image classification prob- lems. On CIFAR-10, we train ResNet-56, wide ResNet-56 (WRN-56-2, with two times more filters than ResNet-56), VGG-9, and the high capacity VGG-BC network used for the original BC model [CBD15]. We also train ResNet-56 on CIFAR-100, and ResNet-18 on ImageNet [RDS+15]. VGG-9 on CIFAR-10 consists of 7 convolutional layers and 2 fully connected layers. The convolutional layers contain 64, 64, 128, 128, 256, 256 and 256 of 3 × 3 filters respectively. There is a Batch Normalization and ReLU after each convolutional layer and the first fully connected layer. The details of the architecture are presented in Table 5.1. VGG-BC is a high-capacity network used for the original BC method [CBD15], which contains 6 convolutional layers and 3 linear layers. We use the same architecture as in [CBD15] except using softmax and cross-entropy loss instead of SVM and squared hinge loss, respectively. The details of the architecture are presented in Table 5.2. ResNets-56 has 55 convolutional layers and one linear layer, and contains three stages of residual blocks where each stage has the same number of residual blocks. WRN-56-2 doubles the number of filters in each residual block as in [ZK16]. ResNets-18 for ImageNet has the same description as in [HZRS16b]. We implement all models in Torch7 [CKF11] and train the quantized models with NVIDIA GPUs. The default minibatch size is 128. Following [CBD15], we do not use weight decay during training. We use Adam [KB15] as our baseline optimizer as we found it to frequently give better results than well-tuned SGD (an observation that 91 is consistent with previous papers on quantized models [CHS+16, MOPU93, RORF16, GAGN15, CBD15]), and we train with the three quantized algorithms mentioned in Sec- tion 5.3, i.e., R-ADAM, SR-ADAM and BC-ADAM. The image pre-processing and data augmentation procedures are the same as [HZRS16b]. Similar to [RORF16], we only quantize the weights in the convolutional layers, but not linear layers, during train- ing. Binarizing linear layers causes some performance drop without much computational speedup. This is because fully connected layers have very little computation overhead compared to Conv layers. Also, for state-of-the-art CNNs, the number of FC parame- ters is quite small. The number of params of Conv/FC layers for CNNs in Table 1 are (in millions): VGG-9: 1.7/1.1, VGG-BC: 4.6/9.4, ResNet-56: 0.84/0.0006, WRN-56-2: 3.4/0.001, ResNet-18: 11.2/0.5. While the VGG-like nets have many FC parameters, the more efficient and higher performing ResNets are almost entirely convolutional. The weights of convolutional layers are intitialized with random Rademacher (±1) variables. We set the initial learning rate to 0.01 and decrease the learning rate by a factor of 10 at epochs 82 and 122 for CIFAR-10 and CIFAR-100 [HZRS16b]. For ImageNet experiments, we train the model for 90 epochs and decrease the learning rate at epochs 30 and 60. The authors of BC [CBD15] adopt a small initial learning rate (0.003) and it takes 500 epochs to converge. It is observed that large binary weights (∆ = 1) will generate small gradients when batch normalization is used [IS15a], hence a large learning rate is necessary for faster convergence. We experiment with a larger learning rate (0.01) and find it converges to the same performance within 160 epochs, comparing with 500 epochs in the original paper [CBD15]. 92 Table 5.1: VGG-9 on CIFAR-10. layer type kernel size input size output size Conv 1 3× 3 3 × 32× 32 64 × 32× 32 Conv 2 3× 3 64 × 32× 32 64 × 32× 32 Max Pooling 2× 2 64 × 32× 32 64 × 16× 16 Conv 3 3× 3 64 × 16× 16 128× 16× 16 Conv 4 3× 3 128× 16× 16 128× 16× 16 Max Pooling 2× 2 128× 16× 16 128× 8 × 8 Conv 5 3× 3 128× 8 × 8 256× 8 × 8 Conv 6 3× 3 256× 8 × 8 256× 8 × 8 Conv 7 3× 3 256× 8 × 8 256× 8 × 8 Max Pooling 2× 2 256× 8 × 8 256× 4 × 4 Linear 1× 1 1× 4096 1× 256 Linear 1× 1 1× 256 1× 10 Results The overall results are summarized in Table 5.3. The binary model trained by BC-ADAM has comparable performance to the full-precision model trained by ADAM. SR-ADAM outperforms R-ADAM, which verifies the effectiveness of Stochastic Round- ing. There is a performance gap between SR-ADAM and BC-ADAM across all models and datasets. This is consistent with our theoretical results in Sections 5.4 and 5.5, which predict that keeping track of the real-valued weights as in BC-ADAM should produce 93 Table 5.2: VGG-BC for CIFAR-10. layer type kernel size input size output size Conv 1 3× 3 3 × 32× 32 128× 32× 32 Conv 2 3× 3 128× 32× 32 128× 32× 32 Max Pooling 2× 2 128× 32× 32 128× 16× 16 Conv 3 3× 3 128× 16× 16 256× 16× 16 Conv 4 3× 3 256× 16× 16 256× 16× 16 Max Pooling 2× 2 256× 16× 16 256× 8 × 8 Conv 5 3× 3 256× 8 × 8 512× 8 × 8 Conv 6 3× 3 512× 8 × 8 512× 8 × 8 Max Pooling 2× 2 512× 8 × 8 512× 4 × 4 Linear 1× 1 1× 8192 1× 1024 Linear 1× 1 1× 1024 1× 1024 Linear 1× 1 1× 1024 1× 10 better minimizers. Exploration vs exploitation tradeoffs Section 5.5 discusses the exploration/exploitation tradeoff of continuous-valued SGD methods and predicts that fully discrete methods like SR are unable to enter a greedy phase. To test this effect, we plot the percentage of changed weights (signs different from the initialization) as a function of the training epochs (Figures 5.4 and 5.5). SR-ADAM explores aggressively; it changes more weights 94 Table 5.3: Top-1 test error after training with full-precision (ADAM), binarized weights (R-ADAM, SR-ADAM, BC-ADAM), and binarized weights with big batch size (Big SR- ADAM). CIFAR-10 CIFAR-100 ImageNet VGG-9 VGG-BC ResNet-56 WRN-56-2 ResNet-56 ResNet-18 ADAM 7.97 7.12 8.10 6.62 33.98 36.04 BC-ADAM 10.36 8.21 8.83 7.17 35.34 52.11 Big SR-ADAM 16.95 16.77 19.84 16.04 50.79 77.68 SR-ADAM 23.33 20.56 26.49 21.58 58.06 88.86 R-ADAM 23.99 21.88 33.56 27.90 68.39 91.07 in the conv layers than both R-ADAM and BC-ADAM, and keeps changing weights until nearly 40% of the weights differ from their starting values (in a binary model, randomly re-assigning weights would result in 50% change). The BC method never changes more than 20% of the weights (Fig 5.4b), indicating that it stays near a local minimizer and explores less. Interestingly, we see that the weights of the conv layers were not changed at all by R-ADAM; when the tails of the stochastic gradient distribution are light, this method is ineffective. 95 50 50 50 40 conv_1 40 conv_1 40 conv_1 conv_2 conv_2 conv_2 conv_3 conv_3 conv_3 30 30 30 conv_4 conv_4 conv_4 conv_5 conv_5 conv_5 20 20 20 conv_6 conv_6 conv_6 linear_1 linear_1 linear_1 10 linear_2 10 linear_2 10 linear_2 linear_3 linear_3 linear_3 0 0 0 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 Epochs Epochs Epochs (a) R-ADAM (b) BC-ADAM (c) SR-ADAM Figure 5.4: Percentage of weight changes during training of VGG-BC on CIFAR-10. 60 60 50 BC-ADAM 128 BC-ADAM 128 50 BC-ADAM 1024 50 40 BC-ADAM 1024 SR-ADAM 128 SR-ADAM 128 40 SR-ADAM 1024 40 30 SR-ADAM 1024 30 30 BC-ADAM 128 20 20 20 BC-ADAM 1024 10 10 10 SR-ADAM 128 SR-ADAM 1024 0 0 0 0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 Epochs Epochs Epochs (a) BC-ADAM vs SR-ADAM (b) Weight changes since be- (c) Weight changes every 5 ginning epochs Figure 5.5: Effect of batch size on SR-ADAM when tested with ResNet-56 on CIFAR-10. (a) Test error vs epoch. Test error is reported with dashed lines, train error with solid lines. (b) Percentage of weight changes since initialization. (c) Percentage of weight changes per every 5 epochs. 5.6.1 A way forward: big batch training We saw in Section 5.5 that SR is unable to exploit local minima because, for small learning rates, shrinking the learning rate does not produce additional bias towards mov- ing downhill. This was illustrated in Figure 5.1. If this is truly the cause of the problem, 96 Error (%) Percentage of changed weights (%) Percentage of changed weights (%) Percentage of changed weights (%) Percentage of changed weights (%) Percentage of changed weights (%) then our theory predicts that we can improve the performance of SR for low-precision training by increasing the batch size. This shrinks the variance of the gradient distri- bution in Figure 5.1 without changing the mean and concentrates more of the gradient distribution towards downhill directions, making the algorithm more greedy. To verify this, we tried different batch sizes for SR including 128, 256, 512 and 1024, and found that the larger the batch size, the better the performance of SR. Fig- ure 5.5a illustrates the effect of a batch size of 1024 for BC and SR methods. We find that the BC method, like classical SGD, performs best with a small batch size. However, a large batch size is essential for the SR method to perform well. Figure 5.5b shows the percentage of weights changed by SR and BC during training. We see that the large batch methods change the weights less aggressively than the small batch methods, indicating less exploration. Figure 5.5c shows the percentage of weights changed during each 5 epochs of training. It is clear that small-batch SR changes weights much more frequently than using a big batch. This property of big batch training clearly benefits SR; we see in Figure 5.5a and Table 5.3 that big batch training improved performance over SR-ADAM consistently. In addition to providing a means of improving fixed-point training, this suggests that recently proposed methods using big batches [DYJG16, GDG+17] may be able to exploit lower levels of precision to further accelerate training. 97 5.7 Conclusion The training of quantized neural networks is essential for deploying machine learn- ing models on portable and ubiquitous devices. We provide a theoretical analysis to better understand the BinaryConnect (BC) and Stochastic Rounding (SR) methods for training quantized networks. We proved convergence results for BC and SR methods that predict an accuracy bound that depends on the coarseness of discretization. For general non- convex problems, we proved that SR differs from conventional stochastic methods in that it is unable to exploit greedy local search. Experiments confirm these findings, and show that the mathematical properties of SR are indeed observable in practice. 98 Chapter 6: Why is SGD so fast for neural nets? 6.1 Introduction Stochastic gradient descent [RM51] (and its momentum variants [Nes83]) has be- come the standard optimization routine for deep learning due to its fast convergence and good generalization properties [WRS+17,KS17,SMDH13], but the performance of these methods defies explanation. Classical convex optimization theory predicts that the learning rate of SGD needs to decrease over time for convergence to be guaranteed [SZ13, Ber11]. With constant learning rates, it has been shown that SGD converges fast to a neighborhood of the min- imizer, but then reaches a noise floor that depends on the variance of the gradients at the minimizer [MB11, NWS14]. When models contain the same number of parameters as training data, it is possible for a model to over-fit the data while still being strongly convex. In this case, convergence without a noise floor is possible without decaying the learning rate [MB11, NWS14]. But the behavior of SGD on high-dimensional neural models still evades explana- tion. Neural networks operate in a regime where the number of parameters is much larger than the number of training data. In this regime, SGD seems to converge very quickly. So quickly, in fact, that practitioners often use exponentially decaying learning rate schedules 99 without seeing the method stall. Furthermore, network architecture seems to affect SGD a lot. It is common knowledge among practitioners that wider networks train faster [ZK16], and deeper networks train slower [BSF94, GB10]. The goal of this paper is to study why SGD is efficient for neural nets, and how neural net design affects SGD. In particular, we investigate how over-parametrization – an increase in the number of parameters beyond the number of training data – affects the dynamics of SGD. To explain the fast convergence of SGD on over-parameterized problems, we in- troduce a simple concept called gradient confusion. When confusion is high, stochastic gradients produced by different data samples may be negatively correlated. When this happens, data samples contradict one another, causing slow convergence. When confu- sion is low, the gradients produced by different samples are similar, and we show via the- oretical and empirical results that convergence is much faster than predicted by classical theory. For randomized training data, we show that the gradients of over-parameterized neural networks are likely to have low confusion. Finally, we present experimental results showing that low gradient confusion leads to efficient convergence of SGD. Problem formulation & conceptual overview Notation. In this chapter, we use upper-case bold fonts to represent matrices. We use (W)i,j to indicate the (i, j) cell in matrix W and (W)i for the i-th row of matrix W. SGD works by iteratively selecting a random function f̃k, and modifying the pa- rameters to decrease the value of the objective term f̃k. It may happen that the selected 100 gradient ∇f̃k is negatively correlated with the gradient of another term ∇fj. In this case, the gains we make by decreasing f̃k are partially cancelled out by an increase in fj, and convergence becomes slow. When the gradients of different mini-batches are negatively correlated, the objective terms disagree on which direction the parameters should move, and we say that there is gradient confusion. Definition 6.1.1. A set of objective functions {fi} has gradient confusion η if the pair- wise inner products between gradients satisfy 〈∇fi(w),∇fj(w)〉 ≥ −η, ∀i, j. (6.1) SGD converges fast when gradient confusion is low. To see why, consider the case of training a logistic regression model on a dataset with orthogonal vectors. We have fi(w) = `(yi · 〈xi,w〉), where ` : R → R is the logistic loss, {xi} is a set of orthogonal training vectors, and yi ∈ {−1, 1} is a label. We then have∇fi(w) = yi`′(yi · 〈xi,w〉)xi and 〈∇fi(w),∇fj(w)〉 = y ′iyj` (〈xi,w〉)`′(〈xj,w〉)〈xi,xj〉 = 0, and so there is no gradient confusion (η = 0). Because of gradient orthogonality, an update in the gradient direction fi has no effect on the loss value of fj for i 6= j. In this case, SGD decouples into a (deterministic) gradient decent on each objective term separately, and we can expect to see the fast rates of convergence attained by deterministic gradient descent, rather than the slow rates of SGD. Can we expect a problem to have low gradient confusion in practice? It is known that randomly chosen vectors in high dimensions are nearly orthogonal with high proba- bility [GS16]. For this reason, we would expect an average-case (i.e., random) problem to have nearly orthogonal gradients, provided that we don’t train on too many training 101 vectors (in which case it becomes likely that we will see two training vectors with large negative correlation). In other words, we should expect a random optimization problem to have low gradient confusion when the number of parameters is “large” and the number of training data is “small” – i.e., when the model is over-parameterized. The above argument is rather informal, and ignores issues like random sampling or- der and non-convexity. Furthermore, it is unclear whether we can expect low levels of gra- dient confusion in practice, and what effect non-zero confusion has on convergence rates. Below, we present a rigorous argument that low confusion levels accelerate SGD for both convex and non-convex problems. Then, we turn to the issue of over-parameterization, and show that gradient confusion is low for over-parameterized classifier problems with random data. Finally, we use computational experiments to show that gradient confu- sion is low for real-world neural nets, and that this explains the superior optimization performance of SGD. Related work The authors of [ACH18] study the behavior of SGD on over-parameterized prob- lems, and show that SGD on over-parameterized linear neural nets is similar to applying a certain preconditioner while optimizing. Our work differs from [ACH18] in that it stud- ies a completely different mechanism of acceleration, and that we establish a more direct relationship between width, depth, problem dimensionality, and the error floor of SGD convergence. The behavior of SGD on over-parameterized problems was also studied in [MBB17] with the purpose of exploring how SGD hyper-parameters (learning rates, 102 batch size, etc...) affect convergence. In contrast, this study focuses on how and why network architecture choices affect convergence. Several other authors have studied the impact of structured gradients on SGD. [BFL+17] study the effects of “shattered gradients,” which is when (non-stochastic) gra- dients at different (but close) locations in parameter space become negatively correlated. This is different from gradient confusion, which refers to negative correlations between stochastic mini-batch gradients at the same location in parameter space. Another related issue is that stochastic noise during training leads to improved generalization performance because of implicit regularization. This is not our main focus – we addresses the question of why SGD is good for optimization, rather than why it’s good for generalization. 6.2 SGD is fast when gradient confusion is low We now present a rigorous analysis of gradient confusion and its effect on SGD. We begin by looking at the case where the objective satisfies the PL inequality (a condition related to, but weaker than, strong convexity), where we can provide tight bounds on the rate of convergence in terms of the optimality gap. Then we look at a broader class of non-convex functions, and prove fast convergence to a stationary point. We begin by making two standard assumptions about the objective function. Assumption 6.2.1. The individual gradients∇fi are L-Lipschitz continuous: f ′i(w ) ≤ fi(w) + 〈∇fi(w), w′ −w〉+ L‖w′ −w‖2, ∀i.2 Assumption 6.2.2. The individual functions fi satisfy the PL inequality: 1‖∇f (w)‖2i ≥ µ(fi(w)− f ?i ), ∀i,2 103 where f ?i = minw fi(w). Using these assumptions, we now state the following convergence result. Theorem 6.2.1 (Linear convergence under bounded gradient confusion). If the objective function satisfies Assumptions 6.2.1 and 6.2.2, and has gradient confusion bounded by η, SGD with updates of the form (2.2) converges linearly to a neighborhood of the minima on (2.1) as f(wk)− f ? ≤ ρk(f(w )− f ?) + αη̂0 1− ,ρ ( 2 ) where η̂ = max{η, 0}, the learning rate α ≤ 2/nL and ρ = 1− 2µ α− nLα . n 2 Proof. From Assumption 6.2.1, we have f(wk+1) ≤ f(wk) + 〈∇f(wk), wk+1 −wk〉+ L‖w 22 k+1 −wk‖ 2 = f(wk)− α( 〈∇f(wk)), ∇f̃k(wk)〉+ Lα ‖∇f̃k(wk)‖22 = f(w )− αk ( − 2 ‖∇ ∑Lα ) f̃ 2 α n 2 k (wk)‖ − ∀i:f 6=f̃ 〈∇fi(wk), ∇f̃k(wk)〉n i k ≤ 2f(w )− α − Lαk ( ‖∇f̃ (w )‖ 2 + α(n−1)η̂ , n 2 ) k k n ≤ f(w )− α − Lα2k ‖∇f̃k(w 2k)‖ + αη̂,n 2 where the second-last inequality follows from Definition 6.1.1. Let α < 2/nL. Then, using Assumption 6.2.2 and subtracting by f ? = minw f(w) on both sides, we get ( ) 2 f(w )− f ?k+1 ≤ f(wk)− f ? − 2µ α − Lα (f̃ ?k(wk)− f̃k ) + αη̂,n 2 where f̃ ?k = minw f̃k(w). Taking expectation and using the fact that E [f ?i i ] ≤ f ?, we get: ( ) f(wk+1)− f ? ≤ 1− 2µα + µLα2 (f(wk)− f ?) + αη̂.n 104 Writing ρ = 1− 2µα + µLα2, and unrolling the iterations, we get n ∑ f(w ? k+1 ? k ik+1)− f ≤ ρ (f(w0)− f ) +∑i=0 ρ αη̂ ≤ ρk+1(f(w0)− f ?) + ∞i=0 ρiαη̂ = ρk+1(f(w )− f ?) + αη̂0 1− ,ρ which completes the proof.  This result shows that SGD converges linearly to a neighborhood of a minimizer, and the size of this neighborhood depends on the level of gradient confusion. When η ≤ 0, there is no confusion, and SGD converges directly to a minimizer without using a vanishing learning rate schedule. In the case of non-convex functions, we can still prove convergence to a neighbor- hood of a stationary point under the following standard assumption. Assumption 6.2.3. Assume that the variance of the gradients is bounded as: [ ] E ‖∇f̃(w)−∇f(w)‖2 ≤ σ2. The following theorem shows fast convergence in the case of a smooth non-convex function when gradient confusion is low. Theorem 6.2.2. If the objective function satisfies Assumptions 6.2.1, and 6.2.3, and the confusion bound (6.1), then SGD converges to a stationary point with ? min E‖∇f(wk)‖2 ≤ 2n f(w1)− f 2nη k=1,...,T 2α− +nLα2 T 2− nLα for learning rate α < 2/(nL). 105 Proof. From Theorem 6.2.1, we have: ( 2) f(wk+1) ≤ f(wk)− α − Lα ‖∇f̃k(w 2k)‖ + αη. (6.2) n 2 Using Assumption 6.2.3, we can write: E‖∇f̃ 2k(wk)‖ = E‖∇f̃ (w )−∇f(w )‖2k k k + E‖∇f(w )‖2k = σ2 + E‖∇f(wk)‖2. Thus, taking expectation and assuming the step size α < 2/L, we can rewrite equation (6.2) as: ( ) E‖∇f(w )‖2 ≤ 2n 2nηk − E f(w )− f(w ) − σ 2 + 2α nLα2 2n ( k k+1 ) 2− nLα ≤ 2nη− E f(wk)− f(w2 k+1) + .2α nLα 2− nLα Taking an average over T iterations, and using f ? = minw f(w), we get: ∑T ‖∇ ‖2 ≤ 1 ‖∇ 2 2n f(w1)− f ? 2nη min E f(wk) E f(wk)‖ ≤ + . k=1,...,T T 2α− nLα2 T 2− nLα k=1  The presence of a noise floor is not always observed for over-parameterized prob- lems, and the assumption that η ≤ 0 is unrealistically strong to guarantee such conver- gence. In the next section, we present a few additional assumptions under which faster convergence can be observed. 6.2.1 Conditions for even faster convergence Faster convergence can be guaranteed if we re-define gradient confusion using the correlation between gradients (rather than the dot product). If these correlations are bounded below, then linear convergence occurs with no noise floor. The following results 106 prove this convergence in the case of over-fitting. Over-fitting occurs when the minimal objective value of the composite objective F in (2.1) is the average of all the minimal val- ues the terms {fi}. In other words, parameter w? that minimizes F also simultaneously minimizes every objective term fi. Theorem 6.2.3. Suppose F satisfies Assumptions 6.2.1, 6.2.2, and the correlation-based confusion condition 〈∇fi(w),∇fj(w)〉 ‖∇fi(w)‖‖∇ ‖ ≥ −ν, ∀i, j. (6.3) fj(w) ∑ ∑ Suppose further the loss satisfies the over-fitting condition min 1w i f (w) = 1 i i minw fi(w).n n If the objective has confusion ν < µ and learning rate α < 2 − 2ν , then SGD nL nL µ converges with f(wk)− f ? ≤ ρk(f(w0)− f ?), where ρ = 1− 2µα/n+ µLα2 + 2ανL. Proof. From (6.3) and the identity 2ab ≤ a2 + b2 we get 〈∇ −νfi(w),∇fj(w)〉 ≥ −ν‖∇fi(w)‖‖∇fj(w)‖ ≥ (‖∇fi(w)‖2 + ‖∇fj(w)‖2) 2 ≥ −νL(fi(w)− f ? ?i + fj(w)− fj ). Following the proof of Theorem 6.2.1, we have L f(wk+1) ≤ f(wk) + 〈∇f(wk), w 2k+1 −wk〉+ ‖wk+1 −wk‖ 2 2 = f(wk)− α(〈∇f(wk),)∇ Lα f̃k(wk)〉+ ‖∇∑f̃k(wk)‖ 2 2 α Lα2 α = f(w 2k)− − ‖∇f̃k(wk)‖ − 〈∇fi(wk), ∇f̃k(wk)〉 n 2 n ∀i:fi 6=f̃k 107 ( 2) ∑ ≤ f(wk)− α 2µ − Lα ανL(f̃k(w )− f̃ ?) + (f (w )− f ?k k i k i + f̃k(wk)− f̃ ?n 2 n k ) ∀i:fi=6 f̃k where the second-last inequality follows from Definition 6.1.1. Let the learning rate α < 2/nL. Then, using Assumption 6.2.2 and subtracting by f ? = minw f(w) on both sides (α Lα2) f(w ?k+1)− f ≤ f(w )− f ?k − 2µ − ( ?∑f̃k(wk)− f̃k )n 2ανL + (fi(w )− f ?k i + f̃k(w ?k)− f̃k ),n ∀i:fi 6=f̃k where f̃ ?k = minw f̃k(w). Taking expectation and using the fact that E [f ?] = f ?i i , we get ( 2µα ) f(w ? 2k+1)− f ≤ 1− + µLα + 2ανL (f(w )− f ?k ). n  Finally, we can strengthen the definition of confusion by examining the correlation between∇fi(w) and∇fj(w′) for all w and w′.Compared to Theorem 6.2.1, convergence is guaranteed with a larger learning rate that is independent of the training set size n, and faster geometric decay. Theorem 6.2.4. If the objective function satisfies Assumptions 6.2.1 and 6.2.2, and satis- fies the strengthened gradient confusion bound 〈∇fi(w),∇fj(w′)〉 ≥ −η, ∀i, j,w,w′, then SGD converges with αη̂ f(w ? k ?k)− f ≤ ρ (f(w0)− f ) + − ,1 ρ where η̂ = max{η, 0}, the learning rate α ≤ 2/L and ρ = 1− 2µα/n+ µLα2/n. 108 Proof. We start by noting that, for i 6= j ∫ α fi(w − ∇ ∂ α fj(w)) = fi(w) + ∫ fi(w − t∇fj(w)) dt (6.4)t=0 ∂tα = fi(w)− ∫ ∇fj(w) Tfi(w − t∇fj(w)) dt (6.5) t=0 α ≤ fi(w) + η̂ dt ≤ fi(w) + αη̂. (6.6) t=0 We then have ∑ nf(wk+1) = f̃k(w k − α∇f̃ (wkk )) + fi(wk − α∇f̃k(wk)) fi 6=f̃k 2 ∑ ≤ Lαf̃k(wk)− α〈∇f̃k(wk), ∇f̃k(wk)〉+ ‖∇f̃ (w )‖2k k + fi(w) + αη̂ 2 fi 6=f̃k Lα2≤ nf(wk)− (α− )‖∇f̃k(wk)‖2 + nαη̂. 2 Re-arranging and applying Assumption 6.2.2 we get 2µ Lα2 f(w kk+1) ≤ f(w )− (α− )(f̃k(wk)− f̃ ?k ) + αη̂.n 2 Taking expectations and subtracting f ? from both sides we get ( 2µα µLα2 ) f(wk+1)− f ? ≤ 1− + ) (f̃ (w )− f̃ ?k k k ) + αη̂.n n Unrolling this expression gives us αη̂ f(wk+1)− f ? ≤ ρk+1(f(w ?0)− f ) + − ,1 ρ where ρ = 1− 2µα/n+ µLα2/n. which completes the proof.  6.3 Over-parameterized problems have low gradient confusion In the previous section, we showed that low gradient confusion can result in much faster convergence of SGD to the neighborhood of a critical point for general smooth 109 non-convex functions. The question still remains, however, when such a condition might arise, and how it might explain the effectiveness of SGD on neural net problems. In practice the level of gradient confusion will depend on the structure of the train- ing data. However, we can analyze gradient confusion for generic (i.e., random) model problems using methods from high-dimensional probability. We rigorously analyze the case where training data is randomly sampled from a unit sphere, and identify specific cases where gradient confusion (Definition 6.1.1) is low with high-probability. We consider a synthetic datasets of the form D = {(x , C(x ))}ni i i=1, for some la- beling function C. The data points {xi} are drawn uniformly and independently from the surface of a d-dimensional unit sphere. The function fi(w) we consider is the least- squares function. We show below that, given a confusion parameter η, a range of models (including neural networks) with randomized training data naturally attain confusion less than η provided the dimension of the problem is sufficiently large. The method we use to prove this result is very similar for a range of different problem classes, and so we begin by illustrating the result on the simple class of linear regression problems. 6.3.1 A simple case: linear regression We begin by examining gradient confusion in the case of a simple linear least- squares regression. We assume that the hidden concept that the algorithm is trying to learn is given by C(x) = 〈w̃,x〉, for some “true” weight vector w̃1. Throughout we will 1The arguments in this paper actually hold for the case of a noisy model C(x) = 〈w̃,x〉+ ζ where ζ is a Gaussian random variable, however we will omit the noise term for notational simplicity. 110 0.16 Numerical Estimation of Violation Prob. 0.14 Best fit 1/poly(d) 0.12 0.10 0.08 0.06 0.04 0.02 0 200 400 600 800 1000 Problem dimension (d) Figure 6.1: Simulation proof for Theorem 6.3.1. As the dimensionality of a random linear regression problem increases, the probability of violating the gradient confusion condition η > 0.1 vanishes. denote by gw : Rd → R the function we fit to the training data. In this section we simply have 1 gw(x) = 〈w,x〉, and fi(w) = (gw(xi)− C(x ))2i , 2 but we will consider more complex situations below. For this problem, we have ∇fi(w) = αixi, where αi := gw(xi)− C(xi). With this definition of the gradient, we can prove the following theorem. Theorem 6.3.1 (Concentration for linear regression). Let w, w̃ ∈ [−1, 1]d be the approx- imate and true weight vectors and let η >( 0 be)a given constant. For d > Ω(log n), we have that with probability at least 1− Ω 1poly , equation (6.7) holds at w.(d) Figure 6.1 shows a numerical demonstration of this theorem. Note that one limitation of Theorem 6.3.1, and of all other results presented in the rest of this section, is that the bound is non-uniform in the weights w. 111 Pr[ > 0.1] Technical approach & proof sketch To prove that any bound on the gradient confusion η > 0 is attained for sufficiently large d, examine the values of the function h(xi,xj) := 〈∇fi,∇fj〉 when xi and xj are selected at random from the unit sphere. Our goal will be to show that this function has positive expected value, and that h concentrates around its expected value for large d, thus making it extremely unlikely that a large negative value of h is observed. Once we have shown that h(xi,xj) < η with extremely low probability for a random pair of points (xi,xj), we can use a union bound to show that this occurs with low probability for all pairwise comparisons between data points. For a given constant η > 0, our goal is to find an appropriate concentration bound for the event for some i 6= j, h(xi,xj) ≥ −η. (6.7) In other words, we want to find a sharp bound τ(n, d, η) such that for fixed i, j Pr[h(xi,xj) ≤ −η] ≤ τ(n, d, η). (6.8) By using the union bound and the fact that h(xi,xj) is identically distributed for all i, j ∈ [n], we have that Pr[∃i 6= j, h(xi,xj) ≤ −η] ≤ n2τ(n, d, η). (6.9) We use tools from high-dimensional probability to find an appropriate function τ for a large class of predictors g, and show that the quantity n2τ(n, d, η) vanishes for large d. 112 Proof of Theorem 6.3.1 We will briefly describe some technical lemmas we require in our analysis. The following Chernoff-style concentration bound is proved in Chapter 5 of [Ver]. Lemma 6.3.1 (Concentration of Lipshitz function over a sphere). Let x ∈ Rd be sampled uniformly from the surface of a d-dimensional sphere. Consider a Lipshitz function ` : Rd → R which is differentiable everywhere. Let ||∇`||∞ denote supx∈Rd ‖∇`(x)‖∞. Then for any t ≥ 0 and so[m∣ e fixed constan∣t c ≥]0, we have the following.∣∣ ( )− E ∣∣ ≥ ≤ − cdt2Pr `(x) [`(x)] t 2 exp 2 , (6.10)ρ where ρ ≥ ‖∇`‖∞ is a entry-wise bound on∇`. We will rely on the following generalization of Lemma 6.3.1. Corollary 6.3.1. Let x,y ∈ Rd be two mutually independent vectors sampled uniformly from the surface of a d-dimensional sphere. Consider a Lipshitz function ` : Rd×Rd → R which is differentiable everywhere. Let ‖∇`‖∞ denote sup(x,y)∈Rd×Rd ‖∇`(x,y)‖∞. Then for any t ≥ 0 and some[∣fixed constant c ≥ 0,∣we h]ave the fol(lowing.∣ ) Pr ∣`(x,y)− E ∣[`(x,y)]∣ ≥ t ≤ 2 exp − cdt22 , (6.11)ρ where ρ ≥ ‖∇`‖∞ is a entry-wise bound on∇`. Proof. This corollary can be derived from Lemma 6.3.1 as follows. Note that for every fixed ỹ ∈ Rd, equation (6.10) holds. Additionally, we have that the vectors x and y are m∫utually indep∫endent. Hence we can write the LHS of equation (6.11) as the following. (y)1=∞ (y)d=∞ [∣∣∣ ∣ ∣∣ ∣∣]. . . Pr `(x,y)− E ∣ ∣ ∣[`(x,y)]∣ ≥ t ∣∣ y = ỹ∣∣ φ(ỹ)d(y)1 . . . d(y)d. (y)1=−∞ (y)d=−∞ 113 Here φ(ỹ) refers to the pdf of[t∣he distribution of y. ∣From] independence, the inner term in the integral(evaluates)to ∣Pr ∣ ∣`(x, ỹ)− E[`(x, ỹ)]∣ ≥ t . We know this is less than or 2 equal to 2 ex∫p − cdt||∇ ||2 .∫Therefore, the integral can be upper bounded by the following.` ∞ (y)1=∞ (y)d=∞ ( ) cdt2 . . . 2 exp −||∇ || φ(ỹ)d(y)1 . . . d(y)d.` 2∞ (y)1=−∞ (y)d=−∞ Since φ(ỹ) is a valid pdf, we get the required equation (6.11).  Additionally, we will use the following facts about a normalized Gaussian random variable. Lemma 6.3.2. For a normalized Gaussian x (i.e., an x sampled uniformly from the sur- face of a unit d-dimensional sphere) the following statements are true. 1. ∀i ∈ [d] we have that E[(x)i] = 0. 2. ∀i ∈ [d] we have that E[(x)2i ] = 1/d. Proof. Part (1) can be proved by observing that the normalized Gaussian random variable is spherically symmetric about the origin. In other words, for every i ∈ [d] the vectors (x1, x2, . . . , xi, . . . , xd) and (x1, x2, . . . ,−xi, . . . , xd) are identically distributed. Hence E[xi] = E[−xi] which implies that E[xi] = 0. Part (2) can be proved by observing∑that for any i, j ∈ [d], xi and xj are identically distributed. Fix any i ∈ [d]. We have that dj=1 E[x2j ] = d× E[x2i ]. Note that we have ∑ ∫d (x)1=∞ ∫ (x)d=∞ ∑d x2 E 2 j=1 j[xj ] = . . . ∑ φ(x)d(x)1 . . . d(x)d d = 1. j=1 ′ x 2 ′ (x)1=−∞ (x) =−∞ j =1 jd Therefore E[x2i ] = 1/d.  114 We are now ready to prove Theorem 6.3.1. Proof. The proof illustrates a general strategy we will use for other general models. We will prove that h(·, ·) has two properties, namely that ∇h(·, ·) is bounded, and the entries in ∇h(·, ·) have non-negative expectation. These properties enable us to use Corollary 6.3.1 to show that h(·, ·) concentrates on non-negative values. Bounded Gradient. Fix an arbitrary value xi = x̃i and xj = x̃j . Consider an arbi- trary coordinate corresponding to the variable (xi)p (by symmetry a corresponding ar- gument holds for (xj)p) in the vector ∇h(x̃i, x̃j). The term evaluates to αiαj(x̃j)p + (∆)pαj〈x̃i, x̃j〉. Note that we have |(x̃i)p| ≤ 1 for every i ∈ [n],∑p ∈ [d]. Addi- tionally, from our assumption on weights w and w̃ we have −2 ≤ dp=1(∆)p ≤ 2. The∑refore we have that for every i ∈ [n], −2 ≤ αi ≤ 2. And hence αiαj(x̃j)p + α dj p=1(∆)p(x̃i)p(x̃j)p ≤ 8. Hence we have that ||∇h||∞ ≤ ρ = 8. In particular, the upper bound ρ is a constant. Non-negative Expectation. To compute bounds on E[h(xi,xj)], we want to evaluate E[αiαj〈xi,xj〉]. On expanding the product and removing all summands where either (xi)p or (xj)p appear as an odd-term, we have E[h(xi,xj)] = ‖∆‖2/d2. Therefore we have 0 ≤ E[h(xi,xj)]. Alternatively, we can obtain this lower-bound as follows. Note that since −2 ≤ αi, αj ≤ 2, we have that E[αiαj〈xi,xj〉] ≥∑−4E[〈xi,xj〉] = ∑0. The last equality is ∑because of the following: E[〈xi,x 〉] = E[ d j p=1(xi)p(xj)p] = d p=1 E[(xi)p(x ) d j p] = p=1 E[(xi)p]E[(xi)p] = 0. The second equality follows from Linearity of Expectation, the third from independence of xi and xj and the last follows from Lemma 6.3.2. 115 We combine the two properties as follows. From Non-negative Expectation prop- erty and equation (6.11), we have that ( ) ≤ − ≤ ≤ − ≤ − cdη2Pr[h(xi,xj) η] Pr[h(xi,xj) E[h(xi,xj)] η] 2 exp 2 .ρ ( ) 2 The probability that some value of h(∇fi,∇fj) lies below−(η is then)bounded by 2n2 exp − cdη2 .ρ 2 For any choice of c1, we can solve the inequality 2n2 exp − cdη ≤ d−c12 , to get d >ρ √ Ω(ρ2 log n) and η > Ω(1/ d). In particular, the bound holds for any constant η > 0.  6.3.2 Linear neural networks We now study the behavior of gradient confusion for neural networks. We begin with the simplified case of linear neural networks (i.e., with no non-linearities) with one output feature, one hidden layer, and a quadratic loss. We’ll then examine the case of more general networks. Let W0 ∈ [−1, 1]d×`1 ,W ∈ [−1, 1]`1×11 denote the weight matrices connecting the input layer to the hidden layer, and the hidden layer to the output, respectively. Thus, the output of the neural net is given by W1W0x. Then we have, ∑d ∑`1 g(x) = (W1)p′(W0)p,p′(x)p, and αi := g(xi)− C(xi). p=1 p′=1 Later, we consider the case of more than one hidden-layer, and use `i to denote the width of layer i and β to be the total number of hidden layers. Further we define ` := maxi∈[β] `i. Throughout this sub-section we make the following assumption. Assumption 6.3.1 (Small Weights). We will assume W is such that the outputs at each neuron of the hidden layer, C(x) and g(x) all lie between [−1, 1] for every value of x in 116 the unit ball2. Additionally, we will assume that the entries in the weight matrices Wi (e.g., W0, W1, . . . , Wβ) lie in the range [−1/`, 1/`]. Is the small-weights assumption reasonable? If we relax the assumption and let every entry in each of the weight matrices lie in the range [−1, 1], then a product of matrices Wβ ·Wβ−1 · . . . (Wβ′)k ′ may lead to an exponential blow up in values. If we haveβ vectors v1,v `2 ∈ [−1, 1] then 〈v1,v2〉 can be as large as `. Hence in a sequence of β products of matrices, the final value can become as large as `β . Note this would imply that ||∇h|| β∞ ≤ ` and hence we need d ≥ Ω(`2β log n) for the required concentration. The small weights assumption is not just a theoretical concern, but also usually mandated in practice. Without small weights, the gradients ∇fi blows up in magnitude and a phe- nomenon known as gradient-explosion (which is particularly problematic for recurrent neural networks) is observed. This may indicate why weights are typically initialized as N (0, 1/`i) in practice [Ben12]. Similar to the case for linear regression, we prove the following theorem. A full discussion of this result, and a rigorous proof, are in the Appendix. Theorem 6.3.2 (Bounded gradient confusion for single layer linear neural network). Con- sider a single hidden layer neural network with fixed weight vectors W ∈ R1×`11 ,W0 ∈ R`1×d satisfying Assumption 6.3.1, and let η > 0 be a given constant. We have that with probability at least 1− Ω (n exp(−c `2dη21 )), equation (6.7) holds. Proof. We will prove the two properties. 2Note one could use an activation function (sigmoid, tanh, or softmax) to enforce this assumption. For now we will assume that we can bound the outputs appropriately. 117 Bounded Gradient. Fix an arbitrary value xi = x̃i and xj = x̃j . Consider an arbitrary coordinate corresponding to the variable (xi)p (by symmetry a corresponding argument holds for (xj)p) in the vector∇h(x,y). The term evaluates to the following.(∑ ∑ ) α α `1 d ∑`1 i j p(′′=1 p′=1(W0)p,p′′(W0)p′,p′′(xj))p′ + p′=1(W1) 2 p′(xj)∑ p + α κ `1j i,j p′=1(W1)p′(W0)p,p′ − (W̃)p . By the assumption that both C(x) and g(x) lie between [−1, 1], we have that−2 ≤ αi ≤ 2 for all i ∈ [n]. From Assumption 6.3.1 assumption and the sampling procedure of xi’s we have the following. ∑d ∑d1 1 (W0)p,p′′(xi)p ≤ (xi)p ≤ . (6.12) `1 `p=1 p=1 1 Hence, combining this with Assumption 6.3.1 assumption, the first term in the sum above∑can be upper-bounded by 8/`1. Note that for every pair i, j ∈ [n] we have that−1 ≤ dp=1(xi)p(xj)p ≤ 1 since this quantity represents the cosine of the angle between two vectors. Again using the observation in Equation (6.12), we have that κi,j ≤ 2/`1. Hence, the second term in the summand above can be upper-bounded by 8/`1 and hence ||∇(h(xi,xj))||∞ ≤ 16/`1. In other words, we have that ρ is O(1/`1). Non-negative Expectation. Note that αi, αj ≥ −2. Therefore we have E[h(xi,xj)] ≥ −4E[κi,j]. By using Linearity of Expectation and the fact that normalized Gaussian ran- dom variables have mean 0 at every co-ordinate, we have that E[κi,j] = 0. Therefore we have E[h(xi,xj)] ≥ 0. Armed with the bounded gradient and non-negative expectation properties, the rest of the proof follows as in the proof of Theorem 6.3.1.  Note that from theorem 6.3.2 we have that the concentration gets sharper as the 118 width of the hidden layer increases. In particular, SGD converges faster (on the simulated dataset) with increasing width. 6.3.3 Extension to arbitrary depth linear networks We now extend the above analysis to an arbitrary depth linear neural network. Inter- estingly, we see in this case that the gradient confusion depends critically on the network architecture – in particular the width and depth. In this case we have the model g(x) := WβWβ−1 . . .W1W0x where {W }βi i=1 are weight matrices of appropriate dimensions. We let β denote the number of layers in our hidden network, and we let ` denote the width of our network – i.e., the maximal number of features in any layer. Theorem 6.3.3 (Extension to arbitrary depth linear neural networks). Consider an arbi- trary depth linear neural network, and assume the weights(satisfy Assump)tion 6.3.1. Let 2 2 η > 0 be a given constant. With probability at least 1−Ω n exp(−c` dη2 ) we have thatβ equation (6.7) holds. Proof. This statement can be equivalently written as follows. Note that for constants 2 2 c′, c′′ > 0, we want c′n2 exp(−c` dη2 ) ≤ c′′ 1poly ,, since we we assume that d is theβ (d) asymptotic parameter and hence this makes the concentration explicit. Rearranging and solving for d, we get the condition that d ≥ Ω((β/`)2 log n). We now show the two properties. Bounded gradient. We will first compute ∇fi(Wβ, . . . ,W0). Note that the gradient can be visualized as follows. The first `β entries correspond to the entries in ∂fi . The∂Wβ 119 next `β ∗ `β−1 entries correspond to ∂fi and so on. These entries can be computed as∂Wβ−1 follows.( )  ∑ ∑∑ `β−1 `∂f 1 di = αi . . . (Wβ−1)p ,p − . . . (W ) β β 1 0 p1,p(xi)p .∂Wβ pβ pβ−1=1 p1=1 p=1 ( ) ( `β `β′+2 `β′−1 ` d ∂f ∑ ∑ ∑ ∑1 ∑i = αi . . . . . . ∂Wβ′ pβ′+1,pβ′ pβ=1 pβ′+2=1 pβ′−1=1 p1=1 p=1 ) (Wβ)p (Wβ−1)β pβ ,p − . . . (Wβ 1 β′+1)pβ′ (W ′ ) . . . (W ) (x )+2,pβ′+1 β −1 pβ′ ,pβ′−1 0 p1,p i p ∀β′ ∈ {1, 2, . . . , β − 1}.( ) ∑`β ∑`∂f 2i = αi . . . (Wβ)p . . . (W1)p2,p1(x ) i p .∂W β0 p1,p pβ=1 p2=1 Hence h(xi,xj) can be written as follows. ∑`β ( ) ( ) β−1 `β′+1 `β′ ( ) ( )∂fj ∂f ∑ ∑ ∑i ∂fj ∂fi h(xi,xj) = + ∂Wβ p ∂W ∂W ′ ∂W ′pβ=1 β p β ββ β β′=1 p ′∑ p ′ ,p ′ p ′ ,p ′β +1=1 pβ′=1 β +1 β β +1 β`1 ∑d ( ) ( )∂fj ∂fi + . ∂W p =1 p=1 0 p ,p ∂W 1 0 p1,p1 To bound ||∇h||∞, consider any fixed xi = x̃i and xj = x̃j . Consider the entry corresponding to (xi)p. The following claim can be obtained as a consequence of small- weights Assumption. Lemma 6.3.3. Let W0,W1,W2, . . . ,Wβ be weight matrices the satisfying Assumption 6.3.1. Then for any given kβ′ , we have that any product consisting of a sequence of matrices Wβ ·Wβ−1 · . . . (Wβ′)p ′ lies in the interval [−1/`, 1/`].β Proof. For notational convenience, denote the column vector (Wβ′)p ′ as W0. Defineβ ν := β−β′. We also rewrite the chain Wβ ·Wβ−1·. . . (Wβ′)p as W ·Wβ′ ν ν−1·. . .·W0. We 120 will now prove that the value of this matrix-vector product lies in the interval [−1/`, 1/`]. By induction on ν, we will prove that the product vν := Wν−1 · . . . ·W0 will always yield a column vector where every entry lies in the interval [−1/`, 1/`]. Given the proof of this inductive hypothesis, note that we have the following inner-product 〈Wν ,vν〉 where every entry in both these vectors lies in the interval [−1/`, 1/`]. Since the dot-product is a sum over at most ` terms with each term in the sum bounded in the interval [−1/`2, 1/`2], we have that the dot-product lies in the interval [−1/`, 1/`]. We will now prove the inductive statement. The base-case is when ν = 1. In this case we have a single vector W0. By assumption we have that each entry in this vector lies in the interval [−1/`, 1/`] and therefore the hypothesis is true. Consider the case when ν > 1. Consider the chain vν := Wν−1 ·Wν−2 · . . . ·W0. From the inductive hypothesis, we have that vν−1 := Wν−2 · . . . ·W0 gives a column vector where every entry is in the interval [−1/`, 1/`]. Consider the i-th entry in the column vector vν . This is obtained by the inner product of the i-th row of matrix Wν−1 and the column vector vν−1. Since this is a sum of at most ` term with each term in the interval [−1/`2, 1/`2], we have that this i-th entry lies in the interval [−1/`, 1/`].  In the analysis for general neural networks, we use the following corollary. Corollary 6.3.2. Let W0,W1,W2, . . . ,Wβ be weight matrices the satisfying Assump- tion 6.3.1. Then for any given β ≤ β′ ≤ 1, we have that any product consisting of a sequence of matrices W′β ·Wβ′−1 · . . .W0 · x lies in the interval [−1/`, 1/`], where x is a normalized Gaussian. Proof. The proof of this follows directly by combining Equation 6.12 and the above 121 theorem.  By the small weights assumption and lemma 6.3.3, we have that each entry in ∇fi(Wβ, . . . ,W0) is at most 3/` (i.e., αi ≤ 2 and the terms involving the sum over weight matrices is at most 1/`). By using the small-weights assumption repeatedly and noting that αi, αj ≤ 2, we get that each entry in the gradient is at most 18/`2 + 18(β − 1)/`+ 18/`2 ≤ 54β/`. Hence we have ρ as O(β/`). Non-negative expectation. As before, using the fact that αi ≥ −2 and αj ≥ −2 and normalized gaussian random variable has mean 0, we have that E[h(xi,xj)] ≥ 0. Note that the bound in this case depends both on the maximum “width” and the “depth”. In particular, as depth increases or width decreases, we need the dimension d to be larger to get the required concentration. The other way to interpret this is when either the width increases or the depth decreases, the probability that Equation (6.7) holds increases.  6.3.4 More general neural networks This result can be extended to models with certain non-linearities. We consider the model g(x) := σ(Wβσ(Wβ−1 . . . σ(W1σ(W0x)) . . .). Here, the function σ(.) is applied point-wise to its arguments. We will assume that the non-linear activation is given by a function σ(x) with the following properties. • (P1) Boundedness: −1 ≤ σ(x) ≤ 1 for every value of x ∈ R. • (P2) Twice Differentiability: σ is twice differentiable at every point in R. 122 • (P3) Bounded Differentials: −1 ≤ σ′(x) ≤ 1 and −1 ≤ σ′′(x) ≤ 1 for all x ∈ R. Most classical activation functions such as sigmoid, tanh, and softmax satisfy these re- quirements (although relu does not). Provided the small-weights assumption holds3, we have the following theorem which is analogous to the case of linear neural networks, with a mildly stronger require- ment on the constant η and weaker dependence on depth β. Theorem 6.3.4 (Concentration bounds for arbitrary depth neural networks). Let η > 4 be a given constant. Consider a neural network with weights satisfying Assumption 6.3.1, non-lin(earity satisfying p)roperties P1-P3, and quadratic loss. With probability at least 1− Ω n exp(−c`2d(η−4)24 we have that equation (6.7) holds.β The proof of this theorem follows in a similar manner as for linear neural networks. We critically use the property that terms involving σ(...), σ′(..) and σ′′(...) lie in the range [−1, 1]. In particular, we have the following expression for ∇fi(Wβ, . . . ,W0). Here A1, A2, . . . , Aβ are some fixed expressions involving the weight matrices and the random vector xi. ( ) ∂fi = αiσ ′(...)(σ(Wβ−1 · σ(. . . σ(W0 · xi))))p . ∂W ββ pβ ( ) (∑`β `∑β′+2∂fi = α ′ ′iσ (A1)σ (A2) . . . σ ′(Aβ−β′) . . . ∂Wβ′ pβ′+1,pβ′ pβ=1 pβ′+2=1 ) (Wβ)p (Wβ−1)p ,p − . . . (Wβ′+1)p ′ ,p ′ σ(Wβ′−1 ·σ(Wβ′−2 . . . σ(W ·xβ β β 1 β +2 β +1 0 i)) . . .)pβ′ 3The activation function guarantees that the output at each layer is at most 1. However we still need the assumption that each entry in the weight matrix is not too large. Otherwise, the gradient can blow up on the backward pass, even if the forward pass is stable. 123 ∀β′ ∈ {1, 2, . . . , β − 1}. ( ) ∑ `β `∂f ∑2i = α σ′i (A1)σ ′(A2) . . . σ ′(A )β . . . (Wβ)p . . . (W β 1)p∂W 2,p1(xi)p .0 p1,p pβ=1 p2=1 Consider the expression for h(xi,xj) as follows. ∑`β ( ) ( ) ∑β−1 `∑β′+1 ∑`β′ ( ) ( )∂fj ∂fi ∂fj ∂fi h(xi,xj) = + ∂W ∂W ∂W ′ ∂W ′ pβ=1 β p β p ′ p =1 p =(1 )ββ β β =1 ′∑∑′ pβ(′ β +1,pβ′ ) pβ +1 β β′+1,pβ′`1 d ∂fj ∂fi + . ∂W ∂W p1=1 p=1 0 p1,p 0 p1,p To bound ||∇h||∞, consider any fixed xi = x̃i and xj = x̃j . Consider the entry corresponding to (xi)p. Since −1 ≤ σ′(..) ≤ 1 we can use the same results from linear neural network to upper bound the value of partial differentials involving the function fj . To compute the partial differential with resp(ect to )(xi)p we make the following obser- vation. Differential with respect to (x ) in ∂fii p involves two terms both of∂Wβ′ pβ′+1,pβ′ which ca(n be up)per bounded by 2/` using Corollary 6.3.2 and Lemma 6.3.3. The differ- ential in ∂fi involves β′ terms each of them upper-bounded by 2/` by using ∂Wβ′ pβ′+1,pβ′ (the fac)t that −1 ≤ σ′′ ≤ 1, Corollary 6.3.2 and Lemma 6.3.3. Finally the differential in ∂fi can be upper-bounded by 2β/`. ∂W0 p1,p Therefore, we get that ρ ≤ O(`/β2). Computing a lower-bound on the expectation is tricky because of the function σ. Although we cannot show the non-negative expectation property in this case, we can show a slightly weaker lower-bound that still suffices for our purposes. We show that E[h(xi,xj)] ≥ −4. The proof of this follows from the assumption that αi ≥ −2 and αj ≥ −2 and the fact that the remaining terms in h(xi,xj) involve products of terms 124 σ, σ′ and the product of matrices. From the boundedness assumptions on σ and its first- derivative and from lemma 6.3.3 we have that these terms are all at least −1. Hence we have that E[h(xi,xj)] ≥ −4. To complete the proof, we now observe the following. ( − )cd(η 4)2 Pr[h(xi,xj) ≤ −η] ≤ Pr[h(xi,xj) ≤ E[h(xi,xj)]− η + 4] ≤ 2 exp − ρ2 Making the substitution of η′ := η − 4 we obtain the theorem. Note that to have η′ > 0 we want η > 4. 6.3.5 Beyond linearly generated data In this sub-section we will briefly show when the theorem (and the proofs) in the previous sub-sections extend to the case beyond linearly generated data. Note that the only fact we used about the function C(x) was that it always lies in the interval [−1, 1]. This implies that −2 ≤ αi ≤ 2 and −2 ≤ αj ≤ 2 holds. When considering ||∇h||∞ we used the first derivative of the concept function C. In linearly generated data, we get that this first-derivative exists and it lies in the interval [−1, 1]. Hence the bounds on ||∇h||∞ and E[h(xi,xj)] follow directly from these observations. Therefore for any concept function C with the following properties, the above con- centration theorems extend as is. 1. For every x ∈ [−1, 1]d we have −1 ≤ C(x) ≤ 1. 2. The function C is differentiable everywhere in the interval [−1, 1]d. 3. For every x ∈ [−1, 1]d and every i ∈ [d] we have that −1 ≤ ∂C(x) ≤ 1. ∂(x)i 125 6.4 Experiments We present experimental results to see the effect of depth and width on convergence rates and gradient confusion. It is worth noting that Theorem 6.3.4 implies that SGD becomes more effective when width increases or depth decreases. Also, Theorem 6.2.1 indicates that gradient confusion affects the heights of the final “noise floor” of constant step size SGD, and so we expect the effect of gradient confusion to be most prominent near the end of training, particularly when the convergence curve has flattened out near this floor. We perform experiments on wide residual networks (WRN) [ZK16] for an image classification task on CIFAR-10. WRN is an extension of ResNet [HZRS16a], which is one of the state-of-the-art architectures for image classification. WRN is a stack of residual blocks, and we denote the architecture as WRN-β-` following [ZK16], where β represents the depth and ` represents the width factor of the network4. The WRN architecture for CIFAR datasets is a stack of three groups of residual blocks. There is a downsampling layer between two blocks, and the number of channels (width of a convolutional layer) is doubled after downsampling. In the three groups, the width of convolutional layers is {16`, 32`, 64`}, respectively. Each group contains βr residual blocks, and each residual block contains two 3×3 convolutional layers equipped with ReLU activation, batch normalization and dropout. There is a 3 × 3 convolutional layer with 16 channels before the three groups of residual blocks. And there is a global 4The width factor is the number of filters relative to the original ResNet, e.g., a factor of 1 corresponds to the original ResNet, and 2 means the network is twice as wide. 126 average pooling, a fully-connected layer and a softmax layer after the three groups. The depth of WRN is β = 6βr + 4. We turn off dropout for all our experiments. Our first round of experimental net- works have no skip connections or batch normalization [IS15b] so as to stay as close to the assumptions of our theorems as possible. Later on, we study the effects that skip connec- tions and batch normalization have on convergence rate and gradient confusion. We use SGD as the optimizer with no momentum. We train all experiments for 200 epochs and use a standard learning rate decay schedule, where the initial learning rate is reduced by a factor of 10 at epochs 80 and 160. We use a mini-batch of 128 for all our experiments. To measure gradient confusion, at the end of every training epoch, we sample 100 mini-batches each of size 128. We calculate gradients on each of these mini-batches, and then calculate pairwise cosine similarities. To measure the worse-case gradient confusion, we calculate the lowest gradient cosine similarity among all pairs. Effect of width. To test our theoretical results, and in particular Theorem 6.3.4, we consider a WRN with no batch normalization and no skip connections. This makes the network behave like a typical deep convolutional neural network. We now test the effect of increasing width in this network, while keeping the depth fixed. In particular, we consider the following networks: WRN-28-1, WRN-28-2, WRN-28-10. Figure 6.2 shows how the training loss and the minimum gradient cosine similarity is affected by a change in width. We present results with both a fixed initial learning rate across all networks, as well as where we tune the optimal initial learning rate to optimize the performance of each network. Quite clearly, width helps in faster convergence, as well as lower gradient 127 confusion. (a) Loss; Fixed LR (b) Confusion; Fixed (c) Loss, Best LR (d) Confusion, Best LR LR Figure 6.2: How width affects convergence curves and gradient inner products. Effect of depth. Using the same experimental setup as above, we now keep the width fixed, and change the depth over the networks WRN-28-2 and WRN-40-2. Figure 6.3 shows the results. We again see that our theoretical results seem to be backed by the experiments, where we find faster convergence and lower gradient confusion with smaller depth. (a) Loss; Fixed LR (b) Confusion; Fixed (c) Loss, Best LR (d) Confusion, Best LR LR Figure 6.3: How depth affects convergence curves and gradient inner products. Effect of batch normalization and skip connections. Finally, we test the effect that techniques such as batch normalization and adding skip connections has on convergence 128 speed. Figure 6.4 shows results for WRN-40-2, where we start with a network with no batch normalization and no skip connections, and then progressively add them to the net- work. For these runs, we present results only with the best tuned initial learning rate since the optimal learning rates are usually very different on batch normalized vs. non-batch normalized networks. We see that adding batch normalization makes a big difference in the convergence speed as well as in lowering gradient confusion. Adding skip connec- tions on top of this further accelerates training, although it seems to have minimal effect on gradient confusion (when used on top of batch normalization). (a) Loss; Best LR (b) Confusion; Best LR Figure 6.4: Effect of batch normalization and skip connections on a Wide ResNet 6.5 Conclusion We study the effect of high dimensionality and over-parameterization on the con- vergence of SGD, and show that low gradient confusion in high dimensional problems can lead to accelerated convergence. This addresses the issue of why SGD is an effective optimizer for over-parameterized problems. An interesting question for future work is whether there is a connection between gradient confusion and generalization for SGD. 129 Part II STUDYING THE EVOLUTION OF CULTURAL NORMS 130 Chapter 7: Using game theory to study the evolution of cultural norms Understanding human behavior and modeling how cultural norms evolve in dif- ferent human societies is vital for designing policies and avoiding conflicts around the world. This part of the thesis describes ways to use computational game-theoretic tech- niques, and in particular evolutionary game theoretic (EGT) models, to gain insight into why different human societies have different norms and behaviors. Conventional (non-evolutionary) game theory is good for analyzing situations where we know an individuals’ preferences, and want to predict what they will do based on those preferences. However in our work, we want to know how these preferences arose. We are interested in the following kinds of questions: • What kinds of structural and external factors might have led to the emergence of behaviors we see among individuals in a society? • What evolutionary pressures might have led to variations in those behaviors? • Can they be validated by observed phenomena? Conventional game theory can’t properly answer these questions. To lay out an individual’s preferences in a conventional game-theoretic model would, in essence, be building into the model the very traits whose emergence we want to study. We instead 131 need to lay out the structural/environmental factors that might be responsible for the evo- lution of those traits, to see whether those traits would evolve, and evolutionary game theory provides an efficient framework to do just that. 7.1 Evolutionary game theory in biology Evolutionary game theory (EGT) was first developed as an application of game theory to evolving populations composed of multiple animal species, as a way to model how each species’ evolutionary fitness causes its proportion of the population to grow or shrink [SP73]. The idea is to represent an interaction among animals as a normal-form game. The game’s payoffs are intended to represent the effect that the interaction will have on the individuals’ evolutionary fitness. For example, if two animals fight over a piece of food, one might expect that each individual’s fitness would be affected by how the fight affects the animal’s health, and whether the animal gets the piece of food. Rather than developing a detailed model of each specific individual, EGT models typically are at a much more abstract level that does not distinguish among the individuals within each species, but instead looks at the average behavior of all individuals of that species. More specifically: • If the population is composed of n different species, then for each species i (i = 1, . . . n), all individuals of species i have the same strategy si, namely the strategy of being a member of species i. This strategy is intended to encompass—in an abstract way, of course—everything that might affect this species’ average evolu- tionary fitness: size, aggressiveness, sensory abilities, intelligence, etc. 132 • ∑Each species i constitutes some proportion xi of the entire population, with ni=1 xi = 1. If we choose an individual at random, then for i = 1, . . . , n, the probability that this individual uses strategy si is xi. Now, consider an interaction (e.g., a conflict over a source of food) between two indi- viduals: one from species i and one from species j. For simplicity of presentation, let’s restrict this to just two individuals, but it can easily be generalized to interactions among k individuals for arbitrary k. To formulate this interaction as a normal-form game, let’s say that the individuals’ expected payoffs are u(si, sj) and u(sj, si), where “payoff” means the effect that the interaction will have on the individual’s evolutionary fitness. The normal-form game is symmetric, i.e., if the two individuals are named a and b, then it doesn’t matter whether the one with strategy si is individual a or individual b. In either case, this individual’s expected payoff is u(si, sj), and the other individual’s expected payoff is u(sj, si). Suppose an individual with strategy si meets another individual chosen at random. Then for j = 1, . . . , n, the other individual’s strategy∑is sj with probability xj . Hence the expected payoff for the individual with strategy s is ni j=1 xju(si, sj). Earlier, we said the expected payoff is intended to represent the interaction’s effects on evolutionary fitness. The idea is that species i’s expected payoff is higher than that of the entire population, then species i will reproduce at a higher rate, hence its proportion xi will increase. If species j’s expected payoff is lower than that of the entire population, then species j will reproduce at a lower rate, hence xj will decrease. 133 The best-known way to model this is the replicator dynamic [TJ78]. The origi- nal version is a differential equation that assumes an infinite population and continuous time. Let πi(x) ≥ 0 be the average payoff obtained by individuals of species i when the proportions of each spec∑ies are x = (x1, . . . , xn). Then the average payoff for the entire population is θ(x) = ni=1 xiπi(x). According to the replicator dynamic, the rate of change in each xi is given by the following differential equation: dxi/dt = xi(πi(x)− θ(x)). (7.1) The replicator dynamic is consistent with the Lotka-Volterra equations for the dynamics of biological systems. Indeed, the replicator dynamic is mathematically equivalent to a generalization of those equations [PN02]. The replicator dynamic can be translated into a difference equation in which the population is finite, and time proceeds as a sequence of discrete iterations [HS84]. This formulation can be used to run a discrete-event computer simulations and look at their outcomes—which is useful if the differential equations are too complicated to solve math- ematically. The above approach assumes that the species are well-mixed, i.e., that they are uni- formly distributed geographically. Such an assumption is often inaccurate; there are many settings in which an individuals’ location can make a huge difference in what interactions they have, and how those interactions affect their evolutionary fitness. To model such sit- uations, it often is useful to locate the individuals in a network in which they are restricted to interact with their neighbors. This is further discussed later in this chapter. 134 7.2 Modeling cultural evolution EGT can be used to model aspects of the evolution of human cultures. Here, strate- gies correspond not to collections of individuals, but instead to possible behaviors. A successful strategy—i.e., a behavior that produces good results—is likely to be adopted by others, hence become more prevalent in the population. Conversely, the prevalence of an unsuccessful strategy is likely to decrease. The propagation of these strategies cor- responds not to biological reproduction, but instead to cultural transmission, in which humans imitate others and learn from others. Rather than the replicator dynamic, here the evolutionary model is a comparison process, e.g., a modified version of the Fermi rule from statistical mechanics [Blu93]. At each iteration t, each individual a uses some strat- egy in a game-theoretic interaction and receives a payoff ua. Then, before the beginning of iteration t+1, a compares ua to the payoff un received by a randomly chosen neighbor n, and decides whether to keep using the same strategy that it used before, or switch to the neighbor’s strategy. The probability of switching is given by a version of the well-known sigmoid function (see Figure 7.1): Pr[a switches to n’s strategy] = 1/(1 + es(ua−un)), where ua and un are a’s and n’s payoffs in the current iteration, and s ≥ 0 is an arbitrary constant called the selection strength. The Fermi rule can easily be adapted to situations in which the population isn’t well-mixed. For example, one can locate the individuals at the nodes of a network, restrict each individual a to interact only with its neighbors, and restrict a to compare its payoff with the payoffs of its neighbors. 135 Figure 7.1: Graph of 1s(u −u ) , for s = 5 and −1 ≤ u1+e a n a − un ≤ 1. Usually the Fermi rule is further modified by introducing an exploration dynamic that is somewhat analogous to biological mutation. In biological evolution, mutation occurs so rarely that game-theoretic biological models often omit it. In cultural evolution, an analogous phenomenon happens more frequently: individuals to try out new behaviors [THDS+09]. The exploration dynamic models this as follows: when each agent a chooses what strategy to use at the next iteration, there is a small probability µ that a will choose a strategy s at random from the set of all possible strategies, regardless of whether s was a successful strategy for the agents who used it in the current iteration, or whether any agent even used it at all. One of the limitations of EGT models is that they deliberately omit large amounts of detail. In EGT models of biological evolution, they ignore most of the factors that might influence whether a particular individual will reproduce successfully, and instead consider all individuals of a species to be equivalent. Similarly, EGT models of cultural evolution ignore most of the complexities of human interactions. For example, rather than reasoning about the physical outcomes of an interaction among several individuals, these outcomes are represented by payoff values. Because the models are highly simplified, 136 they don’t give exact numeric predictions of what would happen in real life. On the other hand, a good EGT model can provide explanations of the underlying dynamics of an evolving system, and establish support for causal relationships. Conse- quently, such models can provide a useful complement to empirical studies, in which there may be questions whether or not a correlation among various factors indicates a causal relationship [Ald95]. 7.3 Contributions We list the main contributions in this part of the thesis below. In Chapter 8, we study how norms change in a society. To do this, we build an evolutionary game-theoretic model based on the idea that different strength of norms in societies translate to different game-theoretic interaction structures and incentives. This model is used to study the evolutionary relationships of the need for coordination in a society (which is related to its norm strength) with two key aspects of norm change: cul- tural inertia (whether or how quickly the population responds when faced with conditions that make a norm change desirable), and exploration rate (the willingness of agents to try out new strategies). Our results show that a high need for coordination leads to both high cultural inertia and a low exploration rate, while a low need for coordination leads to low cultural inertia and high exploration rate. In Chapter 9, we extend this to study the rate at which a norm changes in different cultures. We analyze the evolutionary relationships between the tendency to conform and how quickly a population reacts when conditions make a change in norm desirable. Our 137 analysis identifies conditions when a tipping point is reached in a population, causing norms to change rapidly. We find that tighter cultures are more likely to be initially resistant to norm changes, but once it reaches a tipping point, they change faster than looser cultures. In Chapter 10, we study conditions that affect the existence of group-biased behav- ior among humans (i.e., favoring others from the same group, and being hostile towards others from different groups). Using an evolutionary game-theoretic model, we show that out-group hostility is dramatically reduced by mobility. Technological and societal ad- vances over the past centuries have greatly increased the degree to which humans change physical locations, and our results show that in highly mobile societies, ones choice of action is more likely to depend on what individual one is interacting with, rather than the group to which the individual belongs. 138 Chapter 8: Understanding norm change in human societies 8.1 Introduction Human societies around the world are unique in their ability to develop, maintain, and enforce social norms. Social norms enable individuals in a society to coordinate actions, and are critical in accomplishing different tasks. Neuroscience, field, and ex- perimental research have all established that there are marked differences in the strength of social norms around the globe [BVL13, EH14, GRN+11, HG14, HEM+10, HMB+06, HTG08,RGNL15]. Some cultures (e.g., some middle-eastern countries, India, South Ko- rea, etc.) are tight, in the sense that they tend to have strong social norms, with a high degree of norm-adherence and higher punishment directed towards norm-violators. Other cultures (e.g., Netherlands, New Zealand, Australia, etc.) are loose, i.e., individuals tend to develop weaker norms with more tolerance for deviance [GRN+11, HG14, RGNL15]. This indicates that the nature of human interaction and influence is vastly different across different cultures around the world. To date, there has been little research on the evolutionary processes of norm main- tenance and the processes that lead to norm change, and how these processes are substan- tially different in societies around the world. However, recent world events (e.g., recent social uprisings and turmoil) show that it is critically important to develop such an un- 139 derstanding. In this section, we draw ideas from recent social science research to build culture-sensitive models that provide insights into the substantial societal differences that exist in how individuals interact and influence each other. Here, we use EGT to examine the relationships of the amount of need for coor- dination (which psychological and sociological studies show is related to norm strength [RGNL15]), with two key aspects of norm change in societies: 1. the amount of cultural inertia, i.e., the amount of resistance to changing a cultural norm, and 2. the exploration rate, i.e., the extent to which agents are willing to try out new behaviors. More specifically, our primary contributions in this work are as follows: • We provide a novel way to 1. model a society’s strength of norms by using an agent’s need for coordination in the society, and 2. model the desirable/undesirable norms in a society, by characterizing how they affect the payoffs in a game-theoretic payoff matrix, leading to different interaction structures and incentives in a society. • We investigate cultural evolution of norm change in this model using two well- known models of change in evolutionary game theory (the replicator dynamic [TJ78] and the Fermi rule [Blu93]). Using mathematical analyses and extensive agent- based simulations, we establish that: the higher the need for coordination is, the 140 higher the cultural inertia will be, and vice versa. When a population faces con- ditions that make a norm change desirable, a high need for coordination will make them slower to change to the new norm compared to a society with a lower need for coordination. Further, if the need for coordination is high enough, the existing norm will not change at all. • In order to understand how norms change in different cultures, we also examine whether the need for coordination in a society has a causal evolutionary relationship to an agent’s tendency to learn socially (i.e., adopt a behavior that is being used by other agents in the population) versus innovate/explore new random behaviors. In order to be able to do so, we propose a novel way to model this, where we let the exploration rate, i.e., the probability that an agent tries out a new action at random, evolve over time as part of the agent’s strategy, rather than stay fixed as in previous work [THDS+09]. • The cultural differences in the distribution of agent strategies favoring social learn- ing versus innovation or exploration can have a critical impact on how attitudes, beliefs and behaviors spread throughout the population, and thus, is vital to under- standing norm change. At a societal level, such differences can affect the rate at which new technologies, languages, moral traditions, and political institutions are adopted, while at local levels, they can affect the processes of influence at the in- dividual level. Using the above model of evolving exploration rates, we verify this by establishing, via extensive agent-based simulations, that: the higher the need for coordination is, the lower the exploration rate will be, and vice versa. 141 A B A B Mc = A ac, ac 0, 0 Mf = A af , af af , bf B 0, 0 bc, bc B bf , af bf , bf Figure 8.1: Individual payoff matrices. Mc denotes the coordination game and Mf de- notes the fixed-payoff game used in our model. These results provide insight into the reasons why tight societies are less open to change, and why cultural inertia and high levels of social learning develop in such soci- eties. To our knowledge, this is the first work to provide a culturally-sensitive model of norm change and to show how the processes of norm propagation differ across societies. The rest of the chapter is organized as follows. Section 8.2 provides our model of the need for coordination, and mathematical analyses and agent-based simulations show- ing how it affects cultural inertia. Section 8.3 describes our model of evolving exploration rates, and shows how the degree of need for coordination affects the evolution of explo- ration rates. In Section 8.4 we discuss the significance of our results. 8.2 Proposed model Past field and experimental research have shown that tight societies have stronger norms, where individuals adhere to norms much more than loose societies, and face higher punishment when deviating. On the other hand, individuals in loose societies typically have more tolerance for deviant behavior [GRN+11, HG14, RGNL15]. Past EGT studies 142 A B M = A cac + (1− c)af , cac + (1− c)af (1− c)af , (1− c)bf B (1− c)bf , (1− c)af cbc + (1− c)bf , cbc + (1− c)bf Figure 8.2: Weighted payoff matrix M defined as M = cMc + (1− c)Mf . have shown that a society’s exposure to societal threat is a key mediating factor in its strength of norms [RGNL15], where threats can be either ecological like natural disas- ters and scarcity of resources, or manmade such as threats of invasions and conflict. In high-threat situations, societies tend to develop strong norms for coordinating social in- teraction, (i.e., to become tighter), since coordination is vital for the society’s survival. In low-threat situations, there is less need for coordination, which affords weaker norms and looser societies. Using this intuition, we hypothesize that the interactions between individuals in dif- ferent societies are governed by different payoff structures and incentives. Tight societies tend to have a high need for coordination, and we can model the extreme case as a coordi- nation game Mc, where one only gets a payoff if playing the same action as the agent one is interacting with. In loose societies, on the other hand, individuals’ payoffs are less af- fected by others’ actions, and we can model the extreme case as a fixed-payoff game Mf , in which an agent’s payoff depends only on the action played by that agent, and not on the actions of the other agent. For cases in between the two extremes, we use a game in which the payoff matrix is a weighted combination of a coordination game and a fixed-payoff 143 A B M ′ = A a, a (1− c)a, (1− c)b B (1− c)b, (1− c)a b, b Figure 8.3: Updated payoff matrix after assuming ac− bc = af − bc and adding a suitable constant to the payoffs in M in Figure 8.2. game, with the weighting factor 0 ≤ c ≤ 1 denoting the need for coordination. As is done in many EGT studies, we consider games in which individuals have two possible actions to choose from. In our case, the two actions A and B correspond to possible norms that the society could settle on. As shown in Figure 8.1, the coordination game has a payoff matrix Mc in which ac and bc are the payoff parameters; and the fixed- payoff game has a payoff matrix Mf in which af and bf are the payoff parameters. The weighted combination of the two games, shown in Figure 8.2, is M = cMc + (1− c)Mf , where 0 ≤ c ≤ 1 is the need for coordination. We first present a lemma that shows that under a mild assumption, the payoff matrix M can be much simplified on adding a constant to all payoffs in the matrix. Lemma 8.2.1. Consider the game matrix M defined in Figure 8.2, and assume that ac − bc = af − bf . Then, under a suitable addition of a constant to the payoffs, and using ac = a and bc = b, the game matrix M reduces to the matrix M ′ shown in Figure 8.3. Proof. On adding the constant value of (1− c) ∗ (ac − af ) = (1− c) ∗ (bc − bf ) (where equality holds under the assumption) to all payoffs in M , the payoff matrix M reduces to 144 M ′, shown in Figure 8.3, where we denote ac = a and bc = b.  The assumption ac − bc = af − bf is very reasonable, since this just ensures that switching from one norm to the other always results in the same change in payoffs, re- gardless of the weight c on the coordination game. Otherwise, there would be an added causal factor for the dynamics of norm change. Also note that, from Lemma 8.2.1, under additions with a constant, this assumption reduces to just setting ac = af and bc = bf . For the rest of the section, we will work with payoff matrix M ′ where we set ac = af = a and bc = bf = b. In subsequent sections, we will show why simplifying the payoff matrix by adding a constant value to all payoffs (as shown in Lemma 8.2.1) is a perfectly reasonable step to take. From payoff matrixM ′, we see that whenever b < a, the better action for the society to settle on (in terms of payoff) is A, while if a < b then it is B. Let M ′AB be the payoff that an agent receives when they play action A and their opponent plays action B. Let M ′AA, M ′ BA and M ′ BB be defined similarly. Studying the Nash equilibrium of the game M ′, we get the following lemma. Lemma 8.2.2. Consider the game matrix M ′ defined in Figure 8.3, where all payoff values are positive, i.e., a, b > 0. Then we have: (i) If b > a, the strategy profile (B,B) is a Nash Equilibrium. Further, if c ≥ b−a , then b (A,A) is also a Nash equilibrium. Further, the strategy profile ((q, 1−q), (q, 1−q)) is a Nash Equilibrium only when c ≥ b−a , where q = b−(1−c)a . Note that the mixed b c(a+b) strategy (q, 1 − q) denotes playing action A with probability q and action B with probability 1− q. 145 (ii) Similarly, if a > b, the strategy profile (A,A) is a Nash Equilibrium. Further, if c ≥ a−b , then the strategy profile (B,B), as well as ((q, 1− q), (q, 1− q)) are also a Nash Equilibria, with q = b−(1−c)a . c(a+b) Proof. For (A,A) to be a Nash equilibrium of the game M ′, as defined in Figure 8.3, the following condition has to hold: M ′ ′AA ≥MBA ⇒ ≥ b− a c . (8.1) b Similarly, for (B,B) to be a Nash equilibrium, the required condition is: ′ ′ a− bMBB ≥MAB ⇒ c ≥ . (8.2)a Consider the following two cases: 1. b > a : In this case, (8.2) is always satisfied. Thus, (B,B) is a NE. If c is large enough such that (8.1) is satisfied, then (A,A) is also a NE. 2. a > b : In this case, (8.1) is always satisfied. Thus, (A,A) is a NE. If c is large enough such that (8.2) is satisfied, then (B,B) is also a NE. Note that ((q, 1 − q), (q, 1 − q)) is a mixed-strategy Nash Equilibrium when the strategy (q, 1− q) makes the agent indifferent to the opponent’s strategy, i.e., when: qM ′AA + (1− q)M ′ ′BA = qMAB + (1− q)M ′BB. (8.3) Simplifying this, we get: b− (1− c)a q = . c(a+ b) 146 We know that 0 ≤ q ≤ 1. Thus, this reduces to the following two conditions for ((q, 1− q), (q, 1− q)) to be a Nash Equilibrium: ≥ b− ac and c ≥ a− b. b a When b > a, c ≥ a−b is always satisfied. Thus, when c is large enough such that a c ≥ b−a , ((q, 1 − q), (q, 1 − q)) is a mixed-strategy Nash Equilibrium. Similarly, when b a > b, c ≥ b−a is always satisfied, and when c is large enough such that c ≥ a−b , b a ((q, 1− q), (q, 1− q)) is a mixed-strategy Nash Equilibrium.  From Lemma 8.2.2, we see that only when c is high enough, the sub-optimal action pair becomes a Nash Equilibrium, where sub-optimal action pair refers to the situation where both agents get a lower payoff than otherwise possible using, for example, the optimal action pair. This means that when b > a, (A,A) is the sub-optimal action pair. Thus, from Lemma 8.2.2, we that see if the need for coordination c is high, then the population may converge to either of two different equilibria, one of which is sub-optimal in terms of overall payoff. When c is low, on the other hand, the society will converge to a single globally-optimal equilibrium. In the next two sub-sections we introduce two models for studying norm change, using two well-known models of evolutionary change (the replicator dynamic [TJ78] and the Fermi rule [Blu93]). We show that both models of evolutionary change are invariant to additions to the payoffs by a constant, and thus the results from this section carry forward. We derive results for how different societies respond to a need for norm change using both mathematical analysis on infinite well-mixed populations (where well-mixed denotes that any agent can interact with any other agent in the population), and extensive agent-based 147 simulations on finite structured populations (where agents are placed on a network and can interact with only their neighbors). 8.2.1 Replicator dynamic on infinite well-mixed populations Consider a well-mixed infinite population of agents. This is a standard setting used in evolutionary game theory, since a well-mixed infinite population is usually analytically tractable. Let the agents be interacting with each other using game matrix M ′ defined in Figure 8.3, and the proportion of agents playing each strategy be denoted by x = (xA, xB), i.e., xA proportion of agents with strategy A, and proportion xB = 1− xA with strategy B. Also, let uA(x) and uB(x) denote the payoffs received by an agent playing actions A and B respectively, given the strategy proportion x. The expected payoff for an agent is given by interacting with a randomly chosen agent in the population. Thus, we get the following: E[uA(x)] = x ′ ′AMAA + xBMAB, E[uB(x)] = xAM ′BA + x M ′B BB. On analyzing the Nash Equilibria of this system, we observe the following lemma. Lemma 8.2.3. Consider a well-mixed infinite population where agents interact using the game M ′ in Figure 8.3. Assuming all payoff values are positive, i.e., a, b > 0, and using Lemma 8.2.2, we have: (i) When b > a, xA = 0 is a Nash Equilibrium. If c ≥ b−a , then xA = 1 and xA =b b−(1−c)a (which corresponds to the mixed-strategy Nash Equilibrium in Lemma c(a+b) 8.2.2) are also Nash Equilibria. 148 (ii) Similarly, when a > b, x = 1 is a Nash Equilibrium, while if c ≥ a−bA , then xa A = 0 and x = b−(1−c)aA also are Nash Equilibria.c(a+b) Proof. Consider the cases: xA = 0 (the strategy set where all of the population plays B) and xA = 1 (the strategy set where all of the population plays A). From Lemma 8.2.2, we get the following two cases: 1. b > a : In this case, (8.2) is always satisfied. Thus, xA = 0 is a NE. If c is large enough such that (8.1) is satisfied, then xA = 1 is also a NE. 2. a > b : In this case, (8.1) is always satisfied. Thus, xA = 0 is a NE. If c is large enough such that (8.2) is satisfied, then xA = 1 is also a NE. Now consider the intermediate case where xA = p with 0 < p < 1. For xA = p to be a NE, no A agent should have a strictly better payoff if switching to B, and vice versa. Thus, the following two conditions need to be simultaneously satisfied: pM ′ + (1− p)M ′ ≥ pM ′AA AB BA + (1− p)M ′BB, and pM ′ + (1− p)M ′ ≥ pM ′BA BB AA + (1− p)M ′AB. Both of these conditions are satisfied only when: pM ′AA + (1− p)M ′AB = pM ′BA + (1− p)M ′BB. This simplifies to: b− (1− c)a p = , c(a+ b) and similar to Lemma 8.2.2, the results follow.  149 We assume that on each iteration, agents interact with other randomly chosen agents, and the population evolves according to the replicator dynamic. The replicator dynamic is based on the idea that the proportion of agents of a type (or strategy) increases when it achieves expected payoff higher than the average payoff, and decreases when achieving lower payoff than the average payoff. Thus, over time, the proportion of agents of a type that achieves payoff higher than the average payoff starts increasing in the population, and eventually take over. More formally, the replicator dynamic is given by the differential equation dxA ẋA = = xA · (E[uA(x)]− θ(x)), (8.4) dt where θ(x) = xAE[uA(x)] + xBE[uB(x)] is the average payoff received by all agents in the population. From (8.4), it is clear that the rate of change remains the same on adding a constant to the payoff matrix, since the added constants would just cancel each other out. Thus, Lemma 8.2.1 follows through to this section as well. Using the game matrix M ′, the rate of change in the proportion xA is given by: ẋA = xA(1− xA)(c(a+ b)xA − (b− (1− c)a)). (8.5) The fixed points of this rate of change are given by: b− (1− c)a xA = 0, xA = 1, and xA = . (8.6) c(a+ b) These correspond to the Nash Equilibria derived earlier. Next, we study the stability of the Nash equilibria derived above, where we define a stable Nash equilibrium under the replicator dynamic to be one where: if an infinitesimal proportion of agents change their strategy, the replicator dynamic always forces the population back to the original Nash 150 equilibrium. More precisely, let the Nash equilibrium be xA = p. If xA increases an infinitesimal amount to p+ , the Nash equilibrium is stable only if ẋA < 0, which drives the population back to the Nash equilibrium xA = p. Similarly, if xA decreases by , xA = p is stable only if ẋA > 0. Thus, we state the following corollary. Corollary 8.2.1. From Lemma 8.2.3 and Eq. (8.5) and Eq. (8.6), we see that the Nash Equilibria xA = 0 and xA = 1 are stable, while the Nash Equilibrium x = b−(1−c)a A isc(a+b) unstable. Proof. Let φ = b−(1−c)a . From Eq. (8.5), we notice that, if xA = φ + , then ẋc(a+b) A > 0, while if xA = φ − , then ẋA < 0, for any small  > 0. Thus, x = b−(1−c)aA representsc(a+b) an unstable fixed point. Similarly notice that if xA = , ẋA < 0, while if xA = 1 − , ẋA > 0. Thus, xA = 0 and xA = 1 represent stable fixed points.  There is a further notion of equilibrium used in EGT called evolutionarily stable strategies (ESS) [Smi82]. A strategy S is an ESS if there is a small proportion py such that, when any other strategy T has a proportion px < py (where the rest of the population has strategy S), the payoff of an S agent is always strictly greater than a T agent. Using this definition, we state the following theorem. Theorem 8.2.1. From Lemma 8.2.3 and Corollary 8.2.1, we see: (i) When b > a, B is an ESS. If c ≥ b−a , then A is also an ESS. b (ii) When a > b, A is an ESS. If c ≥ a−b , then B is also an ESS. a Proof. Let C denote the mixed strategy (q, 1 − q). From (8.3), we see that for C to be a 151 mixed strategy NE, the following condition needs to hold: M ′ ′ q = BB −MAB ′ − ′ ′ − ′ , (8.7)MAA MBA +MBB MAB Let M ′AC denote the payoff received by the row player when an A player (row player) interacts with a C player (column player). Similarly, we define M ′ ′ ′CC , MCA, MBC and M ′CB. Thus we get: M ′ 2 ′ ′CC := q MAA + q(1− q)MAB + q(1− q)M ′BA + (1− q)2M ′BB, M ′CA := qM ′ AA + (1− q)M ′BA, M ′ ′AC := qMAA + (1− q)M ′AB, M ′CB := qM ′ AB + (1− q)M ′BB, M ′BC := qM ′ BA + (1− q)M ′BB. Let us derive conditions for which A is an Evolutionarily Stable Strategy (ESS). Thus, we consider the proportion of agents playing A to be close to 1, i.e., xA = 1 − , where 0 <  1. Let S denote the set of strategies other thanA that agents can play, i.e., S ∈ {B,C}. For A to be an ESS, one of the following conditions need to hold: either 1. M ′ > M ′AA SA, or 2. M ′AA = M ′ SA and M ′ AS > M ′ SS . From Lemma 8.2.2, we see that M ′AA > M ′ BA simplifies to the condition c > b−a . b Further, M ′ = M ′AA BA simplifies to c = b−a . b We also notice that M ′ ′AA > MCA simplifies to: M ′AA > qM ′ ′ ′ ′ AA + (1− q)MBA ⇒ MAA > MBA. 152 Now consider the three cases: 1. b > a : In this case, if c is large enough such that c > b−a is satisfied, then A is an b ESS. 2. a > b : In this case, c > b−a is always satisfied. Thus, A is an ESS. b 3. b > a and c = b−a : In this case, for A to be an ESS, both the conditions M ′ b AB > M ′BB and M ′ > M ′AC CC have to be satisfied. M ′ AB > M ′ BB simplifies to (1− c)a > b, which is never satisfied. M ′AC > M ′ CC simplifies to: qM ′ + (1− q)M ′ > q2M ′AA AB AA + q(1− q)M ′AB + q(1− q)M ′ 2BA + (1− q) M ′BB, ⇒ q(1− q)(M ′AA −M ′BA) > (1− q)2(M ′ ′BB −MAB), (8.8) which is also never satisfied (follows from (8.7)). Thus, for this case, A is not an ESS. We can similarly derive conditions where B is an Evolutionarily Stable Strategy (ESS). Now we examine whether C is an ESS. Using (8.7), we can show that the following conditions are satisfied: M ′ = M ′ and M ′CC AC CC = M ′ BC . Thus, for C to be an ESS, both M ′ ′CA > MAA and M ′ CB > M ′ BB need to be satisfied. These two conditions simplify to the following conditions: M ′BA > M ′ and M ′AA AB > M ′ BB, which in turn simplify to the conditions c < b−a and c < a−b . Both of these conditions cannot be simultaneously b a satisfied. Thus, C is not an ESS.  We observe that the strategies A and B are Evolutionary Stable Strategies (ESS), when adopted by everyone in the population (corresponding to the stable Nash equilibria 153 Figure 8.4: Figures show the change in the proportion of B agents with time with a well- mixed infinite population where reproduction is determined by the replicator dynamic with b > a. xA = 1 and xA = 0). The unstable Nash Equilibrium, on the other hand, does not correspond to an ESS, since even a small group with a different strategy is able to force the population to a different equilibrium. Thus, only stable Nash Equilibria correspond to evolutionarily stable strategies. Theorem 8.2.1 indicates that a society under our model is bound to end up at one of the evolutionarily stable strategies: with every individual on action A or everyone on action B, since even a small perturbation moves the society away from the unstable Nash equilibrium. When c is low, there exists only a single ESS, and thus the society adapts itself and settles on the ESS. When c is high, there are two ESSs, and thus the society might settle on either one, depending on the starting point of the society. Let us consider two societies: one with a lower need for coordination c1, and one with a high need for coordination c2 > c1. To avoid some awkward phrasing, we’ll call these the “looser” and “tighter” societies, respectively. Suppose a majority of both societies are playing normA, and suppose they evolve according to the replicator dynamic 154 Figure 8.5: Figure shows the rate of change ofB agents versus the proportion ofB agents, with a well-mixed infinite population where reproduction is determined by the replicator dynamic with b > a. given in Eq. (7.1). We are interested in how these two societies would respond to the action B, when the payoff of action B is higher than A, i.e., when b > a, or equivalently, M ′ ′BB > MAA. First notice that if c2 > (b − a)/b, and c1 < (b − a)/b, it follows from Theorem 8.2.1 that the tighter society remains on norm A while the looser one switches to the globally optimal norm B. Now suppose the difference in norm payoffs is large enough such that c2 < (b−a)/b (and thus also, c1 < (b − a)/b). This ensures that there is only a single equilibrium for both societies at xA = 0. Thus, both societies would eventually switch to norm B, and we are interested in the rate at which this change occurs. Let ẋB1 and ẋB2 denote the rate of change when the need to conform is c1 or c2, respectively. Then we can show that: ẋB2 − ẋB1 = xB(1− xB)(c2 − c1)((a+ b)xB − b). 155 This simplifies to ≤ 0, when x ≤ b B ; a+bẋB2 − ẋB1 (8.9)> 0, when x > bB .a+b Thus, ẋB2 < ẋB1 in the initial stages when xB < b/(a+b). However, once the proportion of B agents become big enough such that xB > b/(a+ b), then the higher the value of c, the higher the rate of change will be. Thus, when c is high, the switch from A to B takes time to speed up, with more cultural inertia than when c is low, even when the payoff of the new norm is arbitrarily large compared to the previous norm. The initial cultural inertia results in the society with a higher c value to take longer overall to switch to the new norm. Figure 8.4 illustrates these properties of well-mixed populations using the replica- tor dynamic. We start off the society at the proportion xA = 0.95. In the first of the three graphs, the tighter society (again using “tighter” as shorthand for “higher need for coordination”) has c > (b− a)/b. Thus, while the less-tight society switches to the more beneficial norm B, the tighter society is resistant to the change (since the difference in payoffs is small) and stays with norm A. The second and third graphs show situations where both societies switch to norm B. We observe that the tighter society switches more slowly towards changing to norm B, but the difference in speed decreases as the difference in payoffs between B and A increases. As derived in Eq. (8.9), the rate of change for a society with higher c grows larger than with lower c only after xB > b . This is shown in Figure 8.5. This also indicatesa+b the initial inertia that societies with a higher need for coordination experience towards 156 Figure 8.6: Simulations with the Fermi rule on a toroidal grid of size 2500. From top to bottom: c = 1.0, c = 0.75, c = 0.5. Initially: a = 1.0, b = 1.15. We use a structural shock at 2500 iterations, after which the payoffs become: a = 1.15, b = 1.0. changing norms. The need for coordination in these societies lead to individuals being reluctant to try out new norms, which in turn leads to inertia. 8.2.2 Agent simulations on finite networks A limitation of the model introduced in the previous section, is that it assumes that the population is infinite and well-mixed. While the assumption that a population is infinite is not a bad approximation for very large populations (which is the scale that we are interested in), the assumption that agents are well-mixed, i.e., where any agent can interact with any other agent, is often inaccurate. In this section, we show that the results derived in the previous section, also extend to a model in which agents are structured on 157 the nodes of a graph/network, where agents can only interact with another agent if they are connected by an edge in the graph. More specifically, we now consider a structured population where agents are ar- ranged on the nodes of a toroidal (wrap-around) grid, such that each agent can interact only with the 4 other agents they are connected to. We consider toroidal grids as a con- venient example, however, the results we describe below also extend to other network structures like small-world networks [WS98], and preferential attachment [BA99] mod- els. Mathematical analysis of evolutionary games on structured populations is not yet a well-developed field, and thus we perform simulations of our model as follows. Initially, we arrange agents with random strategies (A or B) on each node of the grid. In each iteration, each pair of agents connected by an edge interact in a two-player game defined by the payoff matrix M . The total payoff of each agent is computed by summing over the payoffs received by an agent for each game that they play. Since the population is finite, we use dynamics defined on finite populations. After each interaction phase, agents use the Fermi rule to update it’s strategy for the next iteration. Under the Fermi rule, an agent ψa picks a random neighbor ψn and observes its payoff, and the agent then decides to switch to the neighbor’s strategy with probability p = (1 + exp(−s(ua − u −1n))) , where ua and un are the payoffs of the agent and the neighbor, and s is a user- defined parameter (in all our experiments, we set s = 5). With probability 1 − p, the agent retains its old strategy. With a small probability µ, called the exploration rate, an agent then tries out an action completely randomly. This repeats for every iteration of the simulation. Note that the Fermi rule also only depends on a difference between payoff values, and thus, like the replicator dynamic, is also invariant to addition of a constant 158 to the payoffs. Thus, for all our experiments in this section, we use the simplified game matrix M ′ from Figure 8.3. To study cultural inertia (i.e., resistance to changing a cultural norm) or rapid cul- tural change in different societies, we use a game-theoretic model of a structural shock. A structural shock represents a catastrophic incident in a society, where suddenly there is abrupt change in the payoffs for actions A and B. We are interested in studying how societies with different needs for coordination react to such an abrupt and drastic shift in the payoffs of the possible actions. In our EGT model, we implement a structural shock by simply interchanging the payoffs of actions A and B, thus, denoting a sudden change in the globally optimal action in a society. This is equivalent to interchanging the payoff values a and b. Thus, if initially, we have b > a, after a structural shock, we get a > b. Note that in the simulations presented in Figure 8.4 with well-mixed populations, we use a structural shock implicitly by assuming the norm is A and the payoff of action B is higher than action A. Consider that, initially, the action with a higher utility (and the current norm) in a society is B, i.e., b > a with xA = 0. Suppose, the society experiences a structural shock, where now action A becomes more desirable with a > b. On introducing a small proportion of agents playing norm A (say xA = 0.01), if the need for coordination is low then the population will switch to the new norm with xA = 1. This is because, after the structural shock, the Nash Equilibrium (and ESS) is xA = 1, as shown above. However, if the need for coordination is high (i.e., c ≥ a−b ), then xA = 0 is still a Nasha Equilibrium (and ESS) and the population will remain on the sub-optimal norm B even after the structural shock. 159 All experiments were run on a grid with 2500 nodes, and the simulation goes on for 6000 iterations, with a structural shock implemented at 2500 iterations. 100 independent simulations are run for each setting and the results are averaged over the 100 runs. Figure 8.6 shows the results of our simulations. The plots show the proportion of agents playing norm A vs norm B. As before, the parameter c denotes the need for coordination. When c is low, very little cultural inertia develops and agents are more willing to innovate by exploring behaviors other than the current societal norms. In this case, the population will change more quickly to a different norm if the new norm will be beneficial. By contrast, when c is high, we see the evolutionary emergence of higher levels of cultural inertia, with agents less willing to innovate or to violate established cultural norms. In this case, the population is slower to change to the new norm, and if c is high enough it may not change at all. Thus, qualitatively, the results with a structured populations match those from the infinite well-mixed populations in Section 8.2.1, and the mechanics that lead to the above results can be explained using the same equilibrium results derived above. 8.3 Evolving exploration rates In addition to the amount of cultural inertia, another key aspect to understanding how norms change in different cultures is to study whether an agent is evolutionarily more likely to learn socially (i.e., adopt a behavior that is being used by other agents in the population) or to innovate and explore new random behaviors. Such tendencies are critical in understanding the rate at which new technologies, languages, or moral traditions are adopted in a population, and help us understand the processes of influence 160 Figure 8.7: Replicator-mutator dynamic on an infinite well-mixed population with a = 0.4 and b = 0.6. The solid and dotted lines denote c = 0.05 and c = 0.3, respectively. The colors denote the exploration rates. and persuasion at the individual level. In the model presented in Section 8.2.2 for finite structured populations, the ex- ploration rate (i.e., the small probability with which an agent tries out a new strategy at random) was kept at a constant low value. This exploration rate denotes how much an agent is open to change and trying out new actions at random. Thus, it seems that the need for coordination in a society might affect how likely an individual is to try out different actions, instead of conforming to their neighbors. Particularly, it seems natural to assume that individuals in tight societies are much less likely to try out random actions than in- dividuals in loose societies [GRN+11, HG14]. In this section, we test this hypothesis by presenting a model to study the evolution of exploration rates in different societies. To get some intuition about the hypothesis, we go back to our setting of a well- mixed infinite population. Note that the replicator dynamic does not have a provision for exploration rates. Thus, we use a variant of the replicator dynamic called the replicator- mutator equation [Lev03]. Using this variant, one can include exploration rates into 161 (a) c = 1.0; Action proportions (b) c = 1.0; Exploration rates (c) c = 0.8; Action proportions (d) c = 0.8; Exploration rates (e) c = 0.5; Action proportions (f) c = 0.5; Exploration rates Figure 8.8: Simulations with the Fermi rule on a toroidal grid of size 2500, with structural shocks at intervals of 75 iterations. From left to right: c = 1.0, c = 0.8, c = 0.5. Initially: a = 1.0, b = 1.15. The left column shows proportions of norms A and B. The right column shows proportions of the population that use each different exploration rate. 162 the replicator dynamic. Thus, if we fix µ to be the exploration rate, we can write the replicator-mutator equation as: ẋA = (1− µ)xAE[uA(x)] + µxAE[uB(x)]− xAθ(x), = xA(E[uA(x)]− θ(x)) + µxA(E[uB(x)]− E[uA(x)]). Thus, like the replicator dynamic, we can write the rate of change in terms of payoff differences, which makes the dynamic invariant to additions to the payoffs by a constant. Thus, we again use the simplified game matrix M ′ from Figure 8.3. Simplifying the equation for ẋA, we get: ẋA =xA(1− xA)(c(a+ b)xA − (b− (1− c)a)) + µ(xAxB(1− c)(b− a) + (x2 b− x2B Aa)). (8.10) Figure 8.7 plots the replicator-mutator equation (Eq. (8.10)) with a well-mixed infinite population. The solid lines are for a low need for coordination (c = 0.05), while the dotted lines are for a high need for coordination (c = 0.3), and we plot the proportion of B agents, as well as the rate of change, for various exploration rate µ values. From the figure, we see that for all exploration rates µ, when the need for coordination is high then there is higher cultural inertia. To study how an agent’s tendency to learn socially or explore develops in a cul- ture, we let the exploration rate (referred to as the mutation rate in biological models) evolve. The exploration rate is the probability µ with which an agent chooses a random new strategy at each iteration (0 ≤ µ  1). In biological evolution, mutation occurs so rarely that game-theoretic biological models often omit it. In cultural evolution, how- ever, exploration is an important step since individuals try out new behaviors much more 163 frequently [THDS+09]. Studying the evolution of exploration rates helps us get insights about a society’s openness to change. Low exploration rates suggest that individuals are less likely to try out new strategies and are more likely to coordinate with their neighbors. On the other hand, high exploration rates mean that individuals are more open to change and innovation. To model the evolution of exploration rates, we first create a set L of possible explo- ration rates. These can be a finite discrete set of exploration rates. For all our experiments, we use the set of exploration rates: L = {0.0, 0.1, 0.2, 0.3, 0.4, 0.5}. The exploration rate is added as part of the strategy of an agent, and each individual now chooses an ex- ploration rate in addition to the game action (A or B). Thus, an agent now copies the exploration rate of a neighbor, along with the game action, when updating its strategy using the Fermi rule. Note that a regularly changing environment is essential for studying the evolution of exploration rates since, if the environment is not changing frequently enough, an ex- ploration rate of 0 would always be evolutionarily stable. To model the changing environ- ment, we will use the same switch in dominant norms (structural shock) that we used in our earlier experiments, except now we apply the structural shock multiple times at much shorter and regular intervals. We use a fixed interval of 75 iterations to apply the struc- tural shock. We run the simulation for a total of 2000 iterations. For these experiments, an agent’s strategy set now becomes a size of 10: 5 possible exploration rates in L × 2 possible game actions (norm A or norm B). We use the same toroidal grid as described before. Figure 8.8 shows the experimental results. Each column in Figure 8.8 shows, for a specific c, the proportion of agents playing norm A vs norm B (top plot), and the 164 proportion of agents with each exploration rate (bottom plot). We see that when the need for coordination is high, low exploration rates are adopted by the majority of the society. Individuals in such a society are more likely to adopt the strategies of their neighbors, and this leads to high cultural inertia. In loose societies, however, higher proportions of exploration rates µ > 0 evolve, and individuals are more open towards change, leading to lower cultural inertia. This fits well with our results in Section 8.2, and provides insights into why cultural inertia develops in societies with a higher need for coordination. 8.4 Significance of the work In this chapter, we examined the processes underlying cultural inertia and norm change. We build evolutionary game-theoretic models that show that societies that have a higher need for coordination – those that are tight – have higher cultural inertia, with individuals being less likely to switch to the new norm even when it might have a larger payoff. Societies with a lower need for coordination – those that are loose – on the other hand, have low cultural inertia, with individuals more willing to innovate and open to change. By letting the exploration rate evolve, we used it to study an agent’s tendency to either learn using social interaction or innovate and explore new random behaviors, and we show that exploration rates evolve differently in different cultures. When the need for coordination is high, the majority of the population has very low exploration rates, and individuals are more likely to adopt the strategies of their neighbors. When the need 165 for coordination is low, higher exploration rates evolve, leading to lower cultural inertia, and more openness to change. This explains why tight cultures tend to have less deviant behavior among individuals with more norm adherence. To our knowledge, this is the first work that predicts the effects of the need for coordination on norm change and cultural inertia, and how it affects an agent’s decision of whether to learn from others or to innovate and explore random behaviors. We found our main qualitative findings to be robust to a wide range of parameter values, in both our simulation and theoretical results. By studying how socio-structural factors such as the need for coordination affect cultural inertia, this work aims to establish a culturally-sensitive model of norm change. With this model, we identify the conditions that lead to stability or instability in estab- lished population norms in different cultural contexts. Such knowledge is critical in pro- viding us the ability to identify early markers of impending drastic shifts in populations’ norms and thus enable tools providing alerts to potential social uprisings and turmoil. 166 Chapter 9: Tipping points for norm change in human cultures 9.1 Introduction Tightness-looseness is a dynamic construct, yet to date, there has been little research on the evolutionary processes that lead to changes in societal norms, the rate at which such changes occurs, and how these processes vary across different cultures. In this chapter, we aim to study how cultural differences in the way humans interact and influence each other heavily influence how societal norms are established and the rate at which they change across the world. We examine the causal relationship between an individual’s tendency to conform with those around them and the rate at which norms are changed in different cultures. More specifically, our primary contributions in this chapter are as follows: • Drawing on recent research in cultural psychology, we propose a game-theoretic model of a culture based on the tendency of an individual to conform with others, vs. being more individualistic in their behavior. • Using this model, we provide conditions under which a population is open to chang- ing the current norm in a society. • Finally, we analyze the rate at which such norm changes occur and compare the rates in different cultures. We find that tighter cultures are more likely to be initially 167 resistant to norm changes, but once it reaches a tipping point, they change faster than looser cultures. 9.2 Background and related work There has been widespread interest in studying the emergence of social norms in a population both from an evolutionary perspective [You01,HO01,Mer38,Bic05,HYOR14], as well as in empirical research [JKV10, KJTW09, CB15]. There has been, however, much less work done on understanding the processes that lead to change in an already established norm in a population. A related concept, the propagation of information in social networks, has been well-studied (see [CLC13, Jac10, EK10] for an overview), but these works typically do not account for the differences in how individuals inter- act and influence each other in different cultures. Data science approaches have also explored this question, however, it is very challenging to separate out the various con- founding factors (such as institutional influence) to establish clear causal relationships [ZZX15, LGRC12, KYC+12, LMK+13]. This chapter extends the results of Chapter 8. While we studied the processes of norm change in Chapter 8, there were limitations in the model that made it difficult to analyze the speed of norm change. In the next section, we describe our new model, which is more amenable to mathematical analysis, and provides a clearer picture into the factors that affect the speed of norm change in different cultures. 168 9.3 Proposed evolutionary game-theoretic model Consider an infinite, well-mixed population (i.e., each individual can interact with any other individual in the population) that evolves according to the well-known replicator dynamic [HS03]. For simplicity of presentation, suppose each agent may choose one of two possible actions: A and B (see Section 9.4 for a discussion on the assumptions used in our model). The two actions A and B correspond to possible norms that the society could settle on. Let xA and xB denote the proportions of the population using actions A and B respectively, with 0 ≤ xA, xB ≤ 1 and xA + xB = 1, and let x = (xA, xB). According to the replicator dynamic, the rate of change in the proportions of agents using each action is given by (7.1): ẋi = xi[fi(x)−φ(x)], where i ∈ {A,B}, ẋi = dxi/dt (i.e., rate of change of xi), fi(x) is the fitness of action i, and φ(x) denotes the average fitness of the population, i.e.: φ(x) = xAfA(x) + xBfB(x). The replicator dynamic is based on the idea that the proportion of agents with a particular strategy increases when it achieves expected fitness higher than the average fitness, and vice versa. Let uA and uB denote the payoffs associated with actions A and B, where 0 < uA, uB < 1 and uA + uB = 1. To define the fitness function fi, we use the key insight that in loose cultures, individuals tend to choose the action that is most beneficial to them; but in tight cultures, individuals tend to conform to the same action that others use, even if a different action might be more beneficial to each individual. To model this mathematically, we let fi be a weighted combination of the payoff ui and an additional conformism fitness measure θi that depends on whether the individual is conforming to others in the population. Let m denote the parameter controlling the weighting between 169 these two fitness measures, i.e., the amount of conformist transmission in a population. Thus, we define fi as: fi(x,m) = (1−m)ui +mθi(x, k), (9.1) where 0 ≤ m ≤ 1, and we define the conformism fitness measure θi as: [ ( )]−1 θi(x, k) = 1 + exp − k(x− 0.5) , (9.2) where k > 0. Note that we can vary the behavior of the conformism fitness measure θi using the parameter k (see Figure 9.1). For example, when k is large, θi is close to a step function where an agent has a non-zero conformism fitness only if they conform with the majority action: 0, if xi < 0.5; θ∞i (x) = lim θi(x, k) = k→∞ 0.5, if xi = 0.5;1, if xi > 0.5. Note that with no conformism whatsoever (m = 0), each action’s fitness depends solely on its payoff, i.e., typical of a very loose culture. On the other hand, with 100% conformist transmission (m = 1), i’s fitness depends solely on the conformism fitness measure (for the case of θ∞i , this means that i’s fitness depends solely on whether i is in the majority or the minority of the population). This is more indicative of a very tight culture. For simplicity, for the rest of the chapter, we denote θ ∞ ∞i(x, k) as θi and θi (x) as θi . 170 Figure 9.1: Plot of (9.2) for different values of k. 9.3.1 When does norm change occur? Suppose norm B has a higher utility compared to A, i.e., uB > uA. We are inter- ested in analyzing the conditions for which a population shifts from norm A to B (norm change). We can re-write the average fitness to be: φ(x) = (1−m)(xAuA + xBuB) +m(xAθA + xBθB). (9.3) We are interested in anlayzing the rate of change in the proportion ofB individuals. From (7.1), (9.1), and (9.3), we get: [ ] ẋB = xB(1− xB) (1−m)(uB − uA) +m(θB − θA) . (9.4) Note that θB ≥ θA when xB ≥ 0.5. Since uB > uA, we see that ẋB > 0, i.e., xB will converge to 1 (limt→∞ xB = 1) when xB ≥ 0.5. If xA > xB, i.e., if the current norm in the population is A, norm change takes place only if: uB − uA m < − − . (9.5)(uB uA) + (θA θB) Thus, norm change takes place only if the population is loose enough, while tighter cul- tures are more resistant to change. Further, note that θ∞A − θ∞B = 1 when xA > xB. 171 Figure 9.2: Left: Heatmap of the right-hand side in (9.5) when xB = 0.1, for various uB−uA and k values. Right: Heatmap of the right-hand side in (9.7), for various uB−uA and k values. Best viewed in color. Thus, when the conformist fitness measure is a step-function θ∞i , (9.5) becomes: m < (uB − uA)/(uB − uA + 1) < 0.5. Thus, for θ∞i , norm change occurs only in loose cul- tures where individuals weigh their individual payoff more than whether they conform to others in the population. Figure 9.2 (left) shows a heatmap of how condition (9.5) varies with uB − uA and k when xB = 0.1. We see that the bound on m increases as uB − uA increases, i.e., a population becomes more likely to switch the norm. On increasing k, we see that the bound on m decreases. This makes intuitive sense, since a higher k makes the difference in fitness between A and B clearer. Thus, in tight cultures, where peo- ple tend to agree more on what behaviors are appropriate vs. inappropriate in different situations [GRN+11], a higher k would lead to more resistance to norm change. 172 9.3.2 Rate of norm change in tight vs. loose cultures We are now interested in studying the speed with which norms change in different populations. Consider two possible values of m, namely m1 and m2, with m2 > m1 (i.e., m2 is a more conformist culture than m1). Let the corresponding values of ẋB be denoted by ẋ1 and ẋ2B B, respectively. Assume further that both m1 and m2 satisfy (9.5), i.e., norm change takes place in both cultures. Analyzing the difference in the rates of change, from (9.4) we get: [ ] ẋ2B − ẋ1B = xB(1− xB)(m2 −m1) (θB − θA)− (uB − uA) . (9.6) Note that when xB ≤ 0.5, θB − θA ≤ 0, which would mean: ẋ2 − ẋ1B B ≤ 0, i.e., the more conformist culture would be slow to change initially. To analyze the case when xB > 0.5, let’s assume xB = 0.5 + , for  > 0. Thus, xA = 0.5 − . From (9.6), we see that for ẋ2 1B − ẋB > 0, the following condition needs to hold: [ ]  > ln(1 + uB − uA)− ln(1− (uB − uA)) /k. (9.7) Figure 9.2 (right) plots a heatmap of how this bound varies with uB − uA and k. We see that as k increases, the point at which the more conformist culture starts changing faster moves closer to the point xB = 0.5. Note that when k → ∞, (9.7) reduces to  > 0, i.e., as soon as xB becomes a majority, greater conformism would produce a larger rate of change. We are also interested in studying how the rate of change ẋB varies with xB as a population switches to norm B, and how this relates to different levels of conformism. We first look at the maximum rate of change ẋmaxB = maxx ẋB (we numerically calculateB 173 Figure 9.3: Left: Plot of (9.4) at uB−uA = 0.7. Right: Heatmap of maxx ẋB for variousB k and m values, with uB − uA = 0.7. Best viewed in color. this, given values for k, m and uB − uA). Figure 9.3 (right) plots ẋmaxB for different m and k values, where we set uB − uA = 0.7. These values were chosen such that the norm changes from A to B for all the considered combinations (using the bounds from Section 9.3.1). We see that when k is low, lower conformism leads to higher ẋmaxB . However, as k increases (i.e., as θB approaches θ∞B ), there is a clear transition, where more conformist cultures end up having a higher maximum rate of change. This effect is clearer in the left plot of Figure 9.3, where we show how ẋB varies with xB. We see that when k is low, i.e., when there is no clear difference between A and B in it’s conformist fitness measure θ, ẋB changes slowly for both tight and loose cultures, with the loose culture having a higher rate of change. With high k, however, we see that the tighter culture faces a “tipping point”, which results in a sudden increase in ẋB with the tighter culture adopting a higher rate of change ẋB than the loose culture. Thus, in a more conformist culture with high k, initially peer pressure impedes the switch to the more beneficial norm B. But once enough of the population has switched, a tipping 174 point is reached where peer pressure causes the rest of them to switch very rapidly. 9.4 Discussion In this chapter, we show that tight cultures sometimes experience a tipping point for norm change while loose cultures typically face a more gradual change. The results presented assume an infinite well-mixed population with two possible actions. We believe it would be relatively straightforward to extend our results for multiple actions. Assum- ing an infinite well-mixed population made our model mathematically tractable and to provide exact conditions for norm change. As future work, it would be interesting to ex- tend this model to the finite population case, where interactions between individuals are dictated by a social network. 175 Chapter 10: On the evolution of ethnocentrism in human cultures 10.1 Introduction Nearly all major conflicts across the globe, both current and historical, are charac- terized by individuals defining themselves and others in terms of their group membership. Substantial empirical evidence supports people’s tendency to favor in-group members and show hostility towards out-group individuals [Taj82, BK85, HRW02, BFF06, CL09]. From an evolutionary perspective, numerous studies have shown how in populations com- prised of various groups, group-biased behavior that discriminates or is hostile against out-groups evolves or emerges readily and dominantly [HA06,CB07,AOW+09,GvdB11, FTC+12, MNVV12, HKS13]. Since humans are social beings who establish and define groups constantly, the development of out-group hostility and resulting group conflict might thus seem inevitable. In contrast, however, statistics have shown that violence and outgroup conflict have actually declined dramatically over the past few centuries of human civilization, suggest- ing out-group hostility is not inevitable after all [Pin11a, Pin11b]. What factors might lead to such a decrease in conflict? Evolutionary game-theoretic models can shed light on this question by exploring how various factors affect the emergence and maintenance of individuals’ behaviors relating to group conflict. 176 Cooperate Defect Cooperate b− c, b− c −c, b Defect b, −c 0, 0 Figure 10.1: Prisoner’s Dilemma payoff matrix used in our model. Our evolutionary game model builds on a prior model developed in Hammond and Axelrod’s pioneering work [HA06] on the evolution of ethnocentrism, and used in Hartshorn, et al. [HKS13]. In their model, agents had perceivable group tags, played one-shot Prisoner’s Dilemma games with their neighbors, and could behave differently toward in-group members than out-group members. Each agent’s inherited traits included a group tag, an action (cooperate or defect) to use with in-group members, and a similar action to use with out-group members. Thus there were four possible strategies: Co- operate with both in-group and out-group members; Defect against both in-group and out-group members; Ethnocentric (cooperate with in-group members, defect against out- group members); Traitorous (defect against in-group members, cooperate with out-group members). Using their model with four different groups (or group tags), we have replicated their result showing that after a period in which Cooperative agents are briefly abundant, evolutionary pressure leads to a predominance of Ethnocentric agents. Defectors and Traitors never establish themselves. Since the agents in that model conditioned their actions only on the group tags, they 177 Figure 10.2: Sequence of events at each time step in our evolutionary game-theoretic model. The sequence of steps are the same as in Hammond and Axelrod’s paper [HA06] except for the Mobility stage, which is new. For additional details, see the Methods section. were in effect group-entitative. That leaves open the question whether there are conditions under which individual-entitative agents – agents that base their actions on knowledge of individuals per se rather than group tags – may be able to exist and perhaps even be favored by evolutionary pressures. Moreover, that model does not incorporate mobility. Research in cultural psychol- ogy has demonstrated large empirical differences in residential mobility around the globe with important psychological consequences [Lon91, Ang00]. Researchers have shown that in high-mobility contexts, individuals change relationships often; they form new re- lationships and sever unwanted relationships with great ease [OKM+13, OSYA15]. In such contexts, having a broad network of weak ties and being open toward strangers (with whom it might be valuable to form relationships) is highly adaptive. Indeed, Oishi, et al. [OSYA15] observe that in highly mobile contexts, “since it is hard to keep track of behaviors of many strangers whom one meets, one needs to carefully avoid being as- sociated with defectors or free-riders in order to exploit the greatest possible relational 178 benefit” (p. 228). Thus, individuals are more likely to adopt strategies that try to evaluate the “trustworthiness and worth” [OSYA15] of others in highly mobile contexts, i.e., adopt individual-entitative strategies. On the other hand, in low-mobility contexts, individuals have far fewer opportunities to form new relationships, and severing existing relationships can have extreme adverse effects such as being ostracized from one’s only social cir- cle [OSYA15], causing “the existential, social, and psychological death of the individual” (p. 755) [Lan92]. Based on these theories we would predict that group-entitative behavior and associative ethnocentrism is adaptive in low mobility societies, yet it is maladaptive in high-mobility contexts, where individual-entitative strategies would be evolutionarily favored. We have run extensive new evolutionary simulations, augmenting the prior model to include individual-entitative strategies and mobility; and our results show that the evolu- tion of ethnocentrism is driven by low mobility. Indeed, our subsequent empirical analysis of archival data verifies that contexts with high residential mobility have less out-group hostility than those with low mobility. In our evolutionary game model, agents are arranged on a toroidal (wrap-around) grid, so that every node on the grid is connected to 4 neighboring nodes). Initially the grid is empty. The sequence of events at each time step is shown in Figure 10.2; these are the same as in Hammond and Axelrod’s paper [HA06] except for the Mobility stage, which is new. For additional details, see the Methods section. The agents’ strategies are similar to those in Hammond and Axelrod’s model [HA06], where agents can distinguish between in-group and out-group members by observing the group tags. Hence agents’ strategies can be conditioned on whether they are interacting 179 with in-group or out-group members. In addition, in our model, each agent’s strategies can be conditioned on the past history of other agents. Each agent can either be group- entitative or individual-entitative, and this is an inherited trait. A group-entitative agent i ignores individual identities. Its actions toward an agent j depend only on its last encounter with anyone in j’s group. It has two possibly different strategies: one for in-groups and another for out-groups. Each of those strategies is one of the following: AllC (always cooperate), AllD (always defect), TFT (Tit-for-Tat: play whatever action the opponent played in i’s last interaction with anyone from j’s group), or OTFT (play the opposite of what TFT would play). Note that this is unlike the models in previous chapters where agents had no memory of prior interactions. An individual-entitative agent i ignores other agents’ group tags; i’s action toward j depends only on its last encounter specifically with j. Thus i has one of the above four strategies, except that TFT and OTFT depend on i’s last interaction with j specifically, rather than someone in j’s group. To model mobility, there is a probability m with which, at the beginning of each iteration, an agent moves to a randomly chosen empty spot in the network. Thus a high value of m represents a highly mobile population, while a low value of m represents a population with low mobility. We vary m from 0 to 0.08 in our experiments. It is impor- tant to note that a mobility probability of 0.08 is quite high: it means that on average, 8% of the population move to different locations on each iteration – a substantial amount of movement even for small values of m. At higher levels of mobility (m > 0.1), cooper- ation breaks down in a society, and the majority of the population starts defecting – and thus is not representative of any stable society around the world. 180 10.2 Results Figure 10.3 shows our results after letting the populations evolve for 30,000 iter- ations. Without mobility (i.e., m = 0), group-entitative agents comprise 75% of the population. These agents’ strategies are predominantly out-group hostile (AllD) and in- group cooperative (AllC). This is reasonably consistent with Hammond and Axelrod’s model [HA06], but notice that even when m = 0, individual-entitative agents comprise about 25% of the population. As mobility increases, the evolutionary pressures shift to favor individual-entitative agents. For m > 0.02 they comprise about 80% of the population, and about 70% of them play TFT. Thus, the evolutionary dominance of group-entitative and ethnocentric strategies is thwarted by mobility. The reason why low mobility favors group-entitative strategies while higher mo- bility favors individual-entitative strategies is related to the clustering of group members (Figure 10.3). With low mobility, groups tend to cluster together heavily; hence agents interact primarily with in-group members. Thus the ethnocentric strategy (i.e., group- entitativity with in-group cooperation and out-group-hostility) is effective and profitable in terms of payoff. Under higher mobility, however, agents are less clustered by group membership, hence more likely to interact with out-group members, hence cannot rely on high payoffs from in-group interactions. Furthermore, group-entitative strategies are less effective because different group members are much less likely to have the same strategy. This favors the individual-entitative Tit-for-Tat (TFT) strategy (see the Methods section for more information on the clustering coefficient). 181 (a) (b) (c) (d) (e) (f) Figure 10.3: Proportions of actions and strategies as a function of mobility, after 30,000 iterations, averaged over 100 simulation runs. The plots show the proportions of (a) the group-entitative and individual-entitative agents, (b) the actions played by the agents, (c) the strategies of the individual-entitative agents, (d) the in-group and (e) out-group strategies of the group-entitative agents, (f) the degree of clustering on the grid. 182 To illustrate the evolutionary trajectories that led to the reported results, Figures 10.4 and 10.5 show representative evolutionary trajectories for single simulation runs. In Figure 10.4, there is no mobility. Group-entitative agents quickly become a majority, and most of them are ethnocentric (in-group cooperative and out-group hostile). In Figure 10.5, the mobility probability is m = 0.05. Individual-entitative agents evolve to become a majority, with most of those agents playing Tit-for-Tat (TFT). To illustrate robustness of the results with our model, we also performed a series of experiments where we initialized the population on the grid to have high clustering of group-entitative or individual-entitative agents (instead of starting out with an empty grid). In each case, we notice that the results are the same as when we start out with an empty grid, i.e., group-entitative agents dominate under no mobility and individual- entitative agents dominate under higher values of mobility. This is due to the exploration dynamics (or mutation phase) in our model. The exploration dynamic has been shown to be a key aspect of evolutionary game-theoretic models for cultural evolution [TSS+10, ATTN12], and this ensures that our model remains robust to the initial conditions of the grid. 10.2.1 Empirical analysis In order to complement these modeling efforts, we also gathered data to test the no- tion that mobility relates to lower ethnocentrism. We analyzed data from the U.S. Census Bureau [Ren11, Wor98] that provides measures of mobility in the U.S. 50 states (defined as the percentage of people born in the state of residence; reverse scored, with higher 183 (a) (b) (c) (d) Figure 10.4: Single simulation run for 20000 generations with no mobility (m = 0). (a) Proportions of group-entitative and individual-entitative agents. (b) Relative proportions of the individual-entitative agents’ strategies; Relative proportions of the group-entitative agents’ (c) in-group and (d) out-group strategies. scores being reflective of higher mobility) and data from the DDB Needham Life Style Survey [Wor98]. We found that mobility was positively correlated with responses to the question “I am interested in the cultures of other countries” (r = 0.614, p < .001), and negatively correlated with responses to questions regarding ethnocentrism (e.g., Amer- icans should always buy American products, r = −0.654, p < .001; The government 184 (a) (b) (c) (d) Figure 10.5: Single simulation run for 30000 generations with no mobility (m = 0.05). (a) Proportions of group-entitative and individual-entitative agents. (b) Relative propor- tions of the individual-entitative agents’ strategies; Relative proportions of the group- entitative agents’ (c) in-group and (d) out-group strategies. should restrict imported projects, r = −0.578, p < .001). In addition, states that have higher mobility also have higher openness, one of the big five personality dimensions, which is associated with breadth of experience and inter- est and interest in new ideas and other cultures (r = 0.321, p = .023) [JNS08]. 185 10.3 Significance of the work The evolution of cooperation has been of great scientific interest in many disci- plines; and to date, many evolutionary and empirical studies have found that ingroup- favoring and outgroup hostile behaviors are common. This has caused much concern that group conflict and ethnocentrism is an inevitable threat on our planet. We integrate re- search on group conflict with human mobility [Now06, OLS07, SYHT09, YSS09, Ois10, SYM10, WEMG11], and show for the first time that the evolution of ethnocentrism and group entitative behavior is thwarted by high mobility. As mobility is rapidly chang- ing around the globe [OSYA15], this work predicts that group conflict will continue to decrease, in line with Pinker’s historical analysis [Pin11a, Pin11b]. Mobility is an important and well-studied topic in cultural psychology [OLS07, SYHT09, YSS09, Ois10, SYM10]. Low mobility leads to conditions where interacting individuals are likely to be reproductively related and this has been shown to be important in the evolution of cooperation [Now06, WEMG11]. In our model, we find mobility plays a crucial role in the evolution of ethnocentrism in a society. More specifically, we establish that low mobility leads to in-group cooperation and out-group hostility. High mobility, on the other hand, leads to more individual-entitative behavior, where agents take actions based on the specific individuals they interact with, and not based on the group that those individuals belong to. Another unique aspect of our model is that we allow for agents to have memory of previous actions played by other agents and the possibility of individual-entitative agents, where agents take actions based on the individual they are playing against rather than 186 their tag. In a society with high mobility, agents would be moving to different parts of the grid, which leads to low clustering of agents belonging to the same group. Thus, agents with group-entitative strategies suffer, which leads to the evolution of individual-entitative strategies, with strategies like Tit-for-Tat gaining prominence. Under low mobility, on the other hand, agents of the same group cluster together much more, and simple group- entitative strategies like in-group cooperative and out-group hostility gain prominence. It would be fruitful to incorporate mobility into other evolutionary game-theoretic models of conflict in future research. Moreover, since mobility in a society could be mo- tivated by multiple factors that could have divergent effects, it would be good for future models of mobility and ethnocentrism to incorporate models of these motivations. Mo- bility might reduce ethnocentrism when agents move for economic reasons, but mobility might not reduce ethnocentrism if agents move primarily to be among other in-group members. In all, our work shows for the first time that mobility is a critical factor that affects the dynamics of conflict with important implications for theory and policy. 10.4 Methods 10.4.1 Evolutionary dynamics of our model Here is a more detailed version of the sequence of steps in our evolutionary game- theoretic model: 1. Birth: One agent with a random strategy appears at a random empty site, if such a site exists. 187 2. Base Payoff: Each existing agent receives the base payoff from the environment; we use a base payoff of 0.12 throughout our experiments. 3. Interaction Payoff: Each agent plays a game with each of its neighboring agents on the grid, receiving payoffs according to the game definition. The game played by the agents is the canonical 2-player cooperation/defection dilemma (Figure 6), with the benefit of cooperation b = 0.03 and the cost of cooperation c = 0.01. In this phase, the action chosen by an agent in each game depends on the type of agents playing that game (i.e., group-entitative or individual-entitative agent) as well as the type of strategy being used by that agent. 4. Fitness: Each agent is assigned fitness equal to the agent’s accumulated payoff. 5. Reproduction: In random order, each agent is given a chance to reproduce with probability equal to its fitness. If an agent gets a chance to reproduce, it places an offspring in a randomly chosen empty site in its neighborhood, if such a site exists. The offspring has the same traits as its parent, with a mutation rate of µ = 0.005 per trait. 6. Death: Each agent has a probability d = 0.1 of dying. If an agent dies, it is removed from the grid. 7. Mobility: Each agent has a probability ofm of moving to a randomly chosen empty spot on the grid. 188 10.4.2 Clustering coefficient A clustering coefficient is a metric for measuring the amount of clustering of nodes in a graph. We can measure the clustering of group tags by comparing the group tags of the agents at neighboring locations. For each location (x, y) on the torus grid we consider four triples, each consisting of (x, y) and a pair of adjacent neighboring locations: 1. location (x, y), the neighbor above it, and the neighbor to its left; 2. location (x, y), the neighbor above it, and the neighbor to its right; 3. location (x, y), the neighbor below it, and the neighbor to its left; 4. location (x, y), the neighbor below it, and the neighbor to its right. Our clustering coefficient is the total number of triples that contain three agents with the same group tag, divided by the total number of triples in the grid. For a torus grid of size N ×M , the denominator in our metric, i.e., the number of total triplets, is simply 4NM . The clustering coefficient lies in the range from 0 to 1, and is higher when agents of the same tag cluster together on the grid, while being small when there is less clustering of agents of the same tag. 10.4.3 Strategy set When an individual-entitative agent i interacts with an agent j that i has never encountered before, or when a group-entitative agent i interacts with an agent from a group that i has never encountered before, i must choose whether to cooperate or defect. 189 That choice is part of i’s strategy; and in our experiments we allowed both possibilities. This doubled the total number of strategies in our simulations but made no meaningful difference in the results – so in favor of simplicity and clarity, we did not discuss this detail earlier. 10.4.4 Mutation rate In Hammond and Axelrod’s model [HA06], during reproduction, an offspring will have the same strategy s as its parent, except that for each trait in s, there is a small probability µ = 0.05 that this trait will be changed to a randomly chosen one. Notice that µ is not the probability that the offspring’s strategy will differ from s. Instead, for each trait in s, it is the probability that this trait will be changed; and this happens independently for each of the traits in s. Consequently, the probability that an offspring retains the exact same strategy as its parent is inversely proportional to the number of possible traits. Our model has a higher number of different possible traits than Hammond and Ax- elrod’s model. Thus, in order to maintain roughly the same probability that an offspring will retain the same strategy as its parent, we needed to use a smaller value for µ. In Hammond and Axelrod [HA06], each agent has 3 traits: the group tag and the actions to take when playing against an in-group and an out-group agent. However, in our model, the number of traits is significantly higher. Each group-entitative agent has 7 traits: the group tag, and traits specifying what action to take in each of the following six situations: 1. when an in-group agent cooperated on the last meeting, 190 2. when an in-group agent defected on the last meeting, 3. the first time one meets an in-group agent, 4. when an out-group agent cooperated on the last meeting, 5. when an out-group agent defected on the last meeting, 6. the first time one meets an out-group agent. Similarly, each individual-entitative agent in our model has 4 traits. Thus, in order to ensure that the probability of an offspring retaining the same strategy as its parent is similar to that in Hammond an Axelrod’s model, we needed to use a lower value for µ than what they used. We used µ = 0.005. 10.4.5 Range of mobility In our experiments, the reason we limited the mobility probability to 0 ≤ m ≤ 0.8 is that cooperation breaks down at higher levels of mobility, as shown in Figure 10.6. For example, at m = 0.2, about 80% of all actions are defections. Figure 10.6 also shows the reason for this breakdown. As m increases, the average number of games that an agent plays with the same opponent decreases monotonically. For example, at m = 0.2, each agent plays only 2 games with each opponent on average – which favors AllD rather than Tit-for-Tat (TFT). As a societal analogy, agents are more likely to defect against each other if no agent interacts with any other agent long enough to create any kind of interpersonal ties. For example, consider the limiting case where mobility probability m = 1.0, i.e., every 191 agent moves to a different location in the grid at every iteration. This condition is similar to a well-mixed population, in which every agent can interact with every other agent in the population, with no network structure defining the set of possible interactions. Under well-mixed populations, it is well established that in a cooperation game like the Prisoner’s Dilemma, the society devolves into universal Defection. (a) (b) Figure 10.6: Cooperation breaking down at higher mobility values. Each data point is an average of 100 individual simulation runs. The plots show (a) the proportion of agents cooperating and defecting; and (b) over an agent’s lifetime, the average number of unique opponents it encounters, and the average number of games played against each of them. 192 Bibliography [ACH18] Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. arXiv preprint arXiv:1802.06509, 2018. [AD11] Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873–881, 2011. [AHS15] Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Fixed point optimization of deep convolutional neural networks for object recognition. In ICASSP. IEEE, 2015. [Ald95] John Aldrich. Correlations genuine and spurious in Pearson and Yule. Statistical Science, pages 364–376, 1995. [Ang00] Shlomo Angel. Housing policy matters: A global analysis. Oxford University Press, 2000. [AOW+09] Tibor Antal, Hisashi Ohtsuki, John Wakeley, Peter D Taylor, and Martin A Nowak. Evolution of cooperation by phenotypic similarity. Proceedings of the National Academy of Sciences, 106(21):8597–8600, 2009. [ATTN12] Benjamin Allen, Arne Traulsen, Corina E Tarnita, and Martin A Nowak. How mutation affects evolutionary games on graphs. Journal of theoretical biology, 299:97–105, 2012. [BA99] Albert-László Barabási and Réka Albert. Emergence of scaling in random net- works. Science, 286(5439):509–512, 1999. [BB88] Jonathan Barzilai and Jonathan M Borwein. Two-point step size gradient methods. IMA Journal of Numerical Analysis, 8(1):141–148, 1988. [BCNW12] Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size se- lection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012. [BD99] Jock A Blackard and Denis J Dean. Comparative accuracies of artificial neural net- works and discriminant analysis in predicting forest cover types from cartographic variables. Computers and electronics in agriculture, 24(3):131–151, 1999. 193 [Ben12] Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural networks: Tricks of the trade, pages 437–478. Springer, 2012. [Ber11] Dimitri P Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010(1-38):3, 2011. [BFF06] Helen Bernhard, Urs Fischbacher, and Ernst Fehr. Parochial altruism in humans. Nature, 442(7105):912–915, 2006. [BFL+17] David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. The shattered gradients problem: If resnets are the answer, then what is the question? arXiv preprint arXiv:1702.08591, 2017. [Bic05] Cristina Bicchieri. The grammar of society: The nature and dynamics of social norms. Cambridge University Press, 2005. [BIL+15] Carlo Baldassi, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina. Subdominant dense clusters allow for simple learning and high compu- tational performance in neural networks with discrete synapses. Physical review letters, 115(12):128101, 2015. [BK85] Marilynn B Brewer and Roderick M Kramer. The psychology of intergroup atti- tudes and behavior. Annual review of psychology, 36(1):219–243, 1985. [Blu93] Lawrence E Blume. The statistical mechanics of strategic interaction. Games and Economic Behavior, 5(3):387–424, 1993. [BMEWL11] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The million song dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011. [Bot09] Léon Bottou. Curiously fast convergence of some stochastic gradient descent al- gorithms. In Proceedings of the symposium on learning and data science, Paris, 2009. [Bot12] L Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, pages 421–436. Springer, 2012. [BSF94] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term depen- dencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994. [BSW14] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature communications, 5, 2014. [BT89] Dimitri P Bertsekas and John N Tsitsiklis. Parallel and distributed computation: numerical methods. Prentice-Hall, Inc., 1989. [BTPG15] Guillaume Bouchard, Théo Trouillon, Julien Perez, and Adrien Gaidon. Accel- erating stochastic gradient descent via online learning to sample. arXiv preprint arXiv:1506.09016, 2015. 194 [BVL13] Daniel Balliet and Paul AM Van Lange. Trust, punishment, and cooperation across 18 societies a meta-analysis. Perspectives on Psychological Science, 8(4):363–379, 2013. [CB07] Jung-Kyoo Choi and Samuel Bowles. The coevolution of parochial altruism and war. science, 318(5850):636–640, 2007. [CB15] Damon Centola and Andrea Baronchelli. The spontaneous emergence of conven- tions: An experimental study of cultural evolution. Proceedings of the National Academy of Sciences, 112(7):1989–1994, 2015. [CBD15] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In NIPS, 2015. [CHS+16] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Ben- gio. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1. arXiv preprint arXiv:1602.02830, 2016. [CKF11] Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. Torch7: A matlab- like environment for machine learning. In BigLearn, NIPS Workshop, 2011. [CL09] Yan Chen and Sherry Xin Li. Group identity and social preferences. The American Economic Review, 99(1):431–457, 2009. [CLC13] Wei Chen, Laks VS Lakshmanan, and Carlos Castillo. Information and influence propagation in social networks. Synthesis Lectures on Data Management, 5(4):1– 177, 2013. [CR16] Dominik Csiba and Peter Richtárik. Importance sampling for minibatches. arXiv preprint arXiv:1602.02283, 2016. [CSML15] Zhiyong Cheng, Daniel Soudry, Zexi Mao, and Zhenzhong Lan. Training binary multilayer neural networks for image classification using expectation backpropa- gation. arXiv preprint arXiv:1503.03562, 2015. [DBLJ14] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014. [DCM+12] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1223–1231, 2012. [DD+14] Aaron Defazio, Justin Domke, et al. Finito: A faster, permutable incremental gra- dient method for big data problems. In Proceedings of The 31st International Con- ference on Machine Learning, pages 1125–1133, 2014. [DG16] Soham De and Tom Goldstein. Efficient distributed SGD with variance reduction. In 2016 IEEE International Conference on Data Mining. IEEE, 2016. 195 [DYJG16] Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Big Batch SGD: Automated inference using adaptive batch sizes. arXiv preprint arXiv:1610.05792, 2016. [EH14] Jean Ensminger and Joseph Henrich. Experimenting with social norms: Fairness and punishment in cross-cultural perspective. Russell Sage Foundation, 2014. [EK10] David Easley and Jon Kleinberg. Networks, crowds, and markets: Reasoning about a highly connected world. Cambridge University Press, 2010. [FS12] Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012. [FTC+12] Feng Fu, Corina E Tarnita, Nicholas A Christakis, Long Wang, David G Rand, and Martin A Nowak. Evolution of in-group favoritism. Scientific reports, 2:460, 2012. [GAGN15] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In ICML, 2015. [GB10] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international con- ference on artificial intelligence and statistics, pages 249–256, 2010. [GDG+17] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017. [GOP15] Mert Gürbüzbalaban, Asu Ozdaglar, and Pablo Parrilo. Why random reshuffling beats stochastic gradient descent. arXiv preprint arXiv:1510.08560, 2015. [GRN+11] Michele J Gelfand, Jana L Raver, Lisa Nishii, Lisa M Leslie, Janetta Lun, Beng Chong Lim, Lili Duan, Assaf Almaliach, Soon Ang, Jakobina Arnadottir, et al. Differences between tight and loose cultures: A 33-nation study. Science, 332(6033):1100–1104, 2011. [GS16] Tom Goldstein and Christoph Studer. Phasemax: Convex phase retrieval via basis pursuit. arXiv preprint arXiv:1610.07531, 2016. [GSB14] Tom Goldstein, Christoph Studer, and Richard Baraniuk. A field guide to forward- backward splitting with a FASTA implementation. arXiv eprint, abs/1411.3406, 2014. [GvdB11] Julián Garcı́a and Jeroen CJM van den Bergh. Evolution of parochial altruism by multilevel selection. Evolution and Human Behavior, 32(4):277–287, 2011. [HA06] Ross A Hammond and Robert Axelrod. The evolution of ethnocentrism. Journal of Conflict Resolution, 50(6):926–936, 2006. [HAV+15] Reza Harikandeh, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Konečnỳ, and Scott Sallinen. Stop wasting my gradients: Practical svrg. In Ad- vances in Neural Information Processing Systems, pages 2242–2250, 2015. 196 [HCS+16] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016. [HEM+10] Joseph Henrich, Jean Ensminger, Richard McElreath, Abigail Barr, Clark Barrett, Alexander Bolyanatz, Juan Camilo Cardenas, Michael Gurven, Edwins Gwako, Natalie Henrich, et al. Markets, religion, community size, and the evolution of fairness and punishment. Science, 327(5972):1480–1484, 2010. [HF92] Markus Höhfeld and Scott E Fahlman. Probabilistic rounding in neural network learning with limited precision. Neurocomputing, 4(6):291–299, 1992. [HG14] Jesse R Harrington and Michele J Gelfand. Tightness–looseness across the 50 united states. Proceedings of the National Academy of Sciences, 111(22):7990– 7995, 2014. [HKS13] Max Hartshorn, Artem Kaznatcheev, and Thomas Shultz. The evolutionary domi- nance of ethnocentric cooperation. Journal of Artificial Societies and Social Simu- lation, 16(3):7, 2013. [HMB+06] Joseph Henrich, Richard McElreath, Abigail Barr, Jean Ensminger, Clark Barrett, Alexander Bolyanatz, Juan Camilo Cardenas, Michael Gurven, Edwins Gwako, Natalie Henrich, et al. Costly punishment across human societies. Science, 312(5781):1767–1770, 2006. [HO01] Michael Hechter and Karl-Dieter Opp. Social norms. Russell Sage Foundation, 2001. [HRW02] Miles Hewstone, Mark Rubin, and Hazel Willis. Intergroup bias. Annual review of psychology, 53(1):575–604, 2002. [HS84] Josef Hofbauer and Karl Sigmund. Evolutionstheorie und dynamische Systeme. Parey, 1984. [HS03] Josef Hofbauer and Karl Sigmund. Evolutionary game dynamics. Bulletin of the American Mathematical Society, 40(4):479–519, 2003. [HS14] Kyuyeon Hwang and Wonyong Sung. Fixed-point feedforward deep neural net- work design using weights+ 1, 0, and- 1. In IEEE Workshop on Signal Processing Systems (SiPS), 2014. [HTG08] Benedikt Herrmann, Christian Thöni, and Simon Gächter. Antisocial punishment across societies. Science, 319(5868):1362–1367, 2008. [HYOR14] Dirk Helbing, Wenjian Yu, Karl-Dieter Opp, and Heiko Rauhut. Conditions for the emergence of shared norms in populations with incompatible preferences. PloS one, 9(8):e104207, 2014. [HZRS16a] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. 197 [HZRS16b] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learn- ing for Image Recognition. In CVPR, 2016. [IS15a] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Net- work Training by Reducing Internal Covariate Shift. 2015. [IS15b] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. [Jac10] Matthew O Jackson. Social and economic networks. Princeton university press, 2010. [JKV10] Stephen Judd, Michael Kearns, and Yevgeniy Vorobeychik. Behavioral dynamics and influence in networked coloring and consensus. Proceedings of the National Academy of Sciences, 107(34):14978–14982, 2010. [JNS08] Oliver P John, Laura P Naumann, and Christopher J Soto. Paradigm shift to the integrative big five trait taxonomy. Handbook of personality: Theory and research, 3:114–158, 2008. [JZ13] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using pre- dictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013. [KB14] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [KB15] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015. [KH09] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images, 2009. [KJTW09] Michael Kearns, Stephen Judd, Jinsong Tan, and Jennifer Wortman. Behavioral experiments on biased voting in networks. Proceedings of the National Academy of Sciences, 106(5):1347–1352, 2009. [KLRT14] Jakub Konečnỳ, Jie Liu, Peter Richtárik, and Martin Takáč. ms2gd: Mini- batch semi-stochastic gradient descent in the proximal setting. arXiv preprint arXiv:1410.4744, 2014. [KMN+16] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016. [KNS16] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradi- ent and proximal-gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016. [KR13] Jakub Konečnỳ and Peter Richtárik. Semi-stochastic gradient descent methods. arXiv preprint arXiv:1312.1666, 2013. 198 [KS15] Minje Kim and Paris Smaragdis. Bitwise neural networks. In ICML Workshop on Resource-Efficient Machine Learning, 2015. [KS17] Nitish Shirish Keskar and Richard Socher. Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628, 2017. [KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information pro- cessing systems, pages 1097–1105, 2012. [KYC+12] Farshad Kooti, Haeryun Yang, Meeyoung Cha, P Krishna Gummadi, and Winter A Mason. The emergence of conventions in online social networks. In ICWSM, 2012. [Lan92] Hope Landrine. Clinical implications of cultural differences: The referential versus the indexical self. Clinical Psychology Review, 12(4):401–415, 1992. [LASY14] Mu Li, David G Andersen, Alex J Smola, and Kai Yu. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pages 19–27, 2014. [Lax07] P.D. Lax. Linear Algebra and Its Applications. Number v. 10 in Linear algebra and its applications. Wiley, 2007. [LBBH98] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278– 2324, 1998. [LCMB16] Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neu- ral networks with few multiplications. ICLR, 2016. [Lev03] Simon Levin. Complex adaptive systems: exploring the known, the unknown and the unknowable. Bulletin of the American Mathematical Society, 40(1):3–19, 2003. [LGRC12] Janette Lehmann, Bruno Gonçalves, José J Ramasco, and Ciro Cattuto. Dynamical classes of collective attention in twitter. In Proceedings of the 21st international conference on World Wide Web, pages 251–260. ACM, 2012. [LHLL15] Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Informa- tion Processing Systems, pages 2719–2727, 2015. [LMK+13] Yu-Ru Lin, Drew Margolin, Brian Keegan, Andrea Baronchelli, and David Lazer. # bigbirds never die: Understanding social dynamics of emergent hashtag. arXiv preprint arXiv:1303.7144, 2013. [LNS12] Guanghui Lan, Arkadi Nemirovski, and Alexander Shapiro. Validation analysis of mirror descent stochastic approximation method. Mathematical programming, 134(2):425–458, 2012. [Lon91] Larry Long. Residential mobility differences among developed countries. Interna- tional Regional Science Review, 14(2):133–147, 1991. 199 [LPW09] David Asher Levin, Yuval Peres, and Elizabeth Lee Wilmer. Markov chains and mixing times. American Mathematical Soc., 2009. [LTA16] Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. Fixed point quantization of deep convolutional networks. In ICML, 2016. [LZL16] Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016. [MB11] Eric Moulines and Francis R Bach. Non-asymptotic analysis of stochastic approx- imation algorithms for machine learning. In Advances in Neural Information Pro- cessing Systems, pages 451–459, 2011. [MBB17] Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Un- derstanding the effectiveness of sgd in modern over-parametrized learning. arXiv preprint arXiv:1712.06559, 2017. [Mer38] Robert K Merton. Science and the social order. Philosophy of Science, 5(3):321– 337, 1938. [MG15] James Martens and Roger Grosse. Optimizing neural networks with kronecker- factored approximate curvature. In International Conference on Machine Learning, pages 2408–2417, 2015. [MH15] Maren Mahsereci and Philipp Hennig. Probabilistic line searches for stochastic optimization. In Advances In Neural Information Processing Systems, pages 181– 189, 2015. [MLM16] Daisuke Miyashita, Edward H Lee, and Boris Murmann. Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025, 2016. [MNVV12] Melissa M McDonald, Carlos David Navarrete, and Mark Van Vugt. Evolution and the psychology of intergroup conflict: The male warrior hypothesis. Phil. Trans. R. Soc. B, 367(1589):670–679, 2012. [MOPU93] Michele Marchesi, Gianni Orlandi, Francesco Piazza, and Aurelio Uncini. Fast neural networks without multipliers. IEEE Transactions on Neural Networks, 4(1):53–62, 1993. [MPP+15] Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, and Michael I Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. arXiv preprint arXiv:1507.06970, 2015. [MR16] Aryan Mokhtari and Alejandro Ribeiro. Dsa: Decentralized double stochastic av- eraging gradient algorithm. Journal of Machine Learning Research, 17(61):1–35, 2016. [Nes83] Yurii Nesterov. A method of solving a convex programming problem with conver- gence rate o (1/k2). 1983. [Nes13] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, vol- ume 87. Springer Science & Business Media, 2013. 200 [Now06] Martin A Nowak. Five rules for the evolution of cooperation. science, 314(5805):1560–1563, 2006. [NWC+11] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and An- drew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 4. Granada, Spain, 2011. [NWS14] Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in Neu- ral Information Processing Systems, pages 1017–1025, 2014. [Ois10] Shigehiro Oishi. The psychology of residential mobility implications for the self, social relationships, and well-being. Perspectives on Psychological Science, 5(1):5–21, 2010. [OKM+13] Shigehiro Oishi, Selin Kesebir, Felicity F Miao, Thomas Talhelm, Yumi Endo, Yukiko Uchida, Yasufumi Shibanai, and Vinai Norasakkunkit. Residential mobility increases motivation to expand social network: But why? Journal of Experimental Social Psychology, 49(2):217–223, 2013. [OLS07] Shigehiro Oishi, Janetta Lun, and Gary D Sherman. Residential mobility, self- concept, and positive affect in social interactions. Journal of personality and social psychology, 93(1):131, 2007. [OSYA15] SHIGEHIRO Oishi, JOANNA Schug, MASAKI Yuki, and JORDAN Axt. The psy- chology of residential and relational mobilities. Handbook of advances in culture and psychology, 5:221–272, 2015. [Pin11a] Steven Pinker. The better angels of our nature: Why violence has declined, vol- ume 75. Viking New York, 2011. [Pin11b] Steven Pinker. Decline of violence: Taming the devil within us. Nature, 478(7369):309–311, 2011. [PLT+16] Xinghao Pan, Maximilian Lam, Stephen Tu, Dimitris Papailiopoulos, Ce Zhang, Michael I Jordan, Kannan Ramchandran, Chris Re, and Benjamin Recht. Cyclades: Conflict-free asynchronous machine learning. arXiv preprint arXiv:1605.09721, 2016. [PN02] Karen M Page and Martin A Nowak. Unifying evolutionary dynamics. Journal of Theoretical Biology, 219(1):93–98, 2002. [Pol63] Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 3(4):643–653, 1963. [Pro01] Danil Prokhorov. Ijcnn 2001 neural network competition. Slide presentation in IJCNN, 1, 2001. [RDS+15] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet Large Scale Visual Recognition Challenge. IJCV, 2015. 201 [Ren11] Ping Ren. Lifetime mobility in the United States: 2010. US Department of Com- merce, Economics and Statistics Administration, US Census Bureau, 2011. [RGNL15] Patrick Roos, Michele Gelfand, Dana Nau, and Janetta Lun. Societal threat and cultural variation in the strength of social norms: An evolutionary basis. Organiza- tional Behavior and Human Decision Processes, 129:14–23, 2015. [RHBL07] Marc Aurelio Ranzato, Fu Jie Huang, Y-Lan Boureau, and Yann LeCun. Unsuper- vised learning of invariant feature hierarchies with applications to object recogni- tion. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Confer- ence on, pages 1–8. IEEE, 2007. [RHS+15] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, and Alex J Smola. On variance reduction in stochastic gradient descent and its asynchronous variants. In Advances in Neural Information Processing Systems, pages 2629–2637, 2015. [RM51] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951. [RORF16] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNOR- Net: ImageNet Classification Using Binary Convolutional Neural Networks. ECCV, 2016. [RRWN11] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock- free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011. [RSB12] Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems, pages 2663–2671, 2012. [RSS11] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647, 2011. [Sha16] Ohad Shamir. Without-replacement sampling for stochastic gradient methods: Convergence results and application to distributed optimization. arXiv preprint arXiv:1603.00570, 2016. [SHM14] Daniel Soudry, Itay Hubara, and Ron Meir. Expectation backpropagation: Parameter-free training of multilayer neural networks with continuous or discrete weights. In NIPS, 2014. [SMDH13] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the impor- tance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013. [Smi82] John Maynard Smith. Evolution and the Theory of Games. Cambridge University Press, 1982. [SP73] J. M. Smith and G. R. Price. The logic of animal conflict. Nature, 246:15–18, 1973. 202 [SRB13] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388, 2013. [SS14] Ohad Shamir and Nathan Srebro. Distributed stochastic optimization and learning. In Communication, Control, and Computing (Allerton), 2014 52nd Annual Allerton Conference on, pages 850–857. IEEE, 2014. [SYHT09] Joanna Schug, Masaki Yuki, Hiroki Horikawa, and Kosuke Takemura. Similar- ity attraction and actually selecting similar others: How cross-societal differences in relational mobility affect interpersonal similarity in japan and the usa. Asian Journal of Social Psychology, 12(2):95–103, 2009. [SYM10] Joanna Schug, Masaki Yuki, and William Maddux. Relational mobility explains between-and within-culture differences in self-disclosure to close friends. Psycho- logical Science, 2010. [SZ13] Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth opti- mization: Convergence results and optimal averaging schemes. In International Conference on Machine Learning, pages 71–79, 2013. [SZ14] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [SZ15] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR, 2015. [SZL13] Tom Schaul, Sixin Zhang, and Yann LeCun. No more pesky learning rates. In Proceedings of The 30th International Conference on Machine Learning, pages 343–351, 2013. [Taj82] Henri Tajfel. Social psychology of intergroup relations. Annual review of psychol- ogy, 33(1):1–39, 1982. [THDS+09] Arne Traulsen, Christoph Hauert, Hannelore De Silva, Martin A Nowak, and Karl Sigmund. Exploration dynamics in evolutionary games. Proceedings of the Na- tional Academy of Sciences, 106(3):709–712, 2009. [TJ78] Peter D Taylor and Leo B Jonker. Evolutionary stable strategies and game dynam- ics. Mathematical Biosciences, 40(1-2):145–156, 1978. [TMDQ16] Conghui Tan, Shiqian Ma, Yu-Hong Dai, and Yuqiu Qian. Barzilai-borwein step size for stochastic gradient descent. arXiv preprint arXiv:1605.04131, 2016. [TSS+10] Arne Traulsen, Dirk Semmann, Ralf D Sommerfeld, Hans-Jürgen Krambeck, and Manfred Milinski. Human strategy updating in evolutionary games. Proceedings of the National Academy of Sciences, 107(7):2962–2966, 2010. [Ver] Roman Vershynin. High-dimensional probability. [WCSX13] Chong Wang, Xi Chen, Alex J Smola, and Eric P Xing. Variance reduction for stochastic gradient optimization. In Advances in Neural Information Processing Systems, pages 181–189, 2013. 203 [WEMG11] Stuart A West, Claire El Mouden, and Andy Gardner. Sixteen common misconcep- tions about the evolution of cooperation in humans. Evolution and Human Behav- ior, 32(4):231–262, 2011. [Wor98] DDB Worldwide. The DDB Life Style Survey Data. http://bowlingalone. com/?page_id=7, 1998. [Online; accessed 20-October-2015]. [WRLG18] Yuhuai Wu, Mengye Ren, Renjie Liao, and Roger Grosse. Understanding short- horizon bias in stochastic meta-optimization. arXiv preprint arXiv:1803.02021, 2018. [WRS+17] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pages 4151–4161, 2017. [WS98] Duncan J Watts and Steven H Strogatz. Collective dynamics of ’small-world’ net- works. Nature, 393(6684):440–442, 1998. [XZ14] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014. [You01] H Peyton Young. Individual strategy and social structure: An evolutionary theory of institutions. Princeton University Press, 2001. [YSS09] Toshio Yamagishi, NAOTO Suzuki, and M Schaller. An institutional approach to culture. Evolution, culture, and the human mind, pages 185–203, 2009. [ZCL15] Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems, pages 685– 693, 2015. [Zei12] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012. [ZHMD17] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quan- tization. ICLR, 2017. [ZK16] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016. [ZLS09] Martin Zinkevich, John Langford, and Alex J Smola. Slow learners are fast. In Advances in Neural Information Processing Systems, pages 2331–2339, 2009. [ZWLS10] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pages 2595–2603, 2010. [ZWN+16] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016. [ZYG+17] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless CNNs with low-precision weights. ICLR, 2017. 204 [ZZX15] Leihan Zhang, Jichang Zhao, and Ke Xu. Who creates trends in online social media: The crowd or opinion leaders? Journal of Computer-Mediated Communi- cation, 21(1):1–16, 2015. 205