ABSTRACT Title of dissertation: GENERATION AND ANALYSIS OF STRATEGIES IN AN EVOLUTIONARY SOCIAL LEARNING GAME James Ryan Carr, Doctor of Philosophy, 2013 Dissertation directed by: Professor Dana Nau Department of Computer Science An important way to learn new actions and behaviors is by observing others, and several evolutionary games have been developed to investigate what learning strategies work best and how they might have evolved. In this dissertation I present an extensive set of mathematical and simulation results for Cultaptation, which is one of the best-known such games. I derive a formula for measuring a strategy?s expected reproductive success, and provide algorithms to compute near-best-response strategies and near-Nash equilibria. Some of these algorithms are too complex to run quickly on larger versions of Cultaptation, so I also show how they can be approximated to be able to handle larger games, while still exhibiting better performance than the current best-known Cultaptation strategy for such games. Experimental studies provide strong evidence for the following hypotheses: 1. The best strategies for Cultaptation and similar games are likely to be con- ditional ones in which the choice of action at each round is conditioned on the agent?s accumulated experience. Such strategies (or close approximations of them) can be computed by doing a lookahead search that predicts how each possible choice of action at the current round is likely to a ect future performance. 2. Such strategies are likely to prefer social learning most of the time, but will have ways of quickly detecting structural shocks, so that they can switch quickly to individual learning in order to learn how to respond to such shocks. This con icts with the conventional wisdom that successful social-learning strategies are characterized by a high frequency of individual learning; and agrees with recent experiments by others on human subjects that also challenge the conventional wisdom. GENERATION AND ANALYSIS OF STRATEGIES IN AN EVOLUTIONARY SOCIAL LEARNING GAME by James Ryan Carr Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial ful llment of the requirements for the degree of Doctor of Philosophy 2012 Advisory Committee: Professor Dana Nau, Chair/Advisor Professor Michele Gelfand Professor Mohammad Hajiaghayi Professor Donald Perlis Professor James Reggia c Copyright by James Ryan Carr 2012 Acknowledgments I would like to acknowledge the support and encouragement of many extraor- dinary people, without whom this work would have been impossible. I would like to thank Dr. Dana Nau for being a wonderful advisor, mentor, and friend, and especially for teaching me that half of research is arguing, and how to argue well. I would like to thank my other committee members, Drs. Michele Gelfand, James Reggia, Mohammad Hajiaghayi, Donald Perlis, and V.S. Subrahmanian, for their support, guidance, and patience. I would like to thank Dr. Marie desJardins for getting me interested in research and encouraging me to pursue a Ph.D. I would like to thank Eric Raboin and Austin Parker for their encouragement and contributions to our earlier studies of Cultaptation. I would like to thank my fellow graduate students in LCCD, Eric, Austin, Patrick Roos, Kan-Leung Cheng, Ron Alford, Vikas Shivashankar, and Brandon Wilson, for six years of friendship and advice. This work was supported in part by AFOSR Grant FA95501210021, ARO Grant W911NF1110344, and ONR Grant W911NF0810144. The views expressed are those of the author and do not necessarily re ect the o cial policy or position of the funders. ii Table of Contents List of Figures v 1 Introduction 1 2 Background 6 2.1 Cultaptation Social-Learning Game . . . . . . . . . . . . . . . . . . . 6 2.2 Motivating Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Innovation, Observation, and Structural Shocks . . . . . . . . 13 2.2.2 Innovation and Observation Versus Exploitation . . . . . . . . 14 2.2.3 Best-Response Strategies in Cultaptation . . . . . . . . . . . . 15 3 Related Work 17 3.1 Social Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Related Computational Techniques . . . . . . . . . . . . . . . . . . . 20 4 Game Analysis 22 4.1 Formal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1.1 Transition Probabilities . . . . . . . . . . . . . . . . . . . . . . 26 4.1.2 Accounting for Probability of Death . . . . . . . . . . . . . . . 28 4.1.3 Utility Functions . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1.4 Strategy Representation . . . . . . . . . . . . . . . . . . . . . 29 4.1.5 Evaluating Strategies . . . . . . . . . . . . . . . . . . . . . . . 30 4.2 Analysis of EPRU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2.1 Computation of EPRU . . . . . . . . . . . . . . . . . . . . . . 33 4.2.2 EPRU Corresponds to Reproductive Success . . . . . . . . . . 35 5 Strategy Generation Algorithms 43 5.1 Finding an -Best Response Strategy . . . . . . . . . . . . . . . . . . 43 5.1.1 Problem Speci cation . . . . . . . . . . . . . . . . . . . . . . . 43 5.1.2 Bounding EPRU . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.1.3 Determining How Far to Search . . . . . . . . . . . . . . . . . 46 5.1.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.1.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2 Cultaptation Strategy Learning Algorithm . . . . . . . . . . . . . . . 54 5.2.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 57 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.3.1 Experiments with -Best Response Algorithm . . . . . . . . . 59 5.3.2 Experiments with the Cultaptation Strategy Learning Algo- rithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 iii 6 Improving Full-scale Cultaptation Strategies 78 6.1 The Tournament Winner: discountmachine . . . . . . . . . . . . . . . 78 6.1.1 Potential Problems with discountmachine . . . . . . . . . . . 80 6.2 A Full-scale Cultaptation Player: relaxedlookahead . . . . . . . . . . . 82 6.3 Experiments: RLA vs. discountmachine . . . . . . . . . . . . . . . . 85 6.3.1 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . 85 6.3.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . 89 7 Conclusion 102 A Proofs 108 B Converting from Histories to Repertoires 112 B.1 Repertoire De nition . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 B.2 Transition Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 114 B.3 Consistency between P and PRep . . . . . . . . . . . . . . . . . . . . 117 B.4 Repertoire-Based Strategies . . . . . . . . . . . . . . . . . . . . . . . 120 B.5 Repertoire-Based Algorithm . . . . . . . . . . . . . . . . . . . . . . . 125 C The Number of Pure Strategies for Cultaptation 128 Bibliography 130 iv List of Figures 2.1 Example game with a structural shock . . . . . . . . . . . . . . . . . 14 5.1 An illustration of a portion of the search performed by Algorithm 1 on a very small version of Cultaptation, with four actions and a set of possible action values equal to f20; 40; 80; 160g. Circle nodes indicate possible observations made by the agent, and triangle nodes indicate possible choices made by the agent (which have a random outcome, indicated by the line through the branches leading to its children). Ellipses indicate branches or subtrees omitted from the gure. The left side of the illustration shows how more actions can become avail- able to the agent as the game progresses, while the right side shows how potentially-changing values of actions must be modeled in the tree. Several techniques outlined in Section 5.1.5 can reduce the size of the tree that needs to be searched in the implementation of this algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Number of nodes searched by -best response strategy algorithm. . . . 61 5.3 Average populations when sself and sEVC play against EVChooser . . 69 5.4 Probabilities of choosing each action for sself , sEVC, and EVChooser. . 72 5.5 Performance comparison between normal and \shock" environments for sself , sEVC, and EVChooser . . . . . . . . . . . . . . . . . . . . . . 75 6.1 Probabilities of choosing each type of action for relaxedlookahead and discountmachine, using Geo and c = 0:001. . . . . . . . . . . . . . . . 92 6.2 Probabilities of choosing each type of action for relaxedlookahead and discountmachine, using Geo and c = 0:05. . . . . . . . . . . . . . . . 93 6.3 Probabilities of choosing each type of action for relaxedlookahead and discountmachine, using Gam and c = 0:001. . . . . . . . . . . . . . . 94 6.4 Probabilities of choosing each type of action for relaxedlookahead and discountmachine, using Gam and c = 0:05. . . . . . . . . . . . . . . . 95 6.5 Performance comparison between relaxedlookahead and discountma- chine, using Geo and c = 0:001. . . . . . . . . . . . . . . . . . . . . . 98 6.6 Performance comparison between relaxedlookahead and discountma- chine, using Geo and c = 0:05. . . . . . . . . . . . . . . . . . . . . . . 99 6.7 Performance comparison between relaxedlookahead and discountma- chine, using Gam and c = 0:001. . . . . . . . . . . . . . . . . . . . . . 100 6.8 Performance comparison between relaxedlookahead and discountma- chine, using Gam and c = 0:05. . . . . . . . . . . . . . . . . . . . . . 101 v Chapter 1 Introduction An important way to learn new actions and behaviors is social learning, i.e., learning by observing others. Some social-learning theorists believe this is how most human behavior is learned [1], and it also is important for many other animal species [2, 3, 4]. Such learning usually involves evaluating the outcomes of others? actions, rather than indiscriminate copying of others? behavior [5, 6], but much is unknown about what learning strategies work best and how they might have evolved. For example, it seems natural to assume that communication has evolved due to the inherent superiority of copying others? success rather than learning on one?s own via trial-and-error innovation. However, there has also been substantial work questioning this intuition [7, 8, 9, 10, 11]. Several evolutionary games have been developed to investigate social learning [12, 13, 14, 15]. One of the best-known is Cultaptation, a multi-agent social-learning game developed by a consortium of European scientists [15] who sponsored an in- ternational tournament with a e10,000 prize.1 The rules of Cultaptation are rather complicated (see Section 2.1), but can be summarized as follows: Each agent has three kinds of possible actions: innovation, observation, and exploitation. These are highly simpli ed analogs of the following real-world 1NOTE: I am not a liated with the tournament or with the Cultaptation project. 1 activities, respectively: spending time and resources to learn something new, learning something by communicating with another agent, and exploiting the learned knowledge. At each step of the game, each agent must choose one of the available actions. How an agent does this constitutes the agent?s \social learning strategy." Each action provides an immediate numeric payo and/or information about the payo s of other actions at the current round of the game. This information is not necessarily correct in subsequent rounds because the actions? payo s may vary from one round to the next, and the way in which they may vary is unknown to the agents in the game.2 Each agent has a xed probability of dying at each round. At each round, each agent may also produce o spring, with a probability that depends on how this agent?s average per-round payo compares to the average per-round payo s of the other agents in the game. A second Cultaptation tournament is scheduled to begin in February 2012. This tournament carries a e25,000 prize and introduces a few new concepts into the game, such as the ability for agents to improve actions they already know, and proximity-based observation. This work does not deal with these additions, although 2 For my analyses, I assume the payo s at each round are determined by an arbitrary function (which may be either deterministic or probabilistic), and I ana- lyze how strategies perform given various possible characteristics of that function. In general, such characteristics would not be known to any Cultaptation agent| but my objective is to examine the properties of strategies in various versions of Cultaptation, not to develop a Cultaptation agent per se. 2 I may address them in future work. My work has had two main objectives: (1) to study the nature of Cultaptation to see what types of strategies are e ective; and (2) more generally, to develop ways of analyzing evolutionary environments with social learning. This work includes the following results. First, it provides a formula for ap- proximating (to within any given error bound > 0) the expected per-round utility (EPRU) of a given strategy (Section 4.2), and a proof that a strategy with maximal EPRU will have the highest frequency in the limit (Section 4.2.2). This provides a basis for evaluating Cultaptation strategies analytically, rather than through sim- ulated games. Next, the work provides a strategy-generation algorithm that can construct a strategy that is within any given error bound > 0 of a best response to a given set of competing strategies (Section 5.1). It then presents the Cultaptation Strategy Learning Algorithm (CSLA), which uses the strategy-generation algorithm in an iterative self-improvement loop to attempt to nd a strategy that is a near-best response to itself, and is therefore a symmetric near-Nash equilibrium (Section 5.2). Finding such a strategy is desirable because it should be able to perform well against any set of competing strategies (i.e., any strategies it plays against in a tournament setting). The work then provides experimental results outlining the generation of an ap- proximate Nash equilibrium strategy, sself , in a small version of Cultaptation, and a performance comparison (in the smaller environment) between sself and EVChooser, a known good strategy from the Cultaptation tournament (Section 5.3). These ex- periments show that sself is able to outperform EVChooser, and provide several 3 insights into the characteristics of good Cultaptation strategies. For example, the experiments show that sself observes and exploits most of the time, but switches quickly to innovation when a structural shock occurs, switching back to observation and exploitation once it has learned how to respond to the shock. This con icts with the conventional wisdom [16, 7] that successful social-learning strategies are characterized by a high frequency of innovation, but it helps to explain both the results of the Cultaptation tournament [17] and some recent experimental results on human subjects [18]. Finally, the work shows how the previous results (which were run on smaller versions of Cultaptation) can be extended to the larger version used in the tourna- ment (Chapter 6). It rst shows how the analysis and experimental results outlined above can be used to identify potential problems in the best strategy from the Cul- taptation tournament (i.e., discountmachine). Next, it uses the formulae from the analysis to de ne a new strategy, relaxedlookahead, that avoids such weaknesses. Experimental results verify that relaxedlookahead is capable of outperforming dis- countmachine in a variety of environments similar to those used in the Cultaptation tournament, and provide an in-depth analysis of the factors that allowed relaxed- lookahead to perform better. Taken together, these results provide strong support for the following hypothe- ses about the best strategies for Cultaptation and similar games: First, the best strategies are likely to be conditional ones in which the choice of action at each round is conditioned on the agent?s accumulated experience. Such strategies (or close approximations of them) can be computed by doing a lookahead search that 4 predicts how each possible choice of action at the current round is likely to a ect future performance. Second, it is likely that the best strategies will observe and exploit most of the time, but will have ways of quickly detecting structural shocks, so that they can switch quickly to innovation in order to learn how to respond to such shocks. In addition to these insights about Cultaptation and social learning in gen- eral, the work also explains how the analysis and algorithms outlined above could be adapted to apply to future evolutionary games that, like Cultaptation, are signif- icantly more complex than classical evolutionary games. As researchers in the eld of evolutionary game theory continue to study more complex phenomena like social learning, it is likely that these weaker assumptions will be needed more frequently, and so techniques like the ones presented here will be necessary more often. 5 Chapter 2 Background 2.1 Cultaptation Social-Learning Game This section gives a more detailed description of the Cultaptation social learn- ing game, adapted from the o cial description [15]. The game is a multi-agent round-based game, where one action is chosen by each agent each round. There are N agents playing the game, where N is a parameter to the game. No agent knows of any other agent?s actions at any point in the game except through the Obs action speci ed below. The actions available to each agent are innovation (Inv), observation (Obs), and exploitation (X1; : : : ;XM , where M is a parameter to the game). Each Inv and Obs action informs the agent what the utility would be for one of the exploitation actions, and an agent may not use an exploitation action Xi unless the agent has previously learned of it through an innovation or observation action. Here are some details: Exploitation. Each exploitation action Xi provides utility speci c to that action (e.g. X1 may provide utility 10 and X2 may provide utility 50). The utility assigned to each action at the beginning of the game is drawn from a probability distribution , where is a parameter to the game. The utility provided by each exploitation action Xi may change on round r, according to a probability cr. The function c is a parameter to the game, and 6 speci es the probability of change for every round of the game. When the changes occur, they are invisible to the agents playing the game until the agent interacts with the changed action. For instance: if an action?s utility happens to change on the same round it is exploited, the agent receives the new utility, and discovers the change when the new utility is received. The new utility for a changed action is determined via the distribution . Innovation. When an agent uses the Inv action, it provides no utility, but it tells the agent the name and utility of some exploitation action Xi that is chosen uniformly at random from the set of all exploitation actions about which the agent has no information. If an agent already knows all of the exploitation actions, then Inv is illegal, and indeed undesirable (when there is nothing left to innovate, why innovate?). The agent receives no utility on any round where she chooses an Inv action. Observation. By performing an Obs action, an agent gets to observe the action performed and utility received by some other agent who performed an ex- ploitation action on the previous round. Agents receive no utility for Obs actions, nor any information other than the action observed and its value: the agent being observed, for instance, is unknown. If none of the other agents performed an ex- ploitation action on the previous round, then there were no Xi actions to observe so the observing agent receives no information. In some variants of the social learning game, agents receive information about more than one action when observing. I do not treat such variants directly in this proposal, but it is straightforward to extend my algorithms to take this di erence into account. 7 Round # 1 2 3 4 5 . . . k I1?s action Inv X1 X1 X1 X1 . . . X1 I1?s utility 0 3 6 9 12 . . . 3(k 1) Per round 0 1:5 2 2:25 2:4 . . . 3k 1k I2O?s action Inv Inv Obs X3 X3 . . . X3 I2O?s utility 0 0 0 8 16 . . . 8(k 3) Per round 0 0 0 2 3:2 . . . 8k 3k Table 2.1: Action sequences from Example 1, and their utilities. Example 1. Consider two strategies: the innovate-once strategy (here- after I1), which innovates exactly once and exploits that innovated action (whatever it is) for the rest of the game, and the innovate-twice-observe- once strategy (hereafter I2O), which innovates twice, observes once, and exploits the highest valued action of the actions discovered for the rest of the game. For simplicity of exposition, suppose there are only four exploitation actions: X1, X2, X3, and X4. The values for each of these actions are drawn from a distribution; in this example we will assume that they are chosen to be 3, 5, 8, and 5, respectively. For simplicity, we will assume the probability of change is 0. Suppose there are two agents: one I1 and one I2O. For the rst action, I1 will innovate, which we suppose gives I1 the value of action X1. On every sequential action, I1 will choose action X1, exploiting the initial investment. If the agent dies k rounds later, then the history of actions and utilities will be that given in Table 2.1; giving a utility of 3(k 1) and a per-round utility of 3k 1k . In contrast, I2O will innovate, informing it of the utility of X3: 8, then it will innovate again, informing it of the utility of X4: 5, and nally it will observe. On the second round, I1 performed X1, and since these are the only two agents, this was the only exploitation action performed. Therefore, I2O?s observation action on the next round must report that another agent got a utility of 3 from action X1 last round (if there were multiple possibilities, one would be chosen uniformly at random). On round 4, I2O then knows that actions X1, X3, and X4 have utilities of 3, 8, and 5, respectively. Since the probability of change is 0, the obvious best action is X3, which I2O performs for the rest of her life. The utility of I2O on round k is 8(k 3), making the per-round utility 8k 3k . Note that on rounds 2 to 4, I2O will have a worse per-round utility than I1, while after round 4, the utility of I2O will be higher (this is important because reproduction is tied to per-round utility, as I will show shortly). Formally, everything that an agent knows about each round can be described 8 by an action-percept pair, (a; (m; v)), where a 2 fInv;Obs;X1; : : : ;XMg is the action that chose to perform, and (m; v) is the percept returned by the action. More speci cally, m 2 fX1; : : : ;XM ; ;g is either an exploitation action or a null value, and v is the utility observed or received. While a is chosen by the agent, m and v are percepts the agent receives in response to that choice. If a is Inv or Obs, then v is the utility of exploitation action m. If a is Obs and no agent performed an exploitation action last round, then there is no exploitation action to be observed, hence m = ; and v = 0. If a is some Xi, then m will be the same Xi and v will be the utility the agent receives for that action. The agent history for agent is a sequence of such action-percept pairs, h = h(a1; (m1; v1)); : : : ; (ak; (mk; vk))i. As a special case, the empty (initial) history is hi. Example 2. The history for I2O in Example 1 is: hI2O = h(Inv; (X3; 8)); (Inv; (X4; 5)); (Obs; (X1; 3)); (X3; (X3; 8)); : : : i To concatenate a new action-percept pair onto the end of a history, I use the symbol. For example, h (a; (m; v)) is the history h concatenated with the action-percept pair (a; (m; v)). Further, for h = hp1; p2; : : : ; pki, where each pi is some action-percept pair, I let h [i] = pi, and h [i; : : : ; j] be the subhistory hpi; : : : ; pji. Strategies. The Cultaptation game is ultimately a competition among strate- gies. Here, a strategy is a function from histories to the set of possible actions: s : h 7! m, where h is a history of an agent using s and m is Inv, Obs or Xi for some i. Since each strategy may depend on the entire history, the set of possible 9 strategies is huge;1 but any particular Cultaptation game is a competition among a much smaller set of strategies S, which I will call the set of available strategies. For example, if there are n contestants, each of whom chooses a strategy to enter into the game, then in this case, S = fthe strategies chosen by the contestantsg: (2.1) Each strategy in S may be used by many di erent agents, and the strategy pro le at each round of the game may change many times as the game progresses. When an agent reproduces, it passes its strategy on to a newly created agent, with the per- round utility of each agent determining its likelihood of reproduction. A strategy?s success is measured by its average prevalence over the last quarter of the game [15]. The replication dynamics work as follows. On each round, each agent has a 2% chance of dying. As such, I also include a parameter d in my formulation representing the probability of death (d defaults to 0:02). Upon death, an agent is removed from the game and replaced by a new agent, whose strategy is chosen using the reproduction and mutation mechanisms described below. Mutation happens 2% of the time, and reproduction happens 98% of the time. Reproduction. When reproduction occurs, the social learning strategy used by the newborn agent is chosen from the strategies of agents currently alive with a probability proportional to their per-round utility (the utility gained by an agent 1 The number of possible mixed strategies is, of course, in nite. But even if we consider only pure strategies, the number is quite huge. For a 10,000-round Cultaptation game of the type used in the Cultaptation tournament, a loose lower bound on the number of pure strategies is 1009:4 10 20155 [19]. In contrast, it has been estimated [20] that the total number of atoms in the observable universe is only about 1078 to 1082. 10 divided by the number of rounds the agent has lived). The agent with the highest per-round utility is thus the most likely to propagate its strategy on reproduction. Example 3. Again looking at the sequences of actions in Table 2.1, we see that both agents would have equal chance of reproducing on round 1. However, on round 2 I1 has a per-round utility of 1:5, while I2O has a per-round utility of 0, meaning I1 gets 100% of the reproductions occurring on round 2. Round three is the same, but on round 4 I1 has a per round utility of 2:25 and I2O has a per-round utility of 2. This means that I1 gets 100 2:25=4:25 = 53% of the reproductions and I2O gets 100 2=4:25 = 47% of the reproductions on round 4. Mutation. In Cultaptation, mutation does not refer to changes in an agent?s codebase (as in genetic programming). Instead, it means that the new agent?s strat- egy s is chosen uniformly at random from the set of available strategies, regardless of whether any agents used s on the previous round. For instance, if there were a Cultaptation game pitting strategies I1 and I2O against one another, then a new mutated agent would be equally likely to have either strategy I1 or I2O, even if there were no living agents with strategy I1. Game Types. In the Cultaptation tournament [17], two types of games were played: pairwise games and melee games. A pairwise game was played with an invading strategy and a defending strategy. The defending strategy began play with a population of 100 agents, while the invading strategy began with none. Mutation was also disabled for the rst 100 rounds, to allow the defending strategy time to begin earning utility. After 100 rounds, mutation was enabled and the invader had the challenging task of establishing a foothold in a population consisting entirely of agents using the defending strategy (most of whom would have had time to nd several high-payo actions). Since the pairwise games provide a clear early- 11 game advantage to the defender, they were typically played twice with the invader and defender swapping roles on the second game. A melee game was played with n strategies, for some n > 2. Initially, the population of 100 agents was evenly divided between each strategy in the game. Mutation was disabled for the last quarter of the game, so that it would not in uence results when strategies had similar tness. Scoring. If we have k social learning strategies s1; : : : ; sk playing Cultapta- tion, then on any given round there will be some number nj of agents using strategy sj, for 1 j k. Strategy sj?s score for the game is the average value of nj over the nal 2,500 rounds of the game. The strategy with the highest score is declared the winner. The only way an agent may a ect nj is through reproduction. I will show in Section 4.2.2 that any strategy maximizing an agent?s expected per-round utility (de ned in Section 4.1.5) will also maximize its reproduction. I will therefore focus on computing the expected per-round utility. 2.2 Motivating Discussion The purpose of this section is to explain the motivations for several aspects of my work: Sections 2.2.1 and 2.2.2 give examples of types of strategies that seem like they should work well at rst glance, but can have unexpectedly bad consequences. The existence of such situations motivate the algorithms described later in this proposal, which perform a game tree search in order to consider strategies? 12 long-term consequences. An important way of getting insight into a game is to examine its best-response strategies; and this approach is at the heart of my formal analysis and my game-tree search algorithms. Section 2.2.3 explains some issues that are im- portant for nding best-response strategies in Cultaptation. 2.2.1 Innovation, Observation, and Structural Shocks If we want to acquire a new action to exploit, then what is the best way of doing it: to observe, or to innovate? At rst glance, observing might seem to be the best approach. If the other agents in the environment are competent, then it is likely that they are exploiting actions that have high payo s, hence we should be able to acquire a better action by observing them than by innovating. This suggests that an optimal agent will rely heavily on observation actions. However, the following example shows that relying only on observation actions can lead to disastrous consequences if there is a structural shock, i.e., a large change in the value of an exploitation action.2 Example 4. Structural shocks: Figure 2.1 shows a Cultaptation game in which all agents use the following strategy: each agent begins with a single Obs action, followed by a single Inv action if the Obs action returns ;,3 in order to obtain an exploitation action Xi which the agent will use in all subsequent rounds. Agent A3 acquires action X4 by doing an unsuccessful Obs followed by an Inv; and A1 and A2 acquire X4 by observing A3. At rst, X4 is far 2 I have borrowed this term from the Economics literature, where it has an anal- ogous meaning (e.g., [21, 22]). 3 This will generally only happen on the rst round of the game, before any agent has obtained an exploitation action. 13 Round X1 X2 X3 X4 A1 A2 A3 1 2 4 1 9 N/A N/A (Obs; (;; )) 2 2 4 1 9 N/A N/A (Inv; (X4; 9)) 3 2 4 1 9 Birth N/A (X4; ( ; 9)) 4 5 4 1 9 (Obs; (X4; 9)) Birth (X4; ( ; 9)) 5 5 2 1 9 (X4; ( ; 9)) (Obs; (X4; 9)) (X4; ( ; 9)) 6 1 2 1 9 (X4; ( ; 9)) (X4; ( ; 9)) (X4; ( ; 9)) 7 1 2 8 9 (X4; ( ; 9)) (X4; ( ; 9)) Death 8 1 2 8 1 (X4; ( ; 1)) (X4; ( ; 1)) N/A ... ... ... ... ... ... ... ... Figure 2.1: An example of a game in which there is a large structural shock. The columns for the exploitation actions Xi show their values at each round, and the columns for agents A1{A3 show their histories. Note that by round 6, all agents choose action X4, which has changed to a very low value. Since none of the agents are innovating, none of them can nd the newly optimal action X3. better than the other exploitation actions, so all of the agents do well by using it. On round 8, the action X4 changes to the lowest possible value, but the agents continue to use it anyway. Furthermore, any time a new agent is born, it will observe them using X4 and will start using it too. This is a pathological case where the best action has disappeared and the agents are in a sense \stuck" exploiting the suboptimal result. Their only way out is if all agents die at once, so that one of the newly born agents is forced to innovate. Experiments in Section 5.3.1 show that in some cases, situations like these are a big enough risk that a near-best response strategy will choose innovation moves more frequently than observation moves. 2.2.2 Innovation and Observation Versus Exploitation One might also think that agents should perform all of their innovation and observation actions rst, so that they have as many options as possible when choos- ing an action to exploit. However, as Raboin et al. [23] demonstrate, this intuition 14 is not always correct. Because the game selects which agents reproduce based on average per-round utility, not total accumulated utility, it is frequently better for newborn agents to exploit one of the rst actions it encounters, even if this action has a mediocre payo (e.g., exploiting an action with value 10 on the second round of an agent?s life gives it as much per-round payo as exploiting an action with value 50 on the tenth round). Once the agent has at least some per-round utility so that it has a nonzero chance of reproducing, it can then begin searching for a high-valued action to exploit for the rest of its lifetime. 2.2.3 Best-Response Strategies in Cultaptation A widely used technique for getting insight about a game (e.g., see [24]) is to look at the game?s best-response strategies. Given an agent and a strategy pro le (i.e., an assignment of strategies to agents) s for the agents other than , ?s best response is a strategy sopt that maximizes ?s expected utility if the other agents use the strategies in s . In Cultaptation, it is more useful to consider a best response to the set of available strategies S, rather than any particular strategy pro le. During the course of a Cultaptation game, the strategy pro le will change many times as agents die and other agents are born to take their places. Each strategy in S will be scored based on its average performance over the course of the game; and we will see (in Section 4.2.2) that given S, each strategy?s score is independent of the initial strategy pro le if the game is su ciently long. 15 IfG is a Cultaptation game (i.e., a set of values for game parameters such as the number of agents, set of available actions, probability distribution over their payo s; see Section 4.1 for details), then for any agent , any set of available strategies S, and any history h for , there is a probability distribution Obs(ajh ;S) that gives the probability of observing each action a, given S and h . Given Obs and G, we can calculate the probability of each possible outcome for each action the agent might take, which will allow us to determine the best response to S. To compute Obs is not feasible except in general, but it is possible to compute approximations of it in some special cases (e.g., cases in which all of the agents, or all of the agents other than , use the same strategy). That is the approach used in this proposal. 16 Chapter 3 Related Work In this section I will discuss related work on social learning and on computa- tional techniques related to my own. 3.1 Social Learning The Cultaptation social learning competition o ers insight into open questions in behavioral and cultural evolution. An analysis of the competition is provided by Rendell et al. [17]. Of the strategies entered into the competition, those that per- formed the best were those that greatly favored observation actions over innovation actions, and the top performing strategy learned almost exclusively through observa- tion. This was considered surprising, since several strong arguments have previously been made for why social learning isn?t purely bene cial [7, 16]. However, this re- sult is consistent with observations made during my own experiments, in which the -best-response strategy rarely did innovation (see Section 5.3). In previous work, Carr et al. showed how to compute optimal strategies for a highly simpli ed versions of the Cultaptation social learning game [25]. Their paper simpli es the game by completely removing the observation action|which prevents the agents from interacting with each other in any way whatsoever, thereby transforming the game into a single-agent game rather than a multi-agent game. 17 Their model also assumes that exploitable actions cannot change value once they have been learned, which overlooks a key part of the full social learning game. Wisdom and Goldstone attempted to study social learning strategies using a game similar to Cultaptation, but using humans rather than computer agents [18]. Their game environment consisted of a group of \creatures," each of which had some hidden utility. The agents? objective was to select a subset of the creatures to create a \team," which was assigned a utility based on the creatures used to create it. Agents had a series of rounds in which to modify their team, and on each round they were allowed to see the teams chosen by other agents on the previous round (and in some cases, the utility of the other agents? teams), and the object of the game was to maximize the utility of one?s team. In this game, the acts of keeping a creature on one?s team, choosing a creature that another agent has used, and choosing a creature no one has yet used correspond to exploitation, observation, and innovation (respectively) in the Cultaptation game. The successful strategies Wisdom and Goldstone saw are similar to those used by the strategies found by my algorithm: they keep most of the creatures on their team the same from round to round (which corresponds in Cultaptation to per- forming mostly exploitation actions), and new creatures are mostly drawn from other agents? teams (which corresponds to preferring observation over innovation in Cultaptation). However, Wisdom and Goldstone highlight these characteristics as interesting because they run contrary to the conventional wisdom for social learning strategies, which suggests that broader exploration should lead to better perfor- mance, and therefore that successful strategies should innovate more often [16]. In 18 this case, analyzing the strategies found by my algorithm allowed me to draw the same conclusions about what works well. This gives more evidence that the conven- tional wisdom on social learning [7, 16] may be mistaken. How best to learn in a social environment is still considered a nontrivial prob- lem. Barnard and Sibly show that if a large portion of the population is learning only socially, and there are few information producers, then the utility of social learning goes down [9]. Thus, indiscriminate observation is not always the best strategy, and there are indeed situations where innovation is appropriate. Authors such as Laland have attempted to produce simple models for determining when one choice is preferable to the other [8]. Game theoretic approaches have also been used to explore this subject, but it is still ongoing research [26, 27]. Giraldeau et al. o er reasons why social information can become unreliable. Both biological factors, and the limitations of observation, can signi cantly degrade the quality of information learned socially [11]. Work by Nettle outlines the circumstances in which verbal communication is evolutionarily adaptive, and why few species have developed the ability to use language despite its apparent advantages [10]. Nettle uses a signi cantly simpler model than the Cultaptation game, but provides insight that may be useful to understanding social learning in general. In Nettle?s model, the population reaches an equilibrium at a point where both individual and social learning occur. The point of equilibrium is a ected by the quality of observed information and the rate of change of the environment. 19 3.2 Related Computational Techniques The restless bandit problem, a generalization of the stochastic multi-armed bandit problem that accounts for probability of change, is cited as the basis for the rules of the Cultaptation tournament [17]. The rules of the Cultaptation game di er from the restless bandit problem by including other agents, making observation actions possible and complicating the game signi cantly. I also show in Section 4.2.2 that maximizing total payo , the goal of the restless bandit problem, is di erent from maximizing expected per-round utility (EPRU) of an agent in the Cultaptation tournament. The restless bandit problem is known to be PSPACE-complete, meaning it is di cult to compute optimal solutions for in practice [28, 29]. Multi-armed bandit problems have previously been used to study the tradeo between exploitation and exploration in learning environments [30, 31]. As discussed later in Section 4.1, nding a best-response strategy in Cultap- tation is basically equivalent to nding an optimal policy for a Markov Decision Process. Consequently, my algorithm for nding near-best-response strategies has several similarities to the approach used by Kearns et al. to nd near-optimal poli- cies for large MDPs [32]. Both algorithms use the discount factor of the MDP (which, in the case of Cultaptation, is the probability of death d) and the desired accuracy to create a horizon for their search, and the depth h of this horizon depends on the discount factor and the branching factor, but not on the size of the full state space (unlike conventional MDP algorithms). Thus, both their algorithm 20 and mine also have running time exponential in 1= and in the branching factor. However, the algorithm provided by Kearns et al. was designed as an online algo- rithm, so it only returns the near-optimal action for the state at the root of the search tree. Mine, on the other hand, returns a strategy specifying which action the agent should take for all states that can occur on the rst h rounds. This means that the exponential-time algorithm only needs to run once to generate an entire strategy, rather than once per agent per round in each game we simulate. Many algorithms for optimal control of an MDP have been developed, however they all have running time that grows linearly with the size of the state space of the MDP. This makes them intractable for problems like Cultaptation, which have exponentially large state spaces. Several approaches for near-optimal control, which produces a policy within some of optimal, have been developed [32, 33, 34]. 21 Chapter 4 Game Analysis This chapter presents the formal model of Cultaptation used throughout the rest of the work, as well as the derivation of a formula that, given the parameters of the game environment and the other strategies in the game, will allow us to predict which strategy has an evolutionary advantage. 4.1 Formal Model In this section I introduce a formal mathematical model of Cultaptation games. A glossary of the notation used in this proposal is provided as Table 4.1. Game De nition. Cultaptation requires a number of parameters to deter- mine exactly how it will run. Therefore, I will de ne the game parameters, G, to be a set of values for the following: N , the number of agents; M , the number of ex- ploitation actions in the game; c, the probability that an exploitation action changes its utility each round; , the probability distribution used to assign a utility value to each exploitation action, both the outset and each time an action?s utility changes; and d, the probability of death. In the Cultaptation tournament, only the values of N , M , and d were known ahead of time, but for this analysis I use the values of the other parameters as well. Recall that in the Cultaptation tournament [15], each evolutionary simulation 22 N Number of agents in the environment. S The set of available strategies. Agents may only use strategies in S. r Number of the current round, ranging from 1 to 1. c The probability of change on all rounds. d The probability of death on all rounds. (a; (m; v)) An action-percept pair in which the action a returns the percept (m; v). h Agent history for . A sequence of action-percept pairs experienced by agent . h [i] The i-th action-percept pair in h . X(h ) Number of exploitable actions given history h . M Number of exploitation actions in the game. Probability distribution for the new value of any action whose value changed at round r. Obs(m; vjh ;S) Probability that Obs returns action m with value v, given history h and available strategies S. Inv(vjr) Probability that Inv returns an action with value v on round r. V The set of potential utility values. P (h0 jh ; a;S) Probability of transitioning to history h 0 if performs action a with history h and available strategies S. L(jh j) Probability that lives long enough to experience history h . T Set of all action-percept pairs of the form (a; (m; v)). Table 4.1: A glossary of notation used in this proposal. was a contest between two or more strategies submitted to the tournament. Thus, there is a xed set of strategies that are allowed to occur in a given simulation. I will call this the set of available strategies S, where S = fs1; s2; :::; s?g for some nite ? (i.e. in pairwise games ? = 2, in melee games ? > 2). Any strategy pro le s that occurs in the simulation will consist only of strategies in S. When an agent is chosen to be replaced via mutation, its new strategy is selected at random from the strategies in S. A Cultaptation game can now be de ned formally, as follows. A Cultaptation game is an ?-player game, in which each player receives the game parameters G as 23 input. Each player then simultaneously chooses a strategy to put into the set of available strategies. Let player i?s strategy be si, so that S = fs1; s2; :::; s?g. The pair (G;S) is an instance of G. In (G;S), each player i will receive a payo equal to score(si), de ned below. Scoring. The version of Cultaptation used in the tournament continued for 10,000 rounds, and each strategy was assigned a score equal to its average population over the last 2,500 rounds. But as is often done in analyses of repeated games, the formal model assumes an in nite Cultaptation game, i.e., the game continues for an in nite number of rounds, and the score for strategy s is its average population over the entire game: score(s) = lim r!1 Pr j=1 p(s; j) r ; where p(s; j) is the population size of agents using strategy s on round j. This greatly simpli es the analysis in Section 4.1.5, by allowing us to average out the various sources of noise present in the game. Actions. The rest of the formal model will be constructed from the perspec- tive of an arbitrary agent, , in a given in nite Cultaptation game instance (G;S). I use r for the number of a round, and X(h ) to specify the number of exploita- tion actions available after history h . After all exploitation actions X1; : : : ;XM have been innovated or observed in a history h , then X(h ) = M and innovation actions become illegal. I model the payo s supplied for exploitation actions Xi by a probability dis- tribution . (v) is the probability of an action having payo v at the start of the 24 game instance. (v) is also the probability that, when an action changes its payo , the new payo is v. I let V be the set of all action values that may occur with non-zero probability: V = fv j (v) > 0g; where V has nite size. If we let Inv(vjr) be the probability that value v is innovated on round r, it can be de ned recursively in terms of c and as: Inv(vjr) = 8 >>< >>: (v); if r=0; c (v) + (1 c) Inv(vjr 1); otherwise. That is, initially the chance that Inv will return an action with value v is determined by the given distribution (v). On later rounds (r > 0) the chance that Inv will return an action with value v is the chance that an action?s value changed to v on the current round (given by c (v)), plus the chance that an action?s value was v on the previous round and it did not change this round. While computing the probability distribution for utilities of actions returned by Inv was fairly straightforward, computing a similar distribution for Obs actions is signi cantly more di cult. Let be any agent, and S be the set of available strategies. From S we can get a probability distribution over the other agents? actions in any given situation; and from this we can derive Obs(m; vjh ;S), the probability that Obs would return the action-percept pair (m; v), given history h . In order to derive Obs, we must consider each possible strategy pro le s for agents besides , determine how likely that strategy pro le is to occur, and then determine what each agent in s will do for every possible sequence of actions they 25 could have encountered, bounded only by the percepts the agent has received in h . As discussed in Section 2.2.3, the number of possible histories alone is astro- nomically large. Since Obs is conditioned on each possible history it will be larger still, so in any practical implementation the best we can do is to approximate Obs (Section 5.1.5 describes how we will do this). But for the theoretical development, I will assume we have an oracle for Obs, that will tell us exactly how likely we are to observe any given action-utility pair. In what follows, I will show that given , Obs, V , and S, we can calculate the possible outcomes of each action the agent may take, and the probability of each of these outcomes. This allows us to treat an in nite Cultaptation game as a Markov Decision Process (MDP) [35]. Calculating the best response in this case is equivalent to nding an optimal control policy for an MDP. 4.1.1 Transition Probabilities A transition probability function P (h0 jh ; a;S) de nes the probability of tran- sitioning from history h to history h0 = h (a; (m; v)) in the next round if an agent performs action a. These transition probabilities are for the case where does not die before reaching h0 ; I introduce functions to account for the probability of death in Section 4.1.2. There are three cases for what P (h0 jh ; a;S) might be, depending on whether a is an innovation, observation, or exploitation action: 26 If a = Inv, then P (h (Inv; (m; v))jh ; Inv;S) = 8 >>>>>>< >>>>>>: Inv(vjr) M X(h ) if h contains no percepts that contain the action m, 0 otherwise. (4.1) Recall that an agent cannot innovate action m if it has already encountered m by innovating or observing. Observation actions are not subject to the same restriction, so if a = Obs, then P (h (Obs; (m; v))jh ;Obs;S) = Obs(m; vjh ;S) (4.2) where Obs(m; vjh ;S) models the exploitation behavior of the other agents in the environment. Obviously, the exact probability distribution will depend on the composition of strategies used by these agents. The above de nition is general enough to support a wide range of environments; and in Section 5.1.5 I will discuss one potential way to model this function for a more speci c set of environments. Finally, if a = Xm, then h must contain at least one percept for Xm. Let r be the last round at which the last such percept occurred. For the case where Xm?s utility did not change since round r, we have P (h (Xm; (m; v))jh ;Xm;S) = (1 c)jh j r | {z } prob. of not changing + c (v) jh jX j=r (1 c)jh j j | {z } prob. of changing back to v. (4.3) 27 For the case where Xm?s utility did change since round r, we have P (h (Xm; (m; v))jh ;Xm;S) = c (v) jh jX j=r (1 c)jh j j (4.4) which is similar, but assumes that the value must have changed at least once. In all other cases, no transition from h to h0 is possible, so P (h 0 jh ; a;S) = 0. Probability of Reaching a History We will frequently be interested in P (h js;S), the probability of history h occurring given that the agent is following some strategy s 2 S. We will be able to derive P (h js;S) iteratively, calculating the probability of each step of history h occurring using the functions derived above. Speci cally, P (h js;S) is the probability that each h [i] = (ai; (mi; vi)) occurs given the action chosen by the strategy in the history h [1; : : : ; i 1] = (a1; (m1; v1)) (ai 1; (mi 1; vi 1)), or: P (h js;S) = jh j 1Y i=1 P (h [1; : : : ; i] h [i+ 1]jh [1; : : : ; i]; s(h [1; : : : ; i]);S) (4.5) 4.1.2 Accounting for Probability of Death The probability of an agent living long enough to experience history h de- pends on the probability of death. It is L(jh j) = (1 d) jh j 1: (4.6) When we calculate the probability of reaching a given history h , we will generally multiply it by L(jh j) to account for the chance that the agent dies before reaching h . 28 Sometimes we will also be interested in the probability that a randomly- selected agent has history h . For this we will need to know the probability that a randomly-selected agent is exactly jh j rounds old, which is simply: L(jh j) P1 i=1 L(i) = L(jh j) 1 1 (1 d) = L(jh j) 1 d = dL(jh j): (4.7) 4.1.3 Utility Functions A utility function U((a; (m; v))) de nes the utility gleaned on action-percept pair (a; (m; v)): U((a; (m; v))) = 8 >>< >>: v; if 9i such that a = Xi; 0; otherwise. (4.8) Notice that U( ) is only non-zero on exploitation actions. The per-round utility (PRU) of history h , where h = (a1; (m1; v1)) (ajh j; (mjh j; vjh j)); is de ned to be the sum of the utility acquired in that history divided by the history?s length: PRU(h ) = 1 jh j jh jX i=1 U((ai; (mi; vi))) (4.9) 4.1.4 Strategy Representation A strategy s is de ned as a function mapping each history h 2 H to the agent?s next action s(h ) 2 fInv;Obs;X1; : : : ;XMg. For instance, the strategy I1 29 from Example 1 is de ned by the function: sI1(h ) = 8 >>< >>: Inv; if h is empty, Xi; for h = (Inv; (Xi; v)); : : : In this proposal we will deal with partially speci ed strategies. A partially spec- i ed strategy is a mixed strategy (i.e., a probability distribution over a set of pure strategies) that is de ned by a nite set Q of history action pairs (Q H fInv;Obs;X1; : : : ;XMg), in which each h 2 H appears at most once. Given any history h , if there is an action m such that (h ;m) 2 Q, then sQ chooses the action m. Otherwise, sQ chooses an action arbitrarily from all actions that are legal in h . Partially speci ed strategies have the advantage of being guaranteed to be nitely representable. 4.1.5 Evaluating Strategies At each round, an agent with history h has reproductive tness PRU(h ), and agents are selected to reproduce with probability proportional to their reproductive tness (i.e., using the replicator equation [36]). Since a strategy?s score is a function of its average population over the course of the game, we want some metric that allows us to compare the expected reproductive tness of two strategies. This will allow us to predict which strategy is more likely to win. At rst glance, it may appear that the way to predict which strategy will have higher expected reproductive tness is to compare their expected utilities. However, prior work has shown that this is not the case: in Cultaptation, a strategy?s expected reproductive tness is not necessarily proportional to its expected utility [23]. I now 30 Table 4.2: The expected utility and expected reproductive tness of sII and sIEI from Example 5. sIEI has a higher expected reproductive tness, and therefore will be likely to win a game against sII, even though sII has a higher expected utility. Expected Utility Expected Reproductive Fitness sII 90.25 65.074 sIEI 88.825 65.185 present a simple example that illustrates this phenomenon. Example 5. Consider an in nite Cultaptation game with no probability of change, no observation actions, probability of death d = 0:05, two exploitation actions valued at 65 and 100, and an innovate action that will return either exploitation action with uniform probability. This means that an agent needs to perform at most two innovate actions to have knowledge of the action with value 100, since innovating does not return an action the agent already knows. Let us compare two strategies: sII and sIEI. Both strategies will perform an innovate as their rst action. If the action they learn has value 100, both strategies will exploit that action until the agent dies. If the action learned has value 65, sII will perform a second innovate on its next turn, learning the action with value 100, and will exploit that action until its agent dies. Meanwhile, sIEI will exploit the action with value 65 once, before performing an innovate on its third turn to learn the action with value 100. It then exploits this action until its agent dies. Since the two strategies are identical when they learn the action with value 100 on their rst action, and since this case is equally likely to be encountered by both strategies, we can ignore it for the purposes of comparing them. For the rest of this analysis we will assume the rst innovate returns the action with value 65. In this case, we can calculate the expected utility for both strategies using geometric series, and we can calculate their expected reproductive tnesses using methods described in Section 4.2. Table 4.2 presents these values. While sII has a higher expected utility, since it exploits the action with value 100 more often, sIEI has a higher expected reproductive tness, since it does not wait as long to begin exploiting. Therefore, sIEI will be the likely winner in a contest between these two strategies. Since we cannot always use a strategy?s expected utility to determine whether it is expected to win, we will instead compute a strategy?s expected reproductive tness directly, by computing its Expected Per-Round Utility. 31 De nition. The Expected Per-Round Utility for a strategy s , EPRU(s j G;S), is the expected value of PRU(h ) over all possible histories h for a randomly-selected agent using strategy s 2 S in an in nite Cultaptation game instance (G;S). 2 To calculate EPRU(s j G;S), we look at each possible history h and multiply PRU(h ) by the probability that a randomly-chosen agent using s has history h . This probability is equal to the probability that a randomly-chosen agent is jh j rounds old (Equation 4.7) times the probability of reaching history h in jh j steps using strategy s (Equation 4.5). Hence, the EPRU of a strategy is: EPRU(s j G;S) = X h 2H dL(jh j) P (h js ;S) PRU(h ): Note that for a given environment, the probability of death d is a constant. Hence, in the analysis I will frequently factor it out. Example 6. Recall the innovate-once strategy, which innovates once to learn an action and then exploits that action until it dies. Suppose this strategy exists in an environment with a probability of death of 0:2 and only one possible exploit action with non-changing value 10. All agents using this strategy will therefore learn the only action on their rst round, and then exploit an action with value 10 on all subsequent rounds. Hence, there is only one possible history for a j-round old agent using this strategy, and its per-round utility is 10 (j 1)=j. The probability that a randomly-selected agent will be j rounds old will be 0:2 L(j) = 0:2 0:8j 1. Thus the expected per-round utility achieved by this strategy in this environment is P1 j=1 0:2 0:8 j 1 10 (j 1)=j. 4.2 Analysis of EPRU In this section I examine methods for computing the expected per-round utility of a strategy. First I present a method for computing an approximation to the EPRU 32 for given a strategy, then I present a proof that a strategy maximizing EPRU will also maximize its average population in an in nite Cultaptation game instance. 4.2.1 Computation of EPRU I will now de ne a formula that can be used to compute EPRU exactly for a given strategy s. The de nition of EPRU given in Section 4.1.5 used a \backward" view: for every possible history h , it looked back through h to determine PRU(h ). Notice, however, that h must have some preceding history h0 , where h = h 0 t for some action-percept pair t. This de nition of EPRU must examine h0 and h independently, even though their only di erence is the addition of t. For this reason, it will make more sense computationally to use a \forward" view of EPRU: we will construct a recursive function on s and h which, for each possible h t: calculates the per-round utility gained from t, both for history h t and for all histories that can be reached from h t, and then recurses on s and h t. For the calculation in the rst bullet, we will use the formula EVexp(r; v), which computes the expected amount of per-round utility we gain (on the current round and on future rounds) by exploiting a value v on round r: EVexp(r; v) = 1X j=r L(j)v j = v 1X j=r 1 j (1 d)j 1: (4.10) 33 Using known properties of in nite series, EVexp can also be expressed as EVexp(r; v) = v 1X i=1 1 i (1 d)i 1 r 1X i=1 1 i (1 d)i 1 ! = v ln d d 1 r 1X i=1 1 i (1 d)i 1 ! (4.11) and is therefore computable. We can now express the expected per-round utility of a strategy s recursively in terms of the average per-round payo of an agent. EPRUalt(s; h j G;S) = (4.12) X t2T P (h tjh ; s(h );S) (EVexp(jh tj; U(t)) + EPRUalt(s; h t j G;S)) where T is the set of all possible action-percept pairs, and h t represents a pos- sible history on the next round. Note that the size of T is nite. A proof that EPRU(s j G;S)=d = EPRUalt(s; hi j G;S) is included in [19]. Unfortunately, computing EPRUalt is not possible since it su ers from in nite recursion. To handle this, I introduce a depth-limited computation of EPRUalt, which only computes the portion of the total EPRU contributed by the rst k rounds: EPRUkalt(s; h j G;S) = (4.13)8 >>>>>>< >>>>>>: 0 If k = 0 P t2T P (h tjh ; s(h );S) (EVexp(jh j; U(t)) + EPRU k 1 alt (s; h t j G;S)) otherwise I prove in Section 5.1.3 that if the search depth k is deep enough, EPRUkalt(s; h jG;S) will always be within of EPRUalt(s; h j G;S). 34 4.2.2 EPRU Corresponds to Reproductive Success This section provides a proof that if a strategy has the highest EPRU for the given environment, it will also have the optimal expected probability of reproducing. This proof applies only to pairwise games, but the same techniques should apply to arbitrary ( nite) numbers of strategies. Assume we have an in nite Cultaptation game instance (G;S), as de ned in Section 4.1, made up of agents using strategies s and s0 (i.e. S = fs; s0g). Recall from Section 4.1 that the score for strategy s is score(s) = lim r!1 Pr i=0 p(s; i) r where p(s; i) is the number of agents using strategy s on round i. Our objective for this section will be to show that EPRU(s j G;S) > EPRU(s0 j G;S) if and only if score(s) > score(s0). I begin by de ning a reset event, which will help illustrate some interesting properties of in nite Cultaptation. De nition. Let n and n0 be the number of agents using s and s0, respectively, on the rst round of the game instance, and let N = n+n0. A reset event occurs when all the agents in the environment die on two consecutive rounds, and on the second round they are replaced (via mutation) by n agents using s and n0 agents using s0. The probability of a reset event occurring is = dNdNmN n N 0:5n. 2 In other words, after a reset event occurs the conditions are identical to those that were present on the rst round; the game instance has essentially started over. Note that is the same on every round, and it is always greater than 0. 35 Since the game instance continues for an in nite number of rounds, there will be an in nite number of reset events. Thus, if we were to run other game instances with S = fs; s0g, both strategies would have the same score each time. Therefore, we also know that we can de ne each strategy?s score as a function of its expected population at each round, rather than its population for a single game instance. This gives us lim r!1 Pr i=0 p(s; i) r = lim r!1 Pr i=0 EP(s; i) r (4.14) where EP(s; r) is the expected population of agents using strategy s on round r. I will also de ne EAU(s; r) to be the expected agent utility of strategy s on round r; that is, EAU(s; r) is the expected PRU of a randomly-chosen agent using strategy s on round r. EP(s; r) can now be de ned recursively for each strategy using the mechanics of Cultaptation, as follows. Let EP(s; 0) = n and EP(s0; 0) = n0. Then, for r 0 EP(t; r + 1) = (1 d)EP(t; r) | {z } Survived from previous round +Nd(1 m) EP(t; r)EAU(t; r) TU(r) | {z } New agents from selection + Nd m 2| {z } New agents from mutation where t 2 fs; s0g and TU(r) = EP(s; r)EAU(s; r) + EP(s0; r)EAU(s0; r) is the ex- pected total utility on round r. Recall from Section 4.1 that N is the total number of agents in the environment, d is the probability of death, and m is the probability of mutation. I now consider the behavior of EAU(s; r) as r increases. Lemma 1. For any strategy s, limr!1 EAU(s; r) = for some nite . 36 Proof. Let u(s; r) be the expected utility of a single agent using strategy s when r rounds have passed since the rst round or the last reset event. For all r, we know that 0 u(s; r) Vmax=d, since agents cannot earn negative utility, and no strategy can have better expected performance than a strategy that exploits the best possible action for its entire expected lifespan of 1=d rounds. We can rewrite EAU(s; r) in terms of u(s; r) as follows. EAU(s; r) = r 1X i=0 (1 )iu(s; r) ! + (1 )ru(s; r) Taking the limit of this form gives us lim r!1 EAU(s; r) = lim r!1 r 1X i=0 (1 )iu(s; i) ! + lim r!1 (1 )ru(s; r) = lim r!1 r 1X i=0 (1 )iu(s; i) ! : Since u(s; i) is bounded and Pr 1 i=0 (1 ) i is a geometric series, lim r!1 r 1X i=0 (1 )iu(s; i) ! converges absolutely by the comparison test. Hence, limr!1 EAU(s; r) = for some nite . 2 Lemma 2. For any strategy s and set of available strategies S, lim r!1 EAU(s ; r) = EPRU(s j G;S) Proof. The expected agent utility EAU(s ; r) is de ned as the expected PRU of an agent using strategy s on round r. As r approaches in nity, the probability that a randomly-selected agent will be i rounds old approaches L(i)= P1 j=0 L(j) = dL(i). 37 The probability of reaching a history h is de ned in Section 4.1.5 as P (h js ;S), and as r increases the set of histories a randomly-selected agent may have approaches H, the set of all histories. Thus, lim r!1 EAU(s ; r) = X h 2H dL(jh j) P (h js ;S) PRU(h ); which is the de nition of EPRU(s j G;S). 2 EP(s; r) and EP(s0; r) are both functions of EAU(s; r) and EAU(s0; r), which converge to EPRU(s j G;S) and EPRU(s0 j G;S) respectively. Therefore, EP(s; r) and EP(s0; r) must also converge as r approaches in nity. We will let EP(s) = limr!1 EP(s; r) for s 2 fs; s0g. We can nd the value of EP(s) as follows EP(s) = (1 d)EP(s) +Nd(1 m) EP(s) EPRU(s j G;S) EP(s) EPRU(s j G;S) + EP(s0) EPRU(s0 j G;S) +Nd m 2 : After substituting EP(s0) = N EP(s) and rearranging terms, we have 0 =(EPRU(s j G;S)EP(s)2 EPRU(s0 j G;S)) +N (1 + m 2 ) EPRU(s0 j G;S) (1 m 2 ) EPRU(s j G;S) EP(s) N2 m 2 EPRU(s0 j G;S): Assume EPRU(s j G;S) > 0 and EPRU(s0 j G;S) > 0, which must be true as long as there are no actions that give negative utility. Also, let x = EPRU(s j G;S)=EPRU(s0 j G;S). Then we can rewrite the above as 0 = (x 1)EP(s)2 +N 1 + m 2 x(1 m 2 ) EP(s) N2 m 2 : (4.15) 38 This equation, when subject to the constraint 0 EP(s) N , allows us to express EP(s) as a strictly increasing function of x. It also has the property that when x = 1, EP(s) = EP(s0) = N=2. Lemma 3. EP(s) > EP(s0) if and only if EPRU(s j G;S) > EPRU(s0 j G;S). Proof. Assume EPRU(s j G;S) > EPRU(s0 j G;S). Then x > 1 in Equation 4.15, and therefore EP(s) > N=2, so EP(s) > EP(s0). Assume EP(s) > EP(s0). Applying this as an extra constraint to Equation 4.15, we see that only values for x greater than one satisfy the equality. Therefore, EPRU(s j G;S) > EPRU(s0 j G;S). Hence, EP(s) > EP(s0) if and only if EPRU(s j G;S) > EPRU(s0 j G;S). 2 We now know that the strategy with higher EPRU will have the highest ex- pected frequency in the limit, so all that remains is to show that a strategy with higher frequency in the limit will, in fact, have a higher average population over all rounds (and therefore, a higher score in the game). Lemma 4. For all s and s0, lim r!1 Pr i=0 EP(s;i) r Pr i=0 EP(s 0;i) r = EP(s) EP(s0) : Proof. First, note that because N , d, and m are all strictly positive, 0 < Ndm2 EP(s; r) < N for all strategies s and rounds r. Therefore, Pr i=0 EP(s 0; i) is strictly increasing and unbounded as r increases. Using the fact that Ndm2 EP(s; r) < N for all s and r, and the de nition 39 EP(s) = limr!1 EP(s; r), we can obtain the following: lim r!1 Pr+1 i=0 EP(s; i) Pr i=0 EP(s; i) Pr+1 i=0 EP(s 0; i) Pr i=0 EP(s 0; i) = lim r!1 EP(s; r + 1) EP(s0; r + 1) = limr!1 EP(s; r + 1) limr!1 EP(s0; r + 1) = limr!1 EP(s; r) limr!1 EP(s0; r) = EP(s) EP(s0) : Therefore, by the Stolz-Ces aro theorem, lim r!1 Pr i=0 EP(s; i)Pr i=0 EP(s 0; i) = EP(s) EP(s0) ; thus lim r!1 Pr i=0 EP(s;i) r Pr i=0 EP(s 0;i) r = EP(s) EP(s0) : 2 From Equation 4.14 and Lemmas 3 and 4, we immediately get the following: Theorem 1. For all s and s0, lim r!1 Pr i=0 n(s;i) r Pr i=0 n(s 0;i) r > 1 if and only if EPRU(s j G;S) > EPRU(s0 j G;S). Therefore, the strategy with higher EPRU will have the higher score in a game of in nite Cultaptation. Irrelevance of the Initial Strategy Pro le From the fact that EPRU is independent of the initial strategy pro le s, we also get the following corollary which will help us understand some of the experimental results (see Section 5.3). Corollary 1. The initial strategy pro le s of an in nite Cultaptation game instance (de ned in Section 4.1) does not a ect the score of any strategy in S. 40 If this seems counterintuitive, consider the following. At the beginning of this section I de ned a reset event, in which every agent dies on two consecutive rounds, and all are replaced via mutation so that the population is identical to the initial strategy pro le. For each reset event, there will be many similar events in which every agent dies on two consecutive rounds and is replaced via mutation, but in some arrangement di erent from the initial strategy pro le. The probability of this happening is d2NmN , which is greater than 0. In an in nite-length game, such an event will eventually occur with probability 1. After it occurs, the initial strategy pro le clearly has no bearing on how the rest of the game plays out, yet there are still an in nite number of rounds left in the game. Since each strategy?s score is its average population over the entire game (see Section 4.1), the impact of the initial strategy pro le on each strategy?s total score is vanishingly small in an in nite-length game. Application of EPRU to other Evolutionary Games Many of the equations used in calculating EPRU involve concepts particu- lar to Cultaptation, such as innovation, observation, and changing action values. However, the general technique used is to calculate the expected reproductive t- ness of an agent on round j, multiply this quantity by the expected proportion of agents that are j rounds old, and sum these quantities to get the expected tness of an entire population. This should be a useful metric in any evolutionary game in which agents live for more than one generation and reproduce according to the replicator equation, even if the game uses some measure other than per-round utility to determine reproductive tness. The proofs in this section rely primarily on the 41 symmetry between 1) the probability that an agent will be alive after k rounds and 2) the expected proportion of a population of agents that are k rounds old on any given round. Thus, any evolutionary game that allows agents to live more than one generation and in which agents die with the same probability on every round should be able to use a metric very similar to EPRU to compare strategies. 42 Chapter 5 Strategy Generation Algorithms This chapter describes the algorithms used to generate near-best response strategies for Cultaptation, and presents experimental studies in which near-best response strategies are tested against a known good strategy from the Cultaptation tournament. 5.1 Finding an -Best Response Strategy In this section I explain what it means for a strategy to be a best response or near-best response in in nite Cultaptation, and I provide an algorithm for calculat- ing a near-best response to S , the available strategies other than our own. 5.1.1 Problem Speci cation Now that we have derived EPRU and proved that a strategy?s EPRU is directly proportional to its score in an in nite Cultaptation game, we can determine how each strategy in a given set of available strategies S will perform by evaluating the EPRU of each strategy. Therefore, we can de ne a best-response strategy in terms of EPRU, as follows. Recall that in an in nite Cultaptation game (as de ned in Section 4.1) there are ? players, each of whom selects a strategy to put into the set of available strategies 43 S. Let S be the set of available strategies other than our own, i.e. S = fs1; :::; s 1; s +1; :::; s?g. Strategy sopt is a best response to S if and only if for any other strategy s0, EPRU(sopt j G;S [ sopt) EPRU(s0 j G;S [ s0): Computing sopt is not possible due to its prohibitively large size. However, we can compute an -best-response strategy, i.e., a strategy s such that EPRU(s jG;S [ s) is arbitrarily close to EPRU(sopt j G;S [sopt). This problem can be stated for- mally as follows: Given game parametersG, error bound > 0, and the set S of avail- able strategies other than our own, nd a strategy s such that EPRU(s j G;S [ s ) is within of EPRU(sopt j G;S [ sopt). 5.1.2 Bounding EPRU In games where 0 < d < 1, an agent could potentially live for any nite number of rounds. However, since the agent?s probability of being alive on round r decreases exponentially with r, the expected utility contributed by an agent?s actions in later rounds is exponentially lower than the expected utility contributed by earlier rounds. I will use this fact in deriving a bound on EPRUalt(s; h j G;S) for a given strategy and a history h of length l. Recall from Equations 4.10 and 4.11 that: EVexp(r; v) = v 1X i=r 1 i (1 d)i 1 = v ln d d 1 r 1X i=1 1 i (1 d)i 1 ! (5.1) where EVexp(r; v) is the expected contribution to EPRU made by exploiting an action with value v on round r. 44 Since we know how much any given exploit contributes to EPRU(s j G;S) for a given strategy s , we can calculate G(l; v), the amount that exploiting the same action on all rounds after l contributes to EPRU(s j G;S), as follows: G(l; v) = 1X j=l+1 v 1X n=j 1 n (1 d)n 1 = v 1X j=l+1 1X n=j 1 n (1 d)n 1 Expanding the summations yields: G(l; v) = v 1 l + 1 (1 d)l + 2 l + 2 (1 d)l+1 + = v 1X n=l+1 n l n (1 d)n 1 = v 1X n=l+1 (1 d)n 1 1X n=l+1 l n (1 d)n 1 ! = v (1 d)l d 1X n=l+1 l n (1 d)n 1 ! Next, we pull l out of the summation and use (5.1) to obtain: G(l; v) = v 0 B B B @ (1 d)l d l ln d d 1 lX n=1 1 n (1 d)n 1 ! | {z } a 1 C C C A (5.2) Note that for 0 < d < 1, G(l; v) is nite. G(l; v) provides a closed form formula for the eventual contribution of exploiting an action with value v at every round after the lth round. Since the set V of possible action values is nite (see Section 4.1), let vmax = max(V ) be the largest of these values. Then G(l; vmax) is an upper bound on the expected per-round utility achieved after round l (clearly no strategy can do better than making an action with maximal value every action after action l). I use this fact to bound the depth limited expected per-round utility computation. 45 Theorem 2. Let vmax be the highest possible action utility for game parameters G, and let S be the set of available strategies other than our own. Then for all l and all strategies s , EPRUalt(s ; hi j G;S [ s ) EPRU l alt(s ; hi j G;S [ s ) G(l; vmax): Proof. Since it is not possible for any strategy to gain more utility than vmax on any round, this follows from the discussion above. 2 Theorem 2 states that G(l; vmax) is the highest possible contribution to the total expected per-round utility (i.e., EPRUalt(s ; hi j G;S)) made by any strategy s after round l. Thus, if we are given an > 0 and we can nd a value of k such that G(k; vmax) , then we know that no strategy can earn more than expected utility after round k. The next section will show how to nd such a k. 5.1.3 Determining How Far to Search In this section I show how to nd a search depth k such that, for any given > 0, no strategy can earn more than utility after round k. We rst note a bound on G(l; v): Lemma 5. G(l; v) v(1 d)l=d. Proof. The lemma follows from noting that part (a) of Equation 5.2 is greater than or equal to zero, since ln dd 1 = P1 n=1 1 n(1 d) n 1 and l < 1. Thus G(l; v) = v( (1 d) l d w) v (1 d)l d , since w is always non-negative. 2 46 Now if we can nd a k such that = vmax(1 d) k=d; then we can be certain that G(k; v). Solving for k in the above equation yields k = log(1 d) d vmax ; (5.3) which has a solution for 0 < d < 1 and vmax > 0, both of which will always be true in Cultaptation. This gives us the following theorem. Theorem 3. Given > 0, set of available strategies S other than our own, and game parameters G with maximal utility vmax, let k = log(1 d) d vmax . If s has the maximal value of EPRUkalt(s ; ; j G;S [s ), then s is an -best response to S . Proof. Let sopt be the strategy with the maximal value of EPRU(sopt jG;S [sopt). By Theorem 2, we know that sopt cannot earn more than expected utility on rounds after k. Since s earns the maximum EPRU possible in the rst k rounds, it follows that jEPRU(sopt j G;S [ sopt) EPRU(s j G;S [ s )j . Therefore, s is an -best response. 2 5.1.4 Algorithm I will now present my algorithm for computing the strategy s with the maximal value of EPRUkalt(s; ; j G;S [ s), and show how it can be used to compute an -best response. Algorithm 1 returns a 2-tuple with a partially speci ed strategy s and a scalar U . Strategy s maximizes EPRUkalt(s; h j G;S [ s), and U is the value of this 47 Algorithm 1 Produce strategy s that maximizes EPRUkalt(s; h j G;S [s), given initial history h , set of possible utility values V , and S , the set of available strategies other than our own. Strat(h ,k,V , S ) 1: if k = 0 then 2: return 0 3: end if 4: Let Umax = 0 5: Let smax = null 6: for each action a 2 fInv;Obs;X1; : : : ;XMg do 7: Let Utemp = 0 8: Let stemp = hh ; ai 9: for each action m 2 f1; : : : ;Mg do 10: for each value v 2 V do 11: Let t = (a; (m; v)) 12: Let p = P (h tjh ; a;S ) 13: if p > 0 then 14: Let fS 0; U 0g = Strat(h t; k 1; V;S ) 15: stemp = stemp [ S 0 16: Utemp = Utemp + p (EVexp(jh tj; U(t)) + U 0) 17: end if 18: end for 19: end for 20: if Utemp > Umax then 21: Umax = Utemp 22: smax = stemp 23: end if 24: end for 25: return fsmax; Umaxg expression. The algorithm performs a depth- rst search through the space of strategies that start from the input history h , stopping once it reaches a speci ed depth k. Figure 5.1 provides an example of the kind of tree searched by this algorithm. For each possible action a 2 fInv;Obs;X1; : : : ;XMg at h , it computes the expected per-round utility gained from performing a, and the utility of the best strategy for each possible history h0 that could result from choosing a. It combines these 48 quantities to get the total expected utility for a, and selects the action with the best total expected utility, amax. It returns the strategy created by combining the policy hh ; amaxi with the strategies for each possible h0 , and the utility for this strategy. Seen another way, Strat(h ; k; V;S ) computes EPRU k alt(s; h j G;S [ s) for all possible strategies s, returning the strategy maximizing EPRUkalt as well as the maximal value of EPRUkalt. Proposition 1. Strat(h ; k; V;S ) returns (s; U) such that EPRUkalt(s; h j G;S [ s) = U = argmaxs0(EPRU k alt(s 0; h j G;S [ s0)): A proof of this proposition is presented in [19]. We now have an algorithm capable of computing the strategy with maximal expected utility over the rst k rounds. Hence, in order to nd an -best response strategy we need only nd the search depth k such that no strategy can earn more than expected utility after round k, and then call the algorithm with that value of k. Theorem 4. Given > 0, available strategies other than our own S , and a set of values V with maximum value vmax, let k = log(1 d) d vmax . Then Strat(;; k; V;S ) returns (s; U) such that s is an -best response to S . Proof. This follows from Theorem 3 and Proposition 1. 2 We also have the following. Corollary 2. Given available strategies other than our own S and a set of values V , let sk be the strategy returned by Strat(;; k; V;S ). Then limk!1 sk is a best 49 response to S . Proof. Let sopt be a best response to S . By Lemma 5 and Theorem 3, EPRU(sopt j G;S [ sopt) EPRU(sk j G;S [ s) vmax(1 d)k=d: Since limk!1 vmax(1 d)k=d = 0, it follows that limk!1(EPRU(sopt j G;S [ sopt) EPRU(sk j G;S [ sk)) = 0. Therefore, limk!1 sk is a best response to S . 2 5.1.5 Implementation In this section I discuss modi cations that improve the running time of Algo- rithm 1 without any loss in accuracy. Section 5.1.5 discusses techniques for state aggregation that cut the branching factor of the algorithm in half. Section 5.1.5 dis- cusses the representation of Obs, and Section 5.1.5 discusses caching and pruning. State Aggregation If the pseudocode for Algorithm 1 were implemented verbatim, it would search through each history that can be reached from the starting state. However, there is a signi cant amount of extraneous information in each history that is not needed for any of the algorithm?s calculations. For example, the histories h = h(Inv; (1; 10))i and h0 = h(Inv; (2; 10))i both describe a situation where innovates once and obtains an action with value 10. The only di erence between these histories is the identi er assigned to the action, which does not impact any of the calculations|yet the pseudocode must still search through each of these histories separately. We can 50 Figur e 5.1 : A n illustratio n of a p ortio n of th e sear ch p erfor m ed by Algorith m 1 on a ver y smal l versio n of Cultaptation , wit h fou r action s an d a se t of p ossibl e actio n value s equa l to f2 0; 40 ;8 0; 16 0g . Circl e no de s indicat e p ossibl e obser vation s ma de by th e age nt , an d triangl e no de s indicat e p ossibl e choice s m ad e by th e age nt (whi ch ha ve a rando m outcome , indicate d by th e lin e throug h th e bran che s le adin g to it s children) . Ellipse s indicat e bran che s or subtree s omitte d fr om th e gure . Th e lef t sid e of th e illustratio n sh ow s ho w mor e action s ca n b eco m e av aila bl e to th e age nt as th e gam e progresses , whil e th e rig ht sid e sh ow s ho w p ote ntially- changin g value s of action s m us t b e m odele d in th e tree . Se vera l te chnique s outline d in Sectio n 5.1. 5 ca n reduc e th e siz e of th e tre e tha t need s to b e sear che d in th e impleme ntatio n of thi s algorithm . 51 eliminate this redundancy by using repertoires, rather than histories, as the states for the algorithm to search through. A repertoire is a record of what the agent knows about each of the actions it has learned, rather than a record of everything that has happened to it. Making this simple change allows Algorithm 1 to calculate the value of an observation action by combining information it learns when exploring innovate and exploit actions, rather than recursing again. This cuts the branching factor of the search in half. The analysis and details involved in this change, as well as the proof that the version of the algorithm using repertoires returns the same result as the previous version, are included in Appendix B. Running time analysis. When Algorithm 1 considers a history h , it makes one recursive call for each possible action-percept pair (a; (m; v)) that can be executed at h . There are 2M such pairs for each history; if the agent knows how to exploit j actions, then it can innovate any of the M j actions it does not know, and it can observe any of the M actions. Each of these actions can also have any of v values. Hence, the number of recursive calls made by the algorithm each action is at most 2Mv. Since the algorithm recurses to depth k, the running time for Algorithm 1 is O((2Mv)k). With the state aggregation technique described above, we do not need to perform additional recursions for observation actions. Hence, the number of recursive calls made each action is at most Mv, and the total running time is O((Mv)k), which improves upon the original running time by a factor of 2k. Representing Obs 52 For the formal proofs, I treated Obs as a black box that, when given the agent?s history and round number, could tell us the exact probabilities of observing each action on the current round. However, since there are an exponential number of possible histories, storing Obs in this form would require an exponential amount of space, which would severely limit the size of games for which we could compute strategies. Algorithm 2 (introduced in Section 5.2) would also need to run a pro- hibitively large number of simulations to get enough samples to generate a new Obs of this type. Therefore, as an approximation, my implementation assumes that Obs has a similar structure to Inv, and remains constant throughout the agent?s lifetime. That is, the Obs used in the experiments returns the probability of an action valued v being observed. While this leads to some loss in accuracy, it is very easy to store and compute. Further, we will see in the experimental results (particularly those dealing with iterative computation in Section 5.3.2) that this form of Obs is still able to produce good strategies. Caching and Pruning Since the implementation uses repertoires rather than histories to represent the agent?s set of known actions, and since it is possible for two histories to produce the same repertoire, the algorithm will sometimes encounter repertoires that it has already evaluated. So that the algorithm will not have to waste time evaluating them again, the implementation includes a cache which stores the EPRU of every repertoire it has evaluated. When the algorithm encounters a repertoire whose expected utility is needed, the implementation rst checks the cache to see if the 53 EPRU of the repertoire has been previously computed, and uses the computed value if it exists. Caching is widely used in tree-search procedures, and is analogous to the transposition tables in chess-playing algorithms [37]. I also use another well-known method for avoiding unnecessary evaluation of states, namely branch-and-bound pruning [38, 39], which can be summarised as follows. Before we compute the expected per-round utility of a given action, we check to see if an upper bound on the EPRU of that action would be su cient to make the given action?s utility higher than the best previously computed action. In many situations, the maximal utility that can be achieved for a given action will in fact be less than the utility we know we can achieve via some other action, and therefore we can skip the evaluation of that action (i.e., we can \prune" it from the search tree). There are no theoretical guarantees on runtime reduction using these tech- niques, but we will see in Section 5.3.1 that the combination of pruning and caching allows the algorithm to avoid evaluating signi cant portions of the state space in the environments I tested. 5.2 Cultaptation Strategy Learning Algorithm Until now I have assumed that Algorithm 1 has access to Obs, the distribu- tion of observable actions, when it performs its calculations. While the algorithm nds the near-best-response strategy given a particular Obs, agents playing the real Cultaptation game are not given access to Obs beforehand, and even estimating 54 what Obs looks like can be very di cult while playing the game due to the limited amount of information each agent receives in its lifetime. It is also unclear how exactly an agent?s own actions will a ect Obs: by exploiting a particular action, the agent is making that action observable to others who might then exploit it in greater proportion than in the Obs used to compute the agent?s strategy. To address these di culties, I developed the Cultaptation Strategy Learning Algorithm (CSLA), which uses a method for creating a strategy and a distribution Obs simultaneously so that (i) Obs is the distribution created when all agents in a Cultaptation game play the computed strategy and (ii) the computed strategy is a near-best response for Obs (and other parameters). This algorithm copes with the lack of information about Obs, and generates an approximation of a strategy that is a best response to itself. At a high level, the algorithm can be thought of as generating a series of strategies, each an - best response to the one before it, and stopping when two successive strategies are extremely similar. A more detailed description of this process follows. The algorithm begins by assuming Obs = Inv. The algorithm then proceeds iteratively; at each iteration it generates s, the -best response strategy to the current Obs, then simulates a series of Cultaptation games in which s plays itself, and extracts a new Obs from the actions exploited in these games. At the end of each iteration, the algorithm compares s to sold, the strategy produced by the previous iteration, using the stratDi function. stratDi (s; sold) computes the probability that an agent using s would perform at least one di erent action before dying than the same agent using sold. For instance, stratDi (s; sold) = 55 Algorithm 2 Produce an approximation of a strategy that is an -best response to itself. CSLA( Inv; ; k) 1: Let Obs = Inv. 2: s = ;. 3: repeat 4: Let sold = s. 5: Let V = f Inv; Obsg. 6: s = Strat(;; k; V;S) 7: Simulate a series of Cultaptation games in which s plays itself, and action utilities are initially drawn from Inv, recording all actions exploited in the last quarter of this game. 8: Use records of exploited actions to generate a new distribution Obs (i.e. Obs(v) = fraction of the time v was exploited in the records). 9: until stratDi (s; sold) < 10: return s. 1:0 means that the two strategies will always perform at least one di erent action (i.e. the actions they choose on the rst round are di erent), while stratDi (s; sold) = 0:0 means that s is identical to sold. When stratDi (s; sold) is found to be below some threshold , CSLA ter- minates and returns s, the strategy computed by the last iteration. The formal algorithm is presented as Algorithm 2. Properties of the strategy. CSLA as presented here is a \best-e ort" algorithm in the following sense: If CSLA converges to a strategy s, we know that it is an approximation of a symmetric Nash equilibrium strategy, but we do not know (i) whether or not CSLA will converge for a given environment, or (ii) how close to the true Nash equilibrium s is. The improved version of CSLA proposed in Chapter 5 will address these issues. In my experimental studies (see Section 5.3), the strategies produced by CSLA in any given game were all virtually identical, even when a random distribution 56 (rather than Inv) was used to initialize Obs. This strongly suggests (though it does not prove) that the strategy pro le consisting of copies of sself is a symmetric near-Nash equilibrium. Furthermore, there is reason to believe that s is evolutionarily stable. Consider an environment in which all agents use the strategy s, and suppose a small number (say, one or two) other strategies are introduced as invaders. Because s was an near-best response to the environment that existed before the opponent?s agents are introduced, and because the introduction of one or two invaders will change this environment only slightly, agents using s will still be using a strategy that is close to the best response for the current environment, and they will also have some payo they have accumulated on previous rounds when their strategy was still an near-best response. Thus, the invaders should have a di cult time establishing a foothold in the population, hence should die out with high probability. This suggests (but does not prove) that s is evolutionarily stable.1 5.2.1 Implementation Details We have created a Java implementation of CSLA. Here I brie y discuss two issues dealt with during implementation. Representing Obs My implementation of CSLA uses the same representation of Obs as my im- plementation of Algorithm 1 does. In other words, it assumes Obs has the same 1 Among other things, a formal proof would require a way to calculate the payo s for s and any invading strategy. Accomplishing this is likely to be complicated, but I hope to do it in my future research. 57 form as Inv, and remains constant throughout the game. Ideally we would be able to condition Obs on the agent?s history, but in practice this would require too much space (since there are an exponential number of possible histories), and we would need to run too many simulations in step 7 to get an accurate distribution for each history. Training In the Machine Learning literature, the process of improving an agent?s perfor- mance on a given task is often referred to as \training." In Algorithm 2, strategy s is trained by playing against itself in a series of simulated games in step 7. However, in the implementation of CSLA the agents involved in the games in step 7 are a parameter to the algorithm. This means that CSLA can also produce a strategy that is trained by playing in an environment consisting of itself and one or more given strategies. The intuition behind this approach is that a strategy trained by playing against itself and strategy s0 may perform better when playing against s0 than a strategy trained against itself alone. I test this hypothesis experimentally, in Section 5.3.2. 5.3 Experimental Results In this section I present my experimental results. Section 5.3.1 examines the performance of the implementation of the -best re- sponse algorithm. I nd that the optimizations allow the algorithm to nd strategies within 1% of the best response 1,000 times faster than the unoptimized algorithm. 58 Section 5.3.1 examines the strategies found by the -best response algorithm when presented with di erent environments and strategy pro les, and the results give us an idea of what kinds of circumstances are necessary for the near-best-response strategy to prefer innovation over observation. Section 5.3.2 presents a series of experiments comparing two strategies gener- ated with my Cultaptation Strategy Learning Algorithm to a known good strategy used in the international Cultaptation tournament. I nd that the strategies gen- erated with CSLA are able to beat the known good strategy, even when the envi- ronment is di erent than the one CSLA used to learn the strategies (Sections 5.3.2 and 5.3.2). Finally, I perform an in-depth qualitative analysis of all three strate- gies and highlight the di erences in behavior that give my learned strategies an advantage (Section 5.3.2). 5.3.1 Experiments with -Best Response Algorithm In this section I present the experiments involving an implementation of Algo- rithm 1, which generates -best-response strategies for a given set of game parame- ters and strategy pro le. Implementation Performance My rst set of experiments was designed to study the accuracy and running time of my implementation, and the e ectiveness of the methods developed to im- prove its performance. I rst examined the e ect of on running time and on the expected per-round utility computed by Algorithm 1. I ran the experiments in 59 several di erent environments; rst I examined the uniform1 environment. In this environment, Inv is a uniform distribution over f33:33; 66:67; 100; 133:33; 166:67g, and S contains the innovate-once (I1) strategy from Example 1, so Inv is identical to Obs. The probability of change in uniform1 is 1%, and the probability of death is 40%. I also introduced several variations on the uniform1 environment to study the e ect of di erent probabilities of change. They are uniform10, uniform20, uniform30, and uniform40, which have the respective probabilities of change of 10%, 20%, 30%, and 40%. In Table 5.1, we see the EPRU computed for various values of epsilon in these environments. As a point of reference, strategy I1 can be analytically shown to achieve an EPRU of about 38.56. We can see that an upper bound on achievable EPRU in the uniform1 environment is 40.5, since the EPRU of an -best response to S is 40.1 when is 0.4. Also, note that the algorithm nds lower EPRUs as the probability of change increases. This is as expected: in a rapidly changing environment, one cannot expect an agent to do as well as in a static environment where good actions remain good and bad actions remain bad. The -best-response strategies computed generally innovate as the rst action, then exploit that value if it is not the lowest value available (in this case 33.33). Otherwise, the strategies tend to innovate again in an attempt to nd an action with a value bigger than 33.33. This is how they manage to achieve a higher EPRU than the innovate-once strategy. As part of the experiment in the uniform1 environment, I kept track of the 60 Table 5.1: Expected per-round utility of the -best response strategy computed by Algorithm 7, for eight di erent values of in various environments. = 0:4 = 0:8 = 1:2 = 1:6 = 2:0 = 2:4 = 2:8 = 3:2 uniform1 40.1 40.0 39.6 39.6 39.6 38.9 38.9 38.9 uniform10 39.3 39.1 38.8 38.8 38.8 38.1 38.1 38.1 uniform20 38.7 38.5 38.2 38.2 38.2 37.7 37.7 37.7 uniform30 38.7 38.5 38.2 38.2 38.2 37.7 37.7 37.7 uniform40 38.7 38.5 38.2 38.2 38.2 37.7 37.7 37.7 Figure 5.2: Number of nodes searched in the uniform1 environment, with di erent combinations of caching and pruning. number of nodes searched by four variations of the algorithm. In the rst variation, I ran Algorithm 1 without optimizations. I also examined the algorithm?s performance with the pruning and caching optimizations described in Section 5.1.5. In Figure 5.2 we see that employing both caching and pruning allows the algorithm to compute strategies within 1% of the best response about 1,000 times faster. The search time required for 80,000-node search is around 15 seconds on a 3.4GHz Xeon processor. 61 E ects of Varying Inv and Obs on the -Best-Response The objective of this experiment was to study how near-best-response strate- gies (as computed by Algorithm 1) change as I varied the mean and standard devi- ation of Inv and Obs (which I will call Inv, Inv, Obs and Obs, respectively). If we assume that the other agents in the game are rational and not trying to deceive us by intentionally exploiting low-utility actions, then one should expect that Obs Inv. It may seem natural, then, to conclude that an agent should choose to observe rather than innovate whenever possible, since the average action returned by observing will have higher utility than one returned by innovating. However, previous work has suggested that the standard deviation of these distributions may also play a role in determining which is better [23]. Also, as discussed in Section 2.2.1, it is possible to imagine pathological scenarios where a population that relies too heavily on obser- vation can become stuck exploiting a low-value action. I designed this experiment to test the hypothesis that, even if I let Obs > Inv, the standard deviations of these distributions can still be varied such that the -best-response strategy computed by EPRUkalt will choose to innovate rather than observe. My methods and results are presented below. I used the repertoire-based algorithm Strat(R; r; k; V ) (Algorithm 1) to com- pute -best-response strategies for Cultaptation games with several di erent pa- rameter settings, then analyzed the strategies to determine how often they would observe or innovate. In this experiment, the agents died with 40% probability on each round (d = 0:4) and there were 5 potential exploitation actions. These games are smaller than the Cultaptation game used in the tournament, to ensure they can 62 Table 5.2: The portion of innovation actions (calculated as nInv=(nInv + nObs)) in the -best-response strategy when the standard deviations of Inv and Obs are as speci ed. In all cases, Inv = 100 and Obs = 110. Obs = 10 Obs = 300 Inv = 10 2:80 10 9 3:40 10 10 Inv = 300 0.995 0.215 be solved in a reasonable amount of time. In these games, I used distributions Obs and Inv with means of 110 and 100 respectively. Table 5.2 shows the results for four combinations of parameter settings: Inv 2 f10; 300g and Obs 2 f10; 300g. When Inv = 10, the near-best-response strategy will observe almost exclusively (innovating only in rare cases where observation returns several low-quality moves in a row). However, in the environment where Inv = 300 the near-best-response strategy includes signi cantly more Innvoates; when Obs = 10 it will innovate 99:5% of the time, and even when Obs = 300 it still innovates 21:5% of the time. This experiment lets us conclude that the means of Inv and Obs are not su - cient to determine if innovation or exploitation is better. In particular, if the stan- dard deviation of innovated values is high, then innovation becomes more valuable because multiple innovations tend to result in a higher valued action than multiple observations. An interesting strategy emerges when Inv and Obs both have high standard deviations. Even though the mean value of innovated actions is lower than the mean value of observed actions, the -best-response strategy in these cases innovates initially, then, if the value innovated is high, exploits that value. If the innovated 63 value is not high, an observation action is performed to ensure the agent has a reasonably-valued action available to exploit until it dies. 5.3.2 Experiments with the Cultaptation Strategy Learning Algo- rithm The objective of my second experiment was to examine the performance of strategies produced by the Cultaptation Strategy Learning Algorithm (Algorithm 2 in Section 5.2), and the importance of the environment (see Section 5.2.1) used to train these strategies. Speci cally, I was interested in| examining whether the strategies produced with CSLA were capable of beating a strategy that is known to do well; examining whether strategies produced by CSLA were able to perform well in environments di erent from those they were trained in; comparing how well a strategy that is trained only against itself (i.e., all agents in the simulated game in Step 7 of the CSLA algorithm use strategy s) can do at repelling an invader, versus how well a strategy trained against the invader (i.e. the invading strategy is included in the population of agents at Step 7) can do at repelling the invader. For the previous experiments, I assumed the algorithm had an oracle for Obs. For the rest of this section I will be running experimental simulations, so the oracle will observe what the agents do in the simulations and construct Obs from this, as 64 described in Section 5.2. For the known good strategy I used an algorithm called EVChooser, which per- forms a few innovation and observation actions early in the game and uses the results of these actions (along with a discount factor) to estimate the expected value of in- novating, observing, and exploiting, making the action with the highest expected value. It placed 15th out of over 100 entries in the Cultaptation tournament [15]. We chose EVChooser because (1) it has been shown to be a competitive strategy, (2) its source code was readily available to me (unlike the other successful strate- gies from the Cultaptation tournament), and (3) it could be tuned to perform well in the Cultaptation environments I used (which, in order to accommodate CSLA?s exponential running time, were much smaller than those used in the international Cultaptation tournament). For games as small as the ones in my experiments, I believe EVChooser is repre- sentative of most of the high-performing strategies from the tournament. Nearly all of the strategies described in the tournament report [15] spend some time trying to gure out what the innovate and observe distributions look like, and afterwards use some heuristic for choosing whether to innovate, observe, or exploit their best known action on any given round. This heuristic often involves some type of expected-value computation; for instance, the winning strategy discountmachine used a discount factor to compare the utility gained by exploiting the current best-known action to the utility of possibly learning a better action and exploiting it on all future rounds, 65 which is exactly what EVChooser does.2 Unlike my CSLA algorithm, none of the strategies in the tournament conducted lookahead search. For this experiment, I used an environment where Inv was a uniform distribu- tion over the actions f20; 40; 80; 160g, probability of change was 1%, and probability of death was 25%. Due to the exponential running time of my strategy generating algorithm, this is the largest environment (i.e., smallest probability of death, highest number of actions and action values) for which the algorithm could compute full strategies in a reasonable amount of time. Convergence and Consistency of CSLA As part of this work, I have developed a Java implementation of Algorithm 2 that allows one to specify the type of game to be used for the simulation in Step 7, and created two strategies: sself and sEVC. The training process for both strategies began with s0, the best-response to a random Obs distribution, and continued by constructing a strategy si+1 as a best-response to the Obs generated by simulating games involving si. When training sself the simulated games consisted solely of agents using si, but while training sEVC they consisted of a population of agents using si being invaded by EVChooser. In both cases, 100 games were simulated at each step of the iteration, to limit the amount of noise in the Obs that was extracted from the simulations. 2discountmachine di ers from EVChooser largely because it modi es the expected value of Observing using a machine-learned function that accounts for observe ac- tions being unreliable and returning multiple actions, neither of which are possible in my version of the game 66 While there are no theoretical guarantees that the strategies produced by Algorithm 2 will converge, the algorithm?s similarity to policy iteration [35] led me to suspect that the they would converge. Also, since CSLA is greedy, i.e., it selects the best response strategy at each step of the iteration, I was interested in seeing whether the strategy it found represented a local maximum or a global one. I designed a simple experiment to see how these issues would play out when generating sself and sEVC: I modi ed the program to use a randomly-generated distribution for the initial value of Obs, rather than always initially setting Obs = Inv as is done in Algorithm 2, and I used this modi ed program to generate 100 alternate versions of sself and sEVC. I then compared these alternates to the original sself and sEVC using stratDi . In the case of sself , I found that all 100 alternate versions were identical to the original. In the case of sEVC, I found that 58 alternate versions were identical to the original, and the rest exhibited a stratDi of no more than 1:08 10 4. This means that an agent using an alternate version of sEVC would choose all the same actions as one using the original sEVC at least 99.989% of the time. This tells us that not only does CSLA converge for the environment I am testing it in, it converges to the same strategy each time it is run. This suggests that the algorithm is nding a globally-best solution for this environment, rather than getting stuck in a local maximum. Finally, to estimate how di erent sself and sEVC are, I ran stratDi (sself ; sEVC) and found it to be 0.27. This means that training a strategy against an external, xed strategy in Algorithm 2 does produce signi cantly di erent results than train- ing a strategy against itself. For a more in-depth look at where sself and sEVC di er, 67 Table 5.3: Win percentages of sself and sEVC when playing against EVChooser over 10,000 games as both Defender and Invader. Win percentage Defending vs. EVChooser Invading vs. EVChooser sself 70.65% 70.16% sEVC 69.92% 69.92% see Section 5.3.2. Pairwise Competitions: sself vs. EVChooser and sEVC vs. EVChooser I played both of the generated strategies, sself and sEVC, against EVChooser for 20,000 games { in 10,000 games, the generated strategy was defending against an invading population of EVChooser agents, and in 10,000 games the roles were reversed, with the generated strategy invading and EVChooser defending. I recorded the population of each strategy on every round, as well as the winner of every game.3 The populations in an individual game were extremely noisy, as seen in Figure 5.3(e), however by averaging the populations over all 10,000 games we can see some trends emerge. These average populations for each strategy in all four match-ups are presented in Figure 5.3(a{d), while the win rates for each match-up are presented in Table 5.3. In Figure 5.3 we see that, on average, the strategies generated by Algorithm 2 control roughly 57% of the population for the majority of the game in all four match- ups. Interestingly, both sself and sEVC are able to reach this point in roughly the same amount of time whether they are invading or defending. It is also worth noting 3Recall that the winner of a Cultaptation game is the strategy with the highest average population over the last quarter of the game. 68 a) EVChooser invading sself b) sself invading EVChooser c) EVChooser invading sEVC d) sEVC invading EVChooser e) Population of sself at each round, in a single game against EVChooser Figure 5.3: Average populations of both strategies for each round, in match-ups between sself and EVChooser (parts a and b) and between sEVC and EVChooser (parts c and d), over 10,000 games. From round 2000 onwards, sself or sEVC control 57% of the population on average, regardless of whether EVChooser was invading or defending. Since mutation is enabled from round 100 onwards, populations in an individual game (exhibited in part e) are highly mercurial and do not converge. Therefore, we must run a large number of trials and average the results to get a good idea of each strategy?s expected performance. 69 Table 5.4: Percentage of games won (out of 10,000) by sself , sEVC, and EVChooser in a melee contest between all three. sself sEVC EVChooser Melee win percentage 38.78% 37.38% 23.84% that, even though I showed above that sself and sEVC have signi cant di erences, they performed almost identically against EVChooser in terms of population and win percentages Melee Competition: sself vs. sEVC vs. EVChooser My next experiment was to run sself , sEVC, and EVChooser against one another in a melee contest to see how the three strategies would interact in an environment where none of them originally had the upper hand. All three strategies had an initial population of 33 agents at the start of each game. I used the same Inv, probability of change, and probability of death as in Experiment 2. Mutation was disabled for the nal 2,500 rounds of each melee game, as was done in the Cultaptation tournament to allow the population to settle. I ran 10,000 games in this manner, and percentage of wins for each strategy are shown in Table 5.4. In the table we can see that sself has a slight edge over sEVC, and both these strategies have a signi cant advantage over EVChooser. In fact, I observed that in the rst 100 rounds of most games (before mutation begins) EVChooser nearly died out completely, although it is able to gain a foothold once mutation commences. Mutation is also turned o after 7500 rounds in Cultaptation melee games; this caused the population to quickly become dominated by one of the three strategies in all 10,000 games played. 70 Performance Analysis of sself , sEVC, and EVChooser In the experiments in Section 5.3.2, we saw that the strategies found by CSLA consistently outperform EVChooser in environments similar to the ones they were trained in. In order to get a better idea of why this happens, I ran two experiments to compare the performance of sself and EVChooser in more detail. The rst was designed to show the kinds of situations in which the two strategies chose di erent actions, while the second was designed to show how well the two strategies were able to spread good actions through their population. Action Preferences The objective of this experiment was to identify the kinds of situations in which sself , sEVC, and EVChooser made di erent choices. To this end, I allowed sself to play against itself for ve games, in an environment identical to the one used for the previous experiments in Section 5.3.2 (note that this is the same environment sself was trained in). On each round, for each agent, I recorded the number of rounds the agent had lived, the value of the best action in its repertoire,4 and whether the agent chose to innovate, exploit, or observe on that round. Since there are 100 agents alive at any given time and each game lasts 10,000 rounds, this yielded a total of ve million samples. Figures 5.4(a), (d), and (g) show the observed probability that sself would innovate, observe, or exploit (respectively) for its rst ten rounds and for each possible best action value. I then repeated this process for sEVC and EVChooser, allowing each strategy to play against itself for ve games and recording the same 4This could be 20, 40, 80, 160, or None if the agent had not yet discovered an action 71 a) sself innovates b) sEVC innovates c) EVChooser innovates d) sself observes e) sEVC observes f) EVChooser observes g) sself exploits h) sEVC exploits i) EVChooser exploits Figure 5.4: The observed probability that sself , sEVC, and EVChooser will innovate, observe, or exploit when they are a given number of rounds old (on the x-axis) and with a given value of the best action in the agent?s repertoire. These results were observed by allowing each strategy to play itself for ve games of 10,000 rounds each with 100 agents alive on each round, generating a total of 5,000,000 samples. All graphs in this gure share the same legend, which is included in graph c) and omitted elsewhere to save space. 72 data. The results for sEVC and EVChooser may be found in Figures 5.4(b), (e), and (h), and Figures 5.4(c), (f), and (i), respectively. The most obvious di erence among the three strategies is that EVChooser almost never innovates,5 a property it shares with the strategies that did well in the Cultaptation tournament [17]. On the other hand, sself and sEVC have conditions under which they innovate and conditions under which they do not. For instance, both sself and sEVC always innovate if their rst action (which is always an ob- servation) returns no action. Also, sEVC frequently innovates if it is stuck with the worst action after several observes, and sself also innovates (although less frequently; see next paragraph) in this case. Another sharp contrast between EVChooser and the generated strategies is in their exploitation actions. EVChooser spends nearly all of its time exploiting, even if it has a low-value action, and only observes with signi cant probability on round two. On the other hand, sself and sEVC will be- gin exploiting immediately if they have one of the two best actions, but otherwise will spend several rounds observing or innovating to attempt to nd a better one, and the number of rounds they spend searching for a better action increases as the quality of their best known action decreases. The main di erence between sself and sEVC that can be seen in Figure 5.4 is in the way they handle being stuck with the lowest-value action after several rounds. In these circumstances, sself prefers observation while sEVC prefers innovation. Here we see the most obvious impact of the di ering environments used to generate these two strategies. sself prefers observation in these cases because it was trained in an 5EVChooser innovates 1% of the time on its rst round. 73 environment where all agents are willing to perform innovation. Therefore, if an sself agent is stuck with a bad action for more than a few rounds it will continue to observe other agents, since if a better action exists, it is likely that it has already been innovated by another agent and is spreading through the population. On the other hand, sEVC prefers innovation in these situations because it has been trained with EVChooser occupying a signi cant portion of the population, and we have seen that EVChooser almost never innovates. Therefore, if sEVC is stuck with a bad action after several rounds, it will attempt to innovate to nd a better one, since it is less likely that another agent has already done so. Spreading High-value Actions The objective of this experiment was to measure the rate at which sself , sEVC, and EVChooser were able to spread high-valued actions through their populations. To measure this, I again played sself against itself in the same environment used in the previous experiment (which I will refer to as the \normal" environment in this section), and on each round I recorded the number of agents exploiting actions with each of the four possible values (20, 40, 80, and 160). To account for the noise introduced by changing action values, I ran 10,000 games and averaged the results for each round. I then repeated this process, playing sEVC and EVChooser against themselves. The results for sself , sEVC, and EVChooser may be found in Figures 5.5(a), (c), and (e) respectively. This experiment lets us see what the steady state for these strategies looks like, and how quickly they are able to reach it. However, I am also interested in seeing how they respond to structural shocks [21, 22] (i.e., how quickly the strategies 74 a) sself in normal environment b) sself in shock enviornment c) sEVC in normal environment d) sEVC in shock enviornment e) EVChooser in normal environment f) EVChooser in shock enviornment Figure 5.5: The average number of agents exploiting an action with value U in two environments. The \normal" environment in parts a, c, and e shows how quickly sself , sEVC, and EVChooser spread actions through their population under normal circumstances when they control the entire population. The \shock" environment in parts b, d, and f shows how quickly each strategy responds to periodic structural shock. The \normal" environment is the same as in the rest of Section 5.3.2, and the \shock" environment is similar except that actions with value 160 are forced to change every 100th round and held constant all other rounds. Each data point is an average over 10,000 games. 75 are able to recover when a good, widely-used action changes values). To this end, I created a \shock" environment, which is identical to the normal environment with one modi cation: actions with value 160 have a probability of change equal to 0 ex- cept on rounds divisible by 100, in which case they have probability of change equal to 1. All other actions use the normal probability of change for this environment, 0.01. This modi cation creates a shock every 100 rounds, while still keeping the expected number of changes the same for all actions. I then repeated the experi- ment above with the shock environment, running 10,000 games for sself , sEVC, and EVChooser and averaging the results, which are presented in Figures 5.5(b), (d), and (f) respectively. In Figure 5.5 we can see that sself and sEVC exhibit nearly identical performance in both the normal and shock environments. In the normal environment, they are able to reach their steady state in only a few rounds, and the steady state consists of a roughly equal number of agents exploiting the best and second-best action. In the shock environment, we see that sself and sEVC respond to external shock by drastically increasing the number of agents exploiting the second-best action over the course of a few rounds, and returning to their steady states at a roughly linear rate over the next 100 rounds. The number of sself and sEVC agents exploiting the two worst actions remains extremely low except for small spikes immediately after each shock. Compared to the generated strategies, EVChooser?s performance appears to be less stable, and less robust to structural shock. In the normal environment, we see that EVChooser takes hundreds of rounds to reach its steady state. While 76 EVChooser?s steady state does include more agents exploiting the best action than sself and sEVC, it also includes a signi cant number of agents exploiting the two worst actions. In the shock environment, we see that changes to the best action result in signi cant increases to the number of EVChooser agents exploiting the other actions, including the two worst ones. We can also see that populations of EVChooser agents take a lot longer to return to normal after an external shock than populations of sself and sEVC. These results help us account for the superior performance of sself and sEVC over EVChooser in previous experiments, and indicate that there is plenty of room for improvement in EVChooser and strategies like it. 77 Chapter 6 Improving Full-scale Cultaptation Strategies So far, the objective of this work has been to analyze Cultaptation and see what types of strategies work well. However, in order to nd strategies that were provably good using the algorithms I was able to develop, it was necessary to use game environments that were much smaller than the ones used in the real Cultap- tation tournament. In this chapter, I will show how to take the observations and analysis used in the previous chapters and use them to develop a strategy for full-size Cultaptation that can outperform the current best strategy from the Cultaptation tournament. 6.1 The Tournament Winner: discountmachine The rst Cultaptaton tournament was won by the strategy discountmachine, by Dan Cownden and Tim Lillicrap of Queen?s University [15]. According to the tournament organizers, discountmachine performed signi cantly better than all other contestants; its population was about 50% higher, on average, than the second-place strategy [17]. Pseudocode for discountmachine is provided as Algorithm 3. It can be sum- marized as follows: On the rst round, it always plays Observe. If it?s the second round and the rst Observe returned no action, it will always play Innovate as a 78 bootstrapping measure. Otherwise, it begins by estimating the probability of change (c^), the average value of the payo distribution ( ^ ), the average value of the observe distribution ( ^Obs), and the current value of its best-known action (v^ ). The values v^ and ^Obs are then multiplied by a discount factor meant to re ect the value of exploiting the actions until either the action changes or the agent dies. In order to prevent the agent from missing out on high-value actions in envi- ronments with low probability of change, the strategy includes a heuristic that will cause it to Observe at least once every 20 rounds if c^ < 0:05 and its current action is not signi cantly better than any it has seen before. Otherwise, it decides which action to perform based on the output of a pre-trained neural network, which takes as input discounted versions of v^ and ^Obs, along with an estimate of the degree of noise present in Observation.1 While the source code for discountmachine used in the tournament has been made public on the organizers? website [15], that code includes a neural network with pre-trained weights, and the code used to train the neural network was not made public. However, a comment in the source code explains, \We trained this function by having it try and match the estimate made by a (strategy) with perfect knowledge of what could be observed and (the amount of noise present for Observe moves)." 1The strategy also de nes a rare subset of values for some parameters for which the neural network performed poorly, and in these cases the strategy simply returns Observe if v^ < ^Obs and Exploits the best-known action otherwise. For ease of exposition I have omitted this from the pseudocode. 79 Algorithm 3 Simpli ed pseudocode for the discountmachine strategy, which won the rst Cultaptation tournament. Returns the action selected by an agent, given the agent?s current history h. Full source code available from the Cultaptation website [15]. discountmachine(History h) 1: if jhj == 0 then 2: return Observe. 3: end if 4: if jhj == 1 AND No actions are known then 5: return Innovate. 6: end if 7: Let c^ and ^ be the probability of change and average new action value, respectively, experienced in history h. 8: Let v be the value of the best known action, and r be the number of rounds since that action was seen in h. 9: v^ = (1 c^)r v + (1 (1 c^)r) ^ . 10: Let ^Obs be the mean value of all observed actions in h. 11: Let f = (1 c^)(1 0:02). This is the \discount factor." 12: v^ = v^ 1(1 f) 13: ^Obs = ^Obs f 1 f . 14: if c^ < 0:05 AND jhj > 20 AND Last 20 actions have been Exploit then 15: if v^ < 3 + PRU(h) then 16: return Observe. 17: else 18: return Exploit the best-known action. 19: end if 20: end if 21: return The action selected by a pre-trained neural network with inputs v^max, ^Obs, and an estimate of the amount of noise present in Observation. 6.1.1 Potential Problems with discountmachine It?s clear from the results of the Cultaptation tournament that discountma- chine is a signi cantly better strategy than the other contestants. However, the analysis and experiments presented in Chapters 4 and 5 suggest that it has some aspects that could still be improved. First, discountmachine only ever innovates as a bootstrapping mechanism on the second round of the game. While the tournament organizers found that this was 80 a universal characteristic among the top-performing strategies in the tournament, Sections 2.2.1 and 5.3.2 present theoretical and empirical evidence that, in at least some circumstances, Innovating can be advantageous. Therefore, a strategy that could recognize these situations would have an advantage over discountmachine. Second, the formulae used to obtain the total value of exploiting the best- known action (line 12) and the expected result of an observe (line 13) are incom- plete. These formulae compute the expected total payo the agent will earn by exploiting each of these actions until either the action changes value or the agent dies. As we have seen in Example 5 and Theorem 1, the expected total payo is not always the same as the EPRU, and EPRU is the metric that should be used to determine the quality of a strategy (or, in this case, a potential course of action). Furthermore, the formula on line 13 neglects the fact that the information returned by an Observe pertains to the round before the Observe was performed. This means that, if the newly-observed action is Exploited on the following round, it has had two opportunities to change values, not one, and the expected value of the action should be updated to re ect this. Since the source code used to train the neural net in discountmachine is not provided, it?s di cult to characterize the function approximated by the network. However, given the issues with the formulae used to compute the inputs to the network, it seems unlikely that it is perfectly accurate in determining whether Ob- serving or Exploiting is a better choice. Finally, the experimental results from Section 5.3.2 suggest that the choice of whether to exploit a given action should be determined by the action?s value and 81 the agent?s age, and that the preference towards exploiting should increase mono- tonically as the agent gets older and as the value of the best known action increases. However, the heuristic used on lines 14 through 19 causes discountmachine to con- sistently violate this rule once every 20 rounds in environments with low probability of change. While this does cause the agent to have a better idea of what actions exist to be observed, I have been unable to nd any analytical support for the notion that this will typically increase the agent?s EPRU by more than the amount that is lost by not exploiting as much as possible when the agent already has a high-value action. 6.2 A Full-scale Cultaptation Player: relaxedlookahead In order to show how the analysis and experiments presented in Chapters 4 and 5 can be used to create a player for the Cultaptation tournament, I developed relaxedlookahead. The main insights that inspired relaxedlookahead ?s design were: 1. Actions should be evaluated based upon the estimated amount of EPRU they provide the agent, since EPRU is proportional to reproductive success (The- orem 1). While this may seem intuitive, I know of no current Cultaptation player that uses EPRU or an equivalent metric (e.g., discountmachine esti- mates expected total payo , see above). 2. Innovating should be regarded as a worthwhile choice in at least some cir- cumstances. Even though the top-performing strategies from the tournament tended to eschew Innovate moves, we have seen that there are some circum- 82 stances in which it is preferable. With an accurate formula for estimating the value of innovation, the agent should be able to identify such circumstances. 3. The agent should choose its move based upon its entire history; in other words, hard and fast rules such as the one used by discountmachine on lines 14 through 19 should be avoided. This will allow the agent to have behavior that smoothly transitions according to the value of its best action and age, as sself ?s behavior does in Figure 5.3.2, which gave it a signi cant performance advantage over EVChooser. 4. Due to computational constraints present in the Cultaptation tournament, it will not be possible to perform a full lookahead search to a depth su cient to achieve a small . However, a relaxed lookahead search, which considers a few di erent possible outcomes of each action and their e ects over the course of many rounds, should still be su cient to allow the agent to estimate which of three choices (Exploit the best known action, Observe, or Innovate) has the highest EPRU. The relaxedlookahead strategy is de ned in Algorithm 4, and can be summa- rized as follows. Like discountmachine, it always observes on its rst round and, if that observe fails to return a value, it innovates as a bootstrapping measure. Oth- erwise, the strategy begins by creating estimates of several quantities based on the information contained in h; it estimates the probability of change (c^), average value of the payo distribution ( ^ ), current value of the best known action (v^ ), and the mean and standard deviation of the distribution of observe moves ( ^Obs, ^Obs). It 83 then estimates the value of three possible moves: exploiting the best known action, observing, and innovating. These estimates are created using a relaxed lookahead search, which is a very rough approximation of the search conducted in Chapter 5. The relaxed lookahead search is conducted slightly di erently for each of the three possible moves. For the observe move, the strategy models the observe distri- bution Obs as a normal distribution2 with mean ^Obs and standard deviation ^Obs, and calculates the probability that the observed action will have a higher payo than v^ . The search then considers six di erent branches. In the rst branch, the observed action has a payo lower than the current best action, so the agent?s best known action is still v^ . The other ve branches assume the observed action is better than the current best action; candidate payo s are generated by using the inverse CDF of Obs to nd ve values appropriately distributed among the set of values greater than v^ . For example, if v^ was found to be in the 80th percentile of Obs, the ve candidate values would be in the 82nd, 86th, 90th, 94th, and 98th percentiles respectively. For each branch, the strategy assumes the agent will exploit its new best known action for 100 rounds, and it estimates the EPRU of the branch using formulae based on those from Chapter 4. Finally, the EPRU of the observe move is computed by combining the EPRU of each branch with its estimated likelihood. 2The normal distribution was selected as the most likely shape of Obs using a process similar to the one in CSLA; the rst version of relaxedlookahead did not as- sume any shape to Obs and only considered its average value. This version was then played against itself in simulated games, and the shape of the observe distribution from the simulations appeared to be gaussian. I then changed relaxedlookahead to model the observe distribution as a normal distribution, repeated the process, and found that the observe distribution from the simulations was still gaussian. 84 The EPRU of innovating is estimated in a similar manner, but the strategy models the payo distribution as an exponential distribution3 rather than a normal one. Finally, the EPRU of exploiting the current best action is estimated by simply assuming that the agent exploits on the current round and the next 100 rounds. Once the EPRUs for observing, innovating, and exploiting have all been estimated, relaxedlookahead simply makes the move with the highest estimate. 6.3 Experiments: RLA vs. discountmachine In order to test the hypothesis that relaxedlookahead would have an advan- tage over discountmachine, I conducted several experiments to compare their per- formance. 6.3.1 Overall Performance The rst experiment examines the overall performance of both strategies in a variety of environments. I created two di erent versions of the action distribution . The rst, Geo, is a geometric distribution with probability of success p = 0:1. The second, Gam, is a Gamma distribution with shape k = 2 and scale = 5. Both these distributions have a mean value of 10 and conform to the tournament organizers? speci cations of generating a larger number of small-value actions with 3The exponential distribution was selected due to the tournament organizers? statement that the move distribution would typically contain large numbers of low- value actions and small numbers of high-value actions [15]. The exponential distri- bution is generally representative of distributions that have this shape, and can be easily constructed since its shape depends only on its mean. 85 Algorithm 4 Simpli ed pseudocode for the relaxedlookahead strategy. Returns the action selected by an agent, given the agent?s current history h. relaxedlookahead(History h) 1: if jhj == 0 then 2: return Observe. 3: end if 4: if jhj == 1 AND No actions are known then 5: return Innovate. 6: end if 7: Let c^ and ^ be the probability of change and average new action value, respectively, experienced in history h. 8: Let v be the value of the best known action, and r be the number of rounds since that action was seen in h. 9: v^ = (1 c^)r v + (1 (1 c^)r) ^ . 10: Let ^Obs and ^Obs be the mean value and standard deviation, respectively, of all observed actions in h. 11: UO = evaluateObs( ^Obs, ^Obs, v^ , ^ , c^, jhj) 12: UI = evaluateInv(v^ , ^ , c^, jhj) 13: UX = estimateEPRU(v^ , ^ , c^, jhj) 14: if UX max(UO; UI) then 15: return Exploit the best known action. 16: else if UO UI then 17: return Observe. 18: else 19: return Innovate. 20: end if estimateEPRU(v, v, c, r) 1: U = 0 2: EPRU = 0 3: for i = 0! 100 do 4: U = U + (1 c)i v + (1 (1 c)i) v 5: EPRU = EPRU +0:98i Ur+i 6: end for 7: return EPRU a few high-value actions [15]. However, Geo largely matches the assumptions re- laxedlookahead makes about the shape of while Gam does not. It features a large \hump" around the low-value actions, with the most likely value being 5, and a lower probability of high-value actions than Geo. 86 Algorithm 5 Pseudocode for the functions used to evaluate Observation and In- novation by relaxedlookahead. evaluateObs( , , v, v, c, r) 1: Let Obs be a normal distribution with mean and standard deviation . 2: Let phi be the probability that a value drawn from Obs is greater than v. 3: Uhi = 0 4: for i = 0! 4 do 5: p = 1 phi 1+2i10 6: Let vi be the value of the inverse CDF of Obs at percentile p. 7: vi = (1 c)2 vi + (1 (1 c)2) v 8: Uhi = Uhi + estimateEPRU(vi; v; c; r + 1) 9: end for 10: Uhi = Uhi=5 11: Ulow = estimateEPRU(v; v; c; r + 1) 12: return phiUhi + (1 phi)Ulow evaluateInv( , v, v, c, r) 1: Let Inv be an exponential distribution with mean . 2: Let phi be the probability that a value drawn from Inv is greater than v. 3: Uhi = 0 4: for i = 0! 4 do 5: p = 1 phi 1+2i10 6: Let vi be the value of the inverse CDF of Inv at percentile p. 7: vi = (1 c) vi + c v 8: Uhi = Uhi + estimateEPRU(vi; v; c; r + 1) 9: end for 10: Uhi = Uhi=5 11: Ulow = estimateEPRU(v; v; c; r + 1) 12: return phiUhi + (1 phi)Ulow I simulated 200 pairwise Cultaptation games between relaxedlookahead and discountmachine (100 with relaxedlookahead invading and 100 with discountma- chine invading) in each of 14 di erent environments. The rst seven environments used Geo as the action distribution, and had probability of change c equal to 0.001, 0.005, 0.01, 0.05, 0.1, 0.2, and 0.4, respectively. The other environments used Gam as the action distribution. This set of values for c is the same set used by the Cultaptation tournament organizers for their rst round of evaluations [15]. 87 Table 6.1: Performance of relaxedlookahead when played against discountma- chine in a variety of environments. Results are based on 200 simulated games for each environment (100 with relaxedlookahead invading, 100 with discountma- chine invading). Results in bold or italics are statistically signi cant with < 0:01; bold results represent an advantage for relaxedlookahead, italicized results represent an advantage for discountmachine. Action distribution Geo Action distribution Gam Win percentage Average score Win percentage Average score c = 0:001 .96 82.02 .97 83.96 c = 0:005 .78 66.29 .84 70.86 c = 0:01 .65 58.20 .72 63.77 c = 0:05 .48 47.70 .55 52.21 c = 0:1 .53 52.97 .59 55.24 c = 0:2 .44 47.32 .41 47.38 c = 0:4 .55 53.95 .42 45.04 The results of this experiment are presented in Table 6.1. relaxedlookahead has a statistically signi cant advantage ( < 0:01) over discountmachine in seven envi- ronments: c = 0:001; 0:005; and 0:01 for both action distributions, and c = 0:1 for Gam. It has a statistically signi cant disadvantage for one environment, c = 0:2 with distribution Gam. These results suggest that relaxedlookahead has a signi cant advantage over discountmachine for environments with low probability of change, and that its advantage shrinks and eventually disappears as the probability of change grows to moderate and high levels. Curiously, the shape of did not seem to make much di erence in relaxedlookahead ?s performance, and in fact it performed slightly better in the environment which did not match its assumptions. We will investigate the reasons for this in the following section. 88 It should be noted that, in general, it will be much more di cult to obtain a signi cant advantage in environments with high probabilities of change, since as long as both players are exploiting as much as possible their choice of action makes little di erence. The tournament organizers have acknowledged this and, appropriately, put signi cantly more emphasis on low-change environments in all rounds of their evaluation. Therefore, since relaxedlookahead has a clear advantage over discountmachine in such environments, it seems reasonable to conclude that relaxedlookahead would be more likely to win in a tournament setting. 6.3.2 Performance Analysis The next round of experiments was designed to duplicate the ones presented in Section 5.3.2, in order to get a more detailed look at the di erences between relaxedlookahead and discountmachine. Action Preferences The objective of this experiment, like the rst one in Section 5.3.2, was to get an idea of the kinds of situations in which relaxedlookahead and discountma- chine made di erent choices. To that end, I allowed each strategy to compete against itself in 50 full-length games using four separate environments: c = 0:001 and 0:05 with action distribution Geo, and c = 0:001 and 0:05 with action distribu- tion Gam.4 On each round, I recorded the actions taken by all 100 agents and the payo of their best-known action, yielding a total of 50 106 data points for each 4The values c = 0:001 and c = 0:05 were chosen because the rst re ects the largest advantage for relaxedlookahead in the previous experiment, while the second represents the lowest value at which both strategies appear to be evenly matched. 89 strategy in each environment. Figures 6.1 through 6.4 present the results of this experiment, showing the probability that each strategy would innovate, observe, or exploit given its age and the value of its best known action. There are two obvious di erences between the strategies in all four environments. First, discountmachine, like EVChooser, never innovates except as a bootstrap- ping mechanism on the second round of the game. Meanwhile, relaxedlookahead is willing to innovate on any round, under the right circumstances. In fact, if its best known action is between 5 and 15, relaxedlookahead appears to prefer innovating to observing. The fact that it does so in environments in which it handily defeats discountmachine is evidence that innovating should not be completely overlooked, as the top-performing strategies in the tournament tended to do. Second, the observe and exploit graphs for discountmachine exhibit large spikes at regular intervals. These are due to the clause on line 14 of Algorithm 3, which causes discountmachine to observe at least once every 20 rounds unless the probability of change is high. The authors of the strategy explained that this is done to make sure that an agent doesn?t miss a high-value action that most other agents are exploiting. Meanwhile, the same graphs for relaxedlookahead also exhibit spikes, but their size and the interval between them is correlated with the probability of change in the environment and the value of the best-known action. There is no code in relaxedlookahead that explicitly causes this behavior, so I can only conclude that this is emergent behavior and that certain rounds are more likely than others to be opportune times for exploring rather than exploiting. Whatever the cause, this is 90 evidence that arbitrary rules like the one on line 14 of Algorithm 3 are not necessary if the values of observing and exploiting can be estimated accurately. Spreading High-value Actions The objective of the next experiment, like the second one in Section 5.3.2, was to measure the rate at which relaxedlookahead and discountmachine were able to spread high-value actions through their populations. To measure this, I re-ran the simulations used for the previous experiment,5 but this time recorded the value of each action exploited on each round of the game. Since there are 100 agents in each game, this yielded 5,000 data points for each round. Figures 6.5 through 6.8 present the results of this experiment. They show the probability of each strategy exploiting an action in each of six possible value ranges: 0 to 5, 5 to 15, 15 to 25, 25 to 30, 30 to 40, and more than 40.6 These gures make it clear why relaxedlookahead has an advantage over dis- countmachine, particularly in environments with low probability of change. Over the rst few generations of agents, discountmachine does an excellent job of quickly spreading actions in the best value category that can reliably be found (e.g., val- ues over 40 in Figure 6.5, values between 15 and 25 in Figure 6.6). However, in all four environments, the proportion of agents exploiting such actions peaks at about round 200 and then begins to steadily decline. Meanwhile, the proportion of relaxedlookahead agents exploiting the same actions increases more slowly, but eventually reaches a steady state with more agents exploiting high-value actions 5Each strategy played against itself in 50 full-length Cultaptation games in each of four environments, with c = 0:001 or 0:05 and geo or gam as the payo distribution. 6In both geo and gam, the mean action value is 10. 91 a) relaxedlookahead Innovates b) discountmachine Innovates c) relaxedlookahead Observes d) discountmachine Observes e) relaxedlookahead Exploits f) discountmachine Exploits Figure 6.1: Frequency that an agent using relaxedlookahead (left) or discountma- chine (right) chose to Innovate, Observe, or Exploit, given its age and best known action, in an environment with the geometrically-distributed action distribution Geo (with mean 10) and probability of change c = 0:001. 92 a) relaxedlookahead Innovates b) discountmachine Innovates c) relaxedlookahead Observes d) discountmachine Observes e) relaxedlookahead Exploits f) discountmachine Exploits Figure 6.2: Frequency that an agent using relaxedlookahead (left) or discountma- chine (right) chose to Innovate, Observe, or Exploit, given its age and best known action, in an environment with the geometrically-distributed action distribution Geo (with mean 10) and probability of change c = 0:05. 93 a) relaxedlookahead Innovates b) discountmachine Innovates c) relaxedlookahead Observes d) discountmachine Observes e) relaxedlookahead Exploits f) discountmachine Exploits Figure 6.3: Frequency that an agent using relaxedlookahead (left) or discountma- chine (right) chose to Innovate, Observe, or Exploit, given its age and best known action, in an environment with the action distribution Gam (i.e. a gamma distribu- tion with shape 2 and scale 5) and probability of change c = 0:001. 94 a) relaxedlookahead Innovates b) discountmachine Innovates c) relaxedlookahead Observes d) discountmachine Observes e) relaxedlookahead Exploits f) discountmachine Exploits Figure 6.4: Frequency that an agent using relaxedlookahead (left) or discountma- chine (right) chose to Innovate, Observe, or Exploit, given its age and best known action, in an environment with the action distribution Gam (i.e. a gamma distribu- tion with shape 2 and scale 5) and probability of change c = 0:001. 95 than discountmachine. This is the consequence of only innovating as a bootstrapping measure, as discountmachine does. Over time, agents will observe actions that are better than the ones they currently know, and begin exploiting them. In environments with low probability of change, the best action in the environment will take a long time to change, which makes it likely that almost all of the agents will be exploiting this action when it does change. This creates a structural shock (see Section 2.2.1), and leaves strategies like discountmachine with knowledge of only a very small subset of the actions in the environment, and no way to learn more without in- novating. Strategies like relaxedlookahead, however, can recover by innovating and nding other actions that have changed to have higher value. This phenomenon appears to be the biggest reason why relaxedlookahead has such a large advantage in environments with low probability of change. These gures also help to explain why, in Table 6.1, relaxedlookahead was more likely to win in many environments using Gam, even though Geo more closely matches its assumptions about the shape of . The reason appears to be fairly sim- ple: relaxedlookahead ?s performance is about the same, while discountmachine ap- pears to do worse in the environments using Gam. Comparing relaxedlookahead ?s performance as the distribution changes (i.e. comparing Figure 6.6a to 6.8a), we can see that it is more likely to exploit actions between 15 and 25, and less likely to exploit actions with higher value than 25. This is to be expected, however, since Gam is characterized by a large hump around the mean and a narrower tail than Geo, so there will be fewer actions with very high values in these environments. 96 Meanwhile, comparing Figure 6.6b to 6.8b, we can see that discountmachine has much more trouble nding actions with above-average payo in the environment using Gam, and after round 500 most of its exploits have values between 5 and 15 (recall that the mean of Gam is 10). This explains why relaxedlookahead was able to win against discountmachine seven percent more often in the environment used in Figure 6.8. 97 a) relaxedlookahead b) discountmachine Figure 6.5: Frequency with which relaxedlookahead(top) and discountma- chine(bottom) exploited an action with payo P , for six di erent ranges of values for P , over the rst 1000 rounds of a game. Results were obtained by allowing each strategy to play a 50 games against itself, in an environment with action distribution Geo and probability of change c = 0:001, and averaging the exploits observed on each round. 98 a) relaxedlookahead b) discountmachine Figure 6.6: Frequency with which relaxedlookahead(top) and discountma- chine(bottom) exploited an action with payo P , for six di erent ranges of values for P , over the rst 1000 rounds of a game. Results were obtained by allowing each strategy to play a 50 games against itself, in an environment with action distribu- tion Geo and probability of change c = 0:01, and averaging the exploits observed on each round. 99 a) relaxedlookahead b) discountmachine Figure 6.7: Frequency with which relaxedlookahead(top) and discountma- chine(bottom) exploited an action with payo P , for six di erent ranges of values for P , over the rst 1000 rounds of a game. Results were obtained by allowing each strategy to play a 50 games against itself, in an environment with action distribution Gam and probability of change c = 0:001, and averaging the exploits observed on each round. 100 a) relaxedlookahead b) discountmachine Figure 6.8: Frequency with which relaxedlookahead(top) and discountma- chine(bottom) exploited an action with payo P , for six di erent ranges of values for P , over the rst 1000 rounds of a game. Results were obtained by allowing each strategy to play a 50 games against itself, in an environment with action distribution Gam and probability of change c = 0:05, and averaging the exploits observed on each round. 101 Chapter 7 Conclusion This dissertation has presented a variety of techniques for analyzing Cultap- taiton, a complex evolutionary game designed to explore the phenomenon of social learning. Furthermore, the work has demonstrated how to nd strategies that are provably good, and how to analyze such strategies to draw conclusions about how good social learning strategies will operate. Thus, this work has advanced the state of the art in two ways. First, it provides theory and analysis that support many of the empirical results found by the rst Cultaptation tournament, and identi es some aspects of the tournament results that are likely due to experimental error (e.g., the lack of individual learning in any of the top-performing strategies). This helps strengthen our understanding of social learning in general, and should help inform future studies of this phenomenon. Second, it demonstrates techniques that can be used to analyze evolutionary games that are signi cantly more complex than classical evolutionary games, i.e., games that have a nite number of agents, rather than an in nitely large, well-mixed population; games that last a nite, rather than in nite, number of generations; and games that allow agents to live for multiple generations and condition their actions on accumulated experience, rather than replacing the population every generation 102 and preventing agents from accumulating experience in the rst place. One of the reasons Cultaptation is so complex is that early evolutionary models of social learn- ing made very strong assumptions about social learning mechanics and strategies [16, 7], and the conclusions drawn from studying such models generated a controver- sial challenge to social learning?s role in evolutionary tness that has taken decades to fully address [7, 13, 14, 18]. As researchers in the eld of evolutionary game theory continue to study more complex phenomena like social learning, it is likely that these weaker assumptions will be needed more frequently, and so techniques like the ones presented here will be necessary more often. In summary, this dissertation has provided the following contributions: 1. Analyzing strategies? reproductive success. Given a Cultaptation game G and a set S of available strategies for G, the work presents a formula for approximating (to within any > 0) the expected per-round utility, EPRU(s j G;S), of each strategy in S. The work shows that a strategy with maximal expected per- round utility will have the highest expected frequency in the limit, independent of the initial strategy pro le. These results provide a basis for evaluating highly complex strategies such as the ones described below. Generalizability: These results can be generalized to other evolutionary games in which agents live more than one generation, with a xed probability of death at each generation, and reproduction is done using the replicator dynamic. 2. Computing near-best-response strategies. The work provides a strategy-generation algorithm that, given a Cultaptation game G and a set of avail- able strategies S, can construct a strategy s that is within of the a response to 103 S. Generalizability: The strategy-generation algorithm performs a nite-horizon search, and is generalizable to other evolutionary games in which there is a xed upper bound on per-round utility and a nonzero lower bound on the probability of death at each round. 3. Approximating symmetric Nash equilibria. The work provides CSLA, an iterative self-improvement algorithm that uses the strategy-generation algorithm in Section 5.1 to attempt to nd a strategy sself that is a near-best re- sponse in a Cultaptation game in which the other players are all using sself . Hence a strategy pro le composed entirely of instances of sself is a symmetric near-Nash equilibrium. Generalizability: An iterative self-improvement algorithm similar to CSLA should be able to nd a near-Nash equilibrium for any game in which the strategies are complex enough that computing a best (or near-best) response is not feasible by analyzing the strategies directly, but is feasible using information from a simulated game between strategies in the pro le. Games of this type will typically have a high branching factor but relatively simple interactions between agents. 4. State aggregation. To make its algorithms fast enough for practical experimentation, the work provides a state-aggregation technique that speeds them up by an exponential factor without any loss in accuracy. The experimental re- sults in Section 5.3 demonstrate the practical feasibility that this provides: in these experiments, CSLA always converged in just a few iterations. Generalizability: The state-aggregation technique is generalizable to other evo- 104 lutionary games in which the utilities are Markovian. 5. Experimental results. In the experimental studies, the near-Nash equi- libria produced by CSLA in any given game were all virtually identical, regardless of the starting values that were used. That strongly suggests (though it does not prove) that the strategy pro le consisting of copies of sself approximates an optimal Nash equilibrium, and possibly even a unique Nash equilibrium. Consequently, sself ?s characteristics provide insights into the characteristics of good Cultaptation strategies. For example, the experiments show that sself relies primarily on observation and exploitation, but switches quickly to innovation when a structural shock occurs, switching back to observation and exploitation once it has learned how to respond to the shock. This con icts with the conventional wisdom [16, 7] that successful social-learning strategies are characterized by a high frequency of innovation, but it helps to explain both the results of the Cultaptation tournament [17] and some recent experimental results on human subjects [18]. 6. Improvement on the best tournament strategy. While the algo- rithms described above are fast enough for experimentation on smaller variants of the Cultaptation game, they would be intractable for use on the more complex vari- ant used in the Cultaptation tournament. The work shows two approaches that can extend the above results to larger environments such as these. First, it shows how the analysis and experimental results outlined above can be used to identify potential problems in the best strategy from the Cultaptation tournament (i.e. dis- countmachine). Second, it uses the formulae from the analysis to de ne a new strategy, relaxedlookahead, that avoids such weaknesses. Experimental results verify 105 that relaxedlookahead is capable of outperforming discountmachine in a variety of environments similar to those used in the Cultaptation tournament, and provide an in-depth analysis of the factors that allowed relaxedlookahead to perform better. The analysis showed that refusing to innovate except as a bootstrapping mea- sure (as discountmachine and all of the top-performing strategies from the tourna- ment did) makes it much more di cult to recover from structural shocks, especially in environments with low probability of change. Strategies that are willing to in- novate when a structural shock is detected, as relaxedlookahead is, are able to avoid this problem. The analysis also showed that heuristics instructing a strategy to explore with some minimum frequency, like the one used by discountmachine, are unnecessary, since relaxedlookahead exhibits emergent behavior in which it ex- plores at intervals dependent upon the parameters of the environment, without being speci cally programmed to do so. One possible avenue for future work would be to identify computationally fea- sible techniques capable of approximating the Nash equilibrium strategy in these larger versions of Cultaptation, preferably with provable bounds on the di erence between the performance of such techniques and the Nash equilibrium strategy. In classical games, a regret minimizing strategy1 is typically not computationally intensive and has been shown to have performance quite close to that of a Nash strategy in several large classes of repeated games [40]. Therefore, it may be fruitful 1in short, a regret minimizing strategy seeks to minimize the expected di erence in payo between the result of its chosen action and the best possible result if it had selected a di erent action. 106 to attempt to extend the concept of regret minimization to evolutionary game the- ory; such an extension would need to account for the conceptual di erence between payo in classical games and tness in evolutionary games, perhaps involving an idea of \lost tness minimization," in which the player compares the tness gained by its chosen action to the highest amount of tness it would have obtained if it had selected a di erent action. 7. Implications. These results provide strong support for the following hypotheses about the best strategies for Cultaptation and similar games: What they are like, and how they can be computed. The best strategies are likely to be conditional ones in which the choice of action at each round is conditioned on the agent?s accumulated experience. Such strategies (or close approximations of them) can be computed by doing a lookahead search that predicts how each possible choice of action at the current round is likely to a ect future performance. How they are likely to behave. It is likely that the best strategies will observe and exploit most of the time, but will have ways of quickly detecting structural shocks, so that they can switch quickly to innovation in order to learn how to respond to such shocks. 107 Appendix A Proofs Proposition 2. EPRU(s j G;S)=d = EPRUalt(s; hi j G;S). Proof. First, we will show by induction that EPRUalt(s; hi j G;S) equals the summa- tion of P (h js ;S) EVexp(jh j; U(h [jh j])) for all histories h . Then we will show that this equals the summation of L(jh j)P (h js ;S) PRU(h ) for all h . We will begin with the de nition of EPRUalt in Equation 4.12, and note that P (hi tjhi; s (hi);S) = P (hi tjs ;S) for histories of length one. This gives us a base case of EPRUalt(s ; hi j G;S) = X t2T P (hi tjs ;S) EVexp(1; U(t)) + X t2T P (hi tjs ;S) EPRUalt(s ; hi t j G;S): For the inductive case, we will again start from Equation 4.12, this time noting that P (h js ;S)P (h tjh ; s (h );S) simpli es to just P (h tjs ;S). Thus, for all h we can rewrite P (h js ;S) EPRUalt(s ; h j G;S) in terms of histories one round longer than h , as follows: P (h js ;S) EPRUalt(s ; h j G;S) = X t2T P (h tjs ;S) EVexp(jh tj; U(t)) + X t2T P (h tjs ;S) EPRUalt(s ; h t j G;S)): 108 Therefore, by induction we have EPRUalt(s ; hi j G;S) = X h 2H P (h js ;S) EVexp(jh j; U(h [jh j])); where H is the set of all possible histories. The proof then proceeds arithmetically: EPRUalt(s ; hi j G;S) = X h 2H P (h js ;S) EVexp(jh j; U(h [jh j])) = X h 2H P (h js ;S) 1X i=jh j L(i)U(h [jh j]) i = X h 2H 1X i=jh j L(i)P (h js ;S)U(h [jh j]) i = 1X i=1 X h 2H( i) L(i)P (h js ;S)U(h [jh j]) i = 1X i=1 X h 2H(i) L(i)P (h js ;S) PRU(h ) = EPRU(s j G;S)=d Where H( i) is the set of all histories of length less than or equal to i, and H(i) is the set of all histories exactly of length i. 2 Proposition 3. Strat(h ; k; V;S) returns (s ; U) such that EPRUkalt(s ; h j G;S) = U = maxs0(EPRU k alt(s 0; h j G;S)): Proof. Let s0 be a strategy maximizing EPRUkalt(s 0; h j G;S) and let fs ; Ug be the strategy and value returned by Strat(h ; k; V;S). We will show by induction on k that EPRUkalt(s ; h j G;S) = U = EPRU k alt(s 0; h j G;S): 109 In the base case, k = 0 and clearly EPRU0alt(s 0; h j G;S) = 0 for any s0, therefore EPRUkalt(s ; h j G;S) = EPRU k alt(s 0; h j G;S) = 0 = U as required. For the inductive case, suppose that for k, Strat(h ; k; V;S) returns fs ; Ug such that EPRUkalt(s ; h j G;S) = U = maxs0(EPRU k alt(s 0; h j G;S)). We must then show that Strat(h ; k + 1; V;S) returns fs ; Ug such that EPRUk+1alt (s ; h j G;S) = U = maxs0(EPRU k+1 alt (s 0; h j G;S)): Let stemp be the strategy constructed in lines 8{19 of the algorithm. First we show that on line 20, EPRUk+1alt (stemp; h j G;S) = (A.1) X t2T P (h tjh ; stemp(h );S) EVexp(jh j; U(t)) + EPRU k alt(stemp; h t j G;S) = Utemp: This follows because the t on line 11 iterates over all possible t 2 T (due to the for loops on lines 6, 9, and 10), meaning that the eventual value of Utemp is X t2T P (h tjh ; stemp(h )jS) (EVexp(jh tj; U(t)) + U 0) : By the inductive hypothesis, U 0 = EPRUkalt(stemp; h t j G;S), su cing to show that (A.1) holds. Now we show that EPRUk+1alt (s ; h j G;S) = EPRU k+1 alt (s 0; h j G;S): Clearly EPRUk+1alt (s ; h j G;S) EPRU k+1 alt (s 0; h j G;S), since s0 is assumed to 110 have maximal EPRUk+1alt for h , so it su ces to show that EPRUk+1alt (s ; h j G;S) EPRU k+1 alt (s 0; h j G;S): Since s maximizes X t2T P (h tjh ; s (h );S)(EVexp(jh tj; U(t)) + U 0); where U 0 EPRUkalt(s 0; h t j G;S) by the inductive hypothesis, there can be no action a such that X t2T P (h tjh ; a;S)(EVexp(jh tj; U(t)) + EPRU k alt(s 0; h t j G;S)) > X t2T P (h tjh ; s 0(h );S)(EVexp(jh tj; U(t)) + U 0): Therefore EPRUk+1alt (s ; h j G;S) EPRU k+1 alt (s 0; h j G;S). This concludes the inductive argument. Thus for all k, EPRUkalt(s ; h j G;S) = U = maxs0 EPRU k alt(s 0; h j G;S). 2 111 Appendix B Converting from Histories to Repertoires In this appendix, we will formally de ne a repertoire, explain how to trans- form histories into repertoires, and show that the number of possible repertoires is substantially smaller than the number of possible histories, while maintaining the property that any best-response action for a given repertoire is also a best resonse for any history associated with that repertoire (Theorem 5). Finally, we will present a modi ed version of Algorithm 1, which uses repertoires rather than histories, and we will show how this simple change cuts the branching factor of the algorithm in half (Algorithm 7). B.1 Repertoire De nition A repertoire tells the last value and age of each action an agent \knows," where an action?s age is the number of rounds that have passed since the agent last obtained information about it. Since at any given point in a game, each known action has a unique age, we label exploitation actions by their value and age, leaving o the action number (e.g. if we discovered an action with value 4 last round and an action with value 26 three rounds ago, then the repertoire will be fh4; 1i; h26; 3ig where h4; 1i denotes the existence of an action with value 4 discovered 1 round ago, and h26; 3i denotes the existence of an action with value 26 discovered 3 rounds 112 Algorithm 6 Creates and returns a repertoire for history h . CreateRepertoire(h = (a1; (m1; v1)); : : : ; (ar; (mr; vr))) Let M = fmiji = 1; : : : ; rg. Let R = ; fR will be the repertoire.g for m 2M do Let i = max f1;:::;rg (mi = m) Add hvi; r ii to R. end for return R ago). Formally, a repertoire is de ned to be a set of pairs, where the rst value in each pair represents the knowledge of an action with the given value, while the second value in the pair represents the number of rounds since that knowledge was last updated. De nition. Let v1; : : : ; vm 2 V be action values and 1; : : : ; m 2 Z+ (the positive integers) be action ages. A repertoire R is a set of action value/action age pairs R = fhv1; 1i; : : : ; hvm; mig. We denote the set of all repertoires as Rep, and the set of all repertoires where all i j as Repj. 2 Rep has unbounded size, but Repj has nite size. We show how to create a repertoire R from a history h using the CreateRepertoire function in Algorithm 6. Repertoires change based on the action performed. For example, repertoire R = fh4; 1i; h26; 3ig can change to repertoire R0 = fh4; 2i; h26; 4i; h27; 1ig after an innovation action where an action with value 27 is innovated. Notice that all actions in R0, apart from the newly-innovated action with age 1, are one round older 113 than they were in R. This aging process occurs often enough for us to introduce a function which ages a repertoire R = fhvi; iig: age(fhvi; iig) = fhvi; i + 1ig Finally, we will introduce two functions to represent the two ways our reper- toire can change when we perform an action. The rst, newaction, returns a repertoire with a new action added to it: newaction(R; v) = age(R) [ fhv; 1ig The second, updaction, returns a repertoire with updated information on action m: updaction(R; v;m) = age(R n fhvm; mig) [ fhv; 1ig B.2 Transition Probabilities We can now de ne the probability of transitioning between repertoires on round r. We will call the transition probability functions PRep(R0jR; r; a;S) for a 2 fInv;Obs;Xig. In general, these functions will mirror the P (h0jh; a;S) functions de ned in Section 4.1, with some extra clauses added to ensure that if it is not possible to go from repertoire R to repertoire R0 using the given action, then the transition probability is 0. Innovation actions For innovation actions, the function is: PRep(R0jR; r; Inv;S) = 8 >>< >>: 0 if jRj = M _ 6 9v : R0 = newaction(R; v) (v) if R0 = newaction(R; v) 114 The rst clause ensures that if all possible actions are already in R, or if it is not possible to go from R to R0 in one innovation action, then the transition probability is 0. The second clause simply tells us the probability of innovating an action with value v, given that it is possible to go from R to R0 in one innovation action. Observation actions In Section 4.1 we assumed the existence of a distribution Obs that, when given the current history, would tell us the probability of observing an action with a given value. Here, we will make the following assumptions about Obs: When given a repertoire and round number, Obs(vjR; r;S) tells us the prob- ability of observing an action that has value v and is not already known by R. When given a repertoire and round number, Obs(m; vjR; r;S) tells us the probability of observing an action that has value v, and was previously in R at position m. The value of this action may have changed. Obs can make its predictions without using any information lost when con- verting from a history to a repertoire. These assumptions are all satis ed by the Obs used in our implementation, and we expect them to hold for other practical implementations as well, since a distribution conditioned on entire histories would be impractically large. 115 With this in mind, the transition probability function for observation actions is PRep(R0jR; r;Obs;S) = 8 >>>>>>>>>>< >>>>>>>>>>: Obs(m; vjR; r;S) if hv0m; mi 2 R ^ R0 = updaction(R; v;m) Obs(vjR; r;S) if R0 = newaction(R; v) 0 Otherwise. The rst clause gives us the probability of observing an action already in our repertoire, while the second gives us the probability of observing a new action. Exploitation actions Let hvi; ii be the value and age of exploitation action Xi. Then PRep(R0jR; r;Xi;S) = 8 >>>>>>>>>>>>>>>>>>< >>>>>>>>>>>>>>>>>>: 0 if jRj 6= jR0j 0 if 8v0 2 V;R0 6= updaction(R; v0; i) Qr j=r i (1 c(j)) + Pr j=r i c(j) (vi; j) Qr i=j(1 c(i)) if R0 = updaction(R; vi; i) Pr j=r i c(j) (v0i; j) Qr i=j(1 c(i)) if R0 = updaction(R; v0i; i) and vi 6= v 0 i The rst two clauses check that we can, in fact, transition between R and R0 by exploiting. The third clause gives us the probability that the action we exploited has not changed since we last saw it, while the fourth clause gives us the probability that the action we exploited has changed. 116 B.3 Consistency between P and PRep Later in this section, we will show that using a repertoire-based algorithm to compute -best-response strategies returns the same results as using the history- based Algorithm 1. To do this, we will use the notion of consistency between the P and PRep equations. De nition. Let M be the set of actions known to an agent with history h . The P and PRep equations are consistent for h if, for all a 2 fInv;Obs;Xig and v 2 V : X m2M P (h (a; (m; v)) j h; a;S) = jRjX m=1 PRep(updaction(R; v;m)jR; r; a;S) (B.1) and X m2f1;:::;MgnM P (h (a; (m; v)) j h; a;S) = PRep(newaction(R; v)jR; r; a;S) (B.2) where R = CreateRepertoire(h) and r = jh j. 2 Lemma 6. The P and PRep equations are consistent for all h 2 H. Proof. We can prove this by using the de nition of P , found in Section 4.1, and the de nition of PRep found above. We will simply consider what happens for arbitrary h and v when performing innovation, observation, and exploitation actions. Recall that X(h) returns the number of exploit moves available to an agent with history h. For ease of exposition, we will assume without loss of generality that the rst action learned by h has label 1, the second has label 2, etc. Thus M = f1; : : : ; X(h)g, while f1; : : : ;Mg nM = fX(h) + 1; : : : ;Mg. 117 Innovation actions An innovation action always returns information on a new ac- tion, so both sides of Equation B.1 are clearly 0 in this case. If X(h) = jRj = M , both sides of Equation B.2 are also 0 since no new actions can be innovated. Thus, we will assume X(h) = jRj < M . We now have MX m=X(h)+1 P (h (Inv; (m; v))jh; Inv;S) = (M X(h)) (v) M X(h) = (v) and PRep(newaction(R; v)jR; r; Inv;S) = (v) which are clearly equivalent. Hence, P and PRep are consistent on h when a = Inv. Observation actions This section of the proof is mostly trivial, given the assump- tions we have made about Obs. The left side of Equation B.1 is X(h) m=1 P (h (Obs; (m; v))jh;Obs;S) = X(h) m=1 Obs(m; vjh;S) which tells us the probability of observing one of the actions already seen in our current history. The right side is jRjX m=1 PRep(updaction(R; v;m)jR; r;Obs;S) = jRjX m=1 Obs(m; vjR; r;S) Since we assume that Obs(m; vjR; r;S) tells us the probability of observing action m in the repertoire, PjRj m=1 Obs(m; vjR; r;S) also tells us the probability of observ- ing any of the actions we have already seen. Therefore, Equation B.1 holds for observation actions. For Equation B.2, the left side is MX m=X(h)+1 P (h (Obs; (m; v))jh;Obs;S) = MX m=X(h)+1 Obs(m; vjh;S) 118 which tells us the probability of observing an action we have not yet seen. The right side is PRep(newaction(R; v)jR; r;Obs;S) = Obs(vjR; r;S) Since we assume that Obs(vjR; r;S) gives us the probability of observing a new action, Equation B.2 also holds for observation actions. Hence, P and PRep are consistent on h when a = Obs. Exploitation actions Exploiting an action never gives us information about a new action, so both sides of Equation B.2 are 0 when we exploit. Thus, we need only consider Equation B.1. We will consider two cases. In the rst case, the action we choose to exploit has changed since we last saw it, so v is a new value. We then have X(h) m=1 P (h (Xm; (m; v))jh;Xm;S) = X(h) m=1 rX j=lastm c(j) (v; j) rY i=j 1 c(i) where lastm is the last round number on which we obtained any information about the m-th action. Similarly, since exploiting never increases the size of a repertoire, we have jRjX m=1 PRep(updaction(R; v;m)jR; r;Xm;S) = jRjX m=1 rX j=r m c(j) (v; j) rY i=j (1 c(i)) Since jRj = X(h) and lastm = r m by de nition, Equation B.1 holds when the action we exploit changes. Next, we consider the case where the action we choose to exploit has not changed. In this case, we have 119 X(h) m=1 P (h (Xm; (m; v))jh;Xm;S) = X(h) m=1 rY j=lastm (1 c(j)) + rX j=lastm c(j) (v; j) rY i=j 1 c(i) ! and jRjX m=1 PRep(updaction(R; v;m)jR; r;Xm;S) = jRjX m=1 rY j=r m (1 c(j)) + rX j=r m c(j) (v; j) rY i=j (1 c(i)) ! which are also equivalent. Therefore, Equation B.1 also holds when the action we exploit does not change. We have now shown that P and PRep are consistent on arbitrary h when a 2 fInv;Obs;Xig and for arbitrary v. Therefore, P and PRep are consistent for all h . 2 B.4 Repertoire-Based Strategies Repertoires can be used to more compactly de ne a strategy. We let a repertoire-based strategy s be a function from repertoires to actions. Such a strategy can be represented more compactly than the history-based strategies used earlier in this work, since there are fewer possible repertoires than there are possible histories. In any history h , a repertoire-based strategy s chooses the action associated with repertoire CreateRepertoire(h). We can use the PRep functions to de ne a formula that determines the EPRU for any repertoire-based strategy s. EPRUalt(s; R; r j G;S) is a recursive function 120 for calculating the expected per-round utility of s: EPRUalt(s; R; r j G;S) = X a2A X v2V [PRep(newaction(R; v)jR; r; a;S) (EVexp(r; U((a; ( ; v)))) + EPRUalt(s; newaction(R; v); r + 1 j G;S)) + jRjX m=1 PRep(updaction(R; v;m)jR; r; a;S) (EVexp(r; U((a; ( ; v)))) + EPRUalt(s; updaction(R; v;m); r + 1 j G;S))] where A is the set of possible actions and V is the set of possible action values. However, like EPRUalt(s; h j G;S), EPRUalt(s; R; r j G;S) contains in nite recursion and is therefore not computable. We will deal with this problem as we did in Section 4.2.1, by introducing a depth-limited version. For ease of exposition we will introduce two "helper" functions EPRUkaltnew(s; R; r; a; v j G;S) = PRep(newaction(R; v)jR; r; a;S) [EVexp(r; U((a; ( ; v)))) + EPRU k 1 alt (s; newaction(R; v); r + 1 j G;S)] and EPRUkaltupd(s; R; r; a; v j G;S) = jRjX m=1 PRep(updaction(R; v;m)jR; r; a;S) [EVexp(r; U((a; ( ; v)))) + EPRU k 1 alt (s; updaction(R; v;m); r + 1 j G;S)] Now we can de ne 121 EPRUkalt(s; R; r j G;S) = 8 >>>>>>>>>>< >>>>>>>>>>: 0; if k = 0; X a2A X v2V (EPRUkaltnew(s; R; r; a; v j G;S) + EPRUkaltupd(s; R; r; a; v j G;S)); otherwise. A proof that this formulation is equivalent to the version of EPRUkalt from Section 4.2.1 follows. Theorem 5. For all histories h , all repertoire-based strategies s, and all k 0, if s0 is a function from histories to actions where s0(h) = s(CreateRepertoire(h)) and r = jh j, then EPRU k alt(s; R; r j G;S) = EPRU k alt(s 0; h j G;S). Proof. We can prove this by using induction on k. For our base case, we will use k = 0, since EPRU0alt(s; R; r j G;S) = EPRU 0 alt(s 0; h j G;S) = 0 by de nition. For the inductive step, we will assume that Theorem 5 holds for some k 0, and show that it also holds for k + 1. Recall from Section 4.2.1 that, in this case EPRUk+1alt (s 0; h jG;S) = X t2T P (h tjh; s(h);S)(EVexp(r; U(t))+EPRU k alt(s 0; h t jG;S)) Since T is simply the set of all action-percept pairs, we can instead write this as EPRUk+1alt (s 0; h j G;S) = X a2A X v2V [ MX m=1 P (h (a; (m; v))jh; s(h);S) (EVexp(r; U((a; (m; v)))) + EPRU k alt(s 0; h (a; (m; v)) j G;S))] 122 where A is the set of possible actions and V the set of possible values. We also have1 EPRUk+1alt (s; R; r j G;S) = X a2A X v2V [PRep(newaction(R; v)jR; r; a;S) (EVexp(r; U((a; ( ; v)))) + EPRU k alt(s; newaction(R; v); r + 1 j G;S)) + jRjX m=1 PRep(updaction(R; v;m)jR; r; a;S) (EVexp(r; U((a; ( ; v)))) + EPRU k alt(s; updaction(R; v;m); r + 1 j G;S))] Recall from Lemma 6 that P and PRep are consistent on h . We can combine this with our inductive hypothesis to show that the bracketed portions of the two equations above are equal. Recall that when a repertoire encounters a new action, it does not store the action number m for that action. Thus, for any pair m and m0 that are not al- ready in h , we know that CreateRepertoire(h (a; (m; v))) = CreateRepertoire(h (a; (m0; v))) = newaction(R; v). Therefore, by our inductive hypothesis EPRUkalt(s 0; h (a; (m; v)) j G;S) = EPRUkalt(s; newaction(R; v); r j G;S) for all m not already in h . Notice that, by de nition, there are M X(h) values of m that are not already in h . If we assume without loss of generality that the rst action learned in h has label 1, the second has label 2, etc., then we can de ne new 1Recall that function U simply calculates the utility of performing the given action, and does not depend on the action number. Thus U(a;m; v) = U(a; ; v) for any legal m. 123 to be the quantity P and PRep are multiplied by when we learn something new: new = EVexp(r; U((a; ( ; v)))) + EPRU k alt(s 0; h ha;X(h) + 1; vi j G;S) = EVexp(r; U((a; ( ; v)))) + EPRU k alt(s 0; h ha;X(h) + 2; vi j G;S) : : : = EVexp(r; U((a; ( ; v)))) + EPRU k alt(s 0; h (a; (M; v)) j G;S) = EVexp(r; U((a; ( ; v)))) + EPRU k alt(s; newaction(R; v); r j G;S) Similarly, the inductive hypothesis also tells us that EPRUkalt(s 0; h (a; (m; v)) j G;S) = EPRUkalt(s; updaction(R; v;m); r j G;S) for all m that are already in h . Thus, we can also de ne m to be the quantity P and PRep are multiplied by when we update our information on action m: m = EVexp(r; U((a; ( ; v)))) + EPRU k alt(s 0; h (a; (m; v)) j G;S) = EVexp(r; U((a; ( ; v)))) + EPRU k alt(s; updaction(R; v;m); r j G;S) for m = 1; : : : ; X(h). We can now rewrite EPRUk+1alt (s 0; h j G;S) and EPRUk+1alt (s; R; r j G;S) as EPRUk+1alt (s 0; h j G;S) = X a2A X v2V [ MX m=X(h)+1 P (h (a; (m; v))jh; s(h);S) new + X(h) m=1 P (h (a; (m; v))jh; s(h);S) m] and EPRUk+1alt (s; R; r j G;S) = X a2A X v2V [PRep(newaction(R; v)jR; r; a;S) new + jRjX m=1 PRep(updaction(R; v;m)jR; r; a;S) m] 124 Equations B.1 and B.2 tell us that regardless of the values of a and v, MX m=X(h)+1 P (h (a; (m; v))jh; s(h);S) = PRep(newaction(R; v)jR; r; a;S) and X(h) m=1 P (h (a; (m; v))jh; s(h);S) = jRjX m=1 PRep(updaction(R; v;m)jR; r; a;S) Therefore, EPRUk+1alt (s 0; h j G;S) = EPRUk+1alt (s; R; r j G;S). This completes the induction. 2 B.5 Repertoire-Based Algorithm Now that we have a formula for computing the EPRU of a repertoire-based strategy, and we know that using repertoires rather than histories to calculate EPRU gives us the same results, we can update our algorithm to use repertoires. The new algorithm will be almost identical to Algorithm 1, except that using repertoires rather than histories will allow us to reduce our number of recursive calls by half. Let R0Obs be the set of repertoires for which P Rep(R0ObsjR; r;Obs) > 0, and de ne R0Inv and R 0 Xi similarly. Note that, if R contains m di erent actions, R0Obs R 0 Inv [ m[ i=1 R0Xi (B.3) In other words, since repertoires do not need to remember what actions our agent performed, choosing X1 produces the same repertoire as choosing Obs and observing action 1. Similarly, choosing Inv and Obs can also produce the same repertoires, if both actions happen to tell us about the same action. However, there 125 is no action we could encounter through observing that we could not encounter through either innovating or exploiting. Therefore, if we save the results of the recursive calls to calculate the utility of Inv and X1; : : : ;Xm, we can compute the utility of Obs without any additional recursion. This cuts the branching factor of our algorithm in half, from (2m + 2)v to (m + 1)v, which reduces the size of the search tree by a factor of 2k for search depth k, without any impact on accuracy. Algorithm 7 is the complete algorithm. 126 Algorithm 7 Produce strategy s that maximizes EPRUkalt(s; R j G;S), given initial repertoire R, and set of possible utility values V . Strat(R,r, k,V ,S) 1: if k = 0 then 2: return 0 3: end if 4: Let Umax = 0 5: Let smax = null 6: Let UObs = 0 7: for each action a 2 fX1; ;XM ; Invg do 8: Let Utemp = 0 9: Let stemp = null 10: for each value v 2 V do 11: Let t = hv; 1i 12: if 9i : (a = Xi) then Let R0 = age(R n fhvi; iig) [ t and p = PRep(R0jR; r;Xi;S) 13: else Let R0 = age(R) [ t and p = PRep(R0jR; r; Inv;S) 14: Let pObs = PRep(R0jR; r;Obs;S) 15: if p+ pObs > 0 then 16: Let fS 0; U 0g = Strat(R0; r + 1; k 1; V;S) 17: if pObs > 0 then UObs = UObs + pObs U 0 18: if p > 0 then 19: stemp = stemp [ S 0 20: if 9i : (a = Xi) then Utemp = Utemp + p (EVexp(r; v) + U 0) 21: else Utemp = Utemp + p U 0 22: end if 23: end if 24: end for 25: if Utemp > Umax then 26: Umax = Utemp 27: smax = stemp [ hR; ai 28: end if 29: end for 30: if UObs > Umax then 31: Umax = UObs 32: smax = stemp [ hR;Obsi 33: end if 34: return fsmax; Umaxg 127 Appendix C The Number of Pure Strategies for Cultaptation In Cultaptation as de ned on the tournament web site, each game includes 10,000 rounds, 100 agents, 100 exploitation actions, and the actions Inv and Obs. Let S be the set of all pure Cultaptation strategies, and S 0 be the set of all strategies such that the rst 100 moves are Inv, and all subsequent moves are exploitation actions. Then any lower bound on S 0 is a loose lower bound on S. Suppose an agent uses a strategy in S 0. If it survives for the rst 100 rounds of the game, it will learn values for all 100 of the exploitation actions. There are 100! di erent orders in which these actions may be learned, and for each action there are 100 possible values; hence there are 100100 possible combinations of values. Thus after 100 Inv moves, the number of possible histories is 100! 100100. All subsequent moves by the agent will be exploitations; and it is possible (though quite unlikely!) that the agent may live for the remaining 9; 900 rounds of the game. Thus each of the above histories is the root of a game tree of height 2 9; 900. In this game tree, each node of even depth is a choice node (each branch emanating from the node corresponds to one of the 100 possible exploitation actions), and each node of odd depth is a value node (each branch emanating from the node corresponds to one of the 100 di erent values that the chosen action may return). Since there are 128 100! 100100 of these game trees, the total number of choice nodes is 100! 100100 9899X d=0 (1002)d > 9:3 1039953: If we use the conventional game-theoretic de nition that a pure strategy s must include a choice of action at each choice node, regardless of whether the choice node is reachable given s, then it follows that jS 0j > 1009:3 10 39953 : If we use the de nition used by game-tree-search researchers, in which a pure strat- egy only includes a choice of action at each choice node that is reachable given s, then the number of reachable choice nodes given s is 100! 100100 9899X d=0 100d > 9:4 1020155; so jS 0j > 1009:4 10 20155 : 129 Bibliography [1] A. Bandura. Social Learning Theory. General Learning Press, New York, 1977. [2] B.G. Galef and K.N. Laland. Social learning in animals: Empirical studies and theoretical models. Bioscience, 55:489{499, 2005. [3] T.R. Zentall. Imitation: De nitions, evidence, and mechanisms. Animal Cog- nition, 9:335{353, 2006. [4] L. G. Rapaport and G. R. Brown. Social in uences on foraging behavior in young nonhuman primates: Learning what, where, and how to eat. Evolution- ary Anthropology, 17(4):189{201, 2008. [5] G. Du y, T. Pike, and K.. Laland. Size-dependent directed social learning in nine-spined sticklebacks. Animal Behaviour, 78(2):371{375, August 2009. [6] J. Kendal, L. Rendell, T. Pike, and K. Laland. Nine-spined sticklebacks deploy a hill-climbing social learning strategy. Behavioral Ecology, 20(2):238{244, 2009. [7] R. Boyd and P.J. Richerson. Why does culture increase human adaptability? Ethology and Sociobiology, 16(2):125{143, 1995. [8] K.N. Laland. Social learning strategies. Learning and Behavior, 32:4{14, 2004. [9] C.J. Barnard and R. M. Sibly. Producers and scroungers: A general model and its application to captive ocks of house sparrows. Animal Behavior, 29:543{ 550, 1981. [10] D. Nettle. Language: Costs and bene ts of a specialised system for social information transmission. In J. Wells and et al., editors, Social Information Transmission and Human Biology, pages 137{152. Taylor and Francis, London, 2006. [11] L. A. Giraldeau, T. J. Valone, and J. J. Templeton. Potential disadvantages of using socially acquired information. Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 357(1427):1559{1566, 2002. [12] J Noble, P M Todd, and E Tuci. Explaining social learning of food preferences without aversions: an evolutionary simulation model of norway rats. Proc. Biol Sci., 268(1463):141{149, January 2001. [13] T. Kameda and D. Nakanishi. Cost{bene t analysis of social/cultural learning in a nonstationary uncertain environment: An evolutionary simulation and an experiment with human subjects. Evolution and Human Behavior, 23:373{393, September 2002. 130 [14] R. McElreath, M. Lubell, P. Richerson, T. Waring, W. Baum, E. Edsten, C. Ef- ferson, and B. Paciotti. Applying evolutionary models to the laboratory study of social learning. Evolution and Human Behavior, 26(6):483{508, November 2005. [15] R. Boyd, M. Enquist, K. Eriksson, M. Feldman, and K. La- land. Cultaptation: Social learning tournament, 2008. http://www.intercult.su.se/cultaptation. [16] A. R. Rogers. Does biology constrain culture? American Anthropologist, 90(4):819{831, 1988. [17] L. Rendell, R. Boyd, D. Cownden, M. Enquist, K. Eriksson, M. W. Feldman, L. Fogarty, S. Ghirlanda, T. Lillicrap, and K. N. Laland. Why copy others? insights from the social learning strategies tournament. Science, 328(5975):208{ 213, April 2010. [18] T. Wisdom and R. Goldstone. Social learning and cumulative innovations in a networked group. In Advances in Social Computing: Third International Conference on Social Computing, Behavioral Modeling, and Prediction, SBP 2010, pages 32{41, 2010. [19] R. Carr, E. Raboin, A. Parker, and D. Nau. Theoretical and experimental anal- ysis of an evolutionary social-learning game. Technical Reports from UMIACS, UMIACS-TR-2012-05, 2012. [20] J. C. Villaneuva. Atoms in the universe. Universe Today, July 2009. http: //www.universetoday.com/36302/atoms-in-the-universe/. [21] D. Friedman. On economic applications of evolutionary game theory. Journal of Evolutionary Economics, 8(1):15{43, 1998. [22] E. Guse. Expectational business cycles. Money Macro and Finance (MMF) Re- search Group Conference 2004 97, Money Macro and Finance Research Group, September 2004. [23] E. Raboin, R. Carr, A. Parker, and D. Nau. Balancing innovation and exploita- tion in a social learning game. In AAAI Fall Symposium on Adaptive Agents in Cultural Contexts, November 2008. [24] K. Leyton-Brown and Y. Shoham. Essentials of Game Theory: A Concise Multidisciplinary Introduction. Morgan & Claypool, 2008. [25] R. Carr, E. Raboin, A. Parker, and D. Nau. When innovation matters: An anal- ysis of innovation in a social learning game. In Second International Conference on Computational Cultural Dynamics (ICCCD), September 2008. [26] J. Henrich and R. McElreath. The evolution of cultural evolution. Evolutionary Anthropology, 12:123{135, 2003. 131 [27] M. Enquist, S. Ghirlanda, and K. Eriksson. Critical social learning: A so- lution to rogers?s paradox of nonadaptive culture. American Anthropologist, 109(4):727{734, 2007. [28] C. Papadimitriou and J. Tsitsiklis. The complexity of optimal queuing network control. Mathematics of Operations Research, 24(2):293{305, May 1999. [29] S. Guha, K. Munagala, and P. Shi. Approximation algorithms for restless bandit problems. In SODA ?09: Proceedings of the twentieth Annual ACM- SIAM Symposium on Discrete Algorithms, pages 28{37, Philadelphia, PA, USA, 2009. Society for Industrial and Applied Mathematics. [30] K.H. Schlag. Why imitate, and if so, how?, : A boundedly rational approach to multi-armed bandits. Journal of Economic Theory, 78:130{156, 1998. [31] D.E. Koulouriotis and A. Xanthopoulos. Reinforcement learning and evolu- tionary algorithms for non-stationary multi-armed bandit problems. Applied Mathematics and Computation, 196(2):913 { 922, 2008. [32] M. Kearns, Y. Mansour, and A. Y. Ng. A Sparse Sampling Algorithm for Near- Optimal Planning in Large Markov Decision Processes. MACHINE LEARN- ING, 49:193{208, 2002. [33] D. Koller and R. Parr. Computing factored value functions for policies in structured MDPs. In International Joint Conference on Arti cial Intelligence, volume 16, pages 1332{1339. Citeseer, 1999. [34] C. Boutilier, R. Dearden, and M. Goldszmidt. Exploiting structure in pol- icy construction. In International Joint Conference on Arti cial Intelligence, volume 14, pages 1104{1113. LAWRENCE ERLBAUM ASSOCIATES LTD, 1995. [35] R.A. Howard. Dynamic programming and Markov process. MIT Press, 1960. [36] J. Hofbauer and K. Sigmund. Evolutionary game dynamics. Bulletin of the American Mathematical Society, 40(4):479, 2003. [37] A. Plaat, J. Schae er, W. Pijls, and A. de Bruin. An algorithm faster than ne- gascout and sss* in practice. In Computer Strategy Game Programming Work- shop. 1995. [38] T. Ibaraki. Theoretical comparision of search strategies in branch and bound. International Journal of Computer and Information Sciences, 5:315 344, 1976. [39] E. Horowitz and S. Sahni. Fundamentals of Computer Algorithms. Computer Science Press, Potomac, MD, 1978. [40] A. Blum, M.T. Hajiaghayi, K. Ligett, and A. Roth. Regret minimization and the price of total anarchy. In Proceedings of the 40th annual ACM symposium on Theory of computing, pages 373{382. ACM, 2008. 132