ABSTRACT
 Title of dissertation: GENERATION AND ANALYSIS OF
 STRATEGIES IN AN EVOLUTIONARY
 SOCIAL LEARNING GAME
 James Ryan Carr, Doctor of Philosophy, 2013
 Dissertation directed by: Professor Dana Nau
 Department of Computer Science
 An important way to learn new actions and behaviors is by observing others,
 and several evolutionary games have been developed to investigate what learning
 strategies work best and how they might have evolved. In this dissertation I present
 an extensive set of mathematical and simulation results for Cultaptation, which is
 one of the best-known such games.
 I derive a formula for measuring a strategy?s expected reproductive success,
 and provide algorithms to compute near-best-response strategies and near-Nash
 equilibria. Some of these algorithms are too complex to run quickly on larger versions
 of Cultaptation, so I also show how they can be approximated to be able to handle
 larger games, while still exhibiting better performance than the current best-known
 Cultaptation strategy for such games. Experimental studies provide strong evidence
 for the following hypotheses:
 1. The best strategies for Cultaptation and similar games are likely to be con-
 ditional ones in which the choice of action at each round is conditioned on
the agent?s accumulated experience. Such strategies (or close approximations
 of them) can be computed by doing a lookahead search that predicts how
 each possible choice of action at the current round is likely to a ect future
 performance.
 2. Such strategies are likely to prefer social learning most of the time, but will
 have ways of quickly detecting structural shocks, so that they can switch
 quickly to individual learning in order to learn how to respond to such shocks.
 This con icts with the conventional wisdom that successful social-learning
 strategies are characterized by a high frequency of individual learning; and
 agrees with recent experiments by others on human subjects that also challenge
 the conventional wisdom.
GENERATION AND ANALYSIS OF STRATEGIES IN AN
 EVOLUTIONARY SOCIAL LEARNING GAME
 by
 James Ryan Carr
 Dissertation submitted to the Faculty of the Graduate School of the
 University of Maryland, College Park in partial ful llment
 of the requirements for the degree of
 Doctor of Philosophy
 2012
 Advisory Committee:
 Professor Dana Nau, Chair/Advisor
 Professor Michele Gelfand
 Professor Mohammad Hajiaghayi
 Professor Donald Perlis
 Professor James Reggia
c Copyright by
 James Ryan Carr
 2012
Acknowledgments
 I would like to acknowledge the support and encouragement of many extraor-
 dinary people, without whom this work would have been impossible.
 I would like to thank Dr. Dana Nau for being a wonderful advisor, mentor,
 and friend, and especially for teaching me that half of research is arguing, and how
 to argue well.
 I would like to thank my other committee members, Drs. Michele Gelfand,
 James Reggia, Mohammad Hajiaghayi, Donald Perlis, and V.S. Subrahmanian, for
 their support, guidance, and patience.
 I would like to thank Dr. Marie desJardins for getting me interested in research
 and encouraging me to pursue a Ph.D.
 I would like to thank Eric Raboin and Austin Parker for their encouragement
 and contributions to our earlier studies of Cultaptation. I would like to thank my
 fellow graduate students in LCCD, Eric, Austin, Patrick Roos, Kan-Leung Cheng,
 Ron Alford, Vikas Shivashankar, and Brandon Wilson, for six years of friendship
 and advice.
 This work was supported in part by AFOSR Grant FA95501210021, ARO
 Grant W911NF1110344, and ONR Grant W911NF0810144. The views expressed
 are those of the author and do not necessarily re ect the o cial policy or position
 of the funders.
 ii
Table of Contents
 List of Figures v
 1 Introduction 1
 2 Background 6
 2.1 Cultaptation Social-Learning Game . . . . . . . . . . . . . . . . . . . 6
 2.2 Motivating Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 12
 2.2.1 Innovation, Observation, and Structural Shocks . . . . . . . . 13
 2.2.2 Innovation and Observation Versus Exploitation . . . . . . . . 14
 2.2.3 Best-Response Strategies in Cultaptation . . . . . . . . . . . . 15
 3 Related Work 17
 3.1 Social Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
 3.2 Related Computational Techniques . . . . . . . . . . . . . . . . . . . 20
 4 Game Analysis 22
 4.1 Formal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
 4.1.1 Transition Probabilities . . . . . . . . . . . . . . . . . . . . . . 26
 4.1.2 Accounting for Probability of Death . . . . . . . . . . . . . . . 28
 4.1.3 Utility Functions . . . . . . . . . . . . . . . . . . . . . . . . . 29
 4.1.4 Strategy Representation . . . . . . . . . . . . . . . . . . . . . 29
 4.1.5 Evaluating Strategies . . . . . . . . . . . . . . . . . . . . . . . 30
 4.2 Analysis of EPRU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
 4.2.1 Computation of EPRU . . . . . . . . . . . . . . . . . . . . . . 33
 4.2.2 EPRU Corresponds to Reproductive Success . . . . . . . . . . 35
 5 Strategy Generation Algorithms 43
 5.1 Finding an  -Best Response Strategy . . . . . . . . . . . . . . . . . . 43
 5.1.1 Problem Speci cation . . . . . . . . . . . . . . . . . . . . . . . 43
 5.1.2 Bounding EPRU . . . . . . . . . . . . . . . . . . . . . . . . . 44
 5.1.3 Determining How Far to Search . . . . . . . . . . . . . . . . . 46
 5.1.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
 5.1.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 50
 5.2 Cultaptation Strategy Learning Algorithm . . . . . . . . . . . . . . . 54
 5.2.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 57
 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
 5.3.1 Experiments with  -Best Response Algorithm . . . . . . . . . 59
 5.3.2 Experiments with the Cultaptation Strategy Learning Algo-
 rithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
 iii
6 Improving Full-scale Cultaptation Strategies 78
 6.1 The Tournament Winner: discountmachine . . . . . . . . . . . . . . . 78
 6.1.1 Potential Problems with discountmachine . . . . . . . . . . . 80
 6.2 A Full-scale Cultaptation Player: relaxedlookahead . . . . . . . . . . . 82
 6.3 Experiments: RLA vs. discountmachine . . . . . . . . . . . . . . . . 85
 6.3.1 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . 85
 6.3.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . 89
 7 Conclusion 102
 A Proofs 108
 B Converting from Histories to Repertoires 112
 B.1 Repertoire De nition . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
 B.2 Transition Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 114
 B.3 Consistency between P and PRep . . . . . . . . . . . . . . . . . . . . 117
 B.4 Repertoire-Based Strategies . . . . . . . . . . . . . . . . . . . . . . . 120
 B.5 Repertoire-Based Algorithm . . . . . . . . . . . . . . . . . . . . . . . 125
 C The Number of Pure Strategies for Cultaptation 128
 Bibliography 130
 iv
List of Figures
 2.1 Example game with a structural shock . . . . . . . . . . . . . . . . . 14
 5.1 An illustration of a portion of the search performed by Algorithm 1
 on a very small version of Cultaptation, with four actions and a set of
 possible action values equal to f20; 40; 80; 160g. Circle nodes indicate
 possible observations made by the agent, and triangle nodes indicate
 possible choices made by the agent (which have a random outcome,
 indicated by the line through the branches leading to its children).
 Ellipses indicate branches or subtrees omitted from the  gure. The
 left side of the illustration shows how more actions can become avail-
 able to the agent as the game progresses, while the right side shows
 how potentially-changing values of actions must be modeled in the
 tree. Several techniques outlined in Section 5.1.5 can reduce the size
 of the tree that needs to be searched in the implementation of this
 algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
 5.2 Number of nodes searched by  -best response strategy algorithm. . . . 61
 5.3 Average populations when sself and sEVC play against EVChooser . . 69
 5.4 Probabilities of choosing each action for sself , sEVC, and EVChooser. . 72
 5.5 Performance comparison between normal and \shock" environments
 for sself , sEVC, and EVChooser . . . . . . . . . . . . . . . . . . . . . . 75
 6.1 Probabilities of choosing each type of action for relaxedlookahead and
 discountmachine, using  Geo and c = 0:001. . . . . . . . . . . . . . . . 92
 6.2 Probabilities of choosing each type of action for relaxedlookahead and
 discountmachine, using  Geo and c = 0:05. . . . . . . . . . . . . . . . 93
 6.3 Probabilities of choosing each type of action for relaxedlookahead and
 discountmachine, using  Gam and c = 0:001. . . . . . . . . . . . . . . 94
 6.4 Probabilities of choosing each type of action for relaxedlookahead and
 discountmachine, using  Gam and c = 0:05. . . . . . . . . . . . . . . . 95
 6.5 Performance comparison between relaxedlookahead and discountma-
 chine, using  Geo and c = 0:001. . . . . . . . . . . . . . . . . . . . . . 98
 6.6 Performance comparison between relaxedlookahead and discountma-
 chine, using  Geo and c = 0:05. . . . . . . . . . . . . . . . . . . . . . . 99
 6.7 Performance comparison between relaxedlookahead and discountma-
 chine, using  Gam and c = 0:001. . . . . . . . . . . . . . . . . . . . . . 100
 6.8 Performance comparison between relaxedlookahead and discountma-
 chine, using  Gam and c = 0:05. . . . . . . . . . . . . . . . . . . . . . 101
 v
Chapter 1
 Introduction
 An important way to learn new actions and behaviors is social learning, i.e.,
 learning by observing others. Some social-learning theorists believe this is how most
 human behavior is learned [1], and it also is important for many other animal species
 [2, 3, 4]. Such learning usually involves evaluating the outcomes of others? actions,
 rather than indiscriminate copying of others? behavior [5, 6], but much is unknown
 about what learning strategies work best and how they might have evolved.
 For example, it seems natural to assume that communication has evolved due
 to the inherent superiority of copying others? success rather than learning on one?s
 own via trial-and-error innovation. However, there has also been substantial work
 questioning this intuition [7, 8, 9, 10, 11].
 Several evolutionary games have been developed to investigate social learning
 [12, 13, 14, 15]. One of the best-known is Cultaptation, a multi-agent social-learning
 game developed by a consortium of European scientists [15] who sponsored an in-
 ternational tournament with a e10,000 prize.1 The rules of Cultaptation are rather
 complicated (see Section 2.1), but can be summarized as follows:
  Each agent has three kinds of possible actions: innovation, observation, and
 exploitation. These are highly simpli ed analogs of the following real-world
 1NOTE: I am not a liated with the tournament or with the Cultaptation project.
 1
activities, respectively: spending time and resources to learn something new,
 learning something by communicating with another agent, and exploiting the
 learned knowledge.
  At each step of the game, each agent must choose one of the available actions.
 How an agent does this constitutes the agent?s \social learning strategy."
  Each action provides an immediate numeric payo and/or information about
 the payo s of other actions at the current round of the game. This information
 is not necessarily correct in subsequent rounds because the actions? payo s
 may vary from one round to the next, and the way in which they may vary is
 unknown to the agents in the game.2
  Each agent has a  xed probability of dying at each round. At each round,
 each agent may also produce o spring, with a probability that depends on
 how this agent?s average per-round payo compares to the average per-round
 payo s of the other agents in the game.
 A second Cultaptation tournament is scheduled to begin in February 2012.
 This tournament carries a e25,000 prize and introduces a few new concepts into
 the game, such as the ability for agents to improve actions they already know, and
 proximity-based observation. This work does not deal with these additions, although
 2 For my analyses, I assume the payo s at each round are determined by an
 arbitrary function (which may be either deterministic or probabilistic), and I ana-
 lyze how strategies perform given various possible characteristics of that function.
 In general, such characteristics would not be known to any Cultaptation agent|
 but my objective is to examine the properties of strategies in various versions of
 Cultaptation, not to develop a Cultaptation agent per se.
 2
I may address them in future work.
 My work has had two main objectives: (1) to study the nature of Cultaptation
 to see what types of strategies are e ective; and (2) more generally, to develop ways
 of analyzing evolutionary environments with social learning.
 This work includes the following results. First, it provides a formula for ap-
 proximating (to within any given error bound  > 0) the expected per-round utility
 (EPRU) of a given strategy (Section 4.2), and a proof that a strategy with maximal
 EPRU will have the highest frequency in the limit (Section 4.2.2). This provides a
 basis for evaluating Cultaptation strategies analytically, rather than through sim-
 ulated games. Next, the work provides a strategy-generation algorithm that can
 construct a strategy that is within any given error bound  > 0 of a best response to
 a given set of competing strategies (Section 5.1). It then presents the Cultaptation
 Strategy Learning Algorithm (CSLA), which uses the strategy-generation algorithm
 in an iterative self-improvement loop to attempt to  nd a strategy that is a near-best
 response to itself, and is therefore a symmetric near-Nash equilibrium (Section 5.2).
 Finding such a strategy is desirable because it should be able to perform well against
 any set of competing strategies (i.e., any strategies it plays against in a tournament
 setting).
 The work then provides experimental results outlining the generation of an ap-
 proximate Nash equilibrium strategy, sself , in a small version of Cultaptation, and a
 performance comparison (in the smaller environment) between sself and EVChooser,
 a known good strategy from the Cultaptation tournament (Section 5.3). These ex-
 periments show that sself is able to outperform EVChooser, and provide several
 3
insights into the characteristics of good Cultaptation strategies. For example, the
 experiments show that sself observes and exploits most of the time, but switches
 quickly to innovation when a structural shock occurs, switching back to observation
 and exploitation once it has learned how to respond to the shock. This con icts
 with the conventional wisdom [16, 7] that successful social-learning strategies are
 characterized by a high frequency of innovation, but it helps to explain both the
 results of the Cultaptation tournament [17] and some recent experimental results
 on human subjects [18].
 Finally, the work shows how the previous results (which were run on smaller
 versions of Cultaptation) can be extended to the larger version used in the tourna-
 ment (Chapter 6). It  rst shows how the analysis and experimental results outlined
 above can be used to identify potential problems in the best strategy from the Cul-
 taptation tournament (i.e., discountmachine). Next, it uses the formulae from the
 analysis to de ne a new strategy, relaxedlookahead, that avoids such weaknesses.
 Experimental results verify that relaxedlookahead is capable of outperforming dis-
 countmachine in a variety of environments similar to those used in the Cultaptation
 tournament, and provide an in-depth analysis of the factors that allowed relaxed-
 lookahead to perform better.
 Taken together, these results provide strong support for the following hypothe-
 ses about the best strategies for Cultaptation and similar games: First, the best
 strategies are likely to be conditional ones in which the choice of action at each
 round is conditioned on the agent?s accumulated experience. Such strategies (or
 close approximations of them) can be computed by doing a lookahead search that
 4
predicts how each possible choice of action at the current round is likely to a ect
 future performance. Second, it is likely that the best strategies will observe and
 exploit most of the time, but will have ways of quickly detecting structural shocks,
 so that they can switch quickly to innovation in order to learn how to respond to
 such shocks.
 In addition to these insights about Cultaptation and social learning in gen-
 eral, the work also explains how the analysis and algorithms outlined above could
 be adapted to apply to future evolutionary games that, like Cultaptation, are signif-
 icantly more complex than classical evolutionary games. As researchers in the  eld
 of evolutionary game theory continue to study more complex phenomena like social
 learning, it is likely that these weaker assumptions will be needed more frequently,
 and so techniques like the ones presented here will be necessary more often.
 5
Chapter 2
 Background
 2.1 Cultaptation Social-Learning Game
 This section gives a more detailed description of the Cultaptation social learn-
 ing game, adapted from the o cial description [15]. The game is a multi-agent
 round-based game, where one action is chosen by each agent each round. There
 are N agents playing the game, where N is a parameter to the game. No agent
 knows of any other agent?s actions at any point in the game except through the
 Obs action speci ed below. The actions available to each agent are innovation (Inv),
 observation (Obs), and exploitation (X1; : : : ;XM , where M is a parameter to the
 game). Each Inv and Obs action informs the agent what the utility would be for
 one of the exploitation actions, and an agent may not use an exploitation action Xi
 unless the agent has previously learned of it through an innovation or observation
 action. Here are some details:
 Exploitation. Each exploitation action Xi provides utility speci c to that
 action (e.g. X1 may provide utility 10 and X2 may provide utility 50). The utility
 assigned to each action at the beginning of the game is drawn from a probability
 distribution  , where  is a parameter to the game.
 The utility provided by each exploitation action Xi may change on round r,
 according to a probability cr. The function c is a parameter to the game, and
 6
speci es the probability of change for every round of the game. When the changes
 occur, they are invisible to the agents playing the game until the agent interacts
 with the changed action. For instance: if an action?s utility happens to change on
 the same round it is exploited, the agent receives the new utility, and discovers the
 change when the new utility is received. The new utility for a changed action is
 determined via the distribution  .
 Innovation. When an agent uses the Inv action, it provides no utility, but it
 tells the agent the name and utility of some exploitation action Xi that is chosen
 uniformly at random from the set of all exploitation actions about which the agent
 has no information. If an agent already knows all of the exploitation actions, then
 Inv is illegal, and indeed undesirable (when there is nothing left to innovate, why
 innovate?). The agent receives no utility on any round where she chooses an Inv
 action.
 Observation. By performing an Obs action, an agent gets to observe the
 action performed and utility received by some other agent who performed an ex-
 ploitation action on the previous round. Agents receive no utility for Obs actions,
 nor any information other than the action observed and its value: the agent being
 observed, for instance, is unknown. If none of the other agents performed an ex-
 ploitation action on the previous round, then there were no Xi actions to observe so
 the observing agent receives no information. In some variants of the social learning
 game, agents receive information about more than one action when observing. I do
 not treat such variants directly in this proposal, but it is straightforward to extend
 my algorithms to take this di erence into account.
 7
Round # 1 2 3 4 5 . . . k
 I1?s action Inv X1 X1 X1 X1 . . . X1
 I1?s utility 0 3 6 9 12 . . . 3(k  1)
 Per round 0 1:5 2 2:25 2:4 . . . 3k 1k
 I2O?s action Inv Inv Obs X3 X3 . . . X3
 I2O?s utility 0 0 0 8 16 . . . 8(k  3)
 Per round 0 0 0 2 3:2 . . . 8k 3k
 Table 2.1: Action sequences from Example 1, and their utilities.
 Example 1. Consider two strategies: the innovate-once strategy (here-
 after I1), which innovates exactly once and exploits that innovated action
 (whatever it is) for the rest of the game, and the innovate-twice-observe-
 once strategy (hereafter I2O), which innovates twice, observes once, and
 exploits the highest valued action of the actions discovered for the rest
 of the game. For simplicity of exposition, suppose there are only four
 exploitation actions: X1, X2, X3, and X4. The values for each of these
 actions are drawn from a distribution; in this example we will assume
 that they are chosen to be 3, 5, 8, and 5, respectively. For simplicity,
 we will assume the probability of change is 0. Suppose there are two
 agents: one I1 and one I2O. For the  rst action, I1 will innovate, which
 we suppose gives I1 the value of action X1. On every sequential action,
 I1 will choose action X1, exploiting the initial investment. If the agent
 dies k rounds later, then the history of actions and utilities will be that
 given in Table 2.1; giving a utility of 3(k 1) and a per-round utility of
 3k 1k .
 In contrast, I2O will innovate, informing it of the utility of X3: 8, then
 it will innovate again, informing it of the utility of X4: 5, and  nally it
 will observe. On the second round, I1 performed X1, and since these are
 the only two agents, this was the only exploitation action performed.
 Therefore, I2O?s observation action on the next round must report that
 another agent got a utility of 3 from action X1 last round (if there were
 multiple possibilities, one would be chosen uniformly at random). On
 round 4, I2O then knows that actions X1, X3, and X4 have utilities of 3,
 8, and 5, respectively. Since the probability of change is 0, the obvious
 best action is X3, which I2O performs for the rest of her life. The utility
 of I2O on round k is 8(k  3), making the per-round utility 8k 3k . Note
 that on rounds 2 to 4, I2O will have a worse per-round utility than I1,
 while after round 4, the utility of I2O will be higher (this is important
 because reproduction is tied to per-round utility, as I will show shortly).
 Formally, everything that an agent  knows about each round can be described
 8
by an action-percept pair, (a; (m; v)), where a 2 fInv;Obs;X1; : : : ;XMg is the action
 that  chose to perform, and (m; v) is the percept returned by the action. More
 speci cally, m 2 fX1; : : : ;XM ; ;g is either an exploitation action or a null value, and
 v is the utility observed or received. While a is chosen by the agent, m and v are
 percepts the agent receives in response to that choice. If a is Inv or Obs, then v is the
 utility of exploitation action m. If a is Obs and no agent performed an exploitation
 action last round, then there is no exploitation action to be observed, hence m = ;
 and v = 0. If a is some Xi, then m will be the same Xi and v will be the utility the
 agent receives for that action. The agent history for agent  is a sequence of such
 action-percept pairs, h = h(a1; (m1; v1)); : : : ; (ak; (mk; vk))i. As a special case, the
 empty (initial) history is hi.
 Example 2. The history for I2O in Example 1 is:
 hI2O = h(Inv; (X3; 8)); (Inv; (X4; 5)); (Obs; (X1; 3)); (X3; (X3; 8)); : : : i
 To concatenate a new action-percept pair onto the end of a history, I use
 the  symbol. For example, h  (a; (m; v)) is the history h concatenated with
 the action-percept pair (a; (m; v)). Further, for h = hp1; p2; : : : ; pki, where each
 pi is some action-percept pair, I let h [i] = pi, and h [i; : : : ; j] be the subhistory
 hpi; : : : ; pji.
 Strategies. The Cultaptation game is ultimately a competition among strate-
 gies. Here, a strategy is a function from histories to the set of possible actions:
 s : h 7! m, where h is a history of an agent using s and m is Inv, Obs or Xi for
 some i. Since each strategy may depend on the entire history, the set of possible
 9
strategies is huge;1 but any particular Cultaptation game is a competition among a
 much smaller set of strategies S, which I will call the set of available strategies. For
 example, if there are n contestants, each of whom chooses a strategy to enter into
 the game, then in this case,
 S = fthe strategies chosen by the contestantsg: (2.1)
 Each strategy in S may be used by many di erent agents, and the strategy pro le
 at each round of the game may change many times as the game progresses. When
 an agent reproduces, it passes its strategy on to a newly created agent, with the per-
 round utility of each agent determining its likelihood of reproduction. A strategy?s
 success is measured by its average prevalence over the last quarter of the game [15].
 The replication dynamics work as follows. On each round, each agent has
 a 2% chance of dying. As such, I also include a parameter d in my formulation
 representing the probability of death (d defaults to 0:02). Upon death, an agent is
 removed from the game and replaced by a new agent, whose strategy is chosen using
 the reproduction and mutation mechanisms described below. Mutation happens 2%
 of the time, and reproduction happens 98% of the time.
 Reproduction. When reproduction occurs, the social learning strategy used
 by the newborn agent is chosen from the strategies of agents currently alive with
 a probability proportional to their per-round utility (the utility gained by an agent
 1 The number of possible mixed strategies is, of course, in nite. But even if
 we consider only pure strategies, the number is quite huge. For a 10,000-round
 Cultaptation game of the type used in the Cultaptation tournament, a loose lower
 bound on the number of pure strategies is 1009:4 10
 20155
 [19]. In contrast, it has been
 estimated [20] that the total number of atoms in the observable universe is only
 about 1078 to 1082.
 10
divided by the number of rounds the agent has lived). The agent with the highest
 per-round utility is thus the most likely to propagate its strategy on reproduction.
 Example 3. Again looking at the sequences of actions in Table 2.1, we
 see that both agents would have equal chance of reproducing on round
 1. However, on round 2 I1 has a per-round utility of 1:5, while I2O
 has a per-round utility of 0, meaning I1 gets 100% of the reproductions
 occurring on round 2. Round three is the same, but on round 4 I1 has
 a per round utility of 2:25 and I2O has a per-round utility of 2. This
 means that I1 gets 100  2:25=4:25 = 53% of the reproductions and I2O
 gets 100  2=4:25 = 47% of the reproductions on round 4.
 Mutation. In Cultaptation, mutation does not refer to changes in an agent?s
 codebase (as in genetic programming). Instead, it means that the new agent?s strat-
 egy s is chosen uniformly at random from the set of available strategies, regardless
 of whether any agents used s on the previous round. For instance, if there were
 a Cultaptation game pitting strategies I1 and I2O against one another, then a new
 mutated agent would be equally likely to have either strategy I1 or I2O, even if there
 were no living agents with strategy I1.
 Game Types. In the Cultaptation tournament [17], two types of games were
 played: pairwise games and melee games. A pairwise game was played with an
 invading strategy and a defending strategy. The defending strategy began play with
 a population of 100 agents, while the invading strategy began with none. Mutation
 was also disabled for the  rst 100 rounds, to allow the defending strategy time to
 begin earning utility. After 100 rounds, mutation was enabled and the invader had
 the challenging task of establishing a foothold in a population consisting entirely
 of agents using the defending strategy (most of whom would have had time to
  nd several high-payo actions). Since the pairwise games provide a clear early-
 11
game advantage to the defender, they were typically played twice with the invader
 and defender swapping roles on the second game. A melee game was played with n
 strategies, for some n > 2. Initially, the population of 100 agents was evenly divided
 between each strategy in the game. Mutation was disabled for the last quarter of
 the game, so that it would not in uence results when strategies had similar  tness.
 Scoring. If we have k social learning strategies s1; : : : ; sk playing Cultapta-
 tion, then on any given round there will be some number nj of agents using strategy
 sj, for 1  j  k. Strategy sj?s score for the game is the average value of nj over
 the  nal 2,500 rounds of the game. The strategy with the highest score is declared
 the winner.
 The only way an agent may a ect nj is through reproduction. I will show in
 Section 4.2.2 that any strategy maximizing an agent?s expected per-round utility
 (de ned in Section 4.1.5) will also maximize its reproduction. I will therefore focus
 on computing the expected per-round utility.
 2.2 Motivating Discussion
 The purpose of this section is to explain the motivations for several aspects of
 my work:
  Sections 2.2.1 and 2.2.2 give examples of types of strategies that seem like they
 should work well at  rst glance, but can have unexpectedly bad consequences.
 The existence of such situations motivate the algorithms described later in this
 proposal, which perform a game tree search in order to consider strategies?
 12
long-term consequences.
  An important way of getting insight into a game is to examine its best-response
 strategies; and this approach is at the heart of my formal analysis and my
 game-tree search algorithms. Section 2.2.3 explains some issues that are im-
 portant for  nding best-response strategies in Cultaptation.
 2.2.1 Innovation, Observation, and Structural Shocks
 If we want to acquire a new action to exploit, then what is the best way of
 doing it: to observe, or to innovate? At  rst glance, observing might seem to be
 the best approach. If the other agents in the environment are competent, then it
 is likely that they are exploiting actions that have high payo s, hence we should
 be able to acquire a better action by observing them than by innovating. This
 suggests that an optimal agent will rely heavily on observation actions. However,
 the following example shows that relying only on observation actions can lead to
 disastrous consequences if there is a structural shock, i.e., a large change in the value
 of an exploitation action.2
 Example 4. Structural shocks: Figure 2.1 shows a Cultaptation
 game in which all agents use the following strategy: each agent begins
 with a single Obs action, followed by a single Inv action if the Obs action
 returns ;,3 in order to obtain an exploitation action Xi which the agent
 will use in all subsequent rounds.
 Agent A3 acquires action X4 by doing an unsuccessful Obs followed by
 an Inv; and A1 and A2 acquire X4 by observing A3. At  rst, X4 is far
 2 I have borrowed this term from the Economics literature, where it has an anal-
 ogous meaning (e.g., [21, 22]).
 3 This will generally only happen on the  rst round of the game, before any agent
 has obtained an exploitation action.
 13
Round X1 X2 X3 X4 A1 A2 A3
 1 2 4 1 9 N/A N/A (Obs; (;;  ))
 2 2 4 1 9 N/A N/A (Inv; (X4; 9))
 3 2 4 1 9 Birth N/A (X4; ( ; 9))
 4 5 4 1 9 (Obs; (X4; 9)) Birth (X4; ( ; 9))
 5 5 2 1 9 (X4; ( ; 9)) (Obs; (X4; 9)) (X4; ( ; 9))
 6 1 2 1 9 (X4; ( ; 9)) (X4; ( ; 9)) (X4; ( ; 9))
 7 1 2 8 9 (X4; ( ; 9)) (X4; ( ; 9)) Death
 8 1 2 8 1 (X4; ( ; 1)) (X4; ( ; 1)) N/A
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 Figure 2.1: An example of a game in which there is a large structural shock. The
 columns for the exploitation actions Xi show their values at each round, and the
 columns for agents A1{A3 show their histories. Note that by round 6, all agents
 choose action X4, which has changed to a very low value. Since none of the agents
 are innovating, none of them can  nd the newly optimal action X3.
 better than the other exploitation actions, so all of the agents do well by
 using it. On round 8, the action X4 changes to the lowest possible value,
 but the agents continue to use it anyway. Furthermore, any time a new
 agent is born, it will observe them using X4 and will start using it too.
 This is a pathological case where the best action has disappeared and the
 agents are in a sense \stuck" exploiting the suboptimal result. Their only way out
 is if all agents die at once, so that one of the newly born agents is forced to innovate.
 Experiments in Section 5.3.1 show that in some cases, situations like these are a big
 enough risk that a near-best response strategy will choose innovation moves more
 frequently than observation moves.
 2.2.2 Innovation and Observation Versus Exploitation
 One might also think that agents should perform all of their innovation and
 observation actions  rst, so that they have as many options as possible when choos-
 ing an action to exploit. However, as Raboin et al. [23] demonstrate, this intuition
 14
is not always correct. Because the game selects which agents reproduce based on
 average per-round utility, not total accumulated utility, it is frequently better for
 newborn agents to exploit one of the  rst actions it encounters, even if this action
 has a mediocre payo (e.g., exploiting an action with value 10 on the second round
 of an agent?s life gives it as much per-round payo as exploiting an action with value
 50 on the tenth round). Once the agent has at least some per-round utility so that
 it has a nonzero chance of reproducing, it can then begin searching for a high-valued
 action to exploit for the rest of its lifetime.
 2.2.3 Best-Response Strategies in Cultaptation
 A widely used technique for getting insight about a game (e.g., see [24]) is to
 look at the game?s best-response strategies. Given an agent  and a strategy pro le
 (i.e., an assignment of strategies to agents) s  for the agents other than  ,  ?s best
 response is a strategy sopt that maximizes  ?s expected utility if the other agents
 use the strategies in s  .
 In Cultaptation, it is more useful to consider a best response to the set of
 available strategies S, rather than any particular strategy pro le. During the course
 of a Cultaptation game, the strategy pro le will change many times as agents die
 and other agents are born to take their places. Each strategy in S will be scored
 based on its average performance over the course of the game; and we will see (in
 Section 4.2.2) that given S, each strategy?s score is independent of the initial strategy
 pro le if the game is su ciently long.
 15
IfG is a Cultaptation game (i.e., a set of values for game parameters such as the
 number of agents, set of available actions, probability distribution over their payo s;
 see Section 4.1 for details), then for any agent  , any set of available strategies S,
 and any history h for  , there is a probability distribution  Obs(ajh ;S) that gives
 the probability of observing each action a, given S and h . Given  Obs and G, we
 can calculate the probability of each possible outcome for each action the agent
 might take, which will allow us to determine the best response to S. To compute
  Obs is not feasible except in general, but it is possible to compute approximations
 of it in some special cases (e.g., cases in which all of the agents, or all of the agents
 other than  , use the same strategy). That is the approach used in this proposal.
 16
Chapter 3
 Related Work
 In this section I will discuss related work on social learning and on computa-
 tional techniques related to my own.
 3.1 Social Learning
 The Cultaptation social learning competition o ers insight into open questions
 in behavioral and cultural evolution. An analysis of the competition is provided by
 Rendell et al. [17]. Of the strategies entered into the competition, those that per-
 formed the best were those that greatly favored observation actions over innovation
 actions, and the top performing strategy learned almost exclusively through observa-
 tion. This was considered surprising, since several strong arguments have previously
 been made for why social learning isn?t purely bene cial [7, 16]. However, this re-
 sult is consistent with observations made during my own experiments, in which the
  -best-response strategy rarely did innovation (see Section 5.3).
 In previous work, Carr et al. showed how to compute optimal strategies for
 a highly simpli ed versions of the Cultaptation social learning game [25]. Their
 paper simpli es the game by completely removing the observation action|which
 prevents the agents from interacting with each other in any way whatsoever, thereby
 transforming the game into a single-agent game rather than a multi-agent game.
 17
Their model also assumes that exploitable actions cannot change value once they
 have been learned, which overlooks a key part of the full social learning game.
 Wisdom and Goldstone attempted to study social learning strategies using a
 game similar to Cultaptation, but using humans rather than computer agents [18].
 Their game environment consisted of a group of \creatures," each of which had
 some hidden utility. The agents? objective was to select a subset of the creatures
 to create a \team," which was assigned a utility based on the creatures used to
 create it. Agents had a series of rounds in which to modify their team, and on each
 round they were allowed to see the teams chosen by other agents on the previous
 round (and in some cases, the utility of the other agents? teams), and the object
 of the game was to maximize the utility of one?s team. In this game, the acts of
 keeping a creature on one?s team, choosing a creature that another agent has used,
 and choosing a creature no one has yet used correspond to exploitation, observation,
 and innovation (respectively) in the Cultaptation game.
 The successful strategies Wisdom and Goldstone saw are similar to those used
 by the strategies found by my algorithm: they keep most of the creatures on their
 team the same from round to round (which corresponds in Cultaptation to per-
 forming mostly exploitation actions), and new creatures are mostly drawn from
 other agents? teams (which corresponds to preferring observation over innovation in
 Cultaptation). However, Wisdom and Goldstone highlight these characteristics as
 interesting because they run contrary to the conventional wisdom for social learning
 strategies, which suggests that broader exploration should lead to better perfor-
 mance, and therefore that successful strategies should innovate more often [16]. In
 18
this case, analyzing the strategies found by my algorithm allowed me to draw the
 same conclusions about what works well. This gives more evidence that the conven-
 tional wisdom on social learning [7, 16] may be mistaken.
 How best to learn in a social environment is still considered a nontrivial prob-
 lem. Barnard and Sibly show that if a large portion of the population is learning
 only socially, and there are few information producers, then the utility of social
 learning goes down [9]. Thus, indiscriminate observation is not always the best
 strategy, and there are indeed situations where innovation is appropriate. Authors
 such as Laland have attempted to produce simple models for determining when one
 choice is preferable to the other [8]. Game theoretic approaches have also been used
 to explore this subject, but it is still ongoing research [26, 27]. Giraldeau et al. o er
 reasons why social information can become unreliable. Both biological factors, and
 the limitations of observation, can signi cantly degrade the quality of information
 learned socially [11].
 Work by Nettle outlines the circumstances in which verbal communication
 is evolutionarily adaptive, and why few species have developed the ability to use
 language despite its apparent advantages [10]. Nettle uses a signi cantly simpler
 model than the Cultaptation game, but provides insight that may be useful to
 understanding social learning in general. In Nettle?s model, the population reaches
 an equilibrium at a point where both individual and social learning occur. The
 point of equilibrium is a ected by the quality of observed information and the rate
 of change of the environment.
 19
3.2 Related Computational Techniques
 The restless bandit problem, a generalization of the stochastic multi-armed
 bandit problem that accounts for probability of change, is cited as the basis for
 the rules of the Cultaptation tournament [17]. The rules of the Cultaptation game
 di er from the restless bandit problem by including other agents, making observation
 actions possible and complicating the game signi cantly. I also show in Section 4.2.2
 that maximizing total payo , the goal of the restless bandit problem, is di erent
 from maximizing expected per-round utility (EPRU) of an agent in the Cultaptation
 tournament.
 The restless bandit problem is known to be PSPACE-complete, meaning it is
 di cult to compute optimal solutions for in practice [28, 29]. Multi-armed bandit
 problems have previously been used to study the tradeo between exploitation and
 exploration in learning environments [30, 31].
 As discussed later in Section 4.1,  nding a best-response strategy in Cultap-
 tation is basically equivalent to  nding an optimal policy for a Markov Decision
 Process. Consequently, my algorithm for  nding near-best-response strategies has
 several similarities to the approach used by Kearns et al. to  nd near-optimal poli-
 cies for large MDPs [32]. Both algorithms use the discount factor of the MDP
 (which, in the case of Cultaptation, is the probability of death d) and the desired
 accuracy  to create a horizon for their search, and the depth h of this horizon
 depends on the discount factor and the branching factor, but not on the size of the
 full state space (unlike conventional MDP algorithms). Thus, both their algorithm
 20
and mine also have running time exponential in 1= and in the branching factor.
 However, the algorithm provided by Kearns et al. was designed as an online algo-
 rithm, so it only returns the near-optimal action for the state at the root of the
 search tree. Mine, on the other hand, returns a strategy specifying which action the
 agent should take for all states that can occur on the  rst h rounds. This means
 that the exponential-time algorithm only needs to run once to generate an entire
 strategy, rather than once per agent per round in each game we simulate.
 Many algorithms for optimal control of an MDP have been developed, however
 they all have running time that grows linearly with the size of the state space of
 the MDP. This makes them intractable for problems like Cultaptation, which have
 exponentially large state spaces. Several approaches for near-optimal control, which
 produces a policy within some  of optimal, have been developed [32, 33, 34].
 21
Chapter 4
 Game Analysis
 This chapter presents the formal model of Cultaptation used throughout the
 rest of the work, as well as the derivation of a formula that, given the parameters of
 the game environment and the other strategies in the game, will allow us to predict
 which strategy has an evolutionary advantage.
 4.1 Formal Model
 In this section I introduce a formal mathematical model of Cultaptation games.
 A glossary of the notation used in this proposal is provided as Table 4.1.
 Game De nition. Cultaptation requires a number of parameters to deter-
 mine exactly how it will run. Therefore, I will de ne the game parameters, G, to
 be a set of values for the following: N , the number of agents; M , the number of ex-
 ploitation actions in the game; c, the probability that an exploitation action changes
 its utility each round;  , the probability distribution used to assign a utility value to
 each exploitation action, both the outset and each time an action?s utility changes;
 and d, the probability of death. In the Cultaptation tournament, only the values of
 N , M , and d were known ahead of time, but for this analysis I use the values of the
 other parameters as well.
 Recall that in the Cultaptation tournament [15], each evolutionary simulation
 22
N Number of agents in the environment.
 S The set of available strategies. Agents may only use strategies
 in S.
 r Number of the current round, ranging from 1 to 1.
 c The probability of change on all rounds.
 d The probability of death on all rounds.
 (a; (m; v)) An action-percept pair in which the action a returns the
 percept (m; v).
 h Agent history for  . A sequence of action-percept pairs
 experienced by agent  .
 h [i] The i-th action-percept pair in h .
 X(h ) Number of exploitable actions given history h .
 M Number of exploitation actions in the game.
  Probability distribution for the new value of any action whose
 value changed at round r.
  Obs(m; vjh ;S) Probability that Obs returns action m with value v, given
 history h and available strategies S.
  Inv(vjr) Probability that Inv returns an action with value v on round r.
 V The set of potential utility values.
 P (h0 jh ; a;S) Probability of transitioning to history h
 0
  if  performs action a
 with history h and available strategies S.
 L(jh j) Probability that  lives long enough to experience history h .
 T Set of all action-percept pairs of the form (a; (m; v)).
 Table 4.1: A glossary of notation used in this proposal.
 was a contest between two or more strategies submitted to the tournament. Thus,
 there is a  xed set of strategies that are allowed to occur in a given simulation.
 I will call this the set of available strategies S, where S = fs1; s2; :::; s?g for some
  nite ? (i.e. in pairwise games ? = 2, in melee games ? > 2). Any strategy pro le s
 that occurs in the simulation will consist only of strategies in S. When an agent is
 chosen to be replaced via mutation, its new strategy is selected at random from the
 strategies in S.
 A Cultaptation game can now be de ned formally, as follows. A Cultaptation
 game is an ?-player game, in which each player receives the game parameters G as
 23
input. Each player then simultaneously chooses a strategy to put into the set of
 available strategies. Let player i?s strategy be si, so that S = fs1; s2; :::; s?g. The
 pair (G;S) is an instance of G. In (G;S), each player i will receive a payo equal
 to score(si), de ned below.
 Scoring. The version of Cultaptation used in the tournament continued for
 10,000 rounds, and each strategy was assigned a score equal to its average population
 over the last 2,500 rounds. But as is often done in analyses of repeated games, the
 formal model assumes an in nite Cultaptation game, i.e., the game continues for an
 in nite number of rounds, and the score for strategy s is its average population over
 the entire game:
 score(s) = lim
 r!1
 Pr
 j=1 p(s; j)
 r
 ;
 where p(s; j) is the population size of agents using strategy s on round j. This
 greatly simpli es the analysis in Section 4.1.5, by allowing us to average out the
 various sources of noise present in the game.
 Actions. The rest of the formal model will be constructed from the perspec-
 tive of an arbitrary agent,  , in a given in nite Cultaptation game instance (G;S).
 I use r for the number of a round, and X(h ) to specify the number of exploita-
 tion actions available after history h . After all exploitation actions X1; : : : ;XM have
 been innovated or observed in a history h , then X(h ) = M and innovation actions
 become illegal.
 I model the payo s supplied for exploitation actions Xi by a probability dis-
 tribution  .  (v) is the probability of an action having payo v at the start of the
 24
game instance.  (v) is also the probability that, when an action changes its payo ,
 the new payo is v. I let V be the set of all action values that may occur with
 non-zero probability:
 V = fv j  (v) > 0g;
 where V has  nite size.
 If we let  Inv(vjr) be the probability that value v is innovated on round r, it
 can be de ned recursively in terms of c and  as:
  Inv(vjr) =
 8
 >><
 >>:
  (v); if r=0;
 c (v) + (1 c) Inv(vjr  1); otherwise.
 That is, initially the chance that Inv will return an action with value v is
 determined by the given distribution  (v). On later rounds (r > 0) the chance that
 Inv will return an action with value v is the chance that an action?s value changed
 to v on the current round (given by c (v)), plus the chance that an action?s value
 was v on the previous round and it did not change this round.
 While computing the probability distribution for utilities of actions returned
 by Inv was fairly straightforward, computing a similar distribution for Obs actions
 is signi cantly more di cult. Let  be any agent, and S be the set of available
 strategies. From S we can get a probability distribution over the other agents?
 actions in any given situation; and from this we can derive  Obs(m; vjh ;S), the
 probability that Obs would return the action-percept pair (m; v), given history h .
 In order to derive  Obs, we must consider each possible strategy pro le s  for
 agents besides  , determine how likely that strategy pro le is to occur, and then
 determine what each agent in s  will do for every possible sequence of actions they
 25
could have encountered, bounded only by the percepts the agent has received in
 h . As discussed in Section 2.2.3, the number of possible histories alone is astro-
 nomically large. Since  Obs is conditioned on each possible history it will be larger
 still, so in any practical implementation the best we can do is to approximate  Obs
 (Section 5.1.5 describes how we will do this). But for the theoretical development,
 I will assume we have an oracle for  Obs, that will tell us exactly how likely we are
 to observe any given action-utility pair.
 In what follows, I will show that given  ,  Obs, V , and S, we can calculate
 the possible outcomes of each action the agent may take, and the probability of
 each of these outcomes. This allows us to treat an in nite Cultaptation game as a
 Markov Decision Process (MDP) [35]. Calculating the best response in this case is
 equivalent to  nding an optimal control policy for an MDP.
 4.1.1 Transition Probabilities
 A transition probability function P (h0 jh ; a;S) de nes the probability of tran-
 sitioning from history h to history h0 = h  (a; (m; v)) in the next round if an
 agent  performs action a. These transition probabilities are for the case where  
does not die before reaching h0 ; I introduce functions to account for the probability
 of death in Section 4.1.2.
 There are three cases for what P (h0 jh ; a;S) might be, depending on whether
 a is an innovation, observation, or exploitation action:
 26
 If a = Inv, then
 P (h  (Inv; (m; v))jh ; Inv;S) =
 8
 >>>>>><
 >>>>>>:
  Inv(vjr)
 M X(h )
 if h contains no percepts that
 contain the action m,
 0 otherwise.
 (4.1)
  Recall that an agent cannot innovate action m if it has already encountered m
 by innovating or observing. Observation actions are not subject to the same
 restriction, so if a = Obs, then
 P (h  (Obs; (m; v))jh ;Obs;S) =  Obs(m; vjh ;S) (4.2)
 where  Obs(m; vjh ;S) models the exploitation behavior of the other agents
 in the environment. Obviously, the exact probability distribution will depend
 on the composition of strategies used by these agents. The above de nition is
 general enough to support a wide range of environments; and in Section 5.1.5
 I will discuss one potential way to model this function for a more speci c set
 of environments.
  Finally, if a = Xm, then h must contain at least one percept for Xm. Let r
 be the last round at which the last such percept occurred. For the case where
 Xm?s utility did not change since round r, we have
 P (h  (Xm; (m; v))jh ;Xm;S) = (1 c)jh j r
 | {z }
 prob. of not changing
 + c (v)
 jh jX
 j=r
 (1 c)jh j j
 | {z }
 prob. of changing back to v.
 (4.3)
 27
For the case where Xm?s utility did change since round r, we have
 P (h  (Xm; (m; v))jh ;Xm;S) = c (v)
 jh jX
 j=r
 (1 c)jh j j (4.4)
 which is similar, but assumes that the value must have changed at least once.
 In all other cases, no transition from h to h0 is possible, so P (h
 0
  jh ; a;S) = 0.
 Probability of Reaching a History
 We will frequently be interested in P (h js;S), the probability of history h 
occurring given that the agent is following some strategy s 2 S. We will be able to
 derive P (h js;S) iteratively, calculating the probability of each step of history h 
occurring using the functions derived above.
 Speci cally, P (h js;S) is the probability that each h [i] = (ai; (mi; vi)) occurs
 given the action chosen by the strategy in the history h [1; : : : ; i 1] = (a1; (m1; v1)) 
    (ai 1; (mi 1; vi 1)), or:
 P (h js;S) =
 jh j 1Y
 i=1
 P (h [1; : : : ; i]  h [i+ 1]jh [1; : : : ; i]; s(h [1; : : : ; i]);S) (4.5)
 4.1.2 Accounting for Probability of Death
 The probability of an agent living long enough to experience history h de-
 pends on the probability of death. It is
 L(jh j) = (1 d)
 jh j 1: (4.6)
 When we calculate the probability of reaching a given history h , we will
 generally multiply it by L(jh j) to account for the chance that the agent dies before
 reaching h .
 28
Sometimes we will also be interested in the probability that a randomly-
 selected agent has history h . For this we will need to know the probability that a
 randomly-selected agent is exactly jh j rounds old, which is simply:
 L(jh j)
 P1
 i=1 L(i)
 =
 L(jh j)
 1
 1 (1 d)
 =
 L(jh j)
 1
 d
 = dL(jh j): (4.7)
 4.1.3 Utility Functions
 A utility function U((a; (m; v))) de nes the utility gleaned on action-percept
 pair (a; (m; v)):
 U((a; (m; v))) =
 8
 >><
 >>:
 v; if 9i such that a = Xi;
 0; otherwise.
 (4.8)
 Notice that U( ) is only non-zero on exploitation actions.
 The per-round utility (PRU) of history h , where
 h = (a1; (m1; v1))      (ajh j; (mjh j; vjh j));
 is de ned to be the sum of the utility acquired in that history divided by the history?s
 length:
 PRU(h ) =
 1
 jh j
 jh jX
 i=1
 U((ai; (mi; vi))) (4.9)
 4.1.4 Strategy Representation
 A strategy s is de ned as a function mapping each history h 2 H to the
 agent?s next action s(h ) 2 fInv;Obs;X1; : : : ;XMg. For instance, the strategy I1
 29
from Example 1 is de ned by the function:
 sI1(h ) =
 8
 >><
 >>:
 Inv; if h is empty,
 Xi; for h = (Inv; (Xi; v)); : : :
 In this proposal we will deal with partially speci ed strategies. A partially spec-
 i ed strategy is a mixed strategy (i.e., a probability distribution over a set of
 pure strategies) that is de ned by a  nite set Q of history action pairs (Q  
H  fInv;Obs;X1; : : : ;XMg), in which each h 2 H appears at most once. Given
 any history h , if there is an action m such that (h ;m) 2 Q, then sQ chooses the
 action m. Otherwise, sQ chooses an action arbitrarily from all actions that are legal
 in h . Partially speci ed strategies have the advantage of being guaranteed to be
  nitely representable.
 4.1.5 Evaluating Strategies
 At each round, an agent with history h has reproductive  tness PRU(h ), and
 agents are selected to reproduce with probability proportional to their reproductive
  tness (i.e., using the replicator equation [36]). Since a strategy?s score is a function
 of its average population over the course of the game, we want some metric that
 allows us to compare the expected reproductive  tness of two strategies. This will
 allow us to predict which strategy is more likely to win.
 At  rst glance, it may appear that the way to predict which strategy will have
 higher expected reproductive  tness is to compare their expected utilities. However,
 prior work has shown that this is not the case: in Cultaptation, a strategy?s expected
 reproductive  tness is not necessarily proportional to its expected utility [23]. I now
 30
Table 4.2: The expected utility and expected reproductive  tness of sII and sIEI
 from Example 5. sIEI has a higher expected reproductive  tness, and therefore will
 be likely to win a game against sII, even though sII has a higher expected utility.
 Expected Utility Expected Reproductive Fitness
 sII 90.25 65.074
 sIEI 88.825 65.185
 present a simple example that illustrates this phenomenon.
 Example 5. Consider an in nite Cultaptation game with no probability
 of change, no observation actions, probability of death d = 0:05, two
 exploitation actions valued at 65 and 100, and an innovate action that
 will return either exploitation action with uniform probability. This
 means that an agent needs to perform at most two innovate actions to
 have knowledge of the action with value 100, since innovating does not
 return an action the agent already knows.
 Let us compare two strategies: sII and sIEI. Both strategies will perform
 an innovate as their  rst action. If the action they learn has value 100,
 both strategies will exploit that action until the agent dies. If the action
 learned has value 65, sII will perform a second innovate on its next turn,
 learning the action with value 100, and will exploit that action until its
 agent dies. Meanwhile, sIEI will exploit the action with value 65 once,
 before performing an innovate on its third turn to learn the action with
 value 100. It then exploits this action until its agent dies.
 Since the two strategies are identical when they learn the action with
 value 100 on their  rst action, and since this case is equally likely to
 be encountered by both strategies, we can ignore it for the purposes of
 comparing them. For the rest of this analysis we will assume the  rst
 innovate returns the action with value 65. In this case, we can calculate
 the expected utility for both strategies using geometric series, and we can
 calculate their expected reproductive  tnesses using methods described
 in Section 4.2. Table 4.2 presents these values. While sII has a higher
 expected utility, since it exploits the action with value 100 more often,
 sIEI has a higher expected reproductive  tness, since it does not wait as
 long to begin exploiting. Therefore, sIEI will be the likely winner in a
 contest between these two strategies.
 Since we cannot always use a strategy?s expected utility to determine whether
 it is expected to win, we will instead compute a strategy?s expected reproductive
  tness directly, by computing its Expected Per-Round Utility.
 31
De nition. The Expected Per-Round Utility for a strategy s , EPRU(s j G;S), is
 the expected value of PRU(h ) over all possible histories h for a randomly-selected
 agent  using strategy s 2 S in an in nite Cultaptation game instance (G;S). 2
 To calculate EPRU(s j G;S), we look at each possible history h and multiply
 PRU(h ) by the probability that a randomly-chosen agent using s has history h .
 This probability is equal to the probability that a randomly-chosen agent is jh j
 rounds old (Equation 4.7) times the probability of reaching history h in jh j steps
 using strategy s (Equation 4.5). Hence, the EPRU of a strategy is:
 EPRU(s j G;S) =
 X
 h 2H
 dL(jh j) P (h js ;S) PRU(h ):
 Note that for a given environment, the probability of death d is a constant.
 Hence, in the analysis I will frequently factor it out.
 Example 6. Recall the innovate-once strategy, which innovates once
 to learn an action and then exploits that action until it dies. Suppose
 this strategy exists in an environment with a probability of death of
 0:2 and only one possible exploit action with non-changing value 10. All
 agents using this strategy will therefore learn the only action on their  rst
 round, and then exploit an action with value 10 on all subsequent rounds.
 Hence, there is only one possible history for a j-round old agent using
 this strategy, and its per-round utility is 10  (j  1)=j. The probability
 that a randomly-selected agent will be j rounds old will be 0:2  L(j) =
 0:2 0:8j 1. Thus the expected per-round utility achieved by this strategy
 in this environment is
 P1
 j=1 0:2  0:8
 j 1  10  (j  1)=j.
 4.2 Analysis of EPRU
 In this section I examine methods for computing the expected per-round utility
 of a strategy. First I present a method for computing an approximation to the EPRU
 32
for given a strategy, then I present a proof that a strategy maximizing EPRU will
 also maximize its average population in an in nite Cultaptation game instance.
 4.2.1 Computation of EPRU
 I will now de ne a formula that can be used to compute EPRU exactly for a
 given strategy s. The de nition of EPRU given in Section 4.1.5 used a \backward"
 view: for every possible history h , it looked back through h to determine PRU(h ).
 Notice, however, that h must have some preceding history h0 , where h = h
 0
   t
 for some action-percept pair t. This de nition of EPRU must examine h0 and h 
independently, even though their only di erence is the addition of t.
 For this reason, it will make more sense computationally to use a \forward"
 view of EPRU: we will construct a recursive function on s and h which, for each
 possible h  t:
  calculates the per-round utility gained from t, both for history h  t and for
 all histories that can be reached from h  t, and then
  recurses on s and h  t.
 For the calculation in the  rst bullet, we will use the formula EVexp(r; v),
 which computes the expected amount of per-round utility we gain (on the current
 round and on future rounds) by exploiting a value v on round r:
 EVexp(r; v) =
 1X
 j=r
 L(j)v
 j
 = v
 1X
 j=r
 1
 j
 (1 d)j 1: (4.10)
 33
Using known properties of in nite series, EVexp can also be expressed as
 EVexp(r; v) = v
  
1X
 i=1
 1
 i
 (1 d)i 1  
r 1X
 i=1
 1
 i
 (1 d)i 1
 !
 = v
  
ln d
 d 1
  
r 1X
 i=1
 1
 i
 (1 d)i 1
 !
 (4.11)
 and is therefore computable.
 We can now express the expected per-round utility of a strategy s recursively
 in terms of the average per-round payo of an agent.
 EPRUalt(s; h j G;S) = (4.12)
 X
 t2T
 P (h  tjh ; s(h );S) (EVexp(jh  tj; U(t)) + EPRUalt(s; h  t j G;S))
 where T is the set of all possible action-percept pairs, and h  t represents a pos-
 sible history on the next round. Note that the size of T is  nite. A proof that
 EPRU(s j G;S)=d = EPRUalt(s; hi j G;S) is included in [19].
 Unfortunately, computing EPRUalt is not possible since it su ers from in nite
 recursion. To handle this, I introduce a depth-limited computation of EPRUalt,
 which only computes the portion of the total EPRU contributed by the  rst k
 rounds:
 EPRUkalt(s; h j G;S) = (4.13)8
 >>>>>><
 >>>>>>:
 0 If k = 0
 P
 t2T P (h  tjh ; s(h );S) 
(EVexp(jh j; U(t)) + EPRU
 k 1
 alt (s; h  t j G;S)) otherwise
 I prove in Section 5.1.3 that if the search depth k is deep enough, EPRUkalt(s; h jG;S)
 will always be within  of EPRUalt(s; h j G;S).
 34
4.2.2 EPRU Corresponds to Reproductive Success
 This section provides a proof that if a strategy has the highest EPRU for the
 given environment, it will also have the optimal expected probability of reproducing.
 This proof applies only to pairwise games, but the same techniques should apply to
 arbitrary ( nite) numbers of strategies.
 Assume we have an in nite Cultaptation game instance (G;S), as de ned in
 Section 4.1, made up of agents using strategies s and s0 (i.e. S = fs; s0g). Recall
 from Section 4.1 that the score for strategy s is
 score(s) = lim
 r!1
 Pr
 i=0 p(s; i)
 r
 where p(s; i) is the number of agents using strategy s on round i. Our objective for
 this section will be to show that EPRU(s j G;S) > EPRU(s0 j G;S) if and only if
 score(s) > score(s0).
 I begin by de ning a reset event, which will help illustrate some interesting
 properties of in nite Cultaptation.
 De nition. Let n and n0 be the number of agents using s and s0, respectively, on
 the  rst round of the game instance, and let N = n+n0. A reset event occurs when
 all the agents in the environment die on two consecutive rounds, and on the second
 round they are replaced (via mutation) by n agents using s and n0 agents using s0.
 The probability of a reset event occurring is  = dNdNmN
  n
 N
  
0:5n. 2
 In other words, after a reset event occurs the conditions are identical to those
 that were present on the  rst round; the game instance has essentially started over.
 Note that  is the same on every round, and it is always greater than 0.
 35
Since the game instance continues for an in nite number of rounds, there will
 be an in nite number of reset events. Thus, if we were to run other game instances
 with S = fs; s0g, both strategies would have the same score each time. Therefore,
 we also know that we can de ne each strategy?s score as a function of its expected
 population at each round, rather than its population for a single game instance.
 This gives us
 lim
 r!1
 Pr
 i=0 p(s; i)
 r
 = lim
 r!1
 Pr
 i=0 EP(s; i)
 r
 (4.14)
 where EP(s; r) is the expected population of agents using strategy s on round r.
 I will also de ne EAU(s; r) to be the expected agent utility of strategy s on
 round r; that is, EAU(s; r) is the expected PRU of a randomly-chosen agent using
 strategy s on round r. EP(s; r) can now be de ned recursively for each strategy
 using the mechanics of Cultaptation, as follows. Let EP(s; 0) = n and EP(s0; 0) = n0.
 Then, for r  0
 EP(t; r + 1) =
 (1 d)EP(t; r)
 | {z }
 Survived from previous round
 +Nd(1 m)
 EP(t; r)EAU(t; r)
 TU(r)
 | {z }
 New agents from selection
 + Nd
 m
 2| {z }
 New agents from mutation
 where t 2 fs; s0g and TU(r) = EP(s; r)EAU(s; r) + EP(s0; r)EAU(s0; r) is the ex-
 pected total utility on round r. Recall from Section 4.1 that N is the total number
 of agents in the environment, d is the probability of death, and m is the probability
 of mutation.
 I now consider the behavior of EAU(s; r) as r increases.
 Lemma 1. For any strategy s, limr!1 EAU(s; r) =  for some  nite  .
 36
Proof. Let u(s; r) be the expected utility of a single agent using strategy s when
 r rounds have passed since the  rst round or the last reset event. For all r, we
 know that 0  u(s; r)  Vmax=d, since agents cannot earn negative utility, and no
 strategy can have better expected performance than a strategy that exploits the
 best possible action for its entire expected lifespan of 1=d rounds. We can rewrite
 EAU(s; r) in terms of u(s; r) as follows.
 EAU(s; r) =  
 
r 1X
 i=0
 (1  )iu(s; r)
 !
 + (1  )ru(s; r)
 Taking the limit of this form gives us
 lim
 r!1
 EAU(s; r) = lim
 r!1
  
 
r 1X
 i=0
 (1  )iu(s; i)
 !
 + lim
 r!1
 (1  )ru(s; r)
 = lim
 r!1
  
 
r 1X
 i=0
 (1  )iu(s; i)
 !
 :
 Since u(s; i) is bounded and
 Pr 1
 i=0 (1  )
 i is a geometric series,
 lim
 r!1
  
 
r 1X
 i=0
 (1  )iu(s; i)
 !
 converges absolutely by the comparison test. Hence, limr!1 EAU(s; r) =  for some
  nite  . 2
 Lemma 2. For any strategy s and set of available strategies S,
 lim
 r!1
 EAU(s ; r) = EPRU(s j G;S)
 Proof. The expected agent utility EAU(s ; r) is de ned as the expected PRU of an
 agent using strategy s on round r. As r approaches in nity, the probability that a
 randomly-selected agent will be i rounds old approaches L(i)=
 P1
 j=0 L(j) = dL(i).
 37
The probability of reaching a history h is de ned in Section 4.1.5 as P (h js ;S),
 and as r increases the set of histories a randomly-selected agent may have approaches
 H, the set of all histories. Thus,
 lim
 r!1
 EAU(s ; r) =
 X
 h 2H
 dL(jh j) P (h js ;S) PRU(h );
 which is the de nition of EPRU(s j G;S). 2
 EP(s; r) and EP(s0; r) are both functions of EAU(s; r) and EAU(s0; r), which
 converge to EPRU(s j G;S) and EPRU(s0 j G;S) respectively. Therefore, EP(s; r)
 and EP(s0; r) must also converge as r approaches in nity. We will let EP(s) =
 limr!1 EP(s; r) for s 2 fs; s0g. We can  nd the value of EP(s) as follows
 EP(s) =
 (1 d)EP(s) +Nd(1 m)
 EP(s) EPRU(s j G;S)
 EP(s) EPRU(s j G;S) + EP(s0) EPRU(s0 j G;S)
 +Nd
 m
 2
 :
 After substituting EP(s0) = N  EP(s) and rearranging terms, we have
 0 =(EPRU(s j G;S)EP(s)2  EPRU(s0 j G;S))
 +N
  
(1 +
 m
 2
 ) EPRU(s0 j G;S) (1 
m
 2
 ) EPRU(s j G;S)
  
EP(s)
  N2
 m
 2
 EPRU(s0 j G;S):
 Assume EPRU(s j G;S) > 0 and EPRU(s0 j G;S) > 0, which must be
 true as long as there are no actions that give negative utility. Also, let x =
 EPRU(s j G;S)=EPRU(s0 j G;S). Then we can rewrite the above as
 0 = (x 1)EP(s)2 +N
  
1 +
 m
 2
  x(1 
m
 2
 )
  
EP(s) N2
 m
 2
 : (4.15)
 38
This equation, when subject to the constraint 0  EP(s)  N , allows us to
 express EP(s) as a strictly increasing function of x. It also has the property that
 when x = 1, EP(s) = EP(s0) = N=2.
 Lemma 3. EP(s) > EP(s0) if and only if EPRU(s j G;S) > EPRU(s0 j G;S).
 Proof. Assume EPRU(s j G;S) > EPRU(s0 j G;S). Then x > 1 in Equation 4.15,
 and therefore EP(s) > N=2, so EP(s) > EP(s0). Assume EP(s) > EP(s0). Applying
 this as an extra constraint to Equation 4.15, we see that only values for x greater
 than one satisfy the equality. Therefore, EPRU(s j G;S) > EPRU(s0 j G;S).
 Hence, EP(s) > EP(s0) if and only if EPRU(s j G;S) > EPRU(s0 j G;S). 2
 We now know that the strategy with higher EPRU will have the highest ex-
 pected frequency in the limit, so all that remains is to show that a strategy with
 higher frequency in the limit will, in fact, have a higher average population over all
 rounds (and therefore, a higher score in the game).
 Lemma 4. For all s and s0,
 lim
 r!1
 Pr
 i=0 EP(s;i)
 r
 Pr
 i=0 EP(s
 0;i)
 r
 =
 EP(s)
 EP(s0)
 :
 Proof. First, note that because N , d, and m are all strictly positive, 0 < Ndm2  
EP(s; r) < N for all strategies s and rounds r. Therefore,
 Pr
 i=0 EP(s
 0; i) is strictly
 increasing and unbounded as r increases.
 Using the fact that Ndm2  EP(s; r) < N for all s and r, and the de nition
 39
EP(s) = limr!1 EP(s; r), we can obtain the following:
 lim
 r!1
 Pr+1
 i=0 EP(s; i) 
Pr
 i=0 EP(s; i)
 Pr+1
 i=0 EP(s
 0; i) 
Pr
 i=0 EP(s
 0; i)
 = lim
 r!1
 EP(s; r + 1)
 EP(s0; r + 1)
 =
 limr!1 EP(s; r + 1)
 limr!1 EP(s0; r + 1)
 =
 limr!1 EP(s; r)
 limr!1 EP(s0; r)
 =
 EP(s)
 EP(s0)
 :
 Therefore, by the Stolz-Ces aro theorem,
 lim
 r!1
 Pr
 i=0 EP(s; i)Pr
 i=0 EP(s
 0; i)
 =
 EP(s)
 EP(s0)
 ;
 thus
 lim
 r!1
 Pr
 i=0 EP(s;i)
 r
 Pr
 i=0 EP(s
 0;i)
 r
 =
 EP(s)
 EP(s0)
 :
 2
 From Equation 4.14 and Lemmas 3 and 4, we immediately get the following:
 Theorem 1. For all s and s0,
 lim
 r!1
 Pr
 i=0 n(s;i)
 r
 Pr
 i=0 n(s
 0;i)
 r
 > 1
 if and only if EPRU(s j G;S) > EPRU(s0 j G;S).
 Therefore, the strategy with higher EPRU will have the higher score in a game
 of in nite Cultaptation.
 Irrelevance of the Initial Strategy Pro le
 From the fact that EPRU is independent of the initial strategy pro le s, we also
 get the following corollary which will help us understand some of the experimental
 results (see Section 5.3).
 Corollary 1. The initial strategy pro le s of an in nite Cultaptation game instance
 (de ned in Section 4.1) does not a ect the score of any strategy in S.
 40
If this seems counterintuitive, consider the following. At the beginning of
 this section I de ned a reset event, in which every agent dies on two consecutive
 rounds, and all are replaced via mutation so that the population is identical to the
 initial strategy pro le. For each reset event, there will be many similar events in
 which every agent dies on two consecutive rounds and is replaced via mutation,
 but in some arrangement di erent from the initial strategy pro le. The probability
 of this happening is d2NmN , which is greater than 0. In an in nite-length game,
 such an event will eventually occur with probability 1. After it occurs, the initial
 strategy pro le clearly has no bearing on how the rest of the game plays out, yet
 there are still an in nite number of rounds left in the game. Since each strategy?s
 score is its average population over the entire game (see Section 4.1), the impact of
 the initial strategy pro le on each strategy?s total score is vanishingly small in an
 in nite-length game.
 Application of EPRU to other Evolutionary Games
 Many of the equations used in calculating EPRU involve concepts particu-
 lar to Cultaptation, such as innovation, observation, and changing action values.
 However, the general technique used is to calculate the expected reproductive  t-
 ness of an agent on round j, multiply this quantity by the expected proportion of
 agents that are j rounds old, and sum these quantities to get the expected  tness
 of an entire population. This should be a useful metric in any evolutionary game
 in which agents live for more than one generation and reproduce according to the
 replicator equation, even if the game uses some measure other than per-round utility
 to determine reproductive  tness. The proofs in this section rely primarily on the
 41
symmetry between 1) the probability that an agent will be alive after k rounds and
 2) the expected proportion of a population of agents that are k rounds old on any
 given round. Thus, any evolutionary game that allows agents to live more than one
 generation and in which agents die with the same probability on every round should
 be able to use a metric very similar to EPRU to compare strategies.
 42
Chapter 5
 Strategy Generation Algorithms
 This chapter describes the algorithms used to generate near-best response
 strategies for Cultaptation, and presents experimental studies in which near-best
 response strategies are tested against a known good strategy from the Cultaptation
 tournament.
 5.1 Finding an  -Best Response Strategy
 In this section I explain what it means for a strategy to be a best response or
 near-best response in in nite Cultaptation, and I provide an algorithm for calculat-
 ing a near-best response to S  , the available strategies other than our own.
 5.1.1 Problem Speci cation
 Now that we have derived EPRU and proved that a strategy?s EPRU is directly
 proportional to its score in an in nite Cultaptation game, we can determine how
 each strategy in a given set of available strategies S will perform by evaluating the
 EPRU of each strategy. Therefore, we can de ne a best-response strategy in terms
 of EPRU, as follows.
 Recall that in an in nite Cultaptation game (as de ned in Section 4.1) there
 are ? players, each of whom selects a strategy to put into the set of available strategies
 43
S. Let S  be the set of available strategies other than our own, i.e. S  =
 fs1; :::; s  1; s +1; :::; s?g. Strategy sopt is a best response to S  if and only if for
 any other strategy s0,
 EPRU(sopt j G;S  [ sopt)  EPRU(s0 j G;S  [ s0):
 Computing sopt is not possible due to its prohibitively large size. However, we
 can compute an  -best-response strategy, i.e., a strategy s such that EPRU(s jG;S  [
 s) is arbitrarily close to EPRU(sopt j G;S  [sopt). This problem can be stated for-
 mally as follows: Given game parametersG, error bound  > 0, and the set S of avail-
 able strategies other than our own,  nd a strategy s such that EPRU(s j G;S  [
 s ) is within  of EPRU(sopt j G;S  [ sopt).
 5.1.2 Bounding EPRU
 In games where 0 < d < 1, an agent could potentially live for any  nite number
 of rounds. However, since the agent?s probability of being alive on round r decreases
 exponentially with r, the expected utility contributed by an agent?s actions in later
 rounds is exponentially lower than the expected utility contributed by earlier rounds.
 I will use this fact in deriving a bound on EPRUalt(s; h j G;S) for a given strategy
 and a history h of length l.
 Recall from Equations 4.10 and 4.11 that:
 EVexp(r; v) = v
 1X
 i=r
 1
 i
 (1 d)i 1 = v
  
ln d
 d 1
  
r 1X
 i=1
 1
 i
 (1 d)i 1
 !
 (5.1)
 where EVexp(r; v) is the expected contribution to EPRU made by exploiting an
 action with value v on round r.
 44
Since we know how much any given exploit contributes to EPRU(s j G;S) for
 a given strategy s , we can calculate G(l; v), the amount that exploiting the same
 action on all rounds after l contributes to EPRU(s j G;S), as follows:
 G(l; v) =
 1X
 j=l+1
 v
 1X
 n=j
 1
 n
 (1 d)n 1 = v
 1X
 j=l+1
 1X
 n=j
 1
 n
 (1 d)n 1
 Expanding the summations yields:
 G(l; v) = v
  
1
 l + 1
 (1 d)l +
 2
 l + 2
 (1 d)l+1 +    
 
= v
 1X
 n=l+1
 n l
 n
 (1 d)n 1
 = v
  
1X
 n=l+1
 (1 d)n 1  
1X
 n=l+1
 l
 n
 (1 d)n 1
 !
 = v
  
(1 d)l
 d
  
1X
 n=l+1
 l
 n
 (1 d)n 1
 !
 Next, we pull l out of the summation and use (5.1) to obtain:
 G(l; v) = v
 0
 B
 B
 B
 @
 (1 d)l
 d
  l
  
ln d
 d 1
  
lX
 n=1
 1
 n
 (1 d)n 1
 !
 | {z }
 a
 1
 C
 C
 C
 A
 (5.2)
 Note that for 0 < d < 1, G(l; v) is  nite. G(l; v) provides a closed form formula
 for the eventual contribution of exploiting an action with value v at every round
 after the lth round. Since the set V of possible action values is  nite (see Section
 4.1), let vmax = max(V ) be the largest of these values. Then G(l; vmax) is an upper
 bound on the expected per-round utility achieved after round l (clearly no strategy
 can do better than making an action with maximal value every action after action l).
 I use this fact to bound the depth limited expected per-round utility computation.
 45
Theorem 2. Let vmax be the highest possible action utility for game parameters G,
 and let S  be the set of available strategies other than our own. Then for all l and
 all strategies s ,
 EPRUalt(s ; hi j G;S  [ s ) EPRU
 l
 alt(s ; hi j G;S  [ s )  G(l; vmax):
 Proof. Since it is not possible for any strategy to gain more utility than vmax on
 any round, this follows from the discussion above. 2
 Theorem 2 states that G(l; vmax) is the highest possible contribution to the
 total expected per-round utility (i.e., EPRUalt(s ; hi j G;S)) made by any strategy
 s after round l. Thus, if we are given an  > 0 and we can  nd a value of k such
 that G(k; vmax)   , then we know that no strategy can earn more than  expected
 utility after round k. The next section will show how to  nd such a k.
 5.1.3 Determining How Far to Search
 In this section I show how to  nd a search depth k such that, for any given
  > 0, no strategy can earn more than  utility after round k. We  rst note a bound
 on G(l; v):
 Lemma 5. G(l; v)  v(1 d)l=d.
 Proof. The lemma follows from noting that part (a) of Equation 5.2 is greater
 than or equal to zero, since ln dd 1 =
 P1
 n=1
 1
 n(1  d)
 n 1 and l < 1. Thus G(l; v) =
 v( (1 d)
 l
 d  w)  v
 (1 d)l
 d , since w is always non-negative. 2
 46
Now if we can  nd a k such that
  = vmax(1 d)
 k=d;
 then we can be certain that   G(k; v). Solving for k in the above equation yields
 k = log(1 d)
  
d 
vmax
  
; (5.3)
 which has a solution for 0 < d < 1 and vmax > 0, both of which will always be true
 in Cultaptation. This gives us the following theorem.
 Theorem 3. Given  > 0, set of available strategies S  other than our own, and
 game parameters G with maximal utility vmax, let k = log(1 d)
  
d 
vmax
  
. If s has the
 maximal value of EPRUkalt(s ; ; j G;S  [s ), then s is an  -best response to S  .
 Proof. Let sopt be the strategy with the maximal value of EPRU(sopt jG;S  [sopt).
 By Theorem 2, we know that sopt cannot earn more than  expected utility on rounds
 after k. Since s earns the maximum EPRU possible in the  rst k rounds, it follows
 that jEPRU(sopt j G;S  [ sopt)  EPRU(s j G;S  [ s )j   . Therefore, s is
 an  -best response. 2
 5.1.4 Algorithm
 I will now present my algorithm for computing the strategy s with the maximal
 value of EPRUkalt(s; ; j G;S  [ s), and show how it can be used to compute an
  -best response.
 Algorithm 1 returns a 2-tuple with a partially speci ed strategy s and a scalar
 U . Strategy s maximizes EPRUkalt(s; h j G;S  [ s), and U is the value of this
 47
Algorithm 1 Produce strategy s that maximizes EPRUkalt(s; h j G;S  [s), given
 initial history h , set of possible utility values V , and S  , the set of available
 strategies other than our own.
 Strat(h ,k,V , S  )
 1: if k = 0 then
 2: return 0
 3: end if
 4: Let Umax = 0
 5: Let smax = null
 6: for each action a 2 fInv;Obs;X1; : : : ;XMg do
 7: Let Utemp = 0
 8: Let stemp = hh ; ai
 9: for each action m 2 f1; : : : ;Mg do
 10: for each value v 2 V do
 11: Let t = (a; (m; v))
 12: Let p = P (h  tjh ; a;S  )
 13: if p > 0 then
 14: Let fS 0; U 0g = Strat(h  t; k  1; V;S  )
 15: stemp = stemp [ S 0
 16: Utemp = Utemp + p (EVexp(jh  tj; U(t)) + U 0)
 17: end if
 18: end for
 19: end for
 20: if Utemp > Umax then
 21: Umax = Utemp
 22: smax = stemp
 23: end if
 24: end for
 25: return fsmax; Umaxg
 expression.
 The algorithm performs a depth- rst search through the space of strategies
 that start from the input history h , stopping once it reaches a speci ed depth
 k. Figure 5.1 provides an example of the kind of tree searched by this algorithm.
 For each possible action a 2 fInv;Obs;X1; : : : ;XMg at h , it computes the expected
 per-round utility gained from performing a, and the utility of the best strategy
 for each possible history h0 that could result from choosing a. It combines these
 48
quantities to get the total expected utility for a, and selects the action with the best
 total expected utility, amax. It returns the strategy created by combining the policy
 hh ; amaxi with the strategies for each possible h0 , and the utility for this strategy.
 Seen another way, Strat(h ; k; V;S  ) computes EPRU
 k
 alt(s; h j G;S  [ s)
 for all possible strategies s, returning the strategy maximizing EPRUkalt as well as
 the maximal value of EPRUkalt.
 Proposition 1. Strat(h ; k; V;S  ) returns (s; U) such that
 EPRUkalt(s; h j G;S  [ s) = U = argmaxs0(EPRU
 k
 alt(s
 0; h j G;S  [ s0)):
 A proof of this proposition is presented in [19].
 We now have an algorithm capable of computing the strategy with maximal
 expected utility over the  rst k rounds. Hence, in order to  nd an  -best response
 strategy we need only  nd the search depth k such that no strategy can earn more
 than  expected utility after round k, and then call the algorithm with that value
 of k.
 Theorem 4. Given  > 0, available strategies other than our own S  , and a set of
 values V with maximum value vmax, let k = log(1 d)
  
d 
vmax
  
. Then Strat(;; k; V;S  )
 returns (s; U) such that s is an  -best response to S  .
 Proof. This follows from Theorem 3 and Proposition 1. 2
 We also have the following.
 Corollary 2. Given available strategies other than our own S  and a set of values
 V , let sk be the strategy returned by Strat(;; k; V;S  ). Then limk!1 sk is a best
 49
response to S  .
 Proof. Let sopt be a best response to S  . By Lemma 5 and Theorem 3,
 EPRU(sopt j G;S  [ sopt) EPRU(sk j G;S  [ s)  vmax(1 d)k=d:
 Since limk!1 vmax(1  d)k=d = 0, it follows that limk!1(EPRU(sopt j G;S  [
 sopt)  EPRU(sk j G;S  [ sk)) = 0. Therefore, limk!1 sk is a best response to
 S  . 2
 5.1.5 Implementation
 In this section I discuss modi cations that improve the running time of Algo-
 rithm 1 without any loss in accuracy. Section 5.1.5 discusses techniques for state
 aggregation that cut the branching factor of the algorithm in half. Section 5.1.5 dis-
 cusses the representation of  Obs, and Section 5.1.5 discusses caching and pruning.
 State Aggregation
 If the pseudocode for Algorithm 1 were implemented verbatim, it would search
 through each history that can be reached from the starting state. However, there is
 a signi cant amount of extraneous information in each history that is not needed for
 any of the algorithm?s calculations. For example, the histories h = h(Inv; (1; 10))i
 and h0 = h(Inv; (2; 10))i both describe a situation where  innovates once and
 obtains an action with value 10. The only di erence between these histories is the
 identi er assigned to the action, which does not impact any of the calculations|yet
 the pseudocode must still search through each of these histories separately. We can
 50
Figur
 e
 5.1
 :
 A
 n
 illustratio
 n
 of
 a
 p
 ortio
 n
 of
 th
 e
 sear
 ch
 p
 erfor
 m
 ed
 by
 Algorith
 m
 1
 on
 a
 ver
 y
 smal
 l
 versio
 n
 of
 Cultaptation
 ,
 wit
 h
 fou
 r
 action
 s
 an
 d
 a
 se
 t
 of
 p
 ossibl
 e
 actio
 n
 value
 s
 equa
 l
 to
 f2
 0;
 40
 ;8
 0;
 16
 0g
 .
 Circl
 e
 no
 de
 s
 indicat
 e
 p
 ossibl
 e
 obser
 vation
 s
 ma
 de
 by
 th
 e
 age
 nt
 ,
 an
 d
 triangl
 e
 no
 de
 s
 indicat
 e
 p
 ossibl
 e
 choice
 s
 m
 ad
 e
 by
 th
 e
 age
 nt
 (whi
 ch
 ha
 ve
 a
 rando
 m
 outcome
 ,
 indicate
 d
 by
 th
 e
 lin
 e
 throug
 h
 th
 e
 bran
 che
 s
 le
 adin
 g
 to
 it
 s
 children)
 .
 Ellipse
 s
 indicat
 e
 bran
 che
 s
 or
 subtree
 s
 omitte
 d
 fr
 om
 th
 e
  gure
 .
 Th
 e
 lef
 t
 sid
 e
 of
 th
 e
 illustratio
 n
 sh
 ow
 s
 ho
 w
 mor
 e
 action
 s
 ca
 n
 b
 eco
 m
 e
 av
 aila
 bl
 e
 to
 th
 e
 age
 nt
 as
 th
 e
 gam
 e
 progresses
 ,
 whil
 e
 th
 e
 rig
 ht
 sid
 e
 sh
 ow
 s
 ho
 w
 p
 ote
 ntially-
 changin
 g
 value
 s
 of
 action
 s
 m
 us
 t
 b
 e
 m
 odele
 d
 in
 th
 e
 tree
 .
 Se
 vera
 l
 te
 chnique
 s
 outline
 d
 in
 Sectio
 n
 5.1.
 5
 ca
 n
 reduc
 e
 th
 e
 siz
 e
 of
 th
 e
 tre
 e
 tha
 t
 need
 s
 to
 b
 e
 sear
 che
 d
 in
 th
 e
 impleme
 ntatio
 n
 of
 thi
 s
 algorithm
 .
 51
eliminate this redundancy by using repertoires, rather than histories, as the states
 for the algorithm to search through. A repertoire is a record of what the agent
 knows about each of the actions it has learned, rather than a record of everything
 that has happened to it.
 Making this simple change allows Algorithm 1 to calculate the value of an
 observation action by combining information it learns when exploring innovate and
 exploit actions, rather than recursing again. This cuts the branching factor of the
 search in half. The analysis and details involved in this change, as well as the proof
 that the version of the algorithm using repertoires returns the same result as the
 previous version, are included in Appendix B.
 Running time analysis.
 When Algorithm 1 considers a history h , it makes one recursive call for each
 possible action-percept pair (a; (m; v)) that can be executed at h . There are 2M
 such pairs for each history; if the agent knows how to exploit j actions, then it
 can innovate any of the M  j actions it does not know, and it can observe any
 of the M actions. Each of these actions can also have any of v values. Hence, the
 number of recursive calls made by the algorithm each action is at most 2Mv. Since
 the algorithm recurses to depth k, the running time for Algorithm 1 is O((2Mv)k).
 With the state aggregation technique described above, we do not need to perform
 additional recursions for observation actions. Hence, the number of recursive calls
 made each action is at most Mv, and the total running time is O((Mv)k), which
 improves upon the original running time by a factor of 2k.
 Representing  Obs
 52
For the formal proofs, I treated  Obs as a black box that, when given the
 agent?s history and round number, could tell us the exact probabilities of observing
 each action on the current round. However, since there are an exponential number
 of possible histories, storing  Obs in this form would require an exponential amount
 of space, which would severely limit the size of games for which we could compute
 strategies. Algorithm 2 (introduced in Section 5.2) would also need to run a pro-
 hibitively large number of simulations to get enough samples to generate a new  Obs
 of this type.
 Therefore, as an approximation, my implementation assumes that  Obs has a
 similar structure to  Inv, and remains constant throughout the agent?s lifetime. That
 is, the  Obs used in the experiments returns the probability of an action valued v
 being observed. While this leads to some loss in accuracy, it is very easy to store
 and compute. Further, we will see in the experimental results (particularly those
 dealing with iterative computation in Section 5.3.2) that this form of  Obs is still
 able to produce good strategies.
 Caching and Pruning
 Since the implementation uses repertoires rather than histories to represent
 the agent?s set of known actions, and since it is possible for two histories to produce
 the same repertoire, the algorithm will sometimes encounter repertoires that it has
 already evaluated. So that the algorithm will not have to waste time evaluating
 them again, the implementation includes a cache which stores the EPRU of every
 repertoire it has evaluated. When the algorithm encounters a repertoire whose
 expected utility is needed, the implementation  rst checks the cache to see if the
 53
EPRU of the repertoire has been previously computed, and uses the computed value
 if it exists. Caching is widely used in tree-search procedures, and is analogous to
 the transposition tables in chess-playing algorithms [37].
 I also use another well-known method for avoiding unnecessary evaluation of
 states, namely branch-and-bound pruning [38, 39], which can be summarised as
 follows. Before we compute the expected per-round utility of a given action, we
 check to see if an upper bound on the EPRU of that action would be su cient to
 make the given action?s utility higher than the best previously computed action. In
 many situations, the maximal utility that can be achieved for a given action will
 in fact be less than the utility we know we can achieve via some other action, and
 therefore we can skip the evaluation of that action (i.e., we can \prune" it from the
 search tree).
 There are no theoretical guarantees on runtime reduction using these tech-
 niques, but we will see in Section 5.3.1 that the combination of pruning and caching
 allows the algorithm to avoid evaluating signi cant portions of the state space in
 the environments I tested.
 5.2 Cultaptation Strategy Learning Algorithm
 Until now I have assumed that Algorithm 1 has access to  Obs, the distribu-
 tion of observable actions, when it performs its calculations. While the algorithm
  nds the near-best-response strategy given a particular  Obs, agents playing the real
 Cultaptation game are not given access to  Obs beforehand, and even estimating
 54
what  Obs looks like can be very di cult while playing the game due to the limited
 amount of information each agent receives in its lifetime. It is also unclear how
 exactly an agent?s own actions will a ect  Obs: by exploiting a particular action,
 the agent is making that action observable to others who might then exploit it in
 greater proportion than in the  Obs used to compute the agent?s strategy.
 To address these di culties, I developed the Cultaptation Strategy Learning
 Algorithm (CSLA), which uses a method for creating a strategy and a distribution
  Obs simultaneously so that (i)  Obs is the distribution created when all agents in a
 Cultaptation game play the computed strategy and (ii) the computed strategy is a
 near-best response for  Obs (and other parameters).
 This algorithm copes with the lack of information about  Obs, and generates
 an approximation of a strategy that is a best response to itself. At a high level,
 the algorithm can be thought of as generating a series of strategies, each an  -
 best response to the one before it, and stopping when two successive strategies are
 extremely similar. A more detailed description of this process follows.
 The algorithm begins by assuming  Obs =  Inv. The algorithm then proceeds
 iteratively; at each iteration it generates s, the  -best response strategy to the current
  Obs, then simulates a series of Cultaptation games in which s plays itself, and
 extracts a new  Obs from the actions exploited in these games.
 At the end of each iteration, the algorithm compares s to sold, the strategy
 produced by the previous iteration, using the stratDi function. stratDi (s; sold)
 computes the probability that an agent using s would perform at least one di erent
 action before dying than the same agent using sold. For instance, stratDi (s; sold) =
 55
Algorithm 2 Produce an approximation of a strategy that is an  -best response to
 itself.
 CSLA( Inv;  ; k)
 1: Let  Obs =  Inv.
 2: s = ;.
 3: repeat
 4: Let sold = s.
 5: Let V = f Inv;  Obsg.
 6: s = Strat(;; k; V;S)
 7: Simulate a series of Cultaptation games in which s plays itself, and action
 utilities are initially drawn from  Inv, recording all actions exploited in the
 last quarter of this game.
 8: Use records of exploited actions to generate a new distribution  Obs (i.e.
  Obs(v) = fraction of the time v was exploited in the records).
 9: until stratDi (s; sold) <  
10: return s.
 1:0 means that the two strategies will always perform at least one di erent action (i.e.
 the actions they choose on the  rst round are di erent), while stratDi (s; sold) = 0:0
 means that s is identical to sold.
 When stratDi (s; sold) is found to be below some threshold  , CSLA ter-
 minates and returns s, the strategy computed by the last iteration. The formal
 algorithm is presented as Algorithm 2.
 Properties of the strategy. CSLA as presented here is a \best-e ort"
 algorithm in the following sense: If CSLA converges to a strategy s, we know that it
 is an approximation of a symmetric Nash equilibrium strategy, but we do not know
 (i) whether or not CSLA will converge for a given environment, or (ii) how close to
 the true Nash equilibrium s is. The improved version of CSLA proposed in Chapter
 5 will address these issues.
 In my experimental studies (see Section 5.3), the strategies produced by CSLA
 in any given game were all virtually identical, even when a random distribution
 56
(rather than  Inv) was used to initialize  Obs. This strongly suggests (though it
 does not prove) that the strategy pro le consisting of copies of sself is a symmetric
 near-Nash equilibrium.
 Furthermore, there is reason to believe that s is evolutionarily stable. Consider
 an environment in which all agents use the strategy s, and suppose a small number
 (say, one or two) other strategies are introduced as invaders. Because s was an
 near-best response to the environment that existed before the opponent?s agents
 are introduced, and because the introduction of one or two invaders will change this
 environment only slightly, agents using s will still be using a strategy that is close to
 the best response for the current environment, and they will also have some payo 
they have accumulated on previous rounds when their strategy was still an near-best
 response. Thus, the invaders should have a di cult time establishing a foothold in
 the population, hence should die out with high probability. This suggests (but does
 not prove) that s is evolutionarily stable.1
 5.2.1 Implementation Details
 We have created a Java implementation of CSLA. Here I brie y discuss two
 issues dealt with during implementation.
 Representing  Obs
 My implementation of CSLA uses the same representation of  Obs as my im-
 plementation of Algorithm 1 does. In other words, it assumes  Obs has the same
 1 Among other things, a formal proof would require a way to calculate the payo s
 for s and any invading strategy. Accomplishing this is likely to be complicated, but
 I hope to do it in my future research.
 57
form as  Inv, and remains constant throughout the game. Ideally we would be able
 to condition  Obs on the agent?s history, but in practice this would require too much
 space (since there are an exponential number of possible histories), and we would
 need to run too many simulations in step 7 to get an accurate distribution for each
 history.
 Training
 In the Machine Learning literature, the process of improving an agent?s perfor-
 mance on a given task is often referred to as \training." In Algorithm 2, strategy s
 is trained by playing against itself in a series of simulated games in step 7. However,
 in the implementation of CSLA the agents involved in the games in step 7 are a
 parameter to the algorithm. This means that CSLA can also produce a strategy
 that is trained by playing in an environment consisting of itself and one or more
 given strategies. The intuition behind this approach is that a strategy trained by
 playing against itself and strategy s0 may perform better when playing against s0
 than a strategy trained against itself alone. I test this hypothesis experimentally, in
 Section 5.3.2.
 5.3 Experimental Results
 In this section I present my experimental results.
 Section 5.3.1 examines the performance of the implementation of the  -best re-
 sponse algorithm. I  nd that the optimizations allow the algorithm to  nd strategies
 within 1% of the best response 1,000 times faster than the unoptimized algorithm.
 58
Section 5.3.1 examines the strategies found by the  -best response algorithm when
 presented with di erent environments and strategy pro les, and the results give
 us an idea of what kinds of circumstances are necessary for the near-best-response
 strategy to prefer innovation over observation.
 Section 5.3.2 presents a series of experiments comparing two strategies gener-
 ated with my Cultaptation Strategy Learning Algorithm to a known good strategy
 used in the international Cultaptation tournament. I  nd that the strategies gen-
 erated with CSLA are able to beat the known good strategy, even when the envi-
 ronment is di erent than the one CSLA used to learn the strategies (Sections 5.3.2
 and 5.3.2). Finally, I perform an in-depth qualitative analysis of all three strate-
 gies and highlight the di erences in behavior that give my learned strategies an
 advantage (Section 5.3.2).
 5.3.1 Experiments with  -Best Response Algorithm
 In this section I present the experiments involving an implementation of Algo-
 rithm 1, which generates  -best-response strategies for a given set of game parame-
 ters and strategy pro le.
 Implementation Performance
 My  rst set of experiments was designed to study the accuracy and running
 time of my implementation, and the e ectiveness of the methods developed to im-
 prove its performance. I  rst examined the e ect of  on running time and on
 the expected per-round utility computed by Algorithm 1. I ran the experiments in
 59
several di erent environments;  rst I examined the uniform1 environment. In this
 environment,  Inv is a uniform distribution over f33:33; 66:67; 100; 133:33; 166:67g,
 and S contains the innovate-once (I1) strategy from Example 1, so  Inv is identical
 to  Obs. The probability of change in uniform1 is 1%, and the probability of death
 is 40%.
 I also introduced several variations on the uniform1 environment to study the
 e ect of di erent probabilities of change. They are uniform10, uniform20, uniform30,
 and uniform40, which have the respective probabilities of change of 10%, 20%, 30%,
 and 40%.
 In Table 5.1, we see the EPRU computed for various values of epsilon in these
 environments. As a point of reference, strategy I1 can be analytically shown to
 achieve an EPRU of about 38.56. We can see that an upper bound on achievable
 EPRU in the uniform1 environment is 40.5, since the EPRU of an  -best response
 to S is 40.1 when  is 0.4. Also, note that the algorithm  nds lower EPRUs as
 the probability of change increases. This is as expected: in a rapidly changing
 environment, one cannot expect an agent to do as well as in a static environment
 where good actions remain good and bad actions remain bad. The  -best-response
 strategies computed generally innovate as the  rst action, then exploit that value
 if it is not the lowest value available (in this case 33.33). Otherwise, the strategies
 tend to innovate again in an attempt to  nd an action with a value bigger than
 33.33. This is how they manage to achieve a higher EPRU than the innovate-once
 strategy.
 As part of the experiment in the uniform1 environment, I kept track of the
 60
Table 5.1: Expected per-round utility of the  -best response strategy computed by
 Algorithm 7, for eight di erent values of  in various environments.
  = 0:4  = 0:8  = 1:2  = 1:6  = 2:0  = 2:4  = 2:8  = 3:2
 uniform1 40.1 40.0 39.6 39.6 39.6 38.9 38.9 38.9
 uniform10 39.3 39.1 38.8 38.8 38.8 38.1 38.1 38.1
 uniform20 38.7 38.5 38.2 38.2 38.2 37.7 37.7 37.7
 uniform30 38.7 38.5 38.2 38.2 38.2 37.7 37.7 37.7
 uniform40 38.7 38.5 38.2 38.2 38.2 37.7 37.7 37.7
 Figure 5.2: Number of nodes searched in the uniform1 environment, with di erent
 combinations of caching and pruning.
 number of nodes searched by four variations of the algorithm. In the  rst variation, I
 ran Algorithm 1 without optimizations. I also examined the algorithm?s performance
 with the pruning and caching optimizations described in Section 5.1.5.
 In Figure 5.2 we see that employing both caching and pruning allows the
 algorithm to compute strategies within 1% of the best response about 1,000 times
 faster. The search time required for 80,000-node search is around 15 seconds on a
 3.4GHz Xeon processor.
 61
E ects of Varying  Inv and  Obs on the  -Best-Response
 The objective of this experiment was to study how near-best-response strate-
 gies (as computed by Algorithm 1) change as I varied the mean and standard devi-
 ation of  Inv and  Obs (which I will call  Inv,  Inv,  Obs and  Obs, respectively). If we
 assume that the other agents in the game are rational and not trying to deceive us by
 intentionally exploiting low-utility actions, then one should expect that  Obs   Inv.
 It may seem natural, then, to conclude that an agent should choose to observe rather
 than innovate whenever possible, since the average action returned by observing will
 have higher utility than one returned by innovating. However, previous work has
 suggested that the standard deviation of these distributions may also play a role in
 determining which is better [23]. Also, as discussed in Section 2.2.1, it is possible to
 imagine pathological scenarios where a population that relies too heavily on obser-
 vation can become stuck exploiting a low-value action. I designed this experiment
 to test the hypothesis that, even if I let  Obs >  Inv, the standard deviations of these
 distributions can still be varied such that the  -best-response strategy computed by
 EPRUkalt will choose to innovate rather than observe. My methods and results are
 presented below.
 I used the repertoire-based algorithm Strat(R; r; k; V ) (Algorithm 1) to com-
 pute  -best-response strategies for Cultaptation games with several di erent pa-
 rameter settings, then analyzed the strategies to determine how often they would
 observe or innovate. In this experiment, the agents died with 40% probability on
 each round (d = 0:4) and there were 5 potential exploitation actions. These games
 are smaller than the Cultaptation game used in the tournament, to ensure they can
 62
Table 5.2: The portion of innovation actions (calculated as nInv=(nInv + nObs)) in
 the  -best-response strategy when the standard deviations of  Inv and  Obs are as
 speci ed. In all cases,  Inv = 100 and  Obs = 110.
  Obs = 10  Obs = 300
  Inv = 10 2:80 10 9 3:40 10 10
  Inv = 300 0.995 0.215
 be solved in a reasonable amount of time. In these games, I used distributions  Obs
 and  Inv with means of 110 and 100 respectively.
 Table 5.2 shows the results for four combinations of parameter settings:  Inv 2
 f10; 300g and  Obs 2 f10; 300g. When  Inv = 10, the near-best-response strategy
 will observe almost exclusively (innovating only in rare cases where observation
 returns several low-quality moves in a row). However, in the environment where
  Inv = 300 the near-best-response strategy includes signi cantly more Innvoates;
 when  Obs = 10 it will innovate 99:5% of the time, and even when  Obs = 300 it still
 innovates 21:5% of the time.
 This experiment lets us conclude that the means of  Inv and  Obs are not su -
 cient to determine if innovation or exploitation is better. In particular, if the stan-
 dard deviation of innovated values is high, then innovation becomes more valuable
 because multiple innovations tend to result in a higher valued action than multiple
 observations.
 An interesting strategy emerges when  Inv and  Obs both have high standard
 deviations. Even though the mean value of innovated actions is lower than the
 mean value of observed actions, the  -best-response strategy in these cases innovates
 initially, then, if the value innovated is high, exploits that value. If the innovated
 63
value is not high, an observation action is performed to ensure the agent has a
 reasonably-valued action available to exploit until it dies.
 5.3.2 Experiments with the Cultaptation Strategy Learning Algo-
 rithm
 The objective of my second experiment was to examine the performance of
 strategies produced by the Cultaptation Strategy Learning Algorithm (Algorithm 2
 in Section 5.2), and the importance of the environment (see Section 5.2.1) used to
 train these strategies. Speci cally, I was interested in|
  examining whether the strategies produced with CSLA were capable of beating
 a strategy that is known to do well;
  examining whether strategies produced by CSLA were able to perform well in
 environments di erent from those they were trained in;
  comparing how well a strategy that is trained only against itself (i.e., all agents
 in the simulated game in Step 7 of the CSLA algorithm use strategy s) can do
 at repelling an invader, versus how well a strategy trained against the invader
 (i.e. the invading strategy is included in the population of agents at Step 7)
 can do at repelling the invader.
 For the previous experiments, I assumed the algorithm had an oracle for  Obs.
 For the rest of this section I will be running experimental simulations, so the oracle
 will observe what the agents do in the simulations and construct  Obs from this, as
 64
described in Section 5.2.
 For the known good strategy I used an algorithm called EVChooser, which per-
 forms a few innovation and observation actions early in the game and uses the results
 of these actions (along with a discount factor) to estimate the expected value of in-
 novating, observing, and exploiting, making the action with the highest expected
 value. It placed 15th out of over 100 entries in the Cultaptation tournament [15].
 We chose EVChooser because (1) it has been shown to be a competitive strategy,
 (2) its source code was readily available to me (unlike the other successful strate-
 gies from the Cultaptation tournament), and (3) it could be tuned to perform well
 in the Cultaptation environments I used (which, in order to accommodate CSLA?s
 exponential running time, were much smaller than those used in the international
 Cultaptation tournament).
 For games as small as the ones in my experiments, I believe EVChooser is repre-
 sentative of most of the high-performing strategies from the tournament. Nearly all
 of the strategies described in the tournament report [15] spend some time trying to
  gure out what the innovate and observe distributions look like, and afterwards use
 some heuristic for choosing whether to innovate, observe, or exploit their best known
 action on any given round. This heuristic often involves some type of expected-value
 computation; for instance, the winning strategy discountmachine used a discount
 factor to compare the utility gained by exploiting the current best-known action to
 the utility of possibly learning a better action and exploiting it on all future rounds,
 65
which is exactly what EVChooser does.2 Unlike my CSLA algorithm, none of the
 strategies in the tournament conducted lookahead search.
 For this experiment, I used an environment where  Inv was a uniform distribu-
 tion over the actions f20; 40; 80; 160g, probability of change was 1%, and probability
 of death was 25%. Due to the exponential running time of my strategy generating
 algorithm, this is the largest environment (i.e., smallest probability of death, highest
 number of actions and action values) for which the algorithm could compute full
 strategies in a reasonable amount of time.
 Convergence and Consistency of CSLA
 As part of this work, I have developed a Java implementation of Algorithm 2
 that allows one to specify the type of game to be used for the simulation in Step 7,
 and created two strategies: sself and sEVC. The training process for both strategies
 began with s0, the best-response to a random  Obs distribution, and continued by
 constructing a strategy si+1 as a best-response to the  Obs generated by simulating
 games involving si. When training sself the simulated games consisted solely of
 agents using si, but while training sEVC they consisted of a population of agents
 using si being invaded by EVChooser. In both cases, 100 games were simulated at
 each step of the iteration, to limit the amount of noise in the  Obs that was extracted
 from the simulations.
 2discountmachine di ers from EVChooser largely because it modi es the expected
 value of Observing using a machine-learned function that accounts for observe ac-
 tions being unreliable and returning multiple actions, neither of which are possible
 in my version of the game
 66
While there are no theoretical guarantees that the strategies produced by
 Algorithm 2 will converge, the algorithm?s similarity to policy iteration [35] led me
 to suspect that the they would converge. Also, since CSLA is greedy, i.e., it selects
 the best response strategy at each step of the iteration, I was interested in seeing
 whether the strategy it found represented a local maximum or a global one.
 I designed a simple experiment to see how these issues would play out when
 generating sself and sEVC: I modi ed the program to use a randomly-generated
 distribution for the initial value of  Obs, rather than always initially setting  Obs =
  Inv as is done in Algorithm 2, and I used this modi ed program to generate 100
 alternate versions of sself and sEVC. I then compared these alternates to the original
 sself and sEVC using stratDi . In the case of sself , I found that all 100 alternate
 versions were identical to the original. In the case of sEVC, I found that 58 alternate
 versions were identical to the original, and the rest exhibited a stratDi of no more
 than 1:08 10 4. This means that an agent using an alternate version of sEVC would
 choose all the same actions as one using the original sEVC at least 99.989% of the
 time. This tells us that not only does CSLA converge for the environment I am
 testing it in, it converges to the same strategy each time it is run. This suggests
 that the algorithm is  nding a globally-best solution for this environment, rather
 than getting stuck in a local maximum.
 Finally, to estimate how di erent sself and sEVC are, I ran stratDi (sself ; sEVC)
 and found it to be 0.27. This means that training a strategy against an external,
  xed strategy in Algorithm 2 does produce signi cantly di erent results than train-
 ing a strategy against itself. For a more in-depth look at where sself and sEVC di er,
 67
Table 5.3: Win percentages of sself and sEVC when playing against EVChooser over
 10,000 games as both Defender and Invader.
 Win percentage
 Defending vs. EVChooser Invading vs. EVChooser
 sself 70.65% 70.16%
 sEVC 69.92% 69.92%
 see Section 5.3.2.
 Pairwise Competitions: sself vs. EVChooser and sEVC vs. EVChooser
 I played both of the generated strategies, sself and sEVC, against EVChooser
 for 20,000 games { in 10,000 games, the generated strategy was defending against
 an invading population of EVChooser agents, and in 10,000 games the roles were
 reversed, with the generated strategy invading and EVChooser defending. I recorded
 the population of each strategy on every round, as well as the winner of every game.3
 The populations in an individual game were extremely noisy, as seen in Figure 5.3(e),
 however by averaging the populations over all 10,000 games we can see some trends
 emerge. These average populations for each strategy in all four match-ups are
 presented in Figure 5.3(a{d), while the win rates for each match-up are presented
 in Table 5.3.
 In Figure 5.3 we see that, on average, the strategies generated by Algorithm 2
 control roughly 57% of the population for the majority of the game in all four match-
 ups. Interestingly, both sself and sEVC are able to reach this point in roughly the
 same amount of time whether they are invading or defending. It is also worth noting
 3Recall that the winner of a Cultaptation game is the strategy with the highest
 average population over the last quarter of the game.
 68
a) EVChooser invading sself b) sself invading EVChooser
 c) EVChooser invading sEVC d) sEVC invading EVChooser
 e) Population of sself at each round, in a single game against EVChooser
 Figure 5.3: Average populations of both strategies for each round, in match-ups
 between sself and EVChooser (parts a and b) and between sEVC and EVChooser
 (parts c and d), over 10,000 games. From round 2000 onwards, sself or sEVC control
 57% of the population on average, regardless of whether EVChooser was invading
 or defending. Since mutation is enabled from round 100 onwards, populations in
 an individual game (exhibited in part e) are highly mercurial and do not converge.
 Therefore, we must run a large number of trials and average the results to get a
 good idea of each strategy?s expected performance.
 69
Table 5.4: Percentage of games won (out of 10,000) by sself , sEVC, and EVChooser
 in a melee contest between all three.
 sself sEVC EVChooser
 Melee win percentage 38.78% 37.38% 23.84%
 that, even though I showed above that sself and sEVC have signi cant di erences,
 they performed almost identically against EVChooser in terms of population and
 win percentages
 Melee Competition: sself vs. sEVC vs. EVChooser
 My next experiment was to run sself , sEVC, and EVChooser against one another
 in a melee contest to see how the three strategies would interact in an environment
 where none of them originally had the upper hand. All three strategies had an initial
 population of 33 agents at the start of each game. I used the same  Inv, probability of
 change, and probability of death as in Experiment 2. Mutation was disabled for the
  nal 2,500 rounds of each melee game, as was done in the Cultaptation tournament
 to allow the population to settle. I ran 10,000 games in this manner, and percentage
 of wins for each strategy are shown in Table 5.4.
 In the table we can see that sself has a slight edge over sEVC, and both these
 strategies have a signi cant advantage over EVChooser. In fact, I observed that in
 the  rst 100 rounds of most games (before mutation begins) EVChooser nearly died
 out completely, although it is able to gain a foothold once mutation commences.
 Mutation is also turned o after 7500 rounds in Cultaptation melee games; this
 caused the population to quickly become dominated by one of the three strategies
 in all 10,000 games played.
 70
Performance Analysis of sself , sEVC, and EVChooser
 In the experiments in Section 5.3.2, we saw that the strategies found by CSLA
 consistently outperform EVChooser in environments similar to the ones they were
 trained in. In order to get a better idea of why this happens, I ran two experiments
 to compare the performance of sself and EVChooser in more detail. The  rst was
 designed to show the kinds of situations in which the two strategies chose di erent
 actions, while the second was designed to show how well the two strategies were
 able to spread good actions through their population.
 Action Preferences
 The objective of this experiment was to identify the kinds of situations in
 which sself , sEVC, and EVChooser made di erent choices. To this end, I allowed sself
 to play against itself for  ve games, in an environment identical to the one used for
 the previous experiments in Section 5.3.2 (note that this is the same environment
 sself was trained in). On each round, for each agent, I recorded the number of rounds
 the agent had lived, the value of the best action in its repertoire,4 and whether the
 agent chose to innovate, exploit, or observe on that round. Since there are 100 agents
 alive at any given time and each game lasts 10,000 rounds, this yielded a total of  ve
 million samples. Figures 5.4(a), (d), and (g) show the observed probability that sself
 would innovate, observe, or exploit (respectively) for its  rst ten rounds and for each
 possible best action value. I then repeated this process for sEVC and EVChooser,
 allowing each strategy to play against itself for  ve games and recording the same
 4This could be 20, 40, 80, 160, or None if the agent had not yet discovered an
 action
 71
a) sself innovates b) sEVC innovates c) EVChooser innovates
 d) sself observes e) sEVC observes f) EVChooser observes
 g) sself exploits h) sEVC exploits i) EVChooser exploits
 Figure 5.4: The observed probability that sself , sEVC, and EVChooser will innovate,
 observe, or exploit when they are a given number of rounds old (on the x-axis) and
 with a given value of the best action in the agent?s repertoire. These results were
 observed by allowing each strategy to play itself for  ve games of 10,000 rounds
 each with 100 agents alive on each round, generating a total of 5,000,000 samples.
 All graphs in this  gure share the same legend, which is included in graph c) and
 omitted elsewhere to save space.
 72
data. The results for sEVC and EVChooser may be found in Figures 5.4(b), (e), and
 (h), and Figures 5.4(c), (f), and (i), respectively.
 The most obvious di erence among the three strategies is that EVChooser
 almost never innovates,5 a property it shares with the strategies that did well in the
 Cultaptation tournament [17]. On the other hand, sself and sEVC have conditions
 under which they innovate and conditions under which they do not. For instance,
 both sself and sEVC always innovate if their  rst action (which is always an ob-
 servation) returns no action. Also, sEVC frequently innovates if it is stuck with the
 worst action after several observes, and sself also innovates (although less frequently;
 see next paragraph) in this case. Another sharp contrast between EVChooser and
 the generated strategies is in their exploitation actions. EVChooser spends nearly
 all of its time exploiting, even if it has a low-value action, and only observes with
 signi cant probability on round two. On the other hand, sself and sEVC will be-
 gin exploiting immediately if they have one of the two best actions, but otherwise
 will spend several rounds observing or innovating to attempt to  nd a better one,
 and the number of rounds they spend searching for a better action increases as the
 quality of their best known action decreases.
 The main di erence between sself and sEVC that can be seen in Figure 5.4 is in
 the way they handle being stuck with the lowest-value action after several rounds.
 In these circumstances, sself prefers observation while sEVC prefers innovation. Here
 we see the most obvious impact of the di ering environments used to generate these
 two strategies. sself prefers observation in these cases because it was trained in an
 5EVChooser innovates 1% of the time on its  rst round.
 73
environment where all agents are willing to perform innovation. Therefore, if an
 sself agent is stuck with a bad action for more than a few rounds it will continue to
 observe other agents, since if a better action exists, it is likely that it has already
 been innovated by another agent and is spreading through the population. On the
 other hand, sEVC prefers innovation in these situations because it has been trained
 with EVChooser occupying a signi cant portion of the population, and we have
 seen that EVChooser almost never innovates. Therefore, if sEVC is stuck with a bad
 action after several rounds, it will attempt to innovate to  nd a better one, since it
 is less likely that another agent has already done so.
 Spreading High-value Actions
 The objective of this experiment was to measure the rate at which sself , sEVC,
 and EVChooser were able to spread high-valued actions through their populations.
 To measure this, I again played sself against itself in the same environment used
 in the previous experiment (which I will refer to as the \normal" environment in
 this section), and on each round I recorded the number of agents exploiting actions
 with each of the four possible values (20, 40, 80, and 160). To account for the
 noise introduced by changing action values, I ran 10,000 games and averaged the
 results for each round. I then repeated this process, playing sEVC and EVChooser
 against themselves. The results for sself , sEVC, and EVChooser may be found in
 Figures 5.5(a), (c), and (e) respectively.
 This experiment lets us see what the steady state for these strategies looks
 like, and how quickly they are able to reach it. However, I am also interested in
 seeing how they respond to structural shocks [21, 22] (i.e., how quickly the strategies
 74
a) sself in normal environment b) sself in shock enviornment
 c) sEVC in normal environment d) sEVC in shock enviornment
 e) EVChooser in normal environment f) EVChooser in shock enviornment
 Figure 5.5: The average number of agents exploiting an action with value U in two
 environments. The \normal" environment in parts a, c, and e shows how quickly
 sself , sEVC, and EVChooser spread actions through their population under normal
 circumstances when they control the entire population. The \shock" environment
 in parts b, d, and f shows how quickly each strategy responds to periodic structural
 shock. The \normal" environment is the same as in the rest of Section 5.3.2, and
 the \shock" environment is similar except that actions with value 160 are forced to
 change every 100th round and held constant all other rounds. Each data point is
 an average over 10,000 games.
 75
are able to recover when a good, widely-used action changes values). To this end, I
 created a \shock" environment, which is identical to the normal environment with
 one modi cation: actions with value 160 have a probability of change equal to 0 ex-
 cept on rounds divisible by 100, in which case they have probability of change equal
 to 1. All other actions use the normal probability of change for this environment,
 0.01. This modi cation creates a shock every 100 rounds, while still keeping the
 expected number of changes the same for all actions. I then repeated the experi-
 ment above with the shock environment, running 10,000 games for sself , sEVC, and
 EVChooser and averaging the results, which are presented in Figures 5.5(b), (d),
 and (f) respectively.
 In Figure 5.5 we can see that sself and sEVC exhibit nearly identical performance
 in both the normal and shock environments. In the normal environment, they are
 able to reach their steady state in only a few rounds, and the steady state consists
 of a roughly equal number of agents exploiting the best and second-best action.
 In the shock environment, we see that sself and sEVC respond to external shock by
 drastically increasing the number of agents exploiting the second-best action over
 the course of a few rounds, and returning to their steady states at a roughly linear
 rate over the next 100 rounds. The number of sself and sEVC agents exploiting the
 two worst actions remains extremely low except for small spikes immediately after
 each shock.
 Compared to the generated strategies, EVChooser?s performance appears to
 be less stable, and less robust to structural shock. In the normal environment,
 we see that EVChooser takes hundreds of rounds to reach its steady state. While
 76
EVChooser?s steady state does include more agents exploiting the best action than
 sself and sEVC, it also includes a signi cant number of agents exploiting the two
 worst actions. In the shock environment, we see that changes to the best action
 result in signi cant increases to the number of EVChooser agents exploiting the
 other actions, including the two worst ones. We can also see that populations of
 EVChooser agents take a lot longer to return to normal after an external shock
 than populations of sself and sEVC. These results help us account for the superior
 performance of sself and sEVC over EVChooser in previous experiments, and indicate
 that there is plenty of room for improvement in EVChooser and strategies like it.
 77
Chapter 6
 Improving Full-scale Cultaptation Strategies
 So far, the objective of this work has been to analyze Cultaptation and see
 what types of strategies work well. However, in order to  nd strategies that were
 provably good using the algorithms I was able to develop, it was necessary to use
 game environments that were much smaller than the ones used in the real Cultap-
 tation tournament. In this chapter, I will show how to take the observations and
 analysis used in the previous chapters and use them to develop a strategy for full-size
 Cultaptation that can outperform the current best strategy from the Cultaptation
 tournament.
 6.1 The Tournament Winner: discountmachine
 The  rst Cultaptaton tournament was won by the strategy discountmachine,
 by Dan Cownden and Tim Lillicrap of Queen?s University [15]. According to
 the tournament organizers, discountmachine performed signi cantly better than
 all other contestants; its population was about 50% higher, on average, than the
 second-place strategy [17].
 Pseudocode for discountmachine is provided as Algorithm 3. It can be sum-
 marized as follows: On the  rst round, it always plays Observe. If it?s the second
 round and the  rst Observe returned no action, it will always play Innovate as a
 78
bootstrapping measure. Otherwise, it begins by estimating the probability of change
 (c^), the average value of the payo distribution ( ^ ), the average value of the observe
 distribution ( ^Obs), and the current value of its best-known action (v^ ). The values
 v^ and  ^Obs are then multiplied by a discount factor meant to re ect the value of
 exploiting the actions until either the action changes or the agent dies.
 In order to prevent the agent from missing out on high-value actions in envi-
 ronments with low probability of change, the strategy includes a heuristic that will
 cause it to Observe at least once every 20 rounds if c^ < 0:05 and its current action
 is not signi cantly better than any it has seen before. Otherwise, it decides which
 action to perform based on the output of a pre-trained neural network, which takes
 as input discounted versions of v^ and  ^Obs, along with an estimate of the degree of
 noise present in Observation.1
 While the source code for discountmachine used in the tournament has been
 made public on the organizers? website [15], that code includes a neural network
 with pre-trained weights, and the code used to train the neural network was not
 made public. However, a comment in the source code explains, \We trained this
 function by having it try and match the estimate made by a (strategy) with perfect
 knowledge of what could be observed and (the amount of noise present for Observe
 moves)."
 1The strategy also de nes a rare subset of values for some parameters for which
 the neural network performed poorly, and in these cases the strategy simply returns
 Observe if v^ <  ^Obs and Exploits the best-known action otherwise. For ease of
 exposition I have omitted this from the pseudocode.
 79
Algorithm 3 Simpli ed pseudocode for the discountmachine strategy, which won
 the  rst Cultaptation tournament. Returns the action selected by an agent, given
 the agent?s current history h. Full source code available from the Cultaptation
 website [15].
 discountmachine(History h)
 1: if jhj == 0 then
 2: return Observe.
 3: end if
 4: if jhj == 1 AND No actions are known then
 5: return Innovate.
 6: end if
 7: Let c^ and  ^ be the probability of change and average new action value,
 respectively, experienced in history h.
 8: Let v be the value of the best known action, and r be the number of rounds
 since that action was seen in h.
 9: v^ = (1 c^)r  v + (1 (1 c^)r)  ^ .
 10: Let  ^Obs be the mean value of all observed actions in h.
 11: Let f = (1 c^)(1 0:02). This is the \discount factor."
 12: v^ = v^  1(1 f)
 13:  ^Obs =  ^Obs  
f
 1 f .
 14: if c^ < 0:05 AND jhj > 20 AND Last 20 actions have been Exploit then
 15: if v^ < 3 + PRU(h) then
 16: return Observe.
 17: else
 18: return Exploit the best-known action.
 19: end if
 20: end if
 21: return The action selected by a pre-trained neural network with inputs v^max,
  ^Obs, and an estimate of the amount of noise present in Observation.
 6.1.1 Potential Problems with discountmachine
 It?s clear from the results of the Cultaptation tournament that discountma-
 chine is a signi cantly better strategy than the other contestants. However, the
 analysis and experiments presented in Chapters 4 and 5 suggest that it has some
 aspects that could still be improved.
 First, discountmachine only ever innovates as a bootstrapping mechanism on
 the second round of the game. While the tournament organizers found that this was
 80
a universal characteristic among the top-performing strategies in the tournament,
 Sections 2.2.1 and 5.3.2 present theoretical and empirical evidence that, in at least
 some circumstances, Innovating can be advantageous. Therefore, a strategy that
 could recognize these situations would have an advantage over discountmachine.
 Second, the formulae used to obtain the total value of exploiting the best-
 known action (line 12) and the expected result of an observe (line 13) are incom-
 plete. These formulae compute the expected total payo the agent will earn by
 exploiting each of these actions until either the action changes value or the agent
 dies. As we have seen in Example 5 and Theorem 1, the expected total payo is
 not always the same as the EPRU, and EPRU is the metric that should be used to
 determine the quality of a strategy (or, in this case, a potential course of action).
 Furthermore, the formula on line 13 neglects the fact that the information returned
 by an Observe pertains to the round before the Observe was performed. This means
 that, if the newly-observed action is Exploited on the following round, it has had
 two opportunities to change values, not one, and the expected value of the action
 should be updated to re ect this.
 Since the source code used to train the neural net in discountmachine is not
 provided, it?s di cult to characterize the function approximated by the network.
 However, given the issues with the formulae used to compute the inputs to the
 network, it seems unlikely that it is perfectly accurate in determining whether Ob-
 serving or Exploiting is a better choice.
 Finally, the experimental results from Section 5.3.2 suggest that the choice of
 whether to exploit a given action should be determined by the action?s value and
 81
the agent?s age, and that the preference towards exploiting should increase mono-
 tonically as the agent gets older and as the value of the best known action increases.
 However, the heuristic used on lines 14 through 19 causes discountmachine to con-
 sistently violate this rule once every 20 rounds in environments with low probability
 of change. While this does cause the agent to have a better idea of what actions
 exist to be observed, I have been unable to  nd any analytical support for the notion
 that this will typically increase the agent?s EPRU by more than the amount that is
 lost by not exploiting as much as possible when the agent already has a high-value
 action.
 6.2 A Full-scale Cultaptation Player: relaxedlookahead
 In order to show how the analysis and experiments presented in Chapters 4
 and 5 can be used to create a player for the Cultaptation tournament, I developed
 relaxedlookahead. The main insights that inspired relaxedlookahead ?s design were:
 1. Actions should be evaluated based upon the estimated amount of EPRU they
 provide the agent, since EPRU is proportional to reproductive success (The-
 orem 1). While this may seem intuitive, I know of no current Cultaptation
 player that uses EPRU or an equivalent metric (e.g., discountmachine esti-
 mates expected total payo , see above).
 2. Innovating should be regarded as a worthwhile choice in at least some cir-
 cumstances. Even though the top-performing strategies from the tournament
 tended to eschew Innovate moves, we have seen that there are some circum-
 82
stances in which it is preferable. With an accurate formula for estimating the
 value of innovation, the agent should be able to identify such circumstances.
 3. The agent should choose its move based upon its entire history; in other words,
 hard and fast rules such as the one used by discountmachine on lines 14
 through 19 should be avoided. This will allow the agent to have behavior
 that smoothly transitions according to the value of its best action and age, as
 sself ?s behavior does in Figure 5.3.2, which gave it a signi cant performance
 advantage over EVChooser.
 4. Due to computational constraints present in the Cultaptation tournament, it
 will not be possible to perform a full lookahead search to a depth su cient to
 achieve a small  . However, a relaxed lookahead search, which considers a few
 di erent possible outcomes of each action and their e ects over the course of
 many rounds, should still be su cient to allow the agent to estimate which of
 three choices (Exploit the best known action, Observe, or Innovate) has the
 highest EPRU.
 The relaxedlookahead strategy is de ned in Algorithm 4, and can be summa-
 rized as follows. Like discountmachine, it always observes on its  rst round and, if
 that observe fails to return a value, it innovates as a bootstrapping measure. Oth-
 erwise, the strategy begins by creating estimates of several quantities based on the
 information contained in h; it estimates the probability of change (c^), average value
 of the payo distribution ( ^ ), current value of the best known action (v^ ), and the
 mean and standard deviation of the distribution of observe moves ( ^Obs,  ^Obs). It
 83
then estimates the value of three possible moves: exploiting the best known action,
 observing, and innovating. These estimates are created using a relaxed lookahead
 search, which is a very rough approximation of the search conducted in Chapter 5.
 The relaxed lookahead search is conducted slightly di erently for each of the
 three possible moves. For the observe move, the strategy models the observe distri-
 bution  Obs as a normal distribution2 with mean  ^Obs and standard deviation  ^Obs,
 and calculates the probability that the observed action will have a higher payo 
than v^ . The search then considers six di erent branches. In the  rst branch, the
 observed action has a payo lower than the current best action, so the agent?s best
 known action is still v^ . The other  ve branches assume the observed action is better
 than the current best action; candidate payo s are generated by using the inverse
 CDF of  Obs to  nd  ve values appropriately distributed among the set of values
 greater than v^ . For example, if v^ was found to be in the 80th percentile of  Obs, the
  ve candidate values would be in the 82nd, 86th, 90th, 94th, and 98th percentiles
 respectively. For each branch, the strategy assumes the agent will exploit its new
 best known action for 100 rounds, and it estimates the EPRU of the branch using
 formulae based on those from Chapter 4. Finally, the EPRU of the observe move is
 computed by combining the EPRU of each branch with its estimated likelihood.
 2The normal distribution was selected as the most likely shape of  Obs using a
 process similar to the one in CSLA; the  rst version of relaxedlookahead did not as-
 sume any shape to  Obs and only considered its average value. This version was then
 played against itself in simulated games, and the shape of the observe distribution
 from the simulations appeared to be gaussian. I then changed relaxedlookahead to
 model the observe distribution as a normal distribution, repeated the process, and
 found that the observe distribution from the simulations was still gaussian.
 84
The EPRU of innovating is estimated in a similar manner, but the strategy
 models the payo distribution  as an exponential distribution3 rather than a normal
 one. Finally, the EPRU of exploiting the current best action is estimated by simply
 assuming that the agent exploits on the current round and the next 100 rounds.
 Once the EPRUs for observing, innovating, and exploiting have all been estimated,
 relaxedlookahead simply makes the move with the highest estimate.
 6.3 Experiments: RLA vs. discountmachine
 In order to test the hypothesis that relaxedlookahead would have an advan-
 tage over discountmachine, I conducted several experiments to compare their per-
 formance.
 6.3.1 Overall Performance
 The  rst experiment examines the overall performance of both strategies in a
 variety of environments. I created two di erent versions of the action distribution
  . The  rst,  Geo, is a geometric distribution with probability of success p = 0:1.
 The second,  Gam, is a Gamma distribution with shape k = 2 and scale  = 5.
 Both these distributions have a mean value of 10 and conform to the tournament
 organizers? speci cations of generating a larger number of small-value actions with
 3The exponential distribution was selected due to the tournament organizers?
 statement that the move distribution would typically contain large numbers of low-
 value actions and small numbers of high-value actions [15]. The exponential distri-
 bution is generally representative of distributions that have this shape, and can be
 easily constructed since its shape depends only on its mean.
 85
Algorithm 4 Simpli ed pseudocode for the relaxedlookahead strategy. Returns the
 action selected by an agent, given the agent?s current history h.
 relaxedlookahead(History h)
 1: if jhj == 0 then
 2: return Observe.
 3: end if
 4: if jhj == 1 AND No actions are known then
 5: return Innovate.
 6: end if
 7: Let c^ and  ^ be the probability of change and average new action value,
 respectively, experienced in history h.
 8: Let v be the value of the best known action, and r be the number of rounds
 since that action was seen in h.
 9: v^ = (1 c^)r  v + (1 (1 c^)r)  ^ .
 10: Let  ^Obs and  ^Obs be the mean value and standard deviation, respectively, of
 all observed actions in h.
 11: UO = evaluateObs( ^Obs,  ^Obs, v^ ,  ^ , c^, jhj)
 12: UI = evaluateInv(v^ ,  ^ , c^, jhj)
 13: UX = estimateEPRU(v^ ,  ^ , c^, jhj)
 14: if UX  max(UO; UI) then
 15: return Exploit the best known action.
 16: else if UO  UI then
 17: return Observe.
 18: else
 19: return Innovate.
 20: end if
 estimateEPRU(v,  v, c, r)
 1: U = 0
 2: EPRU = 0
 3: for i = 0! 100 do
 4: U = U + (1 c)i  v + (1 (1 c)i)  v
 5: EPRU = EPRU +0:98i  Ur+i
 6: end for
 7: return EPRU
 a few high-value actions [15]. However,  Geo largely matches the assumptions re-
 laxedlookahead makes about the shape of  while  Gam does not. It features a large
 \hump" around the low-value actions, with the most likely value being 5, and a
 lower probability of high-value actions than  Geo.
 86
Algorithm 5 Pseudocode for the functions used to evaluate Observation and In-
 novation by relaxedlookahead.
 evaluateObs( ,  , v,  v, c, r)
 1: Let  Obs be a normal distribution with mean  and standard deviation  .
 2: Let phi be the probability that a value drawn from  Obs is greater than v.
 3: Uhi = 0
 4: for i = 0! 4 do
 5: p = 1 phi  1+2i10
 6: Let vi be the value of the inverse CDF of  Obs at percentile p.
 7: vi = (1 c)2  vi + (1 (1 c)2)  v
 8: Uhi = Uhi + estimateEPRU(vi;  v; c; r + 1)
 9: end for
 10: Uhi = Uhi=5
 11: Ulow = estimateEPRU(v;  v; c; r + 1)
 12: return phiUhi + (1 phi)Ulow
 evaluateInv( , v,  v, c, r)
 1: Let  Inv be an exponential distribution with mean  .
 2: Let phi be the probability that a value drawn from  Inv is greater than v.
 3: Uhi = 0
 4: for i = 0! 4 do
 5: p = 1 phi  1+2i10
 6: Let vi be the value of the inverse CDF of  Inv at percentile p.
 7: vi = (1 c) vi + c  v
 8: Uhi = Uhi + estimateEPRU(vi;  v; c; r + 1)
 9: end for
 10: Uhi = Uhi=5
 11: Ulow = estimateEPRU(v;  v; c; r + 1)
 12: return phiUhi + (1 phi)Ulow
 I simulated 200 pairwise Cultaptation games between relaxedlookahead and
 discountmachine (100 with relaxedlookahead invading and 100 with discountma-
 chine invading) in each of 14 di erent environments. The  rst seven environments
 used  Geo as the action distribution, and had probability of change c equal to 0.001,
 0.005, 0.01, 0.05, 0.1, 0.2, and 0.4, respectively. The other environments used  Gam
 as the action distribution. This set of values for c is the same set used by the
 Cultaptation tournament organizers for their  rst round of evaluations [15].
 87
Table 6.1: Performance of relaxedlookahead when played against discountma-
 chine in a variety of environments. Results are based on 200 simulated games
 for each environment (100 with relaxedlookahead invading, 100 with discountma-
 chine invading). Results in bold or italics are statistically signi cant with  < 0:01;
 bold results represent an advantage for relaxedlookahead, italicized results represent
 an advantage for discountmachine.
 Action distribution  Geo Action distribution  Gam
 Win percentage Average score Win percentage Average score
 c = 0:001 .96 82.02 .97 83.96
 c = 0:005 .78 66.29 .84 70.86
 c = 0:01 .65 58.20 .72 63.77
 c = 0:05 .48 47.70 .55 52.21
 c = 0:1 .53 52.97 .59 55.24
 c = 0:2 .44 47.32 .41 47.38
 c = 0:4 .55 53.95 .42 45.04
 The results of this experiment are presented in Table 6.1. relaxedlookahead has
 a statistically signi cant advantage ( < 0:01) over discountmachine in seven envi-
 ronments: c = 0:001; 0:005; and 0:01 for both action distributions, and c = 0:1 for
  Gam. It has a statistically signi cant disadvantage for one environment, c = 0:2
 with distribution  Gam.
 These results suggest that relaxedlookahead has a signi cant advantage over
 discountmachine for environments with low probability of change, and that its
 advantage shrinks and eventually disappears as the probability of change grows to
 moderate and high levels. Curiously, the shape of  did not seem to make much
 di erence in relaxedlookahead ?s performance, and in fact it performed slightly better
 in the environment which did not match its assumptions. We will investigate the
 reasons for this in the following section.
 88
It should be noted that, in general, it will be much more di cult to obtain
 a signi cant advantage in environments with high probabilities of change, since
 as long as both players are exploiting as much as possible their choice of action
 makes little di erence. The tournament organizers have acknowledged this and,
 appropriately, put signi cantly more emphasis on low-change environments in all
 rounds of their evaluation. Therefore, since relaxedlookahead has a clear advantage
 over discountmachine in such environments, it seems reasonable to conclude that
 relaxedlookahead would be more likely to win in a tournament setting.
 6.3.2 Performance Analysis
 The next round of experiments was designed to duplicate the ones presented
 in Section 5.3.2, in order to get a more detailed look at the di erences between
 relaxedlookahead and discountmachine.
 Action Preferences
 The objective of this experiment, like the  rst one in Section 5.3.2, was to
 get an idea of the kinds of situations in which relaxedlookahead and discountma-
 chine made di erent choices. To that end, I allowed each strategy to compete
 against itself in 50 full-length games using four separate environments: c = 0:001
 and 0:05 with action distribution  Geo, and c = 0:001 and 0:05 with action distribu-
 tion  Gam.4 On each round, I recorded the actions taken by all 100 agents and the
 payo of their best-known action, yielding a total of 50  106 data points for each
 4The values c = 0:001 and c = 0:05 were chosen because the  rst re ects the
 largest advantage for relaxedlookahead in the previous experiment, while the second
 represents the lowest value at which both strategies appear to be evenly matched.
 89
strategy in each environment.
 Figures 6.1 through 6.4 present the results of this experiment, showing the
 probability that each strategy would innovate, observe, or exploit given its age and
 the value of its best known action. There are two obvious di erences between the
 strategies in all four environments.
 First, discountmachine, like EVChooser, never innovates except as a bootstrap-
 ping mechanism on the second round of the game. Meanwhile, relaxedlookahead is
 willing to innovate on any round, under the right circumstances. In fact, if its best
 known action is between 5 and 15, relaxedlookahead appears to prefer innovating
 to observing. The fact that it does so in environments in which it handily defeats
 discountmachine is evidence that innovating should not be completely overlooked,
 as the top-performing strategies in the tournament tended to do.
 Second, the observe and exploit graphs for discountmachine exhibit large
 spikes at regular intervals. These are due to the clause on line 14 of Algorithm 3,
 which causes discountmachine to observe at least once every 20 rounds unless the
 probability of change is high. The authors of the strategy explained that this is done
 to make sure that an agent doesn?t miss a high-value action that most other agents
 are exploiting. Meanwhile, the same graphs for relaxedlookahead also exhibit spikes,
 but their size and the interval between them is correlated with the probability of
 change in the environment and the value of the best-known action. There is no code
 in relaxedlookahead that explicitly causes this behavior, so I can only conclude that
 this is emergent behavior and that certain rounds are more likely than others to be
 opportune times for exploring rather than exploiting. Whatever the cause, this is
 90
evidence that arbitrary rules like the one on line 14 of Algorithm 3 are not necessary
 if the values of observing and exploiting can be estimated accurately.
 Spreading High-value Actions
 The objective of the next experiment, like the second one in Section 5.3.2, was
 to measure the rate at which relaxedlookahead and discountmachine were able to
 spread high-value actions through their populations. To measure this, I re-ran the
 simulations used for the previous experiment,5 but this time recorded the value of
 each action exploited on each round of the game. Since there are 100 agents in each
 game, this yielded 5,000 data points for each round.
 Figures 6.5 through 6.8 present the results of this experiment. They show
 the probability of each strategy exploiting an action in each of six possible value
 ranges: 0 to 5, 5 to 15, 15 to 25, 25 to 30, 30 to 40, and more than 40.6
 These  gures make it clear why relaxedlookahead has an advantage over dis-
 countmachine, particularly in environments with low probability of change. Over
 the  rst few generations of agents, discountmachine does an excellent job of quickly
 spreading actions in the best value category that can reliably be found (e.g., val-
 ues over 40 in Figure 6.5, values between 15 and 25 in Figure 6.6). However, in
 all four environments, the proportion of agents exploiting such actions peaks at
 about round 200 and then begins to steadily decline. Meanwhile, the proportion
 of relaxedlookahead agents exploiting the same actions increases more slowly, but
 eventually reaches a steady state with more agents exploiting high-value actions
 5Each strategy played against itself in 50 full-length Cultaptation games in each of
 four environments, with c = 0:001 or 0:05 and  geo or  gam as the payo distribution.
 6In both  geo and  gam, the mean action value is 10.
 91
a) relaxedlookahead Innovates b) discountmachine Innovates
 c) relaxedlookahead Observes d) discountmachine Observes
 e) relaxedlookahead Exploits f) discountmachine Exploits
 Figure 6.1: Frequency that an agent using relaxedlookahead (left) or discountma-
 chine (right) chose to Innovate, Observe, or Exploit, given its age and best known
 action, in an environment with the geometrically-distributed action distribution  Geo
 (with mean 10) and probability of change c = 0:001.
 92
a) relaxedlookahead Innovates b) discountmachine Innovates
 c) relaxedlookahead Observes d) discountmachine Observes
 e) relaxedlookahead Exploits f) discountmachine Exploits
 Figure 6.2: Frequency that an agent using relaxedlookahead (left) or discountma-
 chine (right) chose to Innovate, Observe, or Exploit, given its age and best known
 action, in an environment with the geometrically-distributed action distribution  Geo
 (with mean 10) and probability of change c = 0:05.
 93
a) relaxedlookahead Innovates b) discountmachine Innovates
 c) relaxedlookahead Observes d) discountmachine Observes
 e) relaxedlookahead Exploits f) discountmachine Exploits
 Figure 6.3: Frequency that an agent using relaxedlookahead (left) or discountma-
 chine (right) chose to Innovate, Observe, or Exploit, given its age and best known
 action, in an environment with the action distribution  Gam (i.e. a gamma distribu-
 tion with shape 2 and scale 5) and probability of change c = 0:001.
 94
a) relaxedlookahead Innovates b) discountmachine Innovates
 c) relaxedlookahead Observes d) discountmachine Observes
 e) relaxedlookahead Exploits f) discountmachine Exploits
 Figure 6.4: Frequency that an agent using relaxedlookahead (left) or discountma-
 chine (right) chose to Innovate, Observe, or Exploit, given its age and best known
 action, in an environment with the action distribution  Gam (i.e. a gamma distribu-
 tion with shape 2 and scale 5) and probability of change c = 0:001.
 95
than discountmachine.
 This is the consequence of only innovating as a bootstrapping measure, as
 discountmachine does. Over time, agents will observe actions that are better than
 the ones they currently know, and begin exploiting them. In environments with low
 probability of change, the best action in the environment will take a long time to
 change, which makes it likely that almost all of the agents will be exploiting this
 action when it does change. This creates a structural shock (see Section 2.2.1),
 and leaves strategies like discountmachine with knowledge of only a very small
 subset of the actions in the environment, and no way to learn more without in-
 novating. Strategies like relaxedlookahead, however, can recover by innovating and
  nding other actions that have changed to have higher value. This phenomenon
 appears to be the biggest reason why relaxedlookahead has such a large advantage
 in environments with low probability of change.
 These  gures also help to explain why, in Table 6.1, relaxedlookahead was
 more likely to win in many environments using  Gam, even though  Geo more closely
 matches its assumptions about the shape of  . The reason appears to be fairly sim-
 ple: relaxedlookahead ?s performance is about the same, while discountmachine ap-
 pears to do worse in the environments using  Gam. Comparing relaxedlookahead ?s
 performance as the distribution changes (i.e. comparing Figure 6.6a to 6.8a), we
 can see that it is more likely to exploit actions between 15 and 25, and less likely
 to exploit actions with higher value than 25. This is to be expected, however, since
  Gam is characterized by a large hump around the mean and a narrower tail than
  Geo, so there will be fewer actions with very high values in these environments.
 96
Meanwhile, comparing Figure 6.6b to 6.8b, we can see that discountmachine has
 much more trouble  nding actions with above-average payo in the environment
 using  Gam, and after round 500 most of its exploits have values between 5 and 15
 (recall that the mean of  Gam is 10). This explains why relaxedlookahead was able
 to win against discountmachine seven percent more often in the environment used
 in Figure 6.8.
 97
a) relaxedlookahead
 b) discountmachine
 Figure 6.5: Frequency with which relaxedlookahead(top) and discountma-
 chine(bottom) exploited an action with payo P , for six di erent ranges of values
 for P , over the  rst 1000 rounds of a game. Results were obtained by allowing each
 strategy to play a 50 games against itself, in an environment with action distribution
  Geo and probability of change c = 0:001, and averaging the exploits observed on
 each round.
 98
a) relaxedlookahead
 b) discountmachine
 Figure 6.6: Frequency with which relaxedlookahead(top) and discountma-
 chine(bottom) exploited an action with payo P , for six di erent ranges of values
 for P , over the  rst 1000 rounds of a game. Results were obtained by allowing each
 strategy to play a 50 games against itself, in an environment with action distribu-
 tion  Geo and probability of change c = 0:01, and averaging the exploits observed
 on each round.
 99
a) relaxedlookahead
 b) discountmachine
 Figure 6.7: Frequency with which relaxedlookahead(top) and discountma-
 chine(bottom) exploited an action with payo P , for six di erent ranges of values
 for P , over the  rst 1000 rounds of a game. Results were obtained by allowing each
 strategy to play a 50 games against itself, in an environment with action distribution
  Gam and probability of change c = 0:001, and averaging the exploits observed on
 each round.
 100
a) relaxedlookahead
 b) discountmachine
 Figure 6.8: Frequency with which relaxedlookahead(top) and discountma-
 chine(bottom) exploited an action with payo P , for six di erent ranges of values
 for P , over the  rst 1000 rounds of a game. Results were obtained by allowing each
 strategy to play a 50 games against itself, in an environment with action distribution
  Gam and probability of change c = 0:05, and averaging the exploits observed on
 each round.
 101
Chapter 7
 Conclusion
 This dissertation has presented a variety of techniques for analyzing Cultap-
 taiton, a complex evolutionary game designed to explore the phenomenon of social
 learning. Furthermore, the work has demonstrated how to  nd strategies that are
 provably good, and how to analyze such strategies to draw conclusions about how
 good social learning strategies will operate. Thus, this work has advanced the state
 of the art in two ways.
 First, it provides theory and analysis that support many of the empirical
 results found by the  rst Cultaptation tournament, and identi es some aspects of
 the tournament results that are likely due to experimental error (e.g., the lack of
 individual learning in any of the top-performing strategies). This helps strengthen
 our understanding of social learning in general, and should help inform future studies
 of this phenomenon.
 Second, it demonstrates techniques that can be used to analyze evolutionary
 games that are signi cantly more complex than classical evolutionary games, i.e.,
 games that have a  nite number of agents, rather than an in nitely large, well-mixed
 population; games that last a  nite, rather than in nite, number of generations; and
 games that allow agents to live for multiple generations and condition their actions
 on accumulated experience, rather than replacing the population every generation
 102
and preventing agents from accumulating experience in the  rst place. One of the
 reasons Cultaptation is so complex is that early evolutionary models of social learn-
 ing made very strong assumptions about social learning mechanics and strategies
 [16, 7], and the conclusions drawn from studying such models generated a controver-
 sial challenge to social learning?s role in evolutionary  tness that has taken decades
 to fully address [7, 13, 14, 18]. As researchers in the  eld of evolutionary game
 theory continue to study more complex phenomena like social learning, it is likely
 that these weaker assumptions will be needed more frequently, and so techniques
 like the ones presented here will be necessary more often.
 In summary, this dissertation has provided the following contributions:
 1. Analyzing strategies? reproductive success. Given a Cultaptation
 game G and a set S of available strategies for G, the work presents a formula for
 approximating (to within any  > 0) the expected per-round utility, EPRU(s j G;S),
 of each strategy in S. The work shows that a strategy with maximal expected per-
 round utility will have the highest expected frequency in the limit, independent
 of the initial strategy pro le. These results provide a basis for evaluating highly
 complex strategies such as the ones described below.
 Generalizability: These results can be generalized to other evolutionary games
 in which agents live more than one generation, with a  xed probability of death at
 each generation, and reproduction is done using the replicator dynamic.
 2. Computing near-best-response strategies. The work provides a
 strategy-generation algorithm that, given a Cultaptation game G and a set of avail-
 able strategies S, can construct a strategy s that is within  of the a response to
 103
S.
 Generalizability: The strategy-generation algorithm performs a  nite-horizon
 search, and is generalizable to other evolutionary games in which there is a  xed
 upper bound on per-round utility and a nonzero lower bound on the probability of
 death at each round.
 3. Approximating symmetric Nash equilibria. The work provides
 CSLA, an iterative self-improvement algorithm that uses the strategy-generation
 algorithm in Section 5.1 to attempt to  nd a strategy sself that is a near-best re-
 sponse in a Cultaptation game in which the other players are all using sself . Hence
 a strategy pro le composed entirely of instances of sself is a symmetric near-Nash
 equilibrium.
 Generalizability: An iterative self-improvement algorithm similar to CSLA
 should be able to  nd a near-Nash equilibrium for any game in which the strategies
 are complex enough that computing a best (or near-best) response is not feasible by
 analyzing the strategies directly, but is feasible using information from a simulated
 game between strategies in the pro le. Games of this type will typically have a high
 branching factor but relatively simple interactions between agents.
 4. State aggregation. To make its algorithms fast enough for practical
 experimentation, the work provides a state-aggregation technique that speeds them
 up by an exponential factor without any loss in accuracy. The experimental re-
 sults in Section 5.3 demonstrate the practical feasibility that this provides: in these
 experiments, CSLA always converged in just a few iterations.
 Generalizability: The state-aggregation technique is generalizable to other evo-
 104
lutionary games in which the utilities are Markovian.
 5. Experimental results. In the experimental studies, the near-Nash equi-
 libria produced by CSLA in any given game were all virtually identical, regardless
 of the starting values that were used. That strongly suggests (though it does not
 prove) that the strategy pro le consisting of copies of sself approximates an optimal
 Nash equilibrium, and possibly even a unique Nash equilibrium.
 Consequently, sself ?s characteristics provide insights into the characteristics of
 good Cultaptation strategies. For example, the experiments show that sself relies
 primarily on observation and exploitation, but switches quickly to innovation when
 a structural shock occurs, switching back to observation and exploitation once it has
 learned how to respond to the shock. This con icts with the conventional wisdom
 [16, 7] that successful social-learning strategies are characterized by a high frequency
 of innovation, but it helps to explain both the results of the Cultaptation tournament
 [17] and some recent experimental results on human subjects [18].
 6. Improvement on the best tournament strategy. While the algo-
 rithms described above are fast enough for experimentation on smaller variants of
 the Cultaptation game, they would be intractable for use on the more complex vari-
 ant used in the Cultaptation tournament. The work shows two approaches that
 can extend the above results to larger environments such as these. First, it shows
 how the analysis and experimental results outlined above can be used to identify
 potential problems in the best strategy from the Cultaptation tournament (i.e. dis-
 countmachine). Second, it uses the formulae from the analysis to de ne a new
 strategy, relaxedlookahead, that avoids such weaknesses. Experimental results verify
 105
that relaxedlookahead is capable of outperforming discountmachine in a variety of
 environments similar to those used in the Cultaptation tournament, and provide an
 in-depth analysis of the factors that allowed relaxedlookahead to perform better.
 The analysis showed that refusing to innovate except as a bootstrapping mea-
 sure (as discountmachine and all of the top-performing strategies from the tourna-
 ment did) makes it much more di cult to recover from structural shocks, especially
 in environments with low probability of change. Strategies that are willing to in-
 novate when a structural shock is detected, as relaxedlookahead is, are able to
 avoid this problem. The analysis also showed that heuristics instructing a strategy
 to explore with some minimum frequency, like the one used by discountmachine,
 are unnecessary, since relaxedlookahead exhibits emergent behavior in which it ex-
 plores at intervals dependent upon the parameters of the environment, without being
 speci cally programmed to do so.
 One possible avenue for future work would be to identify computationally fea-
 sible techniques capable of approximating the Nash equilibrium strategy in these
 larger versions of Cultaptation, preferably with provable bounds on the di erence
 between the performance of such techniques and the Nash equilibrium strategy.
 In classical games, a regret minimizing strategy1 is typically not computationally
 intensive and has been shown to have performance quite close to that of a Nash
 strategy in several large classes of repeated games [40]. Therefore, it may be fruitful
 1in short, a regret minimizing strategy seeks to minimize the expected di erence in payo 
between the result of its chosen action and the best possible result if it had selected a di erent
 action.
 106
to attempt to extend the concept of regret minimization to evolutionary game the-
 ory; such an extension would need to account for the conceptual di erence between
 payo in classical games and  tness in evolutionary games, perhaps involving an
 idea of \lost  tness minimization," in which the player compares the  tness gained
 by its chosen action to the highest amount of  tness it would have obtained if it had
 selected a di erent action.
 7. Implications. These results provide strong support for the following
 hypotheses about the best strategies for Cultaptation and similar games:
  What they are like, and how they can be computed. The best strategies are
 likely to be conditional ones in which the choice of action at each round is
 conditioned on the agent?s accumulated experience. Such strategies (or close
 approximations of them) can be computed by doing a lookahead search that
 predicts how each possible choice of action at the current round is likely to
 a ect future performance.
  How they are likely to behave. It is likely that the best strategies will observe
 and exploit most of the time, but will have ways of quickly detecting structural
 shocks, so that they can switch quickly to innovation in order to learn how to
 respond to such shocks.
 107
Appendix A
 Proofs
 Proposition 2. EPRU(s j G;S)=d = EPRUalt(s; hi j G;S).
 Proof.
 First, we will show by induction that EPRUalt(s; hi j G;S) equals the summa-
 tion of P (h js ;S) EVexp(jh j; U(h [jh j])) for all histories h . Then we will show
 that this equals the summation of L(jh j)P (h js ;S) PRU(h ) for all h .
 We will begin with the de nition of EPRUalt in Equation 4.12, and note that
 P (hi tjhi; s (hi);S) = P (hi tjs ;S) for histories of length one. This gives us a base
 case of
 EPRUalt(s ; hi j G;S) =
 X
 t2T
 P (hi  tjs ;S) EVexp(1; U(t)) +
 X
 t2T
 P (hi  tjs ;S) EPRUalt(s ; hi  t j G;S):
 For the inductive case, we will again start from Equation 4.12, this time noting
 that P (h js ;S)P (h  tjh ; s (h );S) simpli es to just P (h  tjs ;S). Thus, for
 all h we can rewrite P (h js ;S) EPRUalt(s ; h j G;S) in terms of histories one
 round longer than h , as follows:
 P (h js ;S) EPRUalt(s ; h j G;S) =
 X
 t2T
 P (h  tjs ;S) EVexp(jh  tj; U(t))
 +
 X
 t2T
 P (h  tjs ;S) EPRUalt(s ; h  t j G;S)):
 108
Therefore, by induction we have
 EPRUalt(s ; hi j G;S) =
 X
 h 2H
 P (h js ;S) EVexp(jh j; U(h [jh j]));
 where H is the set of all possible histories.
 The proof then proceeds arithmetically:
 EPRUalt(s ; hi j G;S) =
 X
 h 2H
 P (h js ;S) EVexp(jh j; U(h [jh j]))
 =
 X
 h 2H
 P (h js ;S)
 1X
 i=jh j
 L(i)U(h [jh j])
 i
 =
 X
 h 2H
 1X
 i=jh j
 L(i)P (h js ;S)U(h [jh j])
 i
 =
 1X
 i=1
 X
 h 2H( i)
 L(i)P (h js ;S)U(h [jh j])
 i
 =
 1X
 i=1
 X
 h 2H(i)
 L(i)P (h js ;S) PRU(h ) = EPRU(s j G;S)=d
 Where H( i) is the set of all histories of length less than or equal to i, and H(i)
 is the set of all histories exactly of length i. 2
 Proposition 3. Strat(h ; k; V;S) returns (s ; U) such that
 EPRUkalt(s ; h j G;S) = U = maxs0(EPRU
 k
 alt(s
 0; h j G;S)):
 Proof. Let s0 be a strategy maximizing EPRUkalt(s
 0; h j G;S) and let fs ; Ug be
 the strategy and value returned by Strat(h ; k; V;S). We will show by induction on
 k that
 EPRUkalt(s ; h j G;S) = U = EPRU
 k
 alt(s
 0; h j G;S):
 109
In the base case, k = 0 and clearly EPRU0alt(s
 0; h j G;S) = 0 for any s0,
 therefore EPRUkalt(s ; h j G;S) = EPRU
 k
 alt(s
 0; h j G;S) = 0 = U as required.
 For the inductive case, suppose that for k, Strat(h ; k; V;S) returns fs ; Ug
 such that EPRUkalt(s ; h j G;S) = U = maxs0(EPRU
 k
 alt(s
 0; h j G;S)). We must
 then show that Strat(h ; k + 1; V;S) returns fs ; Ug such that
 EPRUk+1alt (s ; h j G;S) = U = maxs0(EPRU
 k+1
 alt (s
 0; h j G;S)):
 Let stemp be the strategy constructed in lines 8{19 of the algorithm. First we show
 that on line 20,
 EPRUk+1alt (stemp; h j G;S) = (A.1)
 X
 t2T
 P (h  tjh ; stemp(h );S)
  
EVexp(jh j; U(t)) + EPRU
 k
 alt(stemp; h  t j G;S)
  
= Utemp:
 This follows because the t on line 11 iterates over all possible t 2 T (due to the for
 loops on lines 6, 9, and 10), meaning that the eventual value of Utemp is
 X
 t2T
 P (h  tjh ; stemp(h )jS) (EVexp(jh  tj; U(t)) + U 0) :
 By the inductive hypothesis, U 0 = EPRUkalt(stemp; h  t j G;S), su cing to show
 that (A.1) holds.
 Now we show that
 EPRUk+1alt (s ; h j G;S) = EPRU
 k+1
 alt (s
 0; h j G;S):
 Clearly EPRUk+1alt (s ; h j G;S)  EPRU
 k+1
 alt (s
 0; h j G;S), since s0 is assumed to
 110
have maximal EPRUk+1alt for h , so it su ces to show that
 EPRUk+1alt (s ; h j G;S)  EPRU
 k+1
 alt (s
 0; h j G;S):
 Since s maximizes
 X
 t2T
 P (h  tjh ; s (h );S)(EVexp(jh  tj; U(t)) + U 0);
 where U 0  EPRUkalt(s
 0; h  t j G;S) by the inductive hypothesis, there can be no
 action a such that
 X
 t2T
 P (h  tjh ; a;S)(EVexp(jh  tj; U(t)) + EPRU
 k
 alt(s
 0; h  t j G;S))
 >
 X
 t2T
 P (h  tjh ; s
 0(h );S)(EVexp(jh  tj; U(t)) + U 0):
 Therefore EPRUk+1alt (s ; h j G;S)  EPRU
 k+1
 alt (s
 0; h j G;S). This concludes the
 inductive argument.
 Thus for all k, EPRUkalt(s ; h j G;S) = U = maxs0 EPRU
 k
 alt(s
 0; h j G;S). 2
 111
Appendix B
 Converting from Histories to Repertoires
 In this appendix, we will formally de ne a repertoire, explain how to trans-
 form histories into repertoires, and show that the number of possible repertoires is
 substantially smaller than the number of possible histories, while maintaining the
 property that any best-response action for a given repertoire is also a best resonse
 for any history associated with that repertoire (Theorem 5). Finally, we will present
 a modi ed version of Algorithm 1, which uses repertoires rather than histories, and
 we will show how this simple change cuts the branching factor of the algorithm in
 half (Algorithm 7).
 B.1 Repertoire De nition
 A repertoire tells the last value and age of each action an agent \knows,"
 where an action?s age is the number of rounds that have passed since the agent
 last obtained information about it. Since at any given point in a game, each known
 action has a unique age, we label exploitation actions by their value and age, leaving
 o the action number (e.g. if we discovered an action with value 4 last round and
 an action with value 26 three rounds ago, then the repertoire will be fh4; 1i; h26; 3ig
 where h4; 1i denotes the existence of an action with value 4 discovered 1 round ago,
 and h26; 3i denotes the existence of an action with value 26 discovered 3 rounds
 112
Algorithm 6 Creates and returns a repertoire for history h .
 CreateRepertoire(h = (a1; (m1; v1)); : : : ; (ar; (mr; vr)))
 Let M = fmiji = 1; : : : ; rg.
 Let R = ; fR will be the repertoire.g
 for m 2M do
 Let i = max
 f1;:::;rg
 (mi = m)
 Add hvi; r  ii to R.
 end for
 return R
 ago). Formally, a repertoire is de ned to be a set of pairs, where the  rst value
 in each pair represents the knowledge of an action with the given value, while the
 second value in the pair represents the number of rounds since that knowledge was
 last updated.
 De nition. Let v1; : : : ; vm 2 V be action values and  1; : : : ;  m 2 Z+ (the positive
 integers) be action ages. A repertoire R is a set of action value/action age pairs
 R = fhv1;  1i; : : : ; hvm;  mig. We denote the set of all repertoires as Rep, and the
 set of all repertoires where all  i  j as Repj. 2
 Rep has unbounded size, but Repj has  nite size. We show how to create a
 repertoire R from a history h using the CreateRepertoire function in Algorithm 6.
 Repertoires change based on the action performed. For example, repertoire
 R = fh4; 1i; h26; 3ig
 can change to repertoire
 R0 = fh4; 2i; h26; 4i; h27; 1ig
 after an innovation action where an action with value 27 is innovated. Notice that all
 actions in R0, apart from the newly-innovated action with age 1, are one round older
 113
than they were in R. This aging process occurs often enough for us to introduce a
 function which ages a repertoire R = fhvi;  iig:
 age(fhvi;  iig) = fhvi;  i + 1ig
 Finally, we will introduce two functions to represent the two ways our reper-
 toire can change when we perform an action.
 The  rst, newaction, returns a repertoire with a new action added to it:
 newaction(R; v) = age(R) [ fhv; 1ig
 The second, updaction, returns a repertoire with updated information on action m:
 updaction(R; v;m) = age(R n fhvm;  mig) [ fhv; 1ig
 B.2 Transition Probabilities
 We can now de ne the probability of transitioning between repertoires on
 round r. We will call the transition probability functions PRep(R0jR; r; a;S) for
 a 2 fInv;Obs;Xig. In general, these functions will mirror the P (h0jh; a;S) functions
 de ned in Section 4.1, with some extra clauses added to ensure that if it is not
 possible to go from repertoire R to repertoire R0 using the given action, then the
 transition probability is 0.
 Innovation actions For innovation actions, the function is:
 PRep(R0jR; r; Inv;S) =
 8
 >><
 >>:
 0 if jRj = M _ 6 9v : R0 = newaction(R; v)
  (v) if R0 = newaction(R; v)
 114
The  rst clause ensures that if all possible actions are already in R, or if it
 is not possible to go from R to R0 in one innovation action, then the transition
 probability is 0. The second clause simply tells us the probability of innovating an
 action with value v, given that it is possible to go from R to R0 in one innovation
 action.
 Observation actions In Section 4.1 we assumed the existence of a distribution  Obs
 that, when given the current history, would tell us the probability of observing an
 action with a given value. Here, we will make the following assumptions about  Obs:
  When given a repertoire and round number,  Obs(vjR; r;S) tells us the prob-
 ability of observing an action that has value v and is not already known by
 R.
  When given a repertoire and round number,  Obs(m; vjR; r;S) tells us the
 probability of observing an action that has value v, and was previously in R
 at position m. The value of this action may have changed.
   Obs can make its predictions without using any information lost when con-
 verting from a history to a repertoire.
 These assumptions are all satis ed by the  Obs used in our implementation, and we
 expect them to hold for other practical implementations as well, since a distribution
 conditioned on entire histories would be impractically large.
 115
With this in mind, the transition probability function for observation actions
 is
 PRep(R0jR; r;Obs;S) =
 8
 >>>>>>>>>><
 >>>>>>>>>>:
  Obs(m; vjR; r;S) if hv0m;  mi 2 R ^
 R0 = updaction(R; v;m)
  Obs(vjR; r;S) if R0 = newaction(R; v)
 0 Otherwise.
 The  rst clause gives us the probability of observing an action already in our
 repertoire, while the second gives us the probability of observing a new action.
 Exploitation actions Let hvi;  ii be the value and age of exploitation action Xi.
 Then
 PRep(R0jR; r;Xi;S) =
 8
 >>>>>>>>>>>>>>>>>><
 >>>>>>>>>>>>>>>>>>:
 0 if jRj 6= jR0j
 0 if 8v0 2 V;R0 6= updaction(R; v0; i)
 Qr
 j=r  i
 (1 c(j)) +
 Pr
 j=r  i
 c(j) (vi; j)
  Qr
 i=j(1 c(i))
  
if R0 = updaction(R; vi; i)
 Pr
 j=r  i
 c(j) (v0i; j)
  Qr
 i=j(1 c(i))
  
if R0 = updaction(R; v0i; i) and vi 6= v
 0
 i
 The  rst two clauses check that we can, in fact, transition between R and R0
 by exploiting. The third clause gives us the probability that the action we exploited
 has not changed since we last saw it, while the fourth clause gives us the probability
 that the action we exploited has changed.
 116
B.3 Consistency between P and PRep
 Later in this section, we will show that using a repertoire-based algorithm to
 compute  -best-response strategies returns the same results as using the history-
 based Algorithm 1. To do this, we will use the notion of consistency between the P
 and PRep equations.
 De nition. Let M be the set of actions known to an agent with history h . The
 P and PRep equations are consistent for h if, for all a 2 fInv;Obs;Xig and v 2 V :
 X
 m2M
 P (h  (a; (m; v)) j h; a;S) =
 jRjX
 m=1
 PRep(updaction(R; v;m)jR; r; a;S) (B.1)
 and
 X
 m2f1;:::;MgnM
 P (h  (a; (m; v)) j h; a;S) = PRep(newaction(R; v)jR; r; a;S) (B.2)
 where R = CreateRepertoire(h) and r = jh j. 2
 Lemma 6. The P and PRep equations are consistent for all h 2 H.
 Proof. We can prove this by using the de nition of P , found in Section 4.1, and the
 de nition of PRep found above. We will simply consider what happens for arbitrary
 h and v when performing innovation, observation, and exploitation actions.
 Recall that X(h) returns the number of exploit moves available to an agent
 with history h. For ease of exposition, we will assume without loss of generality
 that the  rst action learned by h has label 1, the second has label 2, etc. Thus
 M = f1; : : : ; X(h)g, while f1; : : : ;Mg nM = fX(h) + 1; : : : ;Mg.
 117
Innovation actions An innovation action always returns information on a new ac-
 tion, so both sides of Equation B.1 are clearly 0 in this case. If X(h) = jRj = M ,
 both sides of Equation B.2 are also 0 since no new actions can be innovated. Thus,
 we will assume X(h) = jRj < M . We now have
 MX
 m=X(h)+1
 P (h  (Inv; (m; v))jh; Inv;S) = (M  X(h))
  (v)
 M  X(h)
 =  (v)
 and
 PRep(newaction(R; v)jR; r; Inv;S) =  (v)
 which are clearly equivalent. Hence, P and PRep are consistent on h when a = Inv.
 Observation actions This section of the proof is mostly trivial, given the assump-
 tions we have made about  Obs. The left side of Equation B.1 is
 X(h)
 m=1
 P (h  (Obs; (m; v))jh;Obs;S) =
 X(h)
 m=1
  Obs(m; vjh;S)
 which tells us the probability of observing one of the actions already seen in our
 current history. The right side is
 jRjX
 m=1
 PRep(updaction(R; v;m)jR; r;Obs;S) =
 jRjX
 m=1
  Obs(m; vjR; r;S)
 Since we assume that  Obs(m; vjR; r;S) tells us the probability of observing action
 m in the repertoire,
 PjRj
 m=1  Obs(m; vjR; r;S) also tells us the probability of observ-
 ing any of the actions we have already seen. Therefore, Equation B.1 holds for
 observation actions.
 For Equation B.2, the left side is
 MX
 m=X(h)+1
 P (h  (Obs; (m; v))jh;Obs;S) =
 MX
 m=X(h)+1
  Obs(m; vjh;S)
 118
which tells us the probability of observing an action we have not yet seen. The right
 side is
 PRep(newaction(R; v)jR; r;Obs;S) =  Obs(vjR; r;S)
 Since we assume that  Obs(vjR; r;S) gives us the probability of observing a new
 action, Equation B.2 also holds for observation actions. Hence, P and PRep are
 consistent on h when a = Obs.
 Exploitation actions Exploiting an action never gives us information about a new
 action, so both sides of Equation B.2 are 0 when we exploit. Thus, we need only
 consider Equation B.1.
 We will consider two cases. In the  rst case, the action we choose to exploit
 has changed since we last saw it, so v is a new value. We then have
 X(h)
 m=1
 P (h  (Xm; (m; v))jh;Xm;S) =
 X(h)
 m=1
 rX
 j=lastm
 c(j) (v; j)
 rY
 i=j
 1 c(i)
 where lastm is the last round number on which we obtained any information about
 the m-th action.
 Similarly, since exploiting never increases the size of a repertoire, we have
 jRjX
 m=1
 PRep(updaction(R; v;m)jR; r;Xm;S) =
 jRjX
 m=1
 rX
 j=r  m
 c(j) (v; j)
 rY
 i=j
 (1 c(i))
 Since jRj = X(h) and lastm = r   m by de nition, Equation B.1 holds when the
 action we exploit changes.
 Next, we consider the case where the action we choose to exploit has not
 changed. In this case, we have
 119
X(h)
 m=1
 P (h  (Xm; (m; v))jh;Xm;S) =
 X(h)
 m=1
  
rY
 j=lastm
 (1 c(j)) +
 rX
 j=lastm
 c(j) (v; j)
 rY
 i=j
 1 c(i)
 !
 and
 jRjX
 m=1
 PRep(updaction(R; v;m)jR; r;Xm;S) =
 jRjX
 m=1
  
rY
 j=r  m
 (1 c(j)) +
 rX
 j=r  m
 c(j) (v; j)
 rY
 i=j
 (1 c(i))
 !
 which are also equivalent. Therefore, Equation B.1 also holds when the action
 we exploit does not change.
 We have now shown that P and PRep are consistent on arbitrary h when
 a 2 fInv;Obs;Xig and for arbitrary v. Therefore, P and PRep are consistent for all
 h . 2
 B.4 Repertoire-Based Strategies
 Repertoires can be used to more compactly de ne a strategy. We let a
 repertoire-based strategy s be a function from repertoires to actions. Such a strategy
 can be represented more compactly than the history-based strategies used earlier in
 this work, since there are fewer possible repertoires than there are possible histories.
 In any history h , a repertoire-based strategy s chooses the action associated with
 repertoire CreateRepertoire(h).
 We can use the PRep functions to de ne a formula that determines the EPRU
 for any repertoire-based strategy s. EPRUalt(s; R; r j G;S) is a recursive function
 120
for calculating the expected per-round utility of s:
 EPRUalt(s; R; r j G;S) =
 X
 a2A
 X
 v2V
 [PRep(newaction(R; v)jR; r; a;S) 
(EVexp(r; U((a; ( ; v)))) + EPRUalt(s; newaction(R; v); r + 1 j G;S))
 +
 jRjX
 m=1
 PRep(updaction(R; v;m)jR; r; a;S) 
(EVexp(r; U((a; ( ; v)))) + EPRUalt(s; updaction(R; v;m); r + 1 j G;S))]
 where A is the set of possible actions and V is the set of possible action values.
 However, like EPRUalt(s; h j G;S), EPRUalt(s; R; r j G;S) contains in nite
 recursion and is therefore not computable. We will deal with this problem as we did
 in Section 4.2.1, by introducing a depth-limited version. For ease of exposition we
 will introduce two "helper" functions
 EPRUkaltnew(s; R; r; a; v j G;S)
 = PRep(newaction(R; v)jR; r; a;S) 
[EVexp(r; U((a; ( ; v)))) + EPRU
 k 1
 alt (s; newaction(R; v); r + 1 j G;S)]
 and
 EPRUkaltupd(s; R; r; a; v j G;S)
 =
 jRjX
 m=1
 PRep(updaction(R; v;m)jR; r; a;S) 
[EVexp(r; U((a; ( ; v)))) + EPRU
 k 1
 alt (s; updaction(R; v;m); r + 1 j G;S)]
 Now we can de ne
 121
EPRUkalt(s; R; r j G;S) =
 8
 >>>>>>>>>><
 >>>>>>>>>>:
 0; if k = 0;
 X
 a2A
 X
 v2V
 (EPRUkaltnew(s; R; r; a; v j G;S) +
 EPRUkaltupd(s; R; r; a; v j G;S)); otherwise.
 A proof that this formulation is equivalent to the version of EPRUkalt from
 Section 4.2.1 follows.
 Theorem 5. For all histories h , all repertoire-based strategies s, and all k  0, if
 s0 is a function from histories to actions where s0(h) = s(CreateRepertoire(h)) and
 r = jh j, then EPRU
 k
 alt(s; R; r j G;S) = EPRU
 k
 alt(s
 0; h j G;S).
 Proof. We can prove this by using induction on k. For our base case, we will use
 k = 0, since EPRU0alt(s; R; r j G;S) = EPRU
 0
 alt(s
 0; h j G;S) = 0 by de nition.
 For the inductive step, we will assume that Theorem 5 holds for some k  0,
 and show that it also holds for k + 1. Recall from Section 4.2.1 that, in this case
 EPRUk+1alt (s
 0; h jG;S) =
 X
 t2T
 P (h tjh; s(h);S)(EVexp(r; U(t))+EPRU
 k
 alt(s
 0; h t jG;S))
 Since T is simply the set of all action-percept pairs, we can instead write this
 as
 EPRUk+1alt (s
 0; h j G;S) =
 X
 a2A
 X
 v2V
 [
 MX
 m=1
 P (h  (a; (m; v))jh; s(h);S)
 (EVexp(r; U((a; (m; v)))) + EPRU
 k
 alt(s
 0; h  (a; (m; v)) j G;S))]
 122
where A is the set of possible actions and V the set of possible values. We also have1
 EPRUk+1alt (s; R; r j G;S) =
 X
 a2A
 X
 v2V
 [PRep(newaction(R; v)jR; r; a;S)
 (EVexp(r; U((a; ( ; v)))) + EPRU
 k
 alt(s; newaction(R; v); r + 1 j G;S))
 +
 jRjX
 m=1
 PRep(updaction(R; v;m)jR; r; a;S)
 (EVexp(r; U((a; ( ; v)))) + EPRU
 k
 alt(s; updaction(R; v;m); r + 1 j G;S))]
 Recall from Lemma 6 that P and PRep are consistent on h . We can combine
 this with our inductive hypothesis to show that the bracketed portions of the two
 equations above are equal.
 Recall that when a repertoire encounters a new action, it does not store the
 action number m for that action. Thus, for any pair m and m0 that are not al-
 ready in h , we know that CreateRepertoire(h  (a; (m; v))) = CreateRepertoire(h  
(a; (m0; v))) = newaction(R; v). Therefore, by our inductive hypothesis
 EPRUkalt(s
 0; h  (a; (m; v)) j G;S) = EPRUkalt(s; newaction(R; v); r j G;S)
 for all m not already in h . Notice that, by de nition, there are M  X(h) values of
 m that are not already in h . If we assume without loss of generality that the  rst
 action learned in h has label 1, the second has label 2, etc., then we can de ne  new
 1Recall that function U simply calculates the utility of performing the given action, and does
 not depend on the action number. Thus U(a;m; v) = U(a; ; v) for any legal m.
 123
to be the quantity P and PRep are multiplied by when we learn something new:
  new = EVexp(r; U((a; ( ; v)))) + EPRU
 k
 alt(s
 0; h  ha;X(h) + 1; vi j G;S)
 = EVexp(r; U((a; ( ; v)))) + EPRU
 k
 alt(s
 0; h  ha;X(h) + 2; vi j G;S)
 : : :
 = EVexp(r; U((a; ( ; v)))) + EPRU
 k
 alt(s
 0; h  (a; (M; v)) j G;S)
 = EVexp(r; U((a; ( ; v)))) + EPRU
 k
 alt(s; newaction(R; v); r j G;S)
 Similarly, the inductive hypothesis also tells us that
 EPRUkalt(s
 0; h  (a; (m; v)) j G;S) = EPRUkalt(s; updaction(R; v;m); r j G;S)
 for all m that are already in h . Thus, we can also de ne  m to be the quantity P
 and PRep are multiplied by when we update our information on action m:
  m = EVexp(r; U((a; ( ; v)))) + EPRU
 k
 alt(s
 0; h  (a; (m; v)) j G;S)
 = EVexp(r; U((a; ( ; v)))) + EPRU
 k
 alt(s; updaction(R; v;m); r j G;S)
 for m = 1; : : : ; X(h).
 We can now rewrite EPRUk+1alt (s
 0; h j G;S) and EPRUk+1alt (s; R; r j G;S) as
 EPRUk+1alt (s
 0; h j G;S) =
 X
 a2A
 X
 v2V
 [
 MX
 m=X(h)+1
 P (h  (a; (m; v))jh; s(h);S) new
 +
 X(h)
 m=1
 P (h  (a; (m; v))jh; s(h);S) m]
 and
 EPRUk+1alt (s; R; r j G;S) =
 X
 a2A
 X
 v2V
 [PRep(newaction(R; v)jR; r; a;S) new
 +
 jRjX
 m=1
 PRep(updaction(R; v;m)jR; r; a;S) m]
 124
Equations B.1 and B.2 tell us that regardless of the values of a and v,
 MX
 m=X(h)+1
 P (h  (a; (m; v))jh; s(h);S) = PRep(newaction(R; v)jR; r; a;S)
 and
 X(h)
 m=1
 P (h  (a; (m; v))jh; s(h);S) =
 jRjX
 m=1
 PRep(updaction(R; v;m)jR; r; a;S)
 Therefore, EPRUk+1alt (s
 0; h j G;S) = EPRUk+1alt (s; R; r j G;S). This completes the
 induction. 2
 B.5 Repertoire-Based Algorithm
 Now that we have a formula for computing the EPRU of a repertoire-based
 strategy, and we know that using repertoires rather than histories to calculate EPRU
 gives us the same results, we can update our algorithm to use repertoires. The new
 algorithm will be almost identical to Algorithm 1, except that using repertoires
 rather than histories will allow us to reduce our number of recursive calls by half.
 Let R0Obs be the set of repertoires for which P
 Rep(R0ObsjR; r;Obs) > 0, and de ne
 R0Inv and R
 0
 Xi similarly. Note that, if R contains m di erent actions,
 R0Obs  R
 0
 Inv [
 m[
 i=1
 R0Xi (B.3)
 In other words, since repertoires do not need to remember what actions our
 agent performed, choosing X1 produces the same repertoire as choosing Obs and
 observing action 1. Similarly, choosing Inv and Obs can also produce the same
 repertoires, if both actions happen to tell us about the same action. However, there
 125
is no action we could encounter through observing that we could not encounter
 through either innovating or exploiting. Therefore, if we save the results of the
 recursive calls to calculate the utility of Inv and X1; : : : ;Xm, we can compute the
 utility of Obs without any additional recursion. This cuts the branching factor of
 our algorithm in half, from (2m + 2)v to (m + 1)v, which reduces the size of the
 search tree by a factor of 2k for search depth k, without any impact on accuracy.
 Algorithm 7 is the complete algorithm.
 126
Algorithm 7 Produce strategy s that maximizes EPRUkalt(s; R j G;S), given initial
 repertoire R, and set of possible utility values V .
 Strat(R,r, k,V ,S)
 1: if k = 0 then
 2: return 0
 3: end if
 4: Let Umax = 0
 5: Let smax = null
 6: Let UObs = 0
 7: for each action a 2 fX1;    ;XM ; Invg do
 8: Let Utemp = 0
 9: Let stemp = null
 10: for each value v 2 V do
 11: Let t = hv; 1i
 12: if 9i : (a = Xi) then Let R0 = age(R n fhvi;  iig) [ t and
 p = PRep(R0jR; r;Xi;S)
 13: else Let R0 = age(R) [ t and p = PRep(R0jR; r; Inv;S)
 14: Let pObs = PRep(R0jR; r;Obs;S)
 15: if p+ pObs > 0 then
 16: Let fS 0; U 0g = Strat(R0; r + 1; k  1; V;S)
 17: if pObs > 0 then UObs = UObs + pObs  U 0
 18: if p > 0 then
 19: stemp = stemp [ S 0
 20: if 9i : (a = Xi) then Utemp = Utemp + p  (EVexp(r; v) + U 0)
 21: else Utemp = Utemp + p  U 0
 22: end if
 23: end if
 24: end for
 25: if Utemp > Umax then
 26: Umax = Utemp
 27: smax = stemp [ hR; ai
 28: end if
 29: end for
 30: if UObs > Umax then
 31: Umax = UObs
 32: smax = stemp [ hR;Obsi
 33: end if
 34: return fsmax; Umaxg
 127
Appendix C
 The Number of Pure Strategies for Cultaptation
 In Cultaptation as de ned on the tournament web site, each game includes
 10,000 rounds, 100 agents, 100 exploitation actions, and the actions Inv and Obs. Let
 S be the set of all pure Cultaptation strategies, and S 0 be the set of all strategies
 such that the  rst 100 moves are Inv, and all subsequent moves are exploitation
 actions. Then any lower bound on S 0 is a loose lower bound on S.
 Suppose an agent uses a strategy in S 0. If it survives for the  rst 100 rounds of
 the game, it will learn values for all 100 of the exploitation actions. There are 100!
 di erent orders in which these actions may be learned, and for each action there are
 100 possible values; hence there are 100100 possible combinations of values. Thus
 after 100 Inv moves, the number of possible histories is 100! 100100. All subsequent
 moves by the agent will be exploitations; and it is possible (though quite unlikely!)
 that the agent may live for the remaining 9; 900 rounds of the game. Thus each
 of the above histories is the root of a game tree of height 2  9; 900. In this game
 tree, each node of even depth is a choice node (each branch emanating from the
 node corresponds to one of the 100 possible exploitation actions), and each node of
 odd depth is a value node (each branch emanating from the node corresponds to
 one of the 100 di erent values that the chosen action may return). Since there are
 128
100! 100100 of these game trees, the total number of choice nodes is
 100!  100100
 9899X
 d=0
 (1002)d > 9:3 1039953:
 If we use the conventional game-theoretic de nition that a pure strategy s must
 include a choice of action at each choice node, regardless of whether the choice node
 is reachable given s, then it follows that
 jS 0j > 1009:3 10
 39953
 :
 If we use the de nition used by game-tree-search researchers, in which a pure strat-
 egy only includes a choice of action at each choice node that is reachable given s,
 then the number of reachable choice nodes given s is
 100!  100100
 9899X
 d=0
 100d > 9:4 1020155;
 so
 jS 0j > 1009:4 10
 20155
 :
 129
Bibliography
 [1] A. Bandura. Social Learning Theory. General Learning Press, New York, 1977.
 [2] B.G. Galef and K.N. Laland. Social learning in animals: Empirical studies and
 theoretical models. Bioscience, 55:489{499, 2005.
 [3] T.R. Zentall. Imitation: De nitions, evidence, and mechanisms. Animal Cog-
 nition, 9:335{353, 2006.
 [4] L. G. Rapaport and G. R. Brown. Social in uences on foraging behavior in
 young nonhuman primates: Learning what, where, and how to eat. Evolution-
 ary Anthropology, 17(4):189{201, 2008.
 [5] G. Du y, T. Pike, and K.. Laland. Size-dependent directed social learning in
 nine-spined sticklebacks. Animal Behaviour, 78(2):371{375, August 2009.
 [6] J. Kendal, L. Rendell, T. Pike, and K. Laland. Nine-spined sticklebacks deploy a
 hill-climbing social learning strategy. Behavioral Ecology, 20(2):238{244, 2009.
 [7] R. Boyd and P.J. Richerson. Why does culture increase human adaptability?
 Ethology and Sociobiology, 16(2):125{143, 1995.
 [8] K.N. Laland. Social learning strategies. Learning and Behavior, 32:4{14, 2004.
 [9] C.J. Barnard and R. M. Sibly. Producers and scroungers: A general model and
 its application to captive  ocks of house sparrows. Animal Behavior, 29:543{
 550, 1981.
 [10] D. Nettle. Language: Costs and bene ts of a specialised system for social
 information transmission. In J. Wells and et al., editors, Social Information
 Transmission and Human Biology, pages 137{152. Taylor and Francis, London,
 2006.
 [11] L. A. Giraldeau, T. J. Valone, and J. J. Templeton. Potential disadvantages
 of using socially acquired information. Philosophical transactions of the Royal
 Society of London. Series B, Biological sciences, 357(1427):1559{1566, 2002.
 [12] J Noble, P M Todd, and E Tuci. Explaining social learning of food preferences
 without aversions: an evolutionary simulation model of norway rats. Proc. Biol
 Sci., 268(1463):141{149, January 2001.
 [13] T. Kameda and D. Nakanishi. Cost{bene t analysis of social/cultural learning
 in a nonstationary uncertain environment: An evolutionary simulation and an
 experiment with human subjects. Evolution and Human Behavior, 23:373{393,
 September 2002.
 130
[14] R. McElreath, M. Lubell, P. Richerson, T. Waring, W. Baum, E. Edsten, C. Ef-
 ferson, and B. Paciotti. Applying evolutionary models to the laboratory study
 of social learning. Evolution and Human Behavior, 26(6):483{508, November
 2005.
 [15] R. Boyd, M. Enquist, K. Eriksson, M. Feldman, and K. La-
 land. Cultaptation: Social learning tournament, 2008.
 http://www.intercult.su.se/cultaptation.
 [16] A. R. Rogers. Does biology constrain culture? American Anthropologist,
 90(4):819{831, 1988.
 [17] L. Rendell, R. Boyd, D. Cownden, M. Enquist, K. Eriksson, M. W. Feldman,
 L. Fogarty, S. Ghirlanda, T. Lillicrap, and K. N. Laland. Why copy others?
 insights from the social learning strategies tournament. Science, 328(5975):208{
 213, April 2010.
 [18] T. Wisdom and R. Goldstone. Social learning and cumulative innovations in
 a networked group. In Advances in Social Computing: Third International
 Conference on Social Computing, Behavioral Modeling, and Prediction, SBP
 2010, pages 32{41, 2010.
 [19] R. Carr, E. Raboin, A. Parker, and D. Nau. Theoretical and experimental anal-
 ysis of an evolutionary social-learning game. Technical Reports from UMIACS,
 UMIACS-TR-2012-05, 2012.
 [20] J. C. Villaneuva. Atoms in the universe. Universe Today, July 2009. http:
 //www.universetoday.com/36302/atoms-in-the-universe/.
 [21] D. Friedman. On economic applications of evolutionary game theory. Journal
 of Evolutionary Economics, 8(1):15{43, 1998.
 [22] E. Guse. Expectational business cycles. Money Macro and Finance (MMF) Re-
 search Group Conference 2004 97, Money Macro and Finance Research Group,
 September 2004.
 [23] E. Raboin, R. Carr, A. Parker, and D. Nau. Balancing innovation and exploita-
 tion in a social learning game. In AAAI Fall Symposium on Adaptive Agents
 in Cultural Contexts, November 2008.
 [24] K. Leyton-Brown and Y. Shoham. Essentials of Game Theory: A Concise
 Multidisciplinary Introduction. Morgan & Claypool, 2008.
 [25] R. Carr, E. Raboin, A. Parker, and D. Nau. When innovation matters: An anal-
 ysis of innovation in a social learning game. In Second International Conference
 on Computational Cultural Dynamics (ICCCD), September 2008.
 [26] J. Henrich and R. McElreath. The evolution of cultural evolution. Evolutionary
 Anthropology, 12:123{135, 2003.
 131
[27] M. Enquist, S. Ghirlanda, and K. Eriksson. Critical social learning: A so-
 lution to rogers?s paradox of nonadaptive culture. American Anthropologist,
 109(4):727{734, 2007.
 [28] C. Papadimitriou and J. Tsitsiklis. The complexity of optimal queuing network
 control. Mathematics of Operations Research, 24(2):293{305, May 1999.
 [29] S. Guha, K. Munagala, and P. Shi. Approximation algorithms for restless
 bandit problems. In SODA ?09: Proceedings of the twentieth Annual ACM-
 SIAM Symposium on Discrete Algorithms, pages 28{37, Philadelphia, PA, USA,
 2009. Society for Industrial and Applied Mathematics.
 [30] K.H. Schlag. Why imitate, and if so, how?, : A boundedly rational approach
 to multi-armed bandits. Journal of Economic Theory, 78:130{156, 1998.
 [31] D.E. Koulouriotis and A. Xanthopoulos. Reinforcement learning and evolu-
 tionary algorithms for non-stationary multi-armed bandit problems. Applied
 Mathematics and Computation, 196(2):913 { 922, 2008.
 [32] M. Kearns, Y. Mansour, and A. Y. Ng. A Sparse Sampling Algorithm for Near-
 Optimal Planning in Large Markov Decision Processes. MACHINE LEARN-
 ING, 49:193{208, 2002.
 [33] D. Koller and R. Parr. Computing factored value functions for policies in
 structured MDPs. In International Joint Conference on Arti cial Intelligence,
 volume 16, pages 1332{1339. Citeseer, 1999.
 [34] C. Boutilier, R. Dearden, and M. Goldszmidt. Exploiting structure in pol-
 icy construction. In International Joint Conference on Arti cial Intelligence,
 volume 14, pages 1104{1113. LAWRENCE ERLBAUM ASSOCIATES LTD,
 1995.
 [35] R.A. Howard. Dynamic programming and Markov process. MIT Press, 1960.
 [36] J. Hofbauer and K. Sigmund. Evolutionary game dynamics. Bulletin of the
 American Mathematical Society, 40(4):479, 2003.
 [37] A. Plaat, J. Schae er, W. Pijls, and A. de Bruin. An algorithm faster than ne-
 gascout and sss* in practice. In Computer Strategy Game Programming Work-
 shop. 1995.
 [38] T. Ibaraki. Theoretical comparision of search strategies in branch and bound.
 International Journal of Computer and Information Sciences, 5:315 344, 1976.
 [39] E. Horowitz and S. Sahni. Fundamentals of Computer Algorithms. Computer
 Science Press, Potomac, MD, 1978.
 [40] A. Blum, M.T. Hajiaghayi, K. Ligett, and A. Roth. Regret minimization and
 the price of total anarchy. In Proceedings of the 40th annual ACM symposium
 on Theory of computing, pages 373{382. ACM, 2008.
 132