ABSTRACT Title of dissertation: SYNTHESIS OF STRATEGIES FOR NON-ZERO-SUM REPEATED GAMES Tsz-Chiu Au, Doctor of Philosophy, 2008 Dissertation directed by: Professor Dana Nau Department of Computer Science There are numerous applications that involve two or more self-interested au- tonomous agents that repeatedly interact with each other in order to achieve a goal or maximize their utilities. This dissertation focuses on the problem of how to identify and exploit useful structures in agents? behavior for the construction of good strategies for agents in multi-agent environments, particularly non-zero-sum repeated games. This dissertation makes four contributions to the study of this problem. First, this thesis describes a way to take a set of interaction traces produced by different pairs of players in a two-player repeated game, and then find the best way to combine them into a strategy. The strategy can then be incorporated into an existing agent, as an enhance- ment of the agent?s original strategy. In cross-validated experiments involving 126 agents for the Iterated Prisoner?s Dilemma, Iterated Chicken Game, and Iterated Battle of the Sexes, my technique was able to make improvement to the performance of nearly all of the agents. Second, this thesis investigates the issue of uncertainty about goals when a goal-based agent situated in a nondeterministic environment. The results of this investiga- tion include the necessary and sufficiency conditions for such guarantee, and an algorithm for synthesizing a strategy from interaction traces that maximizes the probability of suc- cess of an agent even when no strategy can assure the success of the agent. Third, this thesis introduces a technique, Symbolic Noise Detection (SND), for detecting noise (i.e., mistakes or miscommunications) among agents in repeated games. The idea is that if we can build a model of the other agent?s behavior, we can use this model to detect and cor- rect actions that have been affected by noise. In the 20th Anniversary Iterated Prisoner?s Dilemma competition, the SND agent placed third in the ?noise? category, and was the best performer among programs that had no ?slave? programs feeding points to them. Fourth, the thesis presents a generalization of SND that can be wrapped around any exist- ing strategy. Finally, the thesis includes a general framework for synthesizing strategies from experience for repeated games in both noisy and noisy-free environments. SYNTHESIS OF STRATEGIES FOR NON-ZERO-SUM REPEATED GAMES by Tsz-Chiu Au Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2008 Advisory Committee: Professor Dana Nau, Chair/Advisor Professor Atif Memon Professor James Reggia Professor P.S. Krishnaprasad Professor Sarit Kraus Professor V.S. Subrahmanian c?Copyright by Tsz-Chiu Au 2008 Acknowledgments Iwouldliketothankmyadvisor,mycolleagues,andmyfriendsfortheircontinuous support and encouragement during my years at UMD. First and foremost, I have to thank my advisor, Prof. Dana Nau for his mentorship. It is my fortune to have Dana to be my advisor in my life. I really appreciate his teachings about how to be a more successful researcher. Apart from being my mentor, Dana is also my good friend. The hallmark of Dana is his humor, which is a source of encouragement and amusement to me and many others. His generous support had given me the peace of mind that was much needed in my graduate study. I learned tremendously from my colleagues and my teachers. In particular, I would like to thank Sarit Kraus, H?ector Mu?noz-Avila, and V.S. Subrahmanian. Special thanks go to past and current members of the DSN research group, who shared their knowledge and great passions for AI research with me throughout my research career at UMD. People in the business office in the CS department, UMIACS and ISR are extremely helpful and friendly. I want to express my gratitude to Fatima in the graduate office. Without her friendly ?nagging?, I am afraid I would have taken a much long time to finish my degree. In addition, I am grateful to Felicia; it was always joyful to talk to her in hallways. I would like to thank all system staffs for their technical supports. My friends played a large part in my graduate study. Among many others, I owe my gratuities to Waiyian Chong, Nick Frangiadakis, Annie Hui, and Shu-Hoi Pun for their friendships and the sharing of joys. Their friendships are integrated parts of my life. Most of all, I would like to thank my parents, my brother and my sister for their encouragement ii and patience. University of Maryland, College Park is a superb place for study and research. How joyful writing a paper would be in McKeldin Library with lots of hardworking students overnight? I will definitely miss this beautiful campus and my couch in my office very much in the future. iii Table of Contents List of Tables vii List of Figures ix 1 Introduction 1 1.1 Objective and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Outline of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Synthesis of Strategies from Interaction Traces 11 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Synthesis of Strategies from Interaction Traces . . . . . . . . . . . . . . 16 2.3.1 Reconstructing an Agent?s Strategy . . . . . . . . . . . . . . . . 16 2.3.2 Constructing New Strategies . . . . . . . . . . . . . . . . . . . . 20 2.3.3 Finding the Best Composite Strategy . . . . . . . . . . . . . . . 23 2.3.4 Using a Base Strategy . . . . . . . . . . . . . . . . . . . . . . . 31 2.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5 Comparisons with Other Work . . . . . . . . . . . . . . . . . . . . . . . 43 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3 Task-Completion Problems with Goal Uncertainty 48 3.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.1.1 Agents and Environments . . . . . . . . . . . . . . . . . . . . . 52 3.1.2 Equivalences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.1.3 Agent-Environment Interaction . . . . . . . . . . . . . . . . . . 55 3.2 Task-Completion Problems . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.2.1 Interaction-based Problem Formulations . . . . . . . . . . . . . . 56 3.2.2 Goal Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.2.3 Simplifying the Problem Definitions . . . . . . . . . . . . . . . . 63 3.3 Strong Solvability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.4 p-Solvability and Optimal Solutions . . . . . . . . . . . . . . . . . . . . 72 3.5 Solving Weakly Solvable Problems . . . . . . . . . . . . . . . . . . . . . 74 3.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 iv 4 Noise Detection in the Iterated Prisoner?s Dilemma 85 4.1 Motivation and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2 Iterated Prisoner?s Dilemma with Noise . . . . . . . . . . . . . . . . . . 91 4.3 Strategies, Policies, and Hypothesized Policies . . . . . . . . . . . . . . 93 4.3.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.4 Derived Belief Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.4.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.5 Learning Hypothesized Policies in Noisy Environments . . . . . . . . . . 102 4.5.1 Learning by Discounted Frequencies . . . . . . . . . . . . . . . . 102 4.5.2 Deficiencies of Discounted Frequencies in Noisy Environments . 105 4.5.3 Identifying Deterministic Rules Using Induction . . . . . . . . . 106 4.5.4 Symbolic Noise Detection and Temporary Tolerance . . . . . . . 109 4.5.5 Coping with Ignorance of the Other Player?s New Behavior . . . 110 4.6 The Move Generator in DBS . . . . . . . . . . . . . . . . . . . . . . . . 112 4.6.1 Generalizing the Move Generator . . . . . . . . . . . . . . . . . 115 4.6.2 An Analysis of the Move Generator . . . . . . . . . . . . . . . . 117 4.7 Competition Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.7.1 Overall Average Scores . . . . . . . . . . . . . . . . . . . . . . . 123 4.7.2 DBS versus the Master-and-Slaves Strategies . . . . . . . . . . . 124 4.7.2.1 Group Performance . . . . . . . . . . . . . . . . . . . 125 4.7.2.2 Overall Average Scores versus Number of Slaves . . . 127 4.7.2.3 Percentages of Interactions . . . . . . . . . . . . . . . 129 4.7.2.4 Distributions of Average Scores . . . . . . . . . . . . . 130 4.7.3 A comparison between DBSz, TFT, and TFTT . . . . . . . . . . 135 4.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5 Symbolic Noise Filter 153 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 5.2 Our Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.3 The Noisy ICG Tournament and The Noisy IBS Tournament . . . . . . . 159 5.4 Na??ve Symbolic Noise Filter . . . . . . . . . . . . . . . . . . . . . . . . 162 5.5 Tournament Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 5.6 Experimental Analysis of NSNF . . . . . . . . . . . . . . . . . . . . . . 168 5.6.1 Basic Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 5.6.1.1 Average scores . . . . . . . . . . . . . . . . . . . . . . 168 5.6.1.2 Accuracy of correction . . . . . . . . . . . . . . . . . 170 5.6.1.3 Accuracy of correction vs Average Scores . . . . . . . 174 5.6.2 Explanations via Characteristics of Decisions Making Process . . 175 5.6.2.1 Average scores vs accuracy of correction . . . . . . . . 176 5.6.2.2 Average scores vs increases in average scores . . . . . 179 5.6.3 Distribution of Decision Pairs . . . . . . . . . . . . . . . . . . . 179 5.7 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 v 6 A Framework for Building Strategies for Repeated Games 186 6.1 An Overview of the Framework . . . . . . . . . . . . . . . . . . . . . . 186 6.2 Strategy Construction Graphs . . . . . . . . . . . . . . . . . . . . . . . . 188 7 Conclusions and Future Work 192 7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 7.2 Directions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 200 7.2.1 New Functionality . . . . . . . . . . . . . . . . . . . . . . . . . 201 7.2.2 New Problems and Applications . . . . . . . . . . . . . . . . . . 204 7.2.3 Synthesis of Strategies for Repeated Games . . . . . . . . . . . . 206 7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 A List of Acronyms 207 B Glossary of Notation 209 Bibliography 216 vi List of Tables 2.1 The payoff matrix of the Prisoner?s Dilemma. . . . . . . . . . . . . . . . 34 2.2 The payoff matrix of the Chicken Game. . . . . . . . . . . . . . . . . . . 34 2.3 The payoff matrix of the Battle of the Sexes. . . . . . . . . . . . . . . . . 34 2.4 Among the original agents, how many of each type. . . . . . . . . . . . . 36 2.5 Average number of interaction traces collected during the training ses- sions, before and after removing duplicate traces, and average number of interaction traces in the composite strategies generated by the CIT algo- rithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6 Average increases in score and rank of the MCA agents, relative to the corresponding original agents. . . . . . . . . . . . . . . . . . . . . . . . 41 2.7 Average scores of the best original agent, and the MCA agent whose base strategy is that agent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.8 Average frequency of invocation of base strategies. . . . . . . . . . . . . 42 4.1 Scores of the best programs in Competition 2 (IPD with Noise). The table shows each program?s average score for each run and its overall average over all five runs. The competition included 165 programs, but we have listed only the top 15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2 A modified version of Table 4.1 in which we have averaged the scores of each collection of programs that was either (i) a group of conspirators or (ii) a collection of of variants of the same algorithm. The average score for DBS is 402.1, which is higher than the average score of any other program. The master of the best master-slave strategy, BWIN came in 14th with a score of 379.9. Only the top fifteen groups are listed. . . . . . 126 4.3 Percentages of different interactions. ?All but M&S? means all 105 pro- grams that did not use master-and-slaves strategies, and ?all? means all 165 programs in the competition. . . . . . . . . . . . . . . . . . . . . . . 131 5.1 The overall normalized average scores (the average of the normalized av- erage scores of all strategies in Figure 5.3.) . . . . . . . . . . . . . . . . 170 5.2 Accuracy of Predictions and Corrections of NSNF. . . . . . . . . . . . . 172 vii 5.3 Distributions of different decision pairs. The IPD?s data are collected from Category 2 of the 2005 IPD competition. ?All but M&S? means all 105 programs that did not use master-and-slaves strategies, and ?all? means all 165 programs in the competition. Note that the numbers for IPD are not the number of decision pairs but interaction pairs. . . . . . . 181 5.4 Frequency of change of decision pairs. . . . . . . . . . . . . . . . . . . . 182 5.5 The sum of the diagonal entries in Table 5.4. . . . . . . . . . . . . . . . . 183 viii List of Figures 2.1 A composite strategy is synthesized from records of the interactions among many existing agents. An agent using this strategy can poten- tially outperform most of the agents from whom the interaction traces were obtained. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 The pseudocode of a composite strategy. . . . . . . . . . . . . . . . . . . 18 2.3 The pseudocode of the CIT algorithm. . . . . . . . . . . . . . . . . . . . 26 2.4 The search space of the CIT algorithm. . . . . . . . . . . . . . . . . . . . 28 2.5 Algorithm for a composite agent with a base strategy. . . . . . . . . . . . 32 2.6 Overall average scores of the base agents and the MCA agents in the IPD. The agents are displayed on the x axis in order of decreasing score of the base agent. The error bars denote the 95% confidence intervals of the overall average scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.7 Overall average scores of the base agents and the MCA agents in the ICG. 38 2.8 Overall average scores of the base agents and the MCA agents in the IBS. 39 2.9 Increase in rank of each enhanced agent, relative to the corresponding base agent. Thexaxis is as in Figure 2.6?2.8. . . . . . . . . . . . . . . . 40 3.1 The state diagrams of two configurations with different goals. The double circles are the goal states. This figure is adapted from Figure 4.19(a) on page 124 in Russell and Norvig [57]. . . . . . . . . . . . . . . . . . . . . 60 3.2 The pseudocode of a composite agent function (also known as a compos- ite strategy) for task-completion problems (with or without goal uncer- tainty). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.3 The Compatible Interaction Traces Search algorithm (CIT-search). . . . . 77 3.4 Success rates for composite agent functions constructed by running CIT- search with a database that covers k of ?e?s 64 configurations, for k = 1,...,64. Each data point is an average of 250,000 runs. . . . . . . . . . 81 4.1 An abstract representation of a class of strategies that generate moves using a model of the other player. . . . . . . . . . . . . . . . . . . . . . . 99 ix 4.2 An outline of the DBS strategy. ShouldPromote first increases r+?s pro- motion count, and then if r+?s promotion count exceeds the promotion threshold, ShouldPromote returns true and resets r+?s promotion count. Likewise, ShouldDemote first increases r??s violation count, and then if r??s violation count exceeds the violation threshold, ShouldPromote returns true and resets r??s violation count. Rp in Line 17 is the proba- bilistic rule set;Rprobk+1 in Line 17 is calculated from Equation 4.2. . . . . . 101 4.3 Learning speeds of the induction method and the discounted frequency method when the other player always cooperates. The initial degree of cooperation is zero, the discounted rate is 0.75, and the promotion thresh- old is 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.4 An example of the tree that we use to compute the maximum expected scores. Each node denotes the interaction of an iteration. The top four nodes constitute a path representing the current history ?current. The length of ?current is l = 2, and the maximum depth N? is 2. There are four edges emanating from each node S after the current node; each of these edges corresponds to a possible interaction of the iteration after S. The maximum expected scores (not shown) of the nodes with depth 2 are set by an evaluation functionf; these values are then used to calculate the maximum expected scores of the nodes with depth 1 by using the maxi- mizing rule. Similarly, the maximum expected scores of the current node iscalculatedusingfourmaximumexpectedscoresofthenodeswithdepth1.144 4.5 The procedure for computing a recommended move for the current itera- tion. In the competition, we set N? = 60, fCC = 3, fCD = 0, fDC = 5, andfDD = 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 4.6 The total number of changes of recommended policies generated by MoveGen for each hypothesized policy as the search depth increases. . . . 146 4.7 The distribution of the periodicity of the cycle of recommended policies versus the starting search depths. No periodicity means that there is no obviouscycleofrecommendedpoliciesinthesequencesofrecommended policies generated starting from a given starting search depth. . . . . . . . 147 4.8 ThepercentageofrecommendedpoliciesreturnedbytheMoveGenproce- dure. The search depth is 100. Each recommended policy is represented by four charactersm1m2m3m4, which means that the recommended pol- icy is {(C,C) ? m1,(C,D) ? m2,(D,C) ? m3,(D,D) ? m4}. This table excluded the hypothesized policies with which the MoveGen procedure returns a sequence of recommended policies that change as the search depth increases. . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 x 4.9 The overall average scores of selected programs versus the number of slaves removed from the tournament. . . . . . . . . . . . . . . . . . . . . 149 4.10 Density plots of the average scores of selected programs, overlapped with dotplotsoftheaveragescores. Theverticallinesmarktheoverallaverage scores of the programs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 4.11 Density plots of the average scores of selected programs, overlapped with dot plots of the average scores. The slave programs and the programs submitted by the author, except DBSz and TFTIm, are excluded. . . . . . 151 4.12 Average scores of the selected programs when played against DBSz and the programs provided by the organizer of the competition. . . . . . . . . 152 5.1 Na??ve Symbolic Noise Filter (NSNF). . . . . . . . . . . . . . . . . . . . 163 5.2 The pseudo-code of the Naive Symbolic Noise Filter. The function invert(bprime) returnsC ifbprime = D and returnsD ifbprime = C. . . . . . . . . . . 165 5.3 Normalized average scores. . . . . . . . . . . . . . . . . . . . . . . . . . 169 5.4 Increases in normalized average scores due to NSNF versus normalized average scores. Notice that 11 points are clustered at (0.7, 0.00). . . . . . 171 5.5 Accuracy of correction versus normalized average scores . . . . . . . . . 175 5.6 Increases in normalized average scores versus accuracy of correction . . . 176 5.7 Frequency of choosing defect versus normalized average scores. . . . . . 177 5.8 Frequency of changes of the player?s own decisions versus normalized average scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 6.1 A framework for constructing strategies from data. . . . . . . . . . . . . 187 6.2 The bond nodes and the bond edges represents the strategy construction graph of the CIT technique. Notice that the edge ?generalize by MCA? is a hyperedge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 6.3 The strategy construction graph of the CIT technique with symbolic noise filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 xi Chapter 1 Introduction There are numerous applications that involves two or more self-interested au- tonomous agents that repeatedly interact with each other in order to complete a task or maximize their utilities. Some notable applications are computer game playing, negotia- tion of contracts, and foreign policy making. This dissertation focuses on the problem of finding a good strategy for an agent in multi-agent environments, particularly non-zero- sum repeated games, so as to maximize the utility of the agent. Game theory has been widely used for studying interactions among self-interested agents whose performance depends on how well they interact with each other. Classical game theory studies the game-theoretic notion of an equilibrium strategy and various so- lution concepts, based on the assumptions that agents are rational, actions are perfectly executed, and observations have no error. Whiletechniques and analysis in classical game theory enjoyed great success in many interesting and important problems, their success is limited when they are applied directly to more complicated settings in which the assump- tions of rationality and perfect acting and sensing are severely derived. For example, the empirical studies of repeated games have shown that agents seldom behaves according to the subgame perfect equilibria of a game, casting doubt on the assumption of rational- ity based on backward induction in repeated or extensive games. Thus, a key problem in today?s game theory research is the disparity between the theoretical predictions of 1 agents? behavior and the actual behavior of agents in tournaments or real-world situa- tions. This disparity prompts us to rethink about the underlying assumptions of classical game theory. This dissertation is set out to propose new techniques for decision making in multi-agent environments when agents are irrational, mistakes can occur during the execution of actions, and observations are imperfect. To account for the inconsistency between game theory and empirical results, re- searchers have proposed alternative models of economical reasoning called the bounded rationality [58, 59, 56], removed the assumptions that agents have infinite computational resources for decision making. However, most of these models focus on the computa- tional limitation of the rationality, and do not take contextual information of a game into account. Experiments conducted by Thomas C. Schelling in 1950s showed that agents often depend on cultural or contextual information in their decision making, and this in- formation is largely ignored in mathematical game theory. These results suggest that mathematical analysis alone may not be sufficient to account for the behavior of agents in a game. One way to get a more precise description of the agent?s behavior is based on the observations of the behavior of the agents. This empirical approach assumes that we have data about decisions made by the agents in a game in a given context or environment. Then we can infer the behavior of the agents from the collected data. This is the approach this dissertation pursuits. We shall use the empirical approach to the studies of several non-zero-sum re- peated games including the Iterated Prisoner?s Dilemma (IPD), the Iterated Chicken Game (ICG), and the Iterated Battles of the Sexes (IBS). In particular, we aim to address two problems in these games: errors in the interaction among agents and the unfamiliar- 2 ity with other agents? irrational behavior. The first problem is about mistakes can occur during the execution of actions and imperfect observations of agents?if errors can occur during the interaction among agents, the agents may no longer be able to maintain coop- eration with each other and the performance of the whole system decreases. The second problem is about how to deal with irrational agents?agents that do not act according to the Nash equilibrium or subgame perfect equilibrium. The techniques we proposed in this dissertation performed pretty well in these games, according to our experiments. Our hope is that these techniques can be extended for other kind of games or multi-agent environments as well. 1.1 Objective and Approach Broadly speaking, our research question is How to identify and exploit useful structures in agents? behavior for effective decision making in non-zero-sum repeated games? Repeated games are traditionally regarded as the ?fruit flies? in the studies of multi- agent systems and game theory. Despite the simplicity of the game structures and rules, repeated games can still be considered as a complete multi-agent environment in which agents have to repeatedly interact with each other in order to maximize their own utilities. In fact, repeated games are so rich in the variety of strategies and solutions, and yet ex- isting theoretical analysis of repeated games seems to unable to make perfect predictions of the actual behavior of players in those games, making repeated games an interesting subject. Therefore, we should limit the scope of our studies to repeated games. At various 3 points in this dissertation we shall discuss the possibility of extending the techniques we developed for these games to other problems. Among all repeated games we are interested in non-zero-sum repeated games in which players does not always compete with each other. We choose to study non-zero- sum games because many real-world situations are not strictly competitive, and there are rooms for agents to cooperate with each other. To model these situations, non-zero-sum games are more appropriate than zero-sum games. Furthermore, the behavior of agents in non-zero-sum games is so different from the behavior of the agents in zero-sum games such as Roshambo and chess, causing a huge difference in the design of strategies for zero-sum games and non-zero-sum games. Due to these differences, non-zero-sum games are technically interesting to AI researchers. Our chief concern is decision making in non-zero-sum repeated games. More pre- cisely, we concern with how to generate a good strategy for an agent to interact with other agents. Our approach is to identify inherent structures in agent?s behavior and then ex- ploit these structures for the creation of better strategies or the improvement of existing strategies. As opposed to analytical approaches which tackles the problem from the first principles (e.g., alpha-beta pruning for game-tree search and value iteration for MDPs), our approach is based on empirical observations of structures of agents? behavior. What we found is that it is often more easy to discern the structure in the behavior of the agents in non-zero-sum games than in zero-sum games?in zero-sum games players tend to hide their intention by making their moves unpredictable, but in non-zero-sum games agents are more eager to openly express their intention in their moves. Due to the availability of easily discerned structures in agents? behavior, we believe our approach can be more 4 effective than analytical approaches for non-zero-sum repeated games. Thedifficultquestion, ofcourse, ishowtoidentifythestructuresinagents?behavior and then effectively exploit them. We consider two directions to address this question: (1) manually identify any recurrent feature in the agents? behavior from records of interac- tions in previous games, and (2) automatically discover those features using an algorithm. Either ways, the focus of the identification is to discover structures in agent?s behavior that can be used for improving the performance of existing agents or creating new agents. In this dissertation, we will show (1) how to exploit the clarity of behavior we observed to deal with noise, and (2) how to automatically identify a setT of interactions from a database of interaction traces, such thatT can be used to form a new strategy and enhance existing strategies. 1.2 Contributions This dissertation revolves around the following two theses: Thesis 1: Clarity of agents? behavior can be used for detecting errors in the interaction among agents. Thesis 2: Records of Interactions among agents can be used to improve the performance of existing agents, especially when the distribution of agents? behavior is highly skewed. Our investigation of these theses leads to several technical contributions to the stud- ies of repeated games and multi-agent systems. The following are the key contributions. 5 1. Synthesis of Strategies from Interaction Traces We devised a novel technique for combining records of interaction, produced by many different agents in a two-player repeated game, into a strategy which is called a composite strategy. Our algorithm, called the CIT algorithm, can synthesize the best composite strategy from a database of interaction traces in polynomial time. Our experimental results show that given a collection of agents, our technique can produce composite strategies that does better than all of the given agents, except the best agents, when combining the composite strategies with existing strategies. 2. Solvability of Task-Completion Problems with Goal Uncertainty We introduce the notion of strong solvability and weak solvability for problems in which a goal-based agent must interact with a nondeterministic environment in order to achieve a goal, but the agent is uncertain about the goals. We state the nec- essary and sufficient conditions of strong solvability in terms of interaction traces by that an agent can successfully reach the goals, provide an algorithm to find an optimal solution given the successful interaction traces for weakly solvable prob- lems, and evaluate the algorithm empirically when the size of its inputs is small. 3. Symbolic Noise Detection Accidents can cause great difficulty in cooperation with others. For example, in the Iterated Prisoner?s Dilemma (IPD) with noise, actions chosen by the players can be randomly changed by noise, and that can trigger a long sequence of mutual defections between the players. Previous approaches for dealing with noise in the IPD are based on forgiveness. Our approach, however, is based on the detection of 6 noise. We developed a technique called symbolic noise detection (SND) for noise detection in the IPD with noise. The performance of SND is demonstrated in the 2005 IPD tournament. 4. Analysis of Symbolic Noise Filter A symbolic noise filter (SNF) is a wrapper that can be placed around any existing strategy in repeated games in order to endow the strategy with the capability of noise detection and correction. We evaluated the performance of SNF in two re- peated games, the Iterated Chicken game and the Iterated Battle of the Sexes, and found that SNF is highly effective in these games. Our experimental analysis indi- cated that SND will be more effective in any games in which strategies often show a stable behavior. 5. A Framework for Experience-Based Strategy Construction for Repeated Games To construct a better strategy for a repeated game, one may want to examine the historical records of previous games, in order to learn from the data. We con- sider a learning process that involves the construction of several key data structures from data, the transformation from one data structure to another, and eventually the construction of strategies from the data structures. We proposed a framework that describes this learning process whose data structures are opponent models, action- value functions, and strategies. Apart from the above technical contributions, this dissertation also includes two scientific discoveries pertaining to the behavior of agents in the IPD, ICG, and IBS, and 7 perhaps non-zero-sum repeated games in general. First, we found that agents in those games often display deterministic behavior during interaction, and therefore noise are often detectable in those games. Second, we found that the distributions of interaction traces are highly skewed in these games, and therefore only a small number of interaction traces is needed in order to form a partial strategy that can be used to interact successfully with the agents in the those games. We will discuss about these empirical properties of agents? behavior at the end of this dissertation. 1.3 Outline of this Thesis The contents of the rest of this dissertation are as follows: Chapter 2. Synthesis of Strategies from Interaction Traces This chapter presents a technique called the CIT technique, for the synthesis of strategies for repeated games. Starting from a set of interaction traces produced by different pairs of players in a two-player repeated game, the CIT algorithm selects a subset of interaction traces and combines them into a composite strategy. The CIT algorithm is a polynomial-time algorithm that can generate the best such composite strategy, which potentially outperforms most of the players who contributed the interaction traces. This chapter also describes how to incorporate the composite strategy into an existing agent, as an enhancement of the agent?s original strategy. Chapter 3. Task-Completion Problems with Goal Uncertainty This chapter provides a way to adapt the CIT technique presented in Chapter 2 to any problem that involves an agent who must achieve a goal or complete a task 8 by interacting an environment that responds to the agent?s actions nondeterministi- cally. First, we give the necessary and sufficient conditions under which an agent can guarantee to accomplish the goal. For problems that no agent can guarantee to successfully accomplish the goal, we provide an algorithm that takes a set of in- teraction traces and generates a composite strategy with the highest probability of success. Chapter 4. Noise Detection in the Iterated Prisoner?s Dilemma This chapter proposes the noise detection approach to cope with noise in the Noisy Iterated Prisoner?s Dilemma, a version of the IPD in which actions or observations can be randomly changed by accidental events. We describe the philosophy of symbolic noise detection (SND) and the details of the DBS strategies that uses SND for noise detection in the IPD. Then we present the performance of the DBS strategies in the 2005 IPD tournament. Chapter 5. Symbolic Noise Filter This chapter introduces a wrapper that can be placed around any agent in 2?2 repeated games so as to filter the noise that may present in the observed actions generated by the other player. The wrapper is called the symbolic noise filter (SNF) and is a simplified version of the SND mechanism in DBS. We conducted exper- iments to evaluate SNF, and empirically explained the performance of SNF in the Iterated Chicken Game and the Iterated Battles of the Sexes. Chapter 6. A Framework for Building Strategies for Repeated Games 9 This chapter describes a general framework for constructing strategies from data in repeated games. This framework revolves around three key data structures for opponent modeling and decision making in repeated games: opponent models, action-value functions, and strategies. This framework elucidates the relationships between these data structures, and suggests a number of ways by which data can be turned into a strategy. Chapter 7. Conclusions and Future Work The last chapter reviews the contributions of this dissertation and proposes direc- tions for future work. In addition, Appendix A contains a list of acronyms that appears in this disserta- tion. The list also contains the names of several well-known strategies for the IPD. In Appendix B, there are tables describing mathematical notations that we define and use in this dissertation. 10 Chapter 2 Synthesis of Strategies from Interaction Traces In this section, we describe how to take a set of interaction traces produced by different pairs of players in a two-player repeated game, and combine them into a com- posite strategy. We provide an algorithm that, in polynomial time, can generate the best such composite strategy. We describe how to incorporate the composite strategy into an existing agent, as an enhancement of the agent?s original strategy. We provide experimental results using interaction traces from 126 agents (most of them written by students as class projects) for the Iterated Prisoner?s Dilemma, Iterated Chicken Game, and Iterated Battle of the Sexes. We compared each agent with the en- hanced version of that agent produced by our algorithm. The enhancements improved the agents? scores by about 5% in the IPD, 11% in the ICG, and 26% in the IBS, and improved their rank by about 12% in the IPD, 38% in the ICG, and 33% in the IBS. 2.1 Introduction To create new and better agents in multi-agent environments, we may want to ex- amine the strategies of several existing agents, in order to combine their best skills. One problem is that in general, we won?t know what those strategies are. Instead, we?ll only have observations of the agents? interactions with other agents. The question is how to synthesize, from these observations, a new strategy that performs as well as possible. 11 In this paper we present techniques for taking interaction traces (i.e., records of observedinteractions)frommanydifferentpairsofagentsina2-playeriteratedgame, and synthesizing from these traces a new strategy called a composite strategy (see Figure 2.1). We also show how an existing agent can enhance its performance by combining its own strategy with the composite strategy. Our contributions include the following: ? We give a formal definition of a composite strategy, and present necessary and sufficient conditions under which a set of interaction traces from different agents can be combined together to form a composite strategy. ? We provide a polynomial-time algorithm for synthesizing, from a given set of in- teraction tracesT, a composite strategy that is optimal, in the sense that it performs at least as well as any other composite strategy that can be formed fromT. ? We provide a way to enhance an existing agent?s performance by augmenting its strategy with a composite strategy. ? We provide experimental results demonstrating our algorithm?s performance in the Iterated Prisoner?s Dilemma (IPD), Iterated Chicken Game (ICG), and Iterated Bat- tle of the Sexes (IBS), using interaction traces from 126 agents (117 written by stu- dents as class projects, and 9 standard agents from the published literature). For each agent, we compared its performance with the performance of our enhanced version of that agent. On the average, the percentage improvements in score were about 5% in the IPD, 10% in the ICG, and 25% in the IBS; and the percentage improvements in rank were about 11% in the IPD, 38% in the ICG, and 33% in the IBS. 12 1 5 2 4 3 Figure 2.1: A composite strategy is synthesized from records of the interactions among many existing agents. An agent using this strategy can potentially outperform most of the agents from whom the interaction traces were obtained. Our results show that given a collection agents, it is possible to do better than most of these agents, by combining some of the ?best? behaviors (in terms of interaction traces) exhibited by these agents in the past, as illustrated in Figure 2.1. 2.2 Basic Definitions Consider a two-player finite repeated game, such as the IPD. At any time point t, two agents ?A and ?B can choose actions a and b from finite sets of actions A and B, respectively. Neither agent learns what the other?s action is until after choosing its own action. Throughout this paper, we will look at games from ?A?s point of view; i.e., ?A is ?our agent? and?B is the opponent. We call the pair (a,b) an interaction between ?A and ?B. An interaction trace be- tween?A and?B is a sequence of interactions? =?(a1,b1),(a2,b2),...(an,bn)?, where 13 ai andbi are the actions of?A and?B, respectively.1 The length of? is|?|= n. We assume interactions occur at a finite number of discrete time points t1, t2, ..., tN, whereN isthetotalnumberofinteractions.2 Ahistory(orahistoryuptotimetk)isan interaction trace? =?(a1,b1),(a2,b2),...(ak,bk)?, where (1)ai andbi are the actions of ?A and?B at timeti, respectively, and (2) 0?k?N. ? is a full history if|?|= N. The histories of ?A?s and ?B?s actions are ?A = ?a1,a2,...,an?and ?B = ?b1,b2,...,bn?, respectively. We often deal with the prefix of a history in this chapter. Let ? be a history ?(a1,b1),(a2,b2),...(ak,bk)?. The j?th prefix of ? is the subsequence prefixj(?) = ?(a1,b1),(a2,b2),...(aj,bj)?, whereas the j?th suffix of ? is suffixj(?) = ?(aj+1,bj+1),(aj+2,bj+2),...(ak,bk)?. Anagentusesastrategytodeterminewhatactionsitshouldtakeineverytimepoint. There are two types of strategies: pure strategies and mixed strategies. LetHbe the set of all possible histories. A pure strategy is a function?from historiesHinto actionsA, that returns the next action to perform. A mixed strategy is a pair ? = (?,??), where ? is a nonempty set of pure strategies and ?? is a probability distribution over ?. Each agent ? has exactly one strategy, which can be either pure or mixed. We denote the strategy of an agent?by strategy(?). A deterministic agent is one whose strategy is a pure strategy, and a probabilistic agent is one whose strategy is a mixed strategy. 1This formalism can easily model environments where the actions are interleaved rather than occurring simultaneously, by specifying that at odd time points ?A?s action must always be a ?null? action, and similarly for?B at even time points. 2It is easy to modify our definitions and algorithms to handle games in which the number of interactions can vary. But to simply our discussion, we?ll assume the number of interactions is constant. 14 An alternative definition of a mixed strategy is a mapping ?prime from historiesHto the set PA of all probability distributions over the set A of actions. Thus, ?prime(?) returns a probability distribution ?primeA such that ?primeA(a) is the probability of choosing a?A at the situation s(?). For clarity, we define?prime(?,a) = ?primeA(a). It is not hard to see that the two definitions of mixed strategies are mathematically equivalent. 3 Given? = (?,??), we construct?prime as follows: for each??T, let ?prime?? be the set of all pure strategies that can possibly generate ? (i.e., ?prime ={? :?k,?prime,(? = prefixk(?prime))?(?prime?H(?prime,?))?(k?0)?(?prime??)}. Let ?primea be the set of pure strategies in ?prime that will outputa?Aafter generating? (i.e., ?primea ={? : (?(?B) = a)?(???prime)}). Then, ?primeA(a) = P ???primea ??(?)P ???prime ??(?) . Conversely, given ?prime, we can construct ? = (?,??) where ? is the set of all possible pure strategies and ??(?) = producttext??H,a?A{?(?,a)}, where ?(?,a) = ?primeA(a) if ? can be generated by ?? ? and ?(?,a) = 0 otherwise. However, these conversions may not be computationally feasible because it could take too much computational resources. Suppose both our agent ?A and the opponent ?B are deterministic, and their pure strategies are ?A = strategy(?A) and ?B = strategy(?B), respectively. When the deter- ministic agents ?A and ?B interact, only one possible interaction trace can be generated, namely the interaction trace trace(?A,?B) = ?n, where ?i is defined recursively as fol- lows: ?0 = ??, and ?j+1 = ?j ??(?A(?j),?B(?j))?for i = 1,...,n. We assume ?A?s performance is measured by a utility function uA, which is a function from interaction traces to non-negative real numbers. Thus, the reward for ?A when its opponent is ?B is 3More precisely, for each ? there is a unique ?prime that is equivalent to ?, but for each mixedprime there are one or more? that are equivalent to?prime 15 uA(trace(?A,?B)). Consider an agent ? that uses a mixed strategy ? = (?,?). We will call the pure strategies in ? the possible strategies for ?. In each game, ? will choose one strategy ??? according to the probability distribution ?, and its opponent will not know which strategy was chosen. We call the chosen strategy??s actual strategy in this game. Suppose the opponent ?B?s strategy is a mixed strategy (?B,??B). The overall performance of our agent ?A using a pure strategy ?A in its interactions with ?B is ?A?s expected utility, which is EUA(?A,?B) = summationdisplay ?B??B {??B(?B)?uA(trace(?A,?B))}. 2.3 Synthesis of Strategies from Interaction Traces Section2.3.1presentsanalgorithmthatcanreconstructthestrategyofasingleagent ?A, given a collection of interaction traces generated by ?A. Section 2.3.2 shows how to take a collection of traces that were generated by more than one agent, and construct a new strategy called a composite strategy, that combines parts of the strategies of all of those agents. Section 2.3.3 gives an algorithm to find the optimal one of these composite strategies. 2.3.1 Reconstructing an Agent?s Strategy Suppose we play a pure strategy ? against a mixed strategy ? = (?,?), where ? ={?1,?2,?3}. Then the set of all possible interaction traces is T ={trace(?,?j) : j?{1,2,3}}. 16 T contains enough information for us to construct a new strategy ?T whose interaction traces with ? will be exactly the same as ??s. Figure 2.2 gives the pseudocode for an agent CA(T) that can generate and play the strategy?T . We will call CA(T) a composite agent, and?T = strategy(CA(T)) a composite strategy. The strategy?T is partial, i.e., it is only defined over some of the possible histories. However, as we will see in Theorem 1, ?T is?-total, i.e., it is defined over the set of all histories that are possible when playing it against ?. In other words, in any history ? of interactions between CA(T) and the agent using?, CA(T) will always be able to determine what its next actiona|?|+1 should be. To see how the CA agent works, suppose that the three interaction traces inT are ?1 = ?(C,C),(C,D),(D,C)?, ?2 = ?(C,C),(C,C),(C,C)?, ?3 = ?(C,D),(C,D),(D,C)?. Next, suppose we play CA(T) against an agent?prime using?. Since?A?s first action isC in all three traces, we haveAi ={C}at Line 5 of CA, so CA(T) returnsa1 = C as its first action. Suppose?prime?s first action isC. Then?prime cannot be using the strategy that produced ?3, so Line 11 of the agent removes?3 fromTprime1, leaving?1 and?2 inT2. Hence in the next interaction with ?prime, CA(T) will choose a2 = C. Suppose ?prime?s second action is b2 = D. Then ?2 is removed fromTprime2, and CA(T)?s next action is D. Hence that the composite agent CA(T) ended up ?replaying? the interaction trace?1. ThefollowingtheoremprovidesaconditionunderwhichCA(T)willalwaysgener- ate the same action that?A would generate against an opponent?B using a mixed strategy ?: 17 Agent CA(T) /* a composite agent synthesized fromT */ 1. i := 1 ;Ti :=T 2. Loop until the end of the game 3. IfTi =?, then exit with failure becauseT is insufficient 4. IfTinegationslash=?, then 5. Ai :={a : (???Ti) a is the i?th action of ?A} 6. If|Ai|negationslash= 1, then exit with failure becauseT is incompatible 7. If|Ai|= 1, then let ai be the action in Ai 8. Output ai and receive the other agent?s action bi 9. Tprimei :=Ti 10. For each ??Ti, 11. If the i?th interaction in ? isn?t (ai,bi), remove ? fromTprimei 12. Ti+1 :=Tprimei ; i := i+1 Figure 2.2: The pseudocode of a composite strategy. 18 Theorem 1 Let ?A and ? = (?B,??B) be pure and mixed strategies, respectively. If T = {trace(?A,?B) : ?B ? ?B}, then trace(CA(T),?B) = trace(?A,?B) for every ?B ??B. Proof. Without loss of generality, let ?B ? ?B be the actual strategy of ?B, and let ? = trace(?A,?B) be the interaction trace between ?A and ?B. Suppose there exists i such that (ai,bi) at Line 8 of Figure 2.2 is not the i?th interaction (aprimei,bprimei) in ?, but, for 1?j < i, (aj,bj) is the j?th interaction in ?. First, all interaction traces inTi have the same prefix up to the (i?1)?th iteration (any traces without this prefix were removed at Line 11 on a previous iteration) and ?A is a deterministic agent. Therefore all traces inTi (including ?) have the same i?th action aprimei, and the composite agent will certainly output aprimei at the iteration i. Thus, ai = aprimei. Second, ?B will certainly output bprimei at the i?th interaction, given the action sequence a1,a2,...,ai?1 as its inputs; that is, bi = bprimei. Hence, a contradiction occurs. Therefore (ai,bi) at Line 8 of Figure 2.2 is always thei?th interaction in?, and trace(CA(T),?B) = ?. a50 The notion of composite agents establishes the fact that the process of interaction trace generation is reversible: not only can agents generate interaction traces, but in ad- dition, interaction traces can be used to construct agent strategies. This fact opens up opportunities to create better strategies. For example, we can first record the interaction traces generated by an agent, modify the set of interaction traces, and then synthesize a new strategy. By carefully adding or replacing interaction traces in the set of recorded interaction traces, the agent using the new strategy can outperform the original agent. In this chapter, we propose one such modification scheme for producing better agents. The 19 key idea is that by mixing interaction traces of two or more agents together, the com- posite strategy synthesized from these interaction traces often outperforms every agent contributed the interaction traces. The reason why this method works is simple: when different agents are good at dealing with different possible strategies of the opponent, we can combine the good strategies together so that the composite strategy can be more successful in dealing with the same opponent or similar opponents. 2.3.2 Constructing New Strategies The previous section showed how to reconstruct a strategy of a single agent ?A against an agent ?B, given ?A?s interaction traces with ?B. The same approach can be used to construct new strategies from a collection of traces generated by more than one agent, provided that two conditions, called compatibility and sufficiency, are satisfied. To illustrate the notion of compatibility, consider two well-known strategies for the IPD: Tit-For-Tat (TFT) and Tit-For-Two-Tats (TFTT). In the usual setting, the optimal strategy against TFT is to cooperate in every iteration, and the optimal strategy against TFTT is to defect and cooperate alternatively. For an IPD game of five iterations, these can be represented by the following two interaction traces: ?1 = ?(C,C),(C,C),(C,C),(C,C),(C,C)?, ?2 = ?(D,C),(C,C),(D,C),(C,C),(D,C)?. However, no pure strategy can generate both of these interaction traces, because in the first interaction, no agent can choose both Cooperate and Defect simultaneously. Thus, there is no single agent that can act optimally against both TFT and TFTT, and we say 20 that ?1 and ?2 are incompatible with each other. If we run CA({?1,?2}), it will return an error message at Line 6 of Figure 2.2. Now consider an agent that defects in the first iteration and then cooperates for the rest of a game. When this agent plays against TFT, the interaction trace is ?3 = ?(D,C),(C,D),(C,C),(C,C),(C,C)?. Although this strategy is not optimal against TFT, it is compatible with ?2: both ?2 and ?3 produce D for a1 and C for a2; and the opponent?s response at the end of the 2nd interaction gives us enough information to decide which of ?2 and ?3 to use thereafter. If the opponent?s second action b2 is C, then we discard ?3 and continue with ?2, and if b2 is D, we discard ?2 and continue with ?3. This is exactly what CA({?2,?3}) does when it plays against a mixed strategy that includes TFT and TFTT. We now formalize the concepts introduced in the above example, to provide neces- sary and sufficient condition onT for the synthesis of a composite strategy. Definition 1 The longest common prefix of two action sequences ? = ?a1,a2,...,an? and ?prime =?aprime1,aprime2,...,aprimen?is the longest action sequence lcp(?,?prime) that?s a prefix of both ?and?prime. We now define a condition called compatibility, that (as we?ll see in Theorem 2) is neces- sary for a set of interaction tracesT to be used successfully by CA. The interaction traces do not all need to be generated by the same agent. Definition 2 Two interaction traces ?1 and ?2 are compatible if either (1) ?A1 = ?A2 , or (2)|lcp(?A1 ,?A2 )|>|lcp(?B1 ,?B2 )|. Otherwise,?1 and?2 are incompatible. 21 Definition 3 A setT of interaction traces is compatible if and only if there is no incom- patible pair of interaction traces inT. Even if a set T of interaction traces is compatible, CA(T) will not always be able to use them against the opponent ?B using a mixed strategy ? unless there is at least one interaction trace for each of?B?s possible strategies. The following definition formalizes this notion. Definition 4 Let ? = (?B,??B) be a mixed strategy. A setT of interaction traces is ?-sufficient if and only if for every strategy ?B ? ?B,T contains an interaction trace ? = ?(a1,b1),...,(an,bn)?such that the action sequence?b1,...,bn?can be generated by?given?a1,...,an?as its inputs. The following theorem shows that compatibility and ?-sufficiency are necessary and sufficient to guarantee that CA(T) will always be able to play an entire game against ?. Theorem 2 The composite agent CA(T) will never exit with failure when it plays against an opponent whose mixed strategy is ? = (?B,??B), if and only ifT is compatible and ?-sufficient. Proof. The ?only if? part is trivial. For the ?if? part, supposeT is compatible and ?- sufficient. Without loss of generality, let ?? ?B be the pure strategy that the opponent chooses to use at the start of the game. By induction oni, we prove that CA(T) does not exit with failure at the i?th iteration. More precisely, we prove (1)|Ai| = 1 at Line 6, and (2) there exist ? ?Ti such that ? can be generated by ? (i.e.,Ti negationslash=?at Line 3), for 1?i?n. 22 First, let us consider i = 1. For any two interaction traces ?1,?2 ?T, the first actions in?A1 and?A2 must be the same sinceT is compatible. Therefore,|A1|= 1. Since T is?-sufficient and ?B is nonempty, it follows that there is at least one interaction trace ??T that can be generated by?. Thus,T1negationslash=?. Now consider i = k + 1. Suppose|Ak| = 1 and there exist ? ?Tk that can be generated by ?. First, since all traces inTk have the same prefix up to the (k?1)?th iteration and ? is a pure strategy, ? will return a unique bk at Line 8, which is the k?th action of ? (otherwise ? is not generated by ?). In addition, by Definition 2, for any two interaction traces ?1,?2 ?Tk, the first k actions in ?A1 and ?A2 must be the same. Thus, the k?th action of ?A must be ak, the action generated at Line 8. Therefore, (ak,bk) is the i?th interaction of ?, and ? will not be removed fromTprimek at Line 11. Hence ? ?Tk+1 sinceTk+1 =Tprimek at Line 12. Second,|Ak+1|?1 because|Tk+1|?1. Third, at the start of each iteration k + 1, all interaction traces inTk+1 have the same prefix up to the k?th iteration (any traces without this prefix were removed at Line 11 on a previous iteration). By Definition 2, for any two interaction traces ?1,?2 ?Tk+1, the first k + 1 actions in ?A1 and ?A2 must be the same; and consequently|Ak+1|?1. Therefore,|Ak+1|= 1. By induction,|Ai|= 1 andTinegationslash=?, for 1?i?n. a50 2.3.3 Finding the Best Composite Strategy We now consider the problem of finding a strategy against a mixed strategy. It is not difficult to show by reduction from 3SAT that the problem of finding an optimal strategy?a strategy with the highest expected utility against a mixed strategy?is NP- 23 hard. Theorem 3 The problem of finding a strategy with an expected utility of at least K against a mixed strategy? = (?,?) is NP-hard. Proof. We show that 3SAT is polynomial time reducible to this problem. Let f be a Boolean formula in conjunctive normal form with Boolean variables x1, x2, ..., xn and clauses C1, C2, ...Cm. For each Ci, we construct an agent ?i whose inputs are either True or False, and whose output is always the same action, namely Nil. The strategy ?i = strategy(?i)isaconstantfunctionthatalwaysreturnsNil. Thenumberofinteraction is exactly N; thus any agent interacting with ?i will terminate after N interactions. The utility of the interaction trace? =?(vj,Nil)?j=1..n is 1 if and only if the assignment ofvj toxj forj = 1...nsatisfies the clauseCi; otherwiseuA(?) is 0. Wearguethatf issatisfiableifandonlyifthereexistsanagentwhoseexpectedutil- ity is at least 1 when it interacts with an opponent using a mixed strategy? = (?B,??B), where ?B = {?1, ?2, ...?m}and ??B is a uniform probability distribution over ?B. Clearly, if f is satisfiable by an assignment of vj to xj for j = 1...n, we can construct an agent?A with strategy?A that choosesvj in thej?th interaction. Then?A would have an expected utility of 1 because uA(trace(?A,?i)) = 1 for any ?i ? ?B. If there exists an agent ?A whose expected utility is at least 1, then (1) uA(trace(?A,?i)) = 1 for all ?i??B since ??B is a uniform probability distribution, and (2)?A?s sequence of action is the same when interacts with any two strategies ?i,?j ? ?B, since both ?i and ?j always return Nil, and ?A, as a deterministic agent, cannot return two different actions at the same interaction for?i and?j. Thus, the sequence of actions generated by?A would 24 satisfy every clause inf. a50 Therefore, instead of finding the optimal strategy, we find the best composite strat- egy that can be synthesized from a given set of interaction traces, and then experimentally show that the best composite strategy can perform pretty well in practice. Let? = (?B,??B) be a mixed strategy used by the opponent?B. Suppose we are given a setTB of interaction traces that were played against?B; and suppose that for each interaction trace ? ?TB, we know the utility uA(?) for the agent that played against ?B. Let T ={T ?TB :T is compatible and?-sufficient}; and for each T ? T, let ?T be the composite strategy constructed from T. Then the optimal composite strategy problem is the problem of finding a composite strategy?T ? such thatT??T andEUA(?T ?,?)?EUA(?T,?) for everyT ?T. We say that?T ? is (TB,?)-optimal. Here is another formulation of the optimal composite strategy problem that?s equiv- alent to the above formulation. Suppose we are given sets of interaction tracesT1,...,Tm from games against several different agents ?1,...,?m, and for each ?i ? Ti we are given the utilityuA(?i) for the agent that played against?i. Furthermore, suppose we are given numbers p1,...,pm such that pi is the probability that we?ll need to play against ?i. Now, consider the problem of finding a strategy with an optimal expected utility against these agents. This is equivalent to an optimal composite strategy problem in whichTB ={T1?...?Tm}and ? = ({?1,...,?m},?), where ?i = strategy(?i) and ?(?i) = pi for eachi. 25 Procedure CIT(k,{Ti}i=1,...,m,{pi}i=1,...,m) 1. If k>n, then /* n is the maximum number of interactions */ 2. For i = 1 to m, choose any one trace, namely ?i, inTi 3. Return ({?i}i=1..m, summationtexti=1..m{pi?uA(?i)}) 4. Else 5. Ai :={ak :?(aj,bj)?j=1..n?Ti}, for i := 1..m 6. Bi :={bk :?(aj,bj)?j=1..n?Ti}, for i := 1..m 7. Aprime := A1?A2?...?Am; Bprime := B1?B2?...?Bm 8. If Aprime =?, then return (?,?1) /* incompatibility detected */ 9. If|Bi|negationslash= 1 for some i, then exit with failure. 10. For i := 1 to m 11. PartitionTi intoTabi for each pair (a,b)?Aprime?Bprime such that 12. Tabi :={??Ti : the k?th interaction in ? is (a,b)} 13. For each (a,b)?Aprime?Bprime 14. Tab :={Tabi : 1?i?m,Tabi negationslash=?} 15. Pab :={pi : 1?i?m,Tabi ?Tab} 16. (Tab? ,Uab? ) := CIT(k+1,Tab,Pab) /* call CIT itself */ 17. For each a?Aprime 18. If Uab? ?0 for all b?Bprime, then 19. ?Ta := uniontextb?Bprime{Tab? }; ?Ua := summationtextb?Bprime Uab? 20. Else /* i.e.,Tab? is not a solution for someb?Bprime */ 21. ?Ta :=?; ?Ua :=?1 /* i.e., no solution */ 22. amax := argmaxa?Aprime{?Ua} 23. Return (?Tamax, ?Uamax) Figure 2.3: The pseudocode of the CIT algorithm. 26 As an example, suppose we want to play the IPD against two agents ?1 = TFT and ?2 = TFTT, who will be our opponents with probabilities p1 = 0.7 and p2 = 0.3, respectively. Suppose we are givenT1 ={?1,?2,?3}, where ?1 =?(C,C),(C,C),(D,C)?; uA(?1) = 11.0; ?2 =?(D,C),(C,D),(C,C)?; uA(?2) = 8.0; ?3 =?(D,C),(D,D),(D,D)?; uA(?3) = 7.0; andT2 ={?prime1,?prime2,?prime3}, where ?prime1 =?(C,C),(C,C),(C,C)?; uA(?prime1) = 9.0; ?prime2 =?(D,C),(C,C),(D,C)?; uA(?prime2) = 13.0; ?prime3 =?(D,C),(D,C),(D,D)?; uA(?prime3) = 11.0. We can map this into the optimal composite agent problem by letting ? = (?B,??B) andT?B =T1?T2, where ?B ={?1,?2}, ??B(?1) = 0.7, and ??B(?2) = 0.3. There are nine subsets ofT that contain one trace for each of?1 and?2 (and hence are ?-sufficient), namely{?j,?primek}for j = 1,2,3 and k = 1,2,3. Only two of these nine sets are compatible:{?2,?prime2}and{?3,?prime3}. Of the two sets,{?2,?prime2}is the (?B,?)-optimal one, because EUA(CA({?2,?prime2}),?) = 9.5 > 8.2 = EUA(CA({?3,?prime3}),?). Hence, our (?B,?)-optimal strategy is to chooseD andC in the first two iterations, and then choose C (or D) in the third iteration if the opponent chooses D (or C) in the second iteration, respectively. Figure 2.3 shows an algorithm, CIT, that can be used to solve the above problem. CIT is a recursive algorithm that works by analyzing interaction traces played against a set of agents{?1,...,?m}. Its inputs include, for each ?i, a set of interaction tracesTi and a probability pi that we?ll have ?i as our opponent; and a number k that represents 27 S0 S1 S2 EU=9.5 S6 S3 S4 Incompatible ! S9 S10 S12 S13 S16 S17 S19 S20 S8 S11 S15 S18 S7 S14 EU=0.3x13=3.9 EU=0.7x8=5.6 EU=0.3x11=3.3 EU=0.7x7=4.9 EU=3.9 EU=5.6 EU=3.3 EU=4.9 EU=4.9EU=3.3EU=5.6EU=3.9 EU=3.9+5.6=9.5 EU=3.3+4.9=8.2 EU = max(9.5,8.2) = 9.5 S5 EU = max(9.5) = 9.5 {?1,?2,?3}0.7 {??1,??2,??3}0.3 {?1}0.7 {??1}0.3 {?1}0.7 {??1}0.3 {?1}0.7 {??1}0.3 {?1}0.7 {??1}0.3 {?2,?3}0.7 {??2,??3}0.3 {?2,?3}0.7 {??2,??3}0.3 {?2}0.7 {??2}0.3 {??2}0.3 {??2}0.3 {??2}0.3 {?2}0.7 {?2}0.7 {?2}0.7 {?3}0.7 {??3}0.3 {??3}0.3 {??3}0.3 {??3}0.3 {?3}0.7 {?3}0.7 {?3}0.7 a=C b=C a=C b=C a=D b=C a=C a=D b=C b=D a=D b=C a=C b=C b=C b=D a=D a=D b=D b=D k =1 k =2 k =3 Figure 2.4: The search space of the CIT algorithm. how many moves deep we have gone in our analysis of the interaction traces. Example. We?ll now illustrate CIT?s operation on the same example that we discussed earlier. In Figure 2.4, the node S0 represents the initial call to CIT. The two sets of traces areT1 andT2 described earlier, and the superscripts on these traces are the probabilities p1 = 0.7 andp2 = 0.3 described earlier. Each path from S0 to the bottom of the tree is one of the interaction traces; and the value k = 1 at S0 indicates that we?re currently looking at the first interaction in each 28 trace. At thek?th interaction for eachk, we need to consider both our possible moves and the opponent?s possible moves. Although these moves are made simultaneously, we can separate the computation into two sequential choices: for each value ofk, the higher layer of nodes corresponds to our possible moves, and the lower layer of nodes corresponds to the opponent?s possible moves. In the higher layer, the expected utility of each node can be computed by taking the maximum of the expected utilities of the child nodes; hence we call these nodes max nodes. In the lower layer, the expected utility of each node can be computed by adding the expected utilities of the child nodes; hence we call these nodes sum nodes.4 In our example, the max node at k = 1 is S0, and the sum nodes at k = 1 are S1 and S5. The edges between the max node and the sum nodes correspond to our actions that can lead from the max node to the sum nodes. For instance, if our action is C at S0, the next sum node is S1; otherwise, the next sum node is S5. The edges between the sum nodes atk = 1 and the max nodes atk = 2 corresponds to the actions that can be chosen by the opponent. At k = 1, the opponent can only choose C, but at S7 at k = 2, the opponent can choose eitherC orD, thus leading to two different max nodes S8 and S11. A terminal node corresponds to the set of interaction traces that are consistent with the actions on the path from S0 to the terminal node. The expected utility of a terminal node is the sum of the probability of the set of interaction traces (denoted by the superscripts in Figure 2.4) times the utility of any interaction trace in the set. For instance, at S10, the 4Mathematically, what we?re computing here is a weighted average; but each number has already been multiplied by the appropriate weight, so we just need to add the numbers together. 29 expected utility is summationtexti=1..m{pi?UA(?i)} = 0.3?13 = 3.9. Notice that all interaction traces in a set of interaction trace of a terminal node are the same; thus the algorithm chooses any?i fromTi at Line 2 since they all have the same utility. The CIT algorithm basically does a depth-first search on a tree as shown in Fig- ure 2.4, and propagates the expected utilities of the terminal nodes to S0 together with the compatible set of interaction traces that gives the expected utility. The expected util- ity of a max node is the maximum of the expected utilities of its child nodes, whereas the expected utility of a sum node is the sum of the expected utilities of its child nodes. Notice that at each max node the CIT algorithm will check the compatibility of the set of interaction traces of the node at Line 8. For example, at S4 the algorithm discovers that?1 and?prime1 are incompatible. Then a failure signal (the expected utility?1) is passed from S4 to the ancestor nodes. The max nodes and the sum nodes would then ignore the solution with a failure signal, thus eliminate the incompatibility. This is one of the key difference between the CIT algorithm and other search algorithms for game trees or MDPs?the CIT algorithm directly operates with interaction traces and checks the incompatibility among them as the search process proceeds. Running time. To achieve efficient performance in CIT we have used some indexing schemes that we will not describe here, due to lack of space. CIT?s running time is O(nM), where n is the number of iterations and M = summationtextmi=1|Ti|. In practice, the CIT algorithm is very efficient?it can find the optimal solution from a database of 500,000 interaction traces in less than 15 minutes on a computer with a 3.8GHz Pentium 4 CPU and 2GB RAM. 30 Discussion. At first glance, CIT?s search space (see Figure 2.4) looks somewhat like a game tree; but there are several important differences. First, our purpose is to compute an agent?s entire strategy offline, rather than having the agent do an online game-tree search at every move of a game. Second, we are not searching an entire game tree, but are just searching through a set of interaction traces; hence the number of nodes in our tree is no greater than the total number of interactions in the interaction traces. Third, game-tree search always assumes, either explicitly or tacitly, a model of the opponent?s behavior.5 Rather than assuming any particular model of the opponent, we instead proceed from observations of agents? actual interactions. 2.3.4 Using a Base Strategy Recall that CA(T)?s strategy is partial. In particular, although CA(T) will never exit with failure when playing against? whenT is compatible and?-sufficient, it might do so if we play it against another opponent?primeB with a mixed strategy?prime = (?primeB,??primeB) for whichT is not?prime-sufficient. However, in iterated games such as the IPD, there is enough overlap in the strategies of different agents that even if ?primeB was not used to construct any of the traces inT,T may still be?prime-sufficient. Even ifT is not ?prime-sufficient, CA(T) may still be able to play against ?primeB most of 5For example, minimax game-tree search assumes that the opponent will always use its dominant strat- egy. In perfect-information games such as chess, this opponent model has worked so well that it is taken more-or-less for granted. But in an imperfect-information variant of chess, it has been shown experimen- tally [52] that this model is not the best one?instead, it is better to use a model that assumes the opponent has very little idea of what its dominant strategy might be. 31 Agent MCA(TB,?,?base) /* a modified composite agent */ 1. T? := CIT(0,TB,?) /*T? is (TB,?)-optimal */ 2. ?A := CA(T?) /*?A is a composite strategy */ 3. Loop until the end of the game 4. Get an action a from ?A 5. If ?A fails, then get an action a from ?base and ?A := ?base 6. Output a and receive the other agent?s action b 7. Give b to ?A as its input Figure 2.5: Algorithm for a composite agent with a base strategy. thetimewithoutexitingwithfailure. TohandlecaseswhereCA(T)doesexitwithfailure, we can modify the CA algorithm so that instead of exiting with failure, it instead uses the output produced by a ?base strategy??base, which may be TFT or any other strategy. We call this modified algorithm the modified composite agent (MCA), and its pseudo-code is shown in Figure 2.5. The strategy of modified composite agents is called the modified composite strategy. A modified composite agent may be viewed in either of two ways: ? As a composite strategy that can use ?base when the composite strategy is insuffi- cient. ? As an enhanced version of?base, in which we modify?base?s moves in cases where the set of tracesTB might yield a better move. From this viewpoint,TB is a case base that is processed by the CIT algorithm so that cases can be selected quickly when they are appropriate. 32 2.4 Experimental Evaluation We evaluated our technique in three well-known games: the Iterated Prisoner?s Dilemma (IPD), the Iterated Chicken Game (ICG), and the Iterated Battle of the Sexes (IBS). The IPD and ICG are the iterated versions of the Prisoner?s Dilemma [3] and the Game of Chicken [23], whose payoff matrices are shown in Figure 2.1 and Figure 2.2. The IBS is the iterated version of the Battle of the Sexes [42], whose payoff matrix is shown below. To allow arbitrary agents to play against each other without having to take on different roles (hence different strategies), we needed to reformulate the IBS to make the roles of Husband and Wife interchangeable. This was quite easy to do, as shown in Figure 2.3: for the Wife we renamed Football to C and Opera to D; and for the Husband we renamed Football to D and Opera to C. 2.4.1 Experimental Design To obtain a large collection of agents for the games, we asked the students in several advanced-level AI classes to contribute agents. We did not tell the students the exact number of iterations in each game, but did tell them that it would be at least 50 (in all of our experiments we used 200 iterations). The students contributed 43 IPD agents, 37 ICG agents, and37IBSagents. Foreachgame, wealsocontributed9moreagentsthatusedthe following well-known strategies: ALLC, ALLD, GRIM, NEG, PAVLOV, RAND, STFT, TFT, and TFTT.6 This gave us a total of 52 IPD agents, 46 ICG agents, and 46 IBS agents. 6These are often used as standard strategies; e.g., they were used as standard entries in the 2005 IPD tournament [33]. 33 Table 2.1: The payoff matrix of the Prisoner?s Dilemma. Player B Prisoner?s Dilemma Cooperate (C) Defect (D) Cooperate (C) (3,3) (0,5) Player A Defect (D) (5,0) (1,1) Table 2.2: The payoff matrix of the Chicken Game. Player B Chicken Game Cooperate (C) Defect (D) Cooperate (C) (4,4) (3,5) Player A Defect (D) (5,3) (0,0) Table 2.3: The payoff matrix of the Battle of the Sexes. Husband Battle of the Sexes D (was Football) C (was Opera) C (was Football) (1,2) (0,0) Wife D (was Opera) (0,0) (2,1) 34 In the rest of this section, we?ll call these the original agents, to distinguish them from the composite agents generated by our algorithms. In each game, each player?s utility is the sum of that player?s payoff in each of the 200 iterations in an interaction trace. For each agent ?, we wanted to find out how much improvement we could get by replacing ? with a modified composite agent whose base strategy is ?. We investigated this by doing a 5-fold cross-validation experiment that worked as follows. For each of the three games (IPD, ICG, and IBS), we took our original set of agents for the game, and randomly and evenly partitioned it into five subsets ?1,?2,?3,?4,?5. Then we repeated the following steps five times, once for each ?i: 1. For the test set ?test, we chose ?i. For the training set ?train, we choseuniontextjnegationslash=i ?j. 2. We ran a 200-iteration tournament (the details are described below) among the agents in ?train (with one modification, also described below), and let Ttrain be the set of all interaction traces recorded in this tournament. 3. For each agent ?? ?train, we played 100 tournaments, each 200 iterations long, involving ? and all of the agents in ?test. We calculated the agent?s average score, which is equal to 1100?|?test| summationtext1?l?100summationtext?k??test{the score of?when it plays against?k in thel?th tournament}. 4. Likewise, for each agent ?? ?train, we played 100 tournaments, each 200 itera- tions long, involving a modified composite agent?prime = MCA(Ttrain,?,?) and all of the agents in ?test, where ? is a uniform probability distribution over ?train. Apart from the average score of?prime, we also recorded the frequency with which?prime used its base strategy?. 35 Table 2.4: Among the original agents, how many of each type. IPD ICG IBS Deterministic agents 34 22 17 Probabilistic agents 18 24 29 All of our tournaments were similar to Axelrod?s IPD tournaments [3] and the 2005 IPD tournament [33]. Each participant played against every participant including itself (thus a tournament amongnagents consists ofn2 iterated games). OneprobleminStep2isthatMCA?sinputneedstocomefromdeterministicagents, but many of our students? agents were probabilistic (see Table 2.4). We handled this problem as follows. Each probabilistic agent ? used a random number generator for which one can set a starting seed. By using ten different starting seeds, we created ten determinized versions of ?; and we generated interaction traces using the determinized agents rather than?. Note that we used the determinized agents only during training. The modified com- posite agents generated from the training data are quite capable of being played against probabilistic agents, so we played them against the probabilistic agents in our tests. 2.4.2 Experimental Results Table 2.5 tells how many traces, on average, were collected for each type of game, and how many traces were in the composite strategies generated by CIT. In each experiment, we calculated 4 average scores for each agent (one for each test 36 Table 2.5: Average number of interaction traces collected during the training sessions, before and after removing duplicate traces, and average number of interaction traces in the composite strategies generated by the CIT algorithm. IPD ICG IBS Before removing duplicates 29466.0 44117.4 60470.5 After removing duplicates 7452.0 25391.5 29700.8 Composite strategies 171.2 209.6 245.6 sets that did not contain the agent) and 4 average scores for each MCA agent. We repeat the above experiment 100 times using different partitions and random seeds. Figure 2.6? 2.8 show the agents? overall average scores, each of which is an average of 400 average scores. Since each average score is an average of 100?|?test|scores, each data point in the figures is computed from the scores of 40000?n games, where n is the average number of agents in the test sets. Hence in the IPD, each data point is computed from the scores of approximately 415769.2 games, and in both the ICG and the IBS each data point is computed from the scores of approximately 367826.1 games. The data file of the experiments can be downloaded at http://www.cs.umd.edu/?chiu/papers/ Au08synthesis data.tar.gz. In each graph in Figure 2.6?2.8, the x axis shows the agents and the y axis shows their scores. The lower line shows the performance of each original agent ? (the agents are sorted in order of decreasing overall performance). The upper line shows the average 37 400500600700 v er ag e ? Scor es Iterated Prisoner's ? Dilemma 200300400 1 6 11 16 21 26 31 36 41 46 51 Ov er all ? A v Ra n k Mo dified ? Co mp o s ite ? Ag e n t Base ? Agent Figure 2.6: Overall average scores of the base agents and the MCA agents in the IPD. The agents are displayed on thexaxis in order of decreasing score of the base agent. The error bars denote the 95% confidence intervals of the overall average scores. 700800900 1000 v er ag e ? Scor es I t erated Chicken ? Game 500600700 1 6 11 16 21 26 31 36 41 46 Ov er all ? A v Rank Mo d i fi ed ? Composit e ? Agent Ba se ? Ag e n t Figure 2.7: Overall average scores of the base agents and the MCA agents in the ICG. 38 200250300350 v er ag e ? Sc or es I t erated ? Battle o f ? the ? Sexes 50 100150 1 6 11 16 21 26 31 36 41 46 Ov er all ? A v Rank Mo d i f i e d ? Composit e ? Age n t Base ? Ag en t Figure 2.8: Overall average scores of the base agents and the MCA agents in the IBS. performance of the corresponding MCA agents. From the graphs, we note the following. In all three games, the average scores of the MCA agents were as good or better than the corresponding base agent. In the IPD, the differences in performance were usually small; and we believe this is because most IPD agents have similar behaviors and therefore leave little room for improvement. In the ICG and IBS, the differences were usually quite large. Finally, in all three games, the MCA agents performed well even when their base agents performed poorly. For example, in the IPD and the IBS, when we incorporated our composite strategy into the weakest of the existing strategies, it more than doubled that strategy?s score. Figure 2.9 shows the increase in rank of an agent after incorporating the composite strategy into it, while all other agents did not incorporate the composite strategy. We can see that the ranks of most agents increased after the modification. We conducted sign tests to see whether the overall average scores of MCAagents are greater than that of the original agents. The p-values of one-sided sign tests are less than 0.00001 in all games. 39 3540 Rank ? v . s . ? Rank ? Incr eases IPD IC G 202530 c re a s e s IC G IBS 5 1015 Rank ? In c ? 505 1 6 11 16 21 26 31 36 41 46 51 Rank Figure 2.9: Increase in rank of each enhanced agent, relative to the corresponding base agent. Thexaxis is as in Figure 2.6?2.8. Therefore, the MCA agents are significantly better at the 99.9% level. Table 2.6 shows the average improvement that each MCA agents provided relative to the corresponding original agent. On the average, the percentage improvements in score were about 5% in the IPD, 11% in the ICG, and 26% in the IBS; and the percentage improvements in rank were about 12% in the IPD, 38% in the ICG, and 33% in the IBS. Hence, the use of composite strategies greatly enhance the performance of most of the agent. Table 2.7 compares the average scores of the best agents, with and without the composite strategy. We can see that the average scores are more or less the same. Thus, thebeststrategiesinthetournamentsdoesnotbenefitfromtheuseofcompositestrategies. We believe the reason for that is that the best strategies have already had a decent set of behavior that are similar to the interaction traces in the composite strategies. Table 2.8 shows how frequently the MCA agents invoked their base agents. We can 40 Table 2.6: Average increases in score and rank of the MCA agents, relative to the corre- sponding original agents. IPD ICG IBS Average MCA agent score 582.69 856.37 262.47 Average original agent score 553.62 774.38 208.73 Average difference in score 29.08 81.99 53.74 Average % difference in score 5.3% 10.6% 25.7% Average increase in rank 6.0 17.4 15.2 Average % increase in rank 11.5% 37.8% 33.0% Table 2.7: Average scores of the best original agent, and the MCA agent whose base strategy is that agent. IPD ICG IBS MCA agents 477.23 729.47 232.61 Original agents 477.46 729.50 232.46 41 Table 2.8: Average frequency of invocation of base strategies. IPD ICG IBS Average percentage of games 15.4% 52.7% 54.4% make the following observations. ? In more than 84% of the IPD games, the MCA agents did not need to use their base strategies at all. In other words, the MCA agents? composite strategies worked successfully throughout those games, even though the games were played with op- ponents other than the ones used to build the composite strategies. One possible the reason for this high reusability of interaction traces is that the IPD is a well known game and thus most of the original strategies are similar to certain well-known strategies such as Tit-for-Tat. ? In the ICG and IBS, the MCA agents invoked their base strategies a little more than half of the time. We think the reason for this is that there was more diversity among the original strategies, perhaps because these games are not as well-known as IPD. But even though the MCA agents used their composite strategies less frequently than in the IPD, the composite strategies provided much more improvement, rela- tive to the base strategy, than in the IPD. In other words, the places where MCA was used its composite strategy provided a much bigger enhancement to the base strategy?s performance. 42 2.5 Comparisons with Other Work Reinforcement learning. One similarity between our approach and reinforcement learn- ingisthatourtechniqueimprovesanagent?sperformancebyincorporatingpreviousexpe- riences. But reinforcement learning agents usually use on-line learning (e.g., Q-learning agents learn and improve their policy during acting), while our composite agent uses off- line learning to produce a composite strategy that can then be incorporated into an agent. A more important difference is that our technique does not require us to know the setofallpossiblestatesoftheotheragents, asopposedtomostexistingworkonPOMDPs or learning automata, in which the set of all possible states must be known beforehand (e.g., [28] and [41]) In open environments such as IPD tournaments in which the oppo- nents? strategies are not known, it would be difficult, if not impossible, to identify all possible states of the opponent?s strategy. Moreover, the Markovian assumption intrinsic to a policy or automaton does not always hold because an agent?s decision can depend on the entire history in a game. Our paper demonstrates that it is possible to perform well in certain open environments such as the IPD without knowing the set of the opponents? internal states. To the best of our knowledge, contemporary reinforcement learning tech- niques are, unlike our techniques, not yet efficient enough to compete with strategies such as TFT and Pavlov in the IPD. In future, we would like to run a much wider variety of experiments to see whether our technique can be applicable in other open environments. Modeling other agents. One approach for playing against a given opponent is to de- velop an opponent model that predicts the opponent?s likely behavior [14, 22, 27]. This model is then used to aid in developing strategies against that opponent. In contrast, we 43 focus on generating strategies directly from observations, without constructing an explicit opponent model. Case-basedreasoning. OurtechniquehassomesimilaritiestoDerivationalAnalogy[63] and the reuse of macro actions/operators [34], in which records of a problem-solver?s de- cisions or actions are used to solve new problems. Most work on derivational analogy and macro actions focuses on problems in which there is no need to interact with the environ- ment during problem solving. But there are some works on using derivational analogy or macro actions in interactive environments [13, 43, 54]. In these domains, it would be beneficial not to discard the observation sequences generated by the environments, since the observation sequences capture important information that can be used to determine whether an agent can utilize two different action sequences in the same problem (see Definition 2). Our work pushes this idea further by showing how to construct a strategy from interaction traces. Case-based reasoning techniques have been used to improve existing reinforcement learning techniques [25, 65]. But so far case-based reasoning played a supporting role only in these work. In contrast, our technique generates fully-functional agents out of previousproblemsolvingexperiences, withoutthehelpofexistingreinforcementlearning techniques. The winner of the 2005 IPD tournament is based on some sort of case-based rea- soning technique [40]. Similarly, the winning strategy of our ICG tournament is a combi- nation of three different ways to deal with three different type of opponents. The success of these strategies compels us to believe that one way to dominate monolithic strategies 44 such as Tit-for-Tat is to combine the winning strategies for different type of opponents together. One problem with the case-based reasoning strategies mentioned in the above para- graph is that the ways they combine strategies together are quite ad hoc, and only manage to consider a handful of possible opponents? strategies. Our work shows a systematic way to combine strategies that can be scaled up to handle a large number of different opponent?s strategies. 2.6 Summary The idea that an agent can improve its performance by observing interactions among other agents is not new. What is new in this paper is how to do it without the knowledge of the set of opponents? internal states. In open environments such as IPD tournaments, we know little about the opponents? strategy, let alone the current state of the opponents? strategy. Hence methods that do not require us to know the opponents? states can be quite valuable in such domains. To avoid using the notion of states or belief states as in policies or automata, our ap- proach directly selects and combines the records of the other agents? interactions to form a partial strategy called a composite strategy, which maximizes the expected utility with respect to an estimated probability distribution of opponents that can possibly generate those interactions, without knowing the states or other details of the strategies of these opponents. The composite strategy can be used to decide how an agent should respond to typical behaviors of opponents in an environment. Out of a set of 126 agents in three 45 iterated games (the IPD, ICG, and IBS), our technique significantly improved the agent?s rank (compared to the other agents) by 12% in the IPD, 38% in the ICG, and 33% in the IBS, after incorporating the partial strategy. In future, we would like to address the following issues: ? In iterated games such as the IPD, ICG, and IBS, players frequently encounter the game situations that have been seen before; hence combining the interaction traces was sufficient to provide large increases in an agent?s performance. By itself, this would not be suitable for large-scale non-zero-sum games such as chess, where games among experts usually lead to situations that no player has ever seen before. We would like to develop a way to combine our technique with game-tree search, to see how much that extends the scope of the technique. ? Composite strategies in our experiments usually do not fail even if some agents in the test sets are probabilistic. The reason is that agents in non-zero-games we con- sidered exhibit relatively small number of different behaviors, such that the sizes of the sizes of the composite strategies are large enough to cover most of these behav- iors. However, in other domains agents may exhibit so many different behavior that cannot be covered by a small number of interaction traces. In the future, we would like to address this problem when we extend this technique to other domains. ? We?ve done preliminary tests on some nondeterministic partially-observable plan- ning domains (e.g., transportation domains with unknown traffic patterns). Our approach worked quite well in our tests, but we need to do larger cross-validated experiments before reporting the results. In future, we intend to generalize our 46 method for domains that are more complicated than two-player repeated games. ? A significant limitation of our technique is that it requires an estimate of the prob- ability with which we?ll encounter each of opponents we have observed. In the future, we would like to study how to estimate the probability from a database of interaction traces and modify our techniques to alleviate this requirement. ? Our current formalization cannot handle problems whose horizon is infinite. We would like to see how to extend our algorithm to deal with interaction trace of potentially infinite length. ? The CIT technique can be considered as a technique for transfer learning, the trans- fer oflearned knowledge fromone situationto another situation(e.g., [6]). The idea is that the composite strategy learned in one non-zero-sum games should enable the system to learn more quickly how to play another non-zero-sum game. In future, we would like to see how to transfer interaction traces recorded in one situation to another in order to speed up the learning process. 47 Chapter 3 Task-Completion Problems with Goal Uncertainty In Chapter 2, we presented the Combining Interaction Traces (CIT) technique for synthesizing a strategy from a database of interaction traces. An interesting question is whether this technique is applicable to domains other than repeated games. There is a large class of problems in which an agent must perform a sequence of interactions with an environment that responds non-deterministically to these actions, in order to achieve a goal. Theenvironmentcanbeanyenvironmentwithcontingencies,includinganopponent in a game or a city with traffic lights and accidental events. In this chapter, we consider the task of adapting the CIT technique to these problems. It turns out that we cannot directly apply the CIT technique to all kind of problems inwhichanagenthastointeractwithanenvironment. Theissueisthatsomeproblemsare inherently unsolvable?there does not exist a strategy that can guarantee the success of an agent in accomplishing the given task. For these problems, it is impossible to synthesize a sure-win strategy, no matter how clever the strategy synthesis algorithm is. The concept of unsolvable problems in agent-environment interaction resembles impossibility theorems in distributed systems [29] and the concepts of observability and controllability in control systems. In distributed systems, researchers know that some problems (e.g., distributed consensus with one faulty process) are fundamentally unsolv- able. Remedying these problems is far from trivial, because it requires a modification to 48 the fundamental assumptions of the distributed system model people have been used. In control theory, controllability is about the possibility of steering a system to a particular state by using an appropriate control signal?if a state is not controllable, no control sig- nal can control the state. In short, it is impossible to design a controller to achieve certain desired properties in a system. But our focus in this chapter is not the solvability of all kinds of systems, but the solvability of problems in which an agent has to interact with an environment in order to accomplish certain tasks. Repeated games such as the IPD, of course, belongs to this class of problems. We will study task-completion problems, in which an agent has to interact with an environment in order to accomplish a goal set by a given task. There are many problems in which an agent needs to complete a given task by interacting with a partially observable and/or non-deterministic environment. In these problems, the challenge for the agent is to explore an unknown world while solving the problem. However, not every environment is safely explorable [57, page 125]; The exploration-exploitation dilemma [19, 20] implies that it is not always possible for the agent to do this successfully (or optimally), because duringexplorationtheagentmayreachastateinwhichitisnolongerpossibletocomplete the task. The problem is compounded by the uncertainty about goals?the goals in the problem can vary from one situation to another, and the agent is uncertain about the goals it needs to achieve at the beginning of the interaction. Thus, the strategy synthesis algorithm must take both the nondeterministic behavior of the environment and the goal uncertainty into account when creating a new strategy for solving the problems. First, we characterize problems that are unsolvable by using the compatibility of interaction traces. We provide theorems giving conditions under which it is possible or 49 impossible to construct strongly successful strategies (i.e., strategies that are guaranteed to be successful). For cases where no strongly successful agent is possible, we provide mathematical results giving the probability that an agent can be successful. Second, we present provably correct algorithms to analyze a database of interaction traces from previ- ous problem-solving episodes, in order to construct a strongly successful strategies if one exists, or construct a strategy that has the highest possible probability of success given the interaction traces. These algorithms can be considered as the adaptations of the CIT algorithm in Chapter 2 for task-completion problems. Finally, we provide theoretical and experimental results demonstrating the algorithm?s performance. To summarize, the contributions of this chapter are: ? We provide necessary and sufficient conditions (on an environment?s set of pos- sible interaction traces) for there to exist an agent that can always successfully accomplish a task. If a task-completion problem satisfies this condition, we say the problem is strongly solvable. ? We provide an algorithm that takes a collection of interaction traces from previous problem-solving episodes, and constructs an agent with the highest probability of success among all combinations of the interaction traces. ? We present experimental results showing that even with a small number of interac- tion traces, the agent constructed by our algorithm performs well. 50 3.1 Basic Definitions OurdefinitionsofagentsandenvironmentsarebasedonChapter2of[57]. Anagent interacts with an environment by performing actions in some finite set A and receiving percepts in some finite set B. In repeated games, an environment is the opponent and percepts are opponent?s actions. We assume the interactions between an agent ? and an environment e occur at discrete time points t1, t2, .... At each time ti, an agent performs an action ai ?A and receives a percept bi ?B. An interaction trace of length N is a sequence of interactions ?(a1,b1),(a2,b2),...,(aN,bN)?. For simplicity, we also denote an interaction trace as a pair ? = (?,?), where ? =?a1,...,aN?is an action sequence, and ? =?b1,b2,...,bN? is a percept sequence.1 Notice that the length of?and? must be equal. LetT? be the set of all interaction traces, i.e.,T? ={(?,?) : ??A?,??B?,|?|=|?|}, whereA? is the set of all action sequences, andB? is the set of all percept sequences. Let? = (?,?). Then we let actions(?) =?A = ?; percepts(?) =?B = ?. By extension, ifT is a set of interaction traces, then we let actions(T) ={?A : ??T}; percepts(T) ={?B : ??T}. 1Although we assume that ai and bi occur simultaneously, our formalism is general enough to model environments where they are interleaved. This can be done by specifying that odd time points the actions have no effect, and that at even time points the environment always returns a ?null? percept. 51 Let ? be any sequence such as action sequences, percept sequences, or interaction sequences. Then [?]k denotes the k?th element in ?, prefixk(?) be the k-prefix of ?, and suffixk(?) be thek-suffix of?. More precisely, we define a k-prefix and a k-suffix as follows: Let ? = ?m1,m2,...,mn?, where mi can be either an action or a percept. Then prefixk(?) = ?m1,...,mmin(k,n)? and suffixk(?) = ?mmin(k,n)+1,...,mn?. If ? is an inter- action trace ?, then prefixk(?) = (prefixk(?A),prefixk(?B)), and suffixk(?) = (suffixk(?A),suffixk(?B)). 3.1.1 Agents and Environments We consider three types of environments: deterministic, nondeterministic, and probabilistic: ? A deterministic environment is a function e : A? ?B. If ? is an action sequence of lengthk, then e(?) is the percept returned by e at timetk+1. ? A nondeterministic environment is a function ?e : T? ? 2B. If ? is an interaction sequence of lengthk, then at timetk+1, ?e may return any of the percepts in ?e(?). ? A probabilistic environment is a pair ?e = (?e,?), where ?e is as defined above and ? :T??B?[0,1] is a function such that for every ? ?T?, the function ??(b) = ?(?,b) is a probability distribution overB. In Chapter 2, we distinguish agents from their strategies, in spite of the one-one correspondence between an agent and the strategy the agent uses. But in the literature, an 52 agent is sometimes considered as an action function, which actually is the strategy of the agent rather than the agent itself. In this chapter, action functions and strategies mean the same thing. A deterministic agent function ? is a function ? : B? ? A; a nondeterministic agent function is a function ?? : T? ? 2A; and a probabilistic agent function is a pair ?? = (??,?), where?? is a probability distribution over ??(?). Adeterministicagent isanagentwhoseactionfunctionisdeterministic; anondeter- ministic agent is an agent whose action function is nondeterministic; and a probabilistic agent is an agent whose action function is probabilistic. 3.1.2 Equivalences Mathematically, deterministic agent functions and deterministic environments are pure strategies, whereas probabilistic agent functions and probabilistic environments are mixedstrategies. Nondeterministicagentfunctionsisequivalenttoasetofpurestrategies, so does nondeterministic environments. To see why it is the case, consider a nondeterministic environment ?e. One can construct a set E?e of deterministic environments recursively by (1) E0 = {{??? b} : b? ?e(??)}, (2) for k ? 0, Ek+1 = {ek ?{??b} : e ? Ek,?? dom(e),a?A,b? ?e((???a?,e(?)))}, and (3)E?e =?k?0Ek. We call each of the deterministic environments in E?e a configuration of ?e. When an agent interacts with ?e, the agent can imagine that it is interacting with one of the configuration in E?e, but it does not know which one it is. We call this configuration the actual configuration. According to this viewpoint, 53 all the contingencies in a nondeterministic environment can be summarized by a single contingency, which is about the ignorance about which configuration is the actual one. That?s why we consider?e is equivalent toE?e. Every nondeterministic environment has an equivalent set of deterministic environments. 2. Let ?e = (?e,?) be a probabilistic environment, let E?e be as above, and let E?e = E?e. Then there is a probability distribution ??e over E?e that is equivalent to ?e in the sense that ??(b) = summationtext{??e(e) : e?E?e,e(prefixj(?A)) = [?B]j+1,0?j P(??). ?Pis strongly solvable iffP(??) = 1; ?Pis unsolvable iffP(??) = 0; If ?Pis not unsolvable (i.e., 0 P(??). ?Pis strongly solvable iffP(??) = 1; ?Pis unsolvable iffP(??) = 0; If ?Pis not unsolvable (i.e., 0 |lcp(?B1 ,?B2 )|. Definition 12 A set T of interaction traces is compatible if and only if every pair of interaction traces inT are compatible. It is interesting to compare Definition 11 and Definition 12 with Definition 2 and Definition 3: in Chapter 2, we assume all interaction traces are equal-length; thus an interaction trace ?1 is a prefix of another interaction trace ?2 if and only if ?1 = ?2. But here interaction traces can have different-length. One may attempt to define the sufficiency of a set of interaction traces as in Defini- tion 4 as follows: Definition 13 A setT of interaction traces isE-sufficient if and only if for all e?Ethere exists??T such that? = trace(?,e) for some action sequence?. 67 ButE-sufficiency, together with compatibility, does not guarantee the success of an agentintask-completionproblems,unlesstheinteractiontracesinT aresuccessful. Thus, we extend the definition ofE-sufficiency to require that interaction traces are compatible, E-sufficient, and successful. Definition 14 A setT of interaction traces is ?P-successful if and only if (1)T is com- patible and (2) for all e?E there exists??T such that??Tsuccess(e). IfT is ?P-successful, there exists a bipartite matching between E andT, such that for every configuration inE, the corresponding interaction trace according to the mapping is successful in the configuration. The bipartite matching, called an EI-mapping (which stands for an environment-interaction mapping) and denoted by ?, is an surjection (i.e., onto) mapping from E toT. We modify Definition 14 based on EI-mappings as follows: Definition 15 A setT of interaction traces is ?P-successful if and only if (1)T is com- patible and (2) there exists an EI-mapping? such that?(e)?Tsuccess(e) for all e?E. We now extend Theorem 4 to problems that involve more than two environments. Theorem 5 A nondeterministic task-completion problem ?P = ?A,B,E,Tsuccess? is strongly solvable if and only if there exists a ?P-successful set of interaction traces. Proof. If ?Pis strongly solvable by an agent function?, letT ={?i = trace(?,ei) : ei ? E}be the set of successful interaction traces. Clearly, there is an EI-mapping? : E?T such that?(ei) = ?i for ei?E. The remaining question is whetherT is compatible. IfT is not compatible, there exist e1,e2 ?E such that ?1 and ?2 are not compatible. But this contradicts to the fact that?is deterministic. Thus,T is E-compatible. 68 Conversely, if there exists a ?P-successful setT of interaction traces, we claim that an agent using the deterministic agent function in Figure 3.2 with T as the input will strongly succeed in ?P. The agent function is called a composite agent function and is denoted by CA(T). To see why CA(T) is a strong solution for ?P, we need to check whether trace(CA(T),e) ?Tsuccess(e) for all e ?E. In other words, we need to check the composite agent function always terminates with success (Line 14) when interacting with any configuration in E. Without loss of generality, let e? ? E be the actual configuration. By induction on i, we prove that CA(T) terminates with success when interacting with e?. Since all interaction traces in T are finite (because they are all successful), if CA(T) does not exist with failure at Line 3 or Line 6 at thei?th iteration, then CA(T) will terminate with success at Line 14. Thus, all we need is to prove by induction on i that CA(T) does not exist with failure at the i?th iteration. More precisely, we prove (1)|Ai| = 1 at Line 6, and (2) there exist ? ?Ti such that ? can be generated by e? (i.e.,Ti negationslash=?at Line 3), for 1?i?n. First, let us consider i = 1. For any two interaction traces ?1,?2 ?T, the first actions in?A1 and?A2 must be the same sinceT is compatible. Therefore,|A1|= 1. Since T is E-sufficient and E is nonempty, it follows that there is at least one interaction trace ??T that can be generated by e?. Thus,T1negationslash=?. Now consider i = k + 1. Suppose |Ak| = 1 and there exist ? ? Tk that can be generated by e?. First, since all traces in Tk have the same prefix up to the (k? 1)?th iteration and e? is a deterministic environment, e? will return a unique percept bk at Line 8, which is the k?th action of ? (otherwise ? is not generated by ?). In addition, 69 by Definition 2, for any two interaction traces ?1,?2 ?Tk, the first k actions in ?A1 and ?A2 must be the same. Thus, the k?th action of ?A must be ak, the action generated at Line 8. Therefore, (ak,bk) is thei?th interaction of?, and? will not be removed fromTprimek at Line 11. Hence ? ?Tk+1 sinceTk+1 = Tprimek at Line 12. Second,|Ak+1|? 1 because |Tk+1|?1. Third, at the start of each iterationk+1, all interaction traces inTk+1 have the sameprefixuptothek?thiteration(anytraceswithoutthisprefixwereremovedatLine11 on a previous iteration). By Definition 2, for any two interaction traces?1,?2 ?Tk+1, the firstk+1actionsin?A1 and?A2 mustbethesame; andconsequently|Ak+1|?1. Therefore, |Ak+1|= 1. By induction,|Ai|= 1 andTinegationslash=?, for 1?i?n. a50 The significance of Theorem 4 is that it implies that if we can find a set of mu- tually compatible interaction sequences that relate to the set E of configurations by an EI-mapping, then we know ?P is strongly solvable without needing to know any infor- mation about the structure of the environments in the configurations in E. Hence, we can ignorealargepartofthestructuresoftheenvironmentswhenweconstructsanagentfunc- tion against the environment. Theorem 4 also suggests that one way to decide whether ?P is not strongly solvable is to find two configurations e1,e2 ? E such that no successful interaction sequences of these environments are (e1,e2)-compatible. The difference between the composite agent function in Figure 3.2 and the compos- ite agent in Figure 2.2 is that this procedure continues until the agent succeeds, while the procedure in Figure 2.2 continues the end of the game. If the inputT is compatible, the procedure would never fail at Line 6. IfT is E-compatible, the procedure would never fail at Line 3. IfT is ?P-successful, the procedure will terminate and succeed according 70 Agent CA(T) /* a composite agent synthesized fromT */ 1. i := 1 ;Ti :=T 2. Loop until the agent succeeds 3. IfTi =?, then exit with failure becauseT is insufficient 4. IfTinegationslash=?, then 5. Ai :={a : (???Ti) a is the i?th action of ?A} 6. If|Ai|negationslash= 1, then exit with failure becauseT is incompatible 7. If|Ai|= 1, then let ai be the action in Ai 8. Output ai and receive a percept bi 9. Tprimei :=Ti 10. For each ??Ti, 11. If the i?th interaction in ? isn?t (ai,bi), remove ? fromTprimei 12. Ti+1 :=Tprimei ; i := i+1 13. End loop 14. Terminate with success Figure 3.2: The pseudocode of a composite agent function (also known as a composite strategy) for task-completion problems (with or without goal uncertainty). 71 to Theorem 4. The running time of the procedure depends on the size of T. We assume every interaction trace inT has a finite length. The number of iterations in the main loop of the procedure is at mostK = max{|?|: ? ?T}. In each iteration, there is one enumeration ofTk, which takes at most|T|computation. Each enumeration involves a look up of the i?th entry of?, which we assume they take a constant time. In summary, the running time of the CIT-agent procedure in Figure 3.2 is O(K|T|), where K = max{|?| : ? ?T}. In practice, the efficiency of the procedure depends on the data structure (i.e., the prefix tree) we used to storeT. 3.4 p-Solvability and Optimal Solutions If a task-completion problem is not strongly solvable, we opt for weak so- lutions instead. Let us consider a probabilistic task-completion problem ?P = (A,B,E,?,Tsuccess). The probability of success of an agent function?in ?Pis P(?) = summationdisplay {?(e) :?e?E such that trace(?,e)?Tsuccess(e)}. If ?P is p-solvable, there is an agent function ? such that P(?)?p. If ?P is not strongly solvable, ?P is not 1.0-solvable. But ?P may still be weakly solvable. Our objective is to find an optimal agent function ?? such that there is no other agent function ? such that P(?) >P(??). One might wonder whether a probabilistic agent function can do better than the best deterministic agent function. The answer is no. For every probabilistic agent, there is a deterministic agent function that has an equal or higher probability of success: 72 Theorem 6 For any probabilistic agent function ??, there exists a deterministic agent function ? such that P(?) ? P(??) in any probabilistic task-completion problem ?P = (A,B,E,?,Tsuccess). Proof. Let ??be (???,???), where ??? is the set of deterministic agent functions equivalent to ??and??? istheprobabilitydistributionover???. Let?max = argmax{P(?) : ?????}. Then P(??) = summationdisplay ????? braceleftBig ???(?)?P(?) bracerightBig ? summationdisplay ????? braceleftBig ???(?)?P(?max) bracerightBig = P(?max)? summationdisplay ????? braceleftBig ???(?) bracerightBig = P(?max) Thus, the probability of success of the deterministic agent function ?max is equal to or larger than that of ??. a50 Theorem 6 means that at least one of the optimal agents for ?P is deterministic. Therefore,wewanttofindoneoftheoptimalagents,itissufficienttolookatdeterministic agents. Moreover, among all optimal deterministic agent functions, it is sufficient to con- sider just composite agent functions, because every deterministic agent function has a ?companion? composite agent function that has the same behavior: Theorem 7 For every deterministic agent function ? there is a composite agent function ?prime = CA(T) for some setT of interaction traces such that ?(?) = ?prime(?) for any percept sequence?. 73 Proof. Without loss of generality, let ?e be a nondeterministic environment. Let T ={trace(?,ei) : ei?E?e}be the set of all possible interaction traces that can be gener- ated by?.T is compatible set of interaction traces because?is deterministic. Therefore the composite agent function ?prime = CA(T) would not produce any error message. More- over, both?prime and?generate the same sequence of actions no matter which configuration (among E?e) is actual. a50 In short, for any probabilistic task-completion problem there exists at least one composite agent function that is optimal. 3.5 Solving Weakly Solvable Problems The results in the last section, together with Theorem 5, suggest an algorithm to construct an optimal composite agent function for a (strongly or weakly solvable) prob- lem ?P, given a set of interaction tracesT. IfT is exhaustive then the agent will have the highest possible probability of success, but it may be close to this probability even whenT is not exhaustive. We call this algorithm the CIT-search algorithm, where CIT stands for ?Compatible Interaction Traces?. The major difference between CIT-search and the CIT algorithm in Section 2 is that the CIT algorithm in Section 2 is not solving task-completion problems, and if we translate the iterated games we studied in Section 2 into task-completion problems, the CIT algorithm will find the strongly solvable solu- tions only. But the CIT-search algorithm we present in this section can also find weakly solvable solutions. In this section, we present how the CIT-search algorithm works, and an analysis of 74 the CIT-search algorithm. In next section, we present the experimental results to demon- strate its performance. An optimal deterministic agent function ?? for ?P is one that has the maximum probability of success. To see what this means, let ?strong ?2E be the set of all strongly solvable sets of configurations. Here, a set Eprime of environments is strongly solvable if the subproblem (A,B,Eprime,Tsuccess) of ?P is strongly solvable. Then ?? is strongly successful on some set Emax ? ?strong such that summationtext{?(e) : e ? Emax}? summationtext{?(e) : e ? Eprime} for every Eprime ??strong. This analysis suggests a two-step procedure to construct ??: (1) identify a set Emax ?E that maximizes the weighted sum summationtext{?(e) : e?Emax}subject to the constraint that (A,B,Eprime,Tsuccess) is still strongly solvable, and (2) find an agent function that strongly succeeds on Emax. The CIT-search algorithm handle these two steps at the same time. Suppose we have a data baseT of interaction traces, where each interaction trace is indexed by the environmentinwhichtheyaresuccessful. Mathematically, letEcase ={e1,e2,...,em}? E, and C = {T1,T2,...,Tm}be a collection of interaction traces, such thatTi ?T is a set of successful interaction traces for ei, for 1 ?i?m. Then we can search for (1) a subset Eprime of Ecase and (2) a setTprime = {?1,?2,...,?|Eprime|}of interaction traces such that ?i ?Ti for each ei ? Eprime and for any ei,ej ? Eprime, ?i and ?j are compatible. If we can find Eprime and{?1,?2,...,?|Eprime|}that satisfy these conditions, then we can show that Eprime is strongly solvable, according to Theorem 5. Furthermore, the composite agent function CA(Tprime) will strongly succeed in the subproblem (A,B,Eprime,Tsuccess). If Eprime maximizes summationtext{?(e) : e?Eprime}, then the composite agent function is an optimal solution. The CIT-search algorithm is a procedure that does a branch-and-bound search to 75 construct a CIT-agent whose success probability is p or greater. The procedure is called CIT-search and the pseudo-code of the procedure is shown in Figure 3.3. The initial inputs of CIT-search are as follows. EachTi is a set of successful interaction traces for a configuration ei whose probability is pi = ?e. ?, ?, and ?+ are for CIT-search?s internal bookkeeping, and their initial values should be? =?,? = 0, and?+ = 1. CIT-search automatically creates a composite agent function that has either (i) a success probability of p or greater, or (ii) the highest probability of success (P(?) = ??) among all combinations of the given interaction traces. Here is how it works: in this procedure, Eprime andTprime are denoted by a EI-mapping ? : Eprime ?Tprime, such that ?(ei) = ?i for each ?i ?Tprime that succeeds in ei ? Eprime. CIT-search goes through the sets of interaction tracesC ={T1,T2,...,Tm}one by one. For eachTk, it looks for a successful interaction trace ?k ? Tk that is compatible with every interaction trace in range(?). If such an interaction trace exists, CIT-search sets?(e) = ?. Otherwise, it skipsTk and considers the next set of interaction traceTk?1. Once it has looked at all sets of interaction traces inC, it checks to see if CA(range(?)) can succeed with a probability??p, where? = summationtext{pi : ei ? dom(?)}, and pi = ?(ei). If the answer is yes, then CIT-search terminates and returnsthecompositeagentfunctionCA(range(?)). Otherwise,itbacktracksandsearches for another mutually compatible set of interaction traces. The procedure maintains the EI- mapping ?? with the highest probability ?? that has seemed so far, and this can be used with a upper-bound?+ of? to prune unpromising search branches. The way the procedure constructs ? guarantees thatTprime is a compatible set of in- teraction traces. Moreover, when Ecase = E, Ti = Tsuccess(ei) for all ei ? Ecase, and 76 Global variables: ?? :=?; ?? := 0 // the best EI-mapping Procedure CIT-search({T1,...,Tk},{p1,...,pk},p,?,?,?+) 1. If ?+ ??, then 3. ?? := ?; ?? := ?; If ??p, then terminate and return CA(range(?)). 4. If k> 0, then 5. Cprime :={T1,...,Tk?1}; Pprime :={p1,...,pk?1} 6. For each ??Tk 7. If ? is compatible with every ?prime?range(?), then 8. CIT-search(Cprime,Pprime,p,??{(e??)},?+pk,?+) 9. CIT-search({T1,...,Tk},{p1,...,pk},p,?,?,(?+?pk)) Figure 3.3: The Compatible Interaction Traces Search algorithm (CIT-search). p = 1, CIT-search will correctly return an optimal composite agent function, because it exhaustively searches all possible EI-mappings. More generally, if we begin with a small case base and gradually add additional cases, then the probability of success of the agent returned by CIT-search will also increase and will approach the optimal probability: Theorem 8 Let n = |Ecase|and mi = |Ti|for any ei ?Ecase. As n?|E|and mi ? |Tsuccess(ei)|, thenP(CA(Tprime))?p?, wherep? is the probability of success for an optimal solution, and CA(Tprime) is the composite agent function returned by CIT-search with Ecase, C,{p1,p2,...,pm}, andp = 1.0 as inputs. The running time of CIT-search is O(lm+1), where l = max{|Ti| : Ti ? C}and m = |C|. Thus, the running time depends on the size of the case base. Although the CIT-search procedure (Figure 3.3) of this method is computationally intractable (because 77 E andTsuccess can be infinite), we found that even if this procedure takes just a finite case baseofsuccessfulinteractiontracesforafinitenumberofenvironmentsasinputs,itwould still return an agent that performs reasonably well. In practice, we are likely to obtain the successful interaction traces for some instances of the nondeterministic environments though our past successful interaction with the environments. So we can construct such a casebaseofsuccessfulinteractiontraces. Furthermore, theprobability?ofenvironments is available, for example, by multiplying the probabilities of the outcomes of the actions (e.g., the domain description is written in a probabilistic STRIPS-style operators.) If every ei has an equal probability and|Ti|= 1 for each ei, the problem of finding the maximal set of compatible successful interaction traces is NP-hard (by reduction from the maximum clique problem). On the other hand, the running time is mostly determined by how easily the successful interaction traces can be combined together. In some prob- lems, there are lots of successful interaction traces for every environment (hence both |Tsuccess(ei)|and|Ti|is large), and in such cases it is relatively easy for CIT-search to find a good solution (i.e., one whose success probability is greater than the input parameterp). 3.6 Experimental Results Now we experimentally evaluate how good the solutions CIT-search produces from a small number of interaction traces. Our test domain is a pizza delivery domain in which a delivery person needs to deliver a pizza to a customer?s home within some time limit. Traffic congestion may hamper his/her job, but the only way for him/her to know whether a road is congested is by going onto that road. 78 We assume that the road map is a grid (e.g., like downtown Manhattan) of size N?N, with N + 1 north-south roads and N + 1 east-west roads, each N blocks long. The pizza shop and the customer?s home are at (0,0) and (N,N), respectively. Each road has a probability of 0.5 of being congested, and either all of it is congested or none of it is congested. The delivery person only needs to go one block on a road to find out whether it is congested, because it takes one time unit to traverse a block without congestion, and 5 time units with congestion. The actions are the directions the delivery person should go at each junction, and the percepts are whether congestion has occurred on the block that he/she has just traversed. Note that there are 22(N+1) possible deterministic environments. The difficulty of each of them depends partly on the amount of time available to the delivery person. With 10N time units or more, the problem is strongly solvable; but with less than2N time units it is unsolvable. For our experiments we used a time limit of 2N + 8Ndwith 0?d< 1, in order to focus on weakly solvable problems. For each combination of N ?{4,5,6,7,8}and d?{0.1,0.3}, we did the follow- ing steps 25 times: We created 63 distinct configurations e1,...,e63 at random, each with congestion on different sets of roads, with a 50% chance of congestion for each road. For each ei, we created a set of successful interaction tracesTi as follows. First, we found a path ?i that went from (0,0) and (N,N) as quickly as possible in ei. Then for each ?i and each j, we created the set of all paths that followed?i for the firstj?1 steps, chose a different action step j, and followed the quickest path from the current location to (N,N). Out of all of these paths,Ti contained the ones whose total time was within ei?s time limit. 79 Then we ran CIT-search with C0,...,C63, where Ck = {T1,...,Tk}. Note that C0 is empty. Since CIT-search is NP-hard, we usually could not wait for CIT-search to find the optimal agent: instead, we terminated it after 100,000 recursive calls, and retrieve the best composite agent function it had found during that time. In the literature on case-based problem solving, the problem solver will often revert to using a conventional search algorithm if it cannot solve the problem using just the case base. To model this, for each agent ?k we also created a modified agent ?primek that used a base procedure if the replay of interaction traces failed. Our base procedure was a simple one: it told the delivery person to move toward to destination randomly, as long as it did not increase the distance between the delivery person and the destination. We ran each of the agents ?k and ?primek 1000 times on a simulator and recorded their success rates. We did this for all combinations of N ?{4,5,6,7,8}and d?{0.1,0.3}, averagedthesuccessrates, andgraphedthemasshowninFigure3.4. HenceinFigure3.4, each data point is an average of 250,000 runs. For the agents that use the base procedure, Figure 3.4 shows that the success rate increases quickly with the number of configurations covered by CIT-search?s database of interaction traces. It takes less than 16 configurations in order to attain a success rate of 0.33, and after that point there is no improvement as the number of configurations increases. We suspect the reason for convergence with a small number of environments was because of the base procedure, which has a success rate of 0.17 even without any help from the composite agent function. Fortheagentsthatdonotusethebaseprocedure,thesuccessratealsogrowsquickly withthenumberofenvironments. Theagentgeneratedfromk = 3performsaswellasthe 80 0 10 20 30 40 50 600 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Number of configurations (k) Successful Rate With Base Procedure Without Base Procedure Figure 3.4: Success rates for composite agent functions constructed by running CIT- search with a database that covers k of ?e?s 64 configurations, for k = 1,...,64. Each data point is an average of 250,000 runs. base procedure with k = 0, and the difference in performance decreases as k increases. When all of the successful interaction traces are used without the base procedure, the performance is 94% higher than the base procedure without the interaction traces. Just as in machine learning techniques, there is a risk of overfitting of the data. In fact, when d=0.1, we observed a small decrease of probability of success from 0.27 to 0.26 as the as the number of configurations increases from 23 to 64. However, we do not observe the same tendency when d is large. Our hypothesis to this is that for difficult problems, there is only a small number of successful interaction traces for each environment. Therefore it is possible that the CIT agent may lead the delivery person to some ?remote? locations and then fail, and the base procedure cannot complete the 81 task starting from these locations. This also explains why we did not observe a decrease of probability of success when problems are easy. In future, we will conduct a more thorough study to investigate this issue. 3.7 Related Work We know of no previous work on the use of case-based reasoning in stochastic environments, especially when there is goal uncertainty. A rough analogy can be made between each of ?e?s configurations ei and a case in a traditional case-based reasoning problem such as those in [39, 63]. But in a traditional case-based reasoning problem, the domain is totally observable, hence one knows what the ?new? problem is that needs solving. In our case, all we have is a probability distribution over the domain?s possible behaviors. Our method is different from existing instance-based learning techniques (such as thek-nearestneighberalgorithms[44]),inthatouragentclassifiestheenvironmentduring execution, not before execution. Unlike the decision trees in decision tree learning, our prefix tree generates actions during classification, and this influences the agent?s probabil- ity of success?particularly if the environment is not safely explorable. This difficulty is similartotheexploration-exploitationdilemmathatoccursinreinforcementlearning[30], game playing [19], and real-time search algorithms [10]. But instead of using learning and exploration techniques, we reuse previously successful interactions to guide the ex- ploration. 82 3.8 Summary We have described a new technique to construct an agent for interacting with non- deterministic and stochastic environments with goal uncertainty. Our technique reuses previous successful interactions to maximize the agent?s chance of success. Our contri- butions are summarized below. We formulate the problems using interaction traces instead of states of the world. Based on this new interaction-based problem formulation of task-completion problems, we have given necessary and sufficient conditions for there to exist an agent that can always successfully accomplish a task. If there exists a set of mutually compatible suc- cessful interaction traces for each configuration of an environment, then the problem is strongly solvable, i.e., there exists an agent that is guaranteed to solve the problem. One class of such agent is the composite agent function, which take the set of compatible interaction traces to make decisions. On problems that are not strongly solvable but where we have a database of suc- cessful interaction traces, our CIT-search algorithm constructs a agent function having the highest probability of success among all combinations of the interaction traces. Even the database covers just a small number of configurations of an environment, our experiments show that the agents produced by CIT-search can still perform well, and that the reuse of successful interaction traces can improve the success rate of an existing problem solving strategy. In future, we would like to conduct more extensive experiments to find out the relationship between successful interaction traces in the database and the quality of the 83 agent function produced by CIT-search. Moreover, we are interested to see how to speed up the CIT-search algorithm. 84 Chapter 4 Noise Detection in the Iterated Prisoner?s Dilemma The Iterated Prisoner?s Dilemma (IPD) has become well known as an abstract model of a class of multi-agent environments in which agents accumulate payoffs that depend on how successful they are in their repeated interactions with other agents. An important variant of the IPD is the Noisy IPD, in which there is a small probability, called the noise level, that accidents will occur. In other words, the noise level is the probability of executing ?cooperate? when ?defect? was the intended move, or vice versa. Accidents can cause difficulty in cooperation with others in real-life situations, and the same is true in the Noisy IPD. Strategies that do quite well in the ordinary (non-noisy) IPD may do quite badly in the Noisy IPD [5, 11, 12, 45, 46, 47]. For example, if two players both use the well-known Tit-For-Tat (TFT) strategy, then an accidental defection may cause a long series of defections by both players as each of them punishes the other for defecting. This chapter reports on a strategy called the Derived Belief Strategy (DBS), which was the best-performing non-master-slave strategy in Category 2 (noisy environments) of the 2005 Iterated Prisoner?s Dilemma competition (see Table 4.1). Like most opponent-modeling techniques, DBS attempts to learn a model of the other player?s strategy (i.e., the opponent model1) during the games. Our main innovation 1The term ?opponent model? appears to be the most common term for a model of the other player, even 85 Table 4.1: Scores of the best programs in Competition 2 (IPD with Noise). The table shows each program?s average score for each run and its overall average over all five runs. The competition included 165 programs, but we have listed only the top 15. Average Score Rank Program Author Run1 Run2 Run3 Run4 Run5 Overall 1 BWIN P. Vytelingum 441.7 431.7 427.1 434.8 433.5 433.8 2 IMM01 J.W. Li 424.7 414.6 414.7 409.1 407.5 414.1 3 DBSz T.C. Au 411.7 405.0 406.5 407.7 409.2 408.0 4 DBSy T.C. Au 411.9 407.5 407.9 407.0 405.5 408.0 5 DBSpl T.C. Au 409.5 403.8 411.4 403.9 409.1 407.5 6 DBSx T.C. Au 401.9 410.5 407.7 408.4 404.4 406.6 7 DBSf T.C. Au 399.2 402.2 405.2 398.9 404.4 402.0 8 DBStft T.C. Au 398.4 394.3 402.1 406.7 407.3 401.8 9 DBSd T.C. Au 406.0 396.0 399.1 401.8 401.5 400.9 10 lowES- TFT classic M. Filzmoser 391.6 395.8 405.9 393.2 399.4 397.2 11 TFTIm T.C. Au 399.0 398.8 395.0 396.7 395.3 397.0 12 Mod P. Hingston 394.8 394.2 407.8 394.1 393.7 396.9 13 TFTIz T.C. Au 397.7 396.1 390.7 392.1 400.6 395.5 14 TFTIc T.C. Au 400.1 401.0 389.5 388.9 389.2 393.7 15 DBSe T.C. Au 396.9 386.8 396.7 394.5 393.7 393.7 86 involves how to reason about noise using the opponent model. The key idea used in DBS is something that we call symbolic noise detection? the use of the other player?s deterministic behavior to tell whether an action has been affected by noise. More precisely, DBS builds a symbolic model of how the other player behaves, and watches for any deviation from this model. If the other player?s next move is inconsistent with its past behavior, this inconsistency can be due either to noise or to a genuinechangeinitsbehavior; andDBScanoftendistinguishbetweenthesetwocasesby waiting to see whether this inconsistency persists in the next few iterations of the game.2 Of the nine different version of DBS that we entered into the competition, all of them placed in the top 15, and seven of them placed among top ten (see Table 4.1). Our bestversion, DBSz, placedthird; andthetwoplayersthatplacedhigherwerebothmasters of master-and-slave teams. DBS operates in a distinctly different way from the master-and-slaves strategy used by several other entrants in the competition. Each participant in the competition was allowed to submit up to 20 programs as contestants. Some participants took advantage of this to submit collections of programs that worked together in a conspiracy in which 19 of their 20 programs (the ?slaves?) worked to give as many points as possible to the 20th program (the ?master?). DBS does not use a master-and-slaves strategy, nor does it conspire with other programs in any other way. Nonetheless, DBS remained competitive with the master-and-slaves strategies in the competition, and performed much better than the master-and-slaves strategies if the score of each master is averaged with the scores though this player is not necessarily an ?opponent? (since the IPD is not zero-sum). 2An iteration has also been called a period or a round by some authors. 87 of its slaves. Furthermore, a more extensive analysis in Section 4.7.2 shows that if each master-and-slaves team had been limited to 10 programs or less, DBS would have placed first in the competition. 4.1 Motivation and Approach The techniques used in DBS are motivated by a British army officer?s story that was quoted in [4, page 40]: I was having tea with A Company when we heard a lot of shouting and went out to investigate. We found our men and the Germans standing on their respective parapets. Suddenly a salvo arrived but did no damage. Naturally both sides got down and our men started swearing at the Germans, when all at once a brave German got onto his parapet and shouted out: ?We are very sorry about that; we hope no one was hurt. It is not our fault. It is that damned Prussian artillery.? (Rutter 1934, 29) Such an apology was an effective way of resolving the conflict and preventing a retalia- tion because it told the British that the salvo was not the intention of the German infantry, but instead was an unfortunate accident that the German infantry did not expect nor de- sire. The reason why the apology was convincing was because it was consistent with the German infantry?s past behavior. The British had was ample evidence to believe that the German infantry wanted to keep the peace just as much as the British infantry did. More generally, an important question for conflict prevention in noisy environments is whether a misconduct is intentional or accidental. A deviation from the usual course 88 of action in a noisy environment can be explained in either way. If we form the wrong belief about which explanation is correct, our response may potentially destroy our long- term relationship with the other player. If we ground our belief on evidence accumulated before and after the incident, we should be in a better position to identify the true cause and prescribe an appropriate solution. To accomplish this, DBS uses the following key techniques: 1. Learning about the other player?s strategy. DBS uses an induction technique to identify a set of rules that model the other player?s recent behavior. The rules give the probability that the player will cooperate under different situations. As DBS learns these probabilities during the game, it identifies a set of deterministic rules that have either 0 or 1 as the probability of cooperation. 2. Detecting noise. DBS uses the above rules to detect anomalies that may be due either to noise or a genuine change in the other player?s behavior. If a move is different from what the deterministic rules predict, this inconsistency triggers an evidence collection process that will monitor the persistence of the inconsistency in the next few iterations of the game. The purpose of the evidence-collection process is to determine whether the violation is likely to be due to noise or to a change in the other player?s policy. If the inconsistency does not persist, DBS asserts that the derivation is due to noise; if the inconsistency persists, DBS assumes there is a change in the other player?s behavior. 3. Temporarily tolerating possible misbehaviors by the other player. Until the evidence-collection process finishes, DBS assumes that the other player?s behavior 89 is still as described by the deterministic rules. Once the evidence collection pro- cess has finished, DBS decides whether to believe the other player?s behavior has changed, and updates the deterministic rules accordingly. Since DBS emphasizes the use of deterministic behaviors to distinguish noise from the change of the other player?s behavior, it works well when the other player uses a pure (i.e., deterministic) strategy or a strategy that makes decisions deterministically most of the time. Fortunately, deterministic behaviors are abundant in the Iterated Prisoner?s Dilemma. Many well-known strategies, such as TFT and GRIM, are pure strategies. Some strategies such as Pavlov or Win-Stay, Lose-Shift strategy (WSLS) [35, 36, 37, 48] are not pure strategies, but a large part of their behavior is still deterministic. The reason for the prevalence of determinism is discussed by Axelrod in [3]: clarity of behavior is an important ingredient of long-term cooperation. A strategy such as TFT benefits from its clarity of behavior, because it allows other players to make credible predictions of TFT?s responses to their actions. We believe the success of our strategy in the competition is because this clarity of behavior also helps us to fend off noise. The results of the competition show that the techniques used in DBS are indeed an effective way to fend off noise and maintain cooperation in noisy environments. When DBSdefersjudgmentaboutwhethertheotherplayer?sbehaviorhaschanged, thepotential cost is that DBS may not be able to respond to a genuine change of the other player?s behavior as quickly as possible, thus losing a few points by not retaliating immediately. But this delay is only temporary, and after it DBS will adapt to the new behavior. More importantly, the techniques used in DBS greatly reduce the probability that noise will 90 cause it to end a cooperation and fall into a mutual-defect situation. Our experience has been that it is hard to re-establish cooperation from a mutual-defection situation, so it is better avoid getting into mutual defection situations in the first place. When compared with the potential cost of ending an cooperation, the cost of temporarily tolerating some defections is worthwhile. Temporary tolerance also benefits us in another way. In the noisy Iterated Prisoner?s Dilemma, there are two types of noise: one that affects the other player?s move, and the other affects our move. While our method effectively handles the first type of noise, it is the other player?s job to deal with the second type of noise. Some players such as TFT are easily provoked by the second type of noise and retaliate immediately. Fortunately, if the retaliation is not a permanent one, our method will treat the retaliation in the same way as the first type of noise, thus minimizing its effect. 4.2 Iterated Prisoner?s Dilemma with Noise In the Iterated Prisoner?s Dilemma, two players play a finite sequence of classical prisoner?s dilemma games, whose payoff matrix is: Player 2 Cooperate Defect Cooperate (uCC,uCC) (uCD,uDC) Player 1 Defect (u DC,uCD) (uDD,uDD) where uDC > uCC > uDD > uCD and 2uCC > uDC +uCD. In the competition, uDC, uCC,uDD anduCD are 5, 3, 1 and 0, respectively. 91 At the beginning of the game, each player knows nothing about the other player and does not know how many iterations it will play. In each iteration, each player chooses either to cooperate (C) or defect (D), and their payoffs in that iteration are as shown in the payoff matrix. We call this decision a move or an action. After both players choose a move, they will each be informed of the other player?s move before the next iteration begins. Ifak,bk ?{C,D}are the moves of Player 1 and Player 2 in iterationk, then we say that (ak,bk) is the interaction of iteration k. If there are N iterations in a game, then the total scores for Player 1 and Player 2 aresummationtext1?k?N uakbk andsummationtext1?k?N ubkak, respectively. The Noisy Iterated Prisoner?s Dilemma is a variant of the Iterated Prisoner?s Dilemma in which there is a small probability that a player?s moves will be mis- implemented. The probability is called the noise level.3 In other words, the noise level is the probability of executing C when D was the intended move, or vice versa. The incorrect move is recorded as the player?s move, and determines the interaction of the it- eration.4 Furthermore, neither player has any way of knowing whether the other player?s move was executed correctly or incorrectly.5 For example, suppose Player 1 chooses C and Player 2 chooses D in iteration k, and noise occurs and affects the Player 1?s move. Then the interaction of iteration k is (D,D). However, since both players do not know that the Player 1?s move has been 3The noise level in the competition was 0.1. 4Hence, amis-implementationisdifferentfromamisperception, whichwouldnotchangetheinteraction of the iteration. The competition included mis-implementations but no misperceptions. 5As far as we know, the definitions of ?mis-implementation? used in the existing literature are ambigu- ous about whether either of the players should know that an action has been mis-executed. 92 changed by noise, Player 1 and Player 2 perceive the interaction differently: for Player 1, the interaction is (C,D), but for Player 2, the interaction is (D,D). As in real life, this misunderstanding would become an obstacle in establishing and maintaining cooperation between the players. 4.3 Strategies, Policies, and Hypothesized Policies A history ? of length k is the sequence of interactions of all iterations up to and including iteration k. We write ? = ?(a1,b1),(a2,b2),...,(ak,bk)?. Let H = ?(C,C),(C,D),(D,C),(D,D)?? be the set of all possible histories. A mixed strategy is a? = (?,?) A strategy ? : H? [0,1] associates with each history ? a real number called the degree of cooperation. ?(?) is the probability that ? chooses to cooperate at iteration k + 1, where k =|?|is ??s length. The definition of a strategy in this chapter is in fact a shorthand way of writing a mixed strategy ?prime that maps histories into probability distri- butions over A ={C,D}. ?prime can be obtained from ? by using the following equations: ?prime(?,C) = ?(?) and?prime(?,D) = 1??(?). For examples, TFT can be considered as a function?TFT, such that (1)?TFT(?) = 1.0 if k = 0 or ak = C (where k =|?|), and (2) ?TFT(?) = 0.0 otherwise; Tit-for-Two- Tats (TFTT), which is like TFT except it defects only after it receives two consecutive defections, can be considered as a function ?TFTT, such that (1) ?TFTT(?) = 0.0 if k?2 andak?1 = ak = D, and (2)?TFTT(?) = 1.0 otherwise. We can model a strategy as a policy. A condition Cond : H?{True,False}is a 93 mapping from histories to Boolean values. A history ? satisfies a condition Cond if and only if Cond(?) = True. A rule is a pair (Cond,p), which we will write as Cond ?p, where Cond is a condition and p is a degree of cooperation (a real number in [0,1] ). A rule is deterministic if p is either 0.0 or 1.0; otherwise, the rule is probabilistic. A policy schemaC is a set of conditions such that each history inHsatisfies exactly one of the conditions inC. In this chapter, we define a policy to be a set of rules whose conditions constitute a policy schema. ?TFT can be modeled as a policy as follows: we define Conda,b to be a condition about the interactions of the last iteration of a history, such that Conda,b(?) = True if and only if (1) k?1, ak = a and bk = b, (where k =|?|), or (2) k = 0 and a = b = C. For simplicity, we also write Conda,b as (a,b). The policy for ?TFT is piTFT = {(C,C) ? 1.0,(C,D) ? 1.0,(D,C) ? 0.0,(D,D) ? 0.0}. Notice that the policy schema for piTFT isC={(C,C),(C,D),(D,C),(D,D)}. Given a policypiand a history?, there is one and only one rule Cond?pinpisuch that Cond(?) = True. We write p as pi(?). A policy pi is complete for a strategy ? if and only if pi(H) = ?(H) for any ? ?H. In other words, a complete policy for a strategy is one that completely models the strategy. For instance, piTFT is a complete policy for ?TFT. Some strategies are much more complicated than TFT?a large number of rules is needed in order to completely model these strategies. If the number of iterations is small and the strategy is complicated enough, it is difficult or impossible to obtain a complete model of the other player?s strategy. Therefore, an agent should not aim at obtaining a complete policy of the other player?s strategy; instead, all an agent can do is to learn an 94 approximation of the other player?s strategy during a game, using a small number of rules. In order to distinguish this approximation from the complete policies for a strategy, we call this approximation a hypothesized policy. Given a policy schemaC, an agent constructs a hypothesized policypi whose policy schema isC. The degrees of cooperation of the rules in pi are estimated by a learning function, which computes the degrees of cooperation according to the current history. Mathematically, a learning function is a mapping from C?H to [0,1]. For example, consider a learning function Lavg that computes the degrees of cooperation by averaging the number of time the next action is C when a condition Condi holds in the current history?: Lavg(Condi,?) = |{k : (0?k<|?|)?(Condi(prefixk(?)) = True)?([? B]k+1 = C)}| |{k : (0?k<|?|)?(Condi(prefixk(?)) = True)}| (4.1) where ?B is the sequence of actions of the agent we are trying to model, and [?B]k+1 is the (k + 1)?th action in the sequence. If the denominator of Lavg(Condi,?) is zero, Lavg(Condi,?) = 0. Lavg is undefined if|?|= 0. Suppose the other player?s strategy is ?TFTT, the given policy schema is C = {(C,C),(C,D),(D,C),(D,D)}, and the current history is ? = {(C,C),(D,C),(C,C),(D,C),(D,C),(D,D),(C,D),(C,C)}. Then the hypothe- sized policy is pi = {(C,C) ? 1.0,(C,D) ? 1.0,(D,C) ? 0.66,(D,D) ? 0.0}. Notice that the rule (D,C) ? 0.66 does not accurately model ?TFTT; this probabilis- tic rule is just an approximation of what ?TFTT does when the condition (D,C) holds. In fact, this approximation is inaccurate as long as the policy schema contains (D,C)? 95 there is no complete policy for?TFTT whose policy schema contains (D,C). If we want to model ?TFTT correctly, we need a different policy schema that allows us to specify more complicated rules. 4.3.1 Discussion When we use a hypothesized policy to model a strategy, the difference between the hypothesized policy and the strategy is due to the choice of the policy schema and the learning function. But we usually do not know how to choose a good policy schema and a good learning function. In the literature, the difficulty is choosing a right model is called model uncertainty, and the problem is called the problem of model selection. If we consider the difference between the hypothesized policy and the strategy as an error, the error is largely due to model uncertainty. We interpret a hypothesized policy as a belief of what the other player will do in the next few iterations in response to our next few moves. This belief does not necessar- ily hold in the long run, since the other player can behave differently at different time in a game. Even worse, there is no guarantee that this belief is true in the next few itera- tions. Nonetheless,hypothesizedpoliciesconstructedbyanagentincertainnon-zero-sum games such as the IPD usually have a high degree of accuracy in predicting what the other player will do. This belief is subjective?it depends on the choice of the policy schema and the learning function. We formally define this subjective viewpoint as follows. The hypoth- esized policy space spanned by a policy schemaCand a learning function L :C?H? 96 [0,1] is a set of policies ? = {pi(H) : ? ?H}, where pi(?) = {Cond ?L(Cond,?) : Cond?C}. Let? be a history of a game in which the other player?s strategy is?. The set of all possible hypothesized policies for?in this game is{pi(?k) : ?k ?prefixes(?)}??, where prefixes(?) is the set of all prefixes of ?, and ?k is the prefix of length k of ?. We say pi(?k) is the current hypothesized policy of ? in the iteration k. A rule Cond?p in pi(?k) describes a particular behavior of the other player?s strategy in the iterationk. The behavior is deterministic if p is either zero or one; otherwise, the behavior is random or probabilistic. If pi(?k) negationslash= pi(?k+1), we say there is a change of the hypothesized policy in the iteration k + 1, and the behaviors described by the rules in (pi(?k)\pi(?k+1)) have changed. 4.4 Derived Belief Strategy In the ordinary Iterated Prisoner?s Dilemma (i.e., without any noise), if we know the other player?s strategy and how many iterations in a game, we can compute an op- timal strategy against the other player by trying every possible sequence of moves to see which sequence yields the highest score, assuming we have sufficient computational power. However, we are missing both pieces of information. So it is impossible for us to compute an optimal strategy, even with sufficient computing resource. Therefore, we can at most predict the other player?s moves based on the history of a game, subject to the fact that the game may terminate any time. SomestrategiesfortheIteratedPrisoner?sDilemmadonotpredicttheotherplayer?s moves at all. For example, Tit-for-Tat and GRIM react deterministically to the other 97 player?s previous moves according to fixed sets of rules, no matter how the other player actually plays. Many strategies adapt to the other player?s strategy over the course of the game: for example, Pavlov [35] adjusts its degree of cooperation according to the history of a game. However, these strategies do not take any prior information about the other player?s strategy as an input; thus they are unable to make use of this important piece of information even when it is available. Let us consider a class of strategies that make use of a model of the other player?s strategytomakedecisions. Figure4.1showsanabstractrepresentationofthesestrategies. Initially, these strategies start out by assuming that the other player?s strategy is TFT or some other strategy. In every iteration of the game, the model is updated according to the current history (using UpdateModel). These strategies decide which move it should make in each iteration using a move generator (GenerateMove), which depends on the current model of the other player?s strategy of the iteration. DBS belongs to this class of strategies. DBS maintains a model of the other player in form of a hypothesized policy throughout a game, and makes decisions based on this hypothesized policy. The key issue for DBS in this process is how to maintain a good approximation of the other player?s strategy, despite that some actions in the history are affected by noise. A good approximation will increase the quality of moves generated by DBS, since the move generator in DBS depends on an accurate model of the other player?s behavior. The approach DBS uses to minimize the effect of noise on the hypothesized policy has been discussed in Section 4.1: temporarily tolerate possible misbehavior by the other 98 Procedure StrategyUsingModelOfTheOtherPlayer() pi?InitialModel() // the current model of the other player ???? // the current history a?GenerateMove(pi,?) // the initial move Loop until the end of the game Output our move a and obtain the other player?s move b ? := ???(a,b)? pi := UpdateModel(pi,?) a := GenerateMove(pi,?) End Loop Figure 4.1: An abstract representation of a class of strategies that generate moves using a model of the other player. 99 player, and then update the hypothesized policy only if DBS believes that the misbehavior is due to a genuine change of behaviors. Figure 4.2 shows an outline of the implemen- tation of this approach in DBS. As we can see, DBS does not maintain the hypothesized policy explicitly; instead, DBS maintains three sets of rules: the default rule set (Rd), the current rule set (Rc), and the probabilistic rule set (Rp). DBS combines these rule sets to form a hypothesized policy for move generation. In addition, DBS maintains several aux- iliary variables (promotion counts and violation counts) to facilitate the update of these rule sets. We will explain every line in Figure 4.2 in detail in the next section. 4.4.1 Discussion When using a policy to express the other player?s strategy, an important question is how large a policy schema to use for the hypothesized policy. If the other player?s strat- egy is complicated and the policy schema is too small, the policy schema won?t provide enough detail to give useful predictions of the other player?s behavior. But if the policy schema is too large, DBS will be unable to compute an accurate approximation of each rule?s degree of cooperation, because the number of iterations in the game is too small for learning all the rules precisely. After all, a large policy does not necessarily outperform a small one when it is used to project the behavior of the other player in future; in fact, it is worse if the policy schema is too large due to the first reason. For these reasons we shall use a small policy schema. We found that a simple policy is sufficient for our noise detection technique to work. In the competition, we used a policy schema of size 4: {(C,C),(C,D),(D,C),(D,D)}. We have found this to be good enough for modeling a 100 Procedure DerivedBeliefStrategy() 1. Rd := piTFT // the default rule set 2. Rc :=? // the current rule set 3. a0 := C ; b0 := C ; ? :=?(a0,b0)?; pi := Rd ; k := 1 ; v := 0 4. a1 := MoveGen(pi,?) 5. Loop until the end of the game 6. Output ak and obtain the other player?s move bk 7. r+ := ((ak?1,bk?1)?bk) 8. r? := ((ak?1,bk?1)?({C,D}\{bk})) 9. If r+,r?negationslash?Rc, then 10. If ShouldPromote(r+) = true, then insert r+ into Rc. 11. If r+?Rc, then set the violation count of r+ to zero 12. If r??Rc and ShouldDemote(r?) = true, then 13. Rd := Rc?Rd ; Rc :=?; v := 0 14. If r??Rd, then v := v+ 1 15. If v>RejectThreshold, or (r+?Rc and r??Rd), then 16. Rd :=?; v := 0 17. Rp :={(Cond?pprime)?Rprobk+1 : Cond not appear in Rc or Rd} 18. pi := Rc?Rd?Rp // construct a hypothesized policy 19. ? := ???(ak,bk)?; ak+1 := MoveGen(pi,?) ; k := k+ 1 20. End Loop Figure 4.2: An outline of the DBS strategy. ShouldPromote first increasesr+?s promotion count, and then ifr+?s promotion count exceeds the promotion threshold, ShouldPromote returns true and resets r+?s promotion count. Likewise, ShouldDemote first increases r??s violation count, and then if r??s violation count exceeds the violation threshold, ShouldPromote returns true and resets r??s violation count. Rp in Line 17 is the proba- bilistic rule set;Rprobk+1 in Line 17 is calculated from Equation 4.2. 101 large number of strategies for the purpose of noise detection. 4.5 Learning Hypothesized Policies in Noisy Environments We will describe how DBS learns and maintains a hypothesized policy for the other player?s strategy in this section. Section 4.5.1 describes how DBS uses discounted fre- quencies for each behavior to estimate the degree of cooperation of each rule in the hy- pothesized policy. Section 4.5.2 explains why using discounted frequencies alone are not sufficient for constructing an accurate model of the other player?s strategy in the presence of noise, and how symbolic noise detection and temporary tolerance can help overcome the difficulty in using discounted frequencies alone. Section 4.5.3 presents the induction technique DBS uses to identify deterministic behaviors in the other player. Section 4.5.4 illustrates how DBS defers judgment about whether an anomaly is due to noise. Sec- tion 4.5.5 discusses how DBS updates the hypothesized policy when it detects a change of behavior. 4.5.1 Learning by Discounted Frequencies In Section 4.3, we introduced the learning functionLavg, which estimate the proba- bilities of cooperation by averaging the number of ?firings? of rules in the current history. Lavg does not distinguish early moves from recent moves. However, agents in the IPD often changes its behavior?their behavior can change from time to time. It is essentially true in noisy environments, since noise can trigger the change of the behavior. Thus, early moves should be less useful in the estimation of cooperation than recent moves in 102 the current history. We now describe a better learning function to estimate the degree of cooperation of the rules in the hypothesized policy. The idea is to maintain a discounted frequency for each behavior: instead of keeping an ordinary frequency count of how often the other player cooperates under a condition in the past, DBS applies discount factors based on how recent each occurrence of the behavior was. Given a history ? = {(a1,b1),(a2,b2),...,(ak,bk)}, a real number ? be- tween 0 and 1 (called the discount factor), and an initial hypothesized policy pi0 = {Cond1 ? p01,Cond2 ? p02,...,Condn ? p0n} whose policy schema is C = {Cond1,Cond2,...,Condn}, the probabilistic policy at iteration k + 1 is Rprobk+1 = {Cond1?pk+11 ,Cond2?pk+12 ,Condn?pk+1n }, wherepk+1i is computed by the follow- ing equation: pk+1i = summationtext 0?j?k parenleftbig?k?jg j parenrightbig summationtext 0?j?k (?k?jfj) (4.2) and where gj = ? ??? ??? ? ??? ??? ? p0i ifj = 0, 1 if 1?j?k, Condi(?j?1) = True andbj = C, 0 otherwise; fj = ? ??? ??? ? ??? ??? ? p0i ifj = 0, 1 if 1?j?k, Condi(?j?1) = True, 0 otherwise; ?j?1 = ? ??? ??? ? ifj = 1, ?(a1,b1),(a2,b2),...,(aj?1,bj?1)? otherwise. 103 In short, the current history ? has k + 1 possible prefixes, and fj is basically a Boolean function indicating whether the prefix of ? up to the j?1?th iteration satisfies Condi. gj is a restricted version offj. When ? = 1, pi is approximately equal to the frequency of the occurrence of Condi ? pi. When ? is less than 1, pi becomes a weighted sum of the frequencies that gives more weight to recent events than earlier ones. For our purposes, it is important to use?< 1, because it may happen that the other player changes its behavior suddenly, and therefore we should forget about its past behavior and adapt to its new behavior (for instance, when GRIM is triggered). In the competition, we used? = 0.75. An important question is how large a policy schema to use for the hypothesized policy. If the policy schema is too small, the policy schema won?t provide enough detail to give useful predictions of the other player?s behavior. But if the policy schema is too large, DBS will be unable to compute an accurate approximation of each rule?s degree of cooperation, because the number of iterations in the game will be too small. In the competition, we used a policy schema of size 4: {(C,C),(C,D),(D,C),(D,D)}. We have found this to be good enough for modeling a large number of strategies. It is essential to have a good initial hypothesized strategy because at the beginning of the game the history is not long enough for us to derive any meaningful informa- tion about the other player?s strategy. Due to the past success of TFT, we believe many strategies in the competition follows or mimics TFT. Thus, in the competition the ini- tial hypothesized policy of DBS is piTFT = {(C,C) ? 1.0,(C,D) ? 1.0,(D,C) ? 0.0,(D,D)?0.0}. 104 4.5.2 Deficiencies of Discounted Frequencies in Noisy Environments It may appear that the probabilistic policy learned by the discounted-frequency learningtechniqueshouldbeinherentlycapableoftoleratingnoise, becauseittakesmany, if not all, moves in the history into account: if the number of terms in the calculation of theaverageorweightedaverageislargeenough, theeffectofnoiseshouldbesmall. How- ever, there is a problem with this reasoning: it neglects the effect of multiple occurrences of noise within a small time interval. A mis-implementation that alters the move of one player would distort an estab- lished pattern of behavior observed by the other player. The general effect of such distor- tion to the Equation 4.2 is hard to tell?it varies with the value of the parameters and the history. But if several distortions occur within a small time interval, the distortion may be big enough to alter the probabilistic policy and hence change our decision about what move to make. This change of decision may potentially destroy an established pattern of mutual cooperation between the players. At first glance, it might seem rare for several noise events to occur at nearly the same time. But if the game is long enough, the probability of it happening can be quite high. The probability of getting two noise events in two consecutive iterations out of a sequence of i iterations can be computed recursively as Xi = p(p + qXi?2) + qXi?1, providing that X0 = X1 = 0, where p is the probability of a noise event and q = 1?p. In the competition, the noise level was p = 0.1 and i = 200, which gives X200 = 0.84. Similarly, the probabilities of getting three and four noises in consecutive iterations are 0.16 and 0.018, respectively. 105 In the 2005 competition, there were 165 players, and each player played each of the other players five times. This means every player played 825 games. On average, there were 693 games having two noises in two consecutive iterations, 132 games having three noisesinthreeconsecutiveiterations, and15gameshavingfournoisesinfourconsecutive iterations. Clearly, we did not want to ignore situations in which several noises occur nearly at the same time. Symbolic noise detection and temporary tolerance outlined in Section 4.1 provide a waytoreducetheamountofsusceptibilitytomultipleoccurrencesofnoiseinasmalltime interval. Deterministic rules enable DBS to detect anomalies in the observed behavior of the other player. DBS temporarily ignores the anomalies which may or may not be due to noise, until a better conclusion about the cause of the anomalies can be drawn. This temporary tolerance prevents DBS from learning from the moves that may be affected by noise, and hence protects the hypothesized policy from the influence of errors due to noise. Since the amount of tolerance (and the accuracy of noise detection) can be controlled by adjusting parameters in DBS, we can reduce the amount of susceptibility to multiple occurrences of noise by increasing the amount of tolerance, at the expense of a higher cost of noise detection?losing more points when a change of behavior occurs. 4.5.3 Identifying Deterministic Rules Using Induction As we discussed in Section 4.1, deterministic behaviors are abundant in the Iter- ated Prisoner?s Dilemma. Deterministic behaviors can be modeled by deterministic rules, whereas random behavior would require probabilistic rules. 106 A nice feature about deterministic rules is that they have only two possible degrees of cooperation: zero or one, as opposed to an infinite set of possible degrees of coopera- tion of the probabilistic rules. Therefore, there should be ways to learn deterministic rules thataremuchfasterthanthediscountedfrequencymethoddescribedearlier. Forexample, if we knew at the outset which rules were deterministic, it would take only one occurrence to learn each of them: each time the condition of a deterministic rule was satisfied, we could assign a degree of cooperation of 1 or 0 depending on whether the player?s move wasC orD. The trick, of course, is to determine which rules are deterministic. We have devel- oped an inductive-reasoning method to distinguish deterministic rules from probabilistic rules during learning and to learn the correct degree of cooperation for the deterministic rules. In general, induction is the process of deriving general principles from particular facts or instances. To learn deterministic rules, the idea of induction can be used as follows. If a certain kind of behavior occurs repeatedly several times, and during this period of time there is no other behavior that contradicts to this kind of behavior, then we will hypothesize that the chance of the same kind of behavior occurring in the next few iterations is pretty high, regardless of how the other player behaved in the remote past. More precisely, let K ? 1 be a number which we will call the promotion thresh- old. Let ? = ?(a1,b1),(a2,b2),...,(ak,bk)? be the current history. For each condi- tion Condj ? C, let Ij be the set of indexes such that for all i ? Ij, i < k and Condj(?(a1,b1),(a2,b2),...,(ai,bi)?) = True. Let ?Ij be the set of the largest K indexes in Ij. If|Ij|?K and for all i? ?Ij, bi+1 = C (i.e., the other player chose C when the 107 previous history up to thei?th iteration satisfies Condj), then we will hypothesize that the other player will choose C whenever Condj is satisfied; hence we will use Condj ? 1 as a deterministic rule. Likewise, if|Ij|?K and for all i? ?Ij, bi+1 = D, we will use Condj ?0 as a deterministic rule. See Line 7 to Line 10 in Figure 4.2 for an outline of the induction method we use in DBS. The induction method can be faster at learning deterministic rules than the dis- counted frequency method that regards a rule as deterministic when the degree of coop- eration estimated by discounted frequencies is above or below certain thresholds. As can be seen in Figure 4.3, the induction method takes only three iterations to infer the other player?s moves correctly, whereas the discounted frequency technique takes six iterations to obtain a 95% degree of cooperation, and it never becomes 100%.6 We may want to set the threshold in the discounted frequency method to be less than 0.8 to make it faster than the induction method. However, this will increase the chance of incorrectly identifying a random behavior as deterministic. A faster learning speed allows us to infer deterministic rules with a shorter history, and hence increase the effectiveness of symbolic noise detection by having more deter- ministic rules at any time, especially when a change of the other player?s behavior occurs. The promotion thresholdK controls the speed of the identification of deterministic rules. The larger the value of K, the slower the speed of identification, but the less likely we will mistakenly hypothesize that the other player?s behavior is deterministic. 6If we modify Equation 4.2 to discard the early interactions of a game, the degree of cooperation of a probabilistic rule can attain 100%. 108 0 1 2 3 4 5 6 7 8 9 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Iteration Degree of Cooperation Induction Discount Frequency Figure 4.3: Learning speeds of the induction method and the discounted frequency method when the other player always cooperates. The initial degree of cooperation is zero, the discounted rate is 0.75, and the promotion threshold is 3. 4.5.4 Symbolic Noise Detection and Temporary Tolerance Once DBS has identified the set of deterministic rules, it can readily use them to de- tect noise. As we said earlier, if the other player?s move violate a deterministic rule, it can be caused either by noise or by a change in the other player?s behavior, and DBS uses an evidence collection process to figure out which is the case. More precisely, once a deter- ministic rule Condi ?oi is violated (i.e., the history up to the previous iteration satisfies Condi but the other player?s move in the current iteration is different fromoi), DBS keeps the violated rule but marks it as violated. Then DBS starts an evidence collection process that in the implementation of our competition entries is a violation counting: for each vi- olated probabilistic rule DBS maintains a counter called the violation count to record how 109 many violations of the rule have occurred (Line 12).7 In the subsequent iterations, DBS increases the violation count by one every time a violation of the rule occurs. However, if DBS encounters a positive example of the rule, DBS resets the violation count to zero and unmark the rule (Line 11). If any violation count excesses a threshold called the violation threshold, DBS concludes that the violation is not due to noise; it is due to a change of the other player?s behavior. In this case, DBS invokes a special procedure (described in Section 4.5.5) to handle this situation (Line 13). This evidence collection process takes advantages of the fact that the pattern of moves affected by noise is often quite different from the pattern of moves generated by the new behavior after a change of behavior occurs. Therefore, it can often distinguish noise from a change of behavior by observing moves in the next few iterations and gather enough evidence. As discussed in Section 4.5.2, we want to set a larger violation threshold in order to avoid the drawback of the discount frequency method in dealing with several misinterpre- tations caused by noise within a small time interval. However, if the threshold is too large, it will slow down the speed of adaptation to changes in the other player?s behavior. In the competition, we entered DBS several times with several different violation thresholds; and in the one that performed the best, the violation threshold was 4. 4.5.5 Coping with Ignorance of the Other Player?s New Behavior When the evidence collection process detects a change in the other player?s behav- ior, DBS knows little about the other player?s new behavior. How DBS copes with this 7We believe that a better evidence collection process should be based on statistical hypothesis testing. 110 ignorance is critical to its success. When DBS knows little about the other player?s new behavior when it detects a change of the other player?s behavior, DBS temporarily uses the previous hypothesized policy as the current hypothesized policy, until it deems that this substitution no longer works. More precisely, DBS maintains two sets of deterministic rules: the current rule set Rc and the default rule set Rd. Rc is the set of deterministic rules that is learned after the change of behavior occurs, whileRd is the set of deterministic rules before the change of behavior occurs. At the beginning of a game, Rd is piTFT and Rc is an empty set (Line 1 and Line 2). When DBS constructs a hypothesized policy pi for move generation, it uses every rule in Rc and Rd. In addition, for any missing rule (i.e., the rule those condition are different from any rule?s condition in Rc or Rd), we regard it as a probabilistic rule and approximate its degree of cooperation by Equation 4.2 (Line 17). These probabilistic rules form the probabilistic rule setRp?Rprobk+1. While DBS can insert any newly found deterministic rule in Rc, it insert rules into Rd only when the evidence collection process detects a change of the other player?s be- havior. When it happens, DBS copies all the rules in Rc to Rd, and then set Rc to an empty set (Line 13). The default rule set is designed to be rejected: we maintain a violation count to record the number of violations to any rule in Rd. Every time any rule in Rd is violated, the violation count increased by 1 (Line 14). Once the violation count exceeds a rejection threshold, we drop the default rule set entirely (set it to an empty set) and reset the viola- tion count (Line 15 and Line 16). We also reject Rd whenever any rule in Rc contradicts any rule inRd (Line 15). 111 We preserve the rules in Rc mainly for sake of providing a smooth transition: we don?t want to convert all deterministic rules to probabilistic rules at once, as it might suddenly alter the course of our moves, since the move generator in DBS generates moves according to the current hypothesized policy only. This sudden change in DBS?s behavior can potentially disrupt the cooperative relationship with the other player. Furthermore, some of the rules inRc may still hold, and we don?t want to learn them from scratch. Noticethatsymbolicnoisedetectionandtemporarytolerancemakesuseoftherules inRc butnottherulesinRd, althoughDBSmakesuseoftherulesinbothRc andRd when DBS decides the next move (Line 18). We do not use Rd for symbolic noise detection and temporary tolerance because when DBS inserts rules into Rd, a change of the other player?s behavior has already occurred?there is little reason to believe that anomalies detected using the rules inRd are due to noise. Furthermore, we want to turn off symbolic noise detection and temporary tolerance temporarily when a change of behavior occurs, in order to identify a whole new set of deterministic rules from scratch. 4.6 The Move Generator in DBS We devised a simple and reasonably effective move generator for DBS. As shown in Figure 4.1, the move generator takes the current hypothesized policypi and the current history ?current whose length is l =|?current|, and then decides whether DBS should co- operate in the current iteration. It is difficult to devise a good move generator, because our move could lead to a change of the hypothesized policy and complicate our projection of the long-term payoff. Perhaps, the move generator should take the other player?s model 112 of DBS into account [18]. However, we found that by making the assumption that hy- pothesized policy will not change for the rest of the game, we can devise a simple move generator that generates fairly good moves. The idea is that we compute the maximum expected score we can possibly earn for the rest of the game, using a technique that com- bines some ideas from both game-tree search and Markov Decision Processes (MDPs). Then we choose the first move in the set of moves that leads to this maximum expected score as our move for the current iteration. To accomplish the above, we consider all possible histories whose prefix is?current as a tree. In this tree, each path starting from the root represents a possible history, which is a sequence of past interactions in ?current plus a sequence of possible interactions in futureiterations. Eachnodeonapathrepresentstheinteractionofaniterationofahistory. Figure 4.4 shows an example of such a tree. The root node of the tree represents the interaction of the first iteration. Let interaction(S) be the interaction represented by a node S. Let ?S0,S1,...,Sk? be a sequence of nodes on the path from the root S0 to Sk. We define the depth of Sk to be k ? l, and the history of Sk be ?(Sk) = ?interaction(S1),interaction(S2),...,interaction(Sk)?. Si is called the current node if the depth ofSi is zero; the current node represents the interaction of the last iteration and ?(Si) = ?current. As we do not know when the game will end, we assume it will go for N? more iterations; thus each path in the tree has length of at mostl+N?. Our objective is to compute a non-negative real number called the maximum ex- pected score E(S) for each node S with a non-negative depth. Like a conventional game tree search in computer chess or checkers, the maximum expected scores are de- 113 fined recursively: the maximum expected score of a node at depth i is determined by the maximum expected scores of its children nodes at depth i + 1. The maximum ex- pected score of a node S of depth N? is assumed to be the value computed by an evaluation function f. This is a mapping from histories to non-negative real numbers, such that E(S) = f(?(S)). The maximum expected score of a node S of depth k, where 0 ? k < N?, is computed by the maximizing rule: suppose the four possi- ble nodes after S are SCC, SCD, SDC, and SDD, and let p be the degree of cooper- ation predicted by the current hypothesized policy pi (i.e., p is the right-hand side of a rule (Cond ? p) in pi such that ?(S) satisfies the condition Cond). Then E(S) = max{EC(S),ED(S)}, whereEC(S) = p(uCC +E(SCC))+(1?p)(uCD+E(SCD)) and ED(S) = p(uDC +E(SDC)) + (1?p)(uDD +E(SDD)). Furthermore, we let move(S) be the decision made by the maximizing rule at each node S, i.e., move(S) = C if EC(S)?ED(S) and move(S) = D otherwise. By applying this maximizing rule recur- sively, we obtain the maximum expected score of every node with a non-negative depth. The move that we choose for the current iteration is move(Si), where Si is the current node. The number of nodes in the tree increases exponentially withN?. Thus, the tree can be huge?there are over a billion nodes when N? ? 15. It is infeasible to compute the maximum expected score for every node one by one. Fortunately, we can use dynamic programming to speed up the computation. As an example, suppose the hypothesized policy is pi = {(C,C) ? pCC,(C,D) ? pCD,(D,C) ? pDC,(D,D) ? pDD}, and suppose the evaluation functionf returns a constantfo1o2 for any history that satisfies the 114 condition (o1,o2), where o1,o2 ?{C,D}. Then, given our assumption that the hypoth- esized policy does not change, it is not hard to show by induction that all nodes whose histories have the same length and satisfy the same condition have the same maximum expected score. By using this property, we construct a table of size 4?(N?+2) in which each entry, denoted byEko1o2, stores the maximum expected score of the nodes whose his- tories have lengthl+kand satisfy the condition (o1,o2), whereo1,o2?{C,D}. We also have another table of the same size to record the decisions the procedure makes; the entry mko1o2 of this table is the decision being made at Eko1o2. Initially, we set EN+1CC = fCC, EN+1CD = fCD, EN+1DC = fDC, and EN+1DD = fDD. Then the maximum expected scores in the remaining entries can be computed by the following recursive equation: Eko1o2 = max braceleftbigpo1o2(uCC +Ek+1CC ) + (1?po1o2)(uCD +Ek+1CD ), po1o2(uDC +Ek+1DC ) + (1?po1o2)(uDD +Ek+1DD )bracerightbig, (4.3) whereo1,o2?{C,D}. Similarly,mko1o2 = C if (po1o2(uCC +Ek+1CC )+(1?po1o2)(uCD + Ek+1CD ))?(po1o2(uDC +Ek+1DC ) + (1?po1o2)(uDD +Ek+1DD ) andmko1o2 = D otherwise. The policy{(C,C)?m0CC,(C,D)?m0CD,(D,C)?m0DC,(D,D)?m0DD} is called the recommended policy. If the interaction of the previous iteration is (o1,o2), we pickm0o1o2 as the recommended move for the current iteration. The pseudocode of this dynamic programming algorithm is shown in Figure 4.5. 4.6.1 Generalizing the Move Generator In general, the policy schema of the hypothesized policy can be different from the simple policy schema that have been used throughout this chapter (i.e., 115 {(C,C),(C,D),(D,C),(D,D)}). According to our experience, this simple policy schema is good enough for move generation and noise detection in the IPD. In other problems, however, the policy schema may not be expressive enough to describe the be- havior of the opponents for move generation and noise detection. If we use a different policy schema, the equation 4.3 is no longer applicable and we have to use a different set of recursive equations. In general, the problem can be viewed as a Markov Decision Process (MDP) in which the other player constitutes a nondeterministic environment whose states are the conditions in the policy schema and whose transition matrix has probabilities that are equal to the degrees of cooperation of the rules in the hypothesized policy. Then the recursive equations are similar to the ones in the value iteration for computing an optimal policy for MDPs. There are subtle differences between solvingan MDP and generatinga move for the IPD. First, a policy in MDPs is different from a policy we defined here. A policy in MDPs is a mapping from states to actions, while a policy we defined in this chapter is a mapping from conditions to probabilities of cooperation. The difference is due to the fact that DBS uses a policy to model an opponent?s behavior; DBS does not use a policy to represent a strategy for itself to interact with other players. As discussed in Section 4.3, a policy for agent modeling is just a hypothesized policy of the behavior of the other players. Since we do not know exactly the set of states of the opponent, we replace the concepts of states in MDPs by policy schemata, which is a set of conditions over histories. Furthermore, since we do not know exactly what the opponent would do under each condition, we use a probability distribution over the set of actions at the right-hand side of a rule in a policy. 116 These difference reflects the purpose of opponent modeling. Second, the MoveGen procedure is different from value iteration for MDPs in a number of ways. Value iteration computes the optimal solution for an infinite-horizon MDP. But the horizon of a game of the IPD is finite, despite the fact that the players do not know the length of a game, and this makes the games look like an infinite game from the players? viewpoint. Instead of using a discount factor, the MoveGen procedure uses a fixed search depth and an evaluation function to estimate the expected utilities at the terminal nodes. The use of fixed search depth and the evaluation function is the remnants of game-tree search. We haven?t checked to see whether MoveGen is better than value iteration in our context, thus we cannot conclude that MoveGen outperforms value iteration with a discount factor. But the obvious benefit of the MoveGen procedure is that the expected utilities are intuitive than the discounted expected utilities in value iteration. Moreover, we believe that the results of the MoveGen procedure are similar to value iteration with a large discount factor. 4.6.2 An Analysis of the Move Generator The dynamic programming algorithm for move generation can virtually search as deep as we want in the game tree. In the competition, the dynamic programming al- gorithm stops exploring the game tree at the cut-off level N? = 60, and then uses the evaluation function f as described in the caption of Figure 4.5 to estimate the expected scores of the nodes at the cut-off level. Would there be any difference with the recom- mended policies and the recommended moves if we increase the cut-off level and use 117 other evaluation functions? We conducted an experiment to investigate this issue, and the result is presented in this section. In the experiment, we generated a set of 625 hypothesized policies of the form {(C,C) ? uCC,(C,D) ? uCD,(D,C) ? uDC,(D,D) ? uDD}where the degrees of cooperation uCC, uCD, uDC, and uDD is one of these values: 0.00, 0.25, 0.5, 0.75, and 1.0. For each hypothesized policy, we did the following: first, we varied the search depth N? from 1 to 100, and computed a recommended policy for each search depth using the MoveGen procedure in Figure 4.5. We labeled the recommended policies as pi1, pi2, ..., pi100, where pii is generated with search depth i. Second, we compared pii with pii+1 to see whether they are different. If there is a difference, we said there is a change of recommended policy at search depthi. Finally, we counted how many changes of recommended policy at search depth between 1 and k to see how often recommended policies changed as we increased the search depth. We then plotted the results in a graph inFigure4.6. Thegraphcontainsatotalof625lines,oneforeachhypothesizedpolicy. As can be seen, the recommended policies for most hypothesized policies did not change as we increase the search depth. However, a small number of them changed at every search depth. This phenomenon worries us, because it indicates that the MoveGen procedure is unstable?its result is too sensitive to the choice of the search depth. To see why this phenomenon occurs, we printed out a few sequences of the recom- mended policies to see if there is any pattern. We found that in most cases the sequence of recommended policies is just an alternation of two recommended policies as the search depth increases. In a small number of cases, there is an alternation of three, four, or five 118 recommended policies. Then, we plotted another graph to look at the periodicity of the cycles of the recommended policies in the sequences. First, for each hypothesized policy, we varied the search depth to generate a sequence of 100 recommended policies. Sec- ond, for each search depth i (called the starting search depth), we looked at a subset of recommended policies generated with search depths between i and 100 in a sequence, to see if the recommended policies repeat themselves periodically in this subset. Finally, we calculated the percentage of sequences that exhibits a cycle of recommended policies with a particular periodicity for each starting search depth, and plot the results in a graph as shown in Figure 4.7. As can be seen, over 90% of the sequences has a periodicity of 1 starting from the starting search depth of 5; in other words, the recommended policies do not change as the search depth increases. Around 8% of belief policies would cause the MoveGen procedure to generate a sequence of recommended policies that alternates between two different recommended policies. The group of horizontal lines at the bottom of Figure 4.7 correspondstothesequencesexhibitsperiodicitiesof3, 4, or5, or?noperiodicity?(which means that there is no observable pattern of changes of recommended policies in the sequences). But their periodicity dropped to either 1 or 2 as the starting search depth increases. Thus, increasing the search depth can help to stabilize the solutions returned by the MoveGen procedure. But in about 10% of the situations, the MoveGen procedure still returns alternating sequences of recommended policies as the search depth increases. It is interesting to see what recommended policies the MoveGen procedure actually returns. Table 4.8 shows the percentage of recommended policies returned by MoveGen 119 given the set of hypothesized policies in the experiments. Each recommended policy is represented by four characters m1m2m3m4, which means that the recommended policy is{(C,C)?m1,(C,D)?m2,(D,C)?m3,(D,D)?m4}. For example, CDDD in the table refers to the recommended policy{(C,C) ? C,(C,D) ? D,(D,C) ? D,(D,D)?D}. We did not take the sequence of recommended policies whose period- icity is greater than 1 into account. We can see that over 60% of recommended policies are ALLD (DDDD). Around 8% cooperates only when both our agent and the other player cooperate in the previous round (CDDD). A slightly more generous strategy that cooperates when the interac- tion of the previous round is (C,C) or (D,D) is equally common (8% of CDDC). In only about 6% of hypothesized policies MoveGen recommends to cooperate no matter what happened in the previous iterations (CCCC = ALLC). Tit-for-Tat (CDCD) is only recommended in less than 1% for all hypothesized policies. One caveat for interpreting the data in Table 4.8 is that the table does not show how often a recommended policy is returned by MoveGen in the actual tournament; it just tells us that if we choose a hypothesized policy randomly how often a recommended policy will be used. Thus, the MoveGen procedure might return TFT (CDCD) or ALLC (CCCC) more often than ALLD in a tournament, since the hypothesized policy for re- turning ALLD does not occur frequently in the tournament. 120 4.7 Competition Results The 2005 IPD Competition was actually a set of four competitions, each for a dif- ferent version of the IPD. The one for the Noisy IPD was Category 2, which used a noise level of 0.1. Of the 165 programs entered into the competition, eight of them were provided by the organizer of the competition. These programs included ALLC (always cooperates), ALLD (always defects), GRIM (cooperates until the first defection of the other player, and thereafter it always defects), NEG (cooperate (or defect) if the other player defects (orcooperates)inthepreviousiteration), RAND(defectsorcooperateswiththe1/2prob- ability), STFT (suspicious TFT, which is like TFT except it defects in the first iteration) TFT, and TFTT. All of these strategies are well known in the literature on IPD. The remaining 157 programs were submitted by 36 different participants. Each participant was allowed to submit up to 20 programs. We submitted the following 20: ? DBS. We entered nine different versions of DBS into the competition, each with a different set of parameters or different implementation. The one that performed best was DBSz, which makes use of the exact set of features we mentioned in this chapter. Versions that have fewer features or additional features did not do as well. ? Learning of Opponent?s Strategy with Forgiveness (LSF). Like DBS, LSF is a strategy that learns the other player?s strategy during the game. The difference betweenLSFandDBSisthatLSFdoesnotmakeuseofsymbolicnoisedetection. It uses the discount frequency (Equation 4.2) to learn the other player?s strategy, plus a forgiveness strategy that decides when to cooperate if mutual defection occurs. 121 We entered one instance of LSF. It placed around the 30?th in three of the runs and around 70?th in the other two runs. We believe the poor ranking of LSF is due to the deficiency of using discount frequency alone as we discussed at the beginning of Section 4.5. ? Tit-for-Tat Improved (TFTI). TFTI is a strategy based on a totally different phi- losophy from DBS?s. It is not an opponent-modeling strategy, in the sense that it does not model the other player?s behavior using a set of rules. Instead, it is a vari- ant of TFT with a sophisticated forgiveness policy that aims at overcoming some of the deficiencies of TFT in noisy environments. We entered ten instantiations of TFTI in the competition, each with a different set of parameters or some differences in the implementation. The best of these, TFTIm, did well in the competition (see Table 4.1), but not as well as DBS. Three of the other participants each entered the full complement of twenty pro- grams: Wolfgang Kienreich, Jia-wei Li, and Perukrishnen Vytelingum. All three of them appear to have adopted the master-and-slaves strategy that was first proposed by Vytelingum?s team from the University of Southampton. A master-and-slaves strategy is not a strategy for a single program, but instead for a team of collaborating programs. One of the programs in such a team is the master, and the remaining programs are slaves. The basic idea is that at the start of a run, the master and slaves would each make a series of moves using a predefined protocol, in order to identify themselves to each other. From then on, the master program would always play ?defect? when playing with the slaves, and the slave programs would always play ?cooperate? when playing with the master, so 122 that the master would gain the highest possible payoff at each iteration. Furthermore, a slave would always plays ?defect? when playing with a program other than the master, in order to try to minimize that player?s score. Wolfgang Kienreich?s master program was CNGF (CosaNostra Godfather), and its slaves were 19 copies of CNHM (CosaNostra Hitman). Jia-wei Li?s master program was IMM01 (Intelligent Machine Master 01), and its slaves were IMS02, IMS03, ..., IMS20 (Intelligent Machine Slave n, for n = 02,03,...20). Perukrishnen Vytelingum?s mas- ter program was BWIN (S2Agent1 ZEUS), and its slaves were BLOS2, BLOS3, ..., BLOS20 (like BWIN, these programs also had longer names based on the names of an- cient Greek gods). We do not know what strategies the other participants used in their programs. 4.7.1 Overall Average Scores Category 2 (IPD with noise) consisted of five runs. Each run was a round-robin tournament in which each program played with every program, including itself. Each program participated in 166 games in each run (recall that there is one game in which a player plays against itself, which counts as two games for that player). Each game consisted of 200 iterations. A program?s score for a game is the sum of its payoffs over all 200 iterations (note that this sum will be at least 0 and at most 1000). The program?s total score for an entire run is the sum of its scores over all 166 games. On the competition?s website, there is a ranking for each of the five runs, each program is ranked according to its total score for the run. 123 When a program P1 plays with a program P2ntimes, the average score of P1 is the sum of the total scores divided byn. For example, in the 2005 IPD tournament, every pair of programs played five time, one for each run. Thus a program?s expected score against another program P2 is the average of the scores the program received in the five games it played with P2. The overall average score of a program P1 in a tournament is the average of the average scores of P1 against every programs in the tournament, including a copy of P1 itself. The overall average score of P1 is also the average of all scores P1 received in a tournament. In the 2005 IPD tournament, a program?s overall average score is its average over all games in all five runs, i.e., its total over all five runs divided by 830 = 5?166. The table in Table 4.1 shows the average scores in each of the five runs of the top twenty-five programs when the programs are ranked by their overall average scores. Of our nine different versions of DBS, all nine of them are among the top twenty-five programs, and they dominate the top ten places. This phenomenon implies that DBS?s performance is insensitive to the parameters in the programs and the implementation de- tails of an individual program. The same phenomenon happens to TFTI?nine out of ten programs using TFTI are ranked between the 11th place and the 25th place, and the last one is at the 29th place. 4.7.2 DBS versus the Master-and-Slaves Strategies We analyze the performance of DBS programs against the master-and-slaves strate- gies in four different ways: (1) compare the performance of the DBS programs with the 124 performance of the entire master-and-slaves teams as a whole; (2) study the change of the overall average scores as the number of slaves decreases; (3) analyze the percentages of different kinds of interactions among programs; and (4) differentiate the distributions of the average scores of DBSz and the master-and-slave programs using some exploratory data analysis techniques. 4.7.2.1 Group Performance Recall from Table 4.1: that DBSz placed third in the competition: it lost only to BWIN and IMM01, the masters of two master-and-slaves strategies. DBS does not use a master-and-slaves strategy, nor does it conspire with other programs in any other way? but in contrast, BWIN?s and IMM01?s performance depended greatly on the points fed to them by their slaves. In fact, if we average the score of each master with the scores of its slaves, we get 379.9 for BWIN and 351.7 for IMM01, both of which are considerably less than DBSz?s score of 408. There are two kinds of programs in the competition that can be viewed as members of a group, rather than as a single program. These include master-slave programs, and variants on the same algorithm. Another way to view the performance of such groups of programs is to average the performance of the members of the group, and that is what we have done in Table 4.2. In this table, the best master-slave strategy is the one submitted by Perukrishnen Vytelingum, which ranks only 14th. DBS, on the other hand, ranks first. The poor group performance of the master-and-slaves teams implies that a master can hardly recruit a large number of slaves that sacrifices for him in many realistic situa- 125 Table 4.2: A modified version of Table 4.1 in which we have averaged the scores of each collection of programs that was either (i) a group of conspirators or (ii) a collection of of variants of the same algorithm. The average score for DBS is 402.1, which is higher than the average score of any other program. The master of the best master-slave strategy, BWIN came in 14th with a score of 379.9. Only the top fifteen groups are listed. Rank Program(s) Participant Overall Avg. Score 1 DBS* Tsz-Chiu Au 402.1 2 Mod Philip Hingston 396.9 3 TTFT Louis Clement 393.4 4 TFTI* Tsz-Chiu Au 392.9 5 *ESTFT * Michael Filzmoser 390.5 6 T4T David Fogel 390.0 7 TFTT Tit-for-Two-Tats 388.4 8 (no name) Bingzhong Wang 388.3 9 TFT Tit-for-Tat 388.2 10 SOMETHING Jan Humble 383.8 11 TTFT 1 Quan Zhou 383.6 12 LSF Tsz-Chiu Au 382.5 13 STFT Suspicious TFT 382.1 14 BWIN/BLOS* Perukrishnen Vytelingum 379.9 15 RANB Muhammad Ahmad 379.3 126 tions, because there is a strong incentive for slaves to betray their master as they can do much better by themselves. One might argue that the master can recruit more slaves if the master promises to distribute some of his payoffs to the slaves. However, the possibility of this ?private? distributionsofpayoffsdestroystheincentiveforagentstoformamaster-and-slavesteam. Theaveragescoreofateamof20agentsinwhicheachteammembersuseDBSoranyone oftop20non-master-and-slavesprograms(i.e., withoutcollusion)ismuchhigherthanthe average score a team of 20 agents using the master-slave strategies in this competition. Thus, the team may be better off not to collude but to use any top 20 non-master-and- slaves programs (e.g., DBS), and then distribute their payoffs later (no matter the payoffs is evenly distributed, or the master would get more than the slaves). 4.7.2.2 Overall Average Scores versus Number of Slaves To see the effect of the presence of slaves in the tournament, we want to study the change of the overall average scores of the programs when the number of slaves varies. Unfortunatelywecannotrerunthetournamentsincetheorganizerofthetournamentsdoes not release the programs in the tournament to the public. But there is a trick to get around this problem. The organizer provided data files containing the records of the histories of every game in the tournament. From the data files, we can compute the scores of every games in the tournament. Our approach is to recompute the overall average scores of the programs while ignoring the games that involved the slave programs, as if the slave programs did not participate in the tournament. By virtually removing slaves from the 127 tournament one by one, we can determine the effects of the slaves on the overall average scores of other programs, including their own masters. Each of the three master-and-slaves teams in the tournament has 19 slaves. We ran- domlyselectedkslavesfromeachteams,andthenrecomputetheoverallaveragescoresof the remaining programs while ignoring the scores of the games that involves these slaves, for 0 ?k? 19. To minimize the biases due to the selection of the slaves, we repeated the above process 20 times with randomly chosen sets of slaves for eachk. The averages of the 20 overall average scores of selected programs are shown in Figure 4.9. Due to the space limitation, we cannot show the overall average scores of all 165?k programs in one graph. Thus, we only selectively plotted the overall average scores some programs that performed well in the tournament in Figure 4.9. These programs are BWIN, IMM01, CNGF, DBSz, lowEsTFT classic, TFTIm, Mod, TTFT, and mediumESTFT classic. The error bars of the data points in the figure indicate the maximums and the minimums of the overall average scores in the 20 repetitions. We can see from Figure 4.9 that the overall average scores of the non-master- slave programs including DBSz, lowEsTFT classic, TFTIm, Mod, TTFT, and medi- umESTFT classicincreases super-linearlyasthe numberof slavesdecreases. This clearly indicates the destructive effects of the presence of slaves in the tournament. The non- master-slave programs could have done much better without the defections intention- ally made by slaves. For instance, the overall average scores of DBSz could increase by approximately 19% if there were no slave in the tournament. Let x be the average scores of DBSz when playing with any slave. Then we get x = 258.3 by the equation 128 408 = 165?3?19165 ?487 + 3?19165 ?x. Since the average scores of a program is at most 200 when playing with ALLD in a game of 200 interactions, we can see that slaves almost always defected with they played with non-master-slave programs such as DBSz. As opposed to the intuition that the master programs would perform worse in the absence of slaves, the overall average scores of the master programs also increases as the number of slaves decreases. But the rate of increases is not as high as that of the non- master-slave programs. This paradox is due to the fact that the slaves defected when they played with the masters of another teams. Since the slaves of other teams outnumbered the slaves of its own team, a master would perform better if no slave is allowed in the tournament. The rate of increases in the overall average scores of the masters is slower than the rate of increases of non-master-slave programs, and eventually the overall average scores of non-master-slave programs surpassed the masters? scores. DBSz?s scores can potentially surpass BWIN?s when the number of slaves is reduced by 9, but almost always surpass BWIN?s when the number of slaves is reduced by 11. Thus, it is safe to say that if the size of the master-and-slaves team was restricted to at most 10 in the tournament (i.e., no participant can submit more than 10 programs), then DBSz would have placed first in the tournament. 4.7.2.3 Percentages of Interactions The reason for the poor performance of master programs with few or no slaves is that the master-and-slaves strategies did not cooperate the other players as much as they 129 did amongst themselves. In particular, Table 4.3 gives the percentages of each of the four possible interactions when any program from one group plays with any program from another group. Note that: ? When BWIN and IMM01 play with their slaves, about 64% and 47% of the interac- tions are (D,C), but when non-master-and-slaves strategies play with each other, only 19% of the interactions are (D,C). ? When the slave programs play with non-master-and-slaves programs, over 60% of interactions are (D,D), but when non-master-and-slaves programs play with other non-master-and-slaves programs, only 31% of the interactions are (D,D). ? The master-and-slaves strategies decrease the overall percentage of (C,C) from 31% to 13%, and increase the overall percentage of (D,D) from 31% to 55%. 4.7.2.4 Distributions of Average Scores All analysis we conducted so far focus on the overall average scores of a program, which, as defined in Section 4.7.1, is a summary of all average scores against every pro- gram in a tournament. In order to get a more clearer picture of the performance of a program in the tournament, we have to look at individual average scores that constitute the overall average scores. The problem is that there are large amount of average scores, and we need a way to succinctly present the data in a meaningful way. Our solution is to display the average scores in both density plots and dot plots, both of them are types of 130 Table 4.3: Percentages of different interactions. ?All but M&S? means all 105 programs that did not use master-and-slaves strategies, and ?all? means all 165 programs in the competition. Player 1 Player 2 (C,C) (C,D) (D,C) (D,D) BWIN BWIN?s slaves 12% 5% 64% 20% IMM01 IMM01?s slaves 10% 6% 47% 38% CNGF CNGF?s slaves 2% 10% 10% 77% BWIN?s slaves all but M&S 5% 9% 24% 62% IMM01?s slaves all but M&S 7% 9% 23% 61% CNGF?s slaves all but M&S 4% 8% 24% 64% TFT all but M&S 33% 20% 20% 27% DBSz all but M&S 54% 15% 13% 19% TFTT all but M&S 55% 20% 11% 14% TFT all 23% 19% 16% 42% DBSz all 36% 14% 11% 39% TFTT all 38% 21% 10% 31% all but M&S all but M&S 31% 19% 19% 31% all all 13% 16% 16% 55% 131 graphically display widely used in exploratory data analysis for visualizing the distribu- tions of numbers. Figure 4.10 contains seven density plots (the curved lines) overlapped with seven dot plots (the data points) for DBSz, the three master programs, and three slave programs. Inthetournament,eachprogramplaysagainst166differentprograms(includingacopyof itself), thus has 166 average scores. Since the number of iterations in a game is about 200, the average scores are between 0and1000. But none of the average scores is below 200or above 900. The density plots, as shown as the lines in the graph, outlined the distribution of the average scores of a program against all other programs in the tournament. To draw a density plot for a program, say DBSz, for each average score x we counted how many average scores, out of the 166 average scores of DBSz, fall in the range of [x?2,x+ 2], and then normalized the counts and plotted the normalized counts. The normalization is needed because the scale in the y-axis is not important?the density plots are intended to show the shape of the distribution rather than the frequency of the numbers. For example, there are two peaks in the density plot of DBSz, and this shows that most of the average scores are around either 260 or 530. To see why there are two peaks in the density plot of DBSz, we used dot plots to identify which program DBSz plays with when DBSz obtained the average scores. We partitioned the set of all programs in the tournament into six groups: DBS* (all versions of DBS strategies), TFTI* (all versions of TFTI strategies), BWIN and its slaves, IMM01 and its slaves, CNGF and its slaves, and the remaining programs. For each group, we plotted the average scores on top of the density plot. As can be seen in Figure 4.10, the 132 DBSz?s average score when played with the slave programs are around x = 260, while the average scores played with many non-master-slave programs are around x = 530. This shows that the peak at around 260 is due to the slave programs which defected most of the time, and the peak at around 530 is due to the cooperation between DBSz and non-master-slave programs. It is interesting to see that there is a small peak at the around 580, which is mainly due to the cooperation among different versions of DBS programs but also some non-master-slave programs. Clearly, if both agents use symbolic noise detection, the degree of cooperation can be even higher in noisy environments. From the density plot and the dot plot of DBSz, we can see that the overall average scores of DBSz (the vertical line atx = 408) is roughly the midpoint between the two peaks. The density plot of BWIN is very different from the density plot of DBSz in that the peak at around x = 530 is missing; alternatively, we can say that the peak is more spread out than the peak for DBSz. From the density plot, we can see how the slaves of BWIN work: when BWIN played with its slaves, the average scores were often more than 600. This is possible only if BWIN defected most of the time while the slaves cooperated. BWIN can also exploit some of the non-master-slave problem (i.e., some dots beyond x = 600). In general, most of the DBS programs tried to cooperate with BWIN. Despite thatBWINalsosuffersfromtheslavesofothermaster-and-slavesteamaswell,theoverall average scores (the vertical line atx = 433) is still much higher than DBSz?s, mainly due to its slaves. IMM01?s density plot is similar to the one for BWIN, but IMM01?s second peak seems to be non-existence. This is why BWIN outperforms IMM01. DBS programs cannot cooperate with IMM01 as well as with BWIN. CNGF cannot even cooperate with 133 its slaves. BLOS10 is a slave of BWIN. We can see that most of the DBS programs cannot cooperate with BLOS10. Compared with IMS02 (a slave of IMM01), BLOS10 is more successful in cooperating with its teammates. The graphical analysis of Figure 4.10 shows that slaves did greatly affect the tour- nament. But what if the slaves did not exist? We drew another graph in which the average scores of the slave programs are removed. In addition, the average scores of DBS pro- grams and TFTI programs, except DBSz and TFTIm, are removed too, such that collusion among programs submitted by the author, if exists at all, would have no effect. The result is shown in Figure 4.11. When there is no slave, all master programs did not do well. Thus, the top programs among all 91 programs did not include the master programs. We selected the following six top programs and drew the density plots and dot plots in Figure 4.11: DBSz, low- ESTFT classic, TFTIm, Mod, TTFT, and mediumESTFT classic. The peak atx = 260 in the density plot for DBSz disappears when there is no slave. The remaining peak is the one atx = 530. This peak is mainly to due to the cooperation of DBSz with non-master-and-slaves programs. The peaks for other programs are more spread out. More importantly, the peaks are spread to the left side of the graph. This is why the overall average score of DBSz?s is higher: the cooperation between DBSz and other programs is more stable in noisy environments. Moreover, this stability can be achieved by DBSz alone?the other players did not use symbolic noise detection. 134 4.7.3 A comparison between DBSz, TFT, and TFTT Next, we consider how DBSz performs against TFT and TFTT. Table 4.3 shows that when playing with another cooperative player, TFT cooperates ((C,C) in the table) 33% of the time, DBSz does so 54% of the time, and TFTT does so 55% of the time. Furthermore, when playing with a player who defects, TFT defects ((D,D) in the table) 27%ofthetime, DBSzdoesso19%ofthetime, andTFTTdoesso14%ofthetime. From this, one might think that DBSz?s behavior is somewhere between TFT?s and TFTT?s. But on the other hand, when playing with a player who defects, DBSz cooperates ((C,D) in the table) only 15% of the time, which is a lower percentage than for TFT and TFTT (both 20%). Since cooperating with a defector generates no payoff, this makes TFT and TFTT perform worse than DBSz overall. DBSz?s average score was 408 and it ranked 3rd, but TFTT?s and TFT?s average scores were 388.4 and 388.2 and they ranked 30th and 33rd. Figure 4.12 shows the average scores DBSz got when played against TFT, TFTT, and other programs provided by the organizer of the competition. We can see that TFT did not perform as well as DBSz and TFTT. In fact, TFT performed better when it played against DBSz rather than TFT itself in noisy environments. 4.8 Related Work Early studies of the effect of noise in the Iterated Prisoner?s Dilemma focused on how TFT, a highly successful strategy in noise-free environments, would do in the pres- ence of noise. TFT is known to be vulnerable to noise; for instance, if two players use 135 TFT at the same time, noise would trigger long sequences of mutual defections [45]. A number of people confirmed the negative effects of noise to TFT [45, 11, 46, 5, 47, 12]. Axelrod found that TFT was still the best decision rule in the rerun of his first tournament with a one percent chance of misperception [3, page 183], but TFT finished sixth out of 21 in the rerun of Axelrod?s second tournament with a 10 percent chance of mispercep- tion [24]. In Competition 2 of the 2005 IPD competition, the noise level was 0.1, and TFT?s overall average score placed it 33rd out of 165. The oldest approach to remedy TFT?s deficiency in dealing with noise is to be more forgiving in the face of defections. A number of studies found that more forgiveness promotes cooperation in noisy environments [12, 46]. For instance, Tit-For-Two-Tats (TFTT), a strategy submitted by John Maynard Smith to Axelrod?s second tournament, retaliates only when it receives two defections in two previous iterations. TFTT can tol- erate isolated instances of defections caused by noise and is more readily to avoid long sequences of mutual defections caused by noise. However, TFTT is susceptible to ex- ploitation of its generosity and was beaten in Axelrod?s second tournament by TESTER, a strategy that may defect every other move. In Competition 2 of the 2005 IPD Com- petition, TFTT ranked 30?a slightly better ranking than TFT?s. In contrast to TFTT, DBS can tolerate not only an isolated defection but also a sequence of defections caused by noise, and at the same time DBS monitors the other player?s behavior and retaliates when exploitation behavior is detected (i.e., when the exploitation causes a change of the hypothesized policy, which initially is TFT). Furthermore, the retaliation caused by exploitation continues until the other player shows a high degree of remorse (i.e., cooper- ations when DBS defects) that changes the hypothesized policy to one with which DBS 136 favors cooperations instead of defections. [45] proposed to mix TFT with ALLC to form a new strategy which is now called Generous Tit-For-Tat (GTFT) [49]. Like TFTT, GTFT avoids an infinite echo of defec- tions by cooperating when it receives a defection in certain iterations. The difference is that GTFT forgives randomly: for each defection GTFT receives it randomly choose to cooperate with a small probability (say 10%) and defect otherwise. DBS, however, does not make use of forgiveness explicitly as in GTFT; its decisions are based entirely on the hypothesized policy that it learned. But temporary tolerance can be deemed as a form of forgiveness, since DBS does not retaliate immediately when a defection occurs in a mutual cooperation situation. This form of forgiveness is carefully planned and there is no randomness in it. AnotherwaytoimproveTFTinnoisyenvironmentsistousecontrition: unilaterally cooperate after making mistakes. One strategy that makes use of contrition is Contrite TFT (CTFT) [61, 16, 64], which does not defect when it knows that noise has occurred and affected its previous action. However, this is less useful in the Noisy IPD since a program does not know whether its action is affected by noise or not. DBS does not make use of contrition, though the effect of temporary tolerance resembles contrition. A family of strategies called ?Pavlovian? strategies, or simply called Pavlov, was found to be more successful than TFT in noisy environments [35, 36, 37, 48]. The sim- plest form of Pavlov is called Win-Stay, Lose-Shift [48], because it cooperates only after mutual cooperation or mutual defection, an idea similar to Simpleton [55]. When an ac- cidental defection occurs, Pavlov can resume mutual cooperation in a smaller number of iterations than TFT [35, 36]. Pavlov learns by conditioned response through rewards and 137 punishments; it adjusts its probability of cooperation according to the previous interac- tion. Like Pavlov, DBS learns from its past experience and makes decisions accordingly. DBS, however, has an intermediate step between learning from experience and decision making: it maintains a model of the other player?s behavior, and uses this model to reason about noise. Although there are probabilistic rules in the hypothesized policy, there is no randomness in its decision making process. For readers who are interested, there are several surveys on the Iterated Prisoner?s Dilemma with noise [5, 32, 50, 38]. The use of opponent modeling is common in games of imperfect information such as Poker [15, 7, 8, 9, 21, 14] and RoShamBo [27]. One entry in Axelrod?s original IPD tournament used opponent modeling, but it was not successful. There have been many works on learning the opponent?s strategy in the non-noisy IPD [26, 31, 53]. By assuming the opponent?s next move depends only on the interactions of the last few iterations, these works model the opponent?s strategy as probabilistic finite automata, and then use various learning methods to learn the probabilities in the automata. For example, [31] proposed an adaptive agent called an opponent modeling agent (OMA) of ordern, which maintains a summary of the moves made up to n previous iterations. Like DBS, OMA learns the probabilities of cooperations of the other player in different situations using an updating rule similar to the Equation 4.2, and generates a move based on the opponent model by searching a tree similar to that shown in Figure 4.4. The opponent model in [26] also has a similar construct. The main way they differ from DBS is how they learn the other player?s strategy, but there are several other differences: for example, the tree they used has a maximum depth of 4, whereas ours has a depth of 60. 138 Theagentsofboth[31]and[26]learnedtheotherplayer?sstrategybyexploration? deliberately making moves in order to probe the other player?s strategy. The use of explo- rationforlearningopponent?sbehaviorswasstudiedby[19], whodevelopedalookahead- based exploration strategy to balance between exploration and exploitation and avoid making risky moves during exploration. [31] and [26] used a different exploration strat- egy than [19]; [31] introduced noise to 1% of their agent?s moves (they call this method the trembling hand), whereas the agent in [26] makes decisions at random when it uses the opponent?s model and finds a missing value in the model. Both of their agents used a random opponent model at the beginning of a game. DBS does not make deliberate moves to attempt to explore the other player?s strat- egy, because we believe that this is a high-risk, low-payoff business in IPD. We believe it incurs a high risk because many programs in the competition are adaptive; our defections made in exploration may affect our long-term relationship with them. We believe it has a low payoff because the length of a game is usually too short for us to learn any non-trivial strategy completely. Moreover, the other player may alter its behavior at the middle of a game, and therefore it is difficult for any learning method to converge. It is essentially true in noisy IPD, since noise can provoke the other player (e.g., GRIM). Furthermore, our objective is to cooperate with the other players, not to exploit their weakness in order to beat them. So as long as the opponent cooperates with us there is no need to bother with their other behaviors. For these reasons, DBS does not aim at learning the other player?s strategy completely; instead, it learns the other player?s recent behavior, which is subject to change. In contrast to the OMA strategy described earlier in this section, most of our DBS programs cooperated with each other in the competition. 139 Our decision-making algorithm combines elements of both minimax game tree search and the value iteration algorithm for Markov Decision Processes. In contrast to [18], we do not model the other player?s model of our strategy; we assume that the hy- pothesized policy does not change for the rest of the game. Obviously this assumption is not valid, because our decisions can affect the decisions of the other players in the future. Nonetheless, we found that the moves returned by our algorithm are fairly good responses. For example, if the other player behaves like TFT, the move returned by our algorithm is to cooperate regardless of the previous interactions; if the other player does not behave like TFT, our algorithm is likely to return defection, a good move in many situations. To the best of our knowledge, ours is the first work on using opponent models in the IPD to detect errors in the execution of another agent?s actions. 4.9 Summary For conflict prevention in noisy environments, a critical problem is to distinguish between situations where another player has misbehaved intentionally and situations where the misbehavior was accidental. That is the problem that DBS was formulated to deal with. DBS?s impressive performance in the 2005 Iterated Prisoner?s Dilemma competition occurred because DBS was better able to maintain cooperation in spite of noise than any other program in the competition. To distinguish between intentional and unintentional misbehaviors, DBS uses a combination of symbolic noise detection plus temporary tolerance: if an action of the 140 other player is inconsistent with the player?s past behavior, we continue as if the player?s behavior has not changed, until we gather sufficient evidence to see whether the inconsis- tency was caused by noise or by a genuine change in the other player?s behavior. Since clarity of behavior is an important ingredient of long-term cooperation in the IPD, most IPD programs have behavior that follows clear deterministic patterns. The clarity of these patterns made it possible for DBS to construct policies that were good approximations of the other players? strategies, and to use these policies to fend off noise. We believe that clarity of behavior is also likely to be important in other multi-agent environments in which agents have to cooperate with each other. Thus it seems plausible that techniques similar to those used in DBS may be useful in those domains. In the future, we are interested in studying the following issues: ? The evidence collection process takes time, and the delay may invite exploitation. For example, the policy of temporary tolerance in DBS may be exploited by a ?hyp- ocrite?strategythatbehaveslikeTFTmostofthetimebutoccasionallydefectseven though DBS did not defect in the previous iteration. DBS cannot distinguish this kind of intentional defection from noise, even though DBS has built-in mechanism to monitor exploitation. We are interested to seeing how to avoid this kind of ex- ploitation. ? In multi-agent environments where agents can communicate with each other, the agents might be able to detect noise by using a predefined communication protocol. However, we believe there is no protocol that is guaranteed to tell which action has been affected by noise, as long as the agents cannot completely trust each other. It 141 would be interesting to compare these alternative approaches with symbolic noise detection to see how symbolic noise detection could enhance these methods or vice versa. ? The type of noise in the competition assumes that no agent know whether an exe- cution of an action has been affected by noise or not. Perhaps there are situations in which some agents may be able to obtain partial information about the occurrence of noise. For example, some agents may obtain a plan of the malicious third party by counter-espionage. We are interested to see how to utilize these information into symbolic noise detection. ? Symbolic noise detection is not designed for noise-free environments. That is why DBS did not work as well in Competition 1 (where there was no noise) as it did in Competition2. 8 Webelieveifwehaveturnedoffthenoisedetectionmechanismby setting the violation thresholds to zero, the DBS programs would have performed better in Competition 1. Thus, if we want a DBS program to be able to work well in both noisy and noise-free environments, we need a way to detect whether there is noise in the environment, so that DBS can turn off the noise detection mechanism when there is no noise. In general, the questions are (1) how to estimate the level of noise in the environment, and (2) how symbolic noise detection should be adjusted for different noise level. ? It would be interesting to put DBS in an evolutionary environment to see whether 8we submitted the same set of programs in both Competition 1 and Competition 2. Although DBS did not win in Competition 1, the best DBS program consistently ranked top 10 among 192 programs. 142 it can survive after a number of generations. Is it evolutionarily stable? 143 . FirstIteration (Root Node) Previous Iteration (Current Node) Depth 0 Depth 1 Depth 2 Figure 4.4: An example of the tree that we use to compute the maximum expected scores. Each node denotes the interaction of an iteration. The top four nodes constitute a path representing the current history?current. The length of?current isl = 2, and the maximum depth N? is 2. There are four edges emanating from each node S after the current node; each of these edges corresponds to a possible interaction of the iteration after S. The maximum expected scores (not shown) of the nodes with depth 2 are set by an evaluation function f; these values are then used to calculate the maximum expected scores of the nodes with depth 1 by using the maximizing rule. Similarly, the maximum expected scores of the current node is calculated using four maximum expected scores of the nodes with depth 1. 144 Procedure MoveGen(pi,?) ?pCC,pCD,pDC,pDD?:= pi {(a1,b1),(a2,b2),...,(ak,bk)}:= ? (a0,b0) := (C,C) ; (a,b) := (ak,bk) ?EN?+1CC ,EN?+1CD ,EN?+1DC ,EN?+1DD ?:=?fCC,fCD,fDC,fDD? For k = N? down to 0 For each (o1,o2) in{(C,C),(C,D),(D,C),(D,D)} Fko1o2 := po1o2(uCC +Ek+1CC ) + (1?po1o2)(uCD +Ek+1CD ) Gko1o2 := po1o2(uDC +Ek+1DC ) + (1?po1o2)(uDD +Ek+1DD ) Eko1o2 := max(Fko1o2,Gko1o2) If Fko1o2 ?Gko1o2, then mko1o2 := C If Fko1o2