ABSTRACT Title of Dissertation: APPLYING POLICY GRADIENT METHODS TO OPEN-ENDED DOMAINS Ryan Sullivan Doctor of Philosophy, 2025 Dissertation Directed by: Professor John P. Dickerson Department of Computer Science Deep reinforcement learning (RL) has been successfully used to train agents in complex video game environments including Starcraft 2, Dota 2, Minecraft, and Gran Turismo1,2,3,4. Each of these projects utilized curriculum learning to train agents more efficiently. However, sys- tematic investigations of curriculum learning are limited and it is rarely studied outside of toy research environments. Modern RL methods still struggle in stochastic, sparse-reward environ- ments with long planning horizons. This thesis studies these challenges from multiple perspec- tives to develop a stronger empirical understanding of curriculum learning in complex environ- ments. By introducing novel visualization techniques for reward surfaces and empirically investi- gating key implementation details, it explores why policy gradient methods alone are insufficient for sparse-reward tasks. These findings motivate the use of curriculum learning to decompose problems into learnable subtasks and to prioritize learnable objectives. Building on these in- sights, this dissertation presents a general-purpose library for curriculum learning and uses it to evaluate popular automatic curriculum learning algorithms on challenging RL environments. Curricula have historically been effective for training reinforcement learning agents, and a fun- damental understanding of automatic curriculum learning is an essential step toward developing generally capable agents in open-ended environments. APPLYING POLICY GRADIENT METHODS TO OPEN-ENDED DOMAINS by Ryan Sullivan Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2025 Advisory Committee: Professor John P. Dickerson, Chair/Advisor Professor Ming Lin Professor Furong Huang Professor Subramanian Raghavan Professor Peter Stone © Copyright by Ryan Sullivan 2025 Acknowledgments This thesis would not be possible without the support and contributions of many friends and collaborators. First of all, I would like to thank my advisor John P. Dickerson for his guidance throughout my PhD and for giving me the confidence to pursue my research. John gave me the freedom to explore my interests, as well as the reassurance and motivation that I needed to see this thesis through to the end. Thank you to Ming Lin, Furong Huang, Subramanian Raghavan, and Peter Stone for reviewing this dissertation and providing valuable feedback. Peter’s work on curriculum learning was a large inspiration for this thesis. Thank you to Joseph Suarez and Costa Huang for being incredible collaborators and friends. Our frequent discussions made me a better researcher, and made my PhD more enjoyable. I also would like to thank Raghav Gupta and Eric Li, without whom I would never have been able to publish the research I conducted during my internship at Google. They went above and beyond to support my ideas and develop them into two successful research papers, and I thoroughly enjoyed my time working with them. I also owe thanks to my friends Jordan, Nameer, Killian, Jarett, Griffon, Aman, Samyadeep, Shlok, Naman, and Sneha, who frequently reminded me to enjoy life beyond research. I am obligated to thank my siblings Olivia, Kiernán, and Danny. Most importantly, thank you to my parents for providing endless love, support, and food. You always enabled my curiosity and encouraged me to pursue my interests, and I will always be grateful. ii Table of Contents Acknowledgements ii Table of Contents iii List of Tables vii List of Figures viii List of Abbreviations xii Chapter 1: Introduction 1 1.1 Reinforcement Learning in Complex Environments . . . . . . . . . . . . . . . . 1 1.2 Intrinsic Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Open Source Reinforcement Learning Infrastructure . . . . . . . . . . . . . . . . 6 1.5 Overall Structure and Contributions . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 2: Background 10 2.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.2 Partial Observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.3 Estimating Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.4 Generalized Advantage Estimation . . . . . . . . . . . . . . . . . . . . . 15 2.2 Policy Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 REINFORCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.2 Actor-Critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.3 Proximal Policy Optimization . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Underspecified Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5 Automatic Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6 Open-Endedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Chapter 3: Visualizing Reward Surfaces 28 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.1 Loss Landscapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 31 iii 3.2.3 Policy Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.4 Reward surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.5 Proximal Policy Optimization . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Environment Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 Initial Explorations of Reward Surfaces . . . . . . . . . . . . . . . . . . . . . . 34 3.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.5 Discovering Cliffs in the Gradient Direction . . . . . . . . . . . . . . . . . . . . 39 3.5.1 Gradient Directions vs. Filter-Normalized Random Directions . . . . . . 41 3.5.2 Visualizing Rewards in the Gradient Direction During Training . . . . . 42 3.6 Cliffs Impact Policy Gradient Training . . . . . . . . . . . . . . . . . . . . . . . 43 3.6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.7 Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.10 Additional Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.10.1 All Reward Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.10.2 Standard Deviation Plots . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.10.3 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.10.4 All Gradient Heat Maps . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.10.5 All Gradient Line Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Chapter 4: Applying DreamerV3 Implementation Tricks to PPO 65 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.3.1 Symlog Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3.2 Twohot Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3.3 Critic EMA Regularizer . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3.4 Percentile Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3.5 Unimix Categoricals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.4.1 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4.2 Enabling All Tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4.3 Add-One Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.4.4 Drop-One Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.4.5 Symlog Predictions and Twohot Encoding . . . . . . . . . . . . . . . . . 78 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.8 Additional Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.8.1 Atari Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.8.2 DeepMind Control Suite Add-One Ablations . . . . . . . . . . . . . . . 87 4.8.3 DeepMind Control Suite Drop-One Ablations . . . . . . . . . . . . . . . 88 iv 4.8.4 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.8.5 PPO Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Chapter 5: Multitask learning in Large Language Models 92 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.3 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.4 Conditional Language Policy (CLP) . . . . . . . . . . . . . . . . . . . . . . . . 100 5.4.1 Conditioning Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.4.2 Three Instantiations of CLP . . . . . . . . . . . . . . . . . . . . . . . . 102 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.5.1 Core Benchmarking Results . . . . . . . . . . . . . . . . . . . . . . . . 104 5.5.2 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.6 Multi-Objective Online DPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.6.2 Objective Weight Sampling . . . . . . . . . . . . . . . . . . . . . . . . 108 5.6.3 Prompt Conditioning Mechanism . . . . . . . . . . . . . . . . . . . . . 109 5.6.4 Sampling and Preference Pair Construction . . . . . . . . . . . . . . . . 109 5.6.5 DPO Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.7 MO-ODPO Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.7.1 Datasets and Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.7.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.7.3 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.8 MO-ODPO Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.8.1 Pareto Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.8.2 Autorater Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.8.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.9.1 Role of Sampling for Prompt Conditioning . . . . . . . . . . . . . . . . 118 5.9.2 Training Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.11 Additional Experiment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Chapter 6: Enabling Future Curriculum Learning Research 124 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.4 Design Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.5 Syllabus APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.5.1 Curriculum API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.5.2 Agent API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.5.3 Task Interface API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.5.4 Task Space API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.5.5 Multiprocessing Infrastructure . . . . . . . . . . . . . . . . . . . . . . . 137 6.5.6 Automatic Curriculum Learning Implementations . . . . . . . . . . . . . 138 v 6.5.7 Manual Curricula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.6 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.7 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.8 Reproduction Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.8.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.8.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.9 Additional Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.9.1 RLLib with Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.9.2 Self-Play on LaserTag . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6.10 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 6.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 6.12 Code Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 6.12.1 RLLib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 6.12.2 Stable Baselines 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.13 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Chapter 7: Curriculum Learning in Complex Environments 155 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 7.2 New Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 7.2.1 Neural MMO 2.0 in PufferLib . . . . . . . . . . . . . . . . . . . . . . . 156 7.2.2 NetHack in Moolib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 7.2.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 7.2.4 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 7.4 Phasic Policy Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 7.4.1 Stale Value Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 7.5 Neural MMO Task-Based Curriculum . . . . . . . . . . . . . . . . . . . . . . . 168 7.6 Task Design in Crafter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 7.6.1 Task Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 7.6.2 Impossible Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 7.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 7.6.4 Single Task Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 175 7.7 A Closer Look at OMNI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 7.7.1 Sampling for Learnability with Stratified Filtering . . . . . . . . . . . . 177 7.8 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 Chapter 8: Future Research Directions and Open Questions 181 8.1 Challenging Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 8.2 Task Space Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 8.3 LLM-Augmented RL Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 8.3.1 Direct Finetuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 8.3.2 LLMs as Teachers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 8.3.3 Commonsense Action Pruning . . . . . . . . . . . . . . . . . . . . . . . 190 vi List of Tables 3.1 Table of A2C and PPO’s average percent change in reward after taking a few gra- dient steps on cliff and non-cliff checkpoints for various sets of hyperparameters. These results are averaged among 10 trials each evaluated for 1000 episodes. . . 43 3.2 Settings used to generate reward surfaces for each environment. These settings were manually chosen to highlight interesting features of the reward surface and to reduce standard error in the plots. . . . . . . . . . . . . . . . . . . . . . . . . 52 4.1 Application of DreamerV3 Tricks . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2 The default PPO hyperparameters in CleanRL’s implementation, used for all ex- periments. The only change we make is increasing the number of environments for better time efficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.1 Overview of multi-objective LLM alignment approaches . . . . . . . . . . . . . 97 5.2 Example generations on Anthropic-HH at different objective weights from MO- ODPO & baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.1 Syllabus Performance Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.1 Learning Hyperparameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 7.2 Neural MMO Sequential Curriculum Task Thresholds . . . . . . . . . . . . . . . 169 vii List of Figures 3.1 Reward surface for the Solaris Atari environment. . . . . . . . . . . . . . . . . . 29 3.2 Reward surfaces for CartPole-v1, Breakout, Hopper-v2, Montezuma’s Revenge . 35 3.3 Gradient heat maps for Ant, Hopper, and InvertedDoublePendulum. The heat map for Hopper and InvertedDoublePendulum shows a cliff-like gradient direc- tion which falls off sharply compared to the filter-normalized direction. . . . . . 36 3.4 Gradient line search plots for Pendulum, Ant, and Pong. The line plot for Ant shows several checkpoints that exhibit cliff-like properties. . . . . . . . . . . . . 39 3.5 Reward surfaces for 5 Classic Control environments. . . . . . . . . . . . . . . . 49 3.6 Reward surfaces for 10 MuJoCo environments. . . . . . . . . . . . . . . . . . . 50 3.7 Reward surfaces for 12 Atari environments. . . . . . . . . . . . . . . . . . . . . 51 3.8 Standard deviation surfaces for 5 Classic Control environments. . . . . . . . . . 53 3.9 Standard deviation surfaces for 10 MuJoCo environments. . . . . . . . . . . . . 54 3.10 Standard deviation surfaces for 12 Atari environments. . . . . . . . . . . . . . . 55 3.11 18 training and plotting runs for the Classic Control Acrobot-v1 environment. . . 56 3.12 18 training and plotting runs for the MuJoCo HalfCheetah-v2 environment. . . . 57 3.13 18 training and plotting runs for the Atari Breakout environment. . . . . . . . . . 58 3.14 18 training and plotting runs for the Atari Montezuma’s Revenge environment. . 59 3.15 Policy gradient heat maps for 5 Classic Control environments. . . . . . . . . . . 60 3.16 Policy gradient heat maps for 10 MuJoCo environments. . . . . . . . . . . . . . 61 3.17 Policy gradient line search plots for 5 Classic Control environments. . . . . . . . 62 3.18 Policy gradient line search plots for 10 MuJoCo environments. . . . . . . . . . . 63 3.19 Policy gradient line search plots for 12 Atari environments. . . . . . . . . . . . . 64 4.1 PPO with all tricks enabled compared to base PPO, using the DeepMind Control Suite in a) and c) and Atari environments (with and without reward clipping) in b) and d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2 Median scores averaged across multiple seeds for the add-one ablations in each environment set. 5 seeds were used for Deepmind Control Suite and 3 seeds for Atari environments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3 Median scores for the drop-one ablations in each environment set. 5 seeds were used for DeepMind Control Suite and 3 seeds for the Arcade Learning Environment 78 4.4 Experiments varying the a) range (bins=256) and b) number of bins (range=100) used for twohot encoding in Atari Breakout, averaged across 3 seeds. c) shows symlog predictions with twohot encoding across all 57 games in the Arcade Learning Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 viii 4.5 Add-one ablations for the Arcade Learning Environment with reward clipping. Each line shows the median scores for 57 environments, averaged over 3 seeds, across training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.6 Add-one ablations for the Arcade Learning Environment without reward clip- ping. Each line shows the median scores for 57 environments, averaged over 3 seeds, across training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.7 Drop-one ablations for the Arcade Learning Environment with reward clipping. Each line shows the median scores for 57 environments, averaged over 3 seeds, across training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.8 Add-one ablations for the DeepMind Control Suite. Each line shows the median scores for 35 environments, averaged over 5 seeds, across training. . . . . . . . . 87 4.9 Drop-one ablations for the DeepMind Control Suite. Each line shows the median scores for 35 environments, averaged over 5 seeds, across training. . . . . . . . . 88 4.10 20M DreamerV3 XL architecture vs. 1M Nature CNN. Figures show median performance averaged across 3 seeds and all 57 Atari environments. . . . . . . . 89 4.11 20M Dreamer-v3 XL architecture (3 seeds) vs. 1M Nature CNN (1 seed). . . . . 90 5.1 (Left) For a fixed prompt x, a multi-objective LM can output y1, y2, y3 for dif- ferent weightings w1, w2, w3 of two rewards r1 and r2, such that the response yi for weighting wi is preferred under the weighted reward wi[1]r1 + wi[2]r2. (Right) Pareto-fronts when using the rewards NLI and Rouge (section 5.5.1.1). Rewarded Soups (RS)5 is Pareto-dominated by both full-CLP (this paper) and prompting (say, Jang et al. 6), but full-CLP is more appealing for its steerability, evidenced by its wider Pareto-front. In sum, Pareto-dominance (pushing out the front) and steerability (stretching out the front) are both key desiderata for MOFT. 93 5.2 Pareto-curves for two-reward & α = 0.01. Observe CLP variants (full-CLP and attn-CLP) offer improved spread (compared to prompting) while Pareto- dominating the Rewarded Soups (RS) baseline. . . . . . . . . . . . . . . . . . . 105 5.3 Combining Prompt-based conditioning with Parameter-space condtioning for full- CLP. Observe that prompting slots in with full-CLP to produce steerable and Pareto dominating behaviors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.4 Pareto-curves at 10k, 60k, 90k training steps. Observe that prompting shows slightly improved steerability with a 3× larger training budget but still isn’t as steerable as full-CLP which exhibits a strong steerability even at 10k iterations. . . 106 5.5 Ablation on model size for NLI v. TLDR. Observe that across model sizes, full- CLP and attn-CLP learn steerable (compared to prompting) and Pareto-dominating behaviors (compared to Rewarded Soups). . . . . . . . . . . . . . . . . . . . . . 107 5.6 Representation of the MO-ODPO algorithm with two rewards . . . . . . . . . . 111 5.7 Pareto fronts for Anthropic-HH (left) and TL;DR Summarization (right) with PaLM 2 XS policy model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.8 Pareto fronts for Anthropic-HH (left) and TL;DR Summarization (right) with PaLM 2 XXS policy model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.9 Automated evaluation win rates for MO-ODPO compared against individual base- lines, as judged by a large off-the-shelf LLM. MO-ODPO consistently achieves a higher win rate than each baseline on both benchmarks. . . . . . . . . . . . . . 113 ix 5.10 Pareto fronts for MO-ODPO (PaLM 2 XS) with different Dirichlet(α) sampling of objective weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.11 Training dynamics for MO-ODPO (above) and Rewards in Context (below) on the TL;DR benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.1 Syllabus with a standard asynchronous RL training setup. . . . . . . . . . . . . . 131 6.2 An abbreviated summary of the Curriculum interface. These represent the main methods for updating and sampling from a curriculum in Syllabus. The get_opponent, update_agent, and update_winrate methods are used for self-play. . . . . . . . . 132 6.3 Main features of the Task Interface. . . . . . . . . . . . . . . . . . . . . . . . . 135 6.4 Main features of the Task Space API. . . . . . . . . . . . . . . . . . . . . . . . . 136 6.5 Using Syllabus for curriculum learning with just a few lines of code. . . . . . . . 137 6.6 (a) Mean normalized test returns for Syllabus’s implementation vs. the original implementation of Prioritized Level Replay from7 on 10 Procgen environments. Domain Randomization is also included for reference. (b) Mean task success rate for Syllabus’s Learning Progress and OMNI implementations vs. the ref- erence implementations from8 on Crafter. (c) and (d) 95% Stratified bootstrap confidence intervals for Procgen and Crafter. . . . . . . . . . . . . . . . . . . . . 140 6.8 (a) Probability of sampling the first agent for each algorithm. Self-play only has one agent, so it always samples the current policy as it’s opponent. (b) The winrate of each method against a fixed random policy. (c) Stratified bootstrapped confidence intervals of the winrates. . . . . . . . . . . . . . . . . . . . . . . . . 149 6.9 Adding curriculum learning with Syllabus to RLLib training code with a few lines of code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.10 Quickstart page of Syllabus’s documentation website. . . . . . . . . . . . . . . . 154 7.1 Automatic curriculum learning training curves and 95% Stratified bootstrap con- fidence intervals on (a), (e) Procgen, (b), (f) Crafter, (c), (g) NetHack, and (d), (h) Neural MMO. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 7.2 Normalized test returns for Domain Randomization, Prioritized Level Replay, Sampling for Learnability, and Learning Progress evaluated on 10 Procgen envi- ronments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 7.3 Normalized Test Returns of PPO and PPG with 1 or 3 value epochs when trained with DR or PLR on 10 Procgen environments with 5 seeds each. . . . . . . . . . 166 7.4 Training Procgen agents with PLR using stale value predictions. 1 buffer is equiv- alent to 1 episode of delay per environment. . . . . . . . . . . . . . . . . . . . . 167 7.5 Various events and achievements in Neural MMO for Domain randomization and a manually designed sequential curriculum over the course of training. . . . . . . 170 7.6 Crafter ablations removing task encodings and impossible tasks. Note that the y axes change when we remove impossible tasks. . . . . . . . . . . . . . . . . . . 174 7.7 Success rates for agents trained to achieve a single task over the course of train- ing. Each run shows a single seed to demonstrate the learning dynamics of indi- vidual agents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 x 7.8 (Left) Mean task success rates for each algorithm combined with stratified sam- pling. (Right) Stratified Domain Randomization compared against other ACL algorithms without impossible tasks. . . . . . . . . . . . . . . . . . . . . . . . . 178 7.9 Mean task success rates for the Full Distribution and Top K implementations of SFL with OMNI’s interestingess filter. LP and OMNI are shown for reference. . . 179 xi List of Abbreviations A2C Advantage Actor-Critic ACL Automatic Curriculum Learning AI Artificial Intelligence ALE Arcade Learning Environment APPO Asynchronous Proximal Policy Optimization API Application Programming Interface CL Curriculum Learning CLP Conditional Language Policy CNN Convolutional Neural Network CPO Contrastive Preference Optimization CPU Central Processing Unit DeRa Decoding-time Realignment DIAYN Diversity is All You Need DMC DeepMind Control Suite DPO Direct Preference Optimization DR Domain Randomization ELU Exponential Linear Unit EMA Exponential Moving Average FSP Fictitious Self-Play GAE Generalized Advantage Estimation GPU Graphics Processing Unit IQM Interquartile Mean IPO Implicit Preference Optimization KL Kullback-Leibler LLM Large Language Model LM Language Model xii LP Learning Progress LSTM Long Short-Term Memory MDP Markov Decision Process MOD Multi-objective Decoding MO-DPO Multi-objective Direct Preference Optimization MOFT Multi-objective Finetuning MO-ODPO Multi-objective Online Direct Preference Optimization MO-RL Multi-objective Reinforcement Learning MO-RLHF Multi-objective Reinforcement Learning from Human Feedback MSE Mean Squared Error NLE NetHack Learning Environment NLI Natural Language Inference ODPO Online Direct Preference Optimization OMNI Open-endedness via Models of Novelty and Interestingness PCG Procedural Content Generation PFSP Prioritized Fictitious Self-Play PLR Prioritized Level Replay P-MORL Prompt-based Multi-objective Reinforcement Learning POMDP Partially Observable Markov Decision Process PPO Proximal Policy Optimization PPG Phasic Policy Gradient ReLU Rectified Linear Unit REINFORCE REward Increment = Nonnegative Factor × Offset Reinforcement × Characteristic Eligibility RiC Rewards-in-Context RL Reinforcement Learning RLHF Reinforcement Learning from Human Feedback RND Random Network Distillation RS Rewarded Soups SAC Soft Actor-Critic SB3 Stable Baselines 3 SFL Sampling for Learnability SFT Supervised Fine-tuning SiLU Sigmoid Linear Unit xiii SOFT Single-objective Finetuning SP Self Play SVD Singular Value Decomposition TD Temporal Difference TRPO Trust Region Policy Optimization UED Unsupervised Environment Design UPOMDP Underspecified Partially Observable Markov Decision Process VLM Vision-Language Model xiv Chapter 1: Introduction 1.1 Reinforcement Learning in Complex Environments Reinforcement learning has solved many premiere challenges in artificial intelligence from defeating the best human players in the board game Go9 to playing Starcraft II at a professional level1. These early successes of RL allowed agents to focus on perfecting a single task, and since then we have seen RL agents learn to pilot weather balloons10, design computer chips11, and achieve many more feats in video games2,3,4. As RL has expanded into more challenging domains, the focus has shifted from solving singleton tasks to training generalist agents capable of performing well in unseen environments. The most challenging RL benchmarks present diverse scenarios, requiring agents to generalize to unseen objectives, terrain, opponents, allies, or game rules. Much of the work in this thesis was inspired by this line of work to develop agents that can play challenging games as well as people can. While these achievements showcase the potential of reinforcement learning in controlled settings, they also rely on properties – such as determinism or self-play – that are absent in stochastic, single-agent, and many-agent environments. For instance, Atari games are deter- ministic and can easily be learned via memorization even with mitigations like entropy bonuses and sticky actions12. Dota 2 and Starcraft 2, though significantly more complicated than Atari, are 2-player competitive games meaning that the agent experiences a winning trajectory in each 1 match and can be trained via self-play. AlphaGo demonstrated that self-play can lead to recursive self-improvement and policies capable of beating the best human players in Go. In contrast, real-world tasks often lack these convenient features. Single player games with procedurally generated content tax an agent’s ability to generalize and long planning hori- zons extend the credit assignment problem past the capabilities of existing discount-based RL methods. Beyond academic benchmarks, modern games with high-quality graphics and increas- ingly complex controls drastically increase the size and sparsity of the RL agent’s search space, exacerbating the need for strong representation learning. In Chapter 3 we show how difficult exploration in these sparse reward problems can be. Hafner et al. 13 address this problem with DreamerV3, a model-based RL algorithm was able to train proficient agents in many challenging sparse reward environments with a single set of hyperparameters. They credit their results to several implementation tricks such as reward normalization and critic regularizers. However, we show in Chapter 4 that these implementation tricks, though applicable to any actor-critic method, are ineffective when applied to Proximal Policy Optimization. The challenges of exploring large state spaces with uninformative rewards necessitate more advanced exploration techniques, a need which inspired the RL subfield of intrinsic motivation. 1.2 Intrinsic Motivation Exploration is a fundamental challenge in reinforcement learning, especially in complex environments where finding all sources of reward is difficult. Intrinsic motivation is a field of research that seeks to directly motivate this exploratory behavior by designing agents with an additional "intrinsic" reward signal. There are many formulations of this goal, some inspired by human notions of exploration, such as curiosity14 as well as practical methods like count-based 2 exploration15,16,17 and random network distillation (RND)18. Typically, the intrinsic reward de- creases as the agent explores more, allowing the extrinsic reward to dominate once sufficient coverage is achieved. Intrinsic motivation also implicitly incentivizes the discovery of useful skills. In challenging environments, some states are not reachable without first learning prerequi- site skills. For instance, an agent playing chess cannot experience a winning checkmate without first becoming a competent chess player. As such, intrinsic rewards can even outperform the environment’s own extrinsic reward by promoting skills that allow the agent to reach more novel states. Intrinsic motivation methods are intuitively appealing, and successful in select environ- ments, but many have been shown to generalize poorly and struggle in stochastic environments in practice19,20. The primary weakness of bonus-based methods used to encourage exploration is that they assume full coverage of the state space is possible and desirable21. In even moderately complex environments, the cardinality of the state space is many times larger than the number of steps seen during training, meaning even visiting a unique state at each step will not fully explore the environment. Function approximation helps by associating similar state together, but is less effective as the state space grows. Henaff et al. 21 recently propose a method which explores the task-specific feature space instead of the full task space, but still struggles with large action and state spaces. These observations suggest a higher level approach to incentivizing exploration is necessary. This thesis explores curriculum learning as an appealing alternative. 1.3 Curriculum Learning Curriculum learning methods seek to generate or sort tasks in a way that increases sample efficiency or asymptotic performance on a target task or range of tasks22. By incentivizing agents 3 to repeat behaviors states that led to novel states, intrinsic motivation methods can be viewed as autocurricula over informative trajectories23. While exploration bonuses prioritize individual states that maximize some measure of new information, curriculum learning methods prioritize specific environment instances that maximize learning potential. Both approaches teach agents to develop better representations and discover important skills. As the complexity of the environment increases even further and the task space also grows to an unmanageable size, we enter the domain of open-ended environments. These are envi- ronments with huge or infinite task spaces, where the environment evolves in response to agent behavior, similar to the real world. For example, as humanity entered the industrial revolution skills like manufacturing and engineering became more important while farmers and craftsmen needed to adapt. As with the transition from state space coverage to state space prioritization, the focus in open-ended environments shifts from task space coverage to task space prioritization. In these settings, agents can train faster by identifying tasks with learnable skills and avoiding distractions, highlighting the importance of curriculum learning. Curriculum learning is appealing in complex, open-ended environments for several reasons. First, it divides challenging tasks into solvable subproblems. These subtasks can train agents faster by removing other sources of reward, distracting environment dynamics, or challenging starting conditions. Second, this approach is intuitively similar to how people learn in school, by progressing from easy problems to hard problems. Third, by operating at a high level, the solution space is often more interpretable. Experts can easily identify which tasks are being prioritized early in training, and it is easier to manually design curricula at the task level than to engineer shaping rewards. The final and most important reason is that curriculum learning has been a core component 4 of nearly every successful application of RL agents in challenging tasks. AlphaGo uses self-play, which induces a curriculum of increasingly capable opponents9. AlphaStar extends this approach with prioritized fictitious self-play and league training to train agents to play Starcraft II, one of the most challenging multiplayer video games, at a competitive level. OpenAI Five, an agent capable of playing Dota 2, was trained with a curriculum that slowly introduced new game me- chanics and expanded the range of playable configurations2. GT Sophy, which outraced a team of professional players in the Gran Turismo racing game, used a manually crafted distribution of training scenarios4. Multiple works have used curricula to train agents to collect diamonds in Minecraft3,13,24. Curriculum learning is ubiquitous in successful applications of RL, and in- tuitively approaches problems in a way that is suitable for challenging environments. Despite these successes, there is little fundamental research on the empirical and theoretical properties of curriculum learning. The later chapters of this thesis develop the science of curriculum learning with a specific focus on challenging domains. Many successes of RL rely on either a natural curriculum from self-play or a hand-crafted curriculum from human-experts. In order to tackle challenging single-agent environments where we cannot rely on self-play, we need to automatically discover effective curricula. NetHack, a text-based rogue-like dungeon-crawler with a vast array of mechanics, exemplifies this challenge. In NetHack, an agent must navigate 50 procedurally generated levels while managing hunger, combat, and resource collection. The game includes several numerous puzzles and mechanics based in mythology and human intuition. This along with its extensive state, action, and task spaces make it a formidable benchmark for tabula rasa RL agents. Although human players can leverage familiar concepts and community resources, current RL methods – and even LLMs – struggle with the low-level control and long-term planning required25,26. In Chapter 5 we 5 develop methods for training multitask LLMs, taking a step towards improving their capabilities in complex domains. In Chapter 7 we explore pure RL approaches to complex domains, utilizing recent advances in curriculum learning. 1.4 Open Source Reinforcement Learning Infrastructure The best neural (RL and LLM) NetHack agents underperform the best symbolic agents by an order of magnitude26? . While tabula rasa RL agents may never completely solve NetHack, I expect it is possible significantly improve current baselines. I attempted this with a curriculum learning method that Kanitscheider et al. 3 developed based on a measure of learning progress. They successfully used it to train a standard PPO agent to reliably collect diamonds in Minecraft, a similarly challenging open-ended environment. However, as I began the project, it became clear that the engineering challenges would be substantial. Curriculum learning requires environments to be globally synchronized using feedback from each environment. However, this is difficult to implement with the highly specialized multiprocessing libraries that NetHack baselines use to maximize performance25,27. Faced with these challenges, I developed a novel curriculum learn- ing infrastructure, called Syllabus, that simplifies the integration of curriculum methods across any RL library and environment (Chapter 6). Syllabus simplifies an otherwise unapproachable area of research, and guarantees reproducibility across different environments. In Chapter 7 we use Syllabus to present the first automatic curriculum learning results in NetHack and Neural MMO, two extremely challenging RL benchmarks. The portable software paradigm that this thesis develops could even be extended to other subfields of reinforcement learning, and I hope it will help others to advance the empirical science of curriculum learning. A theme that appears throughout this thesis is a focus on high-quality open-source tools. 6 Open source software is the driving force behind progress in machine learning. In reinforce- ment learning, the vast majority of projects rely on the API defined by OpenAI’s Gym28, as well as learning libraries like Stable Baselines 329, RLLib30, or CleanRL31. Releasing code and data for research projects is crucial for promoting reproducibility and accelerating research. All projects included in this thesis include a large open source contribution (when legally permitted). Chapter 3 introduces a library for visualizing the reward surfaces of actor-critic agents, Chap- ter 4 provides training scripts and contributed experimental data to the Open RL Benchmark32, and Chapter 6 presents our portable curriculum learning library. I hope that these libraries, as much as any experimental results in this thesis, will facilitate future research on applying policy gradient methods to complex environments. 1.5 Overall Structure and Contributions This thesis is organized into chapters that collectively address the challenges of reinforce- ment learning in long-horizon, high-dimensional, and sparse reward environments. Each chapter contributes experimental insights and practical tools to advance the empirical understanding of policy gradient methods and curriculum learning. Chapter 3: Visualizing Reward Surfaces. In this chapter, we introduce a novel library for visualizing the reward surfaces of actor-critic agents. By mapping out the reward landscapes, we empirically illustrate the exploration difficulties inherent in sparse reward settings and identify evidence of "cliffs" in the policy gradient direction that can collapse performance. This work in this chapter is based on the following paper: Cliff Diving: Exploring Reward Surfaces in Reinforcement Learning Environments. Ryan Sullivan, Jordan K. Terry, Benjamin Black, and John P. Dickerson. 2022. In The Interna- 7 tional Conference on Machine Learning (ICML 2022). Chapter 4: Applying DreamerV3 Implementation Tricks to PPO. Here, we rigorously evalu- ate the implementation tricks introduced by DreamerV3 within the model-free context of Proxi- mal Policy Optimization. Surprisingly, we find that the tricks presented do not transfer as general improvements to PPO. We identify cases where these tricks succeed and offer insight into their relationships. This research was published in the following work: Reward Scale Robustness for Proximal Policy Optimization via DreamerV3 Tricks. Ryan Sullivan, Akarsh Kumar, Shengyi Huang, John P. Dickerson, Joseph Suarez. 2023. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023). Chapter 5: Multitask learning in Large Language Models. In this chapter, we present several novel methods for training multi-objective LLM policies through multi-task training. These ap- proaches allow us to create steerable LLMs that can optimize a range of user preferences during inference. This work was done in collaboration with several teams at Google. I led the devel- opment of the RLHF multi-task training method and conducted the related experiment. I also co-led an extension that applied the same concepts to Online DPO. The methods presented in this chapter were published in the following papers: Conditional Language Policy: A General Framework For Steerable Multi-Objective Fine- tuning. Wang, Kaiwen, Rahul Kidambi, Ryan Sullivan, Alekh Agarwal, Christoph Dann, Andrea Michi, Marco Gelmi et al. 2024. In Findings of the Association for Computational Linguistics (EMNLP 2024). Robust Multi-Objective Preference Alignment with Online DPO. Raghav Gupta, Ryan Sul- 8 livan, Yunxuan Li, Samrat Phatale, and Abhinav Rastogi. 2025. In The 39th Annual AAAI Conference on Artificial Intelligence (AAAI 2025). Chapter 6: Enabling Future Curriculum Learning Research. This chapter presents Syllabus, an open-source and portable framework designed to facilitate curriculum learning across a range of environments. By decoupling curriculum design from specific reinforcement learning im- plementations, Syllabus enables researchers to efficiently experiment with curriculum learning methods in both standard and complex domains. I led the design and development of the library, and several collaborators added new features, manual curriculum learning tools, and example scripts. I also conducted the experiments in this chapter and wrote the following public preprint: Syllabus: Portable Curricula for Reinforcement Learning Agents. Sullivan, Ryan, Ryan Pégoud, Ameen Ur Rehman, Xinchen Yang, Junyun Huang, Aayush Verma, Nistha Mitra, and John P. Dickerson. 2024. In arXiv (Preprint). Chapter 7: Curriculum Learning in Complex Environments. This chapter contains targeted investigations of interactions among curriculum learning methods, policy gradient algorithms, and complex environments. I designed and conducted all of the experiments for this chapter, which have not yet been published. In the final chapter, I discuss open questions and suggest relevant future research directions. 9 Chapter 2: Background This chapter introduces the concepts and formulations used throughout the rest of this the- sis. Future chapters may restate, revise, and extend these descriptions as a reminder and to better fit the specific problems they present. 2.1 Reinforcement Learning Reinforcement learning (RL)33,34 considers the setting in which an agent learns to maximize a reward signal by interacting with an environment. The agent takes actions in response to the current state of the environment in order to accumulate rewards. The reward signal typically communicates a task or goal, and may be sparse, rewarding only few interactions, or dense, rewarding many interactions. The agent is typically assumed to begin with zero knowledge of the environment, and so the agent needs to interact with the environment to uncover rewards and accomplish its task. This setting is sometimes called tabula rasa RL because the agent starts with no prior knowledge – a clean slate. Ultimately, the goal of RL is to develop agents which can learn the optimal behavior in a given environment. The environment can be a physical environment or a simulated world, such as a video game. Virtual environments do not change the underlying RL framework, but they often have different problem constraints from learning in the real world, which we take advantage of in later chapters. 10 An RL agent’s goal is to learn the optimal behavior in a previously unknown environment through interaction. To do this, it must discover sources of reward and learn the behavior to reliably receive that reward. As such, a fundamental challenge of RL is balancing exploration with exploitation. At each time step, the agent can explore by trying actions that may uncover better rewards, or exploit its existing knowledge to receive known rewards. Notably, the agent can rarely explore and exploit at the same time, so it must balance both behaviors. Exploration is often required to avoid local optima and to thoroughly explore the state space of the environment, in order to identify reward that the agent can exploit to achieve the highest possible cumulative reward. 2.1.1 Markov Decision Processes The environment is typically modeled as a Markov decision process (MDP)35? with dis- crete time steps. At each time step t, an MDP exists in a state st which is an element of the state space S. The starting state s0 is sampled from a distribution ρ(s0). The agent observes st and takes an action at from the action space A according to its policy π : S × A 7→ [0, 1], which defines a state-conditional distribution over actions. In response to the agent’s action at ∼ π(·|st), the MDP transitions to the next state, st+1 according to the transition function P : S × A × S 7→ [0, 1], which defines a distribution over next states conditioned on the current state and action, so that st+1 ∼ P (·|st, at). In episodic environments, st+1 can be a terminal state which ends the episode and prevents further interaction until the environment is reset to a new initial state. Terminal states are formally modeled as an absorbing state in the MDP where all transitions map back to the absorbing state. Importantly, the transition func- tion is assumed to be Markovian such that the next state st depends only on the current state s 11 and action a. Taking action at in state st is also accompanied by an associated real-values re- ward rt, based on the reward function R : S × A × S × R 7→ [0, 1]. The transition at time step t typically refers to the tuple (st, at, st+1, rt), and the sequence of transitions up to time t, τt = (s0, a0, r0, s1, ..., st−1, at−1, rt−1, st), is called the trajectory. All RL algorithms seek to learn an optimal policy π∗ that maximizes the expected total return, defined in Equation 2.1, by updating the agent’s current policy using information collected from repeated interactions with the environment over some countable number of time steps T : J = E s0∼ρ τ∼π [ T−1∑ t=0 γtrt ] , (2.1) where γ < 1 is the discount factor. The discounted rewards are summed in expectation over trajectories τ where actions are sampled according to π. Importantly, the discount factor intro- duces a bias such that near-term rewards are weighted more highly than more distant rewards. Reward discounting also empirically serves to reduce the variance of return estimates used in RL algorithms, by effectively shrinking the time horizon over which rewards are summed. In practice, the specific value of γ can have a significant impact on the behavior and performance of an agent36,37. The full MDPM can therefore be specified by the tupleM = (S,A, P,R, ρ, γ). There are many additional formulations of MDPs for more specific RL problems, which we ex- plain in the following section as well as ??. 2.1.2 Partial Observability In many real-world settings, the agent can only observe a subset of the full state st. This is called partial observability38,39 which we model by extending the standard MDP tuple with 12 an additional observation function O : S × Ω 7→ [0, 1], which defines a distribution over the observation space Ω dependent on the current state s. In this setting, the policy is conditioned on ot ∼ O(·|st), so that π : Ω×A 7→ [0, 1] and actions are sampled as at ∼ π(·|ot). The transition and reward functions still condition on the full state as in a standard MDP. This extension of an MDP is called a partially-observable MDP (POMDP) and can be represented by the tupleM = (S,A,Ω, P,R, O, γ). In practice, agents typically need to model the history of observations to produce a Markovian representation of the state, often using a recurrent neural network40,41. 2.1.3 Estimating Values For a given environment and policy π the value of a state s is the expected discounted return obtained by sampling the rest of the trajectory using π, starting from s. The function that defines this value over the entire state space is called the state-value function: Vπ(s) = Eτ∼π [ ∞∑ k=t γk−trk ∣∣∣∣∣st = s ] . (2.2) A closely related entity is the state-action value function or Q-function, which maps every state- action pair (s, a) to the expected discounted return obtained by following π after taking action a in state s. Qπ(s, a) = Eτ∼π [ ∞∑ k=t γk−trk ∣∣∣∣∣st = s, at = a ] . (2.3) In other words, the Q-function measures the value of taking action a in state s for policy π. Sub- tracting the Q-function from the state-value function then yields the advantage function, which measures the expected improvement from taking action a in state s compared to the average 13 performance of policy π in state s: A(s, a) = Qπ(s, a)− Vπ(s). (2.4) Intuitively, if the advantage function is non-positive in all states, then the policy can do no better by selecting alternate actions, and is optimal33. One approach to search for the optimal policy π∗ is to iteratively update the policy to take actions that maximize the advantage in each state. In practice, values must be approximated through Monte Carlo sampling, by rolling out a trajectory in the environment – sampling from the policy at each timestep – and recording the sum of discounted rewards after each state. In high-dimensional state spaces that are impractical to fully cover with rollouts it is common practice to approximate these value functions with neural networks42. A common approach to reduce the variance of Monte Carlo return estimates is to truncate the Monte Carlo rollout after T steps and estimate the remaining return starting at sT with the cur- rent value function approximation V̂ (sT ). By “bootstrapping" from the current value predictions, the return at time t is estimated as R̂t = ( T−1∑ k=t γk−trk ) + γT−tV̂ (sT ), (2.5) By finding the error between our bootstrapped value prediction starting at state s and the direct value prediction V (s) we get the following recursive loss function for training the value approx- imator: LV = ( T−1∑ k=t γk−trk ) + γT−tV̂ (sT )− V̂ (st). (2.6) Note that the bootstrapped value prediction includes rewards from the environment, so this loss 14 function grounds the value predictions with experience gained via interaction. In practice, we can compute this value at every timestep, producing multiple error terms for optimization. The value loss can then be computed as the sum of one-step bootstrap errors for each time step, so that LV = T−1∑ k=0 rk + γV̂ (sk+1)− V̂ (sk), (2.7) where each summand δk = rk + V (sk+1) − V (sk) is called a temporal difference error or TD error at time k 33. Equation 2.6 therefore defines a T -step TD error. Importantly, the value function V for any policy π must satisfy the Bellman Equation43: V (st) = Eπ [rt + γV (st+1)] (2.8) = ∑ at,st+1,rt π(at|st)P (st+1|st, at)R(rt|st, at, st+1) + γV (st+1) (2.9) value-based RL methods44, use algorithms based on the Bellman Equations to learn the opti- mal state-value function or action-value function, then select actions greedily according to these functions to construct an optimal policy. In the following sections, we will discuss a different paradigm of RL algorithms which we utilize throughout this thesis. 2.1.4 Generalized Advantage Estimation While reducing variance, the advantage estimates described in Equation 2.6 based on the T -step TD error introduces bias via the bootstrapped value predictions. Generalized Advantage Estimation (GAE)45 introduces a simple estimator that balances bias and variance in advantage estimation, based on an exponential average of all T -step TD errors for T = 1, ...,∞. This 15 average can be written simply in terms of purely one-step TD errors across all time steps:  GAE(λ) t = (1− λ) (  (1) t + λ (1) t + λ2 (1) t + ... ) (2.10) = ∞∑ k=0 (γλ)kδt+k, (2.11) where 0 ≤ λ ≤ 1 is the key GAE hyperparameter that controls the bias-variance tradeoff. When λ = 0, GAE reduces to δt, the one-step TD error. When λ = 1, Equation 2.11 reduces to the Monte Carlo advantage estimator: ∞∑ k=0 γkδt+k = ∞∑ k=0 γkrt+k − V̂ (st), (2.12) 2.2 Policy Gradient Methods The results and methods in this thesis are developed primarily using a class of RL algo- rithms known as policy gradient methods46,47. In contrast to value-based RL methods, which construct the optimal policy from a learned optimal value function, policy gradient methods directly represent the policy as a neural network. These methods perform stochastic gradient descent over the weights of policy network to optimize a noisy estimate of the discounted future return. In other words, we take the gradient of the expected return J for the policy π from Equa- tion 2.1 with respect to the policy network’s weights. The Policy Gradient Theorem 46 states that this gradient is equivalent to the expected value of the log-probabilities of each action in the trajectory scaled by the trajectory’s return as shown below: 16 ∇θJ(θ) = ∇θEτ∼πθ [R(τ)] ∝ Eτ∼πθ [ T∑ t=0 ∇θ log π(at|st)R(τ)] (2.13) This simple formula allows us to empirically estimate the policy gradient with the sample mean of collected trajectories. 2.2.1 REINFORCE The first, and simplest policy gradient method is REINFORCE48, which estimates the gra- dient of the expected discounted return with respect to the policy weights as ∇θJ(θ) ∝ E s0:∞, a0:∞∼πθ [ ∞∑ t=0 Rt∇θ log πθ(at|st)]. (2.14) The REINFORCE estimator effectively restates the Policy Gradient Theorem making it tractable to estimate its value through Monte Carlo rollouts of the policy, as done by REINFORCE. Intu- itively, this gradient increases the probability of actions that lead to high returns. 2.2.2 Actor-Critic An important property of the REINFORCE estimator is that for any baseline function b that only depends on state, it remains unbiased in the presence of the baseline term. In other words, we can subtract any state-based function from the summand in Equation 2.14 without introducing bias. Actor-critic methods33,49,50,51 exploit this fact to reduce the variance of REINFORCE, by 17 subtracting such a baseline from the return accumulated after timestep t as follows ∇θJ(θ) ∝ E s0:∞, a0:∞∼πθ [ ∞∑ t=0 (Rt − bt)∇θ log πθ(at|st)]. (2.15) The baseline is typically chosen to be the state-value function V (st) implemented as a neural network, V̂ : S 7→ R. This network is typically called the value network or the “critic" (whereas the policy is dubbed the “actor"). When the baseline is a value network, the difference between the future discounted return Rt and bt = V̂ (st) is an unbiased estimator of the advantage A(st, at). Intuitively, updating the policy weights along the direction of the gradient defined in Equation 2.15 increases the probability of taking actions that are better than average in terms of expected future discounted return. In practice, both the expected discounted return and baseline terms within the advantage must be estimated from empirical returns. Rollout data is typically collected over a fixed horizon during training, independent of whether an episode terminates. Therefore, the discounted return Rt is approximated via a bootstrapped value estimate, such that Rt ≈ Gt =∑T−1 t=0 rt + V̂ (sT ), as in subsection 2.1.3. The value network is trained alongside the policy, by minimizing the L2 loss between the bootstrapped value estimate and initial value estimate: LV = 1 T T−1∑ t=0 ( Gt − V̂ (st) )2 . (2.16) The corresponding policy loss function based on Equation 2.15 can then be written as LP = − 1 T E s0:T , a0:T∼πθ [ T−1∑ t=0 (Gt − bt) log πθ(at|st)]. (2.17) Additionally, it is common to include an entropy regularization term in the total loss when 18 training deep RL networks50? to encourage exploration and promote generalized, stochastic be- haviors. This term is shown in Equation 2.18, whereH denotes the Shannon entropy. LH = − 1 T T−1∑ t=0 H(π(at|st)). (2.18) Adding this final term to the total loss function, we obtain the standard policy gradient loss used in actor-critic algorithms: L = −LP + λVLV − λHLH, (2.19) where λV and λH are weighted coefficients, typically set via hyperparameter tuning. This for- mulation, in which the advantage is estimated via bootstrapped return estimates Gt, is commonly called Advantage Actor-Critic (A2C). 2.2.3 Proximal Policy Optimization Training models with A2C can be unstable and sample inefficient, requiring many transi- tions to reach a useful policy. One source of instability in A2C is its sensitivity to the step size taken along the policy gradient. The policy gradient theorem only holds locally for trajectories taken with the current policy52. As optimization moves farther from the policy that was used to calculate the policy gradient, there is a greater risk of degrading agent performance, sometimes catastrophically. Large steps can cause the policy to stray into suboptimal behaviors that then self-reinforce, as the model continues to train on its own transitions53. We explore these intu- itions more thoroughly in Chapter 3. Trust region methods enforce a constraint on the policy update step, such that the updated policy cannot deviate too far from the current policy, which provably results in monotonic policy improvement54. This optimization can be written as 19 maximize θ Et [ π(at|st) πold(at|st) A(st, at) ] , (2.20) subject to DKL(πold||π) ≤ δ. Here, the expectation is an importance sampling estimator and πold denotes the current iterate of the policy, which collects transitions for the next update. Proximal Policy Optimization (PPO)55 uses a simple first-order approximation of the trust- region constraint to update the policy by maximizing the following “clipped" objective: Jclip(θ) = Et[ min(ρt(θ)Ât, clip(ρt(θ), 1− ϵ, 1 + ϵ)Ât)] (2.21) = Et[ min(ρt(θ)Ât, g(ϵ, Ât)], (2.22) where ρt denotes the importance sampling ratio, πθ(at|st)/πold(at|st), and ϵ > 0 is the clipping constant. This objective can be understood by observing how g behaves for positive and negative advantages, Ât 56: g(ϵ, Ât) =  Ât ·min ( π(at|st) πold(a|st) , 1 + ϵ ) if Ât > 0, Ât ·max ( π(at|st) πold(at|st) , 1− ϵ ) if Ât < 0, 0 otherwise. (2.23) When Ât > 0, the action at is better than average when taken in st. As we expect, the objective increases as π(at|st) becomes more likely, but only up to a maximum amount of (1+ϵ)πold·(at|st). Likewise, when Ât < 0, the action at is worse than average when taken in st, and the objective decreases as π(at|st) becomes more likely—but only up to a limit, (1 − ϵ) · πold(at|st). The 20 clipping of ρt thus heuristically approximates the trust region constraint by limiting how much large changes in the policy can contribute to increasing the objective? . PPO is most commonly implemented by replacing the A2C loss term −LP in Equation 2.19 with −Jclip(θ). Empirically, this clipped objective allows for PPO to stably take multiple gradient updates over a given batch of transitions collected by πold despite these samples becoming more stale after each update. This leads to improved sample-efficiency through greater data reuse per batch. In practice, PPO performs multiple gradient updates per batch by subsampling minibatches of data without replacement. In contrast, A2C and other prior policy gradient methods take only a single gradient update per batch then collect new data. Due to its relative simplicity and strong performance across many complex domains, the experiments in this thesis make use of PPO as the base RL optimization algorithm. Although this section highlights PPO’s objective as it’s core difference to previous policy gradient methods, PPO also refers to the set of implementation details and hyperparameters that comprise the method in practice. These details are enumerated and explored thoroughly in57. 2.3 Underspecified Environments The previously discussed RL settings all assume a single environment instantiation, such that the underlying POMDP or MDP is fixed across all interactions. In contrast, most real- world settings feature a large degree of variability across many aspects of the environment. This is especially true in simulation, where many virtual environments of interest are generated via procedural content generation (PCG)58,59, which is the algorithmic generation of data. To model such control over the environment, the Underspecified POMDP (UPOMDP)60 extends the standard POMDP with an additional space of free parameters Θ, that correspond to 21 specific settings of aspects of the environment that can vary, e.g. positions of obstacles in a 2D maze environment. In its most general formulation, the UPOMDP assumes the specific values of the free parameters, θ may vary not only across episodes, but across time steps. This is less common in practice, but we explore some methods that change tasks mid-episode in Chapter 6. The specific environment setting θ is incorporated into the transition function, P : S × A × S × Θ 7→ [0, 1] and so a UPOMDP is defined by the tuple (S,A,Ω, P,R, O, γ,Θ). Following common terminology, this thesis uses the term task to reference a specific setting of the free parameters θ. The standard UPOMDP does not expose the value of θ in the observation, as it was first devised to study zero-shot to unknown θ at test time, i.e. without taking any gradient updates in the environment instance θ. However in practice, some experiments in this work do encode the current task into the observation, and the UPOMDP formulation can be easily modified to include θ in the observation, in which case it becomes equivalent to what Hallak et al. 61 call a contextual MDP. 2.4 Curriculum Learning Curriculum learning is a broad field of methods which generally aim to sort or generate tasks that accelerate agent learning. There is no widely agreed upon definition, though several previous works have attempted to formalize curriculum learning. This section will first describe prior formulations of curriculum learning, then explain the definition used throughout the thesis. Broadly, the term curriculum learning refers to methods that guide an agent toward prob- lems that are challenging yet learnable, within the agent’s zone of proximal development62. In other words, curricula typically sort tasks by difficulty, or prioritize tasks that maximize learn- ing progress. Bengio et al. 63 define curriculum learning as methods which emphasize tasks of 22 increasing difficulty throughout training, and converge on the uniform distribution over target tasks. Importantly, curriculum learning methods manipulate tasks, but do not directly optimize a policy to solve any task. While increasing difficulty is a common goal of curriculum learning methods, other approaches such as maximizing learning progress, do not strictly aim to sort tasks by difficulty. In our work, we consider a broader definition of curriculum learning, most similar to that of Unsupervised Environment Design (UED) Dennis et al. 60 . UED uses an under-specified envi- ronment (a UPOMDP) to produce a distribution over fully specified environments which can be used to train a policy that performs well over that distribution. More precisely, UED specifies an environment policy Λ : Π→ ∆(ΘT ) where Π is the set of possible policies and ΘT is the set of possible sequences of environment parameters, with a specific environment instantiation being defined by θ. Notably this definition centers around the ability to configure an environment, and a class of methods which produce task distributions over these environments conditioned on a given policy. This excludes methods designed for singleton environments which can be specified as POMDPs. This thesis additionally considers any method which can be represented as an environment policy over UPOMDP free variables to be a form of curriculum learning, with no restriction on conditioning inputs to the curriculum. More precisely a curriculum learning method C : Ht → Θt chooses the environment free parameters θt at timestep t based on any training history Ht, which may include policies, environment outputs, or external expert inputs. This encodes the idea that any method which modifies the task structure to promote learning is a form of curriculum, and broadly includes many techniques from the intrinsic motivation and transfer learning literature. For instance, most bonus-based exploration methods10,19,21 can be formulated as UED meth- 23 ods that define a UPOMDP (S,A,Ω, P,R, O, γ,Θ) where the fully specified environment is a POMDP (S,A,Ω, P, R̂, O, γ) and θ specifies the bonus reward term such that R̂ = R + Θ. We discuss the possibility of self-guided curricula that do not require the explicit ability to instantiate an environment with specific settings in Chapter 8. Multi-agent problems can also be specified as POMDPs by considering all other agents to be part of the environment dynamics. This is the intuition behind independent learning and self- play, which often work well in practice64,65,66. The opponent policy becomes a free parameter in the UPOMDP which can be specified to produce a single-agent POMDP. Methods like self-play and prioritized fictitious self-play produce curricula over opponent policies to train agents that are robust against the full space of potential policies1,67. With these algorithms, the UPOMDP’s free parameter Θ becomes the opponents’ policy parameters. Though this definition of curriculum learning includes many works in the exploration lit- erature, it does not include methods that do not explicitly specify changes to the environment. For instance, the state-distribution shift induced by on-policy learning is not curriculum learning, because the action distribution which generates those states is not a specifiable component of the UPOMDP. Curriculum learning can not be used in singleton environments, although some sin- gletons, like Atari games, they can become under-specified environments by augmenting them with additional reward terms. As such, reward shaping methods are not curriculum learning be- cause agents train on only a singleton environment instance, but dynamic reward bonuses which modify shaping rewards throughout training are. 24 2.5 Automatic Curriculum Learning This thesis makes a distinction between manual and automatic curriculum learning as meth- ods that use a static or dynamic task distribution respectively. Manual methods, such as do- main randomization or simulated annealing, specify the task distribution that will be used at any timestep independent of any history H . Automatic curriculum methods instead update their task sampling policy online during training using feedback from the training process. They typically use feedback from the environment or agent such as task success rates or value predictions, but they can also use learned generator networks68, multiagent competition69,70, or LLM responses to propose new tasks8,71.72 define automatic curriculum learning methods as "a family of mecha- nisms that automatically adapt the distribution of training data by learning to adjust the selection of learning situations to the capabilities of DRL agents." This work uses a more specific defini- tion that only allows for training data distribution changes induced by changes to the environment configuration. 2.6 Open-Endedness Open-endedness is a murkier concept, with no widely agreed upon definition. The original term coined in the context of artificial intelligence by O. Stanley et al. 73 referred to a process which led to endlessly evolving complexity, similar to natural evolution on Earth. It has since been used to also describe environments that provide challenges sufficient for an open-ended process to develop. However, it is not clear what minimum environment qualities are sufficient for open-endedness. In fact, understanding this may be a key step toward developing open-ended systems73. 25 Hughes et al. 74 present a thorough exploration of the definition of open-endedness, syn- thesizing recent work and providing several examples of open-ended systems. They define open- endedness as follows "From the perspective of an observer, a system is open-ended if and only if the sequence of artifacts it produces is both novel and learnable"74. Formally, they consider a system open-ended if it is both novel and learnable. A system produces artifacts XT at each timestep T . An observer O produces a statistical model of the artifacts X̂t, conditioned on the history up until timestep t. The observer judges the quality of this prediction with a loss function l(X̂t, XT ), or l(t, T ) for short, comparing the statistical model’s prediction to the true artifact. They consider a system novel if the loss incurred by the statistical model is higher for future arti- facts, ∀T ′ > T,E[l(t, T ′)] > E[l(t, T )]. This indicates information gain by the system over time. A system is considered learnable if the loss incurred by the statistical model decreases with more history, ∀t′ > t,E[l(t, T )] > E[l(t′, T )]. This suggests that the information gained by the system is meaningful or "interesting" to the agent. The full paper includes a thorough discussion of the implications of this definition and its connection to prior attempts to define open-endedness74. Many systems will only possess these two properties for a finite amount of time, until the observer’s statistical model sufficiently generalizes over the environment’s task space. As such, Hughes et al. 74 also describe environments which are open-ended up to a certain timestep as finitely open-ended. Thus it can be said that open-endedness is a temporary and subjective property of an environment, in that it depends on the observer and the time at which the system is being observed. Even simple environments may be considered open-ended for a short time. The simulated environments studied in this paper are finitely open-ended in that they continue to present novel, learnable challenges to agents for much longer than we train those agents, but there is a limit on the complexity which they can produce. When an environment is finitely 26 open ended for t greater than the total training timesteps, these experiments can simulate infinite open-endedness. The work in this thesis does not depend on any precise definition of open-endedness, and instead simply claims that certain simulated environments, namely NetHack25 and Neural MMO75,76, pose a wide array of diverse challenges for RL agents, which can be used to study multi-task learning as a necessary component of open-ended RL systems . This thesis aims to develop methods that can generalize to many tasks as a step toward true open-ended systems. 27 Chapter 3: Visualizing Reward Surfaces 3.1 Introduction We begin by exploring the behavior of policy gradient methods in single task environments. Proximal policy optimization’s failure cases are poorly understood, and it’s even less clear how these weaknesses are exacerbated or mitigated by multi-task environments. We develop a novel visualization approach to study the characteristics of PPO on Atari games and Mujoco physic environments. This allows us to analyze the types of tasks on which PPO underperforms and predict what challenges we will see in a multi-task setting. Reinforcement learning typically attempts to optimize the expected discounted return of an agent’s policy. Policy gradient methods represent the policy with a neural network, and learn this policy by approximating the gradient of the reinforcement learning objective over policy net- work parameters. This means that the benefits and challenges of deep learning apply to policy gradient methods. However, unlike most deep learning tasks, reinforcement learning is notori- ously unstable, and agents often experience sharp drops in reward during training. Studying the reward surface and how RL algorithms interact with it is critical to understanding the successes and failures of deep reinforcement learning. This work appeared at ICML 2022. A “reward surface” is the high dimensional surface of the reinforcement learning objective (the cumulative expected reward when following a given policy in an environment) over the 28 Figure 3.1: Reward surface for the Solaris Atari environment. policy network parameter space. Reward surfaces have been used to study specific questions in limited contexts77 but have never been generated for a wide variety of environments. Loss landscapes have been used in computer vision to understand the effect that residual connections have on computer vision tasks78, and we expect that visualizations of reward surfaces could be similarly valuable for understanding techniques in reinforcement learning. As a first step toward this goal, we produce visualizations of the reward surfaces for 27 of the most popular reinforcement learning environments, and identify new patterns in these surfaces for the first time. We see common traits of environments that are generally perceived to be "easy" for reinforcement learning to solve, and visual explanations of failure modes in sparse reward environments. We additionally conduct a series of novel visualizations of the reward surface, finding ev- idence of steep “cliffs” in reward surfaces plotted in the policy gradient direction of numerous environments. These are directions where the returns are constant or increase for a short distance, then drop sharply. Reinforcement learning notoriously suffers from instability, and these cliffs may explain the sudden performance drops that agents experience during training. Our plots of- fer conclusive visual evidence that these cliff exist, as well as methods to study them further. We 29 show that the specific cliffs present in our visualizations impact learning performance in some situations, and by comparing the performance of PPO79 and A2C50 on these cliffs, we provide an explanation for PPO’s improved efficacy over previous methods. Finally, to accelerate future research on reward surfaces and the impact of cliffs in rein- forcement learning, we release the code for this paper as a comprehensive, modular, and easy to use library for researchers to plot reward surfaces and other visualizations used in this paper. 3.2 Background and Related Work 3.2.1 Loss Landscapes Li et al. 78 developed filter-normalization, a technique for plotting 3d visualizations of loss landscapes, the surface generated by a loss function over neural network parameters. They were able to demonstrate that the sharpness of loss landscapes plotted using filter-normalized random directions correlated with neural network generalization error. They create their plots by choos- ing perturbations d of trained parameters that give an informative view of the neural network’s local behavior. However, uniform random perturbations are known to be misleading in neural network analysis, because neural networks with ReLU activations have scale invariant weights78. To mitigate this problem, they propose filter-normalized random directions. They represent the neural network as a vector θ indexed by layer i and filter (not filter weight) j.1 Then, they sample a random Gaussian direction d, and scale each filter of this random direction to match the mag- nitude of the neural network parameters in the corresponding filter, by applying the following 1Note that this method also works for fully connected layers, which are equivalent to a convolutional layer with a 1x1 output feature map. 30 formula. di,j = di,j ∥di,j∥ ∥θi,j∥ Li et al. 78 used this method to demonstrate that skip connections can limit the increase in non- convexity as network depth increases. Their work was originally applied to image classification networks, but we adapt these techniques to visualize reward surfaces for policy networks. 3.2.2 Reinforcement Learning Deep reinforcement learning aims to optimize a policy π to maximize the expected return over neural network parameters θ. This objective can be written as J(πθ) = Eτ∼πθ R(τ) where R(τ) = ∑n t=0 γ trt. Here τ is a trajectory, rt is the reward at time step t, and γ is the discount factor and we sum the rewards across the entire episode of n time steps. Nota and Thomas 80 showed that the discounted policy gradient is not the gradient over the surface of average dis- counted rewards, and is in fact not the gradient of any surface. To avoid this issue and make interpretation easier, we plot the undiscounted reward surface where γ = 1. 3.2.3 Policy Gradient Methods Policy gradient methods estimate the gradient of the policy network and use gradient as- cent to increase the probability of selecting actions that lead to high rewards. The gradient of the objective J is ∇θJ(πθ) = Eτ∼πθ [∑T t=0∇θ log πθ(at|st)Aπθ(st|at) ] where Aπθ(st|at) is the advantage function for the current policy πθ. 31 3.2.4 Reward surfaces The “reward surface” is the reinforcement learning objective function J(πθ). Reward sur- faces were first visualized by Ilyas et al. 77 to characterize problems with policy gradient esti- mates. The authors plotted a policy gradient estimate vs a uniform random direction, showing via visually striking examples that low sample estimates of the policy gradient rarely guide the policy in a better direction than a random direction. Note that this work did not make use of filter-normalized directions. Bekci and Gumus 81 then used loss landscapes from Li et al. 78 to study Soft Actor-Critic agents82 trained on an inventory optimization task. They visualize the impact of policy stochas- ticity and action smoothing on the curvature of the loss landscapes for 4 MuJoCo environments. Later, Ota et al. 83 used loss landscapes from Li et al. 78 directly to compare shallow neural networks for reinforcement learning to deeper neural networks, showing that deep neural net- works perform poorly because their loss landscape has more complex curvature. They used this visual insight to develop methods that can train deeper networks for reinforcement learning tasks. This work plots the loss function of SAC agents which includes an entropy-regularization term82, while we directly plot the reinforcement learning objective. This paper is different from each of these previous works in the breadth of environments explored, and in that we are the first to use filter-normalized directions from Li et al. 78 to visualize reward surfaces rather than loss landscapes. We additionally introduce novel experiments and results inspired by the findings in our initial reward surface plots. 32 3.2.5 Proximal Policy Optimization Proximal Policy Optimization (PPO)79 is a derivative method of Trust Region Policy Opti- mization (TRPO)54 intended to be easier to implement and more hyperparameter invariant. Over the past 4 years, PPO has become a “default” method for many deep RL practitioners and has been used in numerous high profile works2,84. TRPO and PPO claim to offer enhanced empirical performance over previous methods by preventing dramatic changes in the policy via trust regions and ratio clipping respectively. Ratio clipping is conceptually a useful heuristic, however we are not aware of any work demonstrating why it results in empirically good performance across so many domains. Instability during training is common in deep reinforcement learning, and agents often experience large drops in reward that can take long to recover from. One intuition for the utility of PPO’s gradient and ratio clipping that has been idly discussed in the community is that they prevent agents from taking gradient steps that result in this collapsing performance85. More precisely, the intuition is that by preventing large changes to the policy, it avoids gradient steps that move the policy into regions of the parameter space with poor rewards and uninformative gradients. To the best of our knowledge, this intuition was first described in a widely circulated medium article Hui 85 released roughly 1 year after the original PPO paper. We are aware of no prior work in the academic literature which directly support this intuition, and the TRPO and PPO papers do not directly allude to it. We discover evidence of these cliffs in plots of the policy gradient direction, and perform experiments to confirm that they negatively impact learning. As a result of its empirically good performance PPO has become the de-facto policy gradient method, so we chose it as the agent in our reward surface experiments. 33 3.3 Environment Selection In exploring these reward surfaces, we sought to cover many widely used benchmark envi- ronments. We generated plots for all “Classic Control” and “MuJoCo” environments in Gym28 and for many popular Atari environments from the Arcade Learning Environment86. Video games with graphical observations and discrete actions are a common benchmark for the field, and Atari games are the most popular games for this purpose. The MuJoCo environments are high quality physics simulations used for prominent robotics work with continuous actions and vector obser- vations. The Classic Control games represent some of the early toy environments in reinforce- ment learning, and continue to be used as a standard set of "easy" environments. We generate surfaces for all classic control and MuJoCo environments, but we only generate surfaces for a representative selection of 12 Atari environments instead of all 59 Atari environments in Gym for the sake of brevity. To make sure we explore diverse reward schemes, we specifically picked six sparse reward environments (Montezuma’s Revenge, Pitfall!, Solaris, Private Eye, Freeway, Venture), three popular dense reward environments (Bank Heist, Q*Bert, Ms. Pac-Man), and three popular easy exploration environments (Breakout, Pong and Space Invaders), according to the standard taxonomy by Bellemare et al. 15 . 3.4 Initial Explorations of Reward Surfaces 3.4.1 Methodology In this work, we adapt the techniques from Li et al. 78 to visualize the reward surfaces of PPO agents in reinforcement learning environments. As in that paper, we deal with the high- dimensional domain of the reward surface by focusing our analysis around points in the policy 34 Figure 3.2: Reward surfaces for CartPole-v1, Breakout, Hopper-v2, Montezuma’s Revenge space visited during training. Given a training checkpoint with parameters θ, we are interested in understanding the local surface of rewards generated by the policy represented by parameters θ + d for small perturbations d. To visualize this local region in 3 dimensions we plot the empirical expected episodic return on the z axis against independently sampled filter-normalized random directions on the x and y axes.2 The plots are additionally scaled manually to highlight features of interest, so note the marks on the axes which indicate those manual scales. The scale, resolution, and number of environment steps used in each environment are listed in Table 3.2. A reward surface is dependent on the chosen learning algorithm, network architecture, hyperparameters, and random seed, so for these experiments we chose to plot the reward surface 2Note that because this is a high-dimensional space, these directions are orthogonal with high probability. 35 of PPO agents using the tuned hyperparameters found in RL Zoo 387. However, reward surfaces for a given environment are extremely visually similar across these variables, which we discuss in subsubsection 3.4.2.2. To understand what challenges RL algorithms face towards the end of training after sufficient exploration has occurred, we chose the best checkpoint during training, evaluated on 25 episodes, with a preference for later checkpoints when evaluations showed equal rewards. The best checkpoint was typically found during the last 10% of training. 3.4.2 Results A sample of the visualizations of the reward surfaces can be seen in Figure 3.2. The full set of reward surface plots can be found in subsection 3.10.1. In the following sections we present our key findings and discuss the validity of our reward surface visualizations. 3.4.2.1 Findings in Plots Figure 3.3: Gradient heat maps for Ant, Hopper, and InvertedDoublePendulum. The heat map for Hopper and InvertedDoublePendulum shows a cliff-like gradient direction which falls off sharply compared to the filter-normalized direction. One obvious observation is that the size and shape of the maximizers present in the reward surfaces roughly correlates with the perceived difficulty of the environment. Though there is some recent work attempting to quantify the difficulty of RL environments, it is still a challenging and unsolved problem88,89. However, researchers have used these benchmarks for years, and we now 36 have some intuition for identifying environments that are easier for modern methods to solve. The Classic Control environments were designed as simple toy environments, and modern methods easily find strong policies for them. The Atari environments have been taxonomized by Bellemare et al. 15 to indicate their relative difficulty. Breakout, Pong, and Space Invaders are listed as "Easy Exploration" games. Among the "Hard Exploration" games Bank Heist, Ms. Pacman, and Q*bert have dense rewards, while Freeway, Montezuma’s Revenge, Pitall!, Private Eye, Solaris, and Venture have a more challenging sparse reward structure. The MuJoCo environments have not been officially categorized according to difficulty to our knowledge. Using these rough categories as a guide, we see that the Classic Control environments shown in Figure 3.5, certainly the easiest set that we study, have large maximizers that often span the entire domain of the plot. In the Atari environments shown in Figure 3.7, we see that the "Human Optimal" (Easy Exploration) and "Dense Reward" (Hard Exploration) environments have large smooth maximizers relative the the chaotic landscapes seen in the "Sparse Reward" environments. This finding complements those of Li et al. 78 who found that the sharpness of loss landscapes using filter-normalized directions correlated with generalization error. This observa- tion also raises the possibility of future work creating a metric based on reward surfaces to gauge the difficulty of RL environments. We also observe some interesting characteristics of individual environments. In Figure 3.5, MountainCarContinuous has a strangely spiky reward surface for a Classic Control environment. This may be a result of the uniquely low training time required to solve it. Using the hyperparam- eters from RL Zoo 3 we train this agent for only 20,000 steps, while the next lowest training time in RL Zoo 3 is 200,000 timesteps. Looking at Figure 3.6, Humanoid Standup has fairly spiky re- ward surface, suggesting that it’s reward function is sensitive to small perturbations in the policy. 37 While many maximizers in these plots appear as a single hill or peak in an otherwise flat region, some of the surfaces have uniquely identifiable curvature. Hopper has a large semi-circular curve in its reward surface. Cartpole, MountainCar, and InvertedPendulum have large plateaus at their peak reward. The sparse reward Atari environments in Figure 3.7 are particularly interesting to exam- ine. Note that each point in the sparse Atari reward surfaces were evaluated with either 1 or 2 million environment steps, to limit the standard error for each estimate. We see that Freeway’s reward surface has large flat regions of varying heights surrounding a single smooth maximizer. Montezuma’s Revenge and Venture have short noisy maximizers. The agents for the sparse en- vironments (except Freeway) perform poorly, so note that even the maximizers in these plots represent weak policies. As a result of this, we see that the reward surfaces for Montezuma’s Revenge, Solaris, and Venture show nearby maximizers that the agent was unable to find during training. Private Eye has a large region of zero reward and a far away maximum with much higher rewards. Finally, as its name suggests, the reward surface for Pitfall! is mostly flat and marred by several deep pits. The idea that sparse rewards structures lead to flat regions in a reward surface is intuitive – in environments where rewards are issued solely in infrequent goal states, large yet precise policy improvements may be required to experience variations in rewards. Despite the intuitive nature of this observation, we’re unaware of this being visually documented in the literature. We also find that in regions where the reward surfaces are not flat, they are often extremely noisy. This is supported by plots of the surface of standard deviations at each point, which we include in subsection 3.10.2. We see in these plots that the standard deviation is often significantly larger than the average reward in the reward surfaces for Montezuma’s Revenge, Private Eye, and 38 Venture, unlike any of the other environments we study. These plots seem to highlight different failure modes of sparse environments – either the surface is flat when the agent experiences no variation in rewards, or the surface is spiky when rewards are sparse and noisy. 3.4.2.2 Plot Reproducibility and Standard Error To demonstrate the consistency of these experiments across multiple random seeds, we re- peated our reward surface plots 18 times for Acrobot, HalfCheetah, Breakout, and Montezuma’s Revenge. For each trial, we trained and evaluated a new agent with a new random seed. We can see from the plots in subsection 3.10.3 that the reward surfaces are extremely visually similar for a particular environment, indicating that training appears to converge to visually similar regions of the reward landscape, and that the characteristics of these plots are consistent across multiple seeds. We evaluated for at least 200,000 time steps (1 or 2 million steps for the sparse reward Atari games) at each point to ensure that the standard error for these estimates is small. 3.5 Discovering Cliffs in the Gradient Direction Figure 3.4: Gradient line search plots for Pendulum, Ant, and Pong. The line plot for Ant shows several checkpoints that exhibit cliff-like properties. Filter normalized random directions provide a broad sense of the local reward landscape, 39 and are useful for analysis near the end of training, but they do not represent the direction of training because the sampled random directions are likely orthogonal to the gradient direction. To better understand the optimization characteristics of these environments, we performed similar experiments using the policy gradient direction. We use a high quality estimate of the policy gradient computed over 1,000,000 environment steps as in Ilyas et al. 77 . In many of these plots, we find evidence of “cliffs” in the reward surface. These are regions of the reward surface in which rewards sharply decrease after a short distance. For cliffs, we select checkpoints that exhibit a decrease in returns of at least 50% in the first, high-resolution section of the line search (that is, within a normalized gradient magnitude of 0.04). By selecting points closer to the initial, unperturbed weights, we choose cliffs that the optimization method is likely to reach. Although we do our best to evaluate the gradient line searches with enough environment steps to reduce their standard error, it is not computationally feasible to completely account for the high variance of returns in sparse reward environments. To mitigate this, the decrease in returns that identify a cliff must also be at least 25% of the global reward range across all checkpoints in the line plot. This ensures that we select cliffs that are significant in the environment’s overall reward scale, and not likely to be caused by evaluation noise. We select non-cliff checkpoints arbitrarily from the remaining checkpoints that do not meet these criteria. One difficulty of plotting the gradient direction is that the gradient magnitudes vary dras- tically for different environments and at different points during training90. Additionally, any maximum in a reward surface can be made to look like a sharp cliff by using a large enough gradient scale, or like a large plateau by using a smaller gradient scale. To provide a fair compar- ison of the gradient direction’s sharpness, we normalize the gradient direction by dividing each 40 component by it’s L2 norm. 3.5.1 Gradient Directions vs. Filter-Normalized Random Directions We show the differences between filter normalized random directions and the gradient di- rection by creating plots of the reward surface using the gradient direction on the x axis and a random filter normalized direction on the y axis. Due to the frequent sharp changes in reward that we observe in these plots, 3d surface plots can become partially obscured by peaks, so we choose to plot these surfaces as heat maps instead. A sample of the heat maps can be seen in Figure 3.3 and the full set of heat maps for Classic Control and MuJoCo environments can be found in subsection 3.10.4. 3.5.1.1 Preliminary Observations The most striking observation is that rewards in the gradient direction often change much more ra