ABSTRACT

Title of Dissertation: APPLYING POLICY GRADIENT
METHODS TO OPEN-ENDED DOMAINS

Ryan Sullivan
Doctor of Philosophy, 2025

Dissertation Directed by: Professor John P. Dickerson
Department of Computer Science

Deep reinforcement learning (RL) has been successfully used to train agents in complex

video game environments including Starcraft 2, Dota 2, Minecraft, and Gran Turismo1,2,3,4. Each

of these projects utilized curriculum learning to train agents more efficiently. However, sys-

tematic investigations of curriculum learning are limited and it is rarely studied outside of toy

research environments. Modern RL methods still struggle in stochastic, sparse-reward environ-

ments with long planning horizons. This thesis studies these challenges from multiple perspec-

tives to develop a stronger empirical understanding of curriculum learning in complex environ-

ments. By introducing novel visualization techniques for reward surfaces and empirically investi-

gating key implementation details, it explores why policy gradient methods alone are insufficient

for sparse-reward tasks. These findings motivate the use of curriculum learning to decompose

problems into learnable subtasks and to prioritize learnable objectives. Building on these in-

sights, this dissertation presents a general-purpose library for curriculum learning and uses it


to evaluate popular automatic curriculum learning algorithms on challenging RL environments.

Curricula have historically been effective for training reinforcement learning agents, and a fun-

damental understanding of automatic curriculum learning is an essential step toward developing

generally capable agents in open-ended environments.


APPLYING POLICY GRADIENT METHODS TO OPEN-ENDED DOMAINS

by

Ryan Sullivan

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2025

Advisory Committee:
Professor John P. Dickerson, Chair/Advisor
Professor Ming Lin
Professor Furong Huang
Professor Subramanian Raghavan
Professor Peter Stone


© Copyright by
Ryan Sullivan

2025


Acknowledgments

This thesis would not be possible without the support and contributions of many friends and

collaborators. First of all, I would like to thank my advisor John P. Dickerson for his guidance

throughout my PhD and for giving me the confidence to pursue my research. John gave me the

freedom to explore my interests, as well as the reassurance and motivation that I needed to see

this thesis through to the end. Thank you to Ming Lin, Furong Huang, Subramanian Raghavan,

and Peter Stone for reviewing this dissertation and providing valuable feedback. Peter’s work on

curriculum learning was a large inspiration for this thesis.

Thank you to Joseph Suarez and Costa Huang for being incredible collaborators and friends.

Our frequent discussions made me a better researcher, and made my PhD more enjoyable. I also

would like to thank Raghav Gupta and Eric Li, without whom I would never have been able to

publish the research I conducted during my internship at Google. They went above and beyond to

support my ideas and develop them into two successful research papers, and I thoroughly enjoyed

my time working with them.

I also owe thanks to my friends Jordan, Nameer, Killian, Jarett, Griffon, Aman, Samyadeep,

Shlok, Naman, and Sneha, who frequently reminded me to enjoy life beyond research. I am

obligated to thank my siblings Olivia, Kiernán, and Danny. Most importantly, thank you to my

parents for providing endless love, support, and food. You always enabled my curiosity and

encouraged me to pursue my interests, and I will always be grateful.

ii


Table of Contents

Acknowledgements ii

Table of Contents iii

List of Tables vii

List of Figures viii

List of Abbreviations xii

Chapter 1: Introduction 1
1.1 Reinforcement Learning in Complex Environments . . . . . . . . . . . . . . . . 1
1.2 Intrinsic Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Open Source Reinforcement Learning Infrastructure . . . . . . . . . . . . . . . . 6
1.5 Overall Structure and Contributions . . . . . . . . . . . . . . . . . . . . . . . . 7

Chapter 2: Background 10
2.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Partial Observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3 Estimating Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.4 Generalized Advantage Estimation . . . . . . . . . . . . . . . . . . . . . 15

2.2 Policy Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 REINFORCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Actor-Critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.3 Proximal Policy Optimization . . . . . . . . . . . . . . . . . . . . . . . 19

2.3 Underspecified Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Automatic Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Open-Endedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Chapter 3: Visualizing Reward Surfaces 28
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.1 Loss Landscapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 31

iii


3.2.3 Policy Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.4 Reward surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.5 Proximal Policy Optimization . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Environment Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Initial Explorations of Reward Surfaces . . . . . . . . . . . . . . . . . . . . . . 34

3.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.5 Discovering Cliffs in the Gradient Direction . . . . . . . . . . . . . . . . . . . . 39
3.5.1 Gradient Directions vs. Filter-Normalized Random Directions . . . . . . 41
3.5.2 Visualizing Rewards in the Gradient Direction During Training . . . . . 42

3.6 Cliffs Impact Policy Gradient Training . . . . . . . . . . . . . . . . . . . . . . . 43
3.6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.7 Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.10 Additional Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.10.1 All Reward Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.10.2 Standard Deviation Plots . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.10.3 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.10.4 All Gradient Heat Maps . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.10.5 All Gradient Line Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Chapter 4: Applying DreamerV3 Implementation Tricks to PPO 65
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.3.1 Symlog Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3.2 Twohot Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3.3 Critic EMA Regularizer . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.4 Percentile Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.5 Unimix Categoricals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.1 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4.2 Enabling All Tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.3 Add-One Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.4 Drop-One Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.5 Symlog Predictions and Twohot Encoding . . . . . . . . . . . . . . . . . 78

4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.8 Additional Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.8.1 Atari Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.8.2 DeepMind Control Suite Add-One Ablations . . . . . . . . . . . . . . . 87
4.8.3 DeepMind Control Suite Drop-One Ablations . . . . . . . . . . . . . . . 88

iv


4.8.4 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.8.5 PPO Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

Chapter 5: Multitask learning in Large Language Models 92
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.3 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4 Conditional Language Policy (CLP) . . . . . . . . . . . . . . . . . . . . . . . . 100

5.4.1 Conditioning Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4.2 Three Instantiations of CLP . . . . . . . . . . . . . . . . . . . . . . . . 102

5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5.1 Core Benchmarking Results . . . . . . . . . . . . . . . . . . . . . . . . 104
5.5.2 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.6 Multi-Objective Online DPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.6.2 Objective Weight Sampling . . . . . . . . . . . . . . . . . . . . . . . . 108
5.6.3 Prompt Conditioning Mechanism . . . . . . . . . . . . . . . . . . . . . 109
5.6.4 Sampling and Preference Pair Construction . . . . . . . . . . . . . . . . 109
5.6.5 DPO Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.7 MO-ODPO Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.7.1 Datasets and Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.7.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.7.3 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.8 MO-ODPO Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.8.1 Pareto Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.8.2 Autorater Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.8.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.9.1 Role of Sampling for Prompt Conditioning . . . . . . . . . . . . . . . . 118
5.9.2 Training Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.11 Additional Experiment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

Chapter 6: Enabling Future Curriculum Learning Research 124
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.4 Design Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.5 Syllabus APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6.5.1 Curriculum API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.5.2 Agent API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.5.3 Task Interface API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.5.4 Task Space API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.5.5 Multiprocessing Infrastructure . . . . . . . . . . . . . . . . . . . . . . . 137
6.5.6 Automatic Curriculum Learning Implementations . . . . . . . . . . . . . 138

v


6.5.7 Manual Curricula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.6 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.7 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.8 Reproduction Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

6.8.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.8.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6.9 Additional Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.9.1 RLLib with Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.9.2 Self-Play on LaserTag . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

6.10 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.12 Code Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

6.12.1 RLLib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.12.2 Stable Baselines 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

6.13 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

Chapter 7: Curriculum Learning in Complex Environments 155
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.2 New Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

7.2.1 Neural MMO 2.0 in PufferLib . . . . . . . . . . . . . . . . . . . . . . . 156
7.2.2 NetHack in Moolib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.2.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.2.4 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.4 Phasic Policy Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

7.4.1 Stale Value Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.5 Neural MMO Task-Based Curriculum . . . . . . . . . . . . . . . . . . . . . . . 168
7.6 Task Design in Crafter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

7.6.1 Task Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7.6.2 Impossible Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7.6.4 Single Task Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 175

7.7 A Closer Look at OMNI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.7.1 Sampling for Learnability with Stratified Filtering . . . . . . . . . . . . 177

7.8 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

Chapter 8: Future Research Directions and Open Questions 181
8.1 Challenging Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
8.2 Task Space Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
8.3 LLM-Augmented RL Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

8.3.1 Direct Finetuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8.3.2 LLMs as Teachers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
8.3.3 Commonsense Action Pruning . . . . . . . . . . . . . . . . . . . . . . . 190

vi


List of Tables

3.1 Table of A2C and PPO’s average percent change in reward after taking a few gra-
dient steps on cliff and non-cliff checkpoints for various sets of hyperparameters.
These results are averaged among 10 trials each evaluated for 1000 episodes. . . 43

3.2 Settings used to generate reward surfaces for each environment. These settings
were manually chosen to highlight interesting features of the reward surface and
to reduce standard error in the plots. . . . . . . . . . . . . . . . . . . . . . . . . 52

4.1 Application of DreamerV3 Tricks . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 The default PPO hyperparameters in CleanRL’s implementation, used for all ex-

periments. The only change we make is increasing the number of environments
for better time efficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.1 Overview of multi-objective LLM alignment approaches . . . . . . . . . . . . . 97
5.2 Example generations on Anthropic-HH at different objective weights from MO-

ODPO & baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.1 Syllabus Performance Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.1 Learning Hyperparameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.2 Neural MMO Sequential Curriculum Task Thresholds . . . . . . . . . . . . . . . 169

vii


List of Figures

3.1 Reward surface for the Solaris Atari environment. . . . . . . . . . . . . . . . . . 29
3.2 Reward surfaces for CartPole-v1, Breakout, Hopper-v2, Montezuma’s Revenge . 35
3.3 Gradient heat maps for Ant, Hopper, and InvertedDoublePendulum. The heat

map for Hopper and InvertedDoublePendulum shows a cliff-like gradient direc-
tion which falls off sharply compared to the filter-normalized direction. . . . . . 36

3.4 Gradient line search plots for Pendulum, Ant, and Pong. The line plot for Ant
shows several checkpoints that exhibit cliff-like properties. . . . . . . . . . . . . 39

3.5 Reward surfaces for 5 Classic Control environments. . . . . . . . . . . . . . . . 49
3.6 Reward surfaces for 10 MuJoCo environments. . . . . . . . . . . . . . . . . . . 50
3.7 Reward surfaces for 12 Atari environments. . . . . . . . . . . . . . . . . . . . . 51
3.8 Standard deviation surfaces for 5 Classic Control environments. . . . . . . . . . 53
3.9 Standard deviation surfaces for 10 MuJoCo environments. . . . . . . . . . . . . 54
3.10 Standard deviation surfaces for 12 Atari environments. . . . . . . . . . . . . . . 55
3.11 18 training and plotting runs for the Classic Control Acrobot-v1 environment. . . 56
3.12 18 training and plotting runs for the MuJoCo HalfCheetah-v2 environment. . . . 57
3.13 18 training and plotting runs for the Atari Breakout environment. . . . . . . . . . 58
3.14 18 training and plotting runs for the Atari Montezuma’s Revenge environment. . 59
3.15 Policy gradient heat maps for 5 Classic Control environments. . . . . . . . . . . 60
3.16 Policy gradient heat maps for 10 MuJoCo environments. . . . . . . . . . . . . . 61
3.17 Policy gradient line search plots for 5 Classic Control environments. . . . . . . . 62
3.18 Policy gradient line search plots for 10 MuJoCo environments. . . . . . . . . . . 63
3.19 Policy gradient line search plots for 12 Atari environments. . . . . . . . . . . . . 64

4.1 PPO with all tricks enabled compared to base PPO, using the DeepMind Control
Suite in a) and c) and Atari environments (with and without reward clipping) in
b) and d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.2 Median scores averaged across multiple seeds for the add-one ablations in each
environment set. 5 seeds were used for Deepmind Control Suite and 3 seeds for
Atari environments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.3 Median scores for the drop-one ablations in each environment set. 5 seeds were
used for DeepMind Control Suite and 3 seeds for the Arcade Learning Environment 78

4.4 Experiments varying the a) range (bins=256) and b) number of bins (range=100)
used for twohot encoding in Atari Breakout, averaged across 3 seeds. c) shows
symlog predictions with twohot encoding across all 57 games in the Arcade
Learning Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

viii


4.5 Add-one ablations for the Arcade Learning Environment with reward clipping.
Each line shows the median scores for 57 environments, averaged over 3 seeds,
across training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.6 Add-one ablations for the Arcade Learning Environment without reward clip-
ping. Each line shows the median scores for 57 environments, averaged over 3
seeds, across training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.7 Drop-one ablations for the Arcade Learning Environment with reward clipping.
Each line shows the median scores for 57 environments, averaged over 3 seeds,
across training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.8 Add-one ablations for the DeepMind Control Suite. Each line shows the median
scores for 35 environments, averaged over 5 seeds, across training. . . . . . . . . 87

4.9 Drop-one ablations for the DeepMind Control Suite. Each line shows the median
scores for 35 environments, averaged over 5 seeds, across training. . . . . . . . . 88

4.10 20M DreamerV3 XL architecture vs. 1M Nature CNN. Figures show median
performance averaged across 3 seeds and all 57 Atari environments. . . . . . . . 89

4.11 20M Dreamer-v3 XL architecture (3 seeds) vs. 1M Nature CNN (1 seed). . . . . 90

5.1 (Left) For a fixed prompt x, a multi-objective LM can output y1, y2, y3 for dif-
ferent weightings w1, w2, w3 of two rewards r1 and r2, such that the response
yi for weighting wi is preferred under the weighted reward wi[1]r1 + wi[2]r2.
(Right) Pareto-fronts when using the rewards NLI and Rouge (section 5.5.1.1).
Rewarded Soups (RS)5 is Pareto-dominated by both full-CLP (this paper) and
prompting (say, Jang et al. 6), but full-CLP is more appealing for its steerability,
evidenced by its wider Pareto-front. In sum, Pareto-dominance (pushing out the
front) and steerability (stretching out the front) are both key desiderata for MOFT. 93

5.2 Pareto-curves for two-reward & α = 0.01. Observe CLP variants (full-CLP
and attn-CLP) offer improved spread (compared to prompting) while Pareto-
dominating the Rewarded Soups (RS) baseline. . . . . . . . . . . . . . . . . . . 105

5.3 Combining Prompt-based conditioning with Parameter-space condtioning for full-
CLP. Observe that prompting slots in with full-CLP to produce steerable and
Pareto dominating behaviors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.4 Pareto-curves at 10k, 60k, 90k training steps. Observe that prompting shows
slightly improved steerability with a 3× larger training budget but still isn’t as
steerable as full-CLP which exhibits a strong steerability even at 10k iterations. . . 106

5.5 Ablation on model size for NLI v. TLDR. Observe that across model sizes, full-
CLP and attn-CLP learn steerable (compared to prompting) and Pareto-dominating
behaviors (compared to Rewarded Soups). . . . . . . . . . . . . . . . . . . . . . 107

5.6 Representation of the MO-ODPO algorithm with two rewards . . . . . . . . . . 111
5.7 Pareto fronts for Anthropic-HH (left) and TL;DR Summarization (right) with

PaLM 2 XS policy model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.8 Pareto fronts for Anthropic-HH (left) and TL;DR Summarization (right) with

PaLM 2 XXS policy model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.9 Automated evaluation win rates for MO-ODPO compared against individual base-

lines, as judged by a large off-the-shelf LLM. MO-ODPO consistently achieves
a higher win rate than each baseline on both benchmarks. . . . . . . . . . . . . . 113

ix


5.10 Pareto fronts for MO-ODPO (PaLM 2 XS) with different Dirichlet(α) sampling
of objective weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.11 Training dynamics for MO-ODPO (above) and Rewards in Context (below) on
the TL;DR benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.1 Syllabus with a standard asynchronous RL training setup. . . . . . . . . . . . . . 131
6.2 An abbreviated summary of the Curriculum interface. These represent the main

methods for updating and sampling from a curriculum in Syllabus. The get_opponent,
update_agent, and update_winrate methods are used for self-play. . . . . . . . . 132

6.3 Main features of the Task Interface. . . . . . . . . . . . . . . . . . . . . . . . . 135
6.4 Main features of the Task Space API. . . . . . . . . . . . . . . . . . . . . . . . . 136
6.5 Using Syllabus for curriculum learning with just a few lines of code. . . . . . . . 137
6.6 (a) Mean normalized test returns for Syllabus’s implementation vs. the original

implementation of Prioritized Level Replay from7 on 10 Procgen environments.
Domain Randomization is also included for reference. (b) Mean task success
rate for Syllabus’s Learning Progress and OMNI implementations vs. the ref-
erence implementations from8 on Crafter. (c) and (d) 95% Stratified bootstrap
confidence intervals for Procgen and Crafter. . . . . . . . . . . . . . . . . . . . . 140

6.8 (a) Probability of sampling the first agent for each algorithm. Self-play only
has one agent, so it always samples the current policy as it’s opponent. (b) The
winrate of each method against a fixed random policy. (c) Stratified bootstrapped
confidence intervals of the winrates. . . . . . . . . . . . . . . . . . . . . . . . . 149

6.9 Adding curriculum learning with Syllabus to RLLib training code with a few
lines of code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

6.10 Quickstart page of Syllabus’s documentation website. . . . . . . . . . . . . . . . 154

7.1 Automatic curriculum learning training curves and 95% Stratified bootstrap con-
fidence intervals on (a), (e) Procgen, (b), (f) Crafter, (c), (g) NetHack, and (d),
(h) Neural MMO. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

7.2 Normalized test returns for Domain Randomization, Prioritized Level Replay,
Sampling for Learnability, and Learning Progress evaluated on 10 Procgen envi-
ronments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

7.3 Normalized Test Returns of PPO and PPG with 1 or 3 value epochs when trained
with DR or PLR on 10 Procgen environments with 5 seeds each. . . . . . . . . . 166

7.4 Training Procgen agents with PLR using stale value predictions. 1 buffer is equiv-
alent to 1 episode of delay per environment. . . . . . . . . . . . . . . . . . . . . 167

7.5 Various events and achievements in Neural MMO for Domain randomization and
a manually designed sequential curriculum over the course of training. . . . . . . 170

7.6 Crafter ablations removing task encodings and impossible tasks. Note that the y
axes change when we remove impossible tasks. . . . . . . . . . . . . . . . . . . 174

7.7 Success rates for agents trained to achieve a single task over the course of train-
ing. Each run shows a single seed to demonstrate the learning dynamics of indi-
vidual agents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

x


7.8 (Left) Mean task success rates for each algorithm combined with stratified sam-
pling. (Right) Stratified Domain Randomization compared against other ACL
algorithms without impossible tasks. . . . . . . . . . . . . . . . . . . . . . . . . 178

7.9 Mean task success rates for the Full Distribution and Top K implementations of
SFL with OMNI’s interestingess filter. LP and OMNI are shown for reference. . . 179

xi


List of Abbreviations

A2C Advantage Actor-Critic
ACL Automatic Curriculum Learning
AI Artificial Intelligence
ALE Arcade Learning Environment
APPO Asynchronous Proximal Policy Optimization
API Application Programming Interface

CL Curriculum Learning
CLP Conditional Language Policy
CNN Convolutional Neural Network
CPO Contrastive Preference Optimization
CPU Central Processing Unit

DeRa Decoding-time Realignment
DIAYN Diversity is All You Need
DMC DeepMind Control Suite
DPO Direct Preference Optimization
DR Domain Randomization

ELU Exponential Linear Unit
EMA Exponential Moving Average

FSP Fictitious Self-Play

GAE Generalized Advantage Estimation
GPU Graphics Processing Unit

IQM Interquartile Mean
IPO Implicit Preference Optimization

KL Kullback-Leibler

LLM Large Language Model
LM Language Model

xii


LP Learning Progress
LSTM Long Short-Term Memory

MDP Markov Decision Process
MOD Multi-objective Decoding
MO-DPO Multi-objective Direct Preference Optimization
MOFT Multi-objective Finetuning
MO-ODPO Multi-objective Online Direct Preference Optimization
MO-RL Multi-objective Reinforcement Learning
MO-RLHF Multi-objective Reinforcement Learning from Human Feedback
MSE Mean Squared Error

NLE NetHack Learning Environment
NLI Natural Language Inference

ODPO Online Direct Preference Optimization
OMNI Open-endedness via Models of Novelty and Interestingness

PCG Procedural Content Generation
PFSP Prioritized Fictitious Self-Play
PLR Prioritized Level Replay
P-MORL Prompt-based Multi-objective Reinforcement Learning
POMDP Partially Observable Markov Decision Process
PPO Proximal Policy Optimization
PPG Phasic Policy Gradient

ReLU Rectified Linear Unit
REINFORCE REward Increment = Nonnegative Factor × Offset Reinforcement

× Characteristic Eligibility
RiC Rewards-in-Context
RL Reinforcement Learning
RLHF Reinforcement Learning from Human Feedback
RND Random Network Distillation
RS Rewarded Soups

SAC Soft Actor-Critic
SB3 Stable Baselines 3
SFL Sampling for Learnability
SFT Supervised Fine-tuning
SiLU Sigmoid Linear Unit

xiii


SOFT Single-objective Finetuning
SP Self Play
SVD Singular Value Decomposition

TD Temporal Difference
TRPO Trust Region Policy Optimization

UED Unsupervised Environment Design
UPOMDP Underspecified Partially Observable Markov Decision Process

VLM Vision-Language Model

xiv


Chapter 1: Introduction

1.1 Reinforcement Learning in Complex Environments

Reinforcement learning has solved many premiere challenges in artificial intelligence from

defeating the best human players in the board game Go9 to playing Starcraft II at a professional

level1. These early successes of RL allowed agents to focus on perfecting a single task, and

since then we have seen RL agents learn to pilot weather balloons10, design computer chips11,

and achieve many more feats in video games2,3,4. As RL has expanded into more challenging

domains, the focus has shifted from solving singleton tasks to training generalist agents capable of

performing well in unseen environments. The most challenging RL benchmarks present diverse

scenarios, requiring agents to generalize to unseen objectives, terrain, opponents, allies, or game

rules. Much of the work in this thesis was inspired by this line of work to develop agents that can

play challenging games as well as people can.

While these achievements showcase the potential of reinforcement learning in controlled

settings, they also rely on properties – such as determinism or self-play – that are absent in

stochastic, single-agent, and many-agent environments. For instance, Atari games are deter-

ministic and can easily be learned via memorization even with mitigations like entropy bonuses

and sticky actions12. Dota 2 and Starcraft 2, though significantly more complicated than Atari,

are 2-player competitive games meaning that the agent experiences a winning trajectory in each

1


match and can be trained via self-play. AlphaGo demonstrated that self-play can lead to recursive

self-improvement and policies capable of beating the best human players in Go.

In contrast, real-world tasks often lack these convenient features. Single player games

with procedurally generated content tax an agent’s ability to generalize and long planning hori-

zons extend the credit assignment problem past the capabilities of existing discount-based RL

methods. Beyond academic benchmarks, modern games with high-quality graphics and increas-

ingly complex controls drastically increase the size and sparsity of the RL agent’s search space,

exacerbating the need for strong representation learning. In Chapter 3 we show how difficult

exploration in these sparse reward problems can be. Hafner et al. 13 address this problem with

DreamerV3, a model-based RL algorithm was able to train proficient agents in many challenging

sparse reward environments with a single set of hyperparameters. They credit their results to

several implementation tricks such as reward normalization and critic regularizers. However, we

show in Chapter 4 that these implementation tricks, though applicable to any actor-critic method,

are ineffective when applied to Proximal Policy Optimization. The challenges of exploring large

state spaces with uninformative rewards necessitate more advanced exploration techniques, a

need which inspired the RL subfield of intrinsic motivation.

1.2 Intrinsic Motivation

Exploration is a fundamental challenge in reinforcement learning, especially in complex

environments where finding all sources of reward is difficult. Intrinsic motivation is a field of

research that seeks to directly motivate this exploratory behavior by designing agents with an

additional "intrinsic" reward signal. There are many formulations of this goal, some inspired by

human notions of exploration, such as curiosity14 as well as practical methods like count-based

2


exploration15,16,17 and random network distillation (RND)18. Typically, the intrinsic reward de-

creases as the agent explores more, allowing the extrinsic reward to dominate once sufficient

coverage is achieved. Intrinsic motivation also implicitly incentivizes the discovery of useful

skills. In challenging environments, some states are not reachable without first learning prerequi-

site skills. For instance, an agent playing chess cannot experience a winning checkmate without

first becoming a competent chess player. As such, intrinsic rewards can even outperform the

environment’s own extrinsic reward by promoting skills that allow the agent to reach more novel

states.

Intrinsic motivation methods are intuitively appealing, and successful in select environ-

ments, but many have been shown to generalize poorly and struggle in stochastic environments

in practice19,20. The primary weakness of bonus-based methods used to encourage exploration is

that they assume full coverage of the state space is possible and desirable21. In even moderately

complex environments, the cardinality of the state space is many times larger than the number of

steps seen during training, meaning even visiting a unique state at each step will not fully explore

the environment. Function approximation helps by associating similar state together, but is less

effective as the state space grows. Henaff et al. 21 recently propose a method which explores the

task-specific feature space instead of the full task space, but still struggles with large action and

state spaces. These observations suggest a higher level approach to incentivizing exploration is

necessary. This thesis explores curriculum learning as an appealing alternative.

1.3 Curriculum Learning

Curriculum learning methods seek to generate or sort tasks in a way that increases sample

efficiency or asymptotic performance on a target task or range of tasks22. By incentivizing agents

3


to repeat behaviors states that led to novel states, intrinsic motivation methods can be viewed

as autocurricula over informative trajectories23. While exploration bonuses prioritize individual

states that maximize some measure of new information, curriculum learning methods prioritize

specific environment instances that maximize learning potential. Both approaches teach agents

to develop better representations and discover important skills.

As the complexity of the environment increases even further and the task space also grows

to an unmanageable size, we enter the domain of open-ended environments. These are envi-

ronments with huge or infinite task spaces, where the environment evolves in response to agent

behavior, similar to the real world. For example, as humanity entered the industrial revolution

skills like manufacturing and engineering became more important while farmers and craftsmen

needed to adapt. As with the transition from state space coverage to state space prioritization, the

focus in open-ended environments shifts from task space coverage to task space prioritization.

In these settings, agents can train faster by identifying tasks with learnable skills and avoiding

distractions, highlighting the importance of curriculum learning.

Curriculum learning is appealing in complex, open-ended environments for several reasons.

First, it divides challenging tasks into solvable subproblems. These subtasks can train agents

faster by removing other sources of reward, distracting environment dynamics, or challenging

starting conditions. Second, this approach is intuitively similar to how people learn in school, by

progressing from easy problems to hard problems. Third, by operating at a high level, the solution

space is often more interpretable. Experts can easily identify which tasks are being prioritized

early in training, and it is easier to manually design curricula at the task level than to engineer

shaping rewards.

The final and most important reason is that curriculum learning has been a core component

4


of nearly every successful application of RL agents in challenging tasks. AlphaGo uses self-play,

which induces a curriculum of increasingly capable opponents9. AlphaStar extends this approach

with prioritized fictitious self-play and league training to train agents to play Starcraft II, one of

the most challenging multiplayer video games, at a competitive level. OpenAI Five, an agent

capable of playing Dota 2, was trained with a curriculum that slowly introduced new game me-

chanics and expanded the range of playable configurations2. GT Sophy, which outraced a team

of professional players in the Gran Turismo racing game, used a manually crafted distribution

of training scenarios4. Multiple works have used curricula to train agents to collect diamonds

in Minecraft3,13,24. Curriculum learning is ubiquitous in successful applications of RL, and in-

tuitively approaches problems in a way that is suitable for challenging environments. Despite

these successes, there is little fundamental research on the empirical and theoretical properties of

curriculum learning. The later chapters of this thesis develop the science of curriculum learning

with a specific focus on challenging domains.

Many successes of RL rely on either a natural curriculum from self-play or a hand-crafted

curriculum from human-experts. In order to tackle challenging single-agent environments where

we cannot rely on self-play, we need to automatically discover effective curricula. NetHack, a

text-based rogue-like dungeon-crawler with a vast array of mechanics, exemplifies this challenge.

In NetHack, an agent must navigate 50 procedurally generated levels while managing hunger,

combat, and resource collection. The game includes several numerous puzzles and mechanics

based in mythology and human intuition. This along with its extensive state, action, and task

spaces make it a formidable benchmark for tabula rasa RL agents. Although human players

can leverage familiar concepts and community resources, current RL methods – and even LLMs

– struggle with the low-level control and long-term planning required25,26. In Chapter 5 we

5


develop methods for training multitask LLMs, taking a step towards improving their capabilities

in complex domains. In Chapter 7 we explore pure RL approaches to complex domains, utilizing

recent advances in curriculum learning.

1.4 Open Source Reinforcement Learning Infrastructure

The best neural (RL and LLM) NetHack agents underperform the best symbolic agents by

an order of magnitude26? . While tabula rasa RL agents may never completely solve NetHack, I

expect it is possible significantly improve current baselines. I attempted this with a curriculum

learning method that Kanitscheider et al. 3 developed based on a measure of learning progress.

They successfully used it to train a standard PPO agent to reliably collect diamonds in Minecraft,

a similarly challenging open-ended environment. However, as I began the project, it became clear

that the engineering challenges would be substantial. Curriculum learning requires environments

to be globally synchronized using feedback from each environment. However, this is difficult

to implement with the highly specialized multiprocessing libraries that NetHack baselines use to

maximize performance25,27. Faced with these challenges, I developed a novel curriculum learn-

ing infrastructure, called Syllabus, that simplifies the integration of curriculum methods across

any RL library and environment (Chapter 6). Syllabus simplifies an otherwise unapproachable

area of research, and guarantees reproducibility across different environments. In Chapter 7 we

use Syllabus to present the first automatic curriculum learning results in NetHack and Neural

MMO, two extremely challenging RL benchmarks. The portable software paradigm that this

thesis develops could even be extended to other subfields of reinforcement learning, and I hope it

will help others to advance the empirical science of curriculum learning.

A theme that appears throughout this thesis is a focus on high-quality open-source tools.

6


Open source software is the driving force behind progress in machine learning. In reinforce-

ment learning, the vast majority of projects rely on the API defined by OpenAI’s Gym28, as

well as learning libraries like Stable Baselines 329, RLLib30, or CleanRL31. Releasing code and

data for research projects is crucial for promoting reproducibility and accelerating research. All

projects included in this thesis include a large open source contribution (when legally permitted).

Chapter 3 introduces a library for visualizing the reward surfaces of actor-critic agents, Chap-

ter 4 provides training scripts and contributed experimental data to the Open RL Benchmark32,

and Chapter 6 presents our portable curriculum learning library. I hope that these libraries, as

much as any experimental results in this thesis, will facilitate future research on applying policy

gradient methods to complex environments.

1.5 Overall Structure and Contributions

This thesis is organized into chapters that collectively address the challenges of reinforce-

ment learning in long-horizon, high-dimensional, and sparse reward environments. Each chapter

contributes experimental insights and practical tools to advance the empirical understanding of

policy gradient methods and curriculum learning.

Chapter 3: Visualizing Reward Surfaces. In this chapter, we introduce a novel library for

visualizing the reward surfaces of actor-critic agents. By mapping out the reward landscapes, we

empirically illustrate the exploration difficulties inherent in sparse reward settings and identify

evidence of "cliffs" in the policy gradient direction that can collapse performance. This work in

this chapter is based on the following paper:

Cliff Diving: Exploring Reward Surfaces in Reinforcement Learning Environments. Ryan

Sullivan, Jordan K. Terry, Benjamin Black, and John P. Dickerson. 2022. In The Interna-

7


tional Conference on Machine Learning (ICML 2022).

Chapter 4: Applying DreamerV3 Implementation Tricks to PPO. Here, we rigorously evalu-

ate the implementation tricks introduced by DreamerV3 within the model-free context of Proxi-

mal Policy Optimization. Surprisingly, we find that the tricks presented do not transfer as general

improvements to PPO. We identify cases where these tricks succeed and offer insight into their

relationships. This research was published in the following work:

Reward Scale Robustness for Proximal Policy Optimization via DreamerV3 Tricks. Ryan

Sullivan, Akarsh Kumar, Shengyi Huang, John P. Dickerson, Joseph Suarez. 2023. In

Advances in Neural Information Processing Systems 36 (NeurIPS 2023).

Chapter 5: Multitask learning in Large Language Models. In this chapter, we present several

novel methods for training multi-objective LLM policies through multi-task training. These ap-

proaches allow us to create steerable LLMs that can optimize a range of user preferences during

inference. This work was done in collaboration with several teams at Google. I led the devel-

opment of the RLHF multi-task training method and conducted the related experiment. I also

co-led an extension that applied the same concepts to Online DPO. The methods presented in this

chapter were published in the following papers:

Conditional Language Policy: A General Framework For Steerable Multi-Objective Fine-

tuning. Wang, Kaiwen, Rahul Kidambi, Ryan Sullivan, Alekh Agarwal, Christoph Dann,

Andrea Michi, Marco Gelmi et al. 2024. In Findings of the Association for Computational

Linguistics (EMNLP 2024).

Robust Multi-Objective Preference Alignment with Online DPO. Raghav Gupta, Ryan Sul-

8


livan, Yunxuan Li, Samrat Phatale, and Abhinav Rastogi. 2025. In The 39th Annual AAAI

Conference on Artificial Intelligence (AAAI 2025).

Chapter 6: Enabling Future Curriculum Learning Research. This chapter presents Syllabus,

an open-source and portable framework designed to facilitate curriculum learning across a range

of environments. By decoupling curriculum design from specific reinforcement learning im-

plementations, Syllabus enables researchers to efficiently experiment with curriculum learning

methods in both standard and complex domains. I led the design and development of the library,

and several collaborators added new features, manual curriculum learning tools, and example

scripts. I also conducted the experiments in this chapter and wrote the following public preprint:

Syllabus: Portable Curricula for Reinforcement Learning Agents. Sullivan, Ryan, Ryan

Pégoud, Ameen Ur Rehman, Xinchen Yang, Junyun Huang, Aayush Verma, Nistha Mitra,

and John P. Dickerson. 2024. In arXiv (Preprint).

Chapter 7: Curriculum Learning in Complex Environments. This chapter contains targeted

investigations of interactions among curriculum learning methods, policy gradient algorithms,

and complex environments. I designed and conducted all of the experiments for this chapter,

which have not yet been published.

In the final chapter, I discuss open questions and suggest relevant future research directions.

9


Chapter 2: Background

This chapter introduces the concepts and formulations used throughout the rest of this the-

sis. Future chapters may restate, revise, and extend these descriptions as a reminder and to better

fit the specific problems they present.

2.1 Reinforcement Learning

Reinforcement learning (RL)33,34 considers the setting in which an agent learns to maximize

a reward signal by interacting with an environment. The agent takes actions in response to the

current state of the environment in order to accumulate rewards. The reward signal typically

communicates a task or goal, and may be sparse, rewarding only few interactions, or dense,

rewarding many interactions. The agent is typically assumed to begin with zero knowledge of

the environment, and so the agent needs to interact with the environment to uncover rewards and

accomplish its task. This setting is sometimes called tabula rasa RL because the agent starts with

no prior knowledge – a clean slate. Ultimately, the goal of RL is to develop agents which can learn

the optimal behavior in a given environment. The environment can be a physical environment

or a simulated world, such as a video game. Virtual environments do not change the underlying

RL framework, but they often have different problem constraints from learning in the real world,

which we take advantage of in later chapters.

10


An RL agent’s goal is to learn the optimal behavior in a previously unknown environment

through interaction. To do this, it must discover sources of reward and learn the behavior to

reliably receive that reward. As such, a fundamental challenge of RL is balancing exploration

with exploitation. At each time step, the agent can explore by trying actions that may uncover

better rewards, or exploit its existing knowledge to receive known rewards. Notably, the agent

can rarely explore and exploit at the same time, so it must balance both behaviors. Exploration is

often required to avoid local optima and to thoroughly explore the state space of the environment,

in order to identify reward that the agent can exploit to achieve the highest possible cumulative

reward.

2.1.1 Markov Decision Processes

The environment is typically modeled as a Markov decision process (MDP)35? with dis-

crete time steps. At each time step t, an MDP exists in a state st which is an element of the

state space S. The starting state s0 is sampled from a distribution ρ(s0). The agent observes

st and takes an action at from the action space A according to its policy π : S × A 7→ [0, 1],

which defines a state-conditional distribution over actions. In response to the agent’s action

at ∼ π(·|st), the MDP transitions to the next state, st+1 according to the transition function

P : S × A × S 7→ [0, 1], which defines a distribution over next states conditioned on the

current state and action, so that st+1 ∼ P (·|st, at). In episodic environments, st+1 can be a

terminal state which ends the episode and prevents further interaction until the environment is

reset to a new initial state. Terminal states are formally modeled as an absorbing state in the

MDP where all transitions map back to the absorbing state. Importantly, the transition func-

tion is assumed to be Markovian such that the next state st depends only on the current state s

11


and action a. Taking action at in state st is also accompanied by an associated real-values re-

ward rt, based on the reward function R : S × A × S × R 7→ [0, 1]. The transition at time

step t typically refers to the tuple (st, at, st+1, rt), and the sequence of transitions up to time t,

τt = (s0, a0, r0, s1, ..., st−1, at−1, rt−1, st), is called the trajectory.

All RL algorithms seek to learn an optimal policy π∗ that maximizes the expected total

return, defined in Equation 2.1, by updating the agent’s current policy using information collected

from repeated interactions with the environment over some countable number of time steps T :

J = E
s0∼ρ
τ∼π

[
T−1∑
t=0

γtrt

]
, (2.1)

where γ < 1 is the discount factor. The discounted rewards are summed in expectation over

trajectories τ where actions are sampled according to π. Importantly, the discount factor intro-

duces a bias such that near-term rewards are weighted more highly than more distant rewards.

Reward discounting also empirically serves to reduce the variance of return estimates used in

RL algorithms, by effectively shrinking the time horizon over which rewards are summed. In

practice, the specific value of γ can have a significant impact on the behavior and performance of

an agent36,37. The full MDPM can therefore be specified by the tupleM = (S,A, P,R, ρ, γ).

There are many additional formulations of MDPs for more specific RL problems, which we ex-

plain in the following section as well as ??.

2.1.2 Partial Observability

In many real-world settings, the agent can only observe a subset of the full state st. This

is called partial observability38,39 which we model by extending the standard MDP tuple with

12


an additional observation function O : S × Ω 7→ [0, 1], which defines a distribution over the

observation space Ω dependent on the current state s. In this setting, the policy is conditioned on

ot ∼ O(·|st), so that π : Ω×A 7→ [0, 1] and actions are sampled as at ∼ π(·|ot). The transition

and reward functions still condition on the full state as in a standard MDP. This extension of an

MDP is called a partially-observable MDP (POMDP) and can be represented by the tupleM =

(S,A,Ω, P,R, O, γ). In practice, agents typically need to model the history of observations to

produce a Markovian representation of the state, often using a recurrent neural network40,41.

2.1.3 Estimating Values

For a given environment and policy π the value of a state s is the expected discounted return

obtained by sampling the rest of the trajectory using π, starting from s. The function that defines

this value over the entire state space is called the state-value function:

Vπ(s) = Eτ∼π

[
∞∑
k=t

γk−trk

∣∣∣∣∣st = s

]
. (2.2)

A closely related entity is the state-action value function or Q-function, which maps every state-

action pair (s, a) to the expected discounted return obtained by following π after taking action a

in state s.

Qπ(s, a) = Eτ∼π

[
∞∑
k=t

γk−trk

∣∣∣∣∣st = s, at = a

]
. (2.3)

In other words, the Q-function measures the value of taking action a in state s for policy π. Sub-

tracting the Q-function from the state-value function then yields the advantage function, which

measures the expected improvement from taking action a in state s compared to the average

13


performance of policy π in state s:

A(s, a) = Qπ(s, a)− Vπ(s). (2.4)

Intuitively, if the advantage function is non-positive in all states, then the policy can do no

better by selecting alternate actions, and is optimal33. One approach to search for the optimal

policy π∗ is to iteratively update the policy to take actions that maximize the advantage in each

state. In practice, values must be approximated through Monte Carlo sampling, by rolling out a

trajectory in the environment – sampling from the policy at each timestep – and recording the sum

of discounted rewards after each state. In high-dimensional state spaces that are impractical to

fully cover with rollouts it is common practice to approximate these value functions with neural

networks42.

A common approach to reduce the variance of Monte Carlo return estimates is to truncate

the Monte Carlo rollout after T steps and estimate the remaining return starting at sT with the cur-

rent value function approximation V̂ (sT ). By “bootstrapping" from the current value predictions,

the return at time t is estimated as

R̂t =

(
T−1∑
k=t

γk−trk

)
+ γT−tV̂ (sT ), (2.5)

By finding the error between our bootstrapped value prediction starting at state s and the direct

value prediction V (s) we get the following recursive loss function for training the value approx-

imator:

LV =

(
T−1∑
k=t

γk−trk

)
+ γT−tV̂ (sT )− V̂ (st). (2.6)

Note that the bootstrapped value prediction includes rewards from the environment, so this loss

14


function grounds the value predictions with experience gained via interaction. In practice, we can

compute this value at every timestep, producing multiple error terms for optimization. The value

loss can then be computed as the sum of one-step bootstrap errors for each time step, so that

LV =
T−1∑
k=0

rk + γV̂ (sk+1)− V̂ (sk), (2.7)

where each summand δk = rk + V (sk+1) − V (sk) is called a temporal difference error or TD

error at time k 33. Equation 2.6 therefore defines a T -step TD error.

Importantly, the value function V for any policy π must satisfy the Bellman Equation43:

V (st) = Eπ [rt + γV (st+1)] (2.8)

=
∑

at,st+1,rt

π(at|st)P (st+1|st, at)R(rt|st, at, st+1) + γV (st+1) (2.9)

value-based RL methods44, use algorithms based on the Bellman Equations to learn the opti-

mal state-value function or action-value function, then select actions greedily according to these

functions to construct an optimal policy. In the following sections, we will discuss a different

paradigm of RL algorithms which we utilize throughout this thesis.

2.1.4 Generalized Advantage Estimation

While reducing variance, the advantage estimates described in Equation 2.6 based on the

T -step TD error introduces bias via the bootstrapped value predictions. Generalized Advantage

Estimation (GAE)45 introduces a simple estimator that balances bias and variance in advantage

estimation, based on an exponential average of all T -step TD errors for T = 1, ...,∞. This

15


average can be written simply in terms of purely one-step TD errors across all time steps:

Â
GAE(λ)
t = (1− λ)

(
Â

(1)
t + λÂ

(1)
t + λ2Â

(1)
t + ...

)
(2.10)

=
∞∑
k=0

(γλ)kδt+k, (2.11)

where 0 ≤ λ ≤ 1 is the key GAE hyperparameter that controls the bias-variance tradeoff. When

λ = 0, GAE reduces to δt, the one-step TD error. When λ = 1, Equation 2.11 reduces to the

Monte Carlo advantage estimator:

∞∑
k=0

γkδt+k =
∞∑
k=0

γkrt+k − V̂ (st), (2.12)

2.2 Policy Gradient Methods

The results and methods in this thesis are developed primarily using a class of RL algo-

rithms known as policy gradient methods46,47. In contrast to value-based RL methods, which

construct the optimal policy from a learned optimal value function, policy gradient methods

directly represent the policy as a neural network. These methods perform stochastic gradient

descent over the weights of policy network to optimize a noisy estimate of the discounted future

return. In other words, we take the gradient of the expected return J for the policy π from Equa-

tion 2.1 with respect to the policy network’s weights. The Policy Gradient Theorem 46 states

that this gradient is equivalent to the expected value of the log-probabilities of each action in the

trajectory scaled by the trajectory’s return as shown below:

16


∇θJ(θ) = ∇θEτ∼πθ
[R(τ)]

∝ Eτ∼πθ
[

T∑
t=0

∇θ log π(at|st)R(τ)]

(2.13)

This simple formula allows us to empirically estimate the policy gradient with the sample mean

of collected trajectories.

2.2.1 REINFORCE

The first, and simplest policy gradient method is REINFORCE48, which estimates the gra-

dient of the expected discounted return with respect to the policy weights as

∇θJ(θ) ∝ E
s0:∞,

a0:∞∼πθ

[
∞∑
t=0

Rt∇θ log πθ(at|st)]. (2.14)

The REINFORCE estimator effectively restates the Policy Gradient Theorem making it tractable

to estimate its value through Monte Carlo rollouts of the policy, as done by REINFORCE. Intu-

itively, this gradient increases the probability of actions that lead to high returns.

2.2.2 Actor-Critic

An important property of the REINFORCE estimator is that for any baseline function b that

only depends on state, it remains unbiased in the presence of the baseline term. In other words,

we can subtract any state-based function from the summand in Equation 2.14 without introducing

bias. Actor-critic methods33,49,50,51 exploit this fact to reduce the variance of REINFORCE, by

17


subtracting such a baseline from the return accumulated after timestep t as follows

∇θJ(θ) ∝ E
s0:∞,

a0:∞∼πθ

[
∞∑
t=0

(Rt − bt)∇θ log πθ(at|st)]. (2.15)

The baseline is typically chosen to be the state-value function V (st) implemented as a neural

network, V̂ : S 7→ R. This network is typically called the value network or the “critic" (whereas

the policy is dubbed the “actor"). When the baseline is a value network, the difference between

the future discounted return Rt and bt = V̂ (st) is an unbiased estimator of the advantage A(st, at).

Intuitively, updating the policy weights along the direction of the gradient defined in Equation

2.15 increases the probability of taking actions that are better than average in terms of expected

future discounted return. In practice, both the expected discounted return and baseline terms

within the advantage must be estimated from empirical returns. Rollout data is typically collected

over a fixed horizon during training, independent of whether an episode terminates. Therefore,

the discounted return Rt is approximated via a bootstrapped value estimate, such that Rt ≈ Gt =∑T−1
t=0 rt + V̂ (sT ), as in subsection 2.1.3. The value network is trained alongside the policy, by

minimizing the L2 loss between the bootstrapped value estimate and initial value estimate:

LV =
1

T

T−1∑
t=0

(
Gt − V̂ (st)

)2
. (2.16)

The corresponding policy loss function based on Equation 2.15 can then be written as

LP = − 1

T
E

s0:T ,
a0:T∼πθ

[
T−1∑
t=0

(Gt − bt) log πθ(at|st)]. (2.17)

Additionally, it is common to include an entropy regularization term in the total loss when

18


training deep RL networks50? to encourage exploration and promote generalized, stochastic be-

haviors. This term is shown in Equation 2.18, whereH denotes the Shannon entropy.

LH = − 1

T

T−1∑
t=0

H(π(at|st)). (2.18)

Adding this final term to the total loss function, we obtain the standard policy gradient loss used

in actor-critic algorithms:

L = −LP + λVLV − λHLH, (2.19)

where λV and λH are weighted coefficients, typically set via hyperparameter tuning. This for-

mulation, in which the advantage is estimated via bootstrapped return estimates Gt, is commonly

called Advantage Actor-Critic (A2C).

2.2.3 Proximal Policy Optimization

Training models with A2C can be unstable and sample inefficient, requiring many transi-

tions to reach a useful policy. One source of instability in A2C is its sensitivity to the step size

taken along the policy gradient. The policy gradient theorem only holds locally for trajectories

taken with the current policy52. As optimization moves farther from the policy that was used to

calculate the policy gradient, there is a greater risk of degrading agent performance, sometimes

catastrophically. Large steps can cause the policy to stray into suboptimal behaviors that then

self-reinforce, as the model continues to train on its own transitions53. We explore these intu-

itions more thoroughly in Chapter 3. Trust region methods enforce a constraint on the policy

update step, such that the updated policy cannot deviate too far from the current policy, which

provably results in monotonic policy improvement54. This optimization can be written as

19


maximize
θ

Et

[
π(at|st)
πold(at|st)

A(st, at)

]
, (2.20)

subject to DKL(πold||π) ≤ δ.

Here, the expectation is an importance sampling estimator and πold denotes the current iterate of

the policy, which collects transitions for the next update.

Proximal Policy Optimization (PPO)55 uses a simple first-order approximation of the trust-

region constraint to update the policy by maximizing the following “clipped" objective:

Jclip(θ) = Et[ min(ρt(θ)Ât, clip(ρt(θ), 1− ϵ, 1 + ϵ)Ât)] (2.21)

= Et[ min(ρt(θ)Ât, g(ϵ, Ât)], (2.22)

where ρt denotes the importance sampling ratio, πθ(at|st)/πold(at|st), and ϵ > 0 is the clipping

constant. This objective can be understood by observing how g behaves for positive and negative

advantages, Ât
56:

g(ϵ, Ât) =



Ât ·min
(

π(at|st)
πold(a|st)

, 1 + ϵ
)

if Ât > 0,

Ât ·max
(

π(at|st)
πold(at|st)

, 1− ϵ
)

if Ât < 0,

0 otherwise.

(2.23)

When Ât > 0, the action at is better than average when taken in st. As we expect, the objective

increases as π(at|st) becomes more likely, but only up to a maximum amount of (1+ϵ)πold·(at|st).

Likewise, when Ât < 0, the action at is worse than average when taken in st, and the objective

decreases as π(at|st) becomes more likely—but only up to a limit, (1 − ϵ) · πold(at|st). The

20


clipping of ρt thus heuristically approximates the trust region constraint by limiting how much

large changes in the policy can contribute to increasing the objective? . PPO is most commonly

implemented by replacing the A2C loss term −LP in Equation 2.19 with −Jclip(θ).

Empirically, this clipped objective allows for PPO to stably take multiple gradient updates

over a given batch of transitions collected by πold despite these samples becoming more stale

after each update. This leads to improved sample-efficiency through greater data reuse per batch.

In practice, PPO performs multiple gradient updates per batch by subsampling minibatches of

data without replacement. In contrast, A2C and other prior policy gradient methods take only a

single gradient update per batch then collect new data. Due to its relative simplicity and strong

performance across many complex domains, the experiments in this thesis make use of PPO as

the base RL optimization algorithm. Although this section highlights PPO’s objective as it’s

core difference to previous policy gradient methods, PPO also refers to the set of implementation

details and hyperparameters that comprise the method in practice. These details are enumerated

and explored thoroughly in57.

2.3 Underspecified Environments

The previously discussed RL settings all assume a single environment instantiation, such

that the underlying POMDP or MDP is fixed across all interactions. In contrast, most real-

world settings feature a large degree of variability across many aspects of the environment. This

is especially true in simulation, where many virtual environments of interest are generated via

procedural content generation (PCG)58,59, which is the algorithmic generation of data.

To model such control over the environment, the Underspecified POMDP (UPOMDP)60

extends the standard POMDP with an additional space of free parameters Θ, that correspond to

21


specific settings of aspects of the environment that can vary, e.g. positions of obstacles in a 2D

maze environment. In its most general formulation, the UPOMDP assumes the specific values

of the free parameters, θ may vary not only across episodes, but across time steps. This is less

common in practice, but we explore some methods that change tasks mid-episode in Chapter 6.

The specific environment setting θ is incorporated into the transition function, P : S × A ×

S × Θ 7→ [0, 1] and so a UPOMDP is defined by the tuple (S,A,Ω, P,R, O, γ,Θ). Following

common terminology, this thesis uses the term task to reference a specific setting of the free

parameters θ. The standard UPOMDP does not expose the value of θ in the observation, as it was

first devised to study zero-shot to unknown θ at test time, i.e. without taking any gradient updates

in the environment instance θ. However in practice, some experiments in this work do encode

the current task into the observation, and the UPOMDP formulation can be easily modified to

include θ in the observation, in which case it becomes equivalent to what Hallak et al. 61 call a

contextual MDP.

2.4 Curriculum Learning

Curriculum learning is a broad field of methods which generally aim to sort or generate

tasks that accelerate agent learning. There is no widely agreed upon definition, though several

previous works have attempted to formalize curriculum learning. This section will first describe

prior formulations of curriculum learning, then explain the definition used throughout the thesis.

Broadly, the term curriculum learning refers to methods that guide an agent toward prob-

lems that are challenging yet learnable, within the agent’s zone of proximal development62. In

other words, curricula typically sort tasks by difficulty, or prioritize tasks that maximize learn-

ing progress. Bengio et al. 63 define curriculum learning as methods which emphasize tasks of

22


increasing difficulty throughout training, and converge on the uniform distribution over target

tasks. Importantly, curriculum learning methods manipulate tasks, but do not directly optimize

a policy to solve any task. While increasing difficulty is a common goal of curriculum learning

methods, other approaches such as maximizing learning progress, do not strictly aim to sort tasks

by difficulty.

In our work, we consider a broader definition of curriculum learning, most similar to that

of Unsupervised Environment Design (UED) Dennis et al. 60 . UED uses an under-specified envi-

ronment (a UPOMDP) to produce a distribution over fully specified environments which can be

used to train a policy that performs well over that distribution. More precisely, UED specifies an

environment policy Λ : Π→ ∆(ΘT ) where Π is the set of possible policies and ΘT is the set of

possible sequences of environment parameters, with a specific environment instantiation being

defined by θ. Notably this definition centers around the ability to configure an environment, and

a class of methods which produce task distributions over these environments conditioned on a

given policy. This excludes methods designed for singleton environments which can be specified

as POMDPs.

This thesis additionally considers any method which can be represented as an environment

policy over UPOMDP free variables to be a form of curriculum learning, with no restriction on

conditioning inputs to the curriculum. More precisely a curriculum learning method C : Ht → Θt

chooses the environment free parameters θt at timestep t based on any training history Ht, which

may include policies, environment outputs, or external expert inputs. This encodes the idea that

any method which modifies the task structure to promote learning is a form of curriculum, and

broadly includes many techniques from the intrinsic motivation and transfer learning literature.

For instance, most bonus-based exploration methods10,19,21 can be formulated as UED meth-

23


ods that define a UPOMDP (S,A,Ω, P,R, O, γ,Θ) where the fully specified environment is a

POMDP (S,A,Ω, P, R̂, O, γ) and θ specifies the bonus reward term such that R̂ = R + Θ. We

discuss the possibility of self-guided curricula that do not require the explicit ability to instantiate

an environment with specific settings in Chapter 8.

Multi-agent problems can also be specified as POMDPs by considering all other agents to

be part of the environment dynamics. This is the intuition behind independent learning and self-

play, which often work well in practice64,65,66. The opponent policy becomes a free parameter in

the UPOMDP which can be specified to produce a single-agent POMDP. Methods like self-play

and prioritized fictitious self-play produce curricula over opponent policies to train agents that

are robust against the full space of potential policies1,67. With these algorithms, the UPOMDP’s

free parameter Θ becomes the opponents’ policy parameters.

Though this definition of curriculum learning includes many works in the exploration lit-

erature, it does not include methods that do not explicitly specify changes to the environment.

For instance, the state-distribution shift induced by on-policy learning is not curriculum learning,

because the action distribution which generates those states is not a specifiable component of the

UPOMDP. Curriculum learning can not be used in singleton environments, although some sin-

gletons, like Atari games, they can become under-specified environments by augmenting them

with additional reward terms. As such, reward shaping methods are not curriculum learning be-

cause agents train on only a singleton environment instance, but dynamic reward bonuses which

modify shaping rewards throughout training are.

24


2.5 Automatic Curriculum Learning

This thesis makes a distinction between manual and automatic curriculum learning as meth-

ods that use a static or dynamic task distribution respectively. Manual methods, such as do-

main randomization or simulated annealing, specify the task distribution that will be used at any

timestep independent of any history H . Automatic curriculum methods instead update their task

sampling policy online during training using feedback from the training process. They typically

use feedback from the environment or agent such as task success rates or value predictions, but

they can also use learned generator networks68, multiagent competition69,70, or LLM responses

to propose new tasks8,71.72 define automatic curriculum learning methods as "a family of mecha-

nisms that automatically adapt the distribution of training data by learning to adjust the selection

of learning situations to the capabilities of DRL agents." This work uses a more specific defini-

tion that only allows for training data distribution changes induced by changes to the environment

configuration.

2.6 Open-Endedness

Open-endedness is a murkier concept, with no widely agreed upon definition. The original

term coined in the context of artificial intelligence by O. Stanley et al. 73 referred to a process

which led to endlessly evolving complexity, similar to natural evolution on Earth. It has since

been used to also describe environments that provide challenges sufficient for an open-ended

process to develop. However, it is not clear what minimum environment qualities are sufficient

for open-endedness. In fact, understanding this may be a key step toward developing open-ended

systems73.

25


Hughes et al. 74 present a thorough exploration of the definition of open-endedness, syn-

thesizing recent work and providing several examples of open-ended systems. They define open-

endedness as follows "From the perspective of an observer, a system is open-ended if and only

if the sequence of artifacts it produces is both novel and learnable"74. Formally, they consider

a system open-ended if it is both novel and learnable. A system produces artifacts XT at each

timestep T . An observer O produces a statistical model of the artifacts X̂t, conditioned on the

history up until timestep t. The observer judges the quality of this prediction with a loss function

l(X̂t, XT ), or l(t, T ) for short, comparing the statistical model’s prediction to the true artifact.

They consider a system novel if the loss incurred by the statistical model is higher for future arti-

facts, ∀T ′ > T,E[l(t, T ′)] > E[l(t, T )]. This indicates information gain by the system over time.

A system is considered learnable if the loss incurred by the statistical model decreases with more

history, ∀t′ > t,E[l(t, T )] > E[l(t′, T )]. This suggests that the information gained by the system

is meaningful or "interesting" to the agent. The full paper includes a thorough discussion of the

implications of this definition and its connection to prior attempts to define open-endedness74.

Many systems will only possess these two properties for a finite amount of time, until

the observer’s statistical model sufficiently generalizes over the environment’s task space. As

such, Hughes et al. 74 also describe environments which are open-ended up to a certain timestep

as finitely open-ended. Thus it can be said that open-endedness is a temporary and subjective

property of an environment, in that it depends on the observer and the time at which the system

is being observed. Even simple environments may be considered open-ended for a short time.

The simulated environments studied in this paper are finitely open-ended in that they continue

to present novel, learnable challenges to agents for much longer than we train those agents, but

there is a limit on the complexity which they can produce. When an environment is finitely

26


open ended for t greater than the total training timesteps, these experiments can simulate infinite

open-endedness.

The work in this thesis does not depend on any precise definition of open-endedness,

and instead simply claims that certain simulated environments, namely NetHack25 and Neural

MMO75,76, pose a wide array of diverse challenges for RL agents, which can be used to study

multi-task learning as a necessary component of open-ended RL systems . This thesis aims to

develop methods that can generalize to many tasks as a step toward true open-ended systems.

27


Chapter 3: Visualizing Reward Surfaces

3.1 Introduction

We begin by exploring the behavior of policy gradient methods in single task environments.

Proximal policy optimization’s failure cases are poorly understood, and it’s even less clear how

these weaknesses are exacerbated or mitigated by multi-task environments. We develop a novel

visualization approach to study the characteristics of PPO on Atari games and Mujoco physic

environments. This allows us to analyze the types of tasks on which PPO underperforms and

predict what challenges we will see in a multi-task setting.

Reinforcement learning typically attempts to optimize the expected discounted return of

an agent’s policy. Policy gradient methods represent the policy with a neural network, and learn

this policy by approximating the gradient of the reinforcement learning objective over policy net-

work parameters. This means that the benefits and challenges of deep learning apply to policy

gradient methods. However, unlike most deep learning tasks, reinforcement learning is notori-

ously unstable, and agents often experience sharp drops in reward during training. Studying the

reward surface and how RL algorithms interact with it is critical to understanding the successes

and failures of deep reinforcement learning. This work appeared at ICML 2022.

A “reward surface” is the high dimensional surface of the reinforcement learning objective

(the cumulative expected reward when following a given policy in an environment) over the

28


Figure 3.1: Reward surface for the Solaris Atari environment.

policy network parameter space. Reward surfaces have been used to study specific questions

in limited contexts77 but have never been generated for a wide variety of environments. Loss

landscapes have been used in computer vision to understand the effect that residual connections

have on computer vision tasks78, and we expect that visualizations of reward surfaces could

be similarly valuable for understanding techniques in reinforcement learning. As a first step

toward this goal, we produce visualizations of the reward surfaces for 27 of the most popular

reinforcement learning environments, and identify new patterns in these surfaces for the first time.

We see common traits of environments that are generally perceived to be "easy" for reinforcement

learning to solve, and visual explanations of failure modes in sparse reward environments.

We additionally conduct a series of novel visualizations of the reward surface, finding ev-

idence of steep “cliffs” in reward surfaces plotted in the policy gradient direction of numerous

environments. These are directions where the returns are constant or increase for a short distance,

then drop sharply. Reinforcement learning notoriously suffers from instability, and these cliffs

may explain the sudden performance drops that agents experience during training. Our plots of-

fer conclusive visual evidence that these cliff exist, as well as methods to study them further. We

29


show that the specific cliffs present in our visualizations impact learning performance in some

situations, and by comparing the performance of PPO79 and A2C50 on these cliffs, we provide

an explanation for PPO’s improved efficacy over previous methods.

Finally, to accelerate future research on reward surfaces and the impact of cliffs in rein-

forcement learning, we release the code for this paper as a comprehensive, modular, and easy to

use library for researchers to plot reward surfaces and other visualizations used in this paper.

3.2 Background and Related Work

3.2.1 Loss Landscapes

Li et al. 78 developed filter-normalization, a technique for plotting 3d visualizations of loss

landscapes, the surface generated by a loss function over neural network parameters. They were

able to demonstrate that the sharpness of loss landscapes plotted using filter-normalized random

directions correlated with neural network generalization error. They create their plots by choos-

ing perturbations d of trained parameters that give an informative view of the neural network’s

local behavior. However, uniform random perturbations are known to be misleading in neural

network analysis, because neural networks with ReLU activations have scale invariant weights78.

To mitigate this problem, they propose filter-normalized random directions. They represent the

neural network as a vector θ indexed by layer i and filter (not filter weight) j.1 Then, they sample

a random Gaussian direction d, and scale each filter of this random direction to match the mag-

nitude of the neural network parameters in the corresponding filter, by applying the following

1Note that this method also works for fully connected layers, which are equivalent to a convolutional layer with a
1x1 output feature map.

30


formula.

di,j =
di,j
∥di,j∥

∥θi,j∥

Li et al. 78 used this method to demonstrate that skip connections can limit the increase in non-

convexity as network depth increases. Their work was originally applied to image classification

networks, but we adapt these techniques to visualize reward surfaces for policy networks.

3.2.2 Reinforcement Learning

Deep reinforcement learning aims to optimize a policy π to maximize the expected return

over neural network parameters θ. This objective can be written as J(πθ) = Eτ∼πθ
R(τ) where

R(τ) =
∑n

t=0 γ
trt. Here τ is a trajectory, rt is the reward at time step t, and γ is the discount

factor and we sum the rewards across the entire episode of n time steps. Nota and Thomas 80

showed that the discounted policy gradient is not the gradient over the surface of average dis-

counted rewards, and is in fact not the gradient of any surface. To avoid this issue and make

interpretation easier, we plot the undiscounted reward surface where γ = 1.

3.2.3 Policy Gradient Methods

Policy gradient methods estimate the gradient of the policy network and use gradient as-

cent to increase the probability of selecting actions that lead to high rewards. The gradient of

the objective J is ∇θJ(πθ) = Eτ∼πθ

[∑T
t=0∇θ log πθ(at|st)Aπθ(st|at)

]
where Aπθ(st|at) is the

advantage function for the current policy πθ.

31


3.2.4 Reward surfaces

The “reward surface” is the reinforcement learning objective function J(πθ). Reward sur-

faces were first visualized by Ilyas et al. 77 to characterize problems with policy gradient esti-

mates. The authors plotted a policy gradient estimate vs a uniform random direction, showing

via visually striking examples that low sample estimates of the policy gradient rarely guide the

policy in a better direction than a random direction. Note that this work did not make use of

filter-normalized directions.

Bekci and Gumus 81 then used loss landscapes from Li et al. 78 to study Soft Actor-Critic

agents82 trained on an inventory optimization task. They visualize the impact of policy stochas-

ticity and action smoothing on the curvature of the loss landscapes for 4 MuJoCo environments.

Later, Ota et al. 83 used loss landscapes from Li et al. 78 directly to compare shallow neural

networks for reinforcement learning to deeper neural networks, showing that deep neural net-

works perform poorly because their loss landscape has more complex curvature. They used this

visual insight to develop methods that can train deeper networks for reinforcement learning tasks.

This work plots the loss function of SAC agents which includes an entropy-regularization term82,

while we directly plot the reinforcement learning objective.

This paper is different from each of these previous works in the breadth of environments

explored, and in that we are the first to use filter-normalized directions from Li et al. 78 to visualize

reward surfaces rather than loss landscapes. We additionally introduce novel experiments and

results inspired by the findings in our initial reward surface plots.

32


3.2.5 Proximal Policy Optimization

Proximal Policy Optimization (PPO)79 is a derivative method of Trust Region Policy Opti-

mization (TRPO)54 intended to be easier to implement and more hyperparameter invariant. Over

the past 4 years, PPO has become a “default” method for many deep RL practitioners and has

been used in numerous high profile works2,84. TRPO and PPO claim to offer enhanced empirical

performance over previous methods by preventing dramatic changes in the policy via trust regions

and ratio clipping respectively. Ratio clipping is conceptually a useful heuristic, however we are

not aware of any work demonstrating why it results in empirically good performance across so

many domains.

Instability during training is common in deep reinforcement learning, and agents often

experience large drops in reward that can take long to recover from. One intuition for the utility

of PPO’s gradient and ratio clipping that has been idly discussed in the community is that they

prevent agents from taking gradient steps that result in this collapsing performance85. More

precisely, the intuition is that by preventing large changes to the policy, it avoids gradient steps

that move the policy into regions of the parameter space with poor rewards and uninformative

gradients. To the best of our knowledge, this intuition was first described in a widely circulated

medium article Hui 85 released roughly 1 year after the original PPO paper. We are aware of no

prior work in the academic literature which directly support this intuition, and the TRPO and

PPO papers do not directly allude to it. We discover evidence of these cliffs in plots of the policy

gradient direction, and perform experiments to confirm that they negatively impact learning. As a

result of its empirically good performance PPO has become the de-facto policy gradient method,

so we chose it as the agent in our reward surface experiments.

33


3.3 Environment Selection

In exploring these reward surfaces, we sought to cover many widely used benchmark envi-

ronments. We generated plots for all “Classic Control” and “MuJoCo” environments in Gym28

and for many popular Atari environments from the Arcade Learning Environment86. Video games

with graphical observations and discrete actions are a common benchmark for the field, and Atari

games are the most popular games for this purpose. The MuJoCo environments are high quality

physics simulations used for prominent robotics work with continuous actions and vector obser-

vations. The Classic Control games represent some of the early toy environments in reinforce-

ment learning, and continue to be used as a standard set of "easy" environments. We generate

surfaces for all classic control and MuJoCo environments, but we only generate surfaces for a

representative selection of 12 Atari environments instead of all 59 Atari environments in Gym

for the sake of brevity. To make sure we explore diverse reward schemes, we specifically picked

six sparse reward environments (Montezuma’s Revenge, Pitfall!, Solaris, Private Eye, Freeway,

Venture), three popular dense reward environments (Bank Heist, Q*Bert, Ms. Pac-Man), and

three popular easy exploration environments (Breakout, Pong and Space Invaders), according to

the standard taxonomy by Bellemare et al. 15 .

3.4 Initial Explorations of Reward Surfaces

3.4.1 Methodology

In this work, we adapt the techniques from Li et al. 78 to visualize the reward surfaces of

PPO agents in reinforcement learning environments. As in that paper, we deal with the high-

dimensional domain of the reward surface by focusing our analysis around points in the policy

34


Figure 3.2: Reward surfaces for CartPole-v1, Breakout, Hopper-v2, Montezuma’s Revenge

space visited during training. Given a training checkpoint with parameters θ, we are interested

in understanding the local surface of rewards generated by the policy represented by parameters

θ + d for small perturbations d.

To visualize this local region in 3 dimensions we plot the empirical expected episodic return

on the z axis against independently sampled filter-normalized random directions on the x and y

axes.2 The plots are additionally scaled manually to highlight features of interest, so note the

marks on the axes which indicate those manual scales. The scale, resolution, and number of

environment steps used in each environment are listed in Table 3.2.

A reward surface is dependent on the chosen learning algorithm, network architecture,

hyperparameters, and random seed, so for these experiments we chose to plot the reward surface

2Note that because this is a high-dimensional space, these directions are orthogonal with high probability.

35


of PPO agents using the tuned hyperparameters found in RL Zoo 387. However, reward surfaces

for a given environment are extremely visually similar across these variables, which we discuss

in subsubsection 3.4.2.2. To understand what challenges RL algorithms face towards the end of

training after sufficient exploration has occurred, we chose the best checkpoint during training,

evaluated on 25 episodes, with a preference for later checkpoints when evaluations showed equal

rewards. The best checkpoint was typically found during the last 10% of training.

3.4.2 Results

A sample of the visualizations of the reward surfaces can be seen in Figure 3.2. The full set

of reward surface plots can be found in subsection 3.10.1. In the following sections we present

our key findings and discuss the validity of our reward surface visualizations.

3.4.2.1 Findings in Plots

Figure 3.3: Gradient heat maps for Ant, Hopper, and InvertedDoublePendulum. The heat map for Hopper
and InvertedDoublePendulum shows a cliff-like gradient direction which falls off sharply compared to the
filter-normalized direction.

One obvious observation is that the size and shape of the maximizers present in the reward

surfaces roughly correlates with the perceived difficulty of the environment. Though there is some

recent work attempting to quantify the difficulty of RL environments, it is still a challenging and

unsolved problem88,89. However, researchers have used these benchmarks for years, and we now

36


have some intuition for identifying environments that are easier for modern methods to solve. The

Classic Control environments were designed as simple toy environments, and modern methods

easily find strong policies for them. The Atari environments have been taxonomized by Bellemare

et al. 15 to indicate their relative difficulty. Breakout, Pong, and Space Invaders are listed as

"Easy Exploration" games. Among the "Hard Exploration" games Bank Heist, Ms. Pacman, and

Q*bert have dense rewards, while Freeway, Montezuma’s Revenge, Pitall!, Private Eye, Solaris,

and Venture have a more challenging sparse reward structure. The MuJoCo environments have

not been officially categorized according to difficulty to our knowledge.

Using these rough categories as a guide, we see that the Classic Control environments

shown in Figure 3.5, certainly the easiest set that we study, have large maximizers that often

span the entire domain of the plot. In the Atari environments shown in Figure 3.7, we see that

the "Human Optimal" (Easy Exploration) and "Dense Reward" (Hard Exploration) environments

have large smooth maximizers relative the the chaotic landscapes seen in the "Sparse Reward"

environments. This finding complements those of Li et al. 78 who found that the sharpness of loss

landscapes using filter-normalized directions correlated with generalization error. This observa-

tion also raises the possibility of future work creating a metric based on reward surfaces to gauge

the difficulty of RL environments.

We also observe some interesting characteristics of individual environments. In Figure 3.5,

MountainCarContinuous has a strangely spiky reward surface for a Classic Control environment.

This may be a result of the uniquely low training time required to solve it. Using the hyperparam-

eters from RL Zoo 3 we train this agent for only 20,000 steps, while the next lowest training time

in RL Zoo 3 is 200,000 timesteps. Looking at Figure 3.6, Humanoid Standup has fairly spiky re-

ward surface, suggesting that it’s reward function is sensitive to small perturbations in the policy.

37


While many maximizers in these plots appear as a single hill or peak in an otherwise flat region,

some of the surfaces have uniquely identifiable curvature. Hopper has a large semi-circular curve

in its reward surface. Cartpole, MountainCar, and InvertedPendulum have large plateaus at their

peak reward.

The sparse reward Atari environments in Figure 3.7 are particularly interesting to exam-

ine. Note that each point in the sparse Atari reward surfaces were evaluated with either 1 or 2

million environment steps, to limit the standard error for each estimate. We see that Freeway’s

reward surface has large flat regions of varying heights surrounding a single smooth maximizer.

Montezuma’s Revenge and Venture have short noisy maximizers. The agents for the sparse en-

vironments (except Freeway) perform poorly, so note that even the maximizers in these plots

represent weak policies. As a result of this, we see that the reward surfaces for Montezuma’s

Revenge, Solaris, and Venture show nearby maximizers that the agent was unable to find during

training. Private Eye has a large region of zero reward and a far away maximum with much higher

rewards. Finally, as its name suggests, the reward surface for Pitfall! is mostly flat and marred by

several deep pits.

The idea that sparse rewards structures lead to flat regions in a reward surface is intuitive

– in environments where rewards are issued solely in infrequent goal states, large yet precise

policy improvements may be required to experience variations in rewards. Despite the intuitive

nature of this observation, we’re unaware of this being visually documented in the literature. We

also find that in regions where the reward surfaces are not flat, they are often extremely noisy.

This is supported by plots of the surface of standard deviations at each point, which we include

in subsection 3.10.2. We see in these plots that the standard deviation is often significantly

larger than the average reward in the reward surfaces for Montezuma’s Revenge, Private Eye, and

38


Venture, unlike any of the other environments we study. These plots seem to highlight different

failure modes of sparse environments – either the surface is flat when the agent experiences no

variation in rewards, or the surface is spiky when rewards are sparse and noisy.

3.4.2.2 Plot Reproducibility and Standard Error

To demonstrate the consistency of these experiments across multiple random seeds, we re-

peated our reward surface plots 18 times for Acrobot, HalfCheetah, Breakout, and Montezuma’s

Revenge. For each trial, we trained and evaluated a new agent with a new random seed. We can

see from the plots in subsection 3.10.3 that the reward surfaces are extremely visually similar for

a particular environment, indicating that training appears to converge to visually similar regions

of the reward landscape, and that the characteristics of these plots are consistent across multiple

seeds.

We evaluated for at least 200,000 time steps (1 or 2 million steps for the sparse reward

Atari games) at each point to ensure that the standard error for these estimates is small.

3.5 Discovering Cliffs in the Gradient Direction

Figure 3.4: Gradient line search plots for Pendulum, Ant, and Pong. The line plot for Ant shows several
checkpoints that exhibit cliff-like properties.

Filter normalized random directions provide a broad sense of the local reward landscape,

39


and are useful for analysis near the end of training, but they do not represent the direction of

training because the sampled random directions are likely orthogonal to the gradient direction. To

better understand the optimization characteristics of these environments, we performed similar

experiments using the policy gradient direction. We use a high quality estimate of the policy

gradient computed over 1,000,000 environment steps as in Ilyas et al. 77 . In many of these plots,

we find evidence of “cliffs” in the reward surface. These are regions of the reward surface in

which rewards sharply decrease after a short distance.

For cliffs, we select checkpoints that exhibit a decrease in returns of at least 50% in the

first, high-resolution section of the line search (that is, within a normalized gradient magnitude

of 0.04). By selecting points closer to the initial, unperturbed weights, we choose cliffs that the

optimization method is likely to reach. Although we do our best to evaluate the gradient line

searches with enough environment steps to reduce their standard error, it is not computationally

feasible to completely account for the high variance of returns in sparse reward environments. To

mitigate this, the decrease in returns that identify a cliff must also be at least 25% of the global

reward range across all checkpoints in the line plot. This ensures that we select cliffs that are

significant in the environment’s overall reward scale, and not likely to be caused by evaluation

noise. We select non-cliff checkpoints arbitrarily from the remaining checkpoints that do not

meet these criteria.

One difficulty of plotting the gradient direction is that the gradient magnitudes vary dras-

tically for different environments and at different points during training90. Additionally, any

maximum in a reward surface can be made to look like a sharp cliff by using a large enough

gradient scale, or like a large plateau by using a smaller gradient scale. To provide a fair compar-

ison of the gradient direction’s sharpness, we normalize the gradient direction by dividing each

40


component by it’s L2 norm.

3.5.1 Gradient Directions vs. Filter-Normalized Random Directions

We show the differences between filter normalized random directions and the gradient di-

rection by creating plots of the reward surface using the gradient direction on the x axis and a

random filter normalized direction on the y axis. Due to the frequent sharp changes in reward

that we observe in these plots, 3d surface plots can become partially obscured by peaks, so we

choose to plot these surfaces as heat maps instead. A sample of the heat maps can be seen in

Figure 3.3 and the full set of heat maps for Classic Control and MuJoCo environments can be

found in subsection 3.10.4.

3.5.1.1 Preliminary Observations

The most striking observation is that rewards in the gradient direction often change much

more ra