ABSTRACT

Title of Dissertation: MACHINE LEARNING WITH
DIFFERENTIABLE PHYSICS PRIORS

Yiling Qiao
Doctor of Philosophy, 2024

Dissertation Directed by: Professor Ming C. Lin
Department of Computer Science

Differentiable physics priors enable gradient-based learning systems to adhere to physical

dynamics. By making physics simulations differentiable, we can backpropagate through the

physical consequences of actions. This pipeline allows agents to quickly learn to achieve desired

effects in the physical world and is an effective technique for solving inverse problems in physical

or dynamical systems. This new programming paradigm bridges model-based and data-driven

methods, mitigating data scarcity and model bias simultaneously.

My research focuses on developing scalable, powerful, and efficient differentiable physics

simulators. We have created state-of-the-art differentiable physics for rigid bodies, cloth, fluids,

articulated bodies, and deformable solids, achieving performance orders of magnitude better than

existing alternatives. These differentiable simulators are applied to solve inverse problems, train

control policies, and enhance reinforcement learning algorithms.


MACHINE LEARNING WITH DIFFERENTIABLE PHYSICS PRIORS

by

Yiling Qiao

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2024

Advisory Committee:
Professor Ming C. Lin, Chair/Advisor
Professor Maria Cameron, Dean’s Representative
Professor Jia-Bin Huang
Professor Tianyi Zhou
Professor Dinesh Manocha


© Copyright by
Yiling Qiao

2024


Acknowledgments

First and foremost, I would like to express my deepest gratitude to my advisor, Dr. Ming

Lin. Your guidance, support, and encouragement have been invaluable throughout my doctoral

journey. Your expertise and insights have profoundly shaped this dissertation, and I am incredibly

fortunate to have had the opportunity to work under your mentorship. Your stories have inspired

me greatly, and your suggestions about my future career will always play an important role in my

life.

I am also profoundly grateful to the members of my dissertation committee, Dr. Maria

Cameron, Dinesh Manocha, Jia-Bin Huang, and Tianyi Zhou, for their insightful feedback,

constructive criticism, and unwavering support. Your contributions have significantly enhanced

the quality of this work.

I would like to extend my heartfelt thanks to Dr. Vladlen Koltun, Aniket Bera, Matthias

Zwicker, Tom Goldstein, Daniele Panozzo, Hao Su, Bo Zhu, Chuang Gan, Animesh Garg, Xiaodi

Wu, Lingjie Liu, Breannan Smith, Takaaki Shiratori, Ken Museth, Miles Macklin, Chris Atkeson,

Lin Gao, Xilin Chen, and others for your kind guidance and help. I feel so fortunate to have

had the opportunity to read your great research works first and then to have been helped and/or

mentored by you later.

To my friends and labmates in GAMMA, Aaron, Adarsh, Alexander, Bhrij, Divya, Geonsun,

Jaehoon, Jing, Junbang, Justin, Laura, Mohamed, Niall, Peng, Pooja, Qingyang, Rohan, Ruicheng,

ii


Ruiqi, Sanghyun, Senthil, Shrey, Shrellekha, Tianrui, Trisha, Utsav, Uttaran, Vishnu, Will, Xijun,

Xiyang, Yonghan, Yu, Zhenyu, Jason, Tetsuya, Xian, Zhenjia, Johnson, Pingchuan, Xuan, Peter,

Krishna, and others, thank you for your companionship, encouragement, and the countless discussions

that have helped shape my thinking.

At the University of Maryland, I really spent five happy years. Thank you Tom, Jodie,

Sharron, Migo, Shanshan, and all the other staff in ISSS, the Graduate School, the CS department,

the Eppley Center, and elsewhere.

I also sincerely appreciate the support from Intel, Nvidia, and Meta for my research. A

special thanks to the Meta Fellowship. Thank you for recognizing our research. That really

means a lot to me and gives me the confidence for more ambitious ventures in the future.

A special note of appreciation goes to my family: my wife, mom, dad, grandma, uncle, and

aunts. Thank you for your love, patience, and belief in me. To my housemates, Jiaqi, Yuxiang,

Hsien-Yu, and Suying, thanks for all the good times we had together. All the best to your future

journey as well.

Thank you all for your contributions, support, and encouragement. This dissertation would

not have been possible without you.

iii


Table of Contents

Acknowledgements ii

Table of Contents iv

List of Tables viii

List of Figures ix

I Introduction 1

Chapter 1: Overview 2
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Differentiable Programming . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Simulation and Animation Techniques . . . . . . . . . . . . . . . . . . . 6
1.2.3 Differentiable 3D Reconstruction . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Outline of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

II Differentiable Simulation 12

Chapter 2: Scalable Differentiable Physics for Learning and Control 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Scalable Collision Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Fast Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.7.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7.2 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7.3 Two-way coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.7.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Chapter 3: Efficient Differentiable Simulation of Articulated Bodies 38

iv


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Efficient Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.4.1 Adjoint Method for Articulated Dynamics . . . . . . . . . . . . . . . . . 45
3.4.2 Checkpointing for Articulated Dynamics . . . . . . . . . . . . . . . . . . 48

3.5 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.6.1 Comparison with Autodiff Tools . . . . . . . . . . . . . . . . . . . . . . 53
3.6.2 Integration with Reinforcement Learning . . . . . . . . . . . . . . . . . 56
3.6.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Chapter 4: Differentiable Simulation of Soft Multi-body Systems 64
4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Soft Body Simulation Using Projective Dynamics . . . . . . . . . . . . . . . . . 68
4.3 Articulated Skeletons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.3.1 Rigid Body System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3.2 Top-down Matrix Assembly for Articulated Body Systems . . . . . . . . 72
4.3.3 Articulated Joint Actuation . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.4 Contact Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Chapter 5: Other Differentiable Dynamical Systems 85
5.1 Traffic Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Fluids Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3 Analog Quantum Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

III Reality-based Simulation 95

Chapter 6: Editable Neural Geometry and Physics from Monocular Videos 96
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.3.1 Rendering Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3.2 Simulation Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3.3 Training and Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.4.1 Dynamic Geometry Reconstruction . . . . . . . . . . . . . . . . . . . . 109
6.4.2 Novel View Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

v


6.4.3 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.4.4 Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.4.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

Chapter 7: Dynamic Mesh-Aware Radiance Fields 118
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

7.2.1 Neural Fields and Surface Representations . . . . . . . . . . . . . . . . . 120
7.2.2 Scene Editing and Synthesis . . . . . . . . . . . . . . . . . . . . . . . . 122

7.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.3.1 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.3.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.3.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.4.1 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.4.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

Chapter 8: Physically Consistent Perception of Hand-Object Interactions with Differentiable
Priors 141

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

8.3.1 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.3.2 Differentiable Rendering Priors . . . . . . . . . . . . . . . . . . . . . . . 147
8.3.3 Differentiable Physics Priors . . . . . . . . . . . . . . . . . . . . . . . . 148
8.3.4 Optimization-based Refinement . . . . . . . . . . . . . . . . . . . . . . 149
8.3.5 Filtering-based Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 150

8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.4.1 Real-world Robotic Hand and Object Iteration . . . . . . . . . . . . . . . 153
8.4.2 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.4.3 Filtering-based tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.4.4 Contact Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.4.5 Generalization to Human . . . . . . . . . . . . . . . . . . . . . . . . . . 159

8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

Chapter 9: Other Visual-Dynamical Systems 162
9.1 Navigation and Collision Avoidance Using Commodity Sensors . . . . . . . . . 162
9.2 Physics Augmented Continuum Neural Radiance Fields . . . . . . . . . . . . . . 164

vi


IV Conclusion 167

Chapter 10: Conclusion 168
10.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
10.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

V Appendix 173

Appendix A: Scalable Differentiable Physics for Learning and Control 174

Appendix B: Efficient Differentiable Simulation of Articulated Bodies 180

Appendix C: Differentiable Simulation of Soft Multi-body Systems 201

Appendix D: Editable Neural Geometry and Physics from Monocular Videos 209

Appendix E: Dynamic Mesh-Aware Radiance Fields 213

Appendix F: Physically Consistent Perception of Hand-Object Interactions with Differentiable
Priors 223

vii


List of Tables

2.1 Performance of collision handling . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Performance of differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1 Memory usage of different simulation pipelines . . . . . . . . . . . . . . . . . . 52
3.2 imulation time for forward pass . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Simulation time for the backward pass . . . . . . . . . . . . . . . . . . . . . . . 55

4.1 Memory footprint (GB). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.1 Machine learning in different dynamical systems. . . . . . . . . . . . . . . . . . 91

6.1 Quantitative evaluation of video reconstruction . . . . . . . . . . . . . . . . . . 109
6.2 Quantitative evaluation of novel view synthesis . . . . . . . . . . . . . . . . . . 110

8.1 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
8.2 Evaluation of hand pose estimation . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.3 Quantitative evaluation of object pose estimation . . . . . . . . . . . . . . . . . 155
8.4 Error after contact-based pose refinement . . . . . . . . . . . . . . . . . . . . . 159
8.5 Quantitative evaluation on human object pose estimation . . . . . . . . . . . . . 159

B.1 Maximum reward on n-link pendulum . . . . . . . . . . . . . . . . . . . . . . . 182
B.2 Model transplant results on MuJoCo Ant . . . . . . . . . . . . . . . . . . . . . . 185

F.1 Error after contact-based pose refinement . . . . . . . . . . . . . . . . . . . . . 225

viii


List of Figures

1.1 An overview of my research. Upper boxes are the closed-loop pipeline, and the
lower boxes are my works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Differentiable simulation embedded in a neural network . . . . . . . . . . . . . . 17
2.2 Collision visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Scalability comparison with grid-based methods . . . . . . . . . . . . . . . . . . 25
2.4 Ablation studies for collision handling . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Two-way coupling of rigid bodies and cloth . . . . . . . . . . . . . . . . . . . . 31
2.6 Comparison with MuJoCo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7 Comparison on inverse problems . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.8 Learning control policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.9 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.10 Interoperability across simulators . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.1 The workflow of a simulation step . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Differentiation of articulated body dynamics . . . . . . . . . . . . . . . . . . . . 49
3.3 Memory and speed comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4 The n-link pendulum task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5 The MuJoCo Ant task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.6 Motion control task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.7 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.1 Ablation study of skeleton realization . . . . . . . . . . . . . . . . . . . . . . . 78
4.2 Ablation study of collision handling scheme . . . . . . . . . . . . . . . . . . . . 80
4.3 System identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4 Motion control experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.1 Traffic simulation in an urban environment . . . . . . . . . . . . . . . . . . . . . 86

6.1 Overall workflow of NeuPhysics . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2 The editing pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.3 Computational Graph During Training . . . . . . . . . . . . . . . . . . . . . . . 107
6.4 Surface reconstruction from dynamic videos . . . . . . . . . . . . . . . . . . . . 109
6.5 Qualitative evaluation of geometry reconstruction and novel view synthesis . . . 110
6.6 Parameter estimation and editing by differentiable physics . . . . . . . . . . . . 110

ix


6.7 Geometry and appearance editing in videos . . . . . . . . . . . . . . . . . . . . 112
6.8 Joint training can be time-consuming and converges slowly. . . . . . . . . . . . 113
6.9 Region of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.10 Compare with optimizing Chamfer Distance . . . . . . . . . . . . . . . . . . . . 117

7.1 Light transport on the surface (left) and in the medium (right). . . . . . . . . . 119
7.2 Pipeline overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.3 Shadow Casting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.4 HDR Volumetric Radiance Map . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.5 Rendering comparison for virtual object insertion . . . . . . . . . . . . . . . . . 132
7.6 Qualitative results for simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.7 Real-time photorealistic gaming on a laptop . . . . . . . . . . . . . . . . . . . . 136
7.8 Application of mixing NeRF with meshes . . . . . . . . . . . . . . . . . . . . . 139

8.1 Application in multiple tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.2 An overview of our optimization and filtering pipelines . . . . . . . . . . . . . . 144
8.3 Diagram of the EKF filtering process . . . . . . . . . . . . . . . . . . . . . . . . 150
8.4 Real-world pose estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.5 Pose Estimation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.6 Filtering results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.7 Qualitative results of contact-based pose refinement . . . . . . . . . . . . . . . . 158
8.8 Human Object Pose Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . 160

A.1 Qualitative comparison on complex models and interactions . . . . . . . . . . . . 178

B.1 Ablation study for the checkpointing scheme. . . . . . . . . . . . . . . . . . . . 183

C.1 Ablation study of embedded skeletons . . . . . . . . . . . . . . . . . . . . . . . 207
C.2 Ablation study of collision handling scheme . . . . . . . . . . . . . . . . . . . . 207

D.1 Qualitative evaluation of video reconstruction . . . . . . . . . . . . . . . . . . . 211

E.1 Optimizing Emission UV Texture Map . . . . . . . . . . . . . . . . . . . . . . . 217
E.2 Inverse Camera Response Function (CRF) . . . . . . . . . . . . . . . . . . . . . 219
E.3 Scaling the number of the inserted objects and image resolution . . . . . . . . . . 220
E.4 Run time comparison with NeuPhysics . . . . . . . . . . . . . . . . . . . . . . . 220
E.5 Simulation Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
E.6 Number of Rank-1s each method receives. . . . . . . . . . . . . . . . . . . . . . 221
E.7 Comparison between Luma AI (left) and Ours (right). . . . . . . . . . . . . . . . 222

x


Part I

Introduction

1


Chapter 1: Overview

1.1 Motivation

Physically-based simulation has become an essential component in animation, visual effects,

and broader computational problems, such as aircraft design and digital fabrication. In the

recent wave of machine learning, simulation environments for games, robots, and autonomous

cars have played a crucial role as infrastructure, fostering breakthroughs in learning algorithms.

With stronger artificial general intelligence approaching, I believe that simulation will be even

more important. We need physics-informed AI; any intelligent agent needs to be safely trained,

aligned, and verified in a simulated sandbox before it can ultimately be deployed and benefit our

human society. Moreover, despite the presence of numerous simulation models and real-world

data, people are working to address the sim-to-real (generalization and robustness problems) and

real-to-sim (digital twin) gaps. Simply enlarging the dataset size or pursuing infinitely accurate

simulation would be too expensive. Instead, I aim to leverage the merits of both data-driven

and model-based approaches by designing novel physics-informed learning algorithms. In recent

years, I have undertaken many pioneering works in differentiable simulation toward this goal.

My research goal is to develop safe, realistic, and efficient simulations for graphics, robotics,

embodied AI, digital twin, and inverse design.

The core of my work can be distilled into a three-part closed-loop pipeline, as illustrated

2


in Figure 1. (1) Formulate the governing dynamics: I will initially define the underlying rules of

the system and tailor the simulation to be efficient and realistic for machine learning algorithms.

(2) Generate and understand observations: Subsequently, coupling the simulation output with

real-world observations, the existing data can be used to refine our model-based approach by

automatically finetuning the simulation parameters. (3) Interact and control the system: With

prior knowledge of the model and insights gained from real data, our algorithms can make

more informed decisions to interact with the system. Throughout this closed-loop pipeline, my

collaborators and I have developed several software libraries and made significant contributions

published in internationally renowned venues, as ICML, NeurIPS, ICLR, Siggraph Asia, ICRA,

and ICCV. Those works have garnered significant interest from researchers across the Machine

Learning, Computer Graphics, Robotics, and Computer Vision research communities.

Figure 1.1: An overview of my research. Upper boxes are the closed-loop pipeline, and the lower
boxes are my works.

1.2 Related work

1.2.1 Differentiable Programming

Differentiable physics provides gradients for learning, control, and inverse problems that

involve physical systems. [1] advocate using differentiable physics to solve control problems

in robotics. [2] obtain gradients of 2D rigid body dynamics. [3] use automatic differentiation

3


tools to obtain gradients for cloth simulation. [4] develop a more comprehensive differentiable

physics engine for rigid bodies and cloth based on mesh representations. [5] present differentiable

simulation of protein dynamics. For volumetric data, ChainQueen [6] computes gradients of the

MPM simulation pipeline. [7] use FEM to model soft robots and perform trajectory optimization

with analytic gradients. [8] differentiate the simulation of fluids with solid coupling.

Many exciting applications of differentiable physics have been explored [9, 10, 11, 12].

[13] propose a soft-body manipulation benchmark for differentiable physics. [14] manipulate

tools with the help of differentiable physics. [15] perform system identification by learning

from the trajectory, and [16] then use the estimated frictional properties to help robotic pushing.

Differentiable simulation can also be used to train control policies by optimizing model predictive

control’s objective function [17, 18].

A related line of work concerns the development of powerful automatic differentiation

tools that can be used to differentiate simulation. DiffTaichi [19] provides a new programming

language and a compiler, enabling the high-performance Taichi simulator to compute the gradients

of the simulation. JAX MD [20] makes use of the JAX autodiff library [21] to differentiate

molecular dynamics simulation. These works have specifically made use of vectorization on

both CPU and GPU to achieve high performance on grids and particle sets, where the simulation

is intrinsically parallel. In contrast, the simulation of articulated bodies is far more serial, as

dynamics propagates along kinematic paths rather than acting in parallel on grid cells or particles.

TinyDiffSim [22] provides a templatized simulation framework that can leverage existing

autodiff tools such as CppAD [23], Ceres [24], and PyTorch [25] to differentiate simulation.

However, these methods introduce significant overhead in tracing the computation graph and

accumulate substantial computational and memory costs when applied to articulated bodies.

4


Our method does not need to store the entire computation graph to compute the gradients.

We store the initial states of each simulation step and reproduce the intermediate results when

needed by the backward pass. This checkpointing strategy is also used in training neural ODEs [26]

and large neural networks [27, 28, 29, 30]. We are the first to adapt this technique to articulated

dynamics, achieving dramatic reductions in memory consumption and enabling stable simulation

of long experiences.

Adjoint method. The adjoint method has been applied to fluid control [31], PDEs [32], light

transport simulation [33], and neural ODEs [26, 34]. Recently, [35] proposed to use the adjoint

method in multi-body dynamics. However, they operate in maximal coordinates and model

body attachments using springs. This does not enforce physical validity of articulated bodies.

In contrast, we operate in reduced coordinates and derive the adjoints for the articulated body

algorithm and spatial algebra operators that properly model the body’s joints. This supports

physically correct simulation with joint torques, limits, Coriolis forces, and proper transmission

of internal forces between links.

Neural approximation. A number of works approximate physics simulation using neural networks [36,

37, 38, 39, 40, 41, 42, 43, 44, 45, 46]. Physical principles have also been incorporated in the

design of neural networks [47, 48, 49, 50, 51]. Approximate simulation by neural networks is

naturally differentiable, but the networks are not constrained to abide by the underlying physical

dynamics and simulation fidelity may degenerate outside the training distribution.

5


1.2.2 Simulation and Animation Techniques

Deformable body simulation using Finite Element Method (FEM) plays an important role

in many scientific and engineering problems [52, 53, 54]. Previous works model soft bodies using

different representations and methods for specific tasks. There are several kinds of approaches

for modeling body actuation. Pneumatic-based methods [55] change the rest shape to produce

reaction forces. Rigid bones attached within soft materials are also used to control the motion of

deformable bodies [56, 57, 58, 59]. To further simulate biologically realistic motion, it is common

to apply joint torques in articulated skeletons [60]. For example, [61, 62] use articulated body

dynamics to govern the motion while handling collisions using soft contact. Inspired by animals,

different designs of muscle-like actuators for soft-body simulations were also proposed [63, 64,

65].

Regarding contact modeling, spring-based penalty forces are widely used [66, 67, 68] for

their simplicity. More advanced algorithms include inelastic projection [69, 70] and barrier-based

repulsion [71, 72]. However, these methods do not always conform to Coulomb’s frictional law.

We opt for a more realistic dry frictional model [73] to better handle collisions.

Projective Dynamics [74] is widely used for its robustness and efficiency for implicit

time integration. It has been extended to model muscles [75], rigid skeletons [76], realistic

materials [77], and accurate contact forces [78]. Our method also adopts this framework for faster

and more stable time integration. In contrast to the aforementioned methods using Projective

Dynamics, our algorithm is the first to enable joint actuation in articulated skeletons together

with a generalized dry frictional contact for soft body dynamics.

6


1.2.3 Differentiable 3D Reconstruction

Differentiable volume rendering with implicit neural representation have recently been

used to synthesize novel views, estimate geometry and appearance. Neural Radiance Field

(NeRF) [79], and its follow-on works [80, 81, 82, 83] have achieved photorealistic rendering

results even on complex scenes. The expressive coordinate-based multi-layer perceptron (MLP)

representation equips the scene with high resolution [84] while remains regularized [85] and

memory-efficient. Compared with differentiable surface rendering [86, 87], volume rendering

could backpropagate gradients to the entire space so to better handle complex geometry optimization

with topological changes and mitigate local minima. The original density field from NeRF

cannot directly reconstruct high-quality surface mesh, so SDF-based neural fields [88, 89, 90]

are proposed for multi-view geometry reconstruction. Compared to other discrete representation

like mesh [91, 92, 93], point cloud [94, 95] and voxels [96, 97], SDF-field has higher resolution.

And the the NeRF-based method can take raw RGB images without pixel-lebel segmentation [98,

99, 100]. However, current SDF-based neural fields cannot handle dynamic scenes. Recent work

has extended NeRF to videos of dynamic scenes [101, 102, 103], but they mainly focus on the

2D image synthesis, and few of them can reconstruct the 3D geometry well. They either learn

a canonical field with time-dependent motion [104, 105, 106, 107] or condition all the fields

on the time axis [108, 109, 110]. We choose the former strategy because decoupling motion

from the geometry makes the formulation more constrained and it also provides us with point-

wise correspondence. All the mentioned dynamic extension of NeRF focus on the 2D video

reconstruction and lack regularization on the 3D geometry. Our method can reconstruct the

geometry much better than the existing works. Moreover, since single-video 3D reconstruction

7


is highly ill-posed, our geometry prior can also help improving the image reconstruction quality.

1.3 Outline of Dissertation

The subsequent chapters in this dissertation are organized as follows.

In Part II, we will demonstrate how to develop differentiable simulation for several dynamical

systems.

Chapter 2 is dedicated to the two-way coupling of rigid body and cloth. We develop a

scalable framework for differentiable physics that can support a large number of objects and their

interactions. To accommodate objects with arbitrary geometry and topology, we adopt meshes

as our representation and leverage the sparsity of contacts for scalable differentiable collision

handling. Collisions are resolved in localized regions to minimize the number of optimization

variables even when the number of simulated objects is high. We further accelerate implicit

differentiation of optimization with nonlinear constraints. Experiments demonstrate that the

presented framework requires up to two orders of magnitude less memory and computation

in comparison to recent particle-based methods. We further validate the approach on inverse

problems and control scenarios, where it outperforms derivative-free and model-free baselines

by at least an order of magnitude.

Chapter 3 utilizes the adjoint method and checkpoint method for high-performance articulated

body simulation. We present a method for efficient differentiable simulation of articulated bodies.

This enables integration of articulated body dynamics into deep learning frameworks, and gradient-

based optimization of neural networks that operate on articulated bodies. We derive the gradients

of the forward dynamics using spatial algebra and the adjoint method. Our approach is an order

8


of magnitude faster than autodiff tools. By only saving the initial states throughout the simulation

process, our method reduces memory requirements by two orders of magnitude. We demonstrate

the utility of efficient differentiable dynamics for articulated bodies in a variety of applications.

We show that reinforcement learning with articulated systems can be accelerated using gradients

provided by our method. In applications to control and inverse problems, gradient-based optimization

enabled by our work accelerates convergence by more than an order of magnitude.

Chapter 4 provides a comprehensive simulation platform for articulated soft bodies. We

present a method for differentiable simulation of soft articulated bodies. Our work enables

the integration of differentiable physical dynamics into gradient-based pipelines. We develop

a top-down matrix assembly algorithm within Projective Dynamics and derive a generalized dry

friction model for soft continuum using a new matrix splitting strategy. We derive a differentiable

control framework for soft articulated bodies driven by muscles, joint torques, or pneumatic

tubes. The experiments demonstrate that our designs make soft body simulation more stable

and realistic compared to other frameworks. Our method accelerates the solution of system

identification problems by more than an order of magnitude, and enables efficient gradient-based

learning of motion control with soft robots.

Chapter 5 presents other differentiable systems that I have been involved with, including

traffic simulation, fluid simulation, and quantum computing.

In Part III, we will go further and study how to fuse simulation and real-world observation.

Chapter 6 proposes to combine NeRF and differentiable simulation to reconstruct and

simulate video data. We present a method for learning 3D geometry and physics parameters

of a dynamic scene from only a monocular RGB video input. To decouple the learning of

underlying scene geometry from dynamic motion, we represent the scene as a time-invariant

9


signed distance function (SDF) which serves as a reference frame, along with a time-conditioned

deformation field. We further bridge this neural geometry representation with a differentiable

physics simulator by designing a two-way conversion between the neural field and its corresponding

hexahedral mesh, enabling us to estimate physics parameters from the source video by minimizing

a cycle consistency loss. Our method also allows a user to interactively edit 3D objects from the

source video by modifying the recovered hexahedral mesh, and propagating the operation back

to the neural field representation. Experiments show that our method achieves superior mesh and

video reconstruction of dynamic scenes compared to competing Neural Field approaches, and

we provide extensive examples which demonstrate its ability to extract useful 3D representations

from videos captured with consumer-grade cameras.

Chapter 7 designs a two-way coupling between mesh and NeRF during the rendering and

simulation. We first review the light transport equations for both mesh and NeRF, then distill

them into a straightforward algorithm for updating radiance and throughput along a cast ray

with an arbitrary number of bounces. To address the discrepancy between the linear color space

that the path tracer assumes, versus the sRGB color space that standard NeRF uses, we train

NeRF with High Dynamic Range images to pull the radiance field back to the linear space.

Finally, we consider how the hybrid surface-volumetric formulation can be efficiently integrated

with a high-performance physics simulator that supports rigid and soft bodies, and cloth. The

full rendering and simulation system can be run on a GPU at interactive rates. We show that a

hybrid system approach outperforms alternatives in achieving visually realistic mesh insertion

into a NeRF scene, because it allows realistic light transport from volumetric NeRF media onto

surfaces, which affects the appearance of reflective/refractive surfaces and illumination of diffuse

surfaces informed by the scene.

10


Chapter 8 introduces a method to utilize differentiable priors in hand-related tasks. Our

approach uses rendering priors to align with input images and segmentation masks, while physics

priors mitigate penetration and relative-sliding across frames. Furthermore, we present two

alternatives for hand and object pose estimation. The optimization-based pose estimation achieves

high accuracy, while the filtering-based tracking, which utilizes the differentiable priors as dynamics

and observation models, runs faster. We show that our method achieves comparable or better

results in the pose estimation task, and the differentiable physics module enables us to predict

contact information for pose refinement. We also show that our approach generalizes to perception

tasks including robotic hand manipulation and human-object pose estimation in the wild.

Chapter 9 includes two additional works on simulation from real-world observations.

Chapter 10 concludes by summarizing the key contributions and suggest possible future

directions in extending my dissertation research.

11


Part II

Differentiable Simulation

12


Chapter 2: Scalable Differentiable Physics for Learning and Control

2.1 Introduction

Differentiable physics enables gradient-based learning systems to strictly adhere to physical

dynamics. By making physics simulation differentiable, we can backpropagate through the

physical consequences of actions. This enables agents to quickly learn to achieve desired effects

in the physical world. It is also a principled and effective approach to inverse problems that

involve physical systems [1, 2, 14, 39].

Recent efforts have significantly advanced the understanding of differentiable physics in

machine learning and robotics. Automatic differentiation and analytical methods have been

applied to derive a variety of differentiable simulation engines [3, 5, 19, 32]. Yet existing

differentiable frameworks are still limited in their ability to simulate complex scenes composed

of many detailed interacting objects. For example, modeling assumptions have limited some

existing frameworks to restricted object classes, such as balls or two-dimensional polygons [1, 2].

Other approaches adopt expressive grid- or particle-based formulations, which are general in

principle but have not scaled to detailed simulation of large scenes due to the cubic growth rates

of volumetric grids and particles [6, 38, 111]. For example, a grid of size 10243 has on the

order of a billion elements, and this resolution is still quite coarse in comparison to the scale and

detail of physical environments we encounter in daily life, such as city streets. Particle-based

13


representations likewise suffer from explosive growth rates or limited resolution: Macklin et al.

[112] suggest a maximum size ratio of 1:10 between the smallest and largest feature, or placing

particles only around boundaries, which runs the risk of object tunneling.

In this chapter, we develop a differentiable physics framework that addresses these limitations.

We adopt meshes as a general representation of objects. Meshes are the most widely used

specification for object geometry in computer graphics and scientific computing. Meshes are

inherently sparse, can model objects of any shape, and can compactly specify environments with

both large spatial extent and highly detailed features [113].

The use of meshes brings up the challenge of collision handling, since collisions can occur

anywhere on the surface of the mesh. Prior frameworks adopted global LCP-based formulations,

which severely limited scalability [1, 2]. In contrast, we leverage the structure of contacts by

grouping them into localized impact zones, which can be efficiently represented and processed [114,

115]. This substantially reduces the number of variables and constraints involved in specifying

dynamical scenes and dramatically speeds up backpropagation.

Following prior work, we use implicit differentiation to compute gradients induced by

an optimization problem embedded in the simulation [3, 116]. One of our contributions is an

acceleration scheme that can handle the nonlinear constraints encountered in our system.

We evaluate the scalability and generality of the presented approach experimentally. Controlled

experiments that vary the number of objects in the scene and their relative scales demonstrate that

our approach is dramatically more efficient than state-of-the-art differentiable physics frameworks.

Since meshes can represent both rigid and deformable objects, our work is also the first differentiable

physics framework that can simulate two-way coupling of rigid bodies and cloth. We further

demonstrate example applications to learning and control scenarios, where the presented framework

14


outperforms derivative-free and model-free baselines by at least an order of magnitude. Code

is available on our project page: https://gamma.umd.edu/researchdirections/

mlphysics/diffsim

2.2 Related Work

Backpropagating through physical dynamics is an enticing prospect for embedding physical

reasoning into learning processes. Degrave et al. [1] articulated the potential of this approach

and applied automatic differentiation to rigid body systems where contacts are assumed to only

happen between balls and planes. Although this assumption simplifies collision detection and

response, it greatly limits the scope of application. de Avila Belbute-Peres et al. [2] introduced

implicit differentiation for gradient computation, obviating the need to explicitly differentiate

through all steps of the forward optimization solver. By designing specific rules for collision

response, their framework supports simple two-dimensional shapes such as circles and polygons,

but was not extended to general 3D objects. Toussaint et al. [14] utilized differentiable primitives

to develop a robot reasoning system that can achieve user-defined tasks. It is designed for path

planning with sphere-swept shapes and frictionless interactions, but not for general control tasks.

Building on the implicit differentiation methodology, Liang et al. [3] developed a differentiable

simulator for cloth. Their formulation assumes that each node can move independently and is

only tied to others by soft constraints. This is true for cloth but not for rigid bodies. Carpentier and

Mansard [117] developed algorithms for efficient computation of analytical derivatives of rigid-

body dynamics, but did not handle collision. Millard et al. [118] implemented an articulated body

simulator in a reverse-mode automatic differentiation framework, but likewise did not handle

15

https://gamma.umd.edu/researchdirections/mlphysics/diffsim
https://gamma.umd.edu/researchdirections/mlphysics/diffsim


dynamic contact and collision response, which are at the heart of our work.

Hu et al. [6] developed a differentiable simulator for deformable objects based on the

material point method (MPM). A follow-up work applies this framework to controller design

for soft robots [9]. This approach is effective in handling soft bodies, but has limited ability to

enforce rigid structure. Since the method is based on particles and grids, it has limited scalability

and has not been applied to large scenes with interacting objects. In contrast, our method uses

meshes to represent objects, naturally enforces precise detailed geometry, and scales to scenes

with both large spatial extent and fine local features.

Hu et al. [19] introduce a domain-specific language for building differentiable physical

simulators. This work develops a programming language and compiler that can be broadly useful

throughout the differentiable physics community, including to projects such as ours. However,

at present this framework does not deal with general mesh-based rigid and deformable objects.

Their rigid-body scenario uses rectangles constrained by springs, with contact responses only

possible between rectangles and the ground plane. Their three-dimensional scenario uses soft

bodies modeled by particles and grids and inherits the aforementioned limitations of MPM-based

modeling.

A line of work developed particle-based frameworks for differentiable simulation of fluids [39,

44] and unified treatment of fluids and solids [38, 111]. These frameworks resort to approximations

in handling rigid bodies and detailed geometry, and suffer from high growth rates in the scale of

objects and scenes. The generality of particle-based approaches is appealing, but their scalability

limitations are well-known and are explicitly discussed in the literature [112]. We take a different

tack and model objects as meshes, thus harnessing the scalability and generality of this representation.

Some of the aforementioned works, as well as others [36, 37, 40, 42], fit function approximators

16


Time

Integration

Collision

Response

State: velocity, position, material, force...

Differentiable Simulation Layer

Loss

Trainable Network Layers

Control 

Signals

New 

Observation

Observations Simulation

Results

Figure 2.1: Differentiable simulation embedded in a neural network. The loss can be
backpropagated through the physics simulator to the control signal, thus enabling end-to-end
gradient-based training of the controller network. This approach enables faster convergence with
higher accuracy in learning and control of physical systems.

to demonstrations of physical dynamics. Such approximators are valuable, but may not fully

capture the true underlying physics and may degenerate outside the training distribution. Our

work follows the program exemplified by Degrave et al. [1], de Avila Belbute-Peres et al. [2],

and others in that we aim to backpropagate through the true physical dynamics. In this approach,

physical correctness is enforced by construction.

2.3 Overview

Our differentiable physics engine can function as a layer embedded within a broader differentiable

pipeline, with other neural network layers in the loop. This scenario is illustrated in Figure 2.1.

Given observations from simulation, the network can compute control signals and send them to

the simulator. The physics engine performs time integration, evaluates collision response, and

updates the mesh states. The simulation results can be used to evaluate the loss and provide

new observations to the network, thus closing the loop. All operations in our physics engine are

differentiable, supporting end-to-end gradient-based optimization of the complete pipeline. As

mentioned earlier, state-of-the-art differentiable simulation algorithms often suffer from scalability

issues. They either use particle representations, which have cubic growth rates with respect to

17


object sizes, or compute custom dynamic responses for a limited number of object primitives,

which have limited ability to represent complex shapes. Our method resolves these issues to

advance scalability and generality.

Section 2.4 introduces our notation and the basic forward integration scheme. To facilitate

scalability and generality, we adopt meshes as our basic object representation. Simulation of

complex scenes brings up the challenge of collision detection and response. We describe an

efficient collision handling scheme in Section 2.5. Observing that collisions are often sparse, we

utilize the impact zone method and handle collisions and contact forces locally in independent

collision areas. This collision handling scheme has linear computational complexity with respect

to the number of constraints rather than the total number of nodes. Our unified collision handling

formulation also enables differentiable two-way coupling of rigid bodies and cloth. In Section 2.6,

we describe the challenges involved in differentiation and introduce an acceleration scheme for

implicit differentiation with nonlinear constraints.

2.4 Preliminaries

After construction of the system, the dynamics can be written as

d

dt

 q

q̇

 =

 q̇

q̈

 =

 q̇

M−1f(q, q̇)

 , (2.4.0.1)

where M is the mass matrix and f is the generalized force vector. We incorporate contact

forces [115], internal forces including bending and stretching [119], and external forces including

gravity and control input. For numerical simulation, the dynamical system can be discretized with

18


time step h. At time t0, with q0 = q(t0) and q̇0 = q̇(t0), we can compute the increment of the

coordinates ∆q = q(t0 + h) − q(t0) and velocities ∆q̇ = q̇(t0 + h) − q̇(t0) using the implicit

Euler method [120]:

 ∆q

∆q̇

 = h

 q̇0 +∆q̇

M−1f(q+∆q, q̇+∆q̇)

 . (2.4.0.2)

With a linear approximation, we can solve for ∆q̇,

(
h−1M− ∂f

∂q̇
− h∂f

∂q

)
∆q̇ = f0 + h

∂f

∂q
q̇0, (2.4.0.3)

which can then be used to compute q(t0 + h) and q̇(t0 + h) for the next time step.

2.5 Scalable Collision Handling

One of the most time-consuming and complex parts of physics simulation is collision

handling. The difficulty, especially for differentiable physics, lies in how to compute the correct

derivatives efficiently even in a large scene with many interacting objects. As observed by Hu

et al. [19], naive discrete-time impulse-based collision response can lead to completely incorrect

gradients. We apply continuous collision detection to circumvent this problem. We also employ

a bounding volume hierarchy to localize and accelerate dynamic collision detection.

Traditional collision response algorithms for rigid bodies commonly employ global LCP-

based solvers to resolve collisions [2]. This works well for small-to-medium-size systems, but

does not scale well when the number of objects increases. The problem is exacerbated when the

derivatives of all DOFs in this large system need to be computed simultaneously, which is exactly

19


what backpropagation does.

Our key observation is that collisions are sparse in the majority of simulation steps, which

means that the number of collisions is much smaller than the total DOF. As shown in Figure 2.2,

collisions are commonly localized around the scene. With this observation, we resolve collision

locally among the objects involved, rather than setting up a global optimization, updating coordinates,

and computing the gradients for all objects in the environment.

One way to represent local collisions is to

use impact zones [114, 115]. An impact is a pair

of primitives colliding with each other. It can be

an edge-edge (EE) or a vertex-face (VF) pair. For

these two cases, the non-penetration constraints can

be written as

0 ≤ Cee = n · [(α3x3 + α4x4)− (α1x1 + α2x2)]

0 ≤ Cvf = n · [x4 − (α1x1 + α2x2 + α3x3)], (2.5.0.1)

respectively, where xi are the positions of the mesh vertices, αi are the barycentric coordinates,

and n is the normal of the colliding plane.

Impacts may share vertices. All the impacts in one connected component are said to form

an impact zone. Each impact zone is a local area that can be treated independently. The red

areas in Figure 2.2 are impact zones. Collision resolution in one impact zone can be written as

20


an optimization problem:

minimize
x′

1

2
(x− x′)⊤M(x− x′)

subject to Gx′ + h ≤ 0. (2.5.0.2)

Here x is the concatenation of vertex positions in the impact zone before collision handling; x′

are the resolved positions that satisfy constraints; M is the mass matrix; G and h are a matrix

and a vector derived from the constraints in Equation 2.5.0.1.

The impact zone method is fail-safe [114] and is commonly used in cloth simulation, where

each vertex in the cloth is free and can be explicitly optimized when solving Equation 2.5.0.2.

However, we cannot directly use this formulation for rigid bodies, because all vertices in a rigid

body are tied and cannot be optimized separately. A possible treatment is to use additional

constraints on the relative positions of vertices to enforce rigidity. However, the number of

vertices in a rigid body is usually far larger than its DOF. We therefore take a different approach

that minimizes the number of optimization variables in the system. This is critical for scalable

backpropagation in such systems.

We make a distinction between contact vertices x and the objects’ generalized coordinates

q. The actual optimization variables are replaced by q. For a rigid body i with rotation r =

(ϕ, θ, ψ)⊤ and translation t = (tx, ty, tz)
⊤, the generalized coordinates are q = [r⊤, t⊤]⊤ ∈ R6.

The block-diagonal mass matrix M̂ ∈ R6×6 is accordingly changed into angular and linear

inertia. The constraints are rewritten using the new variables. A function f(·) maps generalized

coordinates q to a contact point x, including both rotation and translation. Detailed formulations

of M̂ and f(·) are given in Appendices A and A. The optimization problem for collision resolution

21


Figure 2.2: Collision visualization. Regions in collision are shown in red. In most cases, the
number of collisions is sparse compared to the number of degrees of freedom in the scene.

becomes

minimize
q′

1

2
(q− q′)⊤M̂(q− q′)

subject to Gf(q′) + h ≤ 0. (2.5.0.3)

2.6 Fast Differentiation

Gradients for the sparse linear system in Equation 2.4.0.3 can be computed via implicit

differentiation [3, 116]. Other basic operations can be handled by automatic differentiation [121].

However, the heaviest computation is induced by collision handling, which can take up to 90%

of the runtime for a single simulation step. Collision resolution can involve multiple iterations

of the optimization specified in Equation 2.5.0.3 for each impact zone. Liang et al. [3] proposed

22


a method that reduces computation, but it only deals with linear constraints and does not apply

in our case due to the the nonlinearity introduced by f(·). We now develop an appropriate

acceleration scheme that works in our setting.

Consider the differentiation in our case. At a local minimum z∗ with Lagrange multiplier

λ∗, the following KKT conditions hold:

M̂z∗ − M̂q+∇f⊤G⊤λ∗ = 0

D(λ∗)(Gf(z∗) + h) = 0, (2.6.0.1)

whereD(·) packages a vector into a diagonal matrix. The key idea here is to linearize f(·) around

a small neighborhood of z∗. The implicit differentiation is formulated as

 M̂ ∇f⊤G⊤

D(λ∗)G∇f D(Gf(z∗) + h)


 ∂z

∂λ

 =

 M̂∂q−∇f⊤∂G⊤λ∗

−D(λ∗)(∂G · f(z∗) + ∂h)

 . (2.6.0.2)

As derived by Amos and Kolter [116], by solving for dz,dλ in the following equation

 M̂ ∇f⊤G⊤D(λ∗)

G∇f D(Gf(z∗) + h)


 dz

dλ

 =

 ∂L
∂z

⊤

0

 (2.6.0.3)

23


we can get the backward gradients:

∂L
∂q

= d⊤
z M̂ (2.6.0.4)

∂L
∂G

= −D(λ∗)dλf(z
∗)⊤ − λ∗d⊤

z∇f⊤ (2.6.0.5)

∂L
∂h

= −d⊤
λD(λ∗). (2.6.0.6)

Assume that in one impact zone, there are n DOFs and m constraints. In Equation 2.6.0.3,

the size of the system is n +m. The solution of this system would typically take O((n +m)3):

prohibitively expensive in large scenes. We therefore need to accelerate the computation of

dz and dλ. Liang et al. [3] proposed to use a QR decomposition. However, it only copes

with quadratic programming with linear constraints. We can extend this method to nonlinear

constraints by incorporating the Jacobian∇f .

Recall that a rigid body has coordinates q = [r, t] ∈ R6. The rotation is represented by

Euler angles r = (ϕ, θ, ψ). The Jacobian at vertex x = (x, y, z)⊤ of the rigid body is

∇fk =


∂x
∂ϕ

∂x
∂θ

∂x
∂ψ

1 0 0

∂y
∂ϕ

∂y
∂θ

∂y
∂ψ

0 1 0

∂z
∂ϕ

∂z
∂θ

∂z
∂ψ

0 0 1

 . (2.6.0.7)

The exact formulation of∇fk can be found in Appendix A.

After computing ∇f , we apply a QR decomposition
√
M̂

−1

∇f⊤G⊤ = QR and derive

24


0 200 400 600 800 1000
number of objects

0

500

1000

1500

tim
e(

s)

Ours
ChainQueen

0 200 400 600 800 1000
number of objects

0

5

10

15

20

m
em

or
y(

G
B

)

Ours
ChainQueen

0 2 4 6 8 10
scale

0

500

1000

1500

tim
e(

s)

Ours
ChainQueen

0 2 4 6 8 10
scale

0

5

10

15

20

m
em

or
y 

(G
B

)

Ours
ChainQueen

(a) Benchmark scenes (b) Running time (c) Memory consumption

Figure 2.3: Scalability. (a) Benchmark scenes. Few differentiable simulation frameworks
are capable of modeling these scenes. We compare to the expressive MPM-based framework
ChainQueen [6], implemented in the high-performance DiffTaichi library [19]. (b,c) Runtime
and memory consumption as the scenes are varied in controlled fashion. Memory consumption
is peak memory usage. Time is the running time for simulating 2 seconds of dynamics. Top:
the number of objects in the scene increases from 20 to 1000. Bottom: the relative scale of the
cloth and the bunny increases from 1:1 to 10:1. The MPM-based framework consumes up to two
orders of magnitude more memory and computation before it runs out of memory.

dz,dλ from Equation 2.6.0.3 as

dz =
√

M̂
−1

(I−QQ⊤)
√

M̂
−1∂L

∂z

⊤
(2.6.0.8)

dλ = D(λ∗)−1R−1Q⊤
√

M̂
−1∂L

∂z

⊤
. (2.6.0.9)

The cost of the QR decomposition isO(nm2), reducing the computation fromO((n+m)3).

2.7 Experiments

We conduct a variety of experiments to evaluate the presented approach. We begin by

comparing to other differentiable physics frameworks, with particular attention to scalability and

25


generality. We then conduct ablation studies to evaluate the impact of the techniques presented

in Sections 2.5 and 2.6. Lastly, we provide case studies that illustrate the application of the

presented approach to inverse problems and learning control.

The automatic differentiation is implemented in PyTorch 1.3 [121]. All experiments are

run on an Intel(R) Xeon(R) W-2123 CPU @ 3.60GHz.

2.7.1 Scalability

We construct two benchmark scenes and vary them in controlled fashion to evaluate scalability.

The scenes are illustrated in Figure 2.3.

In the first, objects fall from the air, hit the ground, and finally settle. To test scalability, we

increase the number of objects while maintaining the stride between objects. As the number of

objects increases, the spatial extent of the scene expands accordingly.

In the second scene, a rigid bunny strikes a deformable cloth. We vary the relative scale

of the bunny and the cloth to test the ability of simulation frameworks to handle objects with

geometric features at different resolutions.

Few differentiable simulation frameworks are capable of modeling these scenes. de Avila Belbute-

Peres et al. [2] only simulate 2D scenes with a restricted repertoire of primitives. Liang et al.

[3] simulate cloth but cannot handle rigid-body dynamics. We therefore choose the simulation

framework of Hu et al. [6] as our main baseline. This framework uses the the material point

method (MPM), which leverages particles and grids to model solids. We use the state-of-the-art

implementation from the high-performance DiffTaichi library [19].

The results are shown in Figure 2.3. The first row reports the runtime and memory consumption

26


of the two frameworks as the number of objects in the scene increases from 20 to 1000. Memory

consumption is peak memory usage. Time is the running time for simulating 2 seconds of

dynamics. Our runtime and memory consumption increase linearly in the complexity of the

scene, while the MPM-based method scales cubically until it runs out of memory at 200 objects

and a 6403 grid.

The second row reports runtime and memory consumption as the relative sizes of the cloth

and the bunny are varied from 1:1 to 10:1. The runtime and memory consumption of our method

stay constant. In contrast, as the size of the cloth grows, the MPM-based framework is forced to

allocate more and more memory and expend greater and greater computation.

These experiments indicate that scalability is a significant advantage of our method. Since

we do not need to quantize space, the extent of the scene or the relative sizes of objects do not

dominate our runtime and memory consumption. Since our method dynamically detects and

handles collisions locally, as needed, rather than setting up a global optimization problem, the

runtime scales linearly (rather than quadratically) with scene complexity.

2.7.2 Ablation Studies

We proposed localized collision handling in Section 2.5 and fast differentiation in Section 2.6

to make our method scalable to large scenes with many degrees of freedom. We conduct dedicated

ablation studies to assess the contribution of these techniques.

Localized collision handling. We compare with LCP-based rigid-body differentiable simulation

developed by de Avila Belbute-Peres et al. [2]. This baseline also uses implicit differentiation to

compute derivatives of optimization solvers [116]. The environment is shown in Figure 2.4(a).

27


(a) Localized collisions (b) Correlated collisions
Figure 2.4: Ablation studies. (a) Comparing global and local collision handling. Cubes
are released from the air and fall to the ground. (b) Evaluating the contribution of the fast
differentiation scheme. Cubes are densely stacked, forming a big connected component. All
collisions need to be solved simultaneously because motion of one cube can affect all others.

N (= 100, 200, 300) cubes are released above the ground plane. They fall down and hit the

ground. We use the same environment for both methods, although the LCP-based framework only

simulates in 2D, thus having only half the degrees of freedom compared to our 3D simulation (3

versus 6 per object, accounting for translation and rotation). We disabled our fast differentiation

method in this experiment in order to neutralize its effect and conduct a controlled comparison

between global and local collision handling.

The implementation of de Avila Belbute-Peres et al. [2] uses four threads while our method

only uses one. Results are reported in Table 2.1. Our sparse collision handling method runs up to

5 times faster than the LCP-based approach, and the performance gap widens as the complexity

of the scene increases.

Fast differentiation. We now evaluate the contribution of the acceleration scheme described

in Section 2.6. The environment is shown in Figure 2.4(b). N (= 100, 200, 300) cubes are

stacked in two layers. During collision handling, all contacts between cubes form one connected

28


# of cubes 100 200 300

LCP 0.73s ± 0.017s 2.87s ± 0.103s 8.42s ± 0.190s
Ours 0.56s ± 0.009s 1.11s ± 0.012s 1.65s ± 0.025s

Table 2.1: Runtime of backpropagation (in seconds per simulation step) with LCP-based collision
handling [2] and our approach. Simulation of the scene shown in Figure 2.4(a). The LCP-based
framework simulates in 2D and uses 4 threads, while our implementation simulates in 3D and
uses 1 thread. Our approach is faster, and the performance gap widens with the complexity of the
scene.

component. All constraints need to be solved in one big optimization problem. Thus the linear

system in Equation 2.6.0.3 is large.

# of cubes 100 200 300

W/o FD 1.43s ± 0.015s 7.76s ± 0.302s 21.88s ± 0.125s
Ours 0.41s ± 0.004s 0.86s ± 0.008s 1.30s ± 0.008s
Speedup 3.49x 9.02x 16.83x

Table 2.2: Runtime of backpropagation (in seconds per simulation step) with and without our
fast differentiation scheme. Simulation of the scene shown in Figure 2.4(b). N ( = 100, 200,
300) cubes are stacked in two layers. The impact of the acceleration scheme increases with the
complexity of the scene.

We evaluate the runtime of backpropagation in this scene with and without the presented

acceleration scheme. The ablation condition is referred to as ‘W/o FD’, where FD refers to

fast differentiation. In this condition, the derivative of Equation 2.5.0.3 is calculated by solving

Equation 2.6.0.3 directly. The results are reported in Table 2.2. The fast differentiation accelerates

backpropagation by up to 16x in this scene, and the impact of the technique increases with the

complexity of the scene.

29


2.7.3 Two-way coupling

One of the advantages of this presented framework is that it can handle both rigid bodies

and deformable objects such as cloth. This is illustrated in the two scenes shown in Figure 2.5.

In Figure 2.5(a), a Stanford bunny and an armadillo stand on a piece of cloth. The cloth is lifted

by its corners, envelopes the figurines, and lifts them up. In Figure 2.5(b), a piece of cloth strikes

a domino, which begins a chain reaction that propagates until the last domino strikes the cloth

from behind. The scenes are also shown in the supplementary video.

No prior differentiable simulation framework is capable of modeling these scenes properly.

Liang et al. [3] can simulate the cloth, but its motion does not affect the figurines. The framework

of Hu et al. [6, 19] does not enforce rigidity and suffers from interpenetration. In contrast, our

approach simulates the correct dynamics and the two-way coupling between them.

We further compare with MuJoCo, a popular physics engine [122]. MuJoCo models cloth

as a 2D grid of capsule and ellipsoid geoms in addition to spheres. This representation fails

to correctly handle collisions near the holes in a grid. In Figure 2.6, we simulate a rigid body

falling onto a trampoline and bouncing back. When simulated in MuJoCo, the ball penetrates the

trampoline. Our method simulates the interaction correctly.

2.7.4 Applications

Inverse problem. Differentiable simulation naturally lends itself to gradient-based optimization

for inverse problems. A case study is shown in Figure 2.7, in which a marble is supported

by a soft sheet. The sheet is fixed at the four corners. In each time step, an external force

is applied to the marble. The goal is to find a sequence of forces that drives the marble to a

30


(a) Bunny and armadillo (b) Dominoes

Figure 2.5: Two-way coupling of rigid bodies and cloth. Top: initial state. Bottom: towards the
end of the simulation. (a) Two figurines are lifted by a piece of cloth. (b) Cloth strikes domino
and is later struck back. No prior differentiable simulation framework is capable of modeling
these scenes properly.

31


(a) MuJoCo (b) Ours

Figure 2.6: Comparison with MuJoCo. Simulation of a ball interacting with a trampoline. Top:
initial state. Bottom: later in the simulation. (a) Simulation in MuJoCo. (b) Simulation by
our method. MuJoCo models cloth as a 2D grid of capsules and ellipsoid geoms in addition to
spheres. The ball penetrates the trampoline when the grid is sparse.

32


0 50 100
episode

0

2

4

6

ob
je

ct
iv

e 
fu

nc
tio

n Ours

CMA-ES

(a) Marble on a sheet (b) Objective function
Figure 2.7: Inverse problem. (a) The goal is to bring the marble to the desired position with
minimal force. (b) Our framework enables gradient-based optimization for this problem and
quickly converges to a lower objective value than derivative-free optimization (CMA-ES). We
perform 5 runs with different random seeds; shaded areas represent one standard deviation.

target position in 2 seconds, while minimizing the total amount of applied force. The vertical

component of the external force is set to 0 so that the marble has to interact with the cloth before

reaching the target. We compare with a derivative-free optimization algorithm, CMA-ES [123].

Our framework enables gradient-based optimization in this setting and converges in 4 iterations,

reaching a lower objective value than what CMA-ES achieves after two orders of magnitude more

iterations.

Learning control. Our second set of case studies uses the differentiable simulator to backpropagate

gradients in a neural network that must learn to manipulate objects to achieve desired outcomes.

The scenarios are shown in Figure 2.8. The neural network controls a pair of sticks and a piece of

cloth, respectively, which must be used to manipulate an object. The goal is to bring the object to

the desired position within 1 second. The optimization objective is the L2 distance to the target.

The neural network is an MLP with 50 nodes in the first layer and 200 nodes in the second, with

ReLU activations.

For reference, we report the performance of DDPG, a model-free reinforcement learning

33


algorithm [124]. In each episode, we fix the initial position of the manipulator and the object

while the target position is randomized. For both methods, the input to the network is a concatenated

vector of relative distance to the target, speed, and remaining time. The DDPG reward is the

negative L2 distance to the target.

0 50 100 150 200
episode

0

0.5

1

1.5

2

2.5

lo
ss

 fu
nc

tio
n

Ours

DDPG

0 50 100 150 200
episode

0

0.2

0.4

0.6

lo
ss

 fu
nc

tio
n

Ours

DDPG

(a) Sticks (b) Cloth

Figure 2.8: Learning control. A neural network must learn to control (a) a pair of sticks or (b) a
piece of cloth. The trained controller must be able to bring the manipulated object to any desired
location specified at test time. Our method enables gradient-based training of the controller.
A model-free reinforcement learning baseline (DDPG) fails to learn the task on a comparable
time scale. We perform 5 runs with different random seeds; shaded areas represent one standard
deviation.

Our method updates the network once at the end of each episode, while DDPG receives a

reward signal and updates the network weights in each time step. As shown in the loss curves

34


(a) Initial guess (b) Target observation

Figure 2.9: Parameter estimation. Cubes start with opposite velocities and collide with each
other. Our goal is to estimate the mass of the left cube such that the total momentum after the
collision has the desired direction and magnitude. (a) is the initial estimate: both cubes have
the same mass and the total momentum after collision is 0. (b) is the simulation with the mass
produced by our estimation procedure: the momentum points to the right and has a magnitude of
3, as specified in the target observation.

in Figure 2.8, our method quickly converges to a good control policy, while DDPG fails to

learn the task on a comparable time scale. This example illustrates the power of gradient-based

optimization with our differentiable physics framework.

Parameter estimation. Given an observation of motion, our method can estimate unknown

physical parameters for objects in the scene.

In Figure 2.9, two cubes start with opposite velocities ±v and collide with each other.

Given a target momentum observation p = m1v
′
1+m2v

′
2, where v′

1,2 are velocities after collision

and m1,2 are masses, we aim to estimate the mass m1. The target momentum is set to p =

(3, 0, 0). We initialize with m1 = m2 = 1 and p = (0, 0, 0) (Figure 2.9(a)). After 90 gradient

steps, our method arrives at the estimatem1 = 5.4 and achieves the desired momentum (Figure 2.9(b)).

Interoperability across simulators. Our differentiable simulation framework is interoperable

with other physics simulators. We have evaluated this interoperability with MuJoCo, a popular

35


(a) Initial state (b) Target state

Figure 2.10: Interoperability across simulators. Three cubes are placed on smooth ground. The
goal is to make the cubes stick together while minimizing the applied forces. The loss is computed
in MuJoCo but the gradient is evaluated in DiffSim. (a) is the initial state and (b) is the successful
simulation result after the optimization.

non-differentiable simulator [122]. The experiment is illustrated in Figure 2.10. We place three

cubes on smooth ground. The goal is to apply forces to the cubes in order to make them stick

together. The objective function is the distance of each cube to its target positions plus the L2

norm of the force. We compute the loss in MuJoCo but evaluate the gradient in DiffSim, our

differentiable simulator. Figure 2.10(a) shows the initial state. Figure 2.10(b) is the result of the

simulation after 10 gradient steps. The goal of making the cubes stick together was accomplished.

This experiment shows that physical states and control signals are interoperable between our

differentiable framework and non-differentiable simulators.

2.8 Conclusion

We developed a scalable differentiable simulation method that has linear complexity with

respect to the number of objects and collisions, and constant complexity with respect to spatial

resolution. We use meshes as our core representation, enabling support for objects of arbitrary

shape at any scale. A unified local collision handling scheme enables fast coupling between

36


objects with different materials. An acceleration scheme speeds up backpropagation by implicit

differentiation. Experiments show that the presented techniques speed up backpropagation by

large multiplicative factors, and their impact increases with the complexity of simulated scenes.

We demonstrated the application of the presented method in a number of case studies,

including end-to-end gradient-based training of neural network controllers. Training with differentiable

simulation was shown to be significantly more effective than gradient-free and model-free baselines.

One avenue for future work is to extend the presented framework to other types of objects,

materials, and contact models. The framework is sufficiently general to support deformable solids

and articulated bodies, as long as they have mesh-based representations with repulsive contact

forces. We plan to extend our implementation to incorporate such object classes.

37


Chapter 3: Efficient Differentiable Simulation of Articulated Bodies

3.1 Introduction

Differentiable physics enables efficient gradient-based optimization with dynamical systems.

It has achieved promising results in both simulated [4, 6] and real environments [7, 16]. Our goal

is to make articulated body simulation efficiently differentiable. We aim to maximize efficiency

in both computation and memory use, in order to support fast gradient-based optimization of

differentiable systems that interact with articulated bodies in physical environments.

Articulated bodies play a central role in robotics, computer graphics, and embodied AI.

Many control systems are optimized via experiences collected in simulation [122, 125, 126, 127].

However, they do not have access to analytic derivatives of the articulated dynamics. Therefore,

nearly all gradient-based approaches have to design strategies to compute the gradients indirectly

when dealing with articulated bodies.

The straightforward way to differentiate the simulation is to use existing automatic differentiation

tools [25, 128, 129]. However, autodiff tools consume prohibitive amounts of memory when

there are many simulation steps. Autodiff tracks every operation and stores the intermediate

results in order to perform backpropagation. In articulated body simulation, the iterative contact

solver and the dynamics algorithm [130] yield exceedingly long computational graphs. In our

experiments, differentiable simulators built with autodiff tools run out of memory after 5,000

38


simulation steps – just a few seconds of experience. As a result, learning is constrained to short

experiences or forced to used large time steps, thus curtailing the scope of behaviors that can be

learned or undermining the simulation’s stability.

Furthermore, the overhead of creating and storing the auxiliary variables for autodiff also

slows down the forward simulation. Although autodiff tools like DiffTaichi [19] and JAX [21] can

accelerate the simulation of fluids and deformable solids by vectorization, it is difficult to achieve

the same speedup in articulated body simulation because the articulated dynamics algorithm is

highly serialized, unlike inherently parallel computation on grids and particles.

In this chapter, we design a differentiable articulated body simulation method that runs 10x

faster with 1% of the memory consumption compared to differentiation based on autodiff tools.

In order to minimize the overhead of differentiation, we derive the gradients of articulated body

simulation using the adjoint method [131]. The adjoint method has been applied to fluids [31] and

multi-body systems [35], but these applications do not support physically correct differentiable

simulation of articulated bodies. Our derivation of the operator adjoints uses spatial algebra [132].

Our method needs almost no additional computation during the forward simulation and is an order

of magnitude faster than autodiff tools.

We further reduce memory requirements by adapting ideas from checkpointing [27, 133]

to the differentiable simulation setting. In the forward pass, we store the initial simulation state

for each time step. During backpropagation, we recreate intermediate variables by reproducing

simulation from the stored state. The overall runtime remains fast, while memory consumption

is reduced by two orders of magnitude.

As an application of differentiable dynamics for articulated systems, we show that reinforcement

learning (RL) can benefit from the knowledge of gradients in two ways. First, it can make

39


use of the gradients computed by the simulation to generate extra samples using first-order

approximation. Second, during the policy learning phase, differentiable physics enables us to

perform a one-step rollout of the objective value function so that the policy updates can be more

accurate. Both schemes effectively improve the convergence speed and the attained reward. We

also demonstrate applications of efficient differentiable articulated dynamics to inverse problems,

such as motion control and parameter estimation. Gradient-based optimization enabled by our

method accelerates convergence in these settings by more than an order of magnitude. Code is

available on our project page: https://github.com/YilingQiao/diffarticulated

The contributions of this work are as follows:

• We derive the adjoint formulations for the entire articulated body simulation workflow,

enabling a 10x acceleration over autodiff tools.

• We adapt the checkpointing method to the structure of articulated body simulation to reduce

memory consumption by 100x, making stable collection of extended experiences feasible.

• We introduce two general schemes for accelerating reinforcement learning using differentiable

physics.

• We demonstrate the utility of differentiable simulation of articulated bodies in motion

control and parameter estimation, enhancing performance in these scenarios by more than

an order of magnitude.

3.2 Related Work

Differentiable programming has been applied to rendering [86, 134, 135], image processing [136,

137], SLAM [138], and design [139]. Making complex systems differentiable enables learning

40

https://github.com/YilingQiao/diffarticulated


and optimization using gradient-based methods. Our literature review focuses on differentiable

physics, the adjoint method, and neural approximations of physical systems.

Differentiable physics. Differentiable physics provides gradients for learning, control, and

inverse problems that involve physical systems. Degrave et al. [1] advocate using differentiable

physics to solve control problems in robotics. de Avila Belbute-Peres et al. [2] obtain gradients of

2D rigid body dynamics. Liang et al. [3] use automatic differentiation tools to obtain gradients for

cloth simulation. Qiao et al. [4] develop a more comprehensive differentiable physics engine for

rigid bodies and cloth based on mesh representations. Ingraham et al. [5] present differentiable

simulation of protein dynamics. For volumetric data, ChainQueen [6] computes gradients of the

MPM simulation pipeline. Bern et al. [7] use FEM to model soft robots and perform trajectory

optimization with analytic gradients. Takahashi et al. [8] differentiate the simulation of fluids

with solid coupling.

Many exciting applications of differentiable physics have been explored [9, 10, 11, 12].

Huang et al. [13] propose a soft-body manipulation benchmark for differentiable physics. Toussaint

et al. [14] manipulate tools with the help of differentiable physics. Song and Boularias [15]

perform system identification by learning from the trajectory, and Song and Boularias [16] then

use the estimated frictional properties to help robotic pushing.

A related line of work concerns the development of powerful automatic differentiation

tools that can be used to differentiate simulation. DiffTaichi [19] provides a new programming

language and a compiler, enabling the high-performance Taichi simulator to compute the gradients

of the simulation. JAX MD [20] makes use of the JAX autodiff library [21] to differentiate

molecular dynamics simulation. These works have specifically made use of vectorization on

41


both CPU and GPU to achieve high performance on grids and particle sets, where the simulation

is intrinsically parallel. In contrast, the simulation of articulated bodies is far more serial, as

dynamics propagates along kinematic paths rather than acting in parallel on grid cells or particles.

TinyDiffSim [22] provides a templatized simulation framework that can leverage existing

autodiff tools such as CppAD [23], Ceres [24], and PyTorch [25] to differentiate simulation.

However, these methods introduce significant overhead in tracing the computation graph and

accumulate substantial computational and memory costs when applied to articulated bodies.

Our method does not need to store the entire computation graph to compute the gradients.

We store the initial states of each simulation step and reproduce the intermediate results when

needed by the backward pass. This checkpointing strategy is also used in training neural ODEs [26]

and large neural networks [27, 28, 29, 30]. We are the first to adapt this technique to articulated

dynamics, achieving dramatic reductions in memory consumption and enabling stable simulation

of long experiences.

Adjoint method. The adjoint method has been applied to fluid control [31], PDEs [32], light

transport simulation [33], and neural ODEs [26, 34]. Recently, Geilinger et al. [35] proposed to

use the adjoint method in multi-body dynamics. However, they operate in maximal coordinates

and model body attachments using springs. This does not enforce physical validity of articulated

bodies. In contrast, we operate in reduced coordinates and derive the adjoints for the articulated

body algorithm and spatial algebra operators that properly model the body’s joints. This supports

physically correct simulation with joint torques, limits, Coriolis forces, and proper transmission

of internal forces between links.

Neural approximation. A number of works approximate physics simulation using neural networks [36,

42


update

kinematics

update

forces

update

accelerations

states

positions,

transformations 

velocities, etc.

articulated momentum,

articulated inertias, 

torques, etc
body accelerations,

joint accelerations

collision

resolving

collision 

detection

+

LCP solver

new 

states

time

integration

checkpoints Intermediate results

update

kinematics

update

forces

update

accelerations

states

positions,

transformations 

velocities, etc.

articulated momentum,

articulated inertias, 

torques, etc
body accelerations,

joint accelerations

collision

resolving

collision 

detection

+

LCP solver

new 

states

time

integration

Figure 3.1: The workflow of a simulation step. Assume there are nq = 6 DoF. The initial state
is an nq × 3 matrix containing the position, velocity, and control input of each joint. The forward
dynamics will traverse the articulated body sequentially three times. This process is difficult to
parallelize and will generate a large number of intermediate results. After the forward dynamics
there is a collision resolution step with collision detection and an iterative Gauss-Seidel solver.
The size of the initial state (position, velocity, and control input) is much smaller than that of
intermediate results but the initial state has all the information to resume all the intermediate
variables.

37, 38, 39, 40, 41, 42, 43, 44, 45, 46]. Physical principles have also been incorporated in the

design of neural networks [47, 48, 49, 50, 51]. Approximate simulation by neural networks is

naturally differentiable, but the networks are not constrained to abide by the underlying physical

dynamics and simulation fidelity may degenerate outside the training distribution.

3.3 Preliminaries

Articulated body dynamics. For the forward simulation, we choose the recursive Articulated

Body Algorithm (ABA) [130], which has O(n) complexity and is widely used in articulated

body simulators [122, 125, 126]. In each simulation step k, the states xk of the system consist

of configurations in generalized coordinates and their velocities xk = [qk, q̇k] ∈ R2nq×1, where

nq is the number of degrees of freedom in the system. Assuming each joint can be actuated by a

torque, there is an nq-dimensional control signal uk ∈ Rnq×1. The discretized dynamics at this

step can be written as fk+1(uk,xk,xk+1) = 0.

43


Adjoint method. We can concatenate the states in the entire time sequence into x = [x1,x2, ...,xnt ] ∈ Rnt·2nq×1,

with corresponding control input u = [u1,u2, ...,unt ] ∈ Rnt·nu×1, where nt is the number of

simulation steps. The concatenated dynamics equations can be similarly written as f = [f1, f2, ..., fnt ] =

0. For a learning or optimization problem, we would usually define a scalar objective function

Φ(x,u). To get the derivative of this function w.r.t. the control input u, one needs to compute

dΦ

du
=
∂Φ

∂u
+
∂Φ

∂x

dx

du
. (3.3.0.1)

∂Φ
∂u

is easy to compute for each single simulation step. But it is prohibitively expensive to directly

solve ∂Φ
∂x

dx
du

because dx
du

will be a 2nqnt × nqnt matrix.

Instead, we differentiate the dynamics f(x,u) = 0 and obtain the constraint ∂f
∂x

dx
du

= − ∂f
∂u

.

Applying the adjoint method [131], ∂Φ
∂x

dx
du

equals to

−R⊤ ∂f

∂u
such that

(
∂f

∂x

)⊤

R =

(
∂Φ

∂x

)⊤

(3.3.0.2)

Since the dynamics equation for one step only involves a small number of previous steps, the

sparsity of ∂f
∂u

and ∂f
∂x

makes it easier to solve for the variable R in Eq. 3.3.0.2.

∂Φ
∂u
−R⊤ ∂f

∂u
is called the adjoint of u and is equivalent to the gradient by our substitution. In

the following derivation, we denote the adjoint of a variable s by s, and the adjoint of a function

f(·) by f(·).

44


3.4 Efficient Differentiation

In this section, we introduce our algorithm for the gradient computation in articulated body

simulation. For faster differentiation, the gradients of articulated body dynamics are computed

by the adjoint method. We derive the adjoint of time integration, contact resolution, and forward

dynamics in reverse order. We then adapt the checkpoint method to articulated body simulation

to reduce memory consumption.

3.4.1 Adjoint Method for Articulated Dynamics

One forward simulation step can be split into five modules, as shown in Figure 3.1: perform

the forward kinematics from the root link to the leaf links, update the forces from the leaves to

the root, update the accelerations from the root to the leaves, detect and resolve collisions, and

perform time integration. Backpropagation proceeds through these modules in reverse order.

Time integration. Backpropagation starts from the time integration. As an example, for a

simulation sequence with nt = 3 steps and an integrator with a temporal horizon of 2, the

constraints ( ∂f
∂x
)⊤R = (∂Φ

∂x
)⊤ in Equation 3.3.0.2 can be expanded as


( ∂f

1

∂x1 )
⊤ ( ∂f

2

∂x1 )
⊤ 0

0 ( ∂f
2

∂x2 )
⊤ ( ∂f

3

∂x2 )
⊤

0 0 ( ∂f
3

∂x3 )
⊤




R1

R2

R3

 =


( ∂Φ
∂x1 )

⊤

( ∂Φ
∂x2 )

⊤

( ∂Φ
∂x3 )

⊤

 (3.4.1.1)

If using the explicit Euler integrator, we have ∂fk

∂xk = I. Initially Rnt = ∂Φ
∂xnt

. The following Rk

45


can be computed iteratively by

Rk = (
∂Φ

∂xk
)⊤ − (

∂fk+1

∂xk
)⊤Rk+1, k = 1, 2, .., nt − 1. (3.4.1.2)

When Rk backpropagates through time, the gradients uk can also be computed by Equation 3.3.0.2.

In fact, by the way we calculate Rk, it equals the gradients of xk. Other parameters can also be

computed in a similar way as uk.

Collision resolution. The collision resolution step consists of collision detection and a collision

solver. Upon receiving the gradients x from the time integrator, the collision solver needs to pass

the gradients to detection, and then to the forward dynamics. In our collision solver, we construct

a Mixed Linear Complementarity Problem (MLCP) [140]:

a = Ax+ b

s.t. 0 ≤a ⊥ x ≥ 0 and c− ≤ x ≤ c+,

(3.4.1.3)

where x is the new collision-free state, A is the inertial matrix, b contains the relative velocities,

and c−, c+ are the lower bound and upper bound constraints, respectively. We use the projected

Gauss-Seidel (PGS) method to solve this MLCP. This iterative solve trades off accuracy for speed

such that the constraints a⊤x = 0 might not hold on termination. In this setting, where the

solution is not guaranteed to satisfy constraints, implicit differentiation [2, 3, 4] no longer works.

Instead, we design a reverse version of the PGS solver using the adjoint method to compute the

gradients. Further details are in the supplement. In essence, the solver mirrors PGS, substituting

the adjoints for the original operators. This step passes the gradients from the collision-free states

to the forward dynamics.

46


Forward dynamics. Our articulated body simulator propagates the forward dynamics as shown

in the green blocks of Figure 3.1. Each operation in the forward simulation has its corresponding

adjoint operation in the backward pass. To compute the gradients, we derive adjoint rules for the

operators in spatial algebra [132].

As a simple example, a spatial vector pi = [w,v] ∈ R6 representing the bias force of the

i-th link is the sum of an external force [f1, f2] ∈ R6 and a cross product of two spatial motion

vectors [w1,v1], [w2,v2] ∈ R6:

 w

v

 =

 f1 +w1 ×w2 + v1 × v2

f2 +w1 × v2

 . (3.4.1.4)

Once we get the adjoint [w,v] of pi = [w,v], we can propagate it to its inputs:

 w1

v1

 =

 −w×w2 − v×v2

−w×v2


 w2

v2

 =

 w×w1

w×v1 + v×w1

 ,
 f1

f2

 =

 w

v


(3.4.1.5)

This example shows the adjoint of one forward operation. The time and space complexity of

the original operation and its adjoint are the same as the forward simulation. The supplement

provides more details on the adjoint operations.

47


3.4.2 Checkpointing for Articulated Dynamics

The input and output variables of one simulation step have relatively small dimensionalities

(positions, velocities, and accelerations), but many more intermediate values are computed during

the process. Although this is not a bottleneck for forward simulation because the memory

allocation for intermediate results can be reused across time steps, it becomes a problem for

differentiable simulation, which needs to store the entire computation graph for the backward

pass. This causes memory consumption to explode when simulation proceeds over many time

steps, as is the case when small time steps are needed for accuracy and stability, and when the

effects of actions take time to become apparent.

The need for scaling to larger simulation steps motivates our adaptation of the checkpointing

method [27, 133]. Instead of saving the entire computation graph, we choose to store only the

initial states in each time step. We use this granularity because (a) the states of each step have the

smallest size among all essential information sufficient to replicate the simulation step, (b) finer

checkpoints are not useful because at least one step of intermediate results needs to be stored in

order to do the simulation, and (c) sparser checkpoints will use less memory but require multiple

steps for reconstructing intermediate variables, costing more memory and computation. We

validate our checkpointing policy with an ablation study in the supplement. Figure 3.2 illustrates

the scheme. During forward simulation, we store the simulation state in the beginning of each

time step. During backpropagation, we reload the checkpoints (blue arrows) and redo the forward

(one-step) simulation to generate the computation graph, and then compute the gradients using

the adjoint method in reverse order (red arrows).

48


Forward

Dynamics

Collision 

Resolving

Time

Integration

store 

checkpoints

Forward

Dynamics

Collision 

Resolving

Time

Integration

reload 

checkpoints

forward simulation

forward/resume

backward

forward/resume

backward

backward differentiation

Figure 3.2: Differentiation of articulated body dynamics. Top: forward simulation at step k.
The simulator stores the states qk, q̇k, and control input uk. Bottom: During backpropagation, the
checkpoints are reloaded and the simulator runs one forward step to reconstruct the intermediate
variables. Beginning with the gradients from step k+1, we use the adjoint method to sequentially
compute the gradients of all variables through time integration, the constraint solver, and forward
dynamics.

49


In summary, assume the simulation step consists of two parts:

Zk = G(xk−1), xk = F (Zk), (3.4.2.1)

where Zk represents all the intermediate results. After each step, we free the space consumed by

Zk, only storing xk. During backpropagation, we recompute Zk from xk−1 and use the adjoint

method to compute gradients of the previous step:

Zk = G(xk−1), (3.4.2.2)

Zk = F (Zk,xk), (3.4.2.3)

xk−1 = G(xk−1,Zk). (3.4.2.4)

3.5 Reinforcement Learning

Our method can be applied to simulation-based reinforcement learning (RL) tasks to improve

policy performance as well as convergence speed. By computing gradients with respect to the

actions, differentiable simulation can provide more information about the environment. We

suggest two approaches to integrating differentiable physics into RL.

Sample enhancement. We can make use of the computed gradients to generate samples in the

neighborhood around an existing sample. Specifically, given a sample (s, a0, s
′
0, r0) from the

history, with observation s, action a0, next-step observation s′0, and reward r0, we can generate

50


new samples (s, ak, s′k, rk) using first-order approximation:

ak = a0 +∆ak,

s′k = s′0 +
∂s′0
∂a0

∆ak,

rk = r0 +
∂r0
∂a0

∆ak,

where ∆ak is a random perturbation vector. This method, which we call sample enhancement,

effectively generates as many approximately accurate samples as desired for learning purposes.

By providing a sufficient number of generated samples around the neighborhood, the critic can

have a better grasp of the function shape (patchwise rather than pointwise), and can thus learn

and converge faster.

Policy enhancement. Alternatively, the policy update can be adjusted to a format compatible

with differentiable physics, usually dependent on the specific RL algorithm. For example, in

MBPO [141], the policy network is updated using the gradients of the critic:

Lµ = −Q(s, µ(s)) + Z, (3.5.0.1)

where Lµ is the loss for the policy network µ, Q is the value function for state-action pairs, and Z

is the regularization term. To make use of the gradients, we can expand the Q function one step

forward,

∂Q(s, a)

∂a
=
∂r

∂a
+ γ

∂Q(s′, µ(s′))

∂s′
∂s′

∂a
, (3.5.0.2)

51


steps 50.0 100.0 500.0 1000.0 5000.0

Ceres 100.0 200.0 1100.0 2700.0 23 600.0

CppAD 16.0 26.0 160.0 320.0 3100.0

JAX 200.0 200.0 400.0 700.0 3000.0

PyTorch 1200.0 2400.0 11 700.0 12 400.0 N/A
Ours 0.3 0.3 0.7 1.2 5.0

Table 3.1: Peak memory use of different simulation frameworks. The memory footprint (MB)
of our framework is more than two orders of magnitude lower than autodiff methods. PyTorch
crashes at 5,000 simulation steps. ADF fails to compute the gradients in a reasonable time (10
min) and is not included here for this reason.

and substitute it in Eq. 3.5.0.1:

L′
µ = −∂Q(s, a)

∂a
µ(s) + Z. (3.5.0.3)

Eqs. 3.5.0.1 and 3.5.0.3 yield the same gradients with respect to the action, which provide the

gradients of the network parameters. This method, which we call policy enhancement, constructs

the loss functions with embedded ground-truth gradients so that the policy updates are physics-

aware and generally more accurate than merely looking at the Q function itself, thus achieving

faster convergence and even potentially higher reward.

3.6 Experiments

For experiments, we will first scale the complexity of simulated scenes and compare the

performance of our method with autodiff tools. Then we will integrate differentiable physics

into reinforcement learning and use our method to learn control policies. Lastly, we will apply

52


steps 50.0 100.0 500.0 1000.0 5000.0

ADF 25.7 25.5 25.1 32.1 58.4

Ceres 27.2 27.5 27.2 34.0 58.2

CppAD 2.4 2.4 2.3 2.3 4.5

JAX 53.3 46.1 43.1 42.7 42.3

PyTorch 195.6 192.2 199.2 192.8 N/A
Ours 0.3 0.3 0.2 0.2 0.2

Table 3.2: Simulation time for forward pass. Ours is at least an order of magnitude faster
than autodiff tools (in msec). PyTorch crashes at 5,000 simulation steps. CppAD is the fastest
baseline.

differentiable articulated dynamics to solve motion control and parameter estimation problems.

3.6.1 Comparison with Autodiff Tools

Using existing autodiff tools is a convenient way to derive simulation gradients. However,

these tools are not optimized for articulated dynamics, which adversely affects their computation

and memory consumption in this domain. We compare our method with state-of-the-art autodiff

tools, including CppAD [23], Ceres [24], PyTorch [25], autodiff [142] (referred to as ADF to

avoid ambiguity), and JAX [21]. All our experiments are performed on a desktop with an Intel(R)

Xeon(R) W-2123 CPU @ 3.60GHz with 32GB of memory.

One robot. In the first round, we profile all the methods by simulating one Laikago robot standing

on the ground. Figure 3.3(a) provides a visualization. The state vector of the Laikago has 37

dimensions (19-dimensional positions and 18-dimensional velocities). We vary the number of

simulation steps from 50 to 5,000. JAX is based on Python and its JIT compiler cannot easily

53


(a) One Laikago (b) Ten Laika