ABSTRACT Title of Dissertation: MACHINE LEARNING WITH DIFFERENTIABLE PHYSICS PRIORS Yiling Qiao Doctor of Philosophy, 2024 Dissertation Directed by: Professor Ming C. Lin Department of Computer Science Differentiable physics priors enable gradient-based learning systems to adhere to physical dynamics. By making physics simulations differentiable, we can backpropagate through the physical consequences of actions. This pipeline allows agents to quickly learn to achieve desired effects in the physical world and is an effective technique for solving inverse problems in physical or dynamical systems. This new programming paradigm bridges model-based and data-driven methods, mitigating data scarcity and model bias simultaneously. My research focuses on developing scalable, powerful, and efficient differentiable physics simulators. We have created state-of-the-art differentiable physics for rigid bodies, cloth, fluids, articulated bodies, and deformable solids, achieving performance orders of magnitude better than existing alternatives. These differentiable simulators are applied to solve inverse problems, train control policies, and enhance reinforcement learning algorithms. MACHINE LEARNING WITH DIFFERENTIABLE PHYSICS PRIORS by Yiling Qiao Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2024 Advisory Committee: Professor Ming C. Lin, Chair/Advisor Professor Maria Cameron, Dean’s Representative Professor Jia-Bin Huang Professor Tianyi Zhou Professor Dinesh Manocha © Copyright by Yiling Qiao 2024 Acknowledgments First and foremost, I would like to express my deepest gratitude to my advisor, Dr. Ming Lin. Your guidance, support, and encouragement have been invaluable throughout my doctoral journey. Your expertise and insights have profoundly shaped this dissertation, and I am incredibly fortunate to have had the opportunity to work under your mentorship. Your stories have inspired me greatly, and your suggestions about my future career will always play an important role in my life. I am also profoundly grateful to the members of my dissertation committee, Dr. Maria Cameron, Dinesh Manocha, Jia-Bin Huang, and Tianyi Zhou, for their insightful feedback, constructive criticism, and unwavering support. Your contributions have significantly enhanced the quality of this work. I would like to extend my heartfelt thanks to Dr. Vladlen Koltun, Aniket Bera, Matthias Zwicker, Tom Goldstein, Daniele Panozzo, Hao Su, Bo Zhu, Chuang Gan, Animesh Garg, Xiaodi Wu, Lingjie Liu, Breannan Smith, Takaaki Shiratori, Ken Museth, Miles Macklin, Chris Atkeson, Lin Gao, Xilin Chen, and others for your kind guidance and help. I feel so fortunate to have had the opportunity to read your great research works first and then to have been helped and/or mentored by you later. To my friends and labmates in GAMMA, Aaron, Adarsh, Alexander, Bhrij, Divya, Geonsun, Jaehoon, Jing, Junbang, Justin, Laura, Mohamed, Niall, Peng, Pooja, Qingyang, Rohan, Ruicheng, ii Ruiqi, Sanghyun, Senthil, Shrey, Shrellekha, Tianrui, Trisha, Utsav, Uttaran, Vishnu, Will, Xijun, Xiyang, Yonghan, Yu, Zhenyu, Jason, Tetsuya, Xian, Zhenjia, Johnson, Pingchuan, Xuan, Peter, Krishna, and others, thank you for your companionship, encouragement, and the countless discussions that have helped shape my thinking. At the University of Maryland, I really spent five happy years. Thank you Tom, Jodie, Sharron, Migo, Shanshan, and all the other staff in ISSS, the Graduate School, the CS department, the Eppley Center, and elsewhere. I also sincerely appreciate the support from Intel, Nvidia, and Meta for my research. A special thanks to the Meta Fellowship. Thank you for recognizing our research. That really means a lot to me and gives me the confidence for more ambitious ventures in the future. A special note of appreciation goes to my family: my wife, mom, dad, grandma, uncle, and aunts. Thank you for your love, patience, and belief in me. To my housemates, Jiaqi, Yuxiang, Hsien-Yu, and Suying, thanks for all the good times we had together. All the best to your future journey as well. Thank you all for your contributions, support, and encouragement. This dissertation would not have been possible without you. iii Table of Contents Acknowledgements ii Table of Contents iv List of Tables viii List of Figures ix I Introduction 1 Chapter 1: Overview 2 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Differentiable Programming . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 Simulation and Animation Techniques . . . . . . . . . . . . . . . . . . . 6 1.2.3 Differentiable 3D Reconstruction . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Outline of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 II Differentiable Simulation 12 Chapter 2: Scalable Differentiable Physics for Learning and Control 13 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Scalable Collision Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.6 Fast Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.7.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.7.2 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.7.3 Two-way coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.7.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Chapter 3: Efficient Differentiable Simulation of Articulated Bodies 38 iv 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4 Efficient Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4.1 Adjoint Method for Articulated Dynamics . . . . . . . . . . . . . . . . . 45 3.4.2 Checkpointing for Articulated Dynamics . . . . . . . . . . . . . . . . . . 48 3.5 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6.1 Comparison with Autodiff Tools . . . . . . . . . . . . . . . . . . . . . . 53 3.6.2 Integration with Reinforcement Learning . . . . . . . . . . . . . . . . . 56 3.6.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Chapter 4: Differentiable Simulation of Soft Multi-body Systems 64 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2 Soft Body Simulation Using Projective Dynamics . . . . . . . . . . . . . . . . . 68 4.3 Articulated Skeletons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3.1 Rigid Body System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3.2 Top-down Matrix Assembly for Articulated Body Systems . . . . . . . . 72 4.3.3 Articulated Joint Actuation . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4 Contact Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.5.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.5.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Chapter 5: Other Differentiable Dynamical Systems 85 5.1 Traffic Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.2 Fluids Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.3 Analog Quantum Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 III Reality-based Simulation 95 Chapter 6: Editable Neural Geometry and Physics from Monocular Videos 96 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.3.1 Rendering Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.3.2 Simulation Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.3.3 Training and Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.4.1 Dynamic Geometry Reconstruction . . . . . . . . . . . . . . . . . . . . 109 6.4.2 Novel View Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 v 6.4.3 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.4.4 Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.4.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Chapter 7: Dynamic Mesh-Aware Radiance Fields 118 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.2.1 Neural Fields and Surface Representations . . . . . . . . . . . . . . . . . 120 7.2.2 Scene Editing and Synthesis . . . . . . . . . . . . . . . . . . . . . . . . 122 7.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.3.1 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.3.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.3.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.4.1 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.4.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Chapter 8: Physically Consistent Perception of Hand-Object Interactions with Differentiable Priors 141 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 8.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 8.3.1 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 8.3.2 Differentiable Rendering Priors . . . . . . . . . . . . . . . . . . . . . . . 147 8.3.3 Differentiable Physics Priors . . . . . . . . . . . . . . . . . . . . . . . . 148 8.3.4 Optimization-based Refinement . . . . . . . . . . . . . . . . . . . . . . 149 8.3.5 Filtering-based Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 150 8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 8.4.1 Real-world Robotic Hand and Object Iteration . . . . . . . . . . . . . . . 153 8.4.2 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 8.4.3 Filtering-based tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 8.4.4 Contact Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 8.4.5 Generalization to Human . . . . . . . . . . . . . . . . . . . . . . . . . . 159 8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Chapter 9: Other Visual-Dynamical Systems 162 9.1 Navigation and Collision Avoidance Using Commodity Sensors . . . . . . . . . 162 9.2 Physics Augmented Continuum Neural Radiance Fields . . . . . . . . . . . . . . 164 vi IV Conclusion 167 Chapter 10: Conclusion 168 10.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 10.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 V Appendix 173 Appendix A: Scalable Differentiable Physics for Learning and Control 174 Appendix B: Efficient Differentiable Simulation of Articulated Bodies 180 Appendix C: Differentiable Simulation of Soft Multi-body Systems 201 Appendix D: Editable Neural Geometry and Physics from Monocular Videos 209 Appendix E: Dynamic Mesh-Aware Radiance Fields 213 Appendix F: Physically Consistent Perception of Hand-Object Interactions with Differentiable Priors 223 vii List of Tables 2.1 Performance of collision handling . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2 Performance of differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1 Memory usage of different simulation pipelines . . . . . . . . . . . . . . . . . . 52 3.2 imulation time for forward pass . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3 Simulation time for the backward pass . . . . . . . . . . . . . . . . . . . . . . . 55 4.1 Memory footprint (GB). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.1 Machine learning in different dynamical systems. . . . . . . . . . . . . . . . . . 91 6.1 Quantitative evaluation of video reconstruction . . . . . . . . . . . . . . . . . . 109 6.2 Quantitative evaluation of novel view synthesis . . . . . . . . . . . . . . . . . . 110 8.1 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 8.2 Evaluation of hand pose estimation . . . . . . . . . . . . . . . . . . . . . . . . . 154 8.3 Quantitative evaluation of object pose estimation . . . . . . . . . . . . . . . . . 155 8.4 Error after contact-based pose refinement . . . . . . . . . . . . . . . . . . . . . 159 8.5 Quantitative evaluation on human object pose estimation . . . . . . . . . . . . . 159 B.1 Maximum reward on n-link pendulum . . . . . . . . . . . . . . . . . . . . . . . 182 B.2 Model transplant results on MuJoCo Ant . . . . . . . . . . . . . . . . . . . . . . 185 F.1 Error after contact-based pose refinement . . . . . . . . . . . . . . . . . . . . . 225 viii List of Figures 1.1 An overview of my research. Upper boxes are the closed-loop pipeline, and the lower boxes are my works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Differentiable simulation embedded in a neural network . . . . . . . . . . . . . . 17 2.2 Collision visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Scalability comparison with grid-based methods . . . . . . . . . . . . . . . . . . 25 2.4 Ablation studies for collision handling . . . . . . . . . . . . . . . . . . . . . . . 28 2.5 Two-way coupling of rigid bodies and cloth . . . . . . . . . . . . . . . . . . . . 31 2.6 Comparison with MuJoCo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.7 Comparison on inverse problems . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.8 Learning control policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.9 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.10 Interoperability across simulators . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1 The workflow of a simulation step . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Differentiation of articulated body dynamics . . . . . . . . . . . . . . . . . . . . 49 3.3 Memory and speed comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4 The n-link pendulum task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.5 The MuJoCo Ant task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.6 Motion control task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.7 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.1 Ablation study of skeleton realization . . . . . . . . . . . . . . . . . . . . . . . 78 4.2 Ablation study of collision handling scheme . . . . . . . . . . . . . . . . . . . . 80 4.3 System identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4 Motion control experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.1 Traffic simulation in an urban environment . . . . . . . . . . . . . . . . . . . . . 86 6.1 Overall workflow of NeuPhysics . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.2 The editing pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.3 Computational Graph During Training . . . . . . . . . . . . . . . . . . . . . . . 107 6.4 Surface reconstruction from dynamic videos . . . . . . . . . . . . . . . . . . . . 109 6.5 Qualitative evaluation of geometry reconstruction and novel view synthesis . . . 110 6.6 Parameter estimation and editing by differentiable physics . . . . . . . . . . . . 110 ix 6.7 Geometry and appearance editing in videos . . . . . . . . . . . . . . . . . . . . 112 6.8 Joint training can be time-consuming and converges slowly. . . . . . . . . . . . 113 6.9 Region of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.10 Compare with optimizing Chamfer Distance . . . . . . . . . . . . . . . . . . . . 117 7.1 Light transport on the surface (left) and in the medium (right). . . . . . . . . . 119 7.2 Pipeline overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.3 Shadow Casting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.4 HDR Volumetric Radiance Map . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.5 Rendering comparison for virtual object insertion . . . . . . . . . . . . . . . . . 132 7.6 Qualitative results for simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 134 7.7 Real-time photorealistic gaming on a laptop . . . . . . . . . . . . . . . . . . . . 136 7.8 Application of mixing NeRF with meshes . . . . . . . . . . . . . . . . . . . . . 139 8.1 Application in multiple tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 8.2 An overview of our optimization and filtering pipelines . . . . . . . . . . . . . . 144 8.3 Diagram of the EKF filtering process . . . . . . . . . . . . . . . . . . . . . . . . 150 8.4 Real-world pose estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 8.5 Pose Estimation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 8.6 Filtering results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 8.7 Qualitative results of contact-based pose refinement . . . . . . . . . . . . . . . . 158 8.8 Human Object Pose Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . 160 A.1 Qualitative comparison on complex models and interactions . . . . . . . . . . . . 178 B.1 Ablation study for the checkpointing scheme. . . . . . . . . . . . . . . . . . . . 183 C.1 Ablation study of embedded skeletons . . . . . . . . . . . . . . . . . . . . . . . 207 C.2 Ablation study of collision handling scheme . . . . . . . . . . . . . . . . . . . . 207 D.1 Qualitative evaluation of video reconstruction . . . . . . . . . . . . . . . . . . . 211 E.1 Optimizing Emission UV Texture Map . . . . . . . . . . . . . . . . . . . . . . . 217 E.2 Inverse Camera Response Function (CRF) . . . . . . . . . . . . . . . . . . . . . 219 E.3 Scaling the number of the inserted objects and image resolution . . . . . . . . . . 220 E.4 Run time comparison with NeuPhysics . . . . . . . . . . . . . . . . . . . . . . . 220 E.5 Simulation Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 E.6 Number of Rank-1s each method receives. . . . . . . . . . . . . . . . . . . . . . 221 E.7 Comparison between Luma AI (left) and Ours (right). . . . . . . . . . . . . . . . 222 x Part I Introduction 1 Chapter 1: Overview 1.1 Motivation Physically-based simulation has become an essential component in animation, visual effects, and broader computational problems, such as aircraft design and digital fabrication. In the recent wave of machine learning, simulation environments for games, robots, and autonomous cars have played a crucial role as infrastructure, fostering breakthroughs in learning algorithms. With stronger artificial general intelligence approaching, I believe that simulation will be even more important. We need physics-informed AI; any intelligent agent needs to be safely trained, aligned, and verified in a simulated sandbox before it can ultimately be deployed and benefit our human society. Moreover, despite the presence of numerous simulation models and real-world data, people are working to address the sim-to-real (generalization and robustness problems) and real-to-sim (digital twin) gaps. Simply enlarging the dataset size or pursuing infinitely accurate simulation would be too expensive. Instead, I aim to leverage the merits of both data-driven and model-based approaches by designing novel physics-informed learning algorithms. In recent years, I have undertaken many pioneering works in differentiable simulation toward this goal. My research goal is to develop safe, realistic, and efficient simulations for graphics, robotics, embodied AI, digital twin, and inverse design. The core of my work can be distilled into a three-part closed-loop pipeline, as illustrated 2 in Figure 1. (1) Formulate the governing dynamics: I will initially define the underlying rules of the system and tailor the simulation to be efficient and realistic for machine learning algorithms. (2) Generate and understand observations: Subsequently, coupling the simulation output with real-world observations, the existing data can be used to refine our model-based approach by automatically finetuning the simulation parameters. (3) Interact and control the system: With prior knowledge of the model and insights gained from real data, our algorithms can make more informed decisions to interact with the system. Throughout this closed-loop pipeline, my collaborators and I have developed several software libraries and made significant contributions published in internationally renowned venues, as ICML, NeurIPS, ICLR, Siggraph Asia, ICRA, and ICCV. Those works have garnered significant interest from researchers across the Machine Learning, Computer Graphics, Robotics, and Computer Vision research communities. Figure 1.1: An overview of my research. Upper boxes are the closed-loop pipeline, and the lower boxes are my works. 1.2 Related work 1.2.1 Differentiable Programming Differentiable physics provides gradients for learning, control, and inverse problems that involve physical systems. [1] advocate using differentiable physics to solve control problems in robotics. [2] obtain gradients of 2D rigid body dynamics. [3] use automatic differentiation 3 tools to obtain gradients for cloth simulation. [4] develop a more comprehensive differentiable physics engine for rigid bodies and cloth based on mesh representations. [5] present differentiable simulation of protein dynamics. For volumetric data, ChainQueen [6] computes gradients of the MPM simulation pipeline. [7] use FEM to model soft robots and perform trajectory optimization with analytic gradients. [8] differentiate the simulation of fluids with solid coupling. Many exciting applications of differentiable physics have been explored [9, 10, 11, 12]. [13] propose a soft-body manipulation benchmark for differentiable physics. [14] manipulate tools with the help of differentiable physics. [15] perform system identification by learning from the trajectory, and [16] then use the estimated frictional properties to help robotic pushing. Differentiable simulation can also be used to train control policies by optimizing model predictive control’s objective function [17, 18]. A related line of work concerns the development of powerful automatic differentiation tools that can be used to differentiate simulation. DiffTaichi [19] provides a new programming language and a compiler, enabling the high-performance Taichi simulator to compute the gradients of the simulation. JAX MD [20] makes use of the JAX autodiff library [21] to differentiate molecular dynamics simulation. These works have specifically made use of vectorization on both CPU and GPU to achieve high performance on grids and particle sets, where the simulation is intrinsically parallel. In contrast, the simulation of articulated bodies is far more serial, as dynamics propagates along kinematic paths rather than acting in parallel on grid cells or particles. TinyDiffSim [22] provides a templatized simulation framework that can leverage existing autodiff tools such as CppAD [23], Ceres [24], and PyTorch [25] to differentiate simulation. However, these methods introduce significant overhead in tracing the computation graph and accumulate substantial computational and memory costs when applied to articulated bodies. 4 Our method does not need to store the entire computation graph to compute the gradients. We store the initial states of each simulation step and reproduce the intermediate results when needed by the backward pass. This checkpointing strategy is also used in training neural ODEs [26] and large neural networks [27, 28, 29, 30]. We are the first to adapt this technique to articulated dynamics, achieving dramatic reductions in memory consumption and enabling stable simulation of long experiences. Adjoint method. The adjoint method has been applied to fluid control [31], PDEs [32], light transport simulation [33], and neural ODEs [26, 34]. Recently, [35] proposed to use the adjoint method in multi-body dynamics. However, they operate in maximal coordinates and model body attachments using springs. This does not enforce physical validity of articulated bodies. In contrast, we operate in reduced coordinates and derive the adjoints for the articulated body algorithm and spatial algebra operators that properly model the body’s joints. This supports physically correct simulation with joint torques, limits, Coriolis forces, and proper transmission of internal forces between links. Neural approximation. A number of works approximate physics simulation using neural networks [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46]. Physical principles have also been incorporated in the design of neural networks [47, 48, 49, 50, 51]. Approximate simulation by neural networks is naturally differentiable, but the networks are not constrained to abide by the underlying physical dynamics and simulation fidelity may degenerate outside the training distribution. 5 1.2.2 Simulation and Animation Techniques Deformable body simulation using Finite Element Method (FEM) plays an important role in many scientific and engineering problems [52, 53, 54]. Previous works model soft bodies using different representations and methods for specific tasks. There are several kinds of approaches for modeling body actuation. Pneumatic-based methods [55] change the rest shape to produce reaction forces. Rigid bones attached within soft materials are also used to control the motion of deformable bodies [56, 57, 58, 59]. To further simulate biologically realistic motion, it is common to apply joint torques in articulated skeletons [60]. For example, [61, 62] use articulated body dynamics to govern the motion while handling collisions using soft contact. Inspired by animals, different designs of muscle-like actuators for soft-body simulations were also proposed [63, 64, 65]. Regarding contact modeling, spring-based penalty forces are widely used [66, 67, 68] for their simplicity. More advanced algorithms include inelastic projection [69, 70] and barrier-based repulsion [71, 72]. However, these methods do not always conform to Coulomb’s frictional law. We opt for a more realistic dry frictional model [73] to better handle collisions. Projective Dynamics [74] is widely used for its robustness and efficiency for implicit time integration. It has been extended to model muscles [75], rigid skeletons [76], realistic materials [77], and accurate contact forces [78]. Our method also adopts this framework for faster and more stable time integration. In contrast to the aforementioned methods using Projective Dynamics, our algorithm is the first to enable joint actuation in articulated skeletons together with a generalized dry frictional contact for soft body dynamics. 6 1.2.3 Differentiable 3D Reconstruction Differentiable volume rendering with implicit neural representation have recently been used to synthesize novel views, estimate geometry and appearance. Neural Radiance Field (NeRF) [79], and its follow-on works [80, 81, 82, 83] have achieved photorealistic rendering results even on complex scenes. The expressive coordinate-based multi-layer perceptron (MLP) representation equips the scene with high resolution [84] while remains regularized [85] and memory-efficient. Compared with differentiable surface rendering [86, 87], volume rendering could backpropagate gradients to the entire space so to better handle complex geometry optimization with topological changes and mitigate local minima. The original density field from NeRF cannot directly reconstruct high-quality surface mesh, so SDF-based neural fields [88, 89, 90] are proposed for multi-view geometry reconstruction. Compared to other discrete representation like mesh [91, 92, 93], point cloud [94, 95] and voxels [96, 97], SDF-field has higher resolution. And the the NeRF-based method can take raw RGB images without pixel-lebel segmentation [98, 99, 100]. However, current SDF-based neural fields cannot handle dynamic scenes. Recent work has extended NeRF to videos of dynamic scenes [101, 102, 103], but they mainly focus on the 2D image synthesis, and few of them can reconstruct the 3D geometry well. They either learn a canonical field with time-dependent motion [104, 105, 106, 107] or condition all the fields on the time axis [108, 109, 110]. We choose the former strategy because decoupling motion from the geometry makes the formulation more constrained and it also provides us with point- wise correspondence. All the mentioned dynamic extension of NeRF focus on the 2D video reconstruction and lack regularization on the 3D geometry. Our method can reconstruct the geometry much better than the existing works. Moreover, since single-video 3D reconstruction 7 is highly ill-posed, our geometry prior can also help improving the image reconstruction quality. 1.3 Outline of Dissertation The subsequent chapters in this dissertation are organized as follows. In Part II, we will demonstrate how to develop differentiable simulation for several dynamical systems. Chapter 2 is dedicated to the two-way coupling of rigid body and cloth. We develop a scalable framework for differentiable physics that can support a large number of objects and their interactions. To accommodate objects with arbitrary geometry and topology, we adopt meshes as our representation and leverage the sparsity of contacts for scalable differentiable collision handling. Collisions are resolved in localized regions to minimize the number of optimization variables even when the number of simulated objects is high. We further accelerate implicit differentiation of optimization with nonlinear constraints. Experiments demonstrate that the presented framework requires up to two orders of magnitude less memory and computation in comparison to recent particle-based methods. We further validate the approach on inverse problems and control scenarios, where it outperforms derivative-free and model-free baselines by at least an order of magnitude. Chapter 3 utilizes the adjoint method and checkpoint method for high-performance articulated body simulation. We present a method for efficient differentiable simulation of articulated bodies. This enables integration of articulated body dynamics into deep learning frameworks, and gradient- based optimization of neural networks that operate on articulated bodies. We derive the gradients of the forward dynamics using spatial algebra and the adjoint method. Our approach is an order 8 of magnitude faster than autodiff tools. By only saving the initial states throughout the simulation process, our method reduces memory requirements by two orders of magnitude. We demonstrate the utility of efficient differentiable dynamics for articulated bodies in a variety of applications. We show that reinforcement learning with articulated systems can be accelerated using gradients provided by our method. In applications to control and inverse problems, gradient-based optimization enabled by our work accelerates convergence by more than an order of magnitude. Chapter 4 provides a comprehensive simulation platform for articulated soft bodies. We present a method for differentiable simulation of soft articulated bodies. Our work enables the integration of differentiable physical dynamics into gradient-based pipelines. We develop a top-down matrix assembly algorithm within Projective Dynamics and derive a generalized dry friction model for soft continuum using a new matrix splitting strategy. We derive a differentiable control framework for soft articulated bodies driven by muscles, joint torques, or pneumatic tubes. The experiments demonstrate that our designs make soft body simulation more stable and realistic compared to other frameworks. Our method accelerates the solution of system identification problems by more than an order of magnitude, and enables efficient gradient-based learning of motion control with soft robots. Chapter 5 presents other differentiable systems that I have been involved with, including traffic simulation, fluid simulation, and quantum computing. In Part III, we will go further and study how to fuse simulation and real-world observation. Chapter 6 proposes to combine NeRF and differentiable simulation to reconstruct and simulate video data. We present a method for learning 3D geometry and physics parameters of a dynamic scene from only a monocular RGB video input. To decouple the learning of underlying scene geometry from dynamic motion, we represent the scene as a time-invariant 9 signed distance function (SDF) which serves as a reference frame, along with a time-conditioned deformation field. We further bridge this neural geometry representation with a differentiable physics simulator by designing a two-way conversion between the neural field and its corresponding hexahedral mesh, enabling us to estimate physics parameters from the source video by minimizing a cycle consistency loss. Our method also allows a user to interactively edit 3D objects from the source video by modifying the recovered hexahedral mesh, and propagating the operation back to the neural field representation. Experiments show that our method achieves superior mesh and video reconstruction of dynamic scenes compared to competing Neural Field approaches, and we provide extensive examples which demonstrate its ability to extract useful 3D representations from videos captured with consumer-grade cameras. Chapter 7 designs a two-way coupling between mesh and NeRF during the rendering and simulation. We first review the light transport equations for both mesh and NeRF, then distill them into a straightforward algorithm for updating radiance and throughput along a cast ray with an arbitrary number of bounces. To address the discrepancy between the linear color space that the path tracer assumes, versus the sRGB color space that standard NeRF uses, we train NeRF with High Dynamic Range images to pull the radiance field back to the linear space. Finally, we consider how the hybrid surface-volumetric formulation can be efficiently integrated with a high-performance physics simulator that supports rigid and soft bodies, and cloth. The full rendering and simulation system can be run on a GPU at interactive rates. We show that a hybrid system approach outperforms alternatives in achieving visually realistic mesh insertion into a NeRF scene, because it allows realistic light transport from volumetric NeRF media onto surfaces, which affects the appearance of reflective/refractive surfaces and illumination of diffuse surfaces informed by the scene. 10 Chapter 8 introduces a method to utilize differentiable priors in hand-related tasks. Our approach uses rendering priors to align with input images and segmentation masks, while physics priors mitigate penetration and relative-sliding across frames. Furthermore, we present two alternatives for hand and object pose estimation. The optimization-based pose estimation achieves high accuracy, while the filtering-based tracking, which utilizes the differentiable priors as dynamics and observation models, runs faster. We show that our method achieves comparable or better results in the pose estimation task, and the differentiable physics module enables us to predict contact information for pose refinement. We also show that our approach generalizes to perception tasks including robotic hand manipulation and human-object pose estimation in the wild. Chapter 9 includes two additional works on simulation from real-world observations. Chapter 10 concludes by summarizing the key contributions and suggest possible future directions in extending my dissertation research. 11 Part II Differentiable Simulation 12 Chapter 2: Scalable Differentiable Physics for Learning and Control 2.1 Introduction Differentiable physics enables gradient-based learning systems to strictly adhere to physical dynamics. By making physics simulation differentiable, we can backpropagate through the physical consequences of actions. This enables agents to quickly learn to achieve desired effects in the physical world. It is also a principled and effective approach to inverse problems that involve physical systems [1, 2, 14, 39]. Recent efforts have significantly advanced the understanding of differentiable physics in machine learning and robotics. Automatic differentiation and analytical methods have been applied to derive a variety of differentiable simulation engines [3, 5, 19, 32]. Yet existing differentiable frameworks are still limited in their ability to simulate complex scenes composed of many detailed interacting objects. For example, modeling assumptions have limited some existing frameworks to restricted object classes, such as balls or two-dimensional polygons [1, 2]. Other approaches adopt expressive grid- or particle-based formulations, which are general in principle but have not scaled to detailed simulation of large scenes due to the cubic growth rates of volumetric grids and particles [6, 38, 111]. For example, a grid of size 10243 has on the order of a billion elements, and this resolution is still quite coarse in comparison to the scale and detail of physical environments we encounter in daily life, such as city streets. Particle-based 13 representations likewise suffer from explosive growth rates or limited resolution: Macklin et al. [112] suggest a maximum size ratio of 1:10 between the smallest and largest feature, or placing particles only around boundaries, which runs the risk of object tunneling. In this chapter, we develop a differentiable physics framework that addresses these limitations. We adopt meshes as a general representation of objects. Meshes are the most widely used specification for object geometry in computer graphics and scientific computing. Meshes are inherently sparse, can model objects of any shape, and can compactly specify environments with both large spatial extent and highly detailed features [113]. The use of meshes brings up the challenge of collision handling, since collisions can occur anywhere on the surface of the mesh. Prior frameworks adopted global LCP-based formulations, which severely limited scalability [1, 2]. In contrast, we leverage the structure of contacts by grouping them into localized impact zones, which can be efficiently represented and processed [114, 115]. This substantially reduces the number of variables and constraints involved in specifying dynamical scenes and dramatically speeds up backpropagation. Following prior work, we use implicit differentiation to compute gradients induced by an optimization problem embedded in the simulation [3, 116]. One of our contributions is an acceleration scheme that can handle the nonlinear constraints encountered in our system. We evaluate the scalability and generality of the presented approach experimentally. Controlled experiments that vary the number of objects in the scene and their relative scales demonstrate that our approach is dramatically more efficient than state-of-the-art differentiable physics frameworks. Since meshes can represent both rigid and deformable objects, our work is also the first differentiable physics framework that can simulate two-way coupling of rigid bodies and cloth. We further demonstrate example applications to learning and control scenarios, where the presented framework 14 outperforms derivative-free and model-free baselines by at least an order of magnitude. Code is available on our project page: https://gamma.umd.edu/researchdirections/ mlphysics/diffsim 2.2 Related Work Backpropagating through physical dynamics is an enticing prospect for embedding physical reasoning into learning processes. Degrave et al. [1] articulated the potential of this approach and applied automatic differentiation to rigid body systems where contacts are assumed to only happen between balls and planes. Although this assumption simplifies collision detection and response, it greatly limits the scope of application. de Avila Belbute-Peres et al. [2] introduced implicit differentiation for gradient computation, obviating the need to explicitly differentiate through all steps of the forward optimization solver. By designing specific rules for collision response, their framework supports simple two-dimensional shapes such as circles and polygons, but was not extended to general 3D objects. Toussaint et al. [14] utilized differentiable primitives to develop a robot reasoning system that can achieve user-defined tasks. It is designed for path planning with sphere-swept shapes and frictionless interactions, but not for general control tasks. Building on the implicit differentiation methodology, Liang et al. [3] developed a differentiable simulator for cloth. Their formulation assumes that each node can move independently and is only tied to others by soft constraints. This is true for cloth but not for rigid bodies. Carpentier and Mansard [117] developed algorithms for efficient computation of analytical derivatives of rigid- body dynamics, but did not handle collision. Millard et al. [118] implemented an articulated body simulator in a reverse-mode automatic differentiation framework, but likewise did not handle 15 https://gamma.umd.edu/researchdirections/mlphysics/diffsim https://gamma.umd.edu/researchdirections/mlphysics/diffsim dynamic contact and collision response, which are at the heart of our work. Hu et al. [6] developed a differentiable simulator for deformable objects based on the material point method (MPM). A follow-up work applies this framework to controller design for soft robots [9]. This approach is effective in handling soft bodies, but has limited ability to enforce rigid structure. Since the method is based on particles and grids, it has limited scalability and has not been applied to large scenes with interacting objects. In contrast, our method uses meshes to represent objects, naturally enforces precise detailed geometry, and scales to scenes with both large spatial extent and fine local features. Hu et al. [19] introduce a domain-specific language for building differentiable physical simulators. This work develops a programming language and compiler that can be broadly useful throughout the differentiable physics community, including to projects such as ours. However, at present this framework does not deal with general mesh-based rigid and deformable objects. Their rigid-body scenario uses rectangles constrained by springs, with contact responses only possible between rectangles and the ground plane. Their three-dimensional scenario uses soft bodies modeled by particles and grids and inherits the aforementioned limitations of MPM-based modeling. A line of work developed particle-based frameworks for differentiable simulation of fluids [39, 44] and unified treatment of fluids and solids [38, 111]. These frameworks resort to approximations in handling rigid bodies and detailed geometry, and suffer from high growth rates in the scale of objects and scenes. The generality of particle-based approaches is appealing, but their scalability limitations are well-known and are explicitly discussed in the literature [112]. We take a different tack and model objects as meshes, thus harnessing the scalability and generality of this representation. Some of the aforementioned works, as well as others [36, 37, 40, 42], fit function approximators 16 Time Integration Collision Response State: velocity, position, material, force... Differentiable Simulation Layer Loss Trainable Network Layers Control Signals New Observation Observations Simulation Results Figure 2.1: Differentiable simulation embedded in a neural network. The loss can be backpropagated through the physics simulator to the control signal, thus enabling end-to-end gradient-based training of the controller network. This approach enables faster convergence with higher accuracy in learning and control of physical systems. to demonstrations of physical dynamics. Such approximators are valuable, but may not fully capture the true underlying physics and may degenerate outside the training distribution. Our work follows the program exemplified by Degrave et al. [1], de Avila Belbute-Peres et al. [2], and others in that we aim to backpropagate through the true physical dynamics. In this approach, physical correctness is enforced by construction. 2.3 Overview Our differentiable physics engine can function as a layer embedded within a broader differentiable pipeline, with other neural network layers in the loop. This scenario is illustrated in Figure 2.1. Given observations from simulation, the network can compute control signals and send them to the simulator. The physics engine performs time integration, evaluates collision response, and updates the mesh states. The simulation results can be used to evaluate the loss and provide new observations to the network, thus closing the loop. All operations in our physics engine are differentiable, supporting end-to-end gradient-based optimization of the complete pipeline. As mentioned earlier, state-of-the-art differentiable simulation algorithms often suffer from scalability issues. They either use particle representations, which have cubic growth rates with respect to 17 object sizes, or compute custom dynamic responses for a limited number of object primitives, which have limited ability to represent complex shapes. Our method resolves these issues to advance scalability and generality. Section 2.4 introduces our notation and the basic forward integration scheme. To facilitate scalability and generality, we adopt meshes as our basic object representation. Simulation of complex scenes brings up the challenge of collision detection and response. We describe an efficient collision handling scheme in Section 2.5. Observing that collisions are often sparse, we utilize the impact zone method and handle collisions and contact forces locally in independent collision areas. This collision handling scheme has linear computational complexity with respect to the number of constraints rather than the total number of nodes. Our unified collision handling formulation also enables differentiable two-way coupling of rigid bodies and cloth. In Section 2.6, we describe the challenges involved in differentiation and introduce an acceleration scheme for implicit differentiation with nonlinear constraints. 2.4 Preliminaries After construction of the system, the dynamics can be written as d dt  q q̇  =  q̇ q̈  =  q̇ M−1f(q, q̇)  , (2.4.0.1) where M is the mass matrix and f is the generalized force vector. We incorporate contact forces [115], internal forces including bending and stretching [119], and external forces including gravity and control input. For numerical simulation, the dynamical system can be discretized with 18 time step h. At time t0, with q0 = q(t0) and q̇0 = q̇(t0), we can compute the increment of the coordinates ∆q = q(t0 + h) − q(t0) and velocities ∆q̇ = q̇(t0 + h) − q̇(t0) using the implicit Euler method [120]:  ∆q ∆q̇  = h  q̇0 +∆q̇ M−1f(q+∆q, q̇+∆q̇)  . (2.4.0.2) With a linear approximation, we can solve for ∆q̇, ( h−1M− ∂f ∂q̇ − h∂f ∂q ) ∆q̇ = f0 + h ∂f ∂q q̇0, (2.4.0.3) which can then be used to compute q(t0 + h) and q̇(t0 + h) for the next time step. 2.5 Scalable Collision Handling One of the most time-consuming and complex parts of physics simulation is collision handling. The difficulty, especially for differentiable physics, lies in how to compute the correct derivatives efficiently even in a large scene with many interacting objects. As observed by Hu et al. [19], naive discrete-time impulse-based collision response can lead to completely incorrect gradients. We apply continuous collision detection to circumvent this problem. We also employ a bounding volume hierarchy to localize and accelerate dynamic collision detection. Traditional collision response algorithms for rigid bodies commonly employ global LCP- based solvers to resolve collisions [2]. This works well for small-to-medium-size systems, but does not scale well when the number of objects increases. The problem is exacerbated when the derivatives of all DOFs in this large system need to be computed simultaneously, which is exactly 19 what backpropagation does. Our key observation is that collisions are sparse in the majority of simulation steps, which means that the number of collisions is much smaller than the total DOF. As shown in Figure 2.2, collisions are commonly localized around the scene. With this observation, we resolve collision locally among the objects involved, rather than setting up a global optimization, updating coordinates, and computing the gradients for all objects in the environment. One way to represent local collisions is to use impact zones [114, 115]. An impact is a pair of primitives colliding with each other. It can be an edge-edge (EE) or a vertex-face (VF) pair. For these two cases, the non-penetration constraints can be written as 0 ≤ Cee = n · [(α3x3 + α4x4)− (α1x1 + α2x2)] 0 ≤ Cvf = n · [x4 − (α1x1 + α2x2 + α3x3)], (2.5.0.1) respectively, where xi are the positions of the mesh vertices, αi are the barycentric coordinates, and n is the normal of the colliding plane. Impacts may share vertices. All the impacts in one connected component are said to form an impact zone. Each impact zone is a local area that can be treated independently. The red areas in Figure 2.2 are impact zones. Collision resolution in one impact zone can be written as 20 an optimization problem: minimize x′ 1 2 (x− x′)⊤M(x− x′) subject to Gx′ + h ≤ 0. (2.5.0.2) Here x is the concatenation of vertex positions in the impact zone before collision handling; x′ are the resolved positions that satisfy constraints; M is the mass matrix; G and h are a matrix and a vector derived from the constraints in Equation 2.5.0.1. The impact zone method is fail-safe [114] and is commonly used in cloth simulation, where each vertex in the cloth is free and can be explicitly optimized when solving Equation 2.5.0.2. However, we cannot directly use this formulation for rigid bodies, because all vertices in a rigid body are tied and cannot be optimized separately. A possible treatment is to use additional constraints on the relative positions of vertices to enforce rigidity. However, the number of vertices in a rigid body is usually far larger than its DOF. We therefore take a different approach that minimizes the number of optimization variables in the system. This is critical for scalable backpropagation in such systems. We make a distinction between contact vertices x and the objects’ generalized coordinates q. The actual optimization variables are replaced by q. For a rigid body i with rotation r = (ϕ, θ, ψ)⊤ and translation t = (tx, ty, tz) ⊤, the generalized coordinates are q = [r⊤, t⊤]⊤ ∈ R6. The block-diagonal mass matrix M̂ ∈ R6×6 is accordingly changed into angular and linear inertia. The constraints are rewritten using the new variables. A function f(·) maps generalized coordinates q to a contact point x, including both rotation and translation. Detailed formulations of M̂ and f(·) are given in Appendices A and A. The optimization problem for collision resolution 21 Figure 2.2: Collision visualization. Regions in collision are shown in red. In most cases, the number of collisions is sparse compared to the number of degrees of freedom in the scene. becomes minimize q′ 1 2 (q− q′)⊤M̂(q− q′) subject to Gf(q′) + h ≤ 0. (2.5.0.3) 2.6 Fast Differentiation Gradients for the sparse linear system in Equation 2.4.0.3 can be computed via implicit differentiation [3, 116]. Other basic operations can be handled by automatic differentiation [121]. However, the heaviest computation is induced by collision handling, which can take up to 90% of the runtime for a single simulation step. Collision resolution can involve multiple iterations of the optimization specified in Equation 2.5.0.3 for each impact zone. Liang et al. [3] proposed 22 a method that reduces computation, but it only deals with linear constraints and does not apply in our case due to the the nonlinearity introduced by f(·). We now develop an appropriate acceleration scheme that works in our setting. Consider the differentiation in our case. At a local minimum z∗ with Lagrange multiplier λ∗, the following KKT conditions hold: M̂z∗ − M̂q+∇f⊤G⊤λ∗ = 0 D(λ∗)(Gf(z∗) + h) = 0, (2.6.0.1) whereD(·) packages a vector into a diagonal matrix. The key idea here is to linearize f(·) around a small neighborhood of z∗. The implicit differentiation is formulated as  M̂ ∇f⊤G⊤ D(λ∗)G∇f D(Gf(z∗) + h)   ∂z ∂λ  =  M̂∂q−∇f⊤∂G⊤λ∗ −D(λ∗)(∂G · f(z∗) + ∂h)  . (2.6.0.2) As derived by Amos and Kolter [116], by solving for dz,dλ in the following equation  M̂ ∇f⊤G⊤D(λ∗) G∇f D(Gf(z∗) + h)   dz dλ  =  ∂L ∂z ⊤ 0  (2.6.0.3) 23 we can get the backward gradients: ∂L ∂q = d⊤ z M̂ (2.6.0.4) ∂L ∂G = −D(λ∗)dλf(z ∗)⊤ − λ∗d⊤ z∇f⊤ (2.6.0.5) ∂L ∂h = −d⊤ λD(λ∗). (2.6.0.6) Assume that in one impact zone, there are n DOFs and m constraints. In Equation 2.6.0.3, the size of the system is n +m. The solution of this system would typically take O((n +m)3): prohibitively expensive in large scenes. We therefore need to accelerate the computation of dz and dλ. Liang et al. [3] proposed to use a QR decomposition. However, it only copes with quadratic programming with linear constraints. We can extend this method to nonlinear constraints by incorporating the Jacobian∇f . Recall that a rigid body has coordinates q = [r, t] ∈ R6. The rotation is represented by Euler angles r = (ϕ, θ, ψ). The Jacobian at vertex x = (x, y, z)⊤ of the rigid body is ∇fk =  ∂x ∂ϕ ∂x ∂θ ∂x ∂ψ 1 0 0 ∂y ∂ϕ ∂y ∂θ ∂y ∂ψ 0 1 0 ∂z ∂ϕ ∂z ∂θ ∂z ∂ψ 0 0 1  . (2.6.0.7) The exact formulation of∇fk can be found in Appendix A. After computing ∇f , we apply a QR decomposition √ M̂ −1 ∇f⊤G⊤ = QR and derive 24 0 200 400 600 800 1000 number of objects 0 500 1000 1500 tim e( s) Ours ChainQueen 0 200 400 600 800 1000 number of objects 0 5 10 15 20 m em or y( G B ) Ours ChainQueen 0 2 4 6 8 10 scale 0 500 1000 1500 tim e( s) Ours ChainQueen 0 2 4 6 8 10 scale 0 5 10 15 20 m em or y (G B ) Ours ChainQueen (a) Benchmark scenes (b) Running time (c) Memory consumption Figure 2.3: Scalability. (a) Benchmark scenes. Few differentiable simulation frameworks are capable of modeling these scenes. We compare to the expressive MPM-based framework ChainQueen [6], implemented in the high-performance DiffTaichi library [19]. (b,c) Runtime and memory consumption as the scenes are varied in controlled fashion. Memory consumption is peak memory usage. Time is the running time for simulating 2 seconds of dynamics. Top: the number of objects in the scene increases from 20 to 1000. Bottom: the relative scale of the cloth and the bunny increases from 1:1 to 10:1. The MPM-based framework consumes up to two orders of magnitude more memory and computation before it runs out of memory. dz,dλ from Equation 2.6.0.3 as dz = √ M̂ −1 (I−QQ⊤) √ M̂ −1∂L ∂z ⊤ (2.6.0.8) dλ = D(λ∗)−1R−1Q⊤ √ M̂ −1∂L ∂z ⊤ . (2.6.0.9) The cost of the QR decomposition isO(nm2), reducing the computation fromO((n+m)3). 2.7 Experiments We conduct a variety of experiments to evaluate the presented approach. We begin by comparing to other differentiable physics frameworks, with particular attention to scalability and 25 generality. We then conduct ablation studies to evaluate the impact of the techniques presented in Sections 2.5 and 2.6. Lastly, we provide case studies that illustrate the application of the presented approach to inverse problems and learning control. The automatic differentiation is implemented in PyTorch 1.3 [121]. All experiments are run on an Intel(R) Xeon(R) W-2123 CPU @ 3.60GHz. 2.7.1 Scalability We construct two benchmark scenes and vary them in controlled fashion to evaluate scalability. The scenes are illustrated in Figure 2.3. In the first, objects fall from the air, hit the ground, and finally settle. To test scalability, we increase the number of objects while maintaining the stride between objects. As the number of objects increases, the spatial extent of the scene expands accordingly. In the second scene, a rigid bunny strikes a deformable cloth. We vary the relative scale of the bunny and the cloth to test the ability of simulation frameworks to handle objects with geometric features at different resolutions. Few differentiable simulation frameworks are capable of modeling these scenes. de Avila Belbute- Peres et al. [2] only simulate 2D scenes with a restricted repertoire of primitives. Liang et al. [3] simulate cloth but cannot handle rigid-body dynamics. We therefore choose the simulation framework of Hu et al. [6] as our main baseline. This framework uses the the material point method (MPM), which leverages particles and grids to model solids. We use the state-of-the-art implementation from the high-performance DiffTaichi library [19]. The results are shown in Figure 2.3. The first row reports the runtime and memory consumption 26 of the two frameworks as the number of objects in the scene increases from 20 to 1000. Memory consumption is peak memory usage. Time is the running time for simulating 2 seconds of dynamics. Our runtime and memory consumption increase linearly in the complexity of the scene, while the MPM-based method scales cubically until it runs out of memory at 200 objects and a 6403 grid. The second row reports runtime and memory consumption as the relative sizes of the cloth and the bunny are varied from 1:1 to 10:1. The runtime and memory consumption of our method stay constant. In contrast, as the size of the cloth grows, the MPM-based framework is forced to allocate more and more memory and expend greater and greater computation. These experiments indicate that scalability is a significant advantage of our method. Since we do not need to quantize space, the extent of the scene or the relative sizes of objects do not dominate our runtime and memory consumption. Since our method dynamically detects and handles collisions locally, as needed, rather than setting up a global optimization problem, the runtime scales linearly (rather than quadratically) with scene complexity. 2.7.2 Ablation Studies We proposed localized collision handling in Section 2.5 and fast differentiation in Section 2.6 to make our method scalable to large scenes with many degrees of freedom. We conduct dedicated ablation studies to assess the contribution of these techniques. Localized collision handling. We compare with LCP-based rigid-body differentiable simulation developed by de Avila Belbute-Peres et al. [2]. This baseline also uses implicit differentiation to compute derivatives of optimization solvers [116]. The environment is shown in Figure 2.4(a). 27 (a) Localized collisions (b) Correlated collisions Figure 2.4: Ablation studies. (a) Comparing global and local collision handling. Cubes are released from the air and fall to the ground. (b) Evaluating the contribution of the fast differentiation scheme. Cubes are densely stacked, forming a big connected component. All collisions need to be solved simultaneously because motion of one cube can affect all others. N (= 100, 200, 300) cubes are released above the ground plane. They fall down and hit the ground. We use the same environment for both methods, although the LCP-based framework only simulates in 2D, thus having only half the degrees of freedom compared to our 3D simulation (3 versus 6 per object, accounting for translation and rotation). We disabled our fast differentiation method in this experiment in order to neutralize its effect and conduct a controlled comparison between global and local collision handling. The implementation of de Avila Belbute-Peres et al. [2] uses four threads while our method only uses one. Results are reported in Table 2.1. Our sparse collision handling method runs up to 5 times faster than the LCP-based approach, and the performance gap widens as the complexity of the scene increases. Fast differentiation. We now evaluate the contribution of the acceleration scheme described in Section 2.6. The environment is shown in Figure 2.4(b). N (= 100, 200, 300) cubes are stacked in two layers. During collision handling, all contacts between cubes form one connected 28 # of cubes 100 200 300 LCP 0.73s ± 0.017s 2.87s ± 0.103s 8.42s ± 0.190s Ours 0.56s ± 0.009s 1.11s ± 0.012s 1.65s ± 0.025s Table 2.1: Runtime of backpropagation (in seconds per simulation step) with LCP-based collision handling [2] and our approach. Simulation of the scene shown in Figure 2.4(a). The LCP-based framework simulates in 2D and uses 4 threads, while our implementation simulates in 3D and uses 1 thread. Our approach is faster, and the performance gap widens with the complexity of the scene. component. All constraints need to be solved in one big optimization problem. Thus the linear system in Equation 2.6.0.3 is large. # of cubes 100 200 300 W/o FD 1.43s ± 0.015s 7.76s ± 0.302s 21.88s ± 0.125s Ours 0.41s ± 0.004s 0.86s ± 0.008s 1.30s ± 0.008s Speedup 3.49x 9.02x 16.83x Table 2.2: Runtime of backpropagation (in seconds per simulation step) with and without our fast differentiation scheme. Simulation of the scene shown in Figure 2.4(b). N ( = 100, 200, 300) cubes are stacked in two layers. The impact of the acceleration scheme increases with the complexity of the scene. We evaluate the runtime of backpropagation in this scene with and without the presented acceleration scheme. The ablation condition is referred to as ‘W/o FD’, where FD refers to fast differentiation. In this condition, the derivative of Equation 2.5.0.3 is calculated by solving Equation 2.6.0.3 directly. The results are reported in Table 2.2. The fast differentiation accelerates backpropagation by up to 16x in this scene, and the impact of the technique increases with the complexity of the scene. 29 2.7.3 Two-way coupling One of the advantages of this presented framework is that it can handle both rigid bodies and deformable objects such as cloth. This is illustrated in the two scenes shown in Figure 2.5. In Figure 2.5(a), a Stanford bunny and an armadillo stand on a piece of cloth. The cloth is lifted by its corners, envelopes the figurines, and lifts them up. In Figure 2.5(b), a piece of cloth strikes a domino, which begins a chain reaction that propagates until the last domino strikes the cloth from behind. The scenes are also shown in the supplementary video. No prior differentiable simulation framework is capable of modeling these scenes properly. Liang et al. [3] can simulate the cloth, but its motion does not affect the figurines. The framework of Hu et al. [6, 19] does not enforce rigidity and suffers from interpenetration. In contrast, our approach simulates the correct dynamics and the two-way coupling between them. We further compare with MuJoCo, a popular physics engine [122]. MuJoCo models cloth as a 2D grid of capsule and ellipsoid geoms in addition to spheres. This representation fails to correctly handle collisions near the holes in a grid. In Figure 2.6, we simulate a rigid body falling onto a trampoline and bouncing back. When simulated in MuJoCo, the ball penetrates the trampoline. Our method simulates the interaction correctly. 2.7.4 Applications Inverse problem. Differentiable simulation naturally lends itself to gradient-based optimization for inverse problems. A case study is shown in Figure 2.7, in which a marble is supported by a soft sheet. The sheet is fixed at the four corners. In each time step, an external force is applied to the marble. The goal is to find a sequence of forces that drives the marble to a 30 (a) Bunny and armadillo (b) Dominoes Figure 2.5: Two-way coupling of rigid bodies and cloth. Top: initial state. Bottom: towards the end of the simulation. (a) Two figurines are lifted by a piece of cloth. (b) Cloth strikes domino and is later struck back. No prior differentiable simulation framework is capable of modeling these scenes properly. 31 (a) MuJoCo (b) Ours Figure 2.6: Comparison with MuJoCo. Simulation of a ball interacting with a trampoline. Top: initial state. Bottom: later in the simulation. (a) Simulation in MuJoCo. (b) Simulation by our method. MuJoCo models cloth as a 2D grid of capsules and ellipsoid geoms in addition to spheres. The ball penetrates the trampoline when the grid is sparse. 32 0 50 100 episode 0 2 4 6 ob je ct iv e fu nc tio n Ours CMA-ES (a) Marble on a sheet (b) Objective function Figure 2.7: Inverse problem. (a) The goal is to bring the marble to the desired position with minimal force. (b) Our framework enables gradient-based optimization for this problem and quickly converges to a lower objective value than derivative-free optimization (CMA-ES). We perform 5 runs with different random seeds; shaded areas represent one standard deviation. target position in 2 seconds, while minimizing the total amount of applied force. The vertical component of the external force is set to 0 so that the marble has to interact with the cloth before reaching the target. We compare with a derivative-free optimization algorithm, CMA-ES [123]. Our framework enables gradient-based optimization in this setting and converges in 4 iterations, reaching a lower objective value than what CMA-ES achieves after two orders of magnitude more iterations. Learning control. Our second set of case studies uses the differentiable simulator to backpropagate gradients in a neural network that must learn to manipulate objects to achieve desired outcomes. The scenarios are shown in Figure 2.8. The neural network controls a pair of sticks and a piece of cloth, respectively, which must be used to manipulate an object. The goal is to bring the object to the desired position within 1 second. The optimization objective is the L2 distance to the target. The neural network is an MLP with 50 nodes in the first layer and 200 nodes in the second, with ReLU activations. For reference, we report the performance of DDPG, a model-free reinforcement learning 33 algorithm [124]. In each episode, we fix the initial position of the manipulator and the object while the target position is randomized. For both methods, the input to the network is a concatenated vector of relative distance to the target, speed, and remaining time. The DDPG reward is the negative L2 distance to the target. 0 50 100 150 200 episode 0 0.5 1 1.5 2 2.5 lo ss fu nc tio n Ours DDPG 0 50 100 150 200 episode 0 0.2 0.4 0.6 lo ss fu nc tio n Ours DDPG (a) Sticks (b) Cloth Figure 2.8: Learning control. A neural network must learn to control (a) a pair of sticks or (b) a piece of cloth. The trained controller must be able to bring the manipulated object to any desired location specified at test time. Our method enables gradient-based training of the controller. A model-free reinforcement learning baseline (DDPG) fails to learn the task on a comparable time scale. We perform 5 runs with different random seeds; shaded areas represent one standard deviation. Our method updates the network once at the end of each episode, while DDPG receives a reward signal and updates the network weights in each time step. As shown in the loss curves 34 (a) Initial guess (b) Target observation Figure 2.9: Parameter estimation. Cubes start with opposite velocities and collide with each other. Our goal is to estimate the mass of the left cube such that the total momentum after the collision has the desired direction and magnitude. (a) is the initial estimate: both cubes have the same mass and the total momentum after collision is 0. (b) is the simulation with the mass produced by our estimation procedure: the momentum points to the right and has a magnitude of 3, as specified in the target observation. in Figure 2.8, our method quickly converges to a good control policy, while DDPG fails to learn the task on a comparable time scale. This example illustrates the power of gradient-based optimization with our differentiable physics framework. Parameter estimation. Given an observation of motion, our method can estimate unknown physical parameters for objects in the scene. In Figure 2.9, two cubes start with opposite velocities ±v and collide with each other. Given a target momentum observation p = m1v ′ 1+m2v ′ 2, where v′ 1,2 are velocities after collision and m1,2 are masses, we aim to estimate the mass m1. The target momentum is set to p = (3, 0, 0). We initialize with m1 = m2 = 1 and p = (0, 0, 0) (Figure 2.9(a)). After 90 gradient steps, our method arrives at the estimatem1 = 5.4 and achieves the desired momentum (Figure 2.9(b)). Interoperability across simulators. Our differentiable simulation framework is interoperable with other physics simulators. We have evaluated this interoperability with MuJoCo, a popular 35 (a) Initial state (b) Target state Figure 2.10: Interoperability across simulators. Three cubes are placed on smooth ground. The goal is to make the cubes stick together while minimizing the applied forces. The loss is computed in MuJoCo but the gradient is evaluated in DiffSim. (a) is the initial state and (b) is the successful simulation result after the optimization. non-differentiable simulator [122]. The experiment is illustrated in Figure 2.10. We place three cubes on smooth ground. The goal is to apply forces to the cubes in order to make them stick together. The objective function is the distance of each cube to its target positions plus the L2 norm of the force. We compute the loss in MuJoCo but evaluate the gradient in DiffSim, our differentiable simulator. Figure 2.10(a) shows the initial state. Figure 2.10(b) is the result of the simulation after 10 gradient steps. The goal of making the cubes stick together was accomplished. This experiment shows that physical states and control signals are interoperable between our differentiable framework and non-differentiable simulators. 2.8 Conclusion We developed a scalable differentiable simulation method that has linear complexity with respect to the number of objects and collisions, and constant complexity with respect to spatial resolution. We use meshes as our core representation, enabling support for objects of arbitrary shape at any scale. A unified local collision handling scheme enables fast coupling between 36 objects with different materials. An acceleration scheme speeds up backpropagation by implicit differentiation. Experiments show that the presented techniques speed up backpropagation by large multiplicative factors, and their impact increases with the complexity of simulated scenes. We demonstrated the application of the presented method in a number of case studies, including end-to-end gradient-based training of neural network controllers. Training with differentiable simulation was shown to be significantly more effective than gradient-free and model-free baselines. One avenue for future work is to extend the presented framework to other types of objects, materials, and contact models. The framework is sufficiently general to support deformable solids and articulated bodies, as long as they have mesh-based representations with repulsive contact forces. We plan to extend our implementation to incorporate such object classes. 37 Chapter 3: Efficient Differentiable Simulation of Articulated Bodies 3.1 Introduction Differentiable physics enables efficient gradient-based optimization with dynamical systems. It has achieved promising results in both simulated [4, 6] and real environments [7, 16]. Our goal is to make articulated body simulation efficiently differentiable. We aim to maximize efficiency in both computation and memory use, in order to support fast gradient-based optimization of differentiable systems that interact with articulated bodies in physical environments. Articulated bodies play a central role in robotics, computer graphics, and embodied AI. Many control systems are optimized via experiences collected in simulation [122, 125, 126, 127]. However, they do not have access to analytic derivatives of the articulated dynamics. Therefore, nearly all gradient-based approaches have to design strategies to compute the gradients indirectly when dealing with articulated bodies. The straightforward way to differentiate the simulation is to use existing automatic differentiation tools [25, 128, 129]. However, autodiff tools consume prohibitive amounts of memory when there are many simulation steps. Autodiff tracks every operation and stores the intermediate results in order to perform backpropagation. In articulated body simulation, the iterative contact solver and the dynamics algorithm [130] yield exceedingly long computational graphs. In our experiments, differentiable simulators built with autodiff tools run out of memory after 5,000 38 simulation steps – just a few seconds of experience. As a result, learning is constrained to short experiences or forced to used large time steps, thus curtailing the scope of behaviors that can be learned or undermining the simulation’s stability. Furthermore, the overhead of creating and storing the auxiliary variables for autodiff also slows down the forward simulation. Although autodiff tools like DiffTaichi [19] and JAX [21] can accelerate the simulation of fluids and deformable solids by vectorization, it is difficult to achieve the same speedup in articulated body simulation because the articulated dynamics algorithm is highly serialized, unlike inherently parallel computation on grids and particles. In this chapter, we design a differentiable articulated body simulation method that runs 10x faster with 1% of the memory consumption compared to differentiation based on autodiff tools. In order to minimize the overhead of differentiation, we derive the gradients of articulated body simulation using the adjoint method [131]. The adjoint method has been applied to fluids [31] and multi-body systems [35], but these applications do not support physically correct differentiable simulation of articulated bodies. Our derivation of the operator adjoints uses spatial algebra [132]. Our method needs almost no additional computation during the forward simulation and is an order of magnitude faster than autodiff tools. We further reduce memory requirements by adapting ideas from checkpointing [27, 133] to the differentiable simulation setting. In the forward pass, we store the initial simulation state for each time step. During backpropagation, we recreate intermediate variables by reproducing simulation from the stored state. The overall runtime remains fast, while memory consumption is reduced by two orders of magnitude. As an application of differentiable dynamics for articulated systems, we show that reinforcement learning (RL) can benefit from the knowledge of gradients in two ways. First, it can make 39 use of the gradients computed by the simulation to generate extra samples using first-order approximation. Second, during the policy learning phase, differentiable physics enables us to perform a one-step rollout of the objective value function so that the policy updates can be more accurate. Both schemes effectively improve the convergence speed and the attained reward. We also demonstrate applications of efficient differentiable articulated dynamics to inverse problems, such as motion control and parameter estimation. Gradient-based optimization enabled by our method accelerates convergence in these settings by more than an order of magnitude. Code is available on our project page: https://github.com/YilingQiao/diffarticulated The contributions of this work are as follows: • We derive the adjoint formulations for the entire articulated body simulation workflow, enabling a 10x acceleration over autodiff tools. • We adapt the checkpointing method to the structure of articulated body simulation to reduce memory consumption by 100x, making stable collection of extended experiences feasible. • We introduce two general schemes for accelerating reinforcement learning using differentiable physics. • We demonstrate the utility of differentiable simulation of articulated bodies in motion control and parameter estimation, enhancing performance in these scenarios by more than an order of magnitude. 3.2 Related Work Differentiable programming has been applied to rendering [86, 134, 135], image processing [136, 137], SLAM [138], and design [139]. Making complex systems differentiable enables learning 40 https://github.com/YilingQiao/diffarticulated and optimization using gradient-based methods. Our literature review focuses on differentiable physics, the adjoint method, and neural approximations of physical systems. Differentiable physics. Differentiable physics provides gradients for learning, control, and inverse problems that involve physical systems. Degrave et al. [1] advocate using differentiable physics to solve control problems in robotics. de Avila Belbute-Peres et al. [2] obtain gradients of 2D rigid body dynamics. Liang et al. [3] use automatic differentiation tools to obtain gradients for cloth simulation. Qiao et al. [4] develop a more comprehensive differentiable physics engine for rigid bodies and cloth based on mesh representations. Ingraham et al. [5] present differentiable simulation of protein dynamics. For volumetric data, ChainQueen [6] computes gradients of the MPM simulation pipeline. Bern et al. [7] use FEM to model soft robots and perform trajectory optimization with analytic gradients. Takahashi et al. [8] differentiate the simulation of fluids with solid coupling. Many exciting applications of differentiable physics have been explored [9, 10, 11, 12]. Huang et al. [13] propose a soft-body manipulation benchmark for differentiable physics. Toussaint et al. [14] manipulate tools with the help of differentiable physics. Song and Boularias [15] perform system identification by learning from the trajectory, and Song and Boularias [16] then use the estimated frictional properties to help robotic pushing. A related line of work concerns the development of powerful automatic differentiation tools that can be used to differentiate simulation. DiffTaichi [19] provides a new programming language and a compiler, enabling the high-performance Taichi simulator to compute the gradients of the simulation. JAX MD [20] makes use of the JAX autodiff library [21] to differentiate molecular dynamics simulation. These works have specifically made use of vectorization on 41 both CPU and GPU to achieve high performance on grids and particle sets, where the simulation is intrinsically parallel. In contrast, the simulation of articulated bodies is far more serial, as dynamics propagates along kinematic paths rather than acting in parallel on grid cells or particles. TinyDiffSim [22] provides a templatized simulation framework that can leverage existing autodiff tools such as CppAD [23], Ceres [24], and PyTorch [25] to differentiate simulation. However, these methods introduce significant overhead in tracing the computation graph and accumulate substantial computational and memory costs when applied to articulated bodies. Our method does not need to store the entire computation graph to compute the gradients. We store the initial states of each simulation step and reproduce the intermediate results when needed by the backward pass. This checkpointing strategy is also used in training neural ODEs [26] and large neural networks [27, 28, 29, 30]. We are the first to adapt this technique to articulated dynamics, achieving dramatic reductions in memory consumption and enabling stable simulation of long experiences. Adjoint method. The adjoint method has been applied to fluid control [31], PDEs [32], light transport simulation [33], and neural ODEs [26, 34]. Recently, Geilinger et al. [35] proposed to use the adjoint method in multi-body dynamics. However, they operate in maximal coordinates and model body attachments using springs. This does not enforce physical validity of articulated bodies. In contrast, we operate in reduced coordinates and derive the adjoints for the articulated body algorithm and spatial algebra operators that properly model the body’s joints. This supports physically correct simulation with joint torques, limits, Coriolis forces, and proper transmission of internal forces between links. Neural approximation. A number of works approximate physics simulation using neural networks [36, 42 update kinematics update forces update accelerations states positions, transformations velocities, etc. articulated momentum, articulated inertias, torques, etc body accelerations, joint accelerations collision resolving collision detection + LCP solver new states time integration checkpoints Intermediate results update kinematics update forces update accelerations states positions, transformations velocities, etc. articulated momentum, articulated inertias, torques, etc body accelerations, joint accelerations collision resolving collision detection + LCP solver new states time integration Figure 3.1: The workflow of a simulation step. Assume there are nq = 6 DoF. The initial state is an nq × 3 matrix containing the position, velocity, and control input of each joint. The forward dynamics will traverse the articulated body sequentially three times. This process is difficult to parallelize and will generate a large number of intermediate results. After the forward dynamics there is a collision resolution step with collision detection and an iterative Gauss-Seidel solver. The size of the initial state (position, velocity, and control input) is much smaller than that of intermediate results but the initial state has all the information to resume all the intermediate variables. 37, 38, 39, 40, 41, 42, 43, 44, 45, 46]. Physical principles have also been incorporated in the design of neural networks [47, 48, 49, 50, 51]. Approximate simulation by neural networks is naturally differentiable, but the networks are not constrained to abide by the underlying physical dynamics and simulation fidelity may degenerate outside the training distribution. 3.3 Preliminaries Articulated body dynamics. For the forward simulation, we choose the recursive Articulated Body Algorithm (ABA) [130], which has O(n) complexity and is widely used in articulated body simulators [122, 125, 126]. In each simulation step k, the states xk of the system consist of configurations in generalized coordinates and their velocities xk = [qk, q̇k] ∈ R2nq×1, where nq is the number of degrees of freedom in the system. Assuming each joint can be actuated by a torque, there is an nq-dimensional control signal uk ∈ Rnq×1. The discretized dynamics at this step can be written as fk+1(uk,xk,xk+1) = 0. 43 Adjoint method. We can concatenate the states in the entire time sequence into x = [x1,x2, ...,xnt ] ∈ Rnt·2nq×1, with corresponding control input u = [u1,u2, ...,unt ] ∈ Rnt·nu×1, where nt is the number of simulation steps. The concatenated dynamics equations can be similarly written as f = [f1, f2, ..., fnt ] = 0. For a learning or optimization problem, we would usually define a scalar objective function Φ(x,u). To get the derivative of this function w.r.t. the control input u, one needs to compute dΦ du = ∂Φ ∂u + ∂Φ ∂x dx du . (3.3.0.1) ∂Φ ∂u is easy to compute for each single simulation step. But it is prohibitively expensive to directly solve ∂Φ ∂x dx du because dx du will be a 2nqnt × nqnt matrix. Instead, we differentiate the dynamics f(x,u) = 0 and obtain the constraint ∂f ∂x dx du = − ∂f ∂u . Applying the adjoint method [131], ∂Φ ∂x dx du equals to −R⊤ ∂f ∂u such that ( ∂f ∂x )⊤ R = ( ∂Φ ∂x )⊤ (3.3.0.2) Since the dynamics equation for one step only involves a small number of previous steps, the sparsity of ∂f ∂u and ∂f ∂x makes it easier to solve for the variable R in Eq. 3.3.0.2. ∂Φ ∂u −R⊤ ∂f ∂u is called the adjoint of u and is equivalent to the gradient by our substitution. In the following derivation, we denote the adjoint of a variable s by s, and the adjoint of a function f(·) by f(·). 44 3.4 Efficient Differentiation In this section, we introduce our algorithm for the gradient computation in articulated body simulation. For faster differentiation, the gradients of articulated body dynamics are computed by the adjoint method. We derive the adjoint of time integration, contact resolution, and forward dynamics in reverse order. We then adapt the checkpoint method to articulated body simulation to reduce memory consumption. 3.4.1 Adjoint Method for Articulated Dynamics One forward simulation step can be split into five modules, as shown in Figure 3.1: perform the forward kinematics from the root link to the leaf links, update the forces from the leaves to the root, update the accelerations from the root to the leaves, detect and resolve collisions, and perform time integration. Backpropagation proceeds through these modules in reverse order. Time integration. Backpropagation starts from the time integration. As an example, for a simulation sequence with nt = 3 steps and an integrator with a temporal horizon of 2, the constraints ( ∂f ∂x )⊤R = (∂Φ ∂x )⊤ in Equation 3.3.0.2 can be expanded as  ( ∂f 1 ∂x1 ) ⊤ ( ∂f 2 ∂x1 ) ⊤ 0 0 ( ∂f 2 ∂x2 ) ⊤ ( ∂f 3 ∂x2 ) ⊤ 0 0 ( ∂f 3 ∂x3 ) ⊤   R1 R2 R3  =  ( ∂Φ ∂x1 ) ⊤ ( ∂Φ ∂x2 ) ⊤ ( ∂Φ ∂x3 ) ⊤  (3.4.1.1) If using the explicit Euler integrator, we have ∂fk ∂xk = I. Initially Rnt = ∂Φ ∂xnt . The following Rk 45 can be computed iteratively by Rk = ( ∂Φ ∂xk )⊤ − ( ∂fk+1 ∂xk )⊤Rk+1, k = 1, 2, .., nt − 1. (3.4.1.2) When Rk backpropagates through time, the gradients uk can also be computed by Equation 3.3.0.2. In fact, by the way we calculate Rk, it equals the gradients of xk. Other parameters can also be computed in a similar way as uk. Collision resolution. The collision resolution step consists of collision detection and a collision solver. Upon receiving the gradients x from the time integrator, the collision solver needs to pass the gradients to detection, and then to the forward dynamics. In our collision solver, we construct a Mixed Linear Complementarity Problem (MLCP) [140]: a = Ax+ b s.t. 0 ≤a ⊥ x ≥ 0 and c− ≤ x ≤ c+, (3.4.1.3) where x is the new collision-free state, A is the inertial matrix, b contains the relative velocities, and c−, c+ are the lower bound and upper bound constraints, respectively. We use the projected Gauss-Seidel (PGS) method to solve this MLCP. This iterative solve trades off accuracy for speed such that the constraints a⊤x = 0 might not hold on termination. In this setting, where the solution is not guaranteed to satisfy constraints, implicit differentiation [2, 3, 4] no longer works. Instead, we design a reverse version of the PGS solver using the adjoint method to compute the gradients. Further details are in the supplement. In essence, the solver mirrors PGS, substituting the adjoints for the original operators. This step passes the gradients from the collision-free states to the forward dynamics. 46 Forward dynamics. Our articulated body simulator propagates the forward dynamics as shown in the green blocks of Figure 3.1. Each operation in the forward simulation has its corresponding adjoint operation in the backward pass. To compute the gradients, we derive adjoint rules for the operators in spatial algebra [132]. As a simple example, a spatial vector pi = [w,v] ∈ R6 representing the bias force of the i-th link is the sum of an external force [f1, f2] ∈ R6 and a cross product of two spatial motion vectors [w1,v1], [w2,v2] ∈ R6:  w v  =  f1 +w1 ×w2 + v1 × v2 f2 +w1 × v2  . (3.4.1.4) Once we get the adjoint [w,v] of pi = [w,v], we can propagate it to its inputs:  w1 v1  =  −w×w2 − v×v2 −w×v2   w2 v2  =  w×w1 w×v1 + v×w1  ,  f1 f2  =  w v  (3.4.1.5) This example shows the adjoint of one forward operation. The time and space complexity of the original operation and its adjoint are the same as the forward simulation. The supplement provides more details on the adjoint operations. 47 3.4.2 Checkpointing for Articulated Dynamics The input and output variables of one simulation step have relatively small dimensionalities (positions, velocities, and accelerations), but many more intermediate values are computed during the process. Although this is not a bottleneck for forward simulation because the memory allocation for intermediate results can be reused across time steps, it becomes a problem for differentiable simulation, which needs to store the entire computation graph for the backward pass. This causes memory consumption to explode when simulation proceeds over many time steps, as is the case when small time steps are needed for accuracy and stability, and when the effects of actions take time to become apparent. The need for scaling to larger simulation steps motivates our adaptation of the checkpointing method [27, 133]. Instead of saving the entire computation graph, we choose to store only the initial states in each time step. We use this granularity because (a) the states of each step have the smallest size among all essential information sufficient to replicate the simulation step, (b) finer checkpoints are not useful because at least one step of intermediate results needs to be stored in order to do the simulation, and (c) sparser checkpoints will use less memory but require multiple steps for reconstructing intermediate variables, costing more memory and computation. We validate our checkpointing policy with an ablation study in the supplement. Figure 3.2 illustrates the scheme. During forward simulation, we store the simulation state in the beginning of each time step. During backpropagation, we reload the checkpoints (blue arrows) and redo the forward (one-step) simulation to generate the computation graph, and then compute the gradients using the adjoint method in reverse order (red arrows). 48 Forward Dynamics Collision Resolving Time Integration store checkpoints Forward Dynamics Collision Resolving Time Integration reload checkpoints forward simulation forward/resume backward forward/resume backward backward differentiation Figure 3.2: Differentiation of articulated body dynamics. Top: forward simulation at step k. The simulator stores the states qk, q̇k, and control input uk. Bottom: During backpropagation, the checkpoints are reloaded and the simulator runs one forward step to reconstruct the intermediate variables. Beginning with the gradients from step k+1, we use the adjoint method to sequentially compute the gradients of all variables through time integration, the constraint solver, and forward dynamics. 49 In summary, assume the simulation step consists of two parts: Zk = G(xk−1), xk = F (Zk), (3.4.2.1) where Zk represents all the intermediate results. After each step, we free the space consumed by Zk, only storing xk. During backpropagation, we recompute Zk from xk−1 and use the adjoint method to compute gradients of the previous step: Zk = G(xk−1), (3.4.2.2) Zk = F (Zk,xk), (3.4.2.3) xk−1 = G(xk−1,Zk). (3.4.2.4) 3.5 Reinforcement Learning Our method can be applied to simulation-based reinforcement learning (RL) tasks to improve policy performance as well as convergence speed. By computing gradients with respect to the actions, differentiable simulation can provide more information about the environment. We suggest two approaches to integrating differentiable physics into RL. Sample enhancement. We can make use of the computed gradients to generate samples in the neighborhood around an existing sample. Specifically, given a sample (s, a0, s ′ 0, r0) from the history, with observation s, action a0, next-step observation s′0, and reward r0, we can generate 50 new samples (s, ak, s′k, rk) using first-order approximation: ak = a0 +∆ak, s′k = s′0 + ∂s′0 ∂a0 ∆ak, rk = r0 + ∂r0 ∂a0 ∆ak, where ∆ak is a random perturbation vector. This method, which we call sample enhancement, effectively generates as many approximately accurate samples as desired for learning purposes. By providing a sufficient number of generated samples around the neighborhood, the critic can have a better grasp of the function shape (patchwise rather than pointwise), and can thus learn and converge faster. Policy enhancement. Alternatively, the policy update can be adjusted to a format compatible with differentiable physics, usually dependent on the specific RL algorithm. For example, in MBPO [141], the policy network is updated using the gradients of the critic: Lµ = −Q(s, µ(s)) + Z, (3.5.0.1) where Lµ is the loss for the policy network µ, Q is the value function for state-action pairs, and Z is the regularization term. To make use of the gradients, we can expand the Q function one step forward, ∂Q(s, a) ∂a = ∂r ∂a + γ ∂Q(s′, µ(s′)) ∂s′ ∂s′ ∂a , (3.5.0.2) 51 steps 50.0 100.0 500.0 1000.0 5000.0 Ceres 100.0 200.0 1100.0 2700.0 23 600.0 CppAD 16.0 26.0 160.0 320.0 3100.0 JAX 200.0 200.0 400.0 700.0 3000.0 PyTorch 1200.0 2400.0 11 700.0 12 400.0 N/A Ours 0.3 0.3 0.7 1.2 5.0 Table 3.1: Peak memory use of different simulation frameworks. The memory footprint (MB) of our framework is more than two orders of magnitude lower than autodiff methods. PyTorch crashes at 5,000 simulation steps. ADF fails to compute the gradients in a reasonable time (10 min) and is not included here for this reason. and substitute it in Eq. 3.5.0.1: L′ µ = −∂Q(s, a) ∂a µ(s) + Z. (3.5.0.3) Eqs. 3.5.0.1 and 3.5.0.3 yield the same gradients with respect to the action, which provide the gradients of the network parameters. This method, which we call policy enhancement, constructs the loss functions with embedded ground-truth gradients so that the policy updates are physics- aware and generally more accurate than merely looking at the Q function itself, thus achieving faster convergence and even potentially higher reward. 3.6 Experiments For experiments, we will first scale the complexity of simulated scenes and compare the performance of our method with autodiff tools. Then we will integrate differentiable physics into reinforcement learning and use our method to learn control policies. Lastly, we will apply 52 steps 50.0 100.0 500.0 1000.0 5000.0 ADF 25.7 25.5 25.1 32.1 58.4 Ceres 27.2 27.5 27.2 34.0 58.2 CppAD 2.4 2.4 2.3 2.3 4.5 JAX 53.3 46.1 43.1 42.7 42.3 PyTorch 195.6 192.2 199.2 192.8 N/A Ours 0.3 0.3 0.2 0.2 0.2 Table 3.2: Simulation time for forward pass. Ours is at least an order of magnitude faster than autodiff tools (in msec). PyTorch crashes at 5,000 simulation steps. CppAD is the fastest baseline. differentiable articulated dynamics to solve motion control and parameter estimation problems. 3.6.1 Comparison with Autodiff Tools Using existing autodiff tools is a convenient way to derive simulation gradients. However, these tools are not optimized for articulated dynamics, which adversely affects their computation and memory consumption in this domain. We compare our method with state-of-the-art autodiff tools, including CppAD [23], Ceres [24], PyTorch [25], autodiff [142] (referred to as ADF to avoid ambiguity), and JAX [21]. All our experiments are performed on a desktop with an Intel(R) Xeon(R) W-2123 CPU @ 3.60GHz with 32GB of memory. One robot. In the first round, we profile all the methods by simulating one Laikago robot standing on the ground. Figure 3.3(a) provides a visualization. The state vector of the Laikago has 37 dimensions (19-dimensional positions and 18-dimensional velocities). We vary the number of simulation steps from 50 to 5,000. JAX is based on Python and its JIT compiler cannot easily 53 (a) One Laikago (b) Ten Laika