ABSTRACT Title of Dissertation: ENHANCED ROBOT PLANNING AND PERCEPTION THROUGH ENVIRONMENT PREDICTION Vishnu Dutt Sharma Doctor of Philosophy, 2024 Dissertation Directed by: Professor Pratap Tokekar Department of Computer Science Mobile robots rely on maps to navigate through an environment. In the absence of any map, the robots must build the map online from partial observations as they move in the environment. Traditional methods build a map using only direct observations. In contrast, humans identify pat- terns in the observed environment and make informed guesses about what to expect ahead. Mod- eling these patterns explicitly is difficult due to the complexity in the environments. However, these complex models can be approximated well using learning-based methods in conjunction with large training data. By extracting patterns, robots can use not only direct observations but also predictions of what lies ahead to better navigate through an unknown environment. In this dissertation, we present several learning-based methods to equip mobile robots with prediction capabilities for efficient and safer operation. In the first part of the dissertation, we learn to predict using geometrical and structural patterns in the environment. Partially observed maps provide invaluable cues for accurately pre- dicting the unobserved areas. We first demonstrate the capability of general learning-based ap- proaches to model these patterns for a variety of overhead map modalities. Then we employ task- specific learning for faster navigation in indoor environments by predicting 2D occupancy in the nearby regions. This idea is further extended to 3D point cloud representation for object recon- struction. Predicting the shape of the full object from only partial views, our approach paves the way for efficient next-best-view planning, which is a crucial requirement for energy-constrained aerial robots. Deploying a team of robots can also accelerate mapping. Our algorithms benefit from this setup as more observation results in more accurate predictions and further improves the task efficiency in the aforementioned tasks. In the second part of the dissertation, we learn to predict using spatiotemporal patterns in the environment. We focus on dynamic tasks such as target tracking and coverage where we seek decentralized coordination between robots. We first show how graph neural networks can be used for more scalable and faster inference while achieving comparable coverage performance as classical approaches. We find that differentiable design is instrumental here for end-to-end task-oriented learning. Building on this, we present a differentiable decision-making framework that consists of a differentiable decentralized planner and a differentiable perception module for dynamic tracking. In the third part of the dissertation, we show how to harness semantic patterns in the envi- ronment. Adding semantic context to the observations can help the robots decipher the relations between objects and infer what may happen next based on the activity around them. We present a pipeline using vision-language models to capture a wider scene using an overhead camera to provide assistance to humans and robots in the scene. We use this setup to implement an assistive robot to help humans with daily tasks, and then present a semantic communication-based collab- orative setup of overhead-ground agents, highlighting the embodiment-specific challenges they may encounter and how they can be overcome. The first three parts employ learning-based methods for predicting the environment. How- ever, if the predictions are incorrect, this could pose a risk to the robot and its surroundings. The third part of the dissertation presents risk management methods with meta-reasoning over the predictions. We study two such methods: one extracting uncertainty from the prediction model for risk-aware planning, and another using a heuristic to adaptively switch between classical and prediction-based planning, resulting in safe and efficient robot navigation. ENHANCED ROBOT PLANNING AND PERCEPTION THROUGH ENVIRONMENT PREDICTION by Vishnu Dutt Sharma Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2024 Advisory Committee: Professor Pratap Tokekar, Chair/Advisor Professor Nikhil Chopra, Dean’s Representative Professor Dinesh Manocha Professor Tianyi Zhou Professor Kaiqing Zhang © Copyright by Vishnu Dutt Sharma 2024 Dedication To my brother, Ramdutt Sharma. ii “We are stardust brought to life, then empowered by the universe to figure itself out – and we have only just begun.” — Neil deGrasse Tyson iii Acknowledgments My Ph.D. journey has been a thrilling rollercoaster, filled with exhilarating highs and chal- lenging lows. While I officially embarked on this path five years ago, the journey truly started many years earlier, and it would not have been possible without the unwavering support of many individuals who stood by me all along. First and foremost I express my deepest gratitude to my advisor, Dr. Pratap Tokekar, for his tireless support and mentorship at every step of this journey. Looking back, I recognize how he encouraged me to explore my interests and gently nudged me in the right direction when nec- essary, helping me grow into an independent researcher. His support extended beyond academic guidance, providing me with empathy and encouragement throughout this process. As my friend and colleague Deeksha aptly put it, “he is a great adviser and an amazing guide”, and I am deeply grateful for the opportunity to have worked with him. I am profoundly thankful to my committee members, Dr. Nikhil Chopra, Dr. Dinesh Manocha, Dr. Tianyi Zhou, and Dr. Kaiqing Zhang, for their invaluable feedbacks and insights on my dissertation. I also want to extend my gratitude to Dr. Manocha for his advice during the preliminary exam and for connecting me with the collaborators at the GAMMA lab at UMD, which led to the projects forming an essential part of this dissertation. Heartfelt thanks to my collaborators and co-authors: Harnaik Singh Dhami, Lifeng Zhou, Anukriti Singh, Vishnu Sashank Dorbala, Qingbiao Li, Jingxi Chen, and Maymoonah Toubeh. iv This dissertation was made possible with their help and working with them expanded the horizons of my conceptual and practical knowledge. I am also grateful to have worked with and learned from Dr. Matthew Andrews, Jeongran Lee, and Ilija Hadžić during my internship. This dissertation would not have been possible without the generous financial support from several sources: the U.S National Science Foundation (grant #1943368), the Office of Naval Research (grant #N00014-18-1-2829), Kulkarni Foundation, Nokia Bell Labs, and Comcast Cor- poration. I am grateful to Ivan Penskiy and the Maryland Robotic Center for providing the necessary hardware for experiments. I also extend my thanks to IEEE RAS, the Department of Computer Science at UMD, and the Graduate School at UMD for support in the form of travel grants. Attending conferences with these grants allowed me to experience research on a broad scale and connect with the wider research community. I owe my deepest gratitude to my friends who kept me going through the adversities: Ak- shita Jha, Siddharth Jar, Aman Gupta, Vikram Mohanty, Rajnish Aggarwal, Abhilash Sahoo, Alisha Pradhan, Biswaksen Patnaik, and Pramod Chundury. Special thanks to Akshita and Sid- dharth, who were always a call away, to lend an ear to all my personal and professional problems and provide kind and energizing words. I also want to thank Aman for motivating me to keep working towards starting my Ph.D. journey. I am grateful to Dr. Pawan Goyal and Dr. Amrith Krishna, who provided me with my first opportunity to pursue academic research during my undergraduate studies. Under their guidance, I learned the fundamental skills that have carried me through my academic career. Many thanks to my friends and lab-mates from the RAAS Lab: Guangyao Shi, Amisha Bhaskar, Prateek Verma, Jingxi Chen, Troi Williams, Charith Reddy, Deeksha Dixit, Rui Liu, Chak Lam Shek, Zahir Mahammad, and Sachin Jadhav. I thank them for making this experience v enjoyable. This journey would have been unimaginable without the unwavering love and support of my family. I owe them a deep gratitude for their patience, understanding, and encouragement throughout this journey. Finally, I wish to thank the creators of Naruto. I started watching the show when I was going through a very rough patch. The character instilled in me the courage to keep going and start this journey. It transformed the anger within into acknowledgment and kindness towards myself, setting me on a path that ultimately led to this accomplishment. vi Table of Contents Dedication ii Acknowledgements iv Table of Contents vii List of Tables xi List of Figures xiii List of Abbreviations xix Chapter 1: Introduction 1 1.1 Types of Patterns and Informed Decision-Making . . . . . . . . . . . . . . . . . 3 1.1.1 Geometrical and Structural Patterns . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Spatiotemporal Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.3 Semantic Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.1 Enhanced Perception with Structural Continuity and Closure . . . . . . . 9 1.2.2 Spatiotemporal Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.2.3 Semantic Pattern Prediction to Assist Humans and Robots . . . . . . . . 19 1.2.4 Meta-Reasoning to Manage Risk in Predictions . . . . . . . . . . . . . . 24 Chapter 2: Structural and Geometric Pattern Prediction in 2D Occupancy Maps 29 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3 Proposed Approach: Proximal Occupancy Map Prediction (ProxMaP) . . . . . . 33 2.3.1 Network Architecture and Training Details . . . . . . . . . . . . . . . . 33 2.3.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.4 Experiments & Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4.1 Occupancy Map Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.4.2 Navigation Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.4.3 Predictions on Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Chapter 3: Structural and Geometric Pattern Prediction in 2D Images and Maps 45 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 vii 3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.1 Mapping for Robot Navigation . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.2 Self-supervised masked encoding . . . . . . . . . . . . . . . . . . . . . 50 3.3 Proposed Approach: MAE as Zero-Shot Predictor . . . . . . . . . . . . . . . . . 50 3.3.1 FoV Expansion and Navigation . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.2 Multi-Agent Uncertainty Guided Exploration . . . . . . . . . . . . . . . 53 3.3.3 Navigation with Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.4 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4.1 FoV Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4.2 Multi-Agent Uncertainty Guided Exploration . . . . . . . . . . . . . . . 60 3.4.3 Navigation with prediction . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Chapter 4: Structural and Geometrical Pattern Prediction in 3D Point Clouds 64 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4 Proposed Approach: Prediction-based Next-Best-View (Pred-NBV) . . . . . . . 71 4.4.1 PoinTr-C: 3D Shape Completion Network . . . . . . . . . . . . . . . . . 71 4.4.2 Next-Best View Planner . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.5 Experiment and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.5.1 Qualitative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.5.2 3D Shape Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.5.3 Next-Best-View Planning . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Chapter 5: Structural and Geometric Pattern Prediction with Planning for Multi-Robot Systems 81 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4 Proposed Approach: Multi-Agent Prediction-based Next-Best-View (MAP-NBV) 87 5.4.1 3D Model Prediction (Line 5-6) . . . . . . . . . . . . . . . . . . . . . . 88 5.4.2 Decentralized Coordination (Line 7-12) . . . . . . . . . . . . . . . . . . 88 5.5 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.5.2 Qualitative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.5.3 Can Point Cloud Prediction Improve Reconstruction? . . . . . . . . . . . 93 5.5.4 How does coordination affect reconstruction? . . . . . . . . . . . . . . . 95 5.5.5 Qualitative Real-World Experiment . . . . . . . . . . . . . . . . . . . . 99 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Chapter 6: Spatiotemporal Pattern Prediction for Multi-Robot Coordination and Tracking 102 6.1 Learning Decentralized Coordination with Graph Neural Networks . . . . . . . . 103 6.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 viii 6.1.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.1.3 Proposed Approach: GNN-based Decentralized Coverage Planner . . . . 113 6.1.4 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 117 6.1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.2 Learning to Track and Coordinate with Differentiable Planner . . . . . . . . . . . 124 6.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.2.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.2.4 Proposed Approach: Differentiable, Decentralized Coverage Planner (D2CoPlan) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.2.5 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 133 6.2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Chapter 7: Semantic Pattern Prediction for Assistive Robot Perception and Planning 143 7.1 Semantic Pattern Prediction for Assisting Humans . . . . . . . . . . . . . . . . . 144 7.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.1.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 7.1.3 System Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.1.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 7.1.5 An Examples Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . 152 7.2 Semantic Communication for Assisting Robots with ObjectNav . . . . . . . . . . 156 7.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 7.2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.2.3 Proposed Approach: Assisted ObjectNav . . . . . . . . . . . . . . . . . 161 7.2.4 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 168 7.2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Chapter 8: Meta-Reasoning for Risk Management with Implicit Measures 174 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 8.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 8.2.1 Conditional Value at Risk . . . . . . . . . . . . . . . . . . . . . . . . . . 178 8.2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 8.3 Proposed Approach: Risk-Aware Path Planner . . . . . . . . . . . . . . . . . . . 181 8.3.1 Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 8.3.2 Candidate Path Generation . . . . . . . . . . . . . . . . . . . . . . . . . 183 8.3.3 Risk-Aware Path Assignment . . . . . . . . . . . . . . . . . . . . . . . . 183 8.4 Experiment and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Chapter 9: Meta-Reasoning for Risk Management with Explicit Measures 196 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 9.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 9.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 9.3.1 Dynamic Window Approach (DWA) . . . . . . . . . . . . . . . . . . . . 201 9.3.2 SACPlanner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 ix 9.4 Proposed Approach: Hybrid Local Planner . . . . . . . . . . . . . . . . . . . . . 206 9.4.1 Waypoint Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 9.4.2 Clearance Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 9.4.3 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 9.4.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 9.5 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 9.5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 9.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 9.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 Chapter 10: Conclusion and Future Work 220 10.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 10.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 10.2.1 Large Models for Zero-Shot Robotic Applications . . . . . . . . . . . . . 221 10.2.2 Risk-Aware Methods for Large Models . . . . . . . . . . . . . . . . . . 222 10.2.3 Building End-to-End Methods . . . . . . . . . . . . . . . . . . . . . . . 222 Appendix A: Prompts and Additional Experiments for Assisted ObjectNav 224 A.1 Prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 A.1.1 No Comm. - Ground . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 A.1.2 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 A.1.3 Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 A.1.4 Preemptive Actions Classifier . . . . . . . . . . . . . . . . . . . . . . . 230 A.2 Adverserial Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 A.3 Real World Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 A.3.1 Localization Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 A.3.2 Finetuning Prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 A.3.3 Identifying Correct Action from Dialogue . . . . . . . . . . . . . . . . . 236 A.4 Selective Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 A.5 Communication Wordclouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 242 Bibliography 242 x List of Tables 2.1 Comparison across different variations of ProxMaP over living room data from AI2THOR [1] simulator. Abbreviations Reg and Class refer to Regression and Classification tasks, respectively . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.2 Generalizability of ProxMaP and variations over Habitat-Matterport3D (HM3D) [2] dataset. Abbreviations Reg and Class refer to Regression and Classification tasks, respectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.3 SCT performance across different living rooms . . . . . . . . . . . . . . . . . . 42 3.1 Results for increasing the FoV in RGB images . . . . . . . . . . . . . . . . . . . 59 3.2 Results for increasing the FoV in Semantic segmentation images . . . . . . . . . 59 3.3 Results for increasing the FoV in Binary images from Outdoor environment . . . 60 4.1 Comparison between the baseline model (PoinTr) and PoinTr-C over test data with and without perturbation. Arrows show if a higher (↑) or a lower (↓) value is better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2 Points observed by Pred-NBV and the baseline NBV method [3] for all models in AirSim. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.1 MAP-NBV results in a better coverage compared to the multi-agent baseline NBV method [3] for all models in AirSim upon algorithm termination. . . . . . . . . . 95 6.1 Percentage of the number of targets covered (the average across 1000 trials) by GNNtrained and tested with varying numbers of robots. . . . . . . . . . . . . . . 122 6.2 Percentage of the targets covered (the average across 1000 trials) with respect to EXPERT by D2COPLAN trained and tested with varying numbers of robots. . . . 138 6.3 Percentage of the targets covered (the average across 1000 trials) with respect to EXPERT by D2COPLAN across varying target density maps. . . . . . . . . . . . 138 7.1 Oracle Success Rate (OSR) and Success Weighted by Path Length (SPL) for the assisted ObjectNav in simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . 169 7.2 Generative communication traits in simulations. We find that while the VLMs do not hallucinate much in describing images, they progressively perform worse in assuming pre-emptive agent motion during communication. . . . . . . . . . . . . 170 9.1 Summary statistics of trajectories from the real-world experiments using DWA, SAC, and Hybrid planner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 xi A.1 Performance of the Selective Communication setup. We find that letting the GA choose whether to communicate with the agent or not results in a better Object- Nav performance than the fully cooperative setup. . . . . . . . . . . . . . . . . . 240 xii List of Figures 1.1 Motivating Application with Geometrical Structural Patterns: Limited observa- tions and occlusion can limit the robots’ planning capabilities. Geometrical and structural predictions can help them make informed decisions by predicting the map beyond direct observations. The images are from the City and Forest envi- ronments in AirSim. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Motivating Application with Spatiotemporal Patterns: The motion of dynamic objects can be difficult to model. Spatiotemporal patterns can help the robots estimate the motion to avoid or track them efficiently. The image is from experi- ments performed at Nokia Bell Labs as an intern. . . . . . . . . . . . . . . . . . 6 1.3 Motivating Application with Semantic Patterns: Lack of semantic understanding can make the robots reliant on human instructions to perform tasks for them. Semantic patterns can help the robots anticipate what a human may need, and help them proactively. The images are from our experiments at RAAS Lab, University of Maryland. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Example of some tasks that can be done with geometrical and structural patterns . 10 1.5 Movement configuration for data collection and training strategy for proximal occupancy map prediction (ProxMaP) . . . . . . . . . . . . . . . . . . . . . . . 12 1.6 Overview of our prediction-guided next-best-view approach (Pred-NBV) for ob- ject reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.7 Overview of our prediction and uncertainty-driven planning approach for multi- agent coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.8 Our graph neural network (GNN) based method for decentralized multi-robot coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.9 Our tracking and coverage approach using a differentiable decentralized coverage planner (D2CoPlan) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.10 A VLM-based overhead agent working along with a ground robot can act as an effective assistance unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.11 Two VLM-based agents, one with an overhead view and another with a ground view, can work together to find an object on a scene with generative communication 23 1.12 Risk-aware planning strategy using uncertainty extraction allows the user to choose between a conservative and adventurous plans . . . . . . . . . . . . . . . . . . . 25 1.13 Left: Overview of the proposed hybrid local planning approach which combines the benefits of classical and AI-based local planners. Right: A real experiment scenario showing a hybrid planner in action when a human suddenly appears on the robot’s path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 xiii 2.1 An example situation where the robot’s view is limited by the obstacles (sofa blocking the view) and the camera field of view (sofa on the right is not fully visible). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2 Overview of the proposed approach. The training and inference flows are indi- cated with red and black arrows, respectively. We take the input view by moving the robot to the left and right sides (CamLeft and CamRight), looking towards the region of interest. ProxMaP makes predictions using the CamCenter only, and the map obtained by combining the information from the three positions acts as the ground truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3 Results obtained by the proposed model over some examples (rows). Red, yellow, and green areas represent a high, moderate, and low chance of occupancy in an area. ProxMaP makes more accurate and precise predictions than others. . . . . . 39 2.4 Prediction by ProxMaP over real-world inputs. . . . . . . . . . . . . . . . . . . . 44 3.1 Traditional approach of leveraging models trained on huge computer vision datasets can be applied to robotic tasks reliant on top-down images, albeit with some task- specific fine-tuning. We show that this is not necessary and some models, such as MAE [4] can be applied directly to these robotics tasks. . . . . . . . . . . . . . . 46 3.2 Example of robotics tasks solved with the help of Masked Autoencoder. . . . . . 48 3.3 Examples showing that the masked autoencoder can be used to expand the effec- tive FoV in top-down RGB, semantic, and binary images without fine-tuning. . . 51 3.4 Results of expanding FoV for indoor images in three masking scenarios. The corner of the bathtub and room is accurately predicted based on the symmetry of the lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.5 An overview of the MAE-based multi-agent exploration pipeline. . . . . . . . . . 54 3.6 Left: The area robot has explored till now. Right: Prediction of obstacle (red) shape aiding robot path planning. . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.7 Comparison between the multi-agent exploration algorithms to reach at least 95% accuracy in prediction of the unexplored map. . . . . . . . . . . . . . . . . . . . 60 4.1 Overview of Prediction-based Next Best View (PredNBV). . . . . . . . . . . . . 65 4.2 Flight path and total observations of C17 Airplane after running our NBV planner in AirSim simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3 Results over the real-world point cloud of a car obtained using LiDAR (Interactive figure available on our webpage). . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.4 Comparison between Pred-NBV and the baseline NBV algorithm [3] for a C-17 airplane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.1 MAP-NBV uses predictions to select better NBVs for a team of robots compared to the non-predictive baseline approach. . . . . . . . . . . . . . . . . . . . . . . 82 5.2 Algorithm Overview for MAP-NBV: Each robot runs the same algorithm includ- ing perception, prediction, and planning steps. The robots that communicate with each other can share observations and coordinate planning, whereas robots in isolation (e.g., Robot n) perform individual greedy planning. . . . . . . . . . . . 84 5.3 Flight paths of the two robots during C-17 simulation for MAP-NBV. . . . . . . . 90 xiv http://raaslab.org/projects/PredNBV/ 5.4 Examples of the 5 simulation model classes used for the multi-agent object re- construction task using MAP-NBV. . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.5 MAP-NBV (CO(d)-CP(d)-Greedy) performs comparably to the optimal so- lution (CO(c)-CP(c)-Optimal; Section 5.5.4), and much better than the frontiers-baseline in AirSim experiments. . . . . . . . . . . . . . . . . . . . . . 96 5.6 Directional CD-ℓ2 for teams of 2, 4, and 6 robots on ShapeNet models [5] with different coordination strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.7 Real-World MAP-NBV experiment. (a) RGB Image. (b) Observations, Predic- tions, and MAP-NBV poses. (c) Initial, Drone 1, and Drone 2 points after MAP- NBV iteration. (d) Reconstruction. . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.1 An overview of the multi-robot target tracking setup: a team of aerial robots, mounted with down-facing cameras, aims at covering multiple targets (depicted as colorful dots) on the ground. The red arrow lines and the blue dotted lines show inter-robot observations and communications. The red squares represent the fields of view of the robots’ cameras. . . . . . . . . . . . . . . . . . . . . . . 107 6.2 Overview of the decentralized target tracking problem. At a given time step, each robot observes the robots within its sensing range and chooses an action from a set of motion primitives to cover some targets using its camera. Each robot communicates with those robots within its communication range. . . . . . . . . . 108 6.3 Overview of our learning-based planner for decentralized target tracking. It con- sists of three main modules: (i), the Individual Observation Processing module processes the local observations and generates a feature vector for each robot; (ii), the Decentralized Information Aggregation module utilizes the GNN to aggregate and fuse the features from K-hop neighbors for each robot; (iii), the Decentral- ized Action Selection module selects an action for each robot by imitating an expert algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.4 Comparison of Opt, Centrl-gre, Decent-gre, GNN, and Rand in terms of running time (plotted in log scale) and the number of targets covered. (a) & (d) are for small-scale comparison averaged across 1000 Monte Carlo trials. (b) & (e) are for large-scale comparison averaged across 1000 Monte Carlo trials. . . . 120 6.5 An overview of the multi-robot target coverage setup. A team of aerial robots aims at covering multiple targets (depicted as red dots) on the ground. The robots observe the targets in their respective field of view (green squares) using down- facing cameras and share information with the neighbors through communication links (red arrows). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.6 An illustrative example showing multi-robot action selection for joint coverage maximization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 xv 6.7 Overview of our approach from a robot’s perspective: first the local observations are processed to generate the current coverage map. This can be done with the Differentiable Map Processor (DMP). D2COPLAN takes the coverage map as the input and processes it to first generate compact feature representation, with Map Encoder; shares the features with its neighbors, using the Distributed Feature Aggregator; and then selects an action using the aggregated information, with the Local Action Selector. The abbreviations in the parentheses for D2COPLAN’s sub-modules indicate the type of neural network used in their implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.8 Comparison of EXPERT and D2COPLAN in terms of running time (plotted in log scale) and the number of targets covered, averaged across 1000 Monte Carlo trials. D2COPLAN was trained on 20 robots. D2COPLAN is able to cover 92%-96% of the targets covered by EXPERT, while running at a much faster rate. . . . . . . . . . . . . . . . . . . . . 136 6.9 Comparison of D2COPLAN, and DG in terms of running time (plotted in log scale) and the number of targets covered, averaged across 1000 Monte Carlo trials. D2COPLAN was trained on 20 robots. D2COPLAN is able to cover almost the same number of targets as DG. DG is faster for fewer number of robots, but as the number of robots increases, D2COPLAN scales better than it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.10 Comparison of coverage highlighting the effect of using D2COPLAN, a differentiable planner to aid learning for a differentiable map predictor (DMP), which works better than the DMP trained standalone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.11 An ablation study for DMP and D2COPLAN. The plot shows results for the sce- narios where there parts and trained together or in isolation. . . . . . . . . . . . . 141 7.1 Overview of the assistive robot setup: an overhead agent equipped with the VLM directs a ground robot to help with the tasks based on semantic predictions . . . . 145 7.2 Overview of the proposed pipeline for the assistive robot . . . . . . . . . . . . . 147 7.3 Turtlebot2 and the sensors used for the assistive robot setup . . . . . . . . . . . . 149 7.4 The occupancy map and the labels assigned to the rooms for the house-like envi- ronment. The images on the sides show example observations from the rooms. . . 151 7.5 A sequence of observations from the overhead camera while making coffee. There is no spoon or stirrer nearby, which the person may need next. . . . . . . . 154 7.6 Executing the action suggested by the LLMs: The robot goes to the pantry, finds the stirrer, and brings it to the kitchen. . . . . . . . . . . . . . . . . . . . . . . . 155 7.7 Overview of the assistive ObjectNav setup. The overhead and ground VLM agents communicate to coordinate and find the target object. . . . . . . . . . . . 157 7.8 Overview of the proposed approach for assisted ObjectNav. The GA and OC first communicate with each other. A summary of the communication is then used to recommend an action for the GA. The GA then can either cooperate or decline cooperation and decide the next action on its own. . . . . . . . . . . . . . . . . . 162 7.9 An example showing dialogues hallucinations across different lengths of commu- nication between the ground and overhead agents. . . . . . . . . . . . . . . . . . 165 7.10 Overview of the real-world experimental setup consisting of a Turtlebot as a Ground Agent (GA) and a GoPro camera mounted to the roof as an Overhead Agent (OA). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 xvi 8.1 An illustration showing CVaRα of function f(S, y).. It denotes the expectation on the left α-tailed cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 8.2 The breakdown of the risk-aware planning framework. Given an overhead image input, the algorithm provides a semantic segmentation and uncertainty map, then generates candidate paths, and finally performs the risk-aware path assignment of vehicles to demands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 8.3 Difference in variance in semantic segmentation outputs due to data distribution. 189 8.4 Average Cross-Entropy for each class/label. It is inversely correlated with the frequency of the class/label in the data. . . . . . . . . . . . . . . . . . . . . . . . 190 8.5 Effect of λ on path planning. Smaller λ (=10) gives a shorter path with high uncertainty; larger λ (=50) gives a longer path with low uncertainty. . . . . . . . 190 8.6 Efficiency distributions of paths and the path assignment when α = 0.01. The assigned path for each robot is marked in red. . . . . . . . . . . . . . . . . . . . 192 8.7 Efficiency distributions of paths and the path assignment when α = 1. The as- signed path for each robot is marked in red. . . . . . . . . . . . . . . . . . . . . 193 8.8 Start and demand positions for surprise calculation setup. . . . . . . . . . . . . . 194 8.9 Surprise vs λ. A higher λ may result in a longer path, increasing the cost of traversal. For large values of λ, even a few pixels with high variance may greatly increase the cost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 8.10 Distribution of the total travel efficiency f(S, y) by Algorithm 2. . . . . . . . . . 195 9.1 An illustrative example showing various ways in which a robot can react to an obstacle with a series of arc-motions. . . . . . . . . . . . . . . . . . . . . . . . . 203 9.2 ROS framework and the architecture of our hybrid local planner. . . . . . . . . . 207 9.3 An overview of the waypoint generation scheme for the hybrid local planner. . . . 207 9.4 An overview of the clearance detection module for the hybrid planner. . . . . . . 208 9.5 Dummy training environment for SACPlanner (left) and the associated polar costmap (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 9.6 Real experimental environment and 4 test case scenarios (C1-4) from left to right. 213 9.7 Trajectory comparison between DWA, SACPlanner vs. Hybrid planner agent for each test case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 9.8 Trajectory comparison between DWA, SAC, and Hybrid planners based on logs from the scenario (C3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 A.1 Ground agent observations which result in error with GPT-4-turbo. . . . . . . . . 231 A.2 Localization responses from GPT-4-turbo for varying extent of labelling. . . . . . 233 A.3 Localization responses from GPT-4-turbo for different locations of the Turtlebot2 robot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 A.4 An example showing the need for environment-specific fine-tuning. Upon pass- ing the simulator prompts for conversation to this image, the overhead agent would misidentify the lights as a table. Explicitly mentioning the presence of white lights on the floor alleviates this issue. . . . . . . . . . . . . . . . . . . . . 236 A.5 Egocentric images from the camera on Turtlebot2. These images correspond to images of robot locations shown in Figure A.3 . . . . . . . . . . . . . . . . . . . 237 xvii A.6 Corrective action with GA in the real-world experiments. In this case, the agent decides to rotate by 180 degrees first and then move towards the left. The OA in the execution phase correctly suggests that the same can be accomplished by rotating towards the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 A.7 World clouds over the communication between the GA and OA in Selective Ac- tion setup. Communication in Cooperative Action setup also exhibits similar patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 xviii List of Abbreviations AUV Autonomous Underwater Vehicle CNN Convolutional Neural Network DS Dialogue Similarity GA Ground Agent GC Generative Communication GNN Graph Neural Network LLM Large Language Model MAE Masked Autoencoder MLP Multi-Layer Perceptron NBV Next Best View OA Overhead Agent OSR Oracle Success Rate SCT Success weighted by Completion Time SLAM Simultaneous Localization and Mapping SPL Success weighted by Path Length SR Success Rate UAV Unmanned Aerial Vehicle UGV Unmanned Ground Vehicle VLM Vision-Langugae Model xix Chapter 1: Introduction Navigating through unknown environments is a fundamental capability of mobile robots and has been studied by the robotics community for a long time [6]. Onboard sensors help them perceive their surroundings and planning algorithms enable them to navigate through the environment [6]. If a robot already has a map of the environment, it can plan optimal routes between locations, even in the presence of some dynamic objects [7]. However, the presence of a map is not always guaranteed. In certain scenarios, building a map from the ground up is itself the very mission assigned to a robot [8, 9]. Efficient navigation requires careful selection of the next location to move to, at each time step. Should the robot need to traverse to a pre-defined goal, it must move to locations such that the time taken to reach the goal is minimized. If the objective is to build a map, it must move to locations that help in observing most of the environment in the fewest steps. At each step, the local observations are integrated to construct a partial global map, and a new path is planned based on the updated map. Typically, a planner produces navigation strategies either avoiding or exploring unknown areas, depending upon the task at hand and safety constraints. Unsurprisingly, the amount of information about the map acts as the bottleneck for efficient navigation. So we ask, is there a way to make the robot navigation process more efficient even with limited information? To answer this, we look towards humans for inspiration. 1 In similar circumstances where there is limited information, humans show remarkable effi- ciency in navigation by making guesses about the yet-unseen part of the environment. At the core of this cognitive process lies our ability to identify patterns in our surroundings. We can walk through our living room without directly looking at the floor and by just mentally visualizing the furniture in the room even if we can only see a part of it. We move through a crowd gracefully by walking faster or slower by intuitively anticipating other people’s motions. Locating a book, say Probabilistic Robotics, within a library is easy once we know the positions of the volumes starting with ’O’ and ’Q’. In all these instances, we rely on patterns we have identified from our prior experiences to inform our decision-making. Inspired by these feats, the central question we aim to answer in this dissertation is: can robots also reason about the regions beyond their field of view by employing similar pattern recognition capabilities to improve their navigation efficiency? To answer this question, it is important to understand how humans develop these capabili- ties. Our brain learns these patterns through experiences and builds an internal model to facilitate their applications [10–12]. The easiest solution to impart such capabilities to the robots is there- fore creating such models manually for them. However, handcrafting models for these patterns prove to be a formidable task, owing to our limited knowledge of the system-environment inter- action and the exact distribution of these patterns [13]. In contrast, learning-based approaches, especially deep neural networks, offer a promising avenue to approximate these models using ex- tensive training data. This dissertation introduces an array of learning-based algorithms for this purpose, demonstrating their efficacy in enabling robots to make informed predictions about un- observed environments, resulting in enhancing navigation efficiency for a variety of input views, modalities, and number of robots. Being mindful of the inherent approximations, we concurrently 2 develop methods to manage risks when using predictions to ensure the robots can utilize them safely. Thus, the goal of this dissertation is to equip robots with pattern recognition capabilities and facilitate the judicious use of predictions to enhance navigation efficiency and safety. 1.1 Types of Patterns and Informed Decision-Making In this section, we start by introducing some patterns often harnessed by humans which prove to be of significant utility to robots for informed decision-making. Specifically, we focus on geometrical and structural, spatiotemporal, and semantic patterns, and describe their charac- teristics and relevance to robotic planning in the subsections below. 1.1.1 Geometrical and Structural Patterns Geometrical or structural patterns refer to recurring arrangements or configurations of shapes or structures in the environment. There is a plethora of these patterns present in our surroundings, mainly due to our affinity towards finding regularity in what we see, even in sce- narios where they may not be immediately apparent. A great example of this is star constellations, where we can visualize animals and objects in a seemingly random arrangement of tiny twinkling lights. The human brain is remarkable at recognizing and organizing visual information through the application of Gestalt principles [14,15]. These fundamental concepts in psychology explain how we perceive complex scenes as cohesive wholes, rather than isolated elements. They not only equip us to understand our surroundings but are also expressed as regular designs of man-made objects. Consider the prevalent rectilinear design of walls, tables, and boxes, or the cylindrical 3 Figure 1.1: Motivating Application with Geometrical Structural Patterns: Limited observations and occlusion can limit the robots’ planning capabilities. Geometrical and structural predictions can help them make informed decisions by predicting the map beyond direct observations. The images are from the City and Forest environments in AirSim. shape of bottles, cups, and tumblers. Despite variations in specific details, objects often adhere to similar underlying structures. Ask a child to draw an airplane, and we would expect to see a tube with pointed ends with two triangles on the sides. Boeing and Airbus models may differ in their tail designs, but the general shape is similar. Another prevalent characteristic of man- made objects is symmetry. Together, this results in easy-to-remember and recognizable shapes all around us. We show that these principles can be exploited in robot navigation, especially when the robots work in man-made environments, by predicting the object shapes beyond the robot’s field of view. The abundance of familiar and recurring shapes presents a great opportunity for robots 4 to infer the complete shape from partial views, making efficient navigation possible. Deploying a team of robots leads to further improvement as more information helps in more accurate pre- dictions, making these patterns immensely useful for robot navigation. Identifying and modeling these patterns is a challenging task. Additionally, the underlying method should ensure gener- alizability which is crucial for the deployment of the robot to new scenarios, including the real world. 1.1.2 Spatiotemporal Patterns Spatiotemporal patterns refer to the arrangement of entities in their scene, not necessarily geometrical, and the regularity in their motion. While we focus on static arrangements in geo- metrical and structural patterns, here we aim to couple the motion with the spatial arrangements and harness the predictable cues. For “mobile” robots these cues can arise from the environment, as well as their own motion. Soccer is the most popular sport in the world. Millions of fans watch it in areas and on TV. Passing the ball, one of the most fundamental skills in soccer involves one player kicking the ball toward a teammate and the opponents try to intercept it. The motion of the can be easily predicted with Newton’s laws of motion. But the players and the viewers do not have to pull their notebooks out to calculate where the ball would be after a certain time. Even those unfamiliar with Newton’s laws can easily predict the trajectory of the ball as humans can anticipate the motion from their experience. Similar patterns can be found in how people move in the crowd and how cars navigate on the road and are used by us to safely and smoothly navigate through them. 5 Figure 1.2: Motivating Application with Spatiotemporal Patterns: The motion of dynamic objects can be difficult to model. Spatiotemporal patterns can help the robots estimate the motion to avoid or track them efficiently. The image is from experiments performed at Nokia Bell Labs as an intern. Robots also can learn to anticipate motion based on experience, as demonstrated in our works. Robots can extract these patterns from moving targets, predicting their future locations for better tracking. When working with other robots, they could further utilize their spatial ar- rangements to coordinate with others to improve robot navigation for the whole team, especially when their communication capabilities are limited. Limited observations pose a significant chal- lenge in these scenarios, which is further exacerbated by the need for scalable solutions which can be hard to model and learn when generating labels with optimal algorithms for large teams is impractical. 6 1.1.3 Semantic Patterns Semantic patterns mean a consistent arrangement or structure of words, symbols, or el- ements that convey meaning within a given context to express higher-level specific concepts, relationships, or information. While the first two types of patterns often come to us intuitively, semantic patterns are usually reasoned about explicitly [16]. Human decision-making is highly dependent on semantic entities, resulting in a world full of such patterns. Looking at someone pouring coffee into the cup, one may guess they may need sugar or milk next is one such example where one can use a pattern emerging from the understanding of objects, their semantic relationship, and the activity they imply. Language, color codings, and sounds are some examples of such patterns. The inherent intuitiveness in semantic patterns paves the way to make accurate predictions about them. The ubiquity of the semantics around us means it is essential for robots to reason semanti- cally to efficiently assist humans. It also means that similar to us, the robots can make inferences to navigate efficiently in such a world. Using these patterns can enable robots to fully utilize the man-made signals around them and in turn effectively work alongside humans by reasoning similarly to them. Learning these patterns requires a huge amount of data and computational resources, which may not be easily available. The emergence of foundational and large language models addresses this issue and promises an easy avenue for scene understanding and human- level planning but requires bridging them together with grounding approaches to make them realizable. 7 Figure 1.3: Motivating Application with Semantic Patterns: Lack of semantic understanding can make the robots reliant on human instructions to perform tasks for them. Semantic patterns can help the robots anticipate what a human may need, and help them proactively. The images are from our experiments at RAAS Lab, University of Maryland. 1.2 Research Contributions This section highlights our contributions proposing methods to leverage the three types of patterns. The key proposition of these methods is to utilize learning-based approaches for modeling the patterns from data and improving planning efficiency across a variety of tasks and scenarios. Additionally, we explore methods for risk management with predictive models for the safe deployment of robots. We start with a discussion on how geometric and structural patterns can be used to improve PointGoal navigation [17], i.e. traversing to a pre-defined goal through an unknown environment, 8 and active object reconstruction by making accurate predictions about unseen regions from par- tial observations. These works take advantage of the patterns in the static environment around them. Then we turn our attention to dynamic entities in the scene and detail our approaches to harnessing spatiotemporal patterns arising from the motion of a team of robots and the dynamic targets and their spatial arrangements for scalable coverage and tracking. Our works show how predictions are invaluable to efficient navigation for mobile robots, but erroneous predictions can be a point of concern for safety in some applications. We deal with this issue in the last subsec- tion with our meta-reasoning approaches for the safe navigation of the robot using heuristics and uncertainty-based strategies for risk management. 1.2.1 Enhanced Perception with Structural Continuity and Closure This section describes our contributions to using geometrical and structural patterns to enhance robot perception and, as a result, planning. We start with a discussion on how geometric and structural patterns can be used to improve robot navigation for PointGoal Navigation with a 2D occupancy map, and active object reconstruction based on 3D point cloud representations, by predicting unseen regions from only partial observations. Lastly, we show that these approaches can be effectively extended to multi-agent systems by using multiple views to further improve the predictions, and in turn, the navigation efficiency. PointGoal Navigation with 2D Maps PointGoal navigation refers to the navigation task in which the robot is given a specific destination point (goal) in the environment and is required to reach it [17]. 2D overhead or bird’s 9 Figure 1.4: Example of some tasks that can be done with geometrical and structural patterns eye view (BEV) maps are commonly used by ground robots for this task. Typically, the robot builds the navigation maps incrementally from local observations using onboard sensors. 2D ranging LiDARs and RGB-D cameras are the most popular sensors for this task and are used to generate an occupancy map, which distinguishes the free areas from occupied or unknown areas. Sometimes, an unmanned aerial vehicle may act as a scout and observe a wider area from a height to get RGB maps, over which semantic segmentation is applied to act as an occupancy map for the navigation of the ground robot. The planner plans a path to the goal using these maps. As the robot navigates and updates the map, the path is also updated as a result. A conservative planner may avoid the regions of the unknown regions for safety, taking a longer time to navigate to the goal. Instead, if a robot is able to correctly predict the occupancy in the occluded regions, the robot may navigate efficiently. Recent works have shown that pre- 10 dicting the structural patterns in the environment through learning-based approaches can greatly enhance task efficiency [18, 19]. This is accomplished by predicting the occupancy maps in the yet unobserved regions, effectively increasing the field of view of the sensor. Figure 1.4 shows some example applications with this capability. We show that the existing foundational vision networks can accomplish this without any fine-tuning by using the concept of continuity learned from the computer vision datasets. Specifically, we use Masked Autoencoders [20], pre-trained on street images, for the field of view expansion on RGB images, semantic segmentation maps, and binary maps. The images and maps span both outdoor and indoor scenes and a diverse set of locations from AirSim [21] and AI2THOR simulator [1]. Two key findings stand out from our experiments in this work: (1) inferring unobserved scenes is easier for more simpler and abstract representations such as semantic segmentation and binary map as they don’t require complex reasoning about textures, and (2) predictions are more accurate when made closer to the areas of direct observation and degrade as we move farther away. The limitations of such foundation models are that they are computationally heavy and may not be suitable for the structural patterns of the region with miss- ing information. This work was accepted at the 2024 International Conference on Robotics and Automation (ICRA 2024) [22]. To overcome these limitations, we look towards task-specific, convolutional neural net- works (CNNs) based approaches for occupancy map prediction. Existing works using CNN for this task learn to predict occupancy in areas away from direct observations and thus may suf- fer from network overfitting [23, 24]. This also requires a time-consuming data collection step. To alleviate these issues, prior work has proposed a self-supervised approach with multi-camera setup [25], but this setup does not result in precision in predicted maps and is not economical. 11 Figure 1.5: Movement configuration for data collection and training strategy for proximal occu- pancy map prediction (ProxMaP) We use the takeaways from our previous work and focus on making predictions near the ob- served regions only, thus reducing overfitting and making precise predictions. This also results in a self-supervised and efficient data collection approach (as shown in Figure 1.5), which is also more economical than the existing self-supervised approach. We further improve navigation by adjusting the robot’s speed according to the information over the path to the goal, resulting in faster navigation to the goal. This work was accepted at the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2023) [26]. Active Object Reconstruction with 3D Point Clouds Object reconstruction refers to the task of observing an object from multiple viewpoints and reconstructing a 3D representation of it. 3D LiDARs, which provide a sparse point cloud as output, are often used with UAVs for this task. Ideally, we can observe the object from a large constellation of viewpoints around the objects, an efficient plan requires selecting only those viewpoints which minimize the overlap with the previous observations to finish the task quickly. This is a crucial requirement when using UAVs as the limited battery capacity limits its flight 12 time. Similar to the 2D situation, to incrementally build the map, the robot must choose the next- best-view (NBV) carefully to produce an accurate reconstruction with an efficient plan. If the shape of the object (equivalent to the map) is already known, we can carefully select a minimal number of viewpoints resulting in a geometrical NBV approach. But this may not always be possible and such a requirement limits the situation where the UAV can be deployed. Prior works have used learning-based approaches for this, but these networks are usually specific to an object class (e.g. houses only). This means that we must train a model for each target object. Other works have used the Gestalt principle of similarity to bridge the gap between lack of map and geometrical NBV approach. Revisiting the example from Section 1, all planes have similar shapes even though they may differ in details. Thus if we can predict the full space of the object of interest from partial views (this is known as shape completion), we can effectively construct a geometric NBV plan. Figure 1.6: Overview of our prediction-guided next-best-view approach (Pred-NBV) for object reconstruction 13 The existing works for 3D shape prediction make an implicit assumption about the partial observations and therefore cannot be used for real-world planning [27]. Also, they do not con- sider the control effort for next-best-view planning, which directly affects the flight time [28]. We proposed Pred-NBV [29], a realistic object shape reconstruction method consisting of PoinTr-C, an enhanced 3D prediction model trained on the ShapeNet dataset using curriculum learning, and an information and control effort-based next-best-view method to address these issues. Fig- ure 1.6 shows the overview of Pred-NBV. In each iteration, the robot predicts the full shape from partial observations and goes to the closest location which will result in a high information gain. After moving to the new location, the current observations are combined with the previous ones. The process repeats till we reach a termination condition. Pred-NBV achieves an improvement of 25.46% in object coverage over the traditional methods in the AirSim simulator [21] and performs better shape completion than PoinTr [27], the state-of-the-art shape completion model, even on real data obtained from a Velodyne 3D LiDAR mounted on DJI M600 Pro. This work was ac- cepted at the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2023) [29]. Extensions to Multi-Robot Systems Previous subsections show how continuity and closure principles can improve the naviga- tion efficiency for single robot systems by making predictions over partial observations. It is intuitive to expect further benefits when more information and context (and hence more geomet- rical and structural cues) are provided to the prediction model. This brings the question, what if these ideas are extended to multi-robot systems? 14 For navigation with a 2D map, we study the task of multi-agent exploration, which re- quires a team of robots to observe the whole map in the fewest steps. We use predictions for planning as well as mapping by inferring the unexplored regions using continuity. As observa- tions from multiview points provide strong geometrical and structural cues, the predictions are more reliable. Whenever there is uncertainty in predictions, we navigate the robots to these re- gions to directly observe and reduce uncertainty. For this, we use MAE to predict the unexplored regions and extract uncertainty from them by adding visually imperceptible noise to the input. Then a centralized planner uses the K-Nearest Neighbors (KNN) algorithm to identify regions with unknowns and high uncertainty to identify regions of interest and moves the robots to them. Figure 1.7 shows an overview of this process. This method results in higher prediction accu- racy in fewer steps compared to the traditional method of assigning non-overlapping regions to robots and scanning them in a sweeping fashion. These findings were accepted at the 2024 International International Conference on Robotics and Automation (ICRA 2024) [22]. Figure 1.7: Overview of our prediction and uncertainty-driven planning approach for multi-agent coverage 15 For 3D object reconstruction, we extended prior work to MAP-NBV, a prediction-guided active algorithm for 3D reconstruction with multi-agent UAV systems. We use PoinTr-C similar to Pred-NBV, but add a centralized planner to find NBV for all the UAVs together. We jointly op- timize the information gain and control effort for efficient collaborative 3D reconstruction of the object. Our method achieves a 19% improvement over the non-predictive multi-agent approach and a 17% improvement over the prediction-based, non-cooperative multi-agent approach. This work was accepted at the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024) [30]. 1.2.2 Spatiotemporal Patterns In this section, we describe the spatiotemporal patterns and our contributions outlining their use for improving robot planning. Specifically, we focus on decentralized algorithms for multi-robot coverage and target tracking, putting primary emphasis on planning. The first sub- section aims to harness the patterns emerging from the spatial arrangement of the robots, repre- sented as a communication graph, and their motion through the space. In the next subsection, we additionally utilize similar patterns from moving targets. Decentralized Coordinated Coverage using Graph Neural Networks The problem of decentralized multi-robot target tracking asks for jointly selecting actions, e.g., motion primitives, for the robots to maximize the joint coverage with local communica- tions. One major challenge for practical implementations is to make target-tracking approaches scalable for large-scale problem instances. In this work, we propose a general-purpose learning 16 architecture toward collaborative target tracking at scale, with decentralized communications. Classical, manually designed decentralized approaches can be more scalable compared to cen- tralized ones at the cost of reduced coverage [31]. We investigate whether planners, designed us- ing learning-based algorithms, can accomplish the same when provided with local observations as hand-crafted features. Particularly, our learning architecture, shown in Figure 1.8, leverages a graph neural network (GNN) to capture local interactions of the robots and learns decentralized decision-making for the robots [32]. We train the learning model by imitating an expert solution and implement the resulting model for decentralized action selection involving local observations and communications only. Using GNN here based on the representation of the spatial arrangement as a graph also al- lows centralized training, decentralized execution [33], a property specific to GNNs. We demon- strate the performance of our learning-based approach in a scenario of active target tracking with large networks of robots. The simulation results show our approach nearly matches the tracking performance of the expert algorithm, and yet runs several orders faster with up to 100 robots. Moreover, it slightly outperforms a decentralized greedy algorithm but runs faster (especially with more than 20 robots). The results also exhibit our approach’s generalization capability in previously unseen scenarios, e.g., larger environments and larger networks of robots. This shows that learning-based approaches can patterns for efficient planning, achieving similar coverage with lesser inference speed in comparison to the equivalent classical approaches. This work was accepted at the 2022 IEEE Symposium on Safety, Security, and Rescue Robotics (SSRR 2023) [32]. 17 Figure 1.8: Our graph neural network (GNN) based method for decentralized multi-robot cover- age Decentralized Coverage and Tracking with Differentiable Planners Learning-based distributed algorithms provide a scalable avenue in addition to bringing data-oriented feature generation capabilities to the table, allowing integration with other learning- based approaches. Our previous work focuses on local communication through GNN to improve planning. As we demonstrated in Section 1.2.1, perception is often harder to model than planning. Thus we ask, can a learning-based planner augment the training for learning-based perception models to outperform a simple combination of the two? Realizing this setup required an end-to- end differentiable planner that can be seamlessly combined with a perception network. To this end, we present a learning-based, differentiable distributed coverage planner (D2CoPlan) [34], shown in Figure 1.9, that scales efficiently in runtime and number of agents compared to the expert algorithm and performs on par with the classical distributed algorithm. Then we combine it with a perception network to predict motion primitives for covering dynamic targets, hence solving a target tracking problem. We find that this modular combination is not only able to out- perform combinations of classical and learning-based counterparts but also learns more efficiently than a single monolithic end-to-end planning network. These findings suggest that differentiable 18 designs in perception and planning are key to the development of more powerful learning-based solutions through end-to-end, task-specific learning. This work was published in the proceeding of the 2023 International Conference on Robotics and Automation (ICRA 2023) [34]. Figure 1.9: Our tracking and coverage approach using a differentiable decentralized coverage planner (D2CoPlan) 1.2.3 Semantic Pattern Prediction to Assist Humans and Robots In this section, we present our contributions utilizing semantic patterns grounded in vision and language to assist humans and other robots. For the former, we investigate the innate world knowledge in Vision-Language Models (LLMs) and Vision-Language Models (VLMs) to antici- pate what a person may need in the future and help them with the task. We specifically focus on language-based semantic patterns. Large Language Models as Anticipatory Planners for Assisting Humans Large Language Models (LLMs) are among the most recent significant advancements in artificial intelligence (AI). Trained using reinforcement learning from human feedback (RLHF), these models are exceptionally good at conversations with humans. Within a few months of their introduction, people have come up with a huge range of applications using LLMs as human 19 surrogates. Language researchers are studying what LLMs learn and have revealed that they can reason about the world which carries huge implications for applications dependent on semantic patterns. Consider this scenario: a person wakes up in the morning and is getting ready to make coffee. They reach the kitchen counter and turn on the coffee maker. An assistive home robot observing them infers that they are making coffee but notices that there is no sugar nearby. Antic- ipating that they may need sugar next, the assistant sends a robot to bring sugar from the pantry and bring it to the human. This ability to help humans without them needing to ask the robot explicitly requires a world model to understand and make inferences for a wide array of human activities. Using the capability of the LLMs to act as approximate world models, we use their capa- bility to generate likely words to assist humans by anticipating their next actions. Here we pose the next action prediction as the next word prediction with LLMs, given the past observations about a human’s activity. Due to the generative nature and lack of grounding in LLMs, they may generate many plausible actions. To ground them into real applications, we provide the context in the form of a textual description of the map. The robot action primitives are used as addi- tional input to ensure affordance-based actions are selected to effectively assist the humans. We build this system in and show applications in the real world for a variety of tasks in a home-like environment for image and text-based observation, VLMs, and image captioning methods. 20 Vision Language Models as Global Context Providers for Assisting Robots Recently researchers have been exploring multi-model representations learning to combine multiple representations in the same embedding, mainly aimed at tasks to assist humans. One such joint representation that is relevant to robots is vision-language representation, which can help the agent equipped with such VLM not only comprehend the observations but also make predictions about the future state of the environment. Focusing on such semantic predictions we propose a pipeline where an environment camera monitors the surroundings to identify the ongoing activities and directs the robot to help the human with them. Having such cameras in the environment is not unusual nowadays; cameras are often used for security purposes in industrial residential spaces. Growing interests in AI assistants have also brought devices equipped with cameras, which can provide additional guidance to the robot to help the people around. As depicted in Figure 1.10, the overhead camera monitors and deciphers the activities using VLMs, acting as a overhead VLM agent. The ground robot, orground agent can only see limited parts of the scene, due to limited camera field-of-view and the occlusions in the scene (such as walls). In contrast, the placement of the overhead camera allows it to observe a wider area than a ground robotic agent could. But it can not move around and help the people in the environment, which the ground agent can do. We propose a method to utilize the strengths of these agents: the overhead agent is tasked with scene understanding, activity recognition, and predicting what assistance a person in the scene may need. It can then direct the ground robot to move around and perform the appropriate actions. Since the overhead agent may not be able to see all the details from the height and walls may obstruct its view of some areas, the ground agent uses VLMs to accomplish the task. We implemented this pipeline using GPT-4 to direct a Turtlebot2 robot to 21 help humans in a house-like environment in the real world, paving a path to assistive robots with the help of external sensors and VLM-based capabilities. Figure 1.10: A VLM-based overhead agent working along with a ground robot can act as an effective assistance unit Given the world knowledge that VLMs encompass, they are well-equipped to assist not only in the form of temporal semantic predictions, but can also prove helpful for spatial semantic prediction (e.g., a bowl of sugar is more likely to be in the pantry than in a bathroom). Owing to these properties, the research community has welcomed these models with open arms for many applications, including ObjectNav. ObjectNav [17] is an embodied task where a mobile robot must find an object (e.g., a fork) in an environment without a map. This task is challenging as the ground agent has limited observation due to limited field-of-view (FoV) and obstructions, and hence its planning horizon is limited. While a VLM may provide sound reasoning about the target object and where to find it, the lack of grounding, hallucinations, and reliance on limited 22 Figure 1.11: Two VLM-based agents, one with an overhead view and another with a ground view, can work together to find an object on a scene with generative communication observations pose challenges for effective applications. To address these challenges, we propose using an environment camera, also equipped with a VLM to provide additional guidance as an overhead agent. However, the overhead agent itself may suffer from occlusions from walls and objects or may confuse other objects with the robot, it must communicate with the GA to estimate the latter’s position well and provide accurate guidance to accomplish ObjectNav. This proposed approach, shown in Figure 1.11, is the first instance of two agents with global and local views communicating with VLMs to the best of our knowledge. Similar prior works have focused on emergent semantic communication with limited vocabulary, in contrast to gen- erative communication with unrestricted communication. To study the effect of communication, we further investigate the effect of communication length and varying degrees of communica- tion over the task performance and show that communication indeed a crucial role in completing 23 the task. To mitigate the adverse effects of hallucination in simple two-way communication, we propose a selective cooperation framework for this task and achieve 10% improvement over the non-assistive, single-agent method. This work is currently under review. 1.2.4 Meta-Reasoning to Manage Risk in Predictions The accuracy of learning-based models is largely dependent on the training data. General- izability is often a source of concern for these methods and manifests itself as a Sim2Real gap in robotic applications. An error in prediction can be dangerous to the surroundings, the humans nearby, and even to the robot itself. Trust in predictions is thus a critical issue for deploying robots in the real world. Existing works have explored this issue through the lens of uncertainty extraction [35], interpretable designs [36], and explainable methods [37] among other. We contribute in this regard with implicit and explicit meta-reasoning approaches over predictions for planning, as described below. Meta-Reasoning for Risk-Aware Planning with Implicit Means Perception networks used in neural networks are generally point prediction models, i.e. for the same inputs, the network provides the same deterministic output. Bayesian neural networks, which were designed for stochastic outputs, can be computationally intensive and thus are un- suitable for deployment on resource-constrained robots. Kendall et al. [35] proposed Bayesian SegNet which uses Bayesian dropouts [38] for sementic segmentation and uncertainty extraction and show their use for detecting uncertainty in street-view images for autonomous driving. 24 Figure 1.12: Risk-aware planning strategy using uncertainty extraction allows the user to choose between a conservative and adventurous plans Our work builds on this idea and uses Bayesian SegNet on top-down images for high-level planning. In this setup, an aerial robot acts as a scout for ground robot navigation and uses Bayesian SegNet on aerial images. We train our network in the CityEnviron environment and test it in one suburban scene. The grass patches, which are scarce in the city act as out-of-distribution entities and result in high prediction uncertainty. The semantic costmap is combined with the uncertainty map using a user-defined risk-affinity factor, which allows the user to select between risk-conservative and risk-adventurous paths. The proposed approach thus allows risk manage- ment using implicit uncertainty in the networks. This process is shown in Figure 1.12. This work was published at the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2020) [39]. Meta-Reasoning for Hybrid Planning with Explicit Means Uncertainty extraction from network prediction is a contentious issue as we do not know how to truly validate the ’uncertainty’ thus obtained. This may result in a gap in trustworthi- ness, even if learning-based methods are more desirable and suitable for certain scenarios. The 25 easiest solution in such scenarios can be to use the predictions only when we trust them, oth- erwise, switch back to traditional methods. Recent works have explored this idea, albeit with neural networks acting as the switch [40, 41], which can again cause trust issues and may not be generalizable. Figure 1.13: Left: Overview of the proposed hybrid local planning approach which combines the benefits of classical and AI-based local planners. Right: A real experiment scenario showing a hybrid planner in action when a human suddenly appears on the robot’s path. We use a heuristic-based approach switching between a classical and a reinforcement learn- ing (RL) based method for local planning in an indoor environment with unexpected obstacles. The classical planner, dynamic window approach (DWA) [7] is robust and moves the robot ef- ficiently on a smooth path. However, DWA may be slow to react to unexpected obstacles on its path and may result in a collision. SACPlanner [42], an RL-based local planner is more reactive in these situations but results in a jerky and inefficient motion. We thus come up with a switching approach to identify whether there is an unexpected obstacle on the path and switch between the two planners accordingly. If the path is clear, then it uses the DWA planner, resulting in fast progress towards the goal. If an obstacle is detected, the robot uses the RL-based planner to safely avoid the obstacles and switches back to the DWA if no other obstacle is on the path to the 26 goal. This method results in faster navigation to the goal without any collision with the obstacles in various scenarios in real-world experiments. Figure 1.13 visualizes this process and shows an experimental situation where this framework was tested in the real world. This work is currently under revision [43]. Organization of the Dissertation This dissertation is organized into 10 chapters, following this chapter. In Chapters 2- 5, we present methods to utilize structural and geometrical patterns for ef- ficient navigation planning. Chapters 2 and 3 focus on predicting unseen parts of 2D maps from partial views with task-specific training1 and a pre-trained masked image model2, respectively. Chapters 4 and 5 present methods for predicting 3D maps using partial observations and build- ing next-base-view planning approaches with them for single-agent3 and multi-agent systems4, respectively. Chapter 6 concentrates on leveraging spatiotemporal patterns. First, we show how learning- based approaches can act as approximate, but scalable planners for multi-agent coverage prob- lems5. Then we present a method using a learning-based decentralized approach as a differen- tiable planner to efficiently train a multi-agent tracking method in an end-to-end manner, achiev- ing better results than its counterpart composed of independently trained submodules6. In Chapter 7 we present a framework using semantic patterns with the help of VLMs and LLMs. This framework uses an overhead camera to direct or coordinate with a ground robot. We 1https://raaslab.org/projects/ProxMaP 2https://raaslab.org/projects/MIM4Robots 3https://raaslab.org/projects/PredNBV 4https://raaslab.org/projects/MAPNBV 5https://github.com/VishnuDuttSharma/deep-multirobot-task 6https://raaslab.org/projects/d2coplan 27 https://raaslab.org/projects/ProxMaP https://raaslab.org/projects/MIM4Robots https://raaslab.org/projects/PredNBV https://raaslab.org/projects/MAPNBV https://github.com/VishnuDuttSharma/deep-multirobot-task https://raaslab.org/projects/d2coplan first show this framework in action with a real-world implementation to assist a human in doing everyday tasks in a house-like environment. Then we use this framework to help ground robot equipped with VLM for ObjectNav and present an analysis of the conversation between the two agents. Chapters 8 and 9 present methods to manage risk in predictions. In Chapter 8 we propose an implicit measure for risk-aware path planning where the out-of-distribution data acts as the source of uncertainty. Chapter 9 focuses on an explicit measure for risk management by using a heuristics-based approach to switch between classical and learning-based planners to balance navigation efficiency and the collision risk due to unexpected obstacles on the path. We conclude the dissertation with an overview of our research contributions and underline the future research directions. The prompts and additional results from Chapter 7 are presented in the appendix. The software and media corresponding to the work in this dissertation are available at https://vishnuduttsharma.github.io/thesis/. 28 https://vishnuduttsharma.github.io/thesis/ Chapter 2: Structural and Geometric Pattern Prediction in 2D Occupancy Maps 2.1 Introduction To navigate in a complex environment, a robot needs to know the map of the environment. This information can either be obtained by mapping the environment beforehand, or the robot can build a map online using the onboard sensors. Occupancy maps are often used, which pro- vide probabilistic estimates about the free (navigable) and occupied (non-navigable) areas. These estimates can be updated as the robot gains new information while navigating. Given an occu- pancy map, the robot can adjust its speed to navigate faster through high-confidence, free areas and slower through low-confidence, free areas so that it can stop before collision. The effective speed of the robot thus depends on the occupancy estimates. Occlusions due to obstacles and limited field-of-view (FoV) of the robot lead to low-confidence occupancy estimates, which limit the navigation speed of the robot. In this chapter, we propose training a neural network to predict occupancy in the regions that are currently occluded by obstacles, as shown in Fig. 2.1. Prior works learn to predict the occupancy map all around the robot i.e., simulating a 360◦ FoV given the visible occupancy map within the current, limited FoV [44–46]. Since the network is trained to predict the occupancy Further details and results for this work are available at https://raaslab.org/projects/ProxMaP. 29 https://raaslab.org/projects/ProxMaP (a) Third-person view of the robot in a living room (b) Top view of the robot showing visibility polygon Figure 2.1: An example situation where the robot’s view is limited by the obstacles (sofa blocking the view) and the camera field of view (sofa on the right is not fully visible). map all around the robot, it overfits by learning the room layouts. This happens as the network must learn to predict the occupancy information about the areas such as the back of the robot, for which the robot may not have any overlapping information in its egocentric observations. This makes the prediction task difficult to learn. Furthermore, the ground truth requires mapping the whole scene beforehand, which could make sim-to-real transfer tedious. It also means that the whole environment needs to be mapped to get the ground truth data. Our key insight is to simplify this problem by making predictions only about the proximity of the areas where the robot could move immediately. This setting has three-fold advantages: first, the prediction task is easier and relevant as the network needs to reason only about the im- mediately accessible regions (that are partly visible); second, the robot learns to predict obstacle shapes instead of learning room layouts, making it more generalizable; and third, ground truth is easier to obtain which can be obtained by moving the robot, making the approach self-supervised. 30 (a) Movement configuration for data collec- tion. (b) Training and prediction overview. Figure 2.2: Overview of the proposed approach. The training and inference flows are indicated with red and black arrows, respectively. We take the input view by moving the robot to the left and right sides (CamLeft and CamRight), looking towards the region of interest. ProxMaP makes predictions using the CamCenter only, and the map obtained by combining the information from the three positions acts as the ground truth. Following are our main contributions in this work: 1. We present ProxMaP, a self-supervised proximal occupancy map prediction method for indoor navigation, trained on occupancy maps generated from AI2THOR [1] simulator, and show that it makes accurate predictions and also generalizes well on HM3D dataset [2] without fine-tuning. 2. We study the effect of training ProxMaP under various paradigms on prediction quality and navigation tasks, highlighting the role of training methods on occupancy map prediction tasks. We also present some qualitative results on real data showing that ProxMaP can be extended to real-world inputs. 3. We simulate the point goal navigation as a downstream task utilizing our method for occu- pancy map prediction and show that our method outperforms the baseline, non-predictive approach, relatively by 12.40% in navigating faster, and even outperforms a robot with multiple cameras in the general setting. 31 2.2 Related Works Mapping the environment is a standard step for autonomous navigation. The classical methods typically treat unobserved (i.e., occluded locations) as unknown. Our focus in this work is on learning to predict the occupancy values in these occluded areas. As shown by recent works, occupancy map prediction can help the robot navigate faster [47] and in an efficient manner [18]. Earlier works explored machine learning techniques for online occupancy map predic- tion [48, 49], but they require updating the model online with new observations. Recent works shifted to offline training using neural networks, treating map-to-map prediction as an inpainting task. Katyal et al. [50] compared ResNet, UNet, and GAN for 2D occupancy map inpainting with LiDAR data, finding that UNet outperforms the others. Subsequent works used UNet for occu- pancy map prediction with RGBD sensors, demonstrating improved robot navigation [25,44,45]. Offline training for these methods requires collecting ground truth data by mapping the entire training environment, which can be time-consuming and hinder real-world deployment. Moreover, these models are trained to predict occupancy for the entire surroundings of the robot, including the scene behind for which they may lack context within the current observation, which could result in the networks memorizing room layouts, affecting their generalizability. Addition- ally, methods relying on historical observations for predictions [51] face data efficiency chal- lenges during training. As robots can actively collect data, self-supervised methods have been successful in ad- dressing data requirements for various robotic learning tasks [29, 52–56]. For occupancy map prediction in indoor robot navigation, Wei et al. [25] proposed a self-supervised approach us- ing two downward-looking RGBD cameras at different heights. The network predicts the com- 32 bined occupancy map from the lower camera’s input without manual annotation, making it data- efficient and suitable for real robots. However, it struggles to predict edge-like obstacles and requires additional data collection for fine-tuning. Moreover, tilted cameras limit the captured information ahead compared to straight, forward-looking cameras. To this end, we propose a self-supervised method consisting of a single, forward-looking camera to maximize the information acquisition for the navigation plane, while reducing the control effort required to collect data. Adding two cameras to the side of the robot could further reduce this effort. We design our predictor as a classification network, which can generate sharper images compared to the regression networks, as shown later in this chapter. We focus on making predictions in the proximity of the robot, reducing the likelihood of memorization and improving its generalizability by using the current view as context. 2.3 Proposed Approach: Proximal Occupancy Map Prediction (ProxMaP) In this work, we consider a ground robot equipped with an RGBD camera in indoor envi- ronments. Two additional views are obtained by moving the robot around as shown in Fig. 2.2a. The same can also be achieved by adding extra cameras to the robot. In the following subsections, we detail the network architecture for ProxMaP, training details, and the data collection process. 2.3.1 Network Architecture and Training Details We use the occupancy map generated by CamCenter as input and augment it using a pre- diction network. Our goal is to accurately predict the occupancy information about the unknown cells in the input map. The network uses the map generated by combining information from the 33 three robot positions as the ground truth for training and thus learns to predict occupancy in the robot’s proximity. We use UNet [57] for map prediction in ProxMaP due to its ability to per- form pixel-to-pixel prediction well by sharing intermediate encodings between the encoder and decoder. We use a UNet with a 5-block encoder and a 5-block decoder. For training, we convert these maps to 3-channel images representing free, unknown, and occupied regions. This is done by assigning each cell to one of the 3 classes based on its probability p: if p ≤ 0.495, the cells are treated as free; if p ≥ 0.505, it is treated as occupied; and as unknown in rest of the cases, similar to Wei et al. [25]. We train the network with cross-entropy loss, a popular choice for training classification networks. Since previous works have used variations of UNet for occupancy map prediction training as a regression task [25, 52] and as a generative task [47, 58], we also train ProxMaP with these variations. We also use UNet as the building block for these approaches with Oc as input and O∗ as the target map (Fig. 2.2b). For the regression tasks, these maps are transformed from log odds to probability maps before training. For generative tasks, we use the UNet-based pix2pix [59] network with single and three-channel input and output pairs for regression and classification, respectively. For regression, since both input and output are probability maps, we use the KL-divergence loss function for training UNet, which simplifies to binary cross-entropy (BCE) under the as- sumption that each occupancy map is sampled from a multivariate Bernoulli distribution param- eterized by the probability of each cell. In addition, we also train a UNet with Mean Squared Error (MSE) loss for regression. For training the generative models, we use L1 loss and LGAN losses as suggested by Isola et al. [59]. In the rest of the discussion, while discussing ProxMaP’s variations, we will refer to 34 the generative classification, generative regression, and discriminative regression approaches as Class-GAN, Reg-GAN, and Reg-UNet, respectively. 2.3.2 Data Collection We use the AI2THOR [1] simulator, which provides photo-realistic scenes with depth and segmentation maps. Our setup, as shown in Fig. 2.2a, includes three RGBD cameras: CamCenter, positioned at the robot’s height of 0.5m and location, and two additional observations from CamLeft and CamRight, located at a horizontal distance of 0.3m from the original position to- wards left and right, respectively. Each camera is rotated by 30 deg to capture extra information and increase the robot’s FoV. This is done to capture extra information about the scene, while also making sure that the cameras on the sides have some overlap with CamCenter The rotation of the cameras virtually increases the FoV for the robot and the translation makes sure that the robot is able to learn to look around the corners rather than simply rotating at its location. Each camera captures depth and instance segmentation images. The depth image aids in creating a 3D re-projection of the scene into point clouds, while the segmentation image identifies the ceiling (excluded from occupancy map generation) and the floor (representing the free/navigable area). The rest of the scene is considered occupied/non-navigable. The segmentation- based processing can be replaced with height-based filtering of the ceiling and floor after re- projection. All the point clouds are reprojected to a top-down view in the robot frame using appropriate rotation and translation. Maps are then limited to a 5m × 5m area in front of the robot and converted to 256 × 256 images to use in the network. Points belonging to obstacles increment the corresponding cell value by 1, while floor points decrement by 1. Each bin’s point 35 count is multiplied by a factor m = 0.1 to obtain an occupancy map with log odds. To limit log- odds values, the point count is clipped to the range [−10, 10]. The resulting map from CamCenter, denoted Oc, is the network input. The ground truth map O∗ is constructed as a combination of the maps from the three cameras, similar to Wei et al. [25], as follows: O∗ = max{abs(Oc), abs(Ol), abs(Or)} · sign(Oc +Ol +Or), (2.1) where Oc, Ol and Or refer to the occupancy maps generated by CamCenter, CamLeft, and CamRight, respectively. These log-odds maps are converted to probability maps before being used for network training. AI2THOR provides different types of rooms. We use living rooms only as they have a larger size and contain more obstacles compared to others. Out of the 30 such rooms, we use the first 20 for training and validation and the rest for testing. For data collection, we divide the floor into square grids of size 0.5m and rotate the cameras by 360◦ in steps of 45◦. Some maps do not contain much information to predict due to the robot being close to the walls. Thus, we filter out map pairs where the number of occupied cells in O∗ is more than 20%. This process provides us with ∼6000 map pairs for training and ∼2000 pairs for testing. 2.4 Experiments & Evaluation We report two types of results in this section. First, we present the prediction performance of the ProxMaP and its variations on our test dataset from AI2THOR. Additionally, we show prediction results on HM3D [2] to test generalizability. Then we use these networks for indoor point-goal navigation and compare them with non-predictive methods and state-of-the-art self- 36 supervised approach [25]. Finally, we present qualitative results on some real observations to highlight the potential of real-world applications. The networks were trained on a GeForce RTX 2080 GPU, with a batch size of 4 for GANs and 16 for discriminative models. Early stopping was used to avoid overfitting with the maximum number of epochs set to 300. 2.4.1 Occupancy Map Prediction Setup. As our ground truth maps are generated from a limited set of observations, they may not contain the occupancy information of all the surrounding cells. Hence, we evaluate the predictions only in cells whose ground truth occupancy is known to be either occupied or free. We refer to such cells as inpainted cells. For classification, we choose the most likely label as the output for each cell. For regression, a cell is considered to be free if the probability p in this cell is lesser than 0.495. Similarly, a cell with p ≥ 0.505 is considered to be occupied. The remaining cells are treated as unknown and are not considered in the evaluations. Prediction accuracy is a typical metric to evaluate the prediction quality. However, it may not present a clear picture of our situation due to the data imbalance caused by fewer occupied cells. Ground robots with cameras at low heights, similar to our case, are more prone to data imbalance as the robot may only observe the edges of the obstacles. Thus we also present the precision, recall, and F1 score for each class. Results. Fig. 2.3 shows the qualitative results from ProxMaP and its variants, and Table 2.1 summarizes the quantitative outcomes. The classification version of ProxMaP exhibits superior precision in predicting occupied cells. In contrast