ABSTRACT

Title of Dissertation: ENHANCED ROBOT PLANNING AND PERCEPTION
THROUGH ENVIRONMENT PREDICTION

Vishnu Dutt Sharma
Doctor of Philosophy, 2024

Dissertation Directed by: Professor Pratap Tokekar
Department of Computer Science

Mobile robots rely on maps to navigate through an environment. In the absence of any map,

the robots must build the map online from partial observations as they move in the environment.

Traditional methods build a map using only direct observations. In contrast, humans identify pat-

terns in the observed environment and make informed guesses about what to expect ahead. Mod-

eling these patterns explicitly is difficult due to the complexity in the environments. However,

these complex models can be approximated well using learning-based methods in conjunction

with large training data. By extracting patterns, robots can use not only direct observations but

also predictions of what lies ahead to better navigate through an unknown environment. In this

dissertation, we present several learning-based methods to equip mobile robots with prediction

capabilities for efficient and safer operation.

In the first part of the dissertation, we learn to predict using geometrical and structural

patterns in the environment. Partially observed maps provide invaluable cues for accurately pre-

dicting the unobserved areas. We first demonstrate the capability of general learning-based ap-


proaches to model these patterns for a variety of overhead map modalities. Then we employ task-

specific learning for faster navigation in indoor environments by predicting 2D occupancy in the

nearby regions. This idea is further extended to 3D point cloud representation for object recon-

struction. Predicting the shape of the full object from only partial views, our approach paves the

way for efficient next-best-view planning, which is a crucial requirement for energy-constrained

aerial robots. Deploying a team of robots can also accelerate mapping. Our algorithms benefit

from this setup as more observation results in more accurate predictions and further improves the

task efficiency in the aforementioned tasks.

In the second part of the dissertation, we learn to predict using spatiotemporal patterns in

the environment. We focus on dynamic tasks such as target tracking and coverage where we seek

decentralized coordination between robots. We first show how graph neural networks can be

used for more scalable and faster inference while achieving comparable coverage performance

as classical approaches. We find that differentiable design is instrumental here for end-to-end

task-oriented learning. Building on this, we present a differentiable decision-making framework

that consists of a differentiable decentralized planner and a differentiable perception module for

dynamic tracking.

In the third part of the dissertation, we show how to harness semantic patterns in the envi-

ronment. Adding semantic context to the observations can help the robots decipher the relations

between objects and infer what may happen next based on the activity around them. We present

a pipeline using vision-language models to capture a wider scene using an overhead camera to

provide assistance to humans and robots in the scene. We use this setup to implement an assistive

robot to help humans with daily tasks, and then present a semantic communication-based collab-

orative setup of overhead-ground agents, highlighting the embodiment-specific challenges they


may encounter and how they can be overcome.

The first three parts employ learning-based methods for predicting the environment. How-

ever, if the predictions are incorrect, this could pose a risk to the robot and its surroundings. The

third part of the dissertation presents risk management methods with meta-reasoning over the

predictions. We study two such methods: one extracting uncertainty from the prediction model

for risk-aware planning, and another using a heuristic to adaptively switch between classical and

prediction-based planning, resulting in safe and efficient robot navigation.


ENHANCED ROBOT PLANNING AND PERCEPTION
THROUGH ENVIRONMENT PREDICTION

by

Vishnu Dutt Sharma

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2024

Advisory Committee:
Professor Pratap Tokekar, Chair/Advisor
Professor Nikhil Chopra, Dean’s Representative
Professor Dinesh Manocha
Professor Tianyi Zhou
Professor Kaiqing Zhang


© Copyright by
Vishnu Dutt Sharma

2024


Dedication

To my brother, Ramdutt Sharma.

ii


“We are stardust brought to life, then empowered by the universe

to figure itself out – and we have only just begun.”

— Neil deGrasse Tyson

iii


Acknowledgments

My Ph.D. journey has been a thrilling rollercoaster, filled with exhilarating highs and chal-

lenging lows. While I officially embarked on this path five years ago, the journey truly started

many years earlier, and it would not have been possible without the unwavering support of many

individuals who stood by me all along.

First and foremost I express my deepest gratitude to my advisor, Dr. Pratap Tokekar, for

his tireless support and mentorship at every step of this journey. Looking back, I recognize how

he encouraged me to explore my interests and gently nudged me in the right direction when nec-

essary, helping me grow into an independent researcher. His support extended beyond academic

guidance, providing me with empathy and encouragement throughout this process. As my friend

and colleague Deeksha aptly put it, “he is a great adviser and an amazing guide”, and I am deeply

grateful for the opportunity to have worked with him.

I am profoundly thankful to my committee members, Dr. Nikhil Chopra, Dr. Dinesh

Manocha, Dr. Tianyi Zhou, and Dr. Kaiqing Zhang, for their invaluable feedbacks and insights

on my dissertation. I also want to extend my gratitude to Dr. Manocha for his advice during

the preliminary exam and for connecting me with the collaborators at the GAMMA lab at UMD,

which led to the projects forming an essential part of this dissertation.

Heartfelt thanks to my collaborators and co-authors: Harnaik Singh Dhami, Lifeng Zhou,

Anukriti Singh, Vishnu Sashank Dorbala, Qingbiao Li, Jingxi Chen, and Maymoonah Toubeh.

iv


This dissertation was made possible with their help and working with them expanded the horizons

of my conceptual and practical knowledge. I am also grateful to have worked with and learned

from Dr. Matthew Andrews, Jeongran Lee, and Ilija Hadžić during my internship.

This dissertation would not have been possible without the generous financial support from

several sources: the U.S National Science Foundation (grant #1943368), the Office of Naval

Research (grant #N00014-18-1-2829), Kulkarni Foundation, Nokia Bell Labs, and Comcast Cor-

poration. I am grateful to Ivan Penskiy and the Maryland Robotic Center for providing the

necessary hardware for experiments. I also extend my thanks to IEEE RAS, the Department of

Computer Science at UMD, and the Graduate School at UMD for support in the form of travel

grants. Attending conferences with these grants allowed me to experience research on a broad

scale and connect with the wider research community.

I owe my deepest gratitude to my friends who kept me going through the adversities: Ak-

shita Jha, Siddharth Jar, Aman Gupta, Vikram Mohanty, Rajnish Aggarwal, Abhilash Sahoo,

Alisha Pradhan, Biswaksen Patnaik, and Pramod Chundury. Special thanks to Akshita and Sid-

dharth, who were always a call away, to lend an ear to all my personal and professional problems

and provide kind and energizing words. I also want to thank Aman for motivating me to keep

working towards starting my Ph.D. journey.

I am grateful to Dr. Pawan Goyal and Dr. Amrith Krishna, who provided me with my first

opportunity to pursue academic research during my undergraduate studies. Under their guidance,

I learned the fundamental skills that have carried me through my academic career.

Many thanks to my friends and lab-mates from the RAAS Lab: Guangyao Shi, Amisha

Bhaskar, Prateek Verma, Jingxi Chen, Troi Williams, Charith Reddy, Deeksha Dixit, Rui Liu,

Chak Lam Shek, Zahir Mahammad, and Sachin Jadhav. I thank them for making this experience

v


enjoyable.

This journey would have been unimaginable without the unwavering love and support of

my family. I owe them a deep gratitude for their patience, understanding, and encouragement

throughout this journey.

Finally, I wish to thank the creators of Naruto. I started watching the show when I was

going through a very rough patch. The character instilled in me the courage to keep going and

start this journey. It transformed the anger within into acknowledgment and kindness towards

myself, setting me on a path that ultimately led to this accomplishment.

vi


Table of Contents

Dedication ii

Acknowledgements iv

Table of Contents vii

List of Tables xi

List of Figures xiii

List of Abbreviations xix

Chapter 1: Introduction 1
1.1 Types of Patterns and Informed Decision-Making . . . . . . . . . . . . . . . . . 3

1.1.1 Geometrical and Structural Patterns . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Spatiotemporal Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.3 Semantic Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Enhanced Perception with Structural Continuity and Closure . . . . . . . 9
1.2.2 Spatiotemporal Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.3 Semantic Pattern Prediction to Assist Humans and Robots . . . . . . . . 19
1.2.4 Meta-Reasoning to Manage Risk in Predictions . . . . . . . . . . . . . . 24

Chapter 2: Structural and Geometric Pattern Prediction in 2D Occupancy Maps 29
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3 Proposed Approach: Proximal Occupancy Map Prediction (ProxMaP) . . . . . . 33

2.3.1 Network Architecture and Training Details . . . . . . . . . . . . . . . . 33
2.3.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.4 Experiments & Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.1 Occupancy Map Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.2 Navigation Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.3 Predictions on Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Chapter 3: Structural and Geometric Pattern Prediction in 2D Images and Maps 45
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

vii


3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.1 Mapping for Robot Navigation . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.2 Self-supervised masked encoding . . . . . . . . . . . . . . . . . . . . . 50

3.3 Proposed Approach: MAE as Zero-Shot Predictor . . . . . . . . . . . . . . . . . 50
3.3.1 FoV Expansion and Navigation . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.2 Multi-Agent Uncertainty Guided Exploration . . . . . . . . . . . . . . . 53
3.3.3 Navigation with Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.4 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.1 FoV Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.2 Multi-Agent Uncertainty Guided Exploration . . . . . . . . . . . . . . . 60
3.4.3 Navigation with prediction . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Chapter 4: Structural and Geometrical Pattern Prediction in 3D Point Clouds 64
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Proposed Approach: Prediction-based Next-Best-View (Pred-NBV) . . . . . . . 71

4.4.1 PoinTr-C: 3D Shape Completion Network . . . . . . . . . . . . . . . . . 71
4.4.2 Next-Best View Planner . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.5 Experiment and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.1 Qualitative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.2 3D Shape Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5.3 Next-Best-View Planning . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Chapter 5: Structural and Geometric Pattern Prediction with Planning for Multi-Robot
Systems 81

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4 Proposed Approach: Multi-Agent Prediction-based Next-Best-View (MAP-NBV) 87

5.4.1 3D Model Prediction (Line 5-6) . . . . . . . . . . . . . . . . . . . . . . 88
5.4.2 Decentralized Coordination (Line 7-12) . . . . . . . . . . . . . . . . . . 88

5.5 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.5.2 Qualitative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.5.3 Can Point Cloud Prediction Improve Reconstruction? . . . . . . . . . . . 93
5.5.4 How does coordination affect reconstruction? . . . . . . . . . . . . . . . 95
5.5.5 Qualitative Real-World Experiment . . . . . . . . . . . . . . . . . . . . 99

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

Chapter 6: Spatiotemporal Pattern Prediction for Multi-Robot Coordination and Tracking 102
6.1 Learning Decentralized Coordination with Graph Neural Networks . . . . . . . . 103

6.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

viii


6.1.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.1.3 Proposed Approach: GNN-based Decentralized Coverage Planner . . . . 113
6.1.4 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 117
6.1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.2 Learning to Track and Coordinate with Differentiable Planner . . . . . . . . . . . 124
6.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.2.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.2.4 Proposed Approach: Differentiable, Decentralized Coverage Planner

(D2CoPlan) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.2.5 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 133
6.2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

Chapter 7: Semantic Pattern Prediction for Assistive Robot Perception and Planning 143
7.1 Semantic Pattern Prediction for Assisting Humans . . . . . . . . . . . . . . . . . 144

7.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.1.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.1.3 System Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.1.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.1.5 An Examples Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . 152

7.2 Semantic Communication for Assisting Robots with ObjectNav . . . . . . . . . . 156
7.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.2.3 Proposed Approach: Assisted ObjectNav . . . . . . . . . . . . . . . . . 161
7.2.4 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 168
7.2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

Chapter 8: Meta-Reasoning for Risk Management with Implicit Measures 174
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
8.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

8.2.1 Conditional Value at Risk . . . . . . . . . . . . . . . . . . . . . . . . . . 178
8.2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

8.3 Proposed Approach: Risk-Aware Path Planner . . . . . . . . . . . . . . . . . . . 181
8.3.1 Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
8.3.2 Candidate Path Generation . . . . . . . . . . . . . . . . . . . . . . . . . 183
8.3.3 Risk-Aware Path Assignment . . . . . . . . . . . . . . . . . . . . . . . . 183

8.4 Experiment and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

Chapter 9: Meta-Reasoning for Risk Management with Explicit Measures 196
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
9.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
9.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

9.3.1 Dynamic Window Approach (DWA) . . . . . . . . . . . . . . . . . . . . 201
9.3.2 SACPlanner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

ix


9.4 Proposed Approach: Hybrid Local Planner . . . . . . . . . . . . . . . . . . . . . 206
9.4.1 Waypoint Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
9.4.2 Clearance Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
9.4.3 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
9.4.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

9.5 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
9.5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
9.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

9.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

Chapter 10: Conclusion and Future Work 220
10.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
10.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

10.2.1 Large Models for Zero-Shot Robotic Applications . . . . . . . . . . . . . 221
10.2.2 Risk-Aware Methods for Large Models . . . . . . . . . . . . . . . . . . 222
10.2.3 Building End-to-End Methods . . . . . . . . . . . . . . . . . . . . . . . 222

Appendix A: Prompts and Additional Experiments for Assisted ObjectNav 224
A.1 Prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

A.1.1 No Comm. - Ground . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
A.1.2 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
A.1.3 Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
A.1.4 Preemptive Actions Classifier . . . . . . . . . . . . . . . . . . . . . . . 230

A.2 Adverserial Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
A.3 Real World Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

A.3.1 Localization Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
A.3.2 Finetuning Prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
A.3.3 Identifying Correct Action from Dialogue . . . . . . . . . . . . . . . . . 236

A.4 Selective Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
A.5 Communication Wordclouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

242

Bibliography 242

x


List of Tables

2.1 Comparison across different variations of ProxMaP over living room data from
AI2THOR [1] simulator. Abbreviations Reg and Class refer to Regression and
Classification tasks, respectively . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.2 Generalizability of ProxMaP and variations over Habitat-Matterport3D (HM3D) [2]
dataset. Abbreviations Reg and Class refer to Regression and Classification tasks,
respectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.3 SCT performance across different living rooms . . . . . . . . . . . . . . . . . . 42

3.1 Results for increasing the FoV in RGB images . . . . . . . . . . . . . . . . . . . 59
3.2 Results for increasing the FoV in Semantic segmentation images . . . . . . . . . 59
3.3 Results for increasing the FoV in Binary images from Outdoor environment . . . 60

4.1 Comparison between the baseline model (PoinTr) and PoinTr-C over test data
with and without perturbation. Arrows show if a higher (↑) or a lower (↓) value
is better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.2 Points observed by Pred-NBV and the baseline NBV method [3] for all models
in AirSim. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.1 MAP-NBV results in a better coverage compared to the multi-agent baseline NBV
method [3] for all models in AirSim upon algorithm termination. . . . . . . . . . 95

6.1 Percentage of the number of targets covered (the average across 1000 trials) by
GNNtrained and tested with varying numbers of robots. . . . . . . . . . . . . . . 122

6.2 Percentage of the targets covered (the average across 1000 trials) with respect to
EXPERT by D2COPLAN trained and tested with varying numbers of robots. . . . 138

6.3 Percentage of the targets covered (the average across 1000 trials) with respect to
EXPERT by D2COPLAN across varying target density maps. . . . . . . . . . . . 138

7.1 Oracle Success Rate (OSR) and Success Weighted by Path Length (SPL) for the
assisted ObjectNav in simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . 169

7.2 Generative communication traits in simulations. We find that while the VLMs do
not hallucinate much in describing images, they progressively perform worse in
assuming pre-emptive agent motion during communication. . . . . . . . . . . . . 170

9.1 Summary statistics of trajectories from the real-world experiments using DWA,
SAC, and Hybrid planner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

xi


A.1 Performance of the Selective Communication setup. We find that letting the GA
choose whether to communicate with the agent or not results in a better Object-
Nav performance than the fully cooperative setup. . . . . . . . . . . . . . . . . . 240

xii


List of Figures

1.1 Motivating Application with Geometrical Structural Patterns: Limited observa-
tions and occlusion can limit the robots’ planning capabilities. Geometrical and
structural predictions can help them make informed decisions by predicting the
map beyond direct observations. The images are from the City and Forest envi-
ronments in AirSim. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Motivating Application with Spatiotemporal Patterns: The motion of dynamic
objects can be difficult to model. Spatiotemporal patterns can help the robots
estimate the motion to avoid or track them efficiently. The image is from experi-
ments performed at Nokia Bell Labs as an intern. . . . . . . . . . . . . . . . . . 6

1.3 Motivating Application with Semantic Patterns: Lack of semantic understanding
can make the robots reliant on human instructions to perform tasks for them.
Semantic patterns can help the robots anticipate what a human may need, and help
them proactively. The images are from our experiments at RAAS Lab, University
of Maryland. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Example of some tasks that can be done with geometrical and structural patterns . 10
1.5 Movement configuration for data collection and training strategy for proximal

occupancy map prediction (ProxMaP) . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Overview of our prediction-guided next-best-view approach (Pred-NBV) for ob-

ject reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.7 Overview of our prediction and uncertainty-driven planning approach for multi-

agent coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.8 Our graph neural network (GNN) based method for decentralized multi-robot

coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.9 Our tracking and coverage approach using a differentiable decentralized coverage

planner (D2CoPlan) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.10 A VLM-based overhead agent working along with a ground robot can act as an

effective assistance unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.11 Two VLM-based agents, one with an overhead view and another with a ground

view, can work together to find an object on a scene with generative communication 23
1.12 Risk-aware planning strategy using uncertainty extraction allows the user to choose

between a conservative and adventurous plans . . . . . . . . . . . . . . . . . . . 25
1.13 Left: Overview of the proposed hybrid local planning approach which combines

the benefits of classical and AI-based local planners. Right: A real experiment
scenario showing a hybrid planner in action when a human suddenly appears on
the robot’s path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

xiii


2.1 An example situation where the robot’s view is limited by the obstacles (sofa
blocking the view) and the camera field of view (sofa on the right is not fully
visible). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.2 Overview of the proposed approach. The training and inference flows are indi-
cated with red and black arrows, respectively. We take the input view by moving
the robot to the left and right sides (CamLeft and CamRight), looking towards the
region of interest. ProxMaP makes predictions using the CamCenter only, and the
map obtained by combining the information from the three positions acts as the
ground truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.3 Results obtained by the proposed model over some examples (rows). Red, yellow,
and green areas represent a high, moderate, and low chance of occupancy in an
area. ProxMaP makes more accurate and precise predictions than others. . . . . . 39

2.4 Prediction by ProxMaP over real-world inputs. . . . . . . . . . . . . . . . . . . . 44

3.1 Traditional approach of leveraging models trained on huge computer vision datasets
can be applied to robotic tasks reliant on top-down images, albeit with some task-
specific fine-tuning. We show that this is not necessary and some models, such as
MAE [4] can be applied directly to these robotics tasks. . . . . . . . . . . . . . . 46

3.2 Example of robotics tasks solved with the help of Masked Autoencoder. . . . . . 48
3.3 Examples showing that the masked autoencoder can be used to expand the effec-

tive FoV in top-down RGB, semantic, and binary images without fine-tuning. . . 51
3.4 Results of expanding FoV for indoor images in three masking scenarios. The

corner of the bathtub and room is accurately predicted based on the symmetry of
the lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.5 An overview of the MAE-based multi-agent exploration pipeline. . . . . . . . . . 54
3.6 Left: The area robot has explored till now. Right: Prediction of obstacle (red)

shape aiding robot path planning. . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.7 Comparison between the multi-agent exploration algorithms to reach at least 95%

accuracy in prediction of the unexplored map. . . . . . . . . . . . . . . . . . . . 60

4.1 Overview of Prediction-based Next Best View (PredNBV). . . . . . . . . . . . . 65
4.2 Flight path and total observations of C17 Airplane after running our NBV planner

in AirSim simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Results over the real-world point cloud of a car obtained using LiDAR (Interactive

figure available on our webpage). . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4 Comparison between Pred-NBV and the baseline NBV algorithm [3] for a C-17

airplane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.1 MAP-NBV uses predictions to select better NBVs for a team of robots compared
to the non-predictive baseline approach. . . . . . . . . . . . . . . . . . . . . . . 82

5.2 Algorithm Overview for MAP-NBV: Each robot runs the same algorithm includ-
ing perception, prediction, and planning steps. The robots that communicate with
each other can share observations and coordinate planning, whereas robots in
isolation (e.g., Robot n) perform individual greedy planning. . . . . . . . . . . . 84

5.3 Flight paths of the two robots during C-17 simulation for MAP-NBV. . . . . . . . 90

xiv

http://raaslab.org/projects/PredNBV/


5.4 Examples of the 5 simulation model classes used for the multi-agent object re-
construction task using MAP-NBV. . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.5 MAP-NBV (CO(d)-CP(d)-Greedy) performs comparably to the optimal so-
lution (CO(c)-CP(c)-Optimal; Section 5.5.4), and much better than the
frontiers-baseline in AirSim experiments. . . . . . . . . . . . . . . . . . . . . . 96

5.6 Directional CD-ℓ2 for teams of 2, 4, and 6 robots on ShapeNet models [5] with
different coordination strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.7 Real-World MAP-NBV experiment. (a) RGB Image. (b) Observations, Predic-
tions, and MAP-NBV poses. (c) Initial, Drone 1, and Drone 2 points after MAP-
NBV iteration. (d) Reconstruction. . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.1 An overview of the multi-robot target tracking setup: a team of aerial robots,
mounted with down-facing cameras, aims at covering multiple targets (depicted
as colorful dots) on the ground. The red arrow lines and the blue dotted lines
show inter-robot observations and communications. The red squares represent
the fields of view of the robots’ cameras. . . . . . . . . . . . . . . . . . . . . . . 107

6.2 Overview of the decentralized target tracking problem. At a given time step, each
robot observes the robots within its sensing range and chooses an action from
a set of motion primitives to cover some targets using its camera. Each robot
communicates with those robots within its communication range. . . . . . . . . . 108

6.3 Overview of our learning-based planner for decentralized target tracking. It con-
sists of three main modules: (i), the Individual Observation Processing module
processes the local observations and generates a feature vector for each robot; (ii),
the Decentralized Information Aggregation module utilizes the GNN to aggregate
and fuse the features from K-hop neighbors for each robot; (iii), the Decentral-
ized Action Selection module selects an action for each robot by imitating an
expert algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.4 Comparison of Opt, Centrl-gre, Decent-gre, GNN, and Rand in terms of
running time (plotted in log scale) and the number of targets covered. (a) & (d)
are for small-scale comparison averaged across 1000 Monte Carlo trials. (b) &
(e) are for large-scale comparison averaged across 1000 Monte Carlo trials. . . . 120

6.5 An overview of the multi-robot target coverage setup. A team of aerial robots
aims at covering multiple targets (depicted as red dots) on the ground. The robots
observe the targets in their respective field of view (green squares) using down-
facing cameras and share information with the neighbors through communication
links (red arrows). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.6 An illustrative example showing multi-robot action selection for joint coverage
maximization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

xv


6.7 Overview of our approach from a robot’s perspective: first the local observations are
processed to generate the current coverage map. This can be done with the Differentiable
Map Processor (DMP). D2COPLAN takes the coverage map as the input and processes it
to first generate compact feature representation, with Map Encoder; shares the features
with its neighbors, using the Distributed Feature Aggregator; and then selects an action
using the aggregated information, with the Local Action Selector. The abbreviations in
the parentheses for D2COPLAN’s sub-modules indicate the type of neural network used
in their implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.8 Comparison of EXPERT and D2COPLAN in terms of running time (plotted in log scale)
and the number of targets covered, averaged across 1000 Monte Carlo trials. D2COPLAN
was trained on 20 robots. D2COPLAN is able to cover 92%-96% of the targets covered
by EXPERT, while running at a much faster rate. . . . . . . . . . . . . . . . . . . . . 136

6.9 Comparison of D2COPLAN, and DG in terms of running time (plotted in log scale) and
the number of targets covered, averaged across 1000 Monte Carlo trials. D2COPLAN

was trained on 20 robots. D2COPLAN is able to cover almost the same number of targets
as DG. DG is faster for fewer number of robots, but as the number of robots increases,
D2COPLAN scales better than it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.10 Comparison of coverage highlighting the effect of using D2COPLAN, a differentiable
planner to aid learning for a differentiable map predictor (DMP), which works better than
the DMP trained standalone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.11 An ablation study for DMP and D2COPLAN. The plot shows results for the sce-
narios where there parts and trained together or in isolation. . . . . . . . . . . . . 141

7.1 Overview of the assistive robot setup: an overhead agent equipped with the VLM
directs a ground robot to help with the tasks based on semantic predictions . . . . 145

7.2 Overview of the proposed pipeline for the assistive robot . . . . . . . . . . . . . 147
7.3 Turtlebot2 and the sensors used for the assistive robot setup . . . . . . . . . . . . 149
7.4 The occupancy map and the labels assigned to the rooms for the house-like envi-

ronment. The images on the sides show example observations from the rooms. . . 151
7.5 A sequence of observations from the overhead camera while making coffee.

There is no spoon or stirrer nearby, which the person may need next. . . . . . . . 154
7.6 Executing the action suggested by the LLMs: The robot goes to the pantry, finds

the stirrer, and brings it to the kitchen. . . . . . . . . . . . . . . . . . . . . . . . 155
7.7 Overview of the assistive ObjectNav setup. The overhead and ground VLM

agents communicate to coordinate and find the target object. . . . . . . . . . . . 157
7.8 Overview of the proposed approach for assisted ObjectNav. The GA and OC first

communicate with each other. A summary of the communication is then used to
recommend an action for the GA. The GA then can either cooperate or decline
cooperation and decide the next action on its own. . . . . . . . . . . . . . . . . . 162

7.9 An example showing dialogues hallucinations across different lengths of commu-
nication between the ground and overhead agents. . . . . . . . . . . . . . . . . . 165

7.10 Overview of the real-world experimental setup consisting of a Turtlebot as a
Ground Agent (GA) and a GoPro camera mounted to the roof as an Overhead
Agent (OA). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

xvi


8.1 An illustration showing CVaRα of function f(S, y).. It denotes the expectation
on the left α-tailed cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

8.2 The breakdown of the risk-aware planning framework. Given an overhead image
input, the algorithm provides a semantic segmentation and uncertainty map, then
generates candidate paths, and finally performs the risk-aware path assignment of
vehicles to demands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

8.3 Difference in variance in semantic segmentation outputs due to data distribution. 189
8.4 Average Cross-Entropy for each class/label. It is inversely correlated with the

frequency of the class/label in the data. . . . . . . . . . . . . . . . . . . . . . . . 190
8.5 Effect of λ on path planning. Smaller λ (=10) gives a shorter path with high

uncertainty; larger λ (=50) gives a longer path with low uncertainty. . . . . . . . 190
8.6 Efficiency distributions of paths and the path assignment when α = 0.01. The

assigned path for each robot is marked in red. . . . . . . . . . . . . . . . . . . . 192
8.7 Efficiency distributions of paths and the path assignment when α = 1. The as-

signed path for each robot is marked in red. . . . . . . . . . . . . . . . . . . . . 193
8.8 Start and demand positions for surprise calculation setup. . . . . . . . . . . . . . 194
8.9 Surprise vs λ. A higher λ may result in a longer path, increasing the cost of

traversal. For large values of λ, even a few pixels with high variance may greatly
increase the cost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

8.10 Distribution of the total travel efficiency f(S, y) by Algorithm 2. . . . . . . . . . 195

9.1 An illustrative example showing various ways in which a robot can react to an
obstacle with a series of arc-motions. . . . . . . . . . . . . . . . . . . . . . . . . 203

9.2 ROS framework and the architecture of our hybrid local planner. . . . . . . . . . 207
9.3 An overview of the waypoint generation scheme for the hybrid local planner. . . . 207
9.4 An overview of the clearance detection module for the hybrid planner. . . . . . . 208
9.5 Dummy training environment for SACPlanner (left) and the associated polar

costmap (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
9.6 Real experimental environment and 4 test case scenarios (C1-4) from left to right. 213
9.7 Trajectory comparison between DWA, SACPlanner vs. Hybrid planner agent for

each test case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
9.8 Trajectory comparison between DWA, SAC, and Hybrid planners based on logs

from the scenario (C3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

A.1 Ground agent observations which result in error with GPT-4-turbo. . . . . . . . . 231
A.2 Localization responses from GPT-4-turbo for varying extent of labelling. . . . . . 233
A.3 Localization responses from GPT-4-turbo for different locations of the Turtlebot2

robot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
A.4 An example showing the need for environment-specific fine-tuning. Upon pass-

ing the simulator prompts for conversation to this image, the overhead agent
would misidentify the lights as a table. Explicitly mentioning the presence of
white lights on the floor alleviates this issue. . . . . . . . . . . . . . . . . . . . . 236

A.5 Egocentric images from the camera on Turtlebot2. These images correspond to
images of robot locations shown in Figure A.3 . . . . . . . . . . . . . . . . . . . 237

xvii


A.6 Corrective action with GA in the real-world experiments. In this case, the agent
decides to rotate by 180 degrees first and then move towards the left. The OA
in the execution phase correctly suggests that the same can be accomplished by
rotating towards the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

A.7 World clouds over the communication between the GA and OA in Selective Ac-
tion setup. Communication in Cooperative Action setup also exhibits similar
patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

xviii


List of Abbreviations

AUV Autonomous Underwater Vehicle
CNN Convolutional Neural Network
DS Dialogue Similarity
GA Ground Agent
GC Generative Communication
GNN Graph Neural Network
LLM Large Language Model
MAE Masked Autoencoder
MLP Multi-Layer Perceptron
NBV Next Best View
OA Overhead Agent
OSR Oracle Success Rate
SCT Success weighted by Completion Time
SLAM Simultaneous Localization and Mapping
SPL Success weighted by Path Length
SR Success Rate
UAV Unmanned Aerial Vehicle
UGV Unmanned Ground Vehicle
VLM Vision-Langugae Model

xix


Chapter 1: Introduction

Navigating through unknown environments is a fundamental capability of mobile robots

and has been studied by the robotics community for a long time [6]. Onboard sensors help

them perceive their surroundings and planning algorithms enable them to navigate through the

environment [6]. If a robot already has a map of the environment, it can plan optimal routes

between locations, even in the presence of some dynamic objects [7]. However, the presence of

a map is not always guaranteed. In certain scenarios, building a map from the ground up is itself

the very mission assigned to a robot [8, 9].

Efficient navigation requires careful selection of the next location to move to, at each time

step. Should the robot need to traverse to a pre-defined goal, it must move to locations such

that the time taken to reach the goal is minimized. If the objective is to build a map, it must

move to locations that help in observing most of the environment in the fewest steps. At each

step, the local observations are integrated to construct a partial global map, and a new path is

planned based on the updated map. Typically, a planner produces navigation strategies either

avoiding or exploring unknown areas, depending upon the task at hand and safety constraints.

Unsurprisingly, the amount of information about the map acts as the bottleneck for efficient

navigation. So we ask, is there a way to make the robot navigation process more efficient even

with limited information? To answer this, we look towards humans for inspiration.

1


In similar circumstances where there is limited information, humans show remarkable effi-

ciency in navigation by making guesses about the yet-unseen part of the environment. At the core

of this cognitive process lies our ability to identify patterns in our surroundings. We can walk

through our living room without directly looking at the floor and by just mentally visualizing the

furniture in the room even if we can only see a part of it. We move through a crowd gracefully

by walking faster or slower by intuitively anticipating other people’s motions. Locating a book,

say Probabilistic Robotics, within a library is easy once we know the positions of the volumes

starting with ’O’ and ’Q’. In all these instances, we rely on patterns we have identified from our

prior experiences to inform our decision-making. Inspired by these feats, the central question

we aim to answer in this dissertation is: can robots also reason about the regions beyond their

field of view by employing similar pattern recognition capabilities to improve their navigation

efficiency?

To answer this question, it is important to understand how humans develop these capabili-

ties. Our brain learns these patterns through experiences and builds an internal model to facilitate

their applications [10–12]. The easiest solution to impart such capabilities to the robots is there-

fore creating such models manually for them. However, handcrafting models for these patterns

prove to be a formidable task, owing to our limited knowledge of the system-environment inter-

action and the exact distribution of these patterns [13]. In contrast, learning-based approaches,

especially deep neural networks, offer a promising avenue to approximate these models using ex-

tensive training data. This dissertation introduces an array of learning-based algorithms for this

purpose, demonstrating their efficacy in enabling robots to make informed predictions about un-

observed environments, resulting in enhancing navigation efficiency for a variety of input views,

modalities, and number of robots. Being mindful of the inherent approximations, we concurrently

2


develop methods to manage risks when using predictions to ensure the robots can utilize them

safely. Thus, the goal of this dissertation is to equip robots with pattern recognition capabilities

and facilitate the judicious use of predictions to enhance navigation efficiency and safety.

1.1 Types of Patterns and Informed Decision-Making

In this section, we start by introducing some patterns often harnessed by humans which

prove to be of significant utility to robots for informed decision-making. Specifically, we focus

on geometrical and structural, spatiotemporal, and semantic patterns, and describe their charac-

teristics and relevance to robotic planning in the subsections below.

1.1.1 Geometrical and Structural Patterns

Geometrical or structural patterns refer to recurring arrangements or configurations of

shapes or structures in the environment. There is a plethora of these patterns present in our

surroundings, mainly due to our affinity towards finding regularity in what we see, even in sce-

narios where they may not be immediately apparent. A great example of this is star constellations,

where we can visualize animals and objects in a seemingly random arrangement of tiny twinkling

lights.

The human brain is remarkable at recognizing and organizing visual information through

the application of Gestalt principles [14,15]. These fundamental concepts in psychology explain

how we perceive complex scenes as cohesive wholes, rather than isolated elements. They not only

equip us to understand our surroundings but are also expressed as regular designs of man-made

objects. Consider the prevalent rectilinear design of walls, tables, and boxes, or the cylindrical

3


Figure 1.1: Motivating Application with Geometrical Structural Patterns: Limited observations
and occlusion can limit the robots’ planning capabilities. Geometrical and structural predictions
can help them make informed decisions by predicting the map beyond direct observations. The
images are from the City and Forest environments in AirSim.

shape of bottles, cups, and tumblers. Despite variations in specific details, objects often adhere

to similar underlying structures. Ask a child to draw an airplane, and we would expect to see

a tube with pointed ends with two triangles on the sides. Boeing and Airbus models may differ

in their tail designs, but the general shape is similar. Another prevalent characteristic of man-

made objects is symmetry. Together, this results in easy-to-remember and recognizable shapes

all around us.

We show that these principles can be exploited in robot navigation, especially when the

robots work in man-made environments, by predicting the object shapes beyond the robot’s field

of view. The abundance of familiar and recurring shapes presents a great opportunity for robots

4


to infer the complete shape from partial views, making efficient navigation possible. Deploying

a team of robots leads to further improvement as more information helps in more accurate pre-

dictions, making these patterns immensely useful for robot navigation. Identifying and modeling

these patterns is a challenging task. Additionally, the underlying method should ensure gener-

alizability which is crucial for the deployment of the robot to new scenarios, including the real

world.

1.1.2 Spatiotemporal Patterns

Spatiotemporal patterns refer to the arrangement of entities in their scene, not necessarily

geometrical, and the regularity in their motion. While we focus on static arrangements in geo-

metrical and structural patterns, here we aim to couple the motion with the spatial arrangements

and harness the predictable cues. For “mobile” robots these cues can arise from the environment,

as well as their own motion.

Soccer is the most popular sport in the world. Millions of fans watch it in areas and on

TV. Passing the ball, one of the most fundamental skills in soccer involves one player kicking

the ball toward a teammate and the opponents try to intercept it. The motion of the can be easily

predicted with Newton’s laws of motion. But the players and the viewers do not have to pull their

notebooks out to calculate where the ball would be after a certain time. Even those unfamiliar

with Newton’s laws can easily predict the trajectory of the ball as humans can anticipate the

motion from their experience. Similar patterns can be found in how people move in the crowd

and how cars navigate on the road and are used by us to safely and smoothly navigate through

them.

5


Figure 1.2: Motivating Application with Spatiotemporal Patterns: The motion of dynamic objects
can be difficult to model. Spatiotemporal patterns can help the robots estimate the motion to
avoid or track them efficiently. The image is from experiments performed at Nokia Bell Labs as
an intern.

Robots also can learn to anticipate motion based on experience, as demonstrated in our

works. Robots can extract these patterns from moving targets, predicting their future locations

for better tracking. When working with other robots, they could further utilize their spatial ar-

rangements to coordinate with others to improve robot navigation for the whole team, especially

when their communication capabilities are limited. Limited observations pose a significant chal-

lenge in these scenarios, which is further exacerbated by the need for scalable solutions which

can be hard to model and learn when generating labels with optimal algorithms for large teams is

impractical.

6


1.1.3 Semantic Patterns

Semantic patterns mean a consistent arrangement or structure of words, symbols, or el-

ements that convey meaning within a given context to express higher-level specific concepts,

relationships, or information. While the first two types of patterns often come to us intuitively,

semantic patterns are usually reasoned about explicitly [16].

Human decision-making is highly dependent on semantic entities, resulting in a world full

of such patterns. Looking at someone pouring coffee into the cup, one may guess they may

need sugar or milk next is one such example where one can use a pattern emerging from the

understanding of objects, their semantic relationship, and the activity they imply. Language,

color codings, and sounds are some examples of such patterns. The inherent intuitiveness in

semantic patterns paves the way to make accurate predictions about them.

The ubiquity of the semantics around us means it is essential for robots to reason semanti-

cally to efficiently assist humans. It also means that similar to us, the robots can make inferences

to navigate efficiently in such a world. Using these patterns can enable robots to fully utilize

the man-made signals around them and in turn effectively work alongside humans by reasoning

similarly to them. Learning these patterns requires a huge amount of data and computational

resources, which may not be easily available. The emergence of foundational and large language

models addresses this issue and promises an easy avenue for scene understanding and human-

level planning but requires bridging them together with grounding approaches to make them

realizable.

7


Figure 1.3: Motivating Application with Semantic Patterns: Lack of semantic understanding can
make the robots reliant on human instructions to perform tasks for them. Semantic patterns can
help the robots anticipate what a human may need, and help them proactively. The images are
from our experiments at RAAS Lab, University of Maryland.

1.2 Research Contributions

This section highlights our contributions proposing methods to leverage the three types

of patterns. The key proposition of these methods is to utilize learning-based approaches for

modeling the patterns from data and improving planning efficiency across a variety of tasks and

scenarios. Additionally, we explore methods for risk management with predictive models for the

safe deployment of robots.

We start with a discussion on how geometric and structural patterns can be used to improve

PointGoal navigation [17], i.e. traversing to a pre-defined goal through an unknown environment,

8


and active object reconstruction by making accurate predictions about unseen regions from par-

tial observations. These works take advantage of the patterns in the static environment around

them. Then we turn our attention to dynamic entities in the scene and detail our approaches to

harnessing spatiotemporal patterns arising from the motion of a team of robots and the dynamic

targets and their spatial arrangements for scalable coverage and tracking. Our works show how

predictions are invaluable to efficient navigation for mobile robots, but erroneous predictions can

be a point of concern for safety in some applications. We deal with this issue in the last subsec-

tion with our meta-reasoning approaches for the safe navigation of the robot using heuristics and

uncertainty-based strategies for risk management.

1.2.1 Enhanced Perception with Structural Continuity and Closure

This section describes our contributions to using geometrical and structural patterns to

enhance robot perception and, as a result, planning. We start with a discussion on how geometric

and structural patterns can be used to improve robot navigation for PointGoal Navigation with a

2D occupancy map, and active object reconstruction based on 3D point cloud representations, by

predicting unseen regions from only partial observations. Lastly, we show that these approaches

can be effectively extended to multi-agent systems by using multiple views to further improve

the predictions, and in turn, the navigation efficiency.

PointGoal Navigation with 2D Maps

PointGoal navigation refers to the navigation task in which the robot is given a specific

destination point (goal) in the environment and is required to reach it [17]. 2D overhead or bird’s

9


Figure 1.4: Example of some tasks that can be done with geometrical and structural patterns

eye view (BEV) maps are commonly used by ground robots for this task. Typically, the robot

builds the navigation maps incrementally from local observations using onboard sensors. 2D

ranging LiDARs and RGB-D cameras are the most popular sensors for this task and are used to

generate an occupancy map, which distinguishes the free areas from occupied or unknown areas.

Sometimes, an unmanned aerial vehicle may act as a scout and observe a wider area from a height

to get RGB maps, over which semantic segmentation is applied to act as an occupancy map for

the navigation of the ground robot. The planner plans a path to the goal using these maps. As the

robot navigates and updates the map, the path is also updated as a result.

A conservative planner may avoid the regions of the unknown regions for safety, taking a

longer time to navigate to the goal. Instead, if a robot is able to correctly predict the occupancy

in the occluded regions, the robot may navigate efficiently. Recent works have shown that pre-

10


dicting the structural patterns in the environment through learning-based approaches can greatly

enhance task efficiency [18, 19]. This is accomplished by predicting the occupancy maps in the

yet unobserved regions, effectively increasing the field of view of the sensor. Figure 1.4 shows

some example applications with this capability. We show that the existing foundational vision

networks can accomplish this without any fine-tuning by using the concept of continuity learned

from the computer vision datasets.

Specifically, we use Masked Autoencoders [20], pre-trained on street images, for the field

of view expansion on RGB images, semantic segmentation maps, and binary maps. The images

and maps span both outdoor and indoor scenes and a diverse set of locations from AirSim [21]

and AI2THOR simulator [1]. Two key findings stand out from our experiments in this work:

(1) inferring unobserved scenes is easier for more simpler and abstract representations such as

semantic segmentation and binary map as they don’t require complex reasoning about textures,

and (2) predictions are more accurate when made closer to the areas of direct observation and

degrade as we move farther away. The limitations of such foundation models are that they are

computationally heavy and may not be suitable for the structural patterns of the region with miss-

ing information. This work was accepted at the 2024 International Conference on Robotics

and Automation (ICRA 2024) [22].

To overcome these limitations, we look towards task-specific, convolutional neural net-

works (CNNs) based approaches for occupancy map prediction. Existing works using CNN for

this task learn to predict occupancy in areas away from direct observations and thus may suf-

fer from network overfitting [23, 24]. This also requires a time-consuming data collection step.

To alleviate these issues, prior work has proposed a self-supervised approach with multi-camera

setup [25], but this setup does not result in precision in predicted maps and is not economical.

11


Figure 1.5: Movement configuration for data collection and training strategy for proximal occu-
pancy map prediction (ProxMaP)

We use the takeaways from our previous work and focus on making predictions near the ob-

served regions only, thus reducing overfitting and making precise predictions. This also results

in a self-supervised and efficient data collection approach (as shown in Figure 1.5), which is

also more economical than the existing self-supervised approach. We further improve navigation

by adjusting the robot’s speed according to the information over the path to the goal, resulting

in faster navigation to the goal. This work was accepted at the 2023 IEEE/RSJ International

Conference on Intelligent Robots and Systems (IROS 2023) [26].

Active Object Reconstruction with 3D Point Clouds

Object reconstruction refers to the task of observing an object from multiple viewpoints

and reconstructing a 3D representation of it. 3D LiDARs, which provide a sparse point cloud as

output, are often used with UAVs for this task. Ideally, we can observe the object from a large

constellation of viewpoints around the objects, an efficient plan requires selecting only those

viewpoints which minimize the overlap with the previous observations to finish the task quickly.

This is a crucial requirement when using UAVs as the limited battery capacity limits its flight

12


time.

Similar to the 2D situation, to incrementally build the map, the robot must choose the next-

best-view (NBV) carefully to produce an accurate reconstruction with an efficient plan. If the

shape of the object (equivalent to the map) is already known, we can carefully select a minimal

number of viewpoints resulting in a geometrical NBV approach. But this may not always be

possible and such a requirement limits the situation where the UAV can be deployed. Prior

works have used learning-based approaches for this, but these networks are usually specific to an

object class (e.g. houses only). This means that we must train a model for each target object.

Other works have used the Gestalt principle of similarity to bridge the gap between lack of map

and geometrical NBV approach. Revisiting the example from Section 1, all planes have similar

shapes even though they may differ in details. Thus if we can predict the full space of the object

of interest from partial views (this is known as shape completion), we can effectively construct a

geometric NBV plan.

Figure 1.6: Overview of our prediction-guided next-best-view approach (Pred-NBV) for object
reconstruction

13


The existing works for 3D shape prediction make an implicit assumption about the partial

observations and therefore cannot be used for real-world planning [27]. Also, they do not con-

sider the control effort for next-best-view planning, which directly affects the flight time [28]. We

proposed Pred-NBV [29], a realistic object shape reconstruction method consisting of PoinTr-C,

an enhanced 3D prediction model trained on the ShapeNet dataset using curriculum learning,

and an information and control effort-based next-best-view method to address these issues. Fig-

ure 1.6 shows the overview of Pred-NBV. In each iteration, the robot predicts the full shape from

partial observations and goes to the closest location which will result in a high information gain.

After moving to the new location, the current observations are combined with the previous ones.

The process repeats till we reach a termination condition. Pred-NBV achieves an improvement of

25.46% in object coverage over the traditional methods in the AirSim simulator [21] and performs

better shape completion than PoinTr [27], the state-of-the-art shape completion model, even on

real data obtained from a Velodyne 3D LiDAR mounted on DJI M600 Pro. This work was ac-

cepted at the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems

(IROS 2023) [29].

Extensions to Multi-Robot Systems

Previous subsections show how continuity and closure principles can improve the naviga-

tion efficiency for single robot systems by making predictions over partial observations. It is

intuitive to expect further benefits when more information and context (and hence more geomet-

rical and structural cues) are provided to the prediction model. This brings the question, what if

these ideas are extended to multi-robot systems?

14


For navigation with a 2D map, we study the task of multi-agent exploration, which re-

quires a team of robots to observe the whole map in the fewest steps. We use predictions for

planning as well as mapping by inferring the unexplored regions using continuity. As observa-

tions from multiview points provide strong geometrical and structural cues, the predictions are

more reliable. Whenever there is uncertainty in predictions, we navigate the robots to these re-

gions to directly observe and reduce uncertainty. For this, we use MAE to predict the unexplored

regions and extract uncertainty from them by adding visually imperceptible noise to the input.

Then a centralized planner uses the K-Nearest Neighbors (KNN) algorithm to identify regions

with unknowns and high uncertainty to identify regions of interest and moves the robots to them.

Figure 1.7 shows an overview of this process. This method results in higher prediction accu-

racy in fewer steps compared to the traditional method of assigning non-overlapping regions to

robots and scanning them in a sweeping fashion. These findings were accepted at the 2024

International International Conference on Robotics and Automation (ICRA 2024) [22].

Figure 1.7: Overview of our prediction and uncertainty-driven planning approach for multi-agent
coverage

15


For 3D object reconstruction, we extended prior work to MAP-NBV, a prediction-guided

active algorithm for 3D reconstruction with multi-agent UAV systems. We use PoinTr-C similar

to Pred-NBV, but add a centralized planner to find NBV for all the UAVs together. We jointly op-

timize the information gain and control effort for efficient collaborative 3D reconstruction of the

object. Our method achieves a 19% improvement over the non-predictive multi-agent approach

and a 17% improvement over the prediction-based, non-cooperative multi-agent approach. This

work was accepted at the 2024 IEEE/RSJ International Conference on Intelligent Robots

and Systems (IROS 2024) [30].

1.2.2 Spatiotemporal Patterns

In this section, we describe the spatiotemporal patterns and our contributions outlining

their use for improving robot planning. Specifically, we focus on decentralized algorithms for

multi-robot coverage and target tracking, putting primary emphasis on planning. The first sub-

section aims to harness the patterns emerging from the spatial arrangement of the robots, repre-

sented as a communication graph, and their motion through the space. In the next subsection, we

additionally utilize similar patterns from moving targets.

Decentralized Coordinated Coverage using Graph Neural Networks

The problem of decentralized multi-robot target tracking asks for jointly selecting actions,

e.g., motion primitives, for the robots to maximize the joint coverage with local communica-

tions. One major challenge for practical implementations is to make target-tracking approaches

scalable for large-scale problem instances. In this work, we propose a general-purpose learning

16


architecture toward collaborative target tracking at scale, with decentralized communications.

Classical, manually designed decentralized approaches can be more scalable compared to cen-

tralized ones at the cost of reduced coverage [31]. We investigate whether planners, designed us-

ing learning-based algorithms, can accomplish the same when provided with local observations

as hand-crafted features. Particularly, our learning architecture, shown in Figure 1.8, leverages a

graph neural network (GNN) to capture local interactions of the robots and learns decentralized

decision-making for the robots [32]. We train the learning model by imitating an expert solution

and implement the resulting model for decentralized action selection involving local observations

and communications only.

Using GNN here based on the representation of the spatial arrangement as a graph also al-

lows centralized training, decentralized execution [33], a property specific to GNNs. We demon-

strate the performance of our learning-based approach in a scenario of active target tracking with

large networks of robots. The simulation results show our approach nearly matches the tracking

performance of the expert algorithm, and yet runs several orders faster with up to 100 robots.

Moreover, it slightly outperforms a decentralized greedy algorithm but runs faster (especially

with more than 20 robots). The results also exhibit our approach’s generalization capability in

previously unseen scenarios, e.g., larger environments and larger networks of robots. This shows

that learning-based approaches can patterns for efficient planning, achieving similar coverage

with lesser inference speed in comparison to the equivalent classical approaches. This work was

accepted at the 2022 IEEE Symposium on Safety, Security, and Rescue Robotics (SSRR

2023) [32].

17


Figure 1.8: Our graph neural network (GNN) based method for decentralized multi-robot cover-
age

Decentralized Coverage and Tracking with Differentiable Planners

Learning-based distributed algorithms provide a scalable avenue in addition to bringing

data-oriented feature generation capabilities to the table, allowing integration with other learning-

based approaches. Our previous work focuses on local communication through GNN to improve

planning. As we demonstrated in Section 1.2.1, perception is often harder to model than planning.

Thus we ask, can a learning-based planner augment the training for learning-based perception

models to outperform a simple combination of the two? Realizing this setup required an end-to-

end differentiable planner that can be seamlessly combined with a perception network. To this

end, we present a learning-based, differentiable distributed coverage planner (D2CoPlan) [34],

shown in Figure 1.9, that scales efficiently in runtime and number of agents compared to the

expert algorithm and performs on par with the classical distributed algorithm. Then we combine

it with a perception network to predict motion primitives for covering dynamic targets, hence

solving a target tracking problem. We find that this modular combination is not only able to out-

perform combinations of classical and learning-based counterparts but also learns more efficiently

than a single monolithic end-to-end planning network. These findings suggest that differentiable

18


designs in perception and planning are key to the development of more powerful learning-based

solutions through end-to-end, task-specific learning. This work was published in the proceeding

of the 2023 International Conference on Robotics and Automation (ICRA 2023) [34].

Figure 1.9: Our tracking and coverage approach using a differentiable decentralized coverage
planner (D2CoPlan)

1.2.3 Semantic Pattern Prediction to Assist Humans and Robots

In this section, we present our contributions utilizing semantic patterns grounded in vision

and language to assist humans and other robots. For the former, we investigate the innate world

knowledge in Vision-Language Models (LLMs) and Vision-Language Models (VLMs) to antici-

pate what a person may need in the future and help them with the task. We specifically focus on

language-based semantic patterns.

Large Language Models as Anticipatory Planners for Assisting Humans

Large Language Models (LLMs) are among the most recent significant advancements in

artificial intelligence (AI). Trained using reinforcement learning from human feedback (RLHF),

these models are exceptionally good at conversations with humans. Within a few months of

their introduction, people have come up with a huge range of applications using LLMs as human

19


surrogates. Language researchers are studying what LLMs learn and have revealed that they can

reason about the world which carries huge implications for applications dependent on semantic

patterns.

Consider this scenario: a person wakes up in the morning and is getting ready to make

coffee. They reach the kitchen counter and turn on the coffee maker. An assistive home robot

observing them infers that they are making coffee but notices that there is no sugar nearby. Antic-

ipating that they may need sugar next, the assistant sends a robot to bring sugar from the pantry

and bring it to the human. This ability to help humans without them needing to ask the robot

explicitly requires a world model to understand and make inferences for a wide array of human

activities.

Using the capability of the LLMs to act as approximate world models, we use their capa-

bility to generate likely words to assist humans by anticipating their next actions. Here we pose

the next action prediction as the next word prediction with LLMs, given the past observations

about a human’s activity. Due to the generative nature and lack of grounding in LLMs, they may

generate many plausible actions. To ground them into real applications, we provide the context

in the form of a textual description of the map. The robot action primitives are used as addi-

tional input to ensure affordance-based actions are selected to effectively assist the humans. We

build this system in and show applications in the real world for a variety of tasks in a home-like

environment for image and text-based observation, VLMs, and image captioning methods.

20


Vision Language Models as Global Context Providers for Assisting Robots

Recently researchers have been exploring multi-model representations learning to combine

multiple representations in the same embedding, mainly aimed at tasks to assist humans. One

such joint representation that is relevant to robots is vision-language representation, which can

help the agent equipped with such VLM not only comprehend the observations but also make

predictions about the future state of the environment. Focusing on such semantic predictions

we propose a pipeline where an environment camera monitors the surroundings to identify the

ongoing activities and directs the robot to help the human with them. Having such cameras in the

environment is not unusual nowadays; cameras are often used for security purposes in industrial

residential spaces. Growing interests in AI assistants have also brought devices equipped with

cameras, which can provide additional guidance to the robot to help the people around.

As depicted in Figure 1.10, the overhead camera monitors and deciphers the activities using

VLMs, acting as a overhead VLM agent. The ground robot, orground agent can only see limited

parts of the scene, due to limited camera field-of-view and the occlusions in the scene (such as

walls). In contrast, the placement of the overhead camera allows it to observe a wider area than a

ground robotic agent could. But it can not move around and help the people in the environment,

which the ground agent can do. We propose a method to utilize the strengths of these agents:

the overhead agent is tasked with scene understanding, activity recognition, and predicting what

assistance a person in the scene may need. It can then direct the ground robot to move around and

perform the appropriate actions. Since the overhead agent may not be able to see all the details

from the height and walls may obstruct its view of some areas, the ground agent uses VLMs to

accomplish the task. We implemented this pipeline using GPT-4 to direct a Turtlebot2 robot to

21


help humans in a house-like environment in the real world, paving a path to assistive robots with

the help of external sensors and VLM-based capabilities.

Figure 1.10: A VLM-based overhead agent working along with a ground robot can act as an
effective assistance unit

Given the world knowledge that VLMs encompass, they are well-equipped to assist not

only in the form of temporal semantic predictions, but can also prove helpful for spatial semantic

prediction (e.g., a bowl of sugar is more likely to be in the pantry than in a bathroom). Owing to

these properties, the research community has welcomed these models with open arms for many

applications, including ObjectNav. ObjectNav [17] is an embodied task where a mobile robot

must find an object (e.g., a fork) in an environment without a map. This task is challenging as

the ground agent has limited observation due to limited field-of-view (FoV) and obstructions,

and hence its planning horizon is limited. While a VLM may provide sound reasoning about the

target object and where to find it, the lack of grounding, hallucinations, and reliance on limited

22


Figure 1.11: Two VLM-based agents, one with an overhead view and another with a ground view,
can work together to find an object on a scene with generative communication

observations pose challenges for effective applications. To address these challenges, we propose

using an environment camera, also equipped with a VLM to provide additional guidance as an

overhead agent. However, the overhead agent itself may suffer from occlusions from walls and

objects or may confuse other objects with the robot, it must communicate with the GA to estimate

the latter’s position well and provide accurate guidance to accomplish ObjectNav.

This proposed approach, shown in Figure 1.11, is the first instance of two agents with global

and local views communicating with VLMs to the best of our knowledge. Similar prior works

have focused on emergent semantic communication with limited vocabulary, in contrast to gen-

erative communication with unrestricted communication. To study the effect of communication,

we further investigate the effect of communication length and varying degrees of communica-

tion over the task performance and show that communication indeed a crucial role in completing

23


the task. To mitigate the adverse effects of hallucination in simple two-way communication, we

propose a selective cooperation framework for this task and achieve 10% improvement over the

non-assistive, single-agent method. This work is currently under review.

1.2.4 Meta-Reasoning to Manage Risk in Predictions

The accuracy of learning-based models is largely dependent on the training data. General-

izability is often a source of concern for these methods and manifests itself as a Sim2Real gap

in robotic applications. An error in prediction can be dangerous to the surroundings, the humans

nearby, and even to the robot itself. Trust in predictions is thus a critical issue for deploying

robots in the real world.

Existing works have explored this issue through the lens of uncertainty extraction [35],

interpretable designs [36], and explainable methods [37] among other. We contribute in this

regard with implicit and explicit meta-reasoning approaches over predictions for planning, as

described below.

Meta-Reasoning for Risk-Aware Planning with Implicit Means

Perception networks used in neural networks are generally point prediction models, i.e. for

the same inputs, the network provides the same deterministic output. Bayesian neural networks,

which were designed for stochastic outputs, can be computationally intensive and thus are un-

suitable for deployment on resource-constrained robots. Kendall et al. [35] proposed Bayesian

SegNet which uses Bayesian dropouts [38] for sementic segmentation and uncertainty extraction

and show their use for detecting uncertainty in street-view images for autonomous driving.

24


Figure 1.12: Risk-aware planning strategy using uncertainty extraction allows the user to choose
between a conservative and adventurous plans

Our work builds on this idea and uses Bayesian SegNet on top-down images for high-level

planning. In this setup, an aerial robot acts as a scout for ground robot navigation and uses

Bayesian SegNet on aerial images. We train our network in the CityEnviron environment and test

it in one suburban scene. The grass patches, which are scarce in the city act as out-of-distribution

entities and result in high prediction uncertainty. The semantic costmap is combined with the

uncertainty map using a user-defined risk-affinity factor, which allows the user to select between

risk-conservative and risk-adventurous paths. The proposed approach thus allows risk manage-

ment using implicit uncertainty in the networks. This process is shown in Figure 1.12. This work

was published at the 2020 IEEE/RSJ International Conference on Intelligent Robots and

Systems (IROS 2020) [39].

Meta-Reasoning for Hybrid Planning with Explicit Means

Uncertainty extraction from network prediction is a contentious issue as we do not know

how to truly validate the ’uncertainty’ thus obtained. This may result in a gap in trustworthi-

ness, even if learning-based methods are more desirable and suitable for certain scenarios. The

25


easiest solution in such scenarios can be to use the predictions only when we trust them, oth-

erwise, switch back to traditional methods. Recent works have explored this idea, albeit with

neural networks acting as the switch [40, 41], which can again cause trust issues and may not be

generalizable.

Figure 1.13: Left: Overview of the proposed hybrid local planning approach which combines the
benefits of classical and AI-based local planners. Right: A real experiment scenario showing a
hybrid planner in action when a human suddenly appears on the robot’s path.

We use a heuristic-based approach switching between a classical and a reinforcement learn-

ing (RL) based method for local planning in an indoor environment with unexpected obstacles.

The classical planner, dynamic window approach (DWA) [7] is robust and moves the robot ef-

ficiently on a smooth path. However, DWA may be slow to react to unexpected obstacles on its

path and may result in a collision. SACPlanner [42], an RL-based local planner is more reactive

in these situations but results in a jerky and inefficient motion. We thus come up with a switching

approach to identify whether there is an unexpected obstacle on the path and switch between the

two planners accordingly. If the path is clear, then it uses the DWA planner, resulting in fast

progress towards the goal. If an obstacle is detected, the robot uses the RL-based planner to

safely avoid the obstacles and switches back to the DWA if no other obstacle is on the path to the

26


goal. This method results in faster navigation to the goal without any collision with the obstacles

in various scenarios in real-world experiments. Figure 1.13 visualizes this process and shows an

experimental situation where this framework was tested in the real world. This work is currently

under revision [43].

Organization of the Dissertation

This dissertation is organized into 10 chapters, following this chapter.

In Chapters 2- 5, we present methods to utilize structural and geometrical patterns for ef-

ficient navigation planning. Chapters 2 and 3 focus on predicting unseen parts of 2D maps from

partial views with task-specific training1 and a pre-trained masked image model2, respectively.

Chapters 4 and 5 present methods for predicting 3D maps using partial observations and build-

ing next-base-view planning approaches with them for single-agent3 and multi-agent systems4,

respectively.

Chapter 6 concentrates on leveraging spatiotemporal patterns. First, we show how learning-

based approaches can act as approximate, but scalable planners for multi-agent coverage prob-

lems5. Then we present a method using a learning-based decentralized approach as a differen-

tiable planner to efficiently train a multi-agent tracking method in an end-to-end manner, achiev-

ing better results than its counterpart composed of independently trained submodules6.

In Chapter 7 we present a framework using semantic patterns with the help of VLMs and

LLMs. This framework uses an overhead camera to direct or coordinate with a ground robot. We

1https://raaslab.org/projects/ProxMaP
2https://raaslab.org/projects/MIM4Robots
3https://raaslab.org/projects/PredNBV
4https://raaslab.org/projects/MAPNBV
5https://github.com/VishnuDuttSharma/deep-multirobot-task
6https://raaslab.org/projects/d2coplan

27

https://raaslab.org/projects/ProxMaP
https://raaslab.org/projects/MIM4Robots
https://raaslab.org/projects/PredNBV
https://raaslab.org/projects/MAPNBV
https://github.com/VishnuDuttSharma/deep-multirobot-task
https://raaslab.org/projects/d2coplan


first show this framework in action with a real-world implementation to assist a human in doing

everyday tasks in a house-like environment. Then we use this framework to help ground robot

equipped with VLM for ObjectNav and present an analysis of the conversation between the two

agents.

Chapters 8 and 9 present methods to manage risk in predictions. In Chapter 8 we propose

an implicit measure for risk-aware path planning where the out-of-distribution data acts as the

source of uncertainty. Chapter 9 focuses on an explicit measure for risk management by using

a heuristics-based approach to switch between classical and learning-based planners to balance

navigation efficiency and the collision risk due to unexpected obstacles on the path.

We conclude the dissertation with an overview of our research contributions and underline

the future research directions. The prompts and additional results from Chapter 7 are presented in

the appendix. The software and media corresponding to the work in this dissertation are available

at https://vishnuduttsharma.github.io/thesis/.

28

https://vishnuduttsharma.github.io/thesis/


Chapter 2: Structural and Geometric Pattern Prediction in 2D Occupancy Maps

2.1 Introduction

To navigate in a complex environment, a robot needs to know the map of the environment.

This information can either be obtained by mapping the environment beforehand, or the robot

can build a map online using the onboard sensors. Occupancy maps are often used, which pro-

vide probabilistic estimates about the free (navigable) and occupied (non-navigable) areas. These

estimates can be updated as the robot gains new information while navigating. Given an occu-

pancy map, the robot can adjust its speed to navigate faster through high-confidence, free areas

and slower through low-confidence, free areas so that it can stop before collision. The effective

speed of the robot thus depends on the occupancy estimates. Occlusions due to obstacles and

limited field-of-view (FoV) of the robot lead to low-confidence occupancy estimates, which limit

the navigation speed of the robot.

In this chapter, we propose training a neural network to predict occupancy in the regions

that are currently occluded by obstacles, as shown in Fig. 2.1. Prior works learn to predict the

occupancy map all around the robot i.e., simulating a 360◦ FoV given the visible occupancy map

within the current, limited FoV [44–46]. Since the network is trained to predict the occupancy

Further details and results for this work are available at https://raaslab.org/projects/ProxMaP.

29

https://raaslab.org/projects/ProxMaP


(a) Third-person view of the robot in a living
room

(b) Top view of the robot showing visibility
polygon

Figure 2.1: An example situation where the robot’s view is limited by the obstacles (sofa blocking
the view) and the camera field of view (sofa on the right is not fully visible).

map all around the robot, it overfits by learning the room layouts. This happens as the network

must learn to predict the occupancy information about the areas such as the back of the robot, for

which the robot may not have any overlapping information in its egocentric observations. This

makes the prediction task difficult to learn. Furthermore, the ground truth requires mapping the

whole scene beforehand, which could make sim-to-real transfer tedious. It also means that the

whole environment needs to be mapped to get the ground truth data.

Our key insight is to simplify this problem by making predictions only about the proximity

of the areas where the robot could move immediately. This setting has three-fold advantages:

first, the prediction task is easier and relevant as the network needs to reason only about the im-

mediately accessible regions (that are partly visible); second, the robot learns to predict obstacle

shapes instead of learning room layouts, making it more generalizable; and third, ground truth is

easier to obtain which can be obtained by moving the robot, making the approach self-supervised.

30


(a) Movement configuration for data collec-
tion. (b) Training and prediction overview.

Figure 2.2: Overview of the proposed approach. The training and inference flows are indicated
with red and black arrows, respectively. We take the input view by moving the robot to the left
and right sides (CamLeft and CamRight), looking towards the region of interest. ProxMaP makes
predictions using the CamCenter only, and the map obtained by combining the information from
the three positions acts as the ground truth.

Following are our main contributions in this work:

1. We present ProxMaP, a self-supervised proximal occupancy map prediction method for

indoor navigation, trained on occupancy maps generated from AI2THOR [1] simulator,

and show that it makes accurate predictions and also generalizes well on HM3D dataset [2]

without fine-tuning.

2. We study the effect of training ProxMaP under various paradigms on prediction quality and

navigation tasks, highlighting the role of training methods on occupancy map prediction

tasks. We also present some qualitative results on real data showing that ProxMaP can be

extended to real-world inputs.

3. We simulate the point goal navigation as a downstream task utilizing our method for occu-

pancy map prediction and show that our method outperforms the baseline, non-predictive

approach, relatively by 12.40% in navigating faster, and even outperforms a robot with

multiple cameras in the general setting.

31


2.2 Related Works

Mapping the environment is a standard step for autonomous navigation. The classical

methods typically treat unobserved (i.e., occluded locations) as unknown. Our focus in this work

is on learning to predict the occupancy values in these occluded areas. As shown by recent works,

occupancy map prediction can help the robot navigate faster [47] and in an efficient manner [18].

Earlier works explored machine learning techniques for online occupancy map predic-

tion [48, 49], but they require updating the model online with new observations. Recent works

shifted to offline training using neural networks, treating map-to-map prediction as an inpainting

task. Katyal et al. [50] compared ResNet, UNet, and GAN for 2D occupancy map inpainting with

LiDAR data, finding that UNet outperforms the others. Subsequent works used UNet for occu-

pancy map prediction with RGBD sensors, demonstrating improved robot navigation [25,44,45].

Offline training for these methods requires collecting ground truth data by mapping the

entire training environment, which can be time-consuming and hinder real-world deployment.

Moreover, these models are trained to predict occupancy for the entire surroundings of the robot,

including the scene behind for which they may lack context within the current observation, which

could result in the networks memorizing room layouts, affecting their generalizability. Addition-

ally, methods relying on historical observations for predictions [51] face data efficiency chal-

lenges during training.

As robots can actively collect data, self-supervised methods have been successful in ad-

dressing data requirements for various robotic learning tasks [29, 52–56]. For occupancy map

prediction in indoor robot navigation, Wei et al. [25] proposed a self-supervised approach us-

ing two downward-looking RGBD cameras at different heights. The network predicts the com-

32


bined occupancy map from the lower camera’s input without manual annotation, making it data-

efficient and suitable for real robots. However, it struggles to predict edge-like obstacles and

requires additional data collection for fine-tuning. Moreover, tilted cameras limit the captured

information ahead compared to straight, forward-looking cameras.

To this end, we propose a self-supervised method consisting of a single, forward-looking

camera to maximize the information acquisition for the navigation plane, while reducing the

control effort required to collect data. Adding two cameras to the side of the robot could further

reduce this effort. We design our predictor as a classification network, which can generate sharper

images compared to the regression networks, as shown later in this chapter. We focus on making

predictions in the proximity of the robot, reducing the likelihood of memorization and improving

its generalizability by using the current view as context.

2.3 Proposed Approach: Proximal Occupancy Map Prediction (ProxMaP)

In this work, we consider a ground robot equipped with an RGBD camera in indoor envi-

ronments. Two additional views are obtained by moving the robot around as shown in Fig. 2.2a.

The same can also be achieved by adding extra cameras to the robot. In the following subsections,

we detail the network architecture for ProxMaP, training details, and the data collection process.

2.3.1 Network Architecture and Training Details

We use the occupancy map generated by CamCenter as input and augment it using a pre-

diction network. Our goal is to accurately predict the occupancy information about the unknown

cells in the input map. The network uses the map generated by combining information from the

33


three robot positions as the ground truth for training and thus learns to predict occupancy in the

robot’s proximity. We use UNet [57] for map prediction in ProxMaP due to its ability to per-

form pixel-to-pixel prediction well by sharing intermediate encodings between the encoder and

decoder. We use a UNet with a 5-block encoder and a 5-block decoder. For training, we convert

these maps to 3-channel images representing free, unknown, and occupied regions. This is done

by assigning each cell to one of the 3 classes based on its probability p: if p ≤ 0.495, the cells are

treated as free; if p ≥ 0.505, it is treated as occupied; and as unknown in rest of the cases, similar

to Wei et al. [25]. We train the network with cross-entropy loss, a popular choice for training

classification networks.

Since previous works have used variations of UNet for occupancy map prediction training

as a regression task [25, 52] and as a generative task [47, 58], we also train ProxMaP with these

variations. We also use UNet as the building block for these approaches with Oc as input and O∗

as the target map (Fig. 2.2b). For the regression tasks, these maps are transformed from log odds

to probability maps before training. For generative tasks, we use the UNet-based pix2pix [59]

network with single and three-channel input and output pairs for regression and classification,

respectively.

For regression, since both input and output are probability maps, we use the KL-divergence

loss function for training UNet, which simplifies to binary cross-entropy (BCE) under the as-

sumption that each occupancy map is sampled from a multivariate Bernoulli distribution param-

eterized by the probability of each cell. In addition, we also train a UNet with Mean Squared

Error (MSE) loss for regression. For training the generative models, we use L1 loss and LGAN

losses as suggested by Isola et al. [59].

In the rest of the discussion, while discussing ProxMaP’s variations, we will refer to

34


the generative classification, generative regression, and discriminative regression approaches as

Class-GAN, Reg-GAN, and Reg-UNet, respectively.

2.3.2 Data Collection

We use the AI2THOR [1] simulator, which provides photo-realistic scenes with depth and

segmentation maps. Our setup, as shown in Fig. 2.2a, includes three RGBD cameras: CamCenter,

positioned at the robot’s height of 0.5m and location, and two additional observations from

CamLeft and CamRight, located at a horizontal distance of 0.3m from the original position to-

wards left and right, respectively. Each camera is rotated by 30 deg to capture extra information

and increase the robot’s FoV. This is done to capture extra information about the scene, while

also making sure that the cameras on the sides have some overlap with CamCenter The rotation

of the cameras virtually increases the FoV for the robot and the translation makes sure that the

robot is able to learn to look around the corners rather than simply rotating at its location.

Each camera captures depth and instance segmentation images. The depth image aids

in creating a 3D re-projection of the scene into point clouds, while the segmentation image

identifies the ceiling (excluded from occupancy map generation) and the floor (representing the

free/navigable area). The rest of the scene is considered occupied/non-navigable. The segmentation-

based processing can be replaced with height-based filtering of the ceiling and floor after re-

projection. All the point clouds are reprojected to a top-down view in the robot frame using

appropriate rotation and translation. Maps are then limited to a 5m × 5m area in front of the

robot and converted to 256 × 256 images to use in the network. Points belonging to obstacles

increment the corresponding cell value by 1, while floor points decrement by 1. Each bin’s point

35


count is multiplied by a factor m = 0.1 to obtain an occupancy map with log odds. To limit log-

odds values, the point count is clipped to the range [−10, 10]. The resulting map from CamCenter,

denoted Oc, is the network input. The ground truth map O∗ is constructed as a combination of

the maps from the three cameras, similar to Wei et al. [25], as follows:

O∗ = max{abs(Oc), abs(Ol), abs(Or)} · sign(Oc +Ol +Or), (2.1)

where Oc, Ol and Or refer to the occupancy maps generated by CamCenter, CamLeft, and

CamRight, respectively. These log-odds maps are converted to probability maps before being

used for network training.

AI2THOR provides different types of rooms. We use living rooms only as they have a

larger size and contain more obstacles compared to others. Out of the 30 such rooms, we use the

first 20 for training and validation and the rest for testing. For data collection, we divide the floor

into square grids of size 0.5m and rotate the cameras by 360◦ in steps of 45◦. Some maps do not

contain much information to predict due to the robot being close to the walls. Thus, we filter out

map pairs where the number of occupied cells in O∗ is more than 20%. This process provides us

with ∼6000 map pairs for training and ∼2000 pairs for testing.

2.4 Experiments & Evaluation

We report two types of results in this section. First, we present the prediction performance

of the ProxMaP and its variations on our test dataset from AI2THOR. Additionally, we show

prediction results on HM3D [2] to test generalizability. Then we use these networks for indoor

point-goal navigation and compare them with non-predictive methods and state-of-the-art self-

36


supervised approach [25]. Finally, we present qualitative results on some real observations to

highlight the potential of real-world applications. The networks were trained on a GeForce RTX

2080 GPU, with a batch size of 4 for GANs and 16 for discriminative models. Early stopping

was used to avoid overfitting with the maximum number of epochs set to 300.

2.4.1 Occupancy Map Prediction

Setup. As our ground truth maps are generated from a limited set of observations, they

may not contain the occupancy information of all the surrounding cells. Hence, we evaluate the

predictions only in cells whose ground truth occupancy is known to be either occupied or free.

We refer to such cells as inpainted cells. For classification, we choose the most likely label as the

output for each cell. For regression, a cell is considered to be free if the probability p in this cell

is lesser than 0.495. Similarly, a cell with p ≥ 0.505 is considered to be occupied. The remaining

cells are treated as unknown and are not considered in the evaluations.

Prediction accuracy is a typical metric to evaluate the prediction quality. However, it may

not present a clear picture of our situation due to the data imbalance caused by fewer occupied

cells. Ground robots with cameras at low heights, similar to our case, are more prone to data

imbalance as the robot may only observe the edges of the obstacles. Thus we also present the

precision, recall, and F1 score for each class.

Results. Fig. 2.3 shows the qualitative results from ProxMaP and its variants, and Table 2.1

summarizes the quantitative outcomes. The classification version of ProxMaP exhibits superior

precision in predicting occupied cells. In contrast