ABSTRACT

Title of Dissertation: MINIMAL PERCEPTION: ENABLING AUTONOMY
ON RESOURCE-CONSTRAINED ROBOTS

Chahat Deep Singh
Doctor of Philosophy, 2023

Dissertation Directed by: Professor Yiannis Aloimonos
Department of Computer Science

Mobile robots are widely used and crucial in diverse fields due to their autonomous task

performance. They enhance efficiency, and safety, and enable novel applications like precision

agriculture, environmental monitoring, disaster management, and inspection. Perception plays

a vital role in their autonomous behavior for environmental understanding and interaction.

Perception in robots refers to their ability to gather, process, and interpret environmental data,

enabling autonomous interactions. It facilitates navigation, object identification, and real-time

reactions. By integrating perception, robots achieve onboard autonomy, operating without

constant human intervention, even in remote or hazardous areas. This enhances adaptability

and scalability.

This thesis explores the challenge of developing autonomous systems for smaller robots

used in precise tasks like confined space inspections and robot pollination. These robots face

limitations in real-time perception due to computing, power, and sensing constraints. To address

this, we draw inspiration from small organisms such as insects and hummingbirds, known for


their sophisticated perception, navigation, and survival abilities despite their minimalistic sensory

and neural systems. This research aims to provide insights into designing compact, efficient, and

minimal perception systems for tiny autonomous robots.

Embracing this minimalism is paramount in unlocking the full potential of tiny robots

and enhancing their perception systems. By streamlining and simplifying their design and

functionality, these compact robots can maximize efficiency and overcome limitations imposed

by size constraints. In this work, a Minimal Perception framework is proposed that enables

onboard autonomy in resource-constrained robots at scales (as small as a credit card) that were not

possible before. Minimal perception refers to a simplified, efficient, and selective approach from

both hardware and software perspectives to gather and process sensory information. Adopting a

task-centric perspective allows for further refinement of the minimalist perception framework for

tiny robots. For instance, certain animals like jumping spiders, measuring just 1/2 inch in length,

demonstrate minimal perception capabilities through sparse vision facilitated by multiple eyes,

enabling them to efficiently perceive their surroundings and capture prey with remarkable agility.

This thesis introduces a cutting-edge exploration of the minimal perception framework,

pushing the boundaries of robot autonomy to new heights. The contributions of this work can be

summarized as follows:

• Utilizing minimal quantities such as uncertainty in optical flow (Ajna Chp 2) and

its untapped potential to enable autonomous navigation, static and dynamic obstacle

avoidance, and the ability to fly through unknown gaps.

• By utilizing the principles of interactive perception (Chp 3), the framework proposes novel

object segmentation in cluttered environments eliminating the reliance on neural network


training for object recognition.

• Introducing a generative simulator called WorldGen (Chp 4) that has the power to generate

countless cities and petabytes of high-quality annotated data, designed to minimize the

demanding need for laborious 3D modeling and annotations, thus unlocking unprecedented

possibilities for perception and autonomy tasks.

• Proposed a method to predict metric dense depth maps (Chp 5) in never-seen or

out-of-domain environments by fusing information from a traditional RGB camera and

a sparse 64-pixel depth sensor.

• The autonomous capabilities of the tiny robots are demonstrated on both aerial and ground

robots: (a) autonomous car with a size smaller than a credit card (70mm), and (b) bee drone

with a length of 120mm, showcasing navigation abilities, depth perception in all four main

directions, and effective avoidance of both static and dynamic obstacles. (Chp 6)

In conclusion, the integration of the minimal perception framework in tiny mobile robots

heralds a new era of possibilities, signaling a paradigm shift in unlocking their perception

and autonomy potential. This thesis would serve as a transformative milestone that will

reshape the landscape of mobile robot autonomy, ushering in a future where tiny robots operate

synergistically in swarms, revolutionizing fields such as exploration, disaster response, and

distributed sensing.


MINIMAL PERCEPTION: ENABLING AUTONOMY
ON RESOURCE-CONSTRAINED ROBOTS

by

Chahat Deep Singh

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2023

Advisory Committee:
Dr. Yiannis Aloimonos, Chair/Advisor
Dr. Cornelia Fermüller
Dr. Guido de Croon
Dr. Christopher Metzler
Dr. Nitin J. Sanket
Dr. Inderjit Chopra, Dean’s Representative


© Copyright by
Chahat Deep Singh

2023


To my family, friends, mentors

and

to the people who strive to do what they love

ii


Acknowledgments

I am sincerely humbled and overwhelmed with gratitude as I pen down this section. The

process of my Ph.D. has taken me on a thrilling ride of emotions, obstacles, accomplishments,

and profound personal development. It has been a remarkable journey that could not have been

completed without the unflinching support and contributions of several incredible individuals.

It all started during my Master’s in Robotics in 2016 at the University of Maryland. I

still remember the first day I met Prof. Yiannis Aloimonos – I was awestruck by his inspiring

thoughts on perception and how both small and large animals/insects solve the same day-to-day

tasks but in different ways. He asked me one fundamental question – ‘What is the minimum

information required to solve a given task?’ Little did I know that this question would lay down

the foundation of this thesis. I will be perpetually grateful to him for giving me this opportunity

to work in the Perception and Robotics Group (PRG). I am also forever grateful to Dr. Cornelia

Fermüller for introducing me to the field of neuromorphic perception and mentoring me. Thank

you for treating me as your child and making PRG feel like a home away from home. The amount

of freedom I have gotten from you two for creative thinking is unfathomable and has led me to

this research. Thank you for allowing me to mentor students on my own and teach courses while

pursuing my research – valuable skills that I would not have developed otherwise. Undoubtedly,

these were the best years of my academic life!

I am eternally indebted to Nitin J. Sanket, for his unwavering unconditional support and

iii


mentoring, especially during my masters, which has touched the depths of my heart and forever

transformed my life. This journey would be impossible without you! Thank you for being a

better than best friend throughout this journey. Thank you for teaching me the art of research and

fostering me to think outside the box. I will always cherish the road trips and the fun sessions

in the lab and in the house! Thank you for all the photography lessons among countless other

things. I have thoroughly enjoyed discussing math with you. Thank you for being there at every

step of the way – more than I deserved.

PRG has been like an amazing family all these years. I am thankful to everyone in the

group for limitless discussions and fun experiences. I would like to thank Levi Burner with whom

I had an insane amount of random discussions, especially math. Chethan Mysore Parameshwara

for introducing, teaching, and involving me in neuromorphic perception. Kanishka Ganguly

for helping me with all my papers, especially hardware and Linux-related stuff. To Loonix,

our pet cockroach who stayed with us at 3 am during drone flight experiments. It was scary

when you flew faster than the drone. Thank you Snehesh Shreshta for taking the lead with

us in setting up the aerial robotics lab. A big thanks to Xiaomin Lin, Jingxi Chen, Botao He,

Konstantinos Zampogiannis, Francisco Barranco, Michael Maynord, Chinmaya Devraj, Matthew

Evanusa, Peter Sutor, Behzad Sadrfaridpour.

I am always indebted to all the Master’s students that have helped me in my research

over the years – Prateek Arora, Ashwin V. Kuruttukulam, Abhinav Modi, Yashveer Jain, Kartik

Madhira, Varun Asthana, Saumil Shah and Akash Gupta. And a huge thanks to the undergraduate

students for teaching me valuable mentorship skills – Rishabh Singh, Riya Kumari and Rohan

Uttamsingh.

A huge thanks to Guido de Croon and Davide Scaramuzza for teaching me valuable skills

iv


in research; Inderjit Chopra for inviting me to teach an aerial robotics course in your department;

Luca Carlone and Ashok Agarwala for your insights into academia; Christopher Metzler to

introducing me to computational imaging. And finally President Darryll Pines for his continuous

support in the RoboBeeHive project.

To mom – Harbir Kaur and dad – Rajinder Singh Kambo, thank you for your patience

over the years. I know I cannot thank you enough for believing in me even when I did not.

I have deeply embedded your qualities within me without even realizing it. Completing this

journey—from an underachieving student to where I am now—feels like a real-life Cinderella

story. Jasmeet Singh Kambo, I have no words for you. Firstly, thank you for forcing me to enjoy

my undergraduate life with fun projects and not worrying about grades. Thank you for being a

life mentor! Sakshi Singh, I guess there’s another doctor in the family now. Thank you for taking

all the pressure off my shoulders. To the entire Certified Siblings – Jasmeet Singh, Sakshi Singh,

Aman Kaur, Sparsh Deep Singh, Ananta Malhotra, Hersh Deep Singh, Hershita Singh, Arsh

Deep Kaur, Utsav Agarwal, and Tapasi Malhotra for the family trips and endless fun discussions

every weekend. Special thanks to Arhaan Agarwal and Jaiveer ‘Fateh’ Singh. A big thank you

to Hersh for his insightful views on mentoring, academia, and random fun math discussions.

A huge thank you to Sunaina Prabhu and Kedar Gaitonde for being there as emotional

support throughout. To our family in College Park (Ghar ek Mandir) – Priyal Gala, Anoorag

Sunkari, Vinayak Bendale, Ankita Tondwalkar, Prateek Arora, Harshvardhan Uppaluru, Shankar

Ramesh, Pranay Kanagat, Devyani Gera, Kunal Mehta, Ishmeet Singh, Aprit Agarwal, Meghavi

Prashnani, Nakul Garg, Pooja Guhan, Mrunal Dhaygude, and Aakriti Agrawal, thank you for

your support and endless Bakchodi. A big thank you for my unconditional pet dogs – Bansi,

Stella, and Lilly.

v


I would like to express my deepest gratitude to Pranshu Jhamb, Tapan Khattar, and Niharika

Singh for their unwavering support, even during times when I couldn’t be there. I offer my

heartfelt apologies for not being able to attend your respective weddings. Finally, I extend

my heartfelt gratitude to Naitri Rajyaguru for standing by my side during the last leg of this

remarkable journey. Your support has been genuinely priceless and irreplaceable!

I extend my heartfelt appreciation to Tom Ventsias and Maria Herd for their invaluable

assistance in promoting my research throughout the years. I am deeply grateful to BBC Earth,

Voice of America, Maryland Today, and IEEE Spectrum for featuring my research. A special

thank you goes to Indian Creek Elementary School for giving me the opportunity to teach the

findings of my research to third-grade students. I am grateful to Ivan Pensiky and Kimberly

Edwards for their support in the labs. Ania Picard, your unwavering support has meant the world

to me. I would also like to express my gratitude to Janice Perrone, Tom Hurst, and Sharron

Mcelroy for their patience and assistance with logistics.

I am immensely thankful to Vikram Hrishikeshavan and Derrick Yeo from the aerospace

department for their invaluable help with drone hardware. I also want to express my profound

gratitude to the Department of Computer Science, UMIACS, and the Maryland Robotics Center.

Lastly, I extend my thanks to the Wikimedia Foundation for providing a free source of education.

I wish to express my sincere appreciation for the generous financial support received

from the Office of Naval Research (ONR), Brin Family Foundation, Northrop Grumman

Corporation, NVIDIA, National Science Foundation (NSF), Intel, Dean’s Fellowship, Ann G.

Wylie Fellowship, and the Future Faculty Fellowship.

Additionally, I am immensely grateful to the remarkable open-source platforms, including

Linux, TensorFlow, ArduPilot, Raspberry Pi, NVIDIA Jetson, and PX4. Without their invaluable

vi


contributions, this thesis would not have been achievable.

Remembering everyone is an impossible feat, and from the depths of my heart, I humbly

apologize to those I may have inadvertently missed.

As I close this chapter and embark on new adventures, I carry with me the memories,

lessons, and relationships forged along the way. May this acknowledgment serve as a token of

my deepest appreciation and as a reminder of the indelible impact each and every one of you has

had on my life. Thank you from the bottom of my heart.

vii


Table of Contents

Preface ii

Acknowledgements iii

Table of Contents viii

List of Tables xi

List of Figures xii

Chapter 1: Introduction 1
1.1 Resource-constraint autonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Learning From Nature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Frugal AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Active Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Principles of Minimal Perception . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Predicting Minimal Quantities Using Uncertainty Principles . . . . . . . . . . . . 16
1.7 Minimal Prior Knowledge – Interactive Perception . . . . . . . . . . . . . . . . 19
1.8 Learning Structure via a Generative Simulator – Minimizing Annotations and

Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.9 Minimal Sensing Modality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.10 Minimal Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.11 Applications of Minimal Perception . . . . . . . . . . . . . . . . . . . . . . . . 26

Chapter 2: Generalized Deep Uncertainty For Parsimonious Robots 29
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.1.1 Estimating Uncertainties in Neural Networks . . . . . . . . . . . . . . . 34
2.1.2 Applications of Deep Uncertainty in Robotics and Computer Vision . . . 35

2.2 Method – Ajna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2.1 General Heteroscedastic Aleatoric Uncertainty Formulation . . . . . . . . 36
2.2.2 Informational Cues from Uncertainty Υ . . . . . . . . . . . . . . . . . . 40
2.2.3 Uncertainty of Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2.4 Uncertainty of Monocular/Stereo Depth . . . . . . . . . . . . . . . . . . 44
2.2.5 Uncertainty of Surface Normals . . . . . . . . . . . . . . . . . . . . . . 45
2.2.6 Uncertainty of Semantic Segmentation . . . . . . . . . . . . . . . . . . . 46
2.2.7 Uncertainty and its relationship to Confidence and Inlier ratio . . . . . . . 47

2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

viii


2.3.1 Quadrotor Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.3.2 Perception Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.3.3 Application 1: Dodging Dynamic Obstacles . . . . . . . . . . . . . . . . 51
2.3.4 Application 2: Navigating through unstructured environments . . . . . . 54
2.3.5 Application 3: Flying Through An Unknown Gap . . . . . . . . . . . . . 57
2.3.6 Application 4: Segmentation of Object Pile . . . . . . . . . . . . . . . . 62
2.3.7 Network Speed on Different Hardware . . . . . . . . . . . . . . . . . . . 63

2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.4.1 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 70

2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Chapter 3: Novel Object Segmentation With Minimum Knowledge 77
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.1.1 Problem Formulation and Contributions . . . . . . . . . . . . . . . . . . 80
3.2 NudgeSeg Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

3.2.1 Active perception in NudgeSeg . . . . . . . . . . . . . . . . . . . . . . . 83
3.2.2 Interactive perception in NudgeSeg . . . . . . . . . . . . . . . . . . . . . 84
3.2.3 Verification and Termination . . . . . . . . . . . . . . . . . . . . . . . . 87
3.2.4 Network Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

3.3 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 89
3.3.1 Description of robot platforms – Aerial Robot and UR10 . . . . . . . . . 89
3.3.2 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

3.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

Chapter 4: WorldGen: Generative Simulator for Minimal Perception 101
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.1.1 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.2 WorldGen Generative Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.2.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.2.2 Texture Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.2.3 Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.2.4 Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.2.5 Lighting and Climate Conditions . . . . . . . . . . . . . . . . . . . . . . 114
4.2.6 Assets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.2.7 Design Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

4.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.3.1 Improvements in Optical Flow . . . . . . . . . . . . . . . . . . . . . . . 118
4.3.2 Computational Photography . . . . . . . . . . . . . . . . . . . . . . . . 120
4.3.3 View Synthesis using Neural Radiance Fields . . . . . . . . . . . . . . . 121
4.3.4 Active and Interactive Perception . . . . . . . . . . . . . . . . . . . . . . 121
4.3.5 Generating Real World Traffic . . . . . . . . . . . . . . . . . . . . . . . 122
4.3.6 Human Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 122

4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

ix


Chapter 5: Generalized Neural Metric Depth Estimation 125
5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.1.1 Monocular Depth Estimation . . . . . . . . . . . . . . . . . . . . . . . . 128
5.1.2 Estimating Depth using Multiple Frames . . . . . . . . . . . . . . . . . . 128
5.1.3 Using Sparse Depth Supervision . . . . . . . . . . . . . . . . . . . . . . 129

5.2 TinyDepth Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.2.1 Sensor Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.2.2 Pre-Processing and Data Generation . . . . . . . . . . . . . . . . . . . . 131
5.2.3 Flow-Guided Depth Estimation . . . . . . . . . . . . . . . . . . . . . . . 133
5.2.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5.3 Experiments and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.3.3 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

5.4 Discussions and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

Chapter 6: Conclusions 147

Chapter 7: Future Directions 152
7.1 Passive Computing – Modifying Camera Apertures . . . . . . . . . . . . . . . . 152

7.1.1 Non Visible Spectral Sensing . . . . . . . . . . . . . . . . . . . . . . . . 155
7.2 Leveraging From Active Elements In Front Of The Sensor . . . . . . . . . . . . 158
7.3 Towards Robot Morphology Design . . . . . . . . . . . . . . . . . . . . . . . . 159

Bibliography 163

x


List of Tables

2.1 Relation to existing works (Chronological Order). . . . . . . . . . . . . . . . . 48
2.2 Quantitative Evaluation for various applications. . . . . . . . . . . . . . . . . . . 58

3.1 Description of Evaluation Sequences. . . . . . . . . . . . . . . . . . . . . . . . . 91
3.2 Evaluation with different segmentation methods for multiple sequences. . . . . . 95
3.3 Evaluation of GrassMoss sequence with different amount of errors in A and M. 96

4.1 Comparison of the different simulation environments. . . . . . . . . . . . . . . . 103
4.2 Optical Flow EPE Comparison of Training RAFT [1] On Different Datasets.

Lower Is better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5.1 Quantitative evaluation of different methods for metric depth estimation on
out-of-domain datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

xi


List of Figures

1.1 A qualitative comparison of living beings and robots in terms of perceptual
capabilities with respect to their scaled body length. Note that cat and eagle
sizes are not to scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2 Segmentation of the gap with similar texture on the foreground and background
elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 A comparison between a honey bee and a hummingbird. . . . . . . . . . . . . . 11
1.4 Bee Peering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Flying Through Unknown Gaps. (a) Active strategy and (b) Visual Servoing . . . 13
1.6 Minimal Perception Framework. Green parts are presented in this thesis and the

blue parts are on going work that are presented in the future work. . . . . . . . . 14
1.7 Learning to estimate the structure of the unknown scene (a) by observing it from

multiple views can reduce the neural network model size in the estimation of the
structures like mountains with various textures (b). . . . . . . . . . . . . . . . . 14

1.8 Combining RGB with a tiny sparse sensor leads to high-resolution depth maps. . 24
1.9 Illustration of a tiny robotic bee pollinating the flowers. . . . . . . . . . . . . . . 26
1.10 Weight comparison between a 330ml of Pepsi can with the tiny autonomous car. . 27

2.1 Unification of common robotics problems using the novel generalized
heteroscedastic aleatoric uncertainty formulation for neural networks – Ajna.
This chapter experimentally demonstrates the efficacy of using uncertainty
for the following robotics tasks: (A) Dodging dynamic obstacles, (B)
Navigating through cluttered scenes, (C) Flying through unknown gaps, and
(D) Segmentation of unknown object piles. This chapter shows that such an
algorithmic approach would enable autonomy at scales not thought possible
before such as the drone the size of a hummingbird as shown in the center. All
the images in this chapter are best viewed in color and on a computer screen at
200% zoom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

xii


2.2 A sequence of images of quadrotor dodging objects. Dodging (A) Airplane,
(B) Ball, (C) Cart and (D) Drone. Here, the object and quadrotor transparency
show the progression over time. Red and green arrows indicate object and
quadrotor directions, respectively. In each sub-figure, the outputs are shown in
the following order (taking example as sub-figures of A): (A1) Image sequence
of dodging, (A2) RGB image as seen by the quadrotor, (A3) D435i depth image,
(A4) MiDaS-S output, (A5) MiDaS output, (A6) OccMask, (A7) Ajna. The color
map used in all the depth images is plasma, where blue color represents far
and yellow is close. The colormap for occlusion and uncertainty map is inverse
plasma, where blue color represents lower uncertainty/occlusions and yellow
represents higher uncertainty/occlusions. The yellow boxes show the zoomed-in
view of the object. The colormap is consistent across all figures in this chapter. . 52

2.3 A sequence of images of the quadrotor navigating through cluttered
environments. (A) Indoor forest, (B) Boxes. Here, the object and quadrotor
transparency show the progression of time. Green arrow indicates the quadrotor
direction. In each sub-figure, the outputs are shown in the following order (taking
example as sub-figures of A): (A1) Image sequence of navigation, (A2) RGB
image as seen by the quadrotor, (A3) D435i depth image, (A4) MiDaS-S output,
(A5) MiDaS output, (A6) OccMask, (A7) Ajna. . . . . . . . . . . . . . . . . . . 56

2.4 Comparison of various methods to navigate through a simulated realistic forest
scene. (A) The scene from the top view with paths overlaid (direction of travel
is left to right). The legend is as follows: white – ground truth depth, green –
MiDaS, dashed green – MiDaS-S, yellow – OccMask, black – MorphEyes, blue
– Ajna (ours), (B) Sample RGB Image as seen by the quadrotor, (C) ground
truth depth, (D) MiDaS-S output, (E) MiDaS output, (F) OccMask output, (G)
MorphEyes output and (H) Ajna output. . . . . . . . . . . . . . . . . . . . . . . 58

2.5 Image of quadrotor flying through unknown gaps. (A) Egg, (B) Goku, (C)
Infinity, (D) Rectangle. In each sub-figure, the outputs are shown in
the following order (taking example as sub-figures of A): (A1) Image of flight
through the gap, (A2) RGB image as seen by the quadrotor, (A3) D435i depth
image, (A4) MiDaS-S output, (A5) MiDaS output, (A6) OccMask, (A7) Ajna.
The black or yellow boxes on the images show the window location. . . . . . . . 60

2.6 Outputs for segmentation experiments using various methods on different
datasets: (A) GrassMoss, (B) Rocks, (C) YCB. In each sub-figure, the outputs
are shown in the following order (taking example as sub-figures of A): (A1)
RGB image as seen by the robot, (A2) D435i depth image, (A3) Mask R-CNN
output, (A4) PointRend output, (A5) MiDaS-S output, (A6) MiDaS output, (A7)
OccMask, (A8) Ajna output. Different colors in A3 and A4 show different objects
with different labels being detected by the instance segmentation. . . . . . . . . . 61

xiii


2.7 (A1) Input image pair as an anaglyph, (A2) Optical flow with colormap shown as
inset, (A3) Ajna’s predicted uncertainty. Despite low ṗ in the highlighted white
region, the quadrotor needs to dodge this area. This is correctly predicted as high
Υ. This is a common example where Υ provides additional information over
ṗ. (B1 to B3) input image frames and ṗ under blinking LED without motion.
(C1 to C3) input image frames and ṗ under blinking LED with motion. (D1 to
D4): Image input, predicted Υ, image input with flow attack, predicted Υ under
attack. (E), (F) and (G) are experiments of flying through gaps, flying through a
forest and detecting dynamic obstacles. Left to right: ground truth depth (white
is 4 m and black is 0 m), input images 1 and 2, predicted Υ. . . . . . . . . . . . 66

2.8 Uncertainty Estimation from a moving camera looking at an unknown-shaped
gap. Fig. 2.9 shows the environmental setup. (a) shows the direction of
camera motion and the ground truth mask: white is the background and grey
is the foreground. (b)-(j) shows the pair of images and uncertainty (from left
to right). Note that (d) shows uncertainty in a challenging/illusion scene with
a checkerboard pattern. (b)-(f) scenes with the same texture in foreground and
background; (g)-(i) scenes with different textures and (j) no texture in both
foreground and background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

2.9 GapFlyt experiment setup in 3D for uncertainty estimation in different texture
environments. The top pyramid represents the camera, the yellow plane
represents the foreground with a gap and the blue plane represents the background. 72

2.10 The plot represents how the detection accuracy of the gap varies with the texture
resolution and contrast. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

2.11 From left to right: Input Image, Uncertainty Estimation, Input Image with
Black-Box patch, Uncertainty on the patched image. We show that uncertainty
behavior is not affected by the optical flow attack patch in both cases of (a)
obstacle avoidance and (b) navigation. . . . . . . . . . . . . . . . . . . . . . . . 73

2.12 From left to right: Pair of consecutive input images and uncertainty. (a) shows
uncertainty with both motion and illumination changes and (b) shows only
illumination changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.1 Top row: Robots (UR-10 and a quadrotor) used to physically interact (or nudge)
with the objects to get motion cues for segmenting objects in a clutter. Bottom
row (left to right): Initial Configuration of a cluttered scene and the first nudge
being invoked, final nudge is invoked, final Segmentation of the cluttered scene.
Green circles show the nudge operation. All the images in this chapter are best
viewed in color at 200% zoom on a computer screen. . . . . . . . . . . . . . . . 78

3.2 A conceptual graph of variation of complexity in perception, planning, and
control with task philosophy. As a keen observation, the algorithmic complexity
decreases with an increase in the manipulator motion. . . . . . . . . . . . . . . . 79

3.3 First nudge policy using uncertainty in optical flow. Hotter colors represent
higher uncertainty. The dashed line represents the convex hull of the cluttered
scene and the arrow represents the direction of the first nudge at point N1. . . . . 84

xiv


3.4 (a) Active perception in NudgeSeg framework. The top row shows the movement
of the camera. The bottom row shows the image inputs and uncertainty ρ. (b) and
(c) Interactive perception in NudgeSeg framework. The top row shows the object
nudging. The bottom row shows the input images (before and after the nudge),
optical flow representation, and segmentation hypothesis where colors indicate
cluster membership. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

3.5 Top Row: Sample objects used in Table 3.1 as the evaluation sequences. Bottom
Row: Sample cluttered scene for each sequence. . . . . . . . . . . . . . . . . . . 91

3.6 For each sub-figure: First row (From left to right): Sample monochrome input
image, Uncertainty in optical flow ρ, Segmentation hypothesis after first nudge,
Final segmentation masks. Second row: (From left to right): Outputs of 0-MMS
[2], PointRend (color input), PointRend (mono input), Mask-RCNN (color input),
Mask-RCNN (mono input). Note that in (a) and (d), the objects highlighted with a
red boundary in the top left image of the respective sequences are ‘glued’ together
and are considered to be adversarial samples. This image is viewed best in color
at 400% zoom on a computer screen. . . . . . . . . . . . . . . . . . . . . . . . . 94

3.7 Qualitative Results with (a) no error, ϵA = 0, ϵM = (b) ±5%, (c) ±10%, (d)
±20%, ϵM = 0, ϵA = (e) ±10◦, (f) ±20◦, (g) ±30◦. . . . . . . . . . . . . . . . . 96

4.1 Generative ability of WorldGen: (a) Comparison between Google Street View
(left) and the same street in WorldGen (right), (b) Comparison of Google Maps
satellite image vs. WorldGen top view, (c) Collection of 3D objects in motion,
(d) Object fragmentation,(e) Annotation from left to right: depth, optical flow,
surface normals, stereo anaglyph, image segmentation, event frame. All the
images in this chapter are best viewed in color on a computer screen at 200%
zoom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.2 An overview of WorldGen Framework: (a) Assets: Loads the assets such as maps,
objects, materials etc. into WorldGen environment, (b) Structural Modification
and Animation: Modifying the texture maps and applying physics and motion
models on different objects in the scene, (c) Rendering: Generates rich ground
truth data with the desired metadata (time, frame number, camera intrinsic and
extrinsic properties). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.3 Mapping textures to a round table. Top row: Rendered Output, Bottom row:
Sample textures projected on a sphere. (a) Barebone 3D model, (b)-(d) Different
Textures applied on (a). Note: Variational mapping models change the structure
of the 3D objects in different renders (notice the legs on the chair). Here, the
Gaussian noise in (d) > (c) > (b). . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.4 (a) OpenStreetView, (b) Depth Map, (c) 3D Model View Generated by WorldGen
and (d) Final Rendered View . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.5 City environment in different weather and time of the day: (a) Day, (b) Night with
rain, (c) Dawn and (d) Night without rain, (e) panoramic view of the city and (f)
demonstrates the generative ability of WorldGen by changing the textures of the
entire scene while keeping the same structure. . . . . . . . . . . . . . . . . . . . 110

4.6 High-resolution views generated by WorldGen from views at different altitudes
with dynamic lighting, camera intrinsic, and extrinsic. . . . . . . . . . . . . . . . 118

xv


5.1 Depth Estimation for Tiny Autonomous Robots: (a) Bee drone of size 92cm in the
largest dimension, (b) Lightweight sensor suite – RGB and sparse time-of-flight
sensor used on Bee drone and Tiny car, (c) Tiny car of size 70cm largest
dimension and (d) illustrates the pair of sensor inputs along with our metric dense
depth prediction with the ground truth on the right. . . . . . . . . . . . . . . . . 126

5.2 Sensing principle of VL53L5CX sensor that results in super sparse 8 × 8 depth
resolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.3 System overview: An architecture of TinyDepth encoder-decoder model that
utilizes an eight-channel input by combining RGB and L5 data from two views
to predict metric dense depth. Refer Fig. 5.2 for the color scheme. . . . . . . . . 130

5.4 Real-world robot experiments: (a) Drone Navigation in an unstructured indoor
forest scene, (b) Flying through unknown gaps, and (c) Tiny Car navigation
through an obstacle course. The bottom left insets in each section of the image
represent input RGB image, ground truth data, and depth prediction (left to right).
Note the gradient yellow to red line shows the traversal of the robots in time where
red represents temporally later stage. . . . . . . . . . . . . . . . . . . . . . . . . 132

5.5 Quantitative evaluation on out-of-domain dataset: (a)-(c) NYUv2 out-of-domain
samples, (d) GapFlyt data: Flying through unstructured gaps and (e) Indoor forest
data for drone navigation. Note that MiDaS/MiDaS-S use a single RGB image,
DELTAR uses a single RGB + single L5 and TinyDepth uses two RGB and two
L5 consecutive image pairs for depth predictions. . . . . . . . . . . . . . . . . . 137

6.1 Onboard Autonomy on Tiny Robots: An Outcome of Minimal Perception . . . . 147
6.2 Flower Detection of a Downfacing Camera on the RoboBee . . . . . . . . . . . . 148
6.3 Autonomous onboard obstacle avoidance on a credit-card size robot . . . . . . . 149

7.1 Various Apertures in the wild . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.2 Images of Flower: (a) RGB, (b) Simulated Bee Vision and (c) UV Space . . . . . 156
7.3 (a) Bald eagle seamlessly changing its state from walking to squeezing to flying.

(Left to Right). (b) Our Morphing Quadrotor Prototype walking, squeezing and
switching to flight mode. (Left to Right). . . . . . . . . . . . . . . . . . . . . . . 159

xvi


Chapter 1: Introduction

Nature has spent 3.8 billion years on research and development in genetic evolution. They

have evolved over the years based on their daily operations, habitat, and surrounding. The

evolution in nature has been purposive rather than generic. This evolution is largely driven

by their perceptual behaviors based on their needs and environment. Over the years, these

systems have learned to solve specific tasks very efficiently. These parsimonious systems or

living beings carry a blueprint to develop the next generation of robots. The key to parsimony

is using a minimal amount of information/cues or sensing modalities for an efficient competition

of goals. In stark contrast, we have been developing robots and AI frameworks for merely 50

years, spending the most time developing independent modules. I draw inspiration from nature

to formulate robot autonomy frameworks using only onboard sensing and computation at scales

that were never thought possible before. The solution to robot autonomy lies at the intersection

of AI, computer vision, computational imaging and robotics – resulting in parsimonious robots.

Parsimony refers to thriving in a resource-constrained environment. Even with drastically varied

power and sensor constraints, bees and birds are often able to perform similar tasks. To enable

autonomy at such scales, my research focuses on the robustness, unification and generalizibility

of AI frameworks for robots. Robustness refers to inferring prediction in the presence of noise

in the input data. Unification refers to the formulation of a single mathematical framework that

1


enables solving different robotics problems. Generalizibility refers to the ability of an AI to work

across various environmental conditions.

Inspired by nature, this thesis deals with a minimal perception framework for robots to

develop the next generation of tiny, efficient, effective, and purposive robots with onboard

autonomy. It introduces alternative methodologies to depth sensors that are essential for robot

autonomy, especially navigation and obstacle avoidance.

An outline of the thesis is given below:

• Chapter 2 proposes the general formulation of uncertainty principles in optical flow and

their unconventional uses in robot navigation.

• Chapter 3 utilizes the uncertainty in optical flow along with the principles of interactive

perception to segment the never-seen objects in a cluttered environment by repeated

nudging.

• Chapter 4 introduces a generative simulator to automatically create petabytes of

high-quality annotated data of real-world cities (digital twins) without any need for manual

efforts in 3D modeling by computer vision and robotics applications.

• Chapter 5 presents a method to predict metric dense depth maps by utilizing only a

traditional RGB sensor with a 64-pixel super sparse depth sensor. We demonstrate its

ability to navigate in unknown environments on a credit card-sized robot.

• Chapter 6 serves as the culmination of the minimal perception framework within the

context of robot autonomy applications, encapsulating key findings and insights, and

offering valuable reflections for future directions in this field.

2


1.1 Resource-constraint autonomy

Robot autonomy within a resource-constrained environment is a complex and challenging

task that requires intricate strategies for optimal functionality. The basic premise revolves around

designing robotic systems capable of performing tasks autonomously despite limitations in

computational resources, energy supplies, or sensor capabilities. This is of particular importance

in scenarios such as aerial robotics, deep-sea exploration, or extraterrestrial missions, where

resource management becomes critical. Algorithms, such as those based on reinforcement

learning or genetic algorithms, are often used to optimize resource allocation dynamically. These

algorithms are designed to make trade-offs between resources used and the quality of the task

performed, thus maximizing efficiency. Sensor fusion techniques are also crucial in managing

limited sensor capabilities, merging data from multiple sources to improve understanding and

accuracy. Moreover, power management strategies, including sleep-wake cycles and variable

processing speed, are implemented to minimize energy consumption. Software and hardware are

co-designed for these systems to leverage the unique properties of specific hardware for better

resource management. As such, resource-constrained robot autonomy requires a multi-faceted

approach integrating planning, learning, perception, and decision-making to facilitate successful

operation.

From a perception point of view, resource-constrained robot autonomy presents intriguing

challenges and necessitates innovative solutions. The perception system of an autonomous

robot, including cameras, lidar, sonar, and other sensors, serves as its ‘eyes’ but in a

resource-constrained environment, the capabilities of these sensors may be limited. Therefore,

advanced methods such as sensor fusion become crucial, which integrate data from different

3


sensor types to form a more comprehensive understanding of the environment and reduce

perceptual uncertainty. Deep learning and computer vision techniques are also used to extract

relevant features from the sensory data and to recognize objects and patterns, yet they must be

implemented efficiently due to limited computational resources. Furthermore, perception-based

strategies must account for energy consumption, as continuous data acquisition and processing

can be power-intensive. As such, low-power modes and selective perception, where only relevant

data are actively processed, can be effective strategies. Hence, in resource-constrained robot

autonomy, the perception system must balance between detail and depth of environmental

understanding, computational demand, and energy efficiency to ensure reliable operation.

Although, these computationally expensive perception algorithms can be outsourced using

a cloud computer or a companion computer over networking. So the question the first question

that comes to mind is – ‘Why do we need onboard autonomy?’. Autonomous systems that rely

on either wirelessly connected companion computers in the vicinity or cloud computing are

susceptible to deployment in the wild. Such systems tend to fail in GPS-denied environments

as well are prone to latency issues. Onboard robot autonomy leads to secure systems that reduce

the possibilities of hacking and various security threat as well as make the robots more robust.

Although, we have some highly capable autonomous robots with onboard computing that are

relatively large in size (more than 300mm) – both aerial and ground robots. So, ‘Why do we

need small robots?’ These robots are safe, and agile and can be deployed as swarms. These

swarms are highly scalable and can be effectively produced at much cheaper costs. Furthermore,

these autonomous swarms enable the robots to inspect confined and dangerous areas that are time

constrained such as thermonuclear power plants. It is a well-known fact that robot autonomy is

substantially affected by the speed and size of the memory, sensor type, and quality as well as the

4


power required. This directly affects the robotic system’s size, area, and weight.

1.2 Learning From Nature

The field of biomimetics, or biologically inspired engineering, offers valuable insights

into designing and creating robots based on the study of nature - animals, birds, insects, and

even plant life. These natural entities exhibit unique abilities honed by millions of years of

evolution, providing excellent models for robotic systems. For instance, the agility of a cheetah

can inform the development of robots with enhanced locomotion capabilities. Similarly, the

echolocation method used by bats can inspire the design of robotic sensing systems that operate

effectively in low-light conditions. The collective behavior of ants and bees provides concepts for

swarm robotics, where multiple robots work together to perform complex tasks more efficiently

than a single robot could. Studying bird flight can contribute to advancements in aerial drone

technology, providing insights into energy efficiency and aerodynamics. In the case of insects like

spiders, their ability to create intricate web structures could influence the design of construction

or 3D printing robots. Thus, understanding nature and its mechanisms opens up a wide array of

possibilities for designing innovative, efficient, and adaptable robotic systems.

From a perception perspective, biomimetics plays a vital role in developing advanced

robotic systems inspired by nature. A central premise is how animals, birds, and insects

perceive and interact with their environment, offering rich insights for creating efficient robotic

perception systems. For instance, the complex visual processing system of a dragonfly, capable

of detecting movement and depth with extraordinary precision, can inspire the development of

sophisticated machine vision algorithms for robots. The sonar system of bats, which enables

5


navigation and hunting in the dark, provides a model for developing robust echo-based sensing

mechanisms, especially useful for robots operating in low-visibility environments. Birds, with

their ability to adjust their flight based on wind conditions, offer insights for developing adaptive

perception and control systems in aerial drones. Similarly, the tactile sensing of rodents’

whiskers could inform the design of touch-based perception for robots operating in dark or

cluttered spaces. Swarm robotics often draw inspiration from ants and bees, which communicate

and coordinate effectively to perceive their environment collectively and perform complex

tasks. Thus, perception research in biomimetics is about deciphering and applying nature’s

sophisticated sensory systems to enhance robotic perception and interaction capabilities.

Thus, in order to make these systems autonomous, let us understand what we can learn from

nature to build the next generation of onboard tiny autonomous robots. What does it take from the

currently existing autonomous systems to downsize by a multitude of factors while maintaining

or eventually enhancing autonomous capabilities? One way is to look at nature and living beings

and observe their behaviors. Let us look at Fig. 1.1. It indicates the perception capabilities

which are largely driven by their perception systems as compared to their body lengths. It is

important to note that perception capabilities monotonically increase with body size i.e. with the

increase in body length, their perception systems become matured with some exceptions in a few

living beings. These exceptions include jumping spiders, cuttlefish, and various species of frogs.

Jumping spiders have a very sparse low-resolution vision system that enables them to process

fast-moving objects or prey and quickly react to hunt them. Whereas cuttlefish and different

species of frogs and other beings have developed their visual systems by modifying the aperture

shapes. For example, a cuttlefish has a ‘W’-shaped aperture while there are species of frogs that

have vertical or horizontal openings. Also, note that the blue and green bubbles have real-world

6


Pe
rc

ep
tio

n 
Ca

pa
bi

lit
y

Body Size (mm)

10 100 300200 400 500

Figure 1.1: A qualitative comparison of living beings and robots in terms of perceptual
capabilities with respect to their scaled body length. Note that cat and eagle sizes are not to
scale.

tiny robots with onboard autonomy. Before this work, there were robots that existed that are

as small as 120mm that are able to perform autonomous tasks such as navigation, static and

dynamic obstacle avoidance as well as flying through unknown-shaped gaps [3]. These robots

7


are represented in a blue bubble. This work enables us to downsize and boost the autonomy

performance even further at robots that are as small as a credit card (less than 3 inches in length).

These robots are presented with a green bubble in Fig. 1.1. In the following sections, we will

learn how bio-inspired solutions will boost resource-constraint autonomy in mobile robots.

1.3 Frugal AI

Nature’s grandest architecture is built upon eons of evolution, an epitome of minimalism

refined over time. Our approach to building robots should mirror this ethos - evolving complexity

through simplicity, function through form, and achieving great feats not through excess but

efficiency. One of the most influential theories highlighting the elegance of simplicity in language

structure is Noam Chomsky’s Minimalist Program [4]. Chomsky, a renowned linguist, cognitive

scientist, and philosopher, proposed the Minimalist Program as a radical rethinking of syntactic

theory. Its underlying principle is the idea that nature, including human language, operates

in the simplest and most efficient way possible. The core of Chomsky’s Minimalist Program

rests on the assumption that sentences are built from a basic lexical inventory through a series

of binary merges. This process minimizes computational complexity by reusing the same

operation for structuring sentences, enabling an infinite array of expressions from a finite set

of elements. Another significant facet of the Minimalist Program is the notion of ‘economy.’

Chomsky theorized that linguistic expressions follow the principle of economy, ensuring the

most resource-efficient outcomes. This mirrors the concept of minimalism where ‘less is more.’

Based on such theories, a new field has emerged – Frugal AI which refers to the

development and deployment of artificial intelligence (AI) systems that operate effectively with

8


minimal resources, particularly in terms of computational power, energy consumption, and data

requirements. The concept aims to create AI models that maintain high efficiency and robust

performance despite these constraints. This is particularly relevant in scenarios where the

availability of resources is limited, such as remote areas, edge devices, or low-income regions.

Frugal AI can also contribute to sustainable AI practices, reducing the environmental impact

associated with data centers and large-scale computing.

The development of frugal AI poses significant research challenges, necessitating the

exploration of methods that allow model compression, efficient learning, and effective inference.

Techniques such as quantization, pruning, and knowledge distillation are often used to compress

deep learning models without substantial loss of accuracy. Transfer learning and few-shot

learning strategies are used to enable models to learn effectively from small data sets. On-device

AI and federated learning can be utilized for inference in resource-constrained environments

while also preserving privacy. As such, the research and implementation of frugal AI embody

a shift towards more accessible, sustainable, and decentralized AI practices, making advanced

technologies more equitable and less resource-intensive.

While many forms of perception in the animal kingdom are astoundingly complex, they

also often exhibit a remarkable sense of frugality. This ‘minimal perception’ underscores how

living beings optimize their perceptual capacities in resource-constrained conditions. In the

following section, we will engage in a detailed exploration of how diverse organisms employ

distinct strategies to address identical challenges.

9


Figure 1.2: Segmentation of the gap with similar texture on the foreground and background
elements.

1.4 Active Vision

Active vision [3,5–7] is a concept in computer vision and robotics that describes a system’s

ability to control the focus of its attention rather than passively analyzing an entire scene. This is

often achieved through physical motion or manipulation of the sensor or environment. It reflects

the concept of “looking around” to gather information, mimicking the way humans and animals

actively observe their surroundings. Active vision systems can also dynamically modify their

field of view or adjust parameters such as focus and aperture to capture the most pertinent

information. This allows for more efficient data collection and processing since it enables the

system to concentrate resources on the most relevant aspects of the visual scene.

Figure 1.2 illustrates a scene comprising a foreground and a background. In Figure 1.2a,

the side view setup is presented, where the yellow region represents the foreground with a gap or

hole, while the blue region represents the background. Notably, both foreground and background

elements possess identical textures. Observing the scene from a single view, as depicted in Figure

1.2b, it is unfeasible to determine the exact location of the gap. However, by combining Figures

1.2b-c, it becomes possible to calculate the optical flow, enabling the estimation of ordinal depth.

Consequently, the gap in the image can be identified, as demonstrated in Figure 1.2d.

In real-world applications, active vision strategies could significantly enhance the

10


Figure 1.3: A comparison between a honey bee and a hummingbird.

capabilities of automated systems, such as autonomous vehicles, industrial robots, and

surveillance systems. For instance, an autonomous vehicle equipped with an active vision

system can adjust its sensors to focus on specific areas of interest like pedestrians, road signs,

or other vehicles. Similarly, in surveillance applications, an active vision system can concentrate

on unusual movements or behavior, making the overall system more effective and efficient.

Thus, active vision forms a critical part of advanced AI systems, enabling them to interact more

effectively with their environments.

Birds and bees – two different species (See Fig. 1.3 with very different resources in terms

of sensing quality, number of neurons, weight, power, etc., they solve the same problem of flying

through never seen unknown unstructured gaps but in a very different manner. While birds

seamlessly traverse through these gaps, due to lower computation and sensing quality in bees,

they tend to utilize active vision techniques. Bees wander around the gaps and observe the gaps

from various positions and orientations in order to estimate the position and relative size (to the

11


Figure 1.4: Bee Peering

background) of the gap. It is also important to note that bees do not estimate the size of the gaps

but they estimate the size of the gap with respect to their body lengths. This effect the amount of

peering the bees have to actively perform in order to have an ordinal depth perception of the gap.

Fig. 1.4 demonstrates the amount of movement required for the bee for different sizes of gaps.

This image has been adapted from [8].

GapFlyt [9] studies these behaviors and introduced TS2P where they obtain and stack

optical flow [10] from different views in order to estimate the ordinal depth. Fig. 2.5 shows

the active strategy used in GapFlyt [9]. However, a significant limitation of this technique is the

absence of mathematical assurances for successful gap traversal. The drone lacks the ability to

estimate the metric depth of the gap or determine whether it has successfully traversed through

it. However, this challenge has been addressed by the TinyDepth approach, which is discussed in

detail in Section 1.9.

12


Figure 1.5: Flying Through Unknown Gaps. (a) Active strategy and (b) Visual Servoing

1.5 Principles of Minimal Perception

Minimal perception is defined by a living being’s ability to extract maximal information

from their environment using minimal sensory data. It reflects an organism’s capacity to

maintain functional performance while minimizing the cognitive and energy resources required

for perception. This economization of resources is not only essential for survival but also a

testament to nature’s ingenuity.

Fig. 1.6 shows the classification of the minimal perception framework. The notion of

Minimal Perception can be conceptualized at different levels – from cognitive to the sensor. This

work deals with three different categories of minimalism in perception:

1. Minimalism in Information: What is the minimum information by the robot required to

complete a given task? This can be carried forward by utilizing minimal quantities such

as uncertainty of optical flow (Chapter 2) or by active perception (observing from multiple

views to learn the structure of the scene rather than the texture. See Fig. 1.7).

2. Minimalism in Sensing Modality: What is the minimal choice of sensor required to a given

size, area, weight, and power constraint to solve the tasks at hand?

13


Minimal Perception

Minimal 
Information Models

Minimal 
Data Acquisition

Minimal 
Sensing Modalities

By Predicting 
Minimal Quantities

By Learning 
Scene Structures

By Adding Passive 
Element to the Sensor

By Adding Active 
Element to the Sensor

Figure 1.6: Minimal Perception Framework. Green parts are presented in this thesis and the blue
parts are on going work that are presented in the future work.

3. Minimalism in data acquisition: By adding an active or passive element in front of the

sensor, can we extract the minimal information from the environment that is required to

solve a set of tasks?

The underlying principles of minimal perception are rooted in the goal of extracting

essential information while optimizing resource utilization in autonomous systems. The

principles governing minimal perception can be defined as follows:

1. Selective Information Extraction: Minimal perception involves the selective extraction of

Figure 1.7: Learning to estimate the structure of the unknown scene (a) by observing it from
multiple views can reduce the neural network model size in the estimation of the structures like
mountains with various textures (b).

14


pertinent information from the environment. This principle focuses on identifying and

prioritizing relevant data while disregarding non-essential or redundant information. By

employing techniques such as feature selection, saliency analysis, or attention mechanisms,

minimal perception aims to minimize the computational burden associated with processing

large amounts of data.

2. Minimal Prior Knowledge: ‘What is the information I required to solve N set of tasks TN

in a given amount of time?’ This question addresses the possibility of solving a given task

with absolutely minimal prior knowledge or information. Active and interactive Perception

[11–13] strategies play a key role in solving these tasks with minimal prior knowledge.

3. Adaptive Sensing: Minimal perception incorporates adaptive sensing strategies to optimize

data collection based on the specific context and task requirements. Adaptive sensing

techniques dynamically adjust sensor parameters such as sensing modality, changes in

aperture shapes, etc. to effectively capture the necessary information. By adopting

the sensing process to the prevailing conditions, minimal perception reduces resource

consumption and maximizes the efficiency of data acquisition.

4. Attention Mechanism: Attention mechanisms play a crucial role in minimal perception

by directing computational resources toward salient stimuli. Inspired by nature’s visual

attention, these mechanisms allocate processing power and sensory focus to the most

relevant parts of the input data. By selectively attending to significant features or

regions, minimal perception optimizes computational efficiency and facilitates real-time

responsiveness in resource-constrained systems. In Chapter 5, we exemplify this principle

by showcasing the robot’s capability to estimate dense depth in all spatial directions.

15


However, it selectively focuses its computational resources solely on the direction

associated with the highest potential risk for effective obstacle avoidance and navigation

during tasks.

5. Hardware-Software Optimization: Hardware-software co-design for mobile robots plays a

vital role in maximizing the capabilities and performance of these autonomous systems. It

involves the seamless integration and optimization of hardware components and software

algorithms specifically tailored to the requirements of mobile robotic applications. The

co-design process aims to strike a balance between computational efficiency, power

consumption, real-time responsiveness, and physical constraints to enable mobile robots

to navigate their environments, perform complex tasks, interact with humans, and adapt to

changing conditions.

1.6 Predicting Minimal Quantities Using Uncertainty Principles

The widely successful classical theory of visual perception utilizes a single image that is

tailor-made for static scenes. However, this theory is destined to fail on robots due to the dynamic

nature of real-world environments, limiting robot autonomy. Combing motion or Temporal

Information (TI) along with the sensor characteristics enabled us to unlock hidden potentials of

perception that were not possible before. The use of TI empowers us to solve common robotics

problems such as navigation and segmentation without any need for depth (or range) sensors. To

accomplish such tasks, it is crucial for the robot to learn the geometry of the environment and the

physics of the robot (how things move), rather than learning only the scene characteristics (how

the environment looks like).

16


The presence of motion or temporal information (TI) introduces uncertainty in network

predictions. Roboticists and computer vision scientists currently exploit this uncertainty to

improve predictions. However, these uncertainties contain untapped concealed information with

significant potential for addressing various robotics problems. Aleatoric uncertainty specifically

characterizes the inherent bias in data collection by sensors, such as cameras’ limited ability

to perceive obstructed objects. To demonstrate the potential of these additional cues, aleatoric

uncertainty prediction was exclusively employed on TI, specifically optical flow, for diverse

robotics applications. The primary advantage of relying solely on uncertainty, rather than

traditional predictions, is a substantial reduction in computational costs ranging from 10 to 100

times. Works such as Ajna (Chapter 2) and NudgeSeg [14] (Chapter 3)) showcase real-time

robotics tasks utilizing uncertainty, including navigation static and dynamic obstacle avoidance,

traversing unknown gaps, and segmentation tasks. Significant uncertainties were observed in

areas where optical flow presented challenges, such as occlusions and motion blur, and this

information was effectively utilized to detect obstacles within the scene. Additionally, a novel

class of sensors known as neuromorphic sensors or event cameras, capable of extracting TI at the

sensor level itself, can be employed to further enhance robot efficiency.

Quoting the philosopher Socrates, who famously said, “The only true wisdom is in

knowing you know nothing,” it is crucial to recognize when an agent lacks certainty, just as

it is important to evaluate the accuracy of predictions. In the field of Robotics, we often rely

blindly on neural network predictions for quantities such as Depth, Optical Flow, and Surface

Normal, without quantifying the reliability of these predictions. This reliance has prompted

Roboticists to acknowledge the need for incorporating uncertainties, leading to the adoption

of the gold-standard approach in robotics: combining multiple measurements using Bayesian

17


formulations and propagating distribution statistics.

While uncertainties prove valuable for merging multiple measurements, we believe that

their potential in robotics remains underexploited. This is mainly because uncertainties

offer contextual information beyond their combined capabilities. Before delving into specific

examples, let us introduce two common types of uncertainties: Aleatoric, also known as

observational data uncertainty, and Epistemic, which pertains to model uncertainty. Aleatoric

uncertainty captures the inherent bias in data collection by a sensor, while epistemic uncertainty

captures the inherent bias arising from the scenarios used to train the model. For example,

aleatoric uncertainty would be high in transparent or dark regions when using RGB-D data,

whereas a network trained indoors would exhibit high epistemic uncertainty when tested on

outdoor data.

The contextual information provided by an epistemic uncertainty model is the need for

additional data samples to improve accuracy for a particular sample. This information is valuable

for determining if the agent is operating in an “out of domain” situation and whether online

learning is necessary to achieve desirable performance. On the other hand, a more careful

examination of the contextual information from aleatoric uncertainty reveals intriguing insights

about the scene based on sensor characteristics. For instance, cameras cannot see through objects,

so high aleatoric uncertainty at the depth boundaries of an object can serve as a powerful cue for

various robotics tasks.

Estimating epistemic uncertainty requires variational inference and multiple runs of

the neural network, making it impractical for real-time applications unless multiple neural

network accelerators are employed. Conversely, aleatoric uncertainty is well-suited for real-time

applications as it only requires a minor increase in the number of parameters and a single pass

18


of the network to predict uncertainty. In this study, the focus is on estimating heteroscedastic

aleatoric uncertainty, which refers to the observational uncertainty specific to the input data.

We address the following questions in Chapter 2: How can we estimate heteroscedastic

aleatoric uncertainty in a neural network? What informational cues does it provide for various

robotic tasks? This work proposes a novel generalized formulation for heteroscedastic aleatoric

uncertainty in neural networks.

1.7 Minimal Prior Knowledge – Interactive Perception

Perception and interaction constitute a synergistic pair that exhibits complementary

properties in the field of robotics. Despite the inherent capabilities of most robots to engage

in movement or body manipulation for the purpose of acquiring additional information, the

utilization of this combined perception-interaction paradigm remains limited. Nature’s creations,

even at the most rudimentary biological level, exploit this active-interactive synergy to effectively

address complex problem domains [15]. Consequently, the foundational principles of robotics

have encompassed formal frameworks that capture the elegance of action-interaction-perception

loops. By augmenting the computational requirements of a specific task through exploration

and interaction, valuable information can be obtained in a manner that simplifies the underlying

perception challenges.

In recent years, deep neural networks have made significant strides in object segmentation,

effectively delineating objects within color and depth images for specific classes [11, 13, 16].

However, the performance of these networks relies heavily on the availability of training data

encompassing diverse classes and objects. This limitation restricts their ability to generalize well

19


to previously unseen objects or zero-shot samples. Resource constraints further exacerbate the

issue as robots can only be trained on a limited number of samples.

Furthermore, object segmentation based solely on image frames depends on the recognition

and pattern-matching cues. To address these challenges more efficiently, our proposed approach

leverages the active nature of robots and their capacity to interact with the environment. By

engaging in interactions with objects, robots can induce additional geometric constraints to

facilitate the segmentation of zero-shot samples. Our framework introduces a process where

the robot repeatedly nudges or pokes at objects, leveraging the resulting motion cues to generate

and refine segmentation masks at each step. The fundamental concept underlying our approach

is that each rigid body exhibits a unique motion signature (optical flow) during each nudge. We

exploit this characteristic to provide an initial estimation for the robot to learn about new objects

through interaction, analogous to how infants acquire knowledge about their surroundings.

Since the method only relies on optical flow for segmenting these objects, it only utilizes a

monocular monochrome camera. The method is evaluated on zero-shot samples (GrassMoss

and Rocks) and the YCB dataset [17], and compared with state-of-the-art methods such as

Mask-RCNN [18], PointRend [19], and 0-MMS [2]. It is observed that NudgeSeg outperforms

previous state-of-the-art passive approaches on zero-shot samples. Chapter 3 will extensively

explore the realm of interactive perception and provide an in-depth analysis of the NudgeSeg [14]

framework.

20


1.8 Learning Structure via a Generative Simulator – Minimizing Annotations

and Modeling

In the field of computer vision, a significant challenge involves transferring learned

quantities such as ‘depth’ and ‘optical flow’ from one environment to another. Unlike living

beings, current robotic systems lack the ability to infer depth in new surroundings and struggle

with the cross-domain inference of acquired knowledge. For instance, a depth prediction model

trained on outdoor data fails to accurately predict depth in indoor environments. In the research

presented in Chapters 2, 3, 5, the approach addresses this issue by training models in a simulated

environment and evaluating their performance on real-world scenes. Notably, this approach

eliminates the need for fine-tuning the models with real-world data, which is contrary to the

prevailing literature. This strategy effectively mitigates the problem of overfitting in neural

networks, which commonly arises when fine-tuning is performed using testing data. Moreover,

it improves the scalability of the networks, enabling their deployment across a variety of robot

sizes and scales.

Neural network predictions are often constrained by simulated data generated using

imperfect camera models that lack photorealism and accurate camera physics. Conversely, the

collection and annotation of real-world data can be prohibitively expensive. In recent research,

the open-source framework WorldGen [20] was employed to autonomously generate diverse

structured and unstructured 3D photorealistic scenes, such as city views, object collections, and

object fragmentation. This data generation process relies on existing open-source object models,

world maps, and semantic information. WorldGen, a perception-centric generative simulator,

21


enables the modification of textures, object structures, motion, camera properties, and lens

properties using photorealistic camera models, thereby reducing data bias in neural networks.

Significant improvements in optical flow predictions were demonstrated using the WorldGen

data. Furthermore, the capabilities of the WorldGen simulator were extended to include human

motion data with various textures and structural characteristics. Remarkable advancements in

human pose estimation on event data in the real world were achieved solely through training

on the WorldGen simulation environment [21]. Additionally, simple and effective methods for

learning to generate training data tailored to specific robotics applications were explored.

WorldGen serves as a high-level open-source Python library for generating an unlimited

amount of synthetic data. This library provides a platform for generating visual data to simulate

various scenarios, including self-driving cars, autonomous drones, object segmentation, active

vision, motion segmentation, tracking, and computational photography. Its key contribution

lies in the API that enables the construction of generative environments and streamlines the

process of generating synthetic data, thereby lowering the difficulty barrier for researchers and

practitioners. WorldGen is built around BlenderTM, a free and open-source 3D creation suite,

allowing the generation of synthetic data such as city maps, collections of moving objects, and

object fragmentation. The design of WorldGen emphasizes scalability and speed. Chapter 4

provides a comprehensive discussion of the various components and details employed in building

WorldGen.

22


1.9 Minimal Sensing Modality

Minimal sensing modality in robots refers to the implementation of a simplified sensory

system that enables the robot to perceive and understand the environment using a limited set of

sensors. The goal is to design a sensing framework that optimizes resource utilization while

still providing sufficient information for the robot to perform its intended tasks effectively.

This approach involves carefully selecting a subset of sensors that capture key aspects of the

robot’s surroundings, such as proximity, orientation, or object detection, based on the specific

requirements of the application. By minimizing the number and complexity of sensors, the robot

can reduce cost, power consumption, and computational overhead while maintaining a practical

level of situational awareness. The challenge lies in finding the right balance between sensor

richness and system constraints to ensure reliable and efficient operation in real-world scenarios.

Accurate measurement of distances and depth cues is a fundamental requirement for

autonomous robots to comprehend the geometric properties of a 3D scene. When it comes

to navigation, agents heavily rely on depth maps to effectively traverse intricate and dynamic

environments. However, conventional depth estimation algorithms, whether monocular or

stereo-based, often involve computationally expensive operations or necessitate high-quality

sensors. Consequently, their implementation becomes challenging in resource-constrained

settings. To address this, leveraging motion cues like parallax, as observed in pigeons, can

expedite depth computation. Previous endeavors have aimed to mitigate computational burdens

by lowering the resolution or capitalizing on known environmental cues. Nonetheless, these

approaches fall short in terms of accuracy for obstacle avoidance or fail to generalize to unfamiliar

scenes when applied in real-world scenarios.

23


Figure 1.8: Combining RGB with a tiny sparse sensor leads to high-resolution depth maps.

Chapter 5 introduces TinyDepth, a compact neural network architecture that leverages

a sparse depth sensor with low resolution and low power consumption (64 depth values).

The proposed method achieves dense depth estimation by combining this sensor with a

high-resolution monocular RGB camera. To enhance the training process, information from

multiple viewpoints is exploited, incorporating motion parallax cues. This approach enables

the model to generalize effectively to previously unseen or zero-shot scenes without the need

for fine-tuning or retraining. Remarkably, the network achieves a processing rate of 4.3Hz

on the Raspberry Pi CPU, providing accuracy comparable to larger networks while surpassing

them significantly in terms of speed. Due to its lightweight computational demands and sensor

requirements, this method is highly suitable for deployment on small-sized robots, including

palm-sized and even hummingbird-sized aerial platforms.

Fig. 1.8 shows the conventional depth camera on the left and the contrasting setup of RGB

and sparse depth sensor in order to estimate a high-resolution depth map. The work presented

in this study is closely related in spirit to [22], emphasizing notable differences. In contrast,

the model proposed here is much smaller, reaching sizes up to 126× smaller as compared to

24


Intel Realsense D435i – an industrial defacto depth sensor. Moreover, our approach incorporates

cues from multiple views and demonstrates the ability to generalize to zero-shot or unseen

environments following simulation-based training. The efficacy of this approach is validated

through real-world robotics experiments, explicitly focusing on navigation in complex static

scenes involving both ground and aerial robots.

1.10 Minimal Data Acquisition

Minimal data acquisition in robot perception refers to the efficient and judicious collection

of sensory information required for effective perception tasks in robotics. By carefully

selecting and prioritizing the relevant data, robots can optimize their computational resources

and improve real-time decision-making capabilities. This approach focuses on acquiring only

the essential information needed to perceive the environment accurately while disregarding

redundant or irrelevant data. Techniques such as active perception and sensor fusion play a crucial

role in minimizing data acquisition. Active perception involves intelligent control strategies

that guide the robot’s sensors to actively gather information from specific areas of interest,

maximizing the utility of acquired data. Sensor fusion combines data from multiple sensors to

create a comprehensive and reliable representation of the environment. By adopting minimal

data acquisition strategies, robots can enhance their perceptual capabilities while reducing

computational complexity and achieving efficient and streamlined operations in various domains,

including navigation, object recognition, and scene understanding.

Two ways to modify the data without computing are by adding either a passive element

(such as custom apertures) or an active element (such as a rotating prism) in front of the

25


Figure 1.9: Illustration of a tiny robotic bee pollinating the flowers.

camera/sensor plane in order to filter out the data at a hardware level. This is discussed in the

Chapter 7.

1.11 Applications of Minimal Perception

The future of tiny mobile robots and drones holds immense potential for various industrial,

research, and consumer applications. With advancements in miniaturization and robotics

technology, these miniature devices can perform intricate tasks in constrained environments with

great precision and agility. The integration of artificial intelligence and minimal perception

thinking is expected to play a crucial role in shaping the next generation of these robots. By

employing minimal perception thinking, tiny mobile robots and drones can navigate complex

environments, avoid obstacles, and carry out specific tasks with improved adaptability and

autonomy. Furthermore, the integration of AI techniques, such as machine learning and computer

vision, can enhance their perception capabilities, enabling them to interpret and respond to

dynamic environments effectively. This convergence of minimal perception thinking and AI

26


Figure 1.10: Weight comparison between a 330ml of Pepsi can with the tiny autonomous car.

holds significant promise in unlocking the full potential of tiny mobile robots and drones across

industries ranging from healthcare and agriculture to manufacturing and surveillance.

The thesis findings have led to significant practical applications in two areas. Firstly,

the utilization of tiny robot bee drones equipped with comprehensive metric depth perception

capabilities in all directions enables efficient pollination processes. Secondly, an onboard credit

card mobile robot weighing a mere 100g demonstrates autonomous navigation capabilities.

Visual representations of the weight comparison and the RoboBee can be found in Fig. 1.10

and Fig. 1.9, respectively. Subsequent chapters of this thesis will delve into the modeling of

the necessary frameworks and showcase real-world robotic applications, particularly focusing on

navigating in unfamiliar environments.

27


This page intentionally left blank.

28


Chapter 2: Generalized Deep Uncertainty For Parsimonious Robots

Robots are proactive entities functioning within fluctuating environments using imperfect

sensors. These variable sensor readings often result in predictive inaccuracies and can prove

untrustworthy. As a solution, robotic researchers employ fusion methods involving multiple

observations. Recently, neural networks have emerged as leaders in terms of accuracy for

perception-oriented predictions for robotic decision-making, although they frequently lack

associated uncertainty measurements with the predictions. This chapter will introduce a

mathematical model for determining heteroscedastic aleatoric uncertainty in any random

distribution, without requiring preliminary data knowledge. This model doesn’t make any

assumptions about prediction labels and is impartial to network design.

A specific category of networks proposed in this work, known as Ajna, involves a minimal

computational addition and necessitates only a slight alteration to the loss function during neural

network training to capture uncertainty in predictions. This facilitates real-time operation even

in robots under severe computational limitations, such as small drones. It will also explore the

informative indicators found in the uncertainties of predicted values and their use in consolidating

common robotics challenges. Specifically, this work proposes a strategy to avoid dynamic

obstacles, traverse cluttered scenes, pass through unknown gaps, and segment an object pile.

This is achieved not by computing depth but by utilizing the uncertainties of optical flow acquired

29


Figure 2.1: Unification of common robotics problems using the novel generalized heteroscedastic
aleatoric uncertainty formulation for neural networks – Ajna. This chapter experimentally
demonstrates the efficacy of using uncertainty for the following robotics tasks: (A) Dodging
dynamic obstacles, (B) Navigating through cluttered scenes, (C) Flying through unknown gaps,
and (D) Segmentation of unknown object piles. This chapter shows that such an algorithmic
approach would enable autonomy at scales not thought possible before such as the drone the size
of a hummingbird as shown in the center. All the images in this chapter are best viewed in color
and on a computer screen at 200% zoom.

from a monocular camera with onboard sensing and computation.

This chapter will effectively assess and exhibit the proposed Ajna network on four

aforementioned typical robotics and computer vision tasks, showing results comparable to

methods that directly use depth.

30


2.1 Background

As an old saying goes – “If knowledge is power, knowing what you don’t know is wisdom”.

It is as important to know when the agent is unsure as much as the correctness of the prediction.

Especially in the case of neural network predictions, estimating the uncertainty associated

with these predictions aid in taking better decisions rather than blindly relying on these

predictions based on the assumption that they are correct. Roboticists have remarked on this

observation and this led to the approach of combining multiple measurements using uncertainties

which have become the gold-standard approach in robotics. Fundamentally, these measurements

are combined using Bayesian formulations and propagating the distribution statistics.

Although uncertainties are very useful for combining multiple measurements, they are

underutilized in robotics. This is due to the fact that uncertainties also provide contextual

cues/information. Before this chapter provides examples of the previous statement, let us talk

about two kinds of common uncertainties: Aleatoric or observational data uncertainty and

Epistemic or model uncertainty.

The aleatoric uncertainty models the inherent bias in the way a sensor collects data and

epistemic uncertainty models the inherent bias in the scenarios used to collect the training data.

For e.g., the aleatoric uncertainty would be high for transparent or dark regions for RGB-D data

and the epistemic uncertainty of a network trained indoors would be high when tested on outdoor

data.

The contextual information that an epistemic uncertainty model provides is that the trained

model requires more data to improve accuracy for the particular input sample. Such information

is useful to know if one is operating ‘out of domain’ and if online learning is required for a

31


desirable operation. On the contrary, contextual information from Aleatoric uncertainty when

studied more carefully is more intriguing as it helps unravel information about the scene based

on the sensor characteristics. For e.g., cameras cannot see through objects, hence one would

expect high aleatoric uncertainty at the object’s depth boundaries which can act as a powerful cue

for performing various robotics tasks.

Furthermore, from a pragmatic viewpoint, estimating epistemic uncertainty requires

variational inference and multiple runs of the neural network leading it ineffectual for real-time

applications unless multiple neural network accelerators are used. On the contrary, aleatoric

uncertainty is highly suited for real-time applications since it requires a minor increase in the

number of parameters and requires a single pass of the network to predict the uncertainty. In

this work, we focus on estimating the heteroscedastic aleatoric uncertainty, i.e., observational

uncertainty with respect to the input data.

In particular, this work proposes a generalized loss function formulation to estimate the

heteroscedastic aleatoric uncertainty that can be used to model various probability distributions

and relate it to the works in the past decade. This demonstrates that previous works are

special cases of our generalized formulation. Furthermore, this work presents a theoretical

analysis of what information/cues this uncertainty formulation provides for various prediction

modalities. Finally, this work applies the predicted uncertainty to perform various robotic tasks

and demonstrates the unification such a methodology can bring to various classes of robotics

problems. The class of networks as Ajna which is named after the third eye of Lord Shiva from

Hindu Mythology and refers to the eye of wisdom/consciousness/intuition since our networks can

“see” (predict) where they might not work well. The uncertainty of predicted values is denoted as

Υ as it represents the Greek letter for u standing for uncertainty and resembles the shrug emoji

32


. We formally define the problem statement and a list of our contributions next.

The following questions are addressed: How to estimate the heteroscedastic aleatoric

uncertainty of a neural network? What informational cues does it provide for various robotic

tasks? Given an input x, label ŷ, and prediction ỹ, the heteroscedastic aleatoric uncertainty Υ

is predicted by minimizing the proposed generalized loss function. The loss function reduces

to classical statistical properties of variance for common distributions such as Gaussian or

Laplacian. Additionally, the uncertainty of optical flow is learned using this loss function, which

is then applied to four example robotic tasks: (a) Navigating through a scene with static obstacles,

(b) Dodging unknown dynamic obstacles, (c) Detecting and Flying through unknown shaped

gaps, and (d) Segmenting an unknown object pile (See Fig. 4.2). A summary of the contributions

in this chapter is provided below:

• A generalized heteroscedastic aleatoric uncertainty formulation for neural networks

• Analysis of informational cues provided by heteroscedastic aleatoric uncertainty for robotic

tasks

• Extensive real-world experiments demonstrating how such uncertainty can be used for

various robotic tasks

• Discussion of how uncertainty can act as a unifying parsimonious framework for various

robotics applications

Uncertainties and error statistics have been widely utilized in robotics for several decades.

In the subsequent sections, the works concerning the estimation of uncertainties in neural

33


networks and the applications of deep uncertainty in computer vision and robotics will be

presented.

2.1.1 Estimating Uncertainties in Neural Networks

As previously mentioned, two types of uncertainties exist: (a) Aleatoric or observational

uncertainty and (b) Epistemic or model uncertainty. Previous studies focused on estimating either

Aleatoric or Epistemic uncertainty individually. Approaches such as [23–25] solely estimated

Epistemic uncertainty by assuming a Gaussian prior distribution over weights. These models are

known as Bayesian Neural Networks (BNN). Although the mathematical formulations of BNNs

are straightforward, their inference requires complex computations as marginal distributions

across all neurons need to be computed. Additionally, [26] introduced dropout variational

inference to make Epistemic uncertainty estimation tractable through stochastic Monte Carlo

dropout. In contrast, [27] presented a method specifically for Aleatoric uncertainty estimation,

which was later combined with Epistemic uncertainty in [28] to obtain the concept of “total

uncertainty.” However, these methods were either computationally slow for robotic applications

or lacked sufficient accuracy. To address this, [29] introduced Lightweight Probabilistic Deep

Networks, which propagate uncertainties using assumed density filtering. An even faster variant

was proposed, which directly predicts uncertainties only in the final layer. The approach was

further extended in [30] to be agnostic to the network architecture and loss function. For

a comprehensive overview of related works, please refer to [31], which provides a detailed

summary of prior research.

34


2.1.2 Applications of Deep Uncertainty in Robotics and Computer Vision

In the field of robotics, the fusion of uncertainties and their statistical analysis has been

widely employed to combine multiple measurements obtained from either a single sensor

or multiple sensors. Recent research has witnessed a shift in focus towards incorporating

uncertainty fusion techniques within neural networks, owing to the dominance of deep learning

approaches in terms of accuracy metrics. For instance, TLIO [32] proposed a methodology that

fuses multiple inertial measurements, leveraging predicted uncertainties in conjunction with an

Extended Kalman Filter, to estimate odometry. KFNet [33] introduced a neural network-based

fusion approach that combines measurement and process models, drawing inspiration from

the classical Kalman Filter formulation [34], which was specifically applied to the problem of

camera relocalization. In the pursuit of robust performance, IVOA [35] incorporates predicted

uncertainties into the navigation stack. Moreover, a general framework for uncertainty estimation,

encompassing both aleatoric and epistemic uncertainties, was presented in [30]. This framework

was successfully applied to three tasks: (a) End-to-End Steering Angle Prediction, (b) Object

Future Motion Prediction, and (c) Closed-Loop Control of a Quadrotor.

In the field of computer vision, the utilization of deep uncertainty predictions to enhance

performance has gained significant attention in recent years. Various applications, including

object detection, optical flow estimation, visual odometry, monocular depth estimation, stereo

depth/disparity, and surface normals estimation, have leveraged uncertainties as a regularizer

to improve robustness. To address noisy samples in 3D object detection using LiDAR data,

Feng et al. [36] proposed a method that learns to ignore such samples. Several works, such

as Lee et al. [37], Kang et al. [38], Ilg et al. [39], Gast et al. [29], and Li et al. [40], employ

35


either a Generative Adversarial model or an aleatoric uncertainty model to estimate uncertainties.

These uncertainties are then used as regularizers to train optical flow models, leading to improved

performance as observed empirically. In our work, we provide theoretical reasoning to explain

this phenomenon, specifically attributing it to loss attenuation at optical flow discontinuities.

Methods presented by Yuan et al. [41], Bae et al. [42], Roessle et al. [43], and Bhatt et

al. [44] focus on estimating dense depth from stereo or monocular views. They aim to improve

accuracy at the boundaries by incorporating an uncertainty metric. Martin-Brualla et al. [45]

utilize the same aleatoric uncertainty formulation to enhance volumetric color rendering in a

NeRF (Neural Radiance Fields) model. Their approach involves rejecting dynamic objects based

on uncertainty estimates. Eldesokey et al. [46] exploit uncertainty for self-supervised depth

completion, achieving state-of-the-art performance. Similarly, Poggi et al. [47] utilize uncertainty

obtained through image flipping to enhance monocular depth estimation results. Costante et al.

[48] propose a method to estimate and incorporate total uncertainty into a deep visual odometry

pipeline. Furthermore, Kawashima et al. [49] present an alternative approach for aleatoric

uncertainty estimation, employing virtual residuals to address overfitting and demonstrating

state-of-the-art results in age and monocular depth estimation. Alternatively, uncertainty has

been indirectly learned as the probability of outlier/inlier in SFMLearner [50].

2.2 Method – Ajna

2.2.1 General Heteroscedastic Aleatoric Uncertainty Formulation

Consider an input x provided to a neural network N, which has weights W . Let ỹ represent

the estimated output of the neural network N (Eq. 2.1), while the ground truth prediction is

36


denoted by ŷ.

ỹ = N (x|W ) (2.1)

The objective is to learn weights W in order to optimize the following problem:

argmin
W,Υ

f (ŷ, ỹ) s.t. Υ = k (f (ŷ, ỹ) , x) (2.2)

In this context, the symbol f represents a distance metric between the predicted value ỹ

and the ground truth value ŷ. The symbol Υ corresponds to a monotone function k that depends

on the heteroscedastic aleatoric uncertainty of the underlying probability distribution p(x, ỹ|W ).

This uncertainty is positively correlated with the expected error or risk. The correlation between

two random variables X and Y is formally expressed as the Pearson correlation ρX,Y in Equation

2.3, where the symbol E denotes the expectation operator.

ρX,Y =
E (XY )− E (X)E (Y )√

E (X2)− E (X)2
√
E (Y 2)− E (Y )2

(2.3)

To reiterate, the function Υ is dependent on the input x and exhibits correlation with the

estimated error between ỹ and ŷ. Its formal definition is presented below:

Υ(x|W ) := h (E (d (ŷ, ỹ))) s.t. ρΥ,f(ŷ,ỹ) > 0 (2.4)

In this context, let d and f denote distance metrics on a set X , such that f, d : X ×X →

[0,∞), satisfying the properties of identity, symmetry, and the triangle inequality. It is important

to note that Υ does not necessarily correspond to the variance of the distribution p (x, ỹ|W ), but

37


it must fulfill the condition ρΥ,ν > 0, where ν represents the variance (which may be challenging

to compute for arbitrary distributions). Intuitively, Υ represents the anticipated error, risk, or lack

of confidence in the predicted output. To obtain Υ, which will be referred to as “uncertainty”

for easier comprehension, a self-supervised optimization of the following function needs to be

performed.

argmin
ỹ,Υ

h (Υ) f (ŷ, ỹ) + λg (Υ) (2.5)

In the above optimization function, the function g represents a monotone function of the

uncertainty, ensuring preservation of domain order and convexity. On the other hand, the function

h is responsible for inverting the monotonicity of g, satisfying ρh,g < 0 (where h could also

be a function of g). The rationale behind this formulation is to establish a two-way coupling

between Υ and ỹ in order to prevent trivial solutions and appropriately scale the values. The

term h (Υ) f (ŷ, ỹ) scales the value of f (ŷ, ỹ) based on the uncertainty per input dimension,

simulating “outlier rejection” by weighing different noisy observations. It can be considered as a

loss attenuator. However, this approach can lead to trivial solutions where Υ → ∞ (if unbounded)

to minimize the loss. To mitigate this issue, a simple penalty term λg (Υ) is added to counteract

the occurrence of exploding values for Υ. This formulation extends the work presented in [28].

The selection of the functions g, h, and f is at the discretion of the user and can be tailored based

on domain-specific knowledge. The relationship between f , g, and h has been established in

previous studies, as shown in Table 2.1. It is important to note that our formulation is derived by

summarizing a substantial amount of prior work from various domains that estimate uncertainty,

risk, and/or learned robustness parameters. We identified a common trend in these previous

38


works and developed a blueprint function that can be employed to design novel loss functions. In

summary, we unify previous approaches into a single generalized function, and specific functional

parameters from our formulation (Eq. 2.5) can be substituted to obtain the previously proposed

works (Table 2.1).

Note that in the formulation presented, Υ can represent either uncertainty (similar to

co-variance) or lack of confidence (risk) of any arbitrary distribution. For complex distributions,

Υ can be a complex function of the variance ν, resulting in qualitative rather than quantitative

uncertainty. However, by carefully selecting functions f , g, h, and λ, Υ can be transformed into

a quantitative function of ν with straightforward closed-form solutions. In such cases, it is also

possible to work towards certifying the robustness of neural networks within a limited domain of

training/operating data.

Formally, a network is considered certifiably robust when the error in predicting perturbed

inputs is bounded by a value τ . If x is the input and x′ is the perturbed input, the lp

distance between their respective outputs should be constrained to τ , expressed as ∥N (x|W ) −

N (x′|W ) ∥p ≤ τ . We hypothesize that this definition of robustness should also incorporate

the network’s confidence as an additional constraint. In essence, the network would “inform”

us when it is speculating a failure. However, such a formulation requires comprehensive

mathematical treatment and falls beyond the scope of this chapter. Moreover, we consider it

a promising direction for future research endeavors.

By employing Eq. 2.5 in a self-supervised manner, Υ is learned in conjunction with ỹ,

with both being dense and exhibiting variations across pixel locations x. In practical terms,