ABSTRACT Title of Dissertation: MINIMAL PERCEPTION: ENABLING AUTONOMY ON RESOURCE-CONSTRAINED ROBOTS Chahat Deep Singh Doctor of Philosophy, 2023 Dissertation Directed by: Professor Yiannis Aloimonos Department of Computer Science Mobile robots are widely used and crucial in diverse fields due to their autonomous task performance. They enhance efficiency, and safety, and enable novel applications like precision agriculture, environmental monitoring, disaster management, and inspection. Perception plays a vital role in their autonomous behavior for environmental understanding and interaction. Perception in robots refers to their ability to gather, process, and interpret environmental data, enabling autonomous interactions. It facilitates navigation, object identification, and real-time reactions. By integrating perception, robots achieve onboard autonomy, operating without constant human intervention, even in remote or hazardous areas. This enhances adaptability and scalability. This thesis explores the challenge of developing autonomous systems for smaller robots used in precise tasks like confined space inspections and robot pollination. These robots face limitations in real-time perception due to computing, power, and sensing constraints. To address this, we draw inspiration from small organisms such as insects and hummingbirds, known for their sophisticated perception, navigation, and survival abilities despite their minimalistic sensory and neural systems. This research aims to provide insights into designing compact, efficient, and minimal perception systems for tiny autonomous robots. Embracing this minimalism is paramount in unlocking the full potential of tiny robots and enhancing their perception systems. By streamlining and simplifying their design and functionality, these compact robots can maximize efficiency and overcome limitations imposed by size constraints. In this work, a Minimal Perception framework is proposed that enables onboard autonomy in resource-constrained robots at scales (as small as a credit card) that were not possible before. Minimal perception refers to a simplified, efficient, and selective approach from both hardware and software perspectives to gather and process sensory information. Adopting a task-centric perspective allows for further refinement of the minimalist perception framework for tiny robots. For instance, certain animals like jumping spiders, measuring just 1/2 inch in length, demonstrate minimal perception capabilities through sparse vision facilitated by multiple eyes, enabling them to efficiently perceive their surroundings and capture prey with remarkable agility. This thesis introduces a cutting-edge exploration of the minimal perception framework, pushing the boundaries of robot autonomy to new heights. The contributions of this work can be summarized as follows: • Utilizing minimal quantities such as uncertainty in optical flow (Ajna Chp 2) and its untapped potential to enable autonomous navigation, static and dynamic obstacle avoidance, and the ability to fly through unknown gaps. • By utilizing the principles of interactive perception (Chp 3), the framework proposes novel object segmentation in cluttered environments eliminating the reliance on neural network training for object recognition. • Introducing a generative simulator called WorldGen (Chp 4) that has the power to generate countless cities and petabytes of high-quality annotated data, designed to minimize the demanding need for laborious 3D modeling and annotations, thus unlocking unprecedented possibilities for perception and autonomy tasks. • Proposed a method to predict metric dense depth maps (Chp 5) in never-seen or out-of-domain environments by fusing information from a traditional RGB camera and a sparse 64-pixel depth sensor. • The autonomous capabilities of the tiny robots are demonstrated on both aerial and ground robots: (a) autonomous car with a size smaller than a credit card (70mm), and (b) bee drone with a length of 120mm, showcasing navigation abilities, depth perception in all four main directions, and effective avoidance of both static and dynamic obstacles. (Chp 6) In conclusion, the integration of the minimal perception framework in tiny mobile robots heralds a new era of possibilities, signaling a paradigm shift in unlocking their perception and autonomy potential. This thesis would serve as a transformative milestone that will reshape the landscape of mobile robot autonomy, ushering in a future where tiny robots operate synergistically in swarms, revolutionizing fields such as exploration, disaster response, and distributed sensing. MINIMAL PERCEPTION: ENABLING AUTONOMY ON RESOURCE-CONSTRAINED ROBOTS by Chahat Deep Singh Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2023 Advisory Committee: Dr. Yiannis Aloimonos, Chair/Advisor Dr. Cornelia Fermüller Dr. Guido de Croon Dr. Christopher Metzler Dr. Nitin J. Sanket Dr. Inderjit Chopra, Dean’s Representative © Copyright by Chahat Deep Singh 2023 To my family, friends, mentors and to the people who strive to do what they love ii Acknowledgments I am sincerely humbled and overwhelmed with gratitude as I pen down this section. The process of my Ph.D. has taken me on a thrilling ride of emotions, obstacles, accomplishments, and profound personal development. It has been a remarkable journey that could not have been completed without the unflinching support and contributions of several incredible individuals. It all started during my Master’s in Robotics in 2016 at the University of Maryland. I still remember the first day I met Prof. Yiannis Aloimonos – I was awestruck by his inspiring thoughts on perception and how both small and large animals/insects solve the same day-to-day tasks but in different ways. He asked me one fundamental question – ‘What is the minimum information required to solve a given task?’ Little did I know that this question would lay down the foundation of this thesis. I will be perpetually grateful to him for giving me this opportunity to work in the Perception and Robotics Group (PRG). I am also forever grateful to Dr. Cornelia Fermüller for introducing me to the field of neuromorphic perception and mentoring me. Thank you for treating me as your child and making PRG feel like a home away from home. The amount of freedom I have gotten from you two for creative thinking is unfathomable and has led me to this research. Thank you for allowing me to mentor students on my own and teach courses while pursuing my research – valuable skills that I would not have developed otherwise. Undoubtedly, these were the best years of my academic life! I am eternally indebted to Nitin J. Sanket, for his unwavering unconditional support and iii mentoring, especially during my masters, which has touched the depths of my heart and forever transformed my life. This journey would be impossible without you! Thank you for being a better than best friend throughout this journey. Thank you for teaching me the art of research and fostering me to think outside the box. I will always cherish the road trips and the fun sessions in the lab and in the house! Thank you for all the photography lessons among countless other things. I have thoroughly enjoyed discussing math with you. Thank you for being there at every step of the way – more than I deserved. PRG has been like an amazing family all these years. I am thankful to everyone in the group for limitless discussions and fun experiences. I would like to thank Levi Burner with whom I had an insane amount of random discussions, especially math. Chethan Mysore Parameshwara for introducing, teaching, and involving me in neuromorphic perception. Kanishka Ganguly for helping me with all my papers, especially hardware and Linux-related stuff. To Loonix, our pet cockroach who stayed with us at 3 am during drone flight experiments. It was scary when you flew faster than the drone. Thank you Snehesh Shreshta for taking the lead with us in setting up the aerial robotics lab. A big thanks to Xiaomin Lin, Jingxi Chen, Botao He, Konstantinos Zampogiannis, Francisco Barranco, Michael Maynord, Chinmaya Devraj, Matthew Evanusa, Peter Sutor, Behzad Sadrfaridpour. I am always indebted to all the Master’s students that have helped me in my research over the years – Prateek Arora, Ashwin V. Kuruttukulam, Abhinav Modi, Yashveer Jain, Kartik Madhira, Varun Asthana, Saumil Shah and Akash Gupta. And a huge thanks to the undergraduate students for teaching me valuable mentorship skills – Rishabh Singh, Riya Kumari and Rohan Uttamsingh. A huge thanks to Guido de Croon and Davide Scaramuzza for teaching me valuable skills iv in research; Inderjit Chopra for inviting me to teach an aerial robotics course in your department; Luca Carlone and Ashok Agarwala for your insights into academia; Christopher Metzler to introducing me to computational imaging. And finally President Darryll Pines for his continuous support in the RoboBeeHive project. To mom – Harbir Kaur and dad – Rajinder Singh Kambo, thank you for your patience over the years. I know I cannot thank you enough for believing in me even when I did not. I have deeply embedded your qualities within me without even realizing it. Completing this journey—from an underachieving student to where I am now—feels like a real-life Cinderella story. Jasmeet Singh Kambo, I have no words for you. Firstly, thank you for forcing me to enjoy my undergraduate life with fun projects and not worrying about grades. Thank you for being a life mentor! Sakshi Singh, I guess there’s another doctor in the family now. Thank you for taking all the pressure off my shoulders. To the entire Certified Siblings – Jasmeet Singh, Sakshi Singh, Aman Kaur, Sparsh Deep Singh, Ananta Malhotra, Hersh Deep Singh, Hershita Singh, Arsh Deep Kaur, Utsav Agarwal, and Tapasi Malhotra for the family trips and endless fun discussions every weekend. Special thanks to Arhaan Agarwal and Jaiveer ‘Fateh’ Singh. A big thank you to Hersh for his insightful views on mentoring, academia, and random fun math discussions. A huge thank you to Sunaina Prabhu and Kedar Gaitonde for being there as emotional support throughout. To our family in College Park (Ghar ek Mandir) – Priyal Gala, Anoorag Sunkari, Vinayak Bendale, Ankita Tondwalkar, Prateek Arora, Harshvardhan Uppaluru, Shankar Ramesh, Pranay Kanagat, Devyani Gera, Kunal Mehta, Ishmeet Singh, Aprit Agarwal, Meghavi Prashnani, Nakul Garg, Pooja Guhan, Mrunal Dhaygude, and Aakriti Agrawal, thank you for your support and endless Bakchodi. A big thank you for my unconditional pet dogs – Bansi, Stella, and Lilly. v I would like to express my deepest gratitude to Pranshu Jhamb, Tapan Khattar, and Niharika Singh for their unwavering support, even during times when I couldn’t be there. I offer my heartfelt apologies for not being able to attend your respective weddings. Finally, I extend my heartfelt gratitude to Naitri Rajyaguru for standing by my side during the last leg of this remarkable journey. Your support has been genuinely priceless and irreplaceable! I extend my heartfelt appreciation to Tom Ventsias and Maria Herd for their invaluable assistance in promoting my research throughout the years. I am deeply grateful to BBC Earth, Voice of America, Maryland Today, and IEEE Spectrum for featuring my research. A special thank you goes to Indian Creek Elementary School for giving me the opportunity to teach the findings of my research to third-grade students. I am grateful to Ivan Pensiky and Kimberly Edwards for their support in the labs. Ania Picard, your unwavering support has meant the world to me. I would also like to express my gratitude to Janice Perrone, Tom Hurst, and Sharron Mcelroy for their patience and assistance with logistics. I am immensely thankful to Vikram Hrishikeshavan and Derrick Yeo from the aerospace department for their invaluable help with drone hardware. I also want to express my profound gratitude to the Department of Computer Science, UMIACS, and the Maryland Robotics Center. Lastly, I extend my thanks to the Wikimedia Foundation for providing a free source of education. I wish to express my sincere appreciation for the generous financial support received from the Office of Naval Research (ONR), Brin Family Foundation, Northrop Grumman Corporation, NVIDIA, National Science Foundation (NSF), Intel, Dean’s Fellowship, Ann G. Wylie Fellowship, and the Future Faculty Fellowship. Additionally, I am immensely grateful to the remarkable open-source platforms, including Linux, TensorFlow, ArduPilot, Raspberry Pi, NVIDIA Jetson, and PX4. Without their invaluable vi contributions, this thesis would not have been achievable. Remembering everyone is an impossible feat, and from the depths of my heart, I humbly apologize to those I may have inadvertently missed. As I close this chapter and embark on new adventures, I carry with me the memories, lessons, and relationships forged along the way. May this acknowledgment serve as a token of my deepest appreciation and as a reminder of the indelible impact each and every one of you has had on my life. Thank you from the bottom of my heart. vii Table of Contents Preface ii Acknowledgements iii Table of Contents viii List of Tables xi List of Figures xii Chapter 1: Introduction 1 1.1 Resource-constraint autonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Learning From Nature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Frugal AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Active Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Principles of Minimal Perception . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6 Predicting Minimal Quantities Using Uncertainty Principles . . . . . . . . . . . . 16 1.7 Minimal Prior Knowledge – Interactive Perception . . . . . . . . . . . . . . . . 19 1.8 Learning Structure via a Generative Simulator – Minimizing Annotations and Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.9 Minimal Sensing Modality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.10 Minimal Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.11 Applications of Minimal Perception . . . . . . . . . . . . . . . . . . . . . . . . 26 Chapter 2: Generalized Deep Uncertainty For Parsimonious Robots 29 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.1.1 Estimating Uncertainties in Neural Networks . . . . . . . . . . . . . . . 34 2.1.2 Applications of Deep Uncertainty in Robotics and Computer Vision . . . 35 2.2 Method – Ajna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.2.1 General Heteroscedastic Aleatoric Uncertainty Formulation . . . . . . . . 36 2.2.2 Informational Cues from Uncertainty Υ . . . . . . . . . . . . . . . . . . 40 2.2.3 Uncertainty of Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . 40 2.2.4 Uncertainty of Monocular/Stereo Depth . . . . . . . . . . . . . . . . . . 44 2.2.5 Uncertainty of Surface Normals . . . . . . . . . . . . . . . . . . . . . . 45 2.2.6 Uncertainty of Semantic Segmentation . . . . . . . . . . . . . . . . . . . 46 2.2.7 Uncertainty and its relationship to Confidence and Inlier ratio . . . . . . . 47 2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 viii 2.3.1 Quadrotor Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.3.2 Perception Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.3.3 Application 1: Dodging Dynamic Obstacles . . . . . . . . . . . . . . . . 51 2.3.4 Application 2: Navigating through unstructured environments . . . . . . 54 2.3.5 Application 3: Flying Through An Unknown Gap . . . . . . . . . . . . . 57 2.3.6 Application 4: Segmentation of Object Pile . . . . . . . . . . . . . . . . 62 2.3.7 Network Speed on Different Hardware . . . . . . . . . . . . . . . . . . . 63 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 2.4.1 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 70 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Chapter 3: Novel Object Segmentation With Minimum Knowledge 77 3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.1.1 Problem Formulation and Contributions . . . . . . . . . . . . . . . . . . 80 3.2 NudgeSeg Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.2.1 Active perception in NudgeSeg . . . . . . . . . . . . . . . . . . . . . . . 83 3.2.2 Interactive perception in NudgeSeg . . . . . . . . . . . . . . . . . . . . . 84 3.2.3 Verification and Termination . . . . . . . . . . . . . . . . . . . . . . . . 87 3.2.4 Network Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.3 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 89 3.3.1 Description of robot platforms – Aerial Robot and UR10 . . . . . . . . . 89 3.3.2 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Chapter 4: WorldGen: Generative Simulator for Minimal Perception 101 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.1.1 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.2 WorldGen Generative Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.2.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.2.2 Texture Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.2.3 Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.2.4 Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.2.5 Lighting and Climate Conditions . . . . . . . . . . . . . . . . . . . . . . 114 4.2.6 Assets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.2.7 Design Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.3.1 Improvements in Optical Flow . . . . . . . . . . . . . . . . . . . . . . . 118 4.3.2 Computational Photography . . . . . . . . . . . . . . . . . . . . . . . . 120 4.3.3 View Synthesis using Neural Radiance Fields . . . . . . . . . . . . . . . 121 4.3.4 Active and Interactive Perception . . . . . . . . . . . . . . . . . . . . . . 121 4.3.5 Generating Real World Traffic . . . . . . . . . . . . . . . . . . . . . . . 122 4.3.6 Human Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 ix Chapter 5: Generalized Neural Metric Depth Estimation 125 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.1.1 Monocular Depth Estimation . . . . . . . . . . . . . . . . . . . . . . . . 128 5.1.2 Estimating Depth using Multiple Frames . . . . . . . . . . . . . . . . . . 128 5.1.3 Using Sparse Depth Supervision . . . . . . . . . . . . . . . . . . . . . . 129 5.2 TinyDepth Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.2.1 Sensor Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.2.2 Pre-Processing and Data Generation . . . . . . . . . . . . . . . . . . . . 131 5.2.3 Flow-Guided Depth Estimation . . . . . . . . . . . . . . . . . . . . . . . 133 5.2.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.3 Experiments and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.3.3 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.4 Discussions and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Chapter 6: Conclusions 147 Chapter 7: Future Directions 152 7.1 Passive Computing – Modifying Camera Apertures . . . . . . . . . . . . . . . . 152 7.1.1 Non Visible Spectral Sensing . . . . . . . . . . . . . . . . . . . . . . . . 155 7.2 Leveraging From Active Elements In Front Of The Sensor . . . . . . . . . . . . 158 7.3 Towards Robot Morphology Design . . . . . . . . . . . . . . . . . . . . . . . . 159 Bibliography 163 x List of Tables 2.1 Relation to existing works (Chronological Order). . . . . . . . . . . . . . . . . 48 2.2 Quantitative Evaluation for various applications. . . . . . . . . . . . . . . . . . . 58 3.1 Description of Evaluation Sequences. . . . . . . . . . . . . . . . . . . . . . . . . 91 3.2 Evaluation with different segmentation methods for multiple sequences. . . . . . 95 3.3 Evaluation of GrassMoss sequence with different amount of errors in A and M. 96 4.1 Comparison of the different simulation environments. . . . . . . . . . . . . . . . 103 4.2 Optical Flow EPE Comparison of Training RAFT [1] On Different Datasets. Lower Is better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.1 Quantitative evaluation of different methods for metric depth estimation on out-of-domain datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 xi List of Figures 1.1 A qualitative comparison of living beings and robots in terms of perceptual capabilities with respect to their scaled body length. Note that cat and eagle sizes are not to scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Segmentation of the gap with similar texture on the foreground and background elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 A comparison between a honey bee and a hummingbird. . . . . . . . . . . . . . 11 1.4 Bee Peering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5 Flying Through Unknown Gaps. (a) Active strategy and (b) Visual Servoing . . . 13 1.6 Minimal Perception Framework. Green parts are presented in this thesis and the blue parts are on going work that are presented in the future work. . . . . . . . . 14 1.7 Learning to estimate the structure of the unknown scene (a) by observing it from multiple views can reduce the neural network model size in the estimation of the structures like mountains with various textures (b). . . . . . . . . . . . . . . . . 14 1.8 Combining RGB with a tiny sparse sensor leads to high-resolution depth maps. . 24 1.9 Illustration of a tiny robotic bee pollinating the flowers. . . . . . . . . . . . . . . 26 1.10 Weight comparison between a 330ml of Pepsi can with the tiny autonomous car. . 27 2.1 Unification of common robotics problems using the novel generalized heteroscedastic aleatoric uncertainty formulation for neural networks – Ajna. This chapter experimentally demonstrates the efficacy of using uncertainty for the following robotics tasks: (A) Dodging dynamic obstacles, (B) Navigating through cluttered scenes, (C) Flying through unknown gaps, and (D) Segmentation of unknown object piles. This chapter shows that such an algorithmic approach would enable autonomy at scales not thought possible before such as the drone the size of a hummingbird as shown in the center. All the images in this chapter are best viewed in color and on a computer screen at 200% zoom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 xii 2.2 A sequence of images of quadrotor dodging objects. Dodging (A) Airplane, (B) Ball, (C) Cart and (D) Drone. Here, the object and quadrotor transparency show the progression over time. Red and green arrows indicate object and quadrotor directions, respectively. In each sub-figure, the outputs are shown in the following order (taking example as sub-figures of A): (A1) Image sequence of dodging, (A2) RGB image as seen by the quadrotor, (A3) D435i depth image, (A4) MiDaS-S output, (A5) MiDaS output, (A6) OccMask, (A7) Ajna. The color map used in all the depth images is plasma, where blue color represents far and yellow is close. The colormap for occlusion and uncertainty map is inverse plasma, where blue color represents lower uncertainty/occlusions and yellow represents higher uncertainty/occlusions. The yellow boxes show the zoomed-in view of the object. The colormap is consistent across all figures in this chapter. . 52 2.3 A sequence of images of the quadrotor navigating through cluttered environments. (A) Indoor forest, (B) Boxes. Here, the object and quadrotor transparency show the progression of time. Green arrow indicates the quadrotor direction. In each sub-figure, the outputs are shown in the following order (taking example as sub-figures of A): (A1) Image sequence of navigation, (A2) RGB image as seen by the quadrotor, (A3) D435i depth image, (A4) MiDaS-S output, (A5) MiDaS output, (A6) OccMask, (A7) Ajna. . . . . . . . . . . . . . . . . . . 56 2.4 Comparison of various methods to navigate through a simulated realistic forest scene. (A) The scene from the top view with paths overlaid (direction of travel is left to right). The legend is as follows: white – ground truth depth, green – MiDaS, dashed green – MiDaS-S, yellow – OccMask, black – MorphEyes, blue – Ajna (ours), (B) Sample RGB Image as seen by the quadrotor, (C) ground truth depth, (D) MiDaS-S output, (E) MiDaS output, (F) OccMask output, (G) MorphEyes output and (H) Ajna output. . . . . . . . . . . . . . . . . . . . . . . 58 2.5 Image of quadrotor flying through unknown gaps. (A) Egg, (B) Goku, (C) Infinity, (D) Rectangle. In each sub-figure, the outputs are shown in the following order (taking example as sub-figures of A): (A1) Image of flight through the gap, (A2) RGB image as seen by the quadrotor, (A3) D435i depth image, (A4) MiDaS-S output, (A5) MiDaS output, (A6) OccMask, (A7) Ajna. The black or yellow boxes on the images show the window location. . . . . . . . 60 2.6 Outputs for segmentation experiments using various methods on different datasets: (A) GrassMoss, (B) Rocks, (C) YCB. In each sub-figure, the outputs are shown in the following order (taking example as sub-figures of A): (A1) RGB image as seen by the robot, (A2) D435i depth image, (A3) Mask R-CNN output, (A4) PointRend output, (A5) MiDaS-S output, (A6) MiDaS output, (A7) OccMask, (A8) Ajna output. Different colors in A3 and A4 show different objects with different labels being detected by the instance segmentation. . . . . . . . . . 61 xiii 2.7 (A1) Input image pair as an anaglyph, (A2) Optical flow with colormap shown as inset, (A3) Ajna’s predicted uncertainty. Despite low ṗ in the highlighted white region, the quadrotor needs to dodge this area. This is correctly predicted as high Υ. This is a common example where Υ provides additional information over ṗ. (B1 to B3) input image frames and ṗ under blinking LED without motion. (C1 to C3) input image frames and ṗ under blinking LED with motion. (D1 to D4): Image input, predicted Υ, image input with flow attack, predicted Υ under attack. (E), (F) and (G) are experiments of flying through gaps, flying through a forest and detecting dynamic obstacles. Left to right: ground truth depth (white is 4 m and black is 0 m), input images 1 and 2, predicted Υ. . . . . . . . . . . . 66 2.8 Uncertainty Estimation from a moving camera looking at an unknown-shaped gap. Fig. 2.9 shows the environmental setup. (a) shows the direction of camera motion and the ground truth mask: white is the background and grey is the foreground. (b)-(j) shows the pair of images and uncertainty (from left to right). Note that (d) shows uncertainty in a challenging/illusion scene with a checkerboard pattern. (b)-(f) scenes with the same texture in foreground and background; (g)-(i) scenes with different textures and (j) no texture in both foreground and background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 2.9 GapFlyt experiment setup in 3D for uncertainty estimation in different texture environments. The top pyramid represents the camera, the yellow plane represents the foreground with a gap and the blue plane represents the background. 72 2.10 The plot represents how the detection accuracy of the gap varies with the texture resolution and contrast. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 2.11 From left to right: Input Image, Uncertainty Estimation, Input Image with Black-Box patch, Uncertainty on the patched image. We show that uncertainty behavior is not affected by the optical flow attack patch in both cases of (a) obstacle avoidance and (b) navigation. . . . . . . . . . . . . . . . . . . . . . . . 73 2.12 From left to right: Pair of consecutive input images and uncertainty. (a) shows uncertainty with both motion and illumination changes and (b) shows only illumination changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.1 Top row: Robots (UR-10 and a quadrotor) used to physically interact (or nudge) with the objects to get motion cues for segmenting objects in a clutter. Bottom row (left to right): Initial Configuration of a cluttered scene and the first nudge being invoked, final nudge is invoked, final Segmentation of the cluttered scene. Green circles show the nudge operation. All the images in this chapter are best viewed in color at 200% zoom on a computer screen. . . . . . . . . . . . . . . . 78 3.2 A conceptual graph of variation of complexity in perception, planning, and control with task philosophy. As a keen observation, the algorithmic complexity decreases with an increase in the manipulator motion. . . . . . . . . . . . . . . . 79 3.3 First nudge policy using uncertainty in optical flow. Hotter colors represent higher uncertainty. The dashed line represents the convex hull of the cluttered scene and the arrow represents the direction of the first nudge at point N1. . . . . 84 xiv 3.4 (a) Active perception in NudgeSeg framework. The top row shows the movement of the camera. The bottom row shows the image inputs and uncertainty ρ. (b) and (c) Interactive perception in NudgeSeg framework. The top row shows the object nudging. The bottom row shows the input images (before and after the nudge), optical flow representation, and segmentation hypothesis where colors indicate cluster membership. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.5 Top Row: Sample objects used in Table 3.1 as the evaluation sequences. Bottom Row: Sample cluttered scene for each sequence. . . . . . . . . . . . . . . . . . . 91 3.6 For each sub-figure: First row (From left to right): Sample monochrome input image, Uncertainty in optical flow ρ, Segmentation hypothesis after first nudge, Final segmentation masks. Second row: (From left to right): Outputs of 0-MMS [2], PointRend (color input), PointRend (mono input), Mask-RCNN (color input), Mask-RCNN (mono input). Note that in (a) and (d), the objects highlighted with a red boundary in the top left image of the respective sequences are ‘glued’ together and are considered to be adversarial samples. This image is viewed best in color at 400% zoom on a computer screen. . . . . . . . . . . . . . . . . . . . . . . . . 94 3.7 Qualitative Results with (a) no error, ϵA = 0, ϵM = (b) ±5%, (c) ±10%, (d) ±20%, ϵM = 0, ϵA = (e) ±10◦, (f) ±20◦, (g) ±30◦. . . . . . . . . . . . . . . . . 96 4.1 Generative ability of WorldGen: (a) Comparison between Google Street View (left) and the same street in WorldGen (right), (b) Comparison of Google Maps satellite image vs. WorldGen top view, (c) Collection of 3D objects in motion, (d) Object fragmentation,(e) Annotation from left to right: depth, optical flow, surface normals, stereo anaglyph, image segmentation, event frame. All the images in this chapter are best viewed in color on a computer screen at 200% zoom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.2 An overview of WorldGen Framework: (a) Assets: Loads the assets such as maps, objects, materials etc. into WorldGen environment, (b) Structural Modification and Animation: Modifying the texture maps and applying physics and motion models on different objects in the scene, (c) Rendering: Generates rich ground truth data with the desired metadata (time, frame number, camera intrinsic and extrinsic properties). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.3 Mapping textures to a round table. Top row: Rendered Output, Bottom row: Sample textures projected on a sphere. (a) Barebone 3D model, (b)-(d) Different Textures applied on (a). Note: Variational mapping models change the structure of the 3D objects in different renders (notice the legs on the chair). Here, the Gaussian noise in (d) > (c) > (b). . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.4 (a) OpenStreetView, (b) Depth Map, (c) 3D Model View Generated by WorldGen and (d) Final Rendered View . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.5 City environment in different weather and time of the day: (a) Day, (b) Night with rain, (c) Dawn and (d) Night without rain, (e) panoramic view of the city and (f) demonstrates the generative ability of WorldGen by changing the textures of the entire scene while keeping the same structure. . . . . . . . . . . . . . . . . . . . 110 4.6 High-resolution views generated by WorldGen from views at different altitudes with dynamic lighting, camera intrinsic, and extrinsic. . . . . . . . . . . . . . . . 118 xv 5.1 Depth Estimation for Tiny Autonomous Robots: (a) Bee drone of size 92cm in the largest dimension, (b) Lightweight sensor suite – RGB and sparse time-of-flight sensor used on Bee drone and Tiny car, (c) Tiny car of size 70cm largest dimension and (d) illustrates the pair of sensor inputs along with our metric dense depth prediction with the ground truth on the right. . . . . . . . . . . . . . . . . 126 5.2 Sensing principle of VL53L5CX sensor that results in super sparse 8 × 8 depth resolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.3 System overview: An architecture of TinyDepth encoder-decoder model that utilizes an eight-channel input by combining RGB and L5 data from two views to predict metric dense depth. Refer Fig. 5.2 for the color scheme. . . . . . . . . 130 5.4 Real-world robot experiments: (a) Drone Navigation in an unstructured indoor forest scene, (b) Flying through unknown gaps, and (c) Tiny Car navigation through an obstacle course. The bottom left insets in each section of the image represent input RGB image, ground truth data, and depth prediction (left to right). Note the gradient yellow to red line shows the traversal of the robots in time where red represents temporally later stage. . . . . . . . . . . . . . . . . . . . . . . . . 132 5.5 Quantitative evaluation on out-of-domain dataset: (a)-(c) NYUv2 out-of-domain samples, (d) GapFlyt data: Flying through unstructured gaps and (e) Indoor forest data for drone navigation. Note that MiDaS/MiDaS-S use a single RGB image, DELTAR uses a single RGB + single L5 and TinyDepth uses two RGB and two L5 consecutive image pairs for depth predictions. . . . . . . . . . . . . . . . . . 137 6.1 Onboard Autonomy on Tiny Robots: An Outcome of Minimal Perception . . . . 147 6.2 Flower Detection of a Downfacing Camera on the RoboBee . . . . . . . . . . . . 148 6.3 Autonomous onboard obstacle avoidance on a credit-card size robot . . . . . . . 149 7.1 Various Apertures in the wild . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.2 Images of Flower: (a) RGB, (b) Simulated Bee Vision and (c) UV Space . . . . . 156 7.3 (a) Bald eagle seamlessly changing its state from walking to squeezing to flying. (Left to Right). (b) Our Morphing Quadrotor Prototype walking, squeezing and switching to flight mode. (Left to Right). . . . . . . . . . . . . . . . . . . . . . . 159 xvi Chapter 1: Introduction Nature has spent 3.8 billion years on research and development in genetic evolution. They have evolved over the years based on their daily operations, habitat, and surrounding. The evolution in nature has been purposive rather than generic. This evolution is largely driven by their perceptual behaviors based on their needs and environment. Over the years, these systems have learned to solve specific tasks very efficiently. These parsimonious systems or living beings carry a blueprint to develop the next generation of robots. The key to parsimony is using a minimal amount of information/cues or sensing modalities for an efficient competition of goals. In stark contrast, we have been developing robots and AI frameworks for merely 50 years, spending the most time developing independent modules. I draw inspiration from nature to formulate robot autonomy frameworks using only onboard sensing and computation at scales that were never thought possible before. The solution to robot autonomy lies at the intersection of AI, computer vision, computational imaging and robotics – resulting in parsimonious robots. Parsimony refers to thriving in a resource-constrained environment. Even with drastically varied power and sensor constraints, bees and birds are often able to perform similar tasks. To enable autonomy at such scales, my research focuses on the robustness, unification and generalizibility of AI frameworks for robots. Robustness refers to inferring prediction in the presence of noise in the input data. Unification refers to the formulation of a single mathematical framework that 1 enables solving different robotics problems. Generalizibility refers to the ability of an AI to work across various environmental conditions. Inspired by nature, this thesis deals with a minimal perception framework for robots to develop the next generation of tiny, efficient, effective, and purposive robots with onboard autonomy. It introduces alternative methodologies to depth sensors that are essential for robot autonomy, especially navigation and obstacle avoidance. An outline of the thesis is given below: • Chapter 2 proposes the general formulation of uncertainty principles in optical flow and their unconventional uses in robot navigation. • Chapter 3 utilizes the uncertainty in optical flow along with the principles of interactive perception to segment the never-seen objects in a cluttered environment by repeated nudging. • Chapter 4 introduces a generative simulator to automatically create petabytes of high-quality annotated data of real-world cities (digital twins) without any need for manual efforts in 3D modeling by computer vision and robotics applications. • Chapter 5 presents a method to predict metric dense depth maps by utilizing only a traditional RGB sensor with a 64-pixel super sparse depth sensor. We demonstrate its ability to navigate in unknown environments on a credit card-sized robot. • Chapter 6 serves as the culmination of the minimal perception framework within the context of robot autonomy applications, encapsulating key findings and insights, and offering valuable reflections for future directions in this field. 2 1.1 Resource-constraint autonomy Robot autonomy within a resource-constrained environment is a complex and challenging task that requires intricate strategies for optimal functionality. The basic premise revolves around designing robotic systems capable of performing tasks autonomously despite limitations in computational resources, energy supplies, or sensor capabilities. This is of particular importance in scenarios such as aerial robotics, deep-sea exploration, or extraterrestrial missions, where resource management becomes critical. Algorithms, such as those based on reinforcement learning or genetic algorithms, are often used to optimize resource allocation dynamically. These algorithms are designed to make trade-offs between resources used and the quality of the task performed, thus maximizing efficiency. Sensor fusion techniques are also crucial in managing limited sensor capabilities, merging data from multiple sources to improve understanding and accuracy. Moreover, power management strategies, including sleep-wake cycles and variable processing speed, are implemented to minimize energy consumption. Software and hardware are co-designed for these systems to leverage the unique properties of specific hardware for better resource management. As such, resource-constrained robot autonomy requires a multi-faceted approach integrating planning, learning, perception, and decision-making to facilitate successful operation. From a perception point of view, resource-constrained robot autonomy presents intriguing challenges and necessitates innovative solutions. The perception system of an autonomous robot, including cameras, lidar, sonar, and other sensors, serves as its ‘eyes’ but in a resource-constrained environment, the capabilities of these sensors may be limited. Therefore, advanced methods such as sensor fusion become crucial, which integrate data from different 3 sensor types to form a more comprehensive understanding of the environment and reduce perceptual uncertainty. Deep learning and computer vision techniques are also used to extract relevant features from the sensory data and to recognize objects and patterns, yet they must be implemented efficiently due to limited computational resources. Furthermore, perception-based strategies must account for energy consumption, as continuous data acquisition and processing can be power-intensive. As such, low-power modes and selective perception, where only relevant data are actively processed, can be effective strategies. Hence, in resource-constrained robot autonomy, the perception system must balance between detail and depth of environmental understanding, computational demand, and energy efficiency to ensure reliable operation. Although, these computationally expensive perception algorithms can be outsourced using a cloud computer or a companion computer over networking. So the question the first question that comes to mind is – ‘Why do we need onboard autonomy?’. Autonomous systems that rely on either wirelessly connected companion computers in the vicinity or cloud computing are susceptible to deployment in the wild. Such systems tend to fail in GPS-denied environments as well are prone to latency issues. Onboard robot autonomy leads to secure systems that reduce the possibilities of hacking and various security threat as well as make the robots more robust. Although, we have some highly capable autonomous robots with onboard computing that are relatively large in size (more than 300mm) – both aerial and ground robots. So, ‘Why do we need small robots?’ These robots are safe, and agile and can be deployed as swarms. These swarms are highly scalable and can be effectively produced at much cheaper costs. Furthermore, these autonomous swarms enable the robots to inspect confined and dangerous areas that are time constrained such as thermonuclear power plants. It is a well-known fact that robot autonomy is substantially affected by the speed and size of the memory, sensor type, and quality as well as the 4 power required. This directly affects the robotic system’s size, area, and weight. 1.2 Learning From Nature The field of biomimetics, or biologically inspired engineering, offers valuable insights into designing and creating robots based on the study of nature - animals, birds, insects, and even plant life. These natural entities exhibit unique abilities honed by millions of years of evolution, providing excellent models for robotic systems. For instance, the agility of a cheetah can inform the development of robots with enhanced locomotion capabilities. Similarly, the echolocation method used by bats can inspire the design of robotic sensing systems that operate effectively in low-light conditions. The collective behavior of ants and bees provides concepts for swarm robotics, where multiple robots work together to perform complex tasks more efficiently than a single robot could. Studying bird flight can contribute to advancements in aerial drone technology, providing insights into energy efficiency and aerodynamics. In the case of insects like spiders, their ability to create intricate web structures could influence the design of construction or 3D printing robots. Thus, understanding nature and its mechanisms opens up a wide array of possibilities for designing innovative, efficient, and adaptable robotic systems. From a perception perspective, biomimetics plays a vital role in developing advanced robotic systems inspired by nature. A central premise is how animals, birds, and insects perceive and interact with their environment, offering rich insights for creating efficient robotic perception systems. For instance, the complex visual processing system of a dragonfly, capable of detecting movement and depth with extraordinary precision, can inspire the development of sophisticated machine vision algorithms for robots. The sonar system of bats, which enables 5 navigation and hunting in the dark, provides a model for developing robust echo-based sensing mechanisms, especially useful for robots operating in low-visibility environments. Birds, with their ability to adjust their flight based on wind conditions, offer insights for developing adaptive perception and control systems in aerial drones. Similarly, the tactile sensing of rodents’ whiskers could inform the design of touch-based perception for robots operating in dark or cluttered spaces. Swarm robotics often draw inspiration from ants and bees, which communicate and coordinate effectively to perceive their environment collectively and perform complex tasks. Thus, perception research in biomimetics is about deciphering and applying nature’s sophisticated sensory systems to enhance robotic perception and interaction capabilities. Thus, in order to make these systems autonomous, let us understand what we can learn from nature to build the next generation of onboard tiny autonomous robots. What does it take from the currently existing autonomous systems to downsize by a multitude of factors while maintaining or eventually enhancing autonomous capabilities? One way is to look at nature and living beings and observe their behaviors. Let us look at Fig. 1.1. It indicates the perception capabilities which are largely driven by their perception systems as compared to their body lengths. It is important to note that perception capabilities monotonically increase with body size i.e. with the increase in body length, their perception systems become matured with some exceptions in a few living beings. These exceptions include jumping spiders, cuttlefish, and various species of frogs. Jumping spiders have a very sparse low-resolution vision system that enables them to process fast-moving objects or prey and quickly react to hunt them. Whereas cuttlefish and different species of frogs and other beings have developed their visual systems by modifying the aperture shapes. For example, a cuttlefish has a ‘W’-shaped aperture while there are species of frogs that have vertical or horizontal openings. Also, note that the blue and green bubbles have real-world 6 Pe rc ep tio n Ca pa bi lit y Body Size (mm) 10 100 300200 400 500 Figure 1.1: A qualitative comparison of living beings and robots in terms of perceptual capabilities with respect to their scaled body length. Note that cat and eagle sizes are not to scale. tiny robots with onboard autonomy. Before this work, there were robots that existed that are as small as 120mm that are able to perform autonomous tasks such as navigation, static and dynamic obstacle avoidance as well as flying through unknown-shaped gaps [3]. These robots 7 are represented in a blue bubble. This work enables us to downsize and boost the autonomy performance even further at robots that are as small as a credit card (less than 3 inches in length). These robots are presented with a green bubble in Fig. 1.1. In the following sections, we will learn how bio-inspired solutions will boost resource-constraint autonomy in mobile robots. 1.3 Frugal AI Nature’s grandest architecture is built upon eons of evolution, an epitome of minimalism refined over time. Our approach to building robots should mirror this ethos - evolving complexity through simplicity, function through form, and achieving great feats not through excess but efficiency. One of the most influential theories highlighting the elegance of simplicity in language structure is Noam Chomsky’s Minimalist Program [4]. Chomsky, a renowned linguist, cognitive scientist, and philosopher, proposed the Minimalist Program as a radical rethinking of syntactic theory. Its underlying principle is the idea that nature, including human language, operates in the simplest and most efficient way possible. The core of Chomsky’s Minimalist Program rests on the assumption that sentences are built from a basic lexical inventory through a series of binary merges. This process minimizes computational complexity by reusing the same operation for structuring sentences, enabling an infinite array of expressions from a finite set of elements. Another significant facet of the Minimalist Program is the notion of ‘economy.’ Chomsky theorized that linguistic expressions follow the principle of economy, ensuring the most resource-efficient outcomes. This mirrors the concept of minimalism where ‘less is more.’ Based on such theories, a new field has emerged – Frugal AI which refers to the development and deployment of artificial intelligence (AI) systems that operate effectively with 8 minimal resources, particularly in terms of computational power, energy consumption, and data requirements. The concept aims to create AI models that maintain high efficiency and robust performance despite these constraints. This is particularly relevant in scenarios where the availability of resources is limited, such as remote areas, edge devices, or low-income regions. Frugal AI can also contribute to sustainable AI practices, reducing the environmental impact associated with data centers and large-scale computing. The development of frugal AI poses significant research challenges, necessitating the exploration of methods that allow model compression, efficient learning, and effective inference. Techniques such as quantization, pruning, and knowledge distillation are often used to compress deep learning models without substantial loss of accuracy. Transfer learning and few-shot learning strategies are used to enable models to learn effectively from small data sets. On-device AI and federated learning can be utilized for inference in resource-constrained environments while also preserving privacy. As such, the research and implementation of frugal AI embody a shift towards more accessible, sustainable, and decentralized AI practices, making advanced technologies more equitable and less resource-intensive. While many forms of perception in the animal kingdom are astoundingly complex, they also often exhibit a remarkable sense of frugality. This ‘minimal perception’ underscores how living beings optimize their perceptual capacities in resource-constrained conditions. In the following section, we will engage in a detailed exploration of how diverse organisms employ distinct strategies to address identical challenges. 9 Figure 1.2: Segmentation of the gap with similar texture on the foreground and background elements. 1.4 Active Vision Active vision [3,5–7] is a concept in computer vision and robotics that describes a system’s ability to control the focus of its attention rather than passively analyzing an entire scene. This is often achieved through physical motion or manipulation of the sensor or environment. It reflects the concept of “looking around” to gather information, mimicking the way humans and animals actively observe their surroundings. Active vision systems can also dynamically modify their field of view or adjust parameters such as focus and aperture to capture the most pertinent information. This allows for more efficient data collection and processing since it enables the system to concentrate resources on the most relevant aspects of the visual scene. Figure 1.2 illustrates a scene comprising a foreground and a background. In Figure 1.2a, the side view setup is presented, where the yellow region represents the foreground with a gap or hole, while the blue region represents the background. Notably, both foreground and background elements possess identical textures. Observing the scene from a single view, as depicted in Figure 1.2b, it is unfeasible to determine the exact location of the gap. However, by combining Figures 1.2b-c, it becomes possible to calculate the optical flow, enabling the estimation of ordinal depth. Consequently, the gap in the image can be identified, as demonstrated in Figure 1.2d. In real-world applications, active vision strategies could significantly enhance the 10 Figure 1.3: A comparison between a honey bee and a hummingbird. capabilities of automated systems, such as autonomous vehicles, industrial robots, and surveillance systems. For instance, an autonomous vehicle equipped with an active vision system can adjust its sensors to focus on specific areas of interest like pedestrians, road signs, or other vehicles. Similarly, in surveillance applications, an active vision system can concentrate on unusual movements or behavior, making the overall system more effective and efficient. Thus, active vision forms a critical part of advanced AI systems, enabling them to interact more effectively with their environments. Birds and bees – two different species (See Fig. 1.3 with very different resources in terms of sensing quality, number of neurons, weight, power, etc., they solve the same problem of flying through never seen unknown unstructured gaps but in a very different manner. While birds seamlessly traverse through these gaps, due to lower computation and sensing quality in bees, they tend to utilize active vision techniques. Bees wander around the gaps and observe the gaps from various positions and orientations in order to estimate the position and relative size (to the 11 Figure 1.4: Bee Peering background) of the gap. It is also important to note that bees do not estimate the size of the gaps but they estimate the size of the gap with respect to their body lengths. This effect the amount of peering the bees have to actively perform in order to have an ordinal depth perception of the gap. Fig. 1.4 demonstrates the amount of movement required for the bee for different sizes of gaps. This image has been adapted from [8]. GapFlyt [9] studies these behaviors and introduced TS2P where they obtain and stack optical flow [10] from different views in order to estimate the ordinal depth. Fig. 2.5 shows the active strategy used in GapFlyt [9]. However, a significant limitation of this technique is the absence of mathematical assurances for successful gap traversal. The drone lacks the ability to estimate the metric depth of the gap or determine whether it has successfully traversed through it. However, this challenge has been addressed by the TinyDepth approach, which is discussed in detail in Section 1.9. 12 Figure 1.5: Flying Through Unknown Gaps. (a) Active strategy and (b) Visual Servoing 1.5 Principles of Minimal Perception Minimal perception is defined by a living being’s ability to extract maximal information from their environment using minimal sensory data. It reflects an organism’s capacity to maintain functional performance while minimizing the cognitive and energy resources required for perception. This economization of resources is not only essential for survival but also a testament to nature’s ingenuity. Fig. 1.6 shows the classification of the minimal perception framework. The notion of Minimal Perception can be conceptualized at different levels – from cognitive to the sensor. This work deals with three different categories of minimalism in perception: 1. Minimalism in Information: What is the minimum information by the robot required to complete a given task? This can be carried forward by utilizing minimal quantities such as uncertainty of optical flow (Chapter 2) or by active perception (observing from multiple views to learn the structure of the scene rather than the texture. See Fig. 1.7). 2. Minimalism in Sensing Modality: What is the minimal choice of sensor required to a given size, area, weight, and power constraint to solve the tasks at hand? 13 Minimal Perception Minimal Information Models Minimal Data Acquisition Minimal Sensing Modalities By Predicting Minimal Quantities By Learning Scene Structures By Adding Passive Element to the Sensor By Adding Active Element to the Sensor Figure 1.6: Minimal Perception Framework. Green parts are presented in this thesis and the blue parts are on going work that are presented in the future work. 3. Minimalism in data acquisition: By adding an active or passive element in front of the sensor, can we extract the minimal information from the environment that is required to solve a set of tasks? The underlying principles of minimal perception are rooted in the goal of extracting essential information while optimizing resource utilization in autonomous systems. The principles governing minimal perception can be defined as follows: 1. Selective Information Extraction: Minimal perception involves the selective extraction of Figure 1.7: Learning to estimate the structure of the unknown scene (a) by observing it from multiple views can reduce the neural network model size in the estimation of the structures like mountains with various textures (b). 14 pertinent information from the environment. This principle focuses on identifying and prioritizing relevant data while disregarding non-essential or redundant information. By employing techniques such as feature selection, saliency analysis, or attention mechanisms, minimal perception aims to minimize the computational burden associated with processing large amounts of data. 2. Minimal Prior Knowledge: ‘What is the information I required to solve N set of tasks TN in a given amount of time?’ This question addresses the possibility of solving a given task with absolutely minimal prior knowledge or information. Active and interactive Perception [11–13] strategies play a key role in solving these tasks with minimal prior knowledge. 3. Adaptive Sensing: Minimal perception incorporates adaptive sensing strategies to optimize data collection based on the specific context and task requirements. Adaptive sensing techniques dynamically adjust sensor parameters such as sensing modality, changes in aperture shapes, etc. to effectively capture the necessary information. By adopting the sensing process to the prevailing conditions, minimal perception reduces resource consumption and maximizes the efficiency of data acquisition. 4. Attention Mechanism: Attention mechanisms play a crucial role in minimal perception by directing computational resources toward salient stimuli. Inspired by nature’s visual attention, these mechanisms allocate processing power and sensory focus to the most relevant parts of the input data. By selectively attending to significant features or regions, minimal perception optimizes computational efficiency and facilitates real-time responsiveness in resource-constrained systems. In Chapter 5, we exemplify this principle by showcasing the robot’s capability to estimate dense depth in all spatial directions. 15 However, it selectively focuses its computational resources solely on the direction associated with the highest potential risk for effective obstacle avoidance and navigation during tasks. 5. Hardware-Software Optimization: Hardware-software co-design for mobile robots plays a vital role in maximizing the capabilities and performance of these autonomous systems. It involves the seamless integration and optimization of hardware components and software algorithms specifically tailored to the requirements of mobile robotic applications. The co-design process aims to strike a balance between computational efficiency, power consumption, real-time responsiveness, and physical constraints to enable mobile robots to navigate their environments, perform complex tasks, interact with humans, and adapt to changing conditions. 1.6 Predicting Minimal Quantities Using Uncertainty Principles The widely successful classical theory of visual perception utilizes a single image that is tailor-made for static scenes. However, this theory is destined to fail on robots due to the dynamic nature of real-world environments, limiting robot autonomy. Combing motion or Temporal Information (TI) along with the sensor characteristics enabled us to unlock hidden potentials of perception that were not possible before. The use of TI empowers us to solve common robotics problems such as navigation and segmentation without any need for depth (or range) sensors. To accomplish such tasks, it is crucial for the robot to learn the geometry of the environment and the physics of the robot (how things move), rather than learning only the scene characteristics (how the environment looks like). 16 The presence of motion or temporal information (TI) introduces uncertainty in network predictions. Roboticists and computer vision scientists currently exploit this uncertainty to improve predictions. However, these uncertainties contain untapped concealed information with significant potential for addressing various robotics problems. Aleatoric uncertainty specifically characterizes the inherent bias in data collection by sensors, such as cameras’ limited ability to perceive obstructed objects. To demonstrate the potential of these additional cues, aleatoric uncertainty prediction was exclusively employed on TI, specifically optical flow, for diverse robotics applications. The primary advantage of relying solely on uncertainty, rather than traditional predictions, is a substantial reduction in computational costs ranging from 10 to 100 times. Works such as Ajna (Chapter 2) and NudgeSeg [14] (Chapter 3)) showcase real-time robotics tasks utilizing uncertainty, including navigation static and dynamic obstacle avoidance, traversing unknown gaps, and segmentation tasks. Significant uncertainties were observed in areas where optical flow presented challenges, such as occlusions and motion blur, and this information was effectively utilized to detect obstacles within the scene. Additionally, a novel class of sensors known as neuromorphic sensors or event cameras, capable of extracting TI at the sensor level itself, can be employed to further enhance robot efficiency. Quoting the philosopher Socrates, who famously said, “The only true wisdom is in knowing you know nothing,” it is crucial to recognize when an agent lacks certainty, just as it is important to evaluate the accuracy of predictions. In the field of Robotics, we often rely blindly on neural network predictions for quantities such as Depth, Optical Flow, and Surface Normal, without quantifying the reliability of these predictions. This reliance has prompted Roboticists to acknowledge the need for incorporating uncertainties, leading to the adoption of the gold-standard approach in robotics: combining multiple measurements using Bayesian 17 formulations and propagating distribution statistics. While uncertainties prove valuable for merging multiple measurements, we believe that their potential in robotics remains underexploited. This is mainly because uncertainties offer contextual information beyond their combined capabilities. Before delving into specific examples, let us introduce two common types of uncertainties: Aleatoric, also known as observational data uncertainty, and Epistemic, which pertains to model uncertainty. Aleatoric uncertainty captures the inherent bias in data collection by a sensor, while epistemic uncertainty captures the inherent bias arising from the scenarios used to train the model. For example, aleatoric uncertainty would be high in transparent or dark regions when using RGB-D data, whereas a network trained indoors would exhibit high epistemic uncertainty when tested on outdoor data. The contextual information provided by an epistemic uncertainty model is the need for additional data samples to improve accuracy for a particular sample. This information is valuable for determining if the agent is operating in an “out of domain” situation and whether online learning is necessary to achieve desirable performance. On the other hand, a more careful examination of the contextual information from aleatoric uncertainty reveals intriguing insights about the scene based on sensor characteristics. For instance, cameras cannot see through objects, so high aleatoric uncertainty at the depth boundaries of an object can serve as a powerful cue for various robotics tasks. Estimating epistemic uncertainty requires variational inference and multiple runs of the neural network, making it impractical for real-time applications unless multiple neural network accelerators are employed. Conversely, aleatoric uncertainty is well-suited for real-time applications as it only requires a minor increase in the number of parameters and a single pass 18 of the network to predict uncertainty. In this study, the focus is on estimating heteroscedastic aleatoric uncertainty, which refers to the observational uncertainty specific to the input data. We address the following questions in Chapter 2: How can we estimate heteroscedastic aleatoric uncertainty in a neural network? What informational cues does it provide for various robotic tasks? This work proposes a novel generalized formulation for heteroscedastic aleatoric uncertainty in neural networks. 1.7 Minimal Prior Knowledge – Interactive Perception Perception and interaction constitute a synergistic pair that exhibits complementary properties in the field of robotics. Despite the inherent capabilities of most robots to engage in movement or body manipulation for the purpose of acquiring additional information, the utilization of this combined perception-interaction paradigm remains limited. Nature’s creations, even at the most rudimentary biological level, exploit this active-interactive synergy to effectively address complex problem domains [15]. Consequently, the foundational principles of robotics have encompassed formal frameworks that capture the elegance of action-interaction-perception loops. By augmenting the computational requirements of a specific task through exploration and interaction, valuable information can be obtained in a manner that simplifies the underlying perception challenges. In recent years, deep neural networks have made significant strides in object segmentation, effectively delineating objects within color and depth images for specific classes [11, 13, 16]. However, the performance of these networks relies heavily on the availability of training data encompassing diverse classes and objects. This limitation restricts their ability to generalize well 19 to previously unseen objects or zero-shot samples. Resource constraints further exacerbate the issue as robots can only be trained on a limited number of samples. Furthermore, object segmentation based solely on image frames depends on the recognition and pattern-matching cues. To address these challenges more efficiently, our proposed approach leverages the active nature of robots and their capacity to interact with the environment. By engaging in interactions with objects, robots can induce additional geometric constraints to facilitate the segmentation of zero-shot samples. Our framework introduces a process where the robot repeatedly nudges or pokes at objects, leveraging the resulting motion cues to generate and refine segmentation masks at each step. The fundamental concept underlying our approach is that each rigid body exhibits a unique motion signature (optical flow) during each nudge. We exploit this characteristic to provide an initial estimation for the robot to learn about new objects through interaction, analogous to how infants acquire knowledge about their surroundings. Since the method only relies on optical flow for segmenting these objects, it only utilizes a monocular monochrome camera. The method is evaluated on zero-shot samples (GrassMoss and Rocks) and the YCB dataset [17], and compared with state-of-the-art methods such as Mask-RCNN [18], PointRend [19], and 0-MMS [2]. It is observed that NudgeSeg outperforms previous state-of-the-art passive approaches on zero-shot samples. Chapter 3 will extensively explore the realm of interactive perception and provide an in-depth analysis of the NudgeSeg [14] framework. 20 1.8 Learning Structure via a Generative Simulator – Minimizing Annotations and Modeling In the field of computer vision, a significant challenge involves transferring learned quantities such as ‘depth’ and ‘optical flow’ from one environment to another. Unlike living beings, current robotic systems lack the ability to infer depth in new surroundings and struggle with the cross-domain inference of acquired knowledge. For instance, a depth prediction model trained on outdoor data fails to accurately predict depth in indoor environments. In the research presented in Chapters 2, 3, 5, the approach addresses this issue by training models in a simulated environment and evaluating their performance on real-world scenes. Notably, this approach eliminates the need for fine-tuning the models with real-world data, which is contrary to the prevailing literature. This strategy effectively mitigates the problem of overfitting in neural networks, which commonly arises when fine-tuning is performed using testing data. Moreover, it improves the scalability of the networks, enabling their deployment across a variety of robot sizes and scales. Neural network predictions are often constrained by simulated data generated using imperfect camera models that lack photorealism and accurate camera physics. Conversely, the collection and annotation of real-world data can be prohibitively expensive. In recent research, the open-source framework WorldGen [20] was employed to autonomously generate diverse structured and unstructured 3D photorealistic scenes, such as city views, object collections, and object fragmentation. This data generation process relies on existing open-source object models, world maps, and semantic information. WorldGen, a perception-centric generative simulator, 21 enables the modification of textures, object structures, motion, camera properties, and lens properties using photorealistic camera models, thereby reducing data bias in neural networks. Significant improvements in optical flow predictions were demonstrated using the WorldGen data. Furthermore, the capabilities of the WorldGen simulator were extended to include human motion data with various textures and structural characteristics. Remarkable advancements in human pose estimation on event data in the real world were achieved solely through training on the WorldGen simulation environment [21]. Additionally, simple and effective methods for learning to generate training data tailored to specific robotics applications were explored. WorldGen serves as a high-level open-source Python library for generating an unlimited amount of synthetic data. This library provides a platform for generating visual data to simulate various scenarios, including self-driving cars, autonomous drones, object segmentation, active vision, motion segmentation, tracking, and computational photography. Its key contribution lies in the API that enables the construction of generative environments and streamlines the process of generating synthetic data, thereby lowering the difficulty barrier for researchers and practitioners. WorldGen is built around BlenderTM, a free and open-source 3D creation suite, allowing the generation of synthetic data such as city maps, collections of moving objects, and object fragmentation. The design of WorldGen emphasizes scalability and speed. Chapter 4 provides a comprehensive discussion of the various components and details employed in building WorldGen. 22 1.9 Minimal Sensing Modality Minimal sensing modality in robots refers to the implementation of a simplified sensory system that enables the robot to perceive and understand the environment using a limited set of sensors. The goal is to design a sensing framework that optimizes resource utilization while still providing sufficient information for the robot to perform its intended tasks effectively. This approach involves carefully selecting a subset of sensors that capture key aspects of the robot’s surroundings, such as proximity, orientation, or object detection, based on the specific requirements of the application. By minimizing the number and complexity of sensors, the robot can reduce cost, power consumption, and computational overhead while maintaining a practical level of situational awareness. The challenge lies in finding the right balance between sensor richness and system constraints to ensure reliable and efficient operation in real-world scenarios. Accurate measurement of distances and depth cues is a fundamental requirement for autonomous robots to comprehend the geometric properties of a 3D scene. When it comes to navigation, agents heavily rely on depth maps to effectively traverse intricate and dynamic environments. However, conventional depth estimation algorithms, whether monocular or stereo-based, often involve computationally expensive operations or necessitate high-quality sensors. Consequently, their implementation becomes challenging in resource-constrained settings. To address this, leveraging motion cues like parallax, as observed in pigeons, can expedite depth computation. Previous endeavors have aimed to mitigate computational burdens by lowering the resolution or capitalizing on known environmental cues. Nonetheless, these approaches fall short in terms of accuracy for obstacle avoidance or fail to generalize to unfamiliar scenes when applied in real-world scenarios. 23 Figure 1.8: Combining RGB with a tiny sparse sensor leads to high-resolution depth maps. Chapter 5 introduces TinyDepth, a compact neural network architecture that leverages a sparse depth sensor with low resolution and low power consumption (64 depth values). The proposed method achieves dense depth estimation by combining this sensor with a high-resolution monocular RGB camera. To enhance the training process, information from multiple viewpoints is exploited, incorporating motion parallax cues. This approach enables the model to generalize effectively to previously unseen or zero-shot scenes without the need for fine-tuning or retraining. Remarkably, the network achieves a processing rate of 4.3Hz on the Raspberry Pi CPU, providing accuracy comparable to larger networks while surpassing them significantly in terms of speed. Due to its lightweight computational demands and sensor requirements, this method is highly suitable for deployment on small-sized robots, including palm-sized and even hummingbird-sized aerial platforms. Fig. 1.8 shows the conventional depth camera on the left and the contrasting setup of RGB and sparse depth sensor in order to estimate a high-resolution depth map. The work presented in this study is closely related in spirit to [22], emphasizing notable differences. In contrast, the model proposed here is much smaller, reaching sizes up to 126× smaller as compared to 24 Intel Realsense D435i – an industrial defacto depth sensor. Moreover, our approach incorporates cues from multiple views and demonstrates the ability to generalize to zero-shot or unseen environments following simulation-based training. The efficacy of this approach is validated through real-world robotics experiments, explicitly focusing on navigation in complex static scenes involving both ground and aerial robots. 1.10 Minimal Data Acquisition Minimal data acquisition in robot perception refers to the efficient and judicious collection of sensory information required for effective perception tasks in robotics. By carefully selecting and prioritizing the relevant data, robots can optimize their computational resources and improve real-time decision-making capabilities. This approach focuses on acquiring only the essential information needed to perceive the environment accurately while disregarding redundant or irrelevant data. Techniques such as active perception and sensor fusion play a crucial role in minimizing data acquisition. Active perception involves intelligent control strategies that guide the robot’s sensors to actively gather information from specific areas of interest, maximizing the utility of acquired data. Sensor fusion combines data from multiple sensors to create a comprehensive and reliable representation of the environment. By adopting minimal data acquisition strategies, robots can enhance their perceptual capabilities while reducing computational complexity and achieving efficient and streamlined operations in various domains, including navigation, object recognition, and scene understanding. Two ways to modify the data without computing are by adding either a passive element (such as custom apertures) or an active element (such as a rotating prism) in front of the 25 Figure 1.9: Illustration of a tiny robotic bee pollinating the flowers. camera/sensor plane in order to filter out the data at a hardware level. This is discussed in the Chapter 7. 1.11 Applications of Minimal Perception The future of tiny mobile robots and drones holds immense potential for various industrial, research, and consumer applications. With advancements in miniaturization and robotics technology, these miniature devices can perform intricate tasks in constrained environments with great precision and agility. The integration of artificial intelligence and minimal perception thinking is expected to play a crucial role in shaping the next generation of these robots. By employing minimal perception thinking, tiny mobile robots and drones can navigate complex environments, avoid obstacles, and carry out specific tasks with improved adaptability and autonomy. Furthermore, the integration of AI techniques, such as machine learning and computer vision, can enhance their perception capabilities, enabling them to interpret and respond to dynamic environments effectively. This convergence of minimal perception thinking and AI 26 Figure 1.10: Weight comparison between a 330ml of Pepsi can with the tiny autonomous car. holds significant promise in unlocking the full potential of tiny mobile robots and drones across industries ranging from healthcare and agriculture to manufacturing and surveillance. The thesis findings have led to significant practical applications in two areas. Firstly, the utilization of tiny robot bee drones equipped with comprehensive metric depth perception capabilities in all directions enables efficient pollination processes. Secondly, an onboard credit card mobile robot weighing a mere 100g demonstrates autonomous navigation capabilities. Visual representations of the weight comparison and the RoboBee can be found in Fig. 1.10 and Fig. 1.9, respectively. Subsequent chapters of this thesis will delve into the modeling of the necessary frameworks and showcase real-world robotic applications, particularly focusing on navigating in unfamiliar environments. 27 This page intentionally left blank. 28 Chapter 2: Generalized Deep Uncertainty For Parsimonious Robots Robots are proactive entities functioning within fluctuating environments using imperfect sensors. These variable sensor readings often result in predictive inaccuracies and can prove untrustworthy. As a solution, robotic researchers employ fusion methods involving multiple observations. Recently, neural networks have emerged as leaders in terms of accuracy for perception-oriented predictions for robotic decision-making, although they frequently lack associated uncertainty measurements with the predictions. This chapter will introduce a mathematical model for determining heteroscedastic aleatoric uncertainty in any random distribution, without requiring preliminary data knowledge. This model doesn’t make any assumptions about prediction labels and is impartial to network design. A specific category of networks proposed in this work, known as Ajna, involves a minimal computational addition and necessitates only a slight alteration to the loss function during neural network training to capture uncertainty in predictions. This facilitates real-time operation even in robots under severe computational limitations, such as small drones. It will also explore the informative indicators found in the uncertainties of predicted values and their use in consolidating common robotics challenges. Specifically, this work proposes a strategy to avoid dynamic obstacles, traverse cluttered scenes, pass through unknown gaps, and segment an object pile. This is achieved not by computing depth but by utilizing the uncertainties of optical flow acquired 29 Figure 2.1: Unification of common robotics problems using the novel generalized heteroscedastic aleatoric uncertainty formulation for neural networks – Ajna. This chapter experimentally demonstrates the efficacy of using uncertainty for the following robotics tasks: (A) Dodging dynamic obstacles, (B) Navigating through cluttered scenes, (C) Flying through unknown gaps, and (D) Segmentation of unknown object piles. This chapter shows that such an algorithmic approach would enable autonomy at scales not thought possible before such as the drone the size of a hummingbird as shown in the center. All the images in this chapter are best viewed in color and on a computer screen at 200% zoom. from a monocular camera with onboard sensing and computation. This chapter will effectively assess and exhibit the proposed Ajna network on four aforementioned typical robotics and computer vision tasks, showing results comparable to methods that directly use depth. 30 2.1 Background As an old saying goes – “If knowledge is power, knowing what you don’t know is wisdom”. It is as important to know when the agent is unsure as much as the correctness of the prediction. Especially in the case of neural network predictions, estimating the uncertainty associated with these predictions aid in taking better decisions rather than blindly relying on these predictions based on the assumption that they are correct. Roboticists have remarked on this observation and this led to the approach of combining multiple measurements using uncertainties which have become the gold-standard approach in robotics. Fundamentally, these measurements are combined using Bayesian formulations and propagating the distribution statistics. Although uncertainties are very useful for combining multiple measurements, they are underutilized in robotics. This is due to the fact that uncertainties also provide contextual cues/information. Before this chapter provides examples of the previous statement, let us talk about two kinds of common uncertainties: Aleatoric or observational data uncertainty and Epistemic or model uncertainty. The aleatoric uncertainty models the inherent bias in the way a sensor collects data and epistemic uncertainty models the inherent bias in the scenarios used to collect the training data. For e.g., the aleatoric uncertainty would be high for transparent or dark regions for RGB-D data and the epistemic uncertainty of a network trained indoors would be high when tested on outdoor data. The contextual information that an epistemic uncertainty model provides is that the trained model requires more data to improve accuracy for the particular input sample. Such information is useful to know if one is operating ‘out of domain’ and if online learning is required for a 31 desirable operation. On the contrary, contextual information from Aleatoric uncertainty when studied more carefully is more intriguing as it helps unravel information about the scene based on the sensor characteristics. For e.g., cameras cannot see through objects, hence one would expect high aleatoric uncertainty at the object’s depth boundaries which can act as a powerful cue for performing various robotics tasks. Furthermore, from a pragmatic viewpoint, estimating epistemic uncertainty requires variational inference and multiple runs of the neural network leading it ineffectual for real-time applications unless multiple neural network accelerators are used. On the contrary, aleatoric uncertainty is highly suited for real-time applications since it requires a minor increase in the number of parameters and requires a single pass of the network to predict the uncertainty. In this work, we focus on estimating the heteroscedastic aleatoric uncertainty, i.e., observational uncertainty with respect to the input data. In particular, this work proposes a generalized loss function formulation to estimate the heteroscedastic aleatoric uncertainty that can be used to model various probability distributions and relate it to the works in the past decade. This demonstrates that previous works are special cases of our generalized formulation. Furthermore, this work presents a theoretical analysis of what information/cues this uncertainty formulation provides for various prediction modalities. Finally, this work applies the predicted uncertainty to perform various robotic tasks and demonstrates the unification such a methodology can bring to various classes of robotics problems. The class of networks as Ajna which is named after the third eye of Lord Shiva from Hindu Mythology and refers to the eye of wisdom/consciousness/intuition since our networks can “see” (predict) where they might not work well. The uncertainty of predicted values is denoted as Υ as it represents the Greek letter for u standing for uncertainty and resembles the shrug emoji 32 . We formally define the problem statement and a list of our contributions next. The following questions are addressed: How to estimate the heteroscedastic aleatoric uncertainty of a neural network? What informational cues does it provide for various robotic tasks? Given an input x, label ŷ, and prediction ỹ, the heteroscedastic aleatoric uncertainty Υ is predicted by minimizing the proposed generalized loss function. The loss function reduces to classical statistical properties of variance for common distributions such as Gaussian or Laplacian. Additionally, the uncertainty of optical flow is learned using this loss function, which is then applied to four example robotic tasks: (a) Navigating through a scene with static obstacles, (b) Dodging unknown dynamic obstacles, (c) Detecting and Flying through unknown shaped gaps, and (d) Segmenting an unknown object pile (See Fig. 4.2). A summary of the contributions in this chapter is provided below: • A generalized heteroscedastic aleatoric uncertainty formulation for neural networks • Analysis of informational cues provided by heteroscedastic aleatoric uncertainty for robotic tasks • Extensive real-world experiments demonstrating how such uncertainty can be used for various robotic tasks • Discussion of how uncertainty can act as a unifying parsimonious framework for various robotics applications Uncertainties and error statistics have been widely utilized in robotics for several decades. In the subsequent sections, the works concerning the estimation of uncertainties in neural 33 networks and the applications of deep uncertainty in computer vision and robotics will be presented. 2.1.1 Estimating Uncertainties in Neural Networks As previously mentioned, two types of uncertainties exist: (a) Aleatoric or observational uncertainty and (b) Epistemic or model uncertainty. Previous studies focused on estimating either Aleatoric or Epistemic uncertainty individually. Approaches such as [23–25] solely estimated Epistemic uncertainty by assuming a Gaussian prior distribution over weights. These models are known as Bayesian Neural Networks (BNN). Although the mathematical formulations of BNNs are straightforward, their inference requires complex computations as marginal distributions across all neurons need to be computed. Additionally, [26] introduced dropout variational inference to make Epistemic uncertainty estimation tractable through stochastic Monte Carlo dropout. In contrast, [27] presented a method specifically for Aleatoric uncertainty estimation, which was later combined with Epistemic uncertainty in [28] to obtain the concept of “total uncertainty.” However, these methods were either computationally slow for robotic applications or lacked sufficient accuracy. To address this, [29] introduced Lightweight Probabilistic Deep Networks, which propagate uncertainties using assumed density filtering. An even faster variant was proposed, which directly predicts uncertainties only in the final layer. The approach was further extended in [30] to be agnostic to the network architecture and loss function. For a comprehensive overview of related works, please refer to [31], which provides a detailed summary of prior research. 34 2.1.2 Applications of Deep Uncertainty in Robotics and Computer Vision In the field of robotics, the fusion of uncertainties and their statistical analysis has been widely employed to combine multiple measurements obtained from either a single sensor or multiple sensors. Recent research has witnessed a shift in focus towards incorporating uncertainty fusion techniques within neural networks, owing to the dominance of deep learning approaches in terms of accuracy metrics. For instance, TLIO [32] proposed a methodology that fuses multiple inertial measurements, leveraging predicted uncertainties in conjunction with an Extended Kalman Filter, to estimate odometry. KFNet [33] introduced a neural network-based fusion approach that combines measurement and process models, drawing inspiration from the classical Kalman Filter formulation [34], which was specifically applied to the problem of camera relocalization. In the pursuit of robust performance, IVOA [35] incorporates predicted uncertainties into the navigation stack. Moreover, a general framework for uncertainty estimation, encompassing both aleatoric and epistemic uncertainties, was presented in [30]. This framework was successfully applied to three tasks: (a) End-to-End Steering Angle Prediction, (b) Object Future Motion Prediction, and (c) Closed-Loop Control of a Quadrotor. In the field of computer vision, the utilization of deep uncertainty predictions to enhance performance has gained significant attention in recent years. Various applications, including object detection, optical flow estimation, visual odometry, monocular depth estimation, stereo depth/disparity, and surface normals estimation, have leveraged uncertainties as a regularizer to improve robustness. To address noisy samples in 3D object detection using LiDAR data, Feng et al. [36] proposed a method that learns to ignore such samples. Several works, such as Lee et al. [37], Kang et al. [38], Ilg et al. [39], Gast et al. [29], and Li et al. [40], employ 35 either a Generative Adversarial model or an aleatoric uncertainty model to estimate uncertainties. These uncertainties are then used as regularizers to train optical flow models, leading to improved performance as observed empirically. In our work, we provide theoretical reasoning to explain this phenomenon, specifically attributing it to loss attenuation at optical flow discontinuities. Methods presented by Yuan et al. [41], Bae et al. [42], Roessle et al. [43], and Bhatt et al. [44] focus on estimating dense depth from stereo or monocular views. They aim to improve accuracy at the boundaries by incorporating an uncertainty metric. Martin-Brualla et al. [45] utilize the same aleatoric uncertainty formulation to enhance volumetric color rendering in a NeRF (Neural Radiance Fields) model. Their approach involves rejecting dynamic objects based on uncertainty estimates. Eldesokey et al. [46] exploit uncertainty for self-supervised depth completion, achieving state-of-the-art performance. Similarly, Poggi et al. [47] utilize uncertainty obtained through image flipping to enhance monocular depth estimation results. Costante et al. [48] propose a method to estimate and incorporate total uncertainty into a deep visual odometry pipeline. Furthermore, Kawashima et al. [49] present an alternative approach for aleatoric uncertainty estimation, employing virtual residuals to address overfitting and demonstrating state-of-the-art results in age and monocular depth estimation. Alternatively, uncertainty has been indirectly learned as the probability of outlier/inlier in SFMLearner [50]. 2.2 Method – Ajna 2.2.1 General Heteroscedastic Aleatoric Uncertainty Formulation Consider an input x provided to a neural network N, which has weights W . Let ỹ represent the estimated output of the neural network N (Eq. 2.1), while the ground truth prediction is 36 denoted by ŷ. ỹ = N (x|W ) (2.1) The objective is to learn weights W in order to optimize the following problem: argmin W,Υ f (ŷ, ỹ) s.t. Υ = k (f (ŷ, ỹ) , x) (2.2) In this context, the symbol f represents a distance metric between the predicted value ỹ and the ground truth value ŷ. The symbol Υ corresponds to a monotone function k that depends on the heteroscedastic aleatoric uncertainty of the underlying probability distribution p(x, ỹ|W ). This uncertainty is positively correlated with the expected error or risk. The correlation between two random variables X and Y is formally expressed as the Pearson correlation ρX,Y in Equation 2.3, where the symbol E denotes the expectation operator. ρX,Y = E (XY )− E (X)E (Y )√ E (X2)− E (X)2 √ E (Y 2)− E (Y )2 (2.3) To reiterate, the function Υ is dependent on the input x and exhibits correlation with the estimated error between ỹ and ŷ. Its formal definition is presented below: Υ(x|W ) := h (E (d (ŷ, ỹ))) s.t. ρΥ,f(ŷ,ỹ) > 0 (2.4) In this context, let d and f denote distance metrics on a set X , such that f, d : X ×X → [0,∞), satisfying the properties of identity, symmetry, and the triangle inequality. It is important to note that Υ does not necessarily correspond to the variance of the distribution p (x, ỹ|W ), but 37 it must fulfill the condition ρΥ,ν > 0, where ν represents the variance (which may be challenging to compute for arbitrary distributions). Intuitively, Υ represents the anticipated error, risk, or lack of confidence in the predicted output. To obtain Υ, which will be referred to as “uncertainty” for easier comprehension, a self-supervised optimization of the following function needs to be performed. argmin ỹ,Υ h (Υ) f (ŷ, ỹ) + λg (Υ) (2.5) In the above optimization function, the function g represents a monotone function of the uncertainty, ensuring preservation of domain order and convexity. On the other hand, the function h is responsible for inverting the monotonicity of g, satisfying ρh,g < 0 (where h could also be a function of g). The rationale behind this formulation is to establish a two-way coupling between Υ and ỹ in order to prevent trivial solutions and appropriately scale the values. The term h (Υ) f (ŷ, ỹ) scales the value of f (ŷ, ỹ) based on the uncertainty per input dimension, simulating “outlier rejection” by weighing different noisy observations. It can be considered as a loss attenuator. However, this approach can lead to trivial solutions where Υ → ∞ (if unbounded) to minimize the loss. To mitigate this issue, a simple penalty term λg (Υ) is added to counteract the occurrence of exploding values for Υ. This formulation extends the work presented in [28]. The selection of the functions g, h, and f is at the discretion of the user and can be tailored based on domain-specific knowledge. The relationship between f , g, and h has been established in previous studies, as shown in Table 2.1. It is important to note that our formulation is derived by summarizing a substantial amount of prior work from various domains that estimate uncertainty, risk, and/or learned robustness parameters. We identified a common trend in these previous 38 works and developed a blueprint function that can be employed to design novel loss functions. In summary, we unify previous approaches into a single generalized function, and specific functional parameters from our formulation (Eq. 2.5) can be substituted to obtain the previously proposed works (Table 2.1). Note that in the formulation presented, Υ can represent either uncertainty (similar to co-variance) or lack of confidence (risk) of any arbitrary distribution. For complex distributions, Υ can be a complex function of the variance ν, resulting in qualitative rather than quantitative uncertainty. However, by carefully selecting functions f , g, h, and λ, Υ can be transformed into a quantitative function of ν with straightforward closed-form solutions. In such cases, it is also possible to work towards certifying the robustness of neural networks within a limited domain of training/operating data. Formally, a network is considered certifiably robust when the error in predicting perturbed inputs is bounded by a value τ . If x is the input and x′ is the perturbed input, the lp distance between their respective outputs should be constrained to τ , expressed as ∥N (x|W ) − N (x′|W ) ∥p ≤ τ . We hypothesize that this definition of robustness should also incorporate the network’s confidence as an additional constraint. In essence, the network would “inform” us when it is speculating a failure. However, such a formulation requires comprehensive mathematical treatment and falls beyond the scope of this chapter. Moreover, we consider it a promising direction for future research endeavors. By employing Eq. 2.5 in a self-supervised manner, Υ is learned in conjunction with ỹ, with both being dense and exhibiting variations across pixel locations x. In practical terms,