ABSTRACT Title of Dissertation: ACTIVE VISION BASED EMBODIED-AI DESIGN FOR NANO-UAV AUTONOMY Nitin Jagannatha Sanket Doctor of Philosophy, 2021 Dissertation Directed by: Professor Yiannis Aloimonos Department of Computer Science The human fascination to mimic ultra-efficient flying beings like birds and bees has led to a rapid rise in aerial robots in the recent decade. These aerial robots now posses a market share of over 10 Billion US Dollars. The future for aerial robots or Unmanned Aerial Vehicles (UAVs) which are commonly called drones is very bright because of their utility in a myriad of applications. I envision drones delivering packages to our homes, finding survivors in collapsed buildings, pollinating flowers, inspecting bridges, performing surveillance of cities, in sports and even as pets. In particular, quadrotors have become the go to platform for aerial robotics due to simplicity in their mechanical design, their vertical takeoff and landing capabilities and agility characteristics. Our eternal pursuit to improve drone safety and improve power efficiency has given rise to the research and development of smaller yet smarter drones. Furthermore, smaller drones are more agile and task-distributable as swarms. Embodied Artificial Intelligence (AI) has been a big fuel to push this area further. Classically, the approach to designing such nano-drones possesses a strict distinction between perception, planning and control and relies on a 3D map of the scene that are used to plan paths that are executed by a control algorithm. On the contrary, nature’s never-ending quest to improve the efficiency of flying agents through genetic evolution led to birds developing amazing eyes and brains tailored for agile flight in complex environments as a software and hardware co-design solution. In contrast, smaller flying agents such as insects that are at the other end of the size and computation spectrum adapted an ingenious approach – to utilize movement to gather more information. Early pioneers of robotics remarked at this observation and coined the concept of “Active Perception” which proposed that one can move in an exploratory way to gather more information to compensate for lack of computation and sensing. Such a controlled movement imposes additional constraints on the data being perceived to make the perception problem simpler. Inspired by this concept, in this thesis, I present a novel approach for algorithmic design on nano aerial robots (flying robots the size of a hummingbird) based on active per- ception by tightly coupling and combining perception, planning and control into sensori- motor loops using only on-board sensing and computation. This is done by re-imagining each aerial robot as a series of hierarchical sensorimotor loops where the higher ones require the inner ones such that resources and computation can be efficiently re-used. Ac- tiveness is presented and utilized in four different forms to enable large-scale autonomy at tight Size, Weight, Area and Power (SWAP) constraints not heard of before. The four forms of activeness are: 1. By moving the agent itself, 2. By employing an active sensor, 3. By moving a part of the agent’s body, 4. By hallucinating active movements. Next, to make this work practically applicable I show how hardware and software co-design can be performed to optimize the form of active perception to be used. Finally, I present the world’s first prototype of a RoboBeeHive that shows how to integrate multiple compe- tences centered around active vision in all it’s glory. Following is a list of contributions of this thesis: • The world’s first functional prototype of a RoboBeeHive that can artificially polli- nate flowers. • The first method that allows a quadrotor to fly through gaps of unknown shape, location and size using a single monocular camera with only on-board sensing and computation. • The first method to dodge dynamic obstacles of unknown shape, size and loca- tion on a quadrotor using a monocular event camera. Our series of shallow neural networks are trained in simulation and transfers to the real world without any fine- tuning or re-training. • The first method to detect unmarked drones by detecting propellers. Our neural network is trained in simulation and transfers to the real world without any fine- tuning or re-training. • A method to adaptively change the baseline of a stereo camera system for quadrotor navigation. • The first method to introduce the usage of saliency to select features in a direct visual odometry pipeline. • A comprehensive benchmark of software and hardware for embodied AI which would serve as a blueprint for researchers and practitioners alike. ACTIVE VISION BASED EMBODIED-AI DESIGN FOR NANO-UAV AUTONOMY by Nitin Jagannatha Sanket Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2021 Advisory Committee: Professor Yiannis Aloimonos, Chair/Advisor Dr. Cornelia Fermüller Professor Davide Scaramuzza Professor Dinesh Manocha Professor Inderjit Chopra, Dean’s Representative © Copyright by Nitin Jagannatha Sanket 2021 Dedication To my family. ii Acknowledgments I owe my gratitude to all the people who have made this thesis possible and because of whom my doctoral study experience has been one that I will cherish forever. I still remember to this day I first met Prof. Yiannis Aloimonos in his room to dis- cuss about how perception leads to cognition. I am eternally grateful to him for inspiring me and giving me an opportunity to join his lab as a doctoral student. I am also eter- nally grateful to Dr. Cornelia Fermüller for introducing me to the field of neuromorphic cameras and perception. In particular, my experience during the Telluride Neromorphic workshop was one of a kind that I will cherish forever. Both Yiannis and Cornelia have treated me like their own child, giving their time whenever I needed them, mentoring me and helping me hone my skills. One of the most remarkable things was the amount of freedom they gave me during my doctoral studies, they funded me to setup the aerial robotics lab without hesitating even once. They gave me freedom to mentor students and teach courses all while pursuing my research – these was the best years of my academic life. I still remember the day, Prof. Yiannis came into the room and asked the question “What is the minimal amount of information you would need to solve a task?”, further he added “You could do this to fly through gaps of unknown shape and location!”. He then said “Do all this with only a single camera!”. I was thinking in the paradigm of building iii a map when he said “Use Active vision: Move to make your problem easier”. This was the monumental moment of my Ph.D., where I adopted the active philosophy to solve problems on extremely computation and sensor starved aerial robots – nano-quadrotors. Further down the line, Cornelia said, “Why don’t we use event cameras to solve problems, they are energy efficient and bio-inspired”. This led to tackling the problem of dynamic obstacle dodging using event cameras. Both Yiannis and Cornelia were a powerhouse of inspiration and they motivated and nurtured my wild ideas and also helping translate the math into code and high quality publications to be made open-source. I am elated to have worked with the both of you. I am forever indebted to Chahat Deep Singh for his help with all my papers. He has been the best friend, housemate, lab-mate, roadtrip partner and astrophotography ex- peditions during the last five years of my life. I would not have been able to fix all those linux issues without your help and would not have enjoyed discovering the math used in the papers without our intense discussions. I still remember getting yelled at by other professors for being loud at discussions. Chahat was one of the heros that helped setup the aerial robotics lab including helping me in sweeping the floors and getting yelled at for this. Chahat, I will forever try to maintain this friendship and collaboration. I would also specifically like to thank Chethan Mysore Parameshwara for all his help with ROS and event cameras. Chethan, thanks for inspiring me to work on neuro- morphic algorithms and cameras with you. It has been a pleasure. I would also like to thank all my other labmates of Perception and Robotics Group (PRG, which was formerly called Computer Vision Lab or CVL) duing my past five years of Ph.D.: Chahat Deep Singh, Chethan Mysore Parameshwara, Kanishka Ganguly, iv Huai-Jen Liang, Aleksandrs Ecins, Konstantinos Zampogiannis, Dr. Francisco Barranco, Levi Burner, Snehesh Shreshta, Michael Maynord, Chinmaya Devaraj, Matthew Evanusa, Xiaomin Lin, Peter Sutor, Siqin Li, Dr. Behzad Sadrfaridpour, Dr. Krishna Kidambi and Anton Mitrokhin. PRG has always been a second home to me. I am also indebted to all the Masters students that helped me with experiments in my research: Prateek Arora, Ashwin V. Kuruttukulam, Abhinav Modi, Kartik Madhira, Miguel Maestre Trueba, Varun Asthana, Saumil Shah and Akash Guha. Without you guys, it would not have been possible to conduct these hard and amazing experiments. I am thrilled to have worked with brilliant professors such as Prof. Davide Scara- muzza and Prof. Guido de Croon and I am indebted for the opportunity for collaboration. You have made me a better writer, a better experimentalist and an overall better researcher. I would also like to thank the unsung heros which have contributed immensely in the completion of this thesis: University of Maryland Institute for Advanced Computer Studies staff in particular Janice Perrone for handling a multitude of order requests with patience and Tom Ventsias for handling our media outreach. Ivan Penskiy and Kim- berly Edwards have been of immense help for setting up the lab, procuring materials for experiments and I would like to wholeheartedly thank them. I also thank the patient housekeeping staff for keeping our lab clean during messy experiments. My housemates have helped in balancing workload with unlimited doses of pure unadulterated fun and dad jokes. I thank Chahat Deep Singh, Sunaina Prabhu, Vinayak Bendale, Ankita Tondwalkar, Priyal Gala, Anoorag Sunkari, Prateek Arora, Harshvardhan Uppaluru, Kedar Gaitonde, Pranay Kanagat, Shankar Ramesh, Kunal Mehta, Meghavi Prashnani and Arpit Agarwal. Stella (Anoorag’s pet), you have helped alleviate stress v during my Ph.D. with your purest soul and that cute smile. I owe my deepest thanks to my family - my mother R. S. Shubha and my father K. Jagannatha, you have always stood by me and guided me through my career, and have pulled me through against impossible odds at times. I owe everything I am today to both of you. Words cannot express the gratitude I owe you. Thanks Amma and Daddy for believing in me and for being patient for the last five years with giving me absolute freedom to pursue my interests. You have never let the large physical distance between us bother me by always being there at any time of the day or night when I was tensed. I would like to thank my dearest uncle Chinmaya, aunt Suma and my cousins Nikhil and Aditya who have been so welcoming as they were the only family away from my mother and father. They never let me feel alone by constantly checking for my well being. Finally, I want to thank Dr. Vikram Hrishikeshavan, Dr. Derrick Yeo for all the help regarding aerospace things, lending tools when needed along with help with hardware. I am grateful to Prof. Inderjit Chopra for giving me an opportunity to teach ENAE788M which I will cherish forever for meeting one of the best graduate students and challenging them with hands-on projects. I would like to acknowledge financial support from the Office of Naval Research (ONR), Brin Family Foundation, Northroup Gumman Corporation, Samsung Electron- ics, National Science Foundation (NSF), NVIDIA and Intel for all the research work discussed herein. I would also like to thank the amazing open-source and open-hardware communities of Ubuntu, TensorFlow, ArduPilot, RaspberryPi and PX4 without whose work this thesis vi would not have been possible. It is impossible to remember all, and I sincerely apologize to those I have inadver- tently left out from the bottom of my heart. Lastly, thank you all and thank you God! vii Table of Contents Dedication ii Acknowledgements iii Table of Contents viii List of Tables xii List of Figures xiv Chapter 1: Introduction 1 1.1 Active Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Active vs Passive Approaches to Perception . . . . . . . . . . . . . . . . 3 1.3 Forms of Activeness on a UAV . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Hardware and Software Co-design . . . . . . . . . . . . . . . . . . . . . 6 1.5 When is Active design useful? . . . . . . . . . . . . . . . . . . . . . . . 7 1.6 Applications of an Active Nano-Quadrotor(s) . . . . . . . . . . . . . . . 10 1.7 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.8 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Chapter 2: Contributions 26 2.1 Active Perception by moving the agent . . . . . . . . . . . . . . . . . . . 27 2.1.1 Paper A: GapFlyt . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2 Active sensing using event cameras . . . . . . . . . . . . . . . . . . . . . 30 2.2.1 Paper B: EVDodgeNet . . . . . . . . . . . . . . . . . . . . . . . 30 2.2.2 Paper C: EVPropNet . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3 Active Perception by moving a part of the agent . . . . . . . . . . . . . . 36 2.3.1 Paper D: MorphEyes . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4 Hallucinated Activeness . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.4.1 Paper E: SalientDSO . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5 Hardware and Software Co-design . . . . . . . . . . . . . . . . . . . . . 42 2.5.1 Paper F: PRGFlow . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.6 Unrelated Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Chapter 3: RoboBeeHive: Integration of Active Competences 47 3.1 Hierarchy of competences . . . . . . . . . . . . . . . . . . . . . . . . . . 47 viii 3.2 Motivation and Conceptualization of the RoboBeeHive . . . . . . . . . . 49 3.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.1 Bee Nano-quadrotors . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.2 BeeHive Quadrotor . . . . . . . . . . . . . . . . . . . . . . . . . 55 Chapter 4: Future Directions 63 4.1 Limitations of Proposed Approaches . . . . . . . . . . . . . . . . . . . . 64 4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Chapter A: GapFlyt 72 A.1 Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 A.2 Introduction and Philosophy . . . . . . . . . . . . . . . . . . . . . . . . 74 A.2.1 Organization of the paper: . . . . . . . . . . . . . . . . . . . . . 77 A.2.2 Problem Formulation and Proposed Solutions . . . . . . . . . . . 77 A.3 Gap Detection using TS2P . . . . . . . . . . . . . . . . . . . . . . . . . 78 A.4 High Speed Gap Tracking For Visual Servoing Based Control . . . . . . . 82 A.4.1 Safe Point Computation and Tracking . . . . . . . . . . . . . . . 84 A.4.2 Control Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 A.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 A.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 88 A.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 90 A.5.3 Robustness of TS2P against different textures . . . . . . . . . . . 98 A.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Chapter B: EVDodgeNet 102 B.1 Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 B.2 Introduction and Philosophy . . . . . . . . . . . . . . . . . . . . . . . . 104 B.2.1 Problem Formulation and Contributions . . . . . . . . . . . . . . 106 B.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 B.2.3 Organization of the paper . . . . . . . . . . . . . . . . . . . . . . 109 B.3 Deep Learning Based Navigation Stack For Dodging Dynamic Obstacles . 109 B.3.1 Definitions Of Coordinate Frames Used . . . . . . . . . . . . . . 110 B.3.2 Event Frame E . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 B.3.3 EVDeBlurNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 B.3.4 EVHomographyNet . . . . . . . . . . . . . . . . . . . . . . . . . 116 B.3.5 EVSegFlowNet . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 B.3.6 Network Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 B.3.7 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 B.3.8 Compression Achieved by using EVSegFlowNet . . . . . . . . . 124 B.4 Multi Moving Object Event Dataset . . . . . . . . . . . . . . . . . . . . 125 B.4.1 3D room and moving objects . . . . . . . . . . . . . . . . . . . . 127 B.4.2 Dataset for EVDeblurNet . . . . . . . . . . . . . . . . . . . . . . 128 B.4.3 Dataset for EVSegNet, EVFlowNet and EVSegFlowNet . . . . . 128 B.4.4 Dataset for EVHomographyNet . . . . . . . . . . . . . . . . . . 130 B.5 Control Policy for Dodging Dynamic Obstacles . . . . . . . . . . . . . . 130 ix B.5.1 Sphere with known radius r . . . . . . . . . . . . . . . . . . . . 131 B.5.2 Unknown shaped objects with bound on size . . . . . . . . . . . . 133 B.5.3 Unknown objects with no prior knowledge . . . . . . . . . . . . . 133 B.5.4 Pursuit: A reversal of evasion? . . . . . . . . . . . . . . . . . . . 134 B.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 B.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 137 B.6.2 Experimental Results and Discussion . . . . . . . . . . . . . . . 139 B.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Chapter C: EVPropNet 145 C.1 Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 C.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 C.2.1 Problem Formulation and Contributions . . . . . . . . . . . . . . 148 C.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 C.2.3 Organization of the paper . . . . . . . . . . . . . . . . . . . . . . 151 C.3 Geometric Modelling of a Propeller . . . . . . . . . . . . . . . . . . . . 151 C.4 EVPropNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 C.4.1 Event Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 156 C.4.2 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 C.4.3 Network Architecture and Loss Function . . . . . . . . . . . . . . 160 C.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 C.5.1 Following . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 C.5.2 Landing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 C.5.3 Quadrotor Location from Detected Propellers and Filtering . . . . 163 C.6 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . 163 C.6.1 Quadrotor Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 C.6.2 Experimental Results And Observations . . . . . . . . . . . . . . 165 C.6.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 C.6.4 Implementation Considerations . . . . . . . . . . . . . . . . . . . 175 C.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Chapter D: MorphEyes 177 D.1 Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 D.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 D.2.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 D.2.2 Organization of the paper . . . . . . . . . . . . . . . . . . . . . . 181 D.3 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 D.4 Hardware and Software Design . . . . . . . . . . . . . . . . . . . . . . . 189 D.4.1 Hardware Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 D.4.2 Software Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 D.5 Experiments: Applications . . . . . . . . . . . . . . . . . . . . . . . . . 191 D.5.1 Quadrotor Platform . . . . . . . . . . . . . . . . . . . . . . . . . 191 D.5.2 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . 191 D.5.3 Forest Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . 192 D.5.4 Flying through a static/dynamic unknown shaped gap . . . . . . . 194 x D.5.5 Accurate IMO Detection . . . . . . . . . . . . . . . . . . . . . . 195 D.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Chapter E: SalientDSO 197 E.1 Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 E.2 Introduction and Philosophy . . . . . . . . . . . . . . . . . . . . . . . . 200 E.3 SalientDSO Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 E.4 Point selection based on visual saliency and scene parsing . . . . . . . . . 206 E.4.1 Visual Saliency Prediction . . . . . . . . . . . . . . . . . . . . . 206 E.4.2 Filtering saliency using semantic information . . . . . . . . . . . 207 E.4.3 Features/Points selection . . . . . . . . . . . . . . . . . . . . . . 209 E.5 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . . . . 211 E.5.1 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . 213 E.5.2 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . 216 E.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 Chapter F: PRGFlow 220 F.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 F.1.1 Problem Definition and Contributions . . . . . . . . . . . . . . . 224 F.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 F.2 Pseudo-Similarity Estimation Using PRGFlow . . . . . . . . . . . . . . . 228 F.3 Table-top Experiments and Evaluation . . . . . . . . . . . . . . . . . . . 230 F.3.1 Data Setup, Training and Testing Details . . . . . . . . . . . . . . 230 F.3.2 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . 231 F.3.3 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 F.3.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 234 F.3.5 Hardware Platforms . . . . . . . . . . . . . . . . . . . . . . . . . 236 F.4 Flight Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . 238 F.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 239 F.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 F.5.1 Algorithmic Design . . . . . . . . . . . . . . . . . . . . . . . . . 240 F.5.2 Hardware Aware Design . . . . . . . . . . . . . . . . . . . . . . 253 F.5.3 Trajectory Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 258 F.6 Summary And Directions For Future Work . . . . . . . . . . . . . . . . . 259 F.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 Bibliography 262 xi List of Tables 1.1 Minimalist design of autonomous UAV behaviours. . . . . . . . . . . . . 5 A.1 Minimalist design of autonomous quadrotor (drone) behaviours. . . . . . 75 A.2 Comparison of different methods used for gap detection. . . . . . . . . . 95 A.3 Comparison of different methods used for tracking. . . . . . . . . . . . . 98 A.4 Comparison of our approach with different setups . . . . . . . . . . . . . 100 B.1 Quantitative evaluation of different methods for Homography estimation. . 140 B.2 Quantitative evaluation of different methods for Segmentation of IMO. . . 142 C.1 Parameters used in geometric model of the propeller. . . . . . . . . . . . 154 C.2 Detection Rate (%) ↑ of EVPropNet for variation in parameters. . . . . . . 166 C.3 Detection Rate (%) ↑ of AprilTags 3 for amount of tag blocked. . . . . . . 167 C.4 Performance Metrics On Different Compute Modules. . . . . . . . . . . . 168 C.5 Different Propeller Configurations Used for Qualitative Evaluation in Fig. C.7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 C.6 Aratio FOR SOME COMMON COMMERCIAL DRONES. . . . . . . . . . . . 172 D.1 Relationship between ex and ey for errors in different parameters. . . . . . 184 E.1 Active vs Passive approach for computer vision tasks. . . . . . . . . . . . 203 E.2 Parameter settings for different datasets. . . . . . . . . . . . . . . . . . . 213 E.3 RMSEate on ICL-NIUM dataset in m. . . . . . . . . . . . . . . . . . . . . 215 E.4 ealign on TUM monoVO dataset in m. . . . . . . . . . . . . . . . . . . . . 215 E.5 Comparison of success rate between DSO and SalientDSO on CVL-UMD dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 F.1 Different Computers Used on Aerial Robots. . . . . . . . . . . . . . . . . 235 F.2 Quantitative evaluation of different warping combination for Pseudo-similarity estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 F.3 Quantitative evaluation of different network architectures for Pseudo-similarity estimation using T×2, S×2 warping block for large model (≤8.3 MB). . . 240 F.4 Quantitative evaluation of different network architectures for Pseudo-similarity estimation using T×2, S×2 warping block for small model (≤0.83 MB). . 240 F.5 Quantitative evaluation of different network inputs for Pseudo-similarity estimation using T×2, S×2 warping block for large model (≤8.3 MB). . . 241 F.6 Quantitative evaluation of different loss functions for Pseudo-similarity estimation using PS×1 warping block for large model (≤8.3 MB). . . . . 241 xii F.7 Quantitative evaluation of different compression methods for Pseudo-similarity estimation using PS×1 warping. . . . . . . . . . . . . . . . . . . . . . . 241 F.8 Comparison of PRGFlow with different classical methods. . . . . . . . . 242 F.9 Different-sized Quadrotor Configuration with respective computers. . . . 253 F.10 Trajectory evaluation for flight experiemtnts of PRGFlow. . . . . . . . . . 254 xiii List of Figures 1.1 Sensing, Control and Computation variation with respect to the amount of activeness used by the agent. . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Algorithmic design philosophies for different sized robots along with their capabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Amount of autonomous capabilities for different sized robots and living beings with respect to size. The red box shows where this thesis aims to take the autonomy of a nano-quadrotor. . . . . . . . . . . . . . . . . . . 9 1.4 Comparison of our proposed “bee” nano-quadrotor with birds and bees. (a) Sparrowhawk, (b) White Necked Jacobin Hummingbird, (c) Giant Honeybee, and (d) Our proposed “bee” nano-quadrotor. The number next to the brain and scale icon shows the number of neurons and the weight respectively. Note that the images are to relative size. . . . . . . . . . . . 9 1.5 A stack of images showing an owlet bobbing its head (see red highlight) to make perception easier. This is an example of an agent moving a part of it’s body to exhibit activeness. For original video see https://vimeo.com/152347964. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.6 Left to right: Color image of the scene, corresponding saliency map out- put by SalGAN [1]. The hotness of the saliency color corresponds to the value being higher. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.7 Left to right: Bidens ferulifolia flower as seen by Human vision, reflected UV, butterfly vision and bee vision. Note that altough images shown here for simulated butterfly and bee vision are at the same resolution as those seen human eyes, the real resolution of the eyes on these flying agents is much smaller. Photo credits and ©: Dr. Klaus Schmitt. . . . . . . . . . . 22 2.1 Different parts of the GapFlyt framework: (a) Detection of the unknown gap using active vision and TS2P algorithm (cyan highlight shows the path followed for obtaining multiple images for detection), (b) Sequence of quadrotor passing through the unknown gap using visual servoing based control. The blue and green highlights represent the tracked foreground and background regions respectively. . . . . . . . . . . . . . . . . . . . 29 2.2 A real quadrotor running EVDodgeNet to dodge two obstacles thrown at it simultaneously. From left to right in bottom row: (a) Raw event frame as seen from the front event camera. (b) Segmentation output. (c) Segmentation flow output which includes both segmentation and optical flow. (d) Simulation environment where EVDodgeNet was trained. (e) Segmentation ground truth. (f) Simulated front facing event frame. . . . . 32 xiv https://vimeo.com/152347964 2.3 Applications presented in this work using the proposed propeller detec- tion method for finding multi-rotors. (a) Tracking and following an un- marked quadrotor, (b) Landing/Docking on a flying quadrotor. Red and green arrows indicates the movement of the larger and smaller quadrotors respectively. Time progression is shown as quadrotor opacity. The insets show the event frames E from the smaller quadrotor used for detecting the propellers of the bigger quadrotor using the proposed EVPropNet. Red and blue color in the event frames indicate positive and negative events respectively. Green color indicates the network prediction. . . . . . . . . 35 2.4 Three applications of a variable baseline stereo system were explored in this work. (a) Flying through a forest, (b) Flying through an unknown shape and location dynamic gap, (c) Detecting an Independently Moving Object. In all the cases, the baseline of the stereo system is changing and is colored coded as jet (blue to red indicates 100 mm to 300 mm baseline). The opacity of the quadrotor/object shows positive progression of time. (d) Variation of baseline from 100 mm to 300 mm. Notice that the stereo system is bigger than the quadrotor at the largest baseline. . . . 37 2.5 Sample point-cloud output of SalientDSO which does not have loop clo- sure or global bundle adjustment. Each inset (color-coded to suite the re- spective location on the map) in clockwise direction from top left show the corresponding image, saliency, scene parsing outputs and active features. Observe that features from non-informative regions are almost removed approaching object-centric odometry. . . . . . . . . . . . . . . . . . . . 40 2.6 Size comparison of various components used on quadrotors. (a) Snap- dragon Flight, (b) PixFalcon, (c) 120 mm quadrotor platform with NanoPi Neo Core 2, (d) MYNT EYE stereo camera, (e) Google Coral USB ac- celerator, (f) Sipeed Maix Bit, (g) PX4Flow, (h) 210 mm quadrotor plat- form with Coral Dev board, (i) 360 mm quadrotor platform with Intel® Up board, (j) 500 mm quadrotor platform with NVIDIA® JetsonTM TX2. Note that all components shown are to relative scale. . . . . . . . . . . . 43 3.1 Proposed hierarchy of competences with the exterior ones needing the ones inside them, blue color indicates competences related to individual agents, green color indicates competences related to multiple agents and yellow bubbles show multiple agents. . . . . . . . . . . . . . . . . . . . 50 3.2 Illustration of the RoboBeeHive. . . . . . . . . . . . . . . . . . . . . . . 51 3.3 Different parts of the Bee nano-quadrotor. 1. Front facing RGB camera, 2. T-Motor F1404 Motors, 3. Tattu R-Line 3S 750mAh LiPo battery, 4. Raspberry Pi CM4 mated to a StereoPi v2 motherboard, 5. Gutted Google Coral USB Accelerator with custom heatsink, 6. Flywoo Goku F745 AIO Flight Controller and 4 in 1 ESC, 7. Downfacing RGB camera, 8. Optical flow sensor, 9. TFMini Lidar, 10. Gemfan 2540×3 propeller. Bottom left of the image shows a standard US quarter for scale reference. . . . . . . 56 xv 3.4 Different iterations of the Bee nano-quadrotor (number at the bottom left of each quadrotor shows the version number). Bottom left of the image shows a standard US quarter for scale reference. . . . . . . . . . . . . . 57 3.5 Left to right: Camera board, Interface board and RaspberryPi CM4. 1. Cameras, 2. Coral USB Accelerator attached to the PCIe port on the CM4, and 3. CM4 board sandwitched between the camera and interface board. Design based on https://grabcad.com/library/tpu-cam-with-cm4-1. Bottom left of the image shows a standard US quarter for scale reference. 58 3.6 Left to right: Raw RGB image (green overlay shows the detected flower and red overlay shows removed false positives based on geometry), HSV representation of the RGB image , yellow color thresholded using the Gaussian Mixture Model. . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.7 Top Row: Different views of the GoGoBird ornithocopter used as a dy- namic obstacle. Bottom row (left to right): Consecutive RGB images taken from the front camera on the bee nano-quadrotor , optical flow color map and detected dynamic obstacle (Inset shows the color representation used, the hue of the color represents the direction and the saturation rep- resents the magnitude). Notice how the optical flow colors of the dynamic obstacle and the background regions are different and are easily clustered. 59 3.8 CAD Model of the BeeHive drone. 1. Propeller, 2. Flower petal flaps, 3. Flower petal servo motors, 4. Hook for perching. . . . . . . . . . . . . . 60 3.9 Different iterations of the Hive drone (left to right show progression of versions). Note that this image only shows the drone without the perching and bee holding flower mechanism which is under construction and was delayed due to COVID-19 causing machine shop closures and shipping delays. Bottom left of the image shows a standard US quarter for scale reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.10 Left to right: RGB Image of the pole used for perching, Depth Image of the pole (brighter is farther), Mask of the segmented pole. . . . . . . . . 61 A.1 Different parts of the pipeline: (a) Detection of the unknown gap using active vision and TS2P algorithm (cyan highlight shows the path followed for obtaining multiple images for detection), (b) Sequence of quadrotor passing through the unknown gap using visual servoing based control. The blue and green highlights represent the tracked foreground and back- ground regions respectively. Best viewed in color. . . . . . . . . . . . . . 73 A.2 Components of the environment. On-set Image: Quadrotor view of the scene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 A.3 Representation of co-ordinate frames. . . . . . . . . . . . . . . . . . . . 79 A.4 Label sets used in tracking. (blue: foreground region, green: background region, orange: uncertainty region, black line: contour, brighter part of frame: active region and darker part of frame: inactive region.) . . . . . . 82 xvi https://grabcad.com/library/tpu-cam-with-cm4-1 A.5 Tracking of F and B across frames. (a) shows tracking when Ci F > kF and Ci B > kB. (b) When Ci B ≤ kB, the tracking for B will be reset. (c) When Ci F ≤ kF , the tracking for F will be reset. (d) shows tracking only with B, when F = ∅. (blue: F , green: B, yellow: O, yellow dots: Ci F , red dots: Ci B, blue Square: xs,F , red Square: xs,B.) . . . . . . . . . . . . 86 A.6 The platform used for experiments. (1) The front facing camera, (2) NVIDIA TX2 CPU+GPU, (3) Downward facing optical flow sensor (cam- era+sonar) which is only used for position hold. . . . . . . . . . . . . . . 89 A.7 First two rows: (XW , YW ), (YW , ZW ) and (XW , ZW ) Vicon estimates of the trajectory executed by the quadrotor in different stages (gray bar indicates the gap). (XW , ZW ) plot shows the diagonal scanning trajectory (the lines don’t coincide due to drift). Last row: Photo of the quadrotor during gap traversal. (cyan: detection stage, red: traversal stage.) . . . . 90 A.8 Sequence of images of quadrotor going through different shaped gaps. Top on-set: Ξ outputs, bottom on-set: quadrotor view. . . . . . . . . . . . 91 A.9 Top Row (left to right): Quadrotor view at 0ZF = 1.5, 2.6, 3m respec- tively with 0ZB = 5.7m. Bottom Row: Respective Ξ outputs for N = 4. Observe how the fidelity of Ξ reduces as 0ZF → 0ZB, making the detec- tion more noisy. (white boxes show the location of the gap in Figs. A.9 to A.13.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 A.10 Comparison of different philosophies to gap detection. Top row (left to right): DSO, Stereo Depth, MonoDepth, TS2P. Bottom row shows the detected gap overlayed on the corresponding input image. (green: G ∩O, yellow: false negative G ∩ O′, red: false positive G ′ ∩ O.) . . . . . . . . . 93 A.11 Left Column: Images used to compute Ξ. Middle Column (top to bot- tom): Ξ outputs for DIS Flow, SpyNet and FlowNet2. Right Column: Gap Detection outputs. (green: G∩O, yellow: false negative G∩O′, red: false positive G ′ ∩ O. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 A.12 Top row (left to right): Quadrotor view at image sizes of 384×576, 192× 288, 96×144, 48×72, 32×48. Note all images are re-scaled to 384×576 for better viewing. Bottom row shows the respective Ξ outputs for N = 4. 96 A.13 Top two rows show the input images. The third row shows the Ξ outputs when only the first 2, 4 and all 8 images are used. . . . . . . . . . . . . . 97 A.14 Quadrotor traversing an unknown window with a minimum tolerance of just 5cm. (red dashed line denotes C.) . . . . . . . . . . . . . . . . . . . 97 A.15 Left to right columnwise: Side view of the setup, Front view of the setup, sample image frame used, Ξ output, Detection output - Yellow: Ground Truth, Green: Correctly detected region, Red: Incorrectly detected region. Rowwise: Cases in the order in Table A.4. Best viewed in color. . . . . . 101 xvii B.1 (a) A real quadrotor running EVDodgeNet to dodge two obstacles thrown at it simultaneously. (b) Raw event frame as seen from the front event camera. (c) Segmentation output. (d) Segmentation flow output which includes both segmentation and optical flow. (e) Simulation environment where EVDodgeNet was trained. (f) Segmentation ground truth. (g) Sim- ulated front facing event frame. All the images in this paper are best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 B.2 Overview of the proposed neural network based navigation stack for the purpose of dodging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 B.3 (a) A sample simulation scene used for training our networks, (b) Sample objects used in (a), (c) sample scene textures used in (a). . . . . . . . . . 110 B.4 Representation of coordinate frames on the hardware platform used. (1) Front facing DAVIS 240C, (2) down facing sonar on PX4Flow, (3) down facing DAVIS 240B, (4) NVIDIA TX2 CPU+GPU, (5) Intel® Aero Com- pute board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 B.5 Network Architectures used in the proposed pipeline. Left: EVDeblur- Net, Middle: EVHomographyNet and Right: EVSegFlowNet. Green blocks show the convolutional layer with batch normalization and ReLU activation, cyan blocks show deconvolutional layer with batch normal- ization and ReLU activation and orange blocks show dropout layers. The numbers inside convolutional and deconvolutional layers show kernel size, number of filters and stride factor. The number inside dropout layer shows the dropout fraction. N is 3 and 6 respectively for EVDeblurNet when us- ing losses D1/D2 and D3. N is 2 and 5 respectively for EVSegFlowNet when using losses D1/D2 and D3. . . . . . . . . . . . . . . . . . . . . . 117 B.6 Various Scene setups used for generating data. Red box indicates the scene used for generating out of dataset testing data to evaluate general- ization to novel scenes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 B.7 Moving objects used in our simulation environment. Left to right: ball, cereal box, tower, cone, car, drone, kunai, wine bottle and airplane. Notice the variation in texture, color and shape. Note that the objects are not presented to scale for visual clarity. . . . . . . . . . . . . . . . . . . . . 125 B.8 Random textures used in our simulation environment . . . . . . . . . . . 126 B.9 Different textured carpets laid on the ground during real experiments to aid robust homography estimation from EVHomographyNet. . . . . . . . 127 B.10 Vectors XIMO i,p and XIMO i+1,p represent the intersection of the trajectory and the image plane. xs is the direction of the “safe” trajectory. All the vectors are defined with respect to the center of the quadrotor projected on the image plane, O. Both of the spheres are of known radii. . . . . . . . . . . 132 B.11 Representation of velocity direction of multiple unknown IMOs. The vec- tor vIMO i and vIMO i+1 represent velocities of the corresponding objects. xs denotes the “safe” direction for the quadrotor. . . . . . . . . . . . . . . . 134 B.12 Objects used in experiments. Left to right: Airplane, car, spherical ball and Bebop 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 xviii B.13 Vicon estimates for the trajectories of the objects and quadrotor. (a) Per- spective and top view for single unknown object case, (b) perspective and top view for multiple object case. Object and quadrotor silhouettes are shown to scale. Time progression is shown from red to yellow for objects and blue to green for the quadrotor. . . . . . . . . . . . . . . . . . . . . . 135 B.14 Sequence of images of quadrotor dodging or pursuing of objects. (a)- (d): Dodging a spherical ball, car, airplane and Bebop 2 respectively. (e): Dodging multiple objects simultaneously. (f): Pursuit of Bebop 2 by re- versing control policy. Object and quadrotor transparency show progres- sion of time. Red and green arrows indicate object and quadrotor direc- tions respectively. On-set images show front facing event frame (top) and respective segmentation obtained from our network (down). . . . . . . . . 136 B.15 Output of EVDeBlurNet for different integration time and loss functions. Top row: raw event frames, middle row: deblurred event frames with D2 and bottom row: deblurred event frames with D3 with δt. Left to right: δt of 1 ms, 5 ms and 10 ms. Notice that only the major contours are preserved and blurred contours are thinned in deblurred outputs. . . . . . 136 B.16 Representation of coordinate frames on the hardware platform used. (1) Front facing DAVIS 240C, (2) down facing sonar on PX4Flow, (3) down facing DAVIS 240B, (4) NVIDIA TX2 CPU+GPU, (5) Intel® Aero Com- pute board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 B.17 Output of EVHomographyNet for raw and deblurred event frames at dif- ferent integration times. Green and red color denotes ground truth and predicted H̃4Pt respectively. Top row: raw events frames and bottom row: deblurred event frames. Left to right: δt of 1 ms, 5 ms and 10 ms. Notice that the deblurred homography outputs are almost not affected by integra- tion time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 C.1 Applications presented in this work using the proposed propeller detec- tion method for finding multi-rotors. (a) Tracking and following an un- marked quadrotor, (b) Landing/Docking on a flying quadrotor. Red and green arrows indicates the movement of the larger and smaller quadrotors respectively. Time progression is shown as quadrotor opacity. The insets show the event frames E from the smaller quadrotor used for detecting the propellers of the bigger quadrotor using the proposed EVPropNet. Red and blue color in the event frames indicate positive and negative events respectively. Green color indicates the network prediction. All the event images in this paper follow the same color scheme. Vicon estimates are shown in corresponding sub-figures of Fig. C.8. All the images in this paper are best viewed in color on a computer screen at a zoom of 200%. 146 xix C.2 (a) Coordinate frames used for the geometric modelling of a propeller, (b) Blade coordinate definition, (c) Skew definition, (d) Coordinate axes for propeller projection on camera, and (e) Simplified model of the projection of the propeller blade; Each color represents a single spline and points with same color denote knots used to fit the cubic spline. Bi-color points are used as knots for both the splines of respective color. See Table C.1 for a tabulation of the variables used in this figure. . . . . . . . . . . . . . 152 C.3 Spatio-temporal event cloud E and Event frame E . The cloud shows that the propeller creates a helix in the spatio-temporal domain. The zoomed in view shows the propeller with positive events colored red and negative events colored blue along with network prediction as green with the color saturation indicating confidence. . . . . . . . . . . . . . . . . . . . . . . 158 C.4 Sample event images E from the generated synthetic dataset used to train EVPropNet. Here red and blue colors show positive and negative events respectively. Green color indicates our ground truth label with the color saturation indicating confidence as defined by Eq. C.14. . . . . . . . . . . 158 C.5 Network architecture for EVPropNet (χ is a hyperparameter along with expansion rate – rate at which the number of neurons grow after each block). If no down/up-sampling rate is shown, it is taken to be 1. This image is best viewed on the computer screen at a zoom of 200%. . . . . . 159 C.6 (a) Smaller quadrotor on the bigger quadrotor used for landing experi- ments (Sec. C.5.1), (b) Gutted Coral USB Accelerator with custom heat sink used to run the neural networks, (c) Samsung Gen 3 DVS sensor used for experiments, (d) Bigger quadrotor used in the following experiments (Sec. C.5.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 C.7 Top rows: Input event frame E where red and blue colors show posi- tive and negative events respectively. Green color indicates EVPropNet prediction with the color saturation indicating confidence. Bottom rows: reference images of the propeller taken with a Nikon D850 DSLR (32dB dynamic range). Scenarios (a) to (h) are explained in Table C.5. . . . . . . 169 C.8 Vicon estimates for the trajectories of the smaller and larger quadrotor in the application experiments shown in Fig. C.1. (a) Tracking and follow- ing, (b) Mid-air landing. Time progression is shown from yellow to red for the smaller quadrotor and and green to blue for the bigger quadro- tor. The black dots in (b) show the moment in time where the touchdown occured. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 C.9 (a) Simplified model of a quadrotor used to calculate area ratios of the propellers to that of the biggest square fiducial marker that can be fit in the center without obstruction, (b) Simplified arm and motor projection to compute amount of propeller occluded from generating events – gray areas show where the propeller is visible and generates events, green area is occluded by the motor and blue area is occluded by the arm. . . . . . . 172 C.10 Variation of Detection Rate with variation in real-world propeller radius r for different (a) Focal lengths f with ϕ = 0◦, and (b) Camera Roll ϕ with f = 2.5mm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 xx D.1 Three applications of a variable baseline stereo system were explored in this work. (a) Flying through a forest, (b) Flying through an unknown shape and location dynamic gap, (c) Detecting an Independently Moving Object. In all the cases, the baseline of the stereo system is changing and is colored coded as jet (blue to red indicates 100 mm to 300 mm baseline). The opacity of the quadrotor/object shows positive progression of time. All the images in this paper are best viewed in color on a computer. . . . 178 D.2 Error in pixel location due to error in various estimated intrinsic parame- ters. (a) ey vs. fe, (b) ex vs. αe, (c) ex vs. k1e, (d) ey vs. k2e, (e) ey vs. k3e, (f) ey vs. k5e. Notice that the X and Y scales for each of the plots is different though trend may seem similar. . . . . . . . . . . . . . . . . . . 185 D.3 Error in pixel location in right camera due to error in various estimated extrinsic parameters. (a) ex,R vs. Txe, (b) ey,R vs. Txe, (c) ex,R vs. ϕe, (d) ey,R vs. ϕe, (e) ex,R vs. θe, (f) ey,R vs. θe, (g) ex,R vs. ψe, (h) ey,R vs. ψe. Notice that the X and Y scales for each of the plots is different though trend may seem similar. . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 D.4 Target vs. achieved baseline. The highlight shows the 10σ value. . . . . . 186 D.5 Max. velocity to have a disparity error lower than k px. vs. baseline for different time synchronization errors (δt). . . . . . . . . . . . . . . . . . 187 D.6 Variation of baseline from 100 mm to 300 mm. Notice that the stereo system is bigger than the quadrotor at the largest baseline. . . . . . . . . . 187 D.7 Quadrotor platform used for experiments. (1) RaspberryPi 3B+ compute module, (2) Stereo camera, (3) Actuonix linear servo, (4) T-Motor F40 III Motors, (5) T-Motor F55A 4-in-1 ESC, (6) Holybro Kakute F7 flight controller, (7) WiFi module, (8) Teensy 3.2 microcontroller, (9) 5045×3 propeller, (10) Optical Flow module, (11) TFMini lidar, (12) 3S LiPo battery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 D.8 Variable baseline stereo performance in simulated forest flight when com- pared to small and large baselines. Note that the large baseline system crashes (red curve) and small baseline system (blue curve) can traverse the scene but is about 4× slower than the variable baseline system. The baseline for the variable baseline case is color-coded as jet (blue to red indicates small to large baseline). . . . . . . . . . . . . . . . . . . . . . . 188 D.9 Sequence of images of quadrotor going through different shaped gaps: (a) Infinity, (b) Goku, (c) Rectangle. In all the cases, the baseline of the stereo system is changing and is colored coded as jet (blue to red indicates 100 mm to 300 mm baseline). . . . . . . . . . . . . . . . . . . . . . . . . . . 189 D.10 Variable baseline stereo performance in simulated 3D IMO detection when compared to small and large baselines. Note that both the large baseline system (red curve) and small baseline system (blue curve) lose detection of the IMO at different parts of the scene (dots). The baseline for the vari- able baseline case is color-coded as jet (blue to red indicates small to large baseline). Black curve (horizontal line at zero vertical axis) represents the ground truth trajectory of the object. . . . . . . . . . . . . . . . . . . . . 189 xxi E.1 Sample point-cloud output of SalientDSO which does not have loop clo- sure or global bundle adjustment. Each inset (color-coded to suite the re- spective location on the map) in clockwise direction from top left show the corresponding image, saliency, scene parsing outputs and active features. Observe that features from non-informative regions are almost removed approaching object-centric odometry. . . . . . . . . . . . . . . . . . . . 198 E.2 Sample point-cloud output of SalientDSO which does not have loop clo- sure or global bundle adjustment. Each inset (color-coded to suite the re- spective location on the map) in clockwise direction from top left show the corresponding image, saliency, scene parsing outputs and active features. Observe that features from non-informative regions are almost removed approaching object-centric odometry. . . . . . . . . . . . . . . . . . . . . 201 E.3 Algorithmic overview of SalientDSO, blue parts show our contributions. Here KF is the abbreviation for Key Frame. . . . . . . . . . . . . . . . . 205 E.4 Left column: Input image, Right column: Saliency overlayed on input image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 E.5 Variation of Saliency Map due to changes in illumination and viewpoint. Notice that the fixation still remains inside the same object but the saliency map varies. The crosses of respective color highlight the fixation in the respective images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 E.6 Point selection using different schemes. Top rows in (a) and (b), left to right: features selected using DSO’s scheme, saliency only, saliency+scene parsing. Bottom rows in (a) and (b), left to right: input image, saliency, scene parsing output. Notice how using saliency+scene parsing removed all non-informative features. (a) and (b) show images from ICL-NUIM and CVL-UMD datasets respectively. . . . . . . . . . . . . . . . . . . . . 211 E.7 Comparison of evaluation results for ICL-NIUM dataset. Left: DSO, Right: SalientDSO. Each square correspondes to a color coded error. Note that Salient DSO almost always has lower error than it’s DSO counterpart. 214 E.8 Comparison of evaluation results for TUM dataset. Left: DSO, Right: SalientDSO. Note that Salient DSO almost always has lower error than it’s DSO counterpart. Note that, for the TUM dataset scene parsing was turned off as TUM dataset only provides grayscale images and scene pars- ing outputs are very noisy for grayscale images. . . . . . . . . . . . . . . 214 E.9 Comparison of outputs for Np = 40 – very few features. (a) Success case of DSO with a large amount of drift, (b) Success case for SalientDSO, (c) Failure case of DSO where the optimization diverges due to very few features. Notice that SalientDSO can perform very well in these extreme conditions showing the robustness of the features chosen. . . . . . . . . . 217 E.10 Comparison of drift. (a) DSO’s output, (b) SalientDSO’s output, (c) Im- age corresponding to crop shown in the inset. Observe that SalientDSO’s output has the checkerboard from different times more closely aligned as compared to DSO. Here Np = 1000. . . . . . . . . . . . . . . . . . . . . 218 E.11 Sample outputs for TUM sequence 1. (a) DSO, (b) SalientDSO. Here Np = 1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 xxii F.1 Size comparison of various components used on quadrotors. (a) Snap- dragon Flight, (b) PixFalcon, (c) 120 mm quadrotor platform with NanoPi Neo Core 2, (d) MYNT EYE stereo camera, (e) Google Coral USB ac- celerator, (f) Sipeed Maix Bit, (g) PX4Flow, (h) 210 mm quadrotor plat- form with Coral Dev board, (i) 360 mm quadrotor platform with Intel® Up board, (j) 500 mm quadrotor platform with NVIDIA® JetsonTM TX2. Note that all components shown are to relative scale. All the images in this paper are best viewed in color. . . . . . . . . . . . . . . . . . . . . 221 F.2 Different network architectures. (a) VanillaNet, (b) ResNet, (c) SqueezeNet, (d) MobileNet and (e) ShuffleNet. (χ and ξ are hyperparameters). Each architecture block is repeated per warp parameter prediction. This image is best viewed on the computer screen at a zoom of 200%. . . . . . . . . . 225 F.3 PRG Husky-360γ platform used in flight experiments. (a) Top view, (b) front view, (c) down-facing leopard imaging camera. . . . . . . . . . . . 232 F.4 (a) Accuracy, (b) Accuracy per Kilo param, (c) Accuracy per Kilo OP for different network architectures. Blue and orange histograms denote small (≤0.83 MB) and large (≤8.3 MB) networks respectively. Here the following shorthand is used for network names: VN: VanillaNet, RN: ResNet, SqN: SqueezeNet, MN: MobileNet and ShN: ShuffleNet. All networks use T×2, S×2 warping configuration. . . . . . . . . . . . . . . 243 F.5 Weight vs. FPS for VanillaNet4 (T×2, S×2) on different hardware and software optimization combinations. Left: small (≤0.83 MB) model, right: large (≤8.3 MB) model. The radius of each circle is proportional to log of volume of each hardware (this is shown in the legend below the plots with volume indicated on top of each legend in cm3). The out- line on each sample indicates the configuration of quantization or opti- mization used (Float32 (red outline) is the original TensorFlow model without any quantization or optimization, Int8-TFLite (black out- line) is the TensorFlow-Lite model with 8-bit Integer quantization and Int8-EdgeTPU (blue outline) is the TensorFlow-Lite model with 8-bit Integer quantization and Edge-TPU optimization. The samples are color coded to indicate the computer it was run on (shown in the legend on the bottom). Also note that, Laptop and PC (Deskop) weight and volume val- ues are not to actual scale for visual clarity in all images. All the figures in this paper use the same legend and color coding for ease of readability. 248 F.6 Weight vs. FPS for ResNet4 (T×2, S×2) on different hardware and soft- ware optimization combinations. Left: small (≤0.83 MB) model, right: large (≤8.3 MB) model. The radius of each circle is proportional to log of volume of each hardware. . . . . . . . . . . . . . . . . . . . . . . . . 249 F.7 Weight vs. FPS for SqueezeNet4 (T×2, S×2) on different hardware and software optimization combinations. Left: small (≤0.83 MB) model, right: large (≤8.3 MB) model. The radius of each circle is proportional to log of volume of each hardware. . . . . . . . . . . . . . . . . . . . . . 249 xxiii F.8 Weight vs. FPS for MobileNet4 (T×2, S×2) on different hardware and software optimization combinations. Left: small (≤0.83 MB) model, right: large (≤8.3 MB) model. The radius of each circle is proportional to log of volume of each hardware. . . . . . . . . . . . . . . . . . . . . . 249 F.9 Weight vs. FPS for ShuffleNet4 (T×2, S×2) on different hardware and software optimization combinations. Left: small (≤0.83 MB) model, right: large (≤8.3 MB) model. The radius of each circle is proportional to log of volume of each hardware. . . . . . . . . . . . . . . . . . . . . . 250 F.10 Weight vs. FPS for the best model architecture on each hardware cou- pled to the best software optimization combination. The radius of each circle is proportional to log of volume of each hardware. The best model architecture and model optimization for each hardware are: Up: ResNetS- Float32, CoralDev: ResNetS-Int8-EdgeTPU, CoralUSB: ResNetS- Int8-EdgeTPU, NanoPi: ResNetS-Int8, BananaPiM2-Zero: ResNetS- Int8, TX2: SqueezeNetS-Float32, Laptop-i7: SqueezeNetS-Float32, Laptop-1070: SqueezeNetS-Float32, PC-i9: SqueezeNetS-Float32, PC-TitanXp: SqueezeNetS-Float32. All networks use T×2, S×2 con- figuration and S and L subscripts indicate small and large networks re- spectively. The best network for each hardware was chosen with the avg. error ≤ 2.5 px. and the configuration which gives the higest FPS. . . . . . 250 F.11 Num. of Params vs. FPS (a) when only increasing the depth of the net- work while keeping width constant, (b) when only increasing the width of the network while keeping depth constant, (c) when increasing a com- bination of depth and width of the network for different computers. . . . . 251 F.12 Total Power vs. Quadrotor Size at hover. Each sample is a pie chart which shows the percentage of power consumed by the motors in red and compute and sensing power in blue. The radius of the pie chart is proportional to the power efficiency (in g/W and is given as the ratio of hover thrust to hover power). Refer to the legend on the bottom (gray circles) with the numbers on top indicating power efficiency in g/W. . . . 252 F.13 Comparison of trajectory obtained by dead-reckoning (red) our estimates with respect to the ground truth (blue) for quadrotor flight in various tra- jectory shapes. (a) Circle, (b) Moon, (c) Line, (d) Figure8 and (e) Square. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 xxiv Chapter 1: Introduction Robots have always been imagined as intelligent agents that can solve any problem faster, cheaper and better than human beings. Such an imagination has been prevalent since the 1800s. Despite major advances in technology, autonomy in robots fall signifi- cantly short of predictions made in the past along with the expectations of most people. Most successful autonomous robots today are generally big, bulky and targeted towards a particular set of task such as cleaning floors, assembling cars and so on. To build a gen- eral purpose robot such as those shown in TV shows which posses capabilities similar to humans, we need to utilize concepts used by living agents for both hardware and software co-design. The capabilities of such a general purpose robot in the wild can be categorized as: (a) Navigation, (b) Human robot interaction and finally (c) Physical interaction or Manipulation. Navigation involves moving around in space without running into static or dynamic obstacles such as trees and humans. Human robot interaction involves the ability to un- derstand humans, reason about their intent by interacting with them though a natural lan- guage to disambiguate hard to parse queries. Finally, physical interaction or manipulation involves the ability to alter the world by picking, moving and nudging objects. 1 This thesis focuses on the first capability: Navigation, specifically tailored towards aerial robots or Unmanned Aerial Vehicles (UAVs). However, they can be easily adapted to ground robots with minimal effort. This fundamentally involves answering the follow- ing three questions: Where am I? and What is a hazard when I am moving? Where should I go next? This capability has been traditionally achieved using computer vision algorithms with the aim of building a representation of general applicability: a 3D reconstruction of the scene. Using this representation, planning tasks are constructed and accomplished to allow the quadrotor to demonstrate autonomous behavior. Note that I will use the words aerial robot(s), Unmanned Aerial Vehicle(s) (UAV(s)), quadrotor(s) or drone(s) interchangeably and they refer to the same entity unless specified otherwise. 1.1 Active Agents Although aerial robots, are inherently active agents, their perceptual capabilities in literature so far have been mostly passive in nature. Researchers and practitioners to- day use traditional computer vision algorithms with the aim of building a representation of general applicability: a 3D reconstruction of the scene. Using this representation, planning tasks are constructed and accomplished to allow the robot to demonstrate au- tonomous behavior. However, this is in stark contrast to the methodology used by living agents such as birds and bees that have been solving these problems from agent with rel- ative ease and extreme efficiency. These living beings utilize their activeness (the ability to control movement of their bodies or a part of it) to simplify perception problems by 2 building specific task driven sensorimotor loops (combination of perception, planning and control). This thesis is built upon this philosophy: the agent can control it’s own move- ment or movement of a part of it’s body to make it’s perception problem simpler. This is due to additional constraints introduced by moving in particular ways. Such a movement to manipulate the perception forms a sensorimotor loop: a perception, planning and con- trol loop to solve the task at hand. Note that solving a real world task at hand can utilize multiple of such sensorimotor loops. 1.2 Active vs Passive Approaches to Perception Different set of tasks or competences of an aerial robot have been traditionally achieved with passive perception (or vision) which is based on the human or primate vision and involves sensing the world in 3D. This philosophy revolves around obtaining a 3D map first and then utilizing it for various tasks. However, a lot of tasks rarely require a full 3D map of the scene to be accomplished and are hence not minimalist (not utilizing less power, computation or number of sensors) by the virtue of their design. The major advantage of such a system is that it is agnostic to the morphology of the robot and can be almost directly adapted to different shapes, sizes and kinds of robots given that they posses the required computation, sensing and power on-board. On the contrary, active perception (or vision) adapts the design philosophy based on the current operating constraints of computation, sensing and power. Such an approach though is not directly transferable to different agent morphologies, it is generally more power-efficient for the set of tasks it is designed for. Such an active design method is task 3 Figure 1.1: Sensing, Control and Computation variation with respect to the amount of activeness used by the agent. driven and utilizes the minimal amount of information, computation and power required for the task. It can inherently handle risk of a sensor failure by the virtue it’s design utilizing exploration to gather more information and is generally more robust albeit it might take more time as compared to it’s passive counterpart (See Fig. 1.1). Table A.1 shows a comparison of different behaviours of an UAV using both active and passive design philosophies. Notice that the integration of different behaviours is harder in the active approach. To make this problem easier and more tractable, we propose to re-use multiple of these competences by conceptualizing each agent as a set of a hierarchical sensorimotor loops (or competences or behaviours). This makes it easier to adapt the agent to related but different sets of problems just by changing the checking condition (condition to see if a sensorimotor loops needs to be terminated). For eg., in the function of flower pollination, one would check for the flower and this could easily be changed to check for survivors in a search and rescue task. 4 Table 1.1: Minimalist design of autonomous UAV behaviours. Competence Passive Approach Active and Task-based Approach Kinetic stabilization Optimization of optical flow fields Sensor fusion between optical flow and IMU measurements Obstacle avoidance Obtain 3D model and plan accordingly Obtain flow fields and extract relevant information from them Segmentation of independently moving objects Optimization of flow fields Fixation and tracking allows detection Homing Application of SLAM Learn paths to home from many locations Landing Reconstruct 3D model and plan accordingly Perform servoing of landing area and plan appropriate policy Pursuit and Avoidance Reconstruct 3D model and plan accordingly Track while in motion Integration: Switching between behaviors Easy: The planner interacts with the 3D model Hard: An attention mechanism on ideas switching between behaviors 1.3 Forms of Activeness on a UAV Activeness on a UAV or any robot in general can be accomplished in multiple ways. 1. By moving the agent itself, 2. By employing an active sensor, 3. By moving a part of the agent’s body, 4. By hallucinating active movements. In the first approach, the entire agent moves such that the perception problem be- comes simpler. Such an approach is generally used by smaller robots where moving the entire agent is not very power hungry as compared to adding another sensor by increasing it’s weight and computation. The second approach utilizes an active sensor, i.e., a sensor which only works when movement is present. A class of sensors that are inspired by the animal eyes are called event cameras. Such cameras only collect the asynchronous intensity changes in light rather than traditional image frames. They have a higher dynamic range and lower la- tency compared to classical cameras. However, such a sensor lacks the ability to perform recognition from single timestep due to the lack of dense data when small or no move- ment is present. However, these sensors excel at tasks that involve movement, which is the core concept of activeness. This approach is useful when recognition of static objects is not required or high-speed and severe illumination changes might be encountered. The third approach brings the change of robot’s body morphology to enable simpler 5 perception. Such an approach can be used to make the robot smaller as required while utilizing a bigger set of sensors. Such an approach is desired when moving the robot is less power efficient than adding additional components to enable movement of the sensor suite. Such a method can also simplify certain perception problems by directly estimating depth. Finally, the last approach entails utilizing a method which hallucinates an active observer. For e.g., one could hallucinate a heatmap of microsaccades a human being might exhibit by looking at an image. Such a method is computationally more expensive but can be utilized when power used by computation is far lesser than moving the agent or a part of it. 1.4 Hardware and Software Co-design Keen readers would have realized that the active approach to designing algorithms is dictated by the amount of sensors, computation and power supply on-board which are commonly called Size, Weight, Area and Power (SWAP) constraints. This is in unison with what Alan Kay, a pioneer computer scientist quoted in the 1980s – “People who are really serious about software should make their own hardware.” Thus, understanding the hierarchy of autonomous systems and applying it in engineering, becomes a problem of synergistic hardware and software co-design. This multidimensional optimization prob- lem across different strata (hardware – integrated chips, sensors, effectors and software – the set of programs running on the system) is a new research area that has the potential to lead to a disruptive technology in this field. We call this field “Embodied AI”. This is 6 directly inspired by how nature has taken ages to solve the optimization problem between hardware (sensing using eyes and other sensors) and software (neural architecture) and it is still being refined by a method we call ‘Genetic Evolution’ [2]. This multidimensional problem is intractable if we try to study all combinations of sensor placements, computation and algorithms. Hence, to make the problem tractable, we limit all our algorithms to be implemented using deep learning modules which obey SWAP constraints and to the morphology of a quadrotor. However, the sensor suite is allowed to change between a monocular traditional or event camera and a stereo camera. 1.5 When is Active design useful? The next obvious question that comes to mind is when such an active design phi- losophy is useful. An active philosphy using the agent’s movement or the movement of a part of it’s body is useful when the agent cannot carry passive sensors that can sense the required quantity (such as depth for mapping) due to SWAP constraints. However, that being said, an active sensor can still be used despite the size of the robot for low-latency applications. To add further, an active mechanism on a large robot (with enough SWAP constraints for a myriad of sensors) can act as a failure mechanism when one or more of the sensors fail. This could avoid the UAV from crashing by landing it safely. An active philosophy is also useful when uncertainty is expected in either the environment or sens- ing. In stark contrast to computer vision approaches, the agent would move to explore the world to gather more information for a confident prediction. Fig. 1.2 shows how the ability of a robot is severly affected as the SWAP constraints 7 Figure 1.2: Algorithmic design philosophies for different sized robots along with their capabilities. change. This shows that we need to use the active design philosophy as SWAP constraints become tighter (UAV becomes smaller). This is important as smaller UAVs are safer, more agile and are scalable to be used as swarms. But, today’s passive algorithmic design philosophy has bounded their abilities severely and the level of autonomy of these smaller UAVs are very far from those of their larger counterparts. This shows how tightly coupled the design philosophy is with the SWAP constraints. However, when we bring living beings into the same plot as above we see from Fig. 1.3 that the different between autonomy levels is not as significant even though liv- ing being size changes drastically. This is because of the adaptation of sensing, neural architectures and the methodology used by these agents that scale well with their SWAP constraints. Also, notice that the level of autonomy possessed by living agents are far 8 Figure 1.3: Amount of autonomous capabilities for different sized robots and living beings with respect to size. The red box shows where this thesis aims to take the autonomy of a nano-quadrotor. Figure 1.4: Comparison of our proposed “bee” nano-quadrotor with birds and bees. (a) Sparrowhawk, (b) White Necked Jacobin Hummingbird, (c) Giant Honeybee, and (d) Our proposed “bee” nano-quadrotor. The number next to the brain and scale icon shows the number of neurons and the weight respectively. Note that the images are to relative size. 9 higher than our man-made autonomous UAVs. A detailed comparison between a spar- rowhawk, a hummingbird, a bee and our “bee” nano-quadrotor is shown in Fig. 1.4. One can observe that our nano-quadrotor has more computation than the sparrowhawk, weights almost the same as the sparrowhawk but is around the size of the hummingbird but has the capabilities similar to that of a bee. This is because of our technology and design methodology being not as efficient as that of nature and this area has a huge scope for further research. This thesis is one of the few works that has taken baby steps in this area of on-board nano-UAV autonomy using an active approach. Next, I will describe some applications of a nano-UAV (specifically a quadrotor that weighs < 250g and maximum motor to motor diagonal size of 120 mm) which is based on the active design philosophy. 1.6 Applications of an Active Nano-Quadrotor(s) Quadrotors out of all possible aerial robots have gained massive popularity due to the simplicity in their mechanical design: a frame with four motors where diagonally opposite motors spin the same direction. Such a design is inherently stable due to coun- teracting torques and can be directly controlled by changing motor speeds since it is a differentially flat system. A quadrotor has the following advantages over a traditional fixed wing aircraft: can hover in place, generally has a higher payload for the same size and can fly faster for the same size. They also don’t need a runway or a catapult mecha- nism to take off and can be easily deployed fully autonomously in the field. In particular, nano-quadrotors are defined in this thesis as a quadrotor of maximum 10 side of 85 mm (motor to motor diagonal dimension of 120 mm) and a maximum All Up Weight (AUW) of 250 g. These vehicles are safe, agile and can carry enough payload to be autonomous with off-the-shelf components making them easy to repair and distribute for further research even in developing countries. A few applications of such a nano- quadrotor (one or many) are described next. Fast exploration and mapping Nano-quadrotors can be used to map large areas fast using a centralized or decentralized mapping server where a fast map could be generated on-board and can be further opti- mized when they are downloaded to a large server. Such a method can be very useful when one can deploy drones with different sensor suites for gather data from different spectra. This would be useful for mapping bridges and pipes where gaps are small and large drones cannot traverse. Such nano-drones can also be used for sports to obtain angles that were not possible before. Search and Rescue Nano-UAVs are specially suited for search and rescue since they can be readily deployed with minimal effort in the field and can traverse gaps of unknown shapes, sizes and lo- cations. Such a swarm with different sensor suites can be used to find survivors (using a thermal camera). Nano-UAVs can traverse terrain that might not be possible for ground robots or larger UAVs due to their size, agility and full 3D maneuverability. Such small nano-UAVs because of their cost-effectiveness can be used as ‘disposable information capture devices’ for areas such as nuclear plants. Finally, they can also provide a bird’s 11 eye view of the scene along with 3D reconstructions for necessary evaluation and to col- laborate with other ground robots. E-sports and Hobbying Drones have become popular due to a sport called drone racing and have carved out a new space that blends reality with computer games. These drone races are manually piloted and recently AlphaPilot competition from Lockheed Martin posed to do the same in a completely autonomous manner with on-board sensing and computation which was won by Prof. Guido de Croon’s team from TUDelft using a monocular and active approach1. These features could be built into drones of smaller sizes to make them safer to learn and for hobbying purposes so they do not hurt other birds in flight. One such step has been taken by DJI in the form of DJI FPV drone (that weights about 1 Kg), but it still is far from being a nano-quadrotor. Co-operative delivery Although nano-UAVs cannot carry larger payloads, they can be combined together (either using rigid links or tethered cables) to carry larger payloads. Such an approach would be desirable during search and rescue as smaller drones are easier to deploy and are faster in exploration but can be combined together to lift larger payloads. Inspection Drones are a popular tool for inspecting large structures where access is limited to ground robots or using human labor is dangerous such as under bridge structures [3] or historical 1https://mavlab.tudelft.nl/mavlab-wins-the-alpha-pilot-challenge/ 12 https://mavlab.tudelft.nl/mavlab-wins-the-alpha-pilot-challenge/ monuments [4] or in radio active areas [5]. Recently , Skydio announced an active ap- proach called 3D Scanhttps://www.skydio.com/3d-scan to map and inspect any structure in a GPS denied environment using only on-board computation and sensing. In addition, a research area could be to use a gimbal camera with zoom that can also use active per- ception to reduce overall control effort of moving the entire vehicle to obtain a higher resolution image for mapping. Such an approach would make it safer when tight areas have to be mapped with a high degree of accuracy. 1.7 Research Objectives Since I started my doctoral studies in 2016, there has been an astounding progress in the field of quadrotors and specially autonomous ones. One of the major factors that influenced this growth was the Drone Racing League (DRL) which was launched publicly in Jan, 2016. Since then, there has been a rapid development and availability of hobby grade parts such as sensors, motors and on-board flight controllers that are on-par with their commercially available counterparts and those used in the top of the line research. This trend has also set off a cycle where Ph.D.’s are working with hobbyists to build au- tonomous drones (different multirotors and fixed wing planes) using hobby grade parts so that such research is accessible to the general public. To fuel this growth further, there has been a rapid advancement in the field of on-board algorithms involving visual perception (visual inertial odometry), better sensor fusion, motion planning, control schemes and decision making. The recent deep learning trend has also shown to improve robustness over classical methods if the training data is generated appropriately. These networks 13 https://www.skydio.com/3d-scan can also be trained in simulation and transferred to the real world with minimal effort if the problems are chosen carefully. To add further, these networks can be hardware ac- celerated to obtain speeds not possible before with classical methods without extensive reprogramming to fit the required computer architecture. Although, electronics and avionics exist for building efficient nano-quadrotors (we define this as a maximum diagonal motor to motor size of 120 mm and AUW of 250 g), they rarely posses autonomous features. This is because of a large number of open chal- lenges to scale down autonomy with limited sensing and computation. With the advent of special deep learning accelerator chips, open-source simulation tools for data generation, new generation sensors and new FAA rules for small drones, now is the perfect time to dive into make these nano-quadrotors autonomous. This research direction also caught attention of many top notch researchers across the world due to the DARPA’s Fast Light Autonomy project where the goal was to build an autonomous quadrotor that can fly upto 20ms−1 with on-board sensing and computation. The myriad of challenges to make nano-drones autonomous span across multiple domains and multiple Ph.D. theses and will take an effort from researchers all across the world to be addressed. My doctoral work focuses on a few key components which revolve around the central philosophy of active perception. Each of my works has a philosophical ideology of active perception with a practical application of immediate impact. I par- ticularly showcase four forms of active perception (combining perception, planning and control into a single sensorimotor loop to make the problem simpler to solve): 1. By moving the agent itself, 2. By employing an active sensor, 3. By moving a part of the agent’s body, 4. By hallucinating active movements. Next, to make this work practically 14 applicable I show how hardware and software co-design can be performed to optimize the form of active perception to be used. Finally, I present the world’s first prototype of a RoboBeeHive that shows how to integrate multiple competences centered around active vision in all it’s glory. Today’s drone autonomy in indoor (sometimes outdoor) spaces has been tradition- ally achieved using motion capture setup which are expensive, have a large setup and calibration time, and strongly limit the flying area. The key advantages of such a system is their sub-millimeter accuracy, constant and low latency of about 10ms and high update rate of up to 200Hz. However, such a system is not applicable when drones need to be deployed in the wild with wind, changing terrain, varied visual features and illumina- tion, and other external obstacles. This has necessitated the need for on-board perception (sensing) and computation. Such an on-board system also has the added advantage of preserving privacy and security (by being harder to hack into). In this thesis, I focus on on-board sensing and perception to solve fundamental problems for aerial robot auton- omy: static and dynamic obstacle avoidance, flying through gaps and homing (where do I find home). All these problems are tackled by controlling the agent’s movement in a way to simplify perception problems (this is called a sensorimotor loop) using the active perception philosophy. 1.8 State of the Art I will next describe the state of the art in various related areas that can potentially advance the area of autonomy on nano-UAVs. 15 Active Perception by moving the agent Active perception is the concept where one would control their movements (control and planning) in order to simplify the perception problem. This is in stark contrast to passive perception where a 3D map is first constructed and then motion planning is performed which is then used to control the robot’s movement. Such a passive approach relies on the fact that the perception information obtained is accurate and robust which is rarely the case in the wild. Such an approach has spurred from the fact that it is the most evident way to setup a mathematical optimization problem which can be sub-divided into multiple fields so that fast progress could be achieved. Now that the field has matured to the point where we have full-fledged products based on this philosophy, academia needs to think about the next conceptualization of a robot. Revisiting what experts of the field put forth about three decades ago: we need to combine perception, planning and control into a single entity called sensorimotor loops. This has been called by different names: Active vision, Animat vision, or perception-aware planning. This literature has gained a lot of momentum in the last five years due to active vision’s robustness, built-in contingency planning and minimalism. The literature spans across multiple robot platforms such as quadrotors, ground robots and humanoids to name a few and is dominant where robots can carry a limited payload in-terms of sensors and computation. Active view planning Another way planning and perception have been tightly coupled is through an area called active view planning. Here, the planner takes into account the current or accumulated perception information to plan a path on the fly to obtain a better view for the desired 16 task. For e.g., if one is mapping a bridge [6], generally spots under the bridge are dark and require closer data and hence the view planner would take that into account. Such a planner is accomplished by various statistical measures such as entropy or estimated cov- erage amount for the map built. Another flavor of active view planning comes in the form of keeping some points of interest in the field of view while executing a desired trajectory. Such a problem is tackled by formulating minimum time trajectories for quadrotors with a limited field of view camera to ensure that the points of interest are always in the field of view. Alternative formulations have been presented to control yaw for autonomous aerial cinematography [7]. To add further, a few approaches also consider obstacle avoidance when planning such perception-aware trajectories [8]. Active Sensing using event cameras Over the last decade, there has been an enormous advancement in sensor technology, specially imaging sensors or cameras. Even after these significant advancements, the dy- namic range and latency of these cameras is nowhere comparable to those possessed by living agents, specially ones that can fly such as birds and bees. This observation moti- vated the neuromorphic engineers to develop a new class of imaging sensors called event cameras. These event cameras have ‘smart pixels’, wherein each pixel is asynchronous and collects the changes in light rather than a traditional intensity measurement to create classical image frames. Such a sensor outputs an event cloud rather than an image frame. where each event contains the location and polarity. The polarity takes values from the set {+1,−1, 0}, where a +1(-1) indicates a intensity increase (decrease) and 0 indicates the event did not trigger. Since, only the intensity changes are recorded, huge bandwidth sav- 17 ings of upto orders of magnitude is obtained. Such a sensor also has a ultra high dynamic range of upto 100dB which is much higher than a traditional camera [9]. The latency of such a camera can be on the order of a few microseconds which is two to three orders of magnitude lower than that of the comparable classical camera. Since event sensor data is tightly coupled to it’s movement and the scene, they are also called active sensors in the essence that one has to move (or the scene has to move) to obtain data. To reiterate, the event camera has a high dynamic range, low latency, low bandwidth and is particularly suited for an active agent. As one would think intuitively, the larger the perception latency, the slower the robot can respond to abrupt or dynamic changes in the environment/scene. However, robots in the wild often experience dynamic obstacles such as humans, birds, insects along with unstructured obstacles in collapsed buildings like falling rocks. Although in theory, one could recognize the objects in the scene to obtain dynamic obstacles, they are not robust to motion blur and drastic illumi- nation changes which classical cameras suffer from in dynamic and wild scenarios. Event cameras by the virtue of their design are perfectly suited for the task of dynamic obsta- cle detection. In literature, event camera based Independently Moving Objects (IMO or dynamic obstacles) has been performed in two major ways: by motion compensating the event volume (or frame as a projection of events) [10, 11] or by learning to predict the segmentation masks directly[12, 13]. In the first method, the algorithms construct an Im- age of Warped Events (IWE) or event frames then use a measure of IWE’s contrast or sharpness as a metric to warp the event cloud to obtain a sharp IWE or event frame. The blurry parts of this resultant image are the regions that do not comply with the motion model of the ‘background’ (the parts of the scene that are not IMOs) are the ‘foreground’ 18 or IMOs. Such a way is robust but generally slow due to processing of 3D spatio-temporal event data, depending on amount of motion and scene this can use a lot of memory and be slower than realtime (defined as 50Hz). In the second method, a network is trained on these input event frames (without sharpening) to predict directly the pixel locations that belong to IMOs. Such a method is faster and robust if enough data is given to learn. Event cameras due to their myriad of advantages over classical cameras have also gained attention of other roboticists to build visual odometry [14, 15, 16] and Simultane- ous Localization And Mapping (SLAM) algorithms[17]. Event cameras have also been utilized on humanoid robots [18] and other aerial robots to track objects of interest. Active Perception by moving a part of the agent One can also achieve activeness by moving a part of the agent rather than the entire agent itself (Such an example in nature can be seen when the owlets bob their head around in circles, see Fig; 1.5). Though morphable designs of robots in it’s early days [19, 20, 21, 22, 22, 23, 24] were to showcase that one could make such a design functional and were targeted towards mechanical and control design of these systems. In the last decade, gimbal systems on drone have become common to the point that even a cheap hobby drone generally is equipped with one. One of the major driving forces for such a design is aerial videography to get a smooth footage with little to no post-processing. This also inspired roboticists to build rigs with pan, tilt and zoom cameras [25] on robots so that they could track objects faster by moving the camera rather than the entire robot itself. Similarly, stereo systems were built on the same principle to enable better tracking and depth estimation by changing baseline. 19 Figure 1.5: A stack of images showing an owlet bobbing its head (see red highlight) to make perception easier. This is an example of an agent moving a part of it’s body to exhibit activeness. For original video see https://vimeo.com/152347964. 20 https://vimeo.com/152347964 Hallucinated Activeness Activeness as discussed before comes in various forms and involves some amount of physical movement. However, similar to the difference in the way the older agents (or humans) approach problems as compared to younger ones, activeness changes its form as more data or information about something is acquired. Foe e.g., a bird does not have to explore the area around it’s nest since it already knows how it looks by creating a mental picture or it. In such a case, activeness becomes closer to passiveness with one single difference: the activeness is hallucinated or is in the imagination of the agent. Hallucinated activeness in the most simplified form could be captured in visual saliency: the amount of time an agent’s gaze rests on different parts of the image (See Fig. 1.6 for an example of how this map looks). Such a saliency method depends on the context and is generally a top to down approach. For e.g., if one is looking for a yellow ball, the saliency heatmap would light-up at spots of the image which are yellow. Saliency could also be in the form of a rudimentary motion segmentation method, such an approach would be trivial if an event camera is employed. Also, note that most living agents have evolved eyes to sense in the range to find the most salient objects from their perspective. For e.g., bees and butterflies can ‘see’ in the ultraviolet range to find the salient flowers where they can drink nectar from (See Fig. 1.7). In literature, saliency has been used for navigation [26], human-robot interaction [27, 28] and to make robots have gazes similar to that of humans[29]. Nano/Pico-quadrotor design Smaller UAVs are inherently safer, more agile and their ability to be deployed as large 21 Figure 1.6: Left to right: Color image of the scene, corresponding saliency map output by SalGAN [1]. The hotness of the saliency color corresponds to the value being higher. Figure 1.7: Left to right: Bidens ferulifolia flower as seen by Human vision, reflected UV, butterfly vision and bee vision. Note that altough images shown here for simulated butterfly and bee vision are at the same resolution as those seen human eyes, the real resolution of the eyes on these flying agents is much smaller. Photo credits and ©: Dr. Klaus Schmitt. 22 swarms [30]. In literature, pico-quadrotors have been developed with a myriad of capa- bilities such as walking and flying but rarely have enough on-board computation to be of practical value. Recently, custom chips based on the RISC-V architecture have been de- veloped to aid autonomy features using deep learning to these tiny pico-quadrotors [31], however they require a mastery on hardware design along with writing lower level driver code to make them functional and autonomous: thereby limiting further research. More- over, cameras and other sensors that can be used on these sized drones are either custom made and expensive. To add further, the battery life on these tiny drones are less than 2 minutes due to the limited payload capabilities. A similar issue is observed with flapping bird drones which are inherently safer[32]. On the other end of the spectrum, research groups all over the world have also worked on integrating larger sensors and computa- tion modules into smaller units to run a full SLAM stack along with path planner and controls enabled by the latest generation of Graphics Processing Units (GPUs) [33]. Al- though massive advances have been made in this regard and have been made open-source and open-hardware, they still use expensive sensors to achieve a good SLAM accuracy. To this end, we propose to utilize activeness in our design and build nano-quadrotors which fight right in-between the two aforementioned areas to enable autonomous features at scales not possible before due to the higher payload and computation as compared to pico-quadrotors and improved agility, safety and size decrease as compared to micro- quadrotors. We also bring the first step in studying the co-design of hardware and software for nano-quadrotors using embodied AI. 23 1.9 Summary In this chapter, I discussed what active agents are, how an active agent differs from a passive agent. This was then extended to why and how one should use activeness of an UAV to make perception problems simpler to enable autonomy at scales that were not possible before. I also talk about how activeness is a hardware and software co- design problem. Then, I formally present the research objectives of this thesis followed by different applications of nano-quadrotors. In the literature review, I summarized the state of the art on different forms of active perception, combining perception and planning, and nano/pico-quadrotor design. 24 Chapter 2: Contributions In this chapter, I summarize the key contributions of the papers re-printed in the appendix. In particular, this chapter highlights the individual results, refers to the related video, open-source and open-hardware tutorials. I also put this thesis’ research in-context to the state of the art literature in this chapter. In total, this research has been published in three peer-reviewed journals and six peer-reviewed conference publications. One further paper is under preparation for AAAS Science Robotics journal. Paper A was successfully demonstrated live to multiple audience, most notable to Sam Brin (brother of Sergey Brin) which was awarded USD 200K from the Brin Family foundation to our lab to advance machine perception on drones along with a grant to the University of Maryland of USD 2M for building the Brin Family Aerial robotics lab which is currently used by various departments. Paper B was presented by Profs. John Baras and Yiannis Aloimonos to the Office of Naval Research and obtained a grant of USD 2.2M to advance the field of Intelligent and Learning Autonomous Systems: Composability and Correctness. The works in papers A and B also won the runner up in the Brin Family prize in 2016. Papers B and C have helped cultivate relationships with one of the best labs for aerial robotics research in the world: Robotics and Perception Group at the University of Zurich headed by Prof. Davide Scaramuzza and Micro Air Vehicle Laboratory at the Delft University 26 https://robotics.umd.edu/facilities/brin-family-aerial-robotics-lab https://twitter.com/umdcs/status/994575308450947072?s=20 of Technology headed by Prof. Guido de Croon. The works from this thesis was used to create the world’s first prototype of the RoboBeeHive which is described in detail in Chapter 3. Finally, a lot of the research in this thesis has led to the creation of two fully open-source and open-hardware courses with video lectures and slides: ENAE788M: Hands-on Autonomous Aerial Robotics and CMSC828T: Vision, Planning and Control in Aerial Robotics. 2.1 Active Perception by moving the agent In this section, I present the work on the textbook definition of active perception: by controlling the agent’s movement to simplify the perception problem. Such a work is also called perception-aware planning in the classic literature. 2.1.1 Paper A: GapFlyt (P1) Nitin J. Sanket*, Chahat Deep Singh*, Kanishka Ganguly, Cornelia Fermüller, Yiannis Aloimonos “GapFlyt: Active Vision Based Minimal- ist Structure-Less Gap Detection For Quadrotor Flight”, IEEE Robotics and Automation Letters (RA-L), (2018) Vol. 3, No. 43847–3854. DOI: http://dx.doi.org/10.1109/LRA.2018.2843445. 2.1.1.1 Brief Description In this work, we address one of the biggest challenges for autonomous operation of a UAV in complex environments: navigating through narrow gaps of unknown shape, 27 https