ABSTRACT

Title of Dissertation: ACTIVE VISION BASED EMBODIED-AI
DESIGN FOR NANO-UAV AUTONOMY

Nitin Jagannatha Sanket
Doctor of Philosophy, 2021

Dissertation Directed by: Professor Yiannis Aloimonos
Department of Computer Science

The human fascination to mimic ultra-efficient flying beings like birds and bees has

led to a rapid rise in aerial robots in the recent decade. These aerial robots now posses

a market share of over 10 Billion US Dollars. The future for aerial robots or Unmanned

Aerial Vehicles (UAVs) which are commonly called drones is very bright because of

their utility in a myriad of applications. I envision drones delivering packages to our

homes, finding survivors in collapsed buildings, pollinating flowers, inspecting bridges,

performing surveillance of cities, in sports and even as pets. In particular, quadrotors have

become the go to platform for aerial robotics due to simplicity in their mechanical design,

their vertical takeoff and landing capabilities and agility characteristics.

Our eternal pursuit to improve drone safety and improve power efficiency has given

rise to the research and development of smaller yet smarter drones. Furthermore, smaller

drones are more agile and task-distributable as swarms. Embodied Artificial Intelligence

(AI) has been a big fuel to push this area further. Classically, the approach to designing

such nano-drones possesses a strict distinction between perception, planning and control


and relies on a 3D map of the scene that are used to plan paths that are executed by a

control algorithm.

On the contrary, nature’s never-ending quest to improve the efficiency of flying

agents through genetic evolution led to birds developing amazing eyes and brains tailored

for agile flight in complex environments as a software and hardware co-design solution.

In contrast, smaller flying agents such as insects that are at the other end of the size and

computation spectrum adapted an ingenious approach – to utilize movement to gather

more information. Early pioneers of robotics remarked at this observation and coined the

concept of “Active Perception” which proposed that one can move in an exploratory way

to gather more information to compensate for lack of computation and sensing. Such a

controlled movement imposes additional constraints on the data being perceived to make

the perception problem simpler.

Inspired by this concept, in this thesis, I present a novel approach for algorithmic

design on nano aerial robots (flying robots the size of a hummingbird) based on active per-

ception by tightly coupling and combining perception, planning and control into sensori-

motor loops using only on-board sensing and computation. This is done by re-imagining

each aerial robot as a series of hierarchical sensorimotor loops where the higher ones

require the inner ones such that resources and computation can be efficiently re-used. Ac-

tiveness is presented and utilized in four different forms to enable large-scale autonomy

at tight Size, Weight, Area and Power (SWAP) constraints not heard of before. The four

forms of activeness are: 1. By moving the agent itself, 2. By employing an active sensor,

3. By moving a part of the agent’s body, 4. By hallucinating active movements. Next, to

make this work practically applicable I show how hardware and software co-design can


be performed to optimize the form of active perception to be used. Finally, I present the

world’s first prototype of a RoboBeeHive that shows how to integrate multiple compe-

tences centered around active vision in all it’s glory. Following is a list of contributions

of this thesis:

• The world’s first functional prototype of a RoboBeeHive that can artificially polli-

nate flowers.

• The first method that allows a quadrotor to fly through gaps of unknown shape,

location and size using a single monocular camera with only on-board sensing and

computation.

• The first method to dodge dynamic obstacles of unknown shape, size and loca-

tion on a quadrotor using a monocular event camera. Our series of shallow neural

networks are trained in simulation and transfers to the real world without any fine-

tuning or re-training.

• The first method to detect unmarked drones by detecting propellers. Our neural

network is trained in simulation and transfers to the real world without any fine-

tuning or re-training.

• A method to adaptively change the baseline of a stereo camera system for quadrotor

navigation.

• The first method to introduce the usage of saliency to select features in a direct

visual odometry pipeline.


• A comprehensive benchmark of software and hardware for embodied AI which

would serve as a blueprint for researchers and practitioners alike.


ACTIVE VISION BASED EMBODIED-AI DESIGN
FOR NANO-UAV AUTONOMY

by

Nitin Jagannatha Sanket

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2021

Advisory Committee:
Professor Yiannis Aloimonos, Chair/Advisor
Dr. Cornelia Fermüller
Professor Davide Scaramuzza
Professor Dinesh Manocha
Professor Inderjit Chopra, Dean’s Representative


© Copyright by
Nitin Jagannatha Sanket

2021


Dedication

To my family.

ii


Acknowledgments

I owe my gratitude to all the people who have made this thesis possible and because

of whom my doctoral study experience has been one that I will cherish forever.

I still remember to this day I first met Prof. Yiannis Aloimonos in his room to dis-

cuss about how perception leads to cognition. I am eternally grateful to him for inspiring

me and giving me an opportunity to join his lab as a doctoral student. I am also eter-

nally grateful to Dr. Cornelia Fermüller for introducing me to the field of neuromorphic

cameras and perception. In particular, my experience during the Telluride Neromorphic

workshop was one of a kind that I will cherish forever. Both Yiannis and Cornelia have

treated me like their own child, giving their time whenever I needed them, mentoring

me and helping me hone my skills. One of the most remarkable things was the amount

of freedom they gave me during my doctoral studies, they funded me to setup the aerial

robotics lab without hesitating even once. They gave me freedom to mentor students and

teach courses all while pursuing my research – these was the best years of my academic

life.

I still remember the day, Prof. Yiannis came into the room and asked the question

“What is the minimal amount of information you would need to solve a task?”, further he

added “You could do this to fly through gaps of unknown shape and location!”. He then

said “Do all this with only a single camera!”. I was thinking in the paradigm of building

iii


a map when he said “Use Active vision: Move to make your problem easier”. This was

the monumental moment of my Ph.D., where I adopted the active philosophy to solve

problems on extremely computation and sensor starved aerial robots – nano-quadrotors.

Further down the line, Cornelia said, “Why don’t we use event cameras to solve problems,

they are energy efficient and bio-inspired”. This led to tackling the problem of dynamic

obstacle dodging using event cameras.

Both Yiannis and Cornelia were a powerhouse of inspiration and they motivated

and nurtured my wild ideas and also helping translate the math into code and high quality

publications to be made open-source. I am elated to have worked with the both of you.

I am forever indebted to Chahat Deep Singh for his help with all my papers. He

has been the best friend, housemate, lab-mate, roadtrip partner and astrophotography ex-

peditions during the last five years of my life. I would not have been able to fix all those

linux issues without your help and would not have enjoyed discovering the math used in

the papers without our intense discussions. I still remember getting yelled at by other

professors for being loud at discussions. Chahat was one of the heros that helped setup

the aerial robotics lab including helping me in sweeping the floors and getting yelled at

for this. Chahat, I will forever try to maintain this friendship and collaboration.

I would also specifically like to thank Chethan Mysore Parameshwara for all his

help with ROS and event cameras. Chethan, thanks for inspiring me to work on neuro-

morphic algorithms and cameras with you. It has been a pleasure.

I would also like to thank all my other labmates of Perception and Robotics Group

(PRG, which was formerly called Computer Vision Lab or CVL) duing my past five

years of Ph.D.: Chahat Deep Singh, Chethan Mysore Parameshwara, Kanishka Ganguly,

iv


Huai-Jen Liang, Aleksandrs Ecins, Konstantinos Zampogiannis, Dr. Francisco Barranco,

Levi Burner, Snehesh Shreshta, Michael Maynord, Chinmaya Devaraj, Matthew Evanusa,

Xiaomin Lin, Peter Sutor, Siqin Li, Dr. Behzad Sadrfaridpour, Dr. Krishna Kidambi and

Anton Mitrokhin. PRG has always been a second home to me.

I am also indebted to all the Masters students that helped me with experiments in

my research: Prateek Arora, Ashwin V. Kuruttukulam, Abhinav Modi, Kartik Madhira,

Miguel Maestre Trueba, Varun Asthana, Saumil Shah and Akash Guha. Without you

guys, it would not have been possible to conduct these hard and amazing experiments.

I am thrilled to have worked with brilliant professors such as Prof. Davide Scara-

muzza and Prof. Guido de Croon and I am indebted for the opportunity for collaboration.

You have made me a better writer, a better experimentalist and an overall better researcher.

I would also like to thank the unsung heros which have contributed immensely in

the completion of this thesis: University of Maryland Institute for Advanced Computer

Studies staff in particular Janice Perrone for handling a multitude of order requests with

patience and Tom Ventsias for handling our media outreach. Ivan Penskiy and Kim-

berly Edwards have been of immense help for setting up the lab, procuring materials

for experiments and I would like to wholeheartedly thank them. I also thank the patient

housekeeping staff for keeping our lab clean during messy experiments.

My housemates have helped in balancing workload with unlimited doses of pure

unadulterated fun and dad jokes. I thank Chahat Deep Singh, Sunaina Prabhu, Vinayak

Bendale, Ankita Tondwalkar, Priyal Gala, Anoorag Sunkari, Prateek Arora, Harshvardhan

Uppaluru, Kedar Gaitonde, Pranay Kanagat, Shankar Ramesh, Kunal Mehta, Meghavi

Prashnani and Arpit Agarwal. Stella (Anoorag’s pet), you have helped alleviate stress

v


during my Ph.D. with your purest soul and that cute smile.

I owe my deepest thanks to my family - my mother R. S. Shubha and my father K.

Jagannatha, you have always stood by me and guided me through my career, and have

pulled me through against impossible odds at times. I owe everything I am today to

both of you. Words cannot express the gratitude I owe you. Thanks Amma and Daddy

for believing in me and for being patient for the last five years with giving me absolute

freedom to pursue my interests. You have never let the large physical distance between

us bother me by always being there at any time of the day or night when I was tensed.

I would like to thank my dearest uncle Chinmaya, aunt Suma and my cousins Nikhil

and Aditya who have been so welcoming as they were the only family away from my

mother and father. They never let me feel alone by constantly checking for my well

being.

Finally, I want to thank Dr. Vikram Hrishikeshavan, Dr. Derrick Yeo for all the help

regarding aerospace things, lending tools when needed along with help with hardware. I

am grateful to Prof. Inderjit Chopra for giving me an opportunity to teach ENAE788M

which I will cherish forever for meeting one of the best graduate students and challenging

them with hands-on projects.

I would like to acknowledge financial support from the Office of Naval Research

(ONR), Brin Family Foundation, Northroup Gumman Corporation, Samsung Electron-

ics, National Science Foundation (NSF), NVIDIA and Intel for all the research work

discussed herein.

I would also like to thank the amazing open-source and open-hardware communities

of Ubuntu, TensorFlow, ArduPilot, RaspberryPi and PX4 without whose work this thesis

vi


would not have been possible.

It is impossible to remember all, and I sincerely apologize to those I have inadver-

tently left out from the bottom of my heart.

Lastly, thank you all and thank you God!

vii


Table of Contents

Dedication ii

Acknowledgements iii

Table of Contents viii

List of Tables xii

List of Figures xiv

Chapter 1: Introduction 1
1.1 Active Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Active vs Passive Approaches to Perception . . . . . . . . . . . . . . . . 3
1.3 Forms of Activeness on a UAV . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Hardware and Software Co-design . . . . . . . . . . . . . . . . . . . . . 6
1.5 When is Active design useful? . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Applications of an Active Nano-Quadrotor(s) . . . . . . . . . . . . . . . 10
1.7 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.8 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Chapter 2: Contributions 26
2.1 Active Perception by moving the agent . . . . . . . . . . . . . . . . . . . 27

2.1.1 Paper A: GapFlyt . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Active sensing using event cameras . . . . . . . . . . . . . . . . . . . . . 30

2.2.1 Paper B: EVDodgeNet . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.2 Paper C: EVPropNet . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3 Active Perception by moving a part of the agent . . . . . . . . . . . . . . 36
2.3.1 Paper D: MorphEyes . . . . . . . . . . . . . . . . . . . . . . . . 36

2.4 Hallucinated Activeness . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4.1 Paper E: SalientDSO . . . . . . . . . . . . . . . . . . . . . . . . 39

2.5 Hardware and Software Co-design . . . . . . . . . . . . . . . . . . . . . 42
2.5.1 Paper F: PRGFlow . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.6 Unrelated Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Chapter 3: RoboBeeHive: Integration of Active Competences 47
3.1 Hierarchy of competences . . . . . . . . . . . . . . . . . . . . . . . . . . 47

viii


3.2 Motivation and Conceptualization of the RoboBeeHive . . . . . . . . . . 49
3.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.3.1 Bee Nano-quadrotors . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.2 BeeHive Quadrotor . . . . . . . . . . . . . . . . . . . . . . . . . 55

Chapter 4: Future Directions 63
4.1 Limitations of Proposed Approaches . . . . . . . . . . . . . . . . . . . . 64
4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Chapter A: GapFlyt 72
A.1 Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
A.2 Introduction and Philosophy . . . . . . . . . . . . . . . . . . . . . . . . 74

A.2.1 Organization of the paper: . . . . . . . . . . . . . . . . . . . . . 77
A.2.2 Problem Formulation and Proposed Solutions . . . . . . . . . . . 77

A.3 Gap Detection using TS2P . . . . . . . . . . . . . . . . . . . . . . . . . 78
A.4 High Speed Gap Tracking For Visual Servoing Based Control . . . . . . . 82

A.4.1 Safe Point Computation and Tracking . . . . . . . . . . . . . . . 84
A.4.2 Control Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

A.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
A.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 88
A.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 90
A.5.3 Robustness of TS2P against different textures . . . . . . . . . . . 98

A.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

Chapter B: EVDodgeNet 102
B.1 Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
B.2 Introduction and Philosophy . . . . . . . . . . . . . . . . . . . . . . . . 104

B.2.1 Problem Formulation and Contributions . . . . . . . . . . . . . . 106
B.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
B.2.3 Organization of the paper . . . . . . . . . . . . . . . . . . . . . . 109

B.3 Deep Learning Based Navigation Stack For Dodging Dynamic Obstacles . 109
B.3.1 Definitions Of Coordinate Frames Used . . . . . . . . . . . . . . 110
B.3.2 Event Frame E . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
B.3.3 EVDeBlurNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
B.3.4 EVHomographyNet . . . . . . . . . . . . . . . . . . . . . . . . . 116
B.3.5 EVSegFlowNet . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
B.3.6 Network Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
B.3.7 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
B.3.8 Compression Achieved by using EVSegFlowNet . . . . . . . . . 124

B.4 Multi Moving Object Event Dataset . . . . . . . . . . . . . . . . . . . . 125
B.4.1 3D room and moving objects . . . . . . . . . . . . . . . . . . . . 127
B.4.2 Dataset for EVDeblurNet . . . . . . . . . . . . . . . . . . . . . . 128
B.4.3 Dataset for EVSegNet, EVFlowNet and EVSegFlowNet . . . . . 128
B.4.4 Dataset for EVHomographyNet . . . . . . . . . . . . . . . . . . 130

B.5 Control Policy for Dodging Dynamic Obstacles . . . . . . . . . . . . . . 130

ix


B.5.1 Sphere with known radius r . . . . . . . . . . . . . . . . . . . . 131
B.5.2 Unknown shaped objects with bound on size . . . . . . . . . . . . 133
B.5.3 Unknown objects with no prior knowledge . . . . . . . . . . . . . 133
B.5.4 Pursuit: A reversal of evasion? . . . . . . . . . . . . . . . . . . . 134

B.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
B.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 137
B.6.2 Experimental Results and Discussion . . . . . . . . . . . . . . . 139

B.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

Chapter C: EVPropNet 145
C.1 Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
C.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

C.2.1 Problem Formulation and Contributions . . . . . . . . . . . . . . 148
C.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
C.2.3 Organization of the paper . . . . . . . . . . . . . . . . . . . . . . 151

C.3 Geometric Modelling of a Propeller . . . . . . . . . . . . . . . . . . . . 151
C.4 EVPropNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

C.4.1 Event Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 156
C.4.2 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
C.4.3 Network Architecture and Loss Function . . . . . . . . . . . . . . 160

C.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
C.5.1 Following . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
C.5.2 Landing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
C.5.3 Quadrotor Location from Detected Propellers and Filtering . . . . 163

C.6 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . 163
C.6.1 Quadrotor Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
C.6.2 Experimental Results And Observations . . . . . . . . . . . . . . 165
C.6.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
C.6.4 Implementation Considerations . . . . . . . . . . . . . . . . . . . 175

C.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

Chapter D: MorphEyes 177
D.1 Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
D.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

D.2.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
D.2.2 Organization of the paper . . . . . . . . . . . . . . . . . . . . . . 181

D.3 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
D.4 Hardware and Software Design . . . . . . . . . . . . . . . . . . . . . . . 189

D.4.1 Hardware Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
D.4.2 Software Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

D.5 Experiments: Applications . . . . . . . . . . . . . . . . . . . . . . . . . 191
D.5.1 Quadrotor Platform . . . . . . . . . . . . . . . . . . . . . . . . . 191
D.5.2 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . 191
D.5.3 Forest Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . 192
D.5.4 Flying through a static/dynamic unknown shaped gap . . . . . . . 194

x


D.5.5 Accurate IMO Detection . . . . . . . . . . . . . . . . . . . . . . 195
D.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

Chapter E: SalientDSO 197
E.1 Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
E.2 Introduction and Philosophy . . . . . . . . . . . . . . . . . . . . . . . . 200
E.3 SalientDSO Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
E.4 Point selection based on visual saliency and scene parsing . . . . . . . . . 206

E.4.1 Visual Saliency Prediction . . . . . . . . . . . . . . . . . . . . . 206
E.4.2 Filtering saliency using semantic information . . . . . . . . . . . 207
E.4.3 Features/Points selection . . . . . . . . . . . . . . . . . . . . . . 209

E.5 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . . . . 211
E.5.1 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . 213
E.5.2 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . 216

E.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

Chapter F: PRGFlow 220
F.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

F.1.1 Problem Definition and Contributions . . . . . . . . . . . . . . . 224
F.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

F.2 Pseudo-Similarity Estimation Using PRGFlow . . . . . . . . . . . . . . . 228
F.3 Table-top Experiments and Evaluation . . . . . . . . . . . . . . . . . . . 230

F.3.1 Data Setup, Training and Testing Details . . . . . . . . . . . . . . 230
F.3.2 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . 231
F.3.3 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
F.3.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 234
F.3.5 Hardware Platforms . . . . . . . . . . . . . . . . . . . . . . . . . 236

F.4 Flight Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . 238
F.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 239

F.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
F.5.1 Algorithmic Design . . . . . . . . . . . . . . . . . . . . . . . . . 240
F.5.2 Hardware Aware Design . . . . . . . . . . . . . . . . . . . . . . 253
F.5.3 Trajectory Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 258

F.6 Summary And Directions For Future Work . . . . . . . . . . . . . . . . . 259
F.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

Bibliography 262

xi


List of Tables

1.1 Minimalist design of autonomous UAV behaviours. . . . . . . . . . . . . 5

A.1 Minimalist design of autonomous quadrotor (drone) behaviours. . . . . . 75
A.2 Comparison of different methods used for gap detection. . . . . . . . . . 95
A.3 Comparison of different methods used for tracking. . . . . . . . . . . . . 98
A.4 Comparison of our approach with different setups . . . . . . . . . . . . . 100

B.1 Quantitative evaluation of different methods for Homography estimation. . 140
B.2 Quantitative evaluation of different methods for Segmentation of IMO. . . 142

C.1 Parameters used in geometric model of the propeller. . . . . . . . . . . . 154
C.2 Detection Rate (%) ↑ of EVPropNet for variation in parameters. . . . . . . 166
C.3 Detection Rate (%) ↑ of AprilTags 3 for amount of tag blocked. . . . . . . 167
C.4 Performance Metrics On Different Compute Modules. . . . . . . . . . . . 168
C.5 Different Propeller Configurations Used for Qualitative Evaluation in Fig.

C.7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
C.6 Aratio FOR SOME COMMON COMMERCIAL DRONES. . . . . . . . . . . . 172

D.1 Relationship between ex and ey for errors in different parameters. . . . . . 184

E.1 Active vs Passive approach for computer vision tasks. . . . . . . . . . . . 203
E.2 Parameter settings for different datasets. . . . . . . . . . . . . . . . . . . 213
E.3 RMSEate on ICL-NIUM dataset in m. . . . . . . . . . . . . . . . . . . . . 215
E.4 ealign on TUM monoVO dataset in m. . . . . . . . . . . . . . . . . . . . . 215
E.5 Comparison of success rate between DSO and SalientDSO on CVL-UMD

dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

F.1 Different Computers Used on Aerial Robots. . . . . . . . . . . . . . . . . 235
F.2 Quantitative evaluation of different warping combination for Pseudo-similarity

estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
F.3 Quantitative evaluation of different network architectures for Pseudo-similarity

estimation using T×2, S×2 warping block for large model (≤8.3 MB). . . 240
F.4 Quantitative evaluation of different network architectures for Pseudo-similarity

estimation using T×2, S×2 warping block for small model (≤0.83 MB). . 240
F.5 Quantitative evaluation of different network inputs for Pseudo-similarity

estimation using T×2, S×2 warping block for large model (≤8.3 MB). . . 241
F.6 Quantitative evaluation of different loss functions for Pseudo-similarity

estimation using PS×1 warping block for large model (≤8.3 MB). . . . . 241

xii


F.7 Quantitative evaluation of different compression methods for Pseudo-similarity
estimation using PS×1 warping. . . . . . . . . . . . . . . . . . . . . . . 241

F.8 Comparison of PRGFlow with different classical methods. . . . . . . . . 242
F.9 Different-sized Quadrotor Configuration with respective computers. . . . 253
F.10 Trajectory evaluation for flight experiemtnts of PRGFlow. . . . . . . . . . 254

xiii


List of Figures

1.1 Sensing, Control and Computation variation with respect to the amount
of activeness used by the agent. . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Algorithmic design philosophies for different sized robots along with their
capabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Amount of autonomous capabilities for different sized robots and living
beings with respect to size. The red box shows where this thesis aims to
take the autonomy of a nano-quadrotor. . . . . . . . . . . . . . . . . . . 9

1.4 Comparison of our proposed “bee” nano-quadrotor with birds and bees.
(a) Sparrowhawk, (b) White Necked Jacobin Hummingbird, (c) Giant
Honeybee, and (d) Our proposed “bee” nano-quadrotor. The number next
to the brain and scale icon shows the number of neurons and the weight
respectively. Note that the images are to relative size. . . . . . . . . . . . 9

1.5 A stack of images showing an owlet bobbing its head (see red highlight)
to make perception easier. This is an example of an agent moving a part of
it’s body to exhibit activeness. For original video see https://vimeo.com/152347964.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.6 Left to right: Color image of the scene, corresponding saliency map out-
put by SalGAN [1]. The hotness of the saliency color corresponds to the
value being higher. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.7 Left to right: Bidens ferulifolia flower as seen by Human vision, reflected
UV, butterfly vision and bee vision. Note that altough images shown here
for simulated butterfly and bee vision are at the same resolution as those
seen human eyes, the real resolution of the eyes on these flying agents is
much smaller. Photo credits and ©: Dr. Klaus Schmitt. . . . . . . . . . . 22

2.1 Different parts of the GapFlyt framework: (a) Detection of the unknown
gap using active vision and TS2P algorithm (cyan highlight shows the path
followed for obtaining multiple images for detection), (b) Sequence of
quadrotor passing through the unknown gap using visual servoing based
control. The blue and green highlights represent the tracked foreground
and background regions respectively. . . . . . . . . . . . . . . . . . . . 29

2.2 A real quadrotor running EVDodgeNet to dodge two obstacles thrown
at it simultaneously. From left to right in bottom row: (a) Raw event
frame as seen from the front event camera. (b) Segmentation output. (c)
Segmentation flow output which includes both segmentation and optical
flow. (d) Simulation environment where EVDodgeNet was trained. (e)
Segmentation ground truth. (f) Simulated front facing event frame. . . . . 32

xiv

https://vimeo.com/152347964


2.3 Applications presented in this work using the proposed propeller detec-
tion method for finding multi-rotors. (a) Tracking and following an un-
marked quadrotor, (b) Landing/Docking on a flying quadrotor. Red and
green arrows indicates the movement of the larger and smaller quadrotors
respectively. Time progression is shown as quadrotor opacity. The insets
show the event frames E from the smaller quadrotor used for detecting the
propellers of the bigger quadrotor using the proposed EVPropNet. Red
and blue color in the event frames indicate positive and negative events
respectively. Green color indicates the network prediction. . . . . . . . . 35

2.4 Three applications of a variable baseline stereo system were explored in
this work. (a) Flying through a forest, (b) Flying through an unknown
shape and location dynamic gap, (c) Detecting an Independently Moving
Object. In all the cases, the baseline of the stereo system is changing
and is colored coded as jet (blue to red indicates 100 mm to 300 mm
baseline). The opacity of the quadrotor/object shows positive progression
of time. (d) Variation of baseline from 100 mm to 300 mm. Notice that
the stereo system is bigger than the quadrotor at the largest baseline. . . . 37

2.5 Sample point-cloud output of SalientDSO which does not have loop clo-
sure or global bundle adjustment. Each inset (color-coded to suite the re-
spective location on the map) in clockwise direction from top left show the
corresponding image, saliency, scene parsing outputs and active features.
Observe that features from non-informative regions are almost removed
approaching object-centric odometry. . . . . . . . . . . . . . . . . . . . 40

2.6 Size comparison of various components used on quadrotors. (a) Snap-
dragon Flight, (b) PixFalcon, (c) 120 mm quadrotor platform with NanoPi
Neo Core 2, (d) MYNT EYE stereo camera, (e) Google Coral USB ac-
celerator, (f) Sipeed Maix Bit, (g) PX4Flow, (h) 210 mm quadrotor plat-
form with Coral Dev board, (i) 360 mm quadrotor platform with Intel®

Up board, (j) 500 mm quadrotor platform with NVIDIA® JetsonTM TX2.
Note that all components shown are to relative scale. . . . . . . . . . . . 43

3.1 Proposed hierarchy of competences with the exterior ones needing the
ones inside them, blue color indicates competences related to individual
agents, green color indicates competences related to multiple agents and
yellow bubbles show multiple agents. . . . . . . . . . . . . . . . . . . . 50

3.2 Illustration of the RoboBeeHive. . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Different parts of the Bee nano-quadrotor. 1. Front facing RGB camera,

2. T-Motor F1404 Motors, 3. Tattu R-Line 3S 750mAh LiPo battery, 4.
Raspberry Pi CM4 mated to a StereoPi v2 motherboard, 5. Gutted Google
Coral USB Accelerator with custom heatsink, 6. Flywoo Goku F745 AIO
Flight Controller and 4 in 1 ESC, 7. Downfacing RGB camera, 8. Optical
flow sensor, 9. TFMini Lidar, 10. Gemfan 2540×3 propeller. Bottom left
of the image shows a standard US quarter for scale reference. . . . . . . 56

xv


3.4 Different iterations of the Bee nano-quadrotor (number at the bottom left
of each quadrotor shows the version number). Bottom left of the image
shows a standard US quarter for scale reference. . . . . . . . . . . . . . 57

3.5 Left to right: Camera board, Interface board and RaspberryPi CM4. 1.
Cameras, 2. Coral USB Accelerator attached to the PCIe port on the
CM4, and 3. CM4 board sandwitched between the camera and interface
board. Design based on https://grabcad.com/library/tpu-cam-with-cm4-1.
Bottom left of the image shows a standard US quarter for scale reference. 58

3.6 Left to right: Raw RGB image (green overlay shows the detected flower
and red overlay shows removed false positives based on geometry), HSV
representation of the RGB image , yellow color thresholded using the
Gaussian Mixture Model. . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.7 Top Row: Different views of the GoGoBird ornithocopter used as a dy-
namic obstacle. Bottom row (left to right): Consecutive RGB images
taken from the front camera on the bee nano-quadrotor , optical flow color
map and detected dynamic obstacle (Inset shows the color representation
used, the hue of the color represents the direction and the saturation rep-
resents the magnitude). Notice how the optical flow colors of the dynamic
obstacle and the background regions are different and are easily clustered. 59

3.8 CAD Model of the BeeHive drone. 1. Propeller, 2. Flower petal flaps, 3.
Flower petal servo motors, 4. Hook for perching. . . . . . . . . . . . . . 60

3.9 Different iterations of the Hive drone (left to right show progression of
versions). Note that this image only shows the drone without the perching
and bee holding flower mechanism which is under construction and was
delayed due to COVID-19 causing machine shop closures and shipping
delays. Bottom left of the image shows a standard US quarter for scale
reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.10 Left to right: RGB Image of the pole used for perching, Depth Image of
the pole (brighter is farther), Mask of the segmented pole. . . . . . . . . 61

A.1 Different parts of the pipeline: (a) Detection of the unknown gap using
active vision and TS2P algorithm (cyan highlight shows the path followed
for obtaining multiple images for detection), (b) Sequence of quadrotor
passing through the unknown gap using visual servoing based control.
The blue and green highlights represent the tracked foreground and back-
ground regions respectively. Best viewed in color. . . . . . . . . . . . . . 73

A.2 Components of the environment. On-set Image: Quadrotor view of the
scene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

A.3 Representation of co-ordinate frames. . . . . . . . . . . . . . . . . . . . 79
A.4 Label sets used in tracking. (blue: foreground region, green: background

region, orange: uncertainty region, black line: contour, brighter part of
frame: active region and darker part of frame: inactive region.) . . . . . . 82

xvi

https://grabcad.com/library/tpu-cam-with-cm4-1


A.5 Tracking of F and B across frames. (a) shows tracking when Ci
F > kF

and Ci
B > kB. (b) When Ci

B ≤ kB, the tracking for B will be reset. (c)

When Ci
F ≤ kF , the tracking for F will be reset. (d) shows tracking only

with B, when F = ∅. (blue: F , green: B, yellow: O, yellow dots: Ci
F ,

red dots: Ci
B, blue Square: xs,F , red Square: xs,B.) . . . . . . . . . . . . 86

A.6 The platform used for experiments. (1) The front facing camera, (2)
NVIDIA TX2 CPU+GPU, (3) Downward facing optical flow sensor (cam-
era+sonar) which is only used for position hold. . . . . . . . . . . . . . . 89

A.7 First two rows: (XW , YW ), (YW , ZW ) and (XW , ZW ) Vicon estimates
of the trajectory executed by the quadrotor in different stages (gray bar
indicates the gap). (XW , ZW ) plot shows the diagonal scanning trajectory
(the lines don’t coincide due to drift). Last row: Photo of the quadrotor
during gap traversal. (cyan: detection stage, red: traversal stage.) . . . . 90

A.8 Sequence of images of quadrotor going through different shaped gaps.
Top on-set: Ξ outputs, bottom on-set: quadrotor view. . . . . . . . . . . . 91

A.9 Top Row (left to right): Quadrotor view at 0ZF = 1.5, 2.6, 3m respec-
tively with 0ZB = 5.7m. Bottom Row: Respective Ξ outputs for N = 4.
Observe how the fidelity of Ξ reduces as 0ZF → 0ZB, making the detec-
tion more noisy. (white boxes show the location of the gap in Figs. A.9 to
A.13.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

A.10 Comparison of different philosophies to gap detection. Top row (left to
right): DSO, Stereo Depth, MonoDepth, TS2P. Bottom row shows the
detected gap overlayed on the corresponding input image. (green: G ∩O,
yellow: false negative G ∩ O′, red: false positive G ′ ∩ O.) . . . . . . . . . 93

A.11 Left Column: Images used to compute Ξ. Middle Column (top to bot-
tom): Ξ outputs for DIS Flow, SpyNet and FlowNet2. Right Column:
Gap Detection outputs. (green: G∩O, yellow: false negative G∩O′, red:
false positive G ′ ∩ O. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

A.12 Top row (left to right): Quadrotor view at image sizes of 384×576, 192×
288, 96×144, 48×72, 32×48. Note all images are re-scaled to 384×576
for better viewing. Bottom row shows the respective Ξ outputs for N = 4. 96

A.13 Top two rows show the input images. The third row shows the Ξ outputs
when only the first 2, 4 and all 8 images are used. . . . . . . . . . . . . . 97

A.14 Quadrotor traversing an unknown window with a minimum tolerance of
just 5cm. (red dashed line denotes C.) . . . . . . . . . . . . . . . . . . . 97

A.15 Left to right columnwise: Side view of the setup, Front view of the setup,
sample image frame used, Ξ output, Detection output - Yellow: Ground
Truth, Green: Correctly detected region, Red: Incorrectly detected region.
Rowwise: Cases in the order in Table A.4. Best viewed in color. . . . . . 101

xvii


B.1 (a) A real quadrotor running EVDodgeNet to dodge two obstacles thrown
at it simultaneously. (b) Raw event frame as seen from the front event
camera. (c) Segmentation output. (d) Segmentation flow output which
includes both segmentation and optical flow. (e) Simulation environment
where EVDodgeNet was trained. (f) Segmentation ground truth. (g) Sim-
ulated front facing event frame. All the images in this paper are best
viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

B.2 Overview of the proposed neural network based navigation stack for the
purpose of dodging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

B.3 (a) A sample simulation scene used for training our networks, (b) Sample
objects used in (a), (c) sample scene textures used in (a). . . . . . . . . . 110

B.4 Representation of coordinate frames on the hardware platform used. (1)
Front facing DAVIS 240C, (2) down facing sonar on PX4Flow, (3) down
facing DAVIS 240B, (4) NVIDIA TX2 CPU+GPU, (5) Intel® Aero Com-
pute board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

B.5 Network Architectures used in the proposed pipeline. Left: EVDeblur-
Net, Middle: EVHomographyNet and Right: EVSegFlowNet. Green
blocks show the convolutional layer with batch normalization and ReLU
activation, cyan blocks show deconvolutional layer with batch normal-
ization and ReLU activation and orange blocks show dropout layers. The
numbers inside convolutional and deconvolutional layers show kernel size,
number of filters and stride factor. The number inside dropout layer shows
the dropout fraction. N is 3 and 6 respectively for EVDeblurNet when us-
ing losses D1/D2 and D3. N is 2 and 5 respectively for EVSegFlowNet
when using losses D1/D2 and D3. . . . . . . . . . . . . . . . . . . . . . 117

B.6 Various Scene setups used for generating data. Red box indicates the
scene used for generating out of dataset testing data to evaluate general-
ization to novel scenes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

B.7 Moving objects used in our simulation environment. Left to right: ball,
cereal box, tower, cone, car, drone, kunai, wine bottle and airplane. Notice
the variation in texture, color and shape. Note that the objects are not
presented to scale for visual clarity. . . . . . . . . . . . . . . . . . . . . 125

B.8 Random textures used in our simulation environment . . . . . . . . . . . 126
B.9 Different textured carpets laid on the ground during real experiments to

aid robust homography estimation from EVHomographyNet. . . . . . . . 127
B.10 Vectors XIMO

i,p and XIMO
i+1,p represent the intersection of the trajectory and

the image plane. xs is the direction of the “safe” trajectory. All the vectors
are defined with respect to the center of the quadrotor projected on the
image plane, O. Both of the spheres are of known radii. . . . . . . . . . . 132

B.11 Representation of velocity direction of multiple unknown IMOs. The vec-
tor vIMO

i and vIMO
i+1 represent velocities of the corresponding objects. xs

denotes the “safe” direction for the quadrotor. . . . . . . . . . . . . . . . 134
B.12 Objects used in experiments. Left to right: Airplane, car, spherical ball

and Bebop 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

xviii


B.13 Vicon estimates for the trajectories of the objects and quadrotor. (a) Per-
spective and top view for single unknown object case, (b) perspective and
top view for multiple object case. Object and quadrotor silhouettes are
shown to scale. Time progression is shown from red to yellow for objects
and blue to green for the quadrotor. . . . . . . . . . . . . . . . . . . . . . 135

B.14 Sequence of images of quadrotor dodging or pursuing of objects. (a)-
(d): Dodging a spherical ball, car, airplane and Bebop 2 respectively. (e):
Dodging multiple objects simultaneously. (f): Pursuit of Bebop 2 by re-
versing control policy. Object and quadrotor transparency show progres-
sion of time. Red and green arrows indicate object and quadrotor direc-
tions respectively. On-set images show front facing event frame (top) and
respective segmentation obtained from our network (down). . . . . . . . . 136

B.15 Output of EVDeBlurNet for different integration time and loss functions.
Top row: raw event frames, middle row: deblurred event frames with D2

and bottom row: deblurred event frames with D3 with δt. Left to right:
δt of 1 ms, 5 ms and 10 ms. Notice that only the major contours are
preserved and blurred contours are thinned in deblurred outputs. . . . . . 136

B.16 Representation of coordinate frames on the hardware platform used. (1)
Front facing DAVIS 240C, (2) down facing sonar on PX4Flow, (3) down
facing DAVIS 240B, (4) NVIDIA TX2 CPU+GPU, (5) Intel® Aero Com-
pute board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

B.17 Output of EVHomographyNet for raw and deblurred event frames at dif-
ferent integration times. Green and red color denotes ground truth and
predicted H̃4Pt respectively. Top row: raw events frames and bottom row:
deblurred event frames. Left to right: δt of 1 ms, 5 ms and 10 ms. Notice
that the deblurred homography outputs are almost not affected by integra-
tion time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

C.1 Applications presented in this work using the proposed propeller detec-
tion method for finding multi-rotors. (a) Tracking and following an un-
marked quadrotor, (b) Landing/Docking on a flying quadrotor. Red and
green arrows indicates the movement of the larger and smaller quadrotors
respectively. Time progression is shown as quadrotor opacity. The insets
show the event frames E from the smaller quadrotor used for detecting the
propellers of the bigger quadrotor using the proposed EVPropNet. Red
and blue color in the event frames indicate positive and negative events
respectively. Green color indicates the network prediction. All the event
images in this paper follow the same color scheme. Vicon estimates are
shown in corresponding sub-figures of Fig. C.8. All the images in this
paper are best viewed in color on a computer screen at a zoom of 200%. 146

xix


C.2 (a) Coordinate frames used for the geometric modelling of a propeller, (b)
Blade coordinate definition, (c) Skew definition, (d) Coordinate axes for
propeller projection on camera, and (e) Simplified model of the projection
of the propeller blade; Each color represents a single spline and points
with same color denote knots used to fit the cubic spline. Bi-color points
are used as knots for both the splines of respective color. See Table C.1
for a tabulation of the variables used in this figure. . . . . . . . . . . . . . 152

C.3 Spatio-temporal event cloud E and Event frame E . The cloud shows that
the propeller creates a helix in the spatio-temporal domain. The zoomed
in view shows the propeller with positive events colored red and negative
events colored blue along with network prediction as green with the color
saturation indicating confidence. . . . . . . . . . . . . . . . . . . . . . . 158

C.4 Sample event images E from the generated synthetic dataset used to train
EVPropNet. Here red and blue colors show positive and negative events
respectively. Green color indicates our ground truth label with the color
saturation indicating confidence as defined by Eq. C.14. . . . . . . . . . . 158

C.5 Network architecture for EVPropNet (χ is a hyperparameter along with
expansion rate – rate at which the number of neurons grow after each
block). If no down/up-sampling rate is shown, it is taken to be 1. This
image is best viewed on the computer screen at a zoom of 200%. . . . . . 159

C.6 (a) Smaller quadrotor on the bigger quadrotor used for landing experi-
ments (Sec. C.5.1), (b) Gutted Coral USB Accelerator with custom heat
sink used to run the neural networks, (c) Samsung Gen 3 DVS sensor used
for experiments, (d) Bigger quadrotor used in the following experiments
(Sec. C.5.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

C.7 Top rows: Input event frame E where red and blue colors show posi-
tive and negative events respectively. Green color indicates EVPropNet
prediction with the color saturation indicating confidence. Bottom rows:
reference images of the propeller taken with a Nikon D850 DSLR (32dB
dynamic range). Scenarios (a) to (h) are explained in Table C.5. . . . . . . 169

C.8 Vicon estimates for the trajectories of the smaller and larger quadrotor in
the application experiments shown in Fig. C.1. (a) Tracking and follow-
ing, (b) Mid-air landing. Time progression is shown from yellow to red
for the smaller quadrotor and and green to blue for the bigger quadro-
tor. The black dots in (b) show the moment in time where the touchdown
occured. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

C.9 (a) Simplified model of a quadrotor used to calculate area ratios of the
propellers to that of the biggest square fiducial marker that can be fit in
the center without obstruction, (b) Simplified arm and motor projection
to compute amount of propeller occluded from generating events – gray
areas show where the propeller is visible and generates events, green area
is occluded by the motor and blue area is occluded by the arm. . . . . . . 172

C.10 Variation of Detection Rate with variation in real-world propeller radius r
for different (a) Focal lengths f with ϕ = 0◦, and (b) Camera Roll ϕ with
f = 2.5mm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

xx


D.1 Three applications of a variable baseline stereo system were explored in
this work. (a) Flying through a forest, (b) Flying through an unknown
shape and location dynamic gap, (c) Detecting an Independently Moving
Object. In all the cases, the baseline of the stereo system is changing and
is colored coded as jet (blue to red indicates 100 mm to 300 mm baseline).
The opacity of the quadrotor/object shows positive progression of time.
All the images in this paper are best viewed in color on a computer. . . . 178

D.2 Error in pixel location due to error in various estimated intrinsic parame-
ters. (a) ey vs. fe, (b) ex vs. αe, (c) ex vs. k1e, (d) ey vs. k2e, (e) ey vs.
k3e, (f) ey vs. k5e. Notice that the X and Y scales for each of the plots is
different though trend may seem similar. . . . . . . . . . . . . . . . . . . 185

D.3 Error in pixel location in right camera due to error in various estimated
extrinsic parameters. (a) ex,R vs. Txe, (b) ey,R vs. Txe, (c) ex,R vs. ϕe, (d)
ey,R vs. ϕe, (e) ex,R vs. θe, (f) ey,R vs. θe, (g) ex,R vs. ψe, (h) ey,R vs. ψe.
Notice that the X and Y scales for each of the plots is different though
trend may seem similar. . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

D.4 Target vs. achieved baseline. The highlight shows the 10σ value. . . . . . 186
D.5 Max. velocity to have a disparity error lower than k px. vs. baseline for

different time synchronization errors (δt). . . . . . . . . . . . . . . . . . 187
D.6 Variation of baseline from 100 mm to 300 mm. Notice that the stereo

system is bigger than the quadrotor at the largest baseline. . . . . . . . . . 187
D.7 Quadrotor platform used for experiments. (1) RaspberryPi 3B+ compute

module, (2) Stereo camera, (3) Actuonix linear servo, (4) T-Motor F40
III Motors, (5) T-Motor F55A 4-in-1 ESC, (6) Holybro Kakute F7 flight
controller, (7) WiFi module, (8) Teensy 3.2 microcontroller, (9) 5045×3
propeller, (10) Optical Flow module, (11) TFMini lidar, (12) 3S LiPo
battery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

D.8 Variable baseline stereo performance in simulated forest flight when com-
pared to small and large baselines. Note that the large baseline system
crashes (red curve) and small baseline system (blue curve) can traverse
the scene but is about 4× slower than the variable baseline system. The
baseline for the variable baseline case is color-coded as jet (blue to red
indicates small to large baseline). . . . . . . . . . . . . . . . . . . . . . . 188

D.9 Sequence of images of quadrotor going through different shaped gaps: (a)
Infinity, (b) Goku, (c) Rectangle. In all the cases, the baseline of the stereo
system is changing and is colored coded as jet (blue to red indicates 100
mm to 300 mm baseline). . . . . . . . . . . . . . . . . . . . . . . . . . . 189

D.10 Variable baseline stereo performance in simulated 3D IMO detection when
compared to small and large baselines. Note that both the large baseline
system (red curve) and small baseline system (blue curve) lose detection
of the IMO at different parts of the scene (dots). The baseline for the vari-
able baseline case is color-coded as jet (blue to red indicates small to large
baseline). Black curve (horizontal line at zero vertical axis) represents the
ground truth trajectory of the object. . . . . . . . . . . . . . . . . . . . . 189

xxi


E.1 Sample point-cloud output of SalientDSO which does not have loop clo-
sure or global bundle adjustment. Each inset (color-coded to suite the re-
spective location on the map) in clockwise direction from top left show the
corresponding image, saliency, scene parsing outputs and active features.
Observe that features from non-informative regions are almost removed
approaching object-centric odometry. . . . . . . . . . . . . . . . . . . . 198

E.2 Sample point-cloud output of SalientDSO which does not have loop clo-
sure or global bundle adjustment. Each inset (color-coded to suite the re-
spective location on the map) in clockwise direction from top left show the
corresponding image, saliency, scene parsing outputs and active features.
Observe that features from non-informative regions are almost removed
approaching object-centric odometry. . . . . . . . . . . . . . . . . . . . . 201

E.3 Algorithmic overview of SalientDSO, blue parts show our contributions.
Here KF is the abbreviation for Key Frame. . . . . . . . . . . . . . . . . 205

E.4 Left column: Input image, Right column: Saliency overlayed on input
image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

E.5 Variation of Saliency Map due to changes in illumination and viewpoint.
Notice that the fixation still remains inside the same object but the saliency
map varies. The crosses of respective color highlight the fixation in the
respective images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

E.6 Point selection using different schemes. Top rows in (a) and (b), left to
right: features selected using DSO’s scheme, saliency only, saliency+scene
parsing. Bottom rows in (a) and (b), left to right: input image, saliency,
scene parsing output. Notice how using saliency+scene parsing removed
all non-informative features. (a) and (b) show images from ICL-NUIM
and CVL-UMD datasets respectively. . . . . . . . . . . . . . . . . . . . . 211

E.7 Comparison of evaluation results for ICL-NIUM dataset. Left: DSO,
Right: SalientDSO. Each square correspondes to a color coded error. Note
that Salient DSO almost always has lower error than it’s DSO counterpart. 214

E.8 Comparison of evaluation results for TUM dataset. Left: DSO, Right:
SalientDSO. Note that Salient DSO almost always has lower error than
it’s DSO counterpart. Note that, for the TUM dataset scene parsing was
turned off as TUM dataset only provides grayscale images and scene pars-
ing outputs are very noisy for grayscale images. . . . . . . . . . . . . . . 214

E.9 Comparison of outputs for Np = 40 – very few features. (a) Success case
of DSO with a large amount of drift, (b) Success case for SalientDSO,
(c) Failure case of DSO where the optimization diverges due to very few
features. Notice that SalientDSO can perform very well in these extreme
conditions showing the robustness of the features chosen. . . . . . . . . . 217

E.10 Comparison of drift. (a) DSO’s output, (b) SalientDSO’s output, (c) Im-
age corresponding to crop shown in the inset. Observe that SalientDSO’s
output has the checkerboard from different times more closely aligned as
compared to DSO. Here Np = 1000. . . . . . . . . . . . . . . . . . . . . 218

E.11 Sample outputs for TUM sequence 1. (a) DSO, (b) SalientDSO. Here
Np = 1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

xxii


F.1 Size comparison of various components used on quadrotors. (a) Snap-
dragon Flight, (b) PixFalcon, (c) 120 mm quadrotor platform with NanoPi
Neo Core 2, (d) MYNT EYE stereo camera, (e) Google Coral USB ac-
celerator, (f) Sipeed Maix Bit, (g) PX4Flow, (h) 210 mm quadrotor plat-
form with Coral Dev board, (i) 360 mm quadrotor platform with Intel®

Up board, (j) 500 mm quadrotor platform with NVIDIA® JetsonTM TX2.
Note that all components shown are to relative scale. All the images in
this paper are best viewed in color. . . . . . . . . . . . . . . . . . . . . 221

F.2 Different network architectures. (a) VanillaNet, (b) ResNet, (c) SqueezeNet,
(d) MobileNet and (e) ShuffleNet. (χ and ξ are hyperparameters). Each
architecture block is repeated per warp parameter prediction. This image
is best viewed on the computer screen at a zoom of 200%. . . . . . . . . . 225

F.3 PRG Husky-360γ platform used in flight experiments. (a) Top view, (b)
front view, (c) down-facing leopard imaging camera. . . . . . . . . . . . 232

F.4 (a) Accuracy, (b) Accuracy per Kilo param, (c) Accuracy per Kilo OP
for different network architectures. Blue and orange histograms denote
small (≤0.83 MB) and large (≤8.3 MB) networks respectively. Here the
following shorthand is used for network names: VN: VanillaNet, RN:
ResNet, SqN: SqueezeNet, MN: MobileNet and ShN: ShuffleNet. All
networks use T×2, S×2 warping configuration. . . . . . . . . . . . . . . 243

F.5 Weight vs. FPS for VanillaNet4 (T×2, S×2) on different hardware and
software optimization combinations. Left: small (≤0.83 MB) model,
right: large (≤8.3 MB) model. The radius of each circle is proportional
to log of volume of each hardware (this is shown in the legend below
the plots with volume indicated on top of each legend in cm3). The out-
line on each sample indicates the configuration of quantization or opti-
mization used (Float32 (red outline) is the original TensorFlow model
without any quantization or optimization, Int8-TFLite (black out-
line) is the TensorFlow-Lite model with 8-bit Integer quantization and
Int8-EdgeTPU (blue outline) is the TensorFlow-Lite model with 8-bit
Integer quantization and Edge-TPU optimization. The samples are color
coded to indicate the computer it was run on (shown in the legend on the
bottom). Also note that, Laptop and PC (Deskop) weight and volume val-
ues are not to actual scale for visual clarity in all images. All the figures
in this paper use the same legend and color coding for ease of readability. 248

F.6 Weight vs. FPS for ResNet4 (T×2, S×2) on different hardware and soft-
ware optimization combinations. Left: small (≤0.83 MB) model, right:
large (≤8.3 MB) model. The radius of each circle is proportional to log
of volume of each hardware. . . . . . . . . . . . . . . . . . . . . . . . . 249

F.7 Weight vs. FPS for SqueezeNet4 (T×2, S×2) on different hardware and
software optimization combinations. Left: small (≤0.83 MB) model,
right: large (≤8.3 MB) model. The radius of each circle is proportional
to log of volume of each hardware. . . . . . . . . . . . . . . . . . . . . . 249

xxiii


F.8 Weight vs. FPS for MobileNet4 (T×2, S×2) on different hardware and
software optimization combinations. Left: small (≤0.83 MB) model,
right: large (≤8.3 MB) model. The radius of each circle is proportional
to log of volume of each hardware. . . . . . . . . . . . . . . . . . . . . . 249

F.9 Weight vs. FPS for ShuffleNet4 (T×2, S×2) on different hardware and
software optimization combinations. Left: small (≤0.83 MB) model,
right: large (≤8.3 MB) model. The radius of each circle is proportional
to log of volume of each hardware. . . . . . . . . . . . . . . . . . . . . . 250

F.10 Weight vs. FPS for the best model architecture on each hardware cou-
pled to the best software optimization combination. The radius of each
circle is proportional to log of volume of each hardware. The best model
architecture and model optimization for each hardware are: Up: ResNetS-
Float32, CoralDev: ResNetS-Int8-EdgeTPU, CoralUSB: ResNetS-
Int8-EdgeTPU, NanoPi: ResNetS-Int8, BananaPiM2-Zero: ResNetS-
Int8, TX2: SqueezeNetS-Float32, Laptop-i7: SqueezeNetS-Float32,
Laptop-1070: SqueezeNetS-Float32, PC-i9: SqueezeNetS-Float32,
PC-TitanXp: SqueezeNetS-Float32. All networks use T×2, S×2 con-
figuration and S and L subscripts indicate small and large networks re-
spectively. The best network for each hardware was chosen with the avg.
error ≤ 2.5 px. and the configuration which gives the higest FPS. . . . . . 250

F.11 Num. of Params vs. FPS (a) when only increasing the depth of the net-
work while keeping width constant, (b) when only increasing the width
of the network while keeping depth constant, (c) when increasing a com-
bination of depth and width of the network for different computers. . . . . 251

F.12 Total Power vs. Quadrotor Size at hover. Each sample is a pie chart
which shows the percentage of power consumed by the motors in red
and compute and sensing power in blue. The radius of the pie chart is
proportional to the power efficiency (in g/W and is given as the ratio of
hover thrust to hover power). Refer to the legend on the bottom (gray
circles) with the numbers on top indicating power efficiency in g/W. . . . 252

F.13 Comparison of trajectory obtained by dead-reckoning (red) our estimates
with respect to the ground truth (blue) for quadrotor flight in various tra-
jectory shapes. (a) Circle, (b) Moon, (c) Line, (d) Figure8 and (e)
Square. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

xxiv


Chapter 1: Introduction

Robots have always been imagined as intelligent agents that can solve any problem

faster, cheaper and better than human beings. Such an imagination has been prevalent

since the 1800s. Despite major advances in technology, autonomy in robots fall signifi-

cantly short of predictions made in the past along with the expectations of most people.

Most successful autonomous robots today are generally big, bulky and targeted towards a

particular set of task such as cleaning floors, assembling cars and so on. To build a gen-

eral purpose robot such as those shown in TV shows which posses capabilities similar to

humans, we need to utilize concepts used by living agents for both hardware and software

co-design.

The capabilities of such a general purpose robot in the wild can be categorized

as: (a) Navigation, (b) Human robot interaction and finally (c) Physical interaction or

Manipulation.

Navigation involves moving around in space without running into static or dynamic

obstacles such as trees and humans. Human robot interaction involves the ability to un-

derstand humans, reason about their intent by interacting with them though a natural lan-

guage to disambiguate hard to parse queries. Finally, physical interaction or manipulation

involves the ability to alter the world by picking, moving and nudging objects.

1


This thesis focuses on the first capability: Navigation, specifically tailored towards

aerial robots or Unmanned Aerial Vehicles (UAVs). However, they can be easily adapted

to ground robots with minimal effort. This fundamentally involves answering the follow-

ing three questions: Where am I? and What is a hazard when I am moving? Where should

I go next?

This capability has been traditionally achieved using computer vision algorithms

with the aim of building a representation of general applicability: a 3D reconstruction

of the scene. Using this representation, planning tasks are constructed and accomplished

to allow the quadrotor to demonstrate autonomous behavior. Note that I will use the

words aerial robot(s), Unmanned Aerial Vehicle(s) (UAV(s)), quadrotor(s) or drone(s)

interchangeably and they refer to the same entity unless specified otherwise.

1.1 Active Agents

Although aerial robots, are inherently active agents, their perceptual capabilities in

literature so far have been mostly passive in nature. Researchers and practitioners to-

day use traditional computer vision algorithms with the aim of building a representation

of general applicability: a 3D reconstruction of the scene. Using this representation,

planning tasks are constructed and accomplished to allow the robot to demonstrate au-

tonomous behavior. However, this is in stark contrast to the methodology used by living

agents such as birds and bees that have been solving these problems from agent with rel-

ative ease and extreme efficiency. These living beings utilize their activeness (the ability

to control movement of their bodies or a part of it) to simplify perception problems by

2


building specific task driven sensorimotor loops (combination of perception, planning and

control). This thesis is built upon this philosophy: the agent can control it’s own move-

ment or movement of a part of it’s body to make it’s perception problem simpler. This is

due to additional constraints introduced by moving in particular ways. Such a movement

to manipulate the perception forms a sensorimotor loop: a perception, planning and con-

trol loop to solve the task at hand. Note that solving a real world task at hand can utilize

multiple of such sensorimotor loops.

1.2 Active vs Passive Approaches to Perception

Different set of tasks or competences of an aerial robot have been traditionally

achieved with passive perception (or vision) which is based on the human or primate

vision and involves sensing the world in 3D. This philosophy revolves around obtaining a

3D map first and then utilizing it for various tasks. However, a lot of tasks rarely require

a full 3D map of the scene to be accomplished and are hence not minimalist (not utilizing

less power, computation or number of sensors) by the virtue of their design. The major

advantage of such a system is that it is agnostic to the morphology of the robot and can

be almost directly adapted to different shapes, sizes and kinds of robots given that they

posses the required computation, sensing and power on-board.

On the contrary, active perception (or vision) adapts the design philosophy based on

the current operating constraints of computation, sensing and power. Such an approach

though is not directly transferable to different agent morphologies, it is generally more

power-efficient for the set of tasks it is designed for. Such an active design method is task

3


Figure 1.1: Sensing, Control and Computation variation with respect to the
amount of activeness used by the agent.

driven and utilizes the minimal amount of information, computation and power required

for the task. It can inherently handle risk of a sensor failure by the virtue it’s design

utilizing exploration to gather more information and is generally more robust albeit it

might take more time as compared to it’s passive counterpart (See Fig. 1.1). Table A.1

shows a comparison of different behaviours of an UAV using both active and passive

design philosophies. Notice that the integration of different behaviours is harder in the

active approach. To make this problem easier and more tractable, we propose to re-use

multiple of these competences by conceptualizing each agent as a set of a hierarchical

sensorimotor loops (or competences or behaviours). This makes it easier to adapt the

agent to related but different sets of problems just by changing the checking condition

(condition to see if a sensorimotor loops needs to be terminated). For eg., in the function

of flower pollination, one would check for the flower and this could easily be changed to

check for survivors in a search and rescue task.

4


Table 1.1: Minimalist design of autonomous UAV behaviours.
Competence Passive Approach Active and Task-based Approach
Kinetic stabilization Optimization of optical flow fields Sensor fusion between optical flow and IMU measurements
Obstacle avoidance Obtain 3D model and plan accordingly Obtain flow fields and extract relevant information from them
Segmentation of independently moving objects Optimization of flow fields Fixation and tracking allows detection
Homing Application of SLAM Learn paths to home from many locations
Landing Reconstruct 3D model and plan accordingly Perform servoing of landing area and plan appropriate policy
Pursuit and Avoidance Reconstruct 3D model and plan accordingly Track while in motion
Integration: Switching between behaviors Easy: The planner interacts with the 3D model Hard: An attention mechanism on ideas switching between behaviors

1.3 Forms of Activeness on a UAV

Activeness on a UAV or any robot in general can be accomplished in multiple ways.

1. By moving the agent itself, 2. By employing an active sensor, 3. By moving a part of

the agent’s body, 4. By hallucinating active movements.

In the first approach, the entire agent moves such that the perception problem be-

comes simpler. Such an approach is generally used by smaller robots where moving the

entire agent is not very power hungry as compared to adding another sensor by increasing

it’s weight and computation.

The second approach utilizes an active sensor, i.e., a sensor which only works when

movement is present. A class of sensors that are inspired by the animal eyes are called

event cameras. Such cameras only collect the asynchronous intensity changes in light

rather than traditional image frames. They have a higher dynamic range and lower la-

tency compared to classical cameras. However, such a sensor lacks the ability to perform

recognition from single timestep due to the lack of dense data when small or no move-

ment is present. However, these sensors excel at tasks that involve movement, which is

the core concept of activeness. This approach is useful when recognition of static objects

is not required or high-speed and severe illumination changes might be encountered.

The third approach brings the change of robot’s body morphology to enable simpler

5


perception. Such an approach can be used to make the robot smaller as required while

utilizing a bigger set of sensors. Such an approach is desired when moving the robot is

less power efficient than adding additional components to enable movement of the sensor

suite. Such a method can also simplify certain perception problems by directly estimating

depth.

Finally, the last approach entails utilizing a method which hallucinates an active

observer. For e.g., one could hallucinate a heatmap of microsaccades a human being

might exhibit by looking at an image. Such a method is computationally more expensive

but can be utilized when power used by computation is far lesser than moving the agent

or a part of it.

1.4 Hardware and Software Co-design

Keen readers would have realized that the active approach to designing algorithms

is dictated by the amount of sensors, computation and power supply on-board which are

commonly called Size, Weight, Area and Power (SWAP) constraints. This is in unison

with what Alan Kay, a pioneer computer scientist quoted in the 1980s – “People who are

really serious about software should make their own hardware.” Thus, understanding the

hierarchy of autonomous systems and applying it in engineering, becomes a problem of

synergistic hardware and software co-design. This multidimensional optimization prob-

lem across different strata (hardware – integrated chips, sensors, effectors and software –

the set of programs running on the system) is a new research area that has the potential

to lead to a disruptive technology in this field. We call this field “Embodied AI”. This is

6


directly inspired by how nature has taken ages to solve the optimization problem between

hardware (sensing using eyes and other sensors) and software (neural architecture) and it

is still being refined by a method we call ‘Genetic Evolution’ [2].

This multidimensional problem is intractable if we try to study all combinations of

sensor placements, computation and algorithms. Hence, to make the problem tractable,

we limit all our algorithms to be implemented using deep learning modules which obey

SWAP constraints and to the morphology of a quadrotor. However, the sensor suite is

allowed to change between a monocular traditional or event camera and a stereo camera.

1.5 When is Active design useful?

The next obvious question that comes to mind is when such an active design phi-

losophy is useful. An active philosphy using the agent’s movement or the movement of a

part of it’s body is useful when the agent cannot carry passive sensors that can sense the

required quantity (such as depth for mapping) due to SWAP constraints. However, that

being said, an active sensor can still be used despite the size of the robot for low-latency

applications. To add further, an active mechanism on a large robot (with enough SWAP

constraints for a myriad of sensors) can act as a failure mechanism when one or more of

the sensors fail. This could avoid the UAV from crashing by landing it safely. An active

philosophy is also useful when uncertainty is expected in either the environment or sens-

ing. In stark contrast to computer vision approaches, the agent would move to explore the

world to gather more information for a confident prediction.

Fig. 1.2 shows how the ability of a robot is severly affected as the SWAP constraints

7


Figure 1.2: Algorithmic design philosophies for different sized robots along
with their capabilities.

change. This shows that we need to use the active design philosophy as SWAP constraints

become tighter (UAV becomes smaller). This is important as smaller UAVs are safer,

more agile and are scalable to be used as swarms. But, today’s passive algorithmic design

philosophy has bounded their abilities severely and the level of autonomy of these smaller

UAVs are very far from those of their larger counterparts. This shows how tightly coupled

the design philosophy is with the SWAP constraints.

However, when we bring living beings into the same plot as above we see from

Fig. 1.3 that the different between autonomy levels is not as significant even though liv-

ing being size changes drastically. This is because of the adaptation of sensing, neural

architectures and the methodology used by these agents that scale well with their SWAP

constraints. Also, notice that the level of autonomy possessed by living agents are far

8


Figure 1.3: Amount of autonomous capabilities for different sized robots
and living beings with respect to size. The red box shows where this thesis
aims to take the autonomy of a nano-quadrotor.

Figure 1.4: Comparison of our proposed “bee” nano-quadrotor with birds
and bees. (a) Sparrowhawk, (b) White Necked Jacobin Hummingbird, (c)
Giant Honeybee, and (d) Our proposed “bee” nano-quadrotor. The number
next to the brain and scale icon shows the number of neurons and the weight
respectively. Note that the images are to relative size.

9


higher than our man-made autonomous UAVs. A detailed comparison between a spar-

rowhawk, a hummingbird, a bee and our “bee” nano-quadrotor is shown in Fig. 1.4.

One can observe that our nano-quadrotor has more computation than the sparrowhawk,

weights almost the same as the sparrowhawk but is around the size of the hummingbird

but has the capabilities similar to that of a bee. This is because of our technology and

design methodology being not as efficient as that of nature and this area has a huge scope

for further research. This thesis is one of the few works that has taken baby steps in this

area of on-board nano-UAV autonomy using an active approach.

Next, I will describe some applications of a nano-UAV (specifically a quadrotor that

weighs < 250g and maximum motor to motor diagonal size of 120 mm) which is based

on the active design philosophy.

1.6 Applications of an Active Nano-Quadrotor(s)

Quadrotors out of all possible aerial robots have gained massive popularity due to

the simplicity in their mechanical design: a frame with four motors where diagonally

opposite motors spin the same direction. Such a design is inherently stable due to coun-

teracting torques and can be directly controlled by changing motor speeds since it is a

differentially flat system. A quadrotor has the following advantages over a traditional

fixed wing aircraft: can hover in place, generally has a higher payload for the same size

and can fly faster for the same size. They also don’t need a runway or a catapult mecha-

nism to take off and can be easily deployed fully autonomously in the field.

In particular, nano-quadrotors are defined in this thesis as a quadrotor of maximum

10


side of 85 mm (motor to motor diagonal dimension of 120 mm) and a maximum All Up

Weight (AUW) of 250 g. These vehicles are safe, agile and can carry enough payload to

be autonomous with off-the-shelf components making them easy to repair and distribute

for further research even in developing countries. A few applications of such a nano-

quadrotor (one or many) are described next.

Fast exploration and mapping

Nano-quadrotors can be used to map large areas fast using a centralized or decentralized

mapping server where a fast map could be generated on-board and can be further opti-

mized when they are downloaded to a large server. Such a method can be very useful

when one can deploy drones with different sensor suites for gather data from different

spectra. This would be useful for mapping bridges and pipes where gaps are small and

large drones cannot traverse. Such nano-drones can also be used for sports to obtain

angles that were not possible before.

Search and Rescue

Nano-UAVs are specially suited for search and rescue since they can be readily deployed

with minimal effort in the field and can traverse gaps of unknown shapes, sizes and lo-

cations. Such a swarm with different sensor suites can be used to find survivors (using a

thermal camera). Nano-UAVs can traverse terrain that might not be possible for ground

robots or larger UAVs due to their size, agility and full 3D maneuverability. Such small

nano-UAVs because of their cost-effectiveness can be used as ‘disposable information

capture devices’ for areas such as nuclear plants. Finally, they can also provide a bird’s

11


eye view of the scene along with 3D reconstructions for necessary evaluation and to col-

laborate with other ground robots.

E-sports and Hobbying

Drones have become popular due to a sport called drone racing and have carved out a new

space that blends reality with computer games. These drone races are manually piloted

and recently AlphaPilot competition from Lockheed Martin posed to do the same in a

completely autonomous manner with on-board sensing and computation which was won

by Prof. Guido de Croon’s team from TUDelft using a monocular and active approach1.

These features could be built into drones of smaller sizes to make them safer to learn and

for hobbying purposes so they do not hurt other birds in flight. One such step has been

taken by DJI in the form of DJI FPV drone (that weights about 1 Kg), but it still is far

from being a nano-quadrotor.

Co-operative delivery

Although nano-UAVs cannot carry larger payloads, they can be combined together (either

using rigid links or tethered cables) to carry larger payloads. Such an approach would be

desirable during search and rescue as smaller drones are easier to deploy and are faster in

exploration but can be combined together to lift larger payloads.

Inspection

Drones are a popular tool for inspecting large structures where access is limited to ground

robots or using human labor is dangerous such as under bridge structures [3] or historical
1https://mavlab.tudelft.nl/mavlab-wins-the-alpha-pilot-challenge/

12

https://mavlab.tudelft.nl/mavlab-wins-the-alpha-pilot-challenge/


monuments [4] or in radio active areas [5]. Recently , Skydio announced an active ap-

proach called 3D Scanhttps://www.skydio.com/3d-scan to map and inspect any structure

in a GPS denied environment using only on-board computation and sensing. In addition,

a research area could be to use a gimbal camera with zoom that can also use active per-

ception to reduce overall control effort of moving the entire vehicle to obtain a higher

resolution image for mapping. Such an approach would make it safer when tight areas

have to be mapped with a high degree of accuracy.

1.7 Research Objectives

Since I started my doctoral studies in 2016, there has been an astounding progress

in the field of quadrotors and specially autonomous ones. One of the major factors that

influenced this growth was the Drone Racing League (DRL) which was launched publicly

in Jan, 2016. Since then, there has been a rapid development and availability of hobby

grade parts such as sensors, motors and on-board flight controllers that are on-par with

their commercially available counterparts and those used in the top of the line research.

This trend has also set off a cycle where Ph.D.’s are working with hobbyists to build au-

tonomous drones (different multirotors and fixed wing planes) using hobby grade parts so

that such research is accessible to the general public. To fuel this growth further, there has

been a rapid advancement in the field of on-board algorithms involving visual perception

(visual inertial odometry), better sensor fusion, motion planning, control schemes and

decision making. The recent deep learning trend has also shown to improve robustness

over classical methods if the training data is generated appropriately. These networks

13

https://www.skydio.com/3d-scan


can also be trained in simulation and transferred to the real world with minimal effort if

the problems are chosen carefully. To add further, these networks can be hardware ac-

celerated to obtain speeds not possible before with classical methods without extensive

reprogramming to fit the required computer architecture.

Although, electronics and avionics exist for building efficient nano-quadrotors (we

define this as a maximum diagonal motor to motor size of 120 mm and AUW of 250 g),

they rarely posses autonomous features. This is because of a large number of open chal-

lenges to scale down autonomy with limited sensing and computation. With the advent of

special deep learning accelerator chips, open-source simulation tools for data generation,

new generation sensors and new FAA rules for small drones, now is the perfect time to

dive into make these nano-quadrotors autonomous. This research direction also caught

attention of many top notch researchers across the world due to the DARPA’s Fast Light

Autonomy project where the goal was to build an autonomous quadrotor that can fly upto

20ms−1 with on-board sensing and computation.

The myriad of challenges to make nano-drones autonomous span across multiple

domains and multiple Ph.D. theses and will take an effort from researchers all across the

world to be addressed. My doctoral work focuses on a few key components which revolve

around the central philosophy of active perception. Each of my works has a philosophical

ideology of active perception with a practical application of immediate impact. I par-

ticularly showcase four forms of active perception (combining perception, planning and

control into a single sensorimotor loop to make the problem simpler to solve): 1. By

moving the agent itself, 2. By employing an active sensor, 3. By moving a part of the

agent’s body, 4. By hallucinating active movements. Next, to make this work practically

14


applicable I show how hardware and software co-design can be performed to optimize

the form of active perception to be used. Finally, I present the world’s first prototype of a

RoboBeeHive that shows how to integrate multiple competences centered around active

vision in all it’s glory.

Today’s drone autonomy in indoor (sometimes outdoor) spaces has been tradition-

ally achieved using motion capture setup which are expensive, have a large setup and

calibration time, and strongly limit the flying area. The key advantages of such a system

is their sub-millimeter accuracy, constant and low latency of about 10ms and high update

rate of up to 200Hz. However, such a system is not applicable when drones need to be

deployed in the wild with wind, changing terrain, varied visual features and illumina-

tion, and other external obstacles. This has necessitated the need for on-board perception

(sensing) and computation. Such an on-board system also has the added advantage of

preserving privacy and security (by being harder to hack into). In this thesis, I focus on

on-board sensing and perception to solve fundamental problems for aerial robot auton-

omy: static and dynamic obstacle avoidance, flying through gaps and homing (where do

I find home). All these problems are tackled by controlling the agent’s movement in a

way to simplify perception problems (this is called a sensorimotor loop) using the active

perception philosophy.

1.8 State of the Art

I will next describe the state of the art in various related areas that can potentially

advance the area of autonomy on nano-UAVs.

15


Active Perception by moving the agent

Active perception is the concept where one would control their movements (control and

planning) in order to simplify the perception problem. This is in stark contrast to passive

perception where a 3D map is first constructed and then motion planning is performed

which is then used to control the robot’s movement. Such a passive approach relies on

the fact that the perception information obtained is accurate and robust which is rarely the

case in the wild. Such an approach has spurred from the fact that it is the most evident

way to setup a mathematical optimization problem which can be sub-divided into multiple

fields so that fast progress could be achieved. Now that the field has matured to the point

where we have full-fledged products based on this philosophy, academia needs to think

about the next conceptualization of a robot. Revisiting what experts of the field put forth

about three decades ago: we need to combine perception, planning and control into a

single entity called sensorimotor loops. This has been called by different names: Active

vision, Animat vision, or perception-aware planning. This literature has gained a lot of

momentum in the last five years due to active vision’s robustness, built-in contingency

planning and minimalism. The literature spans across multiple robot platforms such as

quadrotors, ground robots and humanoids to name a few and is dominant where robots

can carry a limited payload in-terms of sensors and computation.

Active view planning

Another way planning and perception have been tightly coupled is through an area called

active view planning. Here, the planner takes into account the current or accumulated

perception information to plan a path on the fly to obtain a better view for the desired

16


task. For e.g., if one is mapping a bridge [6], generally spots under the bridge are dark

and require closer data and hence the view planner would take that into account. Such a

planner is accomplished by various statistical measures such as entropy or estimated cov-

erage amount for the map built. Another flavor of active view planning comes in the form

of keeping some points of interest in the field of view while executing a desired trajectory.

Such a problem is tackled by formulating minimum time trajectories for quadrotors with

a limited field of view camera to ensure that the points of interest are always in the field of

view. Alternative formulations have been presented to control yaw for autonomous aerial

cinematography [7]. To add further, a few approaches also consider obstacle avoidance

when planning such perception-aware trajectories [8].

Active Sensing using event cameras

Over the last decade, there has been an enormous advancement in sensor technology,

specially imaging sensors or cameras. Even after these significant advancements, the dy-

namic range and latency of these cameras is nowhere comparable to those possessed by

living agents, specially ones that can fly such as birds and bees. This observation moti-

vated the neuromorphic engineers to develop a new class of imaging sensors called event

cameras. These event cameras have ‘smart pixels’, wherein each pixel is asynchronous

and collects the changes in light rather than a traditional intensity measurement to create

classical image frames. Such a sensor outputs an event cloud rather than an image frame.

where each event contains the location and polarity. The polarity takes values from the set

{+1,−1, 0}, where a +1(-1) indicates a intensity increase (decrease) and 0 indicates the

event did not trigger. Since, only the intensity changes are recorded, huge bandwidth sav-

17


ings of upto orders of magnitude is obtained. Such a sensor also has a ultra high dynamic

range of upto 100dB which is much higher than a traditional camera [9]. The latency of

such a camera can be on the order of a few microseconds which is two to three orders of

magnitude lower than that of the comparable classical camera.

Since event sensor data is tightly coupled to it’s movement and the scene, they are

also called active sensors in the essence that one has to move (or the scene has to move)

to obtain data. To reiterate, the event camera has a high dynamic range, low latency, low

bandwidth and is particularly suited for an active agent. As one would think intuitively,

the larger the perception latency, the slower the robot can respond to abrupt or dynamic

changes in the environment/scene. However, robots in the wild often experience dynamic

obstacles such as humans, birds, insects along with unstructured obstacles in collapsed

buildings like falling rocks. Although in theory, one could recognize the objects in the

scene to obtain dynamic obstacles, they are not robust to motion blur and drastic illumi-

nation changes which classical cameras suffer from in dynamic and wild scenarios. Event

cameras by the virtue of their design are perfectly suited for the task of dynamic obsta-

cle detection. In literature, event camera based Independently Moving Objects (IMO or

dynamic obstacles) has been performed in two major ways: by motion compensating the

event volume (or frame as a projection of events) [10, 11] or by learning to predict the

segmentation masks directly[12, 13]. In the first method, the algorithms construct an Im-

age of Warped Events (IWE) or event frames then use a measure of IWE’s contrast or

sharpness as a metric to warp the event cloud to obtain a sharp IWE or event frame. The

blurry parts of this resultant image are the regions that do not comply with the motion

model of the ‘background’ (the parts of the scene that are not IMOs) are the ‘foreground’

18


or IMOs. Such a way is robust but generally slow due to processing of 3D spatio-temporal

event data, depending on amount of motion and scene this can use a lot of memory and

be slower than realtime (defined as 50Hz). In the second method, a network is trained on

these input event frames (without sharpening) to predict directly the pixel locations that

belong to IMOs. Such a method is faster and robust if enough data is given to learn.

Event cameras due to their myriad of advantages over classical cameras have also

gained attention of other roboticists to build visual odometry [14, 15, 16] and Simultane-

ous Localization And Mapping (SLAM) algorithms[17]. Event cameras have also been

utilized on humanoid robots [18] and other aerial robots to track objects of interest.

Active Perception by moving a part of the agent

One can also achieve activeness by moving a part of the agent rather than the entire agent

itself (Such an example in nature can be seen when the owlets bob their head around

in circles, see Fig; 1.5). Though morphable designs of robots in it’s early days [19, 20,

21, 22, 22, 23, 24] were to showcase that one could make such a design functional and

were targeted towards mechanical and control design of these systems. In the last decade,

gimbal systems on drone have become common to the point that even a cheap hobby

drone generally is equipped with one. One of the major driving forces for such a design

is aerial videography to get a smooth footage with little to no post-processing. This also

inspired roboticists to build rigs with pan, tilt and zoom cameras [25] on robots so that

they could track objects faster by moving the camera rather than the entire robot itself.

Similarly, stereo systems were built on the same principle to enable better tracking and

depth estimation by changing baseline.

19


Figure 1.5: A stack of images showing an owlet bobbing its head (see
red highlight) to make perception easier. This is an example of an agent
moving a part of it’s body to exhibit activeness. For original video see
https://vimeo.com/152347964.

20

https://vimeo.com/152347964


Hallucinated Activeness

Activeness as discussed before comes in various forms and involves some amount of

physical movement. However, similar to the difference in the way the older agents (or

humans) approach problems as compared to younger ones, activeness changes its form

as more data or information about something is acquired. Foe e.g., a bird does not have

to explore the area around it’s nest since it already knows how it looks by creating a

mental picture or it. In such a case, activeness becomes closer to passiveness with one

single difference: the activeness is hallucinated or is in the imagination of the agent.

Hallucinated activeness in the most simplified form could be captured in visual saliency:

the amount of time an agent’s gaze rests on different parts of the image (See Fig. 1.6 for

an example of how this map looks). Such a saliency method depends on the context and is

generally a top to down approach. For e.g., if one is looking for a yellow ball, the saliency

heatmap would light-up at spots of the image which are yellow. Saliency could also be

in the form of a rudimentary motion segmentation method, such an approach would be

trivial if an event camera is employed. Also, note that most living agents have evolved

eyes to sense in the range to find the most salient objects from their perspective. For e.g.,

bees and butterflies can ‘see’ in the ultraviolet range to find the salient flowers where they

can drink nectar from (See Fig. 1.7). In literature, saliency has been used for navigation

[26], human-robot interaction [27, 28] and to make robots have gazes similar to that of

humans[29].

Nano/Pico-quadrotor design

Smaller UAVs are inherently safer, more agile and their ability to be deployed as large

21


Figure 1.6: Left to right: Color image of the scene, corresponding saliency
map output by SalGAN [1]. The hotness of the saliency color corresponds to
the value being higher.

Figure 1.7: Left to right: Bidens ferulifolia flower as seen by Human vision,
reflected UV, butterfly vision and bee vision. Note that altough images shown
here for simulated butterfly and bee vision are at the same resolution as those
seen human eyes, the real resolution of the eyes on these flying agents is
much smaller. Photo credits and ©: Dr. Klaus Schmitt.

22


swarms [30]. In literature, pico-quadrotors have been developed with a myriad of capa-

bilities such as walking and flying but rarely have enough on-board computation to be of

practical value. Recently, custom chips based on the RISC-V architecture have been de-

veloped to aid autonomy features using deep learning to these tiny pico-quadrotors [31],

however they require a mastery on hardware design along with writing lower level driver

code to make them functional and autonomous: thereby limiting further research. More-

over, cameras and other sensors that can be used on these sized drones are either custom

made and expensive. To add further, the battery life on these tiny drones are less than 2

minutes due to the limited payload capabilities. A similar issue is observed with flapping

bird drones which are inherently safer[32]. On the other end of the spectrum, research

groups all over the world have also worked on integrating larger sensors and computa-

tion modules into smaller units to run a full SLAM stack along with path planner and

controls enabled by the latest generation of Graphics Processing Units (GPUs) [33]. Al-

though massive advances have been made in this regard and have been made open-source

and open-hardware, they still use expensive sensors to achieve a good SLAM accuracy.

To this end, we propose to utilize activeness in our design and build nano-quadrotors

which fight right in-between the two aforementioned areas to enable autonomous features

at scales not possible before due to the higher payload and computation as compared

to pico-quadrotors and improved agility, safety and size decrease as compared to micro-

quadrotors. We also bring the first step in studying the co-design of hardware and software

for nano-quadrotors using embodied AI.

23


1.9 Summary

In this chapter, I discussed what active agents are, how an active agent differs from

a passive agent. This was then extended to why and how one should use activeness of

an UAV to make perception problems simpler to enable autonomy at scales that were

not possible before. I also talk about how activeness is a hardware and software co-

design problem. Then, I formally present the research objectives of this thesis followed

by different applications of nano-quadrotors. In the literature review, I summarized the

state of the art on different forms of active perception, combining perception and planning,

and nano/pico-quadrotor design.

24


Chapter 2: Contributions

In this chapter, I summarize the key contributions of the papers re-printed in the

appendix. In particular, this chapter highlights the individual results, refers to the related

video, open-source and open-hardware tutorials. I also put this thesis’ research in-context

to the state of the art literature in this chapter. In total, this research has been published in

three peer-reviewed journals and six peer-reviewed conference publications. One further

paper is under preparation for AAAS Science Robotics journal. Paper A was successfully

demonstrated live to multiple audience, most notable to Sam Brin (brother of Sergey Brin)

which was awarded USD 200K from the Brin Family foundation to our lab to advance

machine perception on drones along with a grant to the University of Maryland of USD

2M for building the Brin Family Aerial robotics lab which is currently used by various

departments. Paper B was presented by Profs. John Baras and Yiannis Aloimonos to

the Office of Naval Research and obtained a grant of USD 2.2M to advance the field

of Intelligent and Learning Autonomous Systems: Composability and Correctness. The

works in papers A and B also won the runner up in the Brin Family prize in 2016. Papers

B and C have helped cultivate relationships with one of the best labs for aerial robotics

research in the world: Robotics and Perception Group at the University of Zurich headed

by Prof. Davide Scaramuzza and Micro Air Vehicle Laboratory at the Delft University

26

https://robotics.umd.edu/facilities/brin-family-aerial-robotics-lab
https://twitter.com/umdcs/status/994575308450947072?s=20


of Technology headed by Prof. Guido de Croon. The works from this thesis was used

to create the world’s first prototype of the RoboBeeHive which is described in detail in

Chapter 3. Finally, a lot of the research in this thesis has led to the creation of two fully

open-source and open-hardware courses with video lectures and slides: ENAE788M:

Hands-on Autonomous Aerial Robotics and CMSC828T: Vision, Planning and Control

in Aerial Robotics.

2.1 Active Perception by moving the agent

In this section, I present the work on the textbook definition of active perception:

by controlling the agent’s movement to simplify the perception problem. Such a work is

also called perception-aware planning in the classic literature.

2.1.1 Paper A: GapFlyt

(P1) Nitin J. Sanket*, Chahat Deep Singh*, Kanishka Ganguly, Cornelia

Fermüller, Yiannis Aloimonos “GapFlyt: Active Vision Based Minimal-

ist Structure-Less Gap Detection For Quadrotor Flight”, IEEE Robotics

and Automation Letters (RA-L), (2018) Vol. 3, No. 43847–3854. DOI:

http://dx.doi.org/10.1109/LRA.2018.2843445.

2.1.1.1 Brief Description

In this work, we address one of the biggest challenges for autonomous operation

of a UAV in complex environments: navigating through narrow gaps of unknown shape,

27

https