ABSTRACT Title of Dissertation: TOWARD INTEGRATING INTELLIGENCE INTO EVERYTHING AROUND US Nakul Garg Doctor of Philosophy, 2025 Dissertation Directed by: Professor Nirupam Roy Department of Computer Science The vision of ambient intelligence promises a world where computational capabilities seamlessly integrate into everyday objects and environments, creating systems that sense, learn, and adapt to human needs while remaining invisible to users. Despite significant advances in miniaturization and low-power computing, true ambient intelligence has remained elusive, hindered by a fundamental challenge: current intelligent systems require substantial energy, complex hardware, and frequent maintenance, making widespread deployment impractical. We introduce a paradigm shift in how we create intelligent systems by fundamentally reimagining sensing and computing architectures from first principles for extreme resource constraints. This thesis centers on encoding intelligence directly into the physical domain through novel hardware-software co-design, where passive structures perform initial signal transformations without consuming power. Through novel architectures across acoustic, radio frequency, and optical domains, we demonstrate systems that achieve spatial perception, global positioning, and environmental monitoring with orders of magnitude less power than conventional approaches. These innovations enable intelligence in previously impossible contexts: insect-scale robots that navigate complex environments, sticker-sized tags that provide GPS-like tracking for years on a single battery, and wireless sensors that monitor food quality throughout global supply chains. By bridging the gap between what intelligent systems can do and what resource-constrained platforms can support, this work establishes a foundation for truly pervasive intelligence that operates sustainably at large scale. TOWARD INTEGRATING INTELLIGENCE INTO EVERYTHING AROUND US by Nakul Garg Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2025 Advisory Committee: Professor Nirupam Roy, Chair/Advisor Professor Ramani Duraiswami Professor Lin Zhong Professor Alan Liu Professor Sennur Ulukus, Dean’s Representative © Copyright by Nakul Garg 2025 To my family - Kavita, Sunil, and Rishabh ii Acknowledgments I would like to express my sincere gratitude to my advisor, Professor Nirupam Roy, for his exceptional guidance and mentorship throughout my doctoral studies. His support has been instrumental in shaping my research capabilities and analytical thinking. I have thoroughly enjoyed our brainstorming sessions where several creative ideas emerged and I learned how to transform research challenges into opportunities. He taught me invaluable skills in organizing complex thoughts, approaching problems with critical rigor, and communicating my ideas. Without his dedication and belief in my potential, the work presented in this dissertation would not have been possible. I would like to express my deep appreciation to my committee members for their guidance and support. Professor Ramani Duraiswami has been incredibly supportive of my work since day one and provided invaluable feedback till the end in my academic job search. I am very grateful to have Professor Lin Zhong on my committee; his leadership and mentorship have been exemplary. He has been a role model for me and has shown unwavering support for our research directions, providing validation for the structure-assisted spatial sensing work we pursued. I extend my heartfelt thanks to Professor Alan Liu, from whom I learned how to build and scale networked systems both while taking his class on cloud networking and computing and during our discussions about the future of Edge IoT. I truly value his thoughtful feedback on this dissertation and his help in crafting my thesis story. I am fortunate to have Professor Sennur Ulukus on my iii committee. As an expert in wireless, her keen insights have been particularly valuable as I explore scalable wireless systems in this work. I am incredibly grateful to the mentors and supporters who helped me through the academic job market. Thank you to Nirupam Roy, Karthik Sundaresan, Ranveer Chandra, Lin Zhong, and Siavash Alamouti. Karthik mentored me during my internship at NEC and beyond. He is a brilliant researcher and collaborator who believed in my abilities from the start. I want to thank him for collaborating with me on the scalable UWB project and for inspiring me to aim higher. I met Ranveer during my internship at Microsoft Research. He is the best mentor anyone could ask for, and his leadership and ability to solve impactful problems has inspired me to the core. I am fortunate for his feedback and mentorship during my job search and lucky to have him as a mentor. Siavash is a rockstar researcher and has been a strong advocate for my work. Whenever I got a chance to meet him, he always gave great advice on my research directions and was especially supportive of my low power wireless work. I am lucky to know him as a friend, a mentor, and a 10x entrepreneur. I would also like to thank Akshay Gadre, Ish Jain, Nivedita Arora, Akarsh Prabhakar, Justin Chan, Vikram Iyer, Suman Banerjee, Yasaman Ghasempour, Venkat Arun, Dinesh Bharadia, Swarun Kumar, Chahatdeep Singh, Nitin Sanket, and Tara Boroushaki for their time and invaluable advice throughout my job search journey. I am grateful to my labmates and Spacewalkers - Yang Bai, Aritrik Ghosh, Irtaza Shahid and Harsh Takawale (The iCoSMoS Lab). Thank you Yang for working with me on the spatial acoustics projects. I learned a lot about perseverance and discipline from your approach to research. Special thanks to Aritrik for collaborating on the low-power cellular work. I learned a lot about theoretical research and cherish the many deep physics conversations we had. I would like to thank Irtaza for working with me on the UWB project and spending nights to create testbeds and iv calibrating our sensors. I also cherish all our fundamental discussions and recoveries from our breakdowns thinking about signal processing problems. Thanks to Harsh for working with me on the food sensing project. I have always enjoyed brainstorming with you. Thanks to the entire iCoSMoS lab for always providing me feedback on the papers, talks, and supporting me through both ups and downs in my PhD journey. I must express my heartfelt gratitude to my amazing friends who have been my support system throughout this journey. To Satyarth, who taught me life philosophy and the positive vibe - I’m sorry I couldn’t attend your wedding. To Deepak, Ravin, Smriti, Varenya, Shweta, and Abhishek - I found a family away from family especially during COVID with our Spring TP group. To Aditi for helping me with medicines in all conference travels and your perfect song suggestions. Special thanks to Meenal for her unwavering support. You are my best friend, and I am lucky to have you. I am deeply grateful to have Nitin bhaiya as a friend and a mentor - I have learned so much from our conversations, especially your advice and guidance in my early years of PhD when I came and felt lost. Chahat bhaiya, thank you for all your advice throughout my PhD journey, from my first project to my most recent one, and especially for all the job talk and interview tips. Thanks also to Anoorag, Priyal, Stella, Pooja, and Mrunal. To my dearest friends whose support and belief in me has been a driving force - Tanmay, Bhawana, Prerna, Ishani, Devina, and Sidharth, I’m super lucky to have all of you. Thank you Ayush for being a huge inspiration to pursue academia and for being my longest friend. Most importantly, I want to thank my family. My parents (Kavita and Sunil), my brother (Rishabh), my grandparents (Ilaychi and Ramchandra), and the entire RCG family. I cannot thank my parents enough who have sacrificed so much and faced hardships just to make my life better and prioritize my education. My brother is my best friend who unknowingly taught me so much, to v dream big, never lose hope, and to smile through challenges. My grandfather is the mathematical prodigy I know and grew up with him teaching me mathematics and discipline. My grandmother, amma, taught me to be a good human being above everything else. I am lucky to have grown up learning from her. My uncle (Anil) has taught me to be selfless, to provide value and support, and to remain optimistic no matter what. Thanks to my brothers Lakshay and Gaurav for being my biggest pillars for life. My sister (Mahima) for being a role model and helping me pursue science and teaching me the fundamentals. You are the biggest inspiration for me to pursue academia. Special thanks to my superstar sisters - Himanshi, Mugdhal; my aunts - Anita, Mamta and Babita; my brother Akshay; and everyone for their love. vi Table of Contents Dedication ii Acknowledgements iii Table of Contents vii List of Tables xi List of Figures xii List of Abbreviations xix Chapter 1: Introduction 1 1.1 Challenges in Scaling Ambient Intelligence . . . . . . . . . . . . . . . . . . . . 3 1.2 Designing Ultra-Low-Power Intelligent Systems . . . . . . . . . . . . . . . . . . 5 1.3 Systems Developed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.1 Structure-Assisted Spatial Intelligence . . . . . . . . . . . . . . . . . . . 8 1.3.2 Scalable nextG Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.3 Systems for Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.4 Other Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter 2: Structure-Assisted Spatial Audio Sensing 15 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Core Intuitions and Primers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.1 Metamaterials for Passive Filtering . . . . . . . . . . . . . . . . . . . . . 22 2.3 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.1 Processing for DoA Estimation . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.2 Eliminating Source Signal Dependency . . . . . . . . . . . . . . . . . . 25 2.3.3 Eliminating Environmental Dependency . . . . . . . . . . . . . . . . . . 28 2.3.4 Synthetic Training for Deep Learning . . . . . . . . . . . . . . . . . . . 29 2.3.5 Optimizing 3D Stencil Design . . . . . . . . . . . . . . . . . . . . . . . 31 2.4 Prototype Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4.1 3D-Printing Stencil Caps . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4.2 Calibration and Data Collection . . . . . . . . . . . . . . . . . . . . . . 36 2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5.1 Evaluation Setup and Results Summary . . . . . . . . . . . . . . . . . . 37 vii 2.5.2 Impacts of external conditions . . . . . . . . . . . . . . . . . . . . . . . 39 2.5.3 Performance in different environments . . . . . . . . . . . . . . . . . . . 41 2.5.4 Impact of different sound sources . . . . . . . . . . . . . . . . . . . . . 42 2.5.5 Performance in known environment . . . . . . . . . . . . . . . . . . . . 42 2.5.6 Localization Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.5.7 Comparison with traditional methods . . . . . . . . . . . . . . . . . . . 44 2.5.8 Comparison between learning models . . . . . . . . . . . . . . . . . . . 44 2.5.9 Energy consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Chapter 3: Microstructure-Assisted Vision for Ubiquitous Tiny Robots 52 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2 Core Intuitions and Primers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.2.1 Coded Signal Projection with Structures . . . . . . . . . . . . . . . . . . 58 3.2.2 Single-Receiver Depth Mapping . . . . . . . . . . . . . . . . . . . . . . 60 3.3 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3.1 Low-Power Scene Reconstruction . . . . . . . . . . . . . . . . . . . . . 61 3.3.2 Directional Code Projection . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.3 Optimal microstructure design . . . . . . . . . . . . . . . . . . . . . . . 68 3.3.4 Motion stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.4.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.4.2 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.4.3 Impact of the Environment . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.4.4 Impact of system parameters . . . . . . . . . . . . . . . . . . . . . . . . 74 3.4.5 Impact of scene parameters . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.4.6 Computation techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.4.7 Power consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Chapter 4: Ultra-Low-Power Self-Localization Using a Single Antenna 87 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2 Core Intuition and Feasibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.3 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.3.1 Ultra-low Power Receiver . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.3.2 AoA Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.3.3 Localization with Independent Beacons . . . . . . . . . . . . . . . . . . 97 4.3.4 Designing Programmable Directional Gain . . . . . . . . . . . . . . . . 102 4.3.5 Pin-diodes as RF Switches . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.3.6 Directional Code to AoA Mapping . . . . . . . . . . . . . . . . . . . . . 105 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 viii 4.4.1 AoA Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.4.2 Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.4.3 Impact on RF Communication . . . . . . . . . . . . . . . . . . . . . . . 110 4.4.4 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.4.5 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.5 Limitation and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Chapter 5: Scalable Asset Tracking with NextG Cellular Signals 123 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.2 Cellular Networks Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.3 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.4 LiTEfoot Prototype Implementation . . . . . . . . . . . . . . . . . . . . . . . . 144 5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Chapter 6: Large Network UWB Localization: Algorithms and Implementation 159 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.2 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.2.1 Joint Range-Angle Localization . . . . . . . . . . . . . . . . . . . . . . 164 6.2.2 Scaling to Large networks . . . . . . . . . . . . . . . . . . . . . . . . . 167 6.2.3 Reference Frame Transformation . . . . . . . . . . . . . . . . . . . . . . 174 6.3 Opportunistic Anchor Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 177 6.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.5.1 Localization Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.5.2 Latency Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 6.5.3 Micro-Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 6.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 6.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Chapter 7: Low-Cost and Dynamic Food Quality Sensing at the Pallet-Level 195 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 7.2 Understanding Food Quality Metrics . . . . . . . . . . . . . . . . . . . . . . . . 197 7.3 Physics and Core Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 7.4 Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 7.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Chapter 8: Conclusion and Future Directions 207 8.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 ix 8.2 Broader Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 8.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 8.4 Closing Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Bibliography 213 x List of Tables 2.1 Comparison of prototype cost, size, median error, and energy consumption of Owlet with a microphone array. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1 Breakdown of energy consumption for the hardware and software submodules. . . . . . 81 3.2 Comparison of different computation optimizations showing the total energy consumed per scene reconstruction. Prototype on Raspberry Pi 4. . . . . . . . . . . . . . . . . . 83 4.1 Breakdown of energy consumption. . . . . . . . . . . . . . . . . . . . . . . . . 112 4.2 Summary of related work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.1 Key characteristics of evaluation routes. . . . . . . . . . . . . . . . . . . . . . . 147 5.2 Comparison between latency, accuracy, and power consumption. . . . . . . . . . 149 5.3 Overall power and energy per inference. . . . . . . . . . . . . . . . . . . . . . . 150 5.4 RF frontend power consumption. . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.5 Baseband power and time per inference. . . . . . . . . . . . . . . . . . . . . . . 151 xi List of Figures 1.1 Examples of intelligent, low-power, and scalable sensors developed in this thesis. 2 1.2 The intelligence-energy tradeoff: Conventional intelligent systems (top left) operate at high energy levels and offer good capabilities. Current resource- constrained systems (bottom right) provide limited intelligence. This thesis explores approaches to bridge this gap by creating systems that offer 100× more intelligence at similar energy levels. . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Our approach to sustainable ambient intelligence combines nature-inspired architectures, physics-informed AI, and scalable networks. . . . . . . . . . . . . . . . . . . . . 6 1.4 Structure-assisted sensing systems: (A) SPiDR’s depth imaging using 3D-printed acoustic metamaterial, (B) SPiDR’s acoustic stencil and its internal spatial filtering channels, (C) Owlet’s acoustic stencil and its internal direction-finding channels, (D) Owlet’s direction finding with passive acoustic structures. . . . . . . . . . . . 8 1.5 LiTEfoot leverages cellular signals for nationwide tracking using an ultra-low- power miniaturized receiver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.6 Locate3D enables infrastructure-free tracking at scale through optimized peer-to- peer measurements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1 The vision and technical overview of Owlet, a low-power and miniaturized system for extracting spatial information from sound. Owlet uses acoustic microstructures to embed direction-specific signatures on the recorded sound and develops a learning-based approach for signature recovery and mapping in real-time. . . . . . . . . . . . . . . . . . . . . 16 2.2 The concept of using a stencil with direction-specific hole patterns and microstructures for passive filtering of the incoming sound. The stencil embeds a directional response to the recorded signals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 The concept of passive directional filtering using a stencil of acoustic microstructure. The stencil embeds a directional signature to the recorded sound unique to its direction of arrival (DoA). The spectrum of complex gains represents the signature for further computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Different types of metamaterial stencils used in our experiments. . . . . . . . . . . . . 23 2.5 Angular diversity of the microphone with and without the microstructure stencil. . . . . 24 2.6 Comparison of the diversity in frequency responses (amplitude and phase) of the three types of metamaterial stencils. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.7 The two-microphone model for eliminating source and environmental dependency. . . . 26 xii 2.8 The architecture of the proposed CNN model. . . . . . . . . . . . . . . . . . . . . . 29 2.9 The behavior of sound field at the outer surface of an obstacle. (a) When the object’s size is much larger compared to the wavelength of the sound, the obstacle creates a shadow region. (b) When the object’s size is comparable to the wavelength of the sound, the wave diffracts around the object creating high-pressure at a larger region of the surface. It also creates a high-pressure region directly opposite to the sound’s directions where sound fields from the top and bottom sides meet. . . . . . . . . . . . . . . . . . . . . . . . 32 2.10 (a) A one-hole stencil to measure surface pressure levels. (b) Sound amplitude at different angles from the sound’s direction of arrival. . . . . . . . . . . . . . . . . . . . . . . 33 2.11 Comparison of diversity in phase and amplitude patterns for an optimal and a sub-optimal design of the stencil. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.12 The Owlet prototype used in the evaluation experiment (left) and a 9-element uniform linear microphone array used as baseline for comparison(right). The array is 12cm wide, where Owlet is significantly smaller measuring less than 2cm in its largest dimension. . . 37 2.13 Various locations for system evaluations: (a) indoor laboratory, (b) indoor lobby, (c) outdoor. 38 2.14 Overall performance of the Owlet system compared to the traditional microphone arrays of various sizes. Owlet requires 100× less energy than the state-of-the-art array systems while achieving better accuracy than a 9-element array. . . . . . . . . . . . . . . . . . 39 2.15 Performance under external conditions: (a) The impact of varying types and loudness levels of ambient noise on the median DoA estimation error. (b) The CDF of errors when the sound source is located at varying distances from the receiver. (c) The CDF plot of estimation error for different elevation angles or the vertical positions of the sound source. (d) The CDF plots of errors that show the impact of dynamic movements in the environment. 40 2.16 The performance of sound tracking while the source is constantly moving near the sensor. The movement of the source creates a dynamic multipaths scenario. . . . . . . . . . . . 41 2.17 The CDF of median error for (a) different environments and (b) different types of sound sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.18 The performance for DoA estimation with known room size: (a) The confusion matrix and (b) the CDF of error in degrees of angle. . . . . . . . . . . . . . . . . . . . . . . 43 2.19 The localization error as (a) heatmap and (b) empirical CDF. . . . . . . . . . . . . . . 44 2.20 Performance comparison of Owlet with the implementation of beamscan, MVDR, and MUSIC algorithms: (a) The CDF of median errors, (b) The spatial spectrum for an incoming signal from 20◦ angle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.21 Performance comparison of Owlet with different deep learning models and architectures. 45 2.22 The setup for evaluating energy consumption. The setup tracks the energy requirements of Owlet and baseline microphone arrays under various conditions using a Keysight E6313A power supply and monitor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.23 Energy consumption of (a) MSP430FR5969 low-power ADC [1] for different sampling rates and (b) Keysight Data Acquisition System [2] for different number of microphones. 47 2.24 Overall energy consumption of array-based systems and Owlet. . . . . . . . . . . . . . 48 3.1 SPiDR, an ultra-low-power acoustic spatial sensing system for mobile robots. The system uses a carefully designed 3D-printed micro-structure for projecting spatially coded signals for imaging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 xiii 3.2 The concept of the spatially coded channel sounding method. The received signal is the weighted linear combination of the reflections that bear the direction-specific signature. . 54 3.3 Diversity projection with the stencil with the internal channels to encode unique gains to signals to probe each pixel on the object plane with a unique signature. . . . . . . . . . 58 3.4 (left) The 3D design of the stencil and (right) the internal structure showing the tubular helical paths of different lengths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5 Ultrasound emitted from the speaker without (left) and with the stencil (right). The stencil spreads the signal energy over the region of interest. . . . . . . . . . . . . . . . . . . 59 3.6 The amplitude of acoustic signal at the cross section of the image plane when (left) speaker does not have any stencil, and (right) when it has a stencil. The internal micro-structure of the stencil diversifies the signal amplitude as direction codes. . . . . . . . . . . . . . . 61 3.7 (a) Energy consumption for different number of columns in the channel matrix H . (b) The correlation between the signals at nearby locations. The pixels within 3cm have correlation higher correlation higher than 0.5. . . . . . . . . . . . . . . . . . . . . . 62 3.8 Row 1: Comparison of the correlation of the time domain signals project at different angles with an arbitrary stencil (left) and an optimized stencil (right). Row 2: Comparison of the location detection performance for a small object (1cm wide) placed at different angles from the sensor with an arbitrary stencil (left) and an optimized stencil (right). . . 64 3.9 Different sizes of stencils used in our experiments. . . . . . . . . . . . . . . . . . . . 65 3.10 F values with (a) different lengths of tubes inside stencil, (b) different diameters of tubes. 67 3.11 Confusion matrix of imaging accuracy for different frequencies and after combining all frequencies together. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.12 Stacking multiple frames suppresses spurious objects in the scene. Above we show the result of motion stacking of 5 frames taken as the robot moves. . . . . . . . . . . . . . 70 3.13 Depth-map reconstruction using SPiDR for various real-world scenes. . . . . . . . . . . 72 3.14 Overall performance of the SPiDR compared to Intel Realsense lidar and ultrasound distance sensor mounted on a servo motor. SPiDR consumes a fraction of power compared to Lidar and motor based systems while delivering high accuracy in depth-map reconstruction. 73 3.15 Impact of varying (a) types and (b) levels of noises on depth-map reconstruction. . . . . 73 3.16 SPiDR’s performance in different environments. . . . . . . . . . . . . . . . . . . . . 74 3.17 Cross-sectional depth-map reconstruction performance in terms of (a) RMS error and (b) structural similarity, as a function of sparsity of the scene. . . . . . . . . . . . . . . . 75 3.18 Horizontal localization performance in terms of (a) RMS error and (b) structural similarity, as a function of sparsity of the scene. . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.19 CDF plot of (a) RMS error and (b)structural similarity for depth-map reconstruction at varying resolutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.20 Scene reconstruction results at 1cm and 0.5cm resolutions. We modify the number of columns in the channel matrix to have a higher resolution. . . . . . . . . . . . . . . . 76 3.21 The depth-map reconstruction performance for (a) different materials and (b) different proximity between two objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.22 Depth-map reconstruction outputs for varying proximity between objects. . . . . . . . . 77 3.23 Depth-map reconstruction outputs of different depths of the objects. . . . . . . . . . . 78 3.24 Performance of depth-map reconstruction with different depths of the objects. . . . . . . 78 xiv 3.25 Scene reconstruction of a horizontal bar with and without frequency or motion stacking. (a) Ground truth, (b) Raw output, (c) Only motion stacked, (d) Only frequency stacked, (e) Both motion and frequency stacked. . . . . . . . . . . . . . . . . . . . . . . . . 79 3.26 The scene reconstruction with and without the fractional computing method. (a) Ground truth, and the scene reconstruction (a) without and (b) with fractional computing. . . . . 79 3.27 Performance with different sizes of Hmeta. . . . . . . . . . . . . . . . . . . . . . . . 80 3.28 SPiDR prototype for power evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.29 Power consumption during sampling and computation. . . . . . . . . . . . . . . . . . 82 3.30 Comparison of power consumption with different sizes of Hmeta. Prototype on MSP430FR5969. 83 4.1 Sirius dynamically switches the beam pattern of the antenna to embed direction specific signature to the received signal. The vector of amplitudes contain the unique signature which map to the angle-of-arrival, θ. . . . . . . . . . . . . . . . 88 4.2 Sirius uses pin-diode to switch the gain pattern of an antenna. Connecting and disconnecting a conductive patch to the surface of antenna can change the shape of the gain pattern. This figure shows four different configurations of the antenna controlled using a set of two switches. . . . . . . . . . . . . . . . . . . . . . . . 90 4.3 Feasibility study demonstrates dynamic gain pattern switching. Figure shows gain patterns for (a) our reconfigurable antenna and (b) regular antenna. . . . . . . . . 93 4.4 Passive envelope detector used by Sirius captures incoming signal energy without requiring power-hungry components such as oscillators and down-converters. . . 95 4.5 Gain patterns for the first 3 paths that signal takes to reach the antenna. First one is the direct path which constitutes the majority of the energy, followed by weaker reflections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.6 Sirius uses triangulation to localize mobile nodes. It estimates angles from at least two anchors with known locations. . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.7 Figure depicts envelope detector’s received signal: (top) Anchor1 transmitting, (middle) Anchor2 transmitting, and (bottom) both Anchor1 and Anchor2 transmitting. 99 4.8 Anchor beacon signals designed with varying duty cycles create distinct time windows for interference-free reception. (left) Case 1: Anchor2’s window as the outer neighbor of collision windows. (right) Case 2: Anchor2’s window as the inner neighbor of the collision window. . . . . . . . . . . . . . . . . . . . . . . . 101 4.9 Figure displays anchor detection algorithm outputs: (top) time domain signal, (middle) total energy from all configurations, and (bottom) signal’s first derivative. 102 4.10 Using pin diode as RF switch. (a) Biasing circuit (b, c) Equivalent lumped model for on and off states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.11 Switching and sampling techniques: (a) Sequential sampling: constant-time switch, hold, and sample, (b) Uniform sampling: continuous configuration switching with constant rate, (c) Burst sampling: high-speed switching and sampling with long duty cycles for omnidirectional communication. . . . . . . . . . . . . . . . . . . 103 4.12 Sirius’s prototype for localization. . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.13 We evaluate Sirius on different antennas in ISM band 900MHz and 2.4GHz. The figure shows the fabricated reconfigurable antennas used in the prototype. . . . . 107 4.14 Setup for outdoor long-range data collection. The map shows the static anchors and node locations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 xv 4.15 (a) Overall AoA estimation accuracy of Sirius. (b) CDF of AoA error. . . . . . . 109 4.16 (a) CDF of localization error (b) Localization error per location shows the impact of distance from the APs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.17 Effect of pattern switching speed on communication. . . . . . . . . . . . . . . . 111 4.18 Impact of antenna reconfiguration on bit error rate for different antennas. . . . . . 111 4.19 Energy distribution between direct path and multipath for various indoor locations. 113 4.20 AoA errors for different indoor locations. . . . . . . . . . . . . . . . . . . . . . 113 4.21 AoA errors for different reflection materials. Each cluster of bars represents one type of material, and each bar within a cluster represents different indoor locations.114 4.22 Inverse distance matrix for varying number of gain patterns (N). . . . . . . . . . 115 4.23 AoA estimation errors. (left) varying levels of multipath density in the environment. (right) varying number of gain patterns. . . . . . . . . . . . . . . . . . . . . . . 115 4.24 (a) Correlation of gain pattern amplitudes. (b) Mean and median AoA errors showing the long-term stability of gain patterns in a dynamic environment. . . . . 116 4.25 The figure shows gain patterns collected at two different indoor locations in a dynamic environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.26 CDF of AoA error for different environments. . . . . . . . . . . . . . . . . . . . 117 4.27 Our experimental setup at different indoor locations in varying levels of static and dynamic multipath. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.28 AoA error for varying distance from the anchor. . . . . . . . . . . . . . . . . . . 118 4.29 Impact of clock drift on AoA estimation . . . . . . . . . . . . . . . . . . . . . . 120 4.30 Impact of frequency shift on AoA estimation . . . . . . . . . . . . . . . . . . . . 120 5.1 LiTEfoot- an ultra-low-power wireless tracker next to a US quarter for scale. . . . 124 5.2 LTE cells and frame structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.3 Intermodulation and spectrum folding. . . . . . . . . . . . . . . . . . . . . . . . 132 5.4 Confusion matrix for PCI estimation using SSS2 (i.e., the SSS signal after non- linearity). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.5 LTE packets before and after stacking (a) 1.4 MHz bandwidth (b) 10 MHz bandwidth. After stacking the PSS and SSS strength increases in the 10 MHz case whereas the data fades into a DC bias. . . . . . . . . . . . . . . . . . . . . . . . 138 5.6 (a) Blind Separation of weighted superposition (b) Linear phase change introduced due to sub-sample offset in the OFDM subcarriers. . . . . . . . . . . . . . . . . 140 5.7 The high-level circuit schematic of LiTEfoot. . . . . . . . . . . . . . . . . . . . 144 5.8 LiTEfoot PCB tag prototype showing the low-power RF frontend. . . . . . . . . . 145 5.9 LiTEfoot’s estimated trajectories and localization errors for in (a) urban and (b), (c) rural environments. The black lines denote the GPS trajectory and blue markers denote LiTEfoot’s estimated trajectory. The cell towers that are detected during the route are shown in red. (d) The empirical CDF of the localization errors. . . . 145 5.10 Diversity in downlink center frequencies used by the cells in a 200-meter radius. . 148 5.11 PCI estimation accuracy (F1 score) vs. number of stacked frames for LTE, 5G-NR, and NB-IoT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.12 (a) CDF of phase offsets in measurements before and after sub-sample offset correction. (b) Localization error for varying speed of vehicle. . . . . . . . . . . 152 xvi 5.13 (a) An alert is generated when the tag exits the boundary of the marked region. (b) Alert generation distance from the map boundaries. . . . . . . . . . . . . . . . . 154 6.1 Locate3D’s approach to include both angle and edge constraints for faster and efficient localization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.2 Comparative analysis of constraints: Incorporating angles reduces the number of edges required to attain the same level of accuracy as the ”Ranges only” approach. . . . . . . 166 6.3 (a) Histogram of localization errors for all spanning trees. (b) Reported angles by COTS UWB sensor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 6.4 Different spanning trees representing rigid and non-rigid graphs. Solid lines indicate both range+angle edges, and dashed lines indicate range-only edges. (a) A connected but non-rigid graph due to missing angle information in an edge. (b) The subgraph is free to rotate. (c) Adding a range measurement makes the graph rigid. . . . . . . . . . . . . . 171 6.5 The displacements corresponding to zero eigenvalues represent the translational and rotational motions that the nodes can undergo without violating any constraint. . . . . . 172 6.6 (a) When Node 1 is aligned with the global frame of reference, it reports that Node 2 is positioned at angle θ. (b) When Node 1 is rotated by γ relative to the global frame of reference in a 3D space, it reports a distinctly different angle θ′, for the same node. . . . 174 6.7 The relation between AoAs reported by two nodes (a) when nodes are perfectly aligned, and (b) when nodes are oriented with angle γ relative to the global frame of reference. . . 176 6.8 (a) Time to reach ‘0 False positives’. (b) Percentage of users registered for varying trajectory matching thresholds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 6.9 Room-scale evaluation: CDF of localization errors. . . . . . . . . . . . . . . . . . . . 180 6.10 3D Localization performance of Locate3D compared to baseline system - Cappella [3] - which uses visual odometry along with UWB. Results show performance in different lighting conditions for (a) Moving nodes and (b) Static nodes. . . . . . . . . . . . . . 181 6.11 (a) Prototype built using Raspberry Pi, UWB sensor, Intel Realsense, IR markers for ground truth and a battery pack. (b) Room-scale evaluation. (c) Building-scale evaluation (d) 3D Lidar scan of the building for reference (not used in computation) (e) Snapshot of estimated locations and MST. (f) AprilTags [4] captured by nodes for ground truth (not used in computation). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 6.12 City-scale analysis: (a) CDF of localization errors of 30k nodes, 3800m× 3800m area, and 15 anchors. (b) Errors for 30k, 60k, and 100k nodes in 1500m× 1500m area and 1 anchor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 6.13 New York City wide-area simulation results: CDF of localization errors for 100,000 nodes using 1 (left figure) and 5 (right figure) anchors in a 22000m× 3200m area. . . . . . . 185 6.14 Impact of submodules on (a) latency and (b) accuracy. . . . . . . . . . . . . . . . . . 186 6.15 Localization errors (in meters) for infrastructure baseline [5] and Locate3D for varying nodes and anchors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 6.16 (a) Number of unreachable nodes and (b) Total number of NLOS measurements made with varying anchor area density. . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 6.17 (a) LOS and NLOS localization errors. (b) Static vs Mobile nodes localization errors . . 188 6.18 (a) City-scale results for a 1000 node topology simulation in a 200 × 200 × 50 meter 3D space for a varying number of static anchors. (b) Room-scale localization error for real-world 20-node experiments with varying numbers of anchors registered. . . . . . . 189 xvii 6.19 Ranging and AoA errors for various range-angles. . . . . . . . . . . . . . . . . . . . 190 6.20 (a) CDF orientation errors. (b) Percent of users registered as Virtual anchors over time. . 191 7.1 Complex permittivity of electromagnetic waves in water. . . . . . . . . . . . . . 199 7.2 FreshSense: Different frequencies travel at different speeds through water resulting in an additional frequency dependent delay. . . . . . . . . . . . . . . . . . . . . 200 7.3 Estimated dispersion for varying percentage of water in the box. . . . . . . . . . 201 7.4 Evaluation setup with a box of avocados. . . . . . . . . . . . . . . . . . . . . . . 202 7.5 Images of avocados captured across 14 days. . . . . . . . . . . . . . . . . . . . . 204 7.6 Ground truth measured for 5 individual fruits and their average trend. . . . . . . . 204 7.7 Correlation between estimated and true DM% for the testing data. . . . . . . . . 205 7.8 Estimated and true DM% across 14 days. . . . . . . . . . . . . . . . . . . . . . . 205 xviii List of Abbreviations ADC Analog-to-Digital Converter AI Artificial Intelligence AP Access Point AoA Angle of Arrival API Application Programming Interface ASIC Application-Specific Integrated Circuit BLE Bluetooth Low Energy CDF Cumulative Distribution Function CGI Cell Global Identity CMOS Complementary Metal–Oxide–Semiconductor CNN Convolutional Neural Network COTS Commercial Off-The-Shelf DAQ Data Acquisition System DC Direct Current DM Dry Matter DoA Direction of Arrival DoF Degree of Freedom eDRX Extended Discontinuous Reception eNodeB Evolved Node B FCC Federal Communications Commission FDA Food and Drug Administration FDD Frequency Division Duplex FFT Fast Fourier Transform FMCW Frequency-Modulated Continuous Wave FoV Field of View GHz Gigahertz GPS Global Positioning System GS Gerchberg-Saxton HFSS High Frequency Structure Simulator IoT Internet of Things IoU Intersection over Union IQ In-phase and Quadrature LEA Low Energy Accelerator LiDAR Light Detection and Ranging xix LNA Low-Noise Amplifier LOS Line of Sight LTE Long-Term Evolution MCU Microcontroller Unit MDS Multidimensional Scaling MEMS Microelectromechanical Systems MHz Megahertz ML Machine Learning MLP Multilayer Perceptron MST Minimum Spanning Tree MUSIC MUltiple SIgnal Classification MUT Micromachined Ultrasound Transducer NB-IoT Narrowband Internet of Things NeRF Neural Radiance Fields NIR Near-Infrared NLOS Non-Line of Sight NSSS Narrowband Secondary Synchronization Signal OFDM Orthogonal Frequency Division Multiplexing PA Power Amplifier PCA Principal Component Analysis PCB Printed Circuit Board PCI Physical Cell Identity PIN Positive-Intrinsic-Negative (diode) PSS Primary Synchronization Signal QPSK Quadrature Phase Shift Keying R-squared Coefficient of Determination RF Radio Frequency RLS Recursive Least Squares RMSE Root Mean Square Error RSSI Received Signal Strength Indicator SAR Successive Approximation Register SDR Software-Defined Radio SIMD Single Instruction, Multiple Data SMACOF Scaling by MAjorizing a COmplicated Function SNR Signal-to-Noise Ratio SPL Sound Pressure Level SPI Serial Peripheral Interface SSIM Structural Similarity Index SSS Secondary Synchronization Signal STL Stereolithography xx SWaP Size, Weight, and Power SWaP-C Size, Weight, Power, and Cost TSS Total Soluble Solids ToF Time-of-Flight UDP User Datagram Protocol UE User Equipment UHF Ultra High Frequency UWB Ultra-Wideband VIO Visual Inertial Odometry VNA Vector Network Analyzer xxi Chapter 1: Introduction Imagine a world where intelligence is woven into the fabric of our physical environment. Insect-scale robots navigate disaster zones, paper-thin tags track goods across global supply chains without manual maintenance, and medical implants detect disease markers years before symptoms appear. The vision of intelligent systems seamlessly embedded in the environment has attracted sustained interest from the research community since the late 1990s, when pioneers like Mark Weiser envisioned ubiquitous computing [6] and Kristofer S. J. Pister, Joe Kahn, and Bernhard Boser conceptualized smart dust [7]. While today’s smart devices offer glimpses of this future, they represent only incremental steps toward true ambient intelligence—systems that vanish into the background while autonomously sensing, learning, and adapting to human needs. The fundamental challenge lies in scaling intelligence sustainably: current approaches rely on power-hungry hardware, frequent maintenance, and expensive components, making widespread deployment impractical [8]. Achieving this vision of ambient intelligence at scale requires fundamentally new approaches that can operate within extreme resource constraints. Over the past two decades, we have pursued miniaturization and low-power operation as primary paths toward embedding intelligence in everyday objects. This approach has yielded impressive advances in microelectronics, MEMS sensors, and efficient computing. However, simply making conventional systems smaller encounters fundamental physical barriers that force 1 difficult tradeoffs between intelligence capabilities and energy requirements. For instance, consider the contrast between perception systems for autonomous vehicles and insect-scale robots: self- driving cars use LiDAR sensors consuming watts of power, while a robot at millimeter scale must achieve similar functionality within sub-milliwatt constraints. Sensor arrays required for spatial perception cannot be reduced beyond wavelength-dependent limits without sacrificing resolution or accuracy. Furthermore, modern deep-learning systems demand substantial computational resources that conflict with the strict power budgets of tiny devices. Our goal, in this thesis, is not merely to reduce power consumption, but to maintain intelligent capabilities while operating within extreme resource limitations. To summarize, these challenges indicate that miniaturization alone cannot bridge the gap between current technology and true ambient intelligence. We need to fundamentally redesign our approaches to sensing, computing, and communication in resource-constrained environments. Figure 1.1: Examples of intelligent, low-power, and scalable sensors developed in this thesis. In this thesis, we investigate how to achieve ambient intelligence by fundamentally reimagining sensing and computing architectures from first principles. We show how bio-inspired meta- structures (see Figure 1.1) enable spatial perception with single sensors that match the performance of multi-element arrays while using 1000× less power; how ultra-low-power techniques can provide GPS-like positioning with sticker-sized tags operating for years on a single battery; and how novel sensing approaches can monitor food quality non-invasively throughout global supply 2 chains. Through innovations at the intersection of physics and computation, we create sensing systems that achieve orders of magnitude improvements in energy efficiency, size and scalability, making intelligence practical in environments previously considered impossible. 1.1 Challenges in Scaling Ambient Intelligence The gap between our vision of integrating intelligence in everything and current reality stems from three fundamental challenges: Nature’s Efficiency vs. Artificial Systems: Nature has created intelligent systems that operate within incredible efficiency constraints. A fruit fly navigates complex 3D environments, avoids predators, and locates food with a brain of merely 100,000 neurons consuming just microwatts of power [9]. Similarly, desert ants perform precise navigation across vast distances without GPS, using minimal neural hardware [10]. In stark contrast, today’s artificial intelligent systems capable of comparable perception [11], reasoning and decision-making [12] require orders of magnitude more resources—hundreds of megabytes of memory and several watts of power [13]. This fundamental mismatch prevents direct application of current AI approaches in resource-constrained devices, creating a critical barrier to embedding intelligence in everyday objects. Physical Limitations: Traditional sensing paradigms encounter fundamental physical barriers when scaled down to micro-devices. Spatial perception systems like radar, sonar, and camera arrays rely on sampling theory principles that demand multiple sensors separated by minimal distances to avoid aliasing. The Nyquist-Shannon sampling theorem states that to accurately capture a signal with maximum frequency component fmax, sampling must occur at a 3 Energy efficiency Intelligence 100x energy 100x intelligence 100x intelligence Figure 1.2: The intelligence-energy tradeoff: Conventional intelligent systems (top left) operate at high energy levels and offer good capabilities. Current resource-constrained systems (bottom right) provide limited intelligence. This thesis explores approaches to bridge this gap by creating systems that offer 100× more intelligence at similar energy levels. rate of at least 2fmax [14, 15]. This creates critical hardware requirements: higher sampling rates demand faster ADCs with correspondingly higher power consumption. Additionally, diffraction limits optical sensing according to Abbe’s criterion (d = λ/2NA), while signal-to-noise ratio deteriorates as sensor size decreases. These physics-based constraints collectively create seemingly insurmountable barriers to effective sensing in tiny platforms. Energy Sustainability at Scale: Achieving ambient intelligence requires not just creating intelligent systems but ensuring they can operate sustainably at planetary scale. As illustrated in Figure 1.2, current intelligent systems consumes order of magnitude more power, requiring frequent battery replacements or continuous power delivery. The energy gap is huge mainly because perception typically requires hundreds of milliwatts, while energy harvesting provides only microwatts in many scenarios [16]. This gap cannot be bridged through incremental improvements alone. It demands radically different approaches to energy-conscious intelligence. These challenges collectively highlight that simply miniaturizing conventional sensing and computing approaches is insufficient. The vision of ambient intelligence requires fundamentally reimagining how we design intelligent systems from first principles, particularly for extreme 4 resource constraints. 1.2 Designing Ultra-Low-Power Intelligent Systems Traditional approaches to intelligent systems follows the pipeline: collect high-resolution signals using sensors, convert analog signals to digital data, and process this data through computationally intensive large ML models. This conventional paradigm tightly integrates sensing with digital processing, requiring substantial computational resources to extract meaningful information from raw sensor data. While this approach works well for resource-rich platforms like self driving cars, drones, humanoids; it fundamentally breaks down in resource-constrained scenarios. Tiny robots, wearable devices, and battery-free sensor networks cannot support the high-dimensional data acquisition, digital conversion, and large model inferencing required by conventional intelligent systems. This mismatch between learning capabilities and hardware limitations represents a critical barrier to realizing ambient intelligence at scale. Our approach departs from this conventional primitive by fundamentally rethinking the relationship between sensing hardware and computation. We relax the strict hardware requirements for sensing on resource-constrained devices and instead leverage tiny ML models that operate on physically pre-processed signals. The key idea is to merge the fundamental laws of physics with learning principles, creating sensing frontends that perform signal transformations in the analog domain before any power-consuming digital processing occurs. In other words, these passive structures act as physical neural encoders, encoding domain knowledge directly into hardware that requires zero power to operate. For instance, in acoustic sensing, we implement this concept through 3D-printed metamaterials 5 that transform omnidirectional microphones into direction-aware sensors (Figure 1.1a, 1.1b). For wireless applications, we develop gain-pattern reconfigurable antennas that embed spatial information directly into received signal strength measurements (Figure 1.1c). By performing these transformations in the physical domain, our approach dramatically reduces both data dimensionality and energy requirements. The resulting sensor outputs contain high-level features extracted with minimal energy consumption, requiring significantly smaller neural models for downstream processing while providing information that naturally aligns with the hierarchical structure of neural networks. Physics- informed AI Nature-inspired architectures Scalable Networks xx Sustainable Ambient Intelligence Figure 1.3: Our approach to sustainable ambient intelligence combines nature-inspired architectures, physics-informed AI, and scalable networks. This approach combines three complementary strategies that together enable sustainable ambient intelligence, as illustrated in Figure 1.3. First, we draw inspiration from nature’s efficient solutions to similar problems. Biological systems have evolved sophisticated sensing capabilities within strict resource constraints—owls can precisely locate prey in total darkness using asymmetric ear structures, insects navigate complex environments with tiny brains, and even plants respond to environmental stimuli through passive mechanisms. By studying these biological systems, we identify principles for efficient spatial sensing that can be translated into engineered 6 systems through biomimetic design. This bio-inspired approach leads to sensing architectures that achieve remarkable capabilities with minimal active components. Second, we integrate physics-informed machine learning to bridge the gap between simple hardware and complex perception tasks. Our systems incorporate passive components that perform initial signal transformations in the analog domain without consuming power, effectively encoding domain knowledge directly into the hardware. These physical structures act as computational elements, extracting meaningful features before digital processing begins. The resulting signals naturally align with the hierarchical feature extraction performed by neural networks, dramatically reducing the computational burden on subsequent processing stages. This physics-informed approach enables us to extract maximum information from minimal hardware, achieving capabilities that would traditionally require complex, power-hungry systems. Third, we develop techniques for scaling these intelligent systems across large networks and diverse environments. By reimagining how devices communicate, collaborate, and leverage existing infrastructure, we create systems that can be deployed at unprecedented scales—from city-wide sensor networks to micro-robotic swarms. This scalable approach enables widespread deployment of ambient intelligence without requiring dedicated infrastructure or frequent maintenance, making these systems practical for real-world applications. Together, these strategies enable a new class of intelligent systems with transformative potential. They deliver extreme energy efficiency, supporting sensing functions, that typically demand complex digital systems, now with minimal power use. They improve robustness by leveraging core physical principles for better generalization. They also allow real-time adaptability through dynamically reconfigurable designs. Most importantly, they close the gap between ambient intelligence’s promise and the practical limits of embedded platforms, making it feasible to scale. 7 1.3 Systems Developed 1.3.1 Structure-Assisted Spatial Intelligence Spatial perception systems fundamentally rely on sensor arrays spanning multiple wavelengths, consuming hundreds of milliwatts of power and requiring substantial physical space. This dependency on arrays creates a critical barrier for resource-constrained devices like micro-robots and IoT sensors. We developed a different approach that achieves spatial sensing without arrays by combining carefully designed passive structures with minimal active components, significantly reducing both power and computational requirements. A B C D Figure 1.4: Structure-assisted sensing systems: (A) SPiDR’s depth imaging using 3D-printed acoustic metamaterial, (B) SPiDR’s acoustic stencil and its internal spatial filtering channels, (C) Owlet’s acoustic stencil and its internal direction-finding channels, (D) Owlet’s direction finding with passive acoustic structures. Bio-inspired DoA Estimation. Conventional acoustic DoA estimation requires multiple synchronized microphones separated by half-wavelength distances. Drawing inspiration from how barn owls achieve precise sound localization through asymmetric ear structures, we developed Owlet [17], a system that reimagines direction estimation. The key insight lies in leveraging diffraction and Helmholtz resonance through 3D-printed metamaterials to create direction-dependent acoustic filtering (Figure 1.4 C, 1.4 D). By wrapping a single microphone with a structured pattern 8 of holes and resonant cavities, each sound direction creates a unique spectral signature. We solved the critical challenge of environmental robustness through a two-microphone architecture that eliminates both source and environmental dependencies. The system achieves 3.6◦ angular error—matching a 9-microphone array while using 100× less power. Single-sensor Depth Perception. Insect-scale robots operating in unknown environments require depth perception for navigation, but existing solutions like LiDAR and ultrasound arrays are power-hungry and bulky. We developed SPiDR [18], a fundamentally new approach to depth perception using a single microphone-speaker pair. The core idea is to spatially encode the acoustic channel through a metamaterial ”stencil” that creates unique signatures for each point in 3D space (Figure 1.4 A, 1.4 B). These physics-optimized waveguides embed both direction and distance information in single measurements, eliminating the need for scanning or arrays. Through sparse recovery algorithms informed by wave interference patterns, SPiDR achieves centimeter-level depth accuracy while consuming only 0.83mJ per frame—a 400× improvement over traditional solutions. This work establishes new possibilities for perception in resource-constrained robotics. Passive Spectral Analysis. Extending our structure-assisted sensing approach to spectral processing, we drew inspiration from how the human cochlea naturally decomposes sound into frequencies. We developed Lyra [19], a system that uses standing wave resonators to implement FFT-like spectral analysis without power-hungry ADCs or digital processing. By leveraging wave interference patterns in carefully designed cavity structures, Lyra enables always-on acoustic monitoring with microwatt power consumption, making it suitable for continuous environmental sensing in battery-constrained or energy-harvesting scenarios. RF Self-localization. Building on our success with acoustic metamaterials, we extended these principles to RF signals with Sirius [20]. While passive structures worked well for acoustic 9 sensing, the small wavelength diversity in RF necessitated a different approach. We developed a gain-pattern reconfigurable antenna that dynamically embeds direction-specific codes in received signals. Recent advances show envelope detectors enable ultra-low-power communication but cannot extract phase information needed for spatial sensing. We solved this through a neural pipeline that learns to decode spatial information directly from signal amplitudes, achieving 7◦ angular accuracy while consuming 1000× less energy than array-based positioning systems. This enables sustainable sensor networks across agricultural fields, supply chains, and wildlife habitats where traditional GPS (25mJ per fix) would quickly deplete batteries. Figure 1.5: LiTEfoot leverages cellular signals for nationwide tracking using an ultra-low-power miniaturized receiver. 1.3.2 Scalable nextG Systems From tracking perishable goods to monitoring elderly patients with dementia, continuous location tracking of small assets and individuals has become essential across numerous domains. However, existing solutions like GPS rely on bulky batteries or dedicated infrastructure, creating fundamental barriers to widespread deployment. We developed two complementary systems that 10 reimagine global positioning for resource-constrained scenarios, enabling seamless tracking from nationwide supply chains to dense urban environments. Ultra-low-power Location Tracking. Traditional cellular localization requires frequency hopping across multi-GHz bandwidths using power-hungry oscillators and IQ demodulators that consume over 100mW. We developed LiTEfoot [21], a cellular-based self-localization system that uses non-linear intermodulation to simultaneously capture synchronization signals across 3GHz spectrum through a passive envelope detector (Figure 1.5). The system decodes Physical Cell Identities from the folded spectrum and performs multilateration, achieving 19m accuracy while consuming only 40µJ—a 625× reduction compared to GPS. This enables 11-year operation on a coin cell battery for nationwide asset tracking in supply chains, precision agriculture, wildlife conservation, and healthcare monitoring, all without requiring dedicated infrastructure. 24mm 100x less energy 9x smaller Traditional 9 element array Owlet Cross-sectional depth map Transducer with microstructure Figure 1.6: Locate3D enables infrastructure-free tracking at scale through optimized peer-to-peer measurements. Infrastructure-free 6DoF Tracking at Scale. For scenarios without cellular infrastructure, we developed Locate3D [22], a peer-to-peer system enabling infrastructure-free 6-degree-of- freedom tracking using UWB radios for massive-scale networks. The system introduces angle measurements alongside ranging to reduce the minimum edges needed for unique topology 11 realization by 4× (see Figure 1.6). Using rigidity-aware spanning tree optimization and non-rigid graph decomposition, Locate3D achieves 0.86m accuracy in building-scale networks and 12.09m in city-scale deployments while reducing latency by 75%. The system seamlessly scales to track 100,000 devices across cities, enabling transformative applications from coordinating disaster response teams to managing autonomous delivery fleets and monitoring smart city infrastructure. 1.3.3 Systems for Sustainability While 800 million people face hunger globally, nearly 40% of food produced is lost to waste, primarily due to inadequate monitoring during distribution. In collaboration with Microsoft Research, we developed sensing systems to enable data-driven food quality monitoring across global supply chains. Non-invasive Food Quality Monitoring. Current food quality assessment methods require destructive testing, making continuous monitoring throughout distribution impractical. We developed FreshSense [23], a wireless sensing system that monitors dry matter content non- invasively at the pallet level. The key challenge lies in measuring subtle changes in water content through densely-packed produce where traditional RF sensing fails due to complex multi-path effects. We solved this through a dispersion-based sensing approach that exploits frequency- dependent wave propagation in water-rich environments. By analyzing electromagnetic delays with physics-informed neural networks, FreshSense achieves robust quality assessment while eliminating environmental variations. 12 1.3.4 Other Contributions In my Ph.D., I have also developed solutions across security, AI verification, and audio processing, demonstrating the broader applicability of our approach. Security and Self-defense for Drones. As drones become trusted delivery systems and law enforcement tools, they face increasing risk of mid-air attacks and vandalism. We developed DopplerDodge [24], an acoustic sensing system enabling real-time threat detection and avoidance in resource-constrained drones. Using just a single microphone and Doppler effect analysis, the system detects incoming projectiles with 100ms advance warning, enabling autonomous evasive maneuvers while consuming minimal power. This work establishes a new direction in embedded defense systems for tiny autonomous vehicles. Side-channel Security in Embedded AI. With the proliferation of edge AI, verifying the trustworthiness of model inference is increasingly important. We developed ThermWare [25], a system that leverages thermal side-channels to detect anomalous computations in embedded AI systems. By capturing spatiotemporal heat signatures with a thermal camera, the system identifies unauthorized operations with 94% accuracy, enabling non-invasive run-time AI model monitoring without requiring system-level access or modifications. Noise-cancellation for Wearables. To improve voice communication in noisy environments, we developed VoiceFind [26], a speech enhancement system that uses just two microphones to achieve spatial filtering of desired speech. Through a combination of harmonic-based direction finding and conditional generative adversarial networks, the system improves speech intelligibility by 16% in real-world environments. This work demonstrates how physics-informed machine learning can enable sophisticated audio processing on resource-constrained wearables. 13 1.4 Organization The remainder of this thesis is organized as follows. Chapter 2 introduces a bio-inspired acoustic sensing system that achieves direction finding with a single microphone through carefully designed metamaterials, enabling spatial audio perception in wearable devices. Chapter 3 presents a novel approach to depth perception for micro-robotics that provides high-resolution imaging with minimal power requirements through physics-based spatial encoding. Chapter 4 explores self- localization for IoT devices using reconfigurable RF structures and machine learning techniques, achieving long-range positioning with significantly reduced energy consumption. Chapter 5 details a cellular-based localization framework that enables nationwide tracking with sticker-sized tags by exploiting existing cellular infrastructure for applications in supply chains and healthcare. Chapter 6 describes an infrastructure-free tracking system that supports large-scale coordination of devices in urban environments through innovative peer-to-peer algorithms. Chapter 7 presents a non-invasive approach to food quality monitoring that creates three-dimensional maps of food quality within pallets through physics-informed sensing techniques. Chapter 8 concludes with reflections on the broader implications and future directions of this thesis. 14 Chapter 2: Structure-Assisted Spatial Audio Sensing 2.1 Introduction Acoustic devices are increasingly becoming pervasive in our everyday environments. Beyond just voice interfaces, a broad spectrum of applications is emerging that leverages multiple facets of context-awareness and analytics. These applications encompass indoor activity monitoring based on sound [27–29], health monitoring through acoustic signals [30, 31], speech development support and acoustic environment sensing with on-body wearables [32, 33], along with numerous outdoor use-cases utilizing distributed sensor nodes [34,35]. With the advent of new low-power and battery-free technologies [36, 37], it has become feasible to continuously capture and process sound using independent sensing modules distributed throughout the environment. Adding spatial analysis of sound and source localization can significantly enhance the capabilities of such context-aware systems. Meanwhile, spatial sensing of sound is also critical for robotic navigation and situational awareness systems, both in aerial environments [38–40] and underwater scenarios [41, 42]. However, conventional methods for obtaining spatial sound information typically rely on capturing multiple synchronized audio streams through microphone arrays, which is a power-intensive hardware requirement that is challenging for standalone sensing modules. In this chapter, we aim to build an acoustic sensing system that enables spatial information processing on power-constrained ubiquitous devices with compact form factors. 15 15 mm 1 3 m m Stencil with µ-resonators Internal cavity Estimates direction and location Low-power and miniaturized spatial sensing with acoustic micro-structures Passive structure interacts with sound System learns location-agnostic features Microstructure embeds directional signature 1 2 3 4 Figure 2.1: The vision and technical overview of Owlet, a low-power and miniaturized system for extracting spatial information from sound. Owlet uses acoustic microstructures to embed direction-specific signatures on the recorded sound and develops a learning-based approach for signature recovery and mapping in real-time. Estimating spatial features of sound, such as direction-of-arrival (DoA) or source location, traditionally relies on sampling the acoustic wavefront across space using a microphone array. Since conventional DoA algorithms are fundamentally based on this spatial sampling model, both the array dimensions and the number of microphones directly impact their performance. According to the sampling theorem [14], microphones in a linear array must ideally be spaced at half the signal’s wavelength (λ/2) for accurate DoA estimation. Moreover, the angular resolution, inversely proportional to the Half Power Beam-width, improves with the total length of the array aperture. Thus, achieving fine-grained DoA resolution typically demands large physical arrays. Additionally, these arrays require simultaneous sampling across all microphones, causing power consumption and hardware complexity to grow with array size. Although acoustic devices are increasingly common in ubiquitous computing, the power, hardware, and form-factor constraints limit the feasibility of traditional array-based spatial sensing. In this chapter, we explore an alternative approach to spatial signal processing that moves away from the standard spatio-temporal sampling paradigm, leveraging wave-structure interactions to enable low-power, compact, and simple designs. Directional hearing aided by structural interactions is widespread in nature. The symmetric 16 placement of ears in most mammals, including humans, effectively forms a two-element array for directional processing. However, biophysical studies show that fine-grained localization relies heavily on how sound interacts with the complex three-dimensional geometry of the head [43]. Owls, for example, possess asymmetrically positioned ears along both horizontal and vertical planes [44], allowing them to precisely localize low-frequency sounds, a task difficult to achieve with symmetric structures alone. Remarkably, certain insects with body sizes much smaller than one-tenth of the sound wavelength can localize sound as accurately as mammals [45]. A grasshopper, for instance, with a body width of just 3mm, achieves precise localization despite its small size relative to the sound wavelength. This is enabled by asymmetrical body structures that produce direction-dependent responses to incoming sounds. The sensory and neural systems of these organisms have evolved to map such responses to sound direction. Inspired by these biological mechanisms, we design a structure-assisted DoA estimation system suited for power- constrained and miniaturized sensing platforms. In this chapter, we present the design and prototype of an acoustic localization system that introduces physical structures around a microphone to embed directional cues. As acoustic waves propagate, they interact with physical structures, resulting in transformations of the wave field. Such interactions are evident at large scales in room acoustics, where the same sound can differ based on the room’s shape, size, and object placement. We demonstrate that small 3D-printed structures can similarly manipulate sound waves, imprinting unique signatures onto the passing sound. By placing a microphone within a structure only a few centimeters in size, the recorded signals inherently carry these signatures. With careful design, the structure can embed distinct signatures for sounds arriving from different angles, achieving angular resolutions of a few degrees. Our system detects these embedded signatures to infer the direction of arrival (DoA) of sound. We 17 name this system Owlet, inspired by a bird known for its exceptional auditory capabilities. The idea of leveraging environmental variations in sound fields for localization is not new. Prior work has explored fingerprinting multipath environments and analyzing reflections for localization [46]. Closest to our approach is [47], which places objects in a 60 × 60 cm area with a central microphone, showing that scattered sound carries directional cues that can be used for DoA estimation. While Owlet builds on similar principles, it differs in two critical ways. First, we focus on creating a centimeter-scale sensing system suitable for resource-constrained robots and ubiquitous sensing applications. Our Owlet prototype achieves angular resolutions comparable to or better than previous work, with a compact 1.5 cm× 1.3 cm sensor. Second, we address robustness to environmental changes. Owlet is designed to operate beyond controlled environments like anechoic chambers and does not require location-specific training data. A primary challenge for Owlet is achieving sufficient multipath diversity within a small form factor. Because low-frequency acoustic signals have large wavelengths, traditional reflection-based approaches require similarly large reflectors to create diversity, directly limiting spatial resolution. We overcome this limitation by developing a diffraction-based technique for miniature acoustic structures. When sound passes through small apertures, it diffracts, effectively generating new secondary sources. We exploit this phenomenon by designing a 3D-printed cylindrical cover, called a stencil, that surrounds the microphone. These stencils incorporate optimally patterned holes that create complex but predictable multipath interference inside the structure. The resulting interference patterns carry signatures encoding the direction of arrival. To enhance angular diversity, we also integrate principles of metamaterial design into the stencil. The Owlet system learns these signatures through a one-time calibration and maps them to DoA estimates during operation. 18 Another major challenge is ensuring that the design remains robust against environmental changes that can unpredictably alter incoming sound. For practical deployment, the system must function reliably across diverse environments while requiring only a one-time calibration during manufacturing. As previously noted, room acoustics can distort the sound field and compromise the mapping between directional signatures and angles. To address this, Owlet introduces a reference microphone and adopts a communication-theoretic approach to suppress transient multipath effects during signature generation and mapping. This technique enhances Owlet’s robustness to environmental variations, making it viable for real-world applications. This chapter explores acoustic structures as passive components for creating low-power, miniaturized solutions in ubiquitous sensing. Potential applications include wearable devices for acoustic environment sensing, such as systems for assessing speech development in infants [48,49] or personal acoustic analytics [50, 51], where sound direction is critical. Navigation in SWaP- constrained [52, 53] aerial and underwater robots can also benefit from spatial sensing capabilities enabled by Owlet. Moreover, Owlet offers a path toward directional sensing and localization in energy-harvesting systems, a task challenging for traditional microphone arrays. Figure 3.1 illustrates the broader vision and technical overview of this work. While many application opportunities emerge from this platform, this chapter focuses on developing the core capabilities and understanding the system’s fundamental limits. In this chapter, we make the following three contributions: • A novel method of using passive elements for directional sensing, enabling a low-power, low-complexity, and miniaturized system for acoustic localization. The sensing and signal processing techniques ensure robust DoA estimation with a single in-lab calibration. The 19 system achieves a median DoA error of 3.6◦, comparable to microphone array-based solutions while significantly reducing power and size requirements. • A replicable process for designing and 3D-printing optimal acoustic structures that encode incoming sounds with directional cues, presenting a new approach for shaping sound fields using controlled diffraction in compact metamaterial structures. • A complete hardware and software prototype of the Owlet system, made available for the community to reproduce, evaluate, and extend. In the following sections, we detail the core intuition, system design, and key findings of this work. j Stencil Approximation of the stencil Sound source Sound source Hole pattern for a specific DoA Microphone Microphone inside the stencil Sound holes Figure 2.2: The concept of using a stencil with direction-specific hole patterns and microstructures for passive filtering of the incoming sound. The stencil embeds a directional response to the recorded signals. 2.2 Core Intuitions and Primers The core idea of our system is to engineer a controlled environment around the microphone such that the recorded signal carries a unique, direction-specific channel impulse response. This impulse response, extracted from the microphone recording, serves as a signature of the sound’s 20 angle of arrival. While larger objects or room acoustics naturally introduce diverse multipath effects that embed directional cues, our goal is to achieve finer-grained diversity within a compact form factor by combining principles of diffraction, interference, and structural resonance. Toward this end, we design a porous cap for the microphone, referred to as a stencil. The stencil features patterned holes on different sides, as illustrated in Figure 2.2. Incoming sound waves pass through these holes, and depending on the angle of incidence, different hole patterns influence the wave before it reaches the microphone. The holes are coupled with internal microstructures of varying parameters, imparting distinct frequency responses to the sound. The stencil acts as a metamaterial, where the internal microstructures modulate the incoming sound and imprint a unique directional signature. Because the response of these microstructures is frequency-dependent, the directional signature is represented as a vector of complex frequency gains, Gθ. This concept is depicted in Figure 2.3. f f Sound DoA = »1 Sound DoA = »2 f Signature_»1 Signature_»2 f C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6 Complex frequency gains represent directional signature (G») Mic Acoustic micro-structures for directional filtering Stencil Figure 2.3: The concept of passive directional filtering using a stencil of acoustic microstructure. The stencil embeds a directional signature to the recorded sound unique to its direction of arrival (DoA). The spectrum of complex gains represents the signature for further computation. 21 2.2.1 Metamaterials for Passive Filtering When sound interacts with physical structures, its frequencies are either amplified or attenuated. At larger scales, multipath reflections create such frequency variations through constructive and destructive interference. While reflections can embed directional signatures into sounds, they typically require structures comparable in size to the acoustic wavelength. Since Owlet targets low-frequency audible signals with large wavelengths, traditional reflection-based approaches would demand structures on the order of half a meter—prohibitively large for our goals. To achieve passive filtering within a compact design, we leverage the concept of acoustic metamaterials. Metamaterials are artificially structured materials composed of subwavelength elements that endow the material with novel properties. In constructing the metamaterial stencil, we utilize three key principles: (a) diffraction, (b) capillary effects, and (c) structural resonance. (a) Diffraction: When waves encounter the edge of an obstacle, they bend around it—a phenomenon known as diffraction. This behavior is particularly pronounced when sound passes through an aperture smaller than its wavelength [54]. In such cases, the hole behaves as a virtual point source. When a receiver is placed behind a barrier with multiple small apertures, it observes a multipath-like environment formed by multiple virtual sources. The interaction of signals from these sources creates patterns of constructive and destructive interference, influenced by both the receiver’s position and the sound frequency. We exploit this property by designing diverse hole patterns on the stencil, enabling a rich multipath environment around the microphone within a small form factor. (b) Capillary Effect: As sound propagates through narrow capillary tubes, its acoustic impedance varies significantly [55]. The dimensions of these tubes, particularly their length 22 and cross-sectional area, influence the speed and phase of the transmitted sound. By integrating capillary tubes of different geometries into the stencil, we introduce controlled phase shifts between sound paths. This enhances frequency diversity, even when the physical separations between the holes are small. Cylindrical stencil with capillary tubes Cylindrical stencil with micro resonators US penny for reference Hemispherical stencil with capillary tubes 19 mm 30 mm15 mm 10 mm 11 m m 13 m m Internal cavity structure of the stencils Figure 2.4: Different types of metamaterial stencils used in our experiments. (c) Structural Resonance: Certain sound frequencies are amplified when oscillating air pressures interact with cavities along their path [56]. This phenomenon, known as Helmholtz resonance, is commonly observed in whistling bottles. We design millimeter-scale Helmholtz resonators embedded within the stencil, connected to the sound holes. By varying the shapes and dimensions of these resonators, we generate arbitrary resonance effects across different frequencies. Figure 2.4 shows examples of 3D-printed stencils with embedded microstructures for directional filtering. Figure 2.5 illustrates the improvement in angular diversity provided by the stencil, comparing the amplitude variation of a 7 kHz tone with and without the stencil. Finally, Figure 2.6 presents the corresponding diversity in direction-specific frequency responses across different stencil designs. 23 0° 30° 60° 90° 120° 150° 180° With stencil Without stencil Figure 2.5: Angular diversity of the microphone with and without the microstructure stencil. 2.3 System Design The system design focuses on two key objectives: (a) creating an optimal stencil structure that maximizes angular diversity, and (b) developing signal processing techniques to estimate the direction of arrival (DoA) from the recorded signal. Naturally, the accuracy of the system depends directly on the diversity introduced by the stencil. Our algorithms optimize the stencil design by simulating wave propagation around small structures and subsequently fabricating the optimized stencil using 3D printing. Before delving into stencil design details, we first describe the signal processing and DoA estimation techniques, providing an overview of the complete system. (a) Cylinder_capillary (b) Cylinder_resonator (c) Hemisphere_capillary 4000 5000 6000 7000 8000 Frequency 0 1 2 3 P h a s e 0 deg 60 deg 120 deg 180 deg 4000 5000 6000 7000 8000 Frequency 0 0.5 1 1.5 2 A m p lit u d e 0 deg 60 deg 120 deg 180 deg 4000 5000 6000 7000 8000 Frequency 0 5 10 15 20 25 A m p lit u d e 0 deg 60 deg 120 deg 180 deg 4000 5000 6000 7000 8000 Frequency 0 1 2 3 P h a s e 0 deg 60 deg 120 deg 180 deg 4000 5000 6000 7000 8000 Frequency 0 0.5 1 1.5 A m p lit u d e 0 deg 60 deg 120 deg 180 deg 4000 5000 6000 7000 8000 Frequency 0 1 2 3 P h a s e 0 deg 60 deg 120 deg 180 deg Figure 2.6: Comparison of the diversity in frequency responses (amplitude and phase) of the three types of metamaterial stencils. 24 2.3.1 Processing for DoA Estimation At a high level, Owlet’s DoA estimation operates in two stages. First, during a one-time in-lab calibration, we generate a table of direction-specific signatures Gθ by sending known wideband signals from various directions. These signals are recorded by a microphone equipped with the stencil cap, capturing the direction-dependent modifications imposed by the structure. This calibration process is analogous to the procedures used for commercial microphone arrays. The second stage occurs at run-time, where the system processes incoming sounds to extract the stencil-induced signature, hstencil, and matches it against the pre-collected signature table to infer the DoA. In practice, we train a deep learning model on variations of the signature table, allowing the system to predict the DoA directly from pre-processed signals during real-world operation. A critical aspect of this processing is accurately extracting hstencil from real-world signals, which must overcome two challenges: (i) separating the stencil’s signature from the unknown source signal, and (ii) mitigating distortions caused by environmental multipath. We first describe our method for eliminating dependency on the source signal, followed by techniques to suppress environmental effects. 2.3.2 Eliminating Source Signal Dependency The signal recorded by the microphone inside the stencil is the source signal modified by the stencil’s directional response. Assuming no environmental effects, if X(ω) denotes the source signal and Yin(ω) the recorded signal, their relationship in the frequency domain is: 25 Yin(ω) = X(ω)Hstencil (2.1) When the source signal X(ω) is known, the stencil response Hstencil can be directly obtained by dividing the recorded signal by the source signal, i.e., Hstencil = Yin(ω) X(ω) . In certain applications, such as navigation where a robot localizes itself using a known control signal, this assumption holds. However, in many other applications, including ambient sound localization or user direction detection from speech, the source signal is unknown. To address this, Owlet introduces a secondary microphone placed outside the stencil. This reference microphone records the incoming sound without the stencil’s influence, providing an independent observation of the source signal. Unlike traditional microphone arrays, the secondary microphone can be placed arbitrarily close to the primary microphone without affecting system operation. Figure 2.7 illustrates the physical setup and realistic signal model of the system. H9env Henv Source sound: X Yin = X.Henv . Hstencil Yout = X.H9env Hstencil Stencil Inside mic outside mic Environmental multipath response Direction-specific response of stencil Figure 2.7: The two-microphone model for eliminating source and environmental dependency. Consider the channel frequency responses from the source to the inside and outside microphones as Henv and H ′ env, respectively. These responses capture the effects of multipath propagation, including reflections from nearby objects. The presence of the stencil around 26 the internal microphone introduces an additional modulation represented by the frequency response Hstencil. Assuming linearity of the system, the signal recorded by the inside microphone experiences both environmental and stencil-induced transformations, as illustrated in Figure 2.7. Thus, the signals recorded by the inside and outside microphones, Yin(ω) and Yout(ω), can be expressed as: Yin(ω) = X(ω)HenvHstencil +N(ω) Yout(ω) = X(ω)H ′ env +N ′(ω) (2.2) where X(ω) denotes the source signal, and N(ω) and N ′(ω) represent independent noise at the two microphones. Dividing Yin(ω) by Yout(ω) cancels the dependency on the unknown source signal but retains a residual environmental dependency: Yin(ω) Yout(ω) = Hstencil Henv H ′ env +N ′′(ω), (2.3) where N ′′(ω) ≪ Hstencil Henv H′ env . This residual term, Henv H′ env , implies that without further correction, the system remains sensitive to environmental variations. Thus, the stencil calibration or the training of the deep learning model would need to be performed for every new environment to ensure accurate angle prediction. Such a setup may be feasible in controlled scenarios where the positions of sound sources and sensing modules are fixed, such as object tracking on a conveyor belt. However, in most practical applications, the source location is unknown and variable, making exhaustive environment- specific training impractical. To overcome this limitation, we introduce a technique to eliminate 27 environmental dependency, enabling Owlet to operate reliably with a single in-lab calibration. 2.3.3 Eliminating Environmental Dependency Our approach to removing environmental dependency is based on the observation that, although Henv and H ′ env can vary unpredictably across environments, their ratio Hratio = Henv H′ env remains bounded when the microphones are placed closely together. This can be intuitively understood by considering the origin of environmental diversity. After emission from the source, sound waves reflect off various surfaces, creating multiple paths that arrive at the microphones with different delays. These differences in path lengths cause variations in the observed environmental response. When microphones are widely separated, these path differences are significant, leading to substantial variability in Henv and H ′ env. However, when the microphones are placed within a few centimeters of each other, the differences in reflection paths become small and bounded. At the limit, if the microphones are exactly co-located, they observe identical environmental responses. Therefore, the ratio Hratio = Henv H′ env has a narrow distribution across frequencies when the microphones are closely spaced. We validate this property through both simulated ray tracing and real-world experiments, demonstrating that environmental variations can be effectively suppressed, enabling robust DoA estimation without location-specific retraining. Once the distribution of Hratio is known and Hstencil is collected through the calibration process, we generate synthetic training data by sampling from the distribution of Hratio and combining it with Hstencil. This synthetic data enables robust training of the deep learning model for angle prediction without requiring real-world sound traces. Moreover, if the dimensions of 28 the target environment and the positions of major reflectors are known, the synthetic training data can be tailored accordingly. Such customization accelerates model convergence and improves prediction accuracy. At run-time, the system extracts HratioHstencil from the two recorded channels, Yin(ω) and Yout(ω). To enhance this extraction, we employ a Recursive Least Squares (RLS) adaptive filter [57] operating in system identification mode. The adaptive filter exploits the uncorrelated Gaussian noise between the two channels to estimate HratioHstencil by minimizing the following error term via gradient descent: e(ω) = Yin(ω)− Yout(ω) HenvHstencil H ′ env (2.4) 2.3.4 Synthetic Training for Deep Learning We use the synthetic channel responses generated above to introduce environmental diversity into the training set for the neural network model. Specifically, we simulate different room environments and various source-microphone placements to create a wide range of HratioHstencil samples. Each sample is represented as a vector of 400 equally spaced frequency components between 0 and 8 kHz. Rather than using raw complex values, we separately extract the amplitude and phase spectra for training. Freq. spectrum Phase spectrum Sound signal Input matrix Conv. layer Conv. layer Conv. layer Fully connected layer Predicted Angle Regression layer (400 x 2) (394 x 64) (390 x 128) (388 x 256) (1 x 1) 2x7 1x5 1x3 Figure 2.8: The architecture of the proposed CNN model. 29 We use the synthetic channel responses generated above to introduce environmental diversity into the training set for the neural network model. Specifically, we simulate different room environments and various source-microphone placements to create a wide range of HratioHstencil samples. Each sample is represented as a vector of 400 equally spaced frequency components between 0 and 8 kHz. Rather than using raw complex values, we separately extract the amplitude and phase spectra for training. We adopt a Convolutional Neural Network (CNN)-based regression model for DoA estimation. CNNs are well-suited for environmental sound processing [58] and offer low-latency performance due to their compact parameter sets. Our model consists of a one-dimensional CNN with three convolutional layers followed by a fully connected layer and an output regression layer. The convolutional layers have 64, 128, and 256 filters with kernel sizes of 2 × 7, 1 × 5, and 1× 3, respectively. The regression layer minimizes the half-mean-squared-error loss for angular prediction. We customize the loss function according to the target range and resolution of the directional angles. ReLU activation functions and batch normalization layers are used between convolutional layers to accelerate training, and stochastic gradient descent serves as the optimizer. The model is trained for 100 epochs with a learning rate of 1× 10−6. The block diagram of the CNN architecture is shown in Figure 2.8. In addition to the regression model, we also develop a CNN-based classification model for evaluation and comparison, as described in Section 2.5.8. This model largely mirrors the regression architecture but replaces the output with a fully connected layer of size 360 (one node per degree) followed by a Softmax and classification layer. 30 2.3.5 Optimizing 3D Stencil Design The performance of our system depends critically on the diversity of the frequency gain patterns across different angles. Our initial feasibility studies using randomly distributed holes on the stencil demonstrated a median DoA error of 7◦. However, the angular resolution was non-uniform, with some directions exhibiting significantly worse accuracy. This issue arises from suboptimal distributions of holes and internal microstructures, leading to similar gain patterns for different angles. To address this, we systematically optimize the 3D stencil design to guarantee a minimum DoA detection resolution across all directions. Ideally, the stencil should maximize the diversity of frequency gain patterns associated with each possible DoA. This design problem parallels the information-theoretic challenge of constructing maximally diverse code sequences. Each frequency gain pattern, Gθ, associated with an angle θ, can be thought of as a codeword. Our goal is to design a set of N codewords that are maximally distant from each other (i.e., maximizing the Euclidean distance between all pairs of codewords). The number of codewords, N , determines the angular resolution, ∆θ = 2π N . Initially, we attempt to design ideal codewords of discrete frequencies and use them as guidelines for constructing desired gain patterns Gθ. These patterns are then mapped to physical arrangements of pinholes on the stencil surface at corresponding angles. Given the number of holes N , the distances from the microphone to the holes rn, the distance from the microphone to the stencil surface D, and the acoustic wavelength λ, the superposition of waves at the microphone can be modeled as: 31 u(λ) = N∑ n=1 D jλr2n e j2πrn λ (2.5) This equation defines the resu