ABSTRACT Title of Dissertation: MINIMALIST SENSING TOWARD UBIQUITOUS PERCEPTION Yang Bai Doctor of Philosophy, 2025 Dissertation Directed by: Professor Nirupam Roy Department of Computer Science This thesis explores minimalist sensing, a design philosophy prioritizing simplicity and efficiency in sensor technology to capture essential information for various applications, including robotics, environmental monitoring, and wearable technology. By focusing on streamlined functionalities, these sensors avoid the complexities and costs of more elaborate systems, offering practical solutions under resource constraints. My research emphasizes developing low-power, miniaturized systems that integrate seamlessly into both urban and natural environments, enhancing ubiquitous perception without the encumbrance of complex technologies. I explore three main areas: low-power and miniaturized acoustic direction-of-arrival (DoA) estimation, ultra-low-power spatial sensing for miniature robots, and a single frequency-based tracking interface for voice assistants. The contributions include a novel low-power DoA estimation system using 3D-printed metamaterials, an innovative spatial sensing system for mobile robots using a single speaker-microphone pair, and a comprehensive voice and motion tracking interface that operates on a single frequency. This work is aimed at establishing a pervasive perception network that offers continuous, reliable data while minimizing energy use and infrastructure demands, potentially revolutionizing real-time monitoring and responsiveness in diverse settings. A critical aspect of achieving minimalist perception is the integration of machine intelligence and computation. By leveraging advanced algorithms and computational techniques, we can bridge the gap in minimalist perception, making it both feasible and efficient. Machine learning and signal processing algorithms enhance the accuracy and functionality of simplified sensor systems, allowing them to perform complex tasks without the need for sophisticated hardware. For instance, intelligent data processing enables low-power sensors to extract meaningful information from limited data inputs, reducing the need for extensive sensor networks. By incorporating these computational strategies, we can push the boundaries of minimalist sensing, enabling the creation of smart, resource-efficient perception systems that are capable of operating in diverse and challenging environments. MINIMALIST SENSING TOWARD UBIQUITOUS PERCEPTION by Yang Bai Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2025 Advisory Committee: Professor Nirupam Roy, Chair/Advisor Professor Christopher Allan Metzler Professor Xiaofan (Fred) Jiang Professor Ashok K. Agrawala Professor Min Wu © Copyright by Yang Bai 2025 To my parents ii Acknowledgments First, I want to express my heartfelt gratitude to the society that fosters peace and stability, providing an environment where people can pursue advanced education and personal growth. This journey would not have been possible without the foundation of progress and the values that support learning and development. I thoroughly cherished my time at the University of Maryland, College Park, as a Ph.D. student in the Department of Computer Science. This journey has been as much about personal growth and introspection as it has been about pushing the boundaries of knowledge through rigorous experiments and thoughtful hypotheses. First and foremost, I owe an immense debt of gratitude to my advisor, Professor Nirupam Roy, for accepting me as his Ph.D. student and for his unwavering support throughout this transformative journey. In the face of challenges and doubts from others, he stood by me, offering not just guidance in research but also encouragement to believe in my own potential. His mentorship has been a beacon, reminding me of the power of resilience and self-confidence, as encapsulated in the lines of William Ernest Henley’s Invictus: “I am the master of my fate, I am the captain of my soul.” Professor Roy’s ability to inspire confidence and perseverance has left an indelible mark on my academic and personal growth. He not only taught me how to navigate the complexities of research but also how to confront challenges with determination and grace, ensuring that I emerged stronger and more self-assured. For this, I will forever remain grateful. iii I am also profoundly grateful to my committee members for their invaluable contributions to my academic journey. I thank Professor Christopher Allan Metzler for generously sharing his profound knowledge about compressed sensing, which greatly influenced my work. I deeply appreciate Professor Xiaofan (Fred) Jiang for his inspiring research on acoustic sensing, which has been a source of motivation for my own studies. I am equally grateful to Professor Ashok K. Agrawala for his renowned expertise in real-time systems and pervasive computing, which provided valuable insights that broadened my perspective on the practical implications of my research. Similarly, I extend my deepest thanks to Professor Min Wu. Her innovative approaches and thoughtful perspectives encouraged me to think critically about the impact and ethical considerations of my work. Their collective guidance and encouragement have been invaluable in refining and enhancing the quality of this thesis. Additionally, I would like to express my heartfelt gratitude to Professor Ramani Duraiswami for sharing his extensive knowledge of acoustics, which provided a strong foundation for many aspects of my research. I am also deeply appreciative of Professor Dinesh Manocha, whose pioneering work was highly inspiring when I began exploring the intersection of minimalist perception with LLMs and LAMs. Their support and belief in my potential have been invaluable, and I am truly honored to have had their guidance and encouragement. I would also like to express my gratitude to my Master’s advisor, Prof. Yingying (Jennifer) Chen, for her invaluable guidance and support in my research journey. She played a crucial role in mentoring me through my first two research papers, providing insightful direction and fostering my growth as a researcher. I am also deeply thankful to Dr. Jian Liu and Dr. Li Lu, who provided significant help throughout the research process, offering technical insights and constructive feedback. Their expertise and support were instrumental in overcoming challenges iv and successfully completing the work, shaping my foundation in academic research. I am deeply thankful to my lab mates for creating such a supportive and intellectually stimulating environment. Working alongside such talented and driven individuals has been an incredibly enriching experience. I would especially like to thank Nakul Garg, Irtaza Shahid, Aritrik Ghosh, Harshvardhan Takawale, and Ayushi Mishra. Collaborating with them has enriched this dissertation in many ways, from insightful discussions to creative problem-solving. Their camaraderie, encouragement, and contributions have made this journey both productive and memorable. I am grateful for the friendships and professional relationships that I have built with them, which I will cherish beyond my time at the lab. I would also like to express my heartfelt gratitude to my friend Kathryn Juliette Klett, who sits opposite my lab. She is an outgoing and kind individual whose smile and conversations always refresh me and lift my spirits. Additionally, I want to thank my English teacher, Marcia Albergo, who has been a source of unwavering support and guidance. Like an elder in my family, she has been there for me. I want to express my deepest gratitude to my husband, Zhenzhe Lin, who has been with me through every step of this journey, from our undergraduate days to the completion of this dissertation. His unwavering support, understanding, and encouragement have been a cornerstone of my success. Whether it was providing emotional strength during challenging times or helping me navigate the many facets of life, he has been my constant companion and greatest advocate. I am incredibly fortunate to have him by my side. Finally, I owe profound gratitude to my parents, Qiusheng Bai and Ling Yang, for their unwavering belief in my abilities and their support throughout my life. My father has always been my greatest advocate, instilling in me the confidence to pursue success regardless of gender. v His faith in my unlimited potential has been a source of strength and inspiration. Their love, sacrifices, and encouragement have been the bedrock of my achievements, and I dedicate this work to them with all my heart. Your support has been my foundation, and this achievement would not have been possible without you. This thesis is dedicated to you. vi Table of Contents Preface ii Acknowledgements iii Table of Contents vii Chapter 1: Introduction 1 1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Contributions to Minimalist Perception . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Opportunities Beyond this Dissertation . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 2: Low-power and Miniaturized Acoustic DoA Estimation 9 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Core Intuitions and Primers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Metamaterials for passive filtering . . . . . . . . . . . . . . . . . . . . . 16 2.3 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.1 Processing for DoA estimation . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.2 Eliminating source signal dependency . . . . . . . . . . . . . . . . . . . 21 2.3.3 Eliminating environmental dependency . . . . . . . . . . . . . . . . . . 24 2.3.4 Synthetic training for deep learning . . . . . . . . . . . . . . . . . . . . 25 2.3.5 Optimizing 3D stencil design . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4 Prototype Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4.1 3D-printing stencil caps . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4.2 Calibration and data collection . . . . . . . . . . . . . . . . . . . . . . . 33 2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5.1 Evaluation setup and results summary . . . . . . . . . . . . . . . . . . . 33 2.5.2 Impacts of external conditions . . . . . . . . . . . . . . . . . . . . . . . 35 2.5.3 Performance in different environments . . . . . . . . . . . . . . . . . . . 38 2.5.4 Impact of different sound sources . . . . . . . . . . . . . . . . . . . . . 38 2.5.5 Performance in known environment . . . . . . . . . . . . . . . . . . . . 39 2.5.6 Localization Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5.7 Comparison with traditional methods . . . . . . . . . . . . . . . . . . . 40 2.5.8 Comparison between learning models . . . . . . . . . . . . . . . . . . . 41 2.5.9 Energy consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 vii 2.6 Limitations and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Chapter 3: Ultra-low-power Acoustic Spatial Sensing 49 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2 Core Intuitions and Primers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.2.1 Coded signal projection with structures . . . . . . . . . . . . . . . . . . 56 3.2.2 Single receiver depth mapping . . . . . . . . . . . . . . . . . . . . . . . 58 3.3 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.3.1 Low-power scene reconstruction . . . . . . . . . . . . . . . . . . . . . . 59 3.3.2 Directional code projection . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.3 Optimal microstructure design . . . . . . . . . . . . . . . . . . . . . . . 67 3.3.4 Motion stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.4.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.4.2 Overall performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.4.3 Impact of the environment . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.4.4 Impact of system parameters . . . . . . . . . . . . . . . . . . . . . . . . 72 3.4.5 Impact of scene parameters . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.4.6 Computation techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.4.7 Power consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Chapter 4: Simultaneous Voice and Handwriting Interface 86 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2 Primer: Location from Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.2.1 Advantage of pure-tone based ranging . . . . . . . . . . . . . . . . . . . 91 4.2.2 Advantages of high-frequency signal . . . . . . . . . . . . . . . . . . . . 92 4.2.3 Challenges of pure tone-based ranging . . . . . . . . . . . . . . . . . . . 93 4.3 Cross-Frequency Sonar Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.3.1 Implicit frequency and phase translation . . . . . . . . . . . . . . . . . . 95 4.3.2 Multipath avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.4.1 Location tracking performance . . . . . . . . . . . . . . . . . . . . . . . 104 4.4.2 Performance of voice recovery . . . . . . . . . . . . . . . . . . . . . . . 105 4.4.3 Performance under human voice and environmental noise . . . . . . . . . 106 4.4.4 Performance of handwriting recovery . . . . . . . . . . . . . . . . . . . 107 4.4.5 Performance of signature recovery . . . . . . . . . . . . . . . . . . . . . 108 4.4.6 Impacts of external conditions . . . . . . . . . . . . . . . . . . . . . . . 111 4.4.7 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 viii 4.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Chapter 5: Low-power Spatial Intelligence for Next-Gen Wearables 120 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.2.1 Spatial audio detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.2.2 Multimodal large language models . . . . . . . . . . . . . . . . . . . . . 124 5.2.3 Spatial-Aware Large Language Models . . . . . . . . . . . . . . . . . . 124 5.3 Microstructure-Assisted Spatial Encoding . . . . . . . . . . . . . . . . . . . . . 125 5.3.1 Spatial encoding microstructure . . . . . . . . . . . . . . . . . . . . . . 125 5.3.2 Spatial speech generation . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.4 : Spatial Context to Wearable LLM . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.4.1 Spatial speech encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.4.2 Speech-to-text Encoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.4.3 Alignment to LLM space . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.5 Soundscaping: Multi-DoA Encoder . . . . . . . . . . . . . . . . . . . . . . . . 135 5.6 Evaluation of Spatial-Aware ASR . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.6.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.6.2 Training details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.6.3 Performance on spatially aware ASR . . . . . . . . . . . . . . . . . . . . 137 5.6.4 Performance under noise and room reverberation . . . . . . . . . . . . . 138 5.6.5 Performance on different datasets . . . . . . . . . . . . . . . . . . . . . 139 5.7 Evaluation of Multi-DoA Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.7.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.7.2 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.7.3 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.7.4 Performance of DoA estimation . . . . . . . . . . . . . . . . . . . . . . 141 5.7.5 Performance of number of speakers estimation . . . . . . . . . . . . . . 142 5.8 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 5.9 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Chapter 6: Conclusion 150 6.1 Impacts and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Bibliography 155 ix Chapter 1: Introduction 1.1 Background and Motivation In an increasingly interconnected world, ubiquitous perception—the ability for devices to continuously sense and interpret their surroundings—has become a key enabler for next- generation smart systems. From wearables and augmented reality (AR) glasses to IoT sensors and intelligent assistants, the ability to seamlessly perceive and interact with the environment is essential for enhancing human-computer interaction, improving accessibility, and enabling automation in everyday life. Acoustic sensing plays a critical role in this vision, as sound carries rich spatial and contextual information about an environment. It enables devices to locate sound sources, track motion, and infer scene dynamics, all while operating in conditions where vision-based sensing may fail (e.g., low-light environments or occluded spaces). However, despite the immense potential of acoustic sensing, significant technical hurdles prevent its integration into ubiquitous systems. Traditional approaches rely on large, power-hungry microphone arrays and wide-band signal processing, which are impractical for small, battery-operated, and always-on devices. Additionally, existing AI models, including large language models (LLMs), lack the ability to incorporate spatial sound perception, limiting their ability to reason about physical environments in an intuitive manner. 1 Frequency bandHardware size Energy consumption Embedded AI Figure 1.1: Illustration of four types of minimalist sensing. 1.2 Problem Statement As shown in Figure 1.1, To truly enable ubiquitous acoustic perception, we must address these challenges by developing minimalist-hardware, energy-efficient, frequency-conscious, and embedded-AI acoustic sensing solutions. These challenges can be categorized into three broad axes: 1. Power and Hardware Constraints on Ubiquitous Devices – The reliance on large, multi-microphone arrays and high-power processing makes acoustic sensing impractical for resource- constrained devices such as wearables, IoT nodes, and smart assistants. 2. Frequency Band Limitations and Interference with Voice Interfaces – Wide-band acoustic sensing methods interfere with critical voice-based applications, limiting their deployment on devices that prioritize natural speech interaction. 3. Integrating Acoustic Sensing with Large Language Models (LLMs) for Spatial Awareness – Despite advancements in AI, current LLMs lack the ability to process spatial audio information, preventing them from understanding and reasoning about soundscapes. By tackling these fundamental barriers, this research paves the way for a new era of lightweight, intelligent, and seamlessly integrated acoustic sensing, enabling a world where devices can perceive and interact with their surroundings effortlessly. 2 1.3 Contributions to Minimalist Perception With this goal of developing minimalist sensing systems towards ubiquitous perception in mind, my research breaks down into four areas: low-power and miniaturized acoustic DoA estimation, ultra-low-power spatial sensing for miniature robots, single frequency-based tracking interface, and integrating the miniaturized acoustic DoA estimation into LLMs for enhanced spatial understanding. This thesis has the following contributions: Low-power and miniaturized acoustic DoA estimation. This dissertation introduces an Owlet- inspired acoustic sensing system that enables low-power, compact, and high-resolution DoA estimation using a compact monaural setup embedded within a structured micro-enclosure. Unlike traditional microphone arrays that require multiple spatially distributed sensors, this approach passively encodes spatial information through controlled modifications of the impulse response, allowing a compact microstructure setup to infer DoA with high accuracy. By leveraging machine learning to extract spatial features from the altered acoustic signals, this method eliminates the need for complex array processing, significantly reducing hardware footprint, power consumption, and computational overhead. The compact nature of this system makes it well-suited for wearable devices, IoT nodes, and AR/VR platforms, where conventional microphone arrays are impractical. This contribution represents a fundamental shift toward minimalist acoustic sensing, demonstrating that spatial perception can be achieved with single-microphone solutions, paving the way for energy-efficient, scalable, and widely deployable acoustic sensing technologies. Ultra-low-power spatial sensing for miniature robots. SPiDR is a microstructure-assisted acoustic sensing system designed to enable low-power spatial perception. Unlike conventional 3 active sonar or microphone array-based imaging techniques, SPiDR employs a structured micro- acoustic surface that modulates incident acoustic waves, introducing diffraction patterns that encode spatial information. These passive structures alter the wavefront in a predictable manner, allowing a single microphone to capture rich spatial features without requiring multiple sensors or active signal transmission. A computational model, combined with compressed sensing-based reconstruction, extracts the phase and amplitude variations induced by diffraction, effectively converting them into spatially resolved depth and shape information. The system is designed to be compact and energy-efficient, making it suitable for miniature robotics, low-power imaging, and non-invasive material sensing. By leveraging microstructure-induced diffraction as a novel sensing modality, SPiDR introduces a fundamentally new approach to acoustic imaging, overcoming the limitations of traditional sensor arrays and expanding the possibilities of passive, energy- efficient spatial sensing in constrained environments. Single frequency-based tracking interface. Scribe introduces a novel authentication system that leverages single-frequency acoustic sensing and motion tracking to enable secure and intuitive handwriting-in-the-air authentication. Unlike traditional authentication methods that rely on passwords, biometrics, or capacitive touch surfaces, Scribe utilizes high-precision phase tracking of acoustic signals to capture the fine-grained trajectory of a user’s handwriting in free space. By integrating voice-based authentication with spatial handwriting recognition, Scribe achieves a robust two-factor authentication system that is resistant to forgery, spoofing, and environmental interference. A key technical innovation of Scribe is its ability to eliminate multipath interference through frequency hopping techniques, ensuring that the tracked handwriting is accurately reconstructed even in complex acoustic environments. Additionally, by leveraging single-frequency-based 4 phase tracking, Scribe achieves high accuracy in motion estimation without requiring wide-band acoustic signals, making it energy-efficient and compatible with voice interfaces. The system’s compact and low-power design allows seamless integration into voice assistants, wearable devices, and mobile platforms, paving the way for next-generation, touch-free authentication and human- computer interaction. Spatial context in Large Language Model (LLM) for next-gen wearables. This work expands Owlet to detect multiple DoAs for speech simultaneously and integrates this information into a LLM. This contribution bridges the gap between low-power spatial sensing and advanced AI systems, enabling context-aware processing of spatial information in real time. By leveraging Owlet’s minimalist design, we develop methods to estimate and utilize several DoAs for speech concurrently, enhancing applications such as speaker tracking, spatial reasoning, and multi- speaker environments. The integration with LLM demonstrates how minimalist sensing can enrich AI models with spatial perception capabilities, opening new possibilities for human- centered AI applications. A key technical contribution of SING is the development of a DoA encoder, which extracts spatial cues from raw multi-microphone audio signals and encodes them into a compact, structured representation. This encoder learns spatial embeddings that capture information about sound source directionality, spatial relationships, and environmental reflections, ensuring robust spatial representation across diverse acoustic scenes. To align this information with LLMs, SING introduces a projection layer which maps the learned DoA embeddings into the LLM’s latent space, allowing it to integrate spatial perception seamlessly with text-based reasoning. This alignment ensures that LLMs can leverage spatial audio understanding alongside other modalities, unlocking new capabilities such as spatially informed dialogue, immersive AI- 5 assisted navigation, and intelligent environmental awareness. 1.4 Opportunities Beyond this Dissertation While this dissertation presents significant advancements in minimalist acoustic sensing and spatial perception, several exciting opportunities remain open for further exploration. The methods and technologies introduced in this work provide a strong foundation for future research, expanding the scope of applications and optimizing performance in various domains. Below, we outline several key opportunities beyond this dissertation. Optimizing SPiDR with Deep Image Prior. SPiDR, introduced in this dissertation as a microstructure-assisted imaging technique, offers a novel approach to spatial sensing. However, its reconstruction quality can be further enhanced using deep image prior (DIP) [1] methods. Unlike traditional deep learning approaches that require large datasets for training, DIP leverages the implicit structure of convolutional networks to refine noisy or incomplete images. By integrating DIP into SPiDR’s reconstruction pipeline, we can potentially improve the resolution and accuracy of spatial information, making the system more robust for high-precision applications such as micro-scale sensing and medical imaging. Extending Single-Frequency Tracking for Indentation Detection and Fusion with Image- Based 3D Shape Reconstruction. Image-based 3D shape reconstruction has been widely explored for capturing object geometries, yet challenges remain in achieving high precision, particularly for fine surface details and subtle indentations. Given that single-frequency-based tracking has demonstrated precise motion tracking capabilities, an intriguing question arises: Can the same technique be adapted for accurate indentation detection? By leveraging single-frequency tracking 6 to map indentation profiles, we can complement existing image-based 3D reconstruction methods, fusing the two modalities to achieve higher accuracy in 3D shape estimation. This integration is particularly valuable for applications such as 3D printing, manufacturing, and quality control, where both external geometry and fine indentation details are critical for achieving high-fidelity replication. A hybrid approach combining precise acoustic-based indentation detection with image-based reconstruction has the potential to refine object modeling, enabling more precise and efficient fabrication processes. Extending Microstructure-Assisted Imaging to Underwater Environments. Microstructure- assisted imaging techniques like SPiDR have demonstrated strong potential in controlled terrestrial environments. However, their principles can be extended to underwater imaging, where traditional optical and sonar-based imaging techniques face limitations. In murky or low-light underwater conditions, acoustic-based imaging through microstructured surfaces could provide a novel alternative for capturing fine structural details of submerged objects. This opens up opportunities for applications in marine exploration, environmental monitoring, and underwater robotics, where compact, low- power imaging solutions are crucial. Minimalist Perception Microstructure-Assisted Perception Single-Frequency based Perception Handwriting Interface on Voice Assistants (Chapter 4) Low-power Spatial Audio Perception for LLM (Chapter 5) Spatial Audio Perception Low-power Spatial Sensing For Micro-robots (Chapter 3) Low-power and Miniaturized Spatial Audio Perception (Chapter 2) Figure 1.2: Organization of the thesis in chapters. 7 1.5 Organization The subsequent sections elaborate on these ideas of minimalist sensing, starting with low- power and miniaturized design for acoustic DoA estimation in Chapter 2. Chapter 3 extends the primitive of 3D-printed passive structure to an ultra-low-power spatial sensing system for miniature mobile robots. Chapters 4 explains a comprehensive voice processing and handwriting interface for voice assistants. Chapter 5 discusses the expansion of Owlet for spatial speech integration into LLM. Finally, we conclude in Chapter 6. Figure 1.2 shows the organization of the topics in the rest of this dissertation. 8 Chapter 2: Low-power and Miniaturized Acoustic DoA Estimation 2.1 Overview Acoustic devices are increasingly integrated into our daily environments in diverse forms. Beyond traditional voice interfaces, a growing number of applications are leveraging sound for context-awareness and analytical purposes. These include indoor activity recognition [2–4], health monitoring through acoustic signals [5, 6], speech development tracking and acoustic environment sensing using wearable devices [7, 8], as well as outdoor applications powered by distributed sensor networks [9,10]. With advancements in low-power and battery-free technologies [11, 12], it is now feasible to continuously sense and analyze audio through compact, standalone modules deployed throughout the environment. Integrating spatial sound analysis and source localization into these systems can further enhance their contextual sensing capabilities. Meanwhile, spatial sound sensing is critical for applications such as robotic navigation and situational awareness in both aerial [13–15] and underwater [16, 17] settings. However, conventional approaches to spatial sound processing typically rely on multi-channel audio captured by microphone arrays, which demand significant power and are unsuitable for lightweight, low-power sensing nodes. This chapter aims to design an acoustic sensing framework capable of spatial sound analysis within the constraints of small, power-efficient ubiquitous computing platforms. Extracting spatial characteristics of sound—such as direction-of-arrival (DoA) or source 9 15 mm 13 m m Stencil with µ-resonators Internal cavity Estimates direction and location Low-power and miniaturized spatial sensing with acoustic micro-structures Passive structure interacts with sound System learns location-agnostic features Microstructure embeds directional signature 1 2 3 4 Figure 2.1: The vision and technical overview of Owlet, a low-power and miniaturized system for extracting spatial information from sound. Owlet uses acoustic microstructures to embed direction-specific signatures on the recorded sound and develops a learning-based approach for signature recovery and mapping in real-time. localization—typically involves sampling the acoustic wave across space using a microphone array. Traditional DoA estimation algorithms rely heavily on this spatial sampling framework, making the array’s size and the number of microphones critical to their accuracy. Based on the sampling theorem [18], an ideal linear array features microphones spaced at half the wavelength (�/2) of the signal for optimal DoA estimation. Additionally, the angular resolution, often defined by the inverse of the Half Power Beam-width, scales with the overall length of the array. As a result, achieving high-resolution DoA measurements conventionally demands large and complex hardware setups. These systems also require synchronized data collection from all microphones, leading to increased power usage and greater system complexity. Although acoustic devices are now widespread in ubiquitous computing, the significant power, hardware, and size constraints of traditional arrays limit their integration in applications that demand fine- grained spatial sensing. In this chapter, we propose an alternative approach to spatial audio processing that moves away from the standard spatio-temporal sampling paradigm. Instead, we 10 explore how sound waves interact with physical structures to enable a compact, low-power, and low-complexity solution for spatial sensing. Directional hearing aided by acoustic structures is widely observed in nature. In most mammals, including humans, the placement of the left and right ears mimics a two-element array for capturing directional sound cues. However, research in biophysics reveals that these animals rely not only on ear positioning but also on how sound interacts with the complex 3D geometry of the head to achieve fine-grained source localization [19]. In contrast, many owl species have asymmetrically positioned ears—both horizontally and vertically—which enhances their ability to detect low-frequency sounds with high precision [20], a feat that symmetric ear arrangements cannot accomplish due to their limited spacing. Remarkably, certain insects, despite having bodies smaller than one-tenth of the wavelength of the sounds they perceive, can localize sound as accurately as mammals [21]. For example, a grasshopper with a body width of just 3 mm—far smaller than the wavelength of the target sound—can still determine sound direction accurately. This capability arises from the insect’s asymmetrical body structure and orientation, which produce direction-dependent acoustic responses. Their sensory and neural systems are finely tuned to interpret these responses and map them to the sound’s direction of arrival. Drawing inspiration from these biologically driven, structure-enhanced hearing mechanisms, we propose a DoA estimation system tailored for compact, energy-efficient devices. In this chapter, we introduce the design and implementation of an acoustic localization system that leverages specially crafted acoustic structures surrounding a microphone to encode directional information. As sound waves travel, they interact with physical objects in their path, altering the wave field in the process. This phenomenon is evident in large-scale environments, such as rooms, where the same sound can vary significantly depending on the room’s shape, size, 11 and the arrangement of objects. We demonstrate that similar transformations can be intentionally induced on a much smaller scale using a compact, 3D-printed acoustic structure. This structure modifies the sound waves in a way that imprints a unique, direction-dependent signature onto them. When a microphone is placed inside such a structure—just a few centimeters in size—it captures sounds embedded with these distinct signatures. With careful design, the structure can produce distinguishable patterns for sounds arriving from angles separated by only a few degrees. Our system decodes these patterns to estimate the DoA of incoming sounds. We call this system Owlet, inspired by the owl’s exceptional auditory localization abilities. We are not the first to recognize the potential of using environmental variations in sound fields for localization. Previous research has investigated this idea by fingerprinting multipath environments and analyzing nearby sound reflections [22]. The work most closely related to ours is [23], where objects are placed within a 60×60 cm area, and a microphone is positioned at the center. That study demonstrated that sound scattered by surrounding objects contains directional information, which can be used to estimate the DoA. While the core idea behind Owlet aligns with this prior work, our approach differs in two key aspects. First, we aim to create a compact, centimeter-scale sensing system suitable for integration into low-power, resource- constrained robots or for widespread use in ubiquitous sensing applications. The Owlet prototype achieves angular resolution comparable to or better than previous systems, despite using a much smaller sensor measuring just 1.5cm ⇥ 1.3cm. Second, we tackle the challenge of ensuring robustness to changes in the environment. Unlike systems that depend on controlled conditions or location-specific training, Owlet is designed to operate reliably outside of anechoic chambers and perform consistently in real-world, dynamic settings. A key challenge in developing Owlet lies in achieving sufficient multipath diversity within 12 a compact physical footprint. Low-frequency acoustic signals have long wavelengths, which typically require reflectors of comparable size to create the necessary spatial variations—directly impacting the system’s spatial resolution. To overcome this constraint, we adopt a diffraction- based approach rather than relying on traditional reflection-based methods to design miniature acoustic structures. The core idea stems from the principle that when sound waves pass through small openings, they diffract and behave like new point sources. Leveraging this, we design a 3D-printed cylindrical shell—referred to as a stencil—that encases the microphone. These stencils are embedded with carefully engineered patterns of holes that produce complex yet predictable multipath interference patterns within the structure. These patterns encode unique directional signatures into the captured sound. To enhance angular discrimination, we incorporate metamaterial-inspired design elements into the stencil. The Owlet system undergoes a one-time calibration to learn the relationship between these interference signatures and the DoA, enabling real-time directional sensing during operation. Another significant challenge is ensuring that the system remains robust against environmental variations, which can unpredictably alter the characteristics of incoming sound. For Owlet to be practical and widely deployable, it must operate reliably across diverse environments with only a one-time calibration performed during manufacturing. As previously discussed, room acoustics can distort the sound field, potentially disrupting the mapping between directional signatures and the actual direction of arrival. To address this, Owlet incorporates a reference microphone into the design and adopts a communication-theoretic approach to mitigate the impact of transient multipath effects during signature extraction and matching. This strategy enhances the system’s resilience to environmental changes, making Owlet well-suited for real-world deployment. This chapter investigates the use of acoustic structures as passive elements in the design of 13 novel, low-power, and compact solutions for ubiquitous sensing. Potential applications include wearable devices for sensing acoustic environments to support speech development monitoring in infants [24, 25], as well as personal audio analytics [26, 27], both of which rely on accurate sound direction detection. Spatial sensing through Owlet can also enhance navigation for size, weight, and power (SWaP)-constrained mobile robots operating in air or underwater [28, 29]. Additionally, Owlet enables direction estimation and localization in energy-harvesting systems—something traditional microphone arrays struggle to achieve due to their power and hardware demands. Figure 3.1 illustrates the overarching vision and technical approach behind our work. While many application avenues are possible, this chapter focuses on building the core functionality of the system and evaluating its performance boundaries. At this stage of the project, we contribute the following three key advancements: 1. A novel approach that leverages passive structures for directional sensing, enabling a low- power, low-complexity, and compact system for acoustic localization. The sensing and signal processing pipeline is designed to provide reliable DoA estimation across varying environments using only a one-time calibration during manufacturing. The system achieves a median angular error of just 3.6�—on par with traditional microphone array solutions, but with significantly reduced power and size requirements. 2. A reproducible framework for designing and 3D-printing optimized acoustic structures that encode direction-dependent features into incoming sounds. This includes a method for shaping the acoustic field using controlled diffraction within compact metamaterial-based geometries. 3. A complete hardware and software prototype of the system, made available to the community 14 for replication, evaluation, and future development of the Owlet platform. Next, we elaborate on the core intuition, system design, and key findings of this project. ≈ Stencil Approximation of the stencil Sound source Sound source Hole pattern for a specific DoA Microphone Microphone inside the stencil Sound holes Figure 2.2: The concept of using a stencil with direction-specific hole patterns and microstructures for passive filtering of the incoming sound. The stencil embeds a directional response to the recorded signals. 2.2 Core Intuitions and Primers At its core, our goal is to create a controlled acoustic environment surrounding the microphone so that the captured signal includes a distinct, direction-dependent channel impulse response. This response can be extracted from the recording and used as a unique signature indicating the angle of arrival of the sound. While typical room acoustics or the presence of large nearby objects can naturally introduce directional multipath effects, our approach seeks to generate more precise and fine-grained spatial diversity using a compact design. To achieve this, we integrate principles of diffraction, interference, and structural resonance. Specifically, we develop a perforated cap for the microphone—referred to as a stencil—featuring strategically placed hole patterns on various sides, as illustrated in Figure 2.2. Sound entering from a particular direction passes through these distinct hole patterns and converges at the microphone. Each set of holes is linked to internal 15 microstructures with varying parameters, producing a unique frequency response that encodes directional information. The stencil forms a metamaterial with internal microstructures that naturally modulates incoming sound to introduce a unique directional signature. As the impact of the microstructures depends on the frequencies of the sound, the signature is basically a vector of complex gains, G✓, of the frequency response. The concept is explained in Figure 2.3. f f Sound DoA = θ1 Sound DoA = θ2 f Signature_θ1 Signature_θ2 f C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6 Complex frequency gains represent directional signature (Gθ) Mic Acoustic micro-structures for directional filtering Stencil Figure 2.3: The concept of passive directional filtering using a stencil of acoustic microstructure. The stencil embeds a directional signature to the recorded sound unique to its direction of arrival (DoA). The spectrum of complex gains represents the signature for further computation. 2.2.1 Metamaterials for passive filtering When sound waves interact with physical structures, certain frequencies can be either amplified or attenuated. On a larger scale, such variations typically result from multipath reflections, where constructive and destructive interference alters the frequency profile of the sound. While these reflections can embed directional information into the sound, they generally require large structures—comparable to the sound’s wavelength—to be effective. Since Owlet targets low- frequency sounds, which have long wavelengths, using traditional reflectors would necessitate 16 structures nearly half a meter in size. To achieve directional filtering in a much smaller form factor, we turn to the concept of metamaterials—engineered materials composed of specially arranged substructures that exhibit unique acoustic properties. In designing our metamaterial- based stencil, we apply a combination of (a) diffraction, (b) capillary channel effects, and (c) structural resonance to create compact yet effective direction-dependent filtering. (a) Diffraction: When sound waves encounter the edge of an obstacle, they tend to bend around it—a phenomenon known as diffraction. This effect becomes particularly interesting when sound passes through a small opening [30]. If the hole’s size is much smaller than the wavelength of the sound, the wave diffracts at the edges, and the hole effectively acts as a virtual point source of sound. When a receiver, such as a microphone, is positioned on the opposite side of a surface with multiple such apertures, it perceives a complex sound field similar to one created by multiple sources—resulting in a multipath-like environment. The overlapping signals from these virtual sources produce patterns of constructive and destructive interference, shaped by both the signal frequency and the receiver’s position. We harness this behavior by embedding various small hole patterns into the stencil, generating a rich multipath effect around the microphone within a compact structure. (b) Capillary effect: When sound travels through narrow capillary tubes, its acoustic impedance changes [31]. Additionally, both the length and cross-sectional area of these tubes influence the speed at which sound propagates. To simulate phase differences along different sound paths, we incorporate capillary tubes of varying shapes and dimensions into the stencil design. This approach introduces significant diversity in the resulting frequency spectrum, even though the openings between sound paths are closely spaced. 17 Cylindrical stencil with capillary tubes Cylindrical stencil with micro resonators US penny for reference Hemispherical stencil with capillary tubes 19 mm 30 mm15 mm 10 mm 11 m m 13 m m Internal cavity structure of the stencils Figure 2.4: Different types of metamaterial stencils used in our experiments. (c) Structural resonance: When sound waves with oscillating air pressure encounter cavities, certain frequencies become amplified—a phenomenon known as Helmholtz resonance [32], often demonstrated by the sound of air blown across the top of a bottle. We apply this principle by embedding millimeter-scale Helmholtz resonators into the stencil design, each connected to the sound holes. By varying the shapes and dimensions of these tiny resonators, we can produce targeted resonance effects at specific frequencies. Figure 2.4 presents our 3D-printed stencils featuring integrated microstructures designed for directional acoustic filtering. In Figure 2.5, we illustrate how these microstructures enhance the angular sensitivity of the sensor by comparing the amplitude variation of a 7 kHz tone recorded with and without the stencil in the Owlet setup. Figure 2.6 further highlights the directional frequency response diversity achieved by different stencil designs. 18 0° 30° 60° 90° 120° 150° 180° With stencil Without stencil Figure 2.5: Angular diversity of the microphone with and without the microstructure stencil. 2.3 System Design The system design centers around two core objectives: (a) creating an optimal stencil structure that maximizes angular diversity, and (b) developing computational methods to estimate the DoA from the recorded audio. The overall accuracy of the system is closely tied to the level of directional diversity introduced by the stencil. To achieve this, our algorithms simulate sound wave behavior around small-scale structures, optimize the stencil design accordingly, and then fabricate it using 3D printing for real-world testing. Before delving into the specifics of stencil design, we first outline our signal processing and DoA estimation methods, which also provide a high-level view of the system’s architecture. 2.3.1 Processing for DoA estimation In rather simplistic terms, Owlet’s DoA estimation technique is a two-step process. First, we create a table of direction-specific signatures G✓ of the stencil by sending known signals from various directions. We perform this signature generation by playing a wideband sound signal from a specific direction and recording the signal with a microphone having the stencil cap over it. This is a one time in-lab calibration, similar to the calibration done for commercial-grade 19 (a) Cylinder_capillary (b) Cylinder_resonator (c) Hemisphere_capillary 4000 5000 6000 7000 8000 Frequency 0 1 2 3 Ph as e 0 deg 60 deg 120 deg 180 deg 4000 5000 6000 7000 8000 Frequency 0 0.5 1 1.5 2 Am pl itu de 0 deg 60 deg 120 deg 180 deg 4000 5000 6000 7000 8000 Frequency 0 5 10 15 20 25 Am pl itu de 0 deg 60 deg 120 deg 180 deg 4000 5000 6000 7000 8000 Frequency 0 1 2 3 Ph as e 0 deg 60 deg 120 deg 180 deg 4000 5000 6000 7000 8000 Frequency 0 0.5 1 1.5 Am pl itu de 0 deg 60 deg 120 deg 180 deg 4000 5000 6000 7000 8000 Frequency 0 1 2 3 Ph as e 0 deg 60 deg 120 deg 180 deg Figure 2.6: Comparison of the diversity in frequency responses (amplitude and phase) of the three types of metamaterial stencils. microphone arrays. The second step is performed at the run-time when the system is being used for DoA estimations. Here we process the incoming signal to extract the signature introduced by the stencil, hstencil, and then look it up in the table of pre-collected signatures that maps it to a specific DoA. In practice, we train a deep learning model with variations of the signature table and use the pre-processed signal at run-time to get predicted DoA from the model. Note that the signature extraction from the real-world signal is a crucial part of the processing and it meets two challenges: (i) estimating hstencil by separating it from the frequency diversity of the source signal and (ii) eliminating additional distortions added to the signal by the environmental multipaths. We explain the technique for signature extraction for eliminating the source dependency, followed by a technique to deal with the environmental multipath in arbitrary locations. At a high level, Owlet’s DoA estimation method involves a two-step process. The first step is a one-time, in-lab calibration where we generate a set of direction-specific signatures, denoted as G✓, for the stencil. This is done by playing a known wideband sound from various angles and 20 recording the responses using a microphone covered with the stencil cap. The recorded responses serve as unique directional signatures, similar to calibration procedures used in commercial microphone array systems. The second step takes place during actual usage. When an unknown sound arrives, the system processes the incoming signal to extract the stencil-induced signature, hstencil, and compares it against the pre-collected signature table to estimate the corresponding DoA. In practice, we train a deep learning model on variations of the signature set and use this model to predict the DoA from the processed signal in real time. A key part of this process is accurately extracting the stencil-specific signature from real-world recordings. This involves overcoming two main challenges: (i) isolating hstencil from the frequency-dependent characteristics of the sound source, and (ii) removing additional distortions caused by multipath effects in the surrounding environment. In the following sections, we describe our techniques for handling source variability and mitigating environmental multipath to ensure robust signature extraction across diverse conditions. 2.3.2 Eliminating source signal dependency The signal recorded by the microphone inside the stencil is basically the source signal distorted by the directional-specific response of the stencil. If we assume no environmental effect on the source signal X(!), the signal received by the inside microphone Yin(!) can be expressed as a multiplication between this source signal and the stencil’s response Hstencil in frequency domain: Yin(!) = X(!)Hstencil (2.1) 21 When the source signal X(!) is known, extracting the stencil’s response is straightforward—we can compute it as Yin(!) X(!) is the signal recorded by the microphone inside the stencil. This setup works well in scenarios where the system uses predefined signals, such as in robotic navigation, where the robot estimates its orientation using the direction of a known control signal. However, many practical use cases—such as detecting the direction of ambient sounds or identifying a speaker’s location during conversation—do not involve a known source signal. In these cases, separating Hstencil from the unknown source becomes challenging. To overcome this, we introduce a secondary microphone positioned outside the stencil. This microphone captures the same incoming sound but without being influenced by the stencil’s directional filtering. Importantly, unlike traditional microphone arrays, this reference microphone can be placed very close to the primary one, allowing for a compact design. Figure 2.7 illustrates the physical setup and provides a realistic signal model for the system. H’env Henv Source sound: X Yin = X.Henv . Hstencil Yout = X.H’env Hstencil Stencil Inside mic outside mic Environmental multipath response Direction-specific response of stencil Figure 2.7: The two-microphone model for eliminating source and environmental dependency. Consider the channel frequency responses to the inside and outside microphone are Henv and H 0 env respectively. These channel responses manifest the effects of the multipath signal propagation from the source to the microphones and the signal’s reflections from nearby objects. 22 The presence of the stencil around the internal microphone introduces additional modulation to the recorded signal, represented by the frequency response Hstencil. Considering the linearity of the channels, the signal recorded by the inside microphone will experience both the impulse responses as shown in Figure 2.7. Therefore, the signals recorded simultaneously by these microphones, Yin(!) and Yout(!), can be formulated as the following equations. The source sound is X(!) and independent noise at the two channels are N(!) and N 0(!) at the frequency !. Yin(!) = X(!)HenvHstencil +N(!) Yout(!) = X(!)H 0 env +N 0(!) (2.2) If we divide Yin(!) by Yout(!), it successfully eliminates the dependency on the source signal. However, the environmental dependency remains in the form of Henv H0 env . Yin(!) Yout(!) = Hstencil Henv H 0 env +N 00(!), (2.3) Here, N 00(!) ⌧ Hstencil Henv H0 env . This means the stencil calibration process, or the deep learning module training process has to be trained for all locations in the target environment to capture the environmental dependency to make the angle prediction effective. Such a system may be applicable for scenarios where the locations of the sound sources and the sensing modules are predefined. For instance, when acoustic localization is used to track objects on a conveyor belt or on a track. However, for most practical scenarios the location of the sound source is unknown, and it will require collecting data from virtually every point in the scene and train the prediction module – leading to an impractical solution. Next, we explain our technique to eliminate this 23 location dependency. With this technique Owlet can function with one round of in-lab calibration of the stencil and does not require collecting any calibration data at the target environment. 2.3.3 Eliminating environmental dependency This final stage of the technique is based on the observation that despite diverse and unpredictable nature of the environmental channel responses Henv and H 0 env , the ratio of the channels, Hratio = Henv H0 env is bounded when the microphones are closely placed. This idea can be intuitively understood by first analyzing the reason for diversity in environmental response Henv. The sound wave reflects off various objects in the environment after leaving from the source. These reflections follow paths of varying lengths to get superimposed at the recording microphone along with the direct line-of-sight path. The diversity in the path distances creates time delays in the reflected components leading to a unique response of the environment. Therefore, two microphones, even when recording the same signal, can observe different responses as the path lengths of the reflections are different. However, if the locations of the microphones are close to each other, these path differences of reflections are bounded and at one extreme when two microphones are exactly collocated, they will observe same environmental response. Therefore, Hratio = Henv H0 env has a narrow distribution of values for each frequency in the response when two microphones are a few centimeters apart from each other. We obtained the probability distributions from simulated ray tracing and real-world experiments. Once the distributions of Hratio is known and Hstencil is collected through the calibration stage, we generate a synthetic training data for HratioHstencil drawing from the distribution and use it for training the deep learning module. This process can train our angle prediction module 24 robust to environmental variations at run-time without requiring real-world sound traces for training. Interestingly, if the dimension of the target environment and locations of the major reflectors are known, the synthetic training data can be customized to that environment. This customization reduces the time for convergence during training and improves prediction accuracy. The run-time processing now requires to extract HratioHstencil from the two channels of sound Yin(!) and Yout(!). We improve this process by employing a recursive least square (RLS) adaptive filter [33] in system identification mode. The adaptive filter takes advantage of the uncorrelated Gaussian noise in the recorded signals to estimate HratioHstencil by minimizing the following error term with gradient descent. e(!) = Yin(!)� Yout(!) HenvHstencil Henv0 (2.4) 2.3.4 Synthetic training for deep learning To enhance the training diversity of our neural network model, we utilize the synthetic channel response described in the previous section. Specifically, we compute HratioHstencil by simulating various room environments and different configurations of source and microphone placements. These simulations produce a distribution of channel responses, which we use to generate additional training examples of HratioHstencil for the learning model. Each direction of arrival is represented by a vector containing 400 evenly spaced frequency samples between 0 and 8 kHz, capturing the discrete spectrum of HratioHstencil. Rather than feeding the raw complex- valued vectors into the network, we decompose them into their amplitude and phase spectra and use these components as inputs for training. 25 Freq. spectrum Phase spectrum Sound signal Input matrix Conv. layer Conv. layer Conv. layer Fully connected layer Predicted Angle Regression layer (400 x 2) (394 x 64) (390 x 128) (388 x 256) (1 x 1) 2x7 1x5 1x3 Figure 2.8: The architecture of the proposed CNN model. For DoA estimation, we employ a Convolutional Neural Network (CNN)-based regression model. CNNs are well-suited for processing environmental sound data due to their strong performance and low latency, thanks to their relatively compact parameter sets [34]. Our model is a one- dimensional CNN consisting of three convolutional layers, followed by a fully connected layer and a final regression output layer. The convolutional layers use 64, 128, and 256 filters with kernel sizes of 2 ⇥ 7, 1 ⇥ 5, and 1 ⇥ 3, respectively. The output regression layer is designed to minimize the half-mean-squared-error (HMSE) loss for angle prediction. We tailor the loss function based on the desired range and resolution of angle estimation. The network uses ReLU (Rectified Linear Unit) as the activation function and includes batch normalization layers between the convolution layers to accelerate training. Optimization is performed using stochastic gradient descent (SGD), and the model is trained over 100 epochs with a learning rate of 1e�6. A block diagram of the CNN architecture is provided in Figure 2.8. In addition to the regression model, we also develop a CNN-based classification model for evaluation and comparison purposes, as described in Section §2.5.8. This model largely mirrors the architecture of the regression network, with a key modification: the fully connected layer is configured to have 360 output units—one for each possible angle—followed by a Softmax activation and a classification layer. 26 2.3.5 Optimizing 3D stencil design The effectiveness of our system relies heavily on the variation in frequency gain patterns across different angles. In our initial feasibility study using a stencil cap with randomly placed holes, we observed sufficient diversity in the gain patterns to differentiate sound directions, achieving a median error of 7�. However, the system’s directional resolution is not consistent across all angles—some directions are detected with lower accuracy than others. This inconsistency stems from the suboptimal arrangement of holes and internal microstructures on the stencil, which can lead to similar gain responses for different directions. To overcome this, we adopt a more systematic approach to designing the 3D stencil. Our optimized design ensures a minimum angular resolution is maintained for DoA estimation in all directions. An ideal stencil cap should produce highly distinct frequency gain patterns for each possible DoA, maximizing the system’s ability to differentiate between angles. This challenge of maximizing diversity in gain patterns is similar to an information-theoretic problem: designing a set of codewords that are as different from each other as possible. In our context, a frequency gain pattern G✓ corresponding to a specific angle is treated as a codeword. The goal is to create a set of N such codewords that are maximally separated—typically measured by maximizing the Euclidean distance between every pair. The number N also determines the angular resolution of the system, defined as �✓ = 2⇡ N . As a first step, we aim to design a set of ideal codewords over a range of discrete frequencies and then use these as targets to guide the creation of the corresponding gain patterns G✓. The next step involves translating each G✓ into a physical hole configuration on the stencil surface for the corresponding angle ✓. Given parameters such as the number of holes N, distances from the microphone to each hole rn, the distance between the 27 microphone and the stencil D, and the sound wavelength �, Equation 2.5 provides the resulting superimposed value u(�) at the microphone. This value represents the combined contribution of waves entering through all stencil holes. Because this equation spans multiple wavelengths, solving for the ideal hole patterns becomes an overdetermined problem—one that, in theory, can be approached approximately to optimize hole placement. However, this analytical approach quickly becomes unmanageable, especially when dealing with more than 10 holes on a 3D stencil. Furthermore, it fails to capture the complex behavior of wave propagation around small objects, where theoretical models diverge significantly from real-world observations. Before introducing our simulation-based design strategy, we first discuss these wave properties in more detail. u(�) = NX n=1 D j�r2 n e j2⇡rn � (2.5) Behavior of wave fields near the stencil: In our stencil model, we make a simplifying assumption that incoming sound waves primarily pass through the holes located on the side of the cylindrical structure that directly faces the wavefront. Essentially, we approximate the cylindrical stencil as N -gonal prism (illustrated in Figure 2.2), where each face contains distinct hole configurations that contribute to generating a specific frequency gain pattern at the microphone. This approximation is generally valid for large objects—specifically, when the object’s diameter is more than ten times the wavelength of the sound signal [35]. However, our goal is to design a compact interference-shaping structure, and at this miniature scale, the actual behavior of wave propagation deviates significantly from this simplified model. In reality, due to diffraction, the sound wave bends around the outer surface of the small stencil and wraps around the structure, affecting nearly the entire cap, as illustrated in Figure 2.9. 28 Large object (compared to wavelength) Small object (compared to wavelength) Sound fieldSound field High pressure High pressure Low pressure (a) (b) Shadow region Figure 2.9: The behavior of sound field at the outer surface of an obstacle. (a) When the object’s size is much larger compared to the wavelength of the sound, the obstacle creates a shadow region. (b) When the object’s size is comparable to the wavelength of the sound, the wave diffracts around the object creating high-pressure at a larger region of the surface. It also creates a high-pressure region directly opposite to the sound’s directions where sound fields from the top and bottom sides meet. We validated this diffraction effect using a cylindrical structure with a single pinhole on one side (Figure 2.10(a)), placing a microphone at the center to measure sound pressure. As shown in Figure 2.10(b), the microphone recorded substantial sound pressure even when the pinhole was positioned more than 90� away from the direction of the sound source—evidence of the sound waves bending around the structure. Notably, we observed a strong pressure level directly opposite the sound source, caused by the merging of diffracted waves from both sides of the cylinder. This angular variation in sound intensity across the stencil surface is frequency- dependent and plays a key role in shaping the received frequency gain pattern. 0 90 180 270 120 150 240 210 60 30 300 330 0 30 60 90 120 150 180 210 240 270 300 330 0 0.5 1 Figure 2.10: (a) A one-hole stencil to measure surface pressure levels. (b) Sound amplitude at different angles from the sound’s direction of arrival. 29 To improve the stencil cap design, we revise our approach and adopt a forward simulation- based method. Rather than working backward from a target frequency gain pattern to determine the hole layout, we instead evaluate a large number of randomly generated hole configurations and select the best-performing one. This Monte Carlo-based approach involves repeatedly sampling random stencil designs and analytically simulating the resulting frequency gain patterns for each direction of arrival—covering 360 angles with 1� resolution. We then assess and optimize the angular diversity of these gain patterns across all directions, following the steps outlined below. This method allows us to identify a hole configuration that closely approximates the globally optimal design, given the stencil cap’s size and other physical constraints. (1) Random stencil pattern generation: The variation in directional gain patterns is closely tied to the multipath diversity produced by the arrangement of holes on the stencil. For our simulations, we fix the outer and inner diameters, as well as the height of the cylindrical stencil. Each pinhole is assigned a diameter of 2 mm. We then generate random configurations of hole placements along the cylinder’s surface. However, uniformly sampling hole positions does not guarantee sufficient spacing between them. To ensure that each hole contributes uniquely to the frequency gain pattern, a minimum separation—typically half the maximum wavelength—is required. To enforce this constraint, we adapt the Fast Poisson Disc sampling algorithm [36]. In each iteration, the Poisson Disc method generates 2D coordinates for new holes on the unwrapped (flattened) surface of the cylinder, starting from a few initial seed points. Each new hole is placed randomly within an annular region centered around existing holes, with a radius of 3 mm to maintain the minimum required distance. To further increase the diversity of hole arrangements, we randomly vary the annulus width during each iteration. 30 (2) Estimating frequency gain-patterns: Using Equation 2.5, the algorithm calculates the frequency-dependent gain pattern for each stencil design generated in the previous step. It determines the path differences between each hole and the microphone, factoring in the diffraction of sound waves around the outer surface of the cylinder. For each stencil, we compute the gain pattern across 400 evenly spaced frequencies ranging from 0 to 8 kHz, resulting in a 400-point complex gain vector for each of the 360 source angles (with 1° resolution). We also apply amplitude and phase adjustments to account for the diffraction effects discussed earlier (as illustrated in Figure 2.10). At the end of this step, we obtain a set of 360 gain patterns—each with 400 frequency points—for further analysis and optimization. (3) Assessing the diversity of gain-patterns: Next, we measure the diversity of the gain-patterns using the all-pair Euclidean distance as the metric, called chord-distance. Two distinguishable gain-patterns will show higher values for chord-distance compared to the two similar patterns. We use this metric for maximin decision criteria in the final step. (4) Stopping criteria and selecting the best stencil: In each iteration with a newly generated stencil pattern, the algorithm calculates and records the minimum chord-distance between all pairs of gain patterns obtained in the previous step. This metric reflects the angular diversity of the stencil. The iteration process continues until the distribution of these minimum distances approximates a Gaussian curve, indicating convergence. Once this condition is met, the stencil pattern with the highest recorded chord-distance is selected as the optimal design for fabrication. Figure 2.11 illustrates a comparison between the frequency gain-pattern diversity of an optimal stencil and that of a sub-optimal one. 31 (a) Optimal Hole Pattern (b) Suboptimal Hole Pattern 0 2000 4000 6000 8000 Frequency (Hz) -3 -2 -1 0 1 2 3 Ph as e H_ID = 369, Metric = 17.6309 0 deg 20 deg 40 deg 60 deg 0 2000 4000 6000 8000 Frequency (Hz) -3 -2 -1 0 1 2 3 Ph as e H_ID = 12, Metric = 7.9219 0 deg 20 deg 40 deg 60 deg 0 2000 4000 6000 8000 Frequency (Hz) 0 1 2 3 4 Am pl itu de H_ID = 369, Metric = 4.1975 0 deg 20 deg 40 deg 60 deg 0 2000 4000 6000 8000 Frequency (Hz) 0 1 2 3 4 Am pl itu de H_ID = 12, Metric = 2.0411 0 deg 20 deg 40 deg 60 deg Figure 2.11: Comparison of diversity in phase and amplitude patterns for an optimal and a sub-optimal design of the stencil. 2.4 Prototype Development 2.4.1 3D-printing stencil caps We first run our optimization algorithm on Matlab to obtain a stencil design. Next, we use the Autodesk Fusion 360 Python API [37, 38] to generate the 3D model of the stencil. The script takes the design parameters of the stencil as input, builds the structure including internal substructures and cavities, and adds the holes on the surface. Finally, we export the STL model of the stencil and slice it for 3D printing. We used the Elegoo Mars photocuring 3D printer [39] to print the stencils. We use an ultraviolet light-curable resin with 1.195 g/cm 3 density that solidifies when exposed to the light of 405nm wavelength. Compared with jetting-based printing, it provides a high resolution and smooth finish which is ideal for the tiny sub-structures on the stencil. More importantly, photocuring method leads to dense surfaces and makes the acoustic behavior of the stencil predictable [40]. 32 2.4.2 Calibration and data collection We began by generating a wideband calibration signal in MATLAB and exporting it as a .arb file, which was then loaded onto a Keysight Waveform Generator [41]. To transmit the signal, we used two commercial speakers powered by a 40W dual-channel amplifier. For precise timing and synchronization, the waveform generator was triggered via an external wired connection. The stencil was mounted on a stepper motor controlled by an Arduino [42], allowing for automated rotation in 1� increments from 0� to 360�. During each step, the calibration signal was recorded using omnidirectional ADMP401 MEMS microphones [43], sampled at 16 kHz. The recorded data was then processed offline on a computer for analysis. 2.5 Evaluation Our goal is to evaluate the effectiveness of our microstructure-based spatial sensing approach. To do this, we built a functional prototype of the Owlet system and conducted experiments across multiple indoor and outdoor environments, each with varying acoustic conditions. For benchmarking and comparison, we used conventional uniform linear microphone arrays (ULAs) of different lengths to establish baseline performance and energy consumption metrics. In the following sections, we describe the experimental setup in detail and present our evaluation results. 2.5.1 Evaluation setup and results summary In the Owlet prototype, we utilize a 3D-printed stencil and two microphones aligned vertically, facing in opposite directions, with a 4 mm separation between them. One of the microphones is enclosed by the stencil, while the other remains exposed. For comparison, we also built a 9- 33 Figure 2.12: The Owlet prototype used in the evaluation experiment (left) and a 9-element uniform linear microphone array used as baseline for comparison(right). The array is 12cm wide, where Owlet is significantly smaller measuring less than 2cm in its largest dimension. element uniform linear array (ULA) with 1.3 cm spacing between adjacent microphones, all simultaneously sampled using a multi-channel data acquisition system [44]. Figure 2.12 shows the front-end sensor configurations for both the Owlet and ULA systems. We used omnidirectional ADMP401 MEMS microphones [43], sampled at 16 kHz, for both Owlet and the ULAs. The recorded data was processed offline using MATLAB scripts on a computer. Test signals included multi-frequency wideband tones, white noise, drone sounds, and car engine noise. Unless otherwise specified, the default sound source was a multi-frequency wideband signal, played at 40 dB SPL, positioned 3 feet away from the microphones at a 0° elevation angle. The stencil used in the Owlet system measured 1.5 cm × 1.3 cm and incorporated internal capillary tubes and structural resonator cavities. We evaluated system performance across a variety of representative settings, including an indoor lab, building lobby, and outdoor locations, as shown in Figure 2.13. It’s important to note that our structure-guided DoA estimation method does not distort the original sound source, as the secondary reference microphone is located outside the stencil and can capture the source signal clearly. Figure 2.13: Various locations for system evaluations: (a) indoor laboratory, (b) indoor lobby, (c) outdoor. 34 System Prototype cost Size Error Energy Owlet $15 1.9cm 3.6� 16.7mJ 9-element array $70 11.4cm 4� 2078mJ Table 2.1: Comparison of prototype cost, size, median error, and energy consumption of Owlet with a microphone array. Summary: Figure 2.14 provides a summary of Owlet’s overall performance compared to traditional DoA estimation using ULAs. As detailed later in this section, Owlet achieves higher accuracy than even a 9-element ULA utilizing the standard MUSIC algorithm for direction estimation, all while operating with significantly lower energy consumption. Table 2.1 presents a side-by-side comparison of the prototypes, including estimated manufacturing costs, physical dimensions, median DoA estimation errors, and power requirements. 3 5 7 9 11 13 Median error (Degrees) 101 102 103 En er gy (m illi J ou le s) 2-mic array 4-mic array 6-mic array 8-mic array9-mic array Owlet (2-mic) Figure 2.14: Overall performance of the Owlet system compared to the traditional microphone arrays of various sizes. Owlet requires 100⇥ less energy than the state-of-the-art array systems while achieving better accuracy than a 9-element array. 2.5.2 Impacts of external conditions We evaluated the performance of our prototype under various adversarial conditions. We present the results below. (a) Ambient noise: The ambient noise level at our test locations was approximately 35 50db 55db 60db 65db 70db 75db Sound Pressure Level 0 5 10 15 20 M ed ia n Er ro r ( de gr ee s) WhiteNoise TrafficNoise PeopleSpeaking JackHammer 0 20 40 60 80 Error (Degrees) 0 0.2 0.4 0.6 0.8 1 Em pi ric al C D F 50cm 100cm 150cm 200cm 0 20 40 60 80 Error (Degrees) 0 0.2 0.4 0.6 0.8 1 Em pi ric al C D F 0 deg 5 deg 10 deg 15 deg 0 20 40 60 80 Error (Degrees) 0 0.2 0.4 0.6 0.8 1 Em pi ric al C D F Stable Environment Dynamic Environment Dynamic Source Figure 2.15: Performance under external conditions: (a) The impact of varying types and loudness levels of ambient noise on the median DoA estimation error. (b) The CDF of errors when the sound source is located at varying distances from the receiver. (c) The CDF plot of estimation error for different elevation angles or the vertical positions of the sound source. (d) The CDF plots of errors that show the impact of dynamic movements in the environment. 40 dB SPL. To evaluate robustness, we introduced four types of noise with varying spectral characteristics: (i) white noise, (ii) traffic sounds, (iii) human speech, and (iv) machinery noise, such as a jackhammer. These noises were played from three different speakers positioned at various angles and with varying loudness levels near the receiver during the DoA estimation process. The target sound used for direction estimation had a loudness of 60 dB SPL, which is comparable to typical conversational volume. As shown in Figure 2.15(a), Owlet’s median DoA estimation error remains consistently low across a wide range of noise types and intensity levels, demonstrating the system’s robustness to environmental interference. (b) Distance from the receiver: We evaluated the system’s performance by placing the sound source at different distances from the receiver. Figure 2.15(b) presents the resulting median DoA estimation errors. The observed errors are primarily influenced by the decrease in signal-to-noise ratio (SNR) at the receiver as the distance increases, since the source intensity remained constant regardless of its position. However, when we adjusted the setup to maintain a consistent sound level at the receiver—regardless of source distance—the effect of distance on DoA accuracy became minimal. (c) Elevation angle: The current version of Owlet is designed to estimate the DoA only in the azimuth plane—that is, horizontal directions. In theory, an azimuth-only system 36 should be unaffected by the elevation (vertical position) of the sound source. However, in real-world scenarios, microphones are not perfectly omnidirectional, which means traditional microphone array systems typically perform accurately only within a limited elevation range. Beyond microphone limitations, Owlet’s stencil design—with its specific pinhole patterns—can also be influenced by the vertical angle of incoming sound due to the way those patterns are projected onto the microphone. To assess the effect of elevation, we varied the vertical position of the sound source while keeping its horizontal distance fixed at 150 cm. As shown in Figure 2.15(c), Owlet’s performance remains stable when the vertical offset of the source is within 15 cm of the microphone’s center, indicating minimal impact from moderate elevation changes. 0 100 200 300 Time (seconds) -90 -45 0 45 90 An gl es (d eg re es ) Actual trajectory Predicted angles Figure 2.16: The performance of sound tracking while the source is constantly moving near the sensor. The movement of the source creates a dynamic multipaths scenario. (d) Dynamic multipath: Owlet is specifically designed to reduce the impact of environmental multipath effects. We initially tested this capability by varying the locations in our earlier experiments. To further assess its robustness, we introduced additional environmental changes by moving the sound source during testing, simulating dynamic conditions. In this scenario, Owlet maintained a median error below 7�. We then introduced moving subjects near the sensor setup to create time-varying multipath conditions. Even with three people walking within a 3-meter radius 37 of the sensor, the median error only increased slightly to around 9�. Figure 2.15(d) presents the cumulative distribution function (CDF) of the DoA error under these dynamic conditions, alongside results from more stable environments. Additionally, Figure 2.16 illustrates Owlet’s performance in tracking a moving sound source near the sensor. 2.5.3 Performance in different environments We tested Owlet in a variety of representative environments, including an indoor laboratory, a building lobby, and outdoor open-air locations, as illustrated in Figure 2.13. To ensure robustness across different acoustic settings, we trained the deep learning model using synthetic room impulse responses, as described in Section §2.3.4. Figure 2.17(a) presents Owlet’s DoA estimation performance across various positions within these environments. The system achieves a median error of less than 4�, with 90th percentile errors staying below 10�. These results demonstrate Owlet’s ability to operate reliably in unfamiliar environments, relying only on a one-time calibration during initial prototype development. 2.5.4 Impact of different sound sources In this experiment we evaluate the system’s DoA estimation performance for parallel frequency signal and other types of signal sources. We evaluate the system’s DoA estimation performance for different types of sounds. These signals are different in their active bandwidths, frequency spectrums, and loudness. Figure 2.17(b) shows comparable performance across different sounds. 38 0 20 40 60 80 Error (Degrees) 0 0.2 0.4 0.6 0.8 1 Em pi ric al C D F Lab Lobby Outdoor 0 20 40 60 80 Error (Degrees) 0 0.2 0.4 0.6 0.8 1 Em pi ric al C D F ParallelFrequencies WhiteNoise Drone Engine Figure 2.17: The CDF of median error for (a) different environments and (b) different types of sound sources. 2.5.5 Performance in known environment Owlet’s synthetic training data generation can be tailored based on the known geometry of the target environment. We tested this feature by creating training data specific to the test location. Figure 2.18 displays the overall performance of Owlet in estimating the direction of arrival (DoA) of signals. In this experiment, we transmitted signals from a speaker placed at different angles relative to the Owlet system. The ground truth DoAs spanned from 0 � 180� in front of the receiver, with 1� separation between each position. Unlike traditional microphone arrays, Owlet’s design eliminates the issue of ”mirror location” (front-back) ambiguity in DoA estimation. The confusion matrix in Figure 2.18(a) visually illustrates the distribution of errors for each ground truth angle. Figure 2.18(b) shows the empirical cumulative distribution of the errors. In this scenario, Owlet achieves a median error of less than 3.3� and a 90th percentile error of under 10�. 2.5.6 Localization Performance Owlet is primarily designed for DoA estimation of sound sources. However, by combining data from multiple Owlet units, it is possible to localize a sound source using triangulation. 39 20 40 60 80 10 0 12 0 14 0 16 0 18 0 Ground truth angle (Degrees) 20 40 60 80 100 120 140 160 180Es tim at ed a ng le (D eg re es ) 0 2 4 6 8 10 0 20 40 60 80 Error (Degrees) 0 0.2 0.4 0.6 0.8 1 Em pi ric al C D F Figure 2.18: The performance for DoA estimation with known room size: (a) The confusion matrix and (b) the CDF of error in degrees of angle. To test this, we set up an experiment with two speakers continuously emitting 50 ms parallel frequency pulses. The Owlet receiver was positioned at various locations on a grid in front of the speakers. The Owlet system estimated the DoA for both speakers and used triangulation to calculate the location of the sound source. Figure 2.19(a) shows a heatmap of the localization error, while Figure 2.19(b) presents the corresponding cumulative distribution function (CDF) plot. The median localization error achieved was 10 cm. 0.25 0.75 1.25 Meters 0.25 0.75 1.25 M et er s 0.1 0.2 0.3 Lo ca tio n Er ro r(m ) 0 0.2 0.4 0.6 Location Error (Meters) 0 0.5 1 C D F Figure 2.19: The localization error as (a) heatmap and (b) empirical CDF. 2.5.7 Comparison with traditional methods Figure 2.20 compares the performance of Owlet with traditional array-based DoA estimation techniques. We implemented three widely-used array-based methods: beamscan, minimum variance distortionless response (MVDR), and the MUSIC algorithm. These techniques were applied to microphone arrays with varying numbers of elements. The results in Figure 2.20(a) 40 demonstrate that Owlet significantly outperforms the other algorithms under similar conditions and signal-to-noise ratios (SNR), despite using only two microphones. Owlet’s median error is even slightly better than that of the MUSIC algorithm with a 9-microphone array. To evaluate the DoA resolution, we compare the spatial spectrum of each traditional algorithm with Owlet. Since Owlet employs a regression-based approach, it does not directly generate a spatial spectrum. Instead, we plot the confidence score distribution for all angles. Figure 2.20(b) displays the spatial spectra for a signal arriving at 20�. Owlet shows a narrower beamwidth, similar to that of the MUSIC algorithm. 0 20 40 60 80 Error (degrees) 0 0.2 0.4 0.6 0.8 1 Em pi ric al C DF 2-mic Array 4-mic Array 6-mic Array 8-mic Array 9-mic Array Owlet (2-mic) -80 -60 -40 -20 0 20 40 60 80 DoA angles (degrees) 0 0.2 0.4 0.6 0.8 1 No rm al ize d am pl itu de Beamscan MVDR MUSIC Owlet Figure 2.20: Performance comparison of Owlet with the implementation of beamscan, MVDR, and MUSIC algorithms: (a) The CDF of median errors, (b) The spatial spectrum for an incoming signal from 20� angle. 2.5.8 Comparison between learning models Figure 2.21 compares the performance of various deep learning models and algorithms. In some scenarios, the regression model slightly outperforms the classification algorithm. We also evaluate different architectures for the regression model. For instance, when we reduce the filter sizes in the three convolution layers from 64, 128, and 256 to half, the median error improves to 5.6�. Using only the first two convolution layers with reduced filter sizes results in a median error of 5.8�. When we employ two convolution layers with filters sized 64 and 128, the median error increases to 7.8�. These findings highlight the flexibility of the model, allowing customization for 41 resource-limited computational environments while still achieving acceptable DoA performance. 0 20 40 60 80 Error (Degrees) 0 0.2 0.4 0.6 0.8 1 Em pi ric al C D F Regression Classification 0 20 40 60 80 Error (Degrees) 0 0.2 0.4 0.6 0.8 1 Em pi ric al C D F Normal Regression Half Num of Filters Two Conv Layers Half Filters & Two Conv Layers Figure 2.21: Performance comparison of Owlet with different deep learning models and architectures. 2.5.9 Energy consumption In this section, we assess and compare the energy efficiency of Owlet with traditional array- based systems. We measure the power consumption of each submodule, including the hardware frontend, analog-to-digital conversion (ADC), and DoA computation. While we directly monitor the frontend and ADC, the computation part requires porting the runtime code to a Raspberry Pi 4 module and tracking the overall power variation of the module. For accurate and high-resolution power tracking, we used a Keysight E6313A power supply and monitoring unit [45]. The setup is illustrated in Figure 2.22. Computation: We write the code in MATLAB and use MATLAB Coder [46] to generate executable C files for the Raspberry Pi 4. We employ the MathWorks Raspbian image optimized for deep learning applications and cross-compile the code for the ARMv7 architecture with Neon Acceleration. This acceleration utilizes special registers for parallel processing, which provides an advantage for neural network systems compared to traditional methods. We deploy the executable code and run it on offline data for 10,000 iterations, collecting voltage and current readings from the power meter. 42 Figure 2.22: The setup for evaluating energy consumption. The setup tracks the energy requirements of Owlet and baseline microphone arrays under various conditions using a Keysight E6313A power supply and monitor. Additionally, we record the total time required to complete the DoA estimations. We observe that while Owlet’s instantaneous power consumption is 1.92 Watts—about twice that of traditional algorithms, which consume 1.05 Watts—the time taken for Owlet to complete the estimation is significantly lower, at just 8.3 ms, compared to 2050 ms for the traditional algorithm. This difference arises from the highly parallelized operations of the neural network, which are not feasible with the sequential nature of traditional algorithms. ADC: Figure 2.23 shows the energy consumption of the ADC in two devices: the MSP430 microcontroller ($3) [47] and the Keysight Data Acquisition System ($2500) [44]. For the MSP430FR5969, we utilize its low-power 12-bit ADC and adjust the sampling rate to simulate the multiplexing of multiple microphones. For the Keysight DAQ, we use its 12-bit parallel channel ADC in single-shot data acquisition mode and vary the number of channels. Power consumption data is recorded for both devices from the power supply, and the microphones are disconnected to isolate the power consumption of the ADCs from the microphones and their amplifiers. Microphone frontend: We measure the power consumption of the ADMP401 MEMS 43 16 32 48 64 80 96 112 128 144 Sampling Rate (kHz) 0.02 0.03 0.04 0.05 0.06 0.07 0.08 En er gy (m illi J ou le s) 2 3 4 5 6 7 8 9 Number of Microphones 0.2 0.4 0.6 0.8 1 1.2 En er gy (m illi J ou le s) Figure 2.23: Energy consumption of (a) MSP430FR5969 low-power ADC [47] for different sampling rates and (b) Keysight Data Acquisition System [44] for different number of microphones. microphones [43]. To estimate the total energy consumption, we consider a 50 ms duration, which is typical for collecting 800 samples at 16 kHz. We calculate the energy by multiplying this duration by the average power consumed by the microphones. In Section §2.5.1, we compare the power consumption and accuracy of Owlet with traditional arrays. Figure 2.14 illustrates the energy consumption of Owlet and other arrays, alongside the median errors in DoA estimation. Figure 2.24 breaks down the energy usage of each submodule (computation, ADC, and microphone frontend). Owlet uses less than a 100th of the power required by traditional arrays for similar accuracy and angular resolution. 100 101 102 103 104 Energy (milli Joules) 2-mic Array 3-mic Array 4-mic Array 5-mic Array 6-mic Array 7-mic Array 8-mic Array 9-mic Array Owlet (2-mic) Microphone ADC Computation Figure 2.24: Overall energy consumption of array-based systems and Owlet. 44 2.6 Limitations and Discussion It is important to note that the current version of Owlet is an initial prototype of the concept, and there is ample room for further development and improvement. Here, we discuss some key points for future work. • Multiple Sound Sources: We have tested our prototype in various acoustic environments with different types of target sounds and noise sources. At present, the system is designed for DoA estimation assuming a single sound source. When multiple sources overlap, the system focuses on the strongest signal for direction estimation, treating the others as noise. While considering only one dominant source is practical for many applications, we believe that Owlet could be adapted to estimate multiple DoAs. A promising approach would involve applying statistical methods to separate the source signals and then matching them with directional signatures for DoA detection. We leave this area for future exploration. • Theoretical Capacity Bounds: Array signal processing has been extensively studied, and it is possible to estimate the theoretical limits on achievable spatial resolution, taking into account various constraints and array configurations. This information is crucial for designing and simulating array-based systems. The Owlet concept differs significantly from traditional array processing techniques, but its performance could still be analyzed through an information- theoretic framework, focusing on the entropy of the directional gain patterns. The shape and size of the stencil, as well as the frequency of the sound, impose additional limitations on Owlet’s capacity to produce diverse gain patterns. A theoretical analysis would enhance our understanding of the system and help guide future improvements. • Mobility: The Doppler effect, caused by fast-moving sound sources, can influence the 45 frequency gain patterns Owlet uses for direction estimation. Our current prototype operates in the low-frequency audible range, which is less susceptible to Doppler shifts from the sound source or receiver. Additionally, DoA estimation with parallel frequency signals offers some robustness against Doppler shifts. As a result, we did not focus on the system’s performance under mobility. However, if Owlet operates at higher frequencies in the future, considerations will need to be made for detecting and compensating for Doppler frequency shifts. • Inaudibility of Sound Signals: In this work, we focused on audible sound frequencies for system calibration and source signals. Low-frequency signals, with longer wavelengths, tend to exhibit less diversity in their frequency gain patterns, which limits the achievable angular resolution. We specifically chose this frequency range to demonstrate system performance at the lower end of the spectrum. Higher frequencies are likely to improve both spatial resolution and reduce the size of the stencil. Future versions of Owlet will explore inaudible near-ultrasonic frequencies (17-24 kHz) and ultrasound frequencies (above 24 kHz). 2.7 Related Work There is a wealth of research on spatial sound analysis techniques. Notable studies in direction of arrival (DoA) estimation using microphone arrays [48–51], array signal processing for beamforming [52–54], and subspace-based super-resolution algorithms [55, 56] have greatly advanced this field. Recently, innovations in ubiquitous spatial acoustic sensing [14,57–65] have opened up new opportunities for acoustic sensing in diverse environments. Below, we highlight two topics that are closely related to Owlet. • Acoustic Structures: The study of how structures affect sound fields has a long history. 46 In ancient architecture, large structures were used to amplify sound or reduce noise. Modern architectural acoustics is applied in buildings and auditoriums to control reverberation and sound isolation. Research related to Owlet includes the design of 3D-printed acoustic metamaterials that absorb specific frequencies [66], and the development of meta-surfaces that generate diffraction- limited acoustic fields [67]. The use of acoustic structures for sensing applications is a relatively recent area of study. For example, Li et al. [68] used additive manufacturing to create acoustic filters that control the impedance at specific frequencies. In [69], physical notches were created on a surface to form acoustic barcodes. Other works [70, 71] use 3D-printed acoustic structures to create tangible user interfaces, with varying structure shapes to produce unique frequency responses that are classified using smartphone microphones. • Monaural DoA: Previous studies [48–50,58] have explored microphone arrays for DoA estimation. Recently, there has been a growing interest in minimizing resources for directional acoustic sensing. For example, [72] uses a single microphone placed in a known room, relying on wall reflections and scattering to estimate the sound source location. In [22], a small vertical wall is placed near a microphone, altering the frequency response based on the direction of sound. Recent work [23, 73] has placed small objects like Legos and