ABSTRACT

Title of Dissertation: MINIMALIST SENSING
TOWARD UBIQUITOUS PERCEPTION

Yang Bai
Doctor of Philosophy, 2025

Dissertation Directed by: Professor Nirupam Roy
Department of Computer Science

This thesis explores minimalist sensing, a design philosophy prioritizing simplicity and

efficiency in sensor technology to capture essential information for various applications, including

robotics, environmental monitoring, and wearable technology. By focusing on streamlined functionalities,

these sensors avoid the complexities and costs of more elaborate systems, offering practical

solutions under resource constraints. My research emphasizes developing low-power, miniaturized

systems that integrate seamlessly into both urban and natural environments, enhancing ubiquitous

perception without the encumbrance of complex technologies.

I explore three main areas: low-power and miniaturized acoustic direction-of-arrival (DoA)

estimation, ultra-low-power spatial sensing for miniature robots, and a single frequency-based

tracking interface for voice assistants. The contributions include a novel low-power DoA estimation

system using 3D-printed metamaterials, an innovative spatial sensing system for mobile robots

using a single speaker-microphone pair, and a comprehensive voice and motion tracking interface

that operates on a single frequency. This work is aimed at establishing a pervasive perception


network that offers continuous, reliable data while minimizing energy use and infrastructure

demands, potentially revolutionizing real-time monitoring and responsiveness in diverse settings.

A critical aspect of achieving minimalist perception is the integration of machine intelligence

and computation. By leveraging advanced algorithms and computational techniques, we can

bridge the gap in minimalist perception, making it both feasible and efficient. Machine learning

and signal processing algorithms enhance the accuracy and functionality of simplified sensor

systems, allowing them to perform complex tasks without the need for sophisticated hardware.

For instance, intelligent data processing enables low-power sensors to extract meaningful information

from limited data inputs, reducing the need for extensive sensor networks. By incorporating

these computational strategies, we can push the boundaries of minimalist sensing, enabling the

creation of smart, resource-efficient perception systems that are capable of operating in diverse

and challenging environments.


MINIMALIST SENSING TOWARD UBIQUITOUS PERCEPTION

by

Yang Bai

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2025

Advisory Committee:
Professor Nirupam Roy, Chair/Advisor
Professor Christopher Allan Metzler
Professor Xiaofan (Fred) Jiang
Professor Ashok K. Agrawala
Professor Min Wu


© Copyright by
Yang Bai

2025


To my parents

ii


Acknowledgments

First, I want to express my heartfelt gratitude to the society that fosters peace and stability,

providing an environment where people can pursue advanced education and personal growth.

This journey would not have been possible without the foundation of progress and the values that

support learning and development.

I thoroughly cherished my time at the University of Maryland, College Park, as a Ph.D.

student in the Department of Computer Science. This journey has been as much about personal

growth and introspection as it has been about pushing the boundaries of knowledge through

rigorous experiments and thoughtful hypotheses.

First and foremost, I owe an immense debt of gratitude to my advisor, Professor Nirupam

Roy, for accepting me as his Ph.D. student and for his unwavering support throughout this

transformative journey. In the face of challenges and doubts from others, he stood by me,

offering not just guidance in research but also encouragement to believe in my own potential.

His mentorship has been a beacon, reminding me of the power of resilience and self-confidence,

as encapsulated in the lines of William Ernest Henley’s Invictus: “I am the master of my fate, I am

the captain of my soul.” Professor Roy’s ability to inspire confidence and perseverance has left

an indelible mark on my academic and personal growth. He not only taught me how to navigate

the complexities of research but also how to confront challenges with determination and grace,

ensuring that I emerged stronger and more self-assured. For this, I will forever remain grateful.

iii


I am also profoundly grateful to my committee members for their invaluable contributions

to my academic journey. I thank Professor Christopher Allan Metzler for generously sharing

his profound knowledge about compressed sensing, which greatly influenced my work. I deeply

appreciate Professor Xiaofan (Fred) Jiang for his inspiring research on acoustic sensing, which

has been a source of motivation for my own studies. I am equally grateful to Professor Ashok

K. Agrawala for his renowned expertise in real-time systems and pervasive computing, which

provided valuable insights that broadened my perspective on the practical implications of my

research. Similarly, I extend my deepest thanks to Professor Min Wu. Her innovative approaches

and thoughtful perspectives encouraged me to think critically about the impact and ethical considerations

of my work. Their collective guidance and encouragement have been invaluable in refining and

enhancing the quality of this thesis.

Additionally, I would like to express my heartfelt gratitude to Professor Ramani Duraiswami

for sharing his extensive knowledge of acoustics, which provided a strong foundation for many

aspects of my research. I am also deeply appreciative of Professor Dinesh Manocha, whose

pioneering work was highly inspiring when I began exploring the intersection of minimalist

perception with LLMs and LAMs. Their support and belief in my potential have been invaluable,

and I am truly honored to have had their guidance and encouragement.

I would also like to express my gratitude to my Master’s advisor, Prof. Yingying (Jennifer)

Chen, for her invaluable guidance and support in my research journey. She played a crucial

role in mentoring me through my first two research papers, providing insightful direction and

fostering my growth as a researcher. I am also deeply thankful to Dr. Jian Liu and Dr. Li Lu,

who provided significant help throughout the research process, offering technical insights and

constructive feedback. Their expertise and support were instrumental in overcoming challenges

iv


and successfully completing the work, shaping my foundation in academic research.

I am deeply thankful to my lab mates for creating such a supportive and intellectually

stimulating environment. Working alongside such talented and driven individuals has been an

incredibly enriching experience. I would especially like to thank Nakul Garg, Irtaza Shahid,

Aritrik Ghosh, Harshvardhan Takawale, and Ayushi Mishra. Collaborating with them has enriched

this dissertation in many ways, from insightful discussions to creative problem-solving. Their

camaraderie, encouragement, and contributions have made this journey both productive and

memorable. I am grateful for the friendships and professional relationships that I have built

with them, which I will cherish beyond my time at the lab.

I would also like to express my heartfelt gratitude to my friend Kathryn Juliette Klett, who

sits opposite my lab. She is an outgoing and kind individual whose smile and conversations

always refresh me and lift my spirits. Additionally, I want to thank my English teacher, Marcia

Albergo, who has been a source of unwavering support and guidance. Like an elder in my family,

she has been there for me.

I want to express my deepest gratitude to my husband, Zhenzhe Lin, who has been with

me through every step of this journey, from our undergraduate days to the completion of this

dissertation. His unwavering support, understanding, and encouragement have been a cornerstone

of my success. Whether it was providing emotional strength during challenging times or helping

me navigate the many facets of life, he has been my constant companion and greatest advocate. I

am incredibly fortunate to have him by my side.

Finally, I owe profound gratitude to my parents, Qiusheng Bai and Ling Yang, for their

unwavering belief in my abilities and their support throughout my life. My father has always

been my greatest advocate, instilling in me the confidence to pursue success regardless of gender.

v


His faith in my unlimited potential has been a source of strength and inspiration. Their love,

sacrifices, and encouragement have been the bedrock of my achievements, and I dedicate this

work to them with all my heart. Your support has been my foundation, and this achievement

would not have been possible without you. This thesis is dedicated to you.

vi


Table of Contents

Preface ii

Acknowledgements iii

Table of Contents vii

Chapter 1: Introduction 1
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Contributions to Minimalist Perception . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Opportunities Beyond this Dissertation . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Chapter 2: Low-power and Miniaturized Acoustic DoA Estimation 9
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Core Intuitions and Primers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.1 Metamaterials for passive filtering . . . . . . . . . . . . . . . . . . . . . 16
2.3 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.1 Processing for DoA estimation . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Eliminating source signal dependency . . . . . . . . . . . . . . . . . . . 21
2.3.3 Eliminating environmental dependency . . . . . . . . . . . . . . . . . . 24
2.3.4 Synthetic training for deep learning . . . . . . . . . . . . . . . . . . . . 25
2.3.5 Optimizing 3D stencil design . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4 Prototype Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.1 3D-printing stencil caps . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.2 Calibration and data collection . . . . . . . . . . . . . . . . . . . . . . . 33

2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.1 Evaluation setup and results summary . . . . . . . . . . . . . . . . . . . 33
2.5.2 Impacts of external conditions . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.3 Performance in different environments . . . . . . . . . . . . . . . . . . . 38
2.5.4 Impact of different sound sources . . . . . . . . . . . . . . . . . . . . . 38
2.5.5 Performance in known environment . . . . . . . . . . . . . . . . . . . . 39
2.5.6 Localization Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5.7 Comparison with traditional methods . . . . . . . . . . . . . . . . . . . 40
2.5.8 Comparison between learning models . . . . . . . . . . . . . . . . . . . 41
2.5.9 Energy consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

vii


2.6 Limitations and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Chapter 3: Ultra-low-power Acoustic Spatial Sensing 49
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Core Intuitions and Primers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.2.1 Coded signal projection with structures . . . . . . . . . . . . . . . . . . 56
3.2.2 Single receiver depth mapping . . . . . . . . . . . . . . . . . . . . . . . 58

3.3 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3.1 Low-power scene reconstruction . . . . . . . . . . . . . . . . . . . . . . 59
3.3.2 Directional code projection . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3.3 Optimal microstructure design . . . . . . . . . . . . . . . . . . . . . . . 67
3.3.4 Motion stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.4.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.4.2 Overall performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4.3 Impact of the environment . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4.4 Impact of system parameters . . . . . . . . . . . . . . . . . . . . . . . . 72
3.4.5 Impact of scene parameters . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4.6 Computation techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.4.7 Power consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Chapter 4: Simultaneous Voice and Handwriting Interface 86
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2 Primer: Location from Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.2.1 Advantage of pure-tone based ranging . . . . . . . . . . . . . . . . . . . 91
4.2.2 Advantages of high-frequency signal . . . . . . . . . . . . . . . . . . . . 92
4.2.3 Challenges of pure tone-based ranging . . . . . . . . . . . . . . . . . . . 93

4.3 Cross-Frequency Sonar Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.3.1 Implicit frequency and phase translation . . . . . . . . . . . . . . . . . . 95
4.3.2 Multipath avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.4.1 Location tracking performance . . . . . . . . . . . . . . . . . . . . . . . 104
4.4.2 Performance of voice recovery . . . . . . . . . . . . . . . . . . . . . . . 105
4.4.3 Performance under human voice and environmental noise . . . . . . . . . 106
4.4.4 Performance of handwriting recovery . . . . . . . . . . . . . . . . . . . 107
4.4.5 Performance of signature recovery . . . . . . . . . . . . . . . . . . . . . 108
4.4.6 Impacts of external conditions . . . . . . . . . . . . . . . . . . . . . . . 111
4.4.7 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

viii


4.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

Chapter 5: Low-power Spatial Intelligence for Next-Gen Wearables 120
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5.2.1 Spatial audio detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.2.2 Multimodal large language models . . . . . . . . . . . . . . . . . . . . . 124
5.2.3 Spatial-Aware Large Language Models . . . . . . . . . . . . . . . . . . 124

5.3 Microstructure-Assisted Spatial Encoding . . . . . . . . . . . . . . . . . . . . . 125
5.3.1 Spatial encoding microstructure . . . . . . . . . . . . . . . . . . . . . . 125
5.3.2 Spatial speech generation . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.4 : Spatial Context to Wearable LLM . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.4.1 Spatial speech encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.4.2 Speech-to-text Encoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.4.3 Alignment to LLM space . . . . . . . . . . . . . . . . . . . . . . . . . . 134

5.5 Soundscaping: Multi-DoA Encoder . . . . . . . . . . . . . . . . . . . . . . . . 135
5.6 Evaluation of Spatial-Aware ASR . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5.6.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.6.2 Training details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.6.3 Performance on spatially aware ASR . . . . . . . . . . . . . . . . . . . . 137
5.6.4 Performance under noise and room reverberation . . . . . . . . . . . . . 138
5.6.5 Performance on different datasets . . . . . . . . . . . . . . . . . . . . . 139

5.7 Evaluation of Multi-DoA Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.7.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.7.2 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.7.3 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.7.4 Performance of DoA estimation . . . . . . . . . . . . . . . . . . . . . . 141
5.7.5 Performance of number of speakers estimation . . . . . . . . . . . . . . 142

5.8 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.9 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

Chapter 6: Conclusion 150
6.1 Impacts and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

Bibliography 155

ix


Chapter 1: Introduction

1.1 Background and Motivation

In an increasingly interconnected world, ubiquitous perception—the ability for devices

to continuously sense and interpret their surroundings—has become a key enabler for next-

generation smart systems. From wearables and augmented reality (AR) glasses to IoT sensors

and intelligent assistants, the ability to seamlessly perceive and interact with the environment

is essential for enhancing human-computer interaction, improving accessibility, and enabling

automation in everyday life.

Acoustic sensing plays a critical role in this vision, as sound carries rich spatial and contextual

information about an environment. It enables devices to locate sound sources, track motion, and

infer scene dynamics, all while operating in conditions where vision-based sensing may fail (e.g.,

low-light environments or occluded spaces). However, despite the immense potential of acoustic

sensing, significant technical hurdles prevent its integration into ubiquitous systems.

Traditional approaches rely on large, power-hungry microphone arrays and wide-band

signal processing, which are impractical for small, battery-operated, and always-on devices.

Additionally, existing AI models, including large language models (LLMs), lack the ability to

incorporate spatial sound perception, limiting their ability to reason about physical environments

in an intuitive manner.

1


Frequency bandHardware size Energy consumption Embedded AI

Figure 1.1: Illustration of four types of minimalist sensing.

1.2 Problem Statement

As shown in Figure 1.1, To truly enable ubiquitous acoustic perception, we must address

these challenges by developing minimalist-hardware, energy-efficient, frequency-conscious, and

embedded-AI acoustic sensing solutions. These challenges can be categorized into three broad

axes:

1. Power and Hardware Constraints on Ubiquitous Devices – The reliance on large,

multi-microphone arrays and high-power processing makes acoustic sensing impractical for resource-

constrained devices such as wearables, IoT nodes, and smart assistants.

2. Frequency Band Limitations and Interference with Voice Interfaces – Wide-band

acoustic sensing methods interfere with critical voice-based applications, limiting their deployment

on devices that prioritize natural speech interaction.

3. Integrating Acoustic Sensing with Large Language Models (LLMs) for Spatial

Awareness – Despite advancements in AI, current LLMs lack the ability to process spatial audio

information, preventing them from understanding and reasoning about soundscapes.

By tackling these fundamental barriers, this research paves the way for a new era of

lightweight, intelligent, and seamlessly integrated acoustic sensing, enabling a world where

devices can perceive and interact with their surroundings effortlessly.

2


1.3 Contributions to Minimalist Perception

With this goal of developing minimalist sensing systems towards ubiquitous perception

in mind, my research breaks down into four areas: low-power and miniaturized acoustic DoA

estimation, ultra-low-power spatial sensing for miniature robots, single frequency-based tracking

interface, and integrating the miniaturized acoustic DoA estimation into LLMs for enhanced

spatial understanding. This thesis has the following contributions:

Low-power and miniaturized acoustic DoA estimation. This dissertation introduces an Owlet-

inspired acoustic sensing system that enables low-power, compact, and high-resolution DoA

estimation using a compact monaural setup embedded within a structured micro-enclosure. Unlike

traditional microphone arrays that require multiple spatially distributed sensors, this approach

passively encodes spatial information through controlled modifications of the impulse response,

allowing a compact microstructure setup to infer DoA with high accuracy. By leveraging machine

learning to extract spatial features from the altered acoustic signals, this method eliminates the

need for complex array processing, significantly reducing hardware footprint, power consumption,

and computational overhead. The compact nature of this system makes it well-suited for wearable

devices, IoT nodes, and AR/VR platforms, where conventional microphone arrays are impractical.

This contribution represents a fundamental shift toward minimalist acoustic sensing, demonstrating

that spatial perception can be achieved with single-microphone solutions, paving the way for

energy-efficient, scalable, and widely deployable acoustic sensing technologies.

Ultra-low-power spatial sensing for miniature robots. SPiDR is a microstructure-assisted

acoustic sensing system designed to enable low-power spatial perception. Unlike conventional

3


active sonar or microphone array-based imaging techniques, SPiDR employs a structured micro-

acoustic surface that modulates incident acoustic waves, introducing diffraction patterns that

encode spatial information. These passive structures alter the wavefront in a predictable manner,

allowing a single microphone to capture rich spatial features without requiring multiple sensors

or active signal transmission. A computational model, combined with compressed sensing-based

reconstruction, extracts the phase and amplitude variations induced by diffraction, effectively

converting them into spatially resolved depth and shape information. The system is designed to

be compact and energy-efficient, making it suitable for miniature robotics, low-power imaging,

and non-invasive material sensing. By leveraging microstructure-induced diffraction as a novel

sensing modality, SPiDR introduces a fundamentally new approach to acoustic imaging, overcoming

the limitations of traditional sensor arrays and expanding the possibilities of passive, energy-

efficient spatial sensing in constrained environments.

Single frequency-based tracking interface. Scribe introduces a novel authentication system

that leverages single-frequency acoustic sensing and motion tracking to enable secure and intuitive

handwriting-in-the-air authentication. Unlike traditional authentication methods that rely on

passwords, biometrics, or capacitive touch surfaces, Scribe utilizes high-precision phase tracking

of acoustic signals to capture the fine-grained trajectory of a user’s handwriting in free space.

By integrating voice-based authentication with spatial handwriting recognition, Scribe achieves

a robust two-factor authentication system that is resistant to forgery, spoofing, and environmental

interference. A key technical innovation of Scribe is its ability to eliminate multipath interference

through frequency hopping techniques, ensuring that the tracked handwriting is accurately reconstructed

even in complex acoustic environments. Additionally, by leveraging single-frequency-based

4


phase tracking, Scribe achieves high accuracy in motion estimation without requiring wide-band

acoustic signals, making it energy-efficient and compatible with voice interfaces. The system’s

compact and low-power design allows seamless integration into voice assistants, wearable devices,

and mobile platforms, paving the way for next-generation, touch-free authentication and human-

computer interaction.

Spatial context in Large Language Model (LLM) for next-gen wearables. This work expands

Owlet to detect multiple DoAs for speech simultaneously and integrates this information into a

LLM. This contribution bridges the gap between low-power spatial sensing and advanced AI

systems, enabling context-aware processing of spatial information in real time. By leveraging

Owlet’s minimalist design, we develop methods to estimate and utilize several DoAs for speech

concurrently, enhancing applications such as speaker tracking, spatial reasoning, and multi-

speaker environments. The integration with LLM demonstrates how minimalist sensing can

enrich AI models with spatial perception capabilities, opening new possibilities for human-

centered AI applications. A key technical contribution of SING is the development of a DoA

encoder, which extracts spatial cues from raw multi-microphone audio signals and encodes them

into a compact, structured representation. This encoder learns spatial embeddings that capture

information about sound source directionality, spatial relationships, and environmental reflections,

ensuring robust spatial representation across diverse acoustic scenes. To align this information

with LLMs, SING introduces a projection layer which maps the learned DoA embeddings into

the LLM’s latent space, allowing it to integrate spatial perception seamlessly with text-based

reasoning. This alignment ensures that LLMs can leverage spatial audio understanding alongside

other modalities, unlocking new capabilities such as spatially informed dialogue, immersive AI-

5


assisted navigation, and intelligent environmental awareness.

1.4 Opportunities Beyond this Dissertation

While this dissertation presents significant advancements in minimalist acoustic sensing

and spatial perception, several exciting opportunities remain open for further exploration. The

methods and technologies introduced in this work provide a strong foundation for future research,

expanding the scope of applications and optimizing performance in various domains. Below, we

outline several key opportunities beyond this dissertation.

Optimizing SPiDR with Deep Image Prior. SPiDR, introduced in this dissertation as a

microstructure-assisted imaging technique, offers a novel approach to spatial sensing. However,

its reconstruction quality can be further enhanced using deep image prior (DIP) [1] methods.

Unlike traditional deep learning approaches that require large datasets for training, DIP leverages

the implicit structure of convolutional networks to refine noisy or incomplete images. By integrating

DIP into SPiDR’s reconstruction pipeline, we can potentially improve the resolution and accuracy

of spatial information, making the system more robust for high-precision applications such as

micro-scale sensing and medical imaging.

Extending Single-Frequency Tracking for Indentation Detection and Fusion with Image-

Based 3D Shape Reconstruction. Image-based 3D shape reconstruction has been widely explored

for capturing object geometries, yet challenges remain in achieving high precision, particularly

for fine surface details and subtle indentations. Given that single-frequency-based tracking has

demonstrated precise motion tracking capabilities, an intriguing question arises: Can the same

technique be adapted for accurate indentation detection? By leveraging single-frequency tracking

6


to map indentation profiles, we can complement existing image-based 3D reconstruction methods,

fusing the two modalities to achieve higher accuracy in 3D shape estimation. This integration is

particularly valuable for applications such as 3D printing, manufacturing, and quality control,

where both external geometry and fine indentation details are critical for achieving high-fidelity

replication. A hybrid approach combining precise acoustic-based indentation detection with

image-based reconstruction has the potential to refine object modeling, enabling more precise

and efficient fabrication processes.

Extending Microstructure-Assisted Imaging to Underwater Environments. Microstructure-

assisted imaging techniques like SPiDR have demonstrated strong potential in controlled terrestrial

environments. However, their principles can be extended to underwater imaging, where traditional

optical and sonar-based imaging techniques face limitations. In murky or low-light underwater

conditions, acoustic-based imaging through microstructured surfaces could provide a novel alternative

for capturing fine structural details of submerged objects. This opens up opportunities for applications

in marine exploration, environmental monitoring, and underwater robotics, where compact, low-

power imaging solutions are crucial.

Minimalist Perception

Microstructure-Assisted 
Perception

Single-Frequency 
based Perception

Handwriting Interface 
on Voice Assistants

(Chapter 4)

Low-power Spatial Audio 
Perception for LLM

(Chapter 5)

Spatial Audio Perception Low-power Spatial Sensing
For Micro-robots

(Chapter 3)
Low-power and Miniaturized 

Spatial Audio Perception
(Chapter 2)

Figure 1.2: Organization of the thesis in chapters.

7


1.5 Organization

The subsequent sections elaborate on these ideas of minimalist sensing, starting with low-

power and miniaturized design for acoustic DoA estimation in Chapter 2. Chapter 3 extends

the primitive of 3D-printed passive structure to an ultra-low-power spatial sensing system for

miniature mobile robots. Chapters 4 explains a comprehensive voice processing and handwriting

interface for voice assistants. Chapter 5 discusses the expansion of Owlet for spatial speech

integration into LLM. Finally, we conclude in Chapter 6. Figure 1.2 shows the organization of

the topics in the rest of this dissertation.

8


Chapter 2: Low-power and Miniaturized Acoustic DoA Estimation

2.1 Overview

Acoustic devices are increasingly integrated into our daily environments in diverse forms.

Beyond traditional voice interfaces, a growing number of applications are leveraging sound for

context-awareness and analytical purposes. These include indoor activity recognition [2–4],

health monitoring through acoustic signals [5, 6], speech development tracking and acoustic

environment sensing using wearable devices [7, 8], as well as outdoor applications powered by

distributed sensor networks [9,10]. With advancements in low-power and battery-free technologies

[11, 12], it is now feasible to continuously sense and analyze audio through compact, standalone

modules deployed throughout the environment. Integrating spatial sound analysis and source

localization into these systems can further enhance their contextual sensing capabilities. Meanwhile,

spatial sound sensing is critical for applications such as robotic navigation and situational awareness

in both aerial [13–15] and underwater [16, 17] settings. However, conventional approaches to

spatial sound processing typically rely on multi-channel audio captured by microphone arrays,

which demand significant power and are unsuitable for lightweight, low-power sensing nodes.

This chapter aims to design an acoustic sensing framework capable of spatial sound analysis

within the constraints of small, power-efficient ubiquitous computing platforms.

Extracting spatial characteristics of sound—such as direction-of-arrival (DoA) or source

9


15 mm

13
 m

m

Stencil with
µ-resonators

Internal cavity

Estimates direction 
and location

Low-power and miniaturized
spatial sensing with acoustic
micro-structures

Passive structure 
interacts with sound

System learns
location-agnostic features

Microstructure embeds 
directional signature

1 2 3 4

Figure 2.1: The vision and technical overview of Owlet, a low-power and miniaturized system for
extracting spatial information from sound. Owlet uses acoustic microstructures to embed direction-specific
signatures on the recorded sound and develops a learning-based approach for signature recovery and
mapping in real-time.

localization—typically involves sampling the acoustic wave across space using a microphone

array. Traditional DoA estimation algorithms rely heavily on this spatial sampling framework,

making the array’s size and the number of microphones critical to their accuracy. Based on the

sampling theorem [18], an ideal linear array features microphones spaced at half the wavelength

(�/2) of the signal for optimal DoA estimation. Additionally, the angular resolution, often

defined by the inverse of the Half Power Beam-width, scales with the overall length of the

array. As a result, achieving high-resolution DoA measurements conventionally demands large

and complex hardware setups. These systems also require synchronized data collection from

all microphones, leading to increased power usage and greater system complexity. Although

acoustic devices are now widespread in ubiquitous computing, the significant power, hardware,

and size constraints of traditional arrays limit their integration in applications that demand fine-

grained spatial sensing. In this chapter, we propose an alternative approach to spatial audio

processing that moves away from the standard spatio-temporal sampling paradigm. Instead, we

10


explore how sound waves interact with physical structures to enable a compact, low-power, and

low-complexity solution for spatial sensing.

Directional hearing aided by acoustic structures is widely observed in nature. In most

mammals, including humans, the placement of the left and right ears mimics a two-element

array for capturing directional sound cues. However, research in biophysics reveals that these

animals rely not only on ear positioning but also on how sound interacts with the complex 3D

geometry of the head to achieve fine-grained source localization [19]. In contrast, many owl

species have asymmetrically positioned ears—both horizontally and vertically—which enhances

their ability to detect low-frequency sounds with high precision [20], a feat that symmetric

ear arrangements cannot accomplish due to their limited spacing. Remarkably, certain insects,

despite having bodies smaller than one-tenth of the wavelength of the sounds they perceive, can

localize sound as accurately as mammals [21]. For example, a grasshopper with a body width

of just 3 mm—far smaller than the wavelength of the target sound—can still determine sound

direction accurately. This capability arises from the insect’s asymmetrical body structure and

orientation, which produce direction-dependent acoustic responses. Their sensory and neural

systems are finely tuned to interpret these responses and map them to the sound’s direction of

arrival. Drawing inspiration from these biologically driven, structure-enhanced hearing mechanisms,

we propose a DoA estimation system tailored for compact, energy-efficient devices.

In this chapter, we introduce the design and implementation of an acoustic localization

system that leverages specially crafted acoustic structures surrounding a microphone to encode

directional information. As sound waves travel, they interact with physical objects in their path,

altering the wave field in the process. This phenomenon is evident in large-scale environments,

such as rooms, where the same sound can vary significantly depending on the room’s shape, size,

11


and the arrangement of objects. We demonstrate that similar transformations can be intentionally

induced on a much smaller scale using a compact, 3D-printed acoustic structure. This structure

modifies the sound waves in a way that imprints a unique, direction-dependent signature onto

them. When a microphone is placed inside such a structure—just a few centimeters in size—it

captures sounds embedded with these distinct signatures. With careful design, the structure can

produce distinguishable patterns for sounds arriving from angles separated by only a few degrees.

Our system decodes these patterns to estimate the DoA of incoming sounds. We call this system

Owlet, inspired by the owl’s exceptional auditory localization abilities.

We are not the first to recognize the potential of using environmental variations in sound

fields for localization. Previous research has investigated this idea by fingerprinting multipath

environments and analyzing nearby sound reflections [22]. The work most closely related to

ours is [23], where objects are placed within a 60×60 cm area, and a microphone is positioned

at the center. That study demonstrated that sound scattered by surrounding objects contains

directional information, which can be used to estimate the DoA. While the core idea behind

Owlet aligns with this prior work, our approach differs in two key aspects. First, we aim to create

a compact, centimeter-scale sensing system suitable for integration into low-power, resource-

constrained robots or for widespread use in ubiquitous sensing applications. The Owlet prototype

achieves angular resolution comparable to or better than previous systems, despite using a much

smaller sensor measuring just 1.5cm ⇥ 1.3cm. Second, we tackle the challenge of ensuring

robustness to changes in the environment. Unlike systems that depend on controlled conditions

or location-specific training, Owlet is designed to operate reliably outside of anechoic chambers

and perform consistently in real-world, dynamic settings.

A key challenge in developing Owlet lies in achieving sufficient multipath diversity within

12


a compact physical footprint. Low-frequency acoustic signals have long wavelengths, which

typically require reflectors of comparable size to create the necessary spatial variations—directly

impacting the system’s spatial resolution. To overcome this constraint, we adopt a diffraction-

based approach rather than relying on traditional reflection-based methods to design miniature

acoustic structures. The core idea stems from the principle that when sound waves pass through

small openings, they diffract and behave like new point sources. Leveraging this, we design

a 3D-printed cylindrical shell—referred to as a stencil—that encases the microphone. These

stencils are embedded with carefully engineered patterns of holes that produce complex yet

predictable multipath interference patterns within the structure. These patterns encode unique

directional signatures into the captured sound. To enhance angular discrimination, we incorporate

metamaterial-inspired design elements into the stencil. The Owlet system undergoes a one-time

calibration to learn the relationship between these interference signatures and the DoA, enabling

real-time directional sensing during operation.

Another significant challenge is ensuring that the system remains robust against environmental

variations, which can unpredictably alter the characteristics of incoming sound. For Owlet to be

practical and widely deployable, it must operate reliably across diverse environments with only a

one-time calibration performed during manufacturing. As previously discussed, room acoustics

can distort the sound field, potentially disrupting the mapping between directional signatures and

the actual direction of arrival. To address this, Owlet incorporates a reference microphone into

the design and adopts a communication-theoretic approach to mitigate the impact of transient

multipath effects during signature extraction and matching. This strategy enhances the system’s

resilience to environmental changes, making Owlet well-suited for real-world deployment.

This chapter investigates the use of acoustic structures as passive elements in the design of

13


novel, low-power, and compact solutions for ubiquitous sensing. Potential applications include

wearable devices for sensing acoustic environments to support speech development monitoring

in infants [24, 25], as well as personal audio analytics [26, 27], both of which rely on accurate

sound direction detection. Spatial sensing through Owlet can also enhance navigation for size,

weight, and power (SWaP)-constrained mobile robots operating in air or underwater [28, 29].

Additionally, Owlet enables direction estimation and localization in energy-harvesting systems—something

traditional microphone arrays struggle to achieve due to their power and hardware demands.

Figure 3.1 illustrates the overarching vision and technical approach behind our work. While

many application avenues are possible, this chapter focuses on building the core functionality of

the system and evaluating its performance boundaries. At this stage of the project, we contribute

the following three key advancements:

1. A novel approach that leverages passive structures for directional sensing, enabling a low-

power, low-complexity, and compact system for acoustic localization. The sensing and

signal processing pipeline is designed to provide reliable DoA estimation across varying

environments using only a one-time calibration during manufacturing. The system achieves

a median angular error of just 3.6�—on par with traditional microphone array solutions, but

with significantly reduced power and size requirements.

2. A reproducible framework for designing and 3D-printing optimized acoustic structures

that encode direction-dependent features into incoming sounds. This includes a method for

shaping the acoustic field using controlled diffraction within compact metamaterial-based

geometries.

3. A complete hardware and software prototype of the system, made available to the community

14


for replication, evaluation, and future development of the Owlet platform.

Next, we elaborate on the core intuition, system design, and key findings of this project.

≈
Stencil

Approximation
of the stencil

Sound source
Sound source

Hole pattern
for a specific DoA

Microphone
Microphone inside 

the stencil

Sound holes

Figure 2.2: The concept of using a stencil with direction-specific hole patterns and microstructures for
passive filtering of the incoming sound. The stencil embeds a directional response to the recorded signals.

2.2 Core Intuitions and Primers

At its core, our goal is to create a controlled acoustic environment surrounding the microphone

so that the captured signal includes a distinct, direction-dependent channel impulse response.

This response can be extracted from the recording and used as a unique signature indicating the

angle of arrival of the sound. While typical room acoustics or the presence of large nearby objects

can naturally introduce directional multipath effects, our approach seeks to generate more precise

and fine-grained spatial diversity using a compact design. To achieve this, we integrate principles

of diffraction, interference, and structural resonance. Specifically, we develop a perforated cap for

the microphone—referred to as a stencil—featuring strategically placed hole patterns on various

sides, as illustrated in Figure 2.2. Sound entering from a particular direction passes through these

distinct hole patterns and converges at the microphone. Each set of holes is linked to internal

15


microstructures with varying parameters, producing a unique frequency response that encodes

directional information.

The stencil forms a metamaterial with internal microstructures that naturally modulates

incoming sound to introduce a unique directional signature. As the impact of the microstructures

depends on the frequencies of the sound, the signature is basically a vector of complex gains, G✓,

of the frequency response. The concept is explained in Figure 2.3.

f f

Sound DoA = θ1 Sound DoA = θ2

f

Signature_θ1 Signature_θ2

f

C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6

Complex frequency gains represent 
directional signature (Gθ)

Mic

Acoustic micro-structures for 
directional filtering

Stencil

Figure 2.3: The concept of passive directional filtering using a stencil of acoustic microstructure. The
stencil embeds a directional signature to the recorded sound unique to its direction of arrival (DoA). The
spectrum of complex gains represents the signature for further computation.

2.2.1 Metamaterials for passive filtering

When sound waves interact with physical structures, certain frequencies can be either

amplified or attenuated. On a larger scale, such variations typically result from multipath reflections,

where constructive and destructive interference alters the frequency profile of the sound. While

these reflections can embed directional information into the sound, they generally require large

structures—comparable to the sound’s wavelength—to be effective. Since Owlet targets low-

frequency sounds, which have long wavelengths, using traditional reflectors would necessitate

16


structures nearly half a meter in size. To achieve directional filtering in a much smaller form

factor, we turn to the concept of metamaterials—engineered materials composed of specially

arranged substructures that exhibit unique acoustic properties. In designing our metamaterial-

based stencil, we apply a combination of (a) diffraction, (b) capillary channel effects, and (c)

structural resonance to create compact yet effective direction-dependent filtering.

(a) Diffraction:

When sound waves encounter the edge of an obstacle, they tend to bend around it—a

phenomenon known as diffraction. This effect becomes particularly interesting when sound

passes through a small opening [30]. If the hole’s size is much smaller than the wavelength

of the sound, the wave diffracts at the edges, and the hole effectively acts as a virtual point source

of sound. When a receiver, such as a microphone, is positioned on the opposite side of a surface

with multiple such apertures, it perceives a complex sound field similar to one created by multiple

sources—resulting in a multipath-like environment. The overlapping signals from these virtual

sources produce patterns of constructive and destructive interference, shaped by both the signal

frequency and the receiver’s position. We harness this behavior by embedding various small

hole patterns into the stencil, generating a rich multipath effect around the microphone within a

compact structure.

(b) Capillary effect: When sound travels through narrow capillary tubes, its acoustic

impedance changes [31]. Additionally, both the length and cross-sectional area of these tubes

influence the speed at which sound propagates. To simulate phase differences along different

sound paths, we incorporate capillary tubes of varying shapes and dimensions into the stencil

design. This approach introduces significant diversity in the resulting frequency spectrum, even

though the openings between sound paths are closely spaced.

17


Cylindrical stencil with
capillary tubes

Cylindrical stencil with
micro resonators

US penny for 
reference

Hemispherical stencil with
capillary tubes

19 mm

30 mm15 mm
10 mm

11
 m

m 13
 m

m

Internal cavity 
structure of the 

stencils

Figure 2.4: Different types of metamaterial stencils used in our experiments.

(c) Structural resonance: When sound waves with oscillating air pressure encounter

cavities, certain frequencies become amplified—a phenomenon known as Helmholtz resonance

[32], often demonstrated by the sound of air blown across the top of a bottle. We apply this

principle by embedding millimeter-scale Helmholtz resonators into the stencil design, each connected

to the sound holes. By varying the shapes and dimensions of these tiny resonators, we can

produce targeted resonance effects at specific frequencies. Figure 2.4 presents our 3D-printed

stencils featuring integrated microstructures designed for directional acoustic filtering. In Figure

2.5, we illustrate how these microstructures enhance the angular sensitivity of the sensor by

comparing the amplitude variation of a 7 kHz tone recorded with and without the stencil in the

Owlet setup. Figure 2.6 further highlights the directional frequency response diversity achieved

by different stencil designs.

18


0°

30°

60°
90°

120°

150°

180°

With stencil
Without stencil

Figure 2.5: Angular diversity of the microphone with and without the microstructure stencil.

2.3 System Design

The system design centers around two core objectives: (a) creating an optimal stencil

structure that maximizes angular diversity, and (b) developing computational methods to estimate

the DoA from the recorded audio. The overall accuracy of the system is closely tied to the level

of directional diversity introduced by the stencil. To achieve this, our algorithms simulate sound

wave behavior around small-scale structures, optimize the stencil design accordingly, and then

fabricate it using 3D printing for real-world testing. Before delving into the specifics of stencil

design, we first outline our signal processing and DoA estimation methods, which also provide a

high-level view of the system’s architecture.

2.3.1 Processing for DoA estimation

In rather simplistic terms, Owlet’s DoA estimation technique is a two-step process. First,

we create a table of direction-specific signatures G✓ of the stencil by sending known signals from

various directions. We perform this signature generation by playing a wideband sound signal

from a specific direction and recording the signal with a microphone having the stencil cap over

it. This is a one time in-lab calibration, similar to the calibration done for commercial-grade

19


(a) Cylinder_capillary (b) Cylinder_resonator (c) Hemisphere_capillary

4000 5000 6000 7000 8000
Frequency

0

1

2

3

Ph
as

e

0 deg
60 deg
120 deg
180 deg

4000 5000 6000 7000 8000
Frequency

0

0.5

1

1.5

2

Am
pl

itu
de

0 deg
60 deg
120 deg
180 deg

4000 5000 6000 7000 8000
Frequency

0

5

10

15

20

25

Am
pl

itu
de

0 deg
60 deg
120 deg
180 deg

4000 5000 6000 7000 8000
Frequency

0

1

2

3

Ph
as

e

0 deg
60 deg
120 deg
180 deg

4000 5000 6000 7000 8000
Frequency

0

0.5

1

1.5

Am
pl

itu
de

0 deg
60 deg
120 deg
180 deg

4000 5000 6000 7000 8000
Frequency

0

1

2

3

Ph
as

e

0 deg
60 deg
120 deg
180 deg

Figure 2.6: Comparison of the diversity in frequency responses (amplitude and phase) of the three types
of metamaterial stencils.

microphone arrays. The second step is performed at the run-time when the system is being used

for DoA estimations. Here we process the incoming signal to extract the signature introduced by

the stencil, hstencil, and then look it up in the table of pre-collected signatures that maps it to a

specific DoA. In practice, we train a deep learning model with variations of the signature table

and use the pre-processed signal at run-time to get predicted DoA from the model. Note that the

signature extraction from the real-world signal is a crucial part of the processing and it meets two

challenges: (i) estimating hstencil by separating it from the frequency diversity of the source signal

and (ii) eliminating additional distortions added to the signal by the environmental multipaths.

We explain the technique for signature extraction for eliminating the source dependency, followed

by a technique to deal with the environmental multipath in arbitrary locations.

At a high level, Owlet’s DoA estimation method involves a two-step process. The first step

is a one-time, in-lab calibration where we generate a set of direction-specific signatures, denoted

as G✓, for the stencil. This is done by playing a known wideband sound from various angles and

20


recording the responses using a microphone covered with the stencil cap. The recorded responses

serve as unique directional signatures, similar to calibration procedures used in commercial

microphone array systems. The second step takes place during actual usage. When an unknown

sound arrives, the system processes the incoming signal to extract the stencil-induced signature,

hstencil, and compares it against the pre-collected signature table to estimate the corresponding

DoA. In practice, we train a deep learning model on variations of the signature set and use this

model to predict the DoA from the processed signal in real time. A key part of this process

is accurately extracting the stencil-specific signature from real-world recordings. This involves

overcoming two main challenges: (i) isolating hstencil from the frequency-dependent characteristics

of the sound source, and (ii) removing additional distortions caused by multipath effects in the

surrounding environment. In the following sections, we describe our techniques for handling

source variability and mitigating environmental multipath to ensure robust signature extraction

across diverse conditions.

2.3.2 Eliminating source signal dependency

The signal recorded by the microphone inside the stencil is basically the source signal

distorted by the directional-specific response of the stencil. If we assume no environmental effect

on the source signal X(!), the signal received by the inside microphone Yin(!) can be expressed

as a multiplication between this source signal and the stencil’s response Hstencil in frequency

domain:

Yin(!) = X(!)Hstencil
(2.1)

21


When the source signal X(!) is known, extracting the stencil’s response is straightforward—we

can compute it as Yin(!)
X(!) is the signal recorded by the microphone inside the stencil. This

setup works well in scenarios where the system uses predefined signals, such as in robotic

navigation, where the robot estimates its orientation using the direction of a known control

signal. However, many practical use cases—such as detecting the direction of ambient sounds or

identifying a speaker’s location during conversation—do not involve a known source signal. In

these cases, separating Hstencil from the unknown source becomes challenging. To overcome this,

we introduce a secondary microphone positioned outside the stencil. This microphone captures

the same incoming sound but without being influenced by the stencil’s directional filtering.

Importantly, unlike traditional microphone arrays, this reference microphone can be placed very

close to the primary one, allowing for a compact design. Figure 2.7 illustrates the physical setup

and provides a realistic signal model for the system.

H’env

Henv

Source sound: X

Yin = X.Henv . Hstencil

Yout = X.H’env

Hstencil
Stencil

Inside mic

outside mic
Environmental 

multipath response

Direction-specific 
response of stencil

Figure 2.7: The two-microphone model for eliminating source and environmental dependency.

Consider the channel frequency responses to the inside and outside microphone are Henv

and H
0
env

respectively. These channel responses manifest the effects of the multipath signal

propagation from the source to the microphones and the signal’s reflections from nearby objects.

22


The presence of the stencil around the internal microphone introduces additional modulation to

the recorded signal, represented by the frequency response Hstencil. Considering the linearity of

the channels, the signal recorded by the inside microphone will experience both the impulse

responses as shown in Figure 2.7. Therefore, the signals recorded simultaneously by these

microphones, Yin(!) and Yout(!), can be formulated as the following equations. The source

sound is X(!) and independent noise at the two channels are N(!) and N
0(!) at the frequency

!.

Yin(!) = X(!)HenvHstencil +N(!)

Yout(!) = X(!)H 0
env

+N
0(!)

(2.2)

If we divide Yin(!) by Yout(!), it successfully eliminates the dependency on the source

signal. However, the environmental dependency remains in the form of Henv
H0

env
.

Yin(!)

Yout(!)
= Hstencil

Henv

H 0
env

+N
00(!), (2.3)

Here, N 00(!) ⌧ Hstencil
Henv
H0

env
. This means the stencil calibration process, or the deep

learning module training process has to be trained for all locations in the target environment to

capture the environmental dependency to make the angle prediction effective. Such a system may

be applicable for scenarios where the locations of the sound sources and the sensing modules are

predefined. For instance, when acoustic localization is used to track objects on a conveyor belt

or on a track. However, for most practical scenarios the location of the sound source is unknown,

and it will require collecting data from virtually every point in the scene and train the prediction

module – leading to an impractical solution. Next, we explain our technique to eliminate this

23


location dependency. With this technique Owlet can function with one round of in-lab calibration

of the stencil and does not require collecting any calibration data at the target environment.

2.3.3 Eliminating environmental dependency

This final stage of the technique is based on the observation that despite diverse and

unpredictable nature of the environmental channel responses Henv and H
0
env

, the ratio of the

channels, Hratio = Henv
H0

env
is bounded when the microphones are closely placed. This idea can

be intuitively understood by first analyzing the reason for diversity in environmental response

Henv. The sound wave reflects off various objects in the environment after leaving from the

source. These reflections follow paths of varying lengths to get superimposed at the recording

microphone along with the direct line-of-sight path. The diversity in the path distances creates

time delays in the reflected components leading to a unique response of the environment. Therefore,

two microphones, even when recording the same signal, can observe different responses as

the path lengths of the reflections are different. However, if the locations of the microphones

are close to each other, these path differences of reflections are bounded and at one extreme

when two microphones are exactly collocated, they will observe same environmental response.

Therefore, Hratio = Henv
H0

env
has a narrow distribution of values for each frequency in the response

when two microphones are a few centimeters apart from each other. We obtained the probability

distributions from simulated ray tracing and real-world experiments.

Once the distributions of Hratio is known and Hstencil is collected through the calibration

stage, we generate a synthetic training data for HratioHstencil drawing from the distribution and

use it for training the deep learning module. This process can train our angle prediction module

24


robust to environmental variations at run-time without requiring real-world sound traces for

training. Interestingly, if the dimension of the target environment and locations of the major

reflectors are known, the synthetic training data can be customized to that environment. This

customization reduces the time for convergence during training and improves prediction accuracy.

The run-time processing now requires to extract HratioHstencil from the two channels of

sound Yin(!) and Yout(!). We improve this process by employing a recursive least square (RLS)

adaptive filter [33] in system identification mode. The adaptive filter takes advantage of the

uncorrelated Gaussian noise in the recorded signals to estimate HratioHstencil by minimizing the

following error term with gradient descent.

e(!) = Yin(!)� Yout(!)
HenvHstencil

Henv0
(2.4)

2.3.4 Synthetic training for deep learning

To enhance the training diversity of our neural network model, we utilize the synthetic

channel response described in the previous section. Specifically, we compute HratioHstencil by

simulating various room environments and different configurations of source and microphone

placements. These simulations produce a distribution of channel responses, which we use to

generate additional training examples of HratioHstencil for the learning model. Each direction of

arrival is represented by a vector containing 400 evenly spaced frequency samples between 0 and

8 kHz, capturing the discrete spectrum of HratioHstencil. Rather than feeding the raw complex-

valued vectors into the network, we decompose them into their amplitude and phase spectra and

use these components as inputs for training.

25


Freq. spectrum

Phase spectrum

Sound signal

Input
matrix

Conv.
layer

Conv.
layer

Conv.
layer

Fully connected
layer

Predicted 
Angle

Regression
layer

(400 x 2) (394 x 64) (390 x 128) (388 x 256) (1 x 1)

2x7 1x5 1x3

Figure 2.8: The architecture of the proposed CNN model.

For DoA estimation, we employ a Convolutional Neural Network (CNN)-based regression

model. CNNs are well-suited for processing environmental sound data due to their strong performance

and low latency, thanks to their relatively compact parameter sets [34]. Our model is a one-

dimensional CNN consisting of three convolutional layers, followed by a fully connected layer

and a final regression output layer. The convolutional layers use 64, 128, and 256 filters with

kernel sizes of 2 ⇥ 7, 1 ⇥ 5, and 1 ⇥ 3, respectively. The output regression layer is designed

to minimize the half-mean-squared-error (HMSE) loss for angle prediction. We tailor the loss

function based on the desired range and resolution of angle estimation. The network uses ReLU

(Rectified Linear Unit) as the activation function and includes batch normalization layers between

the convolution layers to accelerate training. Optimization is performed using stochastic gradient

descent (SGD), and the model is trained over 100 epochs with a learning rate of 1e�6. A block

diagram of the CNN architecture is provided in Figure 2.8.

In addition to the regression model, we also develop a CNN-based classification model for

evaluation and comparison purposes, as described in Section §2.5.8. This model largely mirrors

the architecture of the regression network, with a key modification: the fully connected layer

is configured to have 360 output units—one for each possible angle—followed by a Softmax

activation and a classification layer.

26


2.3.5 Optimizing 3D stencil design

The effectiveness of our system relies heavily on the variation in frequency gain patterns

across different angles. In our initial feasibility study using a stencil cap with randomly placed

holes, we observed sufficient diversity in the gain patterns to differentiate sound directions,

achieving a median error of 7�. However, the system’s directional resolution is not consistent

across all angles—some directions are detected with lower accuracy than others. This inconsistency

stems from the suboptimal arrangement of holes and internal microstructures on the stencil,

which can lead to similar gain responses for different directions. To overcome this, we adopt a

more systematic approach to designing the 3D stencil. Our optimized design ensures a minimum

angular resolution is maintained for DoA estimation in all directions.

An ideal stencil cap should produce highly distinct frequency gain patterns for each possible

DoA, maximizing the system’s ability to differentiate between angles. This challenge of maximizing

diversity in gain patterns is similar to an information-theoretic problem: designing a set of

codewords that are as different from each other as possible. In our context, a frequency gain

pattern G✓ corresponding to a specific angle is treated as a codeword. The goal is to create a

set of N such codewords that are maximally separated—typically measured by maximizing the

Euclidean distance between every pair. The number N also determines the angular resolution

of the system, defined as �✓ = 2⇡
N

. As a first step, we aim to design a set of ideal codewords

over a range of discrete frequencies and then use these as targets to guide the creation of the

corresponding gain patterns G✓. The next step involves translating each G✓ into a physical hole

configuration on the stencil surface for the corresponding angle ✓. Given parameters such as the

number of holes N, distances from the microphone to each hole rn, the distance between the

27


microphone and the stencil D, and the sound wavelength �, Equation 2.5 provides the resulting

superimposed value u(�) at the microphone. This value represents the combined contribution

of waves entering through all stencil holes. Because this equation spans multiple wavelengths,

solving for the ideal hole patterns becomes an overdetermined problem—one that, in theory,

can be approached approximately to optimize hole placement. However, this analytical approach

quickly becomes unmanageable, especially when dealing with more than 10 holes on a 3D stencil.

Furthermore, it fails to capture the complex behavior of wave propagation around small objects,

where theoretical models diverge significantly from real-world observations. Before introducing

our simulation-based design strategy, we first discuss these wave properties in more detail.

u(�) =
NX

n=1

D

j�r2
n

e
j2⇡rn

� (2.5)

Behavior of wave fields near the stencil: In our stencil model, we make a simplifying

assumption that incoming sound waves primarily pass through the holes located on the side

of the cylindrical structure that directly faces the wavefront. Essentially, we approximate the

cylindrical stencil as N -gonal prism (illustrated in Figure 2.2), where each face contains distinct

hole configurations that contribute to generating a specific frequency gain pattern at the microphone.

This approximation is generally valid for large objects—specifically, when the object’s diameter

is more than ten times the wavelength of the sound signal [35]. However, our goal is to design a

compact interference-shaping structure, and at this miniature scale, the actual behavior of wave

propagation deviates significantly from this simplified model. In reality, due to diffraction, the

sound wave bends around the outer surface of the small stencil and wraps around the structure,

affecting nearly the entire cap, as illustrated in Figure 2.9.

28


Large object 
(compared to 
wavelength)

Small object 
(compared to 
wavelength)

Sound fieldSound field

High pressure High pressure

Low pressure

(a) (b)

Shadow
region

Figure 2.9: The behavior of sound field at the outer surface of an obstacle. (a) When the object’s size is
much larger compared to the wavelength of the sound, the obstacle creates a shadow region. (b) When the
object’s size is comparable to the wavelength of the sound, the wave diffracts around the object creating
high-pressure at a larger region of the surface. It also creates a high-pressure region directly opposite to
the sound’s directions where sound fields from the top and bottom sides meet.

We validated this diffraction effect using a cylindrical structure with a single pinhole on

one side (Figure 2.10(a)), placing a microphone at the center to measure sound pressure. As

shown in Figure 2.10(b), the microphone recorded substantial sound pressure even when the

pinhole was positioned more than 90� away from the direction of the sound source—evidence

of the sound waves bending around the structure. Notably, we observed a strong pressure level

directly opposite the sound source, caused by the merging of diffracted waves from both sides

of the cylinder. This angular variation in sound intensity across the stencil surface is frequency-

dependent and plays a key role in shaping the received frequency gain pattern.

0

90

180

270

120

150

240

210

60

30

300

330
0

30

60

90

120

150
180

210

240

270

300

330

0

0.5

1

Figure 2.10: (a) A one-hole stencil to measure surface pressure levels. (b) Sound amplitude at different
angles from the sound’s direction of arrival.

29


To improve the stencil cap design, we revise our approach and adopt a forward simulation-

based method. Rather than working backward from a target frequency gain pattern to determine

the hole layout, we instead evaluate a large number of randomly generated hole configurations

and select the best-performing one. This Monte Carlo-based approach involves repeatedly sampling

random stencil designs and analytically simulating the resulting frequency gain patterns for each

direction of arrival—covering 360 angles with 1� resolution. We then assess and optimize the

angular diversity of these gain patterns across all directions, following the steps outlined below.

This method allows us to identify a hole configuration that closely approximates the globally

optimal design, given the stencil cap’s size and other physical constraints.

(1) Random stencil pattern generation:

The variation in directional gain patterns is closely tied to the multipath diversity produced by the

arrangement of holes on the stencil. For our simulations, we fix the outer and inner diameters, as

well as the height of the cylindrical stencil. Each pinhole is assigned a diameter of 2 mm. We

then generate random configurations of hole placements along the cylinder’s surface.

However, uniformly sampling hole positions does not guarantee sufficient spacing between

them. To ensure that each hole contributes uniquely to the frequency gain pattern, a minimum

separation—typically half the maximum wavelength—is required. To enforce this constraint, we

adapt the Fast Poisson Disc sampling algorithm [36].

In each iteration, the Poisson Disc method generates 2D coordinates for new holes on the

unwrapped (flattened) surface of the cylinder, starting from a few initial seed points. Each new

hole is placed randomly within an annular region centered around existing holes, with a radius

of 3 mm to maintain the minimum required distance. To further increase the diversity of hole

arrangements, we randomly vary the annulus width during each iteration.

30


(2) Estimating frequency gain-patterns:

Using Equation 2.5, the algorithm calculates the frequency-dependent gain pattern for each

stencil design generated in the previous step. It determines the path differences between each

hole and the microphone, factoring in the diffraction of sound waves around the outer surface of

the cylinder. For each stencil, we compute the gain pattern across 400 evenly spaced frequencies

ranging from 0 to 8 kHz, resulting in a 400-point complex gain vector for each of the 360 source

angles (with 1° resolution). We also apply amplitude and phase adjustments to account for the

diffraction effects discussed earlier (as illustrated in Figure 2.10). At the end of this step, we

obtain a set of 360 gain patterns—each with 400 frequency points—for further analysis and

optimization.

(3) Assessing the diversity of gain-patterns:

Next, we measure the diversity of the gain-patterns using the all-pair Euclidean distance as the

metric, called chord-distance. Two distinguishable gain-patterns will show higher values for

chord-distance compared to the two similar patterns. We use this metric for maximin decision

criteria in the final step.

(4) Stopping criteria and selecting the best stencil:

In each iteration with a newly generated stencil pattern, the algorithm calculates and records

the minimum chord-distance between all pairs of gain patterns obtained in the previous step.

This metric reflects the angular diversity of the stencil. The iteration process continues until the

distribution of these minimum distances approximates a Gaussian curve, indicating convergence.

Once this condition is met, the stencil pattern with the highest recorded chord-distance is selected

as the optimal design for fabrication. Figure 2.11 illustrates a comparison between the frequency

gain-pattern diversity of an optimal stencil and that of a sub-optimal one.

31


(a) Optimal Hole Pattern

(b) Suboptimal Hole Pattern

0 2000 4000 6000 8000
Frequency (Hz)

-3

-2

-1

0

1

2

3

Ph
as

e

H_ID = 369, Metric = 17.6309

0 deg
20 deg
40 deg
60 deg

0 2000 4000 6000 8000
Frequency (Hz)

-3

-2

-1

0

1

2

3

Ph
as

e

H_ID = 12, Metric = 7.9219

0 deg
20 deg
40 deg
60 deg

0 2000 4000 6000 8000
Frequency (Hz)

0

1

2

3

4

Am
pl

itu
de

H_ID = 369, Metric = 4.1975

0 deg
20 deg
40 deg
60 deg

0 2000 4000 6000 8000
Frequency (Hz)

0

1

2

3

4

Am
pl

itu
de

H_ID = 12, Metric = 2.0411

0 deg
20 deg
40 deg
60 deg

Figure 2.11: Comparison of diversity in phase and amplitude patterns for an optimal and a sub-optimal
design of the stencil.

2.4 Prototype Development

2.4.1 3D-printing stencil caps

We first run our optimization algorithm on Matlab to obtain a stencil design. Next, we

use the Autodesk Fusion 360 Python API [37, 38] to generate the 3D model of the stencil. The

script takes the design parameters of the stencil as input, builds the structure including internal

substructures and cavities, and adds the holes on the surface. Finally, we export the STL model

of the stencil and slice it for 3D printing. We used the Elegoo Mars photocuring 3D printer [39]

to print the stencils. We use an ultraviolet light-curable resin with 1.195 g/cm
3 density that

solidifies when exposed to the light of 405nm wavelength. Compared with jetting-based printing,

it provides a high resolution and smooth finish which is ideal for the tiny sub-structures on the

stencil. More importantly, photocuring method leads to dense surfaces and makes the acoustic

behavior of the stencil predictable [40].

32


2.4.2 Calibration and data collection

We began by generating a wideband calibration signal in MATLAB and exporting it as a

.arb file, which was then loaded onto a Keysight Waveform Generator [41]. To transmit the signal,

we used two commercial speakers powered by a 40W dual-channel amplifier. For precise timing

and synchronization, the waveform generator was triggered via an external wired connection. The

stencil was mounted on a stepper motor controlled by an Arduino [42], allowing for automated

rotation in 1� increments from 0� to 360�. During each step, the calibration signal was recorded

using omnidirectional ADMP401 MEMS microphones [43], sampled at 16 kHz. The recorded

data was then processed offline on a computer for analysis.

2.5 Evaluation

Our goal is to evaluate the effectiveness of our microstructure-based spatial sensing approach.

To do this, we built a functional prototype of the Owlet system and conducted experiments

across multiple indoor and outdoor environments, each with varying acoustic conditions. For

benchmarking and comparison, we used conventional uniform linear microphone arrays (ULAs)

of different lengths to establish baseline performance and energy consumption metrics. In the

following sections, we describe the experimental setup in detail and present our evaluation results.

2.5.1 Evaluation setup and results summary

In the Owlet prototype, we utilize a 3D-printed stencil and two microphones aligned vertically,

facing in opposite directions, with a 4 mm separation between them. One of the microphones is

enclosed by the stencil, while the other remains exposed. For comparison, we also built a 9-

33


Figure 2.12: The Owlet prototype used in the evaluation experiment (left) and a 9-element uniform
linear microphone array used as baseline for comparison(right). The array is 12cm wide, where Owlet is
significantly smaller measuring less than 2cm in its largest dimension.

element uniform linear array (ULA) with 1.3 cm spacing between adjacent microphones, all

simultaneously sampled using a multi-channel data acquisition system [44]. Figure 2.12 shows

the front-end sensor configurations for both the Owlet and ULA systems. We used omnidirectional

ADMP401 MEMS microphones [43], sampled at 16 kHz, for both Owlet and the ULAs. The

recorded data was processed offline using MATLAB scripts on a computer. Test signals included

multi-frequency wideband tones, white noise, drone sounds, and car engine noise. Unless otherwise

specified, the default sound source was a multi-frequency wideband signal, played at 40 dB

SPL, positioned 3 feet away from the microphones at a 0° elevation angle. The stencil used

in the Owlet system measured 1.5 cm × 1.3 cm and incorporated internal capillary tubes and

structural resonator cavities. We evaluated system performance across a variety of representative

settings, including an indoor lab, building lobby, and outdoor locations, as shown in Figure

2.13. It’s important to note that our structure-guided DoA estimation method does not distort

the original sound source, as the secondary reference microphone is located outside the stencil

and can capture the source signal clearly.

Figure 2.13: Various locations for system evaluations: (a) indoor laboratory, (b) indoor lobby, (c) outdoor.

34


System Prototype cost Size Error Energy

Owlet $15 1.9cm 3.6� 16.7mJ

9-element array $70 11.4cm 4� 2078mJ

Table 2.1: Comparison of prototype cost, size, median error, and energy consumption of Owlet
with a microphone array.

Summary: Figure 2.14 provides a summary of Owlet’s overall performance compared

to traditional DoA estimation using ULAs. As detailed later in this section, Owlet achieves

higher accuracy than even a 9-element ULA utilizing the standard MUSIC algorithm for direction

estimation, all while operating with significantly lower energy consumption. Table 2.1 presents

a side-by-side comparison of the prototypes, including estimated manufacturing costs, physical

dimensions, median DoA estimation errors, and power requirements.

3 5 7 9 11 13
Median error (Degrees)

101

102

103

En
er

gy
 (m

illi
 J

ou
le

s)

2-mic array

4-mic array
6-mic array

8-mic array9-mic array

Owlet (2-mic)

Figure 2.14: Overall performance of the Owlet system compared to the traditional microphone arrays
of various sizes. Owlet requires 100⇥ less energy than the state-of-the-art array systems while achieving
better accuracy than a 9-element array.

2.5.2 Impacts of external conditions

We evaluated the performance of our prototype under various adversarial conditions. We

present the results below.

(a) Ambient noise: The ambient noise level at our test locations was approximately

35


50db 55db 60db 65db 70db 75db
Sound Pressure Level

0

5

10

15

20

M
ed

ia
n 

Er
ro

r (
de

gr
ee

s) WhiteNoise
TrafficNoise
PeopleSpeaking
JackHammer

0 20 40 60 80
Error (Degrees)

0

0.2

0.4

0.6

0.8

1

Em
pi

ric
al

 C
D

F

50cm
100cm
150cm
200cm

0 20 40 60 80
Error (Degrees)

0

0.2

0.4

0.6

0.8

1

Em
pi

ric
al

 C
D

F

0 deg
5 deg
10 deg
15 deg

0 20 40 60 80
Error (Degrees)

0

0.2

0.4

0.6

0.8

1

Em
pi

ric
al

 C
D

F

Stable Environment
Dynamic Environment
Dynamic Source

Figure 2.15: Performance under external conditions: (a) The impact of varying types and loudness levels
of ambient noise on the median DoA estimation error. (b) The CDF of errors when the sound source is
located at varying distances from the receiver. (c) The CDF plot of estimation error for different elevation
angles or the vertical positions of the sound source. (d) The CDF plots of errors that show the impact of
dynamic movements in the environment.

40 dB SPL. To evaluate robustness, we introduced four types of noise with varying spectral

characteristics: (i) white noise, (ii) traffic sounds, (iii) human speech, and (iv) machinery noise,

such as a jackhammer. These noises were played from three different speakers positioned at

various angles and with varying loudness levels near the receiver during the DoA estimation

process. The target sound used for direction estimation had a loudness of 60 dB SPL, which is

comparable to typical conversational volume. As shown in Figure 2.15(a), Owlet’s median DoA

estimation error remains consistently low across a wide range of noise types and intensity levels,

demonstrating the system’s robustness to environmental interference.

(b) Distance from the receiver: We evaluated the system’s performance by placing

the sound source at different distances from the receiver. Figure 2.15(b) presents the resulting

median DoA estimation errors. The observed errors are primarily influenced by the decrease in

signal-to-noise ratio (SNR) at the receiver as the distance increases, since the source intensity

remained constant regardless of its position. However, when we adjusted the setup to maintain

a consistent sound level at the receiver—regardless of source distance—the effect of distance on

DoA accuracy became minimal.

(c) Elevation angle: The current version of Owlet is designed to estimate the DoA

only in the azimuth plane—that is, horizontal directions. In theory, an azimuth-only system

36


should be unaffected by the elevation (vertical position) of the sound source. However, in

real-world scenarios, microphones are not perfectly omnidirectional, which means traditional

microphone array systems typically perform accurately only within a limited elevation range.

Beyond microphone limitations, Owlet’s stencil design—with its specific pinhole patterns—can

also be influenced by the vertical angle of incoming sound due to the way those patterns are

projected onto the microphone. To assess the effect of elevation, we varied the vertical position

of the sound source while keeping its horizontal distance fixed at 150 cm. As shown in Figure

2.15(c), Owlet’s performance remains stable when the vertical offset of the source is within 15

cm of the microphone’s center, indicating minimal impact from moderate elevation changes.

0 100 200 300
Time (seconds)

-90

-45

0

45

90

An
gl

es
 (d

eg
re

es
)

Actual trajectory
Predicted angles

Figure 2.16: The performance of sound tracking while the source is constantly moving near the sensor.
The movement of the source creates a dynamic multipaths scenario.

(d) Dynamic multipath: Owlet is specifically designed to reduce the impact of environmental

multipath effects. We initially tested this capability by varying the locations in our earlier

experiments. To further assess its robustness, we introduced additional environmental changes by

moving the sound source during testing, simulating dynamic conditions. In this scenario, Owlet

maintained a median error below 7�. We then introduced moving subjects near the sensor setup to

create time-varying multipath conditions. Even with three people walking within a 3-meter radius

37


of the sensor, the median error only increased slightly to around 9�. Figure 2.15(d) presents

the cumulative distribution function (CDF) of the DoA error under these dynamic conditions,

alongside results from more stable environments. Additionally, Figure 2.16 illustrates Owlet’s

performance in tracking a moving sound source near the sensor.

2.5.3 Performance in different environments

We tested Owlet in a variety of representative environments, including an indoor laboratory,

a building lobby, and outdoor open-air locations, as illustrated in Figure 2.13. To ensure robustness

across different acoustic settings, we trained the deep learning model using synthetic room impulse

responses, as described in Section §2.3.4. Figure 2.17(a) presents Owlet’s DoA estimation

performance across various positions within these environments. The system achieves a median

error of less than 4�, with 90th percentile errors staying below 10�. These results demonstrate

Owlet’s ability to operate reliably in unfamiliar environments, relying only on a one-time calibration

during initial prototype development.

2.5.4 Impact of different sound sources

In this experiment we evaluate the system’s DoA estimation performance for parallel frequency

signal and other types of signal sources. We evaluate the system’s DoA estimation performance

for different types of sounds. These signals are different in their active bandwidths, frequency

spectrums, and loudness. Figure 2.17(b) shows comparable performance across different sounds.

38


0 20 40 60 80
Error (Degrees)

0

0.2

0.4

0.6

0.8

1

Em
pi

ric
al

 C
D

F

Lab
Lobby
Outdoor

0 20 40 60 80
Error (Degrees)

0

0.2

0.4

0.6

0.8

1

Em
pi

ric
al

 C
D

F

ParallelFrequencies
WhiteNoise
Drone
Engine

Figure 2.17: The CDF of median error for (a) different environments and (b) different types of sound
sources.

2.5.5 Performance in known environment

Owlet’s synthetic training data generation can be tailored based on the known geometry

of the target environment. We tested this feature by creating training data specific to the test

location. Figure 2.18 displays the overall performance of Owlet in estimating the direction of

arrival (DoA) of signals. In this experiment, we transmitted signals from a speaker placed at

different angles relative to the Owlet system. The ground truth DoAs spanned from 0 � 180� in

front of the receiver, with 1� separation between each position. Unlike traditional microphone

arrays, Owlet’s design eliminates the issue of ”mirror location” (front-back) ambiguity in DoA

estimation. The confusion matrix in Figure 2.18(a) visually illustrates the distribution of errors

for each ground truth angle. Figure 2.18(b) shows the empirical cumulative distribution of the

errors. In this scenario, Owlet achieves a median error of less than 3.3� and a 90th percentile

error of under 10�.

2.5.6 Localization Performance

Owlet is primarily designed for DoA estimation of sound sources. However, by combining

data from multiple Owlet units, it is possible to localize a sound source using triangulation.

39


20 40 60 80 10
0

12
0

14
0

16
0

18
0

Ground truth angle (Degrees)

20
40
60
80

100
120
140
160
180Es

tim
at

ed
 a

ng
le

 (D
eg

re
es

)

0

2

4

6

8

10

0 20 40 60 80
Error (Degrees)

0

0.2

0.4

0.6

0.8

1

Em
pi

ric
al

 C
D

F

Figure 2.18: The performance for DoA estimation with known room size: (a) The confusion matrix and
(b) the CDF of error in degrees of angle.

To test this, we set up an experiment with two speakers continuously emitting 50 ms parallel

frequency pulses. The Owlet receiver was positioned at various locations on a grid in front of

the speakers. The Owlet system estimated the DoA for both speakers and used triangulation to

calculate the location of the sound source. Figure 2.19(a) shows a heatmap of the localization

error, while Figure 2.19(b) presents the corresponding cumulative distribution function (CDF)

plot. The median localization error achieved was 10 cm.

0.25 0.75 1.25
Meters

0.25

0.75

1.25

M
et

er
s

0.1

0.2

0.3

Lo
ca

tio
n 

Er
ro

r(m
)

0 0.2 0.4 0.6
Location Error (Meters)

0

0.5

1

C
D

F

 
Figure 2.19: The localization error as (a) heatmap and (b) empirical CDF.

2.5.7 Comparison with traditional methods

Figure 2.20 compares the performance of Owlet with traditional array-based DoA estimation

techniques. We implemented three widely-used array-based methods: beamscan, minimum

variance distortionless response (MVDR), and the MUSIC algorithm. These techniques were

applied to microphone arrays with varying numbers of elements. The results in Figure 2.20(a)

40


demonstrate that Owlet significantly outperforms the other algorithms under similar conditions

and signal-to-noise ratios (SNR), despite using only two microphones. Owlet’s median error is

even slightly better than that of the MUSIC algorithm with a 9-microphone array. To evaluate the

DoA resolution, we compare the spatial spectrum of each traditional algorithm with Owlet. Since

Owlet employs a regression-based approach, it does not directly generate a spatial spectrum.

Instead, we plot the confidence score distribution for all angles. Figure 2.20(b) displays the

spatial spectra for a signal arriving at 20�. Owlet shows a narrower beamwidth, similar to that of

the MUSIC algorithm.

0 20 40 60 80
Error (degrees)

0

0.2

0.4

0.6

0.8

1

Em
pi

ric
al

 C
DF

2-mic Array
4-mic Array
6-mic Array
8-mic Array
9-mic Array
Owlet (2-mic)

-80 -60 -40 -20 0 20 40 60 80
DoA angles (degrees)

0

0.2

0.4

0.6

0.8

1

No
rm

al
ize

d 
am

pl
itu

de

Beamscan
MVDR
MUSIC
Owlet

Figure 2.20: Performance comparison of Owlet with the implementation of beamscan, MVDR, and
MUSIC algorithms: (a) The CDF of median errors, (b) The spatial spectrum for an incoming signal from
20� angle.

2.5.8 Comparison between learning models

Figure 2.21 compares the performance of various deep learning models and algorithms. In

some scenarios, the regression model slightly outperforms the classification algorithm. We also

evaluate different architectures for the regression model. For instance, when we reduce the filter

sizes in the three convolution layers from 64, 128, and 256 to half, the median error improves to

5.6�. Using only the first two convolution layers with reduced filter sizes results in a median error

of 5.8�. When we employ two convolution layers with filters sized 64 and 128, the median error

increases to 7.8�. These findings highlight the flexibility of the model, allowing customization for

41


resource-limited computational environments while still achieving acceptable DoA performance.

0 20 40 60 80
Error (Degrees)

0

0.2

0.4

0.6

0.8

1

Em
pi

ric
al

 C
D

F

Regression
Classification

0 20 40 60 80
Error (Degrees)

0

0.2

0.4

0.6

0.8

1

Em
pi

ric
al

 C
D

F

Normal Regression
Half Num of Filters
Two Conv Layers
Half Filters & Two Conv Layers

Figure 2.21: Performance comparison of Owlet with different deep learning models and architectures.

2.5.9 Energy consumption

In this section, we assess and compare the energy efficiency of Owlet with traditional array-

based systems. We measure the power consumption of each submodule, including the hardware

frontend, analog-to-digital conversion (ADC), and DoA computation. While we directly monitor

the frontend and ADC, the computation part requires porting the runtime code to a Raspberry Pi 4

module and tracking the overall power variation of the module. For accurate and high-resolution

power tracking, we used a Keysight E6313A power supply and monitoring unit [45]. The setup

is illustrated in Figure 2.22.

Computation: We write the code in MATLAB and use MATLAB Coder [46] to generate

executable C files for the Raspberry Pi 4. We employ the MathWorks Raspbian image optimized

for deep learning applications and cross-compile the code for the ARMv7 architecture with

Neon Acceleration. This acceleration utilizes special registers for parallel processing, which

provides an advantage for neural network systems compared to traditional methods. We deploy

the executable code and run it on offline data for 10,000 iterations, collecting voltage and current

readings from the power meter.

42


Figure 2.22: The setup for evaluating energy consumption. The setup tracks the energy requirements of
Owlet and baseline microphone arrays under various conditions using a Keysight E6313A power supply
and monitor.

Additionally, we record the total time required to complete the DoA estimations. We

observe that while Owlet’s instantaneous power consumption is 1.92 Watts—about twice that

of traditional algorithms, which consume 1.05 Watts—the time taken for Owlet to complete the

estimation is significantly lower, at just 8.3 ms, compared to 2050 ms for the traditional algorithm.

This difference arises from the highly parallelized operations of the neural network, which are

not feasible with the sequential nature of traditional algorithms.

ADC: Figure 2.23 shows the energy consumption of the ADC in two devices: the MSP430

microcontroller ($3) [47] and the Keysight Data Acquisition System ($2500) [44]. For the

MSP430FR5969, we utilize its low-power 12-bit ADC and adjust the sampling rate to simulate

the multiplexing of multiple microphones. For the Keysight DAQ, we use its 12-bit parallel

channel ADC in single-shot data acquisition mode and vary the number of channels. Power

consumption data is recorded for both devices from the power supply, and the microphones are

disconnected to isolate the power consumption of the ADCs from the microphones and their

amplifiers.

Microphone frontend: We measure the power consumption of the ADMP401 MEMS

43


16 32 48 64 80 96 112 128 144
Sampling Rate (kHz)

0.02

0.03

0.04

0.05

0.06

0.07

0.08

En
er

gy
 (m

illi
 J

ou
le

s)

2 3 4 5 6 7 8 9
Number of Microphones

0.2

0.4

0.6

0.8

1

1.2

En
er

gy
 (m

illi
 J

ou
le

s)

Figure 2.23: Energy consumption of (a) MSP430FR5969 low-power ADC [47] for different sampling
rates and (b) Keysight Data Acquisition System [44] for different number of microphones.

microphones [43]. To estimate the total energy consumption, we consider a 50 ms duration,

which is typical for collecting 800 samples at 16 kHz. We calculate the energy by multiplying

this duration by the average power consumed by the microphones.

In Section §2.5.1, we compare the power consumption and accuracy of Owlet with traditional

arrays. Figure 2.14 illustrates the energy consumption of Owlet and other arrays, alongside the

median errors in DoA estimation. Figure 2.24 breaks down the energy usage of each submodule

(computation, ADC, and microphone frontend). Owlet uses less than a 100th of the power

required by traditional arrays for similar accuracy and angular resolution.

100 101 102 103 104

Energy (milli Joules)

2-mic Array
3-mic Array
4-mic Array
5-mic Array
6-mic Array
7-mic Array
8-mic Array
9-mic Array

Owlet (2-mic)

Microphone
ADC
Computation

Figure 2.24: Overall energy consumption of array-based systems and Owlet.

44


2.6 Limitations and Discussion

It is important to note that the current version of Owlet is an initial prototype of the concept,

and there is ample room for further development and improvement. Here, we discuss some key

points for future work.

• Multiple Sound Sources: We have tested our prototype in various acoustic environments

with different types of target sounds and noise sources. At present, the system is designed for

DoA estimation assuming a single sound source. When multiple sources overlap, the system

focuses on the strongest signal for direction estimation, treating the others as noise. While

considering only one dominant source is practical for many applications, we believe that Owlet

could be adapted to estimate multiple DoAs. A promising approach would involve applying

statistical methods to separate the source signals and then matching them with directional signatures

for DoA detection. We leave this area for future exploration.

• Theoretical Capacity Bounds: Array signal processing has been extensively studied,

and it is possible to estimate the theoretical limits on achievable spatial resolution, taking into

account various constraints and array configurations. This information is crucial for designing

and simulating array-based systems. The Owlet concept differs significantly from traditional

array processing techniques, but its performance could still be analyzed through an information-

theoretic framework, focusing on the entropy of the directional gain patterns. The shape and

size of the stencil, as well as the frequency of the sound, impose additional limitations on

Owlet’s capacity to produce diverse gain patterns. A theoretical analysis would enhance our

understanding of the system and help guide future improvements.

• Mobility: The Doppler effect, caused by fast-moving sound sources, can influence the

45


frequency gain patterns Owlet uses for direction estimation. Our current prototype operates in the

low-frequency audible range, which is less susceptible to Doppler shifts from the sound source

or receiver. Additionally, DoA estimation with parallel frequency signals offers some robustness

against Doppler shifts. As a result, we did not focus on the system’s performance under mobility.

However, if Owlet operates at higher frequencies in the future, considerations will need to be

made for detecting and compensating for Doppler frequency shifts.

• Inaudibility of Sound Signals: In this work, we focused on audible sound frequencies

for system calibration and source signals. Low-frequency signals, with longer wavelengths, tend

to exhibit less diversity in their frequency gain patterns, which limits the achievable angular

resolution. We specifically chose this frequency range to demonstrate system performance at the

lower end of the spectrum. Higher frequencies are likely to improve both spatial resolution and

reduce the size of the stencil. Future versions of Owlet will explore inaudible near-ultrasonic

frequencies (17-24 kHz) and ultrasound frequencies (above 24 kHz).

2.7 Related Work

There is a wealth of research on spatial sound analysis techniques. Notable studies in

direction of arrival (DoA) estimation using microphone arrays [48–51], array signal processing

for beamforming [52–54], and subspace-based super-resolution algorithms [55, 56] have greatly

advanced this field. Recently, innovations in ubiquitous spatial acoustic sensing [14,57–65] have

opened up new opportunities for acoustic sensing in diverse environments. Below, we highlight

two topics that are closely related to Owlet.

• Acoustic Structures: The study of how structures affect sound fields has a long history.

46


In ancient architecture, large structures were used to amplify sound or reduce noise. Modern

architectural acoustics is applied in buildings and auditoriums to control reverberation and sound

isolation. Research related to Owlet includes the design of 3D-printed acoustic metamaterials that

absorb specific frequencies [66], and the development of meta-surfaces that generate diffraction-

limited acoustic fields [67]. The use of acoustic structures for sensing applications is a relatively

recent area of study. For example, Li et al. [68] used additive manufacturing to create acoustic

filters that control the impedance at specific frequencies. In [69], physical notches were created

on a surface to form acoustic barcodes. Other works [70, 71] use 3D-printed acoustic structures

to create tangible user interfaces, with varying structure shapes to produce unique frequency

responses that are classified using smartphone microphones.

• Monaural DoA: Previous studies [48–50,58] have explored microphone arrays for DoA

estimation. Recently, there has been a growing interest in minimizing resources for directional

acoustic sensing. For example, [72] uses a single microphone placed in a known room, relying on

wall reflections and scattering to estimate the sound source location. In [22], a small vertical wall

is placed near a microphone, altering the frequency response based on the direction of sound.

Recent work [23, 73] has placed small objects like Legos and