ABSTRACT
Title of Dissertation: EFFICIENT ACOUSTIC SIMULATION
FOR LEARNING-BASED VIRTUAL AND
REAL-WORLD AUDIO PROCESSING
Zhenyu Tang
Doctor of Philosophy, 2022
Dissertation Directed by: Professor Dinesh Manocha
Department of Computer Science
Sound propagation is commonly known to be air pressure perturbations due to
vibrating/moving objects. The energy of sound gets attenuated by transmitting in
the air over a distance and by being absorbed at other objects? surfaces. Numerous
researchers have focused on devising better acoustic simulation methods to model
sound propagation in a more realistic manner. The benefits of accurate acoustic
simulations include but are not limited to computer-aided acoustic design, acoustic
optimization, synthetic speech data generation, and immersive audio-visual rendering
for mixed reality. However, acoustic simulation has been underexplored for relevant
virtual and real-world audio processing applications. The main challenges in adopting
accurate acoustic simulation methods include the tradeoff between accuracy and time-
space cost and the difficulties in acquiring and reconstructing acoustic scenes in the
real world.
In this dissertation, we propose novel methods to overcome the above challenges
by leveraging the inferential power of deep neural networks, and combining them
with interactive acoustic simulation techniques. First, we develop a neural network
model that can learn the acoustic scattering fields of different objects given their
3D representations as the input. This works facilitates the inclusion of wave acous-
tic scattering effects in interactive sound rendering applications, which used to be
difficult without intensive pre-computation. Second, we incorporate a deep acoustic
analysis neural network into the sound rendering pipeline to allow the generation of
sounds that are perceptually consistent with real-world sounds. This is achieved by
predicting acoustic parameters at run-time from real-world audio samples and opti-
mizing simulation parameters accordingly. Finally, we build a pipeline that utilizes
general 3D indoor scene datasets to generate high-quality acoustic room impulse re-
sponses and demonstrate the usefulness of the generated data on several practical
speech processing tasks. Our results demonstrate that by leveraging state-of-the-art
physics-based acoustic simulation and deep learning techniques, realistic simulated
data can be generated to enhance sound rendering quality in the virtual world and
boost the performance of audio processing tasks in the real world.
EFFICIENT ACOUSTIC SIMULATION FOR LEARNING-BASED
VIRTUAL AND REAL-WORLD AUDIO PROCESSING
by
Zhenyu Tang
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2022
Advisory Committee:
Professor Dinesh Manocha, Chair/Advisor
Professor Carol Espy-Wilson, Dean?s Representative
Professor Ming C. Lin
Professor Ramani Duraiswami
Professor Nirupam Roy
? Copyright by
Zhenyu Tang
2022
Acknowledgements
The past fives years in my life has been a journey full of discoveries - both in scientific
research and in learning my own pursuits, limits, and potential. The completion of
this dissertation would not have been possible without the help from many people.
First, I would like to express my deepest appreciation to my advisor Prof. Dinesh
Manocha, for lending me his invaluable experience and insights in conducting high-
quality research. While leading the large GAMMA research group, he never hesitated
to provide me the necessary resources, and has always trusted and supported my
decisions.
I would like to extend my sincere thanks to my collaborators from both academia
and industry. Thank you all for dedicating your time to my preliminary ideas and
helping me turn them into meaningful findings that I would not have the bandwidth
to achieve alone - Dr. Nicolas Morales, Dr. Dingzeyu Li, Dr. Timothy Langlois, Dr.
Nicholas Bryan, Dr. Dong Yu, Dr. Buye Xu, Hsien-Yu Meng, Anton Jeran Ratnarajah,
Rohith Aralikatti. I especially want to thank Dr. Atul Rungta, a GAMMA member
who discussed acoustic research with me in my early days at graduate school and
encouraged me to set forth on this exciting new line of research. I am also grateful to
my undergraduate advisor Prof. Hongzhi Wu, for his initial guidance and influence
on me, which motivated me into computer science research in the first place.
Doctoral research is known to be physically and mentally demanding, and I am
very privileged to be supported by my loving family and friends during this period.
ii
I deeply appreciate the support from my parents, who have selflessly cared for my
well-being all the time, even though being thousands of miles apart.
Lastly, a special thank you to my partner Ran for all her kindness and love, who
has filled my life with joyful expectations. I am so lucky to have encountered you in
this period of my life.
iii
Table of Contents
Acknowledgements ii
Table of Contents iv
List of Tables vii
List of Figures ix
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Challenges and Contributions . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background and Previous Research 5
2.1 Room Acoustics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Room Impulse Response . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Reverberation Time . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 Room Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Acoustic Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Wave Acoustics . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Geometric Acoustics . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Acoustic Scene Representation . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Audio Processing Applications . . . . . . . . . . . . . . . . . . . . . . 11
3 Scene-Aware Audio for Mixed Reality1 13
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Deep Acoustic Analysis: Our Algorithm . . . . . . . . . . . . . . . . 21
3.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.2 Geometry Reconstruction . . . . . . . . . . . . . . . . . . . . 23
3.3.3 Learning Reverberation and Equalization . . . . . . . . . . . . 24
3.3.4 Acoustic Material Optimization . . . . . . . . . . . . . . . . . 28
3.4 Analysis and Applications . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1The work in this chapter has been published in Tang et al. (2019a)
iv
3.4.2 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Perceptual Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5.1 Design and Procedure . . . . . . . . . . . . . . . . . . . . . . 36
3.5.2 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5.4 Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5.5 User Study Results . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Fast Learning-Based Acoustic Scattering2 44
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Interactive Sound Rendering in Dynamic Scenes . . . . . . . . 47
4.2.2 Machine Learning and Acoustic Processing . . . . . . . . . . . 48
4.3 Acoustic Scattering Preliminary . . . . . . . . . . . . . . . . . . . . . 49
4.3.1 Helmholtz Equation . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.2 Acoustic Wave Scattering . . . . . . . . . . . . . . . . . . . . 50
4.3.3 Global and Localized Sound Fields . . . . . . . . . . . . . . . 51
4.3.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 Learning-based Sound Scattering . . . . . . . . . . . . . . . . . . . . 53
4.4.1 Wave Propagation Modeling . . . . . . . . . . . . . . . . . . . 53
4.4.2 Learning Spherical Pressure Fields . . . . . . . . . . . . . . . 55
4.5 Interactive Sound Propagation with Wave-Ray Coupling . . . . . . . 56
4.6 Implementation and Results . . . . . . . . . . . . . . . . . . . . . . . 59
4.6.1 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6.2 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.6.3 Runtime System and Benchmarks . . . . . . . . . . . . . . . . 62
4.6.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.7 Perceptual Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.7.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.7.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.7.3 Stimuli and Procedure . . . . . . . . . . . . . . . . . . . . . . 69
4.7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5 High-Quality Synthetic Acoustic Datasets3 74
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Data Augmentation Preliminary . . . . . . . . . . . . . . . . . . . . . 77
5.3 Dataset Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.1 Acoustic Environment Acquisition . . . . . . . . . . . . . . . . 80
5.3.2 Semantic Acoustic Material Assignment . . . . . . . . . . . . 81
5.3.3 Geometric-Wave Hybrid Simulation . . . . . . . . . . . . . . . 82
2The work in this chapter has been published in Tang et al. (2021)
3The work in this chapter has been published in Tang et al. (2019b, 2020, 2022)
v
5.3.4 Analysis and Statistics . . . . . . . . . . . . . . . . . . . . . . 86
5.4 Acoustic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5.1 Automated Speech Recognition . . . . . . . . . . . . . . . . . 91
5.5.2 Speech Dereverberation . . . . . . . . . . . . . . . . . . . . . . 91
5.5.3 Speech Separation . . . . . . . . . . . . . . . . . . . . . . . . 92
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6 Conclusion 95
6.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
vi
List of Tables
3.1 Dataset composition. The training set and validation set are based
on synthetic IRs and the test set is based on real IRs to guarantee
model generalization. Clean speech files are also divided in a way that
speakers (?f1" for female speaker 1; ?m10" for male speaker 10) in
each dataset partition are different, to avoid the model learning the
speaker?s voice signature. Audio files are generated at a sample rate of
16kHz, which is sufficient to cover the human voice?s frequency range. 26
3.2 Benchmark results for acoustic matching. These real-world rooms are
of different sizes and shapes, and contain a wide variety of acoustic
materials such as brick, carpet, glass, metal, wood, plastic, etc., which
make the problem acoustically challenging. We compare our method
with Li et al. (2018). Our method does not require a reference IR and
still obtains similar T60 and EQ errors in most scenes compared with
their method. We also achieve faster optimization speed. Note that the
input audio to our method is already noisy and reverberant, whereas
Li et al. (2018) requires clean IR recording. All IR plots in the table
have the same time and amplitude scale. . . . . . . . . . . . . . . . . 34
4.1 Runtime performance on our benchmarks. The computation of ASFs
takes ? 1ms per view and most frame time is spent in ray tracing. . . 64
5.1 Overview of some existing large IR datasets and their characteristics.
In the ?Type? column, ?Rec.? means recorded and ?Syn.? means syn-
thetic. The real-world datasets capture the low-frequency (LF) and
high-frequency (HF) wave effects in the recorded IRs. Note that all
prior synthetic datasets use geometric simulation methods and are ac-
curate for higher frequencies only. In contrast, we use an accurate hy-
brid geometric-wave simulator on more diverse input data, correspond-
ing to professionally designed 3D interior models with furniture, and
generate accurate IRs corresponding to the entire human aural range
(LF and HF). We highlight the benefits of our high-quality dataset for
different audio and speech applications. . . . . . . . . . . . . . . . . 76
5.2 Character accuracy of ASR systems. Our method has the highest
accuracy and outperforms IM by 1.58%. . . . . . . . . . . . . . . . . 78
vii
5.3 Equal error rates of KWS systems. Our method has the lowest equal
error rate and results in a 21% error reduction relative to that of IM. 78
5.4 Results on the SOFA (P?rez-L?pez and De Muynke, 2018) dataset.
First three columns show the percentage of DOA labels correctly pre-
dicted within error tolerances, followed by average angular errors, and
%-improvement on baseline. Best performance in each column is high-
lighted in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5 Far-field ASR results obtained for the AMI corpus. The best result is
marked in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.6 We tabulate the SRMR of the SkipConvNet enhancement model trained
using different synthetic IR generation methods. We test the results
on real-world reverberant recordings from the VOiCES dataset. Use
of our hybrid dataset results in improved accuracy over prior methods. 92
5.7 SI-SDRi values reported for different IR generation methods. We re-
port results separately for the four rooms used to capture the test set
(higher is better). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
viii
List of Figures
2.1 Energy distribution of an impulse response in time. . . . . . . . . . . 6
3.1 Given a natural sound in a real-world room that is recorded using a
cellphone microphone (left), we estimate the acoustic material prop-
erties and the frequency equalization of the room using a novel deep
learning approach (middle). We use the estimated acoustic material
properties for generating plausible sound effects in the virtual model
of the room (right). Our approach is general and robust, and works
well with commodity devices. . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Our pipeline: Starting with a audio-video recording (left), we esti-
mate the 3D geometric representation of the environment using stan-
dard computer vision methods. We use the reconstructed 3D model
to simulate new audio effects in that scene. To ensure our simula-
tion results perceptually match recorded audio in the scene, we auto-
matically estimate two acoustic properties from the audio recordings:
frequency-dependent reverberation time or T60 of the environment, and
a frequency-dependent equalization curve. The T60 is used to optimize
the frequency-dependent absorption coefficients of the materials in the
scene. The frequency equalization filter is applied to the simulated au-
dio, and accounts for the missing wave effects in geometrical acoustics
simulation. We use these parameters for interactive scene-aware audio
rendering (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 The simulated and recorded frequency response in the same room at a
sample rate of 44.1kHz is shown. Note that the recorded response has
noticeable peaks and notches compared with the relatively flat simu-
lated response. This is mainly caused by room equalization. Missing
proper room equalization leads to discrepancies in audio quality and
overall room acoustics. . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 We use an off-the-shelf app called MagicPlan to generate geometry
proxy. Input: a real-world room (left); Output: the captured 3D model
of the room (right) without high-level details, which is used by the
runtime geometric acoustic simulator. . . . . . . . . . . . . . . . . . 24
ix
3.5 Network architecture for T60 and EQ prediction. Two models are
trained for T60 and EQ, which have the same components except the
output layers have different dimensions customized for the octave bands
they use. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6 Equalization augmentation. The 1000Hz sub-band is used as reference
and has unit gain. We fit normal distributions (red bell curves shown
in (a)) to describe the EQ gains of MIT IRs. We then apply EQs
sampled from these distributions to our training set distribution in
(b). We observe that the augmented EQ distribution in (d) becomes
more similar to the target distribution in (c). . . . . . . . . . . . . . 27
3.7 Evaluating T60 from signal envelope on low and high frequency bands of
the same IR. Note that the SNR in the low frequency band is lower than
the high frequency band. This makes T60 evaluation for low frequency
bands less reliable, which partly explains the larger test error in low
frequency sub-bands. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.8 Simulated energy curves before and after optimization (with target
slope shown). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.9 Stress test our our optimizer. We uniformly sample T60 between 0.2s
and 2.5s and set it to be the target. The ideal I/O relationship is a
straight line passing the origin with slope 1. Our optimization results
matches the ideal line much better than prior optimization method. 33
3.10 We show the effects of our equalization filtering on audio spectrograms,
compared with Schissler et al. (2017). In the highlighted region, we are
able to better reproduce the fast decay in the high-frequency range,
closely matching the recorded sound. . . . . . . . . . . . . . . . . . . 33
3.11 We demonstrate the importance on T60 optimization on audio ampli-
tude waveform. Our method optimizes the material parameters based
on input audio and matches the tail shape and decay amplitude with
the recorded sound, whereas the visual-based object materials from
Kim et al. (2019) failed to compensate for the audio effects. . . . . . 35
3.12 A screenshot of MUSHRA-like web interface used in our user study.
The design is from Cartwright et al. (2016). . . . . . . . . . . . . . . 37
3.13 Box plot results for our listening test. Participants were asked to rate
how similar each recording was to the explicit reference. All recordings
have the same content, but different acoustic conditions. Note our
proposed T60 and T60+EQ are both better than the Mid-Anchor by a
statistically significant amount (approx10 rating points on a 100 point
scale). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
x
4.1 We show the dynamic scenes with various moving objects that are used
to evaluate our hybrid sound propagation algorithm. We compute the
acoustic scattered fields of each object using a neural network and
couple them with interactive ray tracing to generate diffraction and
occlusion effects. Our approach can generate plausible acoustic effects
in dynamic scenes in a few milliseconds and we demonstrate its benefits
for sound rendering in virtual environments. . . . . . . . . . . . . . . 44
4.2 Overview: Our algorithm consists of the training stage and the run-
time stage. The training stage uses a large dataset of 3D objects and
their associated acoustic pressure fields computed using an accurate
BEM solver to train the network. The runtime stage uses the trained
neural network to predict the sound pressure field from a point cloud
approximation of different objects at interactive rates. . . . . . . . . 53
4.3 Simulated sound pressure fall-off and inverse-distance law fit-
ted curves: We calculate the sound pressure around a sound scat-
terer in our dataset using the BEM solver as reference. We exam-
ine the sound pressure from 1m to 10m scattered along 5 directions
(0?, 72?, 144?, 216?, and 288?). We regard the sound pressure value at
10m to correspond to far-field condition, and inversely fit the pres-
sure values for distance within 10m according to Equation 4.7. We
userref = 5m is used for generating our ASFs, although other values
can be used as well. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 PointNet regression: Given an input point cloud with N = 1024 3D
points, we feed it to the PointNet architecture (Charles et al., 2017)
until maxpooling to extract the global feature. Then we use multi-layer
perceptrons (MLPs) of layer size 256, 128, and 16 to map the feature
to a SH vector of length 16 representing the scattering field. . . . . . 56
4.5 Our dataset generation pipeline for neural network training:
Given a set of CAD models, we apply random rotations with respect
to their center of mass to generate a larger augmented dataset and use
a BEM solver to calculate the ASFs. . . . . . . . . . . . . . . . . . . 60
4.6 Spherical harmonics approximation of sound pressure fields:
We evaluate different orders of SH functions to fit our pressure fields
at 4 frequencies and calculate the relative fitting errors. . . . . . . . . 62
4.7 Comparing ASF prediction accuracy in latitude-longitude plots:
We highlight the ASFs for different simulation frequencies. For each
image block, the left column shows the mesh rendering of the objects.
The Lat-Long plots visualize the ASF used in Equation (4.9) by fre-
quency using perceptually uniform colormaps: the top row (Target) is
the groundtruth ASF computed using a BEM solver on the original
mesh; the bottom row (Predicted) represents the ASF computed us-
ing our neural network based on point-cloud representation. The error
metric NRE from Equation (4.13) is annotated above predicted ASFs. 65
4.8 Distribution of test set prediction errors: We also mark the
50%, 75% and 95% percentiles in the error histogram. . . . . . . . . . 67
xi
4.9 Perceptual evaluation results: User ratings are visualized as box
plots. A higher rating means better quality. Results are grouped by
benchmark scene and each box represents the rating of a specific ren-
dering pipeline in that scene. . . . . . . . . . . . . . . . . . . . . . . . 71
5.1 Our IR data generation pipeline starts from a 3D model of a com-
plex scene and its visual material annotations (unstructured texts).
We sample multiple collision-free source and receiver locations in the
scene. We use a novel scheme to automatically assign acoustic material
parameters by semantic matching from a large acoustic database. Our
hybrid acoustic simulator generates accurate impulse responses (IRs),
which become part of the large synthetic impulse response dataset after
post-processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2 Our semantic material assignment algorithm. We use NLP techniques
based on sentence embedding along with transformer network to choose
absorption coefficients from a database of 2, 042 unique materials. . . 81
5.3 Power spectrum comparison between the original wave FDTD simu-
lated IR and the calibrated IR. The vertical dashed line indicates the
highest valid frequency of the FDTD method. Our automatic cali-
bration method ensures that the GA and wave-based methods have
consistent energy levels so that they can generate high quality IRs and
plausible/smooth sound effects. . . . . . . . . . . . . . . . . . . . . . 86
5.4 We highlight the most frequently used materials in our approach for
generating the IR dataset. The acoustic database also contains non-
English words, which are handled by a pre-trained multi-lingual lan-
guage model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.5 Distance distribution between source and receiver pairs in our scene
database. No special distance constraints are enforced during sampling
except the need to be collision-free from the objects in the scene. The
IRs vary based on relative positions of the source and the received in
a 3D scene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.6 Statistics of house/scene volumes and reverberation times. We see a
large variation in reverberation times, which is important for speech
processing and other applications. . . . . . . . . . . . . . . . . . . . 88
5.7 Frequency responses of geometric and hybrid simulations compared
with measured IRs in BRAS benchmarks RS5-7 (Asp?ck et al., 2020).
Images of each setup are attached in the corners of the graph. We notice
that the IRs generated using our hybrid method closely match with the
measure IRs, as compared to those generated using GA methods. This
demonstrates the higher quality and accuracy of our IRs as compared
to the ones generated by prior GA methods highlighted in Table 5.1. 94
xii
Chapter 1
Introduction
1.1 Motivation
Accurate and efficient simulation of physics has been an important topic for computer
science and applied math research. Over past decades, the rapid development of
computing hardware and software has facilitated many simulation techniques that
transfer theories to professional tools that we use to interpret and predict real-world
physics. The application of computer simulation has seen huge success in many
fields, including weather forecasting, industrial computer-aided design (CAD), flight
simulation for personnel training, digital entertainment, etc.
One research area that has gained increased interest in recent years is efficient
acoustic simulation for audio processing. Audio signals corresponding to music,
speech, and non-verbal sounds in the real world encode rich information regarding the
surrounding environment. Many digital signal processing algorithms and audio deep
learning techniques have been proposed to extract information from audio signals.
These methods are widely used for different applications such as music information
retrieval, automated speech recognition, sound separation and localization, sound
synthesis and rendering, etc. Acoustic simulation techniques are often used in audio
1
processing tasks where real-world audio data is difficult to acquire.
In contrast to realistic visual rendering techniques, which have been the main
topic for computer graphics research, acoustic simulation has not been as widely
adopted by related applications such as computer games, digital film making, and
virtual reality. While state-of-the-art acoustic simulation techniques can add high-
fidelity physics-based sounds to these applications, there are still barriers that make
them less practical to be used for situations where: (1) it is difficult to faithfully
describe the acoustic environment and give accurate inputs to the simulator or (2)
there is a strict requirement for computing efficiency. As a result, digital audio in
many applications is often post-processed by professionals subjectively, even though
they can deviate hugely from physically realistic sounds. However, there are still
areas where accurate acoustic simulation is irreplaceable, including but not limited to
computer-aided acoustic design, environmental acoustic optimization, and immersive
audio-visual rendering for mixed reality. This motivates us to investigate how to
use acoustic simulation techniques practically in various virtual and real-world audio
processing tasks.
1.2 Challenges and Contributions
In contrast to previous research, which focused on theoretical acoustic simulations,
my dissertation research aims to bridge the gap between theoretical methods and
their applications in practical audio processing tasks. One challenge is the trade-off
between simulation accuracy and time-space cost. Conventional numeric wave solvers
based on the first-principal wave equation provides the most accurate results and can
be validated with real-world measurements. However, they usually scale poorly with
simulation frequency and scene scale, making them unsuitable for large simulations
(in number or scale). Another challenge is incorporating synthetic sound in real-world
2
settings, where the simulated sound needs to be consistent with the recorded sound.
This requires the simulator to be scene-aware: the sound simulation setups need to
align well with the real-world scene. The main difficulty comes from two parts: (1)
The real-world scene configurations are not always well known, and they need to be
empirically inferred or measured on-site. Prior solutions are either inaccurate or not
user-friendly. (2) The wave effects are essential for low-frequency components but
are poorly approximated by state-of-the-art real-time acoustic simulators. A large
amount of pre-computation time is needed to incorporate results from wave-based
solvers.
In this dissertation, we develop a series of algorithms and tools to overcome the
above challenges and verify their effectiveness via real-world acoustic benchmarks
and subjective listening studies. Our main contributions can be summarized in the
following three aspects:
Scene-Aware Audio for Mixed Reality We propose a novel method that allows
automatic analysis of real-world acoustics for generating virtual sounds that
are perceptually consistent with real-world sounds. This is achieved by train-
ing acoustic parameter predictors1 from a large amount of simulated data in
various room environments. The scene analysis can be performed on new real-
world scenes on-the-fly while still generating plausible sound rendering that is
consistent with the recorded sound in the same environment.
Fast Learning-Based Acoustic Scattering We present a novel approach to ap-
proximate the acoustic scattering field of any geometric object using neural
networks for interactive sound propagation of highly dynamic scenes2. Our ap-
proach is general and makes no assumption about the scene or the motion or
topology of the objects. We exploit properties of the acoustic scattering field
1Code available at https://github.com/GAMMA-UMD/deep-acoustic-analysis
2Code available at https://github.com/GAMMA-UMD/Fast3DScattering-release
3
of objects for lower frequencies and use neural networks to learn this field from
geometric representations of the objects.
High-Quality Synthetic Acoustic Datasets We propose methods to simulate high-
quality room impulse responses (RIRs) using our physics-based geometric acous-
tic simulator3 and a hybrid geometric-acoustic simulation approach. We address
the challenges in accurately modeling acoustic phenomena, including occlusion,
specular and diffuse reflections, and diffraction and demonstrate the benefits of
our method in speech recognition, speech enhancement, key-word spotting, and
direction of arrival estimation tasks.
1.3 Organization
The rest of the dissertation is organized as follows: Chapter 2 gives a comprehensive
background and overview of previous research related to topics in this dissertation.
Chapters 3, 4, and 5 present our work on scene-aware audio, fast 3D acoustic scat-
tering, and high-quality acoustic datasets generation, respectively. Then we discuss
the limitations, envision several future research directions, and conclude my disserta-
tion in Chapter 6.
3Code available at https://github.com/GAMMA-UMD/pygsound
4
Chapter 2
Background and Previous Research
2.1 Room Acoustics
2.1.1 Room Impulse Response
Sound is commonly known to be air pressure perturbations caused by vibrating/moving
objects. One conventional way to define a sound signal is by describing the air pres-
sure perturbation (in Pascal) as a function of time, denoted as s(t). A sound signal
can get attenuated by transmitting in the air over a distance and by being absorbed at
other objects? surfaces. A room, or more generally, an acoustic environment, affects
any sound signal excited within it before the sound is received by a listener (e.g.,
human ears or microphones). The transformation from the input signal to the out-
put signal can be characterized by the room impulse response (RIR), which specifies
how a signal is delayed and attenuated in a linear time-invariant (LTI) system. If we
denote the RIR by h(t), we can write the input-output relationship as
sout[t] = sin[t]? h[t], (2.1)
5
where ? denotes 1D convolution. More formally, the RIR is defined as the output
signal in response to an impulsive input signal represented by the Dirac function ?(t),
which is zero everywhere except at the origin, where it is infinite (Kuttruff, 2016).
Conventionally, an RIR can be decomposed into three parts: the direct response,
early reflections, and the late reverberation. The direct response is determined by
the visibility between the source and listener. Early reflections have stronger energy
peaks and follow shortly after the direct response. The late reverberation is the result
of high-order reflections and is more random. A typical RIR energy distribution is
shown in Figure 2.1.
Figure 2.1: Energy distribution of an impulse response in time.
The Fourier transform of the RIR is known as the frequency response of the room,
which can reveal the frequency dependence for changes in sound intensity and phase.
Once the RIR for a particular source-listener pair in a room is known, it can be used
as a digital filter to reproduce any sound signal as if the sound was emitted in the
same room.
In terms of recording RIRs in the real world, the most reliable methods involve
playing and recording Golay codes (Foster, 1986) or sine sweeps (Farina, 2000) at
high signal-to-noise ratios. Also required are fairly high-quality speakers and micro-
phones with flat frequency responses, small harmonic distortion, and little cross-talk.
The speaker and microphone should be acoustically separated from surfaces, i.e., they
shouldn?t be placed directly on tables (else surface vibrations could contaminate the
signal). Clock drift between the source and microphone must be accounted for (Bryan
6
et al., 2010). Alternatively, balloon pops or hand claps have been proposed for easier
RIR estimation, but require additional post-processing (Abel et al., 2010; Seethara-
man and Tarzia, 2012).
2.1.2 Reverberation Time
The sound signal emitted by any finite-time sound source will eventually drop its
amplitude below the human hearing threshold as the signal is absorbed by its prop-
agating medium (i.e., air) and boundaries in the room. Such energy decay is often
exponential with respect to time. One acoustic metric commonly used to describe
the decay rate is the reverberation time (Sabine, 1927), defined as the time interval
in which the sound pressure level for an impulse input decays by 60dB from its onset,
written as T60. T60 can be directly evaluated from a recorded RIR (Karjalainen et al.,
2001). Conventional rooms may have reverberation times from 0.3s to 2.0s, mostly
depending on the size and furnishing of the room. Extremely large environments and
reverberation chambers can have reverberation times up to 10s. In theory, a signal
in the free field (e.g., vacuum) will have a T60 of 0s. An acoustically treated anechoic
chamber can also have its T60 close to zero; sounds that are recorded in this condition
is often called ?dry? or ?clean? sounds.
2.1.3 Room Modes
As the sound signal propagates in a room, standing waves can form at discrete res-
onant frequencies whose wavelength ? satisfies ? = 2L , n = 1, 2, 3, ..., where L is the
n
room dimension along some direction (e.g., axial, tangential, oblique). Room modes
are the collection of these resonant frequencies a?nd consist of mostly low frequencies
below the Schroeder frequency (in Hz) f = 2000 T60c (Schroeder, 1996), where T60 isV
the reverberation time in seconds and V is the volume of the room in m3. For typical
residential rooms, fc will be lower than 200Hz. At these resonant frequencies, the
7
sound pressure tends to be significantly modified in different locations in the same
room, which can cause problems for accurate sound reproduction. Various methods
have been devised to remove room modes using equalization filters (Cecchi et al.,
2018).
2.2 Acoustic Simulation
Acoustic simulation can involve the process of sound generation and propagation.
In this dissertation, we focus on the sound propagation aspect and refer readers
interested in modal sound simulation to Zheng and James (2011).
2.2.1 Wave Acoustics
First, we describe the theoretical foundation of wave acoustic simulations. A scalar
acoustic pressure field, P (x, t), satisfies the inhomogeneous wave equation
?2P (x, t) ? c2?2P (x, t) = f(x, t), (2.2)
?t2
where c is the speed of sound, x is the 3D coordinate, and f(x, t) is the forcing term,
usually representing some driving source signal. An RIR can be obtained by setting
f(x, t) to an impulse signal at a source location xs, fixing P (x, t) at the receiver loca-
tion xr and extracting its time-varying component. The wave equation can be solved
numerically using the finite-difference time domain (FDTD) (Botteldooren, 1995)
method or in the frequency domain using the finite-element (FEM) method (Thomp-
son, 2006), the boundary-element (BEM) method (Wrobel and Kassab, 2003), the
adaptive rectangular decomposition (ARD) method (Raghuvanshi et al., 2009), etc.
These methods are also referred to as wave-based methods. Their computation com-
plexity increases linearly with the size of the environment (surface area or volume)
and as a third or fourth power of frequencies. As a result, they are limited to lower
8
frequencies and offline simulations (Raghuvanshi et al., 2010; Mehra et al., 2013; Yeh
et al., 2013).
2.2.2 Geometric Acoustics
When the wavelength of the sound is smaller than the size of the obstacles in the
environment, the sound wave can be treated in the form of a ray, which is the key
idea of geometric acoustics. Typical geometric acoustic simulation techniques include
the image method (Allen and Berkley, 1979), which only models specular reflections,
path tracing methods (Taylor et al., 2009, 2012b; Schissler and Manocha, 2016, 2018)
based on efficient Monte Carlo path tracing (Kajiya, 1986), and beam or frustum
tracing methods (Funkhouser et al., 1998b; Chandak et al., 2008). These techniques
are designed to run magnitudes faster than wave acoustic solvers and can be enhanced
to simulate low-frequency diffraction effects. This category includes the time-domain
Biot-Tolstoy-Medwin (BTM) model, which can be expensive and is also limited to
offline computations (Svensson et al., 1999). For interactive applications, commonly
used techniques are based on the uniform theory of diffraction (UTD), which is a
less accurate frequency-domain model that can generate plausible results in some
cases (Tsingos et al., 2001; Taylor et al., 2012a; Schissler et al., 2014). Moreover, the
complexity of edge-based diffraction algorithms can increase exponentially with the
maximum diffraction order. A more extensive review of geometric acoustic techniques
can be found in Liu and Manocha (2020).
2.3 Acoustic Scene Representation
Room acoustics depend on many factors. Room geometry and acoustic materials
together can greatly affect how a sound signal is being modified by propagation.
In larger and less absorbent rooms, the sound signal can keep travelling for longer
9
times before vanishing (hence longer T60), and vice versa. For the same room size, a
rectangular room would have different room modes if it were differently shaped. In
practice, modern 3D vision techniques can be used on commodity devices to construct
geometry proxies from a video recording of the real-world scene in the form of dense
3D point clouds or meshes (Zhi et al., 2019; Bloesch et al., 2018).
Acoustic materials are often described in terms of how they react to incoming
sound. The complex acoustic impedance indicates how much sound pressure would
be generated in response to vibrations in the acoustic medium (e.g., air). This in-
dicator is being used by many wave acoustic solvers but needs to be measured in
controlled lab settings (Hiremath et al., 2021), making it less accessible for most ma-
terials. In the geometric acoustics context, the absorption and scattering coefficients
are more commonly used, though they are also the sources of errors for acoustic sim-
ulations (Vorl?nder, 2013). The absorption coefficient ? ? [0, 1] is defined as the
fraction of sound energy at a specific frequency that is absorbed by the material.
While the measurement of ? also requires a reverberation chamber environment, the
frequency-dependent absorption coefficients of many commonly seen materials have
been measured and compiled as acoustic material databases. The energy that is not
absorbed can be further described using the scattering coefficient s ? [0, 1], which
represents the fraction of sound that is diffusely reflected (e.g., following Lambertian
distribution), while the remaining fraction is specularly reflected (i.e., having high
directivity). However, the scattering coefficient is highly relevant to the roughness of
the surface (Christensen and Rindel, 2005), and available measured data are few. In
theory, bidirectional reflection distribution functions (BRDFs) that are widely used in
computer graphics can more accurately describe the interaction between an incoming
sound and the material (M?ckl and Dachsbacher, 2014), but acoustic BRDFs have
not been commonly measured.
Once the room geometry and materials are well-defined, the soundfield can be
10
simulated for any given sources. Wave-based methods solve the soundfield for the
whole space within the acoustic scene, so the sound pressure at specific locations
can be evaluated after the simulation finishes. In contrast, geometric methods only
simulate results at pre-defined receiver locations but use less memory than wave-based
methods. One convenient property for acoustic measurements and simulations is that
the location of the source and the receiver can be interchanged without affecting the
measured/simulated result according to acoustic reciprocity (Wapenaar, 2019). This
is sometimes useful in reducing the number of measurements/simulations in various
scenes.
2.4 Audio Processing Applications
The measurement/computation of RIRs and resulting datasets has been used for
audio processing applications including but not limited to:
1. Sound Propagation and Rendering: Sounds in nature are produced by vi-
brating objects and then propagate through a medium (e.g., air) before finally being
heard by a listener. Humans can perceive these sound waves in the frequency range of
20Hz to 20KHz (human aural range). There is a large body of literature on modeling
sound propagation in indoor scenes using geometric and wave-based methods (Liu
and Manocha, 2020; Krokstad et al., 1968; Vorl?nder, 1989; Funkhouser et al., 1998a;
Raghuvanshi et al., 2009; Mehra et al., 2013; Schissler et al., 2014). Wave-based
solvers are practical for lower frequencies and limited to static scenes. Geometric
methods, widely used in interactive applications, are accurate for higher frequencies.
We need automatic software systems that can accurately compute IRs corresponding
to human aural range and handle arbitrary 3D models.
2. Deep Audio Synthesis for Videos: Video acquisition has become very common
and easy. However, it is difficult to add realistic audio that can be synchronized with
11
animation in a video. Many deep learning methods have been proposed for such audio
synthesis that utilize acoustic impulse responses for such applications (Li et al., 2018;
Owens et al., 2016; Zhou et al., 2018)
3. Speech Processing using Deep Learning: IRs consist of many clues related
to reproducing or understanding intelligible human speech. Synthetic datasets of IRs
have been used in machine learning methods for automatic speech recognition (Ma-
lik et al., 2021; Ko et al., 2017; Tang et al., 2020; Ratnarajah et al., 2021), sound
source separation (Aralikatti et al., 2021; Jenrungrot et al., 2020), and sound source
localization (Grumiaux et al., 2021).
4. Sound Simulation using Machine Learning: Many recent deep learning
methods have been proposed for sound synthesis (Hawley et al., 2020; Ji et al., 2020;
Jin et al., 2020), scattering effect computation, and sound propagation (Fan et al.,
2020b; Meng et al., 2021; Pulkki and Svensson, 2019). Deep learning methods have
also been used to compute material properties of a room and acoustic characteris-
tics (Schissler et al., 2017; Tang et al., 2019a).
Some of these will be discussed in more detail in this dissertation. Other applica-
tions that have used acoustic datasets include navigation (Chen et al., 2020), floorplan
reconstruction (Purushwalkam et al., 2021) and depth estimation algorithms (Gao
et al., 2020).
12
Chapter 3
Scene-Aware Audio for Mixed
Reality1
Figure 3.1: Given a natural sound in a real-world room that is recorded using a
cellphone microphone (left), we estimate the acoustic material properties and the
frequency equalization of the room using a novel deep learning approach (middle).
We use the estimated acoustic material properties for generating plausible sound
effects in the virtual model of the room (right). Our approach is general and robust,
and works well with commodity devices.
3.1 Introduction
Auditory perception of recorded sound is strongly affected by the acoustic environ-
ment it is captured in. Concert halls are carefully designed to enhance the sound
on stage, even accounting for the effects an audience of human bodies will have on
the propagation of sound (Barron, 2010). Anechoic chambers are designed to remove
1The work in this chapter has been published in Tang et al. (2019a)
13
acoustic reflections and propagation effects as much as possible. Home theaters are
designed with acoustic absorption and diffusion panels, as well as with careful speaker
and seating arrangements (Rizzi et al., 2016).
The same acoustic effects are important when creating immersive effects for vir-
tual reality (VR) and augmented reality (AR) applications. It is well known that
realistic sounds can improve a user?s sense of presence and immersion (Larsson et al.,
2002). There is considerable work on interactive sound propagation in virtual en-
vironments based on geometric and wave-based methods (Vorl?nder, 1989; Schissler
and Manocha, 2017; Raghuvanshi and Snyder, 2014c; Cao et al., 2017). Furthermore,
these techniques are increasingly used to generate plausible sound effects in VR sys-
tems and games, including Microsoft Project Acoustics2, Oculus Spatializer3, Steam
Audio4, etc. However, these methods are limited to synthetic scenes where an ex-
act geometric representation of the scene and acoustic material properties are known
apriori.
In this chapter, we address the problem of rendering realistic sounds that are
similar to recordings of real acoustic scenes. These capabilities are needed for VR
as well as AR applications (Conference, 2018), which often use recorded sounds.
Foley artists often record source audio in environments similar to the places the
visual contents were recorded in. Similarly, creators of vocal content (e.g. podcasts,
movie dialogue, or video voice-overs), carefully re-record content made in different
environment or with different equipment to match the acoustic conditions. However,
these processes are expensive, time-consuming, and cannot adapt to spatial listening
location. There is strong interest in developing automatic spatial audio synthesis
methods.
For VR or AR content creation, acoustic effects can also be captured with an
2https://aka.ms/acoustics
3https://developer.oculus.com/downloads/package/oculus-spatializer-unity
4https://valvesoftware.github.io/steam-audio
14
impulse response (IR) ? a compact acoustic description of how sound propagates
from one location to another in a given scene. Given an IR, it can be convolved with
any virtual sound or dry sound to generate the desired acoustic effects. However,
recording the IRs of real-world scenes can be challenging, especially for interactive
applications. Many times special recording hardware is needed to record the IRs.
Furthermore, the IR is a function of the source and listener positions and it needs to
be re-recorded as either position changes.
Our goal is to replace the step of recording an IR with an unobtrusive method
that works on in-situ speech recordings and video signals and uses commodity devices.
This can be regarded as an acoustic analogy of visual relighting (Debevec, 2002): to
light a new visual object in an image, traditional image based lighting methods re-
quire the capture of real-world illumination as an omnidirectional, high dynamic range
(HDR) image. This light can be applied to the scene, as well as on a newly inserted
object, making the object appear as if it was always in the scene. Recently, Gardner
et al. (2017) and Hold-Geoffroy et al. (2017) proposed convolutional neural network
(CNN)-based methods to estimate HDR indoor or outdoor illumination from a single
low dynamic range (LDR) image. These high-quality visual illumination estimation
methods enable novel interactive applications. Concurrent work from LeGendre et al.
(2019) demonstrates the effectiveness on mobile devices, enabling photorealistic mo-
bile mixed reality experiences.
In terms of audio ?relighting" or reproduction, there have been several approaches
proposed toward realistic audio in 360? images (Kim et al., 2019), multi-modal estima-
tion and optimization (Schissler et al., 2017), and scene-aware audio in 360? videos (Li
et al., 2018). However, these approaches either require separate recording of an IR,
or produce audio results that are perceptually different from recorded scene audio.
Important acoustic properties can be extracted from IRs, including the reverberation
time (T60), which is defined as the time it takes for a sound to decay 60 decibels (Kut-
15
Figure 3.2: Our pipeline: Starting with a audio-video recording (left), we estimate
the 3D geometric representation of the environment using standard computer vision
methods. We use the reconstructed 3D model to simulate new audio effects in that
scene. To ensure our simulation results perceptually match recorded audio in the
scene, we automatically estimate two acoustic properties from the audio recordings:
frequency-dependent reverberation time or T60 of the environment, and a frequency-
dependent equalization curve. The T60 is used to optimize the frequency-dependent
absorption coefficients of the materials in the scene. The frequency equalization
filter is applied to the simulated audio, and accounts for the missing wave effects in
geometrical acoustics simulation. We use these parameters for interactive scene-aware
audio rendering (right).
truff, 2016), and the frequency-dependent amplitude level or equalization (EQ) (Hak
et al., 2012). This heavy reliance on IRs greatly constrains the wide adoption of au-
dio for immersive applications or video post-production that require realistic acoustic
simulation that is calibrated to real-world acoustic scenes.
Main Results: We present novel algorithms to estimate two important environ-
mental acoustic properties from recorded sounds (e.g. speech). Our approach uses
commodity microphones and does not need to capture any IRs. The first property is
the frequency-dependent T60. This is used to optimize absorption coefficients for ge-
ometric acoustic (GA) simulators for audio rendering. Next, we estimate a frequency
equalization filter to account for wave effects that cannot be modeled accurately using
geometric acoustic simulation algorithms. This equalization step is crucial to ensur-
16
ing that our GA simulator outputs perceptually match existing recorded audio in the
scene. Estimating the equalization filter without an IR is challenging since it is not
only speaker dependent, but also scene dependent, which poses extra difficulties in
terms of dataset collection. For a model to predict the equalization filtering behavior
accurately, we need a large amount of diverse speech data and IRs. Our key idea is
a novel dataset augmentation process that significantly increases room equalization
variation. With robust room acoustic estimation as input, we present a novel inverse
material optimization algorithm to estimate the acoustic properties. We propose a
new objective function for material optimization and show that it models the IR de-
cay behavior better than (Li et al., 2018). We demonstrate our ability to add new
sound sources in regular videos. Similar to visual relighting examples where new ob-
jects can be rendered with photorealistic lighting, we enable audio reproduction in
any regular video with existing sound with applications for mixed reality experiences.
We highlight their performance on many challenging benchmarks.
We show the importance of matched T60 and equalization in our perceptual user
study ?3.5. In particular, our perceptual evaluation results show that: (1) Our T60
estimation method is perceptually comparable to all past baseline approaches, even
though we do not require an explicit measured IR; (2) Our EQ estimation method im-
proves the performance of our T60-only approach by a statistically significant amount
(? 10 rating points on a 100 point scale); and (3) Our combined method (T60+EQ)
outperforms the average room IR (T60 = .5 seconds with uniform EQ) by a sta-
tistically significant amount (+10 rating points) ? the only reasonable comparable
baseline we could conceive that does not require an explicit IR estimate. To the best
of our knowledge, ours is the first method to predict IR equalization from raw speech
data and validate its accuracy. Our main contributions include:
? A CNN-based model to estimate frequency-dependent T60 and equalization filter
from real-world speech recordings.
17
? An equalization augmentation scheme for training to improve the prediction
robustness.
? A derivation for a new optimization objective that better models the IR decay
process for inverse materials optimization.
? A user study to compare and validate our performance with current state-of-the-
art audio rendering algorithms. Our study is used to evaluate the perceptual
similarity between the recorded sounds and our rendered audio.
3.2 Related Work
Cohesive audio in mixed reality environments (when there is a mix of real and virtual
content), is more difficult than in fully virtual environments. This stems from the
difference between ?Plausibility? in VR and ?Authenticity? in AR (Kim et al., 2019).
Visual cues dominate acoustic cues, so the perceptual difference between how audio
sounds and the environment in which it is seen is smaller than the perceived envi-
ronment of two sounds. Recently, Li et al. (2018) introduced scene-aware audio to
optimize simulator parameters to match the room acoustics from existing recordings.
By leveraging visual information for acoustic material classification, Schissler et al.
(2017) demonstrated realistic audio for 3D-reconstructed real-world scenes. However,
both of these methods still require explicit measurement of IRs. In contrast, our
proposed pipeline works with any input speech signal and commodity microphones.
Sound simulation can be categorized into wave-based methods and geometric
acoustics. While wave-based methods generally produce more accurate results, it
remains an open challenge to build a real-time universal wave solver. Recent ad-
vances such as parallelization via rectangular decomposition (Morales et al., 2015),
pre-computation acceleration structures (Mehra et al., 2015), and coupling with ge-
ometric acoustics (Yeh et al., 2013; Rungta et al., 2018) are used for interactive
18
applications. It is also possible to precompute low-frequency wave-based propaga-
tion effects in large scenes (Raghuvanshi et al., 2010), and to perceptually compress
them to reduce runtime requirements (Raghuvanshi and Snyder, 2014a). Even with
the massive speedups presented, and a real-time runtime engine, these methods still
require tens of minutes to hours of pre-computation depending on the size of the
scene and frequency range chosen, making them impractical for augmented reality
scenarios and difficult to include in an optimization loop to estimate material param-
eters. With interactive applications as our goal, most game engines and VR systems
tend to use geometric acoustic simulation methods (Vorl?nder, 1989; Schissler and
Manocha, 2017; Cao et al., 2017). These algorithms are based on fast ray tracing
and perform specular and diffuse reflections (Savioja and Svensson, 2015). Some
techniques have been proposed to approximate low-frequency diffraction effects using
ray-tracing (Tsingos et al., 2001; Rungta et al., 2018; Taylor et al., 2012a). Our ap-
proach can be combined with any interactive audio simulation method, though our
current implementation is based on bidirectional ray tracing (Cao et al., 2017). The
sound propagation algorithms can also be used for acoustic material design optimiza-
tion for synthetic scenes (Morales and Manocha, 2016).
The efficiency of deep neural networks has been shown in audio/video-related tasks
that are challenging for traditional methods(Virtanen et al., 2018; Gharib et al., 2018;
Hinton et al., 2012; Evers et al., 2016; Sterling et al., 2018). Hershey et al. (2017)
showed that it is feasible to use CNNs for large-scale audio classification problems.
Many deep neural networks require a large amount of training data. Salamon and
Bello (2017) used data augmentation to improve environmental sound classification.
Similarly, Bryan (2020) estimates the T60 and the direct-to-reverberant ratio (DRR)
from a single speech recording via augmented datasets. Tang et al. (2019b) trained
CRNN models purely based on synthetic spatial IRs that generalize to real-world
recordings. We strategically design an augmentation scheme to address the challenge
19
of equalization?s dependence on both IRs and speaker voice profiles, which is fully
complimentary to all prior data-driven methods.
Figure 3.3: The simulated and recorded frequency response in the same room at a
sample rate of 44.1kHz is shown. Note that the recorded response has noticeable peaks
and notches compared with the relatively flat simulated response. This is mainly
caused by room equalization. Missing proper room equalization leads to discrepancies
in audio quality and overall room acoustics.
Acoustic simulators require a set of well-defined material properties. The material
absorption coefficient is one of the most important parameters (Bork, 2000), ranging
from 0 (total reflection) to 1 (total absorption). A material?s acoustic properties are
correlated with its visual appearance to some extent. For example, a carpet is usually
more absorptive for sound than a glass is. This audio-visual correlation enables rough
material estimation from visual cues (Schissler et al., 2017). However, despite a non-
zero material recognition error, visual information alone does not accurately capture
the acoustic property of materials. Prior work shows that 7D (source-listener 3D
locations and time) acoustic fields in an environment can be effectively compressed
into 6D time-invariant fields using only four selected scalar acoustic metrics with low
20
reconstruction errors (Raghuvanshi and Snyder, 2014c). This indicates that certain
acoustic metrics can be used to guide the modeling of acoustic materials.
When a reference IR is available, it is straightforward to adjust room materials
to match the energy decay of the simulated IR to the reference IR (Li et al., 2018).
Similarly, Ren et al. (2013) optimized linear modal analysis parameters to match the
given recordings. A probabilistic damping model for audio-material reconstruction
has been presented for VR applications (Sterling et al., 2019). Unlike all previous
methods which require a clean IR recording for accurate estimation and optimization
of boundary materials, we infer typical material parameters including T60 values and
equalization from raw speech signals using a CNN-based model.
Analytical gradients can significantly accelerate the optimization process. With
similar optimization objectives, it was shown that additional gradient information can
boost the speed by a factor of over ten times (Li et al., 2018; Schissler et al., 2017).
The speed gain shown in (Li et al., 2018) is impressive, and we further improve
the accuracy and speed of the formulation. More specifically, the original objective
function evaluated energy decay relative to the first ray received (the direct sound if
there were no obstacles). However, energy estimates can be noisy due to both the
oscillatory nature of audio as well as simulator noise. Instead, we optimize the slope
of the best fit line of ray energies to the desired energy decay (defined by the T60),
which we found to be more robust.
3.3 Deep Acoustic Analysis: Our Algorithm
In this section, we overview our proposed method for scene-aware audio rendering. We
begin by providing background information, discuss how we capture room geometry,
and then proceed with discussing how we estimate the frequency dependent room
reverberation and equalization parameters directly from recorded speech. We follow
21
by discussing how we use the estimated acoustic parameters to perform acoustic
materials optimization such that we calibrate our virtual acoustic model with real-
world recordings.
3.3.1 Background
To explain the motivation of our approach, we briefly elaborate on the most difficult
parts of previous approaches, upon which our method improves. Previous methods
require an impulse response of the environment to estimate acoustic properties (Li
et al., 2018; Schissler et al., 2017). Recording an impulse response is a non-trivial task.
The most reliable methods involve playing and recording Golay codes (Foster, 1986)
or sine sweeps (Farina, 2000), which both play loud and intrusive audio signals. Also
required are a fairly high-quality speaker and microphone with constant frequency
response, small harmonic distortion and little crosstalk. The speaker and microphone
should be acoustically separated from surfaces, i.e., they shouldn?t be placed directly
on tables (else surface vibrations could contaminate the signal). Clock drift between
the source and microphone must be accounted for (Bryan et al., 2010). Alternatively,
balloon pops or hand claps have been proposed for easier IR estimation, but require
significant post-processing and still are very obtrusive (Abel et al., 2010; Seethara-
man and Tarzia, 2012). In short, correctly recording an IR is not easy, and makes it
challenging to add audio in scenarios such as augmented reality, where the environ-
ment is not known beforehand and estimation must be done interactively to preserve
immersion.
Geometric acoustics is a high-frequency approximation to the wave equation. It
is a fast method, but assumes that wavelengths are small compared to objects in the
scene, while ignoring pressure effects (Savioja and Svensson, 2015). It misses several
important wave effects such as diffraction and room resonance. Diffraction occurs
when sound paths bend around objects that are of similar size to the wavelength.
22
Resonance is a pressure effect that happens when certain wavelengths are either re-
inforced or diminished by the room geometry: certain wavelengths create peaks or
troughs in the frequency spectrum based on the positive or negative interference they
create.
We model these effects with a linear finite impulse response (FIR) equalization
filter (Schafer and Oppenheim, 1989). We compute the discrete Fourier transform
on the recorded IR over all frequencies, following Li et al. (2018). Instead of filter-
ing directly in the frequency domain, we design a linear phase EQ filter with 32ms
delay to compactly represent this filter at 7 octave bin locations. We then blindly
estimate this compact representation of the frequency spectrum of the impulse re-
sponse as discrete frequency gains, without specific knowledge of the input sound or
room geometry. This is a challenging estimation task. Since the convolution of two
signals (the IR and the input sound) is equivalent to multiplication in the frequency
domain, estimating the frequency response of the IR is equivalent to estimating one
multiplicative factor of a number without constraining the other. We are relying on
this approach to recognize the a compact representation of the frequency response
magnitude in different environments.
3.3.2 Geometry Reconstruction
Given the background, we begin by first estimating the room geometry. In our exper-
iments, we utilize the ARKit-based iOS app MagicPlan5 to acquire the basic room
geometry. A sample reconstruction is shown in Figure 3.4. With computer vision
research evolving rapidly, we believe constructing geometry proxies from video input
will become even more robust and easily accessible (Zhi et al., 2019; Bloesch et al.,
2018).
5https://www.magicplan.app/
23
Figure 3.4: We use an off-the-shelf app called MagicPlan to generate geometry proxy.
Input: a real-world room (left); Output: the captured 3D model of the room (right)
without high-level details, which is used by the runtime geometric acoustic simulator.
3.3.3 Learning Reverberation and Equalization
Figure 3.5: Network architecture for T60 and EQ prediction. Two models are trained
for T60 and EQ, which have the same components except the output layers have
different dimensions customized for the octave bands they use.
We use a convolutional neural network (Figure 3.5) to predict room equalization
and reverberation time (T60) directly from a speech recording. Training requires
a large number of speech recordings with known T60 and room equalization. The
standard practice is to generate speech recordings from known real-world or synthetic
IRs (Kim et al., 2017; Doulaty et al., 2017). Unfortunately, large scale IR datasets do
not currently exist due to the difficulty of IR measurement; most publicly available
IR datasets have fewer than 1000 IR recordings. Synthetic IRs are easy to obtain and
can be used, but again lack wave-based effects as well as other simulation deficiencies.
Recent work has addressed this issue by combining real-word IR measurements with
augmentation to increase the diversity of existing real-world datasets (Bryan, 2020).
24
This work, however, only addresses T60 and DRR augmentation, and lacks a method
to augment the frequency-equalization of existing IRs. To address this, we propose
an augmentation method in this section. Beforehand, however, we discuss our neural
network estimation method for estimating both T60 and equalization.
Octave-Based Prediction
Most prior work takes the full-frequency range as input for prediction. For exam-
ple, one closely related work (Bryan, 2020) only predicts one T60 value for the entire
frequency range (full-band). However, sound propagates and interacts with mate-
rials differently at different frequencies. To this end, we define our learning tar-
gets over several octaves. Specifically, we calculate T60 at 7 sub-bands centered at
{125, 250, 500, 1000, 2000, 4000, 8000}Hz. We found prediction of T60 at the 62.5Hz
band to be unreliable due to low SNR. During material optimization, we set the
62.5Hz T60 value to the 125Hz one. Our frequency equalization estimation is done
at 6 octave bands centered at {62.5, 125, 250, 500, 2000, 4000}Hz. Note that we will
compute equalization relative to the 1kHz band, so we do not estimate it. When
applying our equalization filter, we set bands greater than or equal to 8kHz to -50dB.
Given our target sampling rate of 16kHz and the limited content of speech in higher
octaves, this did not affect our estimation.
Data Augmentation
We use the following datasets as the basis for our training and augmentation.
? ACE Challenge (Eaton et al., 2016): 70 IRs and noise audio;
? MIT IR Survey (Traer and McDermott, 2016): 271 IRs;
? DAPS dataset (Mysore, 2014): 4.5 hours of 20 speakers? speech (10 males and
10 females).
25
Table 3.1: Dataset composition. The training set and validation set are based on
synthetic IRs and the test set is based on real IRs to guarantee model generalization.
Clean speech files are also divided in a way that speakers (?f1" for female speaker 1;
?m10" for male speaker 10) in each dataset partition are different, to avoid the model
learning the speaker?s voice signature. Audio files are generated at a sample rate of
16kHz, which is sufficient to cover the human voice?s frequency range.
Partition Noise Clean Speech IR
Training set Synthetic IR
(size: 56.5k) ACE ambient f5?f10, m5?m10 (size: 4.5k)
Validation set Synthetic IR
(size: 19.5k) ACE ambient f3, f4, m3, m4 (size: 1k)
Test set
(size: 18.5k) ACE ambient f1, f2, m1, m2
MIT survey IR
(size: 271)
First, we use the method in Bryan (2020) to expand the T60 and direct-to-
reverberant ratio (DRR) range of the 70 ACE IRs, resulting in 7000 synthetic IRs
with a balanced T60 distribution between 0.1 ? 1.5 seconds. The ground truth T60
estimates can be computed directly from IRs can be computed is a variety of ways.
We follow the methodology of Karjalainen et al. (2001) when computing the T60 from
real IRs with a measurable noise floor. This method was found to be the most robust
estimator when computing the T60 from real IRs in recent work (Eaton et al., 2016).
The final composition of our dataset is listed in Table 3.1.
While we know the common range of real-world T60 values, there is limited lit-
erature giving statistics about room equalization. Therefore, we analyzed the equal-
ization range and distribution of the 271 MIT survey IRs as a guidance for data
augmentation. The equalization of frequency bands is computed relative to the 1kHz
octave. This is a common practice (V?lim?ki and Reiss, 2016), unless expensive
equipment is used to obtain calibrated acoustic pressure readings.
For our equalization augmentation procedure, we first fit a normal distribution
(mean and standard deviation) to each sub-band amplitude of the MIR IR dataset
as shown in Figure 3.6. Given this set of parametric model estimates, we iterate
through our training and validation IRs. For each IR, we extract its original EQ.
26
(a) MIT IR survey equalization distribution by sub-band.
(b) Original synthetic IR (c) Target (MIT) IR equaliza- (d) Augmented synthetic IR
equalization. tion. equalization.
Figure 3.6: Equalization augmentation. The 1000Hz sub-band is used as reference and
has unit gain. We fit normal distributions (red bell curves shown in (a)) to describe
the EQ gains of MIT IRs. We then apply EQs sampled from these distributions to
our training set distribution in (b). We observe that the augmented EQ distribution
in (d) becomes more similar to the target distribution in (c).
We then randomly sample a target EQ according to our fit models (independently
per frequency band), calculate the distance between the source and target EQ, and
then design an FIR filter to compensate for the difference. For simplicity, we use the
window method for FIR filter design (Smith III, 2008). Note, we do not require a
perfect filter design method. We simply need a procedure to increase the diversity
of our data. Also note, we intentionally sample our augmented IRs to have a larger
variance than the recorded IRs to further increase the variety of our training data.
We compute the log Mel-frequency spectrogram for each four second audio clip,
which is commonly used for speech-related tasks (Chen et al., 2018; Eskimez et al.,
2018). We use a Hann window of size 256 with 50% overlap during computation of the
short-time Fourier transform (STFT) for our 16kHz samples. Then we use 32 Mel-
scale bands and area normalization for Mel-frequency warping (Stevens et al., 1937).
27
The spectrogram power is computed in decibels. This extraction process yields a 32
x 499 (frequency x time domain) matrix feature representation. All feature matrices
are normalized by the mean and standard deviation of the training set.
Network Architecture and Training
We propose using a network architecture differing only in the final layer for both T60
and room equalization estimation. Six 2D convolutional layers are used sequentially
to reduce both the time and frequency resolution of features until they have approx-
imately the same dimension. Each conv layer is immediately followed by a rectified
linear unit (ReLU) (Nair and Hinton, 2010) activation function, 2D max pooling, and
batch normalization. The output from conv layers is flattened to a 1D vector and
connected to a fully connected layer of 64 units, at a dropout rate of 50% to lower
the risk of overfitting. The final output layer has 7 fully connected units to predict
a vector of length 7 for T60 or 6 fully connected units to predict a vector of length
6 for frequency equalization. This network architecture is inspired by Bryan (2020),
where it was used to predict full-band T60. We updated the output layer to predict
the more challenging sub-band T60, and also discovered that the same architecture
predicts equalization well.
For training the network, we use the mean square error (MSE) with the ADAM
optimizer (Kingma and Ba, 2014) in Keras (Chollet et al., 2015). The max number
of epochs is 500 with an early stopping mechanism. We choose the model with the
lowest validation error for further evaluation on the test set. Our model architecture
is shown in Figure 3.5.
3.3.4 Acoustic Material Optimization
Our goal is to optimize the material absorption coefficients at the same octave bands
as T60 of a set of room materials to match the sub-band T60 of the simulated sound
28
with the target predicted in ? 3.3.3.
Ray Energy. We borrow notation from Li et al. (2018). Briefly, a geometric acous-
tic simulator generates a set of sound paths, each of which carries an amount of sound
energy. Each material mi in a scene is described by a frequency dependent absorption
coefficient, ?i. A path leaving the source is reflected by a set of materials before it
reaches the listener. The energy fraction that is received by the listener along path j
is ?Nj
ej = ?j ?mk , (3.1)
k=1
where mk is the material the path intersects on the kth bounce, Nj is the number of
surface reflections for path j, and ?j accounts for air absorption (dependent on the
total length of the path). Our goal is to optimize the set of absorption coefficients
?i to match the energy distribution of the paths ej to that of the environment?s IR.
Again similar to (Li et al., 2018), we assume the energy decrease of the IR follows an
exponential curve, which is a linear decay in dB space. The slope of this decay line
is m? = ?60/T60.
Objective Function. We propose the following objective function:
J(?) = (m?m?)2 (3.2)
where m is the best fit line of the ray energies on a decibel scale:
?n? ?n? ?n nm = i=0 tiyi ? i=0 ti i=0 yi , (3.3)
n n 2 n
2
i=0 ti ? ( i=0 ti)
with yi = 10log10(ei), which we found to be more robust than previous methods.
Specifically, in comparison with Equation (3) in Li et al. (2018), we see that they try
to match the slope of the energies relative to e0, forcing e0 to be at the origin on a
29
dB scale. However, we only care about the energy decrease, and not the absolute
scale of the values from the simulator. We found that allowing the absolute scale to
move and only optimizing the slope of the best fit line produced a better match to
the target T60.
We minimize J using the L-BFGS-B algorithm (Zhu et al., 1997). The gradient
of J is given by
?J ? ?nt ? n? ? i i?=0 ti 10 ?ei= 2(m m ) (3.4)
?? n n t2 ? n 2j ( ti) ln(10)ei ??i=0 i i=0 j
3.4 Analysis and Applications
3.4.1 Analysis
Speed. We implement our system on an Intel Xeon(R) CPU @3.60GHz and an
NVIDIA GTX 1080 Ti GPU. Our neural network inference runs at 222 fps on 4-second
sliding windows of audio due to the compact design (only 18K trainable parameters).
Optimization runs twice as fast with our improved objective function. The sound
rendering is based on the real-time geometric bi-directional sound path tracing from
Cao et al. (2017).
Sub-band T60 prediction. We first evaluate our T60 blind estimation model and
achieve a mean absolute error (MAE) of 0.23s on the test set (MIT IRs). While the
271 IRs in the test set have a mean T60 of 0.49s with a standard deviation (STD) of
0.85s at the 125Hz sub-band, the highest sub-band 8000Hz only has a mean T60 of
0.33s with a STD of 0.24s, which reflects a narrow subset within our T60 augmentation
range. We also notice that the validation MAE on ACE IRs is 0.12s, which indicates
our validation set and the test set still come from different distributions. Another
error source is the inaccurate labeling of low-frequency sub-band T60 as shown in
30
Figure 3.7, but we do not filter any outliers in the test set. In addition, our data is
intended to cover frequency ranges up to 8000Hz, but human speech has less energy
in high-frequency range (Titze et al., 2017), which results in low signal energy for
these sub-bands, making it more difficult for learning.
(a) 125Hz sub-band. (b) 8000Hz sub-band.
Figure 3.7: Evaluating T60 from signal envelope on low and high frequency bands of
the same IR. Note that the SNR in the low frequency band is lower than the high
frequency band. This makes T60 evaluation for low frequency bands less reliable,
which partly explains the larger test error in low frequency sub-bands.
Material Optimization. When we optimize the room material absorption coeffi-
cients according to the predicted T60 of a room, our optimizer efficiently modifies the
simulated energy curve to a desired energy decay rate (T60) as shown in Figure 3.8.
We also try fixing the room configuration and set the target T60 to values uniformly
distributed between 0.2s and 2.5s, and evaluate the T60 of the simulated IRs. The
relationship between the target and output T60 is shown in Figure 3.9, in which our
simulation closely matches the target, demonstrating that our optimization is able to
match a wide range of T60 values.
To test the real-world performance of our acoustic matching, we recorded ground
truth IRs in 5 benchmark scenes, then use the method in Li et al. (2018), which
requires a reference IR, and our method, which does not require an IR, for comparison.
31
Figure 3.8: Simulated energy curves before and after optimization (with target slope
shown).
Benchmark scenes and results are summarized in Table 3.2.
We apply the EQ filter to the simulated IR as a last step. Overall, we obtain a
prediction MAE of 3.42dB on our test set, whereas before augmentation, the MAE
was 4.72dB under the same training condition, which confirms the effectiveness of our
EQ augmentation. The perceptual impact of the EQ filter step is evaluated in ?3.5.
3.4.2 Comparisons
We compare our work with two related projects, Schissler et al. (2017) and Kim
et al. (2019), where the high-level goal is similar to ours but the specific approach is
different.
Material optimization is a key step in our method and Schissler et al. (2017). One
major difference is that we additionally compensate wave effects explicitly with an
equalization filter. Figure 3.10 shows the difference in spectrogram where the high
frequency equalization was not properly accounted for. Our method better replicate
the rapid decay in the high frequency range. For audio comparison, please refer to
32
Figure 3.9: Stress test our our optimizer. We uniformly sample T60 between 0.2s and
2.5s and set it to be the target. The ideal I/O relationship is a straight line passing
the origin with slope 1. Our optimization results matches the ideal line much better
than prior optimization method.
our supplemental video6.
Figure 3.10: We show the effects of our equalization filtering on audio spectrograms,
compared with Schissler et al. (2017). In the highlighted region, we are able to better
reproduce the fast decay in the high-frequency range, closely matching the recorded
sound.
We also want to highlight the importance of optimizing T60. In (Kim et al., 2019),
a CNN is used for object-based material classification. Default materials are assigned
to a limited set of objects. Without optimizing specifically for the audio objective, the
resulting sound might not blend in seamlessly with the existing audio. In Figure 3.11,
6https://gamma.umd.edu/pro/sound/sceneaware
33
Table 3.2: Benchmark results for acoustic matching. These real-world rooms are of
different sizes and shapes, and contain a wide variety of acoustic materials such as
brick, carpet, glass, metal, wood, plastic, etc., which make the problem acoustically
challenging. We compare our method with Li et al. (2018). Our method does not
require a reference IR and still obtains similar T60 and EQ errors in most scenes
compared with their method. We also achieve faster optimization speed. Note that
the input audio to our method is already noisy and reverberant, whereas Li et al.
(2018) requires clean IR recording. All IR plots in the table have the same time and
amplitude scale.
Benchmark Scene
Size ( 3) 1100 1428 352m (irregular) (12x17x7) 72 (4x6x3) (11x8x4)
# Main planes 6 6 11 6
Groundtruth IR
(dB scale)
Li et al. (2018) IR
(dB scale)
Opt. time (s) 29 43 71 46
T60 error (s) 0.11 0.23 0.02 0.10
EQ error (dB) 1.50 2.97 3.61 7.55
Ours IR
(dB scale)
Opt. time (s) 13 13 31 20
T60 error (s) 0.14 0.12 0.04 0.24
EQ error (dB) 2.26 3.86 3.46 4.62
we show that our method produces audio that matches the decay tail better, whereas
(Kim et al., 2019) produces a longer reverb tail than the recorded ground truth.
3.4.3 Applications
Acoustic Matching in Videos Given a recorded video in an acoustic environment,
our method can analyze the room acoustic properties from noisy, reverberant recorded
audio in the video. The room geometry can be estimated from video (Bloesch et al.,
2018), if the user has no access to the room for measurement. During post-processing,
we can simulate sound that is similar to the recorded sound in the room. Moreover,
34
Figure 3.11: We demonstrate the importance on T60 optimization on audio amplitude
waveform. Our method optimizes the material parameters based on input audio and
matches the tail shape and decay amplitude with the recorded sound, whereas the
visual-based object materials from Kim et al. (2019) failed to compensate for the
audio effects.
virtual characters or speakers, such as the ones shown in Figure 3.1, can be added to
the video, generating sound that is consistent with the real-world environment.
Real-time Immersive Augmented Reality Audios Our method works in a
real-time manner and can be integrated into modern AR systems. AR devices are
capable of capturing real-world geometry, and can stream audio input to our pipeline.
At interactive rates, we can optimize and update the material properties, and update
the room EQ filter as well. Our method is not hardware-dependent and can be used
with any AR device (which provides geometry and audio) to enable a more immersive
listening experience.
35
Real-world Computer-Aided Acoustic Design Computer-aided design (CAD)
software has been used for designing architecture acoustics, usually before construc-
tion is done, in a predictive manner (Pelzer et al., 2014; Kleiner et al., 1990). But
when given an existing real-world environment, it becomes challenging for traditional
CAD software to adapt to current settings because acoustic measurement can be
tedious and error-prone. By using our method, room materials and EQ properties
can be estimated from simple input, and can be further fed to other acoustic design
applications in order to improve the room acoustics such as material replacement,
source and listener placement (Morales et al., 2019), and soundproofing setup.
3.5 Perceptual Evaluation
We perceptually evaluated our approach using a critical listening test. For this test, we
studied the perceptual similarity of a reference speech recording with speech record-
ings convolved with simulated impulse responses. We used the same speech content
for the reference and all stimuli under testing and evaluated how well we can recon-
struct the same identical speech content in a given acoustic scene. This is useful
for understanding the absolute performance of our approach compared to the ground
truth results.
3.5.1 Design and Procedure
For our test, we adopted the multiple stimulus with hidden reference and anchor
(MUSHRA) methodology from the ITU-R BS.1534-3 recommendation (Series, 2014).
MUSHRA provides a protocol for the subjective assessment of intermediate quality
level of audio systems (Series, 2014) and has been adopted for a wide variety of
audio processing tasks such as audio coding, source separation, and speech synthesis
evaluation (Schoeffler et al., 2015; Cartwright et al., 2016).
36
Figure 3.12: A screenshot of MUSHRA-like web interface used in our user study. The
design is from Cartwright et al. (2016).
In a single MUSHRA trial, participants are presented with a high-quality reference
signal and asked to compare the quality (or similarity) of three to twelve stimuli on a 0-
100 point scale using a set of vertical sliders as shown in Figure 3.12. The stimuli must
contain a hidden reference (identical to the explicit reference), two anchor conditions
? low-quality and high-quality, and any additional conditions under study (maximum
of nine). The hidden reference and anchors are used to help the participants calibrate
their ratings relative to one another, as well as to filter out inaccurate assessors in
a post-screening process. MUSHRA tests serve a similar purpose to mean opinion
(MOS) score tests (Series, 2016), but requires fewer participants to obtain results that
are statistically significant.
We performed our studies using Amazon Mechanical Turk (AMT), resulting in
a MUSHRA-like protocol (Cartwright et al., 2016). In recent years, web-based
MUSHRA-like tests have become a standard methodology and have been shown to
perform equivalently to full, in-person tests(Schoeffler et al., 2015; Cartwright et al.,
37
2016).
3.5.2 Participants
We recruited 261 participants on AMT to rate one or more of our five acoustic scenes
under testing following the approach proposed by Cartwright et al. (2016). To increase
the quality of the evaluation, we pre-screened the participants for our tests. To do
this, we first required that all participants have a minimum number of 1000 approved
Human Intelligence Task (HITs) assignments and have had at least 97 percent of all
assignments approved. Second, all participants must pass a hearing screening test
to verify they are listening over devices with adequate frequency response. This was
performed by asking participants to listen to two separate eight second recordings
consisting of a 55Hz tone, a 10kHz tone and zero to six tones of random frequency.
If any user failed to count the number of tones correctly after two or more attempts,
they were not allowed to proceed.
3.5.3 Training
After having passed our hearing screening test, each user was presented with a one
page training test. For this, the participant was provided two sets of recordings. The
first set of training recordings consisted of three recordings: a reference, a low-quality
anchor, and a high-quality anchor. The second set of training recordings consisted of
the full set of recordings used for the given MUSHRA trail, albeit without the vertical
sliders present. To proceed to the actual test, participants were required to listen to
each recording in full. In total, the training time was estimated to take approximately
two minutes.
38
3.5.4 Stimuli
For our test conditions, we simulated five different acoustic scenes. For each scene,
a separate MUSHRA trial was created. In AMT language, each scene was presented
as a separate HIT per user. For each MUSHRA trial or HIT, we tested the follow-
ing stimuli: hidden reference, low-quality anchor, mid-quality anchor, baseline T60,
Baseline T60+EQ, proposed T60, and proposed T60+EQ.
As noted by the ITU-R BS.1534-3 specification (Series, 2014), both the reference
and anchors will have a significant effect on the test results, must resemble the artifacts
from the systems, and must be designed carefully. For our work, we set the hidden
reference as an identical copy of the explicit reference (required), which consisted of
speech convolved with the ground truth IR for each acoustic scene. Then, we set
the low-quality anchor to be completely anechoic, non-reverberated speech. We set
the mid-quality anchor to be speech convolved with an impulse response with a 0.5
second T60 (typical conference room) across frequencies, and uniform equalization.
For our baseline comparison, we included two baseline approaches following previ-
ous work (Li et al., 2018). More specifically, our Baseline T60 leverages the geometric
acoustics method proposed by Cao et al. (2017) as well as the materials analysis cal-
ibration method of Li et al. (2018). Our Baseline T60+EQ extends this and includes
the additional frequency equalization analysis (Li et al., 2018). These two baselines
directly correspond to the proposed materials optimization (Proposed T60) and equal-
ization prediction subsystems (Proposed T60+EQ) in our work. The key difference is
that we blindly estimate the parameters necessary for both steps blindly from speech.
3.5.5 User Study Results
When we analyzed the results of our listening test, we post-filtered the results follow-
ing the ITU-R BS.1534-3 specification (Series, 2014). More specifically, we excluded
assessors if they
39
? rated the hidden reference condition for > 15% of the test items lower than a
score of 90
? or, rated the mid-range (or low-range) anchor for more than 15% of the test
items higher than a score of 90.
Using this post-filtering, we reduce our collected data down to 70 unique participants
and 108 unique test trials, spread across our five acoustic scene conditions.
Figure 3.13: Box plot results for our listening test. Participants were asked to rate
how similar each recording was to the explicit reference. All recordings have the
same content, but different acoustic conditions. Note our proposed T60 and T60+EQ
are both better than the Mid-Anchor by a statistically significant amount (approx10
rating points on a 100 point scale).
We show the box plots of our results in Figure 3.13. The median ratings for each
stimulus include: Baseline T60 (62.0), Baseline T60+EQ (85.0), Low-Anchor (40.5),
Mid-Anchor (59.0), Proposed T60 (61.5), Proposed T60+EQ (71.0), Hidden Reference
(99.5). As seen, the Low-Anchor and Hidden Reference outline the range of user
40
scores for our test. In terms of baseline approaches, the Proposed T60+EQ method
achieves the highest overall listening test performance. We then see that our proposed
T60 method and T60+EQ method outperform the mid-anchor. Our proposed T60
method is comparable to the baseline T60 method, and our proposed T60+EQ method
outperforms our proposed T60-only method.
To understand the statistical significance, we perform paired t-tests between stim-
uli pairs. The p-value between Baseline T60 and Proposed T60 is 0.09, suggesting that
we cannot reject the null hypothesis of identical average scores between prior work
(which uses manually measured IRs) and our work. The p-value of Baseline T60+EQ
and Proposed T60+EQ, however, is 1.85e-6, suggesting our EQ method has a statisti-
cally different average (lower). The p-value of Proposed T60 and Proposed T60+EQ,
however, is 0.004, suggesting our EQ method does significantly improve performance
compared to our proposed T60-only subsystem. We also note that the p-value of the
Mid-Anchor and Proposed T60+EQ is 0.0002, suggesting our method is statistically
different (higher performing) on average than simply using an average room T60 and
uniform equalization.
In summary, we see that our proposed T60 computation method is comparable to
prior work, albeit we perform such estimation directly from a short speech recording
rather than relying on intrusive IR measurement schemes. Further, our proposed
complete system (Proposed T60+EQ) outperforms both the mid-anchor and proposed
T60 system alone, demonstrating the value of EQ estimation. Finally, we note our
proposed T60+EQ method does not perform as well as prior work, largely due to the
EQ estimation subsystem. This result, however, is expected as prior work requires
manual IR measurements, which result in perfect EQ estimation. This is in contrast
to our work, which directly estimates both T60 and EQ parameters from recorded
speech, enabling a drastically improved interaction paradigm for matching acoustics
in several applications.
41
3.6 Summary
We present a new pipeline to estimate, optimize, and render immersive audio in
video and mixed reality applications. We present novel algorithms to estimate two
important acoustic environment characteristics ? the frequency-dependent reverber-
ation time and equalization filter of a room. Our multi-band octave-based prediction
model works in tandem with our equalization augmentation and provides robust in-
put to our improved materials optimization algorithm. Our user study validates the
perceptual importance of our method. To the best of our knowledge, our method
is the first method to predict IR equalization from raw speech data and validate its
accuracy.
Limitations and Future Work To achieve a perfect acoustic match, one would
expect the real-world validation error to be zero. In reality, zero error is only a suf-
ficient but not necessary condition. In our evaluation tests, we observe that small
validation errors still allow for plausible acoustic matching. While reducing the pre-
diction error is an important direction, it is also useful to investigate the perceptual
error threshold for acoustic matching for different tasks or applications. Moreover,
temporal prediction coherence is not in our evaluation process. This implies that
given a sliding windows of audio recordings, our model might predict temporally in-
coherent T60 values. One interesting problem is to utilize this coherence to improve
the prediction accuracy and can be an interesting future direction.
Modeling real-world characteristics in simulation is a non-trivial task ? as in pre-
vious work along this line, our simulator does not fully recreate the real world in
terms of precise details. For example, we did not consider the speaker or microphone
response curve in our simulation. In addition, sound sources are modeled as omni-
directional sources (Cao et al., 2017), where real sources exhibit certain directional
patterns. It remains an open research challenge to perfectly replicate and simulate
42
our real world in a simulator.
Like all data-driven methods, our learned model performs best on the same kind
of data on which it was trained. Augmentation is useful because it generalizes the
existing dataset so that the learned model can extrapolate to unseen data. However,
defining the range of augmentation is not straightforward. We set the MIT IR dataset
as the baseline for our augmentation process. In certain cases, this assumption might
not generalize well to estimate the extreme room acoustics. We need to design bet-
ter and more universal augmentation training algorithms. Our method focused on
estimation from speech signals, due to their pervasiveness and importance. It would
be useful to explore how well the estimation could work on other audio domains,
especially when interested in frequency ranges outside typical human speech. This
could further increase the usefulness of our method, e.g., if we could estimate acoustic
properties from ambient/HVAC noise instead of requiring a speech signal.
43
Chapter 4
Fast Learning-Based Acoustic
Scattering1
Figure 4.1: We show the dynamic scenes with various moving objects that are used to
evaluate our hybrid sound propagation algorithm. We compute the acoustic scattered
fields of each object using a neural network and couple them with interactive ray
tracing to generate diffraction and occlusion effects. Our approach can generate
plausible acoustic effects in dynamic scenes in a few milliseconds and we demonstrate
its benefits for sound rendering in virtual environments.
4.1 Introduction
Interactive sound propagation and rendering are increasingly used to generate plau-
sible sounds that can improve a user?s sense of presence and immersion in virtual
environments (Larsson et al., 2002; Liu and Manocha, 2020). Recent advances in ge-
1The work in this chapter has been published in Tang et al. (2021)
44
ometric and wave-based simulation methods have lead to integration of these methods
into current games and virtual reality (VR) applications to generate plausible acous-
tic effects, including Project Acoustics (Mic, 2019), Oculus Spatializer (Ocu, 2019),
and Steam Audio (Ste, 2018). The underlying propagation algorithms are based on
using reverberation filters (Valimaki et al., 2012), ray tracing (Schissler et al., 2014;
Schissler and Manocha, 2018), or precomputed wave-based acoustics (Raghuvanshi
and Snyder, 2014b).
A key challenge in interactive sound rendering is handling dynamic scenes that
are frequently used in games and VR applications. Not only can the objects undergo
large motion or deformation, but their topologies may also change. In addition to
specular and diffuse effects, it is also important to simulate complex diffracted scatter-
ing, occlusions, and inter-reflections that are perceptible (James et al., 2006; Pulkki
and Svensson, 2019; Raghuvanshi and Snyder, 2014b). Prior geometric methods are
accurate in terms of simulating high-frequency effects and can be augmented with
approximate edge diffraction methods that may work well in certain cases (Tsingos
et al., 2001; Schissler et al., 2014), though their behavior can be erratic (Rungta et al.,
2016). On the other hand, wave-based precomputation methods can accurately sim-
ulate these effects, but are limited to static scenes (Raghuvanshi and Snyder, 2014b,
2018). Some hybrid methods are limited to interactive dynamic scenes with well-
separated rigid objects (Rungta et al., 2018). Our goal is to design similar hybrid
methods that can overcome these restrictions and can generate diffraction and occlu-
sion effects that translate into good perceptual differentiation (Rungta et al., 2016).
Many recent works use machine learning techniques for audio processing, including
recovering acoustic parameters of real-world scenes from recordings (Eaton et al.,
2016; Genovese et al., 2019; Tsokaktsidis et al., 2019). Furthermore, machine learning
methods have been used to approximate diffraction scattering and occlusion effects
from rectangular plate objects (Pulkki and Svensson, 2019) and frequency-dependent
45
loudness fields for 2D convex shapes (Fan et al., 2020a). These results are promising
and have motivated us to develop good learning based methods for more general 3D
objects.
Main Results: We present a novel approach to approximate the acoustic scattering
field of an object in 3D using neural networks for interactive sound propagation in
dynamic scenes. Our approach makes no assumption about the motion or topology of
the objects. We exploit properties of the acoustic scattering field of objects for lower
frequencies and use neural networks to learn this field from geometric representations
of the objects.
Given an object in 3D, we use the neural network to estimate the scattered field at
runtime, which is used to compute the propagation paths when sound waves interact
with objects in the scene. The radial part of the acoustic scattering field is estimated
using geometric ray tracing, along with specular and diffuse reflections. Some of the
novel components of our work include:
? Learning acoustic scattering fields: We use techniques based on geometric
deep learning to approximate the angular component of acoustic wave propa-
gation in the wave-field. Our neural network takes the point cloud as the input
and outputs the spherical harmonic coefficients that represent the acoustic scat-
tering field. We compare the accuracy of our learning method with an exact
BEM solver, and the error on new, unseen objects (as compared to training
data). Our empirical results are promising and we observe average normalized
reproduction error(Lilis et al., 2010; Betlehem and Abhayapala, 2005) of 8.8%
in the pressure fields.
? Interactive wave-geometric sound propagation: We present a hybrid
propagation algorithm that uses a neural network-based scattering field repre-
sentation along with ray tracing to efficiently generate specular, diffuse, diffrac-
tion, and occlusion effects at interactive rates.
46
? Plausible sound rendering for dynamic scenes: We present the first inter-
active approach for plausible sound rendering in dynamic scenes with diffraction
modeling and occlusion effects. As the objects deform or change topology, we
compute a new spherical harmonic representation using the neural network.
Compared with prior interactive methods, we can handle unseen objects at
real-time, without using precomputed transfer functions for each object.
? Perceptual evaluation: We perform a user study to validate the perceptual
benefits of our method. Our propagation algorithm generates more smooth and
realistic sound and has increased perceptual differentiation over prior methods
used for dynamic scenes (Schissler and Manocha, 2017; Rungta et al., 2018).
We demonstrate the performance in dynamic scenes with multiple moving objects and
changing topologies. The additional runtime overhead of estimating the scattering
field from neural networks is less than 1ms per object on a NVIDIA GeForce RTX
2080 Ti GPU. The overall running time of sound propagation is governed by the
underlying ray tracing system and takes few milliseconds per frame on multi-core
desktop PC. We also evaluate the accuracy of acoustic scattering fields, as shown in
Figure 4.7.
4.2 Related Work
4.2.1 Interactive Sound Rendering in Dynamic Scenes
At a broad level, techniques for dynamic scenes can be classified into reverberation
filters, geometric and wave-based methods, and hybrid combinations. The simplest
and lowest-cost algorithms are based on artificial reverberators (Valimaki et al., 2012),
which simulate the decay of sound in rooms. These filters are designed based on
different parameters and are either specified by an artist or computed using scene
47
characteristics (Tsingos, 2009). They can handle dynamic scenes but assume that
the reverberant sound field is diffuse, making them unable to generate directional
reverberation or time-varying effects.
Many interactive techniques based on geometric acoustics and ray tracing have
been proposed for dynamic scenes (Vorl?nder, 1989; Taylor et al., 2012a; Schissler
and Manocha, 2017). They use spatial data structures along with multiple cores on
commodity processors and caching techniques to achieve higher performance. Fur-
thermore, hybrid combinations of ray tracing and reverberation filters (Schissler and
Manocha, 2018) have been proposed for low-power, mobile devices. In practice, these
methods can handle scenes with a large number of moving objects, along with sources
and the listener, but can?t model diffraction or occlusion effects well.
Many precomputation-based wave acoustics techniques tend to compute a global
representation of the acoustic pressure field. They are limited to static scenes, but
can handle real-time movement of both sources and the listener (Raghuvanshi et al.,
2010; Mehra et al., 2015). These representations are computed based on uniform or
adaptive sampling techniques (Chaitanya et al., 2019). Overall, the acoustic wave
field is a complex high-dimensional function and many efficient techniques have been
designed to encode this field (Raghuvanshi and Snyder, 2014b, 2018) within 100MB
and with a small runtime overhead. A hybrid combination of BEM and ray tracing has
been presented for dynamic scenes with well-separated rigid objects (Rungta et al.,
2018). A recent Planeverb system (Rosen et al., 2020) is able to perform 2D wave
simulation at interactive rates and calculate perceptual acoustic parameters that can
be used for sound rendering.
4.2.2 Machine Learning and Acoustic Processing
Machine learning techniques are increasingly used for acoustic processing applica-
tions. These include isolating the source locations in multipath environments (Fer-
48
guson et al., 2018) and recovering the room acoustic parameters corresponding to
reverberation time, direct-to-reverberant ratio, room volume, equalization, etc. from
recorded signals (Eaton et al., 2016; Genovese et al., 2019; Tsokaktsidis et al., 2019;
Tang et al., 2019a). These parameters are used for speech processing or audio ren-
dering in real-world scenes. Neural networks have also been used to replace the
expensive convolution operations for fast auralization (Tenenbaum et al., 2019), to
render the acoustic effects of scattering from rectangular plate objects for VR applica-
tions (Pulkki and Svensson, 2019), or to learn the mapping from convex shapes to the
frequency dependent loudness field (Fan et al., 2020a). The last method formulates
the scattering function computation as a high-dimension image-to-image regression
and is mainly limited to convex objects that are isomorphic to spheres. In contrast,
our learning-based method can compute a good approximation of the acoustic scat-
tering field of arbitrary objects (e.g. non-convex or non-manifold).
4.3 Acoustic Scattering Preliminary
4.3.1 Helmholtz Equation
We can analyze the acoustic pressure field in the frequency domain by converting
P (x, t) from Equation (2.2) using Fourier transform
? ?
p(x, ?) = Ft{P (x, t)} = P (x, t)e?j?tdt. (4.1)
??
At each frequency ? the pressure field satisfies the homogeneous Helmholtz wave
equation
(?2 + k2)p(x, ?) = 0, (4.2)
49
where k = ? is the wavenumber. We can expand the Laplacian operator in terms of
c
spherical coordinates (r, ?, ?) as
( ( ) )
?2 2 ? 1 ? ? 1 ?2
+ + sin ? + + k2 p = 0. (4.3)
?r2 r ?r r2 sin ? ?? ?? r2 sin2 ? ??2
The general free-field solution of (4.3) can be formulated as
?? ?+l [ ]
(1) (2)
p(x, ?) = Almhl (kr) +Blmhl (kr) Y
m
l (?, ?), (4.4)
l=0 m=?l
where (1) and (2)hl hl are Hankel functions of the first and the second kind, respectively.
and are arbitrary constants, (1) (2)Alm Blm Almhl (kr) + Blmhl (kr) together represents
the radial part of the solution and the spherical harmonics term Y ml (?, ?) represents
the angular part of the solution.
4.3.2 Acoustic Wave Scattering
Equation (4.2) describes the behavior of acoustic waves in free-field conditions. When
a propagating acoustic wave generated by a sound source interacts with an obstacle
(the scatterer), a scattered field is generated outside the scatterer. The Helmholtz
equation can be used to describe this scenario:
(?2 + k2)p(x, ?) = ?Q(x, ?), ?x ? E, (4.5)
where E is the space that is exterior to the scatterer and Q(x, ?) represents the
acoustic sources in the frequency domain. Common types of sound sources include
monopole sources, dipole sources, and plane wave sources. To obtain an exact solution
to Equation (4.5), the boundary conditions on the scatterer surface S need to be
specified. In this work, we assume all the scattering objects are sound-hard (i.e. all
energy is scattered, not absorbed) and therefore use the zero Neumann boundary
50
condition for all S:
?p
= 0, ?x ? S, (4.6)
?n(x)
where n(x) is the normal vector at x. Alternatively, other conditions including
the sound-soft Dirichlet boundary condition and the mixed Robin boundary con-
dition (Pierce and Beyer, 1990) can be used to model different acoustic scattering
problems. When the boundary conditions are fully defined, the constants in Equa-
tion (4.4) can be uniquely determined.
4.3.3 Global and Localized Sound Fields
Sound fields typically refer to the sound energy/pressure distribution over a bounded
space as generated by one or more sound sources. The global sound field in an acous-
tic environment depends on each sound source location, the propagating medium,
and any reflections from boundary surfaces and objects. This requires solving the
wave equation in the free-field condition and evaluating inter-boundary interactions
of sound energy using a global numeric solver (details in ? 4.3.1). In this case, the
position of all scene objects/boundaries and sound sources needs to be specified be-
forehand, and any change in these conditions changes the sound field. The exact
computation of the global pressure field is very expensive and can take tens of hours
on a cluster (Mehra et al., 2013; Raghuvanshi et al., 2010; Raghuvanshi and Snyder,
2014b).
Our goal is to generate plausible sounds in virtual environments with dynamic
objects. Therefore, it is important to model the acoustic scattering field (ASF) of
each object. The ASFs of different objects are used to represent the localized pressure
field, which is needed for diffraction and inter-reflection effects (James et al., 2006;
Mehra et al., 2013). At the same time, the sound field in the free space (e.g., the far-
field) between two distant objects is approximated using ray tracing, and we do not
51
compute that pressure field accurately using a wave-solver. In practice, computing
the sound field in a localized space for each object in the scene is much simpler and
easier to represent than using a global solver (Mehra et al., 2013; Rungta et al., 2018).
4.3.4 Overview
We present a learning method to approximate the ASFs of static or dynamic 3D
objects of moderate sizes. In terms of correlation between the object shape and
its scattering field, the volume of the scatterer closely relates to its low-order shape
characteristics that can be represented by coarse triangle faces, which dominate the
low-frequency scattering behaviors; while at high frequencies, this relationship shifts
to high-order shape characteristics (i.e., geometrical details). Given the powerfulness
of deep learning inference, we hypothesize the scattering sound distribution can be
directly learned from the scatterer geometry, without solving the complicated wave
equations. The inference speed on a modern GPU far exceeds conventional wave
solvers, making deep neural networks suitable for interactive sound rendering appli-
cations. Therefore, we propose using appropriate 3D representation of objects to feed
a neural network that can learn its corresponding scattered acoustic pressure field.
We build and evaluate our method mainly on low frequency sounds and leverage
state-of-the-art geometric ray-tracing techniques to handle high frequency sounds.
For each object, we consider a spherical grid of incoming directions and model
the plane-waves from each direction of this grid. For each plane wave, our goal
is to compute the scattered field for the object on an offset surface of the object.
Our geometric deep learning method is used to compute the angular portion of the
scattered field (Equation 4.4). If two objects move and are in a touching configuration,
our learning algorithm treats them as a one large object and estimates its scattered
field. Similarly, we can recompute the scattered field for a deforming object. An
overview of our approach is illustrated in Figure 4.2.
52
Figure 4.2: Overview: Our algorithm consists of the training stage and the runtime
stage. The training stage uses a large dataset of 3D objects and their associated
acoustic pressure fields computed using an accurate BEM solver to train the network.
The runtime stage uses the trained neural network to predict the sound pressure field
from a point cloud approximation of different objects at interactive rates.
4.4 Learning-based Sound Scattering
4.4.1 Wave Propagation Modeling
Our approach is designed for synthetic scenes and we assume a geometric represen-
tation (e.g., triangle mesh) is given to us. So the acoustic scattering field p(x, ?)
around the object can be solved numerically (derivation in ? 4.3.1 and 4.3.2). In this
work, we propose modeling the angular part of the scattering field using our learning
based pressure field inference. The radial part is approximated using geometric sound
propagation techniques.
Radial Decoupling
Our goal is to determine the scattering field exterior to an object using a wave-
solver. This field needs to be compactly encoded for efficient training. As shown in
Equation (4.4), acoustic wave propagation in the free-field can be decomposed into
radial and angular components. Furthermore, the radial sound pressure in the far-
field follows the inverse-distance law (Beranek and Mellow, 2012): p ? 1/r, as shown
in Figure 4.3. We utilize this property to extrapolate the full ASF from one of its
far-field ?snapshots? at a fixed radius, so that the full ASF does not need to be stored.
Following the inverse-distance law, the sound pressure at any far-field location (r, ?, ?)
53
can be computed as
rref
p(r, ?, ?, ?) = p(rref , ?, ?, ?), (4.7)
r
where rref is the reference distance and only p(rref , ?, ?, ?) needs to be computed and
stored. For brevity, we omit r in following sections.
Figure 4.3: Simulated sound pressure fall-off and inverse-distance law fitted
curves: We calculate the sound pressure around a sound scatterer in our dataset
using the BEM solver as reference. We examine the sound pressure from 1m to 10m
scattered along 5 directions (0?, 72?, 144?, 216?, and 288?). We regard the sound pres-
sure value at 10m to correspond to far-field condition, and inversely fit the pressure
values for distance within 10m according to Equation 4.7. We userref = 5m is used
for generating our ASFs, although other values can be used as well.
Angular Pressure Field Encoding
A spherical field consisting of a fixed number of points (e.g., 642 points evenly dis-
tributed on a sphere surface) is obtained by generating an icosphere with 4 subdi-
visions. Real valued scattered sound pressures are evaluated at these field points
during wave-based simulation. Spherical harmonics (SH) can represent a spherical
scalar field compactly using a set of SH coefficients; they have been widely used for
3D sound field recording and reproduction (Poletti, 2005). SH function up to order
lmax has M = (lmax +1)2 coefficients. The angular pressure at the outgoing direction
54
(?, ?) can be evaluated as
l?max ?+l
p(?, ?, ?) = Y ml (?, ?)c
m
l (?), (4.8)
l=0 m=?l
where Y ml (?, ?) are the SH basis functions at degree l and order m, and cml (?) are
the SH coefficients that encode our angular pressure fields. Increasing the number of
coefficients can lead to more challenges because the dimension of our learning target
is raised.
4.4.2 Learning Spherical Pressure Fields
We need an appropriate geometric representation for the underlying objects in the
scene so that we can apply geometric deep learning methods to compute the ASF. It
is important that our approach should be able handle dynamic scenes with moving
objects or changing topology. It can be difficult to handle such scenarios with mesh-
based representations (Hanocka et al., 2019; Tan et al., 2018; Zheng et al., 2017). For
example, (Hanocka et al., 2019) calculates intrinsic geodesic distances for convolution
operations, which cannot be applied when one big object breaks into two.
Our approach uses a point cloud representation of the objects in the scene as an
input. And we leverage the PointNet (Charles et al., 2017) architecture to regress
the spherical harmonics term cml in Equation (4.8). PointNet is a highly efficient and
effective network architecture that works on raw point cloud input, and can perform
various tasks including 3D object classification, semantic segmentation and our ASF
regression. It also respects the permutation invariance of points. We slightly modify
its output layers to predict the SH vector as shown in Figure 4.4.
55
Figure 4.4: PointNet regression: Given an input point cloud with N = 1024 3D
points, we feed it to the PointNet architecture (Charles et al., 2017) until maxpooling
to extract the global feature. Then we use multi-layer perceptrons (MLPs) of layer
size 256, 128, and 16 to map the feature to a SH vector of length 16 representing the
scattering field.
4.5 Interactive Sound Propagation with Wave-Ray
Coupling
In this section, we describe how our learning-based method can be combined with
geometric sound propagation techniques to compute the impulse responses for given
source and listener positions. Then, we can render them in highly dynamic scenes.
Hybrid Sound Propagation We use a hybrid sound propagation algorithm that
combines wave-based and ray acoustics. Each of them handles different parts of
wave acoustics phenomena, but they are coupled in terms of incoming and outgoing
energies at multiple localized scattering fields. Specifically, our trained neural network
estimates the scattering field and is used to compute propagation paths when sound
interacts with obstacles in the scene. On the other hand, modeling sound propagation
in the air along with specular and diffuse reflections at large boundary surfaces (e.g.,
walls, floors) is computed using ray tracing methods (Schissler et al., 2014; Schissler
and Manocha, 2017; Rungta et al., 2018).
Ray Tracing with Localized Fields Our localized ASFs are represented using SH
coefficients. Given the most general ray tracing formulation at a scattering surface,
the sound intensity Iout of an outgoing direction (?o, ?o) from a scattering surface is
56
given by the integral of the incoming intensity from all directions:
?
Iout(?o, ?o, ?) = Iin(?i, ?i, ?)f(?i, ?i, ?o, ?o, ?)dS, (4.9)
S
where S represents the directions on a spherical surface around the ray hit point,
Iin(?i, ?i, ?) is the incoming sound intensity from direction (?i, ?i), and f(?i, ?i, ?o, ?o, ?)
is the bi-directional scattering distribution function (BSDF) that is commonly used
in visual rendering (Pharr et al., 2016). Our problem of acoustic wave scattering is
different from visual rendering in two aspects: (1) sound wave scatters around objects,
whereas light mostly transmits to visible directions or propagates through transpar-
ent materials; (2) BSDFs are point-based functions that depend on both incoming
and outgoing directions, whereas our localized scattered fields are region-based func-
tions. Therefore, we replace BSDFs in Equation (4.9) with our localized scattered field
p(?, ?, ?) representation from Equation (4.8). Our choice of a spherical offset surface
to model the scattered field also enables us to perform integration over the whole
spherical surface in a straightforward manner, since evaluating spherical coordinates
is efficient with SH functions. Although p(?, ?, ?) encodes only the outgoing direc-
tions and assumes incoming plane waves to ?x direction, one can easily rotate the
point cloud to align any incoming direction to the ?x direction and use our network
to infer p(?, ?, ?) at that direction. We update Equation (4.9) to
?
Iout(?o, ?o, ?) = Iin(?i, ?i, ?)p
2(?i, ?i, ?)dS. (4.10)
S
We use the Monte Carlo integration to numerically evaluate the outgoing scattered
intensity:
N
? 1
? Iin(?j, ? 2j, ?)p (?j, ?j, ?)
Iout(?o, ?o, ?) , (4.11)
N Pr(? , ? )
j=1 j j
57
where N is the number of samples and Pr(?j, ?j) is the probability of generating
a sample for direction (?j, ?j). A uniform sampling over the sphere surface gives
Pr(?j, ? ) =
1
j . As N increases, the approximation becomes more accurate.4?
Diffraction Compensation In wave acoustics, the total sound field at a position
can be decomposed into the sum of the free-field sound pressure and the scattered
sound field. Similar to (Rungta et al., 2018), we only have computed the scattered
sound field up to now. But when the listener is obstructed from the sound source, the
traditional ray-tracing algorithm will miss the contribution from the free-field, which
will result in a very unnatural phenomenon: the sound would be greatly attenuated
by a single obstacle if we only render the scattered sound, whereas in a realistic
setup, low-frequency sound should not be attenuated by a small obstacle by much.
To address this issue in a ray-tracing context, we propose to approximate sound
interference with and without an obstacle depending on an extra visibility check.
Specifically, for a sound source from direction (?j, ?j) and the listener at (?o, ?o), we
calculate the sound at the listener position based on whether they are blocked by a
scatterer from each other as:
??? ?N I (? ,? ,?)(1?p21 ? in j j (?j ,?j ,?))?? j=1 , if invisibleN Pr(?j ,?j)Iout(?o, ?o, ?) ? (4.12)1 N Iin(?j ,? ,?)p2j (?j ,?j ,?)
j=1 , if visibleN Pr(?j ,?j)
Note that the visible case remains the same as Equation (4.11), because the direct
response will be automatically accounted for by the original ray-tracing pipeline. Ob-
viously, this implementation is not physically accurate compared with wave acoustic
simulations, since additional phase information is missing. However, this formulation
will generate more realistic and more smooth sound rendering than prior work that
only considers the scattering field, and we verify its benefits through a perceptual
evaluation in ? 4.7.
58
4.6 Implementation and Results
In this section, we describe our implementation details and demonstrate the perfor-
mance on many dynamic benchmarks.
4.6.1 Data Generation
Dataset To generate our learning examples, we choose to use theABC Dataset (Koch
et al., 2019). This dataset is a collection of one million general Computer-Aided De-
sign (CAD) models and is widely used for evaluation of geometric deep learning
methods and applications. In particular, this dataset has been used to estimate of
differential quantities (e.g., normals) and sharp features, which makes it attractive
for learning ASFs as well. We sample 100,000 models from the ABC Dataset and
process them by scaling objects such that their longest dimension is in the range of
[1m, 2m]. The choice of such an object size limit is not fixed and can depend on the
specific problem domain (e.g., size of objects used in applications like games or VR).
Because the scattered pressure field is orientation-dependent, we augment our models
by applying random 3D rotations to the original dataset to create an equal-sized rota-
tion augmented dataset. To generate accurate labeled data, we use an accurate BEM
wave solver, placing a plane wave source with unit strength propagating to the ?x
direction. The solver outputs the ASF for each object, which becomes our learning
target. The dataset pipeline is also illustrated in Figure 4.5.
Mesh Pre-processing The original meshes from the ABC Dataset have high levels
of details with fine edges of length shorter than 1cm. Dense point cloud inputs could
also be modeled or collected from the real-world scenes with granularity similar to
this dataset. However, a high number of triangle elements in a mesh will significantly
increase the simulation time of BEM solvers. For wave-based solver, our highest
simulation frequency is 1000Hz, which converts to a wavelength of 34cm. Therefore,
59
Figure 4.5: Our dataset generation pipeline for neural network training:
Given a set of CAD models, we apply random rotations with respect to their center
of mass to generate a larger augmented dataset and use a BEM solver to calculate
the ASFs.
we use the standard procedure of mesh simplification and mesh clustering algorithm
from the vcglib 2 to ensure that our meshes have a minimum edge length of 1.7cm,
which is 1/20 of our shortest target wavelength. This is sufficient according to the
standard techniques used in BEM simulators (Marburg, 2002). Most meshes after
pre-processing have fewer than 20% number of elements than the original and the
BEM simulation for dataset generation gains over 10? speedup.
BEM Solver We use the FastBEM Acoustics software 3 as our wave-based solver.
Simulations are run on a Windows 10 workstation that has 32 Intel(R) Xeon(R) Gold
5218 CPU cores with multi-threading. First we use the adaptive cross approximation
(ACA) BEM (Kurz et al., 2002) to compute the ASF since it can achieve near O(N)
computational performance for small to medium sized models (e.g., element count
N ? 100, 000). If it fails to converge within some fixed number of iterations, we use
2http://vcg.isti.cnr.it/vcglib/
3https://www.fastbem.com/
60
the conventional and accurate BEM solver. Overall, it takes about 12 days to compute
the ASF up to 1000Hz frequency of about 100,000 objects from the ABC Dataset.
The sound pressure field is evaluated at 642 field points that are evenly distributed
on the spherical field surface. Next, we use pyshtools 4 software (Wieczorek and
Meschede, 2018) to compute the spherical harmonics coefficients from the pressure
field using least squares inversion.
Reference Field Distance Since the inverse-distance law has increasing error in
the near-field of objects, we need to find a suitable distance for computing our ref-
erence field. We experimentally simulate the sound pressure fall-off with respect to
distance and observe that sound pressure that is 5m or further away from the scat-
terer closely agrees with this far-field approximation (see Figure 4.3). Therefore, we
choose to calculate the pressure field on an offset surface 5m away from the scatterer?s
center using a BEM solver (i.e., setting rref = 5m in Equation 4.7). Note that this
choice of 5m is not strict or fixed. If higher accuracy along the radial line is desired,
multiple locations (especially in the near field) can be sampled during the simulation
to interpolate the curve at a higher accuracy. The precomputation time and memory
overhead will increase linearly with respect to the number of sampled distance fields.
Max Spherical Harmonics Order We experiment with the number of SH coeffi-
cients by projecting our scattered sound pressure fields to SH functions with different
orders, as shown in Figure 4.6. Based on this analysis, we choose to use up to a
3rd order SH projection, which yields sufficiently small fitting errors (relative error
smaller than 2%) with 16 SH coefficients. This sets the output of our neural network
(Section 4.2.3) to be a vector of length 16.
4https://shtools.oca.eu/shtools/public/index.html
61
Figure 4.6: Spherical harmonics approximation of sound pressure fields: We
evaluate different orders of SH functions to fit our pressure fields at 4 frequencies and
calculate the relative fitting errors.
4.6.2 Network Training
Our network model is trained on a GeForce RTX 2080 Ti GPU using the Tensorflow
framework (Abadi et al., 2016). The dataset is split into training set and test set
using the ratio 9 : 1. In the training stage, we use Adam optimizer to minimize L2
norm loss between predicted spherical harmonic coefficients and the groundtruth. In
practice, the initial learning rate is set to 1 ? 10?3, which decays exponentially at a
rate of 0.9 and clips at 1?10?5. The batch size is set to 128 and typically our network
converges after 100 epochs in 8 hours. The number of our trainable parameters is
about 800k.
4.6.3 Runtime System and Benchmarks
We use the geometric sound propagation and rendering algorithm described in (Schissler
et al., 2014). Our sound rendering system traces sound rays at octave frequency bands
at 125Hz, 250Hz, 500Hz, 1000Hz, 2000Hz, 4000Hz, and 8000Hz. The direct out-
put from ray tracing for each frequency band is the energy histogram with respect
62
to propagation delays. We take square root of these responses to compute the fre-
quency dependent pressure response envelopes. Broadband frequency responses are
interpolated from our traced frequency bands, and the inverse Fourier transform is
used to re-construct the broadband impulse response. In theory, it is possible to en-
code phase information within a spherical harmonics representation. However, prior
auralization research (Kuttruff, 1993) suggests that using a random phase spectrum
along with the energy response does not introduce noticeable sound difference during
auralization. Therefore, our method does not preserve phase information to keep the
system light-weight.
We require that the wall boundaries are explicitly marked in our scenes. As a
result, when a ray hits the wall, only conventional sound reflections occur for all
frequencies. During audio-visual rendering, when a ray hits a scattering object, we
first extend the hit point along its ray direction by 0.5m and use it as the scattering
region center. We include all the points within a search radius of 1m from the region
center to generate a point cloud approximation of the scatterer. This point cloud
is resampled using furthest point sampling and fed into our neural networks. Our
network predicts the ASFs for sound frequencies corresponding to 125Hz, 250Hz,
500Hz and 1000Hz. The higher frequencies (i.e., 2000Hz, 4000Hz, and 8000Hz) are
handled by conventional geometric ray-tracing with specular and diffuse reflections
and it does not use ASFs. Our neural network has small prediction overhead of
less than 1ms per view on an NVIDIA GeForce RTX 2080 Ti GPU. The interactive
runtime propagation system is illustrated in Figure 4.2. Our ray-tracer performs 200
orders of reflections to generate late reverberation.
We evaluate the performance of our hybrid sound propagation and rendering al-
gorithms several benchmark scenes shown in Figure 4.1 and Table 4.6.3. They have
with varying levels of dynamism in terms of moving objects and are demonstrated in
63
Scene Benchmark Description #Triangle Frame time
Floor One static sound scatterer and one static sound 4065 10.65ms
source above an infinitely large floor. The
listener moves horizontally so that the sound
source visibility changes periodically. This is
the simplest case where no sound reverbera-
tion occurs so as to accentuate the effect of
sound diffraction.
Sibenik Two disjoint moving objects are used as scat- 122798 6.87ms
terers in a church. The two scatterers revolve
around each other in close proximity such that
there are complicated near-field interactions
of sound waves. This scene is a reverberant
benchmark.
Trinity Six objects fly across a large indoor room and 386007 12.95ms
dynamically generate new composite scatter-
ers or decompose into separate scatterers (i.e.,
changing topologies). As a result, the total
number of separate scattering entities in the
scene change and prior methods (Rungta et al.,
2018) are not effective. The occluded regions
also change dynamically and create challeng-
ing scenarios for sound propagation.
Havana Two rotating walls that are generally larger 54383 6.78ms
than scatterers in previous benchmarks in a
half-open space. We use this benchmark to
show that our approach can also handle large
static objects, in addition to a large number of
dynamic objects. It is an outdoor scene with
moderate reverberation.
Table 4.1: Runtime performance on our benchmarks. The computation of ASFs takes
? 1ms per view and most frame time is spent in ray tracing.
64
our supplemental video5.
4.6.4 Analysis
(a) ASF of static objects from the unseen test set.
(b) ASF of dynamically moving objects (lowest and highest frequencies). We recompute the
ASF at each time instance using our network.
(c) ASFs of a deforming object (lowest and highest frequencies), computed using our network.
Figure 4.7: Comparing ASF prediction accuracy in latitude-longitude plots:
We highlight the ASFs for different simulation frequencies. For each image block, the
left column shows the mesh rendering of the objects. The Lat-Long plots visualize
the ASF used in Equation (4.9) by frequency using perceptually uniform colormaps:
the top row (Target) is the groundtruth ASF computed using a BEM solver on the
original mesh; the bottom row (Predicted) represents the ASF computed using our
neural network based on point-cloud representation. The error metric NRE from
Equation (4.13) is annotated above predicted ASFs.
Accuracy Evaluation Our goal is to approximate the acoustic scattering fields of
general 3D objects. While there is a preliminary 2D scattering dataset (Fan et al.,
5https://gamma.umd.edu/pro/sound/asf
65
2020a), there are no general or well-known datasets or benchmarks for evaluating such
ASFs or related computations. Therefore, we use 10k objects from our test dataset
to evaluate the performance of our trained network in terms of accuracy. Compared
with the original ABC Dataset, our test dataset has been augmented in terms of scale
and using different orientations to evaluate the performance of our learning method.
Since the prediction p(?, ?, ?) ? [0, 1] from our network is used as the BSDF in
Equation (4.9), by fixing ? and varying ? and ?, we visualize the field using latitude-
longitude plots in Figure 4.7. We use the common normalized reproduction error
(NRE) (Lilis et al., 2010; Betlehem and Abhayapala, 2005) to measure the error level
of our predicted fields, which is defined as:
? 2? ? ? |p?targe?t(?, ?, ?)? ppredict(?, ?, ?)|2d?d?E(?) = 0 0 2? ? . (4.13)|ptarget(?, ?, ?)|2d?d?
0 0
We analyze three types of results. 1) Static Objects: Figure 4.7a shows a subset
of CAD objects sampled from our test set, which is from the same distribution as
the training set. The average NREs over the entire test set are 4.2%, 7.6%, 8.5%, 10%
for 125Hz, 250Hz, 500Hz, and 1000Hz respectively, with an overall NRE of 8.8%.
In addition, we show the NRE distribution in Figure 4.8, where we see most test
errors are contained below the average NRE. We observe a close visual match in most
objects across frequencies. 2) Dynamic Objects: Figure 4.7b shows an example
where two disjoint objects moves in proximity. Such scenarios are not created for the
training set. We show the compraison and NREs at the lowest and highest frequencies.
3) Deforming objects: Figure 4.7c shows an example where a sphere deforms in
different parts.
These examples show that our network is able to perform consistently well on a
large unseen test set when they are similar to the CAD models in training. Prelimi-
nary results on dynamic objects and deforming objects indicate that our network has
66
Figure 4.8: Distribution of test set prediction errors: We also mark the
50%, 75% and 95% percentiles in the error histogram.
the potential to generalize to more complicated scenarios that are not explicitly mod-
eled during training, although we cannot provide the error bound on these cases. Note
that the ASFs are not directly the perceived sound field at specific listener positions
- instead they are intermediate transfer functions as one part in the sound rendering
pipeline. Therefore, we further demonstrate the perceptual benefits of our predicted
ASFs in ?4.7 and show that we can reliably generate plausible sound renderings under
this error level.
Frequency Growth In theory, our learning-based framework and runtime sys-
tem can also incorporate wave frequencies beyond 1000Hz. However, two important
factors need to be considered when extending our setup: 1) the wave simulation
time increases with the simulation frequency (e.g., between a square and cubic func-
tion for an accurate BEM solver); and 2) the ASF becomes more complicated at
higher frequencies, which makes it more difficult to be learned or approximated us-
ing the same neural network. The per-object simulation time in our experiment is
0.87s, 1.10s, 2.04s, 2.80s for 125Hz, 250Hz, 500Hz, and 1000Hz, respectively. Note
that the simulation time is governed much by the choice of the wave solver, as well
as the relevant parameters/strategies used. We pre-processed our meshes according
to the highest simulation frequency and used that mesh representation for all fre-
quencies. When a higher frequency needs to be added, the meshes need to have finer
67
details, meaning more boundary elements will be involved. A frequency-adaptive
mesh simplification strategy (Li et al., 2015) can be used to reduce the simulation
time at low frequencies. Our network prediction error also grows with the target
frequency, but not at a prohibitive rate.
4.7 Perceptual Evaluation
We perceptually evaluate our method using audio-visual listening tests. Our goal is to
verify that our method generates plausible sound renderings and identify conditions it
may or may not work well. We evaluate three pipelines: 1) Using predicted ASFs and
our diffraction handling (ours); 2) Using predicted ASFs and the scattering sound
rendering pipeline in diffraction kernels (DK) (Rungta et al., 2018); and 3) Using
geometric sound propagation only (GSound) (Schissler and Manocha, 2017). The
reason for choosing the two alternatives is that GSound is the state-of-the-art for
interactive sound propagation without diffraction modeling. DK is regarded as state
of the art hybrid algorithm for interactive sound propagation in dynamic scenes with
rigid objects and uses ASFs precomputed by a BEM solver. Since wave-based methods
are limited to static scenes, they are not included in our evaluation.
4.7.1 Participants
We performed our studies using Amazon Mechanical Turk6 (AMT), a popular online
crowdsourcing platform that can help data collection. We recruited 71 participants on
AMT to take our study. To ensure the quality of our evaluation, we pre-screened our
participants for this study. The pre-screening question is designed to test whether the
participant has the proper listening device and is in a comfortable listening environ-
ment, so that they can tell basic qualitative differences between audios. Specifically,
6https://www.mturk.com/
68
we convolved three impulse responses of reverberation times 0.2s, 0.6s and 1.0s with
a 5-second long clean human speech recording to generate three corresponding rever-
berant speech. The commonly used just-noticeable-difference (JND) of reverberation
times is a 5% relative change (ISO, 2009), so in normal conditions we expect a listener
to correctly rank our three audios by their reverberation levels. Each participant is
asked to listen to the three audios with no time limit, and rank them by their re-
verberance levels. The initial presentation order of the three audios is randomized
for each participant. After pre-screening, our participants consist of 35 males and 16
females, with an average age of 35.9 and a standard deviation of 9.5 years.
4.7.2 Training
As expected, general listeners have varied levels of understanding for sound effects,
and we try to diminish this variance to some extent through a quick introduction
of sound diffraction. During the training, we provide educational materials about
sound diffraction including texts in non-academic language and a short YouTube
video showing this phenomenon in the real world (where the sound travels around a
pillar while the sound source is invisible). These materials require about one minute
to read and watch.
In addition, our participants become familiar with the video playing interface and
are asked to adjust their audio playing volume to a comfortable level before the main
listening tasks.
4.7.3 Stimuli and Procedure
We use the four scenes from benchmarks in ?4.6.3 in combination with the three
sound rendering pipelines to populate 12 audio-visual renderings that we ask our
participants to give ratings on, with no time limit. We present the videos in four
pages one after another, each containing only three videos from the same scene (e.g.,
69
Floor (ours), Floor (DK), and Floor (GSound)). The presentation order of the pages,
as well as the order of videos within each page, are randomized for each participant.
Immediately after each video, participants are asked to give a sound reality rating
and a sound smoothness rating. Both ratings range from 0 to 5 stars, with a half-star
granularity. Participants are asked to ?give 5 stars for the most realistic and most
smooth video and 0 star for the least realistic and smooth?. Although we believe the
perceptual sensitivity can vary among individuals, we expect that participants will
be able to recognize cases where unnatural abrupt sound changes occur in response
to scene dynamics, and will penalize them in their ratings.
4.7.4 Results
The average study completion time is 13 minutes. We show the box plots of user
ratings in Figure 4.9. We are interested in user?s rating differences under the 3
test conditions (i.e., GSound, DK, and ours) on a per scene basis. Therefore, we
perform within-group statistical analysis to identify potential significant differences.
A significance level of 0.05 is adopted for all results in our discussions.
Sound Reality Ratings First we conduct a non-parametric Friedman test to the
ratings given to the 3 rendering conditions, and find significant group differences in
Floor (?2 = 10.82, p < 0.01) and Havana (?2 = 8.27, p = 0.02), but not in Trinity
(?2 = 0.16, p = 0.92) or Sibenik (?2 = 3.70, p = 0.16). Note that Floor and Havana
are basically open space scenes with less reverberation, whereas Trinity and Sibenik
are common indoor environments that have a lot of reverberation. Considering that
the sound power of reverberation is usually more dominant than diffraction, this result
indicates that it is harder to tell the perceptual difference between these rendering
pipelines when there is a strong reverberation. To identify the source of differences in
Floor and Havana scenes, we perform post-hoc non-parametric Wilcoxon signed-rank
70
(a) Sound reality ratings by scene.
(b) Sound smoothness ratings by scene.
Figure 4.9: Perceptual evaluation results: User ratings are visualized as box
plots. A higher rating means better quality. Results are grouped by benchmark scene
and each box represents the rating of a specific rendering pipeline in that scene.
tests with Bonferroni correction (Holm, 1979). We observe that ours receives higher
ratings than DK and GSound in both Floor (Z = {215.0, 144.0}, p < 0.01) and
Havana (Z = {254.0, 186.5}, p < 0.01). However, there are no significant differences
between GSound and DK in any scene.
Sound Smoothness Ratings Following the same procedure, we perform a Fried-
man test to the smoothness ratings, and discover that there are significant group
differences in Floor (?2 = 10.29, p < 0.01), Havana (?2 = 7.63, p = 0.02), and Sibenik
(?2 = 12.59, p < 0.01). Post-hoc Wilcoxon tests show consistent results with real-
71
ity ratings - we are only able to see a higher smoothness rating of ours compared
with both DK and GSound in Floor (Z = {203.5, 186.0}, p = 0.01) and Havana
(Z = {233.5, 127.5}, p < 0.01). In Sibenik, both ours and DK receive a higher rating
than GSound (Z = {146.5, 171.0}, p = 0.01).
In conclusion, our pipeline receives better perceptual ratings than the other two
methods in moderately reverberant conditions, which may not hold in highly rever-
berant scenes. We have increased perceptual differentiation over the DK method.
This is due to our better computation of the ASF for dynamic objects which DK
cannot handle well and our diffraction handling that aligns better with wave acoustic
observations.
4.8 Summary
We present a new learning-based approach to approximate the ASFs of objects for
interactive sound propagation. We exploit properties of the acoustic scattering field
and use a geometric learning algorithm based on point-based approximation. We
evaluate the accuracy of our learning method on a large number of objects not seen
in the training dataset, also undergoing topology changes. We observe low relative
error in our benchmarks. Furthermore, we combine with a ray-tracing based engine
for sound rendering in highly dynamic scenes. A perceptual study confirms that our
approach generates smooth and realistic sound effects in dynamic environments with
increased perceptual differentiation over prior interactive methods.
Our approach has several limitations. These include all the challenges of geometric
deep learning in terms of choosing an appropriate training dataset and long training
time. It is very hard to provide any rigorous guarantees in terms of error bounds
on arbitrary objects. Furthermore, we assume that objects in the scene are sound-
hard and do not take into account various material properties. There is a linear
72
scaling of training time with the number of frequencies and the number of scattering
objects, while the simulation time could scale as a cubic function of the frequency.
One mitigation is to limit the training to the kind of objects that are frequently used
in an interactive application (i.e., customized training).
There are many avenues for future work. It would be useful to take into account
the material properties by considering them as an additional object characteristic dur-
ing training. We would also like to use other techniques from geometric deep learning
to improve the performance of our approach. Our runtime ray tracing algorithm could
use a different sampling scheme that exploits the properties of ASF. In-person user
study using a VR headset or standardized lab listening tests may add more insights
to how spatial sound perception is affected by different sound propagation schemes.
73
Chapter 5
High-Quality Synthetic Acoustic
Datasets1
Figure 5.1: Our IR data generation pipeline starts from a 3D model of a complex
scene and its visual material annotations (unstructured texts). We sample multi-
ple collision-free source and receiver locations in the scene. We use a novel scheme
to automatically assign acoustic material parameters by semantic matching from a
large acoustic database. Our hybrid acoustic simulator generates accurate impulse
responses (IRs), which become part of the large synthetic impulse response dataset
after post-processing.
1The work in this chapter has been published in Tang et al. (2019b, 2020, 2022)
74
5.1 Introduction
Many audio processing tasks have seen rapid progress in recent years due to advances
in deep learning and the accumulation of large-scale audio or speech datasets. Not
only are these techniques widely used for speech processing, but also acoustic scene
understanding and reconstruction, generating plausible sound effects for interactive
applications, audio synthesis for videos, etc. A key factor in the advancement of
these methods is development and release of audio-related datasets. There are many
datasets for speech processing, including datasets with different settings and lan-
guages (Park and Mulc, 2019), emotional speech (Tits et al., 2019), speech source
separation (Drude et al., 2019), sound source localization (Wu et al., 2018), noise
suppression (Reddy et al., 2020), background noise (Reddy et al., 2019), music gen-
eration (Briot et al., 2017), etc.
In this chapter, we present a large, novel dataset corresponding to synthetic room
impulse responses (IRs). As introduced in ? 2.1.1, an IR is regarded as the acous-
tical signature of a system and contains information related to reverberant decay,
signal-to-noise ratio, arrival time, energy of direct and indirect sound, or other data
related to acoustic scene analysis. These IRs can be convolved with anechoic sound
to generate artificial reverberation, which is widely used in the music, gaming and
VR applications, as enumerated in ? 2.4.
There are some known datasets of recorded IRs from real-world scenes and syn-
thetic IRs (see Table 5.1). The real-world datasets are limited in terms of number
of IRs or the size and characteristics of the captured scenes. All prior synthetic
IR datasets are generated using geometric simulators and do not accurately capture
low-frequency wave effects. This limits their applications.
Main Results: We present a large, accurate acoustic dataset (GWA) of synthetically
generated IRs. Our approach is based on using a hybrid simulator that combines a
wave-solver based on finite differences time domain (FDTD) method with geometric
75
Dataset Type #IRs #Scenes Scene Descriptions Scene Types Acoustic Material Quality
BIU (Hadad et al., 2014) Rec. 234 3 Photos Acoustic lab Real-world LF, HF
MeshRIR (Koyama et al., 2021) Rec. 4.4K 2 Room dimensions Acoustic lab Real-world LF, HF
BUT Reverb (Sz?ke et al., 2019) Rec. 1.3K 8 Photos Various sized rooms Real-world LF, HF
S3A (Coleman et al., 2020) Rec. 1.6K 5 Room dimensions Various sized rooms Real-world LF, HF
dEchorate (Di Carlo et al., 2021) Rec. 2K 11 Room dimensions Acoustic lab Real-world LF, HF
Ko et al. (2017) Syn. 60K 600 Room dimensions Empty shoebox rooms Uniform sampling HF
BIRD (Grondin et al., 2020) Syn. 100K 100K Room dimensions Empty shoebox rooms Uniform sampling HF
SoundSpaces (Chen et al., 2020) Syn. 16M 101 Annotated 3D model Scanned indoor scenes Material database HF
GWA (ours) Syn. 2M 18.9K Annotated 3D model Professionally designed Material database LF, HF
Table 5.1: Overview of some existing large IR datasets and their characteristics. In the
?Type? column, ?Rec.? means recorded and ?Syn.? means synthetic. The real-world
datasets capture the low-frequency (LF) and high-frequency (HF) wave effects in the
recorded IRs. Note that all prior synthetic datasets use geometric simulation methods
and are accurate for higher frequencies only. In contrast, we use an accurate hybrid
geometric-wave simulator on more diverse input data, corresponding to professionally
designed 3D interior models with furniture, and generate accurate IRs corresponding
to the entire human aural range (LF and HF). We highlight the benefits of our high-
quality dataset for different audio and speech applications.
sound propagation based on path tracing. The resulting IRs are accurate over the
human aural range. Moreover, we use a large database of more than 6.8K professional
designed scenes with more than 18K rooms with furniture that provide a diverse set
of geometric models. We present a novel and automatic scheme for semantic acous-
tic material assignment based on natural language processing techniques. We use
a database of absorption coefficients of 2, 042 unique real-world materials and use a
transformer network for sentence embedding. Currently, GWA consists of about 2
million IRs. We can easily use our approach to generate more IRs by either chang-
ing the source and receiver positions or using different set of geometric models or
materials. The novel components of our work include:
? Our dataset has more acoustic environments than real-world IR datasets by two
orders of magnitude.
? Our dataset has more diverse IRs with higher accuracy, as compared to prior
synthetic IR datasets.
? The accuracy improvement of our hybrid method over prior methods is evalu-
ated by comparing our IRs with recorded IRs of multiple real-world scenes.
76
? We use our dataset to improve the performance of deep-learning speech process-
ing algorithms, including automatic speech recognition, speech enhancement,
and source separation, and observe significant improvement in accuracy.
5.2 Data Augmentation Preliminary
In this section, we explain the process of audio data augmentation, with an emphasis
on speech data, and their use for deep learning tasks. Deep learning theory indicates
that having more training examples that have the same data distribution as the
test data is crucial to reduce the generalization error of trained models in real test
cases (Seltzer et al., 2013). However, the majority of popular speech corpuses were
recorded under relatively ideal conditions, i.e. anechoic speech with negligible noise
and environmental reverberation. When training models for real-world applications,
it is common to distort the clean speech by adding noise and reverberation as a pre-
processing step to augment the training data (Kim et al., 2017; Doulaty et al., 2017).
In general, speech processing tasks use IR dataset to augment anechoice speech data to
create synthetic distant data as the training data, whereas the test data is reverberant
data recorded in the real world. In practice, both recorded IRs and synthetic IRs
have been used to convolve with the clean speech. Significant improvements in model
accuracy have been observed due to this type of data augmentation. When high-
quality IR datasets are used, the training set is expected to generalize better on the
test data.
Specifically, we can generate distant speech data xd[t] by convolving anechoic
speech xc[t] with different IRs h[t] and adding environmental noise n[t] (e.g., from
noise datasets like BUT ReverbDB (Sz?ke et al., 2019)) using
xd[t] = xc[t]? h[t] + n[t]. (5.1)
77
This process is the most common way of reverberant speech data augmentation.
The image method is the current most widely used method in the speech com-
munity for generating IRs for speech augmentation (Ko et al., 2017). It is based
on the principle of specular reflections where all reflection paths can be constructed
by mirroring sound sources with respect to the reflecting plane. We hypothesize
that more accurate acoustic simulations (i.e., not only considering specular reflec-
tions) can benefit downstream tasks that are trained using the simulated IRs. To
verify this, we run various speech processing benchmarks to test a diffuse geomet-
ric acoustic simulator we developed (Tang et al., 2020) and compare with an image
method simulator. Specifically, we test our geometric simulation with diffuse compo-
nents against the conventional image method on the automated speech recognition
(ASR) task (Table 5.2), the key-word spotting (KWS) task (Table 5.3), as well as the
direction-of-arrival (DOA) estimation task (Table 5.4). In all tests, our method has
consistently achieved the best performance.
Table 5.2: Character accuracy of ASR Table 5.3: Equal error rates of KWS sys-
systems. Our method has the highest tems. Our method has the lowest equal
accuracy and outperforms IM by 1.58%. error rate and results in a 21% error re-duction relative to that of IM.
Model % Model %
Image Method (IM) 59.96
Our Geometric Simulator 61.54 Image Method (IM) 1.48Our Geometric Simulator 1.17
Table 5.4: Results on the SOFA (P?rez-L?pez and De Muynke, 2018) dataset. First
three columns show the percentage of DOA labels correctly predicted within error
tolerances, followed by average angular errors, and %-improvement on baseline. Best
performance in each column is highlighted in bold.
Model < 5? < 10? < 15? Error Improv.
Image Method 11.9% 35.9% 73.2% 16.9? -
Ours 24.4% 66.3% 88.2% 9.68? 43%
In addition, we are aware that the geometric simulation has the drawback of
inaccurate low-frequency modeling due to diffraction and room modes (see ? 2.2). This
78
motivates us to develop a larger dataset with the highest quality synthetic IRs, which
model all acoustic phenomena including specular and diffuse reflections, occlusion,
diffraction, and low-frequency wave effects.
5.3 Dataset Creation
A key issue in terms of the design and release of an acoustic dataset is the choice
of underlying 3D geometric models. Given the availability of interactive geometric
acoustic simulation software packages, it is relatively simple to randomly sample a set
of simple virtual shoebox-shaped rooms for source and listener positions and generate
unlimited simulated IR data. However, the underlying issue is such IR data will not
have the acoustic variety (e.g., room equalization, material diversity, wave effects,
reverberation patterns, etc.) frequently observed in real-world datasets. We identify
several criteria that are important in terms of creating a useful synthetic acoustic
dataset: (1) a wide range of room configurations: the room space should include
regular and irregular shapes as well as furniture placed in reasonable ways. Many
prior datasets are limited to rectangular, shoebox or empty rooms (see Table 1); (2)
meaningful acoustic materials: object surfaces should use physically plausible acoustic
materials with varying absorption and scattering coefficients, rather than randomly
assigned frequency-dependent coefficients; (3) an accurate simulation method that
accounts for various acoustic effects, including specular and diffuse reflections, oc-
clusion, and low-frequency wave effects like diffraction. It is important to generate
IRs corresponding to the human aural range for many speech processing and related
applications. In this section, we present our pipeline for developing a dataset that
satisfies all these criteria. An overview of our pipeline is illustrated in Figure 5.1.
79
5.3.1 Acoustic Environment Acquisition
Acoustic simulation for 3D models requires that environment boundaries and object
shapes be well defined and represented as 3D meshes. Simple image-method simula-
tions may only require a few room dimensions (i.e., length, width, and height) and
have been used for speech applications, but these methods cannot handle complex
3D indoor scenes. Many techniques have been proposed in computer vision to recon-
struct large-scale 3D environments using RGB-D input (Choi et al., 2015). Moreover,
they can be combined with 3D semantic segmentation (Dai et al., 2018) to recover
category labels of objects in the scene. This facilitates the collection of indoor scene
datasets. However, real-world 3D scans tend to suffer from measurement noise, re-
sulting in incomplete/discontinuous surfaces in the reconstructed model that can be
problematic for acoustic simulation algorithms. One alternative is to use profession-
ally designed scenes of indoor scenes in the form of CAD models. These models are
desirable for acoustic simulation because they have well-defined geometries and the
most accurate semantic labels. Therefore, we use CAD models from the 3D-FRONT
dataset (Fu et al., 2021), which contains 18,968 diversely furnished rooms in 6,813 ir-
regularly shaped houses/scenes. These different types of rooms (e.g., bedrooms, living
rooms, dining rooms, and study rooms) are diversely furnished with varying numbers
of furniture objects in meaningful locations. This differs from prior methods that use
empty shoebox-shaped rooms (Grondin et al., 2020; Ko et al., 2017), because room
shapes and the existence of furniture will significantly modify the acoustic signature
of the room, including shifting the room modes in the low frequency. 3D-FRONT
dataset is designed to have realistic scene layouts, and has received higher human
ratings in subjective studies. Generating audio data from these models allows us to
better approximate real-world acoustics.
80
5.3.2 Semantic Acoustic Material Assignment
Figure 5.2: Our semantic material assignment algorithm. We use NLP techniques
based on sentence embedding along with transformer network to choose absorption
coefficients from a database of 2, 042 unique materials.
Because the 3D-FRONT dataset also provides object semantics (i.e., object ma-
terial labels), it is possible to assign more meaningful acoustic materials to individual
surfaces or objects in the scene. For example, an object with ?window? description
is likely to be matched with several types of window glass material from the acous-
tic material database. SoundSpaces dataset (Chen et al., 2020) also utilizes scene
labels by using empirical manual material assignment (e.g., acoustic materials of car-
pet, gypsum board, and acoustic tile are assumed for floor, wall, and ceiling classes),
creating a one-to-one visual-acoustic material mapping for the entire dataset. This
approach works for a small set of known material types. Instead, we present a general
and fully automatic method that works for unknown materials with unstructured text
descriptions.
To start with, we retrieve measured frequency-dependent acoustic absorption co-
efficients for 2, 042 unique materials from a room acoustic database (Kling, 2018).
The descriptions of these materials do not directly match the semantic labels of ob-
jects in the 3D-FRONT dataset. Therefore, we present a method to calculate the
semantic similarity between each label and material description using natural lan-
guage processing (NLP) techniques. In NLP research, sentences can be encoded
into definite length numeric vectors known as sentence embedding (Mishra and Vi-
radiya, 2019). One goal of sentence embedding is to find semantic similarities to
81
identify text with similar meanings. Transformer networks have been very successful
in generating good sentence embeddings (Liu et al., 2020) such that sentences with
similar meanings will be relatively close in the embedding vector space. We leverage
a state-of-the-art sentence transformer model 2 (Reimers and Gurevych, 2019) that
calculates an embedding of dimension 512 for each sentence. Next, we calculate the
cosine similarity score between each pair of embedding vectors, which can represent
the pair-wise semantic distance between each material label and each description in
the material database. For each material label in a 3D scene, we assign a set of ab-
sorption coefficients from the acoustic database using weighted sampling based on the
cosine similarity scores between the 3D-FRONT material label and all descriptions
from the material database. This process is illustrated in Figure 5.2. Note that we
do not directly pick the material with the highest score because for the same type of
material, there are still different versions with different absorption coefficients (e.g.,
in terms of thickness, brand, painting, etc.). These slightly different descriptions of
the same material are likely to have similar semantic distance to the 3D-FRONT ma-
terial label being examined. We use a probabilistic assignment process that provides
balanced sampling among the material database and thereby increase the diversity of
our acoustic database.
5.3.3 Geometric-Wave Hybrid Simulation
It is well known that geometric acoustic (GA) methods do not model low-frequency
acoustic effects well due to the linear ray assumption (Funkhouser et al., 1998a;
Schissler et al., 2014). Therefore, we use a hybrid propagation algorithm that com-
bines wave-based methods with GA. These wave-based methods can accurately model
low-frequency wave effects, but their running time increases as the third or fourth
power of the highest simulation frequency (Raghuvanshi et al., 2009). Given the
2Using pre-trained model at https://huggingface.co/sentence-transformers/distiluse-base-
multilingual-cased-v2
82
high time complexity of wave-based methods, we also want to use methods that
are: (1) highly parallelizable so that dataset creation takes acceptable time on high-
performance computing clusters; (2) compatible with arbitrary geometric mesh rep-
resentations and acoustic material inputs; and (3) open-source so that the simulation
pipeline can be reused by the research community. In this chapter, we develop our
hybrid simulation pipeline from a CPU-based GA implementation pygsound 3 and a
GPU-based wave FDTD implementation PFFDTD (Hamilton, 2021).
Inputs
The scene CAD models from the 3D-FRONT dataset, each corresponding to several
rooms with open doors, are represented in a triangle mesh format. Most GA methods
have native support for 3D mesh input. The meshes are converted to voxels to be used
as geometry input to the wave-based solver. We randomly sample 1 source and 50
receiver locations in each scene. We perform collision checking to ensure all sampled
locations have at least 0.2m clearance to any object in the scene.
We assign acoustic absorption coefficients according to the scheme presented in
? 5.3.2. These coefficients can be directly used by the GA method and integrated
with the passive boundary impedance model used by the wave FDTD method (Bil-
bao et al., 2015). The GA method also requires scattering coefficients, which account
for the energy ratio between specular and diffuse reflections. Such data is less con-
ventionally measured and is not available from the material database in ? 5.3.2. It is
known that scattering coefficients tend to be negligible (e.g., ? 0.05) for low-frequency
bands (Cox et al., 2006) handled by the wave method. Therefore, we sample scat-
tering coefficients by fitting a normal distribution to 37 sets of frequency-dependent
scattering coefficients obtained from the benchmark data in ? 5.4.1, which are only
used by the GA method.
3https://github.com/royjames/pygsound
83
Setup
For the GA method, we set 20, 000 rays and 200 maximum depth for specular and
diffuse reflections. The GA simulation is intended for human aural range, while most
absorption coefficient data is only valid for octave bands from 63Hz to 8,000Hz. The
ray-tracing stops when the maximum depth is reached or the energy is below the
hearing threshold.
For the wave-based FDTD method, we set the maximum simulation frequency
to 1,400Hz. The grid spacing is set according to 10.5 points per wavelength. Our
simulation duration is one second since indoor scenes are usually not too large.
Automatic Calibration
Before combining simulated IRs from two methods, one important step is to properly
calibrate their relative energies. Southern et al. (2011) describe two objective calibra-
tion methods: (1) pre-defining a crossover frequency range near the highest frequency
of the wave method and aligning the sound level of the two methods in that range;
(2) calibrating the peak level from time-windowed, bandwidth-matched regions in
the wave and the GA methods. Both calibration methods are used case-by-case for
each pair of IRs. However, the first method is not physically correct, and the second
method can be vulnerable when the direct sound is not known, as with occluded di-
rect rays in the GA method. Southern et al. (2013) improved the second method by
calculating calibration parameters once in free-field condition using a band-limited
source signal.
We use a similar calibration procedure. The calibration source and receivers have
a fixed distance r = 1 in a large volume with absorbing boundary conditions, and
the 90 calibration receivers span a 90? arc to account for the influence of propagation
direction along FDTD grids. The source impulse signal is low-pass filtered at a cut-off
frequency of 255Hz. When the source signal is a unit impulse, this filtering makes
84
the source signal essentially the same as the coefficients of the low-pass filter. The
simulated band-limited IRs are truncated at twice the theoretical direct response
time to further prevent any unwanted reflected wave. The calibration parameter for
wave-based FDTD is computed as:
?
Es
?w = , (5.2)
Er
where Es is the total energy of the band-limited point source, and Er is the total
energy at the receiver point. For multiple receiver points, ?w takes the average value.
During wave-based FDTD calibration, each received signal is multiplied by ?w, and
we can calculate the difference between the calibrated signal and the band-limited
source signal. As a result, we obtain a very low mean error of 0.50dB and a max
error of 0.85dB among all calibration receivers.
For the GA method, we follow the same procedure though the process is simpler
since the direct sound energy is explicit in most GA algorithms (i.e., 1 scaled by
r
some constant). Another calibration parameter ?g is similarly obtained for the GA
method. This calibration process ensures that the full-band transmitted energy from
both methods will be E = 1 at a distance of 1m from a sound source, although the
absolute energy does not matter and the two parameters can be combined into one
(i.e., only use ??w = ?w/?g for wave calibration). Figure 5.3 shows an example of
simulation results with and without calibration. Without properly calibrating the
energies, there will be abrupt sound level changes in the frequency domain, which
can create unnatural sound.
Hybrid Combination
Ideally we would want to use the wave-based method for the highest possible simula-
tion frequency. Besides the running time, one issue with FDTD scheme is the rising
85
Figure 5.3: Power spectrum comparison between the original wave FDTD simulated
IR and the calibrated IR. The vertical dashed line indicates the highest valid frequency
of the FDTD method. Our automatic calibration method ensures that the GA and
wave-based methods have consistent energy levels so that they can generate high
quality IRs and plausible/smooth sound effects.
dispersion error with the frequency (Lehtinen, 2003). As a remedy, the FDTD results
are first high-pass filtered at a very low frequency (e.g., 10Hz) to remove some DC
offset and then low-pass filtered at the crossover frequency to be combined with GA
results. We use a Linkwitz-Riley crossover filter (Linkwitz, 1976) to avoid ringing
artifacts near the crossover frequency, harnessing its use of cascading Butterworth
filters. The crossover frequency in this work is chosen to be 1, 400Hz to fully utilize
the accuracy ofwave simulation results. Higher simulation crossover frequencies could
be used at the cost of increased FDTD simulation time.
5.3.4 Analysis and Statistics
Runtime The runtime of our hybrid simulator depends on specific computational
hardware. We utilize a high-performance computing cluster with 20 Intel Ivy Bridge
E5-2680v2 CPUs and 2 Nvidia Tesla K20m GPUs on each node. On a single node,
our simulator requires about 800 computing hours for the wave-based FDTD method
and about 500 computing hours for the GA method to generate all data. One can
roughly estimate the wall time needed by dividing the time above by the number of
86
(a) Occurrence of top visual material names.
(b) Occurrence of top acoustic material names.
Figure 5.4: We highlight the most frequently used materials in our approach for
generating the IR dataset. The acoustic database also contains non-English words,
which are handled by a pre-trained multi-lingual language model.
such available compute nodes.
Distributions More than 5,000 scene/house models are used. On average, each
scene uses 22.5 different acoustic materials. We assign 1, 955 unique acoustic materials
(out of 2, 042) from the material database, and the most frequently used materials
are several versions of brick, concrete, glass, wood, and plaster. The occurrence of
most frequently used materials are visualized in Figure 5.4.
The distribution of distances between all source and receiver pairs are visualized in
Figure 5.5. We also show the relationship between the volume of each 3D house model
and the reverberation time for that model in Figure 5.6 to highlight the wide distri-
bution of our dataset. Overall, we have a balanced distribution of the reverberation
times in the normal range.
87
Figure 5.5: Distance distribution between source and receiver pairs in our scene
database. No special distance constraints are enforced during sampling except the
need to be collision-free from the objects in the scene. The IRs vary based on relative
positions of the source and the received in a 3D scene.
Figure 5.6: Statistics of house/scene volumes and reverberation times. We see a large
variation in reverberation times, which is important for speech processing and other
applications.
88
5.4 Acoustic Evaluation
In this section, we evaluate the accuracy of our IR generation hybrid algorithm. We
use a set of real-world acoustic scenes that have measured IR data to evaluate the
effectiveness and accuracy of our hybrid simulation method.
5.4.1 Benchmarks
Several real-world benchmarks have been proposed to investigate the accuracy of
acoustic simulation techniques. A series of three round-robin studies (Vorliander,
1995; Bork, 2000, 2005a,b) have been conducted on several acoustic simulation soft-
ware systems by providing the same input and then comparing the different simula-
tion results with the measured data. In general, these studies provide the room and
material descriptions as well as microphone and loudspeaker specifications including
locations and directivity. However, the level of detailed characteristics, in terms of
complete 3D models and consistent measured acoustic material properties tend to
vary. Previous round-robin studies have identified many issues (e.g., uncertainty in
boundary condition definitions) in terms of simulation input definitions for many sim-
ulation packages, which can result in poor agreement between simulation results and
real-world measurements. A more recent benchmark, the BRAS benchmark (Asp?ck
et al., 2020), contains the most complete scene description and has a wide range
of recording scenarios. We use the BRAS benchmark to evaluate our simulation
method. Three reference scenes (RS5-7) are designed as diffraction benchmarks and
we use them to evaluate the performance of our hybrid simulator, especially at lower
frequencies.
The 3D models of the reference scenes along with frequency-dependent acoustic
absorption and scattering coefficients are directly used for our hybrid simulator. We
use these three scenes because they are considered difficult for the geometric method
89
alone (Brinkmann et al., 2019).
5.4.2 Results
We use the room geometry, source-listener locations, and material definitions as an
input to our simulation pipeline. Note that the benchmark only provide absorp-
tion and scattering coefficients, and no impedance data is directly available for wave
solvers. Thus, we only use fitted values rather than exact values. The IRs generated
by the GA method and our hybrid method and the measured IRs from the bench-
mark are compared in the frequency domain in Figure 5.7. In these scenes, the source
and receiver are placed on different sides of the obstacle and the semi-anechoic room
only has floor reflections. In the high frequency range, there are fewer variations in
the measured response, and both methods capture the general trend of energy decay
despite response levels not being perfectly matched. This demonstrates that our hy-
brid sound simulation pipeline is able to generate more accurate results than the GA
method for complex real-world scenes.
5.5 Applications
We use our dataset on three speech processing applications that use deep learning
methods. Synthetic IRs have been widely used for training neural networks for auto-
matic speech recognition, speech enhancement, and source separation. We evaluate
the benefits of generating a diverse and high-quality IRs dataset over prior methods
used to generate synthetic IRs.
Far-field speech data is generated according to Equation (5.1) using synthetic IRs.
In following test, we use various versions of IR datasets: GA (geometric method only),
FDTD (only up to 1,400Hz), and GWA (hybrid method). Then the speech data
is used by different training procedure and neural network architectures on different
90
Table 5.5: Far-field ASR results obtained for the AMI corpus. The best result is
marked in bold.
IR used WER[%]?
None (anechoic speech) 64.2
GA 55.5
GWA (ours) 54.1
benchmarks described below.
5.5.1 Automated Speech Recognition
Automatic speech recognition (ASR) aims to convert speech data to text transcrip-
tions. The performance of ASR models is measured by the word error rate (WER),
which is the percentage of incorrectly transcribed words in the test data. The AMI
speech corpus (Carletta et al., 2005) consisting of 100 hours of meeting recording is
used as our benchmark. And we use the Kaldi 4 toolbox to run experiments on this
benchmark. We randomly select 17, 749 IRs out of 2M synthetic IRs in GWA to
augment the anechoice training set in AMI, and report the WER on the real-world
test set. A lower WER indicates that the synthetic distant speech data used for train-
ing is closer to real-world distant speech data. We highlight the improved accuracy
obtained using GWA over prior synthetic IR generators in Table 5.5.
5.5.2 Speech Dereverberation
Speech dereverberation aims at converting a reverberant speech signal back to its
anechoic version to enhance its intelligibility. We use SkipConvNet (Kothapally et al.,
2020), a U-Net based speech dereverberation model. The model is trained on the
100-hour subset of Librispeech dataset (Panayotov et al., 2015). The reverberant
input to the model is generated by convolving the clean Librispeech data with our
synthetically generated IRs. In addition, we include another synthetic IR dataset,
4https://github.com/kaldi-asr/kaldi
91
Table 5.6: We tabulate the SRMR of the SkipConvNet enhancement model trained
using different synthetic IR generation methods. We test the results on real-world
reverberant recordings from the VOiCES dataset. Use of our hybrid dataset results
in improved accuracy over prior methods.
IR used SRMR?
None (baseline) 4.96
SoundSpaces (Chen et al., 2020) 7.44
GA 6.01
FDTD 4.78
GWA (ours) 8.14
SoundSpaces (Chen et al., 2020) in this comparison. We test the performance of the
model on real-world recordings from the VOiCES dataset (Richey et al., 2018). We
report the speech-to-reverberation modulation energy ratio (SRMR) over the test set.
A higher value of SRMR indicates lower reverberation and higher speech quality. As
seen from Table 5.6, our proposed dataset obtains better dereverberation performance
as compared to all other datasets.
5.5.3 Speech Separation
We train a model to separate reverberant mixtures of two speech signals into its
constituent reverberant sources. We use the Asteroid (Pariente et al., 2020) imple-
mentation of the DPRNN-TasNet model (Luo et al., 2020) for our benchmarks. The
100-hour split of the Libri2Mix (Cosentino et al., 2020) dataset is used for training.
We test the model on reverberant mixtures generated from the VOiCES dataset. We
report the improvement in scale-invariant signal-to-distortion ratio (SI-SDRi) (Roux
et al., 2018) to measure separation performance. Higher SI-SDRi implies better sep-
aration. As seen from Table. 5.7, our proposed hybrid approach (GWA) outperforms
both GA and FDTD for speech separation.
92
Table 5.7: SI-SDRi values reported for different IR generation methods. We report
results separately for the four rooms used to capture the test set (higher is better).
IR used SI-SDRi?Room 1 Room 2 Room 3 Room 4
GA 2.25 2.55 1.44 2.55
FDTD 2.36 2.43 1.33 2.46
GWA (ours) 2.94 2.76 1.86 2.91
5.6 Summary
We introduced a large new audio dataset of synthetic room impulse responses and
the simulation pipeline, which can take different scene configurations and generate
higher quality IRs. We demonstrated the improved accuracy of our hybrid geometric-
wave simulator on three difficult scenes from the BRAS benchmark. As compared
to prior datasets, GWA has more scene diversity than recorded datasets, and has
more physically accurate IRs than other synthetic datasets. We also use our dataset
with audio deep learning algorithms to improve the performance of speech processing
applications.
Our dataset only consists of synthetic scenes, and may not be as accurate as real-
world captured IRs. In many applications, it is also important to model ambient
noise. In the future, we will continue growing the dataset by including more 3D
scenes to further expand the acoustic diversity of the dataset. We plan to evaluate
the performance of other audio deep learning applications.
93
(a) RS5: simple diffraction with infinite edge.
(b) RS6: diffraction with infinite body.
(c) RS7: multiple diffraction (seat dip effect)
Figure 5.7: Frequency responses of geometric and hybrid simulations compared with
measured IRs in BRAS benchmarks RS5-7 (Asp?ck et al., 2020). Images of each setup
are attached in the corners of the graph. We notice that the IRs generated using our
hybrid method closely match with the measure IRs, as compared to those generated
using GA methods. This demonstrates the higher quality and accuracy of our IRs as
compared to the ones generated by prior GA methods highlighted in Table 5.1.
94
Chapter 6
Conclusion
6.1 Summary of Results
In this dissertation, we first investigate novel solutions via acoustic simulation and
deep learning to provide high-quality sound rendering in mixed reality settings with
fewer limitations than existing vision-based and measurement-based methods. Next,
we continue to extend the inferential power of deep neural networks to predict compli-
cated acoustic scattering fields by analyzing object shapes. This becomes the first and
the fastest method to generate wave acoustic scattering effects on-the-fly in 3D envi-
ronments without additional pre-computation for unseen scenes. Finally, we develop
a data pipeline that utilizes state-of-the-art geometric and wave acoustic simulators
to generate high-quality synthetic impulse response data at scale. Our pipeline can
take general 3D model inputs and automatically assign meaningful acoustic materials
by semantic matching. The simulation pipeline and dataset can significantly im-
prove the performance of data-driven applications such as deep learning-based speech
processing tasks.
Our results have demonstrated that by leveraging state-of-the-art physics-based
acoustic simulation and deep learning techniques, realistic simulated data can be
95
generated to enhance the sound rendering quality in the virtual world and boost the
performance of audio processing tasks in the real world.
6.2 Future Work
In the future, I would like to address some limitations mentioned in previous chapters.
In the following, I identify several specific future directions.
Just-Noticeable-Difference (JND) in Simulations We have run several percep-
tual evaluations against other works to verify that the quality of sound rendering
from our methods is on par with or better than previous work. However, it is
not clear to what extent we want to optimize respective objective functions
for acoustic simulations. In other words, how much do factors like accurate
material modeling, low-frequency wave simulation, and geometry details affect
perceptual listening quality for humans? While some JND metrics have been
established for more common acoustic metrics like the T60, less work has been
done under the context of acoustic simulations. I believe more rigorous in-lab
listening tests with a range of simulation setups will help establish more useful
JND metrics for follow-up works.
Curse of Dimensionality Deep learning methods generally suffer from the curse of
dimensionality, which means if the dimensionality of the problem being analyzed
increases even slightly, the required amount of data will grow exponentially. As a
consequence, the time needed to prepare the data and train the model also grows
accordingly. This situation applies to deep learning with acoustic problems. As
discussed, the soundfield in a room can be affected by the room shape, source
and listener positions, acoustic materials/boundary conditions, and medium
property (e.g., air temperature). Most of the time we are only able to study a
subset of these conditions, as is the case with our deep learning-based acoustic
96
scattering framework, where we based our analysis entirely on the geometry
inputs and ignored the variations in their material properties. While we can
expand the training data by adding more dimensions to the simulation setup,
techniques like parameter regularization and autoencoders should be considered
to mitigate the curse of dimensionality to train a more general acoustic inference
model.
Neural Acoustic Fields We have managed to generate a large, high-quality acous-
tic dataset, and the pipeline allows anyone to expand the dataset to much larger
scales if resources permit. This also includes simulating audio data in differ-
ent hardware (e.g., multi-channel) or software (e.g., spatially encoded) formats.
However, there can be infinite amount of data to simulate, and it is unlikely
that any one dataset can satisfy all needs. Therefore, one promising direction is
to use such a large dataset to learn to construct the acoustic field using neural
networks. The same idea has rapidly gained huge success in computer graphics
and is known as the neural radiance fields (NeRF) (Mildenhall et al., 2020).
While some preliminary work has been done for acoustics (Ratnarajah et al.,
2022), dealing with acoustic fields in higher dimensions than radiance fields
remains an open and challenging problem.
97
Bibliography
Steam audio. https://valvesoftware.github.io/steam-audio, 2018.
Microsoft project acoustics. https://aka.ms/acoustics, 2019.
Oculus spatializer. https://developer.oculus.com/downloads/package/
oculus-spatializer-unity, 2019.
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat,
G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning.
In 12th {USENIX} Symposium on Operating Systems Design and Implementation
({OSDI} 16), pages 265?283, 2016.
J. S. Abel, N. J. Bryan, P. P. Huang, M. Kolar, and B. V. Pentcheva. Estimating
room impulse responses from recorded balloon pops. In Audio Engineering Society
Convention 129. Audio Engineering Society, 2010.
J. B. Allen and D. A. Berkley. Image method for efficiently simulating small-room
acoustics. The Journal of the Acoustical Society of America, 65(4):943?950, 1979.
R. Aralikatti, A. Ratnarajah, Z. Tang, and D. Manocha. Improving reverberant
speech separation with multi-stage training and curriculum learning. arXiv preprint
arXiv:2107.09177, 2021.
L. Asp?ck, M. Vorl?nder, F. Brinkmann, D. Ackermann, and S. Weinzierl. Benchmark
for room acoustical simulation (bras). DOI, 10:14279, 2020.
M. Barron. Auditorium acoustics and architectural design. E & FN Spon, 2010.
L. L. Beranek and T. Mellow. Acoustics: sound fields and transducers. Academic
Press, 2012.
T. Betlehem and T. D. Abhayapala. Theory and design of sound field reproduction
in reverberant rooms. The Journal of the Acoustical Society of America, 117(4):
2100?2111, 2005.
S. Bilbao, B. Hamilton, J. Botts, and L. Savioja. Finite volume time domain room
acoustics simulation under general impedance boundary conditions. IEEE/ACM
Transactions on Audio, Speech, and Language Processing, 24(1):161?173, 2015.
98
M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger, and A. J. Davison. Codeslam -
learning a compact, optimisable representation for dense visual slam. In The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
I. Bork. A comparison of room simulation software-the 2nd round robin on room
acoustical computer simulation. Acta Acustica united with Acustica, 86(6):943?
956, 2000.
I. Bork. Report on the 3rd round robin on room acoustical computer simulation?part
i: Measurements. Acta Acustica united with Acustica, 91(4):740?752, 2005a.
I. Bork. Report on the 3rd round robin on room acoustical computer simulation?part
ii: Calculations. Acta Acustica united with Acustica, 91(4):753?763, 2005b.
D. Botteldooren. Finite-difference time-domain simulation of low-frequency room
acoustic problems. The Journal of the Acoustical Society of America, 98(6):3302?
3308, 1995.
F. Brinkmann, L. Asp?ck, D. Ackermann, S. Lepa, M. Vorl?nder, and S. Weinzierl.
A round robin on room acoustical simulation and auralization. The Journal of the
Acoustical Society of America, 145(4):2746?2760, 2019.
J. Briot, G. Hadjeres, and F. Pachet. Deep learning techniques for music generation
- A survey. CoRR, abs/1709.01620, 2017.
N. J. Bryan. Impulse response data augmentation and deep neural networks for blind
room acoustic parameter estimation. In ICASSP 2020-2020 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1?5. IEEE,
2020.
N. J. Bryan, J. S. Abel, and M. A. Kolar. Impulse response measurements in the
presence of clock drift. In Audio Engineering Society Convention 129. Audio En-
gineering Society, 2010.
C. Cao, Z. Ren, C. Schissler, D. Manocha, and K. Zhou. Bidirectional sound transport.
The Journal of the Acoustical Society of America, 141(5):3454?3454, 2017.
J. Carletta et al. The ami meeting corpus: A pre-announcement. In Proceedings of the
Second International Conference on Machine Learning for Multimodal Interaction,
MLMI?05, page 28?39. Springer-Verlag, 2005. ISBN 3540325492. doi: 10.1007/
11677482_3.
M. Cartwright, B. Pardo, G. J. Mysore, and M. Hoffman. Fast and easy crowdsourced
perceptual audio evaluation. In 2016 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 619?623. IEEE, 2016.
S. Cecchi, A. Carini, and S. Spors. Room response equalization?a review. Applied
Sciences, 8(1):16, 2018.
99
C. R. A. Chaitanya, J. M. Snyder, K. Godin, D. Nowrouzezahrai, and N. Raghuvanshi.
Adaptive sampling for sound propagation. IEEE transactions on visualization and
computer graphics, 25(5):1846?1854, 2019.
A. Chandak, C. Lauterbach, M. Taylor, Z. Ren, and D. Manocha. Ad-frustum:
Adaptive frustum tracing for interactive sound propagation. IEEE Transactions
on Visualization and Computer Graphics, 14(6):1707?1722, 2008.
R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas. Pointnet: Deep learning on point
sets for 3d classification and segmentation. 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Jul 2017. doi: 10.1109/cvpr.2017.16.
URL http://dx.doi.org/10.1109/CVPR.2017.16.
C. Chen, U. Jain, C. Schissler, S. V. A. Gari, Z. Al-Halah, V. K. Ithapu, P. Robinson,
and K. Grauman. Soundspaces: Audio-visual navigation in 3d environments. In
Computer Vision?ECCV 2020: 16th European Conference, Glasgow, UK, August
23?28, 2020, Proceedings, Part VI 16, pages 17?36. Springer, 2020.
L. Chen, Z. Li, R. K Maddox, Z. Duan, and C. Xu. Lip movements generation at
a glance. In The European Conference on Computer Vision (ECCV), September
2018.
S. Choi, Q.-Y. Zhou, and V. Koltun. Robust reconstruction of indoor scenes. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 5556?5565, 2015.
F. Chollet et al. Keras. https://keras.io, 2015.
C. L. Christensen and J. H. Rindel. A new scattering method that combines roughness
and diffraction effects. In Forum Acousticum, Budapest, Hungary, pages 344?352,
2005.
P. Coleman, L. Remaggi, and P. Jackson. S3a room impulse responses, 2020.
A. I. Conference. Audio for virtual and augmented reality. AES Proceedings, 2018.
J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent. Librimix: An
open-source dataset for generalizable speech separation, 2020.
T. J. Cox, B.-I. Dalenback, P. D?Antonio, J.-J. Embrechts, J. Y. Jeon, E. Mommertz,
and M. Vorl?nder. A tutorial on scattering and diffusion coefficients for room
acoustic surfaces. Acta Acustica united with ACUSTICA, 92(1):1?15, 2006.
A. Dai, D. Ritchie, M. Bokeloh, S. Reed, J. Sturm, and M. Nie?ner. Scancomplete:
Large-scale scene completion and semantic segmentation for 3d scans. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
4578?4587, 2018.
100
P. Debevec. Image-based lighting. IEEE Computer Graphics and Applications, 22
(2):26?34, 2002.
D. Di Carlo, P. Tandeitnik, C. Foy, A. Deleforge, N. Bertin, and S. Gannot. dechorate:
a calibrated room impulse response database for echo-aware signal processing. arXiv
preprint arXiv:2104.13168, 2021.
M. Doulaty, R. Rose, and O. Siohan. Automatic optimization of data perturbation
distributions for multi-style training in speech recognition. In Spoken Language
Technology Workshop, 2017.
L. Drude, J. Heitkaemper, C. Boeddeker, and R. Haeb-Umbach. Sms-wsj: Database,
performance measures, and baseline recipe for multi-channel source separation and
recognition. arXiv preprint arXiv:1910.13934, 2019.
J. Eaton, N. D. Gaubitch, A. H. Moore, P. A. Naylor, J. Eaton, N. D. Gaubitch, A. H.
Moore, P. A. Naylor, N. D. Gaubitch, J. Eaton, et al. Estimation of room acoustic
parameters: The ace challenge. IEEE/ACM Transactions on Audio, Speech and
Language Processing (TASLP), 24(10):1681?1693, 2016.
S. E. Eskimez, P. Soufleris, Z. Duan, and W. Heinzelman. Front-end speech en-
hancement for commercial speaker verification systems. Speech Communication,
99:101?113, 2018.
C. Evers, A. H. Moore, and P. A. Naylor. Acoustic simultaneous localization and
mapping (a-slam) of a moving microphone array and its surrounding speakers. In
2016 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 6?10. IEEE, 2016.
Z. Fan, V. Vineet, H. Gamper, and N. Raghuvanshi. Fast acoustic scattering using
convolutional neural networks. In ICASSP 2020-2020 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing (ICASSP), pages 171?175. IEEE,
2020a.
Z. Fan, V. Vineet, H. Gamper, and N. Raghuvanshi. Fast acoustic scattering using
convolutional neural networks. In ICASSP 2020-2020 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing (ICASSP), pages 171?175. IEEE,
2020b.
A. Farina. Simultaneous measurement of impulse response and distortion with a
swept-sine technique. In Audio Engineering Society Convention 108. Audio Engi-
neering Society, 2000.
E. L. Ferguson, S. B. Williams, and C. T. Jin. Sound source localization in a multipath
environment using convolutional neural networks. In 2018 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2386?
2390. IEEE, 2018.
101
S. Foster. Impulse response measurement using golay codes. In ICASSP?86. IEEE
International Conference on Acoustics, Speech, and Signal Processing, volume 11,
pages 929?932. IEEE, 1986.
H. Fu, B. Cai, L. Gao, L.-X. Zhang, J. Wang, C. Li, Q. Zeng, C. Sun, R. Jia, B. Zhao,
et al. 3d-front: 3d furnished rooms with layouts and semantics. In Proceedings of
the IEEE/CVF International Conference on Computer Vision, pages 10933?10942,
2021.
T. Funkhouser, I. Carlbom, G. Elko, G. Pingali, M. Sondhi, and J. West. A beam
tracing approach to acoustic modeling for interactive virtual environments. In
Proceedings of the 25th annual conference on Computer graphics and interactive
techniques, pages 21?32. ACM, 1998a.
T. Funkhouser, I. Carlbom, G. Elko, G. Pingali, M. Sondhi, and J. West. A beam
tracing approach to acoustic modeling for interactive virtual environments. In
Proceedings of the 25th annual conference on Computer graphics and interactive
techniques, pages 21?32. ACM, 1998b.
R. Gao, C. Chen, Z. Al-Halah, C. Schissler, and K. Grauman. Visualechoes: Spatial
image representation learning through echolocation. In European Conference on
Computer Vision, pages 658?676. Springer, 2020.
M.-A. Gardner, K. Sunkavalli, E. Yumer, X. Shen, E. Gambaretto, C. Gagn?, and
J.-F. Lalonde. Learning to predict indoor illumination from a single image. arXiv
preprint arXiv:1704.00090, 2017.
A. F. Genovese, H. Gamper, V. Pulkki, N. Raghuvanshi, and I. J. Tashev. Blind room
volume estimation from single-channel noisy speech. In ICASSP 2019-2019 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pages 231?235. IEEE, 2019.
S. Gharib, H. Derrar, D. Niizumi, T. Senttula, J. Tommola, T. Heittola, T. Virtanen,
and H. Huttunen. Acoustic scene classification: A competition review. In IEEE
28th International Workshop on Machine Learning for Signal Processing (MLSP),
pages 1?6. IEEE, 2018.
F. Grondin, J.-S. Lauzon, S. Michaud, M. Ravanelli, and F. Michaud. Bird: Big
impulse response dataset. arXiv preprint arXiv:2010.09930, 2020.
P. Grumiaux, S. Kitic, L. Girin, and A. Gu?rin. A survey of sound source localization
with deep learning methods. CoRR, abs/2109.03465, 2021.
E. Hadad, F. Heese, P. Vary, and S. Gannot. Multichannel audio database in var-
ious acoustic environments. In 14th International Workshop on Acoustic Signal
Enhancement (IWAENC), pages 313?317. IEEE, 2014.
102
C. Hak, R. Wenmaekers, and L. Van Luxemburg. Measuring room impulse responses:
Impact of the decay range on derived room acoustic parameters. Acta Acustica
united with Acustica, 98(6):907?915, 2012.
B. Hamilton. Pffdtd software, 2021. https://github.com/bsxfun/pffdtd.
R. Hanocka, A. Hertz, N. Fish, R. Giryes, S. Fleishman, and D. Cohen-Or. Meshcnn:
a network with an edge. ACM Transactions on Graphics (TOG), 38(4):1?12, 2019.
S. H. Hawley, V. Chatziiannou, and A. Morrison. Synthesis of musical instrument
sounds: Physics-based modeling or machine learning. Phys. Today, 16:20?28, 2020.
S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore,
M. Plakal, D. Platt, R. A. Saurous, B. Seybold, et al. Cnn architectures for large-
scale audio classification. In IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 131?135. IEEE, 2017.
G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Van-
houcke, P. Nguyen, and T. N. Sainath. Deep neural networks for acoustic modeling
in speech recognition: The shared views of four research groups. IEEE Signal
Processing Magazine, 29(6):82?97, 2012.
N. Hiremath, V. Kumar, N. Motahari, and D. Shukla. An overview of acoustic
impedance measurement techniques and future prospects. Metrology, 1(1):17?38,
2021.
Y. Hold-Geoffroy, K. Sunkavalli, S. Hadap, E. Gambaretto, and J.-F. Lalonde. Deep
outdoor illumination estimation. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 7312?7321, 2017.
S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian journal
of statistics, pages 65?70, 1979.
A. ISO. Measurement of room acoustic parameters - part 1. ISO Std, 2009.
D. L. James, J. Barbi?, and D. K. Pai. Precomputed acoustic transfer: output-
sensitive, accurate sound generation for geometrically complex vibration sources.
In ACM Transactions on Graphics (TOG), volume 25, pages 987?995. ACM, 2006.
T. Jenrungrot, V. Jayaram, S. Seitz, and I. Kemelmacher-Shlizerman. The cone of
silence: speech separation by localization. arXiv preprint arXiv:2010.06007, 2020.
S. Ji, J. Luo, and X. Yang. A comprehensive survey on deep music generation: Multi-
level representations, algorithms, evaluations, and future directions. arXiv preprint
arXiv:2011.06801, 2020.
X. Jin, S. Li, T. Qu, D. Manocha, and G. Wang. Deep-modal: real-time impact
sound synthesis for arbitrary shapes. In Proceedings of the 28th ACM International
Conference on Multimedia, pages 1171?1179, 2020.
103
J. T. Kajiya. The rendering equation. In ACM SIGGRAPH computer graphics,
volume 20, pages 143?150. ACM, 1986.
M. Karjalainen, P. Antsalo, A. Makivirta, T. Peltonen, and V. Valimaki. Estimation
of modal decay parameters from noisy response measurements. In Audio Engineer-
ing Society Convention 110. Audio Engineering Society, 2001.
C. Kim, A. Misra, K. Chin, T. Hughes, A. Narayanan, T. N. Sainath, and M. Bac-
chiani. Generation of large-scale simulated utterances in virtual rooms to train
deep-neural networks for far-field speech recognition in google home. In Inter-
speech, 2017.
H. Kim, L. Remaggi, P. Jackson, and A. Hilton. Immersive spatial audio reproduc-
tion for vr/ar using room acoustic modelling from 360? images. Proceedings IEEE
VR2019, 2019.
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
M. Kleiner, P. Svensson, and B.-I. Dalenb?ck. Auralization: experiments in acoustical
cad. In Audio Engineering Society Convention 89. Audio Engineering Society, 1990.
C. Kling. Absorption coefficient database, Jul 2018. URL https:
//www.ptb.de/cms/de/ptb/fachabteilungen/abt1/fb-16/ag-163/
absorption-coefficient-database.html.
T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur. A study on data
augmentation of reverberant speech for robust speech recognition. In 2017 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pages 5220?5224. IEEE, 2017.
S. Koch, A. Matveev, Z. Jiang, F. Williams, A. Artemov, E. Burnaev, M. Alexa,
D. Zorin, and D. Panozzo. Abc: A big cad model dataset for geometric deep
learning. In The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), June 2019.
V. Kothapally, W. Xia, S. Ghorbani, J. H. Hansen, W. Xue, and J. Huang. Skipcon-
vnet: Skip convolutional neural network for speech dereverberation using optimally
smoothed spectral mapping. arXiv preprint arXiv:2007.09131, 2020.
S. Koyama, T. Nishida, K. Kimura, T. Abe, N. Ueno, and J. Brunnstr?m. Meshrir:
A dataset of room impulse responses on meshed grid points for evaluating sound
field analysis and synthesis methods. arXiv preprint arXiv:2106.10801, 2021.
A. Krokstad, S. Strom, and S. S?rsdal. Calculating the acoustical room response by
the use of a ray tracing technique. Journal of Sound and Vibration, 8(1):118?125,
1968.
104
S. Kurz, O. Rain, and S. Rjasanow. The adaptive cross-approximation technique for
the 3d boundary-element method. IEEE transactions on Magnetics, 38(2):421?424,
2002.
H. Kuttruff. Room Acoustics. Taylor & Francis Group, London, U. K., 6th edition,
2016.
K. H. Kuttruff. Auralization of impulse responses modeled on the basis of ray-tracing
results. Journal of the Audio Engineering Society, 41(11):876?880, 1993.
P. Larsson, D. Vastfjall, and M. Kleiner. Better presence and performance in virtual
environments by improved binaural sound rendering. In Virtual, Synthetic, and
Entertainment Audio conference, Jun 2002. URL http://www.aes.org/e-lib/
browse.cfm?elib=11148.
C. LeGendre, W.-C. Ma, G. Fyffe, J. Flynn, L. Charbonnel, J. Busch, and P. De-
bevec. Deeplight: Learning illumination for unconstrained mobile mixed reality. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 5918?5928, 2019.
J. Lehtinen. Time-domain numerical solution of the wave equation. Feb, 6:1?17, 2003.
D. Li, Y. Fei, and C. Zheng. Interactive acoustic transfer approximation for modal
sound. ACM Transactions on Graphics (TOG), 35(1):1?16, 2015.
D. Li, T. R. Langlois, and C. Zheng. Scene-aware audio for 360? videos. ACM Trans.
Graph., 37(4), 2018.
G. N. Lilis, D. Angelosante, and G. B. Giannakis. Sound field reproduction using
the lasso. IEEE Transactions on Audio, Speech, and Language Processing, 18(8):
1902?1912, 2010.
S. H. Linkwitz. Active crossover networks for noncoincident drivers. Journal of the
Audio Engineering Society, 24(1):2?8, 1976.
Q. Liu, M. J. Kusner, and P. Blunsom. A survey on contextual embeddings. arXiv
preprint arXiv:2003.07278, 2020.
S. Liu and D. Manocha. Sound synthesis, propagation, and rendering: A survey.
arXiv preprint arXiv:2011.05538, 2020.
Y. Luo, Z. Chen, and T. Yoshioka. Dual-path rnn: efficient long sequence modeling
for time-domain single-channel speech separation, 2020.
M. Malik, M. K. Malik, K. Mehmood, and I. Makhdoom. Automatic speech recogni-
tion: a survey. Multimedia Tools and Applications, 80(6):9411?9457, 2021.
S. Marburg. Six boundary elements per wavelength: Is that enough? Journal of
computational acoustics, 10(01):25?51, 2002.
105
R. Mehra, N. Raghuvanshi, L. Antani, A. Chandak, S. Curtis, and D. Manocha.
Wave-based sound propagation in large open scenes using an equivalent source
formulation. ACM Transactions on Graphics (TOG), 32(2):19, 2013.
R. Mehra, A. Rungta, A. Golas, M. Lin, and D. Manocha. Wave: Interactive wave-
based sound propagation for virtual environments. IEEE transactions on visual-
ization and computer graphics, 21(4):434?442, 2015.
H.-Y. Meng, Z. Tang, and D. Manocha. Point-based acoustic scattering for interactive
sound propagation via surface encoding. arXiv preprint arXiv:2105.08177, 2021.
B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng.
Nerf: Representing scenes as neural radiance fields for view synthesis. In European
conference on computer vision, pages 405?421. Springer, 2020.
M. K. Mishra and J. Viradiya. Survey of sentence embedding methods. International
Journal of Applied Science and Computations, 6(3):592?592, 2019.
N. Morales and D. Manocha. Efficient wave-based acoustic material design optimiza-
tion. Computer-Aided Design, 78:83?92, 2016.
N. Morales, R. Mehra, and D. Manocha. A parallel time-domain wave simulator
based on rectangular decomposition for distributed memory architectures. Applied
Acoustics, 97:104?114, 2015.
N. Morales, Z. Tang, and D. Manocha. Receiver placement for speech enhancement
using sound propagation optimization. Applied Acoustics, 155:53?62, 2019.
G. M?ckl and C. Dachsbacher. Precomputing sound scattering for structured surfaces.
In EGPGV@ EuroVis, pages 73?80, 2014.
G. J. Mysore. Can we automatically transform speech recorded on common consumer
devices in real-world environments into professional production quality speech??a
dataset, insights, and challenges. IEEE Signal Processing Letters, 22(8):1006?1010,
2014.
V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann ma-
chines. In Proceedings of the 27th international conference on machine learning
(ICML-10), pages 807?814, 2010.
A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, and W. T. Freeman.
Visually indicated sounds. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 2405?2413, 2016.
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: An asr corpus
based on public domain audio books. In 2015 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pages 5206?5210, 2015. doi:
10.1109/ICASSP.2015.7178964.
106
M. Pariente, S. Cornell, J. Cosentino, S. Sivasankaran, E. Tzinis, J. Heitkaemper,
M. Olvera, F.-R. St?ter, M. Hu, J. M. Mart?n-Do?as, D. Ditter, A. Frank, A. Dele-
forge, and E. Vincent. Asteroid: the PyTorch-based audio source separation toolkit
for researchers. In Proc. Interspeech, 2020.
K. Park and T. Mulc. CSS10: A collection of single speaker speech datasets for
10 languages. CoRR, abs/1903.11269, 2019. URL http://arxiv.org/abs/1903.
11269.
S. Pelzer, L. Asp?ck, D. Schr?der, and M. Vorl?nder. Integrating real-time room
acoustics simulation into a cad modeling software to enhance the architectural
design process. Buildings, 4(2):113?138, 2014.
A. P?rez-L?pez and J. De Muynke. Ambisonics directional room impulse response as a
new convention of the spatially oriented format for acoustics. In Audio Engineering
Society Convention 144. Audio Engineering Society, 2018.
M. Pharr, W. Jakob, and G. Humphreys. Physically based rendering: From theory to
implementation. Morgan Kaufmann, 2016.
A. D. Pierce and R. T. Beyer. Acoustics: An introduction to its physical principles
and applications. 1989 edition, 1990.
M. A. Poletti. Three-dimensional surround sound systems based on spherical har-
monics. Journal of the Audio Engineering Society, 53(11):1004?1025, 2005.
V. Pulkki and U. P. Svensson. Machine-learning-based estimation and rendering of
scattering in virtual reality. The Journal of the Acoustical Society of America, 145
(4):2664?2676, 2019.
S. Purushwalkam, S. V. A. Gari, V. K. Ithapu, C. Schissler, P. Robinson, A. Gupta,
and K. Grauman. Audio-visual floorplan reconstruction. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, pages 1183?1192, 2021.
N. Raghuvanshi and J. Snyder. Parametric wave field coding for precomputed sound
propagation. ACM Trans. Graph., 33(4):38:1?38:11, July 2014a. ISSN 0730-0301.
doi: 10.1145/2601097.2601184. URL http://doi.acm.org/10.1145/2601097.
2601184.
N. Raghuvanshi and J. Snyder. Parametric wave field coding for precomputed sound
propagation. ACM Transactions on Graphics (TOG), 33(4):38, 2014b.
N. Raghuvanshi and J. Snyder. Parametric wave field coding for precomputed sound
propagation. ACM Transactions on Graphics (TOG), 33(4):38, 2014c.
N. Raghuvanshi and J. Snyder. Parametric directional coding for precomputed sound
propagation. ACM Transactions on Graphics (TOG), 37(4):108, 2018.
107
N. Raghuvanshi, R. Narain, and M. C. Lin. Efficient and accurate sound propagation
using adaptive rectangular decomposition. Visualization and Computer Graphics,
IEEE Transactions on, 15(5):789?801, 2009.
N. Raghuvanshi, J. Snyder, R. Mehra, M. Lin, and N. Govindaraju. Precomputed
wave simulation for real-time sound propagation of dynamic sources in complex
scenes. ACM Trans. Graph., 29(4):68:1?68:11, July 2010. ISSN 0730-0301. doi: 10.
1145/1778765.1778805. URL http://doi.acm.org/10.1145/1778765.1778805.
A. Ratnarajah, Z. Tang, and D. Manocha. IR-GAN: Room Impulse Response Gen-
erator for Far-Field Speech Recognition. In Proc. Interspeech 2021, pages 286?290,
2021. doi: 10.21437/Interspeech.2021-230.
A. Ratnarajah, S.-X. Zhang, M. Yu, Z. Tang, D. Manocha, and D. Yu. FAST-RIR:
Fast neural diffuse room impulse response generator. In 2022 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022.
C. K. A. Reddy, E. Beyrami, J. Pool, R. Cutler, S. Srinivasan, and J. Gehrke. A
scalable noisy speech dataset and online subjective test framework, 2019.
C. K. A. Reddy, V. Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matu-
sevych, R. Aichner, A. Aazami, S. Braun, P. Rana, S. Srinivasan, and J. Gehrke.
The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing
framework, and challenge results, 2020.
N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese
bert-networks. arXiv preprint arXiv:1908.10084, 2019.
Z. Ren, H. Yeh, and M. C. Lin. Example-guided physically based modal sound
synthesis. ACM Transactions on Graphics (TOG), 32(1):1, 2013.
C. Richey, M. A. Barrios, Z. Armstrong, C. Bartels, H. Franco, M. Graciarena,
A. Lawson, M. K. Nandwana, A. Stauffer, J. van Hout, P. Gamble, J. Hetherly,
C. Stephenson, and K. Ni. Voices obscured in complex environmental settings
(voices) corpus, 2018.
L. Rizzi, G. Ghelfi, and M. Santini. Small-rooms dedicated to music: From room
response analysis to acoustic design. In Audio Engineering Society Convention
140. Audio Engineering Society, 2016.
M. Rosen, K. W. Godin, and N. Raghuvanshi. Interactive Sound Propagation For
Dynamic Scenes Using 2d Wave Simulation. Computer Graphics Forum, 2020.
ISSN 1467-8659. doi: 10.1111/cgf.14099.
J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey. SDR - half-baked or well
done? CoRR, abs/1811.02508, 2018. URL http://arxiv.org/abs/1811.02508.
108
A. Rungta, S. Rust, N. Morales, R. Klatzky, M. Lin, and D. Manocha. Psychoacoustic
characterization of propagation effects in virtual environments. ACM Transactions
on Applied Perception (TAP), 13(4):21, 2016.
A. Rungta, C. Schissler, N. Rewkowski, R. Mehra, and D. Manocha. Diffraction ker-
nels for interactive sound propagation in dynamic environments. IEEE transactions
on visualization and computer graphics, 24(4):1613?1622, 2018.
W. C. Sabine. Collected papers on acoustics. 1927.
J. Salamon and J. P. Bello. Deep convolutional neural networks and data augmenta-
tion for environmental sound classification. IEEE Signal Processing Letters, 24(3):
279?283, 2017.
L. Savioja and U. P. Svensson. Overview of geometrical room acoustic modeling
techniques. The Journal of the Acoustical Society of America, 138(2):708?730,
2015. doi: 10.1121/1.4926438.
R. W. Schafer and A. V. Oppenheim. Discrete-time signal processing. Prentice Hall
Englewood Cliffs, NJ, 1989.
C. Schissler and D. Manocha. Interactive sound propagation and rendering for large
multi-source scenes. ACM Transactions on Graphics (TOG), 36(1):2, 2016.
C. Schissler and D. Manocha. Interactive sound propagation and rendering for large
multi-source scenes. ACM Transactions on Graphics (TOG), 36(1):2, 2017.
C. Schissler and D. Manocha. Interactive sound rendering on mobile devices using
ray-parameterized reverberation filters. arXiv preprint arXiv:1803.00430, 2018.
C. Schissler, R. Mehra, and D. Manocha. High-order diffraction and diffuse reflections
for interactive sound propagation in large environments. ACM Transactions on
Graphics (TOG), 33(4):39, 2014.
C. Schissler, C. Loftin, and D. Manocha. Acoustic classification and optimization for
multi-modal rendering of real-world scenes. IEEE Transactions on Visualization
and Computer Graphics, 24(3):1246?1259, 2017.
M. Schoeffler, F.-R. St?ter, B. Edler, and J. Herre. Towards the next generation of
web-based experiments: A case study assessing basic audio quality following the
itu-r recommendation bs. 1534 (mushra). In 1st Web Audio Conference, pages 1?6,
2015.
M. R. Schroeder. The ?schroeder frequency?revisited. The Journal of the Acoustical
Society of America, 99(5):3240?3241, 1996.
P. Seetharaman and S. P. Tarzia. The hand clap as an impulse source for measuring
room acoustics. In Audio Engineering Society Convention 132. Audio Engineering
Society, 2012.
109
M. L. Seltzer, Y. Dong, and Y. Wang. An investigation of deep neural networks for
noise robust speech recognition. In IEEE International Conference on Acoustics,
2013.
B. Series. Recommendation ITU-R BS. 1534-3 method for the subjective assessment
of intermediate quality level of audio systems. International Telecommunication
Union Radio Communication Assembly, 2014.
P. Series. Methods for objective and subjective assessment of speech and video quality.
International Telecommunication Union Radiocommunication Assembly, 2016.
J. O. Smith III. Spectral Audio Signal Processing. 01 2008.
A. Southern, S. Siltanen, and L. Savioja. Spatial room impulse responses with a
hybrid modeling method. In Audio Engineering Society Convention 130. Audio
Engineering Society, 2011.
A. Southern, S. Siltanen, D. T. Murphy, and L. Savioja. Room impulse response
synthesis and validation using a hybrid acoustic model. IEEE Transactions on
Audio, Speech, and Language Processing, 21(9):1940?1952, 2013.
A. Sterling, J. Wilson, S. Lowe, and M. C. Lin. Isnn: Impact sound neural network
for audio-visual object classification. In Proceedings of the European Conference on
Computer Vision (ECCV), pages 555?572, 2018.
A. Sterling, N. Rewkowski, R. L. Klatzky, and M. C. Lin. Audio-material reconstruc-
tion for virtualized reality using a probabilistic damping model. IEEE transactions
on visualization and computer graphics, 25(5):1855?1864, 2019.
S. S. Stevens, J. Volkmann, and E. B. Newman. A scale for the measurement of the
psychological magnitude pitch. The Journal of the Acoustical Society of America,
8(3):185?190, 1937.
U. P. Svensson, R. I. Fred, and J. Vanderkooy. An analytic secondary source model
of edge diffraction impulse responses. The Journal of the Acoustical Society of
America, 106(5):2331?2344, 1999.
I. Sz?ke, M. Sk?cel, L. Mo?ner, J. Paliesek, and J. ?ernocky?. Building and evaluation
of a real room impulse response dataset. IEEE Journal of Selected Topics in Signal
Processing, 13(4):863?876, 2019.
Q. Tan, L. Gao, Y.-K. Lai, J. Yang, and S. Xia. Mesh-based autoencoders for localized
deformation component analysis. In Thirty-Second AAAI Conference on Artificial
Intelligence, 2018.
Z. Tang, N. J. Bryan, D. Li, T. R. Langlois, and D. Manocha. Scene-aware audio
rendering via deep acoustic analysis. arXiv preprint arXiv:1911.06245, 2019a.
110
Z. Tang, J. Kanu, K. Hogan, and D. Manocha. Regression and classification for
direction-of-arrival estimation with convolutional recurrent neural networks. In
Interspeech, 2019b.
Z. Tang, L. Chen, B. Wu, D. Yu, and D. Manocha. Improving reverberant speech
training using diffuse acoustic simulation. In ICASSP 2020-2020 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
6969?6973. IEEE, 2020.
Z. Tang, H.-Y. Meng, and D. Manocha. Learning acoustic scattering fields for dy-
namic interactive sound propagation. In 2021 IEEE Virtual Reality and 3D User
Interfaces (VR), pages 835?844. IEEE, 2021.
Z. Tang, R. Aralikatti, A. Ratnarajah, , and D. Manocha. Gwa: A large geometric-
wave acoustic dataset for audio deep learning, 2022.
M. Taylor, A. Chandak, Q. Mo, C. Lauterbach, C. Schissler, and D. Manocha. Guided
multiview ray tracing for fast auralization. IEEE Transactions on Visualization and
Computer Graphics, 18:1797?1810, 2012a.
M. Taylor, A. Chandak, Q. Mo, C. Lauterbach, C. Schissler, and D. Manocha. Guided
multiview ray tracing for fast auralization. IEEE Transactions on Visualization and
Computer Graphics, 18(11):1797?1810, 2012b.
M. T. Taylor, A. Chandak, L. Antani, and D. Manocha. Resound: interactive sound
rendering for dynamic virtual environments. In Proceedings of the 17th ACM in-
ternational conference on Multimedia, pages 271?280. ACM, 2009.
R. A. Tenenbaum, F. O. Taminaro, and V. Melo. Room acoustics modeling using a
hybrid method with fast auralization with artificial neural network techniques. In
Proc. International Congress on Acoustics (ICA), pages 6420?6427, 2019.
L. L. Thompson. A review of finite-element methods for time-harmonic acoustics.
The Journal of the Acoustical Society of America, 119(3):1315?1330, 2006.
N. Tits, K. El Haddad, and T. Dutoit. Emotional speech datasets for english speech
synthesis purpose: A review. In Proceedings of SAI Intelligent Systems Conference,
pages 61?66. Springer, 2019.
I. R. Titze, L. M. Maxfield, and M. C. Walker. A formant range profile for singers.
Journal of Voice, 31(3):382.e9 ? 382.e13, 2017. ISSN 0892-1997. URL http://
www.sciencedirect.com/science/article/pii/S0892199716301096.
J. Traer and J. H. McDermott. Statistics of natural reverberation enable perceptual
separation of sound and space. Proceedings of the National Academy of Sciences,
113(48):E7856?E7865, 2016.
111
N. Tsingos. Precomputing geometry-based reverberation effects for games. In Audio
Engineering Society Conference: 35th International Conference: Audio for Games.
Audio Engineering Society, 2009.
N. Tsingos, T. Funkhouser, A. Ngan, and I. Carlbom. Modeling acoustics in virtual
environments using the uniform theory of diffraction. In Proceedings of the 28th
annual conference on Computer graphics and interactive techniques, pages 545?552.
ACM, 2001.
D. Tsokaktsidis, T. Von Wysocki, F. Gauterin, and S. Marburg. Artificial neural
network predicts noise transfer as a function of excitation and geometry. In Proc.
International Congress on Acoustics (ICA), pages 4392?4396, 2019.
V. V?lim?ki and J. Reiss. All about audio equalization: Solutions and frontiers.
Applied Sciences, 6(5):129, 2016.
V. Valimaki, J. D. Parker, L. Savioja, J. O. Smith, and J. S. Abel. Fifty years
of artificial reverberation. IEEE Transactions on Audio, Speech, and Language
Processing, 20(5):1421?1448, 2012.
T. Virtanen, M. D. Plumbley, and D. Ellis. Computational analysis of sound scenes
and events. Springer, 2018.
M. Vorl?nder. Simulation of the transient and steady-state sound propagation in
rooms using a new combined ray-tracing/image-source algorithm. The Journal of
the Acoustical Society of America, 86(1):172?178, 1989.
M. Vorl?nder. Computer simulations in room acoustics: Concepts and uncertainties.
The Journal of the Acoustical Society of America, 133(3):1203?1213, 2013.
M. Vorliander. International round robin on room acoustical computer simulations.
In 15th Intl. Congress on Acoustics, Trondheim, Norway, pages 689?692, 1995.
K. Wapenaar. Unified matrix?vector wave equation, reciprocity and representations.
Geophysical Journal International, 216(1):560?583, 2019.
M. A. Wieczorek and M. Meschede. Shtools: Tools for working with spherical har-
monics. Geochemistry, Geophysics, Geosystems, 19(8):2574?2592, 2018.
L. C. Wrobel and A. Kassab. Boundary element method, volume 1: Applications in
thermo-fluids and acoustics. Appl. Mech. Rev., 56(2):B17?B17, 2003.
T. Wu, Y. Jiang, N. Li, and T. Feng. An indoor sound source localization dataset
for machine learning. In Proceedings of the 2018 2nd International Conference on
Computer Science and Artificial Intelligence, pages 28?32, 2018.
H. Yeh, R. Mehra, Z. Ren, L. Antani, D. Manocha, and M. Lin. Wave-ray coupling
for interactive sound propagation in large complex scenes. ACM Transactions on
Graphics (TOG), 32(6):165, 2013.
112
C. Zheng and D. L. James. Toward high-quality modal contact sound. In ACM
SIGGRAPH 2011 papers, pages 1?12. 2011.
X. Zheng, C. Wen, N. Lei, M. Ma, and X. Gu. Surface registration via foliation. In
The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
S. Zhi, M. Bloesch, S. Leutenegger, and A. J. Davison. Scenecode: Monocular dense
semantic reconstruction using learned encoded scene representations. In The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg. Visual to sound: Generating
natural sound for videos in the wild. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 3550?3558, 2018.
C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm 778: L-bfgs-b: Fortran
subroutines for large-scale bound-constrained optimization. ACM Trans. Math.
Softw., 23(4):550?560, Dec. 1997. ISSN 0098-3500. URL http://doi.acm.org/
10.1145/279232.279236.
113