ABSTRACT Title of Dissertation: REAL-TIME AUDIO REVERBERATION FOR VIRTUAL ROOM ACOUSTICS Justin M Shen Master of Science, 2020 Dissertation Directed by: Professor Ramani Duraiswami Department of Computer Science For virtual and augmented reality applications, it is desirable to render audio sources in the space the user is in, in real-time without sacrificing the perceptual quality of the sound. One aspect of the rendering that is perceptually important for a listener is the late-reverberation, or ?echo?, of the sound within a room environ- ment. A popular method of generating a plausible late reverberation in real-time is the use of Feedback Delay Network (FDN). However, its use has the drawback that it first has to be tuned (usually manually) for a particular room before the late- reverberation generated becomes perceptually accurate. In this thesis, we propose a data-driven approach to automatically generate a pre-tuned FDN for any given room described by a set of room parameters. When combined with existing method for rendering the direct path and early reflections of a sound source, we demon- strate the feasibility of being able to render audio source in real-time for interactive applications. REAL-TIME AUDIO REVERBERATION FOR VIRTUAL ROOM ACOUSTICS by Justin M Shen Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Master of Science 2020 Advisory Committee: Professor Ramani Duraiswami, Chair/Advisor Professor Matthias Zwicker Assistant Professor Nirupam Roy ?c Copyright by Justin M Shen 2020 Acknowledgments I would like to take some time to acknowledge all the people who have made an impact in my graduate experience so far through the various interactions we have had. Without them, my graduate experience simply would not be the same, and I am grateful for all their encouragements and advice throughout the thesis process. First, I would like to thank my advisor, Professor Ramani Duraiswami, for many years of guidance and teachings. Even as far back as my high school years, he was open to introducing me to the world of scientific research by providing me with the opportunity to work in his lab for my high school senior research project. The opportunity played no small part in my eventual decision to study Computer Science in college. My education throughout my undergraduate and graduate years has been greatly enriched thanks to his presence and encouragements. I would also like to thank my committee members Professor Matthias Zwicker and Nirupam Roy. They have both been teachers to courses I?ve had in the past whose teachings helped expand my horizon and proved to be useful in my research. I am truly grateful for their flexibility and willingness to be a part of my committee, especially during this unusual time of COVID-19. Lastly, I would like to thank everyone else who made my graduate experience possible: all my friends and family who have cared and encouraged me along the way. Mr. Jason Filippou, Professor David Jacobs, Dr. Ilchul Yoon, Professor Updaya Shankar, and all other professors who have provided me the opportunity to work as a teaching assistant throughout my undergraduate and graduate years. Tom ii Hurst for his stellar administrative support to graduate students. Dorothea Brosius and the Institute for Research in Electronics and Applied Physics for providing the LaTex thesis template. Once again, I owe my gratitude to all the people who have made this thesis possible and have made my graduate experience unique. iii Table of Contents Acknowledgements ii Table of Contents iv List of Tables vi List of Figures vii List of Abbreviations viii 1 Introduction 1 2 Background 4 2.1 Artificial Reverberation . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Convolutional and Computational Acoustics . . . . . . . . . . . . . . 6 2.3 Delay Network and Feedback Delay Network . . . . . . . . . . . . . . 7 2.3.1 Feedback Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.2 Delay Line and Length . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Important Perceptual Metrics for Reverberation . . . . . . . . . . . . 10 2.5 Automatic Tuning of Feedback Delay Network . . . . . . . . . . . . . 12 3 Methodology 14 3.1 Automatic FDN Construction from Room Parameters . . . . . . . . . 14 3.2 Obtaining the Room Impulse Response . . . . . . . . . . . . . . . . . 15 3.3 Automatic Design of FDN . . . . . . . . . . . . . . . . . . . . . . . . 16 3.4 Choice of Perceptual Metric . . . . . . . . . . . . . . . . . . . . . . . 17 3.5 SVM Regression for Generating FDN Parameters . . . . . . . . . . . 18 4 Results 20 4.1 Genetic Algorithm Optimization Loss Value . . . . . . . . . . . . . . 20 4.2 SVM Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2.1 Small Room Data Set Results . . . . . . . . . . . . . . . . . . 22 4.2.2 Large Room Data Set Results . . . . . . . . . . . . . . . . . . 25 4.3 Real-time System Demonstration . . . . . . . . . . . . . . . . . . . . 26 iv 5 Discussions 28 5.1 Quality of the Training Data Set . . . . . . . . . . . . . . . . . . . . 28 5.2 SVM Regression Performance . . . . . . . . . . . . . . . . . . . . . . 30 5.3 Computational Cost Considerations . . . . . . . . . . . . . . . . . . . 31 6 Conclusion 33 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 6.2 Possible Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 v List of Tables 4.1 Training performance for SVM regression model for FDN parameter- ization (g, m1, m2, m3, m4) for the small room data set. . . . . . . . 23 4.2 Testing performance for SVM regression model for FDN parameteri- zation (g, m1, m2, m3, m4) for the small room data set. . . . . . . . . 23 4.3 Training performance for SVM regression model for FDN parameter- ization (g,b,c) for the small room data set. . . . . . . . . . . . . . . . 24 4.4 Testing performance for SVM regression model for FDN parameteri- zation (g,b,c) for the small room data set. . . . . . . . . . . . . . . . . 25 4.5 Training performance for SVM regression model for FDN parameter- ization (g,b,c) for the large room data set. . . . . . . . . . . . . . . . 26 4.6 Testing performance for SVM regression model for FDN parameteri- zation (g,b,c) for the large room data set. . . . . . . . . . . . . . . . . 26 vi List of Figures 2.1 Anatomy of a typical room impulse response from [7] . . . . . . . . . 5 2.2 Example FDN structure from [7] . . . . . . . . . . . . . . . . . . . . 8 2.3 Example of a simulated impulse response from a FDN matched to have the same energy as the measured IR [11]. . . . . . . . . . . . . . 13 3.1 Flow chart of the proposed general method. The final regression model to deploy for inference is boxed in red. . . . . . . . . . . . . . . 15 4.1 Spectrogram of sound signal convolved with RIR and FDN produced RIR for various loss values. In each set of three spectrogram, the top is the original audio, middle is the result produced from the simulated RIR and bottom is the result produced from FDN generated RIR. . . 21 4.2 Sample output from the SVM regression model trained on the small room data set with FDN parameterized as (g, m1, m2, m3, m4). In each set of three spectrogram, the original audio (top), result pro- duced from the simulated RIR (middle) and result produced from the FDN predicted by the SVM (bottom). . . . . . . . . . . . . . . . 24 4.3 Sample output from the SVM regression model trained on the small room data set with FDN parameterized as (g, b, c). In each set of three spectrogram, the original audio (top), result produced from the simulated RIR (middle) and result produced from the FDN predicted by the SVM (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.4 Sample output from the SVM regression model trained on the large room data set with FDN parameterized as (g, b, c). In each set of three spectrogram, the original audio (top), result produced from the simulated RIR (middle) and result produced from the FDN predicted by the SVM (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.5 Plot of the total amount of time the mock demo took to process different number of audio samples. . . . . . . . . . . . . . . . . . . . 27 vii List of Abbreviations BEM Boundary Element Method BRIR Binaural Room Impulse Response EDC Energy Decay Curve FDN Feedback Delay Network FDTD Finite-Difference Time-Domain FEM Finite Element Method FFT Fast Fourier Transform HRTF Head-Related Transfer Function IR Impulse Response MFCC Mel-frequency cepstral coefficients RIR Room Impulse Response SVM Support Vector Machine viii Chapter 1: Introduction When a sound wave is propagating through a room, what is perceived by a listener is typically composed of the direct signal from the sound source to the listener, as well as the various indirect signals from the sound source bouncing off of the surfaces of the room. This perceived reverberated sound can be divided into two perceptually different segments based on the room impulse response (RIR). The first segment is referred to as the early reflection, while the last segment is referred to as the late reverberation [19]. In this thesis, we are primarily interested in developing a framework for efficient approximation of the late reverberation of audio signals in an arbitrary environment. Being able to approximate the late reverberation in real-time is important for interactive applications where realistic audio rendering is required. For example, in augmented and virtual reality applications, it is often desirable to create as immer- sive or realistic of an experience as possible. This can be realized with a traditional head up display hardware which can render visual elements and play audio. In par- ticular, we can imagine augmented or virtual reality use case such as virtual concert performance where realistic audio can dramatically enhance the listening experience since music being perform in an amphitheater or concert hall carries very rich tone. 1 Similarly, for virtual conference meetings and virtual classrooms, realistic audio and visuals can potentially provide a higher level of engagement and more closely repli- cate the experience of in-person meetings, which is especially useful for times when in-person meetings might be inconvenient. In audio rendering applications, the original, unaltered audio signal that a listener will hear is referred to as the dry audio, and the process of the dry audio propagating through a room environment and bouncing off of the surfaces of the room to form the final reverberated sound a listener perceives can be referred to as reverberation. The point-to-point RIR of a room characterizes the behavior of a sound traveling from a source location to a receiver location, and the convolution of the dry audio signal with the RIR can be use to accurately render the reverberated sound [10]. Thus, to render audio that is similar to how a sound is actually perceived by a listener in a given room, it is often sufficient to compute the RIR of a room, along with the Head-Related Transfer Function (HRTF) unique an individual. While we can accurately compute a reverberated sound through convolution, that alone is not enough if we are interested in being able to render the reverberated sound in real-time [18] [19]. To render the reverberated sound in real-time, one approach is to approximate the late-reverberation using a Feedback Delayed Network (FDN), allowing us to apply other existing methods for rendering just the direct path and early reflections of the sound rather than the entire sound source [19]. This approach, however, has the drawback that the FDN must be first tuned for a particular room (often manually) before the late-reverberation generated becomes perceptually accurate. 2 To overcome this drawback, we propose a data-driven approach to automatically generate a pre-tuned FDN for any given room described by a set of room parameters. This approach involves building a data set to train a model to learn a mapping between a parametric model of a room and the parameteric model of the FDN corresponding to those rooms. The main requirement for this model is that once the model is built, it can be use to infer the FDN parameters in real-time once information about the room environment is known. This, when combined with existing methods for rendering the direct path and early reflections of a sound source, allows us to render more realistic audio in real-time. In the next section we will discuss relevant background and previous works related to our proposed approach. After the background, we will then discuss the methodology of our approach and the results of our implementation in sections 3 and 4. Finally, we provide some discussions and conclusion regarding our work in the remaining sections. 3 Chapter 2: Background 2.1 Artificial Reverberation As mentioned earlier, reverberation is the result of sound bouncing off the surfaces of the room. More specifically, as the sound propagates through an envi- ronment, it is slowed down from interaction with the environment, giving the listener a sense of the space and structure of the environment. Furthermore, a sound source can bounce off surfaces in the room multiple time, meaning the resulting sound the listener hears is a combination of various paths the sound waves took prior to reaching the ears. Sound that have only bounces off surfaces a couple of times can be heard more distinctly are referred to as the early reflections of the sound. While the sound that traveled directly to the listener (the direct path) allows listeners to perceive the direction of the sound, the early reflections are distinct enough to give listeners a sense of the geometry and material of the room [19]. The early reflection ends once the reverberation reaches its ?asymptotic statistical behavior,? but it is typically taken to be the first 80 to 100 ms of the sound signal [13]. An example of the different parts of a room impulse response (RIR) for characterizing the interaction of the sound with the environment is shown in Fig. 2.1. Since the direct path and 4 early reflections are rather distinct, it is important to be able to accurately simulate them in order to render sound realistically. In contrast, the late reverberation that resulted from the sound bouncing off the surfaces of the room many times becomes less distinct and conveys a sense of the size of the environment and its absorbing power. Furthermore, this late reverberation can be characterize statistically, making it acceptable to approximate it without significant loss in realism [19]. Figure 2.1: Anatomy of a typical room impulse response from [7] Since the idea of recreating reverberation through artificial reverberation was first introduced by Schroeder in 1961, three main types of algorithms have been developed: delay networks, convolutional, and computational acoustic [19]. Con- volutional approaches focuses on obtaining good physical measurements of a room and being able to compute the convolution efficiently. Computational acoustic ap- proaches simulates the propagation of acoustic signals in a geometric representation of the room and can be further broken down into two variants: one being the wave- based methods that aims to solve the wave equation numerically, and the other being ray-based methods that propagates sound as rays geometrically. Delay net- work approaches use networks of delay lines and digital filters to recreate the sound delays of a reverberation. A popular example of a delay networks structure is the 5 Feedback Delay Networks (FDNs), which will be the main focus of this thesis for producing reverberation in real-time. 2.2 Convolutional and Computational Acoustics The goal of convolutional approaches and computational acoustics approaches for simulating artificial reverberation is to compute the room impulse response (RIR). Once the RIR is computed the artificial reverberation can be obtain by finding the convolution of the dry audio signal and the computed RIR [10] [18]. Both approaches are both physically based, allowing them to accurately reproduce the reverberated audio. As mentioned previously, convolutional approaches are concern with physically measuring the RIR through sound recordings and applying the measured RIR to generate artificial reverberation efficiently. Once a recording of the RIR is obtained, post-processing algorithm is necessary to deal with noise and other limitations re- lated to the recording device [19]. With the post-processing complete, the Fast Fourier Transform (FFT) algorithm and its variants are used to efficiently compute the convolution between the dry audio signal and the impulse response. Meanwhile, for wave-based computational acoustics approaches, the wave equa- tion can be solve through various numerical techniques in either the time or fre- quency domain [19]. Notable time domain technique include the finite-difference time-domain (FDTD) technique and its variants [5] [19]. For the frequency domain, finite element method (FEM) [15] and boundary element method (BEM) are used 6 [4] [9] [19]. Wave-based approaches are consider to be the most accurate way to simulate the RIR, however, they are also very computationally expensive. Ray-based com- putational acoustics approaches (also call geometrical acoustics) offers alternative techniques that are faster but less accurate [17] [19]. Because ray-based approaches treats sound as rays, it is more accurate for mid to high frequency sound waves and have difficulty capturing lower frequency wave phenomena. 2.3 Delay Network and Feedback Delay Network Feedback Delay Network were first pioneer by Jot and Chaigne in 1991 and remains a state-of-the-art reverberation method [7][19]. Feedback Delay Network and other delay networks methods aim to simulate reverberation through networks of delay lines and digital filters. For the purpose of generating high quality artificial reverberation, reverberator based on Feedback Delay Network has the advantage that it can independently tune the energy storage, damping, and diffusion compo- nents related to the reverberation. However, Feedback Delay Network and other delay networks are not physically based so they do not necessarily model the RIR accurately. Instead, they aim to capture the perceptual quality of the RIR [18]. FDN has two major components: a set of delay lines and a feedback ma- trix. For a given sound source the FDN generates the artificial reverberant sound by continually looping the signal through a set of delay lines (represented as a di- agonal delay matrix), and then mixing the delayed signal with a feedback matrix 7 (potentially with some additional filters in between). An example of a full FDN is displayed below in Fig. 2.2 to clarify the interac- tion between the input audio signal (u(n)) and the components of a FDN. Besides the feedback matrix and delay lines (denoted {q } and z?Mii,j in the figure), a FDN can also have a set of input gains (bi?s), output gains (ci?s), and feedback gains (gi?s). Optionally, the output from each delay line may also be pass through a low pass filter (not pictured). The remainder of section 2.3 will describe the different components and typical design of the these components for a FDN as described in [7][18]. Figure 2.2: Example FDN structure from [7] 2.3.1 Feedback Matrix In a FDN, the feedback matrix serves the role of mixing the delayed signals to simulate the aggregate effect of a sound signal?s interaction with the surfaces of the enclosing environment. For this reason, the feedback matrix is sometimes also referred to as the scatter matrix. The Hadamard and Householder Feedback 8 Matrix are common choices for the feedback matrix in FDN and are well studied, but generally we can choose any unitary matrix to ensure stability for the FDN [18]. A Hadamard matrix is a square orthogonal matrix whose entries are either 1 or -1. Furthermore, it is known that for a Hadamard matrix of order n ? 4, n is divisible by 4. Although there are no known formula for constructing all possible Hadamard matrices, there are known methods for constructing Hadamard matrices of a specific structure [1]. An example of of a Hadamard matrix for n = 4 would be: ?? ??1 1 1 1?1 ?1 1 ?1?H4 = ?1 1 ?1 ?1? 1 ?1 ?1 1 The Householder matrix represents a reflection transformation about the hy- perplane defined by the vector u [6] and is defined as ( ) 2 Hu = I ? uuT (2.1) uTu 2.3.2 Delay Line and Length The delay line in the FDN specifies the amount of time that a signal that is passed through it gets ?delayed,? and the number of delay line that the FDN has is often referred to as the number of tap the FDN has. The delay line length should roughly be around the ?mean free path? given approximately as d = 4V/S, due to Sabine, where S is the total surface area of the room and V its enclosing volume [10] [20]. More precisely each delay line length, Mi, 9 should be chosen to be mutually prime (to maximize the number of samples that the lossless reverberator prototype must be go through before the impulse response repeats) and chosen to ensure a sufficiently high?mode density in all frequency bands (typically, we want M ? 0.15t60fs where M = Ni=1Mi). To generate a set of prime power delay-line length, a common schema is to parameterize delay line length as an integer power of a distinct prime number; where the power is chosen as ? ? log(Mi) mi = 0.5 + log(pi) for a given prime pi, yielding the final set of delay line length as {pmii }. In practice, to generate a good prime power delay line, we should ensure that the minimum delay line length roughly corresponds to the minimum acoustic ray length in the reverberator (that is, the desired delay time between the sound source and receiver for a given the sampling frequency). Similarly, we should bound the maximum delay line length to correspond to the maximum acoustic ray length (?room size?) [18]. 2.4 Important Perceptual Metrics for Reverberation To compare how similar two room impulse responses are to one another in terms of how an individual perceives them, it is useful to be able to quantify any perceptual differences. The following features are considered to be good metrics for measuring the perceptual accuracy of the generated impulse response for a given sampling frequency [8]: 10 ? t60(f) - desired reverberation time at each frequency f , used as a measure of perceived reverberation time. It is defined as the tie it takes for the sound level in the room to decrease by 60 dB [16] The Energy Decay Curve (EDC) defined as ? ? EDC(t) = h(?)d? (2.2) t is often use to compute t60 since it decays more smoothly than the impulse response envelope [18]. t60 may be approximated by Sabine?s Formula [10] [20]: 0.049V T = (2.3) S? for room dimensions measured in feet, where V is the volume of the room, S, is the boundary surface area, and ?? is the average absorption coefficient given as 1 ?n ? = si?i (2.4) S i=1 where si is the area of a boundary surface and ?i is the absorption value of the corresponding boundary surface. ? G2(f) - signal power gain at each frequency ? C50(f) or C80(f) - clarity or early-to-late index (related to the direct-to- reverberant-ratio), important for perception of room size and perception of sound quality [12]. It can be defined as [ ? 80ms /? ? ] C80 = 10 log 2 10 h (t)dt h 2(t)dt (2.5) 0 80ms 11 where h(t) is the impulse response. C50 is similarly defined [16]. Other features from the impulse response, such as the Mel-frequency cepstral coefficients (MFCC) and signal power envelope, can also serve as perceptual com- parison metrics, though less common. Each of these metrics are computed for an individual impulse response and are then compared somehow base on their differ- ences. Another way to compare the impulse response is with the spectral distortion and signal-to-distortion ratio. These two metrics can be use to directly compare two impulse responses by treating one of the impulse response (such as the FDN generated impulse response) as some distortion of the other impulse response. 2.5 Automatic Tuning of Feedback Delay Network When Jot and Chaigne first propose FDN, the various components of FDN has to be manually designed and tuned for each room before a plausible late rever- beration can be generated from the FDN. The FDN tuning is done in such a way to match some perceptual metrics (see Fig. 2.3 for example). Typically, FDN are designed for a given reverberation time at several frequencies. The idea of automatically tuning a FDN was first proposed in [2] and then further refined in [3]. The core idea behind these automatic tuning methods is to tune the FDN to match a given RIR by using the Genetic Algorithm to optimize some loss function with respect to the FDN parameters describing the FDN. The loss function is a function of the given RIR and the generated RIR produced by the FDN, and should ideally model decrease in perceptual differences between the 12 Figure 2.3: Example of a simulated impulse response from a FDN matched to have the same energy as the measured IR [11]. two RIR?s as the loss approaches zero. To ensure that perceptual differences are captured in the loss function, perceptual metrics are used in formulating possible loss functions. A subset of these perceptual metrics described in section 2.4 have been ex- plored are compared in [3] for automatic tuning of FDN. The choice for the best set of perceptual metrics to use in the automatic tuning of FDN is explored to a limited extent in this thesis, though it is outside of the main scope of the thesis. 13 Chapter 3: Methodology 3.1 Automatic FDN Construction from Room Parameters We aim to construct a FDN in real-time that produces an impulse response that perceptually matches the impulse response of a given room described by a set of room parameters. A possible approach to accomplish this goal is to break the goal into two sub-problems. The first sub-problem involves computing the RIR of the parameterized room and computing some perceptual metric to represent the RIR. The second sub-problem then involves finding a FDN that produces an impulse response that matches the same perceptual metric. Here we propose an initial general method to solving the problem by combining the results from the two sub- problems to build a data set based on the correspondence between the set of room parameters and FDN parameters. The data set can then be use to train a regression model to infer the FDN parameters from a given set of room parameters. The general method is outlined below (Fig. 3.1): 1. Generate a set of RIR parameterized by a set of parameters describing a room 2. Apply optimization algorithm to generate a matching FDN for each RIR over a set of FDN parameters with a perceptual loss function 14 3. Train a regression model to learn mapping between room parameters and FDN parameters 4. Refine inferred FDN to generate late reverberation (optional) 5. Deploy regression model to infer a plausible FDN for each room of interest Figure 3.1: Flow chart of the proposed general method. The final regression model to deploy for inference is boxed in red. One possible implementation approach for our proposed method using a shoe box room Binaural Room Impulse Response (BRIR) generator, Genetic Algorithm, and Support Vector Machine Regression is described in the rest of this section. 3.2 Obtaining the Room Impulse Response BRIR simulation for a box room is used to generate a set of room impulse response corresponding to a room parameterized by its room dimensions (length, width, and height) and reflection coefficient of each of the six walls. Once the BRIR 15 is simulated, the BRIR corresponding to the ?left ear? is chosen to be use as the RIR, and the reverberation time and other possible perceptual metrics can be computed from this RIR. The particular method used to generate the BRIR is described in [21]. Note that a box room is used here for ease of parameterization. This process can be generalize further by simulating BRIR for rooms with more complex geometries so long as there is a reasonable way to parameterize the complex geometries of the room. 3.3 Automatic Design of FDN The automatic design of FDN can be frame as an inverse problem: given an impulse response for some room, recover a set of parameters that would describe a FDN that can produce the same matching impulse response (as close as possible). Previous approach to solving this problem requires the use of Genetic Algo- rithm to search for the FDN parameters, given an objective function whose optima correspond to a match of some chosen perceptual metric. In that sense the inverse problem is solved through global optimization. Genetic Algorithm is a viable opti- mization method for this case because it does require the gradient of the perceptual loss function. Working with the gradient of the loss function is tricky because the loss function is a function of the impulse response generated from the FDN. This would require finding derivatives of a recurrent function which can be difficult in practice. Automatic differentiation or numerical differentiation might use apply to utilize gradient-based optimization methods and can be explore in later works. 16 We take a modified approach to tune the FDN based on work by [3]. As in their approach, the impulse response is generated at a sampling rate of 48kHz and the first 4096 sample are directly taken from the original matching impulse response. The main modification to the optimization process is that we use a four- tap FDN, the perceptual metric we use for the matching is different (see Section 3.4), and instead of constraining tap delay line length be to integer values that cover a scale of 1:2.5 and with the longest value corresponding to 100 milliseconds, we just have the constraint that the integers be mutually prime. This can be achieve by optimizing over a set of integers corresponding to the exponent component of the delay length and having a fix set of prime basis: {2, 3, 5, 7}. Further, we do not modify the input and output gains, as well as the low-pass filter. In summary, we optimize over a parameterization of the FDN given by {g,m1,m2,m3,m4} where g is the feedback gain and mi?s are the delay line length exponents. We also try out another parameterization where we only optimize base on the input gain (b), output gain (c), and feedback gain (g) of the FDN. 3.4 Choice of Perceptual Metric Some options explored by [3] so far for suitable perceptual metrics are ISO acoustic metrics (such as clarity index) matching, energy decay (EDC) matching, MFCC matching, and power envelope matching. For simplicity and as a proof-of- 17 concept, we chose the following loss function: L 8 2(IR, IRFDN) = (t60(IR)? t60(IR 2 2FDN)) + (C50(IR)? C50(IRFDN)) (3.1) 10 10 where IR is the impulse response we want to match, IRFDN is the impulse response generated from the FDN, and t60(?) and C50(?) are the reverberation time and clarity index computed for each respective impulse response. 3.5 SVM Regression for Generating FDN Parameters To achieve real time reverberation generation for a given room, just using the Genetic algorithm to construct the FDN parameters would not be sufficient because the optimization takes a non-trivial amount of time to compute. Thus, ideally we would want to be able to generate the FDN parameters in real time as soon as information about the room becomes available. To achieve this we can train a Support Vector Machine (SVM) regression model, and run inference on the trained model to predict the desired set of FDN parameters from the given room parameters in real time. To train the SVM model would first require some training data. This training data can be generated first using the BRIR simulator and then with the Genetic algorithm applied to build correspondence between the room parameters that yield a particular RIR and a suitable FDN. Two sets of training data were generated. The first data set consist of RIR generated from rooms with length and width varying between 3 to 8 meters and 18 height of 2.7 meters. The reflection coefficients for the walls are fixed with the side wall all set to the reflection coefficient and the top and bottom coefficient differs. Similarly, the second data set consist of RIR generated from rooms with length and width varying between 22 to 27 meters and height of 8 meters. The reflection coefficients for the walls are set the same way as the previous data set. We refer to the first data set as the small room data set and the second data set as the large room data set. For each impulse response generated, we compute the reverberation time and clarity index and then use the automatic FDN tuning approach described above to generate the corresponding FDN parameters that ?best? matches the impulse response in terms of the perceptual loss function 3.1. This establishes a correspon- dence between the room parameters that generated the impulse response with the FDN parameters. This correspondence can be use as training data for the SVM regression. 19 Chapter 4: Results 4.1 Genetic Algorithm Optimization Loss Value Three sample results from the RIR matching using Genetic Algorithm is dis- played below (Fig. 4.1(a), 4.1(b), 4.1(c)) in decreasing order of final loss value as calculated by Equation 3.1. The figures display the spectrogram of the reverber- ated sound produced using the simulated RIR compared to the spectrogram of the reverberated sound produced from the RIR generated from the matching FDN. The visualization of the spectrogram and sound produced from the two RIR demonstrates visually the difference between the dry audio and the reverberated sound. It also illustrates visually how loss value effects the matching of two RIR. 20 (a) Loss value of 18.3822 (b) Loss value of 2.4727 (c) Loss value of 1.0042? 10?4 Figure 4.1: Spectrogram of sound signal convolved with RIR and FDN produced RIR for various loss values. In each set of three spectrogram, the top is the original audio, middle is the result produced from the simulated RIR and bottom is the result produced from FDN generated RIR. 4.2 SVM Results We want to build a regression model to map some parameterization of a boxed room to the corresponding parameterization of the FDN that would produce a per- ceptually identical room impulse response. In the impulse generation process, we have a simplified parameterization of the room given by (`, w, h) where `, w, and h are the length, width, and height of the boxed room and the reflection coefficient of 21 each wall is described by a set of fixed ? taken to be in the range 0 to 1. To represent the FDN we parameterize it as (g, m1, m2, m3, m4) where g is the feedback gain that is then multiplied with the feedback matrix, ?, and mi is the exponent power we raise the prime basis to for our prime delay line. For the four tap FDN we are modeling, the prime bases used are 2, 3, 5, and 7. We alternatively parameterize the FDN as (g,b,c) where g is the feedback gain, b is the input gain, and c is the output gain. With the genetic algorithm we were able to establish more than 100 corre- spondences so far between the set of room and FDN parameter for both the large room data set and the small room data set. 4.2.1 Small Room Data Set Results For the small room data set, we trained four sets of SVM regression model: for each parameterization of the FDN, we trained a SVM model with the Gaussian (RBF) kernel and a SVM model with the polynomial kernel. The SVM regression implementation from Matlab is used with the hyperparameters for the SVM model set to be determined automatically. For each model we report the mean squared error for each of the prediction of each FDN parameters. The mean squared error by itself is not very informative, rather the loss function described in Equation 3.1 evaluated over the predicted FDN is much more indicative of the perceptual relevance of the result. Thus, we also predict the perceptual loss value for each model. 22 The training and testing results for the first FDN parameterization (g, m1, m2, m3, m4) is tabulated in Table 4.1 and 4.2 below. The training data is restricted to ones with loss value less than 0.8 from the initial Genetic Algorithm matching, and the remaining data serves as the testing data set. The threshold is set in such a way to ensure the training data set is approximately of the same size across all the models trained Regression Model FDN Parameters MSE Perceptual Loss Stat. g m1 m2 m3 m4 Mean Median RBF Kernel SVM 0.0727 2.0345 4.8966 1.1724 1.0690 1.9955 2.1470 Poly. Kernel SVM 0.0284 4.1379 4.9310 0.7586 0.4828 45.7648 2.3623 Table 4.1: Training performance for SVM regression model for FDN parameteriza- tion (g, m1, m2, m3, m4) for the small room data set. Regression Model FDN parameters MSE Perceptual Loss Stat. g m1 m2 m3 m4 Mean Median RBF Kernel SVM 0.0840 16.2254 12.0704 3.2113 1.4225 1.5700 1.0308 Poly. Kernel SVM 0.0861 22.7042 13.8592 4.5915 1.3099 246.0664 20.8343 Table 4.2: Testing performance for SVM regression model for FDN parameterization (g, m1, m2, m3, m4) for the small room data set. Sample spectrogram of the audio output from the FDN predicted from the SVM model is displayed in Fig. 4.2. The spectrogram of the dry audio and the reverberated audio obtained from convolution with RIR is also included for reference. Note that the sample output is not necessarily representative of how well the model perform. 23 (a) Sample result from SVM with Gaussian (b) Sample result from SVM with Polynomial Kernel Kernel Figure 4.2: Sample output from the SVM regression model trained on the small room data set with FDN parameterized as (g, m1, m2, m3, m4). In each set of three spectrogram, the original audio (top), result produced from the simulated RIR (middle) and result produced from the FDN predicted by the SVM (bottom). Now we shift to reporting the SVM regression result of predicting the FDN parameterized by (g,b,c). As before, the training and testing results for the second FDN parameterization is tabulated in Table 4.3 and 4.4 below. The training data is restricted to ones with loss value less than 0.2 from the initial Genetic Algorithm matching, and the remaining data serves as the testing data set. Sample spectrogram of the audio output from the FDN predicted from the SVM model is displayed in Fig. 4.3 like before. Regression Model FDN Parameters MSE Perceptual Loss Stat. g b c Mean Median Min Max RBF Kernel SVM 0.0083 0.0040 0.0047 0.0676 0.0343 0.0010 0.5456 Poly. Kernel SVM 0.0101 0.0053 0.0053 0.0713 0.0378 0.0003 0.3636 Table 4.3: Training performance for SVM regression model for FDN parameteriza- tion (g,b,c) for the small room data set. 24 Regression Model FDN Parameters MSE Perceptual Loss Stat. g b c Mean Median Min Max RBF Kernel SVM 0.0734 0.0265 0.0051 0.1519 0.0622 0.0018 0.8613 Poly. Kernel SVM 0.0761 0.0320 0.0067 0.1332 0.0721 0.0016 0.7356 Table 4.4: Testing performance for SVM regression model for FDN parameterization (g,b,c) for the small room data set. (a) Sample result from SVM with Gaussian (b) Sample result from SVM with Polynomial Kernel Kernel Figure 4.3: Sample output from the SVM regression model trained on the small room data set with FDN parameterized as (g, b, c). In each set of three spectrogram, the original audio (top), result produced from the simulated RIR (middle) and result produced from the FDN predicted by the SVM (bottom). 4.2.2 Large Room Data Set Results For the SVM regression model trained on the large room data set, we also report its mean squared training error and loss function statistics in Table 4.5 and 4.6. The threshold for the training set is now set to ones with loss value less than 0.7 from the initial Genetic Algorithm matching, and the remaining data serves as the testing data set as before. Sample spectrogram of the audio output from the FDN predicted from the SVM model is displayed in Fig. 4.4. Models were only trained for the second FDN parameterization for the large room data set. 25 Regression Model FDN Parameters MSE Perceptual Loss Stat. g b c Mean Median Min Max RBF Kernel SVM 0.0045 0.0172 0.0113 31.2218 31.6700 25.0280 34.5177 Poly. Kernel SVM 0.0062 0.0159 0.0110 31.1467 32.0197 21.9203 47.1815 Table 4.5: Training performance for SVM regression model for FDN parameteriza- tion (g,b,c) for the large room data set. Regression Model FDN Parameters MSE Perceptual Loss Stat. g b c Mean Median Min Max RBF Kernel SVM 0.0075 0.0647 0.0397 31.4420 31.7469 26.3002 34.7566 Poly. Kernel SVM 0.0082 0.0569 0.0360 32.1311 31.8487 22.4158 62.6516 Table 4.6: Testing performance for SVM regression model for FDN parameterization (g,b,c) for the large room data set. (a) Sample result from SVM with Gaussian (b) Sample result from SVM with Polynomial Kernel Kernel Figure 4.4: Sample output from the SVM regression model trained on the large room data set with FDN parameterized as (g, b, c). In each set of three spectrogram, the original audio (top), result produced from the simulated RIR (middle) and result produced from the FDN predicted by the SVM (bottom). 4.3 Real-time System Demonstration To demonstrate the feasibility of approximating the late-reverberation in real- time using our proposed method, a mock demo was implemented in C/C++ demon- strate the audio stream processing capability of FDN. The demo consists of two system processes. The first process sends sample audio data through a pipe to sim- 26 ulate a dummy audio stream. The second process reads from the pipe and process the audio stream in batches of 4800 samples and outputs the resulting audio that has been passed through the FDN. Figure 4.5: Plot of the total amount of time the mock demo took to process different number of audio samples. The running time of the mock demo run on a 1.6 GHz Dual-Core Intel Core i5 processor is displayed in Fig. 4.5. This suggests that FDN should be able to stream and process audio in real-time, depending on the sampling frequency of the audio, tap-delay line lengths, and the processor speed. At the time of this writing, a full virtual demo of our pre-tuned FDN is also under development using [14], which is a real-time virtual environment rendering system originally developed in the Spatial Auditory Displays Lab at NASA Ames Research Center. 27 Chapter 5: Discussions 5.1 Quality of the Training Data Set Our method of obtaining FDN parameters that perceptually matches some target impulse response is to set an objective function to compare the FDN generated impulse response with the target and then optimize for that objective with respect to the FDN parameters. This matching process is done for each RIR we generated from a set of room parameters, thus the quality of our training data set for the SVM regression model is limited by how closely the Genetic Algorithm was able to tune the FDN parameters. In practice, we see from the results in section 4.1 that the matching computed from the Genetic Algorithm is not necessarily always good since there are already some visually distinct differences between the FDN produced RIR and the simulated RIR when the loss value is around 2. As with any optimization algorithm, Genetic Algorithm can occasionally run into issues with finding sub-optimal solution when trying to minimize the loss function. This can occur if the Genetic Algorithm did not converge within the given maximum iterations (or ?generations? in the specific context of Genetic Algorithm) or if the optimization got stuck in local-minima. In the context we are working with, a sub-optimal result means that the set of FDN 28 parameters obtained from the Genetic Algorithm corresponds to objective function that is sufficient large enough to yield a noticeable perceptual difference between the generated impulse response and the desired impulse response. There are practical ways to improve the sub-optimal results, however, under time and resource constraint we would have to settle and deal with these sub- optimal results. Pre-mature convergence can be overcome with high probability by applying the Genetic algorithm multiple times and taking the best result out of all the run. We have done this a couple time to construct our data set, but some sub- optimal results still remain. Meanwhile, having results that have not yet converge before the maximum number of generations can be overcome simply by increasing the maximum allowed generations or setting the algorithm to terminate only when certain objective threshold is met. This, however, can significantly increase the amount of time it takes to optimize for one impulse response, and the amount of increase may or may not be too much to handle. We have tried optimizing up to 15 generations with a population size of 20. We have applied these methods to improve our data set quality and have driven down the loss function value calculated from each optimization to be less than one. User listening tests may be necessary to confirm whether the FDN produced from the SVM trained on the data set is sufficiently good for application purposes. 29 5.2 SVM Regression Performance In section 4.2, the mean squared training error serves mainly as an indication of the SVM model?s ability to learn the patterns captured in the training data. We see evidence of this through the SVM model for the large room data set where despite low mean squared training error the loss value remains high. While this training error is not necessarily indicative of the perceptual relevance of the SVM model, it can be useful for comparing between regression models. The loss function value of the RIR generated from the predicted FDN is more directly relevant to the perceptual performance of the SVM regression in actual au- dio rendering application. This can be seen in section 4.1 where we see that the spectrogram of the reverberant sound produced from the optimized FDN more visu- ally resemble the spectrogram of the reverberant sound produced from the simulated RIR with lower loss values. Base on the loss value the polynomial kernel SVM model trained on the small room data set has a higher variance in terms of being able to predict a good set of FDN parameters for the given room dimension compared to the Gaussian kernel SVM model. When predicting the FDN with the first parameterization, notice that the mean squared training error of the mi?s are significantly larger than that of the feedback gain parameter g. This suggests that the regression model is not learning the mi exponents well. The mi?s parameters seems more difficult to learn likely because the set of integer powers we optimize over is very small and contain patterns that is hard to disambiguate from overlap. It is also possible that there is little 30 correlation between the individual value of the mi values with the perceptual metrics we are matching, and that it is the aggregate behavior of the mi exponents that has relation to the perceptual metrics. Another interesting result from the SVM regression worth pointing out is the high loss value for the model trained and evaluated over the large room data set. This is in-spite of a mean training error lower than the mean training error for the SVM model trained on the small room data for predicting (g, m1, m2, m3, m4). A possible explanation for this poor result is that a four tap FDN is not sufficiently complex enough to model the reverberation of a large room; perhaps a eight tap FDN would be more sufficient for modeling larger rooms. Another possibility is that the delay line lengths chosen is not appropriate for modeling the larger rooms. Out of the different parameterization and data set we considered, training the SVM model on the small room data set to predict the feedback, input, and output gains parameters demonstrate that it is possible to construct a model that can predict good pre-tuned FDN that closely model the impulse response of a given room. 5.3 Computational Cost Considerations As seen in Section 4.3, the FDN is able to process the audio stream in ap- proximately linear time with respect to the duration of the stream. This mean depending on the quality of the audio we want to process, that is, its sampling rate, we can potentially process each signal before the next signal arrives: yielding 31 real-time performance. What this also mean is that the time it will take to run Genetic Algorithm is also approximately linear with respect to the length of the audio signal (or more specifically, the length of the impulse response). However, the constant factor can be quite large and have a significant impact on the running time. The constant factor associated with this linear running time has to do with the maximum length of our delay line and also the population size and maximum number of generations allowed for the Genetic Algorithm implementation. This in practice mean that the Genetic Algorithm can take time on the order of half an hour to run for a population of 10 and max generation of 5. This make the optimization procedure using the Genetic Algorithm relatively expensive, especially when scaled to a large data set size or considering that the algorithm is not guaranteed to converge to a good result within the maximum set generation. 32 Chapter 6: Conclusion 6.1 Summary Through the course of this thesis work we have examine previous works done to generate artificial reverberation. On the one hand, we have convolutional and computational acoustic approaches that are capabable of producing highly accurate reverberant sound, but at a high computational cost not practical for real-time applications. On the other hand, we have delay network based approach that is computationally efficient, but produce a lower quality reverberant sound. With the goal of producing higher quality reverberant sound in real-time, we developed a data-driven framework for enabling real-time approximation of late-reverberation. This approximation method can be combined with existing efficient methods for rendering the direct path and early reflections of a sound source to render the full reverberant audio source in real-time, as shown in the mock demo. This data- driven approach can be view as a hybrid method that takes advantage of the high quality RIR generated offline using computational acoustics method to quickly infer a plausible FDN for efficient rendering. This is useful for enhancing the realism of augmented and virtual reality applications where audio signal needs to be streamed to the user in real-time, such as virtual video conferencing and concert performance 33 broadcasting. To conclude, we make several suggestions on possible future direction related to the thesis. 6.2 Possible Future Work One area with rooms for improvement is to improve the speed and effectiveness of the optimization process used to build the training data set. Because some of the room dimension input are similar, an optimal output for one set of impulse response matching can be a good candidate initial candidate for another that is similar. This mean we can potentially achieve speed up for the optimization process from joint optimization of multiple impulse response using the Genetic Algorithm, allowing us to more efficiently build up a larger and more representative training data set and improve the data set quality. The choice of using the Genetic Algorithm itself is also open for exploration, it is possible that gradient based optimization methods will be more suitable for the optimization process involved in matching a FDN to a given RIR. Similarly, the choice of regression model for predicting the FDN parameters from the room parameters can be explored as well. Once a sufficiently large training set is constructed, it might even be possible to apply neural networks for handling the regression task. Finally, it is known that with a given HRTF, it is possible to combine two FDN to simulate the Binaural Room Impulse Response (BRIR). Rendering a sound source using a person?s HRTF and BRIR provides additional value over using just 34 the RIR because it enables a higher level of personalization and realism for the listener. More sophisticated FDN can be considered to improve the realism of the reverberant audio even further, so long as the more complex FDN model still remain efficent for evaluation. 35 Bibliography [1] Richard A Brualdi, Shmuel Friedland, and Victor Klee. Combinatorial and graph-theoretical problems in linear algebra. Vol. 50. Springer Science & Busi- ness Media, 2012. [2] Michael Chemistruck, Kyle Marcolini, and Will Pirkle. ?Generating matrix coefficients for feedback delay networks using genetic algorithm?. In: Audio Engineering Society Convention 133. Audio Engineering Society. 2012. [3] Jay Coggin and Will Pirkle. ?Automatic Design of Feedback Delay Network Reverb Parameters for Impulse Response Matching?. In: Audio Engineering Society Convention 141. Audio Engineering Society. 2016. [4] Nail A Gumerov and Ramani Duraiswami. Fast multipole methods for the Helmholtz equation in three dimensions. Elsevier, 2005. [5] Brian Hamilton and Stefan Bilbao. ?FDTD methods for 3-D room acous- tics simulation with high-order accuracy in space and time?. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 25.11 (2017), pp. 2112? 2124. [6] Alston S Householder. ?Unitary triangularization of a nonsymmetric matrix?. In: Journal of the ACM (JACM) 5.4 (1958), pp. 339?342. [7] Jean-Marc Jot and Antoine Chaigne. ?Digital delay networks for designing artificial reverberators?. In: Audio Engineering Society Convention 90. Audio Engineering Society. 1991. [8] J-P Jullien et al. ?Spatializer: a perceptual approach?. In: PREPRINTS- AUDIO ENGINEERING SOCIETY (1993). [9] Stephen Kirkup. ?The boundary element method in acoustics: a survey?. In: Applied Sciences 9.8 (2019), p. 1642. [10] Heinrich Kuttruff. Room acoustics. Crc Press, 2016. [11] Fritz Menzer. ?Binaural reverberation using two parallel feedback delay net- works?. In: Audio Engineering Society Conference: 40th International Con- ference: Spatial Audio: Sense the Sound of Space. Audio Engineering Society. 2010. [12] Andrzej Mi?skiewicz et al. ?Concert hall sound clarity: A comparison of audi- tory judgments and objective measures?. In: Archives of Acoustics 37.1 (2012), pp. 41?46. 36 [13] James A Moorer. ?About this reverberation business?. In: Computer music journal (1979), pp. 13?28. [14] NASA Spatial Auditory Displays Lab. slab3d User Manual. Version 6.8.3. May 1, 2020. url: http://slab3d.sonisphere.com. [15] Takeshi Okuzono et al. ?An explicit time-domain finite element method for room acoustics simulations: Comparison of the performance with implicit methods?. In: Applied Acoustics 104 (2016), pp. 76?84. [16] Thomas Rossing. Springer handbook of acoustics. Springer Science & Business Media, 2007. [17] Lauri Savioja and U Peter Svensson. ?Overview of geometrical room acoustic modeling techniques?. In: The Journal of the Acoustical Society of America 138.2 (2015), pp. 708?730. [18] Julius O. Smith. Physical Audio Signal Processing. online book, 2010 edition. http://ccrma.stanford.edu/~jos/pasp/, accessed April, 2020. [19] Vesa Valimaki et al. ?Fifty years of artificial reverberation?. In: IEEE Trans- actions on Audio, Speech, and Language Processing 20.5 (2012), pp. 1421? 1448. [20] Robert W Young. ?Sabine reverberation equation and sound power calcu- lations?. In: The Journal of the Acoustical Society of America 31.7 (1959), pp. 912?921. [21] Dmitry N Zotkin, Ramani Duraiswami, and Larry S Davis. ?Rendering lo- calized spatial audio in a virtual auditory space?. In: IEEE Transactions on multimedia 6.4 (2004), pp. 553?564. 37