ABSTRACT Title of thesis: POSITION CALIBRATION OF ACOUSTIC SENSORS AND ACTUATORS ON DISTRIBUTED GENERAL PURPOSE COMPUTING PLATFORMS Vikas Chandrakant Raykar, Master of Science, 2003 Thesis directed by: Professor Rama Chellappa Department of Electrical Engineering An algorithm is presented to automatically determine the relative 3D positions of audio sensors and actuators in an ad-hoc distributed network of heterogeneous general purpose computing platforms. A closed form approximate solution is derived, which is further reflned by minimizing a non-linear error function. Our formulation and solution accounts for the lack of temporal synchronization among difierent platforms. We also derive an approximate expression for the mean and covariance of the implicitly deflned estimator. The theoretical performance limits for the sensor positions are derived and analyzed with respect to the number of sensors and actuators as well as their geometry. We report extensive simulation results and discuss the practical details of implementing our algorithms. POSITION CALIBRATION OF ACOUSTIC SENSORS AND ACTUATORS ON DISTRIBUTED GENERAL PURPOSE COMPUTING PLATFORMS by Vikas Chandrakant Raykar Thesis submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulflllment of the requirements for the degree of Master of Science 2003 Advisory Committee: Professor Rama Chellappa, Chair/Advisor Professor Min Wu Dr. Ramani Duraiswami, Co-Advisor Professor Shihab Shamma c Copyright by Vikas Chandrakant Raykar 2003 DEDICATED TO ii ACKNOWLEDGEMENTS I am grateful to those people who have made this research and thesis possible and those who have made my experience in graduate school one that I will always remember fondly. First of all I would like to thank my research advisor Dr. Ramani Duraiswami who inspired me to do research in the area of microphone arrays. His passion and unparalleled enthusiasm for new ideas really put me on the track on how to think on my own to get new developments in the existing fleld. I am greatly indebted to Prof. B. Yegnanarayana with whom I had an opportunity to interact when he was a visiting professor at Maryland in summer for three months. I really learnt the right way of doing research from him. I would also like to thank my graduate advisor Prof. Rama Chellappa who has been a inspirational flgure for me in terms of excellence a person can achieve in a fleld. I would like to thank Prof. Shihab Shamma and Prof. Min Wu who kindly agreed to be on my thesis committee. A large part of this work was done during my internship at Intel Labs, Intel corporation, Santa Clara. This work would be incomplete without the guidance of my mentors at Intel, Dr. Igor Kozintsev and Dr. Rainer Lienhart, who put on the right track and were helpful in criticizing and improving difierent aspects of the algorithms. I learnt a great deal during my internship at Intel as to iii how research can be directed to have signiflcant impact in terms of applications. I would also like to thank Dr. Eric Debes who was fun to hang around with at Intel. Finally, I would also like to thank my co-interns Aaamer for those rollerblading trips, Brinda and Madhavi for the endless cofiee hours and fundaes, my previous room mates Karthik, Saurabh and Devang, my room mates at Maryland Niranjan, Adarsh and Urvashi and my current room mates Arun and Punya. Last but not the least I am forever indebted for my parents, for I am what I am wholly due to their efiorts and sacriflces. iv TABLE OF CONTENTS List of Tables viii List of Figures ix 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Distributed Array Processing . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Common Time and Space . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.6 Novel Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Problem Formulation 11 2.1 Problem statement and notation . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 Time Difierence Of Flight . . . . . . . . . . . . . . . . . . . . 16 v 2.2 Maximum Likelihood Estimate . . . . . . . . . . . . . . . . . . . . . 17 2.3 Reference Coordinate System . . . . . . . . . . . . . . . . . . . . . . 19 2.4 Non-Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 Minimum number of microphones and speakers required . . . . . . . 21 3 Closed Form approximate Solution 23 3.1 Initial Guess for capture and emission start times . . . . . . . . . . . 23 3.2 Initial Guess for microphone and speaker positions . . . . . . . . . . . 25 3.2.1 Multidimensional Scaling with clustering . . . . . . . . . . . . 27 3.3 Final Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4 Estimator Performance 31 4.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 Vector Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 Estimator Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.4 Estimator Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.5 Cram?er-Rao Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.6 Rank of the Fischer Information Matrix . . . . . . . . . . . . . . . . . 38 4.7 Efiect of the Nuisance parameters . . . . . . . . . . . . . . . . . . . . 39 4.8 Increasing the number of GPCs . . . . . . . . . . . . . . . . . . . . . 41 4.9 How to select a coordinate system? . . . . . . . . . . . . . . . . . . . 42 4.10 Monte Carlo Simulation Results . . . . . . . . . . . . . . . . . . . . . 44 vi 5 Implementation Details 48 5.1 Calibration Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2 Time Delay Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.3 Testbed Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.4 Software details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.5 Dealing with Partial Information . . . . . . . . . . . . . . . . . . . . 56 5.6 Robust ML estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6 Conclusions 62 A Derivatives 63 B Distance matrix to a dot product matrix 65 Bibliography 68 vii LIST OF TABLES 2.1 Total Number of independent observations(N) and parameters to be estimated(P) for difierent estimation procedures: M = Number of Microphones, S = Number of Speakers, D = Dimension. . . . . . . . 21 2.2 Minimum value of Microphone Speaker Pairs (K) required for difier- ent estimation procedures (D=Dimension). . . . . . . . . . . . . . . . 22 viii LIST OF FIGURES 1.1 A general setup involving multiple microphones, loudspeakers, cam- eras and displays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Some typical applications involving multiple audio/video sensors and actuators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Distributed computing platform consisting of N general-purposecom- puters along with their onboard audio sensors, actuators and wireless communication capabilities. . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Schematic indicating the unknown emission and capture start time. . 14 3.1 Flow chart of the complete algorithm. . . . . . . . . . . . . . . . . . . 28 3.2 Results of Multidimensional Scaling for a network consisting of 10 GPCs each having one microphone and one speaker. . . . . . . . . . . 30 ix 4.1 Cram?er-Rao bound on the total variance of the unknown microphone coordinates as a function of TOF noise standard deviation for dif- ferent estimation procedures. For the TDOF- based method the noise variance was taken as twice that of the TOF variance. The network had a total of 16 microphones and 16 speakers. . . . . . . . . . . . . . 41 4.2 95% uncertainty ellipses for a regular 2 dimensional array of (a) 9 speakers and 9 microphones, (b) 25 speakers and 25 microphones. Noise variance in both cases is 2 = 10?9 for the TOF-based method and 2 = 2?10?9 for the TDOF-based method. The microphones are represented as crosses (?) and the speakers as dots (:). The position of one microphone and the x coordinate of one speaker is assumed to be known (shown in bold). The solid and dotted ellipses are the uncertainty ellipses for the estimation procedure using the TOF and TDOF-based method respectively. . . . . . . . . . . . . . . . . . . . . 45 4.3 95% uncertainty ellipses for a regular 2 dimensional array of 25 mi- crophones and 25 speakers for difierent positions of the known mi- crophone and for difierent x coordinates of the known speaker. In (a) and (b) the known nodes are close to each other and in (c) they are spread out one at each corner of the grid. The microphones are represented as crosses (?) and the speakers as dots (:). Noise vari- ance in all cases was 2 = 10?9. (d) Schematic to explain the shape of uncertainty ellipses. 50 TDOF pairs were used for the estimation procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 x 4.4 (a) The total variance and (b) total bias of all the microphone coor- dinates for increasing TOF noise standard deviation . The sensor network consisted of 16 microphones and 16 speakers. The results are shown for both the TOF and TDOF-based Position and Joint Estima- tion. The Cram?er Rao bound for the TDOF based Joint Estimation is also plotted. For the TDOF-based method the noise variance was taken as twice that of the TOF variance. . . . . . . . . . . . . . . . 47 5.1 (a) The loopback reference chirp signal (b) the chirp signal received by one of the microphones (c) the magnitude of the spectrum of the reference signal and (d) the received chirp signal . . . . . . . . . . . 50 5.2 GCC function with PHAT weighting. . . . . . . . . . . . . . . . . . 53 5.3 (a) Synchronized setup consisting of four microphones and four loud- speakers. (b) A sample screen shot of the OpenGL display showing the positions of the microphones and loudspeakers for the synchro- nized setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.4 (a) 2D Distributed setup consisting of 5 laptops placed on a table. (b) Results for the setup consisting of 5 laptops each having one internal microphone and speakers. . . . . . . . . . . . . . . . . . . . . . . . . 60 5.5 Schematic showing the distributed control scheme. . . . . . . . . . . . 61 B.1 Law of cosines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 xi Chapter 1 Introduction 1.1 Motivation Arrays of audio/video sensors and actuators (such as microphones, cameras, loudspeakers and displays) along with array processing algorithms ofier a rich set of new features for emerging multimedia applications. A typical setup as shown in Figure 1.1 would involve capturing the audio and video scene using multiple microphones and cameras. The captured multiple audio/video streams can be rendered on multiple loudspeakers/displays or used for difierent applications. A few such applications include multi-stream audio/video rendering, smart conference rooms [35, 34, 37] , meeting recording, hands free voice communication [23, 15], speech acquisition in automobile environments [16, 22], object localization 1 Motivation Microphones Cameras Speakers Displays Di st rib ut ed Ca pt ur e Di st rib ut ed Re nd er in g Other Applications Number Crunching Figure 1.1: A general setup involving multiple microphones, loudspeakers, cameras and displays. and tracking, hearing-aid devices [14], speech enhancement [25, 24], speech dereverberation and acoustic surveillance (Refer Figure 1.2). In this thesis we are concerned only with acoustic sensors (microphones) and actuators (loudspeakers). The two main applications for which we can use multiple microphones are for sound source localization and beamforming. Using multiple microphones and knowing the locations of the microphones we can estimate the location of the speaker based on the waveform captured at each of the microphones. Once we know the location of the speaker we can track the moving speaker and beamform to his location. A beamformer does spatial flltering in the sense that it separates two signals with overlapping frequency content originating from difierent directions. Consider a typical conference room scenario. The speech signal received from a speaker in such acoustical environments is corrupted both by additive noise and 2 Some Applications? Audio/Video Surveillance Smart Conference RoomsAudio/Image BasedRendering Meeting Recording Source separation and Dereverberation Speech Recognition Hands free voice communication Speaker Localization and tracking Multichannel speech Enhancement MultiChannel echo Cancellation Novel Interactive audio Visual Interfaces Figure 1.2: Some typical applications involving multiple audio/video sensors and actuators. room reverberation. One efiective way of dealing with such situations is to use a set of spatially distributed microphones for recording the speech. In order to keep the speaker in focus in videoconferencing, the speaker can be localized, and this information can be fed to a video system for actuating the pan-tilt operations of a camera. Once the actual position of the speaker is known, the microphone array can be steered electronically (beamformed) for high quality speech acquisition. Tracking a moving speaker is also useful in a multispeaker scenario in which speech from a particular speaker may need to be enhanced with respect to others, or with respect to noise sources. 3 1.2 Distributed Array Processing Much of the current work has focussed on setting up all the sensors and actuators on a single dedicated computing platform. Such a setup would require a lot of dedicated infrastructure in terms of the sensors, multi-channel interface cards and computing power. For example, to setup a microphone array on a single general purpose computer we need expensive multichannel sound cards and a CPU with huge computation power to process all the multiple streams. At the same time, recent advances in mobile computing and communication technologies suggest a very attractive platform for implementing these algorithms. Students in classrooms, co-workers at meetings, family members at home are nowadays accompanied by one or several mobile computing and communication devices like laptops, PDAs, tablets, with multiple audio and video I/O devices onboard. We collectively refer to such devices as General Purpose Computers (GPCs). In addition, high-speed wireless network connections, like IEEE 802.11a/b/g, are available to network those devices. If we manage to combine sensors/actuators with wireless connectivity and computational resources, we can potentially transform such a network into a complex array Digital Signal Processing system. The advantage of such an approach is that multiple GPCs along with their sensors and actuators can be converted to a distributed sensor network in an ad-hoc fashion by just adding appropriate software layers. No dedicated infrastructure in terms of the sensors, actuators, multi-channel interface cards and computing power is required. However, there are several important technical and theoretical 4 RF RF Laptop 2 Lapto p 3 Laptop N RF Laptop 3 RF Laptop 1 RF Laptop N-1 Figure 1.3: Distributed computing platform consisting of N general-purpose com- putersalongwiththeironboardaudiosensors, actuatorsandwirelesscommunication capabilities. problems that need to be addressed before the idea of using GPCs for array signal processing algorithms can materialize in real-life applications. Figure 1.3 shows a schematic representation of our distributed computing platform consisting of N GPCs. Each GPC is assumed to be equipped with audio sensors (microphones), actuators (speakers) for performing audio I/O, and wireless communication capabilities for exchanging data between each other. 5 1.3 Common Time and Space A prerequisite for using distributed audio-visual I/O capabilities is to put sensors and actuators into a common time and space. [19] proposes a way to provide a common time reference for multiple distributed GPCs with the precision of ten?s of microseconds. This thesis is mainly concerned with providing a common space (relative coordinate system) by means of actively estimating the three dimensional positions of the sensors and actuators. Many multi-microphone array processing algorithms (like sound source localization or conventional beamforming) need to know the positions of the microphones very precisely. Even relatively small uncertainties in sensor location could make substantial, often dominant, contributions to overall localization error [27]. 1.4 Previous work Current audio array processing systems either rely on placing the microphones in known locations or manual calibration of their positions. There are some approaches which do position calibration using speakers in known locations. [28] describes an experimental setup for automatic calibration of a large-aperture microphone array using acoustic signals from transducers whose locations are known. We follow a more general approach where we assume that the speakers locations are also unknown. A lot of related theoretical work can be found in [27, 36, 21]. Most of the formulations assume that all the sensors and actuators are on a synchronized setup i.e capture and playback occur simultaneously. However 6 in a typical distributed setup we start the audio capture and playback on each GPC one by one and the playback and the capture start time are generally unknown. Our solution explicitly accounts for the errors in localization due to lack of temporal synchronization among difierent platforms. A recent paper [20] accounts only for the unknown source emission time. The solution turns out to be a non-linear minimization problem which requires a good starting point to reach the global minimum. We derive a closed form approximate solution to be used as initial guess for the minimization routine. The problem of self-localization for a network of nodes has also been dealt in the wireless network and robotics community . The problem is essentially the same as in our case but the ranging method difier depending on the sensors and actuators. The problem of self-localization of a network of nodes involves two steps: ranging and multilateration. Ranging involves the estimation of the distance between two nodes in the network. Multilateration refers to using the estimated ranges to flnd the position of difierent nodes. The ranging technology can be either based on the Time-Of-Arrival (TOA) or the Received Signal Strength (RSS) of acoustic, ultrasound or radio frequency (RF) signals. The choice of a particular technology depends on the environment and the range for which the sensor network is designed. The GPS system and long range wireless sensor networks use RF technology for range estimation. Localization using Global Positioning System (GPS) is not suitable for our applications since GPS systems do not work indoors and are very expensive. Also RSS based on RF is very unpredictable [29] and the RF TOA is very small to be used indoors. [29] discuss systems based on 7 ultrasound TOA using specialized hardware (like motes) as the nodes. However, our goal is to use the already available sensors and actuators on the GPCs to estimate their positions. So our ranging technology is based on acoustic TOA as in [28, 20, 13]. Once we have the range estimates the Maximum Likelihood (ML) estimate can be used to get the positions. 1.5 Organization The thesis is organized as follows. In Chapter 2, we formulate the problem and derive the Maximum Likelihood (ML) estimator. We derive two estimators, one based on TOF and the other based on TDOF. In Chapter 3, an approximate closed form solution is derived, which can be used as an initial guess for the non-linear minimization routine. In Chapter 4, we derive the theoretical mean and covariance of the estimated parameters. The Cram?er-Rao bound is also derived and analyzed for its sensitivity with respect to the number of sensors and actuators as well as their geometry. Chapter 5 gives a discussion of the issues involved in designing a practical system. Chapter 6, concludes with a summary of the present work. 1.6 Novel Contributions The following are the novel contributions of this thesis. ? We propose a novel setup for array processing algorithms with ad-hoc connected GPCs. 8 ? The position estimation problem has been derived as a maximum likelihood in several papers [20, 36, 28]. The solution turns out to be the minimum of a nonlinear cost function. Iterative nonlinear least square optimization procedures require a very close initial guess to converge to a global maximum. We propose the technique of metric Multidimensional Scaling (MDS)[32] in order to get an initial guess for the nonlinear minimization problem. Using this technique, we get the approximate positions of GPCs. ? Most of the previous work on position calibration (except [13] which describes a setup based on Compaq iPAQs and motes) are formulated assuming time synchronized platforms. However in an ad-hoc distributed computing platform consisting of heterogeneous GPCs we need to explicitly account for errors due to lack of temporal synchronization. We perform an analysis of the localization errors due to lack of synchronization among multiple platforms and propose ways to account for the unknown emission start times and capture start times. ? Most of the existing localization methods use the Time Of Flight (TOF) approach for position calibration [20, 28, 13]. We show that for distributed computing platforms, the method based on Time Difierence of Flight (TDOF) is better than the TOF method in many respects. ? We derive the approximate mean and covariance of the implicitly deflned estimator using the implicit function theorem and Taylor series expansion as in [11]. We also derive the Cram er-Rao bound and analyze the localization accuracy with respect to the number of sensors and sensor geometry. 9 The work presented in this paper resulted in two conference publications and two patents being flled. Publications ? Position Calibration of Audio sensors and actuators in a distributed computing platform Vikas C. Raykar, Igor Kozintsev and Rainer Lienhart , ACM Multimedia 2003, Berkeley, CA, USA, November 2003. ? Self Localization of acoustic sensors and actuators on distributed platforms Vikas C. Raykar, Igor Kozintsev and Rainer Lienhart, ICCV 2003 International Workshop on Multimedia Technologies in E-Learning and Collaboration, Nice, France, October 2003. Patents flled ? Three-Dimensional Position Calibration of Audio Sensors and Actuators on a Distributed Computing Platform. (flled on 05/09/2003 along with Igor Kozintsev and Rainer Lienhart) ? Method for 3-Dimensional position calibration of audio sensors and actuators on a distributed computing platform. (flled on 08/29/2003 along with Igor Kozintsev and Rainer Lienhart) 10 Chapter 2 Problem Formulation 2.1 Problem statement and notation Given a set of M acoustic sensors (microphones) and S acoustic actuators (speakers) in unknown locations, our goal is to estimate their three dimensional coordinates. We assume that each of the GPCs has at least one microphone and one speaker. We also assume that at any given instant we know the number of sensors and actuators in the network. Any new node entering/departing the network announces its arrival/departure by some means, so that the network of sensors and actuators can be recalibrated. Each of the speaker is excited using a known calibration signal such as maximum length sequence or chirp signal and the signal is captured by each of the acoustic 11 sensors. The Time of Flight (TOF) is estimated from the captured audio signal. The TOF for a given pair of microphone and speaker is deflned as the time taken by the acoustic signal to travel from the speaker to the microphone1. We assume that the signals emitted from each of the speakers do not interfere with each other i.e. each signal can be associated with a particular speaker. This can be achieved by conflning the signal at each speaker to disjoint frequency bands or time intervals. Alternately, we can use coded sequences such that the signal due to each speaker can be extracted at the microphones and correctly attributed to the corresponding speaker. The MS TOF measurements constitute our observations, based on which we have to estimate the microphone and speaker positions. The approach we describe is a generalization of the trilateration and multilateration techniques used in GPS positioning and other localization systems. Such systems assume that the locations of four sources are known. Based on these sources the TOF to a sensor is estimated. By trilateration a sensor?s position can be determined. At least four speakers are required to flnd the position of an omnidirectional microphone. Knowing the distance from one speaker, the microphone can lie anywhere on a sphere centered at the speaker. With two speakers the microphone can lie on a circle, since two spheres intersect at a circle. With three we can get two points and four speakers can give a unique location. Since the estimated distances are corrupted by noise, the intersection in general need not be a unique point. Therefore we solve the problem in a least square sense by adding more speakers. We formulate the problem for the general case where the 1In some papers, TOF is referred to as Time Of Arrival (TOA). 12 positions of both the microphones and the speakers are unknown. Let mi for i 2 [1;M] and sj for j 2 [1;S] be the three dimensional vectors representing the spatial coordinates of the ith microphone and jth speaker, respectively. We excite one of the S speakers at a time and measure the TOF at each of the M microphones. Let TOFactualij be the actual TOF for the ith microphone due to the jth source. Based on geometry the actual TOF can be written as (assuming a direct path), TOFactualij = k mi ?sj kc (2.1) where c the speed of sound in the acoustical medium 2 and kk is the Euclidean norm. The TOF, which we estimate based on the signal captured conflrms to this model only when all the sensors start capturing at the same instant and we know when the calibration signal was sent from the speaker. This is generally the case when we use multichannel sound cards to interface multiple microphones and speakers 3. However in a typical distributed setup of GPCs as shown in Figure 1.3, the master starts the audio capture and playback on each of the GPCs one by one. As a result the capture starts at difierent instants on each GPC and also the time at which 2The speed of sound in a given acoustical medium is assumed to be constant. In air it is given by c = (331+0:6T)m=s, where T is the temperature of the medium in Celsius degrees. For improved position calibration it is beneflcial to integrate a temperature sensor into the system. It is also possible to include the speed of sound as a parameter to be estimated, as in [28]. 3For multichannel sound cards all the channels are synchronized and the time when the calibra- tion signal was sent can be determined by doing a loop back from the output to the input. This loopback signal can be used as a reference to estimate the TOF. 13 t t jts SignalEmittedbysourcej SignalReceivedbymicrophone i ijFOT ? itm ijTOF CaptureStarted PlaybackStarted TimeOrigin Figure 2.1: Schematic indicating the unknown emission and capture start time. the calibration signal was emitted from each loud speaker are not known. In a distributed setting, the TOF which we measure includes both the speaker emission start time and the microphone capture start time (See Figure 2.1 where ^TOFij is what we measure and TOFij is what we require). The speaker emission start time is deflned as the time at which the sound is actually emitted from the speaker. This includes the time when the play back command was issued (with reference to some time origin), the network delay involved in starting the playback on a difierent machine (if the speaker is on a difierent GPC), the delay in setting up the audio bufiers and also the time required for the speaker diaphragm to start vibrating. The emission start time is generally unknown and depends on the particular sound card, speaker and the system state such as the processor workload, interrupts, and the processes scheduled at the given instant. The microphone capture start time is deflned as 14 the time instant at which capture is started. This includes the time when the capture command was issued, the network delay involved in starting the capture on a difierent machine and the delay in transferring the captured sample from the sound card to the bufiers. Let tsj be the emission start time for the jth source and tmi be the capture start time for the ith microphone with respect to some origin (see Figure 2.1). Incorporating these two the actual TOF now becomes, ^TOFactualij = TOFactualij +tsj ?tmi = k mi ?sj kc +tsj ?tmi (2.2) The origin can be arbitrary since ^TOFactualij depends on the difierence of tsj and tmi. We start the audio capture on each GPC one by one. We deflne the microphone on which the audio capture was started flrst as our flrst microphone. In practice, we set tm1 = 0 i.e. the time at which the flrst microphone started capturing is our origin. We deflne all other times with respect to this origin. If two audio input and output channels are available on a single GPC then one of the output channels can be used to play a reference signal which is RF modulated and transmitted through the air [19]. This reference signal can be captured in one of the input channels, demodulated and used to estimate tsj ?tmi, since the transmission time for RF waves can be considered almost zero. Note that this assumes that all audio channels on the same I/O device are synchronized, which is generally true. However this method requires more hardware in terms of RF modulators/demodulators. The other solution is to jointly estimate the unknown 15 source emission and capture start time along with the microphone and source coordinates. However we can eliminate the source emission start time if we use Time Difierence Of Flight instead of Time Of Flight. 2.1.1 Time Difierence Of Flight The TDOF for a given pair of microphones and a speaker is deflned as the time difierence between the signal received by the two microphones 4. Let TDOFestimatedikj be the estimated TDOF between the ith and the kth microphone when the jth source is excited. Let TDOFactualikj be the actual TDOF. It is given by TDOFactualikj = k mi ?sj k?k mk ?sj kc (2.3) Including the source emission and capture start times, it becomes ^TDOFactualikj = k mi ?sj k?k mk ?sj k c +tmk ?tmi (2.4) In the case of TDOF the source emission time is the same for both microphones and thus gets cancelled out. Therefore, by using TDOF measurements instead of TOF we can reduce the number of parameters to be estimated. 4Given M microphones and S speakers we can have MS(M ? 1)=2 TDOF measurements as opposed to MS TOF measurements. Of these MS(M?1)=2 TDOF measurements only (M?1)S are linearly independent. 16 2.2 Maximum Likelihood Estimate Assuming an additive Gaussian5 noise model for the TDOF observations we can derive the Maximum Likelihood estimate as follows. Let ?, be a vector of length P ?1, representing all the unknown non-random parameters to be estimated (microphone and speaker coordinates and microphone capture start times). Let ?, be a vector of length N ?1, representing noisy TDOF measurements. Let T(?), be a vector of length N ?1, representing the actual value of the observations. Then our model for the observations is ? = T(?)+? where ? is the zero-mean additive white Gaussian noise vector of length N ?1 where each element has the variance 2j. Also let us deflne ? to be the N ?N covariance matrix of the noise vector ?. The likelihood function of ? in vector form can be written as: p(?=?) = (2?)?N2 j ? j?12 exp?12(??T)T??1(??T) (2.5) The log-likelihood function is given by lnp(?=?) = ?N2 ln(2?)? 12ln j ? j?12(??T)T??1(??T) (2.6) The ML estimate of ? is the one which maximizes the log likelihood ratio and is given by ^?ML = arg? maxF(?;?) F(?;?) = ?12[??T(?)]T??1[??T(?)] (2.7) 5WeestimatetheTDOForTOFusingGeneralizedCrossCorrelation(GCC)[17]. Theestimated TDOF or TOF is corrupted due to ambient noise and room reverberation. For high SNR the delays estimated by the GCC can be shown to be normally distributed with zero mean. [17]. 17 In our case, ? represents a vectorized form of the following parameters. ? = [?m;?s;?tm] (2.8) ?m = [mx1;my1;mz1;::::::;mxM;myM;mzM]T ?s = [sx1;sy1;sz1;::::::;sxS;syS;szS]T ?tm = [tm1;tm2;::::::;tmM]T (2.9) where mxi, myi, and mzi are the x, y and z coordinates of the ith microphone and sxi, syi, and szi are the x, y and z coordinates of the ith speaker. tmi is the microphone capture start time for the ith microphone. ? and T corresponds to the estimated Assuming that each of the TDOFs are independently corrupted by zero-mean additive white Gaussian noise of variance 2ikj the ML estimate becomes a nonlinear least squares problem (in this case ? is a diagonal matrix), i.e. ^?ML = arg? min[eFTDOF(?;?)] eFTDOF(?;?) = SX j=1 M?1X i=1 MX k=i+1 (TDOFestimatedikj ? ^TDOFactualikj )2 2ikj (2.10) In the case of TOF measurements the ML estimate can be similarly derived as above and is given by, ^?ML = arg? min[eFTOF(?;?)] eFTOF(?;?) = SX j=1 MX i=1 (TOFestimatedij ?TOFactualij )2 2ij (2.11) In this case ? also includes the speaker emission start times. 18 2.3 Reference Coordinate System Since the TOF and TDOF depends on pairwise distances, any translation and rotation of the coordinate system, will also be a global minimum. In order to eliminate multiple global minima we select three arbitrary nodes to lie in a plane such that the flrst is at (0;0;0), the second at (x1;0;0), and the third at (x2;y2;0). Basically we are flxing a plane so that the sensor conflguration cannot be translated or rotated. In two dimensions we select two nodes to lie on a line, the flrst at (0;0) and the second at (x1;0). To eliminate the ambiguity due to re ection along the Z-axis (or Y-axis in 2D) we specify one more node to lie in the positive Z-axis (or positive Y-axis in 2D). Also the re ections along the X-axis and Y-axis (for 3D) can be eliminated by assuming the nodes, which we flx, to lie on the positive side of the respective axes, i.e. x1 > 0 and y2 > 0. Since the TDOF and TOF depends on time difierences (i.e. tsj ?tmi in case of TOF and tmk ?tmi in case of TDOF) there are multiple global minima due to shifts in the time axis. Similar to flxing a reference coordinate system in space we introduce a reference time line by setting tm1 = 0. This is needed since we are estimating the absolute source emission and capture start times6. Note we are only interested in the positions of the microphones and speakers. The emission and capture times are just nuisance parameters. 6If we are estimating the difierence then we do not need a time reference. However estimat- ing the difierence introduces a lot of unnecessary parameters(O(N2) parameters instead of O(N) parameters. 19 2.4 Non-Linear Least Squares The ML estimate for the node coordinates of the microphones and speakers is implicitly deflned as the minimum of the non-linear function deflned in Equation 2.10. This function has to be minimized using numerical optimization methods. Least squares problems can be solved using a general unconstrained minimization. However there exist specialized methods like the Gauss-Newton and the Levenberg-Marquardt method which are often more e?cient in practice. The Levenberg-Marquardt method [8] is a popular method for solving non-linear least squares problems. It is a compromise between steepest descent and Newton?s methods. The steepest descent method potentially has a very slow convergence, but can converge from any starting point. Newton?s method converges fast but requires a good initial guess and computation of the inverse of the Hessian matrix. For more details on nonlinear minimization refer to [12]. Appendix A gives the non zero partial derivatives needed for the minimization routines7. The common problem with minimization methods is that they often get stuck in a local minima. Good initial guesses of the node locations counteract the problem. 7Many commercial software solutions are available for the Levenberg-Marquardt method such as lsqnonlin in MATLAB, mrqmin provided by Numerical Recipes in C[26] , and the MINPACK-1 routines[3] 20 2.5 Minimum number of microphones and speakers required Non-linear least squares optimization requires that the total number of observations is greater than or equal to the total number of parameters to be estimated. This imposes a minimum number of microphones and speakers required for the position estimation method to work. Assuming we have M microphones and S speakers Table 2.1 summarizes the number of independent observations (N) and the number of parameters to be estimated (P) in each of the estimation procedures. In case of the TDOF based method only (M ?1)S out of MS(M ?1)=2 pair of TDOF measurements for each speaker are linearly independent. Assuming M=S=K, the Table 2.2 lists the minimum K required for least squares fltting. Table 2.1: Total Number of independent observations(N) and parameters to be estimated(P) for difierent estimation procedures: M = Number of Microphones, S = Number of Speakers, D = Dimension. N P TOF Position Estimation MS DM +DS ? D(D+1)2 TDOF Position Estimation (M ?1)S DM +DS ? D(D+1)2 TOF Joint Estimation MS (D +1)M +(D +1)S ? D(D+1)2 ?1 TDOF Joint Estimation (M ?1)S (D +1)M +DS ? D(D+1)2 ?1 21 Table 2.2: Minimum value of Microphone Speaker Pairs (K) required for difierent estimation procedures (D=Dimension). K ? D = 2 D = 3 TOF Position Estimation 3 5 TDOF Position Estimation 5 6 TOF Joint Estimation 6 7 TDOF Joint Estimation 6 7 22 Chapter 3 Closed Form approximate Solution The common problem with minimization methods is that they often get stuck in a local minima. They do not converge unless we have a very good starting point. In this chapter we make some approximations to get closed form solutions to the microphone and speaker positions and the capture start times which can be used as a initial guess for the nonlinear minimization routine. 3.1 Initial Guess for capture and emission start times Consider two laptops i and j each having one microphone and one speaker. For these two laptops we can measure ^TOFii, ^TOFjj, ^TOFij and ^TOFji. Assuming 23 no noise these are related to the actual TOF as follows: ^TOFii = TOFii +tsi ?tmi ^TOFjj = TOFjj +tsj ?tmj ^TOFij = TOFij +tsj ?tmi ^TOFji = TOFji +tsi ?tmj (3.1) Assuming su?cient closeness between the microphone and speaker on the same laptop compared to the distance between two laptops, the following approximations can be made. TOFii ? TOFjj ? 0 TOFij ? TOFji (3.2) Substituting we have the following equations: ^TOFii ? tsi ?tmi ^TOFjj ? tsj ?tmj ^TOFij ? TOFij +tsj ?tmi ^TOFji ? TOFij +tsi ?tmj (3.3) From the above equations we can solve for TOFij as: TOFij ? ( ^TOFij + ^TOFji)?( ^TOFii + ^TOFjj) 2 (3.4) 24 Also we can solve for the microphone capture start time and the source emission start time as follows: tsi ? ^TOFii +tmi tmj ? ( ^TOFij ? ^TOFji)+( ^TOFii ? ^TOFjj) 2 +tmi (3.5) Considering the time when the capture on the flrst microphone is started as zero ( i.e. tm1 = 0 ), we can solve for all the other microphone capture start times and the speaker emission start times. Note that all the above equations are true only approximately. Their values have to be reflned further using the ML estimation procedure. 3.2 Initial Guess for microphone and speaker positions Given the pairwise Euclidean distances between N nodes their relative positions can be determined by means of metric Multidimensional Scaling (MDS) [32]. MDS is popular in psychology and denotes a set of data-analysis techniques for the analysis of proximity data on a set of stimuli for revealing the hidden structure underlying the data [31]. The proximity data refers to some measure of pairwise dissimilarity. Given a set of N stimuli along with their pairwise dissimilarities pij, MDS places the N stimuli as points in a multidimensional space, such that the distances between any two points are a monotonic function of the corresponding dissimilarity. MDS is widely used to visually study the structure in proximity 25 data. [31] describes an experiment where MDS is used to reveal some of the perceptual dimensions that people might use for face similarity judgement tasks. If proximity data are based on the Euclidean distances, then classical metric MDS [32] can exactly recreate the conflguration. Given a set of N GPCs, let X be a N ?3 matrix where each row represents the 3D coordinates of each GPC. Then the N ?N matrix B = XXT is called the dot product matrix. By deflnition, B is a symmetric positive deflnite matrix, so the rank of B (i.e the number of positive eigen values) is equal to the dimension of the datapoints i.e. 3 in this case. Also based on the rank of B we can flnd whether the GPCs are on a plane or distributed in 3D. Starting with a matrix B (possibly corrupted by noise), it is possible to factor it to get the matrix of coordinates X. One method to factor B is to use singular value decomposition (SVD) [26], i.e., B = U?UT where ? is a N ?N diagonal matrix of singular values. The diagonal elements are arranged as s1 ? s2 ? sr > sr+1 = ::::: = sN = 0, where r is the rank of the matrix B. The columns of U are the corresponding singular vectors. We can write X0 = U?1=2. From X0 we can take the flrst three columns to get X. If the elements of B are exact (i.e., they are not corrupted by noise), then all the other columns are zero. It can be shown that SVD factorization minimizes the matrix norm k B ?XXT k. In practice, we can estimate the distance matrix D, where the ijth element is the Euclidean distance between the ith and the jth GPC. This distance matrix D must be converted into a dot product matrix B before MDS can be applied. We need to choose some point as the origin of our coordinate system in order to form the dot product matrix. Any point can be selected as the origin, but Togerson [32] 26 recommends the centroid of all the points. If the distances have random errors then choosing the centroid as the origin will minimize the errors as they tend to cancel each other. We can obtain the dot product matrix using the cosine law which relates the distance between two vectors to their lengths and the cosine of the angle between them. Refer to Appendix B for a detailed derivation of how to convert the distance matrix to the scalar product matrix. 3.2.1 Multidimensional Scaling with clustering In our case of M microphones and S speakers we cannot use MDS directly because we cannot measure all the pairwise distances. We can measure the distance between each speaker and all the microphones. However we cannot measure the distance between two microphones or two speakers. In order to apply MDS, we cluster microphones and speakers, which are close together. Based on the approximation discussed in the previous section, the distance dij between the ith and jth GPC is given by dij ? c( ^TOFij + ^TOFji ? ^TOFii ? ^TOFjj) 2 (3.6) where c is the speed of the sound. The position estimate from MDS is arbitrary with respect to the centroid and the orientation and is converted into the reference coordinate system described in Section 2.3. The approximate locations of the GPCs are slightly perturbed to get the initial guess for the microphone and speaker locations. 27 Finally the com plete algorithm ? Approx D istance m atrix between G PC s Approx ts Approx tm C lustering D ot product m atrix D im ension and coordinate system M D S to get approx G PC locations perturb TO F m atrix Approx. m icrophone and speaker locations TD O A based N onlinear m inim ization M icrophone and speaker locations tm Figure 3.1: Flow chart of the complete algorithm. 3.3 Final Algorithm Figure 3.1 summarizes the algorithm. Say we have M microphones and S speakers ? STEP 1: Measure the M ?S Time Of Flight ( ^TOF) matrix. ? STEP 2: { Form the approximate distance matrix D. (Equation 3.6) { Assume tm1 = 0 (microphone on which capture was started flrst) and get the approximate microphone capture and speaker emission start times. (Equation 3.5) { Convert the distance matrix D to the dot product matrix B (Appendix I). Find the rank of B to determine whether the GPCs are in 2D or 3D. 28 ? STEP 3: Form a reference coordinate system { If 3D select three nodes: The flrst one as the origin, the second to deflne the x-axis and the third to form the xy-plane. Also select a fourth node to represent the positive z-axis. { If 2D select two nodes: The flrst one as the origin, the second to deflne the x-axis. Also select a third node to represent the positive y-axis. ? STEP 4: { Get the approximate positions of the GPCs using metric Multidimensional Scaling. { Translate, rotate and mirror to the coordinate system choosen. { Slightly perturb the coordinates to get approximate initial guess for the microphone and speaker coordinates. ? STEP 5: Minimize the TDOF based error function using the Levenberg-Marquardat method to get the flnal positions of the microphones and speakers. Use the approximate positions and the capture start times as the initial guess. Figure 3.2 shows an example with 10 laptops each having one microphone and one speaker. The actual locations of the sensors and actuators are shown as ?x?. The ?*?s are the approximate GPC locations as determined by MDS. As can be seen 29 the MDS results are very close to the microphone and speaker locations. The estimated locations are further improved in STEP 3 and marked as ?o?s. 0 0.5 1 1.5 2 2.5 3 3.5 40 0.5 1 1.5 2 2.5 3 3.5 4 x coordinate(m) Y coordinate(m) Actual Estimated MDS Figure 3.2: Results of Multidimensional Scaling for a network consisting of 10 GPCs each having one microphone and one speaker. 30 Chapter 4 Estimator Performance The properties of the ML estimator can be studied in terms of the estimator bias and error covariance matrix. The bias and error variance depends on the noise variance, the number of microphones and speakers and the geometry of the setup. One way to study it is to do extensive Monte Carlo Simulations for various geometries and difierent number of nodes. However if we get an analytical expression for the bias and the variance of the estimator then these simulation studies can be carried out quickly and the estimator can be studied in depth. The ML estimate for the microphone and speaker positions is deflned implicitly as the minimum of a certain error function. Hence it is not possible to get exact analytical expressions for the mean and the variance. However, by using the implicit function theorem and the Taylor series it is possible to derive approximate 31 expressions for the mean and variance of implicitly deflned estimators [10, 11]. In this section we derive the approximate expressions for both the mean and variance of the estimators. We could have derived the Cram?er-Rao bound which gives the lower bound on the error covariance matrix of any unbiased estimator. However since we cannot determine whether our estimator is unbiased, we cannot use the Cram?er-Rao bound for unbiased estimators. However, we also derive the Cram?er-Rao bound assuming our estimator is unbiased. It turns out be to same as our approximate covariance matrix expression. 4.1 Notation Let ?, be a vector of length P ?1, representing all the unknown non-random parameters to be estimated. Let ?, be a vector of length N ?1, representing our noisy measurements. Let T(?), be a vector of length N ?1, representing the actual value of the observations. ? = [ 1; 2;::::::; P]T ? = [ 1; 2;::::::; N]T T(?) = [t1;t2;::::::;tN]T (4.1) Then our model for the observations was ? = T(?)+? where ? is the zero-mean additive white Gaussian noise vector of length N ?1 where each element has the variance 2j. Also let us deflne ? to be the N ?N covariance matrix of the noise vector ?. 32 The ML estimate of ? is the one which maximizes the log likelihood ratio and is given by ^?ML = arg? maxF(?;?) F(?;?) = ?12[??T(?)]T??1[??T(?)] (4.2) 4.2 Vector Derivatives In further derivations we need the flrst and second derivatives of Equation 4.2 with respect to ? and ?. In this section we specify the vector derivative notation we use and the corresponding derivatives of F(?;?). The P ?1 column gradient operator 5? is deflned as r?F(?;?) = [@F(?;?)@ 1 ; @F(?;?)@ 2 ;:::; @F(?;?)@ P ]T (4.3) Similarly the N ?1 column gradient operator r? with respect to ? is deflned as r?F(?;?) = [@F(?;?)@ 1 ; @F(?;?)@ 2 ;:::; @F(?;?)@ N ]T (4.4) We also deflne the following four second derivative operators: the P ?P operator r?r?, N ?N operator r?r?, N ?P operator r?r? and P ?N operator r?r?, which are deflned as below r?r?F(?;?) = r?[fr?F(?;?)gT] r?r?F(?;?) = r?[fr?F(?;?)gT] r?r?F(?;?) = r?[fr?F(?;?)gT] r?r?F(?;?) = r?[fr?F(?;?)gT] (4.5) 33 Using the generalized chain rule it can be shown that the vector derivatives are as follows r?F(?;?) = JT??1(??T(?)) r?F(?;?) = ???1(??T(?)) r?r?F(?;?) = ?JT??1J r?r?F(?;?) = ???1 r?r?F(?;?) = ??1J r?r?F(?;?) = JT??1 (4.6) where J is a N ?P matrix of partial derivatives of T(?) called the Jacobian of T(?). [J]ij = @ti(?)@ j (4.7) Refer to Appendix A for the individual derivatives of the Jacobian matrix. 4.3 Estimator Covariance In this section we use the Taylor series expansion and the implicit function theorem to derive an approximate expression for the covariance of the implicity deflned estimator. The ML estimate of ? is the one which maximizes the log likelihood ratio deflned in Equation 4.2. The maximum can be found by setting the flrst derivative to zero i.e. r?F(?;?) j?=^?= 0 (4.8) 34 where 0 is a zero column vector of length P. The implicit function theorem guarantees that Equation 4.8 implicitly deflnes a vector valued function ^? = h(?) = [h1(?);h1(?);:::;hP(?)]T that maps the observation vector ? to the parameter vector ^?. Equation 4.8 can be written as r?F(?;?) j?=h(?)= 0 (4.9) r?F(h(?);?) = 0 (4.10) However it is not possible to flnd an analytical expression for h(?). But we can approximate the covariance using the flrst-order Taylor series expansion for h(?). Let ?m be the mean of ?. Then expanding h(?) around ?m we get h(?) ? h(?m)+[r?h(?)T j?=?m]T(???m) (4.11) where r? = [ @@ 1; @@ 2;:::; @@ N ]T is a N ?1 column gradient operator. Taking the covariance on both sides yields Cov(h(?)) ? [r?h(?)T j?=?m]TCov(?)[r?h(?)T j?=?m] (4.12) Note we do not know h(?). Difierentiating Equation 4.10 with respect to ? and evaluating at ?m yields r?r?F(h(?);?)[r?h(?)T]T +r?r?F(h(?);?) j?=?m= 0 r?r?F(h(?m);?m)[r?h(?m)T]T +r?r?F(h(?m);?m) = 0 (4.13) Assuming r?r?F(h(?m);?m) is invertible we can write [r?h(?m)T]T = ?[r?r?F(h(?m);?m)]?1r?r?F(h(?m);?m) (4.14) 35 Substituting from Equation 4.6 we get [r?h(?m)T]T = ?[?JT??1J]?1JT??1 (4.15) Using this in the covariance expression in Equation 4.12, we arrive at Cov^? = Cov(h(?)) = [JT??1J]?1JT??1?[JT??1]Tf[JT??1J]Tg?1 = [JT??1J]?1JT??1???1Jf[JT??1J]Tg?1 = [JT??1J]?1[JT??1J][JT??1J]?1 = [JT??1J]?1 (4.16) Cov^? = [JT??1J]?1 (4.17) 4.4 Estimator Mean Taking the expectation of the flrst order Taylor series expansion in Equation 4.11 E(h(?)) ? h(?m) = h(T(?)) (4.18) We have made use of the fact that ?m = T(?). We see that the mean is the value given by the estimation procedure when applied to the actual noise free measurements T. It is also possible to get the mean using the second order Taylor series expansion, but it involves third order derivatives and generally we cannot get simple form as in Equation 4.17. 36 4.5 Cram?er-Rao Bound The Cram?er-Rao bound gives a lower bound on the variance of any unbiased estimate [33]. It does not depend on the particular estimation method used. In this section, we derive the Cram?er-Rao bound (CRB) assuming our estimator is unbiased. The variance of any unbiased estimator ^? of ? is bounded as [33] E h (^???)(^???)T i ? F?1(?) (4.19) where F(?) is called the Fischer?s Information matrix and is given by F(?) = E n [r? lnp(?=?)][r? lnp(?=?)]T o (4.20) The derivative of the log-likelihood function can be found using the generalized chain rule and is given by r? lnp(?=?) = JT??1(??T) (4.21) where J is the Jacobian. Substituting this in Equation 4.20 and taking the expectation the Fishers Information matrix is, F = JT??1J (4.22) Cov^? ? [JT??1J]?1 (4.23) Note that this expression is the same as the approximate covariance of the estimator derived in the previous section. 37 4.6 Rank of the Fischer Information Matrix If we assume ? = 2I, i.e. the noise components are independent, then the covariance matrix can be simplifled as Cov[^?] = 1 2[JTJ]?1 = F?1 (4.24) where F = 1 2JTJ. If we assume that all the microphone and source locations are unknown, F is rank deflcient and hence not invertible. This is because the solution to the ML estimation problem as formulated is not invariant to rotation and translation. In order to make the Fisher Information matrix invertible we remove the rows and columns corresponding to the known parameters. Theorem : rank(F) < P Proof : rank(F) = rank(JTJ) = rank(J) ? min(N;P). In our case we have always su?cient number of microphones and speakers such that P < N i.e. the number of parameters to be estimated is always less than the number of observations. Hence rank(F) ? P. Since rank of F is equal to rank of J rank of F will be strictly less than P only when the columns of J are linearly dependent. J is a N ?P matrix of partial derivatives of T(?) called the Jacobian of T(?). Each row of J corresponds to the derivatives of TOFij with respect to all the unknown parameters. From Appendix A it can be seen that for TOFij the only non-zero derivatives are with respect to mxi, myi, mzi, sxj, syj and szj. Also all these derivatives sum up to zero. Therefore each row of J sums up to zero. Hence the columns of J are linearly dependent. 38 4.7 Efiect of the Nuisance parameters The speaker emission start time, the microphone capture start time and the loudspeaker coordinates can be considered as the nuisance parameters since we are interested only in the microphone coordinates. We can split J the Jacobian into two blocks, one involving the parameters which we are interested i.e the microphone coordinates and the other nuisance parameters. Let ?m represent the parameters of interest and let ?s be the nuisance parameters. J = [JmJs]whereJm = ?@T(?) @?m ? Js = ?@T(?) @?s ? (4.25) So now, F = 1 2 2 66 4 JTmJm JTmJs JTs Jm JTs Js 3 77 5 (4.26) Using the formula for the inverse of a block matrix we can write F?1 = 2 2 66 4 F?1mm Fms Fms F?1ss 3 77 5 (4.27) where Fmm = JTm?I ?Js(JTs Js)?1JTs ?Jm Fss = JTs ?I ?Jm(JTmJm)?1JTm?Js Fms = ?(JTmJm)?1(JTmJs)F?1ss Fsm = ?F?1ss (JTs Jm)(JTmJm)?1 (4.28) where I is the Identity matrix of appropriate dimensions. So the flrst term of the block matrix which gives a bound on the parameters of interest (i.e. microphone 39 coordinates) can be written as F?1mm = 2'JTm?I ?Js(JTs Js)?1JTs ?Jm??1 (4.29) The diagonal terms of F?1mm represents the error variance for estimating each of the parameters in ?m. In the next few sections we explore the dependency of the error variance on difierent parameters. Figure 4.1 shows Cram?er-Rao bound on the total variance of the unknown microphone coordinates as a function of TOF noise standard deviation for a sensor network consisting of 16 microphones and 16 speakers, for difierent estimation procedures. In order to do a fair comparison, the corresponding TDOF noise variance was approximated to be twice the corresponding TOF noise variance. In the TOF case only one signal was degraded due to noise and reverberation while the other was the reference signal. In case of TDOF both the signals are degraded. The efiect of the nuisance parameters on the Cram?er Rao bound can be seen from Figure 4.1, where the total error variance in the microphone coordinates is plotted against the noise standard deviation for both normal position estimation and joint position estimation. For both the TOF and TDOF approaches the joint estimation results in a higher variance which is due to the extra nuisance parameters. Among TOF and TDOF approaches TOF has more number of nuisance parameters and hence it has a higher variance than the TDOF approach. Another point to be noted is that in the TDOF approach we need not use all the M(M ?1)=2 pairwise TDOF measurements. However as we use more and more 40 10?8 10?7 10?6 10?5 10?4 10?310 ?12 10?10 10?8 10?6 10?4 10?2 100 102 TOF Noise Standard Deviation (?) Total Variance of the microphone coordinates (m 2 ) TOF based Position EstimationTDOA based Position Estimation TOF based Joint EstimationTDOA based Joint Estimation(120 pairs) TDOA based Joint Estimation(16 pairs) Figure 4.1: Cram?er-Rao bound on the total variance of the unknown microphone coordinates as a function of TOF noise standard deviation for difierent estimation procedures. For the TDOF- based method the noise variance was taken as twice that of the TOF variance. The network had a total of 16 microphones and 16 speakers. TDOF measurements the variance decreases as can be seen in Figure 4.1. 4.8 Increasing the number of GPCs As the number of nodes increases in the network, the CRB on the covariance matrix decreases. The more microphones and speakers in the network, the smaller the error in estimating their positions. Figure 4.2(a) shows the 95% uncertainty ellipses for a regular two dimensional array consisting of 9 microphones and 9 41 speakers, for both the TOF and the TDOF-based joint estimation procedures. We flxed the position of one microphone and the x coordinate of one speaker. For the flxed speaker only the variance in y direction is shown since the x coordinate is flxed. For TOF-based method the noise variance was assumed to be 10?9 in order to properly visualize the uncertainty ellipses. In order to give a fair comparison, a noise variance of 2?10?9 was assumed for the TDOF-based method. Figure 4.2(b) shows the corresponding 95% uncertainty ellipses for a two dimensional array consisting of 25 microphones and 25 speakers. It can be seen that as the number of sensors in the network increases the size of the uncertainty ellipses decreases. Intuitively this can be explained as follows: Let there be a total of n nodes in the network whose coordinates are unknown. Then we have to estimate a total of 3n parameters. The total number of TOF measurements available is however n2=4 (assuming that there are n=2 microphones and n=2 speakers). So if the number of unknown parameters increases as O(n), the number of available measurements increases as O(n2). The linear increase in the number of unknown parameters, is compensated by the quadratic increase in the available measurements, which suggests that the uncertainty per unknown variable will decrease. 4.9 How to select a coordinate system? The geometry of the network plays an important role in CRB. It is possible to analyze how to place the sensors in order to achieve a lower CRB. In an ad-hoc network, however, such analysis is of little beneflt. In our formulation we assumed 42 that we know the positions of a certain number of nodes, i.e we flx three of the nodes to lie in the x-y plane. The CRB depends on which of the sensor nodes are assumed to have known positions. Figure 4.3 shows the 95% uncertainty ellipses for a regular two dimensional array containing 25 microphones and 25 speakers for difierent positions of the known nodes. In Figure 4.3(a) the two known nodes are at one corner of the grid. It can be seen that the uncertainty ellipse becomes wider as you move away form the known nodes. The uncertainty in the direction tangential to the line joining the sensor node and the center of the known nodes is much larger than along the line. The same can be seen in Figure 4.3(b) where the known nodes are at the center of the grid. The reason for this can be explained for a simple case where we know the locations of two speakers as shown in Figure 4.3(d). Each circular band represents the uncertainty in the distance estimation. The intersection of the two annuli corresponding to the two speakers gives the uncertainty region for the position of the sensor. As can be seen for nodes far away from the two speakers the region widens because of the decrease in the curvature. It is beneflcial if the known nodes are on the edges of the network and as far away from each other as possible. In Figure 4.3(c) the known sensor nodes are on the edges of the network. As can be seen there is a substantial reduction in the dimensions of the uncertainty ellipses. In order to minimize the error due to Gaussian noise we should choose the three reference nodes (in 3D) as far as possible. In practice, using the TOF matrix we can choose three nodes such that the area of the triangle formed by these three nodes is maximum. In this way we can dynamically adapt our coordinate system to minimize the error even though 43 the array geometry may change drastically. 4.10 Monte Carlo Simulation Results We performed a series of Monte Carlo simulations to compare the performance of the difierent estimation procedures. 16 microphones and 16 speakers were randomly selected to lie in a room of dimensions 4:0m?4:0m?4:0m. The speaker was chosen to be close to the microphone in order to simulate a typical laptop. Based on the geometry of the setup the actual TOF between each speaker and microphones was calculated and then corrupted with zero mean additive white Gaussian noise of variance 2 in order to model the room ambient noise and reverberation. The TOF matrix was also corrupted by known systematic errors, i.e. a known microphone emission capture start time and speaker emission start time was added. The Levenberg-Marquardt method was used as the minimization routine. For each noise variance 2, the results were averaged over 2000 trials. Figure 4.4(a) and Figure 4.4(b) show the total variance and the total bias (sum of all the biases in each parameter) of all the unknown microphone coordinates plotted against the noise standard deviation for both the TOF and the TDOF-based approach. The results are shown both for position estimation and the Joint position and start times estimation procedures. The Cram?er Rao bound for the TDOF-based joint estimation procedure is also shown. Since we corrupted the TOF with a systematic errors, the position estimation procedure shows a very high variance and a correspondingly high bias. Hence when the TOFs are corrupted by 44 systematic errors we need to do joint estimation of the positions as well as the nuisance parameters. Even though theoretically the TDOF-based joint estimation procedure has the least variance, experimentally all the joint estimation procedures showed the same variance. The estimator is unbiased for low noise variances. 0 0.5 1 1.5 2 2.5 3 3.5 40 0.5 1 1.5 2 2.5 3 3.5 4 X (m) Y (m) (a) 0 0.5 1 1.5 2 2.5 3 3.5 40 0.5 1 1.5 2 2.5 3 3.5 4 X (m) Y (m) (b) Figure 4.2: 95% uncertainty ellipses for a regular 2 dimensional array of (a) 9 speakers and 9 microphones, (b) 25 speakers and 25 microphones. Noise variance in both cases is 2 = 10?9 for the TOF-based method and 2 = 2?10?9 for the TDOF- based method. The microphones are represented as crosses (?) and the speakers as dots (:). The position of one microphone and the x coordinate of one speaker is assumed to be known (shown in bold). The solid and dotted ellipses are the uncertainty ellipses for the estimation procedure using the TOF and TDOF-based method respectively. 45 0 0.5 1 1.5 2 2.5 3 3.5 40 0.5 1 1.5 2 2.5 3 3.5 4 X (m) Y (m) (a) 0 0.5 1 1.5 2 2.5 3 3.5 40 0.5 1 1.5 2 2.5 3 3.5 4 X (m) Y (m) (b) 0 0.5 1 1.5 2 2.5 3 3.5 40 0.5 1 1.5 2 2.5 3 3.5 4 X (m) Y (m) (c) (d) Figure 4.3: 95% uncertainty ellipses for a regular 2 dimensional array of 25 mi- crophones and 25 speakers for difierent positions of the known microphone and for difierent x coordinates of the known speaker. In (a) and (b) the known nodes are close to each other and in (c) they are spread out one at each corner of the grid. The microphones are represented as crosses (?) and the speakers as dots (:). Noise vari- ance in all cases was 2 = 10?9. (d) Schematic to explain the shape of uncertainty ellipses. 50 TDOF pairs were used for the estimation procedure. 46 10?8 10?7 10?6 10?5 10?410?10 10?8 10?6 10?4 10?2 100 102 TOF Noise Standard Deviation (?) Total variance in the microphones coordinates (m 2 ) TOF based Position EstimationTDOA based Position Estimation(45 pairs) TOF based Joint EstimationTDOA based Joint Estimation(45 pairs) TDOA based Joint Estimation(10 pairs)CR bound for TDOA based Joint Estimation(45 pairs) (a) 10?8 10?7 10?6 10?5 10?4?25 ?20 ?15 ?10 ?5 0 5 10 15 TOF Noise Standard Deviation (?) Total Bias in the microphones coordinates (m) TOF based Position EstimationTDOA based Position Estimation(45 pairs) TOF based Joint EstimationTDOA based Joint Estimation(45 pairs) TDOA based Joint Estimation(10 pairs (b) Figure 4.4: (a) The total variance and (b) total bias of all the microphone coordi- nates for increasing TOF noise standard deviation . The sensor network consisted of 16 microphones and 16 speakers. The results are shown for both the TOF and TDOF-based Position and Joint Estimation. The Cram?er Rao bound for the TDOF based Joint Estimation is also plotted. For the TDOF-based method the noise vari- ance was taken as twice that of the TOF variance. 47 Chapter 5 Implementation Details In this section we discuss some of the practical issues of our real-time implementation such as the type of calibration signal and the TOF estimation procedure used as well as other design choices. 5.1 Calibration Signals In order to measure the TOF accurately the calibration signal has to be appropriately selected and the parameters properly tuned. Chirp signals and Maximum Length sequences are the two most popular sequences for this task. A linear chirp signal is a short pulse in which the frequency of the signal varies linearly between two preset frequencies. The cosine linear chirp signal of duration 48 T with the instantaneous frequency varying linearly between f0 and f1 is given by s(t) = Acos(2?(f0 +(f1 ?f0T )t)) 0 ? t ? T (5.1) In our system, we used a chirp signal of 512 samples at 44.1kHz (11.61 ms) as our calibration signal. The instantaneous frequency varied linearly from 5 kHz to 8 kHz. The initial and the flnal frequency was chosen to lie in the common pass band of the microphone and the speaker frequency response. The chirp signal sent by the speaker is convolved with the room impulse response resulting in the spreading of the chirp signal. Figure 5.1(a) shows the chirp signal as sent out by the soundcard to the speaker. This signal is recorded by looping the output channels directly back to an input channel. The initial delay corresponds to the source emission time. Figure 5.1(b) shows the corresponding chirp signal received by a microphone. The chirp signal is delayed by a certain amount due to the propagation path. The distortion and the spreadout is due to the speaker, microphone and room response. Figure 5.1(c) and Figure 5.1(d) show the magnitude of the frequency response of the transmitted chirp signal and the received chirp signal, respectively. One of the problems in accurately estimating the TOF is due to the multipath propagation caused by room re ections. This can be seen in the received chirp signal where the initial part corresponds to the direct signal and the rest are the room re ections. We use the Time Division Multiplexing scheme to send the calibration signal to difierent speakers. To avoid interference between the difierent calibration signals we zeropaded the calibration signal appropriately in dependence 49 0 20 40 60 80?1 ?0.5 0 0.5 1 Time (ms) Reference chirp signal 0 20 40 60 80 ?0.2 ?0.1 0 0.1 0.2 0.3 Time (ms) Received chirp signal 0 5 10 15 20?40 ?20 0 20 40 Frequency (kHz) dB 0 5 10 15 20?40 ?20 0 20 40 Frequency (kHz) dB (a) (b) (c) (d) Figure 5.1: (a) The loopback reference chirp signal (b) the chirp signal received by one of the microphones (c) the magnitude of the spectrum of the reference signal and (d) the received chirp signal of the room reverberation level and the maximum delay. Alternatively, we could also use Frequency Division Multiplexing by allocating a frequency band at each channel or spread spectrum techniques by using difierent Maximum Length sequences for each channel. The advantage would be that all the output channels can be played simultaneously. However extra processing is needed at the input to separate the signals. 50 5.2 Time Delay Estimation This is the most crucial part of the algorithm and also a potential source of error. Hence lot of care has to be taken to get the TOF measurements accurately in noisy and reverberant environments. The time-delay may be obtained by locating the peak in the cross-correlation function of the signals received by a pair of microphones. But this method is not robust to degradations in the signals. Knapp and Carter [18] developed the Maximum Likelihood (ML) estimator for determining the time-delay between signals received at two spatially separated microphones when the noise is uncorrelated. In this method, the estimated delay is the time lag which maximizes the cross-correlation between flltered versions of the received signals [18]. The cross-correlation of the flltered versions of the signals is called the Generalized Cross Correlation (GCC) function. The GCC function Rx1x2(?) is given by [18] Rx1x2(?) = Z 1 ?1 W(!)X1(!)X?2(!)ej!?d! (5.2) where X1(!) and X2(!) are the Fourier transforms of the microphone signals x1(t) and x2(t), respectively, and W(!) is the weight function. The efiect of flve difierent weight functions, namely, the Roth Impulse Response, the Smoothed Coherence Transform (SCOT), the Phase Transform (PHAT), the Eckart fllter and the Maximum Likelihood (ML) weighting were studied in [18]. The two most commonly used weight functions are the ML and PHAT. The ML weight function accentuates the signal passed to the correlator at frequencies where the Signal-to-Noise Ratio (SNR) is high [18]. Brandstein, Adcock and 51 Silverman [9] proposed an approximate ML type weighting for speech applications. The approximate weight function is given by cWML(!) = jX1(!)jjX2(!)j jN1(!)j2jX2(!)j2 +jN2(!)j2jX1(!)j2 (5.3) where jN1(!)j and jN2(!)j are the noise power spectra at the two microphones, and are assumed to be known during the silence interval [9]. We use this weight function in our simulation studies. This ML weight function performs well when the efiect of room reverberation is low. As the room reverberation increases, this method shows degradations in performance [6]. Since the spectral characteristics of the received signal are afiected by the multipath propagation or reverberation in a room, the GCC function is made more robust by deemphasizing the frequency dependent weighting. The Phase Transform is one extreme case where the magnitude spectrum is attened. The PHAT weight function WPT(!) is given by WPT(!) = 1jX 1(!)X?2(!)j (5.4) By attening the magnitude spectrum the resulting peak in the GCC function corresponds to the dominant delay. However, the disadvantage of the PHAT weighting is that it places equal emphasis on both the low and high SNR regions, and hence it works well only when the overall noise level is low. St?ephanne and Champagne [30] proposed cepstral preflltering to reduce the efiects of reverberation. Benesty [7] proposed a novel method for time-delay estimation based on eigenvalue decomposition of the covariance matrix. Our current implementation has the option of selecting between the simple Cross Correlation and GCC with PHAT weighting. We plan to appropriately combine 52 0 500 1000 1500 2000 2500 3000 ?1 ?0.5 0 0.5 Figure 5.2: GCC function with PHAT weighting. the ML and the PHAT technique based on the environment. A more accurate estimate of the peak can be found by doing a parabolic flt around the peak. Figure 5.2 shows a sample GCC-PHAT function. The TOF is the position of the peak in the correlation function. 5.3 Testbed Setup The real-time setup has been tested in a synchronized as well as a distributed setup using laptops. Figure 5.3(a) shows the top view of our experimental synchronized setup. Four omnidirectional microphones (RadioShack) and four loudspeakers (Mackie HR624) were setup in a room with low reverberation and low ambient noise. The microphones and loudspeakers were interfaced using an RME DIGI9652 card. For the distributed setup we used 5 laptops (IBM T-series Thinkpads with Intel Pentium series processors) as shown in Figure 5.4(a). For our experiments we used the internal microphones and speakers in the laptop. The room also had multiple PCs which acted as a noise sources. All the flve laptops were placed on a 53 at table so that we can form a 2D coordinate system 1. The ground truth was measured manually to validate the results from the position calibration methods. 5.4 Software details Capture and play back was done using the free, cross platform, open-source, audio I/O library Portaudio [4]. PortAudio provides a very simple API for recording and/or playing sound using a simple callback function [4]. Most of the signal processing tasks were implemented using the Intel Integrated Performance Primitives (IPP). IPP is a cross-platform low-level software layer that abstracts multimedia functionality from the processor underneath and provides highly optimized code [2]. For the non-linear minimization we used the mrqmin routine from Numerical Recipes in C [26]. For displaying the calibrated microphones and speakers we used the OpenGL Utility Toolkit (GLUT) ported to Win32 [5]. For the distributed platform we used the Universal Plug and Play (UPnP) [1] technology to form an adhoc network and control the audio devices on difierent platforms. UPnP technology is a distributed, open networking architecture that employs TCP/IP and other Internet technologies to enable seamless proximity networking [1]. Each of the laptops has an UPnP service running for playing the chirp signal and capturing the audio stream. A program on the master scans the network for all the available UPnP players. First the master starts the audio 1As discussed earlier we need minimum six laptops for the minimization routine. With 5 laptops we need to know the actual x-coordinate of one of the laptops. 54 capture on each of the laptops one by one. Then the chirp signal is played on each of the devices one after the other and the signal is captured. The TOF computation is distributed among all the laptops, in that each laptop computes its own TOF and reports it back to the master. The master performs the minimization routine once it has the TOF matrix. As regards to CPU utilization the TOF estimation consumes negligible resources. If we use a good initial guess via the Multidimensional Scaling technique then the minimization routine converges within 10 to 30 iterations. For the setup consisting of 5 microphones and 5 speakers, Figure 5.4(b) shows the actual(?X?) and the estimated(?o?) positions of the microphones and speakers. The locations as got from the closed form approximate solution are shown as ?*?. The localization error for each microphone or speaker is deflned as the euclidean distance between the actual and the estimated positions. For our setup the average localization error was 8.2 cm. For the synchronized setup consisting of 4 speakers and 4 microphones, the sensors? and actuators? three dimensional locations could be estimated with an average bias of 0.08 cm and average standard deviation of 3 cm (results averaged over 100 trials). Figure 5.3(b) shows a snapshot of the OpenGL display, showing the estimated locations of the speakers and microphones. Our algorithm assumed that the sampling rate was known for each laptop and the clock does not drift. However in practice the sampling rate is not as specifled and the clock can also drift. Hence our real time setup integrates the distributed synchronization scheme using ML sequence as proposed in [19]. This scheme essentially gives the exact sampling rate on each of the GPCs. Figure 5.5 shows a 55 schematic of the TOF computation protocol. 5.5 Dealing with Partial Information In some cases all TOF measurements may not be available. This could be either due to the presence of a large obstacle in between a microphone and a speaker, or an available TOF measurement may be an outlier. In such cases we can formulate the ML estimation procedure by ignoring the unavailable measurement. We can deflne a weighting function w, which takes the value 1, if the corresponding measurement is available, and else 0. For example for TOF- based position estimation, the ML estimate now becomes ^?ML = arg? min[FML(?)] (5.5) FML(?) = SX j=1 MX i=1 wij(TOF estimated ij ?TOF actual ij ) 2 2ij (5.6) In the case of Multidimensional Scaling it is possible to flll in the missing data when all the pairwise distances are not available. If we know the distance between one node and at least four other nodes (three in case of 2 dimensions), then it is possible to flnd the distance between that node and all other nodes. First using the available pairwise distances between a subset of nodes, we can form a coordinate system for the subset of nodes and hence it is possible to know the locations of the subset of nodes. If the location of four nodes are known then by trilateration the node?s position can be determined analytically. Knowing the distance from one 56 known node, the unknown node can lie anywhere on a sphere centered at the flrst known node. With two nodes the unknown node can lie on a circle, since two spheres intersect at a circle. With three we can get two points, and with four nodes we can give a unique location. Since the estimated distances are corrupted by noise, the intersection in general needs not to be a unique point. If the distance to more than four nodes are known then we solve the problem in a least square sense. Once the node locations are known then the pairwise distances can be computed. 5.6 Robust ML estimation In practice the TOF measurements may contain outliers (measurements which are in a gross disagreement with the underlying model). Outlier can have strong in uence on the solution, and completely distort the nonlinear fltting function. In this situation we need to use robust methods for ML estimation. One method is to discard iteratively the measurement with the largest residual after the nonlinear least squares fltting. However this method not necessarily removes the actual outlier. The two other robust methods are the M estimators and the RANSAC method. In the case of M estimators we use some robust error metric in place of the squared error. One example is the Lorentzian function which gives less penalty to large errors as compared to the squared error function. The Lorentzian error function is given by d(en) = ln(1+(en )2) (5.7) 57 In RANdom SAmpling Consensus (RANSCAC) method we use the minimum data required to flnd a solution (See Table 2.2 for the minimum number required for each estimation procedure). This process is repeated on difierent subsets of the data to ensure that there is a high chance of one of the subsets containing only good data points. The best solution is that which maximizes the number of points whose residual is below a certain threshold. Once all the outliers are removed the estimation procedure can be done with only the good data. 58 Mic 3 Mic 1 Mic 2 Mic 4 Spe ake r1 Speaker4 Speaker2 Spe ake r3 X Z RoomLength=4.22m Roo mW idth =2. 55m Roo mH eigh t=2 .03m (a) (b) Figure 5.3: (a) Synchronized setup consisting of four microphones and four loud- speakers. (b) A sample screen shot of the OpenGL display showing the positions of the microphones and loudspeakers for the synchronized setup. 59 (a) ?0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4?0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 x coordinate(m) Y coordinate(m) TOF based TDOF based MDS (b) Figure 5.4: (a) 2D Distributed setup consisting of 5 laptops placed on a table. (b) Results for the setup consisting of 5 laptops each having one internal microphone and speakers. 60 Initialization phase Scan the network and find the number of GPC?s and the UPnP services available Master GPC 1 GPC 2 GPC M ?GPC 1 (Speaker) GPC 2 (Mic) ?Calibration signal parameters TOA Computation TOA TOA matrix Position estimation Play Calibration Signal Play ML Sequence Figure 5.5: Schematic showing the distributed control scheme. 61 Chapter 6 Conclusions In this thesis we described the problem of position calibration of acoustic sensors and actuators in a network of distributed general-purpose computing platforms. Our approach allows putting laptops, PDAs and tablets into a common 3D coordinate system. Together with time synchronization this creates arrays of audio sensors and actuators enabling a rich set of new multistream A/V applications on platforms that are available virtually anywhere. We also derived important bounds on performance of spatial localization algorithms, proposed optimization techniques to implement them and extensively validated the algorithms on simulated and real data. 62 Appendix A Derivatives Following are the derivatives which are needed for the minimization routine. These derivatives form the non-zero elements of the Jacobian matrix. @ ^TOFactualij @mxi = ? @ ^TOFactualij @sxj = mxi ?sxj ckmi ?sjk @ ^TOFactualij @myi = ? @ ^TOFactualij @syj = myi ?syj ckmi ?sjk @ ^TOFactualij @mzi = ? @ ^TOFactualij @szj = mzi ?szj ckmi ?sjk @ ^TOFactualij @tsj = ? @ ^TOFactualij @tmi = 1 (A.1) 63 @ ^TDOFactualikj @mxi = mxi ?sxj ckmi ?sjk @ ^TDOFactualikj @mxk = ? mxk ?sxj ckmk ?sjk @ ^TDOFactualikj @myi = myi ?syj ckmi ?sjk @ ^TDOFactualikj @myk = ? myk ?syj ckmk ?sjk @ ^TDOFactualikj @mzi = mzi ?szj ckmi ?sjk @ ^TDOFactualikj @mzk = ? mzk ?szj ckmk ?sjk @ ^TDOFactualikj @sxj = ? mxi ?sxj ckmi ?sjk + mxk ?sxj ckmk ?sjk @ ^TDOFactualikj @syj = ? myi ?syj ckmi ?sjk + myk ?syj ckmk ?sjk @ ^TDOFactualikj @szj = ? mzi ?szj ckmi ?sjk + mzk ?szj ckmk ?sjk @ ^TDOFactualikj @tmk = ? @ ^TDOFactualikj @tmi = 1 (A.2) 64 Appendix B Distance matrix to a dot product matrix Let us say we choose the kth GPC as the origin of our coordinate system. Let dij and bij be the distance and dotproduct respectively, between the ith and the jth GPC. Referring to Figure B.1, using the cosine law, d2ij = d2ki +d2kj ?2dkidkjcos(fi) (B.1) The dot product bij is deflned as bij = dkidkjcos(fi) (B.2) Combining the above two equations, bij = 12(d2ki +d2kj ?d2ij) (B.3) However this is with respect to the kth GPC as the origin of the coordinate system. We need to get the dot product matrix with the centroid as the origin. 65 Let B be the dot product matrix with respect to the kth GPC as the origin and let B? be the dot product matrix with the centroid of the data points as the origin. Let X? be to matrix of coordinates with the origin shifted to the centroid. X? = X ? 1N1N?NX (B.4) where 1N?N is an N ?N matrix who?s all elements are 1. So now B? can be written in terms of B as follows: B? = X?X?T = (X ? 1N1N?NX)(X ? 1N1N?NX)T (B.5) = XXT ? 1NXXT1N?N ? 1N1N?NXXT + 1N21N?NXXT1N?N (B.6) = B ? 1NB1N?N ? 1N1N?NB + 1N21N?NB1N?N (B.7) Hence the ijth element in B? is given by b?ij = bij ? 1N NX l=1 bil ? 1N NX m=1 bmj + 1N2 NX o=1 NX p=1 bop (B.8) Substituting Equation B.3 we get b?ij = ?12 " d2ij ? 1N NX l=1 d2il ? 1N NX m=1 d2mj + 1N2 NX o=1 NX p=1 d2op # (B.9) This operation is also known as double centering i.e. subtract the row and the column means from its elements and add the grand mean and then multiply by ?12. 66 k i j ijd kjd kid? Figure B.1: Law of cosines 67 BIBLIOGRAPHY [1] http://intel.com/technology/upnp/. [2] http://www.intel.com/software/products/per ib/. [3] http://www.netlib.org/minpack/. [4] http://www.portaudio.com/. [5] http://www.xmission.com/nate/glut.html. [6] S. B?edard, B. Champagne, and A. St?ephanne. Efiects of room reverberation on time-delay estimation performance. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pages II{261 { II{264, 1994. [7] J. Benesty. Adaptive eigen value decomposition algorithm for passive acoustic source localization. J. Acoust. Soc. Am., 107(1):384{391, 2000. 68 [8] D. P. Betrsekas. Nonlinear Programming. Athena Scientiflc, 1995. [9] M. Brandstein, J. Adcock, and H. Silverman. A practical time-delay estimator for localizing speech sources with a microphone array. Comput. Speech Lang., 9:153{169, September 1995. [10] A. R. Chowdhury and R. Chellappa. Statistical bias and the accuracy of 3d reconstruction from video. Submitted to International Journal of Computer Vision. [11] J. A. Fessler. Mean and variance of implicitly deflned biased estimators (such as penalized maximum likelihood): Applications to tomography. IEEE Trans. on Image Processing, 5(3):493{506, March 1996. [12] Philip E. Gill, Walter Murray, and Margaret H. Wright. Practical Optimization. 1981. [13] L. Girod, V. Bychkovskiy, J. Elson, and D. Estrin. Locating tiny sensors in time and space: A case study. In Proc. International Conference on Computer Design, September 2002. [14] J. E. Greenberg and P. M. Zurek. Microphone-Array Hearing Aids, chapter 11, pages 229{253. Microphone arrays-Signal Processing Techniques and Applications. Springer-Verlag, 2001. [15] R?egine Le Bouquin Jeann es, Pascal Scalart, G?erard Faucon, and Chirstophe Beaugeant. Combined noise and echo reduction in hands-free systems: A survey. IEEE Trans. Speech Audio Processing, 9:808{820, Nov. 2001. 69 [16] Nabuo Kawaguchi, Shigeki Matsubara, Hiroyuki Iwa, Shoji Kajita, Kazuya Takeda, Fumitada Itakura, and Yasuyoshi Inagaki. Construction of speech corpus in moving car environment. In Proc. Int. Conf. Spoken Language Processing, volume III, pages 362{365, 2000. [17] C. H. Knapp and G. C. Carter. The generalized correlation method for estimation of time delay. IEEE Trans. Acoust., Speech, Signal Processing, ASSP-24(4):320{327, August 1976. [18] C. H. Knapp and G. C. Carter. The generalized correlation method for estimation of time delay. IEEE Trans. Acoust., Speech, Signal Processing, ASSP-24:320{327, Aug. 1976. [19] R. Lienhart, I. Kozintsev, S. Wehr, and Minerva Yeung. On the importance of exact synchronization for distributed audio processing. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, April 2003. [20] R. Moses, D. Krishnamurthy, and R. Patterson. A self-localization method for wireless sensor networks. Eurasip Journal on Applied Signal Processing Special Issue on Sensor Networks, 2003(4):348{358, March 2003. [21] B. C. Ng and C. M. S See. Sensor-array calibration using a maxilmum-likelihood approach. IEEE Trans. Acoust., Speech, Signal Processing, 44(6):827{835, June 1996. [22] S. Nordholm, I. Claesson, and N Grbi?c. Optimal and Adaptive Microphone Arrays for Speech Input in Automobiles, chapter 14, pages 307{329. 70 Microphone arrays-Signal Processing Techniques and Applications. Springer-Verlag, 2001. [23] S. Oh and V. Viswanathan. Hands-free voice communication in an automobile with a microphone array. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pages 281{284, 1992. [24] M. Omologo, M. Matassoni, and P Svaizer. Speech Recognition with Microphone Arrays, chapter 15, pages 331{353. Microphone arrays-Signal Processing Techniques and Applications. Springer-Verlag, 2001. [25] M. Omologo, P. Svaizer, and Matassoni. Environmental conditions and aocustic transduction in hands-free speech recognition. Speech Communication, 25:75{95, 1998. [26] H. P. Press, S. A. Teukolsky, W. T. Vettring, and B. P. Flannery. Numerical Recipes in C The Art of Scientiflc Computing. Cambridge University Press, 2 edition, 1995. [27] Y. Rockah and P. M. Schultheiss. Array shape calibration using sources in unknown locations Part II: Near-fleld sources and estimator implementation. IEEE Trans. Acoust., Speech, Signal Processing, ASSP-35(6):724{735, June 1987. [28] J. M. Sachar, H. F. Silverman, and W. R. Patterson III. Position calibration of large-aperture microphone arrays. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pages II{1797 { II{1800, 2002. 71 [29] A. Savvides, C. C. Han, and M. B. Srivastava. Dynamic flne-grained localization in ad-hoc wireless sensor networks. In Proc. International Conference on Mobile Computing and Networking, July 2001. [30] A. Stephene and B. Champagne. Cepstral preflltering for time delay estimation in reverberant environments. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, volume 5, pages 3055{3058, May 1995. [31] M. Steyvers. Multideimnsional scaling. Encyclopedia of Cognitive Science, 2002. [32] W. S. Torgerson. Multidimensional scaling: I. theory and method. Psychometrika, 17:401{419, 1952. [33] H. L. Van Trees. Detection, Estimation, and Modulation Theory, volume Part 1. Wiley-Interscience, 2001. [34] C. Wang, S. Griebel, P Hsu, and M. Brandstein. Real-time automated video and audio capture with multiple camera and microphones. Journal of VLSI Signal Processing Systems, 29(1/2):81{100, Aug/Sep 2001. [35] H. Wang and P. Chu. Voice source localization for automatic camera pointing system in videoconferencing. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pages 187{190, Apr. 1997. [36] A. J. Weiss and B. Friedlander. Array shape calibration using sources in unknown locations-a maxilmum-likelihood approach. IEEE Trans. Acoust., Speech, Signal Processing, 37(12):1958{1966, December 1989. 72 [37] D. Zotkin, R. Duraiswami, V. Philomin, and L Davis. Smart videoconferencing. In ICME, pages 1597{1600, Aug. 2000. 73 Vikas Chandrakant Raykar 3409 Tulane drive, Apt. no. 13, Hyattsville MD 20783 (h) 301-422-6069 (w) 301-405-8753 (cell) 408-242-1863 vikas@umiacs.umd.edu http://cvl.umiacs.umd.edu/users/vikas/ EDUCATION Master of Science, December 2003 Advisors: Dr. Rama Chellappa and Dr. Ramani Duraiswami Thesis: Position Calibration of Acoustic Sensors and actuators on General Distributed Computing Platforms Department of Electrical Engineering University of Maryland, College Park, MD Major: Signal Processing Minor: Computer Engineering Current GPA: 3.785/4.0 Bachelor of Engineering, May 2001 Electronics and Communication Engineering Regional Engineering College, Trichy, India Aggregate: 87.97% Equivalent GPA:4.0/4.0 Department Rank:1/51 PUBLICATIONS 1. Position Calibration of Audio sensors and actuators in a distributed computing platform Vikas C. Raykar, Igor Kozintsev and Rainer Lienhart (ACM Multimedia 2003, Berkeley, CA, USA, November 2003) 2. Self Localization of acoustic sensors and actuators on distributed platforms Vikas C. Raykar, Igor Kozintsev and Rainer Lienhart (ICCV 2003 International Workshop on Multimedia Technologies in E-Learning and Collaboration, Nice, France, October 2003) 3. Tracking a moving speaker using excitation source information Vikas C. Raykar, B. Yegnanarayana, S. R. Mahadeva Prasanna and Ramani Duraiswami (Eurospeech 2003, Geneva, September 2003) 4. Extracting signiflcant features from the HRTF Vikas C. Raykar, B.Yegnanarayana Ramani Duraiswami and Larry Davis (International Conference on Auditory Displays 2003, Boston, July 2003) 5. Virtual audio system customization using visual matching of ear parameters D.Zotkin, R.Duraiswami, L.Davis, V.C.Raykar and A.Mohan ( ICPR 2002, Quebec City, Canada. August 2002) 6. Head Related Impulse Response Interpolation for Dynamic Spatialization V.C.Raykar, T. V. Shreenivas and R.Raman ( Texas Instruments DSPS Fest-2k, Bangalore, India, November 2000.) SUBMITTED 1. Speaker Localization using Excitation source information in speech Vikas C. Raykar, B.Yegnanarayana, S. R. Mahadeva Prasanna and Ramani Duraiswami (submitted IEEE Trans. Speech and Audio Processing) 2. Position Calibration of Microphones and Loudspeakers in Distributed Computing Platforms Vikas C. Raykar, Igor Kozintsev and Rainer Lienhart (submitted IEEE Trans. Speech and Audio Processing Special Issue on Multi-channel Signal Processing for Audio and Acoustics Applications ) 3. Position Calibration of Multiple Microphones Vikas C. Raykar and Ramani Duraiswami (submitted to ICASSP 2003) PATENTS FILED 1. Three-Dimensional Position Calibration of Audio Sensors and Actuators on a Distributed Computing Platform. (flled on 05/09/2003 along with Igor Kozintsev and Rainer Lienhart) 2. Method for 3-Dimensional position calibration of audio sensors and actuators on a distributed computing platform. (flled on 09/18/2003 along with Igor Kozintsev and Rainer Lienhart) RESEARCH / WORK EXPERIENCE Intern 02/2003 to 08/2003 Future Platforms Lab, Intel Research Labs, Intel Corporation, Santa Clara CA Mentors:Dr.Igor Kozintsev and Dr. Rainer Lienhart ? Designed and implemented novel algorithms for 3D position calibration of a network of microphones/cameras and speakers/displays on distributed computing platforms. ? The work resulted in 3 publications and 2 patents flled. Graduate Research Assistant 08/2001 to 02/2003 Perceptual Interfaces and Realities Laboratory, University of Maryland, College Park MD Research Advisor: Dr. Ramani Duraiswami ? Implemented a real time video conferencing setup using a microphone array and a pan-tilt camera which includes auditory source localization, automatic camera pointing, face detection and multi channel speech enhancement. Also worked on Temporal and Spectral Decomposition of HRTFs. Research Fellow 11/1999 to12/1999 and 04/2000 to 05/2000 Speech and Audio laboratory, Indian Institute of Science, Bangalore, India Advisor: Dr.T. V. Sreenivas ? Implemented a real-time 3D spatial audio system using Head Related Transfer Functions (HRTFs). Also worked on modelling and interpolation of HRTFs. Intern 05/20/1999 to 06/25/1999 Centre for Artiflcial Intelligence and Robotics, Bangalore, India Advisor: Dr. Ambalal V. Patel ? Implemented PI and PD controller using Fuzzy Logic. SKILLS ? Expertise in C, MATLAB & Win32 Programming. ? Experience in building real-time video conferencing systems with Matrox EVI-D30 pan-tilt camera and PowerDAQ data acquisition board. ? Built a real time face detection system using Firewire camera and openCV. ? Coding experience with MIL Image processing library, Intel Integrated Performance Primitives (IPP) Library, openCV and portaudio. ? Working knowledge of C++, FORTRAN, DirectX and Verilog. ? Working knowledge of MPI and openMP. ? Digital Signal Processors including Texas Instruments TMS320C30, TMS320C54. ? Assembly languages such as 8085, 8086 & 8051/52 microcontrollers. ? Worked on a wide range of platforms including MS-DOS 6.2, Windows XP/NT/200/95, Unix(Solaris)& Linux (Suse). ? Also familiar with packages like MS-O?ce, LATEX and HTML programming. COURSE PROJECTS ? Evaluation of kernel methods like KPCA, KLDA and KBDA for face detection. ? Fast Kernel Principal Component Analysis for the Polynomial and Gaussian kernels. ? Classiflcation and Regression using Linear Networks, Multilayer Perceptrons and RBF?s ? Video Codec Implementation, DC Image Extraction and Shot Segmentation. ? Verilog Implementation of pipelined CPU with precise interrupt handling. ? Optimization methods for sound source localization. ? Implementation of the Frost beamformer. HONORS ? Best Outgoing student in the Electronics and communication Engineering Department at Regional Engineering College, Trichy, India for the year 2000-2001. ? Recipient of the prestigious National Science Fellowship Award (Engineering Stream) for the year 1999 under the KVPY scheme funded by the Department of Science and Technology (DST),Government of India. ? Recipient of the National Talent Search Examination (NTSE) scholarship in the year 1995. ? Was 18th rank in State level Entrance Examination into engineering Colleges. ? Was 13th rank in State level public examination. ? Was 2nd rank in Karnataka State in X standard public examination. ACTIVITIES ? Organizer of National Level Technical Symposium Probe 99 held in Trichy, India. ? Participated in the KVPY summer robotics camp for engineering students held at IIT, Bombay, India. ? Student member of IEEE Signal processing society. ? Member of SPIC MACAY. INTERESTS Sketching, Painting, Amateur robotics, Astronomy. REFERENCES ? Dr.Ramani Duraiswami(Director, Perceptual Interfaces and Reality Laboratory, University of Maryland, CollegePark) ? Dr.Rama Chellappa(Professor and Director of the Center for Automation Research, at the University of Maryland in College Park) ? Dr.B. Yegnanarayana(Visiting Professor, University of Maryland, CollegePark ) ? Dr. Igor Kozintsev, Intel Research Labs, Santa Clara, CA ? Dr. Rainer Lienhart, Intel Research Labs, Santa Clara, CA