ABSTRACT Title of Dissertation: UNCONSTRAINED FACE RECOGNITION Shaohua Zhou, Doctor of Philosophy, 2004 Dissertation directed by: Professor Rama Chellappa Department of Electrical and Computer Engineering Although face recognition has been actively studied over the past decade, the state-of-the-art recognition systems yield satisfactory performance only under con- trolled scenarios and recognition accuracy degrades significantly when confronted with unconstrained situations due to variations such as illumintion, pose, etc. In this dissertation, we propose novel approaches that are able to recognize human faces under unconstrained situations. Part I presents algorithms for face recognition under illumination/pose varia- tions. For face recognition across illuminations, we present a generalized photomet- ric stereo approach by modeling all face appearances belonging to all humans under all lighting conditions. Using a linear generalization, we achieve a factorization of the observation matrix consisting of face appearances of different individuals, each under a different illumination. We resolve ambiguities in factorization using sur- face integrability and symmetry constraints. In addition, an illumination-invariant identity descriptor is provided to perform face recognition across illuminations. We further extend the generalized photometric stereo approach to an illuminating light field approach, which is able to recognize faces under pose and illumination varia- tions. Face appearance lies in a high-dimensional nonlinear manifold. In Part II, we introduce machine learning approaches based on reproducing kernel Hilbert space (RKHS) to capture higher-order statistical characteristics of the nonlinear appearance manifold. In particular, we analyze principal components of the RKHS in a probabilistic manner and compute distances such as the Chernoff distance, the Kullback-Leibler divergence between two Gaussian densities in RKHS. Part III is on face tracking and recognition from video. We first present an enhanced tracking algorithm that models online appearance changes in a video se- quence using a mixture model and produces good tracking results in various chal- lenging scenarios. For video-based face recognition, while conventional approaches treat tracking and recognition separately, we present a simultaneous tracking-and- recognition approach. This simultaneous approach solved using the sequential importance sampling algorithm improves accuracy in both tracking and recogni- tion. Finally, we propose a unifying framework called probabilistic identity char- acterization able to perform face recognition under registration/illumination/pose variation and from a still image, a group of still images, or a video sequence. UNCONSTRAINED FACE RECOGNITION by Shaohua Zhou Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2004 Advisory Committee: Professor Rama Chellappa, Chairman Professor Larry S. Davis Professor David W. Jacobs Professor Adrian Papamarcou Professor Min Wu c?Copyright by Shaohua Zhou 2004 DEDICATION To Chunhui ii ACKNOWLEDGEMENTS I wish to express my sincere gratitude to my supervisor, Professor Rama Chel- lappa, for his sustained financial support, his valuable guidance on research, and his scholarly and honest attitude toward life. I am grateful to my committee members, Professors Larry S. Davis, David W. Jacobs, Adrian Papamarcou, and Min Wu. I enjoyed my fruitful discussions with Professor David W. Jacobs. I also thank Professor Eric V. Slud in the Mathematics department for educating me and sharing with me his broad knowledge onstatistics and Dr. Baback Moghaddam at Mitsubishi Electric Research Labs (MERL) for hosting me as a summer intern in 2002. I also would like to express my appreciation of Professor Azriel Rosenfeld, who was in my proposal examination committee and edited two of my technical reports. I had a pleasant stay at the Center for Automation Research (CfAR). I am indebted to my lab colleagues: Amit R. Chowdhury, Naresh Contoor, Jian Li, Jian Liang, Haiying Liu, Amit Kale, Gang Qian, Jie Shao, Namrata Vaswani, Zhanfen Yue, and Qinfen Zheng. I really enjoyed my collaborations and discussions with these brilliant guys. I take this special occasion to thank my parents and parents-in-law back in China for their support and to wish them best. Finally, I thank my wife, Chunhui, for her patience, her encouragement, and her lifelong love. I dedicate my thesis to her. iii TABLE OF CONTENTS List of Tables vii List of Figures ix 1 Introduction 1 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Biometric perspective . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Experimental perspective . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Theoretic perspective . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Unconstrained Face Recognition . . . . . . . . . . . . . . . . . . . . 13 1.2.1 Face recognition under variations . . . . . . . . . . . . . . . 15 1.2.2 Face recognition via kernel learning . . . . . . . . . . . . . . 16 1.2.3 Face tracking and recognition from videos . . . . . . . . . . 18 2 Generalized Photometric Stereo 21 2.1 Principle of Generalized Photometric Stereo . . . . . . . . . . . . . 26 2.1.1 Literature review and proposed approach . . . . . . . . . . . 27 2.1.2 Setting and constraints . . . . . . . . . . . . . . . . . . . . . 29 2.1.3 Separating illumination . . . . . . . . . . . . . . . . . . . . . 34 2.1.4 Recovering class-specific albedos and surface normals . . . . 37 2.2 Face Recognition across Illumination . . . . . . . . . . . . . . . . . 39 2.2.1 Literature review and proposed approach . . . . . . . . . . . 40 2.2.2 Bootstrap set . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.2.3 Recognition experiments . . . . . . . . . . . . . . . . . . . . 45 2.3 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3 Illuminating Light Field 57 3.1 Principle of Illuminating Light Field . . . . . . . . . . . . . . . . . 58 3.1.1 Literature review . . . . . . . . . . . . . . . . . . . . . . . . 58 3.1.2 Pose-invariant identity signature . . . . . . . . . . . . . . . . 62 3.1.3 Illumination- and pose-invariant identity signature . . . . . . 65 3.1.4 Learning algorithms . . . . . . . . . . . . . . . . . . . . . . 67 3.2 Face Recognition across Illumination and Poses . . . . . . . . . . . 70 iv 3.2.1 PIE database and recognition setting . . . . . . . . . . . . . 70 3.2.2 Recognition performance . . . . . . . . . . . . . . . . . . . . 73 3.2.3 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4 Probabilistic Kernel Principal Component Analysis 82 4.1 Reproducing Kernel Hilbert Space (RKHS) . . . . . . . . . . . . . . 85 4.2 Probabilistic Analysis of Kernel Principal Components . . . . . . . 88 4.2.1 Kernel principal component analysis . . . . . . . . . . . . . 88 4.2.2 Theory of PKPCA . . . . . . . . . . . . . . . . . . . . . . . 90 4.3 Mixture Modeling of Probabilistic Kernel Principal Components . . 96 4.3.1 Theory of mixture of PKPCA . . . . . . . . . . . . . . . . . 96 4.3.2 Why mixture of PKPCA? . . . . . . . . . . . . . . . . . . . 100 4.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.4.1 PKPCA or mixture of PKPCA classifier . . . . . . . . . . . 101 4.4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.5 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5 Probability Distances in Reproducing Kernel Hilbert Space 118 5.1 Probabilistic Distances in Rd . . . . . . . . . . . . . . . . . . . . . 120 5.2 Mean and Covariance Marix in RKHS . . . . . . . . . . . . . . . . 123 5.2.1 First- and second-order statistics . . . . . . . . . . . . . . . 123 5.2.2 Covariance matrix approximation . . . . . . . . . . . . . . . 124 5.3 The Probabilistic Distances in RKHS . . . . . . . . . . . . . . . . . 126 5.3.1 The Chernoff distance and the Bhattarchayya distance . . . 126 5.3.2 The KL divergence and the symmetric divergence . . . . . . 129 5.3.3 The Patrick-Fisher distance . . . . . . . . . . . . . . . . . . 130 5.3.4 Limiting behavior . . . . . . . . . . . . . . . . . . . . . . . . 130 5.3.5 Kernel for set . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.4.1 Synthetic examples . . . . . . . . . . . . . . . . . . . . . . . 132 5.4.2 Face recognition from a group of images . . . . . . . . . . . 134 6 Adaptive Visual Tracking 138 6.1 Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.1.1 Visual tracking . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.1.2 Particle filter . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.2 Appearance-Adaptive Models . . . . . . . . . . . . . . . . . . . . . 145 6.2.1 Adaptive observation model . . . . . . . . . . . . . . . . . . 145 6.2.2 Adaptive state transition model . . . . . . . . . . . . . . . . 148 6.2.3 Handling occlusion . . . . . . . . . . . . . . . . . . . . . . . 154 6.3 Experimental results on visual tracking . . . . . . . . . . . . . . . . 157 6.3.1 Car tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.3.2 Tank tracking in an aerial video . . . . . . . . . . . . . . . . 160 v 6.3.3 Face tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.3.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 7 Simultaneous Tracking and Recognition 166 7.1 Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 7.1.1 Face modeling and recognition . . . . . . . . . . . . . . . . . 169 7.1.2 Video-based tracking and recognition . . . . . . . . . . . . . 170 7.2 Stochastic Models and Algorithms for Recognition from Video . . . 173 7.2.1 Time series state space model . . . . . . . . . . . . . . . . . 173 7.2.2 Posterior probability of identity variable . . . . . . . . . . . 174 7.2.3 SIS algorithms and computational efficiency . . . . . . . . . 176 7.3 Still-to-Video Face Recognition Experiments . . . . . . . . . . . . . 180 7.3.1 Results for Database-0 . . . . . . . . . . . . . . . . . . . . . 181 7.3.2 Results for Database-1 . . . . . . . . . . . . . . . . . . . . . 187 7.3.3 Results for Database-2 . . . . . . . . . . . . . . . . . . . . . 191 7.3.4 Enhanced results . . . . . . . . . . . . . . . . . . . . . . . . 192 7.4 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 8 Probabilistic Identity Characterization 202 8.1 Principle of Probabilistic Identity Characterization . . . . . . . . . 205 8.1.1 Independent group (I-group) . . . . . . . . . . . . . . . . . . 206 8.1.2 Video sequence . . . . . . . . . . . . . . . . . . . . . . . . . 207 8.1.3 Difference from Bayesian estimation . . . . . . . . . . . . . . 207 8.2 Recognition Setting and Issues . . . . . . . . . . . . . . . . . . . . . 208 8.2.1 Discrete identity signature . . . . . . . . . . . . . . . . . . . 209 8.2.2 Continuous identity signature . . . . . . . . . . . . . . . . . 209 8.2.3 The effects of the transformation . . . . . . . . . . . . . . . 210 8.2.4 Asymptotic behaviors . . . . . . . . . . . . . . . . . . . . . . 211 8.3 Subspace Identity Encoding . . . . . . . . . . . . . . . . . . . . . . 211 8.3.1 Invariant to localization, illumination, and pose . . . . . . . 212 8.3.2 Computational issues . . . . . . . . . . . . . . . . . . . . . . 214 8.3.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . 216 8.4 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 9 Conclusions 223 9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 9.2 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 vi LIST OF TABLES 1.1 A list of biometrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 Recognition rate obtained by our approach using the first rank con- straint and the Yale?s database as the training set. . . . . . . . . . . 46 2.2 Recognition rate obtained by the ?Eigenface?approach (discarding the first 3 components) using the Yale?s database as the training set. 48 2.3 Recognition rate obtained by the ?Fisherface? approach using the Yale?s database as the training set. . . . . . . . . . . . . . . . . . . 48 2.4 Recognition rate obtained by our approach with the first rank con- straint and Vetter?s database as the training set. . . . . . . . . . . 49 2.5 Recognition rate obtained by our approach with the second rank constraint and Vetter?s database as the training set. . . . . . . . . . 49 2.6 Recognition rate across poses and illumination. The front view is from camera 27, and the side view from camera 05. . . . . . . . . . 50 3.1 Recognition rates forall the probe sets with a fixed gallery set (c27,f11). 73 3.2 Average recognition rates for all the gallery sets. For each cell, say the gallery set at (vg = c27,sg = f12), the average rate is taken over all probe sets (vp,sp) where vp negationslash= vg and sp negationslash= sg. For example, the average rate for (c27,f11) is the average of the rates in Table 3.1 excluding the row c27 and the column f11. . . . . . . . . . . . . . . 74 3.3 The recognition rates for test scenario B. . . . . . . . . . . . . . . . 78 4.1 PPCA and PKPCA reconstruction error percentage. . . . . . . . . . 96 4.2 Classification error on the single C-shaped, the single O-shape, and the double C-shapes. . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.3 The classification error on IDA benchmark repository. The SVM and KFD results are reported in [179]. . . . . . . . . . . . . . . . . 109 4.4 Recognition rate of various kernel and non-kernel subspace methods. 111 5.1 (a) The KL distances in the RKHS with ? = 1 and q = 3. (b) The Bhatacharyya distances in the RKHS with ? = 0.5 and q = 1. p1 is listed in the first column and p2 in the first row. . . . . . . . . . . . 135 vii 5.2 The recognition score obtaining using the symmetric divergence and Bhatacharyya distance. . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.1 Comparison of tracking results obtained by particle filters with dif- ferent configurations. ?At size? means pixel size in the component(s) of the appearance model. ?o? means success in tracking. ?x? means failure in tracking. . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.1 Use of temporal information in various tracking/recognition processes.168 7.2 Summary of three databases experimented. . . . . . . . . . . . . . . 181 7.3 Recognition performance of algorithms when applied to Database-0. 187 7.4 Performances of algorithms when applied to Database-1. . . . . . . 188 8.1 Recognition rates of different methods. . . . . . . . . . . . . . . . . 219 viii LIST OF FIGURES 1.1 Comparison of various biometric features based on MRTD compat- ibility (from [33]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Three face recognition tasks: verification, identification, watch list (courtesy of P.J.Phillips [59]). . . . . . . . . . . . . . . . . . . . . . 5 1.3 A hierarchy of face pattern and face recognition. . . . . . . . . . . . 6 1.4 An illustration of the imaging system. . . . . . . . . . . . . . . . . . 8 1.5 One PIE [75] individual under different illumination and poses. . . . 9 1.6 (a) Appearances of one individual with different facial expression (from [53]). (b) Appearances of one individual at different ages (from [50]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.7 Face appearances in a video sequences, forming a nonlinear manifold. 14 2.1 Top row: One object under eight different light sources. This can be handled by the ordinary photometric stereo algorithm. Bottom row: Eight different objects illuminated by eight different lighting sources. This cannot be handled by the ordinary photometric stereo algorithm but can be handled by the proposed generalized photo- metric stereo algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 The first row: The first basis object under eight different illumi- nation. The second row: The second basis object under the same set of eight different illumination. The third row: Eight images (constructed by random linear combinations of two basis objects) illuminated by eight different lighting sources. The fourth row: Re- covered class-specific albedo-shape matrix W showing the product of varying albedos and surface normals of two basis objects (i.e. the three columns of T1 and T2) using the generalized photometric stereo algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.3 Right: Flash distribution in the PIE database. For illustrative pur- poses, we move their positions on a unit sphere as only the illu- minant directions matter. ?o? means the ground truth and ?x? the estimated values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 ix 2.4 The first and second rows display one PIE object under the selected 12 illuminants (from left to right, row 1 to row 2: f08, f09, f11-f17, and f20-f22) and the third and fourth rows one Yale object under 9 lights (most frontal lights) used in the training set. . . . . . . . . . 47 3.1 This figure illustrates the 2D light-field of a 2D object (a square with four differently colored sides), which is placed within an circle. The angles ? and ? are used to relate the viewpoint with the radiance from the object. The right image shows the actual light field for the square object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.2 Examples of the face images of one PIE object (used in the testing stage) under selected illumination and poses . . . . . . . . . . . . . 71 3.3 The first nine columns of the learned W matrix. . . . . . . . . . . . 75 3.4 The reconstruction results of the object in Figure 3.2. Notice that only the f?s and s?s for the row c27 are used for reconstructing all the images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.5 The average recognition rates across illumination (the top row) and across poses (the bottom row) for three cases. Case (a) shows the average recognition rate (averaging over all illumination/poses and all gallery sets) obtained by the proposed algorithm using the top n matches. Case (b) shows the average recognition rate (averaging over all illumination/poses for the gallery set (c27, f11) only) ob- tained by the proposed algorithm using the top n matches. Case(c) shows the average recognition rate (averaging over all illumina- tion/poses andallgallerysets) obtainedby the ?Eigenface? algorithm using the top n matches. . . . . . . . . . . . . . . . . . . . . . . . . 77 4.1 Two nonlinear data structures (a)(d) and their drawn samples (of size 200) for the foreground class (b)(e) and the background (c)(f). 85 4.2 Histogram of ? for iris data obtained by (a) PPCA with q = 2, (b) PPCA with q = 3, (c) PKPCA with Gaussian kernel with q = 9, ? = 2 and ? = 0.001, and (d) PKPCA with Gaussian kernel with q = 15, ? = 2 and ? = 0.001. . . . . . . . . . . . . . . . . . . . . . . 97 4.3 (a) Initial configuration. (b) After first iteration. (c) Final configu- ration. ?+? and ?x? denote two different mixture components. . . . . 100 4.4 (a) One C-shape and contour plots of its (b) 1st and (c) 2nd KPCA features. (d) Two C-shapes and its contour plots of its (e) 1st and (f) 2nd KPCA features. . . . . . . . . . . . . . . . . . . . . . . . . 102 4.5 The approximation of the Jacobi matrix. (a) The contour plots of the true density: uniform inside the C-shaped region. (b) The map of log(??). (c) The contour plots of ??? inside the C-shaped region. . 103 4.6 The classification results on the single C-shape obtained by (a) PKPCA-d, (b) PKPCA-s, (c) SVM, and (d) KFDA. . . . . . . . . . 106 x 4.7 The classification results on the double C-shape obtained by (a) PKPCA-d classfier, (b) SVM, and (c) mixture of PKPCA classfier with different kernel widths. . . . . . . . . . . . . . . . . . . . . . . 106 4.8 The classification results on the single O-shape. . . . . . . . . . . . 107 4.9 Top row: neutral faces. Middle row: faces with facial expression. Bottom row: faces under different illumination. Image size is 24 by 21 in pixels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.10 (a) The curve of E(?). (b) The curve of ?1(?). We have set q = 30 and ? = 1e?6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.11 (a) The map of log(??) and (b) the contour plots of ??? inside the C-shaped region, when ? = 3. (c) The map of log(??) and (d) the contour plots of ??? inside the C-shaped region, when ? = 36. . . . . 117 5.1 300 i.i.d. realizations of four different densities with the same mean (zero mean) and covariance matrix (identity matrix). (a) 2-D Gaus- sian. (b) ?O?-shaped uniform.(c) ?D?-shapeduniform. (d) ?X?-shaped uniform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.2 (a) The symmetric divergence ?JD(?,q) and (b) the Bhatacharyya distance ?JB(?,q) between the 2-D Gaussian and the ?O?-shaped uni- form as a function of ? and q. . . . . . . . . . . . . . . . . . . . . . 134 5.3 Examples of face images in the gallery and probe set. (a) The 4th gallery person in 10 frames (every 8 frames) of a 80-frame se- quence. (b) The 9th gallery person in 10 frames (every 10 frames) of a 105-frame sequence.(a) The 4th probe person in 10 frames (every 6 frames) of a 60-frame sequence. (d) The plot of first three PCA coefficients of the above three sets. . . . . . . . . . . . . . . . . . . 136 6.1 The general particle filter algorithm. . . . . . . . . . . . . . . . . . 144 6.2 Particle configurations from (top row) the adaptive velocity model and (bottom row) the zero-velocity model. . . . . . . . . . . . . . . 154 6.3 The proposed visual tracking algorithm with occlusion handling. . . 157 6.4 The car sequence. Notice the fast scale change present in the video. Column 1: the tracking results obtained with an adaptive motion model and an adaptive appearance model (?adp?). Column 2: the tracking results obtained with an adaptive motion model but a fixed appearance model (?fa?). In this case, the corner shows the tracked region. Column 3: the tracking results obtained with an adaptive appearance model but a fixed motion model (?fm?). . . . . . . . . . 160 xi 6.5 (a) The scale estimate for the car. (b) The 2-D trajectory of the cen- troid of the tracked tank. ?*? means the starting and ending points and ?.? points are marked along the trajectory every 10 frames. (c) The particle number Jt vs. t obtained when tracking the tank. (d) The MSE invoked by the ?adp? and ?fa? algorithms. (e) The scale estimate for the face sequence. . . . . . . . . . . . . . . . . . . . . . 161 6.6 Tracking a moving tank in a video acquired by an airborne camera. 162 6.7 The face sequence. Frames 145, 148, and 155 show the first oc- clusion. Frames 470 and 517 show the smallest and largest face observed. Frames 685, 690, and 710 show the second occlusion. . . . 164 6.8 Tracking results on the face sequence using the adaptive particle filter without occlusion analysis. . . . . . . . . . . . . . . . . . . . . 165 7.1 The conventional particle filter algorithm for simultaneous tracking and recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 7.2 The computationally efficient particle filter algorithm for simulta- neous tracking and recognition. . . . . . . . . . . . . . . . . . . . . 179 7.3 Database-0. The 1st row: the face gallery with image size being 30?26. The 2nd and 3rd rows: 4 example frames in one probe video with image size being 320 ? 240 while the actual face size ranges approximately from 30?30 in the first frame to 50?50 in the last frame. Notice that the sequence is taken under a well-controlled con- dition so that there are no illumination or pose variations between the gallery and the probe. . . . . . . . . . . . . . . . . . . . . . . . 182 7.4 Database-1. The 1st row: the face gallery with image size being 30?26. The 2nd and 3rd rows: 4 example frames in one probe video with image size being 720 ? 480 while the actual face size ranges approximately from 20?20 in the first frame to 60?60 in the last frame. Notice the significant illumination variations between the probe and the gallery. . . . . . . . . . . . . . . . . . . . . . . . . . . 183 7.5 Database-2. The 1st row: the face gallery with image size being 30?26. The 2nd and 3rd rows: some example frames in one probe video (slowWalk). Each video consists of 300 frames (480?640 pixels per frame) captured at 30 Hz. The inner face regions in these videos contain between 30?30 and 40?40 pixels. Notice the significant pose variation available in the video. . . . . . . . . . . . . . . . . . 184 7.6 Posteriorprobability p(nt|y0:t)againsttime t, obtainedby theCONDENSATION algorithm (top left) and the proposed algorithm (top right). Condi- tional entropy H(nt|y0:t) (bottom left) and MMSE estimate of scale parameter sc (bottom right) against time t. The conditional entropy and the MMSE estimate are obtained using the proposed algorithm. 186 xii 7.7 Database-1. Top row: the second facial images for estimating proba- bilistic density. Middle row: top 10 eigenvectors for the IPS. Bottom row: the facial images cropped out from the largest frontal view. . . 190 7.8 Cumulative match curves for Database-1 (left) and Database-2 (right).192 7.9 The visual tracking and recognition algorithm. . . . . . . . . . . . . 195 7.10 Row 1-3: the gallery set with 29 subjects in frontal view. Rows 4, 5, and 6: the top 10 eigenvectors for FFS, IPS, and EPS, respectively.196 7.11 Example images in ?Subject-2? probe video sequence and the track- ing results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 7.12 Results on the ?Subject-2? sequence. (a) Posterior probabilities against time t for all identities p(nt|y1:t), nt = 1,2,...,N. The line close to 1 is for the true identity. (b) Scale estimate against time t. 198 7.13 Left: The ?average? likelihood of the correct hypothesis and incorrect hypotheses against the log of scale parameter. Right: The ?average? likelihood ratio against the log of scale parameter. . . . . . . . . . . 201 8.1 Theposteriordistributions p(?1|y1:T)withdifferent T?s: (a) p(?1|y1); (b) p(?1|y1:6); and (c) p(?1|y1:12), and (d) the posterior distribution p(?|y1:12). Notice that p(?1|y1:T) has two modes and becomes more peaked as T increases. . . . . . . . . . . . . . . . . . . . . . . . . . 219 8.2 The recognition rates of all tests. (a) Our method based on ?k. (b) Our method based on ?k. (c) The PCA approach [62]. (d) The KL approach. Notice the different ranges of values for different methods and the diagonal entries should be ignored. . . . . . . . . . . . . . . 220 xiii Chapter 1 Introduction 1.1 Overview Identifying people from faces is an effortless task for humans. Is it the same for computers? This defines the very question for the field of automatic face recognition [20, 21, 22, 23, 24, 25, 26, 27, 191] (also referred to as face recognition in the present dissertation), one of the most active research areas in computer vision, pattern recognition, and image understanding. Over the past decade, face recognition has attracted substantial attention from various disciplines and contributed to a skyrocketing growth in the literature. Be- low, we mainly emphasize the biometric, experimental, and theoretic perspectives of face recognition. 1.1.1 Biometric perspective Face is a biometric [31]. As a consequence, face recognition finds wide applications related to authentication, security, and so on. One striking example is recent deployment of the US-VISIT system [30] by the Department of Homeland Security 1 (DHS), collecting foreign passengers? fingerprints and face images. Biometrics enable automatic identification of a person based on physiological or behavioral characteristics [29, 28]. Physiological biometrics are biological/chemical traits that are innate or naturally grown, while behavioral biometrics are manner- isms or traits that are learned or acquired. Table 1.1 lists commonly used biomet- rics. Some introductory discussions on biometrics may be found in [28, 29, 31, 32]. Type Examples Physiological biometrics Body odor, DNA, face, fingerprint, hand geometry, iris, pulse, retinal Behavioral biometrics Face, gait, handwriting, signature, voice Table 1.1: A list of biometrics. Biometrics technologies are becoming the foundations of an extensive array of highly secure identification and personal verification solutions. Compared with conventional identification and verification methods based on personal identifica- tion numbers (PINs) or passwords, biometrics technologies offer some unique ad- vantages. First, biometrics are individualized traits while passwords may be used or stolen by someone other than the authorized user. Also, a biometric is very convenient since there is nothing to carry or remember. In addition, biometric technology is becoming more accurate and inexpensive. Among all biometrics listed in Table 1.1, face biometric is a very unique one because face is the only biometric belonging to both physiological and behavioral categories. While the physiological part of the face biometric is widely researched in the literature, the behavioral part is not yet fully investigated. In addition, as reported in [33, 34], face has advantage over other biometrics because it is a natural, non-intrusive, and easy-to-use biometric. For example [33], among the 2 six biometrics of face, finger, hand, voice, eye, and signature in Figure 1.1, face biometric ranks the first in the compatibility evaluation of a machine readable travel document (MRTD) system in terms of six criteria: enrollment, renewal, machine-assisted identity verification requirements, redundancy, public perception, and storage requirements and performance. Probably the most important feature of a biometric is its ability to collect the signature from non-cooperating subjects. Face Finger Hand Voice Eye Signature0 20 40 60 80 100 Weighted percentage Figure 1.1: Comparison of various biometric features based on MRTD compatibil- ity (from [33]). Besides applications related to identification and verification such as access control, law enforcement, ID and licensing, surveillance, etc., face recognition is also useful in human-computer interaction, virtual reality, database retrieval, mul- timedia, computer entertainment, etc. See [27, 45] for a review of face recognition applications. 3 1.1.2 Experimental perspective Face recognition mainly involves the following three tasks [59]: ? Verification. The recognition system determines if the query face image and the claimed identity match. ? Identification. The recognition system determines the identity of the query face image by matching it with a database of images with known identities, assuming that the identity is inside the database. ? Watch list. The recognition system first determines if the identity of the query face image is on the stored watch list and, if yes, then identifies the individual. Figure 1.2 illustrates the above three tasks and corresponding statistics used for evaluation. Among three tasks, the watch list task is the most difficult one. The present thesis focuses only on the identification task. We introduce a face recognition test protocol FERET [58] widely observed in the face recognition literature. FERET stands for ?facial recognition technology?. In most experiments conducted in the thesis, we follow the FERET protocol. FERET assumes availability of the following three sets, namely one training set, one gallery set, and one probe set. The training set is provided for the recognition algorithm to learn the characteristic features. The gallery and probe sets are used in the testing stage. The gallery set contains images with known identities and the probe set with unknown identities. The algorithm associates descriptive features with images in the gallery and probe sets and determines the identities of the probe images by comparing their associated features with those features associated with gallery images. 4 Figure 1.2: Three face recognition tasks: verification, identification, watch list (courtesy of P.J.Phillips [59]). 1.1.3 Theoretic perspective Face recognition is by nature an interdisciplinary research area, tied to an array of research fields, ranging from pattern recognition, computer vision and graph- ics, and image processing/understanding to statistical computing and machine learning. In addition, automatic face recognition designs are often guided by the psychophysical and neural studies. A good summary of research on face perception is presented in [27, 35, 38]. We now focus on the theoretical implications of pattern recognition for the special task of face recognition. We present a three-level structure for understanding the face recognition prob- lem. The three levels forming the pyramid are: pattern, visual pattern, and face pattern, each associated with a corresponding theory of recognition. Accordingly, face recognition approaches can be grouped into three categories. 5 Figure 1.3: A hierarchy of face pattern and face recognition. Pattern and recognition On the base of the pyramid lies a general pattern. Because face is first a pattern, any pattern recognition theory [7] can be directly applied to a face recognition problem. In general, a vector representation is used in pattern recognition. A common way of deriving a vector representation from a 2D face image, say of size M ?N, is through a ?vectorization? operator that stacks the pixels in a particular order, say a raster-scanning order, to an MN ? 1 vector. Obviously, given an arbitrary MN ? 1 vector, it can be decoded into an M ? N image by reversing the above ?vectorization? operator. Such a vector representation corresponds to a holistic-based viewpoint in the psychophysics literature [36, 37]. Subspace methods are pattern recognition techniques widely invoked in vari- ous face recognition approaches. Two well-known appearance-based recognition schemes utilize principal component analysis (PCA) [12] and linear discriminant analysis (LDA) [7]. PCA performs an eigen-decomposition of the covariance ma- trix and consequently minimizes the reconstruction error in the mean square sense. 6 LDA minimizes the within-class scatter while maximizing the between-class scat- ter. The PCA approach used in face recognition is called the ?Eigenface? approach [62]. Another work using PCA earlier than ?Eigenface? is [47]. The LDA approach used in face recognition is called the ?Fisherface? approach [41] since LDA is also commonly referred to as Fisher discriminant analysis. LDA for face recognition was also independently proposed in [44]. Further PCA and LDA are combined (LDA after PCA) as in [64] to yield a better recognition scheme. Other subspace methods such as independent component analysis (ICA) [20, 40, 155], local feature analysis (LFA) [164], probabilistic subspace [54, 55, 56], multi-exemplar discrim- inant analysis [211] have been used in face recognition. A comparison of these subspace methods is reported in [56, 200]. Other than the subspace methods, clas- sical pattern recognition tools such as neural networks [51], learning methods [57], and evolutionary pursuit/genetic algorithms [52] have also been applied to face recognition. One concern in a general pattern recognition problem is the ?curse of dimen- sionality? since usually M and N themselves are quite large. In face recognition, because of limitations of image acquisition, practical face recognition systems store only a small number of samples per subject. This further worsens the ?curse of dimensionality? problem. Face recognition also differs from general pattern recognition problem in various aspects. Some of the differences are illustrated below. Visual pattern and visual recognition In the middle of the pyramid in Figure 1.3 sits the visual pattern layer. A face is a visual pattern in the sense that it is a 2D appearance of a 3D object captured by 7 an imaging system. Certainly, visual appearance is affected by the configuration of an imaging system. An illustration of the imaging system is presented in Figure 1.4. Figure 1.4: An illustration of the imaging system. There are two distinct characteristics of the imaging system: photometric and geometric. ? Photometric characteristics are related to the light sources distributed in the scene. Figure 1.5 shows the face images of one object captured under varying illumination conditions. Numerous models have been proposed to describe the illuminating phenomenon, i.e., how the light travels when it hits the object. In addition to its relationship with the light distribution such as the light direction and intensity, an illumination model is in general also related to the object surface material properties. ? Geometric characteristic is about the camera properties and the relative po- 8 sitioning of the camera and the object. Camera properties include cam- era intrinsic paramters and camera imaging models. The imaging models widely studied in the computer vision literature are orthographic, scaled orthographic, and perspective models. Because the perspective model is dif- ficult to deal with as it requires the depth information, the orthographic or scaled orthograhic model is more used in the face recognition community. The relative positioning of the camera and the object results in pose varia- tion, a key factor determining how the 2D appearances are produced. Figure 1.5 shows the face images of one object captured at different poses. c22 c02 c37 c05 c27 f02 f08 f11 f13 f16 Figure 1.5: One PIE [75] individual under different illumination and poses. Studying photometric and geometric characteristics is the key problem in the 9 computer vision literature and consequently visual recognition under illumination and pose variations is the main challenge in the recognition community. A full re- view of the visual recognition literature is beyond the scope of the thesis. However, face recognition methods that address the photometric and geometric characteris- tics are still in a nascent stage and needs to be fully explored. Approaches to face recognition under illumination variation are usually treated as extensions of research efforts on illumination models. For example, if a simpli- fied Lambertian reflectance model ignoring shadow pixels [96, 101, 103] is used, a rank-3 subspace can be constructed to cover the appearances arbitrarily illumi- nated by a distant point source. Similarly low-dimensional subspaces [94, 95] can be found using a Lambertian model with attached shadows. Face recognition can be performed by checking if a query face image lies in the object-specific illumi- nation subspace. To generalize from the object-specific illumination subspace to a class-specific illumination subspace, bilinear models are used in [74, 138, 204]. Most face recognition approaches across pose variation use view-based appear- ance representation [67, 69, 72]. Face recognition across illumination and poses is more difficult compared with recognition across one single modality. Proposed approaches in the literature include [66, 70, 208], among which the 3D morphable model [66] yields the best recognition performance. The feature-based approach [48] is reported to be partially robust to illumination and pose variations. An important feature of a visual pattern is its presence in video. The ubiq- uitousness of video sequences calls upon recognition algorithms based on videos. Because a video sequence is a collection of still images, face recognition from still images certainly applies. However, an important property of a video sequence is its temporal dimension. Recent psychophysical and neural studies [37, 39] demon- 10 strate the role of movement in face recognition: Famous faces are easier to recog- nize when presented in moving sequences than in still photographs, even under a range of different types of degradations. Computational approaches utilizing such temporal information include [86, 193, 194, 185, 186, 190]. Figure 1.7 shows the tracked face appearance in a video sequence captures in an office environment [84]. Clearly, due to free movement of the human face and an uncontrolled environment, issues like illumination and pose variations still exist. Besides these issues, local- izing faces or face segmentation in a cluttered environment in video sequences is very challenging. In surveillance scenarios, further challenges include poorvideo quality and lower resolution. For example, the face region can be as small as 15 ? 15, while most feature-based approaches [48, 66] need big face images of size as large as 128?128. However, video provides multiple observations linked by their temporal continuity. Face pattern and face recognition At the top of the pyramid lies the face pattern. The face pattern specializes the visual pattern by letting the object be a human face. Therefore, face-specific properties or characteristics should be taken into account when performing face recognition. ? Deformation. Humans express emotions through facial expressions, yielding patterns under nonrigid deformations. The non-rigidity is of very high de- gree of freedom and perplexes the recognition task. Figure 1.6(a) shows the face images of a person exhibiting different expressions. While face expres- sion analysis attracts a lot of attention [42, 60, 61], recognition under facial expression variation has not been fully explored. 11 (a) (b) Figure 1.6: (a) Appearances of one individual with different facial expression (from [53]). (b) Appearances of one individual at different ages (from [50]). ? Aging. Face appearances vary significantly with aging and such variations are specific to an individual. As a result, theoretical modeling of aging [50] is very difficult due to the individualized variation. Figure 1.6(b) shows the face images of a person at different ages. ? Face surface. One speciality of face surface is its bilateral symmetry. Sym- metry constraint has been widely exploited in [102, 104, 204]. In addition, surface integrability is an inherent property of any surface, which has also been used in [99, 103, 137, 204]. ? Self-similarity. There is a strong visual similarity among face images of differ- ent individuals. Geometric positioning of facial features such as eyes, noses, mouths, etc. are alike across individuals. Early face recognition approaches in the 70?s [24, 46] used the distances between feature points to describe the face and achieved some success. Also, face surface materials properties are similar within the same race. As a consequence of visual similarity, the ?shapes? of the face appearance manifolds belonging to different subjects are similar. This is the foundation of approaches [55, 56, 211] that attempt to capture the ?shape? characteristics by constructing the so-called intra-person 12 space. ? Makeup, cosmetic, etc. There factors are specific to an individual and so are unpredictable. Except that the effect of glasses has been studied in [41], effects induced by other factors have not been widely investigated. Face appearances of the same individual under variations in illumination, pose, deformation, aging, etc. lie in a nonlinear manifold. Figure 1.7 visualizes such a manifold by projecting the appearances of the top row into top three principal components. Manifold characterization can be done in various ways. One way is to embed a manifold in a low-dimensional space [162, 166]. The other way is to learn the nonlinearity using machine learning techniques [9, 19, 63, 172, 177, 179, 181, 189, 198]. 1.2 Unconstrained Face Recognition State-of-the-art face recognition systems yield satisfactory performance under con- trolled conditions. To be specific, the face images are typically acquired in frontal views and are often illuminated by a frontal light source. These conditions pose strong restrictions on patterns possibly acquired. In other words, the clustering nature of the produced patterns (usually tightly clustered) is amenable for classical pattern analysis. Therefore, most face recognition approaches lie in the first level of the hierarchy. Unfortunately, recognition performance degrades significantly when face recognition systems are presented with patterns that go beyond these controlled conditions. Recently, researchers have begun to investigate face recognition under uncon- strained conditions. Examples of unconstrained conditions include illumination 13 PC1PC2 PC3 Figure 1.7: Face appearances in a video sequences, forming a nonlinear manifold. and pose variations, video sequences, expression, aging, and so on. In general, recognition approaches addressing the second and third levels of the hierarchy can be considered in the category of unconstrained face recognition. The present thesis presents several unconstrained face recognition approaches. It consists of three parts: Part I is on Face Recognition under Variations, Part 14 II on Face Recognition via Kernel Learning, and Part III on Face Tracking and Recognition from Videos. 1.2.1 Face recognition under variations Part I of the thesis studies face recognition under illumination and pose variations. Pose and illumination are related to the second level of Figure 1.3. In Chapter 2, we present a generalized photometric stereo algorithm for recognizing faces under illumination variation and then in Chapter 3 an illuminating light field algorithm for recognizing faces under illumination and pose variations. Most photometric stereo algorithms employ a Lambertian reflectance model with a varying albedo field and involve the appearances of only one object. The recovered albedos and surface normals are object-specific and appearances not be- longing to the object cannot be easily handled. In Chapter 2, we generalize pho- tometric stereo algorithms to handle all appearances of all objects in a class, in particular the human face class, by assuming that albedos and surface normals of all objects in the class be rank-constrained, i.e. lie in a subspace. Rank con- straints lead us to a factorization of an observation matrixthat consists of exemplar images of different objects under different illuminations. To fully recover the sub- space bases or class-specific albedos and surface normals, we employ integrability and face symmetry constraints and propose a linearized algorithm. This algorithm takes into account the effects of varying albedo field by approximating the inte- grability terms using only the surface normals. We then apply our generalized photometric stereo algorithm for recognizing faces under illumination variations. As far as recognition is concerned, we can utilize a bootstrap set which is just a collection of 2D image observations to avoid an explicit requirement that 3D infor- 15 mation be available. We obtain good recognition results using the PIE database [187, 202, 204]. The illuminating light field algorithm presented in Chapter 3 is an image-based method for face recognition across different illumination and different poses, where the term image-based means that no explicit prior 3D models are needed. As face recognition under illumination and pose variations involves three factors, namely identity, illumination, and pose, generalizations in all these three factors are de- sired. The illuminating light field approach is able to generalize in identity and illumination and handle a given set of poses. The proposed approach derives an identity signature that is illumination- and pose-invariant, where the identity is tackled using subspace encoding, the illumination is characterized using a Lam- bertian reflectance model, and the given set of poses is treated as a whole. Ex- perimental results using the PIE database demonstrate the effectiveness of the proposed approach [188, 208]. 1.2.2 Face recognition via kernel learning As mentioned earlier, the visual pattern lies in a nonlinear manifold, which is further complicated by face-specific characteristics. Nonlinear data modeling is an important research topic in machine learning. While linear data modeling such as PCA and LDA utilizes first- and second-order statistics, higher-order statistics play essential roles in nonlinear data modeling. Kernel learning methods (or kernel methods) are able to capture the higher-order statistical information. In the core of kernel learning methods lie two important components: a learning algorithm using linear geometry and a nonlinear feature space induced by a kernel function. Such a space is referred as reproducing kernel Hilbert space (RKHS) 16 in the literature. Kernel methods are linear learning algorithms operating on the nonlinear feature space. In Part II, we introduce two kernel learning methods. Chapter 4 presents a probabilistic approach to analyze kernel principal compo- nents by naturally combining in one treatment the theory of probabilistic principal component analysis and that of kernel principal component analysis. In this for- mulation, the kernel component enhances the nonlinear modeling power, while the probabilistic structure offers (i) a mixture model for nonlinear data structure con- taining nonlinear sub-structures, and (ii) an effective classification scheme. It also turns out that the original loading matrix [15] is replaced by the newly defined empirical loading matrix. The expectation/maximization algorithm for learning parameters of interest is then developed. Computation of reconstruction error and Mahalanobis distance is also discussed. Finally, we apply this approach to face recognition [198, 209]. Probabilistic distance measures are important quantities in many research ar- eas. For example, the Chernoff distance (or the Bhattarchayya distance as its special example) is often used to bound the Bayes error in a pattern classifica- tion task and the Kullback-Leibler (KL) distance is a key quantity in information theory literature. However, computing these distances is a difficult task and ana- lytic solutions are not available except under some special conditions. One popular example is the Gaussian density. The Gaussian density employs only up to second- order statistics and its modeling capacity is linear and hence rather limited. In Chapter 5, we enhance this capacity through a nonlinear mapping from original data space to RKHS, which is implemented using kernel embedding. Since this mapping is nonlinear, we achieve a new paradigm to study these distances whose feasibility and efficiency are demonstrated using experiments on synthetic and face 17 recognition examples [189]. 1.2.3 Face tracking and recognition from videos Video sequences are becoming ubiquitous due to the advances in digital imaging devices and the advent of internet era. A face in video sequences presents further challenges to recognition algorithms besides those common to face recognition from still images. In Chapter 6, we present an approach called adaptive visual tracking that in- corporates appearance-adaptive models in a particle filter to realize robust visual tracking. Tracking needs modeling of inter-frame motion and appearance changes whereas recognition needs modeling of appearance changes between frames and gallery images. In conventional tracking algorithms, the appearance model is ei- ther fixed or rapidly changing, and the motion model is simply a random walk with fixed noise variance. Also, the number of particles is typically fixed. All these factors make the visual tracker unstable. To stabilize the tracker, we propose the following features: an observation model arising from an adaptive appearance model, an adaptive velocity motion model with adaptive noise variance, and an adaptive number of particles. The adaptive-velocity model is derived using a first- order linear predictor based on the appearance difference between the incoming observation and the existing particle configuration. Occlusion analysis is imple- mented using robust statistics. Experimental results [186, 201, 203] on tracking visual objects in long outdoor and indoor video sequences demonstrate the effec- tiveness and robustness of our tracking algorithm. In Chapter 7, recognition of human faces using a gallery of still images and a probe set of videos is systematically investigated using a probabilistic framework 18 called simultaneous tracking and recognition. In still-to-video recognition, where the gallery consists of still images, a time series state space model is proposed to fuse temporal information in a probe video, which simultaneously character- izes the kinematics and identity using a motion vector and an identity variable, respectively. The joint posterior distribution of the motion vector and the iden- tity variable is estimated at each time instant and then propagated to the next time instant. Marginalization over the motion vector yields a robust estimate of the posterior distribution of the identity variable. A computationally efficient sequential importance sampling (SIS) algorithm is developed to estimate the pos- terior distribution. Empirical results demonstrate that, due to the propagation of the identity variable over time, a degeneracy in posterior probability of the iden- tity variable is achieved to give improved recognition. We perform experiments [192, 193, 194, 195, 196, 197, 199] using images/videos with pose/illumination vari- ations to illustrate the effectiveness of this approach for the still-to-video scenario with appropriate model choices. In Chapter 8, we present the most general framework for characterizing the face identity in a single image or a group of images with each image containing a trans- formed version of the object. In terms of the transformation, the group is made of either still images or frames of a video sequence. The face identity signature is either discrete- or continuous-valued. This framework referred as probabilistic identity characterization integrates all the evidence of the set and handles the localization problem, illumination and pose variations through subspace identity encoding. Issues and challenges arising in this framework are addressed and effi- cient computational schemes are given. All instances of face recognition algorithms are be interpreted in the most general framework [210]. 19 Part I: Face Recognition under Variations 20 Chapter 2 Generalized Photometric Stereo In this chapter, we present a theory of generalized photometric stereo and its ap- plication to face recognition across illumination. We first present the generalized photometric stereo algorithm which is able to handle all appearances under dif- ferent illumination of all objects in a class, in particular the human face class. In contrast, the ordinary photometric stereo algorithm handles the appearances belonging to one object under different illumination. We then evaluate this algo- rithm in its application to face recognition under illumination variation. Since this generalization is linear, the blending linear coefficients offer an illuminant-invariant identity signature. Figure 2.1 motivates the proposed approach. The first row of Figure 2.1 dis- plays one Yale object [68] under eight different illumination. Photometric stereo algorithms can recover the varying albedos and surface normals for the object, even assuming no knowledge of the illumination conditions. Here, by photometric stereo algorithm we mean any algorithm that utilizes a Lambertian reflectance model to describe the visual appearance and has the capability to recover the albedos and surface normals involved in the reflectance model. However, ordinary photomet- 21 Figure 2.1: Top row: One object under eight different light sources. This can be handled by the ordinary photometric stereo algorithm. Bottom row: Eight different objects illuminated by eight different lighting sources. This cannot be handled by the ordinary photometric stereo algorithm but can be handled by the proposed generalized photometric stereo algorithm. ric stereo algorithm cannot handle the images in the second row of Figure 2.1, where each image represents a different object under a different illumination. This motivates us to propose a generalized photometric stereo approach. As in ordinary photometric stereo algorithm, the generalized photometric stereo algorithm utilizes a Lambertain reflectance model to depict the visual appearance. The significant difference between the ordinary and generalized photometric stereo algorithms lies in the image ensemble they analyze. The image ensemble that the ordinary photometric stereo algorithm analyzes consists of the appearances of one object under different illumination while, in general, the image ensemble that the generalized photometric stereo algorithm analyzes consists of the appearances of different objects, with each object under a different illumination. Analysis of the latter image ensemble is very difficult. To this end, we introduce a key assumption: These different objects belong to one class (for example, the human face class) so that they are linearly spanned by a fixed number of basis objects. Generalized pho- tometric stereo does not assume any knowledge of the lighting sources as well as the blending coefficients. Rather, the generalized photometric stereo approach ac- tually recovers such information. To further complicate the matter, the knowledge 22 of the basis objects is also unknown and needs to be recovered. We evaluate the generalized photometric stereo algorithm for a face recognition application. The key assumption has two important implications. Firstly, it fits with the requirement of a recognition task that needs a generalization capability built on a training set. The idea is to learn the basis objects from the training set. Once learned, we use them to cope with arbitrary images belonging to objects other than those in the training set. Secondly, because the bases are for the object class only, the blending coefficients provide an identity encoding which is invariant to illumination. We use the blending coefficients for face recognition under illumination variation, which results in good recognition performance. Chapter organization Section 2.1 elaborates the generalized photometric stereo algorithm and addresses its issues and challenges. Section 2.2 details the face recognition setting and presents the experimental results using the PIE database. Appendices 2.I and 2.II give supplementary details of the algorithms proposed in the chapter. A glossary of notations In general, we denote a scalar by a, a vector by a, and a matrix with r rows and c columns by Ar?c. The matrix transpose is donate by AT, the pseudo-inverse by A?. The matrix L2-norm is denoted by ||.||2. The following notations are introduced for the sake of notational conciseness and emphasis of special structure. ? Concatenation notations: ? and ?. ? and ? mean horizontal and vertical concatenations, respectively. For 23 example, we can represent a n ? 1 vector an?1 by a = [a1,a2,...,an]T = [?ni=1 ai] and its transpose by aT = [a1,a2,...,an] = [?ni=1 ai]. We can use ? and ? to concatenate matrices to form a new matrix. For instance, given a collection of matrices {A1,A2,...,An} of size r ? c, we construct a r ?cn matrix1 [?ni=1 Ai] = [A1,A2,...,An] and a rn?c matrix [?ni=1 Ai] = [AT1 ,AT2 ,...,ATn]T. In addition, we can combine ? and ? to achieve a concise notation. Rather than representing a matrix Ar?c as [aij], we represent it as Ar?c = [?ri=1 [?cj=1 aij] ] = [?cj=1 [?ri=1 aij] ]. Also we can easily construct ?big? matrices using ?small? matrices {A11,A12,...,A1n,...,Amn} of size r?c. The matrix [?mi=1 [?nj=1 Aij] ] is of size rm?cn, the matrix [?mi=1 [?nj=1 Aij] ] of size r?cmn. ? Kronecker (tensor) product: ?. It is defined as Am?n ?Br?c = [?mi=1 [?nj=1 aijB] ]mr?nc. ? Hadamard (element-wise) product: ?. It is defined as Am?n ?Bm?n = [?mi=1 [?nj=1 aijbij] ]m?n. ? Special notation: circledot. This is used for the special structure of the object-specific albedo-shape ma- trix T (The definitions of T, p, and N are listed below), i.e., Td?3 = [?di=1 (pinTi )] = pcircledotNT = (pd?1 ?11?3)?NT3?d Some special scalars, vectors, and matrices are defined as follows: ? d: number of pixels; 1We do not need the size of {A1,A2,...,An} to be exactly same. We use the same matrix size for simplicity. For example, for [?ni=1 Ai], we only need the number of rows of these matrices to be same. 24 ? m: the rank used in the first rank constraint. ? i, j, iprime, jprime, l, and k: loop indices. ? 1r?c: a r?c matrix of ones. ? In: an identity matrix of size n?n. ? h: a pixel; hd?1: an image. ? p: albedo at a pixel. pd?1: albedo vector ? n3?1 = [?a,?b,?c]T: unit surface normal vector; ?a, ?b, and ?c: elements of n. ? N3?d = [?di=1 ni]: the surface normal matrix. ? t3?1 = [a,b,c]T: product of albedo and surface normal; a, b, and c: elements of t. ? Td?3 = [?di=1 (pinTi )]: the object-specific albedo-shape matrix. Also, Td?3 = [a,b,c] where a, b, and c are d?1 vectors. ? s3?1: illumination vector. S3?n: the matrix consisting of a collection of different illumination vectors. ? fm?1: the vector of blending linear coefficients under the first rank constraint. Fm?n: the matrix consisting of a collection of different f?s. ? Wd?3m = [?mi=1 Ti]: the class-specific albedo-shape matrix. Also, Wd?3m = [?mi=1 [ai,bi,ci]]. ? A, B, C: A = [?mi=1 ai], B = [?mi=1 bi], and C = [?mi=1 ci]. ? Wf: Wf = [?mi=1 (Tis)]d?m. 25 ? Ws: Ws = [Af,Bf,Cf]. ? Hd?n = [?ni=1 hi]: the observation matrix consisting of a collection of images. ? ?Wd?3m: the U matrix after a rank-3m SVD factorization of H. ? ?w(x): a 3m?1 vector same as the row in ?W associated with the pixel x ? R3m?3m: the ambiguity matrix in the factorization. ? raj, rbj, and rcj: the (3j ?2)th, (3j ?1)th, and (3j)th columns of the matrix R. ? ?: an indicator function. ? x = (x,y): pixel coordinate; ?x = (?x,y): the symmetric point of x. ? ?: the integrability constraint term. ? ?: the face symmetry constraint term. 2.1 Principle of Generalized Photometric Stereo This section describes the generalized photometric stereo algorithm. We start in Section 2.1.1 by a brief review of related literature and highlight the advantages of the proposed approach. We list in Section 2.1.2 the setting and constraints. Then we present a method to recover the albedos and surface normal for a class of objects in Sections 2.1.3 and 2.1.4. Section 2.1.3 handles the isolated task of separating the illumination (v.i.z. finding the illuminant vector and the blending coefficients) from an arbitrary image, which is used in the recovery algorithm presented in Section 2.1.4. 26 2.1.1 Literature review and proposed approach Recovery of albedos and surface normals has been studied in the computer vision research for a long time. Usually a Lambertian reflectance model, ignoring both attached and cast shadows, is employed. Early works from the shape from shading (SFS) literature have typically assumed a constant albedo field: this assumption is not valid for many real objects and thus limits the practical applicability of the SFS algorithms. Early photometric stereo approaches require the knowledge of lighting conditions, but such knowledge is hard to gather under uncontrolled scenarios. Recent research efforts [74, 68, 94, 95, 96, 101, 103, 104] attempt to go beyond these restrictions by (i) using a varying albedo field, a more accurate model of the real world, and (ii) assuming no prior knowledge or requiring no control of the lighting sources. As a consequence, the complexity of the problem has also significantly increased. If we fix the imaging geometry and only move the lighting source to illumi- nate one object, the observed images (ignoring the cast and attached shadows) lie in a subspace completely determined by three images illuminated by three in- dependent lighting sources [101]. If an ambient component is added [103], this subspace becomes 4-D. If attached shadows are considered, the subspace dimen- sion grows to infinity [97] but most of its energy is packed in a limited number of harmonic components, thereby leading to a low-dimensional subspace approx- imations in [94, 95, 100]. However, all the photometric-stereo-type approaches (except [74]) commonly restrict themselves to using object-specific samples and cannot perform reconstruction combining images produced by different objects. In this chapter, we present a generalized photometric stereo algorithm that is able to handle all appearances of all objects in a class, in particular the human face 27 class. To this end, we impose a rank constraint (i.e. a linear generalization) on the albedos and surface normals of all human faces. We choose the human face as a working example because it naturally fits in our framework and is widely studied in the photometric stereo literature; however this does not pose any limitations in applying our algorithm to other object classes such as vehicles. We propose a rank constraint on the product of albedo and surface normal. The rank constraint enables us to accomplish a factorization of the observation matrix that decomposes a class-specific ensemble into a product of two matrices: one encoding the albedos and surfaces normals for a class of objects and the other encoding blending linear coefficients and lighting conditions. A class-specific en- semble consists of exemplar images of different objects with each under a different illumination, which is beyond what can be analyzed using the bilinear analysis of [138]. Bilinear analysis requires exemplar images of different objects under the same set of illumination conditions. Because a factorization is always up to an in- vertible matrix, unique recovery of the albedos and surface normals is not possible and requires additional constraints. We use two constraints: surface integrability and face symmetry. The surface integrability constraint [99, 137] has been used in several ap- proaches [68, 103] to successfully recover albedo and shape. The symmetry con- straint has also been employed in [102, 104] for face images. We present an ap- proach to fusing these constraints to recover the class-specific albedos and surface normals, even in the presence of shadows. More importantly, this approach takes into account the effects of a varying albedo field by approximating the integra- bility terms using only the surface normals instead of the product of the albedos and the surface normals. Due to the nonlinearity embedded in the integrability 28 terms, regular algorithms such as the steepest descent are inefficient. We derive a linearized algorithm to find the solution. 2.1.2 Setting and constraints Photometric stereo We assume a Lambertian imaging model with a varying albedo field. A pixel h is represented as h = p nTs = tTs, (2.1) where [.]T denotes the transpose, p is the albedo at the pixel, n ? [?a,?b,?c]T is the unit surface normal vector at the pixel, t3?1 ? [a ? p?a,b ? p?b,c ? p?c]T is the product of albedo and surface normal, and s (a 3 ? 1 unit vector multiplied by its intensity) specifies a distant illuminant. For time being, we consider the case without the shadow pixels and will deal with the shadow pixels later on. An image h is a collection of d pixels {hi,i = 1,...,d} 2. By stacking all the pixels into a column vector, we have hd?1 ? [?di=1 hi] = [?di=1 (pi nTi )]s = [?di=1 tTi ]s = [?di=1 [ai,bi,ci]]s = (pd?1 circledotNT3?d)s3?1 = [ad?1,bd?1,cd?1]s3?1 (2.2) = Td?3 s3?1, (2.3) where p ? [?di=1 pi] is the albedo vector, N ? [?di=1 ni] is the surface normal matrix, a ? [?di=1 ai] = [?di=1 pi?ai], b ? [?di=1 bi] = [?di=1 pi?bi], and c ? [?di=1 ci] = [?di=1 pi?ci]. To emphasize the structure of the T matrix which is a ?product? 2The index i corresponds to a spatial position x = (x,y). We will interchange both notations. For instance, we might also use x = 1,...,d. 29 of the albedo vector p and the surface normal N, we introduce a special notation circledot to denote T by T ? pcircledotNT ? [?di=1 tTi ] ? [a,b,c]. (2.4) We call the T matrix as the object-specific albedo-shape matrix. In the case of photometric stereo, we have n images of the same object, say {h1,h2,...,hn}, observed at a fixed pose illuminated by n different lighting sources, forming an object-specific ensemble. Simple algebraic manipulation gives: Hd?n ? [?ni=1 hi] = T[?ni=1 si] = Td?3 S3?n, (2.5) where H is the observation matrix and S ? [?ni=1 si] encodes the information on the illuminants. Hence photometric stereo is rank-3 constrained. Therefore, given at least three exemplar images for one object under three different independent illumination, we can determine the identity of a new probe image by checking if it lies in the linear span of the three exemplar images. This requires capturing at least three images for one object in the gallery set, which can be prohibitive in practical scenarios. Note that in this recognition setting, there is no need for the training set; in other words, the training set is equivalent to the gallery set. A typical recognition setting [58], however, assumes no identity overlap between the gallery set and the training set and often stores only one exemplar image for each object in the gallery set. However, the training set can have multiple images for one object. In order to generalize from the training set to the gallery and probe sets, we note that all images in the training, gallery, and probe sets belong to the same face class, which naturally leads to the rank constraint. 30 The rank constraint We impose the rank constraint on the T matrix by assuming that any T matrix is a linear combination of some basis matrices {T1,T2,...,Tm} coming from some m basis objects. Rank constraints are often found in the literature [110, 111, 129, 116, 117, 122]. Mathematically, there exist coefficients {fj; j = 1,...,m} such that Td?3 = msummationdisplay j=1 fjTj = [?mj=1 Tj](f?I3) = Wd?3m(fm?1 ?I3), (2.6) where f ? [?mj=1 fj], W ? [?mj=1 Tj], In denotes an identity matrix of dimension n?n, and ? denotes the Kronecker (tensor) product. Since the W matrix encodes all albedos and surface normals for a class of objects, we call it a class-specific albedo-shape matrix. Substitution of (2.6) into (2.3) yields hd?1 = Ts = W(f?I3)s = W(f?s) = Wd?3m k3m?1, (2.7) where k ? f?s. This leads to a two-factor bilinear analysis [138]. With the availability of n images {h1,h2,...,hn} for different objects, observed at a fixed pose illuminated by n different lighting sources, forming a class-specific ensemble, we have Hd?n = [?ni=1 hi] = W[?ni=1 (fi ?si)] = W[?ni=1 ki] = Wd?3m K3m?n, (2.8) where K ? [?ni=1 (fi ?si)] = [?ni=1 ki]. It is a rank-3m problem, which combines the rank of 3 for the illumination and the rank of m for the identity. The rank constraint generalizes many approaches in the literature other than the photometric stereo. If the surface normal is fixed and the albedo field lies in a rank-m linear subspace, we have (2.6) satisfied. Interestingly, the ?Eigenface? approach [62] is just a special case of this approach for a fixed illumination source. 31 Suppose that the fixed illuminant vector is ?s. (2.7) and (2.8) reduce to hd?1 = W(f??s) = ?Wd?mfm?1; Hd?n = [?ni=1 hi] = ?W[?ni=1 fi] = ?Wd?mFm?n, (2.9) where ?W ? [?mi=1 Ti?s]. Therefore, our approach can also be regarded as a gener- alized ?Eigenface? analysis able to handle illumination variation. Our immediate goal is to estimate W and K from the observation matrix H. The first step is to invoke an SVD factorization, H = U?VT, and retain the top 3m components as H = U3m?3mVT3m=?W ?K, where ?W = U3m and ?K = ?3mVT3m. Thus, we can recover W and K up to an 3m ? 3m invertible matrix R with W = ?WR, K = R?1?K. Additional constraints are required to determine the R matrix. We will use the integrability and face symmetry constraints, both related to W. Moreover, K must take the special structure K = [?i (fi ?si)]. Incidentally, by noting that T = p circledot NT, we can introduce a second rank constraint which assumes that (i) any p vector is a linear combination of some basis vectors {p1,p2,...,pm1} with m1 < d and (ii) any N matrix is a linear combination of some basis matrices{N1,N2,...,Nm2}with m2 < d. This is a common constraint used in the face recognition literature. For example, in [43, 66, 76], they all assume that shape and texture have separate bases. However, it turns out that the second rank constraint is not systematically superior to the first rank constraint in terms of recognition performance. Also, it is computationally inconvenient to use the second rank constraint. Hence, there exist two vectors fm1?1 ? [?i fi] and gm2?1 ? [?i gi] such that p = [?m1i=1 pi]f;NT = [?m2j=1 NTj ](g?I3), (2.10) 32 and similarly the image h can be expressed as hd?1 = [?m1i=1 [?m2j=1 (pi circledotNTj )] ](f?g?s) = Yd?3m1m2(fm1?1 ?gm2?1 ?s3?1), (2.11) where Y ? [?m1i=1 [?m2j=1 (pi circledotNTj )] ]. The integrability constraint One common constraint used in SFS research is the integrability of the surface [68, 99, 103, 137]. Suppose that the surface function is z = z(x) with x ? (x,y), we must have ??x ?z?y = ??y ?z?x. For the given unit surface normal vector n(x) ? [?a(x),?b(x),?c(x)]T at pixel x, the integrability constraint requires that ? ?x ?b(x) ?c(x) = ? ?y ?a(x) ?c(x) . (2.12) In other words, with ?(x) defined as an integrability constraint term, ?(x) ? ?c(x)? ?b(x) ?x ? ?b(x)??c(x) ?x + ?a(x) ??c(x) ?y ??c(x) ??a(x) ?y = 0. (2.13) If given the product of the albedo and the surface normal t(x) ? [a(x),b(x),c(x)]T with a(x) ? p(x)?a(x), b(x) ? p(x)?b(x), and c(x) ? p(x)?c(x), Eq. (2.13) still holds with ?a, ?b, and ?c replaced by a, b, and c, respectively. Practical algorithms approximate the partial derivatives by forward or backward differences or other differences with the inherent smoothness assumption. Hence, the approximations based on t(x) are very rough especially at places where abrupt albedo variations exist (e.g. the boundaries of eyes, iris, eyebrow, etc.) since the smoothness assumption is seriously violated. We should by all means use n(x) in order to remove this effect. 33 The face symmetry constraint For a face image in a frontal view, one natural constraint is its symmetry about the central y-axis [102, 104]: p(x,y) = p(?x,y);?a(x,y) = ??a(?x,y);?b(x,y) = ?b(?x,y);?c(x,y) = ?c(?x,y), (2.14) which is equivalent to, using x ? (x,y) and its symmetric point ?x ? (?x,y), a(x) = ?a(?x); b(x) = b(?x);c(x) = c(?x). (2.15) If a face image in a non-frontal view, such a symmetry still exists but the coordinate system should be modified to take into account the view change. 2.1.3 Separating illumination In this section, we temporarily assume that the class-specific albedo-shape matrix W is available and solve the problem of separating illumation, v.i.z., foran arbitrary image h, find the illuminant vector s and the coefficient f under the first constraint (or f and g under the second constraint). For convenience in performing tasks such as recognition, we also normalize the solution f to the same range. The first rank constraint gives rise to the basic equation h = W (f?s). So, we convert the separation task to a minimization task of finding f and s to minimize the least square (LS) cost, i.e., minf ,s E(f,s) ?bardblh?W (f?s)bardbl 2, (2.16) Note that f and s can be recovered only up to a non-zero scalar; one can always multiply f by a non-zero scalar and divide s by the same scalar. Therefore, without loss of generality, we can simply pose an additional constraint: 1Tf = 1, where 1m?1 is a vector of 1?s. 34 One way to solve this is indicated in [74]. It is a two-step algorithm. First, k is approximated by k = W?h. Then k = f ? s is used to solve for f and s, again using the LS approximation, i.e. finding f and s such that the cost bardblk ? f ? sbardbl2 is minimized. However, as pointed out in [74], the above algorithm is not robust since two approximations are involved. Before we proceed to the actual separation algorithm, note that shadows in principle increase the rank (for the illumination only) to infinity. However, if those pixels are successfully excluded in our calculations, the rank for the illumination is still maintained to be 3 and the overall rank is 3m. In view of the above and considering the normalization requirement, we modify the cost function as E(f,s) ?bardbl? ?(h?W (f?s))bardbl2 + (1Tf?1)2, (2.17) where ?d?1 indicates the inclusion or exclusion of the pixels of the image h and ? denotes the Hadamard (or element-wise) product. Notice that (2.17) can be easily generalized to a cost function used in robust estimation if the vector norm is replaced by a robust function, and ? by an appropriate weight function. Using the fact that Eq. (2.7) provides a series of sub-equations, which is linear in f if s is fixed and in s if f is fixed, we can design a simple iterative algorithm. Each iteration of the algorithm has three steps. In the first step, we solve for the LS estimate of f, given s and ?. f = ? ?? ? Wf 1T ? ?? ? ?? ?? ? ? ?h 1 ? ?? ?; Wf ? [?mi=1 (Tis)]d?m. (2.18) In the second step, we solve for the LS estimate of s, given f and ?: s = W?s(? ?h); Ws ? [ [?mi=1 ai]f, [?mi=1 bi]f, [?mi=1 ci]f ]d?3 ? [Af,Bf,Cf], (2.19) 35 where Ad?m ? [?mi=1 ai], Bd?m ? [?mi=1 bi], and Cd?m ? [?mi=1 ci], respectively. In the third step, given f and s we update ? as follows3: ? = [ |h?W (f?s)| < ? ], (2.20) where ? is a pre-defined threshold. Note that in (2.18) and (2.19), additional saving in computation is possible. We can form dimension-reduced matrices Wprimef and Wprimes and vector hprime and apply the primed version in (2.18) and (2.19) The matrices Wprimef and Wprimes and vector hprime are formed from Wf, Ws, and h, respectively, by discarding those rows corresponding to the excluded pixels. The initial conditions can be arbitrary. But, for fast convergence, we need good initial values. In our implementation, we estimate s using the algorithm presented in [105]. To initialize ?, we employ heuristics to distinguish pixels in shadows: their intensities are close to zero. In practice, we set those pixels whose intensities are smaller than a certain threshold as missing values. In addition, we also set those pixels whose intensities are above a certain threshold as missing values to remove pixels possibly in a specular region. This is only for initialization, we update ? during iterations. To test the stability of our algorithm, we perturb the initial conditions and find that our algorithm is very stable in the sense that it always reaches the same solution (up to the convergence error) regardless of initial conditions and generates a smaller residual than the algorithm reported in [74]. Learning f, g, and s from h using the second constraint is a straightforward generalization of the above algorithm. Appendix 2.I presents such a recovery al- gorithm in an even more general setting, i.e. a multilinear setting. 3This is a Matlab operation which performs an element-wise comparison. 36 2.1.4 Recovering class-specific albedos and surface normals The recovery task is to find from the observation matrix H the class-specific albedo-shape matrix W (or equivalently R), which satisfies both the integrabil- ity and symmetry constraints, as well as the matrices F and S. We decompose R as R3m?3m ? [?mj=1 [raj,rbj,rcj]] and treat the column vectors {raj,rbj,rcj; j = 1,...,m} as our computational ?units?. We also decompose ?W as ?W ? [?dx=1 ?wT(x)] where ?w(x) is a 3m?1 vector same as the row in ?W corresponding to the pixel x. As W ? [?dx=1 [?mj=1 [aj(x),bj(x),cj(x)]]] = ?WR, we have aj(x) = ?wT(x)raj, bj(x) = ?wT(x)rbj, cj(x) = ?wT(x)rcj; j = 1,...,m. (2.21) As mentioned in Section 2.1.3, we must take into account attached and cast shadows. After setting them as missing values, we perform SVD with missing val- ues [149] to find ?W. Other approaches for dealing with missing value are available in [141, 165, 169]. In view of the above, we formulate the following optimization problem: mini- mize over R, F, and S the cost function E defined as E(R,F,S) = 12 nsummationdisplay i=1 dsummationdisplay x=1 ?i(x){hi(x) ? ?w(x)TR(fi ?si)}2 +?12 msummationdisplay j=1 dsummationdisplay x=1 {?j(x)}2 + ?22 msummationdisplay j=1 dsummationdisplay x=1 {?j(x)}2, = E0(R,F,S)+ ?1E1(R)+ ?2E2(R), (2.22) where ?i(x) is an indicator function which takes the value one if the pixel x of the image hi is not in shadow and zero otherwise, ?j(x) is the integrability constraint term based only on surface normals as defined in (2.13), and ?j(x) is the symmetry constraint term given as ?2j(x) = {aj(x) + aj(?x)}2 +{bj(x) ?bj(?x)}2 +{cj(x) ?cj(?x)}2; j = 1,...,m. (2.23) 37 One approach could be to directly minimize the cost function over W, F, and S. This is in principle possible but numerically difficult as the number of unknowns depends on the image size, which can be quite large in practice. As shown in [98], the recovered surface normal is up to a generalized bas-relief (GBR) ambiguity. To avoid trivial solutions such as a planar object4, we normalize the matrix R by setting ||R||2 = 1 where ||.||2 is a matrix norm. Another ambiguity between fj and sj is a nonzero scale, which can be removing by normalizing f to same range: fTj 1 = 1, where 1m?1 is a vector of 1?s. To summarize, we perform the following task: minR,F,S E(R,F,S) subject to ||R||2 = 1,FT1 = 1. (2.24) An iterative algorithm can be designed to solve (2.24). While solving for F and S with R fixed is quite easy, solving for R with F and S is very difficult because the integrability constraint terms involve partial derivatives of the surface normals that are nonlinear in R. Regular algorithms such as the steepest descent are inefficient. One main contribution of this chapter is that we propose a linearized algorithm to solve for R, which is detailed in Appendix 2.II. We now illustrate how to update F = [?i fi], S = [?i si], and ? = [?i ?i] with R fixed (or W fixed). First notice that F, S, and ? are only involved in the term E0. Moreover, fi, si and ?i are related to only the image hi. This becomes the same as the illumination separation problem defined in Section 2.1.3. The proposed algorithm is also iterative in nature. After running one iterative step to obtain the updated F, S, and ?, we proceed to update R again and this process 4In this way, the surface normals we are recovering are versions up to a GBR ambiguity with respect to the true physical surface normals [68]. However, they are enough for tasks such as face recognition under illumination variation. 38 carries on until convergence. To demonstrate how the algorithm works, we design the following scenario with m = 2 so that the rank of interest is 2x3=6. To defeat the photometric stereo algorithm, which requires one object illuminated by at least three sources, and the bilinear analysis, which requires two fixed objects illuminated by at least three same lighting sources, we construct eight images by taking random linear combinations of two basis objects illuminated by eight different lighting sources. Figure 2.2 displays the two basis objects under the same set of eight illumination and the synthesized images. The recovered class-specific albedo-shape matrix is also presented in Figure 2.2, which clearly shows the two basis objects. The quality of reconstruction is quite good except the nose part. The reason might be that the two basis objects have quite distinct noses so that the nose part of their linear combinations is not visually good (see the image in the last column of the third row), which propagates to the recovery results of albedos and surface normals from these combination images. Our algorithm usually converges within 100 iterations. One notes that the special case m = 1 of our algorithm can be readily applied to photometric stereo (with the symmetry constraint removed) to robustly recover the albedos and surface normals for one object. 2.2 Face Recognition across Illumination This section deals with the face recognition part, which serves as a main evaluation tool for the generalized photometric stereo algorithm. Section 2.2.1 briefly reviews the literature on face recognition across illumination. In Section 2.2.2, we relax the requirement of recovering the albedos and surface normals by utilizing sample imagery as a bootstrap set for the recognition task. We then report in Section 39 Figure 2.2: The first row: The first basis object under eight different illumination. The second row: The second basis object under the same set of eight different illumination. The third row: Eight images (constructed by random linear combi- nations of two basis objects) illuminated by eight different lighting sources. The fourth row: Recovered class-specific albedo-shape matrix W showing the product of varying albedos and surface normals of two basis objects (i.e. the three columns of T1 and T2) using the generalized photometric stereo algorithm. 2.2.3 face recognition results using the PIE database. 2.2.1 Literature review and proposed approach Face recognition under illumination variation is a very challenging problem. The key is to successfully separate the illumination source from the observed appear- ance. Once separated, what remains is illuminant-invariant and appropriate for recognition. In addition to illumination variation, various issues embedded in the recognition setting make recognition even more difficult. We follow the recogni- tion protocol introduced in [58]. Assuming the availability of the following three sets, namely one training set, one gallery set, and one probe set, the recognition algorithm learns from the training set the characteristic features, associates de- 40 scriptive features with the objects in the gallery set, and determines the identity for the objects in the probe set. Different recognition settings can be formed in terms of identity and illumination overlaps among the training, gallery, and probe sets. The most difficult setting, which is the focus of this chapter, is obviously the one in which there is no overlap at all among the three sets in terms of both identity and illumination, except the identity overlap between the gallery and probe sets. In this setting, generalizations from known illumination to unknown illumination and from known identities to unknown identities are particularly desired. State-of-the-art research efforts can be grouped into three streams: subspace methods, reflectance-model methods, and 3D-model-based methods. (i) The first approach is very popular for the recognition problem. After removing the first three eigenvectors, principal component analysis (PCA) was reported to be more robust to illumination variation than the ordinary PCA or the ?Eigenface? approach [62]. Fisher discriminant analysis (FDA) [41, 70] has also been modified to handle illumination variations. In general, subspace learning methods are able to cap- ture the generic face space and thus to recognize new objects not present in the training set. The disadvantage is that subspace learning is actually tuned to the lighting conditions of the training set; therefore if the illumination conditions are not similar among the training, gallery, and probe sets, recognition performance may not be acceptable. (ii) The second approach [68, 74, 101, 104] employs a Lambertian reflectance model with a varying albedo field ignoring both attached and cast shadows. The main disadvantage of this approach is the lack of general- ization from known objects to unknown objects. (iii) The third approach employs 3D models. The ?Eigenhead? approach [65] assumes that the 3D geometry (or 3D depth information) of any face lies in a linear space spanned by the 3D geometry 41 of the training ensemble and uses a constant albedo field. The morphable model approach [66] is based on a synthesis-and-analysis strategy. Both geometry and texture are linearly spanned by those of the training ensemble. It is able to han- dle both illumination and pose variations with illumination directions specified. The weakness of the 3D model approaches is that they require 3D models and complicated fitting algorithms. Compared to the above, the proposed recognition scheme possesses the follow- ing properties: (i) It is able to recognize new objects not present in the training set; (ii) It is able to handle new lighting conditions not present in the training set; and (iii) No explicit 3D model and no prior knowledge about illumination condi- tions are needed. In other words, we combine the advantages of subspace learning and reflectance model-based methods. Further, we can avoid the recovery burden as far as recognition is concerned by using a proper bootstrap set under the first constraint. 2.2.2 Bootstrap set A procedure for learning the W matrix was presented in Section 2.1.4. Even though the learning algorithm is quite robust, it is possible that it gets trapped in local minima, which might subsequently yield inferior recognition results. Thus, an alternative approach without explicitly learning the W matrix is very beneficial. We now show that, as far as recognition is concerned, the W matrix under the first constraint can be replaced by a bootstrap set ?W consisting of sample imagery only. The bootstrap set can take various forms. In this chapter, we focus on such a bootstrap set that contains m exemplar objects captured at a fixed pose, each with three images illuminated by three independent but fixed lighting sources. 42 We denote ?hij as the image for the ith exemplar object illuminated by the jth exemplar lighting source. As an image can be expressed in a two-factor form using (2.7), we can write ?hij as ?hij = W(?fi ??sj); i = 1,...,m;j = 1,2,3. (2.25) where?fi isthe blending coefficient vector forthe ith exemplar object and?sj describes the jth exemplar lighting source. The bootstrap set ?W is then expressed as ?Wd?3m = [?mi=1 [?3j=1 ?hij] ] = W[?mi=1 [?3j=1 (?fi ??sj)] ] = Wd?3m(?Fm?m ? ?S3?3), (2.26) where ?F ? [?mi=1 ?fi] and ?S ? [?3j=1 ?sj] define the (not necessarily orthogonal) bases for the identity coefficients and the light sources, respectively. Thus, any vector f lies in the linear span of ?F, i.e., there exists a coefficient vector ? = [?mi=1 ?i] relating f with ?F in the following way: f = msummationdisplay i=1 ?i ?fi = ?F?; (2.27) Similarly, for any vector s, there exists ? = [?3j=1 ?j] such that s = 3summationdisplay j=1 ?j ?sj = ?S?. (2.28) Substituting (2.27) and (2.28) into (2.7), we have hd?1 = W(f?s) = W((?F ?)?(?S ?)) = W(?F??S)(???) = ?Wd?3m(?m?1 ??3?1) (2.29) Therefore, if the bootstrap set ?W is given, finding f and s for image h is equivalent to finding ? and ?. Since (2.29) is in a bilinear form, we can compute ? and ? 43 via the same algorithm described in Section 2.1.3 and employ ? for subsequent recognition task. The use of the bootstrap set yields an additional benefit. As indicated before, the rank for covering illumination variations in practice exceeds 3. Suppose that this rank is r > 3, we can use a bootstrap set of dimension d by rm, i.e. using images for m exemplar objects taken under r exemplar lighting conditions, to improve the recognition performance. Obviously, our separation algorithm can be generalized to handle s with dimension r?1. Unfortunately, no bootstrap set can be easily constructed for the second constraint using exemplar images. ?1 ?0.8 ?0.6 ?0.4 ?0.2 0 0.2 0.4 0.6 0.8 1 ?0.1 0 0.1 0.2 0.3 0.4 0 0.2 0.4 0.6 0.8 1 f17 f16 f15 f22 f14 f21 f13 f12 f 9 f20 f11 o ?? ground truth, x ?? estimated value head f 8 f 6 f19 f 7 f 5 f18 f10 f 4 f 2 f 3 Figure 2.3: Right: Flash distribution in the PIE database. For illustrative pur- poses, we move their positions on a unit sphere as only the illuminant directions matter. ?o? means the ground truth and ?x? the estimated values. 44 2.2.3 Recognition experiments We study an extreme recognition setting with the following features: there is no identity overlap between the training set and the gallery and probe sets; only one image per object is stored in the gallery set; the lighting conditions for the training, gallery and probe sets are completely unknown. Our strategy is to: (i) Learn W, if needed, from the training set using the recovery algorithm described in Section 2.1.4 or construct a bootstrap set ?W for simplicity; (ii) With W (or ?W) given, learn the identity signature f?s (or ??s) for both the gallery and probe sets using the recovery algorithm described in Section 2.1.3, assuming no knowledge of illumination directions; and (iii) Perform recogni- tion using the nearest correlation coefficient. Suppose that a gallery image g has its signature5 fg (or ?g) and a probe image p has its signature fp (or ?g), their correlation coefficient is k(p,g) = (fp,fg)/ radicalBig (fp,fp)(fg,fg), (2.30) where (x,y) is an inner-product such as (x,y) = xT?y with ? learned or given. We use ? as an identity matrix. PIE database We use the Pose and Illumination and Expression (PIE) database [75] in our ex- periment6. Figure 2.3 shows the distribution of all 21 flashes used in PIE and their estimated positions using our algorithm. Since the flashes are almost symmetri- cally distributed about the head position, we only use 12 of them distributed on 5In the sequel, we simply refer as f = [fT,gT]T for the second rank constraint 6We use the ?illum? part of the PIE database that is close to obeying the Lambertian model as in [70] while the ?light? part that includes an ambient light is used in [66]. 45 the right half of the unit sphere in Figure 2.3. More specifically, the flashes we used are f08, f09, f11-f17, and f20-f22. In total, we used 68x12=816 images in a fixed view as there are 68 subjects in the PIE database. Figure 2.4 displays one PIE object under the selected 12 illuminants. Registration is performed by aligning the eyes and mouth to desired posi- tions. No flow computation [66] is carried on for further alignment. After the pre-processing step, the cropped out face image is of size 50 by 50, i.e. d = 2500. Also, we only study gray images by taking the average of the red, green, and blue channels of their color versions. We use all 68 images under one illumination to form a gallery set and under another illumination to form a probe set. The training set is taken from sources other than the PIE dataset. Thus, we have 12x11=132 tests, with each test giving rise to a recognition score. Gallery f08 f09 f11 f12 f13 f14 f15 f16 f17 f20 f21 f22 Average Probe f08 - 96 96 87 66 60 46 29 22 85 78 53 65 f09 94 - 96 96 90 87 56 40 24 84 96 68 75 f11 94 91 - 97 72 72 38 28 16 100 94 51 69 f12 88 94 97 - 88 93 57 41 28 94 100 76 78 f13 56 87 59 85 - 100 90 71 50 54 87 100 76 f14 51 85 63 93 100 - 90 66 49 59 91 99 77 f15 33 40 37 49 85 88 - 93 78 32 49 97 62 f16 19 26 26 32 59 44 84 - 93 26 31 63 46 f17 14 28 19 26 50 41 68 94 - 19 26 44 39 f20 90 85 99 97 65 69 38 26 21 - 93 53 67 f21 79 94 93 100 88 94 62 49 28 91 - 76 78 f22 43 65 46 75 99 99 97 76 59 43 74 - 70 Average 60 72 66 76 78 77 66 56 42 63 74 71 67 Table 2.1: Recognition rate obtained by our approach using the first rank con- straint and the Yale?s database as the training set. 46 Figure 2.4: The first and second rows display one PIE object under the selected 12 illuminants (from left to right, row 1 to row 2: f08, f09, f11-f17, and f20-f22) and the third and fourth rows one Yale object under 9 lights (most frontal lights) used in the training set. Recognition across illumination We first assume that all the images have been captured in a frontal view, but we do not assume that the directions and intensities of the illuminants are known. [Yale training set] The training (or bootstrap) set is first taken as the Yale?s illumination database [68]. There are only 10 subjects (i.e. m = 10) in this database and each subject has 64 images in frontal view illuminated by 64 different lights. We pick out images under 9 lights (mostly frontal) in order to cover up to second-order harmonic components [95]. Figure 2.3 shows one Yale object under r = 9 lights. Table 2.1 lists the recognition rate for the PIE database using the first rank 47 Gallery f08 f09 f11 f12 f13 f14 f15 f16 f17 f20 f21 f22 Average Probe f08 - 100 90 66 21 9 1 9 4 60 60 1 38 f09 100 - 72 94 59 31 10 24 13 51 84 13 50 f11 97 91 - 100 29 24 13 15 10 100 94 19 54 f12 93 97 100 - 93 90 56 59 35 96 100 69 81 f13 19 62 22 68 - 97 82 100 68 13 84 81 63 f14 9 15 12 62 100 - 100 84 82 12 72 100 59 f15 0 3 1 4 76 100 - 74 76 1 18 100 41 f16 6 25 3 31 82 65 71 - 100 3 41 57 44 f17 4 12 3 31 51 56 81 100 - 3 28 59 39 f20 88 76 100 99 28 28 15 12 16 - 99 19 53 f21 84 97 97 100 96 88 57 74 46 96 - 71 82 f22 3 4 3 13 72 100 100 50 57 3 24 - 39 Average 46 53 46 61 64 62 53 54 46 40 64 54 54 Table 2.2: Recognition rate obtained by the ?Eigenface?approach (discarding the first 3 components) using the Yale?s database as the training set. Gallery f08 f09 f11 f12 f13 f14 f15 f16 f17 f20 f21 f22 Average Probe f08 - 97 97 93 63 56 29 16 9 94 85 29 61 f09 99 - 97 99 96 88 38 21 12 91 96 57 72 f11 99 96 - 99 62 63 29 16 12 100 94 41 65 f12 96 99 100 - 93 91 40 22 13 99 100 69 75 f13 74 93 69 84 - 100 71 37 16 62 87 97 72 f14 66 88 74 93 100 - 76 34 19 71 93 100 74 f15 22 34 24 35 71 66 - 82 46 28 44 99 50 f16 12 21 13 18 28 26 74 - 85 18 22 47 33 f17 6 7 9 13 15 18 40 81 - 13 16 24 22 f20 93 88 100 96 63 68 32 19 13 - 96 43 65 f21 87 94 100 100 93 99 51 22 15 99 - 84 77 f22 41 65 43 62 96 100 100 56 29 46 71 - 64 Average 63 71 66 72 71 70 53 37 24 65 73 63 61 Table 2.3: Recognition rate obtained by the ?Fisherface? approach using the Yale?s database as the training set. 48 Gallery f08 f09 f11 f12 f13 f14 f15 f16 f17 f20 f21 f22 Average Probe f08 - 100 99 99 97 97 79 72 43 99 97 93 88 f09 100 - 99 99 99 99 97 91 60 97 97 97 94 f11 99 99 - 100 100 100 90 76 65 100 100 99 93 f12 99 99 100 - 100 100 100 93 76 100 100 100 97 f13 99 99 100 100 - 100 100 100 88 99 100 100 99 f14 99 99 100 100 100 - 100 100 96 99 100 100 99 f15 84 94 93 100 100 100 - 100 100 88 100 100 96 f16 69 87 78 90 100 100 100 - 100 69 90 100 89 f17 44 60 51 71 84 91 99 100 - 56 75 94 75 f20 97 97 100 100 100 100 90 74 68 - 100 99 93 f21 97 97 100 100 100 100 100 97 82 100 - 100 98 f22 90 97 96 100 100 100 100 100 99 97 100 - 98 Average 89 93 92 96 98 99 96 91 80 91 96 98 93 Table 2.4: Recognition rate obtained by our approach with the first rank constraint and Vetter?s database as the training set. Gallery f08 f09 f11 f12 f13 f14 f15 f16 f17 f20 f21 f22 Average Probe f08 - 100 99 99 97 93 82 59 35 99 97 88 86 f09 100 - 99 99 99 99 91 84 53 99 99 96 92 f11 99 99 - 100 100 100 91 71 44 100 100 94 90 f12 99 99 100 - 100 100 99 90 72 100 100 99 96 f13 99 99 100 100 - 100 99 99 79 99 100 99 97 f14 99 99 100 100 100 - 99 97 87 99 100 99 98 f15 93 96 93 97 99 99 - 100 99 96 99 100 97 f16 75 90 69 93 97 99 100 - 99 69 94 100 89 f17 47 68 51 78 84 90 100 100 - 57 82 94 77 f20 99 99 100 100 99 100 91 76 51 - 100 94 92 f21 99 99 100 100 100 100 99 94 78 100 - 99 97 f22 97 96 96 99 99 99 100 100 90 96 99 - 97 Average 91 94 91 97 97 98 95 88 71 92 97 96 92 Table 2.5: Recognition rate obtained by our approach with the second rank con- straint and Vetter?s database as the training set. 49 f08 f09 f11 f12 f13 f14 f15 f16 f17 f20 f21 f22 Average Gly: Front f12, Prb: Front 99 99 100 - 100 100 97 93 78 100 100 99 96 Gly: Front f12, Prb: Side 85 89 88 94 96 96 88 81 68 86 95 91 88 Gly: Side f12, Prb: Front 92 91 99 97 85 87 72 53 33 94 96 87 82 Gly: Side f12, Prb: Side 100 100 100 - 100 100 99 85 63 100 100 100 95 Gly: Side f08, Prb: Side - 100 100 100 99 97 72 59 35 100 100 90 86 Gly: Side f17, Prb: Side 26 41 37 57 76 84 100 100 - 43 65 91 66 Gly: Side f22, Prb: Side 75 97 88 99 100 100 100 100 100 91 100 - 95 Table 2.6: Recognition rate across poses and illumination. The front view is from camera 27, and the side view from camera 05. constraint and the Yale?s database as the training set. Even with m = 10, we obtain quite good results, especially when the gallery and probe sets are close in terms of their flash positions. When the flashes of the gallery and probe sets become separated, the recognition rate decreases. The worst performance is with the gallery set at f08 and the probe set at f17, two most separated flashes. In general, using images under frontal or near-frontal illuminants (e.g. f09, f12, and f21) as gallery sets produces good results. For comparison, we also implemented the ?Eigenface? approach (discarding the first 3 components) and the ?Fisherface? approach by training the subspace pro- jection vectors from the same training set. The recognition rates are presented in Tables 2.2 and 2.3. The ?Fisherface? approach outperforms the ?Eigenface? ap- proach, but their performances are worse than our approach. This highlights the virtue of decoupling the illumination variations. [Vetter training set] Generalization capacity with m = 10 is rather restrictive. We now increase m from 10 to 100 by using Vetter?s 3D face database [66]. As this is a 3D database, we actually have W (even p and N) available. However, we believe that using a training set of m = 100 from other sources, which to the best of our 50 knowledge is not available in the literature, can yield similar performances. Table 2.4 tabulates the recognition rates obtained by imposing the first rank constraint. Significant improvements have been achieved by increasing m. This seems to suggest that a moderate sample size of 100 is enough to span the entire face space under a fixed view. As an interesting comparison, Blanz and Vetter [66] also reported the recog- nition rates across the illumination variation (with only ?f12? being the gallery set and using the ?light? part of the PIE database) and their average is 98% for color images while ours is 96% for gray images under the first rank constraint. We believe that our performances can be boosted using the color images and finer alignment. Note that our approaches look similar to [66], but there are significant differences. In [66] depths and texture maps of explicit 3D face models are used, while our image-based approach uses the concepts of albedo and surface normal and can recover the 3D models under the first constraint. Also, [66] needs a very good initialization for the lighting source. We then experiment with the second rank constraint. Note that here we need explicit knowledge of p and N, while under the first constraint we can use a boot- strap set instead. Table 2.5 tabulates the recognition rate obtained. It seems that the use of the second rank constraint does not help much. In fact, it is slightly worse due to possible over-parameterization. In addition, it is difficult to estimate p and N using the second rank constraint. Thus, it seems beneficial to use the first rank constraint in practice. 51 Recognition across views and illumination We now present our preliminary results on recognition across poses and illumina- tion. Our approach in principle can also handle pose variation since the W matrix contains all the needed 3D information, i.e., we can recover the 3D model from it. Also as mentioned earlier, learning the W matrix can be avoided by using a boot- strap set. Here, we simply use Vetter?s database to handle pose variation. Pose is roughly estimated from the geometric calibration information provided in the PIE database. We then warp the 3D model to the desired pose. The motivation is the following: suppose the pose parameter is ?, then the image h? at pose ? can be expressed as h? = W?(f?s). (2.31) In other words, the illumination-invariant signature f for image h? is kept the same if we have the class-specific albedo and shape matrix at pose ?. The rest just follows using the first constraint approach. Table 2.6 lists the recognition results obtained. In general, using the side view still yields quite good recognition result. Illuminant estimation In the above process, we achieve illuminant estimation. Figure 2.3 also shows the estimated illuminant directions. It is quite accurate for estimation of directions of flashes near frontal pose. But when the flashes are significantly off-frontal, accuracy slightly goes down. 52 2.3 Appendix Appendix 2.I: Recovering multilinear coefficients from h The algorithm presented in Sec. 2.1.4 can be generalized to recover {f1,...,fn} from h if the following multilinear form is satisfied: hd?1 = Wd?producttextn i=1 mi (f1m1?1 ?...?fnmn?1), (2.32) where W ? [?j1,...,jn wj1,...,jn]. Again, we impose the addition constraints: 1Tfi = 1; i = 1,...,n?1. In the iteration for computing fi given all other fj?s (j negationslash= i) fixed, we have, h = Aifi, (2.33) where Ai ? [?miji=1 aiji] and aiji = m1summationdisplay j1=1 ... mnsummationdisplay jn=1 c1j1 ...ci?1ji?1ci+1ji+1 ...cnjnwj1,...,jn. (2.34) If 1Tfi = 1 is imposed for i = 1,...,n?1, the LS solution to fi is fi = ?? ??? ??? ??? ??? ? ? ?? ? Ai 1T ? ?? ? ?? ?? ? h 1 ? ?? ?, i = 1,...,n?1; [An]? h, i = n. (2.35) Appendix 2.II: Computing R from H This appendix concentrates on the most difficult part of recovering the albedos and surface normals from H: updating R with F, S, and ? fixed. We will take vector derivatives of E with respective to {rij; i = a,b,c; j = 1,...,m} and treat the three terms in E separately. 53 [About E0.] With fjprime ? [?mj=1 fjprimej] and sjprime ? [sjprimea,sjprimeb,sjprimec]T, ?E0 ?rij = nsummationdisplay jprime=1 dsummationdisplay x=1 ?jprime(x){?w(x)TR(fjprime ?sjprime)?hjprime(x)}?w(x)fjprimejsjprimei = nsummationdisplay jprime=1 dsummationdisplay x=1 ?jprime(x){ summationdisplay l=a,b,c msummationdisplay k=1 ?w(x)Trlkfjprimeksjprimel ?hjprime(x)}?w(x)fjprimejsjprimei = summationdisplay l=a,b,c msummationdisplay k=1 { nsummationdisplay jprime=1 dsummationdisplay x=1 ?jprime(x)fjprimeksjprimelfjprimejsjprimei?w(x)?w(x)T}rlk ? nsummationdisplay jprime=1 dsummationdisplay x=1 ?jprime(x)hjprime(x)fjprimejsjprimei?w(x) = summationdisplay l=a,b,c msummationdisplay k=1 Olkijrlk ??ij, (2.36) where {Olkij;l = a,b,c;k = 1,...,m} are properly defined 3m?3m matrices, and ?ij is a properly defined 3m?1 vector. [About E1.] Using forward differences to approximate the partial derivatives7, ??aj(x,y) ?y similarequal ?aj(x,y+1) ??aj(x,y); ??bj(x,y) ?x similarequal ?bj(x+1,y) ??bj(x,y); ??cj(x,y) ?x similarequal ?cj(x+1,y) ??cj(x,y); ??cj(x,y) ?y similarequal ?cj(x,y+1) ??cj(x,y), (2.37) we have ?j(x,y) ??bj(x+1,y)?cj(x,y) ??bj(x,y)?cj(x+1,y) + ?aj(x,y)?cj(x,y+1) ??aj(x,y+1)?cj(x,y). (2.38) Suppose we are given the product of albedo and surface normal [aj(x),bj(x),cj(x)] as in (2.21), we can derive the albedo pj(x) and surface normals ?aj(x), ?bj(x), and ?cj(x) as follows: pj(x) = radicalbigg (?wT(x)raj)2 + (?wT(x)rbj)2 + (?wT(x)rcj)2, (2.39) ?aj(x) = ?w T (x)raj pj(x) , ?bj(x) = ?w T (x)rbj pj(x) , ?cj(x) = ?wT(x)rcj pj(x) . (2.40) 7Partial derivatives of boundary pixels require different approximations. But, similar deriva- tions can be derived. 54 So, their partial derivatives with respect to raj are ??aj(x) ?raj = ?w(x) pj(x) ? ?w T (x)raj ?w(x)?wT(x)raj p3j(x) = 1??a2j(x) pj(x) ?w(x), (2.41) ??aj(x) ?rbj = ??w T (x)raj ?w(x)?wT(x)rbj p3j(x) = ??aj(x)?bj(x) pj(x) ?w(x), ??aj(x) ?rcj = ??aj(x)?cj(x) pj(x) ?w(x). (2.42) Similarly, we can derive their partial derivatives with respect to rbj and rcj, which are summarized as follows: ??kj(x) ?rlj = ??kj(x)?lj(x) pj(x) ?w(x), ??kj(x) ?rkj = 1??k2j(x) pj(x) ?w(x), k,l ?{a,b,c}, k negationslash= l. (2.43) Notice that ??aj(x)?r bj = ??bj(x)?r aj , ??aj(x)?r cj = ??cj(x)?r aj , and ??bj(x)?r cj = ??cj(x)?r bj , which imply saving in computations. We now compute the partial derivative of ?j(x,y) with respect to raj: ??j(x,y) ?raj = ? ?raj{ ?bj(x+1,y)?cj(x,y) ??bj(x,y)?cj(x+1,y) + ?aj(x,y)?cj(x,y+1) ??aj(x,y+1)?cj(x,y)} = { ?aj(x,y)?cj(x,y)p j(x,y)pj(x,y+1) ?w(x,y)?wT(x,y+1) ? ?aj(x,y+1)?cj(x,y+1)p j(x,y)pj(x,y+1) ?w(x,y+1)?wT(x,y)}raj + {?aj(x+1,y)?cj(x+1,y)p j(x,y)pj(x+1,y) ?w(x+1,y)?wT(x,y) ? ?aj(x,y)?cj(x,y)p j(x,y)pj(x+1,y) ?w(x,y)?wT(x+1,y)}rbj + { ?aj(x,y) ?bj(x,y) pj(x,y)pj(x+1,y) ?w(x,y)?w T (x+1,y) ? ?aj(x+1,y)?bj(x+1,y) pj(x,y)pj(x+1,y) ?w(x+1,y)?w T (x,y) + 1??a2j(x,y) pj(x,y)pj(x,y+1) ?w(x,y)?w T (x,y+1) ? 1??a2j(x,y+1) pj(x,y)pj(x+1,y) ?w(x,y+1)?w T (x,y)}rcj = Paaj(x,y)raj +Pbaj(x,y)rbj +Pcaj(x,y)rcj = summationdisplay l=a,b,c Plaj(x,y)rlj, (2.44) where Paaj(x,y), Pbaj(x,y), and Pcaj(x,y) are properly defined matrices of dimension 3m? 3m. By the same token, using properly defined Pabj(x,y), Pbbj(x,y), Pcbj(x,y), Pacj(x,y), Pbcj(x,y), and Pccj(x,y), we can calculate ??j(x,y) ?rij = summationdisplay l=a,b,c Plij(x,y)rlj; i = a,b,c, (2.45) 55 and, finally, ?E1 ?rij = dsummationdisplay x=1 ?j(x) summationdisplay l=a,b,c Plij(x)rlj = summationdisplay l=a,b,c Plijrlj; Plij ? dsummationdisplay x=1 ?j(x)Plij(x). (2.46) [About E2.] The symmetry constraint term ?j(x) defined as in (2.23) can be expressed as ?2j(x) = rTajQa(x)raj +rTbjQb(x)rbj +rTcjQc(x)rcj, (2.47) where Qa(x), Qb(x), and Qc(x) are symmetric matrices with size 3m?3m: Qa(x) = (?w(x) + ?w(?x))(?w(x) + ?w(x))T,Qb(x) = (?w(x) ? ?w(?x))(?w(x) ? ?w(x))T,Qc(x) = Qb(x). (2.48) The derivatives of ?2j(x)/2 and E2 with respective to raj, rbj, and rcj are ?{?2j(x)/2} ?rij = Q i (x)rij; ?E2 ?rij = dsummationdisplay x=1 Qi(x)rij = Qirij; Qi = dsummationdisplay x=1 Qi(x). (2.49) Combining the above derivations and using ?E?rij = 0, we have summationdisplay l=a,b,c msummationdisplay k=1 Olkijrlk + ?1 summationdisplay l=a,b,c Plijrlj + ?2Qirij = ?ij;i = a,b,c; j = 1,...,m. (2.50) We therefore arrive at a set of equations linear in {rij; i = a,b,c; j = 1,...,m} that can be solved easily. After finding the new R, we normalize it using R=R/||R||2. 56 Chapter 3 Illuminating Light Field State-of-the-art algorithms are not able to produce satisfactory recognition per- formance when confronted by pose and illumination variations. In general, pose variation is slightly more difficult to handle than illumination variation. The pres- ence of both variations further challenges the recognition algorithms. This chapter extends the generalized photometric stereo algorithm presented in Chapter 2 to handle pose variation. The way we handle pose variation is through the ?Eigen? light approach [69]. This unified approach is image-based, in the sense that, in the training set, only 2D images are used and no explicit 3D models are needed. The unification is achieved by exploiting the fact that both approaches use a subspace model for identity. The ?Eigen? light field approach combines subspace modeling with light field and offers a pose-invariant encoding of identity. The generalized photometric stereo algorithm combines the identity subspace with the illumination model and provides an illumination-invariant description. However, the ?Eigen? light field approach assumes a fixed illumination and cannot handle illumination variations, i.e., its pose-invariant identity encoding is not invariant to variations in illumination. The generalized photometric stereo algorithm assumes a 57 fixed pose and cannot easily handle pose variations, i.e., its illumination-invariant identity description is not invariant to variations in pose. This motivates our integrated approach for handling both pose and illumination variations using an illumination- and pose-invariant identity signature. Chapter organization Section 3.1 presents the principle of the illuminating light field approach. It starts by reviewing in Section 3.1.1 the related literature, then describes Section 3.1.2 the ?Eigen? light field approach [69] that performs FR under pose variations, and finally introduces in Section 3.1.3 our integrated approach. Section 3.1.4 presents algorithms for recovering the identity signature that is invariant to illumination and pose. Section 3.2 gives our experimental results on the PIE database [75] and comparisons with other approaches. 3.1 Principle of Illuminating Light Field 3.1.1 Literature review Identity, illumination, and pose Three factors are involved in face recognition, namely illumination, pose, and iden- tity. Using the human face images as examples, we now address issues involved in each of the three factors by fixing the other two. ? Illumination. Various illumination models are available in the literature, ranging from models for highly specular objects such as mirrors to models for matte objects. Mostly objects belong to the latter category, which is described by a Lambertian reflectance model for its simplicity. Early shape 58 from shading approaches [10] assumed a constant albedo field. However, this assumption is violated at locations such as eyes and mouth edges. For the human face, the Lambertian reflectance model with a varying albedo field provides a reasonable approximation [68, 74, 95, 103, 204]. The Phong illumination model also has found application [66]. This proposed method adopts the Lambertian reflectance model with a varying albedo field to model the effect of illumination. ? Pose. The issue of pose essentially amounts to a correspondence problem. If dense correspondences across poses are available and if a Lambertian re- flectance model is further assumed, a rank-1 constraint is implied because theoretically, a 3D model can be recovered and used to render novel poses. However, recovering a 3D model from 2D images is a difficult task. There are two types of approaches: model-based and image-based. Model-based approaches [66, 139, 145, 146] require explicit knowledge of prior 3D mod- els, while image-based approaches [125, 129, 142, 143, 144] do not use prior 3D models. In general, model-based approaches [66, 139, 145, 146] register the 2D face image to 3D models that are given beforehand. In [139, 146], a generative face model is deformed through bundle adjustment to fit 2D images. In [145], a generative face model is used to regularize the 3D model recovered using the SfM algorithm. In [66], 3D morphable models are con- structed based on many prior 3D models. There are mainly three types of image-based approaches: Structure from motion (SfM) [125, 129], visual hull [142, 144], and light field rendering [143, 140] methods. The SfM approach [125] works with sparse correspondence and does not reliably recover the 3D model amenable for practical use. The visual hull methods [142, 144] assume 59 that the shape of the object is convex, which is not always satisfied by the human face, and also require accurate calibration information. The light field rendering methods [143, 140] relax the requirement of calibration by a fine quantization of the pose space and recover a novel view by sampling the captured data that form the so-called light field. The proposed method is image-based, so no prior 3D models are used. It handles a given set of views through an analysis analogous to the light field concept. However, no novel poses are rendered. ? Identity. One straightforward method to describe the identity is through discrete labels. However, using this discrete description it is impossible to establish a link between objects used in the training and testing stages in terms of the identity. An alternative way is to associate a discrete label with a continuous-valued variable, which is regarded as an identity signature. One goodexample is to use subspace encoding [47, 62], where linear generalization is assumed to incorporate the fact that all human faces are similar. Once the subspace basis are learned from the training set, they are used to characterize the gallery/probe set, thus enabling the required generalization capability. In this chapter, we also use the subspace method to describe the identity. Face recognition under illumination variation FR under illumination variation must take into account the two factors of identity and illumination. Refer to Section 2.2.1 in Chapter 2 for a review of related work. 60 Face recognition under pose variation As mentioned earlier, pose variation essentially amounts to a correspondence prob- lem. If dense correspondences across poses are available and a Lambertian re- flectance is assumed, then a rank-1 constraint is implied. Unfortunately, finding correspondences is a very difficult task and, therefore there exist no subspace based on an appearance representation when confronted with pose variation. Approaches to face recognition under pose variation [68, 69, 72] avoid the correspondence prob- lem by sampling the continuous pose space into a set of poses, v.i.z. storing mul- tiple images at different poses for each person at least in the training set. In [72], view-based ?Eigenfaces? are learned from the training set and used for recognition. In [68], a denser sampling is used to cover the pose space. However, as [68] uses object-specific images, appearances belonging to a novel object (i.e. not in the training set) cannot be handled. In [69], the concept of light field [143] is used to characterize the continuous pose space. ?Eigen? light fields are learnt from the training set. However, the implementation of [69] still discretizes the pose space and recognition can be based on probe images at poses in the discretized set. One should note that the light field is not related to variation in illumination. Face recognition under illumination and pose variations Approaches to handling both illumination and pose variations include [66, 70, 77, 78, 202]. The approach [66] uses morphable 3D models to characterize the human faces. Both geometry and texture are linearly spanned by those of the training ensemble consisting of 3D prior models. It is able to handle both illumination and pose variations. Its only weakness is a complicated fitting algorithm. Recently, a fitting algorithm more efficient than suggested in [66] is proposed in [73]. In [70], 61 the Fisher light field is proposed to handle both illumination and pose variations, where the light field is used to cover the pose variation and the Fisher discriminant analysis to cover the illumination variation. Since discriminant analysis is just a statistical analysis tool which minimizes the within-class scatter while maximizing the between-class clatter and has no relationship with any physical illumination model, it is questionable that discriminant analysis is able to generalize to new lighting conditions. Instead, this generalization may be inferior because discrim- inant analysis tends to overly tune to the lighting conditions in the training set. The ?Tensorface? approach [77, 78] uses a multilinear analysis to handle various factors such as identity, illumination, pose, and expression. The factors of identity and illumination are suitable for linear analysis, as evidenced by the ?Eigenface? approach (assuming a fixed illumination and a fixed pose) and the subspace in- duced by the Lambertian model, respectively. However, the factor of expression is arguably amenable for linear analysis and the factor of pose is not amenable for lin- ear analysis. In [202], preliminary results are reported by first warping the albedo and surface normal fields at the desired pose and then carrying on recognition as usual. 3.1.2 Pose-invariant identity signature The light field measures the radiance in free space (free of occluders) as a 4D function of position and direction. An image is a 2D slice of the 4D light field. If the space is only 2D, the light field is then a 2D function. This is illustrated in Figure 3.1 (also see [69] for another illustration), where a camera conceptually moves along a circle, within which a square object with four differently colored sides resides. The 2D light field L is a function of ? and ? as properly defined 62 in Figure 3.1. The image of the 2D object is just a vertical line. If the camera is allowed to leave the circle, then a curve is traced out in the light field to form the image, i.e. the light field is accordingly sampled. Even though the light field for a 3D object is a 4D function, we still use the notation L(?,?) for the sake of simplification. Figure 3.1: This figure illustrates the 2D light-field of a 2D object (a square with four differently colored sides), which is placed within an circle. The angles ? and ? are used to relate the viewpoint with the radiance from the object. The right image shows the actual light field for the square object. Starting from the light fields {Ln(?,?); n = 1,...,N} of the training sam- ples, the ?Eigen? light field approach conducts a PCA to find the eigenvectors {ei(?,?); i = 1,...,m} which span a rank-m subspace. The ?Eigen? light field [69] is again motivated by the similarity among the human faces. Using the fact [47, 62] that: If YTY has an eigenpair (?,v), then YYT has a corresponding eigen- pair (?,Yv), we know that ei(?,?) is just a linear combination of the Ln(?,?)?s, i.e., there exist ain?s such that ei(?,?) = summationdisplay n ainLn(?,?). (3.1) For an arbitrary subject, its light field L(?,?) lies in this rank-m subspace. In 63 other words, there exists coefficients fi?s such that, ?(?,?), L(?,?) = msummationdisplay i=1 fiei(?,?) = e(?,?)Tf, (3.2) where e(?,?) ? [?mi=1 ei(?,?)]m?1 and f = [?mi=1 fi]m?1. As mentioned earlier, to obtain an image hv at a particular pose v (a collection of d pixels) one should sample the light field. Suppose that one pixel hv is the point sample of the light field associated with the coordinate (?v,?v), i.e., hv = L(?v,?v). (3.3) The image hv can be expressed as hv ? [?di=1 hvi] = [?di=1 L(?vi ,?vi)], (3.4) where (?vi ,?vi) is the corresponding coordinate in the light field for the pixel hvi. Substituting (3.2) into (3.4) yields hv = [?di=1 e(?vi ,?vi)T]f = Evf, (3.5) where Ev ? [?di=1 e(?vi ,?vi)T]d?m. Eq. (3.5) has an important implication: f is a pose-invariant identity signature because the pose information is encoded in Ev. This is summarized in Proposition 3.1. Proposition 3.1: The identity signature f as derived in (3.5) is pose-invariant. Constructing a light field is a practically difficult task. However, if only some specific poses are of interest with each pose sampling a subset of the light field, we can only focus on the portion of the light field that is equivalent to the union of these subsets. Suppose that the K poses are of interest are {v1,...,vK} and the corresponding images at these poses are {hv1,...,hvK} with hvk expressed as in 64 (3.4), the portion of the light field of focus is nothing but [?Kk=1 [?di=1 L(?vki ,?vki )] ], which is a ?long? Kd ? 1 vector obtained by stacking all the images at all these poses. The introduction of such a ?long? vector eases our computation: (i) If we are interested in a particular view v, we just simply take out those rows corresponding to this view. (ii) In this context, computing the ?Eigen? light field is equivalent to performing PCA on the ensemble consisting of a collection of such ?long? vectors. The concept of light field was introduced in the computer graphics literature [143]. A strict assumption is that the scene be static. While characterizing the ap- pearances of one object at given views using the concept of light field is legitimate, generalizing this to many objects is questionable since the lights fields belonging to different objects are not in correspondence, i.e. they are not shape-free in the terminology of [49, 76]. The mismatch in correspondence arises from differences in head sizes and locations in world coordinator system of different objects, and so on. Typically, correspondences between different objects are established using face normalization or registration is performed. Unfortunately, the normalization step ruins the static scene requirement in the light field theory. On the other hand, as argued in [49, 76], since the shape-free appearance is amenable for linear analysis, we can pursue PCA on the shape-free vector L, similar to the ?Eigen? light field approach [69]. This point is illustrated in [71]. Following [71], we also use the term light field in a loose sense. 3.1.3 Illumination- and pose-invariant identity signature As mentioned earlier and in [143], the underlying assumption about the concept of light is one of fixed illumination. We now consider the light fields formed under varying illumination, i.e., illuminating the light field. 65 Clearly, the light field under a fixed illumination s, Ls(?,?), follows the Lam- bertian reflectance model: Ls(?,?) = t(?,?)Ts, (3.6) where t(?,?) is the product of the albedo and the surface normal at a proper pixel and does not depend on s. Combining (3.1) and (3.6) yields the ?Eigen? light field esi(?,?) under the illumination s as, esi(?,?) = summationdisplay n aintn(?,?)Ts = tei(?,?)Ts, (3.7) where tei(?,?) ?summationtextn aintn(?,?). Eq. (3.2) then becomes Ls(?,?) = [?mi=1 tei(?,?)Ts]Tf = W(?,?)(f?s), (3.8) where W(?,?) ? [?mi=1 tei(?,?)]1?3m does not depend on s. This successfully leads to a two-factor analysis [138, 187]. A pixel hvs under a pose v and an illumination s is a point sample of the light field Ls(?,?) at coordinate (?v,?v), i.e., hvs = Ls(?v,?v) = W(?v,?v)(f?s), (3.9) and an image hvs under the pose v and illumination s, which traces a set of d samples of the light field under illumination s, is hvs = [?di=1 hvsi ] = [?di=1 W(?vi ,?vi)](f?s) = Wv(?,?)(f?s), (3.10) where Wv(?,?) ? [?di=1 W(?vi ,?vi)]d?3m. Eq. (3.10) has an important implica- tion: The coefficient vector f provides an identity signature invariant to both pose and illumination because the pose is absorbed in Wv(?,?) and the illumination is absorbed in s. 66 Proposition 3.2: The identity signature f as derived in (3.10) is illumination and pose-invariant. The remaining questions are how to learn the basis matrix W(?,?) from a given training ensemble and how to compute the blending coefficient vector f as well as s for an arbitrary image hvs. The next section presents the algorithms in detail. 3.1.4 Learning algorithms Learning the basis matrix W(?,?) Suppose that the training ensemble is given as {Lsn(?,?); n = 1,...N, s = 1,...,S}, where Lsn(?,?) is the light field of the nth training object under illumination s (a Kd?1 vector as explained in Section 3.1.2). Learning W(?,?) (a Kd?mr matrix where m is the rank for the identity and r is the rank for the illumination) from the training ensemble is detailed in [138] and is further extended in [187] by imposing the integrability constraint. The main difference between [138] and [187] is the following: In [138], the recovered W(?,?) minimizes the approximation error in the mean square sense and not necessarily satisfies the integrability constraint. In other words, the hypothetical base objects in W(?,?) is not integrable. In [187], the recovered W(?,?) minimizes the above approximation error as well as a cost function invoked by violating the integrability constraint. As a consequence, [138] can only process the image ensemble consisting of different objects under the same set of illumination (e.g. the case considered here) while [187] can process the image ensemble consisting of different objects under completely different illumination. Here, we follow the approach in [138] to derive W(?,?) for simplicity. The basic underlying principle is to use a two-fold SVD algorithm that is reviewed below. The following two matrices (A-type and B-type) are first constructed by group- 67 ing the ?long? vectors {Lsn(?,?); n = 1,...N, s = 1,...,S} in two ways: A = [?Nn=1 [?Ss=1 Lsn(?,?)] ], B = [?Ss=1 [?Nn=1 Lsn(?,?)] ], (3.11) where A is a KNd?S matrix whose rows stack together the light fields of different identities under the same illumination and whose columns correspond to different illumination and B is a KSd?N matrix whose rows stack together the light fields under different illumination for the same identity and whose columns correspond to different identities. It is obvious that we can convert from an A-type matrix to B-type and vice versa. We perform the SVD for the A matrix as A = UADAVTA and keep the top r rows of the column basis VTA for the illumination, denoted by S. We do a similar thing to the B matrix and keep the top m rows of the column basis VTB for the identities, denoted by F. Direct SVD of the A and B matrices are numerically inefficient or even prohibitive since they are extremely ?tall?. Also it is unnecessary to compute U and D as we are interested only in the V part of the SVD result. For computational savings, we observe that VA encodes the eigenvectors of ATA = VAD2AVTA. Since the size of ATA is only S ? S, computing its eigenvalues is numerically stable. Therefore, we simply firstcompute ATAand then performits ?Eigen? decomposition to find VA. Similarly, we can compute VB. We now have the matrices S and F at our disposal. To find W(?,?), we first compute Aprime = AST, where Aprime is a KNd?r matrix. Notice that Aprime is still an A-type matrix, so we can convert Aprime to a B-type matrix Bprime following the strategy described in (3.11), where Bprime is a Krd?N matrix. Thirdly, we compute Wprime = BprimeFT, where Wprime is a Krd?m matrix. The rest is to group Wprime to form a Kd?mr matrix W. 68 Recovering the blending coefficient vector f from an image Given W(?,?) = [?mi=1 [?rj=1 Wij(?,?)] ]Kd?mr, where Wij(?,?) denotes the ((i? 1)?r+j)th column of the W(?,?) matrix, computing f and s for an arbitrary image hvs utilizes (3.10) iteratively [187]. Notice that we need only the portion of W(?,?) corresponding to the pose v, denoted by Wv(?,?) = [?mi=1 [?rj=1 Wvij(?,?)] ]d?mr. If f is fixed, (3.10) is linear in s and its least square (LS) solution is s = [?rj=1 ([?mi=1 Wvij(?,?)]f) ]?hvs, (3.12) where [.]? is a matrix psuedo-inverse; if s is fixed, (3.10) is linear in f and its LS solution is f = ? ?? ? [?mi=1 ([?rj=1 Wvij(?,?)]s) ] 1T ? ?? ? ?? ?? ? hvs 1 ? ?? ?, (3.13) where 1 is a vector of 1?s. To obtain (3.13), we also impose fT1 = 1 to normalize the solution to the same range, which facilitates the recognition task. We iterate this process until convergence. Meanwhile, we can also take into account the pixels in shadows as in [187]. Recovering the blending coefficient vector f from a group of images This iterative algorithm can be easily modified to handle a group of Q images {hv1s1,...,hvQsQ} having the same f but different s?s since multiple equations like (3.10) can be formulated. To be specific, we have the following iterative equations: sq = [?rj=1 ([?mi=1 Wvqij (?,?)]f) ]?hvqsq; q = 1,2,...,Q, (3.14) f = ? ?? ? [?Qq=1 [?mi=1 ([?rj=1 Wvqij (?,?)]sq) ] ] 1T ? ?? ? ?? ?? ? [?Qq=1 hvqsq] 1 ? ?? ?. (3.15) In practice, using a group of images yields a robust estimate for f. 69 The present of shadow pixels affects the learning algorithm. Handling shadows can be performed in the same fashion as in Chapter 2. 3.2 FaceRecognitionacrossIlluminationand Poses 3.2.1 PIE database and recognition setting We use the ?illum? subset of the PIE database [75] in our experiments. This subset has 68 subjects under 21 illumination and 13 poses. Out of 21 illumination configu- ration, weselect 12denoted byF = {f16,f15,f13,f21,f12,f11,f08,f06,f10,f18,f04,f02} as in [70], which typically span the set of variations. Out of the 13 poses, we select 9 denoted by C = {c22,c02,c37,c05,c27,c29,c11,c14,c34}, which cover from the left profile to the right profile. In total, we have 68*12*9=7344 images. Figure 3.2 displays one PIE object under illumination and pose variations. Registration is performed by aligning the eyes and mouth to desired positions. No flow computation is carried on for further alignment. After the pre-processing step, the used face image is of size 48 by 40, i.e. d = 1920. Also, we only use gray scale images by taking the average of the red, green, and blue channels of their color versions. We believe that our recognition rates can be boosted by using color images and finer registrations. Figure 3.2 shows some examples of the face images actually used in recognition. We randomly divide the 68 subjects into two parts. The first 34 subjects are used in the training set and the remaining 34 subjects are used in the gallery and probe sets. It is guaranteed that there is no identity overlap between the training set and the gallery and probe sets. To form the light field, we use images at all available poses. Since the illumination model has generalization capability, we 70 c22 c02 c37 c05 c27 c29 c11 c14 c34 f16 f15 f13 f21 f12 f11 f08 f06 f10 f18 f04 f02 Figure 3.2: Examples of the face images of one PIE object (used in the testing stage) under selected illumination and poses . can select a minimum of 3 illumination in the training set. In our experiments, the training set includes only 9 selected illumination to cover the second-order harmonic components [95]. Notice that this is not possible in the Fisher light field approach [70] that exhausts all illumination configurations. The images belonging to the remaining 34 subjects are used in the gallery and probe sets. The construction of the gallery and probe sets conforms to the following two scenarios: (A) We use all the 34 images under one illumination sp and one pose vp to form a gallery set and under the other illumination sg and the other pose 71 vg to form a probe set. There are three cases of interest: same pose but different illumination, different pose but same illumination, and different pose and different illumination. We mainly concentrate on the third case with sp negationslash= sg and vp negationslash= vg. Also our approach reduces to the ?Eigen? light field approach [69] if sp = sg and to the generalized photometric stereo approach [187] if vp = vg. Thus, we have (9 ? 12)2 ? (9 ? 12) = 11,556 tests, with each test giving rise to a recognition score. (B) We divide C into three sets: C1 = {c22,c02,c37} (left-profile views), C2 = {c05,c27,c29} (frontal views), and C3 = {c11,c14,c34} (right-profile views) and F into 3 sets: F1 = {f16,f15,f13,f21} (left lights), F2 = {f12,f11,f08,f06} (frontal lights), and F3 = {f10,f18,f04,f02} (right lights). For each of the thirty four subjects, the gallery set contains all twelve images under the illumination in Fg and the poses in Cg and the probe set all twelve images under the illumination in Fp and the poses in Cg. We make sure that (Cp,FP) negationslash= (Cg,Fg). Thus, we have (3?3)2 ?(3?3) = 72 tests in this scenario that has no counterpart in the Fisher light field [70]. To make the recognition more difficult, we assume that the lighting conditions for the training, gallery and probe sets are completely unknown when recovering the identity signatures. The testing strategy is similar to that described in Chapter 2. 1. Learn W from the training set using the bilinear learning algorithm [138, 204]. Figure 3.3 shows the W matrix obtained using the training set. 2. With Wgiven, learnthe identity signature f?s (aswell ass?s) forallgallery and probe elements (an element is an image in Scenario A and a group of images in Scenario B) using the iterative algorithms in Section 3.1.4. Learning f and s from one single image takes about 1-2 seconds in a Matlab implementation. Figure 3.4 shows the reconstructed images using the learned f and s. 72 3. Perform recognition using the nearest correlation coefficient. Gallery f16 f15 f13 f21 f12 f11 f08 f06 f10 f18 f04 f02 Average Probe c22 56 41 62 68 71 71 53 65 41 44 38 21 52 c02 71 76 76 91 88 94 94 94 85 71 50 32 77 c37 79 82 82 94 94 97 94 94 76 65 65 50 81 c05 68 85 97 100 100 97 97 97 91 82 71 44 86 c27 94 100 100 100 100 ? 100 100 100 97 94 76 97 c29 74 82 91 100 100 100 97 97 94 91 88 65 90 c11 50 53 68 79 85 97 97 88 79 82 71 62 76 c14 15 24 44 71 76 82 74 82 82 74 79 56 63 c34 18 18 47 50 56 65 62 56 44 44 41 38 45 Average 58 62 74 84 86 88 85 86 77 72 66 49 74 Table 3.1: Recognition rates for all the probe sets with a fixed gallery set (c27,f11). 3.2.2 Recognition performance Scenario A Table 3.1 shows the recognition results for all probe sets with a fixed gallery set (c27,f11), whose gallery images are in a frontal pose and under a frontal illumi- nation. Using this table we compare the three cases. The case of same pose but different illumination has an average rate 97% (i.e. the average of all 11 cells on the row c27), the case of different pose but same illumination has an average rate 88% (i.e. the average of all 8 cells on the column f11), the case of different pose and different illumination has an average rate 70% (i.e. the average of all 88 cells excluding the row c27 and the column f11). This shows that illumination varia- tion is easier to handle than pose illumination and variations in both pose and illumination are the most difficult to deal with. 73 Gallery f16 f15 f13 f21 f12 f11 f08 f06 f10 f18 f04 f02 Average Probe c22 44 44 46 45 46 49 46 49 44 32 30 14 41 c02 55 58 59 62 63 62 60 60 54 48 40 22 54 c37 56 59 61 64 65 62 60 58 51 47 45 34 55 c05 56 63 66 67 68 65 59 58 54 51 45 36 57 c27 62 66 69 70 70 70 65 69 68 67 65 54 66 c29 46 53 53 61 60 63 59 62 66 68 62 60 60 c11 41 43 50 53 55 61 57 58 56 61 58 51 54 c14 19 24 39 49 53 58 58 61 60 61 57 48 49 c34 16 21 38 44 46 51 48 51 46 45 45 42 41 Average 44 48 53 57 59 60 57 59 56 53 50 40 53 Table 3.2: Average recognition rates for all the gallery sets. For each cell, say the gallery set at (vg = c27,sg = f12), the average rate is taken over all probe sets (vp,sp) where vp negationslash= vg and sp negationslash= sg. For example, the average rate for (c27,f11) is the average of the rates in Table 3.1 excluding the row c27 and the column f11. We now focus on the case of different pose and different illumination. For each gallery set, we average the recognition scores of all the probe sets with both pose and illumination different from the gallery set. Table 3.2 shows the average recognition rates for all the gallery sets. As an interesting comparison, the ?grand? average is 53% (the last cell in Table 3.2) while that of the Fisher light field approach [70] is 36%. In general, when the poses and illumination of the gallery and probe sets become far apart, the recognition rates decrease. The best gallery sets for recognition are those in frontal poses and under frontal illumination and the worst gallery sets are those in profile views and off-frontal illumination. As shown in Figures 1.5 and 3.2, the worst gallery sets consist of face images almost invisible (See for example the images (c22,f02), (c34,f16), etc.), on which recognition can be hardly performed. Figure 3.5 presents the curves of the average recognition rates (i.e. the last 74 columns and last rows of Tables 3.1 and 3.2) across poses and illumination. Clearly the effect of illumination variations is not as strong as due to pose variations in the sense that the curves of average recognition rates across illumination are flatter than those across poses. Figure 3.5 also shows the curves of the average recogni- tion rates obtained based on the top 3 and top 5 matches. Using more matches increases the recognition rates significantly, which demonstrates the efficiency of our recognition scheme. For comparison, Figure 3.5 also plots the average rates obtained using the baseline PCA. These rates are well below ours. The ?grand? average is below 10% if the top 1 match is used. Figure 3.3: The first nine columns of the learned W matrix. 75 c22 c02 c37 c05 c27 c29 c11 c14 c34 f16 f15 f13 f21 f12 f11 f08 f06 f10 f18 f04 f02 Figure 3.4: The reconstruction results of the object in Figure 3.2. Notice that only the f?s and s?s for the row c27 are used for reconstructing all the images. Scenario B This test scenario is designed for face recognition based on a group of images, which can be under different poses and different illumination. Table 3.3 lists the recognition rates, which are much higher than those in Tables 3.1 and 3.2. Also, similar observations can be made regarding the effects of illumination and pose variations. 76 f16 f15 f13 f21 f12 f11 f08 f06 f10 f18 f04 f02 0 10 20 30 40 50 60 70 80 90 100 flash index recognition rate top 1 top 3 top 5 f16 f15 f13 f21 f12 f11 f08 f06 f10 f18 f04 f02 0 10 20 30 40 50 60 70 80 90 100 flash index recognition rate top 1 top 3 top 5 f16 f15 f13 f21 f12 f11 f08 f06 f11 f18 f04 f02 0 10 20 30 40 50 60 70 80 90 100 flash index recognition rate top 1 top 3 top 5 c22 c02 c37 c05 c27 c29 c11 c14 c34 0 10 20 30 40 50 60 70 80 90 100 camera index recognition rate top 1 top 3 top 5 c22 c02 c37 c05 c27 c29 c11 c14 c34 0 10 20 30 40 50 60 70 80 90 100 camera index recognition rate top 1 top 3 top 5 c22 c02 c37 c05 c27 c29 c11 c14 c34 0 10 20 30 40 50 60 70 80 90 100 camera index recognition rate top 1 top 3 top 5 (a) (b) (c) Figure 3.5: The average recognition rates across illumination (the top row) and across poses (the bottom row) for three cases. Case (a) shows the average recog- nition rate (averaging over all illumination/poses and all gallery sets) obtained by the proposed algorithm using the top n matches. Case (b) shows the average recognition rate (averaging over all illumination/poses for the gallery set (c27, f11) only) obtained by the proposed algorithm using the top n matches. Case(c) shows the average recognition rate (averaging over all illumination/poses and all gallery sets) obtained by the ?Eigenface? algorithm using the top n matches. 3.2.3 Comparisons Comparison with the Fisher light field It is interesting to compare the proposed approach with the Fisher light field [70] since both of them handle pose variation in a similar fashion. The main difference lies in handling the illumination variation. Our approach uses the Lambertian model while [70] uses Fisher discriminant analysis. Therefore, our approach can 77 Gallery C1F1 C1F2 C1F3 C2F1 C2F2 C2F3 C3F1 C3F2 C3F3 Average Probe C1F1 ? 100 85 100 94 82 62 85 94 88 C1F2 100 ? 100 100 100 85 71 82 94 92 C1F3 85 97 ? 88 88 91 76 62 65 82 C2F1 97 94 71 ? 100 85 71 85 76 85 C2F2 97 100 85 100 ? 100 76 91 85 92 C2F3 79 82 76 97 100 ? 74 88 91 86 C3F2 59 59 68 85 76 71 ? 100 82 75 C3F2 74 85 62 91 94 82 100 ? 100 86 C3F3 88 82 62 79 79 94 85 100 ? 84 Average 85 88 76 93 92 86 77 87 86 85 Table 3.3: The recognition rates for test scenario B. generalize to novel illumination and [70] does not have such a generalization. Also, in Section 3.2 the proposed approach leads to a new recognition scenario which is not available in [70]. Comparison with the 3D morphable model The 3D morphable model (3DMM) [66] is the state-of-the-art approach to identify faces across illumination andposes. The proposed approach differs fromthe 3DMM approach mainly as follows: ? Model-based v.s. image-based. The 3DMM approach is requires prior 3D models while the proposed approach that is image-based needs only 2D im- ages. Linear assumptions are used in both approaches. The operating units in the 3DMM approach are 3D depth and texture, respectively, and two indepen- dent linear models are assumed in both units. The operating unit in the proposed approach is the product of the albedo and surface normal and a 78 single linear model is assumed. As in the 3DMM approach, it seems that the dimensionality of the proposed model can be ?decomposed? as the product (or the addition) of the dimensionality of the surface normals and that of the albedo field. However, empirically analysis shows [202] that such a decom- position is not necessary and might overfit the problem, thereby indicating that a subspace of rather low dimensionality can be used. ? Handling illumination. The Lambertian model is used in the proposed algo- rithm and pixels in shadows and specular reflection regions are inferred and excluded for consideration. The 3DMM approach uses the standard Phong model to directly model diffuse and specular reflection on the face surface. The 3DMM also takes into account inputs illuminated by colored lights us- ing color transformation while the proposed approach only processes inputs illuminated by white lights. ? Handling pose. The 3DMM approach can handle images at any pose, while the current implementation of the proposed approach can handle images sampled from a given set of poses. In order to handle arbitrary pose other than those listed in the given set, the system should incorporate a tool to render novel poses using given poses, which is left for future. In the proposed approach, pixels at different poses might correspond to the same point in the physical 3D model. In the 3DMM approach, one point is only represented once for all the poses since the 3D model is used. ? Experiments Both the 3DMM and the proposed approaches conducted ex- periments using the PIE database. However, different portions of the PIE database are used. The 3DMM approach worked on the ?lights? part, where 79 an ambient light source is always present. The proposed approach worked on the ?illum? part with no ambient light source. As a consequence, some images appear almost dark (refer to Figure 3) and there is little hope to perform correct recognition based on these extreme images, explaining the relatively low recognition rates compared with those produced by the 3DMM approach. In terms of computational complexity, the proposed algorithm is more com- putationally efficient than the 3DMM approach. The proposed fitting al- gorithm, taking 1-2 seconds to process one input image using Matlab im- plementation, is simply linear (rather bilinear) and has a unique minimum; while the 3DMM approach, taking 4.5 minutes to process one input image, invokes a gradient descent algorithm that does not guarantee a global min- imum. Also, the proposed algorithm is able to handle face images of very small size. In the reported experiments, gray-level images are normalized to size of 48?40. The size of color images used in the 3DMM approach is unclear, but typically much larger. 80 Part II: Face Recognition via Kernel Learning 81 Chapter 4 Probabilistic Kernel Principal Component Analysis Principal component analysis [12] is one of the most popular statistical data analy- sis techniques with applications in numerous areas such as data compression, image processing, computer vision, and pattern recognition, to name a few. However, the PCA has two disadvantages: (i) it lacks a probabilistic model structure which is important in many contexts such as mixture modeling and Bayesian decision (also see [167]); and (ii) it restricts itself to a linear setting, where high-order statistical information is discarded [181]. Probabilistic principal component analysis (PPCA) proposed by Tipping and Bishop [167, 168] overcomes the first disadvantage. By letting the noise com- ponent possess an isotropic structure, the PCA is implicitly embedded in a pa- rameter learning stage for this model using the maximum likelihood estimation (MLE) method. An efficient expectation/maximization (EM) algorithm [152] is also developed to iteratively learn the parameters. Kernel principal component analysis (KPCA) proposed by Sch?olkopf, Smola 82 and M?uller [181] overcomes the second disadvantage by using the so-called ?kernel trick?. The essential idea of the KPCA is to avoid the direct evaluation of the required dot product in a high-dimensional feature space using the kernel function. The feature space is called reproducing kernel Hilbert space (RKHS). Hence, no explicit nonlinear mapping function projecting the data from the original space to the feature space is needed. Since a nonlinear function is used, albeit in an implicit fashion, high-order statistical information is captured. See [179] for a recent survey on the kernel space and application on discovering pre-image and denoised pattern in the original space. We propose an approach to analyze kernel principal components in a probabilis- tic manner. It naturally unifies PPCA and KPCA in one treatment to overcome the both disadvantages of PCA. We call it the probabilistic kernel principal com- ponent analysis (PKPCA). In this chapter, we present our development of the PKPCA approach by treating the KPCA as a special case of PCA where the num- ber of samples is smaller than the data dimension. One speciality of KPCA is the data centering issue, which is also taken into account in Section 4.2. While the kernel part retains the nonlinear modeling power, resulting in a smaller reconstruction error, the additional probabilistic structure offers us (i) a mixture modeling capacity of PKPCA, and (ii) an efficient classification scheme. Mixture of PKPCA is derived to model the nonlinear structure containing non- linear substructures in a systematic way. Mixture of PKPCA nontrivially extends to the feature space induced by the kernel function, the theory of mixture of PPCA proposed by Tipping and Bishop [167, 168]. An EM algorithm [152] is also devel- oped to iteratively but efficiently learn the parameters of interest. We also show how to compute two important quantities, namely the reconstruction error and 83 the Mahalanobis distance. Our analysis can be easily incorporated for a classification task. Our per- formances are competitive to those produced by the mainstream kernel classifiers, such as the support vector machine (SVM) and kernel Fisher discrimination (KFD) classifier, but our analysis provides more regularized approximation to the data structure. Chapter organization Section 4.1 briefly reviews the essentials of RKHS. Section 4.2 presents how to compute the kernel principal components and to analyze these components in a probabilistic manner. Section 4.3 presents the mixture of PKPCA and Section 4.4 presents the classification results on synthetic data and in a face recognition application. Two examples Figure 4.1 shows two examples of nonlinear data structures 1 to be modeled. Figure 4.1(a) presents the first example: a C-shaped structure in the foreground. In the context of data modeling, we consider only the foreground and assume a uniform distribution within the C-shaped region and zeros outside. Figure 4.1(b) displays 200 sample points drawn from this density. In the context of pattern classification, we consider both the foreground and the background and further assume that the background class possess a uniform distribution outside the C-shaped region and zeros inside. Figure 4.1(c) shows the samples for the background class. 1This means that, if conventional linear modeling techniques such as linear PCA are used, the responses are badly approximated. 84 Figure 4.1(d) shows the second example where the foreground nonlinear data structure consists of two C-shaped substructures. Figures 4.1(e) and 4.1(f) present the drawn samples for the foreground and background classes, respectively. We mainly use this example for mixture modeling. 20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 1000 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 1000 10 20 30 40 50 60 70 80 90 100 (a) (b) (c) 20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 1000 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 1000 10 20 30 40 50 60 70 80 90 100 (d) (e) (f) Figure 4.1: Two nonlinear data structures (a)(d) and their drawn samples (of size 200) for the foreground class (b)(e) and the background (c)(f). 4.1 Reproducing Kernel Hilbert Space (RKHS) We illustrate the principle of the RKHS by drawing an analogy of the RKHS, a functional space, to a regular vector space Rd. We start by a d?d positive definite matrix T = [ti(j)], where ti(j) is its (i,j)th element. By denoting the ith column by ti = [ti(1),...,ti(d)]T, we have T = [t1,t2,...,td]. The eigen-decomposition of 85 T is given as T = dsummationdisplay n=1 ?n?n?Tn; ?n > 0, where (?n,?n)?s are eigenpairs. We define an inner product between two elements a and b in Rd as < a,b > ? aTT?1b = dsummationdisplay n=1 ??1n aT?n?Tnb = dsummationdisplay n=1 ??1n (a,?n)(b,?n), where (u,v) ? uTv. Suppose that g = [g(1),g(2),...,g(d)]T ? Rd and the identity matrix Id is written as Id = [e1,e2,...,ed] where ei is the ith column of the Id matrix. The inner product < .,. > possesses two important properties: P1 : < ti,tj >= tTi T?1tj = tTi ej = ti(j) P2 : < ti,g >= tTi T?1g = eTi g = g(i) The RKHS, denoted by H, can be heuristically thought of as an f-dimensional ?vector space? Rf (f might be finite or infinite) associated with a positive kernel function kx(y) = k(x,y). The existence of such kernel functions is guaranteed by the Mercer?s Theorem [176] and the eigensystem of k(x,y) is given as k(x,y) = fsummationdisplay n=1 ?n?n(x)?n(y); ?n > 0; fsummationdisplay n=1 ?2n < ?. (4.1) Similarly, the inner product is defined as, with a(x), b(x), and g(x) in H, < a,b >H? fsummationdisplay n=1 ??1n (a,?n)(b,?n), where (u,v) ?integraltextxu(x)v(x)dx. Furthermore, the two properties known as reproduc- ing properties hold too. P1 : < kx,ky >H= kx(y), P2 : < kx,g >H= g(x). 86 An alternative perspective to view Eq. (4.1) is to consider a hypothetical nonlinear mapping ? : Rd ?Rf defined as ?(x) = [?1/21 (x,?1),...,?1/2f (x,?f)]T. It is easy to verify that ?(x)T?(y) = k(x,y) =< kx,ky >H . Thus evaluating the dot product can be easily done by computing k(x,y) which usually takes a parametric form. This is so-called ?kernel trick?, which plays an essential role in many kernel methods, such as SVM [19] and KPCA [181], kernel Fisher discriminant analysis [177, 172], and kernel independent component analysis [170]. In this chapter, we also adopt this viewpoint. There are a lot of ways to construct a kernel function: see [17] for a list. One example of k(x,y) is the radial basis function (RBF) kernel which is widely studied in the literature and the focus of this chapter. It is defined as k(x,y) = exp(? 12?2bardblx?ybardbl2) ?x,y ?Rd, where ? controls the kernel width. This is an infinite-dimensional RKHS, i.e., f = ?. The RBF kernel is a special example of translation-invariant kernels of the form k(x,y) = k(x?y) whose characteristics can be easily described using Fourier theory [173]. In particular, the functions in the RKHS exhibit smoothness since their Fourier transforms decay rapidly. 87 4.2 ProbabilisticAnalysisofKernelPrincipalCom- ponents 4.2.1 Kernel principal component analysis Suppose that {x1,x2,...,xN} are the given training samples in the original data space Rd. KPCA operates in a feature space that is in fact a RKHS Hk induced by a kernel function k. There exists a hypothetical nonlinear mapping function ? : Rd ? Rf, where f > d and f could even be infinite. The training samples in Rf are denoted by ?f?N = [?1,?2,...,?N], where ?n ? ?(xn) ? Rf. Denote the sample mean in the feature space as ??0 ? 1 N Nsummationdisplay n=1 ?(xn) = ?e, (4.2) where eN?1 = N?11. The f ?f covariance matrix in the feature space denoted by ? is given as ? ? 1N Nsummationdisplay n=1 (?n ? ??0)(?n ? ??0)T = ?JJT?T = ??T, (4.3) where J ? N?1/2(IN ?e1T), ? ? ?J. KPCA performs eigen-decomposition of the covariance matrix ? in the feature space. Due to the high dimensionality of the feature space, we often have insuffi- cient number of samples, i.e., the rank of the ? matrix is maximally N instead of f. However, computing the eigensystem is still possible using the method presented in [47, 62]. The explicit knowledge of the nonlinear feature mapping can be avoided using the ?kernel trick? as in Section 4.1. Define ?K ? ?T? = JT?T?J = JTKJ, (4.4) 88 where K ? ?T? is the Gram matrix or the dot product matrix. The (i,j)th entry of the Gram matrix K can be calculated as follows: Kij = ?(xi)T?(xj) = k(xi,xj). As in Appendix 4.I and [47, 62], the eigensystem for ? can be derived from ?K. Suppose that the top r eigenpairs for ?K are {(?n,vn)}qn=1, where ?n?s are sorted in a non-increasing order, and the r top eigenpairs for ? are {(?n,un)}qn=1, then we can compute un as un = (?n)?1/2?vn. In a matrix form (if only the top q eigenvectors are retained), Uq ? [u1,u2,...,uq] = ?Vq??1/2q = ?JVq??1/2q , (4.5) where Vq ? [v1,v2,...,vq] and ?q ? D[?1,?2,...,?1], a diagonal matrix whose diag- onal elements are {?1,?2,...,?q}. It is clear that we are not operating in the full feature space, but in a low- dimensional subspace of it, which is spanned by the training samples. It seems that the modeling capacity is limited by subspace dimensionality, or by the number of the samples. In reality, it however turns out that even in this subspace, the smallest eigenvalues are very close to zero, which means that the full feature space can be further captured by a subspace with an even-lower dimensionality. This motivates us to use a latent model. 89 4.2.2 Theory of PKPCA Probabilistic analysis assumes that the data in the feature space follows a special factor analysis model [15] which relates an f-dimensional data ?(x) to a latent q-dimensional variable z as ?(x) = ? +Wz + epsilon1, where z ? N(0,Iq), epsilon1 ? N(0,?If), and W is a f ? q loading matrix. Therefore, ?(x) ? N(?,S), where S = WWT + ?If. Typically, we have q << N << f. As shown in [167, 168], the MLE?s for ? and W, denoted by ?? and ?W, respec- tively, are given by ?? = ??0 = 1N Nsummationdisplay n=1 ?(xn) = ?e, (4.6) ?W = Uq(?q ??Iq)1/2R, (4.7) where R is any q ? q orthogonal matrix, i.e., RTR = RRT = Iq, and Uq and ?q contain the top q eigenvectors and eigenvalues of the ? matrix. It is in this sense that our probabilistic analysis coincides with the plain KPCA. Substituting (4.5) into (4.7), we obtain the following: ?W = ?Vq??1/2q (?q ??Iq)1/2R = ?Q = ?JQ, (4.8) where the N ?q matrix Q is defined as Q ? Vq(Iq ????1q )1/2R. (4.9) Equation (4.8) has a very important implication: ?W lies in a linear subspace of ?. We name the Q matrix as empirical loading matrix since this relates the loading 90 matrix to the empirical data. Also since the matrix (Iq????1q ) in (4.9) is diagonal, additional savings in computing its square root are realized. The MLE for ?, ??, is given [167, 168] as ?? = 1f ?q{tr(S)?tr(?q)}. (4.10) Assuming that the remaining eigenvalues are zero, (this is a reasonable assumption supported by empirical evidences when f is finite), it is approximated as ?? similarequal 1f ?q{tr(K)?tr(?q)}. (4.11) But when f is infinite, this is doubtful since this always gives ?? = 0. In such a case, there is no automatic way of learning this. We temporarily set a manual choice for ??. as in [182]. However, as shown later on, we can in fact study the limiting case by letting ?? approach zero in various cases. Even when a fixed ?? is used, the optimal estimate for W (or ?W) is still the same as in (4.8). It is interesting to note that Moghaddam and Pentland [54] derived (4.10) in a different context by minimizing the Kullback-Leibler divergence distance [4, 13]. Now, the covariance matrix is estimated by ?S = ?JQQTJT?T + ??If = ?A?T + ??If, where A ? JQQTJT. This offers a regularized approximation to ? = ?JJT?T. In ridge regression [9], the form of S1 = ?JJT?T+?If (with rho a pre-specified small positive number) is used to provide a regularized approximation. This has a smoothness interpretation of the regression parameters. However, the eigenvalues of S1 always increase those of ? by an amount of ? but the eigenvectors of the S1 are the same as those of ?. 91 Although S is in a compact form and also regularized, inversion of the S1 matrix involves inverting an N ?N matrix, which is still prohibitive in real applications with a large N, whereas ?S?1 involves inverting only a r?r M matrix (defined later). This form of S1 is also used in [170, 171] for estimating the canonical correlation and [175] for constructing the Bhattacharyya kernel. In [182] the covariance matrix ? is approximated as S2 = ?JDJT?T + ?If, where D is a diagonal matrix whose many diagonal entries empirically shown to be zero. This is not surprising as in our computation D = QQT is rank deficient. However, we do not enforce D to be diagonal. Inverting ?S is also easy by invoking the Woodbury formula [8], ?S?1 = (??If + ?W?WT)?1 = ???1(If ? ?WM?1 ?WT) = ???1(If ??B?T), where B ? JQM?1QTJT, and the matrix Mr?r can be thought of as a ?reciprocal? matrix for ?S, M ? ??Iq + ?WT ?W = ??Iq +L, (4.12) with L ? QT?KQ. Using the Q matrix in 4.9, Appendix 4.II calculates various quantities in a closed form. For example, M = RT?qR, |S| = ??(f?q)|M|. Refer to Appendix 4.II for details. From now on, we will drop the (?.) notation that denotes the MLE estimate. Whenever we mention some parameters requiring estimates, we mean the MLE values. 92 Parameter learning using EM The key for the approach developed in Section 4.2.2 is (4.8) which relates W to ? using a linear equation and the empirical loading matrix Q. This motivates us to use the EM learning algorithm to learn the Q matrix instead of the W matrix. We now present the EM algorithm for learning the parameters Q and ? in PKPCA. Assume that Q(j) and ?(j) are the estimates obtained after the jth itera- tion. The iteration proceeds as follows: Q(j+1) = ?KQ(j)(?(j)Iq +M?1Q(j)T?K2Q(j))?1, (4.13) ?(j+1) = 1ftr(?K? ?KQ(j)M(j)?1Q(j+1)T?K), (4.14) where M(j), defined in (4.9), is evaluated using Q(j). As mentioned earlier, when f is infinite, using (4.14) is not appropriate and hence a manual choice of ? is used instead. With ? fixed, Q is nothing but the solution to (4.13) and one can check that Q given in (4.9) is the solution. Computational efficiency The above EM algorithm involves only inversions of q ?q matrices and arrives at the same results (up to an orthogonal matrix R) as direct computation. However, in practice one may still use direct computation of complexity O(N3) since the complexity of computing ?K2 is O(N3). If we pre-compute ?K2, the complexity for each iteration reduces to O(qN2). Clearly, the overall computation complexity depends on the number of iterations needed for desired accuracy and the ratio of N to q. In our experiment, the EM algorithm converges to reasonable accuracy very fast, usually in less than 20 iterations. 93 Reconstruction error and Mahalanobis distance Given a vector y ? Rd, we are often interested in computing the following two quantities: 1. the reconstruction error epsilon1?(y) ? (?(y)???(y))T(?(y)???(y)) where ??(y) is the reconstructed version of ?(y); 2. the Mahalanobis distance ??(y) ? (?(y)? ??0)TS?1(?(y)? ??0). As shown in [167], the best predictor for ?(y) is ??(y) given by ??(y) = W(WTW)?1WT(?(y)? ??0)+ ??0, and ?(y)? ??(y) is given by ?(y)? ??(y) = (If ?W(WTW)?1WT)(?(y)? ??0) = ?(?(y)??0), where the f ?f matrix ? ? If ?W(WTW)?1WT is symmetric and idempotent as ?2 = ?. So, epsilon1?(y) is computed as follows: epsilon1?(y) = (?(y)? ??0)T?(?(y)? ??0) = a(y)?b(y)TCb(y), where C, ay, and by are defined by: CN?N ? JQ(QT?KQ)?1QTJT, a(y) ? (?(y)? ??0)T(?(y)? ??0) = k(y,y)?2c(y)Te +eTKe, b(y)N?1 ? ?T(?(y)? ??0) = c(y)?Ke, 94 with c(y)N?1 ? ?T?(y) = [k(x1,y),...,k(xN,y)]T. The Mahalanobis distance is calculated as follows: ??(y) = (?(y)? ??0)TS?1(?(y)? ??0) = ??1{a(y)?b(y)TBb(y)}. (4.15) Finally, an important observation is that as long as we can express ??0 and S as in (4.2) and (4.3), i.e. there exist e and J that relate ??0 and S to ?, we can safely use the derivations presented in this section. This lays a solid foundation for the development of the mixture of PKPCA theory. We can study a limiting behavior of ??(y) by defining ???(y) ? lim ??0???(y) = a(y)?b(y) T?Bb(y), (4.16) where ?B ? lim??0 B. Experiments on kernel modeling This part addresses the power of kernel modeling part in PKPCA in terms of the reconstruction error. The probabilistic nature of PKPCA will be illustrated in the next sections. We compare PPCA and PKPCA since the only difference between them is the kernel modeling part. We define the reconstruction error percentage ? as follows: ?(y) = epsilon1(y)yTy, ??(y) = epsilon1?(y)k(y,y), where ?(y) is for PPCA and ??(y) for PKPCA. Figure 4.2 shows the histogram of ? for the famous iris data2. This dataset consists of 150 samples and is used in pattern classification tasks. We, however, 2This is available at the UCI Machine Learning Repository. The URL is http://www.ics.uci.edu/?mlearn/MLRepository.html. 95 Algorithm PPCA PPCA PKPCA PKPCA q = 2 q = 3 q = 9 q = 15 Mean 8.23% 1.42% 3.88% 1.39% Std. dev. 13.12% 4.52% 3.86% 1.39% Table 4.1: PPCA and PKPCA reconstruction error percentage. just treat it as a whole regardless of its class labels. Since it is just 4-d data, PPCA keeps at most 3 principal component, i.e. q ? 3, while PKPCA has no such limit and can have q ? 149. Figure 4.2 and Table 4.1 show that PKPCA with q = 9, i.e. using 6% percent principal components produces a small ? than PPCA with q = 2 that uses 50% components. In addition, PKPCA with q = 15 that uses 10% percent principal components produces a small ? than PPCA with q = 3, using 75% components. A larger q produces even smaller ?. This improvement benefits from kernel modeling, which is able to capture the nonlinear structure of the data. However, PKPCA involves much more computation than PPCA. 4.3 MixtureModelingofProbabilisticKernelPrin- cipal Components 4.3.1 Theory of mixture of PKPCA Mixture of PKPCA models the data in a high-dimensional feature space using a mixture of I densities with each mixture component p(.|i) being a PKPCA density associated with an empirical loading matrix Qi that can be derived from corresponding ei and Ji (as shown below). For ?i?s, we assume ?i ? ? with ? fixed. 96 0 0.5 10 50 100 150 (a) PPCA: q=2 0 0.5 10 50 100 150 (b) PPCA: q=3 0 0.5 10 50 100 150 (c) PKPCA: q=9, ?=1.6, ?=0.001 0 0.5 10 50 100 150 (d) PKPCA: q=15, ?=1.6, ?=0.001 Figure 4.2: Histogram of ? for iris data obtained by (a) PPCA with q = 2, (b) PPCA with q = 3, (c) PKPCA with Gaussian kernel with q = 9, ? = 2 and ? = 0.001, and (d) PKPCA with Gaussian kernel with q = 15, ? = 2 and ? = 0.001. Mathematically, p(?(x)) = Isummationdisplay i=1 mip(?(x)|i) = Isummationdisplay i=1 miN(??i,Si), where mi?s are mixing probabilities summing up to 1, and p(?(x)|i) = N(??i,Si) is the PKPCA density for the ith component defined as N(??i,Si) = (2pi) ?f/2 |Si|1/2 exp{? 1 2??,i(x)} = (2pi)?f/2 ?(f?qi)/2|Mi|1/2 exp{? 1 2??,i(x)} = (2pi?)?f/2 exp{?12???,i(x)} where ??,i(x) is the Mahalanobis distance as in (4.15) with all parameters involved coming from the ith component, and ???,i(x) ? ??,i(x)+ log(|Mi|)+ qi log(??1). 97 We call ???(x) as the ?generalized? Mahalanobis distance. Parameter learning using EM We invoke the ML principle to estimate the parameters of interest, i.e., {mi,Qi}?s from the training data. It turns out that direct maximization is cumbersome since the log-likelihood involves summations within logarithms. The iterative EM algorithm [152, 167] is used instead. Assume that {m(j)i ,Q(j)i } are the values obtained in the jth iteration. We begin by computing the posterior responsibility rni. r(j)ni ? p(j)(i|?n) = mip (j)(?n|i) p(j)(?n) = m(j)i exp{?12??(j)?,i(x)} summationtextI l=1 m (j) l exp{? 1 2 ??(j)?,l(x)}. (4.17) There is no need to calculate rni by exactly following (4.17). One only needs to evaluate the numerator mi exp{?12???,i(x)} and perform normalization to guarantee that summationtextIi=1 rni = 1. The EM iterations compute the following quantities: m(j+1)i = 1N Nsummationdisplay n=1 r(j)ni , (4.18) ??(j+1)i = summationtextN n=1 r (j) ni ?nsummationtext N n=1 r (j) ni = Nsummationdisplay n=1 e(j)ni ?n = ?e(j)i , where e(j)i = [e(j)1i ,e(j)2i ,...,e(j)Ni]T with e(j)ni ? r (j) nisummationtext N n=1 r (j) ni . It is easy to show that the local responsibility-weighted covariance matrix for component i, Si, is obtained as S(j+1)i ? Nsummationdisplay n=1 e(j)ni (?n ? ??(j+1)i )(?n ? ??(j+1)i )T = ?J(j+1)i J(j+1)i T?T, 98 where J(j+1)i ? (IN ?e(j)i 1T) D1/2[e(j)1i ,e(j)2i ,...,e(j)Ni]. Using ?K(j+1)i = J(j+1)i TKJ(j+1)i , the updated Q(j+1)i can obtained as Q(j+1)i = V(j+1)qi,i (Iqi ???(j+1)qi,i ?1)1/2, (4.19) where ?(j+1)qi,i and V(j+1)qi,i are the top qi eigenvalues and eigenvectors of ?K(j+1)i . Also, an EM algorithm for learning the Qi matrix as shown in Section 4.2.2 can be used instead of direct computation. The above derivations indicate that it is not necessary to start the EM itera- tions from initializing the parameters e.g. {mi,Qi}?s. Instead, we can start from assigning the posterior responsibility {rni}?s. Once assigned, we follow equations (4.18) to (4.19) to compute the updated {mi,Qi}?s. The iterations then move on. This way we can easily incorporate any prior knowledge gained from cluster- ing techniques such as the ?kernelized? version of the K-means algorithm [181], or other algorithms [180]. Parameter learning experiments We now demonstrate how mixture of PKPCA performs by fitting it to the two C-shapes shown in Figure 4.4(d). We set the following parameters: I = 2, q = 2, ? = 1e?2, and ? = 8. The algorithm iterations are terminated if the changes in the {rni}?s are small enough. Figure 4.3(a) presents the initial configuration for the two C-shapes. We just generate random numbers for{rni}?s followed by a normalization step to guarantee 99 summationtextI i=1 rni = 1. Figure 4.3(b) shows the mixture assignment after the first iteration and Figure 4.3(c) the final configuration (only after 3 iterations). A final note is that the EM algorithm can still converge to a local minimum. In this case, the clustering method [180] is very helpful for initialization. 0 20 40 60 80 1000 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 1000 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 1000 10 20 30 40 50 60 70 80 90 100 (a) (b) (c) Figure 4.3: (a) Initial configuration. (b) After first iteration. (c) Final configura- tion. ?+? and ?x? denote two different mixture components. 4.3.2 Why mixture of PKPCA? It is well known [180, 181] that kernel embedding results in clustering capability. This raises the doubt whether PKPCA is sufficient to model a nonlinear struc- ture with nonlinear substructures. We demonstrate the effectiveness of mixture of PKPCA with the following examples. Figure 4.4(a) shows a nonlinear structure containing a single C-shape and Fig- ures 4.4(b) and 4.4(c) the contour plots, for the 1st and 2nd kernel principal com- ponents, i.e. all points in the contour share the same principal component values. These plots capture the nonlinear shape very precisely. Now in Figure 4.4(d), a nonlinear structure containing two C-shapes is presented. Figures 4.4(e) and 4.4(f) display the contour plots corresponding to 4.4(d). Clearly, they attempt to cap- ture both C-shapes at the same time. This is not desirable. Ideally, we want to 100 have two KPCAs, each modeling a different C-shape more precisely. However, the ordinary KPCA has no such capability but PKPCA does. This naturally leads us to considering a mixture of PKPCA. Section 4.4 also demonstrates this using classification results. The successful kernel clustering algorithm [180] shows that after kernel em- bedding, the clusters become more separable. This further sheds light on the effectiveness of mixture of PKPCA. One may also ask: why not use the mixture of PPCA directly? Although a mixture of PPCA is legitimate, its use is not elegant in this scenario since one may need more than 2 components for Figure 4.4(d) to capture the data structure due to the limitation of the linear setting in PCA. But mixture of PKPCA can elegantly model it using two components. 4.4 Classification 4.4.1 PKPCA or mixture of PKPCA classifier We now demonstrate the probabilistic interpretation embedded in PKPCA using a pattern classification problem. Suppose we have N classes. For class n, a PKPCA or mixture of PKPCA density p(?n(x)|n) is trained; then, the class label for a point x is determined using the Bayesian decision principle by ?n = arg maxn=1,...,N p(n)p(x|n) = arg maxn=1,...,N p(n)p(?n(x)|n)|Jn(x)|, (4.20) where p(n) is the prior distribution, p(x|n) is the conditional density for class n in the original space, and Jn(x) is the Jacobi matrix for class n. To use (4.20), we are confronted by two dilemmas: (i) the Jacobi matrices, Jn(x)?s, are unknown since we have no knowledge of ?n(x); and (ii) the densities, 101 0 10 20 30 0 10 20 30 (a) one C?shape 0 10 20 30 0 10 20 30 (b) 1st KPC 0 10 20 30 0 10 20 30 (c) 2nd KPC 0 10 20 30 0 10 20 30 (d) two C?shapes 0 10 20 30 0 10 20 30 (e) 1st KPC 0 10 20 30 0 10 20 30 (f) 2nd KPC Figure 4.4: (a) One C-shape and contour plots of its (b) 1st and (c) 2nd KPCA features. (d) Two C-shapes and its contour plots of its (e) 1st and (f) 2nd KPCA features. p(?n(x)|n)?s, involves infinite f. The latter is easily fixed by assuming ?c ? ? for all classes, where ?n is the parameter in the density p(?n(x)|n) for class n. One trick to attack the first dilemma is to use the same kernel function for all the classes with the same kernel width ?, i.e. ?n = ?. However, it might not be appropriate since different classes possess different data structures. An alternative approach is that we still use different kernel functions for different classes but we approximate the Jacobi matrices. We use the following approximation: |Jn(x)| similarequal const, ?x. Figure 4.5 demonstrates our rationale. Figure 4.5(a) presents the contour plots 102 for the true density to be modeled, which is uniform inside the black C-shaped region (Figure 4.1(a)). All contour plots are located on the boundary. We fit a PKPCA density (? = 15, q = 20, and ? = 1e ? 6) based on the samples shown in Figure 4.1(b) and visualize the density using Figure 4.5(b), which displays the map of log(??(x)). To verify that the values in the C-shaped region are uniform, we show in Figure 4.5(c) the contour plots for ???(x) inside the C-shaped region. Most contours are close to the boundary, which indicates the uniformity of the density p(?(x)) inside the C-shaped region and thus the Jacobi approximation which relates p(?(x)) and p(x) is reasonable. The above approximation leads to a linear decision rule. For example, in a two-class problem, the decision rule is, for some ? > 0, If p(?1(x)|1) ? ? p(?2(x)|2) then class 1; Else class 2 In the sequel, we simply take ? = 1. 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 (a) (b) (c) Figure 4.5: The approximation of the Jacobi matrix. (a) The contour plots of the true density: uniform inside the C-shaped region. (b) The map of log(??). (c) The contour plots of ??? inside the C-shaped region. Putting the above discussions together, we have the following decision rules: ? If PKPCA densities are learned for all classes, i.e., for class n, we learn 103 {?n = ?,Qn}, it is easy to check that the classifier performs the following: arg minn=1,...,N ??n?(x), where ??n is the ?generalized? Mahalanobis distance. ? If mixture of PKPCA densities are learned for all classes, i.e., for the class c, we learn {?n = ?,mn,1,Qn,1,...,mn,Ic,Qn,In} with In being the number of mixture components, then the classifier decides as follows: arg maxn=1,...,N Insummationdisplay j=1 mn,j exp{?12??n?,j(x)}. 4.4.2 Experiments Synthetic Data We consider a 2-class problem with foreground (class 1) and background (class 2) classes given in Figure 4.1(a), where the letter ?C? or ?O? means the foreground class. We then draw 200 samples for both classes as shown in Figures 4.1 and 4.8. Figure 4.6 presents the classification results obtained by the PKPCA classifier with different kernel widths for different classes (PKPCA-d), the PKPCA classifier with same kernel widths for different classes (PKPCA-s), the support vector ma- chine (SVM) [19], and the kernel Fisher discriminant analysis (KFDA) [177]. In PKPCA-s, SVM and KFD, the kernel width ? is tuned (via exhaustive search from 1 to 100) to yield the best empirical classification results and reported in Table 4.2. The PKPCA-d parameters actually used are also reported in Table 4.2, where the kernel widths for the background and foreground classes are found via the procedures described in Appendix 4.III. As shown in Figure 4.6, the classification boundary obtained by PKPCA-d is very smooth and very similar to the original 104 Algorithm Single C-shape Single O-shape Double C-shapes PKPCA-d 1.57% 3.80% 7.49% q = 30, ? = 10?8 q = 20, ? = 10?6 q = 20, ? = 10?6 ?1 = 15,?2 = 35 ?1 = 15, ?2 = 35 ?1 = 15, ?2 = 35 PKPCA-s 1.95% 5.50% 1.85% q = 30, ? = 10?8 q = 30, ? = 10?8 q = 20, ? = 10?6 ? = 1 ? = 1 ? = 1 SVM 1.80% 5.45% 1.69% ? = 1 ? = 1 ? = 1 KFDA 1.84% 5.47% 1.82% ? = 1, 30 components ? = 1, 20 components ? = 1, 20 components mix. PKPCA NA NA 0.70% q = 20, ? = 10?6, I1 = 2, ?1 = 8, I2 = 1, ?2 = 35 Table 4.2: Classification error on the single C-shaped, the single O-shape, and the double C-shapes. boundary, while those of PKPCA-s, SVM and KFDA seem to only replicate the training samples, with holes and gaps. Table 4.2 indicates that our PKPCA-d classifier outperforms the SVM and KFDA classifiers by some margin. Similar observations can be made based on the experimental results on a single O-shape as shown in Figure 4.8. The superior performance of PKPCA-d classifer mainly arises from its ability to model different classes with different kernel functions, while the PKPCA-s, SVM and KFDA employ only one kernel. This is a big advantage since as seen in our synthetic examples we clearly need different kernel widths for the foreground and 105 20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 (a) (b) 20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 (c) (d) Figure 4.6: The classification results on the single C-shape obtained by (a) PKPCA-d, (b) PKPCA-s, (c) SVM, and (d) KFDA. 20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 (a) (b) (c) Figure 4.7: The classification results on the double C-shape obtained by (a) PKPCA-d classfier, (b) SVM, and (c) mixture of PKPCA classfier with different kernel widths. 106 (a) Original 2?Class 0 50 100 0 50 100 0 50 1000 50 100 (b) FG Samples 0 50 1000 50 100 (c) BG Samples (d) PKPCA 2?Class 0 50 100 0 50 100 (e) SVM 2?Class 0 50 100 0 50 100 (f) KFDA 2?Class 0 50 100 0 50 100 Figure 4.8: The classification results on the single O-shape. background classes. More importantly, PKPCA provides a regularized approxima- tion to the data structure; thus its decision boundary is very smooth. Also, the probabilistic interpretation of PKPCA enables the PKPCA classifier to deal with an N-class problem as easily as KFDA, while the SVM is basically designed for a two-class problem and extending it to an M-class is not very straightforward. We now illustrate the mixture of PKPCA classifier by applying it to the double C-shapes shown in Figure 4.1(d). We fit the mixture of PKPCA density for the foreground class based on the samples shown in Figure 4.1(e) and the PKPCA density for the background class based on the samples shown in Figure 4.1(f). Figure 4.7 and Table 4.2 present the classification results. Clearly the mixture of PKPCA classifier produces the best performance in terms of the classification error. Also the decision boundary is very smooth. 107 One important observation is that the PKPCA classifier with different kernel widths performs poorly. This is because the selected kernel width attempts to cover both nonlinear substructures simultaneously, which actually over-smoothes each substructure (see Figure 4.7(a)). Hence, caution should be exercised when modeling a mixture data via PKPCA densities of different kernel widths. IDA Benchmark We also test our classifier on the IDA benchmark3 repository [179]. To make our results comparable, we use the cross-validation (the same procedure as in [179]) to choose our parameters; also we invoke the PKPCA density without mixture modeling and the same kernel parameter for different classes. As tabulated in Table. 4.3, our PKPCA classifier compared favorably to those of kernel classifiers such as SVM and KFD. We believe that the classification results can be improved by using PKPCA-d or even mixture of PKPCA classifier. A real application: face recognition We report face recognition results using a subset of the FERET database [58] with 200 subjects only. Each subject has 3 images: (i) one taken under controlled lighting condition with a neutral expression; (ii) one taken under the same lighting condition as (i) but with different facial expressions (mostly smiling); and (iii) one taken under different lighting condition and mostly with a neutral expression. Figure 4.9 shows some face examples in this database. Our experiment focuses on testing the generalization capability of our algo- rithm. It is our hope that the training stage can learn the intrinsic characteristics 3This is available at http://ida.first.gmd.de/?raetsch/data/benchmarks.htm. 108 PKPCA-s SVM KFD Banana 10.5 ? 0.4 11.5 ? 0.7 10.8 ? 0.5 B. Cancer 28.0 ? 4.7 26.0 ? 4.7 25.8 ? 4.6 Diabetes 24.8 ? 1.9 23.5 ? 1.7 23.2 ? 1.6 German 24.9 ? 2.2 23.6 ? 2.1 23.7 ? 2.2 Heart 16.8 ? 3.4 16.0 ? 3.3 16.1 ? 3.4 Image 2.8 ? 0.6 3.0 ? 0.6 3.3 ? 0.6 Ringnorm 1.6 ? 0.1 1.7 ? 0.1 1.5 ? 0.1 F. Solar 34.8 ? 1.9 32.4 ? 1.8 33.2 ? 1.7 Splice 12.2? 0.8 10.9 ? 0.7 10.5 ? 0.6 Thyroid 4.0 ? 2.0 4.8 ? 2.2 4.2 ? 2.1 Titanic 22.6 ? 1.3 22.4 ? 1.0 23.2 ? 2.0 Twonorm 2.6 ? 0.2 3.0 ? 0.2 2.6 ? 0.2 Waveform 11.4? 0.5 9.9 ? 0.4 9.9 ? 0.4 Table 4.3: The classification error on IDA benchmark repository. The SVM and KFD results are reported in [179]. Figure 4.9: Top row: neutral faces. Middle row: faces with facial expression. Bottom row: faces under different illumination. Image size is 24 by 21 in pixels. of the space we are interested in. Therefore, we always keep the gallery and probe sets separate. We randomly select 300 images belonging to 100 subjects as the 109 gallery set for learning and the remaining 300 images as the probe set for testing. This random division is repeated 20 times and we take their averages as the final result. General component analysis is not geared towards discrimination, thus yielding inferior recognition results in practice. To this end, Moghaddam et al. [55, 56] introduced the concept of intra-personal space (IPS). The IPS is constructed by collecting all the difference images between any two image pairs belonging to the same individual. The construction of the IPS is meant to capture all the possible intra-personal variations introduced during image acquisition. Suppose that we have learned some density pIPS on top of the IPS space and we are given the gallery set consisting of images {x1,x2,...,xN} for N different individuals. Given a probe image y, its identity ?n is determined by ?n = arg maxc=1,...,N pIPS(y?xn) = arg minc=1,...,N ??IPS,?(y?xn). Here we use the limiting Mahalanobis distance ??. Forcomparison, we have implemented thefollowing fourmethods. InPKPCA/IPS andPPCA/IPS, the IPSisconstructed basedonthegalleryset andthePKPCA/PPCA density is fitted on top of that. In KPCA and PCA, all 300 training images are re- garded lying in one face space and KPCA/PCA is then learned on that space. The classifier sets the identity of a probe image as the identity of its nearest neighbor in the gallery set. Table 4.4 lists the recognition rate, averaging those of 20 simulations, using the top 1 match. The PKPCA/IPS algorithm attains the best performance since it combines the discriminative power of the IPS model and the merit of PKPCA. However, compared to PPCA/IPS, the improvement is not significant, indicating that second-order statistics might be enough after IPS modeling for the face recog- 110 nition problem. However, PKPCA may be more effective since it also takes into account high-order statistics. Another observation is that variations in illumina- tion are easier to model than facial expression using subspace methods. PKPCA/IPS PPCA/IPS KPCA PCA Expression 78.55% 78.35% 63.85% 67.65% Illumination 83.9% 81.85% 51.9% 73.1% Average 81.23% 80.1% 57.88% 70.38% Table 4.4: Recognition rate of various kernel and non-kernel subspace methods. 4.5 Appendix Appendix 4.I: Two Lemmas on Matrix Computation We introduce some related results on matrix computation using the following two lemmas. The proofs are pretty straightforward and hence skipped here. Lemma 4.1. Suppose that A is of size d?q with q < d and the matrix ATA is of full rank, the matrices ATA and AAT have the same nonzero eigenvalues. Lemma 4.2. Suppose that B = ?Id + AAT+ and {?i; i = 1,2,...,q} are eigenvalues of the ATA matrix, the determinant |B| is given by |B| = qproductdisplay i=1 (?+ ?i)?d?q, (4.21) and the inverse matrix B?1 is given by B?1 = ??1{If ?A(?Iq +ATA)?1AT}. 111 Appendix 4.II: A List of Important Quantities Important quantities RKHS: H = Rf. Original observations: Xd?N = [x1,x2,...,xN] Nonlinear mapping: ?(x) : Rd ? Rf Observations in RKHS: ?f?N = [?1,?2,...,?N]. Weight vector: eN?1 = N?11 (for example). Mean: ?f?1 = ?e Centering matrix: JN?N = N?1/2(IN ?e1T). Covariance matrix (c.m.): ?f?f = ?JJT?T. Gram matrix (g.m.): KN?N = ?T?. Centered g.m.: ?KN?N = JTKJ. Eigenvalues of ?K: ?q = D[?1,?2,...,?q]q?q. Eigenvectors of ?K: Vq = [v1,v2,...,vq]N?q. Approximate c.m.: Sf?f = ?A?T + ?If. A matrix: AN?N = JVq(Iq ????1q )VTq JT. Inverse of S: S?1N?N = ??1(If ??B?T). B matrix: BN?N = JVq(??1q ????2q )VTq JT. C matrix: CN?N = JQ(QT?KQ)?1QTJT. Q matrix: QN?q = Vq(Iq ????1q )1/2R M matrix: Mq?q = ?Iq +QT?KQ. Computation related to L and M We first compute L = QT?KQ and then M. L = RTQT?KQR = RT(Iq ????1q )1/2VTq ?KVq(Iq ????1q )1/2R 112 = RT(Iq ????1q )1/2?q(Iq ????1q )1/2R = RT(?q ??Iq)R, where the fact that VTq ?KVq = VTq JTKJVq = ?q is used. Therefore, M = ?Iq +L = ?Iq +RT(?q ??Iq)R = RT?qR. |M| = |?q| = qproductdisplay i=1 ?i, M?1 = RT??1q R. Computation related to A, B, and C A = JQQTJT = JVq(Iq ????1q )1/2RRT(Iq ????1q )1/2VTq JT = JVq(Iq ????1q )VTq JT B = JQM?1QTJT = JVq(Iq ????1q )1/2RRT??1q RRT(Iq ????1q )1/2VTq JT = JVq(??1q ????2q )VTq JT C = JQ(QT?KQ)?1QTJT = JVq(Iq ????1q )1/2RRT(?q ??Iq)?1RRT(Iq ????1q )1/2VTq JT = JVq??11 VTq JT tr[AK] = tr[JVq(Iq ????1q )VTq JTK] = tr[(Iq ????1q )VTq JTKJVq] = tr[(Iq ????1q )?q] = tr[?q]??q = summationtextqi=1 ?i ??q. tr[BK] = tr[JVq(??1q ????2q )VTq JTK] = tr[(??1q ????2q )VTq JTKJVq] = tr[(??1q ????2q )?q] = q ??tr[??1q ] = q ??summationtextqi=1 ??1i . 113 Computation related to S We have shown that S?1 = ??1(If ??B?T), Also, we are often interested in computing tr(S?1?). tr(S?1?) = tr(S?1??T) = tr(?TS?1?) = ??1(tr(?K)?tr(JT?T?B?T?J)) = ??1(tr(?K)?tr(?KVq(??1q ????2q )VTq ?K) = ??1(tr(bK)?tr(Vq?q??1q (Iq ????1q )VTq ?qVTq )) = ??1(tr(?K)?tr(Vq(?q ??Iq)VTq )) = ??1(tr(?K)?tr(?q ??Iq)) = ??1(tr(?K)?summationtextqi=1 ?i)+ q. Also, using Lemma 4.2 in Appendix 4.I, the determinant of S is given by |S| = ?f?q|M| = ?f?q|?q| = ?f?q qproductdisplay i=1 ?i. Appendix 4.III: Kernel selection Only those functions satisfying the Mercer?s Theorem [176] can be used as ker- nel functions. In general, the kernel function lies in some parameterized function family. Denote the parameter of interest by ?. For example, ? can be the polyno- mial degree in the polynomial kernel, or the kernel width in the Gaussian kernel. The choice of ? remains an open question with the reason being that there is no systematic criteria to judge the goodness. Again, we only focus on the Gaussian kernel case; so ? = ? and f = ?. It seems that PKPCA offers a systematic ML principle to follow, i.e., picking the ? which maximizes the likelihood or log-likelihood. However, it turns out that the ML principle fails as it has an inherent bias towards a large ? value. The log-likelihood L is given by: 114 L = ?Nf2 log(2pi)? N2 log(|S|)? 12 summationtextNn=1(?(xn)? ??0)TS?1(?(x)? ??0) ??N2 summationtextqi=1 log(?i)? N2 tr(S?1?) ??N2 summationtextqi=1 log(?i)? N2 ??1(tr(?K)?summationtextqi=1 ?i) By defining the following quantity: E(?) = ? 2NL ? qsummationdisplay i=1 log(?i)+ ??1(tr(?K)?tr(?q)), (4.22) the goal is to min? E(?) subject to ?q(?) > ?. 0 10 20 30 40 50?1 0 1 2 3 4 5 6 7 8 x 10 5 0 10 20 30 40 500 0.05 0.1 0.15 0.2 0.25 (a) (b) Figure 4.10: (a) The curve of E(?). (b) The curve of ?1(?). We have set q = 30 and ? = 1e?6. We now show how it works. Figure 4.10(a) presents the curve of E(?) obtained using (4.22) for the C-shaped data (Figure 4.1(a)), which always has a bias toward favoring a large ?. This is not surprising since a large ? makes the matrix K0 close to a matrix of ones; hence the matrix K becomes close to a matrix of zeros, the data variation is reduced, and therefore the likelihood is increased. If ? goes to ?, all data essentially reduces to one point in the feature space. This is also explained 115 by Williams in [183]. Williams [183] has also studied the ratio of the sum of the top q eigenvalues to that of all eigenvalues, and discovered the same bias. We propose an alternative approach by examining the first eigenvalue, which equals to the maximum variance of the projected data where the projection occurs in the feature space induced by the kernel function. Figure 4.10(b) shows the plot of the first eigenvalue ?1(?) against ?. There is a unique maximum. We pick this as our kernel width. This choice of the kernel width seems to have a close relationship with the assumption on the Jacobi matrix in (4.4.1). Figure 4.11(a) present the map of log(??(x)) for the single C-shape (with ? = 3) and Figure 4.11(b) the contour plots of ???(x). The map is very granular and the uniformity inside the C-shaped region disappears. Figure 4.11(c) shows the map of log(??(x)) with ? = 36 and Figure 4.11(d) the contour plots of ???(x). Now, the map is over- smoothed (compare the intensity change inside and outside the C-shaped region with that of Figure 4.5(b)). 116 20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 (a) (b) 20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 (c) (d) Figure 4.11: (a) The map of log(??) and (b) the contour plots of ??? inside the C-shaped region, when ? = 3. (c) The map of log(??) and (d) the contour plots of ??? inside the C-shaped region, when ? = 36. 117 Chapter 5 Probability Distances in Reproducing Kernel Hilbert Space Probabilistic distance measures, defined as the distances between two probability distributions, are important quantities and find their uses in many research areas such as probability and statistics, pattern recognition, information theory, com- munication and so on. In statistics, the probabilistic distances are often used in asymptotic analysis. In pattern recognition, pattern separability is usually cali- brated using probabilistic distance measures [5] like Chernoff distance and Bhattar- chayya distance because they provide bounds for probability of error in a pattern classification problem. In information theory, mutual information, a special exam- ple of Kullback-Leibler divergence or relative entropy [4] is a fundamental quantity related to the channel capacity. In communication, divergence and Bhattarchayya distance measures are used for signal selection [156]. Direct evaluation of probabilistic distances is nontrivial since they involve inte- 118 grals. Only within certain parametric families, say the widely-used Gaussian den- sity, we have analytic expressions for probability distances. However, the Gaussian density employs only up to second-order statistics and its modeling capacity is lin- ear and hence rather limited when confronted with a nonlinear data structure. By nonlinear data structure, we mean that if conventional linear modeling techniques such as fitting the Gaussian density are used, the responses are badly approxi- mated. To absorb the nonlinearity, mixture models or non-parametric densities are used in practice. For such cases, one has to resort to numerical methods for computing the probabilistic distances. Such computation is not robust in nature since two approximations are invoked: one in estimating the density and the other in evaluating the numerical integral. In this chapter, we model the nonlinearity through a different approach: kernel methods. The essence of kernel methods is to combine a linear algorithm with a nonlinear embedding, which maps the data from the original vector space to the reproducing kernel Hilbert space (RKHS). But, we need not require any explicit knowledge of the nonlinear mapping function as long as we can cast our compu- tations into dot product evaluations. Since a nonlinear function is used, albeit in an implicit fashion, we achieve a new paradigm to study these distances and investigate their uses in a different space. Clearly, our computation depends on the assumption that the data is Gaussian in RKHS. This assumption has been implicitly used in many kernel methods such as [172, 181]. In [181], PCA operates on the RKHS. Even though it seems that PCA needs only the covariance matrix without the Gaussianity assumption, it is the deviation of the data from Gaussianity in the original space that drives us to search for the principal components in the nonlinear feature space. In [172], 119 discriminant analysis is performed on the feature space. It is well known that discriminant analysis originated as a two-class problem by assuming that each class is distributed as Gaussian with a common covariance matrix. Recently, the Gaussianity is directly adopted in the literature [170, 171, 175]. In [170, 171], it is used to compute the mutual information between two Gaussian random vectors in RKHS. In [175], it is used to construct the so-called Bhattacharyya kernel. In fact, the validity of this assumption boils down to a Gaussian process argument [175]. However, since the induced RKHS is certainly limited by the number of available samples, a regularized covariance matrix is needed in [170, 171]. We also propose a way to regularize the covariance matrix in this chapter. Chapter organization This chapter is organized as follows. Section 5.1 introduces several probabilistic distances often used in the literature and Section 5.2 presents a method for es- timating the first- and second-order statistics for the data in RKHS. Section 5.3 elaborates the derivations of the probabilistic distances in the RKHS and their limiting behavior. Section 5.4 demonstrates the feasibility and efficiency of the proposed measures using experiments on synthetic and real examples. 5.1 Probabilistic Distances in Rd Consider a two-class problem and suppose that class 1 has prior probability pi1 and class-dependent density p1(x) and class 2 has prior probability pi2 and class- dependent density p2(x), both defined on Rd. The following defines a list of prob- abilistic distance measures often found in the literature [5]: 120 ? Chernoff distance [151] JC(p1,p2) = ?log{ integraldisplay x p ? 1(x)p 1?? 2 (x)dx}; (5.1) ? Bhattacharyya distance [150] JB(p1,p2) = ?log{ integraldisplay x[p1(x)p2(x)] 1/2dx}; (5.2) ? Hellinger or Matusita distance [161] JT(p1,p2) = { integraldisplay x[ radicalBig p1(x)? radicalBig p2(x)]2dx}1/2; (5.3) ? The symmetric divergence [13] JD(p1,p2) = integraldisplay x[p1(x)?p2(x)]log p1(x) p2(x)dx; (5.4) ? Patrick-Fisher distance [163] JP(p1,p2) = { integraldisplay x[p1(x)pi1 ?p2(x)pi2] 2dx}1/2; (5.5) ? Lissack-Fu distance [158] JL(p1,p2) = integraldisplay x|p1(x)pi1 ?p2(x)pi2| ?p1??(x)dx; (5.6) ? Kolmogorov distance [147] JK(p1,p2) = integraldisplay x|p1(x)pi1 ?p2(x)pi2|dx; (5.7) where 0 < ? < 1 and p(x) = p1(x)pi1 + p2(x)pi2. It is obvious that (i) the Bhattacharyya distance is a special case of the Chernoff distance with ? = 1/2; (ii) the Hellinger distance is related to the Bhattacharyya distance as follows: JT = {2[1?exp(?JB)]}1/2; (5.8) 121 and (iii) the Kolmogorov distance is a special case of the Lissack-Fu distance with ? = 1. Some interesting properties of these distances can be found in [5, 156] In particular, the symmetric divergence is of great interest in the information theory literature [4] and has a close connection with the famous Kullback-Leibler (KL) divergence [13]. The KL divergence or relative entropy between two densities p1(x) and p2(x) is given by JR(p1||p2) = integraldisplay x p1(x)log{ p1(x) p2(x)}dx. (5.9) However, the KL divergence is not a true metric because neither the symmetry constraint nor the triangle inequality is satisfied. The symmetric divergence, which is symmetric, is equal to JD(p1,p2) = JR(p1||p2) + JR(p2||p1). (5.10) As mentioned earlier, computing the above probabilistic distance measures is nontrivial. Only within certain parametric families, say the Gaussian density, we know how to analytically compute some of the above defined distance measures. Suppose that N(x;?,?) is a multivariate Gaussian density defined as N(x;?,?) = 1radicalBig (2pi)d|?| exp{?12(x??)T??1(x??)}, (5.11) where x ? Rd and |.| is the matrix determinant. With p1(x) = N(x;?1,?1) and p2(x) = N(x;?2,?2), we evaluate some of the above probabilistic distance measures as follows: ? Chernoff distance JC(p1,p2) = 12?(1??)(?1??2)T[(1??)?1+??2]?1(?1??2)+12 log |(1??)?1 + ??2||? 1|1??|?2|? ; (5.12) 122 ? Bhattacharyya distance JB(p1,p2) = 18(?1 ??2)T[?1 + ?22 ]?1(?1 ??2)+ 12 log | 1 2(?1 + ?2)| |?1|1/2|?2|1/2; (5.13) ? Kullback-Leibler divergence or relative entropy JR(p1||p2) = 12(?1??2)T??12 (?1??2)+12 log |?2||? 1| +12tr[?1??12 ?Id]; (5.14) ? The symmetric divergence JD(p1,p2) = 12(?1??2)T(?1?1+?2?1)(?1??2)+12tr[?1?1?2+?2?1?1?2Id]; (5.15) ? Patrick-Fisher distance JP(p1,p2) = [(2pi)d|2?1|]?1/2 + [(2pi)d|2?2|]?1/2 (5.16) ? 2[(2pi)d|?1 + ?2|]?1/2 exp{?12(?1 ??2)T(?1 + ?2)?1(?1 ??2)}; where d is the dimensionality of the random vector x and tr[.] is the matrix trace. In particular, when the covariance matrices for the two densities are same, i.e., ?1 = ?2 = ?, the Bhattacharyya distance and the symmetric divergence reduce to the Mahalanobis distance [160]: JM = JD = 8JB = (?1 ??2)T??1(?1 ??2). (5.17) In this chapter, we only focus on the distances defined in (5.12)-(5.15). 5.2 Mean and Covariance Marix in RKHS 5.2.1 First- and second-order statistics Computing the probabilistic distance measures requires first- and second-order statistics in the RKHS, as shown in Section 5.1. In practice, we have to estimate 123 these statistics from a set of training samples. Chapter 4 presented a detailed treatment of this topic and here we recapitulate some important points. Suppose that {x1,x2,...,xN} are given observations in the original data space Rd. We operate in the RKHS Rf induced by a nonlinear mapping function ? : Rd ? Rf, where f > d and f could even be infinite. The training samples in Rf are denoted by ?f?N = [?1,?2,...,?N], where ?n ? ?(xn) ?Rf. Using the maximum likelihood estimate (MLE) principle, the mean ? and the covariance matrix ? are estimated as ? = 1N Nsummationdisplay n=1 ?(xn) = ?e; ? = 1N Nsummationdisplay n=1 (?n ??)(?n ??)T = ?JJT?T = ??T, (5.18) where the weight vector eN?1 ? N?11 with 1 being a vector of 1?s, ? ? ?J, and J is an N ?N centering matrix given as J ? N?1/2(IN ?e1T). (5.19) 5.2.2 Covariance matrix approximation The covariance matrix ? in (5.18) is rank-deficient since f > N. Thus, inverting such a matrix is impossible and an approximation to the covariance matrix is necessary. Later in Section 5.3 we show that this approximation can be exact by studying the limiting behavior. Such an approximation S should possess the following features: ? It keeps the principal structure of the covariance matrix ?. In other words, the dominant eigenvalues and eigenvectors of ? and S should be the same. ? It is compact and regularized. The compactness is inspired by the fact that the smallest eigenvalues of the covariance matrix are very close to zero. The regularity is always desirable in the approximation theory. 124 ? It is easy to invert. As shown in Chapter 4, we suggested the following approximation form: S = ?If + ?JQQTJT?T = ?If + ?A?T, (5.20) where Q is an N ?r matrix, A ? JQQTJT, and ? > 0 is a pre-specified constant. Typically, q << N << f. Firstly, when Q = Vq(Iq ????1q )1/2R, where Vq and ?q encode the top q eigenvectors and eigenvalues of the ?K matrix, the top q eigenpairs of ? are maintained. Hence, if ? = 0, we exactly maintain the subspace containing the top q eigenpairs. Secondly, S is regularized and its compactness is achieved through the Q matrix. Finally, inverting S is also easy by using the Woodbury formula [8], S?1 = (?If +WWT)?1 = ??1(If ?WM?1WT) = ??1(If ??B?T), (5.21) where B ? JQM?1QTJT and the matrix Mq?q is M ? ?Iq +WTW = ?Iq +QT?KQ. (5.22) After obtaining Q, it is easy to check that the following equations hold: M = ?q, |M| = |?q| = qproductdisplay i=1 ?i, M?1 = ??1q , |S| = ?f?q|?q|. (5.23) A = JVq(Iq ????1q )VTq JT, B = JVq(??1q ????2q )VTq JT. (5.24) tr[AK] = tr[?q]??q, tr[BK] = q ??tr[??1q ]. (5.25) 125 5.3 The Probabilistic Distances in RKHS Since the probabilistic distances involve two densities p1 and p2, we need two sets of training samples: ?1 for p1 and ?2 for p2. For each density pi, we can find its corresponding ei, Ji, ?i, ?i, Ki, Si, Vqi,i, ?qi,i = D[?1,i,?2,i,...,?qi,i], Ai, Bi, etc., by keeping the top qi principal components. In general, we can have q1 negationslash= q2 and N1 negationslash= N2 with Ni being the number of samples for the ith density. In addition, we define the following dot product matrix: ? ?? ? ?T1 ?T2 ? ?? ?[?1 ?2] = ? ?? ? ?T1 ?1 ?T1 ?2 ?T2 ?1 ?T2 ?2 ? ?? ?? ? ?? ? K11 K12 K21 K22 ? ?? ?, (5.26) where Kij ? ?Ti ?j and K21 = KT12. 5.3.1 The Chernoff distance and the Bhattarchayya dis- tance As mentioned before, the Bhattarchayya distance is a special case of Chernoff distance with ? = 1/2. Hence, we focus only on the Chernoff distance. The key quantity in computing the Chernoff distance is ?1S1 +?2S2 with ?1 + ?2 = 1. We now analyze this quantity in detail. ?1S1 + ?2S2 = ?1{?If + ?1A1?T1}+ ?2{?If + ?2A2?T2} = ?If + ?1?1A1?T1 + ?2?2A2?T2 = ?If + [?1 ?2] ? ?? ? ?1A1 0 0 ?2A2 ? ?? ? ? ?? ? ?T1 ?T2 ? ?? ? = ?If + [?1 ?2] ? ?? ? ?1J1Q1QT1 JT1 0 0 ?2J2Q2QT2 JT2 ? ?? ? ? ?? ? ?T1 ?T2 ? ?? ? 126 = ?If + [?1 ?2]Ach ? ?? ? ?T1 ?T2 ? ?? ?, (5.27) where the matrix Ach is rank-deficient since Ach = PPT with P(N1+N2)?(q1+q2) ? ? ?? ? ?? 1J1Q1 0 0 ??2J2Q2 ? ?? ?. (5.28) Therefore, the matrix ?1S1 + ?2S2 is of such a form that we can easily find its determinant and inverse. The determinant |?1S1 + ?2S2| is given by |?1S1 + ?2S2| = ?f?(q1+q2)|?Iq1+q2 +L| = ?f?(q1+q2) q1+q2productdisplay i=1 (?i + ?), (5.29) where {?i; i = 1,...,q1 + q2} are eigenvalues of the L matrix. The L matrix is given by L(q1+q2)?(q1+q2) = PT ? ?? ? ?T1 ?T2 ? ?? ?[?1 ?2]P = PT ? ?? ? K11 K12 K21 K22 ? ?? ?P = ? ?? ? ?1QT1 JT1 K11J1Q1 ??1?2QT1 JT1 K12J2Q2 ?? 1?2QT2 JT2 K21J1Q1 ?2QT2 JT2 K22J2Q2 ? ?? ? = ? ?? ? ?1{?q1,1 ??Iq1} ??1?2L12 ?? 1?2LT12 ?2{?q2,2 ??Iq2} ? ?? ?, (5.30) with L12 ? QT1 JT1 K12J2Q2. The inverse {?1S1 + ?2S2}?1 is given by {?1S1 + ?2S2}?1 = ??1{If ?[?1 ?2]Bch ? ?? ? ?T1 ?T2 ? ?? ?}, Bch = P(?Iq1+q2 +L)?1PT. (5.31) 127 We now show how to compute the following two quantities in (5.12): ?Ti {?1S1 + ?2S2}?1?j = eTi ?Ti ??1{If ?[?1 ?2]Bch ? ?? ? ?T1 ?T2 ? ?? ?}?jej (5.32) = ??1{eTi Kijej ?eTi [Ki1 Ki2]Bch ? ?? ? K1j K2j ? ?? ?ej}? ??1?ij, log |?1S1 + ?2S2||S 1|?1|S2|?2 = q1+q2summationdisplay i=1 log(?+ ?i)+ (f ?q1 ?q2)log(?) ? ?1{ q1summationdisplay i=1 log(?i,1)+ (f ?q1)log(?)} ? ?2{ q2summationdisplay i=1 log(?i,2)+ (f ?q2)log(?)} = ?1 q1+q2summationdisplay i=1 log ? + ?i? i,1 + ?2 q1+q2summationdisplay i=1 log ?+ ?i? i,2 , (5.33) where {?i,1;i = 1,2,...,q1} and {?i,2;i = 1,2,...,q2} are eigenvalues for S1 and S2, respectively. Notice that (i) {?i,1;i = q1 + 1,...,q1 + q2} and {?i,2;i = q2 + 1,...,q1 + q2}, all equal to ??s, are introduced only for notational convenience; (ii) the infinite dimensionality f in (5.32) and (5.33) disappeared as needed; and (iii) all calculations are based on the Gram matrix defined in (5.26). Finally, we compute the Chernoff distance as follows (with ?1 = 1 ? ? and ?2 = ?): 2JC(p1,p2) = ??1?1?2{?11 + ?22 ?2?12}+ ?1 q1+q2summationdisplay i=1 log ? + ?i? i,1 + ?2 q1+q2summationdisplay i=1 log ?+ ?i? i,2 . (5.34) 128 5.3.2 The KL divergence and the symmetric divergence Computing the KL divergence in the RKHS is just done by collecting terms like ?Ti S?1j ?k and tr{SiS?1j }. ?Ti S?1j ?k = eTi ?Ti ??1(If ??jBj?Tj )?kek = ??1(eTi Kikek ?eTi KijBjKjkek) ? ??1?ijk. (5.35) tr[SiS?1j ] = tr[(?iAi?Ti + ?If)??1(If ??jBj?Tj )] (5.36) = ??1tr[?iAi?Ti ]???1tr[?iAi?Ti ?jBj?Tj ] + f ?tr[?jBj?Tj ] = ??1tr[AiKii]???1tr[AiKijBjKji] + f ?tr[BjKjj] = ??1tr[?qi,i]?qi ???1tr[AiKijBjKji]+ f + ?tr[??1qj,j]?qj = ??1{tr[?qi,i]??ij}+ ?tr[??1qj,j] + f ?(qi + qj), where ?ij ? tr[AiKijBjKji]. Finally, we obtain the KL divergence and the symmetric divergence in the RKHS by substituting (5.35) and (5.36) into (5.14) and (5.15) with d replaced by f, 2JR(p1||p2) = ??1{?121 + ?222 ??122 ??221}+{log|?q2,2|?log|?q1,1|} (5.37) + (q1 ?q2)log? + ??1{tr[?q1,1]??12}+ ?{tr[??1q2,2]}?(q1 + q2). 2JD(p1,p2) = ??1{?111 + ?121 + ?212 + ?222 ??112 ??122 ??211 ??221} + ??1{tr[?q1,1]+ tr[?q2,2]??12 ??21} + ?{tr[??1q1,1]+ tr[??1q2,2]}?2(q1 + q2). (5.38) 129 5.3.3 The Patrick-Fisher distance Given the above derivations in Sections 5.3.1 and 5.3.2, computing the Patrick- Fisher distance JP(p1,p2) can be easily done by putting together related terms. JP(p1,p2) = [2(2pi)f?f?q1 q1productdisplay i=1 ?i,1]?1/2 + [2(2pi)f?f?q2 q2productdisplay i=1 ?i,2]?1/2 ? 2[2(2pi)f?f?q1?q2 q1+q2productdisplay i=1 (?+ ?i)]?1/2 exp{???1(?11 + ?22 ?2?12)}. where {?i;i = 1,2,...,q1 + q2} are eigenvalues of the L matrix defined in (5.30) with ? = 1/2. 5.3.4 Limiting behavior It is interesting to study the behavior of the distances when ? approaches to zero. First, lim??0A = ?A ? JVqVTq JT, lim??0B = ?B ? JVq??1q VTq JT, (5.39) Then, lim??0?ijk = ??ijk ? eTi Kikek ?eTi Kij?BjKjkek, lim??0?ij = ??ij ? tr[?BiKij?AjKji]. (5.40) Similarly, lim??0?ij = ??ij ?= eTi Kijej ?eTi [Ki1 Ki2]?Bch ? ?? ? K1j K2j ? ?? ?ej, (5.41) where ?Bch = lim??0 Bch. Finally, lim??0?JC(p1,p2) = ?JC(p1,p2), (5.42) lim??0?JR(p1||p2) = ?JR(p1||p2), (5.43) lim??0?JD(p1,p2) = ?JD(p1,p2), (5.44) 130 where 2 ?JC(p1,p2) = ?(1??){??11 + ??22 ?2??12}, (5.45) 2 ?JR(p1||p2) = ??121 + ??222 ? ??122 ? ??221 + tr[?q1,1]? ??12, (5.46) 2 ?JD(p1,p2) = ??111 + ??121 + ??212 + ??222 ? ??112 ? ??122 ? ??211 ? ??221 +tr[?q1,1]+ tr[?q2,1]? ??12 ? ??21. (5.47) When ? = 1/2, we obtain the limiting distance for the Bhattacharyya distance 2 ?JB(p1,p2) = 14{??11 + ??22 ?2??12}. (5.48) The limiting behavior of the Patrick-Fisher distance JP(p1,p2) is not interesting since it involves f, thus we omit its discussion. As mentioned earlier, when ? = 0 and q1 = q2 = q, we actually use the subspace of the RKHS containing the top q eigenpairs. Therefore, the derived limiting distances calibrate the pattern separability on this subspace of the RKHS and carry many optimal features their original counterparts possess, yet additionally equipped with a nonlinear embedding. 5.3.5 Kernel for set A set here is a collection of observations. A kernel for set is a two-input kernel function that takes the two sets as inputs and satisfies the requirement of positive definiteness. Several kernels for set have emerged in the literature. In [184], Wolf and Shashua proposed the kernel principal angle. The principal angle is defined as the angle between the principal subspaces of the two input sets and then ?kernel- ized?. In [174], Jebara and Kondor showed that the Bhattacharyya coefficient [156] 131 that operates the probability distribution defined on the original data space is a kernel. In [175], they extended the Bhattacharyya kernel to operate the probabil- ity distribution defined on the RKHS. In [178], Moreno et. al. proposed a kernel function based on the Kullback-Leibler divergence in the original data space. It is obvious that our probabilistic distance measures can be adapted as kernel functions for set. First, the Bhattacharyya kernel defined in [174] differs from the Bhattacharyya distance by ?log(.). Secondly, the adaptation can be in the sense of [178]. Other ways are possible by utilizing the construction rule of kernel functions. 5.4 Experimental Results In the following experiments with both synthetic examples and a real face recog- nition application, we will use only the limiting distances, namely ?JC(p1,p2) (or ?JB(p1,p2)), ?JR(p1||p2), and ?JD(p1,p2), since they do not depend on the choice ?, which frees us from the burden of choosing ?. Also, we set q1 = q2 = q. 5.4.1 Synthetic examples To fail the KL distance between two Gaussian densities in the original space, we designed four different 2-D densities sharing the same mean (zero mean) and co- variance matrix (identity matrix). As shown in Figure 5.1, the four densities are 2-D Gaussian, and ?O?-, ?D?-, and ?X?-shaped uniform densities, where say the ?O?- shaped uniform density means that it is uniform in the ?O?-shaped region and zero outside the region. Figure 5.1 actually shows 300 i.i.d. realizations sampled from these four densities. Due to the same first- and second-order statistics, the proba- 132 ?4 ?3 ?2 ?1 0 1 2 3 4?4 ?3 ?2 ?1 0 1 2 3 4 ?4 ?3 ?2 ?1 0 1 2 3 4?4 ?3 ?2 ?1 0 1 2 3 4 (a) (b) ?4 ?3 ?2 ?1 0 1 2 3 4?4 ?3 ?2 ?1 0 1 2 3 4 ?4 ?3 ?2 ?1 0 1 2 3 4?4 ?3 ?2 ?1 0 1 2 3 4 (c) (d) Figure 5.1: 300 i.i.d. realizations of four different densities with the same mean (zero mean) and covariance matrix (identity matrix). (a) 2-D Gaussian. (b) ?O?- shaped uniform.(c) ?D?-shaped uniform. (d) ?X?-shaped uniform. bilistic distance between any of two densities in the original space is simply zero. This highlights the virtue of a nonlinear mapping that provides us information embedded in higher-order statistics. Obviously, the probabilistic distances depend on q, the number of eigenpairs, and ?, the RBF kernel width. Figure 5.2 displays ?JD and ?JB as a function of q and ?. The effect of ? is biased: It always disfavors a large ? since a large ? tends to pool the data together. For example, when ? is infinite, all data points collapse to one single point in the RKHS and become inseparable. Generally, it 133 r ? 5 10 15 20 25 30 35 40 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 r ? 2 4 6 8 10 12 14 16 18 20 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 (a) (b) Figure 5.2: (a) The symmetric divergence ?JD(?,q) and (b) the Bhatacharyya dis- tance ?JB(?,q) between the 2-D Gaussian and the ?O?-shaped uniform as a function of ? and q. is not necessary that a large q (or equivalently using a nonlinear subspace with a large dimension) yields a large distance. A typical subspace yielding the maximum distances is of low-dimensional. Table 5.1 lists some computed values of the probabilistic distances. It is inter- esting to observe that when the shapes of two densities are close, their distance is small. For example, ?O? is closest to ?D? among all possible pairs. The closest density to the 2-D Gaussian is the ?O?-shaped uniform. 5.4.2 Face recognition from a group of images The gallery set consists of 15 sets (one per person) while the probe set consists of 15 new sets of the same people (one per person). In these sets, the people can move their heads freely so that pose and illumination variations abound. The existence of these variations violates the Gaussianity assumption of the original data space used in [91]. Figure 5.3 shows some example faces of the in the 4th gallery person, the 9th gallery person, and the 4th probe person (whose identity is same as the 4th 134 ?JR(p1||p2) Gau ?O? ?D? ?X? Gau - .0740 .0782 .0808 ?O? .0584 - .0281 .0523 ?D? .0670 .0295 - .0436 ?X? .0944 .0505 .0417 - (a) ?JB(p1,p2) Gau ?O? ?D? ?X? Gau - .0033 .0037 .0048 ?O? .0033 - .0021 .0099 ?D? .0037 .0021 - .0086 ?X? .0048 .0099 .0086 - (b) Table 5.1: (a) The KL distances in the RKHS with ? = 1 and q = 3. (b) The Bhatacharyya distances in the RKHS with ? = 0.5 and q = 1. p1 is listed in the first column and p2 in the first row. gallery person). The shown face images of size 32 by 32 are automatically cropped from video sequences (courtesy of [84]) using a flow tracking algorithm. Symmetric divergence Bhatacharyya distance ?J(p1,p2) in the RKHS 13/15 13/15 J(p1,p2) in the original space Rd 11/15 11/15 Table 5.2: The recognition score obtaining using the symmetric divergence and Bhatacharyya distance. A generic principal component analysis is performed to reduce the dimension- ality to 300. Figure 5.3 also plots the first three PCA coefficient of the 4th gallery person, the 9th gallery person, and the 4th probe person. Clearly, the manifolds 135 are highly nonlinear, which indicates a need for nonlinear modeling. Table 5.2 reports the recognition rates. The top match with the smallest dis- tance is claimed to be the winner. For comparison, we also implemented the approaches using the symmetric divergence [91] and the Bhatacharyya distance in the original space is used for face recognition. Clearly, using the distances in RKHS yields better result. Out of 15 probe sets, we successfully classified 13 of them. In fact, Figure 5.3 shows a misclassification example in [91], where the 4th probe person is misclassified as the 9th gallery person, while our approach corrects this error. (a) (b) (c) gly9 gly4 prb4 (d) Figure 5.3: Examples of face images in the gallery and probe set. (a) The 4th gallery person in 10 frames (every 8 frames) of a 80-frame sequence. (b) The 9th gallery person in 10 frames (every 10 frames) of a 105-frame sequence.(a) The 4th probe person in 10 frames (every 6 frames) of a 60-frame sequence. (d) The plot of first three PCA coefficients of the above three sets. 136 Part III: Face Tracking and Recognition from Videos 137 Chapter 6 Adaptive Visual Tracking Particle filtering [114, 157, 159, 153, 6] is an inference technique [3, 18] for es- timating the unknown motion state, ?t, from a noisy collection of observations, y1:t = {y1,...,yt} arriving in a sequential fashion. A state space model is often employed to accommodate such a time series. Two important components of this approach are state transition and observation models whose most general forms can be defined as follows: State transition model: ?t = ft(?t?1,ut), (6.1) Observation model: yt = gt(?t,vt), (6.2) where ut is the system noise, ft(.,.) characterizes the kinematics, vt is the obser- vation noise, and gt(.,.) models the observer. The particle filter approximates the posterior distribution p(?t|y1:t) by a set of weighted particles {?(j)t ,w(j)t }Jj=1. Then, the state estimate ??t can either be the minimum mean square error (MMSE) esti- mate, ??t = ?mmset = E[?t|y1:t] ? J?1 Jsummationdisplay j=1 w(j)t ?(j)t , (6.3) 138 where E the expectation operator, the maximum a posteriori (MAP) estimate, ??t = ?mapt = argmax ?t p(?t|y1:t) ? argmax ?t w(j)t , (6.4) or other forms based on p(?t|y1:t). The state transition model characterizes the motion change between frames. In a visual tracking problem, it is ideal to have an exact motion model governing the kinematics of the object. In practice, however, approximate models are used. There are two types of approximations commonly found in the literature. (i) One is to learn a motion model directly from a training video [118, 124]. However such a model may overfit the training data and may not necessarily succeed when presented with testing videos containing objects arbitrarily moving at different times and places. Also one cannot always rely on the availability of training data. (ii) Secondly, a fixed constant-velocity model with fixed noise variance is fitted as in [109, 133, 135, 185]. ?t = ?t?1 + ?t +ut, (6.5) where ?t is a constant velocity, i.e. ?t = ?0, and ut has a fixed noise variance of the form ut = r0 ? u0 with r0 a fixed constant measuring the extent of noise and u0 a ?standardized? random variable/vector 1. Since a constant ?0 has difficulty in handling arbitrary movement, ?0 is typically set to be ?0 = 0. If r0 is small, it is very hard to model rapid movements; if r0 is large, it is computationally inefficient since many more particles are needed to accommodate the large noise variance. All these factors make such a model ineffective. In this chapter, we overcome this by introducing an adaptive-velocity model. 1Consider the scalar case for example. If ut is distributed as N(0,?2), we can write ut = ?u0 where u0 is standard normal N(0,1). This also applies to multivariate cases. 139 While contour is the visual cue used in many tracking algorithms [118], another class of tracking approaches [115, 127, 185] exploits an appearance model At. In its simplest form, we have the following observation equation2, zt = T{yt;?t} = At +vt, (6.6) where zt is the image patch of interest in the video frame yt, parameterized by ?t. In [115], a fixed template, At = A0, is matched with observations to minimize a cost function in the form of sum of squared distance (SSD). This is equivalent to assuming that the noise vt is a normal random vector with zero mean and a diagonal (isotropic) covariance matrix. At the other extreme, one could use a rapidly changing model [127], say, At = ?zt?1, i.e., the ?best? patch of interest in the previous frame. However, a fixed template cannot handle appearance changes in the video, while a rapidly changing model is susceptible to drift. Thus, it is necessary to have a model which is a compromise between these two cases. In [120], Jepson et. al. proposed an online appearance model (OAM) for a robust visual tracker, which is a mixture of three components. Two EM algorithms are used, one forupdating the appearancemodeland theotherforderiving thetracking parameters. Ourapproach tovisual tracking is to make bothobservation andstate transition models adaptive in the framework of a particle filter, with provisions for handling occlusion. The main features of our tracking approach are as follows: ? Appearance-based. The only visual cue used in our tracker is the 2-D ap- pearance; i.e., we employ only image intensities, though in general features 2For the sake of simplicity, we denote: zt ?T{yt;?t}, z(j)t ?T{yt;?(j)t }, ?zt ?T{yt; ??t}. Also, we can always vectorize the 2-D image by a lexicographical scanning of all pixels and denote the number of pixels by d. 140 derived from image intensities, such as the phase information of the filter responses [120] or the Gabor feature graph presentation [85], are also appli- cable. No prior object models are invoked. In addition, we only use gray scale images. ? Adaptive observation model. We adopt an appearance-based approach. The originalOAM is modified and then embedded in ourparticle filter. Therefore, the observation model is adaptive as the appearance At involved in (6.6) is adaptive. ? Adaptive state transition model. Instead of using a fixed model, we use an adaptive-velocity model, where the adaptive motion velocity ?t is predicted using a first-order linear approximation based on the appearance difference between the incoming observation and the previous particle configuration. We also use an adaptive noise component, i.e, ut = rt ?u0, whose magnitude rt is a function of the prediction error. It is natural to vary the number of particles based on the degree of uncertainty rt in the noise component. ? Handling occlusion. Occlusion is handled using robust statistics [11, 115, 108]. We robustify the likelihood measurement and the adaptive velocity estimate by downweighting the ?outlier? pixels. If occlusion is declared, we stop updating the appearance model and estimating the motion velocity. Chapter organization This chapter is organized as follows. We briefly review the related literature on visual tracking and particle filters in Section 6.1. We examine the details of an adaptive observation model in Section 6.2.1, with a special focus on the adaptive 141 appearance model, and of an adaptive state transition model in Section 6.2.2 with a special focus on how to calculate the motion velocity. Handling occlusion is discussed in Section 6.2.3, and experimental results on tracking vehicles and human faces in Section 6.3. 6.1 Related Literature 6.1.1 Visual tracking Roughly speaking, previous work on visual tracking can be divided into two groups: deterministic tracking and stochastic tracking. Our approach combines the merits of both stochastic and deterministic tracking approaches in a unified framework using a particle filter. We give below a brief review of both approaches. Deterministic approaches usually reduce to an optimization problem, e.g., min- imizing an appropriate cost function. The definition of the cost function is a key issue. A common choice in the literature is the SSD used in many optical flow approaches [115].3 A gradient descent algorithm is most commonly used to find the minimum. Very often, only a local minimum can be reached. In [115], the cost function is defined as the SSD between the observation and a fixed template, and the motion is parameterized as affine. Hence the task is to find the affine param- eter minimizing the cost function. Using a Taylor series expansion and keeping only the first-order terms, a linear prediction equation is obtained. It has been shown that for the affine case, the system matrix can be computed efficiently since a fixed template is used. Mean shift [113] is an alternative deterministic approach 3We note that using SSD is equivalent to using a model where the noise obeys an iid Gaussian distribution; therefore this case can also be viewed as stochastic tracking. 142 to visual tracking, where the cost function is derived from the color histogram. Stochastic tracking approaches often reduce to an estimation problem, e.g., es- timating the state for a time series state space model. Early works [106, 112] used the Kalman filter or its variants [1] to provide solutions. However, this restricts the type of model that can be used. Recently sequential Monte Carlo (SMC) al- gorithms [6, 114, 157, 159], which can model nonlinear/non-Gaussian cases, have gained prevalence in the tracking literature due in part to the CONDENSATION algorithm [118]. Stochastic tracking improves robustness over its deterministic counterpart by its capability for escaping the local minimum since the searching directions are for the most part random even though they are governed by a deter- ministic state transition model. Toyama and Blake [130] proposed a probabilistic paradigm for tracking with the following properties: Exemplars are learned from the raw training data and embedded in a mixture density; The kinematics is also learned; The likelihood measurement is constructed on a metric space. Other ap- proaches are also discussed in Section 6.1.2. However, as far as computational load is concerned, stochastic algorithms in general are more intense. Note that the stochastic approaches can often be formulated as optimization problems. 6.1.2 Particle filter General particle filter algorithm Given the state transition model in (6.1) characterized by the state transition prob- ability p(?t|?t?1) and the observation model in (6.2) characterized by the likelihood function p(yt|?t), the problem is reduced to computing the posterior probability p(?t|y1:t). The nonlinearity/nonnormality in (6.1) and (6.2) result in Kalman filter [1] being ineffective. The particle filter is a means to approximate the poste- 143 rior distribution p(?t|y1:t) by a set of weighted particles St = {?(j)t ,w(j)t }Jj=1 with summationtextJ j=1 w (j) t = 1. It can be shown [159] that St is properly weighted with respect to p(?t|y1:t) in the sense that, for every bounded function h(.), limJ?? Jsummationdisplay j=1 w(j)t h(?(j)t ) = Ep[h(?t)]. (6.7) GivenSt?1 = {?(j)t?1,w(j)t?1}Jj=1 which isproperly weighted with respect top(?t?1|y1:t?1), we first resample St?1 to reach a new set of samples with equal weights {?prime(j)t?1,1}Jj=1. We then draw samples {u(j)t }Jj=1 for ut and propagate ?prime(j)t?1 to ?prime(j)t by (6.1). The new weight is updated as wt ? p(yt|?t) (6.8) The complete algorithm is summarized in Figure 6.1. Initialize a sample set S0 = {?(j)0 ,1)}Jj=1 according to prior distribution p(?0). For t = 1,2,... For j = 1,2,...,J Resample St?1 = {?(j)t?1,w(j)t?1} to obtain a new sample (?prime(j)t?1,1). Predict the sample by drawing u(j)t for ut and computing ?(j)t = ft(?prime(j)t?1,u(j)t ). Compute the transformed image z(j)t = T{yt;?t}. Update the weight using w(j)t = p(yt|?(j)t ) = p(z(j)t |?(j)t ). End Normalize the weight using w(j)t = w(j)t /summationtextJj=1 w(j)t . End Figure 6.1: The general particle filter algorithm. Variations of Particle Filters Sequential Importance Sampling (SIS) [153, 159] draws particles from a proposal distribution q(?t|?t?1,y1:t) and then for each particle a proper weight is assigned as 144 follows: wt ? p(yt|?t)p(?t|?t?1)/q(?t|?t?1,y1:t). (6.9) Selection of the proposal distribution q(?t|?t?1,y1:t) is usually dependent on the application. For example, in the ICONDENSATION algorithm [119] which fuses low-level and high-level visual cues in the conventional CONDENSATION algorithm [118], the proposal distribution, a fixed Gaussian distribution for low-level color cue, is used to predict the particle configurations, then the posterior distribution of the high-level shape cue is approximated using SIS. It is interesting to note that two different cues can be even combined together into one state vector to yield a robust tracker, using the co-inference algorithm [133] and the approach proposed in [131]. We also use a prediction scheme but our prediction is based on the same visual cue i.e. the appearance in the image, and it is directly used in the state transition model rather than used as a proposal distribution. Additional visual cues are not used. 6.2 Appearance-Adaptive Models 6.2.1 Adaptive observation model The adaptive observation model arises from the adaptive appearance model At. We use a modified version of OAM as developed in [120]. The differences between our appearance model and the original OAM are highlighted below. Mixture appearance model The original OAM assumes that the observations are explained by different causes, thereby indicating the use of a mixture density of components. In the originalOAM 145 presented in [120], three components are used, namely the W-component charac- terizing the two-frame variations, the S-component depicting the stable structure within all past observations (though it is slowly-varying), and the L-component accounting for outliers such as occluded pixels. We modify the OAM to accommodate our appearance analysis in the following aspects. (i) We directly use the image intensities while they use phase information derived from image intensities. Direct use of image intensities is computationally more efficient than using the phase information that requires filtering and visually more interpretable. (ii) As an option, in order to further stabilize the tracker one could use an F-component which is a fixed template that one is expecting to observe most often. For example, in face tracking this could be just the facial image as seen from a frontal view. In the sequel, we derive the equations as if there is an F-component. However, the effect of this component can be ignored by setting its initial mixing probability to zero. (iii) We embed the appearance model in a particle filter to perform tracking while they use the EM algorithm. (iv) In our implementation, we do not incorporate the L-component because we model the occlusion in a different manner (using robust statistics) as discussed in Section 6.2.3. We now describe the mixture appearance model. The appearance model at time t, At = {Wt,St,Ft}, is a time-varying one that models the appearances present in all observations up to time t?1. It obeys a mixture of Gaussians, with Wt,St,Ft as mixture centers {?i,t; i = w,s,f} and their corresponding variances {?2i,t; i = w,s,f} and mixing 146 probabilities {mi,t; i = w,s,f}. Notice that {mi,t,?i,t,?2i,t; i = w,s,f} are ?images? consisting of d pixels that are assumed to be independent of each other. In summary, the observation likelihood is written as p(yt|?t) = p(zt|?t) = dproductdisplay j=1 { summationdisplay i=w,s,f mi,t(j)N(zt(j);?i,t(j),?2i,t(j))}, (6.10) where N(x;?,?2) is a normal density N(x;?,?2) = (2pi?2)?1/2 exp{??(x??? )}, ?(x) = 12x2. (6.11) Model update To keep the chapter self-contained, we show how to update the current appearance model At to At+1 after ?zt becomes available, i.e., we want to compute the new mixing probabilities, mixture centers, and variances for time t+ 1, {mi,t+1,?i,t+1,?2i,t+1; i = w,s,f}. It is assumed that the past observations are exponentially ?forgotten? with re- spect to their contributions to the current appearance model. Denote the expo- nential envelop by ?exp(???1(t ? k)) for k ? t, where ? = nh/log2, nh is the half-life of the envelope in frames, and ? = 1?exp(???1) to guarantee that the area under the envelope is 1. We just sketch the updating equations as follows and refer the interested readers to [120] for technical details and justifications. The EM algorithm [152] is invoked. Since we assume that the pixels are in- dependent of each other, we can deal with each pixel separately. The following 147 computation is valid for j = 1,2,...,d where d is the number of pixels in the appearance model. First, the posterior responsibility probabilities are computed as oi,t(j) ? mi,t(j)N(?zt(j);?i,t(j),?2i,t(j)); i = w,s,f, & summationdisplay i=w,s,f oi,t(j) = 1. (6.12) Then, the mixing probabilities are updated as mi,t+1(j) = ? oi,t(j)+ (1??) mi,t(j); i = w,s,f, (6.13) and the first- and second-moment images {Mp,t+1; p = 1,2} are evaluated as Mp,t+1(j) = ? ?zpt(j)os,t(j)+ (1??) Mp,t(j); p = 1,2. (6.14) Finally, the mixture centers and the variances are updated as: St+1(j) = ?s,t+1(j) = M1,t+1(j)m s,t+1(j) , ?2s,t+1(j) = M2,t+1(j)m s,t+1(j) ??2s,t+1(j). (6.15) Wt+1(j) = ?w,t+1(j) = ?zt(j), ?2w,t+1(j) = ?2w,1(j), (6.16) Ft+1(j) = ?f,t+1(j) = F1(j), ?2f,t+1(j) = ?2f,1(j). (6.17) Model initialization To initialize A1, we set W1 = S1 = F1 = T0 (with T0 supplied by a detection algorithm or manually), {mi,1,?2i,1; i = w,s,f}, and M1,1 = ms,1z0 and M2,1 = ms,1?2s,1 +T20. 6.2.2 Adaptive state transition model The state transition model we use incorporates a term for modeling adaptive veloc- ity. The adaptive velocity is calculated using a first-order linear prediction method 148 based on the appearance differences between two successive frames. The previous particle configuration is incorporated in the prediction scheme. Construction of the particle configuration involves the costly computation of image warping (in the experiments reported here, it usually accounts for about half of the computations). In a conventional particle filtering algorithm, the particle configuration is used only to update the weight, i.e., computing weight for each particle by comparing the warped image with the online appearance model using the observation equation. But, our approach in addition uses the particle config- uration in the state transition equation. In some sense, we ?maximally? utilize the information contained in the particles (without wasting the costly computation of image warping) since we use it in both state and observation models. In [128], random samples are guided by deterministic search. Momentum for each particle is computed as the sum of absolute difference between two frames. If the momentum is below a threshold, a deterministic search is first performed using a gradient descent method and a small number of offsprings is then generated using stochastic diffusion; otherwise, stochastic diffusion is performed to generate a large number of offsprings. The stochastic diffusion is based on a second-order autore- gressive process. But, the gradient descent method does not utilize the previous particle configuration in its entirety. Also, the generated particle configuration could severely deviate from the second-order autoregressive model, which clearly implies the need for an adaptive model. Adaptive velocity With the availability of the sample set ?t?1 = {?(j)t?1}Jj=1 and the image patches of interest Zt?1 = {z(j)t?1}Jj=1, for a new observation yt, we can predict the shift in 149 the motion vector (or adaptive velocity) ?t = ?t ? ??t?1 using a first-order linear approximation [107, 115, 121, 123], which essentially comes from the constant brightness constraint, i.e., there exists a ?t such that T{yt;?t} similarequal?zt?1. (6.18) Approximating T{yt;?t} using a first-order Taylor series expansion around ??t (we set ??t = ??t?1) yields T{yt;?t}similarequalT{yt; ??t}+Ct(?t ? ??t) = T{yt; ??t}+Ct?t, (6.19) where Ct is the Jacobian matrix. Combining (6.18) and (6.19) gives ?zt?1 similarequalT{yt; ??t}+Ct?t, (6.20) i.e., ?t = ?t ? ??t similarequal?Bt(T{yt; ??t}??zt?1), (6.21) where Bt is the pseudo-inverse of the Ct matrix, which can be efficiently estimated from the available data ?t?1 and Zt?1. Specifically, to estimate Bt we stack into matrices the differences in motion vectors and image patches, using ??t?1 and ?zt?1 as pivotal points: ??t?1 = [?(1)t?1 ? ??t?1, ..., ?(J)t?1 ? ??t?1], (6.22) ?Zt?1 = [z(1)t?1 ??zt?1, ..., z(J)t?1 ??zt?1]. (6.23) The least square (LS) solution for Bt is Bt = (??t?1?ZTt?1)(?Zt?1?ZTt?1)?1. (6.24) 150 However, it turns out that the matrix ?Zt?1?ZTt?1 is very often rank-deficient due to the high dimensionality of the data (unless the number of the particles at least exceeds the data dimension). To overcome this, we use the SVD as ?Zt?1 = USVT (6.25) It can be easily shown that Bt = ??t?1VS?1UT. (6.26) To gain some computational efficiency, we can further approximate Bt = ??t?1VqS?1q UTq , (6.27) by retaining the top q components. Notice that if only a fixed template is used [121], the B matrix is fixed and pre-computable. But, in our case, the appearance is changing so that we have to compute the Bt matrix in each time step. In practice, one may run several iterations till ?zt = T{yt; ??t + ?t} stabilizes, i.e., the error epsilon1t defined below is small enough. epsilon1t = ?(?zt,At) = 2d dsummationdisplay j=1 { summationdisplay i=w,s,f mi,t(j)?(?zt(j)??i,t(j)? i,t(j) )}. (6.28) In (6.28), epsilon1t measures the distance between T{yt; ??t + ?t} and the updated ap- pearance model At. The iterations proceed as follows: We initially set ??1t = ??t?1. For the first iteration, we compute ?1t as usual. For the kth iteration, we use the predicted ??kt = ??k?1t + ?k?1t as a pivotal point for the Taylor expansion in (6.19) and the rest of the calculation then follows. It is rather beneficial to run sev- eral iterations especially when the object moves very fast in two successive frames since ??t?1 might cover the target in yt in a small portion. After one iteration, the computed ?t might be not accurate, but indicates a good minimization direction. Using several iterations helps to find ?t (compared to ??t?1) more accurately. 151 We use the following adaptive state transition model ?t = ??t?1 + ?t +ut, (6.29) where ?t is the predicted shift in the motion vector. The choice of ut is discussed below. One should note that we are not using (6.29) as a proposal function to draw particles, which requires using (6.9) to compute the particle weight. Instead we directly use it as the state transition model and hence use (6.8) to compute the particle weight. Our model can be easily interpreted as a time-varying state model. It is interesting to note that the approach proposed in [131] also uses motion cues as well as color parameter adaptation. Our approach is different from [131] in that: (i) We use the motion cue in the state transition model while they use it as part of observations; (ii) We only use the gray images without using the color cue which is used in [131]; and (iii) We use an adaptive appearance model which is updated by the EM algorithm while they use an adaptive color model which is updated by a stochastic version of the EM algorithm. Adaptive noise The value of epsilon1t determines the quality of prediction. Therefore, if epsilon1t is small, which implies a good prediction, we only need noise with small variance to absorb the residual motion; if epsilon1t is large, which implies a poor prediction, we then need noise with large variance to model the potentially large jumps in the motion state. To this end, we use ut of the form ut = rt?u0, where rt is a function of epsilon1t. Since epsilon1t defined in (6.28) is a ?variance?-type measure, we use rt = max(min(r0?epsilon1t,rmax),rmin), (6.30) 152 where rmin is the lower bound to maintain a reasonable sample coverage and rmax is the upper bound to constrain the computational load. Adaptive number of particles If the noise variance rt is large, we need more particles, while conversely, fewer particles are needed for noise with small variance rt. Based on the principle of asymptotic relative efficiency (ARE) [3], we should adjust the particle number Jt in a similar fashion, i.e., Jt = J0rt/r0. (6.31) Fox [154] also presents an approach to improve the efficiency of particle filters by adapting the particle numbers on-the-fly. His approach is to divide the state space into bins and approximate the posterior distribution by a multinomial distribution. A small number of particles is used if the density is focused on a small part of the state space and a large number of particles if the uncertainty in the state space is high. In this way, the error between the empirical distribution and the true distribution (approximated as a multinomial in his analysis) measured by Kullback-Leilber distance is bounded. However, in his approach, since the state space (only 2D) is exhaustively divided, the number of particles is at least several thousand, while our approach uses at most a few hundred. Our attempt is not to explore the state space (6-D affine space) exhaustively, but only regions that have high potential for the object to be present. 153 Comparison between the adaptive velocity model and the zero velocity model We demonstrate the necessity of the adaptive velocity model by comparing it with the zero velocity model. Figure 6.2 shows the particle configurations created from the adaptive velocity model (with Jt < J0 and rt < r0 computed as above) and the zero velocity model (with Jt = J0 and rt = r0). Clearly, the adaptive- velocity model generates particles very efficiently, i.e, they are tightly centered around the object of interest so that we can easily track the object at time t; while the zero-velocity model generates more particles widely spread to explore larger regions, leading to unsuccessful tracking as widespread particles often lead to a local minimum. Tracking result at t?1 Particle configuration at t Tracking result at t Figure 6.2: Particle configurations from (top row) the adaptive velocity model and (bottom row) the zero-velocity model. 6.2.3 Handling occlusion Occlusion is usually handled in two ways. One way is to use joint probabilistic data associative filter (JPDAF) [2, 126]; and the other one is to use robust statistics 154 [11]. We use robust statistics here. Robust statistics We assume that occlusion produces large image differences which can be treated as ?outliers?. Outlier pixels cannot be explained by the underlying process and their influences on the estimation process should be reduced. Robust statistics provide such mechanisms. We use the ?? function defined as follows: ??(x) = ?? ?? ??? 1 2x 2 if |x| ? c cx? 12c2 if |x| > c , (6.32) where x is normalized to have unit variance and the constant c controls the outlier rate. In our experiment, we take c = 1.435 based on experimental experience. If |x| > c is satisfied, we declare the corresponding pixel as an outlier. Robust likelihood measure and adaptive velocity estimate The likelihood measure defined in Eq. (6.10) involves a multi-dimensional normal density. Since we assume that each pixel is independent, we consider the one- dimensional normal density. To make the likelihood measure robust, we replace the one-dimensional normal density N(x;?,?2) by ?N(x;?,?2) = (2pi?2)?1/2 exp(???(x??? )). (6.33) Note that this is not a density function any more, but since we are dealing with discrete approximation in the particle filter, normalization makes it a probability mass function. Existence of outlier pixels severely violates the constant brightness constraint and hence affects our estimate of the adaptive velocity. To downweight the influ- 155 ence of the outlier pixels in estimating the adaptive velocity, we introduce a d?d diagonal matrix Lt with its ith diagonal element being Lt(i) = ?(xi) where xi is the pixel intensity of the difference image (T{yt; ??t}??zt?1) normalized by the variance of the OAM stable component and ?(x) = 1xd??(x)dx = ? ??? ??? 1 if |x| ? c c/|x| if |x| > c , (6.34) Eq. (6.21) becomes ?t similarequal ?BtLt(T{yt; ??t?1}??zt?1). (6.35) This is similar in principle to the weighted least square algorithm. Occlusion declaration If the number of the outlier pixels in?zt (compared with the OAM), say dout, exceeds a certain threshold, i.e., dout > ?d where 0 < ? < 1 (we take ? = 0.15), we declare occlusion. Since the OAM has more than one component, we count the number of outlier pixels with respect to every component and take the maximum. If occlusion is declared, we stop updating the appearance model and estimating the motion velocity. Instead, we (i) keep the current appearance model, i.e., At+1 = At and (ii) set the motion velocity to zero, i.e., ?t = 0 and use the maximum number of particles sampled from the diffusion process with largest variance, i.e., rt = rmax, and Jt = Jmax. The adaptive particle filtering algorithm with occlusion analysis is summarized in Figure 6.3. 156 Initialize a sample set S0 = {?(j)0 ,1/J0)}J0j=1 according to prior distribution p(?0). Initialize the appearance model A1. Set OCCFLAG = 0 to indicate no occlusion. For t = 1,2,... If (OCCFLAG == 0) Calculate the state estimate ??t?1 by Eq. (6.3) or (6.4), the adaptive velocity ?t by Eq. (6.21), the noise variance rt by Eq. (6.30), and the particle number Jt by Eq. (6.31). Else rt = rmax, Jt = Jmax, ?t = 0. End For j = 1,2,...,Jt Draw the sample u(j)t for ut with variance rt. Construct the sample ?(j)t = ??t?1 + ?t + u(j)t by Eq. (6.29). Compute the transformed image z(j)t . Update the weight using w(j)t = p(yt|?(j)t ) = p(z(j)t |?(j)t ). End Normalize the weight using w(j)t = w(j)t /summationtextJj=1 w(j)t . Set OCCFLAG according to the number of the outlier pixels in ?zt. If (OCCFLAG == 0) Update the appearance model At+1 using ?zt. End End Figure 6.3: The proposed visual tracking algorithm with occlusion handling. 6.3 Experimental results on visual tracking In our implementation, we used the following choices. We consider affine transfor- mation only. Specifically, the motion is characterized by ? = (a1,a2,a3,a4,tx,ty) where {a1,a2,a3,a4}are deformation parameters and{tx,ty}denote the 2-D trans- 157 lation parameters. Even though significant pose/illumincation changes are present in the video, we believe that our adaptive appearance model can easily absorb them and therefore for our purposes the affine transformation is a reasonable approxi- mation. Regarding photometric transformations, only a zero-mean-unit-variance normalization is used to partially compensate for contrast variations. The com- plete image transformation T{y;?} is implemented as follows: affine transform y using {a1,a2,a3,a4}, crop out the region of interest at position {tx,ty} with the same size as the still template in the appearance model, and perform zero-mean- unit-variance normalization. We demonstrate our algorithm by tracking a disappearing car, a moving tank acquired by a camera mounted on a micro air vehicle, and a moving face under occlusion. Table 6.1 summarizes some statistics about the video sequences and the appearance model size used. We initialize the particle filter and the appearance model with a detector algo- rithm (we actually used the face detector described in [132] for the face sequence) or a manually specified image patch in the first frame. r0 and J0 are also manually set, depending on the sequence. 6.3.1 Car tracking We first test our algorithm to track a vehicle with the F-component but without occlusion analysis. The result of tracking a fast moving car is shown in Figure 6.4 (column 1)4. The tracking result is shown with a bounding box. We also show the stable and wandering components separately (in a double-zoomed size) at the corner of each frame. The video is captured by a camera mounted on the 4Accompanying videos are available at http://www.cfar.umd.edu/?shaohua/research/. 158 Video Car Tank Face # of frames 500 300 800 Frame size 576x768 240x360 240x360 At size 24x30 24x30 30x26 Occlusion No No Yes (twice) ?adp? o o x ?fa? o o x ?fm? x x x ?fb? x x x ?adp & occ? o o o Table 6.1: Comparison of tracking results obtained by particle filters with different configurations. ?At size? means pixel size in the component(s) of the appearance model. ?o? means success in tracking. ?x? means failure in tracking. car. In this footage the relative velocity of the car with respect to the camera platform is very large, and the target rapidly decreases in size. Our algorithm?s adaptive particle filter successfully tracks this rapid change in scale. Figure 6.5(a) plots the scale estimate (calculated as radicalBig (a21 + a22 + a23 + a24)/2 ) recovered by our algorithm. It is clear that the scale follows a decreasing trend as time proceeds. The pixels located on the car in the final frame are about 12 by 15 in size, which makes the vehicle almost invisible. In this sequence we set J0 = 50 and r0 = 0.25. The algorithm implemented in a standard Matlab environment processes about 1.2 frames per second (with J0 = 50) running on a PC with a PIII 650 CPU and 512M memory. 159 Frame 1 Frame 100 Frame 300 Frame 500 Figure 6.4: The car sequence. Notice the fast scale change present in the video. Column 1: the tracking results obtained with an adaptive motion model and an adaptive appearance model (?adp?). Column 2: the tracking results obtained with an adaptive motion model but a fixed appearance model (?fa?). In this case, the corner shows the tracked region. Column 3: the tracking results obtained with an adaptive appearance model but a fixed motion model (?fm?). 6.3.2 Tank tracking in an aerial video Figure 6.6 shows our results on tracking a tank in an aerial video with degraded image quality due to motion blur. Also, the movement of the tank is very jerky and arbitrary because of platform motion, as seen in Figure 6.5(b) which plots the 160 0 100 200 300 400 5000 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 time scale estimate 0 50 100 150 200 250 300 350 0 50 100 150 200 column index row index 0 50 100 150 200 250 30060 70 80 90 100 110 120 130 140 time particle number (a) (b) (c) 0 50 100 150 200 250 3000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 time mean square error adp fa 0 100 200 300 400 500 600 700 8001 1.5 2 2.5 3 3.5 time scale estimate (d) (e) Figure 6.5: (a) The scale estimate for the car. (b) The 2-D trajectory of the centroid of the tracked tank. ?*? means the starting and ending points and ?.? points are marked along the trajectory every 10 frames. (c) The particle number Jt vs. t obtained when tracking the tank. (d) The MSE invoked by the ?adp? and ?fa? algorithms. (e) The scale estimate for the face sequence. 2-D trajectory of the centroid of the tracked tank every 10 frames, covering from the left to the right in 300 frames. Although the tank moved about 100 pixels in column index in a certain period of 10 frames, the tracking is still successful. Figure 6.5(c) displays the plot of actual number of particles Jt as a function of time t. The average number of particle is about 83, where we set J0 to be 100, which means that in this case we actually saved about 20% in computation by using an adaptive Jt instead of a fixed number of particles. To further illustrate the importance of the adaptive appearance model, we 161 Frame 1 Frame 31 Frame 49 Frame 116 Frame 228 Frame 300 Figure 6.6: Tracking a moving tank in a video acquired by an airborne camera. computed the mean square error (MSE) invoked by two particle filter algorithms, one (referred as?adp? inSection 6.3.4)using the adaptive appearance model and the other (referred as ?fa? in Section 6.3.4) using a fixed appearance model. Computing the MSE for the ?fa? algorithm is straightforward, with T0 denoting the fixed template, MSEfa(t) = d?1 dsummationdisplay j=1 (?zt(j)?T0(j))2. (6.36) Computing the MSE for the ?adp? algorithm is as follows: MSEadp(t) = d?1 dsummationdisplay j=1 { summationdisplay i=w,s,f mi,t(?zt(j)??i,t(j))2}. (6.37) Figure 6.5(d) plots the functions of MSEfa(t) and MSEadp(t). Clearly, using the adaptive appearance model invokes smaller MSE for almost all 300 frames. The average MSE for the ?adp? algorithm is 0.1394 5 while that for the ?fa? algorithm is 0.3169! 5The range of MSE is very reasonable since we are using image patches after the zero-mean- unit-variance normalization not the raw image intensities. 162 6.3.3 Face tracking We present one example of successful tracking of a human face using a hand-held video camera in an office environment, where both camera and object motion are present. Figure 6.7 presents the tracking results on the video sequence featuring the following variations: moderate lighting variations, quick scale changes (back and forth)in themiddle of thesequence, andocclusion (twice). The results areobtained by incorporating the occlusion analysis in the particle filter, but we did not use the F-component. Notice that the adaptive appearance model remains fixed during occlusion. Figure 6.8 presents the tracking results obtained using the particle filter without occlusion analysis. We have found that the predicted velocity actually accounts for the motion of the occluding hand since the outlier pixels (mainly on the hand) dominate the image difference (T{yt; ??t}??zt?1). Updating the appearance model deteriorates the situation. Figure 6.5(e) plots the scale estimate against time t. We clearly observe a rapid scale change (a sudden increase followed by a decrease within about 50 frames) in the middle of the sequence (though hard to display the recovered scale estimates are in perfect synchrony with the video data). 6.3.4 Comparison We illustrate the effectiveness of our adaptive approach (?adp?) by comparing the particle filter either with (a) an adaptive motion model but a fixed appearance model (?fa?), or with (b) a fixed motion model but an adaptive appearance model (?fm?); or with (c) a fixed motion model and a fixed appearance model (?fb?). Table 163 Frame 1 Frame 145 Frame 148 Frame 155 Frame 470 Frame 517 Frame 685 Frame 695 Frame 800 Figure 6.7: The face sequence. Frames 145, 148, and 155 show the first occlusion. Frames 470 and 517 show the smallest and largest face observed. Frames 685, 690, and 710 show the second occlusion. 6.1 lists the tracking results obtained using particle filters under the above situa- tions, where ?adp & occ? refers to the adaptive approach with occlusion handling. Figure 6.4 also shows the tracking results on the car sequence when the ?fa? and ?fm? options are used. Table 6.1 seems to suggest that the adaptive motion model plays a more im- portant role than the adaptive appearance model since ?fa? always yields successful tracking while ?fm? fails, the reasons being that (i) the fixed motion model is unable to adapt to quick motion present in the video sequences, and (ii) the appearance 164 Frame 1 Frame 145 Frame 148 Frame 155 Frame 170 Frame 200 Figure 6.8: Tracking results on the face sequence using the adaptive particle filter without occlusion analysis. changes in the video sequences, though significant in some cases, are still within the range of the fixed appearance model. However, as seen in the videos, ?adp? produces much smoother tracking results than ?fa?, demonstrating the power of the adaptive appearance model. 165 Chapter 7 Simultaneous Tracking and Recognition Following [58], we define a still-to-video scenario: the gallery consists of still facial templates and the probe set consists of video sequences containing the facial re- gion. Denote the gallery as I = {I1,I2,...,IN}, indexed by the identity variable n, which lies in a finite sample space N = {1,2,...,N}. Though significant research has been conducted on the still-to-still face recognition problem, research efforts on still-to-video recognition, are relatively fewer due to the following challenges [27] in typical surveillance applications: poor video quality, significant illumina- tion and pose variations, and low image resolution. Most existing video-based recognition systems [79] attempt the following: the face is first detected and then tracked over time. Only when a frame satisfying certain criteria (size, pose) is acquired, recognition is performed using still-to-still recognition technique. For this, the face part is cropped from the frame and transformed or registered using appropriate transformations. This tracking-then-recognition approach attempts to resolve uncertainties in tracking and recognition sequentially and separately. 166 There are several unresolved issues in the tracking-then-recognition approach: criteria for selecting good frames and estimation of parameters for registration. Also, still-to-still recognition does not effectively exploit temporal information. A common strategy that selects several good frames, performs recognition on each frame and then votes on these recognition results for a final solution is rather ad hoc. To overcome these difficulties, we propose a tracking-and-recognition approach, which attempts to resolve uncertainties in tracking and recognition simultaneously in a unified probabilistic framework. To fuse temporal information, the time series state space model is adopted to characterize the evolving kinematics and identity in the probe video. Three basic components of the model are: ? a motion equation governing the kinematic behavior of the tracking motion vector, ? an identity equation governing the temporal evolution of the identity variable, ? an observation equation establishing a link between the motion vector and the identity variable. Using the SIS [114, 118, 153, 157, 159] technique, the joint posterior distribution of the motion vector and the identity variable, i.e., p(nt,?t|y0:t) 1 is estimated at each time instant and then propagated to the next time instant governed by mo- tion and identity equations. The marginal distribution of the identity variable, i.e., p(nt|y0:t), is estimated to provide a recognition result. An SIS algorithm is developed to approximate the distribution p(nt|y0:t) in the still-to-video scenario. 1For notational convenience, e.g. in (7.5) and (7.6), we introduce in this chapter a dummy variable y0. 167 It achieves computational efficiency over its CONDENSATION counterpart by con- sidering the discrete nature of the identity variable. It is worth emphasizing that (i) our model can take advantage of any still-to- still recognition algorithm [41, 44, 48, 62] by embedding distance measures used therein in our likelihood measurement; and (ii) it allows a variety of image repre- sentations and transformations. Section 7.3.4 presents an enhancement technique by incorporating the sophisticated appearance-based models in Chapter 6. The ap- pearance models are used for tracking (modeling inter-frame appearance changes) and recognition (modeling appearance changes between video frames and gallery images), respectively. Table 7.1 summarizes the proposed approach and others, in term of using temporal information. Process Operation Temporal information Visual tracking Modeling the inter-frame Used in tracking differences Visual recognition Modeling the difference between Not applicable probe and gallery images Tracking-then-recognition Combining tracking and Used only in tracking recognition sequentially Tracking-and-recognition Unifying tracking and Used in both tracking recognition and recognition Table 7.1: Use of temporal information in various tracking/recognition processes. Chapter organization The organization of the chapter is as follows: Section 7.1 reviews some related stud- ies on (i) face modeling and recognition and (ii) video-based tracking and recog- 168 nition in the literature. Section 7.2 introduces the time series state space model for recognition and establishes the time-evolving behavior of p(nt|y0:t). Section 7.2.3 briefly reviews the SIS principles from the viewpoint of a general state space model and develops a SIS algorithm to solve the still-to-video recognition prob- lem, with special emphasis on its computational efficiency. Section 7.3 describes the experimental scenarios for still-to-video recognition and presents results using data collected at UMD, NIST/USF, and CMU (MoBo database) as part of the DARPA HumanID effort. 7.1 Related Literature 7.1.1 Face modeling and recognition Statistical approaches to face modeling have been very popular since Turk and Pentland?s work on eigenface [62]. In the statistical approach, the two-dimensional appearance of face image is treated as a vector by scanning the image in lexico- graphical order, with the vector dimension being the number of pixels in the image. In the eigenface approach [62], all face images consists of a distinctive face sub- space. This subspace is linear and spanned by the eigenvectors of the covariance matrix found using PCA. Typically we keep the number of eigenvectors much less than the true dimension of the vector space. The task of face recognition is then to find the closest matches in this face subspace. However, PCA might not be ef- ficient in terms of recognition accuracy since the construction of the face subspace does not capture discrimination between humans. This motivates the use of LDA [41, 44] and its variants. In LDA, the linear subspace is constructed [7] in such a manner that the within-class scatter is minimized and the between-class scatter is 169 maximized. This idea is further generalized in the approach called Bayesian face recognition [55], where intra-personal space (IPS) and extra-personal space (EPS) are used in lieu of within-class scatter and between-class scatter measures. The IPS models the variations in the appearance of the same individual and the EPS models the variations in appearances due to differences in the identity. Probabilis- tic subspace density is then fitted on each space. A Bayesian decision is taken using a maximum a posteriori (MAP) rule to determine the identity. In the famous EGM [48] algorithm, the face is represented as a labeled graph. The nodes of the graph are located at facial landmarks, e.g., the pupils, the tip of nose, etc. Also, each node is labeled with jets derived from responses obtained by convolving the image with a family of Gabor functions. The edge characterizes the geometric distance between two nodes. Face recognition is then formalized as a graph matching problem. All the above approaches are based on 2-D appearance and perform poorly when significant pose and illumination variations are present [58]. To completely resolve such challenges, 3-D face modeling [66, 83] is necessary. However, building a 3-D face model is a very difficult and complicated task in the literature even though structure from motion has been studied for several decades. 7.1.2 Video-based tracking and recognition Nearly all video-based recognition systems apply still-image-based recognition to selected good frames. The face images are warped into frontal views whenever pose and depth information about the faces is available [79]. In [82, 90, 93], RBF (Radial Basis Function) networks are used for tracking and recognition purposes. In [82], the system uses an RBF (Radial Basis Function) 170 network for recognition. Since no warping is done, the RBF network has to learn the individual variations as well as possible transformations. The performance appears to vary widely, depending on the size of the training data. [93] presents a fully automatic person authentication system. The system uses video break, face detection, and authentication modules and cycles over successive video images until a high recognition confidence is reached. This system was tested on three image sequences; the first was taken indoors with one subject present, the second was taken outdoors with two subjects, and the third was taken outdoors with one subject in stormy conditions. Perfect results were reported on all three sequences, when verified against a database of 20 still face images. In [92], a system called PersonSpotter is described. This system is able to capture, track and recognize a person walking toward or passing a stereo CCD camera. It has several modules, including a head tracker, and a landmark finder. The landmark finder uses a dense graph consisting of 48 nodes learned from 25 example images to find landmarks such as eyes and nose tip. An elastic graph matching scheme is employed to identify the face. A multimodal based person recognition system is described in [79]. This system consists of a face recognition module, a speaker identification module, and a classi- fier fusion module. The most reliable video frames and audio clips are selected for recognition. The 3D head information is used to detect the presence of an actual person as opposed to an image of that person. Recognition and verification rates of 100% were achieved for 26 registered clients. In [87, 88], recognition of face over time is implemented by constructing a face identity surface. The face is first warped to a frontal view, and its Kernel Discriminant Analysis (KDA) features over time form a trajectory. It is shown 171 that the trajectory distances accumulate recognition evidence over time. In [86], a generic approach to simultaneous object tracking and verification is proposed. The approach is based on posterior probability density estimation using sequential Monte Carlo methods [118, 153, 157, 159]. Tracking is formulated as a probability density propagation problem and the algorithm also provides verifi- cation results. However, no systematic evaluation of recognition was done. Our approach looks similar to this algorithm; however, there are significant differences from the algorithm described in [86]. (i) In [86], basically only the tracking motion vector is parameterized in the state-space model. The identity is involved only in the initialization step to rectify the template onto the first frame of the sequence. However, in our approach both tracking motion vector and identity variables are parameterized in the state-space model, which offers us one more degree of freedom and leads to a different approach for deriving the solution. (ii) The SIS technique is used in both approaches to numerically approximate the posterior probability given the observation. Again in [86], it is the posterior probability of motion vector and the verification probability is estimated by marginalizing over a proper region of state space redefined at each time instant. However, we always compute the joint density, i.e., the posterior probability of motion vector and identity variable and the posterior probability of identity variable is just a free estimate obtained by marginalizing over the motion vector. Note that there is no time propagation of verification probability in [86] while we always propagate the joint density. One consequence is that we guarantee that summationtextnt?N p(nt|y0:t) = 1, but there is no such guarantee in [86]. 172 7.2 StochasticModelsand Algorithmsfor Recog- nition from Video In this section, we present the details on the propagation model for recognition and discuss its impact on the posterior distribution of identity variable. 7.2.1 Time series state space model Motion equation In its most general form, the motion model can be written as ?t = g(?t?1,ut); t ? 1, (7.1) where ut is noise in the motion model, whose distribution determines the motion state transition probability p(?t|?t?1). The function g(.,.) characterizes the evolv- ing motion and it could be a function learned o?ine or given a priori. One of the simplest choice is an additive function, i.e., ?t = ?t?1 + ut, which leads to a first-order Markov chain. Choice of ?t is application dependent. Affine motion parameters are often used when there is no significant pose variation available in the video sequence. However, if a 3-D face model is used, then the 3-D motion parameters should be used accordingly. Identity equation nt = nt?1; t ? 1, (7.2) assuming that the identity does not change as time proceeds. 173 Observation equation By assuming that the transformed observation is a noise-corrupted version of some still template in the gallery, the observation equation can be written as T{yt;?t} = Int +vt; t ? 1, (7.3) where vt is observation noise at time t, whose distribution determines the observa- tion likelihood p(yt|nt,?t), and T{yt;?t} is a transformed version of the observation yt. This transformation could be either geometric or photometric or both. How- ever, when confronting sophisticated scenarios, this model is far from sufficient. One should use the complicated likelihood measurement as shown in Section 7.3.2. We assume statistical independence between all noise variables and prior knowl- edge on the distributions p(?0|y0) and p(n0|y0). Using the overall state vector xt = (nt,?t), Eq. (7.1) and (7.2) can be combined into one state equation (in a normal sense) which is completely described by the overall state transition proba- bility p(xt|xt?1) = p(nt|nt?1)p(?t|?t?1) . (7.4) Given this model, our goal is to compute the posterior probability p(nt|y0:t). It is in fact a probability mass function (PMF) since nt only takes values from N = {1,2,...,N}, as well as a marginal probability of p(nt,?t|y0:t), which is a mixed- type distribution. Therefore, the problem is reduced to computing the posterior probability. 7.2.2 Posterior probability of identity variable The evolution of the posterior probability p(nt|y0:t) as time proceeds is very in- teresting to study as the identity variable does not change by assumption, i.e., 174 p(nt|nt?1) = ?(nt ?nt?1), where ?(.) is a discrete impulse function at zero. Using time recursion, Markov properties, and statistical independence embed- ded in the model, we can easily derive: p(n0:t,?0:t|y0:t) = p(n0:t?1,?0:t?1|y0:t?1)p(yt|nt,?t)p(nt|nt?1)p(?t|?t?1)p(y t|y0:t?1) = p(n0,?0|y0) tproductdisplay s=1 p(ys|ns,?s)p(ns|ns?1)p(?s|?s?1) p(ys|y0:s?1) = p(n0|y0)p(?0|y0) tproductdisplay s=1 p(ys|ns,?s)?(ns ?ns?1)p(?s|?s?1) p(ys|y0:s?1) .(7.5) Therefore, by marginalizing over ?0:t and n0:t?1, we obtain p(nt = l|y0:t) = p(l|y0) integraldisplay ?0 ... integraldisplay ?t p(?0|y0) tproductdisplay s=1 p(ys|l,?s)p(?s|?s?1) p(ys|y0:s?1) d?t ...d?0. (7.6) Thus p(nt = l|y0:t) is determined by the prior distribution p(n0 = l|y0) and the product of the likelihood functions, producttextts=1 p(ys|l,?s). If a uniform prior is assumed, then producttextts=1 p(ys|l,?s) is the only determining factor. In the appendix, we show that, under some minor assumptions, the poste- rior probability for the correct identity l, p(nt = l|y0:t), is lower-bounded by an increasing curve which converges to 1. To measure the evolving uncertainty remaining in the identity variable as ob- servations accumulate, we use the notion of entropy [4]. In the context of this problem, conditional entropy H(nt|y0:t) is used. However, the knowledge of p(y0:t) is needed to compute H(nt|y0:t). We assume that it degenerates to an impulse at the actual observations ?y0:t since we observe only this particular sequence, i.e., p(y0:t) = ?(y0:t ??y0:t). Thus, H(nt|y0:t) = ? Nsummationdisplay nt=1 p(nt|?y0:t)log2 p(nt|?y0:t). (7.7) Under the assumptions listed in the appendix, we expect that H(nt|y0:t) decreases 175 as time proceeds since we start from an equi-probable distribution to a degenerate one. 7.2.3 SIS algorithms and computational efficiency Consider a general time series state space model fully determined by (i) the over- all state transition probability p(xt|xt?1), (ii) the observation likelihood p(yt|xt), and (iii) prior probability p(x0) and statistical independence among all the noise variables. We wish to compute the posterior probability p(xt|y0:t). If the modelis linear withGaussian noise, itis analytically solvable by a Kalman filter which essentially propagates the mean and variance ofa Gaussian distribution over time. For nonlinear and non-Gaussian cases, an extended Kalman filter (EKF) and its variants have been used to arrive at an approximate analytic solution [1]. Recently, the SIS technique or particle filter algorithm, a special case of Monte Carlo method, [118, 153, 157, 159] has been used to provide a numerical solution and propagate an arbitrary distribution over time. However, since we are dealing with a mixed-type distribution, additional properties are available to be exploited when developing the SIS algorithms. First, two following two propositions are useful. Proposition 7.1 When pi(x) is a PMF defined on a finite sample space, the proper sample set should exactly include all samples in the sample space. Proposition 7.2 If a set of weighted random samples {(x(m),y(m),w(m))}Mm=1 is proper with respect to pi(x,y), then a new set of weighted random samples {(yprime(k),wprime(k))}Kk=1, which is proper with respect to pi(y), the marginal of pi(x,y), can be constructed as follows: 1) Remove the repetitive samples from {y(m)}Mm=1 to obtain {yprime(k)}Kk=1, where all 176 yprime(k)?s are distinct; 2) Sum the weight w(m) belonging to the same sample yprime(k) to obtain the weight wprime(k), i.e., wprime(k) = Msummationdisplay m=1 w(m)?(y(m) ?yprime(k)) (7.8) In the context of this framework, the posterior probability p(nt,?t|y0:t) is rep- resented by a set of indexed and weighted samples St = {(n(m)t ,?(m)t ,w(m)t )}Mm=1 (7.9) with nt as the above index. By Proposition 7.2, we can sum the weights of the samples belonging to the same index nt to obtain a proper sample set {nt,?nt}Nnt=1 with respect to the posterior PMF p(nt|y0:t). A straightforward implementation of the particle filter algorithm (Figure 7.1) for simultaneous tracking and recognition is not efficient in terms of its compu- tational load. Since N = {1,2,...,N} is a countable sample space, we need N samples for the identity variable nt according to Proposition 7.1. Assume that, for each identity variable nt, J samples are needed to represent ?t. Hence, we need M = J ?N samples in total. Further assume that one resampling step takes Tr seconds (s), one predicting step Tp s, computing one transformed image Tt s, evaluating likelihood once Tl s, one updating step Tu s. Obviously, the bulk of computation is J?N?(Tr+Tp+Tt+Tl) s to deal with one video frame as the com- putational time for the normalizing step and the marginalizing step is negligible. It is well known that computing the transformed image is much more expensive than other operations, i.e., Tt >> max(Tr,Tp,Tl). Therefore, as the number of templates N grows, the computational load increases dramatically. There are various approaches in the literature to reduce the computational cost of the conventional particle filter algorithm. In [128], random particles are guided 177 Initialize a sample set S0 = {(n(m)0 ,?(m)0 ,1)}Mm=1 according to prior distribu- tions p(n0|y0) and p(?0|y0). For t = 1,2,... For m = 1,2,...,M Resample St?1 = {(n(m)t?1,?(m)t?1,w(m)t?1)}Mm=1 to obtain a new sample (nprime(m)t?1 ,?prime(m)t?1 ,1). Predict a sample by drawing (n(m)t ,?(m)t ) from p(nt|nprime(m)t?1 ) and p(?t|?prime(m)t?1 ). Compute the transformed image z(m)t = T{yt;?(m)t }. Update the weight using ?(m)t = p(yt|n(m)t ,?(m)t ). End Normalize each weight using w(m)t = ?(m)t /summationtextMm=1 ?(m)t . Marginalize over ?t to obtain the weight ?nt for nt. End Figure 7.1: The conventional particle filter algorithm for simultaneous tracking and recognition. by deterministic search. Assumed density filtering approach [148], different from particle filter, is even more efficient. Those approaches are general and do not explicitly exploit the special structure of the distribution in this setting: a mixed distribution of continuous and discrete variables. To this end, we propose the following algorithm. As the sample space N is countably finite, an exhaustive search of sample space N is possible. Mathematically, we release the random sampling in the identity variable nt by constructing samples as follows: for each ?(j)t , (1,?(j)t ,w(j)t,1),(2,?(j)t ,w(j)t,2),...,(N,?(j)t ,w(j)t,N). 178 We in fact use the following notation for the sample set, St = {(?(j)t ,w(j)t ,w(j)t,1,w(j)t,2,...,w(j)t,N)}Jj=1, (7.10) with w(j)t = summationtextNn=1 w(j)t,n. The proposed algorithm is summarized in Figure 7.2. Initialize a sample set S0 = {(?(j)0 ,N,1,...,1)}Jj=1 according to prior distribu- tion p(?0|z0). For t = 1,2,... For j = 1,2,...,J Resample St?1 = {(?(j)t?1,w(j)t?1)}Jj=1 to obtain a new sample (?prime(j)t?1,1,wprime(j)t?1,1,...,wprime(j)t?1,N), where wprime(j)t?1,n = w(j)t?1,n/w(j)t?1 for n = 1,2,...,N. Predict a sample by drawing (?(j)t ) from p(?t|?prime(j)t?1). Compute the transformed image z(m)t = T{yt;?(m)t }. For n = 1,...,N Update the weight using ?(j)t,n = wprime(j)t?1,n ?p(yt|n,?(j)t ). End End Normalize each weight using w(j)t,n = ?(j)t,n/summationtextNn=1summationtextJj=1 ?(j)t,n and w(j)t = summationtextNn=1 w(j)t,n. Marginalize over ?t to obtain the weight ?nt for nt. End Figure 7.2: The computationally efficient particle filter algorithm for simultaneous tracking and recognition. The crux of this algorithm lies in the fact that, instead of propagating random samples on both motion vector and identity variable, we can keep the samples on the identity variable fixed and let those on the motion vector be random. Although 179 we propagate only the marginal distribution for motion tracking, we still propagate the joint distribution for recognition purposes. The bulk of computation of the proposed algorithm is J?(Tr+Tp+Tt)+J?N?Tl s, a tremendous improvement over the conventional particle filter when dealing with a large database since the majority computational time J?Tt does not depend on N. 7.3 Still-to-Video Face Recognition Experiments In this section we describe the still-to-video scenarios used in our experiments and their practical model choices, followed by a discussion of experiments. Three databases are used in the still-to-video experiments. Database-0 was collected outside a building. Subjects walked straight towards a video camera inorder tosimulate typical scenarios in visual surveillance. Database- 0 includes one face gallery, and one probe set. The images in the gallery are listed in Figure 7.3. The probe contains 12 videos, one for each individual. Figure 7.3 gives some frames in a probe video. In Database-1, we have video sequences with subjects walking in a slant path towards the camera. There are 30 subjects, each having one face template. There are one face gallery and one probe set. The face gallery is shown in Figure 7.4. The probe contains 30 video sequences, one for each subject. Figure 7.4 gives some example frames extracted from one probe video. As far as imaging conditions are concerned, the gallery is very different from the probe, especially in lighting. This is similar to the ?FC? test protocol of the FERET test [58]. These images/videos were collected, as part of the HumanID project, by National Institute of Standards and Technology and University of South Florida researchers. 180 Database-2, Motion of Body (MoBo) database, was collected at the Carnegie Mellon University [81] under the HumanID project. There are 25 different indi- viduals in total. The video sequences show the individuals walking on a tread-mill so that they move their heads naturally. Different walking styles have been simu- lated to assure a variety of conditions that are likely to appear in real life: walking slowly, walking fast, inclining and carrying an object. Therefore, four videos per person and 99 videos in total ( with one carrying video missing ) are available. However, the probe set we use in this section includes only 25 slowWalk videos. Some example images of the videos (slowWalk) are shown in Figure 7.5. Figure 7.5 also shows the face gallery in Database-2 with face images in almost frontal view cropped from probe videos and then normalized using their eye positions. Table 7.2 summaries the features of the three databases. Database Database-0 Database-1 Database-2 No. of subjects 12 30 25 Gallery Frontal face Frontal face Frontal face Motion in probe Walking straight Walking in an angle Walking towards the camera towards the camera on tread-mill Illumination variation No Large No Pose variation No Slight Large Table 7.2: Summary of three databases experimented. 7.3.1 Results for Database-0 We consider an affine transformation. Specifically, the motion is characterized by ? = (a1,a2,a3,a4,tx,ty) where {a1,a2,a3,a4} are deformation parameters and {tx,ty}are2-Dtranslation parameters. Itis a reasonable approximation since there 181 Figure 7.3: Database-0. The 1st row: the face gallery with image size being 30?26. The 2nd and 3rd rows: 4 example frames in one probe video with image size being 320?240 while the actual face size ranges approximately from 30?30 in the first frame to 50?50 in the last frame. Notice that the sequence is taken under a well- controlled condition so that there are no illumination or pose variations between the gallery and the probe. is no significant out-of-plane motion as the subjects walk towards the camera. Re- garding the photometric transformation, only zero-mean-unit-variance operator is performed to partially compensate for contrast variations. The complete trans- formation T{y;?} is processed as follows: affine transform y using {a1,a2,a3,a4}, 182 Figure 7.4: Database-1. The 1st row: the face gallery with image size being 30?26. The 2nd and 3rd rows: 4 example frames in one probe video with image size being 720?480 while the actual face size ranges approximately from 20?20 in the first frame to 60?60 in the last frame. Notice the significant illumination variations between the probe and the gallery. crop out the interested region at position {tx,ty} with the same size as the still template in the gallery, and perform zero-mean-unit-variance operation. Prior distribution p(?0|y0) is assumed to be Gaussian, whose mean comes from 183 Figure 7.5: Database-2. The 1st row: the face gallery with image size being 30?26. The 2nd and 3rd rows: some example frames in one probe video (slowWalk). Each video consists of 300 frames (480?640 pixels per frame) captured at 30 Hz. The inner face regions in these videos contain between 30?30 and 40?40 pixels. Notice the significant pose variation available in the video. the initial detector and whose covariance matrix is manually specified. A time-invariant first-order Markov Gaussian model with constant velocity is used for modeling motion transition. Given the scenario that the subject is walking 184 towards the camera, the scale increases with time. However, under perspective projection, this increase is no longer linear, causing the constant-velocity model to be not optimal. However, experimental results show that as long as the samples of ? can cover the motion, this model is sufficient. The likelihood measurement is simply set as a ?truncated? Laplacian: p1(yt|nt,?t) = LAP(bardblT{yt;?t}?Intbardbl;?1,?1) (7.11) where, bardbl.bardbl is sum of absolute distance, ?1 and ?1 are manually specified, and LAP(x;?,?) = ? ??? ??? ??1 exp(?x/?) if x ? ?? ??1 exp(??) otherwise (7.12) Gaussian distribution is widely used as a noise model, accounting for sensor noise, digitization noise, etc. However, given the observation equation: vt = T{yt;?t}? Int, the dominant part of vt becomes the high-frequency residual if ?t is not proper, and it is well known that the high-frequency residual of natural images is more Laplacian-like. The ?truncated? Laplacian is used to give a ?surviving? chance for samples to accommodate abrupt motion changes. Figure 7.6 presents the plot of the posterior probability p(nt|y0:t), the condi- tional entropy H(nt|y0:t) and the minimum mean square error (MMSE) estimate of the scale parameter sc = radicalBig (a21 + a22 + a23 + a24)/2, all against t. In Figure 7.3, the tracked face is superimposed on the image using a bounding box. Suppose the correct identity for Figure 7.3 is l. From Figure 7.6, we can easily observe that the posterior probability p(nt = l|y0:t) increases as time proceeds and eventually approaches 1, and all others p(nt = j|y0:t) for j negationslash= l go to 0. Figure 7.6 also plots the decrease in conditional entropy H(nt|y0:t) and the increase in scale parameter, which matches with the scenario of a subject walking towards a camera. 185 Table 7.3 summarizes the average recognition performance and computational time of the conventional and the proposed particle filter algorithm when applied to Database-0. Both algorithms achieved 100% recognition rate with top match. The proposed algorithm is much more efficient than the conventional one. It is more than 10 times faster as shown in Table I. This experiment was implemented in C++ on a PC with P-III 1G CPU and 512M RAM with the number of motion samples J chosen to be 200, the number of templates in the gallery N to be 12. 5 10 15 20 25 30 35 40 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 time t posterior probability p(n t|z 0:t ) 5 10 15 20 25 30 35 40 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 time t posterior probability p(n t|z 0:t ) 5 10 15 20 25 30 35 40 0 0.5 1 1.5 2 2.5 3 3.5 time t conditional entropy H(n t|z 0:t ) 5 10 15 20 25 30 35 40 0.9 1 1.1 1.2 1.3 1.4 1.5 time t scale estime Figure 7.6: Posterior probability p(nt|y0:t) against time t, obtained by the CONDENSATION algorithm (top left) and the proposed algorithm (top right). Con- ditional entropy H(nt|y0:t) (bottom left) and MMSE estimate of scale parameter sc (bottom right) against time t. The conditional entropy and the MMSE estimate are obtained using the proposed algorithm. 186 Algorithm Conventional algorithm Efficient algorithm Recognition rate within top 1 match 100% 100% Time per frame 7s 0.5s Table 7.3: Recognition performance of algorithms when applied to Database-0. 7.3.2 Results for Database-1 Case 1: Tracking and Recognition using Laplacian Density We first investigate the performance using the same setting as described in Section 7.3.1. In other words, we still use the affine transformation, first-order Markov Gaussian state transition model, ?truncated? Laplacian observation likelihood, etc. Table 7.4 shows that the recognition rate is very poor, only 13% are correctly identified using top match. The main reason is that the ?truncated? Laplacian density is farfrom sufficient to capture the appearance difference between the probe and the gallery, thereby indicating a need for a different appearance modeling. Nevertheless, the tracking accuracy 2 is reasonable with 83% successfully tracked because we are using multiple face templates in the gallery to track the specific face in the probe video. After all, faces in both the gallery and the probe belong to the same class of human face and it seems that the appearance change is within the class range. 2We manually inspect the tracking results by imposing the MMSE motion estimate on the final frame as shown in Figs. 7.3 and 7.4 and determine if tracking is successful or not for this sequence. This is done for all sequences and tracking accuracy is defined as the ratio of the number of sequences successfully tracked to the total number of all sequences. 187 Case 2: Pure Tracking using Laplacian Density In Case 2, we measure the appearance change within the probe video as well as the noise in the background. To this end, we introduce a dummy template T0, a cut version in the first frame of the video. Define the observation likelihood for tracking as p2(yt|?t) = LAP(bardblT{yt;?t}?T0bardbl;?2,?2), (7.13) where ?2 and ?2 are set manually. The other setting, such as motion parameter and model, is the same as in Case 1. We still can run the CONDENSATION algorithm to perform pure tracking. Table 7.4 shows that87% are successfully tracked by this simple tracking model, which implies that the appearance within the video remains similar. Case Case 1 Case 2 Case 3 Case 4 Case 5 Tracking accuracy 83% 87% 93% 100% NA Recognition w/in top 1 match 13% NA 83% 93% 57% Recognition w/in top 3 matches 43% NA 97% 100% 83% Table 7.4: Performances of algorithms when applied to Database-1. Case 3: Tracking and Recognition using Probabilistic Subspace Density As mentioned in Case 1, we need a new appearance model to improve the recog- nition accuracy. As reviewed in Section 7.1.1, there are various approaches in the literature. We decided to use the approach suggested by Moghaddam et al. [55] due to its computational efficiency and high recognition accuracy. However, in our implementation, we model only intra-personal variations instead of both intra/extra-personal variations for simplicity. 188 We need at least two facial images for one identity to construct the intra- personal space (IPS). Apart from the available gallery, we crop out the second image from the video ensuring no overlap with the frames actually used in probe videos. Figure 7.7 (top row) shows a list of such images. Compare with Figure 7.4 to see how the illumination varies between the gallery and the probe. We then fit a probabilistic subspace density [56] on top of the IPS. It proceeds as follows: a regular PCA is performed for the IPS. Suppose the eigensystem for the IPS is {(?i,ei)}di=1, where d is the number of pixels and ?1 ? ... ? ?d. Only top s principal components corresponding to top s eigenvalues are then kept while the residual components are considered as isotropic. We refer the reader to the original paper [56] for the full details. Figure 7.7 (middle row) shows the eigenvectors for the IPS. The density is written as follows: QIPS(x) = {exp(? 1 2 summationtexts i=1 y2i ?i) (2pi)s/2producttextsi=1 ?1/2i }{ exp(?epsilon122?) (2pi?)(d?s)/2}, (7.14) where yi = eTi xfori = 1,...,s is theith principal component ofx, epsilon12 = bardblxbardbl2?summationtextsi=1 y2i is the reconstruction error, and ? = (summationtextdi=s+1 ?i)/(d ? q). It is easy to write the likelihood as follows: p3(yt|nt,?t) = QIPS(T{yt;?t}?Int). (7.15) Table 7.4 lists the performance by using this new likelihood measurement. It turns out that the performance is significantly better that in Case 1, with 93% tracked successfully and 83% recognized within top 1 match. If we consider the top 3 matches, 97% are correctly identified. Case 4: Tracking and Recognition using Combined Density In Case 2, we have studied appearance changes within a video sequence. In Case 3, we have studied the appearance change between the gallery and the probe. In 189 Figure 7.7: Database-1. Top row: the second facial images for estimating prob- abilistic density. Middle row: top 10 eigenvectors for the IPS. Bottom row: the facial images cropped out from the largest frontal view. Case 4, we attempt to take advantage of both cases by introducing a combined likelihood defined as follows: p4(yt|nt,?t) = p3(yt|nt,?t)p2(yt|?t) (7.16) Again, all other setting is the same as in Case 1. We now obtain the best perfor- mance so far: no tracking error, 93% are correctly recognized as the first match, and no error in recognition when top 3 matches are considered. 190 Case 5: Still-to-still Face Recognition To make a comparison, we also performed an experiment on still-to-still face recog- nition. We selected the probe video frames with the best frontal face view (i.e. biggest frontal view) and cropped out the facial region by normalizing with respect to the eye coordinates manually specified. This collection of images is shown in Figure 7.7 (bottom row) and it is fed as probes into a still-to-still face recognition system with the learned probabilistic subspace as in Case 3. It turns out that the recognition result is 57% correct for the top one match, and 83% for the top 3 matches. The cumulative match curves for Case 1 and Cases 3-5 are presented in Figure 7.8. Clearly, Case 4 is the best among all. We also implemented the original algorithm by Moghaddam et al. [56], i.e., both intra/extra-personal variations are considered, the recognition rate is similar to that obtained in Case 5. 7.3.3 Results for Database-2 The recognition result for Database-2 is presented in Figure 7.8, using the cumu- lative match curve. We still use the same setting as in Case 1 of section 7.3.2. However, due to the pose variations present in the database, using one frontal view is not sufficient to represent all the appearances under different poses and the recognition rate is hence not so high, 56% when only the top match is considered and 88% when top 3 matches are considered. We do not use probabilistic subspace modeling for this database because such modeling requires manually cropping out multiple templates for each individual. Also, pre-selecting video frames from the same probe video and ensuring that they do not overlap with the probe frames is time-consuming. What is desirable is to automatically select such templates from different sources other than the probe video. Since we have multiple videos 191 available for one individual in Database-2, this motivates us to obtain more repre- sentative views for one face class, leading to the discussions in [194]. 5 10 15 20 25 30 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 rank cumulative match score Case1 Case 3 Case 4 Case 5 5 10 15 20 250 0.2 0.4 0.6 0.8 1 rank cumulative match score Figure 7.8: Cumulative match curves for Database-1 (left) and Database-2 (right). 7.3.4 Enhanced results Visual tracking models the inter-frame appearance differences and visual recogni- tion models the appearance differences between video frames and gallery images. Simultaneous tracking and recognition provides a mechanism of jointly modeling inter-frame appearance differences and the appearance differences between video frames and gallery images. As in Section 7.3.2, this joint modeling of appearance differences in both tracking and recognition in one framework actually improves both tracking and recognition accuracies over approaches that separate tracking and recognition as two tasks. The more effective the model choices are, improved performance in tracking and recognition is expected. We explore this avenue by incorporating the models used in Chapter 6. We use the same adaptive-velocity motion model (6.29) and the same identity equation (7.2). The observation likelihood is modified to combine contributions (or scores) from both tracking and recognition in the likelihood yields the best performance in both tracking and recognition. 192 To compute the tracking score pa(yt|?t) which measures the inter-frame appear- ance changes, we use the appearance model introduced in Section 6.2.1 and the quantity defined in (6.10) as pa(yt|?t). To compute the recognition score which measures the appearance changes be- tween probe videos and gallery images, we assume the same model as in (7.3), i.e., the transformed observation is a noise-corrupted version of some still template in the gallery, and the noise distribution determines the recognition score pn(yt|nt,?t). We will physically define this quantity below. To fully exploit the fact that all gallery images are in frontal view, we also compute below how likely the patch zt is in frontal view and denote this score by pf(yt|?t). If the patch is in frontal view, we accept a recognition score; otherwise, we simply set the recognition score as equiprobable among all identities, i.e., 1/N. The complete likelihood p(yt|nt,?t) is now defined as p(yt|nt,?t) ? pa {pf pn + (1?pf) N?1}. (7.17) Model components in detail ? A. Modeling inter-frame appearance changes Inter-frame appearance changes are related to the motion transition model and the appearance model for tracking, which were explained in Sections 6.2.1 and 6.2.2. ? B. Being in frontal view Since all gallery images are in frontal view, we simply measure the extent of being frontal by fitting a probabilistic subspace (PS) density on the top of the gallery images [54, 56], assuming that they are i.i.d. samples from the 193 frontal face space (FFS). pf(yt|?t) is written as follows: pf(yt|?t) = QFFS(zt), (7.18) where the density Q(.) is defined same as that in (7.14). ? C. Modeling appearance changes between probe video frames and gallery im- ages WeadopttheMAPrule developed in[56]fortherecognitionscore pn(yt|nt,?t). Two subspaces are constructed to model appearance variations. The IPS is meant to cover all the variations in appearances belonging to the same person while the EPS is used to cover all the variations in appearances belonging to different people. More than one facial image per person is needed to con- struct the IPS. Apart from the available gallery, we crop out four images from the video ensuring no overlap with frames used in probe videos. The above PS density estimation method is applied separately to the IPS and the EPS, yielding two different eigensystems. The recognition score pn(yt|nt,?t) is finally computed as, assuming equal priors on the IPS and the EPS, pn(yt|nt,?t) = QIPS(zt ?Int)Q IPS(zt ?Int)+ QEPS(zt ?Int) . (7.19) D. Proposed algorithm We adjust the particle number Jt based on the following considerations. (i) The first issue is same as (6.31) based on the prediction error. (ii) As shown above, the uncertainty in the identity variable nt is characterized by an entropy measure Ht for p(nt|y1:t) and Ht is a non-increasing function (under one weak assumption). Accordingly, we increase the number of particles by a fixed amount Jfix if Ht 194 Initialize a sample set S0 = {?(j)0 ,w(j)0 = 1/J0)}J0j=1 according to prior distribution p(?0). Set ?0,l = 1/N. Initialize the appearance mode A1. For t = 1,2,... Calculate the MAP estimate ??t?1, the adaptive motion shift ?t by Eq. (6.21), the noise variance rt by Eq. (6.30), and particle number Jt by Eq. (7.20). For j = 1,2,...,Jt Draw the sample u(j)t for ut with variance Rt. Construct the sample ?(j)t by Eq. (6.29). Compute the transformed image z(j)t . For l = 1,2,...,N Update the weight using ?(j)t,l = ?t?1,lp(yt|l,?(j)t ) = ?t?1,lp(z(j)t |l,?(j)t ) by Eq. (7.17). End End Normalize the weight using w(j)t,l = ?(j)t,l /summationtextj,l ?(j)t,l and compute w(j)t = summationtextj w(j)t,l and ?t,l =summationtextj w(j)t,l . Update the appearance model At+1 using ?zt. End Figure 7.9: The visual tracking and recognition algorithm. increases; otherwise we deduct Jfix from Jt. Combining these two, we have Jt = J0 rtr 0 + Jfix ?(?1)i[Ht?1 1 such that, p(yt|nt = l,?t) ? ?p(yt|nt = j,?t); t ? 1,j ? N,j negationslash= l. (7.22) Substitution of Eq. (7.21) and (7.22) into Eq. (7.6) gives rise to p(nt = l|y0:t) = 1N integraldisplay ?0 ... integraldisplay ?t p(?0|y0) tproductdisplay s=1 p(ys|ns = l,?s)p(?s|?s?1) p(ys|y0:s?1) d?t ...d?0 ? 1N integraldisplay ?0 ... integraldisplay ?t p(?0|y0) tproductdisplay s=1 ?p(ys|ns = j,?s)p(?s|?s?1) p(ys|y0:s?1) d?t ...d?0 198 = ? t N integraldisplay ?0 ... integraldisplay ?t p(?0|z0) tproductdisplay s=1 p(ys|ns = j,?s)p(?s|?s?1) p(ys|y0:s?1) d?t ...d?0 = ?tp(nt = j|y0:t); j ? N,j negationslash= l, (7.23) where ?t = producttextts=1 ?. More interestingly, from Eq. (7.23), we have (N ?1)p(nt = l|y0:t) ? ?t Nsummationdisplay j=1,jnegationslash=l p(nt = j|y0:t) = ?t(1?p(nt = l|y0:t)), (7.24) i.e., p(nt = l|y0:t) ? h(?,t), (7.25) where h(?,t) = ? t ?t + N ?1. (7.26) Eq. (7.25) has two implications. 1. Since the function h(?,t) which provides a lower bound for p(nt = l|y0:t) is monotonically increasing against time t, p(nt = l|y0:t) has a probable trend of increase over t, even though not in a monotonic manner. 2. Since ? > 1 and p(nt = l|y0:t) ? 1, limt??p(nt = l|y0:t) = 1, (7.27) implying that p(nt = l|y0:t) degenerates in the identity l for some sufficiently large t. However, all these derivations are based on assumptions (A) and (B). Though it is easy to satisfy (A), difficulty arises in practice in order to satisfy (B) for all the frames in the sequence. Fortunately, as we have seen in the experiment in Section 7.3, numerically this degeneracy is still reached even if (B) is satisfied only for most but not all frames in the sequence. 199 Appendix 7.II: More on assumption (B) A trivial choice for ? is the lower bound on the likelihood ratio, i.e., ? = inf t?1,jnegationslash=l,?t?? p(yt|nt = l,?t) p(yt|nt = j,?t). (7.28) This choice is of theoretical interest. In practice, how good is the assumption (B) satisfied? Figure 7.13 plots against the logarithm of the scale parameter, the ?average? likelihood of the correct identity, 1 N summationdisplay n?N p(In|n,?), and that of the incorrect identities, 1 N(N ?1) summationdisplay m?N,n?N,mnegationslash=n p(Im|n,?), of the face gallery as well as the ?average? likelihood ratio, i.e., the ratio between the above two quantities. The observation is that only within a narrow ?band? the condition (B) is well satisfied. Therefore, the success of SIS algorithm depends on how good the samples lie in a similar ?band? in the high-dimensional affine space. Also, the lower bound ? in assumption (B) is too strict. If we take the mean of the ?average? likelihood ratio shown in Figure 7.13 as an estimate of ? ( roughly 1.5 ), Eq. (7.25) tells that, after 20 frames, the probability p(l|y0:t) reaches 0.99! However, this is not reached in the experiments due to noise in the observations and incomplete parameterization of transformations. 200 ?1 ?0.5 0 0.5 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 log of scale parameter likelihood correct incorrect ?1 ?0.5 0 0.5 10 1 2 3 4 5 6 7 8 log of scale parameter likelihood ratio Figure 7.13: Left: The ?average? likelihood of the correct hypothesis and incorrect hypotheses against the log of scale parameter. Right: The ?average? likelihood ratio against the log of scale parameter. 201 Chapter 8 Probabilistic Identity Characterization Visual face recognition is an important task. Even though a lot of research has been carried out, state-of-the-art recognizers still yield unsatisfactory results especially when confronted with pose and illumination variations. In addition, the recognizers are further complicated by the registration requirement as the images that the recognizers process contain transformed appearances of the object. Below, we simply use the term ?transformation? to model all the variations involved, be it registration, or pose and illumination variations. While most recognizers process a single image, there is a growing interest in us- ing a group of images [80, 84, 88, 89, 91, 184, 185]. In terms of the transformations embedded in the group or the temporal continuity between the transformations, the group can be either independent or not. Examples of the independent group (I-group) are face databases that store multiple appearances for one object. Ex- amples of the dependent group are video sequences. If the temporal information is stripped, video sequences reduce to I-groups. In this chapter, whenever we mention 202 video sequences, we mean dependent groups of images. Approaches that use the I-groups can be roughly divided into two categories. The first category is based on manifold matching. In [88], hypothetical identity surfaces are constructed by computing the linear coefficients of view space. Illumi- nation variations are not accounted for. Discriminant features are then extracted to overcome other variations. In [80], manifolds are formed for every I-group. Recognition is performed by computing the shortest distance between two man- ifolds. The manifold takes a certain parameterized form and the parameters are directly learned from the visual appearances. Robustness to pose and illumination variations are not reported. The second category is based on statistical learning. In [91], a multi-variate Gaussian density is fitted for every I-group. Recognition is achieved by computing the Kullback-Leibler distance [4] between two Gaussian densities. However, the Gaussian assumption is easily violated if pose and illumi- nation variations exist. In [184], principal subspaces are learned for each I-group and principal angle between the two principal subspace are used for recognition. The computation of principal angle is also carried on the feature space embed- ded by kernel functions. One common disadvantage of the above approaches is that they also assume that the face regions have already been cropped beforehand, using either a detector or a tracker. Approaches using video sequences utilize temporal information for recognition as well. In [185], simultaneous tracking and recognition is implemented in a prob- abilistic framework. The joint posterior probability of the tracking parameter and the identity variable is approximated using the SIS algorithm and the marginal posterior probability of the identity variable is used for recognition. However, only an affine localization parameter is used for tracking and pose and illumination 203 variations are not considered. In addition, exemplars are learned from the gallery videos to cover pose and illumination variations. In [89], hidden Markov models are used to learn the dynamics before successive appearances. In [84], pose varia- tions are handled by learning the view-discretized appearance manifolds from the training ensemble. Transition probabilities from one view to another view are used to regularize the search space. However, in [84, 89], the cropped images are used for testing. In this chapter, we propose a general framework which possesses the following features: ? It processes either a single image or a group of images (including the I-group and video sequence). ? It handles the localization problem, illumination and pose variations. ? The identity description could be either discrete or continuous. The contin- uous identity encoding typically arises from subspace modeling. ? It is probabilistic and integrates all the available evidence. Chapter organization In Section 8.1 we introduce the generic framework which provides a probabilis- tic characterization of the object identity. In Section 8.2 we address issues and challenges arising in this framework. In Section 8.3 we focus on how to achieve an identity encoding which is invariant to localization, illumination and pose vari- ations. In Section 8.3.2, we present some efficient computational methods. In Section 8.3.3, we present experimental results. 204 8.1 Principle of Probabilistic Identity Charac- terization Suppose ? is the identity signature, which represents the identity in an abstract manner. It can be either discrete- or continuous- valued. If we have an N-class problem, ? is discrete taking value in {1,2,...,N}. If we associate the identity with image intensity or feature vectors derived from say subspace projections, ? is continuous-valued. Given a group of images y1:T .= {y1,y2,...,yT} containing the appearances of the same but unknown identity, probabilistic identity characteriza- tion is equivalent to finding the posterior probability p(?|y1:T). As the image only contains a transformed version of the object, we also need to associate it a transformation parameter ?, which lies in a transformation space ?. The transformation space ? is usually application dependent. Affine trans- formation is often used to compensate for the localization problem. To handle illumination variation, the lighting direction is used. If pose variation is involved, 3D transformation is needed or a discrete set is used if we quantize the continuous view space. We assume that the prior probability of ? is pi(?), which is assumed to be, in practice, a non-informative prior. A non-informative prior is uniform in the discrete case and treated as a constant, say 1, in the continuous case. The key to our probabilistic identity characterization is as follows: p(?|y1:T) ? pi(?)p(y1:T|?) = pi(?) integraldisplay ?1:T p(y1:T|?1:T,?)p(?1:T)d?1:T = pi(?) integraldisplay ?1:T Tproductdisplay t=1 p(yt|?t,?)p(?t|?1:t?1)d?1:T, (8.1) 205 where the following rules, namely (a) observational conditional independence and (b) chain rule, are applied: (a) p(y1:T|?1:T,?) = Tproductdisplay t=1 p(yt|?t,?); (8.2) (b) p(?1:T) = Tproductdisplay t=1 p(?t|?1:t?1); p(?1|?0) .= p(?1). (8.3) Equation (8.1) involves two key quantities: the observation likelihood p(yt|?t,?) and the state transition probability p(?t|?1:t?1). The former is essential to a recog- nition task, the ideal case being that it possesses a discriminative power in the sense that it always favors the correct identity and disfavors the others; the latter is also very helpful especially when processing video sequences, which constrains the search space. We now study two special cases of p(?t|?1:t?1). 8.1.1 Independent group (I-group) In this case, the transformations {?t; t = 1,...,T} are independent of each other, i.e. p(?t|?1:t?1) = p(?t). (8.4) Eq. (8.1) becomes p(?|y1:T) ? pi(?) Tproductdisplay t=1 integraldisplay ?t p(yt|?t,?)p(?t)d?t. (8.5) In this context, the probability p(?t) can be regarded as a prior for ?t, which is often assumed to be Gaussian with mean ?? or non-informative. The most widely studied case in the literature is T = 1, i.e. there is only a single image in the group. Due to its importance, sometimes we will distinguish 206 it from the I-group (with T > 1) depending on the context. We will present in Section 8.2 the shortcomings of many contemporary approaches. It all boils down to how to compute the integral in (8.5) in real applications. In the sequel, we show how to efficiently approximate it. 8.1.2 Video sequence In the case of video sequence, temporal continuity between successive video frames implies thatthe transformations{?t; t = 1,...,T}follow a Markov chain. Without loss of generality, we assume a first-order Markov chain, i.e. p(?t|?1:t?1) = p(?t|?t?1). (8.6) Eq. (8.1) becomes p(?|y1:T) ? pi(?) integraldisplay ?1:T Tproductdisplay t=1 p(yt|?t,?)p(?t|?t?1)d?1:T. (8.7) The difference between (8.5) and (8.7) is whether the product lies inside or outside the integral. In (8.5), the product lies outside the integral, which divides the quantity of interest into ?small? integrals that can be computed efficiently; while (8.7) does not have such a decomposition, causing computational difficulty. 8.1.3 Difference from Bayesian estimation Our framework is very different fromthe traditional Bayesian parameter estimation setting, where a certain parameter ? should be estimated from the i.i.d. observa- tions {x1,x2,...,xT} generated from a parametric density p(x|?). If we assume that ? has a prior probability pi(?), then the posterior probability p(?|x1:T) is computed as p(?|x1:T) ? pi(?)p(x1:T|?) = pi(?) Tproductdisplay t=1 p(xt|?) (8.8) 207 and used to derive the parameter estimate ??. One should not confuse our trans- formation parameter ? with the parameter ?. Notice that ? is fixed in p(xt|?) for different t?s. However, each yt is associates with a ?t. Also, ? is different from ? in the sense that ? describes the identity and ? helps to describe the parametric density. To make our framework more general, we can also incorporate the ? parameter by letting the observation likelihood be p(y|?,?,?). Equation (8.1) then becomes p(?|y1:T) ? pi(?)p(y1:T|?) (8.9) = pi(?) integraldisplay ?,?1:T p(y1:T|?1:T,?,?)p(?1:T)pi(?)d?1:Td? = pi(?) integraldisplay Tproductdisplay t=1 p(yt|?t,?,?)p(?t|?1:t?1)pi(?)d?1:Td?, where ?1:T and ? are assumed to be statistically independent. In this chapter, we will focus only on (8.1) as if we already know the true parameter ? in (8.9). This greatly simplifies our computation. 8.2 Recognition Setting and Issues Equation (8.1) lays a theoretical foundation, which is universal for all recognition settings: (i) recognition is based on a single image (an I-group with T = 1), an I- group with T ? 2, or a video sequence; (ii) the identity signature is either discrete- or continuous-valued; and (iii) the transformation space takes into account all available variations, such as localization and variations in illumination and pose. 208 8.2.1 Discrete identity signature In a typical pattern recognition scenario, say an N-class problem, the identity signature for y1:T, ??, is determined by the Bayesian decision rule: ?? = arg max {1,2,...,N} p(?|y1:T). (8.10) Usually p(y|?,?) is a class-dependent density, either pre-specified or learned. This is a well studied problem and we will not focus on this. 8.2.2 Continuous identity signature If the identity signature is continuous-valued, two recognition schemes are possible. The first is to derive a point estimate ?? (e.g. conditional mean, mode) from p(?|y1:T) to represent the identity of image group y1:T. Recognition is performed by matching ???s belonging to different groups of images using a metric k(.,.). Say, ??1 is for group 1 and ??2 for group 2, the point distance ?k1,2 .= k(??1, ??2) is computed to characterize the difference between groups 1 and 2. Instead of comparing the point estimates, the second scheme directly compares different distributions that characterize the identities for different groups of images. Therefore, for two groups 1 and 2 with the corresponding posterior probabilities p(?1) and p(?2), we use the following expected distance [134] ?k1,2 .= integraldisplay ?1 integraldisplay ?2 k(?1,?2)p(?1)p(?2)d?1d?2. Ideally, we wish to compare the two probability distributions using quantities such as the Kullback-Leibler distance [4]. However, computing such quantities is nu- merically prohibitive when ? is of high dimensionality. 209 The second scheme is preferred as it utilizes the complete statistical informa- tion, while in the first one, point estimates use partial information. For examples, if only the conditional mean is used, the covariance structure or higher-order statis- tics is thrown away. However, there are circumstances when the first scheme makes sense: the posterior distribution p(?|y1:T) is highly peaked or even degenerate at ??. This might occur when (i) the variance parameters are taken to be very small; or (ii) we let T go to ?, i.e. keep observing the same object for a long time. 8.2.3 The effects of the transformation Even though recognition based on single images has been studied for a long time, most efforts assume only one alignment parameter ?? and compute the probabil- ity p(y|??,?). Any recognition algorithm computing some distance measures can be thought of as using a properly defined Gibbs distribution. The underlying assumption is that p(?) = ?(?? ??), (8.11) where ?(.) is an impulse function. Using (8.11), (8.5) becomes p(?|y) ? pi(?) integraldisplay ? p(y|?,?)?(?? ??)d? = pi(?)p(y|??,?). (8.12) Incidentally, if the Laplace?s method is used to approximate the integral (refer to the Appendix 8.I for details) and the maximizer ??? = argmax? p(y|?,?)p(?) does not depend on ?, say ??? = ??, then p(?|y) ? pi(?) integraldisplay ? p(y|?,?)p(?)d? similarequal pi(?)p(y|??,?)p(??) radicalBig (2pi)r/|I(??)|. (8.13) This gives rise to the same decision rule as implied by (8.12) and also partly explains why the simple assumption (8.11) can work in practice. 210 The alignment parameter is therefore very crucial for a good recognition perfor- mance. Even a slightly erroneous ?? may affect the recognition system significantly. It is very beneficial to have a continuous density p(?) such as a Gaussian or even a non-informative since marginalization of p(?,?|y) over ? yields a robust estimate of p(?|y). In addition, our Bayesian framework also provides a way to estimate the best alignment parameter through the posterior probability: p(?|y) ? integraldisplay ? p(y|?,?)pi(?)d?. (8.14) 8.2.4 Asymptotic behaviors When we have an I-group or a video sequence, we are often interested in dis- covering the asymptotic (or large-sample) behaviors of the posterior distribution p(?|y1:T) when T is large. In [185], the discrete case of ? in a video sequence is studied. However it is very challenging to extend this study to a continuous case. Experimentally (refer to Section 8.3.3), we find that p(?|y1:T) becomes more and more peaked as N increase, which seems to suggest a degenerancy in the true value ?true. 8.3 Subspace Identity Encoding The main challenge is to specify the likelihood p(y|?,?). Practical considerations require that (i) the identity encoding coefficient ? is compact so that our target space where ? resides is of low dimensional; and (ii) ? should be invariant to transformations and tightly clustered so that we can safely focus on a small portion of the spaces. 211 Inspired by the popularity of subspace analysis, we assume that the observation y can be well explained by a subspace, whose basis vectors are encoded in a matrix denoted by B, i.e. there exists linear coefficients ? such that y ? B?. Clearly, ? naturally encodes the identity. However, the observation under the transformation condition (parameterized by ?) deviates from the canonical condition (parameter- ized by say ??) under which the B matrix is defined. To achieve an identity encoding that is invariant to the transformation, there are two possible ways. One way is to inverse-warp the observation y from the transformation condition ? to the canoni- cal condition ?? and the other way is to warp the basis matrix B from the canonical condition ?? to the transformation condition ?. In practice, inverse-warping is typ- ically difficult. For example, we cannot easily warp an off-frontal view to a frontal view without explicit 3D depth information that is unavailable. Hence, we follow the second approach, which is also known as analysis-by-synthesis approach. We denote the basis matrix under the transformation condition ? by B?. 8.3.1 Invariant to localization, illumination, and pose Localization parameter, denoted by ?, includes the face location, scale and in-plane rotation. Typically, an affine transformation is used. We absorb the localization parameter ? in the observation using T{y;?}, where the T{.;?} is a localization operator, extracting the region of interest and normalizing it to match with the size of the basis. The illumination parameter, denoted by ?, is a vector specifying the illumi- nant direction (and intensity if required). The pose parameter, denoted by ?, is a continuous-valued random variable. However, practical systems [67, 69] often discretize this due to the difficulty in handling 3D to 2D projection. Suppose the 212 quantized pose set is {1,...,V}. To achieve pose invariance, we concatenate all the images [69] {y1,...,yV} under all the views and a fixed illumination ? to form a high-dimensional vector Y? = [y1,?,...,yV,?]T. To further achieve invariance to illuminations, we invoke the Lambertian reflectance model, ignoring shadow pixels. Now, ? is actually a 3-D vector describing the illuminant. We now follow Chapter 3 to derive a bilinear analysis summarized below. Since all yv?s are illuminated by the same ?, the Lambertian model gives, Y? = W?. (8.15) Following [204], we assume that W = msummationdisplay i=1 ?iWi, (8.16) and we have Y? = msummationdisplay i=1 ?iWi?, (8.17) where Wi?s are illumination-invariant bilinear basis and ? = [?1,...,?m]T pro- vides an illuminant-invariant identity signature. Those bilinear basis can be easily learned as shown in [138, 202]. Thus ? is also pose-invariant because, for a given view ?, we take the part in Y corresponding to this view and still have y?,? = msummationdisplay i=1 ?iW?i ?. (8.18) In summary, the basis matrix B? for ? = (?,?,?) with ? absorbed in y is expressed as B?,? = [W?1?,...,W?m?]. We focus on the following likelihood: p(y|?) = p(y|?,?,?,?) = Z?1?,?,? exp{?D(T{y;?},B?,??)}, (8.19) 213 where D(y,B??) is some distance measure and Z?,?,? is the so-called partition func- tion which plays a normalization role. In particular, if we take D as D(T{y;?},B?,??) = (T{y;?}?B?,??)T??1(T{y;?}?B?,??)/2, (8.20) with a given ? (say ? = ?2I where I is an identity matrix), then (8.19) becomes a multivariate Gaussian and the partition function Z?,?,? does not depend on the parameters any more. However, even though (8.19) is a multivariate Gaussian, the posterior distribution p(?|y1:T) is no longer Gaussian. 8.3.2 Computational issues The integral If the transformation space ? is discrete, it is easy to evaluate the integral1 integraltext ? p(y|?,?)p(?)d?, which becomes a sum. If ? is continuous, in general, com- puting integral integraltext? p(y|?,?)p(?)d? is a difficult task. Many techniques are available in the literature. Here we mainly focus on two techniques: Monte Carlo simulation [14, 16] and Laplace?s method [16, 136]. Monte Carlo simulation. The underlying principle is the law of large number (LLN). If {x(1),x(2),...,x(K)} are K i.i.d. samples of the density p(x), for any bounded function h(x), limK?? 1K Ksummationdisplay k=1 h(x(k)) = integraldisplay xh(x)p(x)dx = Ep[h]. (8.21) Alternatively, when drawing i.i.d. samples from p(x) is difficult, we can use importance sampling [14, 16]. Suppose that the importance function q(x) has i.i.d. realizations {x(1),x(2),...,x(K)}. The pdf p(x) can be represented by a weighted 1We drop the subscript [.]t notation as this is a general treatment. 214 sample set {(x(k),w(k)p )}Kk=1, where the weight for the sample x(k) is w(k)p = p(x(k))/q(x(k)), (8.22) in the sense that for any bounded function h(x), limK?? Ksummationdisplay k=1 w(k)p h(x(k)) = Ksummationdisplay k=1 p(x(k)) q(x(k))h(x (k)) = Ep[h]. (8.23) Laplace?s method [16, 136]. The general approach of this method is presented in Appendix 8.I. This is a good approximation to the integral only if the integrand is uniquely peaked and reasonably mimics the Gaussian function. In our context, we use importance sampling (or i.i.d sampling if possible) for ? and the Laplace?s method for ? and enumerate ?. We draw i.i.d. samples {?(1),?(2),...,?(K)} from q(?) and, for each sample ?(k), compute the weight w?(k) = p(?(k))/q(?(k)). If the i.i.d. sampling is used, the weights are always ones. Putting things together, we have (assuming pi(?) is a non-informative prior) p(?|y) ? integraldisplay ?,?,? p(y|?,?,?,?)p(?)p(?)p(?)d?d?d? similarequal 1K Ksummationdisplay k=1 w?(k) 1V Vsummationdisplay ?=1 p(y|?(k),???(k),?,?,?,?)? p(???(k),?,?) radicalBig (2pi)r/|I(???(k),?,?)|, (8.24) where ???k,?,? is the maximizer ???(k),?,? = argmin ? p(y|? (k),?,?,?)p(?), (8.25) r is the dimensionality of ?, and I(???,?,?) is a properly defined matrix. Refer to Appendix 8.II for computing ???,?,? and I(???,?,?) if the likelihood is given as (8.19) and (8.20) and a non-informative prior p(?) is assumed. Similar derivations can be conducted for an I-group of observations y1:T. 215 The distances ?k and ?k To evaluate the expected distance ?k, we use the Monte Carlo method. In our context, the target distribution is p(?|y1:T). Based on the above derivations, we know how to evaluate the target distribution, but not to draw sample from it. Therefore, we use importance sampling. Other sampling techniques such as Monte Carlo Markov chain [14, 16] can also be applied. Suppose that, say for group 1, the importance function is q1(?1), and weighted sample set is {?(i)1 ,w(i)1 }Ii=1, the expected distance is approximated as ?k1,2 similarequal summationtextI i=1 summationtextJ j=1 w (i) 1 w (j) 2 k(? (i) 1 ,? (j) 2 )summationtext I i=1 w (i) 1 summationtextJ j=1 w (j) 2 . (8.26) The point distance is approximated as ?k1,2 similarequal k( summationtextI i=1 w (i) 1 ? (i) 1summationtext I i=1 w (i) 1 , summationtextJ j=1 w (j) 2 ? (j) 2summationtext J j=1 w (j) 2 ). (8.27) 8.3.3 Experimental results We use the ?illum? subset of the PIE database [75] in our experiments. This subset has 68 subjects under 21 illumination configurations and 13 poses. Out of the 21 illumination configurations, we select 12 of them denoted by F, F = {f16,f15,f13,f21,f12,f11,f08,f06,f10,f18,f04,f02}, which typically span the set of variations. Out of the 13 poses, we select 9 of them denoted by C, C = {c22,c02,c37,c05,c27,c29,c11,c14,c34}, which cover from the left profile to the frontal to the right profile. In total, we have 68?12?9 = 7344 images. Fig 3.2 displays one PIE object under the illumination and pose variations. 216 We randomly divide the 68 subjects into two parts. The first 34 subjects are used in the training set and the remaining 34 subjects are used in the gallery and probe sets. It is guaranteed that there is no identity overlap between the training set and the gallery set. During training, the images are pre-preprocessed by aligning the eyes and mouth to desired positions. No flow computation is carried on for further align- ment. After the pre-processing step, the used face image is of size 48 by 40, i.e. d = 48?40 = 1920. Also, we only study gray images by taking the average of the red, green, and blue channels of their color versions. The training set is used to learn the basis matrix B? or the bilinear basis Wi?s. As mentioned before, ? includes the illumination direction ? and the view pose ?, where ? is a continuous-valued random vector and ? is a discrete random variable taking values in {1,...,V} with p = 9 (corresponding to C). The images belonging to the remaining 34 subjects are used in the gallery and probe sets. The construction of the gallery and probe sets conforms the following: To form a gallery set of the 34 subjects, for each subject, we use an I-group of 12 images under all the illuminations under one pose ?p; to form a probe set, we use I-groups under the other pose ?g. We mainly concentrate on the case with ?p negationslash= ?g. Thus, we have 9?8 = 72 tests, with each test giving rise to a recognition score. The 1-NN (nearest neighbor) rule is applied to find the identity for a probe I-group. During testing, we no longer use the pre-processed images and therefore the unknown transformation parameter includes the affine localization parameter, the light direction, and the discrete view pose. The prior distribution p(?t) is assumed to be a Gaussian, whose mean is found by a background subtraction algorithm 217 and whose covariance matrix is manually specified. We use i.i.d. sampling from p(?t) since it is Gaussian. The metric k(.,.) actually used in our experiments is the correlation coefficient: k(x,y) = {(xTy)2}/{(xTx)(yTy)}. Figure 8.1 shows the marginal posterior distribution of the first element ?1 of the identity variable ?, i.e., p(?1|y1:T), with different N?s. From Figure 8.1, we notice that (i) the posterior probability p(?1|y1:T) has two modes, which might fail those algorithms using the point estimate, and (ii) it becomes more peaked and tightly-supported as T increases, which empirically supports the asymptotic behavior mentioned in Section 8.2. Figure 8.2 shows the recognition rates for all the 72 tests. In general, when the poses of the gallery and probe sets arefarapart, the recognition rates decrease. The best gallery sets for recognition are those in frontal poses and the worst gallery sets are those in profile views. These observation are similar to those made in Chapter 3. For comparison, Table 8.1 shows the average recognition rates for four different methods: our two probabilistic approaches using ?k and ?k, respectively, the PCA approach [62], and the statistical approach [91] using the KL distance. When implementing the PCA approach, we learned a generic face subspace from all the training images, stripping their illumination and pose conditions; while implement- ing the KL approach, we fit a Gaussian density on every I-group and the learning set is not used. Our approaches outperform the other two approaches significantly due to the transformation-invariant subspace modeling. The KL approach [91] performs even worse than the PCA approach simply because no illumination and pose learning is used in the KL approach while the PCA approach has a learning 218 ?5 ?2.5 0 2.5 50 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 ?1 p( ?1 |y 1 ) (a) ?5 ?2.5 0 2.5 5 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 ?1 p( ?1 |y 1:6 ) (b) ?5 ?2.5 0 2.5 50 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 ?1 p( ?1 |y 1:12 ) (c) c22 c02 c37 c05 c27 c29 c11 c14 c340.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 (d) Figure 8.1: The posterior distributions p(?1|y1:T) with different T?s: (a) p(?1|y1); (b) p(?1|y1:6); and (c) p(?1|y1:12), and (d) the posterior distribution p(?|y1:12). Notice that p(?1|y1:T) has two modes and becomes more peaked as T increases. algorithm based on image ensembles taken under different illuminations and poses (though this specific information is stripped). Method ?k ?k PCA KL [91] Rec. Rate (top 1) 82% 76% 36% 6% Rec. Rate (top 3) 94% 91% 56% 15% Table 8.1: Recognition rates of different methods. As earlier mentioned in Section 8.2.3, we can infer the transformation param- eters using the posterior probability p(?|y1:T). Figure 8.1 also shows the obtained p(?|y1:12) for one probe I-group. In this case, the actual pose is ? = 5 (i.e. cam- 219 c22 c02 c37 c05 c27 c29 c11 c14 c34 c22 c02 c37 c05 c27 c29 c11 c14 c34 55 60 65 70 75 80 85 90 95 100(a) c22 c02 c37 c05 c27 c29 c11 c14 c34 c22 c02 c37 c05 c27 c29 c11 c14 c34 50 55 60 65 70 75 80 85 90 95 100(b) c22 c02 c37 c05 c27 c29 c11 c14 c34 c22 c02 c37 c05 c27 c29 c11 c14 c34 10 20 30 40 50 60 70 80 90 100(c) c22 c02 c37 c05 c27 c29 c11 c14 c34 c22 c02 c37 c05 c27 c29 c11 c14 c34 0 10 20 30 40 50 60 70 80 90 100(d) Figure 8.2: The recognition rates of all tests. (a) Our method based on ?k. (b) Our method based on ?k. (c) The PCA approach [62]. (d) The KL approach. Notice the different ranges of values for different methods and the diagonal entries should be ignored. era c27), which has the maximum probability in Figure 8.1(d). Similarly, we can find an estimation for ?, which is quite accurate as the back ground subtraction algorithm already provides a clean position. 8.4 Appendix Appendix 8.I ? Laplace?s method We are interested in computing the following quantity, for ? = [?1,?2,...,?r]T ? Rr, J = integraltext p(?)d?. Suppose that ?? is the maximizer of p(?) or equivalently logp(?) 220 which satisfies ?p(?) ?? |?? = 0 or ? logp(?) ?? |?? = 0. (8.28) We expand logp(?) around ?? using a Taylor series: logp(?) similarequal logp(??)? 12(????)TI(??)(?? ??), (8.29) where I(?) is an r?r matrix whose ijth element is Iij(?) = ?? 2 logp(?) ??i??j . (8.30) Note that the first-order term in (8.29) is zero by virtue of (8.28). If p(?) is a pdf function with parameter ?, then I(?) is the famous Fisher information matrix [16]. Substituting (8.29) into J gives J similarequal p(??) integraldisplay exp{?12(?? ??)TI(??)(? ? ??)}d? = p(??) radicalBig (2pi)r/|I(??)|. (8.31) Appendix 8.II ? About ???,?,? If a non-information prior p(?) is assumed2, the maximizer ???,?,? satisfies ???,?,? = argmax ? p(y|?,?,?,?) (8.32) = argmin ? (T{y;?}?B?,??)T(T{y;?}?B?,??) = argmin ? L(?,?,?,?) where L(?,?,?,?) .= (T{y;?}?B?,??)T(T{y;?}?B?,??). Using the fact that B?,?? = [W?1?,...,W?m?]? = B?,??; B?,? .= msummationdisplay i=1 ?iW?i , (8.33) 2If a Gaussian prior is assumed, a similar derivation can be carried. 221 The term L(?,?,?,?) becomes L(?,?,?,?) = (T{y;?}?B?,??)T(T{y;?}?B?,??), (8.34) which is quadratic in ?. The optimum ???,?,? is unique and its value is ???,?,? = (B?,?TB?,?)?1B?,?Ty = B?,??T{y;?}. (8.35) where [.]? is the pseudo-inverse. Substituting (8.35) into L(?,?,?,?) yields L(?,?,???,?,?,?) = T{y;?}T(Id ?B?,?B?,??)T?{y}. (8.36) It is easy to show that I(?) is no longer a function of ? and equals to I = ??2B?,?TB?,?. (8.37) 222 Chapter 9 Conclusions 9.1 Summary This doctoral dissertation addressed several approaches for unconstrained face recognition from three aspects. The first aspect is to directly model illumination and pose variations. The second aspect is to use nonlinear kernel learning to char- acterize the face appearance manifold. The third aspect is to perform recognition using video sequences. Here are some of the key contributions made in the thesis: ? In the generalized photometric stereo approach in Chapter 2, we proposed a rank constraint on the product of albedo and surface normal that provides a very compact yet efficient encoding of the identity. In the literature, usually two separate linear subspaces [43, 66] are constructed for shape and texture, respectively, assuming the independence between them. This assumption might result in an overfit for the problem [202]. By using the integrability and symmetry constraints, we then achieve a lin- 223 earized algorithm that recovers the class-specific albedos and surface normals under the most general and hence most difficult setting, i.e., the observation matrix consists of different objects under different illuminations. In particu- lar, this algorithm takes into account the effect of varying albedo field in the integrability term. ? The proposed illuminating light field approach in Chapter 3 is image-based and requires no explicit 3D model. It is computationally efficient and able to deal with images of small size. In contrast, the 3D model-based approach [66] is computationally intense and needs image of large size. ? Probabilistic analysis of kernel principal components in Chapter 4 provides a tool for modeling nonlinear manifold in an interpretable manner. This also implicitly characterizes the high order statistical information. The prob- abilistic nature enables a mixture modeling of kernel principal component analysis and an effective classification scheme. ? Computing the probabilistic distance measures (e.g. the Chernoff distance, the Bhatacharyya distance, the KL distance, and the divergence distance) between two Gaussian densities in the RKHS is presented in Chapter 5. Since the RKHS might be infinite-dimensional, we derive a limiting distance which can be easily computed. This leads to a novel paradigm for studying pattern separability, especially for visual pattern lying in a nonlinear manifold. ? Presented in Chapter 6 is an adaptive method for visual tracking which stabi- lizes the tracker by embedding deterministic linear prediction into stochastic diffusion. Numerical solutions have been provided using particle filters with the adaptive observation model arising from the adaptive appearance model, 224 adaptive state transition model, and adaptive number of particles. Occlusion analysis is also embedded in the particle filter. ? A systematic method for face recognition from a probe video, compared with a gallery of still templates is introduced in Chapter 7. A time series state space model is used to accommodate the video and SIS algorithms provide the numerical solutions to the model. This probabilistic framework, which overcomes many difficulties arising in conventional recognition approaches using video, is registration-free and poses no need for selecting good frames. It turns out that an immediate recognition decision can be made in our framework due to the degeneracy of the posterior probability of the identity variable. The conditional entropy can also serve as a good indication for the convergence. ? We present in Chapter 8 a generic framework of modeling human identity for a single image, a group of images, or a video sequence . This framework provides a complete statistic description of the identity. Various current recognition schemes are just instances of this generic framework. 9.2 Future works Unconstrained face recognition can be expanded in a multitude of ways. The following just lists some potential avenues to explore in the context of the proposed approaches: ? In Chapters 2 and 3, we utilize a Lambertian reflectance model to describe illumination phenomenon. However, the Lambertian reflectance model is a rather simple model and unable to handle cast shadows and specular regions. 225 Although we employ a simple technique to exclude pixels in cast shadow and specular regions, it turns out when the light comes from extreme directions (e.g. highly off-frontal ones), the recognition performance drops quickly. We need to investigate these lighting conditions. Alternatively, a complex illumination model providing a better illumination description can be used. ? In the illuminating light field approach of Chapter 3, we need an image-based rendering technique to handle novel poses. Some promising works along this line are [67, 110, 111]. ? On probabilistic analysis of kernel principal components and probability dis- tances on RKHS, possible future works include (i) how to design or select the kernel function for a given task, be it classification or modeling; (ii) evaluat- ing the kernels for set based on the derived probabilistic distances (as argued in Section 5.3.5) in a classification device such as Support Vector Machine for various applications; (iii) utilizing probabilistic distances for an independent component analysis (ICA) as in [170]. ? The visual tracking algorithm of Chapter 6 can be extended in many ways [206, 212]. (i) Combining shape information into appearance. Appearance and shape are two very important visual cues arguably presented in a comple- mentary fashion [133]. (ii) Utilizing appearance from multiple views. Using multiple views can overcome some difficulty in a single view. For example, an object might be occluded in one view but not the other one. Using the multi- view geometry, we can infer the movement of the object in the occluded view [207]. (iii) Here we mostly model the movement of the foreground object. Joint modeling of foreground and background movements is very promising 226 [212, 213] since the stabilization obtained by background modeling signifi- cantly reduces the clutter in the background that confuses the foreground tracking algorithm. ? In simultaneous tracking and recognition of Chapter 7, various issues ex- ist. (i) Robustness. Generally speaking, our approach is more robust than still-image-based approach since we essentially compute the recognition score based on all video frames and, in each frame, all kinds of transformed ver- sions of the face part corresponding to the sample configurations that are considered. However, since we take no explicit measure when handling frames with outlier or other unexpected factors, recognition scores based on those frames might be low. But, this is a problem for other approaches too. The assumption that the identity does not change as time proceeds, i.e., p(nt|nt?1) = ?(nt?nt?1), could be relaxed by having nonzero transition probabilities between different identity variables. Using nonzero transition probabilities will enable us an easier transition to the correct choice in case that the initial choice is incorrectly chosen, making the algorithm more ro- bust. (ii) Resampling. In the recognition algorithm, the marginal distribution {(?(j)t?1,wprime(j)t?1)}Jj=1 is sampled to obtain the sample set {(?(j)t ,1)}Jj=1. This may cause problems in principle since there is no conditional independence between ?t and nt given y0:t. However, in a practical sense, this is not a big disadvantage because the purpose of resampling is to ?provide chances for the good streams (samples) to amplify themselves and hence rejuvenate the sampler to produce better results for future states as the system evolves? [159]. The resampling scheme can either be simple random sampling with 227 weights (like in CONDENSATION), residual sampling, or local Monte Carlo methods. ? Further, in the experimental part of Chapter 8, we can extend our approach to perform recognition from video sequences with localization, illumination, and pose variations. Again, Sequential Monte Carlo methods can be used to accommodate temporal continuity. This leads to a very high-dimensional state space to explore. Efficient simulation techniques are desired. In fact, the issue of computation load also exist for the efficient algorithm in Chapter 7. There, two important numbers affecting the computation are J, the num- ber of motion samples, and N, the size of the database. (i) The choice of J is an open question in the statistics literature. In general, larger J produces more accurate results. (ii) The choice of N depends on application. Since a small database is used in this experiment, it is not a big issue here. However, the computational burden may be excessive if N is large. One possibility is to use a continuous parameterized representation, say ? as in Chapter 8, instead of discrete identity variable n. Now the task reduces to computing p(?t,?t|y0:t). The approaches taken in this thesis by no means cover the whole spectrum of the unconstrained face recognition problem and address only a small portion of all available issues. Some possible important issues, other than those addressed in the thesis, include the following: ? Aging. Aging is a very important topic in unconstrained face recognition. Often the stored gallery images are taken well before the probe images. For example, passengers hold passports with photos taken when the passport was issued years ago. While one solution is to maintain the gallery images 228 up-to-date, a systematic solution is theoretical modeling of the generic affect of aging. This modeling is very difficult due to the individualized variation. Presented in [50] is just one attempt with limited success. More research efforts are certainly worthwhile. ? Expression. Facial expression analysis and modeling attracts a lot of atten- tion [42, 60, 61] and some approaches [60] focus on expression recognition, i.e., identifying different modalities of facial expression such as happy, angry, disgust, etc. Face recognition under expression variation has not been fully explored. Clearly expression recognition and face recognition under expres- sion variation are two different topics. However, expression recognition and modeling is a crucial component for accurate face recognition under expres- sion variation. Further, facial expressions manifest themselves in a temporal dimension. The manner that an individual poses expressions (in natural contexts) presents certain behavioral aspect of the face biometric. Utilizing temporal infor- mation embedded in facial expression for face recognition under expression variation is an interesting research topic. ? Distorted imagery. Images as one main digital media are to be compressed, stored, transmitted and so on. Compression schemes sacrifice image quality for fewer bits to encode the image, storage devices are susceptible to various damages, trans- mission channels are often noisy. All these results in distorted images. How to perform face recognition accounting for sources of distortions [199] is a very practical research topic that needs to be explored. 229 BIBLIOGRAPHY [Books on general topics] [1] B. Anderson and J. Moore, Optimal Filtering. New Jersey: Prentice Hall, Engle-wood Cliffs, 1979. [2] Y. Bar-Shalom and T. Fortmann, Tracking and Data Association. Academic Press, 1988. [3] G. Casella and R. L. Berger, Statistical Inference. Duxbury, 2002. [4] T.M. Cover and J.A. Thomas, Elements of Information Theory. Wiley, 1991. [5] P. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach. Prentice Hall International, 1982. [6] A. Doucet, N. d. Freitas, and N. Gordon (Eds.), Sequential Monte Carlo Methods in Practice. Springer-Verlag, New York, 2001. [7] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. Wiley- Interscience, 2001. [8] G. H. Golub and C. F. Van Loan, Matrix Computations. The Johns Hopkins University Press, 1996. 230 [9] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learn- ing: Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2001. [10] B. Horn and M. Brooks (Eds.) Shape from Shading. MIT Press, 1989. [11] P.J. Huber, Robust statistics. Wiley, 1981. [12] I. T. Jolliffe, Principal Component Analysis. New York: Springer-Verlag, 2002. [13] Kullback, Information Theory and Statistics. Wiley, New York, 1959. [14] J.S. Liu, Monte Carlo Strtegies in Scientific Computing. Springer, 2001. [15] K. V. Mardia, J. T. Kent, and J. M. Bibby, Multivariate Analysis. Academic Press, 1979. [16] C. Robert and G. Casella, Monte Carlo Statistical Methods. Springer, 1999. [17] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. [18] M.A. Tanner, Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions. Springer, 1996. [19] V. N. Vapnik, The Nature of Statistical Learning Theory. Springer-Verlag, New York, ISBN 0-387-94559-8, 1995. [Books and Review Papers on face recognition] [20] M.S.Bartlett, Face Image Analysis by Unsupervised Learning. Kluwer Aca- demic Publishers, 2001. 231 [21] R. Chellappa, C. L. Wilson, and S. Sirohey, ?Human andmachine recognition of faces: A survey,? Proceedings of IEEE, vol. 83, pp. 705?740, 1995. [22] S. Gong, S.J. McKenna, Dynamic Vision: From Images to Face Recognition. Imperial College Press, 2000. [23] P.W. Hallinan, G. Gordon, A. Yuille, P. Giblin, and D. Mumford, Two- and Three-Dimensional Patterns of the Face. A. K. Peters, Ltd., 1999. [24] T. Kanade, Computer Recognition of Human Faces. Birhauser, Basel, Switzerland, and Stuggart, Germany, 1973. [25] S.Z. Li, A.K. Jain (Eds.), Handbook of Face Recognition. Springer-Verlag, 2004. [26] H. Wechsler, P.J. Phillips, V. Bruce, F.F. Soulie, and T.S. Huang (Eds.), Face Recognition: From Theory to Applications. Springer-Verlag, 1998. [27] W. Zhao, R. Chellappa, A. Rosenfeld, and J. Phillips, ?Face recognition: A literature survey,? ACM Computing Surveys, vol. 12, 2003. [Biometrics] [28] Biometric Catalog. http://www.biomtricscatalog.org. [29] Biometric Consortium. http://www.biometrics.org. [30] Deparment of Homeland Security (DHS), US-VISIT Program. http://www.dhs.goc/dhspublic/interapp/editorial/editorial 0333.xml. [31] National Institue of Standards and Technologies (NIST), Biometrics Web Site. http://www.nist.gov/biometrics. 232 [32] D.M. Blackburn, ?Biometrics 101 (version 3.1)? http://www.biometricscatalog.org/biometrics/Introduction.asp, March 2004. [33] R. Hietmeyer, ?Biometric identification promises fast and secure processings of airline passengers,? The Internationl Civil Aviation Organization Journal, vol. 55, no. 9, pp. 10-11, 2000. [34] P.J. Phillips, R.M. McCabe, and R. Chellappa, ?Biometric image process- ing and recognition,? Proceedings of European Signal Processing Conference, 1998. [Psychophysical and neural aspects] [35] I. Biederman and P. Kalocsai, ?Neural and psychophysical analysis of object and face recognition,? In Face Recognition: From Theory to Applications, H. Wechsler, P.J. Phillips, V. Bruce, F.F. Soulie, and T.S. Huang (Eds.), Springer-Verlag, 1998. [36] V. Bruce, Recognizing Faces. Lawrence Erlbaum Associates, London, U.K., 1988. [37] V. Bruce, P.J.B. Hancock, and A.M. Burton, ?Human face perception and identification,? In Face Recognition: From Theory to Applications, H. Wech- sler, P.J. Phillips, V. Bruce, F.F. Soulie, and T.S. Huang (Eds.), Springer- Verlag, 1998. [38] A.J. O?Toole, ?Psychological and neural perspectives on human faces recog- nition,? In Handbook of Face Recognition, S.Z. Li and A.K. Jain (Eds.), Springer, 2004. 233 [39] B. Knight and A. Johnston, ?The role of movement in face recognition,? Visual Cognition, vol. 4, pp. 265-274, 1997. [Face recognition from still images] [40] M.S. Barlett, H.M. Ladesand, and T.J. Sejnowski, ?Independent component representations for face recognition,? Proceedings of SPIE 3299, pp. 528-539, 1998. [41] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, ?Eigenfaces vs. fish- erfaces: Recognition using class specific linear projection,? IEEE Trans. Pat- tern Analysis and Machine Intelligence, vol. 19, pp. 711?720, 1997. [42] M.J. Black and Y. Yacoob, ?Recognizing facial expressions in image se- quences using local paramterized models of image motion,? International Journal of Computer Vision, vol. 25, pp. 23-48, 1997. [43] T. Cootes, G.Edwards, andC. Taylor, ?Active appearance model,? European Conference on Computer Vision, 1998. [44] K. Etemad and R. Chellappa, ?Discriminant analysis for recognition of hu- man face images,? Journal of Optical Society of America A, pp. 1724?1733, 1997. [45] T. Huang, Z. Xiong, and Z. Zhang, ?Face recognition applications,? Hand- book of Face Recognition, S. Li and A. K. Jain (Eds.), Springer, 2004. [46] M.D. Kelly, ?Visual identification of people by computer,? Tech. rep. AI-130, Stanford AI project, Stanform, CA, 1970. 234 [47] M. Kirby and L. Sirovich, ?Application of Karhunen-Lo?eve procedure of the characterization of human faces,? IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 12, pp. 103?108, 1990. [48] M. Lades, J.C. Vorbruggen, J. Buhmann, J. Lange, C. v. d. Malsburg, R.P. Wurtz, and W. Konen, ?Distortion invariant object recognition in the dy- namic link architecture,? IEEE Trans. Computers, vol. 42, no. 3, pp. 300? 311, 1993. [49] A. Lanitis, C.J. Taylor, and T.F. Cootes, ?Automatic interpretation and coding of face images using flexible models,? IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 442-455, 1997. [50] A. Lanitis, C.J. Taylor, and T.F. Cootes, ?Toward automatic simulation of aging affects on face images,? IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, pp. 442-455, 2002. [51] S.H. Lin, S.Y. Kung, and J.J. Lin, ?Face recognition/detection by probabilis- tic decision based neural network,? IEEE Trans. Neural Networks, vol. 9, pp. 114-132, 1997. [52] C. Liu and H. Wechsler, ?Evolutionary pursuit and its applications to face recognition,? IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, pp. 570-582, 2000. [53] M.J. Lyons, J. Biudynek, and S. Akamatsu, ?Automatic classification of sin- gle facial images,? IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 21, no. 12, pp. 1357-1362, 1999. 235 [54] B. Moghaddam and A. Pentland, ?Probabilistic visual learning for object representation,? IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. PAMI-19, no. 7, pp. 696?710, 1997. [55] B. Moghaddam, T. Jebara, and A. Pentland, ?Bayesian modeling of facial similarity,? Advances in Neural Information Processing Systems, vol. 11, pp. 910?916, 1999. [56] B. Moghaddam, ?Principal manifolds and probabilistic subspaces for vi- sual recognition,? IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, pp. 780?788, 2002. [57] P.J. Phillips, ?Support vector machines applied to face recognition,? Ad- vances in Neural Information Processing Systems, vol. 11, pp. 803-809, 1998. [58] P.J. Phillips, H. Moon, S. Rizvi, and P.J. Rauss, ?The FERET evaluation methodology fro face-recognition algorithms,? IEEE Trans. attern Analysis and Machine Intelligence, vol. 22, pp. 1090?1104, 2000. [59] P.J. Phillips, P. Grother, R.J. Micheals, D.M. Blackburn, E. Tabbssi, and M. Bone, ?Face recognition vendor test 2002: evaluation report? NISTIR 6965, http://www.frvt.org, 2003. [60] Y. Tian, T. Kanade, and J. Cohn, ?Recognizing action units of facial ex- pression analysis,? IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, pp. 1-19, 2001. [61] Y. Tian, T. Kanade, and J. Cohn, ?Recognizing action units of facial ex- pression analysis,? In Handbook of Face Recognition, S.Z. Li and A.K. Jain (Eds.), Springer, 2004. 236 [62] M. Turk and A. Pentland, ?Eigenfaces for recognition,? Journal of Cognitive Neuroscience, vol. 3, pp. 72?86, 1991. [63] M.-H. Yang, ?Kernel eigenfaces vs. kernel Fisherfaces: Face recognition using kernel methods,? Proceedings of International Conference on Automatic Face and Gesture Recognition, 2002. [64] W. Zhao, R. Chellappa, and A. Krishnaswamy, ?Discriminant analysis of principal components forface recognition,? Proceedings of International Con- ference on Automatic Face and Gesture Recognition, pp. 361-341, Nara, Japan, 1998. [Face recognition across illumination and poses] [65] J. Atick, P. Griffin, and A. Redlich, ?Statistical approach to shape from shad- ing: Reconstrunction of3-dimensional facesurfaces fromsingle 2-dimentional images,? Neural Computation, vol. 8, pp. 1321?1340, 1996. [66] V. Blanz and T. Vetter, ?Face recognition based on fitting a 3D morphable model,? IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 25, pp. 1063?1074, 2003. [67] T. Cootes, K. Walker, and C. Taylor, ?View-based Active appearance mod- els,? Proceedings of International Conference on Automatic Face and Gesture Recognition, 2000. [68] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, ?From few to many: Illumination cone models for face recognition under variable lighting and pose,? IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, pp. 643 ?660, 2001. 237 [69] R. Gross, I. Matthews, and S. Baker, ?Eigen light-fields and face recognition across pose,? Proceedings of Intenational Conference on Automatic Face and Gesture Recognition, Washington D.C., 2002. [70] R. Gross, I. Matthews, and S. Baker, ?Fisher light-fields for face recognition across pose and illumination,? Proceedings of the German Symposium on Pattern Recognition, Washington D.C., 2002. [71] R. Gross, I. Matthews, and S. Baker, ?Appearance-based face recognition and light-fields,? IEEE Transactions on Pattern Analysis and Machine In- telligence, vol. 26, no. 4, pp. 449 - 465, April, 2004. [72] A. Pentland, B. Moghaddam, and T. Starner, ?View-based and modular eigenspaces for face recognition,? Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, 1994. [73] S. Romdhani and T. Vetter, ?Efficient, robust and accurate fitting of a 3D morphable model,? Proceedings of IEEE Internationl Conference on Com- puter Vision, pp. 59-66, Nice, France, 2003. [74] A. Shashua and T. R. Raviv, ?The quotient image: Class based re-rendering and recognition with varying illuminations,? IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, pp. 129?139, 2001. [75] T. Sim, S. Baker, and M. Bast, ?The CMU pose, illuminatin, and expression (PIE) database,? Proceedings of International Conference on Automatic Face and Gesture Recognition, pp. 53?58, Washington D.C., 2002. 238 [76] T. Vetter and T. Poggio, ?Linear object classes and image synthesis from a single example image,? IEEE Trans. Pattern Analysis and Machine Intelli- gence, vol. 11, pp. 733?742, 1997. [77] M.A.O. Vasilescu and D. Terzopoulos, ?Multilinear analysis of image en- sembles: Tensorfaces,? European Conference on Computer Vision, vol. 2350, pp. 447-460, Copenhagen, Denmark, May 2002. [78] M. Vasilescu and D. Terzopoulos, ?Multilinear image analysis for facialrecog- nition,? Proceedings of International Conference on Pattern Recognition, Quebec City, Canada, 2002. [Face recognition from video sequences] [79] T. Choudhury, B. Clarkson, T. Jebara, and A. Pentland, ?Multimodal per- son recognition using unconstrained audio and video,? Proceedings of Inter- national Conference on Audio- and Video-Based Person Authentication, pp. 176?181, Washington D.C., 1999. [80] A. Fitzgibbon and A. Zisserman, ?Joint manifold distance: a new approach to appearance based clustering,? Proceedings of IEEE Conference on Com- puter Vision and Pattern Recognition, Madison, WI, 2003. [81] R. Gross and J. Shi, ?The CMU Motion of Body (MoBo) Database,? CMU- RI-TR-01-18, 2001. [82] A. Howell and H.Buxton, ?Facerecognition using radialbasis functionneural networks,? Proceedings of British Machine Vision Conference, pp. 455?464, 1996. 239 [83] T. Jebara and A. Pentland, ?Parameterized structure from motion for 3D adaptive feedback tracking of faces,? Proceedings of IEEE Computer Soci- ety Conference on Computer Vision and Pattern Recognition, pp. 144 ?150, Puerto Rico, 1997. [84] K. Lee, M. Yang, and D. Kriegman, ?Video-based face recognition using probabilistic appearance manifolds,? Proceedings of IEEE Computer Soci- ety Conference on Computer Vision and Pattern Recognition, Madison, WI, 2003. [85] B. Li and R. Chellappa, ?Face verification through tracking facial features,? Journal of Optical Society of America A, vol. 18, no. 12, pp. 2969?2981, 2001. [86] B. Li and R. Chellappa, ?A generic approach to simultaneous tracking and verification in video,? IEEE Transaction on Image Processing, vol. 11, no. 5, pp. 530?554, 2002. [87] Y. Li, S. Gong, and H. Liddell, ?Modelling faces dynamically across views and over time,? Proceedings of International Conference on Computer Vi- sion, pp. 554 ?559, Hawaii, 2001. [88] Y. Li, S. Gong, and H. Liddell, ?Constructing facial identity surfaces in a nonlinear discriminant space,? Proceedings of IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition, Hawaii, 2001. [89] X. Liu and T. Chen, ?Video-based face recognition using adaptive hidden markov models,? Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Madison, WI, 2003. 240 [90] S. McKenna and S. Gong, ?Non-intrusive person authentication for access control by visual tracking and face recognition,? Proceedings of International Conference on Audio- and Video-based Biometric Person Authentication, pp. 177?183, Crans-Montana, Switzerland, 1997. [91] G. Shakhnarovich, J. Fisher, and T. Darrell, ?Face recognition from long- term observations,? Proc. European Conference on Computer Vision, Copen- hagen, Denmark, 2002. [92] J. Steffens, E. Elagin, and H. Neven, ?Personspotter - fast and robust system for human detection, tracking, and recognition,? Proceedings of Internationl Conference on Automatic Face and Gesture Recognition, pp. 516?521, Nara, Japan, 1998. [93] H. Wechsler, V. Kakkad, J. Huang, S. Gutta, and V. Chen, ?Automatic video-based person authentication using the RBF network,? Proceedings of International Conference on Audio- and Video-based Biometric Person Au- thentication, pp. 85?92, Crans-Montana, Switzerland, 1997. [Lighting and illumination] [94] R. Basri and D. Jacobs, ?Photometric stereo with general, unknown light- ing,? Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, vol. II, pp. 374?381, Hawaii, 2001. [95] R.Basri and D.Jacobs, ?Lambertianreflectance and linearsubspaces,? IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, pp. 218? 233, 2003. 241 [96] H. Hayakawa, ?Photometric stereo under a light source with arbitrary mo- tion,? Journal of Optical Society of America, A, vol. 11, 1994. [97] P.N. Belhumeur and D.J. Kriegman, ?What is the set of images of an ob- ject under all possible illumination conditions?? International Journal of Computer Vision, vol. 28, pp. 245?260, 1998. [98] P. Belhumeur, D. Kriegman, and A. Yuille, ?The bas-relief ambiguity,? In- ternational Journal of Computer Vision, vol. 35, pp. 33?44, 1999. [99] R. T. Frankot and R. Chellappa, ?A method for enforcing integrability in shape from shaging problem,? IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 10, pp. 439?451, 1987. [100] R. Ramamoothi and P. Hanrahan, ?On the relationship between radiance and irradiance: Determining the illumination from images of a convex lam- bertian object,? Journal of the Optical Society of America (JOSA A), vol. 18, pp. 2448?2459, 2001. [101] A. Shashua, ?On photometric issues in 3d visual recognition from a single 2D image,? International Journal of Computer Vision, vol. 21, pp. 99?122, 1997. [102] I. Shimshoni, Y. Moses, and M. Lindenbaum., ?Shape reconstruction of 3D bilaterally symmetric surfaces,? International Journal of Computer Vision, vol. 39, pp. 97?100, 2000. [103] A.L. Yuille, D. Snow, R. Epstein, and P.N. Belhumeur, ?Determining gener- ative models of objects under varying illumination: Shape and albedo from 242 multiple images using svd and integrability,? Internationl Journal of Com- puter Vision, vol. 35, pp. 203?222, 1999. [104] W. Zhao and R. Chellappa, ?Symmetric shape from shading using self-ratio image,? International Journal of Computer Vision, vol. 45, pp. 55?752, 2001. [105] Q. F. Zheng and R. Chellappa, ?Estimation of illuminant direction, albedo and shape from shading,? IEEE Trans. Pattern Analysis and Machine Intel- ligence, vol. 13, pp. 680?702, 1991. [Tracking, detection, and registration] [106] A. Azarbayejani and A. Pentland, ?Recursive estimation of motion, struc- ture, and focal length,? IEEE Trans. Pattern Analysis and Machine Intelli- gence, vol. 17, pp. 562?575, 1995. [107] A. Bergen, P. Anadan, K. Hanna, and R. Hingorani, ?Hierarchical model- based motion estimation,? European Conference on Computer Vision, pp. 237?252, Stockholms, Sweden, 1992. [108] M.J. Black and A.D. Jepson, ?Eigentracking: Robust matching and tracking of articulated objects using a view-based representation,? European Confer- ence on Computer Vision, vol. 1, pp. 329?342, Cambridge, UK, 1996. [109] M.J. Black and D.J. Fleet, ?Probabilistic detection and tracking of motion discontinuities,? Proceedings of International Conference on Computer Vi- sion, vol. 2, pp. 551?558, Greece, 1999. [110] M.E. Brand, ?Morphable 3D Models from Video,? Proceedings of IEEE Con- ference on Computer Vision and Pattern Recognition, Hawaii, 2001. 243 [111] C. Bregler, A. Hertzmann, and H. Biermann, ?Recovering nonrigid 3D shape fomr image streams,? Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head, SC, 2000. [112] T. J. Broida, S. Chandra, and R. Chellappa, ?Recursive techniques for es- timation of 3-d translation and rotation parameters from noisy image se- quences,? IEEE Trans. Aerospace and Electronic Systems, vol. AES-26, pp. 639?656, 1990. [113] D. Comaniciu, V. Ramesh, and P. Meer, ?Real-time tracking of non-rigid objects using mean shift,? Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 142?149, Hilton Head, SC, 2000. [114] N.J. Gordon, D.J. Salmond, and A.F.M. Smith, ?Novel approach to nonlinear/non-gaussian bayesian state estimation,? IEE Proceedings on Radar and Signal Processing, vol. 140, pp. 107?113, 1993. [115] G. D. Hager and P. N. Belhumeur, ?Efficient region tracking with parametric models of geometry and illumination,? IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 20, pp. 1025?1039, 1998. [116] M. Irani, ?Multi-frame optical flow estimation using subspace constraints,? Proceedings of International Conference on Computer Vision, pp. 626-633, Greece, 1999. [117] M. Irani and P. Anandan, ?Factorization with Uncertainty,? European Con- ference on Computer Vision, pp. 539-553, Dublin, Ireland, 2000. 244 [118] M. Isard and A. Blake, ?Contour tracking by stochastic propagation of con- ditional density,? European Conference on Computer Vision, pp. 343?356, Cambridge, UK, 1996. [119] M. Isard and A. Blake, ?ICONDENSATION: Unifying low-level and high- level tracking in a stochastic framework,? Euporean Conference on Computer Vision, vol. 1, pp. 767?781, Freiburg, Germany, 1998. [120] A. D. Jepson, D. J. Fleet, and T. El-Maraghi, ?Robust online appearance model for visual tracking,? Proceedings of IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition, vol. 1, pp. 415?422, Hawaii, 2001. [121] F. Jurie and M. Dhome, ?A simple and efficient template matching algo- rithm,? Proceedings of International Conference on Computer Vision, vol. 2, pp. 544?549, Vancouver, BC, 2001. [122] Q. Ke and T. Kanade, ?A subspace approach to layer extraction,? Pro- ceedings of IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, 2001. [123] B. Lucas and T. Kanade, ?An iterative image registration technique with an application to stereo vision,? International Joint Conference on Artifical Intelligence, 1981. [124] B. North, A. Blake, M. Isard, and J. Rittscher, ?Learning and classification of complex dynamics,? IEEE Trans. Pattern Analysis and Machine Intelli- gence, vol. 22, pp. 1016?1034, 2000. 245 [125] G. Qian and R. Chellappa, ?Structure from motion using sequential monte carlo methods,? Proceedings of International Conference on Computer Vi- sion, pp. 614 ?621, Vancouver, BC, 2001. [126] C. Rasmussen and G. Hager, ?Probabilistic data association methods for tracking complex visual objects,? IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 560?576, 2001. [127] H. Sidenbladh, M. J. Black, and D.J.Fleet, ?Stochastic tracking of 3dhuman figures using 2d image motion,? European Conference on Computer Vision, vol. 2, pp. 702?718, Copenhagen, Denmark, 2002. [128] J. Sullivan and J. Rittscher, ?Guiding random particle by deterministic search,? Proceedings of International Conference on Computer Vision, vol. 1, pp. 323 ?330, Vancouver, BC, 2001. [129] C. Tomasi and T. Kanade, ?Shape and motion from image streams under orthography: a factorization method,? International Journal of Computer Vision, vol. 9, no. 2, pp. 137?154, 1992. [130] K. Toyama and A. Blake, ?Probabilistic tracking in a metric space,? Proceed- ings of International Conference on Computer Vision, pp. 50?59, Vancouver, BC, 2001. [131] J. Vermaak, P. Peraz, M. Gangnet, and A. Blake, ?Towards improved obse- vation models for visual tracking: selective adaption,? European Conference on Computer Vision, pp. 645?660, Copenhagen, Denmark, 2002. [132] P. Voila and M. Jones, ?Robust real-time object detection,? Second Intl. Workshop on Stat. and Comp. Theories of Vision, Vancouver, BC, 2001. 246 [133] Y. Wu and T. S. Huang, ?A co-inference approach to robust visual tracking,? Proceedings of International Confererence on Computer Vision, vol. 2, pp. 26?33, Vancouver, BC, 2001. [134] C. Yang, R. Duraiswami, A. Elgammal, and L. Davis, ?Real-time kernel- based tracking in joint feature-spatial spaces,? Tech. Report CS-TR-4567, Univ. of Maryland, 2004. [Others in computer vision and graphics] [135] M. J. Black and A. D. Jepson, ?A probabilistic framework for matching temporal trajectories,? Proceedings of International Conference on Computer Vision, pp. 176?181, Greece, 1999. [136] R. Bolle and D. Cooper, ?On optimally combining pieces of information with application to estimating 3-d complex-object position from range data,? IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 8, pp. 619? 638, 1986. [137] D. Forsyth, ?Shape from texture and integrability,? Proc. International Con- ference on Computer Vision, pp. 447?453, Vancouver, BC, 2001. [138] W. T. Freeman and J. B. Tenenbaum, ?Learning bilinear models for two- factor problems in vision,? Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Puerto Rico, 1997. [139] P. Fua, ?Regularized bundle adjustment to model heads from image se- quences without calibrated data,? Internationl Journal of Computer Vision, vol. 38, pp. 153-157, 2000. 247 [140] S.J. Gortler, R. Grzeszczuk, R. Szeliski, and M. Cohen, ?The lumigraph,? Proceedings of SIGGRAPH, pp. 43-54, New Orleans, LA, USA, 1996. [141] D. Jacobs, ?Linear fitting with missing data for structure-from-motion,? Computer Vision and Image Understanding, vol. 82, pp. 57?81, 2001. [142] A. Laurentini, ?The visual hull concept for silhouette-based image under- standing,? IEEE Trans. Pattern Analysis and Machine Intelligences, vol. 16, no. 2, pp. 150-162, 1994. [143] M. Levoy and P. Hanrahan, ?Light field rendering,? Proceedings of ACM SIGGRAPH, New Orleans, LA, USA, 1996. [144] W. Matusik, C. Buehler, R. Raskar, S. Gortler, and L. McMillan, ?Image- based visual hulls,? Proceedings of SIGGRAPH, pp. 369 - 374, New Orleans, LA, USA, 2000. [145] A. Roy Chowdhury and R. Chellappa, ?Face reconstruction from video us- ing uncertainty analysis and a generic model,? Computer Vision and Image Understanding, vol. 91, pp. 188-213, 2003. [146] Y. Shan, Z. Liu, and Z. Zhang ?Model-based bundle adjustment with appli- caiton to face modeling,? Proceedings of Internationl Conference on Com- puter Vision, pp. 645?651, Vancouver, BC, 2001. [Statistical analysis and computing] [147] B. Adhikara and D. Joshi, ?Distance discrimination et resume exhaustif,? Publs. Inst. Statis., vol. 5, pp. 57?74, 1956. 248 [148] X. Boyen and D. Koller, ?Tractable inference for complex stochastic pro- cesses,? Proceedings of the 14th Annual Conference on Uncertainty in AI (UAI), pp. 33 ? 42, Madison, Wisconsin, 1998. [149] M. Brand, ?Incremental singular value decomposition of uncertain data with missing values,? European Conference on Computer Vision, pp. 707?720, Copenhagen, Denmark, 2002. [150] A. Bhattacharyya, ?Ona measure of divergence between two statistical popu- lations defined by their probability distributions,? Bull. Calcutta Math. Soc., vol. 35, pp. 99?109, 1943. [151] H. Chernoff, ?A measure of asymptotic efficiency of tests for a hypothesis based on a sum of observations,? Annals of Math. Stat., vol. 23, pp. 493?507, 1952. [152] A. P. Dempster, N. M. Laird, and D. B. Rubin, ?Maximum likelihood from incomplete data via the em algorithm.? J. Roy. Statist. Soc. B, vol. 39, 1977. [153] A. Doucet, S. J. Godsill, and C. Andrieu, ?On sequential monte carlo sam- pling methods forbayesian filtering,? Statistics and Computing, vol. 10, no. 3, pp. 197?209, 2000. [154] D. Fox, ?KLD-sampling: Adaptive particle filters and mobile robot localiza- tion,? Neural Information Processing Systems (NIPS), 2001. [155] A. Hyvarinen, ?Survey on Independent Component Analysis,? Neural Com- puting Surveys, vol. 2, pp. 94-128, 1999. [156] T. Kailath, ?The divergance and Bhattacharyya distance measures in signal selection,? IEEE Trans. on Comm. Tech., vol. COM-15, pp. 52?60, 1967. 249 [157] G. Kitagawa, ?Monte carlo filter and smoother for non-gaussian nonlinear state space models,? J. Computational and Graphical Statistics, vol. 5, pp. 1?25, 1996. [158] T. Lissack and K. Fu, ?Error estimation in pattern recognition via L-distance between posterior density functions,? IEEE Trans. Information Theory, vol. 22, pp. 34?45, 1976. [159] J. S. Liu and R. Chen, ?Sequential monte carlo for dynamic systems,? Jour- nal of the American Statistical Association, vol. 93, pp. 1031?1041, 1998. [160] P. Mahalanobis, ?On the generalized distance in statistics,? Proc. National Inst. Sci. (India), vol. 12, pp. 49?55, 1936. [161] K. Matusita, ?Decision rules based on the distance for problems of fit, two samples and estimation,? Ann. Math. Stat., vol. 26, pp. 631?640, 1955. [162] S. Roweis and L. Saul, ?Nonlinear dimensionality reduction by locally linear embedding,? Science, vol. 290, no. 5500, pp. 2323?2326, Dececember 2000. [163] E. Patrick and F. Fisher, ?Nonparametric feature selection,? IEEE Trans. Information Theory, vol. 15, pp. 577?584, 1969. [164] P. Penev andJ.Atick, ?Localfeatureanalysis: Ageneral statistical theoryfor object representation,? Networks: Computations in Neural Systems, vol. 7, pp. 477-500, 1996. [165] H. Shum, K. Ikeuchi, and R. Reddy, ?Principal component analysis with missing data and its applications to polyhedral object modeling,? IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 17, pp. 854?867, 1995. 250 [166] J.B. Tenenbaum, V. de Silva, and J.C. Langford, ?A Global Geometric Framework for Nonlinear Dimensionality Reduction,? Science, vol. 290, no. 5500, pp. 2319-2323, December 2000. [167] M. E. Tipping and C. M. Bishop, ?Mixtures of probabilistic principal com- ponent analysers,? Neural Computation, vol. 11, no. 2, pp. 443?482, 1999. [168] M. E. Tipping and C. M. Bishop, ?Probabilistic principal component analy- sis,? Journal of the Royal Statistical Society, Series B, vol. 61, pp. 611?622, 1999. [169] T. Wiberg, ?Computation of principal components when data are missing,? Proc. Second Symp. Computational Statistics, pp. 229?236, 1976. [Machine learning and kernel methods] [170] F. Bach and M. I. Jordan, ?Kernel independent component analysis,? Jour- nal of Machine Learning Research, vol. 3, pp. 1?48, 2002. [171] F. Bach and M. I. Jordan, ?Learning graphical models with Mercer kernels,? Advances in Neural Information Processing Systems, 2002. [172] G. Baudat and F. Anouar, ?Generalized discriminant analysis using a kernel approach,? Neural Computation, vol. 12, pp. 2385?2404, 2000. [173] F. Girosi, M. Jones, and T. Poggio, ?Regularization theory and neutral net- works architectures,? Neural Computation, vol. 7, pp. 219?269, 1995. [174] T. Jebara and R. Kondor, ?Bhattarcharyya and expected likelihood kernels,? Conference on Learning Theory (COLT), 2003. 251 [175] R. Kondor and T. Jebara, ?A kernel between sets of vectors,? Intenational Conference on Machine Learning (ICML), 2003. [176] J. Mercer, ?Functions of positive and negative type and their connection with the thoery of integral equations,? Philos. Trans. Roy. Soc. London, vol. A 209, pp. 415?446, 1909. [177] S. Mika, G. R?atsch, J. Weston, B. Sch?olkopf, and K.-R. M?uller, ?Fisher discriminant analysis with kernels,? in Neural Networks for Signal Processing IX, Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, Eds. IEEE, 1999, pp. 41?48. [178] P. Moreno, P. Ho, and N. Vasconcelos, ?A Kullback-Leibler divergence based kernel for svm classfication in multimedia applications,? Neural Information Processing Systems, 2003. [179] K.R. M?uller, S. Mika, G. R?atsch, K. Tsuda, and B. Sch?olkopf, ?An introdu- cation to kernel-based learning algorithms,? IEEE Trans. Neutral Networks, vol. 12, pp. 181?202, 2001. [180] A. Ng, M. Jordan, and Y. Weiss, ?On spectral clustering: analysis and an algorithm,? Neural Information Processing Systems, 2002. [181] B. Sch?olkopf, A. Smola, and K.-R. M?uller, ?Nonlinear component analysis as a kernel eigenvalue problem,? Neural Computation, vol. 10, pp. 1299?1319, 1998. [182] M. Tipping, ?Sparse kernel prinicipal component analysis,? Neural Informa- tion Processing Systems, 2001. 252 [183] C. K. I. Williams, ?On a connection between kernel PCA and metric multi- dimensional scaling,? Neural Information Processing Systems, 2001. [184] L. Wolf and A. Shashua, ?Kernel principal angles for classification machines with applications to image sequence interpretation,? IEEE Computer Soci- ety Conference on Computer Vision and Pattern Recognition, Madison, WI, 2003. [Shaohua Zhou?s publications] [185] S. Zhou, V. Krueger, and R. Chellappa, ?Probabilistic recognition of human faces from video,? Computer Vision and Image Understanding, vol. 91, pp. 214?245, 2003. [186] S. Zhou, R. Chellappa, and B. Moghaddam, ?Visual tracking and recognition using appearance-adaptive models in particle filters,? IEEE Trans. Image Processing (to appear), 2004. [187] S. Zhou, R. Chellappa, and D. Jacobs, ?Generalized photometric stereo and its applicaitons to face recognition,? International Journal of Computer Vi- sion (submitted). [188] S. Zhou and R. Chellappa, ?Image-based face recognition under illumination and pose variantons,? Journal of the Optical Society of America (submitted). [189] S. Zhou and R. Chellappa, ?Probabilisitic distances in reproducing kernel Hilbert space,? IEEE Trans. on Information Theory (under preparation). [190] R.Chellappa andS. Zhou, ?Facetracking andrecognition fromvideo,? Hand- book of Face Recognition, S. Li and A. K. Jain (Eds.), Springer, 2004. 253 [191] S. Zhou and R. Chellappa, ?Face recognition from still images and videos,? Handbook of Image and Video Processing, A. Bovik (Ed.), Academic Press, 2004. [192] S. Zhou, V. Krueger, and R. Chellappa, ?Face recgnition from video: A con- densation approach,? Proceedings of International Conference on Automatic Face and Gesture Recognition, Washington, D.C., USA, May 2002. [193] S. Zhou and R. Chellappa, ?Probabilistic human Recognition from video,? European Conference on Computer Vision, vol. 3, pp. 681-697, Copenhagen, Denmark, May 2002. [194] V. Krueger and S. Zhou, ?Exemplar-based face recgnition from video,? Eu- ropean Conference on Computer Vision, Copenhagen, Denmark, 2002. [195] R. Chellappa, S. Zhou, and B. Li, ?Bayseain methods for probabilistic human recgnition from video,? Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing, Orlando, Florida, USA, 2002. [196] S. Zhou and R. Chellappa, ?A robust algorithm for probabilistic human recognition from video,? Proceedings of International Conference on Pattern Recognition, Quebec City, Canada, 2002. [197] R. Chellappa, V. Krueger, and S. Zhou, ?Probabilistic recgnition of human faces from video,? Proceedings of IEEE International Conference on Image Processing, Rochester, NY, 2002. [198] S. Zhou, ?Probabilistic analysis of kernel principal components: classification and mixture modeling,? CfAR Technial Report, CAR-TR-993, 2003. 254 [199] S. Zhou and R. Chellappa, ?Simultaneous tracking and recognition of hu- man faces from video,? Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing, 2003. [200] J. Li, S. Zhou, and C. Shekhar, ?A comparison of subspace analysis for face recognition,? Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. [201] S. Zhou, R. Chellappa, and B. Moghaddam, ?Adaptive visual tracking and recognition using particle filters,? Proceedings of IEEE International Con- ference on Multimedia & Expo, Baltimore, USA, 2003. [202] S. Zhou and R. Chellappa, ?Rank constrained recognition under unknown illuminations,? IEEE Intl. Workshop on Analysis and Modeling of Faces and Gestures, Nice, France, 2003. [203] S. Zhou, R. Chellappa, and B. Moghaddam, ?Appearance tracking using adaptive models in a particle filter,? Proceedings of Asian Conference on Computer Vision, Korea, January 2004. [204] S. Zhou, R. Chellappa, and D. Jacobs, ?Characterization of human faces under illumination variations using rank, integrability, and symmetry con- straints,? European Conference on Computer Vision, Prague, Czech, May 2004. [205] J. Li and S. Zhou, ?Probabilistic face recognition with compressed imagery,? Proceedings of IEEE International Conference on Acoustics, Speech, and Sig- nal Processing, Montreal, Canada, May 2004. 255 [206] J. Shao, S. Zhou, and R. Chellappa, ?Appearance-based visual tracking and recognition with trilinear tensor,? Proceedings of IEEE International Con- ference on Acoustics, Speech, and Signal Processing, Montreal, Canada, May 2004. [207] Z. Yue, S. Zhou, and R. Chellappa, ?Robust two-camera visual tracking with homography,? Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Canada, May 2004. [208] S. Zhou and R. Chellappa, ?Illuminating light field: Image-based face recog- nition across illuminations and poses,? Proceedings of International Confer- ence on Automatic Face and Gesture Recognition, Seoul, Korea, May 2004. [209] S. Zhou, R. Chellappa, and B. Moghaddam, ?Intra-personal kernel space for face recognition,? Proceedings of International Conference on Automatic Face and Gesture Recognition, Seoul, Korea, May 2004. [210] S. Zhou and R. Chellappa, ?Probabilistic identity characterization for face recognition,? Proceedings of IEEE Computer Society Conference on Com- puter Vision and Pattern Recognition, Washington D.C., USA, June 2004. [211] S. Zhou and R. Chellappa, ?Multiple-exemplar discriminant analysis for face recognition,? Proceedings of International Conference on Pattern Recogni- tion, Cambridge, UK, August 2004. [212] J. Shao, S. Zhou, and Q. Zheng, ?Robust appearance-based tracking of mov- ing object from moving platform,? Proceedings of International Conference on Pattern Recognition, Cambridge, UK, August 2004. 256 [213] J. Shao, S. Zhou, and R. Chellappa, ?Simultaneous background and fore- ground modeling for tracking in surveillance video,? Proceedings of IEEE International Conference on Image Processing, Singapore, October 2004. 257