ABSTRACT
Title of Dissertation: UNCONSTRAINED FACE RECOGNITION
Shaohua Zhou, Doctor of Philosophy, 2004
Dissertation directed by: Professor Rama Chellappa
Department of Electrical and Computer Engineering
Although face recognition has been actively studied over the past decade, the
state-of-the-art recognition systems yield satisfactory performance only under con-
trolled scenarios and recognition accuracy degrades significantly when confronted
with unconstrained situations due to variations such as illumintion, pose, etc. In
this dissertation, we propose novel approaches that are able to recognize human
faces under unconstrained situations.
Part I presents algorithms for face recognition under illumination/pose varia-
tions. For face recognition across illuminations, we present a generalized photomet-
ric stereo approach by modeling all face appearances belonging to all humans under
all lighting conditions. Using a linear generalization, we achieve a factorization of
the observation matrix consisting of face appearances of different individuals, each
under a different illumination. We resolve ambiguities in factorization using sur-
face integrability and symmetry constraints. In addition, an illumination-invariant
identity descriptor is provided to perform face recognition across illuminations. We
further extend the generalized photometric stereo approach to an illuminating light
field approach, which is able to recognize faces under pose and illumination varia-
tions.
Face appearance lies in a high-dimensional nonlinear manifold. In Part II,
we introduce machine learning approaches based on reproducing kernel Hilbert
space (RKHS) to capture higher-order statistical characteristics of the nonlinear
appearance manifold. In particular, we analyze principal components of the RKHS
in a probabilistic manner and compute distances such as the Chernoff distance,
the Kullback-Leibler divergence between two Gaussian densities in RKHS.
Part III is on face tracking and recognition from video. We first present an
enhanced tracking algorithm that models online appearance changes in a video se-
quence using a mixture model and produces good tracking results in various chal-
lenging scenarios. For video-based face recognition, while conventional approaches
treat tracking and recognition separately, we present a simultaneous tracking-and-
recognition approach. This simultaneous approach solved using the sequential
importance sampling algorithm improves accuracy in both tracking and recogni-
tion. Finally, we propose a unifying framework called probabilistic identity char-
acterization able to perform face recognition under registration/illumination/pose
variation and from a still image, a group of still images, or a video sequence.
UNCONSTRAINED FACE RECOGNITION
by
Shaohua Zhou
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2004
Advisory Committee:
Professor Rama Chellappa, Chairman
Professor Larry S. Davis
Professor David W. Jacobs
Professor Adrian Papamarcou
Professor Min Wu
c?Copyright by
Shaohua Zhou
2004
DEDICATION
To Chunhui
ii
ACKNOWLEDGEMENTS
I wish to express my sincere gratitude to my supervisor, Professor Rama Chel-
lappa, for his sustained financial support, his valuable guidance on research, and
his scholarly and honest attitude toward life.
I am grateful to my committee members, Professors Larry S. Davis, David W.
Jacobs, Adrian Papamarcou, and Min Wu. I enjoyed my fruitful discussions with
Professor David W. Jacobs. I also thank Professor Eric V. Slud in the Mathematics
department for educating me and sharing with me his broad knowledge onstatistics
and Dr. Baback Moghaddam at Mitsubishi Electric Research Labs (MERL) for
hosting me as a summer intern in 2002. I also would like to express my appreciation
of Professor Azriel Rosenfeld, who was in my proposal examination committee and
edited two of my technical reports.
I had a pleasant stay at the Center for Automation Research (CfAR). I am
indebted to my lab colleagues: Amit R. Chowdhury, Naresh Contoor, Jian Li, Jian
Liang, Haiying Liu, Amit Kale, Gang Qian, Jie Shao, Namrata Vaswani, Zhanfen
Yue, and Qinfen Zheng. I really enjoyed my collaborations and discussions with
these brilliant guys.
I take this special occasion to thank my parents and parents-in-law back in
China for their support and to wish them best. Finally, I thank my wife, Chunhui,
for her patience, her encouragement, and her lifelong love. I dedicate my thesis to
her.
iii
TABLE OF CONTENTS
List of Tables vii
List of Figures ix
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Biometric perspective . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Experimental perspective . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Theoretic perspective . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Unconstrained Face Recognition . . . . . . . . . . . . . . . . . . . . 13
1.2.1 Face recognition under variations . . . . . . . . . . . . . . . 15
1.2.2 Face recognition via kernel learning . . . . . . . . . . . . . . 16
1.2.3 Face tracking and recognition from videos . . . . . . . . . . 18
2 Generalized Photometric Stereo 21
2.1 Principle of Generalized Photometric Stereo . . . . . . . . . . . . . 26
2.1.1 Literature review and proposed approach . . . . . . . . . . . 27
2.1.2 Setting and constraints . . . . . . . . . . . . . . . . . . . . . 29
2.1.3 Separating illumination . . . . . . . . . . . . . . . . . . . . . 34
2.1.4 Recovering class-specific albedos and surface normals . . . . 37
2.2 Face Recognition across Illumination . . . . . . . . . . . . . . . . . 39
2.2.1 Literature review and proposed approach . . . . . . . . . . . 40
2.2.2 Bootstrap set . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.2.3 Recognition experiments . . . . . . . . . . . . . . . . . . . . 45
2.3 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3 Illuminating Light Field 57
3.1 Principle of Illuminating Light Field . . . . . . . . . . . . . . . . . 58
3.1.1 Literature review . . . . . . . . . . . . . . . . . . . . . . . . 58
3.1.2 Pose-invariant identity signature . . . . . . . . . . . . . . . . 62
3.1.3 Illumination- and pose-invariant identity signature . . . . . . 65
3.1.4 Learning algorithms . . . . . . . . . . . . . . . . . . . . . . 67
3.2 Face Recognition across Illumination and Poses . . . . . . . . . . . 70
iv
3.2.1 PIE database and recognition setting . . . . . . . . . . . . . 70
3.2.2 Recognition performance . . . . . . . . . . . . . . . . . . . . 73
3.2.3 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4 Probabilistic Kernel Principal Component Analysis 82
4.1 Reproducing Kernel Hilbert Space (RKHS) . . . . . . . . . . . . . . 85
4.2 Probabilistic Analysis of Kernel Principal Components . . . . . . . 88
4.2.1 Kernel principal component analysis . . . . . . . . . . . . . 88
4.2.2 Theory of PKPCA . . . . . . . . . . . . . . . . . . . . . . . 90
4.3 Mixture Modeling of Probabilistic Kernel Principal Components . . 96
4.3.1 Theory of mixture of PKPCA . . . . . . . . . . . . . . . . . 96
4.3.2 Why mixture of PKPCA? . . . . . . . . . . . . . . . . . . . 100
4.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.4.1 PKPCA or mixture of PKPCA classifier . . . . . . . . . . . 101
4.4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.5 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5 Probability Distances in Reproducing Kernel Hilbert Space 118
5.1 Probabilistic Distances in Rd . . . . . . . . . . . . . . . . . . . . . 120
5.2 Mean and Covariance Marix in RKHS . . . . . . . . . . . . . . . . 123
5.2.1 First- and second-order statistics . . . . . . . . . . . . . . . 123
5.2.2 Covariance matrix approximation . . . . . . . . . . . . . . . 124
5.3 The Probabilistic Distances in RKHS . . . . . . . . . . . . . . . . . 126
5.3.1 The Chernoff distance and the Bhattarchayya distance . . . 126
5.3.2 The KL divergence and the symmetric divergence . . . . . . 129
5.3.3 The Patrick-Fisher distance . . . . . . . . . . . . . . . . . . 130
5.3.4 Limiting behavior . . . . . . . . . . . . . . . . . . . . . . . . 130
5.3.5 Kernel for set . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.4.1 Synthetic examples . . . . . . . . . . . . . . . . . . . . . . . 132
5.4.2 Face recognition from a group of images . . . . . . . . . . . 134
6 Adaptive Visual Tracking 138
6.1 Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.1.1 Visual tracking . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.1.2 Particle filter . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.2 Appearance-Adaptive Models . . . . . . . . . . . . . . . . . . . . . 145
6.2.1 Adaptive observation model . . . . . . . . . . . . . . . . . . 145
6.2.2 Adaptive state transition model . . . . . . . . . . . . . . . . 148
6.2.3 Handling occlusion . . . . . . . . . . . . . . . . . . . . . . . 154
6.3 Experimental results on visual tracking . . . . . . . . . . . . . . . . 157
6.3.1 Car tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.3.2 Tank tracking in an aerial video . . . . . . . . . . . . . . . . 160
v
6.3.3 Face tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.3.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7 Simultaneous Tracking and Recognition 166
7.1 Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.1.1 Face modeling and recognition . . . . . . . . . . . . . . . . . 169
7.1.2 Video-based tracking and recognition . . . . . . . . . . . . . 170
7.2 Stochastic Models and Algorithms for Recognition from Video . . . 173
7.2.1 Time series state space model . . . . . . . . . . . . . . . . . 173
7.2.2 Posterior probability of identity variable . . . . . . . . . . . 174
7.2.3 SIS algorithms and computational efficiency . . . . . . . . . 176
7.3 Still-to-Video Face Recognition Experiments . . . . . . . . . . . . . 180
7.3.1 Results for Database-0 . . . . . . . . . . . . . . . . . . . . . 181
7.3.2 Results for Database-1 . . . . . . . . . . . . . . . . . . . . . 187
7.3.3 Results for Database-2 . . . . . . . . . . . . . . . . . . . . . 191
7.3.4 Enhanced results . . . . . . . . . . . . . . . . . . . . . . . . 192
7.4 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
8 Probabilistic Identity Characterization 202
8.1 Principle of Probabilistic Identity Characterization . . . . . . . . . 205
8.1.1 Independent group (I-group) . . . . . . . . . . . . . . . . . . 206
8.1.2 Video sequence . . . . . . . . . . . . . . . . . . . . . . . . . 207
8.1.3 Difference from Bayesian estimation . . . . . . . . . . . . . . 207
8.2 Recognition Setting and Issues . . . . . . . . . . . . . . . . . . . . . 208
8.2.1 Discrete identity signature . . . . . . . . . . . . . . . . . . . 209
8.2.2 Continuous identity signature . . . . . . . . . . . . . . . . . 209
8.2.3 The effects of the transformation . . . . . . . . . . . . . . . 210
8.2.4 Asymptotic behaviors . . . . . . . . . . . . . . . . . . . . . . 211
8.3 Subspace Identity Encoding . . . . . . . . . . . . . . . . . . . . . . 211
8.3.1 Invariant to localization, illumination, and pose . . . . . . . 212
8.3.2 Computational issues . . . . . . . . . . . . . . . . . . . . . . 214
8.3.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . 216
8.4 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
9 Conclusions 223
9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
9.2 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
vi
LIST OF TABLES
1.1 A list of biometrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 Recognition rate obtained by our approach using the first rank con-
straint and the Yale?s database as the training set. . . . . . . . . . . 46
2.2 Recognition rate obtained by the ?Eigenface?approach (discarding
the first 3 components) using the Yale?s database as the training set. 48
2.3 Recognition rate obtained by the ?Fisherface? approach using the
Yale?s database as the training set. . . . . . . . . . . . . . . . . . . 48
2.4 Recognition rate obtained by our approach with the first rank con-
straint and Vetter?s database as the training set. . . . . . . . . . . 49
2.5 Recognition rate obtained by our approach with the second rank
constraint and Vetter?s database as the training set. . . . . . . . . . 49
2.6 Recognition rate across poses and illumination. The front view is
from camera 27, and the side view from camera 05. . . . . . . . . . 50
3.1 Recognition rates forall the probe sets with a fixed gallery set (c27,f11). 73
3.2 Average recognition rates for all the gallery sets. For each cell, say
the gallery set at (vg = c27,sg = f12), the average rate is taken over
all probe sets (vp,sp) where vp negationslash= vg and sp negationslash= sg. For example, the
average rate for (c27,f11) is the average of the rates in Table 3.1
excluding the row c27 and the column f11. . . . . . . . . . . . . . . 74
3.3 The recognition rates for test scenario B. . . . . . . . . . . . . . . . 78
4.1 PPCA and PKPCA reconstruction error percentage. . . . . . . . . . 96
4.2 Classification error on the single C-shaped, the single O-shape, and
the double C-shapes. . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.3 The classification error on IDA benchmark repository. The SVM
and KFD results are reported in [179]. . . . . . . . . . . . . . . . . 109
4.4 Recognition rate of various kernel and non-kernel subspace methods. 111
5.1 (a) The KL distances in the RKHS with ? = 1 and q = 3. (b) The
Bhatacharyya distances in the RKHS with ? = 0.5 and q = 1. p1 is
listed in the first column and p2 in the first row. . . . . . . . . . . . 135
vii
5.2 The recognition score obtaining using the symmetric divergence and
Bhatacharyya distance. . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.1 Comparison of tracking results obtained by particle filters with dif-
ferent configurations. ?At size? means pixel size in the component(s)
of the appearance model. ?o? means success in tracking. ?x? means
failure in tracking. . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.1 Use of temporal information in various tracking/recognition processes.168
7.2 Summary of three databases experimented. . . . . . . . . . . . . . . 181
7.3 Recognition performance of algorithms when applied to Database-0. 187
7.4 Performances of algorithms when applied to Database-1. . . . . . . 188
8.1 Recognition rates of different methods. . . . . . . . . . . . . . . . . 219
viii
LIST OF FIGURES
1.1 Comparison of various biometric features based on MRTD compat-
ibility (from [33]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Three face recognition tasks: verification, identification, watch list
(courtesy of P.J.Phillips [59]). . . . . . . . . . . . . . . . . . . . . . 5
1.3 A hierarchy of face pattern and face recognition. . . . . . . . . . . . 6
1.4 An illustration of the imaging system. . . . . . . . . . . . . . . . . . 8
1.5 One PIE [75] individual under different illumination and poses. . . . 9
1.6 (a) Appearances of one individual with different facial expression
(from [53]). (b) Appearances of one individual at different ages
(from [50]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.7 Face appearances in a video sequences, forming a nonlinear manifold. 14
2.1 Top row: One object under eight different light sources. This can
be handled by the ordinary photometric stereo algorithm. Bottom
row: Eight different objects illuminated by eight different lighting
sources. This cannot be handled by the ordinary photometric stereo
algorithm but can be handled by the proposed generalized photo-
metric stereo algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 The first row: The first basis object under eight different illumi-
nation. The second row: The second basis object under the same
set of eight different illumination. The third row: Eight images
(constructed by random linear combinations of two basis objects)
illuminated by eight different lighting sources. The fourth row: Re-
covered class-specific albedo-shape matrix W showing the product
of varying albedos and surface normals of two basis objects (i.e.
the three columns of T1 and T2) using the generalized photometric
stereo algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3 Right: Flash distribution in the PIE database. For illustrative pur-
poses, we move their positions on a unit sphere as only the illu-
minant directions matter. ?o? means the ground truth and ?x? the
estimated values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
ix
2.4 The first and second rows display one PIE object under the selected
12 illuminants (from left to right, row 1 to row 2: f08, f09, f11-f17,
and f20-f22) and the third and fourth rows one Yale object under 9
lights (most frontal lights) used in the training set. . . . . . . . . . 47
3.1 This figure illustrates the 2D light-field of a 2D object (a square with
four differently colored sides), which is placed within an circle. The
angles ? and ? are used to relate the viewpoint with the radiance
from the object. The right image shows the actual light field for the
square object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2 Examples of the face images of one PIE object (used in the testing
stage) under selected illumination and poses . . . . . . . . . . . . . 71
3.3 The first nine columns of the learned W matrix. . . . . . . . . . . . 75
3.4 The reconstruction results of the object in Figure 3.2. Notice that
only the f?s and s?s for the row c27 are used for reconstructing all
the images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.5 The average recognition rates across illumination (the top row) and
across poses (the bottom row) for three cases. Case (a) shows the
average recognition rate (averaging over all illumination/poses and
all gallery sets) obtained by the proposed algorithm using the top
n matches. Case (b) shows the average recognition rate (averaging
over all illumination/poses for the gallery set (c27, f11) only) ob-
tained by the proposed algorithm using the top n matches. Case(c)
shows the average recognition rate (averaging over all illumina-
tion/poses andallgallerysets) obtainedby the ?Eigenface? algorithm
using the top n matches. . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1 Two nonlinear data structures (a)(d) and their drawn samples (of
size 200) for the foreground class (b)(e) and the background (c)(f). 85
4.2 Histogram of ? for iris data obtained by (a) PPCA with q = 2, (b)
PPCA with q = 3, (c) PKPCA with Gaussian kernel with q = 9,
? = 2 and ? = 0.001, and (d) PKPCA with Gaussian kernel with
q = 15, ? = 2 and ? = 0.001. . . . . . . . . . . . . . . . . . . . . . . 97
4.3 (a) Initial configuration. (b) After first iteration. (c) Final configu-
ration. ?+? and ?x? denote two different mixture components. . . . . 100
4.4 (a) One C-shape and contour plots of its (b) 1st and (c) 2nd KPCA
features. (d) Two C-shapes and its contour plots of its (e) 1st and
(f) 2nd KPCA features. . . . . . . . . . . . . . . . . . . . . . . . . 102
4.5 The approximation of the Jacobi matrix. (a) The contour plots of
the true density: uniform inside the C-shaped region. (b) The map
of log(??). (c) The contour plots of ??? inside the C-shaped region. . 103
4.6 The classification results on the single C-shape obtained by (a)
PKPCA-d, (b) PKPCA-s, (c) SVM, and (d) KFDA. . . . . . . . . . 106
x
4.7 The classification results on the double C-shape obtained by (a)
PKPCA-d classfier, (b) SVM, and (c) mixture of PKPCA classfier
with different kernel widths. . . . . . . . . . . . . . . . . . . . . . . 106
4.8 The classification results on the single O-shape. . . . . . . . . . . . 107
4.9 Top row: neutral faces. Middle row: faces with facial expression.
Bottom row: faces under different illumination. Image size is 24 by
21 in pixels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.10 (a) The curve of E(?). (b) The curve of ?1(?). We have set q = 30
and ? = 1e?6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.11 (a) The map of log(??) and (b) the contour plots of ??? inside the
C-shaped region, when ? = 3. (c) The map of log(??) and (d) the
contour plots of ??? inside the C-shaped region, when ? = 36. . . . . 117
5.1 300 i.i.d. realizations of four different densities with the same mean
(zero mean) and covariance matrix (identity matrix). (a) 2-D Gaus-
sian. (b) ?O?-shaped uniform.(c) ?D?-shapeduniform. (d) ?X?-shaped
uniform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.2 (a) The symmetric divergence ?JD(?,q) and (b) the Bhatacharyya
distance ?JB(?,q) between the 2-D Gaussian and the ?O?-shaped uni-
form as a function of ? and q. . . . . . . . . . . . . . . . . . . . . . 134
5.3 Examples of face images in the gallery and probe set. (a) The
4th gallery person in 10 frames (every 8 frames) of a 80-frame se-
quence. (b) The 9th gallery person in 10 frames (every 10 frames) of
a 105-frame sequence.(a) The 4th probe person in 10 frames (every
6 frames) of a 60-frame sequence. (d) The plot of first three PCA
coefficients of the above three sets. . . . . . . . . . . . . . . . . . . 136
6.1 The general particle filter algorithm. . . . . . . . . . . . . . . . . . 144
6.2 Particle configurations from (top row) the adaptive velocity model
and (bottom row) the zero-velocity model. . . . . . . . . . . . . . . 154
6.3 The proposed visual tracking algorithm with occlusion handling. . . 157
6.4 The car sequence. Notice the fast scale change present in the video.
Column 1: the tracking results obtained with an adaptive motion
model and an adaptive appearance model (?adp?). Column 2: the
tracking results obtained with an adaptive motion model but a fixed
appearance model (?fa?). In this case, the corner shows the tracked
region. Column 3: the tracking results obtained with an adaptive
appearance model but a fixed motion model (?fm?). . . . . . . . . . 160
xi
6.5 (a) The scale estimate for the car. (b) The 2-D trajectory of the cen-
troid of the tracked tank. ?*? means the starting and ending points
and ?.? points are marked along the trajectory every 10 frames. (c)
The particle number Jt vs. t obtained when tracking the tank. (d)
The MSE invoked by the ?adp? and ?fa? algorithms. (e) The scale
estimate for the face sequence. . . . . . . . . . . . . . . . . . . . . . 161
6.6 Tracking a moving tank in a video acquired by an airborne camera. 162
6.7 The face sequence. Frames 145, 148, and 155 show the first oc-
clusion. Frames 470 and 517 show the smallest and largest face
observed. Frames 685, 690, and 710 show the second occlusion. . . . 164
6.8 Tracking results on the face sequence using the adaptive particle
filter without occlusion analysis. . . . . . . . . . . . . . . . . . . . . 165
7.1 The conventional particle filter algorithm for simultaneous tracking
and recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
7.2 The computationally efficient particle filter algorithm for simulta-
neous tracking and recognition. . . . . . . . . . . . . . . . . . . . . 179
7.3 Database-0. The 1st row: the face gallery with image size being
30?26. The 2nd and 3rd rows: 4 example frames in one probe video
with image size being 320 ? 240 while the actual face size ranges
approximately from 30?30 in the first frame to 50?50 in the last
frame. Notice that the sequence is taken under a well-controlled con-
dition so that there are no illumination or pose variations between
the gallery and the probe. . . . . . . . . . . . . . . . . . . . . . . . 182
7.4 Database-1. The 1st row: the face gallery with image size being
30?26. The 2nd and 3rd rows: 4 example frames in one probe video
with image size being 720 ? 480 while the actual face size ranges
approximately from 20?20 in the first frame to 60?60 in the last
frame. Notice the significant illumination variations between the
probe and the gallery. . . . . . . . . . . . . . . . . . . . . . . . . . . 183
7.5 Database-2. The 1st row: the face gallery with image size being
30?26. The 2nd and 3rd rows: some example frames in one probe
video (slowWalk). Each video consists of 300 frames (480?640 pixels
per frame) captured at 30 Hz. The inner face regions in these videos
contain between 30?30 and 40?40 pixels. Notice the significant
pose variation available in the video. . . . . . . . . . . . . . . . . . 184
7.6 Posteriorprobability p(nt|y0:t)againsttime t, obtainedby theCONDENSATION
algorithm (top left) and the proposed algorithm (top right). Condi-
tional entropy H(nt|y0:t) (bottom left) and MMSE estimate of scale
parameter sc (bottom right) against time t. The conditional entropy
and the MMSE estimate are obtained using the proposed algorithm. 186
xii
7.7 Database-1. Top row: the second facial images for estimating proba-
bilistic density. Middle row: top 10 eigenvectors for the IPS. Bottom
row: the facial images cropped out from the largest frontal view. . . 190
7.8 Cumulative match curves for Database-1 (left) and Database-2 (right).192
7.9 The visual tracking and recognition algorithm. . . . . . . . . . . . . 195
7.10 Row 1-3: the gallery set with 29 subjects in frontal view. Rows 4,
5, and 6: the top 10 eigenvectors for FFS, IPS, and EPS, respectively.196
7.11 Example images in ?Subject-2? probe video sequence and the track-
ing results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.12 Results on the ?Subject-2? sequence. (a) Posterior probabilities
against time t for all identities p(nt|y1:t), nt = 1,2,...,N. The line
close to 1 is for the true identity. (b) Scale estimate against time t. 198
7.13 Left: The ?average? likelihood of the correct hypothesis and incorrect
hypotheses against the log of scale parameter. Right: The ?average?
likelihood ratio against the log of scale parameter. . . . . . . . . . . 201
8.1 Theposteriordistributions p(?1|y1:T)withdifferent T?s: (a) p(?1|y1);
(b) p(?1|y1:6); and (c) p(?1|y1:12), and (d) the posterior distribution
p(?|y1:12). Notice that p(?1|y1:T) has two modes and becomes more
peaked as T increases. . . . . . . . . . . . . . . . . . . . . . . . . . 219
8.2 The recognition rates of all tests. (a) Our method based on ?k. (b)
Our method based on ?k. (c) The PCA approach [62]. (d) The KL
approach. Notice the different ranges of values for different methods
and the diagonal entries should be ignored. . . . . . . . . . . . . . . 220
xiii
Chapter 1
Introduction
1.1 Overview
Identifying people from faces is an effortless task for humans. Is it the same
for computers? This defines the very question for the field of automatic face
recognition [20, 21, 22, 23, 24, 25, 26, 27, 191] (also referred to as face recognition
in the present dissertation), one of the most active research areas in computer
vision, pattern recognition, and image understanding.
Over the past decade, face recognition has attracted substantial attention from
various disciplines and contributed to a skyrocketing growth in the literature. Be-
low, we mainly emphasize the biometric, experimental, and theoretic perspectives
of face recognition.
1.1.1 Biometric perspective
Face is a biometric [31]. As a consequence, face recognition finds wide applications
related to authentication, security, and so on. One striking example is recent
deployment of the US-VISIT system [30] by the Department of Homeland Security
1
(DHS), collecting foreign passengers? fingerprints and face images.
Biometrics enable automatic identification of a person based on physiological or
behavioral characteristics [29, 28]. Physiological biometrics are biological/chemical
traits that are innate or naturally grown, while behavioral biometrics are manner-
isms or traits that are learned or acquired. Table 1.1 lists commonly used biomet-
rics. Some introductory discussions on biometrics may be found in [28, 29, 31, 32].
Type Examples
Physiological biometrics Body odor, DNA, face, fingerprint,
hand geometry, iris, pulse, retinal
Behavioral biometrics Face, gait, handwriting, signature, voice
Table 1.1: A list of biometrics.
Biometrics technologies are becoming the foundations of an extensive array of
highly secure identification and personal verification solutions. Compared with
conventional identification and verification methods based on personal identifica-
tion numbers (PINs) or passwords, biometrics technologies offer some unique ad-
vantages. First, biometrics are individualized traits while passwords may be used
or stolen by someone other than the authorized user. Also, a biometric is very
convenient since there is nothing to carry or remember. In addition, biometric
technology is becoming more accurate and inexpensive.
Among all biometrics listed in Table 1.1, face biometric is a very unique one
because face is the only biometric belonging to both physiological and behavioral
categories. While the physiological part of the face biometric is widely researched
in the literature, the behavioral part is not yet fully investigated. In addition,
as reported in [33, 34], face has advantage over other biometrics because it is a
natural, non-intrusive, and easy-to-use biometric. For example [33], among the
2
six biometrics of face, finger, hand, voice, eye, and signature in Figure 1.1, face
biometric ranks the first in the compatibility evaluation of a machine readable
travel document (MRTD) system in terms of six criteria: enrollment, renewal,
machine-assisted identity verification requirements, redundancy, public perception,
and storage requirements and performance. Probably the most important feature
of a biometric is its ability to collect the signature from non-cooperating subjects.
Face Finger Hand Voice Eye Signature0
20
40
60
80
100
Weighted percentage
Figure 1.1: Comparison of various biometric features based on MRTD compatibil-
ity (from [33]).
Besides applications related to identification and verification such as access
control, law enforcement, ID and licensing, surveillance, etc., face recognition is
also useful in human-computer interaction, virtual reality, database retrieval, mul-
timedia, computer entertainment, etc. See [27, 45] for a review of face recognition
applications.
3
1.1.2 Experimental perspective
Face recognition mainly involves the following three tasks [59]:
? Verification. The recognition system determines if the query face image and
the claimed identity match.
? Identification. The recognition system determines the identity of the query
face image by matching it with a database of images with known identities,
assuming that the identity is inside the database.
? Watch list. The recognition system first determines if the identity of the
query face image is on the stored watch list and, if yes, then identifies the
individual.
Figure 1.2 illustrates the above three tasks and corresponding statistics used for
evaluation. Among three tasks, the watch list task is the most difficult one.
The present thesis focuses only on the identification task. We introduce a
face recognition test protocol FERET [58] widely observed in the face recognition
literature. FERET stands for ?facial recognition technology?. In most experiments
conducted in the thesis, we follow the FERET protocol.
FERET assumes availability of the following three sets, namely one training set,
one gallery set, and one probe set. The training set is provided for the recognition
algorithm to learn the characteristic features. The gallery and probe sets are used
in the testing stage. The gallery set contains images with known identities and the
probe set with unknown identities. The algorithm associates descriptive features
with images in the gallery and probe sets and determines the identities of the probe
images by comparing their associated features with those features associated with
gallery images.
4
Figure 1.2: Three face recognition tasks: verification, identification, watch list
(courtesy of P.J.Phillips [59]).
1.1.3 Theoretic perspective
Face recognition is by nature an interdisciplinary research area, tied to an array
of research fields, ranging from pattern recognition, computer vision and graph-
ics, and image processing/understanding to statistical computing and machine
learning. In addition, automatic face recognition designs are often guided by the
psychophysical and neural studies. A good summary of research on face perception
is presented in [27, 35, 38]. We now focus on the theoretical implications of pattern
recognition for the special task of face recognition.
We present a three-level structure for understanding the face recognition prob-
lem. The three levels forming the pyramid are: pattern, visual pattern, and face
pattern, each associated with a corresponding theory of recognition. Accordingly,
face recognition approaches can be grouped into three categories.
5
Figure 1.3: A hierarchy of face pattern and face recognition.
Pattern and recognition
On the base of the pyramid lies a general pattern. Because face is first a pattern,
any pattern recognition theory [7] can be directly applied to a face recognition
problem. In general, a vector representation is used in pattern recognition. A
common way of deriving a vector representation from a 2D face image, say of size
M ?N, is through a ?vectorization? operator that stacks the pixels in a particular
order, say a raster-scanning order, to an MN ? 1 vector. Obviously, given an
arbitrary MN ? 1 vector, it can be decoded into an M ? N image by reversing
the above ?vectorization? operator. Such a vector representation corresponds to a
holistic-based viewpoint in the psychophysics literature [36, 37].
Subspace methods are pattern recognition techniques widely invoked in vari-
ous face recognition approaches. Two well-known appearance-based recognition
schemes utilize principal component analysis (PCA) [12] and linear discriminant
analysis (LDA) [7]. PCA performs an eigen-decomposition of the covariance ma-
trix and consequently minimizes the reconstruction error in the mean square sense.
6
LDA minimizes the within-class scatter while maximizing the between-class scat-
ter. The PCA approach used in face recognition is called the ?Eigenface? approach
[62]. Another work using PCA earlier than ?Eigenface? is [47]. The LDA approach
used in face recognition is called the ?Fisherface? approach [41] since LDA is also
commonly referred to as Fisher discriminant analysis. LDA for face recognition
was also independently proposed in [44]. Further PCA and LDA are combined
(LDA after PCA) as in [64] to yield a better recognition scheme. Other subspace
methods such as independent component analysis (ICA) [20, 40, 155], local feature
analysis (LFA) [164], probabilistic subspace [54, 55, 56], multi-exemplar discrim-
inant analysis [211] have been used in face recognition. A comparison of these
subspace methods is reported in [56, 200]. Other than the subspace methods, clas-
sical pattern recognition tools such as neural networks [51], learning methods [57],
and evolutionary pursuit/genetic algorithms [52] have also been applied to face
recognition.
One concern in a general pattern recognition problem is the ?curse of dimen-
sionality? since usually M and N themselves are quite large. In face recognition,
because of limitations of image acquisition, practical face recognition systems store
only a small number of samples per subject. This further worsens the ?curse of
dimensionality? problem.
Face recognition also differs from general pattern recognition problem in various
aspects. Some of the differences are illustrated below.
Visual pattern and visual recognition
In the middle of the pyramid in Figure 1.3 sits the visual pattern layer. A face is a
visual pattern in the sense that it is a 2D appearance of a 3D object captured by
7
an imaging system. Certainly, visual appearance is affected by the configuration
of an imaging system. An illustration of the imaging system is presented in Figure
1.4.
Figure 1.4: An illustration of the imaging system.
There are two distinct characteristics of the imaging system: photometric and
geometric.
? Photometric characteristics are related to the light sources distributed in the
scene. Figure 1.5 shows the face images of one object captured under varying
illumination conditions. Numerous models have been proposed to describe
the illuminating phenomenon, i.e., how the light travels when it hits the
object. In addition to its relationship with the light distribution such as the
light direction and intensity, an illumination model is in general also related
to the object surface material properties.
? Geometric characteristic is about the camera properties and the relative po-
8
sitioning of the camera and the object. Camera properties include cam-
era intrinsic paramters and camera imaging models. The imaging models
widely studied in the computer vision literature are orthographic, scaled
orthographic, and perspective models. Because the perspective model is dif-
ficult to deal with as it requires the depth information, the orthographic or
scaled orthograhic model is more used in the face recognition community.
The relative positioning of the camera and the object results in pose varia-
tion, a key factor determining how the 2D appearances are produced. Figure
1.5 shows the face images of one object captured at different poses.
c22
c02
c37
c05
c27
f02 f08 f11 f13 f16
Figure 1.5: One PIE [75] individual under different illumination and poses.
Studying photometric and geometric characteristics is the key problem in the
9
computer vision literature and consequently visual recognition under illumination
and pose variations is the main challenge in the recognition community. A full re-
view of the visual recognition literature is beyond the scope of the thesis. However,
face recognition methods that address the photometric and geometric characteris-
tics are still in a nascent stage and needs to be fully explored.
Approaches to face recognition under illumination variation are usually treated
as extensions of research efforts on illumination models. For example, if a simpli-
fied Lambertian reflectance model ignoring shadow pixels [96, 101, 103] is used,
a rank-3 subspace can be constructed to cover the appearances arbitrarily illumi-
nated by a distant point source. Similarly low-dimensional subspaces [94, 95] can
be found using a Lambertian model with attached shadows. Face recognition can
be performed by checking if a query face image lies in the object-specific illumi-
nation subspace. To generalize from the object-specific illumination subspace to
a class-specific illumination subspace, bilinear models are used in [74, 138, 204].
Most face recognition approaches across pose variation use view-based appear-
ance representation [67, 69, 72]. Face recognition across illumination and poses
is more difficult compared with recognition across one single modality. Proposed
approaches in the literature include [66, 70, 208], among which the 3D morphable
model [66] yields the best recognition performance. The feature-based approach
[48] is reported to be partially robust to illumination and pose variations.
An important feature of a visual pattern is its presence in video. The ubiq-
uitousness of video sequences calls upon recognition algorithms based on videos.
Because a video sequence is a collection of still images, face recognition from still
images certainly applies. However, an important property of a video sequence is
its temporal dimension. Recent psychophysical and neural studies [37, 39] demon-
10
strate the role of movement in face recognition: Famous faces are easier to recog-
nize when presented in moving sequences than in still photographs, even under a
range of different types of degradations. Computational approaches utilizing such
temporal information include [86, 193, 194, 185, 186, 190]. Figure 1.7 shows the
tracked face appearance in a video sequence captures in an office environment [84].
Clearly, due to free movement of the human face and an uncontrolled environment,
issues like illumination and pose variations still exist. Besides these issues, local-
izing faces or face segmentation in a cluttered environment in video sequences is
very challenging.
In surveillance scenarios, further challenges include poorvideo quality and lower
resolution. For example, the face region can be as small as 15 ? 15, while most
feature-based approaches [48, 66] need big face images of size as large as 128?128.
However, video provides multiple observations linked by their temporal continuity.
Face pattern and face recognition
At the top of the pyramid lies the face pattern. The face pattern specializes the
visual pattern by letting the object be a human face. Therefore, face-specific
properties or characteristics should be taken into account when performing face
recognition.
? Deformation. Humans express emotions through facial expressions, yielding
patterns under nonrigid deformations. The non-rigidity is of very high de-
gree of freedom and perplexes the recognition task. Figure 1.6(a) shows the
face images of a person exhibiting different expressions. While face expres-
sion analysis attracts a lot of attention [42, 60, 61], recognition under facial
expression variation has not been fully explored.
11
(a)
(b)
Figure 1.6: (a) Appearances of one individual with different facial expression (from
[53]). (b) Appearances of one individual at different ages (from [50]).
? Aging. Face appearances vary significantly with aging and such variations
are specific to an individual. As a result, theoretical modeling of aging [50]
is very difficult due to the individualized variation. Figure 1.6(b) shows the
face images of a person at different ages.
? Face surface. One speciality of face surface is its bilateral symmetry. Sym-
metry constraint has been widely exploited in [102, 104, 204]. In addition,
surface integrability is an inherent property of any surface, which has also
been used in [99, 103, 137, 204].
? Self-similarity. There is a strong visual similarity among face images of differ-
ent individuals. Geometric positioning of facial features such as eyes, noses,
mouths, etc. are alike across individuals. Early face recognition approaches
in the 70?s [24, 46] used the distances between feature points to describe
the face and achieved some success. Also, face surface materials properties
are similar within the same race. As a consequence of visual similarity, the
?shapes? of the face appearance manifolds belonging to different subjects are
similar. This is the foundation of approaches [55, 56, 211] that attempt to
capture the ?shape? characteristics by constructing the so-called intra-person
12
space.
? Makeup, cosmetic, etc. There factors are specific to an individual and so
are unpredictable. Except that the effect of glasses has been studied in [41],
effects induced by other factors have not been widely investigated.
Face appearances of the same individual under variations in illumination, pose,
deformation, aging, etc. lie in a nonlinear manifold. Figure 1.7 visualizes such
a manifold by projecting the appearances of the top row into top three principal
components. Manifold characterization can be done in various ways. One way is
to embed a manifold in a low-dimensional space [162, 166]. The other way is to
learn the nonlinearity using machine learning techniques [9, 19, 63, 172, 177, 179,
181, 189, 198].
1.2 Unconstrained Face Recognition
State-of-the-art face recognition systems yield satisfactory performance under con-
trolled conditions. To be specific, the face images are typically acquired in frontal
views and are often illuminated by a frontal light source. These conditions pose
strong restrictions on patterns possibly acquired. In other words, the clustering
nature of the produced patterns (usually tightly clustered) is amenable for classical
pattern analysis. Therefore, most face recognition approaches lie in the first level
of the hierarchy. Unfortunately, recognition performance degrades significantly
when face recognition systems are presented with patterns that go beyond these
controlled conditions.
Recently, researchers have begun to investigate face recognition under uncon-
strained conditions. Examples of unconstrained conditions include illumination
13
PC1PC2
PC3
Figure 1.7: Face appearances in a video sequences, forming a nonlinear manifold.
and pose variations, video sequences, expression, aging, and so on. In general,
recognition approaches addressing the second and third levels of the hierarchy can
be considered in the category of unconstrained face recognition.
The present thesis presents several unconstrained face recognition approaches.
It consists of three parts: Part I is on Face Recognition under Variations, Part
14
II on Face Recognition via Kernel Learning, and Part III on Face Tracking and
Recognition from Videos.
1.2.1 Face recognition under variations
Part I of the thesis studies face recognition under illumination and pose variations.
Pose and illumination are related to the second level of Figure 1.3. In Chapter 2,
we present a generalized photometric stereo algorithm for recognizing faces under
illumination variation and then in Chapter 3 an illuminating light field algorithm
for recognizing faces under illumination and pose variations.
Most photometric stereo algorithms employ a Lambertian reflectance model
with a varying albedo field and involve the appearances of only one object. The
recovered albedos and surface normals are object-specific and appearances not be-
longing to the object cannot be easily handled. In Chapter 2, we generalize pho-
tometric stereo algorithms to handle all appearances of all objects in a class, in
particular the human face class, by assuming that albedos and surface normals
of all objects in the class be rank-constrained, i.e. lie in a subspace. Rank con-
straints lead us to a factorization of an observation matrixthat consists of exemplar
images of different objects under different illuminations. To fully recover the sub-
space bases or class-specific albedos and surface normals, we employ integrability
and face symmetry constraints and propose a linearized algorithm. This algorithm
takes into account the effects of varying albedo field by approximating the inte-
grability terms using only the surface normals. We then apply our generalized
photometric stereo algorithm for recognizing faces under illumination variations.
As far as recognition is concerned, we can utilize a bootstrap set which is just a
collection of 2D image observations to avoid an explicit requirement that 3D infor-
15
mation be available. We obtain good recognition results using the PIE database
[187, 202, 204].
The illuminating light field algorithm presented in Chapter 3 is an image-based
method for face recognition across different illumination and different poses, where
the term image-based means that no explicit prior 3D models are needed. As face
recognition under illumination and pose variations involves three factors, namely
identity, illumination, and pose, generalizations in all these three factors are de-
sired. The illuminating light field approach is able to generalize in identity and
illumination and handle a given set of poses. The proposed approach derives an
identity signature that is illumination- and pose-invariant, where the identity is
tackled using subspace encoding, the illumination is characterized using a Lam-
bertian reflectance model, and the given set of poses is treated as a whole. Ex-
perimental results using the PIE database demonstrate the effectiveness of the
proposed approach [188, 208].
1.2.2 Face recognition via kernel learning
As mentioned earlier, the visual pattern lies in a nonlinear manifold, which is
further complicated by face-specific characteristics. Nonlinear data modeling is an
important research topic in machine learning. While linear data modeling such
as PCA and LDA utilizes first- and second-order statistics, higher-order statistics
play essential roles in nonlinear data modeling. Kernel learning methods (or kernel
methods) are able to capture the higher-order statistical information.
In the core of kernel learning methods lie two important components: a learning
algorithm using linear geometry and a nonlinear feature space induced by a kernel
function. Such a space is referred as reproducing kernel Hilbert space (RKHS)
16
in the literature. Kernel methods are linear learning algorithms operating on the
nonlinear feature space. In Part II, we introduce two kernel learning methods.
Chapter 4 presents a probabilistic approach to analyze kernel principal compo-
nents by naturally combining in one treatment the theory of probabilistic principal
component analysis and that of kernel principal component analysis. In this for-
mulation, the kernel component enhances the nonlinear modeling power, while the
probabilistic structure offers (i) a mixture model for nonlinear data structure con-
taining nonlinear sub-structures, and (ii) an effective classification scheme. It also
turns out that the original loading matrix [15] is replaced by the newly defined
empirical loading matrix. The expectation/maximization algorithm for learning
parameters of interest is then developed. Computation of reconstruction error and
Mahalanobis distance is also discussed. Finally, we apply this approach to face
recognition [198, 209].
Probabilistic distance measures are important quantities in many research ar-
eas. For example, the Chernoff distance (or the Bhattarchayya distance as its
special example) is often used to bound the Bayes error in a pattern classifica-
tion task and the Kullback-Leibler (KL) distance is a key quantity in information
theory literature. However, computing these distances is a difficult task and ana-
lytic solutions are not available except under some special conditions. One popular
example is the Gaussian density. The Gaussian density employs only up to second-
order statistics and its modeling capacity is linear and hence rather limited. In
Chapter 5, we enhance this capacity through a nonlinear mapping from original
data space to RKHS, which is implemented using kernel embedding. Since this
mapping is nonlinear, we achieve a new paradigm to study these distances whose
feasibility and efficiency are demonstrated using experiments on synthetic and face
17
recognition examples [189].
1.2.3 Face tracking and recognition from videos
Video sequences are becoming ubiquitous due to the advances in digital imaging
devices and the advent of internet era. A face in video sequences presents further
challenges to recognition algorithms besides those common to face recognition from
still images.
In Chapter 6, we present an approach called adaptive visual tracking that in-
corporates appearance-adaptive models in a particle filter to realize robust visual
tracking. Tracking needs modeling of inter-frame motion and appearance changes
whereas recognition needs modeling of appearance changes between frames and
gallery images. In conventional tracking algorithms, the appearance model is ei-
ther fixed or rapidly changing, and the motion model is simply a random walk
with fixed noise variance. Also, the number of particles is typically fixed. All these
factors make the visual tracker unstable. To stabilize the tracker, we propose
the following features: an observation model arising from an adaptive appearance
model, an adaptive velocity motion model with adaptive noise variance, and an
adaptive number of particles. The adaptive-velocity model is derived using a first-
order linear predictor based on the appearance difference between the incoming
observation and the existing particle configuration. Occlusion analysis is imple-
mented using robust statistics. Experimental results [186, 201, 203] on tracking
visual objects in long outdoor and indoor video sequences demonstrate the effec-
tiveness and robustness of our tracking algorithm.
In Chapter 7, recognition of human faces using a gallery of still images and a
probe set of videos is systematically investigated using a probabilistic framework
18
called simultaneous tracking and recognition. In still-to-video recognition, where
the gallery consists of still images, a time series state space model is proposed
to fuse temporal information in a probe video, which simultaneously character-
izes the kinematics and identity using a motion vector and an identity variable,
respectively. The joint posterior distribution of the motion vector and the iden-
tity variable is estimated at each time instant and then propagated to the next
time instant. Marginalization over the motion vector yields a robust estimate
of the posterior distribution of the identity variable. A computationally efficient
sequential importance sampling (SIS) algorithm is developed to estimate the pos-
terior distribution. Empirical results demonstrate that, due to the propagation of
the identity variable over time, a degeneracy in posterior probability of the iden-
tity variable is achieved to give improved recognition. We perform experiments
[192, 193, 194, 195, 196, 197, 199] using images/videos with pose/illumination vari-
ations to illustrate the effectiveness of this approach for the still-to-video scenario
with appropriate model choices.
In Chapter 8, we present the most general framework for characterizing the face
identity in a single image or a group of images with each image containing a trans-
formed version of the object. In terms of the transformation, the group is made
of either still images or frames of a video sequence. The face identity signature
is either discrete- or continuous-valued. This framework referred as probabilistic
identity characterization integrates all the evidence of the set and handles the
localization problem, illumination and pose variations through subspace identity
encoding. Issues and challenges arising in this framework are addressed and effi-
cient computational schemes are given. All instances of face recognition algorithms
are be interpreted in the most general framework [210].
19
Part I: Face Recognition under
Variations
20
Chapter 2
Generalized Photometric Stereo
In this chapter, we present a theory of generalized photometric stereo and its ap-
plication to face recognition across illumination. We first present the generalized
photometric stereo algorithm which is able to handle all appearances under dif-
ferent illumination of all objects in a class, in particular the human face class.
In contrast, the ordinary photometric stereo algorithm handles the appearances
belonging to one object under different illumination. We then evaluate this algo-
rithm in its application to face recognition under illumination variation. Since this
generalization is linear, the blending linear coefficients offer an illuminant-invariant
identity signature.
Figure 2.1 motivates the proposed approach. The first row of Figure 2.1 dis-
plays one Yale object [68] under eight different illumination. Photometric stereo
algorithms can recover the varying albedos and surface normals for the object, even
assuming no knowledge of the illumination conditions. Here, by photometric stereo
algorithm we mean any algorithm that utilizes a Lambertian reflectance model to
describe the visual appearance and has the capability to recover the albedos and
surface normals involved in the reflectance model. However, ordinary photomet-
21
Figure 2.1: Top row: One object under eight different light sources. This can
be handled by the ordinary photometric stereo algorithm. Bottom row: Eight
different objects illuminated by eight different lighting sources. This cannot be
handled by the ordinary photometric stereo algorithm but can be handled by the
proposed generalized photometric stereo algorithm.
ric stereo algorithm cannot handle the images in the second row of Figure 2.1,
where each image represents a different object under a different illumination. This
motivates us to propose a generalized photometric stereo approach.
As in ordinary photometric stereo algorithm, the generalized photometric stereo
algorithm utilizes a Lambertain reflectance model to depict the visual appearance.
The significant difference between the ordinary and generalized photometric stereo
algorithms lies in the image ensemble they analyze. The image ensemble that the
ordinary photometric stereo algorithm analyzes consists of the appearances of one
object under different illumination while, in general, the image ensemble that the
generalized photometric stereo algorithm analyzes consists of the appearances of
different objects, with each object under a different illumination. Analysis of the
latter image ensemble is very difficult. To this end, we introduce a key assumption:
These different objects belong to one class (for example, the human face class) so
that they are linearly spanned by a fixed number of basis objects. Generalized pho-
tometric stereo does not assume any knowledge of the lighting sources as well as
the blending coefficients. Rather, the generalized photometric stereo approach ac-
tually recovers such information. To further complicate the matter, the knowledge
22
of the basis objects is also unknown and needs to be recovered.
We evaluate the generalized photometric stereo algorithm for a face recognition
application. The key assumption has two important implications. Firstly, it fits
with the requirement of a recognition task that needs a generalization capability
built on a training set. The idea is to learn the basis objects from the training
set. Once learned, we use them to cope with arbitrary images belonging to objects
other than those in the training set. Secondly, because the bases are for the
object class only, the blending coefficients provide an identity encoding which is
invariant to illumination. We use the blending coefficients for face recognition
under illumination variation, which results in good recognition performance.
Chapter organization
Section 2.1 elaborates the generalized photometric stereo algorithm and addresses
its issues and challenges. Section 2.2 details the face recognition setting and
presents the experimental results using the PIE database. Appendices 2.I and
2.II give supplementary details of the algorithms proposed in the chapter.
A glossary of notations
In general, we denote a scalar by a, a vector by a, and a matrix with r rows and
c columns by Ar?c. The matrix transpose is donate by AT, the pseudo-inverse by
A?. The matrix L2-norm is denoted by ||.||2.
The following notations are introduced for the sake of notational conciseness
and emphasis of special structure.
? Concatenation notations: ? and ?.
? and ? mean horizontal and vertical concatenations, respectively. For
23
example, we can represent a n ? 1 vector an?1 by a = [a1,a2,...,an]T =
[?ni=1 ai] and its transpose by aT = [a1,a2,...,an] = [?ni=1 ai]. We can
use ? and ? to concatenate matrices to form a new matrix. For instance,
given a collection of matrices {A1,A2,...,An} of size r ? c, we construct a
r ?cn matrix1 [?ni=1 Ai] = [A1,A2,...,An] and a rn?c matrix [?ni=1 Ai] =
[AT1 ,AT2 ,...,ATn]T. In addition, we can combine ? and ? to achieve a concise
notation. Rather than representing a matrix Ar?c as [aij], we represent it as
Ar?c = [?ri=1 [?cj=1 aij] ] = [?cj=1 [?ri=1 aij] ]. Also we can easily construct
?big? matrices using ?small? matrices {A11,A12,...,A1n,...,Amn} of size r?c.
The matrix [?mi=1 [?nj=1 Aij] ] is of size rm?cn, the matrix [?mi=1 [?nj=1 Aij] ]
of size r?cmn.
? Kronecker (tensor) product: ?.
It is defined as Am?n ?Br?c = [?mi=1 [?nj=1 aijB] ]mr?nc.
? Hadamard (element-wise) product: ?.
It is defined as Am?n ?Bm?n = [?mi=1 [?nj=1 aijbij] ]m?n.
? Special notation: circledot.
This is used for the special structure of the object-specific albedo-shape ma-
trix T (The definitions of T, p, and N are listed below), i.e., Td?3 = [?di=1
(pinTi )] = pcircledotNT = (pd?1 ?11?3)?NT3?d
Some special scalars, vectors, and matrices are defined as follows:
? d: number of pixels;
1We do not need the size of {A1,A2,...,An} to be exactly same. We use the same matrix size
for simplicity. For example, for [?ni=1 Ai], we only need the number of rows of these matrices to
be same.
24
? m: the rank used in the first rank constraint.
? i, j, iprime, jprime, l, and k: loop indices.
? 1r?c: a r?c matrix of ones.
? In: an identity matrix of size n?n.
? h: a pixel; hd?1: an image.
? p: albedo at a pixel. pd?1: albedo vector
? n3?1 = [?a,?b,?c]T: unit surface normal vector; ?a, ?b, and ?c: elements of n.
? N3?d = [?di=1 ni]: the surface normal matrix.
? t3?1 = [a,b,c]T: product of albedo and surface normal; a, b, and c: elements
of t.
? Td?3 = [?di=1 (pinTi )]: the object-specific albedo-shape matrix. Also, Td?3 =
[a,b,c] where a, b, and c are d?1 vectors.
? s3?1: illumination vector. S3?n: the matrix consisting of a collection of
different illumination vectors.
? fm?1: the vector of blending linear coefficients under the first rank constraint.
Fm?n: the matrix consisting of a collection of different f?s.
? Wd?3m = [?mi=1 Ti]: the class-specific albedo-shape matrix. Also, Wd?3m =
[?mi=1 [ai,bi,ci]].
? A, B, C: A = [?mi=1 ai], B = [?mi=1 bi], and C = [?mi=1 ci].
? Wf: Wf = [?mi=1 (Tis)]d?m.
25
? Ws: Ws = [Af,Bf,Cf].
? Hd?n = [?ni=1 hi]: the observation matrix consisting of a collection of images.
? ?Wd?3m: the U matrix after a rank-3m SVD factorization of H.
? ?w(x): a 3m?1 vector same as the row in ?W associated with the pixel x
? R3m?3m: the ambiguity matrix in the factorization.
? raj, rbj, and rcj: the (3j ?2)th, (3j ?1)th, and (3j)th columns of the matrix
R.
? ?: an indicator function.
? x = (x,y): pixel coordinate; ?x = (?x,y): the symmetric point of x.
? ?: the integrability constraint term.
? ?: the face symmetry constraint term.
2.1 Principle of Generalized Photometric Stereo
This section describes the generalized photometric stereo algorithm. We start in
Section 2.1.1 by a brief review of related literature and highlight the advantages of
the proposed approach. We list in Section 2.1.2 the setting and constraints. Then
we present a method to recover the albedos and surface normal for a class of objects
in Sections 2.1.3 and 2.1.4. Section 2.1.3 handles the isolated task of separating
the illumination (v.i.z. finding the illuminant vector and the blending coefficients)
from an arbitrary image, which is used in the recovery algorithm presented in
Section 2.1.4.
26
2.1.1 Literature review and proposed approach
Recovery of albedos and surface normals has been studied in the computer vision
research for a long time. Usually a Lambertian reflectance model, ignoring both
attached and cast shadows, is employed. Early works from the shape from shading
(SFS) literature have typically assumed a constant albedo field: this assumption
is not valid for many real objects and thus limits the practical applicability of
the SFS algorithms. Early photometric stereo approaches require the knowledge
of lighting conditions, but such knowledge is hard to gather under uncontrolled
scenarios. Recent research efforts [74, 68, 94, 95, 96, 101, 103, 104] attempt to go
beyond these restrictions by (i) using a varying albedo field, a more accurate model
of the real world, and (ii) assuming no prior knowledge or requiring no control of
the lighting sources. As a consequence, the complexity of the problem has also
significantly increased.
If we fix the imaging geometry and only move the lighting source to illumi-
nate one object, the observed images (ignoring the cast and attached shadows)
lie in a subspace completely determined by three images illuminated by three in-
dependent lighting sources [101]. If an ambient component is added [103], this
subspace becomes 4-D. If attached shadows are considered, the subspace dimen-
sion grows to infinity [97] but most of its energy is packed in a limited number
of harmonic components, thereby leading to a low-dimensional subspace approx-
imations in [94, 95, 100]. However, all the photometric-stereo-type approaches
(except [74]) commonly restrict themselves to using object-specific samples and
cannot perform reconstruction combining images produced by different objects.
In this chapter, we present a generalized photometric stereo algorithm that is
able to handle all appearances of all objects in a class, in particular the human face
27
class. To this end, we impose a rank constraint (i.e. a linear generalization) on the
albedos and surface normals of all human faces. We choose the human face as a
working example because it naturally fits in our framework and is widely studied
in the photometric stereo literature; however this does not pose any limitations in
applying our algorithm to other object classes such as vehicles.
We propose a rank constraint on the product of albedo and surface normal.
The rank constraint enables us to accomplish a factorization of the observation
matrix that decomposes a class-specific ensemble into a product of two matrices:
one encoding the albedos and surfaces normals for a class of objects and the other
encoding blending linear coefficients and lighting conditions. A class-specific en-
semble consists of exemplar images of different objects with each under a different
illumination, which is beyond what can be analyzed using the bilinear analysis of
[138]. Bilinear analysis requires exemplar images of different objects under the
same set of illumination conditions. Because a factorization is always up to an in-
vertible matrix, unique recovery of the albedos and surface normals is not possible
and requires additional constraints. We use two constraints: surface integrability
and face symmetry.
The surface integrability constraint [99, 137] has been used in several ap-
proaches [68, 103] to successfully recover albedo and shape. The symmetry con-
straint has also been employed in [102, 104] for face images. We present an ap-
proach to fusing these constraints to recover the class-specific albedos and surface
normals, even in the presence of shadows. More importantly, this approach takes
into account the effects of a varying albedo field by approximating the integra-
bility terms using only the surface normals instead of the product of the albedos
and the surface normals. Due to the nonlinearity embedded in the integrability
28
terms, regular algorithms such as the steepest descent are inefficient. We derive a
linearized algorithm to find the solution.
2.1.2 Setting and constraints
Photometric stereo
We assume a Lambertian imaging model with a varying albedo field. A pixel h is
represented as
h = p nTs = tTs, (2.1)
where [.]T denotes the transpose, p is the albedo at the pixel, n ? [?a,?b,?c]T is the
unit surface normal vector at the pixel, t3?1 ? [a ? p?a,b ? p?b,c ? p?c]T is the
product of albedo and surface normal, and s (a 3 ? 1 unit vector multiplied by
its intensity) specifies a distant illuminant. For time being, we consider the case
without the shadow pixels and will deal with the shadow pixels later on.
An image h is a collection of d pixels {hi,i = 1,...,d} 2. By stacking all the
pixels into a column vector, we have
hd?1 ? [?di=1 hi] = [?di=1 (pi nTi )]s = [?di=1 tTi ]s = [?di=1 [ai,bi,ci]]s
= (pd?1 circledotNT3?d)s3?1 = [ad?1,bd?1,cd?1]s3?1 (2.2)
= Td?3 s3?1, (2.3)
where p ? [?di=1 pi] is the albedo vector, N ? [?di=1 ni] is the surface normal
matrix, a ? [?di=1 ai] = [?di=1 pi?ai], b ? [?di=1 bi] = [?di=1 pi?bi], and c ? [?di=1
ci] = [?di=1 pi?ci]. To emphasize the structure of the T matrix which is a ?product?
2The index i corresponds to a spatial position x = (x,y). We will interchange both notations.
For instance, we might also use x = 1,...,d.
29
of the albedo vector p and the surface normal N, we introduce a special notation
circledot to denote T by
T ? pcircledotNT ? [?di=1 tTi ] ? [a,b,c]. (2.4)
We call the T matrix as the object-specific albedo-shape matrix.
In the case of photometric stereo, we have n images of the same object, say
{h1,h2,...,hn}, observed at a fixed pose illuminated by n different lighting sources,
forming an object-specific ensemble. Simple algebraic manipulation gives:
Hd?n ? [?ni=1 hi] = T[?ni=1 si] = Td?3 S3?n, (2.5)
where H is the observation matrix and S ? [?ni=1 si] encodes the information on
the illuminants. Hence photometric stereo is rank-3 constrained. Therefore, given
at least three exemplar images for one object under three different independent
illumination, we can determine the identity of a new probe image by checking if
it lies in the linear span of the three exemplar images. This requires capturing at
least three images for one object in the gallery set, which can be prohibitive in
practical scenarios. Note that in this recognition setting, there is no need for the
training set; in other words, the training set is equivalent to the gallery set.
A typical recognition setting [58], however, assumes no identity overlap between
the gallery set and the training set and often stores only one exemplar image for
each object in the gallery set. However, the training set can have multiple images
for one object. In order to generalize from the training set to the gallery and probe
sets, we note that all images in the training, gallery, and probe sets belong to the
same face class, which naturally leads to the rank constraint.
30
The rank constraint
We impose the rank constraint on the T matrix by assuming that any T matrix
is a linear combination of some basis matrices {T1,T2,...,Tm} coming from some
m basis objects. Rank constraints are often found in the literature [110, 111, 129,
116, 117, 122]. Mathematically, there exist coefficients {fj; j = 1,...,m} such
that
Td?3 =
msummationdisplay
j=1
fjTj = [?mj=1 Tj](f?I3) = Wd?3m(fm?1 ?I3), (2.6)
where f ? [?mj=1 fj], W ? [?mj=1 Tj], In denotes an identity matrix of dimension
n?n, and ? denotes the Kronecker (tensor) product. Since the W matrix encodes
all albedos and surface normals for a class of objects, we call it a class-specific
albedo-shape matrix. Substitution of (2.6) into (2.3) yields
hd?1 = Ts = W(f?I3)s = W(f?s) = Wd?3m k3m?1, (2.7)
where k ? f?s. This leads to a two-factor bilinear analysis [138].
With the availability of n images {h1,h2,...,hn} for different objects, observed
at a fixed pose illuminated by n different lighting sources, forming a class-specific
ensemble, we have
Hd?n = [?ni=1 hi] = W[?ni=1 (fi ?si)] = W[?ni=1 ki] = Wd?3m K3m?n, (2.8)
where K ? [?ni=1 (fi ?si)] = [?ni=1 ki]. It is a rank-3m problem, which combines
the rank of 3 for the illumination and the rank of m for the identity.
The rank constraint generalizes many approaches in the literature other than
the photometric stereo. If the surface normal is fixed and the albedo field lies in
a rank-m linear subspace, we have (2.6) satisfied. Interestingly, the ?Eigenface?
approach [62] is just a special case of this approach for a fixed illumination source.
31
Suppose that the fixed illuminant vector is ?s. (2.7) and (2.8) reduce to
hd?1 = W(f??s) = ?Wd?mfm?1;
Hd?n = [?ni=1 hi] = ?W[?ni=1 fi] = ?Wd?mFm?n, (2.9)
where ?W ? [?mi=1 Ti?s]. Therefore, our approach can also be regarded as a gener-
alized ?Eigenface? analysis able to handle illumination variation.
Our immediate goal is to estimate W and K from the observation matrix H.
The first step is to invoke an SVD factorization, H = U?VT, and retain the top 3m
components as H = U3m?3mVT3m=?W ?K, where ?W = U3m and ?K = ?3mVT3m. Thus,
we can recover W and K up to an 3m ? 3m invertible matrix R with W = ?WR,
K = R?1?K. Additional constraints are required to determine the R matrix. We will
use the integrability and face symmetry constraints, both related to W. Moreover,
K must take the special structure K = [?i (fi ?si)].
Incidentally, by noting that T = p circledot NT, we can introduce a second rank
constraint which assumes that (i) any p vector is a linear combination of some basis
vectors {p1,p2,...,pm1} with m1 < d and (ii) any N matrix is a linear combination
of some basis matrices{N1,N2,...,Nm2}with m2 < d. This is a common constraint
used in the face recognition literature. For example, in [43, 66, 76], they all assume
that shape and texture have separate bases. However, it turns out that the second
rank constraint is not systematically superior to the first rank constraint in terms
of recognition performance. Also, it is computationally inconvenient to use the
second rank constraint.
Hence, there exist two vectors fm1?1 ? [?i fi] and gm2?1 ? [?i gi] such that
p = [?m1i=1 pi]f;NT = [?m2j=1 NTj ](g?I3), (2.10)
32
and similarly the image h can be expressed as
hd?1 = [?m1i=1 [?m2j=1 (pi circledotNTj )] ](f?g?s)
= Yd?3m1m2(fm1?1 ?gm2?1 ?s3?1), (2.11)
where Y ? [?m1i=1 [?m2j=1 (pi circledotNTj )] ].
The integrability constraint
One common constraint used in SFS research is the integrability of the surface
[68, 99, 103, 137]. Suppose that the surface function is z = z(x) with x ? (x,y),
we must have ??x ?z?y = ??y ?z?x. For the given unit surface normal vector n(x) ?
[?a(x),?b(x),?c(x)]T at pixel x, the integrability constraint requires that
?
?x
?b(x)
?c(x) =
?
?y
?a(x)
?c(x) . (2.12)
In other words, with ?(x) defined as an integrability constraint term,
?(x) ? ?c(x)?
?b(x)
?x ?
?b(x)??c(x)
?x + ?a(x)
??c(x)
?y ??c(x)
??a(x)
?y = 0. (2.13)
If given the product of the albedo and the surface normal t(x) ? [a(x),b(x),c(x)]T
with a(x) ? p(x)?a(x), b(x) ? p(x)?b(x), and c(x) ? p(x)?c(x), Eq. (2.13) still holds with
?a, ?b, and ?c replaced by a, b, and c, respectively. Practical algorithms approximate
the partial derivatives by forward or backward differences or other differences with
the inherent smoothness assumption. Hence, the approximations based on t(x)
are very rough especially at places where abrupt albedo variations exist (e.g. the
boundaries of eyes, iris, eyebrow, etc.) since the smoothness assumption is seriously
violated. We should by all means use n(x) in order to remove this effect.
33
The face symmetry constraint
For a face image in a frontal view, one natural constraint is its symmetry about
the central y-axis [102, 104]:
p(x,y) = p(?x,y);?a(x,y) = ??a(?x,y);?b(x,y) = ?b(?x,y);?c(x,y) = ?c(?x,y), (2.14)
which is equivalent to, using x ? (x,y) and its symmetric point ?x ? (?x,y),
a(x) = ?a(?x); b(x) = b(?x);c(x) = c(?x). (2.15)
If a face image in a non-frontal view, such a symmetry still exists but the
coordinate system should be modified to take into account the view change.
2.1.3 Separating illumination
In this section, we temporarily assume that the class-specific albedo-shape matrix
W is available and solve the problem of separating illumation, v.i.z., foran arbitrary
image h, find the illuminant vector s and the coefficient f under the first constraint
(or f and g under the second constraint). For convenience in performing tasks such
as recognition, we also normalize the solution f to the same range.
The first rank constraint gives rise to the basic equation h = W (f?s). So, we
convert the separation task to a minimization task of finding f and s to minimize
the least square (LS) cost, i.e.,
minf
,s E(f,s) ?bardblh?W (f?s)bardbl
2, (2.16)
Note that f and s can be recovered only up to a non-zero scalar; one can always
multiply f by a non-zero scalar and divide s by the same scalar. Therefore, without
loss of generality, we can simply pose an additional constraint: 1Tf = 1, where 1m?1
is a vector of 1?s.
34
One way to solve this is indicated in [74]. It is a two-step algorithm. First, k
is approximated by k = W?h. Then k = f ? s is used to solve for f and s, again
using the LS approximation, i.e. finding f and s such that the cost bardblk ? f ? sbardbl2
is minimized. However, as pointed out in [74], the above algorithm is not robust
since two approximations are involved.
Before we proceed to the actual separation algorithm, note that shadows in
principle increase the rank (for the illumination only) to infinity. However, if those
pixels are successfully excluded in our calculations, the rank for the illumination
is still maintained to be 3 and the overall rank is 3m.
In view of the above and considering the normalization requirement, we modify
the cost function as
E(f,s) ?bardbl? ?(h?W (f?s))bardbl2 + (1Tf?1)2, (2.17)
where ?d?1 indicates the inclusion or exclusion of the pixels of the image h and
? denotes the Hadamard (or element-wise) product. Notice that (2.17) can be
easily generalized to a cost function used in robust estimation if the vector norm
is replaced by a robust function, and ? by an appropriate weight function.
Using the fact that Eq. (2.7) provides a series of sub-equations, which is linear
in f if s is fixed and in s if f is fixed, we can design a simple iterative algorithm.
Each iteration of the algorithm has three steps. In the first step, we solve for the
LS estimate of f, given s and ?.
f =
?
??
?
Wf
1T
?
??
?
??
??
?
? ?h
1
?
??
?; Wf ? [?mi=1 (Tis)]d?m. (2.18)
In the second step, we solve for the LS estimate of s, given f and ?:
s = W?s(? ?h); Ws ? [ [?mi=1 ai]f, [?mi=1 bi]f, [?mi=1 ci]f ]d?3 ? [Af,Bf,Cf], (2.19)
35
where Ad?m ? [?mi=1 ai], Bd?m ? [?mi=1 bi], and Cd?m ? [?mi=1 ci], respectively. In
the third step, given f and s we update ? as follows3:
? = [ |h?W (f?s)| < ? ], (2.20)
where ? is a pre-defined threshold.
Note that in (2.18) and (2.19), additional saving in computation is possible.
We can form dimension-reduced matrices Wprimef and Wprimes and vector hprime and apply the
primed version in (2.18) and (2.19) The matrices Wprimef and Wprimes and vector hprime are
formed from Wf, Ws, and h, respectively, by discarding those rows corresponding
to the excluded pixels.
The initial conditions can be arbitrary. But, for fast convergence, we need good
initial values. In our implementation, we estimate s using the algorithm presented
in [105]. To initialize ?, we employ heuristics to distinguish pixels in shadows: their
intensities are close to zero. In practice, we set those pixels whose intensities are
smaller than a certain threshold as missing values. In addition, we also set those
pixels whose intensities are above a certain threshold as missing values to remove
pixels possibly in a specular region. This is only for initialization, we update ?
during iterations.
To test the stability of our algorithm, we perturb the initial conditions and
find that our algorithm is very stable in the sense that it always reaches the same
solution (up to the convergence error) regardless of initial conditions and generates
a smaller residual than the algorithm reported in [74].
Learning f, g, and s from h using the second constraint is a straightforward
generalization of the above algorithm. Appendix 2.I presents such a recovery al-
gorithm in an even more general setting, i.e. a multilinear setting.
3This is a Matlab operation which performs an element-wise comparison.
36
2.1.4 Recovering class-specific albedos and surface normals
The recovery task is to find from the observation matrix H the class-specific
albedo-shape matrix W (or equivalently R), which satisfies both the integrabil-
ity and symmetry constraints, as well as the matrices F and S. We decompose
R as R3m?3m ? [?mj=1 [raj,rbj,rcj]] and treat the column vectors {raj,rbj,rcj; j =
1,...,m} as our computational ?units?. We also decompose ?W as ?W ? [?dx=1 ?wT(x)]
where ?w(x) is a 3m?1 vector same as the row in ?W corresponding to the pixel x.
As W ? [?dx=1 [?mj=1 [aj(x),bj(x),cj(x)]]] = ?WR, we have
aj(x) = ?wT(x)raj, bj(x) = ?wT(x)rbj, cj(x) = ?wT(x)rcj; j = 1,...,m. (2.21)
As mentioned in Section 2.1.3, we must take into account attached and cast
shadows. After setting them as missing values, we perform SVD with missing val-
ues [149] to find ?W. Other approaches for dealing with missing value are available
in [141, 165, 169].
In view of the above, we formulate the following optimization problem: mini-
mize over R, F, and S the cost function E defined as
E(R,F,S) = 12
nsummationdisplay
i=1
dsummationdisplay
x=1
?i(x){hi(x) ? ?w(x)TR(fi ?si)}2
+?12
msummationdisplay
j=1
dsummationdisplay
x=1
{?j(x)}2 + ?22
msummationdisplay
j=1
dsummationdisplay
x=1
{?j(x)}2,
= E0(R,F,S)+ ?1E1(R)+ ?2E2(R), (2.22)
where ?i(x) is an indicator function which takes the value one if the pixel x of the
image hi is not in shadow and zero otherwise, ?j(x) is the integrability constraint
term based only on surface normals as defined in (2.13), and ?j(x) is the symmetry
constraint term given as
?2j(x) = {aj(x) + aj(?x)}2 +{bj(x) ?bj(?x)}2 +{cj(x) ?cj(?x)}2; j = 1,...,m. (2.23)
37
One approach could be to directly minimize the cost function over W, F, and S.
This is in principle possible but numerically difficult as the number of unknowns
depends on the image size, which can be quite large in practice.
As shown in [98], the recovered surface normal is up to a generalized bas-relief
(GBR) ambiguity. To avoid trivial solutions such as a planar object4, we normalize
the matrix R by setting ||R||2 = 1 where ||.||2 is a matrix norm. Another ambiguity
between fj and sj is a nonzero scale, which can be removing by normalizing f to
same range: fTj 1 = 1, where 1m?1 is a vector of 1?s.
To summarize, we perform the following task:
minR,F,S E(R,F,S) subject to ||R||2 = 1,FT1 = 1. (2.24)
An iterative algorithm can be designed to solve (2.24). While solving for F and S
with R fixed is quite easy, solving for R with F and S is very difficult because the
integrability constraint terms involve partial derivatives of the surface normals that
are nonlinear in R. Regular algorithms such as the steepest descent are inefficient.
One main contribution of this chapter is that we propose a linearized algorithm to
solve for R, which is detailed in Appendix 2.II.
We now illustrate how to update F = [?i fi], S = [?i si], and ? = [?i ?i]
with R fixed (or W fixed). First notice that F, S, and ? are only involved in the
term E0. Moreover, fi, si and ?i are related to only the image hi. This becomes
the same as the illumination separation problem defined in Section 2.1.3. The
proposed algorithm is also iterative in nature. After running one iterative step to
obtain the updated F, S, and ?, we proceed to update R again and this process
4In this way, the surface normals we are recovering are versions up to a GBR ambiguity with
respect to the true physical surface normals [68]. However, they are enough for tasks such as face
recognition under illumination variation.
38
carries on until convergence.
To demonstrate how the algorithm works, we design the following scenario
with m = 2 so that the rank of interest is 2x3=6. To defeat the photometric
stereo algorithm, which requires one object illuminated by at least three sources,
and the bilinear analysis, which requires two fixed objects illuminated by at least
three same lighting sources, we construct eight images by taking random linear
combinations of two basis objects illuminated by eight different lighting sources.
Figure 2.2 displays the two basis objects under the same set of eight illumination
and the synthesized images. The recovered class-specific albedo-shape matrix is
also presented in Figure 2.2, which clearly shows the two basis objects. The quality
of reconstruction is quite good except the nose part. The reason might be that
the two basis objects have quite distinct noses so that the nose part of their linear
combinations is not visually good (see the image in the last column of the third
row), which propagates to the recovery results of albedos and surface normals from
these combination images. Our algorithm usually converges within 100 iterations.
One notes that the special case m = 1 of our algorithm can be readily applied
to photometric stereo (with the symmetry constraint removed) to robustly recover
the albedos and surface normals for one object.
2.2 Face Recognition across Illumination
This section deals with the face recognition part, which serves as a main evaluation
tool for the generalized photometric stereo algorithm. Section 2.2.1 briefly reviews
the literature on face recognition across illumination. In Section 2.2.2, we relax
the requirement of recovering the albedos and surface normals by utilizing sample
imagery as a bootstrap set for the recognition task. We then report in Section
39
Figure 2.2: The first row: The first basis object under eight different illumination.
The second row: The second basis object under the same set of eight different
illumination. The third row: Eight images (constructed by random linear combi-
nations of two basis objects) illuminated by eight different lighting sources. The
fourth row: Recovered class-specific albedo-shape matrix W showing the product
of varying albedos and surface normals of two basis objects (i.e. the three columns
of T1 and T2) using the generalized photometric stereo algorithm.
2.2.3 face recognition results using the PIE database.
2.2.1 Literature review and proposed approach
Face recognition under illumination variation is a very challenging problem. The
key is to successfully separate the illumination source from the observed appear-
ance. Once separated, what remains is illuminant-invariant and appropriate for
recognition. In addition to illumination variation, various issues embedded in the
recognition setting make recognition even more difficult. We follow the recogni-
tion protocol introduced in [58]. Assuming the availability of the following three
sets, namely one training set, one gallery set, and one probe set, the recognition
algorithm learns from the training set the characteristic features, associates de-
40
scriptive features with the objects in the gallery set, and determines the identity
for the objects in the probe set. Different recognition settings can be formed in
terms of identity and illumination overlaps among the training, gallery, and probe
sets. The most difficult setting, which is the focus of this chapter, is obviously the
one in which there is no overlap at all among the three sets in terms of both identity
and illumination, except the identity overlap between the gallery and probe sets.
In this setting, generalizations from known illumination to unknown illumination
and from known identities to unknown identities are particularly desired.
State-of-the-art research efforts can be grouped into three streams: subspace
methods, reflectance-model methods, and 3D-model-based methods. (i) The first
approach is very popular for the recognition problem. After removing the first
three eigenvectors, principal component analysis (PCA) was reported to be more
robust to illumination variation than the ordinary PCA or the ?Eigenface? approach
[62]. Fisher discriminant analysis (FDA) [41, 70] has also been modified to handle
illumination variations. In general, subspace learning methods are able to cap-
ture the generic face space and thus to recognize new objects not present in the
training set. The disadvantage is that subspace learning is actually tuned to the
lighting conditions of the training set; therefore if the illumination conditions are
not similar among the training, gallery, and probe sets, recognition performance
may not be acceptable. (ii) The second approach [68, 74, 101, 104] employs a
Lambertian reflectance model with a varying albedo field ignoring both attached
and cast shadows. The main disadvantage of this approach is the lack of general-
ization from known objects to unknown objects. (iii) The third approach employs
3D models. The ?Eigenhead? approach [65] assumes that the 3D geometry (or 3D
depth information) of any face lies in a linear space spanned by the 3D geometry
41
of the training ensemble and uses a constant albedo field. The morphable model
approach [66] is based on a synthesis-and-analysis strategy. Both geometry and
texture are linearly spanned by those of the training ensemble. It is able to han-
dle both illumination and pose variations with illumination directions specified.
The weakness of the 3D model approaches is that they require 3D models and
complicated fitting algorithms.
Compared to the above, the proposed recognition scheme possesses the follow-
ing properties: (i) It is able to recognize new objects not present in the training
set; (ii) It is able to handle new lighting conditions not present in the training set;
and (iii) No explicit 3D model and no prior knowledge about illumination condi-
tions are needed. In other words, we combine the advantages of subspace learning
and reflectance model-based methods. Further, we can avoid the recovery burden
as far as recognition is concerned by using a proper bootstrap set under the first
constraint.
2.2.2 Bootstrap set
A procedure for learning the W matrix was presented in Section 2.1.4. Even though
the learning algorithm is quite robust, it is possible that it gets trapped in local
minima, which might subsequently yield inferior recognition results. Thus, an
alternative approach without explicitly learning the W matrix is very beneficial.
We now show that, as far as recognition is concerned, the W matrix under the
first constraint can be replaced by a bootstrap set ?W consisting of sample imagery
only. The bootstrap set can take various forms. In this chapter, we focus on such
a bootstrap set that contains m exemplar objects captured at a fixed pose, each
with three images illuminated by three independent but fixed lighting sources.
42
We denote ?hij as the image for the ith exemplar object illuminated by the jth
exemplar lighting source. As an image can be expressed in a two-factor form using
(2.7), we can write ?hij as
?hij = W(?fi ??sj); i = 1,...,m;j = 1,2,3. (2.25)
where?fi isthe blending coefficient vector forthe ith exemplar object and?sj describes
the jth exemplar lighting source.
The bootstrap set ?W is then expressed as
?Wd?3m = [?mi=1 [?3j=1 ?hij] ] = W[?mi=1 [?3j=1 (?fi ??sj)] ]
= Wd?3m(?Fm?m ? ?S3?3), (2.26)
where ?F ? [?mi=1 ?fi] and ?S ? [?3j=1 ?sj] define the (not necessarily orthogonal) bases
for the identity coefficients and the light sources, respectively. Thus, any vector
f lies in the linear span of ?F, i.e., there exists a coefficient vector ? = [?mi=1 ?i]
relating f with ?F in the following way:
f =
msummationdisplay
i=1
?i ?fi = ?F?; (2.27)
Similarly, for any vector s, there exists ? = [?3j=1 ?j] such that
s =
3summationdisplay
j=1
?j ?sj = ?S?. (2.28)
Substituting (2.27) and (2.28) into (2.7), we have
hd?1 = W(f?s) = W((?F ?)?(?S ?))
= W(?F??S)(???)
= ?Wd?3m(?m?1 ??3?1) (2.29)
Therefore, if the bootstrap set ?W is given, finding f and s for image h is equivalent
to finding ? and ?. Since (2.29) is in a bilinear form, we can compute ? and ?
43
via the same algorithm described in Section 2.1.3 and employ ? for subsequent
recognition task.
The use of the bootstrap set yields an additional benefit. As indicated before,
the rank for covering illumination variations in practice exceeds 3. Suppose that
this rank is r > 3, we can use a bootstrap set of dimension d by rm, i.e. using
images for m exemplar objects taken under r exemplar lighting conditions, to
improve the recognition performance. Obviously, our separation algorithm can be
generalized to handle s with dimension r?1. Unfortunately, no bootstrap set can
be easily constructed for the second constraint using exemplar images.
?1 ?0.8 ?0.6 ?0.4 ?0.2
0 0.2 0.4 0.6 0.8
1
?0.1
0
0.1
0.2
0.3
0.4
0
0.2
0.4
0.6
0.8
1
  f17
  f16
  f15
  f22  f14
  f21
  f13
  f12
  f 9
  f20
  f11
o ?? ground truth, x ?? estimated value
  head
  f 8
  f 6
  f19
  f 7
  f 5  f18
  f10
  f 4
  f 2  f 3
Figure 2.3: Right: Flash distribution in the PIE database. For illustrative pur-
poses, we move their positions on a unit sphere as only the illuminant directions
matter. ?o? means the ground truth and ?x? the estimated values.
44
2.2.3 Recognition experiments
We study an extreme recognition setting with the following features: there is no
identity overlap between the training set and the gallery and probe sets; only one
image per object is stored in the gallery set; the lighting conditions for the training,
gallery and probe sets are completely unknown.
Our strategy is to: (i) Learn W, if needed, from the training set using the
recovery algorithm described in Section 2.1.4 or construct a bootstrap set ?W for
simplicity; (ii) With W (or ?W) given, learn the identity signature f?s (or ??s) for
both the gallery and probe sets using the recovery algorithm described in Section
2.1.3, assuming no knowledge of illumination directions; and (iii) Perform recogni-
tion using the nearest correlation coefficient. Suppose that a gallery image g has
its signature5 fg (or ?g) and a probe image p has its signature fp (or ?g), their
correlation coefficient is
k(p,g) = (fp,fg)/
radicalBig
(fp,fp)(fg,fg), (2.30)
where (x,y) is an inner-product such as (x,y) = xT?y with ? learned or given. We
use ? as an identity matrix.
PIE database
We use the Pose and Illumination and Expression (PIE) database [75] in our ex-
periment6. Figure 2.3 shows the distribution of all 21 flashes used in PIE and their
estimated positions using our algorithm. Since the flashes are almost symmetri-
cally distributed about the head position, we only use 12 of them distributed on
5In the sequel, we simply refer as f = [fT,gT]T for the second rank constraint
6We use the ?illum? part of the PIE database that is close to obeying the Lambertian model
as in [70] while the ?light? part that includes an ambient light is used in [66].
45
the right half of the unit sphere in Figure 2.3. More specifically, the flashes we
used are f08, f09, f11-f17, and f20-f22. In total, we used 68x12=816 images in a
fixed view as there are 68 subjects in the PIE database. Figure 2.4 displays one
PIE object under the selected 12 illuminants.
Registration is performed by aligning the eyes and mouth to desired posi-
tions. No flow computation [66] is carried on for further alignment. After the
pre-processing step, the cropped out face image is of size 50 by 50, i.e. d = 2500.
Also, we only study gray images by taking the average of the red, green, and blue
channels of their color versions. We use all 68 images under one illumination to
form a gallery set and under another illumination to form a probe set. The training
set is taken from sources other than the PIE dataset. Thus, we have 12x11=132
tests, with each test giving rise to a recognition score.
Gallery f08 f09 f11 f12 f13 f14 f15 f16 f17 f20 f21 f22 Average
Probe
f08 - 96 96 87 66 60 46 29 22 85 78 53 65
f09 94 - 96 96 90 87 56 40 24 84 96 68 75
f11 94 91 - 97 72 72 38 28 16 100 94 51 69
f12 88 94 97 - 88 93 57 41 28 94 100 76 78
f13 56 87 59 85 - 100 90 71 50 54 87 100 76
f14 51 85 63 93 100 - 90 66 49 59 91 99 77
f15 33 40 37 49 85 88 - 93 78 32 49 97 62
f16 19 26 26 32 59 44 84 - 93 26 31 63 46
f17 14 28 19 26 50 41 68 94 - 19 26 44 39
f20 90 85 99 97 65 69 38 26 21 - 93 53 67
f21 79 94 93 100 88 94 62 49 28 91 - 76 78
f22 43 65 46 75 99 99 97 76 59 43 74 - 70
Average 60 72 66 76 78 77 66 56 42 63 74 71 67
Table 2.1: Recognition rate obtained by our approach using the first rank con-
straint and the Yale?s database as the training set.
46
Figure 2.4: The first and second rows display one PIE object under the selected
12 illuminants (from left to right, row 1 to row 2: f08, f09, f11-f17, and f20-f22)
and the third and fourth rows one Yale object under 9 lights (most frontal lights)
used in the training set.
Recognition across illumination
We first assume that all the images have been captured in a frontal view, but we
do not assume that the directions and intensities of the illuminants are known.
[Yale training set] The training (or bootstrap) set is first taken as the Yale?s
illumination database [68]. There are only 10 subjects (i.e. m = 10) in this
database and each subject has 64 images in frontal view illuminated by 64 different
lights. We pick out images under 9 lights (mostly frontal) in order to cover up to
second-order harmonic components [95]. Figure 2.3 shows one Yale object under
r = 9 lights.
Table 2.1 lists the recognition rate for the PIE database using the first rank
47
Gallery f08 f09 f11 f12 f13 f14 f15 f16 f17 f20 f21 f22 Average
Probe
f08 - 100 90 66 21 9 1 9 4 60 60 1 38
f09 100 - 72 94 59 31 10 24 13 51 84 13 50
f11 97 91 - 100 29 24 13 15 10 100 94 19 54
f12 93 97 100 - 93 90 56 59 35 96 100 69 81
f13 19 62 22 68 - 97 82 100 68 13 84 81 63
f14 9 15 12 62 100 - 100 84 82 12 72 100 59
f15 0 3 1 4 76 100 - 74 76 1 18 100 41
f16 6 25 3 31 82 65 71 - 100 3 41 57 44
f17 4 12 3 31 51 56 81 100 - 3 28 59 39
f20 88 76 100 99 28 28 15 12 16 - 99 19 53
f21 84 97 97 100 96 88 57 74 46 96 - 71 82
f22 3 4 3 13 72 100 100 50 57 3 24 - 39
Average 46 53 46 61 64 62 53 54 46 40 64 54 54
Table 2.2: Recognition rate obtained by the ?Eigenface?approach (discarding the
first 3 components) using the Yale?s database as the training set.
Gallery f08 f09 f11 f12 f13 f14 f15 f16 f17 f20 f21 f22 Average
Probe
f08 - 97 97 93 63 56 29 16 9 94 85 29 61
f09 99 - 97 99 96 88 38 21 12 91 96 57 72
f11 99 96 - 99 62 63 29 16 12 100 94 41 65
f12 96 99 100 - 93 91 40 22 13 99 100 69 75
f13 74 93 69 84 - 100 71 37 16 62 87 97 72
f14 66 88 74 93 100 - 76 34 19 71 93 100 74
f15 22 34 24 35 71 66 - 82 46 28 44 99 50
f16 12 21 13 18 28 26 74 - 85 18 22 47 33
f17 6 7 9 13 15 18 40 81 - 13 16 24 22
f20 93 88 100 96 63 68 32 19 13 - 96 43 65
f21 87 94 100 100 93 99 51 22 15 99 - 84 77
f22 41 65 43 62 96 100 100 56 29 46 71 - 64
Average 63 71 66 72 71 70 53 37 24 65 73 63 61
Table 2.3: Recognition rate obtained by the ?Fisherface? approach using the Yale?s
database as the training set.
48
Gallery f08 f09 f11 f12 f13 f14 f15 f16 f17 f20 f21 f22 Average
Probe
f08 - 100 99 99 97 97 79 72 43 99 97 93 88
f09 100 - 99 99 99 99 97 91 60 97 97 97 94
f11 99 99 - 100 100 100 90 76 65 100 100 99 93
f12 99 99 100 - 100 100 100 93 76 100 100 100 97
f13 99 99 100 100 - 100 100 100 88 99 100 100 99
f14 99 99 100 100 100 - 100 100 96 99 100 100 99
f15 84 94 93 100 100 100 - 100 100 88 100 100 96
f16 69 87 78 90 100 100 100 - 100 69 90 100 89
f17 44 60 51 71 84 91 99 100 - 56 75 94 75
f20 97 97 100 100 100 100 90 74 68 - 100 99 93
f21 97 97 100 100 100 100 100 97 82 100 - 100 98
f22 90 97 96 100 100 100 100 100 99 97 100 - 98
Average 89 93 92 96 98 99 96 91 80 91 96 98 93
Table 2.4: Recognition rate obtained by our approach with the first rank constraint
and Vetter?s database as the training set.
Gallery f08 f09 f11 f12 f13 f14 f15 f16 f17 f20 f21 f22 Average
Probe
f08 - 100 99 99 97 93 82 59 35 99 97 88 86
f09 100 - 99 99 99 99 91 84 53 99 99 96 92
f11 99 99 - 100 100 100 91 71 44 100 100 94 90
f12 99 99 100 - 100 100 99 90 72 100 100 99 96
f13 99 99 100 100 - 100 99 99 79 99 100 99 97
f14 99 99 100 100 100 - 99 97 87 99 100 99 98
f15 93 96 93 97 99 99 - 100 99 96 99 100 97
f16 75 90 69 93 97 99 100 - 99 69 94 100 89
f17 47 68 51 78 84 90 100 100 - 57 82 94 77
f20 99 99 100 100 99 100 91 76 51 - 100 94 92
f21 99 99 100 100 100 100 99 94 78 100 - 99 97
f22 97 96 96 99 99 99 100 100 90 96 99 - 97
Average 91 94 91 97 97 98 95 88 71 92 97 96 92
Table 2.5: Recognition rate obtained by our approach with the second rank con-
straint and Vetter?s database as the training set.
49
f08 f09 f11 f12 f13 f14 f15 f16 f17 f20 f21 f22 Average
Gly: Front f12, Prb: Front 99 99 100 - 100 100 97 93 78 100 100 99 96
Gly: Front f12, Prb: Side 85 89 88 94 96 96 88 81 68 86 95 91 88
Gly: Side f12, Prb: Front 92 91 99 97 85 87 72 53 33 94 96 87 82
Gly: Side f12, Prb: Side 100 100 100 - 100 100 99 85 63 100 100 100 95
Gly: Side f08, Prb: Side - 100 100 100 99 97 72 59 35 100 100 90 86
Gly: Side f17, Prb: Side 26 41 37 57 76 84 100 100 - 43 65 91 66
Gly: Side f22, Prb: Side 75 97 88 99 100 100 100 100 100 91 100 - 95
Table 2.6: Recognition rate across poses and illumination. The front view is from
camera 27, and the side view from camera 05.
constraint and the Yale?s database as the training set. Even with m = 10, we
obtain quite good results, especially when the gallery and probe sets are close
in terms of their flash positions. When the flashes of the gallery and probe sets
become separated, the recognition rate decreases. The worst performance is with
the gallery set at f08 and the probe set at f17, two most separated flashes. In
general, using images under frontal or near-frontal illuminants (e.g. f09, f12, and
f21) as gallery sets produces good results.
For comparison, we also implemented the ?Eigenface? approach (discarding the
first 3 components) and the ?Fisherface? approach by training the subspace pro-
jection vectors from the same training set. The recognition rates are presented
in Tables 2.2 and 2.3. The ?Fisherface? approach outperforms the ?Eigenface? ap-
proach, but their performances are worse than our approach. This highlights the
virtue of decoupling the illumination variations.
[Vetter training set] Generalization capacity with m = 10 is rather restrictive.
We now increase m from 10 to 100 by using Vetter?s 3D face database [66]. As this
is a 3D database, we actually have W (even p and N) available. However, we believe
that using a training set of m = 100 from other sources, which to the best of our
50
knowledge is not available in the literature, can yield similar performances. Table
2.4 tabulates the recognition rates obtained by imposing the first rank constraint.
Significant improvements have been achieved by increasing m. This seems to
suggest that a moderate sample size of 100 is enough to span the entire face space
under a fixed view.
As an interesting comparison, Blanz and Vetter [66] also reported the recog-
nition rates across the illumination variation (with only ?f12? being the gallery
set and using the ?light? part of the PIE database) and their average is 98% for
color images while ours is 96% for gray images under the first rank constraint.
We believe that our performances can be boosted using the color images and finer
alignment. Note that our approaches look similar to [66], but there are significant
differences. In [66] depths and texture maps of explicit 3D face models are used,
while our image-based approach uses the concepts of albedo and surface normal
and can recover the 3D models under the first constraint. Also, [66] needs a very
good initialization for the lighting source.
We then experiment with the second rank constraint. Note that here we need
explicit knowledge of p and N, while under the first constraint we can use a boot-
strap set instead. Table 2.5 tabulates the recognition rate obtained. It seems that
the use of the second rank constraint does not help much. In fact, it is slightly
worse due to possible over-parameterization. In addition, it is difficult to estimate
p and N using the second rank constraint. Thus, it seems beneficial to use the first
rank constraint in practice.
51
Recognition across views and illumination
We now present our preliminary results on recognition across poses and illumina-
tion. Our approach in principle can also handle pose variation since the W matrix
contains all the needed 3D information, i.e., we can recover the 3D model from it.
Also as mentioned earlier, learning the W matrix can be avoided by using a boot-
strap set. Here, we simply use Vetter?s database to handle pose variation. Pose is
roughly estimated from the geometric calibration information provided in the PIE
database. We then warp the 3D model to the desired pose. The motivation is the
following: suppose the pose parameter is ?, then the image h? at pose ? can be
expressed as
h? = W?(f?s). (2.31)
In other words, the illumination-invariant signature f for image h? is kept the same
if we have the class-specific albedo and shape matrix at pose ?. The rest just
follows using the first constraint approach. Table 2.6 lists the recognition results
obtained. In general, using the side view still yields quite good recognition result.
Illuminant estimation
In the above process, we achieve illuminant estimation. Figure 2.3 also shows the
estimated illuminant directions. It is quite accurate for estimation of directions
of flashes near frontal pose. But when the flashes are significantly off-frontal,
accuracy slightly goes down.
52
2.3 Appendix
Appendix 2.I: Recovering multilinear coefficients from h
The algorithm presented in Sec. 2.1.4 can be generalized to recover {f1,...,fn} from
h if the following multilinear form is satisfied:
hd?1 = Wd?producttextn
i=1 mi
(f1m1?1 ?...?fnmn?1), (2.32)
where W ? [?j1,...,jn wj1,...,jn]. Again, we impose the addition constraints: 1Tfi =
1; i = 1,...,n?1.
In the iteration for computing fi given all other fj?s (j negationslash= i) fixed, we have,
h = Aifi, (2.33)
where Ai ? [?miji=1 aiji] and
aiji =
m1summationdisplay
j1=1
...
mnsummationdisplay
jn=1
c1j1 ...ci?1ji?1ci+1ji+1 ...cnjnwj1,...,jn. (2.34)
If 1Tfi = 1 is imposed for i = 1,...,n?1, the LS solution to fi is
fi =
??
???
???
???
???
?
?
??
?
Ai
1T
?
??
?
??
??
?
h
1
?
??
?, i = 1,...,n?1;
[An]? h, i = n.
(2.35)
Appendix 2.II: Computing R from H
This appendix concentrates on the most difficult part of recovering the albedos
and surface normals from H: updating R with F, S, and ? fixed. We will take
vector derivatives of E with respective to {rij; i = a,b,c; j = 1,...,m} and treat
the three terms in E separately.
53
[About E0.] With fjprime ? [?mj=1 fjprimej] and sjprime ? [sjprimea,sjprimeb,sjprimec]T,
?E0
?rij =
nsummationdisplay
jprime=1
dsummationdisplay
x=1
?jprime(x){?w(x)TR(fjprime ?sjprime)?hjprime(x)}?w(x)fjprimejsjprimei
=
nsummationdisplay
jprime=1
dsummationdisplay
x=1
?jprime(x){ summationdisplay
l=a,b,c
msummationdisplay
k=1
?w(x)Trlkfjprimeksjprimel ?hjprime(x)}?w(x)fjprimejsjprimei
= summationdisplay
l=a,b,c
msummationdisplay
k=1
{
nsummationdisplay
jprime=1
dsummationdisplay
x=1
?jprime(x)fjprimeksjprimelfjprimejsjprimei?w(x)?w(x)T}rlk
?
nsummationdisplay
jprime=1
dsummationdisplay
x=1
?jprime(x)hjprime(x)fjprimejsjprimei?w(x)
= summationdisplay
l=a,b,c
msummationdisplay
k=1
Olkijrlk ??ij, (2.36)
where {Olkij;l = a,b,c;k = 1,...,m} are properly defined 3m?3m matrices, and
?ij is a properly defined 3m?1 vector.
[About E1.] Using forward differences to approximate the partial derivatives7,
??aj(x,y)
?y similarequal ?aj(x,y+1) ??aj(x,y);
??bj(x,y)
?x similarequal
?bj(x+1,y) ??bj(x,y);
??cj(x,y)
?x similarequal ?cj(x+1,y) ??cj(x,y);
??cj(x,y)
?y similarequal ?cj(x,y+1) ??cj(x,y),
(2.37)
we have
?j(x,y) ??bj(x+1,y)?cj(x,y) ??bj(x,y)?cj(x+1,y) + ?aj(x,y)?cj(x,y+1) ??aj(x,y+1)?cj(x,y). (2.38)
Suppose we are given the product of albedo and surface normal [aj(x),bj(x),cj(x)]
as in (2.21), we can derive the albedo pj(x) and surface normals ?aj(x), ?bj(x), and
?cj(x) as follows:
pj(x) =
radicalbigg
(?wT(x)raj)2 + (?wT(x)rbj)2 + (?wT(x)rcj)2, (2.39)
?aj(x) = ?w
T
(x)raj
pj(x) ,
?bj(x) = ?w
T
(x)rbj
pj(x) , ?cj(x) =
?wT(x)rcj
pj(x) . (2.40)
7Partial derivatives of boundary pixels require different approximations. But, similar deriva-
tions can be derived.
54
So, their partial derivatives with respect to raj are
??aj(x)
?raj =
?w(x)
pj(x) ? ?w
T
(x)raj
?w(x)?wT(x)raj
p3j(x) =
1??a2j(x)
pj(x) ?w(x), (2.41)
??aj(x)
?rbj = ??w
T
(x)raj
?w(x)?wT(x)rbj
p3j(x) =
??aj(x)?bj(x)
pj(x) ?w(x),
??aj(x)
?rcj =
??aj(x)?cj(x)
pj(x) ?w(x). (2.42)
Similarly, we can derive their partial derivatives with respect to rbj and rcj, which
are summarized as follows:
??kj(x)
?rlj =
??kj(x)?lj(x)
pj(x) ?w(x),
??kj(x)
?rkj =
1??k2j(x)
pj(x) ?w(x), k,l ?{a,b,c}, k negationslash= l. (2.43)
Notice that ??aj(x)?r
bj
= ??bj(x)?r
aj
, ??aj(x)?r
cj
= ??cj(x)?r
aj
, and ??bj(x)?r
cj
= ??cj(x)?r
bj
, which imply saving
in computations.
We now compute the partial derivative of ?j(x,y) with respect to raj:
??j(x,y)
?raj =
?
?raj{
?bj(x+1,y)?cj(x,y) ??bj(x,y)?cj(x+1,y) + ?aj(x,y)?cj(x,y+1) ??aj(x,y+1)?cj(x,y)}
= { ?aj(x,y)?cj(x,y)p
j(x,y)pj(x,y+1)
?w(x,y)?wT(x,y+1) ? ?aj(x,y+1)?cj(x,y+1)p
j(x,y)pj(x,y+1)
?w(x,y+1)?wT(x,y)}raj +
{?aj(x+1,y)?cj(x+1,y)p
j(x,y)pj(x+1,y)
?w(x+1,y)?wT(x,y) ? ?aj(x,y)?cj(x,y)p
j(x,y)pj(x+1,y)
?w(x,y)?wT(x+1,y)}rbj +
{ ?aj(x,y)
?bj(x,y)
pj(x,y)pj(x+1,y) ?w(x,y)?w
T
(x+1,y) ?
?aj(x+1,y)?bj(x+1,y)
pj(x,y)pj(x+1,y) ?w(x+1,y)?w
T
(x,y) +
1??a2j(x,y)
pj(x,y)pj(x,y+1) ?w(x,y)?w
T
(x,y+1) ?
1??a2j(x,y+1)
pj(x,y)pj(x+1,y) ?w(x,y+1)?w
T
(x,y)}rcj
= Paaj(x,y)raj +Pbaj(x,y)rbj +Pcaj(x,y)rcj = summationdisplay
l=a,b,c
Plaj(x,y)rlj, (2.44)
where Paaj(x,y), Pbaj(x,y), and Pcaj(x,y) are properly defined matrices of dimension 3m?
3m. By the same token, using properly defined Pabj(x,y), Pbbj(x,y), Pcbj(x,y), Pacj(x,y),
Pbcj(x,y), and Pccj(x,y), we can calculate
??j(x,y)
?rij =
summationdisplay
l=a,b,c
Plij(x,y)rlj; i = a,b,c, (2.45)
55
and, finally,
?E1
?rij =
dsummationdisplay
x=1
?j(x) summationdisplay
l=a,b,c
Plij(x)rlj = summationdisplay
l=a,b,c
Plijrlj; Plij ?
dsummationdisplay
x=1
?j(x)Plij(x). (2.46)
[About E2.] The symmetry constraint term ?j(x) defined as in (2.23) can be
expressed as
?2j(x) = rTajQa(x)raj +rTbjQb(x)rbj +rTcjQc(x)rcj, (2.47)
where Qa(x), Qb(x), and Qc(x) are symmetric matrices with size 3m?3m:
Qa(x) = (?w(x) + ?w(?x))(?w(x) + ?w(x))T,Qb(x) = (?w(x) ? ?w(?x))(?w(x) ? ?w(x))T,Qc(x) = Qb(x).
(2.48)
The derivatives of ?2j(x)/2 and E2 with respective to raj, rbj, and rcj are
?{?2j(x)/2}
?rij = Q
i
(x)rij;
?E2
?rij =
dsummationdisplay
x=1
Qi(x)rij = Qirij; Qi =
dsummationdisplay
x=1
Qi(x). (2.49)
Combining the above derivations and using ?E?rij = 0, we have
summationdisplay
l=a,b,c
msummationdisplay
k=1
Olkijrlk + ?1 summationdisplay
l=a,b,c
Plijrlj + ?2Qirij = ?ij;i = a,b,c; j = 1,...,m. (2.50)
We therefore arrive at a set of equations linear in {rij; i = a,b,c; j = 1,...,m} that
can be solved easily. After finding the new R, we normalize it using R=R/||R||2.
56
Chapter 3
Illuminating Light Field
State-of-the-art algorithms are not able to produce satisfactory recognition per-
formance when confronted by pose and illumination variations. In general, pose
variation is slightly more difficult to handle than illumination variation. The pres-
ence of both variations further challenges the recognition algorithms.
This chapter extends the generalized photometric stereo algorithm presented in
Chapter 2 to handle pose variation. The way we handle pose variation is through
the ?Eigen? light approach [69]. This unified approach is image-based, in the sense
that, in the training set, only 2D images are used and no explicit 3D models are
needed. The unification is achieved by exploiting the fact that both approaches use
a subspace model for identity. The ?Eigen? light field approach combines subspace
modeling with light field and offers a pose-invariant encoding of identity. The
generalized photometric stereo algorithm combines the identity subspace with the
illumination model and provides an illumination-invariant description. However,
the ?Eigen? light field approach assumes a fixed illumination and cannot handle
illumination variations, i.e., its pose-invariant identity encoding is not invariant to
variations in illumination. The generalized photometric stereo algorithm assumes a
57
fixed pose and cannot easily handle pose variations, i.e., its illumination-invariant
identity description is not invariant to variations in pose. This motivates our
integrated approach for handling both pose and illumination variations using an
illumination- and pose-invariant identity signature.
Chapter organization
Section 3.1 presents the principle of the illuminating light field approach. It starts
by reviewing in Section 3.1.1 the related literature, then describes Section 3.1.2
the ?Eigen? light field approach [69] that performs FR under pose variations, and
finally introduces in Section 3.1.3 our integrated approach. Section 3.1.4 presents
algorithms for recovering the identity signature that is invariant to illumination
and pose. Section 3.2 gives our experimental results on the PIE database [75] and
comparisons with other approaches.
3.1 Principle of Illuminating Light Field
3.1.1 Literature review
Identity, illumination, and pose
Three factors are involved in face recognition, namely illumination, pose, and iden-
tity. Using the human face images as examples, we now address issues involved in
each of the three factors by fixing the other two.
? Illumination. Various illumination models are available in the literature,
ranging from models for highly specular objects such as mirrors to models
for matte objects. Mostly objects belong to the latter category, which is
described by a Lambertian reflectance model for its simplicity. Early shape
58
from shading approaches [10] assumed a constant albedo field. However,
this assumption is violated at locations such as eyes and mouth edges. For
the human face, the Lambertian reflectance model with a varying albedo
field provides a reasonable approximation [68, 74, 95, 103, 204]. The Phong
illumination model also has found application [66]. This proposed method
adopts the Lambertian reflectance model with a varying albedo field to model
the effect of illumination.
? Pose. The issue of pose essentially amounts to a correspondence problem.
If dense correspondences across poses are available and if a Lambertian re-
flectance model is further assumed, a rank-1 constraint is implied because
theoretically, a 3D model can be recovered and used to render novel poses.
However, recovering a 3D model from 2D images is a difficult task. There
are two types of approaches: model-based and image-based. Model-based
approaches [66, 139, 145, 146] require explicit knowledge of prior 3D mod-
els, while image-based approaches [125, 129, 142, 143, 144] do not use prior
3D models. In general, model-based approaches [66, 139, 145, 146] register
the 2D face image to 3D models that are given beforehand. In [139, 146],
a generative face model is deformed through bundle adjustment to fit 2D
images. In [145], a generative face model is used to regularize the 3D model
recovered using the SfM algorithm. In [66], 3D morphable models are con-
structed based on many prior 3D models. There are mainly three types of
image-based approaches: Structure from motion (SfM) [125, 129], visual hull
[142, 144], and light field rendering [143, 140] methods. The SfM approach
[125] works with sparse correspondence and does not reliably recover the 3D
model amenable for practical use. The visual hull methods [142, 144] assume
59
that the shape of the object is convex, which is not always satisfied by the
human face, and also require accurate calibration information. The light
field rendering methods [143, 140] relax the requirement of calibration by a
fine quantization of the pose space and recover a novel view by sampling the
captured data that form the so-called light field. The proposed method is
image-based, so no prior 3D models are used. It handles a given set of views
through an analysis analogous to the light field concept. However, no novel
poses are rendered.
? Identity. One straightforward method to describe the identity is through
discrete labels. However, using this discrete description it is impossible to
establish a link between objects used in the training and testing stages in
terms of the identity. An alternative way is to associate a discrete label with
a continuous-valued variable, which is regarded as an identity signature. One
goodexample is to use subspace encoding [47, 62], where linear generalization
is assumed to incorporate the fact that all human faces are similar. Once the
subspace basis are learned from the training set, they are used to characterize
the gallery/probe set, thus enabling the required generalization capability. In
this chapter, we also use the subspace method to describe the identity.
Face recognition under illumination variation
FR under illumination variation must take into account the two factors of identity
and illumination. Refer to Section 2.2.1 in Chapter 2 for a review of related work.
60
Face recognition under pose variation
As mentioned earlier, pose variation essentially amounts to a correspondence prob-
lem. If dense correspondences across poses are available and a Lambertian re-
flectance is assumed, then a rank-1 constraint is implied. Unfortunately, finding
correspondences is a very difficult task and, therefore there exist no subspace based
on an appearance representation when confronted with pose variation. Approaches
to face recognition under pose variation [68, 69, 72] avoid the correspondence prob-
lem by sampling the continuous pose space into a set of poses, v.i.z. storing mul-
tiple images at different poses for each person at least in the training set. In [72],
view-based ?Eigenfaces? are learned from the training set and used for recognition.
In [68], a denser sampling is used to cover the pose space. However, as [68] uses
object-specific images, appearances belonging to a novel object (i.e. not in the
training set) cannot be handled. In [69], the concept of light field [143] is used
to characterize the continuous pose space. ?Eigen? light fields are learnt from the
training set. However, the implementation of [69] still discretizes the pose space
and recognition can be based on probe images at poses in the discretized set. One
should note that the light field is not related to variation in illumination.
Face recognition under illumination and pose variations
Approaches to handling both illumination and pose variations include [66, 70, 77,
78, 202]. The approach [66] uses morphable 3D models to characterize the human
faces. Both geometry and texture are linearly spanned by those of the training
ensemble consisting of 3D prior models. It is able to handle both illumination and
pose variations. Its only weakness is a complicated fitting algorithm. Recently, a
fitting algorithm more efficient than suggested in [66] is proposed in [73]. In [70],
61
the Fisher light field is proposed to handle both illumination and pose variations,
where the light field is used to cover the pose variation and the Fisher discriminant
analysis to cover the illumination variation. Since discriminant analysis is just a
statistical analysis tool which minimizes the within-class scatter while maximizing
the between-class clatter and has no relationship with any physical illumination
model, it is questionable that discriminant analysis is able to generalize to new
lighting conditions. Instead, this generalization may be inferior because discrim-
inant analysis tends to overly tune to the lighting conditions in the training set.
The ?Tensorface? approach [77, 78] uses a multilinear analysis to handle various
factors such as identity, illumination, pose, and expression. The factors of identity
and illumination are suitable for linear analysis, as evidenced by the ?Eigenface?
approach (assuming a fixed illumination and a fixed pose) and the subspace in-
duced by the Lambertian model, respectively. However, the factor of expression is
arguably amenable for linear analysis and the factor of pose is not amenable for lin-
ear analysis. In [202], preliminary results are reported by first warping the albedo
and surface normal fields at the desired pose and then carrying on recognition as
usual.
3.1.2 Pose-invariant identity signature
The light field measures the radiance in free space (free of occluders) as a 4D
function of position and direction. An image is a 2D slice of the 4D light field.
If the space is only 2D, the light field is then a 2D function. This is illustrated
in Figure 3.1 (also see [69] for another illustration), where a camera conceptually
moves along a circle, within which a square object with four differently colored
sides resides. The 2D light field L is a function of ? and ? as properly defined
62
in Figure 3.1. The image of the 2D object is just a vertical line. If the camera
is allowed to leave the circle, then a curve is traced out in the light field to form
the image, i.e. the light field is accordingly sampled. Even though the light field
for a 3D object is a 4D function, we still use the notation L(?,?) for the sake of
simplification.
Figure 3.1: This figure illustrates the 2D light-field of a 2D object (a square with
four differently colored sides), which is placed within an circle. The angles ? and
? are used to relate the viewpoint with the radiance from the object. The right
image shows the actual light field for the square object.
Starting from the light fields {Ln(?,?); n = 1,...,N} of the training sam-
ples, the ?Eigen? light field approach conducts a PCA to find the eigenvectors
{ei(?,?); i = 1,...,m} which span a rank-m subspace. The ?Eigen? light field
[69] is again motivated by the similarity among the human faces. Using the fact
[47, 62] that: If YTY has an eigenpair (?,v), then YYT has a corresponding eigen-
pair (?,Yv), we know that ei(?,?) is just a linear combination of the Ln(?,?)?s,
i.e., there exist ain?s such that
ei(?,?) = summationdisplay
n
ainLn(?,?). (3.1)
For an arbitrary subject, its light field L(?,?) lies in this rank-m subspace. In
63
other words, there exists coefficients fi?s such that, ?(?,?),
L(?,?) =
msummationdisplay
i=1
fiei(?,?) = e(?,?)Tf, (3.2)
where e(?,?) ? [?mi=1 ei(?,?)]m?1 and f = [?mi=1 fi]m?1.
As mentioned earlier, to obtain an image hv at a particular pose v (a collection
of d pixels) one should sample the light field. Suppose that one pixel hv is the
point sample of the light field associated with the coordinate (?v,?v), i.e.,
hv = L(?v,?v). (3.3)
The image hv can be expressed as
hv ? [?di=1 hvi] = [?di=1 L(?vi ,?vi)], (3.4)
where (?vi ,?vi) is the corresponding coordinate in the light field for the pixel hvi.
Substituting (3.2) into (3.4) yields
hv = [?di=1 e(?vi ,?vi)T]f = Evf, (3.5)
where Ev ? [?di=1 e(?vi ,?vi)T]d?m.
Eq. (3.5) has an important implication: f is a pose-invariant identity signature
because the pose information is encoded in Ev. This is summarized in Proposition
3.1.
Proposition 3.1: The identity signature f as derived in (3.5) is pose-invariant.
Constructing a light field is a practically difficult task. However, if only some
specific poses are of interest with each pose sampling a subset of the light field,
we can only focus on the portion of the light field that is equivalent to the union
of these subsets. Suppose that the K poses are of interest are {v1,...,vK} and the
corresponding images at these poses are {hv1,...,hvK} with hvk expressed as in
64
(3.4), the portion of the light field of focus is nothing but [?Kk=1 [?di=1 L(?vki ,?vki )] ],
which is a ?long? Kd ? 1 vector obtained by stacking all the images at all these
poses. The introduction of such a ?long? vector eases our computation: (i) If we are
interested in a particular view v, we just simply take out those rows corresponding
to this view. (ii) In this context, computing the ?Eigen? light field is equivalent to
performing PCA on the ensemble consisting of a collection of such ?long? vectors.
The concept of light field was introduced in the computer graphics literature
[143]. A strict assumption is that the scene be static. While characterizing the ap-
pearances of one object at given views using the concept of light field is legitimate,
generalizing this to many objects is questionable since the lights fields belonging
to different objects are not in correspondence, i.e. they are not shape-free in the
terminology of [49, 76]. The mismatch in correspondence arises from differences in
head sizes and locations in world coordinator system of different objects, and so
on. Typically, correspondences between different objects are established using face
normalization or registration is performed. Unfortunately, the normalization step
ruins the static scene requirement in the light field theory. On the other hand, as
argued in [49, 76], since the shape-free appearance is amenable for linear analysis,
we can pursue PCA on the shape-free vector L, similar to the ?Eigen? light field
approach [69]. This point is illustrated in [71]. Following [71], we also use the term
light field in a loose sense.
3.1.3 Illumination- and pose-invariant identity signature
As mentioned earlier and in [143], the underlying assumption about the concept
of light is one of fixed illumination. We now consider the light fields formed under
varying illumination, i.e., illuminating the light field.
65
Clearly, the light field under a fixed illumination s, Ls(?,?), follows the Lam-
bertian reflectance model:
Ls(?,?) = t(?,?)Ts, (3.6)
where t(?,?) is the product of the albedo and the surface normal at a proper pixel
and does not depend on s. Combining (3.1) and (3.6) yields the ?Eigen? light field
esi(?,?) under the illumination s as,
esi(?,?) = summationdisplay
n
aintn(?,?)Ts = tei(?,?)Ts, (3.7)
where tei(?,?) ?summationtextn aintn(?,?). Eq. (3.2) then becomes
Ls(?,?) = [?mi=1 tei(?,?)Ts]Tf = W(?,?)(f?s), (3.8)
where W(?,?) ? [?mi=1 tei(?,?)]1?3m does not depend on s. This successfully leads
to a two-factor analysis [138, 187].
A pixel hvs under a pose v and an illumination s is a point sample of the light
field Ls(?,?) at coordinate (?v,?v), i.e.,
hvs = Ls(?v,?v) = W(?v,?v)(f?s), (3.9)
and an image hvs under the pose v and illumination s, which traces a set of d
samples of the light field under illumination s, is
hvs = [?di=1 hvsi ] = [?di=1 W(?vi ,?vi)](f?s) = Wv(?,?)(f?s), (3.10)
where Wv(?,?) ? [?di=1 W(?vi ,?vi)]d?3m. Eq. (3.10) has an important implica-
tion: The coefficient vector f provides an identity signature invariant to both pose
and illumination because the pose is absorbed in Wv(?,?) and the illumination is
absorbed in s.
66
Proposition 3.2: The identity signature f as derived in (3.10) is illumination
and pose-invariant.
The remaining questions are how to learn the basis matrix W(?,?) from a given
training ensemble and how to compute the blending coefficient vector f as well as
s for an arbitrary image hvs. The next section presents the algorithms in detail.
3.1.4 Learning algorithms
Learning the basis matrix W(?,?)
Suppose that the training ensemble is given as {Lsn(?,?); n = 1,...N, s = 1,...,S},
where Lsn(?,?) is the light field of the nth training object under illumination s (a
Kd?1 vector as explained in Section 3.1.2). Learning W(?,?) (a Kd?mr matrix
where m is the rank for the identity and r is the rank for the illumination) from the
training ensemble is detailed in [138] and is further extended in [187] by imposing
the integrability constraint. The main difference between [138] and [187] is the
following: In [138], the recovered W(?,?) minimizes the approximation error in
the mean square sense and not necessarily satisfies the integrability constraint. In
other words, the hypothetical base objects in W(?,?) is not integrable. In [187],
the recovered W(?,?) minimizes the above approximation error as well as a cost
function invoked by violating the integrability constraint. As a consequence, [138]
can only process the image ensemble consisting of different objects under the same
set of illumination (e.g. the case considered here) while [187] can process the image
ensemble consisting of different objects under completely different illumination.
Here, we follow the approach in [138] to derive W(?,?) for simplicity. The basic
underlying principle is to use a two-fold SVD algorithm that is reviewed below.
The following two matrices (A-type and B-type) are first constructed by group-
67
ing the ?long? vectors {Lsn(?,?); n = 1,...N, s = 1,...,S} in two ways:
A = [?Nn=1 [?Ss=1 Lsn(?,?)] ], B = [?Ss=1 [?Nn=1 Lsn(?,?)] ], (3.11)
where A is a KNd?S matrix whose rows stack together the light fields of different
identities under the same illumination and whose columns correspond to different
illumination and B is a KSd?N matrix whose rows stack together the light fields
under different illumination for the same identity and whose columns correspond
to different identities. It is obvious that we can convert from an A-type matrix to
B-type and vice versa.
We perform the SVD for the A matrix as A = UADAVTA and keep the top r rows
of the column basis VTA for the illumination, denoted by S. We do a similar thing
to the B matrix and keep the top m rows of the column basis VTB for the identities,
denoted by F. Direct SVD of the A and B matrices are numerically inefficient or
even prohibitive since they are extremely ?tall?. Also it is unnecessary to compute U
and D as we are interested only in the V part of the SVD result. For computational
savings, we observe that VA encodes the eigenvectors of ATA = VAD2AVTA. Since
the size of ATA is only S ? S, computing its eigenvalues is numerically stable.
Therefore, we simply firstcompute ATAand then performits ?Eigen? decomposition
to find VA. Similarly, we can compute VB.
We now have the matrices S and F at our disposal. To find W(?,?), we first
compute Aprime = AST, where Aprime is a KNd?r matrix. Notice that Aprime is still an A-type
matrix, so we can convert Aprime to a B-type matrix Bprime following the strategy described
in (3.11), where Bprime is a Krd?N matrix. Thirdly, we compute Wprime = BprimeFT, where
Wprime is a Krd?m matrix. The rest is to group Wprime to form a Kd?mr matrix W.
68
Recovering the blending coefficient vector f from an image
Given W(?,?) = [?mi=1 [?rj=1 Wij(?,?)] ]Kd?mr, where Wij(?,?) denotes the ((i?
1)?r+j)th column of the W(?,?) matrix, computing f and s for an arbitrary image
hvs utilizes (3.10) iteratively [187]. Notice that we need only the portion of W(?,?)
corresponding to the pose v, denoted by Wv(?,?) = [?mi=1 [?rj=1 Wvij(?,?)] ]d?mr.
If f is fixed, (3.10) is linear in s and its least square (LS) solution is
s = [?rj=1 ([?mi=1 Wvij(?,?)]f) ]?hvs, (3.12)
where [.]? is a matrix psuedo-inverse; if s is fixed, (3.10) is linear in f and its LS
solution is
f =
?
??
?
[?mi=1 ([?rj=1 Wvij(?,?)]s) ]
1T
?
??
?
??
??
?
hvs
1
?
??
?, (3.13)
where 1 is a vector of 1?s. To obtain (3.13), we also impose fT1 = 1 to normalize
the solution to the same range, which facilitates the recognition task. We iterate
this process until convergence. Meanwhile, we can also take into account the pixels
in shadows as in [187].
Recovering the blending coefficient vector f from a group of images
This iterative algorithm can be easily modified to handle a group of Q images
{hv1s1,...,hvQsQ} having the same f but different s?s since multiple equations like
(3.10) can be formulated. To be specific, we have the following iterative equations:
sq = [?rj=1 ([?mi=1 Wvqij (?,?)]f) ]?hvqsq; q = 1,2,...,Q, (3.14)
f =
?
??
?
[?Qq=1 [?mi=1 ([?rj=1 Wvqij (?,?)]sq) ] ]
1T
?
??
?
??
??
?
[?Qq=1 hvqsq]
1
?
??
?. (3.15)
In practice, using a group of images yields a robust estimate for f.
69
The present of shadow pixels affects the learning algorithm. Handling shadows
can be performed in the same fashion as in Chapter 2.
3.2 FaceRecognitionacrossIlluminationand Poses
3.2.1 PIE database and recognition setting
We use the ?illum? subset of the PIE database [75] in our experiments. This subset
has 68 subjects under 21 illumination and 13 poses. Out of 21 illumination configu-
ration, weselect 12denoted byF = {f16,f15,f13,f21,f12,f11,f08,f06,f10,f18,f04,f02}
as in [70], which typically span the set of variations. Out of the 13 poses, we select
9 denoted by C = {c22,c02,c37,c05,c27,c29,c11,c14,c34}, which cover from the left
profile to the right profile. In total, we have 68*12*9=7344 images. Figure 3.2
displays one PIE object under illumination and pose variations.
Registration is performed by aligning the eyes and mouth to desired positions.
No flow computation is carried on for further alignment. After the pre-processing
step, the used face image is of size 48 by 40, i.e. d = 1920. Also, we only use gray
scale images by taking the average of the red, green, and blue channels of their
color versions. We believe that our recognition rates can be boosted by using color
images and finer registrations. Figure 3.2 shows some examples of the face images
actually used in recognition.
We randomly divide the 68 subjects into two parts. The first 34 subjects are
used in the training set and the remaining 34 subjects are used in the gallery and
probe sets. It is guaranteed that there is no identity overlap between the training
set and the gallery and probe sets. To form the light field, we use images at all
available poses. Since the illumination model has generalization capability, we
70
c22
c02
c37
c05
c27
c29
c11
c14
c34
f16 f15 f13 f21 f12 f11 f08 f06 f10 f18 f04 f02
Figure 3.2: Examples of the face images of one PIE object (used in the testing
stage) under selected illumination and poses .
can select a minimum of 3 illumination in the training set. In our experiments,
the training set includes only 9 selected illumination to cover the second-order
harmonic components [95]. Notice that this is not possible in the Fisher light field
approach [70] that exhausts all illumination configurations.
The images belonging to the remaining 34 subjects are used in the gallery and
probe sets. The construction of the gallery and probe sets conforms to the following
two scenarios: (A) We use all the 34 images under one illumination sp and one pose
vp to form a gallery set and under the other illumination sg and the other pose
71
vg to form a probe set. There are three cases of interest: same pose but different
illumination, different pose but same illumination, and different pose and different
illumination. We mainly concentrate on the third case with sp negationslash= sg and vp negationslash= vg.
Also our approach reduces to the ?Eigen? light field approach [69] if sp = sg and
to the generalized photometric stereo approach [187] if vp = vg. Thus, we have
(9 ? 12)2 ? (9 ? 12) = 11,556 tests, with each test giving rise to a recognition
score. (B) We divide C into three sets: C1 = {c22,c02,c37} (left-profile views),
C2 = {c05,c27,c29} (frontal views), and C3 = {c11,c14,c34} (right-profile views)
and F into 3 sets: F1 = {f16,f15,f13,f21} (left lights), F2 = {f12,f11,f08,f06}
(frontal lights), and F3 = {f10,f18,f04,f02} (right lights). For each of the thirty
four subjects, the gallery set contains all twelve images under the illumination in
Fg and the poses in Cg and the probe set all twelve images under the illumination
in Fp and the poses in Cg. We make sure that (Cp,FP) negationslash= (Cg,Fg). Thus, we have
(3?3)2 ?(3?3) = 72 tests in this scenario that has no counterpart in the Fisher
light field [70]. To make the recognition more difficult, we assume that the lighting
conditions for the training, gallery and probe sets are completely unknown when
recovering the identity signatures.
The testing strategy is similar to that described in Chapter 2.
1. Learn W from the training set using the bilinear learning algorithm [138, 204].
Figure 3.3 shows the W matrix obtained using the training set.
2. With Wgiven, learnthe identity signature f?s (aswell ass?s) forallgallery and
probe elements (an element is an image in Scenario A and a group of images
in Scenario B) using the iterative algorithms in Section 3.1.4. Learning f and
s from one single image takes about 1-2 seconds in a Matlab implementation.
Figure 3.4 shows the reconstructed images using the learned f and s.
72
3. Perform recognition using the nearest correlation coefficient.
Gallery f16 f15 f13 f21 f12 f11 f08 f06 f10 f18 f04 f02 Average
Probe
c22 56 41 62 68 71 71 53 65 41 44 38 21 52
c02 71 76 76 91 88 94 94 94 85 71 50 32 77
c37 79 82 82 94 94 97 94 94 76 65 65 50 81
c05 68 85 97 100 100 97 97 97 91 82 71 44 86
c27 94 100 100 100 100 ? 100 100 100 97 94 76 97
c29 74 82 91 100 100 100 97 97 94 91 88 65 90
c11 50 53 68 79 85 97 97 88 79 82 71 62 76
c14 15 24 44 71 76 82 74 82 82 74 79 56 63
c34 18 18 47 50 56 65 62 56 44 44 41 38 45
Average 58 62 74 84 86 88 85 86 77 72 66 49 74
Table 3.1: Recognition rates for all the probe sets with a fixed gallery set (c27,f11).
3.2.2 Recognition performance
Scenario A
Table 3.1 shows the recognition results for all probe sets with a fixed gallery set
(c27,f11), whose gallery images are in a frontal pose and under a frontal illumi-
nation. Using this table we compare the three cases. The case of same pose but
different illumination has an average rate 97% (i.e. the average of all 11 cells on
the row c27), the case of different pose but same illumination has an average rate
88% (i.e. the average of all 8 cells on the column f11), the case of different pose
and different illumination has an average rate 70% (i.e. the average of all 88 cells
excluding the row c27 and the column f11). This shows that illumination varia-
tion is easier to handle than pose illumination and variations in both pose and
illumination are the most difficult to deal with.
73
Gallery f16 f15 f13 f21 f12 f11 f08 f06 f10 f18 f04 f02 Average
Probe
c22 44 44 46 45 46 49 46 49 44 32 30 14 41
c02 55 58 59 62 63 62 60 60 54 48 40 22 54
c37 56 59 61 64 65 62 60 58 51 47 45 34 55
c05 56 63 66 67 68 65 59 58 54 51 45 36 57
c27 62 66 69 70 70 70 65 69 68 67 65 54 66
c29 46 53 53 61 60 63 59 62 66 68 62 60 60
c11 41 43 50 53 55 61 57 58 56 61 58 51 54
c14 19 24 39 49 53 58 58 61 60 61 57 48 49
c34 16 21 38 44 46 51 48 51 46 45 45 42 41
Average 44 48 53 57 59 60 57 59 56 53 50 40 53
Table 3.2: Average recognition rates for all the gallery sets. For each cell, say
the gallery set at (vg = c27,sg = f12), the average rate is taken over all probe sets
(vp,sp) where vp negationslash= vg and sp negationslash= sg. For example, the average rate for (c27,f11) is
the average of the rates in Table 3.1 excluding the row c27 and the column f11.
We now focus on the case of different pose and different illumination. For
each gallery set, we average the recognition scores of all the probe sets with both
pose and illumination different from the gallery set. Table 3.2 shows the average
recognition rates for all the gallery sets. As an interesting comparison, the ?grand?
average is 53% (the last cell in Table 3.2) while that of the Fisher light field
approach [70] is 36%. In general, when the poses and illumination of the gallery
and probe sets become far apart, the recognition rates decrease. The best gallery
sets for recognition are those in frontal poses and under frontal illumination and the
worst gallery sets are those in profile views and off-frontal illumination. As shown
in Figures 1.5 and 3.2, the worst gallery sets consist of face images almost invisible
(See for example the images (c22,f02), (c34,f16), etc.), on which recognition can be
hardly performed.
Figure 3.5 presents the curves of the average recognition rates (i.e. the last
74
columns and last rows of Tables 3.1 and 3.2) across poses and illumination. Clearly
the effect of illumination variations is not as strong as due to pose variations in the
sense that the curves of average recognition rates across illumination are flatter
than those across poses. Figure 3.5 also shows the curves of the average recogni-
tion rates obtained based on the top 3 and top 5 matches. Using more matches
increases the recognition rates significantly, which demonstrates the efficiency of
our recognition scheme. For comparison, Figure 3.5 also plots the average rates
obtained using the baseline PCA. These rates are well below ours. The ?grand?
average is below 10% if the top 1 match is used.
Figure 3.3: The first nine columns of the learned W matrix.
75
c22
c02
c37
c05
c27
c29
c11
c14
c34
f16 f15 f13 f21 f12 f11 f08 f06 f10 f18 f04 f02
Figure 3.4: The reconstruction results of the object in Figure 3.2. Notice that only
the f?s and s?s for the row c27 are used for reconstructing all the images.
Scenario B
This test scenario is designed for face recognition based on a group of images,
which can be under different poses and different illumination. Table 3.3 lists the
recognition rates, which are much higher than those in Tables 3.1 and 3.2. Also,
similar observations can be made regarding the effects of illumination and pose
variations.
76
f16 f15 f13 f21 f12 f11 f08 f06 f10 f18 f04 f02
0
10
20
30
40
50
60
70
80
90
100
flash index
recognition rate
top 1
top 3
top 5
f16 f15 f13 f21 f12 f11 f08 f06 f10 f18 f04 f02
0
10
20
30
40
50
60
70
80
90
100
flash index
recognition rate
top 1
top 3
top 5
f16 f15 f13 f21 f12 f11 f08 f06 f11 f18 f04 f02
0
10
20
30
40
50
60
70
80
90
100
flash index
recognition rate
top 1
top 3
top 5
c22 c02 c37 c05 c27 c29 c11 c14 c34
0
10
20
30
40
50
60
70
80
90
100
camera index
recognition rate
top 1
top 3
top 5
c22 c02 c37 c05 c27 c29 c11 c14 c34
0
10
20
30
40
50
60
70
80
90
100
camera index
recognition rate
top 1
top 3
top 5
c22 c02 c37 c05 c27 c29 c11 c14 c34
0
10
20
30
40
50
60
70
80
90
100
camera index
recognition rate
top 1
top 3
top 5
(a) (b) (c)
Figure 3.5: The average recognition rates across illumination (the top row) and
across poses (the bottom row) for three cases. Case (a) shows the average recog-
nition rate (averaging over all illumination/poses and all gallery sets) obtained
by the proposed algorithm using the top n matches. Case (b) shows the average
recognition rate (averaging over all illumination/poses for the gallery set (c27, f11)
only) obtained by the proposed algorithm using the top n matches. Case(c) shows
the average recognition rate (averaging over all illumination/poses and all gallery
sets) obtained by the ?Eigenface? algorithm using the top n matches.
3.2.3 Comparisons
Comparison with the Fisher light field
It is interesting to compare the proposed approach with the Fisher light field [70]
since both of them handle pose variation in a similar fashion. The main difference
lies in handling the illumination variation. Our approach uses the Lambertian
model while [70] uses Fisher discriminant analysis. Therefore, our approach can
77
Gallery C1F1 C1F2 C1F3 C2F1 C2F2 C2F3 C3F1 C3F2 C3F3 Average
Probe
C1F1 ? 100 85 100 94 82 62 85 94 88
C1F2 100 ? 100 100 100 85 71 82 94 92
C1F3 85 97 ? 88 88 91 76 62 65 82
C2F1 97 94 71 ? 100 85 71 85 76 85
C2F2 97 100 85 100 ? 100 76 91 85 92
C2F3 79 82 76 97 100 ? 74 88 91 86
C3F2 59 59 68 85 76 71 ? 100 82 75
C3F2 74 85 62 91 94 82 100 ? 100 86
C3F3 88 82 62 79 79 94 85 100 ? 84
Average 85 88 76 93 92 86 77 87 86 85
Table 3.3: The recognition rates for test scenario B.
generalize to novel illumination and [70] does not have such a generalization. Also,
in Section 3.2 the proposed approach leads to a new recognition scenario which is
not available in [70].
Comparison with the 3D morphable model
The 3D morphable model (3DMM) [66] is the state-of-the-art approach to identify
faces across illumination andposes. The proposed approach differs fromthe 3DMM
approach mainly as follows:
? Model-based v.s. image-based. The 3DMM approach is requires prior 3D
models while the proposed approach that is image-based needs only 2D im-
ages.
Linear assumptions are used in both approaches. The operating units in the
3DMM approach are 3D depth and texture, respectively, and two indepen-
dent linear models are assumed in both units. The operating unit in the
proposed approach is the product of the albedo and surface normal and a
78
single linear model is assumed. As in the 3DMM approach, it seems that the
dimensionality of the proposed model can be ?decomposed? as the product
(or the addition) of the dimensionality of the surface normals and that of the
albedo field. However, empirically analysis shows [202] that such a decom-
position is not necessary and might overfit the problem, thereby indicating
that a subspace of rather low dimensionality can be used.
? Handling illumination. The Lambertian model is used in the proposed algo-
rithm and pixels in shadows and specular reflection regions are inferred and
excluded for consideration. The 3DMM approach uses the standard Phong
model to directly model diffuse and specular reflection on the face surface.
The 3DMM also takes into account inputs illuminated by colored lights us-
ing color transformation while the proposed approach only processes inputs
illuminated by white lights.
? Handling pose. The 3DMM approach can handle images at any pose, while
the current implementation of the proposed approach can handle images
sampled from a given set of poses. In order to handle arbitrary pose other
than those listed in the given set, the system should incorporate a tool to
render novel poses using given poses, which is left for future.
In the proposed approach, pixels at different poses might correspond to the
same point in the physical 3D model. In the 3DMM approach, one point is
only represented once for all the poses since the 3D model is used.
? Experiments Both the 3DMM and the proposed approaches conducted ex-
periments using the PIE database. However, different portions of the PIE
database are used. The 3DMM approach worked on the ?lights? part, where
79
an ambient light source is always present. The proposed approach worked
on the ?illum? part with no ambient light source. As a consequence, some
images appear almost dark (refer to Figure 3) and there is little hope to
perform correct recognition based on these extreme images, explaining the
relatively low recognition rates compared with those produced by the 3DMM
approach.
In terms of computational complexity, the proposed algorithm is more com-
putationally efficient than the 3DMM approach. The proposed fitting al-
gorithm, taking 1-2 seconds to process one input image using Matlab im-
plementation, is simply linear (rather bilinear) and has a unique minimum;
while the 3DMM approach, taking 4.5 minutes to process one input image,
invokes a gradient descent algorithm that does not guarantee a global min-
imum. Also, the proposed algorithm is able to handle face images of very
small size. In the reported experiments, gray-level images are normalized
to size of 48?40. The size of color images used in the 3DMM approach is
unclear, but typically much larger.
80
Part II: Face Recognition via
Kernel Learning
81
Chapter 4
Probabilistic Kernel Principal
Component Analysis
Principal component analysis [12] is one of the most popular statistical data analy-
sis techniques with applications in numerous areas such as data compression, image
processing, computer vision, and pattern recognition, to name a few. However, the
PCA has two disadvantages: (i) it lacks a probabilistic model structure which is
important in many contexts such as mixture modeling and Bayesian decision (also
see [167]); and (ii) it restricts itself to a linear setting, where high-order statistical
information is discarded [181].
Probabilistic principal component analysis (PPCA) proposed by Tipping and
Bishop [167, 168] overcomes the first disadvantage. By letting the noise com-
ponent possess an isotropic structure, the PCA is implicitly embedded in a pa-
rameter learning stage for this model using the maximum likelihood estimation
(MLE) method. An efficient expectation/maximization (EM) algorithm [152] is
also developed to iteratively learn the parameters.
Kernel principal component analysis (KPCA) proposed by Sch?olkopf, Smola
82
and M?uller [181] overcomes the second disadvantage by using the so-called ?kernel
trick?. The essential idea of the KPCA is to avoid the direct evaluation of the
required dot product in a high-dimensional feature space using the kernel function.
The feature space is called reproducing kernel Hilbert space (RKHS). Hence, no
explicit nonlinear mapping function projecting the data from the original space to
the feature space is needed. Since a nonlinear function is used, albeit in an implicit
fashion, high-order statistical information is captured. See [179] for a recent survey
on the kernel space and application on discovering pre-image and denoised pattern
in the original space.
We propose an approach to analyze kernel principal components in a probabilis-
tic manner. It naturally unifies PPCA and KPCA in one treatment to overcome
the both disadvantages of PCA. We call it the probabilistic kernel principal com-
ponent analysis (PKPCA). In this chapter, we present our development of the
PKPCA approach by treating the KPCA as a special case of PCA where the num-
ber of samples is smaller than the data dimension. One speciality of KPCA is the
data centering issue, which is also taken into account in Section 4.2.
While the kernel part retains the nonlinear modeling power, resulting in a
smaller reconstruction error, the additional probabilistic structure offers us (i) a
mixture modeling capacity of PKPCA, and (ii) an efficient classification scheme.
Mixture of PKPCA is derived to model the nonlinear structure containing non-
linear substructures in a systematic way. Mixture of PKPCA nontrivially extends
to the feature space induced by the kernel function, the theory of mixture of PPCA
proposed by Tipping and Bishop [167, 168]. An EM algorithm [152] is also devel-
oped to iteratively but efficiently learn the parameters of interest. We also show
how to compute two important quantities, namely the reconstruction error and
83
the Mahalanobis distance.
Our analysis can be easily incorporated for a classification task. Our per-
formances are competitive to those produced by the mainstream kernel classifiers,
such as the support vector machine (SVM) and kernel Fisher discrimination (KFD)
classifier, but our analysis provides more regularized approximation to the data
structure.
Chapter organization
Section 4.1 briefly reviews the essentials of RKHS. Section 4.2 presents how to
compute the kernel principal components and to analyze these components in a
probabilistic manner. Section 4.3 presents the mixture of PKPCA and Section
4.4 presents the classification results on synthetic data and in a face recognition
application.
Two examples
Figure 4.1 shows two examples of nonlinear data structures 1 to be modeled. Figure
4.1(a) presents the first example: a C-shaped structure in the foreground. In the
context of data modeling, we consider only the foreground and assume a uniform
distribution within the C-shaped region and zeros outside. Figure 4.1(b) displays
200 sample points drawn from this density. In the context of pattern classification,
we consider both the foreground and the background and further assume that the
background class possess a uniform distribution outside the C-shaped region and
zeros inside. Figure 4.1(c) shows the samples for the background class.
1This means that, if conventional linear modeling techniques such as linear PCA are used,
the responses are badly approximated.
84
Figure 4.1(d) shows the second example where the foreground nonlinear data
structure consists of two C-shaped substructures. Figures 4.1(e) and 4.1(f) present
the drawn samples for the foreground and background classes, respectively. We
mainly use this example for mixture modeling.
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100
(a) (b) (c)
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100
(d) (e) (f)
Figure 4.1: Two nonlinear data structures (a)(d) and their drawn samples (of size
200) for the foreground class (b)(e) and the background (c)(f).
4.1 Reproducing Kernel Hilbert Space (RKHS)
We illustrate the principle of the RKHS by drawing an analogy of the RKHS, a
functional space, to a regular vector space Rd. We start by a d?d positive definite
matrix T = [ti(j)], where ti(j) is its (i,j)th element. By denoting the ith column
by ti = [ti(1),...,ti(d)]T, we have T = [t1,t2,...,td]. The eigen-decomposition of
85
T is given as
T =
dsummationdisplay
n=1
?n?n?Tn; ?n > 0,
where (?n,?n)?s are eigenpairs.
We define an inner product between two elements a and b in Rd as
< a,b > ? aTT?1b =
dsummationdisplay
n=1
??1n aT?n?Tnb
=
dsummationdisplay
n=1
??1n (a,?n)(b,?n),
where (u,v) ? uTv.
Suppose that g = [g(1),g(2),...,g(d)]T ? Rd and the identity matrix Id is
written as Id = [e1,e2,...,ed] where ei is the ith column of the Id matrix. The
inner product < .,. > possesses two important properties:
P1 : < ti,tj >= tTi T?1tj = tTi ej = ti(j)
P2 : < ti,g >= tTi T?1g = eTi g = g(i)
The RKHS, denoted by H, can be heuristically thought of as an f-dimensional
?vector space? Rf (f might be finite or infinite) associated with a positive kernel
function kx(y) = k(x,y). The existence of such kernel functions is guaranteed by
the Mercer?s Theorem [176] and the eigensystem of k(x,y) is given as
k(x,y) =
fsummationdisplay
n=1
?n?n(x)?n(y); ?n > 0;
fsummationdisplay
n=1
?2n < ?. (4.1)
Similarly, the inner product is defined as, with a(x), b(x), and g(x) in H,
< a,b >H?
fsummationdisplay
n=1
??1n (a,?n)(b,?n),
where (u,v) ?integraltextxu(x)v(x)dx. Furthermore, the two properties known as reproduc-
ing properties hold too.
P1 : < kx,ky >H= kx(y),
P2 : < kx,g >H= g(x).
86
An alternative perspective to view Eq. (4.1) is to consider a hypothetical
nonlinear mapping ? : Rd ?Rf defined as
?(x) = [?1/21 (x,?1),...,?1/2f (x,?f)]T.
It is easy to verify that
?(x)T?(y) = k(x,y) =< kx,ky >H .
Thus evaluating the dot product can be easily done by computing k(x,y) which
usually takes a parametric form. This is so-called ?kernel trick?, which plays an
essential role in many kernel methods, such as SVM [19] and KPCA [181], kernel
Fisher discriminant analysis [177, 172], and kernel independent component analysis
[170]. In this chapter, we also adopt this viewpoint.
There are a lot of ways to construct a kernel function: see [17] for a list. One
example of k(x,y) is the radial basis function (RBF) kernel which is widely studied
in the literature and the focus of this chapter. It is defined as
k(x,y) = exp(? 12?2bardblx?ybardbl2) ?x,y ?Rd,
where ? controls the kernel width. This is an infinite-dimensional RKHS, i.e.,
f = ?.
The RBF kernel is a special example of translation-invariant kernels of the
form k(x,y) = k(x?y) whose characteristics can be easily described using Fourier
theory [173]. In particular, the functions in the RKHS exhibit smoothness since
their Fourier transforms decay rapidly.
87
4.2 ProbabilisticAnalysisofKernelPrincipalCom-
ponents
4.2.1 Kernel principal component analysis
Suppose that {x1,x2,...,xN} are the given training samples in the original data
space Rd. KPCA operates in a feature space that is in fact a RKHS Hk induced
by a kernel function k. There exists a hypothetical nonlinear mapping function
? : Rd ? Rf, where f > d and f could even be infinite. The training samples in
Rf are denoted by ?f?N = [?1,?2,...,?N], where ?n ? ?(xn) ? Rf. Denote the
sample mean in the feature space as
??0 ? 1
N
Nsummationdisplay
n=1
?(xn) = ?e, (4.2)
where eN?1 = N?11.
The f ?f covariance matrix in the feature space denoted by ? is given as
? ? 1N
Nsummationdisplay
n=1
(?n ? ??0)(?n ? ??0)T = ?JJT?T = ??T, (4.3)
where
J ? N?1/2(IN ?e1T), ? ? ?J.
KPCA performs eigen-decomposition of the covariance matrix ? in the feature
space. Due to the high dimensionality of the feature space, we often have insuffi-
cient number of samples, i.e., the rank of the ? matrix is maximally N instead of f.
However, computing the eigensystem is still possible using the method presented
in [47, 62].
The explicit knowledge of the nonlinear feature mapping can be avoided using
the ?kernel trick? as in Section 4.1. Define
?K ? ?T? = JT?T?J = JTKJ, (4.4)
88
where
K ? ?T?
is the Gram matrix or the dot product matrix. The (i,j)th entry of the Gram
matrix K can be calculated as follows:
Kij = ?(xi)T?(xj) = k(xi,xj).
As in Appendix 4.I and [47, 62], the eigensystem for ? can be derived from ?K.
Suppose that the top r eigenpairs for ?K are {(?n,vn)}qn=1, where ?n?s are sorted in
a non-increasing order, and the r top eigenpairs for ? are {(?n,un)}qn=1, then we
can compute un as
un = (?n)?1/2?vn.
In a matrix form (if only the top q eigenvectors are retained),
Uq ? [u1,u2,...,uq] = ?Vq??1/2q = ?JVq??1/2q , (4.5)
where Vq ? [v1,v2,...,vq] and ?q ? D[?1,?2,...,?1], a diagonal matrix whose diag-
onal elements are {?1,?2,...,?q}.
It is clear that we are not operating in the full feature space, but in a low-
dimensional subspace of it, which is spanned by the training samples. It seems that
the modeling capacity is limited by subspace dimensionality, or by the number of
the samples. In reality, it however turns out that even in this subspace, the smallest
eigenvalues are very close to zero, which means that the full feature space can be
further captured by a subspace with an even-lower dimensionality. This motivates
us to use a latent model.
89
4.2.2 Theory of PKPCA
Probabilistic analysis assumes that the data in the feature space follows a special
factor analysis model [15] which relates an f-dimensional data ?(x) to a latent
q-dimensional variable z as
?(x) = ? +Wz + epsilon1,
where z ? N(0,Iq), epsilon1 ? N(0,?If), and W is a f ? q loading matrix. Therefore,
?(x) ? N(?,S), where
S = WWT + ?If.
Typically, we have q << N << f.
As shown in [167, 168], the MLE?s for ? and W, denoted by ?? and ?W, respec-
tively, are given by
?? = ??0 = 1N
Nsummationdisplay
n=1
?(xn) = ?e, (4.6)
?W = Uq(?q ??Iq)1/2R, (4.7)
where R is any q ? q orthogonal matrix, i.e., RTR = RRT = Iq, and Uq and ?q
contain the top q eigenvectors and eigenvalues of the ? matrix. It is in this sense
that our probabilistic analysis coincides with the plain KPCA.
Substituting (4.5) into (4.7), we obtain the following:
?W = ?Vq??1/2q (?q ??Iq)1/2R = ?Q = ?JQ, (4.8)
where the N ?q matrix Q is defined as
Q ? Vq(Iq ????1q )1/2R. (4.9)
Equation (4.8) has a very important implication: ?W lies in a linear subspace of ?.
We name the Q matrix as empirical loading matrix since this relates the loading
90
matrix to the empirical data. Also since the matrix (Iq????1q ) in (4.9) is diagonal,
additional savings in computing its square root are realized.
The MLE for ?, ??, is given [167, 168] as
?? = 1f ?q{tr(S)?tr(?q)}. (4.10)
Assuming that the remaining eigenvalues are zero, (this is a reasonable assumption
supported by empirical evidences when f is finite), it is approximated as
?? similarequal 1f ?q{tr(K)?tr(?q)}. (4.11)
But when f is infinite, this is doubtful since this always gives ?? = 0. In such a case,
there is no automatic way of learning this. We temporarily set a manual choice
for ??. as in [182]. However, as shown later on, we can in fact study the limiting
case by letting ?? approach zero in various cases. Even when a fixed ?? is used, the
optimal estimate for W (or ?W) is still the same as in (4.8). It is interesting to
note that Moghaddam and Pentland [54] derived (4.10) in a different context by
minimizing the Kullback-Leibler divergence distance [4, 13].
Now, the covariance matrix is estimated by
?S = ?JQQTJT?T + ??If = ?A?T + ??If,
where
A ? JQQTJT.
This offers a regularized approximation to ? = ?JJT?T. In ridge regression [9],
the form of S1 = ?JJT?T+?If (with rho a pre-specified small positive number) is
used to provide a regularized approximation. This has a smoothness interpretation
of the regression parameters. However, the eigenvalues of S1 always increase those
of ? by an amount of ? but the eigenvectors of the S1 are the same as those of ?.
91
Although S is in a compact form and also regularized, inversion of the S1 matrix
involves inverting an N ?N matrix, which is still prohibitive in real applications
with a large N, whereas ?S?1 involves inverting only a r?r M matrix (defined later).
This form of S1 is also used in [170, 171] for estimating the canonical correlation
and [175] for constructing the Bhattacharyya kernel.
In [182] the covariance matrix ? is approximated as S2 = ?JDJT?T + ?If,
where D is a diagonal matrix whose many diagonal entries empirically shown to
be zero. This is not surprising as in our computation D = QQT is rank deficient.
However, we do not enforce D to be diagonal.
Inverting ?S is also easy by invoking the Woodbury formula [8],
?S?1 = (??If + ?W?WT)?1 = ???1(If ? ?WM?1 ?WT) = ???1(If ??B?T),
where
B ? JQM?1QTJT,
and the matrix Mr?r can be thought of as a ?reciprocal? matrix for ?S,
M ? ??Iq + ?WT ?W = ??Iq +L, (4.12)
with
L ? QT?KQ.
Using the Q matrix in 4.9, Appendix 4.II calculates various quantities in a
closed form. For example,
M = RT?qR, |S| = ??(f?q)|M|.
Refer to Appendix 4.II for details.
From now on, we will drop the (?.) notation that denotes the MLE estimate.
Whenever we mention some parameters requiring estimates, we mean the MLE
values.
92
Parameter learning using EM
The key for the approach developed in Section 4.2.2 is (4.8) which relates W to ?
using a linear equation and the empirical loading matrix Q. This motivates us to
use the EM learning algorithm to learn the Q matrix instead of the W matrix.
We now present the EM algorithm for learning the parameters Q and ? in
PKPCA. Assume that Q(j) and ?(j) are the estimates obtained after the jth itera-
tion. The iteration proceeds as follows:
Q(j+1) = ?KQ(j)(?(j)Iq +M?1Q(j)T?K2Q(j))?1, (4.13)
?(j+1) = 1ftr(?K? ?KQ(j)M(j)?1Q(j+1)T?K), (4.14)
where M(j), defined in (4.9), is evaluated using Q(j).
As mentioned earlier, when f is infinite, using (4.14) is not appropriate and
hence a manual choice of ? is used instead. With ? fixed, Q is nothing but the
solution to (4.13) and one can check that Q given in (4.9) is the solution.
Computational efficiency
The above EM algorithm involves only inversions of q ?q matrices and arrives at
the same results (up to an orthogonal matrix R) as direct computation. However,
in practice one may still use direct computation of complexity O(N3) since the
complexity of computing ?K2 is O(N3). If we pre-compute ?K2, the complexity for
each iteration reduces to O(qN2). Clearly, the overall computation complexity
depends on the number of iterations needed for desired accuracy and the ratio of
N to q. In our experiment, the EM algorithm converges to reasonable accuracy
very fast, usually in less than 20 iterations.
93
Reconstruction error and Mahalanobis distance
Given a vector y ? Rd, we are often interested in computing the following two
quantities:
1. the reconstruction error epsilon1?(y) ? (?(y)???(y))T(?(y)???(y)) where ??(y) is the
reconstructed version of ?(y);
2. the Mahalanobis distance ??(y) ? (?(y)? ??0)TS?1(?(y)? ??0).
As shown in [167], the best predictor for ?(y) is ??(y) given by
??(y) = W(WTW)?1WT(?(y)? ??0)+ ??0,
and ?(y)? ??(y) is given by
?(y)? ??(y) = (If ?W(WTW)?1WT)(?(y)? ??0) = ?(?(y)??0),
where the f ?f matrix
? ? If ?W(WTW)?1WT
is symmetric and idempotent as
?2 = ?.
So, epsilon1?(y) is computed as follows:
epsilon1?(y) = (?(y)? ??0)T?(?(y)? ??0) = a(y)?b(y)TCb(y),
where C, ay, and by are defined by:
CN?N ? JQ(QT?KQ)?1QTJT,
a(y) ? (?(y)? ??0)T(?(y)? ??0) = k(y,y)?2c(y)Te +eTKe,
b(y)N?1 ? ?T(?(y)? ??0) = c(y)?Ke,
94
with
c(y)N?1 ? ?T?(y) = [k(x1,y),...,k(xN,y)]T.
The Mahalanobis distance is calculated as follows:
??(y) = (?(y)? ??0)TS?1(?(y)? ??0) = ??1{a(y)?b(y)TBb(y)}. (4.15)
Finally, an important observation is that as long as we can express ??0 and S as
in (4.2) and (4.3), i.e. there exist e and J that relate ??0 and S to ?, we can safely
use the derivations presented in this section. This lays a solid foundation for the
development of the mixture of PKPCA theory.
We can study a limiting behavior of ??(y) by defining
???(y) ? lim
??0???(y) = a(y)?b(y)
T?Bb(y), (4.16)
where ?B ? lim??0 B.
Experiments on kernel modeling
This part addresses the power of kernel modeling part in PKPCA in terms of the
reconstruction error. The probabilistic nature of PKPCA will be illustrated in the
next sections.
We compare PPCA and PKPCA since the only difference between them is the
kernel modeling part. We define the reconstruction error percentage ? as follows:
?(y) = epsilon1(y)yTy, ??(y) = epsilon1?(y)k(y,y),
where ?(y) is for PPCA and ??(y) for PKPCA.
Figure 4.2 shows the histogram of ? for the famous iris data2. This dataset
consists of 150 samples and is used in pattern classification tasks. We, however,
2This is available at the UCI Machine Learning Repository. The URL is
http://www.ics.uci.edu/?mlearn/MLRepository.html.
95
Algorithm PPCA PPCA PKPCA PKPCA
q = 2 q = 3 q = 9 q = 15
Mean 8.23% 1.42% 3.88% 1.39%
Std. dev. 13.12% 4.52% 3.86% 1.39%
Table 4.1: PPCA and PKPCA reconstruction error percentage.
just treat it as a whole regardless of its class labels. Since it is just 4-d data, PPCA
keeps at most 3 principal component, i.e. q ? 3, while PKPCA has no such limit
and can have q ? 149. Figure 4.2 and Table 4.1 show that PKPCA with q = 9,
i.e. using 6% percent principal components produces a small ? than PPCA with
q = 2 that uses 50% components. In addition, PKPCA with q = 15 that uses 10%
percent principal components produces a small ? than PPCA with q = 3, using
75% components. A larger q produces even smaller ?. This improvement benefits
from kernel modeling, which is able to capture the nonlinear structure of the data.
However, PKPCA involves much more computation than PPCA.
4.3 MixtureModelingofProbabilisticKernelPrin-
cipal Components
4.3.1 Theory of mixture of PKPCA
Mixture of PKPCA models the data in a high-dimensional feature space using
a mixture of I densities with each mixture component p(.|i) being a PKPCA
density associated with an empirical loading matrix Qi that can be derived from
corresponding ei and Ji (as shown below). For ?i?s, we assume ?i ? ? with ? fixed.
96
0 0.5 10
50
100
150
(a) PPCA: q=2
0 0.5 10
50
100
150
(b) PPCA: q=3
0 0.5 10
50
100
150
(c) PKPCA: q=9, ?=1.6, ?=0.001
0 0.5 10
50
100
150
(d) PKPCA: q=15, ?=1.6, ?=0.001
Figure 4.2: Histogram of ? for iris data obtained by (a) PPCA with q = 2, (b)
PPCA with q = 3, (c) PKPCA with Gaussian kernel with q = 9, ? = 2 and
? = 0.001, and (d) PKPCA with Gaussian kernel with q = 15, ? = 2 and ? = 0.001.
Mathematically,
p(?(x)) =
Isummationdisplay
i=1
mip(?(x)|i) =
Isummationdisplay
i=1
miN(??i,Si),
where mi?s are mixing probabilities summing up to 1, and p(?(x)|i) = N(??i,Si) is
the PKPCA density for the ith component defined as
N(??i,Si) = (2pi)
?f/2
|Si|1/2 exp{?
1
2??,i(x)} =
(2pi)?f/2
?(f?qi)/2|Mi|1/2 exp{?
1
2??,i(x)}
= (2pi?)?f/2 exp{?12???,i(x)}
where ??,i(x) is the Mahalanobis distance as in (4.15) with all parameters involved
coming from the ith component, and
???,i(x) ? ??,i(x)+ log(|Mi|)+ qi log(??1).
97
We call ???(x) as the ?generalized? Mahalanobis distance.
Parameter learning using EM
We invoke the ML principle to estimate the parameters of interest, i.e., {mi,Qi}?s
from the training data. It turns out that direct maximization is cumbersome
since the log-likelihood involves summations within logarithms. The iterative EM
algorithm [152, 167] is used instead.
Assume that {m(j)i ,Q(j)i } are the values obtained in the jth iteration. We begin
by computing the posterior responsibility rni.
r(j)ni ? p(j)(i|?n) = mip
(j)(?n|i)
p(j)(?n) =
m(j)i exp{?12??(j)?,i(x)}
summationtextI
l=1 m
(j)
l exp{?
1
2
??(j)?,l(x)}. (4.17)
There is no need to calculate rni by exactly following (4.17). One only needs to
evaluate the numerator mi exp{?12???,i(x)} and perform normalization to guarantee
that summationtextIi=1 rni = 1.
The EM iterations compute the following quantities:
m(j+1)i = 1N
Nsummationdisplay
n=1
r(j)ni , (4.18)
??(j+1)i =
summationtextN
n=1 r
(j)
ni ?nsummationtext
N
n=1 r
(j)
ni
=
Nsummationdisplay
n=1
e(j)ni ?n = ?e(j)i ,
where e(j)i = [e(j)1i ,e(j)2i ,...,e(j)Ni]T with
e(j)ni ? r
(j)
nisummationtext
N
n=1 r
(j)
ni
.
It is easy to show that the local responsibility-weighted covariance matrix for
component i, Si, is obtained as
S(j+1)i ?
Nsummationdisplay
n=1
e(j)ni (?n ? ??(j+1)i )(?n ? ??(j+1)i )T = ?J(j+1)i J(j+1)i T?T,
98
where
J(j+1)i ? (IN ?e(j)i 1T) D1/2[e(j)1i ,e(j)2i ,...,e(j)Ni].
Using
?K(j+1)i = J(j+1)i TKJ(j+1)i ,
the updated Q(j+1)i can obtained as
Q(j+1)i = V(j+1)qi,i (Iqi ???(j+1)qi,i ?1)1/2, (4.19)
where ?(j+1)qi,i and V(j+1)qi,i are the top qi eigenvalues and eigenvectors of ?K(j+1)i . Also,
an EM algorithm for learning the Qi matrix as shown in Section 4.2.2 can be used
instead of direct computation.
The above derivations indicate that it is not necessary to start the EM itera-
tions from initializing the parameters e.g. {mi,Qi}?s. Instead, we can start from
assigning the posterior responsibility {rni}?s. Once assigned, we follow equations
(4.18) to (4.19) to compute the updated {mi,Qi}?s. The iterations then move
on. This way we can easily incorporate any prior knowledge gained from cluster-
ing techniques such as the ?kernelized? version of the K-means algorithm [181], or
other algorithms [180].
Parameter learning experiments
We now demonstrate how mixture of PKPCA performs by fitting it to the two
C-shapes shown in Figure 4.4(d). We set the following parameters: I = 2, q = 2,
? = 1e?2, and ? = 8. The algorithm iterations are terminated if the changes in
the {rni}?s are small enough.
Figure 4.3(a) presents the initial configuration for the two C-shapes. We just
generate random numbers for{rni}?s followed by a normalization step to guarantee
99
summationtextI
i=1 rni = 1. Figure 4.3(b) shows the mixture assignment after the first iteration
and Figure 4.3(c) the final configuration (only after 3 iterations). A final note is
that the EM algorithm can still converge to a local minimum. In this case, the
clustering method [180] is very helpful for initialization.
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100
(a) (b) (c)
Figure 4.3: (a) Initial configuration. (b) After first iteration. (c) Final configura-
tion. ?+? and ?x? denote two different mixture components.
4.3.2 Why mixture of PKPCA?
It is well known [180, 181] that kernel embedding results in clustering capability.
This raises the doubt whether PKPCA is sufficient to model a nonlinear struc-
ture with nonlinear substructures. We demonstrate the effectiveness of mixture of
PKPCA with the following examples.
Figure 4.4(a) shows a nonlinear structure containing a single C-shape and Fig-
ures 4.4(b) and 4.4(c) the contour plots, for the 1st and 2nd kernel principal com-
ponents, i.e. all points in the contour share the same principal component values.
These plots capture the nonlinear shape very precisely. Now in Figure 4.4(d), a
nonlinear structure containing two C-shapes is presented. Figures 4.4(e) and 4.4(f)
display the contour plots corresponding to 4.4(d). Clearly, they attempt to cap-
ture both C-shapes at the same time. This is not desirable. Ideally, we want to
100
have two KPCAs, each modeling a different C-shape more precisely. However, the
ordinary KPCA has no such capability but PKPCA does. This naturally leads
us to considering a mixture of PKPCA. Section 4.4 also demonstrates this using
classification results.
The successful kernel clustering algorithm [180] shows that after kernel em-
bedding, the clusters become more separable. This further sheds light on the
effectiveness of mixture of PKPCA.
One may also ask: why not use the mixture of PPCA directly? Although a
mixture of PPCA is legitimate, its use is not elegant in this scenario since one
may need more than 2 components for Figure 4.4(d) to capture the data structure
due to the limitation of the linear setting in PCA. But mixture of PKPCA can
elegantly model it using two components.
4.4 Classification
4.4.1 PKPCA or mixture of PKPCA classifier
We now demonstrate the probabilistic interpretation embedded in PKPCA using a
pattern classification problem. Suppose we have N classes. For class n, a PKPCA
or mixture of PKPCA density p(?n(x)|n) is trained; then, the class label for a
point x is determined using the Bayesian decision principle by
?n = arg maxn=1,...,N p(n)p(x|n) = arg maxn=1,...,N p(n)p(?n(x)|n)|Jn(x)|, (4.20)
where p(n) is the prior distribution, p(x|n) is the conditional density for class n in
the original space, and Jn(x) is the Jacobi matrix for class n.
To use (4.20), we are confronted by two dilemmas: (i) the Jacobi matrices,
Jn(x)?s, are unknown since we have no knowledge of ?n(x); and (ii) the densities,
101
0 10 20 30
0
10
20
30
(a) one C?shape
0 10 20 30
0
10
20
30
(b) 1st KPC
0 10 20 30
0
10
20
30
(c) 2nd KPC
0 10 20 30
0
10
20
30
(d) two C?shapes
0 10 20 30
0
10
20
30
(e) 1st KPC
0 10 20 30
0
10
20
30
(f) 2nd KPC
Figure 4.4: (a) One C-shape and contour plots of its (b) 1st and (c) 2nd KPCA
features. (d) Two C-shapes and its contour plots of its (e) 1st and (f) 2nd KPCA
features.
p(?n(x)|n)?s, involves infinite f. The latter is easily fixed by assuming ?c ? ? for
all classes, where ?n is the parameter in the density p(?n(x)|n) for class n.
One trick to attack the first dilemma is to use the same kernel function for all
the classes with the same kernel width ?, i.e. ?n = ?. However, it might not be
appropriate since different classes possess different data structures. An alternative
approach is that we still use different kernel functions for different classes but we
approximate the Jacobi matrices. We use the following approximation:
|Jn(x)| similarequal const, ?x.
Figure 4.5 demonstrates our rationale. Figure 4.5(a) presents the contour plots
102
for the true density to be modeled, which is uniform inside the black C-shaped
region (Figure 4.1(a)). All contour plots are located on the boundary. We fit a
PKPCA density (? = 15, q = 20, and ? = 1e ? 6) based on the samples shown
in Figure 4.1(b) and visualize the density using Figure 4.5(b), which displays the
map of log(??(x)). To verify that the values in the C-shaped region are uniform,
we show in Figure 4.5(c) the contour plots for ???(x) inside the C-shaped region.
Most contours are close to the boundary, which indicates the uniformity of the
density p(?(x)) inside the C-shaped region and thus the Jacobi approximation
which relates p(?(x)) and p(x) is reasonable.
The above approximation leads to a linear decision rule. For example, in a
two-class problem, the decision rule is, for some ? > 0,
If p(?1(x)|1) ? ? p(?2(x)|2) then class 1; Else class 2
In the sequel, we simply take ? = 1.
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100
(a) (b) (c)
Figure 4.5: The approximation of the Jacobi matrix. (a) The contour plots of the
true density: uniform inside the C-shaped region. (b) The map of log(??). (c) The
contour plots of ??? inside the C-shaped region.
Putting the above discussions together, we have the following decision rules:
? If PKPCA densities are learned for all classes, i.e., for class n, we learn
103
{?n = ?,Qn}, it is easy to check that the classifier performs the following:
arg minn=1,...,N ??n?(x),
where ??n is the ?generalized? Mahalanobis distance.
? If mixture of PKPCA densities are learned for all classes, i.e., for the class
c, we learn {?n = ?,mn,1,Qn,1,...,mn,Ic,Qn,In} with In being the number of
mixture components, then the classifier decides as follows:
arg maxn=1,...,N
Insummationdisplay
j=1
mn,j exp{?12??n?,j(x)}.
4.4.2 Experiments
Synthetic Data
We consider a 2-class problem with foreground (class 1) and background (class 2)
classes given in Figure 4.1(a), where the letter ?C? or ?O? means the foreground
class. We then draw 200 samples for both classes as shown in Figures 4.1 and 4.8.
Figure 4.6 presents the classification results obtained by the PKPCA classifier
with different kernel widths for different classes (PKPCA-d), the PKPCA classifier
with same kernel widths for different classes (PKPCA-s), the support vector ma-
chine (SVM) [19], and the kernel Fisher discriminant analysis (KFDA) [177]. In
PKPCA-s, SVM and KFD, the kernel width ? is tuned (via exhaustive search from
1 to 100) to yield the best empirical classification results and reported in Table
4.2. The PKPCA-d parameters actually used are also reported in Table 4.2, where
the kernel widths for the background and foreground classes are found via the
procedures described in Appendix 4.III. As shown in Figure 4.6, the classification
boundary obtained by PKPCA-d is very smooth and very similar to the original
104
Algorithm Single C-shape Single O-shape Double C-shapes
PKPCA-d 1.57% 3.80% 7.49%
q = 30, ? = 10?8 q = 20, ? = 10?6 q = 20, ? = 10?6
?1 = 15,?2 = 35 ?1 = 15, ?2 = 35 ?1 = 15, ?2 = 35
PKPCA-s 1.95% 5.50% 1.85%
q = 30, ? = 10?8 q = 30, ? = 10?8 q = 20, ? = 10?6
? = 1 ? = 1 ? = 1
SVM 1.80% 5.45% 1.69%
? = 1 ? = 1 ? = 1
KFDA 1.84% 5.47% 1.82%
? = 1, 30 components ? = 1, 20 components ? = 1, 20 components
mix. PKPCA NA NA 0.70%
q = 20, ? = 10?6, I1 = 2,
?1 = 8, I2 = 1, ?2 = 35
Table 4.2: Classification error on the single C-shaped, the single O-shape, and the
double C-shapes.
boundary, while those of PKPCA-s, SVM and KFDA seem to only replicate the
training samples, with holes and gaps. Table 4.2 indicates that our PKPCA-d
classifier outperforms the SVM and KFDA classifiers by some margin. Similar
observations can be made based on the experimental results on a single O-shape
as shown in Figure 4.8.
The superior performance of PKPCA-d classifer mainly arises from its ability
to model different classes with different kernel functions, while the PKPCA-s, SVM
and KFDA employ only one kernel. This is a big advantage since as seen in our
synthetic examples we clearly need different kernel widths for the foreground and
105
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
(a) (b)
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
(c) (d)
Figure 4.6: The classification results on the single C-shape obtained by (a)
PKPCA-d, (b) PKPCA-s, (c) SVM, and (d) KFDA.
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
(a) (b) (c)
Figure 4.7: The classification results on the double C-shape obtained by (a)
PKPCA-d classfier, (b) SVM, and (c) mixture of PKPCA classfier with different
kernel widths.
106
(a) Original 2?Class
0 50 100
0
50
100 0 50 1000
50
100
(b) FG Samples
0 50 1000
50
100
(c) BG Samples
(d) PKPCA 2?Class
0 50 100
0
50
100
(e) SVM 2?Class
0 50 100
0
50
100
(f) KFDA 2?Class
0 50 100
0
50
100
Figure 4.8: The classification results on the single O-shape.
background classes. More importantly, PKPCA provides a regularized approxima-
tion to the data structure; thus its decision boundary is very smooth. Also, the
probabilistic interpretation of PKPCA enables the PKPCA classifier to deal with
an N-class problem as easily as KFDA, while the SVM is basically designed for a
two-class problem and extending it to an M-class is not very straightforward.
We now illustrate the mixture of PKPCA classifier by applying it to the double
C-shapes shown in Figure 4.1(d). We fit the mixture of PKPCA density for the
foreground class based on the samples shown in Figure 4.1(e) and the PKPCA
density for the background class based on the samples shown in Figure 4.1(f).
Figure 4.7 and Table 4.2 present the classification results. Clearly the mixture
of PKPCA classifier produces the best performance in terms of the classification
error. Also the decision boundary is very smooth.
107
One important observation is that the PKPCA classifier with different kernel
widths performs poorly. This is because the selected kernel width attempts to
cover both nonlinear substructures simultaneously, which actually over-smoothes
each substructure (see Figure 4.7(a)). Hence, caution should be exercised when
modeling a mixture data via PKPCA densities of different kernel widths.
IDA Benchmark
We also test our classifier on the IDA benchmark3 repository [179]. To make our
results comparable, we use the cross-validation (the same procedure as in [179])
to choose our parameters; also we invoke the PKPCA density without mixture
modeling and the same kernel parameter for different classes. As tabulated in
Table. 4.3, our PKPCA classifier compared favorably to those of kernel classifiers
such as SVM and KFD. We believe that the classification results can be improved
by using PKPCA-d or even mixture of PKPCA classifier.
A real application: face recognition
We report face recognition results using a subset of the FERET database [58]
with 200 subjects only. Each subject has 3 images: (i) one taken under controlled
lighting condition with a neutral expression; (ii) one taken under the same lighting
condition as (i) but with different facial expressions (mostly smiling); and (iii)
one taken under different lighting condition and mostly with a neutral expression.
Figure 4.9 shows some face examples in this database.
Our experiment focuses on testing the generalization capability of our algo-
rithm. It is our hope that the training stage can learn the intrinsic characteristics
3This is available at http://ida.first.gmd.de/?raetsch/data/benchmarks.htm.
108
PKPCA-s SVM KFD
Banana 10.5 ? 0.4 11.5 ? 0.7 10.8 ? 0.5
B. Cancer 28.0 ? 4.7 26.0 ? 4.7 25.8 ? 4.6
Diabetes 24.8 ? 1.9 23.5 ? 1.7 23.2 ? 1.6
German 24.9 ? 2.2 23.6 ? 2.1 23.7 ? 2.2
Heart 16.8 ? 3.4 16.0 ? 3.3 16.1 ? 3.4
Image 2.8 ? 0.6 3.0 ? 0.6 3.3 ? 0.6
Ringnorm 1.6 ? 0.1 1.7 ? 0.1 1.5 ? 0.1
F. Solar 34.8 ? 1.9 32.4 ? 1.8 33.2 ? 1.7
Splice 12.2? 0.8 10.9 ? 0.7 10.5 ? 0.6
Thyroid 4.0 ? 2.0 4.8 ? 2.2 4.2 ? 2.1
Titanic 22.6 ? 1.3 22.4 ? 1.0 23.2 ? 2.0
Twonorm 2.6 ? 0.2 3.0 ? 0.2 2.6 ? 0.2
Waveform 11.4? 0.5 9.9 ? 0.4 9.9 ? 0.4
Table 4.3: The classification error on IDA benchmark repository. The SVM and
KFD results are reported in [179].
Figure 4.9: Top row: neutral faces. Middle row: faces with facial expression.
Bottom row: faces under different illumination. Image size is 24 by 21 in pixels.
of the space we are interested in. Therefore, we always keep the gallery and probe
sets separate. We randomly select 300 images belonging to 100 subjects as the
109
gallery set for learning and the remaining 300 images as the probe set for testing.
This random division is repeated 20 times and we take their averages as the final
result.
General component analysis is not geared towards discrimination, thus yielding
inferior recognition results in practice. To this end, Moghaddam et al. [55, 56]
introduced the concept of intra-personal space (IPS). The IPS is constructed by
collecting all the difference images between any two image pairs belonging to the
same individual. The construction of the IPS is meant to capture all the possible
intra-personal variations introduced during image acquisition.
Suppose that we have learned some density pIPS on top of the IPS space and
we are given the gallery set consisting of images {x1,x2,...,xN} for N different
individuals. Given a probe image y, its identity ?n is determined by
?n = arg maxc=1,...,N pIPS(y?xn) = arg minc=1,...,N ??IPS,?(y?xn).
Here we use the limiting Mahalanobis distance ??.
Forcomparison, we have implemented thefollowing fourmethods. InPKPCA/IPS
andPPCA/IPS, the IPSisconstructed basedonthegalleryset andthePKPCA/PPCA
density is fitted on top of that. In KPCA and PCA, all 300 training images are re-
garded lying in one face space and KPCA/PCA is then learned on that space. The
classifier sets the identity of a probe image as the identity of its nearest neighbor
in the gallery set.
Table 4.4 lists the recognition rate, averaging those of 20 simulations, using
the top 1 match. The PKPCA/IPS algorithm attains the best performance since
it combines the discriminative power of the IPS model and the merit of PKPCA.
However, compared to PPCA/IPS, the improvement is not significant, indicating
that second-order statistics might be enough after IPS modeling for the face recog-
110
nition problem. However, PKPCA may be more effective since it also takes into
account high-order statistics. Another observation is that variations in illumina-
tion are easier to model than facial expression using subspace methods.
PKPCA/IPS PPCA/IPS KPCA PCA
Expression 78.55% 78.35% 63.85% 67.65%
Illumination 83.9% 81.85% 51.9% 73.1%
Average 81.23% 80.1% 57.88% 70.38%
Table 4.4: Recognition rate of various kernel and non-kernel subspace methods.
4.5 Appendix
Appendix 4.I: Two Lemmas on Matrix Computation
We introduce some related results on matrix computation using the following two
lemmas. The proofs are pretty straightforward and hence skipped here.
Lemma 4.1. Suppose that A is of size d?q with q < d and the matrix ATA
is of full rank, the matrices ATA and AAT have the same nonzero eigenvalues.
Lemma 4.2. Suppose that B = ?Id + AAT+ and {?i; i = 1,2,...,q} are
eigenvalues of the ATA matrix, the determinant |B| is given by
|B| =
qproductdisplay
i=1
(?+ ?i)?d?q, (4.21)
and the inverse matrix B?1 is given by
B?1 = ??1{If ?A(?Iq +ATA)?1AT}.
111
Appendix 4.II: A List of Important Quantities
Important quantities
RKHS: H = Rf.
Original observations: Xd?N = [x1,x2,...,xN]
Nonlinear mapping: ?(x) : Rd ? Rf
Observations in RKHS: ?f?N = [?1,?2,...,?N].
Weight vector: eN?1 = N?11 (for example).
Mean: ?f?1 = ?e
Centering matrix: JN?N = N?1/2(IN ?e1T).
Covariance matrix (c.m.): ?f?f = ?JJT?T.
Gram matrix (g.m.): KN?N = ?T?.
Centered g.m.: ?KN?N = JTKJ.
Eigenvalues of ?K: ?q = D[?1,?2,...,?q]q?q.
Eigenvectors of ?K: Vq = [v1,v2,...,vq]N?q.
Approximate c.m.: Sf?f = ?A?T + ?If.
A matrix: AN?N = JVq(Iq ????1q )VTq JT.
Inverse of S: S?1N?N = ??1(If ??B?T).
B matrix: BN?N = JVq(??1q ????2q )VTq JT.
C matrix: CN?N = JQ(QT?KQ)?1QTJT.
Q matrix: QN?q = Vq(Iq ????1q )1/2R
M matrix: Mq?q = ?Iq +QT?KQ.
Computation related to L and M
We first compute L = QT?KQ and then M.
L = RTQT?KQR = RT(Iq ????1q )1/2VTq ?KVq(Iq ????1q )1/2R
112
= RT(Iq ????1q )1/2?q(Iq ????1q )1/2R = RT(?q ??Iq)R,
where the fact that VTq ?KVq = VTq JTKJVq = ?q is used. Therefore,
M = ?Iq +L = ?Iq +RT(?q ??Iq)R = RT?qR.
|M| = |?q| =
qproductdisplay
i=1
?i, M?1 = RT??1q R.
Computation related to A, B, and C
A = JQQTJT = JVq(Iq ????1q )1/2RRT(Iq ????1q )1/2VTq JT
= JVq(Iq ????1q )VTq JT
B = JQM?1QTJT = JVq(Iq ????1q )1/2RRT??1q RRT(Iq ????1q )1/2VTq JT
= JVq(??1q ????2q )VTq JT
C = JQ(QT?KQ)?1QTJT
= JVq(Iq ????1q )1/2RRT(?q ??Iq)?1RRT(Iq ????1q )1/2VTq JT
= JVq??11 VTq JT
tr[AK] = tr[JVq(Iq ????1q )VTq JTK] = tr[(Iq ????1q )VTq JTKJVq]
= tr[(Iq ????1q )?q] = tr[?q]??q = summationtextqi=1 ?i ??q.
tr[BK] = tr[JVq(??1q ????2q )VTq JTK] = tr[(??1q ????2q )VTq JTKJVq]
= tr[(??1q ????2q )?q] = q ??tr[??1q ] = q ??summationtextqi=1 ??1i .
113
Computation related to S
We have shown that
S?1 = ??1(If ??B?T),
Also, we are often interested in computing tr(S?1?).
tr(S?1?) = tr(S?1??T) = tr(?TS?1?) = ??1(tr(?K)?tr(JT?T?B?T?J))
= ??1(tr(?K)?tr(?KVq(??1q ????2q )VTq ?K)
= ??1(tr(bK)?tr(Vq?q??1q (Iq ????1q )VTq ?qVTq ))
= ??1(tr(?K)?tr(Vq(?q ??Iq)VTq )) = ??1(tr(?K)?tr(?q ??Iq))
= ??1(tr(?K)?summationtextqi=1 ?i)+ q.
Also, using Lemma 4.2 in Appendix 4.I, the determinant of S is given by
|S| = ?f?q|M| = ?f?q|?q| = ?f?q
qproductdisplay
i=1
?i.
Appendix 4.III: Kernel selection
Only those functions satisfying the Mercer?s Theorem [176] can be used as ker-
nel functions. In general, the kernel function lies in some parameterized function
family. Denote the parameter of interest by ?. For example, ? can be the polyno-
mial degree in the polynomial kernel, or the kernel width in the Gaussian kernel.
The choice of ? remains an open question with the reason being that there is no
systematic criteria to judge the goodness. Again, we only focus on the Gaussian
kernel case; so ? = ? and f = ?.
It seems that PKPCA offers a systematic ML principle to follow, i.e., picking
the ? which maximizes the likelihood or log-likelihood. However, it turns out that
the ML principle fails as it has an inherent bias towards a large ? value. The
log-likelihood L is given by:
114
L = ?Nf2 log(2pi)? N2 log(|S|)? 12 summationtextNn=1(?(xn)? ??0)TS?1(?(x)? ??0)
??N2 summationtextqi=1 log(?i)? N2 tr(S?1?)
??N2 summationtextqi=1 log(?i)? N2 ??1(tr(?K)?summationtextqi=1 ?i)
By defining the following quantity:
E(?) = ? 2NL ?
qsummationdisplay
i=1
log(?i)+ ??1(tr(?K)?tr(?q)), (4.22)
the goal is to
min? E(?) subject to ?q(?) > ?.
0 10 20 30 40 50?1
0
1
2
3
4
5
6
7
8 x 10
5
0 10 20 30 40 500
0.05
0.1
0.15
0.2
0.25
(a) (b)
Figure 4.10: (a) The curve of E(?). (b) The curve of ?1(?). We have set q = 30
and ? = 1e?6.
We now show how it works. Figure 4.10(a) presents the curve of E(?) obtained
using (4.22) for the C-shaped data (Figure 4.1(a)), which always has a bias toward
favoring a large ?. This is not surprising since a large ? makes the matrix K0 close
to a matrix of ones; hence the matrix K becomes close to a matrix of zeros, the
data variation is reduced, and therefore the likelihood is increased. If ? goes to ?,
all data essentially reduces to one point in the feature space. This is also explained
115
by Williams in [183]. Williams [183] has also studied the ratio of the sum of the
top q eigenvalues to that of all eigenvalues, and discovered the same bias.
We propose an alternative approach by examining the first eigenvalue, which
equals to the maximum variance of the projected data where the projection occurs
in the feature space induced by the kernel function. Figure 4.10(b) shows the plot
of the first eigenvalue ?1(?) against ?. There is a unique maximum. We pick
this as our kernel width. This choice of the kernel width seems to have a close
relationship with the assumption on the Jacobi matrix in (4.4.1). Figure 4.11(a)
present the map of log(??(x)) for the single C-shape (with ? = 3) and Figure
4.11(b) the contour plots of ???(x). The map is very granular and the uniformity
inside the C-shaped region disappears. Figure 4.11(c) shows the map of log(??(x))
with ? = 36 and Figure 4.11(d) the contour plots of ???(x). Now, the map is over-
smoothed (compare the intensity change inside and outside the C-shaped region
with that of Figure 4.5(b)).
116
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100
(a) (b)
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100
(c) (d)
Figure 4.11: (a) The map of log(??) and (b) the contour plots of ??? inside the
C-shaped region, when ? = 3. (c) The map of log(??) and (d) the contour plots of
??? inside the C-shaped region, when ? = 36.
117
Chapter 5
Probability Distances in
Reproducing Kernel Hilbert
Space
Probabilistic distance measures, defined as the distances between two probability
distributions, are important quantities and find their uses in many research areas
such as probability and statistics, pattern recognition, information theory, com-
munication and so on. In statistics, the probabilistic distances are often used in
asymptotic analysis. In pattern recognition, pattern separability is usually cali-
brated using probabilistic distance measures [5] like Chernoff distance and Bhattar-
chayya distance because they provide bounds for probability of error in a pattern
classification problem. In information theory, mutual information, a special exam-
ple of Kullback-Leibler divergence or relative entropy [4] is a fundamental quantity
related to the channel capacity. In communication, divergence and Bhattarchayya
distance measures are used for signal selection [156].
Direct evaluation of probabilistic distances is nontrivial since they involve inte-
118
grals. Only within certain parametric families, say the widely-used Gaussian den-
sity, we have analytic expressions for probability distances. However, the Gaussian
density employs only up to second-order statistics and its modeling capacity is lin-
ear and hence rather limited when confronted with a nonlinear data structure. By
nonlinear data structure, we mean that if conventional linear modeling techniques
such as fitting the Gaussian density are used, the responses are badly approxi-
mated. To absorb the nonlinearity, mixture models or non-parametric densities
are used in practice. For such cases, one has to resort to numerical methods for
computing the probabilistic distances. Such computation is not robust in nature
since two approximations are invoked: one in estimating the density and the other
in evaluating the numerical integral.
In this chapter, we model the nonlinearity through a different approach: kernel
methods. The essence of kernel methods is to combine a linear algorithm with a
nonlinear embedding, which maps the data from the original vector space to the
reproducing kernel Hilbert space (RKHS). But, we need not require any explicit
knowledge of the nonlinear mapping function as long as we can cast our compu-
tations into dot product evaluations. Since a nonlinear function is used, albeit
in an implicit fashion, we achieve a new paradigm to study these distances and
investigate their uses in a different space.
Clearly, our computation depends on the assumption that the data is Gaussian
in RKHS. This assumption has been implicitly used in many kernel methods such
as [172, 181]. In [181], PCA operates on the RKHS. Even though it seems that
PCA needs only the covariance matrix without the Gaussianity assumption, it is
the deviation of the data from Gaussianity in the original space that drives us
to search for the principal components in the nonlinear feature space. In [172],
119
discriminant analysis is performed on the feature space. It is well known that
discriminant analysis originated as a two-class problem by assuming that each
class is distributed as Gaussian with a common covariance matrix. Recently, the
Gaussianity is directly adopted in the literature [170, 171, 175]. In [170, 171], it is
used to compute the mutual information between two Gaussian random vectors in
RKHS. In [175], it is used to construct the so-called Bhattacharyya kernel. In fact,
the validity of this assumption boils down to a Gaussian process argument [175].
However, since the induced RKHS is certainly limited by the number of available
samples, a regularized covariance matrix is needed in [170, 171]. We also propose
a way to regularize the covariance matrix in this chapter.
Chapter organization
This chapter is organized as follows. Section 5.1 introduces several probabilistic
distances often used in the literature and Section 5.2 presents a method for es-
timating the first- and second-order statistics for the data in RKHS. Section 5.3
elaborates the derivations of the probabilistic distances in the RKHS and their
limiting behavior. Section 5.4 demonstrates the feasibility and efficiency of the
proposed measures using experiments on synthetic and real examples.
5.1 Probabilistic Distances in Rd
Consider a two-class problem and suppose that class 1 has prior probability pi1
and class-dependent density p1(x) and class 2 has prior probability pi2 and class-
dependent density p2(x), both defined on Rd. The following defines a list of prob-
abilistic distance measures often found in the literature [5]:
120
? Chernoff distance [151]
JC(p1,p2) = ?log{
integraldisplay
x p
?
1(x)p
1??
2 (x)dx}; (5.1)
? Bhattacharyya distance [150]
JB(p1,p2) = ?log{
integraldisplay
x[p1(x)p2(x)]
1/2dx}; (5.2)
? Hellinger or Matusita distance [161]
JT(p1,p2) = {
integraldisplay
x[
radicalBig
p1(x)?
radicalBig
p2(x)]2dx}1/2; (5.3)
? The symmetric divergence [13]
JD(p1,p2) =
integraldisplay
x[p1(x)?p2(x)]log
p1(x)
p2(x)dx; (5.4)
? Patrick-Fisher distance [163]
JP(p1,p2) = {
integraldisplay
x[p1(x)pi1 ?p2(x)pi2]
2dx}1/2; (5.5)
? Lissack-Fu distance [158]
JL(p1,p2) =
integraldisplay
x|p1(x)pi1 ?p2(x)pi2|
?p1??(x)dx; (5.6)
? Kolmogorov distance [147]
JK(p1,p2) =
integraldisplay
x|p1(x)pi1 ?p2(x)pi2|dx; (5.7)
where 0 < ? < 1 and p(x) = p1(x)pi1 + p2(x)pi2.
It is obvious that (i) the Bhattacharyya distance is a special case of the Chernoff
distance with ? = 1/2; (ii) the Hellinger distance is related to the Bhattacharyya
distance as follows:
JT = {2[1?exp(?JB)]}1/2; (5.8)
121
and (iii) the Kolmogorov distance is a special case of the Lissack-Fu distance with
? = 1. Some interesting properties of these distances can be found in [5, 156]
In particular, the symmetric divergence is of great interest in the information
theory literature [4] and has a close connection with the famous Kullback-Leibler
(KL) divergence [13]. The KL divergence or relative entropy between two densities
p1(x) and p2(x) is given by
JR(p1||p2) =
integraldisplay
x p1(x)log{
p1(x)
p2(x)}dx. (5.9)
However, the KL divergence is not a true metric because neither the symmetry
constraint nor the triangle inequality is satisfied. The symmetric divergence, which
is symmetric, is equal to
JD(p1,p2) = JR(p1||p2) + JR(p2||p1). (5.10)
As mentioned earlier, computing the above probabilistic distance measures is
nontrivial. Only within certain parametric families, say the Gaussian density, we
know how to analytically compute some of the above defined distance measures.
Suppose that N(x;?,?) is a multivariate Gaussian density defined as
N(x;?,?) = 1radicalBig
(2pi)d|?|
exp{?12(x??)T??1(x??)}, (5.11)
where x ? Rd and |.| is the matrix determinant. With p1(x) = N(x;?1,?1) and
p2(x) = N(x;?2,?2), we evaluate some of the above probabilistic distance measures
as follows:
? Chernoff distance
JC(p1,p2) = 12?(1??)(?1??2)T[(1??)?1+??2]?1(?1??2)+12 log |(1??)?1 + ??2||?
1|1??|?2|?
;
(5.12)
122
? Bhattacharyya distance
JB(p1,p2) = 18(?1 ??2)T[?1 + ?22 ]?1(?1 ??2)+ 12 log |
1
2(?1 + ?2)|
|?1|1/2|?2|1/2; (5.13)
? Kullback-Leibler divergence or relative entropy
JR(p1||p2) = 12(?1??2)T??12 (?1??2)+12 log |?2||?
1|
+12tr[?1??12 ?Id]; (5.14)
? The symmetric divergence
JD(p1,p2) = 12(?1??2)T(?1?1+?2?1)(?1??2)+12tr[?1?1?2+?2?1?1?2Id];
(5.15)
? Patrick-Fisher distance
JP(p1,p2) = [(2pi)d|2?1|]?1/2 + [(2pi)d|2?2|]?1/2 (5.16)
? 2[(2pi)d|?1 + ?2|]?1/2 exp{?12(?1 ??2)T(?1 + ?2)?1(?1 ??2)};
where d is the dimensionality of the random vector x and tr[.] is the matrix trace.
In particular, when the covariance matrices for the two densities are same, i.e.,
?1 = ?2 = ?, the Bhattacharyya distance and the symmetric divergence reduce
to the Mahalanobis distance [160]:
JM = JD = 8JB = (?1 ??2)T??1(?1 ??2). (5.17)
In this chapter, we only focus on the distances defined in (5.12)-(5.15).
5.2 Mean and Covariance Marix in RKHS
5.2.1 First- and second-order statistics
Computing the probabilistic distance measures requires first- and second-order
statistics in the RKHS, as shown in Section 5.1. In practice, we have to estimate
123
these statistics from a set of training samples. Chapter 4 presented a detailed
treatment of this topic and here we recapitulate some important points.
Suppose that {x1,x2,...,xN} are given observations in the original data space
Rd. We operate in the RKHS Rf induced by a nonlinear mapping function ? :
Rd ? Rf, where f > d and f could even be infinite. The training samples in Rf
are denoted by ?f?N = [?1,?2,...,?N], where ?n ? ?(xn) ?Rf.
Using the maximum likelihood estimate (MLE) principle, the mean ? and the
covariance matrix ? are estimated as
? = 1N
Nsummationdisplay
n=1
?(xn) = ?e; ? = 1N
Nsummationdisplay
n=1
(?n ??)(?n ??)T = ?JJT?T = ??T, (5.18)
where the weight vector eN?1 ? N?11 with 1 being a vector of 1?s, ? ? ?J, and J
is an N ?N centering matrix given as
J ? N?1/2(IN ?e1T). (5.19)
5.2.2 Covariance matrix approximation
The covariance matrix ? in (5.18) is rank-deficient since f > N. Thus, inverting
such a matrix is impossible and an approximation to the covariance matrix is
necessary. Later in Section 5.3 we show that this approximation can be exact by
studying the limiting behavior.
Such an approximation S should possess the following features:
? It keeps the principal structure of the covariance matrix ?. In other words,
the dominant eigenvalues and eigenvectors of ? and S should be the same.
? It is compact and regularized. The compactness is inspired by the fact that
the smallest eigenvalues of the covariance matrix are very close to zero. The
regularity is always desirable in the approximation theory.
124
? It is easy to invert.
As shown in Chapter 4, we suggested the following approximation form:
S = ?If + ?JQQTJT?T = ?If + ?A?T, (5.20)
where Q is an N ?r matrix, A ? JQQTJT, and ? > 0 is a pre-specified constant.
Typically, q << N << f. Firstly, when
Q = Vq(Iq ????1q )1/2R,
where Vq and ?q encode the top q eigenvectors and eigenvalues of the ?K matrix,
the top q eigenpairs of ? are maintained. Hence, if ? = 0, we exactly maintain
the subspace containing the top q eigenpairs. Secondly, S is regularized and its
compactness is achieved through the Q matrix. Finally, inverting S is also easy by
using the Woodbury formula [8],
S?1 = (?If +WWT)?1 = ??1(If ?WM?1WT) = ??1(If ??B?T), (5.21)
where B ? JQM?1QTJT and the matrix Mq?q is
M ? ?Iq +WTW = ?Iq +QT?KQ. (5.22)
After obtaining Q, it is easy to check that the following equations hold:
M = ?q, |M| = |?q| =
qproductdisplay
i=1
?i, M?1 = ??1q , |S| = ?f?q|?q|. (5.23)
A = JVq(Iq ????1q )VTq JT, B = JVq(??1q ????2q )VTq JT. (5.24)
tr[AK] = tr[?q]??q, tr[BK] = q ??tr[??1q ]. (5.25)
125
5.3 The Probabilistic Distances in RKHS
Since the probabilistic distances involve two densities p1 and p2, we need two sets
of training samples: ?1 for p1 and ?2 for p2. For each density pi, we can find its
corresponding ei, Ji, ?i, ?i, Ki, Si, Vqi,i, ?qi,i = D[?1,i,?2,i,...,?qi,i], Ai, Bi, etc.,
by keeping the top qi principal components. In general, we can have q1 negationslash= q2 and
N1 negationslash= N2 with Ni being the number of samples for the ith density. In addition, we
define the following dot product matrix:
?
??
?
?T1
?T2
?
??
?[?1 ?2] =
?
??
?
?T1 ?1 ?T1 ?2
?T2 ?1 ?T2 ?2
?
??
??
?
??
?
K11 K12
K21 K22
?
??
?, (5.26)
where Kij ? ?Ti ?j and K21 = KT12.
5.3.1 The Chernoff distance and the Bhattarchayya dis-
tance
As mentioned before, the Bhattarchayya distance is a special case of Chernoff
distance with ? = 1/2. Hence, we focus only on the Chernoff distance.
The key quantity in computing the Chernoff distance is ?1S1 +?2S2 with ?1 +
?2 = 1. We now analyze this quantity in detail.
?1S1 + ?2S2 = ?1{?If + ?1A1?T1}+ ?2{?If + ?2A2?T2}
= ?If + ?1?1A1?T1 + ?2?2A2?T2
= ?If + [?1 ?2]
?
??
?
?1A1 0
0 ?2A2
?
??
?
?
??
?
?T1
?T2
?
??
?
= ?If + [?1 ?2]
?
??
?
?1J1Q1QT1 JT1 0
0 ?2J2Q2QT2 JT2
?
??
?
?
??
?
?T1
?T2
?
??
?
126
= ?If + [?1 ?2]Ach
?
??
?
?T1
?T2
?
??
?, (5.27)
where the matrix Ach is rank-deficient since Ach = PPT with
P(N1+N2)?(q1+q2) ?
?
??
?
??
1J1Q1 0
0 ??2J2Q2
?
??
?. (5.28)
Therefore, the matrix ?1S1 + ?2S2 is of such a form that we can easily find its
determinant and inverse.
The determinant |?1S1 + ?2S2| is given by
|?1S1 + ?2S2| = ?f?(q1+q2)|?Iq1+q2 +L| = ?f?(q1+q2)
q1+q2productdisplay
i=1
(?i + ?), (5.29)
where {?i; i = 1,...,q1 + q2} are eigenvalues of the L matrix. The L matrix is
given by
L(q1+q2)?(q1+q2) = PT
?
??
?
?T1
?T2
?
??
?[?1 ?2]P = PT
?
??
?
K11 K12
K21 K22
?
??
?P
=
?
??
?
?1QT1 JT1 K11J1Q1 ??1?2QT1 JT1 K12J2Q2
??
1?2QT2 JT2 K21J1Q1 ?2QT2 JT2 K22J2Q2
?
??
?
=
?
??
?
?1{?q1,1 ??Iq1} ??1?2L12
??
1?2LT12 ?2{?q2,2 ??Iq2}
?
??
?, (5.30)
with L12 ? QT1 JT1 K12J2Q2.
The inverse {?1S1 + ?2S2}?1 is given by
{?1S1 + ?2S2}?1 = ??1{If ?[?1 ?2]Bch
?
??
?
?T1
?T2
?
??
?}, Bch = P(?Iq1+q2 +L)?1PT.
(5.31)
127
We now show how to compute the following two quantities in (5.12):
?Ti {?1S1 + ?2S2}?1?j = eTi ?Ti ??1{If ?[?1 ?2]Bch
?
??
?
?T1
?T2
?
??
?}?jej (5.32)
= ??1{eTi Kijej ?eTi [Ki1 Ki2]Bch
?
??
?
K1j
K2j
?
??
?ej}? ??1?ij,
log |?1S1 + ?2S2||S
1|?1|S2|?2
=
q1+q2summationdisplay
i=1
log(?+ ?i)+ (f ?q1 ?q2)log(?)
? ?1{
q1summationdisplay
i=1
log(?i,1)+ (f ?q1)log(?)}
? ?2{
q2summationdisplay
i=1
log(?i,2)+ (f ?q2)log(?)}
= ?1
q1+q2summationdisplay
i=1
log ? + ?i?
i,1
+ ?2
q1+q2summationdisplay
i=1
log ?+ ?i?
i,2
, (5.33)
where {?i,1;i = 1,2,...,q1} and {?i,2;i = 1,2,...,q2} are eigenvalues for S1 and
S2, respectively. Notice that (i) {?i,1;i = q1 + 1,...,q1 + q2} and {?i,2;i = q2 +
1,...,q1 + q2}, all equal to ??s, are introduced only for notational convenience; (ii)
the infinite dimensionality f in (5.32) and (5.33) disappeared as needed; and (iii)
all calculations are based on the Gram matrix defined in (5.26).
Finally, we compute the Chernoff distance as follows (with ?1 = 1 ? ? and
?2 = ?):
2JC(p1,p2) = ??1?1?2{?11 + ?22 ?2?12}+ ?1
q1+q2summationdisplay
i=1
log ? + ?i?
i,1
+ ?2
q1+q2summationdisplay
i=1
log ?+ ?i?
i,2
.
(5.34)
128
5.3.2 The KL divergence and the symmetric divergence
Computing the KL divergence in the RKHS is just done by collecting terms like
?Ti S?1j ?k and tr{SiS?1j }.
?Ti S?1j ?k = eTi ?Ti ??1(If ??jBj?Tj )?kek
= ??1(eTi Kikek ?eTi KijBjKjkek) ? ??1?ijk. (5.35)
tr[SiS?1j ] = tr[(?iAi?Ti + ?If)??1(If ??jBj?Tj )] (5.36)
= ??1tr[?iAi?Ti ]???1tr[?iAi?Ti ?jBj?Tj ] + f ?tr[?jBj?Tj ]
= ??1tr[AiKii]???1tr[AiKijBjKji] + f ?tr[BjKjj]
= ??1tr[?qi,i]?qi ???1tr[AiKijBjKji]+ f + ?tr[??1qj,j]?qj
= ??1{tr[?qi,i]??ij}+ ?tr[??1qj,j] + f ?(qi + qj),
where
?ij ? tr[AiKijBjKji].
Finally, we obtain the KL divergence and the symmetric divergence in the
RKHS by substituting (5.35) and (5.36) into (5.14) and (5.15) with d replaced by
f,
2JR(p1||p2) = ??1{?121 + ?222 ??122 ??221}+{log|?q2,2|?log|?q1,1|} (5.37)
+ (q1 ?q2)log? + ??1{tr[?q1,1]??12}+ ?{tr[??1q2,2]}?(q1 + q2).
2JD(p1,p2) = ??1{?111 + ?121 + ?212 + ?222 ??112 ??122 ??211 ??221}
+ ??1{tr[?q1,1]+ tr[?q2,2]??12 ??21}
+ ?{tr[??1q1,1]+ tr[??1q2,2]}?2(q1 + q2). (5.38)
129
5.3.3 The Patrick-Fisher distance
Given the above derivations in Sections 5.3.1 and 5.3.2, computing the Patrick-
Fisher distance JP(p1,p2) can be easily done by putting together related terms.
JP(p1,p2) = [2(2pi)f?f?q1
q1productdisplay
i=1
?i,1]?1/2 + [2(2pi)f?f?q2
q2productdisplay
i=1
?i,2]?1/2
? 2[2(2pi)f?f?q1?q2
q1+q2productdisplay
i=1
(?+ ?i)]?1/2 exp{???1(?11 + ?22 ?2?12)}.
where {?i;i = 1,2,...,q1 + q2} are eigenvalues of the L matrix defined in (5.30)
with ? = 1/2.
5.3.4 Limiting behavior
It is interesting to study the behavior of the distances when ? approaches to zero.
First,
lim??0A = ?A ? JVqVTq JT, lim??0B = ?B ? JVq??1q VTq JT, (5.39)
Then,
lim??0?ijk = ??ijk ? eTi Kikek ?eTi Kij?BjKjkek, lim??0?ij = ??ij ? tr[?BiKij?AjKji]. (5.40)
Similarly,
lim??0?ij = ??ij ?= eTi Kijej ?eTi [Ki1 Ki2]?Bch
?
??
?
K1j
K2j
?
??
?ej, (5.41)
where ?Bch = lim??0 Bch.
Finally,
lim??0?JC(p1,p2) = ?JC(p1,p2), (5.42)
lim??0?JR(p1||p2) = ?JR(p1||p2), (5.43)
lim??0?JD(p1,p2) = ?JD(p1,p2), (5.44)
130
where
2 ?JC(p1,p2) = ?(1??){??11 + ??22 ?2??12}, (5.45)
2 ?JR(p1||p2) = ??121 + ??222 ? ??122 ? ??221 + tr[?q1,1]? ??12, (5.46)
2 ?JD(p1,p2) = ??111 + ??121 + ??212 + ??222 ? ??112 ? ??122 ? ??211 ? ??221
+tr[?q1,1]+ tr[?q2,1]? ??12 ? ??21. (5.47)
When ? = 1/2, we obtain the limiting distance for the Bhattacharyya distance
2 ?JB(p1,p2) = 14{??11 + ??22 ?2??12}. (5.48)
The limiting behavior of the Patrick-Fisher distance JP(p1,p2) is not interesting
since it involves f, thus we omit its discussion.
As mentioned earlier, when ? = 0 and q1 = q2 = q, we actually use the subspace
of the RKHS containing the top q eigenpairs. Therefore, the derived limiting
distances calibrate the pattern separability on this subspace of the RKHS and
carry many optimal features their original counterparts possess, yet additionally
equipped with a nonlinear embedding.
5.3.5 Kernel for set
A set here is a collection of observations. A kernel for set is a two-input kernel
function that takes the two sets as inputs and satisfies the requirement of positive
definiteness.
Several kernels for set have emerged in the literature. In [184], Wolf and
Shashua proposed the kernel principal angle. The principal angle is defined as
the angle between the principal subspaces of the two input sets and then ?kernel-
ized?. In [174], Jebara and Kondor showed that the Bhattacharyya coefficient [156]
131
that operates the probability distribution defined on the original data space is a
kernel. In [175], they extended the Bhattacharyya kernel to operate the probabil-
ity distribution defined on the RKHS. In [178], Moreno et. al. proposed a kernel
function based on the Kullback-Leibler divergence in the original data space.
It is obvious that our probabilistic distance measures can be adapted as kernel
functions for set. First, the Bhattacharyya kernel defined in [174] differs from
the Bhattacharyya distance by ?log(.). Secondly, the adaptation can be in the
sense of [178]. Other ways are possible by utilizing the construction rule of kernel
functions.
5.4 Experimental Results
In the following experiments with both synthetic examples and a real face recog-
nition application, we will use only the limiting distances, namely ?JC(p1,p2) (or
?JB(p1,p2)), ?JR(p1||p2), and ?JD(p1,p2), since they do not depend on the choice ?,
which frees us from the burden of choosing ?. Also, we set q1 = q2 = q.
5.4.1 Synthetic examples
To fail the KL distance between two Gaussian densities in the original space, we
designed four different 2-D densities sharing the same mean (zero mean) and co-
variance matrix (identity matrix). As shown in Figure 5.1, the four densities are
2-D Gaussian, and ?O?-, ?D?-, and ?X?-shaped uniform densities, where say the ?O?-
shaped uniform density means that it is uniform in the ?O?-shaped region and zero
outside the region. Figure 5.1 actually shows 300 i.i.d. realizations sampled from
these four densities. Due to the same first- and second-order statistics, the proba-
132
?4 ?3 ?2 ?1 0 1 2 3 4?4
?3
?2
?1
0
1
2
3
4
?4 ?3 ?2 ?1 0 1 2 3 4?4
?3
?2
?1
0
1
2
3
4
(a) (b)
?4 ?3 ?2 ?1 0 1 2 3 4?4
?3
?2
?1
0
1
2
3
4
?4 ?3 ?2 ?1 0 1 2 3 4?4
?3
?2
?1
0
1
2
3
4
(c) (d)
Figure 5.1: 300 i.i.d. realizations of four different densities with the same mean
(zero mean) and covariance matrix (identity matrix). (a) 2-D Gaussian. (b) ?O?-
shaped uniform.(c) ?D?-shaped uniform. (d) ?X?-shaped uniform.
bilistic distance between any of two densities in the original space is simply zero.
This highlights the virtue of a nonlinear mapping that provides us information
embedded in higher-order statistics.
Obviously, the probabilistic distances depend on q, the number of eigenpairs,
and ?, the RBF kernel width. Figure 5.2 displays ?JD and ?JB as a function of
q and ?. The effect of ? is biased: It always disfavors a large ? since a large ?
tends to pool the data together. For example, when ? is infinite, all data points
collapse to one single point in the RKHS and become inseparable. Generally, it
133
r
?
5 10 15 20 25 30 35 40
0.5 
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
r
?
2 4 6 8 10 12 14 16 18 20
0.5 
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
(a) (b)
Figure 5.2: (a) The symmetric divergence ?JD(?,q) and (b) the Bhatacharyya dis-
tance ?JB(?,q) between the 2-D Gaussian and the ?O?-shaped uniform as a function
of ? and q.
is not necessary that a large q (or equivalently using a nonlinear subspace with a
large dimension) yields a large distance. A typical subspace yielding the maximum
distances is of low-dimensional.
Table 5.1 lists some computed values of the probabilistic distances. It is inter-
esting to observe that when the shapes of two densities are close, their distance
is small. For example, ?O? is closest to ?D? among all possible pairs. The closest
density to the 2-D Gaussian is the ?O?-shaped uniform.
5.4.2 Face recognition from a group of images
The gallery set consists of 15 sets (one per person) while the probe set consists of
15 new sets of the same people (one per person). In these sets, the people can move
their heads freely so that pose and illumination variations abound. The existence
of these variations violates the Gaussianity assumption of the original data space
used in [91]. Figure 5.3 shows some example faces of the in the 4th gallery person,
the 9th gallery person, and the 4th probe person (whose identity is same as the 4th
134
?JR(p1||p2) Gau ?O? ?D? ?X?
Gau - .0740 .0782 .0808
?O? .0584 - .0281 .0523
?D? .0670 .0295 - .0436
?X? .0944 .0505 .0417 -
(a)
?JB(p1,p2) Gau ?O? ?D? ?X?
Gau - .0033 .0037 .0048
?O? .0033 - .0021 .0099
?D? .0037 .0021 - .0086
?X? .0048 .0099 .0086 -
(b)
Table 5.1: (a) The KL distances in the RKHS with ? = 1 and q = 3. (b) The
Bhatacharyya distances in the RKHS with ? = 0.5 and q = 1. p1 is listed in the
first column and p2 in the first row.
gallery person). The shown face images of size 32 by 32 are automatically cropped
from video sequences (courtesy of [84]) using a flow tracking algorithm.
Symmetric divergence Bhatacharyya distance
?J(p1,p2) in the RKHS 13/15 13/15
J(p1,p2) in the original space Rd 11/15 11/15
Table 5.2: The recognition score obtaining using the symmetric divergence and
Bhatacharyya distance.
A generic principal component analysis is performed to reduce the dimension-
ality to 300. Figure 5.3 also plots the first three PCA coefficient of the 4th gallery
person, the 9th gallery person, and the 4th probe person. Clearly, the manifolds
135
are highly nonlinear, which indicates a need for nonlinear modeling.
Table 5.2 reports the recognition rates. The top match with the smallest dis-
tance is claimed to be the winner. For comparison, we also implemented the
approaches using the symmetric divergence [91] and the Bhatacharyya distance
in the original space is used for face recognition. Clearly, using the distances in
RKHS yields better result. Out of 15 probe sets, we successfully classified 13 of
them. In fact, Figure 5.3 shows a misclassification example in [91], where the 4th
probe person is misclassified as the 9th gallery person, while our approach corrects
this error.
(a)
(b)
(c)
gly9
gly4
prb4
(d)
Figure 5.3: Examples of face images in the gallery and probe set. (a) The 4th
gallery person in 10 frames (every 8 frames) of a 80-frame sequence. (b) The 9th
gallery person in 10 frames (every 10 frames) of a 105-frame sequence.(a) The 4th
probe person in 10 frames (every 6 frames) of a 60-frame sequence. (d) The plot
of first three PCA coefficients of the above three sets.
136
Part III: Face Tracking and
Recognition from Videos
137
Chapter 6
Adaptive Visual Tracking
Particle filtering [114, 157, 159, 153, 6] is an inference technique [3, 18] for es-
timating the unknown motion state, ?t, from a noisy collection of observations,
y1:t = {y1,...,yt} arriving in a sequential fashion. A state space model is often
employed to accommodate such a time series. Two important components of this
approach are state transition and observation models whose most general forms
can be defined as follows:
State transition model: ?t = ft(?t?1,ut), (6.1)
Observation model: yt = gt(?t,vt), (6.2)
where ut is the system noise, ft(.,.) characterizes the kinematics, vt is the obser-
vation noise, and gt(.,.) models the observer. The particle filter approximates the
posterior distribution p(?t|y1:t) by a set of weighted particles {?(j)t ,w(j)t }Jj=1. Then,
the state estimate ??t can either be the minimum mean square error (MMSE) esti-
mate,
??t = ?mmset = E[?t|y1:t] ? J?1 Jsummationdisplay
j=1
w(j)t ?(j)t , (6.3)
138
where E the expectation operator, the maximum a posteriori (MAP) estimate,
??t = ?mapt = argmax
?t
p(?t|y1:t) ? argmax
?t
w(j)t , (6.4)
or other forms based on p(?t|y1:t).
The state transition model characterizes the motion change between frames.
In a visual tracking problem, it is ideal to have an exact motion model governing
the kinematics of the object. In practice, however, approximate models are used.
There are two types of approximations commonly found in the literature. (i) One
is to learn a motion model directly from a training video [118, 124]. However
such a model may overfit the training data and may not necessarily succeed when
presented with testing videos containing objects arbitrarily moving at different
times and places. Also one cannot always rely on the availability of training data.
(ii) Secondly, a fixed constant-velocity model with fixed noise variance is fitted as
in [109, 133, 135, 185].
?t = ?t?1 + ?t +ut, (6.5)
where ?t is a constant velocity, i.e. ?t = ?0, and ut has a fixed noise variance of
the form ut = r0 ? u0 with r0 a fixed constant measuring the extent of noise and
u0 a ?standardized? random variable/vector 1. Since a constant ?0 has difficulty in
handling arbitrary movement, ?0 is typically set to be ?0 = 0. If r0 is small, it is
very hard to model rapid movements; if r0 is large, it is computationally inefficient
since many more particles are needed to accommodate the large noise variance.
All these factors make such a model ineffective. In this chapter, we overcome this
by introducing an adaptive-velocity model.
1Consider the scalar case for example. If ut is distributed as N(0,?2), we can write ut = ?u0
where u0 is standard normal N(0,1). This also applies to multivariate cases.
139
While contour is the visual cue used in many tracking algorithms [118], another
class of tracking approaches [115, 127, 185] exploits an appearance model At. In
its simplest form, we have the following observation equation2,
zt = T{yt;?t} = At +vt, (6.6)
where zt is the image patch of interest in the video frame yt, parameterized by
?t. In [115], a fixed template, At = A0, is matched with observations to minimize
a cost function in the form of sum of squared distance (SSD). This is equivalent
to assuming that the noise vt is a normal random vector with zero mean and a
diagonal (isotropic) covariance matrix. At the other extreme, one could use a
rapidly changing model [127], say, At = ?zt?1, i.e., the ?best? patch of interest in
the previous frame. However, a fixed template cannot handle appearance changes
in the video, while a rapidly changing model is susceptible to drift. Thus, it is
necessary to have a model which is a compromise between these two cases. In
[120], Jepson et. al. proposed an online appearance model (OAM) for a robust
visual tracker, which is a mixture of three components. Two EM algorithms are
used, one forupdating the appearancemodeland theotherforderiving thetracking
parameters.
Ourapproach tovisual tracking is to make bothobservation andstate transition
models adaptive in the framework of a particle filter, with provisions for handling
occlusion. The main features of our tracking approach are as follows:
? Appearance-based. The only visual cue used in our tracker is the 2-D ap-
pearance; i.e., we employ only image intensities, though in general features
2For the sake of simplicity, we denote: zt ?T{yt;?t}, z(j)t ?T{yt;?(j)t }, ?zt ?T{yt; ??t}. Also,
we can always vectorize the 2-D image by a lexicographical scanning of all pixels and denote the
number of pixels by d.
140
derived from image intensities, such as the phase information of the filter
responses [120] or the Gabor feature graph presentation [85], are also appli-
cable. No prior object models are invoked. In addition, we only use gray
scale images.
? Adaptive observation model. We adopt an appearance-based approach. The
originalOAM is modified and then embedded in ourparticle filter. Therefore,
the observation model is adaptive as the appearance At involved in (6.6) is
adaptive.
? Adaptive state transition model. Instead of using a fixed model, we use an
adaptive-velocity model, where the adaptive motion velocity ?t is predicted
using a first-order linear approximation based on the appearance difference
between the incoming observation and the previous particle configuration.
We also use an adaptive noise component, i.e, ut = rt ?u0, whose magnitude
rt is a function of the prediction error. It is natural to vary the number of
particles based on the degree of uncertainty rt in the noise component.
? Handling occlusion. Occlusion is handled using robust statistics [11, 115,
108]. We robustify the likelihood measurement and the adaptive velocity
estimate by downweighting the ?outlier? pixels. If occlusion is declared, we
stop updating the appearance model and estimating the motion velocity.
Chapter organization
This chapter is organized as follows. We briefly review the related literature on
visual tracking and particle filters in Section 6.1. We examine the details of an
adaptive observation model in Section 6.2.1, with a special focus on the adaptive
141
appearance model, and of an adaptive state transition model in Section 6.2.2 with
a special focus on how to calculate the motion velocity. Handling occlusion is
discussed in Section 6.2.3, and experimental results on tracking vehicles and human
faces in Section 6.3.
6.1 Related Literature
6.1.1 Visual tracking
Roughly speaking, previous work on visual tracking can be divided into two groups:
deterministic tracking and stochastic tracking. Our approach combines the merits
of both stochastic and deterministic tracking approaches in a unified framework
using a particle filter. We give below a brief review of both approaches.
Deterministic approaches usually reduce to an optimization problem, e.g., min-
imizing an appropriate cost function. The definition of the cost function is a key
issue. A common choice in the literature is the SSD used in many optical flow
approaches [115].3 A gradient descent algorithm is most commonly used to find
the minimum. Very often, only a local minimum can be reached. In [115], the cost
function is defined as the SSD between the observation and a fixed template, and
the motion is parameterized as affine. Hence the task is to find the affine param-
eter minimizing the cost function. Using a Taylor series expansion and keeping
only the first-order terms, a linear prediction equation is obtained. It has been
shown that for the affine case, the system matrix can be computed efficiently since
a fixed template is used. Mean shift [113] is an alternative deterministic approach
3We note that using SSD is equivalent to using a model where the noise obeys an iid Gaussian
distribution; therefore this case can also be viewed as stochastic tracking.
142
to visual tracking, where the cost function is derived from the color histogram.
Stochastic tracking approaches often reduce to an estimation problem, e.g., es-
timating the state for a time series state space model. Early works [106, 112] used
the Kalman filter or its variants [1] to provide solutions. However, this restricts
the type of model that can be used. Recently sequential Monte Carlo (SMC) al-
gorithms [6, 114, 157, 159], which can model nonlinear/non-Gaussian cases, have
gained prevalence in the tracking literature due in part to the CONDENSATION
algorithm [118]. Stochastic tracking improves robustness over its deterministic
counterpart by its capability for escaping the local minimum since the searching
directions are for the most part random even though they are governed by a deter-
ministic state transition model. Toyama and Blake [130] proposed a probabilistic
paradigm for tracking with the following properties: Exemplars are learned from
the raw training data and embedded in a mixture density; The kinematics is also
learned; The likelihood measurement is constructed on a metric space. Other ap-
proaches are also discussed in Section 6.1.2. However, as far as computational
load is concerned, stochastic algorithms in general are more intense. Note that the
stochastic approaches can often be formulated as optimization problems.
6.1.2 Particle filter
General particle filter algorithm
Given the state transition model in (6.1) characterized by the state transition prob-
ability p(?t|?t?1) and the observation model in (6.2) characterized by the likelihood
function p(yt|?t), the problem is reduced to computing the posterior probability
p(?t|y1:t). The nonlinearity/nonnormality in (6.1) and (6.2) result in Kalman filter
[1] being ineffective. The particle filter is a means to approximate the poste-
143
rior distribution p(?t|y1:t) by a set of weighted particles St = {?(j)t ,w(j)t }Jj=1 with
summationtextJ
j=1 w
(j)
t = 1. It can be shown [159] that St is properly weighted with respect to
p(?t|y1:t) in the sense that, for every bounded function h(.),
limJ??
Jsummationdisplay
j=1
w(j)t h(?(j)t ) = Ep[h(?t)]. (6.7)
GivenSt?1 = {?(j)t?1,w(j)t?1}Jj=1 which isproperly weighted with respect top(?t?1|y1:t?1),
we first resample St?1 to reach a new set of samples with equal weights {?prime(j)t?1,1}Jj=1.
We then draw samples {u(j)t }Jj=1 for ut and propagate ?prime(j)t?1 to ?prime(j)t by (6.1). The
new weight is updated as
wt ? p(yt|?t) (6.8)
The complete algorithm is summarized in Figure 6.1.
Initialize a sample set S0 = {?(j)0 ,1)}Jj=1 according to prior distribution p(?0).
For t = 1,2,...
For j = 1,2,...,J
Resample St?1 = {?(j)t?1,w(j)t?1} to obtain a new sample (?prime(j)t?1,1).
Predict the sample by drawing u(j)t for ut and computing ?(j)t = ft(?prime(j)t?1,u(j)t ).
Compute the transformed image z(j)t = T{yt;?t}.
Update the weight using w(j)t = p(yt|?(j)t ) = p(z(j)t |?(j)t ).
End
Normalize the weight using w(j)t = w(j)t /summationtextJj=1 w(j)t .
End
Figure 6.1: The general particle filter algorithm.
Variations of Particle Filters
Sequential Importance Sampling (SIS) [153, 159] draws particles from a proposal
distribution q(?t|?t?1,y1:t) and then for each particle a proper weight is assigned as
144
follows:
wt ? p(yt|?t)p(?t|?t?1)/q(?t|?t?1,y1:t). (6.9)
Selection of the proposal distribution q(?t|?t?1,y1:t) is usually dependent on the
application. For example, in the ICONDENSATION algorithm [119] which fuses
low-level and high-level visual cues in the conventional CONDENSATION algorithm
[118], the proposal distribution, a fixed Gaussian distribution for low-level color
cue, is used to predict the particle configurations, then the posterior distribution
of the high-level shape cue is approximated using SIS. It is interesting to note that
two different cues can be even combined together into one state vector to yield a
robust tracker, using the co-inference algorithm [133] and the approach proposed
in [131]. We also use a prediction scheme but our prediction is based on the same
visual cue i.e. the appearance in the image, and it is directly used in the state
transition model rather than used as a proposal distribution. Additional visual
cues are not used.
6.2 Appearance-Adaptive Models
6.2.1 Adaptive observation model
The adaptive observation model arises from the adaptive appearance model At.
We use a modified version of OAM as developed in [120]. The differences between
our appearance model and the original OAM are highlighted below.
Mixture appearance model
The original OAM assumes that the observations are explained by different causes,
thereby indicating the use of a mixture density of components. In the originalOAM
145
presented in [120], three components are used, namely the W-component charac-
terizing the two-frame variations, the S-component depicting the stable structure
within all past observations (though it is slowly-varying), and the L-component
accounting for outliers such as occluded pixels.
We modify the OAM to accommodate our appearance analysis in the following
aspects. (i) We directly use the image intensities while they use phase information
derived from image intensities. Direct use of image intensities is computationally
more efficient than using the phase information that requires filtering and visually
more interpretable. (ii) As an option, in order to further stabilize the tracker
one could use an F-component which is a fixed template that one is expecting
to observe most often. For example, in face tracking this could be just the facial
image as seen from a frontal view. In the sequel, we derive the equations as if
there is an F-component. However, the effect of this component can be ignored by
setting its initial mixing probability to zero. (iii) We embed the appearance model
in a particle filter to perform tracking while they use the EM algorithm. (iv) In
our implementation, we do not incorporate the L-component because we model
the occlusion in a different manner (using robust statistics) as discussed in Section
6.2.3.
We now describe the mixture appearance model. The appearance model at
time t,
At = {Wt,St,Ft},
is a time-varying one that models the appearances present in all observations up
to time t?1. It obeys a mixture of Gaussians, with Wt,St,Ft as mixture centers
{?i,t; i = w,s,f} and their corresponding variances {?2i,t; i = w,s,f} and mixing
146
probabilities {mi,t; i = w,s,f}. Notice that
{mi,t,?i,t,?2i,t; i = w,s,f}
are ?images? consisting of d pixels that are assumed to be independent of each
other.
In summary, the observation likelihood is written as
p(yt|?t) = p(zt|?t) =
dproductdisplay
j=1
{ summationdisplay
i=w,s,f
mi,t(j)N(zt(j);?i,t(j),?2i,t(j))}, (6.10)
where N(x;?,?2) is a normal density
N(x;?,?2) = (2pi?2)?1/2 exp{??(x??? )}, ?(x) = 12x2. (6.11)
Model update
To keep the chapter self-contained, we show how to update the current appearance
model At to At+1 after ?zt becomes available, i.e., we want to compute the new
mixing probabilities, mixture centers, and variances for time t+ 1,
{mi,t+1,?i,t+1,?2i,t+1; i = w,s,f}.
It is assumed that the past observations are exponentially ?forgotten? with re-
spect to their contributions to the current appearance model. Denote the expo-
nential envelop by ?exp(???1(t ? k)) for k ? t, where ? = nh/log2, nh is the
half-life of the envelope in frames, and ? = 1?exp(???1) to guarantee that the
area under the envelope is 1. We just sketch the updating equations as follows and
refer the interested readers to [120] for technical details and justifications.
The EM algorithm [152] is invoked. Since we assume that the pixels are in-
dependent of each other, we can deal with each pixel separately. The following
147
computation is valid for j = 1,2,...,d where d is the number of pixels in the
appearance model.
First, the posterior responsibility probabilities are computed as
oi,t(j) ? mi,t(j)N(?zt(j);?i,t(j),?2i,t(j)); i = w,s,f, & summationdisplay
i=w,s,f
oi,t(j) = 1. (6.12)
Then, the mixing probabilities are updated as
mi,t+1(j) = ? oi,t(j)+ (1??) mi,t(j); i = w,s,f, (6.13)
and the first- and second-moment images {Mp,t+1; p = 1,2} are evaluated as
Mp,t+1(j) = ? ?zpt(j)os,t(j)+ (1??) Mp,t(j); p = 1,2. (6.14)
Finally, the mixture centers and the variances are updated as:
St+1(j) = ?s,t+1(j) = M1,t+1(j)m
s,t+1(j)
, ?2s,t+1(j) = M2,t+1(j)m
s,t+1(j)
??2s,t+1(j). (6.15)
Wt+1(j) = ?w,t+1(j) = ?zt(j), ?2w,t+1(j) = ?2w,1(j), (6.16)
Ft+1(j) = ?f,t+1(j) = F1(j), ?2f,t+1(j) = ?2f,1(j). (6.17)
Model initialization
To initialize A1, we set W1 = S1 = F1 = T0 (with T0 supplied by a detection
algorithm or manually), {mi,1,?2i,1; i = w,s,f}, and M1,1 = ms,1z0 and M2,1 =
ms,1?2s,1 +T20.
6.2.2 Adaptive state transition model
The state transition model we use incorporates a term for modeling adaptive veloc-
ity. The adaptive velocity is calculated using a first-order linear prediction method
148
based on the appearance differences between two successive frames. The previous
particle configuration is incorporated in the prediction scheme.
Construction of the particle configuration involves the costly computation of
image warping (in the experiments reported here, it usually accounts for about half
of the computations). In a conventional particle filtering algorithm, the particle
configuration is used only to update the weight, i.e., computing weight for each
particle by comparing the warped image with the online appearance model using
the observation equation. But, our approach in addition uses the particle config-
uration in the state transition equation. In some sense, we ?maximally? utilize the
information contained in the particles (without wasting the costly computation of
image warping) since we use it in both state and observation models.
In [128], random samples are guided by deterministic search. Momentum for
each particle is computed as the sum of absolute difference between two frames. If
the momentum is below a threshold, a deterministic search is first performed using
a gradient descent method and a small number of offsprings is then generated using
stochastic diffusion; otherwise, stochastic diffusion is performed to generate a large
number of offsprings. The stochastic diffusion is based on a second-order autore-
gressive process. But, the gradient descent method does not utilize the previous
particle configuration in its entirety. Also, the generated particle configuration
could severely deviate from the second-order autoregressive model, which clearly
implies the need for an adaptive model.
Adaptive velocity
With the availability of the sample set ?t?1 = {?(j)t?1}Jj=1 and the image patches
of interest Zt?1 = {z(j)t?1}Jj=1, for a new observation yt, we can predict the shift in
149
the motion vector (or adaptive velocity) ?t = ?t ? ??t?1 using a first-order linear
approximation [107, 115, 121, 123], which essentially comes from the constant
brightness constraint, i.e., there exists a ?t such that
T{yt;?t} similarequal?zt?1. (6.18)
Approximating T{yt;?t} using a first-order Taylor series expansion around ??t
(we set ??t = ??t?1) yields
T{yt;?t}similarequalT{yt; ??t}+Ct(?t ? ??t) = T{yt; ??t}+Ct?t, (6.19)
where Ct is the Jacobian matrix.
Combining (6.18) and (6.19) gives
?zt?1 similarequalT{yt; ??t}+Ct?t, (6.20)
i.e.,
?t = ?t ? ??t similarequal?Bt(T{yt; ??t}??zt?1), (6.21)
where Bt is the pseudo-inverse of the Ct matrix, which can be efficiently estimated
from the available data ?t?1 and Zt?1.
Specifically, to estimate Bt we stack into matrices the differences in motion
vectors and image patches, using ??t?1 and ?zt?1 as pivotal points:
??t?1 = [?(1)t?1 ? ??t?1, ..., ?(J)t?1 ? ??t?1], (6.22)
?Zt?1 = [z(1)t?1 ??zt?1, ..., z(J)t?1 ??zt?1]. (6.23)
The least square (LS) solution for Bt is
Bt = (??t?1?ZTt?1)(?Zt?1?ZTt?1)?1. (6.24)
150
However, it turns out that the matrix ?Zt?1?ZTt?1 is very often rank-deficient due
to the high dimensionality of the data (unless the number of the particles at least
exceeds the data dimension). To overcome this, we use the SVD as
?Zt?1 = USVT (6.25)
It can be easily shown that
Bt = ??t?1VS?1UT. (6.26)
To gain some computational efficiency, we can further approximate
Bt = ??t?1VqS?1q UTq , (6.27)
by retaining the top q components. Notice that if only a fixed template is used
[121], the B matrix is fixed and pre-computable. But, in our case, the appearance
is changing so that we have to compute the Bt matrix in each time step.
In practice, one may run several iterations till ?zt = T{yt; ??t + ?t} stabilizes, i.e.,
the error epsilon1t defined below is small enough.
epsilon1t = ?(?zt,At) = 2d
dsummationdisplay
j=1
{ summationdisplay
i=w,s,f
mi,t(j)?(?zt(j)??i,t(j)?
i,t(j)
)}. (6.28)
In (6.28), epsilon1t measures the distance between T{yt; ??t + ?t} and the updated ap-
pearance model At. The iterations proceed as follows: We initially set ??1t = ??t?1.
For the first iteration, we compute ?1t as usual. For the kth iteration, we use the
predicted ??kt = ??k?1t + ?k?1t as a pivotal point for the Taylor expansion in (6.19)
and the rest of the calculation then follows. It is rather beneficial to run sev-
eral iterations especially when the object moves very fast in two successive frames
since ??t?1 might cover the target in yt in a small portion. After one iteration, the
computed ?t might be not accurate, but indicates a good minimization direction.
Using several iterations helps to find ?t (compared to ??t?1) more accurately.
151
We use the following adaptive state transition model
?t = ??t?1 + ?t +ut, (6.29)
where ?t is the predicted shift in the motion vector. The choice of ut is discussed
below. One should note that we are not using (6.29) as a proposal function to
draw particles, which requires using (6.9) to compute the particle weight. Instead
we directly use it as the state transition model and hence use (6.8) to compute
the particle weight. Our model can be easily interpreted as a time-varying state
model.
It is interesting to note that the approach proposed in [131] also uses motion
cues as well as color parameter adaptation. Our approach is different from [131]
in that: (i) We use the motion cue in the state transition model while they use it
as part of observations; (ii) We only use the gray images without using the color
cue which is used in [131]; and (iii) We use an adaptive appearance model which
is updated by the EM algorithm while they use an adaptive color model which is
updated by a stochastic version of the EM algorithm.
Adaptive noise
The value of epsilon1t determines the quality of prediction. Therefore, if epsilon1t is small, which
implies a good prediction, we only need noise with small variance to absorb the
residual motion; if epsilon1t is large, which implies a poor prediction, we then need noise
with large variance to model the potentially large jumps in the motion state.
To this end, we use ut of the form ut = rt?u0, where rt is a function of epsilon1t. Since
epsilon1t defined in (6.28) is a ?variance?-type measure, we use
rt = max(min(r0?epsilon1t,rmax),rmin), (6.30)
152
where rmin is the lower bound to maintain a reasonable sample coverage and rmax
is the upper bound to constrain the computational load.
Adaptive number of particles
If the noise variance rt is large, we need more particles, while conversely, fewer
particles are needed for noise with small variance rt. Based on the principle of
asymptotic relative efficiency (ARE) [3], we should adjust the particle number Jt
in a similar fashion, i.e.,
Jt = J0rt/r0. (6.31)
Fox [154] also presents an approach to improve the efficiency of particle filters by
adapting the particle numbers on-the-fly. His approach is to divide the state space
into bins and approximate the posterior distribution by a multinomial distribution.
A small number of particles is used if the density is focused on a small part of
the state space and a large number of particles if the uncertainty in the state
space is high. In this way, the error between the empirical distribution and the
true distribution (approximated as a multinomial in his analysis) measured by
Kullback-Leilber distance is bounded. However, in his approach, since the state
space (only 2D) is exhaustively divided, the number of particles is at least several
thousand, while our approach uses at most a few hundred. Our attempt is not to
explore the state space (6-D affine space) exhaustively, but only regions that have
high potential for the object to be present.
153
Comparison between the adaptive velocity model and the zero velocity
model
We demonstrate the necessity of the adaptive velocity model by comparing it
with the zero velocity model. Figure 6.2 shows the particle configurations created
from the adaptive velocity model (with Jt < J0 and rt < r0 computed as above)
and the zero velocity model (with Jt = J0 and rt = r0). Clearly, the adaptive-
velocity model generates particles very efficiently, i.e, they are tightly centered
around the object of interest so that we can easily track the object at time t; while
the zero-velocity model generates more particles widely spread to explore larger
regions, leading to unsuccessful tracking as widespread particles often lead to a
local minimum.
Tracking result at t?1 Particle configuration at t Tracking result at t
Figure 6.2: Particle configurations from (top row) the adaptive velocity model and
(bottom row) the zero-velocity model.
6.2.3 Handling occlusion
Occlusion is usually handled in two ways. One way is to use joint probabilistic data
associative filter (JPDAF) [2, 126]; and the other one is to use robust statistics
154
[11]. We use robust statistics here.
Robust statistics
We assume that occlusion produces large image differences which can be treated as
?outliers?. Outlier pixels cannot be explained by the underlying process and their
influences on the estimation process should be reduced. Robust statistics provide
such mechanisms.
We use the ?? function defined as follows:
??(x) =
??
??
???
1
2x
2 if |x| ? c
cx? 12c2 if |x| > c
, (6.32)
where x is normalized to have unit variance and the constant c controls the outlier
rate. In our experiment, we take c = 1.435 based on experimental experience. If
|x| > c is satisfied, we declare the corresponding pixel as an outlier.
Robust likelihood measure and adaptive velocity estimate
The likelihood measure defined in Eq. (6.10) involves a multi-dimensional normal
density. Since we assume that each pixel is independent, we consider the one-
dimensional normal density. To make the likelihood measure robust, we replace
the one-dimensional normal density N(x;?,?2) by
?N(x;?,?2) = (2pi?2)?1/2 exp(???(x??? )). (6.33)
Note that this is not a density function any more, but since we are dealing with
discrete approximation in the particle filter, normalization makes it a probability
mass function.
Existence of outlier pixels severely violates the constant brightness constraint
and hence affects our estimate of the adaptive velocity. To downweight the influ-
155
ence of the outlier pixels in estimating the adaptive velocity, we introduce a d?d
diagonal matrix Lt with its ith diagonal element being Lt(i) = ?(xi) where xi is the
pixel intensity of the difference image (T{yt; ??t}??zt?1) normalized by the variance
of the OAM stable component and
?(x) = 1xd??(x)dx =
?
???
???
1 if |x| ? c
c/|x| if |x| > c
, (6.34)
Eq. (6.21) becomes
?t similarequal ?BtLt(T{yt; ??t?1}??zt?1). (6.35)
This is similar in principle to the weighted least square algorithm.
Occlusion declaration
If the number of the outlier pixels in?zt (compared with the OAM), say dout, exceeds
a certain threshold, i.e., dout > ?d where 0 < ? < 1 (we take ? = 0.15), we declare
occlusion. Since the OAM has more than one component, we count the number of
outlier pixels with respect to every component and take the maximum.
If occlusion is declared, we stop updating the appearance model and estimating
the motion velocity. Instead, we (i) keep the current appearance model, i.e., At+1 =
At and (ii) set the motion velocity to zero, i.e., ?t = 0 and use the maximum number
of particles sampled from the diffusion process with largest variance, i.e., rt = rmax,
and Jt = Jmax.
The adaptive particle filtering algorithm with occlusion analysis is summarized
in Figure 6.3.
156
Initialize a sample set S0 = {?(j)0 ,1/J0)}J0j=1 according to prior distribution p(?0).
Initialize the appearance model A1.
Set OCCFLAG = 0 to indicate no occlusion.
For t = 1,2,...
If (OCCFLAG == 0)
Calculate the state estimate ??t?1 by Eq. (6.3) or (6.4), the adaptive velocity ?t
by Eq. (6.21), the noise variance rt by Eq. (6.30), and the particle number Jt by Eq.
(6.31).
Else
rt = rmax, Jt = Jmax, ?t = 0.
End
For j = 1,2,...,Jt
Draw the sample u(j)t for ut with variance rt.
Construct the sample ?(j)t = ??t?1 + ?t + u(j)t by Eq. (6.29).
Compute the transformed image z(j)t .
Update the weight using w(j)t = p(yt|?(j)t ) = p(z(j)t |?(j)t ).
End
Normalize the weight using w(j)t = w(j)t /summationtextJj=1 w(j)t .
Set OCCFLAG according to the number of the outlier pixels in ?zt.
If (OCCFLAG == 0)
Update the appearance model At+1 using ?zt.
End
End
Figure 6.3: The proposed visual tracking algorithm with occlusion handling.
6.3 Experimental results on visual tracking
In our implementation, we used the following choices. We consider affine transfor-
mation only. Specifically, the motion is characterized by ? = (a1,a2,a3,a4,tx,ty)
where {a1,a2,a3,a4}are deformation parameters and{tx,ty}denote the 2-D trans-
157
lation parameters. Even though significant pose/illumincation changes are present
in the video, we believe that our adaptive appearance model can easily absorb them
and therefore for our purposes the affine transformation is a reasonable approxi-
mation. Regarding photometric transformations, only a zero-mean-unit-variance
normalization is used to partially compensate for contrast variations. The com-
plete image transformation T{y;?} is implemented as follows: affine transform y
using {a1,a2,a3,a4}, crop out the region of interest at position {tx,ty} with the
same size as the still template in the appearance model, and perform zero-mean-
unit-variance normalization.
We demonstrate our algorithm by tracking a disappearing car, a moving tank
acquired by a camera mounted on a micro air vehicle, and a moving face under
occlusion. Table 6.1 summarizes some statistics about the video sequences and the
appearance model size used.
We initialize the particle filter and the appearance model with a detector algo-
rithm (we actually used the face detector described in [132] for the face sequence)
or a manually specified image patch in the first frame. r0 and J0 are also manually
set, depending on the sequence.
6.3.1 Car tracking
We first test our algorithm to track a vehicle with the F-component but without
occlusion analysis. The result of tracking a fast moving car is shown in Figure
6.4 (column 1)4. The tracking result is shown with a bounding box. We also
show the stable and wandering components separately (in a double-zoomed size)
at the corner of each frame. The video is captured by a camera mounted on the
4Accompanying videos are available at http://www.cfar.umd.edu/?shaohua/research/.
158
Video Car Tank Face
# of frames 500 300 800
Frame size 576x768 240x360 240x360
At size 24x30 24x30 30x26
Occlusion No No Yes (twice)
?adp? o o x
?fa? o o x
?fm? x x x
?fb? x x x
?adp & occ? o o o
Table 6.1: Comparison of tracking results obtained by particle filters with different
configurations. ?At size? means pixel size in the component(s) of the appearance
model. ?o? means success in tracking. ?x? means failure in tracking.
car. In this footage the relative velocity of the car with respect to the camera
platform is very large, and the target rapidly decreases in size. Our algorithm?s
adaptive particle filter successfully tracks this rapid change in scale. Figure 6.5(a)
plots the scale estimate (calculated as
radicalBig
(a21 + a22 + a23 + a24)/2 ) recovered by our
algorithm. It is clear that the scale follows a decreasing trend as time proceeds.
The pixels located on the car in the final frame are about 12 by 15 in size, which
makes the vehicle almost invisible. In this sequence we set J0 = 50 and r0 = 0.25.
The algorithm implemented in a standard Matlab environment processes about
1.2 frames per second (with J0 = 50) running on a PC with a PIII 650 CPU and
512M memory.
159
Frame 1
Frame 100
Frame 300
Frame 500
Figure 6.4: The car sequence. Notice the fast scale change present in the video.
Column 1: the tracking results obtained with an adaptive motion model and an
adaptive appearance model (?adp?). Column 2: the tracking results obtained with
an adaptive motion model but a fixed appearance model (?fa?). In this case, the
corner shows the tracked region. Column 3: the tracking results obtained with an
adaptive appearance model but a fixed motion model (?fm?).
6.3.2 Tank tracking in an aerial video
Figure 6.6 shows our results on tracking a tank in an aerial video with degraded
image quality due to motion blur. Also, the movement of the tank is very jerky
and arbitrary because of platform motion, as seen in Figure 6.5(b) which plots the
160
0 100 200 300 400 5000
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
time
scale estimate
0 50 100 150 200 250 300 350
0
50
100
150
200
column index
row index
0 50 100 150 200 250 30060
70
80
90
100
110
120
130
140
time
particle number
(a) (b) (c)
0 50 100 150 200 250 3000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
time
mean square error
adp
fa
0 100 200 300 400 500 600 700 8001
1.5
2
2.5
3
3.5
time
scale estimate
(d) (e)
Figure 6.5: (a) The scale estimate for the car. (b) The 2-D trajectory of the
centroid of the tracked tank. ?*? means the starting and ending points and ?.?
points are marked along the trajectory every 10 frames. (c) The particle number
Jt vs. t obtained when tracking the tank. (d) The MSE invoked by the ?adp? and
?fa? algorithms. (e) The scale estimate for the face sequence.
2-D trajectory of the centroid of the tracked tank every 10 frames, covering from
the left to the right in 300 frames. Although the tank moved about 100 pixels in
column index in a certain period of 10 frames, the tracking is still successful.
Figure 6.5(c) displays the plot of actual number of particles Jt as a function
of time t. The average number of particle is about 83, where we set J0 to be 100,
which means that in this case we actually saved about 20% in computation by
using an adaptive Jt instead of a fixed number of particles.
To further illustrate the importance of the adaptive appearance model, we
161
Frame 1 Frame 31 Frame 49
Frame 116 Frame 228 Frame 300
Figure 6.6: Tracking a moving tank in a video acquired by an airborne camera.
computed the mean square error (MSE) invoked by two particle filter algorithms,
one (referred as?adp? inSection 6.3.4)using the adaptive appearance model and the
other (referred as ?fa? in Section 6.3.4) using a fixed appearance model. Computing
the MSE for the ?fa? algorithm is straightforward, with T0 denoting the fixed
template,
MSEfa(t) = d?1
dsummationdisplay
j=1
(?zt(j)?T0(j))2. (6.36)
Computing the MSE for the ?adp? algorithm is as follows:
MSEadp(t) = d?1
dsummationdisplay
j=1
{ summationdisplay
i=w,s,f
mi,t(?zt(j)??i,t(j))2}. (6.37)
Figure 6.5(d) plots the functions of MSEfa(t) and MSEadp(t). Clearly, using the
adaptive appearance model invokes smaller MSE for almost all 300 frames. The
average MSE for the ?adp? algorithm is 0.1394 5 while that for the ?fa? algorithm
is 0.3169!
5The range of MSE is very reasonable since we are using image patches after the zero-mean-
unit-variance normalization not the raw image intensities.
162
6.3.3 Face tracking
We present one example of successful tracking of a human face using a hand-held
video camera in an office environment, where both camera and object motion are
present.
Figure 6.7 presents the tracking results on the video sequence featuring the
following variations: moderate lighting variations, quick scale changes (back and
forth)in themiddle of thesequence, andocclusion (twice). The results areobtained
by incorporating the occlusion analysis in the particle filter, but we did not use the
F-component. Notice that the adaptive appearance model remains fixed during
occlusion.
Figure 6.8 presents the tracking results obtained using the particle filter without
occlusion analysis. We have found that the predicted velocity actually accounts
for the motion of the occluding hand since the outlier pixels (mainly on the hand)
dominate the image difference (T{yt; ??t}??zt?1). Updating the appearance model
deteriorates the situation.
Figure 6.5(e) plots the scale estimate against time t. We clearly observe a rapid
scale change (a sudden increase followed by a decrease within about 50 frames) in
the middle of the sequence (though hard to display the recovered scale estimates
are in perfect synchrony with the video data).
6.3.4 Comparison
We illustrate the effectiveness of our adaptive approach (?adp?) by comparing the
particle filter either with (a) an adaptive motion model but a fixed appearance
model (?fa?), or with (b) a fixed motion model but an adaptive appearance model
(?fm?); or with (c) a fixed motion model and a fixed appearance model (?fb?). Table
163
Frame 1 Frame 145 Frame 148
Frame 155 Frame 470 Frame 517
Frame 685 Frame 695 Frame 800
Figure 6.7: The face sequence. Frames 145, 148, and 155 show the first occlusion.
Frames 470 and 517 show the smallest and largest face observed. Frames 685, 690,
and 710 show the second occlusion.
6.1 lists the tracking results obtained using particle filters under the above situa-
tions, where ?adp & occ? refers to the adaptive approach with occlusion handling.
Figure 6.4 also shows the tracking results on the car sequence when the ?fa? and
?fm? options are used.
Table 6.1 seems to suggest that the adaptive motion model plays a more im-
portant role than the adaptive appearance model since ?fa? always yields successful
tracking while ?fm? fails, the reasons being that (i) the fixed motion model is unable
to adapt to quick motion present in the video sequences, and (ii) the appearance
164
Frame 1 Frame 145 Frame 148
Frame 155 Frame 170 Frame 200
Figure 6.8: Tracking results on the face sequence using the adaptive particle filter
without occlusion analysis.
changes in the video sequences, though significant in some cases, are still within
the range of the fixed appearance model. However, as seen in the videos, ?adp?
produces much smoother tracking results than ?fa?, demonstrating the power of
the adaptive appearance model.
165
Chapter 7
Simultaneous Tracking and
Recognition
Following [58], we define a still-to-video scenario: the gallery consists of still facial
templates and the probe set consists of video sequences containing the facial re-
gion. Denote the gallery as I = {I1,I2,...,IN}, indexed by the identity variable n,
which lies in a finite sample space N = {1,2,...,N}. Though significant research
has been conducted on the still-to-still face recognition problem, research efforts
on still-to-video recognition, are relatively fewer due to the following challenges
[27] in typical surveillance applications: poor video quality, significant illumina-
tion and pose variations, and low image resolution. Most existing video-based
recognition systems [79] attempt the following: the face is first detected and then
tracked over time. Only when a frame satisfying certain criteria (size, pose) is
acquired, recognition is performed using still-to-still recognition technique. For
this, the face part is cropped from the frame and transformed or registered using
appropriate transformations. This tracking-then-recognition approach attempts to
resolve uncertainties in tracking and recognition sequentially and separately.
166
There are several unresolved issues in the tracking-then-recognition approach:
criteria for selecting good frames and estimation of parameters for registration.
Also, still-to-still recognition does not effectively exploit temporal information. A
common strategy that selects several good frames, performs recognition on each
frame and then votes on these recognition results for a final solution is rather ad
hoc.
To overcome these difficulties, we propose a tracking-and-recognition approach,
which attempts to resolve uncertainties in tracking and recognition simultaneously
in a unified probabilistic framework. To fuse temporal information, the time series
state space model is adopted to characterize the evolving kinematics and identity
in the probe video. Three basic components of the model are:
? a motion equation governing the kinematic behavior of the tracking motion
vector,
? an identity equation governing the temporal evolution of the identity variable,
? an observation equation establishing a link between the motion vector and
the identity variable.
Using the SIS [114, 118, 153, 157, 159] technique, the joint posterior distribution
of the motion vector and the identity variable, i.e., p(nt,?t|y0:t) 1 is estimated at
each time instant and then propagated to the next time instant governed by mo-
tion and identity equations. The marginal distribution of the identity variable,
i.e., p(nt|y0:t), is estimated to provide a recognition result. An SIS algorithm is
developed to approximate the distribution p(nt|y0:t) in the still-to-video scenario.
1For notational convenience, e.g. in (7.5) and (7.6), we introduce in this chapter a dummy
variable y0.
167
It achieves computational efficiency over its CONDENSATION counterpart by con-
sidering the discrete nature of the identity variable.
It is worth emphasizing that (i) our model can take advantage of any still-to-
still recognition algorithm [41, 44, 48, 62] by embedding distance measures used
therein in our likelihood measurement; and (ii) it allows a variety of image repre-
sentations and transformations. Section 7.3.4 presents an enhancement technique
by incorporating the sophisticated appearance-based models in Chapter 6. The ap-
pearance models are used for tracking (modeling inter-frame appearance changes)
and recognition (modeling appearance changes between video frames and gallery
images), respectively. Table 7.1 summarizes the proposed approach and others, in
term of using temporal information.
Process Operation Temporal information
Visual tracking Modeling the inter-frame Used in tracking
differences
Visual recognition Modeling the difference between Not applicable
probe and gallery images
Tracking-then-recognition Combining tracking and Used only in tracking
recognition sequentially
Tracking-and-recognition Unifying tracking and Used in both tracking
recognition and recognition
Table 7.1: Use of temporal information in various tracking/recognition processes.
Chapter organization
The organization of the chapter is as follows: Section 7.1 reviews some related stud-
ies on (i) face modeling and recognition and (ii) video-based tracking and recog-
168
nition in the literature. Section 7.2 introduces the time series state space model
for recognition and establishes the time-evolving behavior of p(nt|y0:t). Section
7.2.3 briefly reviews the SIS principles from the viewpoint of a general state space
model and develops a SIS algorithm to solve the still-to-video recognition prob-
lem, with special emphasis on its computational efficiency. Section 7.3 describes
the experimental scenarios for still-to-video recognition and presents results using
data collected at UMD, NIST/USF, and CMU (MoBo database) as part of the
DARPA HumanID effort.
7.1 Related Literature
7.1.1 Face modeling and recognition
Statistical approaches to face modeling have been very popular since Turk and
Pentland?s work on eigenface [62]. In the statistical approach, the two-dimensional
appearance of face image is treated as a vector by scanning the image in lexico-
graphical order, with the vector dimension being the number of pixels in the image.
In the eigenface approach [62], all face images consists of a distinctive face sub-
space. This subspace is linear and spanned by the eigenvectors of the covariance
matrix found using PCA. Typically we keep the number of eigenvectors much less
than the true dimension of the vector space. The task of face recognition is then
to find the closest matches in this face subspace. However, PCA might not be ef-
ficient in terms of recognition accuracy since the construction of the face subspace
does not capture discrimination between humans. This motivates the use of LDA
[41, 44] and its variants. In LDA, the linear subspace is constructed [7] in such a
manner that the within-class scatter is minimized and the between-class scatter is
169
maximized. This idea is further generalized in the approach called Bayesian face
recognition [55], where intra-personal space (IPS) and extra-personal space (EPS)
are used in lieu of within-class scatter and between-class scatter measures. The
IPS models the variations in the appearance of the same individual and the EPS
models the variations in appearances due to differences in the identity. Probabilis-
tic subspace density is then fitted on each space. A Bayesian decision is taken
using a maximum a posteriori (MAP) rule to determine the identity.
In the famous EGM [48] algorithm, the face is represented as a labeled graph.
The nodes of the graph are located at facial landmarks, e.g., the pupils, the tip
of nose, etc. Also, each node is labeled with jets derived from responses obtained
by convolving the image with a family of Gabor functions. The edge characterizes
the geometric distance between two nodes. Face recognition is then formalized as
a graph matching problem.
All the above approaches are based on 2-D appearance and perform poorly
when significant pose and illumination variations are present [58]. To completely
resolve such challenges, 3-D face modeling [66, 83] is necessary. However, building
a 3-D face model is a very difficult and complicated task in the literature even
though structure from motion has been studied for several decades.
7.1.2 Video-based tracking and recognition
Nearly all video-based recognition systems apply still-image-based recognition to
selected good frames. The face images are warped into frontal views whenever
pose and depth information about the faces is available [79].
In [82, 90, 93], RBF (Radial Basis Function) networks are used for tracking and
recognition purposes. In [82], the system uses an RBF (Radial Basis Function)
170
network for recognition. Since no warping is done, the RBF network has to learn
the individual variations as well as possible transformations. The performance
appears to vary widely, depending on the size of the training data. [93] presents a
fully automatic person authentication system. The system uses video break, face
detection, and authentication modules and cycles over successive video images
until a high recognition confidence is reached. This system was tested on three
image sequences; the first was taken indoors with one subject present, the second
was taken outdoors with two subjects, and the third was taken outdoors with one
subject in stormy conditions. Perfect results were reported on all three sequences,
when verified against a database of 20 still face images.
In [92], a system called PersonSpotter is described. This system is able to
capture, track and recognize a person walking toward or passing a stereo CCD
camera. It has several modules, including a head tracker, and a landmark finder.
The landmark finder uses a dense graph consisting of 48 nodes learned from 25
example images to find landmarks such as eyes and nose tip. An elastic graph
matching scheme is employed to identify the face.
A multimodal based person recognition system is described in [79]. This system
consists of a face recognition module, a speaker identification module, and a classi-
fier fusion module. The most reliable video frames and audio clips are selected for
recognition. The 3D head information is used to detect the presence of an actual
person as opposed to an image of that person. Recognition and verification rates
of 100% were achieved for 26 registered clients.
In [87, 88], recognition of face over time is implemented by constructing a
face identity surface. The face is first warped to a frontal view, and its Kernel
Discriminant Analysis (KDA) features over time form a trajectory. It is shown
171
that the trajectory distances accumulate recognition evidence over time.
In [86], a generic approach to simultaneous object tracking and verification is
proposed. The approach is based on posterior probability density estimation using
sequential Monte Carlo methods [118, 153, 157, 159]. Tracking is formulated as
a probability density propagation problem and the algorithm also provides verifi-
cation results. However, no systematic evaluation of recognition was done. Our
approach looks similar to this algorithm; however, there are significant differences
from the algorithm described in [86]. (i) In [86], basically only the tracking motion
vector is parameterized in the state-space model. The identity is involved only in
the initialization step to rectify the template onto the first frame of the sequence.
However, in our approach both tracking motion vector and identity variables are
parameterized in the state-space model, which offers us one more degree of freedom
and leads to a different approach for deriving the solution. (ii) The SIS technique
is used in both approaches to numerically approximate the posterior probability
given the observation. Again in [86], it is the posterior probability of motion vector
and the verification probability is estimated by marginalizing over a proper region
of state space redefined at each time instant. However, we always compute the
joint density, i.e., the posterior probability of motion vector and identity variable
and the posterior probability of identity variable is just a free estimate obtained
by marginalizing over the motion vector. Note that there is no time propagation
of verification probability in [86] while we always propagate the joint density. One
consequence is that we guarantee that summationtextnt?N p(nt|y0:t) = 1, but there is no such
guarantee in [86].
172
7.2 StochasticModelsand Algorithmsfor Recog-
nition from Video
In this section, we present the details on the propagation model for recognition
and discuss its impact on the posterior distribution of identity variable.
7.2.1 Time series state space model
Motion equation
In its most general form, the motion model can be written as
?t = g(?t?1,ut); t ? 1, (7.1)
where ut is noise in the motion model, whose distribution determines the motion
state transition probability p(?t|?t?1). The function g(.,.) characterizes the evolv-
ing motion and it could be a function learned o?ine or given a priori. One of
the simplest choice is an additive function, i.e., ?t = ?t?1 + ut, which leads to a
first-order Markov chain.
Choice of ?t is application dependent. Affine motion parameters are often
used when there is no significant pose variation available in the video sequence.
However, if a 3-D face model is used, then the 3-D motion parameters should be
used accordingly.
Identity equation
nt = nt?1; t ? 1, (7.2)
assuming that the identity does not change as time proceeds.
173
Observation equation
By assuming that the transformed observation is a noise-corrupted version of some
still template in the gallery, the observation equation can be written as
T{yt;?t} = Int +vt; t ? 1, (7.3)
where vt is observation noise at time t, whose distribution determines the observa-
tion likelihood p(yt|nt,?t), and T{yt;?t} is a transformed version of the observation
yt. This transformation could be either geometric or photometric or both. How-
ever, when confronting sophisticated scenarios, this model is far from sufficient.
One should use the complicated likelihood measurement as shown in Section 7.3.2.
We assume statistical independence between all noise variables and prior knowl-
edge on the distributions p(?0|y0) and p(n0|y0). Using the overall state vector
xt = (nt,?t), Eq. (7.1) and (7.2) can be combined into one state equation (in a
normal sense) which is completely described by the overall state transition proba-
bility
p(xt|xt?1) = p(nt|nt?1)p(?t|?t?1) . (7.4)
Given this model, our goal is to compute the posterior probability p(nt|y0:t). It
is in fact a probability mass function (PMF) since nt only takes values from N =
{1,2,...,N}, as well as a marginal probability of p(nt,?t|y0:t), which is a mixed-
type distribution. Therefore, the problem is reduced to computing the posterior
probability.
7.2.2 Posterior probability of identity variable
The evolution of the posterior probability p(nt|y0:t) as time proceeds is very in-
teresting to study as the identity variable does not change by assumption, i.e.,
174
p(nt|nt?1) = ?(nt ?nt?1), where ?(.) is a discrete impulse function at zero.
Using time recursion, Markov properties, and statistical independence embed-
ded in the model, we can easily derive:
p(n0:t,?0:t|y0:t) = p(n0:t?1,?0:t?1|y0:t?1)p(yt|nt,?t)p(nt|nt?1)p(?t|?t?1)p(y
t|y0:t?1)
= p(n0,?0|y0)
tproductdisplay
s=1
p(ys|ns,?s)p(ns|ns?1)p(?s|?s?1)
p(ys|y0:s?1)
= p(n0|y0)p(?0|y0)
tproductdisplay
s=1
p(ys|ns,?s)?(ns ?ns?1)p(?s|?s?1)
p(ys|y0:s?1) .(7.5)
Therefore, by marginalizing over ?0:t and n0:t?1, we obtain
p(nt = l|y0:t) = p(l|y0)
integraldisplay
?0
...
integraldisplay
?t
p(?0|y0)
tproductdisplay
s=1
p(ys|l,?s)p(?s|?s?1)
p(ys|y0:s?1) d?t ...d?0. (7.6)
Thus p(nt = l|y0:t) is determined by the prior distribution p(n0 = l|y0) and the
product of the likelihood functions, producttextts=1 p(ys|l,?s). If a uniform prior is assumed,
then producttextts=1 p(ys|l,?s) is the only determining factor.
In the appendix, we show that, under some minor assumptions, the poste-
rior probability for the correct identity l, p(nt = l|y0:t), is lower-bounded by an
increasing curve which converges to 1.
To measure the evolving uncertainty remaining in the identity variable as ob-
servations accumulate, we use the notion of entropy [4]. In the context of this
problem, conditional entropy H(nt|y0:t) is used. However, the knowledge of p(y0:t)
is needed to compute H(nt|y0:t). We assume that it degenerates to an impulse
at the actual observations ?y0:t since we observe only this particular sequence, i.e.,
p(y0:t) = ?(y0:t ??y0:t). Thus,
H(nt|y0:t) = ?
Nsummationdisplay
nt=1
p(nt|?y0:t)log2 p(nt|?y0:t). (7.7)
Under the assumptions listed in the appendix, we expect that H(nt|y0:t) decreases
175
as time proceeds since we start from an equi-probable distribution to a degenerate
one.
7.2.3 SIS algorithms and computational efficiency
Consider a general time series state space model fully determined by (i) the over-
all state transition probability p(xt|xt?1), (ii) the observation likelihood p(yt|xt),
and (iii) prior probability p(x0) and statistical independence among all the noise
variables. We wish to compute the posterior probability p(xt|y0:t).
If the modelis linear withGaussian noise, itis analytically solvable by a Kalman
filter which essentially propagates the mean and variance ofa Gaussian distribution
over time. For nonlinear and non-Gaussian cases, an extended Kalman filter (EKF)
and its variants have been used to arrive at an approximate analytic solution [1].
Recently, the SIS technique or particle filter algorithm, a special case of Monte
Carlo method, [118, 153, 157, 159] has been used to provide a numerical solution
and propagate an arbitrary distribution over time. However, since we are dealing
with a mixed-type distribution, additional properties are available to be exploited
when developing the SIS algorithms.
First, two following two propositions are useful.
Proposition 7.1 When pi(x) is a PMF defined on a finite sample space, the
proper sample set should exactly include all samples in the sample space.
Proposition 7.2 If a set of weighted random samples {(x(m),y(m),w(m))}Mm=1
is proper with respect to pi(x,y), then a new set of weighted random samples
{(yprime(k),wprime(k))}Kk=1, which is proper with respect to pi(y), the marginal of pi(x,y), can
be constructed as follows:
1) Remove the repetitive samples from {y(m)}Mm=1 to obtain {yprime(k)}Kk=1, where all
176
yprime(k)?s are distinct;
2) Sum the weight w(m) belonging to the same sample yprime(k) to obtain the weight
wprime(k), i.e.,
wprime(k) =
Msummationdisplay
m=1
w(m)?(y(m) ?yprime(k)) (7.8)
In the context of this framework, the posterior probability p(nt,?t|y0:t) is rep-
resented by a set of indexed and weighted samples
St = {(n(m)t ,?(m)t ,w(m)t )}Mm=1 (7.9)
with nt as the above index. By Proposition 7.2, we can sum the weights of the
samples belonging to the same index nt to obtain a proper sample set {nt,?nt}Nnt=1
with respect to the posterior PMF p(nt|y0:t).
A straightforward implementation of the particle filter algorithm (Figure 7.1)
for simultaneous tracking and recognition is not efficient in terms of its compu-
tational load. Since N = {1,2,...,N} is a countable sample space, we need N
samples for the identity variable nt according to Proposition 7.1. Assume that,
for each identity variable nt, J samples are needed to represent ?t. Hence, we
need M = J ?N samples in total. Further assume that one resampling step takes
Tr seconds (s), one predicting step Tp s, computing one transformed image Tt
s, evaluating likelihood once Tl s, one updating step Tu s. Obviously, the bulk of
computation is J?N?(Tr+Tp+Tt+Tl) s to deal with one video frame as the com-
putational time for the normalizing step and the marginalizing step is negligible.
It is well known that computing the transformed image is much more expensive
than other operations, i.e., Tt >> max(Tr,Tp,Tl). Therefore, as the number of
templates N grows, the computational load increases dramatically.
There are various approaches in the literature to reduce the computational cost
of the conventional particle filter algorithm. In [128], random particles are guided
177
Initialize a sample set S0 = {(n(m)0 ,?(m)0 ,1)}Mm=1 according to prior distribu-
tions p(n0|y0) and p(?0|y0).
For t = 1,2,...
For m = 1,2,...,M
Resample St?1 = {(n(m)t?1,?(m)t?1,w(m)t?1)}Mm=1 to obtain a new sample
(nprime(m)t?1 ,?prime(m)t?1 ,1).
Predict a sample by drawing (n(m)t ,?(m)t ) from p(nt|nprime(m)t?1 ) and p(?t|?prime(m)t?1 ).
Compute the transformed image z(m)t = T{yt;?(m)t }.
Update the weight using ?(m)t = p(yt|n(m)t ,?(m)t ).
End
Normalize each weight using w(m)t = ?(m)t /summationtextMm=1 ?(m)t .
Marginalize over ?t to obtain the weight ?nt for nt.
End
Figure 7.1: The conventional particle filter algorithm for simultaneous tracking
and recognition.
by deterministic search. Assumed density filtering approach [148], different from
particle filter, is even more efficient. Those approaches are general and do not
explicitly exploit the special structure of the distribution in this setting: a mixed
distribution of continuous and discrete variables. To this end, we propose the
following algorithm.
As the sample space N is countably finite, an exhaustive search of sample space
N is possible. Mathematically, we release the random sampling in the identity
variable nt by constructing samples as follows: for each ?(j)t ,
(1,?(j)t ,w(j)t,1),(2,?(j)t ,w(j)t,2),...,(N,?(j)t ,w(j)t,N).
178
We in fact use the following notation for the sample set,
St = {(?(j)t ,w(j)t ,w(j)t,1,w(j)t,2,...,w(j)t,N)}Jj=1, (7.10)
with w(j)t = summationtextNn=1 w(j)t,n. The proposed algorithm is summarized in Figure 7.2.
Initialize a sample set S0 = {(?(j)0 ,N,1,...,1)}Jj=1 according to prior distribu-
tion p(?0|z0).
For t = 1,2,...
For j = 1,2,...,J
Resample St?1 = {(?(j)t?1,w(j)t?1)}Jj=1 to obtain a new sample
(?prime(j)t?1,1,wprime(j)t?1,1,...,wprime(j)t?1,N), where wprime(j)t?1,n = w(j)t?1,n/w(j)t?1 for n = 1,2,...,N.
Predict a sample by drawing (?(j)t ) from p(?t|?prime(j)t?1).
Compute the transformed image z(m)t = T{yt;?(m)t }.
For n = 1,...,N
Update the weight using ?(j)t,n = wprime(j)t?1,n ?p(yt|n,?(j)t ).
End
End
Normalize each weight using w(j)t,n = ?(j)t,n/summationtextNn=1summationtextJj=1 ?(j)t,n and
w(j)t = summationtextNn=1 w(j)t,n.
Marginalize over ?t to obtain the weight ?nt for nt.
End
Figure 7.2: The computationally efficient particle filter algorithm for simultaneous
tracking and recognition.
The crux of this algorithm lies in the fact that, instead of propagating random
samples on both motion vector and identity variable, we can keep the samples on
the identity variable fixed and let those on the motion vector be random. Although
179
we propagate only the marginal distribution for motion tracking, we still propagate
the joint distribution for recognition purposes.
The bulk of computation of the proposed algorithm is J?(Tr+Tp+Tt)+J?N?Tl
s, a tremendous improvement over the conventional particle filter when dealing
with a large database since the majority computational time J?Tt does not depend
on N.
7.3 Still-to-Video Face Recognition Experiments
In this section we describe the still-to-video scenarios used in our experiments
and their practical model choices, followed by a discussion of experiments. Three
databases are used in the still-to-video experiments.
Database-0 was collected outside a building. Subjects walked straight towards a
video camera inorder tosimulate typical scenarios in visual surveillance. Database-
0 includes one face gallery, and one probe set. The images in the gallery are listed
in Figure 7.3. The probe contains 12 videos, one for each individual. Figure 7.3
gives some frames in a probe video.
In Database-1, we have video sequences with subjects walking in a slant path
towards the camera. There are 30 subjects, each having one face template. There
are one face gallery and one probe set. The face gallery is shown in Figure 7.4.
The probe contains 30 video sequences, one for each subject. Figure 7.4 gives some
example frames extracted from one probe video. As far as imaging conditions are
concerned, the gallery is very different from the probe, especially in lighting. This
is similar to the ?FC? test protocol of the FERET test [58]. These images/videos
were collected, as part of the HumanID project, by National Institute of Standards
and Technology and University of South Florida researchers.
180
Database-2, Motion of Body (MoBo) database, was collected at the Carnegie
Mellon University [81] under the HumanID project. There are 25 different indi-
viduals in total. The video sequences show the individuals walking on a tread-mill
so that they move their heads naturally. Different walking styles have been simu-
lated to assure a variety of conditions that are likely to appear in real life: walking
slowly, walking fast, inclining and carrying an object. Therefore, four videos per
person and 99 videos in total ( with one carrying video missing ) are available.
However, the probe set we use in this section includes only 25 slowWalk videos.
Some example images of the videos (slowWalk) are shown in Figure 7.5. Figure
7.5 also shows the face gallery in Database-2 with face images in almost frontal
view cropped from probe videos and then normalized using their eye positions.
Table 7.2 summaries the features of the three databases.
Database Database-0 Database-1 Database-2
No. of subjects 12 30 25
Gallery Frontal face Frontal face Frontal face
Motion in probe Walking straight Walking in an angle Walking
towards the camera towards the camera on tread-mill
Illumination variation No Large No
Pose variation No Slight Large
Table 7.2: Summary of three databases experimented.
7.3.1 Results for Database-0
We consider an affine transformation. Specifically, the motion is characterized
by ? = (a1,a2,a3,a4,tx,ty) where {a1,a2,a3,a4} are deformation parameters and
{tx,ty}are2-Dtranslation parameters. Itis a reasonable approximation since there
181
Figure 7.3: Database-0. The 1st row: the face gallery with image size being 30?26.
The 2nd and 3rd rows: 4 example frames in one probe video with image size being
320?240 while the actual face size ranges approximately from 30?30 in the first
frame to 50?50 in the last frame. Notice that the sequence is taken under a well-
controlled condition so that there are no illumination or pose variations between
the gallery and the probe.
is no significant out-of-plane motion as the subjects walk towards the camera. Re-
garding the photometric transformation, only zero-mean-unit-variance operator is
performed to partially compensate for contrast variations. The complete trans-
formation T{y;?} is processed as follows: affine transform y using {a1,a2,a3,a4},
182
Figure 7.4: Database-1. The 1st row: the face gallery with image size being 30?26.
The 2nd and 3rd rows: 4 example frames in one probe video with image size being
720?480 while the actual face size ranges approximately from 20?20 in the first
frame to 60?60 in the last frame. Notice the significant illumination variations
between the probe and the gallery.
crop out the interested region at position {tx,ty} with the same size as the still
template in the gallery, and perform zero-mean-unit-variance operation.
Prior distribution p(?0|y0) is assumed to be Gaussian, whose mean comes from
183
Figure 7.5: Database-2. The 1st row: the face gallery with image size being 30?26.
The 2nd and 3rd rows: some example frames in one probe video (slowWalk). Each
video consists of 300 frames (480?640 pixels per frame) captured at 30 Hz. The
inner face regions in these videos contain between 30?30 and 40?40 pixels. Notice
the significant pose variation available in the video.
the initial detector and whose covariance matrix is manually specified.
A time-invariant first-order Markov Gaussian model with constant velocity is
used for modeling motion transition. Given the scenario that the subject is walking
184
towards the camera, the scale increases with time. However, under perspective
projection, this increase is no longer linear, causing the constant-velocity model to
be not optimal. However, experimental results show that as long as the samples
of ? can cover the motion, this model is sufficient.
The likelihood measurement is simply set as a ?truncated? Laplacian:
p1(yt|nt,?t) = LAP(bardblT{yt;?t}?Intbardbl;?1,?1) (7.11)
where, bardbl.bardbl is sum of absolute distance, ?1 and ?1 are manually specified, and
LAP(x;?,?) =
?
???
???
??1 exp(?x/?) if x ? ??
??1 exp(??) otherwise
(7.12)
Gaussian distribution is widely used as a noise model, accounting for sensor noise,
digitization noise, etc. However, given the observation equation: vt = T{yt;?t}?
Int, the dominant part of vt becomes the high-frequency residual if ?t is not proper,
and it is well known that the high-frequency residual of natural images is more
Laplacian-like. The ?truncated? Laplacian is used to give a ?surviving? chance for
samples to accommodate abrupt motion changes.
Figure 7.6 presents the plot of the posterior probability p(nt|y0:t), the condi-
tional entropy H(nt|y0:t) and the minimum mean square error (MMSE) estimate of
the scale parameter sc =
radicalBig
(a21 + a22 + a23 + a24)/2, all against t. In Figure 7.3, the
tracked face is superimposed on the image using a bounding box.
Suppose the correct identity for Figure 7.3 is l. From Figure 7.6, we can easily
observe that the posterior probability p(nt = l|y0:t) increases as time proceeds and
eventually approaches 1, and all others p(nt = j|y0:t) for j negationslash= l go to 0. Figure
7.6 also plots the decrease in conditional entropy H(nt|y0:t) and the increase in
scale parameter, which matches with the scenario of a subject walking towards a
camera.
185
Table 7.3 summarizes the average recognition performance and computational
time of the conventional and the proposed particle filter algorithm when applied
to Database-0. Both algorithms achieved 100% recognition rate with top match.
The proposed algorithm is much more efficient than the conventional one. It is
more than 10 times faster as shown in Table I. This experiment was implemented
in C++ on a PC with P-III 1G CPU and 512M RAM with the number of motion
samples J chosen to be 200, the number of templates in the gallery N to be 12.
5 10 15 20 25 30 35 40
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
time t
posterior probability p(n
t|z
0:t
)
5 10 15 20 25 30 35 40
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
time t
posterior probability p(n
t|z
0:t
)
5 10 15 20 25 30 35 40
0
0.5
1
1.5
2
2.5
3
3.5
time t
conditional entropy H(n
t|z
0:t
)
5 10 15 20 25 30 35 40
0.9
1
1.1
1.2
1.3
1.4
1.5
time t
scale estime
Figure 7.6: Posterior probability p(nt|y0:t) against time t, obtained by the
CONDENSATION algorithm (top left) and the proposed algorithm (top right). Con-
ditional entropy H(nt|y0:t) (bottom left) and MMSE estimate of scale parameter sc
(bottom right) against time t. The conditional entropy and the MMSE estimate
are obtained using the proposed algorithm.
186
Algorithm Conventional algorithm Efficient algorithm
Recognition rate within top 1 match 100% 100%
Time per frame 7s 0.5s
Table 7.3: Recognition performance of algorithms when applied to Database-0.
7.3.2 Results for Database-1
Case 1: Tracking and Recognition using Laplacian Density
We first investigate the performance using the same setting as described in Section
7.3.1. In other words, we still use the affine transformation, first-order Markov
Gaussian state transition model, ?truncated? Laplacian observation likelihood, etc.
Table 7.4 shows that the recognition rate is very poor, only 13% are correctly
identified using top match. The main reason is that the ?truncated? Laplacian
density is farfrom sufficient to capture the appearance difference between the probe
and the gallery, thereby indicating a need for a different appearance modeling.
Nevertheless, the tracking accuracy 2 is reasonable with 83% successfully tracked
because we are using multiple face templates in the gallery to track the specific
face in the probe video. After all, faces in both the gallery and the probe belong
to the same class of human face and it seems that the appearance change is within
the class range.
2We manually inspect the tracking results by imposing the MMSE motion estimate on the
final frame as shown in Figs. 7.3 and 7.4 and determine if tracking is successful or not for this
sequence. This is done for all sequences and tracking accuracy is defined as the ratio of the
number of sequences successfully tracked to the total number of all sequences.
187
Case 2: Pure Tracking using Laplacian Density
In Case 2, we measure the appearance change within the probe video as well as
the noise in the background. To this end, we introduce a dummy template T0, a
cut version in the first frame of the video. Define the observation likelihood for
tracking as
p2(yt|?t) = LAP(bardblT{yt;?t}?T0bardbl;?2,?2), (7.13)
where ?2 and ?2 are set manually. The other setting, such as motion parameter and
model, is the same as in Case 1. We still can run the CONDENSATION algorithm
to perform pure tracking.
Table 7.4 shows that87% are successfully tracked by this simple tracking model,
which implies that the appearance within the video remains similar.
Case Case 1 Case 2 Case 3 Case 4 Case 5
Tracking accuracy 83% 87% 93% 100% NA
Recognition w/in top 1 match 13% NA 83% 93% 57%
Recognition w/in top 3 matches 43% NA 97% 100% 83%
Table 7.4: Performances of algorithms when applied to Database-1.
Case 3: Tracking and Recognition using Probabilistic Subspace Density
As mentioned in Case 1, we need a new appearance model to improve the recog-
nition accuracy. As reviewed in Section 7.1.1, there are various approaches in
the literature. We decided to use the approach suggested by Moghaddam et al.
[55] due to its computational efficiency and high recognition accuracy. However,
in our implementation, we model only intra-personal variations instead of both
intra/extra-personal variations for simplicity.
188
We need at least two facial images for one identity to construct the intra-
personal space (IPS). Apart from the available gallery, we crop out the second
image from the video ensuring no overlap with the frames actually used in probe
videos. Figure 7.7 (top row) shows a list of such images. Compare with Figure 7.4
to see how the illumination varies between the gallery and the probe.
We then fit a probabilistic subspace density [56] on top of the IPS. It proceeds
as follows: a regular PCA is performed for the IPS. Suppose the eigensystem for
the IPS is {(?i,ei)}di=1, where d is the number of pixels and ?1 ? ... ? ?d. Only top
s principal components corresponding to top s eigenvalues are then kept while the
residual components are considered as isotropic. We refer the reader to the original
paper [56] for the full details. Figure 7.7 (middle row) shows the eigenvectors for
the IPS. The density is written as follows:
QIPS(x) = {exp(?
1
2
summationtexts
i=1
y2i
?i)
(2pi)s/2producttextsi=1 ?1/2i }{
exp(?epsilon122?)
(2pi?)(d?s)/2}, (7.14)
where yi = eTi xfori = 1,...,s is theith principal component ofx, epsilon12 = bardblxbardbl2?summationtextsi=1 y2i
is the reconstruction error, and ? = (summationtextdi=s+1 ?i)/(d ? q). It is easy to write the
likelihood as follows:
p3(yt|nt,?t) = QIPS(T{yt;?t}?Int). (7.15)
Table 7.4 lists the performance by using this new likelihood measurement. It
turns out that the performance is significantly better that in Case 1, with 93%
tracked successfully and 83% recognized within top 1 match. If we consider the
top 3 matches, 97% are correctly identified.
Case 4: Tracking and Recognition using Combined Density
In Case 2, we have studied appearance changes within a video sequence. In Case
3, we have studied the appearance change between the gallery and the probe. In
189
Figure 7.7: Database-1. Top row: the second facial images for estimating prob-
abilistic density. Middle row: top 10 eigenvectors for the IPS. Bottom row: the
facial images cropped out from the largest frontal view.
Case 4, we attempt to take advantage of both cases by introducing a combined
likelihood defined as follows:
p4(yt|nt,?t) = p3(yt|nt,?t)p2(yt|?t) (7.16)
Again, all other setting is the same as in Case 1. We now obtain the best perfor-
mance so far: no tracking error, 93% are correctly recognized as the first match,
and no error in recognition when top 3 matches are considered.
190
Case 5: Still-to-still Face Recognition
To make a comparison, we also performed an experiment on still-to-still face recog-
nition. We selected the probe video frames with the best frontal face view (i.e.
biggest frontal view) and cropped out the facial region by normalizing with respect
to the eye coordinates manually specified. This collection of images is shown in
Figure 7.7 (bottom row) and it is fed as probes into a still-to-still face recognition
system with the learned probabilistic subspace as in Case 3. It turns out that the
recognition result is 57% correct for the top one match, and 83% for the top 3
matches. The cumulative match curves for Case 1 and Cases 3-5 are presented in
Figure 7.8. Clearly, Case 4 is the best among all. We also implemented the original
algorithm by Moghaddam et al. [56], i.e., both intra/extra-personal variations are
considered, the recognition rate is similar to that obtained in Case 5.
7.3.3 Results for Database-2
The recognition result for Database-2 is presented in Figure 7.8, using the cumu-
lative match curve. We still use the same setting as in Case 1 of section 7.3.2.
However, due to the pose variations present in the database, using one frontal
view is not sufficient to represent all the appearances under different poses and the
recognition rate is hence not so high, 56% when only the top match is considered
and 88% when top 3 matches are considered. We do not use probabilistic subspace
modeling for this database because such modeling requires manually cropping out
multiple templates for each individual. Also, pre-selecting video frames from the
same probe video and ensuring that they do not overlap with the probe frames
is time-consuming. What is desirable is to automatically select such templates
from different sources other than the probe video. Since we have multiple videos
191
available for one individual in Database-2, this motivates us to obtain more repre-
sentative views for one face class, leading to the discussions in [194].
5 10 15 20 25 30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
rank
cumulative match score
Case1
Case 3
Case 4
Case 5
5 10 15 20 250
0.2
0.4
0.6
0.8
1
rank
cumulative match score
Figure 7.8: Cumulative match curves for Database-1 (left) and Database-2 (right).
7.3.4 Enhanced results
Visual tracking models the inter-frame appearance differences and visual recogni-
tion models the appearance differences between video frames and gallery images.
Simultaneous tracking and recognition provides a mechanism of jointly modeling
inter-frame appearance differences and the appearance differences between video
frames and gallery images. As in Section 7.3.2, this joint modeling of appearance
differences in both tracking and recognition in one framework actually improves
both tracking and recognition accuracies over approaches that separate tracking
and recognition as two tasks. The more effective the model choices are, improved
performance in tracking and recognition is expected. We explore this avenue by
incorporating the models used in Chapter 6.
We use the same adaptive-velocity motion model (6.29) and the same identity
equation (7.2). The observation likelihood is modified to combine contributions
(or scores) from both tracking and recognition in the likelihood yields the best
performance in both tracking and recognition.
192
To compute the tracking score pa(yt|?t) which measures the inter-frame appear-
ance changes, we use the appearance model introduced in Section 6.2.1 and the
quantity defined in (6.10) as pa(yt|?t).
To compute the recognition score which measures the appearance changes be-
tween probe videos and gallery images, we assume the same model as in (7.3), i.e.,
the transformed observation is a noise-corrupted version of some still template in
the gallery, and the noise distribution determines the recognition score pn(yt|nt,?t).
We will physically define this quantity below.
To fully exploit the fact that all gallery images are in frontal view, we also
compute below how likely the patch zt is in frontal view and denote this score by
pf(yt|?t). If the patch is in frontal view, we accept a recognition score; otherwise,
we simply set the recognition score as equiprobable among all identities, i.e., 1/N.
The complete likelihood p(yt|nt,?t) is now defined as
p(yt|nt,?t) ? pa {pf pn + (1?pf) N?1}. (7.17)
Model components in detail
? A. Modeling inter-frame appearance changes
Inter-frame appearance changes are related to the motion transition model
and the appearance model for tracking, which were explained in Sections
6.2.1 and 6.2.2.
? B. Being in frontal view
Since all gallery images are in frontal view, we simply measure the extent of
being frontal by fitting a probabilistic subspace (PS) density on the top of
the gallery images [54, 56], assuming that they are i.i.d. samples from the
193
frontal face space (FFS). pf(yt|?t) is written as follows:
pf(yt|?t) = QFFS(zt), (7.18)
where the density Q(.) is defined same as that in (7.14).
? C. Modeling appearance changes between probe video frames and gallery im-
ages
WeadopttheMAPrule developed in[56]fortherecognitionscore pn(yt|nt,?t).
Two subspaces are constructed to model appearance variations. The IPS is
meant to cover all the variations in appearances belonging to the same person
while the EPS is used to cover all the variations in appearances belonging
to different people. More than one facial image per person is needed to con-
struct the IPS. Apart from the available gallery, we crop out four images
from the video ensuring no overlap with frames used in probe videos. The
above PS density estimation method is applied separately to the IPS and the
EPS, yielding two different eigensystems. The recognition score pn(yt|nt,?t)
is finally computed as, assuming equal priors on the IPS and the EPS,
pn(yt|nt,?t) = QIPS(zt ?Int)Q
IPS(zt ?Int)+ QEPS(zt ?Int)
. (7.19)
D. Proposed algorithm
We adjust the particle number Jt based on the following considerations. (i) The
first issue is same as (6.31) based on the prediction error. (ii) As shown above,
the uncertainty in the identity variable nt is characterized by an entropy measure
Ht for p(nt|y1:t) and Ht is a non-increasing function (under one weak assumption).
Accordingly, we increase the number of particles by a fixed amount Jfix if Ht
194
Initialize a sample set S0 = {?(j)0 ,w(j)0 = 1/J0)}J0j=1 according to prior distribution
p(?0). Set ?0,l = 1/N. Initialize the appearance mode A1.
For t = 1,2,...
Calculate the MAP estimate ??t?1, the adaptive motion shift ?t by Eq. (6.21), the
noise variance rt by Eq. (6.30), and particle number Jt by Eq. (7.20).
For j = 1,2,...,Jt
Draw the sample u(j)t for ut with variance Rt.
Construct the sample ?(j)t by Eq. (6.29).
Compute the transformed image z(j)t .
For l = 1,2,...,N
Update the weight using ?(j)t,l = ?t?1,lp(yt|l,?(j)t ) = ?t?1,lp(z(j)t |l,?(j)t ) by
Eq. (7.17).
End
End
Normalize the weight using w(j)t,l = ?(j)t,l /summationtextj,l ?(j)t,l and compute w(j)t = summationtextj w(j)t,l
and ?t,l =summationtextj w(j)t,l .
Update the appearance model At+1 using ?zt.
End
Figure 7.9: The visual tracking and recognition algorithm.
increases; otherwise we deduct Jfix from Jt. Combining these two, we have
Jt = J0 rtr
0
+ Jfix ?(?1)i[Ht?1<Ht?2]}, (7.20)
where i[.] is an indication function.
The proposed particle filtering algorithm for simultaneous tracking and recog-
nition is summarized in Figure 7.9, where w(j)t,l is the weight of the particle (nt =
l,?t = ?(j)t ) for the posterior density p(nt,?t|y1:t); w(j)t is the weight of the particle
?t = ?(j)t for the posterior density p(?t|y1:t); and ?t,l is the weight of the particle
nt = l for the posterior density p(nt|y1:t). Occlusion analysis can also be included
195
in Figure 7.9.
Figure 7.10: Row 1-3: the gallery set with 29 subjects in frontal view. Rows 4, 5,
and 6: the top 10 eigenvectors for FFS, IPS, and EPS, respectively.
Experimental results on visual tracking and recognition
We have applied our algorithm for tracking and recognition of human faces cap-
tured by a hand-held video camera in office environments. There are 29 subjects
in the database. Figure 7.10 lists all the images in the galley set and the top 10
eigenvectors for FFS, IPS, and EPS, respectively. Figure 7.11 presents some frames
(with tracking results) in the video sequence for ?Subject-2? featuring quite large
pose variations, moderate illumination variations, and quick scale changes ( back
and forth toward the end of the sequence).
Tracking is successful for all video sequences and 100% recognition rate is
achieved, while early approaches fail to track in several video sequences due to
its inability to handle significant appearance changes caused by pose and illumina-
tion variations. The posterior probabilities p(nt|y1:t) with nt = 1,2,...N obtained
196
Frame 1 Frame 160 Frame 290
Frame 690 Frame 750 Frame 800
Figure 7.11: Example images in ?Subject-2? probe video sequence and the tracking
results.
for the ?Subject-2? sequence are plotted in Figure 7.12(a). We start from a uni-
form prior for the identity variable, i.e., p(n0) = N?1 for n0 = 1,2,...N. It is
very fast, taking about less than 10 frames, to reach above 0.9 level for the poste-
rior probability corresponding to ?Subject-2?, while all other posterior probabilities
corresponding to other identities approach zero. This is mainly attributed to the
discriminative power of the MAP recognition score induced by IPS and EPS mod-
eling. The previous approach [185] usually takes about 30 frames to reach 0.9 level
since only intra-personal modeling is adopted. Figure 7.12(b) captures the scale
change in the ?Subject-2? sequence.
197
0 200 400 600 800 10000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
posterior probability
time
0 200 400 600 800 10001.5
2
2.5
3
3.5
4
4.5
time
scale estimate
(a) (b)
Figure 7.12: Results on the ?Subject-2? sequence. (a) Posterior probabilities against
time t for all identities p(nt|y1:t), nt = 1,2,...,N. The line close to 1 is for the true
identity. (b) Scale estimate against time t.
7.4 Appendix
Appendix 7.I: Derivation of the lower bound for the poste-
rior probability of identity
Suppose that the following two assumptions hold:
? (A) The prior probability for each identity is same,
p(n0 = j|y0) = 1/N; j ?N, (7.21)
? (B) for the correct identity l ?N, there exists a constant ? > 1 such that,
p(yt|nt = l,?t) ? ?p(yt|nt = j,?t); t ? 1,j ? N,j negationslash= l. (7.22)
Substitution of Eq. (7.21) and (7.22) into Eq. (7.6) gives rise to
p(nt = l|y0:t) = 1N
integraldisplay
?0
...
integraldisplay
?t
p(?0|y0)
tproductdisplay
s=1
p(ys|ns = l,?s)p(?s|?s?1)
p(ys|y0:s?1) d?t ...d?0
? 1N
integraldisplay
?0
...
integraldisplay
?t
p(?0|y0)
tproductdisplay
s=1
?p(ys|ns = j,?s)p(?s|?s?1)
p(ys|y0:s?1) d?t ...d?0
198
= ?
t
N
integraldisplay
?0
...
integraldisplay
?t
p(?0|z0)
tproductdisplay
s=1
p(ys|ns = j,?s)p(?s|?s?1)
p(ys|y0:s?1) d?t ...d?0
= ?tp(nt = j|y0:t); j ? N,j negationslash= l, (7.23)
where ?t = producttextts=1 ?.
More interestingly, from Eq. (7.23), we have
(N ?1)p(nt = l|y0:t) ? ?t
Nsummationdisplay
j=1,jnegationslash=l
p(nt = j|y0:t) = ?t(1?p(nt = l|y0:t)), (7.24)
i.e.,
p(nt = l|y0:t) ? h(?,t), (7.25)
where
h(?,t) = ?
t
?t + N ?1. (7.26)
Eq. (7.25) has two implications.
1. Since the function h(?,t) which provides a lower bound for p(nt = l|y0:t) is
monotonically increasing against time t, p(nt = l|y0:t) has a probable trend
of increase over t, even though not in a monotonic manner.
2. Since ? > 1 and p(nt = l|y0:t) ? 1,
limt??p(nt = l|y0:t) = 1, (7.27)
implying that p(nt = l|y0:t) degenerates in the identity l for some sufficiently
large t.
However, all these derivations are based on assumptions (A) and (B). Though
it is easy to satisfy (A), difficulty arises in practice in order to satisfy (B) for all
the frames in the sequence. Fortunately, as we have seen in the experiment in
Section 7.3, numerically this degeneracy is still reached even if (B) is satisfied only
for most but not all frames in the sequence.
199
Appendix 7.II: More on assumption (B)
A trivial choice for ? is the lower bound on the likelihood ratio, i.e.,
? = inf
t?1,jnegationslash=l,?t??
p(yt|nt = l,?t)
p(yt|nt = j,?t). (7.28)
This choice is of theoretical interest. In practice, how good is the assumption
(B) satisfied? Figure 7.13 plots against the logarithm of the scale parameter, the
?average? likelihood of the correct identity,
1
N
summationdisplay
n?N
p(In|n,?),
and that of the incorrect identities,
1
N(N ?1)
summationdisplay
m?N,n?N,mnegationslash=n
p(Im|n,?),
of the face gallery as well as the ?average? likelihood ratio, i.e., the ratio between
the above two quantities. The observation is that only within a narrow ?band? the
condition (B) is well satisfied. Therefore, the success of SIS algorithm depends on
how good the samples lie in a similar ?band? in the high-dimensional affine space.
Also, the lower bound ? in assumption (B) is too strict. If we take the mean of
the ?average? likelihood ratio shown in Figure 7.13 as an estimate of ? ( roughly
1.5 ), Eq. (7.25) tells that, after 20 frames, the probability p(l|y0:t) reaches 0.99!
However, this is not reached in the experiments due to noise in the observations
and incomplete parameterization of transformations.
200
?1 ?0.5 0 0.5 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
log of scale parameter
likelihood
correct
incorrect
?1 ?0.5 0 0.5 10
1
2
3
4
5
6
7
8
log of scale parameter
likelihood ratio
Figure 7.13: Left: The ?average? likelihood of the correct hypothesis and incorrect
hypotheses against the log of scale parameter. Right: The ?average? likelihood
ratio against the log of scale parameter.
201
Chapter 8
Probabilistic Identity
Characterization
Visual face recognition is an important task. Even though a lot of research has been
carried out, state-of-the-art recognizers still yield unsatisfactory results especially
when confronted with pose and illumination variations. In addition, the recognizers
are further complicated by the registration requirement as the images that the
recognizers process contain transformed appearances of the object. Below, we
simply use the term ?transformation? to model all the variations involved, be it
registration, or pose and illumination variations.
While most recognizers process a single image, there is a growing interest in us-
ing a group of images [80, 84, 88, 89, 91, 184, 185]. In terms of the transformations
embedded in the group or the temporal continuity between the transformations,
the group can be either independent or not. Examples of the independent group
(I-group) are face databases that store multiple appearances for one object. Ex-
amples of the dependent group are video sequences. If the temporal information is
stripped, video sequences reduce to I-groups. In this chapter, whenever we mention
202
video sequences, we mean dependent groups of images.
Approaches that use the I-groups can be roughly divided into two categories.
The first category is based on manifold matching. In [88], hypothetical identity
surfaces are constructed by computing the linear coefficients of view space. Illumi-
nation variations are not accounted for. Discriminant features are then extracted
to overcome other variations. In [80], manifolds are formed for every I-group.
Recognition is performed by computing the shortest distance between two man-
ifolds. The manifold takes a certain parameterized form and the parameters are
directly learned from the visual appearances. Robustness to pose and illumination
variations are not reported. The second category is based on statistical learning.
In [91], a multi-variate Gaussian density is fitted for every I-group. Recognition
is achieved by computing the Kullback-Leibler distance [4] between two Gaussian
densities. However, the Gaussian assumption is easily violated if pose and illumi-
nation variations exist. In [184], principal subspaces are learned for each I-group
and principal angle between the two principal subspace are used for recognition.
The computation of principal angle is also carried on the feature space embed-
ded by kernel functions. One common disadvantage of the above approaches is
that they also assume that the face regions have already been cropped beforehand,
using either a detector or a tracker.
Approaches using video sequences utilize temporal information for recognition
as well. In [185], simultaneous tracking and recognition is implemented in a prob-
abilistic framework. The joint posterior probability of the tracking parameter and
the identity variable is approximated using the SIS algorithm and the marginal
posterior probability of the identity variable is used for recognition. However, only
an affine localization parameter is used for tracking and pose and illumination
203
variations are not considered. In addition, exemplars are learned from the gallery
videos to cover pose and illumination variations. In [89], hidden Markov models
are used to learn the dynamics before successive appearances. In [84], pose varia-
tions are handled by learning the view-discretized appearance manifolds from the
training ensemble. Transition probabilities from one view to another view are used
to regularize the search space. However, in [84, 89], the cropped images are used
for testing.
In this chapter, we propose a general framework which possesses the following
features:
? It processes either a single image or a group of images (including the I-group
and video sequence).
? It handles the localization problem, illumination and pose variations.
? The identity description could be either discrete or continuous. The contin-
uous identity encoding typically arises from subspace modeling.
? It is probabilistic and integrates all the available evidence.
Chapter organization
In Section 8.1 we introduce the generic framework which provides a probabilis-
tic characterization of the object identity. In Section 8.2 we address issues and
challenges arising in this framework. In Section 8.3 we focus on how to achieve
an identity encoding which is invariant to localization, illumination and pose vari-
ations. In Section 8.3.2, we present some efficient computational methods. In
Section 8.3.3, we present experimental results.
204
8.1 Principle of Probabilistic Identity Charac-
terization
Suppose ? is the identity signature, which represents the identity in an abstract
manner. It can be either discrete- or continuous- valued. If we have an N-class
problem, ? is discrete taking value in {1,2,...,N}. If we associate the identity
with image intensity or feature vectors derived from say subspace projections, ? is
continuous-valued. Given a group of images y1:T .= {y1,y2,...,yT} containing the
appearances of the same but unknown identity, probabilistic identity characteriza-
tion is equivalent to finding the posterior probability p(?|y1:T).
As the image only contains a transformed version of the object, we also need
to associate it a transformation parameter ?, which lies in a transformation space
?. The transformation space ? is usually application dependent. Affine trans-
formation is often used to compensate for the localization problem. To handle
illumination variation, the lighting direction is used. If pose variation is involved,
3D transformation is needed or a discrete set is used if we quantize the continuous
view space.
We assume that the prior probability of ? is pi(?), which is assumed to be,
in practice, a non-informative prior. A non-informative prior is uniform in the
discrete case and treated as a constant, say 1, in the continuous case.
The key to our probabilistic identity characterization is as follows:
p(?|y1:T) ? pi(?)p(y1:T|?)
= pi(?)
integraldisplay
?1:T
p(y1:T|?1:T,?)p(?1:T)d?1:T
= pi(?)
integraldisplay
?1:T
Tproductdisplay
t=1
p(yt|?t,?)p(?t|?1:t?1)d?1:T, (8.1)
205
where the following rules, namely (a) observational conditional independence and
(b) chain rule, are applied:
(a) p(y1:T|?1:T,?) =
Tproductdisplay
t=1
p(yt|?t,?); (8.2)
(b) p(?1:T) =
Tproductdisplay
t=1
p(?t|?1:t?1); p(?1|?0) .= p(?1). (8.3)
Equation (8.1) involves two key quantities: the observation likelihood p(yt|?t,?)
and the state transition probability p(?t|?1:t?1). The former is essential to a recog-
nition task, the ideal case being that it possesses a discriminative power in the
sense that it always favors the correct identity and disfavors the others; the latter
is also very helpful especially when processing video sequences, which constrains
the search space.
We now study two special cases of p(?t|?1:t?1).
8.1.1 Independent group (I-group)
In this case, the transformations {?t; t = 1,...,T} are independent of each other,
i.e.
p(?t|?1:t?1) = p(?t). (8.4)
Eq. (8.1) becomes
p(?|y1:T) ? pi(?)
Tproductdisplay
t=1
integraldisplay
?t
p(yt|?t,?)p(?t)d?t. (8.5)
In this context, the probability p(?t) can be regarded as a prior for ?t, which is
often assumed to be Gaussian with mean ?? or non-informative.
The most widely studied case in the literature is T = 1, i.e. there is only a
single image in the group. Due to its importance, sometimes we will distinguish
206
it from the I-group (with T > 1) depending on the context. We will present in
Section 8.2 the shortcomings of many contemporary approaches.
It all boils down to how to compute the integral in (8.5) in real applications.
In the sequel, we show how to efficiently approximate it.
8.1.2 Video sequence
In the case of video sequence, temporal continuity between successive video frames
implies thatthe transformations{?t; t = 1,...,T}follow a Markov chain. Without
loss of generality, we assume a first-order Markov chain, i.e.
p(?t|?1:t?1) = p(?t|?t?1). (8.6)
Eq. (8.1) becomes
p(?|y1:T) ? pi(?)
integraldisplay
?1:T
Tproductdisplay
t=1
p(yt|?t,?)p(?t|?t?1)d?1:T. (8.7)
The difference between (8.5) and (8.7) is whether the product lies inside or
outside the integral. In (8.5), the product lies outside the integral, which divides
the quantity of interest into ?small? integrals that can be computed efficiently; while
(8.7) does not have such a decomposition, causing computational difficulty.
8.1.3 Difference from Bayesian estimation
Our framework is very different fromthe traditional Bayesian parameter estimation
setting, where a certain parameter ? should be estimated from the i.i.d. observa-
tions {x1,x2,...,xT} generated from a parametric density p(x|?). If we assume that
? has a prior probability pi(?), then the posterior probability p(?|x1:T) is computed
as
p(?|x1:T) ? pi(?)p(x1:T|?) = pi(?)
Tproductdisplay
t=1
p(xt|?) (8.8)
207
and used to derive the parameter estimate ??. One should not confuse our trans-
formation parameter ? with the parameter ?. Notice that ? is fixed in p(xt|?) for
different t?s. However, each yt is associates with a ?t. Also, ? is different from ?
in the sense that ? describes the identity and ? helps to describe the parametric
density.
To make our framework more general, we can also incorporate the ? parameter
by letting the observation likelihood be p(y|?,?,?). Equation (8.1) then becomes
p(?|y1:T) ? pi(?)p(y1:T|?) (8.9)
= pi(?)
integraldisplay
?,?1:T
p(y1:T|?1:T,?,?)p(?1:T)pi(?)d?1:Td?
= pi(?)
integraldisplay Tproductdisplay
t=1
p(yt|?t,?,?)p(?t|?1:t?1)pi(?)d?1:Td?,
where ?1:T and ? are assumed to be statistically independent. In this chapter, we
will focus only on (8.1) as if we already know the true parameter ? in (8.9). This
greatly simplifies our computation.
8.2 Recognition Setting and Issues
Equation (8.1) lays a theoretical foundation, which is universal for all recognition
settings: (i) recognition is based on a single image (an I-group with T = 1), an I-
group with T ? 2, or a video sequence; (ii) the identity signature is either discrete-
or continuous-valued; and (iii) the transformation space takes into account all
available variations, such as localization and variations in illumination and pose.
208
8.2.1 Discrete identity signature
In a typical pattern recognition scenario, say an N-class problem, the identity
signature for y1:T, ??, is determined by the Bayesian decision rule:
?? = arg max
{1,2,...,N}
p(?|y1:T). (8.10)
Usually p(y|?,?) is a class-dependent density, either pre-specified or learned. This
is a well studied problem and we will not focus on this.
8.2.2 Continuous identity signature
If the identity signature is continuous-valued, two recognition schemes are possible.
The first is to derive a point estimate ?? (e.g. conditional mean, mode) from
p(?|y1:T) to represent the identity of image group y1:T. Recognition is performed
by matching ???s belonging to different groups of images using a metric k(.,.). Say,
??1 is for group 1 and ??2 for group 2, the point distance
?k1,2 .= k(??1, ??2)
is computed to characterize the difference between groups 1 and 2.
Instead of comparing the point estimates, the second scheme directly compares
different distributions that characterize the identities for different groups of images.
Therefore, for two groups 1 and 2 with the corresponding posterior probabilities
p(?1) and p(?2), we use the following expected distance [134]
?k1,2 .=
integraldisplay
?1
integraldisplay
?2
k(?1,?2)p(?1)p(?2)d?1d?2.
Ideally, we wish to compare the two probability distributions using quantities such
as the Kullback-Leibler distance [4]. However, computing such quantities is nu-
merically prohibitive when ? is of high dimensionality.
209
The second scheme is preferred as it utilizes the complete statistical informa-
tion, while in the first one, point estimates use partial information. For examples,
if only the conditional mean is used, the covariance structure or higher-order statis-
tics is thrown away. However, there are circumstances when the first scheme makes
sense: the posterior distribution p(?|y1:T) is highly peaked or even degenerate at
??. This might occur when (i) the variance parameters are taken to be very small;
or (ii) we let T go to ?, i.e. keep observing the same object for a long time.
8.2.3 The effects of the transformation
Even though recognition based on single images has been studied for a long time,
most efforts assume only one alignment parameter ?? and compute the probabil-
ity p(y|??,?). Any recognition algorithm computing some distance measures can
be thought of as using a properly defined Gibbs distribution. The underlying
assumption is that
p(?) = ?(?? ??), (8.11)
where ?(.) is an impulse function. Using (8.11), (8.5) becomes
p(?|y) ? pi(?)
integraldisplay
?
p(y|?,?)?(?? ??)d? = pi(?)p(y|??,?). (8.12)
Incidentally, if the Laplace?s method is used to approximate the integral (refer
to the Appendix 8.I for details) and the maximizer ??? = argmax? p(y|?,?)p(?)
does not depend on ?, say ??? = ??, then
p(?|y) ? pi(?)
integraldisplay
?
p(y|?,?)p(?)d?
similarequal pi(?)p(y|??,?)p(??)
radicalBig
(2pi)r/|I(??)|. (8.13)
This gives rise to the same decision rule as implied by (8.12) and also partly
explains why the simple assumption (8.11) can work in practice.
210
The alignment parameter is therefore very crucial for a good recognition perfor-
mance. Even a slightly erroneous ?? may affect the recognition system significantly.
It is very beneficial to have a continuous density p(?) such as a Gaussian or even a
non-informative since marginalization of p(?,?|y) over ? yields a robust estimate
of p(?|y).
In addition, our Bayesian framework also provides a way to estimate the best
alignment parameter through the posterior probability:
p(?|y) ?
integraldisplay
?
p(y|?,?)pi(?)d?. (8.14)
8.2.4 Asymptotic behaviors
When we have an I-group or a video sequence, we are often interested in dis-
covering the asymptotic (or large-sample) behaviors of the posterior distribution
p(?|y1:T) when T is large. In [185], the discrete case of ? in a video sequence is
studied. However it is very challenging to extend this study to a continuous case.
Experimentally (refer to Section 8.3.3), we find that p(?|y1:T) becomes more and
more peaked as N increase, which seems to suggest a degenerancy in the true value
?true.
8.3 Subspace Identity Encoding
The main challenge is to specify the likelihood p(y|?,?). Practical considerations
require that (i) the identity encoding coefficient ? is compact so that our target
space where ? resides is of low dimensional; and (ii) ? should be invariant to
transformations and tightly clustered so that we can safely focus on a small portion
of the spaces.
211
Inspired by the popularity of subspace analysis, we assume that the observation
y can be well explained by a subspace, whose basis vectors are encoded in a matrix
denoted by B, i.e. there exists linear coefficients ? such that y ? B?. Clearly, ?
naturally encodes the identity. However, the observation under the transformation
condition (parameterized by ?) deviates from the canonical condition (parameter-
ized by say ??) under which the B matrix is defined. To achieve an identity encoding
that is invariant to the transformation, there are two possible ways. One way is to
inverse-warp the observation y from the transformation condition ? to the canoni-
cal condition ?? and the other way is to warp the basis matrix B from the canonical
condition ?? to the transformation condition ?. In practice, inverse-warping is typ-
ically difficult. For example, we cannot easily warp an off-frontal view to a frontal
view without explicit 3D depth information that is unavailable. Hence, we follow
the second approach, which is also known as analysis-by-synthesis approach. We
denote the basis matrix under the transformation condition ? by B?.
8.3.1 Invariant to localization, illumination, and pose
Localization parameter, denoted by ?, includes the face location, scale and in-plane
rotation. Typically, an affine transformation is used. We absorb the localization
parameter ? in the observation using T{y;?}, where the T{.;?} is a localization
operator, extracting the region of interest and normalizing it to match with the
size of the basis.
The illumination parameter, denoted by ?, is a vector specifying the illumi-
nant direction (and intensity if required). The pose parameter, denoted by ?, is
a continuous-valued random variable. However, practical systems [67, 69] often
discretize this due to the difficulty in handling 3D to 2D projection. Suppose the
212
quantized pose set is {1,...,V}. To achieve pose invariance, we concatenate all
the images [69] {y1,...,yV} under all the views and a fixed illumination ? to form
a high-dimensional vector Y? = [y1,?,...,yV,?]T. To further achieve invariance to
illuminations, we invoke the Lambertian reflectance model, ignoring shadow pixels.
Now, ? is actually a 3-D vector describing the illuminant. We now follow Chapter
3 to derive a bilinear analysis summarized below.
Since all yv?s are illuminated by the same ?, the Lambertian model gives,
Y? = W?. (8.15)
Following [204], we assume that
W =
msummationdisplay
i=1
?iWi, (8.16)
and we have
Y? =
msummationdisplay
i=1
?iWi?, (8.17)
where Wi?s are illumination-invariant bilinear basis and ? = [?1,...,?m]T pro-
vides an illuminant-invariant identity signature. Those bilinear basis can be easily
learned as shown in [138, 202]. Thus ? is also pose-invariant because, for a given
view ?, we take the part in Y corresponding to this view and still have
y?,? =
msummationdisplay
i=1
?iW?i ?. (8.18)
In summary, the basis matrix B? for ? = (?,?,?) with ? absorbed in y is
expressed as B?,? = [W?1?,...,W?m?].
We focus on the following likelihood:
p(y|?) = p(y|?,?,?,?)
= Z?1?,?,? exp{?D(T{y;?},B?,??)}, (8.19)
213
where D(y,B??) is some distance measure and Z?,?,? is the so-called partition func-
tion which plays a normalization role. In particular, if we take D as
D(T{y;?},B?,??) = (T{y;?}?B?,??)T??1(T{y;?}?B?,??)/2, (8.20)
with a given ? (say ? = ?2I where I is an identity matrix), then (8.19) becomes
a multivariate Gaussian and the partition function Z?,?,? does not depend on the
parameters any more. However, even though (8.19) is a multivariate Gaussian, the
posterior distribution p(?|y1:T) is no longer Gaussian.
8.3.2 Computational issues
The integral
If the transformation space ? is discrete, it is easy to evaluate the integral1
integraltext
? p(y|?,?)p(?)d?, which becomes a sum. If ? is continuous, in general, com-
puting integral integraltext? p(y|?,?)p(?)d? is a difficult task. Many techniques are available
in the literature. Here we mainly focus on two techniques: Monte Carlo simulation
[14, 16] and Laplace?s method [16, 136].
Monte Carlo simulation. The underlying principle is the law of large number
(LLN). If {x(1),x(2),...,x(K)} are K i.i.d. samples of the density p(x), for any
bounded function h(x),
limK?? 1K
Ksummationdisplay
k=1
h(x(k)) =
integraldisplay
xh(x)p(x)dx = Ep[h]. (8.21)
Alternatively, when drawing i.i.d. samples from p(x) is difficult, we can use
importance sampling [14, 16]. Suppose that the importance function q(x) has i.i.d.
realizations {x(1),x(2),...,x(K)}. The pdf p(x) can be represented by a weighted
1We drop the subscript [.]t notation as this is a general treatment.
214
sample set {(x(k),w(k)p )}Kk=1, where the weight for the sample x(k) is
w(k)p = p(x(k))/q(x(k)), (8.22)
in the sense that for any bounded function h(x),
limK??
Ksummationdisplay
k=1
w(k)p h(x(k)) =
Ksummationdisplay
k=1
p(x(k))
q(x(k))h(x
(k)) = Ep[h]. (8.23)
Laplace?s method [16, 136]. The general approach of this method is presented
in Appendix 8.I. This is a good approximation to the integral only if the integrand
is uniquely peaked and reasonably mimics the Gaussian function.
In our context, we use importance sampling (or i.i.d sampling if possible) for
? and the Laplace?s method for ? and enumerate ?. We draw i.i.d. samples
{?(1),?(2),...,?(K)} from q(?) and, for each sample ?(k), compute the weight w?(k) =
p(?(k))/q(?(k)). If the i.i.d. sampling is used, the weights are always ones. Putting
things together, we have (assuming pi(?) is a non-informative prior)
p(?|y) ?
integraldisplay
?,?,?
p(y|?,?,?,?)p(?)p(?)p(?)d?d?d?
similarequal 1K
Ksummationdisplay
k=1
w?(k) 1V
Vsummationdisplay
?=1
p(y|?(k),???(k),?,?,?,?)?
p(???(k),?,?)
radicalBig
(2pi)r/|I(???(k),?,?)|, (8.24)
where ???k,?,? is the maximizer
???(k),?,? = argmin
? p(y|?
(k),?,?,?)p(?), (8.25)
r is the dimensionality of ?, and I(???,?,?) is a properly defined matrix. Refer to
Appendix 8.II for computing ???,?,? and I(???,?,?) if the likelihood is given as (8.19)
and (8.20) and a non-informative prior p(?) is assumed. Similar derivations can
be conducted for an I-group of observations y1:T.
215
The distances ?k and ?k
To evaluate the expected distance ?k, we use the Monte Carlo method. In our
context, the target distribution is p(?|y1:T). Based on the above derivations, we
know how to evaluate the target distribution, but not to draw sample from it.
Therefore, we use importance sampling. Other sampling techniques such as Monte
Carlo Markov chain [14, 16] can also be applied.
Suppose that, say for group 1, the importance function is q1(?1), and weighted
sample set is {?(i)1 ,w(i)1 }Ii=1, the expected distance is approximated as
?k1,2 similarequal
summationtextI
i=1
summationtextJ
j=1 w
(i)
1 w
(j)
2 k(?
(i)
1 ,?
(j)
2 )summationtext
I
i=1 w
(i)
1
summationtextJ
j=1 w
(j)
2
. (8.26)
The point distance is approximated as
?k1,2 similarequal k(
summationtextI
i=1 w
(i)
1 ?
(i)
1summationtext
I
i=1 w
(i)
1
,
summationtextJ
j=1 w
(j)
2 ?
(j)
2summationtext
J
j=1 w
(j)
2
). (8.27)
8.3.3 Experimental results
We use the ?illum? subset of the PIE database [75] in our experiments. This subset
has 68 subjects under 21 illumination configurations and 13 poses. Out of the 21
illumination configurations, we select 12 of them denoted by F,
F = {f16,f15,f13,f21,f12,f11,f08,f06,f10,f18,f04,f02},
which typically span the set of variations. Out of the 13 poses, we select 9 of them
denoted by C,
C = {c22,c02,c37,c05,c27,c29,c11,c14,c34},
which cover from the left profile to the frontal to the right profile. In total, we have
68?12?9 = 7344 images. Fig 3.2 displays one PIE object under the illumination
and pose variations.
216
We randomly divide the 68 subjects into two parts. The first 34 subjects are
used in the training set and the remaining 34 subjects are used in the gallery and
probe sets. It is guaranteed that there is no identity overlap between the training
set and the gallery set.
During training, the images are pre-preprocessed by aligning the eyes and
mouth to desired positions. No flow computation is carried on for further align-
ment. After the pre-processing step, the used face image is of size 48 by 40, i.e.
d = 48?40 = 1920. Also, we only study gray images by taking the average of the
red, green, and blue channels of their color versions.
The training set is used to learn the basis matrix B? or the bilinear basis Wi?s.
As mentioned before, ? includes the illumination direction ? and the view pose ?,
where ? is a continuous-valued random vector and ? is a discrete random variable
taking values in {1,...,V} with p = 9 (corresponding to C).
The images belonging to the remaining 34 subjects are used in the gallery and
probe sets. The construction of the gallery and probe sets conforms the following:
To form a gallery set of the 34 subjects, for each subject, we use an I-group of
12 images under all the illuminations under one pose ?p; to form a probe set, we
use I-groups under the other pose ?g. We mainly concentrate on the case with
?p negationslash= ?g. Thus, we have 9?8 = 72 tests, with each test giving rise to a recognition
score. The 1-NN (nearest neighbor) rule is applied to find the identity for a probe
I-group.
During testing, we no longer use the pre-processed images and therefore the
unknown transformation parameter includes the affine localization parameter, the
light direction, and the discrete view pose. The prior distribution p(?t) is assumed
to be a Gaussian, whose mean is found by a background subtraction algorithm
217
and whose covariance matrix is manually specified. We use i.i.d. sampling from
p(?t) since it is Gaussian. The metric k(.,.) actually used in our experiments is
the correlation coefficient:
k(x,y) = {(xTy)2}/{(xTx)(yTy)}.
Figure 8.1 shows the marginal posterior distribution of the first element ?1 of
the identity variable ?, i.e., p(?1|y1:T), with different N?s. From Figure 8.1, we
notice that (i) the posterior probability p(?1|y1:T) has two modes, which might
fail those algorithms using the point estimate, and (ii) it becomes more peaked
and tightly-supported as T increases, which empirically supports the asymptotic
behavior mentioned in Section 8.2.
Figure 8.2 shows the recognition rates for all the 72 tests. In general, when the
poses of the gallery and probe sets arefarapart, the recognition rates decrease. The
best gallery sets for recognition are those in frontal poses and the worst gallery sets
are those in profile views. These observation are similar to those made in Chapter
3.
For comparison, Table 8.1 shows the average recognition rates for four different
methods: our two probabilistic approaches using ?k and ?k, respectively, the PCA
approach [62], and the statistical approach [91] using the KL distance. When
implementing the PCA approach, we learned a generic face subspace from all the
training images, stripping their illumination and pose conditions; while implement-
ing the KL approach, we fit a Gaussian density on every I-group and the learning
set is not used. Our approaches outperform the other two approaches significantly
due to the transformation-invariant subspace modeling. The KL approach [91]
performs even worse than the PCA approach simply because no illumination and
pose learning is used in the KL approach while the PCA approach has a learning
218
?5 ?2.5 0 2.5 50
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
?1
p(
?1
|y 1
)
(a) ?5 ?2.5 0 2.5 5
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
?1
p(
?1
|y 1:6
)
(b)
?5 ?2.5 0 2.5 50
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
?1
p(
?1
|y 1:12
)
(c) c22 c02 c37 c05 c27 c29 c11 c14 c340.05
0.06
0.07
0.08
0.09
0.1
0.11
0.12
0.13
0.14
(d)
Figure 8.1: The posterior distributions p(?1|y1:T) with different T?s: (a) p(?1|y1);
(b) p(?1|y1:6); and (c) p(?1|y1:12), and (d) the posterior distribution p(?|y1:12).
Notice that p(?1|y1:T) has two modes and becomes more peaked as T increases.
algorithm based on image ensembles taken under different illuminations and poses
(though this specific information is stripped).
Method ?k ?k PCA KL [91]
Rec. Rate (top 1) 82% 76% 36% 6%
Rec. Rate (top 3) 94% 91% 56% 15%
Table 8.1: Recognition rates of different methods.
As earlier mentioned in Section 8.2.3, we can infer the transformation param-
eters using the posterior probability p(?|y1:T). Figure 8.1 also shows the obtained
p(?|y1:12) for one probe I-group. In this case, the actual pose is ? = 5 (i.e. cam-
219
c22 c02 c37 c05 c27 c29 c11 c14 c34
c22
c02
c37
c05
c27
c29
c11
c14
c34
55 60 65 70 75 80 85 90 95 100(a)
c22 c02 c37 c05 c27 c29 c11 c14 c34
c22
c02
c37
c05
c27
c29
c11
c14
c34
50 55 60 65 70 75 80 85 90 95 100(b)
c22 c02 c37 c05 c27 c29 c11 c14 c34
c22
c02
c37
c05
c27
c29
c11
c14
c34
10 20 30 40 50 60 70 80 90 100(c)
c22 c02 c37 c05 c27 c29 c11 c14 c34
c22
c02
c37
c05
c27
c29
c11
c14
c34
0 10 20 30 40 50 60 70 80 90 100(d)
Figure 8.2: The recognition rates of all tests. (a) Our method based on ?k. (b) Our
method based on ?k. (c) The PCA approach [62]. (d) The KL approach. Notice
the different ranges of values for different methods and the diagonal entries should
be ignored.
era c27), which has the maximum probability in Figure 8.1(d). Similarly, we can
find an estimation for ?, which is quite accurate as the back ground subtraction
algorithm already provides a clean position.
8.4 Appendix
Appendix 8.I ? Laplace?s method
We are interested in computing the following quantity, for ? = [?1,?2,...,?r]T ?
Rr, J = integraltext p(?)d?. Suppose that ?? is the maximizer of p(?) or equivalently logp(?)
220
which satisfies
?p(?)
?? |?? = 0 or
? logp(?)
?? |?? = 0. (8.28)
We expand logp(?) around ?? using a Taylor series:
logp(?) similarequal logp(??)? 12(????)TI(??)(?? ??), (8.29)
where I(?) is an r?r matrix whose ijth element is
Iij(?) = ??
2 logp(?)
??i??j . (8.30)
Note that the first-order term in (8.29) is zero by virtue of (8.28). If p(?) is a pdf
function with parameter ?, then I(?) is the famous Fisher information matrix [16].
Substituting (8.29) into J gives
J similarequal p(??)
integraldisplay
exp{?12(?? ??)TI(??)(? ? ??)}d?
= p(??)
radicalBig
(2pi)r/|I(??)|. (8.31)
Appendix 8.II ? About ???,?,?
If a non-information prior p(?) is assumed2, the maximizer ???,?,? satisfies
???,?,? = argmax
?
p(y|?,?,?,?) (8.32)
= argmin
?
(T{y;?}?B?,??)T(T{y;?}?B?,??)
= argmin
?
L(?,?,?,?)
where L(?,?,?,?) .= (T{y;?}?B?,??)T(T{y;?}?B?,??).
Using the fact that
B?,?? = [W?1?,...,W?m?]? = B?,??; B?,? .=
msummationdisplay
i=1
?iW?i , (8.33)
2If a Gaussian prior is assumed, a similar derivation can be carried.
221
The term L(?,?,?,?) becomes
L(?,?,?,?) = (T{y;?}?B?,??)T(T{y;?}?B?,??), (8.34)
which is quadratic in ?. The optimum ???,?,? is unique and its value is
???,?,? = (B?,?TB?,?)?1B?,?Ty = B?,??T{y;?}. (8.35)
where [.]? is the pseudo-inverse. Substituting (8.35) into L(?,?,?,?) yields
L(?,?,???,?,?,?) = T{y;?}T(Id ?B?,?B?,??)T?{y}. (8.36)
It is easy to show that I(?) is no longer a function of ? and equals to
I = ??2B?,?TB?,?. (8.37)
222
Chapter 9
Conclusions
9.1 Summary
This doctoral dissertation addressed several approaches for unconstrained face
recognition from three aspects. The first aspect is to directly model illumination
and pose variations. The second aspect is to use nonlinear kernel learning to char-
acterize the face appearance manifold. The third aspect is to perform recognition
using video sequences.
Here are some of the key contributions made in the thesis:
? In the generalized photometric stereo approach in Chapter 2, we proposed a
rank constraint on the product of albedo and surface normal that provides a
very compact yet efficient encoding of the identity. In the literature, usually
two separate linear subspaces [43, 66] are constructed for shape and texture,
respectively, assuming the independence between them. This assumption
might result in an overfit for the problem [202].
By using the integrability and symmetry constraints, we then achieve a lin-
223
earized algorithm that recovers the class-specific albedos and surface normals
under the most general and hence most difficult setting, i.e., the observation
matrix consists of different objects under different illuminations. In particu-
lar, this algorithm takes into account the effect of varying albedo field in the
integrability term.
? The proposed illuminating light field approach in Chapter 3 is image-based
and requires no explicit 3D model. It is computationally efficient and able
to deal with images of small size. In contrast, the 3D model-based approach
[66] is computationally intense and needs image of large size.
? Probabilistic analysis of kernel principal components in Chapter 4 provides
a tool for modeling nonlinear manifold in an interpretable manner. This also
implicitly characterizes the high order statistical information. The prob-
abilistic nature enables a mixture modeling of kernel principal component
analysis and an effective classification scheme.
? Computing the probabilistic distance measures (e.g. the Chernoff distance,
the Bhatacharyya distance, the KL distance, and the divergence distance)
between two Gaussian densities in the RKHS is presented in Chapter 5. Since
the RKHS might be infinite-dimensional, we derive a limiting distance which
can be easily computed. This leads to a novel paradigm for studying pattern
separability, especially for visual pattern lying in a nonlinear manifold.
? Presented in Chapter 6 is an adaptive method for visual tracking which stabi-
lizes the tracker by embedding deterministic linear prediction into stochastic
diffusion. Numerical solutions have been provided using particle filters with
the adaptive observation model arising from the adaptive appearance model,
224
adaptive state transition model, and adaptive number of particles. Occlusion
analysis is also embedded in the particle filter.
? A systematic method for face recognition from a probe video, compared with
a gallery of still templates is introduced in Chapter 7. A time series state
space model is used to accommodate the video and SIS algorithms provide
the numerical solutions to the model. This probabilistic framework, which
overcomes many difficulties arising in conventional recognition approaches
using video, is registration-free and poses no need for selecting good frames.
It turns out that an immediate recognition decision can be made in our
framework due to the degeneracy of the posterior probability of the identity
variable. The conditional entropy can also serve as a good indication for the
convergence.
? We present in Chapter 8 a generic framework of modeling human identity
for a single image, a group of images, or a video sequence . This framework
provides a complete statistic description of the identity. Various current
recognition schemes are just instances of this generic framework.
9.2 Future works
Unconstrained face recognition can be expanded in a multitude of ways. The
following just lists some potential avenues to explore in the context of the proposed
approaches:
? In Chapters 2 and 3, we utilize a Lambertian reflectance model to describe
illumination phenomenon. However, the Lambertian reflectance model is a
rather simple model and unable to handle cast shadows and specular regions.
225
Although we employ a simple technique to exclude pixels in cast shadow and
specular regions, it turns out when the light comes from extreme directions
(e.g. highly off-frontal ones), the recognition performance drops quickly.
We need to investigate these lighting conditions. Alternatively, a complex
illumination model providing a better illumination description can be used.
? In the illuminating light field approach of Chapter 3, we need an image-based
rendering technique to handle novel poses. Some promising works along this
line are [67, 110, 111].
? On probabilistic analysis of kernel principal components and probability dis-
tances on RKHS, possible future works include (i) how to design or select the
kernel function for a given task, be it classification or modeling; (ii) evaluat-
ing the kernels for set based on the derived probabilistic distances (as argued
in Section 5.3.5) in a classification device such as Support Vector Machine for
various applications; (iii) utilizing probabilistic distances for an independent
component analysis (ICA) as in [170].
? The visual tracking algorithm of Chapter 6 can be extended in many ways
[206, 212]. (i) Combining shape information into appearance. Appearance
and shape are two very important visual cues arguably presented in a comple-
mentary fashion [133]. (ii) Utilizing appearance from multiple views. Using
multiple views can overcome some difficulty in a single view. For example, an
object might be occluded in one view but not the other one. Using the multi-
view geometry, we can infer the movement of the object in the occluded view
[207]. (iii) Here we mostly model the movement of the foreground object.
Joint modeling of foreground and background movements is very promising
226
[212, 213] since the stabilization obtained by background modeling signifi-
cantly reduces the clutter in the background that confuses the foreground
tracking algorithm.
? In simultaneous tracking and recognition of Chapter 7, various issues ex-
ist. (i) Robustness. Generally speaking, our approach is more robust than
still-image-based approach since we essentially compute the recognition score
based on all video frames and, in each frame, all kinds of transformed ver-
sions of the face part corresponding to the sample configurations that are
considered. However, since we take no explicit measure when handling
frames with outlier or other unexpected factors, recognition scores based
on those frames might be low. But, this is a problem for other approaches
too. The assumption that the identity does not change as time proceeds,
i.e., p(nt|nt?1) = ?(nt?nt?1), could be relaxed by having nonzero transition
probabilities between different identity variables. Using nonzero transition
probabilities will enable us an easier transition to the correct choice in case
that the initial choice is incorrectly chosen, making the algorithm more ro-
bust.
(ii) Resampling. In the recognition algorithm, the marginal distribution
{(?(j)t?1,wprime(j)t?1)}Jj=1 is sampled to obtain the sample set {(?(j)t ,1)}Jj=1. This
may cause problems in principle since there is no conditional independence
between ?t and nt given y0:t. However, in a practical sense, this is not a
big disadvantage because the purpose of resampling is to ?provide chances
for the good streams (samples) to amplify themselves and hence rejuvenate
the sampler to produce better results for future states as the system evolves?
[159]. The resampling scheme can either be simple random sampling with
227
weights (like in CONDENSATION), residual sampling, or local Monte Carlo
methods.
? Further, in the experimental part of Chapter 8, we can extend our approach
to perform recognition from video sequences with localization, illumination,
and pose variations. Again, Sequential Monte Carlo methods can be used
to accommodate temporal continuity. This leads to a very high-dimensional
state space to explore. Efficient simulation techniques are desired. In fact,
the issue of computation load also exist for the efficient algorithm in Chapter
7. There, two important numbers affecting the computation are J, the num-
ber of motion samples, and N, the size of the database. (i) The choice of J
is an open question in the statistics literature. In general, larger J produces
more accurate results. (ii) The choice of N depends on application. Since a
small database is used in this experiment, it is not a big issue here. However,
the computational burden may be excessive if N is large. One possibility
is to use a continuous parameterized representation, say ? as in Chapter 8,
instead of discrete identity variable n. Now the task reduces to computing
p(?t,?t|y0:t).
The approaches taken in this thesis by no means cover the whole spectrum of
the unconstrained face recognition problem and address only a small portion of all
available issues. Some possible important issues, other than those addressed in the
thesis, include the following:
? Aging. Aging is a very important topic in unconstrained face recognition.
Often the stored gallery images are taken well before the probe images. For
example, passengers hold passports with photos taken when the passport
was issued years ago. While one solution is to maintain the gallery images
228
up-to-date, a systematic solution is theoretical modeling of the generic affect
of aging. This modeling is very difficult due to the individualized variation.
Presented in [50] is just one attempt with limited success. More research
efforts are certainly worthwhile.
? Expression. Facial expression analysis and modeling attracts a lot of atten-
tion [42, 60, 61] and some approaches [60] focus on expression recognition,
i.e., identifying different modalities of facial expression such as happy, angry,
disgust, etc. Face recognition under expression variation has not been fully
explored. Clearly expression recognition and face recognition under expres-
sion variation are two different topics. However, expression recognition and
modeling is a crucial component for accurate face recognition under expres-
sion variation.
Further, facial expressions manifest themselves in a temporal dimension. The
manner that an individual poses expressions (in natural contexts) presents
certain behavioral aspect of the face biometric. Utilizing temporal infor-
mation embedded in facial expression for face recognition under expression
variation is an interesting research topic.
? Distorted imagery.
Images as one main digital media are to be compressed, stored, transmitted
and so on. Compression schemes sacrifice image quality for fewer bits to
encode the image, storage devices are susceptible to various damages, trans-
mission channels are often noisy. All these results in distorted images. How
to perform face recognition accounting for sources of distortions [199] is a
very practical research topic that needs to be explored.
229
BIBLIOGRAPHY
[Books on general topics]
[1] B. Anderson and J. Moore, Optimal Filtering. New Jersey: Prentice Hall,
Engle-wood Cliffs, 1979.
[2] Y. Bar-Shalom and T. Fortmann, Tracking and Data Association. Academic
Press, 1988.
[3] G. Casella and R. L. Berger, Statistical Inference. Duxbury, 2002.
[4] T.M. Cover and J.A. Thomas, Elements of Information Theory. Wiley,
1991.
[5] P. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach.
Prentice Hall International, 1982.
[6] A. Doucet, N. d. Freitas, and N. Gordon (Eds.), Sequential Monte Carlo
Methods in Practice. Springer-Verlag, New York, 2001.
[7] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. Wiley-
Interscience, 2001.
[8] G. H. Golub and C. F. Van Loan, Matrix Computations. The Johns Hopkins
University Press, 1996.
230
[9] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learn-
ing: Data Mining, Inference, and Prediction. Springer-Verlag, New York,
2001.
[10] B. Horn and M. Brooks (Eds.) Shape from Shading. MIT Press, 1989.
[11] P.J. Huber, Robust statistics. Wiley, 1981.
[12] I. T. Jolliffe, Principal Component Analysis. New York: Springer-Verlag,
2002.
[13] Kullback, Information Theory and Statistics. Wiley, New York, 1959.
[14] J.S. Liu, Monte Carlo Strtegies in Scientific Computing. Springer, 2001.
[15] K. V. Mardia, J. T. Kent, and J. M. Bibby, Multivariate Analysis. Academic
Press, 1979.
[16] C. Robert and G. Casella, Monte Carlo Statistical Methods. Springer, 1999.
[17] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis.
Cambridge University Press, 2004.
[18] M.A. Tanner, Tools for Statistical Inference: Methods for the Exploration of
Posterior Distributions and Likelihood Functions. Springer, 1996.
[19] V. N. Vapnik, The Nature of Statistical Learning Theory. Springer-Verlag,
New York, ISBN 0-387-94559-8, 1995.
[Books and Review Papers on face recognition]
[20] M.S.Bartlett, Face Image Analysis by Unsupervised Learning. Kluwer Aca-
demic Publishers, 2001.
231
[21] R. Chellappa, C. L. Wilson, and S. Sirohey, ?Human andmachine recognition
of faces: A survey,? Proceedings of IEEE, vol. 83, pp. 705?740, 1995.
[22] S. Gong, S.J. McKenna, Dynamic Vision: From Images to Face Recognition.
Imperial College Press, 2000.
[23] P.W. Hallinan, G. Gordon, A. Yuille, P. Giblin, and D. Mumford, Two- and
Three-Dimensional Patterns of the Face. A. K. Peters, Ltd., 1999.
[24] T. Kanade, Computer Recognition of Human Faces. Birhauser, Basel,
Switzerland, and Stuggart, Germany, 1973.
[25] S.Z. Li, A.K. Jain (Eds.), Handbook of Face Recognition. Springer-Verlag,
2004.
[26] H. Wechsler, P.J. Phillips, V. Bruce, F.F. Soulie, and T.S. Huang (Eds.),
Face Recognition: From Theory to Applications. Springer-Verlag, 1998.
[27] W. Zhao, R. Chellappa, A. Rosenfeld, and J. Phillips, ?Face recognition: A
literature survey,? ACM Computing Surveys, vol. 12, 2003.
[Biometrics]
[28] Biometric Catalog. http://www.biomtricscatalog.org.
[29] Biometric Consortium. http://www.biometrics.org.
[30] Deparment of Homeland Security (DHS), US-VISIT Program.
http://www.dhs.goc/dhspublic/interapp/editorial/editorial 0333.xml.
[31] National Institue of Standards and Technologies (NIST), Biometrics Web
Site. http://www.nist.gov/biometrics.
232
[32] D.M. Blackburn, ?Biometrics 101 (version 3.1)?
http://www.biometricscatalog.org/biometrics/Introduction.asp, March
2004.
[33] R. Hietmeyer, ?Biometric identification promises fast and secure processings
of airline passengers,? The Internationl Civil Aviation Organization Journal,
vol. 55, no. 9, pp. 10-11, 2000.
[34] P.J. Phillips, R.M. McCabe, and R. Chellappa, ?Biometric image process-
ing and recognition,? Proceedings of European Signal Processing Conference,
1998.
[Psychophysical and neural aspects]
[35] I. Biederman and P. Kalocsai, ?Neural and psychophysical analysis of object
and face recognition,? In Face Recognition: From Theory to Applications,
H. Wechsler, P.J. Phillips, V. Bruce, F.F. Soulie, and T.S. Huang (Eds.),
Springer-Verlag, 1998.
[36] V. Bruce, Recognizing Faces. Lawrence Erlbaum Associates, London, U.K.,
1988.
[37] V. Bruce, P.J.B. Hancock, and A.M. Burton, ?Human face perception and
identification,? In Face Recognition: From Theory to Applications, H. Wech-
sler, P.J. Phillips, V. Bruce, F.F. Soulie, and T.S. Huang (Eds.), Springer-
Verlag, 1998.
[38] A.J. O?Toole, ?Psychological and neural perspectives on human faces recog-
nition,? In Handbook of Face Recognition, S.Z. Li and A.K. Jain (Eds.),
Springer, 2004.
233
[39] B. Knight and A. Johnston, ?The role of movement in face recognition,?
Visual Cognition, vol. 4, pp. 265-274, 1997.
[Face recognition from still images]
[40] M.S. Barlett, H.M. Ladesand, and T.J. Sejnowski, ?Independent component
representations for face recognition,? Proceedings of SPIE 3299, pp. 528-539,
1998.
[41] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, ?Eigenfaces vs. fish-
erfaces: Recognition using class specific linear projection,? IEEE Trans. Pat-
tern Analysis and Machine Intelligence, vol. 19, pp. 711?720, 1997.
[42] M.J. Black and Y. Yacoob, ?Recognizing facial expressions in image se-
quences using local paramterized models of image motion,? International
Journal of Computer Vision, vol. 25, pp. 23-48, 1997.
[43] T. Cootes, G.Edwards, andC. Taylor, ?Active appearance model,? European
Conference on Computer Vision, 1998.
[44] K. Etemad and R. Chellappa, ?Discriminant analysis for recognition of hu-
man face images,? Journal of Optical Society of America A, pp. 1724?1733,
1997.
[45] T. Huang, Z. Xiong, and Z. Zhang, ?Face recognition applications,? Hand-
book of Face Recognition, S. Li and A. K. Jain (Eds.), Springer, 2004.
[46] M.D. Kelly, ?Visual identification of people by computer,? Tech. rep. AI-130,
Stanford AI project, Stanform, CA, 1970.
234
[47] M. Kirby and L. Sirovich, ?Application of Karhunen-Lo?eve procedure of
the characterization of human faces,? IEEE Trans. on Pattern Analysis and
Machine Intelligence, vol. 12, pp. 103?108, 1990.
[48] M. Lades, J.C. Vorbruggen, J. Buhmann, J. Lange, C. v. d. Malsburg, R.P.
Wurtz, and W. Konen, ?Distortion invariant object recognition in the dy-
namic link architecture,? IEEE Trans. Computers, vol. 42, no. 3, pp. 300?
311, 1993.
[49] A. Lanitis, C.J. Taylor, and T.F. Cootes, ?Automatic interpretation and
coding of face images using flexible models,? IEEE Trans. Pattern Analysis
and Machine Intelligence, vol. 19, no. 7, pp. 442-455, 1997.
[50] A. Lanitis, C.J. Taylor, and T.F. Cootes, ?Toward automatic simulation of
aging affects on face images,? IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 24, pp. 442-455, 2002.
[51] S.H. Lin, S.Y. Kung, and J.J. Lin, ?Face recognition/detection by probabilis-
tic decision based neural network,? IEEE Trans. Neural Networks, vol. 9,
pp. 114-132, 1997.
[52] C. Liu and H. Wechsler, ?Evolutionary pursuit and its applications to
face recognition,? IEEE Trans. Pattern Analysis and Machine Intelligence,
vol. 22, pp. 570-582, 2000.
[53] M.J. Lyons, J. Biudynek, and S. Akamatsu, ?Automatic classification of sin-
gle facial images,? IEEE Trans. Pattern Analysis and Machine Intelligence,
vol. 21, no. 12, pp. 1357-1362, 1999.
235
[54] B. Moghaddam and A. Pentland, ?Probabilistic visual learning for object
representation,? IEEE Trans. on Pattern Analysis and Machine Intelligence,
vol. PAMI-19, no. 7, pp. 696?710, 1997.
[55] B. Moghaddam, T. Jebara, and A. Pentland, ?Bayesian modeling of facial
similarity,? Advances in Neural Information Processing Systems, vol. 11, pp.
910?916, 1999.
[56] B. Moghaddam, ?Principal manifolds and probabilistic subspaces for vi-
sual recognition,? IEEE Trans. Pattern Analysis and Machine Intelligence,
vol. 24, pp. 780?788, 2002.
[57] P.J. Phillips, ?Support vector machines applied to face recognition,? Ad-
vances in Neural Information Processing Systems, vol. 11, pp. 803-809, 1998.
[58] P.J. Phillips, H. Moon, S. Rizvi, and P.J. Rauss, ?The FERET evaluation
methodology fro face-recognition algorithms,? IEEE Trans. attern Analysis
and Machine Intelligence, vol. 22, pp. 1090?1104, 2000.
[59] P.J. Phillips, P. Grother, R.J. Micheals, D.M. Blackburn, E. Tabbssi, and
M. Bone, ?Face recognition vendor test 2002: evaluation report? NISTIR
6965, http://www.frvt.org, 2003.
[60] Y. Tian, T. Kanade, and J. Cohn, ?Recognizing action units of facial ex-
pression analysis,? IEEE Trans. Pattern Analysis and Machine Intelligence,
vol. 23, pp. 1-19, 2001.
[61] Y. Tian, T. Kanade, and J. Cohn, ?Recognizing action units of facial ex-
pression analysis,? In Handbook of Face Recognition, S.Z. Li and A.K. Jain
(Eds.), Springer, 2004.
236
[62] M. Turk and A. Pentland, ?Eigenfaces for recognition,? Journal of Cognitive
Neuroscience, vol. 3, pp. 72?86, 1991.
[63] M.-H. Yang, ?Kernel eigenfaces vs. kernel Fisherfaces: Face recognition using
kernel methods,? Proceedings of International Conference on Automatic Face
and Gesture Recognition, 2002.
[64] W. Zhao, R. Chellappa, and A. Krishnaswamy, ?Discriminant analysis of
principal components forface recognition,? Proceedings of International Con-
ference on Automatic Face and Gesture Recognition, pp. 361-341, Nara,
Japan, 1998.
[Face recognition across illumination and poses]
[65] J. Atick, P. Griffin, and A. Redlich, ?Statistical approach to shape from shad-
ing: Reconstrunction of3-dimensional facesurfaces fromsingle 2-dimentional
images,? Neural Computation, vol. 8, pp. 1321?1340, 1996.
[66] V. Blanz and T. Vetter, ?Face recognition based on fitting a 3D morphable
model,? IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 25,
pp. 1063?1074, 2003.
[67] T. Cootes, K. Walker, and C. Taylor, ?View-based Active appearance mod-
els,? Proceedings of International Conference on Automatic Face and Gesture
Recognition, 2000.
[68] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, ?From few to
many: Illumination cone models for face recognition under variable lighting
and pose,? IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23,
pp. 643 ?660, 2001.
237
[69] R. Gross, I. Matthews, and S. Baker, ?Eigen light-fields and face recognition
across pose,? Proceedings of Intenational Conference on Automatic Face and
Gesture Recognition, Washington D.C., 2002.
[70] R. Gross, I. Matthews, and S. Baker, ?Fisher light-fields for face recognition
across pose and illumination,? Proceedings of the German Symposium on
Pattern Recognition, Washington D.C., 2002.
[71] R. Gross, I. Matthews, and S. Baker, ?Appearance-based face recognition
and light-fields,? IEEE Transactions on Pattern Analysis and Machine In-
telligence, vol. 26, no. 4, pp. 449 - 465, April, 2004.
[72] A. Pentland, B. Moghaddam, and T. Starner, ?View-based and modular
eigenspaces for face recognition,? Proceedings of IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, Seattle, WA, 1994.
[73] S. Romdhani and T. Vetter, ?Efficient, robust and accurate fitting of a 3D
morphable model,? Proceedings of IEEE Internationl Conference on Com-
puter Vision, pp. 59-66, Nice, France, 2003.
[74] A. Shashua and T. R. Raviv, ?The quotient image: Class based re-rendering
and recognition with varying illuminations,? IEEE Trans. Pattern Analysis
and Machine Intelligence, vol. 23, pp. 129?139, 2001.
[75] T. Sim, S. Baker, and M. Bast, ?The CMU pose, illuminatin, and expression
(PIE) database,? Proceedings of International Conference on Automatic Face
and Gesture Recognition, pp. 53?58, Washington D.C., 2002.
238
[76] T. Vetter and T. Poggio, ?Linear object classes and image synthesis from a
single example image,? IEEE Trans. Pattern Analysis and Machine Intelli-
gence, vol. 11, pp. 733?742, 1997.
[77] M.A.O. Vasilescu and D. Terzopoulos, ?Multilinear analysis of image en-
sembles: Tensorfaces,? European Conference on Computer Vision, vol. 2350,
pp. 447-460, Copenhagen, Denmark, May 2002.
[78] M. Vasilescu and D. Terzopoulos, ?Multilinear image analysis for facialrecog-
nition,? Proceedings of International Conference on Pattern Recognition,
Quebec City, Canada, 2002.
[Face recognition from video sequences]
[79] T. Choudhury, B. Clarkson, T. Jebara, and A. Pentland, ?Multimodal per-
son recognition using unconstrained audio and video,? Proceedings of Inter-
national Conference on Audio- and Video-Based Person Authentication, pp.
176?181, Washington D.C., 1999.
[80] A. Fitzgibbon and A. Zisserman, ?Joint manifold distance: a new approach
to appearance based clustering,? Proceedings of IEEE Conference on Com-
puter Vision and Pattern Recognition, Madison, WI, 2003.
[81] R. Gross and J. Shi, ?The CMU Motion of Body (MoBo) Database,? CMU-
RI-TR-01-18, 2001.
[82] A. Howell and H.Buxton, ?Facerecognition using radialbasis functionneural
networks,? Proceedings of British Machine Vision Conference, pp. 455?464,
1996.
239
[83] T. Jebara and A. Pentland, ?Parameterized structure from motion for 3D
adaptive feedback tracking of faces,? Proceedings of IEEE Computer Soci-
ety Conference on Computer Vision and Pattern Recognition, pp. 144 ?150,
Puerto Rico, 1997.
[84] K. Lee, M. Yang, and D. Kriegman, ?Video-based face recognition using
probabilistic appearance manifolds,? Proceedings of IEEE Computer Soci-
ety Conference on Computer Vision and Pattern Recognition, Madison, WI,
2003.
[85] B. Li and R. Chellappa, ?Face verification through tracking facial features,?
Journal of Optical Society of America A, vol. 18, no. 12, pp. 2969?2981, 2001.
[86] B. Li and R. Chellappa, ?A generic approach to simultaneous tracking and
verification in video,? IEEE Transaction on Image Processing, vol. 11, no. 5,
pp. 530?554, 2002.
[87] Y. Li, S. Gong, and H. Liddell, ?Modelling faces dynamically across views
and over time,? Proceedings of International Conference on Computer Vi-
sion, pp. 554 ?559, Hawaii, 2001.
[88] Y. Li, S. Gong, and H. Liddell, ?Constructing facial identity surfaces in a
nonlinear discriminant space,? Proceedings of IEEE Computer Society Con-
ference on Computer Vision and Pattern Recognition, Hawaii, 2001.
[89] X. Liu and T. Chen, ?Video-based face recognition using adaptive hidden
markov models,? Proceedings of IEEE Conference on Computer Vision and
Pattern Recognition, Madison, WI, 2003.
240
[90] S. McKenna and S. Gong, ?Non-intrusive person authentication for access
control by visual tracking and face recognition,? Proceedings of International
Conference on Audio- and Video-based Biometric Person Authentication, pp.
177?183, Crans-Montana, Switzerland, 1997.
[91] G. Shakhnarovich, J. Fisher, and T. Darrell, ?Face recognition from long-
term observations,? Proc. European Conference on Computer Vision, Copen-
hagen, Denmark, 2002.
[92] J. Steffens, E. Elagin, and H. Neven, ?Personspotter - fast and robust system
for human detection, tracking, and recognition,? Proceedings of Internationl
Conference on Automatic Face and Gesture Recognition, pp. 516?521, Nara,
Japan, 1998.
[93] H. Wechsler, V. Kakkad, J. Huang, S. Gutta, and V. Chen, ?Automatic
video-based person authentication using the RBF network,? Proceedings of
International Conference on Audio- and Video-based Biometric Person Au-
thentication, pp. 85?92, Crans-Montana, Switzerland, 1997.
[Lighting and illumination]
[94] R. Basri and D. Jacobs, ?Photometric stereo with general, unknown light-
ing,? Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, vol. II, pp. 374?381, Hawaii, 2001.
[95] R.Basri and D.Jacobs, ?Lambertianreflectance and linearsubspaces,? IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 25, pp. 218?
233, 2003.
241
[96] H. Hayakawa, ?Photometric stereo under a light source with arbitrary mo-
tion,? Journal of Optical Society of America, A, vol. 11, 1994.
[97] P.N. Belhumeur and D.J. Kriegman, ?What is the set of images of an ob-
ject under all possible illumination conditions?? International Journal of
Computer Vision, vol. 28, pp. 245?260, 1998.
[98] P. Belhumeur, D. Kriegman, and A. Yuille, ?The bas-relief ambiguity,? In-
ternational Journal of Computer Vision, vol. 35, pp. 33?44, 1999.
[99] R. T. Frankot and R. Chellappa, ?A method for enforcing integrability in
shape from shaging problem,? IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 10, pp. 439?451, 1987.
[100] R. Ramamoothi and P. Hanrahan, ?On the relationship between radiance
and irradiance: Determining the illumination from images of a convex lam-
bertian object,? Journal of the Optical Society of America (JOSA A), vol. 18,
pp. 2448?2459, 2001.
[101] A. Shashua, ?On photometric issues in 3d visual recognition from a single
2D image,? International Journal of Computer Vision, vol. 21, pp. 99?122,
1997.
[102] I. Shimshoni, Y. Moses, and M. Lindenbaum., ?Shape reconstruction of 3D
bilaterally symmetric surfaces,? International Journal of Computer Vision,
vol. 39, pp. 97?100, 2000.
[103] A.L. Yuille, D. Snow, R. Epstein, and P.N. Belhumeur, ?Determining gener-
ative models of objects under varying illumination: Shape and albedo from
242
multiple images using svd and integrability,? Internationl Journal of Com-
puter Vision, vol. 35, pp. 203?222, 1999.
[104] W. Zhao and R. Chellappa, ?Symmetric shape from shading using self-ratio
image,? International Journal of Computer Vision, vol. 45, pp. 55?752, 2001.
[105] Q. F. Zheng and R. Chellappa, ?Estimation of illuminant direction, albedo
and shape from shading,? IEEE Trans. Pattern Analysis and Machine Intel-
ligence, vol. 13, pp. 680?702, 1991.
[Tracking, detection, and registration]
[106] A. Azarbayejani and A. Pentland, ?Recursive estimation of motion, struc-
ture, and focal length,? IEEE Trans. Pattern Analysis and Machine Intelli-
gence, vol. 17, pp. 562?575, 1995.
[107] A. Bergen, P. Anadan, K. Hanna, and R. Hingorani, ?Hierarchical model-
based motion estimation,? European Conference on Computer Vision, pp.
237?252, Stockholms, Sweden, 1992.
[108] M.J. Black and A.D. Jepson, ?Eigentracking: Robust matching and tracking
of articulated objects using a view-based representation,? European Confer-
ence on Computer Vision, vol. 1, pp. 329?342, Cambridge, UK, 1996.
[109] M.J. Black and D.J. Fleet, ?Probabilistic detection and tracking of motion
discontinuities,? Proceedings of International Conference on Computer Vi-
sion, vol. 2, pp. 551?558, Greece, 1999.
[110] M.E. Brand, ?Morphable 3D Models from Video,? Proceedings of IEEE Con-
ference on Computer Vision and Pattern Recognition, Hawaii, 2001.
243
[111] C. Bregler, A. Hertzmann, and H. Biermann, ?Recovering nonrigid 3D shape
fomr image streams,? Proceedings of IEEE Conference on Computer Vision
and Pattern Recognition, Hilton Head, SC, 2000.
[112] T. J. Broida, S. Chandra, and R. Chellappa, ?Recursive techniques for es-
timation of 3-d translation and rotation parameters from noisy image se-
quences,? IEEE Trans. Aerospace and Electronic Systems, vol. AES-26, pp.
639?656, 1990.
[113] D. Comaniciu, V. Ramesh, and P. Meer, ?Real-time tracking of non-rigid
objects using mean shift,? Proceedings of IEEE Computer Society Conference
on Computer Vision and Pattern Recognition, vol. 2, pp. 142?149, Hilton
Head, SC, 2000.
[114] N.J. Gordon, D.J. Salmond, and A.F.M. Smith, ?Novel approach to
nonlinear/non-gaussian bayesian state estimation,? IEE Proceedings on
Radar and Signal Processing, vol. 140, pp. 107?113, 1993.
[115] G. D. Hager and P. N. Belhumeur, ?Efficient region tracking with parametric
models of geometry and illumination,? IEEE Trans. on Pattern Analysis and
Machine Intelligence, vol. 20, pp. 1025?1039, 1998.
[116] M. Irani, ?Multi-frame optical flow estimation using subspace constraints,?
Proceedings of International Conference on Computer Vision, pp. 626-633,
Greece, 1999.
[117] M. Irani and P. Anandan, ?Factorization with Uncertainty,? European Con-
ference on Computer Vision, pp. 539-553, Dublin, Ireland, 2000.
244
[118] M. Isard and A. Blake, ?Contour tracking by stochastic propagation of con-
ditional density,? European Conference on Computer Vision, pp. 343?356,
Cambridge, UK, 1996.
[119] M. Isard and A. Blake, ?ICONDENSATION: Unifying low-level and high-
level tracking in a stochastic framework,? Euporean Conference on Computer
Vision, vol. 1, pp. 767?781, Freiburg, Germany, 1998.
[120] A. D. Jepson, D. J. Fleet, and T. El-Maraghi, ?Robust online appearance
model for visual tracking,? Proceedings of IEEE Computer Society Con-
ference on Computer Vision and Pattern Recognition, vol. 1, pp. 415?422,
Hawaii, 2001.
[121] F. Jurie and M. Dhome, ?A simple and efficient template matching algo-
rithm,? Proceedings of International Conference on Computer Vision, vol. 2,
pp. 544?549, Vancouver, BC, 2001.
[122] Q. Ke and T. Kanade, ?A subspace approach to layer extraction,? Pro-
ceedings of IEEE Conference on Computer Vision and Pattern Recognition,
Hawaii, 2001.
[123] B. Lucas and T. Kanade, ?An iterative image registration technique with
an application to stereo vision,? International Joint Conference on Artifical
Intelligence, 1981.
[124] B. North, A. Blake, M. Isard, and J. Rittscher, ?Learning and classification
of complex dynamics,? IEEE Trans. Pattern Analysis and Machine Intelli-
gence, vol. 22, pp. 1016?1034, 2000.
245
[125] G. Qian and R. Chellappa, ?Structure from motion using sequential monte
carlo methods,? Proceedings of International Conference on Computer Vi-
sion, pp. 614 ?621, Vancouver, BC, 2001.
[126] C. Rasmussen and G. Hager, ?Probabilistic data association methods for
tracking complex visual objects,? IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 23, no. 6, pp. 560?576, 2001.
[127] H. Sidenbladh, M. J. Black, and D.J.Fleet, ?Stochastic tracking of 3dhuman
figures using 2d image motion,? European Conference on Computer Vision,
vol. 2, pp. 702?718, Copenhagen, Denmark, 2002.
[128] J. Sullivan and J. Rittscher, ?Guiding random particle by deterministic
search,? Proceedings of International Conference on Computer Vision, vol. 1,
pp. 323 ?330, Vancouver, BC, 2001.
[129] C. Tomasi and T. Kanade, ?Shape and motion from image streams under
orthography: a factorization method,? International Journal of Computer
Vision, vol. 9, no. 2, pp. 137?154, 1992.
[130] K. Toyama and A. Blake, ?Probabilistic tracking in a metric space,? Proceed-
ings of International Conference on Computer Vision, pp. 50?59, Vancouver,
BC, 2001.
[131] J. Vermaak, P. Peraz, M. Gangnet, and A. Blake, ?Towards improved obse-
vation models for visual tracking: selective adaption,? European Conference
on Computer Vision, pp. 645?660, Copenhagen, Denmark, 2002.
[132] P. Voila and M. Jones, ?Robust real-time object detection,? Second Intl.
Workshop on Stat. and Comp. Theories of Vision, Vancouver, BC, 2001.
246
[133] Y. Wu and T. S. Huang, ?A co-inference approach to robust visual tracking,?
Proceedings of International Confererence on Computer Vision, vol. 2, pp.
26?33, Vancouver, BC, 2001.
[134] C. Yang, R. Duraiswami, A. Elgammal, and L. Davis, ?Real-time kernel-
based tracking in joint feature-spatial spaces,? Tech. Report CS-TR-4567,
Univ. of Maryland, 2004.
[Others in computer vision and graphics]
[135] M. J. Black and A. D. Jepson, ?A probabilistic framework for matching
temporal trajectories,? Proceedings of International Conference on Computer
Vision, pp. 176?181, Greece, 1999.
[136] R. Bolle and D. Cooper, ?On optimally combining pieces of information
with application to estimating 3-d complex-object position from range data,?
IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 8, pp. 619?
638, 1986.
[137] D. Forsyth, ?Shape from texture and integrability,? Proc. International Con-
ference on Computer Vision, pp. 447?453, Vancouver, BC, 2001.
[138] W. T. Freeman and J. B. Tenenbaum, ?Learning bilinear models for two-
factor problems in vision,? Proceedings of IEEE Conference on Computer
Vision and Pattern Recognition, Puerto Rico, 1997.
[139] P. Fua, ?Regularized bundle adjustment to model heads from image se-
quences without calibrated data,? Internationl Journal of Computer Vision,
vol. 38, pp. 153-157, 2000.
247
[140] S.J. Gortler, R. Grzeszczuk, R. Szeliski, and M. Cohen, ?The lumigraph,?
Proceedings of SIGGRAPH, pp. 43-54, New Orleans, LA, USA, 1996.
[141] D. Jacobs, ?Linear fitting with missing data for structure-from-motion,?
Computer Vision and Image Understanding, vol. 82, pp. 57?81, 2001.
[142] A. Laurentini, ?The visual hull concept for silhouette-based image under-
standing,? IEEE Trans. Pattern Analysis and Machine Intelligences, vol. 16,
no. 2, pp. 150-162, 1994.
[143] M. Levoy and P. Hanrahan, ?Light field rendering,? Proceedings of ACM
SIGGRAPH, New Orleans, LA, USA, 1996.
[144] W. Matusik, C. Buehler, R. Raskar, S. Gortler, and L. McMillan, ?Image-
based visual hulls,? Proceedings of SIGGRAPH, pp. 369 - 374, New Orleans,
LA, USA, 2000.
[145] A. Roy Chowdhury and R. Chellappa, ?Face reconstruction from video us-
ing uncertainty analysis and a generic model,? Computer Vision and Image
Understanding, vol. 91, pp. 188-213, 2003.
[146] Y. Shan, Z. Liu, and Z. Zhang ?Model-based bundle adjustment with appli-
caiton to face modeling,? Proceedings of Internationl Conference on Com-
puter Vision, pp. 645?651, Vancouver, BC, 2001.
[Statistical analysis and computing]
[147] B. Adhikara and D. Joshi, ?Distance discrimination et resume exhaustif,?
Publs. Inst. Statis., vol. 5, pp. 57?74, 1956.
248
[148] X. Boyen and D. Koller, ?Tractable inference for complex stochastic pro-
cesses,? Proceedings of the 14th Annual Conference on Uncertainty in AI
(UAI), pp. 33 ? 42, Madison, Wisconsin, 1998.
[149] M. Brand, ?Incremental singular value decomposition of uncertain data with
missing values,? European Conference on Computer Vision, pp. 707?720,
Copenhagen, Denmark, 2002.
[150] A. Bhattacharyya, ?Ona measure of divergence between two statistical popu-
lations defined by their probability distributions,? Bull. Calcutta Math. Soc.,
vol. 35, pp. 99?109, 1943.
[151] H. Chernoff, ?A measure of asymptotic efficiency of tests for a hypothesis
based on a sum of observations,? Annals of Math. Stat., vol. 23, pp. 493?507,
1952.
[152] A. P. Dempster, N. M. Laird, and D. B. Rubin, ?Maximum likelihood from
incomplete data via the em algorithm.? J. Roy. Statist. Soc. B, vol. 39, 1977.
[153] A. Doucet, S. J. Godsill, and C. Andrieu, ?On sequential monte carlo sam-
pling methods forbayesian filtering,? Statistics and Computing, vol. 10, no. 3,
pp. 197?209, 2000.
[154] D. Fox, ?KLD-sampling: Adaptive particle filters and mobile robot localiza-
tion,? Neural Information Processing Systems (NIPS), 2001.
[155] A. Hyvarinen, ?Survey on Independent Component Analysis,? Neural Com-
puting Surveys, vol. 2, pp. 94-128, 1999.
[156] T. Kailath, ?The divergance and Bhattacharyya distance measures in signal
selection,? IEEE Trans. on Comm. Tech., vol. COM-15, pp. 52?60, 1967.
249
[157] G. Kitagawa, ?Monte carlo filter and smoother for non-gaussian nonlinear
state space models,? J. Computational and Graphical Statistics, vol. 5, pp.
1?25, 1996.
[158] T. Lissack and K. Fu, ?Error estimation in pattern recognition via L-distance
between posterior density functions,? IEEE Trans. Information Theory,
vol. 22, pp. 34?45, 1976.
[159] J. S. Liu and R. Chen, ?Sequential monte carlo for dynamic systems,? Jour-
nal of the American Statistical Association, vol. 93, pp. 1031?1041, 1998.
[160] P. Mahalanobis, ?On the generalized distance in statistics,? Proc. National
Inst. Sci. (India), vol. 12, pp. 49?55, 1936.
[161] K. Matusita, ?Decision rules based on the distance for problems of fit, two
samples and estimation,? Ann. Math. Stat., vol. 26, pp. 631?640, 1955.
[162] S. Roweis and L. Saul, ?Nonlinear dimensionality reduction by locally linear
embedding,? Science, vol. 290, no. 5500, pp. 2323?2326, Dececember 2000.
[163] E. Patrick and F. Fisher, ?Nonparametric feature selection,? IEEE Trans.
Information Theory, vol. 15, pp. 577?584, 1969.
[164] P. Penev andJ.Atick, ?Localfeatureanalysis: Ageneral statistical theoryfor
object representation,? Networks: Computations in Neural Systems, vol. 7,
pp. 477-500, 1996.
[165] H. Shum, K. Ikeuchi, and R. Reddy, ?Principal component analysis with
missing data and its applications to polyhedral object modeling,? IEEE
Trans. Pattern Analysis and Machine Intelligence, vol. 17, pp. 854?867, 1995.
250
[166] J.B. Tenenbaum, V. de Silva, and J.C. Langford, ?A Global Geometric
Framework for Nonlinear Dimensionality Reduction,? Science, vol. 290,
no. 5500, pp. 2319-2323, December 2000.
[167] M. E. Tipping and C. M. Bishop, ?Mixtures of probabilistic principal com-
ponent analysers,? Neural Computation, vol. 11, no. 2, pp. 443?482, 1999.
[168] M. E. Tipping and C. M. Bishop, ?Probabilistic principal component analy-
sis,? Journal of the Royal Statistical Society, Series B, vol. 61, pp. 611?622,
1999.
[169] T. Wiberg, ?Computation of principal components when data are missing,?
Proc. Second Symp. Computational Statistics, pp. 229?236, 1976.
[Machine learning and kernel methods]
[170] F. Bach and M. I. Jordan, ?Kernel independent component analysis,? Jour-
nal of Machine Learning Research, vol. 3, pp. 1?48, 2002.
[171] F. Bach and M. I. Jordan, ?Learning graphical models with Mercer kernels,?
Advances in Neural Information Processing Systems, 2002.
[172] G. Baudat and F. Anouar, ?Generalized discriminant analysis using a kernel
approach,? Neural Computation, vol. 12, pp. 2385?2404, 2000.
[173] F. Girosi, M. Jones, and T. Poggio, ?Regularization theory and neutral net-
works architectures,? Neural Computation, vol. 7, pp. 219?269, 1995.
[174] T. Jebara and R. Kondor, ?Bhattarcharyya and expected likelihood kernels,?
Conference on Learning Theory (COLT), 2003.
251
[175] R. Kondor and T. Jebara, ?A kernel between sets of vectors,? Intenational
Conference on Machine Learning (ICML), 2003.
[176] J. Mercer, ?Functions of positive and negative type and their connection
with the thoery of integral equations,? Philos. Trans. Roy. Soc. London, vol.
A 209, pp. 415?446, 1909.
[177] S. Mika, G. R?atsch, J. Weston, B. Sch?olkopf, and K.-R. M?uller, ?Fisher
discriminant analysis with kernels,? in Neural Networks for Signal Processing
IX, Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, Eds. IEEE, 1999, pp.
41?48.
[178] P. Moreno, P. Ho, and N. Vasconcelos, ?A Kullback-Leibler divergence based
kernel for svm classfication in multimedia applications,? Neural Information
Processing Systems, 2003.
[179] K.R. M?uller, S. Mika, G. R?atsch, K. Tsuda, and B. Sch?olkopf, ?An introdu-
cation to kernel-based learning algorithms,? IEEE Trans. Neutral Networks,
vol. 12, pp. 181?202, 2001.
[180] A. Ng, M. Jordan, and Y. Weiss, ?On spectral clustering: analysis and an
algorithm,? Neural Information Processing Systems, 2002.
[181] B. Sch?olkopf, A. Smola, and K.-R. M?uller, ?Nonlinear component analysis as
a kernel eigenvalue problem,? Neural Computation, vol. 10, pp. 1299?1319,
1998.
[182] M. Tipping, ?Sparse kernel prinicipal component analysis,? Neural Informa-
tion Processing Systems, 2001.
252
[183] C. K. I. Williams, ?On a connection between kernel PCA and metric multi-
dimensional scaling,? Neural Information Processing Systems, 2001.
[184] L. Wolf and A. Shashua, ?Kernel principal angles for classification machines
with applications to image sequence interpretation,? IEEE Computer Soci-
ety Conference on Computer Vision and Pattern Recognition, Madison, WI,
2003.
[Shaohua Zhou?s publications]
[185] S. Zhou, V. Krueger, and R. Chellappa, ?Probabilistic recognition of human
faces from video,? Computer Vision and Image Understanding, vol. 91, pp.
214?245, 2003.
[186] S. Zhou, R. Chellappa, and B. Moghaddam, ?Visual tracking and recognition
using appearance-adaptive models in particle filters,? IEEE Trans. Image
Processing (to appear), 2004.
[187] S. Zhou, R. Chellappa, and D. Jacobs, ?Generalized photometric stereo and
its applicaitons to face recognition,? International Journal of Computer Vi-
sion (submitted).
[188] S. Zhou and R. Chellappa, ?Image-based face recognition under illumination
and pose variantons,? Journal of the Optical Society of America (submitted).
[189] S. Zhou and R. Chellappa, ?Probabilisitic distances in reproducing kernel
Hilbert space,? IEEE Trans. on Information Theory (under preparation).
[190] R.Chellappa andS. Zhou, ?Facetracking andrecognition fromvideo,? Hand-
book of Face Recognition, S. Li and A. K. Jain (Eds.), Springer, 2004.
253
[191] S. Zhou and R. Chellappa, ?Face recognition from still images and videos,?
Handbook of Image and Video Processing, A. Bovik (Ed.), Academic Press,
2004.
[192] S. Zhou, V. Krueger, and R. Chellappa, ?Face recgnition from video: A con-
densation approach,? Proceedings of International Conference on Automatic
Face and Gesture Recognition, Washington, D.C., USA, May 2002.
[193] S. Zhou and R. Chellappa, ?Probabilistic human Recognition from video,?
European Conference on Computer Vision, vol. 3, pp. 681-697, Copenhagen,
Denmark, May 2002.
[194] V. Krueger and S. Zhou, ?Exemplar-based face recgnition from video,? Eu-
ropean Conference on Computer Vision, Copenhagen, Denmark, 2002.
[195] R. Chellappa, S. Zhou, and B. Li, ?Bayseain methods for probabilistic human
recgnition from video,? Proceedings of IEEE International Conference on
Acoustic, Speech, and Signal Processing, Orlando, Florida, USA, 2002.
[196] S. Zhou and R. Chellappa, ?A robust algorithm for probabilistic human
recognition from video,? Proceedings of International Conference on Pattern
Recognition, Quebec City, Canada, 2002.
[197] R. Chellappa, V. Krueger, and S. Zhou, ?Probabilistic recgnition of human
faces from video,? Proceedings of IEEE International Conference on Image
Processing, Rochester, NY, 2002.
[198] S. Zhou, ?Probabilistic analysis of kernel principal components: classification
and mixture modeling,? CfAR Technial Report, CAR-TR-993, 2003.
254
[199] S. Zhou and R. Chellappa, ?Simultaneous tracking and recognition of hu-
man faces from video,? Proceedings of IEEE International Conference on
Acoustic, Speech, and Signal Processing, 2003.
[200] J. Li, S. Zhou, and C. Shekhar, ?A comparison of subspace analysis for face
recognition,? Proceedings of IEEE International Conference on Acoustics,
Speech, and Signal Processing, 2003.
[201] S. Zhou, R. Chellappa, and B. Moghaddam, ?Adaptive visual tracking and
recognition using particle filters,? Proceedings of IEEE International Con-
ference on Multimedia & Expo, Baltimore, USA, 2003.
[202] S. Zhou and R. Chellappa, ?Rank constrained recognition under unknown
illuminations,? IEEE Intl. Workshop on Analysis and Modeling of Faces and
Gestures, Nice, France, 2003.
[203] S. Zhou, R. Chellappa, and B. Moghaddam, ?Appearance tracking using
adaptive models in a particle filter,? Proceedings of Asian Conference on
Computer Vision, Korea, January 2004.
[204] S. Zhou, R. Chellappa, and D. Jacobs, ?Characterization of human faces
under illumination variations using rank, integrability, and symmetry con-
straints,? European Conference on Computer Vision, Prague, Czech, May
2004.
[205] J. Li and S. Zhou, ?Probabilistic face recognition with compressed imagery,?
Proceedings of IEEE International Conference on Acoustics, Speech, and Sig-
nal Processing, Montreal, Canada, May 2004.
255
[206] J. Shao, S. Zhou, and R. Chellappa, ?Appearance-based visual tracking and
recognition with trilinear tensor,? Proceedings of IEEE International Con-
ference on Acoustics, Speech, and Signal Processing, Montreal, Canada, May
2004.
[207] Z. Yue, S. Zhou, and R. Chellappa, ?Robust two-camera visual tracking with
homography,? Proceedings of IEEE International Conference on Acoustics,
Speech, and Signal Processing, Montreal, Canada, May 2004.
[208] S. Zhou and R. Chellappa, ?Illuminating light field: Image-based face recog-
nition across illuminations and poses,? Proceedings of International Confer-
ence on Automatic Face and Gesture Recognition, Seoul, Korea, May 2004.
[209] S. Zhou, R. Chellappa, and B. Moghaddam, ?Intra-personal kernel space
for face recognition,? Proceedings of International Conference on Automatic
Face and Gesture Recognition, Seoul, Korea, May 2004.
[210] S. Zhou and R. Chellappa, ?Probabilistic identity characterization for face
recognition,? Proceedings of IEEE Computer Society Conference on Com-
puter Vision and Pattern Recognition, Washington D.C., USA, June 2004.
[211] S. Zhou and R. Chellappa, ?Multiple-exemplar discriminant analysis for face
recognition,? Proceedings of International Conference on Pattern Recogni-
tion, Cambridge, UK, August 2004.
[212] J. Shao, S. Zhou, and Q. Zheng, ?Robust appearance-based tracking of mov-
ing object from moving platform,? Proceedings of International Conference
on Pattern Recognition, Cambridge, UK, August 2004.
256
[213] J. Shao, S. Zhou, and R. Chellappa, ?Simultaneous background and fore-
ground modeling for tracking in surveillance video,? Proceedings of IEEE
International Conference on Image Processing, Singapore, October 2004.
257