ABSTRACT
 Title of dissertation: NON-LINEAR AND SPARSE REPRESENTATIONS
 FOR MULTI-MODAL RECOGNITION
 Hien Van Nguyen, Doctor of Philosophy, 2013
 Dissertation directed by: Professor Rama Chellappa
 Department of Electrical and Computer Engineering
 In the first part of this dissertation, we address the problem of representing 2D and
 3D shapes. In particular, we introduce a novel implicit shape representation based on
 Support Vector Machine (SVM) theory. Each shape is represented by an analytic deci-
 sion function obtained by training an SVM, with a Radial Basis Function (RBF) kernel,
 so that the interior shape points are given higher values. This empowers support vector
 shape (SVS) with multifold advantages. First, the representation uses a sparse subset of
 feature points determined by the support vectors, which significantly improves the dis-
 criminative power against noise, fragmentation and other artifacts that often come with
 the data. Second, the use of the RBF kernel provides scale, rotation, and translation invari-
 ant features, and allows a shape to be represented accurately regardless of its complexity.
 Finally, the decision function can be used to select reliable feature points. These features
 are described using gradients computed from highly consistent decision functions instead
 of conventional edges. Our experiments on 2D and 3D shapes demonstrate promising
 results.
 The availability of inexpensive 3D sensors like Kinect necessitates the design of
new representation for this type of data. We present a 3D feature descriptor that represents
 local topologies within a set of folded concentric rings by distances from local points to
 a projection plane. This feature, called as Concentric Ring Signature (CORS), possesses
 similar computational advantages to point signatures yet provides more accurate matches.
 CORS produces compact and discriminative descriptors, which makes it more robust to
 noise and occlusions.
 It is also well-known to computer vision researchers that there is no universal rep-
 resentation that is optimal for all types of data or tasks. Sparsity has proved to be a good
 criterion for working with natural images. This motivates us to develop efficient sparse
 and non-linear learning techniques for automatically extracting useful information from
 visual data. Specifically, we present dictionary learning methods for sparse and redun-
 dant representations in a high-dimensional feature space. Using the kernel method, we
 describe how the well-known dictionary learning approaches such as the method of opti-
 mal directions and KSVD can be made non-linear. We analyse their kernel constructions
 and demonstrate their effectiveness through several experiments on classification prob-
 lems. It is shown that non-linear dictionary learning approaches can provide significantly
 better discrimination compared to their linear counterparts and kernel PCA, especially
 when the data is corrupted by different types of degradations.
 Visual descriptors are often high dimensional. This results in high computational
 complexity for sparse learning algorithms. Motivated by this observation, we introduce
 a novel framework, called sparse embedding (SE), for simultaneous dimensionality re-
 duction and dictionary learning. We formulate an optimization problem for learning a
 transformation from the original signal domain to a lower-dimensional one in a way that
preserves the sparse structure of data. We propose an efficient optimization algorithm
 and present its non-linear extension based on the kernel methods. One of the key fea-
 tures of our method is that it is computationally efficient as the learning is done in the
 lower-dimensional space and it discards the irrelevant part of the signal that derails the
 dictionary learning process. Various experiments show that our method is able to capture
 the meaningful structure of data and can perform significantly better than many competi-
 tive algorithms on signal recovery and object classification tasks.
 In many practical applications, we are often confronted with the situation where the
 data that we use to train our models are different from that presented during the testing.
 In the final part of this dissertation, we present a novel framework for domain adapta-
 tion using a sparse and hierarchical network (DASH-N), which makes use of the old
 data to improve the performance of a system operating on a new domain. Our network
 jointly learns a hierarchy of features together with transformations that rectify the mis-
 match between different domains. The building block of DASH-N is the latent sparse
 representation. It employs a dimensionality reduction step that can prevent the data di-
 mension from increasing too fast as traversing deeper into the hierarchy. Experimental
 results show that our method consistently outperforms the current state-of-the-art by a
 significant margin. Moreover, we found that a multi-layer DASH-N has an edge over the
 single-layer DASH-N.
NON-LINEAR AND SPARSE REPRESENTATIONS FOR OBJECT
 RECOGNITION
 by
 Hien Van Nguyen
 Dissertation submitted to the Faculty of the Graduate School of the
 University of Maryland, College Park in partial fulfillment
 of the requirements for the degree of
 Doctor of Philosophy
 2013
 Advisory Committee:
 Professor Rama Chellappa (Chair/Advisor)
 Professor Larry Davis
 Professor Min Wu
 Doctor Jonathon Phillips
 Associate Professor Wojciech Czaja (Dean?s Representative)
c? Copyright by
 Hien Van Nguyen
 2013
DEDICATION
 To my parents, Ut Hien and Cam Ha.
 ii
Acknowledgments
 First and foremost I would like to express my deepest gratitude to my advisor,
 Professor Rama Chellappa for his extraordinary supervision. He has always been there
 whenever I needed advice and help and there hasn?t been an occasion when I?ve come to
 his office and he didn?t give me time. His intellectual charisma is a tremendous source
 of inspiration and his witty sense of humor teaches me to look on the bright side of
 life. Professor Rama?s approach to mentoring is more like being a beacon in which he
 encouraged me to find my own way but provided wise counsel whenever I was at lost.
 He inspired me and gave enough freedom to independently explore my own research
 interests. There are many more good deeds that he has done for me and my family, for
 which I want to thank him in heart and not in word.
 I am also extremely grateful to Dr. Fatih Porikli and Dr. Oncel Tuzel for their hospi-
 tality when I was doing my internships at the Mitsubishi Electric Research Laboratories.
 Dr. Fatih?s whole-hearted support have a tremendous positive impact on my current re-
 search and future career. I am very thankful to Dr. Nasser Nasrabadi for teaching me to
 learn from the basics. Dr. Vishal Patel deserves my special appreciation for his help in
 editing papers, discussing new ideas and sharing with me his research experience.
 I am grateful to my dissertation committee members, Professor Larry Davis, Pro-
 fessor Min Wu, Professor Wojciech Czaja and Dr. Jonathon Phillips for spending their
 valuable time on reading my thesis and attending my dissertation defence. I would also
 want to acknowledge the help from administrative staffs including Ms. Janice Perrone,
 Dr. Tracy Chung, Ms. Melanie Prange, Ms. Maria Hoo, Ms. Meg Richmond for helping
 iii
me to smoothly get through all the complicated paper work.
 The bright fellow graduate students at the Center for Automation Research have
 enriched my graduate life in many ways and I am truly grateful to all of them, Jais-
 hanker Pillai, Nazare Batool, Tho Ho, Ashish Srivastava , Sumit Shekhar, Yi-Chen Chen,
 Raviteja Vemulapalli, Garrett Warnell, Wu Tao, Sima Taheri, Qiang Qiu, Jie Ni, Aswin
 Sankaranarayanan, Pavan Turaga, Kaushik Mitra, Dikpal Reddy, Ming-Yu Liu, Ruonan
 Li, Raghuram Gopalan, Mahesh Ramachandran, and other colleagues. Without these
 companions, the journey would not be as fulfilling and enjoyable.
 I have been extremely fortunate to know many wonderful Vietnamese friends in
 Maryland who have been a significant part of my life in the last five years and made me
 feel so home. Special gratitude is due to Dr. Ha Minh Nguyen who frequently drove me
 to grocery stores for the first two years. I am immensely grateful to Mr. Ba and his wife,
 my grandfather, my ants, Mrs. Hang, Mrs. Diep for helping my parents when I am away
 from home. I can?t be more grateful to Dr. Amador and Ms. Ling for their contributions
 in improving my parents? well-being.
 I owe my deepest gratitude to my family (father, mother, Ut Hien, Cam Ha) who
 have endured a difficult time for me to accomplish my graduate program. Words just can?t
 do enough justice for what they have done for me.
 It is impossible to remember all, and I apologize to those I?ve inadvertently left out.
 Finally, Thank you all!
 iv
Table of Contents
 List of Tables vii
 List of Figures viii
 1 Introduction 1
 2 Shape Representations 6
 2.1 Related Work On 2D and 3D Shape Representations . . . . . . . . . . . . 6
 2.1.1 Global Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 7
 2.1.2 Part-Based Approaches . . . . . . . . . . . . . . . . . . . . . . . 8
 2.2 Support Vector Shape Representation . . . . . . . . . . . . . . . . . . . 9
 2.2.1 Formulation of Support Vector Shape . . . . . . . . . . . . . . . 10
 2.2.1.1 Learning a Decision Function: . . . . . . . . . . . . . 12
 2.2.1.2 ??SVM and One-Class SVM: . . . . . . . . . . . . . 17
 2.2.2 SVS Feature Points . . . . . . . . . . . . . . . . . . . . . . . . . 20
 2.2.2.1 Gradient-based Feature Points . . . . . . . . . . . . . . 21
 2.2.2.2 Support Vector-based Feature Points . . . . . . . . . . 23
 2.2.2.3 Curvature-based Feature Points . . . . . . . . . . . . . 24
 2.2.2.4 Entropy-based Feature Points . . . . . . . . . . . . . . 25
 2.2.3 SVS Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . 26
 2.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
 2.2.5 Proof of SVM-RBF Irreducibility . . . . . . . . . . . . . . . . . 36
 2.3 Concentric Ring Signature . . . . . . . . . . . . . . . . . . . . . . . . . 37
 2.3.1 CORS Construction . . . . . . . . . . . . . . . . . . . . . . . . 38
 2.3.2 Fast Approximation CORS . . . . . . . . . . . . . . . . . . . . . 43
 2.3.3 Similarity Measure and Matching . . . . . . . . . . . . . . . . . 45
 2.3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 45
 2.3.4.1 Saliency . . . . . . . . . . . . . . . . . . . . . . . . . 46
 2.3.4.2 Shape Detection and Registration . . . . . . . . . . . . 49
 2.3.4.3 Shape Recognition and Retrieval . . . . . . . . . . . . 53
 3 Sparse Representations 57
 3.1 Background on Sparse Coding and Dictionary Learning . . . . . . . . . . 57
 3.2 Non-Linear Kernel Dictionary Learning . . . . . . . . . . . . . . . . . . 59
 3.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . 60
 3.2.2 Optimization Procedure . . . . . . . . . . . . . . . . . . . . . . 63
 3.2.2.1 Kernel Orthogonal Matching Pursuit (KOMP) . . . . . 63
 3.2.3 Dictionary Update with Kernel MOD . . . . . . . . . . . . . . . 66
 3.2.4 Dictionary Update with Kernel KSVD . . . . . . . . . . . . . . . 67
 3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
 3.3.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
 3.3.2 Kernel sparse representation . . . . . . . . . . . . . . . . . . . . 75
 3.3.3 Digit Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 76
 v
3.3.3.1 Pre-images of learned atoms . . . . . . . . . . . . . . 78
 3.3.3.2 Classification results . . . . . . . . . . . . . . . . . . . 79
 3.3.4 Caltech-101 Object Recognition . . . . . . . . . . . . . . . . . . 83
 3.4 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 89
 3.5 Proof of Proposition 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
 4 Sparse Representations On Low-Dimensional Feature Space 92
 4.1 Motivation and Related Work . . . . . . . . . . . . . . . . . . . . . . . . 92
 4.2 Sparse Embedding Framework . . . . . . . . . . . . . . . . . . . . . . . 94
 4.3 Optimization Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
 4.4 Non-linear Extension of Sparse Embedding . . . . . . . . . . . . . . . . 100
 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
 4.5.1 Recovery of Dictionary Atoms . . . . . . . . . . . . . . . . . . . 103
 4.5.2 Latent Sparse Embedding Residual Classifier (LASERC) . . . . . 106
 4.5.2.1 USPS Digit Recognition . . . . . . . . . . . . . . . . . 107
 4.5.2.2 Caltech-101 and Caltech-256 Object Detection . . . . . 109
 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
 4.7 Proof of Proposition 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
 5 Domain Adaptation Using a Sparse and Hierarchical Network 117
 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
 5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
 5.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
 5.3.1 Latent Sparse Representation . . . . . . . . . . . . . . . . . . . . 121
 5.3.2 Multi-layer Feature Learning . . . . . . . . . . . . . . . . . . . . 124
 5.3.3 Hierarchical Domain Adaptation . . . . . . . . . . . . . . . . . . 125
 5.4 Optimization Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
 5.5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 131
 5.5.2 Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 134
 5.5.3 Half-toning and Edge . . . . . . . . . . . . . . . . . . . . . . . . 136
 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
 6 Future Works 140
 6.1 Invariant Representations . . . . . . . . . . . . . . . . . . . . . . . . . . 140
 6.2 Learning Transformation Models . . . . . . . . . . . . . . . . . . . . . . 141
 6.3 Representations On Compressed Domain . . . . . . . . . . . . . . . . . 141
 Bibliography 143
 vi
List of Tables
 2.1 Comparison of MPEG7 classification results using four different feature point
 selection methods in Section 2.2.2 based on gradients, support vectors, curva-
 tures, and entropies. The last column shows the baseline classification result
 when gradients are computed from edges instead of SVS. . . . . . . . . . . . 30
 2.2 Comparison for the MPEG7 dataset. . . . . . . . . . . . . . . . . . . . . . . 31
 2.3 MPEG7 classification results versus the number of features selected using the
 gradient-based selection strategy. . . . . . . . . . . . . . . . . . . . . . . . 33
 2.4 Random distortion results on partial MPEG7 dataset . . . . . . . . . . . . . . 33
 2.5 Matching results on the articulation dataset . . . . . . . . . . . . . . . . . . 35
 2.6 Amount of time taken to compute 500 descriptors. . . . . . . . . . . . . . 44
 3.1 Parameters? dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . 61
 3.2 Comparison of recognition results on Caltech-101 dataset . . . . . . . . . 86
 3.3 Comparison of recognition results on Caltech-256 dataset . . . . . . . . . 87
 4.1 Comparison of recognition results on Caltech-101 dataset (left), recogni-
 tion results on Caltech-256 dataset (upper right), and the computing time
 (lower right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
 5.1 Recognition rates of different approaches on four domains (C: Caltech,
 A: Amazon, D: DSLR, W: Webcam). 10 common classes are used. Red
 color denotes the best recognition rates. Blue color denotes the second
 best recognition rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
 5.2 Single-source recognition rates on all 31 classes. . . . . . . . . . . . . . 133
 5.3 Multiple-source recognition rates on all 31 classes . . . . . . . . . . . . . 134
 5.4 Recognition rates of different approaches on the half-toning and edge
 datasets. 10 common classes are used. . . . . . . . . . . . . . . . . . . . 139
 vii
List of Figures
 2.1 Illustration of SVS decision boundary. The dotted line is the tangent space. . . . 14
 2.2 Orientation of ?f(x) remains stable even if the shape transforms and different
 set of parameters and points are used to train the SVS on the right. . . . . . . . 16
 2.3 Support vectors (red circles ?i > 0, blue circles ?i ? 0), the decision function
 boundaries f(x) = 0 (yellow), and the original shape (green) for support vector
 numbers 25, 44, and 61. Classifier accuracies are 93%, 97.2%, and 99.8%, re-
 spectively. Original shape contains ?54K points (?1K points on the boundary),
 yet only a small fraction (e.g. 61) of points are needed to encode the shape. . . 17
 2.4 Comparison of the decision function responses for noise-free and noisy shapes.
 In the first experiment, a car shape is distorted by randomly removing 20% and
 50% of points and adding speckle noise to the background. In the second exper-
 iment, the letter A is interleaved with blank spaces. The differences of decision
 function for both cases are hardly noticeable. They are both trained using one-
 class SVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
 2.5 The relative gradient magnitude at a point is computed by dividing its gradient
 magnitude by mean of gradient magnitude across the entire shape. . . . . . . . 22
 2.6 Decision function responses (left) and the gradient magnitude of the decision
 function (right) for the SVS representing Stanford Bunny. Points with higher
 values in ?(fx) seem to be good candidates to be selected as discriminative
 feature points. Black circles are the 51 feature points (102 support vectors). . . . 23
 2.7 Stability of support vectors with respect to variation of RBF kernel width (?) and
 articulation. (top) Support vectors when ? varies. ? = 0.0027 in the first shape,
 and ? = 0.0054 in the second shape. 62 out of 78 support vectors in the first
 shape appear in the second shape. The overall positioning of the support vectors
 is very similar. (bottom) Support vectors with shape articulation. Blue circles
 indicate SVs that do not have correspondences on the other shape. 83 out of 114
 SVs in the first shape appear in the second shape. . . . . . . . . . . . . . . . 25
 2.8 Comparison of IDSC with the SVS computed at four locations indicated by
 square dots on the shapes. Descriptors are displayed from left to right corre-
 sponding to the following order of dots? colors: red, yellow, green, cyan. The
 second image has a crack in the shape. IDSC changes drastically (it highly de-
 pends on the boundary) while the SVS remains robust generating almost the
 identical descriptors. Box color indicates the severity of the mismatch: Green
 (< 20%) ; Orange (20? 50%); Red (> 50%). . . . . . . . . . . . . . . . . . 27
 2.9 Comparison of Shape Context (SC) with the SVS computed at four locations
 indicated by square dots on the shapes. Descriptors are displayed from left to
 right corresponding to the following order of dots? colors: red, yellow, green,
 cyan. Box color indicates the severity of the mismatch: Green (< 20%) ; Orange
 (20? 50%); Red (> 50%). . . . . . . . . . . . . . . . . . . . . . . . . . . 28
 2.10 SVS results: red circles show the incorrect matches. Note that none is in top 4
 rankings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
 viii
2.11 Comparison of the retrieval results for noisy dataset. Accuracy of IDSC is
 57.0%, SVS is 93.7%. Red circles show incorrect matches. . . . . . . . . . . . 32
 2.12 Samples from the articulation database. Note that, each column corresponds to
 a different object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
 2.13 a) CORS is constructed at p by finding a spherical support region S . b)
 A plane is fitted to local neighborhood and translated to the point p. The
 normal direction is taken to be z-axis c) Selecting a reference orientation
 for x-axis and projecting the distances from the surface to plane into the
 corresponding patches. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
 2.14 A special case where fitting a plane to local support is different from
 fitting a plane to its perimeter. a) Fitting to the perimeter results in a
 tangent plane, while b) fitting to the entire local neighborhood results in
 a slicing plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
 2.15 Illustration of CORS (red border) and spin image (black border) at differ-
 ent points on a 3D human face. The radius of support region is set to 15
 for both descriptors. The number of azimuthal and radial quantizations of
 CORS are 10 and 5, respectively. The bin size of spin image is set equal
 to the scanner resolution as suggested by its author. For this setting, the
 dimension of CORS is around 6.5? more compact than that of spin image. 42
 2.16 Percentage of correct matches within k-nearest neighbors. . . . . . . . . 46
 2.17 Comparison of correct correspondences for two methods: random selec-
 tion and selecting with the discriminant ratio ? ? 1.5. Gaussian noise
 is added to all descriptors before matching. The noise standard deviation
 varies from 0.1 to 2, and database size varies from 10K to 100K. . . . . . 47
 2.18 Correspondences of CORS with the discriminant ratio ? ? 1.5 a) Two
 shapes taken from TOSCA dataset with nonrigid transformations (Note
 that all correspondences for this pair are correct while the plotted lines are
 sometimes hidden behind the surface giving an impression of matching
 the face with the back of the head). b) A model and an occluded scene
 from Mian et al. dataset. (All correspondences, except two on the leg of
 the reference model, are correct.) . . . . . . . . . . . . . . . . . . . . . . 50
 2.19 Comparison of the recognition rates vs. the percentage of occlusions: for
 spin image, tensor matching, Drost method, and CORS. . . . . . . . . . . 51
 2.20 Recognition and registration of 3D model point clouds into the occluded
 scenes. The cyan and pink colors are used to render the scenes and mod-
 els, respectively. The transformed models closely overlap with the targets
 within all the scenes. 50 different scenes are used for testing our approach. 52
 2.21 Six different articulations from a model in the TOSCA dataset. . . . . . . 53
 2.22 Top retrieval shapes of our CORS method. All shapes are correctly re-
 trieved. (The 5th model gorilla has three and the 9th model wolf has two
 instances in the dataset.) . . . . . . . . . . . . . . . . . . . . . . . . . . 55
 2.23 Top four retrieval shapes of spin image. Wrong retrievals are highlighted
 with orange boxes. Distance scores are erratic (see red underlines) inhibit-
 ing the determination of a cut-off value between the correct and incorrect
 shapes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
 ix
3.1 The KOMP algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
 3.2 The kernel KSVD algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 68
 3.3 Comparison of error ratios for (a) KSVD and (b) Kernel KSVD on a com-
 mon logarithm scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
 3.4 Comparison between the level curves of the projection coefficients for
 three different dictionary atoms corresponding to linear KSVD and kernel
 KSVD. In this figure, the first row corresponds to KSVD and the second
 row corresponds to kernel KSVD. . . . . . . . . . . . . . . . . . . . . . 72
 3.5 Comparison of error percentage using kernel KSVD and kernel PCA. . . 74
 3.6 The pre-images of the kernel KSVD dictionary atoms. . . . . . . . . . . 76
 3.7 Comparison of digit recognition accuracies for different methods in the
 presence of Gaussian noise and missing-pixel effects. Red color and
 orange color represent the distributive and collective classification ap-
 proaches for kernel KSVD, respectively. . . . . . . . . . . . . . . . . . . 78
 3.8 Kernel KSVD classification accuracy versus the polynomial degree of the
 USPS dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
 3.9 Comparison of digit recognition accuracies for different sparsity levels in
 the presence of Gaussian noise with standard deviation ?. All algorithms
 train on clean images and test on noisy images. The dictionary size is set
 to 50 for this experiments. . . . . . . . . . . . . . . . . . . . . . . . . . 81
 3.10 Comparison of digit recognition accuracies for different dictionary sizes
 in the presence of Gaussian noise with standard deviation ?. All algo-
 rithms train on clean images and test on noisy images. The sparsity is set
 to T0 = 5 in this experiment. . . . . . . . . . . . . . . . . . . . . . . . . 81
 3.11 The first two rows and the last two rows show the classes that our method
 performed the best and worst, respectively. . . . . . . . . . . . . . . . . . 85
 3.12 Confusion matrix of our recognition performances on Caltech 101 dataset
 using kernel KSVD. The rows and columns correspond to true labels and
 predicted labels, respectively. The dictionary is learned from 3030 im-
 ages where each class contributes 30 images. The sparsity is set to be 30
 for both training and testing. Although the confusion matrix contains all
 classes, only a subset of class labels is displayed for better legibility. . . . 89
 3.13 Average accuracy of each class in Caltech-101 dataset. Classes are sorted
 in the ascending order of their recognition accuracies. . . . . . . . . . . 90
 4.1 The SE algorithm for both linear and non-linear cases. . . . . . . . . . . 101
 4.2 Comparison of number of recovered dictionary atoms over 40 trials. . . . 104
 4.3 Average number of successfully recovered dictionary atoms versus the
 dimension of the reduced space for different distortion level ?. Blue color
 line corresponds to results for PCA+KSVD scheme, green color line for
 KSVD, and red color line for SE. . . . . . . . . . . . . . . . . . . . . . . 105
 4.4 Comparison of PCA mapping (left) and the transformation P learned
 from SE (right). Distortion level ? = 1, dimension of the reduced space
 is 40. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
 x
4.5 (a,b) Comparison of classification accuracy against noise level. (c) Ac-
 curacy versus dimension. (d) Projection coefficient of all samples onto a
 dictionary atom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
 4.6 Sample images from the classes corresponding to the highest accuracy
 (top row) and the lowest accuracy (bottom row) of LASERC. . . . . . . . 110
 4.7 Caltech-101 Per Class Accuracy . . . . . . . . . . . . . . . . . . . . . . 111
 4.8 Caltech-101 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . 112
 5.1 An illustration of DASH-N algorithm. The source domain is RGB images
 and the target domain is halftone images. First, images are divided into
 small overlapping patches. These patches are vectorized while maintain-
 ing their spatial arrangements. (a) Performing contrast-normalization and
 dimensionality reduction using PS for source images and PT for target
 images. The circular feedbacks between PS and PT indicate that these
 two transformations are learned jointly. (b) Obtaining sparse codes using
 the common dictionary D1. (c) Performing max pooling. The process
 then repeats for layer 2 (d & e), except that the input is the sparse codes
 from layer 1 instead of pixel intensities. At the final stage, spatial pyra-
 mid with max pooling are used to create image descriptors. Classification
 is done using linear support vector machine. . . . . . . . . . . . . . . . . 122
 5.2 Example images from the LAPTOP-101 class in different domains: (a)
 Amazon, (b) Caltech, (c) DSLR, (d) Webcam. First row: original images,
 second row: halftone images, third row: edge images. . . . . . . . . . . . 131
 5.3 Dictionary responses of training (left) and testing (right) data for the
 BACKPACK class for the pair DSLR-Webcam domains in the first layer. 132
 5.4 Recognition rates with respect to different dimensions of the latent do-
 main in the first and second layer. . . . . . . . . . . . . . . . . . . . . . 137
 5.5 The reconstructed dictionaries at layer 1. . . . . . . . . . . . . . . . . . 138
 xi
Chapter 1
 Introduction
 We live in a complex visual world. A typical visual scene is composed of a large
 number of image elements that vary in luminance, color, shape, and motion. Unfortu-
 nately, the information of interests is often highly non-linear. High-level concepts such
 as car or human form very convoluted regions within the image space. Several theorists
 have proposed that natural images lie along a continuous curved ?manifold? embedding in
 the high-dimensional state space of images. The manifold represents the smooth changes
 that follow from the transformations, such as translations and rotations that are likely to
 occur in natural scenes. For this reason, finding a good way to represent images or videos
 is not a trivial task.
 Good representations of data are crucial to the successes of many computer vision
 systems. However, there is no universal representation that is optimal for all types of data.
 As a result, significant research efforts have been spent on designing good representations
 that are appropriate for specific tasks such as detection and recognition. The availability
 of new sensor modalities, such as 3D and hyperspectral sensors capable of collecting
 high-dimensional data in real-time, necessitates the design of new representations and
 processing techniques. In this dissertation, we address the problems of designing rep-
 resentations for 2D and 3D shapes. In addition, we propose novel learning techniques
 capable of learning non-linear and sparse representations directly from a set of images.
 1
2D shapes are often represented by a set of features that are computed based on
 boundaries or edges points. The most popular features include shape context [1] and in-
 ner distance shape context [2]. While these approaches are widely used to design features
 for 2D shape matching, the have some disadvantages. For example, the constructions of
 these features are sensitive to various effects such as missing pixels, internal discontinu-
 ities, and branching offshoots. As opposed to the conventional approach, in chapter 2 we
 propose to represent a shape as an analytic function similar to the decision function of
 the support vector machine [3]. This representation enables the extraction of a novel set
 of invariant features that is robust against the above-referenced degradations. These fea-
 tures are extracted from highly consistent decision function?s gradients instead of edges
 as in conventional approaches. Our experiments demonstrate promising results of shape
 matching and object recognition on several datasets including MPEG7.
 Range images are more and more popular due to the availability of inexpensive sen-
 sors like Kinect. However, it is not straight-forward to extend the traditional 2D features
 into this sensor modality. The reasons include low resolution, irregular sampling den-
 sity, and differences in underlying physics. Common 3D feature extractors such as spin
 image [4], 3D shape context [5], and spherical harmonic [6] are often computationally
 expensive. These drawbacks motivate the design of new features for this type of data. In
 this dissertation, we propose a novel feature for 2.5D range images called the Concentric
 Ring Signature (CORS) [7]. Our feature is patch-based and thus is a natural choice for
 range images. Unlike volume-based features (e.g. spin image and spherical harmonics)
 that require the normalization of the sampling density at every 3D voxel, CORS is more
 compact and robust against irregular sampling effects. CORS achieves state-of-the-art
 2
recognition results on several difficult datasets including TOSCA and MIAN which con-
 sist of multiple objects and scenes that come with various non-rigid 3D transformations.
 The process of designing representations for visual data is tedious and time con-
 suming. It requires a deep understanding and a careful examination of the underlying
 physics that governs the generation of data. Recently, there is an explosion of interest in
 modelling data using sparse representation. It has been shown that sparse methods can
 automatically extract useful information such as edges, corners as well as more complex
 structures from natural images. The set of techniques learned sparse representations di-
 rectly from the data is called dictionary learning. The dictionaries designed by MOD [8]
 and KSVD [9] are based on a linear representation of the data. However, linear repre-
 sentations are almost always inadequate for representing non-linear structures of the data
 which arise in many practical applications. For example, images of a human face are sub-
 jected to various sources of non-linear transformations such as rotation and deformation.
 In addition, many types of descriptors in computer vision have intrinsic non-linear
 similarity measure functions. The most popular ones include the spatial pyramid descrip-
 tor [10] which uses a pyramid match kernel, and the region covariance descriptor [11]
 which uses a Riemannian metric as the similarity measure between two descriptors. Both
 of these distance measures are highly non-linear. The linear model used by the traditional
 dictionary learning methods, e.g. MOD and KSVD, inevitably leads to poor performances
 for many datasets, e.g., object classification of Caltech-101 [12] dataset, even when dis-
 criminant power is taken into account during the training [13].
 Motivated by the drawbacks of the current methods and the needs of many practical
 applications, in chapter 3 we propose kernel dictionaries which are basically dictionaries
 3
in high dimensional feature spaces. Our dictionary learning methods yield representations
 that are more compact than kernel principal component analysis (KPCA) [14] and are able
 to handle the non-linearity better their linear counterparts.
 To enhance the efficacy and adaptability of kernel dictionary, we introduce a novel
 framework, called sparse embedding (SE), for simultaneous dimensionality reduction and
 dictionary learning in chapter 4. We formulate an optimization problem for learning a
 transformation from the original signal domain to a lower-dimensional one in a way that
 preserves the sparse structure of data. We propose an efficient optimization algorithm
 and present its non-linear extension based on the kernel methods. One of the key fea-
 tures of our method is that it is computationally efficient as the learning is done in the
 lower-dimensional space and it discards the irrelevant part of the signal that derails the
 dictionary learning process. Various experiments show that our method is able to capture
 the meaningful structure of data and can perform significantly better than many competi-
 tive algorithms on signal recovery and object classification tasks.
 In many practical computer vision applications, we are often confronted with the
 situation where the data that we use to train our algorithm have different distribution or
 representation from that presented during the testing. The ubiquity of this problem is well-
 known to computer vision researchers. For instance, indoor images are quite different
 from outdoor images, just as videos captured with a high definition camera are from those
 collected with a webcam. This detrimental effect is often a dominant factor accounting for
 the poor performances of many computer vision algorithms. As an example of the effect
 of distribution mismatch, Ben-David et al. [15] show that, under certain assumption, the
 bound on the test error linearly increases with the `1 divergence between the training
 4
and testing distributions. Even worse, data from the test domain are often scarce and
 expensive to obtain. This makes it impractical to re-train an algorithm from the scratch
 since a learning algorithm would generalize poorly when insufficient amount of data is
 presented [16].
 In order to address this issue, chapter 5 proposes a novel approach for domain adap-
 tation, based on the sparse embedding framework, that possesses the following advan-
 tages. First, the adaptation is performed on multiple levels of the feature hierarchy in
 order to maximize the knowledge transfer. The hierarchical structure allows the trans-
 fer of useful information that might not be well captured by existing domain adaptation
 techniques. Second, domain adaptation is done jointly with feature learning. Our method
 learns a hierarchy of sparse codes and uses them to describe a visual object instead of re-
 lying on any low-level feature. Finally, unlike existing hierarchical networks, our network
 is more computationally efficient with a mechanism to prevent the data dimension from
 increasing too fast as the number of layer increases. We provide extensive experiments to
 show that our approach outperforms the current state-of-the-art by a large margin. This
 is interesting since our training is entirely generative followed by a linear SVM while
 several other methods employ discriminative training together with non-linear kernels.
 5
Chapter 2
 Shape Representations
 Shape of an object represents the geometrical information that is independent of
 the transformational (scaling, rotation, articulation, etc) effects. Understanding shape
 is essential in many computer vision applications from recognition of people and their
 actions in video surveillance to design and inspection in industrial manufacturing [17, 18].
 Recent psychophysical findings [19] suggest that the perceptual representation of
 a shape is primarily based on qualitative properties whose topological structures remain
 relatively stable over transformational conditions. Other empirical studies [20, 21] have
 shown that the neural processing of shape in the brain is broadly distributed throughout
 the ventral (what) pathway that is involved in object recognition, and the dorsal (where)
 pathway that is involved in spatial localization. In other words, an adequate mathematical
 representation of shape needs to be invariant to viewpoint changes and articulated object
 motion, and discriminative enough to enable detection and classification.
 2.1 Related Work On 2D and 3D Shape Representations
 Two main approaches dominate previous work on shape representation: global ap-
 proaches model an object as a whole segment, while part-based approaches advocate seg-
 mentation of shape into constituent regions. The drawback of a purely global approach
 is the exclusion of articulation and the sensitivity to occlusion. The drawback of a purely
 6
part-based approach is that a consistent partitioning is generally not possible in the face
 of numerous combinations of possibilities and object shape variations. Besides, segmen-
 tation itself is ill-posed, except under controlled environments or in restricted application
 domains.
 2.1.1 Global Approaches
 Prominent global shape representations include variational [22] and level set ap-
 proaches [23, 24]. These approaches have been applied for scene segmentation [25] and
 tracking [26, 27]. Brookstein initiated the use of thin-plate splines [28] to analyze de-
 formable shape, which were then improved by [29, 30]. These methods are landmark-
 based and suffer from inconsistency in landmark selection. [31] fits a parametric model
 to a shape using mixture of Gaussian densities. This method requires a clustering process
 to estimate cluster centers and therefore has the same drawback with other landmark-
 based approaches. [32] assigns every internal point of the silhouette a value proportional
 to mean time of random walk from the point to the boundary. This can be achieved by
 solving a Poisson equation. Other popular methods are statistical moments [33], eigen-
 shapes [34], curvature scale space [35], elastic matching [36], parametric curves (poly-
 lines), image signatures, etc. Zernike moments are a class of orthogonal moments that are
 invariant to rotation and translation. Eigenshapes decompose a distance matrix of bound-
 ary points into an ordered set of eigenvectors and finds the modes of these eigenvectors.
 Elastic matching evaluates the similarity as a sum of local deformations needed to change
 one shape into another. Scale space representation successively smooths contour while
 7
decreasing the number of curvature zero crossings. In general, global models need addi-
 tional mechanisms to compensate for articulated motion and non-rigid deformation.
 2.1.2 Part-Based Approaches
 In comparison, part-based approaches describe shapes in terms of their part struc-
 ture. Parts are defined to be nearly convex shapes separated from the rest of the object
 at concavity extrema [37, 38, 39, 40]. It is possible to build a discriminative classifier
 from a collection of parts [41] or a bag-of-feature to solve correspondence [42]. These
 methods often require a multitude of training samples, prior knowledge on the number
 of parts, and a precise formulation of articulation. Other part-based methods try to learn
 the part structure by considering the shape interior. For instance, shock graphs [43] are
 defined as the cyclic tree of singularities of a curve evolution. The inner distance [44],
 geodesic distance [45] and random walk [46] also consider the interior of the shape to
 build descriptors. Given a pair of points, the inner distance is determined by finding their
 closest points on the shape skeleton, then measuring the distance along the skeleton. The
 geodesic distance is the length of the shortest path on the surface. While shock graphs
 benefit from the skeletons robustness to articulation they suffer from boundary noise. The
 inner and geodesic distances are robust to disturbances along boundaries, yet they are
 highly sensitive to occlusions and fragmentations. Recently, [47] proposed Gibbs Ran-
 dom Fields to model shapes as spatial compositions of simple parts.
 Pioneering work on the spin image [48] describes the relative spatial distribution
 of shape points around a set of feature points. It considers a cylindrical support region
 8
and accumulates a histogram of points. The shape context [49, 50] is similar to the spin
 image except that the support region is a sphere. Since both generate sparse matrices,
 the distance computation is sensitive to the shape structure. In [51] each shape is indexed
 based on a variety of features such as inner distance, Euclidean distance, contour distance,
 etc. that characterize pairwise geometric relationships between interest points on the
 shape. Shapes in the database are ordered according to their similarity with the query
 shape and similar shapes are retrieved using a scheme which does not involve shape-wise
 alignment.
 2.2 Support Vector Shape Representation
 The above referenced methods provide satisfactory results under ideal conditions
 with strong priors and clean segmentation masks. Their representation capacity substan-
 tially degrades when the shape boundary is noisy (part-based methods, shock graphs),
 shape has internal crevices, branching offshoots and excessive discontinuities (inner-
 distance, spin images, shape context), and non-conforming articulations (global models).
 Besides, they would not necessarily extend to higher dimensions or generalize over most
 shape classes.
 Quite different from existing approaches, we propose a novel implicit shape repre-
 sentation based on SVM theory. Each shape is represented by a classification decision
 function obtained by training an SVM with interior and exterior shape points providing
 positive and negative training samples, respectively. The RBF kernel is used with SVM
 to make our representation rotation and translation invariant. The decision boundary is
 9
a hypersurface on the high-dimensional feature space that separates the positive labelled
 points, shape, from the negative labelled points, its surroundings. Our shape represen-
 tation is not just the shape boundary or the decision function boundary but the function
 itself. Instead of using the edge or surface gradients on a discrete grid, we use the gradient
 of the classification decision function, which is an analytic function that is defined every-
 where in the data space. Furthermore, the use of the RBF kernel enables SVM to model
 any complicated shape due to the infinite dimensional nature of the associated Hilbert
 space. Several other advantages are explained in the following section.
 To summarize the main contributions, this chapter
 1. proposes a novel method to represent 2D and 3D shapes using support vector clas-
 sifiers,
 2. provides an in depth theoretical analysis for a better understanding of this represen-
 tation, and
 3. presents experimental results on several datasets to evaluate its performance on
 challenging datasets.
 2.2.1 Formulation of Support Vector Shape
 We define the SVS representation to be the decision function of a classifier. The
 name SVS comes from the fact that the decision function is parametrized by a set of
 support vectors, learned from an SVM. Through out the chapter, the terms ?SVS? and
 ?decision function? are used interchangeably. This representation facilitates the extrac-
 tion of feature points that correspond to salient components of the shape, which are then
 10
described using local statistics of the decision function around each point.
 To the best of our knowledge, this is the first effort that considers the shape repre-
 sentation as a classification problem. This classifier-based representation offers several
 advantages. First, it is general enough to be applied to 2D shapes and 3D volumes. Sec-
 ond, the classification function depends only on a sparse subset of key points, which
 makes the representation robust against noise, missing data, and other artifacts including
 interior fragmentations. Finally, the descriptors are also more discriminative and stable
 against transformation and disturbances than edge-based descriptors [48, 52, 44] since
 they are extracted from the dense gradient field of the decision function but not from the
 original data.
 Basically, SVS involves the following tasks:
 1. Learn a decision function from a given shape,
 2. Select feature points using the gradient, i.e. the first derivative, of the decision
 function. (sec. 2.2.2),
 3. Compute local descriptors (sec. 2.2.3).
 which are explained in the following sections.
 As it will be clear, SVS enables selecting a small set of salient features for shape
 matching and retrieval. These features are described using local statistics of the decision
 function around each point. For instance, a 2D variant of SVS features picks the high
 gradient points of the decision function as feature points and uses the local histogram of
 oriented gradients computed on the decision function as the descriptors.
 11
2.2.1.1 Learning a Decision Function:
 Let S = {xn}Nn=1 be a set of points representing a shape1 and ?S be the set of points
 not in S, i.e outside. We wish to learn a classifier:
 f(x) =
 ?
 ????
 ????
 ? 0, x ? S
 < 0, x ? ?S
 (2.1)
 Two classifiers, f(x) and g(x), are said to represent the same shape if
 sign[f(x)] = sign[g(x)] =
 ?
 ????
 ????
 ? 0, x ? S
 < 0, x ? ?S
 (2.2)
 The following theorem states that if the binary shapes generated by taking the zero-level
 crossings of these classifiers are equivalent (as in 2.2) to each other then the decision
 functions are equivalent with a constant factor:
 Theorem 2.1 (Curtis ?87 [53]). Let f(x) and g(x) be real, 2D, band-limited and irre-
 ducible functions. If f(x) and g(x) take on both positive and negative values in a closed
 bounded region D ? R2, and sign[f(x)] = sign[g(x)] for all x in D, then f(x) = ?g(x),
 where ? is a real positive constant.
 Here, irreducible function means that its Fourier transform is not factorable. The
 band-limited condition implies the Fourier transform will exist and it will have a com-
 pact region of support for finite-energy signals (interested readers are referred to [54] for
 further discussions on band-limitedness condition).
 1We denote vectors in bold letters
 12
The above theorem suggests that the gradient orientation is equivalent for such two
 functions satisfying the above constraints. In other words, two decision functions rep-
 resenting the same shape will be consistent and exhibit invariance properties in terms of
 their gradient orientations if these functions are real, 2D, band-limited, irreducible, and
 have almost identical responses.
 One such example is the radial basic function (RBF) an SVM [55] operates on. It is
 real and band-limited: it has the form of the sum of finite number of weighted exponential,
 thus, its Fourier transform is a sum of weighted exponentials, which has finite-energy
 and a compact region of support especially when insignificant coefficients of the radial
 kernels are disregarded. A Theorem for the irreducibility of the general class of functions
 including the RBF kernel is given below with a proof in Sec. 2.2.5.
 Theorem 2.2. A function of the following form:
 f(x) =
 m?
 i=1
 ?ig(x? x?i ), m ? 5 (2.3)
 is irreducible if the Fourier transform of g(x) does not have any zero (either real or
 complex).
 For an ideal shape classifier as defined in (2.1), the decision function is positive for
 the shape regions, zero on the shape boundary B, and negative otherwise. In other words,
 the gradient of the decision function along the tangent space of a point on the boundary
 B is zero (see Figure 2.1). This means that the gradient of the decision function must
 be perpendicular to the tangent plane, thus, the gradient itself coincides with the normal
 vector of the shape.
 13
Proposition 2.1. Gradient?f(x) at a shape boundary point is a close approximation for
 the normal vector at that point.
 Figure 2.1: Illustration of SVS decision boundary. The dotted line is the tangent space.
 This property is very desirable for SVS since the normal direction is an essential
 input for the construction of many descriptors [49, 56, 48]. It is especially useful for com-
 puting descriptors from 3D point clouds where the knowledge about points orientations
 is missing.
 As moving from the boundary B to the interior of shape, the gradient ?f(x) are
 an indication of the combined effects of local edge segments. This effectively fuses local
 topologies and enhances the discriminative power of local descriptors (Section 2.2.3).
 We are interested in representing a shape by a continuous function that has the
 mathematical form of (2.3) and at the same time satisfies (2.1) as much as possible. No-
 tice that the class of functions in (2.3) contains the decision function of SVM with the
 RBF kernel [57, 58]. Therefore, we employ SVM to learn our parametric shape repre-
 sentation. To our advantage, the SVM decision function is analytic, i.e., it is in the form
 of weighted sum of kernel responses, thus its gradient can be efficiently computed at a
 point in the space. Furthermore, the RBF kernel functions can effectively map data xn to
 an infinite-dimensional space where S and ?S would be linearly separable, thus even for
 intricate boundaries a high classification performance, and accurate shape representation,
 14
is guaranteed.
 SVMs construct a hyperplane in a high (or infinite) dimensional feature space be-
 tween a set of labelled input vectors x that can be either +1 for shape pixels, or ?1 for
 non-shape pixels by definition for binary SVMs. The decision boundary is defined in
 terms of a typically small subset of training examples, called as support vectors, that re-
 sult in a maximum margin separation between these classes. The decision function of
 SVM is given as
 f(x) =
 m?
 i=1
 ?i[?(x).?(x?i )] (2.4)
 where x?i are support vectors, ?i are the corresponding weights of the support vectors,
 m is the number of non-zero support vectors, and ? is a mapping function in some dot
 product spaceH. By defining a similarity measure k inH as
 k(x,x?i ) = ?(x).?(x?i ), (2.5)
 every dot product in the decision function is replaced by a kernel function. This allows
 efficient computations without having to venture into the high dimensional feature space
 H. The transformation may be non-linear; thus though the classifier is a hyperplane in
 H, it may be non-linear in the original input space. If the kernel used is a Gaussian, the
 corresponding feature space is a Hilbert space of infinite dimension
 ?(x).?(x?i ) = exp(??||x? x?i ||2) (2.6)
 where ? stands for the Gaussian kernel width. By using the RBF, it is always possible to
 find a decision function that perfectly represents a shape. Such a decision function has
 the form
 f(x) =
 m?
 i=1
 ?i exp(??||x? x?i ||2) (2.7)
 15
Figure 2.2: Orientation of?f(x) remains stable even if the shape transforms and different set of
 parameters and points are used to train the SVS on the right.
 and the final classification is made by l(x) = sign[f(x)]. It is worth noting that the set of
 support vectors is usually small in comparison with the entire training set.
 Since the decision function in (2.7) only depends on the distance between points,
 the SVS representation is robust to translation and rotation2. This can be observed from
 the Figure 2.2 which shows that the gradient directions, computed from SVS decision
 functions of transformed versions of the same shape, are almost same.
 Furthermore, the ? multiplier in the kernel function is inversely proportional to the
 squared scale change. For a given shape with unknown scale change from its original, the
 mean of pairwise distances between all points can be used to normalize the shape. This
 scale normalization technique has been used in shape context [50] and has been proven to
 be effective. In addition, our analysis demonstrates that small variations of ? would not
 perturb the best possible classification accuracy.
 For training, we select a random subset of internal points to be positive training
 2Note that, for linear and polynomial kernels such an invariance does not apply as they impose inner
 products of point coordinates
 16
Figure 2.3: Support vectors (red circles ?i > 0, blue circles ?i ? 0), the decision function
 boundaries f(x) = 0 (yellow), and the original shape (green) for support vector numbers 25, 44,
 and 61. Classifier accuracies are 93%, 97.2%, and 99.8%, respectively. Original shape contains
 ?54K points (?1K points on the boundary), yet only a small fraction (e.g. 61) of points are
 needed to encode the shape.
 samples from a given shape. Another random subset of points surrounding the shape is
 chosen to be negative samples. Random selection is preferred just for computational effi-
 ciency. The input to the classifier is the coordinates and the corresponding inside/outside
 labels of the training points.
 Figure 2.3 shows support vectors and decision boundaries for a sample shape. It can
 be noticed that support vectors are not required to lie on shape edges. This is because the
 kernel mapping data to the high dimensional space, where non-edge points might happen
 to lie on the decision boundary of the learning algorithms. The number of support vectors
 typically varies from 0.1% to 3% of the total number of points.
 2.2.1.2 ??SVM and One-Class SVM:
 We employ ??SVM [57] and one-class SVM [58] for learning the decision function
 as its parameters have a natural interpretation for shapes. Given a set of labelled samples
 17
Figure 2.4: Comparison of the decision function responses for noise-free and noisy shapes. In
 the first experiment, a car shape is distorted by randomly removing 20% and 50% of points and
 adding speckle noise to the background. In the second experiment, the letter A is interleaved with
 blank spaces. The differences of decision function for both cases are hardly noticeable. They are
 both trained using one-class SVM.
 (xi, yi), the learning problem of ??SVM is formulated as the minimization of
 arg min
 w,?,?
 1
 2
 ||w||2 ? ?? + 1l
 l?
 i=1
 ?i (2.8)
 subject to:
 yi.(w.?(xi) + b) ? ?? ?i, (2.9)
 ?i ? 0, ? ? 0 (2.10)
 where ? is a function that maps the input data to some Hilbert space. The above opti-
 mization tries to correctly classify as many data as possible by penalizing the misclassified
 samples through variable ?i. At the same time, the minimization of ||w|| keeps the model
 as simple as possible, and the margin is made as large as possible through maximization
 of variable ?.
 The trade-off between the model complexity and the training error is controlled by
 parameter ? ? [0, 1]. It is also the lower and upper bound on the number of examples
 18
that are support vectors and lie on the wrong side of the hyperplane, respectively. The
 larger (smaller) we make ? the more (less) points are allowed to lie inside the margin,
 which gives coarser (finer) shape representations. It is possible to use a small value which
 results in a larger number of support vectors to allow accurate representation of complex
 shapes, while smaller numbers of support vectors enhance robustness against corrupted
 training data.
 The formulation and parameters of one-class SVM is similar with ??SVM. The
 only difference is that one-class SVM allows learning from data with positive samples
 only. It separates data from the origin in the feature space instead of separating positive
 samples from negative samples. This becomes extremely useful to deal with missing data
 due to occlusion and camera sampling error.
 The selection of parameters for SVM algorithms (e.g. kernel width ? of (2.6), error
 margin ? of (2.8) ) is done automatically by imposing a constraint on the cross-validation
 accuracy. Specifically, we divide the data points into two subsets, one for training and
 another one for testing. We then perform cross-validation and select the set of parameters
 that produce a classification accuracy higher than 99%. Note that if heavy occlusions are
 present, we allow the classification accuracy to be lower.
 Figure 2.4 illustrates the robustness of SVS representation under different noise
 effects. In particular, we compare the color-coded response of the decision functions for
 a car shape and an A-letter shape before and after being distorted. As for the car, we
 randomly remove pixels from the shape and also add noise to the background. The A-
 letter shape is interwoven with white spaces to create distortion of the shape boundary.
 For both cases, it can be seen from Figure 2.4 that the color-coded responses, thus, the
 19
associated decision functions remain quite stable.
 From the computational perspective, the SVS complexity is essentially the same
 with that of SVM algorithms. In general, it is polynomial in the number of input points.
 In our experiments, it takes about 0.15 seconds to compute SVS for a shape in the MPEG7
 dataset [59] using a 2.4 GHz Quad Core machine. The learning process can be signifi-
 cantly speed up using approximate variants of SVM [60, 61]. The query of the decision
 function is very efficient. It is linear in the number of support vectors, which is a small
 fraction of the total number of points (e.g. 1%). In addition, it can be accelerated by two
 orders of magnitude [62] using statistical approximations.
 2.2.2 SVS Feature Points
 Shape matching algorithms using SVS representation comprise two constituents:
 feature (interest) points and their descriptors. This section discusses possible ways of
 selecting the feature points. All these methods are based on the previously explained
 decision function f(x).
 The feature points are desired to be stable under local and global shape perturba-
 tions (including affine transformations and articulated motion) as well as noise and other
 artifacts. Such feature points should be reliably computed with high degree of repro-
 ducibility. In addition, the local descriptors computed at these feature points have to be
 discriminative enough for reliable shape matching and registration, therefore structure
 around the feature points should be rich in terms of local information content. In what
 follows we will elaborate on different possibilities of selecting good feature points for
 20
SVS representation.
 2.2.2.1 Gradient-based Feature Points
 A corollary of Theorem 2.1 is that the gradient orientation is stable while gradient
 magnitude differs only by a constant factor. The gradient orientation is given by
 ?f(x)
 ||?f(x)|| , where (2.11)
 ?f(x) = 2?
 m?
 i=1
 ?i exp(??||x? x?i ||2)(x?i ? x). (2.12)
 To evaluate the stability of the gradient orientation, we randomly choose a set of 500
 points on each of 70 SVSs created from different shapes in the MPEG7 database [59] and
 examine their gradients as the training parameters (?, ?) vary (? by 6? and ? by 10? with
 respect to the smallest value of each parameter). Note that each different parametrization
 may generate a different decision function in magnitude, however we are interested in
 how the direction of the gradient of the decision function changes. Therefore, to account
 for the multiplication factor of the decision function, we normalize the decision function
 values by their mean value yielding relative magnitudes.
 Figure 2.5 shows how the standard deviation of gradient direction changes with re-
 spect to the relative gradient magnitudes for 500 points from one of the 70 SVSs. One
 can easily notice that the gradients vary with respect to the training parameters. This is
 because large variation of the training parameters (?, ?) results in different classification
 functions that do not satisfy strictly the condition sign[f(x)] = sign[g(x)] as in Theo-
 rem 2.1.
 However, the variation exhibits a strong dependency on gradient magnitude. The
 21
Figure 2.5: The relative gradient magnitude at a point is computed by dividing its gradient mag-
 nitude by mean of gradient magnitude across the entire shape.
 higher the gradient magnitude gets, the smaller the standard deviation is. For points with
 gradient magnitude of more than two, the standard deviation, which directly corresponds
 to direction changes, is as small as 4?. Note that in practice the variation should be
 smaller than what we see in Figure 2.5 since the constraint of high classification accuracy
 implicitly requires a consistent set of parameters (?, ?).
 Since the gradient orientation is stable especially for higher gradient magnitude
 points we choose a small subset of such points for matching. We apply iterative search
 method that finds the maximum gradient magnitude point on ?f(x) until it selects 100
 points and build a list by ordering them according to their angles from the center of the
 shape in a circular fashion. The starting (0?) orientation of circular sweep is set with
 respect to a dominant gradient direction of ?f(x).
 Severe occlusions or distortions might lead to change of the dominant gradient di-
 rections. In such a scenario, we allow the generation of multiple sets of descriptors corre-
 22
Figure 2.6: Decision function responses (left) and the gradient magnitude of the decision function
 (right) for the SVS representing Stanford Bunny. Points with higher values in ?(fx) seem to be
 good candidates to be selected as discriminative feature points. Black circles are the 51 feature
 points (102 support vectors).
 sponding to different tentative orientations. During the shape matching phase, the distance
 between two shapes is the smallest matching cost among all the orientations. Figure 2.6
 shows the selected feature points for a sample shape.
 2.2.2.2 Support Vector-based Feature Points
 The support vectors are sufficient to construct the decision function and its gradient,
 thus they are good candidates for feature points.
 One can ask whether the set of support vectors enable reliable shape matching. The
 answer to this question largely depends on the problem at hand. For non-articulated trans-
 formations, the support vectors remain stable. In Figure 2.7 we show that support vector
 locations are quite stable when the kernel width varies 2?. Constraining the classification
 accuracy to be sufficiently high (e.g. 99%) and preventing the kernel width from chang-
 ing too much would produce similar support vectors. Yet, if the kernel width changes too
 much, e.g. 20?, support vectors change significantly as in Figure 2.3. Besides, the sup-
 port vectors are sensitive to shape articulations due to the topology changes as illustrated
 23
in Figure 2.7 bottom-pair.
 2.2.2.3 Curvature-based Feature Points
 It is possible to select points with high-curvature on SVS by looking at the Hes-
 sian matrix of a decision function. More specifically, for the 2D case, we first solve the
 eigenvalues of the Hessian matrix, which is proportional to curvatures.
 H =
 ?
 ???
 ?2f
 ?x21
 ?2f
 ?x1?x2
 ?2f
 ?x1?x2
 ?2f
 ?x22
 ?
 ??? = Q
 ?
 ????1 0
 0 ?2
 ?
 ???Q?1 (2.13)
 Points with high curvatures associate with large eigenvalues of both dimensions (?1, ?2).
 We randomly sample 200 points to compute curvatures. Feature points are chosen where
 the first eigenvalue ?1 and the second eigenvalue ?2 are larger than the median of ?1
 and ?2, respectively. To mitigate noise effects, the decision function can be smoothed at
 different scales before computing curvatures. Gaussian smoothing of the decision func-
 tion (2.7) is a mixture of Gaussian functions, which can be computed efficiently with a
 closed-form expression.
 An advantage of this selection scheme is that local descriptors are highly discrimi-
 native. For instance, SIFT can be computed to find point-wise correspondences for align-
 ing two similar shapes. However, this point selection method is not appropriate for shape
 matching. For example, choosing only those points around corners makes it impossible to
 differentiate a rectangle from a square. This disadvantage arises for shapes characterized
 mainly by their dimensions and shapes characterized by the arrangement of similar local
 structures. Another disadvantage of this method is the difficulty of computing curvatures
 when generalizing to higher dimensions.
 24
Figure 2.7: Stability of support vectors with respect to variation of RBF kernel width (?) and
 articulation. (top) Support vectors when ? varies. ? = 0.0027 in the first shape, and ? = 0.0054
 in the second shape. 62 out of 78 support vectors in the first shape appear in the second shape.
 The overall positioning of the support vectors is very similar. (bottom) Support vectors with shape
 articulation. Blue circles indicate SVs that do not have correspondences on the other shape. 83
 out of 114 SVs in the first shape appear in the second shape.
 2.2.2.4 Entropy-based Feature Points
 Another possibility is selecting a small subsets of points whose local gradient orien-
 tation have high entropy. To compute this entropy, we first create a histogram of gradients
 over the local window of size 0.25? 0.25 (with respect to mean pairwise distances). The
 entropy is then computed as follows:
 ?
 n?
 i=1
 pilog2(pi) (2.14)
 where pi is the weight of the ith bin and n is the number of bins in the histogram. High
 entropy is equivalent to high variation of gradient orientations. Therefore, it is a good
 25
indication of complex local topologies of SVS, thus, more discriminative local descrip-
 tors. This strategy is similar to the high-curvature selection method. However, it does not
 involve the computation of principal curvatures which can be difficult for higher dimen-
 sional cases.
 2.2.3 SVS Descriptors
 SVS facilitates the computation of a set of descriptors extracted from the decision
 function. Below, we give only an examples of possible descriptors.
 We compute a local histogram of oriented gradients (HOG) descriptor around each
 point on the decision function gradient ?f but not on the conventional edge gradient,
 thus our HOGf is significantly different from existing descriptors.
 For a given feature point, a 4?4 array of 8-bin orientation histograms is constructed
 within a local window. The size of the window is set to be 0.25?0.25 (relative with respect
 to mean pairwise distance). Our experiments indicated that this size provides satisfactory
 results for both very coarse and fine shapes. A histogram is populated for each sub-block
 by aggregating the gradient magnitude of the decision function into the corresponding
 orientation bin of the gradient direction of the decision function.
 Since gradients with larger magnitudes are more stable, the contribution of each
 gradient to the histogram is set to be proportional to its magnitude. We impose a Gaussian
 kernel to weight gradients based on their relative distances with respect to the feature
 point. This spatial weighting puts more emphasis on gradients that are closer to the center
 and helps improve the discriminative power of the local descriptors.
 26
Figure 2.8: Comparison of IDSC with the SVS computed at four locations indicated by square
 dots on the shapes. Descriptors are displayed from left to right corresponding to the following
 order of dots? colors: red, yellow, green, cyan. The second image has a crack in the shape. IDSC
 changes drastically (it highly depends on the boundary) while the SVS remains robust generating
 almost the identical descriptors. Box color indicates the severity of the mismatch: Green (< 20%)
 ; Orange (20? 50%); Red (> 50%).
 To prevent problems due to the coarse binning issues, the value of each gradient
 point is interpolated into adjacent histogram bins. Finally, the mean of gradients for
 the local window is taken to be the orientation of the descriptor and the histogram bins
 are reoriented accordingly with respect to this mean orientation to achieve the rotation
 invariance of the descriptor. The histograms are then concatenated into a 128-dimensional
 vector.
 Figure 2.8 shows a comparison of the Inner Distance Shape Context descriptors
 (IDSC) [44] with the SVS descriptors for a pair of images where one contains irregularity
 27
Figure 2.9: Comparison of Shape Context (SC) with the SVS computed at four locations indi-
 cated by square dots on the shapes. Descriptors are displayed from left to right corresponding to
 the following order of dots? colors: red, yellow, green, cyan. Box color indicates the severity of
 the mismatch: Green (< 20%) ; Orange (20? 50%); Red (> 50%).
 in the shape (assuming even after some morphology such artifacts remain). The inner
 distance responses change drastically while the descriptors computed from SVS stays
 very robust. Figure 2.9 demonstrates the strength of the SVS descriptors even for very
 noisy data (note that, fitting an outer shell, morphological filtering, etc. would not help
 for this shape as the noise is dispersed into the background). In this case, we compare
 with Shape Context (SC) since IDSC will not work due to the presence of disconnected
 interior. To the best of our knowledge, there is no other representation that gives such
 a robustness that SVS provides for heavily noisy and disconnected data. For all of the
 experiments presented in Figure 2.8, and 2.9, we use one-class SVM to learn the decision
 functions.
 28
2.2.4 Experiments
 For our experiments, we need to establish the distance between two shapes. Given
 two SVS decision functions fA(x) and fB(x), for shapes A and B respectively, we com-
 pute a distance score between their descriptors.
 As we explained before, since the feature points are already ordered with respect
 to the dominant gradient direction, initial alignment for the shapes with similar overall
 structures is almost accurate. Next, we use the local descriptors for comparison of two
 shapes. The advantage of using local descriptors is that they are robust against occlusion
 and shape articulation.
 Let two sets of the descriptors for two shapes A and B be ?A : {?A1 , ?A2 , ..?At }
 and ?B : {?B1 , ?B2 , ..?Bs } where t does not have to be equal to s assuming t ? s. The
 correspondence is established through a mapping function h such that h : ?B ? ?A. If a
 descriptor ?Bi is matched to another ?Aj then h(i) = j. We define a cost function as
 E(h) = ?
 1?i?s
 (h(i), i) (2.15)
 where the descriptor distance is computed using ?2 statistic
 (h(i), i) = ?
 1?k?128
 [?Ah(i)(k)? ?Bi (k)]2
 ?Ah(i)(k) + ?Bi (k)
 . (2.16)
 This cost represents the overall distance for the corresponding pairing of the descriptors.
 Note that, the mapping h is neither one-to-one nor overlapping, but keeps the ordering of
 the descriptors.
 To minimize E, we use dynamic programming to find the solution to (2.16). It is
 worth noting that the start points and the end points of two sequences are already roughly
 29
aligned for the dynamic programming algorithms to converge to the correct solution. Un-
 der certain conditions, the initial alignment provided by the ordered lists may not be valid.
 To overcome this, we find and compensate for the angle that maximizes the correlation
 between two ?global? histogram of oriented gradients, which are defined as HOGf yet
 computed for the entire shapes, of two given shapes before minimizing the above cost
 function.
 We run the first set of experiments on the entire MPEG7 shape benchmark dataset [59],
 which has 70 classes and 20 shapes for each class, a total of 1400 images. The perfor-
 mance is measured by the standard Bullseye test. For each shape, the retrieval accuracy
 is measured by counting how many of twenty correct shapes are in the top forty matches.
 For the gradient-based method, we apply an iterative search method that finds the
 maximum gradient magnitude point on?f(x) until it selects 100 features. In the entropy-
 based feature selection method, 200 points are randomly selected in each shape where
 HOGf descriptors and their entropy are computed. Points associated with entropy larger
 than 1.5 times the median of overall entropy value are selected for matching. In the
 curvature-based method, we also randomly sample 200 points to compute curvatures.
 Table 2.1: Comparison of MPEG7 classification results using four different feature point se-
 lection methods in Section 2.2.2 based on gradients, support vectors, curvatures, and entropies.
 The last column shows the baseline classification result when gradients are computed from edges
 instead of SVS.
 Method Gradient SV Curvature Entropy Edges
 Accuracy (%) 91.07 70.13 72.52 62.25 30.25
 30
Feature points are chosen where the first eigenvalue ?1 and the second eigenvalue ?2 are
 larger than the median of ?1 and ?2, respectively. For the support vector-based method,
 we randomly select half of the support vectors set as feature points for shape matching
 (we observed using only the most significant support vectors is in fact deteriorates the
 performance).
 Algorithm Accuracy (%)
 Gopalan [63] 93.67
 IDSC + LCDP [64] 93.32
 Mixture of Gaussian + tSL [65] 89.1
 Shape-tree [66] 87.7
 IDSC + DP + EMD [67] 86.56
 Biswas [51] 86.48
 Hierarchical Proscrustes [68] 86.35
 IDSC + DP [44] 85.4
 Algorithm Accuracy (%)
 Shape L? ?Ane Rouge [69] 85.25
 Generative Models [70] 80.03
 Curve Edit [71] 78.14
 SC+TPS [49] 76.51
 Visual Parts [59] 76.45
 CSS [72] 75.44
 SVS + DP 91.07
 Table 2.2: Comparison for the MPEG7 dataset.
 Table 2.1 gives the correct retrieval percentages for four feature point selection
 methods (gradient-based, support vector-based, curvature-based, and entropy-based) pre-
 sented in Section 2.2.2. Results indicate that the gradient-based method produces more
 consistent feature points and better matching performances. We also compare our perfor-
 mances with HOG computed on the edges gradients (30.25%) to verify that SVSs improve
 the discriminative power of our descriptors. Edge gradients are commonly used to extract
 feature points for 2D shape representation.
 As in Table 2.1, the best overall accuracy on MPEG7 dataset using the Bulleyes
 31
Figure 2.10: SVS results: red circles show the incorrect matches. Note that none is in top 4
 rankings.
 Figure 2.11: Comparison of the retrieval results for noisy dataset. Accuracy of IDSC is 57.0%,
 SVS is 93.7%. Red circles show incorrect matches.
 test is 91.07%, which is based on the gradient-based feature selection method. This is
 better than the performances of SC [49] (76.51%), IDSC [44] (85.40%), the post-refined
 version of [51] (86.48%), and shape tree [66] (87.70%) as summarized in Table 2.2. The
 best performance 93.67% is reported in [63], however their part-based algorithm is highly
 sensitive to segmentation errors especially for noisy data. A matching scheme [64], which
 takes into account the influence of the other shapes while computing the similarity of a
 32
Table 2.3: MPEG7 classification results versus the number of features selected using the gradient-
 based selection strategy.
 ] Features 10 50 100 200 300
 Accuracy (%) 15.93 68.83 91.07 90.38 89.92
 Table 2.4: Random distortion results on partial MPEG7 dataset
 Method IDSC Denoised IDSC SVS-BoW SVS-DP
 Accuracy 57.0% 85.9% 81.4% 93.7%
 pair of shapes, has reported an accuracy of 93.32%. SVS can be effectively used as the
 shape representation in [64]. Fig. 2.10 shows sample retrieval results in the descending
 order of matching scores using the gradient-based features.
 We also investigate the trade-off between the model complexity and the classifica-
 tion performance by varying the number of feature points from 10? 300. The results are
 summarized in the Table 2.3. In general, more feature points would yield better accura-
 cies. However, it can be noticed that the classification accuracy slightly suffers when the
 number of features exceeds 100. It is because a large number of features inevitably leads
 to less-discriminative interest points to be selected, therefore, negatively interfering with
 the shape matching algorithm.
 In the second experiment, we pick a subset of 6 shape classes (8 samples per class)
 from the MPEG7 database and add random distortions into them. The performance is
 measured by counting how many of 8 correct shapes appear in the first 16 matches for
 each shape. Instead of using ??SVM algorithm like in the first two experiments, we use
 33
Figure 2.12: Samples from the articulation database. Note that, each column corresponds to a
 different object.
 one-class SVM for this experiment since it produces better retrieval results. In addition
 to classification using SVS and IDSC, we zero-threshold our SVS representations to re-
 construct the original binary shapes. These binary shapes are then used as the input for
 computing IDSC. We call this scheme denoised IDSC.
 The overall performance of IDSC is 57.0%, while the SVS method, using gradient-
 based feature points, gives 93.7% as given in Table 2.4. Besides, Table 2.4 shows that
 denoised IDSC (85.9%) performs significantly better than IDSC. The result implies that
 our shape representation provides some denoising effects over random distortions. We
 also compare the performance of SVS when using the bag-of-word approach (81.4%) in-
 stead of the dynamic programming technique. Figure 2.11 shows retrieval results for both
 IDSC and SVS methods in descending order of the matching scores. IDSC?s performance
 dramatically degrades on this database because artifacts within a shape severely changes
 the inner distances. Shape context and other edge-based methods will also have the same
 issue. In contrast, the SVS method invariably gives accurate retrieval results thanks to
 classifier-based representation.
 In the third set of experiments, we test our matching algorithm, with the gradient-
 34
Table 2.5: Matching results on the articulation dataset
 Descriptor type 1st Match 2nd Match 3rd Match 4th Match
 `2 (baseline) 25/40 15/40 12/40 10/40
 SC [49] 20/40 10/40 11/40 5/40
 IDSC [44] 40/40 34/40 35/40 27/40
 Biswas [51] 40/40 38/40 33/40 20/40
 Lin [73] 40/40 38/40 36/40 33/40
 SVS HOGf 40/40 38/40 35/40 31/40
 based feature selection strategy, on the articulation database reported in [44]. This database
 includes 40 images from 8 objects with different articulations as shown in Figure 2.12.
 This consists of highly similar shapes like types of scissors with only minor differences
 but significant articulations. The recognition result is evaluated for each shape by choos-
 ing four most similar matches and then sort them according to their matching scores.
 Table 2.5 summarizes the matching results. We compare against the `2 distance on a
 bag-of-features, the shape context (SC), the inner distance with shape context (IDSC), the
 multiple feature indexing [51], and the layered graph matching [73]. In this experiment,
 the SVS uses the gradient-based feature point selection. It is apparent that our method
 handles articulation as well as (even better for later matches) IDSC, and much more ac-
 curately than the SC thanks to the robust nature of the SVS feature point selection and the
 locality of its descriptor.
 35
2.2.5 Proof of SVM-RBF Irreducibility
 The decision function of SVM with RBF kernel has the following form:
 f(x) =
 m?
 i=1
 ?i exp(??||x? x?i ||2) (2.17)
 Fourier transform of the function is as follows:
 F(?) =
 m?
 i=1
 ?iF{exp(??||x? x?i ||2)} =
 m?
 i=1
 ?iF{exp(??||x||2)} exp(?j?x?i )
 =
 m?
 i=1
 ?i
 [?
 pi
 ?
 ]n
 exp
 (
 ?
 pi2
 ?
 ||?||2
 )
 exp(?j?x?i ) = ?(?)?(?) (2.18)
 where n is the dimension of x and,
 ?(?) =
 [?
 pi
 ?
 ]n
 exp
 (
 ?
 pi2
 ?
 ||?||2
 )
 (2.19)
 ?(?) =
 m?
 i=1
 ?i exp(?j?x?i ) (2.20)
 where n is the dimension of x. The Fourier transform of RBF kernel obviously does not
 have any zero.
 From now on, consider ?(?) as the Fourier transform of an arbitrary kernel func-
 tion. By assumption, the entire function ?(?) never vanishes, i.e. it does not have any
 zero just like the case of the RBF kernel. Therefore, it remains to show that the function
 ?(?) is not factorable to conclude that f(x) is irreducible.
 Assume that ?(?) is reducible. It means that it can be factored into a product of
 two entire functions. These functions are required to have non-empty zeros set (either
 real or complex). Then they must have the following form:
 K?
 k=1
 ak exp(?j?yk)
 L?
 l=1
 bl exp(?j?zl) =
 K?
 k=1
 L?
 l=1
 akbl exp{?j?(yk + zl)} (2.21)
 36
Note that an exponential function ej?? never vanishes. This property together with non-
 empty zeros set constrain implicitly require:
 K ? 2, L ? 2 (2.22)
 The number of constrains when equating (2.21) to (2.20) is (KL + m). The breakdown
 of constrains is as follows:
 ? KL constrains to equate the set of exponents {yk + zl} with the set {x?i }.
 ? m constrains to equate {akbl} with {?i}.
 The total number of variables is 2(K + L), and we require this to be at least equal to the
 total number of constrains, which gives:
 2(K + L) ? KL + m (2.23)
 It is easy to verify that this condition does not hold because K ? 2, L ? 2, m ? 5 due to
 (2.22) and our initial our assumption. Therefore f(x) in (2.3) is irreducible.
 2.3 Concentric Ring Signature
 In the second part of this chapter, we present a 3D feature descriptor that represents
 local topologies within a set of folded concentric rings by distances from local points to
 a projection plane. This feature, called as Concentric Ring Signature (CORS), possesses
 similar computational advantages to point signatures yet provides more accurate matches.
 CORS produces compact and discriminative descriptors, which makes it more robust to
 noise and occlusions.
 37
Figure 2.13: a) CORS is constructed at p by finding a spherical support region S . b) A
 plane is fitted to local neighborhood and translated to the point p. The normal direction
 is taken to be z-axis c) Selecting a reference orientation for x-axis and projecting the
 distances from the surface to plane into the corresponding patches.
 2.3.1 CORS Construction
 To construct CORS we first find the 3D data points within the local support region,
 then determine a plane of projection, decide the reference orientation in that plane, and
 finally compute the patch responses that are arranged into a matrix form as illustrated in
 Fig. 2.13.
 Local Support
 Let p a point in a 3D cloud data and r the radius of a spherical volume centered on
 p. We assign the set of points falling within the spherical volume as the local support S =
 {pi : ||pi ? p|| ? r}. The choice of the radius r is data-dependent. For example, larger
 radius is preferred for smooth and rigid shapes while smaller radius is preferred for shapes
 with articulations or structural variations. As r increases, CORS is more discriminative
 but more vulnerable against occlusions. A good choice of r would balance well these two
 38
factors. We perform cross-validation on a subset of databases to choose the value r that
 gives best results.
 Plane of Projection and Reference Axes
 A plane P is fitted to the local support S . There are two possible choices for fitting
 planes. One can use the all data points within the local support, fit a plane by least-squares
 as the system is almost always over-determined, and parallel move the origin of P at p.
 Alternatively, it is possible to select a subset of points along the perimeter of the local
 support, e.g. intersecting the sphere support with the object surface.
 Fitting to the perimeter would be more appropriate particularly for the points along
 ridges as illustrated in Fig. 2.14. It is more meaningful for the plane of projection to be
 tangent to the surface rather than slice into it as our descriptor is based on elevations of
 points. In case the projection plane slices into the surface, the resulting descriptor would
 be zero.
 We define a local reference coordinates so that the local descriptor is invariant to
 camera view. Let c be the (Karcher) mean, that is the coordinate having the minimal
 overall distance to the other points in the local support c = arg min
 ?
 i ?pi ? c?. We
 set the z-axis to be orthogonal to P and pointing in a direction such as the dot product
 of the unit vector ~z with vector ~cp is positive. We define a local reference axis (x-axis)
 so that the local descriptor is invariant to camera view. x-axis points away from p to the
 projection of the 3D point that has the maximum distance from the fitted plane P within
 the local support S . y-axis is defined by the cross product ~z ? ~x. With such assignments,
 39
P corresponds to the xy plane going through point p. These two conditions define z-axis
 without any ambiguity.
 Figure 2.14: A special case where fitting a plane to local support is different from fitting a
 plane to its perimeter. a) Fitting to the perimeter results in a tangent plane, while b) fitting
 to the entire local neighborhood results in a slicing plane
 In case the projection distances from P to the xy plane have more than one peak,
 multiple reference axes can be generated. When this situation occurs during training
 phase, multiple descriptors, each corresponds to one peak, are computed and stored in the
 database. During the matching phase, only one descriptor corresponding to the largest
 peak is needed for each query point, even when its projection distances have multiple
 similarly large peaks. The reason is because the query descriptor can always find its
 correct match in the model database where multiple descriptors have been generated to
 take into account ambiguity of peaks. We observe that this situation occurs at only around
 1% of points and the inclusion of multiple peaks improves matching of descriptors.
 40
Populating Patches
 After fitting the plane and determining the reference axes, each 3D point pi in the
 local neighborhood S is now represented by a tensor pi(x, y, z) independent of the camera
 viewing angle. The z coordinates pi(z) correspond to the distance from the plane in this
 tensor, and the xy-plane coordinates pi(x, y) correspond to the projection on the plane P .
 We estimate a representative elevation value of the given data points within the patches of
 this grid as follows:
 1. We apply a polar grid along azimuth and radial directions on the xy plane centered
 on the original point p. The patches of this grid can be considered as the 2D his-
 togram bins. Let {(k, l)} be the set of sampled grid locations with k = 1 . . . K
 and l = 1 . . . L, where K and L are the numbers of sampling intervals along the
 radial and the azimuthal directions, respectively. In other words, we will extract a
 2D matrix FK?L on this grid where each coefficient corresponds to a patch of the
 grid.
 2. At each grid location (k, l), we estimate a representative elevation value F (k, l). El-
 evation at a location is estimated by interpolating the elevation values of the given
 3D points within the immediate patches. This significantly improves sparsity re-
 lated issues, e.g. sudden jumps of the estimated elevation when there are insuffi-
 cient number of 3D points and boundary issues e.g. being very close to boundary
 but falling into another bin, etc.
 The representative elevation score F (k, l) is estimated as follows:
 41
Figure 2.15: Illustration of CORS (red border) and spin image (black border) at different
 points on a 3D human face. The radius of support region is set to 15 for both descriptors.
 The number of azimuthal and radial quantizations of CORS are 10 and 5, respectively.
 The bin size of spin image is set equal to the scanner resolution as suggested by its author.
 For this setting, the dimension of CORS is around 6.5? more compact than that of spin
 image.
 F (k, l) =
 ?
 i wi.pi(z)?
 i wi
 (2.24)
 where pi are 3D points within the immediate neighboring bins of the bin of (k, l)
 and the weight is computed as:
 wi =
 ?
 ?????????
 ?????????
 1/?, d ? ?
 1/d, ? ? d ? 2?
 0, otherwise
 (2.25)
 and d = ?(k, l)? pi(x, y)?.
 Basically, F (k, l) is the weighted average of elevation of points surrounding (k, l).
 The contribution of each surrounding point?s elevation to the estimation of representative
 42
elevation is controlled by a weight wi negatively proportional to distance to (k, l). the
 parameter ? controls the smoothness of a descriptor. Higher ? values yield smoother
 descriptors while a smaller ? makes the descriptor sensitive to positional shifts. The
 choice of ? depends on the sampling interval along azimuth and radial directions. We
 observed that the average Euclidean distance between bin centers and their adjacent bins
 is a satisfactory value. Using a fixed value of ? makes bins close to the origin in a polar
 coordinate system look more similar than those further away. ? could be set in an adaptive
 manner to overcome this issue. Also, imposing a minimum distance constraint enables
 improving the robustness against small differences in shape very close to the center.
 In addition to the mean orthogonal distance from S to the P , the standard deviation
 of the projection distances and the density of points falling into each bin also possess
 complementing discriminant power and can be incorporated into similar matrices. An
 advantage of the mean distance is that it does not require point density estimation and
 normalization. Figure 2.15 provides a visual illustration of CORS computed at different
 locations on a 3D data cloud of human face. Note that the dimension of CORS is 6.5 times
 smaller than that of spin image. Such dimensional reduction increases the descriptors
 matching efficiency, yet does not compromise the discriminative power as shown in the
 experimental analysis section.
 2.3.2 Fast Approximation CORS
 In practice, the computation of CORS can be significantly speed up by using the
 normal vectors whenever available as z-axis of the local reference frame. This eliminates
 43
Fast CORS Point Signature CORS Spin Image
 1.02s 1.27s 1.34s 2.76s
 Table 2.6: Amount of time taken to compute 500 descriptors.
 the need for fitting a plane to the neighborhood at every location. Table 2.6 compares the
 computation of CORS, speed-up CORS, and other descriptors. As visible, CORS is as
 fast or faster than the conventional methods. In this experiments, we use models from
 TOSCA dataset, each of which has around 50K-60K points. The computation times are
 for 500 descriptors. No normals are available beforehand for CORS.
 A wide spectrum of applications involve 2.5D image that is a view-point specific
 simplified representation of 3D data in which every pixel on the camera?s image plane
 contains only one depth value measuring its distance to the scene. This data structure is
 effectively a 2D hash table allowing efficient constant-time retrieval of local neighborhood
 S around a point p.
 To speed up the matching of a CORS descriptor to large databases (hundreds of
 thousands of signatures), a coarse-to-fine approach can be adapted. For each CORS de-
 scriptor, we row-sum up all elements of the K ? L matrix, i.e. those lying on the same
 ring to create a subordinate signature. This can be used for quickly pruning unlikely
 candidates.
 44
2.3.3 Similarity Measure and Matching
 Euclidean distance is used as the dissimilarity measure between two CORS matri-
 ces:
 dist(F1, F2) =
 ??
 k,l
 (F1(k, l)? F2(k, l))2 (2.26)
 Matching of CORS descriptors is not limited to Euclidean distance. Since the represen-
 tation of CORS is in a matrix form, it can be considered to possess a manifold structure
 where the matching score is defined as the geodesic distance connecting two CORS de-
 scriptors on the manifold. In addition, a manifold can be flattened using ISOMAP, etc. In
 this work, we concentrate on the Euclidean distance.
 The dissimilarity measure is mainly dictated by the applications at hand and the
 contributions of different bins can be modified accordingly. For example, if the goal is to
 highlight symmetric local structures, CORS with similar bin values along the azimuthal
 dimension should be weighted significantly higher than the others.
 The best match of a query descriptor can be efficiently extracted using approximate
 nearest neighbor techniques such as kd-trees and box decomposition tree based search.
 2.3.4 Experimental Results
 We conducted several detection, recognition, registration, and retrieval experiments
 to analyse the performance of CORS descriptors in comparison to existing descriptors
 including spin images and point signatures.
 45
2.3.4.1 Saliency
 In the first experiment, we use five synthetic models, each of around 150K points,
 to compare the saliency, i.e. discriminative power, of point signature, spin image, and
 CORS. A reference database of 10K signatures is computed at random points on each
 model. Another 10K set of signatures at randomly sampled points on the same model is
 used as the query database. We make sure that no point from the reference set is selected
 for the query set. A query signature at location qi is said to have a good match to its
 model at location mi if their distance di = |mi ? qi| ? . We chose  to be 5 times of the
 scanner resolution. Figure 2.16 shows the percentage of correct matches within the first k
 nearest neighbors. CORS out performs both point signature and spin image. The correct
 matching rate of CORS is approximately 2.5 times higher than that of point signatures.
 The error rate reduces from 18% for spin image to 12% for CORS, which is more than
 33% of improvement.
 Figure 2.16: Percentage of correct matches within k-nearest neighbors.
 46
Discriminant Ratio
 Finding descriptor correspondence is essential to many recognition and retrieval
 problems. It is desirable to have a robust mechanism of filtering out potentially erroneous
 matches and keeping only the useful ones for further processing. When searching de-
 scriptors into a large database or finding correspondences within noisy observations, the
 nearest neighbor matching would result in a large number of incorrect pairs. For instance,
 Fig. 2.17 shows that the rate of finding correct correspondences (dashed lines) decrease
 with the noise and the database size.
 Figure 2.17: Comparison of correct correspondences for two methods: random selection
 and selecting with the discriminant ratio ? ? 1.5. Gaussian noise is added to all descrip-
 tors before matching. The noise standard deviation varies from 0.1 to 2, and database size
 varies from 10K to 100K.
 To deal with this issue, it is possible to impose a global threshold on the Euclidean
 47
distances of the descriptors to their closest matches. However, applying a threshold does
 not work as most of the less discriminative descriptors tend to have multiple matches with
 only small distances.
 Taking a similar approach to [74], we compare the distance of the closest neighbor
 to that of the second nearest neighbor. We define the discriminant ratio as a measure for
 this comparison:
 ? =
 dist2
 dist1
 (2.27)
 where dist1 and dist2 are the Euclidean distances between a query descriptor and its
 first and second best matches in the database, respectively. Higher discriminant ratios
 require correct matches to have the closest neighbor significantly closer than the closest
 incorrect matches. As a result, matches with high discriminant ratios tend to be much
 more reliable. For false matches, there will likely be a number of other false matches
 within similar distances due to the high dimensionality of the feature space.
 Imposing a limit on the discriminant ratio dramatically increases the correct cor-
 respondence rate. Figure 2.17 shows the probability of correct correspondences within
 selected points under different settings. Shapes from TOSCA and Mian datasets are used
 to construct the descriptor databases. Dashed lines represent the conventional correspon-
 dences by the nearest neighbor matching and random point selection. Solid lines represent
 the refined correspondences by the nearest neighbors with the constraint that their dis-
 criminant ratios are at least ? ? 1.5. While the performance of the conventional method
 degrades almost linearly with the Gaussian noise standard deviation, the performance of
 the second method remains high and stable as the noise level increases. Impressively, the
 48
correct correspondence rate of the refined method reaches up to 100% for the the database
 of the largest size (100K), while the conventional method has the lowest performance on
 this setting (see red lines). The reason is that as more false matches are available, the
 comparison based on the distance to the closest neighbor and to the second closest neigh-
 bor becomes more robust. In other words , a point that can stand out from a crowded
 database, in terms of discriminant ratio, must be more reliable.
 Although the reliability of selected points is proportional to the discriminant ratio,
 the number of detected points reduces as the discriminant ratio increases. We observed
 that ? ? 1.5 is a good trade-off between the correct correspondences rate and the number
 of total correspondences. We use both Mian and TOSCA datasets to create the descriptors
 databases. CORS parameters set are as follows: the local support radius r = 12, K =
 10, L = 20, where the value is relative to scanner resolution. Figure 2.18 gives sample sets
 of correspondences found using the proposed discriminant ratio. Feature correspondences
 are demonstrated to be very accurate even under nonrigid transformations and occlusions.
 2.3.4.2 Shape Detection and Registration
 Given a 2.5D range scan query scene the task is to make a reliable decision whether
 the scene contains the objects of interests. If an instance of the target is detected, either
 complete or partially occluded, the algorithm will estimate the transformation that regis-
 ters the complete model to the target in the scene. This problem is challenging for sev-
 eral reasons. First, range scan images usually contain hundreds of thousands of points,
 posing a question of how to process them in an efficient yet reliable manner. Second,
 49
(a) Tosca Non-rigid Shapes (b) Mian Occluded Shapes
 Figure 2.18: Correspondences of CORS with the discriminant ratio ? ? 1.5 a) Two
 shapes taken from TOSCA dataset with nonrigid transformations (Note that all corre-
 spondences for this pair are correct while the plotted lines are sometimes hidden behind
 the surface giving an impression of matching the face with the back of the head). b) A
 model and an occluded scene from Mian et al. dataset. (All correspondences, except two
 on the leg of the reference model, are correct.)
 target is only partially visible due to the self-occlusion and the cluttering effects (see Fig-
 ure 2.18b), rendering many global shape descriptors useless. Three main steps of our
 detection method using CORS, then, can be listed as:
 ? Computate CORS for a subset of randomly distributed points in the scene,
 ? Find correspondence between the query signatures of the randomly distributed
 points and the model signatures computed off-line, and
 ? Iteratively estimate motion parameters with geometric constraints within the RANSAC
 50
framework to locate and find the pose of the object.
 We evaluate our method on the datasets provided by Mian [75], which consists of
 5 models and 50 scenes taken with a Konica-Minolta range scanner. The scenes in this
 dataset are highly occluded and cluttered by putting objects very close to each other. We
 are interested in evaluating the recognition rate that is defined as the number of correct
 detections over the total number of the scenes. An object is said to be correctly detected if
 the resulting errors of the translation and pose estimations, compared to the ground truth,
 are smaller than one-tenth of the object?s diameter and 12 ?, respectively. These criteria
 are the same with that of Drost et al. [56], therefore allowing the comparison to their
 methods.
 Figure 2.19: Comparison of the recognition rates vs. the percentage of occlusions: for
 spin image, tensor matching, Drost method, and CORS.
 Our algorithm converges after, on average, only 3 RANSAC iterations. It produces
 satisfactory estimates of R and t even without any further processing. Figure 2.20 shows
 51
Figure 2.20: Recognition and registration of 3D model point clouds into the occluded
 scenes. The cyan and pink colors are used to render the scenes and models, respectively.
 The transformed models closely overlap with the targets within all the scenes. 50 different
 scenes are used for testing our approach.
 the registration results of four different models without any refinement. Figure 2.19 shows
 the overall recognition result of our method. As given, it outperforms all other methods
 in terms of the recognition rate with respect to occlusions. The key to the higher ro-
 bustness against occlusions lies in the facts that the detected correspondences are highly
 reliable and accurate, and the use of the discriminant ratio constraint successfully filters
 52
out potentially invalid matches.
 2.3.4.3 Shape Recognition and Retrieval
 Figure 2.21: Six different articulations from a model in the TOSCA dataset.
 Given a query shape, the goal is to extract all instances of this shape, which might
 come with 3D non-rigid articulations, from the database . It is necessary for an algorithm
 to be 3D articulation-independent in order to perform well on this test.
 A bag-of-feature approach is used to retrieve 3D shapes. We compare CORS at a
 subset of salient points to evaluate the similarity between a model and a query. Similar
 to the approach in the previous section, we perform salient points selection and shapes
 retrieval jointly. Given a query point clouds, we first compute CORS at a subset of ran-
 domly selected points. Two best matches for each query CORS are extracted from the
 model database from which discriminant ratios are computed. We define a dissimilar-
 ity score ? between a query and a model using only those correspondences with high
 discriminant ratios. More specifically, it is computed as:
 ? = 1|S? |
 ?
 qi?S?
 dist1(qi), S? = {qi : ?qi ? ?0} (2.28)
 where S? is the set of feature points having discriminant ratio higher than a threshold ?0.
 53
In contrast to the previous section where the threshold ?0 is kept constant, here we keep
 the size of |S? | fixed (50 in our experiments) and let ?0 to vary. This effectively forces
 two dissimilar shapes to accept some poor matches leading to higher dissimilarity score.
 Shapes with dissimilarity score lower than a predefined threshold are considered to be
 from the same object.
 We evaluate our algorithm on the TOSCA dataset which consists of 9 models and
 80 shapes. This is a challenging database since all models come with various nonrigid
 shape transformations. Figure 2.21 shows a model in the dataset with six different 3D
 articulations. We perform a leave-one-out matching experiment. In other words, each
 shape is excluded from and matched against the rest of the database. The recognition
 rate is computed by counting how many best matches that come from the same models of
 query shapes. Recognition and retrieval results of our method are presented in Fig. 2.22.
 Our method achieves 100% correct recognition and retrieval rate, i.e. all shapes from cor-
 rect models are retrieved before those from incorrect models. The recognition rate of spin
 image is only 55%, and the retrieval result is presented in Fig. 2.23. It is easy to observe
 that our method produces consistent scores. This makes it possible to impose a threshold
 on the score and reject shapes from incorrect models. On the contrary, dissimilarity scores
 of spin image are quite erratic inhibiting the determination of the cut-off point between
 correct and incorrect models.
 54
Figure 2.22: Top retrieval shapes of our CORS method. All shapes are correctly retrieved.
 (The 5th model gorilla has three and the 9th model wolf has two instances in the dataset.)
 55
Figure 2.23: Top four retrieval shapes of spin image. Wrong retrievals are highlighted
 with orange boxes. Distance scores are erratic (see red underlines) inhibiting the determi-
 nation of a cut-off value between the correct and incorrect shapes.
 56
Chapter 3
 Sparse Representations
 3.1 Background on Sparse Coding and Dictionary Learning
 Sparse and redundant signal representations have recently drawn much interest in
 vision, signal and image processing [76], [77], [78], [79]. This is due in part to the fact
 that signals and images of interest can be sparsely represented or compressible given an
 appropriate dictionary. In particular, we say a signal y ? Rn is sparsely represented by
 a dictionary D ? Rn?K when it can be well approximated by a linear combination of a
 few columns of D as y ? Dx, where x ? RK is the sparse representation vector and
 D is a dictionary that contains a set of basics (atoms) as its columns. Finding a sparse
 representation vector entails solving the following optimization problem
 x? = arg min
 x
 ?x?0 s.t. ?y ?Dx?2 ? ?, (3.1)
 where ? is an error tolerance, ?x?0 is the `0 sparsity measure that counts the number of
 nonzero elements in the vector x, and ?y?Dx?2 is the mean squared error resulted from
 the sparse approximation. Solving (3.1) is NP-hard and can be approximated by various
 methods [80], [81], [82].
 Instead of using a pre-determined dictionary, one can directly learn a dictionary
 from the data. Indeed, it has been observed that learning a dictionary directly from the
 training data rather than using a predetermined dictionary (e.g. wavelet) usually leads
 57
to a more compact representation and hence can provide better results in many image
 processing applications such as restoration and classification [76], [77], [78] [83], [84],
 [85].
 Several algorithms have been developed for the task of learning a dictionary. Two
 of the most well-known algorithms are the method of optimal directions (MOD) [8] and
 the KSVD algorithm [9]. Given a set of N signals Y = [y1, ? ? ? ,yN ], the goal of KSVD
 and MOD algorithms is to find a dictionary D and a sparse matrix X that minimize the
 following representation error
 (?D, ?X) = arg min
 D,X
 ?Y ?DX?2F s.t.
 ?xi?0 ? T0, ?i = 1 . . . N
 (3.2)
 where xi represents the i-th column of X, ?A?F denotes the Frobenius norm of A, and
 T0 denotes the sparsity level. Both MOD and KSVD are iterative methods that alternate
 between sparse-coding and dictionary update steps. First, a dictionary D with `2 normal-
 ized columns is initialized. Then, the main iteration is composed of the following two
 stages:
 ? Sparse coding: In this step, D is fixed and the following optimization problem is
 solved to compute the representation vector xi for each example yi
 min
 xi
 ?yi ?Dxi?22 s.t. ?xi?0 ? T0, ?i = 1 . . . N (3.3)
 ? Dictionary update: This is where both MOD and KSVD algorithms differ. The
 MOD algorithm updates all the atoms simultaneously by solving an optimization
 problem whose solution is given by D = YX?, where X? denotes the Moore-
 Penrose pseudo-inverse. Even though the MOD algorithm is very effective and
 58
usually converges in a few iterations, it suffers from the high complexity of the
 matrix inversion as discussed in [9]. In the case of KSVD, the dictionary update is
 performed atom-by-atom in an efficient way rather than using a matrix inversion. It
 has been observed that the KSVD algorithm requires fewer iterations to converge
 than the MOD method.
 3.2 Non-Linear Kernel Dictionary Learning
 The dictionaries designed by MOD and KSVD are based on a linear representation
 of the data. However, linear representations are almost always inadequate for representing
 non-linear structures of the data which arise in many practical applications. For example,
 many types of descriptors in computer vision have intrinsic non-linear similarity measure
 functions. The most popular ones include the spatial pyramid descriptor [10] which uses
 a pyramid match kernel, and the region covariance descriptor [11] which uses a Rieman-
 nian metric as the similarity measure between two descriptors. Both of these distance
 measures are highly non-linear. Unfortunately, the traditional dictionary learning meth-
 ods, e.g. MOD and KSVD, are based on a linear model. This inevitably leads to poor
 performances for many datasets, e.g., object classification of Caltech-101 [12] dataset,
 even when discriminant power is taken into account during the training [13]. Motivated
 by the drawbacks of the current methods and the needs of many practical applications, we
 propose kernel dictionaries which are basically dictionaries in high dimensional feature
 spaces. Our dictionary learning methods yield representations that are more compact than
 kernel principal component analysis (KPCA) [14] and are able to handle the non-linearity
 59
better their linear counterparts.
 3.2.1 Problem Formulation
 Let ? : Rn ? F ? Rn? be a non-linear mapping from Rn into a dot product
 space F . We restrict our formulation to Hilbert spaces. The reason is because a Hilbert
 space associates each pair of vectors with a dot product. This allows the use of Mercer
 kernels to carry out computations implicitly without venturing into the high-dimensional
 feature space. It is worth noting that in practice n? is often much larger than n, and possibly
 infinite. More discussion on the use of Mercer kernels is delayed until later in this section.
 Our goal is to learn a non-linear dictionary D = [d1, . . . ,dK ] in the feature space
 F by solving the following optimization problem:
 arg min
 D,X
 ??(Y)?DX?2F s.t
 ?xi?0 ? T0, ?dj?2 = 1, ?i, j.
 (3.4)
 where ?(Y) = [?(y1), . . . , ?(yN)], D ? Rn??K is the sought dictionary, X ? RK?N is a
 matrix whose i-th column is the sparse vector xi corresponding to the sample ?(yi), with
 maximum of T0 non-zero entries.
 The following proposition facilitates the re-formulation of (3.4). To the best of our
 knowledge, none of the previous work on dictionary learning has proven this property.
 Proposition 3.1. There exists an optimal solution D? to (3.4) that has the following form:
 D? = ?(Y)A, (3.5)
 for some A ? RN?K .
 60
Table 3.1: Parameters? dimensions.
 Y ?(Y) D A X di xi
 n?N n??N n??K N ?K K ?N n?? 1 K ? 1
 Proof. See Appendix 3.5.
 At first sight, the solution in (5.9) might look similar to the double-sparsity model
 D = BA proposed in [86]. However, there are significant differences between the two
 models. First, we prove in the proposition 3.1 that B = ?(Y) is a natural choice for
 our case instead of relying on any manual selection of B. In addition, we do not require
 columns of D to be sparse with respect to the base B as in [86]. Although B is fixed in
 both models, the dictionary can be tuned to the training data via modifying the coefficient
 matrix A. For the reader?s convenience, table 3.1 summarizes the dimensions of all im-
 portant parameters. Parameters that involve n? are those associated with the feature space
 F . As a result, we can seek an optimal dictionary through optimizing A instead of D. By
 substituting (5.9) into (3.4), the problem can be re-written as follows:
 arg min
 A,X
 ??(Y)? ?(Y)AX?2F s.t.
 ?xi?0 ? T0, ?i = 1 . . . N.
 (3.6)
 An important property of this formulation is that the objective function in (3.6) can be
 minimized through the use of Mercer kernels as we shall see shortly in the next section.
 A Mercer kernel is a function ?(x,y) that satisfies Mercer?s condition [87] [88]:
 For all data {yi}Ni=1, the function gives rise to a positive semi-definite matrix [Kij ] =
 61
[?(yi,yj)]. If ? satisfies the Mercer? condition, it can be shown that ? corresponds to
 some mapping ? in the Hilbert feature space F . That is, ?(x,y) = ??(x), ?(y)?. Mer-
 cer kernels are often used to implicitly specify the mapping ?. It helps avoid the com-
 putationally expensive mapping of data into the high-dimensional feature space. Some
 commonly used kernels include polynomial kernels ?(x,y) = ?(x,y?+c)d and Gaussian
 kernels ?(x,y) = exp(??x?y?2
 c
 ), where c and d are the parameters.
 In order to see the advantage of the formulation in (3.6) over the original one, we
 will examine the objective function. Through some algebraic manipulations, the cost
 function can be re-written as:
 ??(Y)? ?(Y)AX?2F = ??(Y)(I?AX)?2F =
 tr((I?AX)T ?(Y)T ?(Y)(I?AX))
 tr((I?AX)T K(Y,Y)(I?AX)),
 (3.7)
 where K(Y,Y) is a kernel matrix whose elements are computed from ?(yi,yj) = ?(yi)T ?(yj).
 It is more efficient to optimize this objective function since it only involves a kernel matrix
 of finite dimension K ? RN?N , instead of dealing with a possibly infinite dimensional
 dictionary in the original formulation (3.4).
 In order to illustrate this point, we take an example where training data are images
 of size 16 ? 16 pixels. That is, n = 256. Assuming that a polynomial kernel of degree
 4 is used, then the dimension of the feature space is in order of n? ?
 (256
 4
 )
 ? 108. This
 is often much larger than the number of training samples. For instance, Caltech-256
 dataset only contains n = 30607 images for both training and testing. If a Gaussian
 kernel is used, the corresponding feature space is the Hilbert space of infinite dimension.
 Therefore, it is computationally prohibited to carry out computations directly on feature
 62
spaces. Instead, we can work with the kernel matrix as in (3.7) and achieve a better
 computational efficiency.
 3.2.2 Optimization Procedure
 In this section, we present our approach for learning dictionaries in the feature
 space. Just as in the case of KSVD and MOD, our method of learning dictionaries involve
 two stages: sparse coding and dictionary update. In what follows, we describe them in
 detail.
 3.2.2.1 Kernel Orthogonal Matching Pursuit (KOMP)
 In this stage, the matrix A is assumed to be fixed and then we seek for the sparse
 codes contained in the matrix X. Note that, the penalty term in (3.6) can be re-written as
 ??(Y)? ?(Y)AX?2F =
 N?
 j=1
 ??(yi)? ?(Y)Axi?22. (3.8)
 Hence, (3.6) can be reformulated as solving N different problems of the following form
 min
 xi
 ??(yi)? ?(Y)Axi?22 s.t ?xi?0 ? T0, (3.9)
 for i = 1, ? ? ? , N. The above problem can be solved by any pursuit algorithms [80, 81],
 with the modification that signals are now in the feature space with a very high dimen-
 sionality. In what follows, we show how the well-known orthogonal matching pursuit
 algorithm (OMP) [81, 82] can be generalized using kernels to solve (3.11).
 In this stage, the matrix A is assumed to be fixed and then we seek for the sparse
 63
codes contained in the matrix X. Note that, the penalty term in (3.6) can be re-written as
 ??(Y)? ?(Y)AX?2F =
 N?
 j=1
 ??(yi)? ?(Y)Axi?22. (3.10)
 Hence, (3.6) can be reformulated as solving N different problems of the following form
 min
 xi
 ??(yi)? ?(Y)Axi?22 s.t ?xi?0 ? T0, (3.11)
 for i = 1, ? ? ? , N. The above problem can be solved by any pursuit algorithms [80, 81],
 with the modification that signals are now in the feature space with a very high dimen-
 sionality. In what follows, we show how the well-known orthogonal matching pursuit
 algorithm (OMP) [81, 82] can be generalized using kernels to solve (3.11).
 Given a signal z ? Rn and the kernel dictionary represented via A, we seek a sparse
 combinations of dictionary atoms that approximate z in the feature space:
 ?(z) = ?(Y)Axs + rs = ?(Y)?s + rs. (3.12)
 Here, ?s = Axs ? RN is an intermediate notation. We can also think of it as the current
 loading of the signal ?(z) over a base ?(Y). Finally, rs is the current residual.
 The pseudo-code for KOMP is given in the Fig. 3.1. Let Is denote the set of indices
 of the atoms that have already been selected. In the first step, the residual is projected
 onto the remaining dictionary atoms di = ?(Y)ai, where i /? Is and ai is the i-th column
 of A. Let ?i denote the projection of the residual onto di, which is computed simply by:
 ?i = r
 T
 s (?(Y)ai) =
 (
 ?(z)? ?(Y)?s)T(?(Y)ai)
 =
 (
 K(z,Y)? ?Ts K(Y,Y)
 )
 ai, i /? Is
 (3.13)
 64
Input: A signal z, a kernel function ?, a coefficient matrix A, and a sparsity level T0.
 Task: Find a coefficient vector x ? RK with at most T0 non-zero coefficients such that
 ??(Y)Ax? ?(z)?2 is minimized.
 Initialize: s = 0, I0 = ?, x0 = 0, ?0 = 0
 Procedure:
 1. ?i =
 (
 K(z,Y)? ?Ts K(Y,Y)
 )
 ai, ?i /? Is?1
 2. imax = arg maxi|?i|, ?i /? Is?1
 3. Update the index set Is = Is?1
 ?
 imax
 4. xs =
 (
 ATIsK(Y,Y)AIs
 )
 ?1 (K(z,Y)AIs)T
 5. ?s = AIsxs
 6. s? s + 1; Repeat steps 1?6 T0 times
 Output: Sparse vector x ? RK satisfying x(Is(j)) = xs(j), ?j ? Is and zero elsewhere.
 Figure 3.1: The KOMP algorithm.
 where, with a slight abuse of notation, we denote
 K(z,Y) = [?(z,y1), ?(z,y2), . . . , ?(z,yN)]. (3.14)
 The algorithm then selects a new dictionary atom in the remaining set that gives the largest
 projection coefficient in (3.13). This selection guarantees the biggest reduction of the
 approximation error.
 In the fourth step of Fig. 3.1, we want to update the coefficients in x corresponding
 65
to the selected indices in Is. Let xs denote a vector formed by removing from x all
 coefficients whose indices do not belong to Is. Similarly, let AIs indicate the set of
 dictionary atoms whose indices are from the set Is. Then xs is obtained by projecting the
 signal ?(z) onto the subspace spanned by the selected dictionary atoms ?(Y)AIs . Note
 that the matrix
 (?(Y)AIs)T (?(Y)AIs) = ATIsK(Y,Y)AIs (3.15)
 is positive semi-definite since K(Y,Y) is positive semi-definite. The projection coeffi-
 cients are obtained as follows:
 xs =
 ((?(Y)AIs)T (?(Y)AIs))?1(?(Y)AIs)T ?(z)
 =
 (
 ATIsK(Y,Y)AIs
 )
 ?1(K(z,Y)AIs)T (3.16)
 Note also that the computation in (3.16) can be efficiently implemented in a re-
 cursive manner as in [81]. Once the coefficients xs are found, the representation ?s are
 updated as in step 5 of Fig. 3.1. We can think of ?s as the current loading vector of the
 signal ?(z) over the base ?(Y). The procedure is repeated until T0 atoms are selected.
 3.2.3 Dictionary Update with Kernel MOD
 Once the sparse codes for each training data are calculated, the dictionary is updated
 such that the error,
 ??(Y)? ?(Y)AX?2F (3.17)
 is minimized. Note that unlike the traditional dictionary learning framework, a kernel
 dictionary is learned by optimizing the coefficient matrix A instead of the dictionary D.
 Taking the derivative of this error with respect to A and after some manipulations, we
 66
have:
 ???(Y)? ?(Y)AX?2F
 ?A =
 ?tr((I?AX)T K(Y,Y)(I?AX))
 ?A =
 ? 2K(Y,Y)(I?AX)XT
 (3.18)
 By setting (3.18) to zero, we obtain the relation A = XT (XXT )?1 which leads to the
 following iterative update:
 Ak+1 = XTk (XkXTk )?1 = X?k. (3.19)
 where the subscript k denotes iteration index. This procedure of updating the dictionary
 is essentially the idea behind the MOD method [8]. We call this kernel MOD, since this
 is done in the feature space. As discussed in [9], one of the major drawbacks of MOD is
 that it suffers from the high complexity of matrix inversion during the dictionary update
 stage. Specifically, the inversion of matrix XkXTk has the complexity of O(K3), where
 K is the number of dictionary atoms. This becomes very expensive as the dictionary size
 is usually large. Several other methods have also been proposed that focus on reducing
 this complexity. One such algorithm is KSVD [9]. This algorithm eliminates the need of
 a large-matrix inversion and converges faster than MOD [9]. Following the procedure of
 KSVD, in what follows, we describe a more sophisticated way of updating the dictionary.
 3.2.4 Dictionary Update with Kernel KSVD
 Let ak and xjT denote the k-th column of A and the j-th row of X, respectively.
 The error
 ??(Y)? ?(Y)AX?2F (3.20)
 67
Input: A set of signals Y, a kernel function ?.
 Task: Find a dictionary via A to represent these signals as sparse decompositions in the
 feature space by solving the optimization problem in (3.6).
 Initialize: Set a random element of each column in A(0) to be 1. Normalize each column
 in the initial dictionary to a unit norm. Set iteration J = 1.
 Stage 1: Sparse coding
 Use the KOMP algorithm described in Fig. 3.1 to obtain the sparse coefficient matrix X(J)
 given a fixed dictionary A(J?1).
 Stage 2: Dictionary update
 For each column a(J?1)k in A(J?1), where k = 1, . . . , K , update it by
 - Define the group of examples that use this atom: ?k = {i|1 ? i ? N,xkT (i) 6= 0}
 - Define the representation error matrix, Ek, by (3.24).
 - Restrict Ek by choosing only the columns corresponding to ?k by: ERk = Ek?k
 - Apply SVD decomposition to get
 (ERk )T K(Y,Y)(ERk ) = V 4 VT .
 Choose updated a(J)k = ??11 ERk v1, where v1 is the first vector of V corresponding to the
 largest singular value ?21 in 4.
 - Set J = J + 1. Repeat from stage 1 until a stopping criterion is met.
 Output: A and X.
 Figure 3.2: The kernel KSVD algorithm.
 68
can be re-written as:
 ?????
 ?(Y)? ?(Y)
 K?
 j=1
 ajx
 j
 T
 ?????
 2
 F
 (3.21)
 =
 ?????
 ?(Y)
 (
 I?
 ?
 j 6=k
 ajx
 j
 T
 )
 ? ?(Y)(akxkT )
 ?????
 2
 F
 (3.22)
 = ??(Y)Ek ? ?(Y)Mk?2F (3.23)
 where,
 Ek =
 (
 I?
 ?
 j 6=k
 ajx
 j
 T
 )
 , Mk = (akxkT ). (3.24)
 ?(Y)Ek indicates the error between the approximated signals and the true signals when
 removing the k-th dictionary atom. ?(Y)Mk indicates the contribution of the k-th dic-
 tionary atom to the estimated signals.
 At this stage, we assume that only (ak,xkT ) are variables and the rest are fixed,
 hence Ek is also constant for each k. The minimization of the above problem is equiva-
 lent to finding (ak,xkT ) such that the rank-1 matrix ?(Y)Mk best approximates ?(Y)Ek
 in terms of mean squared error. The optimal solution can be obtained via a singular value
 decomposition (SVD). However, there are two reasons that make the direct SVD inappro-
 priate. Firstly, it would yield a dense vector xkT , which increases the number of non-zeros
 in the representation X. Secondly, the matrix might have infinitely large row dimension,
 which is computationally prohibitive.
 In order to minimize the objective function while keeping the sparsity of all the
 representations fixed, we work only with a subset of columns. Note that the columns of
 Mk associated with zero-value elements of xkT are all zero. These columns do not affect
 the minimization of the objective function. Therefore, we can shrink the matrices Ek
 69
and Mk by discarding these zero columns. An advantage of working with the reduced
 matrices is that only the non-zero coefficients in xkT are allowed to vary and therefore the
 sparsity are preserved [9].
 Define ?k as the group of indices pointing to examples {?(yi)} that use the atom
 (?(Y)A)k:
 ?k = {i|1 ? i ? N, xkT (i) 6= 0}. (3.25)
 Let ?k be a matrix of size N ? |?k|, with ones on the (?k(i), i)-th entries and zeros
 elsewhere. When multiplying with ?k, all zeros within the row vector xkT will be dis-
 carded resulting in the row vector xkR of the length |?k|. The column-reduced matrices
 are obtained as
 ERk = Ek?k; MRk = Mk?k. (3.26)
 We can now modify the cost function in (3.21) so that its solution has the same
 support with the original xkT:
 ???(Y)ERk ? ?(Y)MRk
 ??2
 F =
 ???(Y)ERk ? ?(Y)akxkR
 ??2
 F .
 (3.27)
 Recall the fact that ?(Y)akxkR is a rank-1 matrix. Therefore, the optimal solution
 of (3.27) can be obtained by first decomposing ?(Y)ERk into rank-1 matrices using SVD,
 and then equating ?(Y)akxkR to the rank-1 matrix corresponding to the largest singular
 value. That is,
 ?(Y)ERk = U?VT (3.28)
 ?(Y)akxkR = ?1u1vT1 , (3.29)
 70
where u1 and v1 are the first columns of U and V corresponding to the largest singu-
 lar value ?1 = ?(1, 1), respectively. A valid breakdown for the assignment (3.29) is
 given. The reason for putting the multiplier ?1 in (3.30) instead of in (3.31) will become
 clearer when solving for ak. Basically, such an assignment guarantees that the resulting
 dictionary atom on the feature space is normalized to unit-norm
 xkR = ?1v
 T
 1 (3.30)
 ?(Y)ak = u1. (3.31)
 However, as mentioned before, we can not perform direct SVD decomposition on
 ?(Y)ERk as in (3.28) since this matrix can have infinitely large row dimension. This is an
 important difference between the linear KSVD and kernel KSVD algorithm. A remedy
 for this issue comes from the fact that SVD decomposition is closely related to eigen-
 decomposition of the Gram matrix, which is independent of the row dimension.(
 ?(Y)ERk
 )T(?(Y)ERk)=
 (ERk )T K(Y,Y)(ERk ) = V 4 VT ,
 (3.32)
 where 4 = ?T? ? RN?N . This gives us v1 as the first column of V, and ?1 =?
 4(1, 1). Hence, xkR can be found using the relation in (3.30).
 To solve for ak, we first observe that by right-multiplying both sides of (3.28) by V
 and considering only the first column, we get
 ?(Y)ERk v1 = ?1u1. (3.33)
 The solution for ak is obtained by substituting (3.31) into (3.33), yielding:
 ?(Y)ERk v1 = ?1?(Y)ak. (3.34)
 71
(a) (b)
 Figure 3.3: Comparison of error ratios for (a) KSVD and (b) Kernel KSVD on a common
 logarithm scale.
 Hence,
 ak = ?
 ?1
 1 ERk v1. (3.35)
 One can easily verify that this updating procedure of ak results in a dictionary atom of
 unit-norm in the feature space. The pseudo-code for the kernel KSVD algorithm is given
 in Fig. 3.2.
 ?1 ?0.5 0 0.5 1
 ?1
 ?0.5
 0
 0.5
 1
 ?1 ?0.5 0 0.5 1
 ?1
 ?0.5
 0
 0.5
 1
 ?1 ?0.5 0 0.5 1
 ?1
 ?0.5
 0
 0.5
 1
 ?1 ?0.5 0 0.5 1
 ?1
 ?0.5
 0
 0.5
 1
  
 
?1 ?0.5 0 0.5 1
 ?1
 ?0.5
 0
 0.5
 1
 ?1 ?0.5 0 0.5 1
 ?1
 ?0.5
 0
 0.5
 1
 Figure 3.4: Comparison between the level curves of the projection coefficients for three
 different dictionary atoms corresponding to linear KSVD and kernel KSVD. In this figure,
 the first row corresponds to KSVD and the second row corresponds to kernel KSVD.
 72
3.3 Experimental Results
 In this section, we present several experimental results demonstrating the effective-
 ness of the proposed dictionary learning method for classification tasks. In particular,
 we present classification results on the USPS digit dataset and the Caltech-101 object
 dataset. We first examine the effectiveness of a learned dictionary in the feature space
 using synthetic data.
 3.3.1 Synthetic Data
 In the first set of experiments, we generate 1500 samples for a two-class problem
 by randomly sampling a two dimensional circle {y = [y1, y2] ? R2 | y21 + y22 = r2}.
 The radius r of the first circle (class 1) is a half that of the second circle (class 2). The
 radius of the second circle is set to 1 in this experiment. Separate dictionaries are learned
 for different classes and different methods. This means that we learn two different dictio-
 naries for two classes using KSVD, and two kernel dictionaries using kernel KSVD. The
 following parameters are used to learn dictionaries using both KSVD and kernel KSVD:
 dictionaries are learned with 30 atoms, T0 = 3, and the maximum number of training
 iterations is set to 80. For kernel KSVD, we use a polynomial kernel of degree 2.
 Fig. 3.3 shows the color-coded map of the error ratios obtained by dividing recon-
 struction errors of the second class to that of the first class for all points on the R2 plane.
 The reconstruction errors are computed when the sparse coding only use one dictionary
 atom (i.e. T0 = 1). Since data samples from the two classes lie roughly on the same linear
 subspace, which is the entire plane in R2, dictionaries learned using KSVD are indistin-
 73
guishable for the two classes. This is clearly seen from Fig. 3.3 (a), where the error ratios
 are quite random even for points lying on the circles.
 0 5 10 15 200.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1
 Number Of Dictionary Atoms
 Error Percentag
 e
  
 
Kernel PCA
 Kernel KSVD
 Figure 3.5: Comparison of error percentage using kernel KSVD and kernel PCA.
 On the other hand, as can be seen from Fig. 3.3 (b), error ratios corresponding to a
 dictionary learned in the feature space exhibit strong differences between the two classes.
 In particular, error ratios are very high for points lying close to the first class, and very
 small for points lying close to the second class. Moreover, points on the same circle
 have similar error ratios. This observation implies that kernel KSVD correctly learns the
 non-linear structure of the data and embeds this information into kernel dictionary atoms.
 In the second set of synthetic experiments, we learn dictionaries from 1500 data
 samples that are generated from a 2-dimensional parabola {y = [y1, y2] ? R2 | y2 = y21}.
 All parameters are set to the same values as before. Fig. 3.4 shows the level curves of
 the projection coefficients for three different dictionary atoms. The curves are obtained
 as follows. First, we project every point y ? R2 onto the selected dictionary atom di to
 get the projection coefficients. Then, the points with the same projection coefficients are
 shown with the same color map. In other words, every contour in Fig. 3.4 is a collection
 74
of points with the same projection coefficients onto a dictionary atom. For kernel KSVD,
 the projection coefficients for a specific y is computed on the feature space:
 ?(y)Tdi = ?(y)T ?(Y)ai = K(y,Y)ai (3.36)
 Here, Y denotes the set of points sampled along the parabola and y is any point in R2.
 The direction in which the projection coefficient varies (e.g. the direction in which the
 contour?s color changes) indicates the direction of the dictionary atom. For instance, in
 the top left image of Fig. 3.4, the KSVD dictionary atom points in the horizontal direction,
 from left to right. Similarly, for the second and third images in the first row of Fig 3.4,
 the KSVD dictionary atoms point diagonally.
 In contrast to KSVD coefficients, the coefficients of kernel KSVD (second row) in
 Fig. 3.4 change in non-linear manners. Each kernel dictionary atom captures variations
 along both branches of the parabola. This observation implies that our dictionary learning
 method can capture nonlinearities within the data and provide more compact representa-
 tions.
 3.3.2 Kernel sparse representation
 In this section, we examine the representation obtained by kernel KSVD and kernel
 PCA using the USPS dataset. Fig. 3.5 compares the mean-squared-error (MSE) of an
 image from the USPS dataset when approximated using the first m dominant kernel PCA
 components and m = [1, . . . , 20] kernel dictionary atoms (i.e. T0 = m). It is clearly
 seen that the MSE decays much faster for kernel KSVD than kernel PCA with respect
 to the number of selected bases. This observation implies that the image is nonlinearly
 75
sparse and learning a dictionary in the high dimensional feature space can provide a better
 representation of data.
 3.3.3 Digit Recognition
 We apply our dictionary learning approach on the real-world handwritten digits
 classification problem. We use the USPS database [89] which contains ten classes of
 256-dimensional handwritten digits. In other words, input training samples are the vec-
 torizations of USPS digit images with the dimension of n = 256. For each class, we
 randomly select N = 500 samples for training and Ntest = 200 samples for testing. The
 selection of parameters is done through a 5-fold cross-validation. Specifically, we choose
 the following parameters for learning the dictionaries for all classes: each class dictionary
 is learned with K = 300 atoms, T0 = 5, maximum number of training iterations is set
 to 80. A polynomial kernel of degree 4 is used for both kernel KSVD and kernel PCA
 unless otherwise stated.
 Figure 3.6: The pre-images of the kernel KSVD dictionary atoms.
 We first use a sparse subspace approach for classification. Let Yi = [yi,1 . . .yi,N ] ?
 76
R256?500 represents the set of training samples of the i-th class, where i ? {1 . . . 10}. Let
 Di = ?(Yi)Ai denote the learned kernel dictionary for each class, where Ai ? R500?300.
 Given a query image z ? R256, we first perform KOMP separately for each Di, as in
 Fig. 3.1, to get the sparse code xi. Since sparse coding is done separately for each class,
 we call this a distributive approach. The sparse setting is the same as the training phase,
 i.e., T0 = 5. The reconstruction error is then computed as:
 ri = ||?(z)? ?(Yi)Aixi||22 = K(z, z)? 2K(z,Yi)Aixi
 . . . + xiTATi K(Yi,Yi)Aixi , ?i = [1 . . . 10] (3.37)
 The test sample is simply classified to the class that gives the smallest reconstruction
 error.
 We also evaluate the classification performance using a collective approach where
 dictionaries from all classes are concatenated to form a common dictionary D = [D1 . . .D10].
 This approach is different from the first approach only in the sparse coding step. For a
 given test sample z, the sparse coding is done jointly using KOMP on D to obtain a sparse
 coefficient x = [x1, . . . ,x10] ? R3000, where xi contains the sparse coefficients associ-
 ated with the dictionary Di. The sparsity level T0 is chosen to be 10, which is the setting
 that yields the best performance on a 5-fold cross validation. The residual of each class
 is computed as in (3.37). The test sample is classified in a similar fashion with that of the
 distributive approach. Our method can be considered as the kernel version of [90].
 For kernel PCA, we follow the same approach as in [14] to perform classification.
 First, we aggregate training samples from all classes. This results in a training set of 5000
 samples Y = [Y1, . . . ,Y10] ? R256?5000. Training samples are projected onto the first
 77
0 0.5 1 1.5 2
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1
 Noise Standard Deviation ?
 Classification Accuracie
 s
  
 
KSVD
 Kernel PCA
 Kernel MOD
 D?Kernel KSVD
 C?Kernel KSVD
 (a) Gaussian noise
 0 0.2 0.4 0.6 0.8 10
 0.2
 0.4
 0.6
 0.8
 1
 Fraction of missing pixels
 Classification Accuracie
 s
  
 
KSVD
 Kernel PCA
 Kernel MOD
 D?Kernel KSVD
 C?Kernel KSVD
 (b) Missing pixels
 Figure 3.7: Comparison of digit recognition accuracies for different methods in the pres-
 ence of Gaussian noise and missing-pixel effects. Red color and orange color represent
 the distributive and collective classification approaches for kernel KSVD, respectively.
 3000 principal components of largest eigenvalues, which are found by applying the kernel
 PCA method on Y. Classification is then done by learning a linear SVM classifier trained
 on the 3000-dimensional projection coefficients.
 3.3.3.1 Pre-images of learned atoms
 In order to visualize the learned kernel dictionary atoms, we find images that lie
 on the input space and best approximate these atoms in terms of the mean square er-
 rors. Recall that the k-th kernel dictionary atom of the i-th class is represented by
 ?(Yi)ai,k, where ai,k ? RN is the representation of the kernel dictionary atom with
 respect to the base ?(Yi) in the feature space F . The pre-image of ?(Yi)ai,k is ob-
 tained by seeking a vector in the input space di,k ? Rn that minimizes the cost function
 ||?(di,k) ? ?(Yi)ai,k||22. Due to various noise effects and the generally non-invertible
 78
mapping ?, the exact pre-image does not always exist. However, the approximated pre-
 image can be reconstructed without venturing into the feature space using the techniques
 described in [91, 92]. Fig. 3.6 shows the pre-images of the kernel KSVD dictionary atoms
 for 10 digits.
 3.3.3.2 Classification results
 In the first set of experiments, we present the results for the situation where the test
 samples are corrupted by random Gaussian noise with different standard deviations as
 shown in Fig. 3.7a. Likewise, Fig. 3.7b shows the results obtained when pixels are ran-
 domly removed from the test images. In both experiments, kernel KSVD (both distribu-
 tive and collective) and kernel MOD (distributive) consistently outperform linear KSVD
 and kernel PCA. As the distortion level increases the performance difference between
 kernel dictionaries and linear dictionaries become more drastic. Another interesting ob-
 servation is that the distributive classification approach outperforms the collective one
 for the cases with high noise levels. For this reason, we choose the former classification
 scheme for all other experiments on the USPS dataset.
 The second set of experiments examines the effects of parameters choices on the
 overall recognition performances. Fig. 3.8 shows the classification accuracy of kernel
 KSVD as we vary the degree of polynomial kernel. The best error rate of 1.6% is achieved
 with the polynomial degree of 4. This error rate is lower than the error rates from other
 methods in the literature: nearest neighbor (5.6%) [93], neural net (4.2%) [94], neural
 net + boosting (2.6%) [94], invariant support vectors (3%) [95], supervised dictionary
 79
learning (3.54%) [96].
 We also compare the performance of our method in the case of missing pixels. For
 the case when 50% of the pixels in the test samples are randomly removed, the kernel
 KSVD method shows similar trend of performance. For instance, the accuracy peaks
 when the polynomial degree 4 is used and as the polynomial degree increases further, the
 accuracy goes down slightly and becomes saturated. For this reason, we fix the polyno-
 mial degree to be 4 for all other experiments on the USPS dataset.
 Fig. 3.9 compares the performances of KSVD, kernel PCA, and kernel KSVD when
 varying the sparsity T0 in the range from 5 ? 50. For kernel PCA, sparsity means
 the number of principal components or the dimension of the PCA subspace. It can be
 easily seen that the performance of KSVD is very sensitive to the sparsity parameter,
 while kernel KSVD and kernel PCA remain rather stable against this variation. Kernel
 KSVD also outperforms other methods considered here for all Gaussian noise levels and
 0 2 4 6 8 10 12
 0.98
 0.982
 0.984
 Polynomial Degree
 Accurac
 y
  
 
Without noise
 0 2 4 6 8 10 12
 0.8
 0.9
 1
 Polynomial Degree
 Accurac
 y
  
 
50% missing pixels
 Figure 3.8: Kernel KSVD classification accuracy versus the polynomial degree of the
 USPS dataset.
 80
0 25 50
 0
 0.2
 0.4
 0.6
 0.8
 1
 Sparsity
 KSVD
 0 25 50
 0
 0.2
 0.4
 0.6
 0.8
 1
 Sparsity
 Accurac
 y
 Kernel KSVD
 0 25 50
 0
 0.2
 0.4
 0.6
 0.8
 1
 Sparsity
 Accurac
 y
 Kernel PCA
  
 
?=0
 ?=0.3
 ?=0.6
 ?=0.9
 ?=1.2
 ?=1.5
 ?=2
 Figure 3.9: Comparison of digit recognition accuracies for different sparsity levels in
 the presence of Gaussian noise with standard deviation ?. All algorithms train on clean
 images and test on noisy images. The dictionary size is set to 50 for this experiments.
 50 275 500
 0
 0.2
 0.4
 0.6
 0.8
 1
 Dictionary Size
 KSVD
 50 275 500
 0
 0.2
 0.4
 0.6
 0.8
 1
 Dictionary Size
 Accurac
 y
 Kernel KSVD
 50 275 500
 0
 0.2
 0.4
 0.6
 0.8
 1
 Dictionary Size
 Accurac
 y
 Kernel PCA
  
 
?=0
 ?=0.3
 ?=0.6
 ?=0.9
 ?=1.2
 ?=1.5
 ?=2
 Figure 3.10: Comparison of digit recognition accuracies for different dictionary sizes in
 the presence of Gaussian noise with standard deviation ?. All algorithms train on clean
 images and test on noisy images. The sparsity is set to T0 = 5 in this experiment.
 81
all sparsity settings. This can be seen by examining the curves of the same colors in the
 three plots in Fig. 3.9. Interestingly, while KSVD (with a proper sparsity setting) beats
 kernel PCA for cases with moderate noise levels, its performance rapidly deteriorates as
 noise level increases and becomes even worse than kernel PCA for the case of heavy
 noise, e.g. ? ? 1.5.
 Fig. 3.10 compares the recognition accuracies on USPS dataset for different number
 of dictionary atoms and principal components. Both KSVD and kernel KSVD exhibit a
 similar trend. In particular, in the presence of noise, their performances decrease as the
 number of dictionary atoms increases. The performance of KSVD decreases at a faster
 rate than kernel KSVD. In addition, both KSVD and kernel KSVD are rather indifferent
 to the variation of number of dictionary atoms when there is no noise. These observations
 are quite understandable since larger dictionaries do not generalize well, making them
 more brittle in dealing with noise. Interestingly, kernel PCA performs slightly better when
 allowing more principal components. This is the case because principal components are
 orthogonal to each others, thus, they are less compact and need more components to
 capture enough useful information. The benefit of additional information outweighs the
 inclusion of noise and leads to a slightly better performances for kernel PCA. Overall,
 we observe that the setting with dictionary size K = 200 and sparsity setting T0 = 5
 provides a good trade-off between the performance, the robustness against noise, and the
 computational complexity for this experiment.
 82
3.3.4 Caltech-101 Object Recognition
 We report the results of object recognition experiments using the Caltech-101 database [97].
 This database comprises 101 object classes, and 1 background class collected randomly
 from Internet. Each category contains from 31 to 800 images. Most objects are cen-
 tered in the foreground and same-class objects are aligned to similar stereo-typical pose.
 The average size of images is about 300 ? 300 pixels. The database is very diverse and
 challenging as it includes objects like buildings, musical instruments, animals and natural
 scenes.
 We combine 39 different features using the boosting approach as in [98]. These
 features include the spatial pyramid [10], the PHOG shape [99], the region of covari-
 ance [11], the local binary patterns (LBP) [100], the V1S+ [101], etc. All features have
 non-linear distance metrics. For instance, the ?2 distance is used for the PHOG shape and
 the LBP descriptors, the geodesic distance is used for region of covariance descriptor,
 whereas the Gaussian kernel is used for all other types of features. Kernel functions are
 computed as ? = exp(?d(yi,yj)/?), with d being the distance and ? set to the mean of
 pairwise distances. Any learning method should be able to take these non-linear similar-
 ity measures into account to be effective. Our kernel dictionary learning algorithm fits
 naturally into the framework of the multiple features combination. In contrast, the current
 dictionary learning techniques [12, 13], which use the Euclidean distance as the simi-
 larity measure between feature vectors, are unable to adequately capture the non-linear
 structures of the feature space.
 We follow the suggested protocol in [10, 12, 102], namely, we train on N images,
 83
where N ? {5, 10, 15, 20, 25, 30}, and test on the rest. Note that because some categories
 are very small, we might end up with just one single image for testing. To compensate
 for the variations in class sizes, we normalize the recognition results by the number of
 test images to get per-class accuracies. The final recognition accuracy is then obtained by
 averaging per-class accuracies across 102 categories.
 We use similar approaches as in Sec.3.3.3 for classification. The algorithm starts
 with learning a dictionary for each class. In particular, we aggregate descriptors of all
 training images for each category into Yi = [yi,1 . . .yi,N ], where i ? {1 . . . 102}. The
 kernel KSVD algorithm is then run on Yi with the sparsity level T0 = 3 and the maxi-
 mum number of iterations set to 80. The dictionary size is equal to {4, 8, 12, 15, 20, 25},
 respectively. The parameter selection is done with a 5-fold cross validation.
 Let Di = ?(Yi)Ai denotes the learned kernel dictionary, where Ai ? RN?K .
 In order to perform classification on a newly given test feature yt using the collective
 approach, we first concatenate dictionaries from all classes:
 D = [D1 . . .D102] = [?(Y1)A1 . . . ?(Y102)A102] (3.38)
 The final dictionary D is then used in kernel OMP to solve the sparse decomposi-
 tions of yt, as in Fig. 3.1, yielding the sparse coefficients xt = [x1,t . . .x102,t], where xi,t
 contains the sparse coefficients associated with the dictionary Di of the i-th class. During
 the testing phase, we set the sparsity level to be T0 = 10. The reconstruction error is
 84
computed by:
 ri = ??(yt)? ?(Yi)Aixi,t?22
 =
 ((?(yt)? ?(Yi)Aixi,t)T (?(yt)? ?(Yi)Aixi,t))
 = K(yt,yt)? 2K(yt,Yi)Aixi,t + xTi ATi K(Yi,Yi)ATi xi
 (3.39)
 The classification is done simply by assigning the test feature to the class with smallest re-
 construction error. The distributive classification scheme is similar except that the sparse
 coding is done separately for each dictionary.
 (a) Minaret (100%) (b) Panda(100%)
 (c) Okapi (100%) (d) Face easy (99.75%)
 (e) Wild cat (25%) (f) Background (34%)
 (g) Beaver (37.5%) (h) Crocodile (40%)
 Figure 3.11: The first two rows and the last two rows show the classes that our method
 performed the best and worst, respectively.
 Table 3.2 and table 3.3 show the average recognition accuracy of our algorithm
 in comparison with other methods on the Caltech-101 and Caltech-256 datasets, respec-
 85
Table 3.2: Comparison of recognition results on Caltech-101 dataset
 Number of train samp. 5 10 15 20 25 30
 Malik [103] 46.6 55.8 59.1 62.0 - 66.2
 Lazebnik [10] - - 56.4 - - 64.6
 Griffin [104] 44.2 54.5 59.0 63.3 65.8 67.6
 Irani [105] - - 65.0 - - 70.4
 Grauman [102] - - 61.0 - - 69.1
 Venkatesh [106] - - 42.0 - - -
 Gemert [107] - - - - - 64.16
 Yang [108] - - 67.0 - - 73.2
 Wang [109] 51.15 59.77 65.43 67.74 70.16 73.44
 SRC [90] 48.8 60.1 64.9 67.7 69.2 70.7
 KSVD [9] 49.8 59.8 65.2 68.7 71.0 73.2
 D-KSVD [13] 49.6 59.5 65.1 68.6 71.1 73.0
 LC-KSVD [12] 54 63.1 67.7 70.5 72.3 73.6
 LP-? [98] 54.2 65.0 70.4 73.6 75.7 77.8
 D-Kernel KSVD 52.1 63.5 68.3 72.4 74.6 76.8
 C-Kernel KSVD 56.5 67.2 72.5 75.8 77.6 80.1
 86
Table 3.3: Comparison of recognition results on Caltech-256 dataset
 # train samples 15 30
 Griffin [104] 28.3 34.1
 Gemert [107] - 27.2
 Yang [110] 34.4 41.2
 D-Kernel KSVD 34.5 41.4
 C-Kernel KSVD 34.8 42.5
 tively. Kernel KSVD with a collective dictionary outperforms all the methods for any
 number of training images n. In contrast to the previous experiments on the USPS dataset,
 we notice that the collective classification approach significantly outperforms the distribu-
 tive approach on the Caltech-101 dataset. An explanation for this observation is that the
 set of training data is insufficient (less than 30 samples per class). Consequently, the
 learned dictionaries do not generalize well enough. This leads to a deterioration of the
 distributive classification scheme which highly depends on the representation power of
 the learned dictionaries.
 It is worth noting that the best recognition performance of [98] using multiple fea-
 tures with boosting techniques on Caltech-101 is only 77.8% in comparison with 80.1%
 of our method. Since the classifier in [98] is learned directly from the feature vectors, it
 is quite brittle and may not generalize well, especially in the case where the number of
 training samples is small (e.g., less than 30 per class). A more robust approach would
 87
be to extract sparse features that are common among images of the same category and
 learn a classifier on top of these features. LC-KSVD [12] follows such an approach,
 which uses the spatial pyramid descriptors in addition to the discriminative dictionary
 learning algorithm to improve the classification result. One drawback of this approach
 is that the pyramid match kernel or any other non-linear kernel cannot be incorporated
 within this algorithm. As a result, the learning algorithm operates in the Euclidean space
 instead of the manifold that the combined features lie on. This inhibits the learning of
 common non-linear structures that can be very useful for classification. For these rea-
 sons, the recognition accuracies of both methods ([98] and LC-KSVD [12]) are lower
 than our results. This observation justifies the effectiveness of kernel dictionary learning
 and recognizes its important role in improving the classification accuracy.
 Fig. 3.12 plots the confusion table between all categories for our best result (i.e.
 80.1%). Fig. 3.13 displays per-class accuracy where classes are sorted in the ascending
 order of recognition accuracies. There are 15 categories that achieve perfect recognition.
 Fig. 4.6 displays a few of the ?easiest? and the ?hardest? object categories. The successful
 classes include ?Okapi? and ?Minaret?, and the difficult classes include ?Beaver? and
 ?Crocodile?. This is consistent with the result in [10]. Interestingly, several classes like
 ?Ant? have the rather high performances while they are among the hardest classes in [10].
 This shows the benefit of combining multiple features with dictionary learning.
 88
Average Recognition Accuracy = 80.1 
BACKGROUN
 D
 Motorbike
 s
 an
 t
 binocula
 r
 buddh
 a
 car sid
 e
 chandelie
 r
 crayfis
 h
 dalmatia
 n
 electric guita
 r
 ewe
 r
 garfiel
 d
 hawksbil
 l
 ibi
 s
 ketc
 h
 lobste
 r
 menora
 h
 octopu
 s
 pigeo
 n
 revolve
 r
 schoone
 r
 snoop
 y
 stegosauru
 s
 tic
 k
 water lill
 y
 wrenc
 h
  
 BACKGROUND
 Leopards
 airplanes
 barrel
 binocular
 brontosaurus
 camera
 ceiling_fan
 chandelier
 crab
 crocodile_head
 dollar_bill
 electric_guitar
 euphonium
 flamingo
 gerenuk
 hawksbill
 helicopter
 joshua_tree
 lamp
 lobster
 mayfly
 minaret
 okapi
 pigeon
 pyramid 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1
 Figure 3.12: Confusion matrix of our recognition performances on Caltech 101 dataset
 using kernel KSVD. The rows and columns correspond to true labels and predicted labels,
 respectively. The dictionary is learned from 3030 images where each class contributes 30
 images. The sparsity is set to be 30 for both training and testing. Although the confusion
 matrix contains all classes, only a subset of class labels is displayed for better legibility.
 3.4 Discussion and Conclusion
 We have presented two non-linear dictionary learning algorithms that exploit spar-
 sity of data in high dimensional feature space by extending MOD and KSVD dictionary
 learning algorithms through an appropriate choice of kernel. Experimental results indi-
 cate that exploiting non-linear sparsity via learning dictionaries in a non-linear feature
 space can provide better discrimination than the traditional linear approaches and kernel
 PCA. Significant improvements over the current state of the art on two standard datasets
 89
0.2 0.4 0.6 0.8 1 1.2
 0
 10
 20
 30
 40
 50
 60
 70
 80
 90
 100
 wild cat cougar body
 BACKGROUND lobster
 beaver crocodile
 crayfish metronome
 anchor llama
 scorpion sea horse
 flamingo head octopus
 strawberry schooner
 starfish crab
 butterfly kangaroo
 brontosaurus cannon
 lotus flamingo
 ceiling fan elephant
 gramophone water lilly
 ibis brain
 emu lamp
 cougar face ant
 chair platypus
 hawksbill barrel
 soccer ball wrench
 tick camera
 dolphin mayfly
 nautilus snoopy
 yin yang crocodile head
 ketch helicopter
 rhino bass
 chandelier mandolin
 pyramid buddha
 pigeon bonsai
 dragonfly euphonium
 cup umbrella
 ferry menorah
 rooster watch
 saxophone sunflower
 joshua tree pizzahedgehog
 stegosauruswheelchair
 electric guitarstapler
 stop signrevolver
 ewerdalmatian
 Facescellphone
 airplaneslaptop
 grand pianoLeopards
 MotorbikesFaces easy
 accordionbinocular
 car sidedollar bill
 garfieldgerenuk
 headphoneinline skate
 minaretokapi
 pagodapanda
 scissorstrilobite
 windsor chair
 Accuracy
 Clas
 s
 Figure 3.13: Average accuracy of each class in Caltech-101 dataset. Classes are sorted in
 the ascending order of their recognition accuracies.
 show that the kernel dictionary learning technique is superior over its linear counterpart
 in terms of classification accuracy. Especially, kernel dictionary demonstrates robustness
 with respect to noise allowing robust recognition and better tolerance against different
 types of degradations.
 3.5 Proof of Proposition 3.1
 Proof. Using the standard orthogonal decomposition, we can write D? as follows:
 D? = ?D + D? (3.40)
 90
where ?D and D? are matrices whose columns lie within and orthogonal to column sub-
 space of ?(Y), respectively. In other words, ?D = ?(Y)A and ?(Y)TD? = 0. The
 optimal cost is equivalent to:
 ||?(Y)?D?X?||2F = ||?(Y)? ( ?D + D?)X?||2F
 =||?(Y)(I?AX?)?D?X?||2F
 =tr
 ([?(Y)(I?AX?)?D?X?]T [?(Y)(I?AX?)?D?X?])
 =tr[(I?AX?)T ?(Y)T ?(Y)(I?AX?)]
 + tr[(X?)TDT?D?X?]. (3.41)
 The second terms can be expressed as:
 tr[(X?)TDT?D?X?] =
 n?
 i=1
 xTi DT?D?xi ? 0
 The inequality comes from the fact that DT?D? is a positive semidefinite matrix. As a
 result, the second term of (3.41) can only increase the cost function. For the cost to be
 optimal, the second term must vanish. In short, D? = 0 and D? = ?D = ?(Y)A is an
 optimal solution.
 91
Chapter 4
 Sparse Representations On Low-Dimensional Feature Space
 In this chapter, we propose a novel framework, called sparse embedding (SE), that
 brings the strength of both dimensionality reduction and sparse learning together. In this
 framework, the dimension of signals is reduced in a way such that the sparse structures
 of signals are promoted. The algorithm simultaneously learns a dictionary in the reduced
 space, yet, allows the recovery of the dictionary in the original domain. This empowers
 the algorithm with two important advantages: 1) Ability to remove the distracting part
 of the signal that negatively interferes with the learning process, and 2) Learning in the
 reduced space with smaller computational complexity. In addition, our framework is able
 to handle sparsity in non-linear model through the use of Mercer kernels.
 4.1 Motivation and Related Work
 Signals are usually assumed to lie on a low-dimensional manifold embedded in a
 high dimensional space. Dealing with the high-dimension is not practical for both learn-
 ing and inference tasks. As an example of the effect of dimension on learning, Stone [111]
 showed that, under certain regularity assumption including that samples are identically
 independent distributed, the optimal rate of convergence for non-parametric regression
 decreases exponentially with the dimension of the data. As the dimension increases, the
 Euclidean distances between feature vectors become closer to each other making the infer-
 92
ence task harder. This is known as the concentration phenomenon [112], To address these
 issues, various linear and non-linear dimensionality reduction (DR) techniques have been
 developed [113]. Some examples include PCA [114], ISOMAP [115], LLE [116], Lapla-
 cian Eigenmaps [117], etc. In general, these techniques map data to a lower-dimensional
 space such that non-informative or irrelevant information in the data are discarded.
 Recently, there has been an explosion of activities in modelling a signal using ap-
 propriate sparse representations (see [118] and references therein). This approach is mo-
 tivated by the observation that most of signals encountered in practical applications are
 compressible. In other words, their sorted coefficient magnitudes in some basis obey
 power law decay. For this reason, signals can be well-approximated by linear combina-
 tions of a few columns of some appropriate basis or dictionary D. Although predefined
 basis such as wavelets or Fourier basis give rather good performances in signal compres-
 sion, but it has been shown that dictionaries learned from the data can be more compact
 leading to better performance in many important tasks such as reconstruction and classi-
 fication [119, 90].
 However, the current algorithms for finding a good dictionary have some draw-
 backs. The learning of D is challenging due to the high dimensional nature of the training
 data, as well as the lack of training samples. Therefore, DR seems to be a natural solution.
 Unfortunately, the current DR techniques are not designed to respect and promote under-
 lying sparse structures of data. Therefore, they cannot help the process of learning the
 dictionary D. Note that the recently developed DR technique [120] based on the sparse
 linear model is also not suitable for the purpose of sparse learning since it assumes that
 the dictionary is given. Ideally, we want an algorithm that can discard non-informative
 93
part of the signal and yet does not destroy the useful sparse structures within the data.
 The second disadvantage of the current sparse learning framework is its inability to
 handle sparse signals within non-linear models. Linear model used for learning D is often
 inadequate to capture the non-linear relationships within the data that naturally arise in
 many important applications. For example, in [121, 122, 123] it has been shown that by
 taking into account non-linearity, one can do better in reconstruction and classification. In
 addition, spatial pyramid [10], a popular descriptor for object and scene classification, and
 region of covariance [11], a popular descriptor for object detection and tracking, both have
 non-linear distance measures thus making the current sparse representation inappropriate.
 4.2 Sparse Embedding Framework
 The classical approach to learn sparse representations [9] is by minimizing the
 empirical lost over a finite set of signals subject to some sparsity constraint. Let Y =
 [y1, . . . ,yN ] ? Rn?N denotes the matrix of N input signals, where yi ? Rn. A popular
 formulation is:
 {D?,X?} = argmin
 D,X
 ?Y ?DX?2F , (4.1)
 subject to: ?xi?0 ? T0, ?i
 where D = [d1,d2, . . . ,dK ] ? Rn?K is called the dictionary that we seek to learn,
 X = [x1,x2, . . . ,xN ] ? RK?N is the sparse representation of Y over D, and T0 is the
 sparsity level. The cost function in (4.1) promotes a dictionary D that can best represents
 Y by linearly combining only a few of its columns. This type of optimization can be done
 efficiently using the current methods [123, 9].
 94
Different from the classical approaches, we develop an algorithm that embeds input
 signals into a low-dimensional space, and simultaneously learns an optimized dictionary.
 LetM denote the mapping that transforms input signals into the output space. In general,
 M can be non-linear. However, for the simplicity of notations, we temporarily restrict
 our discussion to linear transformations. The extension to the non-linear case will be
 presented in section 4.4. As a result, the mapping M is characterized using a matrix
 P ? Rd?n. We can learn the mapping together with the dictionary through minimizing
 some appropriate cost function CY:
 {P?,D?,X?} = argmin
 P,D,X
 CY(P,D,X) (4.2)
 This cost function CY needs to have several desirable properties. First, it has to promote
 sparsity within the reduced space. At the same time, the transformation P resulting from
 optimizing CY has to retain the useful information within the original signals. The second
 criterion is needed in order to prevent the pathological case when everything is mapped
 into the origin, which obtains the sparsest solution but is obviously of no interest. Towards
 this end, we propose the following optimization:
 {P?,D?,X?} = argmin
 P,D,X
 (?PY ?DX?2F + ??Y ?PTPY?2F) (4.3)
 subject to: PPT = I, and ?xi?0 ? T0, ?i
 where I ? Rd?d is the identity matrix, ? is a non-negative constant, and the dictionary is
 now in the reduced space, i.e., D = [d1,d2, . . . ,dK ] ? Rd?K . The first term of the cost
 function promotes sparsity of signals in the reduced space. The second term is the amount
 of energy discarded by the transformation P, or the difference between low-dimensional
 95
approximations and the original signals. In fact, the second term is closely related to PCA
 as by removing the first term in (4.3), it can be shown that the solution of P coincides
 with the principal components of the largest eigenvalues, when the data are centered.
 In addition, we also require rows of P to be orthogonal and normalized to unit norm.
 P plays the role of selecting the right subspace, or equivalently the right features, in which
 the useful sparse structures within data are revealed. There are several compelling reasons
 to keep the orthogonality constraint. First, this constraint leads to a computationally effi-
 cient scheme for optimization and classification. Second, it allows the extension of SE to
 the non-linear case. Note that the columns of the dictionary D still form a non-orthogonal
 basis in the output space despite the orthogonality constraint of P.
 4.3 Optimization Procedure
 Proposition 4.1. There exists an optimal solution P? and D? to the problem (4.3) that
 has the following form:
 P? = (YA)T ; D? = ATYTYB (4.4)
 for some A ? RN?d, and some B ? RN?K .
 Proof. See Appendix 4.7
 As a corollary of the proposition 4.1, it is sufficient to seek an optimal solution for
 the optimization in Eq. (4.3) through A and B. By substituting Eq. (4.4) into Eq. (4.3),
 we have:
 CY(P,D,X) = ?AT K(I?BX)?2F + ??Y(I?AAT K)?2F (4.5)
 96
where K = YTY and ? is a regularization parameter. The equality constraint becomes
 PPT = AT KA = I (4.6)
 The solution can be derived as
 {A?,B?,X?} = argmin
 A,B,X
 (?AT K(I?BX)?2F + ??Y(I?AAT K)?2F) (4.7)
 subject to: AT KA = I, and ?xi?0 ? T0
 The advantage of this formulation will become clear later. Basically, this formulation
 allows a joint update of P and D via A. As we shall see later in section 4.4, because the
 objective function is not explicitly represented in terms of Y, it is then possible to use
 the kernel method to make the algorithm non-linear. Despite (4.7) being non-convex, our
 experiments show that effective solutions can be found through iterative minimization.
 Solving for A
 In this stage, we assume that (B,X) are fixed. As a result, we can remove the spar-
 sity constraint of (4.7). The following proposition shows that A can be solved efficiently
 after some algebraic manipulation:
 Proposition 4.2. The optimal solution of (4.7) when B and X are fixed is:
 A? = VS? 12G? (4.8)
 where V and S come from the eigendecomposition of K = VSVT , and G? ? RN?d is
 97
the optimal solution of the following minimum eigenvalues problem:
 {G?} = argmin
 G
 tr
 [
 GTHG
 ] (4.9)
 subject to: GTG = I
 where H = S 12VT ((I?BX)(I?BX)T ? ?I)VS 12 ? RN?N .
 Proof. The cost function can be expanded as follows:
 CY(P,D,X) = ?AT K(I?BX)?2F + ??Y(I?AAT K)?2F (4.10)
 = tr
 [(I?BX)(I?BX)T KTQT K + ?(K? 2KTQT K + KTQT KQK)] (4.11)
 where Q = AAT ? RN?N . The constraint AT KA = I leads to the new constraint
 AAT KAAT = AAT or QKQT = Q. Using this equality constraint, and also notice
 that tr(K) is a constant, the objective function in (4.11) can be simplified to a more
 elegant form:
 ?CY(P,D,X) = tr [((I?BX)(I?BX)T ? ?I)KTQT K] (4.12)
 With a simple change of variable G = S 12VTA, and noting that Q = AAT , the cost
 function can be further simplified as:
 ?CY(P,D,X) = tr
 [
 ((I?BX)(I?BX)T ? ?I)VS 12 (S 12VTA)(ATVS 12 )S 12VT
 ]
 = tr
 [
 GTS 12VT ((I?BX)(I?BX)T ? ?I)VS 12G
 ]
 = tr
 [
 GTHG
 ] (4.13)
 On the other hand, the equality constraint can also be simplified as:
 AT KA = ATVS 12S 12VTA = GTG = I (4.14)
 98
Eqs. (4.13) and (4.14) show that the original optimization in (4.7) is equivalent to (4.9),
 and the optimal solution A? can be recovered as in (4.8), i.e., A? = VS? 12G?. Note that
 since K is a positive semidefinite matrix, the diagonal matrix S has non-negative entries.
 S? 12 is obtained by setting non-zero entries along the diagonal of S to the inverse of their
 square root and keeping zero elements the same. This completes the proof.
 Solving for B and X
 In order to solve for B, we keep A and X fixed. The second term in (4.7) disap-
 pears, and the objective function reduces to:
 ?AT K(I?BX)?2F = tr
 (
 KTQK? 2XTBT KTQK + XTBT KTQKBX) (4.15)
 A possible way of solving for B is by taking the derivative of the objective function with
 respect to B and setting it to zero, which yields:
 ?2(KTQK)XT + 2(KTQK)B(XXT ) = 0 (4.16)
 B = XT (XXT )? (4.17)
 This is similar to the method of optimal direction (MOD) [124] updating step except that
 B is not the dictionary but its representation coefficients over the training data Y.
 It is also possible to update B using the KSVD algorithm [9]. First, let us rewrite the
 objective function (4.15) to a more KSVD-friendly form:
 ?AT K(I?BX)?2F = ?Z?DX?2F (4.18)
 99
where Z = AT K ? Rd?N is the set of output signals. We can solve for B in two steps.
 First, we apply the KSVD algorithm to learn a dictionary Dksvd from Z. Second, we try
 to recover B from Dksvd. From proposition 1, it follows that the optimal dictionary has
 to be in the columns subspace of the input signals Z. We can observe from the KSVD
 algorithm that its output dictionary also obeys this property. As a result, we can recover
 B exactly by simply taking the pseudo-inverse:
 B = (Z)?Dksvd (4.19)
 We choose a KSVD-like updating strategy because it is more computationally efficient.
 Our experiments show that both updating strategies produce similar performances for
 applications like object classification.
 Sparse coding that solves for X can be done by any off-the-shelves pursuit algo-
 rithms. We use the orthogonal matching pursuit (OMP) [81] due to its high efficiency
 and effectiveness. Note that sparse coding is the most expensive step in many dictionary
 learning algorithms. Kernel KSVD [122] is an extreme example where sparse coding has
 to be done in the high-dimensional feature space. Our algorithm performs sparse coding
 in the reduced space leading to a huge computational advantage, yet, is capable of taking
 into account the non-linearity as we shall demonstrate in the next section.
 4.4 Non-linear Extension of Sparse Embedding
 There are many important applications of computer vision that deal with non-linear
 data [10, 11]. Non-linear structures in data can be exploited by transforming the data into
 a high-dimensional feature space where they may exist as a simple Euclidean geometry.
 100
Input: A kernel matrix K, sparse setting T0, dictionary size K, dimension d, and ?.
 Task: Find A? and B? by solving Eq. (4.4).
 Initialize:
 - Set iteration J = 1. Perform eigendecomposition K = VSVT
 - Set A = V(:, I0), where I0 is the index set of d largest eigenvalues of K
 Stage 1: Dictionary Update
 - Learn a dictionary D and X from the reduced signals Z = AT K using KSVD
 - Update B = (Z)?D
 Stage 2: Embedding update
 - Compute H = S 12VT ((I?BX)(I?BX)T ? ?I)VS 12
 - Perform eigendecomposition of H = U?UT
 - Set G = U(:, IJ), where IJ is the index set of d smallest eigenvalues of H
 - Update A = VS? 12G
 - Increment J = J + 1. Repeat from stage 1 until stopping conditions reached.
 Output: A, B and X.
 Figure 4.1: The SE algorithm for both linear and non-linear cases.
 In order to avoid the computational issues related to high-dimensional mapping, Mercer
 kernels are often used to carry out the mapping implicitly. We adopt the use of Mercer
 kernels to extend our analysis to the non-linear case.
 Let ? : Rn ? H be a mapping from the input space to the reproducing kernel
 Hilbert space H. Let k : Rn ? Rn ? R be the kernel function associated with ?. The
 101
mappingM from the input space to the reduced space is no longer linear. It, however, can
 be characterized by a compact, linear operator P : H ? Rd that maps every input signal
 s ? Rn to P?(s) ? Rd. Similar to the proposition 1, by letting K = ??(Y), ?(Y)?H =
 [k(yi,yj)]Ni,j=1, we can show that:
 P? = AT ?(Y)T ; D? = AT KB, (4.20)
 Using Eq. (4.20), we can write the mappingM in an explicit form:
 M : s ? Rn ? P?(s) = AT ??(Y), ?(s)?H = AT [k(y1, s), . . . , k(yN , s)]T (4.21)
 Similar to the linear case, the non-linear SE gives rise to the following cost function:
 ?P?(Y)?DX?2F + ???(Y)?PT P?(Y)?2H (4.22)
 which can be expressed in terms of A and B using Eq. (4.20), yielding an equivalent
 optimization:
 {A?,B?,X?} = argmin
 A,B,X
 ?AT K(I?BX)?2F + ?tr
 [(I?AAT K)T K(I?AAT K)]
 subject to: AT KA = I, and ?xi?0 ? T0, ?i (4.23)
 The resulting optimization can be solved in the same way as in the linear case. Fig. 4.1
 summarizes the SE algorithm. Note that in the non-linear case, the dimension of the
 output space can be higher than the dimension of the input space, and is only upper
 bounded by the number of training samples.
 4.5 Experiments
 In this section, we evaluate our algorithm on both synthetic and real datasets. In ad-
 dition, we propose a novel classification scheme that leads to competitive performances
 102
on 3 different datasets: USPS, Caltech-101, and Caltech-256. We also analyse and com-
 pare our method with the state of the art. For all the experiments in this section, we set
 the maximum number of iteration J for our SE algorithm shown in Fig. 4.1 and that of
 the KSVD algorithm to 5 and 80, respectively.
 4.5.1 Recovery of Dictionary Atoms
 Similar to the previous works [9, 81], we first run our algorithm on a set of synthetic
 signals. The goal is to verify if our algorithm is able to learn the sparse patterns from a
 set of training data that comes with distortions.
 Generation of Training Signals: Let D ? R80?50 be a generating dictionary. The
 first 30 elements in each column of D are generated randomly, and the last 50 elements
 are set to zero. Each column of D, which we will call a dictionary atom, is normalized
 to unit norm. 2000 training signals are generated by linearly combining 3 random atoms
 from this dictionary and superimposed with distortion signals:
 Y = DX + ?E (4.24)
 where ? is the distortion level; X ? R50?2000 is a matrix where each of its column has at
 most 3 non-zero elements at independent locations with random values; E ? R80?2000 is
 a matrix where each of its column is called a distortion signal and generated as follows:
 The first 30 elements in each column of E are set to zero, and the last 50 elements are
 generated randomly and independently under Gaussian distribution. Each column of E is
 also normalized to unit norm.
 Our task is to recover D from Y. We will first use SE to simultaneously reduce the
 103
dimension via A and learn a dictionary via B. The original dictionary can be estimated
 as ?D = YB. We compare our results with KSVD. To demonstrate the benefit of our
 proposed joint dimensionality reduction and dictionary learning, we also compare our
 results with the approach when the dimensionality reduction is done using PCA before
 learning the dictionary using KSVD.
 Let Ppca represent the PCA transformation. We learn a dictionary, denoted Dpca,
 using KSVD on the set of reduced signals PpcaY. The original dictionary is recovered by
 first computing the coefficient matrix Bpca = (PpcaY)?Dpca, and then ?D = YBpca. Note
 that since the columns of Dpca are in the column subspace of PpcaY, the computation of
 the coefficient matrix Bpca is exact.
 0.4 0.6 0.8 1 1.2 1.4
 0
 10
 20
 30
 40
 50
 Distortion Level ?
 Number of recovered atom
 s
 Dimension = 40
  
 
Sparse Embedding
 KSVD
 PCA+KSVD
 0.4 0.6 0.8 1 1.2 1.4
 0
 10
 20
 30
 40
 50
 Distortion Level ?
 Number of recovered atom
 s
 Dimension = 20
  
 
Sparse Embedding
 KSVD
 PCA+KSVD
 Figure 4.2: Comparison of number of recovered dictionary atoms over 40 trials.
 Verification of Recovered Dictionaries: For all methods, the computed dictionary
 was compared against the generating dictionary. This comparison is done by sweeping
 through the columns of the generating dictionary to find the closest column (in `2 dis-
 tance). The distance is measured by
 (
 1? |dTi ?di|
 )
 , where ?di is the i-th estimated dictio-
 nary atom, and di is its closest atom in the original dictionary. A distance of less than 0.01
 is considered a success. We learn dictionaries with sparsity level T0 = 3, the dictionary
 104
20 30 40
 ?=1.20
 20 30 40
 ?=1.00
 20 30 40
 ?=0.80
 20 30 40
 0
 10
 20
 30
 40
 50
 ?=0.60
 Figure 4.3: Average number of successfully recovered dictionary atoms versus the dimen-
 sion of the reduced space for different distortion level ?. Blue color line corresponds to
 results for PCA+KSVD scheme, green color line for KSVD, and red color line for SE.
 size K = 50, and ? = 1.1.
 Fig. 4.2 compares the number of successes over 40 trials. Fig. 4.3 plots the aver-
 age number of success versus the dimension of the output space for different distortion
 levels. SE consistently outperforms both KSVD and PCA+KSVD with larger and larger
 performance gaps as the distortion level increases. Fig. 4.3 shows that the performance of
 PCA+KSVD decreases drastically not only when the level of distortion level increases,
 but also when the dimension gets smaller than 30, which is the true dimension of our
 PCA
  
 
1 30 80
 10
 20
 30
 40
 SE
  
 
1 30 80
 10
 20
 30
 40
 ?0.5 0 0.5?0.4 ?0.2 0 0.2 0.4
 Figure 4.4: Comparison of PCA mapping (left) and the transformation P learned from
 SE (right). Distortion level ? = 1, dimension of the reduced space is 40.
 105
sparse signals. In contrast, SE outperforms KSVD even when the dimension goes below
 30. Interestingly, for the high distortion level like ? = 1.2, it is beneficial to reduce the
 dimension to even below the true dimension of sparse signals (see the right most chart of
 Fig. 4.3).
 In order to understand the reason behind the good results of SE, we visually inspect
 the transformation P. Fig. 4.4 shows the images of P and the PCA mapping. The first
 30 rows of the P weight heavily on the first 30 dimensions of Y. In other words, SE
 efficiently preserves sparse structures of signals and discards non-sparse distortions. In
 contrast, PCA does not preserve the sparse patterns since it is attempting to capture more
 of the signal variation. Only around 24 rows of PCA focus on the first 30 dimensions and
 the rest put more emphasis on non-sparse structures. Being able to get rid of distortions
 while preserving the sparse structures enables SE to achieve a higher recovery rate.
 4.5.2 Latent Sparse Embedding Residual Classifier (LASERC)
 Classification is an important component in many computer vision applications. In
 this section, we propose a novel classification scheme motivated by the SE framework.
 For generality, we will consider the non-linear setting. Let there be C different classes
 of signals {Yi = [yij]Nij=1 ? Rn?Ni}Ci=1. We use the SE algorithm in Fig. 4.1 to learn
 {Ai,Bi}Ci=1, which implicitly provides {P i,Di}Ci=1. Given a new test sample st, the
 classification is done in three steps:
 106
I) We compute output signals zi by mapping st viaMi,
 Mi : st ? zi = P i?(st) = ATi ??(Yi), ?(st)?H = ATi ki,t (4.25)
 where: ki,t = [k(yi1, st), . . . , k(yiNi , st)]T ? RNi (4.26)
 II) We obtain the sparse code xi for zi over the dictionary Di = ATi KiBi using the OMP
 algorithm, where
 Ki = ??(Yi), ?(Yi)?H = [k(yij,yik)]Nij,k=1 ? RNi?Ni (4.27)
 III) We compute the residual for each class as follows:
 ri = ??(st)?PTi Dixi?2H = k(st, st)? 2kTi,tAiDixi + xTi DTi ATi KiAiDixi (4.28)
 Here, Dixi is the estimated signal in the output space, and PTi Dixi is the estimated
 signal in the feature space. The sparse coding step makes our algorithm more resilient to
 noise. Finally, each signal is assigned to the class that yields the smallest reconstruction
 residual. We call this latent sparse embedding residual classifier (LASERC). The term
 ?latent? comes from the fact that the residual errors are computed in the feature space
 instead of the input space which does not take into account the non-linearities of the
 signal. The output space is also not suitable for classification because it does not retain
 sufficient discriminatory power due to its low-dimensional nature.
 4.5.2.1 USPS Digit Recognition
 We apply our classifier on the USPS database [89] which contains ten classes of
 256-dimensional handwritten digits. A dictionary is learned for each class using samples
 107
0 0.2 0.4 0.6 0.8 10
 0.2
 0.4
 0.6
 0.8
 1
 Percentage of missing pixels
 Classification Accuracie
 s
  
 
Kernel KSVD
 KSVD
 Kernel PCA
 LASERC
 LASERC1
 LASERC2
 (a) Missing pixels
 0 0.5 1 1.5 20
 0.2
 0.4
 0.6
 0.8
 1
 Noise Standard Deviation ?
 Classification Accuracie
 s
  
 
KSVD
 Kernel KSVD
 Kernel PCA
 LASERC
 LASERC1
 LASERC2
 (b) Gaussian noise
 5 100 200 500
 86
 88
 90
 92
 94
 96
 98
 Dimension
 Accuracy (%
 )
  
 
No noise
 Gauss, ?=0.6
 Gauss, ?=1.2
 Missing, 50%
 Missing, 70%
 (c) Accuracy versus dimension (d) Projection Coefficient
 Figure 4.5: (a,b) Comparison of classification accuracy against noise level. (c) Accuracy
 versus dimension. (d) Projection coefficient of all samples onto a dictionary atom.
 from the training set with the following parameters setting: 500 dictionary atoms, T0 = 5,
 d = 100, ? = 1.1, and the maximum number of iterations is set to 80. A polynomial
 kernel of degree 4 with the constant of value 1 is used for SE, kernel KSVD, and kernel
 PCA.
 Our first experiment presents the results for the case when the pixels are randomly
 removed from the test images shown in Fig. 4.5(a), and when the test samples are cor-
 rupted by Gaussian noise with different standard deviations shown in Fig. 4.5(b). In both
 scenarios, LASERC consistently outperforms kernel KSVD, linear KSVD, and kernel
 PCA. As the distortion level increases the performance differences between sparse em-
 108
bedding and linear KSVD become more drastic.
 It is also worthwhile to investigate cases when the objective function in (4.7) has
 only the first term (? = 0), denoted by LASERC1, and when there is only the second term
 (? ? ?), denoted by LASERC2. Fig 4.5(a) and 4.5(b) show that LASERC2 performs
 better for the low-noise cases and worse for the high-noise cases in comparison with
 LASERC1.
 In order to see the effect of dimension on the classification performance of LASERC,
 we compare the results for different values of dimension d = {5, 10, 100, 200, 500}. The
 corresponding sparsity level is T0 = {2, 3, 5, 5, 10}. Fig. 4.5(c) shows that the classifica-
 tion result improves as the dimension increases from 5? 100. Beyond 100, the accuracy
 decreases slightly for the noiseless case, but faster for the very noisy cases like Gaussian
 noise with ? = 1.2.
 We project test samples onto a random dictionary atom of the first class (digit 0).
 Fig. 4.5(d) plots the sorted projection coefficients of all the samples, color-coded by their
 class labels. We can observe from the plot 4.5(d) that, in the feature space, the dictionary
 atom is almost perpendicular to all samples except those from the first class (orange color
 at the two ends). This implies that the learned dictionary has taken into account the non-
 linearities of signals.
 4.5.2.2 Caltech-101 and Caltech-256 Object Detection
 We perform the second set of object recognition experiments on the Caltech-101
 database [97]. This database comprises of 101 object classes, and 1 background class
 109
(a) Okapi (100%) (b) Leopard (100%) (c) Face (100%)
 (d) Mayfly (20%) (e) Cougar (35%) (f) Seahorse (42%)
 Figure 4.6: Sample images from the classes corresponding to the highest accuracy (top
 row) and the lowest accuracy (bottom row) of LASERC.
 # train samples 5 10 15 20 25 30
 Malik [103] 46.6 55.8 59.1 62.0 - 66.2
 Lazebnik [10] - - 56.4 - - 64.6
 Griffin [104] 44.2 54.5 59.0 63.3 65.8 67.6
 Irani [105] - - 65.0 - - 70.4
 Yang [108] - - 67.0 - - 73.2
 Wang [109] 51.15 59.77 65.43 67.74 70.16 73.44
 SRC [90] 48.8 60.1 64.9 67.7 69.2 70.7
 KSVD [9] 49.8 59.8 65.2 68.7 71.0 73.2
 D-KSVD [13] 49.6 59.5 65.1 68.6 71.1 73.0
 LC-KSVD [12] 54.0 63.1 67.7 70.5 72.3 73.6
 LASERC 55.2 65.6 69.5 73.1 75.8 77.3
 # train samples 15 30
 Griffin [104] 28.3 34.1
 Gemert [107] - 27.2
 Yang [110] 34.4 41.2
 LASERC 35.2 43.6
 Time Train (s) Test (ms)
 SVM 2.1 8.1
 Ker. KSVD 2598 3578
 SRC N/A 520
 D-KSVD 450 12.8
 LASERC 7.2 9.4
 Table 4.1: Comparison of recognition results on Caltech-101 dataset (left), recognition
 results on Caltech-256 dataset (upper right), and the computing time (lower right).
 collected from Internet. The database contains a diverse and challenging set of images
 from buildings, musical instruments, animals and natural scenes, etc. Spatial pyramid
 110
descriptors [10] are used with the following settings: 16 ? 16 patch size, 8 pixels grid
 spacing, 3 pyramid levels, 200 quantized clusters of feature vectors. This results in a 4200-
 dimensional descriptor. The similarity is measured using the pyramid match kernel [10],
 which is also a Mercer kernel.
 We follow the suggested protocol in [10, 102], namely, we train on m images, where
 m ? {5, 10, 15, 20, 25, 30}, and test on the rest. The corresponding parameters settings
 of SE are: T0 = {3, 4, 5, 7, 8, 9}, d = {5, 10, 15, 20, 25, 30}, and ? = 1.1. To compensate
 for the variation of the class size, we normalize the recognition results by the number of
 test images to get per-class accuracies. The final recognition accuracy is then obtained by
 averaging per-class accuracies across 102 categories. We also repeat the same experiment
 on Caltech-256 dataset.
 20 40 60 80 100
 0
 20
 40
 60
 80
 100
 mayfly wild cat
 cougar body scorpion
 sea horse cannon
 crayfish ibis
 lobster llama
 gerenuk saxophone
 elephant cup
 crocodile head headphone
 stegosaurus chair
 flamingo head pigeon
 bass sunflower
 binocular schooner
 dragonfly helicopter
 brontosaurus mandolin
 soccer ball nautilus
 stapler joshua tree
 pyramid gramophone
 revolver stop sign
 snoopy menorah
 ewer windsor chair
 dollar bill panda
 rooster Faces
 Leopards accordion
 car side inline skate
 metronome okapi
 trilobite
 Accuracy
 Number of classe
 s
 Figure 4.7: Caltech-101 Per Class Accuracy
 Table 4.1 shows the comparison of our classification accuracy with the state of the
 art. It is interesting that our method significantly outperforms the other discriminative ap-
 proaches like LC-KSVD [12] and D-KSVD [13]. Thanks to the help of DR, our method
 111
Face
 s
 Motorbike
 s
 ancho
 r
 bas
 s
 bonsa
 i
 buddh
 a
 canno
 n
 cellphon
 e
 cougar bod
 y
 crayfis
 h
 cu
 p
 dolphi
 n
 elephan
 t
 ewe
 r
 flamingo hea
 d
 gramophon
 e
 headphon
 e
 ibi
 s
 kangaro
 o
 lapto
 p
 lotu
 s
 menora
 h
 nautilu
 s
 pagod
 a
 pizz
 a
 revolve
 r
 saxophon
 e
 scorpio
 n
 soccer bal
 l
 stegosauru
 s
 sunflowe
 r
 umbrell
 a
 wheelchai
 r
 wrenc
 h
  
 Faces
 Motorbikes
 anchor
 bass
 bonsai
 buddha
 cannon
 cellphonecougar body
 crayfish
 cupdolphin
 elephant
 ewerflamingo headgramophone
 headphone
 ibiskangaroolaptop
 lotus
 menorah
 nautiluspagoda
 pizza
 revolversaxophone
 scorpion
 soccer ballstegosaurus
 sunflower
 umbrella
 wheelchair
 wrench 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1
 Figure 4.8: Caltech-101 Confusion Matrix
 achieves a significant speed-up in the training process over the other sparse learning meth-
 ods as shown in table 4.1. Fig. 4.6 shows sample images from the easiest classes as well as
 the most difficult classes. Fig. 4.7 shows the recognition accuracy per class, and Fig. 4.8
 shows their confusion matrix.
 4.6 Conclusion
 This chapter presented a novel framework for a joint dimensionality reduction and
 sparse learning. It proposes an efficient algorithm for solving the resulting optimization
 problem. It designs a novel classification scheme leading to a state-of-the-art perfor-
 mance and robustness on several popular datasets. Through extensive experimental re-
 sults on real and synthetic data, we showed that sparse learning techniques can benefit
 significantly from dimensionality reduction in terms of both computation and accuracy.
 112
4.7 Proof of Proposition 4.1
 The proposed optimization:
 {P?,D?,X?} = argmin
 P,D,X
 (?PY ?DX?2F + ??Y ?PTPY?2F) (4.29)
 subject to: PPT = I, and ?xi?0 ? T0, ?i
 Recall the proposition that we want to prove:
 Proposition 4.1. There exists an optimal solution P? and D? to (4.29), for sufficiently
 large ?, that has the following form:
 D? = ATYTYB P? = (YA)T ; (4.30)
 for some A ? RN?d, and some B ? RN?K .
 Proof. First, we will show that for a specific P, there exist an optimal solution of D that
 has the form PYB, for some B ? RN?K .
 Let D?p be an optimal solution corresponding to P. For simplicity of notation, we will
 write D? instead of D?p. Since the columns of D? have the same dimension with the
 columns of PY, we can express D? using the orthogonal decomposition as follows:
 D? = D? + D? (4.31)
 where: D? = (PY)B and DT?(PY) = 0 (4.32)
 for some appropriate B ? RN?K . Basically, columns of D? and D? are in and orthogonal
 to the column subspace of PY, respectively. Since P is fixed, the cost function in (4.3)
 113
only depends on the first term:
 ?PY ?D?X?2F = ?PY ? (D? + D?)X?2F (4.33)
 = tr
 (
 YTPTPY + YTPT (D? + D?)X + XT (D? + D?)T (D? + D?)X) (4.34)
 = tr
 (
 YTPTPY + YTPTD?X + XTDT? D?X + XTDT?D?X
 ) (4.35)
 ? tr
 (
 YTPTPY + YTPTD?X + XTDT? D?X
 ) (4.36)
 The last equality is true because XTDT?D?X is a positive semi-definite matrix, thus, its
 trace is non-negative. In order for D? to be an optimal solution, the cost function must
 achieve the smallest value, i.e., the term in (4.35) has to be equal to the term in (4.36).
 A sufficient condition is D? = 0. In other words, D? is also an optimal solution. As a
 result, there exists an optimal solution of D for each P of the form:
 D? = PYB (4.37)
 Moreover, the optimal solution of this form has smallest Frobenius norm among all opti-
 mal solutions of D because ?D??2F = ?D??2F + ?D??2F ? ?D??2F or ?D??F ? ?D??F .
 Second, we will show that the optimal solution of P, denoted by P?, must have the
 form P? = (YA)T , for some A ? RN?d.
 Using the orthogonal decomposition of P?, we have:
 P? = P? + P? (4.38)
 where: P? = (YA)T and P?Y = 0 (4.39)
 for some appropriate A ? RN?d. Rows of P? and P? are in and orthogonal to the column
 subspace of Y, respectively. Using Eqs. (4.37) and (4.38), we can rewrite the first term
 114
of the objective function in (4.3),
 ?P?Y(I?BX)?2F = ?(P? + P?)Y(I?BX)?2F (4.40)
 = ?P?Y(I?BX)?2F = tr
 (
 P?Y(I?BX)(I?BX)TYTPT?
 ) (4.41)
 and the second term,
 ?Y ?P?TP?Y?2F = tr
 (
 YTY ? 2YTP?TP?Y + YTP?TP?P?TP?Y
 ) (4.42)
 = tr
 (
 YTY ?YTP?TP?Y
 )
 = tr
 (
 YTY ?YT (P? + P?)T (P? + P?)Y) (4.43)
 = tr
 (
 YTY ?YTP?TP?Y
 )
 = tr
 (
 YTY ?P?YYTP?T
 ) (4.44)
 From Eq. (4.42) to Eq. (4.43), we use the constraint P?P?T = I. Putting Eqs. (4.39)
 (4.41) (4.44) together, and let K = YTY, the objective function becomes
 tr(K)? tr(P?Y(?I? (I?BX)(I?BX)T)YTPT?) (4.45)
 = tr(K)? tr(AT K(?I? (I?BX)(I?BX)T)KA) (4.46)
 Let K = VSVT , ?H = S 12VT
 (
 ?I? (I?BX)(I?BX)T)VS 12 , and G = S 12VTA.
 Note that ?H is positive semi-definite for sufficiently large value of ?. We can derive the
 lower bound of the cost function
 tr(K)? tr
 (
 ATVS 12 ?HS 12VTA
 )
 (4.47)
 = tr(K)? tr
 (
 GT ?HG
 )
 ? tr(K)?
 d?
 i=1
 ?i (4.48)
 where ?i is the i-th largest eigenvalue of ?H. In order for the objective function to achieve
 its minimum, columns of G have to be the same with eigenvectors corresponding to
 largest eigenvalues of ?H. Therefore, GTG = I. Equivalently,
 GTG = AT KA = P?PT? = I?P?PT? = I (4.49)
 115
which means that P? = 0. In short, the optimal solution of P has the form
 P? = P? = (YA)T (4.50)
 This also solution has the smallest Frobenius norm since all optimal solutions of P have
 Frobenius norm equal to
 ?
 d:
 ?P??F =
 ?
 tr(P?TP?) =
 ?
 tr(P?P?T ) = ?tr(I) = ?d (4.51)
 Using the form of P? and Eq. (4.37), we conclude that there exist an optimal solution of
 D that has the form D? = P?YB = ATYTYB. This completes the proof.
 116
Chapter 5
 Domain Adaptation Using a Sparse and Hierarchical Network
 5.1 Motivation
 In many practical computer vision applications, we are often confronted with the
 situation where the data that we use to train our algorithm have different distribution or
 representation with that presented during the testing. The ubiquity of this problem is well-
 known to computer vision researchers. For instance, indoor images are quite different
 from outdoor images, just as videos captured with a high definition camera are from that
 collected with a webcam. This detrimental effect is often a dominant factor accounting for
 the poor performances of many computer vision algorithms. As an example of the effect
 of distribution mismatch, Ben-David et al. [15] show that, under certain assumption, the
 bound on the test error linearly increases with the `1 divergence between the training
 and testing distributions. Even worse, data from the test domain are often scarce and
 expensive to obtain. This makes it impractical to re-train an algorithm from the scratch
 since a learning algorithm would generalize poorly when insufficient amount of data is
 presented [16].
 Understanding the problem caused by domain change has received substantial at-
 tention in recent years (see [125] and references therein). The problem can be informally
 stated as follows. Given a source domain whose representation or distribution can be dif-
 ferent from that of the target domain, how to effectively utilize the model trained on the
 117
source to achieve a good performance on the target data. It is also often assumed that the
 source domain has plentiful labelled training samples while there are only a few (both
 labelled and unlabelled) samples available from the target domain. It has been shown in
 [126, 127, 128, 129, 130] that domain adaptation techniques can significantly improve the
 performance of computer vision tasks such as visual object detection and recognition.
 Although the formulation differs, most of the algorithms for adapting a recognition
 system to a new visual domain share a common architecture containing two main stages.
 First, features are extracted separately for source and target using hand-crafted feature
 descriptors, followed by the second stage where transformations are learned in order to
 rectify the discrepancy between the two domains. This architecture has several draw-
 backs. Without any knowledge about the target domain, the feature extraction performed
 on the source data can ignore information important to the target data. For instance, let us
 consider a picture of a small bird in a forest background. The forest features apparently
 dominate the picture description. While these features might be sufficient for classifying
 the picture into land animals or sea animals, they do not benefit the target domain which
 has the same task on images of animals without background.
 Another issue is that discriminative information can be embedded in multiple levels
 of the features hierarchy. High-level features are sometimes more useful than low-level
 ones. In fact, this is one of the main motivations behind the development of hierarchical
 networks (e.g. [131, 132]) so that more complex abstraction from a visual object can be
 captured. The traditional framework of domain adaptation employs a shallow architecture
 containing a single layer. This ignores the possibility of transferring at multiple levels of
 the feature hierarchy.
 118
Finally, the process of designing features, such as SIFT [133] or SURF [134], is
 tedious and time-consuming. It requires a deep understanding and a careful examination
 of the underlying physics that governs the generation of data. Such requirements might
 be impractical given that the data from the target domain are often very scarce.
 Contributions: In order to address the limitations of existing approaches, we pro-
 pose a novel approach for domain adaptation that possesses the following advantages:
 ? Adaptation is performed on multiple levels of the feature hierarchy in order to maxi-
 mize the knowledge transfer. The hierarchical structure allows the transfer of useful
 information that might not be well captured by existing domain adaptation tech-
 niques.
 ? Adaptation is done jointly with feature learning. Our method learns a hierarchy of
 sparse codes and uses them to describe a visual object instead of relying on any
 low-level feature.
 ? Unlike existing hierarchical networks, our network is more computationally effi-
 cient with a mechanism to prevent the data dimension from increasing too fast as
 the number of layer increases.
 We provide extensive experiments to show that our approach outperforms the current
 state-of-the-art by a large margin. This is interesting since our training is entirely gen-
 erative followed by a linear support vector machine while several other methods employ
 discriminative training together with non-linear kernels. Furthermore, we introduce a new
 set of data for benchmarking the performance of our algorithm. The new dataset has two
 domains containing half-toning and edge images, respectively.
 119
5.2 Related Works
 While domain adaptation was first investigated in speech and natural language pro-
 cessing [135, 136], it has been studied extensively in other areas such as machine learn-
 ing [125, 15] and computer vision, especially in the context of visual object recogni-
 tion [126, 127, 128, 129, 130]. For instance, the semi-supervised approach proposed by
 Saenko et al. [126] employed metric learning to learn the domain shift using partially la-
 beled data from the target domain. This work was extended by Kulis et al. [127] to handle
 asymmetric domain transformations. Gopalan et al. [128] addressed the problem of un-
 supervised domain adaptation, where samples from the target domain are unlabeld, using
 an incremental approach based on Grassmann manifolds. By formulating a geodesic flow
 kernel, [129, 137] extended this approach to integrate an infinite number of subspaces on
 the geodesic flow from the source domain to the target domain.
 Sparse methods have also been used to address the domain shift problem [138, 139].
 In particular, [138] modelled dictionaries across different domains with a parametric map-
 ping function, while [139] enforced different domains to have a common sparse represen-
 tation on some latent domain. Another class of techniques performed domain adaptation
 by directly learning a target classifier from classifiers trained on the source domain(s)
 [140, 141]. A drawback of the existing approaches is that the domain shifting transforma-
 tion is considered only at a single layer and may not capture adequately the shift between
 the source and target domain. It is worth noting that although [142] also named their
 method hierarchical domain adaptation, it is not quite related to ours. They made use of
 hierarchical Bayesian prior while we employ a multi-layer network of sparse representa-
 120
tion.
 The design of multi-layer networks has been an active research topic in computer
 vision. Some of the early work includes [143], which used a multistage system to extract
 salient features in the image at different spatial scales. By learning higher-level feature
 representations from unlabelled data, deep belief networks (DBN) [131] and its variants,
 such as convolutional DBNs [132] and deep autoencoders [144], have been shown to be
 effective when applied to classification problems. Motivated by recent works on deep
 learning, multi-layer sparse coding networks [145, 146, 147] proposed to build feature
 hierarchies layer by layer using sparse codes and spatial pooling. A common problem
 with these hierarchical approaches is that the dimension of the learned feature vectors
 keeps increasing after each layer and thus, may make the computations become costly.
 5.3 Proposed Approach
 In this section, we briefly summarize some related work in sparse coding and dic-
 tionary learning. Recent advances in feature learning using hierarchical sparse coding are
 also discussed. We then proceed to introduce the main formulation of DASH-N.
 5.3.1 Latent Sparse Representation
 Given a set of training samples Y = [y1, . . . ,yn] ? Rd?n, the problem of learning
 a dictionary together with the sparse codes is typically posed as the minimization of the
 following cost function over (D, X):
 ?Y ?DX?2F + ??X?1 s.t. ?di?2 = 1, ?i ? [1, K] (5.1)
 121
Figure 5.1: An illustration of DASH-N algorithm. The source domain is RGB images
 and the target domain is halftone images. First, images are divided into small overlap-
 ping patches. These patches are vectorized while maintaining their spatial arrangements.
 (a) Performing contrast-normalization and dimensionality reduction using PS for source
 images and PT for target images. The circular feedbacks between PS and PT indicate
 that these two transformations are learned jointly. (b) Obtaining sparse codes using the
 common dictionary D1. (c) Performing max pooling. The process then repeats for layer
 2 (d & e), except that the input is the sparse codes from layer 1 instead of pixel intensities.
 At the final stage, spatial pyramid with max pooling are used to create image descriptors.
 Classification is done using linear support vector machine.
 122
where D = [d1, . . . ,dK ] ? Rd?K is the sought dictionary, X = [x1, . . . ,xn] ? RK?n is
 the horizontal concatenation of the sparse codes, and ? is a non-negative constant. The `1
 constraint is well-known for promoting sparsity of the coefficients in X. `0 pseudo-norm
 can also be used for sparse regularization. However, we choose to use `1 norm because
 the optimization algorithms possesses a better theoretical guarantee for successfully re-
 covering the true sparse signals due to its convex nature [148].
 From the observation that signals often lie on a low-dimensional manifold, sev-
 eral authors have proposed to perform dictionary learning and sparse coding on a latent
 space [149, 150]. We call it latent sparse representation to distinguish from the formula-
 tion in (5.1). This is done by minimizing the following cost function over (P, D, X):
 L(Y,P,D,X, ?, ?) =
 ?PY ?DX?2F + ??Y ?PTPY?2F + ??X?1
 s.t. PPT = I and ?di?2 = 1, ?i ? [1, K] (5.2)
 where P ? Rp?d is a linear transformation that brings the data to a low-dimensional
 feature space (p < d). Note that the dictionary is now in the low-dimensional space
 D ? Rp?K . Besides the computational advantage, [150] shows that this optimization can
 recover the underlying sparse representation better than the traditional dictionary learning
 methods. This formulation is attractive since it allows the transformation of the data into
 another domain to better handle different sources of variation such as illumination and
 geometric articulation.
 123
5.3.2 Multi-layer Feature Learning
 Designing features for visual object is a time-consuming and challenging task that
 requires a deep understanding of the domain knowledge. It is also non-trivial to adapt
 these manually designed features to new types of data such as hyperspectral or range-
 scan images. For this reason, learning features from the raw data has become increas-
 ingly popular and demonstrated competitive performances on practical computer vision
 tasks [146]. In order to capture the richness of data, a hierarchical network is employed
 to learn a spectrum of features, layer by layer.
 Very recently, multi-level sparse coding networks including hierarchical sparse cod-
 ing [145, 146, 147] have been proposed for feature learning. These networks contain a
 coding layer and a pooling layer. A dictionary is learned at each coding layer, using (5.1),
 which then serves as a codebook for obtaining sparse codes from image patches or pooled
 features. Spatial pooling schemes, most notably max-pooling, group the sparse codes
 from adjacent blocks into common entities. This operation makes the resulting features
 more invariant to certain changes caused by translation and rotation. The pooled sparse
 codes from one layer serve as the input to the next layer. While feature learning has been
 successfully applied to object recognition on RGB images and 3-D point clouds [151],
 to the best of our knowledge, our method is one of the first attempts for integrating the
 feature learning step into domain adaptation.
 124
5.3.3 Hierarchical Domain Adaptation
 Motivated by the work of [146], we propose a method to perform hierarchical do-
 main adaptation jointly with feature learning. Figure 5.1 shows an overview of the pro-
 posed method. The network contains multiple layers, each of which contains 3 sub-layers
 as illustrated in Figure 5.1. The first sub-layer performs contrast-normalization and di-
 mensionality reduction on the input data. Sparse coding is carried out in the second sub-
 layer. In the final sub-layer, adjacent features are max-pooled together to produce a new
 features. Output from one layer becomes the input to the next layer. For the simplicity
 of notation, we consider a single source domain. The extension of DASH-N to multiple
 source domains is straight forward (see the supplementary).
 In each layer of DASH-N, we learn a joint latent sparse representation. This can be
 formulated as the minimization of the following cost function w.r.t. (PS,PT ,D,XS,XT ):
 L(YS,PS,D,XS, ?, ?) + ?L(YT ,PT ,D,XT , ?, ?) (5.3)
 s.t. PSPTS = PTPTT = I, ?di?2 = 1, ?i ? [1, K] (5.4)
 where (?, ?, ?) are the non-negative constants and D ? Rp?K is the common dictionary.
 Ys ? RdS?nS and YT ? RdT ?nT are the input data at each layer, PS ? Rp?dS and
 PT ? Rp?dT are the transformations to the latent domain, XS ? RK?nS and XS ? RK?nS
 are the sparse codes of the source and the target, respectively. In this formulation, two
 domains are forced to share a common dictionary in the latent domain. Together with
 sparsity constraint, the common D provides a coupling effect that promotes the discovery
 of common structure between the two domains.
 In our experiments, we learn a network of two layers as we found empirically that
 125
adding a third layer did not provide significant improvements over the first two layers.
 Layer 1: We perform dense sampling on each training image to get a set of overlap-
 ping patches of size 5?5 pixels. These patches are then contrast-normalized as explained
 in layer 2. In order to make the computation more efficient, only a random subset of
 patches from each image is used for learning the latent sparse representation. We found
 that setting this number to 150 for images of maximum size of 150? 150 provides a good
 trade-off between accuracy and computational efficiency. After learning the dictionary
 D1 and the transformations (P1S,P1T ), the sparse codes (X1S,X1T ) are computed for all
 sampled patches by solving:
 min
 X1?
 ?P1?Y1? ?D1X1??22 + ?1?X1??1, (5.5)
 where ? indicates that it can be either source or target. Each column of Y1? is the vector-
 ized pixel values inside a patch. A fast implementation of the LARS algorithm is used for
 solving this optimization problem [152].
 Spatial max pooling is used to aggregate the sparse codes over each 4 ? 4 neigh-
 borhood as this pooling method is particularly well-suited for the separation of sparse
 features [153].
 Layer 2: In this layer, we perform similar computations except that the input is the
 sparse codes from layer 1 instead of image pixels. The features obtained from the pre-
 vious layer are aggregated by concatenation over each 4 ? 4 neighborhood and contrast-
 normalized. This results in a new representation that is more robust to occlusion and
 illumination. If f is an aggregated feature vector, the contrast-normalization can be per-
 126
formed as in [147]
 ?f = f??f?2 +  , (5.6)
 where  = 0.1 is found to work well in our experiments. Similar to layer 1, we also
 randomly sample 150 normalized feature vectors ?f from each image for training. `1 opti-
 mization is again employed to compute the sparse codes of the normalized features ?f .
 At the end of layer 2, the sparse codes are then aggregated using max pooling in a
 multi-level patch decomposition (spatial pyramid max pooling). At level 0 of the spatial
 pyramid, a single feature vector is obtained by performing max pooling over the whole
 image. At level 1, the image is divided into four quadrants and max pooling is applied
 to each quadrant, yielding 4 feature vectors. Similarly for level 2, we obtain 9 feature
 vectors, and so on. In this chapter, max pooling using a three level spatial pyramid is
 used. As a result, the final feature vector returned by the second layer for each image is a
 result of concatenating 14 feature vectors from the spatial pyramid.
 5.4 Optimization Procedure
 In this section, we elaborate on the details of how the cost function in (5.3) is
 minimized. First, let us define
 KS = YTS YS, KT = YTT YT , K =
 ?
 ??? KS 0
 0
 ?
 ?KT
 ?
 ??? (5.7)
 to be the Gram matrix of source, target, and their block diagonal concatenation, respec-
 tively. It can be shown that (see the supplementary) the optimal solution of (5.3) takes the
 127
form:
 PS = (YSAS)T , PT = (YTAT )T (5.8)
 D = [ATSKS,
 ?
 ?ATT KT ]B (5.9)
 for some AS ? RnS?p, AT ? RnT ?p and B ? R(nS+nT )?K . Notice that rows of each
 transformation live in the column subspace of the data from its own domain. In contrast,
 columns of the dictionary are jointly created by the data of both source and target.
 Solving for (AS,AT ): The orthogonal constraint in (5.4) can be re-written us-
 ing (5.8) as:
 ATSKSAS = I, ATT KTAT = I. (5.10)
 By substituting (5.8), (5.9) into (5.3) and making use of the orthogonal constraint in (5.10),
 the formulation can be simplified to (see derivation in the supplementary):
 min
 G
 tr(GTHG) s.t. GTSGS = GTTGTT = I. (5.11)
 where H is defined as:
 H = ? 12VT K((I?BX)(I?BX)T ? ?I)KV? 12 (5.12)
 V =
 ?
 ??? VS 0
 0 VT
 ?
 ??? , ? =
 ?
 ????S 0
 0
 ?
 ??T
 ?
 ??? , (5.13)
 KS = VS?SVTS , KT = VT?TVTT , (5.14)
 where (5.14) is given by the eigen-decompositions of Gram matrices. Finally, G is de-
 128
fined as:
 G = [GS,
 ?
 ?GT ], (5.15)
 GS = ?
 1
 2
 SVTSAS, GT = ?
 1
 2
 TVTT AT . (5.16)
 The optimization in (5.11) is non-convex due to the orthogonality constraints. However,
 G can be learned efficiently using the algorithm proposed by [154]. Given G, the solution
 of (AS,AT ) is simply given by
 AS = VS?
 ?
 1
 2
 S GS, AT = VT?
 ?
 1
 2
 T GT . (5.17)
 We note that the optimization step involves the eigen-decompositions of large Gram ma-
 trices whose dimensions equal to the number of training samples (? 105 in our experi-
 ments). This is computationally infeasible. We propose a remedy for this. The source is
 taken for the illustration purpose and the computation for the target is similar. First, we
 compute the eigen-decomposition of the matrix:
 CS = YSYTS = US??SUTS ? RdS?dS (5.18)
 then the dS dominant eigenvectors of KS can be recovered as in (5.19). The relationship
 in (5.19) between VS and US can be easily verified using an SVD-decomposition of YS .
 VS = YTS US??S
 ?
 1
 2
 . (5.19)
 The signal dimension dS is much smaller than the number of training samples nS in our
 experiments (e.g. 103 versus 105). The eigen-decomposition of CS is therefore much
 more efficient than that of KS . Finally, non-zero eigenvalues in ?S are given by the
 diagonal coefficients of ??S .
 129
Solving for (B,X): If we fix (AS,AT ), then learning (B,XS,XT ) can be done
 using any dictionary learning algorithms. In order to see this, let us define:
 Z = [ASKS,
 ?
 ?AT KT ], (5.20)
 X = [XS,
 ?
 ?XT ]. (5.21)
 The cost function can be re-written in a familiar form
 ?Z?DX?2F + ?(?XS?1 + ??XT?1). (5.22)
 We use the Lasso to solve for the sparse codes X and the efficient online dictionary
 learning algorithm [152] to solve for D. The solution of B can be recovered, using the
 relationship in (5.9), simply by B = Z?D, where ? denotes the Moore-Penrose pseudo-
 inverse.
 5.5 Experiments
 The proposed algorithm is evaluated in the context of object recognition using a
 recent domain adaptation dataset [126], containing 31 classes, with the addition of im-
 ages from the Caltech-256 dataset [155]. There are 10 common classes between the two
 datasets (BACKPACK, TOURING-BIKE, CALCULATOR, COMPUTER-KEYBOARD,
 HEADPHONES, LAPTOP-101, COMPUTER-MONITOR, COMPUTER-MOUSE, COFFEE-
 MUG, and VIDEO-PROJECTOR) which contain a total of 2533 images. Domain shifts
 are caused by variations in factors such as pose, lighting, resolution, etc., between images
 in different domains. Figure 5.2 shows example images from the LAPTOP-101 class with
 respect to different domains. We compare our method with state-of-the-art adaptation al-
 130
(a) (b) (c) (d)
 Figure 5.2: Example images from the LAPTOP-101 class in different domains: (a) Ama-
 zon, (b) Caltech, (c) DSLR, (d) Webcam. First row: original images, second row: halftone
 images, third row: edge images.
 gorithms such as [126, 128, 129, 139]. In order to better assess the ability to adapt to a
 wide range of domains, experimental results are also reported on new images obtained
 by performing halftoning [156] and edge detection [157] algorithms on images from the
 datasets in [126, 155].
 5.5.1 Experiment Setup
 We follow the experimental set-ups of [129]. The results using 10 as well as 31
 common classes are reported. In both cases, experiments are performed in 20 random
 trials for each pair of source and target domains. If the source domain is Amazon or
 Caltech, 20 samples are used in the training. Otherwise, only 8 training samples are used
 for DLSR and Webcam. The number of target training samples is always set to 3. The
 remaining images from the target domain in each split are used for testing.
 131
Table 5.1: Recognition rates of different approaches on four domains (C: Caltech, A:
 Amazon, D: DSLR, W: Webcam). 10 common classes are used. Red color denotes the
 best recognition rates. Blue color denotes the second best recognition rates.
 Method C? A C? D A? C A?W W? C W? A D? A D?W
 Metric [126] 33.7? 0.8 35.0? 1.1 27.3? 0.7 36.0? 1.0 21.7? 0.5 32.3? 0.8 30.3? 0.8 55.6? 0.7
 SGF [128] 40.2? 0.7 36.6? 0.8 37.7? 0.5 37.9? 0.7 29.2? 0.7 38.2? 0.6 39.2? 0.7 69.5? 0.9
 GFK (PLS+PCA) [129] 46.1? 0.6 55.0? 0.9 39.6? 0.4 56.9? 1.0 32.8? 0.1 46.2? 0.6 46.2? 0.6 80.2? 0.4
 SDDL [139] 49.5? 2.6 76.7? 3.9 27.4? 2.4 72.0? 4.8 29.7? 1.9 49.4? 2.1 48.9? 3.8 72.6? 2.1
 DASH-N (1st layer) 60.3? 2.7 79.6? 3.1 52.2? 2.1 74.1? 4.6 45.31? 3.7 68.7? 2.9 65.9? 2.1 76.3? 2.3
 DASH-N (1st+2nd layers) 71.6? 2.2 81.4? 3.5 54.9? 1.8 75.5? 4.2 50.2? 3.3 70.4? 3.2 68.9? 2.9 77.1? 2.8
 0 50 100 150 200 250
 0
 0.002
 0.004
 0.006
 0.008
 0.01
 0.012
 0.014
 X = 43
 Y = 0.0119
 Atom number
 Respons
 e
 Training data for BACKPACK
 X = 103
 Y = 0.0123
 X = 136
 Y = 0.0113
 X = 160
 Y = 0.0108
 0 50 100 150 200 250
 0
 0.002
 0.004
 0.006
 0.008
 0.01
 0.012
 0.014
 X = 43
 Y = 0.0101
 Atom number
 Respons
 e
 Test data for BACKPACK
 X = 103
 Y = 0.0126
 X = 136
 Y = 0.0112
 X = 160
 Y = 0.0106
 Figure 5.3: Dictionary responses of training (left) and testing (right) data for the BACK-
 PACK class for the pair DSLR-Webcam domains in the first layer.
 Parameter Settings: In our experiments, all images are resized to be no larger than
 150? 150 with preserved ratio and converted to grayscale. The patch size is set to 5? 5.
 Parameter ? is set to 4 in order to account for less training samples from the target than
 that from the source, and ? is set to 1.5 for all experiments. We also found that using
 132
Table 5.2: Single-source recognition rates on all 31 classes.
 Method A?W D?W W? D
 Metric [126] 44 31 27
 RDALR [130] 50.7? 0.8 36.9? 19.9 32.9? 1.2
 SGF [128] 57? 3.5 36? 1.1 37? 2.3
 GFK (PLS+PCA) [129] 46.4? 0.5 61.3? 0.4 66.3? 0.4
 SDDL [139] 50.1? 2.5 51.2? 2.1 50.6? 2.6
 DASH-N (1st layer) 59.9? 2.7 65.8? 1.3 69.6? 2.1
 DASH-N (1st+2nd layers) 60.6? 3.5 67.9? 1.1 71.1? 1.7
 ?train = 0.3 for training and ?test = 0.15 for testing yields best performance. Smaller
 sparsity constant often makes the decoding more stable, thus, leads to more consistent
 sparse codes. This is similar to the finding in [146]. The number of dictionary atoms is
 set to 200 and 1500 in the first and second layer, respectively. The dimension of the latent
 domain is set to 20 and 750 in the first and second layer, respectively. It is worth noting
 that the input feature to layer 2 has the dimension of 3200. This results from aggregating
 sparse codes obtained from the first layer over a 4 ? 4 spatial cell (4 ? 4 ? 200). By
 projecting them into a latent domain of dimension of 750, the computations become more
 tractable. A three level spatial pyramid, partitioned into 1 ? 1, 2 ? 2, and 3 ? 3, is used
 to perform the max pooling in the final layer. Linear SVM [158] with the regularization
 parameter of 10 is employed for classification.
 Computation Time: It takes an average of 35 minutes to perform the dictionary
 learning and feature extraction of all training samples using our Matlab implementation
 133
Table 5.3: Multiple-source recognition rates on all 31 classes
 Method {D, A} ?W {A, W} ? D {W, D} ? A
 A-SVM [159] 30.4? 0.6 25.3? 1.1 17.3? 0.9
 RDALR [130] 36.9? 1.1 31.2? 1.3 20.9? 0.9
 SGF [128] 52? 2.5 39? 1.1 28? 0.8
 FDDL [160] 41.0? 2.4 38.4? 3.4 19.30? 1.2
 SDDL [139] 57.8? 2.4 56.7? 2.3 24.1? 1.6
 DASH-N (1st layer) 61.7? 2.5 64.1? 3.5 39.6? 1.3
 DASH-N (1st+2nd layers) 64.5? 2.3 68.6? 3.7 41.8? 1.1
 on a computer with a 3.8 GHz Intel i7 processor. It takes less than 2 seconds to compute
 the feature for a test image of size 150? 150 using both layers of the hierarchy.
 5.5.2 Object Recognition
 10 Common Classes: The recognition results of different algorithms on 8 pairs of
 source-target domains are shown in Table 5.1. It can be seen that DASH-N outperforms
 all compared methods in 7 out of 8 pairs of source-target domains. For pairs such as
 Caltech-Amazon, Webcam-Amazon, or DSLR-Amazon, we achieve more than 20% im-
 provements over the next best algorithm used in the comparison (from 49.5% to 71.6%,
 49.4% to 70.4%, and 48.9% to 68.9%, respectively). It is worth noting that while we
 employ a generative approach for learning the feature, our method consistently achieves
 better performance than [139], even that this method uses discriminative training together
 with non-linear kernels. It is also clear from the table that the multi-layer DASH-N out-
 134
performs the single-layer DASH-N. In the case of adapting from Caltech to Amazon, the
 performance gain by using a combination of features obtained from both layers rather
 than just features from the first layer is more than 10% (from 60.3% to 71.6%) .
 In order to illustrate the encoding of features using the learned dictionary in the first
 layer, Figure 5.3 shows the responses of the training and testing data for the BACKPACK
 class with respect to each atom of the dictionary in the first layer for the pair DSLR-
 Webcam domains. The sparse codes for all the patches of the training and testing images
 belong to the class are computed. The absolutes of these sparse vectors are summed
 together and normalized to unit length. Small components of the normalized sparse codes
 are thresholded to better show the correspondences between the training and testing data.
 It can be seen from the figure that the sparse codes for the training and testing data for the
 BACKPACK class both have high responses in four different dictionary atoms (43, 103,
 136 and 160).
 31 Classes and Multiple Sources: We also compare the recognition results for all
 31 classes between our approach and other methods in both cases of single (Table 5.2)
 and multiple source domains (Table 5.3). It can be seen from Tables 5.2 and 5.3 that
 our results, even using only features extracted from the first layer, are consistently better
 than that of other algorithms in all the domain settings. This proves the effectiveness of
 the feature learning process. The performance of our algorithm further increases when
 combining features learned from both layers of the hierarchy. Especially in the case of
 adapting from Webcam and DSLR to Amazon, we achieve an improvement of more than
 15% compared to the result of SDDL [139] (from 24.1% to 41.8%).
 Dimensions of Latent Domains: These are important parameters affecting the per-
 135
formance of DASH-N. Figure 5.4(a) shows the recognition rates with respect to different
 dimensions of the latent domain in the first layer for three pairs of source-target domains
 (Amazon-DSLR, Caltech-Amazon and Webcam-DSLR), while keeping the dimension of
 latent domain in the second layer to 750. As the patch size is set at 5 ? 5, we vary the
 dimension of the first layer dictionary from 5 to 25. It can be seen from the figure that if
 the latent domain dimension is too low, the accuracy decreases. The optimal dimension
 is achieved at 20.
 Similarly, the recognition rates with respect to different dimensions of the second
 layer latent domain are shown in Figure 5.4(b) while the first layer latent dimension is
 kept at 20. It can be seen from Figure 5.4(b) that the curves for all three pairs of source-
 target domains peak at the dimension 750. Once again, we observe that the performance
 decreases if the dimension of the latent domain is too low. More interestingly, as we can
 observe for the pair Caltech-Webcam and Webcam-DSLR, setting the dimension of the
 latent domain too high is as detrimental as setting it too low. In all of our experiments,
 we set the dimension of the latent domain using the cross validation technique.
 5.5.3 Half-toning and Edge
 In order to evaluate the ability of DASH-N in adapting to a wide range of domains,
 we also perform experiments on object recognition from the original image domain to
 two new domains generated by applying half-toning and edge extraction algorithms to the
 original images. Half-toning images, which imitate the effect of jet-printing technology in
 the past, are generated using the dithering algorithm in [156]. Edge images are obtained
 136
5 10 15 20 25
 70
 72
 74
 76
 78
 80
 82
 84
 Dimension
 Recognition rate (%
 )
  
 
Amazon?DSLR
 Caltech?Webcam
 Webcam?DSLR
 (a) First layer
 250 500 750 1000 1250 1500
 72
 74
 76
 78
 80
 82
 84
 Dimension
 Recognition rate (%
 )
  
 
Amazon?DSLR
 Caltech?Webcam
 Webcam?DSLR
 (b) Second layer
 Figure 5.4: Recognition rates with respect to different dimensions of the latent domain in
 the first and second layer.
 by applying the Canny edge detector [157] with the threshold set to 0.07.
 Figure 5.5 is the visualization of the reconstructed dictionaries atoms at layer 1
 when adapting the original images (source) to edge images (target). Reconstructed dic-
 tionaries are obtained by ?D1? = (P1?)?D1, where ? denotes the Moore-Penrose pseudo-
 inverse. We observe that the dictionary atoms of original images contain rather fat and
 smooth regions. In contrast, dictionary atoms of edge images have many thin and highly
 varying patterns that are more suitable for capturing edges.
 Table 5.4 shows the performance of different algorithms when adapting to these
 new domains. It can be seen from the table that DASH-N outperforms other methods
 used in the comparison in both cases of half-toning and edge images. This proves the
 ability of our approach to adapt well to new domains.
 137
(a) Source: Amazon original images (b) Target: Amazon edge images
 Figure 5.5: The reconstructed dictionaries at layer 1.
 5.6 Conclusion
 We have presented a hierarchical method for performing domain adaptation using
 multi-layer representations of images. In the proposed approach, the features and domain
 shifts are learned jointly in each layer of the hierarchy in order to obtain a better repre-
 sentation of data from different domains. Unlike the other hierarchical approaches, our
 method prevents the dimension of feature vectors from increasing too fast as the number
 of layers increase. Experimental results show that the proposed approach significantly
 outperforms the other domain adaptation algorithms used in the comparison. In the fu-
 ture, we plan to incorporate non-linear learning frameworks to DASH-N.
 138
Table 5.4: Recognition rates of different approaches on the half-toning and edge datasets.
 10 common classes are used.
 (a) Half-toning
 Method C? A C? D A? C A?W W? C W? A D? A D?W
 kNN 50.1? 5.1 41.9? 5.5 29.8? 4.3 42.9? 2.8 28.9? 2.5 48.3? 2.1 48.6? 1.1 41.4? 4.1
 Metric [126] 41.1? 6.4 38.8? 5.6 31.9? 5.4 49.4? 3.4 32.8? 3.2 49.9? 3.3 43.8? 2.6 49.3? 2.6
 SDDL [139] 52.2? 3.9 66.7? 5.5 34.1? 3.5 69.2? 4.2 34.6? 2.8 51.2? 3.4 54.1? 2.7 71.6? 5.3
 DASH-N (1st layer) 68.1? 3.1 76.6? 3.5 49.7? 2.8 85.4? 3.1 42.1? 3.7 62.7? 3.2 66.4? 2.1 79.3? 2.5
 DASH-N (1st+2nd layer) 70.2? 2.7 79.6? 4.3 52.4? 2.3 86.2? 4.1 43.3? 3.9 66.1? 3.7 67.2? 3.5 80.7? 2.1
 (b) Edge
 Method C? A C? D A? C A?W W? C W? A D? A D?W
 kNN 50.8? 1.8 50.4? 1.4 32.8? 2.9 47.5? 4.2 30.4? 3.3 51.9? 3.1 48.9? 1.8 50.2? 2.1
 Metric [126] 42.8? 2.7 43.8? 2.5 35.2? 2.1 53.6? 1.4 36.8? 1.8 53.2? 3.1 40.8? 3.9 54.5? 2.7
 SDDL [139] 52.9? 5.2 63.8? 6.3 32.4? 3.2 62.5? 5.7 33.5? 2.9 55.2? 2.8 55.4? 3.3 65.3? 4.7
 DASH-N (1st layer) 68.3? 4.2 72.9? 3.3 33.1? 2.1 66.2? 4.2 42.6? 3.4 59.9? 2.7 62.7? 2.3 61.5? 2.6
 DASH-N (1st+2nd layer) 74.2? 3.9 75.7? 2.6 44.3? 2.5 74.2? 4.5 46.7? 3.1 68.9? 2.2 67.7? 3.4 74.5? 3.2
 139
Chapter 6
 Future Works
 This chapter will discuss several potential directions for the future research. They
 includes learning invariant representation, learning transformation models to relate im-
 ages, and learning useful representation of data from compressed observation.
 6.1 Invariant Representations
 There are various sources that cause the variations of a visual scene such as lumi-
 nance, shape, pose, etc. While these variations are important to some applications like
 shape from shading, they are detrimental to many other important applications such as
 object recognition and detections. Good representations should be sensitive to desirable
 variations while being invariant to other types of variations.
 Recall that we have proposed novel approaches such as kernel KSVD and sparse
 embedding for learning sparse and non-linear representations directly from the data. Ef-
 fectively, these methods approximate the manifold of possible scenes by using a set of
 subspaces in a non-linear feature domain. Careful analyses and designs of sparse learning
 algorithms might make it possible to capture desirable variations while being insensitive
 to other effects. Future research work will focus on extending our algorithms to learn
 invariant representations directly from the data.
 140
6.2 Learning Transformation Models
 The human visual system is excellent not only in recognizing objects but also in
 predicting their possible movements and deformations. Human can even make accurate
 predictions on an entirely novel object that are not presented to them beforehand. This
 suggests that human brains might learn a set of universal transformations which can be
 applied to predict motions and deformations of a data. For example, generic transforam-
 tions such as translation, rotation and scaling are likely to be embedded somewhere in our
 visual system. This observation motivates the development of new methods for learning
 transformation models to relate images.
 An example in this research direction is the work by Memisevic et al. [161] who
 proposed the relational autoencoder model for learning transformations from pairs of im-
 ages. The work of Wang et al. [162] follows a similar spirit, however, requires the
 transformation to have a Lie group structure. An exciting research direction would be
 investigating what role sparsity plays in learning these transformation model.
 6.3 Representations On Compressed Domain
 Random projection has been proven to be efficient for data collection and dimen-
 sionality reduction. It has many desirable properties. First, if the dimension of the map-
 ping is significantly large, which is in the order of Klog(N) where K is the sparsity level
 and N is the original signal dimension, then we can recover the original signals from the
 compressed one with an overwhelming probability. In addition the well-known Johnson-
 Lindenstrauss lemma implies that geometric structures of data are well-preserved under
 141
this transformation.
 While much works have been focused on signal reconstruction with random pro-
 jection, there is little effort in analysing the performances of and designing algorithms for
 computer vision tasks such as object detection and recognition in a compressed domain
 associated with the random projection mapping. Lying at the heart of this problem is the
 issue of finding representations of data, such as videos and images, in the compressed do-
 main. Our future works will address this question. This will enable many computer vision
 tasks such as object recognition and detection to be performed directly on the compressed
 domain.
 Our first step toward solving this problem is to examine the effects of different ran-
 dom and deterministic measurement matrices on the performances of object recognition
 for USPS and Caltech-101 datasets. In this experiment, we will use the compressed mea-
 surements of entire images as the primary features. Based on the experimental insights,
 we would develop algorithms for extracting more compact and discriminative features
 from the compressed measurements for recognition and detection tasks.
 142
Bibliography
 [1] Serge Belongie, Jitendra Malik, and Jan Puzicha. Shape matching and object
 recognition using shape contexts. IEEE Transactions on Pattern Analysis and Ma-
 chine Intelligence, 24:509?522, 2001.
 [2] Haibin Ling and D.W. Jacobs. Shape classification using the inner-distance. Pat-
 tern Analysis and Machine Intelligence, IEEE Transactions on, 29(2):286 ?299,
 feb. 2007.
 [3] H.V. Nguyen and F. Porikli. Support vector shape: A classifier-based shape repre-
 sentation. IEEE PAMI (Under Review).
 [4] A. Johnson and M. Hebert. Using spin images for efficient object recognition in
 cluttered 3D scenes. IEEE PAMI, 21, 1999.
 [5] Marcel Kortgen, Marcin Novotni, and Reinhard Klein. 3d shape matching with 3d
 shape contexts. In In The 7th Central European Seminar on Computer Graphics,
 2003.
 [6] D. Saupe and D. Vranic. 3D model retrieval with spherical harmonics and mo-
 ments. In DAGM, 2001.
 [7] Hien Van Nguyen and F. Porikli. Concentric ring signature descriptor for 3d ob-
 jects. In Image Processing (ICIP), 2011 18th IEEE International Conference on,
 pages 2893 ?2896, sept. 2011.
 143
[8] K. Engan, S. O. Aase, and J. H. Husoy. Method of optimal directions for frame de-
 sign. Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 5:2443?2446, 1999.
 [9] M. Aharon, M. Elad, and A. M. Bruckstein. The k-svd: an algorithm for designing
 of overcomplete dictionaries for sparse representation. IEEE Trans. Signal Pro-
 cess., 54(11):4311?4322, 2006.
 [10] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid
 matching for recognizing natural scene categories. In IEEE CVPR, volume 2, pages
 2169 ? 2178, 2006.
 [11] O. Tuzel, F. M. Porikli, and P. Meer. Region covariance: A fast descriptor for
 detection and classification. In ECCV, pages II: 589?600, 2006.
 [12] Zhuolin Jiang, Zhe Lin, and L.S. Davis. Learning a discriminative dictionary for
 sparse coding via label consistent K-SVD. In CVPR, pages 1697 ?1704, june 2011.
 [13] Qiang Zhang and Baoxin Li. Discriminative K-SVD for dictionary learning in face
 recognition. In IEEE CVPR, pages 2691 ?2698, june 2010.
 [14] Bernhard Schlkopf, Alexander J. Smola, and Klaus R. Mu?ller. Kernel principal
 component analysis. Advances in kernel methods: support vector learning, pages
 327?352, 1999.
 [15] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, and F. Pereira. A Theory of
 Learning from Different Domains. Machine Learning, 79:151?175, 2010.
 144
[16] Charles J Stone. Optimal Global Rates of Convergence for Nonparametric Regres-
 sion. The Annals of Statistics, pages 1040?1053, 1982.
 [17] I. Biederman. Human image understanding: Recent research and a theory. Com-
 puter Vision, Graphics, and Image Processing, 32:29?73, 1985.
 [18] B. B. Kimia and K. Siddiqi. Parts of visual form: Computational aspects. IEEE
 Trans. Pattern Analysis and Machine Intelligence, 17(5):544?544, May 1995.
 [19] J. Todd. Visual perception of 3D shape. TRENDS in Cognitive Sciences, 2004.
 [20] Z. Kourtzi and N. Kanwisher. Cortical regions involved in perceiving object shape.
 Journal of Neuroscience, 20:33103318, 2000.
 [21] S. Murray, B. Olshausen, and D. Woods. Processing shape, motion and threedi-
 mensional shape-from-motion in the human cortex. Cerebral Cortex, 13?, 2003.
 [22] David Cohen-Steiner, Pierre Alliez, and Mathieu Desbrun. Variational shape ap-
 proximation. ACM Transactions on Graphics, 23(3):905?914, August 2004.
 [23] S. Osher and J. Sethian. Fronts propagating with curvature-dependent speed: Algo-
 rithms based on Hamilton-Jacobi formulations. Journal of Computational Physics,
 79:12?49, 1988.
 [24] R. Malladi, J. A. Sethian, and B. C. Vemuri. Shape modeling with front prop-
 agation: A level set approach. IEEE Trans. on Pattern Analysis and Machine
 Intelligence, 17(2):158?175, 1995.
 145
[25] D. Cremers, N. A. Sochen, and C. Schnorr. A multiphase dynamic labeling model
 for variational recognition-driven image segmentation. International Journal of
 Computer Vision, 66(1):67?81, January 2006.
 [26] N. Paragios. A level set approach for shape-driven segmentation and tracking of
 the left ventricle. Medical Imaging, IEEE Transactions on, 22(6):773 ?776, june
 2003.
 [27] Victor Adrian Prisacariu and Ian Reid. Nonlinear shape manifolds as shape priors
 in level set segmentation and tracking. IEEE, 2011.
 [28] F. Bookstein. Principal warps: Thin-plate splines and the decomposition of de-
 formations. IEEE Transactions of Pattern Analysis and Machine Intelligence,
 11(6):567?585, 1989.
 [29] H. Chui and A. Rangarajan. A new point matching algorithm for non-rigid reg-
 istration. Computer Vision and Image Understanding, 89(2-3):114?141, February
 2003.
 [30] Haili Chui, Anand Rangarajan, Jie Zhang, and Christiana Morison Leonard. Unsu-
 pervised learning of an atlas from unlabeled point-sets. IEEE Trans. Pattern Anal.
 Mach. Intell, 26(2):160?172, 2004.
 [31] F. Wang, B. C. Vemuri, A. Rangarajan, and S. J. Eisenschenk. Simultaneous non-
 rigid registration of multiple point sets and atlas construction. IEEE Trans. Pattern
 Analysis and Machine Intelligence, 30(11):2011?2022, November 2008.
 146
[32] L. Gorelick, M. Galun, E. Sharon, R. Basri, and A. Brandt. Shape representation
 and classification using the poisson equation. IEEE Trans. Pattern Analysis and
 Machine Intelligence, 28(12):1991?2005, December 2006.
 [33] Y. Mei and D. Androutsos. Robust affine invariant shape image retrieval using the
 ICA Zernike moment shape descriptor. ICIP, 2009?.
 [34] B. Gunsel and M. Tekalp. Shape similarity matching for query by example. Else-
 vier Journal of Pattern Recognition, 31(7), 1998.
 [35] S. Abbasi, F. Mokhtarian, and J. Kittler. Curvature scale space image in shape
 similarity retrieval. ACM Multimedia System, 7, 1999.
 [36] R. Basri, L. Costa, D. Geiger, and D. Jacobs. Determining the similarity of de-
 formable shapes. Elsevier Journal on Vision Research, 38, 1998.
 [37] M. D. Levine and K. N. Wu. 3D part segmentation using simulated electrical
 charge distributions. In ICPR, pages I: 14?18, 1996.
 [38] Lien and Amato. Approximate convex decomposition of polygons. CGTA: Com-
 putational Geometry: Theory and Applications, 35, 2006.
 [39] Jyh-Ming Lien and Nancy M. Amato. Approximate convex decomposition of poly-
 hedra and its applications. Computer Aided Geometric Design, 25(7), 2008.
 [40] Hairong Liu, Wenyu Liu, and L.J. Latecki. Convex shape decomposition. In Com-
 puter Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages
 97 ?104, june 2010.
 147
[41] G. Heitz, G. Elidan, B. Packer, and D. Koller. Shape-based object localization for
 classification. IJCV, 84(1), 2009.
 [42] A. Berg, T. Berg, and J. Malik. Multi-scale object detection by clustering lines.
 CVPR, 2005.
 [43] T. Sebastian, P. Klein, and B. Kimia. Recognition of shapes by editing their shock
 graphs. PAMI, 26(5), 2004.
 [44] H. Ling and D. Jacobs. Shape classification using the inner-distance. PAMI, 29,
 2007.
 [45] J. Hua, Z. Lai, M. Dong, X. Gu, and H. Qin. Geodesic distance-weighted shape
 vector image diffusion. IEEE Transactions on Visualization and Computer Graph-
 ics, 14, 2008.
 [46] L. Gorelick, M. Galun, E. Sharon, R. Basri, and A. Brandt. Shape representation
 and classification using the poisson equation. CVPR, 2004.
 [47] B. Flach and D. Schlesinger. Modelling composite shapes by gibbs random fields.
 In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on,
 pages 2177 ?2182, june 2011.
 [48] A. Johnson and M. Hebert. Using spin images for efficient object recognition in
 cluttered 3D scenes. PAMI, 21(5), 1999.
 [49] S. Belongie, J. Malik, and J. Puzicha. Shape matching object recognition using
 shape context. PAMI, 24(24), 2002.
 148
[50] A. Frome, D. Huber, R. Kolluri, T. Bulow, and J. Malik. Recognizing objects in
 range data using regional point descriptors. 2004.
 [51] S. Biswas, G. Aggarwal, and R. Chellappa. An efficient and robust algorithm for
 shape indexing and retrieval. IEEE Transactions on Multimedia, 12:372385, 2010.
 [52] H. Zhang and J. Malik. Learning a discriminative classifier using shape context
 distances. CVPR, 2003.
 [53] Susan R. Curtis and Alan V. Oppenheim. Reconstruction of multidimensional sig-
 nals from zero crossings. J. Opt. Soc. Am. A, 4(1):221?231, 1987.
 [54] D. Slepian. On bandwidth. Proceedings of the IEEE, 64:292?300, 1976.
 [55] V. Vapnik. The nature of statistical learning theory. Springer Verlag, New York,
 1995.
 [56] B. Drost, M. Ulrich, N. Navab, and S. Ilic. Model globally, match locally: efficient
 and robust 3D object recognition. In CVPR, pages 998?1005. IEEE, 2010.
 [57] B. Scho?lkopf, J.C. Platt, J. Shawe-Taylor, A.J. Smola, and R. C. Williamson.
 Estimating the support of a high-dimensional distribution. Neural Computation,
 13(7):1443?1471, 2001.
 [58] B. Scho?lkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson.
 Estimating the support of a high-dimensional distribution. Neural Computation,
 13(7):1443?1471, 2001.
 149
[59] L. J. Latecki, R. Lakamper, and U. Eckhardt. Shape descriptors for non-rigid
 shapes with a single closed contour. In CVPR, pages I: 424?429, 2000.
 [60] T. Joachims. Making large?scale SVM learning practical. In B. Scho?lkopf, C. J. C.
 Burges, and A. J. Smola, editors, Advances in Kernel Methods ? Support Vector
 Learning, pages 169?184, Cambridge, MA, 1999. MIT Press.
 [61] Ivor W. Tsang, James T. Kwok, and Pak-Ming Cheung. Core vector machines:
 Fast SVM training on very large data sets. Journal of Machine Learning Research,
 6:363?392, 2005.
 [62] Michele Cossalter, Rong Yan, and Lu Zheng. Adaptive kernel approximation for
 large-scale Non-Linear SVM prediction. International Conference on Machine
 Learning, 2011.
 [63] R. Gopalan, P. Turaga, and R. Chellappa. Articulation-invariant representation of
 non-planar shapes. ECCV, 2010.
 [64] X. Yang, S. Koknar-Tezel, and L. Latecki. Locally constrained diffusion process
 on locally densified distance spaces with applications to shape retrieval. CVPR,
 2009.
 [65] M. Liu, B. C. Vemuri, S.-I. Amari, and F. Nielsen. Total Bregman divergence and
 its applications to shape retrieval. In CVPR, pages 3463 ?3468, 2010.
 [66] P. Felzenszwalb and J. Schwartz. Hierarchical matching of deformable shapes.
 CVPR, 2007.
 150
[67] H. Ling and K. Okada. An efficient earth mover?s distance algorithm for robust
 histogram comparison. PAMI, 29(5):840 ?853, may 2007.
 [68] G. McNeill and S. Vijayakumar. Hierarchical procrustes matching for shape re-
 trieval. In CVPR, pages 885 ? 894, 2006.
 [69] A. M. Peter, A. Rangarajan, and J. Ho. Shape L?ane rouge: Sliding wavelets for
 indexing and retrieval. In CVPR, pages 1?8, 2008.
 [70] Z. W. Tu and A. L. Yuille. Shape matching and recognition: Using generative
 models and informative features. In ECCV, pages Vol III: 195?209, 2004.
 [71] T.B. Sebastian, P.N. Klein, and B.B. Kimia. On aligning curves. PAMI, 25(1):116
 ?125, 2003.
 [72] F. Mokhtarian, S. Abbasi, and J. Kittler. Efficient and robust retrieval by shape
 content through curvature scale space, 1996.
 [73] L. Lin, X. Liu, and S.-C. Zhu. Layered graph matching with composite cluster
 sampling. PAMI, 32(8):1426 ?1442, 2010.
 [74] D. Lowe. Distinctive image features from scale-invariant keypoints. International
 Journal of Computer Vision, 60(2):91?110, November 2004.
 [75] A. Mian, M. Bennamoun, and R. Owens. On the repeatability and quality of key-
 points for local feature-based 3D object retrieval from cluttered scenes. Interna-
 tional Journal of Computer Vision, 89(2-3):348?361, 2010.
 151
[76] R. Rubinstein, A. M. Bruckstein, and M. Elad. Dictionaries for sparse representa-
 tion modeling. Proceedings of the IEEE, submitted 2009.
 [77] J. Wright, Yi Ma, J. Mairal, G. Sapiro, T.S. Huang, and Shuicheng Yan. Sparse
 representation for computer vision and pattern recognition. Proceedings of the
 IEEE, 98(6):1031 ?1044, june 2010.
 [78] M. Elad, M.A.T. Figueiredo, and Yi Ma. On the role of sparse and redundant
 representations in image processing. Proceedings of the IEEE, 98(6):972 ?982,
 june 2010.
 [79] Alfred M. Bruckstein, David L. Donoho, and Michael Elad. From sparse solutions
 of systems of equations to sparse modeling of signals and images. SIAM Review,
 51(1):34 ?81, 2010.
 [80] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit.
 SIAM J. Sci. Comp., 20(1):33?61, 1998.
 [81] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad. Orthogonal matching pursuit: re-
 cursive function approximation with applications to wavelet decomposition. 1993
 Conference Record of the 27th Asilomar Conference on Signals, Systems and Com-
 puters, pages 40?44, 1993.
 [82] J. A. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE
 Trans. Info. Theory, 50(10):2231?2242, Oct. 2004.
 152
[83] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field prop-
 erties by learning a sparse code for natural images. Nature, 381(6583):607?609,
 June 1996.
 [84] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online dictionary
 learning for sparse coding. In Proceedings of the 26th Annual International Con-
 ference on Machine Learning, ICML ?09, pages 689?696, New York, NY, USA,
 2009. ACM.
 [85] J. Mairal, F. Bach, and J. Ponce. Task-driven dictionary learning. IEEE Transac-
 tions on Pattern Analysis and Machine Intelligence, 34(4):791 ?804, april 2012.
 [86] R. Rubinstein, M. Zibulevsky, and M. Elad. Double sparsity: Learning sparse dic-
 tionaries for sparse signal approximation. Signal Processing, IEEE Transactions
 on, 58(3):1553 ?1564, march 2010.
 [87] Corinna Cortes and Vladimir Vapnik. Support-vector networks. In Machine Learn-
 ing, pages 273?297, 1995.
 [88] R. Courant and D. Hilbert. Methods of Mathematical Physics. Interscience, 1953.
 [89] Usps handwritten digit database. In http://www-i6.informatik.rwth-aachen.de/ key-
 sers/usps.html.
 [90] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma. Robust face recognition
 via sparse representation. IEEE Trans. Pattern Analysis and Machine Intelligence,
 31(2):210?227, Feb. 2009.
 153
[91] B. Scholkopf and A. J. Smola. Learning With Kernels, Support Vector Machines,
 Regularization, Optimization, and Beyond. MIT Press, 2001.
 [92] James Tin-Yau Kwok and Ivor Wai-Hung Tsang. The pre-image problem in kernel
 methods. IEEE Transactions on Neural Networks, 15(6):1517?1525, 2004.
 [93] D. Keysers, J. Dahmen, T. Theiner, and H. Ney. Experiments with an extended
 tangent distance. In ICPR, pages Vol II: 38?42, 2000.
 [94] Patrice Simard, Yann LeCun, John S. Denker, and Bernard Victorri. Transforma-
 tion invariance in pattern recognition-tangent distance and tangent propagation. In
 Neural Networks: Tricks of the Trade, this book is an outgrowth of a 1996 NIPS
 workshop, pages 239?27, London, UK, 1998. Springer-Verlag.
 [95] Bernhard Scho?lkopf, Patrice Simard, Alex J. Smola, and Vladimir Vapnik. Prior
 knowledge in support vector kernels. The MIT Press, 1997.
 [96] Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro, and Andrew Zisser-
 man. Supervised dictionary learning. CoRR, abs/0809.3083, 2008. informal pub-
 lication.
 [97] P. Perona, R. Fergus, and F. F. Li. Learning generative visual models from few
 training examples: An incremental bayesian approach tested on 101 object cate-
 gories. In Workshop on Generative Model Based Vision, page 178, 2004.
 [98] Peter Gehler and Sebastian Nowozin. On feature combination for multiclass object
 classification. In Computer Vision, 2009 IEEE 12th International Conference on,
 pages 221 ?228, 29 2009-oct. 2 2009.
 154
[99] Anna Bosch, Andrew Zisserman, and Xavier Munoz. Representing shape with a
 spatial pyramid kernel. In Proceedings of the 6th ACM international conference on
 Image and video retrieval, CIVR ?07, pages 401?408, New York, NY, USA, 2007.
 ACM.
 [100] Timo Ojala, Matti Pietikinen, and Topi Menp. Multiresolution gray-scale and ro-
 tation invariant texture classification with local binary patterns. IEEE TRANSAC-
 TIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 24(7):971?
 987, 2002.
 [101] Nicolas Pinto, David D. Cox, and James J. Dicarlo. Why is real-world visual object
 recognition hard. PLoS Computational Biology, 2008.
 [102] P. Jain, B. Kulis, and K. Grauman. Fast image search for learned metrics. In IEEE
 CVPR 2008., pages 1 ?8, june 2008.
 [103] Hao Zhang, A.C. Berg, M. Maire, and J. Malik. SVM-KNN: Discriminative near-
 est neighbor classification for visual category recognition. In IEEE CVPR, 2006.
 [104] G. Grifn, A. Holub, and P. Perona. Caltech-256 object category dataset. Technical
 Report, 2007.
 [105] O. Boiman, E. Shechtman, and M. Irani. In defense of nearest-neighbor based
 image classification. In IEEE CVPR, pages 1 ?8, june 2008.
 [106] Duc-Son Pham and S. Venkatesh. Joint learning and dictionary construction for
 pattern recognition. In IEEE CVPR 2008., pages 1 ?8, june 2008.
 155
[107] J. C. van Gemert, J. M. Geusebroek, C. J. Veenman, and A. W. M. Smeulders.
 Kernel codebooks for scene categorization. In ECCV, pages III: 696?709, 2008.
 [108] Jianchao Yang, Kai Yu, Yihong Gong, and T. Huang. Linear spatial pyramid match-
 ing using sparse coding for image classification. In IEEE CVPR, pages 1794 ?1801,
 june 2009.
 [109] Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, T. Huang, and Yihong Gong.
 Locality-constrained linear coding for image classification. In IEEE CVPR, pages
 3360 ?3367, june 2010.
 [110] N. Kulkarni and Baoxin Li. Discriminative affine sparse codes for image classifi-
 cation. In IEEE CVPR, pages 1609 ?1616, june 2011.
 [111] C. J. Stone. Optimal global rates of convergence for nonparametric regression. The
 Annals of Statistics, 10:1040?1053, 1982.
 [112] Beyer, Goldstein, Ramakrishnan, and Shaft. When is ?nearest neighbor? meaning-
 ful? In ICDT: 7th International Conference on Database Theory, 1999.
 [113] J. A. Lee and M. Verleysen. Nonlinear dimensionality reduction. In Information
 Science and Statistics, Springer, 2006.
 [114] K. Pearson. On lines and planes of closest fit to systems of points in space. The
 London, Edinburgh and Dublin Phil. Mag. & Jour. of Science., 2:559?572, 1901.
 156
[115] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework
 for nonlinear dimensionality reduction. Science, 290(5500):2319?2323, December
 2000.
 [116] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear
 embedding. Science, 290(5500):2323?2326, December 2000.
 [117] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduc-
 tion and data representation. Neural Computation, 15(6):1373?1396, 2003.
 [118] M. Elad. Sparse and redundant representations. In From theory to applications in
 signal and image processing, Springer-New York, 2010.
 [119] M. Elad and M. Aharon. Image denoising via sparse and redundant representations
 over learned dictionaries. IEEE Trans. Image Processing, 15(12):3736?3745, De-
 cember 2006.
 [120] I. Gkioulekas and T. Zickler. Dimensionality reduction using the sparse linear
 model. In Advances in Neural Information Processing Systems (NIPS), 2011.
 [121] Hanchao Qi and Shannon Hughes. Using the kernel trick in compressive sens-
 ing: Accurate signal recovery from fewer measurements. In IEEE Int. Conf. on
 Acoustics, Speech and Signal Proc., pages 3940 ?3943, may 2011.
 [122] H. V. Nguyen, V. M. Patel, N. M. Nasrabadi, and R. Chellappa. Kernel dictionary
 learning. IEEE Int. Conference on Acoustics, Speech and Signal Processing, 2012.
 157
[123] Julien Mairal, Francis Bach, and Jean Ponce. Task-driven dictionary learning.
 Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(4):791 ?804,
 april 2012.
 [124] K. Engan, S. O. Aase, and J. H. Husoy. Multi-frame compression: Theory and
 design. Signal Processing, 80(10):2121?2140, October 2000.
 [125] S. Pan and Q. Yang. A Survey on Transfer Learning. Journal of Artificial Intelli-
 gence Research, 22(10):1345?1359, 2009.
 [126] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting Visual Category Models to
 New Domains. In Proc. ECCV, pages 213?226, 2010.
 [127] B. Kulis, K. Saenko, and T. Darrell. What You Saw is Not What You Get: Domain
 Adaptation using Asymmetric Kernel Transforms. In Proc. CVPR, pages 1785?
 1792, 2011.
 [128] R. Gopalan, R. Li, and R. Chellappa. Domain Adaptation for Object Recognition:
 An Unsupervised Approach. In Proc. ICCV, pages 999?1006, 2011.
 [129] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic Flow Kernel for Unsupervised
 Domain Adaptation. In Proc. CVPR, pages 2066?2073, 2012.
 [130] I.-H. Jhuo, D. Liu, D.T. Lee, and S.-F. Chang. Robust Visual Domain Adaptation
 with Low-Rank Reconstruction. In Proc. CVPR, pages 2168?2175, 2012.
 [131] G. Hinton, S. Osindero, and Y. Teh. A Fast Learning Algorithm for Deep Belief
 Nets. Neur. Comp., 18(7):1527?1554, 2006.
 158
[132] H. Lee, R. Grosse, R. Ranganath, and A. Ng. Convolutional Deep Belief Net-
 works for Scalable Unsupervised Learning of Hierarchical Representations. In
 Proc. ICML, 2009.
 [133] D. Lowe. Distinctive Image Features From Scale-Invariant Keypoints. IJCV,
 60(2):91?110, 2004.
 [134] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-Up Robust Features.
 Comp. Vis. Img. Under., 110:346?359, 2008.
 [135] C. Leggetter and P. Woodland. Maximum Likelihood Linear Regression for
 Speaker Adaptation of Continuous Density Hidden Markov Models. Computer
 Speech and Language, 9(2):171?185, 1995.
 [136] H. Daume? III and D. Marcu. Domain Adaptation for Statistical Classifiers. Jour.
 of Arti. Intell. Res., 26(1):101?126, 2006.
 [137] J. Zheng, M.-Y. Liu, R. Chellappa, and J. Phillips. A Grassmann Manifold-Based
 Domain Adaptation Approach. In Proc. ICPR, pages 2095?2099, 2012.
 [138] Qiang Qiu, VishalM. Patel, Pavan Turaga, and Rama Chellappa. Domain Adaptive
 Dictionary Learning. In Proc. ECCV, pages 631?645, 2012.
 [139] S. Shekhar, V. Patel, H. Nguyen, and R. Chellappa. Generalized Domain-Adaptive
 Dictionaries. In Proc. CVPR, 2013.
 [140] L. Duan, I.W. Tsang, D. Xu, and T.-S. Chua. Domain Adaptation from Multiple
 Sources via Auxiliary Classifiers. In Proc. ICML, 2009.
 159
[141] L. Duan, I.W. Tsang, D. Xu, and T.-S. Chua. Domain Transfer Multiple Kernel
 Learning. IEEE Trans. PAMI, 34(3):465?479, 2012.
 [142] Jenny Rose Finkel and Christopher D Manning. Hierarchical Bayesian Domain
 Adaptation. In Proc. NAACL, pages 602?610, 2009.
 [143] B. Manjunath and R. Chellappa. A Unified Approach to Boundary Perception:
 Edges, Textures, and Illusory Contours. IEEE Trans. Neural Networks, 4(1):96?
 108, 1993.
 [144] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol. Extracting and Composing
 Robust Features with Denoising Autoencoders. In Proc. ICML, 2008.
 [145] K. Yu, Y. Lin, and J. Lafferty. Learning Image Representations from the Pixel
 Level via Hierarchical Sparse Coding. In Proc. CVPR, pages 1713?1720, 2013.
 [146] L. Bo, X. Ren, and D. Fox. Hierarchical Matching Pursuit for Image Classification:
 Architecture and Fast Algorithms. In Proc. NIPS, pages 2115?2123, 2011.
 [147] L. Bo, X. Ren, and D. Fox. Multipath Sparse Coding using Hierarchical Matching
 Pursuit. In Proc. CVPR, 2013.
 [148] M. Elad. Sparse and Redundant Representations. Springer, 2010.
 [149] Julien Mairal, Francis Bach, and Jean Ponce. Task-Driven Dictionary Learning.
 IEEE Trans. PAMI, 34(4):791?804, 2012.
 160
[150] Hien Nguyen, Vishal Patel, Nasser Nasrabadi, and Rama Chellappa. Sparse Em-
 bedding: A Framework for Sparsity Promoting Dimensionality Reduction. In Proc.
 ECCV, pages 414?427, 2012.
 [151] M. Ruhnke, L. bo, D. Fox, and W. Burgard. Compact RGBD Surface Models
 Based on Sparse Coding. In Proc. AAAI, 2013.
 [152] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online Dictionary Learning for Sparse
 Coding. In Proc. ICML, pages 689?696, 2009.
 [153] Y-L. Boureau, J. Ponce, and Y. LeCun. A Theoretical Analysis of Feature Pooling
 in Visual Recognition. In Proc. ICML, 2010.
 [154] Z. Wen and W. Yin. A Feasible Method for Optimization with Orthogonality Con-
 straints. Math. Prog., pages 1?38, 2013.
 [155] G. Griffin, A. Holub, and P. Perona. Caltech-256 Object Category Dataset. Tech-
 nical Report 7694, Caltech, 2007.
 [156] V. Monga, N. Damera-Venkata, H. Rehman, and B. Evans. Halfton-
 ing MATLAB Toolbox. http://users.ece.utexas.edu/?bevans/
 projects/halftoning/toolbox/, 2005.
 [157] J. Canny. A Computational Approach to Edge Detection. IEEE Trans. PAMI,
 8(6):679?698, 1986.
 [158] V. N. Vapnik. Statistical Learning Theory. John Wiley, 1998.
 161
[159] J. Yang, R. Yan, and A. Hauptmann. Cross-Domain Video Concept Detection using
 Adaptive SVMs. In Proc. MM, 2007.
 [160] M. Yang, L. Zhang, X. Feng, and D. Zhang. Fisher Discrimination Dictionary
 Learning for Sparse Representation. In Proc. CVPR, pages 543?550, 2011.
 [161] Roland Memisevic. Gradient-based learning of higher-order image features. In
 Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1591?
 1598. IEEE, 2011.
 [162] Ching Ming Wang, Jascha Shol-Dickstein, Ivana Tosic, and Bruno A Olshausen.
 Lie group transformation models for predictive video coding. In Data Compression
 Conference (DCC), 2011, pages 83?92. IEEE, 2011.
 162