ABSTRACT Title of dissertation: STATISTICAL MODELS AND OPTIMIZATION ALGORITHMS FOR HIGH-DIMENSIONAL COMPUTER VISION PROBLEMS Kaushik Mitra, Doctor of Philosophy, 2011 Dissertation directed by: Professor Rama Chellappa Department of Electrical and Computer Engineering Data-driven and computational approaches are showing significant promise in solving several challenging problems in various fields such as bioinformatics, finance and many branches of engineering. In this dissertation, we explore the potential of these approaches, specifically statistical data models and optimization algorithms, for solving several challenging problems in computer vision. In doing so, we con- tribute to the literatures of both statistical data models and computer vision. In the context of statistical data models, we propose principled approaches for solving robust regression problems, both linear and kernel, and missing data matrix factor- ization problem. In computer vision, we propose statistically optimal and efficient algorithms for solving the remote face recognition and structure from motion (SfM) problems. The goal of robust regression is to estimate the functional relation between two variables from a given data set which might be contaminated with outliers. Under the reasonable assumption that there are fewer outliers than inliers in a dataset, we formulate the robust linear regression problem as a sparse learning problem, which can be solved using efficient polynomial-time algorithms. We also provide sufficient conditions under which the proposed algorithms correctly solve the robust regression problem. We then extend our robust formulation to the case of kernel regression, specifically to propose a robust version for relevance vector machine (RVM) regression. Matrix factorization is used for finding a low-dimensional representation for data embedded in a high-dimensional space. Singular value decomposition is the standard algorithm for solving this problem. However, when the matrix has many missing elements this is a hard problem to solve. We formulate the missing data matrix factorization problem as a low-rank semidefinite programming problem (es- sentially a rank constrained SDP), which allows us to find accurate and efficient solutions for large-scale factorization problems. Face recognition from remotely acquired images is a challenging problem be- cause of variations due to blur and illumination. Using the convolution model for blur, we show that the set of all images obtained by blurring a given image forms a convex set. We then use convex optimization techniques to find the distances be- tween a given blurred (probe) image and the gallery images to find the best match. Further, using a low-dimensional linear subspace model for illumination variations, we extend our theory in a similar fashion to recognize blurred and poorly illuminated faces. Bundle adjustment is the final optimization step of the SfM problem where the goal is to obtain the 3-D structure of the observed scene and the camera parameters from multiple images of the scene. The traditional bundle adjustment algorithm, based on minimizing the l2 norm of the image re-projection error, has cubic com- plexity in the number of unknowns. We propose an algorithm, based on minimizing the l? norm of the re-projection error, that has quadratic complexity in the number of unknowns. This is achieved by reducing the large-scale optimization problem into many small scale sub-problems each of which can be solved using second-order cone programming. STATISTICAL MODELS AND OPTIMIZATION ALGORITHMS FOR HIGH-DIMENSIONAL COMPUTER VISION PROBLEMS by Kaushik Mitra Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2011 Advisory Committee: Professor Rama Chellappa, Chair/Advisor Professor Andre L. Tits Professor Ramani Duraiswami Professor David Jacobs, Dean?s Representative Professor Ashok Veeraraghavan c? Copyright by Kaushik Mitra 2011 Dedication To One and All ii Acknowledgments I owe my gratitude to all the people who have made this dissertation possible and because of whom my graduate experience has been one that I will cherish forever. First and foremost I?d like to thank my advisor Professor Rama Chellappa for giving me the wonderful opportunity to work with him. Under his able guidance, I had the opportunity to work on many challenging and interesting problems. He has also given me the freedom to explore many new topics and ideas on my own, which has given me the confidence to work as an individual researcher. He has always made himself available for help and advice and I am very grateful to him for that. It has been a great pleasure to work with and learn from such an extraordinary individual. I would also like to thank Professor Andre L. Tits, Professor Ramani Du- raiswami, Professor David Jacobs and Professor Ashok Veeraraghavan for agreeing to serve on my thesis committee and for sparing their invaluable time reviewing the manuscript. Professor Ashok Veeraraghavan has also been my mentor and collab- orator and I would like to thank him for his guidance. During my dissertation, I had the opportunity of working closely with some wonderful people - Dr. Sameer Sheorey, who is also a mentor to me, Priyanka Vageeswaran and Jun-Cheng Chen, who were my juniors and their energy and enthusiasm invigorated me. I would like like to thank many professors in the University of Maryland, from whom I have taken courses - (Electrical and Computer Engineering:) Professor iii Anthony Ephremides, Professor Haralabos (Babis) Papadopoulos, Professor Andre L. Tits, Professor Rama Chellappa, Professor Min Wu, Professor. Steve Marcus, Professor Sennur Ulukus, (Mathematics and Statistics:) Professor Wojciech Czaja, Professor Mike Fitzpatrick, Professor Grace Yang, Professor Serguei Novikov, Pro- fessor Karin Melnick, Professor Richardo Nochetto and Professor Dionisios Margetis. The courses were very enjoyable and formed the foundations for my research work. I would also take this opportunity to thank all my teachers from Indian Institute of Science (Bangalore, India), Institute of Radio Physics and Electronics (Kolkata, India), Presidency College (Kolkata, India) and Kendriya Vidyalaya (Itarsi, India). I am specially grateful to my undergraduate (IRPE) advisor Professor Subal Kar and my graduate (IISc) advisor Professor Anurag Kumar for their able guidance, support and encouragement. My colleagues at the computer vision laboratory have enriched my graduate life in many ways and deserve a special mention - Ashish Srivastava, Aravind Sun- daresan, Ashok Veeraraghavan, Narayanan Ramanathan, Gaurav Aggarwal, Seong- Wook Joo, Naresh Cuntoor, Feng Guo, Arun Mohanchettiar, Pavan Turaga, Soma Biswas, James Sherman, Aswin Sankaranarayanan, Mahesh Ramachandran, Wu Hao, Volkan Cevher, Vishal Patel, Ruonan Li, Dikpal Reddy, Nitesh Shroff, Raghu- raman Gopalan, Sima Taheri, Nazre Batool, Priyanka Vageeswaran, Jun-Cheng Chen, Sumit Shekhar, Jaishankar Pillai, Jie Ni, Hien Nguyen, Huy Tho Ho, Garrett Warnell, Qiang Qiu, Dave Shaw, Mohammed Abdelkader, Ming Du, Ming Liu, Wu Tao. Apart from my laboratory colleagues, I would also like to thank my other Uni- versity colleagues and friends - Anne Jorstad, who was my wonderful office mate; iv Ajay Mishra, with whom I had many philosophical discussions on a wide variety of topics; Punyaslok Purkayastha, whose knowledge of maths, physics and literature in- spires me; and many other friends and colleagues - Indroda (Indrajit Bhattacharya), Sandeep Manocha and Sameer Kibey. I would like to acknowledge the help and support of several staff members who made my graduate life smooth - Janice Perrone (CfAR), Maria Hoo (ECE), Arlene Schenk (UMIACS) and the UMIACS computing staff. I owe my deepest thanks to my family - my mother, father and sister, who have always stood by me and guided me through my career. Words cannot express the gratitude I owe them. I am also grateful to my grandparents, uncles, aunties and cousins, who made this experience enjoyable. I am thankful to my house mates and friends for making my stay in Mary- land a wonderful experience - Bhuwan Thapa, Purushottam Shetty, Bhaskar Dutta, Basudev Roy, Sandeep Haldar, Mainak Sen, Senthilkumar Palaniyandi, Sanjeev Howlader, Muhammad S. Noon, Mili Duggal, Kanthi Sarpatwar, Jagadeesh Jagar- lamudi, Nitin Madnani, Ashish Markanday, Padmaja, Punarbasu Purkayastha, Arya Mazumdar, Barna Saha, Biswadip Dey, Arijit Biswas, Saptashati Tania Biswas, and Rajibul Kaji. It is impossible to remember all, and I apologize to those I?ve inadvertently left out. Lastly, thank you all and thank God! v Table of Contents List of Figures viii 1 Introduction 1 2 Robust Linear Regression Using Sparse Learning for High-Dimensional Ap- plications 14 2.1 Robust Regression Based on Basis Pursuit (BPRR) . . . . . . . . . . 18 2.2 Proof of the Main Theorem 2.1.1 . . . . . . . . . . . . . . . . . . . . 24 2.3 A Bayesian Approach: Bayesian Robust Regression (BRR) . . . . . . 29 2.4 Theoretical and Empirical studies of the Parameter space of Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5 Age Estimation From Face Images . . . . . . . . . . . . . . . . . . . . 36 3 Robust RVM Regression Using Sparse Outlier Model 44 3.1 Robust RVM Regression . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.1.1 Model Specification . . . . . . . . . . . . . . . . . . . . . . . . 47 3.1.2 Robust Bayesian RVM (RB-RVM) . . . . . . . . . . . . . . . 48 3.1.2.1 Inference . . . . . . . . . . . . . . . . . . . . . . . . 49 3.1.2.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . 51 3.1.2.3 Advantage over other Robust RVM Algorithms . . . 51 3.1.3 Basis Pursuit RVM (BP-RVM) . . . . . . . . . . . . . . . . . 52 3.2 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3 Robust Image Denoising . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4 Age Estimation from Facial Images . . . . . . . . . . . . . . . . . . . 59 4 Large-Scale Matrix Factorization with Missing Data under Additional Con- straints 64 4.1 Background: Low-rank semidefinite programming (LRSDP) . . . . . 67 4.2 Matrix factorization using LRSDP (MF-LRSDP) . . . . . . . . . . . 68 4.2.1 Noiseless Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2.2 Noisy case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2.3 Enforcing Additional Constraints . . . . . . . . . . . . . . . . 72 4.3 Matrix Completion, Uniqueness and Convergence of MF-LRSDP . . . 73 4.3.1 Matrix Completion Theory . . . . . . . . . . . . . . . . . . . . 74 4.3.2 Relation with Matrix Factorization and its Implications . . . . 74 4.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4.1 Evaluation with Synthetic Data . . . . . . . . . . . . . . . . . 75 4.4.2 Evaluation with Real Data . . . . . . . . . . . . . . . . . . . . 78 5 Direct Recognition of Faces across Blur and Illumination Variations 82 5.1 Face Recognition Across Blur (FRB) . . . . . . . . . . . . . . . . . . 84 5.2 Incorporating the Illumination Model . . . . . . . . . . . . . . . . . . 89 5.3 Experimental Evaluations . . . . . . . . . . . . . . . . . . . . . . . . 92 vi 5.3.1 Recognition across Blur . . . . . . . . . . . . . . . . . . . . . 93 5.3.2 Incorporating Illumination Model . . . . . . . . . . . . . . . . 99 6 A Scalable Projective Bundle Adjustment Algorithm using the l? Norm 107 6.1 Background: geometric reconstruction problems using L? norm . . . 111 6.1.1 Triangulation/Intersection . . . . . . . . . . . . . . . . . . . . 111 6.1.2 Resection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.2 The l? projective bundle adjustment algorithm . . . . . . . . . . . . 113 6.2.1 Decoupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.2.2 Cheirality and quasi-affine initialization . . . . . . . . . . . . . 115 6.3 Computational complexity and memory requirement . . . . . . . . . . 118 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.4.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.4.2 Computational scalability . . . . . . . . . . . . . . . . . . . . 123 6.4.3 Behavior with noise . . . . . . . . . . . . . . . . . . . . . . . . 124 7 Conclusion and Future Directions 127 7.1 Robust Linear Regression Using Sparse Learning for High-Dimensional Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.2 Robust RVM Regression Using Sparse Outlier Model . . . . . . . . . 128 7.3 Sparse Regularization for Regression and Classification on Manifolds . 129 7.4 Large-Scale Matrix Factorization with Missing Data under Additional Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.5 Direct Recognition of Faces across Blur and Illumination Variations . 131 7.6 Hierarchical Dictionary for Face and Activity Recognition: . . . . . . 132 7.7 A Scalable Projective Bundle Adjustment Algorithm using the L? Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Bibliography 133 vii List of Figures 1.1 Many computer vision problems are solved using the following frame- work: first extract relevant features from images/video and then use statistical data models such as regression, classification, matrix fac- torization, etc. to find pattern/structure in the data. . . . . . . . . . 2 1.2 Generally computer vision data have the following characteristics: they are high-dimensional, outliers are present in the data set and some elements of the data are missing. . . . . . . . . . . . . . . . . . 3 1.3 Outliers occur frequently in computer vision data set. For example , in finding lines in an image, points belonging to one line are outliers for the other lines. (Image courtesy OpenCV 2.0 C Reference) . . . . 4 1.4 Missing data problem arises frequently in Structure from Motion (SfM) problem. In SfM, feature points are tracked through all the images, but since not all the features are visible in all the images, this gives rise to the missing data problem. We will see later that completing the missing tracks solves the SfM problem. . . . . . . . . 5 1.5 Robust linear regression: The popular linear regression techique ?least squares? is very sensitive to outliers. Random Sample Consen- sus (RANSAC), a robust algorithm, is mostly used for solving low- dimensional vision problems. However, it is a combinatorial algorithm and hence can not be used for solving high-dimensional problems. We propose robust polynomial time algorithms and analyze their perfor- mances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6 Robust RVM regression: RVM regression is a kernel regression technique, which has been used for solving many problems such as age and pose estimation. However, it is very susecptible to outliers as can be seen here. We propose two robust versions of RVM. . . . . 7 1.7 Missing data matrix factorization: We encounter missing data (missing tracks) in the SfM problem. We can solve the SfM problem (complete the missing tracks) by solving a missing data matrix fac- torization problem. We propose a large-scale factorization algorithm that can handle large amounts of missing data. . . . . . . . . . . . . . 8 1.8 Remote Face Recognition: Face recognition from remotely ac- quired images is a challenging problem because of variations due to blur, illumination, pose and occlusions. We address the problem of recognizing blurred and poorly illuminated faces by using the gener- ative models for blur and illumination variations. . . . . . . . . . . . 11 1.9 Scalable Bundle Adjustment: Bundle adjustment is the final op- timization step of the SfM problem, where the structure and camera parameters are refined starting from an initial reconstruction. We propose an efficient bundle adjustment algorithm based on minimiz- ing the l?-norm of reprojection error. (Image courtesy Dr. Noah Snavely) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 viii 2.1 Mean estimation error vs. outlier fraction for dimension 2, 6 and 25 respectively. Only BPRR, BRR and M-estimator are shown for dimension 25 as the other algorithms very slow. BRR performs very well for all the dimensions; the other algorithms are comparable with each other. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.2 Recovery rate, i.e. the fraction of successful recovery, vs. outlier fraction for dimensions 2 and 50 for algorithms BPRR, BRR and M-estimators; we do have plots for LMedS and RANSAC as these algorithms are very slow. From the figure we can conclude that each of the algorithms exhibit a sharp transtion from success to failure at a certain outlier fraction. . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.3 Phase transition curves of the algorithms BPRR, BRR and M-estimator. BRR gives the best performance followed by BPRR and M-estimator. 40 2.4 Mean angle error vs. inlier noise standard deviation for dimension 6 and 0.4 outlier fraction. All algorithms, except LMedS, perform well. 41 2.5 Some outlier and inliers found by BRR. Most of the outliers were images of older subjects. This could be because a linear (regression) model may not be sufficient to capture the relation between age and facial geometry for all age groups. Since, the majority of the images in the dataset are of young subjects, the older subjects become outliers with respect to them. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.6 Mean absolute error (MAE) of age estimation Vs outlier fraction. BRR has almost constant MAE until outlier fraction increases beyond 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.1 Prediction by the three algorithms: RVM, RB-RVM and BP-RVM in the presence of symmetric outliers for N = 100, f = 0.2 and ? = 0.1. Data which are enclosed by a box are the outliers found by the robust algorithms. Prediction error are also shown in the figures. RB-RVM gives the lowest prediction error. . . . . . . 53 3.2 Prediction by the three algorithms: RVM, RB-RVM and BP-RVM in the presence of asymmetric outliers for N = 100, f = 0.2 and ? = 0.1. Data which are enclosed by a box are the outliers found by the robust algorithms. Prediction error are also shown in the figures. Clearly, RB-RVM gives the best result. . . . . . . . . 54 3.3 Prediction error vs. outlier fraction for the symmetric and asymmetric outlier cases. RB-RVM gives the best result for both the cases. For the symmetric case, BP-RVM gives lower prediction error than RVM but for the asymmetric case they give similar result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.4 Prediction error vs. inlier noise standard deviation for the symmetric and asym- metric outlier cases. RB-RVM gives the lowest prediction error until about ? = 0.2, after which RVM gives better result. This is because for our experi- mental setup, at approximately ? = 0.3, the distinction between the inliers and outliers cease to exist. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5 Prediction error vs. number of data points for the symmetric and asymmetric outlier cases. For all the three algorithms, performance improves with increasing N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 ix 3.6 Results on Salt and pepper noise removal: first column: RVM, second column: RB-RVM, third column: Median filter, fourth column: Gaussian filter. The RMSE values are also shown in the figure; RB-RVM gives the best result. . . . . 59 3.7 Mean RMSE value over seven images vs. percentage of salt and pepper noise. RB-RVM gives better performance than the median filter. . . . . . . . . . . . 60 3.8 Mixture of Gaussian and salt and pepper noise removal experiment: denoised images by RVM and RB-RVM with their corresponding RMSE values. This experiment again shows that the RB-RVM based denoising algorithm gives much better result than the RVM based one. . . . . . . . . . . . . . . . . . . . . 61 3.9 Some inliers and outliers found by RB-RVM. Most of the outliers are images of older subjects like Outlier A and B. This is because there are less number of samples of older subjects in the FG-Net database. Outlier C has an extreme pose variation from the usual frontal faces of the database; hence, it is an outlier. The facial geometry of Outlier D is very similar to that of younger subjects, such as big forehead and small chin, so it is classified as an outlier. . . . . . . . . . . . 62 3.10 Mean absolute error (MAE) of age prediction vs. fraction of controlled outliers added to the training dataset. RB-RVM gives much lower prediction error as compared to the RVM. Also, note that the prediction error is reasonable even with outlier fraction as high as 0.7. . . . . . . . . . . . . . . . . . . . . . . 63 4.1 (a) Reconstruction rate vs. fraction of revealed entries per column |?|/n for 500 ? 500 matrices of rank 5 by MF-LRSDP, alternation and OptSpace. The proposed algorithm MF-LRSDP gives the best reconstruction results since it can reconstruct matrices with fewer observed entries. (b) Time taken for reconstruc- tion by different algorithms. MF-LRSDP takes the least time. . . . . . . . . . 77 4.2 (a) Reconstruction rate vs. fraction of revealed entries per column |?|/n for rank 5 square matrices of different sizes n by MF-LRSDP and OptSpace. MF-LRSDP reconstructs matrices from fewer observed entries than OptSpace. (b) Recon- struction rate vs. |?|/n for 500? 500 matrices of different ranks by MF-LRSDP and OptSpace. Again MF-LRSDP needs fewer observations than OptSpace. (c) RMSE vs. noise standard deviation for rank 5, 200?200 matrices by MF-LRSDP, OptSpace, alternation and damped Newton. All algorithms perform equally well. 78 4.3 Cumulative histogram (of 25 trials) for the Dinosaur, Giraffe and the Face se- quence. For all of them, MF-LRSDP consistently gives good results. . . . . . . 80 4.4 (a) Input (incomplete) point tracks of the Dinosaur turntable sequence, (b) recon- structed tracks without orthonormality constraints and (c) reconstructed tracks with orthonormality contraints. Without the constraints many tracks fail to be circular, whereas with the constraints all of them are circular (the dinosaur se- quence is a turntable sequence and the tracks are supposed to be circular). . . . 81 5.1 Sample probe images from FERET dataset. The probe images are synthetically blurred with Gaussian filters of ? = 0, 2, 4, 6, 8 respec- tively. ? = 0 stands for ?no blur?. . . . . . . . . . . . . . . . . . . . . 93 x 5.2 a) Recognition by our proposed algorithm FRB and b) by FADEIN, LPQ and FADEIN+LPQ (figure courtesy [70]) on the FERET dataset. FRB is better than FADEIN and LPQ. FRB is comparable with FADEIN+LPQ for small values of ?, but outperms it for large values. 95 5.3 The effect of kernel size on the performance of our algorithm FRB. The probe images are blurred by a Gaussian kernel of ? = 4. From these curves we conclude the following:1) FRB is not very sensitive to the choice of kernel-size and 2) the imposition of symmtery constraints further relaxes the need for accurate choice of kernel-size. . . . . . . . 103 5.4 For the experiment on REMOTE dataset, we have divided the probe images into four categories: a) sharp and well-illuminated images, b) sharp and poorly-illuminated images, c) blurred and well-illuminated images and d) blurred and poorly-illuminated images. These images were acquired at distances between 5? 250 meters. . . . . . . . . . . 104 5.5 The nine illumination basis images of an individual in the PIE dataset. These basis images are used in the FRBI algorithm to model illumi- nation variations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.6 The nine illumination basis images of an individual in the REMOTE dataset. These basis images are used in the FRBI algorithm to model illumination variations. The nine illumination positions from which the basis images are created has been optimized for this dataset. . . . 106 6.1 l? BA Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.2 l? reprojection error versus iteration for the four algorithms on the data sets: sphere, corridor, hotel and dinosaur. l? error decreases monotonically for l? BA but not so for the other algorithms. . . . . . 122 6.3 RMS reprojection error versus iteration : RMS error decreases mono- tonically for l? BA and l2 BA but not so for WIE and IE. IE fails to converge for the dinosaur data set. . . . . . . . . . . . . . . . . . . . 123 6.4 3-D reconstruction result for the datasets, Sphere and Corridor. The Red ?*? represents the camera center and Blue ?o? represents the struc- ture point. The first column shows the initialization, second col- umn shows the final reconstruction and the third column shows the groundtruth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.5 Total convergence time of l2 BA and l? BA as the number of cameras is varied with number of points fixed at 500. . . . . . . . . . . . . . . 125 6.6 Behavior of l? BA and l2 BA with image feature noise for the sphere data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 xi Chapter 1 Introduction In recent times data-driven approaches are being used to solve many challeng- ing problems in areas such as bioinformatics, finance and many other engineering sciences. The main reason behind this trend is that some problems are very difficult to model. Modeling difficulty arises because it is not clear what the factors involved in the problem are or how they interact with each other. For example, in bioin- formatics one would like to know which genes are responsible for which diseases. Modeling here would mean knowing the functionalities of each gene and how they interact each other. This is, no doubt, a very challenging problem given the fact that there are 20, 000 ? 25, 000 genes in a human cell. Similar situations arise in many other areas, such as predicting financial markets, weather patterns and so on. Statistical data models are very useful in these situations; it is convenient to collect several data or examples and use it to learn the parameters of an appropriate sta- tistical model. For example, in the gene-disease problem, one can collect data of the type ?active genes in a patients suffering from a certain disease?, and one can then use statistical tools such as missing-data matrix factorization to find the underlying relation. Along with the popularity of statistical data models, the need for efficient computational algorithms is also increasing. As the statistical models become more sophisticated and datasets become larger, there is definitely a need for more efficient 1 optimization algorithms. Figure 1.1: Many computer vision problems are solved using the following frame- work: first extract relevant features from images/video and then use statistical data models such as regression, classification, matrix factorization, etc. to find pattern/structure in the data. The success of statistical data models and optimization algorithms in other areas motivates us to look for appropriate statistical data models and algorithms in the context of computer vision. There are many problems in computer vision which are difficult to model, such as visual representation of objects and scenes, facial age progression, etc.. If we take the example of facial age progression, there are many factors that play a role, such as bone growth, loss in elasticity of facial muscles, facial fat atrophy, ethnicity, gender, dietary habits, climatic conditions, etc. and it is not easy to model them. Hence, the prevalent approach for solving this problem is to extract relevant features from the face images and use statistical model such 2 Figure 1.2: Generally computer vision data have the following characteristics: they are high-dimensional, outliers are present in the data set and some elements of the data are missing. as regression to learn the relation between the extracted features and age [78]. This approach, in general, is used for solving many other vision problems, where the first step involves extracting relevant features from images/video, followed by using statistical data models such as regression, classification, matrix factorization, etc. for finding pattern/structure in the data, see figure 1.1. The need for computationally efficient algorithms has always been felt in vision, the main reason being images, when treated as a vector, are points in very high-dimensional spaces. Our goal in this dissertation is to design statistical data models and optimiza- tion algorithms which can be used for solving many vision problems. Towards this goal, we first list the common characteristics of many computer vision data (see figure 1.2): ? Most of the computer vision data are high-dimensional. This becomes clear from the fact that even a small (black and white) image of size 100? 100 is a point in R10,000. Also, the recent trend towards concatenating many different 3 Figure 1.3: Outliers occur frequently in computer vision data set. For example , in finding lines in an image, points belonging to one line are outliers for the other lines. (Image courtesy OpenCV 2.0 C Reference) features such as ?histogram of oriented gradients? (HOG) [29], ?scale-invariant feature transform? (SIFT) [59], ?histogram of Gabor phase patterns? (HGPP) [109], etc. as a big feature vector results in very high-dimensional data. Hence, statistical models and algorithms that we design should be able to handle high- dimensional data. ? Outliers (data that deviates from a model by a large extent) occur very fre- quently in computer vision data sets, see figure 1.3. The main reasons for this are: the presence of multiple models in images/videos and variations in visual data. Multiple models are frequently encountered in the problem of surface reconstruction from range (depth) images, where it is very likely that a scene will have more than one surface (model) and data drawn from one model become outliers for the other models [90]. Multiple models are also encountered while estimating the motion of moving objects in a video and 4 Figure 1.4: Missing data problem arises frequently in Structure from Motion (SfM) problem. In SfM, feature points are tracked through all the images, but since not all the features are visible in all the images, this gives rise to the missing data problem. We will see later that completing the missing tracks solves the SfM problem. finding lines/curves in images. There are many sources of variations in visual data such as that due to illumination, geometry and noise, and if a variation is not accounted for in a data model, then data suffering from that variation are likely to become outliers. In the presence of outliers, it is important to design robust statistical models. ? We also frequently encounter missing elements in visual data. For example in the SfM problem [41], where the goal is to reconstruct the 3D scene from multiple images or video, we track 2D features through the images or frames of the video and then estimate the geometry of the scene using the features. However many features are not visible in all the images/frames and this gives rise to the missing data problem (see figure 1.4). The missing data problem also arises when solving the photometric stereo problem [107], where the goal is to reconstruct the surface of an imaged object under different illumination 5 conditions. Both these problems can be solved by filling in these missing data [14]. Hence, it is important that statistical models, designed for solving computer vision problems, should be able to handle missing elements in the data. Figure 1.5: Robust linear regression: The popular linear regression techique ?least squares? is very sensitive to outliers. Random Sample Consensus (RANSAC), a robust algorithm, is mostly used for solving low-dimensional vision problems. However, it is a combinatorial algorithm and hence can not be used for solving high-dimensional problems. We propose robust polynomial time algorithms and analyze their performances. Keeping the above characteristics of the computer vision datasets in mind, we propose the following statistical data models: ? Robust Linear Regression For High-Dimensional Data: The goal of regression is to learn the functional relation between two variables from many 6 Figure 1.6: Robust RVM regression: RVM regression is a kernel regression technique, which has been used for solving many problems such as age and pose estimation. However, it is very susecptible to outliers as can be seen here. We propose two robust versions of RVM. examples/data. If we know the functional form (linear, quadratic, etc.) of the relation, then the goal of regression becomes estimating the parameters of the function. Many problems in computer vision can be posed as a regression problem. Some examples are: finding primitive structures (lines and curves) in images, epipolar geometry estimation [41], age estimation from facial images [78], human head and body pose estimation [3] and surface estimation from gradient fields [5]. Many of these problems are high-dimensional such as the age, pose and surface estimation problems. And all of these problems usually suffer from outliers and hence we need robust regression algorithms for solving 7 Figure 1.7: Missing data matrix factorization: We encounter missing data (missing tracks) in the SfM problem. We can solve the SfM problem (complete the missing tracks) by solving a missing data matrix factorization problem. We propose a large-scale factorization algorithm that can handle large amounts of missing data. them, see figure 1.5. Low-dimensional problems, such as line/curve estimation and epipolar geometry estimation, are usually solved using the popular (in vision literature) robust algorithm RANSAC [36]. However, this algorithm is combinatorial in the dimension of the problem and hence can not be used for solving high-dimensional problems. We propose polynomial time robust linear regression algorithms, which can be used for solving high-dimensional problems. Using the assumption that outliers in a dataset are usually sparse, we formulate the robust regression problem based on two techniques from sparse representation/learning theory: Basis pursuit [26] and Bayesian sparse learning [95]. We analyze the precise conditions under which the basis pursuit based algorithm can correctly solve the robust regression problem. These conditions are based on the angle difference between the regressor subspace and the outlier subspaces. We also empirically study the performance of various robust algorithms and use them to solve the age estimating problem. Chapter 8 2 presents this work in more details. ? Robust Kernel Regression Using Sparse Outliers Model: We general- ize our robust framework for the linear regression to kernel regression. Linear regression is an example of parametric regression, where we assume the regres- sion model to be of a certain parametric form. However, if we are not certain about the appropriate parametric model to use for a particular problem, the alternative is to use a non-parametric model such as kernel regression. Kernel regression approximates the dependent variable by kernel functions located at each data point. In this dissertation, we consider the Relevance Vector Machine (RVM) regression, which is a particular type of kernel regression. In RVM, a Gaussian distribution is assumed for the noise term in the model, which makes it susceptible to the presence of outliers in the data set, see figure 1.6. We propose robust versions of the RVM regression. We decompose the noise term in the RVM formulation into a (sparse) outlier noise term and a Gaussian noise term. We then estimate the outlier noise along with the model parameters. We present two approaches for solving this estimation problem: 1) a Bayesian approach, which essentially follows the RVM framework and 2) a regularization approach based on basis pursuit. In the Bayesian approach, the robust RVM problem essentially becomes a bigger RVM problem with the advantage that it can be solved efficiently by a fast algorithm. Empiri- cal evaluations, and real experiments on image denoising and age estimation demonstrate the better performance of the robust RVM algorithms over that 9 of the RVM regression. Chapter 3 presents this work in more details. ? Large-Scale Matrix Factorization in the Presence of Missing Data: Low-rank factorization of the ?data matrix? (data collected as columns of a matrix) reveals the low-dimensional structure of the data. Many problems in computer vision, such as SfM and photometric stereo, are solved using the low-rank matrix factorization technique. If the data matrix is complete, the low-rank factors can be obtained by singular value decomposition (SVD) of the matrix. However, if there are many missing elements in the matrix, which happens frequently in SfM (see figure 1.7) and photometric stereo problems, it is hard problem to solve. The popular algorithm in vision literature for solving this problem is based on damped Newton?s method [14], which is a very slow and memory intensive algorithm. We formulate the matrix factorization with missing data problem as a low-rank semidefinite program (LRSDP) with the advantage that: 1) an efficient quasi-Newton implementation of the LRSDP enables us to solve large-scale factorization problems, and 2) additional con- straints such as ortho-normality, required in orthographic SfM, can be directly incorporated in the new formulation. Our empirical evaluations suggest that, under the conditions of matrix completion theory [21], the proposed algorithm finds the optimal solution, and also requires fewer observations compared to the current state of the art algorithms. We further demonstrate the effective- ness of the proposed algorithm in solving the affine SfM problem, non-rigid SfM and photometric stereo problems. Chapter 4 presents this work in more 10 details. Figure 1.8: Remote Face Recognition: Face recognition from remotely acquired images is a challenging problem because of variations due to blur, illumination, pose and occlusions. We address the problem of recognizing blurred and poorly illuminated faces by using the generative models for blur and illumination variations. Apart from designing statistical data models and optimization algorithms that can be used for solving many computer vision problems, we also address two specific vision problems and propose statistically optimal and efficient algorithms for solving them. ? Direct Face Recognition Across Blur and Illumination Variations: We are interested in recognizing faces acquired from distant cameras. The main factors that make this a challenging problem are image degradations due to blur and noise, and variations in appearance due to illumination and pose, see figure 1.8. In this dissertation, we address the problem of recogniz- ing faces across blur and illumination variations. The current state of the art approach for recognizing blurred faces first deblurs the face image and then recognize it using classical face recognition algorithms [70]. However, deblur- ring (blind deconvolution) is an ill-posed problem and, more importantly, is 11 Figure 1.9: Scalable Bundle Adjustment: Bundle adjustment is the final opti- mization step of the SfM problem, where the structure and camera parameters are refined starting from an initial reconstruction. We propose an efficient bundle ad- justment algorithm based on minimizing the l?-norm of reprojection error. (Image courtesy Dr. Noah Snavely) not an essential step for recognizing faces. We take a direct approach for face recognition. Using the convolution model for blur, we show that the set of all images obtained by blurring a given image forms a convex set. We then use the set theoretic notion of distance between a given blurred (probe) image and the gallery sets to find the best match. Further, to handle illumination variations we use the low-dimensional linear subspace model [8], and define a set for each gallery image that represents all possible variations of that gallery image due to blur and illumination. The probe image is then assigned the identity of the closest gallery image. The proposed recognition algorithm is also statistically optimal; it is the maximum likelihood estimate of the blur filter kernel, illumination coefficients and identity. Further, using the set the- 12 oretic notion of distance between sets, we can characterize the amount of blur our algorithm can handle for a given dataset. Chapter 5 presents this work in more details. ? A Scalable Bundle Adjustment Algorithm Using the l? Norm: SfM is the problem of reconstructing the 3-D structure of an observed scene and the camera parameters (orientations and locations) from multiple images or video of the scene. Bundle adjustment is the final optimization step of the SfM prob- lem, where the structure and camera parameters are refined starting from an initial reconstruction, see figure 1.9. Traditionally this is done by minimizing the l2-norm of the image reprojection error [7]. LevenbergMarquardt algo- rithm is used for solving this problem, which has a computational complexity of O((m+n)3) per iteration and memory requirement of O(mn(m+n)), where m is the number of cameras and n is the number of structure points. We pro- pose an algorithm that has a computational complexity of O(mn(?m+?n)) per iteration and memory requirement of O(max(m,n)). The proposed algo- rithm is based on minimizing the l? norm of reprojection error. It alternately estimates the camera and structure parameters, thus reducing the potentially large scale optimization problem to many small scale subproblems each of which is a quasi-convex optimization problem and hence can be solved globally. Experiments using synthetic and real data show that the proposed algorithm gives good performance in terms of minimizing the reprojection error and also has a good convergence rate. Chapter 6 presents this work in more details. 13 Chapter 2 Robust Linear Regression Using Sparse Learning for High-Dimensional Applications The goal of regression is to infer a functional relationship between two sets of variables from a given data set. Many a times the functional form is already known and the parameters of the model (function) are estimated from the data set. In most of the data sets, there are some data which differ markedly from the rest of the data; these are known as outliers. The goal of robust regression techniques is to properly account for the outliers while estimating the model parameters. Since, any subset of the data could be outliers, robust regression is, in general, a combinato- rial problem and (robust) algorithms such as ?least median squares? (LMedS) [81] and RANSAC [36] inherit this combinatorial nature. We propose polynomial-time algorithms and state the conditions under which we can correctly solve the robust regression problem. We express the regression error as a sum of two error terms: an outlier (gross) error term and an inlier (small) error term. Under the reasonable assumption that the number of outliers is fewer than the number of inliers, the robust regression problem can be formulated as a l0-norm regularization problem, where we mini- mize the number of outliers subject to satisfying the regression model. We provide conditions under which the above optimization problem will find the correct model 14 parameters (and outliers). These conditions are in terms of the smallest principal angle between the regression subspace and the outlier subspaces, which we show is related to the restricted isometry constant of the compressive sensing theory [22]. However, the l0-norm regularization problem is a combinatorial problem and hence we relax it to a l1-norm regularized problem, which is related to the basis pursuit algorithm [26]. We then show that under stricter conditions on the angular distance between the regression subspace and the outlier subspaces, the proposed algorithm will correctly solve the robust regression problem. We also propose a Bayesian for- mulation for solving the robust regression problem. We use the sparse Bayesian learning technique [95] to impose a sparse prior on the outliers and then obtain the outliers using maximum a-posterior (MAP) criterion. Finally, we study the theo- retical computational complexity of various robust regression algorithms to identify algorithms that are efficient for solving high-dimensional problems. Related works: LMedS technique [81] minimizes the median of the squared residuals. A random sampling algorithm is used for solving this problem. This sampling algorithm is combinatorial in the dimension (number of the parameters) of the problem which makes LMedS impractical for solving high-dimensional re- gression problems. The RANSAC algorithm [36] and its improvements such as MSAC, MLESAC [99] are the most widely used robust algorithms in computer vi- sion [90]. RANSAC estimates the model parameters by minimizing the number of outliers, which are defined as data points that have residual greater than a pre- defined threshold. The same random sampling algorithm as used in LMedS is used for solving this problem, which makes RANSAC, MSAC and MLESAC impracti- 15 cal for high-dimension problems. Another famous class of robust algorithms is the M-estimates [44]. M-estimates are a generalization of the maximum likelihood es- timates (MLEs), where the negative log likelihood function of the data is replaced by a robust cost function. Amongst the many possible choices of cost functions, redescending cost functions are the most robust ones. However, these cost func- tions are non-convex and the resulting non-convex optimization problem has many local minima. Generally, a polynomial time algorithm ?iteratively reweighted least squares? (IRLS) is used for solving the optimization problem, which often converges to local minima. There are many other robust algorithms, proposed as improve- ments over M-estimates, such as S-estimates, L-estimates and MM-estimates, but all of them are solved using the (combinatorial) random sampling algorithm [63], and hence, can not be used for solving high-dimensional problems. Apart from robust cost function-based approaches, there are methods that first identify the outliers using ?outlier diagnostics techniques?, remove them, and then use a (non-robust) regression algorithm such as Least Squares (LS) to estimate the model parameters [82]. However, these methods are not known be very successful when there are many outliers. A similar mathematical formulation (as robust regression) arises in the con- text of error-correcting codes over the reals [22], [24]. Error-correcting codes are used for encoding messages in such a way so that it can reliably transmitted over a channel and correctly decoded at the receiver. The decoding schemes, in particular, are very similar to robust regression algorithms. The decoding scheme used in [22] is the l1 ? regression (least absolute deviations). It was shown that if a certain 16 orthogonal matrix, related to the encoding matrix, satisfies the restricted isometry property (RIP) and the gross error vector is sufficiently sparse, then the message can be successfully recovered. In [24], this error-correcting scheme was further ex- tended to the case where the channel could introduce (dense) small errors along with sparse gross errors. Two decoding schemes were proposed and it was shown that if a properly scaled version of the encoding matrix satisfies the RIP property and the gross error vector is sufficiently sparse, the message can be correctly recov- ered. The robust regression problem is different from the error-correcting codes in the following manner: In error-correcting codes, one is free to design the encoding matrix, whereas, in robust regression we are provided a data set and hence there is no question of designing the regression matrix, which plays a similar mathemati- cally role as the encoding matrix. Also, the sufficient conditions that we provide for correctly estimating the model parameters are more appropriate in the context of robust regression and also tighter than that provided in [24]. Concurrently with us [67], a Bayesian approach based on sparse learning was proposed for solving the ro- bust regression problem in [47]. This approach is similar in principal to our Bayesian approach and the paper reports similar results. The organization of the rest of this chapter is as follows: in section 2.1, we formulate the robust regression problem as a l0-norm regularization problem and relaxed convex versions of it (l1 regression and modified form of basis pursuit) and provide conditions under which the proposed optimization problems correctly solves the robust regression problem. We prove our main result in section 2.2. In sec- tion 2.3, we propose a Bayesian approach for robust regression. In section 2.4, we 17 perform many empirical experiments to compare various robust algorithms and, fi- nally, in section 2.5, we present a real application of age estimation using the robust algorithms. 2.1 Robust Regression Based on Basis Pursuit (BPRR) Regression is the problem of estimating the functional relation f between two sets of variables: independent variable or regressor x ? RD and dependent variable or regressand y ? R, from many examples pairs (x, y). In linear regression, the function f is a linear function of the model parameter w ? RD: y = xTw + e, (2.1) where e is the observation noise. We want to estimate w from a given training dataset of N observations (yi, xi)i = 1, 2 ? ? ? , N , i.e. yi = xTi w + ei. We can write all the observation equations collectively as: y = Xw + e, (2.2) where y = (y1, . . . , yN)T , X = [x1T , . . . , xNT ] ? RN?D and e = (e1, . . . , eN)T . The most popular estimator of w is the least squares (LS), which is statistically opti- mal (in the maximum likelihood sense) for independent and identically distributed Gaussian noise case. However, in the presence of outliers or gross error, the noise distribution is far from Gaussian and, hence, LS gives poor estimates of w. To handle outliers, we express the noise variable e as sum of two independent components, e = s + n, where s represents the outliers and n represents the small 18 noise, which can be modeled, for example, by Gaussian distribution. With this the linear regression model is given by y = Xw + s+ n. (2.3) Note that this is an ill-posed problem as there are more unknowns, w and s, than equations and, hence, there are many solutions. Clearly, we need to restrict the solution space in order to make it a well posed problem. A reasonable assumption is that outliers are sparse in a dataset, i.e., the number of outliers are much less than the number of inliers. RANSAC also makes this assumption: it finds that parameter w which results in the least number of data being labeled as outliers. Under this sparse outlier assumption, we should solve the following optimization problem: mins,w?s?0 such that ||y?Xw ? s||2 ? , (2.4) where ?s?0 is the number of non-zero elements in s and  is a measure of the magnitude of the small noise n. If we assume n to be a Gaussian random variable, then  may be chosen as a small multiple of the variance. However, before looking at the case where both outliers and small noise is present, we first treat the case where only outliers are present, i.e., n = 0. In the absence of small noise (n = 0), we should solve mins,w||s||0 such that y = Xw + s. (2.5) We are interested in the question: Under what conditions, by solving the above equation, can we recover the original w from the observation y? It is quite obvious that X should be full column rank (as N ? D), otherwise, even when there are 19 no outliers, we will not be able to recover the original w. To discover the other conditions, we re-write the constraint in (2.5) as y = [X I]ws, (2.6) where I is a N ? N identity matrix and ws = [w; s]1 is the augmented vector of unknowns. Now, consider a particular dataset y,X where amongst the N data, characterized by the index set J = [1, 2, . . . , N ], k of them are affected by outliers. Let these k outlier affected data be specified by the subset T ? J . Then, equation (2.6) can be written as y = [X IT ]wsk , (2.7) where IT is a matrix consisting of column vectors from I indexed by T , wsk = [w; sk] and sk ? Rk represents the k non-zero outliers. Given the information about the index subset T , i.e. given which data (indices) are affected by outliers, we can recover w and the non-zero outliers sk by solving (2.5) if and only if [X IT ] is full column rank. The condition [X IT ] being full rank can also be expressed in terms of the smallest principal angle between the subspace spanned by the regressor, span(X), and the subspace spanned by outliers, span(IT ). The smallest principle angle ? between two subspaces U and W of RN is defined as the smallest angle between a vector in U and a vector in W [38]: cos(?) = max u?U max w?W uTw ?u??w? . (2.8) Equivalently for any vectors u ? span(X) and w ? span(IT ) |uTw| ? ??u??w? (2.9) 1Throughout this chapter, we will use the MATLAB notation [w; s] to mean [wT sT ]T 20 where ? = cos(?) is smallest such number. To generalize this inequality for all subset T with cardinality at most k, we introduce the following definition. Definition 2.1.1. For every integer 1 ? k ? N we define a constant ?k to be the smallest quantity such that for all u ? span(X) and w ? span(IT ) with |T | ? k, the following holds |?u, w?| ? ?k?u??w? (2.10) The quantity ?k ? [0, 1] is a measure of how well separated the regressor subspace span(X) is from the all the outlier subspaces span(IT ) with dimension at most k. When ?k = 1, the regressor subspace and one of the outlier subspaces of dimension at most k, share at least a common vector, whereas, when ?k = 0, the regressor subspace is orthogonal to all the outlier spaces of dimension at most k. With the definition of ?k, we are now in a position to state the sufficient conditions for recovering w by solving (2.5). Proposition 2.1.1. Assume that ?2k < 1 and X is a full column rank matrix. Then, by solving (2.5), we can recover w exactly if there are at most k outliers in the y variable. Proof. The conditions ?2k < 1 and X a full rank matrix together implies that all matrices of the form [X IT ] with |T | ? 2k are full rank. This fact can be proved by a simple contradiction argument. Now, suppose w0 and s0 with ||s0||0 ? k satisfy the equation y = Xw + s. (2.11) 21 Then to show that we can recover w0 and s0 by solving (2.5), it is sufficient to show that there exists no other w and s, with ||s||0 ? k, which also satisfy (2.11). We show this by contradiction: Suppose there is another such pair, say w1 and s1 with ||s1||0 ? k, which also satisfies (2.11). Then Xw0 + s0 = Xw1 + s1. Re-arranging, we have: [X I]?ws = 0 (2.12) where ?ws = [?w; ?s], ?w = (w0 ? w1) and ?s = (s0 ? s1). Since ||s0||0 ? S and ||s1||0 ? S, ||?s||0 ? 2k. If T? denotes the corresponding non-zero index set, then T? has a cardinality of at most 2k and, thus, [X IT? ] is a full rank matrix. This in turn implies that ?ws = 0, i.e. w0 = w1 and s0 = s1. Hence, the solution of (2.5) is unique and correct under the assumed conditions. From the above theorem, we can find a lower bound on the maximum number of outliers (in the y variable) that the l0 norm regression (2.5) can handle in a dataset of regressor matrix X. This is given by the largest integer k such that ?2k < 1. Note that the l0 norm regression (2.5) is a hard combinatorial problem to solve. So, as in compressive sensing theory, we would like to approximate it by the following convex problem: mins,w?s?1 such that y = Xw + s (2.13) where the ||s||0 term is replaced by the l1 norm of s. Note that the above problem can be re-written as minw?y ? Xw?1, and hence this is the l1 regression problem. Again, we are interested in the question: Under what conditions, by solving the above problem, can we recover the original w? Not surprisingly, the answer is that 22 we need a bigger angular separation between the regressor subspace and the outlier subspaces. Proposition 2.1.1. Assume that ?2k < 23 and X is a full column rank matrix. Then, by solving (2.13), we can recover w exactly if there are at most k outliers in the y variable. Furthermore, if there are more than k outliers, then the estimation error of w (?w) is given in terms of the best k-sparse approximation of the outliers sk, the vector s with all but the k-largest entries set to zero, by ||?w||2 ? ??1C0k? 12 ||s? sk||1, (2.14) where ? is the smallest singular value of X and C0 is a constant which depends only on ?2k. Note that if there are at most k outliers, then sk = s, and equation (2.14) implies that ||?w||2 ? 0, i.e., w can be exactly recovered. Similar to the l0 regression case, we can obtain a lower bound on the maximum number of outliers that the l1 regression can handle in the y variable; it is given by the largest integer k for which ?2k < 23 . Proposition 2.1.1 is a special case of the next theorem which considers the small noise case (n > 0). In the presence of small bounded noise with ||n||2 ? , we propose to solve the following convex approximation of the combinatorial problem (2.4) mins,w||s||1 such that ||y?Xw ? s||2 ? . (2.15) Note that the above problem is a modified form of the basis pursuit denoising problem [26]. Under the same conditions on the angular separation between the regressor subspace and the outliers subspaces, we have the following result. 23 Theorem 2.1.1. Assume that ?2k < 23 , X is a full column rank matrix and (2.15) is feasible. Then the error in estimation of w (?w) by the solution of (2.15) is given in terms of the best k-sparse approximation of the outliers (sk) and  as ||?w||2 ? ??1(C0k? 12 ||s? sk||1 + C1), (2.16) where ? is the smallest singular value of X, and C0, C1 are constants which depend only on ?2k. Note that we get fact 2.1.1 by setting  = 0. Also note that if there are at most k outliers, sk = s and the estimation error ||?w||2 is bounded by a constant times . We prove the above theorem in the next section. 2.2 Proof of the Main Theorem 2.1.1 The proof parallels that in [20]. The main assumption of the theorem is in terms of the smallest principal angle between the regressor subspace, span(X), and the outlier subspaces, span(IT ). This angle is best expressed in terms of orthonormal bases of the subspaces. IT is already an orthonormal basis, but we can not say the same for X. Hence we first orthonormalize X by the reduced QR decomposition, i.e. X = QR where Q is an N ?D matrix which forms an orthonormal basis for X and R is an D ? D upper triangular matrix. Since X is assumed to be full rank, R is a full rank matrix. Using this decomposition of X, we can solve (2.15) in an alternative way. First, we substitute z = Rw and then solve the problem: mins,z||s||0 such that ||y ?Qz? s||2 ? . (2.17) 24 w can be then be obtained by w = R?1z. This way of solving for w is exactly equivalent to that of (2.15), and hence for solving practical problems any of the two approaches can be used. However, the proof of the theorem is based on the alternative approach. We first obtain an estimation error bound on z and then use w = R?1z to obtain a bound on w. For the main proof we will need some more results. One of the results is on the relation between ?k and a quantity ?k, defined below, which is very similar to the concept of restricted isometry constant [22]. Definition 2.2.1. For each integer k = 1, 2, . . . , N we define a constant ?k as the smallest number such that (1? ?k)?x?2 ? ?[Q IT ]x?2 ? (1 + ?k)?x?2 (2.18) for all T with cardinality at most k. Lemma 2.2.1. ?k = ?k for all k = 1, 2, . . . , N . Proof. From definition of ?k, for any IT with |T | ? k, z and s: |?Qz, IT s?| ? ?k?z??s? (2.19) where we have used ?Qz? = ?z? and ?IT s? = ?s? since Q and IT are orthonormal matrices. Writing x = [z; s], ?[Q IT ]x?2 is given by ?[Q IT ]x?2 = ?z?2 + ?s?2 + 2?Qz, IT s? ? ?z?2 + ?s?2 + 2?k?z??s? ? ?z?2 + ?s?2 + ?k(?z?2 + ?s?2), 25 where we use the fact 2?z??s? ? ?z?2 + ?s?2 for the last inequality. Further, using the fact ?x?2 = ?z?2 + ?s?2, we get ?[Q IT ]x?2 ? (1+ ?k)?x?2. Using the inequality ?Qz, IT s? ? ??k?z??s?, it is easy to show that ?[QIT ]x?2 ? (1? ?k)?x?2. Thus, we have (1? ?k)?x?2 ? ?[Q IT ]x?2 ? (1 + ?k)?x?2. (2.20) This implies ?k ? ?k. However, since all the inequalities involved can be satisfied with equality, ?k = ?k. Suppose y = Qz + s+ n and let z? and s? be the solution of (2.17) for this y. Then ?Q(z ? z?) + (s? s?)? ? ?Qz + s? y?+ ?y ?Qz? ? s?? ? 2 (2.21) This follows from triangular inequality and that both z, s and z?, s? are feasible for problem (2.17). Let ?z = z? ? z and h = s? ? s. For the rest of the proof, we are going to use the following notation: vector xT is equal to x on the index set T and zero elsewhere. Note that this notation is different from that used for matrices, where IT denotes the matrix consisting of column vectors from I indexed by T . Now, let?s decompose h into a sum of vectors hT0 , hT1 , hT2 , . . . , where each of the index set Ti, i = 0, 1, 2, . . . , is of cardinality k except for the last index set which can be of lesser cardinality. The index T0 corresponds to the locations of k largest coefficients of s, T1 to the locations of k largest coefficients of hT c0 , T2 to that of the next largest k coefficients of hT c0 and so on. In the main proof, we will need a bound on the quantity ? j?2 ?hTj?2, which we obtain first. We use the following results 26 from [20]: ? j?2 ?hTj?2 ? k? 12?hT c0 ?1 (2.22) and ?hT c0 ?1 ? ?hT0?1 + 2?sT c0 ?1. (2.23) These results correspond to equation (10) and (12) in [20], with some change in notations. The first result holds because of the way h has been decomposed into hT0 , hT1, hT2 , . . . , and the second result is based on ?s + h?1 ? ?s?1, which holds because s + h = s? is the minimum l1-norm solution of (2.17). Based on the above two equations, we have ? j?2 ?hTj?2 ? k? 12?hT0?1 + 2k? 12?sT c0 ?1 ? ?hT0?2 + 2e0, (2.24) where we have used the inequality k? 12?hT0?1 ? ?hT0?2 and e0 is defined as e0 = k? 12?sT c0 ?1. Since by definition sT0 = sk, the k-sparse approximation of s, sT c0 = s?sk and hence e0 = k? 1 2?s ? sk?1. With these results, we are in a position to prove theorem 2.1.1. Proof. Our goal is to find a bound on ?z, from which we can find a bound on ?w. We do this by first finding a bound for [?z; hT0?T1 ] through bounds on the quantity ?Q?z + hT0?T1?2. Using hT0?T1 = h??j?2 hTj , we get ?Q?z + hT0?T1?2 = ?Q?z + hT0?T1, Q?z + h? ? ?Q?z + hT0?T1 , ? j?2 hTj?. (2.25) Using triangular inequality, the first term in the right hand side can be bounded as ?Q?z + hT0?T1 , Q?z + h? ? ?Q?z + hT0?T1?2?Q?z + h?2. (2.26) 27 Since hT0?T1 is 2k sparse, using (2.20), we get ?Q?z + hT0?T1?2 ? ? 1 + ?2k?[?z; hT0?T1 ]?2. Further, using the bound ?Q?z + h?2 ? 2, see equation (2.21), we get ?Q?z + hT0?T1 , Q?z + h? ? 2 ? 1 + ?2k?[?z; hT0?T1 ]?2. (2.27) Now, we look at the second term in the right hand side of equation (2.25). Since the support of hT0?T1 and hTj , j ? 2 are different, ?hT0?T1, hTj? = 0 for all j ? 2, and we get ??Q?z + hT0?T1, ? j?2 hTj? = ? j?2 ?Q?z,?hTj ? ? ?2k??z?2 ? j?2 ?hTj?2, (2.28) where we used the definition of ?2k and the fact that hTj is k-sparse, and hence also 2k sparse. Further, using (2.24), ?hT0?2 ? ?hT0?T1?2 and ??z?2 ? ?[?z; hT0?T1 ]?2 ?2k??z?2 ? j?2 ?hTj?2 ? ?2k??z?2?hT0?T1?2 + 2e0?2k?[?z; hT0?T1 ]?2 (2.29) ??z?2?hT0?T1?2 can be further bounded by 12?[?z; hT0?T1 ]?22 (by applying the in- equality 2ab ? a2 + b2). Therefore, ?2k??z?2 ? j?2 ?hTj?2 ? ?2k2 ?[?z; hT0?T1 ]? 2 2 + 2e0?2k?[?z; hT0?T1 ]?2. (2.30) Finally, we obtain the following bound for ?Q?z + hT0?T1?2 ?Q?z + hT0?T1?2 ? (2 ? 1 + ?2k + 2e0?2k)?[?z; hT0?T1 ]?2 + ?2k2 ?[?z; hT0?T1 ]? 2 2. (2.31) Since hT0?T1 is 2k sparse, from equation (2.20), we get (1? ?2k)?[?z; hT0?T1]?22 ? ?Q?z + hT0?T1?22. (2.32) 28 From the above two equations, it follows that (1? 3 2 ?2k)?[?z; hT0?T1 ]?2 ? 2e0?2k + 2 ? 1 + ?2k. (2.33) Since ?2k < 23 is an assumption of the theorem, 1? 3 2?2k > 0, and hence ?[?z; hT0?T1 ]?2 ? 2e0?2k1? 32?2k + 2 ? 1 + ?2k 1? 32?2k . (2.34) Since ??z?2 ? ?[?z; hT0?T1 ]?2, we obtain ?z?2 ? C0k? 12?s? sk?1 + C1 where C0 = 2?2k1? 32?2k , C1 = 2 ? 1 + ?2k 1? 32?2k . (2.35) Using the definition w = R?1z, we get ?w ? ?R?1?2??z?2, where ?R?1?2 is the spectral norm of R?1. Note that the spectral norm of R?1 is given by its largest singular value, which is the reciprocal of the smallest singular value of R. Further, since X = QR and R share the same singular values, ?R?1?2 = ??1, where ? is the smallest singular value of X. Hence, we have the final result ?w ? ??1(C0k? 12?s? sk?1 + C1). (2.36) 2.3 A Bayesian Approach: Bayesian Robust Regression (BRR) We also take a Bayesian approach towards solving (2.4). In the Bayesian approach, a (joint) prior distribution is proposed for the unknown variables of the problem and the (joint) posterior distribution of the variables is computed using the proposed prior and the likelihood distribution. Generally the mean or the mode of this posterior distribution is taken to be the solution. Since we have assumed 29 outliers are sparse in a dataset, an appropriate prior for s would be the sparse prior as introduced in [95]. However, to avoid a choice of prior on w, since we want an unbiased estimate for w, we propose to solve the problem (2.4) in two steps: First, we reduce the joint estimation problem (estimating w and s) to a simpler problem of estimating s. This is done by projecting y onto the left null space of the regressor X, which has contribution only from the outliers s. Recall that X is a full rank N ?D matrix. Let CT be an orthonormal basis for the left null space of X, i.e. CT is a N ? (N ?D) ortho-normal matrix which satisfies C ?X = 0. Pre-multiplying (2.3) by C, we get Cy = CXw + Cs+ Cn z = Cs+ g, (2.37) where z = Cy and g = Cn, again, a small noise. We would like to solve the following problem using sparse Bayesian prior on s: mins?s?0 such that ?z ? Cs?2 ? ? (2.38) Note that ? is related to  of the original problem (2.4). If we assume an isotropic Gaussian distribution for n, then ? = ?(N ?D)/N ? . Once we find a solution for s, we can subtract s from y and estimate w using least squares. The Bayesian approach towards solving problems of the form (2.38) goes by the name of sparse Bayesian learning [95, 106]. The sparse prior on s is defined in the following manner: Each element of s = [s1s2 . . . sN ]T is assumed to be a 30 zero-mean Gaussian random variable with hyper-parameter ? = [?1?2 . . . ?N ]T : p(s/?) = N? i=1 N (si|0, ?i?1) where ?i represents the inverse variance of the Gaussian distribution of si. Gamma distribution (hyper-prior) is specified for each of the hyper-parameters ?i. This is a hierarchical description of the prior, to get a direct description of the prior we need to marginalize out (integrate over) the hyper-parameters. For example, if a uniform distribution (obtained as a particular parameter setting of the Gamma distribution) is assumed for ?i, then by marginalizing out ?i, the improper prior p(si) = 1/|si| is obtained, which is a sparsity promoting prior. The likelihood term is given by p(z/s, ?2) = N (z|Cs, ?2I), where Gaussian distribution is assumed for the small noise g and ? is a gamma distributed random variable. With the above prior and likelihood, the maximum a posteriori estimate (MAP) of s is obtained is obtained as follows: The unknowns ?i, ? are first solved using evidence maximization technique, which maximizes the marginal distribution p(y/?i, ?) over ?i and ?. These values are then used for obtaining the MAP estimate of s. This Bayesian algorithm has a complexity of O(N3). 31 2.4 Theoretical and Empirical studies of the Parameter space of Ro- bust Regression Three important parameters of the robust regression problem are: fraction of outliers in the dataset f , dimension of the problem D and inlier noise variance ?2. We study the performances of the proposed algorithms, BPRR and BRR, and compare them to that of M-estimators, LMedS and RANSAC. The performance criteria are estimation accuracy and computational complexity. We first discuss the theoretical computational complexity of the algorithms and then empirically study them for estimation accuracy. BPRR (equation (2.15)) is a second order cone programming problem with N+ D variables and one cone constraint of dimension N , hence, it has a computational complexity of O ((N +D)2.5N)) [56]. BRR involves solving the sparse Bayesian learning problem, which has a complexity of O(N 3) [95], and a least squares problem of complexity O ((N +D/3)D2) [38]. M-estimators are usually solved using the IRLS algorithm, which has a complexity of O ((N +D/3)D2). Note that none of these algorithms have any direct dependence on the outlier fraction f or inlier noise variance ?2. As discussed in the introduction, LMedS and RANSAC are solved using a random sampling algorithm, where D data are randomly sampled from the data set of N data and the LMedS/RANSAC cost is evaluated based on the these D data. The number of such samplings that we need to perform so as to get a successful sampling (where all the data are inliers) with a high probability p is given 32 by [82, 36] k = min ( log(1? p)log(1? (1? f)D) , ( N D ) ). (2.39) Therefore, these algorithms are combinatorial in D. From the above discussion, we can conclude that BPRR, BRR and M-estimates are the feasible algorithms for high-dimensional robust regression problems, whereas LMedS and RANSAC are not. We perform a series of experiments using synthetically generated data. For each trial in the experiments, we generate the dataset (xi, yi), i = 1, 2, . . . , N , xi ? RD, y ? R, and the model parameters w ? RD in the following manner: xis are obtained by uniformly sampling a D-dimensional hypercube centered around the origin and w is a randomly sampled from a standard Gaussian random variable. Depending on the outlier fraction f , we randomly categorize the N indices into either inlier or outlier indices. The yis corresponding to the inlier indices are obtained from yi =< xi, w > +n, where n is the inlier Gaussian noise N(0, ?2). The yi corresponding to the outlier indices are obtained by uniformly sampling the interval [?r, r], where r is the range (maximum absolute value) of the inlier ys. Regression accuracy is measured by the l2 norm of estimation error of w. BPRR, BRR and RANSAC need estimates of the inlier noise standard deviation, which we provide as the median absolute residual of the l1 regression. In our experiments, we have used the MATLAB implementation of bisquare (Tukey?s biweight) M-estimates, other M-estimates give similar results. In the first experiment, we study the performances of the algorithms as a function of outlier fraction and dimension. We generate N = 500 synthetic data 33 with inlier noise variance ? = 0.001. Fig. 2.1 shows the mean estimation error over 20 trials vs. outlier fraction for dimension 2, 6 and 25. For dimension 25, we only show BPRR, BRR and M-estimates as the other algorithms LMedS and RANSAC, which are combinatorial in nature, are very slow. BRR performs very well for all the dimensions. The other algorithms are comparable with each other. We further study the performances of the algorithms with respect to outlier fraction and dimension using the phase transition curves. In compressive sensing theory, where the goal is to find the sparsest solution for an under-determined system of equations, a sharp transition between success and failure of the basis pursuit algorithm has been observed: For a given level of under-determinacy, basis pursuit successfully recovers the correct solution (with high probability) if the sparsity is below a certain level and fails to do so (with high probability) if the sparsity is above that level [30], [31]. This phenomenon is termed phase transition in the compressive sensing literature and it has been used to characterize and compare the performances of several compressive sensing algorithms [62]. We also use this measure to compare the various robust regression algorithms. In the context of robust regression, the notion of under-determinacy depends on N and D. Since, there are N observations and N +D unknowns in robust regression, by varying D for a fixed N we can vary the level of under-determinacy. The notion of sparsity is associated with the outlier fraction. Hence, to obtain the phase transition curves, we vary the dimension D of the problem for a fixed N and for each D find the outlier fraction where the transition from success to failure (in parameter estimation) occurs. As before, we choose N = 500 and ? = 0.001. We vary D over a range of values 34 from 1 to 450. At each D, we vary the outlier fractions over a range of values and measure the fraction of trials in which the algorithms successfully found the correct solution. We consider a solution to be correct if ?w?w??2?w?2 ? 0.01. Figure 2.2 shows the fraction of successful recovery vs. outlier fraction for dimensions 2 and 50 for algorithms BPRR, BRR and M-estimators; we do not show LMedS and RANSAC as these algorithms are very slow. From the figure, we can conclude that each of the algorithms exhibit a sharp transition from success to failure at a certain outlier fraction, which confirms that phase transition do occur in robust regression also. For each regression algorithm and dimension, we find that outlier fraction where the probability of success is 0.5. Similar to [62], we use logistic regression to find this outlier fraction. Figure 2.3 shows the phase transition curves of the algorithms; it is easy to conclude that BRR gives the best performance followed by BPRR and M-estimators. We also study the effect of inlier noise variance on the performance of the algorithms. For this we fixed the dimension at 6, the outlier fraction at 0.4 and the number of data points at 500. Fig. 2.4 shows that all algorithms, except LMedS, perform well. From the above experiments, it is easy to conclude that BRR should be the preferred robust regression algorithm for low as well as high-dimensional problems. 35 2.5 Age Estimation From Face Images In this section, we use the BRR algorithm for robust age estimation from face images. We use the publicly available FG-Net dataset 2, which contains 1002 facial images of 82 subjects along with their ages. The dependent variable for this problem is the age and the independent variable is a geometric feature obtained by computing the flow field at 68 fiducial features on each image with respect to a reference face image. We categorize the whole dataset into inliers and outliers using the BRR algo- rithm. The algorithm found 177 outliers out of the total database of 1002 images. Some of the inliers and outliers are shown in figure 2.5. Most of the outliers were images of older subjects. This could be because a linear model may not be sufficient to capture the relation between age and facial geometry for all age groups. Since, the majority of the images in the dataset are of young subjects, the older subjects become outliers with respect to them. Next, we perform a leave-one-out testing in which the regression algorithm is trained on the entire dataset except for one sam- ple on which testing is done. We measure the mean absolute error (MAE) of age estimation for inliers and outliers separately. The results are shown in Table 2.1. The low inlier MAE and the high outlier MAE indicates that the inlier vs outlier categorization was good. To further test BRR, we remove the outliers detected in the previous exper- iment and then introduce controlled outliers. We use 90% of the whole dataset as 2The fg-net aging database, http://www.fgnet.rsunit.com 36 Inlier MAE Outlier MAE All MAE BRR 3.73 19.14 6.45 Table 2.1: Mean absolute error (MAE) of age estimation for inliers and outliers using BRR. The low inlier MAE and the high outlier MAE indicates that the inlier vs outlier categorization was good. training set and the remaining 10% as the test set. Controlled outliers are intro- duced only in the training set and age estimation is done on the test set by both BRR and LS. We vary the percentage of outlier on the training set and measure the MAE of age estimation on the test set. Fig. 2.6 shows that BRR gives much lower MAE as compared to LS. Table 2.2 shows the percentage of correctly detected outliers and inliers wrongly classified as outliers by BRR. BRR detects most of the outliers though it removes some the inliers. Outlier fraction 0.1 0.2 0.3 0.4 0.5 0.6 Correctly detected 97.3230 96.0524 96.1059 96.0970 95.3474 95.5894 outliers Inlier wrongly classified 16.2609 16.2415 16.4443 16.2629 16.9942 19.6336 as outliers Table 2.2: Outlier detection rate and False alarm rate of BRR for the FG-Net dataset.BRR detects almost all of the outliers though it removes some the inliers. 37 0 0.2 0.4 0.6 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Outlier fraction Mean estimation erro r Estimation error vs. outlier fraction for dimension = 2 M?est BRR BPRR RANSAC LMedS 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Outlier fraction Mean estimation erro r Estimation error vs. outlier fraction for dimension = 6 M?est BRR BPRR RANSAC LMedS 0 0.2 0.4 0.6 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Outlier fraction Mean estimation erro r Estimation error vs. outlier fraction for dimension=25 M?est BRR BPRR Figure 2.1: Mean estimation error vs. outlier fraction for dimension 2, 6 and 25 respectively. Only BPRR, BRR and M-estimator are shown for dimension 25 as the other algorithms very slow. BRR performs very well for all the dimensions; the other algorithms are comparable with each other. 38 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 Outlier fraction Recovery rat e Recovery rate vs. outlier fraction for Dimension = 25 M?est BRR BPRR 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 Outlier fraction Recovery rat e Recovery rate vs. outlier fraction for dimension = 125 M?est BRR BPRR Figure 2.2: Recovery rate, i.e. the fraction of successful recovery, vs. outlier fraction for dimensions 2 and 50 for algorithms BPRR, BRR and M-estimators; we do have plots for LMedS and RANSAC as these algorithms are very slow. From the figure we can conclude that each of the algorithms exhibit a sharp transtion from success to failure at a certain outlier fraction. 39 0 100 200 300 400 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dimension Outlier fractio n Phase transition curves M?est BRR BPRR Figure 2.3: Phase transition curves of the algorithms BPRR, BRR and M-estimator. BRR gives the best performance followed by BPRR and M-estimator. 40 10?3 10?2 10?1 100 0 0.5 1 1.5 2 Noise standard deviation Mean estimation erro r Estimation error vs. noise M?est BRR BPRR RANSAC LMedS Figure 2.4: Mean angle error vs. inlier noise standard deviation for dimension 6 and 0.4 outlier fraction. All algorithms, except LMedS, perform well. 41 Figure 2.5: Some outlier and inliers found by BRR. Most of the outliers were images of older subjects. This could be because a linear (regression) model may not be sufficient to capture the relation between age and facial geometry for all age groups. Since, the majority of the images in the dataset are of young subjects, the older subjects become outliers with respect to them. 42 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 3.5 4 4.5 5 5.5 6 Outlier fraction MAE of age estimatio n MAE of Age Estimation Vs Outlier fraction LS BSPP Figure 2.6: Mean absolute error (MAE) of age estimation Vs outlier fraction. BRR has almost constant MAE until outlier fraction increases beyond 0.5. 43 Chapter 3 Robust RVM Regression Using Sparse Outlier Model Kernel regression techniques such as Support Vector Regression (SVR) [103], RVM regression [95] and Gaussian processes [79] are widely used for solving many vision problems. Some examples are age estimation from facial images [54, 53, 37, 40], head pose estimation [68], 3D human pose estimation [4] and lighting estimation [86]. Recently, kernel regression has also been used for solving image processing problems such as image de-noising and image reconstruction with a great deal of success [92, 93]. However, many of these kernel regression methods, especially the RVM, are not robust to outliers in the training dataset, and hence, will produce unreliable estimates in the presence of outliers. To make the RVM model robust to outliers, we decompose the noise term in the RVM model into an outlier noise term, which we assume to be sparse, and a Gaussian noise term. The assumption of outliers being sparse is justified as we generally expect the majority of the data to be inliers. During inference, we estimate the outlier noise along with the model parameters. We present two approaches for solving this estimation problem: 1) a Bayesian approach and 2) an regularization- based approach. In the Bayesian approach, we assume a joint sparse prior for the model parameters and the outliers, and then solve the Bayesian inference problem. The mean of the posterior distribution of the model parameters is used for prediction. 44 The joint sparse assumption for the model parameters and the outliers, effectively, makes the robust RVM problem a bigger RVM problem with the advantage that we can use a fast algorithm, developed for the RVM [96], to solve this problem. In the regularization-based approach, we propose to minimize the l0 norm of the model parameters and the outliers, subject to a certain amount of observation error (which depends on the inlier noise variance). However, this is a combinatorial optimization problem and hence can not be used for solving large-scale regression problems. So, we propose to relax the problem to an l1 regularized problem, which is of the same form as the basis pursuit denoising problem [26]. We then empirically evaluate the robust algorithms by varying the following parameters of the robust regression problem: the outlier fraction, the inlier noise variance and the number of data points in the training dataset. We further demonstrate the effectiveness of the robust approaches in solving the image denoising and age estimation problems. Related works: Robust versions of the RVM regression have been proposed in [35], [97] and [108]. In [35], the noise term is modeled as a mixture of Gaussian (for the inlier noise), and uniform or Gaussian with large variance for the outlier noise. But the mixture density model makes inference difficult; a variational method is used for solving this problem making it computationally much more expensive than the RVM. In [97], a Student?s t-distribution is assumed for the noise, and the parameters of the distribution are estimated along with the model parameters. Though, this is a very elegant approach, a variational method is used for inference, which makes it computationally expensive. In [108], a trimmed likelihood function is minimized over a ?trimmed? subset that does not include the outliers. The robust trimmed 45 subset and the model parameters are found by an iterative re-weighting strategy, which at each iteration solves the RVM regression problem over the current trimmed subset. However, the method needs an initial robust estimate of the trimmed subset, which determines the accuracy of the final solution. It also needs many iterations, where in each iteration a RVM regression problem is solved, and this makes it slow. The organization of the rest of this chapter is as follows: in section 3.1, we in- troduce the RVM regression model and its proposed robust versions: robust Bayesian RVM (RB-RVM) and basis pursuit RVM (BP-RVM). In section 3.2, we evaluate the proposed robust algorithms on synthetically generated data. In section 3.3, we use the RB-RVM algorithm for robust image denoising and in section 3.4, for solving the age estimation problem. 3.1 Robust RVM Regression For both the robust Bayesian approach and the robust regularization approach, we replace the Gaussian noise assumption in the RVM formulation by an implicit heavy-tailed distribution. This is achieved by decomposing the noise term into a sparse outlier noise term and a Gaussian noise term. The outliers are then treated as unknowns and are estimated together with the model parameters. In the following sub-sections, we first describe the RVM regression model, followed by the robust Bayesian and regularization approaches. 46 3.1.1 Model Specification Let (xi, yi), i = 1, 2, ..., N be the given training dataset with dependent vari- ables yi, i = 1, 2, . . . , N and independent variables xi, i = 1, 2, . . . , N . In the RVM formulation, yi is related to xi by the model yi = N? j=1 wjK(xi,xj) + w0 + ei (3.1) where with each xj, there is an associated kernel function K(.,xj), and ei is the Gaussian noise. The objective is to estimate the weight vectorw = [w0, w1, . . . , wN ]T using the training dataset. Once this is done, we can predict the dependent variable y for any new x by y = N? i=j wjK(x,xj) + w0 (3.2) In the presence of outliers, Gaussian noise is not an appropriate assumption for ei. We propose to split the noise ei into two components: a Gaussian component ni and a component due to outliers si, which we assume to be sparse. With this, we have yi = N? j=1 wjK(xi,xj) + w0 + ni + si (3.3) In matrix-vector form, this is given by y = ?w + n+ s (3.4) where y = [y1, . . . , yN ]T , n = [n1, . . . , nN ]T , s = [s1, . . . , sN ]T and ? is a N?(N+1) matrix with ? = [?(x1),?(x2), . . . ,?(xN)]T , (3.5) 47 where ?(xi) = [1, K(xi,x1), K(xi,x2), . . . , K(xi,xN)]T . The two unknowns w and s can be augmented into a single unknown vector ws = [wT sT ]T and the above equation can be written as y = ?ws + n (3.6) where ? = [?|I] is a N ? (2N + 1) matrix with I, a N ?N identity matrix. 3.1.2 Robust Bayesian RVM (RB-RVM) In the Bayesian approach, we estimate the joint posterior distribution of w and s, given the observations y and the prior distributions on w and s. We then use the mean of the posterior distribution of w for prediction (3.2). The posterior variance also provides us with a measure of uncertainty in the prediction. The joint posterior distribution of w and s is given by p(w, s|y) = p(w, s)p(y|w, s) p(y) (3.7) From (3.6), the likelihood term p(y|w, s) is given by p(y|w, s) = N (?ws, ?2I) (3.8) where ?2 is the inlier Gaussian noise variance. To proceed further, we need to specify the prior distribution p(w, s). We assume that w and s are independent: p(w, s) = p(w)p(s). Next, we keep the same ?sparsity promoting? prior for w as in RVM [95], that is, p(w|?) = N? i=0 N (wi|0, ??1i ) (3.9) 48 where ? = [?0, ?1, . . . , ?N ]T is a vector of (N + 1) hyper-parameters. A uniform distribution (hyper-prior) is assumed for each of the ?i (For more details, please see [95]). For s, we specify a similar sparsity promoting prior given by p(s|?) = N? i=1 N (si|0, ??1i ) (3.10) where ? = [?1, ?2, . . . , ?N ]T is a vector of N hyper-parameters, and each of the ?i follows a uniform distribution. This completes the description of the likelihood p(y|w, s) and the prior p(w, s). Next, we proceed to the inference stage. 3.1.2.1 Inference Our inference method follows the RVM inference steps. We first find point- estimates for the hyper-parameters ?, ? and the inlier noise variance ?2 by maxi- mizing p(y|?,?, ?2) with respect to these parameters, where p(y|?, ?, ?2) is given by p(y|?,?, ?2) = ? p(y|w, s, ?2)p(w|?)p(s|?) dwds (3.11) Since all the distributions in the right hand side are Gaussian with zero mean, it can be shown that p(y|?,?, ?2) is a zero-mean Gaussian distribution with covariance matrix ?2I+?A?1?T , where A = diag(?0, . . . , ?N , ?1, . . . , ?N). The maximization of p(y|?,?, ?2) with respect to the hyper-parameters ?, ? and the noise variance ?2 is known as evidence maximization and can be solved by an EM algorithm [95] or a faster implementation proposed in [96]. We will refer to these estimated parameters as ?MP , ?MP and ?2MP . 49 With this point estimation of the hyper-parameters and the noise variance, the (conditional) posterior distribution p(w, s|y,?MP ,?MP , ?2MP ) is given by p(y|w, s, ?2MP )p(w|?MP )p(s|?MP ) p(y|?MP ,?MP , ?2MP ) (3.12) Since all the terms in the numerator are Gaussian, it can be shown that this is again a Gaussian distribution with covariance and mean given by ? = (??2MP?T?+AMP)?1 and ? = ??2MP??Ty (3.13) where AMP = diag(?MP0, . . . , ?MPN , ?MP1, . . . , ?MPN). To obtain the posterior distribution p(w, s|y), we need to integrate out ?,?, ?2 from p(w, s|y,?,?, ?2), that is, p(w, s|y) = ? p(w, s|y,?,?, ?2)p(?,?, ?2|y) d?d?d?2 (3.14) However, this is analytically intractable; it has been empirically observed in [95], that for predictive purposes, p(?,?, ?2|y) is very well approximated by ?(?MP ,?MP , ?2MP ). With this approximation, we have p(w, s|y) = p(w, s|y,?MP ,?MP , ?2MP ) (3.15) Thus, the desired joint posterior distribution of w and s is Gaussian with the pos- terior covariance and mean given by (3.13). For prediction, we use the mean as an estimate of w in the prediction model (3.2). 50 3.1.2.2 Prediction We use the prediction model (3.2) to predict y? for any new data x?. The predictive distribution of y? is given by p(y?|y,?MP , ?2MP ) = ? p(y?|w, ?2MP )p(w|y,?MP ) dw (3.16) where the posterior distribution of w, p(w|y,?MP ), can be easily obtained from the joint posterior distribution p(w, s|y,?MP ,?MP , ?2MP ). p(w|y,?MP ) is a Gaussian distribution with mean and covariance given by the mean and covariance of the parameter part (w) of the ws vector, that is, ?w = ?(1 : N + 1, 1 : N + 1) and ?w = ?(1 : N + 1) (3.17) With this, it can be shown that the predictive distribution of y? is Gaussian with mean ?? and variance ??2 given by ?? = ?wT?(x?) and ??2 = ?2MP + ?(x?)T?w?(x?) (3.18) 3.1.2.3 Advantage over other Robust RVM Algorithms The proposed robust Bayesian formulation (RB-RVM) is very similar to the original RVM formulation. All we have to do is, instead of inferring just the param- eter vector w, infer the joint parameter-outlier vector ws, by replacing the ? matrix with the corresponding ? = [?|I] matrix, and use only the parameter part of the estimated ws for prediction. It is this simple modification of the original RVM that gives RB-RVM the computational advantage over [35, 97, 108] because we can use an existing fast implementation of RVM [96] to solve the robust RVM problem. 51 3.1.3 Basis Pursuit RVM (BP-RVM) A very similar objective, as in the Bayesian approach, can be achieved by solving the following optimization problem: min ws ||ws||0 subject to ||y??ws||2 ?  (3.19) where ||ws||0 is the l0 norm, which counts the number of non-zero elements in ws. The cost function promotes a sparse solution for ws and the constraint term is essentially the likelihood term of the Bayesian approach, with  related to the inlier noise variance ?2. w obtained after solving this problem can be used for prediction. However, this is a combinatorial problem; hence, it cannot be solved directly. This problem has been studied extensively in the sparse representation literature [26, 32], where a convex relaxation of the problem is solved: min ws ||ws||1 subject to ||y??ws||2 ?  (3.20) where the l0 norm in the cost function is replaced by the l1 norm, which makes it a convex problem; hence, it can be solved in polynomial time. This approach is known as Basis Pursuit Denoising (BPD) [26, 32], and we will refer to the robust algorithm based on BPD as the Basis Pursuit RVM (BP-RVM). Initially, the justification for using the l1 norm approximation was based on empirical observations [26]. However, recently in [23, 32], it has been shown that if ws is sparse to begin with, then under certain condition (?Restricted Isometry Property? or ?incoherence?) on the matrix ?, (3.19) and (3.20) will have the same solution up to a bounded uncertainty due to . However, in our case the matrix ? depends on the training dataset and the 52 associated kernel function, and it might not satisfy the desired conditions mentioned above. ?4 ?3 ?2 ?1 0 1 2 3 4 ?1 ?0.5 0 0.5 1 1.5 2 Prediction by RVM for symmetric outliers case sinc(x) data points RVM curve RMS error : 0.1152 ?4 ?3 ?2 ?1 0 1 2 3 4 ?1 ?0.5 0 0.5 1 1.5 2 Prediction by RB?RVM for symmetric outliers case sinc(x) data points RB?RVM curve detected outliers RMS error : 0.0654 ?4 ?3 ?2 ?1 0 1 2 3 4 ?1 ?0.5 0 0.5 1 1.5 2 Prediction by BP?RVM for symmetric outliers case sinc(x) data points BP?RVM curve detected outliers RMS error : 0.0951 Figure 3.1: Prediction by the three algorithms: RVM, RB-RVM and BP-RVM in the presence of symmetric outliers for N = 100, f = 0.2 and ? = 0.1. Data which are enclosed by a box are the outliers found by the robust algorithms. Prediction error are also shown in the figures. RB-RVM gives the lowest prediction error. 3.2 Empirical Evaluation In this section, we empirically evaluate the proposed robust versions of the RVM, RB-RVM and BP-RVM, with respect to the baseline RVM. We consider three important intrinsic parameters of the robust regression problem: the outlier fraction (f), the inlier noise variance (?2) and the number of training data points (N), and study the performance of the three algorithms for different settings of these 53 ?4 ?3 ?2 ?1 0 1 2 3 4 ?0.5 0 0.5 1 1.5 2 Prediction by RVM for asymmetric outlier case sinc(x) data points RVM curve RMS error : 0.1672 ?4 ?3 ?2 ?1 0 1 2 3 4 ?0.5 0 0.5 1 1.5 2 Prediction by RB?RVM for asymmetric outliers case sinc(x) data points RB?RVM curve detected outliers RMS error : 0.0663 ?4 ?3 ?2 ?1 0 1 2 3 4 ?0.5 0 0.5 1 1.5 2 Prediction by BP?RVM for asymmetric outliers case sinc(x) data points BP?RVM curve detected outliers RMS error : 0.1588 Figure 3.2: Prediction by the three algorithms: RVM, RB-RVM and BP-RVM in the presence of asymmetric outliers for N = 100, f = 0.2 and ? = 0.1. Data which are enclosed by a box are the outliers found by the robust algorithms. Prediction error are also shown in the figures. Clearly, RB-RVM gives the best result. parameters.1 Next, we describe the experimental setup, which is quite similar to that of [35]. We generate our training data using the normalized sinc function sinc(x) = sin(pix)/(pix). yi of the inlier data are obtained by adding a Gaussian noise N (0, ?2) to sinc(xi). For the outliers, we consider two generative models: 1) symmetric and 2) asymmetric. In the symmetric model, yi is obtained by adding a uniform noise of range [?1,+1] to sinc(xi), and in the asymmetric model, yi is obtained by 1For solving RVM and RB-RVM, we have used the publicly available code in http://www.vectoranomaly.com/downloads/downloads.htm. For solving BP-RVM, we have used l1-magic: http://www.acm.caltech.edu/l1magic/ 54 adding a uniform noise of range [0,+1] to sinc(xi). With each training data xj , we associate a Gaussian kernel: K(x, xj) = exp (?(x? xj)2/r2), with r = 2. Figures 3.1 and 3.2 show the performance of the three algorithms for the symmetric and asymmetric outlier cases for N = 100, f = 0.2 and ? = 0.1. The performance criterion used for comparison is the root mean square (RMS) prediction error. Note that, after inference, robust methods can also classify the training data as inliers or outliers. We classify a data as an outlier if the prediction error (absolute difference between the predicted and the observed value) is greater than three times the inlier noise standard deviation, which is also estimated during inference. From figures 3.1 and 3.2, we conclude that RB-RVM gives the lowest prediction error, followed by BP-RVM and RVM. In the following sections, we study the performance of the algorithms by varying the intrinsic parameters: f , ? and N . Varying the Outlier fraction: We vary the outlier fraction f , with the other parameters fixed at N = 100 and ? = 0.1. Figure 3.3 shows the prediction error vs. outlier fraction for the symmetric and asymmetric outliers cases. For both the cases, RB-RVM gives the best result. For the symmetric case, BP-RVM gives lower prediction error than RVM but for the asymmetric case they give similar result. Varying the Inlier Noise Std: We vary the inlier noise standard deviation ?, with the other parameters fixed at N = 100 and f = 0.2. Figure 3.4 shows that RB-RVM gives the lowest prediction error until about ? = 0.2, after which RVM gives better result. This is because in our experimental setup, at approximately ? = 0.3, the distinction between the inliers and outliers cease to exist. For Gaussian distribution, most of the probability density mass lies within 3? of the mean, and any 55 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.04 0.06 0.08 0.1 0.12 0.14 0.16 Outlier fraction RMS prediction erro r Prediction error Vs Outlier fraction for symmetric outliers RVM RB?RVM BP?RVM 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Outlier fraction RMS prediction erro r Prediction error Vs Outlier fraction for asymmetric outliers RVM RB?RVM BP?RVM Figure 3.3: Prediction error vs. outlier fraction for the symmetric and asymmetric outlier cases. RB-RVM gives the best result for both the cases. For the symmetric case, BP-RVM gives lower prediction error than RVM but for the asymmetric case they give similar result. data within this region can be considered as inliers and those outside as outliers. Thus, for ? = 0.3, 3? = 0.9; most of the outliers will be within this range and effectively become inliers. 0 0.1 0.2 0.3 0.4 0 0.05 0.1 0.15 0.2 0.25 Inlier noise standard deviation RMS prediction erro r Prediction error vs. Inlier noise std for symmetric outliers RVM RB?RVM BP?RVM 0 0.1 0.2 0.3 0.4 0 0.05 0.1 0.15 0.2 0.25 Inlier noise standard deviation RMS prediction erro r Prediction error Vs Inlier noise std for asymmetric outliers RVM RB?RVM BP?RVM Figure 3.4: Prediction error vs. inlier noise standard deviation for the symmetric and asymmetric outlier cases. RB-RVM gives the lowest prediction error until about ? = 0.2, after which RVM gives better result. This is because for our experimental setup, at approximately ? = 0.3, the distinction between the inliers and outliers cease to exist. Varying the Number of Data Points: We vary the number of data points N , with f = 0.2 and ? = 0.1. Figure 3.5 shows that the performance of all the three 56 algorithms improve with increasing N . 0 200 400 600 800 1000 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Number of data points RMS prediction erro r Prediction error Vs Number of data points for symmetric outliers RVM RB?RVM BP?RVM 0 200 400 600 800 1000 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Number of data points RMS prediction erro r Prediction error vs. Number of data points for asymmetric outliers RVM RB?RVM BP?RVM Figure 3.5: Prediction error vs. number of data points for the symmetric and asymmetric outlier cases. For all the three algorithms, performance improves with increasing N . Discussion: We conclude that in presence of outliers RB-RVM and BP-RVM perform better than RVM. The performance of BP-RVM is poor as compared to RB- RVM; this indicates that the l1 norm relaxation (3.20) is not a good approximation of the l0 norm problem (3.19), when ? does not satisfy the desired Restricted Isometry Property [23]. Henceforth, we will only consider RB-RVM for solving the image denoising and age regression problems. 3.3 Robust Image Denoising Recently, kernel regression has been used for solving a number of traditional image processing tasks such as image denoising, image interpolation and super- resolution with a great deal of success [92, 93]. The success of these kernel regression methods prompted us to test RB-RVM for solving the problem of image denoising in the presence of salt and pepper noise. Salt and pepper noise are randomly occurring white and black pixels in an image and can be considered as outliers. 57 Any image I(x, y) can be considered as a surface over a 2D grid. Given a noisy image, we can use regression to learn the relation between the intensity and the 2D grid of the image. If some kind of a local smoothness is imposed by the regression machine, we can use it for denoising the image. Here, we consider RVM and RB-RVM for achieving this purpose. We divide the image into many (overlap- ping) patches, and for each patch we infer the parameters of RVM and RB-RVM. We then use the inferred parameters for predicting the intensity of the central pixel of the patch, which is the denoised intensity at that pixel. This is done for all the pixels of the image to obtain the denoised image. Motivated by [92], we consider a composition of Gaussian and polynomial kernel as the choice of kernel in our regres- sion machines. The Gaussian kernel is defined as Kg(x,xj) = exp (?||x? xj||2/r2), where r is the scale of the Gaussian kernel, and the polynomial kernel is defined as Kp(x, xj) = (xTxj +1)p, where p is the order of the polynomial kernel. We consider kernels of the form: K(x,xj) = Kg(x,xj)Kp(x,xj). To test the proposed kernel denoising algorithms, we add 20% salt and pepper noise to the original images. For RVM and RB-RVM, we choose patch size of 6? 6, r = 2.1 and p = 1. Figure 3.6 shows the image denoising result by RVM, RB-RVM and 3? 3 median filter. The denoised images and the corresponding RMSE values show that RB-RVM gives the best denoising result. Next, we vary the amount of salt and pepper noise, and obtain the mean RMSE value over the commonly used images of Barbara, House, Boat, Baboon, Pepper and Elaine. Figure 3.7 shows that RB-RVM gives better result than the median filter, which is the most commonly used filter for denoising images with salt and pepper noise. Further, we test RB- 58 Figure 3.6: Results on Salt and pepper noise removal: first column: RVM, second column: RB- RVM, third column: Median filter, fourth column: Gaussian filter. The RMSE values are also shown in the figure; RB-RVM gives the best result. RVM for denoising an image corrupted by a mixture of Gaussian noise of ? = 5 and 5% salt and pepper noise. From figure 3.8, we conclude again that RB-RVM gives much better denoising result as compared to RVM. 3.4 Age Estimation from Facial Images The goal of facial age estimation is to estimate the age of a person from his/her image. The most common approach for solving this problem is to extract some relevant features from the image, and then learn the functional relationship between these features and the age of the person using regression techniques [54, 53, 37, 40]. 59 0.1 0.15 0.2 0.25 0.3 0.35 0.4 10 15 20 25 30 35 40 Percentage of Salt and pepper noise Mean RMSE over seven image s Mean RMSE vs. Percentage of Salt and pepper noise Median filter RB?RVM RVM Figure 3.7: Mean RMSE value over seven images vs. percentage of salt and pepper noise. RB-RVM gives better performance than the median filter. Here, we intend to test the RB-RVM regression for the age estimation problem. For our experiments, we use the publicly available FG-Net dataset [1], which contains 1002 images of 82 subjects at different ages. As a choice of features, we use geometric features proposed in [100], which are obtained by computing the ?flow field? at 68 fiducial points with respect to a reference face image. To decide on a particular kernel for regression, we perform leave-one-person- out testing, by RB-RVM, for different choices of kernel. Table 3.1 shows the mean absolute error (MAE) of age prediction for different values of the scale parameter r of the Gaussian kernel. r = 0.2 gives the best result, and we use this value of r for all the subsequent experiments. Next, we use RB-RVM to categorize the whole dataset into inliers and outliers. The algorithm found 90 outliers; some of the inliers and outliers are shown in figure 3.9. With this knowledge of the inliers and the outliers, we perform the leave-one-person-out test again. Table 3.2 shows the mean absolute error (MAE) of age prediction for the inliers and the outliers separately. The small 60 Figure 3.8: Mixture of Gaussian and salt and pepper noise removal experiment: denoised images by RVM and RB-RVM with their corresponding RMSE values. This experiment again shows that the RB-RVM based denoising algorithm gives much better result than the RVM based one. prediction error for the inliers and the large prediction error for the outliers indicate that the inlier vs. outlier categorization by RB-RVM was good. Table 3.2 also shows that the prediction error of the RB-RVM for the whole dataset is lower than that of the RVM. To put the numbers in the table in context, the state-of-the-art algorithm [40] gives a prediction error of 5.07 as compared to the prediction error of 4.61 obtained for the inliers by the RB-RVM. r 0.1 0.2 0.3 0.4 MAE 7.10 6.52 6.54 6.62 Table 3.1: Mean absolute error (MAE) of age prediction for different values of the scale parameter r of the Gaussian kernel. The prediction errors are for the leave-one-person-out testing by RB- RVM. r = 0.2 gives the best result, and we use this r for all the subsequent experiments. To further test RB-RVM, we add various amount of controlled outliers. Before 61 Figure 3.9: Some inliers and outliers found by RB-RVM. Most of the outliers are images of older subjects like Outlier A and B. This is because there are less number of samples of older subjects in the FG-Net database. Outlier C has an extreme pose variation from the usual frontal faces of the database; hence, it is an outlier. The facial geometry of Outlier D is very similar to that of younger subjects, such as big forehead and small chin, so it is classified as an outlier. doing this, we remove the outliers found in the previous experiment. We use 90% of this new dataset as the training set and the remaining 10% as the test set. We introduce controlled outliers only in the training set, and perform age prediction on the test set by both RVM and RB-RVM. We vary the fraction of the outliers on the training set and measure the age prediction error on the test set. Figure 3.10 shows that RB-RVM gives much lower prediction error as compared to RVM. This experiment again suggests that RB-RVM should be preferred over RVM for the age estimation problem. 62 Inlier MAE Outlier MAE All MAE RB-RVM 4.61 25.87 6.52 RVM N.A. N.A. 6.80 Table 3.2: Mean absolute error (MAE) of age prediction for the inliers, outliers and the whole dataset using RB-RVM. Since RVM does not differentiate between inliers and outliers, we only show the prediction error for the whole dataset. The small MAE for the inliers and the large MAE for the outliers indicates that the inlier vs. outlier categorization by RB-RVM was good. Also, note that the prediction error of the RB-RVM for the whole dataset is lower than that of the RVM. 0.2 0.4 0.6 0.8 1 5 6 7 8 9 10 11 Outlier fraction MAE of age estimatio n MAE of Age Prediction vs. Outlier fraction RVM RB?RVM Figure 3.10: Mean absolute error (MAE) of age prediction vs. fraction of controlled outliers added to the training dataset. RB-RVM gives much lower prediction error as compared to the RVM. Also, note that the prediction error is reasonable even with outlier fraction as high as 0.7. 63 Chapter 4 Large-Scale Matrix Factorization with Missing Data under Additional Constraints Many computer vision problems such as SfM [98], non-rigid SfM [12] and photometric stereo [42] can be formulated as a matrix factorization problem. In all these problems, the measured data can be arranged as a matrix of a known rank and the low-ranks factors of this matrix (obtained by matrix factorization) provide the solution of the problems. Let M be the measurement matrix of dimension m?n and rank r. The objective is to factorize this measurement matrix M into factors A and B of dimensions m?r and n?r, respectively such that the error ||M?ABT || is minimized. When all the elements of M are known, and assuming that the elements are corrupted by Gaussian noise, the solution to this problem is given by the singular value decomposition (SVD) of M . However, in most real applications, many of the elements of M will be missing and we need to solve a modified problem given by: min A,B ||W (M ?ABT )||2F + ?1||A||2F + ?2||B||2F (4.1) where is the Hadamard element-wise product, W is a weight matrix with zeroes at indices corresponding to the missing elements of M , and ||A||2F , ||B||2F are regu- larization terms which prevent data over-fitting. Matrix factorization with missing data is a difficult non-convex problem with no known globally convergent algorithm. The damped Newton algorithm [14], a variant of Newton?s method, is one of the 64 most popular algorithms for solving this problem. However, this algorithm has high computational complexity and memory requirements and so cannot be used for solving large scale problems. We formulate the matrix factorization with missing data problem as a low-rank semidefinite program LRSDP [16], which is essentially a rank constrained semidefi- nite programming problem (SDP) and was proposed to solve large SDP in an efficient way. The advantages of formulating the matrix factorization problem as a LRSDP problem are the following: 1) It inherits the efficiency of the LRSDP algorithm. The LRSDP algorithm is based on a quasi-Newton method, which has lower computa- tional complexity and memory requirements than that of Newton?s method, and so is ideally suited for solving large scale problems. 2) Many additional constraints, such as the ortho-normality constraints for the orthographic SfM problem, can be easily incorporated into the LRSDP-based factorization formulation; this is possible because of the flexible framework of the LRSDP (see section 4.1). Related works: Algorithms for matrix factorization in the presence of miss- ing data can be broadly divided into two main categories: initialization algorithms and iterative algorithms. Initialization algorithms [98, 46, 39, 64, 94] generally minimize an algebraic or approximate cost of (4.1) and are used for providing a good starting point for the iterative algorithms. Iterative algorithms are those al- gorithms that directly minimize the cost function (4.1). Alternation algorithms [84, 105, 45, 2, 11, 50], damped Newton algorithm [14] and our approach fall under this category. Alternation algorithms are based on the fact that if one of the factors A or B is known, then there are closed form or numerical solutions for the other 65 factor. Though the alternation-based algorithms minimize the cost in each itera- tion, they suffer from flatlining, requiring an excessive number of iterations before convergence [14]. To solve this problem, damped Newton and hybrid algorithms between damped Newton and alternation were proposed in [14]. Although these al- gorithms give very good results, they cannot be used for solving large-scale problems because of their high computational complexity and memory requirements. Other algorithms, based on Newton?s method, have been proposed in [17, 72], which also cannot be used for solving large-scale problems. The matrix factorization with missing data problem is closely related to the matrix completion problem [21]. The goal of matrix completion is to find a low-rank matrix which agrees with the observed entries of the matrix M . Recently, many effi- cient algorithms have been proposed for solving this problem [19, 60, 65, 55, 52, 66]. Some of them [55, 52, 66] are formulated as matrix factorization problems. How- ever, these algorithms can not handle additional constraints. Matrix factorization also arises while solving the collaborative filtering problem. Collaborative filtering is the task of predicting the interests of a user by collecting the taste information from many users, for example in a movie recommendation system. In [88], collab- orative filtering is formulated as a matrix completion problem and solved using a semidefinite program. Later a fast version, using conjugate gradient, was proposed in [80], but this also cannot handle additional constraints. The organization of the rest of the chapter is as follows: in section 4.1, we set up the necessary background for LRSDP. In section 4.2, we formulate the factorization problem as a LRSDP problem and in section 4.3, discuss its relation with the matrix 66 completion problem. In section 4.4, we experimentally evaluate the LRSDP-based factorization algorithm on synthetic data and for some computer vision problems drawn from SfM and photometric stereo. 4.1 Background: Low-rank semidefinite programming (LRSDP) LRSDP was proposed in [16] to efficiently solve a large scale SDP [101]. In the following paragraphs, we briefly define the SDP and LRSDP problems, and discuss the efficient algorithm used for solving the LRSDP problem. SDP is a subfield of convex optimization concerned with the optimization of a linear objective function over the intersection of the cone of positive semidefinite matrices with an affine space. The standard-form SDP is given by: min C ?X subject to Ai ?X = bi, i = 1, . . . , k X  0 (4.2) where C and Ai are n ? n real symmetric matrices, b is k-dimensional vector, and X is an n ? n matrix variable, which is required to be symmetric and positive semidefinite, as indicated by the constraint X  0. The operator ? denotes the inner product in the space of n? n symmetric matrices defined as A ?B = trace(ATB) = ?n i=1 ?n j=1 AijBij . The most common algorithms for solving (4.2) are the interior point methods [101]. However, these are second-order methods, which need to store and factorize a large (and often dense) matrix and hence are not suitable for solving large scale problems. In LRSDP a change of variables is introduced as X = RRT , where R is a real, n ? r matrix with r ? n. This has the advantage that it removes the non-linear 67 constraint X  0, which is the most challenging aspect of solving (4.2). However, this comes with the cost that the problem may no longer be a convex problem. The LRSDP formulation is given by: (Nr) min C ?RRT subject to Ai ?RRT = bi, i = 1, . . . , k (4.3) Note that the LRSDP formulation depends on r; when r = n, (4.3) is equivalent to (4.2). But the intention is to choose r as small as possible so as to reduce the number of variables, while the problem remains equivalent to the original problem (4.2). A non-linear optimization technique called the augmented Lagrangian method is used for solving (4.3). The majority of the iterations in this algorithm involve the minimization of an augmented Lagrangian function with respect to the variable R which is done by a limited memory BroydenFletcherGoldfarbShanno (BFGS) method. BFGS, a quasi-Newton method, is much more efficient than Newton?s method both in terms of computations and memory requirement. The LRSDP algorithm further optimizes the computations and storage requirements for sparse C and Ai matrices, which is true for problems of our interest. For further details on the algorithm, see [16, 15]. 4.2 Matrix factorization using LRSDP (MF-LRSDP) In this section, we formulate the matrix factorization with missing data as an LRSDP problem. We do this in the following stages: in section 4.2.1, we look at the noiseless case, that is, where the measurement matrix M is not corrupted with 68 noise, followed by the noisy measurement case in section 4.2.2, and finally in section 4.2.3, we look at how additional constraints can be incorporated in the LRSDP formulation. 4.2.1 Noiseless Case When the observed elements of the m ? n dimensional measurement matrix M are not corrupted with noise, a meaningful cost to minimize would be: min A,B ||A||2F + ||B||2F subject to (ABT )i,j = Mi,j for (i, j) ? ?, (4.4) where ? is the index set of the observed entries of M , and A, B are the desired factor matrices of dimensions m? r and n? r respectively. We assume r is known, for example, in affine SfM r = 4 and in photometric stereo r = 3. To formulate this as a LRSDP problem, we introduce a (m+n)? r dimensional matrix R = ? ??? A B ? ???. Then RRT = ? ??? AAT ABT BAT BBT ? ??? (4.5) We observe that the cost function ||A||2F + ||B||2F can be expressed as trace(RRT ) and the constraints as (RRT )i,j+m = Mi,j. Thus, (4.4) is equivalent to: min R trace(RRT ) subject to (RRT )i,j+m = Mi,j for (i, j) ? ? (4.6) This is already in the LRSDP form, since we can express the above equation as min R C ?RRT subject to Al ?RRT = bl, l = 1, . . . , |?| (4.7) 69 where C is an (m + n) ? (m + n) identity matrix, and to simplify the notations we have introduced the index l with ?(l) = (i, j) l = 1, . . . , |?|. Al are sparse matrices with the non-zero entries at indices (i, j + m) and (j + m, i) equal to 1/2 and bl = Mi,j. This completes the formulation of the matrix factorization problem as an LRSDP problem for the noiseless case. Next we look at the noisy case. 4.2.2 Noisy case When the observed entries of M are corrupted with noise, an appropriate cost function to minimize would be: min A,B ||W (M ? ABT )||2F + ?||A||2F + ?||B||2F (4.8) where is the Hadamard element-wise product and W is a weight matrix with zeros corresponding to the missing entries and 1 to the observed entries in M . To formulate this as an LRSDP problem, we introduce noise variables el, l = 1, 2, . . . , |?| which are defined as el = (M ? (ABT ))l . Now, (4.8) can be expressed as min A,B,e ||e||22+?||A||2F +?||B||2F subject to (M?ABT )l = el for l = 1, 2, . . . , |?| (4.9) Next, we aim to formulate this as a LRSDP problem. For this, we construct an augmented noise vector E = [eT 1]T and define R to be R = ? ??????? ? ??? A B ? ??? 0 0 E ? ??????? (4.10) 70 R is a ?block-diagonal? matrix, where the blocks are of sizes (m+n)?r and (|?|+1)?1 respectively. With this definition, RRT is a block-diagonal matrix given by RRT = ? ??????? ? ??? AAT ABT BAT BBT ? ??? 0 0 EET ? ??????? (4.11) We can now express (4.8) in the following LRSDP form: min R C ?RRT subject to Al ?RRT = bl, l = 1, . . . , |?|+ 1 (4.12) with C = ? ??? ?I(m+n)?(m+n) 0 0 I(|?|+1)?(|?|+1) ? ??? (4.13) Note that the number of constraints |?|+ 1 in (4.12) is one more than the number of observations |?|. This is because the last constraint is used to set E|?|+1 = 1, which is done by choosing A|?|+1 to be a sparse matrix with the non-zero entry at index (|?|+ l+m+n, |?|+1+m+n) equal to 1 and b|?|+1 = 1. For the remaining values of l, the Al are sparse matrices with the non-zero entries at indices (i, j+m), (j +m, i), (|?|+1+m+n, l+m+n) and (l+m+n, |?|+1+m+n) equal to 1/2 and bl = Ml. Note that (4.12) is a block-LRSDP problem (R has a block-diagonal structure), which is a simple extension of the original LRSDP problem [15]. This completes the LRSDP formulation for the noisy case. Next, we look at incorporating additional constraints in this framework. 71 4.2.3 Enforcing Additional Constraints Many additional constraints can be easily incorporated in the LRSDP formu- lation. We illustrate this using the specific example of orthographic SfM [98]. SfM is the problem of reconstructing the scene structure (3-D point positions and camera parameters) from 2-D projections of the points in the cameras. Suppose that m/2 cameras are looking at n 3-D points, then under the affine camera model, the 2-D imaged points can be arranged as an m? n measurement matrix M with columns corresponding to the n 3-D points and rows corresponding to the m/2 cameras (2 consecutive rows per camera). M is a rank 4 matrix and can be factorized as M = ABT , where A is a m?4 camera matrix and B is a n?4 structure matrix with the last column of B an all-one vector, i.e. B = [X 1]. Under the orthographic camera model, A has more structure (constraints). To state these constraints pre- cisely, we express the A matrix as A = [P t], where P is a m ? 3 sub-matrix consisting of the first three columns and t is the last column vector. A satisfies the following constraints: rows of P that corresponds to the same camera are ortho- normal. This implies that the diagonal elements of the matrix PP T are equal to 1 (normality constraint) and appropriate off-diagonal elements are 0 (orthogonality constraint). Now, ABT = PXT + t1T and the observation error can be expressed as ei,j = (M ? PX)i,j ? ti for (i, j) ? ?. A meaningful optimization problem to solve here would be to minimize the observation error subject to the ortho-normality 72 constraints: min e,P,X,t ||e||22 subject to ei,j = (M ? PX)i,j ? ti, (i, j) ? ? (PP T )k,k = 1, k = 1, 2, . . . , m (PP T )k,l = 0, if k and l are rows from same camera (4.14) To formulate this as an LRSDP problem, we introduce the augmented trans- lation variable T = [tT 1]T , and propose the following block-diagonal matrix R: R = ? ??????????? ? ??? P X ? ??? 0 0 0 T 0 0 0 E ? ??????????? (4.15) With this definition of R, we can express (4.14) as a LRSDP problem, following steps similar to the previous sections. This completes our illustration on the incorporation of the ortho-normality constraints for the orthographic SfM case. This example should convince the reader that many other application-specific constraints can be directly incorporated into the LRSDP formulation; this is because of the underlying SDP structure of the LRSDP. 4.3 Matrix Completion, Uniqueness and Convergence of MF-LRSDP In this section, we state the main result of the matrix completion theory and discuss its implications for the matrix factorization problem. 73 4.3.1 Matrix Completion Theory Matrix completion theory considers the problem of recovering a low-rank ma- trix from a few samples of its entries: min X rank(X) subject to Xi,j = Mi,j for (i, j) ? ? (4.16) More specifically, it considers the following questions: 1) when does a partially ob- served matrix have a unique low-rank solution? 2) How can this matrix be recovered? The answers to these questions were provided in theorem 1.3 of [21] which states that if 1) the matrix M , that we want to recover, has row and columns spaces incoherent with the standard basis and 2) we are given enough entries (? O(rd6/5 log d), where d = max(m,n)), then there exists a unique low-rank solution to (4.16). Further, the solution can be obtained by solving a convex relaxation of (4.16) given by: min X ||X||? subject to Xi,j = Mi,j for (i, j) ? ? (4.17) where ||X||? is the nuclear norm of X, given by the sum of its singular values. 4.3.2 Relation with Matrix Factorization and its Implications In matrix completion the objective is to find a minimum rank matrix which agrees with the partial observations (4.16), whereas in matrix factorization we as- sume the rank r to be known, as in the problems of SFM and photometric stereo, and we use the rank as a constraint. For example, in our LRSDP formulation, we have imposed this rank constraint by fixing the number of columns of the factors A and B to r. However, though the matrix completion and factorization problems 74 are defined differently, they are closely related as revealed by their very similar La- grangian formulations. This fact has been used in solving the matrix completion problem via matrix factorization with an appropriate rank [55, 52, 66]. We should also note that matrix completion theory helps us answer the question raised in [14]: when is missing data matrix factorization unique (up to a gauge)? And from the discussion in the previous section, it should be clear that the conditions of the ma- trix completion theory are sufficient for guaranteeing us the required uniqueness. Further, in our experimental evaluations (see next section), we have found that the LRSDP formulation, though a non-convex problem in general, typically converges to the global minimum solution under these conditions. 4.4 Experimental Evaluation We evaluate the performance of the proposed LRSDP-based factorization al- gorithm (MF-LRSDP) on both synthetic and real data and compare it against other algorithms such as alternation [14], damped Newton [14] and OptSpace [52], which is one of state-of-the-art algorithms for matrix completion. 4.4.1 Evaluation with Synthetic Data The important parameters in the matrix factorization with missing data prob- lem are: the size of the matrix M characterized by m and n, rank r, fraction of missing data and the variance ?2 of the observation noise. We evaluate the fac- torization algorithms by varying these parameters. We consider two cases: data 75 without noise and data with noise. For synthetic data without noise, we generate n ? n matrices M of rank r by M = ABT , where A and B are n ? r random matrices with each entry being sampled independently from a standard Gaussian distribution N (0, 1). Each entry is then revealed randomly according to the missing data fraction. For synthetic data with noise, we add independent Gaussian noise N (0, ?2) to the observed entries generated as above. Exact Factorization: a first comparison. We study the reconstruction rate of different algorithms by varying the fraction of revealed entries per column (|?|/n) for noiseless 500? 500 matrices of rank 5. We declare a matrix to be recon- structed if ||M ? ?M ||F/||M ||F ? 10?4, where ?M = ?A ?B is the reconstructed matrix and ||.||F denotes the Frobenius norm. Reconstruction rate is defined as the fraction of trials for which the matrix was successfully reconstructed. In all the synthetic data experiments, we performed 10 trials. Figure 4.1(a) shows the reconstruction rate by MF-LRSDP, alternation and OptSpace. MF-LRSDP gives the best recon- struction results as it needs fewer observations for matrix reconstruction than the other algorithms. It is followed by OptSpace and alternation, respectively. MF- LRSDP also takes the least time, followed by OptSpace and alternation. For similar comparison to other matrix completion algorithms such as ADMiRA [55], SVT [19] and FPCA [60], the interested reader can look at [52], where OptSpace was shown to be consistently better than these algorithms. For the remaining experiments on synthetic data, we compare MF-LRSDP against OptSpace. Note that we have not included the damped Newton algorithm in this comparison because it is very slow for matrices of this size. 76 0 10 20 30 400 0.2 0.4 0.6 0.8 1 |?|/n Reconstruction rat e MF?LRSDP Alternation OptSpace (a) Reconstruction rate 0 10 20 30 4010 ?2 100 102 104 |?|/n Time in seconds in log scal e MF?LRSDP Alternation OptSpace (b) Timing results Figure 4.1: (a) Reconstruction rate vs. fraction of revealed entries per column |?|/n for 500?500 matrices of rank 5 by MF-LRSDP, alternation and OptSpace. The proposed algorithm MF-LRSDP gives the best reconstruction results since it can reconstruct matrices with fewer observed entries. (b) Time taken for reconstruction by different algorithms. MF-LRSDP takes the least time. Exact Factorization: vary size. We study the reconstruction rate vs. fraction of revealed entries per column |?|/n for different sizes n of rank 5 square matrices by MF-LRSDP and OptSpace. Figure 4.2(a) shows that MF-LRSDP re- constructs matrices from fewer observed entries than OptSpace. Exact Factorization: vary rank. We study the reconstruction rate vs. |?|/n as we vary the rank r of 500? 500 matrices. Figure 4.2(b) again shows that MF-LRSDP gives better results than OptSpace. Noisy Factorization: vary noise standard deviation. For noisy data, we use the root mean square error RMSE = 1/?mn||M ? ?M ||F as a performance measure. We vary the standard deviation ? of the additive noise for rank 5, 200?200 matrices and study the performance by MF-LRSDP, OptSpace, alternation and damped Newton. Figure 4.2(c) shows that all the algorithms perform equally well. 77 5 10 15 20 25 30 35 400 0.2 0.4 0.6 0.8 1 |?|/n Reconstruction rat e OptSpace n=400 n=1000 n=2000 MF?LRSDP n=400 n=1000 n=2000 (a) Reconstruction rate for dif- ferent sizes 0 50 100 1500 0.2 0.4 0.6 0.8 1 |?|/n Reconstruction rat e OptSpace r=5 r=10 r=20 MF?LRSDP r=5 r=10 r=20 (b) Reconstruction rate for dif- ferent ranks 0 1 2 3 4 50 0.5 1 1.5 2 2.5 Noise standard deviation ? RMS E MF?LRSDP Alternation Damped Newton OptSpace (c) RMSE vs. noise std Figure 4.2: (a) Reconstruction rate vs. fraction of revealed entries per column |?|/n for rank 5 square matrices of different sizes n by MF-LRSDP and OptSpace. MF-LRSDP reconstructs matrices from fewer observed entries than OptSpace. (b) Reconstruction rate vs. |?|/n for 500? 500 matrices of different ranks by MF-LRSDP and OptSpace. Again MF-LRSDP needs fewer observations than OptSpace. (c) RMSE vs. noise standard deviation for rank 5, 200?200 matrices by MF-LRSDP, OptSpace, alternation and damped Newton. All algorithms perform equally well. For timing comparisons, please refer to the supplementary material. 4.4.2 Evaluation with Real Data We consider three problems: 1) affine SfM 2) non-rigid SfM and 3) photometric stereo. Affine SfM. As discussed in section 4.2.3, for affine SfM, the m ? n mea- surement matrix M is a rank 4 matrix with the last column of matrix B an all-one vector. M is generally an incomplete matrix because not all the points are visible in all the cameras. We evaluate the performance of MF-LRSDP on the ?Dinosaur? sequence used in [14, 17], for which M is a 72?319 matrix with 72% missing entries. We perform 25 trials and at each trial we provide the same random initializations 78 to MF-LRSDP, alternation and damped Newton (OptSpace has its only initial- ization technique). We use the root mean square error over the observed entries, ||W (M ? ?M)||F/ ?|?|, as our performance measure. Figure 4.3 shows the cumu- lative histogram over the RMS pixel error. MF-LRSDP gives the best performance followed by damped Newton, alternation and OptSpace. We further tested the algo- rithms on a ?longer Dinosaur?, the result of which is provided in the supplementary material. Non-rigid SfM. In non-rigid SfM, non-rigid objects are expressed as a linear combination of b basis shapes. In this case, the m? n measurement matrix M can be expressed as M = ABT , where A is an m? 3b matrix and B is an n? 3b matrix [12]. This makes M a rank 3b matrix. We test the performance of the algorithms on the ?Giraffe? sequence [14, 17] for which M is a 240? 167 matrix with 30% missing entries. We choose the rank as 6. Figure 4.3 shows the cumulative histogram of 25 trials from which we conclude that MF-LRSDP, alternation and damped Newton give good results. Photometric Stereo. Photometric stereo is the problem of estimating the surface normals of an object by imaging that object under different lighting condi- tions. Suppose we have n images of the object under different lighting conditions with each image consisting of m pixels (m surface normals) and we arrange them as an m?n measurement matrix M . Then under Lambertian assumptions, we can ex- press M as M = ABT , where A is an m?3 matrix representing the surface normals and reflectance and B is an n?3 matrix representing the light-source directions and intensities [42]. Thus, M is a rank 3 matrix. Some of the image pixels are likely to 79 be affected by shadows and specularities and those pixels should not be included in the M matrix as they do not obey the Lambertian assumption. This makes M , an incomplete matrix. We test the algorithms on the ?Face? sequence [14, 17] for which M is a 2944?20 matrix with 42% missing entries. The cumulative histogram in fig- ure 4.3 shows that MF-LRSDP and damped Newton gives the best results followed by alternation and OptSpace. 2 4 6 80 5 10 15 20 25 RMS pixel error Cumulative histogra m MF?LRSDP Alternation Damped Newton OptSpace (a) Dinosaur sequence 0 0.2 0.4 0.6 0.8 10 5 10 15 20 25 RMS pixel error Cumulative histogra m MF?LRSDP Alternation Damped Newton OptSpace (b) Giraffe sequence 0.035 0.04 0.045 0.05 0.0550 5 10 15 20 25 RMS error Cumulative histogra m MF?LRSDP Alternation damped Newton OptSpace (c) Face sequence Figure 4.3: Cumulative histogram (of 25 trials) for the Dinosaur, Giraffe and the Face sequence. For all of them, MF-LRSDP consistently gives good results. Additional constraints: Orthographic SfM. Orthographic SfM is a spe- cial case of affine SfM, where the camera matrix A satisfies the additional constraint of ortho-normality, see section 4.2.3. We show here that incorporating these con- straints leads to a better solution. Figure 4.4 shows the input point tracks, recon- structed point tracks without the constraints and reconstructed point tracks with the constraints for the Dinosaur turntable sequence. Without the constraints many tracks fail to be circular, whereas with the constraints all of them are circular (the dinosaur sequence is a turntable sequence and the tracks are supposed to be circu- lar). Thus, incorporating all the constraints of a problem leads to better solution 80 and MR-LRSDP provides a very flexible framework for doing so. (a) Input point tracks (b) Reconstructed tracks with- out constraints (c) Reconstructed tracks with constraints Figure 4.4: (a) Input (incomplete) point tracks of the Dinosaur turntable sequence, (b) recon- structed tracks without orthonormality constraints and (c) reconstructed tracks with orthonormal- ity contraints. Without the constraints many tracks fail to be circular, whereas with the constraints all of them are circular (the dinosaur sequence is a turntable sequence and the tracks are supposed to be circular). 81 Chapter 5 Direct Recognition of Faces across Blur and Illumination Variations Face recognition is one of the most important problems of computer vision and significant progress has been made towards a solution for this problem [110]. Under controlled environments, with well regulated illumination, expression and pose conditions, the current state of the art face recognition algorithms perform quite well. However, for images acquired in uncontrolled environments, it is still a very challenging problem. We are interested in recognizing faces acquired from distant cameras. The main factors that make this a challenging problem are image degradations due to blur and noise, and variations in appearance due to illumination and pose [69]. In this dissertation, we address the problem of recognizing faces across blur and illumination variations. The current state of the art approach for recognizing blurred faces first deblurs the face image and then uses this deblurred image for recognition [70]. However, this involves solving the challenging problem of blind image deconvolution, which is not a necessary step for solving the recognition problem. We take a direct approach for recognizing blurred faces. Using the convolution model for blur, that is a blurred image can be modeled by convolution of a sharp image with a blur filter kernel, we show that the set of all images obtained by blurring a given image forms a convex set. Hence, we can associate such a set with each of the gallery images. Given a 82 probe image, we find its distance from each of the convex sets (associated with the gallery images) and assign it the identity of the gallery image with the minimum distance. We show that this algorithm is also statistically optimal; it is the maximum likelihood estimator of the blur kernel and the identity. Further, based on the set theoretic notions of blur, we present an algorithm to characterize the amount of blur our algorithm can handle for a given data set. In uncontrolled environments, for good recognition performance, it is impera- tive to model the appearance of a face under different illumination conditions. To- wards this end, we use the low-dimensional linear subspace model theory proposed in [8, 77]. For each face in the gallery, we create nine basis images that spans its cor- responding low-dimensional subspace. Using these basis images and the convolution model for blur, we associate a (non-convex) set, which represents all variations due to blur and illumination, with each gallery image. Given a probe image, we find the closest such set and assign its identity to the probe image. The main optimization step involves solving a quadratically constrained quadratic programming problem (QCQP), which we solve by alternately optimizing over the blur kernels and the illumination coefficients. The proposed algorithm is also statistically optimal; it is the maximum likelihood estimator of the blur kernel, illumination coefficients and identity. Related works: Face recognition from blurred images can be classified into three major approaches. In one approach, blur invariant features are extracted from the blurred image and then used for recognition. [6] follows this approach, where local phase quantization (LPQ) [71] method is used to extract blur invariant features. 83 Though this approach works very well for small blurs, it is not so effective for large blurs [70]. In another approach, the blurred image is first deblurred and then used for recognition. This is the approach taken in [43] and [70]. As discussed earlier, the drawback of this approach is that it first solves the challenging problem of blind image deconvolution, which is not a necessary step for solving the face recognition problem. Also, in [70], statistical models are learned for each blur kernel type and amount; this step might become infeasible when we try to capture the complete space of blur kernels. Finally, the last approach is the direct recognition approach. This is the approach taken in [89] and by us. In [89], artificially blurred versions of the gallery images are created and the blurred probe image is matched to them. Again, it is not possible to capture the whole space of blur kernels using this method. We avoid this problem by optimizing over the space of blur kernels. Our approach also has the additional advantage of incorporating the low-dimensional linear subspace model for capturing illumination variations. The organization of the rest of this chapter is as follows: In section 5.1 we propose our approach for recognizing blurred faces, in section 5.2 we incorporate the illumination model in our approach and in section 5.3 we perform experiments to evaluate the efficacy of our approach on many synthetic and real datasets. 5.1 Face Recognition Across Blur (FRB) We first present the convolution model for blur. Next, we describe the set of all images obtained by blurring a given image. We then describe our recognition 84 algorithm based on distances from such set. We also present an algorithm that characterizes the amount of blur our algorithm can handle in a given dataset. Convolution model for blur. A pixel in a blurred image is the weighted average of that pixel?s neighborhood in the original image. Thus, blur is modeled by a convolution operation between the original image and a blur kernel (filter) which represents the weights. Let I be the original image and H be the blur kernel, then the blurred image Ib is given by Ib = I ?H (5.1) where ? represents the convolution operation. For a blur kernel size of (2k + 1) ? (2k + 1), which is generally much smaller than the image size, the blurred image Ib is given by Ib(n1, n2) = k? i=?k k? j=?k H(i, j)I(n1 ? i, n2 ? j) (5.2) Blur kernels satisfy the following properties: their coefficients are non-negative H ? 0 and they sum up to 1 (?ki=?k?kj=?k H(i, j) = 1). These properties basically represents the fact that the weights in the weighted averaging operation are non- negative and sum up to unity. The blur kernel H may possess additional structures depending on the type of blur, such as circular-symmetry for out-of-focus blur, and these structures could be exploited during recognition. The set of all blurred images. We want to characterize the set of all images obtained by blurring a given image I. To do that we re-write (5.1) in a matrix-vector form. Let h ? R(2k+1)2 be the vector obtained by concatenating the columns of H , i.e. h = H(:), and similarly ib ? RN be the representation of Ib in the vector form. 85 Then (5.2) suggests that we can write (5.1), along with the blur kernel constraints, as ib = {Ah h ? 0, ?hi?1 = 1} (5.3) where A is a N ? (2k+1) matrix, obtained from I, with each row of A representing the neighborhood pixel intensities about the pixel indexed by the row. From the above equation, it is clear that the set of all blurred images obtained from I is given by B = {Ah|h ? 0, ?hi?1 = 1} (5.4) We have the following result about the set B. Proposition 5.1.1. The set of all images B obtained by blurring an image I forms a convex set. Moreover, this convex set is given by the convex hull of the columns of matrix A, which represents the neighborhood structure of the blur operation on I. Proof. Let i1 and i2 be elements from the set B. Then there exists h1 and h2, with both satisfying the conditions h ? 0 and ?h? = 1, such that i1 = Ah1 and i2 = Ah2. To show the set B is convex we need to show that for any ? satisfying 0 ? ? ? 1, i3 = ?i1 + (1? ?)i2 is an element of B. Now i3 = ?i1 + (1? ?)i2 = A(?h1 + (1? ?)h2) = Ah3. (5.5) Note that h3 satisfies both the non-negativity and sum conditions and hence i3 is 86 an element of B. Thus, B is a convex set. B is defined as B = {? i Aihi|hi ? 0, ? i hi = 1}, (5.6) which by definition is the convex hull of the columns Ai of A. A geometric face recognition algorithm. Let Ij , j = 1, 2, . . . ,M be the set of M gallery images. From analysis given above, every gallery image Ij has an associated convex set of blurred images Bj . Given the probe image Ib, we find the minimum distance between the image and a point in the set Bj by solving: rj = min h ||Ib ?Aj ? h||2 subject to h ? 0, ?h?1 = 1 (5.7) This is a convex quadratic program which can be solved efficiently. We compute rj for each j = 1, 2, . . . ,M and assign Ib the identity of the gallery image with minimum rj. If there are multiple images per class (person), we can use the k-nearest neighbor rule, i.e. we arrange the rj in an ascending order and find the class with repeats the most in the first k instances. In this algorithm, we can also incorporate additional information about the type of blur. The most commonly occurring blurs are out-of- focus, motion and atmospheric blur [10]. The out-of-focus blur and the atmospheric blur are circular-symmetric, i.e. the coefficients of H at the same radius are equal, whereas the motion blur is symmetric about the origin, i.e. H(i, j) = H(?i,?j). So if we know the type of blur, we can impose its corresponding symmetry constraint while solving for (5.7). Imposing these constraints reduces the number of parameters in the optimization problem, giving better recognition accuracy and faster solutions. Statistical interpretation of the algorithm. Though our algorithm is geometrically motivated it also has a statistical interpretation. For that we need to 87 introduce the concept of noise in the blur model (5.1): Ib = I ?H + n, (5.8) where n is the image noise. If we assume a Gaussian distribution N (0, ?2I) for noise, then the likelihood of Ib given I and H is N (Ib ? I ? H, ?2I). Given the gallery images Ij , j = 1, 2, . . . ,M and the probe image Ib, the maximum likelihood estimate for the gallery image and the blur kernel H is obtained by solving [?j, ?H] = arg max j,H N (Ib ? Ij ?H) subject to H ? 0, ? H = 1 (5.9) Using the fact that maximizing the likelihood is same as minimizing the negative log likelihood and the matrix-vector notations, we get [?j, ?h] = arg min j,h ||Ib ? Ajh||2 subject to h ? 0, ?h?1 = 1, (5.10) The joint minimization over the index j and blur kernel H can also be solved by first minimizing over H , i.e. by solving (5.7), and then minimizing rj over the index j. This is exactly our proposed algorithm. Effect of the amount of blur on recognition. Given a dataset of gallery images, we can make a prediction for the amount of blur that our algorithm can handle. Here, by amount of blur we mean the length of the blur vector h. For any given length of blur kernel, we can find the separation between the convex sets Bj associated with each gallery image Ij. We define the separation between two sets Bi and Bj as the minimum distance s(i, j) between two points from each set: s(i, j) = min hi,hj ||Aihi ?Ajhj||2 subject to hi ? 0, hj ? 0, ?hi?1 = 1, ?hj?1 = 1 (5.11) 88 Note that this is again a convex quadratic program and hence can be efficiently solved. With increasing amount of blur, all the pairwise distances will decrease. At a certain blur amount, the separation between a pair of sets might become zero which essentially means that the corresponding sets have non-zero intersection. Any probe image that falls in such an intersection can be classified as belonging to either of the classes, leading to poor recognition results. Thus, the amount of blur which reduces the minimal separation between sets to zero is a good measure of the amount of blur a particular dataset can handle. 5.2 Incorporating the Illumination Model The facial images of a person under different illuminations can look very dif- ferent, and hence for any recognition algorithm to work in practice, it must account for these variations. First, we discuss about the low-dimensional subspace model for handling appearance variations due to illumination. Next, we use this model along with the convolution model to define the set of images of face under all possible lighting conditions and blur. We then propose a recognition algorithm based on minimizing the distance of the probe image from such sets. Low-dimensional linear model for illumination variations. It has been shown in [8, 77] that when the object is convex and Lambertian, the set of images form a low-dimensional linear subspace of approximate dimension 9. Though a hu- man face is not exactly convex or Lambertian, it is very close to being one and hence the nine-dimensional subspace model captures its variations due to illumina- 89 tion quite well [34]. The nine-dimensional linear subspace corresponding to a face image I can be characterized by 9 basis images. In terms of these nine basis images Im, m = 1, 2, . . . , 9, an image I of a person under any illumination condition can be written as I = 9? m=1 ?mIm (5.12) where ?m, m = 1, 2, . . . , 9 are the corresponding linear coefficients. To obtain these basis images, we use the ?universal configuration? of lighting positions proposed in [27]. These are a configuration of 9 lighting positions sm, m = 1, 2, . . . , 9 such that images taken under these lighting positions can serve as basis images for the subspace. These basis images are generated using the Lambertian reflectance model: Im(i, j) = ?(i, j) max(< sm, n(i, j) >, 0) (5.13) where ?(i, j) and n(i, j) are the albedo and surface-normal corresponding to pixel (i, j). We use the average 3-D face normals for n and we approximate the albedo ? by a well-illuminated gallery image under diffuse lighting. One can, potentially, also use the algorithm in [9] to estimate the albedo and the surface-normals from the single gallery image. The set of all images under different lighting and blur. For a given face characterized by the nine basis images Im, m = 1, 2, . . . , 9, the set of images under all possible lighting conditions and blur is given by BI = {I = 9? m=1 ?mAmh|h ? 0, ?h?1 = 1}, (5.14) where we have used the matrix-vector notations for I, Im and H introduced in 90 section 5.1. This set is not a convex set though if we fix either the filter kernel h or the illumination condition ?m the set becomes convex. Face recognition across blur and illumination (FRBI). Corresponding to each gallery image Ij, j = 1, 2, . . . ,M , we obtain the nine basis images Ij,m, m = 1, 2, . . . , 9 corresponding to each of the gallery images. Given a probe image Ib, we find the minimum distance between the image and a point in the set BIj by solving: rj = min h,?m ?Ib ? ( 9? m=1 ?mAj,mh)||2 subject to h ? 0, ?h?1 = 1 (5.15) We then assign to the probe image the identity of the gallery image that has the minimum residual value rj . If there are multiple gallery images per person, we can use the k-nearest neighbor rule for assigning identity. The major computational step of the algorithm is the optimization problem of (5.15). This is a non-convex problem. To solve this problem we use an alternation algorithm in which we alter- nately minimize over h and ?s, i.e., in one step we minimize over h keeping ?ms fixed and in the other step we minimize over ?ms keeping h fixed and we iterate till convergence. Each step is now a convex problem: the optimization over h given ?m reduces to the same problem as (5.7) and the optimization of ?s given h is just a lin- ear least squares problem. One can also formulate the optimization problem (5.15) as a non-convex quadratically constrained quadratic program (QCQP) and obtain a lower bound by solving a relaxed convex problem via semidefinite programming (SDP) [102]. Because of large number of variables involved, we could not pursue this approach. Similar to the previous section, we can show that the proposed al- gorithm can also be derived as a maximum likelihood estimator of the blur kernel, 91 illumination coefficients and the identities. Like section 5.1, we can also find the amount of blur and illumination variations our algorithm can handle for any given dataset. For each gallery image Ij, we have an associated set BIj of images which captures all the variations due to blur and illumination of Ij . For a given amount of blur we can find the separations between any pairs of sets BIi and BIj by solving s(i, j) = min hi,?m,hj ,?n || 9? m=1 ?mAi,mhi ? 9? n=1 ?nAj,nhj ||2 subject to hi ? 0, hj ? 0, ?hi?1 = 1, ?hj?1 = 1. (5.16) The amount of blur for which the minimum separation between any pair of sets becomes zero is a good measure of the maximum blur that the proposed algorithm can handle for a given dataset under all possible illumination conditions. 5.3 Experimental Evaluations We evaluate our algorithm on some synthetically blurred datasets like FERET [75] and PIE [85], and a real dataset of remotely acquired face images with lots of blur and illumination variations [69], which we will refer to as the REMOTE dataset. In section 5.3.1 we evaluate our ?blur only? recognition algorithm (FRB) of section 5.1 on well-illuminated but blurred images and in section 5.2 we evaluate our ?joint blur and illumination? algorithm (FRBI) of section 5.2 on blurred as well as poorly illuminated images. 92 5.3.1 Recognition across Blur We evaluate our algorithm for handling blur on the artificially blurred images of the FERET dataset and the naturally blurred images of the REMOTE dataset. Experiments on the FERET dataset. On this dataset we compare our algorithm with FADEIN, the algorithm proposed in [70]. For a fair comparison, we use the same experimental set-up as used for FADEIN, i.e., a subset of 1001 individuals from the ?fa? and ?fb? sub-directories of FERET are used as gallery and probe respectively. There is one image per person in the gallery and the probe images are obtained by artificially blurring the fb images by Gaussian kernels of ? values 0, 2, 4, 6, 8, see figure 5.1. (a) No blur (b) ? = 2 (c) ? = 4 (d) ? = 6 (e) ? = 8 Figure 5.1: Sample probe images from FERET dataset. The probe images are synthetically blurred with Gaussian filters of ? = 0, 2, 4, 6, 8 respectively. ? = 0 stands for ?no blur?. To handle small variations in illumination we histogram-equalize all the images in the gallery and probe datasets. While optimizing over the space of blur kernels in (5.7), we use kernel size of 2?+1 along with circular symmetry constraint for the blur kernel. Figure 5.2(a) shows the recognition result for different ? values obtained by our algorithm. For comparison we show the corresponding recognition results by 93 FADEIN, LPQ [71] and FADEIN+LPQ [70] as well. LPQ (local phase quantization) are blur insensitive image descriptors which has been used in [6] for recognizing blurred faces. FADEIN+LPQ has been proposed in [70] as an improvement over FADEIN, where a LPQ descriptor is constructed from the deblurred image produced by FADEIN. Referring back to figure 5.2(b), we can conclude that FRB is better than FADEIN and LPQ. FRB is comparable with FADEIN+LPQ for small values of ?, but outperforms it for large values. Next we test the effectiveness of our algorithm in handling motion blur. For this experiment, we use a smaller dataset of 100 images each from the fa and fb sub- directories of FERET. As before images from fa form our gallery, and the images from fb are blurred with different motion kernels to form the probe. While optimiz- ing over the space of blur kernels in (5.7), we impose symmetry about the origin. The recognition results by FRB are tabulated in 5.1. It is easy to conclude that FRB performs well for motion-blurred images too. To generalize our experiments further we construct datasets where no implicit assumption can be made on the type of blur. We blur each image of the (smaller) fb dataset by 3 different motion blur kernels and 3 different Gaussian blur kernels. While solving for the blur kernels, we do not impose any symmetry constraints. The results are tabulated in 5.2. Clearly, our algorithm fares well in this case too. In all the previous experiments, we made an optimal choice for the kernel-size. For example, for a Gaussian blur of specific ?, we chose a kernel size of 2? + 1. Here we study the effect of kernel size on the performance of our algorithm. We also examine the implications of imposing the symmetry constraints on the performance. 94 We create a probe dataset by blurring images from the smaller fb sub-directory by Gaussian blur of ? = 4 and solve the optimization problem (5.7) with choices of kernel size ranging form 1 to 16? + 1. We perform two experiments: One where we impose the circular symmetry constraint and the other where we do not impose any symmetry constraint. From figure 5.3 we can conclude the following: 1) Our algorithm is not very sensitive to the choice of kernel-size and 2) the imposition of symmetry constraints further relaxes the need for accurate choice of kernel-size. 0 1 2 3 4 5 6 7 8 30 40 50 60 70 80 90 100 Standard deviation of Gaussian kernel for blurred probe set Recognition rate (% ) Proposed approach FRB (a) FRB (b) FADEIN, LPQ, FADEIN+LPQ Figure 5.2: a) Recognition by our proposed algorithm FRB and b) by FADEIN, LPQ and FADEIN+LPQ (figure courtesy [70]) on the FERET dataset. FRB is better than FADEIN and LPQ. FRB is comparable with FADEIN+LPQ for small values of ?, but outperms it for large values. Experiments on the REMOTE dataset. Images in the REMOTE dataset were captured from distances ranging between 5 to 250 meters under uncontrolled outdoor conditions. Hence, the images suffer from varying amount of blurs, mostly, low-resolution and out of focus blurs. It also has large variations in illumination 95 Motion blur kernel (L,?) (1,0) (5,0) (5,45) (10,45) (10,90) (20,0) (20,45) Recognition rate (%) 100 100 100 98 98 100 98 Table 5.1: Recognition results for images from the FERET dataset blurred by dif- ferent motion kernels. While solving for blur kernel in (5.7) we also impose the ?symmetry about the origin? constraint on the kernel. These results show that FRB generalizes well for motion blur. Blur kernels No Blur M(10,45) M(10,90) M(20,0) G(2) G(4) Recognition rate (%) 100 100 98 96 100 100 Table 5.2: Recognition results for images from FERET that have been degraded by different blur kernels. M(L,?) represents motion blur and G(?) represents Gaussian blur. No symmetry constraints have been imposed while solving (5.7). This shows that FRB generalizes well for all types of blur. 96 Probe image FRB with no FRB with categories illumination pre-processing histogram equalization Sharp and well 41.0 55.7 illuminated Sharp and poorly 29.0 38.2 illuminated Blurred and well 36.7 50.4 illuminated Blurred and poorly 32.1 42.4 illuminated Overall recognition 35.2 47.5 result Table 5.3: Recognition result by FRB on the REMOTE dataset with and with- out illumination pre-processing. This shows that even a simple illumination pre- processing step such as histogram-equalization improves the recognition result by about 12% and hence there is clearly a need for better illumination modeling. 97 and pose, and suffer from occlusions. The dataset has been manually labeled with qualitative labels specifying the amount of blur (mild or severe), (good or poor) illumination, (frontal or profile) pose and (no or partial or full) occlusion. Since we are handling blur and illumination, we use a subset of images that are mostly frontal and has no occlusions but has large variations due to blur and illumination. The dataset contains images of 17 individuals. We designed our gallery to have a single frontal, sharp and well-illuminated image per person. The rest of the images form the probe set. We further sub-divided the probe set into four categories: 1) sharp and well-illuminated images (290 images), 2) sharp and poorly-illuminated images (217 images), 3) blurred and well-illuminated images (371 images) and 4) blurred and poorly-illuminated images (271 images). Figure 5.4 shows some images from each category. We use our FRB algorithm to perform face recognition on this dataset. Since most of the images have low-resolution and out-of-focus blur, we have used the circular-symmetry constraint while solving for the blur kernel. We have used a blur kernel size of 17 ? 17. Table 5.3 shows the recognition results by the algorithm. We have done two experiments, one on the raw images and another on histogram-equalized images. From our experiment, we can conclude the following: 1) this dataset is a much more challenging dataset than the FERET dataset and 2) even a simple illumination pre-processing step such as histogram-equalization improves the recognition result by about 12% and hence there is clearly a need for better illumination modeling. 98 5.3.2 Incorporating Illumination Model We evaluate FRBI, our proposed algorithm to handle blur and illumination variations simultaneously, on PIE and REMOTE datasets both of which have signif- icant variations in illumination. On the PIE dataset, first we test the efficacy of the FRBI algorithm in modeling illumination only without taking blur into considera- tion. The PIE dataset contains images of 68 individuals under different illumination conditions. We follow the same experimental set up as in the face recognition ex- periments of [9]: Recognition is performed across illumination with images from one illumination condition forming the gallery and the images from a different il- lumination set forming the probe. We use the well-illuminated images under the illumination condition f21 as our gallery. Since, in this experiment, we are testing only the illumination model of the FRBI algorithm, we only optimize over the space of illumination coefficients ?m, while H is set to the impulse function in (5.15). Table 5.4 shows the result from our algorithm and that from [9]. The recognition algorithm proposed in [9] estimates the albedo maps of the gallery and probe images and use them for recognition. Our better results indicate that we have modeled the illumination well in FRBI. In our next experiment, we artificially blur the PIE dataset and then use FRBI to perform recognition on it. We use the images under the illumination con- dition f21 as our gallery, and thus our gallery has a single image per individual. We divide the rest of the illumination conditions into two categories: 1) sharp and well-illuminated images (f09, f11, f12, f20) and 2) sharp and poorly-illuminated im- 99 f09 f11 f12 f13 f14 f15 f16 f17 f20 f22 Ave Proposed 98.6 100 100 100 100 100 100 100 100 100 99.9 approach Albedo-based 99 100 100 100 100 93 94 85 100 97 97 approach [9] Table 5.4: We test the illumination modeling capability of the proposed al- gorithm FRBI on the PIE dataset. We use the well-illuminated images un- der the illumination condition f21 as our gallery and the rest of the images in f09, f11, f12, f13, f14, f15, f16, f17, f20, f22 as probe. We also compare our results with the albedo-based recognition algorithm of [9]. From our better results we can con- clude that we have modeled the illumination well in FRBI. Probe image categories FRB with histogram FRBI equalization Sharp and well illuminated 94.5 99.7 Sharp and poorly illuminated 72.5 100 Blurred and well illuminated 75.7 99.7 Blurred and poorly illuminated 59.5 99.1 Overall recognition result 71.1 99.5 Table 5.5: Recognition results by FRB with histogram pre-processing and FRBI on the PIE dataset. FRBI has an average accuracy rate of almost 100%, whereas that of FRB is 71%. 100 Probe image FRB with no FRB with histogram FRBI categories illumination pre-processing equalization Sharp and well 41.0 55.7 53.8 illuminated Sharp and poorly 29.0 38.2 44.7 illuminated Blurred and well 36.7 50.4 45.3 illuminated Blurred and poorly 32.1 42.4 47.6 illuminated Overall recognition 35.2 47.5 47.9 result Table 5.6: Recognition results by FRB with and without illumination pre-processing and FRBI on the REMOTE dataset. On the poor-illumination categories, FRBI gives better results than FRB with histogram pre-processing. However, on the well-illuminated image categories FRB with illumination pre-processing gives better result than FRBI. This result can be attributed to the fact that we have used an average 3-D face model for obtaining the nine basis images which, effectively, means that we are not using the discrimative nature of the shape information for recognition. However, potentially, we can improve the performance of FRBI by using individual 3-D shape models, which can be estimated using the method presented in [9]. 101 ages (f13, f14, f15, f16, f17, f22). We further blur these images by Gaussian blur of ? = 1, 2 to obtain two more categories: 3) well-illuminated and blurred images and 4) poorly-illuminated and blurred. Next for each gallery image we obtain the corresponding nine illumination basis images. Figure 5.5 shows the basis images of a gallery image. After obtaining these basis images, we perform recognition using FRBI. Table 5.5 shows the results obtained by FRBI and by FRB with histogram pre-processing. FRBI has an average accuracy rate of almost 100% whereas that of FRB is 71%. This experiment clearly demonstrates the importance of proper illu- mination modeling along with the blur. Next we use FRBI to perform recognition on the challenging REMOTE dataset. Figure 5.6 shows the nine illumination basis images of an individual in the REMOTE dataset. These basis images are used in the FRBI algorithm to model illumination variations. Table 5.6 shows the result by FRBI for the 4 categories of probe images. It also shows the result obtained by FRB, with and without the pre-processing step of histogram-equalization. On the poor-illumination categories FRBI gives better results than FBI with histogram pre-processing. However, on the well-illuminated image categories FRB with illumi- nation pre-processing gives better result than FRBI. This result can be attributed to the fact that we have used an average 3-D face model for obtaining the nine basis images which effectively means that we are not using the discriminative nature of the shape information for recognition. However, potentially, we can improve the performance of FRBI by using individual 3-D shape models which can be estimated in the framework of [9]. 102 0 10 20 30 40 50 60 70 93 94 95 96 97 98 99 100 101 Kernel Size Recognition rate (% ) Without symmetry constraints With circular symmetry (a) Figure 5.3: The effect of kernel size on the performance of our algorithm FRB. The probe images are blurred by a Gaussian kernel of ? = 4. From these curves we conclude the following:1) FRB is not very sensitive to the choice of kernel-size and 2) the imposition of symmtery constraints further relaxes the need for accurate choice of kernel-size. 103 (a) Sharp and well-illuminated (b) Sharp and poorly-illuminated (c) Blurred and well-illuminated (d) Blurred and poorly-illuminated Figure 5.4: For the experiment on REMOTE dataset, we have divided the probe images into four categories: a) sharp and well-illuminated images, b) sharp and poorly-illuminated images, c) blurred and well-illuminated images and d) blurred and poorly-illuminated images. These images were acquired at distances between 5? 250 meters. 104 (a) Basis 1 (b) Basis 2 (c) Basis 3 (d) Basis 4 (e) Basis 5 (f) Basis 6 (g) Basis 7 (h) Basis 8 (i) Basis 9 Figure 5.5: The nine illumination basis images of an individual in the PIE dataset. These basis images are used in the FRBI algorithm to model illumination variations. 105 (a) Basis 1 (b) Basis 2 (c) Basis 3 (d) Basis 4 (e) Basis 5 (f) Basis 6 (g) Basis 7 (h) Basis 8 (i) Basis 9 Figure 5.6: The nine illumination basis images of an individual in the REMOTE dataset. These basis images are used in the FRBI algorithm to model illumination variations. The nine illumination positions from which the basis images are created has been optimized for this dataset. 106 Chapter 6 A Scalable Projective Bundle Adjustment Algorithm using the l? Norm Structure from Motion is the problem of reconstructing the 3-D structure of an observed scene and the camera parameters (orientations and locations) from multiple images or video of the scene. Bundle adjustment is the final optimization step of the SfM problem, where the structure and camera parameters are refined starting from an initial reconstruction. This is done by minimizing a norm of the image reprojection error, which is defined as the error between the reprojected image points, obtained from the current estimates of camera and structure parameters, and the observed image points. The l2 norm of reprojection error is the most commonly used cost function [7]. The main reason for this choice of the norm is that the cost function becomes a differentiable function of parameters and this allows the use of gradient and Hessian- based optimization methods. However, there are two main problems in minimizing this cost function: One, it is a non-linear and non-convex function of the camera and structure parameters. Even the simpler problem of estimating the structure parameters given the camera parameters, known as the triangulation or intersection problem, is a nonlinear and non-convex optimization problem. The correspond- ing cost function might have multiple minima and finding the global minimum is a 107 difficult problem. The same is true for the problem of estimating the camera param- eters given the structure parameters; this problem is known as the resection problem. Another problem with minimizing the l2 norm-based cost function is it?s high com- putational requirements. A second order algorithm, Levenberg-Marquardt(LM), is used for solving the problem, which has a computational complexity of O((m+n)3) per iteration and memory requirement of O(mn(m + n)), where m is the number of cameras and n is the number of structure points. There exists a more efficient algorithm, sparse bundle adjustment [58], that takes advantage of the sparse struc- ture of the Jacobian matrix of the problem. The computational complexity of this algorithm is O(m3 +mn) per iteration and the memory requirement is O(mn). Another norm that is geometrically and statistically meaningful is the l? norm. Minimizing the l? norm is the same as minimax estimation in statistics. Apart from this significance, the l? norm of reprojection error has a nice analytical form for the triangulation and resection problems: It is a quasi-convex function of the unknown parameters. A quasi-convex function has the property that any local minimum is also a global minimum. The global minimum can be obtained using a bisection algorithm ([48],[50]). Further, each step of the bisection algorithm can be solved by checking for the feasibility of a second order cone programming problem(SOCP) [48],[50], for which efficient software packages such as SeDuMi [91] are readily avail- able. The availability of efficient means of finding the global solution for the tri- angulation and resection problems in the l? norm prompts us to look for bundle adjustment algorithms using the same norm. Joint structure and camera parameter 108 estimation in the l? norm is not a quasi-convex optimization problem because of the non-linear coupling of structure and camera parameters. However, if we fix one of the unknowns, say structure, and optimize over the camera parameters, we are back to the original problem of l? resection. The next step would be to fix the camera parameters and optimize over the structure (l? triangulation problem). These two steps could be iterated till convergence to a local minimum. This algorithm is an instance of alternation algorithms, and, more specifically, of resection-intersection algorithms in bundle adjustment [7]1. The proposed algorithm, using the l? norm, has two advantages. One, fixing one of the unknown set of parameters, say the structure parameters, the camera parameters estimation problem can be solved for each camera separately, which effectively reduces the high-dimensional problem to many low-dimensional subproblems. The same is true when the camera parameters are fixed and the structure parameters are estimated. Hence, a high-dimensional parameter estimation problem gets transformed into many low-dimensional sub- problems. The second advantage is that the subproblems that we have to solve are all quasi-convex optimization problems, whose global minimum can be efficiently found. Our goal is to design a projective bundle adjustment algorithm and so the tri- angulation and resection subproblems have to be solved in the projective space [41]. In [48] and [50], the l? triangulation problem is solved in the Euclidean/affine space where the optimization is done over the convex region in front of all the cameras from which the point is visible. This region is well defined in Euclidean/affine space, 1In this discussion, we use ?intersection? and ?triangulation? interchangeably. 109 but not in the projective space. The search for this region can increase the compu- tational cost of the projective triangulation problem [49]; the same is true for the projective resection problem. We have avoided these computations by initializing our algorithm using a quasi-affine reconstruction, which can be easily obtained from an initial projective reconstruction by solving a linear programming problem [41]. Other related works: There are resection-intersection algorithms based on the l2 norm. The algorithms proposed by Chen et. al. [25] and Mahamud et. al. [83] are some examples of this. These algorithms have almost the same computational complexity as our algorithm but they are based on minimizing algebraic errors, which are approximations of the l2 reprojection error. These approximations make them susceptible to wrong solutions [83]. The organization of the rest of this chapter is as follows: In section 6.1, we pro- vide some necessary background on solving the triangulation and resection problems in the l? framework. In section 6.2, we discuss the proposed l? bundle adjustment algorithm. In section 6.3, we compare the computational complexity and memory requirements of our algorithm with the l2 bundle adjustment algorithm and the l2 based resection-intersection algorithms. In section 6.4, we evaluate our algorithm for convergence, computational complexity, and robustness to noise with appropriate comparisons to other algorithms. 110 6.1 Background: geometric reconstruction problems using L? norm We give a very brief overview of the problem of minimizing the L? norm for (Euclidean/affine) triangulation/intersection and resection problems. For more details see Kahl [48] and Ke et. al [50]. We begin with the definition of a quasi- convex function since the triangulation and resection cost functions reduce to this form. Definition 1. A function f(X) is a quasi-convex function if its sublevel sets are convex. 6.1.1 Triangulation/Intersection Let P j = (ajT ; bjT ; cjT ), i = 1, 2, ...,M be the 3? 4 projection matrices for M cameras and (uj, vj), j = 1, 2, ...,M be the images of the unknown 3D point X in these M cameras. The problem is to estimate X given P j and (uj, vj). Let ?X be the homogeneous coordinate of X i.e, ?X = (X; 1). Then the reprojected 2-D image point in camera j (in Euclidean coordinates) is given by ( ajT ?X cjT ?X , bjT ?X cjT ?X ) and the L2 norm of reprojection error function is given by: f j(X) = ????? ????? ( ajT ?X cjT ?X ? uj, bjT ?X cjT ?X ? vj ) ????? ????? 2 = ????? ????? ( ajT ?X ? ujcjT ?X cjT ?X , bjT ?X ? vjcjT ?X cjT ?X ) ????? ????? 2 (6.1) In Euclidean triangulation, i.e., when the projection matrices are of the form P j = [Rjtj], where Rj ? SO(3) and tj ? R3, the fact that X is in front of camera i is 111 expressed by: cjT ?X > 0 (also known as cheirality constraint). With this constraint, we have: f j(X) = ??? ???ajT ?X ? ujcjT ?X, bjT ?X ? vjcjT ?X ??? ??? 2 cjT ?X = pj(X) qj(X) (6.2) pj(X) is a convex function because it is a composition of a convex function (norm) and an affine function. qj(X) is an affine function. Functions of the form of f j(X) are quasi-convex [102]. The L? norm of reprojection error is F?(X) = maxj f j(X), (6.3) which is again a quasi-convex function, as point-wise maximum of quasi-convex functions is also quasi-convex [102]. Minimization of the quasi-convex function F? can be done using a bisection algorithm in the range of F? ([48], [50]). One step in the bisection algorithm involves solving the following feasibility problem: find X s.t. X ? S? (6.4) where S? is the alpha sub-level set of F?(X) with the cheirality constraint qj(X) > 0. For triangulation, S? = {X|f j(X) ? ?, qj(X) > 0, ?j} = {X|pj(X)? ?qj(X) ? 0, qj(X) > 0, ?j} (6.5) 112 S? is a convex set and hence we have to solve a convex feasibility problem. Moreover, since pj(X) is a L2 norm, this problem is a second order cone programming problem which can be efficiently solved using software packages like Sedumi [91]. 6.1.2 Resection Here we are given N 3D points Xi, i = 1, 2, ..., N and their corresponding image points (ui, vi), i = 1, 2, ..., N . The problem is to estimate the 3?4 camera projection matrix P = [aT ; bT ; cT ]. The reprojected 2-D image point corresponding to the ith 3-D point is given by ( aT ?Xi cT ?Xi , bT ?Xi cT ?Xi ) and the l2 norm of reprojection error function is given by: fi(P ) = ????? ????? ( aT ?Xi ? uicT ?Xi cT ?Xi , bT ?Xi ? vicT ?Xi cT ?Xi ) ????? ????? 2 (6.6) The l? norm of reprojection error is F?(P ) = max i fi(P ), (6.7) Again, the l? reprojection error is a quasi-convex function of the unknown camera parameters and the global minimum can be obtained as in the triangulation case [48],[50]. 6.2 The l? projective bundle adjustment algorithm For l? projective bundle adjustment, we propose an iterative algorithm based on the principle of resection-intersection. We partition the unknown structure and camera parameters into two separate sets and minimize the l? norm of reprojection 113 error over one set of parameters while keeping the other set fixed. In the resec- tion step, the minimization is done over the camera parameters while keeping the structure parameters fixed and in the intersection step, the optimization is done over the structure parameters while keeping the camera parameters fixed. These resection-intersection steps are iterated many times till the algorithm converges to a stationary point. The resection and intersection steps of the proposed algorithm are still a high- dimensional optimization problem. In section 6.2.1, we show that how these two steps can be further simplified by solving a large number of small optimization problems. In section 6.2.2, we discuss the correct way to initialize our algorithm. 6.2.1 Decoupling Consider the intersection step of the algorithm, where the camera parameters are fixed and minimization of the l? norm of reprojection error is done over the structure parameters. Let P j, j = 1, 2, ...,M be the given projection matrices of M cameras and Xi, i = 1, 2, ..., N be the N 3D points, which are to be estimated. Let f ji be the l2 norm of reprojection error for the i-th 3D point imaged in the j-th camera. The l? norm of reprojection error is: F?(X1, X2, ..., XN) = max i,j f ji (X1, X2, . . . , XN) = max i max j f ji (Xi) = max i f?,i(Xi) (6.8) 114 where, f?,i(Xi) = maxj f j i (Xi). (6.9) From equation (6.8), we conclude that F?(X1, X2, ..., XN) can be minimized jointly over all the structure variables by minimizing each of the f?,i(Xi) individually over Xi. Hence, the large optimization problem can be solved by solving many (N) small problems. Moreover, we can solve all these problems in parallel. The same is true for the resection step; we can optimize each of f?,j(P j) individually over P j to obtain the joint optimal solution F?(P 1, P 2, ..., PN). 6.2.2 Cheirality and quasi-affine initialization A camera projection matrix is of the form P = K[R t], where K is a 3? 3 upper-triangular matrix (which has information about the focal length of the camera and is called the internal parameter matrix) and R ? SO(3), t ? R3 are rotation and translation of the camera coordinate system with respect to the world coordinate system (the 3-D points are described with respect to this system). In a reconstruc- tion problem, the goal is to find the camera projection matrices and 3-D structure points from their 2-D image points. While solving the reconstruction problem, if we impose the upper-triangular matrix constraint on K and the ortho-normality constraint on R, the reconstruction we obtain is an Euclidean reconstruction, i.e., the reconstructed camera matrices and structure points are related to the original quantities by a Euclidean transformation (global rotation and translation). How- ever, if we do not impose the above constraints and just solve for an unconstrained 115 P matrix, the reconstruction that we get is a projective reconstruction, i.e., it is related to the original reconstruction by a projective transformation. Generally, we first obtain a projective reconstruction, and later use the various constraints to find a projective transformation that can convert the projective reconstruction to an Euclidean one [41]. Our goal is to design a projective bundle adjustment algorithm. Since bundle adjustment algorithm is an iterative algorithm, we need to ini- tialize the algorithm with an initial reconstruction. The usual way to initialize a projective bundle adjustment algorithm is a projective reconstruction obtained from the given images. Any of the methods mentioned in [41] can be used for projec- tive reconstruction. However doing this for the proposed algorithm will increase its computational complexity. To understand and get around this problem, we need to understand a property known as cheirality. Let X = (X, Y, Z, T ) be a homogeneous representation of a point and P = [aT ; bT ; cT ] = [M p4] be the projection matrix of a camera, with M a 3 ? 3 sub- matrix and p4 a column vector. The imaged point x is given by PX = ?x?, where x? denotes the homogeneous representation of x in which the last coordinate is 1. The depth of the point X with respect to the camera is given by: depth(X;P ) = sign(detM)? T ||m3|| (6.10) where m3 is the third row of M [41]. A point X is said to be in front of the camera if and only if depth(X;P ) > 0. Definition 2. The quantity sign(depth(X;P )) is known as the cheirality of the point X with respect to the camera [41]. 116 If detM > 0 and T > 0, then cTX > 0 implies that the point is in front of the camera(since ? = cTX). Cheirality is an invariant quantity under Euclidean, affine transformations and quasi-affine transformations, but it is not so under projective transformation [41]. In section 6.1, since we were solving Euclidean triangulation and resection problems, we were justified in using the cheirality constraint. However, when solving the projective triangulation problem, we can?t just restrict our search for X in the convex region of {X : cjTX > 0, ?j}. If there are M cameras, then the M principal planes divide the projective space P 3 into (M3 +3M2 +8M)/6 regions [49]. The l? cost, with respect to X, has to be minimized over each of these regions and the minimum among them is the desired solution of the projective triangulation problem[49]. From the discussion above, to avoid additional computations, we should ini- tialize the projective bundle adjustment algorithm with either Euclidean, affine or quasi-affine reconstruction. Quasi-affine reconstruction is the best choice as it is very easy to convert any (initial) projective reconstruction into a quasi-affine one. The only information required for this conversion is the fact that if a point is imaged by a camera, then it must be in front of the camera. The transformation that takes a projective reconstruction to a quasi-affine reconstruction can be found by solving a linear programming problem [41]. To summarize, we first obtain an initial projective reconstruction from the images and then convert this to a quasi-affine reconstruction by solving a linear programming problem. This reconstruction is then used as an initialization for our algorithm. After this initialization, we can use the triangulation/resection method 117 Algorithm Bundle Adjustment using l? norm minimization Input: Set of images Output: Projective reconstruction 1. Do initial projective reconstruction from set of images. 2. Convert to quasi-affine reconstruction. 3. while Reprojection error >  4. do Get camera parameters for each camera by l? resection. 5. Get 3D structure parameters for each point by l? triangulation. Figure 6.1: l? BA Algorithm of section 6.1. The summary of the algorithm is given in Figure 6.1. 6.3 Computational complexity and memory requirement This section first describes the computational complexity and memory require- ment of the proposed algorithm and then compares it with that of l2 based bundle adjustment and l2 based resection-intersection algorithms. As discussed in section 6.2.1, at any time we are either solving the triangula- tion problem for one structure point or the resection problem for one camera. We first analyze the computational complexity and memory requirements for solving one triangulation problem. The triangulation problem is solved by a bisection algorithm 118 [48],[50]. At each step of the bisection, we solve a convex feasibility problem, given in (6.5) of section 6.1. If the point that we are triangulating is visible in m cameras then we have to solve m second order cone feasibility problem. This problem has a computational complexity of O(m1.5) and a memory requirement of O(m) [56]. By analogy, the resection problem for one camera has an computational complexity O(n1.5) and memory requirement O(n) where n points are visible in that camera. Now consider one iteration of our algorithm. For the case where there are m cameras and n points and all the points are visible in all the cameras, per iteration empiri- cal computational complexity is O(mn(?m+?n)) and the memory requirement is O(max(m,n)). Furthermore a parallel implementation of the algorithm is possible because during each resection/intersection step all the cameras/points can be esti- mated at the same time. This is because of the decoupling discussed in section 6.2.1. Such an implementation will result in a reduction of computational complexity. The l2 norm bundle adjustment algorithm (l2 BA) [7] is based on the Levenberg- Marquardt (LM) method. The central step involves solving an equation with all the camera and structure parameters as unknowns. Hence its computational complexity is O((m + n)3) per iteration. For memory requirement, we can consider the Jaco- bian which is O(mn(m+n)). There exists a sparse LM method which uses the fact that the Jacobian matrix for the bundle adjustment problem has a sparse structure [41]. For this method, computation complexity is O(m3 + mn) per iteration. The memory requirement is O(mn) [41]. The l2 based resection-intersection algorithms [25],[83] have computational complexity of O(mn) per iteration and same memory requirement as our algorithm, i.e, O(max(m,n)) [83]. But since these algorithms 119 minimize approximate algebraic errors, they are not so reliable as we found in our experiments 6.4.1. 6.4 Experiments We have done experimental evaluation of the proposed algorithm (l? BA) for convergence, computational scalability and robustness to noise. Comparisons with l2 bundle adjustment based on LM algorithm (l2 BA) and l2 resection-intersection al- gorithms are also given. For l2 BA we have used a publicly available implementation of projective bundle adjustment [58] based on the sparse LM method. The l2 based resection-intersection algorithms that we have compared with are the Weighted It- erative Eigen algorithm (WIE) proposed by Chen et al [25] and a variation of the same algorithm where we avoid the reweighting step, henceforth called the IE algo- rithm. In section 6.4.1, while studying the convergence of the four algorithms we found that the performance of WIE and IE are unreliable and hence in the rest of the sections we have compared l? BA with only l2 BA. For initial reconstruction, we have used the projective factorization method of Triggs et al [13] with proper handling of missing data. This reconstruction was then converted to a quasi-affine reconstruction. Any other projective reconstruction followed by a conversion to a quasi-affine reconstruction will also work fine. When comparing different algorithms, all of them have been provided with the same initial reconstruction. 120 6.4.1 Convergence Here we study the convergence with iteration of l? BA, l2 BA, WIE and IE algorithms. Experiments are done on one synthetic data set and three real data sets. The synthetic data set consists of 100 points distributed uniformly within a unit sphere and 50 cameras in a circle around the sphere looking straight at the sphere. Gaussian noise of standard deviation 1 pixel is added to the feature points. In all the experiments, reprojection error for this data set is the mean reprojection error over ten trials. The real data sets used are the corridor and dinosaur data set2 and the hotel data set3. We have used a subset of 464 structure points from the dinosaur data set and a subset of 60 views from the hotel data set. For the corridor and the dinosaur data sets, feature points are already available and we have used them as they are. For the hotel data set we have used the KLT tracker to track feature points and then used Torr?s Matlab SFM Toolbox [99] to remove the outliers. We compare the convergence of the algorithms in the l? norm of reprojection error and the Root Mean Squares reprojection error (RMS) which is a measure of the l2 norm. Figure 6.2 shows that the l? error decreases monotonically for l? BA, but not so for the other algorithms. Figure 6.3 shows RMS error decreases (almost) monotonically for l? BA and l2 BA but not so for WIE and IE. From Figure 6.3, we can further conclude the following. All the algorithms converge well for the sphere data set. For the corridor data set, WIE converges at a higher value than others. For the hotel data set, both WIE and IE first diverge and then later converge. For 2http://www.robots.ox.ac.uk/?vgg/data/data-mview.html 3http://vasc.ri.cmu.edu/idb/ 121 the dinosaur data set, IE fails to converge. To further study the nature of WIE and IE, we added Gaussian noise of standard deviation 1 pixel to the real data sets and found that the algorithms fail to converge at many trials. However, each time these algorithms have converged in the algebraic cost that they minimize. This study tells us that algebraic cost based algorithms may not be very reliable. For all of the above data sets, our algorithm converges within ten iterations with similar RMS reprojection error as l2 BA. Figure 6.4 shows the final 3-D reconstruction by our algorithm for the datasets, sphere and corridor. 0 5 10 15 20 0 10 20 30 40 50 Iterations L ? reprojection error in pixel s L ? reprojection error Vs Iterations for Sphere Linf_BA L2_BA WIE IE 0 20 40 60 80 100 0 5 10 15 20 Iterations L ? reprojection error in pixel s L ? reprojection error Vs Iterations for Corridor Linf_BA L2_BA WIE IE 0 20 40 60 80 100 0 5 10 15 20 Iterations L ? reprojection error in pixel s L ? reprojection error Vs Iterations for Hotel Linf_BA L2_BA WIE IE 0 20 40 60 80 100 0 20 40 60 80 100 Iterations L ? reprojection error in pixel s L ? reprojection error Vs Iterations for Dinosaur Linf_BA L2_BA WIE IE Figure 6.2: l? reprojection error versus iteration for the four algorithms on the data sets: sphere, corridor, hotel and dinosaur. l? error decreases monotonically for l? BA but not so for the other algorithms. 122 0 5 10 15 20 0 5 10 15 20 Iterations RMS reprojection error in pixel s RMS reprojection error Vs Iterations for Sphere Linf_BA L2_BA WIE IE 0 10 20 30 40 0.5 1 1.5 2 Iterations RMS reprojection error in pixel s RMS reprojection error Vs Iterations for Corridor Linf_BA L2_BA WIE IE 0 10 20 30 40 0 2 4 6 8 10 Iterations RMS reprojection error in pixel s RMS reprojection error Vs Iterations for Hotel Linf_BA L2_BA WIE IE 0 10 20 30 40 0 2 4 6 8 10 Iterations RMS reprojection error in pixel s RMS reprojection error Vs Iterations for Dinosaur Linf_BA L2_BA WIE IE Figure 6.3: RMS reprojection error versus iteration : RMS error decreases mono- tonically for l? BA and l2 BA but not so for WIE and IE. IE fails to converge for the dinosaur data set. 6.4.2 Computational scalability We did experiment on the synthetic sphere data set to compare the total convergence time for l2 BA and l? BA as the number of cameras is varied with the number of points fixed at 500, Figure 6.5. To ensure a fair comparison, both the algorithms were implemented in Matlab with the computationally intensive routines as mex files. l2 BA converges at about 10 iterations and l? BA at about 2 iterations. Figure 6.5 clearly shows that our algorithm has the advantage in terms of time from 250 cameras onwards. For a video with 30 frames per second this is approximately 123 ?2 0 2 ?1?0.50 0.51 ?2 ?1 0 1 2 Initial reconstruction: Sphere ?2 0 2 ?1?0.50 0.51 ?1 0 1 Final reconstruction: Sphere ?2 0 2 ?1?0.50 0.51 ?1 0 1 Groundtruth: Sphere ?100 0 100 200 ?20 0 20 40 60 ?300 ?200 ?100 0 100 Initial reconstruction: Corridor ?20 0 20 ?10 0 10 ?100 ?50 0 50 Final reconstruction: Corridor ?20 0 20 ?10 0 10 ?100 ?50 0 50 Groundtruth: Corridor Figure 6.4: 3-D reconstruction result for the datasets, Sphere and Corridor. The Red ?*? represents the camera center and Blue ?o? represents the structure point. The first column shows the initialization, second column shows the final reconstruction and the third column shows the groundtruth. 8 sec of data. Note that we have to estimate the camera parameters corresponding to each frame of the video. Thus our algorithm is suitable for solving reconstruction problems for video data where the number of frames can be large. Recently, there has been some work on faster computations of l? triangulation and resection problems [18] and incorporating this will reduce the convergence time of our algorithm. Further reduction in convergence time is possible by a parallel implementation, which we have not done here. 6.4.3 Behavior with noise Gaussian noise of different standard deviations are added to the feature points. Figure 6.6 shows the RMS reprojection error in pixels with noise for the synthetic 124 200 400 600 800 1000 0 100 200 300 400 Number of cameras Total convergence time in min s Total convergence time Vs Number of cameras L2_BA Linf_BA Figure 6.5: Total convergence time of l2 BA and l? BA as the number of cameras is varied with number of points fixed at 500. data set, sphere. Generally the l? norm has the reputation of being very sensitive to noise, but here we see a graceful degradation with noise. Further to handle noise with strong directional dependence, we can incorporate the directional uncertainty model of Ke et. al. [51] into the resection and triangulation steps of our algorithm, though we have not done it here. We have not considered outliers here, as bundle adjustment is considered to be the last step in the reconstruction process and outlier detection is generally done in the earlier stages of the reconstruction. In fact as mentioned earlier in section 6.4.1, we have removed the outliers from the hotel data set before the initial reconstruction step. 125 0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5 3 3.5 4 Standard deviation of Gaussian noise RMS reprojection error in pixel s Reprojection error Vs feature noise for Sphere Linf_BA L2_BA Figure 6.6: Behavior of l? BA and l2 BA with image feature noise for the sphere data set. 126 Chapter 7 Conclusion and Future Directions We summarize and suggest future directions for each of the topics covered in the dissertation. We also propose interesting directions for some related topics. 7.1 Robust Linear Regression Using Sparse Learning for High-Dimensional Applications Successful robust regression algorithms, such as LMedS and RANSAC are combinatorial in the dimension of the problem, and hence are not useful for solving high-dimensional problems. We proposed robust polynomial time algorithms based on techniques from sparse learning theory. We decomposed the error term in regres- sion as the sum of two terms: an outlier or gross error term, which is assumed to be sparse, and an inlier or small error term. We then formulated the robust regression problem as an l0-norm optimization problem and stated the conditions under which it can correctly recover the model parameters in presence of k outliers: The smallest principal angle between the regression subspace and all the 2k-dimensional outlier subspaces should be greater than zero and X should be full column rank. Since the above optimization is a combinatorial problem, we proposed a relaxed convex problem BPRR, which is a modified version of the basis pursuit algorithm. We then showed that the if the smallest principal angle between the regression and all 127 the 2k-dimensional outlier subspaces is more than cos?1(23) and X is full column rank, then BPRR finds the correct model parameters provided there are at most k outliers. We also proposed a Bayesian approach, BRR, for solving the robust regres- sion problem, which is based on the sparse Bayesian learning technique. We then empirically studied the parameter space of the robust regression algorithm, which showed that BRR gives the best performance. Finding the Maximum Number of Outliers that a Dataset can Han- dle. The sufficient conditions that we provided for BPRR, Theorem 2.1.1, are in terms of the quantity ?k (cosine of the smallest principal angle between the regres- sion subspace and all k dimensional outlier subspace). The largest integer k for which ?2k < 23 provides us a lower bound on the maximum number of outliers (in y variable) that a given dataset can handle. However, the computation of this quan- tity is itself a combinatorial problem. An interesting direction of research would be to find greedy algorithms that can provide lower and/or upper bound on the maximum number of outliers that a given dataset can handle. 7.2 Robust RVM Regression Using Sparse Outlier Model We extended our robust linear regression formulation to a particular kernel (non-linear) regression technique, the RVM regression. We explored two natural approaches for incorporating robustness in the RVM model: a Bayesian approach and a regularization approach. In the Bayesian approach (RB-RVM), the robust RVM problem is formulated as a bigger RVM problem with the advantage that it can 128 be solved efficiently by a fast algorithm. The regularization approach (BP-RVM) is based on the Basis Pursuit Denoising algorithm, which is a popular algorithm in the sparse representation literature. Empirical evaluations of the two robust algorithms show that RB-RVM performs better than BP-RVM. Further, we used RB-RVM to solve the robust image denoising and age estimation problem, which clearly demonstrated the superiority of RB-RVM over the original RVM. As a future direction of research, it would be interesting to look at a similar robust version for RVM classification. Also, the RB-RVM can be applied for solving the image interpolation problem and the 3d human pose estimation problem, where RVM regression gives one of the best performances [4]. 7.3 Sparse Regularization for Regression and Classification on Man- ifolds There are many applications in vision such as dynamic textures [87], human activity modeling and recognition [104] and shape analysis [73], where the data lies on a non-Euclidean manifold. We are interested in developing regression and clas- sification techniques which would be suitable for such problems. Recent papers by Pelletier et. al. [74, 57] have proposed kernel techniques for regression and clas- sification on closed Riemannian manifolds. However, these techniques lack proper regularization and hence may not generalize well, i.e., they may not predict well for unseen data. It would be interesting to look at sparse regularization for these problems. Another direction would be to make them robust to outliers. 129 7.4 Large-Scale Matrix Factorization with Missing Data under Ad- ditional Constraints Many problems in computer vision, such as SfM and photometric stereo, can be formulated as a missing-data matrix factorization problem, which is a hard problem to solve. We have formulated this problem as a low-rank semidefinite programming problem (MF-LRSDP). MF-LRSDP is an efficient algorithm that can be used for solving large-scale factorization problems. It is also flexible for handling many addi- tional constraints such as the ortho-normality constraints of the orthographic SfM. Our empirical evaluations on synthetic data show that it needs fewer observations for matrix factorization as compared to other algorithms and it gives very good results on the real problems of SfM, non-rigid SfM and photometric stereo. We note that though MF-LRSDP is a non-convex problem, it finds the global minimum under the conditions of the matrix completion theory. As a future work, it would be interesting to find a theoretical justification for this. Subspace Clustering in the presence of Missing Data. As seen in Chapter 4, the motion of a single object can be well formulated by missing data matrix factorization. If there are multiple objects undergoing different motions, then it can be shown that this problem can be formulated as a subspace clustering problem, where each cluster represents a single motion. For solving this problem, Vidal et. al. [33] have proposed a sparse subspace clustering technique. It would be interesting to extend this technique to the missing data scenario. 130 7.5 Direct Recognition of Faces across Blur and Illumination Varia- tions Motivated by the problem of remote face recognition, we have addressed the problem of recognizing blurred and poorly-illuminated faces. We have used the convolution model for blur and a low-dimensional linear subspace model for illu- mination to propose a direct recognition method. For each gallery image, we have an associated set which represents all the variations due to blur and illumination. Given a probe image, we find its distance from each such (gallery) set and assign it the identity of the closest (gallery) set. We have shown that this algorithm, though based on set theoretic concept, is also statistically optimal; it gives the maximum likelihood estimates for the blur kernel, illumination coefficients and identity. We also provided a way to theoretically characterize the amount of blur our algorithm can handle in a given dataset. Finally, we have demonstrated very good recognition results on many synthetic and real datasets. As an extension, it would be interesting to address the problem of pose variations under the same framework. Also, instead of maximizing the likelihood of the probe image over the joint space of identities, blur kernels and illumination coefficients, one can maximize the marginal likelihood of the probe image over the space of identities. This can be done by integrating the joint likelihood function over the space of blur kernels and illumination coefficients. This approach is likely to improve the recognition accuracy but it will also increase the computational complexity of the algorithm. Beyond Nearest Neighbor Classification for Face Recognition Across 131 Blur and Illumination. The proposed algorithm is a nearest neighbor algorithm for the recognizing faces across blur and illumination variations. It is a well known fact that nearest neighbor classifiers are computationally intensive and do not gener- alize well. It would be interesting to explore other classifiers, such as support vector machines (SVM) [28], for solving this problem. 7.6 Hierarchical Dictionary for Face and Activity Recognition: Dictionary based face and activity recognition is a promising new direction [61]. We are interested in learning hierarchical (multi-resolution) dictionaries, which will reveal the proper structure of the data and will also lead to scalable algorithms for dictionary-based recognition. 7.7 A Scalable Projective Bundle Adjustment Algorithm using the L? Norm The traditional bundle adjustment algorithm, based on minimizing the L2 norm of the image re-projection error, has cubic complexity in the number of un- knowns, and hence, is slow. We have proposed an efficient projective bundle ad- justment algorithm using the L? norm. It is a resection-intersection (coordinate descent/alternation) based algorithm which converts the large scale optimization problem to many small scaled ones. It is possible to make the present algorithm faster using a parallel implementation and by a more efficient implementation of L? resection and triangulation. 132 Bibliography [1] The fg-net aging database, http://www.fgnet.rsunit.com. [2] H. Aan?s, R. Fisker, K. ?Astro?m, and J. M. Carstensen. Robust factorization. IEEE TPAMI, 2002. [3] A. Agarwal and B. Triggs. 3d human pose from silhouettes by relevance vector regression. In CVPR, 2004. [4] Ankur Agarwal and Bill Triggs. Recovering 3d human pose from monocular images. IEEE TPAMI, 2006. [5] A. Agrawal, R. Raskar, and R. Chellappa. An algebraic approach to surface reconstruction from gradient fields. In Intl Conf. Computer Vision, 2005. [6] T. Ahonen, E. Rahtu, V. Ojansivu, and J. Heikkila. Recognition of blurred faces using local phase quantization. In International Conference on Pattern Recognition, 2008. [7] R. Hartley B. Triggs, P. McLauchlan and A. Fitzgibbon. Bundle adjustment - a modern synthesis. In Vision Algorithms: Theory and Practice, Springer- Verlag, 2000. [8] Ronen Basri and David W. Jacobs. Lambertian reflectance and linear sub- spaces. IEEE Trans. Pattern Anal. Mach. Intell., 2003. [9] Soma Biswas, Gaurav Aggarwal, and Rama Chellappa. Robust estimation of albedo for illumination-invariant matching and shape recovery. IEEE Trans. Pattern Anal. Mach. Intell., 2009. [10] Alan C. Bovic. The essential guide to image processing. Elsevier, 2009. [11] S. Brandt. Closed-form solutions for affine reconstruction under missing data. In Stat. Methods for Video Proc. (ECCV 02 Workshop), 2002. [12] C. Bregler, A. Hertzmann, and H. Biermann. Recovering non-rigid 3d shape from image streams. In CVPR, 2000. [13] B.Triggs. Factorization methods for projective structure and motion. In Proc. IEEE Conf. on CVPR, 1996. [14] A. M. Buchanan and A. W. Fitzgibbon. Damped newton algorithms for matrix factorization with missing data. In CVPR, 2005. [15] S. Burer and C. Choi. Computational enhancements in low-rank semidefinite programming. Optimization Methods and Software, 2006. 133 [16] S. Burer and R.D.C. Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming (series B, 2001. [17] Pei C. Optimization algorithms on subspaces: Revisiting missing data problem in low-rank matrix. IJCV, 2008. [18] A. P. Eriksson C. Olsson and F. Kahl. Efficient optimization for l?-problems using pseudoconvexity. In IEEE Int. Conf. Computer Vision, 2007. [19] J. Cai, E. J. Cande`s, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 2010. [20] E. J. Candes. The restricted isometry property and its implications for com- pressed sensing. Comptes Rendus Mathematique, 2008. [21] E. J. Cande`s and B. Recht. Exact matrix completion via convex optimization. Foundations on Computational Mathematics, 2009. [22] E. J. Candes and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 2005. [23] E. J. Candes and M. Wakin. An introduction to compressive sampling. IEEE Signal Processing Magazine, 2008. [24] E.J. Candes and P.A. Randall. Highly robust error correction byconvex pro- gramming. Information Theory, IEEE Transactions on, 2008. [25] Q. Chen and G. Medioni. Efficient iterative solution to m-view projective reconstruction problem. In IEEE Conf. on CVPR, 1999. [26] Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders. Atomic decomposition by basis pursuit. SIAM Jour. Scient. Comp., 1998. [27] Kuang chih Lee, Jeffrey Ho, and David Kriegman. Acquiring linear subspaces for face recognition under variable lighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005. [28] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 1995. [29] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2005. [30] D. L. Donoho. High-dimensional centrally symmetric polytopes with neigh- borliness proportional to dimension. Discrete Computational Geometry, 2006. [31] D. L. Donoho and J. Tanner. Counting faces of randomly projected polytopes when the projection radically lowers dimension. J. Amer. Math. Soc., 2009. 134 [32] David L. Donoho, Michael Elad, and Vladimir N. Temlyakov. Stable recov- ery of sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory, 52(1), 2006. [33] E. Elhamifar and R. Vidal. Sparse subspace clustering. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009. [34] R. Epstein, P.W. Hallinan, and A.L. Yuille. 5 plusmn;2 eigenimages suffice: an empirical investigation of low-dimensional lighting models. In Physics-Based Modeling in Computer Vision, 1995., Proceedings of the Workshop on, 1995. [35] Anita C. Faul and Michael E. Tipping. A variational approach to robust regression. In ICANN, 2001. [36] M.A. Fischler and R.C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Comm. Assoc. Mach, 1981. [37] Yun Fu, Ye Xu, and Thomas S. Huang. Estimating human age by manifold analysis of face pictures and regression on aging features. In ICME, 2007. [38] G. H. Golub and C. F. Van Loan. Matrix computations. Johns Hopkins University Press, Baltimore, Md, 1996. [39] N. Guilbert, A.E. Bartoli, and A. Heyden. Affine approximation for direct batch recovery of euclidian structure and motion from sparse data. IJCV, 2006. [40] Guodong Guo, Yun Fu, Charles R. Dyer, and Thomas S. Huang. Image- based human age estimation by manifold learning and locally adjusted robust regression. IEEE Transactions on Image Processing, 2008. [41] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge University press, 2nd edition, 2004. [42] H. Hayakawa. Photometric stereo under a light source with arbitrary motion. JOSA, 1994. [43] Hao Hu and Gerard de Haan. Low cost robust blur estimator. In ICIP, 2006. [44] P.J. Huber. Robust statistics. Wiley Series in Probability and Statistics, 1981. [45] D. Q. Huynh, R. Hartley, and A. Heyden. Outlier correction in image se- quences for the affine camera. In ICCV, 2003. [46] D. W. Jacobs. Linear fitting with missing data for structure-from-motion. CVIU, 2001. [47] Y. Jin and B. D. Rao. Algorithms for robust linear regression by exploiting the connection to sparse signal recovery. In ICASSP, 2010. 135 [48] F. Kahl. Multiple view geometry and the l?-norm. In IEEE Int. Conf. On Computer Vision, 2005. [49] F. Kahl and R. Hartley. Multiple view geometry under the l? -norm. IEEE Tran. Pattern Analysis and Machine Intelligence, 2008. [50] Q. Ke and T. Kanade. Quasiconvex optimization for robust geometric recon- struction. In IEEE Int. Conf. On Computer Vision, 2005. [51] Q. Ke and T. Kanade. Uncertainty models in quasiconvex optimization for geometric reconstruction. In IEEE CVPR, 2006. [52] R. H. Keshavan and S. Oh. A gradient descent algorithm on the grassman manifold for matrix completion. CoRR, abs/0910.5260, 2009. [53] A Lanitis, C Draganova, and C. Christodoulou. Comparing different classifiers for automatic age estimation. IEEE TSMC, 2004. [54] A. Lanitis, C. J. Taylor, and T. F. Cootes. Toward automatic simulation of aging effects on face images. IEEE TPAMI, 2002. [55] K. Lee and Y. Bresler. Admira: Atomic decomposition for minimum rank approximation. CoRR, abs/0905.0044, 2009. [56] M. S. Lobo, L. Vandenberghe, S. Boyd, and H. Lebret. Applications of second- order cone programming. Linear Algebra and its Applications, 1998. [57] J. Loubes and B. Pelletier. A kernel-based classifier on a riemannian manifold. Statistics and Decisions, 2008. [58] M. I. A. Lourakis and A. A. Argyros. The design and implementation of a generic sparse bundle adjustment software package based on the levenberg- marquardt algorithm. In ICS/FORTH Technical Report No. 340, 2004. [59] D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 2004. [60] S. Ma, D. Goldfarb, and L. Chen. Fixed point and bregman iterative methods for matrix rank minimization. Mathematical Programming, 2009. [61] J. Mairal, F. Bach, J. Ponce, G. Shapiro, and A. Zisserman. Supervised dictionary learning. In Advances in Neural Information Processing Systems (NIPS), 2008. [62] A. Maleki and D. L. Donoho. Optimally tuned iterative reconstruction al- gorithms for compressed sensing. IEEE Journal of Selected Topics in Signal Processing, 2010. [63] R. A. Maronna, R. D. Martin, and V. J. Yohai. Robust statistics, theory and methods. Wiley Series in Probability and Statistics, 2006. 136 [64] D. Martinec and T. Pajdla. 3d reconstruction by fitting low-rank matrices with missing data. In CVPR, 2005. [65] R. Mazumder, T. Hastie, and R. Tibshirani. Spectral regulariza- tion algorithms for learning large incomplete matrices. http://www- stat.stanford.edu/ hastie/Papers/SVD JMLR.pdf, 2009. [66] R. Meka, P. Jain, and I. S. Dhillon. Guaranteed rank minimization via singular value projection. CoRR, abs/0909.5457, 2009. [67] K. Mitra, A. Veeraraghavan, and R. Chellappa. Robust regression using sparse learning for high dimensional parameter estimation problems. In ICASSP, 2010. [68] Erik Murphy-Chutorian and Mohan Manubhai Trivedi. Head pose estimation in computer vision: A survey. IEEE TPAMI, 31, 2009. [69] J. Ni and R. Chellappa. Evaluation of state-of-the-art algorithms for remote face recognition. In ICIP, 2010. [70] Masashi Nishiyama, Abdenour Hadid, Hidenori Takeshima, Jamie Shotton, Tatsuo Kozakaya, and Osamu Yamaguch. Facial deblur inference using sub- space analysis for recognition of blurred faces. Accepted in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010. [71] Ville Ojansivu and Janne Heikkil. Blur insensitive texture classification using local phase quantization. In Image and Signal Processing. Springer Berlin / Heidelberg, 2008. [72] T. Okatani and K. Deguchi. On the wiberg algorithm for matrix factorization in the presence of missing components. IJCV, 2007. [73] V. Patrangenaru and K. V. Mardia. Affine shape analysis and image analysis. In 22nd Leeds Annual Statistics Research Workshop, 2003. [74] B. Pelletier. Non-parametric regression estimation on closed riemannian man- ifolds. Journal of Nonparametric Statistics, 2006. [75] P. Jonathon Phillips, Hyeonjoon Moon, Syed A. Rizvi, and Patrick J. Rauss. The feret evaluation methodology for face-recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000. [76] J. Portilla, V. Strela, M. Wainwright, and E. P. Simoncelli. Image denoising using scale mixtures of gaussians in the wavelet domain. IEEE Transactions on Image Processing, 2003. [77] Ravi Ramamoorthi and Pat Hanrahan. A signal-processing framework for inverse rendering. In SIGGRAPH, 2001. 137 [78] N. Ramanathan, R. Chellappa, and S. Biswas. Computational methods for modeling facial aging: A survey. J. Vis. Lang. Comput., 2009. [79] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. The MIT press, 2006. [80] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In ICML, 2005. [81] P. J. Rousseeuw. Least median of squares regression. J. of Amer. Sta. Assoc., 1984. [82] P. J. Rousseeuw and A. M. Leroy. Robust regression and outlier detection. Wiley Series in Prob. and Math. Stat., 1986. [83] Y. Omori S. Mahamud, M. Herbert and J. Ponce. Provably-convergent itera- tive methods for projective structure from motion. In IEEE Conf. on CVPR, 2001. [84] H. Shum, K. Ikeuchi, and R. Reddy. Principal component analysis with missing data and its application to polyhedral object modeling. IEEE TPAMI, 1995. [85] Terence Sim, Simon Baker, and Maan Bsat. The cmu pose, illumination, and expression (pie) database, 2002. [86] Terence Sim and Takeo Kanade. Combining models and exemplars for face recognition: An illuminating example. In CVPR 2001 Workshop on Models versus Exemplars in Computer Vision, 2001. [87] Stefano Soatto, Gianfranco Doretto, and Ying Nian Wu. Dynamic textures. In ICCV, 2001. [88] N. Srebro, J. D. M. Rennie, and T. Jaakkola. Maximum-margin matrix fac- torization. In NIPS, 2004. [89] Inna Stainvas and Nathan Intrator. Blurred face recognition via a hybrid network architecture. Pattern Recognition, International Conference on, 2000. [90] Charles V. Stewart. Robust parameter estimation in computer vision. SIAM Reviews, 1999. [91] J. Sturm. Using sedumi 1.02, a matlab toolbox for optimization over sym- metric cones. Optimization methods and software, 1999. [92] Hiroyuki Takeda, Sina Farsiu, and Peyman Milanfar. Robust kernel regression for restoration and reconstruction of images from sparse noisy data. In ICIP, 2006. [93] Hiroyuki Takeda, Sina Farsiu, and Peyman Milanfar. Kernel regression for image processing and reconstruction. IEEE TIP, 2007. 138 [94] J. P. Tardif, A. Bartoli, M. Trudeau, N. Guilbert, and S. Roy. Algorithms for batch matrix factorization with application to structure-from-motion. In CVPR, 2007. [95] Michael E. Tipping. Sparse bayesian learning and the relevance vector ma- chine. J. Mach. Learn. Res., 2001. [96] Michael E. Tipping and Anita Faul. Fast marginal likelihood maximisation for sparse bayesian models. In Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, 2003. [97] Michael E. Tipping and Neil D. Lawrence. Variational inference for student- models: Robust bayesian interpolation and generalised component analysis. Neurocomputing, 69(1-3), 2005. [98] C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: a factorization method. IJCV, 1992. [99] P. H. S. Torr. A structure and motion toolkit in matlab ?interactive advantures in s and m?. In Technical report, MSR-TR-2002-56, 2002. [100] P. Turaga, S. Biswas, and R. Chellappa. Role of geometry of age estimation. In ICASSP, 2010. [101] L. Vandenberghe and S. Boyd. Semidefinite programming. SIAM Rev., 1996. [102] L. Vandenberghe and S. Boyd. Semidefinite programming. SIAM Review, 1996. [103] V. N. Vapnik. The nature of statistical learning theory. 1995. [104] Ashok Veeraraghavan, Amit K. Roy-Chowdhury, and Rama Chellappa. Matching shape sequences in video with applications in human movement analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005. [105] R. Vidal and R. Hartley. Motion segmentation with missing data using pow- erfactorization and gpca. In In CVPR, 2004. [106] D. P. Wipf and B. D. Rao. Sparse bayesian learning for basis selection. IEEE Trans. Signal Process, 52, 2004. [107] R. J. Woodham. Photometric method for determining surface orientation from multiple images. Optical Engineerings, 1980. [108] B. Yang, Z. Zhang, and Z. Sun. Robust relevance vector regression with trimmed likelihood function. In IEEE Sig. Proc. Letters, 2007. 139 [109] B. Zhang, S. Shan, X. Chen, and W. Gao. Histogram of gabor phase patterns (hgpp): A novel object representation approach for face recognition. IEEE Transactions on Image Processing, 2007. [110] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld. Face recognition: A literature survey. ACM Comput. Surv., 2003. 140