ABSTRACT Title of dissertation: ROBUST FACIAL LANDMARKS LOCALIZATION WITH APPLICATIONS IN FACIAL BIOMETRICS Amit Kumar Doctor of Philosophy, 2019 Dissertation directed by: Professor Rama Chellappa Department of Electrical and Computer Engineering Localization of regions of interest on images and videos is a well studied prob- lem in computer vision community. Usually localization tasks imply localization of objects in a given image, such as detection and segmentation of objects in images. However, the regions of interests can be limited to a single pixel as in the task of facial landmark localization or human pose estimation. This dissertation studies ro- bust facial landmark detection algorithms for faces in the wild using learning methods based on Convolution Neural Networks. Detection of specific keypoints on face images is an integral pre-processing step in facial biometrics and numerous other applications including face verification and identification. Detecting keypoints allows to align face images to a canonical coordi- nate system using geometric transforms such as similarity or affine transformations mitigating the adverse affects of rotation and scaling. This challenging problem has become more attractive in recent years as a result of advances in deep learning and release of more unconstrained datasets. The research community is pushing bound- aries to achieve better and better performance on unconstrained images, where the images are diverse in pose, expression and lightning conditions. Over the years, researchers have developed various hand crafted techniques to extract meaningful features from features, most of them being appearance and geometry-based features. However, these features do not perform well for data col- lected in unconstrained settings due to large variations in appearance and other nui- sance factors. Convolution Neural Networks (CNNs) have become prominent because of their ability to extract discriminating features. Unlike the hand crafted features, DCNNs perform feature extraction and feature classification from the data itself in an end-to-end fashion. This enables the DCNNs to be robust to variations present in the data and at the same time improve their discriminative ability. In this dissertation, we discuss three different methods for facial keypoint de- tection based on Convolution Neural Networks. The methods are generic and can be extended to a related problem of keypoint detection for human pose estimation. The first method called Cascaded Local Deep Descriptor Regression uses deep features ex- tracted around local points to learn linear regressors for incrementally correcting the initial estimate of the keypoints. In the second method, called KEPLER, we develop efficient Heatmap CNNs to directly learn the non-linear mapping between the input and target spaces. We also apply different regularization techniques to tackle the effects of imbalanced data and vanishing gradients. In the third method, we model the spatial correlation between different keypoints using Pose Conditioned Convo- lution Deconvolution Networks (PCD-CNN) while at the same time making it pose agnostic by disentangling pose from the face image. Next, we show an application of facial landmark localization used to align the face images for the task of apparent age estimation of humans from unconstrained images. In the fourth part of this dissertation we discuss the impact of good quality landmarks on the task of face verification. Previously proposed methods perform with reasonable accuracy on high resolution and good quality images, but fail when the input image suffers from degradation. To this end, we propose a semi-supervised method which aims at predicting landmarks in the low quality images. This method learns to predict landmarks in low resolution images by learning to model the learning process of high resolution images. In this algorithm, we use Generative Adversarial Networks, which first learn to model the distribution of real low resolution images after which another CNN learns to model the distribution of heatmaps on the images. Additionally, we also propose another high quality facial landmark detection method, which is currently state of the art. Finally, we also discuss the extension of ideas developed for facial keypoint localization for the task of human pose estimation, which is one of the important cues for Human Activity Recognition. As in PCD-CNN, the parts of human body can also be modelled in a tree structure, where the relationship between these parts are learnt through convolutions while being conditioned on the 3D pose and orientation. Another interesting avenue for research is extending facial landmark localization to naturally degraded images. ROBUST FACIAL LANDMARK LOCALIZATION WITH APPLICATIONS IN FACIAL BIOMETRICS by Amit Kumar Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2019 Advisory Committee: Professor Rama Chellappa, Chair/Advisor Professor Behtash Babadi Professor Larry Davis Professor Vishal Patel Professor Ramani Duraiswami ?c Copyright by Amit Kumar 2019 Dedication Dedicated to my parents, who have always provided unconditional support throughout my life. ii Acknowledgments While the rest of the dissertation is meant to convey the technical work done, this is the only place to take the liberty to express my personal gratitude, I owe to all the people who have made this thesis possible and because of whom my graduate experience has been one that I will cherish forever. First and foremost I?d like to thank my advisor, Professor Rama Chellappa for giving me an invaluable opportunity to work on challenging and extremely in- teresting projects over the past four years. He has always been supportive and has given me freedom to pursue research in many directions. He has always made him- self available for help and advice and there has never been an occasion when I?ve knocked on his door and he hasn?t given me time. It has been a pleasure to work with and learn from such an extraordinary individual. It is an honor to have Professor Larry Davis, Professor Behtash Babadi, Pro- fessor Abhinav Shrivastava and Professor Vishal Patel in my dissertation committee. I am thankful to them for serving in my committee and providing insightful and diverse suggestions to improve this dissertation. I am thankful to Professor Vishal Patel and Dr. Jun-Cheng Chen, and Dr. Swami Sankaranarayanan, and other UMIACS graduate students for intense and fruitful research discussions that led to a good number of publications. I owe my deepest thanks to my family - my mother and father who have always stood by me and guided me through my career, and have pulled me through against impossible odds at times. Words cannot express the gratitude I owe them. iii This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. 2014-14071600012 ,2019-022600002 and D17PC00345. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Gov- ernment. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. iv Table of Contents Dedication ii Acknowledgments iii Table of Contents v List of Tables viii List of Figures xi 1 Introduction 1 1.0.1 Proposed Methods . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Local Deep Descriptor Regression 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 Model-based Approaches . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 Regression-based Approaches . . . . . . . . . . . . . . . . . . 11 2.2.3 Part-based Deformable Models . . . . . . . . . . . . . . . . . 12 2.3 Regression of Deep Descriptors . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 Deep Descriptor Construction . . . . . . . . . . . . . . . . . . 13 2.3.2 Computing Shape Indexed Features . . . . . . . . . . . . . . . 16 2.3.3 Learning the Global Regression . . . . . . . . . . . . . . . . . 17 2.3.4 Incorporating Shape Constraint . . . . . . . . . . . . . . . . . 18 2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.2 Comparison with state-of-the-art Methods . . . . . . . . . . . 22 2.4.3 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3 KEPLER: Keypoint and Pose Estimation of Unconstrained Faces by Learn- ing Efficient H-CNN Regressors 28 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 KEPLER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.2 Iteration 1 and 2: Constrained Training . . . . . . . . . . . . 41 v 3.3.3 Iteration 3: Variant of Euclidean loss . . . . . . . . . . . . . . 43 3.3.4 Iteration 4: Hard sample mining . . . . . . . . . . . . . . . . . 44 3.3.5 Iteration 5: Local Error Correction . . . . . . . . . . . . . . . 46 3.4 Experiments and Comparison . . . . . . . . . . . . . . . . . . . . . . 48 3.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4 Disentangling 3D Pose in A Dendritic CNN for Unconstrained 2D Face Align- ment 60 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3 Pose Conditioned Dendritic CNN . . . . . . . . . . . . . . . . . . . . 66 4.4 Magnified version of the Tree . . . . . . . . . . . . . . . . . . . . . . 74 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.6 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.6.1 Effect of Pose Disentaglement . . . . . . . . . . . . . . . . . . 78 4.6.2 Improvement in localization by augmentation during testing . 78 4.6.3 Training PCD-CNN for COFW . . . . . . . . . . . . . . . . . 79 4.7 Hard mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.8 More results on AFLW, AFW, LFPW and HELEN . . . . . . . . . . 85 4.8.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5 A Cascaded Convolutional Neural Network for Age Estimation of Uncon- strained Faces 90 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.3.1 Face Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3.2 Deep Face Feature Representation . . . . . . . . . . . . . . . . 95 5.3.3 Age Group Classifier . . . . . . . . . . . . . . . . . . . . . . . 96 5.3.4 Apparent Age Regressor Per Age Group . . . . . . . . . . . . 96 5.3.5 Age Error Correction . . . . . . . . . . . . . . . . . . . . . . . 98 5.3.6 Non-linear Regression . . . . . . . . . . . . . . . . . . . . . . . 100 5.3.7 A Toy Example . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.4.2 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . 104 5.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.4.4 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 vi 6 S2LD : Semi Supervised Landmark Detection for Low Resolution Images 111 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.3.1 High to Low Generator and Discriminator . . . . . . . . . . . 117 6.3.2 Semi-Supervised Landmark Localization . . . . . . . . . . . . 120 6.3.2.1 Heatmap Generator G2 . . . . . . . . . . . . . . . . 120 6.3.2.2 Heatmap Discriminator D2 . . . . . . . . . . . . . . 121 6.3.2.3 Heatmap Confidence Discriminator D3 . . . . . . . . 122 6.3.3 Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . 122 6.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.4.1 Ablation Experiments . . . . . . . . . . . . . . . . . . . . . . 125 6.4.2 Experiments on Low Resolution images . . . . . . . . . . . . . 128 6.4.3 Face Recognition experiments . . . . . . . . . . . . . . . . . . 128 6.5 Evaluation on the IJB-S dataset . . . . . . . . . . . . . . . . . . . . . 133 6.5.1 Additional Experiments: . . . . . . . . . . . . . . . . . . . . . 135 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7 Conclusion 138 7.0.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Bibliography 141 Bibliography 141 vii List of Tables 2.1 Input size and the number of strides in conv1, max1, conv2 and max2 layers for 4 stages of regression. . . . . . . . . . . . . . . . . . . . . . 17 2.2 Averaged error comparison of different methods on the LFPW dataset. 23 2.3 Averaged error comparison of different methods on the Helen dataset. 25 2.4 Averaged error comparison of different methods on the iBUG chal- lenging dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1 Comparison of KEPLER with other state of the art methods. NME stands for normalized mean error. For AFLW, the numbers for other methods are taken from respective papers following the PIFA proto- col. For AFW, the numbers are taken from respective works published following the protocol of [177]. . . . . . . . . . . . . . . . . . . . . . . 52 3.2 Performance comparison of the proposed method on COFW dataset. It is to be noted that NME in FPLL, ESR, FLD and RCPR (trained on COFW) is calculated over 29 points, which is calculated for 21 points in KEPLER. It can be observed that the performance of KE- PLER is comparable to RCPR without finetuning on the training set of COFW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3 Summary of performance on different protocols of AFLW and AFW by KEPLER. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4 Comparison of Mean error in 3D pose estimation by KEPLER on AFLW testset. For AFLW [146] only compares mean average error in Yaw. For AFW we we compare the percentage of images for which error is less than 15?. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.1 Root mean square error normalized by bounding box size, calculated on the AFLW validation set following the PIFA protocol. The pro- posed PCD-CNN when conditioned on pose yields better performance for the task of keypoint localization. . . . . . . . . . . . . . . . . . . . 70 viii 4.2 Mean square error normalized by bounding box size, calculated on the AFLW validation set following the PIFA protocol. This table shows that PCD-CNN when followed by another classification stage results in lower localization error compared to classification followed by regression. Note that conditioning on pose is not used in both the cases above for fair comparison. . . . . . . . . . . . . . . . . . . . . . 70 4.3 Root mean square error normalized by bounding box calculated on the AFLW validation set following PIFA protocol. This table indi- cates the effect of using Mask-softmax over Softmax. . . . . . . . . . 73 4.4 Root mean square error normalized by bounding box calculated on the AFLW validation set following PIFA protocol. This table depicts the effect of offline hard sample mining. . . . . . . . . . . . . . . . . . 74 4.5 Root mean square error normalized by bounding box calculated on the AFLW validation set following PIFA protocol. This table shows the effect of offline hard-mining and quadrupling the number of de- convolution filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.6 Comparison of the proposed method with other state of the art meth- ods. C+C stands for classification+classification. For AFLW, num- bers for other methods are taken from respective papers following the PIFA protocol. For AFW, the numbers are taken from respective published works following the protocol of [177]. . . . . . . . . . . . . 79 4.7 Comparison of the proposed method with other state of the art meth- ods on AFLW-PIFA test set, categorized by absolute yaw angles. The numbers represent the normalized mean error. . . . . . . . . . . . . 80 4.8 Comparison of the proposed method with other state-of-the-art meth- ods on 300W dataset. The NME for comparison are taken from the Table 3 of [103]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.9 Comparison of the proposed method with other state of the art meth- ods on COFW dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.10 Mean square error normalized by bounding box calculated on AFLW test set following PIFA protocol. When PCD-CNN and fine-grained localization network both are conditioned on pose yields lower error rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.11 NME on different datasets Pre-Augmentation and Post-Augmentation during testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.1 The base architecture of DCNN model used in this work [162] to finetune on the age group classification and ?age regression for each age group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.2 Age estimation results on the Adience benchmark. Listed are the mean accuracy ? standard error over all age categories. Best results are marked in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.3 Performance comparison on the Chalearn Challenge dataset. . . . . . 107 ix 5.4 Performance comparison of different age estimation algorithms on the FG-Net aging database using mean absolute error(MAE). Since the training of DCNNs is computationally intensive, the evaluation of the proposed approach does not follow the full LOPO protocol. The results are for an empirical evaluation to show the performance level of the proposed approach. . . . . . . . . . . . . . . . . . . . . . . . . 109 6.1 (a) Landmark Detection Error on Real Low Resolution dataset. (b) Table for ablation experiments under different settings on synthesized LR images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.2 Verification performance on Tinyface dataset under different settings (a) LightCNN trained from scratch (b) Using Inception-ResNet pre- trained on MsCeleb-1M . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.3 Face recognition performance using super-resolution before face-alignment132 6.4 Retrieval rates at different ranks(Higher is better) . . . . . . . . . . . 134 6.5 False negative rates at different false positive rates. (Lower is better) 134 6.6 Comparison of the proposed method with other state of the art meth- ods on AFLW (Full) and 300-W testsets. The NMEs for comparison on 300W dataset are taken from the Table 3 of [103]. In this case G2 is trained in supervised manner using high resolution images of size 128? 128. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 x List of Figures 1.1 Face alignment in a face analysis system . . . . . . . . . . . . . . . . 2 1.2 Rigid image transformations: translation, rotation, scale, shear; Non- rigid image transformations and out of plane rotation: deformation. The face alignment poses all these five transformations. . . . . . . . . 3 2.1 We present a deep descriptor-based regression approach for fiducial point extraction. This figure shows fiducial points extracted on all the detected faces on an image from the IJB-A [81] dataset using the proposed method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Overview of the proposed method. During training, we extract deep descriptors for each landmark and concatenate them to form a shape- indexed feature vector. Given these features and target shape incre- ments ?Sti , we learn the linear regression weights W t. During test- ing, deep descriptors are extracted around each point of the initial- ized mean shape. Intermediate shape is predicted using the regressor weights W t. This process is iterated to reach the final estimated shape. 9 2.3 Architecture of the proposed Deep Descriptor Network. The height and width represents the dimensions of each feature map, whereas the depth denotes the number of features maps for a given layer. The number of strides for each layer is restricted to 1. . . . . . . . . . . . 14 2.4 Average pt-pt error (normalized by face size) vs fraction of images in (a) LFPW, (b) Helen, (c) AFW and (d) iBUG. . . . . . . . . . . . . 20 2.5 Qualitative results of our landmark localization method. First row: LFPW, Second row: Helen, Third row: AFW and Fourth row: IBUG. Fifth row: IJB-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.6 Average 3-pt error (normalized by eye-nose distance) vs fraction of images in the IJB-A dataset. . . . . . . . . . . . . . . . . . . . . . . . 26 3.1 Sample results generated by the proposed method. The numbers in black are the predicted 3D pose P:Pitch Y:Yaw R:Roll. Green dots represent the predicted keypoints. The bar graphs show the visibility confidence of each of the 21 keypoints. . . . . . . . . . . . . . . . . . 28 xi 3.2 Sample results generated by the proposed method KEPLER. White dots represent the location of keypoints after each iteration. The first row shows an image from the AFLW dataset. The points move at subpixel level after fourth iteration. The second row is a sample image from the AFW dataset, which shows how the last stage of error correction can effectively mitigate the inconsistent bounding box across datasets. The numbers in red are the predicted 3D pose P:Pitch Y:Yaw R:Roll . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3 Overview of the architecture of KEPLER. The function f() predicts the visibility, pose and the corrections for the next stage. The rep- resentation function h() forms the input representation for the next iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4 The KEPLER network architecture. The dotted line shows the chan- neled inception network. The intermediate features are convolved and the responses are concatenated in a similar fashion as inception module. Tasks such as pose are abstract and contained in deeper layers, however, the localization property is in the shallower layers. . 40 3.5 Qualitative results of KEPLER after second stage. The green dots represent the predicted points after second stage. Red dots represent the ground truth. It can be seen that the visible points have taken the shape of input face image. . . . . . . . . . . . . . . . . . . . . . . 43 3.6 Error Histogram of training samples after stage 3 . . . . . . . . . . . 45 3.7 Red dots in the left image represent the ground truth while green dots represent the predicted points after the fourth iteration. Local patches centered around predicted points are extracted and fed to the network. The network shown in Fig 3.4 is trained on the task of local fiducial correction and visibility of fiducials inside the patch. The image on the right shows the predictions after local correction. . 47 3.8 Schema to convert COFW 29 point format to AFLW 21 point format. 51 3.9 Cumulative error distribution curves for landmark localization on the AFLW dataset. The numbers in the legend are the average normal- ized mean error normalized by the face size. . . . . . . . . . . . . . . 53 3.10 Cumulative error distribution curves for landmark localization on the AFW dataset. The numbers in the legend are the fraction of testing faces that have average error below (5%) of the face size. . . . . . . . 55 3.11 Cumulative error distribution curves for pose estimation on AFW dataset. The numbers in the legend are the percentage of faces that are labeled within ?15? error tolerance . . . . . . . . . . . . . . . . . 55 3.12 Cumulative error distribution curves for landmark localization on the COFW dataset. This is to be noted that the error is calculated over 21 points normalized by inter-occular distance. . . . . . . . . . . . . . 56 3.13 Cumulative error distribution curves for landmark localization on the IJBA dataset. The error is calculater for 3 points normalized by the distance between midpoint of eyes and the nose. . . . . . . . . . . . . 56 xii 3.14 Qualitative results of KEPLER after last stage. The green dots rep- resent the final predicted points after last stage. First row are the test samples from AFLW. Second row shows the samples from AFW dataset. The last two rows are the results of KEPLER after last stage from AFLW testset for all variants protocol. The green dots represent the final predicted points after second stage. . . . . . . . . . . . . . . 58 3.15 Qualitative results of KEPLER after last stage on COFW dataset. The green dots represent the final predicted points after last stage. . . 59 3.16 Qualitative results of KEPLER after last stage on IJBA dataset. The green dots represent the final predicted points after last stage. . . . . 59 4.1 (a) A bird?s eye view of the proposed method. Dendritic CNN is explicitly conditioned on 3D pose. A generic CNN is used for auxiliary tasks such as fine-grained localization or occlusion detection. . . . . . 61 4.2 (a) Details of the proposed method. The dotted lines on top of con- volution layers denote residual connections. The feature maps from the pose model are multiplied element-wise with the feature maps of the keypoint model. The network inside the grey box represents the proposed PCD-CNN, whereas the second network inside the blue box is modular and can be replaced for an auxiliary task. A conv-deconv network for finer localization is used alongside a second regression network for occlusion detection. (b) Proposed dendritic structure of facial landmark points for effective information sharing among land- mark points. The nodes of the dendritic structure are the outputs of deconvolutions while the edges between nodes i and j are modeled by convolution functions fij. For the architecture of deconvolution network refer to Figure 4.3. . . . . . . . . . . . . . . . . . . . . . . . 63 4.3 Detailed description of a single Squeezenet-DeconvNet network. Note the fewer number of deconvolution filters. Each deconvolution net- work is identical to the one shown above. . . . . . . . . . . . . . . . . 71 4.4 The proposed extension of the dendritic structure from Figure 4.2 generalizing to other datasets (COFW and 300W) each with different number of points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.5 Cumulative error distribution curves for landmark localization on AFLW, AFW and COFW dataset respectively. (a) Numbers in the legend represents mean error normalized by the face size. (b) Num- bers in the legend are the fraction of testing faces that have average normalized error below 5%. (c) The numbers in the legend are the fraction of testing faces that have average normalized error below 10%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.6 Comparison of NME and failure rate over visible landmarks out of 29 landmarks from the COFW dataset. . . . . . . . . . . . . . . . . . . 83 4.7 Histogram of error, when evaluated on the training set of (a) AFLW (b) COFW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 xiii 4.8 (a) Precision Recall for the occlusion detection on the COFW dataset. (b)Cumulative error distribution curves for pose estimation on AFW dataset. The numbers in the legend are the percentage of faces that are labeled within ?15? error tolerance. Cumulative Error Distribu- tion curve for (c) Helen (d) LFPW, when the average error is nor- malized by the bounding box size. . . . . . . . . . . . . . . . . . . . . 84 4.9 The proposed extension of the dendritic structure from Figure 1, gen- eralizing to other datasets with variable number of points. . . . . . . 84 4.10 Qualitative results generated from the proposed method. The green dots represent the predicted points. Every two show randomly se- lected samples from AFLW, AFW, COFW, and 300W respectively with all the visible predicted points. . . . . . . . . . . . . . . . . . . . 86 4.11 Qualitative results generated from the proposed method. The green dots represent the predicted points. Each row shows some of the difficult samples from AFLW, AFW, COFW, and 300W respectively with all the visible predicted points. . . . . . . . . . . . . . . . . . . . 87 5.1 Estimated age on sample images from [45]. Our method is able to predict the age in unconstrained images with variations in pose, illu- mination, age groups, and expressions. . . . . . . . . . . . . . . . . . 91 5.2 An overview of the proposed age cascade apparent age estimator. . . 92 5.3 The 3-layer neural network used for estimating the increment in age for each age group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.4 Training data distribution of ICCV-2015 Chalearn Looking at People Apparent Age Estimation Challenge, with regard to age groups. . . . 105 5.5 Age estimates on the Chalearn Validation set. The incorrect age obtained without using the self correcting module is shown in blue, while the corrected age is given in red. . . . . . . . . . . . . . . . . . 107 6.1 Inaccurate landmark detections on low resolution images. We show landmark predicted by different systems. (a) MTCNN [169] and (b) [19] are not able to detect any face in the LR image. (c) Current practice of directly upsampling the low-resolution image to a fixed size of 128 ? 128 by bilinear interpolation. (d) Output from a net- work trained on downsampled version of HR images. (e) Landmark detection using super-resolved images. Note: For visualization pur- poses images have been reshaped after respective processing. Actual size of the images is in the range of 20? 20 pixels . . . . . . . . . . . 112 xiv 6.2 Overview of the proposed approach. High resolution input is passed through High-to-Low generator G1 (shown in cyan colored block). The discriminator D1 learns to distinguish generated LR images vs. real LR images in an unpaired fashion. This generated image is fed to heatmap generator G2. Heatmap discriminator D2 distinguishes generated heatmap vs. groundtruth heatmaps. The pair G2, D2 is inspired from BEGAN [13]. In addition to generated and groundtruth heatmaps, the discriminator D3 also receives predicted heatmaps for real LR images. This enables the generator G2 to generate realistic heatmaps for un-annotated LR images. . . . . . . . . . . . . . . . . . 116 6.3 (a) High to low generator G1. Each ? represents two residual blocks followed by a convolution layer. (b) Discriminator used in D1 and D2. Each ? represents one residual block followed by a convolution layer. . . . 119 6.4 Sample outputs of High to Low generation of AFLW dataset. For more results please refer to the supplementary material. . . . . . . . . . . . . 120 6.5 Architecture of the heatmap generator G2. Architecture of this network is based on U-Net. Each ? represents two residual blocks. 99K represents skip connections between the encoder and decoder. . . . . . . . . . . . . 121 6.6 Sample key-point detections on TinyFace images. . . . . . . . . . . . . . 125 6.7 Snippet of the annotation tool used. . . . . . . . . . . . . . . . . . . . . 129 6.8 (a) Retrieval rates at different ranks. (b) False negatives at different false positive rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.9 Sample outputs obtained by training G2 with HR images. First row shows samples from AFLW test set. Second row shows sample images from 300W test set. Last two columns of second row shows outputs from challenging subset of 300W . . . . . . . . . . . . . . . . . . . . 137 xv Chapter 1: Introduction Interpretation and analysis of faces are fundamental functions of the human vision system and it improves social interaction. Recently, with the increase in the use of portable image and video recording devices, the trend has been shifting towards automatic face analysis in uncontrolled scenarios. To achieve a fully automatic face analysis system, a face detector and a robust facial landmark detector is crucial. More generally, localization in images refers to detecting or segmenting objects in a given image. However, regions of interests can be limited to a single pixel. One such task is facial landmark localization which refers to automatically detecting important keypoints in a face such as eye corners, nose tip. Localizing regions of interest is extremely challenging and has been researched quite extensively in the literature. Objects vary in appearance and appear in variety of shapes and scale. Humans appear in different poses and are usually occluded. Face images can be captured under extreme pose, occlusion or resolution. This dissertation studies robust facial landmark detection algorithms for faces in the wild using learning methods based on Convolution Neural Networks. In general, an automatic face analysis system comprises four main steps: face detection, face association, facial landmark localization and face alignment, facial 1 (3) Max5 Feature Pyramid (4) Norm5 Feature Pyramid(1) Color Image Pyramid (6) DPM Score Pyramid256 3 256 C Face Track Creation (5) Root-filter DPM Level 1 Level 1 Component 1 Detector and Updating (2) Deep Pyramid CNN Level 1Level 1 (7) Detector Output _ = 5 Video Face Tracklet Detected Face frames 3 in the image Tracker Linking For each _ ? 5 256 256 pyramid level l Component C C (output Layer is max5) Image Pyramid max5 pyramid Tracking results norm5 pyramid Level 7 Level 7 Detection Level 7 Scores Level 7 (1/16th spatial resolution of the image) (1) Face Detection (2) Face Association Learning Linear Global Shape Indexed Deep Descriptor Network Projection  Features anchor pos anchor pos Learning neg neg [ ? ] ? Triplet Metric Learning DCNN Face model (4) Face Verification (3) Face Alignment Figure 1.1: Face alignment in a face analysis system feature extraction and face analysis as illustrated in Figure 1.1. Facial landmark localization is an integral component in almost every facial biometric task such as face identification, face synthesis, 3D modeling of faces. These landmarks are used to align faces which mitigates the effects of in-plane rotation and scaling. Facial landmarks are used both directly and indirectly. Typical direct applications include facial expression analysis [147] where landmarks are used to decode specific set of emotions or non-verbal message and marker-less motion capture [139] where landmarks assist in computer generated imagery. To the category of the indirect applications of facial landmark detection belong all applications where the facial landmarks are used for some pre-processing, for example: face verification [30, 80]; 3D face reconstruction [120], where, for instance, the landmarks are used to aid the structure from motion algorithm; head-pose orientation [27] where a 3D face model is fitted to estimated 2D landmark positions; face tracking; other face processing 2 Figure 1.2: Rigid image transformations: translation, rotation, scale, shear; Nonrigid im- age transformations and out of plane rotation: deformation. The face align- ment poses all these five transformations. tasks like prediction of gender, age, expression, or other facial attributes [46]. Detection of facial landmarks in uncontrolled environments is a non-trivial problem for several reasons. The key factor is a large intra-class variability of the input image due to the change of position, scale, and rotation of the face, lighting conditions, background clutter, facial expression, occlusions, and self-occlusions, hair style, make-up, race, aging, modality (webcam, camera, scanned image) and so on. Figure 1.2 illustrates different transformations in an image and shows due to the deformable nature of human face, the problem of landmark detection is extremely challenging. With the advent of Deep Convolutional Neural Networks facial biometrics problems such as facial landmark detection has received a great deal of attention from the computer vision community. DCNNs have been shown to be very effective for several computer vision tasks like image classification [64, 122, 137], and object 3 detection [55, 117]. Deep CNNs (DCNNs) are highly non-linear regressors because of the presence of hierarchical convolutional layers with non-linear activations. Not only this, deep networks have shown to improve the performance of face landmark detection by a large margin [89, 171, 176]. Existing methods for facial key-points localization task have focused primarily on detecting essential landmarks for frontal faces (pose yaw angles in between ?60? and 60?). Most of these methods fail to correctly localize key-points for off-frontal or profile faces which occur frequently in images collected in unconstrained settings. Moreover, manually annotating facial key-points locations is a tedious task and hence it is very difficult to collect large number of training samples to train a DCNN for this task. 1.0.1 Proposed Methods In the second part of this dissertation, we discuss a deep learning-based method called Local Deep Descriptor Regression addressing the task of facial keypoint lo- calization. The proposed method consists of several stages of feature extraction followed by linear regression. It is worth noting that networks trained for the task of face detection/face verification have abstract information about the structure of face. Hence, such a network is used for feature extraction, which are then used to design linear regressors. The spatial resolution of the areas used for feature extrac- tion is reduced in a step-wise manner to achieve better localization over the image space. Chapter 3 discusses another cascade regression based method called, KE- 4 PLER. This method shows an application of multi-tasking in Convolution Neural Networks, where a single network is used to jointly estimate the facial keypoints and their visibility and 3D head pose. Information is pooled from shallow as well as deeper layers of the network to achieve better localization. Some of the practical issues, such as vanishing gradients are tackled by designing improved loss functions and using smart training policies such as hard sample mining and local error cor- rection. We propose a Convolution-Deconvolution network, where we decouple the tasks of facial keypoint localization and 3D head pose estimation by learning them in two different networks. This makes the network agnostic to facial pose. We also model the spatial correlation between different keypoints in a tree-structure the weights of which are learned through convolutions. The proposed network, called Pose-Conditioned-Dendritic CNN is able to precisely estimate the keypoints in a single step which makes it fast and easy to deploy in real life scenarios. Chapter 5 discusses an application of the facial keypoint localization in context of apparent age estimation from unconstrained images. The detected faces are first aligned using Local Deep Descriptor Regression after which the aligned faces are used to train and age group classifier and age regression networks. We also develop an error correction strategy after observing the fact that classifiers makes mistakes between the boundary of age groups. In chapter 6 of this dissertation we discuss the impact of good quality land- marks on the task of face verification. Previously proposed methods perform with reasonable accuracy on high resolution good quality images, but fails when the in- 5 put image suffers from degradation. To this end, we also propose a semi-supervised method which aims at predicting landmarks on the low quality images. This method learns to predict landmarks on low resolution images by learning to model the learn- ing process of high resolution images. In this algorithm, we use Generative Adver- sarial Networks, which first learn to model the distribution of real low resolution images after which another CNN learns to model the distribution of heatmaps on the images. Additionally, we also propose another high quality facial landmark detection method, which is currently state of the art. We also discuss some ongoing work and future plans of localizing facial land- marks in naturally degraded images such as turbulent images. We also plan to extend the ideas developed for facial keypoint localization to other tasks such as human pose estimation and action recognition from human poses. Organization: Chapter 2 discusses in detail the proposed Local Deep De- scriptor Regression, followed by the discussion of KEPLER in chapter 3. In chapter 4 we discuss Pose-Conditioned Dendritic CNN proposed for one step and faster facial alignment. In chapter 5 we discuss, method to address the problem of appar- ent age estimation from unconstrained images. Chapter 6 discusses the strategy of landmark localization in Low resolution images. Finally in chapter 7 we conclude the discussion by presenting future plans of extension and open issues in landmark localization. 6 Chapter 2: Local Deep Descriptor Regression 2.1 Introduction Most of the recent methods use discriminative shape regression approach to esti- mate the face landmark positions. With their ability to utilize large amount of training data, and enforce shape constraints adaptively, regression-based methods have achieved state-of-the-art performance on various unconstrained face alignment datasets. However, the success of these methods is limited by the strength of the features they use. In previous works, the features used are either hand crafted ; for example SIFT was used as features in [158], or learned from a limited set of training samples [25, 116]. In recent years, features obtained using deep CNNs have yielded impressive re- sults for various computer vision applications. They significantly outperform meth- ods proposed earlier for the tasks of face detection and recognition. It has been shown in [84] that a deep CNN pre-trained with a large generic dataset such as Ima- genet [122], can be used as a meaningful feature extractor. Although these features are effective for reliable classification, they are global in nature. Hence, this approach may not be effective for problems such as face alignment where local features are desirable. To overcome this problem, Overfeat [130] uses predicted detection bound- 7 Figure 2.1: We present a deep descriptor-based regression approach for fiducial point ex- traction. This figure shows fiducial points extracted on all the detected faces on an image from the IJB-A [81] dataset using the proposed method. aries, but lacks the needed pixel-based localization feature. [138] and [48] propose pixel-based localization, the former based on the Restricted Boltzmann machine while the latter processes the image to determine a key-point descriptor. In this chapter, we address the localization problem in existing deep CNNs by constructing a deep convolutional key-point descriptor model. We build a network which takes a small local image patch around a pixel as an input and produces a feature vector as the output. We claim that the proposed deep descriptor network can be used as a substitute for SIFT [100] descriptors in most vision problems. To support our claim, we apply the descriptor model for facial landmark detection. Local features calculated for a small rectangular patch around each estimated land- mark position are used by a linear regressor to learn the shape increment during training, and predict the landmark positions at test time. Figure ?? shows several faces where our method is able to locate fiducial points on all the detected faces. Overall, this chapter makes the following contributions: 8 Figure 2.2: Overview of the proposed method. During training, we extract deep descrip- tors for each landmark and concatenate them to form a shape-indexed feature vector. Given these features and target shape increments ?Sti , we learn the linear regression weights W t. During testing, deep descriptors are extracted around each point of the initialized mean shape. Intermediate shape is pre- dicted using the regressor weights W t. This process is iterated to reach the final estimated shape. 1. We construct a novel deep descriptor network to evaluate the local features for a given key-point. 2. We perform face alignment by applying linear regression to the deep descrip- tors evaluated for facial landmarks. This chapter is organized as follows. Section 2.2 reviews a few related works. Details of our deep descriptor-based face alignment method are given in Section 2.3. Section 6.4.3 provides the landmark localization results on five challenging datasets. Finally, Section 2.5 concludes the chapter with a brief summary and discussion. 9 2.2 Previous Work The task of face alignment can be classified broadly into three categories depending on the approach. 2.2.1 Model-based Approaches Model-based approaches learn a shape model during training and use it to fit new faces during testing. The pioneering works of Cootes et al. such as Active Ap- pearance Models (AAM) [36] and Active Shape Models (ASM) [35] were built using PCA constraints on appearance and shape. In recent years many improvements over these models have been proposed in [57, 58, 95, 105, 128, 141]. In [37], Cristi- nacce and Cootes generalised the ASM model to a Constrained Local Model (CLM), in which every landmark has a shape constrained descriptor to capture the appear- ance. In [127], a more sophisticated local model and mean shift was used to obtain good results. However, these methods depend upon the goodness of the error cost function and how well it is optimised. For example, AAM estimates the shape by minimizing the texture residual. Recently, Antonakos et al. [7] proposed a method along similar lines by modeling the appearance of the object using multiple graph- based pairwise normal distributions (Gaussian Markov Random Field) between the patches extracted from the regions. However, the learned models lack the power to capture complex face image variations in pose, expression and illumination. Also, they are sensitive to initialization due to gradient descent optimization, a critical step. 10 2.2.2 Regression-based Approaches Since face alignment is naturally a regression problem there has been a plethora of regression-based approaches in recent years. These methods learn a regression model that directly maps image appearance to target output. But the performance of these methods depends on the robustness of local descriptors. Sun et al. [135] proposed a cascade of carefully designed CNNs in which at each level outputs of multiple net- works are fused for landmark estimation. Our work is different from [135], in that we use a single CNN carefully designed to provide a unique key-point descriptor. Xiong et al. [158] predicts the increment in shape by applying linear regression on SIFT features. Burgos et al. [151] proposed a cascade of T-regressors to estimate the pose in image sequence using pose-indexed features. Cao et al. [25] sequen- tially learned a cascade of random fern regressors using pixel intensity difference as the feature and regresses the shape stage-wise over the learnt cascade. They per- formed regression on all parameters simultaneously, thus effectively exploiting the shape constraint. Following this, Sun et al. [116] proposed cascaded regression using fern regressors and local binary features. Subsequently, Burgos et al. [24] extended their work to face alignment with occlusion handling, enhanced shape indexed fea- tures and more robust initialization which they refer to as Robust cascaded pose regression (RCPR). Li et al. [159] combined multiple final shapes from multiple ini- tializations in a cascade regression manner using weights matrices learnt to combine these hypotheses accurately. Recently, Lee et al. [93] proposed a Gaussian Process Regression face alignment method based on the responses of the Gaussian filters 11 around the patches extracted from the region adjacent to intermediate landmarks. Finally, Zhu et al. [174] proposed a hierarchical face alignment , starting from a coarse shape estimate and refining it to reach the target landmark. Also, Xiong et al. [157] proposed the global supervised descent method where they consider direct optimization over the landmarks independent of any shape model. 2.2.3 Part-based Deformable Models Part-based deformable models perform alignment by maximizing the posterior like- lihood of part locations given an input image I. The models vary in the optimization techniques or the shape priors used. In [126] Saragih et al. used a method similar to mean shift to optimize the posterior likelihood. Recently, Saragih [125] developed a sample specific prior which significantly improves over the original PCA prior in ASM , CLM and AAM. Zhu and Ramanan [177] used a part-based model for face detection, pose estimation and landmark localization assuming the face shape to be a tree structure. Asthana et al. [9] combined discriminative response map fitting with CLM, which learns a dictionary of probability response maps based on local features and adopts linear regression-based fitting in the CLM framework. 2.3 Regression of Deep Descriptors The proposed method for facial landmark detection, called Local Deep Descrip- tor Regression (LDDR), consists of two modules. The first module generates local features for each estimated facial landmark points using the deep descriptor frame- 12 work. These features are concatenated together to form a global shape-indexed feature. The second module is a linear regressor which learns the relationship be- tween the shape feature and the corresponding shape increment during training. The process is repeated stage-by-stage in a cascaded fashion. Figure 2.2 shows the overview of our method. 2.3.1 Deep Descriptor Construction In order to construct a deep CNN descriptor, we start with the Alexnet [84] network. We use the publicly available network weights trained on the Imagenet [122] data using Caffe [72], that are distributed with RCNN [55] . However, this particular CNN cannot be used directly as a key-point descriptor because of the following limitations. Firstly, the CNN requires a fixed input image size of 224 ? 224 pixels which is too large to be considered for the patch size around the key-point. Secondly, a single activation unit at the fifth convolutional layer (conv5) has a highly overlapping receptive field of size 195 ? 195 pixels, which makes localization difficult. As a result, two pixel points in close vicinity cannot be distinguished from one another. On further analysis of the first problem, we found that a CNN requires fixed size input only because of its fully-connected layers. A convolutional layer can process any input as long as it is larger than the convolutional kernel. On the other hand, a fully connected layer needs a fixed size input as its output dimension is predetermined. To resolve this issue, we remove the last max pooling layer (pool5) and all the subsequent fully-connected layers (fc6, fc7, fc8, and softmax) from the 13 Figure 2.3: Architecture of the proposed Deep Descriptor Network. The height and width represents the dimensions of each feature map, whereas the depth denotes the number of features maps for a given layer. The number of strides for each layer is restricted to 1. network. The CNN output is, therefore, computed by the conv5 layer containing 256 feature channels. Analyzing the second problem, we find that a major contributor for the large size of receptive field is the inter-layer subsampling operation, which is implemented in the form of strides in the convolutional as well as max pooling layers. They are deployed mainly to reduce the number of parameters and feature computation time, which are not required for a key-point descriptor since the small patch input will drastically bring down the convolution time anyway. Hence, strides in all the existing layers are set to 1. Also, padding from all the convolutional layers are removed as they contribute very little to describing a key-point. Instead, we apply a single pixel padding in the max pooling layer to further reduce the size of the receptive field without altering the output. With these architectural changes, the receptive field size is reduced to 21 ? 21 pixels which is good enough for the size of a local patch surrounding a key-point. The final network structure obtained for the deep descriptor is shown in Figure 2.3. With the input size as small as the 14 receptive field, single pixel feature maps are obtained at the conv5 layer forming a 256 dimensional output vector. The proposed deep descriptor satisfies the essential properties of being a key- point descriptor. It is position independent, as it depends only on the image patch relative to the point. It is robust to small geometric transformations because of the max pooling operation in CNN. The normalization operation after each convolu- tional layer makes it robust to illumination variations. Since the network weights are trained using fixed sized inputs, the descriptor works best when the input im- ages are scaled to the same size prior to key-point extraction, thus reducing the dependency on scale. Hence it can be used as a generic keypoint descriptor in many computer vision applications. Additionally, for domain specific problems, the model weights can be fine-tuned before evaluating the features. For the application of face alignment, we fine-tune the model weights using face images from the FDDB [70] dataset. Fine-tuning was done for the face detection task, which classifies the in- put as face or non-face. The procedure adopted is similar to the method described in [55]. During fine-tuning, the network learns features specific to face parts which is a crucial part in our work. As a result, the activations at the fifth convolutional layer become more discriminatory to local face patches such as eyes, nose, lips, etc. The other advantage of fine-tuning is that the same network weights can be used for both face detection as well as face alignment. Once the network is fine-tuned, the test image just goes through a forward pass to generate CNN features, which are then fed to a simple linear regression method to generate incremental shapes. 15 2.3.2 Computing Shape Indexed Features Given an initial mean shape containing L landmarks, we compute the 256 dimen- sional deep descriptor ?tl for each landmark l ? 1, 2, ....L at a given stage t. A global shape indexed feature is composed by concatenating the set of deep descriptors, i.e., ?t = [?t , ?t1 1, ..., ?tL] , which is subsequently used to learn the ground truth shape increment, as explained in section 2.3.3. We adopt a coarse to fine regression approach. It is important in face alignment that the features used to describe the landmark points are local. To predict the offset ?s of a single landmark, we extract the deep descriptors from a local region of size r. It has been shown in [116] that the optimal size is almost linear to the standard deviation of individual shape increment ?s. Since, we want ?s to decrease sharply at every stage, we need to choose the size of the local patch region around the landmark accordingly. Following [116], we keep the patch size for deep descriptor larger in the first stage and decrease it linearly in subsequent stages. With this modification, the deep descriptor is bound to generate higher dimensional output for the initial stages. Additional structural modification is needed for uniform output dimension, which limits us to consider only four stages of regression. The patch sizes normalized by face rectangle are taken to be 0.4, 0.3, 0.2, 0.1 for respective stages. Since the face is resized to 224?224 pixels (the input face size used for fine-tuning), the actual patch sizes correspond approximately to 92, 68, 42, 21. Moreover, variable amounts of strides are added to conv1, max1, conv2 and max2 layers for each stage as listed in Table 2.1. The network for the last stage remains unchanged as its input 16 patch size matches the requirement for our deep descriptor network. This ensures a consistent output dimension of 256 at each stage and for every landmark. In addition to just removing the fully connected layers, our network has reduced the amount of subsampling/stride for different regression stages as shown in Table 2.1. Stage 1 Input Size (pixels) conv1 max1 conv2 max2 Stage 1 92? 92 4 2 1 1 Stage 2 68? 68 3 2 1 1 Stage 3 42? 42 2 1 1 2 Stage 4 21? 21 1 1 1 1 Table 2.1: Input size and the number of strides in conv1, max1, conv2 and max2 layers for 4 stages of regression. 2.3.3 Learning the Global Regression In this section, we introduce our basic shape regression methodology for the face alignment problem. Unlike [25] and [116] which have two level cascaded regression framework, we perform a single global regression at each stage. Given a face image I and initial shape S0, the regressor computes the shape increment ?S from the deep descriptors and updates the face shape using (2.1) St = St?1 +W t?t(I, St?1) (2.1) After extracting the deep descriptors, we concatenate them to a form a global shape- indexed feature ?t = [?t1, ?t1, ..., ?tL]. Our aim is to learn a global linear projection 17 W t by minimizing the following objective function: ?N min ??S?ti ?W t?t(Ii, St?1i )?22 + ??W t?22, (2.2) W t i=1 where the first term is the regression target and the second term is a regulariza- tion of W t in L2 sense. The parameter ? controls the strength of regularization. Regularization here plays a major role due to the high dimensionality of the shape- indexed feature. In the experiments, the dimensionality of features for 68 landmarks points could be as high as 17K+. Without regularization there could be substantial amount of over-fitting. For implementing regression, we use L2 regularized L2-loss support vector regression using the LIBLINEAR [47] package. Since the objective function is quadratic in W t, we can always reach a global minimum. 2.3.4 Incorporating Shape Constraint As mentioned in [25], the shape constraint is preserved by learning a vector regressor and explicitly minimizing the shape alignment error as in (2.2). Since each shape is updated in an additive manner, and each shape increment is a linear combination of certain training shapes, the final shape is modeled as a linear combination of the initial shape S0 and all training shapes: N S = S0 ? + wiS?i. (2.3) i=1 18 Hence, as long as the initial shape satisfies the shape constraint, the regressed final shape is bound to lie in the linear subspace constructed by all the training shapes. As a matter of fact all the intermediate shapes also satisfy the shape constraint, since they are constructed in a similar fashion. 2.4 Experiments There are several landmark annotated datasets publicly available. However, we choose the most recent and challenging ones. These are Helen [90], LFPW [12], AFW [177] and IBUG [124]. In addition to these, we evaluate the performance of our method on a recently introduced IARPA Janus Benchmark A (IJB-A) dataset [81]. These datasets present different variations in face shape, appearance and pose and are described in the following subsections. To maintain consistency in the exper- iments, we perform face alignment using Multi-PIE [59] 68 point markup format. 2.4.1 Datasets LFPW [12] is one of the widely used datasets to benchmark the face alignment tasks. It consists of 811 training and 220 testing images. The dataset contains unconstrained images from the internet which have large variations in pose, illumi- nation and expression. Since some of the image links mentioned in the dataset are invalid, we downloaded the LFPW images from the ibug [124] website which has accumulated all valid images and their 68 point annotations. 19 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 AAM(68 points) AAM(68 points) 0.2 Chehra(49 points) 0.2 Chehra(49 points) GN-DPM(49 points) GN-DPM(49 points) SDM(49 points) SDM(49 points) 0.1 LDDR(68 points) 0.1 LDDR(68 points) LDDR(49 points) LDDR(49 points) 0 0 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Average Localization error as fraction of face size Average Localization error as fraction of face size (a) (b) (c) (d) Figure 2.4: Average pt-pt error (normalized by face size) vs fraction of images in (a) LFPW, (b) Helen, (c) AFW and (d) iBUG. 20 Fraction of the number of testing faces Fraction of the number of testing faces Helen [90] dataset has 2300 high resolution web images, each one marked with 194 landmark points. To be consistent with the 68 point markup in our experiments, we downloaded this dataset from the ibug website which provides the 68 point annotations along with this dataset. AFW has annotated faces in the wild dataset created by Zhu and Ramanan [177]. It consists of 205 in-the-wild-faces with varying illumination, pose, attributes and expressions. It was originally annotated with six landmark points. However, we perform our experiment on the AFW dataset provided on ibug website, as it contains 68 points annotated ground truth helping us to maintain consistency in the experiments. IBUG is a challenging subset of 135 images taken from the 300-W [124] dataset. 300-W contains IBUG and images from existing datasets LFPW, Helen, AFW and XM2VTS [106]. It inherently follows the 68 point annotation format. IJB-A [81] dataset is the recently released face verification dataset. The dataset is annotated with three key-points on the faces (two eyes and nose base). The dataset contains images and videos from 500 subjects collected from online media. In total, there are 67,183 faces of which 13,741 are from images and the remaining are from videos. The locations of all faces in the IJB-A dataset were manually annotated by human annotators. The subjects were captured so that the dataset contains wide geographic distribution. The faces in this dataset have significant variations in pose, illumination and resolution. Training and testing: We evaluated the performance of our method on these challenging datasets. First, we performed training and testing on the LFPW and 21 Helen datasets taking only their own training and testing sets. Using this model we test on AFW dataset. In order to evaluate on the IBUG dataset, we generated our own cumulative training set consisting of 3148 images taken from the LFPW, Helen and AFW datasets. This is done since AFW has more pose variations compared to LFPW and Helen. To test on IJBA-A dataset we use the same model. Evaluation Metric: Following the standards of [25], [12], we computed the average error for all landmarks in an image normalized by the inter-pupil distance. For each dataset, the mean error evaluated over all the images is reported. In the following sub-section, we compare our LDDR algorithm against existing state-of- the-art methods and validate our results. Since the IJB-A dataset has only three annotated points, the interoccular distance error was normalized by the distance between nose tip and the midpoint of the eye centers. 2.4.2 Comparison with state-of-the-art Methods During training we augmented the data to improve the generalization ability. A single training sample is translated to multiple samples by flipping all the images and then randomly rotating them. Then initial shapes are also randomly assigned. Our method has only one fitting parameter i.e. number of stages of regression, which following the principles of [116], [25] has been set to 4 in our case. We compare our results with those reported in [25], [116], [24], [9], [144]. Tables 2.2, 2.3, 2.4 and Figure 2.4 provide the Normalized Mean Square Error and average pt-pt error (normalized by face size) vs fraction of images plots of 22 Method 68-pts 49-pts Zhu et al. [177] 8.29 7.78 DRMF [9] 6.57 - RCPR [24] 6.56 5.48 SDM [158] 5.67 4.47 GN-DPM [144] 5.92 4.43 CFAN [168] 5.44 - CFSS [174] 4.87 3.78 LDDR 4.67 2.38 Table 2.2: Averaged error comparison of different methods on the LFPW dataset. different methods, respectively. In Figure 2.6 we present the comparison of our algorithm with [177], [9] and [79]. Our deep descriptor-based global shape regression method outperforms the above mentioned state-of-the-art methods. The tables also show a comparison of our method with many other pioneering methods such as Gauss Newton based Deformable Part Models [144] and Robust Cascaded Pose Regression (RPCR) [24] and some recent methods like [174]. Figure 2.5 shows some landmark localization results on the five datasets. It can be seen from this figure that the proposed method is able to localize landmarks on near profile faces as well as faces of low resolution, partially visible and expression from the IJB-A dataset. Randomly rotating and flipping doubles the amount of data and hence gener- alizes the data more while reducing the error by ? 2%. After the advent of deep learning, it was seen that the conv5 features capture a lot of salient information. Our method depends on the generalization of the deep descriptors and hence the increase in the amount of data available for training favors the learning step. After training only on Helen and LFPW trainset, we get an error of 5.09% and 5.08%, respectively. However, after training on the cumulative data we achieve improved 23 Figure 2.5: Qualitative results of our landmark localization method. First row: LFPW, Second row: Helen, Third row: AFW and Fourth row: IBUG. Fifth row: IJB-A. 24 performance getting 4.76% on the former and 4.67% on the latter. Also, it can be seen from Tables 2.2 and 2.3, the error with 68 landmark points is higher than that with 49 points as the former includes the face contour points. It is evident from our experiments that the proposed method performs better than [177] and [158] where HOG and SIFT were used as their features. Table 2.4 shows the performance of our method on a challenging subset of 300-W ibug dataset. The error in the perfor- mance of CFSS [174] is lower than our method. This may be due to the fact that CFSS performs its initial search on the space of multiple mean shapes, whereas we initialize with only one mean shape at test time. We do this to reduce the time and space complexity during training. In our experiments we only flipped and rotated the shapes in contrast to conventional techniques where the shapes are flipped, ro- tated, translated and scaled. This also demonstrates the discriminatory quality of our Deep Descriptors and how better it can get given a large amount of diversified training data. Method 68-pts 49-pts Zhu et al. [177] 8.16 7.43 DRMF [9] 6.70 - RCPR [24] 5.93 4.64 SDM [158] 5.50 4.25 GN-DPM [144] 5.69 4.06 CFAN [168] 5.53 - CFSS [174] 4.63 3.47 LDDR 4.76 2.36 Table 2.3: Averaged error comparison of different methods on the Helen dataset. 25 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 LDDR(ours) 0.2 Zhu et al. ERT(dlib) 0.1 DRMF(Asthana) 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Error Figure 2.6: Average 3-pt error (normalized by eye-nose distance) vs fraction of images in the IJB-A dataset. 2.4.3 Runtime All the experiments were performed using an NVIDIA TITAN-X GPU using cudnn library on a 2.3Ghz computer. Training on LFPW took 5.5 hours and on Helen took 9 hours. Training on cumulative data took around 15 hours. Due to different CNN being initialized in each stage, the testing was observed to be slow taking ? 4 seconds given a face bounding box. However in our implementation testing was close to real time performance taking only ? 0.8 seconds per face, thereby reducing the testing time by 80% . This includes the time taken for feature extraction and regression. The time consuming part for the landmark localization was the initialization of a different CNN in each stage. To counter this delay in testing, we merged the 4 CNN models in a single CNN model which is initialized only once. To reduce the performance time even more, the 68 patches extracted around the intermediate shape were passed in a batch. 26 Fraction of images Method 68-pts Zhu et al. [177] 18.33 DRMF [9] 19.75 RCPR [24] 17.26 SDM [158] 15.40 GN-DPM [144] - CFAN [168] - ESR [25] 17.00 LBF [116] 11.98 LBF Fast [116] 15.50 CFSS [174] 9.98 LDDR 11.49 Table 2.4: Averaged error comparison of different methods on the iBUG challenging dataset. 2.5 Conclusions In this work, we presented a deep descriptor-based method for face alignment using regression of local descriptors. The highly informative nature of deep descriptor makes it useful as SIFT, SURF and HOG features. This means deep descriptors have potential in many different kinds of applications in machine vision, such as pose estimation, activity recognition and human detection and many others. We also presented an effective way of reducing the testing time by combining four CNNs into one achieving real-time performance. Extensive experiments on five publicly available unconstrained face datasets demonstrate the effectiveness of our proposed image alignment approach. 27 Chapter 3: KEPLER: Keypoint and Pose Estimation of Un- constrained Faces by Learning Efficient H-CNN Regressors 3.1 Introduction Figure 3.1: Sample results generated by the proposed method. The numbers in black are the predicted 3D pose P:Pitch Y:Yaw R:Roll. Green dots represent the predicted keypoints. The bar graphs show the visibility confidence of each of the 21 keypoints. In the last five years, keypoint localization using DCNNs has received great attention from computer vision researchers. This is mainly due to the availability of large scale annotated unconstrained face datasets such as AFLW [82]. Recently, Bulat et al. [19] released even larger dataset with more that 200K annotated face images. Works such as [166] have hypothesized that as the network becomes deeper and deeper more semantic information such as identity, pose, attributes are retained while immediate local features are lost. However, various methods such as [135], 28 [168], and [174] directly used CNNs as regressors or used deeper features from CNNs to design regressors for predicting keypoints. Some of the methods used global features to regress for the keypoints, while others opted for local deep features and train in a coarse to fine manner. On the other hand, an earlier method known as Explicit Shape Regression (ESR) [25] proposed by Cao et al. achieved superior results by introducing the im- portant concept of non-parametric shape regression for facial keypoint localization. Many of its variants [92], [116], [79], [135], [89] were published later, using a variety of features producing incremental improvements over [25]. However, they are all limited by the fixed number of points on the face. In real life applications, there are more challenging datasets such as IJBA [81] and AFW [177], which do not always have 68 or 49 fixed points due to occlusion or pose variation. As alternatives, re- searchers moved towards more sophisticated techniques by incorporating 3D shape models [178], [74], [73], domain learning [176], recurrent autoencoder-decoder [1] and many others. The LS3D-W dataset by Bulat et al. [19] is annotated with 34 3D-coordinates. However, in applications such as face recognition, the images are aligned directly from the 2D images/coordinates skipping the 3D mapping step. Thus, unconstrained face alignment on 2D face images has received much attention as an emerging research topic in the recent past. With all the methods in recent years, one question still remains unanswered: Can cascaded shape regression be applied for an arbitrary face with no prior knowledge of its pose ? The motivation for this work stems from a desire to adapt cascaded regression for predicting landmarks of arbitrary faces, while taking advantage of CNNs. We 29 transform the cascaded regression formulation into an iterative scheme for arbitrary faces. In each iteration the regressor predicts the increment for the next stage which progressively moves the initial estimate closer to ground truth. By jointly training for all the points, the inherent shape constraint is maintained implicitly. As by- products of KEPLER, we get the visibility confidence of each keypoint and 3D pose (pitch, yaw and roll) for the face image. Figure 3.1 shows a set of sample results from the proposed method, indicating the 3D pose, keypoint locations and their visibility confidences. Figure 3.2: Sample results generated by the proposed method KEPLER. White dots rep- resent the location of keypoints after each iteration. The first row shows an image from the AFLW dataset. The points move at subpixel level after fourth iteration. The second row is a sample image from the AFW dataset, which shows how the last stage of error correction can effectively mitigate the incon- sistent bounding box across datasets. The numbers in red are the predicted 3D pose P:Pitch Y:Yaw R:Roll The main contributions of this chapter are: ? We design a novel GoogLenet-based [137] architecture with a channel inception module which pools features from intermediate layers and concatenates them similar to the inception module. We call the proposed architecture Channeled Inception in the rest of the chapter. This network is used in all the stages of 30 KEPLER. ? Inspired by [26], we present an iterative method for estimating the face land- marks using the fixed point consolidation scheme. Fixed point consolidation refers to estimating the error correction in an iterative way by partitioning the total error correction into multiple steps. We observe that estimating land- marks on a face is more challenging than estimating keypoints on a human body. The overview of the pipeline is shown in Figure 3.3. ? After each stage, the error from ground-truth decreases, making the gradi- ent smaller. This is because regression-based approaches use Euclidean loss, the gradient of which also depends on the error. Hence we employ different training policies in each stage for the efficient training of H-CNNs. Figure 5.1 shows how by correcting the estimates of landmark points locally, the issues of inconsistent bounding box can be handled. ? We evaluate the performance of the landmark estimation method on challeng- ing benchmark datasets AFLW, AFW which include faces in diverse pose, expressions and occlusion. Different from many previous methods such as [158], [176], this work estimates variable number of points depending on the head pose. We also introduce a new protocol for evaluating the facial keypoint localization scheme on AFLW which is more challenging and usually left out while evaluating unconstrained face alignment methods. This chapter builds upon KEPLER [86] by Kumar et. al by evaluating KEPLER on two other challenging datasets. To test the robustness of the proposed method 31 during deployment, we evaluate it on the datasets with images of qualities different from which KEPLER was trained on. Without retraining or finetuning we test the proposed method on COFW [23] dataset which is a standard benchmark dataset for evaluating face alignment schemes designed to work on images under heavy internal and external object occlusion. We show that it performs comparable to methods such as RCPR [23] which uses COFW training set to develop the method. We also evaluate the method on the IJB-A dataset which is one of the most challenging datasets publicly available for face verification. Without finetuning, we test the performance of the proposed method on IJB-A. We show in Figure 3.13 that earlier methods [9], [79] which yielded good performance on high resolution images, almost fails on IJB-A dataset. However, due to efficient training scheme of KEPLER, it is able to yield improved landmark estimates on images with lower resolution and extreme head-pose. The rest of the chapter is organized as follows. Section II reviews closely related works. Section III presents the proposed method in detail. Section IV describes the experiments and comparisons, which are then followed by conclusions and suggestions for future works in section V. 3.2 Related Work Following [25], we classify the previous works on face alignment into two basic cat- egories. 32 Part-Based Deformable models: These methods perform alignment by max- imizing the likelihood of part locations in the given input image. One of the major works in this category was done by Zhu and Ramanan [177], where they used a part-based model for face detection, pose estimation and landmark localization as- suming the face shape to be a tree structure. Discriminative Response Map Fitting (DRMF) [9] by Asthana et al., learned a dictionary of probability response maps followed by linear regression in a Constrained Local Model (CLM) framework. How- ever, it is widely acknowledged that the formulation based on CLMs is non-convex, and may converge to local minima. Hsu et al. [66] extended the mixture of tree model [177] in a coarse to fine manner to achieve improved accuracy and efficiency. However, their method again assumes face shape to be a tree structure, enforcing strong constraints specific to shape variations. However, formulating keypoint de- tection problem as a classification problem, Kumar et al. [87] attempted to capture the structural relationships between different keypoints through convolutional filter assuming the keypoints to be arranged in a tree structure. Regression-based approaches: A multitude of regression-based approaches has been proposed in recent years by formulating the keypoint detection as a regression problem using local or global features. Methods reported in [95], [25], [174] are based on learning a regression model that directly maps image appearance to tar- get outputs. Different low-level features such as Local Binary Patterns (LBP) [3], Histogram of Oriented Gradients (HOG) [39], Scale Invariant Feature Transform (SIFT) [101] have been used in a variety of regression methods such as Support Vec- tor Regression and Random Forests. However, these methods along with methods 33 from [6], [142], [5], [8], [144] and [134] were mostly evaluated either in a lab setting or on face images where all the facial keypoints are visible. These methods depend highly on the bounding box annotation and hence the training data is augmented by jittering the images to accomodate for different bounding box annotation. However, when evaluated on challenging datasets such as IJB-A, these methods do not yield accurate results as we show in section 3.4 in Figure 3.13. Wu et al. [155] proposed an occlusion-robust cascaded regressor to handle occlusion by including two sepa- rate models for landmark localization and visibility estimation in an iterative way. Xiong et al. [157] pointed out that standard cascaded regression approaches such as Supervised Descent Method (SDM) [158] tend to average conflicting gradient directions resulting in reduced performance. Hence, Xiong et. al [157] suggested domain dependent descent maps. Inspired by this, Cascade Compositional Learn- ing (CCL) [176] and Ensemble of Model Regression Trees (EMRT) [175] developed head pose-based and domain selective regressors respectively. [176] partitioned the optimization domain into multiple directions based on head pose and learned to combine the results of multiple domain regressors through composition estimator function. Similarly, [175] trained an ensemble of random forests to directly predict the locations of keypoints for a given face image, and face alignment is then achieved by aggregating the consensus of different models. Recently, methods based on 3D models have been proposed for aligning faces. PIFA [73] by Jourabloo et al. proposed a 3D approach that employed cascaded regression to predict the coefficients of 3D to 2D projection matrix and the base shape coefficients. Another recent work from Jourabloo et al. [74] formulated the 34 face alignment problem as a dense 3D model fitting problem, where the camera projection matrix and 3D shape parameters were estimated by a cascade of CNN- based regressors. However, [176] suggests that optimizing the base shape coefficients and projection is indirect and sub-optimal since smaller parameter errors are not necessarily equivalent to smaller alignment errors. 3DDFA [178] by Zhu et al. fitted a dense 3D face model to the image via CNN, by modelling the depth data in a Z-Buffer. In [15, 98] authors used the 3D-morphable model to learn the 3D camera projection matrix parameters and warping parameters while simultaneously training for 2D face alignment. Although these methods provide 3D coordinates of the keypoints for a given image, they do not outperform the state of the art methods for 2D face alignment. This can be attributed to the fact that learning 3D points from 2D data is a complex problem where the groundtruth data itself is noisy. In contrast, KEPLER simultaneously learns the keypoints, visibility and pose directly from the 2D image, and hence is able to capture the inherent structural dependencies among them. We show that, even without finetuning, KEPLER performs comparable to the state of the art methods on the COFW dataset. Our work falls in the category of regression-based approaches and addresses the issue of adapting the cascade shape regression to unconstrained settings. Different from all other previous works, it performs joint training on the three fundamental tasks simultaneously, namely, 3D pose, visibility of each keypoint and the location of keypoints. It also demonstrates that efficient joint training on the three tasks achieves superior performance. One of the closely related work is [172] where the authors used multi-tasking for many attributes, but did not leverage the intermedi- 35 ate features. Figure 3.3: Overview of the architecture of KEPLER. The function f() predicts the visi- bility, pose and the corrections for the next stage. The representation function h() forms the input representation for the next iteration. 3.3 KEPLER KEPLER is an iterative method which consists of three modules. Figure 3.3 illus- trates the basic building blocks of KEPLER. The first module is a rendering module h which models the structure in an N-dimensional input space, with N being the maximum number of keypoints on a face image. The current locations of the key- points are represented by the vector y 1 Nt = {yt . . . yt }. The output of the rendering module is concatenated to the raw RGB input image I, along the channel dimension which is then fed to the function f. The second module is the function f which calculates the correction to be made at the next stage. The function f is modeled by a convolution neural network whose architecture is described in section 3.3.1. 36 The third module is the correction stage which adds the increments, predicted by f, to the current locations. The output goes again into the rendering module h which prepares the rendered data for the next iteration. The rendering function is not learned in this work, but represented by a 2D Gaussian with a fixed variance and centered at current keypoint locations in each of the N channels. Finally, the Gaussian rendered images are stacked together with image I. Therefore the overall method can be summarized by the following set of equations. ?t = ft(Xt,?t) (3.1) yt+1 = yt + ?t (3.2) Xt+1 = h(yt+1) (3.3) where f is a function with learned parameters ?t. The prediction function f is indexed by t as it is trained separately for every iteration. In the first iteration, the function h renders Gaussians at y0, which is the mean shape. In this work we set t = 5 iterations. We perform the last iteration only to take into effect the improper bounding box across different datasets(see Figure 5.1). After four stages of global corrections, no significant improvement was observed on the validation set and hence we adopted local corrections as the last stage of KEPLER. The loss functions for each task is mentioned below. Keypoint localization Keypoint localization is the task of predicting the keypoints in a face. In this chapter, we consider predicting the locations of N = 21 keypoints on the face. With 37 each point is associated the visibility of that point. The loss function for this task is given by ?N L1(y,g) = vi(yit ? gi)2, (3.4) i=1 where yit and gi are the predicted and the ground truth locations of the ith keypoint resprectively at time t. vi is the ground truth visibility associated with each key- point. According to this formulation of the keypoint loss, since there is no penalty for invisible points, there is no gradient back-propagated for such points. We discuss this loss function and its variant in detail in section 3.3.3. Pose Prediction Pose prediction refers to the task of estimating the 3D pose of the face. We use the Euclidean loss function for pose prediction. L2(pp,gp) = (pyaw ? gyaw)2 + (p 2 2pitch ? gpitch) + (proll ? groll) (3.5) where p stands for predicted and g for the ground-truth. In an alternate formulation this task can be constructed as a classification problem where the face images are to be classified into different classes. However, this will result in binning of pose into discrete bins. Since, we have access to accurate 3D pose, we use the Euclidean loss for this task. Visibility This task is associated with estimating the visibility of each keypoint.The number of keypoints visible on the face varies with pose. Hence, we use the Euclidean loss 38 to estimate the visibility confidence of each point. ?N L3(vp,vg) = (v 2p,i ? vg,i) , (3.6) i=1 Alternatively, one can also use multi target cross-entropy loss for this task. Therefore the net loss in the network is the weighted linear combination of the above loss functions. L(p, g) = ?L1(y,g) + ?L2(pp,gp) + ?L3(vp,vg) (3.7) where ?, ? and ? are the weight parameters suitably chosen depending on the iteration. 3.3.1 Network Architecture For the modeling function f we design a unique ConvNet architecture based on GoogLenet [137] by pruning the inception network after inception 4c. As PReLU has shown better performance in many vision tasks such as object recognition [63], in this pruned network we first replace the ReLU non-linearity with PReLU. We pool the intermediate features from the pruned GoogLenet. Then convolutions are performed from the output of each branch, and the output maps are concatenated similar to inception module. We call this module as the Channeled Inception module. Since the output maps after conv1 are larger in size, we first perform 4X4 convolution and then again a 4X4 convolution, both with the stride of 3 to finally match the dimension 39 Figure 3.4: The KEPLER network architecture. The dotted line shows the channeled inception network. The intermediate features are convolved and the responses are concatenated in a similar fashion as inception module. Tasks such as pose are abstract and contained in deeper layers, however, the localization property is in the shallower layers. of the output to 7X7. Similarly, after conv2 we first perform 4X4 convolution and then 3X3 convolution to match the output to 7X7. The former uses a stride of 4 and the latter uses 2. The most na?ive way of combining features is by concatenation. However, the concatenated output blob can be very high dimensional and hence we perform 1X1 convolution for dimensionality reduction. This lets the network decide the weights to effectively combine the pooled features into lower dimension. It has been shown in [166] that adjacent layers are correlated and therefore, we only pool features from alternate layers. Next, the network is trained on three tasks namely, pose, visibilities and the bounded error using ground truth. The joint training is helpful since it models the inherent relationships among visible number of points, pose and the amount of correction needed for a keypoint in particular pose. Choosing an architecture like GoogLenet is appropriate as it has fewer parameters (as compared to VGG- 40 Net [131]) and the training of GoogLenet is faster when batch normalization is added after each convolution layer. In order to further speed up the process we only use convolution layers till the last layer where we use a fully connected layer to get the final output. Recently, Residual Networks [64] with skip connections have been proposed where the number of parameters is even fewer; furthermore these networks have achieved improved classification results on the Imagenet [40] classification task. In that case the backbone network in each stage of the whole pipeline of KEPLER can be replaced by a ResNet, while keeping the training process same. The architecture of the whole network is shown in Figure 3.4. 3.3.2 Iteration 1 and 2: Constrained Training In this section, we explain the first stage training for keypoint estimation. The first stage is the most crucial one for face alignment. Since the network is trained from scratch, precautions have to be taken on what the network learns. Directly learning the locations of keypoints from a network is difficult not only because of highly non-linear mapping between input and target space, but also because when the network gets deeper it loses the localization capability. This is due to the fact that the outputs of the final convolution layers have a larger receptive field on the input image. We devise a strategy in which the corrections for the first two stages are bounded. Let us suppose the key-points are represented by their 2D coordinates y : {yi ? <2, i ? [1, . . . , N ]} where N is the number of keypoints and yi denotes the 41 ith keypoint. The bounded corrections were calculated using (3.8) given below. ?it(gi, yit) = min(L, ?u?).u? (3.8) where L denotes the bound of correction. u = g ? yt and u? = u?u? represent the error vector and error unit vector respectively. In our experiments, we set the bound L to a maximum of 20 pixels. This simplifies the learning problem for the network for the first stage. According to this formulation, error correction for points for which the ground truth is far away, gets bounded by L. The interesting property of this formulation is that in the first and second stages the network only learns the direction in which the points have to shift. This can be thought of as learning the direction of the error unit vector, to which the magnitude will be added later. In addition to just having keypoint location we also have access to facial 3D pose and the visibility of each point. One-shot prediction of the location of keypoints is difficult since the input-output mapping is typically nonlinear. Also, learning small corrections should be easier, when the network is being trained for the first time. Hence, to impart the prior knowledge to the network we jointly learn the pose and visibility of each point. The loss functions used for the three tasks are described in equations (3.4, 3.5, 3.6) in the previous section3.3. The function f for second iteration is trained in a similar fashion with the weights initialized from the first iteration. 42 Figure 3.5: Qualitative results of KEPLER after second stage. The green dots represent the predicted points after second stage. Red dots represent the ground truth. It can be seen that the visible points have taken the shape of input face image. 3.3.3 Iteration 3: Variant of Euclidean loss We show the outputs of the network after the second stage of training in Figure 3.5. Visual inspection of the outputs shows that for many of the faces, the network has already learned the magnitude and direction of the correction vector. However, there are misalignments in some images or in some keypoints in the images. But repeating the training methodology exactly as second iteration revealed that our architecture suffered from vanishing gradients. While back propagating the gradients, the loss is averaged over a batch and if there are few misalignments in a batch, there is very little gradient to be propagated. To maintain consistency we stick with the same architecture. Even though GoogLenet [137] claims to not have vanishing gradient problem, KEPLER faced it because of the absence of intermediate supervision which GoogLenet originally had. This motivated us to design a loss function that satisfies both of these condi- tions: on the one hand, the loss function should minimize the error between predic- tion and the ground truth; on the other hand, it should have sufficient gradients to be propagated to make the learning process reach global minima. Towards this end, 43 we use the following loss function. (? ? )N N L1( 1 y,g) = vi(yi ? g 2i) + ? vi | yi ? gi | (3.9) n (i=1 i=1 ) ?L (y,g) 1 ?N ?N1 = 2 | yi ? gi |v y i (yi ? gi) + ? vi (3.10) ? n i=1 i=1 yi ? gi where ? is a parameter which controls the strength of the gradient and n is the number of samples in a batch. We would like to emphasize that the additional term is not a regularizer as it is added to the objective function and does not directly regularize the weights. However, this is able to provide substantial gradients for the training of ConvNet because depending on the sign of difference, second term is always +1 or -1 in equation 3.9(b). The representation function h in this stage does not render any Gaussian in the channel for which the predicted visibility is below the threshold ? . In this work, we set this threshold ? to 0.03 and ? to 0.2 , which were determined experimentally. Now that the network has learned the unit vectors in first and second iteration, we do not constrain the amount of error corrections for the third stage training. 3.3.4 Iteration 4: Hard sample mining Face alignement is a task which requires precise localization as error in alignment can propoagate to errors in verification/recognition or other tasks which depend on the aligned image. In our case, although after the third iteration, most of the images are aligned, they lack precision in local alignment. Recently, Kabkab et al. [77] suggested that by efficiently sampling the data one can make an optimal use of training data 44 while training ConvNets leading to obtain improved performance. [77] developed an online data sampling method based on a convex optimization formulation and showed how their formulation can make the classifier robust when class imbalance is present. Inspired by [77], we reuse the hard samples of the dataset to build a more robust keypoint localization system. Figure 3.6: Error Histogram of training samples after stage 3 Using the keypoints predicted after the third iteration, we plot the histogram (Fig.3.6) of normalized mean error (NME), after calculating it for all the training samples. We denote the NME on the x-axis at which the maximum number of samples are centered, as C. In an ideal case, the value of C should be low, implying that the average alignment error is less. Therefore, the objective of this stage is to lower the value of C by hard sample mining. We select a threshold C+? (0.03 in our experiments), towards the right of C, after which at least 30?40% of the samples lie, as the threshold for hard samples. Using C + ?, we partition the dataset into two groups of hard and easy samples. We first select equal number of samples from both groups to form a batch which is then presented to the ConvNet for training. This effectively results in reusing the hard samples since the number of samples in hard 45 group is much lower than in the easy group. Then, to counter the group imbalance we finetune the network with the whole dataset again with a lower learning rate. We use the loss function as in (3.9) with ? = 0.1 for this stage. 3.3.5 Iteration 5: Local Error Correction There is a lot of inconsistency among the bounding boxes provided by different datasets. AFLW [82] provides larger bounding box annotations compared to AFW [177]. Regression-based alignment methods are dependent on the mean shape ini- tialization, which is scaled to the bounding box size. Also it is impractical to come up with a heuristic which tries to determine compatible bounding boxes. Almost all the existing methods perform data augmentation by randomly perturbing the bounding boxes by some amount. However, it is not clear by how much the bound- ing boxes should be perturbed to obtain reasonably good bounding boxes during testing which is consistent with the dataset the network was trained on. We train our networks on a larger bounding box provided by AFLW. AFLW bounding boxes tend to be square and for almost all the images the nose tip appears at the center of the bounding box. This is a big limitation for the deployment of the system in real world scenarios. It is worth noting that the previous four stages are trained on full images and hence produce global corrections. Our last stage of local correction is optional, which depends upon the test set and the bounding box annotations that it comes with. We train a similar network as before but only for the tasks of predicting the visibility and corrections in the local 46 Figure 3.7: Red dots in the left image represent the ground truth while green dots rep- resent the predicted points after the fourth iteration. Local patches centered around predicted points are extracted and fed to the network. The network shown in Fig 3.4 is trained on the task of local fiducial correction and visibil- ity of fiducials inside the patch. The image on the right shows the predictions after local correction. patches (see Fig 3.7). Predicting the pose with a local patch of say WXW pixels is difficult which can lead the network to learn improper weights. We choose all the N patches irrespective of the visibility factor. Learning visibility and corrections is important because we do not want the network to propagate any gradient if the point is invisible. We observe during experimentation that training the ConvNet on two tasks together achieves significantly better performance than when the network is trained only for the task of error correction. We again partition the dataset into easy and hard sample groups according to the strategy explained in section3.3.4. We finally finetune the network with the whole dataset with a lower learning rate. 47 3.4 Experiments and Comparison 3.4.1 Datasets We select two challenging datasets with their most recent benchmarks. In-the-wild datasets: To make the system robust for images in real life scenarios due to challenging shape variations and significant view changes, we select AFLW [82] for training and, AFLW and AFW [177] as the main test sets. AFLW contains 24, 386 in-the-wild faces (obtained from Flickr) with head pose ranging from 0? to 120? for yaw and upto 90? for pitch and roll with extremely challenging shape variations and deformations. AFLW provides at most 21 points for each face. It excludes coordinates for invisible landmarks, which we consider to be the best, because there is no way of correctly knowing the exact location of those points. In many cases such invisible points are mostly hallucinated and annotated thereafter. Along with this, AFLW also demonstrates a limited amount of external-object occlusion. COFW is a collection of 1007 face images out which of 507 images are par- titioned as the test set. Caltech Occluded Faces in the Wild (COFW) dataset exhibits wide range of images in diverse pose and is mainly used for evaluation of face alignment methods designed to perform on images under extreme occlusion. In addition, one important point to note is that COFW also provides the annotations for the invisible landmarks while in the case of AFLW the invisble landmarks are unavailable. 48 IJB-A dataset is one of the most challenging face verification dataset. The face images in the dataset are annotated with three key-points ; two eyes and nose base. The dataset contains images and videos from 500 subjects collected from online media. In total, there are 67,183 faces of which 13,741 are from images and the remaining are from videos. The locations of all faces in the IJB-A dataset were manually annotated by human annotators. The images were captured so that the dataset contains wide geographic distribution. The challenge comes through the wide diversity in pose, illumination and resolution. AFW is a popular benchmark for the evaluation of face alignment algorithms. AFW contains 468 in-the-wild faces (obtained from Flickr) with yaw degree up to 90? . The images are diverse in terms of pose, expression and illumination. The number of visible points also varies depending on the image, but the location of occluded points are to be predicted as well. Testing Protocols: (I)AFLW-PIFA: We follow the protocol used in PIFA [73]. We randomly se- lect 23, 386 images for training and the remaining 1, 000 for testing. We divide the testing images in three groups as done in [73]: [0?, 30?], [30?, 60?] and [60?, 90?] where the number of images in each group are taken to be equal. (II)AFLW-Full: We also test on the full test set of AFLW of sample size 1, 000. (III)AFLW-All variants: In the next experiment, to have more rigorous anal- 49 ysis, we perform the test on all variants of images from (I) above. To create all variants images, we first rotate the full images from (I) at angles of 15?, 30?, 45? and 60?. We do the same with the horizontally flipped version of these images. We then rotate the bounding box coordinates and the key-points also at the same angles and crop the faces. This is done for all the images following the AFLW-PIFA pro- tocol. One important effect of this rotation is that some of the images have smaller faces compared to others due to rotated bounding box. This experiment tests the robustness of the algorithm on faces of different effective scales and orientations. (IV)AFW: We only use AFW for testing purposes. We follow the protocol as stated in [177]. AFW provides 468 images in total, out of which 329 faces have height and width greater than 150 pixels. We only evaluate on those 329 images following the protocol of [177]. It is to be noted that methods such as PIFA [73] and CCL [176] also exclude images with pose greater than 75 degrees follwing the protocol of TCDCN [171]. (V) Occlusion: We use COFW dataset only for evaluation purposes without fine- tuning. This shows the efficacy of the proposed method on other datasets. COFW face images are annotated with 29 facial landmarks, however we only evaluate on 21 points as in AFLW. We show that even without retraining KEPLER performs comparable to Robust Cascaded Pose Regression(RCPR) [23] which is the baseline method. We show in Figure 3.8 the schema to convert 29 points to 21 points format. 50 (VI) Real Life Scenario: We use IJB-A dataset to evaluate on images and videos which are taken in challenging situations. We only evaluate against the three points which were manually annotated. The error between the two eye coordinates is nor- malized the the distance between the nose coordinate and the midpoint of two eye coordinates. Figure 3.8: Schema to convert COFW 29 point format to AFLW 21 point format. Evaluation metric: Following most previous works, we obtain the error for each test sample via averaging normalized errors for all annotated landmarks. We demonstrate our results with mean error over all samples, or via Cumulative Error Distribution (CED) curve. For pose, we evaluate on continuous pose predictions as well as their discretized versions rounded to nearest 15?. We report the con- tinuous mean absolute error for the AFLW testset and plot the Cumulative Error Distribution curve for AFW dataset. For the COFW dataset we normalise by the inter-occular distance following the protocol from [23]. The Normalized Mean Error (NME), which is the average of the normalized estimation error of visible landmarks is calculated as follows. 1 ?Nt 1 ?N NME = ( vji ||pi(:, j)? gi(:, j)||2N N |v | 2) (3.11)t 1 f i j 51 where Nf is the normalization factor, which for AFLW and AFW is the ground ? truth bounding box size calculated as wboxxhbox and for COFW is the inter-occular distance. All the experiments including training and testing were performed using the Caffe [72] framework and two Nvidia TITAN-X GPUs. Our method can process upto 12-16 frames per second in batch mode. AFLW AFW Method NME NME TSPM [177] - 11.09 CDM [2] 12.44 9.13 RCPR [24] 7.85 - ESR [25] 8.24 - PIFA [73] 6.8 8.61 3DDFA [178] 5.32 - LPFA-3D [74] 4.72 7.43 EMRT [175] 4.01 3.55 CCL [176] 5.85 2.45 Rec Enc-Dec [1] >6 - FA-3DFR [98] 4.49 - Tree CNN [87] 3.93 3.28 3D STN [15] 4.23 - KEPLER 2.98 3.01 Table 3.1: Comparison of KEPLER with other state of the art methods. NME stands for normalized mean error. For AFLW, the numbers for other methods are taken from respective papers following the PIFA protocol. For AFW, the numbers are taken from respective works published following the protocol of [177]. 3.4.2 Results Table 3.1 compares the performance of KEPLER compared to other existing meth- ods. Table 3.3 summarises the performance of KEPLER under different protocols of AFLW testset. Table 3.4 shows the mean error in degrees, in estimating the 3D pose 52 Method COFW FPLL [177] 14.40 ESR [25] 11.20 FLD [155] 5.18 RCPR [23] 8.5 KEPLER 8.8 Table 3.2: Performance comparison of the proposed method on COFW dataset. It is to be noted that NME in FPLL, ESR, FLD and RCPR (trained on COFW) is calculated over 29 points, which is calculated for 21 points in KEPLER. It can be observed that the performance of KEPLER is comparable to RCPR without finetuning on the training set of COFW. Figure 3.9: Cumulative error distribution curves for landmark localization on the AFLW dataset. The numbers in the legend are the average normalized mean error normalized by the face size. of a face image. Table 3.2 compares the performance of KEPLER on COFW testset. It can be observed that even without finetuning KEPLER performs comparable to RCPR. Figures 3.9 and 3.10 show the cumulative error distribution in predicting keypoints on the AFLW and AFW test sets. Figure 3.11 shows the cumulative error distribution in pose estimation on AFW. Figures 3.12 and 3.13 shows the cumulative error distribution curves for the COFW and IJB-A datasets. Comparison with CCL [176]: It is clear from the tables that KEPLER outperforms all state of the art methods on the AFLW dataset. It also outperforms all state of the art methods except CCL [176] on the AFW datatset. Visual inspec- 53 tion of our results suggests that KEPLER is a little farther from ground truth on invisible points. We note that CCL [176] manually annotates the AFLW dataset with 19 landmarks along with the invisible landmarks, leaving the earpoints. In our experiments we prefer to use the dataset as provided by AFLW [82], although we believe that CCL-kind of reannotation may boost the performance(since during AFW evaluation the location of occluded points also need to be predicted). In KE- PLER there is no loss propagated for the invisible points. We believe that training KEPLER on the revised annotation by [176] would make the prediction of occluded points more precise. Method AFLW-PIFA AFLW-FULL AFLW-Allvariants AFW KEPLER 2.98 2.90 2.35 3.01 Table 3.3: Summary of performance on different protocols of AFLW and AFW by KE- PLER. AFLW AFW Method Yaw Pitch Roll MAE Accuracy(? 15?) Random Forest [146] - - - 12.26? 83.54% KEPLER 6.45? 5.85? 8.75? 6.45? 96.67% Table 3.4: Comparison of Mean error in 3D pose estimation by KEPLER on AFLW testset. For AFLW [146] only compares mean average error in Yaw. For AFW we we compare the percentage of images for which error is less than 15?. We also verify our claim that iteration 5 is optional and only required for transferring the algorithm to other datasets with different bounding box annotations. To support our claim we calculate the normalized mean error after iteration 4 for both datasets and compare with the error obtained after iteration 5. The error after iteration 4 for AFLW testset was 0.0369 (which is already lower than all existing 54 works) and after fifth iteration it was 0.0299, bringing the performance up by 18%. On the other hand the improvement in AFW (whose bounding box annotation is different from AFLW) was close to 60%. The error after iteration 4 on AFW dataset was 0.0757 which decreases to 0.0301 after fifth iteration. We demonstrate some qualitative results from AFLW and AFW test sets in Figure 3.14 and from COFW and IJB-A datasets in Figures 3.15 and 3.16. Figure 3.10: Cumulative error distribution curves for landmark localization on the AFW dataset. The numbers in the legend are the fraction of testing faces that have average error below (5%) of the face size. Figure 3.11: Cumulative error distribution curves for pose estimation on AFW dataset. The numbers in the legend are the percentage of faces that are labeled within ?15? error tolerance 55 Figure 3.12: Cumulative error distribution curves for landmark localization on the COFW dataset. This is to be noted that the error is calculated over 21 points nor- malized by inter-occular distance. Figure 3.13: Cumulative error distribution curves for landmark localization on the IJBA dataset. The error is calculater for 3 points normalized by the distance between midpoint of eyes and the nose. 56 3.5 Conclusions In this work, we showed that by efficiently capturing the structure of face through additional channels, we can obtain precise keypoint localization on unconstrained faces. We proposed a novel Channeled Inception deep network which pools features from intermediate layers and combines them in the same manner as the Inception module. We show how cascade regressors can outperform other recently developed works and designed to yield variable number of keypoints. As a byproduct of KE- PLER, 3D pose information is also generated which can be used for other tasks such as pose dependent verification methods, 3D model generation and many others. In conclusion, KEPLER demonstrates that by improved initialization and multitask training, cascade regressors outperforms state of the art methods not only in pre- dicting the keypoints but also for head pose estimation. One future avenue for extending this work, can be developing methods in which the gaussians are learned and estimated directly from the image. 57 Figure 3.14: Qualitative results of KEPLER after last stage. The green dots represent the final predicted points after last stage. First row are the test samples from AFLW. Second row shows the samples from AFW dataset. The last two rows are the results of KEPLER after last stage from AFLW testset for all variants protocol. The green dots represent the final predicted points after second stage. 58 Figure 3.15: Qualitative results of KEPLER after last stage on COFW dataset. The green dots represent the final predicted points after last stage. Figure 3.16: Qualitative results of KEPLER after last stage on IJBA dataset. The green dots represent the final predicted points after last stage. 59 Chapter 4: Disentangling 3D Pose in A Dendritic CNN for Unconstrained 2D Face Alignment 4.1 Introduction As shown in [10], accurate face alignment improves the performance of a face verifica- tion system, as well as other applications such as 3D face modelling, face animation etc. Currently, face alignment is still dominated by regression-based approaches which yield a fixed number of points. Explicit Shape Regression (ESR) [25] and Su- pervised Descent Method (SDM) [158] have addressed the problem of face alignment for faces in medium pose. To achieve sub-pixel accuracy on such face images, coarse to fine approaches have also been proposed in the literature [89, 168, 174]. It is evi- dent that such methods perform poorly on face images with extreme pose, expression and lighting mainly because they are dependent on bounding box and mean face shape intializations. On the other hand, Convolutional Neural Networks (CNNs) have achieved breakthroughs in many vision tasks including the task of keypoints es- timation [109]. Lately, researchers have used heatmap regression extensively for the task of face alignment and pose estimation using an Encoder-Decoder architecture in the form of Convolution-Deconvolution Networks [32]. Most of the approaches in 60 Figure 4.1: (a) A bird?s eye view of the proposed method. Dendritic CNN is explicitly conditioned on 3D pose. A generic CNN is used for auxiliary tasks such as fine-grained localization or occlusion detection. the literature perform heatmap classification followed by regression [11, 17, 18, 21]. In this work, we propose the Pose Conditioned Dendritic Convolution Neural Net- work (PCD-CNN); which models the dendritic structure of facial landmarks using a single CNN (see Figure 4.1). Shape constraint: Methods such as ESR [25] and SDM [158] impose the shape constraint by jointly regressing over all the points. Such a shape constraint cannot be applied to a profile face as a consequence of extreme pose leading to a variable number of points. Tree structured part models (TSPM) [177] by Zhu et al. had two major limitations associated with it; namely pre-determined models and slower run-time. With an intent to solve these, we propose a tree structure model in a single Dendritic CNN (PCD-CNN), which is able to capture the shape constraint in a deep learning framework. Pose: Works such as Hyperface [115] and TCDCN [172] have used 3D pose in a multitask framework and demonstrated that learning pose and keypoints jointly using a deep network improves the performance of both tasks. However, in contrast to multi-tasking approaches, we condition the landmark estimates on the head pose, following a Bayesian formulation and demonstrate the effectiveness of the proposed 61 approach through extensive experiments. We wish to point out that our primary goal is not to predict the head pose, instead, use 3D head pose to condition the landmark points. This makes our work different from multitask approaches. Speed-vs-Accuracy: We observe that systems which process images at real time, such as [14,75] have higher error rate as opposed to cascade methods which are accurate but slow. Researchers have proposed many different network architectures like Hourglass [109], Binarized CNN (based on hourglass) [18] in order to achieve accuracy in keypoints estimation. Although, such methods are fully convolutional , they suffer from slower run time as a result of cascaded deep bottleneck modules which perform a large number of FLOPs during test time. The proposed PCD- CNN works at the same scale as the input image and thus reduces the extrapolation errors. PCD-CNN is fully convolutional with fewer parameters and is capable of processing images almost at real time speed (20FPS). Limited generalizability as a consequence of smaller number of parameters is tackled by efficiently training the network using Mask-Softmax loss and difficult sample mining. Generalizability: Methods for domain-limited face images have been de- veloped, mostly following the cascade regression approach. [24, 156, 167] have been shown to work well for faces under extreme external object occlusion. On the other hand, [92, 116, 142, 144, 145, 174] achieved satisfactory results on the 300W [123] dataset which contains images in medium pose with almost no occlusion. [73,85,176] have demonstrated their effectiveness for extreme pose datasets with a limited num- ber of fiducial points. However, they do not generalize very well to other datasets. We show that by a small increase in the number of parameters, PCD-CNN can be 62 extended to most of the publicly available datasets including 300W, COFW, AFLW and AFW yielding variable number of points depending on the protocol. (a) (b) Figure 4.2: (a) Details of the proposed method. The dotted lines on top of convolution layers denote residual connections. The feature maps from the pose model are multiplied element-wise with the feature maps of the keypoint model. The network inside the grey box represents the proposed PCD-CNN, whereas the second network inside the blue box is modular and can be replaced for an auxiliary task. A conv-deconv network for finer localization is used alongside a second regression network for occlusion detection. (b) Proposed dendritic structure of facial landmark points for effective information sharing among landmark points. The nodes of the dendritic structure are the outputs of deconvolutions while the edges between nodes i and j are modeled by con- volution functions fij . For the architecture of deconvolution network refer to Figure 4.3. To summarize, the main contributions of this work are : ? We propose the Pose Disentangled Dendritic CNN for unconstrained 2D face alignment, where the shape constraint is imposed by the dendritic structure of facial landmarks. The proposed method uses classification followed by classifi- 63 cation approach as opposed to classification followed by regression. The second auxiliary network is modular and can be designed for fine grained localization or any other auxiliary tasks. ? The proposed method disentangles the head pose using a Bayesian framework and experimentally demonstrates that conditioning on 3D head pose improves the localization performance. The proposed method processes images at real- time speed producing accurate results. ? With a recursive extension, the proposed method can be extended to datasets with arbitrarily different number of points and different auxiliary tasks. ? As a by-product, the network outputs pose estimates of the face image where we achieve close to state-of-the-art result on pose estimation on the AFW dataset. In another experiment, the auxiliary classification network is trained for occlusion detection where we obtain state-of-the-art result for occlusion detection on COFW dataset. 4.2 Prior Work We briefly review prior work in the area of keypoint localization under the following two categories: Deep Learning-based and Hand crafted features-based methods. Parametric part-based models such as Active Appearance Models (AAMs) [36] and Constrained Local Models [38] are statistical methods which perform keypoint detection by maximizing the confidence of part locations in a given input image using handcrafted features such as SIFT and HOG. The tree structure part 64 model (TSPM) proposed in [177] used deformable part-based model for simultaneous detection, pose estimation and landmark localization of face images modeling the face shape in a mixture of trees model. Later, [9] proposed learning a dictionary of probability response maps followed by linear regression in a Constrained Local Model (CLM) framework. Early cascade regression-based methods such as [8, 25, 134, 142, 144,158,174] also used hand crafted features such as SIFT to capture appearance of the face image. The major drawback of regression-based methods is their inability to learn models for unconstrained faces in extreme pose. Deep learning-based methods have achieved breakthroughs in a variety of vision tasks including landmark localization. One of the earliest works was done in [89,135] where a cascade of deep models was learnt for fiducial detection. 3DDFA [178] modeled the depth of the face image in a Z-buffer, after which a dense 3D face model was fitted to the image via CNNs. Pose Invariant Face Alignment (PIFA) [73] by Jourabloo et al. predicted the coefficients of 3D to 2D projection matrix via deep cascade regressors. [14] used 3D spatial transformer networks to capture 3D to 2D projection. [69, 76, 99] extended [73] by using CNNs to directly learn the dense 3D coordinates. The proposed method has a dendritic structure which looks at the global appearance of the image while the local interactions are captured by pose conditioned convolutions. PCD-CNN does not assume that all the keypoints are visible and the interactions between keypoints are learned. PCD-CNN is entirely based on 2D images, which captures the 3D information by conditioning on 3D head pose. Formulating keypoint estimation as the per-pixel labeling task, Hourglass net- 65 works [109] and Structured feature learning [34] were proposed. Hourglass networks use a stack of 8 very deep hourglass modules and hence, even though based en- tirely on convolution can process only 8-10 frames per second. [34] implemented message passing between keypoints, however was able to process images at lower resolution due to large number of parameters. PCD-CNN models the dendritic structure in branched deconvolution networks where each network is implemented in Squeezenet [68] fashion and hence has fewer parameters, contributing to real-time operation at full image scale. In the next few sections, we describe Pose Conditioned Dendritic-CNN in detail and present ablative studies to arrive at the desired architecture. 4.3 Pose Conditioned Dendritic CNN The task of keypoint detection is to estimate the 2D coordinates of, say N landmark points, given a face image. Observing the effectiveness of deep networks for a variety of vision tasks, we present a single end-to-end trainable deep neural network for landmark localization. It has been shown in previous works that capturing structural dependencies between different keypoints is important [34]. THis work derives its motivation from the work by Zhu and Ramanan [177] where every keypoint was modeled as a part and mixture of trees was used to select the best fitting model. Modeling such structural interactions between keypoints pose a great challenge in a deep learning framework as the invisible points are not annotated. 66 Conditioning on 3D pose: Keypoints are susceptible to variations in ex- ternal factors such as emotion, occlusion and intrinsic face shape. On the other hand, 3D pose is fairly stable to them and can be estimated directly from 2D im- age [85]. Reasonably accurate 2D keypoint coordinates can be also inferred given 3D pose and a generic 3D model of a human face. However, the converse problem of estimating 3D pose from 2D keypoints is ill posed. Therefore, we make use of the probabilistic formulation over the variables including the image I ? Rw?h?3 of height h and width w, 3D head pose denoted by P ? R3, 2D keypoints C ? RN?2, where N is the number of keypoints. Following the natural hierarchy between the two tasks, the joint and the conditional probabilities can be written as: p(C,P, I) = p(C|P, I)p(P|I)p(I) (4.1) ( | ) = p(C,P, I)p C,P I p(I) = p? (P??|I)? . ?p(C?|P? , I)? (4.2) CNN PCD-CNN We implement the first factor with an image-based CNN learned to predict the 3D pose of the face image. The second factor is implemented through a ConvNet and multiple DeconvNets arranged in a dendritic structure. The convolution network maps the image to lower dimension, after which the outputs of several deconvo- lution networks are stacked to form the keypoint-heatmap. The models are tied together by element-wise product (as (4.1) and (4.2)) to condition the measurement 67 of 2D coordinates on 3D pose. We choose element-wise product as the operation to condition on the head pose as keypoint heatmaps can be interpreted as probability distribution over the keypoints. The visibility of each keypoint is learnt implicitly as the invisible points are labeled as background. Multi-tasking-vs-Conditioning: In a multi-tasking method such as [85], several tasks are learnt synergetically and backpropagation impacts all the tasks. On the other hand, in the proposed PCD-CNN, the error gradients backpropagated from keypoint network affect both, keypoint network and pose network; however, the pose network affects the keypoint network only during the forward pass. In other words, multi-tasking approaches try to model the joint distribution p(C,P|I) , whereas the proposed approach explicitly models the decomposed form p(P|I)p(C|P, I) by learning the individual factors. Proposed Pose Conditioned Dendritic CNN : We propose the dendritic structure of facial landmarks as shown in figure 4.7b where the nose tip is assumed to be the root node. Such a structure is feasible even in faces with extreme pose. Fol- lowing this, the keypoint estimation network is modeled with a single CNN in a tree structure composed of convolution and deconvolution layers. The pairwise relation- ships between different keypoints are modeled via specialized functions, fi,j, which are implemented through convolutions and are analogous to the spring weights in the spring-weight model of Deformable Part Models [49]. A low confidence of a par- ticular keypoint is reinforced when the response of fi,j corresponding to the adjacent node is added. With experimental justifications we show that such a deformable tree model outperforms the recently published works [14,75,76,99] which use 3D models 68 and 3D spatial transformer networks to supplement keypoint detection models. Fig- ure 4.2 shows the overall architecture of the proposed PCD-CNN and the proposed dendritic structure of the facial landmarks. Instead of going deeper or wider [18, 109] with deep networks, we base our work on the Squeezenet-11 [68] architecture, attributing to its capability to main- tain performance with fewer parameters. We use two Squeezenet-11 networks; one for pose and other for keypoints, named as -PoseNet and KeypointNet respectively. Convolutions are performed on the pool8 activation maps of the PoseNet, the re- sponse of which is then multiplied element-wise to the response maps of pool8 layers of the KeypointNet. Each convolution layer is followed by ReLU non-linearity and batch normalization. In table 4.10, we show that keypoint localization error reduces when conditioned on 3D head pose. The design of deconvolution network is non-trivial. To maintain the same property as of SqueezeNet, we first upsample the feature maps using parametrized strided convolutions and then squeeze the output features maps using 1x1 convolu- tions. We call this network as Squeezenet-DeconvNet. Figure 4.3 shows the detailed architecture of the Squeezenet-DeconvNet. Since, each keypoint in the proposed net- work is modeled by a separate Squeezenet-DeconvNet, it alleviates the need for large number of deconvolution parameters (256 and 512 3? 3 in Hourglass networks). In fact, in the practical version of PCD-CNN, there are only 32 and 16 deconvolution filters which results in the design of networks, which are small enough to fit in a single GPU. The design of networks with fewer filters is motivated by real-time processing consideration. With experiments we show that disentangling the pose 69 Method Normalised Error Without pose conditioning 3.45 With pose conditioning 2.85 Table 4.1: Root mean square error normalized by bounding box size, calculated on the AFLW validation set following the PIFA protocol. The proposed PCD-CNN when conditioned on pose yields better performance for the task of keypoint localization. Method Normalised Error Classification+Regression 3.93 Classification+Classification 3.09 Table 4.2: Mean square error normalized by bounding box size, calculated on the AFLW validation set following the PIFA protocol. This table shows that PCD-CNN when followed by another classification stage results in lower localization error compared to classification followed by regression. Note that conditioning on pose is not used in both the cases above for fair comparison. by conditioning on it, reinforces the learning of the proposed PCD-CNN with fewer parameters (Table 4.10). In order to obtain fine grained localization results, we concatenate to the input data, a learned function of the predicted probabilities (represented as purple box in Figure 4.7a) and pass them through the second Squeezenet based conv- deconv network. This function is modeled by a residual unit with 1 ? 1 and 3 ? 3 filters, which are learned end-to-end with the second classification network (while keeping the weights PCD-CNN frozen). For experimental purposes, we replace the second conv-deconv by another regression network designed along the lines of GoogleNet [137]. Table 4.2 shows a comparison between two stage classification approach versus classifcation followed by regression approaches. One of the goals of this work is to generalize the facial landmark detection to other datasets in order to broaden its applicability. A trivial extension would be 70 Figure 4.3: Detailed description of a single Squeezenet-DeconvNet network. Note the fewer number of deconvolution filters. Each deconvolution network is identical to the one shown above. to increase the number of deconvolution branches, which however is infeasible due to limited GPU memory. With a non-trivial extension, PCD-CNN can be extended to yield more landmark points arranged in different configurations. In figure 4.9 we show the proposed tree structures for COFW and 300W datasets with 29 and 68 landmark points respectively. Keeping the basic dendritic structure intact, first the number of output response maps in the last deconvolution layer are increased and then network slicing is performed to produce the desired number of keypoints. For instance, the output of the deconvolution network for eye-center is sliced to produce four outputs as required by the 300W dataset. Depending on the dataset, the second network can be replaced to perform auxiliary tasks resulting in a modular architecture; for instance in the case of COFW dataset we replace the second conv- deconv network with another Squeezenet network to detect occlusion. We direct the readers to the supplementary material for more details on network surgery and a magnified view of figures 4.7b and 4.9. Each branch of PCD-CNN is designed according to the proposed Squeezenet- Deconv networks shown in Figure 4.3. Due to fewer parameters in the Squeezent- 71 Figure 4.4: The proposed extension of the dendritic structure from Figure 4.2 generalizing to other datasets (COFW and 300W) each with different number of points. Deconv, we hypothesize limited generalization capacity of the deconvolution net- work. By means of experiments, we show that effective training methods such as Mask-Softmax and Hard sample mining improves the performance of PCD-CNN by a large margin as a result of better generalization capacity. Mask-Softmax Loss: To train the network, the localization of fiducial key- points is formulated as a classification problem. The label for an input image of size h ? w ? 3 is a label tensor of same size as the image with N + 1 channels, where N is the number of keypoints. The first N channels represent the location of each keypoint whereas the last channel represents the background. Each pixel is assigned a class label with invisible points being labeled as background. The objective is to minimize the following loss function: ?h ? ( )w N?+1 pk(i,j) L0(p,g) = m( ) e i, j gk(i, j)log ? (4.3) epl(i,j)i=1 j=1 k=1 l where k ? {1, 2 . . . N} is the class index and gk(i, j) represents the ground truth at location (i, j). pl(i, j) is the score obtained for location (i, j) after forward pass through the network. Since the number of negative examples is orders of magnitudes larger than the positives, we design a strategic mask m(i, j) which selects all the 72 Method Normalised Error Softmax 4.56 Using Mask-Softmax 2.85 Table 4.3: Root mean square error normalized by bounding box calculated on the AFLW validation set following PIFA protocol. This table indicates the effect of using Mask-softmax over Softmax. positive pixel samples, and keeps only 50% of the 4-neighborhood pixels and 0.025% of the negative background samples by random selection. During backward pass, the gradients are weighed accordingly. We experimentally show the effect of using Mask- Softmax Loss by training two separate PCD-CNN; with and without the Mask- Softmax Loss; trained under identical training policies(Table 4.3) . Hard Sample Mining: [77] by Kabkab et al. showed that effective sampling of data improves the classification performance of the network. Following [77], we use an offline hard sample mining procedure to train the proposed PCD-CNN. The histogram of error on the training data is plotted after the network is trained for 10 epochs by random sampling (refer supplementary material). We denote the mode of the distribution as C, and categorize all the training samples producing errors larger than C as hard samples. Next we retrain the proposed PCD-CNN with hard and easy samples, sampled at the respective proportion. This effectively results in retraining the network by reusing the hard samples. Table 4.4 shows that such hard sample mining improves the performance of PCD-CNN (with fewer parameters) by a large margin. In the next set of experiments, we train PCD-CNN by increasing the number of deconvolution filters to 128 and 64 in each deconvolution network. We follow 73 Method Normalised Error Without Hard Mining 2.85 With Hard Mining 2.49 Table 4.4: Root mean square error normalized by bounding box calculated on the AFLW validation set following PIFA protocol. This table depicts the effect of offline hard sample mining. Method Normalised Error Less Filters+Hard Mining 2.49 More Filters+Hard Mining 2.40 Table 4.5: Root mean square error normalized by bounding box calculated on the AFLW validation set following PIFA protocol. This table shows the effect of offline hard-mining and quadrupling the number of deconvolution filters. the same strategy of Mask-Softmax and hard sample mining to train this network. Unsurprisingly, we see an improvement in performance for the task of keypoint localization (Table 4.5), although, increasing the number of deconvolution filters leads to slower run time of 11FPS as opposed to 20FPS. 4.4 Magnified version of the Tree One expects to receive information from all other keypoints in order to optimize the features at a specific keypoint. However, this has two drawbacks: First, to model the interaction between keypoints lying far away such as ?eye corner? and ?chin?, convolution kernels with larger size have to be introduced. This leads to increase in the number of parameters. Secondly, relationships between some keypoints are unstable, such as ?left eye corner? and ?right eye corner?. In a profile face image one of the points may not be visible and passing information between those two keypoints may lead to erroneous results. Hence, convolution kernels are learned at the size of 74 14 ? 14 which ensures keypoints which are closer and have stable relationships to be connected together. We also describe the process of extending the proposed dendritic structure of facial landmarks to other datasets with variable number of landmark points. Figure 4.9a shows the tree structure of the 21 landmark points compatible with the AFLW dataset. In figure 4.9b and 4.9c the number of points is increased to 29 and 68 respec- tively compatible with COFW and 300W datasets. We wish to keep the structure of the facial landmarks intact while increasing the number of landmark points. For this, we make use of the network surgery. First, the number of deconvolution filters in the penultimate and ultimate deconvolution layers is increased to 128 and 64 respectively. Next 1? 1 convolutions are used to obtain desire number of outputs, which is then sliced and concatenated in order for loss computation. For instance, eye center points is split into 4 landmark points in the case of COFW and 300W datasets, and ear corner points are dropped. An advantage of network surgery is that, it leads to yielding a variable number of landmark points with minimal increase in parameters while keeping the face structure intact. 4.5 Experiments We select four different datasets with different characteristics to train and evaluate the proposed two stage PCD-CNN. AFLW [82]and AFW [177] are two difficult datatsets which comprises of images in extreme pose, expression and occlusion. AFLW consists of 24, 386 in-the- 75 wild faces (obtained from Flickr) with head pose ranging from 0? to 120? for yaw and upto 90? for pitch and roll. AFLW provides at most 21 points for each face. It excludes coordinates for invisible landmarks and in our method such invisible points are labelled as background. For AFLW we follow the PIFA protocol; i.e. the test set is divided into three groups corresponding to three pose groups with equal number of images in each group. AFW which is a popular benchmark for the evaluation of face alignment algo- rithms, consisting of 468 in-the-wild faces (also obtained from Flickr) with yaw up to 90?. The images are diverse in terms of pose, expression and illumination and was considered the most difficult publicly available dataset, until AFLW. The number of visible points varies depending on the pose and occlusion with a maximum of 6 points per face image. We use AFW only for evaluation purposes. A medium pose dataset from the popular 300W face alignment competition [123]. The dataset consists of re-annotated five existing datasets with 68 landmarks: iBug, LFPW, AFW, HELEN and XM2VTS. We follow the work [174] to use 3, 148 images for training and 689 images for testing. The testing dataset is split into three parts: common subset (554 images), challenging subset (135 images) and the full set (689 images). Another dataset showing extreme cases of external and internal object occlu- sion; COFW [155]. COFW is the most challenging dataset that is designed to depict faces in real-world conditions with partial occlusions [24]. The face images show large variations in shape and occlusions due to differences in pose, expression, hairstyle, use of accessories or interactions with other objects. All 1,007 images were 76 annotated using the same 29 landmarks as in the LFPW dataset, with their indi- vidual visibilities. The training set includes 845 LFPW faces + 500 COFW faces, that is 1,345 images in total. The remaining 507 COFW faces are used for testing. Evaluation Metric: Following most previous works, we obtain the error for each test sample via averaging normalized errors for all annotated landmarks. We illustrate our results with mean error over all samples, or via Cumulative Error Distribution (CED) curve. For AFLW and AFW, the obtained error is normalized by the ground truth bounding box size over all visible points whereas for 300W and COFW, error is normalized by the inter-occular distance. Wherever applicable NME stands for Normalized Mean Error. Training: The PCD-CNN was first trained using the AFLW training set which was augmented by random cropping, flipping and rotation. The network was trained for 10 epochs where the learning rate starting from 0.01 was dropped every 3 epochs. Keeping the weights of PCD-CNN fixed, the auxiliary network for fine grained classifcation was trained for another 10 epochs using the hard mining strategy explained in section 4.3. PoseNet was kept frozen while training the network for COFW and 300W datasets. All the experiments including training and testing were performed using the Caffe [72] framework and Nvidia TITAN-X GPUs and p6000 GPUs. Being a non-iterative and single shot keypoint prediction method, our method is fast and can process 20 frames per second on 1 GPU only in batch mode. 77 4.6 Training Details KeypointNet and PoseNet described in section 3 are designed based on the SqueezeNet architecture, attributing its lower parameter count. The proposed PCD-CNN was first trained using AFLW training set, where Mask-Softmax is used for keypoints and Euclidean Loss for 3D pose estimation. Starting from the learning rate of 0.001, the network was trained for 10 epochs with momentum set to 0.95. The learning rate was dropped by a factor of 10 every 3 epochs. While training PCD- CNN for COFW and 300W datasets, the convolution branch was initialized with the previously trained network, whereas the deconvolution branches were trained from scratch. Since, COFW and 300W datasets does not provide 3D pose ground truth, we leverage the previously trained PoseNet and freeze its weights. 4.6.1 Effect of Pose Disentaglement Next, we also perform an experiment to observe the effect of 3D pose conditioning on the second auxiliary network designed for fine grained localization. Table 4.10 shows the effect of disentangling pose by conditioning, when the auxiliary conv- deconv network does not receive information from the PoseNet. 4.6.2 Improvement in localization by augmentation during testing For a fair comparison with the previous state-of-the-art methods we did not perform augmentation during testing. In the next set of experiments along with the test image, we also pass the flipped version of it and the final output is taken as the mean 78 AFLW AFW Method NME NME TSPM [177] - 11.09 CDM [2] 12.44 9.13 RCPR [24] 7.85 - ESR [25] 8.24 - PIFA [73] 6.8 9.42 3DDFA [178] 5.32 - LPFA-3D [74] 4.72 7.43 EMRT [175] 4.01 3.55 Hyperface [115] 4.26 - Rec Enc-Dec [1] >6 - PIFAS [76] 4.45 6.27 FRTFA [14] 4.23 - CALE [21] 2.63 - KEPLER [85] 2.98 3.01 Binary-CNN [18] 2.85 - PCD-CNN(Fast) 2.85 2.80 PCD-CNN(C+C) 2.49 2.52 PCD-CNN(Best: C+C+more filters) 2.40 2.47 Table 4.6: Comparison of the proposed method with other state of the art methods. C+C stands for classification+classification. For AFLW, numbers for other methods are taken from respective papers following the PIFA protocol. For AFW, the numbers are taken from respective published works following the protocol of [177]. of the two outputs. With experimentation we observe that data augmentation while testing also improves the localization performance. This does not incur any increase in run-time as the inputs can be passed through the network in batch mode, keeping the runtime still at 20FPS. Table 4.11 shows the effects of data augmentation during testing. 4.6.3 Training PCD-CNN for COFW This section covers the details of training for the COFW dataset. The PCD-CNN network was trained using the Mask Softmax and hard negative mining. The second 79 Method [0,30] [30,60] [60,90] Mean HyperFace [115] 3.93 4.14 4.71 4.26 AIO [114] 2.84 2.94 3.09 2.96 Binary-CNN [18] 2.77 2.86 2.90 2.85 PCD-CNN(C+C) 2.33 2.60 2.64 2.49 Table 4.7: Comparison of the proposed method with other state of the art methods on AFLW-PIFA test set, categorized by absolute yaw angles. The numbers repre- sent the normalized mean error. Method Common Challenge Full RCPR [24] 6.18 17.26 8.35 SDM [158] 5.57 15.40 7.52 ESR [25] 5.28 17.00 7.58 CFAN [168] 5.50 16.78 7.69 LBF [116] 4.95 11.98 6.32 CFSS [174] 4.73 9.98 5.76 TCDCN [172] 4.80 8.60 5.54 DDN [165] - - 5.59 MDM [142] 4.83 10.14 5.88 TSR [103] 4.36 7.56 4.99 PCD-CNN 3.67 7.62 4.44 Table 4.8: Comparison of the proposed method with other state-of-the-art methods on 300W dataset. The NME for comparison are taken from the Table 3 of [103]. auxiliary network was trained for the task of occlusion detection. According to the released details about the COFW dataset, around 23% of the landmark points are invisible. Hence, to tackle the class imbalance problem between the visible and invisible points the following loss function was used. ?29 L(p,g) = (0.23 ? 1gvis=1 + 0.77 ? vis vis 21gvisi i =0)(pi ? gi ) (4.4) i=1 where p,g are the vector of predicted and ground-truth visibilities. pvis and gvisi i are the values of the individual elements in the vectors of visibilities. The weighted loss function also balances the gradients back-propagated while loss calculation. 80 (a) (b) (c) Figure 4.5: Cumulative error distribution curves for landmark localization on AFLW, AFW and COFW dataset respectively. (a) Numbers in the legend repre- sents mean error normalized by the face size. (b) Numbers in the legend are the fraction of testing faces that have average normalized error below 5%. (c) The numbers in the legend are the fraction of testing faces that have average normalized error below 10%. Figure 4.6 shows the failure rate and error rate on the COFW dataset. The failure rate on the COFW dataset drops to 4.53% bringing down the error rate to 6.02. When testing with the augmented images the error rate further drops to 5.77 bringing it closer to human performance 5.6. Figure 4.8a shows the precision recall curve for the task of occlusion detection on the COFW dataset. PCD-CNN achieves a significantly higher recall of 44.7% at the precision of 80% as opposed to RCPR?s [24] 38.2%. 81 Method NME Failure Rate RCPR [24] 8.5 20% OFA [167] 6.46 - HPM [54] 8.48 6.99% ERCLM [16] 6.49 6.3% RPP [160] 7.52 16.2% Human [24] 5.6 0% PCD-CNN 6.02 4.53% Table 4.9: Comparison of the proposed method with other state of the art methods on COFW dataset. Method NME PCD-CNN + Auxiliary Network 2.99 PCD-CNN + Pose Conditioned Auxiliary Network 2.49 Table 4.10: Mean square error normalized by bounding box calculated on AFLW test set following PIFA protocol. When PCD-CNN and fine-grained localization network both are conditioned on pose yields lower error rate. 4.7 Hard mining Figure 4.7 shows the distribution of average normalized error on the training sets of AFLW and COFW datasets. The error distributions were obtained upon evaluating the PCD-CNN network on the training set, after it is trained with the whole dataset for 10 epochs. The dataset is partitioned into hard and easy samples after choosing the mode of the distribution as the threshold. Next, the network is trained again, Dataset Pre-Aug Post-Aug AFLW-PIFA (PCD-CNN-Fast) 2.85 2.81 AFW (PCD-CNN-Fast) 2.80 2.66 AFLW-PIFA (PCD-CNN-C+C) 2.49 2.40 AFW (PCD-CNN-C+C) 2.52 2.36 COFW (PCD-CNN-Fast) 6.02 5.77 300W-Challenge (PCD-CNN-Fast) 7.62 7.17 Table 4.11: NME on different datasets Pre-Augmentation and Post-Augmentation during testing. 82 Figure 4.6: Comparison of NME and failure rate over visible landmarks out of 29 land- marks from the COFW dataset. (a) (b) Figure 4.7: Histogram of error, when evaluated on the training set of (a) AFLW (b) COFW. 83 (a) (b) (c) (d) Figure 4.8: (a) Precision Recall for the occlusion detection on the COFW dataset. (b)Cumulative error distribution curves for pose estimation on AFW dataset. The numbers in the legend are the percentage of faces that are labeled within ?15? error tolerance. Cumulative Error Distribution curve for (c) Helen (d) LFPW, when the average error is normalized by the bounding box size. (a) (b) (c) Figure 4.9: The proposed extension of the dendritic structure from Figure 1, generalizing to other datasets with variable number of points. 84 by sampling equal number of images from both groups, which results in an effective reuse of the hard examples. 4.8 More results on AFLW, AFW, LFPW and HELEN In this section, we show some more results obtained by the PCD-CNN on AFW, LFPW and Helen datasets. Figure 4.8b shows the cumulative error distribution curves for the prediction of face pose on AFW dataset. We observe that even though the primary objective of PCD-CNN is not pose prediction, it achieves state-of-the- art results when compared to recently published works Face-DPL [177],RTSM [67]. Figures 4.8c and 4.8d show the cumulative error distribution curve on LFPW and Helen datasets, when the average error is normalized by face size. PCD-CNN achieves significant improvement over the recent work of GNDPM [144]. Figure 4.11 shows some of the difficult test samples from AFLW, AFW, COFW and IBUG datasets respectively. 4.8.1 Results Table 4.6 compares the performance of proposed method over other existing methods on AFLW-PIFA and AFW dataset. Table 4.7 compares the performance on AFLW- PIFA with respect to each pose group. Tables 6.6 and 4.9 compares the mean normalized error on the 300W and COFW datasets respectively. It is clear from the tables that while the proposed PCD-CNN performs comparable to previous state- of-the-art method [18], the two stage PCD-CNN outperforms the state-of-the-art 85 Figure 4.10: Qualitative results generated from the proposed method. The green dots rep- resent the predicted points. Every two show randomly selected samples from AFLW, AFW, COFW, and 300W respectively with all the visible predicted points. 86 methods on all three datasets: AFLW, AFW and COFW by large margins. It is not surprising that increasing the number of deconvolution filters improves the performance on all the datasets. Figures 4.5a, 4.5b and 4.5c show the cumulative error distribution for landmark localization in AFLW, AFW and COFW test sets. From the plots, we observe that the proposed PCD-CNN leads to a significant increase in the percentage of images with mean normalized error less than 5%. On AFW, fraction of images having an error of less than 15? for pose estimation is 87.22% compared to 82% in the recent work [67]. On COFW dataset, the NME reduces to 6.02 (close human performance of 5.6) bringing down the failure rate to 4.53%. PCD-CNN achieves a higher recall of 44.7% at the precision of 80% as opposed to RCPR?s [24] 38.2%. (refer to the supplementary material for more results.) Figure 4.11: Qualitative results generated from the proposed method. The green dots represent the predicted points. Each row shows some of the difficult sam- ples from AFLW, AFW, COFW, and 300W respectively with all the visible predicted points. 87 Based on our experiments, we observe that two major factors are responsi- ble for achieving state-of-the-art results on the task of face alignment. First, the choices made during the design of PCD-CNN and efficient training; and secondly, disentangling of pose by conditioning on it. With the assistance of above two fac- tors PCD-CNN is able to effectively localize landmark points on unconstrained faces directly from 2D images without using 3D morphable models. Figure 4.11 shows some of the difficult images and the predicted visible keypoints on the four datasets. We also achieve state of the art results on the performance of auxiliary tasks, such as pose estimation on AFW and occlusion prediction on COFW dataset. 4.9 Conclusions In this work, we present a dendritic CNN which processes images at full scale looking at the images globally and capturing local interactions through convolutions. We also demonstrate that disentangling pose by conditioning on it can influence the localization of landmark points by reducing the mean pixel error by a large margin. We show that due to effective design choices made, the proposed model is not limited to yield a fixed number of points and can be extended to other datasets with different protocols. With the help of ablative studies, impact of effective training of the convolutional network by using sampling strategies such as Mask-Softmax and hard instance sampling is shown. Using smaller and fewer convolution filters, the proposed network is able to process images close to real-time and can be deployed in a real life scenario. The proposed method can be easily extended to 3D face 88 alignment and human pose estimation tasks, which we plan to pursue in the future. 89 Chapter 5: A Cascaded Convolutional Neural Network for Age Estimation of Unconstrained Faces 5.1 Introduction Face analysis is an active research topic in computer vision with applications in surveillance, human-computer interaction, access control, and security. In this work, we focus on apparent age estimation. Traditionally, the problem is tackled through pure classification or regression approaches. In this chapter, we present a cascaded approach which incorporates the advantages of both classification and regression approaches. Given an input image, we first apply the age group classification algo- rithm to obtain a rough estimate and then perform age group specific regression to obtain an accurate age estimate. Like other facial analysis techniques, age estimation is affected by many in- trinsic and extrinsic challenges, such as illumination variation, race, attributes, etc. One may define the age estimation task as a process of automatically labeling face images with the exact age, or the age group (age range) for each individual. It was suggested in [50] to differentiate the problem of age estimation along four concepts: ? Actual age: real age of an individual. 90 Figure 5.1: Estimated age on sample images from [45]. Our method is able to predict the age in unconstrained images with variations in pose, illumination, age groups, and expressions. ? Appearance age: age information shown on the visual appearance. ? Apparent age: suggested age by human subjects from the visual appearance. ? Estimated age: recognized age by an algorithm from the visual appearance. The proposed cascaded classification and regression approach for apparent age estimation is based on a deep convolutional neural network. Our method consists of three main stages: (1) a single coarse age classifier, (2) multiple age regressors, and (3) an error correcting stage to correct the mistakes made by the age group classifer. Since the number of samples for apparent age estimation is limited, we exploit a DCNN model pretrained for large-scale face identification task and finetune the model for age group classification and age regression tasks. This strategy is effective since the face recognition model trained on the CASIA-WebFace dataset [162] (i.e. it consists of 10,575 subjects and 494,414 images.) encodes rich information reflecting large variations in facial appearances due to aging and variations in pose, expression 91   1st: Age Group Classifier  ? 2nd : Apparent Age Regressor per Age Group  :  Predicted Age: 14 Age Error     +  Correction Figure 5.2: An overview of the proposed age cascade apparent age estimator. and illumination. The main contribution of this work is to propose the age error correction mod- ule which mitigates the common disadvantage of coarse-to-fine approaches. Typi- cally, the errors made at the initial classification stage cannot be recovered by the regressors at the following stage. In this work, we set up the baseline algorithm which is based on the proposed regression algorithm in Section 5.3.6 and study how the coarse-to-fine strategy and the error correction module improve the predic- tion performance. Figure 5.2 presents an overview of the proposed age estimation method. The rest of the chapter is organized as follows: Section 5.2 provides a brief overview of the related works. The proposed approach is presented in Section 5.3 with a concrete example. Experimental results are provided in Section 5.4, and Section 5.5 concludes the chapter with a brief summary and discussion. 92 5.2 Related Work Most of the earlier age estimation methods have focused on using shape or textural features. These features are then fed to a regression method or a classifier to estimate the apparent age [111,112,143,152]. Holistic approaches usually adopt subspace-based methods, while feature-based approaches typically extract different facial regions and compute anthropometric dis- tances. Geometry-based methods [111,143] are inspired by studies in neuroscience, which suggest that facial geometry strongly influences age perception [111]. As such, these methods address the age estimation problem by capturing the face geometry, which refers to the location of 2D facial landmarks on images. Recently, Wu et al. [152] proposed an age estimation method that presents the facial geometry as points on a Grassmann manifold. To solve the regression problem on the Grass- mann manifold, [152] then used the differential geometry of the manifold. However, the Grassmannian manifold-based geometry method suffers from a number of draw- backs. First, it heavily relies on the accuracy of landmark detection step, which might be difficult to obtain in practice. For instance, if an image is taken from a bearded person, then detecting landmarks would become a very challenging task. In addition, different ethnic-groups usually have slightly different face geometry, and to appropriately learn the age model, a large number of samples from different ethnic groups is required. Unlike the traditional methods discussed, the proposed method is based on DCNN to encode the age information from a given image. Recent advances in deep 93 learning methods have shown that compact and discriminative image representation can be learned using DCNN from very large datasets [29]. There are various neural- network-based methods, which have been developed for facial age estimation [52, 83, 129] . However, as the number of samples for estimating the apparent age task is limited, (i.e. not enough to properly learn discriminative features, unless a large number of external data is added), the traditional neural network methods often fail to learn an appropriate model. Thukral et. al. [140] proposed a cascaded approach for apparent age esti- mation based on classifiers using the naive-Bayes approach and a support vector machine (SVM) and regressors using the relevance vector machine (RVM). How- ever, the difference between [140] and the proposed approach is that we leverage the rich information contained in the DCNN model pretrained using a large-scale face dataset for age estimation. Also, the proposed error correction module mitigates the influences of the errors made at initial classification stage. 5.3 Proposed Method Figure 5.2 shows an overview of our CNN-based cascaded age estimation method. Our approach consists of three main components: (1) age group classifier, (2) age regressor to predict the relative age with respect to each age group mean, and (3) apparent age error correction. Given a face image, we first apply the age group classifier to get a rough estimate of the age range from the image. Then, we choose the corresponding age regressor based on the classification results to predict the 94 relative age with respect to the predicted group mean and combine them to get the apparent age estimate. Then, we utilize the characteristic of the classification plus regression framework to design an age error correction scheme to correct age classification and regression errors. Finally, the algorithm outputs the final age estimate for the given input image. In what follows next, we will describe each of these component in detail. 5.3.1 Face Preprocessing In our work, all the face detection and facial landmark detection are handled using the open source library dlib [148] [78]. Three landmark points (the center of the left eye, the center of the right eye, and the nose base) are used to align the detected faces into the canonical coordinate system using the similarity transform. 5.3.2 Deep Face Feature Representation We use the DCNN model with the architecture similar to the one proposed in [162] which is pretrained for the face-identification task with softmax loss using the CASIA-WebFace dataset [162]. The CASIA-WebFace dataset consists of 10,575 subjects and 494,414 images. The architecture is composed of 10 convolutional lay- ers, 5 pooling layers and 1 fully connected layer. In our work, we use PReLU [62] instead of ReLU as the nonlinear activation function and data augmentation to train the network. The input is a color image of aligned faces of dimension 100? 100? 3. The details of this architecture are given in Table 5.1. We do net surgery on this 95 network (i.e., we cut off the part after pool5 layer.) and use its pretrained weights on the CASIA-WebFace dataset to finetune on the age group dataset and apparent age estimation dataset to perform age group classification and relative age regression with respect to each age group. 5.3.3 Age Group Classifier Inspired by the Viola and Jones face detection algorithm [148], we quantize the human age into several age groups (e.g. 0-7, 8-14, 15-23, etc.) which is an easier problem than directly performing classification or regression for the whole age range which requires a large amount of training data. To train the age group classifier, we remove the original fully connected layer, add the PReLU units and the fully connected layer with 512 outputs and finetune it on the the Images of Groups [51], Adience [43] and FGNet [61] datasets to obtain the DCNN-based age group classifier. 5.3.4 Apparent Age Regressor Per Age Group To train the age regressor for each age group, we prepare the training data by splitting each training sample into the corresponding age group based on its ground truth age, and then subtract the mean of that group. The regressors are trained in two ways. The first one is to extract the pool5 features and use them to train the regressors with a large batch size. The other is to train the regressor through end- to-end network finetuning but with a smaller batch size. (i.e., Similarly, we keep the part before pool5 layer and add fully connected layers.) Since the pool5 feature 96 in the face identification task is followed by the fully connected layer with 10,575 output corresponding to the number of subject in the CASIA-WebFace dataset, the pool5 features should contain strong discriminative information from all the face images to classify a large number of subjects in the training data. In addition, we also adopt a novel loss function called, the Gaussian Loss, which takes the a rough age (i.e. the age is represented as a mean and a standard derivation instead of the exact age) as input and is robust for apparent age estimation. The role of the new loss function in learning the nonlinear regression method is discussed in Section 5.3.6. For the pre-training of DCNN face representation model, we use the standard batch size 128 for the training phase. The initial negative slope for PReLU is set to 0.25 as suggested in [62]. The weight decay rates of all the convolutional layers are set to 0, and the weight decay of the final fully connected layer to 5e-4. In addition, the learning rate is set to 1e-2 initially and reduced by half every 100,000 iterations. The momentum is set to 0.9. Finally, we use the snapshot of 1,000,000th iteration as our pretrained model. For the finetuning of the age group classifier, we use the learning rate, 1e-4, for the convolutional layers and 1e-3 for the fully connected layers with 100,000 iterations. For training each age regressor, we first extract all the 320-d feature vectors for each age group and feed them at once into the age regressor network. We train it with 30,000 iterations using the learning rate, 1e-2, and momentum, 0.9. For the end-to-end finetuning of the regressors, we use batch size, 128, with the learning rate, 1e-4, for the convolutional layers and 1e-3 for the fully connected layers. The 120,000th models are used for each age regressor. Data 97 augmentation is performed by randomly cropping 100 ? 100 regions from a 128 ? 128 box and horizontally face flipping. 5.3.5 Age Error Correction In practice, the age group classifier will make errors and these errors significantly affect the final age estimation results for the second stage regressors. To handle these errors, we employ an error correcting approach. When we train the regressor for each age group, we also include the training examples from the neighboring age group. For example, given 3 age groups, (1) 8-14, (2) 15-21, and (3) 22-28, if we want to train the age regressor for the first age group, besides the training samples with ages ranging from 8 to 14 years old, we also add the training samples from its neighboring group (i.e., we added the samples from ?2 groups for the experiments.), that is the second age group. Thus, when the classifier mistakenly assigns the subject to the neighboring age group, the regressor is able to predict a large enough value and correct the error caused by the age group classifier. Furthermore, to take the classifier error into consideration, we also add the misclassified samples to augment the training samples of all the regressors in between the true and wrong groups to increase the chance of correcting the imprecise age estimate so that it is close to the ground truth through our error correction scheme. The detailed step- by-step illustration for the age error correction scheme and other components will be presented in the following subsection. The pseudo code for our age correction approach is given in Algorithm 1. 98 Algorithm 1 Age Estiamtion Algorithm Require: (a) Input face image, I, (b) maxIter iterations, (c) age group classifier, G0, and age regressor per age group, A0, A1, . . . , AN?1 where N is the number of age groups and both age group classifier and age regressors are all DCNN-based models. Ensure: Predicted apparent age, a?. 1: g` = G0(I), where g` is the predicted age group label. 2: For i = 0 to N-1 3: ?ai = Ai(I). 4: End For 5: a? = mean(g`)+?ag .` 6: // Age estimation error correction 7: For i = 0 to maxIter - 1 8: g?` = L(a?), where L(?) returns the age group label of a?. 9: IF g?` = g` 10: Return a? 11: ELSE 12: a? = mean(g?`)+?ag?` 13: End IF 14: g` = g?` 15: End For 16: Return a? Name Type Filter Size/Stride #Params Conv11 convolution 3?3?1 / 1 0.28K Conv12 convolution 3?3?32 / 1 18K Pool1 max pooling 2?2 / 2 Conv21 convolution 3?3?64 / 1 36K Conv22 convolution 3?3?64 / 1 72K Pool2 max pooling 2?2 / 2 Conv31 convolution 3?3?128 / 1 108K Conv32 convolution 3?3?96 / 1 162K Pool3 max pooling 2?2 / 2 Conv41 convolution 3?3?192 / 1 216K Conv42 convolution 3?3?128 / 1 288K Pool4 max pooling 2?2 / 2 Conv51 convolution 3?3?256 / 1 360K Conv52 convolution 3?3?160 / 1 450K Pool5 avg pooling 7?7 / 1 Dropout dropout (40%) Fc6 fully connection 10575 3305K Cost softmax total 5015K Table 5.1: The base architecture of DCNN model used in this work [162] to finetune on the age group classification and ?age regression for each age group. 99 5.3.6 Non-linear Regression We use a 3-layer neural network to learn the age regressor for each age group. The number of layers is determined experimentally to be 3. The regression is learned by optimizing the Gaussian loss function as follows [45]. The Gaussian loss function is useful since the apparent age labels are usually not exact. 1 i?=N ? (?xi??i)22 L = 1? e 2?i , (5.1) N i=1 where L is the average loss for all the training samples, ?xi is the predicted shift in age from the mean of the corresponding age group. ?i is the ground truth shift in age and ?i is the standard deviation in age increment for the ith training sample. The network parameters are trained using the back-propagation algorithm [118] with batch gradient descent. The gradient obtained for the loss function is given by (5.2). This gradient is used for updating the network weights during training using back-propagation. 1 (?x ?? )2?L ? i i2 ? = (?x 2? i ? ?i)e i . (5.2) ? x N?2i We apply dropout [132] after each fully connected layers to reduce the over-fitting due to the limited number of training data. The amount of dropout applied is 0.4, 0.3 and 0.2 for the input, first and second layers of the network respectively. The dropout ratio is applied in a decreasing manner to cope up with the decrease in the number of parameters for the deeper layers. Each layer is followed by the (PReLU) [62] activation function except the last one which predicts the age. The first layer is 100 the input layer which takes the 320 dimensional feature vector obtained from the face-identification task. The output of this layer, after the dropout and PReLU operation, is fed to the fist hidden layer containing 320 hidden units. Subsequently, the output propagates to the second hidden layer containing 160 hidden units. The output from this layer is used to generate a scalar value that would describe the apparent age. Figure 5.3 depicts the 3-layer neural network used. Figure 5.3: The 3-layer neural network used for estimating the increment in age for each age group. 5.3.7 A Toy Example To illustrate the end-to-end pipeline of the proposed age estimation algorithm, we present a toy example below. In this example, we use the 3 age group setting for the age group classifier where (1) the first age group is from 8 to 14 years, (2) the second 15 to 21, and (3) the third 22 to 28. The age regressor will predict ?age with respect to the mean age of its corresponding group. For example, the regressor for the first age group takes charge of predicting the real value ranging from -3 (i.e. 8 - 11 = -3, where 11 is the mean age of the first group) to +3 (i.e. 14 - 11 = 3). Now, given a 101 face image with ground truth age 27 years old, ideally the predicted age group label should be 3 after passing the image into the age group classifier. Then, we will use the third age regressor to predict its ?age which should ideally predict the value as +2 and then we can estimate the apparent age as 25 + 2 = 27 by combining the results of the age group classifier and its corresponding age regressor where 25 is the group mean for the third age group. However, as mentioned in Section 5.3.5, in practice, if the age group classifier makes mistakes, the age estimation results will be wrong. To handle this error, we do the age error correction as described in Section 5.3.5. Now, given another face image with ground truth age 14, incorrectly being classified into third age group, we augment the misclassified samples when we train the regressor. Thus, it can be expected that the ?age should be negative enough, say -5, and as a result, the age estimation will be 25 - 5 = 20 which is still wrong but falls in the range of the second group. Then, we can pass the image again to the second group regressor to get a new estimate, say 18 - 4 = 14. We stop correcting the error when the predicted age and the previous predicted age falls in the same group or reach the maximum number of iterations. That is, we will pass the image to the first regressor again and it will predict 11 + 3 = 14 and then we stop. Otherwise, we continue to perform the correction. The proposed age estimation algorithm is summarized in Algorithm 1. The ex- ecution orders for both the classification and regression parts are written in parallel, and thus it runs in one age group classification plus N ?age regression simultane- ously in total. The maximum number of iterations is preset to avoid looping. 102 5.4 Experimental Results We evaluate the proposed method on two publicly available datasets: Adience [43] and FG-Net [61]. Both datasets include unconstrained images of individuals which are labeled by their actual biological ages. In addition to these two datasets, we present results on the ICCV 2015 Chalearn ?Looking at people-Age Estimation? challenge dataset [45]. The main difference between this dataset and Adience and FG-Net datasets is that Chalearn includes unconstrained images of individuals la- beled by their apparent ages. 5.4.1 Datasets Adience dataset [43] consists of 26, 580 unconstrained images of 2, 284 subjects in 8 age groups (0-2, 4-6, 8-13, 15-20, 25-32, 38-43, 48-53, 60+). The standard five-fold, subject-exclusive cross-validation protocol is used for testing (i.e., we merge 0-2 and 4-6 into one for the experiments of Challenge and FG-Net datasets.) FG-Net aging dataset [61] contains a collection of 1, 002 images of 82 sub- jects, where each image is annotated with true age. Images of groups [51] consists of 28, 231 faces in 5, 080 images. Each face is annotated with a label corresponding to one of the seven age groups; 0-2, 3-7, 8-12, 13-19, 20-36, 37-65, 66+ . Chalearn Workshop Challenge dataset is the first dataset on apparent age estimation containing annotations. The dataset consists of 2, 476 training images, 1, 136 validation images, and 1, 087 test images, which were taken from individuals 103 aged between 0 to 100. The images are captured in the wild, with variations in pose, illumination and quality. Figure 5.4 shows the distribution of the ?Chalearn Looking at People? Challenge dataset across the different age groups. It is evident from this figure that most of the data are distributed around the age group of 20-50, while there are very few samples in the range of 0-15 and above 55. The remaining data consists of the test set which has not been released publicly. 5.4.2 Experimental Details For the first stage of age classification, we augmented the training set with the training splits of Adience [43], FG-Net [61] and Images of groups [51] datasets. To evaluate on the FG-Net, we train the seven regressor networks and then pass them through our proposed error correcting mechanism to predict the final age. Although the recently released IMDB-WIKI dataset [121] contains a large collection of images with ages, the number of the images for the young and old age groups is much smaller than other groups and some of the annotations for the dataset are noisy. Due to these factors, we confine the age group ranges to the ones defined by Adience [43] and focus on those previosly well-labelled datasets for this work. The study of the influences by different ranges of age group intervals is left for future work. All the models were trained using Caffe [71]. We also compare the performance of our proposed method with a recently proposed geometry-based method [152], which is referred to as Grassmann-Regression (G-LR). 104 Figure 5.4: Training data distribution of ICCV-2015 Chalearn Looking at People Appar- ent Age Estimation Challenge, with regard to age groups. 5.4.3 Results To evaluate the performance of age classification algorithm, we conduct experi- ments on the Adience dataset [43], by following the 5 fold cross validation protocol described in [94]. From Table 5.2, it can be seen that our approach achieve better performance than the previous state-of-the-art methods. One thing worth noticing is that the accuracy for exact age group classification is around 53%, but the 1-off accuracy is 88.45% (i.e., 1-off means the predicted label is within the neighboring groups of the true one, and 2-off means ? 2 groups). The results demonstrate the need of our error correction module to make the coarse-to-fine strategy to work better. Method Exact 1-off Best from [43] 45.1? 2.6 79.5? 1.4 Best from [94] 50.7? 5.1 84.7? 2.2 Ours 52.88? 6 88.45? 2.2 Table 5.2: Age estimation results on the Adience benchmark. Listed are the mean ac- curacy ? standard error over all age categories. Best results are marked in bold. After age group classification, we evaluated the performance of the proposed 105 method following the protocol provided by the Chalearn ?Looking at People? chal- lenge dataset to further investigate how the coarse-to-fine strategy and error correc- tion mechanism help the age estimation. The error is computed as follows: 2 ? = 1? e? (x??) 2?2 , (5.3) where x is the estimated age, ? is the provided apparent age label for a given face image, average of at least 10 different user opinions, and ? is the standard deviation of all (at least 10) gauged ages for the given image. We evaluate our method on the validation set of the challenge [45], as the test set annotations are not available for performing analysis. Our baseline approach is to perform age estimation by a single deep regressor (as described in Section 5.3.6) on top of all the DCNN features. From Table 5.3, it shows that the coarse-to-fine strategy improves the prediction results of the baseline approach, and the error correction module further significantly boosts the performance which also demonstrates that the error correction module effectively fixes the errors made by the age classification step. In addition, we also show that the results of end-to-end finetuning on the training data of the challenge data for both baseline and our approach outperform the ones which are trained separately. (i.e., For the results of baseline with end-to-end finetuning, we use the 500,000th model which are trained with the same batch size and learning rate for the proposed approach.) Some prediction sample results from this dataset are shown in Figure 5.5. By looking at the images, we can infer that our method is robust to pose and 106 Method Gaussian Error G-LR [152] 0.62 Baseline 0.39 Our method without error correction 0.382 Our method with error correction 0.355 Baseline with end-to-end finetuning 0.312 Our method with end-to-end finetuning and error correction 0.297 Table 5.3: Performance comparison on the Chalearn Challenge dataset. Figure 5.5: Age estimates on the Chalearn Validation set. The incorrect age obtained without using the self correcting module is shown in blue, while the corrected age is given in red. 107 resolution changes to a certain extent. It fails mostly for extreme illumination and extreme pose scenarios. On further inspection of the Chalearn challenge dataset, we observe the the first stage classification fails to classify correctly when the images have attributes such as hats, glasses, microphone, etc. However, the proposed error correcting mechanism makes it robust to such artifacts. The performance of our method can be improved considerably if we train using age labeled data. Finally, we further evaluate the proposed method with end-to-end finetuning on the FG-Net dataset (i.e., For FGNet, we set ? = 2 for Gaussian loss.). Since the training of DCNN is computationally intensive, a fair amount of time is needed to complete the full leave-one-person out (LOPO) evaluations. Thus, we chose to compromise and show a result that demonstrates the performance level as compared to other methods. We randomly chose 73 subjects and used their images as the training data and the rest for testing. Table 5.4 shows the empirical evaluation of our method compared with several other methods proposed in recent years (i.e., Since the test protocol is different from LOPO used for other methods, the results of the proposed method are not directly comparable to others but only as an empirical performance evaluation.). From this table, it can be seen that our method performs comparable to other state-of-the-art age estimation methods. The approach with error correction module performs much better than the one without considering neighboring samples for error correction during training. 108 Reference Method Training/Testing Result (MAE) Luu2009 [102] 2 stage SVR in AAM subspace 800/200 4.37 Ylioinas2013 [164] LBP Kernel Density Estimate LOPO 5.09 Geng2013 [53] Label Distribution (CPNN) LOPO 4.76 Chen2013 [31] Cumulative Attribute SVR LOPO 4.67 El Dib2010 [44] Biologically-Inspired features LOPO 3.17 Han2013 [61] Component and holistic BIF LOPO 4.6 Hong2013 [65] Biologically InspiredAAM LOPO 4.18 Chao2013 [28] Label-sensitive learning LOPO 4.38 Proposed method Classification+Regression 890 train , 112 test 4.8 Proposed method Classification+Regression+EC 890 train , 112 test 3.49 Table 5.4: Performance comparison of different age estimation algorithms on the FG-Net aging database using mean absolute error(MAE). Since the training of DCNNs is computationally intensive, the evaluation of the proposed approach does not follow the full LOPO protocol. The results are for an empirical evaluation to show the performance level of the proposed approach. 5.4.4 Runtime All the experiments were performed using NVIDIA GTX TITAN-X GPU and the CUDNN library on a 2.3Ghz computer. The first stage training for the classifica- tion task took approximately 8 hours whereas training for the second stage took approximately 8 hours per regressor. The system is fully automated with minimal human intervention. The end-to-end system takes about 2.5 seconds per image for age estimation, with only 0.8 seconds being spent in age estimation given the aligned face while the remaining time being spent on face detection and alignment. 5.5 Conclusions In this work, we proposed a cascaded classification-regression framework to perform unconstrained facial apparent age estimation. The proposed approach estimates the apparent age in a coarse-to-fine manner. The age group classifier gives the 109 rough age estimate, the regressor per age group gives the fine-grained age estimate, and the age error correcting module fixes incorrect prediction. Our experimental results demonstrate the effectiveness of the proposed approach, especially when only a limited number of training data available in the target domain. Although our age classifiers and regressors are all based on DCNN, our frame- work is generic and can be extended to other non-DCNN models. In addition, the same classification-regression framework can be also applied to other vision prob- lems, such as head pose estimation. 110 Chapter 6: S2LD : Semi Supervised Landmark Detection for Low Resolution Images 6.1 Introduction Convolution Neural Networks have revolutionized the computer vision research, to the point that current systems can recognize faces with more than 99.7% [41] accu- racy or achieve detection, segmentation and pose estimation results upto subpixel accuracy. These are only few of the many tasks which have seen a significant perfor- mance improvements in the last five years. However, CNN-based methods assume access to good quality images. ImageNet [122], COCO [97], CASIA [163], 300W [123] or MPII [4] datasets all consist of high resolution images. As a result of domain shift, much lower performance is observed when networks trained on these datasets are applied to images which have suffered degradation due to intrinsic or extrinsic factors. In this work, we address landmark localization in low resolution images. Although, we use face images in our case, the proposed method is also applicable to other tasks, such as human pose estimation. Throughout this chapter we use HR and LR to denote high and low resolutions respectively. Facial landmark localization, also known as keypoint or fiducial detection, 111 Figure 6.1: Inaccurate landmark detections on low resolution images. We show landmark predicted by different systems. (a) MTCNN [169] and (b) [19] are not able to detect any face in the LR image. (c) Current practice of directly upsampling the low-resolution image to a fixed size of 128? 128 by bilinear interpolation. (d) Output from a network trained on downsampled version of HR images. (e) Landmark detection using super-resolved images. Note: For visualization purposes images have been reshaped after respective processing. Actual size of the images is in the range of 20? 20 pixels refers to the task of detecting specific points such as eye corners and nose tip on a face image. The detected keypoints are used to align images to canonical coordi- nates, which are then used as inputs to different convolution networks. It has been experimentally shown in [10], that accurate face alignment leads to improved perfor- mance in face verification. Though great strides have been made in this direction, mainly addressing large-pose face alignment, landmark localization for low resolu- tion images, still remains an understudied problem, mostly because of the absence of large scale labeled dataset(s). To the best of our knowledge, for the first time, landmark localization directly on low resolution images is addressed in this work. Main motivation: In Figure 6.1, we examine possible scenarios which are currently practiced when low resolution images are encountered. Figure 6.1 shows the predicted landmarks when the input image is a LR image of size less than 32?32 pixels. Typically, landmark detection networks are trained with 224 ? 224 crops of HR images taken from AFLW [82] and 300W [123] datasets. During inference, irrespective of resolution, an incoming image is rescaled to 224?224. We deploy two 112 methods: MTCNN [169] and Bulat et al. [19], which have detection and localization built in a single system. In Figure 6.1(a) and (b) we see that these networks failed to detect face in the given image. Figure 6.1(c), shows the outputs when a network trained on high resolution images is applied to a rescaled low resolution one. It is important to note that the trained network, say HR-LD high resolution landmark detector (detailed in Section 6.5.1) achieves state of the art performance on AFLW and 300W test sets. A possible solution is to train a network on sub-sampled images as a substitute for low resolution images. Figure 6.1(d) shows the output of one such network. It is evident from these experiments that networks trained with HR images or subsampled images are not effective for real life LR images. It can also be concluded that subsampled images are unable to capture the distribution of real LR images. Super-resolution is widely used to resolve LR images to reveal more details. Significant developments have been made in this field and methods based on encoder- decoder architectures and GANs [56] have been proposed. We employ two recent deep learning based methods, SRGAN [91] and ESRGAN [149] to resolve given LR images. It is worth noting that the training data for these networks also include face images. Figure 6.1(e) shows the result when the super-resolved image is passed through HR-LD. It can be hypothesized that possibly, the super-resolved images do not lie in the same space of images using which HR-LD was trained. Super resolution networks are trained using synthetic low resolution images obtained by downsampling the image after applying Gaussian smoothing. In some cases, training data for super-resolution networks consists of paired low and high resolution images. 113 Neither of the mentioned scenarios is applicable in real life situations. Main Idea: Different from these approaches, the proposed method is based on the concept of ?generate to adapt?. This work aims to show that landmark local- ization in LR images can not only be achieved, but it also improves the performance over the current practice. To this end, we first train a deep network which generates LR images from HR images and tries to model the distribution of real LR images in pixel space. Since, there is no publicly available dataset, containing low resolution images along with landmark annotations, we take a semi-supervised approach for landmark detection. We train an adversarial landmark localization network on the generated LR images and hence, switching the roles of generated and real LR im- ages. Heatmaps predicted for unlabelled LR images are also included in the inputs of the discriminators. The adversarial training procedure is designed in a way that in order to fool the discriminators, the heatmap generator has to learn the struc- ture of the face even in low resolution. We perform extensive set of experiments explaining all the design choices. In addition, we also propose new state of the art landmark detector for HR images. 6.2 Related Work Being one of the most important pre-processing steps in face analysis tasks, facial landmark detection has been a topic of immense interest among computer vision researchers. We briefly discuss some of the methods which use Convolution Neural Networks (CNN). Different algorithms have been proposed in the recent past such 114 as direct regression approaches of MTCNN by Zhang et al. [172] and KEPLER by Kumar et al. [85]. The convolution neural networks in MTCNN and KEPLER act as non-linear regressors and learn to directly predict the landmarks. Both works are designed to predict other attributes along with keypoints such as 2D pose, visibility of keypoints, gender and many others. Hyperface by Ranjan et al. [115] has shown that learning tasks in one single network does in fact, improves the performance of individual tasks. Recently, architectures based on Encoder-Decoder architecture have become popular and have been used intensively in tasks which require per- pixel labeling such as semantic segmentation [110, 119] and keypoint detection [1, 87, 88, 168]. Despite making significant progress in this field, predicting landmarks on low resolution faces still remains a relatively unexplored topic. All of the works mentioned above are trained on high quality images and their performance degrades on LR images. One of the closely related works, is Super-FAN [20] by Bulat et al., which makes an attempt to predict landmarks on LR images by super-resolution. However, as shown in experiments in Section 6.4.3, face recognition performance degrades even on super-resolved images. This necessitates that super-resolution, face-alignment and face recognition be learned in a single model, trained end to end, making it not only slow in inference but also limited by the GPU memory constraints. The proposed work is different from [20] in many respects as it needs labeled data only in HR and learns to predict landmarks in LR images in an unsupervised way. Due to adversarial training, the network not only acts as a facial parts detector but also learns the inherent structure of the facial parts. The proposed method makes 115 ? ?4 H' Heatmap Reconstructed Discriminator Heatmap IHR ILRG H High to Heatmap Heatmap Low Generated Generator HR Generator LR Image Image ILRR Real (1) Real (1) Fake (0) Fake (0) Real LR Image Fake (0) Resolution Heatmap Confidence Discriminator Discriminator Figure 6.2: Overview of the proposed approach. High resolution input is passed through High-to-Low generator G1 (shown in cyan colored block). The discrimina- tor D1 learns to distinguish generated LR images vs. real LR images in an unpaired fashion. This generated image is fed to heatmap generator G2. Heatmap discriminator D2 distinguishes generated heatmap vs. groundtruth heatmaps. The pair G2, D2 is inspired from BEGAN [13]. In addition to generated and groundtruth heatmaps, the discriminator D3 also receives pre- dicted heatmaps for real LR images. This enables the generator G2 to generate realistic heatmaps for un-annotated LR images. the pre-processing task faster and independent of face verification training. During inference, only the heatmap generator network is used which is based on the fully convolutional architecture of U-Net [119] and works at the spatial resolution of 32? 32 making the alignment process real time. 6.3 Proposed Method S2LD predicts landmarks directly on a LR image of spatial size less than 32 ? 32 pixels. We show that predicting landmarks directly in LR is more effective than the current practices of rescaling or super-resolution. The entire pipeline can be divided into two stages: (a) Generation of LR images in an unpaired manner (b) Generating heatmaps for target LR images in a semi-supervised fashion. An overview of the proposed approach is shown in Figure 6.2. Being a semi-supervised method, it is 116 D3 D2 G2 D1 G1 important to first describe the datasets chosen for the experiments. High Resolution Dataset: We construct the HR dataset by combining the 20, 000 training images from AFLW and the entire 300W dataset. We divide the Widerface dataset [161] into two groups based on their spatial size. The first group consists of images with spatial size between 20?20 and 40?40, whereas the second group consists of images with more than 100? 100 pixels. We combine the second group in HR training set, resulting in a total of 35, 543 HR faces. The remaining 4, 386 images from AFLW are used as validation images for the ablative study and test set for the landmark localization task. Low Resolution Datasets: ? The first group from Widerface dataset consists of 47, 046 faces is used as real LR images for ablative study. ? For face verification experiments, we use recently published TinyFace dataset [33] as the target LR dataset. ? Due to the absence of LR annotated dataset, we create a real LR landmark detection dataset which we call Annotated LR Faces (ALRF) by manually annotating 700 LR images of TinyFace dataset. The details of ALRF creation is discussed in the supplementary materials. 6.3.1 High to Low Generator and Discriminator High to low generator G1, shown in Figure 6.8, is designed following the Encoder- Decoder architecture, where both encoder and decoder consists of multiple residual 117 blocks. The input to the first convolution layer is the HR image concatenated with the noise vector which has been projected using a fully connected layer and reshaped to match the input size. Similar architectures have also been used in [?,91]. The encoder in the generator consists of eight residual blocks each followed by a convolution layer to increase the dimensionality. Max-pooling is used after every 2 residual block to decrease the spatial resolution to 4? 4, for HR image of 128? 128 pixels. The decoder is composed of six residual units followed by up-sampling and convolution layers. Finally, one convolution layer is added in order to output a three channel image. BatchNorm is used after every convolution layer. The discriminator D1, shown in Figure 6.8 is also constructed in a similar way, except that due to low spatial resolution of the input image, max-pooling is only used in the last three layers. In Figure 6.2, we use IHR for HR input images of size 128?128, ILRG for generated LR images of size 32?32 and ILRR for target LR images of the same size. Spectral Normalization [107] is also used in the convolutional layers of D1 to satisfy the Lipschitz constraint ?(W ) = 1, presented in Equation 6.1: WSN(W ) = W?( ) (6.1)W We train G1 using a weighted combination of GAN loss; L2 pixel loss to encour- age convergence in initial training iterations and perceptual loss back-propagated from a pre-trained VGG network. The final loss is summarized in Equation 6.2. lG1 = ?lGGAN + ?lpixel + ?lperceptual (6.2) 118 H W/4 32 256?d?fc W H/4 32 Encoder Decoder No?max?pooling (a) (b) Figure 6.3: (a) High to low generator G1. Each ? represents two residual blocks fol- lowed by a convolution layer. (b) Discriminator used in D1 and D2. Each ? represents one residual block followed by a convolution layer. where ?, ? and ? are hyperpameters which are empirically set. Following re- cent developments in GANs we experimented with different loss functions. However, we settled on the hinge loss. In Equation 6.2, lGGAN is computed as: lGGAN = Ex??Pg [min(0,?1 +D1(x?))] (6.3) where Pg is the distribution of generated images ILRG . Also L2 pixel loss, lpixel, is derived from the following expression: 1 ?W ?Hl HR LR 2pixel = ? (F (I )? IG ) (6.4)H W i=1 i=1 where W and H represent the generated image width and height respectively; also the operation F is implemented as a sub-sampling operation obtained by passing IHR through four average pooling layers. This loss is used to minimize the distance between the generated and sub-sampled images which ensures that the content is not lost during the generation process. To train discriminator D1 we use hinge loss with gradient penalty and Spectral Normalization for faster training. The discriminator 119 HR?Image LR?Image LR?Image Figure 6.4: Sample outputs of High to Low generation of AFLW dataset. For more results please refer to the supplementary material. D1 loss can be defined as: lD1 = lDGAN +GP (6.5) where lDGAN = Ex?Pr [min(0,?1 +D1(x))] + Ex??Pg [min(0,?1?D1(x?))] (6.6) and Pr is the distribution of real LR images ILRR from Widerface dataset. GP in Equation 6.5 represents the gradient penalty term. Figure 6.4 shows some sample LR images generated from the network G1. 6.3.2 Semi-Supervised Landmark Localization 6.3.2.1 Heatmap Generator G2 The key-point heatmap generator, G2 in Figure 6.5 produces heatmaps correspond- ing to N (in our case 19 or 68) key-points in a given image. As mentioned earlier, the objective of this work is to show that landmark prediction directly on LR im- ages is feasible even in the absence of labeled LR data. To this end, we choose a simple network based on the U-Net architecture as the heatmap generator. The 120 W Skip?Connection H Image H W Encoder Decoder Figure 6.5: Architecture of the heatmap generator G2. Architecture of this network is based on U-Net. Each ? represents two residual blocks. 99K represents skip connections between the encoder and decoder. network consists of 16 residual blocks where both encoder and decoder have eight residual blocks. In the last layer, G2 outputs (N+1) feature maps corresponding to N key-points and 1 background channel. After experimentation, this design for landmark detection has proven to be very effective and results in state of the art resutls for HR landmark predictions. Further architectural details are presented in the supplementary materials. 6.3.2.2 Heatmap Discriminator D2 The heatmap discriminator D2 follows the same architecture as the heatmap gen- erator G2 with different number of input channels, i.e., input to the discriminator is a set of heatmaps concatenated with their respective color images. D2 receives two sets of inputs: generated LR image with down-sampled groundtruth heatmaps and generated LR images with predicted heatmaps. This discriminator predicts an- other set of heatmaps and learns whether the key-points described by the input heatmaps are correct and correspond to the input face image. The quality of the output heatmaps is determined by their similarity to the input heatmaps, following the notion of an autoencoder. The loss is computed as the error between the input 121 Heatmaps and the reconstructed heatmaps. 6.3.2.3 Heatmap Confidence Discriminator D3 The architecture of D3 is identical to D1 except for the number of input channels. This discriminator receives three inputs: generated LR image with corresponding groundtruth heatmaps, generated LR image with predicted heatmaps and target LR image with predicted heatmaps. D3 learns to distinguish between the groundtruth and predicted heatmaps. To fool this discriminator, G2 should learn to: (a) gener- ate heatmaps for generated LR images similar to their respective groundtruth, (b) generate heatmaps for unlabeled target LR images with similar statistical properties to the groundtruth heatmap, i.e., G2 should understand the inherent structure of the face in LR images and generate accurate and realistic heatmaps. 6.3.3 Semi-supervised Learning The learning process of this setup is inspired by the seminal works BEGAN [13] and [173] called Energy-based GANs. It is worth recalling that HR images have annotations associated with them and we assume key-point locations in a generated LR image stay relatively the same as its down-sampled version. Therefore, while training G2, the down-sampled annotations are considered to be groundtruth for the generated LR images. The discriminator D2, when the input consists of groundtruth heatmaps, is trained to recognize it and reconstruct a similar one to minimize the error between 122 the groundtruth and reconstructed heatmaps. On the other hand, if the input consists of generated heatmaps, the discriminator is trained to reconstruct different heatmaps to drive the error as large as possible. The losses are expressed as N+1 lrealD = ? (Hi ?D2(Hi, ILR 2G )) (6.7) i=1 N fake = ?+1lD (H? LRi ?D2(H?i, IG ))2 (6.8) i=1 lkp realD = lD ? k l fake t D (6.9) where Hi and H?i represent the ith key-point groundtruth and generated heatmap of the generated LR image ILRG . Inspired by BEGAN, we use a variable kt to control the balance between heatmap generator and discriminator. The variable is updated every t iterations. The adaptive term kt is defined by: kt+1 = kt + ?k(?lreal ? lfakeD D ) (6.10) where kt is bounded between 0 and 1, and ?k is a hyperparameter. As in Equation 6.9, kt controls the emphasis on lfakeD . When the generator is able to fool the dis- criminator, lfakeD becomes smaller than ?lrealD . As a result of this kt increases, making the term lfakeD dominant. The amount of acceleration to train on l fake D is adjusted proportional to ?lreal?lfakeD D , i.e the distance the discriminator falls behind the gener- ator. Similarly, when the discriminator gets better than the generator, kt decreases, to slow down the training on lfakeD making the generator and the discriminator train 123 together. The discriminator D3 is trained using the loss function from Least squares GAN [104] as shown in Equation 6.11. This loss function was chosen to be consistent with the losses computed by D2. lconfD = Ex?Pr [(D3(x)? 1)2] + E 2x??Pg [D3(x?) ] + Ey??Pg [D3(y?)2] (6.11) It is noteworthy to mention in this case Pr represents the groundtruth heatmaps dis- tribution on generated LR images, while Pg represents the distribution on generated heatmaps of generated LR images and real LR images. The generator G2 is trained using a weighted combination of losses from the discriminators D2 and D3 and lMSE heatmap loss. The loss functions for the gener- ator G2 are described in the following equations: N?+1 lMSEG = (Hi ?G2(ILRG ))2 (6.12) i=1 N+1 lkpG = ? (H?i ?D2(H?i, ILR 2g )) (6.13) i=1 lconfG = E 2x?Pg [(D3(x)? 1) ] (6.14) l = alMSE + blkp + clconfG G G G (6.15) where a, b and c are hyper parameters set empirically obeying alMSEG > bl kp > clconfG G . We put more emphasis on lMSEG to encourage convergence of the model in initial 124 Figure 6.6: Sample key-point detections on TinyFace images. iterations. Some target LR images with key-points predicted from the G2 are shown in Figure 6.6. 6.4 Experiments and Results 6.4.1 Ablation Experiments We experimentally demonstrated in Section 6.1 (Figure 6.1) that networks trained on HR images perform poorly on LR images. Therefore, we propose the semi- supervised learning as mentioned in Section 6.3. With the above mentioned networks and loss functions it is important to understand the implication of each component. This section examines each of the design choices quantitatively. To this end, we first train the high to low resolution networks, and generate LR images of 4, 386 AFLW test images. In the absence of real LR images with annotated landmarks, this is done to create a substitute for low resolution dataset with annotations on which localization performance can be evaluated. We also generate subsampled version of the 20, 000 AFLW trainset and 4, 386 AFLW testset using average pooling after applying Gaussian smoothing. Data augmentation techniques such as random scaling (0.9, 1.1), random rotation (?30?, 30?) and random translation upto 20 pixels are used. Evaluation Metric: Following most previous works, we obtain error for each 125 test sample by averaging normalized errors for all annotated landmarks. For AFLW, the obtained error is normalized by the ground truth bounding box size over all visible points whereas for 300W, the error is normalized by the inter-pupil distance. Wherever applicable NRMSE stands for Normalized Root Mean Square Error. Training Details: All the networks are trained in Pytorch using the Adam optimizer with an initial learning rate of 2E?4 and ?1, ?2 values of 0.5, 0.9. We train the networks with a batch size of 32 for 200 epochs, while dropping the learning rates by 0.5 after 80 and 160 epochs. Setting S1: Train networks on subsampled images? We only train network G2 with the subsampled AFLW training images using the loss function in Equation 6.12, and evaluate the performance on generated LR AFLW test images. Setting S2: Train networks on generated LR images? In this experiment, we train the network G2 using generated LR images, in a supervised way using the loss function from Equation 6.12. We again evaluate the performance on generated LR AFLW test images. Observation: From the results summarized in Table 6.1b it is evident that there is a significant reduction in localization error when G2 is trained on gener- ated LR images validating our hypothesis that subsampled images on which many super-resolution networks are trained may not be a correct representative of real LR images. Hence, we need to train the networks on real LR images. Setting S3: Does adversarial training help? This question is asked in order to understand the importance of training the heatmap generator G2 in an adversarial way. In this experiment, we train G2 and D2 using the losses in Eqs 6.7, 6.8, 6.12, 126 Method NRMSE (all) NRMSE (479 images) Time MTCNN [169] - 0.9736 0.388 s HRNet [133] 0.4055 0.3107 0.076 s SAN [42] 0.3901 0.3141 0.0178 s Proposed 0.257 0.1803 0.0105 s (a) Setting NRMSE?std auc@0.07 auc@0.08 S1 11.33? 9.81 11.897 21.894 S2 4.23? 4.52 50.843 55.751 S3 4.120? 4.43 51.889 56.791 S4 4.123? 4.394 51.775 56.697 (b) Table 6.1: (a) Landmark Detection Error on Real Low Resolution dataset. (b) Table for ablation experiments under different settings on synthesized LR images. 6.13. Metrics are calculated on the generated LR AFLW test images and compared against the experimental setting mentioned in S2 above. Setting S4: Does G2 trained in adversarial manner scale to real LR images? In this experiment, we wish to examine if training networks G2, D2 and D3 jointly, improves the performance on real LR images from Widerface dataset.(see Section 6.3 for datasets) Observation: From Table 6.1b we observe that the network trained with setting S3 performs marginally better compared to setting S4. However, since there are no keypoint annotations available for the Widerface dataset, conclusions cannot be drawn from the drop in performance. Hence, in the following subsection 6.4.3, we leap towards understanding this phenomenon indirectly, by aligning the faces using the models from setting S3 and setting S4 and evaluating face recognition performances. 127 6.4.2 Experiments on Low Resolution images We choose to perform direct comparison on a real LR dataset. Two recent state of the art methods Style Aggregated Networks [42] and HRNet [133]. To create a real LR landmark detection dataset which we call Annotated LR Faces (ALRF), we randomly selected 700 identities from the TinyFace dataset, out of which one LR image (less than 32? 32 pixels and more than 15? 15 pixels) per identity was randomly selected, resulting in a total of 700 LR images. Next, three individuals were asked to manually annotated all the images with 5 landmarks(two eye centers, nose tip and mouth corners) in MTCNN [169] style, where invisible points were annotated with ?1. The mean of the points obtained from the three users were taken to be the groundtruth. As per convention, we used Normalised Mean Square Error (NRMSE), averaged over all visible points and normalized by the face size as the comparison metric. Table 6.1a shows the results of this experiment. We also calculate time for forward pass of one image in a single gtx1080. Without loss of generality, the results can be extrapolated to other existing works as [42] and [133] are currently state of the art. MTCNN which has detection and alignment in a single system was able to detect only 479 faces out of 700 test images. 6.4.3 Face Recognition experiments In the previous section, we performed ablative studies on the generated LR AFLW images. Although convenient to quantify the performance, it does not uncover the importance of training three networks jointly in a semi-supervised way. Therefore, 128 Figure 6.7: Snippet of the annotation tool used. in this section, we choose to evaluate the models from setting S3 and setting S4 (Section 6.4.1), by comparing the statistics obtained by applying the two models to align face images for face recognition task. We use recently published and publicly available, Tinyface [33] dataset for our experimental evaluation. It is one of the very few datasets aimed towards under- standing LR face recognition and consists of 5, 139 labeled facial identities with an average of three face images per identity, giving a total of 15, 975 LR face images (average 20 ? 16 pixels). All the LR faces in TinyFace are collected from the web (PIPA [170] and MegaFace2 [108]) across diverse imaging scenarios, captured under uncontrolled viewing conditions in pose, illumination, occlusion and background. 5, 139 known identities is divided into two splits: 2, 570 for training and the remain- ing 2, 569 for test. Evaluation Protocol: In order to compare model performances, we adopt the closed-set face identification (1:N matching) protocol. Specifically, the task is to match a given probe face against a gallery set of enrolled face images with true match from the gallery at top-1 of the ranking list. For each test class, half of 129 Setting L1 L2 L3 L4 L5 top-1 31.17 35.11 39.03 39.87 43.82 (a) Setting top-1 top-5 top-10 top-20 mAP Baseline (ArcFace [41]) 34.71 44.82 49.01 53.70 0.32 I1 34.01 41.98 45.36 49.22 0.29 I2 45.04 56.30 60.11 63.71 0.43 I3 51.10 61.05 64.38 67.89 0.47 (b) Table 6.2: Verification performance on Tinyface dataset under different settings (a) LightCNN trained from scratch (b) Using Inception-ResNet pretrained on MsCeleb-1M the face images are randomly assigned to the probe set, and the remaining to the gallery set. For the purpose of this chapter, we drop the distractor set as this does not divulge new information while significantly slowing down the evaluation process. For face recognition evaluation, we report statistics on Top-k (k=1,5,10,20) statistics and mean average precision (mAP). Experiments with network trained from scratch: Since the number of images in TinyFace dataset is much smaller compared to larger datasets such as CASIA [163] or MsCeleb-1M [60], we observed that training a very deep model like Inception-ResNet [136], quickly leads to over-fitting. Therefore, we adopt a CNN with fewer parameters, specifically, LightCNN [154]. Since inputs to the network are images of size 32? 32, we disable first two max-pooling layers. After detecting the landmarks, training and testing images are aligned to the canonical coordinates using affine transformation. We train 29 layer LightCNN models using the training split of TinyFace dataset under the following settings: Setting L1: Train networks on generated LR images? In this setting, we use the model trained under the setting S2 from the previous section 6.4.1. In this 130 setting, network G2 is trained using generated LR images in a supervised way using the loss function from Equation 6.12. Setting L2: Does adversarial training help? We use the model trained from setting S3 (section 6.4.1) to align the faces in training and testing sets. In this setting networks G2 and D2 are trained using a weighted combination of L2 pixel loss and GAN losses from Equations 6.7, 6.8, 6.12, 6.13. Setting L3: Does G2 trained in adversarial manner scale to real LR images? In this setting, networks G2, D2 and D3 are trained jointly in a semi-supervised way. We use Tinyface training images as real low resolution images. Later, Tiny- face training and testing images are aligned using the trained model for training LightCNN model. Setting L4: End-to-end training? Under this setting, we also train the High to Low networks G1 and D1, using the training images from Tinyface dataset as real LR images. We reduce the amount of data-augmentation in this case to resemble tiny face dataset images. With the obtained trained model, landmarks are extracted and images are aligned for LightCNN training. Setting L5: End-to-end training with pre-trained weights? This setting is similar to the setting L4 above, except instead of training a LightCNN model from scratch we initialize the weights from a pre-trained model, trained with CASIA- Webface dataset. Observation: The results in Table 6.2a summarizes the results of the exper- iments done under the settings discussed above. We see that although, we observed a drop in performance in landmark localization when training the three networks 131 jointly (Table 6.1b), there is a significant gap in rank-1 performance between setting L2 and L3. This indicates that with semi-supervised learning G2 generalizes well to real LR data, and hence also validates our hypothesis of training G2, D2 and D3 together. Unsurprisingly, insignificant difference is seen between settings L3 and L4. Experiments with pre-trained network: Next, to further understand the implications of joint semi-supervised learning, we design another set of experiments. In these experiments, we use a pre-trained Inception-ResNet model, trained on MsCeleb-1M using ArcFace [41] and Focal Loss [96]. This model expects an input of size 112? 112 pixels, hence the images are resized after alignment in low resolution. Using this pre-trained network, we perform the following experiments: Setting top-1 top-5 top-10 top-20 mAP A1 11.75 14.58 24.57 30.47 0.10 A2 26.21 34.76 39.03 43.99 0.24 Table 6.3: Face recognition performance using super-resolution before face-alignment Baseline: For the baseline experiment, we choose to follow the usual practice of re-scaling the images to a fixed size irrespective of resolution. We trained our own HR landmark detector (HR-LD) on 20, 000 AFLW images for this purpose. Tinyface gallery and probe images are resized to 128?128 and used by the landmark detector as inputs. Using the predicted landmarks, images are aligned to a canonical co- ordinates similar to ArcFace [41]. Baseline performance was obtained by computing cosine similarity between gallery and probe features extracted from the network after feed-forwarding the aligned images. Setting I1: Does adversarial training help? The model trained for S3 (Section 132 6.4.1) is used to align the images directly in low resolution. Features for gallery and probe images are extracted after the rescaling the images and cosine distance is used to measure the similarity and retrieve the images from the gallery. Setting I2: Does G2 trained in adversarial manner scale to real LR images? For this experiment, the model trained for L3 in Section 6.4.3 is used for landmark detection in LR. To recall, in this setting, the three models G2, D2 and D3 (with G1 and D1 frozen) are trained jointly in a semi-supervised way and Tinyface training images are used as real LR data for D3. Setting I3: End-to-end training? In this case, we align the images using the model from setting L4 from Section 6.4.3. In this case, we also trained High to low networks (G1 and D1) using training images from Tinyface dataset as real LR images. After training the model for 200 epochs, the weights are frozen to train G2, D2 and D3 in a semi-supervised way. Observation: With no surprise, we observe that (from Table 6.2b) training the heatmap prediction networks in a semi-supervised manner, and aligning the images directly in low resolution, improves the performance of any face recognition system trained with HR images. 6.5 Evaluation on the IJB-S dataset Along with the method to predict landmarks in low resolution images, this work presents a rather counter-intuitive result that performing landmark detection di- rectly in low resolution leads to higher face recognition performance. To understand 133 UltraFace Semi-Supervised Rank 1 23.65 28.88 Rank 2 26.03 32.42 Rank 3 27.58 33.57 Rank 4 28.14 34.46 Rank 5 28.64 35.05 Rank 7 29.54 36.61 Rank 10 30.42 37.46 Rank 20 32.58 39.95 Rank 30 34.38 42.05 Rank 40 35.79 43.34 Rank 50 36.69 44.61 Table 6.4: Retrieval rates at different ranks(Higher is better) FPIR/Method UltraFace Semi-Supervised 1e2 0.9450 0.8959 1e3 0.9081 0.8767 1e4 0.8808 0.8485 1e5 0.8114 0.7720 Table 6.5: False negative rates at different false positive rates. (Lower is better) this further we performed experiments on recently released IJB-S dataset [?]. IJB-S dataset is one of the most challenging dataset available, and consists of several videos collected with surveillance cameras. The subjects in this dataset are extremely chal- lenging to verify because of the distance from the camera and low resolution. We randomly selected 10 videos from the dataset which contained at least 5 subjects from the two galleries the dataset provides. We used surveillance to booking proto- col for the purpose of this experiment. Only 10 videos were chosen attributing to the fact that IJB-S is an extremely large dataset and experimenting on the entire dataset takes more than a month on a single GPU machine. Tables 6.4 and 6.5 shows retrieval rates at different ranks and false negative rates vs false positives. We compare with [113]. 134 (a) (b) Figure 6.8: (a) Retrieval rates at different ranks. (b) False negatives at different false positive rates. 6.5.1 Additional Experiments: Setting A1: Does Super-resolution help? The aim of this experiment is to under- stand if super-resolution can be used to enhance the image quality before landmark detection. We use SRGAN [91] to super-resolve the images before using face align- ment method from Bulat et al. [19] to align the images. Setting A2: Does Super-resolution help? In this case, we use ESRGAN [149] to super-resolve the images before using HR-LD (below) to align. Observation: It can be observed from Table 6.3, that face recognition per- 135 formance obtained after aligning super-resolved images is not at par even with the baseline. It can be hypothesized that possibly super-resolved images do not repre- sent HR images using which [19] or HR-LD are trained. High Resolution Landmark Detector (HR-LD) For this experiment, we train G2 on high resolution images of size 128 ? 128 (for AFLW and 300W) using lMSE loss from Equation 6.12. We evaluate the performance of this network on common benchmarks of AFLW-Full test and 300W test sets, shown in Table 6.6. A few sample outputs are shown in Figure 6.9 Method 300W AFLW Common Challenge Full Full RCPR [24] 6.18 17.26 8.35 - SDM [158] 5.57 15.40 7.52 5.43 CFAN [168] 5.50 16.78 7.69 - LBF [116] 4.95 11.98 6.32 4.25 CFSS [174] 4.73 9.98 5.76 3.92 TCDCN [172] 4.80 8.60 5.54 - MDM [142] 4.83 10.14 5.88 - PCD-CNN [88] 3.67 7.62 4.44 2.36 SAN [42] 3.41 7.55 4.24 1.91 LAB [153] 3.42 6.98 4.12 1.85 HR-LD 3.60 7.301 4.325 1.753 Table 6.6: Comparison of the proposed method with other state of the art methods on AFLW (Full) and 300-W testsets. The NMEs for comparison on 300W dataset are taken from the Table 3 of [103]. In this case G2 is trained in supervised manner using high resolution images of size 128? 128. 136 Figure 6.9: Sample outputs obtained by training G2 with HR images. First row shows samples from AFLW test set. Second row shows sample images from 300W test set. Last two columns of second row shows outputs from challenging subset of 300W 6.6 Conclusion In this chapter, we first present an analysis of landmark detection methods when applied to LR images, and the implications on face recognition. We also discuss the proposed method for predicting landmarks directly on LR images. We show that the proposed method improves face recognition performance over commonly used practices of rescaling and super-resolution. As a by-product, we also developed a simple but state of the art landmark detection network. Although, low resolution is chosen as the source of degradation, however, the method can trivially be extended to capture other degradations in the imaging process, such as motion blur or climatic turbulence. In addition, the proposed method can be applied to detect human keypoints in LR in order to improve skeletal action recognition. In the era of deep learning, LR landmark detection and face recognition is a fairly untouched topic, however, we believe this work will open new avenues in this direction. 137 Chapter 7: Conclusion This dissertation has addressed one of the major face-centric computer vision prob- lems: non-rigid alignment of deformable faces. We discussed four different methods for facial keypoint localization. With extensive experiments we demonstrated the state-of-the art performance of each of the method. In Chapter 1 we discussed the motivation behind the problem of face alignment and the associated challenges. Next we presented a cascade linear regressor based method which takes localized deep features from a face verification network in order to localize landmark points. It was shown by experiments that face verification networks capture localized information to verify faces and can also be used for landmark localization. The proposed method is one of the first methods to use deep features for keypoint localization. We detailed another cascade regression based method KEPLER, based on multi-task learning framework in Chapter 2. The approach of cascade regression makes the method somewhat slower but yields precise locations of keypoints. Along with the keypoints KEPLER is also able to predict 3D head pose from a single image. We also developed a new Channeled Inception Network which was trained in a multi-task fashion to achieve precision over keypoint locations. To tackle the 138 effect of vanishing gradients in a very deep network we also used a novel loss function. In Chapter 3, we discussed Pose Conditioned Dendritic CNN, where the pre- diction of keypoints was conditioned on the 3D head pose. We showed that the knowledge of 3D headpose assist in obtaining accurate keypoints. We also modelled the geometric relationships among different facial parts in a dendritic network. An auxiliary network was used to predict other attributes, such as occlusion and visi- bility. The proposed method is able to predict different attributes of a face image including keypoints in a single pass. This tackles the slower run time of the two methods by learning the locations of keypoints in a single convolution method mak- ing it faster. To tackle the imbalance between positive and negative samples we also discussed a novel Mask Softmax Loss Function. In Chapter 5, we discussed an application of face alignment for the task of apparent age estimation. Face images are aligned with LDDR before being passed through the CNN for age estimation. We analyzed the properties of the convolution networks and develop efficient error correction strategy for better age estimates. The above methods assumed access to high quality images while training and testing. However, a huge amount of data collected are from closed circuit cameras which capture images in much lower resolution. In the semi-supervised method presented in Chapter 6 we showed how we can transfer the knowledge learnt from high resolution images to predict keypoints in naturally degraded images. We also showed the impact keypoint localization has on the task of face verification. With experiments we demonstrated that aligning keypoints in lower resolution achieves better face verification performance than the current practice of upsampling and 139 aligning. 7.0.1 Future Work ? Alignment in videos: The proposed methods are suitable for obtaining pre- cise keypoint locations from still images. However, we observe a temporal relationships between keypoints in a video. One future direction is in exploit- ing the temporal information and utilizing it for simultaneous tracking and keypoint localization. ? Alignment of climatically degraded images: In the age of technical ad- vancement, people are always taking images, in adverse climatic and illumina- tion conditions, such as in rain or under the sun. Images are also taken while in motion, such as running or in a bus. These degrade the quality of images and the current systems of keypoint localization perform poorly on these im- ages. In future, we plan to extend this research, which will enable accurate keypoint localization even under extreme degradation. 140 Bibliography [1] A recurrent autoencoder-decoder for sequential face alignment. http:// arxiv.org/abs/1608.05477. Accessed: 2016-08-16. [2] Pose-free Facial Landmark Fitting via Optimized Part Mixtures and Cascaded Deformable Shape Model, 2013. [3] T. Ahonen, A. Hadid, and M. Pietikainen. Face description with local binary patterns: Application to face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12):2037?2041, Dec 2006. [4] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014. [5] E. Antonakos, J. Alabort i medina, and S. Zafeiriou. Active pictorial struc- tures. In CVPR, pages 5435?5444, Boston, MA, USA, June 2015. [6] E. Antonakos, P. Snape, G. Trigeorgis, and S. Zafeiriou. Adaptive cascaded regression. In ICIP?16, Phoenix, AZ, USA, September 2016. 141 [7] Epameinondas Antonakos, Joan Alabort-i Medina, and Stefanos Zafeiriou. Active pictorial structures. June 2015. [8] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic. Incremental face alignment in the wild. In CVPR 2014, 2014. [9] Akshay Asthana, Stefanos Zafeiriou, Shiyang Cheng, and Maja Pantic. Ro- bust discriminative response map fitting with constrained local models. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ?13, pages 3444?3451, Washington, DC, USA, 2013. IEEE Computer Society. [10] Ankan Bansal, Carlos Castillo, Rajeev Ranjan, and Rama Chellappa. The do?s and don?ts for cnn-based face verification. arXiv preprint arXiv:1705.07426, 2017. [11] V. Belagiannis and A. Zisserman. Recurrent human pose estimation. In 2017 12th IEEE International Conference on Automatic Face Gesture Recognition (FG 2017), pages 468?475, May 2017. [12] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar. Localizing parts of faces using a consensus of exemplars. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ?11, pages 545?552, Washington, DC, USA, 2011. IEEE Computer Society. [13] David Berthelot, Tom Schumm, and Luke Metz. BEGAN: boundary equilib- rium generative adversarial networks. CoRR, abs/1703.10717, 2017. 142 [14] Chandrasekhar Bhagavatula, Chenchen Zhu, Khoa Luu, and Marios Savvides. Faster than real-time facial alignment: A 3d spatial transformer network ap- proach in unconstrained poses. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017. [15] Chandraskehar Bhagavatula, Chenchen Zhu, Khoa Luu, and Marios Savvides. Faster than real-time facial alignment: A 3d spatial transformer network ap- proach in unconstrained poses. CoRR, abs/1707.05653, 2017. [16] Vishnu Naresh Boddeti, Myung-Cheol Roh, Jongju Shin, Takaharu Oguri, and Takeo Kanade. Face alignment robust to pose, expressions and occlusions. CoRR, abs/1707.05938, 2017. [17] Adrian Bulat and Georgios Tzimiropoulos. Human Pose Estimation via Con- volutional Part Heatmap Regression, pages 717?732. Springer International Publishing, Cham, 2016. [18] Adrian Bulat and Georgios Tzimiropoulos. Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017. [19] Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In International Conference on Computer Vision, 2017. 143 [20] Adrian Bulat and Georgios Tzimiropoulos. Super-fan: Integrated facial land- mark localization and super-resolution of real-world low resolution faces in arbitrary poses with gans. CoRR, abs/1712.02765, 2017. [21] Adrian Bulat and Yorgos Tzimiropoulos. Convolutional aggregation of local evidence for large pose face alignment. In Edwin R. Hancock Richard C. Wil- son and William A. P. Smith, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 86.1?86.12. BMVA Press, September 2016. [22] X. P. Burgos-Artizzu, P. Perona, and P. Dollar. Robust face landmark estima- tion under occlusion. In 2013 IEEE International Conference on Computer Vision, pages 1513?1520, Dec 2013. [23] Xavier P. Burgos-Artizzu, Pietro Perona, and Piotr Dollar. Robust face land- mark estimation under occlusion. Computer Vision, IEEE International Con- ference on, 0:1513?1520, 2013. [24] Xudong Cao, Yichen Wei, Fang Wen, and Jian Sun. Face alignment by explicit shape regression. International Journal of Computer Vision, 107(2):177?190, 2014. [25] Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. Human pose estimation with iterative error feedback. 2015. [26] Jan Cech, Vojte?ch Franc, Michal Uricar, and Jiri Matas. Multi-view facial landmark detection by using a 3d shape model. Image and Vision Computing, 144 47:60 ? 70, 2016. 300-W, the First Automatic Facial Landmark Detection in-the-Wild Challenge. [27] Wei-Lun Chao, Jun-Zuo Liu, and Jian-Jiun Ding. Facial age estimation based on label-sensitive learning and age-oriented regression. Pattern Recognition, 46(3):628 ? 641, 2013. [28] Jun-Cheng Chen, Vishal M. Patel, and Rama Chellappa. Unconstrained face verification using deep CNN features. CoRR, abs/1508.01722, 2015. [29] Jun-Cheng Chen, Rajeev Ranjan, Amit Kumar, Ching-Hui Chen, Vishal M. Patel, and Rama Chellappa. An end-to-end system for unconstrained face ver- ification with deep convolutional neural networks. In The IEEE International Conference on Computer Vision (ICCV) Workshops, December 2015. [30] Ke Chen, Shaogang Gong, Tao Xiang, and C.C. Loy. Cumulative attribute space for age and crowd density estimation. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 2467?2474, June 2013. [31] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. CoRR, abs/1412.7062, 2014. [32] Zhiyi Cheng, Xiatian Zhu, and Shaogang Gong. Low-resolution face recogni- tion. CoRR, abs/1811.08965, 2018. [33] Xiao Chu, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Structured feature learning for pose estimation. In CVPR, 2016. 145 [34] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active shape models—their training and application. Comput. Vis. Image Underst., 61(1):38?59, January 1995. [35] T.F. Cootes, G.J. Edwards, and C.J. Taylor. Active appearance models. Pat- tern Analysis and Machine Intelligence, IEEE Transactions on, 23(6):681?685, Jun 2001. [36] David Cristinacce and Tim Cootes. Feature detection and tracking with con- strained local models. pages 929?938, 2006. [37] David Cristinacce and Tim Cootes. Automatic feature localisation with con- strained local models. Pattern Recogn., 41(10):3054?3067, October 2008. [38] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR?05), volume 1, pages 886?893 vol. 1, June 2005. [39] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009. [40] Jiankang Deng, Jia Guo, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. CoRR, abs/1801.07698, 2018. [41] Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. Style aggregated network for facial landmark detection. In CVPR, pages 379?388, 2018. 146 [42] E. Eidinger, R. Enbar, and T. Hassner. Age and gender estimation of unfiltered faces. Information Forensics and Security, IEEE Transactions on, 9(12):2170? 2179, Dec 2014. [43] M.Y. El Dib and M. El-Saban. Human age estimation using enhanced bio- inspired features (ebif). In Image Processing (ICIP), 2010 17th IEEE Inter- national Conference on, pages 1589?1592, Sept 2010. [44] S. Escalera, J. Fabian, P. Pardo, X. Baro, J. Gonzalez, H.J. Escalante, and I. Guyon. Chalearn 2015 apparent age and cultural event recognition: datasets and results. [45] S. Escalera, M. T. Torres, B. Mart??nez, X. Baro?, H. J. Escalante, I. Guyon, G. Tzimiropoulos, C. Corneanu, M. Oliu, M. A. Bagheri, and M. Valstar. Chalearn looking at people and faces of the world: Face analysisworkshop and challenge 2016. In 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 706?713, June 2016. [46] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871?1874, 2008. [47] Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. Learn- ing hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1915?1929, 2013. 147 [48] Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and Deva Ra- manan. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell., 32(9):1627?1645, September 2010. [49] Y. Fu, G. Guo, and T. Huang. Age synthesis and estimation via faces: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(11):1955?1976, 2010. [50] A. Gallagher and T. Chen. Understanding images of groups of people. In Proc. CVPR, 2009. [51] X. Geng, C. Yin, and Z. Zhou. Facial age estimation by learning from label dis- tributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(10):2401?2412, 2013. [52] Xin Geng, Zhi-Hua Zhou, and K. Smith-Miles. Automatic age estimation based on facial aging patterns. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29(12):2234?2240, Dec 2007. [53] G. Ghiasi and C. C. Fowlkes. Occlusion coherence: Localizing occluded faces with a hierarchical deformable part model. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1899?1906, June 2014. [54] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich fea- ture hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition, 2014. 148 [55] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative ad- versarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672?2680. Curran Associates, Inc., 2014. [56] Ralph Gross, Iain Matthews, and Simon Baker. Generic vs. person specific ac- tive appearance models. Image Vision Comput., 23(12):1080?1093, November 2005. [57] Ralph Gross, Iain Matthews, and Simon Baker. Active appearance models with occlusion. Image Vision Comput., 24(6):593?604, June 2006. [58] Ralph Gross, Iain Matthews, Jeffrey Cohn, Takeo Kanade, and Simon Baker. Multi-pie. Image Vision Comput., 28(5):807?813, May 2010. [59] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms- celeb-1m: A dataset and benchmark for large-scale face recognition. CoRR, abs/1607.08221, 2016. [60] Hu Han, C. Otto, and A.K. Jain. Age estimation from face images: Human vs. machine performance. In Biometrics (ICB), 2013 International Conference on, pages 1?8, June 2013. [61] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Sur- passing human-level performance on imagenet classification. arXiv preprint arXiv:1502.01852, 2015. 149 [62] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. CoRR, abs/1502.01852, 2015. [63] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. [64] Lijun Hong, Di Wen, Chi Fang, and Xiaoqing Ding. A new biologically inspired active appearance model for face age estimation by using local ordinal ranking. In Proceedings of the Fifth International Conference on Internet Multimedia Computing and Service, ICIMCS ?13, pages 327?330, New York, NY, USA, 2013. ACM. [65] G. S. Hsu, K. H. Chang, and S. C. Huang. Regressive tree structured model for facial landmark localization. In ICCV, Dec 2015. [66] Gee-Sern Hsu, Kai-Hsiang Chang, and Shih-Chieh Huang. Regressive tree structured model for facial landmark localization. In The IEEE International Conference on Computer Vision (ICCV), December 2015. [67] Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size. arXiv:1602.07360, 2016. 150 [68] Aaron S Jackson, Adrian Bulat, Vasileios Argyriou, and Georgios Tzimiropou- los. Large pose 3d face reconstruction from a single image via direct volumetric cnn regression. International Conference on Computer Vision, 2017. [69] Vidit Jain and Erik Learned-Miller. Fddb: A benchmark for face detection in unconstrained settings. Technical Report UM-CS-2010-009, University of Massachusetts, Amherst, 2010. [70] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadar- rama, and T. Darrell. Caffe: Convolutional architecture for fast feature em- bedding. In ACM International Conference on Multimedia, pages 675?678, 2014. [71] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. [72] Amin Jourabloo and Xiaoming Liu. Pose-invariant 3d face alignment. In ICCV, Santiago, Chile, December 2015. [73] Amin Jourabloo and Xiaoming Liu. Large-pose face alignment via cnn-based dense 3d model fitting. In CVPR, Las Vegas, NV, June 2016. [74] Amin Jourabloo and Xiaoming Liu. Large-pose face alignment via cnn-based dense 3d model fitting. In Proc. IEEE Computer Vision and Pattern Recogn- tion, Las Vegas, NV, June 2016. 151 [75] Amin Jourabloo, Xiaoming Liu, Mao Ye, and Liu Ren. Pose-invariant face alignment with a single cnn. In In Proceeding of International Conference on Computer Vision, Venice, Italy, October 2017. [76] Maya Kabkab, Azadeh Alavi, and Rama Chellappa. Dcnns on a diet: Sampling strategies for reducing the training set size. CoRR, abs/1606.04232, 2016. [77] V. Kazemi and J. Sullivan. One millisecond face alignment with an ensemble of regression trees. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1867?1874, 2014. [78] Vahid Kazemi and Josephine Sullivan. One millisecond face alignment with an ensemble of regression trees. In CVPR, 2014. [79] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard. The megaface benchmark: 1 million faces for recognition at scale. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4873?4882, June 2016. [80] Brendan F. Klare, Ben Klein, Emma Taborsky, Austin Blanton, Jordan Ch- eney, Kristen Allen, Patrick Grother, Alan Mah, Mark Burge, and Anil K. Jain. Pushing the frontiers of unconstrained face detection and recognition: Iarpa janus benchmark a. June 2015. [81] Martin Koestinger, Paul Wohlhart, Peter M. Roth, and Horst Bischof. Anno- tated facial landmarks in the wild: A large-scale, real-world database for facial 152 landmark localization. In First IEEE International Workshop on Benchmark- ing Facial Image Analysis Technologies, 2011. [82] S. N. Kohail. Using artificial neural network for human age estimation based on facial images. In International Conference on Innovations in Information Technology, pages 215?219. IEEE, 2012. [83] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classifica- tion with deep convolutional neural networks. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097?1105. Curran Associates, Inc., 2012. [84] A. Kumar, A. Alavi, and R. Chellappa. Kepler: Keypoint and pose estimation of unconstrained faces by learning efficient h-cnn regressors. In 2017 12th IEEE International Conference on Automatic Face Gesture Recognition (FG 2017), pages 258?265, May 2017. [85] Amit Kumar, Azadeh Alavi, and Rama Chellappa. KEPLER: keypoint and pose estimation of unconstrained faces by learning efficient H-CNN regressors. CoRR, abs/1702.05085, 2017. [86] Amit Kumar and Rama Chellappa. A convolution tree with deconvolution branches: Exploiting geometric relationships for single shot keypoint detec- tion. CoRR, abs/1704.01880, 2017. [87] Amit Kumar and Rama Chellappa. Disentangling 3d pose in A dendritic CNN for unconstrained 2d face alignment. CoRR, abs/1802.06713, 2018. 153 [88] Amit Kumar, Rajeev Ranjan, Vishal M. Patel, and Rama Chellappa. Face alignment by local deep descriptor regression. CoRR, abs/1601.07950, 2016. [89] Vuong Le, Jonathan Brandt, Zhe Lin, Lubomir Bourdev, and Thomas S. Huang. Interactive facial feature localization. In Proceedings of the 12th Eu- ropean Conference on Computer Vision - Volume Part III, ECCV?12, pages 679?692, Berlin, Heidelberg, 2012. Springer-Verlag. [90] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew P. Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo- realistic single image super-resolution using a generative adversarial network. CoRR, abs/1609.04802, 2016. [91] D. Lee, H. Park, and C. D. Yoo. Face alignment using cascade gaussian process regression trees. In CVPR, pages 4204?4212, June 2015. [92] Donghoon Lee, Hyunsin Park, and Chang D. Yoo. Face alignment using cascade gaussian process regression trees. June 2015. [93] Gil Levi and Tal Hassner. Age and gender classification using convolutional neural networks. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) workshops, June 2015. [94] Lin Liang, Rong Xiao, Fang Wen, and Jian Sun 0001. Face alignment via component-based discriminative search. In David A. Forsyth, Philip H. S. Torr, and Andrew Zisserman, editors, ECCV (2), volume 5303 of Lecture Notes in Computer Science, pages 72?85. Springer, 2008. 154 [95] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dolla?r. Focal loss for dense object detection. CoRR, abs/1708.02002, 2017. [96] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dolla?r, and C. Lawrence Zitnick. Microsoft coco: Com- mon objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision ? ECCV 2014, pages 740?755, Cham, 2014. Springer International Publishing. [97] Feng Liu, Dan Zeng, Qijun Zhao, and Xiaoming Liu. Joint Face Alignment and 3D Face Reconstruction, pages 545?560. Springer International Publishing, Cham, 2016. [98] Yaojie Liu, Amin Jourabloo, William Ren, and Xiaoming Liu. Dense face alignment. In In Proceeding of International Conference on Computer Vision Workshops, Venice, Italy, October 2017. [99] David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60:91?110, 2004. [100] David G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91?110, November 2004. [101] Khoa Luu, K. Ricanek, T.D. Bui, and C.Y. Suen. Age estimation using active appearance models and support vector machine regression. In Biometrics: Theory, Applications, and Systems, 2009. BTAS ?09. IEEE 3rd International Conference on, pages 1?5, Sept 2009. 155 [102] Jiangjing Lv, Xiaohu Shao, Junliang Xing, Cheng Cheng, and Xi Zhou. A deep regression architecture with two-stage re-initialization for high performance facial landmark detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. [103] Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, and Zhen Wang. Multi-class generative adversarial networks with the L2 loss function. CoRR, abs/1611.04076, 2016. [104] Iain Matthews and Simon Baker. Active appearance models revisited. Int. J. Comput. Vision, 60(2):135?164, November 2004. [105] K. Messer, J. Matas, J. Kittler, and K. Jonsson. Xm2vtsdb: The extended m2vts database. In In Second International Conference on Audio and Video- based Biometric Person Authentication, pages 72?77, 1999. [106] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. CoRR, abs/1802.05957, 2018. [107] Aaron Nech and Ira Kemelmacher-Shlizerman. Level playing field for million scale face recognition. CoRR, abs/1705.00393, 2017. [108] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked Hourglass Networks for Human Pose Estimation, pages 483?499. Springer International Publishing, Cham, 2016. 156 [109] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. arXiv preprint arXiv:1505.04366, 2015. [110] A. J. O?Toole, T. Price, T. Vetter, J. C. Bartlett, and V. Blanz. 3d shape and 2d surface textures of human faces: The role of ?averages? in attractiveness and age. Image and Vision Computing, 18(1):9?19, 1999. [111] S. Ramanathan, B. Narayanan, and R. Chellappa. Computational methods for modeling facial aging: A survey. Journal of Visual Languages & Computing, 20(3):131?144, 2009. [112] R. Ranjan, A. Bansal, J. Zheng, H. Xu, J. Gleason, B. Lu, A. Nanduri, J. Chen, C. D. Castillo, and R. Chellappa. A fast and accurate system for face detection, identification, and verification. IEEE Transactions on Biometrics, Behavior, and Identity Science, 1(2):82?96, April 2019. [113] R. Ranjan, S. Sankaranarayanan, C. D. Castillo, and R. Chellappa. An all- in-one convolutional neural network for face analysis. In 2017 12th IEEE International Conference on Automatic Face Gesture Recognition (FG 2017), pages 17?24, May 2017. [114] Rajeev Ranjan, Vishal M. Patel, and Rama Chellappa. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. CoRR, abs/1603.01249, 2016. [115] Shaoqing Ren, Xudong Cao, Yichen Wei, and Jian Sun. Face alignment at 3000 FPS via regressing local binary features. In 2014 IEEE Conference on 157 Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 1685?1692, 2014. [116] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497, 2015. [117] Martin Riedmiller and Heinrich Braun. A direct adaptive method for faster backpropagation learning: The rprop algorithm. In IEEE INTERNATIONAL CONFERENCE ON NEURAL NETWORKS, pages 586?591, 1993. [118] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015. [119] J. Roth, Y. Tong, and X. Liu. Adaptive 3d face reconstruction from uncon- strained photo collections. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4197?4206, June 2016. [120] R. Rothe, R. Timofte, and L. V. Gool. Dex: Deep expectation of apparent age from a single image. In ICCV, ChaLearn Looking at People workshop, December 2015. [121] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bern- stein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recog- nition Challenge. International Journal of Computer Vision (IJCV), 2015. 158 [122] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 faces in-the- wild challenge: The first facial landmark localization challenge. In 2013 IEEE International Conference on Computer Vision Workshops, pages 397?403, Dec 2013. [123] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. A semi-automatic methodology for facial landmark annotation. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2013 IEEE Conference on, pages 896?903, June 2013. [124] Jason Saragih. Principal regression analysis. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011, pages 2881?2888, 2011. [125] Jason M. Saragih, Simon Lucey, and Jeffrey F. Cohn. Face alignment through subspace constrained mean-shifts. In ICCV, pages 1034?1041. IEEE, 2009. [126] Jason M. Saragih, Simon Lucey, and Jeffrey F. Cohn. Deformable model fitting by regularized landmark mean-shift. Int. J. Comput. Vision, 91(2):200?215, January 2011. [127] Patrick Sauer, Tim Cootes, and Chris Taylor. Accurate regression procedures for active appearance models. In In BMVC, 2011. [128] A. Saxena, S. Sharma, and V. K. Chaurasiya. Neural network based human age-group estimation in curvelet domain. Procedia Computer Science, 54:781? 789, 2015. 159 [129] Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detec- tion using convolutional networks. In International Conference on Learning Representations (ICLR 2014). CBLS, April 2014. [130] K. Simonyan and A. Zisserman. Very deep convolutional networks for large- scale image recognition. CoRR, abs/1409.1556, 2014. [131] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Rus- lan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929?1958, 2014. [132] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution rep- resentation learning for human pose estimation. In CVPR, 2019. [133] Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point detection. In CVPR, pages 3476?3483, June 2013. [134] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep convolutional network cas- cade for facial point detection. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ?13, pages 3476?3483, Washington, DC, USA, 2013. IEEE Computer Society. [135] Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016. 160 [136] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabi- novich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. [137] Graham W. Taylor, Rob Fergus, Yann LeCun, and Christoph Bregler. Con- volutional learning of spatio-temporal features. In Proceedings of the 11th European Conference on Computer Vision: Part VI, ECCV?10, pages 140? 153, Berlin, Heidelberg, 2010. Springer-Verlag. [138] J. Thies, M. Zollho?fer, M. Stamminger, C. Theobalt, and M. Nie?ner. Face2Face: Real-time Face Capture and Reenactment of RGB Videos. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2016. [139] P. Thukral, K. Mitra, and R. Chellappa. A hierarchical approach for human age estimation. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1529?1532. IEEE, 2012. [140] Philip Tresadern, Patrick Sauer, and Tim Cootes. Additive update predictors in active appearance models. In Proceedings of the British Machine Vision Conference, pages 91.1?91.12. BMVA Press, 2010. doi:10.5244/C.24.91. [141] G. Trigeorgis, P. Snape, M. A. Nicolaou, E. Antonakos, and S. Zafeiriou. Mnemonic descent method: A recurrent process applied for end-to-end face alignment. In CVPR, Las Vegas, NV, USA, June 2016. 161 [142] P. Turaga, S. Biswas, and R. Chellappa. The role of geometry in age esti- mation. In IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pages 946?949. IEEE, 2010. [143] G. Tzimiropoulos and M. Pantic. Gauss-newton deformable part models for face alignment in-the-wild. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1851?1858, June 2014. [144] Georgios Tzimiropoulos. Project-out cascaded regression with an application to face alignment. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. [145] Roberto Valle, Jose? Miguel Buenaposada, Antonio Valde?s, and Luis Baumela. Head-Pose Estimation In-the-Wild Using a Random Forest, pages 24?33. Springer International Publishing, Cham, 2016. [146] M. F. Valstar, T. Almaev, J. M. Girard, G. McKeown, M. Mehu, L. Yin, M. Pantic, and J. F. Cohn. Fera 2015 - second facial expression recogni- tion and analysis challenge. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), volume 06, pages 1?8, May 2015. [147] P. Viola and M. J. Jones. Robust real-time face detection. International journal of computer vision, 57(2):137?154, 2004. 162 [148] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Chen Change Loy, Yu Qiao, and Xiaoou Tang. ESRGAN: enhanced super- resolution generative adversarial networks. CoRR, abs/1809.00219, 2018. [149] Peter Welinder and Pietro Perona. P.: Cascaded pose regression. In In: IEEE Conference on Computer Vision and Pattern Recognition, 2010. [150] T. Wu, P. Turaga, and R. Chellappa. Age estimation and face verification across aging using landmarks. IEEE Transactions on Information Forensics and Security, 7(6):1780?1788, 2012. [151] Wayne Wu, Chen Qian, Shuo Yang, Quan Wang, Yici Cai, and Qiang Zhou. Look at boundary: A boundary-aware face alignment algorithm. In CVPR, 2018. [152] Xiang Wu, Ran He, and Zhenan Sun. A lightened CNN for deep face repre- sentation. CoRR, abs/1511.02683, 2015. [153] Y. Wu and Q. Ji. Robust facial landmark detection under significant head poses and occlusion. In ICCV, pages 3658?3666, Dec 2015. [154] Yue Wu, Chao Gou, and Qiang Ji. Simultaneous facial landmark detection, pose and deformation estimation under facial occlusion. In The IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), July 2017. [155] X. Xiong and F. De la Torre. Global supervised descent method. In CVPR, 2015. 163 [156] Xuehan-Xiong and Fernando De la Torre. Supervised descent method and its application to face alignment. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013. [157] Junjie Yan, Zhen Lei, Dong Yi, and Stan Z. Li. Learn to combine multiple hypotheses for accurate face alignment. In Proceedings of the 2013 IEEE International Conference on Computer Vision Workshops, ICCVW ?13, pages 392?396, Washington, DC, USA, 2013. IEEE Computer Society. [158] H. Yang, X. He, X. Jia, and I. Patras. Robust face alignment under occlu- sion via regional predictive power estimation. IEEE Transactions on Image Processing, 24(8):2393?2403, Aug 2015. [159] Shuo Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Wider face: A face detection benchmark. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [160] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation from scratch. arXiv preprint arXiv:1411.7923, 2014. [161] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z. Li. Learning face representation from scratch. CoRR, abs/1411.7923, 2014. [162] Juha Ylioinas, Abdenour Hadid, Xiaopeng Hong, and Matti Pietika?inen. Age estimation using local binary pattern kernel density estimate. In Alfredo Pet- rosino, editor, Image Analysis and Processing ? ICIAP 2013, volume 8156 of 164 Lecture Notes in Computer Science, pages 141?150. Springer Berlin Heidel- berg, 2013. [163] Xiang Yu, Feng Zhou, and Manmohan Chandraker. Deep deformation network for object landmark localization. CoRR, abs/1605.01014, 2016. [164] Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolu- tional networks. CoRR, abs/1311.2901, 2013. [165] Jie Zhang, Meina Kan, Shiguang Shan, and Xilin Chen. Occlusion-free face alignment: Deep regression networks coupled with de-corrupt autoencoders. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. [166] Jie Zhang, Shiguang Shan, Meina Kan, and Xilin Chen. Coarse-to-fine auto- encoder networks (cfan) for real-time face alignment. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision ECCV 2014, volume 8690 of Lecture Notes in Computer Science, pages 1?16. Springer International Publishing, 2014. [167] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499?1503, Oct 2016. [168] Ning Zhang, Manohar Paluri, Yaniv Taigman, Rob Fergus, and Lubomir D. Bourdev. Beyond frontal faces: Improving person recognition using multiple cues. CoRR, abs/1501.05703, 2015. 165 [169] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Facial Landmark Detection by Deep Multi-task Learning, pages 94?108. Springer International Publishing, Cham, 2014. [170] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Facial landmark detection by deep multi-task learning. In ECCV, pages 94?108, 2014. [171] Junbo Jake Zhao, Michae?l Mathieu, and Yann LeCun. Energy-based genera- tive adversarial network. CoRR, abs/1609.03126, 2016. [172] Shizhan Zhu, Cheng Li, Chen Change Loy, and Xiaoou Tang. Face alignment by coarse-to-fine shape searching. June 2015. [173] Shizhan Zhu, Cheng Li, Chen Change Loy, and Xiaoou Tang. To- wards arbitrary-view face alignment by recommendation trees. CoRR, abs/1511.06627, 2015. [174] Shizhan Zhu, Cheng Li, Chen-Change Loy, and Xiaoou Tang. Unconstrained face alignment via cascaded compositional learning. In CVPR, June 2016. [175] Xiangxin Zhu and D. Ramanan. Face detection, pose estimation, and land- mark localization in the wild. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2879?2886, June 2012. [176] Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and Stan Z. Li. Face alignment across large poses: A 3d solution. CoRR, abs/1511.07212, 2015. 166