ABSTRACT
Title of dissertation: USING CNNS TO UNDERSTAND LIGHTING
WITHOUT REAL LABELED TRAINING DATA
Hao Zhou
Doctor of Philosophy, 2019
Dissertation directed by: Professor David W. Jacobs
Department of Computer Science
The task of computer vision is to make computers understand the physical
word through images. Lighting is the medium through which we capture images of
the physical world. Without lighting, there is no image, and different lighting leads
to different images of the same physical world. In this dissertation, we study how
to understand lighting from images.
With the emergence of large datasets and deep learning in recent years, learn-
ing based methods play a more and more important role in computer vision, and
deep Convolutional Neural Networks (CNNs) now dominate most of the problems
in computer vision. Despite their success, deep CNNs are notorious for their data
hungry nature compared with traditional learning based methods. While collecting
images from the internet is easy and fast, labeling those images is both time con-
suming and expensive, and sometimes, even impossible. In this work, we focus on
understanding lighting from faces and natural scenes, in which ground truth labels
of the lighting are impossible to achieve.
As a preliminary topic, we first study the capacity of deep CNNs. Designing
deep CNNs with less capacity and good generalization is one way to reduce the
amount of labeled data needed in training deep CNNs, and understanding the ca-
pacity of deep CNNs is the first step towards that goal. In this work, we empirically
study the capacity of deep CNNs by studying the redundancy of parameters in them.
More specifically, we aim at optimizing the number of neurons in a network, thus
the number of parameters. To achieve that goal, we incorporate sparse constraints
into the objective function and apply a forward-backward splitting method to solve
this sparse constrained optimization problem efficiently. The proposed method can
significantly reduce the number of parameters, showing that networks with small
capacity can work well.
We then study an important problem in computer vision: inverse lighting
from a single face image. Lacking massive ground truth lighting labels, we generate
a large amount of synthetic data with ground truth lighting to train a deep network.
However, due to the large domain gap between real and synthetic data, the network
trained using synthetic data cannot generalize well to real data. We thus propose
to use real data to train the deep CNN together with synthetic data. We apply an
existing method to estimate lighting conditions of real face images. However, these
lighting labels are noisy. We then propose a Label Denoising Adversarial Network
(LDAN) to make use of these synthetic data to help train a deep CNN to regress
lighting from real face images, denoising labels of real images. We have shown that
the proposed method can generate more consistent lighting for faces taken under
the same lighting condition.
Third, we study how to relight a face image using deep CNNs. We formulate
this problem as a supervised image to image translation problem. Due to the lack
of a ?in the wild? face dataset that is suitable for this task, we apply a physically-
based face relighting method to generate a large scale, high resolution, ?in the wild?
portrait relighting dataset (DPR). A deep Convolutional Neural Network (CNN) is
then trained using this dataset to generate a relighted portrait image by taking a
source image and a target lighting as input. We show that our training procedure can
regularize the generated results, removing the artifacts caused by physically-based
relighting methods.
Fourth, we study how to understand lighting from a natural scene based on
an RGB image. We propose a Global-Local Spherical Harmonics (GLoSH) lighting
model to improve the lighting representation, and jointly predict reflectance and
surface normals. The global SH models the holistic lighting while local SHs account
for the spatial variation of lighting. A novel non-negative lighting constraint is
proposed to encourage the estimated SHs to be physically meaningful. To seamlessly
make use of the GLoSH model, we design a coarse-to-fine network structure. Lacking
labels for reflectance and lighting, we apply synthetic data for model pre-training
and fine-tune the model with real data in a self-supervised way. We have shown
that the proposed method outperforms state-of-the-art methods in understanding
lighting, reflectance and shading of a natural scene.
USING CNNS TO UNDERSTAND LIGHTING
WITHOUT REAL LABELED TRAINING DATA
by
Hao Zhou
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2019
Advisory Committee:
Professor David W. Jacobs, Chair/Advisor
Professor Larry S. Davis
Professor Rama Chellappa
Professor Tom Goldstein
Dr. Yaser Yacoob
?c Copyright by
Hao Zhou
2019

In memory of my grandfather
ii
Acknowledgments
First and foremost, I would like to express my deepest appreciation to my advisor
Prof. David W. Jacobs, an excellent research mentor who helped me generously
during my PhD study. David provided me the precious opportunity to do research
studies in computer vision under his supervision six years ago. Ever since then,
he helped me find my own research interests and helped me on how to become
a qualified researcher. David was always very patient and encouraged me when I
met difficulties, especially for the first two years when I struggled to find my own
research interests. He was accommodating to my immature ideas and helped me
to turn those ideas into mature research projects. His enthusiasm for research and
curiosity in all kinds of new knowledge has greatly influenced me. It has been a
great pleasure to work with and learn from him.
I would like to extend my sincere thanks to my dissertation committee mem-
bers: Prof. Larry S. Davis, Rama Chellappa, Tom Goldstein and Dr. Yaser Yacoob
for their insightful comments and suggestions for this dissertation. I also thank Prof.
Larry S. Davis for his advice on Medifor project and his kind help in my job search.
I would like to thank Dr. Jose M. Alvarez and Prof. Fatih Porikli for their
great help during my internship in NICTA. They introduced me to the filed of
deep learning. I am also grateful to Dr. Quoc-Huy Tran and Prof. Manmohan
Chandraker, who were my mentors in NEC Labs America. It was a great pleasure
to work with them on a research project that went beyond the scope of my own
research. I learned a lot from the way they work and their enthusiasm for research.
iii
I especially thank Prof. Manmohan Chandraker for his kind suggestions during my
job search. I owe my thanks to Dr. Sunil Hadap and Dr. Kalyan Sunkavalli for
their invaluable advice on the deep portrait relighting project during my internship
in Adobe Research.
I am thankful for the support and insightful discussions from my great lab
mates: Joao Soares, Angjoo Kanazawa, Jin Sun, Soumyadip Sengupta, Abhay Ya-
dav, Ryen Krusinga, Koutilya PNVR and Daniel Lichy.
I would like to thank all my other collaborators who are not mentioned above
for their helpful contributions to my research work: Torsten Sattler, Ronen Basri,
Xiang Yu, Hui Ding, Hong Wei.
Many thanks to all other friends at the University of Maryland: Shanshan
Li, Jingjing Zheng, Pan Xu, Junhui Li, Hao Li, Yaming Wang, Xiyang Dai, Zebao
Gao, Xing Niu, Jinfeng Rao, Xintong Han, Ruofei Du, Hongyu Xu, Peng Zhou,
Xitong Yang, Zuxuan Wu and Jun-Cheng Chen. The time we spent together was
unforgettable. Special thanks to Pan Xu for listening to my complaints when I met
difficulties and helped me get out of bad mood.
I am grateful for my house mate: Zheng Xu, Peng Lei, Chen Zhao, Chunfeng
Yang, Xue Li, Han Zhou and Zuolin Tian for their help and support in daily life.
Last but not least, I owe my deepest thanks to my family for their love and
unconditional support. They always stand behind me and help me get through all
those challenging times.
iv
Table of Contents
Dedication ii
Acknowledgements iii
Table of Contents v
List of Tables viii
List of Figures ix
1 Introduction 1
1.1 Empirical study of the capacity of deep CNNs. . . . . . . . . . . . . . 3
1.2 Training deep CNNs with synthetic data. . . . . . . . . . . . . . . . . 4
1.2.1 Label Denoising Adversarial Networks (LDAN) for Inverse
Lighting of Faces . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Deep Single-Image Portrait Relighting . . . . . . . . . . . . . 7
1.2.3 GLoSH: Global-Local Spherical Harmonics for Intrinsic Image
Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Less is More: Towards Compact CNNs 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Sparse Constrained Convolutional Neural Networks . . . . . . . . . . 16
2.3.1 Training a Sparse Constrained CNN . . . . . . . . . . . . . . 16
2.3.2 Forward-Backward Splitting . . . . . . . . . . . . . . . . . . . 18
2.3.3 Sparse Constraints . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.4 Importance of Rectified Linear Units in Sparse Constrained
CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.1 LeNet on MNIST . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.2 CIFAR-10 quick on CIFAR-10 . . . . . . . . . . . . . . . . . . 33
2.4.3 AlexNet and VGG on ImageNet . . . . . . . . . . . . . . . . . 35
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
v
3 Label Denoising Adversarial Networks (LDAN) for Inverse Lighting of Faces 38
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 Spherical Harmonics . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.2 Label Denoising Adversarial Network . . . . . . . . . . . . . . 46
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 50
3.4.3 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 53
3.4.5 Object 2D Keypoints Detection . . . . . . . . . . . . . . . . . 59
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4 Deep Single-Image Portrait Relighting 65
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 Deep Portrait Relighting Dataset . . . . . . . . . . . . . . . . . . . . 71
4.3.1 Ratio Image to Relight Faces . . . . . . . . . . . . . . . . . . 71
4.3.2 Normal Estimation . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.2.1 ARAP Based Normal Refinement . . . . . . . . . . . 73
4.3.3 Relighting Images . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.1 Main architecture for portrait relighting . . . . . . . . . . . . 76
4.4.2 Supervision for training the network . . . . . . . . . . . . . . 77
4.4.3 Skip Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 80
4.4.4.1 Network Structure . . . . . . . . . . . . . . . . . . . 80
4.4.4.2 Training Detail . . . . . . . . . . . . . . . . . . . . . 81
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.5.1 Dataset and Evaluation Metric . . . . . . . . . . . . . . . . . 83
4.5.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5.3 Comparison with the Rendering Pipeline . . . . . . . . . . . . 86
4.5.4 Comparison with State-of-the-art Methods . . . . . . . . . . . 87
4.5.5 Results on challenging images . . . . . . . . . . . . . . . . . . 91
4.5.6 Visual Results on High Resolution Images . . . . . . . . . . . 91
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5 GLoSH: Global-Local Spherical Harmonics for Intrinsic Image Decomposition 95
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3 Reflectance, Normal and Shading from a Single RGB Image . . . . . 102
5.3.1 GLoSH Lighting Modeling . . . . . . . . . . . . . . . . . . . . 102
5.3.1.1 Global and local Spherical Harmonics . . . . . . . . 103
5.3.1.2 Non-negative Constraints on SH . . . . . . . . . . . 103
vi
5.3.2 Coarse-to-fine Network Structure . . . . . . . . . . . . . . . . 104
5.3.3 Supervision on Training . . . . . . . . . . . . . . . . . . . . . 106
5.3.3.1 Reflectance . . . . . . . . . . . . . . . . . . . . . . . 106
5.3.3.2 Normal . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3.3.3 Shading . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4 Network Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.6.2 Spherical Harmonics Lighting Evaluation . . . . . . . . . . . . 115
5.6.3 Intrinsic Image Decomposition . . . . . . . . . . . . . . . . . . 120
5.6.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6 Conclusion 129
A Evaluating Local Features for Day-Night Matching 132
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
A.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
A.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
A.3.1 Keypoint Detectors . . . . . . . . . . . . . . . . . . . . . . . . 136
A.3.2 Repeatability of Detectors . . . . . . . . . . . . . . . . . . . . 138
A.3.3 Matching Day-Night Image Pairs . . . . . . . . . . . . . . . . 141
A.4 Potential of Improving Detectors . . . . . . . . . . . . . . . . . . . . 147
A.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Bibliography 151
vii
List of Tables
2.1 Neuron reduction in the first fully connected layer, the total param-
eter compression, reduced memory, the top-1 validation error rate
(red: error rate without sparse constraints). . . . . . . . . . . . . . . 12
2.2 Results of adding group sparse constraints on three layers. . . . . . . 33
2.3 Results of adding group sparse constraints on two layers. The best
compression results within 1% decrease in top 1 error rate is shown
in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4 Some compression results of proposed method on fc1 for vgg-B. Neu-
ron: compression of neurons in the fc1. Parameter: compression of
total parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1 Accuracy of different methods. Standard deviation is shown in the
parentheses for learning based methods. . . . . . . . . . . . . . . . . 49
3.2 Results of ablation study. Standard derivation is shown in the bracket. 54
3.3 Accuracy of LDAN for different scales of face images. . . . . . . . . 58
3.4 PCK value at ? = 0.1 for different methods. We notice that LDAN
outperforms regression and fine-tune. . . . . . . . . . . . . . . . . . . 59
4.1 Details about each block of our network. . . . . . . . . . . . . . . . . 80
4.2 Ablation Study on MultiPie Dataset . . . . . . . . . . . . . . . . . . 85
4.3 Evaluation MultiPie Dataset . . . . . . . . . . . . . . . . . . . . . . . 87
5.1 Details of each block in our network. . . . . . . . . . . . . . . . . . . 110
5.2 Details about each block in our second and third scale network. . . . 111
5.3 SH lighting Evaluation on SUNCG synthetic data. . . . . . . . . . . . 115
5.4 Surface normal evaluation on NYUv2. . . . . . . . . . . . . . . . . . . 117
5.5 Reflectance evaluation on IIW and shading evaluation on SAW. . . . 121
5.6 Ablation study on loss, without synthetic SUNCG data, and the
coarse-to-fine scales, evaluated on IIW reflectance, SAW shading and
NYUv2 surface normal. . . . . . . . . . . . . . . . . . . . . . . . . . . 126
viii
List of Figures
1.1 Illustration of LDAN. . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Importance of Rectified Linear Units in Sparse Constrained CNNs. . 26
2.2 The role of momentum in compressing CNNs. . . . . . . . . . . . . . 27
2.3 Comparison with Data-free method on MNIST. . . . . . . . . . . . . 29
2.4 Comparing with l0 constraints. . . . . . . . . . . . . . . . . . . . . . . 30
2.5 Distribution of Norm of neurons. . . . . . . . . . . . . . . . . . . . . 31
2.6 Compare with Data-free method on ImageNet. . . . . . . . . . . . . . 35
3.1 Domain adaptation with and without adversarial loss. . . . . . . . . . 41
3.2 Structure of network for lighting regression. . . . . . . . . . . . . . . 52
3.3 Two baseline models compare with LDAN. . . . . . . . . . . . . . . . 55
3.4 Visualization results on MultiPie. . . . . . . . . . . . . . . . . . . . . 56
3.5 Visualization results on CelebA. . . . . . . . . . . . . . . . . . . . . . 57
3.6 Illustration of network for keypoints regression. . . . . . . . . . . . . 60
3.7 Structure of network for keypoints regression . . . . . . . . . . . . . . 62
3.8 PCK curve for keypoints regression. . . . . . . . . . . . . . . . . . . . 63
4.1 Relighted images for Obama. . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 ARAP based normal refinement. . . . . . . . . . . . . . . . . . . . . . 73
4.3 Visual comparison of normals estimated by 3DDFA and the proposed
method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4 Examples of relighted images using the proposed rendering pipeline. . 75
4.5 Illustration of network structure. . . . . . . . . . . . . . . . . . . . . 76
4.6 Illustration of the effect of skip layers. . . . . . . . . . . . . . . . . . . 79
4.7 Visual comparison of Hourglass network with and without skip training. 79
4.8 Illustration of repeating lighting feature spatially. . . . . . . . . . . . 81
4.9 Network structure for 1024? 1024 images. . . . . . . . . . . . . . . . 81
4.10 Visual results of ablation study. . . . . . . . . . . . . . . . . . . . . . 85
4.11 Visual comparison of our rendering pipeline and the proposed deep
portrait relighting method. . . . . . . . . . . . . . . . . . . . . . . . . 87
4.12 Visual results of the proposed method and state-of-the-art methods
on MultiPie. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
ix
4.13 Visual comparison of the proposed method with state-of-the-art meth-
ods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.14 Some challenging examples. . . . . . . . . . . . . . . . . . . . . . . . 92
4.15 Results on flicker portrait dataset [1]. . . . . . . . . . . . . . . . . . . 93
5.1 Visual comparison of the proposed method with [2] . . . . . . . . . . 95
5.2 Illustration of the effectiveness of the proposed GLoSH model. . . . . 98
5.3 Our coarse-to-fine network structure. . . . . . . . . . . . . . . . . . . 104
5.4 Network structure for the first scale. . . . . . . . . . . . . . . . . . . 109
5.5 Network structure for the second and third scale. . . . . . . . . . . . 110
5.6 Illustration of lighting predicted with and without non-negative con-
straint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.7 1. Comparison with [2]. . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.8 2. Comparison with [2] . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.9 3. Comparison with [2]. . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.10 1. Comparison with state-of-the-art intrinsic image decomposition
method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.11 2. Comparison with state-of-the-art intrinsic image decomposition
method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.12 3. Comparison with state-of-the-art intrinsic image decomposition
method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.13 4. Comparison with state-of-the-art intrinsic image decomposition
method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
A.1 Images taken from 00:00 - 23:00 in one image sequence of our dataset. 136
A.2 The number of feature points detected at different time . . . . . . . . 138
A.3 Average number of ground truth feature points and the repeatability
for different detectors. . . . . . . . . . . . . . . . . . . . . . . . . . . 140
A.4 Precision of RootSIFT for day-night image matching for different
detectors and their number of correct matched feature points. . . . . 143
A.5 Histogram of scales for correctly matched RootSIFT features and
comparison of precision for TILDE4 and MultiscaleTILDE4. . . . . . 145
A.6 Recall of RootSIFT matching of day-night image pairs at different
time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
A.7 Precision and recall of matching day-night image pairs using cnn feature.146
A.8 Precision and number of correct matches of dense RootSIFT for day-
night image pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
A.9 Correct matches of DoG+RootSIFT and dense RootSIFT. . . . . . . 148
A.10 Examples of nighttime and daytime image with detected feature points
using DoG and the heat map of their cosine distance of dense RootSIFT.149
x
Chapter 1: Introduction
In nature, light creates the color. In the picture, color creates the light.
? Hans Hofmann
The task of computer vision is to make computers understand the physical
world through images. Lighting, as the medium through which we capture images
of the physical world, is a key factor to understand the physical world through
images. For instance, the same world under different lighting conditions can form
quite different images. As a result, the three general topics in computer vision:
recognition, registration and reconstruction (known as the 3Rs) are all significantly
affected by lighting.
In this dissertation, we study how to understand lighting from images. More
specifically, we study how to estimate lighting conditions from faces and natural
scenes through an RGB image. With the emergence of large datasets and deep
learning in recent years, learning based methods play a more and more important
role in computer vision, and Deep Convolutional Neural Networks (CNNs) [3] now
dominate almost all problems of the 3Rs. Deep CNNs first achieved the best perfor-
mance in many recognition problems [4?7]. Then, researchers began to apply deep
CNNs to registration [8, 9] and reconstruction [10?12] and achieved state-of-the-art
1
performance.
Inspired by the success of Deep CNNs in the 3Rs, we propose to apply it to
understand lighting from images. However, despite their success in the 3Rs of com-
puter vision, deep CNNs are notorious for their data hungry nature compared with
traditional learning based methods. Taking image classification as an example; to
train a deep CNN that performs reasonably well, researchers need to feed hundreds
of thousands of labeled images into the deep network. While collecting images from
the internet is easy and fast, collecting labels for those images is both time consum-
ing and expensive, and sometimes, even impossible (e.g. ground truth lighting of
faces and natural scenes) [13].
There are several ways to save the human labor of labeling images: designing
user friendly interfaces for labeling data; applying active learning to label the most
informative data; using domain adaption to make use of a large dataset with labels
to help the target task which has a small amount of data; unsupervised learning
in which labels are not needed, to name a few. Inspired by the recent progress of
rendering techniques in graphics, we propose to use synthetic data to alleviate the
large demands of real labels in lighting estimation. In this dissertation, we first
empirically study the capacity of current deep CNNs to get an idea of how to design
deep CNNs with less capacity. We then apply deep CNNs to estimate lighting from
faces with the help of synthetic face images. A portrait relighting algorithm is then
proposed, in which physically based rendering is used to generate a large amount
of high quality, ?in the wild? face images as training data. Finally, we propose to
estimate lighting together with reflectance and normal from a natural scene using an
2
RGB image, where synthetic images with ground truth labels are used to pre-train
our network.
1.1 Empirical study of the capacity of deep CNNs.
One way to reduce the heavy demands of data for training deep CNNs is
to design deep CNNs with less capacity but good generalization. Analyzing the
capacity of a deep CNN is the first step towards that goal. However, theoretically
analyzing the capacity of a network is difficult due to its complex structure. In
Chapter 2, we empirically study the capacity of deep CNNs by studying how many
parameters of deep CNNs can be removed without affecting their performance.
To attain favorable performance on large-scale datasets, a common practice is
to design CNNs with a lot of parameters. However, recent studies have shown that
the capacities of CNNs are much larger than necessary [14?17]. Inspired by these
works, we proposed to reduce the number of parameters. More specifically, we aim
at optimizing the number of neurons in a network, thus the number of parameters.
To achieve that goal, we incorporate sparse constraints into the objective
function. The forward-backward splitting method [18, 19] is applied to solve this
sparse constrained optimization problem efficiently. The main advantage of forward-
backward splitting is that it bypasses the sparse constraint evaluations during the
standard back-propagation step, making the implementation very practical. We also
investigate the importance of rectified linear units (ReLU) [20] in sparse constrained
CNNs, showing that using ReLU can lead to more pruned neurons. We study two
3
sparse constraints: tensor low rank [21] and group sparsity [22] and carry out exper-
iments on four well-known models (LeNet [3], CIFAR-10 quick [23], AlexNet [4] and
VGG [5]) using three public datasets including ImageNet. Our experiments demon-
strate that the proposed method can reduce a huge number of parameters during
the training stage, showing that a network with small capacity can work well.
1.2 Training deep CNNs with synthetic data.
Though getting ground truth labels for real data is difficult, generating syn-
thetic data with ground truth labels is usually easy and cheap. As a result, synthetic
data is becoming more and more popular to help train deep CNNs. [12, 24?30] are
some examples of applying synthetic data to train deep CNNs. [27] and [30] use syn-
thetic data as an intermediary to bridge between real data and their labels. [12,24,26]
directly train their deep CNNs on synthetic data and apply them to real data without
considering the domain gap between synthetic and real data. [29] argued that domain
adaptation is necessary to better make use of synthetic data and proposed a simple
strategy that consists of training synthetic and real data simultaneously. [25,28], on
the other hand, applied adversarial loss to bridge the domain gap between synthetic
and real data.
4
1.2.1 Label Denoising Adversarial Networks (LDAN) for Inverse Light-
ing of Faces
In Chapter 3, we propose to train a deep CNN for inverse lighting from a single
face image. Since getting ground truth lighting labels for face images in the wild is
difficult, inspired by the above mentioned methods, we propose to apply synthetic
data to help train a deep CNN. We notice that a deep CNN trained only on synthetic
data can not generalize well to real world face images due to the big domain gap
between synthetic and real data. To reduce this gap, we also apply real data while
training the network. We use an existing method [31] to estimate lighting parameters
for real face images, which are treated as ground truth with noise. Synthetic data
with noise free ground truth labels are then used to help alleviate the effect of such
noise. More specifically, we utilize the idea of Generative Adversarial Networks
(GAN) [32] and propose a Label Denoising Adversarial Network (LDAN).
We design the lighting regression deep CNN to have two sub-networks: a
feature net that extracts lighting related features and a lighting net that takes these
features as input and predicts the lighting. Since synthetic data contains no noise,
the lighting net trained with synthetic data is accurate; we thus directly apply it to
real data. As a result, we only need to train a feature network for real data. As
the lighting net expects lighting related features for synthetic data as input, while
training the feature net for real data, we utilize the idea of Generative Adversarial
Networks (GAN) [32] to map the distribution of lighting related features for real
data to that of synthetic data. The training of LDAN thus has two steps: 1) Train
5
with synthetic data; (2) Fix the feature net for synthetic data and the lighting net,
train another feature net for real data with adversarial loss and regression loss.
Figure 1.1 illustrates the training procedure of LDAN.
Figure 1.1: Training of a LDAN model has two steps: 1) Train the feature net and
lighting net for synthetic data with two losses: Faces with similar lighting should
have similar lighting related features (||s1?s2||2); Estimated lighting should be close
to ground truth lighting (||l1? l?||2 and ||l2? l?||2). 2) Train the feature net for real
data while fixing both the feature net and lighting net trained in step 1. We use two
losses in this step: The distribution of synthetic features and real features should be
close (||Ps?Pr||); Estimated lighting should be close to noisy ground truth lighting
(||lr ? l?r||).
We test our proposed method on the MultiPie data set. While the lighting for
each image is not known, there are groups of images that all have the same lighting.
Our experiments show that our network outperforms existing methods in producing
consistent lighting parameters of different faces under similar lighting conditions.
6
1.2.2 Deep Single-Image Portrait Relighting
In Chapter 4, we propose to train a deep CNN for single-image portrait re-
lighting. Conventional physically-based methods need to solve an inverse rendering
problem, estimating face geometry, albedo and lighting. However, the inaccurate
estimation of face components can cause strong artifacts in relighting, leading to
unsatisfactory results. In the proposed work, we apply a physically-based portrait
relighting method to generate a large scale, high resolution, ?in the wild? portrait
relighting dataset (DPR). A deep Convolutional Neural Network (CNN) is then
trained using the proposed data for portrait relighting. We show that the train-
ing procedure regularizes the generated results, removing the artifacts caused by
physically-based relighting methods.
We design our network to have an hourglass structure [33] which takes a source
image and a target lighting as input and generates a new portrait image under the
target lighting condition. We notice that skip connections in an hourglass network
prevent the bottleneck layers from learning good facial information, causing low
quality of the generated face images. A skip training strategy is thus proposed
to enforce more facial information in the bottleneck layer which can also improve
the quality of the generated image. A GAN loss is further applied to alleviate the
artifacts in the relighted portrait images. Our trained network can relight portrait
images with resolutions as high as 1024? 1024. We evaluate the proposed method
on the proposed DPR datset, flickr portrait dataset and MultiPie dataset both
qualitatively and quantitatively. Our experimental results demonstrate that the
7
proposed method achieves state-of-the-art results.
1.2.3 GLoSH: Global-Local Spherical Harmonics for Intrinsic Image
Decomposition
In Chapter 5, we study how to estimate lighting from a natural scene with
an RGB image. Traditional intrinsic image decomposition focuses on decomposing
images into reflectance and shading, leaving surface normals and lighting entangled
in shading. One challenge of estimating lighting from a natural scene is the lack of
a suitable lighting model. Unlike the lighting of a single object such as faces which
can be modeled using a single set of Spherical Harmonics [34, 35], the lighting of
a natural scene is much more complicated due to its spatial variation caused by
shadow, inter-reflection and the presence of light sources in the scene. In this work,
we propose a Global-Local Spherical Harmonics (GLoSH) lighting model to better
model the lighting of a natural scene. The global SH models the holistic lighting
while local SHs account for the spatial variation of lighting. A novel non-negative
lighting constraint is proposed to encourage the estimated SHs to be physically
meaningful. To seamlessly reflect the GLoSH model, we design a coarse-to-fine
network structure. The coarse network predicts global SH, reflectance and normals,
and the fine network predicts their local residuals.
Similar to our previous work, ground truth labels of reflectance and lighting
are not available for real data, we thus apply synthetic data for model pre-training
and then fine-tune the model with real data in a self-supervised way. Compared to
8
the state-of-the-art methods only targeting normals or reflectance and shading, our
method recovers all components and achieves consistently better results on three
real dataset, IIW, SAW and NYUv2.
1.3 Summary
As the medium through which we capture images of the physical world, lighting
is one of the key components to understand the physical world through images.
Understand lighting from images is thus an important topics in computer vision
which can help computers better understand the world. The biggest challenge of
this task is the lack of ground truth lighting for real images. In this dissertation,
we show the power of synthetic data in training CNNs to understand lighting from
images by summarizing three research works: lighting estimation from faces, portrait
relighting and intrinsic image decomposition from natural scenes. In the following
chapters, we discuss these research works in detail.
9
Chapter 2: Less is More: Towards Compact CNNs
In this chapter, we study the capacity of current deep CNNs. Understanding
the capacity of deep CNNs is important for reducing the amount of labeled data used
to train a network, since it is the first step towards designing a network with less
capacity and good generalization. As theoretically analyzing the capacity of a deep
CNN is difficult, we do it in an empirical way by studying how many parameters of
a deep CNN could be removed without affecting its performance. This work [36] is
in collaboration with Jose M. Alvarez and Fatih Porikli.
2.1 Introduction
The last few years have witnessed the success of deep convolutional neural
networks (CNN) in many computer vision applications. One important reason is
the emergence of large annotated datasets and the development of high-performance
computing hardware facilitating the training of high capacity CNNs with an excep-
tionally large number of parameters.
When defining the structure of a network, large networks are often preferred,
and strong regularizers [17] tend to be applied to give the network as much discrim-
inative power as possible. As such, the state-of-the-art CNNs nowadays contain
10
hundreds of millions of parameters [5]. Most of these parameters come from one
or two layers that host a large number of neurons. Take AlexNet [4] as an exam-
ple. The first and the second fully connected layers, which have 4096 neurons each,
contain around 89% of all the parameters. Having to use such a large number of
parameters leads to high memory requirements. Consequently, using deep CNNs
obligates significant hardware resources, which hinders their practicality for mobile
computing devices that have limited memory.
Several recent studies have focused on reducing the size of the network con-
cluding that there is substantial redundancy in the number of parameters of a CNN.
For instance, [37, 38] represent the filters in a CNN using a linear combination of
basis filters. However, these methods can only apply to an already trained net-
work. Other works have investigated directly reducing the number of parameters in
a CNN by using sparse filters instead of its original full-size filters. To the best of
our knowledge, existing deep learning toolboxes [39?41] do not support sparse data
structures yet. Therefore, special structures need to be implemented. Moreover,
FFT is shown to be very efficient to compute convolutions on GPUs [42]. However,
a sparse matrix is usually no longer sparse in the Fourier domain, which limits the
applicability (i.e., reduction of memory footprint) of these methods.
To address the shortcomings of the aforementioned works, we propose an effi-
cient method to decimate the number of neurons. Our approach has four key advan-
tages: (1) Neurons are assessed and reduced during the training phase. Therefore,
no pre-trained network is required. (2) The reduction of parameters does not rely on
the sparsity of neurons; thus, our method can be directly included in popular deep
11
compression (%)
Network b c memory reduced
a Top-1 error (%)
neurons parameters
LeNet 97.80 92.00 1.52 (MB) 0.63 (0.72)
CIFAR-10 quick 73.44 33.42 0.19 (MB) 25.25 (24.75)
AlexNet 73.73 65.42 152.14 (MB) 46.10 (45.57)
VGG-13 76.21 61.29 311.06 (MB) 39.26 (37.50)
aSupposing a single type is used to store the weight
bResults on number of neurons in the first fully connected layer
cResults on the number of parameters of the whole network
Table 2.1: Neuron reduction in the first fully connected layer, the total parame-
ter compression, reduced memory, the top-1 validation error rate (red: error rate
without sparse constraints).
learning toolboxes. (3) The number of filters in the Fourier domain is proportional
to the number of neurons in the spatial domain. Therefore, our method can also
reduce the number of parameters in the Fourier domain. (4) Reducing the number
of neurons of a layer directly condenses the data dimensionality when the output of
an internal layer is utilized as image features [7, 43].
Our method consists of imposing sparse constraints on the neurons of a CNN
in the objective function during the training process. Furthermore, to minimize the
extra computational cost of adding these constraints, we solve the sparse constrained
optimization problem using the forward-backward splitting method [18, 19]. The
main advantage of forward-backward splitting is to bypass the sparse constraint
evaluations during the standard back-propagation step, making the implementation
very practical. We also investigate the importance of rectified linear units in sparse
constrained CNNs. Our experiments show that rectified linear units help to reduce
the number of neurons leading to even more compact networks.
We conduct a comprehensive set of experiments to validate our method using
12
four well-known models (i.e., LeNet, CIFAR-10 quick, AlexNet and VGG) on three
public datasets including ImageNet. Table 2.1 summarizes a representative set of
our results when constraints are applied to the first fully connected layer of each
model, which has the largest number of neurons and parameters. As shown, even
with a large portion of neurons removed and, therefore, a significant reduction in
the memory footprint of the original model, the resulting networks still perform as
well as the original ones. These results are also convenient for deployment in mobile
platforms where computational resources are relevant. For instance, using single
float type (4 bytes) to store the network, the amount of memory saved for AlexNet
in Table 2.1 is 152 MB.
To reiterate, the contribution of the proposed method in this chapter is three-
fold. First, we remove neurons of a CNN during the training stage. Second, we
analyze the importance of rectified linear units for sparse constraints. Finally, our
experimental results on four well-known CNN architectures demonstrate a signifi-
cant reduction in the number of neurons and the memory footprint of the testing
phase without affecting the classification accuracy.
2.2 Related Work
Deep CNNs demand large memory and excessive computational times. This
motivated studies to simplify the structure of CNNs. However, so far only one
pioneering work has explored the redundancy in the number of neurons [17].
We organize the related work in three categories; network distillation, ap-
13
proximating parameters or neurons by memory efficient structures, and parameter
pruning in large networks.
Network distillation: The work in [44] is among the first papers that tried
to mimic a complicated model with a simple one. The key idea is to train a powerful
ensemble model and use it to label a large amount of unlabeled data. Then a neural
network is trained on these data so as to perform similarly to the ensemble model.
Following the idea of [44], [45?47] present training of a simple neural network based
on the soft output of a complicated one. Their results show that a much simpler
network can imitate a complicated network. However, these methods require a two-
step training process and the simple network must be trained after the complicated
one.
Memory efficient structures: Most of the studies in this category are in-
spired by the work of Denil et al. [14] that demonstrated redundancy in the parame-
ters of neural networks. It proposed to learn only 5% of the parameters and predicted
the rest by choosing an appropriate dictionary. Inspired by this, [37,38,48] proposed
to represent the filters in convolutional layers by a linear combination of basis filters.
As a result, convolving with the original filters is equivalent to convolving with the
base filters followed by a linear combination of the output of these convolutions.
These methods focus on speeding up the testing phase of CNNs. In [49], the use
of the canonical polyadic decomposition (CP-decomposition) to approximate the
filter in each layer as a sequence of four convolutional layers with small kernels is
investigated in order to speed up testing time when the network is run on a CPU.
In [50] several schemes to quantize the parameters in CNNs are presented. All these
14
methods require a trained network and then fine tuning for the new structures. [51]
applied the Tensor Train (TT) format to approximate the fully connected layers and
train the TT format CNN from scratch. They can achieve a high compression rate
with little loss of accuracy. However, it is not clear whether the TT format can be
applied to convolutional layers.
Parameter pruning: A straightforward way to reduce the size of CNNs is
directly removing some of the unimportant parameters. Back to 1990, LeCun et
al. [52] introduced computing the second derivatives of the objective function with
respect to the parameters as the saliency, and removing the parameters with low
saliency value. Later, Hassibi and Stork [53] followed their work and proposed an
optimal brain surgeon, which achieved better results. These two methods, neverthe-
less, require computation of the second derivatives, which is computationally costly
for large-scale neural networks [54]. In [15], directly adding sparse constraints on
the parameters was discussed. This method, different from ours, cannot reduce the
number of neurons.
Another approach [16] combines the memory efficient structures and parameter
pruning. First, unimportant parameters are removed, and then, after a fine-tuning
process, the remaining parameters are quantized for further compression. This work
is complementary to ours and can be used to further compress the network trained
using our method.
Directly related to our proposed method is [17]. Given a pretrained CNN,
it proposes a theoretically sound approach to combine similar neurons together.
Yet, their method is mainly for pruning a trained network whereas our method
15
directly removes neurons during training. Although [17] is data-free when pruning,
it requires a pretrained network, which is data dependent. Moreover, our results
show that, without degrading the performance, we can remove more neurons than
the suggested cut-off value reported in [17].
2.3 Sparse Constrained Convolutional Neural Networks
Notation: In the followi?ng?discussion in this chapter, if not otherwise speci-
fied, | ? | is the `1 norm, ||x|| = i x2i is the `2 norm for a vector (Frobenius norm
for a matrix). We use W and b to denote all the parameters in filters and bias
terms in a CNN, respectively. wl and bl represent the filter and bias parameters
in the l-th layer. wl is a tensor whose size is w ? h ? m ? n, where w and h are
the width and the height of a 2D filter, m represents the number of channels for
the input feature and n is the number of channels for the output feature. bl is a
n dimensional vector, i.e. each output feature of this layer has one bias term. wlj
represents a w? h?m filter that creates the j-th channel of the output feature for
layer l, blj is its corresponding bias term. wlj and blj together form a neuron. We use
W?, w?l and w?lj to represent the augmented filters (they contain the corresponding
bias term).
2.3.1 Training a Sparse Constrained CNN
Let {X,Y?} be the training samples and corresponding ground-truth label.
Then, a CNN can be represented as a function Y = f(W?,X), where Y is the output
16
of the network. W? is learned through minimizing an objective function:
min?(f(W?,X),Y?). (2.1)
W?
We use ?(W?) to represent the objective function for simplicity. The objective func-
tion ?(W?) is usually defined as the average cross entropy of the ground truth labels
with respect to the output of the network for each training image. Equation (2.1) is
usually solved using a gradient descend based method such as back-propagation [3].
We add sparse constraints on the neurons of a CNN. Therefore, the optimiza-
tion problem of (2.1) can be written as:
min?(W?) + g(W?), (2.2)
W?
where g(W?) represents the set of constraints added to W?. Given this new opti-
mization problem, the k-th iteration of a standard back-propagation can be defined
as:
k ??(W?) ?g(W?)W? = W?k?1 ? ? |W?=W?k?1 ? ? |W?=W?k?1 , (2.3)
?W? ?W?
where W?k represents the parameters learned at k-th iteration and ? is the learning
rate. Based on (2.3), a new term ?g(W?) |W?=W?k?1 must be added to the gradient?W?
of each constrained layer during back-propagation. In those cases where g(W?)
is non-differentiable at some points, sub-gradient methods of g(W?) are usually
needed. However, these methods have three main problems. First, iterates of the
17
sub-gradient at the points of non-differentiability hardly ever occur [18]. Second,
sub-gradient methods usually cannot generate an accurate sparse solution [55]. Fi-
nally, sub-gradients of some sparse constraints are difficult to choose due to their
complex form or they may not be unique [19]. To avoid these problems, and in
particular with l1 constrained optimization problems, [56] and [15] proposed to use
a proximal mapping. Following their idea, we apply a proximal operator to our
problem.
2.3.2 Forward-Backward Splitting
Our proposal to solve the problem (2.2) and therefore, train a constrained
CNN, consists of using a forward-backward splitting algorithm [18, 19]. Forward-
backward splitting provides a way to solve non-differentiable and constrained large-
scale optimization problems of the generic form:
min f(z) + h(z), (2.4)
z
where z ? RN , f(z) is differentiable and h(z) is an arbitrary convex function [18,
19]. The algorithm consists of two stages: First, a forward gradient descent on
f(z). Then, a backward gradient step evaluating the proximal operator of h(z), i.e.
prox 2f (z) = arg minu (h(u) + ?(u? z) ), where ? is a weight to be chosen. Using
this algorithm has two main advantages. First, it is usually easy to estimate the
proximal operator of h(z) or even have a closed form solution. Second, backward
analysis has an important effect on the convergence of the method when f(z) is
18
convex [19].
Though there is no guarantee about convergence when f(z) is non-convex,
forward-backward splitting method usually works quite well for non-convex opti-
mization problems [19]. By treating ?(W?) in Equation (2.2) as f(z), the forward
gradient descent can be computed exactly as the standard back-propagation algo-
rithm in training CNNs. As a result, using the forward-backward splitting method
to solve sparse constrained CNNs has two steps in one iteration. Algorithm 1 shows
how to apply this method to optimize Equation (2.2), where ? k is the learning rate
of the forward step at k-th iteration.
Algorithm 1: Forward-backward splitting for sparse constrained CNNs
1: while Not reaching maximum number of iterations do
2: One step back-propagation for ?(W?) to get W?k?
3: W?k = arg min g(W?) + 1W? k ||W? ? W?k?||22?
4: end while
In practice, we define one step in line 2 of Algorithm 1 as one epoch instead
of one iteration of the stochastic gradient descent algorithm. There are two main
reasons for this. First, to minimize the computational training overhead of the
algorithm as we need to estimate fewer proximal operators of g(W?). Second, the
gradient of ?(W?) at each iteration is an approximation to the exact gradient which
is noisy [56]. Computing the gradient after a certain number of iterations makes the
learned parameters more stable [55].
19
2.3.3 Sparse Constraints
Our goal is removing neurons w?lj, each of which is a tensor. To this end,
we consider two sparse constraints for g(W?) in Equation (2.2): tensor low rank
constraints [21] and group sparsity [22].
Tensor Low Rank Constraints
Although the low-rank constraints for 2D matrices and their approximations
have been extensively studied, as far as we know, there are few works considering the
low-rank constraints for higher dimensional tensors. In [21], the authors proposed
to minimize the average rank of different unfoldings of a tensor matrix. To relax the
problem to convex, they proposed to approximate the average rank using the average
of trace norms, which is called the tensor trace norm, for different unfoldings [21].
We use this formulation as our tensor low rank constraints.?
The tensor trace norm of a neuron w?lj is ||w? 1 nlj||tr = ||w? || , where nn i=1 lj(i) tr
is the order of tensor w?lj, and w?lj(i) is the result of unfolding the tensor w?lj along
the i-th mode. Under this definition, function g(W?) can be defined as:
? ?n1
g(W?) = ? ||w?lj(i)||tr, (2.5)
n
(j,l)?? i=1
where ? is a set containing all the neurons to be constrained and ? is the weight for
the sparse constraint.
20
As a result, the backward step in the forward-backward splitting is given by:
1 ?n 1
w?klj = arg min ||w? k? 2lj(i)||tr + ||w?lj ? w? ||
w?lj n? 2??
lj
i=1
n n
1 1 ?
= arg min ||w?lj(i)||tr + ||w? ? w?k? 2lj(i)
n 2??n lj(i)
|| . (2.6)
w?lj
i=1 i=1
This problem can be solved using the Low Rank Tensor Completion (LRTC) al-
gorithm proposed in [21]. Suppose we have a 2D matrix X, let X = U?V be
the singular value decomposition of X, where ? is a diagonal matrix of the sin-
gular values of X. Now let us suppose the i-th singular value is ?i, then define
?? = diag(max(?i ? ?), 0). Then the shrinkage operator is defined as:
Dt(X) = U?
T
tV , (2.7)
Algorithm 2 shows how to use LRTC to optimize Equation (2.6).
Algorithm 2: Backward step for tensor low rank constraints
1: Initialize w?k k?lj = w?lj
2: while not converged do
3: for i = 1 to n do
4: M = 1 k k?i ?D??( (w?lj(i) + w?lj(i)))25: end for
6: w?k = 1 nlj n i=1Mi
7: end while
Group Sparse Constraints
Group sparse constraints are defined as l2,1 regularizer. Applying l2,1 to our
21
objective function, we have:
?
g(W?) = ? ||w?lj||, (2.8)
(j,l)??
where ? and ? are the same as in Equation (2.5). ||.|| is defined in Section 2.3.
According to [22], the backward step in the forward-backward splitting method at
the k-th iteration now becomes:
w?k?
w?k = max{||w?k? ljlj lj || ? ??, 0} ? , (2.9)||w?klj ||
where ? is the learning rate and w?k?lj is the optimized neuron from the forward step
in forward-backward splitting.
2.3.4 Importance of Rectified Linear Units in Sparse Constrained
CNNs
Convolutional layers in a CNN are usually followed by a nonlinear activation
function. Rectified Linear Units (ReLU) [4] have been heavily used in CNNs for
computer vision tasks. Besides the advantages of ReLU discussed in [20,57,58], we
show that w?lj = 0 is a local minimum in sparse constrained CNNs, of which the
non-linear function is ReLU. This can explain our findings that ReLU can help in
removing more neurons and inspires us to set momentum to 0 for neurons which
reach their sparse local minimum during training as discussed in Section 2.4.1.
The ReLU function is defined asReLU(x) = max(0, x), which is non-differentiable
22
at 0. This creates difficulty in analyzing the local minimum of sparse constrained
CNNs. Based on the observation that, in practical implementations, the gradient
of the ReLU function at 0 is set to 0 [39?41]. We consider the following practical
definition of ReLU:
?????
ReLU(x) = ?x if x > ??? (2.10)0 if x ? .
Where  is chosen such that for any real number x that a computer can represent,
if x > 0, then x > ; if x ? 0, then x < . The non-differentiable point of the ReLU
function is now at  which will never appear in practice. Under this definition, the
ReLU function is differentiable and continuous at point 0, the gradient of ReLU(x)
at 0 is now 0. As a result, this practical definition of ReLU is consistent with the
implementations of a ReLU function in practice [39?41].
We now show that a particular neuron w?lj = 0 lies in a flat region of ?(w?lj)
under the above definition of ReLU. Take a particular neuron w?lj as an example
and suppose all other neurons are fixed. Also, suppose x is one of the inputs that
will go through w?lj in the l-th layer. The output for x, assuming the convolution
layer is followed by a ReLU function, is zx? = ReLU(w?ljx?), where x? is x augmented
by 1 to account for the bias term. Next, we show that w?lj = 0 is the local minimum
of Equation (2.2) along the dimensions of w?lj, referred to as a local minimum of a
CNN for simplicity.
Given any x ? ?, define ?(?w?lj) = ?w?ljx? as a function of ?w?lj. It is easy to
23
see that ?(?w?lj) is a continuous function. As a result, we can always find a ?x such
that |?(?w?lj) ? ?(0)| < , ?||?w?lj ? 0|| < ?x. According to Equation (2.10), for
those w?lj, ReLU(?w?ljx?) = 0. Denoting ? as the smallest one among all ?x, we know
that for any ||?w?lj ? 0|| < ?, ReLU(?w?ljx) = 0, ?x ? ?. As all other neurons are
fixed, we know that for any ||?w?lj ? 0|| < ?, ?(?w?lj) = c, where ? is the first part
in Equation (2.2) and c is a scalar. Since g(W?) contains a sparse constraint on w?lj,
g(w?lj = 0) ? g(?w?lj) and the equality holds if and only if ?w?lj = 0. As a result,
for any ||?w?lj?0|| < ?, ?(w?li = 0) + g(w?li = 0) ? ?(?w?li) + g(?w?li), i.e. w?lj = 0
is the local minimum of the objective function ?(w?lj) + g(w?lj).
The fact that w?lj = 0 is a local minimum for sparse constrained CNNs using
ReLU as a nonlinear activation function can explain the improvement in the number
of neurons removed as discussed in Section 2.4.1. Importantly, we find that using
momentum during the optimization may push a zero neuron away from being 0 since
momentum memorizes the gradient of this neuron in previous steps. As a conse-
quence, using momentum would lead to more non-zero neurons without performance
improvement. In practice, once a neuron reaches 0, we set its momentum to zero,
forcing the neuron to maintain its value. As we will demonstrate in Section 2.4.1,
this results in more zero neurons without affecting the performance.
2.4 Experiments
We test our method on four well-known convolutional neural networks on three
well-known datasets: LeNet on MNIST [3], CIFAR10-quick [23] on CIFAR-10 [59]
24
and AlexNet [4] and VGG [5] on ILSVRC-2012 [60]. We use LeNet and CIFAR10-
quick provided by Matconvnet [41] and AlexNet and VGG provided by [61] to carry
out all our experiments.
All the structures of the networks are provided by Matconvnet [41] or [61], the
only change we made is to add a ReLU function after each convolutional layer in
LeNet, which we will discuss in detail later. For all our experiments, data augmen-
tation and pre-processing strategies are provided by [41] and [61].
2.4.1 LeNet on MNIST
MNIST is a well-known dataset for machine learning and computer vision. It
contains 60, 000 handwritten digits for training and 10, 000 for testing. All these
digits are images of 28? 28 with a single channel. The mean of each set of data will
be subtracted, as suggested in [41].
The average top 1 validation error rate of LeNet on MNIST adding a ReLU
layer is 0.73%. As training a LeNet on MNIST is fast, we use this experiment as a
sanity check. Results are computed as the average over four runs of each experiment
using different random seeds.
ReLU helps removing more neurons:
The LeNet structure provided by Matconvnet has two convolutional layers
and two fully connected layers. The two convolutional layers are each followed by
a max pooling layer. Under this structure, the sparse solution may not be a local
minimum. Based on the discussion in Section 2.3.4, we add a ReLU layer after each
25
110   
tensor low rank with relu tensor low rank with relu
group sparse with relu 1.1100 group sparse with relu
tensor lowrank without relu tensor lowrank without relu
group sparse without relu 1.05 group sparse without relu
90
baseline without relu
1
baseline with relu
80
0.95
70 0.9
60 0.85
0.8
50
0.75
40
0.7
30
0.65
20  0.6  
0 20 40 60 80 100 120 0 20 40 60 80 100 120
Weight for sparse constraints Weight for sparse constraints
(a) (b)
Figure 2.1: (a) Percentage of nonzero neurons on the second convolutional layer
for LeNet under different weights for sparse constraints with and without adding a
ReLU layer. (b) Corresponding top 1 validation error rate. Baseline in (b) shows
the top 1 validation error rate without adding any sparse constraints. Error bars
represent the standard deviation over the four experiments.
of these two convolutional layers so that w?lj = 0 is a local minimum. We compare
these two structures by using different weights for sparse constraints added to the
second convolutional layer.
As shown in Figure 2.1, adding ReLU improves the performance of LeNet
regardless of whether we add sparse constraints or not. Comparing these two struc-
tures, adding a ReLU layer always leads to more zero neurons compared with the
original structure. What is more interesting is that, with a ReLU layer, the top 1
validation error rate of LeNet with sparse constraints is more close to, if not better
than, the performance without sparse constraints. This may be explained by the
fact the w?lj = 0 is a local minimum of the structure with ReLU and this local min-
imum is usually a good one. In the following discussion, when LeNet is mentioned,
we mean LeNet with a ReLU function after each of its convolutional layers.
26
Remaining non zero neurons (%)
Top 1 validation error rate (%)
100   
tensor low rank, momentum = 0 tensor low, momentum = 0
90 group sparse, momentum = 0 group sparse, momentum = 0
tensor lowrank, momentum ~= 0 0.85 tensor lowrank, momentum ~= 0
80 group sparse, momentum ~= 0 group sparse, momentum ~= 0
baseline with relu
70
0.8
60
50 0.75
40
0.7
30
20
0.65
10
 0.6  
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Weight for sparse constraints Weight for sparse constraints
(a) (b)
10  0.9  
tensor low rank, momentum = 0 tensor low rank, momentum = 0
9 group sparse, momentum = 0 group sparse, momentum = 0
0.85
tensor lowrank, momentum ~= 0 tensor lowrank, momentum ~= 0
8 group sparse, momentum ~= 0 group sparse, momentum ~= 0
0.8
7
0.75
6
5 0.7
4
0.65
3
0.6
2
0.55
1
0  0.5  
50 55 60 65 70 75 80 85 90 95 100 50 55 60 65 70 75 80 85 90 95 100
Weight for sparse constraints Weight for sparse constraints
(c) (d)
Figure 2.2: Comparison of the proposed results with setting momentum to be zero.
(a) and (c) show the percentage of non-zero neurons for second convolutional layer
and first fully connected layer under different weights for sparse constraints respec-
tively. (b) and (d) show their corresponding top 1 validation error rate.
27
Remaining non zero neurons (%) Remaining non zero neurons (%)
Top 1 validatoin error rate (%) Top 1 validation error rate (%)
Momentum for sparse local minimum:
Momentum plays a significant role in training CNNs. It can be treated as a
memory of the gradient computed in previous iterations and has been proved to
accelerate the convergence of SGD. However, this memory of gradient effect may
push a neuron away from its sparse local minimum, leading to results with more
non-zero neurons with no improvement in performance. To avoid this problem we
can directly set the momentum of the neurons to be zero for those sparse local
minima that are reached.
In Figure 2.2, we compare the number of non-zero neurons with and without
setting the momentum to be zero for LeNet on the MNIST dataset. We add the
sparse constraints on the second convolutional layer and the first fully connected
layer since they have most of the filters. As shown in Figure 2.2, the performance
does not drop when momentum is set to be zero for local sparse minima. Addi-
tionally, we sporadically achieve better performance. Moreover, for the first fully
connected layer, setting momentum to be zero leads to results with more zero neu-
rons under large sparse weights.1
From Figure 2.1 and Figure 2.2, we find that the top-1 error rates of using
tensor low-rank constraints are a little better than using group sparse constraints.
Group sparse constraints, however, can produce more zero neurons, which can re-
sult in more compact networks. These differences are very small, so we conclude
that both of these methods can help removing neurons without hurting the perfor-
1For a clear comparison, we only show results under large weights for the first fully connected
layer in the figure.
28
2  28  
Proposed result on fc3 lowrank constaints for conv3
Data?free result on fc3 27.5 group sparse constriants for conv3
1.5 no sparse constraints lowrank constraints for fc4
27 group sparse constraints for fc4
without adding sparse constraints
26.5
1
26
0.5 25.5
25
0
24.5
24
?0.5
23.5
?1  23  
50 45 40 35 30 25 20 15 10 5 0 100 90 80 70 60 50 40 30 20
Remaining non zero neurons (%) Remaining non zero neurons (%)
(a) (b)
Figure 2.3: a) MNIST Comparison between single layer results of the proposed
method and the data-free method presented in [17] using LeNet on MNIST. b)
CIFAR-10 Top 1 validation error rate versus percentage of remaining non-zero
neurons on conv3 and fc4 using CIFAR10-quick.
mance of the network. In the following discussion, only results from group sparse
constraints are shown.
Comparison with l0 constraints on LeNet
Both [55] and [15] proposed to add l0 constraints to the parameters of a neural
network. They show that this performs quite well in the sense of zeroing out param-
eters. We apply this idea to remove neurons on LeNet and compare it with adding
tensor low rank and group sparse constraints. We use exactly the same setup as in
the paper and test the idea using the MNIST data set [3].
Following the idea of [55] and [15], given a parameter t, neurons whose magni-
tudes are among the largest t will be kept, other neurons will be set to zero during
the training process.
Figure 2.4 compares results of l0 constraints with the proposed tensor low rank
and group sparsity. For the experiment, t is chosen based on the number of non-
29
Relative top 1 validaton error rate (%)
Top 1 validation error rate (%)
0.85  0.9  
Tensor lowrank constaints Tensor lowrank constaints
group sparse constriants 0.85 group sparse constriants
l  constraints l  constraints
0 0
0.8
0.8
0.75
0.75
0.7
0.65
0.7
0.6
0.55
0.65
0.5
 0.45  
90 80 70 60 50 40 30 20 30 25 20 15 10 5 0
Remaining non zero neurons (%) Remanining non zero neurons (%)
(a) (b)
Figure 2.4: (a) and (b) show the top 1 validation error rate versus the percentage of
non-zero neurons for the second convolutional layer and first fully connected layer
for LeNet respectively. We compare results of low rank constraints, group sparse
constraints and directly using l0 constraints.
zero neurons we got by using proposed sparse constraints under different weights.
Generally speaking, the proposed constraints perform better than the l0 constraints
for training the network. This is more obvious when we try to get more zero filters
for the first fully connected layer. One reason is that, without using any sparse
constraints, the norm of the neurons learned are distributed more compactly and
far from 0. As a result, directly setting some of the neurons to be zero is risky. The
distribution of absolute values of the parameters, even without adding any sparse
constraints, are more biased to 0, which explains why l0 constraint works well in [55]
and [15]. This can be seen form Figure 2.5, which shows the distribution of the norm
of the neurons and absolute values of parameters.
30
Top 1 validation error rate (%)
Top 1 validation error rate (%)
30 16000
14000
25
12000
20
10000
15 8000
6000
10
4000
5
2000
0 0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
Value of the norm of each neuron Absolute value of each parameter
(a) (b)
Figure 2.5: (a) shows the distribution of norm of the neurons in first fully connected
layer for LeNet. (b) shows the distribution of the absolute value of the parameters
in the first fully connected layer for LeNet. The network is trained with one epoch
without adding sparse constraints.
31
Number of neurons
Number of parameters
Compression for LeNet:
Figure 2.3 (a) shows the comparison of the proposed method with [17]. Since
[17] only adds sparse constraints on the first fully connected layer of LeNet (fc3),
we only show our results using group sparse constraints on fc3. The compression
rate and top 1 validation error are averaged over four results with different random
seeds. Since the two networks may be trained differently, we show the relative top
1 validation error rate, which is defined as the top 1 validation error rate of the
proposed method minus that without sparse constraints. A smaller value means
better performance. As shown in this figure, our method outperforms the one pro-
posed in [17] in both compression and accuracy. Please note that [17] predicts the
cut-off number of neurons for fc3 to be 440 and a drop of performance of 1.07%.
The proposed method, on the contrary, can remove 489 neurons while maintaining
the performance of the original one.
To improve the compression rate, we add group sparse constraints to the layers
containing most of the parameters in LeNet: the second convolutional layer (conv2)
and the first fully connected layer (fc3). We compare two strategies. First, adding
sparse constraints in a layer by layer fashion and second, jointly constraining both
layers. Empirically we found that the first strategy performs better. Thus, we first
add sparse constraints on fc3 and, after the number of zero-neurons in this layer is
stable, we add sparse constraints on conv2. We report the results of using group
sparse constraints. The weight used for fc3 is set to 100, and the weights for conv2
are set to 60, 80, 100. Table 2.3 summarizes the average results over the four runs of
the experiment. As shown, our method not only reduces significantly the number of
32
? Non-zero Neurons Number of Top-1 error (%)
conv1 conv2 fc3 conv1 conv2 fc3 Parameters
100 80 90 7 23 20 11820 0.72
100 120 80 7 13 21 7079 0.76
120 120 90 6 14 20 6980 0.81
Table 2.2: Results of adding group sparse constraints on three layers.
neurons, which leads to significant reduction in the number of parameters in these
two layers but also compresses the total number of parameters of the network for
more than 90%, leading to a memory footprint reduction larger than 1 MB.
We further try to add group sparse constraints on all three layers of LeNet to
check whether our method can work on more layers. To introduce more redundancy
on conv1 and conv2, we initialize the number of non-zero neurons for conv1 and
conv2 to be 100. Similar to adding sparse constraints on two layers, we add sparse
constraints layer by layer. We show some of our results in Table 2.2. To compare,
the best compression result of adding sparse constraints on conv2 and fc3 leads to
a model with 13062 parameters (third row of LeNet in Table 2.3). We find that by
adding sparse constraints on three layers, a more compact network can be achieved
though we initialize the network with more neurons.
2.4.2 CIFAR-10 quick on CIFAR-10
CIFAR-10 [59] is a database consisting of 50, 000 training and 10, 000 testing
RGB images with a resolution of 32 ? 32 pixels split into 10 classes. As suggested
in [41], data is standardized using zero-mean unit length normalization followed by
a whitening process. Furthermore, we use random flips as data augmentation. For a
33
fair comparison, we train the original network, CIFAR10-quick, without any sparse
constraints using the same training set-up and achieve a top 1 validation error rate
of 24.75%.
Since the third convolutional layer (conv3) and the first fully connected layer
(fc4) contain most of the parameters, we add sparse constraints on these two layers
independently. Figure 2.3 (b) shows the top 1 validation error rate versus the
percentage of remaining non-zero neurons. As shown, 20% and 70% of neurons
for conv3 and fc4 can be removed without a noticeable drop in performance.
To obtain the best compression results, we jointly constrained conv3 and fc4
as we did with LeNet. To this end, we first add the constraints on fc4, and then, we
include the same ones on conv3. We run the experiment for more epochs compared
to the default values in Matconvnet [41]. For a fair comparison, we train the baseline
(i.e., CIFAR-10 quick without sparse constraints) for the same number of epochs. As
a result, we obtain a top 1 validation error of 22.33%. We use this result to compute
relative top 1 validation error rate. The weight of group sparse constraints for fc4 is
fixed to 280 while the three different weights for conv3 are 220, 240, 280. A summary
of results is listed in the second part of Table 2.3. Through this experiment, it can
be seen that even for this simple network, we can remove a large number of neurons,
which leads to a great compression in the number of parameters. Considering that
there are only 64 neurons in each of these two layers, the compression results are
significant.
34
10  
Proposed result on fc6
Proposed result on fc7
8 Data?free result on fc6
Data?free result on fc7
no sparse constraints
6
4
2
0
?2  
100 90 80 70 60 50 40 30 20
Remaining non zero neurons (%)
Figure 2.6: ImageNet. Comparison between single layer results on fc6 and fc7
and the data-free method in [17] using AlexNet on ImageNet. Data-free results are
from [17].
2.4.3 AlexNet and VGG on ImageNet
ImageNet [62] is a dataset with over 15 million labeled images split into 22,000
categories. AlexNet [4] was proposed to be trained on ILSVRC-2012 [60] which is a
subset of ImageNet with 1.2 million training images and 50,000 validation images.
We use the implementation of AlexNet and VGG-13 provided by [61] in order to
test ImageNet on our cluster. Random flipping is applied to augment the data.
Quantitative results are reported using a single crop in the center of the image. For
comparison, we consider the network trained without adding any sparse constraints.
The top 1 validation error for this baseline is 45.57% and 37.50% for AlexNet and
VGG-13 respectively. For the rest of experiments we only report results using group
sparse constraints.
Figure 2.6 shows the top 1 validation error rate versus the percentage of non-
zero neurons for AlexNet. Group sparse constraints are added on the first and
35
Relative top 1 validation error rate (%)
second fully connected layers (fc6 and fc7) independently. Results from [17] are
copied from their paper and shown in this figure for comparison. We show the
relative top 1 validation error rates because the training of two methods may be
different. This figure clearly shows that compared with [17], the proposed method
can remove a large number of neurons without decreasing the performance. For
instance, the best compression results of the proposed method can eliminate 76.76%
of all the parameters of AlexNet with a negligible drop in performance (0.57% in
top 1 validation error). The best performance model in [17], on the other hand, can
only remove 34.89% of the parameters with top 1 validation error rate decreased by
2.24%. A representative set of compression results obtained using sparse constraints
on two layers is shown in the third part of Table 2.3.
We test the proposed method on VGG-13 on the first fully connected layer
as it contains most of the parameters of the network. Table 2.4 summarizes the
outcomes of the experiment for different group sparsity weights. As shown, for
this state-of-the-art network structure, our method can eliminate nearly half of the
parameters and significantly reduce the memory footprint at the expense of a slight
drop in performance.
2.5 Summary
We proposed an algorithm to significantly reduce of the number of neurons in a
convolutional neural network by adding sparse constraints during the training step.
The forward-backward splitting method is applied to solve the sparse constrained
36
? Neurons pruned(%) Top-1 error (%) parameter memory
conv2 fc3 conv2 fc3 absolute relative reduction (%) reduced (MB)
LeNet 60 100 45.5 97.75 0.73 0.00 95.35 1.57
80 100 56.5 97.75 0.77 0.04 96.31 1.58
100 100 63.0 97.75 0.76 0.03 96.79 1.59
? Neurons pruned (%) Top 1 error (%) parameters memory
conv3 fc4 conv3 fc4 absolute relative reduction(%) reduced (KB)
cifar10- 220 280 31.25 70.31 22.21 -0.12 47.17 268.24
quick 240 280 46.88 71.86 22.73 0.4 55.15 313.62
280 280 54.69 70.31 23.78 1.45 58.56 333.01
? Neurons pruned (%) Top 1 error (%) parameters memory
fc6 fc7 fc6 fc7 absolute relative reduction (%) reduced(MB)
AlexNet 40 35 48.46 56.49 44.58 -0.98 55.15 128.26
45 30 77.05 60.21 46.14 0.57 76.76 178.52
45 35 73.39 65.80 45.88 0.31 74.88 174.14
Table 2.3: Results of adding group sparse constraints on two layers. The best
compression results within 1% decrease in top 1 error rate is shown in bold.
layer ? compression % memory top 1 error (%)
neurons parameters reduced (MB) absolute relative
fc1 5 39.04 35.08 178.02 38.30 0.80
fc1 10 49.27 44.28 224.67 38.54 1.04
fc1 20 76.21 61.30 311.06 39.26 1.76
Table 2.4: Some compression results of proposed method on fc1 for vgg-B. Neuron:
compression of neurons in the fc1. Parameter: compression of total parameters.
problem. We also analyze the benefits of using rectified linear units as the non-linear
activation function to remove a larger number of neurons.
Experiments using four popular CNNs including AlexNet and VGG-13 demon-
strate the capability of the proposed method to reduce the number of neurons, and
therefore, the number of parameters and memory footprint, with a negligible loss in
performance.
37
Chapter 3: Label Denoising Adversarial Networks (LDAN) for In-
verse Lighting of Faces
Obtaining ground truth labels for real data is sometimes not just difficult, but
impossible. In this situation, synthetic data can be very helpful in training deep
CNNs, since they are easy to generate and ground truth labels are easy to get. In
this chapter, we study one of these problems: inverse lighting from a single face
image. This work [13] is in collaboration with Jin Sun, Yaser Yacoob and David W
Jacobs
3.1 Introduction
Estimating lighting sources from an image is a fundamental problem in com-
puter vision. In general, this is a particularly difficult task when the scene has
unknown shape and reflectance properties. On the other hand, estimating the light-
ing of a human face, one of the most popular and well studied objects, is easier due
to its approximately known geometry and near Lambertian reflectance. Lighting
estimation can be used in applications such as image editing, 3D structure estima-
tion, and image forgery detection. This chapter focuses on estimating lighting from
a single face image. We consider the most common face image type: near frontal
38
pose. The same idea can be applied to face images with other poses.
There exist many approaches for lighting estimation from a single face im-
age [31, 63?65], however they are not learning-based and rely on complicated opti-
mization during testing, making the process inefficient. Moreover, the performance
of these methods (e.g., [31]) depends on the resolution of face images, and cannot
give accurate predictions for low resolution images.
Witnessing the dominant success of neural network models in other computer
vision problems such as image classification, we are interested in a supervised learn-
ing approach that directly regresses lighting parameters from a single face image.
Given an input face image, the approach outputs low dimensional Spherical Har-
monics (SH) coefficients [34,35] of its environment lighting condition. This is a very
difficult problem, especially due to the scarcity of accurate ground truth lighting
labels for real face images in the wild. In fact, building a dataset with realistic
images and ground truth lighting parameters is extremely hard and currently there
exists no such dataset.
Lacking ground truth labels, we propose to use synthetic data to help train a
deep CNN for real data. However, the network trained using synthetic data cannot
generalize well to real data due to the domain gap between synthetic and real data.
As a result, we also include real data to train a deep CNN together with the synthetic
data. We apply an existing method [31] to estimate lighting parameters of real face
images. However, these lighting parameters are not the real ?ground truth? as they
contain unknown noise. Synthetic face images, on the other hand, have noise-free
ground truth lighting labels. In this work, we show that this synthetic data with
39
accurate labels can help train a deep CNN to regress lighting of real face images:
?denoising? the unreliable labels.
The proposed method is based on two assumptions: (1) A deep CNN trained
with synthetic data is accurate, i.e., it is not affected by any noise; (2) Ground truth
labels for real data are noisy, but still contain useful information. We design the
lighting regression deep CNN, which consists of two sub-networks: a feature net that
extracts lighting related features and a lighting net that takes these features as input
and predicts the Spherical Harmonics parameters. Based on the first assumption,
the lighting net trained with synthetic data is accurate. However, this lighting net
expects lighting related features for synthetic data as input. To make it work for real
data, the lighting related features for real data should be mapped to the same space.
For that purpose, we utilize the idea of Generative Adversarial Networks (GAN) [32].
Specifically, a discriminator is trained to distinguish between lighting related features
from synthetic data and real data, while the feature net (instead of a generator in
the standard GAN) is trained to fool the discriminator. The discriminator and
our feature net play a minmax two player game, with the objective of pulling the
distribution of lighting related features of real data towards that of the synthetic
data. Under the second assumption, we have an additional objective of reducing
regression loss between predicted lightings and ground truth labels. Moreover, we
design the network to take 64 ? 64 RGB face images so that it will work for low
resolution face images.
Figure 1.1 illustrates the proposed LDAN model. It consists of two steps
during training: (1) Train with synthetic data; (2) Fix the feature net for synthetic
40
A	 A	
C?	 C?	
Source	 B	 B	
A?	 A?	
Target	 C?	 C?	B?	 B?	
Figure 3.1: Two different functions that map data from the source domain to target
domain with similar adversarial loss. With additional regression loss, our model is
encouraged to learn a better behaved mapping function.
data and the lighting net, train another feature net for real data with adversarial loss
and regression loss. Eric et al. [66] proposed similar ideas, applying adversarial loss
to map the distribution of features from the source domain to the target domain.
However, they only use adversarial loss. We argue that such a mapping can be
unpredictably arbitrary. As illustrated by Figure 3.1, both mapping A to A?, B to
B? and mapping A to B? and B to A? make the source and target data have similar
distributions. This may be correct for classification tasks if A and B belong to the
same class. However, for regression, mapping A to B? takes the mapped feature far
away from where it should be. As a result, using the regression loss for real data
is critical in our regression problem: it regularizes the domain mapping function
to have reasonable behavior. At the same time, the noise in real data labels are
suppressed by training with the adversarial loss.
Since real ground truth labels for SH do not exist, we propose to use a clas-
41
sification based method to evaluate the consistency of estimated SH. However, this
is still an indirect approach. To further evaluate the effectiveness of the proposed
method, and show its potential to other applications, we apply it to an object key
point regression problem where the ground truth labels are available. Similar to
lighting regression from faces, we apply an existing method [27] to get the noisy
ground truth and use synthetic data to help train an object key point regression
network. Evaluated using the real ground truth, we demonstrate that LDAN works
better than directly training a network with these noisy ground truth labels.
The main contributions of our work are: 1) We propose a lighting regression
network for face images; 2) We propose a novel method, LDAN, to utilize accurate
synthetic image lighting labels in training real face images with noisy labels; 3) The
proposed method: increases the accuracy by 9% compared to [31] on quantitative
evaluation, is robust to low resolution images, and is thousands of times faster.
3.2 Related Work
Lighting Estimation from A Single Face Image. Estimating lighting
conditions from a single face image is an interesting but difficult problem. Blanz
and Vetter [67] proposed to estimate the ambient and directional light as a byproduct
of fitting 3D Morphable Models (3DMM) to a single face image. Since then, several
3DMM based methods were proposed [63?65, 68, 69]. The performance of these
methods rely on a good 3DMM of faces. However, existing 3DMMs are usually
built with face images taken in a controlled environment, so their expressive power
42
(especially the texture model) for faces in the wild is limited [70]. Barron and Malik
proposed an optimization based method for estimating shape, albedo and lighting
for general objects [31]. To solve such an underconstrained problem, their method
heavily relies on prior knowledge about shape, albedo and lighting of general objects.
Though they achieved promising results, their method is slow and may fail to give
reasonable results in some cases due to the non-convexity of the objective function.
[71] proposed to use deep learning to disentangle representations about pose, lighting
and identity of a face image. The authors only show the effectiveness of their method
on synthetic images, its performance on real face images is unclear. Recently, there is
a trend to disentangle real faces using deep CNNs [72?74]. These methods, however,
mainly focus on evaluating their performance on shape and albedo estimation. It is
not clear whether the lighting estimated by these methods are accurate.
Learning with Noisy Labels. Learning with noisy labels has attracted the
interest of researchers for a long time. [75] gives a comprehensive introduction to this
problem. With the development of deep learning, many research studies have now
focused on how to train deep neural networks with noisy labels [76?81]. [76,77,80,81]
assume the probability of a noisy label only depends on the noise-free label but not
on the input data, and try to model the conditional probability explicitly. [79] models
the type of noise as a hidden variable and proposes a novel probabilistic model to
infer the true labels. [78] proposes to use CNNs pre-trained with noise-free data to
help select data with noisy labels in order to better handle the noise. All the above
mentioned methods focus on classification problems and a considerable portion of
the data are assumed to have noise-free labels. However, estimating lighting from
43
face images is a regression problem, and the translation probability from noise-free
label to noisy label is much more difficult to model. Moreover, the labels of our real
data are noisy. As a result, we are dealing with a much harder problem than the
methods mentioned above.
GAN for Domain Adaption. Since Goodfellow et al. [32] first proposed
Generative Adversarial Networks, several works have been using this idea for unsu-
pervised Domain Adaption [66,82?84]. All these methods solve a problem in which
the labels in the target domain are not enough to train a deep neural network.
However, the problem we try to solve is intrinsically different from theirs in that the
labels in the target domain are sufficient, but all these labels are noisy. Moreover,
all these methods apply domain adaption to classification tasks where adversarial
loss is enough to achieve a good performance. On the contrary, adversarial loss
alone cannot work in our regression task. Though adversarial loss could map the
distribution of data in the target domain to that of the source domain, for a single
point in the target domain, the mapping is arbitrary which is problematic as every
data point has its unique label in a regression task.
3.3 Proposed Method
3.3.1 Spherical Harmonics
[34, 35] have shown that for convex objects with Lambertian reflectance and
distant light sources, the lighting of the environment can be well estimated by 9
(gray scale) or 27 (color) dimensions of Spherical Harmonics (SH). In this chapter,
44
we use SH as the lighting representation as it has been widely used to represent the
environmental lighting in face related applications as suggested in [31,65,69,85?87].
All dimensions of SH can be fully recovered from an image if the pixels are
equally distributed over a sphere. However, the pixels of a face image, loosely speak-
ing, are distributed over a hemisphere. The SH that can be recovered from a face
image, as discussed in [88], lie in a lower dimensional subspace, and the SH for faces
under different poses lie in different subspaces. As a result, we consider regressing
the SH in a lower dimensional subspace instead of the original 27 dimensional SH
and focus on near frontal faces since most face images are taken under this pose.
Taking the red color channel as an example, we now show how to get the
lower dimensional subspace of SH for near frontal faces. Let Ir be a column vector:
each element represents one pixel value of a face image for the red channel, then
Ir = ?rY lr. ?r is a n?n diagonal matrix, each element of which is the albedo of the
corresponding pixel, lr is a 9 dimensional SH parameters vector, Y is a n?9 matrix
and n is the number of pixels in the image. Each column of Y corresponds to one
SH base image whose elements are determined by the normal of the corresponding
pixel (see [34]). By applying SVD on Y , we get Y = UDV T , then Ir = ?rUDV
T lr.
V is a 9 ? 9 matrix that spans the entire 9 dimensions of SH. We use synthetic
data to get V since we know the ground truth normal of every pixel and thus Y is
known. We then only keep the first 6 columns of V , denoted as V6, corresponding to
the largest 6 singular values since they capture 99% energy of the singular values.
With V6, we project all the SH to their 18 dimensional subspace throughout the
experiments.
45
3.3.2 Label Denoising Adversarial Network
Training a regression deep CNN needs a lot of data with ground truth labels.
However, getting the ground truth lighting parameters from a realistic face image
is extremely difficult. It usually needs a mirror ball or panorama camera which is
carefully set up to record an environment map relative to the position of the face.
Instead, we adapted [31] to predict lighting parameters from a large number of face
images. These parameters are then projected to a lower dimensional subspace using
V6 discussed above. We use these projected lighting parameters as noisy ground
truth labels and denote them as y?r. Together with real face images r, (r, y?r) will
be used as (data, label) pairs to train a deep CNN to regress lighting parameters.
Because these labels are noisy, directly training a deep CNN with these data cannot
give the best performance.
We propose to use synthetic face images, whose ground truth lighting pa-
rameters are known, to help train a deep CNN. The proposed deep CNN has two
sub-networks: a feature network that is used to extract lighting related features and
a lighting network that takes lighting related features as input and predicts SH for
the face images. For synthetic data s, we denote its feature network as S and its
lighting network as L. Then the predicted SH is represented as ys = L(S(s)). Since
S and L are trained using synthetic data whose ground truth labels are known,
they are accurate. Feature network R and Lighting network Lr for real data, on
the other hand, will both be affected by the noisy labels if directly trained using the
noisy ground truth of real data. To alleviate the effect of noisy labels, we propose to
46
use L as the lighting net for real data, i.e., Lr = L, since it is not affected by noise.
However, since L is trained using synthetic data, it only works if the input is from
the space of lighting related features of synthetic data. As a result, R needs to be
trained such that the lighting related features for real data will be mapped into the
same space as synthetic data.
Given a set of synthetic images s and their ground truth labels y?s , we train
feature net S and lighting net L through the following loss function:
?
min ?[(L(S(s ? 2 ? 2i))? ysi) ?+?(L(S(sj))? ysi) ?] +?? (S(si)??? S(s ))?2j , (3.1)S,L
(i,j)?? regression loss for synthetic feature loss
where si and sj are a pair of synthetic images with the same SH lighting, different
IDs, and different small random deviations from frontal pose. y?si represents their
ground truth label. ? is a set containing all such pairs. ? is the weight for feature
loss. Besides the regression loss, we also add a MSE feature loss that enforces the
lighting related features of face images with the same SH to be the same. This
encourages the lighting related features to contain no information about face ID
and pose.
With trained S and L, we need to train the feature net R for real face images
r so that the lighting related features for real data (fr = R(r)) lie in the same space
as that of synthetic data (fs = S(s)). Our idea is inspired by recently proposed
Generative Adversarial Networks (GAN) [32], which have proved to be very effec-
tive to synthesize realistic images. In our setting, a discriminator D is trained to
distinguish fr and fs, whileR is trained so that fr would make D fail. By playing this
47
minmax game, the distribution of fr will be close to that of fs. Wasserstein GAN
(WGAN) [89] is used as our training strategy since it can alleviate the ?mode drop-
ping? problem and generate more realistic samples for image synthesis. However,
making the distribution of fr and that of fs similar is not enough for our regression
problem since the mapping can be arbitrary to some extent. As shown in Figure 3.1,
both these two mappings would make two sets of points have similar distributions,
but not both of them are correct since every point has its unique label. To deal with
this problem, we use the noisy ground truth of real data as ?anchor points? during
training. As a result, the final loss function for our problem is defined as follows:
?
min max ? (L(R(?r?i))? y?
2
ri)?+? ?(ES(s)?Ps [D(S(s)]???ER(r)?Pr [D(R(r))]) (3.2)R D ?i adversarial loss
regression loss for real
where Ps and Pr are the distributions of lighting related features for synthetic and
real images respectively.
Following [32, 89], the discriminator D and feature net R are trained in an
alternating fashion. While training D, RMSProp [90] is applied and Adadelta [91]
is used to train S, R and L as discussed in [89]. The details on how to train the
whole model are illustrated in Algorithm 3.
48
Algorithm 3: Training procedure for LDAN
1: Train S and L for synthetic data using loss function in Equation 3.1 by Adadelta.
2: Compute lighting related features for synthetic images using fsi = S(si).
3: for number of training epochs do
4: for k=1 to 1 iterations do
5: Sample 128 fs and r. Train discriminator D through the following loss
using RMSProp:
maxEfs?Ps [D(fs)]? ER(r)?Pr [D(R(r))]D
6: end for
7: for k=1 to 4 iterations do
8: Sample 128 r?and train R through the following loss using Adadelta:
min (L(R(ri))? y? 2ri) ? ?ER(r)?Pr [D(R(r))]R
i
9: end for
10: end for
SIRFS log SIRFS SH REAL LDAN Model B Model C
top-1 (%) 60.72 56.04 61.29 (?1.8) 65.73 (?1.78) 56.62 (?3.86) 63.03 (?0.91)
top-2 (%) 79.65 74.39 81.95 (?1.3) 84.57 (?1.35) 76.94 (?4.10) 82.79 (?0.35)
top-3 (%) 87.27 83.74 90.59 (?0.7) 92.43 (?0.59) 86.69 (?3.39) 91.21 (?0.47)
Table 3.1: Accuracy of different methods. Standard deviation is shown in the paren-
theses for learning based methods.
3.4 Experiments
3.4.1 Data Collection
Real Face Images: The proposed LDAN requires a large number of both
synthetic and real face images for training. To collect the real face images, we
download images with faces from the Internet. The SIRFS method proposed by
Barron and Malik [31] is then applied to these face images to get the noisy ground
truth SH for lighting. Since SIRFS was proposed to estimate lighting for general
49
objects, the prior they use is not face-specific. To get a better constraint for a face
shape, we apply Discriminative Response Map Fitting (DRMF) [92] to estimate the
facial landmarks and pose. Then, a 3DMM [67] is fitted to get an estimation of the
face depth map which is used as a prior to constrain the face shape estimation of
SIRFS. We collected 40, 000 faces with noisy ground truth SH for training.
Synthetic Face Images: We apply the 3D face model provided by [93] to
generate 40, 000 pairs of faces. Each pair of these faces are under the same lighting
but with different identities and a small random variation with respect to frontal
pose.
MultiPie: The MultiPie dataset [94] contains a large number of face images of
different IDs taken under different poses and illumination conditions. From this data
set, 4, 980 face images are chosen, which contain 250 IDs in frontal pose under 19
lighting conditions. Though the ground truth lighting parameters are not provided
for each of these face images, the lighting condition group under which a face image
is taken is given. This data is used only for evaluation in our experiments.
3.4.2 Implementation Details
We use the same network structure for feature net S and R. We apply the
ResNet structure [6] to define a feature net. It takes a 64 ? 64 RGB face image as
input and outputs a 128 dimensional feature vector. We define our lighting net L
and discriminator D to be 2 and 3 fully connected layers respectively. The lighting
net outputs 18 dimensional lighting parameters and D outputs the score for being
50
a lighting related feature of real data.
We show the structure of our networks in this section. As mentioned above,
we borrow the structure of ResNet [6] to define our feature net. Figure 4.5 (a) shows
the details. A block like ?Conv 3?3, 16? means a convolutional layer with 16 filters,
the size of each filter is 3 ? 3 ? n where n is the number of input channels. This
convolutional layer is followed by a batch normalization layer and a ReLU layer. A
block like ?Residual 3 ? 3 32? means a residual block of two 3 ? 3 convolutional
layers with skip connections: each of the two convolutional layers has 32 filters and
is followed by batch normalization and ReLU layer. ?,/2? means the stride of the
first convolutional layer in residual block is 2. The output of the feature net is a
128 dimensional feature.
Figure 4.5 (b) shows the structure of the lighting net. ?FC ReLU 128? means
a fully connected layer whose number of outputs is 128 followed by a ReLU layer.
?Dropout? means a dropout layer with dropout ratio being 0.5. ?FC, 18? means a
fully connected layer with 18 outputs.
Figure 4.5 (c) shows the structure of the discriminator. ?FC tanh, 1? means
a fully connected layer with one output followed by a tanh layer.
While training the proposed model, we first train discriminator D for 1 itera-
tion and then train feature net R for 4 iterations. We alternate these two steps for
25 epochs. We choose ? = 0.01, and ? = 0.01. Our algorithm is implemented using
Keras [95] with Tensorflow [96] as backend.
51
Conv	3x3,	16	
Residual	3x3,	16	
Residual	3x3,	32,/2	
Residual	3x3,	32	
FC	ReLU,	128	
Residual	3x3,	64,/2	
Dropout	
Residual	3x3,	64	
FC	ReLU,	128	 FC	ReLU,	128	
Residual	3x3,	128,/2	
Dropout	 Dropout	
Residual	3x3,	128	
FC,	18	 FC	tanh,	1	
(a) Feature net (b) Lighting net (c) Discriminator
Figure 3.2: (a), (b) and (c) show the structure of feature net, lighting net, and
discriminator used in our paper.
52
3.4.3 Evaluation Metric
Since ground truth lighting parameters for real face images are not available, it
is difficult to evaluate the accuracy of regressed lighting quantitatively. We propose
an ?indirect? quantitative evaluation metric based on classification, and test our
method on the MultiPie data set, which contains face images taken under 19 lighting
conditions. More specifically, after regressing the SH for each test face image, 90%
of them are used to compute the mean SH for each lighting condition. Then, each
of the rest of the face images are assigned to the 19 lighting conditions based on
the Euclidean distance between its estimated SH and the mean SH. We carry out
10 cross validations for this classification measurement to make use of all the data.
3.4.4 Experimental Results
To show the effectiveness of the proposed method, we compare our results
with the SIRFS [31] based method in this section. In SIRFS, the shading of a face
is formulated in logarithm space, i.e. log{si} = Yil where si is the shading at the
i-th pixel, Yi is the i-th row of Y and l represents the SH. l estimated in this way is
not the correct SH. To estimate the correct SH lighting, we assume that the normal
of each pixel estimated by SIRFS is in Euclidean space instead of logarithm space.
This assumption is reasonable since we adapted the SIRFS method by estimating
a face depth map using 3DMM in Euclidean space, and constrained the estimated
face shape to be consistent with it. Supposing l? is the correct SH, the shading can
53
LDAN LDAN w/o Adversarial LDAN w/o Regression LDAN w/o Fixed Lighting Net
top-1 (%) 65.73 (?1.78) 63.63 (?2.12) 30.72 (?0.63) 63.95 (?0.60)
top-2 (%) 84.75 (?1.35) 83.44 (?1.57) 49.12 (?0.85) 83.97 (?0.25)
top-3 (%) 92.43 (?0.59) 91.48 (?1.09) 61.58 (?1.07) 92.07 (?0.46)
Table 3.2: Results of ablation study. Standard derivation is shown in the bracket.
be found by si = Yi?l. Then l? can be found by solving the following equation:
Y l? = exp {Y l}. (3.3)
This is an over complete linear equation as the number of pixels is larger than the
dimension of SH.
Comparison with baselines. Table 3.1 compares the proposed method with
the SIRFS based method using the classification measurement on the MultiPie data
set. We denote the original output of the SIRFS method as SIRFS log, and SIRFS
SH is used to denote the corrected SH using Equation (3.3). We test these two
methods on the original resolution of the MultiPie data which is roughly 220? 270
after cropping the faces. REAL in Table 3.1 represents our baseline method which
uses SIRFS SH as ground truth to train a deep CNN without synthetic data. REAL
and LDAN are trained 5 times and the mean accuracies are shown in Table 3.1. We
notice that SIRFS SH, which solves Equation (3.3) based on SIRFS log, performs
worse than SIRFS log. According to Equation (3.3), the accuracy of SIRFS SH
depends not only on the accuracy of SIRFS log, but also on the accuracy of estimated
normals. The noisy estimation of normals would make the performance of estimated
SIRFS SH more noisy. The performance of REAL is better than SIRFS SH, though
54
Figure 3.3: Two models we use to compare with the proposed LDAN. Different
from LDAN, Model B uses the same feature net for synthetic and real data; Model
C trains feature nets for synthetic and real data together.
it is trained directly using the output of SIRFS SH as the ground truth label. This
shows that by observing a large amount of data, the deep CNN itself can be robust
to noise to some extent. This is an advantage for learning based methods compared
with optimization based algorithms. LDAN outperforms REAL by more than 4%
and SIRFS SH by more than 9% for top-1 accuracy, showing the effectiveness of the
proposed method.
We further propose two other baselines to compare with LDAN as shown in
Figure 3.3. Different from LDAN, Model B and Model C learn the feature nets
for synthetic and real data simultaneously and map the lighting related features of
them to the same space. These two models are inspired by [82] and [84]. For Model
B, synthetic and real data share the same feature net. Since synthetic data and real
data are quite different from each other, using a single feature net it is difficult to
make their lighting features have the same distribution, and we do not expect good
performance. Model C defines different feature nets for synthetic and real data. The
55
(a) (b) (c) (d)
Figure 3.4: First row: MultiPie face image, rendered synthetic face with SIRFS es-
timated lighting and rendered synthetic face with LDAN estimated lighting. Second
row: the hemisphere visualization of the corresponding estimated lightings. Images
are best viewed on screen.
difference between Model C and LDAN is that Model C tries to map lighting related
features for synthetic and real data to the same space, which might be different from
that learned with synthetic data alone, whereas LDAN tries to directly map lighting
related features of real data to the space of synthetic data. Intuitively, compared
with LDAN, Model C is more easily affected by the noisy labels of real data since
the training of the feature net for synthetic data is affected by the real data.
Model B and C are also trained 5 times and their mean accuracies are shown
in Table 3.1 for comparison. We notice that Model B performs even worse than
REAL, which shows that a single feature net for both synthetic and real data is not
enough. LDAN and Model C outperform all other methods in Table 3.1. Moreover,
LDAN performs better than Model C, showing that it is more robust to the noise
in the labels of real data.
Ablation Study. To investigate the effectiveness of adversarial loss and re-
gression loss, we carry out ablation studies for LDAN. We train the feature net 5
times for real data without adversarial loss or regression loss respectively and com-
56
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 3.5: First row: CelebA face image, rendered synthetic face with SIRFS esti-
mated lighting, and rendered synthetic face with LDAN estimated lighting. Second
row: the hemisphere visualization of the corresponding estimated lightings. Images
are best viewed on screen.
pare the results with LDAN in Table 3.2. Without adversarial loss, the performance
of LDAN is better than REAL in Table 3.1, which means that synthetic data can
help to regress lighting in this case. Without regression loss, on the other hand,
the performance of LDAN drops dramatically. This is because the mapping of the
distribution of lighting related features of real data to that of synthetic data is ar-
bitrary as shown in Figure 3.1. This is problematic for a regression task where each
data has its unique label. Having noisy ground truth as ?anchor points?, as we do
in LDAN, can alleviate this problem and give much better results. We also train
LDAN without fixing the lighting net and show the results in Table 3.2. We notice
that the performance is similar to training LDAN without adversarial loss. This is
expected since the lighting net will be trained to adapt to the noisy labels, reducing
57
Method LDAN SIRFS SH
Resolution 64? 64 32? 32 16? 16 64? 64
top-1 (%) 65.73 (?1.78) 64.89 (?1.65) 61.72 (?1.72) 42.17
top-2 (%) 84.75 (?1.35) 84.39 (?1.33) 82.17 (?1.01) 61.94
top-3 (%) 92.43 (?0.59) 92.10 (?0.74) 90.94 (?0.81) 74.51
Table 3.3: Accuracy of LDAN for different scales of face images.
the impact of the adversarial loss.
Visualizing Estimated Lighting. Figure 3.4 and Figure 3.5 visualize the
SH parameters estimated by SIRFS, and LDAN from MultiPie images and the
CelebA [97] data set respectively. Note that for visualization purposes, we show the
images cropped by bounding boxes instead of images cropped by facial landmarks.
Though there are few images with strong side light effects, we notice that LDAN
can still work reasonably well for such images as shown in Figure 3.4 (b) and (d).
However, the predicted lightings are not as sharp as those by SIRFS. This is mainly
because the performance of learning based methods are heavily dependent on the
training data. Without sufficient face images with strong side light for training,
the performance of LDAN on those images may not be optimal. We notice that
the lighting predicted by SIRFS can have incorrect directions (Figure 3.5 (a) (b)
(c) and (d) as well as Figure 3.4 (c)). One of the reasons is the effect of the hair.
Since the facial landmark detection method is not perfect, some of the hair regions
are included in the cropped face images, and this confuses SIRFS. Moreover, some
lighting predicted by SIRFS have the incorrect color tone, especially for faces with
dark reflectance, as shown in Figure 3.5 (e) (f) (g) and (h). On the other hand, our
learning based method, LDAN, is not affected by these two issues.
58
Robustness to Low Resolution Images. To investigate the robustness
of the proposed method for low resolution images, we downsample face images of
MultiPie to 32? 32 and 16? 16 and then resize them to 64? 64 and evaluate the
lighting classification accuracy using our trained LDAN model. Table 3.3 shows the
performance of LDAN on low resolution face images. We notice that the trained
model is quite robust to low resolution images, even for face images with size 16?16,
the top-1 accuracy only drops 4% compared with the original resolution (64? 64).
To compare, we also run SIRFS on 64 ? 64 face images. Since we cannot run
3DMM on lower resolution images to get a good initialization directly, we fit the 3D
model on the original resolution and then resize it accordingly. We notice that the
performance of SIRFS drops a lot (14%) even on 64? 64 images.
Running Time. We run experiments on a workstation with 4 Intel Xeon
CPUs and 80 GB memory. While running on a GPU, we use one NVIDIA GeForce
TITAN X. For a 64?64 RGB face image, SIRFS [31] takes 47 second to predict the
lighting parameters. The proposed deep CNN can predict 390 such face images on
the CPU and 2, 400 face images on the GPU per second, so it is potentially 100, 000
times faster.
3.4.5 Object 2D Keypoints Detection
synthetic regression fine-tune LDAN 3DINN [27] regression gt
31.90% 69.61% 76.12% 79.66% 81.95% 90.57%
Table 3.4: PCK value at ? = 0.1 for different methods. We notice that LDAN
outperforms regression and fine-tune.
59
Regression	
Network	
14	
Regression	
Feature	Net	 regression	Network	 networks	
Regression	
Network	
Figure 3.6: Illustration of the network for keypoints regression.
Since ground truth lighting is hard to obtain, to better quantitatively check
the effectiveness of the proposed method, we further apply LDAN to object 2D
keypoint detection, which has ground truth labels. The keypoint-5 dataset provided
by 3DINN [27] has ground truth labels for 2D keypoints of sofa, chair, bed and swivel
chair. [26] provided synthetic images of sofa and chair and their corresponding labels.
Since 3DINN has achieved very high accuracy on the chair data set, we focus on
using sofa to test our method.
Proposed Method for 2D Keypoint Regression
Similar to the lighting regression formulation, let us denote S as the feature
network for synthetic data, R as the feature network for real data, and L as the
regression network. Specifically, we use Lj to represent the regression network for
the j-th keypoint, where j = 1, 2, ..., 14. Using (si,y
?
si) as (data, ground truth label)
pair for synthetic data, and yj?si to represent the ground truth location of the j-th
keypoint. We train the feature net S and regression net L using the following loss
function:
?
min (L j? 2j(S(si))? y ) . (3.4)
S,L si
i,j
60
Note that since we do not have two sofa images that have exactly the same 2D
keypoints, we ignore the feature loss.
After training S and L, we fix S and L and train the feature net R with real
data together with a discriminator D. Suppose ri represents the i-th real data and
y?ri represents its corresponding noisy label. More specifically, letting y?
j
ri represent
the location of j-th keypoint, we train R and D using the following loss function:
?
min max ? (Lj(R(?r
j
?i))? y?ri)
2
?+? ?(ES(s)?Ps [D(S(s)]???ER(r)?Pr [D(R(r))]?) (3.5)R D i,j adversarial loss
regression loss for real
where ? = 0.05 and Ps and Pr are the distributions of keypoints related features for
synthetic and real images respectively.
Training Details.
To mimic noisy labels, we apply the code provided by 3DINN to predict the
keypoints of sofa, and then double the noise of these labels so they contain more
noise. Supposing l? is a ground truth location of a keypoint (a 2 dimensional vector)
and l? is the keypoint location predicted by 3DINN, we double the noise in the label
l? and get l? = 2l??l?. Unless otherwise specified, we use l? as noisy labels to train
networks. Different from 3DINN, we formulate keypoint detection as a regression
problem. Similar to our LDAN model, the network is designed to have a feature
network and regression network. Inspired by [27], we define a separate regression
network for each keypoint, resulting in 14 different regression networks as illustrated
in Figure 3.6. Following [26], we normalize the keypoint location by the width and
61
Conv	3x3,	64,/2	
Conv	3x3,	64	
Conv	3x3,	128,/2	
Conv	3x3,	128	
Conv	3X3,	256./2	
Conv	3x3,	32,/2	
Conv	3X3,	256	
Conv	3x3,	32,/2	
Conv	3X3,	512,/2	 Conv	3x3,	1	 Average	Pooling	
Conv	3X3,	512	 FC,	2	 FC	tanh	1	
(a) Feature net (b) Regression net (c) Discriminator
Figure 3.7: (a), (b) and (c) show the structure of feature net, keypoint regression
net, and discriminator used in our system.
height of the image, so that the (x, y) coordinates of each 2D keypoint are within
[0, 1]. We resize each of the images from the dataset to be 256 ? 256. Our feature
net takes a 256 ? 256 image as input and outputs a 16 ? 16 ? 512 tensor as the
feature vector. Our regression network takes this feature as input and predicts the
2D location of the corresponding keypoint. Figure 3.7 (a), (b), and (c) shows the
structure of the feature net, regression net and discriminator separately. The notion
of each block is the same as those in Figure 4.5.
Experiments We use the Percentage of Correct Keypoints (PCK) metric
[98] to evaluate the accuracy. A 2D keypoint prediction is correct if it lies within
a radius ? ? L of the ground truth, where L is the diagonal of the image with
0 < ? < 1. Following [27], we show the PCK curve with the value of ? between
62
1.0
0.8
0.6
synthetic
0.4 regression
fine-tune
LDAN
0.2
3DINN
regression_gt
0.0
0.00 0.05 0.10 0.15 0.20
alpha value
Figure 3.8: PCK curve of real sofa images for different methods.
0.0 and 0.2 in Figure 3.8. We also show the PCK value at ? = 0.1 (as in [26]) in
Table 3.4 and compare it with several reasonable baselines: (1) Training the network
using real data with the noisy label (?regression?); (2) Training the network using
synthetic data and testing it on real data (?synthetic?); (3) Fine tune the network
trained on synthetic data using real images with the noisy labels (?fine-tune?).
We also show the performance of 3DINN and performance of training the network
using real data with real ground truth labels as supervision as references (?3DINN?
and ?regression gt?, both are trained with real ground truth). These results show
that the proposed method works better than other methods that train with noisy
labels, including directly training the network and fine-tuning on it. This shows the
effectiveness of the proposed method.
63
accuracy
3.5 Summary
In this chapter, we propose a lighting regression network to predict Spherical
Harmonics of environment lighting from face images. Lacking the ground truth
labels for real face images, we applied an existing method to get noisy ground truth.
To alleviate the effect of noise, we propose to apply the idea of adversarial networks
and use synthetic face images with known ground truth to help train a deep CNN for
lighting regression. Compared with existing methods, the proposed method is more
efficient and can predict more consistent Spherical Harmonics from different faces
taken under the same lighting. Due to a lack of direct measurement of estimated
Spherical Harmonics, we further apply the proposed method to regress 2D keypoints,
for which ground truth labels are provided. Our experiments further demonstrate
the effectiveness of the proposed method.
64
Chapter 4: Deep Single-Image Portrait Relighting
In this chapter, we introduce a single image portrait relighting algorithm.
Given a source portrait and a target lighting condition, the proposed method can
generate a new portrait image under the target lighting condition. Due to the lack
of an ?in the wild? dataset that is suitable for this task, we apply physical based
rendering to generate a large scale, high resolution, ?in the wild? dataset. A deep
CNN is then trained using the proposed dataset. Our experiments demonstrate that
the proposed method can generate high resolution relighted images accurately.
This work [99] is in collaboration with Sunil Hadap, Kalyan Sunkavalli and
David W. Jacobs.
Figure 4.1: Our algorithm takes a portrait image and a target lighting as input and
generates a new portrait image.
65
4.1 Introduction
Portrait relighting is one of the most interesting and important applications
of image editing. The goal of the this work is to design an automatic single-image
portrait relighting algorithm, which takes a portrait image and a target lighting as
input and generates a new portrait image under the target lighting condition.
There are physically-based relighting methods that explicitly reconstruct the
face geometry, reflectance, and lighting and then re-render this reconstruction using
a novel lighting [72, 100?104]. However, single image face reconstruction is still an
open problem, and even the state-of-the-art methods have strong errors, e.g., inaccu-
rate estimation of face geometry. These errors can propagate into the relighting and
lead to poor results. As a result, while these relighting methods are generally good
at capturing lighting variations, they may contain artifacts that prevent them from
looking realistic. In this work we leverage this property: we use a physically-based
relighting method to generate a large-scale training dataset, and then use it to train
a generative network to reproduce them while imposing an adversarial loss based
only on real photographs. The supervised reconstruction loss allows the network to
learn how to relight, while the adversarial loss ensures that the results are on the
manifold of real photographs and do not have the errors due to the physically-based
relighting method.
We first propose a ratio image [105] (RI) based rendering algorithm to generate
a large scale, high resolution, ?in the wild? deep portrait relighting dataset (DPR).
In this algorithm, an image under a target lighting condition can be represented as
66
the ratio of the target shading and source shading, multiplied by the source image.
Face normals and SH lighting of the source image are estimated by 3DDFA [106] and
SfSNet [104] respectively. A novel As-Rigid-As-Possible [107] based warping method
is then proposed to accurately align the estimated face normal to the portrait image.
Spherical Harmonics (SH) [34,35] lighting is then randomly sampled from a lighting
prior dataset [102] to relight the portrait image. We apply our proposed RI based
algorithm to the high resolution CelebA dataset (CelebA-HQ) [108] and generate
138,135 relighted 1024? 1024 portrait images with known SH lighting.
An hourglass network [33] is trained using the proposed DPR dataset for the
portrait relighting task. It takes a source image and a target lighting as input and
generates the relighted image. It also predicts the SH lighting for the source image
using the features from the bottleneck layer to disentangle lighting information from
the source image. We notice that the skip connections in the hourglass network
prevent the bottle neck layer from learning meaningful facial information. A simple
skip training strategy is then proposed to enforce facial information in the bottleneck
layer, which can improve the quality of the generated images. Our network is first
trained on 512 ? 512 images and then fine tuned on 1024 ? 1024 images. To the
best of our knowledge, the proposed method can generate relighted images at the
highest resolution among all deep learning based algorithms. We test the proposed
method qualitatively on our proposed DPR dataset and flicker portrait dataset [1]
and quantitatively on the MultiPie dataset [94]. All these experiments demonstrate
that the proposed method can achieve state-of-the-art results both qualitatively and
quantitatively.
67
To reiterate, the contributions of the proposed method are threefold. First,
we propose a ratio image based algorithm to generate a large scale, high resolution
?in the wild? deep portrait relighting dataset. A novel As-Rigid-As-Possible based
warping method is proposed to align the face normals accurately with the face
image. Second, we design an automatic single-image portrait relighting algorithm
which takes a source image and target SH lighting as input and generates a face
image under the target lighting. Third, our trained network can generate 1024 ?
1024 relighted portrait images, which, to the best of our knowledge, is the highest
resolution among all deep learning based portrait relighting methods.
68
4.2 Related Work
Quotient Image for Portrait Relighting Shashua and Riklin-raviv [105] pro-
posed to use the quotient (ratio) image for portrait relighting. They require mul-
tiple reference images as input and assume all these images are in frontal view.
Stoschek [109] extended the ratio image to arbitrary pose by aligning facial land-
marks of the source and target image. Wen et al [110] proposed to render a new
image using the ratio of the radiance environment map. [111] proposed to apply ratio
images to real time portrait illumination editing. However, their method requires
capturing images of a static subject using a Light Stage apparatus. Due to the
success of ratio images in portrait relighting applications, we apply this technique
in our data preparation pipeline.
Inverse Rendering of Portrait Images Starting with the 3D Morphable Model
(3DMM) [67], many inverse rendering methods for portrait images have been pro-
posed [72, 100?104, 112, 113]. These methods decompose a portrait image into re-
flectance, normal and lighting. A relit portrait image can then be rendered by
changing the lighting and keeping the normal and reflectance fixed. [100?102] are
optimization based method, which are time consuming. [72,103,104,112,113] are all
deep learning based methods. Compared with optimization based methods, they
are more time efficient. However, due to the complexity of inverse rendering, all
these methods can only work on low resolution images. On the contrary, our pro-
posed method, focusing on portrait relighting, can be designed to generate very high
resolution (1024? 1024) images.
69
Photo and Portrait Style Transfer Photo and portrait style transfer [1,114?116]
takes a source image and a reference image as input and transfers the style of the
reference image to the source image. Since lighting can be treated as a kind of style,
these methods can also be applied in portrait relighting applications. To generate
a high quality portrait image, these methods usually require a high quality, non-
occluded reference image that contains the desired lighting with a different subject as
input, which limits the possible application scenarios. Different from these methods,
our proposed method is a single-image based algorithm, lighting is specified as input
and no reference image is required.
70
4.3 Deep Portrait Relighting Dataset
In this section, we introduce the Deep Portrait Relighting (DPR) dataset,
which is a large scale, high resolution, ?in the wild? image dataset generated for
portrait relighting purposes. DPR is built on the high resolution CelebA dataset
(CelebA-HQ) published by [108], which contains 30,000 face images from the CelebA
[97] dataset with 1024? 1024 resolution. We remove images on which the landmark
detector [117] fails to detect landmarks, resulting in 27,627 images in the DPR
dataset. For each of these images, we randomly select 5 lighting conditions from
a lighting prior dataset [102] to generate relighted face images, leading to 138,135
relighted images.
4.3.1 Ratio Image to Relight Faces
We propose a ratio image based algorithm for data generation. Optimization
based inverse rendering methods are either too slow [101, 102] or cannot generate
high resolution images [72, 104]. Portrait style transfer methods, such as [1, 116],
require a reference image as target light source, which is not flexible. To render
a face image I, we need the reflectance R, normal N and lighting L. We further
assume that the reflectance of human face is Lambertian. A face image I can thus
be represented as:
I = R f(N,L), (4.1)
71
where  represents the element-wise product and f is the rendering function. To
relight a face image, we apply the ratio image trick proposed in [105]. According
to Eq 4.1, the same face under two different lighting conditions L and L? can be
represented as I = R f(N,L) and I? = R f(N,L?). We know that
I? = R f(N,L?) (4.2)
R f(N,L?)
= (R f(N,L))
R f(N,L)
f(N,L?)
= I.
f(N,L)
As a result, a portrait image I? under lighting L? can be generated given portrait
image I and its normal and lighting.
4.3.2 Normal Estimation
There are many research studies target at estimating normals from portrait
images. We use 3DDFA [118] 1 since it outputs the shape parameters of a 3DMM,
which can be used to generate portrait normal images at arbitrary resolution. Al-
though 3DDFA takes facial expression into consideration while fitting 3DMM, the
normals estimated still cannot be acccurately aligned with the the portrait image.
We believe this is due to the limited power of the 3DMM to model variations of
face geometry, as 3DMM is built on a limited number of faces. To better align
the estimated normals with the portrait image, so as to avoid possible artifacts
in the relighted images, we propose a As-Rigid-As-Possible (ARAP) based normal
1Code provided by [106]
72
Figure 4.2: ARAP based normal refinement.
refinement algorithm.
4.3.2.1 ARAP Based Normal Refinement
Figure 4.2 illustrates the procedure of the ARAP based normal refinement
algorithm. Using the 3DMM parameters predicted by 3DDFA [106], a mesh can be
created. The ?reflectance? image of the portrait can be obtained by projecting the
generic reflectance map of the 3DMM model onto this mesh. 2 We then apply [117]
to detect 68 facial landmarks on this ?reflectance? image. These 68 detected facial
landmarks, together with evenly sampled 198 points along the boundaries of the
image are combined as ?anchor points? and are used to create a triangle mesh
on the reflectance image using Delauny Triangulation. Similarly, a triangle mesh
is created for the portrait image. An As-Rigid-As-Possible transformation [107]
(ARAP) is then applied to warp the triangle mesh of the ?reflectance? image to the
portrait image. The estimated warp function by ARAP is then applied to the face
2Note that this is not the real reflectance image of the portrait, and it is just used to help refine
face normals.
73
(a) (b) (c)
Figure 4.3: We show the original face overlaid with normals estimated by 3DDFA
[106] in (a) and our refined normals in (b). The second row of (a) and (b) show the
right eye region.
normals estimated by 3DDFA to get refined normals as illustrated in Figure 4.2. To
demonstrate the effectivenss of the proposed normal refinement method, we overlay
the normals estimated by 3DDFA [106] and our refined normals with the original
image, and show them in Figure 4.3 (a) and (b) respectively. It is clear that the
quality of the alignment of normals w.r.t. portrait image at the eye and mouth has
been improved significantly through our proposed normal refinement method.
We notice that our proposed ARAP normal refinement method cannot improve
the misalignment of the ear and neck regions. This is because 3DMM cannot model
the deformation of ear and neck well, and, to the best of our knowledge, there is
no landmark detection algorithm for ears and necks. As a result, we remove the
ear and neck regions from the refined normals to avoid possible artifacts in relighted
images. In order to get a full normal image, we solve a Poisson equation to fill in the
missing normals for ear, neck, mouth and background region as suggested by [72].
Figure 4.3 (c) shows the normals after filling the missing region.
74
Figure 4.4: First column is the original image, second to forth columns in the first
row are relighted images generated by our rendering pipeline, the second row shows
the half sphere rendered using the corresponding SH lighting.
4.3.3 Relighting Images
For a portrait image I?, we apply our method to estimate normals N and use
SfSNet [104] to estimate SH lighting L?. Then a target SH lighting L is randomly
sampled from the lighting prior dataset [102]. Eq 4.3 is then used to generate the
relighted face image I. Due to the ambiguity of the color between lighting and
reflectance, we apply the rendering pipeline to the luminance channel and keep the
color of the portrait image unchanged. We show some examples of relighted face
image in Figure 4.4.
75
S4	
S3	
S2	
S1	
Zf	 Zf	
Z Z*
input	image	 s	 s	 output	image	
I I* 
predicted	 target	
lighting	 Lighting	
L L* 
Figure 4.5: Our network takes a portrait image and a target SH lighting as input
and outputs the SH lighting of the input portrait image and generates a new, relit
portrait image using the target SH lighting.
4.4 Method
In this section, we introduce our proposed deep learning based single-image
portrait relighting algorithm. We design an hourglass network for this task and use
the DPR dataset created in the section 4.3 to train the network.
4.4.1 Main architecture for portrait relighting
Figure 4.5 shows the structure of our proposed hourglass network [33]. It
has an encoder and a decoder part. Four skip connections are used to connect the
features at different scales in the encoder part to their corresponding scale in the
decoder part. To relight a face, our network takes a face image I and a target lighting
L? as input. The encoder extracts features Z which is divided into two parts: face
feature Zf which is independent of lighting; and lighting feature Zs. Zs is then fed
into a lighting regression network to predict the lighting L of the input face image
I. The target lighting L? is then mapped to the lighting feature Z?s. Zf and Z
?
s are
76
concatenated together and fed into the decoder part to generate the relighted face
image.
4.4.2 Supervision for training the network
As discussed in Section 4.3, our data preparation process generated 5 relighted
images with known ground truth lighting for each image in CelebA-HQ dataset. To
generate one training data, we randomly select one source image Is and one target
image It and their corresponding ground truth SH lighting Ls and Lt from these 5
relighted images. Our network then takes source image Is and target lighting Lt as
input and generates L?s and I
?
t . Ls and It are used as ground truth to supervise the
training. We apply an L1 loss for generated portrait image I
?
t and an L2 loss for the
predicted lighting L?s. An L1 loss is further applied to the gradient of I
?
t to preserve
edges and avoid possible blurry effect:
L 1I = (||I ? ? ? 2t ? It ||1 + ||?It ??It ||1) + (Ls ? Ls) , (4.3)NI
NI is the number of pixels in the image.
Since our ?ground truth? images are generated using the ratio image trick, they
may contain some artifacts due to inaccurate estimation of face normal or lighting.
We thus propose to use a GAN loss to improve the quality of the generated images.
As these artifacts mostly appear locally, we use a patch GAN [119] to force the
distribution of local image patches to be close to that of a natural image. We follow
the implementation of [119] (lsgan [120]) and use an MSE criterion for our GAN
77
loss:
LGAN = EI(1?D(I))2 + EIsD(G(Is,Lt))2, (4.4)
where I is the real image, and G and D represent our relighting network and dis-
criminator respectively. We use 1 as a label for real images and 0 as a label for fake
images. While training, we use the images from FFHQ dataset [121] as real images
in our GAN loss since images in this dataset contain more lighting variations.
A feature matching loss is further proposed to increase the the accuracy of
the relighted portrait image. More specifically, same images under different lighting
conditions should have the same face features, we thus define a feature loss as:
L 1= (Z ? Z? 2F f1
N f2
) , (4.5)
F
where Z ?f1 and Zf2 are face features of Is1 and Is2, and NF is the number of elements
in feature Zf .
4.4.3 Skip Training
When the Hourglass network is trained end-to-end (denoted as vanilla Hour-
glass), we notice that most of the facial information is passed through skip layers.
Our facial feature Zf , on the other hand, contains little facial information. We thus
propose a skip training strategy in which we train our network without skip connec-
tions first, then add skip layers one by one during subsequent training. We denote
this as skip training. Figure 4.6 compares the relighted images generated by remov-
78
Figure 4.6: From left to right: output without skip layer S4, output without S4/S3,
output without S4/S3/S2, output without S4/S3/S2/S1. Top row: vanilla Hourglass
network, bottom row: Hourglass network with skip training.
(a) (b) (c) (d)
Figure 4.7: (a) output of vanilla Hourglass network, (b) rectangle region of (a), (c)
output of Hourglass network with skip training, (d) rectangle region of (c). We
increase the pixel intensity of (b) and (c) for visualization purpose.
ing the skip layers of vanilla Hourglass network and Hourglass network with skip
training. We can see that with the skip training strategy, more facial information
is kept in the feature layer. Figure 4.7 further demonstrates that skip training can
help improve the quality of the generated results by removing artifacts around the
nose. In the following discussion, unless otherwise specified, our network is trained
with skip training.
79
4.4.4 Implementation Details
We discuss the implementation details of the proposed method in this section.
4.4.4.1 Network Structure
h1, h2, h3, h4 are down-sampling layers followed by residual blocks defined
in [6]. h5, h6, h7 and h8 are designed as residual blocks [6] followed by upsampling
layers. s1, s2, s3 and s4 are defined to be residual blocks [6]. For convenience,
we defined one convolutional block as one convolutional layer followed by a batch
normalization layer and ReLU activation. c1 is designed to have one convolutional
block. c2 is designed to have three convolutional blocks (denoted as c2 1, c2 2, c2 3)
followed by one convolutional layer (denoted as c2 o). More details of these blocks
are shown in Table 4.1. Note that the output of h4 has 155 channels, from which
128 channels belong to face features Zf and 27 channels belong to lighting feature
Zs.
h1 h2 h3 h4 h5 h6 h7 h8
input channel number 16 16 32 64 155 64 32 16
output channel number 16 32 64 155 64 32 16 16
filter size 3 3 3 3 3 3 3 3
s1 s2 s3 s4 c1 c2 1 c2 2 c2 3 c2 o
input channel number 64 32 16 16 1 16 16 16 16
output channel number 64 32 16 16 16 16 16 16 1
filter size 3 3 3 3 5 3 1 1 1
Table 4.1: Details about each block of our network.
The lighting prediction network, which takes Zs as input and predicts L, is
80
defined as an average pooling layer followed by two fully connected layers whose
number of channels are 128 and 9 respectively. The network that maps target light-
ing L? to lighting features Z?s is defined as two fully connected layers whose number
of channels are 128 and 27. The 27 dimensional lighting feature is then repeated
spatially so it has the same spatial resolution as Zs as illustrated in Figure 4.8.
repeat	
Figure 4.8: Illustration of repeating lighting feature spatially.
4.4.4.2 Training Detail
The overall loss for our network is a linear combination of the losses mentioned
in Sec. 4.4.2:
L = LI + LGAN + ?LF , (4.6)
where ? = 0.5. Our network is trained for 14 epochs. We add our feature loss
LF after 10 epochs. For skip training, we train our network without any skip
S4	
S3	
S2	
S1	
Zf	 Zf	
Z Z*
input	image	 s	 s	c1	 h1	 h2	 h3	 output	image	
I h4	 h5	 h6	 h7	 h8	 c2	 I* 
downsample	 predicted	 target	 upsample	
lighting	 Lighting	
L L* 
Figure 4.9: Network structure for 1024? 1024 images.
81
connections for 5 epochs, and add skip connections one at each epoch after the fifth
epoch, until all skip layers are added. We first train our network with images of
resolution 512? 512; most of our experiments are carried out under this resolution.
Finally, we fine tune our trained network using images with resolution 1024? 1024
with a simple modification. More specifically, an additional down sampling and up
sampling layer is added to make our network compatible with 1024 ? 1024 images
as shown in 4.9. We train our network from scratch using the Adam optimizer [122]
with default parameters.
82
4.5 Experiments
In this section, we evaluate our proposed method both quantitatively and
qualitatively and compare it with some of the state-of-the-art methods. Since our
network can predict lighting, it can be used in two ways for portrait relighting: (A)
Given a source image Is and a SH lighting Lt, generate an image It (denoted as SH-
based way). (B) Given a source image Is and a reference image If , extracting SH
lighting Lt from If and use it to relight Is to get It (denoted as image-based way).
Our network is designed to be used as (A). Due to that reason, when the target SH
lighting It is known (e.g. our DPR dataset) we use (A) for our relighting task. For
datasets such as MultiPie [94], in which ground truth SH lighting is unknown, we
use (B) for relighting.
4.5.1 Dataset and Evaluation Metric
Dataset: We demonstrate the effectiveness of the proposed method on the test set
of our proposed DPR dataset. However, due to lack of real ground truth, we cannot
evaluate the accuracy of the relighted images using this dataset. We thus propose
to use the MultiPie dataset [94] for quantitative evaluation. The MultiPie dataset
contains images of the same person under different lighting conditions, which can be
used as source and target image pair. Each MultiPie image is formed by lighting with
a dominant point light source, while the lighting conditions of most ?in the wild?
portrait images are diffuse. We thus generate images under 7 lighting conditions by
averaging 3 to 4 original face images from MultiPie, so as to generate images under
83
more realistic, diffuse lighting conditions. We created 440 groups of images from
our generated face images, each of which contains a source image Is, a target image
It and a reference image If . Is and It are images with the same identity but with
different lighting conditions, It and Ir are images of different identities but with the
same lighting condition. When evaluating, a relighting algorithm takes Is and If as
input and predicts It.
Evaluation metric: Since lighting is ambiguous up to a scale (e.g., longer exposure
time may lead to a SH with high energy under the same lighting conditions), we
propose to use a scale invariant Mean Squared Error (Si-MSE) [31] to evaluate the
error between the generated image I?t and the ground truth image It.
1
Si-MSE = min(It ? ? ? I?)2t , (4.7)NI ?
where ? is a scalar and NI is the number of pixels in the image. To further check
whether the generated image portrays the target lighting, we run SfSNet [104] to
extract the lighting Lt and L
?
t from It and I
?
t respectively, and compute the scale
invariant L2 (Si-L2) distance between L
?
t and Lt . We choose to use SfSNet [104]
since it is proven to work well at predicting consistent lighting for face images under
the same lighting condition.
4.5.2 Ablation Study
To demonstrate the effectiveness of the GAN loss and feature loss, we show the
quantitative and qualitative results of our network trained using LI , LI +LGAN and
84
Si-MSE Si-L2
LI 0.00504 0.1307
LI + LGAN 0.00658 0.1686
LI + LGAN + Lf 0.00590 0.1444
Table 4.2: Ablation Study on MultiPie Dataset
Figure 4.10: From left to right: first row: the input image, images generated using
LI , LI + LGAN and LI + LGAN + Lf ; second row: red rectangle region of the
corresponding image in the first row. Note the edge in the middle of the noise
generated using LI .
85
LI + LGAN + Lf (i.e. full model) in Table 4.2 and Figure 4.10. We notice that with
GAN loss, the accuracy of our trained network is worse than the network trained
without GAN loss. This is because the GAN loss is used to make the distribution of
the generated images closer to that of the real images, i.e. improve the visual quality
of the generated images. Adding the GAN loss may distract the network training
process from being closer to the ?ground truth? images. However, Figure 4.10
shows that with GAN loss, the artifacts on the nose part are alleviated compared
with the network trained without GAN loss. This demonstrates the effectiveness
the GAN loss in improving the visual quality. Adding a feature loss Lf significantly
improves the accuracy of the images generated by our model, as shown in Table 4.2.
We believe this is because our feature loss can force the generated images of the
same identity to have similar latent features, thus, better preserving the identity
information in the generated images. Moreover, Figure 4.10 shows that feature loss
does not affect the quality of the generated images. As a result, we conclude that
our full model can achieve a good balance between the accuracy and quality of the
generated images.
4.5.3 Comparison with the Rendering Pipeline
Our proposed ARAP based normal refinement method improves the misalign-
ments of face normals as discussed in Section 4.3. However, there are still cases in
which the face normals cannot perfectly align with the face image, especially at the
nose and the mouth region. These misalignments can cause ghost effects on the nose
86
(a) (b) (c) (d) (e) (f)
Figure 4.11: (a) original image, (c) results RI based rendering, (e) our results. (b),
(d) and (f) show the red rectangle region of (a), (c) and (d) respectively. Note that
the proposed method removes the ghost effect and artificial highlights.
Si-MSE Si-L2
Li et al [115] 0.01322 0.3939
Shih et al [1] 0.01513 0.3415
Shu et al [116] 0.01384 0.3908
SfSNet [104] 0.00659 0.1593
Proposed Method 0.00590 0.1444
Table 4.3: Evaluation MultiPie Dataset
and artificial highlights at the corner of the mouth as shown in Figure 4.11. Though
our training data contains images with these artifacts, Figure 4.11 shows that these
artifacts can be avoided by the proposed method. This is because a deep learning
based method can regularize the results, avoiding outlier effects.
4.5.4 Comparison with State-of-the-art Methods
In this section, we compare the proposed method with [1,104,115,116], which
can do portrait relighting. Since there is no ground truth lighting for images in
the MultiPie dataset, we use an image-based method to evaluate our proposed
87
method and SfSNet [104] on this dataset, i.e., target lighting is extracted from the
reference image and used for relighting. Both the proposed method and SfSNet [104]
use their own lighting estimation method to extract the target lighting. [116] and [1]
are two state-of-the-art portrait style transfer methods. They take two images Is
and If as input and transfer the style of Is to If . To get the relighted image using
these two methods, we transfer Is and If from RGB image to Lab image, and only
apply their algorithm on the L channel. [115] is designed for general photo style
transfer, and similar to [116] and [1], we use the L channel for portrait relighting.
We quantitatively compare the performance of our proposed method with these
methods on the MultiPie dataset and show the results in Table 4.3. Our proposed
method achieves the state-of-the-art results on both Si-MSE and Si-L2 metric. This
demonstrates that the proposed method can accurately generate relighted images
under the target lighting condition. [116] and [1] both require accurately detected
facial landmarks. The built-in facial landmark detector [123] of [116] fail to detect
landmarks of 90 testing face images, thus we exclude those results when computing
the Si-MSE and Si-L2 for [116] in Table 4.3. We show some examples of relighted
faces in the MultiPie dataset in figure 4.12.
We visually compare the proposed method with these state-of-the-art methods
on DPR dataset and show results in Figure 4.5.4. Since the target lighting is known
in this dataset, we apply a SH-based method to evaluate the proposed method
and SfSNet [104]. We see that although SfSNet [104] can generate images under
the correct lighting condition, their results are of low quality. Also, SfSNet [104]
works on 128 ? 128 images, which is too small for portrait relighting applications.
88
source reference ground truth proposed SfSNet [104] Shih et al [1] Shu et al [116] Li et al [115]
Figure 4.12: Visual results of the proposed method and state-of-the-art methods on
MultiPie.
Furthermore, SfSNet cannot deal with the background correctly, making the results
visually unpleasant. [1], [116] and [115] do not generate images under the correct
lighting in these examples. These three methods are all reference image based, when
the reference image is of low quality (e.g. occluded by hair region or sunglasses),
they fail to understand the lighting correctly. As a result, they cannot generate
images with accurate lighting conditions. We believe this is the common drawback
for all methods which require a reference image as input. On the contrary, the
proposed method and SfSNet [104] can directly take the target lighting as input, no
referene image is required. Moreover, we notice that [1] and [116] cannot generate
attached shadows on the nose, whereas the proposed method can generate very
natural attached shadows.
From these experiments, we conclude that the proposed method outperforms
the state-of-the-art methods both quantitatively and qualitatively, demonstrating
the effectiveness of the proposed method.
89
(A)
(B)
(C)
reference/target SH input our SfSNet [104] Shih et al [1] Shu et al [116] Li et al [115]
Figure 4.13: Qualitative comparison of the proposed method with state-of-the-art
methods. The first column in (A) (B) and (C): first row is the reference image,
second row is the target SH. The second column in (A), (B) and (C) shows the
input image, third to seventh columns show the results of our method, SfSNet [104],
Shih et al [1], Shu et al [116] and Li et al [115].
90
4.5.5 Results on challenging images
In Figure 4.5.5 we show our results on some challenging images. We notice
that the proposed algorithm performs well on images with non-frontal faces, faces
with occlusions and even faces with makeup.
4.5.6 Visual Results on High Resolution Images
We fine tune our network on 1024 ? 1024 images and test the trained model
on the flicker portrait dataset [1]. Figure 4.15 shows some of the results.
91
Figure 4.14: Some challenging examples. The first row shows the target SH lighting.
The first column of the remaining rows shows the input image, the other columns
show relighted images by the proposed method under target SH lighting. Our pro-
posed method can deal with non-frontal faces, faces with occlusions and faces with
makeup well.
92
Figure 4.15: Results on flicker portrait dataset [1]. The first column shows the input
image, the remaining columns show our relighted images using the corresponding
target lighting.
93
4.6 Summary
In this chapter, we proposed an automatic single-image portrait relighting
algorithm. We first apply a ratio image based relighting algorithm to generate a large
scale, high quality, ?in the wild? deep portrait relighting dataset. We then design a
hourglass network that takes a source portrait image and a target lighting as input
and generate a relighted portrait image. We train our network on the proposed DPR
dataset, and find that deep network training can regularize the results, removing the
artifacts in relighted images generated by ratio image based relighting. Moreover,
our network can generate images with resolution 1024?1024. Extensive experiments
demonstrate the effectiveness of the proposed method.
94
Chapter 5: GLoSH: Global-Local Spherical Harmonics for Intrinsic
Image Decomposition
(a) (b) (c) (d) (e)
(f) (g) (h) (i) (j)
Figure 5.1: Top row: result of [2]. [2] takes an RGB-D image as input and pre-
dicts: (b) reflectance, (c) shading, (d) normal, (e) lighting. Bottom row: result of
our method. It takes an RGB image as input and predicts: (g) reflectance, (h)
shading, (i) normal and (j) lighting. The red boxes show that our algorithm cor-
rectly predicts cast shadows and highlights to shading while [2] incorrectly predict
them to reflectance. Our lighting (j) captures local lighting variation better than
(e) from [2].
In this chapter, we introduce our research work: intrinsic image decomposition
from a natural scene. Different from conventional methods that decompose an image
into reflectance and shading, the proposed method further decomposes shading into
geometry (normal) and lighting. A Global-Local Spherical Harmonics is proposed to
model the lighting of a natural scene. Due to the lack of real labels for reflectance,
95
shading and lighting, synthetic data is used for this problem. We demonstrate the
power of our proposed lighting model and synthetic data through experiments.
This work [124] is in collaboration with Xiang Yu and David W. Jacobs.
5.1 Introduction
Understanding the physical world that produces an image is a core problem in
computer vision. [125] first proposed to estimate the intrinsic scene characteristics
from images, including range, orientation, reflectance and incident lighting. This is
a notoriously difficult inverse problem as it is highly under-constrained. Moreover,
we lack models of the physical components of the problem, such as lighting, that are
both accurate and easy to use. Early works start with investigating the reflectance,
shape and illumination of a single object [31,126], as the lighting for a single object is
easier to model, for instance, by using a single set of low dimensional Spherical Har-
monics [34,35]. The lighting of a natural scene, however, is much more complicated
due to its spatial variation caused by shadow, inter-reflection and the presence of
light sources in the scene. As a result, most works that address scenes have lumped
normal and lighting together as shading, and try to recover that, known as intrinsic
image decomposition.
In this work, we propose a new representation of lighting for scenes, which al-
lows us to disentangle lighting and surface normals, while also recovering reflectance.
One way to model lighting is Spherical Harmonics (SHs) [34,35], which approximates
the lighting with 9 low frequency components. While this works well for modeling
96
the lighting of small objects, such as faces [34], such a global lighting cannot capture
the spatially varying lighting in a complex scene, as shown in Figure 2 (e). Allowing
independent lighting in each pixel, however, creates too many degrees of freedom
and would allow lighting variation alone to explain the image.
To overcome the problem, we propose a Global-Local SHs (GLoSH) lighting
model. Our global SH represents the holistic lighting of the entire scene. On top of
it, the local SHs, produced by the sum of global SH and local residual SHs, account
for the spatial variation of the lighting. An L2 regularization on the local residual
SHs limits the effects of over-parameterization. Figure 5.2 (c) shows our GLoSH
and Figure 5.2 (f) shows the reconstructed shading, which is much closer to ground
truth than only using global SH.
Spherical Harmonics with arbitrary coefficients would represent lighting in a
physically unrealistic way, if the lighting is negative in some directions. Nevertheless,
enforcing non-negative SH lighting is not trivial. Existing methods either introduce
many more parameters to constrain non-negative lighting [34] or require solving a
semi-definite programming problem [127], which is difficult to directly incorporate
with deep networks. In this work, we propose to sample the intensity of the lighting
uniformly distributed on a sphere generated from the predicted SH. A non-negative
loss is then defined on the sampled lighting. Our non-negative constraint is only
applied to global SH, because practically the local residual SHs regularized by L2
are not likely to change the sign of the lighting.
We apply a CNN to achieve an end-to-end coarse-to-fine solution. Training
deep CNNs requires huge amounts of data and ground truth labels, and labeling
97
(a) synthetic image (b) global SH (c) GLoSH
(d) GT shading (e) shading w.r.t. (b) (f) shading w.r.t. (c)
Figure 5.2: Visualization of global SH modeling (b) and its reconstructed shading
(e), comparing to our GLoSH (c) and its reconstructed shading (f). With GLoSH,
clearly our method generates the shading much closer to ground truth.
images for reflectance and lighting is extremely difficult. Intrinsic Images in the
Wild (IIW) [128] labels the relative darkness of the reflectance from pairs of pix-
els. Shading Annotations in the Wild (SAW) [129] labels constant shading regions,
shadow boundaries and depth/normal discontinuities. However, these datasets only
provide sparse labels and a limited number of images. Inspired by the recent success
of synthetic data on computer vision applications, we propose to use the synthesized
SUNCG dataset [130], in which ground truth reflectance, normal and shading can
be easily determined, to pre-train the models. The pre-trained model is then further
trained with real data in a self-supervised way.
To sum up, we propose a Global-Local Spherical Harmonics (GLoSH) lighting
model, and apply a coarse-to-fine CNN structure to predict GLoSH together with
reflectance and normal. The synthetic data pre-training and self-supervised training
with real data lead to state-of-the-art performance across three real scene datasets,
98
IIW, SAW and NYUv2. The contributions of our work are the following.
1. We propose a GLoSH lighting model with global and local SHs, and a novel
non-negative constraint to estimate physically realistic lighting.
2. To the best of our knowledge, under a single RGB image setting, we are the
first to apply CNNs to jointly estimate reflectance, normal and lighting.
3. We propose a coarse-to-fine network that is compatible with our proposed
global-local lighting model.
4. Our method achieves the best results on IIW reflectance, the second best
on SAW shading, and strongly competitive performance on NYUv2 normal.
Notice that the state-of-the-art methods only focus on one or two components,
while our method jointly estimates reflectance, normal and lighting.
99
5.2 Related Work
Intrinsic Image Characteristics. We categorize the literature into two main
streams: single object based and natural scene based methods. Researchers have
long been studying the estimation of intrinsic image characteristics for a single ob-
ject. For example, shape from shading [131, 132] focuses on recovering the shape
assuming illumination and reflectance are known. Photometric stereo [133] esti-
mates geometry from multiple images assuming known lighting. Recent progress
in photometric stereo [126] can estimate geometry and lighting up to a bas-relief
transformation [134]. [135?137] proposed to decompose a single object image into
its reflectance and shading. [31] and [138] proposed to jointly estimate reflectance,
shape and lighting from a single object image.
Estimating a natural scene is more difficult due to more complicated geometry
and lighting. Recent studies [139,140] show the capability of accurately estimating
the scene geometry thanks to large scale training data and the success of deep
learning. Some advanced methods [2, 128, 141?144] proposed optimization based
approaches to decompose an image into reflectance and shading, where [2, 141,142]
require depth to be known. Most recent work [145?154] applies deep Convolutional
Neural Networks (CNN) on this task and achieves impressive performance. [153]
proposed to render realistic synthetic data and then use it to train the deep models
and adapt to the real dataset. Our work follows a similar idea. However, we not
only estimate reflectance and shading, but also further decompose the shading into
normal and lighting.
100
Barron and Malik [2] first proposed to estimate reflectance, depth/normal and
lighting with RGB-D images. They model the lighting at each pixel as a linear
combination of eight sets of Spherical Harmonics. In contrast, we jointly esti-
mate reflectance, surface normals and lighting from a single RGB image without
depth, which is a much harder problem. Moreover, we propose our Global-Local
SHs (GloSH) with a coarse-to-fine neural network to represent the lighting for each
pixel, which accounts for not only the holistic lighting but also the local lighting
variations.
Non-negative Spherical Harmonics. While using Spherical Harmonics lighting,
one challenge is how to enforce lighting to be non-negative. [34] proposed to rep-
resent lighting using a non-negative linear combination of delta functions to solve
this problem. One drawback of this method is that to have an accurate represen-
tation, a lot of delta functions are needed. [127] proved that the Toeplitz matrix of
a non-negative SH is positive semi-definite. They proposed to solve a semi-definite
programming (SDP) problem to enforce non-negative lighting. However, the SDP
constraint is not obviously tractable to incorporate with deep training. In contrast,
we formulate a non-negative lighting loss by sampling hundreds of points on a pre-
dicted lighting sphere, which is computationally efficient and fits into the network
training smoothly.
101
5.3 Reflectance, Normal and Shading from a Single RGB Image
Intrinsic image decomposition assumes an image I to be the product of re-
flectance R and shading S, i.e, I = R  S, where  represents an element-wise
product. Most research studies focus on decomposing an image I into R and S,
where geometry and lighting remain entangled in shading. In our work, we propose
to further decompose shading S into surface normal (i.e, geometry) N and lighting
L. Assuming S = ?(N,L), an image I can be represented as
I = R?(N,L), (5.1)
? is a rendering function. Our target is to estimate R, N and L given a single
image I.
5.3.1 GLoSH Lighting Modeling
While a single, global set of low-dimensional SH have been used to represent
lighting of objects, this would be unable to capture the complex lighting condi-
tions of a scene. On the other hand, estimating SHs for each pixel easily falls into
over-parameterization. We propose a neural network based Global-Local Spherical
Harmonics (GLoSH) model, where global SHs serve as the low frequency approxi-
mation of the lighting, and local SHs residuals count for spatial variation. A coarse-
to-fine neural structure is designed to exactly execute the global and local lighting
modeling.
102
5.3.1.1 Global and local Spherical Harmonics
Following [34, 35], we propose to use SH up to the second order, resulting in
a 9 dimensional SH for each color channel. Denote the global SH as L ? R9c . Lc is
predicted from our coarse level network
As revealed in Figure 5.2, based only on the global SH Lc, the shading is
far from satisfactory, lacking much spatial variation. To better model the spatial
variation of the lighting, we predict local SH residuals for each pixel in a fine level
network. Our local SH is then formulated as global SH with the local SH residuals:
Lf = Lc + ?Lf , (5.2)
where ?Lf represent the local residual SH predicted by a fine scale network.
5.3.1.2 Non-negative Constraints on SH
Physically realistic lighting requires non-negative SH lighting, which previous
work [2, 31] do not properly consider. To enforce the non-negative SH lighting,
we propose a simple yet effective constraint on SH. According to [34], given a SH
coefficient Lc, the lighting intensity at a direction (?, ?) is a function of Lc, i.e,
fL(Lc, ?, ?). A non-negative lighting means fL(Lc, ?, ?) ? 0,?0 ? ? ? ?, 0 ? ? ?
2?. Based on this, we uniformly sample the value of the function fL on a unit
sphere and constrain all the sampled values to be non-negative. The non-negative
103
loss function is thus defined as
K
L 1
?
2
Lc = min(0, fL(Lc, ?i, ?i)) , (5.3)
K
i=1
K is the number of directions sampled from the sphere. We apply the above non-
negative constraint to global SH. We further apply the L2 regularization over the
local residual SH:
LLf = ??L 2f?2. (5.4)
This regularization penalizes their L2 norm, encouraging the local lighting to not
vary too much from the global lighting.
Our experiments demonstrate that Equation 5.4 together with non-negative
constraint on global SH is sufficient to guarantee the non-negative lighting for local
SH.
5.3.2 Coarse-to-fine Network Structure
Reectance Reectance
+
Input Coarse Net Fine Net + Pixel-wise add
R S + ! R R Render
Light Shading Light Shading
S Stack channel
Normal + Normal
Figure 5.3: Our coarse-to-fine network structure. The coarse net predicts the first
level reflectance, lighting and surface normal, of which the latter two form the
shading. The fine net takes the previous stacked output as input and predicts the
residual of reflectance, lighting and surface normal. The final reflectance, lighting
and normal are recovered by adding the predicted residual with the first level results.
104
To exactly match the proposed GLoSH lighting modeling, we design a coarse-
to-fine network structure shown in Figure 5.3. The coarse network is defined as
a hourglass network [155]. It takes an image x ? R64?64?3 as input and predicts
reflectance R ? R64?64?3c , normal N ? R64?64c , and global SH Lc ? R9. Shading
S ? R64?64?3c can be constructed by a simple rendering function.
Sc = ?(Nc,Lc) (5.5)
The fine scale network is designed with fully convolutional structures. It takes
x ? R128?128?3, upsampled R ? R128?128?3c ,N ? R128?128c ,Sc ? R128?128?3 as input
and predicts residual maps. The recovered local reflectance, normal and local SHs
are:
R
Rf = R + ?
f
c f (Rc,Nc,Lc),
N
Nf = Nc + ?
f
f (Rc,Nc,Lc), (5.6)
L
Lf = L + ?
f
c f (Rc,Nc,Lc),
R N L
where, ? ff , ?
f and ? ff f represent the fine level network for reflectance, normal
and lighting respectively. The fine scale shading is calculated by Equation 5.5 Sf =
?(Nf ,Lf ). The fine scale network structure can be recurrently applied to a finer
scale. Our full model is defined to have three scales, which can predict reflectance,
normal and lighting with resolution 256? 256. Please refer to Section 5.4 for more
details of the network structure.
105
5.3.3 Supervision on Training
It is difficult to obtain dense accurate ground truth annotation for reflectance,
normal and lighting. We thus leverage the rendered synthetic data for the supervised
pre-training. The pre-trained network is then fine-tuned using sparsely annotated
real data (IIW [128], SAW [129] and NYUv2 [156]) in a self-supervised way, i.e,
applying the pre-trained model to provide pseudo ground truth labels for fine-tuning.
5.3.3.1 Reflectance
In the pre-training stage, we directly apply the ground truth reflectance to
guide the training in a fully supervised way, where L1 loss is applied as shown in
Equation 5.7.
LR1 = ?R?R??1 + ??R??R??1. (5.7)
R is the predicted reflectance and R? is the corresponding ground truth. More-
over, similar to [153], we add supervision to the gradient of the reflectance to en-
courage the predicted reflectance to be piece-wise smooth.
For real data, there is no dense annotation for either reflectance, normal or
lighting. Instead, IIW [128] provides sparse ordinal reflectance judgments. Given a
pair of reflectances R1 and R2, the label indicates whether R1 is darker than (lighter
than or equal to) R2 (demoted as J = 1, J = ?1 and J = 0 respectively) with a
confidence score w. We use the WHDR hinge loss proposed in [146] as the loss for
106
reflectance in real images:
LR?(R1,R2, J) = (5.8)
???????
( )
??max?0, R?1 ? 1? if J = 1? R?2 1+?+??
??
?
????
??? 1 R1? ? ? ?
?
1+? ? R2 ?
w????max 0,??? ?? if J = 0??? R1?? ? (1 + ? ? ?)???? (
R2 )
max 0, (1 + ? + ?)? R1 if J = ?1
R2
We set ? = 0.12 and ? = 0.08 during training as in [146]. Notice that the above loss
is not symmetric, i.e, LR (R1,R2, J) 6= LR(R2,R1,?J). We thus adapt the abovei
loss and define the modified WHDR loss as:
LR2 = LR(R1,R2, J) + LR(R2,R1,?J) (5.9)
5.3.3.2 Normal
The ground truth normal for synthetic data and part of the real data (NYUv2)
are available. For those data in which ground truth normals are available, we define
the loss as
LN = ?NTN? + ??N??N??1 (5.10)
107
Similar to reflectance regularization in Equation 5.7, we further apply the first order
derivative smoothness term to encourage the normal to be piece-wise continuous.
5.3.3.3 Shading
There is no supervision for lighting. The non-negative constraint and the L2
regularization are all unsupervised losses. Applying rendering to generate shading
S = ?(N,L) from normal and lighting, we use the supervision on shading and
normal discussed in Sec. 5.3.3.2 to indirectly supervise the lighting. The supervised
signal for shading is similar to that of reflectance:
LS1 = ?S? S??1 + ??S??S??1 (5.11)
where S and S? are predicted shading and its ground truth.
For real images, SAW [129] provides annotation for smooth shading regions
and shadow boundaries. We thus apply the same loss as in [153] for the shading:
LS2 = ?csLconstant?shading + Lshadow (5.12)
where ?cs = 10 and Lconstant?shading and Lshadow are the loss for constant shading
region and shadow boundary defined in [153].
108
5.4 Network Structure
Overall our framework can be divided into three scales. We employ a hourglass
network [155] for the first scale network and a fully convolutional structure for second
and third scale networks. Details are discussed in the following for the three scale
network structures.
Network of First Scale. The first scale network takes a 64 ? 64 image as input
and predicts 64 ? 64 reflectance Rc, normal Nc, and a single global Lighting Lc.
Shading Sc can be constructed based on Nc and Lc. The branches used to predict
reflectance and normal have the same hourglass network structure [155], which is
illustrated in Figure 5.4 (a). The branch to predict global SH is shown in Figure 5.4
(b). The green blocks are shared by all the three branches.
S2	
S2	
S3	
D1 D2 I U1 U2 U3 
C1 C2 C3 C4 
(a)
D1 D2 I C2 C3	 C4	 AP	 FC	
C1 
(b)
Figure 5.4: (a) shows the network structure to predict reflectance and normal and
(b) shows the network structure for predicting global SH.
In Figure 5.4, C1, C2, C3, and C4 represent convolutional layers. D1, D2, I,
U1, U2, U3, S1, S2, S3 are residual blocks defined in [6]. AP is an average pooling
layer and FC represents a fully-connected layer. Each of the convolution layers
is followed by batch normalization [157] and ReLU except for the output layer.
109
C1 D1 D2 D3 D4 D5 C2 C3 
F1 F2 F3 
Figure 5.5: Network structure to predict reflectance, normal and lighting at second
and third scale.
Table 5.4 shows the detailed definition of the block in Figure 5.4. Since we predict
9 Spherical Harmonics for each channel of the global SH, the number of output
channels for global SH is 27.
C1 C2 C3 C4 D1 D2 I U1 U2 U3 S1 S2 S3 FC
input channel number 3 64 32 3 64 128 256 248 256 128 64 128 256 -
output channel number 64 32 3 3 128 256 248 256 128 64 64 128 256 -
(a) filter size 5 3 3 3 3 3 3 3 3 3 3 3 3 -
input feature size 64 64 64 64 64 32 16 8 16 32 64 32 16 -
output feature size 64 64 64 64 32 16 8 16 32 64 64 32 16 -
input channel number 3 16 64 - 64 128 256 - - - - - - 128
output channel number 64 64 128 - 128 256 16 - - - - - - 27
(a) filter size 5 3 3 - 3 3 3 - - - - - - -
input feature size 64 8 4 - 64 32 16 - - - - - - -
output feature size 64 4 2 - 32 16 8 - - - - - - -
Table 5.1: Details of each block in our network. (a) shows the details about each
block in Figure 5.4 (a) and (b) shows the details about each block in Figure 5.4 (b).
Network of Second and Third Scale. Our second and third scale network has
the same network structure. Our second network works on images with resolution
128 ? 128 and our third network works on images with resolution 256 ? 256. We
define the network to predict residuals of reflectance, normal and lighting using
separate networks with no shared layers.
The network structure is illustrated in Figure 5.5. C1, C2, C3, F1, F2, F3
110
are convolutional layers, D1, D2, D3, D4, D5 are residual blocks defined in [6].
Each convolutional layer is followed by a Batch Normalization layer [157] and ReLU
except for the output layer.
Table 5.4 (a) shows the detailed definition of each block for networks used to
predict residual reflectance and normal. Table 5.4 (b) shows the detailed definition
of each block for networks used to determine local residual SHs. For reflectance, we
concatenate the image and the upsampled reflectance from the coarse network as
input, so the number of input channels is 6. Similarly, we concatenate the image and
upsampled normals from the coarse network as input for the network for normals.
The number of input channels is also 6. For the network used to predict local SHs,
we concatenate the image, upsampled normal, reflectance and shading as input. As
a result, the number of input channels for local SHs prediction is 12. Since we
predict the color SH for each pixel, the number of output channels is 27.
C1 C2 C3 D1 D2 D3 D4 D5 F1 F2 F3
input channel number 6 32 3 32 32 32 64 32 16 32 32
output channel number 16 3 3 32 32 64 32 32 32 32 32
(a) filter size 5 3 3 3 3 3 32 32 1 1 1
dilation size 1 1 1 1 2 4 1 1 1 1 1
input channel number 12 32 3 32 32 32 64 32 16 32 32
output channel number 16 3 27 32 32 64 32 32 32 32 32
(b) filter size 5 3 3 3 3 3 32 32 1 1 1
dilation size 1 1 1 1 2 4 1 1 1 1 1
Table 5.2: Details about each block in our second and third scale network shown
in Figure 5.5. (a) shows the detailed structure for the network used to predict
reflectance and normal. (b) shows the detailed structure for the network used to
predict lighting.
111
5.5 Implementation Details
Pre-training on Synthetic Data: We first train our network using the SUNCG
dataset with synthesized ground truth normal, reflectance, and shading. The loss
to train our network on synthetic data is
Ls = ?sRLR1 + ?sSLS1 + ?sNLN + ?LcLLc + ?LfLLf (5.13)
where LR1, LS1, LN , LLc and LLf are losses for reflectance, shading, normal, global
and local residual lighting defined above, and ?sR, ?sS, ?sN , ?Lc and ?Lf are their
corresponding weights. We set ?sR = ?sS = ?sN = ?Lc = 1 and ?Lf = 0.2. Our
coarse-to-fine network is trained step by step using the Adam [122] optimizer with
initial learning rate 0.001 and weight decay 0.
Fine-tuning on Real Data: Due to the lack of annotation from real datasets,
we use the rendered SUNCG dataset as supervision, with the loss denoted as Lcgr .
In addition, we apply our network trained on synthetic data to predict reflectance,
shading and normal of real images and use the results as pseudo supervision (self-
supervision), with the loss denoted as Lssr .
Lcg cg cgr = ?sRLR1 + ?sSL + ?
cg
S1 sNL + ?
cg cg
N LcLLc + ?LfLLf ,
Lss = ?ss ssr rRLR1 + ?rSL + ?ssS1 rNL + ?ss ssN LcLLc + ?LfLLf (5.14)
where we set ?cg = ?cg cgsR sS = ?Lc = ?
cg
Lf = 1, ?
cg
sN = 10, ?
ss = ?ss = 5, ?ss = ?ssrS rN Lf Lc = 1
112
and ?ssrR = 0.1. Our loss defined on the annotation and ground truth of IIW, SAW
and NYUv2 is:
Lo = ?or rRL oR2 + ?rSLS2 + ?orNLN (5.15)
where ?o o oLc = ?rN = 10, ?rS = 1. Inspired by [128], we introduce the L2 regularization
to achieve a reasonable color for reflectance.
Lc Rr = ? ?1 ? ?Ic 1 ?c 1 (5.16)
3 c
R c I3
where R and I are predicted reflectance and input image, and Rc and Ic, c ?
{R,G,B} denote the color channel of R and I. Importantly, a reconstruction loss
is further introduced to guarantee the predicted reflectance, normal and lighting
preserve the input?s characteristics.
Lrcr = ?Ii ?Ri  Si?2 (5.17)
The overall loss that we apply to fine-tune our network on real images is:
Lr = Lcg ss or + Lr + Lr + Lc + ?rcLrcr r r (5.18)
where ?rcr = 0.1. The coarse-to-fine network is fine tuned scale by scale. The Adam
optimizer with learning rate 0.0005 and weight decay 0.00001 is used for fine-tuning.
113
5.6 Experiments
In this section, we introduce the synthetic dataset that we create for pre-
training and the public real datasets. Then we compare to Barron and Malik [2],
who first proposed to predict reflectance, normal and lighting from an RGB-D image.
Further, we compare to the state-of-the-art intrinsic image decomposition methods
to indicate the overall advantage of our method. An ablation study is then carried
out to demonstrate the contribution of each of our proposed modules.
5.6.1 Datasets
Synthetic Dataset: we make use of the SUNCG dataset [158] to generate syn-
thetic data. It contains 568, 793 images rendered using Mitsuba [159] and their
corresponding ground truth surface normals, depths, semantic labels and object
boundaries. Since our task also requires ground truth reflectance and shading, we
re-render 58, 949 images of SUNCG using the multi-channel renderer of Mitsuba.
We further split the images into a set of 51,507 training images and a set of 7,442
validation images. Instead of directly rendering images, we render shading by set-
ting all the materials to diffuse and the reflectance to be 1. Then the final image
is I = R  S. We believe rendering in this way has two main advantages: (1) The
generated images strictly follow the assumption of intrinsic image decomposition.
(2) The pixel value of ground truth shading has bounded range which makes data
preparation easier. Though the rendered images do not contain non-diffuse effects
of the material, our experiments show that this does not degrade the performance.
114
(a) (b) (c) (d)
Figure 5.6: (a) synthetic images, (b) shading images, (c) and (d) are lighting pre-
dicted by training the network without and with non-negative constraint respec-
tively.
[2] GLoSH SUNCG GLoSH SUNCG + real
MSE 0.098 0.038 0.032
Table 5.3: SH lighting Evaluation on SUNCG synthetic data.
Public Real Datasets: we use IIW [128], SAW [129] and NYUv2 [156] as real data
for training and testing. More specifically, SAW is a combination of IIW and NYUv2
(3761 images from IIW and 381 images from NYUv2 with ground truth normals).
The real dataset we use is the same as [153] in addition to ground truth normals
from NYUv2. We strictly follow the train/val/test splitting strategy of [153].
5.6.2 Spherical Harmonics Lighting Evaluation
Quantitative comparison to [2]. We compare to [2] as they also propose a
lighting model to jointly predict reflectance, normal and lighting of a natural scene.
Notice that [2] uses RGB-D images, which simplifies the problem.
Lighting for real data is hard to obtain. We instead evaluate the shading from
115
the SUNCG synthetic data by fixing the surface normal from ground truth, at which
we can indirectly evaluate the SH lighting. We calculate the per-pixel Mean squared
error (MSE) of the reconstructed shading w.r.t. ground truth shading and show the
results in Table 5.3. Our method shows a significant advantage over [2] and the
real data self-supervision provides a further performance boost. We also evaluate the
shading of [2] on NYUv2 dataset using the AP challenge metric proposed by [153].
They achieve 90.38% shading accuracy, while under the same setup, our method
achieves 95.36%. We believe all these evidences prove the proposed method can
predict much more accurate lighting than [2].
Qualitative comparison to [2]. Figure 5.7, 5.8 and 5.9 compare their visual
results with ours. The red rectangles in reflectance and shading images show that [2]
mistakenly decomposes cast shadow into reflectance instead of shading. We believe
the limited number of SH basis in their method prevents them from modeling the
spatial variation of the lighting well, resulting in a lack of ability to model cast
shadows.
Non-negative lighting: [127] proved that a SH represents non-negative lighting
if its Toeplitz matrix is positive semi-definite. We use their proposed method to
evaluate the effectiveness of our non-negative constraint. We train our coarse scale
network with and without the proposed non-negative constraint, i.e, Equation (5.3),
and then test on the validation set of our synthetic SUNCG data. Without the pro-
posed non-negative constraint, the percentage of global SH that represents negative
lighting is 13.39%. It decreases drastically to 1.09% with this constraint. Figure 5.6
visualizes the predicted lighting with and without the non-negative constraint. After
116
Method Avg. (?)? Med. (?)? 11.25? ? 22.5? ? 30? ?
[158] 27.90 21.29 26.76 52.21 63.75
Ours 28.63 21.05 27.68 52.42 62.87
Table 5.4: Surface normal evaluation on NYUv2. Average (Avg.) and Median
(Med.) show the average and median angular error, smaller values are the better.
11.25?, 22.5? and 30? shows the percentage of normals with angular error smaller
than 11.25?, 22.5? and 30?, higher values are better.
fine-tuning on real data, the global SH that represents negative lighting is reduced
to 0% and there is only one image that contains negative local lighting.
A
B
Figure 5.7: Comparison with [2]. First row of A and B from left to right: input
image, reflectance by [2], shading by [2], normal by [2], lighting by [2]. Second row
of A and B from left to right: ground truth normal; reflectance, shading, normal,
global SH and local SHs by the proposed method.
117
A
B
C
Figure 5.8: Comparison with [2]. First row of A, B, C and D from left to right:
input image, reflectance by [2], shading by [2], normal by [2], lighting by [2]. Second
row of A, B, C and D from left to right: ground truth normal; reflectance, shading,
normal, global SH and local SHs by the proposed method.
118
A
B
C
Figure 5.9: Comparison with [2]. First row of A, B, C and D from left to right:
input image, reflectance by [2], shading by [2], normal by [2], lighting by [2]. Second
row of A, B, C and D from left to right: ground truth normal; reflectance, shading,
normal, global SH and local SHs by the proposed method.
119
(a) image (b) reflectance of [154] (c) reflectance of [153] (d) our reflectance (e) our normal
(f) shading of [154] (g) shading of [153] (h) our shading (i) our global SH (j)our local SH
Figure 5.10: Comparison with state-of-the-art intrinsic image decomposition
method. Note that although [153] achieves the best AP score on shading, the gen-
erated shading image is of very low contrast. The red rectangle shows the shading
of [154] suffers seriously from the reflectance bleeding problem.
5.6.3 Intrinsic Image Decomposition
Model trained on synthetic data. We evaluate our network trained using syn-
thetic data on IIW, SAW and NYUv2. For reflectance on IIW, we use the WHDR
metric proposed in [128], which computes the weighted error of the predicted re-
flectance with human annotation. The challenge average precision (AP) proposed
by [153] is used to evaluate the predicted shading. It computes the average preci-
sion of classification for constant shading regions and shadow boundaries. Table 5.5
(a) compares our trained network with [153] on IIW and SAW dataset. It shows
that our proposed method is closely comparable to [153] on IIW and produces much
better results than [153] on SAW when trained on SUNCG dataset.
[153] claimed that the dataset they provided (denoted as CGI) has a smaller
domain gap with real data compared with SUNCG. For a sanity check, we train
our coarse network using CGI and achieve WHDR 37.98, while the WHDR of our
120
IIW SAW
Method Dataset WHDR (%)? AP (%)?
Li [153] SUNCG 26.1 87.09
a
Proposed SUNCG 26.8 92.40
Grosse [160] - 26.9 85.26
b Garces [161] - 24.8 92.39
Zhao [162] - 23.8 89.72
Bi [144] - 17.7 -
Bell [128] - 20.6 92.18
Zhou [145] IIW 19.9 86.34
[146] IIW 19.5 89.94
Fan [154] IIW 15.4 -
Li [153] CGI + real 15.5 96.57
proposed SUNCG + real 15.2 95.01
Table 5.5: Reflectance evaluation on IIW and shading evaluation on SAW. For
WHDR, lower value (?) is better, for AP, higher value is better(?).
coarse network trained on SUNCG is 28.20. We do not see the advantage of using
CGI data for training and thus we train our network using SUNCG dataset.
Model fine-tuned on real data. Table 5.5 (b) compares our method with some
start-of-the-art methods on IIW and SAW. Our method achieves the best perfor-
mance on IIW and second best on SAW.
[154] demonstrated that by incorporating the guided filter into the training
of their network, they can achieve a WHDR of 14.5% which is the state-of-the-art
result. By applying a guided filter to our model as suggested by [146], we can achieve
14.6% which is closely comparable to this result. However, the challenge AP for the
shading of [154] on IIW dataset 1 is 85.77%. Under the same setting, we achieve
96.81%, a more than 10% improvement.
Besides reflectance and shading, Table 5.4 shows that the normal predicted by
1Images are provided by the authors.
121
our model achieves strongly competitive results with [158], when trained on SUNCG
synthetic data and evaluated on NYUv2. We further fine-tuned the models with
limited real data (381 images with surface normal ground truth), and achieved 25?
average angular error, close to [158] 21.74?.
Visual comparison. We visualize the shading predicted by [154] in Figure 5.10
(f). It shows that the shading images of [154] still retain the effect of reflectance
in their shading prediction. Although [153] achieves the best performance on SAW,
Figure 5.10 (g) shows that their predicted shading images are of low contrast. That
is, the quality of the shading image is low. Across the compared methods, our
method achieves relatively better visual quality on both reflectance and shading.
Figure 5.11, 5.12 and 5.13 show more visual results of [154], [153] and the proposed
method.
To conclude, our GLoSH achieves consistently better results compared to state-
of-the-art methods evaluated on models trained on both synthetic data and fine-
tuned on real data, across the tasks of estimating reflectance, normal, shading and
lighting. We believe this also indicates the effectiveness of the proposed coarse-to-
fine network structure.
5.6.4 Ablation Study
Without synthetic data. Synthetic data is very important for the proposed
method. Table 5.6 ?w/o SUNCG? shows the WHDR on IIW, average precision
(AP) on SAW and mean error on the NYUv2 data set when training our network
122
A
B
C
Figure 5.11: comparison with state-of-the-art intrinsic image decomposition meth-
ods. First row of A, B and C from left to right: input image, reflectance by [154],
shading by [154], reflectance by [153], shading by [153]. The second row of A, B and
C from left to right: reflectance, shading, normal, global SH and local SHs of the
proposed method.
only using real data. It is clear that without synthetic data, the performance of
our network on reflectance, shading and normal shows significant gap relative to
the ?full? model. This is because training a network that performs reasonably well
123
A
B
C
Figure 5.12: comparison with state-of-the-art intrinsic image decomposition meth-
ods. First row of A, B, C and D from left to right: input image, reflectance by [154],
shading by [154], reflectance by [153], shading by [153]. The second row of A, B, C
and D from left to right: reflectance, shading, normal, global SH and local SHs of
the proposed method.
requires a huge amount of data. The sparsity of the annotation for reflectance and
shading, and the small amount of real images makes the training intractable.
Without pseudo supervision. Table 5.6 ?w/o Lssr ? shows that on IIW and
124
A
B
C
Figure 5.13: Comparison with state-of-the-art intrinsic image decomposition meth-
ods. First row of A, B, C and D from left to right: input image, reflectance by [154],
shading by [154], reflectance by [153], shading by [153]. The second row of A, B, C
and D from left to right: reflectance, shading, normal, global SH and local SHs of
the proposed method.
NYUv2, performance degrades relative to the the ?full? model, except for the AP
on the SAW dataset. This shows that the self-supervision helps to provide rough
guidance for the real unlabeled data on reflectance and normal. The degradation
125
IIW SAW NYUv2
Method WHDR (%) ? AP (%) ? Mean Error (?) ?
w/o SUNCG 17.82 88.52 35.14
w/o Lssr 15.50 95.79 25.93
w/o LR2 15.34 91.89 25.96
scale1 18.70 90.35 26.68
scale1+scale2 16.62 94.98 25.59
full 15.20 95.01 25.44
Table 5.6: Ablation study on loss, without synthetic SUNCG data, and the coarse-
to-fine scales, evaluated on IIW reflectance, SAW shading and NYUv2 surface nor-
mal.
for shading is probably due to the large domain gap bwteen the lighting of synthetic
data and real data. However, when compared with shading of [153] in Figure 5.10,
we see even with weak supervision, our model can still predict more reasonable
shading.
Contribution of multiple scales. We clearly see in Table 5.6 that ?scale1+scale2?
outperforms ?scale1?, and our ?full? model further outperforms ?scale1+scale2?. It
suggests that further adding a finer scale module indeed helps the local lighting mod-
eling and boosts the overall performance. Worth noting that there is gradually satu-
ration by further adding finer modules as the improvement gap from ?scale1+scale2?
to ?full? is smaller than ?scale1? to ?scale1+scale2?. In practice, we define our full
model to have three scales, a coarse net with two cascaded finer nets, which strikes
a good balance between accuracy and model complexity.
Without symmetric loss. The WHDR hinge loss proposed by [146] (Equation 5.8)
is not symmetric. This leads to unequal loss when the same points are used in a
different order. By adapting the WHDR to our proposed symmetric one (Equa-
126
tion 5.9), we observe improvement on IIW by 0.14%.
Model complexity: We calculate the model parameters of CGI [153] and our
full model. There are 68, 572, 482 floating numbers in CGI and only 14, 665, 594 in
our model, which is much smaller than CGI. Among the state-of-the-art CNN based
methods, our method achieves consistently better performance with a smaller model
size.
127
5.7 Summary
In this chapter, we propose to estimate reflectance, normal and lighting from
a single image, which is a very hard problem that has not been well addressed. A
global and local SHs model is proposed to model the lighting of a natural scene,
which accounts for both holistic lighting and the spatial variation of the lighting. A
novel non-negative constraint is proposed to force the SH lighting to be physically
meaningful. A synthetic data set is applied as augmentation for real data. Extensive
experiments on SAW, IIW, and NYUv2 datasets demonstrate the effectiveness of
our proposed method.
128
Chapter 6: Conclusion
Lighting is the medium through which we capture images of the physical world.
Understanding lighting from images can help the computer better understand the
world. Conventional computer vision algorithms usually try to understand light-
ing from images through optimizing complicated objective functions with priors
designed by an expert. They are usually slow and the performance relies on the
quality of priors. Recently, deep CNNs have been applied to many computer vision
problems and achieve great success. In this dissertation, we try to apply deep CNNs
for the task of understanding lighting from images.
The biggest challenge to understanding lighting from images is the lack of
labeled data. Deep CNNs are notorious for their data hungry nature. Millions
of labeled data are needed in order to train a CNN that works reasonably well.
However, labeling lighting from images is impossible.
In this dissertation, we study how to apply synthetic data to train a CNN for
understanding lighting from images.
1. We first empirically study the capacity of CNNs by compressing them. Our
sparsity-induced method can reduce a huge amount of parameters for several
popular CNNs (e.g. more than 76% of parameters in AlexNet when training
129
on ImageNet), which shows the current CNNs contain a lot of redundancy.
2. We then designed a label denoising network to make use of synthetic data to
help estimate lighting from real face images. This is a task for which it is ex-
tremely difficult to collect real labels. We demonstrate our proposed method
significantly outperforms the current state-of-the-art lighting estimation meth-
ods.
3. A deep CNN based portrait relighting method is then proposed. Lacking of
an existing dataset, we generate a large scale, high resolution, ?in the wild?
dataset for this task. Our model trained on the proposed dataset outperforms
existing methods significantly.
4. At last, we proposed a novel intrinsic image decomposition algorithm for a
natural scene. We are the first to decompose an RGB image of a natural scene
into reflectance, normal and lighting. A novel global and local lighting model
is proposed to model the complicated lighting conditions of a natural scene.
Future direction. Understanding lighting from images is an interesting but
hard problem. Though our proposed methods achieve great progress, there
are still many open questions in this filed. We list a few possible research
directions below:
Powerful and compact lighting model. One direction is how to find a better
lighting model. The Spherical Harmonics (SH) model [34,35] used in our work
assumes lighting is distant and objects are convex, moreover, it cannot model
130
cast shadows and high frequency lighting components. Modeling lighting using
environment maps requires a huge number of parameters. As a result, finding
a powerful and compact lighting representation is necessary.
Realistic synthetic data with ground truth lighting Due to the difficulty of
labeling ground truth lighting for real images, synthetic data is used to train
deep CNNs in our work. However, the synthetic data either have a large
domain gap with real data (e.g. the synthetic data we used in [13]) or the
lighting is not accurate [99]. How to generate more realistic synthetic data
with accurate lighting is an interesting direction to explore.
131
Appendix A: Evaluating Local Features for Day-Night Matching
In this chapter, we introduce our work of evaluating the performance of local
features in the presence of large illumination changes that occur between day and
night. Through our evaluation, we find that repeatability of detected features, as a
de facto standard measure, is not sufficient in evaluating the performance of feature
detectors; we must also consider the distinctiveness of the features. Moreover, we
find that feature detectors are severely affected by illumination changes between day
and night and that there is great potential to improve both feature detectors and
descriptors.
This work [163] is in collaboration with Torsten Sattler and David W. Jacobs.
A.1 Introduction
Feature detection and matching is one of the central problems in computer
vision and a key step in many applications such as Structure-from-Motion [164], 3D
reconstruction [165], place recognition [166], image-based localization [167], Aug-
mented Reality and robotics [168], and image retrieval [169]. Many of these appli-
cations require robustness under changes in viewpoint. Consequently, research on
feature detectors [170?175] and descriptors [176?179] has for a long time focused
132
on improving their stability under viewpoint changes. Only recently has robustness
against seasonal [180] and illumination changes [174,181] come into focus. Especially
the latter is important for large-scale localization and place recognition applications,
e.g., for autonomous vehicles. In these scenarios, the underlying visual representa-
tion is often obtained by taking photos during daytime and it is infeasible to capture
large-scale scenes also during nighttime.
Many popular feature detectors such as Difference of Gaussians (DoG) [176],
Harris-affine [182], and Maximally Stable Extremal Regions (MSER), as well as
the popular SIFT descriptor [176] are invariant against (locally) uniform changes
in illumination. However, the illumination changes that can be observed between
day and night are often highly non-uniform, especially in urban environments (cf.
Fig. A.1). Recent work has shown that this causes problems for standard feature
detectors: Verdie et al. [174] demonstrated that a detector specifically trained to
handle temporal changes significantly outperforms traditional detectors in challeng-
ing conditions such as day-night illumination variations. Torii et al. [166] observed
that foregoing the feature detection stage and densely extracting descriptors instead
results in a better matching quality when comparing daytime and nighttime images.
Naturally, these results lead to a set of interesting questions: (i) to what extent is
the feature detection stage affected by the illumination changes between day and
night? (ii) the number of repeatable features provides an upper bound on how many
correspondences can be found via descriptor matching. How tight is this bound, i.e.,
is finding repeatable feature detections the main challenge of day-night matching?
(iii) how much potential is there to improve the matching performance of local de-
133
tectors and descriptors, i.e., is it worthwhile to invest more time in the day-night
matching problem?
In this work, we aim at answering these questions through extensive quan-
titative experiments, with the goal of stimulating further research on the topic of
day-night feature matching. We are interested in analyzing the impact of day-night
changes on feature detection and matching performance. Thus, we eliminate the
impact of viewpoint changes by collecting a large dataset of daytime and nighttime
images from publicly available webcams [183] 1. Through our experiments on this
large dataset, we find that: (i) the repeatability of feature detectors for day-night
image pairs is much smaller than that for day-day and night-night image pairs,
meaning that detectors are severely affected by illumination changes between day
and night; (ii) for day-night image pairs, high repeatability of feature detectors does
not necessarily lead to a high matching performance. For example, the TILDE [174]
detector specifically learned for handling illumination changes has a very high re-
peatability, but the precision and recall of matching local features are very low. A
low recall shows that the number of repeatable points provides a loose bound for the
number of correspondences that could be found via descriptor matching. As a re-
sult, further research is necessary for improving both detectors and descriptors; (iii)
through dense local feature matching, we find that there are a lot more correspon-
dences that could be found using local descriptors than are produced by current
detectors, i.e., there is great potential to improve detectors for day-night feature
matching.
1Please find the data set at http://www.umiacs.umd.edu/~hzhou/dnim.html
134
A.2 Dataset
Illumination and viewpoint changes are two main factors that would affect
the performance of feature detectors and descriptors. Ideally, both detectors and
descriptors should be robust to both type of changes. However, obtaining a large
dataset with both types of changes with ground truth transformations is difficult. In
this work, we thus focus on pure illumination changes and collect data that does not
contain any viewpoint changes. Our results show that already this simpler version
of the day-night matching problem is very hard.
The AMOS dataset [183], which contains a huge number of images taken (usu-
ally) every half an hour by outdoor webcams with fixed positions and orientations,
satisfies our requirements perfectly. [174] has collected 6 sequences of images taken at
different times of the day for training illumination robust detectors from the AMOS
dataset. However, the dataset has no time stamps and some of the sequences have
no nighttime images. As a consequence, we collect our own dataset from AMOS. 17
image sequences with relatively high resolution containing 1722 images are selected.
Since the time stamps of the images provided by AMOS are usually not correct,
we choose image sequences with time stamp watermarks. The time of the images
will be decided by the watermarks which are removed afterwards. For each image
sequence, images taken in one or two days are collected. Fig. A.1 gives an example
of images we collected.
135
Figure A.1: Images taken from 00:00 - 23:00 in one image sequence of our dataset.
A.3 Evaluation
A.3.1 Keypoint Detectors
For evaluation, we focus on the keypoint detectors most commonly used in
practice. We choose DoG [176], Hessian, HessianLaplace, MultiscaleHessian, Har-
risLaplace and MultiscaleHarris [182] implemented by vlfeat [184] for evaluation.
Their default parameters are used to determine how well these commonly used set-
tings perform under strong illumination changes. DoG detects feature points as the
extrema of the difference of Gaussian functions. By considering the extrema of the
difference of two images, DoG detections are invariant against additive or multiplica-
tive (affine) changes in illuminati?on. Hessian, Hessia?nLaplace and MultiscaleHessian? Lxx(?) Lxy(?) ?
are based on the Hessian matrix?? ??, where L represents the image
Lyx(?) Lyy(?)
smoothed by a Gaussian with standard deviation ? and Lxx, Lxy, and Lyy are the
second-order derivatives of L. Hessian detects feature points as the local maxima
136
of the determinant of the Hessian matrix. HessianLaplace chooses a scale for the
Hessian detector that maximizes the normalized Laplacian |?2(Lxx(?) + Lyy(?))|.
MultiscaleHessian instead applies the Hessian detector on multiple scales of images
and detects feature points at each scale independently. HarrisLaplace and Multi-
scaleHarris extended the Harris corner detector to multiple scales in a similar way to
HessianLaplace and MultiscaleHessian. The Harris corner detector is based on the
determinant and trace of the second moment matrix of gradient distribution. All
these gradient based methods are essentially invariant to additive and multiplicative
illumination changes.
We also included the learning based detector TILDE [174], since it is designed
to be robust to illumination changes. We use the model trained on the St. Louis
sequence as it has the highest repeatability when testing on the other image se-
quences [174]. TILDE detects feature points at a fixed scale. In this work, we
define a multiple scale version by detecting features at multiple scales, denoted as
MultiscaleTILDE. Feature points are detected from the original image and images
smoothed by a Gaussian with standard deviation of 2 and 4. When TILDE detects
feature points from the original image, the scale of it is set to be 10. Accordingly,
the scale of detected feature points from those three images are set to be 10, 20 and
40. As suggested by [174], we keep a fixed number of feature points based on the
resolution of the image. For the proposed MultiscaleTILDE, the same number of
feature points as that of TILDE are selected for the first scale. For other scales, the
number of feature points selected are reduced by half compared with the previous
scale. In modified versions, we include 4 times as many points as suggested, naming
137
14000
DoG
Hessian
12000 HessianLaplace
HarrisLaplace
MultiscaleHessian
10000 MultiscaleHarris
TILDE
TILDE4
8000 MultiscaleTILDE
MultiscaleTILDE4
6000
4000
2000
0
0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223
Hour
Figure A.2: The number of feature points detected at different time
these TILDE4 and MultiscaleTILDE4 respectively.
A.3.2 Repeatability of Detectors
In this section we address the question: to what extent are feature detec-
tions affected by illumination changes between day and night? by evalu-
ating how many feature points are detected, and how repeatable they are. First we
show the number of detected feature points at different times of the day for different
detectors in Fig. A.2. The numbers are averaged from all 17 image sequences in our
dataset. The number of feature points for TILDE is the same across different times,
since a fixed number of feature points are extracted. For the other detectors, fewer
feature points are detected at nighttime. Especially, the number of feature points
detected by HessianLaplace and MultiscaleHessian are affected most by illumination
changes between day and night.
We then use the repeatability of the detected feature points to evaluate the
performance of detectors. According to [170], the measurement of repeatability
is related to the detected region of feature points. Suppose ?a and ?b are the
138
Number of Feature Points
scale of two points A and B, (xa, ya) and (xb, yb) are their locations, the detected
regions ?a and ?b are defined as the region of (x ? x 2a) + (y ? y )2a = (3?a)2 and
(x?xb)2 +(y?y )2b = (3? 2b) respectively, where 3? is the size of one spatial bin from
which the SIFT feature is extracted. Then A and B are considered to correspond to
each other if 1? ?a??b? ? 0.5, i.e. the intersection of these two regions are larger than?a ?b
or equal to half of the union of these two regions. This overlap error is the same as
the one proposed in [170] except that we do not normalize the region size. This is
because if the detected regions do not overlap, we cannot extract matchable feature
descriptors; normalizing the size of the region would obscure this. For example, two
regions with small scales may be judged to correspond after normalization. However,
the detected region from which the feature descriptor is extracted may not overlap
at all, making it impossible to extract feature descriptors to match them.
Some of the images in our dataset may contain moving objects. To avoid the ef-
fect of those objects, we define ?ground truth? points and compute the repeatability
of detectors at different times w.r.t. them. To make the experiments comprehensive,
we use daytime ground truth and nighttime ground truth. Images taken at 10:00 to
14:00 are used to get the daytime ground truth feature points (and 00:00 to 02:00
together with 21:00 to 23:00 for nighttime ground truth feature points). We select
the image that has the largest number of detected feature points and match them
to those in other images in that time period. A feature point is chosen as a ground
truth if it appears in more than half of all the images of that time period. Fig. A.3
(a) and (b) shows the number of daytime and nighttime ground truth feature points
detected for different detectors respectively. We notice that though Fig. A.2 shows
139
Daytime Nighttime
2500
2235 1600 1546
2000 1400
1200 1148
1500 1000
800
1000 1022  683
600  612
 745  734  549 483
400  398
500  501  512  347 383  240
 204  249 200  184 147
0 0
oG an e e n s DE 4 E 4D si
G n
c c a ri L DE DE E4 o
e e E E 4
s la la si ar TI IL IL D D ss
ia
lac
n is D
e p p s L e p pl
ac ia rr IL LD LD DE
H La La e eH T T I
s a T I I
n s H l H
a a es H T T IL
a ri le ca ca
le T L
ale an ris
L H le ale eT
si r
a l
s a a
le c
H i
s is c i r c a
e isc ult ul
t
ltis es
s a ca s is c
lt H ltis
lti ultH M u H u M ul
tis
u M M u MM M M
(a) (b)
Daytime
0.9 Nighttime
1
0.8
0.9
0.7
0.8
0.6
0.7
0.5
0.6
0.4
0.5
0.3 0.4
0.2 0.3
0.1 0.2
0 0.1
0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Hour Hour
(c) (d)
Figure A.3: (a) and (b) show average number of daytime and nighttime ground
truth feature points for each detector respectively. We show repeatability of different
detectors at different times of the day w.r.t. (c) daytime and (d) nightime ground
truth feature points. Please note time periods that are used to compute the ground
truth feature points are excluded for fair comparison.
the number of detected feature points for TILDE4 at daytime is the second small-
est among all the detectors, the number of daytime ground truth feature points of
TILDE4 is larger than 6 detectors. This implies that the feature points detected by
TILDE4 for daytime images are quite stable across different images.
We use these ground truth feature points to compute the repeatability of the
chosen detectors over different times of the day. Thus, repeatability is determined
by measuring how often the ground truth points are re-detected. Fig. A.3 (c) and
(d) show that the repeatability of features for nighttime images w.r.t. nighttime
140
Repeatability Number of Ground Truth Points
Repeatability Number of Ground Truth Points
ground truth is very high for all the detectors; this is because the illumination of
nighttime images are usually quite stable without the effect of sunlight (cf. Fig. A.1).
For comparison, the repeatability of daytime images w.r.t. daytime ground truth
is smaller and the performance of different detectors varies a lot. Moreover, both
Fig. A.3 (c) and (d) show that the repeatability of day-night image pairs is very low
for most detectors, which implies that detectors are heavily affected by day-night
illumination changes. The drop-off between 05:00-07:00 and 17:00-18:00 is caused
by illumination changes between dusk and dawn. The peaks of the repeatability,
as 09:00 in Fig. A.3 (c) and 03:00 and 20:00 in Fig. A.3 (d), appear because they
are close to the time from which the ground truth feature points are computed.
Among all the detectors, both single scale and multiple scale TILDE have high
repeatabilities of around 50% for day-night image pairs. This is not surprising
since the TILDE detector was constructed to be robust to illumination changes by
learning the impact of these changes from data. Based on the fact that nearly every
second TILDE keypoint is repeatable, we would expect that TILDE is well-suited
for feature matching between day and night.
A.3.3 Matching Day-Night Image Pairs
In theory, every repeatable keypoint should be matchable with a descriptor
since its corresponding regions in the two images have a high overlap. In practice,
the number of repeatable keypoints is only an upper bound since the descriptors ex-
tracted from the regions might not match. For example, local illumination changes
141
might lead to very different descriptors. In this section, we thus study the perfor-
mance of detector+descriptor on matching day-night image pairs. We try to answer
the question whether finding repeatable feature detections is the main chal-
lenge of day-night feature matching, i.e., whether finding repeatable keypoints
is the main bottleneck or whether additional problems are created by the descriptor
matching stage. We use both precision and recall of feature descriptor matching
to answer this question. Suppose for a day-night image pair, N true matches are
provided by detectors. Nf matched feature points are found by matching descrip-
tors, among which Nc matches are true matches. Then the precision and recall of
detector+descriptor are defined as Nc/Nf and Nc/N , respectively. Precision is a
usual way to evaluate the accuracy of matching by detector+descriptor. Recall,
on the other hand, tells us what is the main challenge to increase the number of
matches. A low recall means improving feature descriptors is the key to getting
more matches. On the contrary, feature detection is the bottleneck for getting more
matches if a high recall is observed, but still an insufficient number of matching
features are found.
For each image sequence, images taken at 00:00 - 05:00 and 19:00 - 23:00
are used as nighttime images and those taken at 09:00 - 16:00 are daytime images.
One image is randomly selected from each hour in these time periods and every
nighttime image is paired with every daytime image to create the day-night image
pairs. As the SIFT descriptor is still the first choice in many computer vision
problems, and its extension, RootSIFT [185] performs better than SIFT, we use
RootSIFT as the feature descriptor. To match descriptors, we use nearest neighbor
142
0.45 70
0.4
60
0.35
50
0.3
40
0.25
30
0.2
20
0.15
10
0.1
0.05 0
0 1 2 3 4 5 19 20 21 22 23 0 1 2 3 4 5 19 20 21 22 23
Hour Hour
(a) (b)
Figure A.4: (a) shows the precision of RootSIFT matching of day-night image pairs
for different detectors at different nighttime. (b) shows their corresponding number
of correct matched feature points.
search and apply Lowe?s ratio test [176] to remove unstable matches. The default
ratio provided by vlfeat [184] is used in our evaluation. In practice, the ratio test is
used to reject wrong correspondences but also rejects correct matches. The run-time
of subsequent geometric estimation stages typically depends on the percentage of
wrong matches. Sacrificing recall for precision is thus often preferred since there is
enough redundancy in the matches.
The precisions of matching day-night images for all the detectors at different
daytimes are shown in Fig. A.4 (a). We found that though different versions of
TILDE have the highest repeatability among all the detectors, in general, they have
the lowest precision. There are more than 20% drop in precision w.r.t. DoG in most
cases. This shows that a higher repeatability of a detector does not necessarily mean
better performance for finding correspondences, and detectors and descriptors are
highly correlated with each other for matching feature points. As shown in Fig. A.4
(b), the number of correct matches detected by RootSIFT for all the detectors are
quite small. Even for detectors like DoG, which has the highest precision, only
143
Precision
Number of Correct Matches
20 - 40 correct matches can be found. As a result, for applications that need a
large number of matches between day-night image pairs, the performance of these
detector+descriptor may not be satisfactory.
Another interesting finding from Fig. A.4 is that the multiple scale version
of TILDE, MultiscaleTILDE4, does not have a higher precision compared with
TILDE4. One possible reason may be that the scales of features detected by Mul-
tiscaleTILDE4 are set to 10, 20 and 40, which are too large. Fig. A.5 (a) shows
that most of the correctly matched features of DoG+RootSIFT have scales within
10. This is because within a smaller region, the illumination changes between day
and night images are more likely to be uniform, to which the RootSIFT feature is
designed to be robust. To better understand the effect of scales, we set the scale of
TILDE4 to be 1 and scales of MultiscaleTILDE4 to be 1, 2 and 4, and denote these
two modified versions as ModifiedTILDE4 and ModifiedMultiscaleTILDE4. Fig. A.5
(b) compares the precision of them with TILDE4 and MultiscaleTILDE4. We find
that by setting the scale to be small, there is around 5%?10% increase in precision.
Intuitively, with larger scales, features may contain more information, which should
be beneficial for matching. However, descriptors will need to be trained to make
good use of those information, especially when robustness under viewpoint changes
is required.
To examine the main challenge of finding more correspondences, the recall
of these detectors for matching day-night image pairs is shown in Fig. A.6 (a).
We find that the recall of each detector is very low. As a consequence, one way
to improve the performance of day-night image matching is to improve
144
0.45
0.1
TILDEP4
0.09 0.4 ModifiedTILDEP4
0.08 MultiscaleTILDE4
0.35 ModifiedMultiscaleTILDE4
0.07
0.3
0.06
0.05 0.25
0.04 0.2
0.03
0.15
0.02
0.1
0.01
0 0.05
0 10 20 30 40 50 60 70 80 90 100 0 1 2 3 4 5 19 20 21 22 23
Scale Hour
(a) (b)
Figure A.5: (a) shows the histogram of scales for correctly matched RootSIFT
features using DoG as detector. (b) compares the precision of TILDE4 and Multi-
scaleTILDE4 with small and large scales.
0.15 0.55
0.5
0.45
0.4
0.1
0.35
0.3
0.25
0.05
0.2
0.15
0.1
0 0.05
0 1 2 3 4 5 19 20 21 22 23 9 10 11 12 13 14 15 16
Hour Hour
(a) (b)
Figure A.6: (a) shows the recall of RootSIFT matching of day-night image pairs
for different detectors at different nighttimes. (b) shows the recall of RootSIFT
matching of day-day image pairs for different detectors at different daytimes.
the robustness of descriptors to severe illumination changes. As shown in
Fig. A.6, a much higher recall can be noticed for day-day image pairs, meaning that
RootSIFT is robust to small illumination changes at daytime. However, it is not
so robust to severe illumination changes between day and night. The low recall of
day-night image pairs implies that there are a lot of ?hard? patches from which
RootSIFT cannot extract good descriptors.
With the development of deep learning, novel feature descriptors based on
145
Recall
Probability
Recall
Precision
0.45 0.15
0.4
0.35
0.1
0.3
0.25
0.2
0.05
0.15
0.1
0.05 0
0 1 2 3 4 5 19 20 21 22 23 0 1 2 3 4 5 19 20 21 22 23
Hour Hour
(a) (b)
Figure A.7: (a) shows the precision of matching of day-night image pairs for different
detectors at different nighttime using cnn feature. (b) shows the recall of matching of
day-night image pairs for different detectors at different nighttime using cnn feature.
convolutional neural networks have been proposed. Many of them [8, 186, 187] out-
perform SIFT. We choose the feature descriptor proposed in [186] as an example to
evaluate the performance of the learned descriptor+detector. [186] is chosen since
their evaluation method is Euclidean distance, which can be used easily in our eval-
uation framework. Fig. A.7 shows that this CNN feature performs even worse than
RootSIFT+detectors. One reason is that [186] is learned from the data provided
by [177], which mainly focus on viewpoint changes, and illumination changes that
are small. Though [186] shows its robustness to small illumination changes as in
DaLI dataset [188], it is not very robust to illumination changes between day and
night in our dataset.2
146
Precision
Recall
0.4 30 30
1200
0.35
25 25
0.3 1000
998 987 975 964
0.27 0.27 9120.27 0.27 906 20.62 20.620.26 0.26 0.27 0.26 0.26
20.54 20.56
0.25 838 20 19.620.25 0.25 800 786 18.75
20 19.42
763 758 18.00
18.65
17.90
705 16.6215.88 15.77 16.49
0.2 15 14.25 15
600 13.62 14.10 13.54
12.50 12.35
0.15 10.50 10.42
400 10 10
0.1
5 5
0.05 200
0 0 0 0
0 1 2 3 4 5 19 20 21 22 23 0 1 2 3 4 5 19 20 21 22 23 0 1 2 3 4 5 19 20 21 22 23 0 1 2 3 4 5 19 20 21 22 23
Day Time Hour Day Time Hour Day Time Hour Day Time Hour
(a) (b) (c) (d)
Figure A.8: (a) shows the precision of matching dense RootSIFT for day-night
image pairs at different nighttimes. (b) shows the number of correct matches of
dense RootSIFT for day-night image pairs. (c) shows the number of connected
components of matched points at different nighttime. (d) shows the number of
connected components that contain no matched points of DoG+RootSIFT.
A.4 Potential of Improving Detectors
In this section, we try to examine the potential of improving feature
detectors by fixing the descriptor to be RootSIFT.
Inspired by [166], we extract dense RootSIFT features from day-night image
pairs for matching. When doing the ratio test, we select the neighbor that lies
outside the region from which nearest neighbor?s RootSIFT feature is extracted to
avoid comparing similar features. Figure A.8 (a) and (b) show the precision of
dense RootSIFT matching and the number of matched feature points. Though the
precision is not improved compared with the best performing detector+RootSIFT,
the number of matched feature points improves a lot. This means that there are a
lot of ?easy? RootSIFT features that could be matched for day-night image pairs.
However, we find that the matched RootSIFT features tend to cluster. Since detec-
tors would usually perform non-maximum suppression to get stable detections, in
2We also tried to tune the descriptor using day-night patch pairs, but were not able to increase
the descriptor?s performance.
147
Precision
Number of Correct Matches
Number of Connected Components
Number of Connected Components
(a) (b)
Figure A.9: (a) correct matches of DoG+RootSIFT. (b) correct matches of dense
RootSIFT.
the worst case, only one feature could be detected from each cluster. As a result,
the number of these matched features is an upper bound that cannot be reached.
Instead, we try to get a lower bound on the number of additional potential matches
that could be found. To achieve that, we count the number of connected components
for those matched RootSIFT features and show the result in Fig. A.8 (c). Taking
DoG as an example, we show the number of connected components that contain
no correct matches found by detector+RootSIFT in Fig. A.8 (d). We found that
matches found by detector+RootSIFT have almost no overlap with the connected
components of the matched dense RootSIFT, meaning that there is great potential
to improve feature detectors. Moreover, we notice that there are generally 10 - 20
connected components found by dense RootSIFT. This is in the order of correct
matches we could get for day-night image matching shown in Fig. A.4.
Fig. A.9 shows an example of correct matches found by DoG+RootSIFT and
dense RootSIFT. For this day-night image pair, DoG+RootSIFT can only find 4
correct matches whereas dense RootSIFT can find 188 correct matches. Fig. A.10
shows the detected feature points using DoG for the day and night images and
their corresponding heat map of cosine distance of dense RootSIFT. The colored
148
(a) (b) (c)
Figure A.10: (a) and (b) shows an example of nighttime and daytime image with
detected feature points using DoG. (c) shows the heat map of the cosine distance of
dense RootSIFT for (a) and (b).
rectangles in Fig. A.9 (b) and those in Fig. A.10 (a), (b) (c) are the same area.
It is clearly shown that the cosine distances of points in that area between day
and night images are very large, and Fig. A.9 (b) shows that they can be matched
using dense RootSIFT. However, though many feature points can be detected in the
daytime image, no feature points are detected by DoG for the nighttime image3. As
a result, matches that could be found by RootSIFT are missed due to the detector.
In conclusion, a detector which is more robust to severe illumination changes can
help improve the performance of matching day-night image pairs.
A.5 Summary
In this work, we evaluated the performance of local features for day-night
image matching. Extensive experiments show that repeatability alone is not enough
for evaluating feature detectors. Instead, descriptors should also be considered.
Through the discussion about precision and recall of matching day-night images
and examining the performance of dense feature matching, we concluded that there
3The area in the rectangle of the night image actually has a lot of structure, it appears to be
totally dark due to low resolution
149
is great potential for improving both feature detectors and descriptors. Thus, further
evaluation with parameter tuning and advanced descriptors [189] as well as principal
research on the day-night matching problem is needed.
150
Bibliography
[1] YiChang Shih, Sylvain Paris, Connelly Barnes, William T. Freeman, and
Fre?do Durand. Style transfer for headshot portraits. ACM Trans. Graph.,
33(4), 2014.
[2] Jonathan T Barron and Jitendra Malik. Intrinsic scene properties from a
single rgb-d image. In CVPR, 2013.
[3] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11), 1998.
[4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classifica-
tion with deep convolutional neural networks. In NIPS, 2012.
[5] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks
for large-scale image recognition. CoRR, abs/1409.1556, 2014.
[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recog-
nition. In CVPR, 2016.
[7] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich fea-
ture hierarchies for accurate object detection and semantic segmentation. In
CVPR, 2014.
[8] Xufeng Han, Thomas Leung, Yangqing Jia, Rahul Sukthankar, and Alexan-
der. C. Berg. Matchnet: Unifying feature and metric learning for patch-based
matching. In CVPR, 2015.
[9] Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. Lift: Learned
invariant feature transform. In ECCV, 2016.
[10] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from
a single image using a multi-scale deep network. In NIPS, 2014.
[11] Abhishek Kar, Christian Hne, and Jitendra Malik. Learning a multi-view
stereo machine. In NIPS, 2017.
151
[12] Matan Sela, Elad Richardson, and Ron Kimmel. Unrestricted facial geometry
reconstruction using image-to-image translation. In ICCV, 2017.
[13] Hao Zhou, Jin Sun, Yaser Yacoob, and David W. Jacobs. Label denoising
adversarial network (ldan) for inverse lighting of faces. In CVPR, June 2018.
[14] Misha Denil, Babak Shakibi, Laurent Dinh, Marc?aurelio Ranzato, and
Nando D. Freitasa. Predicting parameters in deep learning. In NIPS, 2013.
[15] Maxwell D. Collins and Pushmeet Kohli. Memory bounded deep convolutional
networks. CoRR, abs/1412.1442, 2014.
[16] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing
deep neural network with pruning, trained quantization and huffman coding.
In ICLR, 2016.
[17] Suraj Srinivas and R. Venkatesh Babu. Data-free parameter pruning for deep
neural networks. In BMVC, 2015.
[18] John Duchi and Yoram Singer. Efficient online and batch learning using for-
ward backward splitting. Journal of Machine Learning Research, 10, 2009.
[19] Tom Goldstein, Christoph Studer, and Richard G. Baraniuk. A field guide
to forward-backward splitting with a FASTA implementation. arXiv eprint,
abs/1411.3406, 2014.
[20] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted
boltzmann machines. In ICML, 2010.
[21] Ji Liu, P. Musialski, P. Wonka, and Jieping Ye. Tensor completion for esti-
mating missing values in visual data. IEEE Transactions on PAMI, 35(1),
2013.
[22] Wei Deng, Wotao Yin, and Yin Zhang. Group sparse optimization by alter-
nating direction method. In SPIE, volume 8858, 2013.
[23] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian opti-
mization of machine learning algorithms. In NIPS, 2012.
[24] Gu?l Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J.
Black, Ivan Laptev, and Cordelia Schmid. Learning from synthetic humans.
In CVPR, 2017.
[25] Wenzheng Chen, Huan Wang, Yangyan Li, Hao Su, Changhe Tu, Dani Lischin-
ski, Daniel Cohen-Or, and Baoquan Chen. Synthesizing training images for
boosting human 3d pose estimation. In 3DV, 2017.
[26] Chi Li, Zeeshan Zia, Quoc huy Tran, Xiang Yu, Gregory D. Hager, and Man-
mohan Chandraker. Deep supervision with shape concepts for occlusion-aware
3d object parsing. In CVPR, 2017.
152
[27] Jiajun Wu, Tianfan Xue, Joseph J Lim, Yuandong Tian, Joshua B Tenen-
baum, Antonio Torralba, and William T Freeman. Single image 3d interpreter
network. In ECCV, 2016.
[28] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda
Wang, and Russell Webb. Learning from simulated and unsupervised images
through adversarial training. In CVPR, July 2017.
[29] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio
Lopez. The SYNTHIA Dataset: A large collection of synthetic images for
semantic segmentation of urban scenes. In CVPR, 2016.
[30] Tinghui Zhou, Philipp Kra?henbu?hl, Mathieu Aubry, Qixing Huang, and
Alexei A. Efros. Learning dense correspondence via 3d-guided cycle consis-
tency. In CVPR, 2016.
[31] J. T. Barron and J. Malik. Shape, illumination, and reflectance from shading.
IEEE Transactions on PAMI, 37(8), 2015.
[32] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-
Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adver-
sarial nets. In NIPS, 2014.
[33] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for
human pose estimation. In ECCV, 2016.
[34] R. Basri and D. W. Jacobs. Lambertian reflectance and linear subspaces.
IEEE Transactions on PAMI, 25(2), 2003.
[35] R. Ramamoorthi and P. Hanrahan. On the relationship between radiance and
irradiance: Determining the illumination from images of a convex lambertian
object. JOSA, 2001.
[36] Hao Zhou, Jose M. Alvarez, and Fatih Porikli. Less is more: Towards compact
cnns. In ECCV, 2016.
[37] Emily L. Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob
Fergus. Exploiting linear structure within convolutional networks for efficient
evaluation. In NIPS, 2014.
[38] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convo-
lutional neural networks with low rank expansions. In BMVC, 2014.
[39] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long,
Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional
architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
[40] Ronan Collobert, Koray Kavukcuoglu, and Cle?ment Farabet. Torch7: A
matlab-like environment for machine learning. In BigLearn, NIPS Workshop,
2011.
153
[41] A. Vedaldi and K. Lenc. Matconvnet ? convolutional neural networks for
matlab. In ACM MM, 2014.
[42] Michae?l Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convolu-
tional networks through ffts. CoRR, abs/1312.5851, 2013.
[43] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy,
and Alan L. Yuille. Semantic image segmentation with deep convolutional
nets and fully connected crfs. In ICLR, 2014.
[44] Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model com-
pression. In ACM SIGKDD, 2006.
[45] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In NIPS,
2014.
[46] Geoffrey E. Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in
a neural network. In NIPS 2014 Deep Learning Workshop, 2014.
[47] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang,
Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. In ICLR,
2015.
[48] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna
Penksy. Sparse convolutional neural networks. In CVPR, 2015.
[49] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan V. Oseledets, and
Victor S. Lempitsky. Speeding-up convolutional neural networks using fine-
tuned cp-decomposition. In ICLR, 2015.
[50] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir D. Bourdev. Compressing
deep convolutional networks using vector quantization. CoRR, abs/1412.6115,
2014.
[51] Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P
Vetrov. Tensorizing neural networks. In NIPS, 2015.
[52] Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. In
NIPS, 1990.
[53] Babak Hassibi and David G. Stork. Second order derivatives for network
pruning: Optimal brain surgeon. In NIPS, 1993.
[54] Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights
and connections for efficient neural networks. In NIPS, 2015.
[55] Dong Yu, F. Seide, Gang Li, and Li Deng. Exploiting sparseness in deep
neural networks for large vocabulary speech recognition. In ICASSP, 2012.
154
[56] Yoshimasa Tsuruoka, Jun?ichi Tsujii, and Sophia Ananiadou. Stochastic gra-
dient descent training for l1-regularized log-linear models with cumulative
penalty. In ACL-IJCNLP, 2009.
[57] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier
neural networks. In AISTATS, 2011.
[58] M. D. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q.V. Le, P. Nguyen,
A. Senior, V. Vanhoucke, J. Dean, and G.E. Hinton. On rectified linear units
for speech processing. In ICASSP, 2013.
[59] Alex Krizhevsky. Learning multiple layers of features from tiny images. Tech-
nical report, Technical report, Department of Computer Science, University
of Toronto, 2009.
[60] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh,
Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bern-
stein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recogni-
tion challenge. IJCV, 2015.
[61] Soumith Chintala. soumith/imagenet-multigpu.torch.
url=https://github.com/soumith/imagenet-multiGPU.torch, 2015.
[62] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A
large-scale hierarchical image database. In CVPR, 2009.
[63] D. Shahlaei and V. Blanz. Realistic inverse lighting from a single 2d image of
a face, taken under unknown and complex lighting. In FG, 2015.
[64] Miguel Heredia Conde, Davoud Shahlaei, Volker Blanz, and Otmar Loffeld.
Efficient and robust inverse lighting of a single face image using compressive
sensing. In ICCV Workshops, 2015.
[65] B. Peng, W. Wang, J. Dong, and T. Tan. Optimized 3d lighting environment
estimation for image forgery detection. IEEE Transactions on IFS, 12(2),
2017.
[66] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial
discriminative domain adaptation. In CVPR, 2017.
[67] Volker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d
faces. In SIGGRAPH, 1999.
[68] O. Aldrian and W. A. P. Smith. Inverse rendering of faces with a 3d morphable
model. IEEE Transactions on PAMI, 35(5), 2013.
[69] Y. Wang, L. Zhang, Z. Liu, G. Hua, Z. Wen, Z. Zhang, and D. Samaras. Face
relighting from a single image under arbitrary unknown lighting conditions.
IEEE Transactions on PAMI, 31(11), 2009.
155
[70] James Booth, Epameinondas Antonakos, Stylianos Ploumpis, George Trige-
orgis, Yannis Panagakis, and Stefanos Zafeiriou. 3d face morphable models
?in-the-wild?. In CVPR, 2017.
[71] Tejas D Kulkarni, William F. Whitney, Pushmeet Kohli, and Josh Tenenbaum.
Deep convolutional inverse graphics network. In NIPS, 2015.
[72] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras.
Neural face editing with intrinsic image disentangling. In CVPR, 2017.
[73] Anh Tuan Tran, Tal Hassner, Iacopo Masi, and Gerard Medioni. Regress-
ing robust and discriminative 3d morphable models with a very deep neural
network. In CVPR, 2017.
[74] Hyeongwoo Kim, Michael Zollho?fer, Ayush Tewari, Justus Thies, Christian
Richardt, and Christian Theobalt. Inversefacenet: Deep single-shot inverse
face rendering from A single image. ArXiv e-prints, abs/1703.10956, 2017.
[75] Benot Frnay and Ata Kaban. A comprehensive introduction to label noise. In
ESANN, 2014.
[76] Volodymyr Mnih and Geoffrey E. Hinton. Learning to label aerial images from
noisy data. In ICML, 2012.
[77] Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and
Rob Fergus. Learning from noisy labels with deep neural networks. In ICLR,
2015.
[78] Samaneh Azadi, Jiashi Feng, Stefanie Jegelka, and Trevor Darrell. Auxiliary
image regularization for deep cnns with noisy labels. In ICLR, 2016.
[79] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning
from massive noisy labeled data for image classification. In CVPR, 2015.
[80] Giorgio Patrini, Alessandro Rozza, Aditya Menon, Richard Nock, and Lizhen
Qu. Making deep neural networks robust to label noise: a loss correction
approach. In CVPR, 2017.
[81] I. Jindal, M. Nokleby, and X. Chen. Learning deep networks from noisy labels
with dropout regularization. In ICDM, 2016.
[82] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by
backpropagation. In ICML, 2015.
[83] Swami Sankaranarayanan, Yogesh Balaji, Carlos D. Castillo, and Rama Chel-
lappa. Generate to adapt: Aligning domains using generative adversarial
networks. ArXiv e-prints, abs/1704.01705, 2017.
156
[84] Kuniaki Saito, Yusuke Mukuta, Yoshitaka Ushiku, and Tatsuya Harada. Deep
modality invariant adversarial network for shared representation learning. In
ICCV, 2017.
[85] Ronen Basri, David Jacobs, and Ira Kemelmacher. Photometric stereo with
general, unknown lighting. IJCV, 72(3), 2007.
[86] Lei Zhang and Dimitris Samaras. Face recognition from a single training
image under arbitrary unknown lighting using spherical harmonics. IEEE
Transaction on PAMI, 28(3), 2006.
[87] M. K. Johnson and H. Farid. Exposing digital forgeries in complex lighting
environments. IEEE Transactions on IFS, 2(3), 2007.
[88] R. Ramamoorthi. Analytic pca construction for theoretical analysis of lighting
variability in images of a lambertian object. IEEE Transactions on PAMI,
24(10), 2002.
[89] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. ArXiv e-prints,
abs/1701.07875, 2017.
[90] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Lecture 6a, overview
of mini-batch gradient descent. http://www.cs.toronto.edu/~tijmen/
csc321/slides/lecture_slides_lec6.pdf, 2012.
[91] Matthew D. Zeiler. ADADELTA: an adaptive learning rate method. ArXiv
e-prints, abs/1212.5701, 2012.
[92] Akshay Asthana, Stefanos Zafeiriou, Shiyang Cheng, and Maja Pantic. Robust
discriminative response map fitting with constrained local models. In CVPR,
2013.
[93] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter. A 3d face
model for pose and illumination invariant face recognition. In AVSS, 2009.
[94] Ralph Gross, Iain Matthews, Jeffrey Cohn, Takeo Kanade, and Simon Baker.
Multi-pie. Image Vision Computing, 28(5), 2010.
[95] Franc?ois Chollet et al. Keras. https://github.com/fchollet/keras, 2015.
[96] Mart??n Abadi, Ashish Agarwal, Paul Barham, et al. TensorFlow: Large-scale
machine learning on heterogeneous systems, 2015. Software available from
tensorflow.org.
[97] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face
attributes in the wild. In ICCV, 2015.
[98] Yi Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-
of-parts. In CVPR, 2011.
157
[99] Hao Zhou, Sunil Hadap, Kalyan Sunkavalli, and David W. Jacobs. Deep
single-image portrait relighting. In ICCV, 2019.
[100] O. Aldrian and W. A. P. Smith. Inverse rendering of faces with a 3d morphable
model. IEEE Transactions on PAMI, 35(5), 2013.
[101] Yang Wang, Lei Zhang, Zicheng Liu, Gang Hua, Zhen Wen, Zhengyou Zhang,
and Dimitris Samaras. Face relighting from a single image under arbitrary
unknown lighting conditions. IEEE Trans. PAMI, 31(11), nov 2009.
[102] Bernhard Egger, Sandro Scho?nborn, Andreas Schneider, Adam Kortylewski,
Andreas Morel-Forster, Clemens Blumer, and Thomas Vetter. Occlusion-
aware 3d morphable models and an illumination prior for face image analysis.
IJCV, 2018.
[103] Ayush Tewari, Michael Zollo?fer, Hyeongwoo Kim, Pablo Garrido, Florian
Bernard, Patrick Perez, and Theobalt Christian. MoFA: Model-based Deep
Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction.
In ICCV, 2017.
[104] Soumyadip Sengupta, Angjoo Kanazawa, Carlos D. Castillo, and David W.
Jacobs. Sfsnet: Learning shape, refectance and illuminance of faces in the
wild. In CVPR, 2018.
[105] Amnon Shashua and Tammy Riklin-raviv. The quotient image: Class-based
re-rendering and recognition with varying illuminations. IEEE Trans. on
PAMI, 23:129?139, 2001.
[106] Xiangyu Zhu Jianzhu Guo and Zhen Lei. 3ddfa. https://github.com/
cleardusk/3DDFA, 2018.
[107] Olga Sorkine and Marc Alexa. As-rigid-as-possible surface modeling. In Pro-
ceedings of Eurographics Symposium on Geometry Processing, 2007.
[108] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive grow-
ing of GANs for improved quality, stability, and variation. In International
Conference on Learning Representations, 2018.
[109] Arne Stoschek. Image-based re-rendering of faces for continuous pose and
illumination directions. In CVPR, 2000.
[110] Zhen Wen, Zicheng Liu, and T. S. Huang. Face relighting with radiance
environment maps. In CVPR, 2003.
[111] Pieter Peers, Naoki Tamura, Wojciech Matusik, and Paul Debevec. Post-
production facial performance relighting using reflectance transfer. In SIG-
GRAPH, 2007.
158
[112] Luan Tran and Xiaoming Liu. Nonlinear 3d face morphable model. In CVPR,
2018.
[113] Kyle Genova, Forrester Cole, Aaron Maschinot, Aaron Sarna, Daniel Vlasic,
and William T. Freeman. Unsupervised training for 3d morphable model
regression. CVPR, 2018.
[114] F. Luan, S. Paris, E. Shechtman, and K. Bala. Deep photo style transfer. In
CVPR, 2017.
[115] Yijun Li, Ming-Yu Liu, Xueting Li, Ming-Hsuan Yang, and Jan Kautz. A
closed-form solution to photorealistic image stylization. In ECCV, 2018.
[116] Zhixin Shu, Sunil Hadap, Eli Shechtman, Kalyan Sunkavalli, Sylvain Paris,
and Dimitris Samaras. Portrait lighting transfer using a mass transport ap-
proach. ACM Transactions on Graphics, 37(2), November 2017.
[117] Vahid Kazemi and Josephine Sullivan. One millisecond face alignment with
an ensemble of regression trees. In CVPR, 2014.
[118] Xiangyu Zhu, Xiaoming Liu, Zhen Lei, and Stan Z. Li. Face alignment in full
pose range: A 3d total solution. PAMI, 2017.
[119] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image
translation with conditional adversarial networks. In CVPR, 2017.
[120] Image-to-image translation in pytorch. https://github.com/junyanz/
pytorch-CycleGAN-and-pix2pix. Accessed:2019.
[121] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architec-
ture for generative adversarial networks. CoRR, abs/1812.04948, 2018.
[122] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimiza-
tion. In ICLR, 2014.
[123] facetracker. http://facetracker.net/.
[124] Hao Zhou, Xiang Yu, and David W. Jacobs. Glosh: Global-local spherical
harmonics for intrinsic image decomposition. In ICCV, 2019.
[125] H. G. Barrow and J. M. Tenenbaum. Recovering intrinsic scene characteristics
from images. Computer Vision Systems, 1978.
[126] Jens Ackermann and Michael Goesele. A survey of photometric stereo tech-
niques. Found. Trends. Comput. Graph. Vis., 9(3-4), 2015.
[127] S. Shirdhonkar and D. W. Jacobs. Non-negative lighting and specular object
recognition. In ICCV, 2005.
159
[128] Sean Bell, Kavita Bala, and Noah Snavely. Intrinsic images in the wild. ACM
Trans. on Graphics (SIGGRAPH), 2014.
[129] Balazs Kovacs, Sean Bell, Noah Snavely, and Kavita Bala. Shading annota-
tions in the wild. In CVPR, 2017.
[130] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and
Thomas Funkhouser. Semantic scene completion from a single depth image.
In CVPR, 2017.
[131] Ruo Zhang, Ping-Sing Tsai, J. E. Cryer, and M. Shah. Shape-from-shading:
a survey. TPAMI, 21(8), 1999.
[132] Jean-Denis Durou, Maurizio Falcone, and Manuela Sagona. Numerical meth-
ods for shape-from-shading: A new survey with benchmarks. CVIU, 109(1),
2008.
[133] R. J. Woodham. Photometric method for determining surface orientation from
multiple images. Optical Engineering, 19, 1980.
[134] P. N. Belhumeur, D. J. Kriegman, and A. L. Yuille. The bas-relief ambiguity.
IJCV, 35(1), 1999.
[135] Peter Vincent Gehler, Carsten Rother, Martin Kiefel, Lumin Zhang, and Bern-
hard Scho?lkopf. Recovering intrinsic images with a global sparsity prior on
reflectance. In NIPS, 2011.
[136] J. Shi, Y. Dong, H. Su, and S. X. Yu. Learning non-lambertian object intrinsics
across shapenet categories. In CVPR, 2017.
[137] Wei-Chiu Ma, Hang Chu, Bolei Zhou, Raquel Urtasun, and Antonio Tor-
ralba. Single image intrinsic decomposition without a single intrinsic image.
In ECCV, 2018.
[138] Michael Janner, Jiajun Wu, Tejas D. Kulkarni, Ilker Yildirim, and Josh Tenen-
baum. Self-supervised intrinsic image decomposition. In NIPS, 2017.
[139] Ashutosh Saxena, Min Sun, and Andrew Y. Ng. Make3d: Learning 3d scene
structure from a single still image. TPAMI, 31(5), 2009.
[140] David Eigen and Rob Fergus. Predicting depth, surface normals and semantic
labels with a common multi-scale convolutional architecture. In ICCV, 2015.
[141] Q. Chen and V. Koltun. A simple model for intrinsic image decomposition
with depth cues. In ICCV, 2013.
[142] Junho Jeon, Sunghyun Cho, Xin Tong, and Seungyong Lee. Intrinsic im-
age decomposition using structure-texture separation and surface normals. In
ECCV, 2014.
160
[143] E. Shelhamer, J. T. Barron, and T. Darrell. Scene intrinsics and depth from
a single image. In ICCV (Workshop), 2015.
[144] Sai Bi, Xiaoguang Han, and Yizhou Yu. An l1 image transform for edge-
preserving smoothing and scene-level intrinsic decomposition. ACM ToG,
34(4), 2015.
[145] Tinghui Zhou, Philipp Kra?henbu?hl, and Alexei A Efros. Learning data-driven
reflectance priors for intrinsic image decomposition. In ICCV, 2015.
[146] Thomas Nestmeyer and Peter V Gehler. Reflectance adaptive filtering im-
proves intrinsic image estimation. In CVPR, 2017.
[147] Takuya Narihira, Michael Maire, and Stella X. Yu. Direct intrinsics: Learning
albedo-shading decomposition by convolutional regression. In ICCV, 2015.
[148] T. Narihira, M. Maire, and S. X. Yu. Learning lightness from human judge-
ment on relative reflectance. In CVPR, 2015.
[149] Lechao Cheng, Chengyi Zhang, and Zicheng Liao. Intrinsic image transfor-
mation via scale space decomposition. In CVPR, 2018.
[150] Seungryong Kim, Kihong Park, Kwanghoon Sohn, and Stephen Lin. Unified
depth prediction and intrinsic image decomposition from a single image via
joint convolutional neural fields. In ECCV, 2016.
[151] D. Zoran, P. Isola, D. Krishnan, and W. T. Freeman. Learning ordinal rela-
tionships for mid-level vision. In ICCV, 2015.
[152] Zhengqi Li and Noah Snavely. Learning intrinsic image decomposition from
watching the world. In CVPR, 2018.
[153] Zhengqi Li and Noah Snavely. Cgintrinsics: Better intrinsic image decompo-
sition through physically-based rendering. In ECCV, 2018.
[154] Qingnan Fan, Jiaolong Yang, Gang Hua, Baoquan Chen, and David Wipf.
Revisiting deep intrinsic image decompositions. In CVPR, 2018.
[155] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for
human pose estimation. In ECCV, 2016.
[156] Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor
segmentation and support inference from rgbd images. In ECCV, 2012.
[157] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep
network training by reducing internal covariate shift. In ICML, 2015.
[158] Yinda Zhang, Shuran Song, Ersin Yumer, Manolis Savva, Joon-Young Lee,
Hailin Jin, and Thomas Funkhouser. Physically-based rendering for indoor
scene understanding using convolutional neural networks. In CVPR, 2017.
161
[159] Wenzel Jakob. Mitsuba renderer, 2010. http://www.mitsuba-renderer.org.
[160] Roger B. Grosse, Micah K. Johnson, Edward H. Adelson, and William T.
Freeman. Ground truth dataset and baseline evaluations for intrinsic image
algorithms. In ICCV, 2009.
[161] Elena Garces, Adolfo Munoz, Jorge Lopez-Moreno, and Diego Gutierrez. In-
trinsic images by clustering. Comput. Graph. Forum, 31(4), 2012.
[162] Qi Zhao, Ping Tan, Qiang Dai, Li Shen, Enhua Wu, and Stephen Lin. A
closed-form solution to retinex with nonlocal texture constraints. TPAMI,
34(7), 2012.
[163] Hao Zhou, Torsten Sattler, and David W. Jacobs. Evaluating local features
for day-night matching. In Computer Vision ? ECCV 2016 Workshops, 2016.
[164] Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring
photo collections in 3d. ACM transactions on graphics (TOG), 25(3):835?846,
2006.
[165] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview
stereopsis. IEEE Transactions on PAMI, 32(8):1362?1376, 2010.
[166] A. Torii, R. Arandjelovic?, J. Sivic, M. Okutomi, and T. Pajdla. 24/7 place
recognition by view synthesis. In CVPR, 2015.
[167] B. Zeisl, T. Sattler, and M. Pollefeys. Camera Pose Voting for Large-Scale
Image-Based Localization. In ICCV, 2015.
[168] S. Lynen, T. Sattler, M. Bosse, J. Hesch, M. Pollefeys, and R. Siegwart. Get
out of my lab: Large-scale, real-time visual-inertial localization. In RSS, 2015.
[169] J. Sivic and A. Zisserman. Video Google: A Text Retrieval Approach to
Object Matching in Videos. In ICCV, 2003.
[170] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaf-
falitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors.
IJCV, 65(1/2), 2005.
[171] Changchang Wu, B. Clipp, Xiaowei Li, J. M. Frahm, and M. Pollefeys. 3d
model matching with viewpoint-invariant patches (vip). In CVPR, 2008.
[172] Jean-Michel Morel and Guoshen Yu. Asift: A new framework for fully affine
invariant image comparison. SIAM J. Img. Sci., 2(2), 2009.
[173] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo
from maximally stable extremal regions. In BMVC, 2002.
[174] Yannick Verdie, Kwang Moo Yi, Pascal Fua, and Vincent Lepetit. TILDE: A
Temporally Invariant Learned DEtector. In CVPR, 2015.
162
[175] Andrew Richardson and Edwin Olson. Learning convolutional filters for in-
terest point detection. In ICRA, 2013.
[176] David G. Lowe. Distinctive image features from scale-invariant keypoints.
IJCV, 60(2), 2004.
[177] M. Brown, G. Hua, and S. Winder. Discriminative learning of local image
descriptors. IEEE Transactions on PAMI, 33(1), 2011.
[178] K. Simonyan, A. Vedaldi, and A. Zisserman. Learning local feature descriptors
using convex optimisation. IEEE Transactions on PAMI, 36(8), 2014.
[179] Tomasz Trzcinski, Mario Christoudias, Vincent Lepetit, and Pascal Fua.
Learning image descriptors with the boosting-trick. In NIPS, 2012.
[180] Niko Suenderhauf, Sareh Shirazi, Adam Jacobson, Feras Dayoub, Edward
Pepperell, Ben Upcroft, and Michael Milford. Place recognition with convnet
landmarks: Viewpoint-robust, condition-robust, training-free. In RSS, 2015.
[181] Bill Triggs. Detecting keypoints with stable position, orientation, and scale
under illumination changes. In ECCV, 2004.
[182] K. Mikolajczyk and C. Schmid. An affine invariant interest point detector. In
ECCV, 2002.
[183] N. Jacobs, N. Roman, and R. Pless. Consistent temporal variations in many
outdoor scenes. In CVPR, 2007.
[184] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of com-
puter vision algorithms. http://www.vlfeat.org/, 2008.
[185] R. Arandjelovic? and A. Zisserman. Three things everyone should know to
improve object retrieval. In CVPR, 2012.
[186] Edgar Simo-Serra, Eduard Trulls, Luis Ferraz, Iasonas Kokkinos, Pascal Fua,
and Francesc Moreno-Noguer. Discriminative Learning of Deep Convolutional
Feature Point Descriptors. In ICCV, 2015.
[187] Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches
via convolutional neural networks. In CVPR, 2015.
[188] Edgar Simo-Serra, Carme Torras, and Francesc Moreno-Noguer. Dali: Defor-
mation and light invariant descriptor. IJCV, 115(2), 2015.
[189] D. Mishkin, J. Matas, M. Perdoch, and K. Lenc. WxBS: Wide Baseline Stereo
Generalizations. In BMVC, 2015.
163