ABSTRACT
Title of dissertation: MACHINE LEARNING OF FACIAL
ATTRIBUTES USING EXPLAINABLE,
SECURE, AND GENERATIVE
ADVERSARIAL NETWORKS
Pouya Samangouei
Doctor of Philosophy, 2018
Dissertation directed by: Professor Rama Chellappa
Department of Electrical and
Computer Engineering
”Attributes” are referred to abstractions that humans use to group entities
and phenomena that have a common characteristic. In machine learning (ML), at-
tributes are fundamental because they bridge the semantic gap between humans and
ML systems. Thus, researchers have been using this concept to transform compli-
cated ML systems into interactive ones. However, training the attribute detectors
which are central to attribute-based ML systems can still be challenging. It might
be infeasible to gather attribute labels for rare combinations to cover all the corner
cases, which can result in weak detectors. Also, it is not clear how to fill in the
semantic gap with attribute detectors themselves. Finally, it is not obvious how to
interpret the detectors’ outputs in the presence of adversarial noise.
First, we investigate the effectiveness of attributes for bridging the semantic
gap in complicated ML systems. We turn a system that does continuous authenti-
cation of human faces on mobile phones into an interactive attribute-based one. We
employ deep multi-task learning in conjunction with multi-view classification using
facial parts to tackle this problem. We show how the proposed system decompo-
sition enables efficient deployment of deep networks for authentication on mobile
phones with limited resources.
Next, we seek to improve the attribute detectors by using conditional image
synthesis. We take a generative modeling approach for manipulating the semantics
of a given image to provide novel examples. Previous works condition the generation
process on binary attribute existence values. We take this type of approaches one
step further by modeling each attribute as a distributed representation in a vector
space. These representations allow us to not only toggle the presence of attributes
but to transfer an attribute style from one image to the other. Furthermore, we show
diverse image generation from the same set of conditions, which was not possible
using existing methods with a single dimension per attribute.
We then investigate filling in the semantic gap between humans and attribute
classifiers by proposing a new way to explain the pre-trained attribute detectors.
We use adversarial training in conjunction with an encoder-decoder model to learn
the behavior of binary attribute classifiers. We show that after our proposed model
is trained, one can see which areas of the image contribute to the presence/absence
of the target attribute, and also how to change image pixels in those areas so that
the attribute classifier decision changes in a consistent way with human perception.
Finally, we focus on protecting the attribute models from un-interpretable
behaviors provoked by adversarial perturbations. These behaviors create an inex-
plainable semantic gap since they are visually unnoticeable. We propose a method
based on generative adversarial networks to alleviate this issue. We learn the train-
ing data distribution that is used to train the core classifier and use it to detect
and denoise test samples. We show that the method is effective for defending facial
attribute detectors.
Machine Learning of Facial Attributes Using Explainable, Secure,
and Generative Adversarial Networks
by
Pouya Samangouei
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2018
Advisory Committee:
Professor Rama Chellappa, Chair/Advisor
Professor Larry Davis
Professor Behtash Babadi
Professor David Jacobs
Dr. Carlos D. Castillo
©c Copyright by
Pouya Samangouei
2018

Acknowledgments
First and foremost I’d like to thank my advisor, Professor Rama Chellappa,
for giving me an invaluable opportunity to work on challenging and extremely in-
teresting projects over the past four years. Although he had been extremely busy,
he always made himself available for help and advice. Besides this dissertation, he
has greatly influenced other aspects of my life and I see him as both an academic
and a life mentor. Words cannot describe how much I feel lucky and grateful to be
advised by such an extraordinary individual.
I owe my sincerest thanks to my family - my mother, father, brother, and
my dear aunt Minoo Mohagheghzadeh who was my first math teacher. They have
always stood by and guided me throughout my career. Words cannot express the
gratitude I owe them. I hope this dissertation is worth the difficulties that they had
to endure because of the forced inhuman travel ban that was in place in the second
half of my Ph.D. period.
My colleagues at the University of Maryland have enriched my graduate life in
many ways and deserve a special mention. Vishal Patel held my hand and directed
me throughout the first chapter of this dissertation that has led to many publica-
tions, a journal paper, a patent, and a book chapter. Emily Hand with whom I
wrote a book chapter and collaborated on a patent and was the best office mate
that anybody could want. Maya Kabkab and Mahyar Najibi were terrific coauthors
of four papers, which already are attracting a lot of attention. I also want to thank
Carlos Castillo, Azadeh Alavi, and Jun-Cheng Chen from whom I learned a lot and
ii
had a significant role in making my life easier throughout my Ph.D. I want to also
than Micheal Rotkowitz, Ali Koochakzadeh, and Sina Miran for collaborating on an
interesting paper that was not a part of this dissertation but had been a source of
inspiration.
I want to thank my internship mentors Hamid Sheikh and Raja Bala at Sam-
sung, Wael Abdal-Majeed at Information Sciences Institute, and Nathan Silberman
at Butterfly Network who had exposed me to a diverse set of problems and great
teams. I learned a lot during the periods that I was interning.
I would not have been able to enjoy my time here while working on this
dissertation if it were not for my friends. I want to thank Amir Montazeri, Ali Ak-
bari, Mohammad Ganji, Aida Mostafazadeh, Sina Zamani, Rahmtin Rotabi, Farhad
Saffaraval, Alborz Alavian, Ali Moeini, Mahshid Najafi, Ladan Rabiee, Niloufar
Shadab, Alireza Sheikhattar, Melika Abolhassani, Bahar Zarrin, Arian Asefzadeh,
Kia Karbasi, Nima Moeini, Hasan Lotfi, Soudeh Montazeri, and many more with
whom I made a lot of fantastic memories.
I would also like to acknowledge help and support from some of the University
of Maryland Center for Advanced Computer Studies. They maintained the compu-
tational resources that I was using throughout my Ph.D. studies and were always
there to address any issues that I had, even on weekends.
My housemates at my place of residence have been a crucial factor in my
finishing smoothly. I’d like to express my gratitude to Daniel Voica, Amy Steele,
Jason Fang, and Bjorn Adamatti for being wonderful housemates.
Chapter five of this research is based upon work supported by the Office of the
iii
Director of National Intelligence (ODNI), Intelligence Advanced Research Projects
Activity (IARPA), via IARPA R&D Contract No. 2014-14071600012. The views
and conclusions contained herein are those of the authors and should not be in-
terpreted as necessarily representing the official policies or endorsements, either
expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S.
Government is authorized to reproduce and distribute reprints for Governmental
purposes notwithstanding any copyright annotation thereon.
Chapter one of this research is based upon work supported by the Office of
the Director of National Intelligence (ODNI), Defense Advanced Research Projects
Agency (DARPA), via DARPA R&D Contract No. FA8750-13-2-0279. The views
and conclusions contained herein are those of the authors and should not be in-
terpreted as necessarily representing the official policies or endorsements, either
expressed or implied, of the ODNI, DARPA, or the U.S. Government. The U.S.
Government is authorized to reproduce and distribute reprints for Governmental
purposes notwithstanding any copyright annotation thereon.
It is impossible to remember all, and I apologize to those I’ve inadvertently
left out.
iv
Table of Contents
Acknowledgements ii
List of Tables vii
List of Figures x
List of Abbreviations xiv
1 Introduction 1
1.1 Attributes for continuous authentication . . . . . . . . . . . . . . . . 4
1.2 Conditional image syndissertation from attribute representations . . . 6
1.3 Explaining attribute detectors . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Protecting against induced semantic gap . . . . . . . . . . . . . . . . 10
1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Attribute-based continuous active authentication 12
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Attribute-based Active Authentication on Mobile Devices . . . . . . . 19
2.4 Efficient Deep Features for Attribute Detection on Mobile Phones . . 41
3 Conditional Image Generation From Attribute Vectors 61
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4 Model Explanation via Decision Boundary Crossing Transformations 76
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4 Priors for Interpretable Image Transformations . . . . . . . . . . . . . 86
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
v
5 Defense-GAN: Protecting Classifiers Against Adversarial Attacks Using Gen-
erative Models 99
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2 Related work and background information . . . . . . . . . . . . . . . 101
5.3 Proposed Defense-GAN . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.5 Optimality of pg = pdata for WGANs . . . . . . . . . . . . . . . . . . 120
5.6 Difficulty of GD-based white-box attacks on Defense-GAN . . . . . . 121
5.7 Neural network architectures . . . . . . . . . . . . . . . . . . . . . . . 122
5.8 Qualitative examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.9 Additional results on the effect of varying the number of GD iterations
L and random restarts R . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.10 Additional results on white-box attacks . . . . . . . . . . . . . . . . . 126
5.11 Time complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.12 An unsuccessful attempt to attack Defense-GAN . . . . . . . . . . . . 128
6 Summary and Future Directions 132
6.1 Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Bibliography 138
vi
List of Tables
2.1 Accuracies of the 44 attribute classifiers proposed in this work on the
PubFig dataset [Kumar et al., 2009]. . . . . . . . . . . . . . . . . . . 26
2.2 Accuracies of the attribute classifiers proposed in this work on avail-
able attributes on the FaceTracer dataset [Kumar et al., 2008]. . . . . 27
2.3 Accuracy of the attribute classifiers for CNNAA [Fathy et al., 2015]
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 The EER values for different methods on the MOBIO dataset. . . . . 29
2.5 The EER values corresponding to the cross-device experiment on the
MOBIO dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6 The EER values of different methods for the AA01 dataset. . . . . . . 32
2.7 The EER values corresponding to the cross-session experiments for
the AA01 dataset. 1 is the office light session, 2 is the low light
session, 3 is the natural light session. . . . . . . . . . . . . . . . . . . 34
2.8 Average memory usage per attribute classifier for full face . . . . . . 36
2.9 Comparison of EER values for LBP, attribute detectors of Section
2.3.1, linear models of Section 2.3.8. The scale L-SVM 1 is trained
on images of size 128 × 168 and the rest are scaled by the indicated
value. The best EER is gained from L-SVMs of Section 2.3.8 with
scale 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.10 The speed and power consumption of different realization of the classi-
fiers learned with the simplified training framework on Google Nexus
5 device. W/O column means our algorithm extract all the attributes
given aligned and cropped face. In last row we assumed that we are
doing authentication with the speed 1fps. W/ column first detects
the face then extracts attributes. We can authenticate 17.6 hours
every second employing our classifiers with best EER of Table 2.9 on
a Nexus 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.11 The performance comparison of attribute detection methods. . . . . . 42
2.12 The architectures of our networks. The number of parameters de-
pends on the face region that they operate on and can be found in
Table 2.16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
vii
2.13 The face regions that are extracted by cropping around the land-
mark points and their corresponding number of attributes. A Multi*-
CNNAA that operates on a face crop has “No. of Attributes” tasks. . 44
2.14 The EER values for the different experiments on AA01 [Fathy et al.,
2015] dataset. The sessions numbers are: 1. Office light 2. Low light
3. Natural light. DiscAttrs column contains the EER values using
the discovered attributes. . . . . . . . . . . . . . . . . . . . . . . . . 53
2.15 The EER values corresponding to MOBIO dataset experiments. . . . 57
2.16 Network size and prediction speed of the networks. The D-* means it
has MultiDeep-CNNAA architecture and W-* means it is MultiWide-
CNNAA . The Binary*-CNNAA network prediction times are just for
one attribute. For all of them together it will be 40 times this value. . 58
4.1 AOSPC value (higher is better, see (4.17) after 10 steps for different segmentation
thresholds. Although, ExplainGAN is not directly optimized for this metric, its
performance is comparable to reasonable baselines for explanation in classifiers.
A larger AOSPC means that the sensitivity of the segments that are perturbed
in 10 steps is higher. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.2 Quantitative substitutability experiments across datasets. Class 0 and Class 1 are
the classes that the given classifier is trained to identify. Transformed/Composite
0/1 column shows the accuracy of the classifiers when just transformations/compositions
of the images used at training time. Ceiling represents the accuracy of the base
classifier on the same test set. . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.3 Substitutability on Ultrasound Dataset. Transformed/Composite 0/1 shows the
accuracy of a classifier on test set when the original samples are replaced with
Transformed/Composite 0/1 at training phase. Both Transformed/Composite
shows the accuracy of the classifier when all of the images are replaced with
Transformed/Composite. Note that PixelDA is a oneway transformer. . . . . . 98
5.1 Classification accuracies of different classifier and substitute model
combinations using various defense strategies on the MNIST dataset,
under FGSM black-box attacks with  = 0.3. Defense-GAN has L =
200 and R = 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2 Classification accuracies of different classifier and substitute model
combinations using various defense strategies on the F-MNIST dataset,
under FGSM black-box attacks with  = 0.3. Defense-GAN has
L = 200 and R = 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.3 Classification accuracy of Model F using Defense-GAN (L = 400,
R = 10), under FGSM black-box attacks for various noise norms 
and substitute Model E. . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.5 Neural network architectures used for classifiers and substitute models.123
5.8 Classification accuracy of Model F using Defense-GAN with various
number of iterations L (R = 10), on the MNIST dataset, under FGSM
black-box attack with  = 0.3. . . . . . . . . . . . . . . . . . . . . . . 125
viii
5.11 Classification accuracy of Model F using Defense-GAN with various
number of random restarts R (L = 100), on the F-MNIST dataset,
under FGSM black-box attack with  = 0.3. . . . . . . . . . . . . . . 126
5.13 Average time, in seconds, to compute reconstructions of MNIST/F-
MNIST images for various values of L and R. . . . . . . . . . . . . . 127
5.4 Classification accuracies of different classifier models using various de-
fense strategies on the MNIST (top) and F-MNIST (bottom) datasets,
under FGSM, RAND+FGSM, and CW white-box attacks. Defense-
GAN has L = 200 and R = 10. . . . . . . . . . . . . . . . . . . . . . 129
5.6 Neural network architectures used for GANs. . . . . . . . . . . . . . . 130
5.7 Neural network architecture used for the MagNet encoder. . . . . . . 130
5.9 Classification accuracy of Model F using Defense-GAN with various
number of iterations L (R = 10), on the F-MNIST dataset, under
FGSM black-box attack with  = 0.3. . . . . . . . . . . . . . . . . . . 130
5.10 Classification accuracy of Model F using Defense-GAN with various
number of random restarts R (L = 100), on the MNIST dataset,
under FGSM black-box attack with  = 0.3. . . . . . . . . . . . . . . 131
5.12 Classification accuracies of different classifier models using various
defense strategies on the CelebA gender classification task, under
FGSM, RAND+FGSM, and CW white-box attacks. Defense-GAN
has L = 200 and R = 2. . . . . . . . . . . . . . . . . . . . . . . . . . 131
ix
List of Figures
2.1 Overview of our attribute-based authentication method. . . . . . . . . 19
2.2 Training phase pipeline for each attribute classifier. Landmarks are
first detected on a given face. Different facial components are then
extracted from these landmarks. Then for each part, features are
extracted with different cell sizes and the dimensionality of features
is reduced using principle component analysis. Classifiers are then
learned on these low-dimensional features. Finally, top five Cls are
selected as our attribute classifier. . . . . . . . . . . . . . . . . . . . 20
2.3 Illustration of our attribute classifiers on sample face images from the
AA01 (first two images) and the MOBIO (last image) datasets. . . . 25
2.4 Sample images from the MOBIO dataset. One can clearly see the
different illumination conditions in this dataset. . . . . . . . . . . . . 28
2.5 Performance evaluation on the MOBIO dataset. . . . . . . . . . . . 29
2.6 Cross device robustness. Laptop session videos are used for enroll-
ment and the data from the remaining sessions are used for testing.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.7 Sample images from the AA01 dataset. (a), (b) and (c) show some
sample images from session 1, 2 and 3, respectively. . . . . . . . . . . 31
2.8 Performance evaluation for the AA01 dataset. . . . . . . . . . . . . . 32
2.9 Session-specific performance evaluations for the AA01 dataset. (a)
Gallery and probe data from session 1. (b) Gallery and probe data
from session 2. (c) Gallery and probe data from session 3. (a) Gallery
data from session 1 and probe data from sessions 2 and 3. (e) Gallery
data from session 2 and probe data from sessions 1 and 3. (f) Gallery
data from session 3 and probe data from sessions 2 and 1. . . . . . . 33
2.10 Comparison of linear SVMs with model learned in section 2.3.1 and
LBP. The best result among all is achieved with linear models of scale
0.5 i.e. face crop size of 64× 80. . . . . . . . . . . . . . . . . . . . . . 39
2.11 Sample images from subspace clustering of face part embedding in
attribute space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.12 Sample images of the three sessions of the AA01 dataset. . . . . . . . 52
x
2.13 ROC curve of different experiments on AA01 [Fathy et al., 2015] and
MOBIO [McCool et al., 2012] dataset. (a) is the ROC curve of AA01
with all of the sessions together in gallery and probe. (b) is the ROC
curve of MOBIO with all of the mobile sessions together with the last
session videos as gallery and the rest of the session as probe. (c) is
the ROC curve of the cross-device experiment. . . . . . . . . . . . . 53
2.14 Sample images of the three sessions of the MOBIO dataset. First row
images are from different sites, second row is the pairs with the same
identities in two different sessions. . . . . . . . . . . . . . . . . . . . 55
3.1 Visualization of approaches to attribute manipulation as grahical
models. (a) The graphical model for Fader Networks. (b) The graph-
ical model for CRISPR. . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.2 Architectural overview of the CRISPR architecture, best viewed in
color. The image x is encoded by E producing invariance vector z and
attribute vectors a0 and a1. These are concatenated and passed to de-
coder G which produces reconstruction x̂. Each continuous attribute
vector is decoded using a linear classifier Hi. Auxiliary components
of the architecture are framed by the dashed green outline. Discrim-
inator D encourages the reconstructions to both appear natural and
exhibit the expected attributes, classifier Cinv encourages z to encode
only non-attribute information and classifier Cattr encourages each
attribute vector to encode only its attribute information. . . . . . . . 66
3.3 Examples illustrating the Diverse Swap operation using the ’bangs’
attribute. The left-most column demonstrates the inputs and the
remaining columns illustrates the results of sampling from the space
of bangs attributes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4 Examples illustrating the Diverse Swap operation using the ’eye-
glasses’ attribute. The left-most column demonstrates the inputs
and the remaining columns illustrates the results of sampling from
the space of bangs attributes. The results demonstrate not only the
diversity of the selection of eye-glasses, but also a failure case (second
row, last image), the sampled attributes may not all be atomic. . . . 75
3.5 Examples illustrating the Borrow operation using the ’smile’ attribute’.
As these examples illustrate, CRISPR is able to effectively borrow the
smile from the reference image and apply it to the query image. . . . 75
4.1 Model architecture of ExplainGAN. Inference (in blue frame) consists
of passing an image x of class j into the appropriate encoder Ej to
produce a hidden vector zj. The hidden vector is decoded to simulta-
neously create its reconstruction Gj(zj), a transformed image of the
opposite class G1−j(zj) and a mask showing where the changes were
made Gm(zj). Composite images C0 and C1 merge the reconstruction
and transformation with the original image x. . . . . . . . . . . . . . 83
xi
4.2 An example of Ultrasound images from our Medical Ultrasound dataset.
(a) A canonical Apical 2 Chamber view. (b) A canonical Apical 4
Chamber view. (c) A difficult Apical 2 Chamber view that is easily
confused for a 4 Chamber view. (d) A difficult Apical 4 Chamber
view that is easily confused for a 2 Chamber view. . . . . . . . . . . 89
4.3 Qualitative visualization of the ExplainGAN model on two datasets:
CelebA and our Medical Ultrasound dataset. The “input” column
represents images x ∈ S0, the “transformed” column represents Ex-
plainGAN’s transformation, G1(z0), to the opposite class. The “mask”
column illustrates the model’s changes, Gm(z0), and the “composite”
column shows the composite images, C1(z0). The results indicate
that in the case of object-related transformations, such as glasses
or mustaches, ExplainGAN effectively performs a weakly supervised
segmentation of the object. In the ultrasound case, ExplainGAN il-
lustrates which anatomical areas the model is cuing on: the right
ventricle and pericardium. . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4 Comparison of different methods for explaining the model’s decision.Fashion-
MNIST: transforming from pullover to shirt, Ultrasound: transforming from
A2C to A4C (see 4.2 for examples of A2C and A4C views), CelebA: transforming
from faces without eyeglasses to faces with eyeglasses, MNIST: transforming
from 4 to 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.5 Boundary-crossing images have varying explanatory power: images carry more
explanatory power if (1) they can be used as substitutes in the original dataset
without affecting the classifier and (2) they are different from a query image in
small and easily localized ways. (a) displays an image classified as a jacket and
not a pullover. (b) shows an image of a pullover which is substitutable and whose
localized mask illustrates the models belief that removing a zipper and the jacket
ribs would make the original image into a pullover. (c) shows another pullover
but non-localized mask doesn’t explain why this is a pullover and not a jacket.
(d) shows an adversarial image which is completely unsubstitutable and provides
no localized explanation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.1 Overview of the Defense-GAN algorithm. . . . . . . . . . . . . . . . . 110
5.2 L steps of Gradient Descent are used to estimate the projection of
the image onto the range of the generator. . . . . . . . . . . . . . . . 110
5.3 Classification accuracy of Model F using Defense-GAN on the MNIST
dataset, under FGSM black-box attacks with  = 0.3 and substitute
Model E. Left: various number of iterations L are used (R = 10).
Right: various number of random restarts R are used (L = 100). . . . 116
5.4 ROC Curves when using Defense-GAN MSE for FGSM attack detec-
tions on the MNIST dataset (Classifier Model F, Substitute Model
E). Left: Results for various number of GD iterations are shown with
R = 10,  = 0.30. Middle: Results for various number of random
restarts R are shown with L = 100,  = 0.30. Right: Results for
various  are shown with L = 400, R = 10. . . . . . . . . . . . . . . . 117
xii
5.5 ROC Curves when using Defense-GAN MSE for FGSM attack detec-
tions on the F-MNIST dataset (Classifier Model F, Substitute Model
E). Left: Results for various number of GD iterations are shown with
R = 10,  = 0.30. Middle: Results for various number of random
restarts R are shown with L = 100,  = 0.30. Right: Results for
various  are shown with L = 200, R = 10. . . . . . . . . . . . . . . . 118
5.6 Examples from MNIST and F-MNIST. Left: Original, FGSM adver-
sarial  = 0.3, and reconstruction images for R = 1 and various L are
shown. Right: Original, FGSM adversarial  = 0.3, and reconstruc-
tion images for L = 25 and various R are shown. . . . . . . . . . . . 124
5.7 Examples from MNIST and F-MNIST: Original, FGSM adversarial
and reconstruction images for L = 50, R = 15 and various  are shown.124
5.8 Classification accuracy of different models using Defense-GAN on the
MNIST dataset, under FGSM white-box attack with  = 0.3, for
various number of iterations L and R = 10. . . . . . . . . . . . . . . . 126
xiii
List of Abbreviations
ACAA Attribute-based Continuous Active Authentication
CNN Deep Convolutional Neural Networks
CPU Central Processing Unit
CW Carlini and Wagner
CNNAA Convolutional Neural Networks for Active Authentication
DCGAN Deep Convolutional GAN
DCNN Deep Convolutional Neural Networks
DTD Deep Taylor Decomposition
EER Equal Error Rate
FGSM Fast Gradient Sign Method
FAR False Acceptance Rate
GAN Generative Adversarial Networks
GB Gigabyte
GPU Graphical Processing Unit
GradCAM Gradient-weighted Class Activation Mapping
HOG Histogram of Oriented Gradients
KNN K-nearest neighbors
LBP Local Binary Patterns
LRP Layer-wise Relevance Propagation
MB Megabyte
ML Machine Learning
MSE Mean squared error
PCA Principle Component Analysis
RAND Random
RAM Random Access Memory
RBF Radial Basis Function
ROC Receiver operating Characteristic
SSC Sparse Subspace Clustering
SVM Support Vector Machine
TAR True Acceptance Rate
TV Total Variation
WGAN Wasserstein GAN
xiv
Chapter 1: Introduction
The notion of the “semantic gap” between humans and machine learning (ML)
algorithms/systems has become increasingly important. When the gap exists, there
is a lack of interpretation for how an ML algorithm or system behaves. This dis-
connect can arise in different levels of the ML system, which often results in un-
predictable behaviors, invalid confidence measures, and unnecessarily complicated
machine learning models. Besides the semantic gaps that naturally happens, there
can be induced semantic gaps which can make the ML systems insecure and show
unexplainable behaviors.
In this dissertation, we focus on the concept of “attributes” for bridging the
semantic gap. Given a set of entities, an attribute is a common characteristic of the
items which divides the set into different non-overlapping partitions (usually two,
one partition that has the attribute and the other does not). For example, if the
entities are face images “gender” is an attribute that divides the set into images
of “male” and “female” faces. More formally, an attribute detector f is defined as
f : X → [0, 1] where X ∈ RW×H×C is the input image and 1 corresponds to the
presence of the attribute and 0 to the absence.
Using multiple attributes to identify entities has been a popular approach for
1
decomposing complicated ML tasks into simpler ones. This approach can either
mimic the complex behavior or narrow down the search space significantly leading
to a more robust and efficient system. For example in a face recognition system
[Kumar et al., 2009] or for object recognition [Farhadi et al., 2009], a collectively
large set of attributes has been shown to be enough to get an accurate identity
preserving representation for faces and objects. This idea can be specially interesting
for converting demanding algorithms into amenable ones for devices with limited
resources. In this dissertation, we look at the problem of continuous authentication
on mobile phones.
Training robust attribute detectors is therefore central to attribute-based ML
systems. In this dissertation, we employ deep convolutional neural networks (CNNs)
for the attribute classifiers since they have proven to work well for this task [Levi and
Hassner, 2015], as well as many other computer vision tasks. However, deploying
CNNs on mobile phones is challenging because of the limited resources [Sarkar et al.,
2016]. We address this challenge by introducing deep CNNs for attribute-based
continuous authentication which are run efficiently on mobile phones.
To be more efficient and also make use of inter-attribute dependencies, usually
a single CNN is trained for predicting all of the attributes together. At training
phase of such models, for each training data x ∈ Xtrain we will have a target vector
y = [y0 ... yn−1] where each dimension determines the presences of one of the n
attributes. As n increases the label space expands exponentially, and therefore
more training data is needed to cover the whole label space. Thus, there are usually
attributes that are not present in most of the images. For example, there might not
2
be enough face images with the attribute ”mustache” present. We take a generative
modeling approach to address this issue by learning to capture the attributes in
vectorized representations and transferring them to images that do not have those
attributes.
After converting an ML system into an attribute-based one and training good
attribute detectors, there still exists one semantic gap: what does a deep black
box attribute detector do to make a prediction. Understanding the behavior of
deep models can be critically important for certain applications, such as ML-based
evidence in legal courts, education, or even model debugging. To understand the
behavior, we should make sure to have an algorithm that is guaranteed to have
interpretable outcomes. Earlier works have defined the explanations as a sensitivity
map of the model output with respect to the given input image [Zhou et al., 2016,
Selvaraju et al., 2016a]. However, there are no guarantees that these masks are
interpretable by humans, and also it is not known how the pixel values should change
in those areas so that the outcome of the binary attribute classifier is toggled. In
this dissertation, we propose a new way of exploring this issue which results in
interpretable masks and also provides pixel-wise values that make the classifier to
cross the decision boundary.
Besides explaining the behavior of deep attribute models, it is essential to
prevent the detectors from induced uninterpretable outcomes. Adversarial agents
enforce these behaviors by adding small perturbations to the input pixels. These
perturbations are usually unnoticeable; thus, from the user perspective, it seems
that the network gives two different outputs for the same image. In the literature,
3
these adversarial perturbations are defended by strong prior assumptions on the
type of attacks [Goodfellow et al., 2014b]. However, in a realistic scenario, we have
no prior information about the type of attacks that can happen on a given ML
system. We propose a method based on generative adversarial networks to solve the
problem with no assumptions on the attacks.
1.1 Attributes for continuous authentication
Continuous authentication, is as opposed to one-time initial authentication
of the users using passwords or biometrics and protects the mobile device even if
the initial authentication is bypassed. This system can be designed without us-
ing any machine learning algorithms, for instance by asking the password every
minute. However, it is more desirable if the continuous authentication method is
non-intrusive, meaning that it does not interrupt the usage of the device. ML-
based solutions have been pursued to alleviate this issue [McCool et al., 2012]. Such
systems produce a similarity score between the current and enrolled user repre-
sentations. These representations are inferred from sensory inputs such as camera
images. The calculated representations are usually of the form of a vector that is
not interpretable on their own.
We are the first to propose using attributes to for mobile continuous authen-
tication and demonstrate its multiple benefits [Samangouei et al., 2015,Samangouei
et al., 2017b,Samangouei et al., 2017a]. From the semantic gap perspective, different
types of interaction are possible with our proposed system. One can enroll the users
4
by entering the attributes manually. This is desirable because enrolling the user
using conventional methods boils down to saving an inferred representation which
may not be robust enough. Another type of interaction can be an understandable
explanation of why the user is locked out: “Attribute A” does not match the at-
tribute of the enrolled user. Besides user experience advantages, this has technical
benefits too: the system can first check attributes which are the easier task, and
then invest resources in running a complicated model.
We train binary attribute classifiers which provide compact visual descriptions
of faces. The learned classifiers are applied to the image of the current user of a
mobile device to extract the attributes and then authentication is done by simply
comparing the calculated attributes with the enrolled attributes of the original user.
Extensive experiments on two publicly available unconstrained mobile face video
datasets show that our method is able to capture meaningful attributes of faces
and performs better than the previously proposed authentication method. We also
provide a feasible variant of our method for efficient continuous authentication on
an actual mobile device by doing extensive platform evaluations of memory usage,
power consumption, and authentication speed.
1.1.1 Bringing CNNs to mobile phones
We also use the fact that attributes decompose complicated systems into sim-
pler ones to bring deep convolutional neural networks (CNNs) to mobile phones
[Samangouei and Chellappa, 2016]. Besides the excellent performance of CNNs in
5
various inference tasks, the ones that are used for face verification in the wild are too
computationally expensive to be deployed on mobile phones for continuous authenti-
cation. We follow our initial approach for performing attribute-based authentication
and bring deep efficient CNNs to devices. The proposed CNNs implement multi-
tasking and multi-view approaches. The multi-tasking approach is needed to predict
multiple attributes from the same backbone network. multi-view approach enables
the processing of different components of the face, such as the eyes or mouth.
Our multi-task, part-based CNN architecture for attribute detection performs
better than previously proposed methods in terms of accuracy for attribute pre-
diction. As a byproduct of the proposed architecture, we are able to explore the
embedding space of the attributes extracted from different facial parts, such as the
mouth and eyes, to discover news attributes. Furthermore, through extensive exper-
imentation, we show that the attribute features extracted by our method outperform
our initial presented attribute-based method and a baseline method for the task of
active authentication. Lastly, we demonstrate the effectiveness of the proposed ar-
chitecture in terms of speed and power consumption by deploying it on an actual
mobile device.
1.2 Conditional image syndissertation from attribute representations
As the number of training examples increases for training attribute detectors,
there is a higher chance of missing or having a low number of examples for a set of
attribute combinations. To address this issue, we try to disentangle the attribute
6
information from invariant information using generative models. As a result, we
can transfer the attribute from one image to the other or change the existence
of multiple attributes at the same time. Our method falls under the umbrella of
conditional image syndissertation. Aside from this application, such methods can
be used for semantic image manipulation control the existence of specific attributes
in the image.
Early works on conditional image generation have represented an attribute
with a single dimension [Lample et al., 2017, Choi et al., 2017] and allotted a rep-
resentation for non-attribute characteristics of the image. These methods consists
an encoder f(X) → z for mapping the images into latent codes z and a generator
g(z, y)→ X̂ that gets z and attribute labels y and generates the image X̂. Transfer-
ring attributes of image X1 to X2 is therefore trivial and happens by generating the
image g(z2, y1). Although these methods can toggle the presence of attributes, they
have no sense of the style of attributes. Style of an attribute can be thought of as
the attribute of the attribute. For example in the case of face images, the attribute
“smiling” happens either with either of these styles: mouth open or closed. This
happens because the encoder is encouraged to output a representation z that has
no information about the attributes. This results in a generator that couples the
“style” of the attributes to the z vector. Thus, when toggling the “smiling” dimen-
sion from “off” to “on” for some input image, the generator has no way of knowing
which type of smile to put on the face.
We introduce a new method for conditional image generation based on at-
tribute manipulation. As mentioned before, recently introduced algorithms explic-
7
itly encode images as a combination of an embedding vector and a series of discrete
binary attributes which encode whether a visual attribute is present or not in a
given image. By encoding a query image and flipping these binary attributes, visual
elements can be toggled, e.g., turned on or off. Unfortunately, such an approach
is limited to representing the presence, or lack thereof, of a particular, categorical
attribute and cannot be used to sample from a variety of attribute realizations. For
example, this approach might be used to specify whether or not a person is wearing
a hat, but cannot be used to sample from a variety of hats. To address this limita-
tion, we introduce a model for attribute-based image manipulation that represents
visual attributes in a continuous embedding space. This approach allows for two
new types of attribute-based manipulations: Diverse Swaps and Borrows. Diverse
Swaps allows us to turn on and off visual attributes by sampling and producing di-
verse results. A Borrow allows us to encode a particular realization of an attribute
from a reference image and inject it into a query image. We demonstrate the efficacy
of our method on a challenging dataset.
1.3 Explaining attribute detectors
Attribute detectors can be studied themselves in terms of the semantic gap.
The semantic gap can be investigated from multiple views. In this dissertation we
are interested in looking at this problem from another viewpoint: given a black box
function fθ (an attribute detector for example), “what” does it do with respect to
a given input? If we can understand what a classifier is doing, we know how we
8
should change the input image to change the decision of the classifier. Therefore,
answering the question boils to understanding what pixel values we should change,
and how we should change those values to change the output of fθ. Therefore one
expects an “explanation” algorithm to produce a saliency map and also a pixel value
map that together crosses the decision boundary of the black box classifier fθ. One
other component of “explanation” is the context that we want to explain fθ in. In
this dissertation, we focus on explanations that are meant for humans.
Previous methods have focused on the gradient of the classification loss func-
tion fθ with respect to the input [Simonyan and Zisserman, 2014]. Such solutions
have some limitations. First of all, the gradient values depend on the loss function
which is not a part of fθ and may not necessarily be the same loss function that fθ is
trained with. The second fundamental problem comes from the fact that gradients
are first-order approximation of highly nonlinear functions. Therefore, as we get
further away from the input image X the gradient directions become less accurate
in changing the loss value. Besides conceptual problems, there are also technical
issues such as the gradient vanishing phenomena that occur when fθ is a deep archi-
tecture. Because of these issues, the saliency maps that come out of these methods
are not visually comprehensible by humans. Although there have been attempts to
make the produced saliency map more visually plausible [Fong and Vedaldi, 2017],
existing works provide no means of changing the pixel values so that the decision of
fθ is changed.
We introduce a new method for interpreting computer vision models: visually
perceptible, decision-boundary crossing transformations [Samangouei et al., 2018b].
9
Our goal is to answer a simple question: why did a model classify an image as being
of class A instead of class B? Existing approaches to model interpretation, including
saliency and explanation-by-nearest neighbor, fail to explicitly illustrate examples
of transformations required for a specific input to alter a model’s prediction. On
the other hand, algorithms for creating decision-boundary crossing transformations
(e.g., adversarial examples) produce differences that are visually imperceptible and
do not enable insightful explanation. To address this we introduce ExplainGAN,
a generative model that produces visually perceptible decision-boundary crossing
transformations. These transformations provide high-level conceptual insights which
illustrate how a model makes decisions. We validate our model using both traditional
quantitative interpretation metrics and introduce a new validation scheme for such
an approach.
1.4 Protecting against induced semantic gap
The goal of adversarial attacks is to change the input image in a way that de-
cision of the target classifier changes. These types of attacks are carried on usually
by solving an optimization problem with respect to the input image and an adver-
sarial loss function. The existing defense methods [Goodfellow et al., 2014b, Meng
and Chen, 2017] assume prior information about the type of attack that is going to
happen and thus is limited to the type of attack they are exploring.
We propose Defense-GAN [Samangouei et al., 2018a], a new framework lever-
aging the expressive capability of generative models to protect deep neural networks
10
against such attacks. Defense-GAN is trained to model the distribution of unper-
turbed images. At inference time, it finds a close output to a given image which does
not contain the adversarial changes. This output is then fed to the classifier. Our
proposed method can be used with any classification model and does not modify the
classifier structure or training procedure. It can also be used as a defense against
any attack as it does not assume knowledge of the process for generating the ad-
versarial examples. We empirically show that Defense-GAN is consistently effective
against different attack methods and improves on existing defense strategies.
1.5 Organization
The organization of this dissertation is as follows. Chapter 2 talks discusses
attribute-based continuous authentication and a deep model for authenticating on
mobile phones. In Chapter 3 we propose the conditional model for manipulating
semantic attributes. Chapter 4 introduces our method for explaining black box
attribute detectors. Chapter 5 presents our method for protecting the attribute
classifiers against adversarial attacks. Finally, Chapter 6 concludes the dissertation
and discuss future research directions.
11
Chapter 2: Attribute-based continuous active authentication
2.1 Introduction
Attributes are semantic entities that are easier to learn than individual classes,
however, a collection of them intuitively and in practice [Kumar et al., 2009], is ben-
eficial for discriminating the object classes. We exploit this fact to design efficient
methods for attribute detection. To test the efficiency, we focus on continuous au-
thentication on mobile devices which have constraints on computation and power
resources. Mobile devices, such as cell phones, tablets, and smart watches have
become integral components of people’s lives. The users often store valuable infor-
mation such as bank account details or credentials to access their sensitive accounts
on their mobile phones. Typical devices integrate no automatic mechanism to au-
thenticate the users. According to the survey in [Inc, 2013], nearly half of the users
do not have any form of authentication for their phones. Besides this, if the ini-
tial password-based authentication is compromised, the personal information of the
users is exposed since there are no active authentication methods incorporated in
the mobile phone.
In the first part of this chapter, we focus on demonstrating that attribute fea-
tures are practical for active authentication on mobile devices as a part of DARPA’s
12
active authentication program. We are the first ones to propose using attributes
as an intuitive, effective, and efficient way for authentication users on smartphones.
We train support vector machines (SVMs) classifiers on conventional features such
as Histogram of Oriented Gradients (HOG) [Dalal and Triggs, 2005] and Local Bi-
nary Pattern (LBP) [Ahonen et al., 2006] features on different face parts and fuse
their score for final attribute response. We show the feasibility of this approach by
implementing it on a smartphone and testing the speed and battery consumption
of different parts.
In the second part, with the emergence of a large-scale face attribute dataset
CelebA [Liu et al., 2015], we train convolutional neural networks for attribute de-
tection. An exclusive CNN per attribute is impractical to use on mobile devices
since they consume a lot of resources. In the second part of this section, we bring
CNN’s to cell phones by exploiting the fact that each attribute can correspond to
specific regions of the face. We use lightweight deep networks for each predefined
area of the face that are obtained by cropping around facial landmarks. Instead of
learning a single model per attribute, we use multi-task training to share the base
networks for each face region and their assigned attributes. Since an attribute can
be assigned to multiple regions of the face, we get the final prediction by fusing the
features from different parts. We also deploy these networks on a cell phone and
show their response time, and power consumption is reasonable when they are using
the phone’s CPU to compute the attribute features.
The rest of this chapter is organized as follows. In section 2.2 we go over
some relevant works to this chapter. We talk about attribute-based authentication
13
in Section 2.3. In Section 2.4 we describe how to train efficient deep architectures
for attribute detection.
2.2 Related work
In computer vision, almost in all problems, the very first step is to extract
features from a given visual signal. The first use of attributes as higher order
features was introduced in Content Based Image Retrieval where they are presented
as a solution to decrease the semantic gap [Liu et al., 2007,Datta et al., 2005,Obeid
et al., 2001]. Attributes were also referred to as a kind of “intermediate features”.
This term initially appeared in [Obeid et al., 2001] referred to the features that are
“low-level” semantic features but “high level” image features.
Later applications of attributes were in object recognition domain and human
identification. Ferrari et al. [Ferrari and Zisserman, 2007] learned visual attributes
for objects such as “dotted” or “striped”. In [Farhadi et al., 2009] Farhadi et al. use
L1-regularized logistic regression to learn object attributes such as “has wheels” or
“metallic” from images of PASCAL VOC 2008 [Everingham et al., 2008] and then
use them to describe objects in the image. In [Lampert et al., 2014] Lampert et
al. learn object attributes via kernel Support Vector Machines (SVMs) [Cortes and
Vapnik, 1995] in two learning paradigms, Direct and Indirect Attribute Prediction
and then use those to perform object recognition. They demonstrate good results on
their Animal with Attributes dataset. There are several other areas where attribute
features have been shown to be useful: zero-shot learning [Liu et al., 2014], scene
14
classification [Patterson and Hays, 2012], and action recognition [Liu et al., 2011].
Human attributes or “soft biometrics” such as age and gender suggested in
[Jain et al., 2004], have been successfully used for identity recognition/verification in
many applications. In [Jain et al., 2004], Jain et al. combine height, race and gender
information with fingerprint to improve the recognition accuracy on an in-house
dataset. Face image retrieval solely based on attributes was investigated in [Kumar
et al., 2008] by Kumar et al. For face verification, Kumar et al. in [Kumar et al.,
2009] extracted attribute feature vectors. Zhang et al. [Zhang et al., 2014a] used
attributes to improve face clustering and overcome variations of faces like pose and
illumination. Klare et al. [Klare et al., 2014] defined 46 facial attributes to perform
suspect identification task. In [Layne et al., 2012] Layn et al. showed that attributes
such as “jeans”, “headphones”,“sunglasses” etc. can help re-identifying people seen
on different cameras of a distributed camera network. Vaquero et al. [Vaquero et al.,
2009] developed a method for searching with attributes in surveillance environments
using Viola-Jones attribute detectors.
Detecting the presence of each attribute has been focus of many researchers.
These algorithms can be roughly divided into two groups, those which learn a specific
model per attribute and those which present a general framework to learn all the
target attributes together at once. Our focus in this chapter is on the second group of
attributes. Bourdev et al. [Bourdev et al., 2011] define poselets-based on Histogram
of Oriented Gradients (HOGs) [Dalal and Triggs, 2005] features and train SVMs
on them. Zhang et al. [Zhang et al., 2014b] train a Convolutional Neural Network
(CNN) on parts extracted from full body person images to detect attributes, they
15
achieved good results on Berkley Attributes of People dataset and Attributes25k
[Bourdev et al., 2011] dataset. Berg et al. [Berg and Belhumeur, 2013] learned
one SVM per class pairs and part pairs to take into account the class relationship
and part relationship and then create a feature vector out of all the SVMs. Then
these features were used to learn classifiers for each attribute. Kumar et al. [Kumar
et al., 2008] trained their local SVMs and let Adaboost to optimize for best ones
for ten attributes and show the performance on FaceTracer [Kumar et al., 2008]
dataset. In [Kumar et al., 2009] Kumar et al. concatenated different low-level
features extracted from face components and incrementally learn SVMs for each
attribute and test them on PubFig [Kumar et al., 2009]. We present two approaches,
one consists of model selection between different SVMs, and another simpler one
which gives efficient linear SVMs for platform implementation.
The early research to find alternatives for password-based authentication were
focused on extracting unique characteristics from users’ keystrokes. In [Spillane,
1975], Spillane et al. suggested to use timing between key presses and the pressure
patterns of keystrokes to identify users. Then in [Monrose et al., 2002] Monrose
et al. created a method using pseudorandom polynomials to generate a secure
sequence based on keystroke time interval of users to increase password security.
In [Klosterman and Ganger, 2000], Klosterman et al. introduce the first continuous
face verification system implemented in Linux. They also present a comprehensive
set of differences between biometric and password-based authentication systems.
The next biometric based continuous authentication system design was introduced
by Carrillo [Carrillo, 2003] to secure aircraft cockpit against unverified access. Then
16
followed many studies on continuous authentication mostly for desktop computers
like [Altinok and Turk, 2003,Sim et al., 2007,Niinuma and Jain, 2010,Niinuma et al.,
2010,Janakiraman et al., 2005].
With exponential growth in the use of mobile devices, active authentication
on them has become the focus of many researchers. Various biometrics have been
proposed to continuously authenticate the users. In [Frank et al., 2013] Frank et al.
proposed a set of 30 behavioral touch features and then use a k-nearest neighbor
classifier and Gaussian kernel SVM for horizontal and vertical strokes of the user
to perform authentication. [Feng et al., 2012], [Zhang et al., 2015b] also use touch-
screen gestures for this purpose. Gait as well as device movement patterns measured
by the smartphone accelerometer were used in [Derawi et al., 2010], [Primo et al.,
2014] for continuous authentication. Stylometry, GPS location, web browsing be-
havior, and application usage patterns were used in [Fridman et al., 2015] for active
authentication.
Face-based continuous user authentication has also been under study by re-
searchers. In [Hadid et al., 2007] Hadid et al. use Haar-like features and Adaboost
of [Viola et al., 2005] by Viola et al. is employed for part detection and, Local
Binary Pattern (LBP) [Ahonen et al., 2006] followed by nearest neighbor threshold-
ing for identification. In [Fathy et al., 2015], Fathy et al. extracted two intensity
features for images, one from the whole face and one from face components. Then
they compare four still image algorithms and five convex hull image set comparison
methods for the AA01 dataset and compared the recognition rates of the algorithms.
Lastly, [Gunther et al., 2013] Gunther et al. provides an overview of methods that
17
depend on low-level features for this task such as [Zhao et al., 1998], [Cox and
Pinto, 2011], [Zhang et al., 2005], [Wiskott et al., 1997] and their results on the
MOBIO [McCool et al., 2012] dataset.
Multi-modal methods have always been of interest when it comes to biometrics.
Fusion of speech and face was proposed in [McCool et al., 2012] by McCool et al,
they extract LBP features and use nearest neighbor thresholding for faces. [Crouse
et al., 2015] proposed to fuse face images with the inertial measurement unit data to
continuously authenticate the users. A low-rank representation-based method was
proposed in [Zhang et al., 2015a] for fusing touch gestures with faces for continuous
authentication. Finally, a domain adaptation method was proposed in [Zhang et al.,
2015c] for dealing with data mismatch problem in continuous authentication.
Face-based continuous user authentication has also been proposed in [Hadid
et al., 2007], [Fathy et al., 2015], [McCool et al., 2012], [Samangouei et al., 2015].
Fusion of speech and face was proposed in [McCool et al., 2012] while [Crouse et al.,
2015] proposed to fuse face images with the inertial measurement unit data to contin-
uously authenticate the users. Finally, a domain adaptation method was proposed
in [Zhang et al., 2015c] for dealing with data mismatch problem in continuous au-
thentication. Fusing touch gestures with faces for continuous authentication using
a low-rank representation-based method was proposed in [Zhang et al., 2015a].
State of the art methods for face recognition employ Deep Convolutional Neu-
ral Networks (DCNN) [Taigman et al., 2014], [Parkhi et al., 2015], [Schroff et al.,
2015], [Sun et al., 2014a]. Since deep networks are very scalable, they achieve good
18
Figure 2.1: Overview of our attribute-based authentication method.
results by having large number of parameters learned using large datasets. The
harder the problem, the more number of parameters and data are required. As a
result of their size, their architectures are not efficient for to be deployed on a mobile
phone. It has been shown in [Sarkar et al., 2016] that DCNN with an architecture
similar to AlexNet [Krizhevsky et al., 2012] can drain the battery very fast.
2.3 Attribute-based Active Authentication on Mobile Devices
In this section, we present the details of the proposed attribute-based au-
thentication system. In particular, we describe the training data used to learn the
attribute classifiers, how different classifiers are trained for each attribute and how
verification is performed using the attributes.
19
Figure 2.2: Training phase pipeline for each attribute classifier. Landmarks are
first detected on a given face. Different facial components are then extracted from
these landmarks. Then for each part, features are extracted with different cell sizes
and the dimensionality of features is reduced using principle component analysis.
Classifiers are then learned on these low-dimensional features. Finally, top five Cls
are selected as our attribute classifier.
2.3.1 Methodology
Training Data PubFig dataset [Kumar et al., 2009] is one of the few publicly
available datasets that provides facial attributes along with face images. We use this
dataset to train our attribute classifiers. PubFig dataset consists of unconstrained
faces collected from the Internet by using a person’s name as the search query on a
variety of image search engines, such as Google Images and flickr. However, there
are several challenges have to be overcome before this dataset can be effectively
utilized for our application. Since the release of this dataset in 2009, many links to
the images in this dataset are broken. Hence, not all the images listed in this dataset
are available for downloading. As a result, we use a subset of this dataset where we
could establish proper links to the images. Furthermore, the true attribute labels
of the images are not provided, instead the output of their attribute classifiers are
20
provided. As a result, we used a proper threshold to get the labels for each attribute
of the available images to ensure that the classifier is certain enough about the
label given to the image. Finally, rather than using all the 73 binary attributes in
the PubFig dataset, we selected a more meaningful subset of 44 attributes in our
implementation.
FaceTracer [Kumar et al., 2008] is another publicly available dataset that has
face images with 18 attributes. This dataset is smaller than the PubFig dataset and
again a several hyperlinks to the images in this dataset are broken. Also, only a
subset of attribute labels has been provided.
2.3.2 Attributes Classifiers
Each attribute classifier Cli ∈ {Cl1, ..., ClN} is trained by an automatic pro-
cedure of model selection for each attribute Ai ∈ {A1, ..., AN}, where N is the total
number of attributes. Automatic selection is necessary since each attribute needs a
different model. Our models are indexed as follows:
1 Facial parts: For each attribute, a set of different facial components can be
more discriminative. The face components considered for training are: eyes,
nose, mouth, hair, eyes&nose, mouth&nose, eyes&nose&mouth, eyes&eyebrows,
and the full face. In total, nine different face components are considered.
2 Features: For different attributes, different types of features may be needed.
For example, for the attribute “blond hair”, features related to color can be
more discriminative than features related to texture. The following features
21
are considered in this work: LBP [Ahonen et al., 2006], ColorLBP, HoG [Dalal
and Triggs, 2005], and ColorHoG. ColorLBP and ColorHOG are obtained by
concatenating the HoG/LBP feature of each RGB channel. In total, four types
of features are extracted using the VLFeat toolbox [Vedaldi and Fulkerson,
2008].
3 Locality of features: In order to capture the local information, we consider
different cell sizes for the HOG and LBP features. In total, six different cell
sizes, 6, 8, 12, 16, 24, 32, are used.
The implementation of the algorithm for this section is done in Matlab [MAT-
LAB, 2014]. We use a state-of-the-art publicly available fiducial point detection
method [Asthana et al., 2013] to extract the different facial components. Further-
more, the detected landmarks are also used to align the faces to a canonical coordi-
nate system. After extracting each set of features, the Principal component analysis
(PCA) is used with 99% of the energy to project each feature onto a low-dimensional
subspace. An SVM with the RBF kernel is then learned on these features. This
process is run exhaustively to train all possible models. For each attribute classifier,
80% of the available data is used for training the SVMs and 20% of the data is used
for model selection. The face images in the test set do not overlap with those in the
training set. The total number of negative and positive classes are the same for both
training and testing. Finally, among all 216 SVMs, five with the best accuracies are
selected.
22
For a given test face image F , a feature vector [fa1 ...fa ] is calculated byN
∑5 i i
f i=1
wkClk(F )
a = ∑5 , (2.1)k wii=1 k
where Clik(F ) → {0, 1} is the output of the ith accurate classifier for the kth at-
tribute Ak on face image F , and wi is the accuracy of Cl
i
k. The entire training
pipeline of our method is shown in Figure 2.2.
2.3.3 Verification
We consider the continuous authentication problem as a verification problem
in which given two pairs of videos or images, we determine whether they correspond
to the same person or not. The well-known receiver operating characteristic (ROC)
curve, which describes the relations between false acceptance rates (FARs) and
true acceptance rates (TARs), is used to evaluate the performance of verification
algorithms. As the TAR increases, so does the FAR. Therefore, one would expect
an ideal verification framework to have TARs all equal to 1 for any FARs. The ROC
curves can be computed given a similarity matrix.
We use the proposed framework to extract the attribute vector from each
image in a given video. We then simply average them to obtain a single attribute
vector that represents the entire video. Then, the (i, j) entry of the similarity matrix
Sattrs is calculated as
1
si,j = , (2.2)‖ei − tj‖2
23
where ei is the ith attribute vector representing the gallery (or enrollment) video,
and tj is the jth attribute vector representing the probe video. We evaluate the
performance of the proposed attribute-based authentication method on two publicly
available mobile video datasets - MOBIO [McCool et al., 2012] and AA01 [Fathy
et al., 2015]. In addition to the ROC curves, the Equal Error Rate (EER) is used to
measure the performance of different methods. The EER is the error rate at which
the probability of false acceptance rate is equal to the probability of false rejection
rate. The lower the EER value, the higher the accuracy of the authentication system.
We use an LBP-based method as a baseline for comparison. In this method,
each detected face is represented by the histogram of LBP features. The same
aligned faces that are used for attribute feature extraction are also used to extract
the LBP features. Similar to the attribute features, the LBP features from each
image in a video are extracted and averaged to represent a single video. The LBP
features are extracted using the VLfeat toolbox. The similarity matrix, SLBP , is then
built by comparing two feature vectors. This LBP-based method has been used for
mobile face authentication in [McCool et al., 2012] and [Hadid et al., 2007]. A third
fusion score matrix, Sfusion = S̃LBP + S̃attrs, is calculated by z-score normalization
si,j − S̄
s̃i,j = , (2.3)
σ(S)
where S̄ and σ(S) are the mean and the standard deviation of the entries in similarity
matrix S, respectively.
24
Figure 2.3: Illustration of our attribute classifiers on sample face images from the
AA01 (first two images) and the MOBIO (last image) datasets.
2.3.4 Attribute Classifiers
In Tables 2.1 and 2.2 the accuracies of the attribute classifiers trained using
our method on PubFig and FaceTracer datasets are given. As can be seen from
these tables, most of the accuracies are high. Also the accuracies for AA01 [Fathy
et al., 2015] are provided in Table 2.3. The attributes for this dataset is labeled by
four volunteers. It consists of 50 subjects and for each of 44 binary attributes; if 3
out of 4 people agreed on the presence of the attribute then it is set to one else zero.
Furthermore, in Figure 2.3 we show some sample outputs of our attribute
classifiers. Results of the classifiers are scaled to be between -0.5 to 0.5. For the
first face, eyeglasses, chubby, round jaw, Asian, male, no beard, sideburns, bangs
25
Attribute Accuracy Attribute Accuracy
Blond Hair 0.9089 Child 0.9538
Partially Visible Forehead 0.8645 Narrow Eyes 0.7777
Round Face 0.9156 Big Nose 0.8039
Indian 0.9714 Male 0.9451
Gray Hair 0.9091 Pointy Nose 0.816
Bags Under Eyes 0.8986 Asian 0.9225
Obstructed Forehead 0.8913 White 0.6992
Shiny Skin 0.9532 Youth 0.7299
No Eyewear 0.8875 Brown Hair 0.6725
Middle Aged 0.929 Bald 0.7909
Senior 0.8867 Wavy Hair 0.9357
Eyeglasses 0.9397 Straight Hair 0.7408
Sunglasses 0.9701 Bangs 0.9397
Mustache 0.8606 Arched Eyebrows 0.6462
Chubby 0.8815 Strong Lines 0.9308
Receding Hairline 0.8164 Pale Skin 0.793
Round Jaw 0.9357 Flushed Face 0.7819
Big Lips 0.7578 Double Chin 0.9727
No Beard 0.7766 Black Hair 0.8029
Goatee 0.9775 Curly Hair 0.8746
Black 0.7818 Bushy Eyebrows 0.836
Sideburns 0.8756 Oval Face 0.82
Table 2.1: Accuracies of the 44 attribute classifiers proposed in this work on the
PubFig dataset [Kumar et al., 2009].
classifiers give high scores. This clearly matches with the image shown on the
left. For the second face, it is interesting to see that the Male classifier produces a
negative score since the image corresponds to a female subject. Finally, for the last
face, “mustache”, “goatee”, “chubby” and “bags under eyes” produce high positive
scores which clearly match with the image shown on the left.
26
Attribute Accuracy Attribute Accuracy
Asian 0.8786 middle aged 0.7321
eyeglasses 0.7214 black 0.808
sunglasses 0.89 female 0.88
smiling false 0.8 senior 0.7933
no eyewear 0.7481 hair color blond 0.7875
child 0.8276 white 0.763
mustache 0.815 youth 0.692
Table 2.2: Accuracies of the attribute classifiers proposed in this work on available
attributes on the FaceTracer dataset [Kumar et al., 2008].
Attribute Indoor Lights off Outdoor Attribute Indoor Lights off Outdoor
Asian 0.64 0.62 0.54 Bags Under Eyes 0.96 0.96 0.96
Bald 0.98 0.98 0.98 Bangs 0.88 0.88 0.88
Big Lips 0.80 0.80 0.80 Big Nose 0.90 0.90 0.92
Black 0.98 0.98 0.98 Black Hair 0.62 0.62 0.72
Blond Hair 0.96 0.96 0.96 Brown Hair 0.96 0.96 0.96
Bushy Eyebrows 0.94 0.94 0.94 Child 0.74 0.76 0.78
Chubby 0.74 0.74 0.76 Curly Hair 0.96 0.96 0.96
Double Chin 0.92 0.94 0.94 Eyeglasses 0.60 0.58 0.58
Flushed Face 0.98 0.98 0.98 Goatee 0.96 0.96 0.96
Gray Hair 0.96 0.96 0.96 Indian 0.86 0.86 0.86
Male 0.82 0.82 0.84 Middle Aged 0.96 0.96 0.96
Mustache 0.86 0.86 0.86 Narrow Eyes 0.64 0.68 0.62
No Beard 0.58 0.56 0.58 No Eyewear 0.74 0.74 0.74
Obstructed Forehead 0.84 0.84 0.88 Oval Face 0.78 0.78 0.78
Pale Skin 0.98 0.98 0.98 Partially Visible Forehead 0.70 0.70 0.70
Pointy Nose 0.98 0.98 0.98 Receding Hairline 0.86 0.86 0.90
Round Face 0.96 0.96 0.96 Round Jaw 0.76 0.74 0.76
Senior 0.98 0.98 0.98 Shiny Skin 0.98 0.98 0.98
Sideburns 0.98 0.98 0.98 Straight Hair 0.72 0.72 0.74
Strong Nose-Mouth Lines 0.82 0.82 0.82 Sunglasses 0.98 0.98 0.98
Wavy Hair 0.96 0.96 0.96 White 0.90 0.90 0.90
Table 2.3: Accuracy of the attribute classifiers for CNNAA [Fathy et al., 2015]
dataset.
27
2.3.5 MOBIO Dataset
The MOBIO dataset [McCool et al., 2012] consists of video data taken from
152 subjects. The dataset was collected in six different sites from five different
countries. In total twelve sessions were captured for each subject - six sessions
for phase 1 and six sessions for phase 2. The database was recorded using two
mobile devices: a NOKIA N93i mobile phone and a standard 2008 MacBook laptop
computer. The laptop was only used to capture videos of part of the first session.
So the first session consists of data captured with both the laptop and the mobile
phone. Figure 2.4 shows some frames from the MOBIO dataset.
Figure 2.4: Sample images from the MOBIO dataset. One can clearly see the
different illumination conditions in this dataset.
In the MOBIO protocol, for each person, the data from one session is used for
enrollment and the data from the remaining sessions are used for testing. In the
first set of experiments with the MOBIO dataset, we do not consider the data from
the laptop session. The first mobile session is considered as the enrollment session
and the data from the next 11 sessions are considered for testing. The ROC curves
28
Site LBP Attributes Fusion
but 0.29 0.28 0.25
idiap 0.18 0.20 0.14
lia 0.31 0.24 0.25
uman 0.20 0.25 0.18
unis 0.24 0.28 0.24
uoulu 0.27 0.24 0.23
All together 0.22 0.23 0.19
Table 2.4: The EER values for different methods on the MOBIO dataset.
corresponding to this experiment are shown in Figure 2.5 for the entire dataset.
As can be seen from this figure, our attribute-based method performs comparably
to the LBP-based methods. However, the best performance is achieved when the
similarity matrices corresponding to the LBP and attribute features are fused. The
EER values corresponding to this experiment are compared in Table 2.4.
1
LBP
0.9 Attributes
0.8 Z-Score fusion
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False Acceptance Rate
Figure 2.5: Performance evaluation on the MOBIO dataset.
29
True Acceptance Rate
2.3.6 Cross-device Experiments
Images captured by different cameras have different characteristics. Since the
MOBIO dataset has videos that were captured using different sensors, we conduct
cross-session experiments in which the data from the laptop session are considered
as the enrollment data and the data from the cell phone are used as the test videos.
This experiment essentially allows us to study the robustness of different algorithms
with respect to different image quality. Figure 2.6 and Table 2.5 show the ROC
curves and the EER values corresponding to this experiment. As can be seen from
this results, attributes are more robust to camera sensor change than LBP features.
In this experiment, fusion does not necessarily improve the performance over the
attributes since LBP features perform poorly.
1
LBP
0.9 Attributes
0.8 Z-Score fusion
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False Acceptance Rate
Figure 2.6: Cross device robustness. Laptop session videos are used for enrollment
and the data from the remaining sessions are used for testing.
30
True Acceptance Rate
Enrollment LBP Attributes Fusion
Laptop 0.33 0.27 0.27
Table 2.5: The EER values corresponding to the cross-device experiment on the
MOBIO dataset.
2.3.7 AA01 Dataset
The AA01 dataset consists of 750 videos from 50 different individuals collected
in three different sessions corresponding to three different illumination conditions.
The UMDAA-01 dataset was collected using an app on an iPhone 5s. Each user
performed five tasks in three sessions. The different tasks were enrollment task,
document task, picture task, popup task and scrolling task. Figure 2.7 shows some
sample images from the UMDAA-01 dataset where one can clearly see the different
illumination conditions present in this dataset.
(a) (b) (c)
Figure 2.7: Sample images from the AA01 dataset. (a), (b) and (c) show some
sample images from session 1, 2 and 3, respectively.
In the first set of experiments using this dataset, we use the data corresponding
to the enrollment task as gallery and the data from the remaining tasks for testing.
Figure 2.8 and Table 2.6 show the ROC curves and the EER values, respectively
31
Enrollment LBP Attributes Fusion
Indoor light 0.13 0.14 0.10
Low light 0.31 0.18 0.20
Natural light 0.19 0.16 0.14
CNNAA all 0.34 0.30 0.30
Table 2.6: The EER values of different methods for the AA01 dataset.
corresponding to this experiment. As can be seen from these results, our attribute-
based method performs much better than the LBP-based authentication system.
Fusion of the LBP and the attribute similarity matrices results in performance
comparable to our method as the LBP features do not perform well on this dataset.
1
LBP
0.9 Attributes
0.8 Z-Score Fusion
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False Acceptance Rate
Figure 2.8: Performance evaluation for the AA01 dataset.
Furthermore, we conducted several session-specific experiments on this dataset.
We used the enrollment data as gallery and the data from other tasks from the same
session as probe. The ROC curves corresponding to these experiments are shown
in Figures 2.9(a)-(c). It can be seen from these figures that our attribute-based
method works better than the LBP-based method, and fusion improves the result
32
True Acceptance Rate
as expected. The reason that attributes work better here is that the sessions are all
taken in the same day so the change in attributes are less severe than in the MOBIO
dataset.
1 1 1
LBP LBP LBP
0.9 0.9 Attributes 0.9
Attributes Z-Score fusion Attributes0.8 0.8
0.8 Z-Score fusion Z-Score fusion
0.7 0.7
0.7
0.6 0.6
0.6 0.5 0.5
0.5 0.4 0.4
0.3
0.4 0.3
0.2
0.2
0.3 0.1
0.1
0.2 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False Acceptance Rate False Acceptance Rate False Acceptance Rate
1 1 1
0.9 LBP
LBP
0.9 LBPAttributes 0.9
Attributes Z-Score fusion Attributes0.8 0.8 0.8
Z-Score fusion Z-Score fusion
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3
0.3
0.2 0.2
0.2
0.1 0.1
0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False Acceptance Rate False Acceptance Rate False Acceptance Rate
Figure 2.9: Session-specific performance evaluations for the AA01 dataset. (a)
Gallery and probe data from session 1. (b) Gallery and probe data from session
2. (c) Gallery and probe data from session 3. (a) Gallery data from session 1 and
probe data from sessions 2 and 3. (e) Gallery data from session 2 and probe data
from sessions 1 and 3. (f) Gallery data from session 3 and probe data from sessions
2 and 1.
Finally, similar to the cross-device experiments on the MOBIO dataset, we
conducted cross-session experiments on the AA01 dataset. We used the data from
the enrollment task from one session as gallery and the data from the other sessions
as probe. This experiment shows the robustness of our attribute-based method to
different illumination conditions. From Figures 2.9(d)-(f), we see that even when
the illumination conditions are different, our attribute-based method is more robust
than the LBP feature-based method. From Figures 2.9(d)-(f) and Table 2.7 we see
that in all cases, attributes performed better than LBP and the fusion of both gives
33
True Acceptance Rate True Acceptance Rate
True Acceptance Rate
True Acceptance Rate
True Acceptance Rate
True Acceptance Rate
Gallery→Probe LBP Attributes Fusion
1→ 2, 3 0.36 0.33 0.32
2→ 1, 3 0.35 0.31 0.30
3→ 1, 2 0.38 0.33 0.31
Table 2.7: The EER values corresponding to the cross-session experiments for the
AA01 dataset. 1 is the office light session, 2 is the low light session, 3 is the natural
light session.
the best results.
2.3.8 Platform implementation and evaluations
One set of the challenges of continuous mobile authentication is the compu-
tational complexity and memory usage of the algorithm. The limited computation
capacity of a mobile phone is shared among many processes. So if the algorithm
takes most of the CPU time, other processes will slow down. Also, computations
consume energy. The more complex they are, the sooner the battery of the phone
needs to be recharged. In addition, the memory capacity of the phones are limited.
Algorithms with high memory usage, will force other running processes to go in the
swap memory of the phone. This costly I/O operations results in both slow down
and high power consumption.
As a consequence, algorithms with high complexity are run on a server and
the mobile device is used just as a client that takes pictures, sends them to the
server and waits for the response as suggested in [Takacs et al., 2008]. This solution
has two drawbacks for continuous authentication. The phone will get locked if the
34
mobile device gets disconnected from the server. Furthermore, the system will be
less secure since the communication between mobile and server can be interfered.
This can result in either locking the device of the victim, or even worse unlocking
it by creating a fake server which responds in a way that keeps the phone unlocked.
Also, depending on the enrollment policy, we may need to re-enroll the user multiple
times to account for changes in appearance or environment after the first enrollment
to create a better template. It will also take time to re-enroll the user on the server
again. This will be an unproductive experience for the user. In this section, we show
that our approach allows enrollment and authentication of the user on the device.
Our implementation is tested on a Google Nexus 5 with 2GB of RAM and
a quad core 2.2GHz CPU. The implementation is done on the Android operating
system using the well-known OpenCV [Bradski, 2000] library. Since the authenti-
cation should be done continuously, efficiency-accuracy trade-off will become very
important. To explore this trade-off, we looked at three measures: memory, running
time, and power consumption. Fully changing all the parameters and performing
evaluations is out of the scope of the research, but we will present one pathway to
platform implementation which highlights the decisions that impact the efficiency-
accuracy trade-off.
35
Learning method PCA+RBFSVM RBFSVM Linear SVM
Average memory usage 80MB 54MB 1MB
Table 2.8: Average memory usage per attribute classifier for full face
2.3.9 Memory
Memory usage or spacial complexity has always been a challenge while imple-
menting computer vision algorithms. We changed the last two steps of our learning
method to evaluate the memory usage of different models. The average test time
memory requirement of each learning approach for attribute classifiers can be found
in Table 2.8. The memory usage is calculated by first loading all the attribute clas-
sifiers and looking at the increase in memory usage and dividing that change by the
number of classifiers. We use LBP features of intensity image and RGB channels
in this experiment on 128× 168 face images, with dimensionality of 76800 per face
crop. In PCA, we keep 90% of energy. As can be seen from Table 2.8, since we have
44 classifiers at least, using PCA for dimensionality reduction or RBF kernel will
need more than 2GB of memory in total. So we focus on linear SVMs for attributes.
2.3.10 Final attribute classifiers for platform
By looking at the memory usage per classifier given in Table 2.8, we have no
choice but to simplify our classifier learning framework. For training the classifiers,
we use the LFW [Huang et al., 2007] dataset. It has more subjects, hence containing
more variations for each attribute. Also, the output of classifiers from [Kumar et al.,
36
2009] is available for the LFW dataset, so to train our classifiers, the same framework
as in Section 2.3.1 is followed with some changes. Since we can not afford 5 PCA-
RBFSVM per attribute, our goal is to train the least number of classifiers possible
which gives us the desired accuracy.
We simplify our training procedure to learn one single linear SVM for each at-
tribute while trying to consider challenges of learning attribute classifiers addressed
in Section 2.3.1. We reran the experiments for the AA01 dataset from Section 2.3.7
with the linear classifiers learned with our simplified learning procedure. The re-
sulting ROC curves can be seen in Figure 2.10 and the corresponding EER values
in 2.9. The differences with classifiers of Section 2.3.1 and Figure 2.2 are:
• Feature extraction: In the approach discussed in Section 2.3.1, we extracted
different types of features to capture the dependence of each attribute on color
and scale. In the simplified model, we just extract LBP feature on gray scale
image and the three channels and concatenate them together. We don’t change
the cell size of LBP to capture dependence on locality. Instead, we perform
evaluation with different image sizes and choose the one that works best for
all attributes together.
• No PCA No dimensionality reduction step is employed after feature extrac-
tion, because loading the PCA basis sets on a phone needs significant memory
space.
• No kernel A linear classifier is learned, because from memory usage in Table
2.8 it is impractical to load the kernelized classifiers into memory.
37
• No part-based attribute classifier We just train classifiers on the full face
image in the simplified model. Extracting the face parts from face image is not
a trivial task and adds to the complexity of the model. Also linear classifiers
optimize the weights that are directly related to pixel values, so choosing face
parts is taken care of by SVM optimization objective to some extent.
• Attribute dimmension value In Section 2.3.1, we took the weighted average
of binary decision values of the top five attribute classifiers as the dimension
value. In the simplified learning approach, we just use the distance from
margin of each attribute classifier. This is valid since we trained the attribute
classifiers with the same image size.
The interesting result is that the classifier with scale 0.5 performs better than
the ones from Section 2.3.1. The most important one is probably the last step of
the approach presented in Section 2.3.1. For the last step, we fuse the output of top
the top five SVMs to get a score by taking the weighted average of binary decision
values of each attribute classifier. This results in a discrete and finite range of scores.
However, the scores of the simplified model are the distances from the margins of
the linear classifiers which gives a continuous value and hence more discriminative
range of value for different faces.
2.3.11 Frames per second and power consumption
Since linear attribute classifiers on the full face turned out to be the winner
of accuracy-efficiency tradeoff, we test their speed and power consumption. For
38
Method LBP AttrsSection3 L-SVM 1 L-SVM 0.7 L-SVM 0.5 L-SVM 0.3
Feature dim 19200 variable 76800 33280 16128 3840
Indoor lighting 0.13 0.14 0.16 0.19 0.11 0.22
Low lighting 0.31 0.18 0.20 0.20 0.15 0.24
Natural light 0.19 0.16 0.18 0.18 0.11 0.21
Altogether 0.33 0.30 0.29 0.29 0.25 0.37
Table 2.9: Comparison of EER values for LBP, attribute detectors of Section 2.3.1,
linear models of Section 2.3.8. The scale L-SVM 1 is trained on images of size
128 × 168 and the rest are scaled by the indicated value. The best EER is gained
from L-SVMs of Section 2.3.8 with scale 0.5.
1 1
LBP LBP
0.9 0.9 AttributesSection3
AttributesSection3 Attributes1
0.8 Attributes1 0.8 Attributes0.7
Attributes0.7 Attributes0.5
0.7 Attributes0.3Attributes0.5 0.7
0.6 Attributes0.3 0.6
0.5 0.5
0.4 0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False Acceptance Rate False Acceptance Rate
1 1
LBP LBP
0.9 0.9 AttributesSection3
AttributesSection3 Attributes1
0.8 Attributes1 0.8 Attributes0.7
Attributes0.7 Attributes0.5
0.7 Attributes0.3Attributes0.5 0.7
0.6 Attributes0.3 0.6
0.5 0.5
0.4 0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False Acceptance Rate False Acceptance Rate
Figure 2.10: Comparison of linear SVMs with model learned in section 2.3.1 and
LBP. The best result among all is achieved with linear models of scale 0.5 i.e. face
crop size of 64× 80.
speed, we look at the number of frames that we can authenticate per second and
for power consumption we use the power consumption profiler presented by Zhang
39
True Acceptance Rate True Acceptance Rate
True Acceptance Rate True Acceptance Rate
et al. [Zhang et al., 2010]. We extract the 44 dimensional feature vector for 5000
frames with each set of attribute classifiers indexed by scale. Android provides a
mechanism to get the time in miliseconds, so we can measure the exact time up to
miliseconds that it takes for each set of classifiers to process the 5000 frames. Also
the power profiler provides the energy consumption in Joules up to 0.1J for each
running application. The numbers for different setups of our algorithm are provided
in 2.10. The landmark detection which was done with Asthana et al. [Asthana
et al., 2013] in Section 2.3.1 is replaced by the algorithm of Kazemi et al. [Kazemi
and Sullivan, 2014] which is implemented using DLib [King, 2009]. This algorithm
adds a 90MB to memory consumption but it is very fast. The evaluation for each
scale is done in two settings, one with Haar face detection and DLib alignment and
one without this step. From the table we see that face detection and alignment step
add around 20mJ energy consumption per frame and reduces the FPS significantly.
In the worst case with fastest available face and landmark detection methods, our
algorithm can authenticate users at the speed of 4 frames per second. This is more
than enough for authentication task which probably requires authenticating every
couple of seconds.
Google Nexus 5 battery capacity is 2300mAh and the average working voltage is
3.8V which can be verified by running the power profiler of [Zhang et al., 2010].
This means that it has in total 8740mWh. If we run the profiler without our
authentication program on the phone for more than 5 minutes, it shows the average
power usage as 520mW which means the phone will last for 16.8 hours. The last
row of Table 2.10 shows how many hours our algorithm can run in background if
40
Scale 0.3 0.5 0.7 1
Size/dim 32× 48/3840 64× 80/16128 88× 112/33280 128× 168/76800
Detection/Alignment W/O W/ W/O W/ W/O W/ W/O W/
FPS 114 29 31 16 13 8 5 4
Energy 26.8J 128.9J 93.5J 201.2J 207J 369.1J 524.9J 603J
Energy per frame 5.4mJ 25.8mJ 18.7mJ 40.2mJ 41.4mJ 73.8mJ 105mJ 120.6mJ
Endurance (hours) 16.6 16 16.2 15.6 15.6 14.7 14 13.6
Table 2.10: The speed and power consumption of different realization of the classi-
fiers learned with the simplified training framework on Google Nexus 5 device. W/O
column means our algorithm extract all the attributes given aligned and cropped
face. In last row we assumed that we are doing authentication with the speed 1fps.
W/ column first detects the face then extracts attributes. We can authenticate 17.6
hours every second employing our classifiers with best EER of Table 2.9 on a Nexus
5.
we do authentication once every second considering these numbers.
2.4 Efficient Deep Features for Attribute Detection on Mobile Phones
2.4.1 Methodology
In the mobile setting, there is a trade-off between hardware limitations such
as battery life and accuracy of the models. We design our models with the goal of
balancing this trade-off. Namely, we move from a computationally expensive but
specialized model to a computationally cheaper but accurate model.
We train and test four different sets of DCNNs, in total 100 of them, for the
task of attribute classification on a set of face regions. We crop the functional face
regions using landmarks detected by [Asthana et al., 2013]. These face regions can
be seen in Table2.13. Then for each part, we find the maximum size of the window
for that part in the dataset, then we crop the regions by putting the center of the
face part at the center of the crop window to avoid any scaling.
41
Attribute
5 o Clock Shadow 93 93 91 89 85 88 91
Arched Eyebrows 81 82 82 83 76 78 79
Attractive 81 81 81 82 78 81 81
Bags Under Eyes 83 84 83 82 76 79 79
Bald 99 99 96 98 89 96 98
Bangs 95 95 94 94 88 92 95
Big Lips 67 70 69 67 64 67 68
Big Nose 82 83 78 78 74 75 78
Black Hair 86 86 88 87 70 85 88
Blond Hair 95 95 94 94 80 93 95
Blurry 95 95 92 80 81 86 84
Brown Hair 86 86 86 84 60 77 80
Bushy Eyebrows 92 92 89 89 80 86 90
Chubby 95 95 87 91 86 86 91
Double Chin 96 96 89 93 88 88 92
Eyeglasses 99 99 99 99 98 98 99
Goatee 97 97 93 96 93 93 95
Gray Hair 98 98 92 97 90 94 97
Heavy Makeup 90 90 90 91 85 90 90
High Cheekbones 86 85 87 87 84 86 87
Male 98 97 97 98 91 97 98
Mouth Slightly Open 93 93 94 94 87 78 92
Mustache 97 96 88 95 91 87 95
Narrow Eyes 87 87 83 81 82 73 81
No Beard 95 95 95 96 90 75 95
Oval Face 72 73 73 70 64 72 66
Pale Skin 97 97 93 94 83 84 91
Pointy Nose 75 73 75 74 68 76 72
Receding Hairline 92 92 88 90 76 84 89
Rosy Cheeks 94 94 87 91 84 73 90
Sideburns 95 95 95 96 94 76 96
Smiling 92 92 92 92 89 89 92
Straight Hair 79 79 78 79 63 73 73
Wavy Hair 71 73 82 81 73 75 80
Wearing Earrings 83 84 86 79 73 92 82
Wearing Hat 98 98 98 98 89 82 99
Wearing Lipstick 92 92 93 93 89 93 93
Wearing Necklace 86 86 71 71 68 86 71
Wearing Necktie 95 96 93 95 86 79 93
Young 87 87 87 88 80 82 87
Average 89.4 89.5 87.7 87.9 81.1 83.6 87.3
Table 2.11: The performance comparison of attribute detection methods.
42
DeepMulti-CNNAA
WideMulti-CNNAA
DeepBinary-CNNAA
WideBinary-CNNAA
FaceTracer
PANDA
LNet+ANet
2.4.2 Network architecture
The DCNNs have two different architectures. The architectures of the Deep
Convoulutional Neural Network for Acitve Authentication (Deep-CNNAA ) and
Wide-CNNAA can be found in Table 2.12. The four sets of models compared are:
BinaryDeep-CNNAA , BinaryWide-CNNAA , MultiDeep-CNNAA , and MultiWide-
CNNAA . First we describe the shared configuration that is used to train these
networks and then the ones that are specific to each class of the networks.
*Wide-CNNAA *Deep-CNNAA
input w × h× 128 input w × h× 3
type patch size type patch size
conv relu 7× 7× 128 conv relu 7× 7× 32
maxpool 3× 3/2(stride)
conv relu 5× 5× 128 conv relu 5× 5× 32
conv relu 5× 5× 32
conv relu 5× 5× 32
maxpool 3× 3/2
conv relu 3× 3× 128 conv relu 3× 3× 32
conv relu 3× 3× 32
conv relu 3× 3× 32
conv relu 3× 3× 32
maxpool 3× 3/2
FC relu dim× 128 FC relu dim× 64
FC relu 128× 128 FC relu 64× 32
logits Num Attr × 2
Softmax loss
Table 2.12: The architectures of our networks. The number of parameters depends
on the face region that they operate on and can be found in Table 2.16.
Shared configuration All of these 100 networks, 20 Multi*-CNNAA and 80
43
face part
No. of Attributes 10 10 10 7 16
Size 53× 39 115× 41 65× 38 40× 56 90× 62
face part
No. of Attributes 15 21 15 15 14
Size 55× 82 115× 107 128× 52 128× 45 62× 100
Table 2.13: The face regions that are extracted by cropping around the landmark
points and their corresponding number of attributes. A Multi*-CNNAA that oper-
ates on a face crop has “No. of Attributes” tasks.
Binary*-CNNAA , are trained on the publicly available CelebA [Liu et al., 2015]
dataset. It has 200 thousands images of 10 thousands identities, each with 40
attribute labels. It is divided into 160k training, 20k development, and 20k test
images. The DCNNs are trained using the recently released Tensorflow [Abadi
et al., 2015a] which also has a mobile implementation. All of the networks are
initialized with random weights and are trained with the same policy. The Adam
optimizer is used to train all of these networks since it incorporates the adaptive
learning rate update step, and performs well without careful fine tuning of the
learning parameters [Kingma and Ba, 2014]. Subsequent fine tuning can give better
results. Early stopping [Prechelt, 1998] using the accuracy on the development set
is used to select the final model for each network. The inputs are colored face part
images that are randomly flipped and also their contrast and gamma are randomly
changed to augment the data that we have to prevent over-fitting.
Due to the nature of the attributes, most of them have an unequal number of
positive and negative labels. Extra care has been taken to make sure the networks
44
are not biased toward one class with the help of data augmentation and stochastic
optimization.
Binary*-CNNAA The binary networks are for a single task and are trained
by the labels of one single attribute. The input face images are aligned to a canonical
coordinate using the landmarks given by [Asthana et al., 2013]. To balance the
training data, the class with the lower number of training data is distorted and
added to the input queue so that the number of images for each class is equal. Then
the data is shuffled and fed in batches to the training algorithm. The softmax cross
entropy loss lB is used to train these binary networks
∑N1
lB(w) = (1− yi) log p(yi = 0|w)
N
i=1 (2.4)
+ yi log p(yi = 1|w)
w
where yj ∈ { }
exp (f (x))
0, 1 is the attribute presence label, p(y = j|w) = ∑ j1 w where
i=0 exp (fi (x))
fwi (x) is the logits of the ith output neuron of the network with weights w.
Multi*-CNNAA The Multi* networks are the proposed models that are
as complex as the binary models but predict multiple attributes at once. The face
parts and the number of attributes that are assigned to them can be found in Table
2.13. For each part, the corresponding network has an output layer that contains
neurons for each attribute that is assigned to that face part. We use the softmax
45
cross entropy loss for part q as specified below:
∑ qNq ∑na
lq
1
(w) = (1− yai ) log p(yi = 0|w)N
a=1 i=1 (2.5)
+ yai log p(y
a
i = 1|w)
where N q is the number of attributes assigned to part q. nqa is the number of images
with the ath attribute of part q in the current batch. yai ∈ {0, 1} is 1 if the ith image
has the ath attribute and N is the batch size. p(yai = 1|w) is the same softmax as
Eq 2.4.
To deal with the class ratio imbalance of the attributes, we shuffle the training
data in a way that the network sees the rare class for each attribute frequently. For
example, for the attribute “Mustache”, the positive class is the rare one since most
of the 202k images do not have this attribute. To handle this imbalance, a queue
is created for each attribute and images that have the rare class are added to that
queue. A queue is also created for images with all the attributes belonging to the
major class. Then all of the queues are shuffled. We treat each queue as a circular
buffer so that the training batches are created by sampling with replacement from
one of these queues at random. Also, each time the images are distorted differently.
After training all the networks, most of the attributes are present in multiple
networks. For each attribute, we only take the embeddings of relevant parts. For
instance, for the attribute ”Mustache” in the MultiDeep-CNNAA , the 32 dimen-
sional embedding of the parts, mouth, mouth and nose, and mouth and chin are
taken and concatenated together. At first, 10000 examples, sampled from the train-
46
ing portion of CelebA, are selected for training and the devolopment set of CelebA is
used for fine tuning the linear SVMs hyperparameters. Then, following the protocol
of [Liu et al., 2015], linear SVMs are trained with the selected parameters on the
development set as their training set and tested on the test set.
2.4.3 Comaprison of attribute detection methods
(a) (b) (c) (d) (e)
(a) (b) (c) (d) (e)
Figure 2.11: Sample images from subspace clustering of face part embedding in
attribute space.
We compare our proposed networks with FaceTracer [Kumar et al., 2008],
PANDA [Zhang et al., 2014b], and CelebA [Liu et al., 2015] attribute networks.
These models capture a broad spectrum of possible automatic attribute detection
models.
FaceTracer [Kumar et al., 2008] attribute classifiers are trained by extracting
traditional low-level features like HOG and color histogram from aligned face parts
by incrementally finding the best set of features and training the Support Vector
47
Mouth and Chin Forehead and hair
Machines (SVM’s) on the selected features and parts for attribute detection. The
face crops are extracted from the ground truth landmarks.
PANDA ensembles multiple CNNs for the face parts and concatenates the
outputs of the last layer and train SVMs for each attribute. There are two differences
between our network architecture and PANDA networks. First, in PANDA, all of
the attributes are associated with all of the parts. Second, in our Multi*-CNNAA
networks, the last layer is shared between all of the attributes softmax losses, but in
PANDA there are two fully connected layers after the shared fully connected layer
for each one. As a result, in our network, the different attributes that are associated
with one network lie in the same Euclidean space of the last fully connected layer
of the network.
CelebA takes a different approach by pre-training their network with face iden-
tities of CelebFaces [Sun et al., 2014b] for both face verification and identification.
Then they extract features from multiple overlapping crops of the face and train
SVMs for each crop for each attribute. To predict the attribute they average over
the scores of SVMs. They use a localization network to detect the face region and
pass them onto the classifier networks.
We also follow [Levi and Hassner, 2015] and train a single task network for
each attribute in Binary*-CNNAA on the full face. Table 2.11 shows the accuracy
of each of these methods.
As it can be seen, our Multi*-CNNAA networks give equal or better results
than the rest. The MultiWide-CNNAA performs slightly better than the MultiDeep-
CNNAA in attribute prediction. This may be due to the large number of parameters
48
that they have as shown in Table 2.16. However they are slower and consume more
energy.
2.4.4 Attribute discovery
As mentioned in the previous section, our Multi*-CNNAA networks transform
the input face regions to a shared Euclidean space for the attributes associated with
that part. To further explore this Euclidean space, we perform Sparse Subspace
Clustering (SSC) [Elhamifar and Vidal, 2009] on 10000 points that are selected
from the training portion of CelebA dataset. The intuition behind this clustering
is that the face parts that have the same attribute lie in the same subspace. SSC
uses the fact that each data point can be represented by a sparse linear combination
of the other points in the same subspace. Therefore it formulates the clustering
problem as
minimize |C| 21 + ‖D −DC‖F (2.6)
C∈Rn×n
subject to diag(C) = 0 (2.7)
where D ∈ Rd×n is the data matrix containing n points of dimension d and C ∈ Rn×n
is the affinity matrix. To enforce the constraint, for each datapoint they take it out
of D and then perform sparse coding. To get the clusters they perform spectral
clustering on C. We find 10 clusters per face regions. The clusters corresponding to
the “Hair-Forehead” region of the face and the “eyes” region can be seen in Figure
2.11. As illustrated, the “discovered” attributes overlap with the labels that we had
49
in the training time mostly, but also some attributes are divided into finer categories.
For example, in the mouth and chin category (b) contains images of people with
chin shape similar to African-Americans which was not present in the labels. In
the “Hair-Forehead” region cluster (c) contains male images with short hair which
again was not seen in the labels. Also, the gender of the people in the same cluster
is the same for these two parts. As shown in the next section, these attributes give
good result for authentication.
2.4.5 Experiments
We evaluate the performance of CNNAA for the task of active authentication
using two publicly available datasets MOBIO [McCool et al., 2012] and AA [Fathy
et al., 2015]. These datasets contain videos of the users interacting with cell phones.
We compare the authentication performance of our DCNN attribute detectors and
discovered attributes with baseline Local Binary Patterns [Ahonen et al., 2006] and
ACAA [Samangouei et al., 2015] which is the only attribute-based approach for this
task. The extracted attribute features of ACAA [Samangouei et al., 2015] from
the videos of these two datasets are provided by the authors. We follow the same
protocol as ACAA to extract facial parts and video features. So, we average over
the extracted attribute outputs for the video frames to get the video descriptors.
We cast the problem of continuous authentication as a face verification prob-
lem in which a pair of videos is given and we determine whether they contain the
same identity or not. To compare the performance of the algorithms, the receiver
50
operating characteristic (ROC) curve is used. Many other measures of performance
can be readily extracted from the ROC curve. ROC curve plots the relationship be-
tween false acceptance rates (FARs) and true acceptance rates (TARs). The ROC
curve can be computed from a similarity matrix S between gallery and probe videos.
We also report the EER value where TAR and FAR are equal. EER value gives a
good idea of the ROC curve shape since it can be extracted by plotting the diagonal
line on the curve and see how soon it hits it. Thus, the better the algorithm, the
lower is its EER value.
We give each video frame to the CNNAA networks and predict the attributes
with linear SVMs. For the learned attributes, we put the probabilistic output of the
SVMs which are trained by LIBSVM [Chang and Lin, 2011] as our final attribute
feature. Since the attribute outputs of our models are probability values we get the
similarity value si,j = 〈ei, tj〉, where ei is the feature vector for the enrollment video
and ti is the test video features.
To use the discovered attributes (DiscAttrs) for authentication, we extract
the attribute features by a similar approach to Sparse Representation Classification
[Wright et al., 2009]. Each face crop from the video frame is embedded to the
attribute space of MultiDeep-CNNAA . It is represented by the dictionary which
we used in Section 2.4.4, so that we know the cluster assignment of its atoms.
We normalize all of the dictionary atoms and the embedding and then get each
feature value by a softmax over the representation contribution of each cluster in
51
Figure 2.12: Sample images of the three sessions of the AA01 dataset.
the attribute space. To do so, we first solve
minimize |f | 21 + ‖f −Df‖ (2.8)
f∈Rn F
to get the sparse representation f of the face crop of that video frame. Then we set
the ith feature for that face crop to p(c = i|D) which is calculated by
| ∑ exp(‖D:,ifi‖)p(c = i D) = 10 (2.9)
k=1 exp(‖D:,kfk‖)
whereD:,i is the dictionary atoms of cluster i and fi are the coefficients corresponding
to those atoms. Thus, if f is in the subspace spanned by the points in D that are in
52
cluster i, it will have more energy in non-zero values for those atoms. To solve (2.8)
we use the Orthogonal Matching Pursuit [Tropp and Gilbert, 2007] algorithm with
sparsity 20. We choose 20 because the subspaces of DeepMulti-CNNAA embedding
must have dimension less than the embedding space dimension which is 32 for Deep*-
CNNAA . Then we concatenate
1 1 1
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
ACAA ACAA ACAA
0.2 LBP 0.2 LBP 0.2 LBP
MultiDeep-CNNAA MultiDeep-CNNAA MultiDeep-CNNAA
0.1 MultiWide-CNNAA 0.1 MultiWide-CNNAA 0.1 MultiWide-CNNAA
DiscAttrs DiscAttrs DiscAttrs
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False Acceptance Rate False Acceptance Rate False Acceptance Rate
Figure 2.13: ROC curve of different experiments on AA01 [Fathy et al., 2015] and
MOBIO [McCool et al., 2012] dataset. (a) is the ROC curve of AA01 with all of the
sessions together in gallery and probe. (b) is the ROC curve of MOBIO with all of
the mobile sessions together with the last session videos as gallery and the rest of
the session as probe. (c) is the ROC curve of the cross-device experiment.
e→ t
1→ 1 0.14 0.13 0.11 0.14 0.16
2→ 2 0.19 0.31 0.18 0.22 0.17
3→ 3 0.16 0.20 0.10 0.10 0.13
1→ 2, 3 0.38 0.38 0.18 0.25 0.23
2→ 1, 3 0.31 0.33 0.26 0.30 0.31
3→ 1, 2 0.31 0.38 0.19 0.24 0.25
Altogether 0.30 0.34 0.20 0.25 0.25
Table 2.14: The EER values for the different experiments on AA01 [Fathy et al.,
2015] dataset. The sessions numbers are: 1. Office light 2. Low light 3. Natural
light. DiscAttrs column contains the EER values using the discovered attributes.
53
True Acceptance Rate
True Acceptance Rate
ACAA
LBP
MD-CNNAA
MW-CNNAA
True Acceptance Rate
DiscAttrs
Results
To plot the ROC curves and evaluate our method, in each dataset, for each
person one session’s videos are considered as the enrollment videos and the other
videos as test videos. The similarity matrix is then generated by pairwise distance
between the enrollment and the test videos. The corresponding ROC curve is plotted
for each experiment.
AA01 AA01 is a mobile dataset with 750 videos of 50 subjects. Each subject
has three sets of videos with three different lighting conditions. Each user is asked to
perform a set of actions on the phone while the front camera is recording the video.
The videos are captured by an iPhone 4 camera. The three lighting conditions
are: office light, low light, and natural light. The sample images of this dataset in
Figure 2.12 show the three different illuminations in each session. Figure2.12 also
presents some partial faces in the dataset. Each person has five videos of performing
five different tasks on the phone. There is a designated enrollment video for each
person. Three different experiments have been conducted on this dataset.
First, the enrollment and the test videos for all of the 50 subjects are taken from
the session with the same lighting condition. The EER values of this experiment can
be found in the first three rows of Table 2.14. It can be seen that our MultiDeep-
CNNAA has the lowest EER in all cases. This experiment reveals the discriminative
power of the features when the surrounding environment is the same. It can be seen
that in this case, high dimensional LBP features even beats ACAA in the office light
session.
54
In the second one, the enrollment video is taken from one illumination session
and the test videos from another. The EER values corresponding to this experiment
are depicted in the next three rows of Table 2.14. The performance drop in our
method is 0.08 on average while ACAA suffers 0.17 and LBP 0.15. The reason is that
ACAA attribute classifiers use low level features that are sensitive to illumination
changes, but CNNAA is trained on a large-scale unconstrained dataset containing
a lot of variations and thus gives more robust features.
In the last experiment, all enrollment videos of the three sessions are put in
the gallery and all the test videos in the probe of to get the similarity matrix. The
ROC curve corresponding to the third general experiment is plotted in Figure 2.4.5.
It can be seen that MultiDeep-CNNAA performs the best and MultiWide-CNNAA
and the discovered attributes are tied as second best.
One explanation for lower performance of MultiWide-CNNAA compared to
MultiDeep-CNNAA is that it has many more parameters than MultiDeep-CNNAA
according to Table 2.16 and has overfitted to the celebrity faces distribution.
Figure 2.14: Sample images of the three sessions of the MOBIO dataset. First row
images are from different sites, second row is the pairs with the same identities in
two different sessions.
55
MOBIO MOBIO [McCool et al., 2012] is a more challenging dataset of 152
subjects. The videos are taken in six different universities across Europe. For most
of the subjects 12 sessions of video are captured. All of the mobile videos are
captured with a Nokia N93i. The first session’s videos are also recorded with a
2008 MacBook laptop. We perform two experiments on this dataset. We take the
12th session videos as our training videos since they are the mostly available videos
accross the dataset. A few subjects have less than 12 session videos.
In the first experiment, we just consider videos that are taken by the mobile
device. We show the EER values for the mobile videos of the subjects within each site
as well as all of the videos together in Table 2.15. This experiment is similar to the
third experiment of AA01 dataset since the environment conditions for enrollment
videos and test videos can be the same or different. However, there are three times
more subjects in MOBIO and more variations in illumination condition of the videos.
The ROC curve for this experiment is plotted in Figure2.4.5.
The second experiment is about the cross sensor authentication, in which you
enroll yourself on one device and test on another device. To see how important
sensor change can be for low level features, one can look at the performance drop of
LBP feature in this experiment and the previous one in Table 2.15. The decrease
is 0.10 for the LBP feature and then 0.05 for ACAA which depends on low level
features, while CNNAA methods just have a decrease of 0.01 in EER value. The
ROC curve for this experiment is presented in Figure 2.4.5. Again, this is due to the
fact that CNNAA has seen more variations in the large training set. Our method
can also handle partial face verification if a partial face detector like [Mahbub et al.,
56
but 0.26 0.36 0.19 0.20 0.23
idiap 0.25 0.35 0.27 0.25 0.24
lia 0.24 0.34 0.17 0.15 0.16
uman 0.27 0.33 0.18 0.20 0.21
unis 0.2 0.27 0.07 0.1 0.1
uoulu 0.18 0.23 0.14 0.14 0.19
Altogether 0.22 0.28 0.17 0.18 0.19
Mobile-PC 0.27 0.38 0.19 0.21 0.2
Table 2.15: The EER values corresponding to MOBIO dataset experiments.
2016] is available.
Mobile Efficiency
There is a trade-off among power consumption, authentication speed, and
accuracy of the model for the task of active authentication on mobile devices. The
response time is important since we do not want to freeze other running processes
and create an unpleasant user experience while authenticating. Power consumption
is also important because as frequent demands for charging the battery can be
annoying.
To show the effectiveness of our approach, we measure the attribute prediction
speed of our networks and the battery consumption on an LG Nexus 5 device. The
results are shown in Table 2.16. This mobile device has a quad-core QUALCOMM
57
ACAA
LBP
MD-CNNAA
MW-CNNAA
DiscAttrs
Input size Network Parameters Prediction time Network Parameters Prediction time
128× 52 D-UpperHead 275,360 0.15s W-UpperHead 1,825,664 0.26s
115× 41 D-BothEyes 227,936 0.11s W-BothEyes 1,447,552 0.19s
90× 62 D-EyesNose 244,704 0.13s W-EyesNose 1,580,160 0.22s
40× 56 D-Nose 170,400 0.06s W-Nose 988,032 0.1s
55× 82 D-NoseMouth 232,352 0.10s W-NoseMouth 1,481,600 0.18s
65× 38 D-Mouth 164,448 0.06s W-Mouth 939,648 0.11s
115× 107 D-EyesNoseMouth 441,632 0.28s W-EyesNoseMouth 3,154,304 0.48s
128× 45 D-MouthChin 244,640 0.13s W-MouthChin 1,579,904 0.23s
62× 100 D-Ear 256,864 0.14s W-Ear 1,677,952 0.25s
53× 39 D-Eye 162,400 0.06s W-Eye 923,264 0.08s
Overall MultiDeep-CNNAA 2.4M 1.22s MultiWide-CNNAA 15.6M 2.10s
128× 128 BinaryDeep-Full 584,160 0.36s BinaryWide-Full 4,289,664 0.637s
Table 2.16: Network size and prediction speed of the networks. The D-* means
it has MultiDeep-CNNAA architecture and W-* means it is MultiWide-CNNAA .
The Binary*-CNNAA network prediction times are just for one attribute. For all of
them together it will be 40 times this value.
Snapdragon 800 clocked at 2.26 GHz and 2 GB of RAM. This specification is consid-
ered average compared to the current smartphones. We use the Tensorflow [Abadi
et al., 2015a] implementation of CNNs on Android devices.
We take one shot with the smartphone camera and feed it to the network for
200 times and measure the prediction speed by looking at the average duration per
frame. To measure the power usage we use PowerTutor [Zhang et al., 2010] which
registers the energy usage per running application and also in total. We do not use
the camera continuously because it will bias the response time and power usage of
the network. We take the image and the application works in background. The
default Android processes are the only other processes that are running besides the
application that runs the networks and PowerTutor application.
According to Tabel 2.16 all of the attributes are detected in 1.22s with MultiDeep-
CNNAA running on CPU in the background without blocking other applications.
58
MultiWide-CNNAA takes 2.10s. The BinaryDeep-CNNAA takes 14.4s and BinaryWide-
CNNAA 25.5s.
The MultiDeep-CNNAA architecture consumes 780mW power on average and
MultiWide-CNNAA drains 1100mW of the battery power. The average battery
usage of Android when it is not running the CNNAA networks is 600mW according
to PowerTutor. To see how this affects the battery life, suppose the battery capacity
is C Watt-hours (Wh). Then
C
d = (2.10)
Pn + βαPd
where d is the mobile device’s battery life, Pn is the power consumption in normal
use, Pd is the power usage of the attribute detection algorithm, β is the fraction of
time that the mobile device is being used, α is the authentication ratio constant. α
shows how often we want to authenticate the user considering the prediction time
of the algorithm, i.e.,we authenticate every Ta where Ta is the prediction speed ofα
the model. For instance, if α = 0.5 we authenticate every 2.44s using MultiDeep-
CNNAA and every 4.2s using MultiWide-CNNAA .
To make the feasibility of CNNAA clearer, suppose we authenticate the user
using the MultiDeep-CNNAA architecture on the Nexus 5 device. We choose the
MultiDeep-CNNAA since it performs well in the authentication task and also it
has a better runtime and power usage. The Nexus 5 has a 2300mAh battery with
3.8V voltage, so C = 8.74Wh. Pn = 0.6W for the “normal usage” state which
is when just Android 5 and the default applications are running. This gives 14.5
59
hours battery life. Now if α = 1 which means we want to authenticate with the
highest speed possible and if we are using the phone all the time with β = 1 then
the battery life will be reduced to 6.3 hours in the worst case. In a realistic setting
with β = 0.2 and α = 0.5 it becomes 12.85 hours which is reasonable. Also, if a
GPU implementation of CNNs on Android [Sarkar et al., 2016] is used, attribute
prediction can happen much faster with less energy consumption.
60
Chapter 3: Conditional Image Generation From Attribute Vectors
3.1 Introduction
High fidelity conditional image generation remains an elusive but highly sought
after goal of computer vision. Any such well-trained algorithm would allow users to
create an unlimited quantity of digital content for use in consumer photography, dig-
ital photo editing, and dataset generation. While this ultimate goal remains out of
reach of state-of-the-art algorithms, an approximation of this task has been phrased
as attribute-level image manipulation. The ability to manipulate and edit images
based on pre-defined attributes has various real-world applications. Consumer photo
editing software would ideally allow users to edit their images is specific ways, such
as re-touching hair-color or removing a hat from someone’s head. Additionally,
the ability to adequately model visual attributes may improve supervised learning
generalization via data augmentation.
Two recent works [Lample et al., 2017,Perarnau et al., 2016] have focused on
the problem of attribute-based image manipulation. In each, they train a condi-
tional generator on discretely represented binary attributes and a nuisance vector
which together completely encode an input image. This mechanism allows users to
manipulate, or Swap, a binary attribute of the input image and obtain a re-rendered
61
output image. For example, if one of the attributes is “has hat”, they can optionally
add a hat to a person’s head or conversely, remove one.
These efforts are indeed impressive. However, attribute Swaps performed on
discrete binary representations are limited. They do not allow users to sample
from the space of possible realizations of a binary attribute. For example, if I
want to activate the “has hat” attribute, I might want to sample from various
hats. Furthermore, a user might observe a realization of a particular attribute in
a reference image and apply it to a query image. For example, one might want to
activate the “has hat” attribute with a particular Fedora from a reference image.
To address these shortcomings, we introduce CRISPR an algorithm for im-
age attribute manipulation that supports two new types of attribute manipulation
operations: Diverse Swaps and Borrows. A Swap is a change to a visual attribute
such that the input and output to our model have different visual attribute values,
such as hat versus no hat. A Diverse Swap is a change to a visual attribute that is
random: each time the operation is performed represents a sampling from the vi-
sual space that both ensures that the input and output images have different visual
attributes but also that subsequent samples have variety. For example, if the input
attribute is ’no hat’, then all output images should be wearing hats. However, each
sample from our model should produce different hats. Alternatively, a Borrow is an
operation where we can change a visual attribute in such a way that it ’borrows’ the
visual component from a reference image and substitutes it into a query image. In
particular, say the input has no hat and we wish to flip the ’hat’ attribute to ’has
hat’ in such a way that the particular hat should match a top hot, then we should
62
be able to ’borrow’ the ’top hat’ representation from an example image and apply
it to the query image.
To summarize, our method exhibits the following unique characteristics:
• Models visual attributes more flexibly through a continuous, rather than dis-
crete, representation.
• Allows a user to alter an image by sampling from a learned distribution of
image attributes.
• Allows a user to alter an image by ’borrowing’ an attribute exemplar from
another image.
3.2 Related Work
Our work is related to research in two problems in generative models of images;
attribute-based image manipulation and learning disentangled representation.
[Yan et al., 2016] was the first to use a variational auto-encoder (VAE) to
build a conditional generative model where the image manipulation is performed
by inferring the latent state given the correct attributes and then changing the
attributes. [Perarnau et al., 2016] employed a similar encoder-decoder architecture,
but devised a adversarial training regime [Goodfellow et al., 2014a] in order to
improve the realism of manipulated images. [Antipov et al., 2017] utilized the same
idea to alter the facial appearance as a function of age. However, in both cases, the
training schemes lack a mechanism to ensure the decoder to utilize the annotator
labels information, which limits the quality of image manipulation. Fader Networks
63
[Lample et al., 2017] aimed to alleviate this problem by imposing an additional
constraint on the latent code to be invariant under any attribute specific information
of input images, and showed a great improvement. However, all these approaches
feed the attribute labels as input to the decoder, which incurs two limitations; (1)
detailed information about the manifestation of each is attribute label is lost e.g.
we may know if the person wears glasses, but not the kind of the glasses; (2) there
is no natural means to generate diverse examples of altered attributes. Our method
in contrast learns multi-dimensional, continuous representation of attribute-specific
information with a mechanism to sample from these spaces.
Another strand of related research is that of learning disentangled represen-
tation. The closest to our work is Inverse Graphics Network proposed in [Kulkarni
et al., 2015] in which attribute-specific latent codes are learned within an auto-
encoder network that can be used to alter image appearance. However, the model
does not learn the attribute-invariant latent code, and thus limits the range of
viable image manipulation operations e.g. attribute transfer from one image to
another is not possible with this approach. By contrast, CRISPR models both
attribute-invariant and attribute-specific representation. Many other work in this
space such as predictability minimization framework [Schmidhuber, 1992], Info-
GAN [Chen et al., 2016], conditional GAN based approach in [Mathieu et al., 2016]
and neural photo editor [Brock et al., 2016], employed fully unsupervised methods,
without specification of attribute types. These methods aim to automatically ex-
tract factors of variations, which may not be necessarily aligned with the specific
demands of users in image editing applications.
64
Figure 3.1: Visualization of approaches to attribute manipulation as grahical mod-
els. (a) The graphical model for Fader Networks. (b) The graphical model for
CRISPR.
3.3 Model
(a) (b)
Let x ∈ RW×H×C be an image with L binary attributes labels represented by
y = [y , ..., y ] ∈ {0, 1}L1 L . For example, visual attribute labels assigned to images
of faces can include man/woman, glasses/no glasses, beard/no beard. We define
our model as an encoder-generator architecture, endowed with disentangled latent
structures between attribute invariant and attribute specific components. More
precisely, the encoder network E is a convolutional network that maps a given image
x to a set of multi-dimensional, continuous vectors {z, a1, . . . , aL}, where z ∈ Rp is
a representation of x, invariant under any of the visual attributes [y1, ..., yL] while
each a qi ∈ R encodes information in x specific to the ith attribute. On the other
hand, the generator G is a deconvolutional network that maps latent codes to an
image x̂. We refer to Einv(x) := z and Eattr(x) := [a1, . . . , aL] as attribute invariant
and attribute specific representations, respectively. We discuss our proposed training
scheme to impose such structures on the latent space in Sec.3.4.
While we represent each attribute in a high-dimensional continuous space, we
need to make sure that each attribute vector can be mapped to a discrete binary
label yi. Let Hi(ai) → yi represent an attribute-specific classifier that maps each
continuous a 0 1i to a binary variable. Let µi and µi represent cluster centroids of
attribute i for values 0 and 1, respectively. The graphical model of CRISPR can be
65
Discriminator
Decoder
Encoder
Figure 3.2: Architectural overview of the CRISPR architecture, best viewed in
color. The image x is encoded by E producing invariance vector z and attribute
vectors a0 and a1. These are concatenated and passed to decoder G which produces
reconstruction x̂. Each continuous attribute vector is decoded using a linear classi-
fier Hi. Auxiliary components of the architecture are framed by the dashed green
outline. Discriminator D encourages the reconstructions to both appear natural
and exhibit the expected attributes, classifier Cinv encourages z to encode only non-
attribute information and classifier Cattr encourages each attribute vector to encode
only its attribute information.
seen in Figure 3.1.
3.3.1 Inference
Reconstruction with our model is straightforward: x̂ = G(E(x)). However,
it is our continuous representation of attributes that permits two unique modes of
inference, Diverse Swaps and Borrows.
By using a discrete representation of attributes, regular attribute swaps are
easy in that one need only artificially alter a discrete binary embedding. However,
one cannot sample from the attribute space. Our model allows for Diverse Swaps:
stochastic invocations of the swap operation that produce different visual manifes-
tations. To formalize this operation, consider the oracale classifier Coracle(x, i)→ yi
which is able to classify the attributes of an image x. Then a swap on attribute i is an
operation S(x, i)→ x̂ such that Coracle(x, i) 6= Coracle(x̂, i). A Diverse Swap adheres
66
to the same constraint but because it is a stochastic transformation, subsequent
invocations of S produce x̂ and x̂′ such that x̂ 6= x̂′.
In order to perform a Diverse Swap, one first embeds image x as {a0, ...ai...z}.
Next, one needs only sampling from a Gaussian distribution with mean µ0i or µ
1
i and
scaled standard deviation to produce a new sampled attribute vector a′i. Finally, the
resulting image is decoded via x̂ = G(a0, ..., a
′
i, ..., z). This partitioned representation
not only allows us to swap visual attributes, but also lets us produce a diverse set
of possible swaps.
A Borrow represents a transformation on attribute i such that attribute i
is borrowed from exemplar image xexemplar to image x. To perform a borrow, x is
encoded as {a0, ...ai...z}, xexemplar is encoded as {aex0 , ...aexi ...zex}, and an attribute
from xexemplar is borrowed to produce encoding {a0, ...aexi ...z}. Finally, the resulting
image is decoded via x̂ = D(a0, ..., a
ex
i , ...z). The overview of our method is shown
in Figure 3.2.
3.4 Training
An ideally trained CRISPR model should exhibit several characteristics. First,
the image reconstructions should be high fidelity. Secondly, the encoder should
produce a partitioned representation with attribute-invariant and attribute-specific
components. Third, the latest attribute space must be amenable to sampling in
order to produce diverse and plausible samples. To this end, we introduce a set of
auxiliary components and losses to the model.
67
3.4.1 Reconstruction
To ensure that our reconstructions are high fidelity, we use the thresholded
mean squared error Lrecon:
[ ]
minEx max(N−1‖G(E(x))i − x̂ 2i‖ , κ) (3.1)
G,E
where N = H×W ×C is the number of pixels and κ is a threshold used to avoid the
reconstruction loss dominating optimization. Additionally, we use an adversarial loss
to capture fine textures [Isola et al., 2017]. Formally, let discriminator D : X ×Y →
[0, 1] be a binary classifier trained to discriminate if a given image-label pair (x, y)
is real or fake. The following loss function is additionally optimized during training
LcGAN: [ ( ) ( )]
min maxEx,y log D(x, y) + log 1−D(G(E(x)), y) (3.2)
G,E D
3.4.2 Learning attribute-invariant representation, z
Encoder E should learn to produce latent codes z that are independent at-
tributes [y0, ..., yL]. In particular, following [Lample et al., 2017, Mathieu et al.,
2016] we employ adversarial training on z where additional classifier Cinv is trained
to identify the true attributes y of an input image x from its latent code z = E(x).
This scheme aims to obtain invariance in z by training encoder E such that Cinv is
unable to identify the attributes from any input images.
The classifier Cinv takes z = Einv(x) and generates a vector of probabilities
68
(1) (L)
[Cinv(z), ..., C
L
inv (z)] ∈ [0∏, 1] of respective attributes being present in the image i.e.(i)
we estimate P (y|z) ≈ L C (z)yi · − (i)(1 C (z))1−yii=1 inv inv . The weights of Cinv are
trained in an adversarial fashion: In one stage, we train the encoder to produce z
vectors that maximize the probability of misclassification by Cinv. In another step
we train the classifier Cinv to predict all of the attributes correctly from z. More
formally we optimize the invariance loss Linv:
[ ∑L ]
(i) (i)
min max Ex,y − yi · logCinv(Einv(x)) + (1− yi) · log(1− Cinv(Einv(x))) (3.3)
Cinv Einv
i=1
3.4.3 Decoupling attribute specific representations {ai}
In addition to latent code z not representing any attribute specific information,
we also want each attribute code ai to encode only information about that particular
attribute. To this end, we use an additional classifier Cattr to identify the true
attributes y based on one of latent codes {a1, ..., aL}. On the other hand, for each
(i)
attribute i, the encoder E is trained to generate a code ai = Eattr(x) such that
the classifier Cattr can correctly infer about the attribute yi while being maximally
(i) (j)
confused about the other attribute identity i.e. Cattr(ai) = yi and Cattr(ai) = 0.5 for
j 6= i for any input images. More concretely, we propose the following encoder loss
Lattr to be maximized:
[∑L [ ]]
min max E (−1)1i=j · (j) (i) (j) (i)x,y yj ·logCattr(Eattr(x))+(1−yj)·log(1−Cattr(Eattr(x)))
Cattr Eattr
i,j=1
(3.4)
69
In the maximization step, the encoder tries to fool the classifier with ith attribute
representation on all other labels but yi. In the discriminator step, the classifier
tries adjust its weights to learn to predict all the attributes from every single ai.
3.4.4 Sampling Attributes
Up to now, by just training with the aforementioned losses, we can already map
images to attribute specific representations {a Li}i=1, which allows us to perform the
“Borrow” operation mentioned in section 3.3 by transplanting these codes from one
image to another. However, there is no mechanism to map each of these attribute
specific codes to a discrete binary label yi, which prevents us from performing at-
tribute swap let alone generating multiple plausible samples of it (“DiverseSwap”).
To address this limitation, we further regularize the latent space with an addition
of center loss [Wen et al., 2016], which not only improves the quality of embedding
by encouraging images with similar attributes to cluster together, but also provide
“switches” in the latent space for manipulating attributes.
We now formalize our proposition. Suppose we want to alter the first at-
tribute of image x from y1 = 1 to y1 = 0. Given the latent representation E(x) =
[z, a1, ..., aL], we can define this operation of attribute swap as finding the alterna-
tive attribute code such that the probability of the reconstructed image does not
have the first attribute is maximized i.e. anew1 = argmaxa P (y1 = 0|z, a1, ..., aL),1
and subsequently reconstruct on the altered latent codes xnew = G(z, a
new
1 , ..., aL).
If we assume that given the code a1, the attribute label y1 is conditionally inde-
70
pendent of the invariant code z and other attribute specific codes a2, ..., aL, and the
marginal distributions P (y1) and p(a1) are constant, we see that the attribute swap
reduces to finding anew1 = argmaxa P (a1|y1 = 0). Our center loss based schemei
estimates P (a1|y1 = 0) with a Gaussian distribution N (a 0 01;µ1, σ1), and thus alters
the attribute by setting anew1 = µ
0
1. To perform “DiverseSwap” in this example, we
generate a set of plausible attribute codes instead by sampling from this distribution
{anew,1, anew,21 1 , ...} ∼ N (a1;µ0 01, σ1). More generally, this scheme models the latent
of every attribute state as a Gaussian distribution i.e. P (ai|yi) ≈ N (ai;µyii , σ
yi
i ) for
i = 1, ..., L, yi ∈ {0, 1}.
We now describe how to learn means (centroids) and standard deviations
{µ0, µ1i i , σ0 1 Li , σi }i=1 of the attribute-specific latent codes. We learn the two in an
alternating fashion. First, given a set of {µ0, µ1i i } initialized to zero vectors, we esti-
mate the standard deviations as follows. We train attribute specific linear classifiers
Cl Tspec(ai) = w ai + bi in the space of attribute representations by minimizing the
standard cross-entropy loss Lispec, and estimate σ0i , σ1i by the distance of the centroid
from the decision boundary of Clspec where c ∈ {0, 1}:
|wT cc i µi + bi|σi = (3.5)‖wi‖
We then estimate the centroids {µ0 1i , µi } by minimizing Lcenter:
[∑L 1 ]
min Ex,y y ||
(i)
E yi 2
σ i attr
(x)− µi ||2 (3.6)
E,µ0i ,µ
1
i i=1 i
71
(i)
where Eattr(x) is the attribute code ai for image x. The centers are updated in every
step of stochastic gradient descent with a batch-wise maximum likelihood update
step with momentum as in [Wen et al., 2016]:
1 ∑∇µc ci{t} = µi − E(xk) (3.7)|Ind(yi = c)|
k∈Ind(yi=c)
µc c ci{t+ 1} = µi{t}+ α∇µi{t} (3.8)
where µic is the centroid of ith attribute for binary class c and Ind(yi = c) denotes
the set of indices in the batch in which attribute i has yi = c. On the other hand,
minimizing this loss with respect to the parameters of the encoder E encourages the
points to be closer to centroids that are near the hyperplane.
3.4.5 Prior on latent codes
We regularize the outputs of the encoder to be in the interval [−β, β] by Lprior:
min Ex[ReLU(|E(x)| − β)] (3.9)
E
3.4.6 Optimization
Our final loss is defined as:
∑
L iCRISPR = LGAN + Lrecon + Linv + Lattr + Lcenter + Lspec (3.10)
i
72
We iteratively optimize the model via three repetitive steps. First, we optimize
LGAN, Linv, Lattr for two steps with respect to D, Cinv, Cattr and H. Next, we
optimize Lprior, LGAN, Linv, Lattr, Lcenter for a single step with respect to E. Finally,
we optimize Lprior, LGAN, Lrecon, Lattr, Lcenter for a single step with respect to G.
All three are optimized using an Adam optimizer (beta=0.5).
3.5 Experiments
We evaluate our model on the CelebA dataset. Our goal is to verify that
the model produces plausible images that we can sample from the space of possible
attribute realizations and observe good diversity (Diverse Swaps) and we can encode
attributes from a reference image and borrow those attributes to apply to a query
image.
We use the standard train/validation/test splits in the following manner: 2k
images were used from the original validation set as the classifier-training set, all
160k images were used to train CRISPR, the remaining 14k validation images were
used for validation. We used the standard test set. We used binary class pairs
(glasses, no glasses) and (mustache, no mustache).
We illustrate examples of Diverse Swaps in Figure 3.3 and Figure 3.4. As
these images illustrate, our model is able to produce examples via sampling that
are not only visually pleasing but also diverse. For example, each row and column
in Figure 3.3 illustrates a sample from CRISPR with a different ’bangs’ attribute
vector. As the images illustrate, some of the samples differ subtly while others
73
differ significantly. Figure 3.4 shows different samples of the ’eyeglasses’ attribute
vector. As the results indicate, different vectors result in very different styles of
attributes. For example, the final row shows several different glasses of varying
shape and shading. The second row also illustrates an interesting failure mode: one
side of the eyeglasses is light while the other is dark. This suggests that while our
sampling strategy largely produces coherent attributes, we may be sampling from a
part of the space that is not atomic: in other words, certain aspects of attributes
are being incorrectly broken up in the representation.
Examples of the Borrows are shown in Figure 3.5. The first column shows the
reference image from which we want to borrow a smile. The second column shows
the query image, whose smile we want to alter. The final column shows the output
of CRISPR when borrowing the smile attribute vector from the reference image.
Figure 3.3: Examples illustrating the Diverse Swap operation using the ’bangs’
attribute. The left-most column demonstrates the inputs and the remaining columns
illustrates the results of sampling from the space of bangs attributes.
74
Figure 3.4: Examples illustrating the Diverse Swap operation using the ’eye-glasses’
attribute. The left-most column demonstrates the inputs and the remaining columns
illustrates the results of sampling from the space of bangs attributes. The results
demonstrate not only the diversity of the selection of eye-glasses, but also a failure
case (second row, last image), the sampled attributes may not all be atomic.
Reference Query Borrowed
Figure 3.5: Examples illustrating the Borrow operation using the ’smile’ attribute’.
As these examples illustrate, CRISPR is able to effectively borrow the smile from
the reference image and apply it to the query image.
75
Chapter 4: Model Explanation via Decision Boundary Crossing Trans-
formations
4.1 Introduction
Given a classifier, one may ask: What high-level, semantic features of an
input is the model using to discriminate between specific classes? Being able to
reliably answer this question amounts to an understanding of the classifier’s decision
boundary at the level of concepts or attributes, rather than pixel-level statistics.
The ability to produce a conceptual understanding of a model’s decision bound-
ary would be extremely powerful. It would enable researchers to ensure that a model
is extracting relevant, high-level concepts, rather than picking up on spurious fea-
tures of a dataset. For example, criminal justice systems could determine whether
their ethical standards were consistent with that of a model [Goodman and Flax-
man, 2016]. Additionally, it would provide some measure of validation to consumers
(e.g., medical applications, self-driving cars) that a model is making decisions that
are difficult to formalize and automatically verify.
Unfortunately, directly visualizing or interpreting decision boundaries in high
dimensions is effectively impossible and existing post-hoc interpretation methods
76
fall short of adequately solving this problem. Dimensionality reduction approaches,
such as T-SNE [Maaten and Hinton, 2008], are often highly sensitive to their hyper-
parameters whose values may drastically alter the visualization [Wattenberg et al.,
2016]. Saliency maps are typically designed to highlight the set of pixels that con-
tributed highly to a particular classification. While they can be useful for explaining
factors that are present; they cannot adequately describe predictions made due to
objects that are missing from the input. Explanation-by-Nearest-Neighbor-Example
can indeed demonstrate similar images to a particular query, but there is no guar-
antee that similar enough images exist to be useful and similarity itself is often
ill-defined.
To overcome these limitations, we introduce a novel technique for post-hoc
model explanation. Our approach visually explains a model’s decisions by producing
images on either side of its decision boundary whose differences are perceptually
clear. Such an approach makes it possible for a practitioner to conceptualize how
a model is making its decisions at the level of semantics or concepts, rather than
vectors or pixels.
Our algorithm is motivated by recent successes in both pixel-wise domain
adaptation [Bousmalis et al., 2017,Liu et al., 2017,Zhu et al., 2017] and style transfer
[Johnson et al., 2016] in which generative models are used to transform images from
one domain to another. Given a pre-trained classifier, we introduce a second, post-
hoc explaining network called ExplainGAN, that takes a query image that falls on
one side of the decision boundary and produces a transformed version of this image
that falls on the other. ExplainGAN exhibits three important properties that make
77
it ideal for post-hoc model interpretation:
Easily Visualizable Differences: Adversarial example [Szegedy et al., 2013]
algorithms produce decision boundary crossing images whose differences from the
originals are not perceptible, by design. In contrast, our model transforms the input
image in a manner that is clearly detectable by the human eye.
Localized Differences: Style transfer [Gatys et al., 2015] and domain adap-
tation approaches typically produce low-level, global changes. If every pixel in the
image changes, even slightly, it is not clear which of those changes actually influenced
the classifier to produce a different prediction. In contrast, our model yields changes
that are spatially localized. Such sparse changes are more easily interpretable by a
viewer as fewer elements change.
Semantically Consistent: Our model must be consistent with the behavior
of the pre-trained classifier to be useful: the class predicted for a transformed image
must not match with the predicted class of the original image.
We evaluate our model using standard approaches as well as a new metric for
evaluating this novel style of model interpretation by visualizing boundary-crossing
transformations. We also utilize a new medical images dataset where the concept
of objectness is not well defined, making it less amenable to domain adaptation
approaches that hinge on identifying an object and altering / removing it. Further-
more, this dataset represents a clear and practical use-case for model explanation.
To summarize, our work makes several contributions:
1. A new approach to model interpretation: visualizing human-interpretable,
78
decision-boundary crossing images.
2. A new model, ExplainGAN, that produces post-hoc model-explanations via
such decision-boundary crossing images.
3. A new metric for evaluating the amount of information retained in decision-
boundary crossing transformations.
4. A new and challenging medical image dataset.
4.2 Related work
Post-Hoc Model Interpretation methods typically seek to provide some
kind of visualization of why a model has made a particular decision in terms of the
saliency of local regions of an input image. These approaches broadly fall into two
main categories: perturbation-based methods and gradient-based methods.
Perturbation-based methods [Zeiler and Fergus, 2014,Fong and Vedaldi, 2017],
perturb the input image and evaluate the consequent change in the output of the
classifier. Such perturbations remove information from specific regions of the in-
put by applying blur or noise, among other pixel manipulations. Perturbation-
based methods require multiple iterations and are computationally more costly than
activation-based methods.
The perturbation of finer regions also makes these methods vulnerable to the
artifacts of the classifier, potentially resulting in the assignment of high saliency to
arbitrary, uninterpretable image regions. In order to combat these artifacts, current
79
methods such as [Fong and Vedaldi, 2017] are forced to perturb larger, less precise
regions of the input.
Gradient-based methods such as [Simonyan et al., 2013, Sundararajan et al.,
2017,Shrikumar et al., 2017,Shrikumar et al., 2016,Springenberg et al., 2014] back-
propagate the gradient for a given class label to the input image and estimate how
moving along the gradient affects the output. Although these methods are com-
putationally more efficient compared to perturbation-based methods, they rely on
heuristics for backpropagation and may not support different network architectures.
A subset of gradient-based methods, which we call activation-based meth-
ods, also incorporate neuron activations into their explanations. Methods such as
Gradient-weighted Class Activation Mapping Grad-CAM [Selvaraju et al., 2016b],
layer-wise Relevance Propagation (LRP) [Bach et al., 2015] and Deep Taylor De-
composition (DTD) [Montavon et al., 2017] can be considered as activation-based
methods. Grad-CAM visualizes the linear combination of (typically) the last convo-
lution layer and class specific gradients. LRP and DTD decompose the activations
of each neuron in terms of contributions (i.e. relevances) from its input.
All these explanation methods are based on identifying pixels which contribute
the most to the model output. In other words, these methods explain a model’s
decision by illustrating which pixels most affect a classifier’s prediction. This takes
the form of an attribution map, a heat map of the same size as the input image,
in which each element of the attribution map indicates the degree to which its
associated pixel contributed to the model output. In contrast, our model takes a
different approach by generating a similar image on the other side of the model’s
80
decision boundary.
Adversarial Examples [Szegedy et al., 2013, Goodfellow et al., 2014b] are
created by performing minute perturbations to image pixels to produce decision-
boundary crossing transformations which are visually imperceptible to human ob-
servers. Such approaches are extremely useful for exploring ways in which a classifier
might be attacked. They do not, however, provide any high-level intuition for why
a model is making a particular decision.
Image-to-Image Transformation approaches, such as those used in do-
main adaptation [Bousmalis et al., 2017, Liu and Tuzel, 2016, Ganin et al., 2016]
have shown increased success in transforming an image in one domain to appear
as if drawn from another domain, such as synthetic-to-real or winter-to-summer.
These approaches are clearly the most similar to our own in that we seek to trans-
form images predicted as one class to appear to a pre-trained classifier as those
from another. These approaches do not, however, constrain the types of transfor-
mations allowed and we demonstrate Section 4.5.3 that significant constraints must
be applied Section 4.4 to ensure that the transformations produced are easily in-
terpretable. Other image-to-image techniques such as Style Transfer [Zhu et al.,
2017, Gatys et al., 2015, Gatys et al., 2016] typically produce very low-level and
comprehensive transformations to every pixel. In contrast, our own approach seeks
highly localized and high-level, semantic changes.
81
4.3 Model
The goal of our model is to take a pre-trained binary classifier and a query
image and generate both a new, transformed image and a binary mask. The trans-
formed image should be similar to the query image, excepting a visually perceptible
difference, such that the pre-trained classifier assigns different labels to the query
and transformed image. The binary mask indicates which pixels from the query
image where changed in order to produce the transformed image. In this way, our
model is able to produce a decision-boundary crossing transformation of the query
image and illustrate both where, via the binary mask, and how, via the transformed
image, the transformation occurs.
More formally, given a binary classifier C(x) ∈ {0, 1} operating on an image
x, we seek to learn a function which predicts a transformed image t and a mask m
such that:
C(x) 6= C(t) (4.1)
xm 6= tm (4.2)
x ¬m = t ¬m (4.3)
where (4.1) indicates that the model believes x and t to be of different classes,
(4.2) indicates that the query and transformed image differ in pixels whose mask
values are 1 and (4.3) indicates that the query and transformed image match in
82
Figure 4.1: Model architecture of ExplainGAN. Inference (in blue frame) consists
of passing an image x of class j into the appropriate encoder Ej to produce a hidden
vector zj. The hidden vector is decoded to simultaneously create its reconstruction
Gj(zj), a transformed image of the opposite class G1−j(zj) and a mask showing
where the changes were made Gm(zj). Composite images C0 and C1 merge the
reconstruction and transformation with the original image x.
pixels where mask values are 0.
4.3.1 Prerequisites
Given a dataset of images S = {xi|i ∈ 1 . . . N}, our pre-trained classifier
produces a set of predictions {ȳi|i ∈ 1 . . . N}. Given these predictions, we now can
split the dataset into two groups S0 = {xi|ȳi = 0} and S1 = {xi|ȳi = 1}.
4.3.2 Inference
Given a query image and a predicted label for that image, our model maps to
a reconstructed version of that image, an image of the opposite class and a mask
that indicates which pixels it changed. Formally, our model is composed of several
components. First, our model uses two class-specific encoders to produce hidden
codes:
83
zj = Ej(x) j ∈ {0, 1}, x ∈ Sj (4.4)
Next, a decoder G maps the hidden representation zj to a reconstructed image
Gj(zj), a transformed image of the opposite class G1−j(zj) and a mask indicating
which pixels changed Gm(zj). In this manner, images of either class can be trans-
formed into similar looking images of the opposite class with a visually interpretable
change.
We also define the concept of a composite image Cj(x) of class j:
Cj(x1−j) = x1−j  (1−Gm(z1−j)) + Gj(z1−j)Gm(z1−j) (4.5)
where z1−j is the code produced by encoding x1−j. The composite image
uses the mask to blend the original image x with either the reconstruction or the
transformed image.
4.3.3 Training
To train the model, several auxiliary components of the network are required.
First, two discriminators Dj(x) → {real, fake}, j ∈ {0, 1} are trained to evaluate
between real and fake images of class j.
84
To train the model we optimize the following objective:
min max LGAN + Lclassifier + Lrecon + Lprior (4.6)
G,E0,E1 D0,D1
where LGAN is a typical GAN loss, Lclassifier is a loss that encourages the
generated and composite images to be likely according to the classifier, Lrecon ensures
that the reconstructions are accurate, and Lprior encodes our prior for the types of
transformations we want to encourage. LGAN is a combination of the GAN losses
for each class:
LGAN = LGAN:0 + LGAN:1 (4.7)
LGAN:j for class j discriminates between images x originally classified as class
j and reconstructions of x, transformations from x and composites from x. It is
defined as:
LGAN :j = Ex∼S log(Dj(x)) (4.8)j
+ Ex∼S [log(1−D (G (E (x))]j j j j
+ Ex∼S − [log(1−Dj(Gj(E1−j(x))]1 j
+ Ex∼S − [log(1−Dj(Cj(E1 j 1−j(x))]
85
Note that this formulation, in which the reconstructions of x are also penalized
are part of ensuring that the auto-encoded images are accurate [Larsen et al., 2015]
and are included here, rather than as part of Lrecon out of convenience.
Next, we encourage the composite images to produce images that the classifier
correctly predicts:
Lclassifier = Ex∈S0 − log(C(C1(x))) (4.9)
+ Ex∈S1 − log(1− C(C0(x)) (4.10)
Finally, we have an auto-encoding loss for the reconstruction:
∑
Lrecon = Ex∈S ‖Gj(Ej(x))− x‖2 (4.11)j
j∈0,1
The mask priors are discussed in the following section.
4.4 Priors for Interpretable Image Transformations
There are many image transformations that will transform an image of one
class to appear like an image from another class. Not all of these transformations,
however, are equally useful for interpreting a model’s behavior at a conceptual level.
Adversarial example transformations will change the label but are not perceptible.
Style transfer transformations make low-level but not semantic changes. Domain
86
Adaptation approaches may change every pixel in the image which makes it difficult
to determine which of these changes actually influenced the classifier. We want to
craft set of priors that encourage transformations that are local to a particular part
of the image and visually perceptible. To this end, we define our prior loss term as:
Lprior = Lconst + Lcount + Lsmoothness + Lentropy (4.12)
The consistency loss Lconst ensures that if a pixel is not masked, then the
transformed image hasn’t altered it.
∑
Lconst = Ex∈S [‖(1−Gm(zj)) xj − (1−Gm(zj))G(1− j)(z )‖2j ] (4.13)j
j∈0,1
where zj = Ej(x). The count loss Lcount allows us to encode prior information
regarding a coarse estimate of the number of pixels we anticipate changing. We
approximate the l0 norm via an l1 norm:
∑
L 1count = Ex∈S [max( |Gm(zj)|, κ)] (4.14)j n
j∈0,1
where κ is a constant that corresponds to the ratio of number of changed pixels
to the total number of the pixels. The smoothness loss encourages masks that are
localized by penalizing transitions via a total variation [Rudin et al., 1992] penalty:
87
∑
Lsmoothness = Ex∈S |∇Gm(zj)| (4.15)j
j∈0,1
Finally, we want to encourage the mask to be as binary as possible:
∑
Lentropy = Ex∈S [min(Gm(zj), 1−Gm(zj))] (4.16)j
j∈0,1
4.5 Experiments
Our goal is to provide model explainability via visualization of samples on
either side of a model’s decision boundary. This is an entirely new way of performing
model explanation and requires a unique approach to evaluation.
To this end, we first demonstrate qualitative results of our approach and com-
pare to related approaches Section 4.5.3. Next, we evaluate our model using tra-
ditional criteria by demonstrating that our model’s inferred masks are highly com-
petitive as saliency maps when compared to state-of-the-art attribution approaches
Section 4.5.4. Next, we introduce two new metrics for evaluating the explainability
of decision-boundary crossing examples Section 4.5.5 and evaluate how our model
performs using these quantitative methods.
88
Figure 4.2: An example of Ultrasound images from our Medical Ultrasound dataset.
(a) A canonical Apical 2 Chamber view. (b) A canonical Apical 4 Chamber view.
(c) A difficult Apical 2 Chamber view that is easily confused for a 4 Chamber view.
(d) A difficult Apical 4 Chamber view that is easily confused for a 2 Chamber view.
4.5.1 Datasets
We used four datasets as part of our evaluation: MNIST [LeCun et al., 1998],
Fashion-MNIST [Xiao et al., 2017a], CelebA [Liu et al., 2015] and a new Medical
Ultrasound dataset that will be released with the publication of this work. For each
dataset, four splits were used: A classifier-training set used to train the black-box
classifier, a training set used to train ExplainGAN, a validation set used to tune
hyperparameters and a test set.
MNIST, Fashion-MNIST: We use the standard train/test splits in the
following manner: The 60k training set is first split into 3 components: a 2k
classifier-training set, a 50k training set and an 8k validation set. We used the
standard test set. For MNIST, we used binary class pairs (3, 8), (4, 9) and (5, 6).
For Fashion-MNIST, we used binary class pairs (coat, shirt), (pullover, shirt) and
(coat, pullover).
CelebA: We use the standard train/validation/test splits in the following
89
manner: 2k images were used from the original validation set as the classifier-training
set, all 160k images were used to train ExplainGAN, the remaining 14k validation
images were used for validation. We used the standard test set. We used binary
class pairs (glasses, no glasses) and (mustache, no mustache).
Medical Ultrasound: Our new medical ultrasound dataset is a collection
of 72k cardiac images taken from 5 different views of the heart. Each image was
labeled by several cardiac sonographers to determine the correct labels. An example
of images from the dataset can be found in 4.2. As the Figure illustrates, the dataset
is very challenging and is not as amenable to certain senses of ’objectness’ found in
most standard vision datasets. Of the 72k images, 2k were used as the classifier-
training set, 60k were used for training ExplainGAN, 4k were used for validation
and 6k were used for testing. We used the binary class pair (Apical 2-Chamber,
Apical 4-Chamber).
4.5.2 Implementation
The model architecture implementation for E, G and D is quite similar to the
DCGAN architecture [Radford et al., 2015]. We share the last few layers of E0 and
E1 and the last few layers of D0 and D1. Each loss term in our objective is scaled
by a coefficient whose values were obtained via cross-validation. In practice, the
coefficients were quite stable across datasets (we use the same set), other than the κ
hyperparameter which controls the effect of the count loss and the scaling coefficient
for Lsmoothness, the smoothness loss.
90
4.5.3 Explanation by Qualitative Evaluation
We evaluated our model qualitatively on a number of datasets. We show
results on both the Medical Ultrasound dataset and CelebA dataset in 4.3. The use
of CelebA and a medical image dataset provides a useful contrast between images
whose relationships should be quite familiar to the average reader (glasses vs no-
glasses) and relationships that are likely to be foreign to the average reader (apical
2 chamber views versus apical 4 chamber views).
In each block, the “input” column represents images x ∈ S0, the “transformed”
column represents ExplainGAN’s transformation, G1(z0), to the opposite class. The
“mask” column illustrates the model’s changes, Gm(z0), and the “composite” column
shows the composite images, C1(z0).
The CelebA (top) results in 4.3 illustrates that the model’s transformations
for both “glasses vs no-glasses” and “mustache vs no-mustache” perform highly
localized changes and the corresponding mask effectively produces a segmentation
of the only visual feature being altered. Furthermore, the model is able to make
quite minimal but perceptible changes. For example, in the first row of the “glasses
vs no-glasses” task, the mask has preserved the hair over the eyeglasses.
The Ultrasound (bottom) results in 4.3 illustrates that the model has both
learned to model the anatomy of the heart and is able to transform from one view
of the heart to the other with minimal changes. The transformations and masks
clearly illustrate that the model is cuing predominantly on the presence of the right
ventricle, but interestingly not the right atrium, and the shape of the pericardium.
91
input transformed mask composite input transformed mask composite
input transformed mask composite input transformed mask composite
Figure 4.3: Qualitative visualization of the ExplainGAN model on two datasets:
CelebA and our Medical Ultrasound dataset. The “input” column represents im-
ages x ∈ S0, the “transformed” column represents ExplainGAN’s transformation,
G1(z0), to the opposite class. The “mask” column illustrates the model’s changes,
Gm(z0), and the “composite” column shows the composite images, C1(z0). The re-
sults indicate that in the case of object-related transformations, such as glasses or
mustaches, ExplainGAN effectively performs a weakly supervised segmentation of
the object. In the ultrasound case, ExplainGAN illustrates which anatomical areas
the model is cuing on: the right ventricle and pericardium.
4.5.4 Explanation via Pixel-Wise Attribution
Many post-hoc explanation methods that use attribution or saliency rely on
visual, qualitative comparisons of attribution maps. Recently, [Samek et al., 2017]
92
Ultrasound A2C to A4C CelebA Eyeglasses
Ultrasound A4C to A2C CelebAMustache
introduced a quantitative approach for comparing attribution maps in which pixels
are progressively perturbed in the order of predicted saliency. Performance is judged
by evaluating which methods require fewer perturbations to affect the classifier’s
prediction.
Our model is not designed for attribution / saliency as it produces a binary,
rather than continuous mask, which is also paired to a particular transformation
image. However, it is possible to loosely interpret our masks as an attribution map
in which pixel priority for all pixels in the mask is not known.
While the work of [Samek et al., 2017] perturbed individual pixels, we wanted
to avoid a comparison in which individual pixel changes, which are neither them-
selves interpretable, nor plausible as images, might alter the classification results.
Consequently, we adapt the approach of [Samek et al., 2017] by perturbing the image
by segments, rather than pixels. To choose the order of perturbation, we normalize
the maps to the range [0, 1], threshold them with t ∈ [0.5, 0.7, 0.9] and segment the
resulting binary maps. We then rank the segments based on the average map value
within each segment1. For perturbation, we replace each pixel in each segment with
uniform random noise in the range of the pixel values.
(k)
More concretely, we denote the image with k segments perturbed by xSP . We
compute the area over the segment perturbation curve (AOSPC) as follows:
〈
1 ∑ 〉K (0) (k)
AOSPC = f(xSP)− f(xSP) , (4.17)K + 1
k=0 px
1For ExplainGAN we take the average of the sigmoid outputs over all pixels in a segment.
93
where K is the number of steps, 〈.〉px denotes the average over all the images, and
f : Rd → R is the classification function.
We report AOSPC after 10 steps for the explanation methods of Section 5.2
in Section 4.1. We choose the methods to cover the 3 main groups of methods (i.e.
perturbation-based, gradient-based and activation-based). A larger AOSPC means
that the sensitivity of the segments that are perturbed in 10 steps is higher. To
avoid cases where the segmentation assigns all or more than half of the pixels to
one segment we choose our threshold from ≥ 0.5 values. Our results demonstrate
that, despite not being explicitly optimized for finding the most informative pixels,
ExplainGAN performs on par with other explanation methods for classifiers. For
qualitative comparison of these methods see 4.4.
Table 4.1: AOSPC value (higher is better, see (4.17) after 10 steps for different segmentation
thresholds. Although, ExplainGAN is not directly optimized for this metric, its performance is
comparable to reasonable baselines for explanation in classifiers. A larger AOSPC means that the
sensitivity of the segments that are perturbed in 10 steps is higher.
Dataset MNIST Ultrasound
Threshold 0.5 0.7 0.9 0.5 0.7 0.9
Grad [Shrikumar et al., 2016] 1474 1563 240 712 291 81
Grad-CAM [Selvaraju et al., 2016b] 17.2 8 − − 70 432
Saliency [Simonyan et al., 2013] 817 718 126 30 63 298
Occlusion [Zeiler and Fergus, 2014] 2099 1946 1486 1215 539 142
LRP [Bach et al., 2015] 1736 1478 244 700 511 71
ExplainGAN 2622 2083 1474 1167 542 374
94
Integrated ExplainGAN ExplainGAN 
Original LRP GradCam  Gradients Occlusion  (mask)  (transformed)
Figure 4.4: Comparison of different methods for explaining the model’s decision.Fashion-
MNIST: transforming from pullover to shirt, Ultrasound: transforming from A2C to A4C (see
4.2 for examples of A2C and A4C views), CelebA: transforming from faces without eyeglasses to
faces with eyeglasses, MNIST: transforming from 4 to 9.
95
MNIST CelebA Medical Ultrasound Fashion-MNIST
Figure 4.5: Boundary-crossing images have varying explanatory power: images carry more
explanatory power if (1) they can be used as substitutes in the original dataset without affecting
the classifier and (2) they are different from a query image in small and easily localized ways. (a)
displays an image classified as a jacket and not a pullover. (b) shows an image of a pullover which
is substitutable and whose localized mask illustrates the models belief that removing a zipper and
the jacket ribs would make the original image into a pullover. (c) shows another pullover but non-
localized mask doesn’t explain why this is a pullover and not a jacket. (d) shows an adversarial
image which is completely unsubstitutable and provides no localized explanation.
Table 4.2: Quantitative substitutability experiments across datasets. Class 0 and Class 1 are
the classes that the given classifier is trained to identify. Transformed/Composite 0/1 column
shows the accuracy of the classifiers when just transformations/compositions of the images used
at training time. Ceiling represents the accuracy of the base classifier on the same test set.
Dataset Class 0 Class 1 Transformed 0 Transformed 1 Composite 0 Composite 1 Ceiling
Ultrasound A2C A4C 95.5 94.2 91.4 95.6 99.6
CelebA W/O Eyeglasses W/ Eyeglasses 93.6 96.2 96.05 96.2 96.5
CelebA W/O Mustache W/ Mustache 76.65 75.2 74.05 71.4 83.9
CelebA W/O Black hair W/ Blackhair 75.65 74.8 79.05 77.4 84.3
FMNIST Coat Pullover 75.8 73.7 84.8 69.1 94.1
FMNIST Coat Shirt 79.7 78.5 71.8 77.2 91.7
MNIST Three Eight 99.6 99.1 99.3 98.9 99.9
MNIST Four Nine 98.6 99.0 98.6 98.5 99.0
MNIST Three Five 98.5 99.3 98.2 98.2 99.2
4.5.5 Quantitative Assessment of Explainability
Given two similar images on either side of a model’s decision boundary, how
can we determine quantitatively whether they provide a conceptual explanation of
why a model discriminates between them? There are several high-level criteria that
must be met in order for people to find such explanatory images useful.
Localized but not minimal: In order for the boundary-crossing image to
clear demonstrate what pixels caused a label-changing event, it must deviate from
96
the original image in a way that is localized to a clear sub-component of the image,
as opposed to every pixel changing or only one or two pixels changing.
Substitutable: If we are explaining a model by comparing an original image
from class A, and a boundary-crossing image is produced to appear like it came from
class B, then we define substitutability to be the property that we can substitute our
boundary-crossing image for one of the original images labeled as class B without
affecting our classifier’s performance.
To this end, we propose two metrics aimed at quantifying such an explanations
utility. First, the degree to which changes to a query image are localized can be
represented by the number of non-zero elements of the mask. Note that while other
measures of locality can be used (cohesiveness, connected components), we make no
such assumption as we found empirically that often such specific measures do not
correlate well with conveying the set of items changing.
Second, we define the substitutability metric as follows: Let an original train-
ing set Dtrain = {(xi, yi|i = 1..N}, a test set Dtest, and a classifier F(x) → y whose
empirical performance on the test set is some score S. Given a new set of model-
generated boundary-crossing images D ′trans = {(xi, y′i|i = 1..N} we say that this set
is R%−substitutable if our classifier can be retrained using Dtrans to achieve perfor-
mance that is R% of S. For example, if our original dataset and classifier yield 90%
performance, and we substitute a generated dataset for our original dataset and a
re-trained classifier yields 45%, we would say the new dataset is 50% substitutable.
Table 4.2 illustrates the substitutability performance of our model on various
datasets. These results illustrate that our model produces images that are nearly
97
Table 4.3: Substitutability on Ultrasound Dataset. Transformed/Composite 0/1 shows the accu-
racy of a classifier on test set when the original samples are replaced with Transformed/Composite
0/1 at training phase. Both Transformed/Composite shows the accuracy of the classifier when
all of the images are replaced with Transformed/Composite. Note that PixelDA is a oneway
transformer.
Transformed 0 Transformed 1 Both Transformed Composite 0 Composite 1 Both composite
PixelDA 87.6 N/A N/A N/A N/A N/A
CycleGAN 94 64 84.1 N/A N/A N/A
ExplainGAN-norec 94.5 83.9 96.1 N/A N/A N/A
ExplainGAN-nomask 93.9 97.3 95.1 N/A N/A N/A
ExplainGAN-full 95.5 94.2 97.3 91.4 95.6 91.4
Ceiling 99.7 99.7 99.7 99.7 99.7 99.7
perfectly substitutable on MNIST, the Ultrasound dataset, and CelebaA for the
Eyeglasses attribute. That being said, despite compelling qualitative results (Fig-
ure 4.4), there is still much room for improvement in terms of substitutability for
the other CelebA attributes.
98
Chapter 5: Defense-GAN: Protecting Classifiers Against Adversarial
Attacks Using Generative Models
5.1 Introduction
Despite their outstanding performance on several machine learning tasks, deep
neural networks have been shown to be susceptible to adversarial attacks [Szegedy
et al., 2013,Goodfellow et al., 2014b]. These attacks come in the form of adversarial
examples : carefully crafted perturbations added to a legitimate input sample. In
the context of classification, these perturbations cause the legitimate sample to
be misclassified at inference time [Szegedy et al., 2013, Goodfellow et al., 2014b,
Papernot et al., 2016b, Liu et al., 2016]. Such perturbations are often small in
magnitude and do not affect human recognition but can drastically change the
output of the classifier.
Recent literature has considered two types of threat models: black-box and
white-box attacks. Under the black-box attack model, the attacker does not have
access to the classification model parameters; whereas in the white-box attack model,
the attacker has complete access to the model architecture and parameters, including
potential defense mechanisms [Papernot et al., 2017,Tramèr et al., 2017,Carlini and
99
Wagner, 2017].
Various defenses have been proposed to mitigate the effect of adversarial at-
tacks. These defenses can be grouped under three different approaches: (1) modify-
ing the training data to make the classifier more robust against attacks, e.g., adver-
sarial training which augments the training data of the classifier with adversarial
examples [Szegedy et al., 2013,Goodfellow et al., 2014b], (2) modifying the training
procedure of the classifier to reduce the magnitude of gradients, e.g., defensive dis-
tillation [Papernot et al., 2016e], and (3) attempting to remove the adversarial noise
from the input samples [Hendrycks and Gimpel, 2017,Meng and Chen, 2017]. All of
these approaches have limitations in the sense that they are effective against either
white-box attacks or black-box attacks, but not both [Tramèr et al., 2017,Meng and
Chen, 2017]. Furthermore, some of these defenses are devised with specific attack
models in mind and are not effective against new attacks.
In this chapter, we propose a novel defense mechanism which is effective against
both white-box and black-box attacks. We propose to leverage the representative
power of GANs [Goodfellow et al., 2014a] to diminish the effect of the adversarial
perturbation, by “projecting” input images onto the range of the GAN’s generator
prior to feeding them to the classifier. In the GAN framework, two models are
trained simultaneously in an adversarial setting: a generative model that emulates
the data distribution, and a discriminative model that predicts whether a certain
input came from real data or was artificially created. The generative model learns
a mapping G from a low-dimensional vector z ∈ Rk to the high-dimensional input
sample space Rn. During training of the GAN, G is encouraged to generate samples
100
which resemble the training data. It is, therefore, expected that legitimate samples
will be close to some point in the range of G, whereas adversarial samples will be
further away from the range of G. Furthermore, “projecting” the adversarial exam-
ples onto the range of the generator G can have the desirable effect of reducing the
adversarial perturbation. The projected output, computed using Gradient Descent
(GD), is fed into the classifier instead of the original (potentially adversarially mod-
ified) image. We empirically demonstrate that this is an effective defense against
both black-box and white-box attacks on two benchmark image datasets.
The rest of the chapter is organized as follows. We introduce the necessary
background regarding known attack models, defense mechanisms, and GANs in Sec-
tion 5.2. Our defense mechanism, which we call Defense-GAN, is formally motivated
and introduced in Section 5.3. Finally, experimental results, under different threat
models, as well as comparisons to other defenses are presented in Section 5.4.
5.2 Related work and background information
In this work, we use GANs for the purpose of defending against adversarial
attacks in classification problems. Before detailing our approach in the next section,
we explain related work in three parts. First, we discuss different attack models
employed in the literature. We, then, go over related defense mechanisms against
these attacks and discuss their strengths and shortcomings. Lastly, we explain
necessary background information regarding GANs.
101
5.2.1 Attack models and algorithms
Various attack models and algorithms have been used to target classifiers. All
attack models we consider aim to find a perturbation δ to be added to a (legitimate)
input x ∈ Rn, resulting in the adversarial example x̃ = x + δ. The `∞-norm of the
perturbation is denoted by  [Goodfellow et al., 2014b] and is chosen to be small
enough so as to remain undetectable. We consider two threat levels: black- and
white-box attacks.
White-box attack models
White-box models assume that the attacker has complete knowledge of all the
classifier parameters, i.e., network architecture and weights, as well as the details
of any defense mechanism. Given an input image x and its associated ground-truth
label y, the attacker thus has access to the loss function J(x, y) used to train the
network, and uses it to compute the adversarial perturbation δ. Attacks can be
targeted, in that they attempt to cause the perturbed image to be misclassified to a
specific target class, or untargeted when no target class is specified.
In this work, we focus on untargeted white-box attacks computed using the
Fast Gradient Sign Method (FGSM) [Goodfellow et al., 2014b], the Randomized
Fast Gradient Sign Method (RAND+FGSM) [Tramèr et al., 2017], and the Carlini-
Wagner (CW) attack [Carlini and Wagner, 2017]. Although other attack models ex-
ist, such as the Iterative FGSM [Kurakin et al., 2016], the Jacobian-based Saliency
Map Attack (JSMA) [Papernot et al., 2016b], and Deepfool [Moosavi-Dezfooli et al.,
102
2016], we focus on these three models as they cover a good breadth of attack algor-
thims. FGSM is a very simple and fast attack algorithm which makes it extremely
amenable to real-time attack deployment. On the other hand, RAND+FGSM, an
equally simple attack, increases the power of FGSM for white-box attacks [Tramèr
et al., 2017], and finally, the CW attack is one of the most powerful white-box
attacks to-date [Carlini and Wagner, 2017].
Fast Gradient Sign Method (FGSM): Given an image x and its corresponding true
label y, the FGSM attack sets the perturbation δ to:
δ =  · sign(∇xJ(x, y)). (5.1)
FGSM [Goodfellow et al., 2014b] was designed to be extremely fast rather than
optimal. It simply uses the sign of the gradient at every pixel to determine the
direction with which to change the corresponding pixel value.
Randomized Fast Gradient Sign Method (RAND+FGSM): The RAND+FGSM
[Tramèr et al., 2017] attack is a simple yet effective method to increase the power of
FGSM against models which were adversarially trained. The idea is to first apply a
small random perturbation before using FGSM. More explicitly, for α < , random
noise is first added to the legitimate image x:
x′ = x + α · sign(N (0n, In)). (5.2)
103
Then, the FGSM attack is computed on x′, resulting in
x̃ = x′ + (− α) · sign(∇x′J(x′, y)). (5.3)
The Carlini-Wagner (CW) attack: The CW attack is an effective optimization-
based attack model. In many cases, it can reduce the classifier accuracy to almost 0%
[Carlini and Wagner, 2017]. The perturbation δ is found by solving an optimization
problem of the form:
min ||δ||p + c · f(x + δ)
δ∈Rn
subject to x + δ ∈ [0, 1]n, (5.4)
where f is an objective function that drives the example x to be misclassified,
and c > 0 is a suitably chosen constant. The `2, `0, and `∞ norms are considered.
We refer the reader to [Carlini and Wagner, 2017] for details regarding the approach
to solving (5.4) and setting the constant c.
Black-box attack models
For black-box attacks we consider untargeted FGSM attacks computed on a
substitute model [Papernot et al., 2017]. As previously mentioned, black-box adver-
saries have no access to the classifier or defense parameters. It is further assumed
that they do not have access to a large training dataset but can query the targeted
DNN as a black-box, i.e., access labels produced by the classifier for specific query
104
images. The adversary trains a model, called substitute, which has a (potentially)
different architecture than the targeted classifier, using a very small dataset aug-
mented by synthetic images labeled by querying the classifier. Adversarial examples
are then found by applying any attack method on the substitute network. It was
found that such examples designed to fool the substitute often end up being mis-
classified by the targeted classifier [Szegedy et al., 2013, Papernot et al., 2017]. In
other words, black-box attacks are easily transferrable from one model to the other.
5.2.2 Defense mechanisms
Various defense mechanisms have been employed to combat the threat from
adversarial attacks. In what follows, we describe one representative defense strategy
from each of the three general groups of defenses.
Adversarial training
A popular approach to defend against adversarial noise is to augment the
training dataset with adversarial examples [Szegedy et al., 2013, Goodfellow et al.,
2014b,Moosavi-Dezfooli et al., 2016]. Adversarial examples are generated using one
or more chosen attack models and added to the training set. This often results
in increased robustness when the attack model used to generate the augmented
training set is the same as that used by the attacker. However, adversarial training
does not perform as well when a different attack strategy is used by the attacker.
Additionally, it tends to make the model more robust to white-box attacks than to
105
black-box attacks due to gradient masking [Papernot et al., 2016c, Papernot et al.,
2017,Tramèr et al., 2017].
Defensive distillation
Defensive distillation [Papernot et al., 2016e] trains the classifier in two rounds
using a variant of the distillation [Hinton et al., 2015] method. This has the desir-
able effect of learning a smoother network and reducing the amplitude of gradients
around input points, making it difficult for attackers to generate adversarial exam-
ples [Papernot et al., 2016e]. It was, however, shown that, while defensive distillation
is effective against white-box attacks, it fails to adequately protect against black-box
attacks transferred from other networks [Carlini and Wagner, 2017].
MagNet
Recently, [Meng and Chen, 2017] introduced MagNet as an effective defense
strategy. It trains a reformer network (which is an auto-encoder or a collection of
auto-encoders) to move adversarial examples closer to the manifold of legitimate, or
natural, examples. When using a collection of auto-encoders, one reformer network
is chosen at random at test time, thus strengthening the defense. It was shown to be
an effective defense against gray-box attacks where the attacker knows everything
about the network and defense, except the parameters. MagNet is the closest defense
to our approach, as it attempts to reform an adversarial sample using a learnt auto-
encoder. The main differences between MagNet and our approach are: (1) we use
106
GANs instead of auto-encoders, and, most importantly, (2) we use GD minimization
to find latent codes as opposed to a feedforward encoder network. This makes
Defense-GAN more robust, especially against white-box attacks.
5.2.3 Generative Adversarial Networks
Generative Adversarial Networks, originally introduced by [Goodfellow et al.,
2014a], consist of two neural networks, G and D. G : Rk → Rn maps a low-
dimensional latent space to the high dimensional sample space of x. D is a binary
neural network classifier. In the training phase, G and D are typically learned in
an adversarial fashion using actual input data samples x and random vectors z. An
isotropic Gaussian prior is usually assumed on z. While G learns to generate outputs
G(z) that have a distribution similar to that of x, D learns to discriminate between
“real” samples x and “fake” samples G(z). D and G are trained in an alternating
fashion to minimize the following min-max loss [Goodfellow et al., 2014a]:
min maxV (D,G) = Ex∼p [logD(x)] + Edata(x) z∼pz(z)[log(1−D(G(z)))]. (5.5)
G D
It was shown that the optimal GAN is obtained when the resulting generator
distribution pg = pdata [Goodfellow et al., 2014a].
However, GANs turned out to be difficult to train in practice [Gulrajani et al.,
2017], and alternative formulations have been proposed. [Arjovsky et al., 2017] in-
troduced Wasserstein GANs (WGANs) which are a variant of GANs that use the
107
Wasserstein distance, resulting in a loss function with more desirable properties:
min maxVW (D,G) = Ex∼p (x)[D(x)]− Ez∼pz(z)[D(G(z))]. (5.6)data
G D
In this work, we use WGANs as our generative model due to the stability of their
training methods, especially using the approach in [Gulrajani et al., 2017].
5.3 Proposed Defense-GAN
We propose a new defense strategy which uses a WGAN trained on legitimate
(un-perturbed) training samples to “denoise” adversarial examples. At test time,
prior to feeding an image x to the classifier, we project it onto the range of the
generator by minimizing the reconstruction error ||G(z)−x||22, using L steps of GD.
The resulting reconstruction G(z) is then given to the classifier. Since the generator
was trained to model the unperturbed training data distribution, we expect this
added step to result in a substantial reduction of any potential adversarial noise.
We formally motivate this approach in the following section.
5.3.1 Motivation
As mentioned in Section 5.2.3, the GAN min-max loss in (5.5) admits a global
optimum when pg = pdata [Goodfellow et al., 2014a]. It can be similarly shown
that WGAN admits an optimum to its own min-max loss in (5.6), when the set
{x | pg(x) 6= pdata(x)} has zero Lebesgue-measure. Formally,
108
Lemma 1 A generator distribution pg is a global optimum for the WGAN min-
max game defined in (5.6), if and only if pg(x) = pdata(x) for all x ∈ Rn, potentially
except on a set of zero Lebesgue-measure.
A sketch of the proof can be found in Section 5.5.
Additionally, it was shown that, if G and D have enough capacity to represent
the data, and if the training algorithm is such that pg converges to pdata, then
[ ]
Ex∼p min‖Gt(z)− x‖ −→ 0 (5.7)data z
where G is the generator of a GAN or WGAN1t after t steps of its training algorithm
[Kabkab et al., 2018].
This serves to show that, under ideal conditions, the addition of the GAN re-
construction loss minimization step should not affect the performance of the classifier
on natural, legitimate samples, as such samples should be almost exactly recovered.
Furthermore, we hypothesize that this step will help reduce the adversarial noise
which follows a different distribution than that of the GAN training examples.
5.3.2 Defense-GAN algorithm
Defense-GAN is a defense strategy to combat both white-box and black-box
adversarial attacks against classification networks. At inference time, given a trained
GAN generator G and an image x to be classified, z∗ is first found so as to minimize
1For simplicity, we will use GAN and WGAN interchangeably in the rest of this manuscript,
with the understanding that our implementation follows the WGAN loss.
109
Figure 5.1: Overview of the Defense-GAN algorithm.
Figure 5.2: L steps of Gradient Descent are used to estimate the projection of the
image onto the range of the generator.
G(z∗) is then given as the input to the classifier. The algorithm is illustrated in
Figure 5.1. As (5.8) is a highly non-convex minimization problem, we approximate
it by doing a fixed number L of GD steps using R different random initializations
of z (which we call random restarts), as shown in Figures 5.1 and 5.2.
The GAN is trained on the available classifier training dataset in an unsuper-
vised manner. The classifier can be trained on the original training images, their
reconstructions using the generator G, or a combination of the two. As was discussed
in Section 5.3.1, as long as the GAN is appropriately trained and has enough ca-
pacity to represent the data, original clean images and their reconstructions should
not defer much. Therefore, these two classifier training strategies should, at least
theoretically, not differ in performance.
Compared to existing defense mechanisms, our approach is different in the
110
following aspects:
1. Defense-GAN can be used in conjunction with any classifier and does not
modify the classifier structure itself. It can be seen as an add-on or pre-
processing step prior to classification.
2. If the GAN is representative enough, re-training the classifier should not be
necessary and any drop in performance due to the addition of Defense-GAN
should not be significant.
3. Defense-GAN can be used as a defense to any attack: it does not assume an at-
tack model, but simply leverages the generative power of GANs to reconstruct
adversarial examples.
4. Defense-GAN is highly non-linear and white-box gradient-based attacks will
be difficult to perform due to the GD loop. A detailed discussion about this
can be found in Section 5.6.
5.4 Experiments
We assume three different attack threat levels:
1. Black-box attacks: the attacker does not have access to the details of the
classifier and defense strategy. It therefore trains a substitute network to find
adversarial examples.
2. White-box attacks: the attacker knows all the details of the classifier and de-
fense strategy. It can compute gradients on the classifier and defense networks
111
in order to find adversarial examples.
3. White-box attacks, revisited: in addition to the details of the architectures
and parameters of the classifier and defense, the attacker has access to the
random seed and random number generator. In the case of Defense-GAN, this
(i)
means that the attacker knows all the random initializations {z R0 }i=1.
We compare our method to adversarial training [Goodfellow et al., 2014b] and
MagNet [Meng and Chen, 2017] under the FGSM, RAND+FGSM, and CW (with
`2 norm) white-box attacks, as well as the FGSM black-box attack. Details of all
network architectures used in this chapter can be found in Section 5.7. When the
classifier is trained using the reconstructed images (G(z∗)), we refer to our method
as Defense-GAN-Rec, and we use Defense-GAN-Orig when the original images (x)
are used to train the classifier. Our GAN follows the WGAN training procedure
in [Gulrajani et al., 2017], and details of the generator and discriminator network
architectures are given in Table 5.6. The reformer network (encoder) for the MagNet
baseline is provided in Table 5.7. Our implementation is based on TensorFlow [Abadi
et al., 2015b] and builds on open-source software: CleverHans by [Papernot et al.,
2016a] and improved WGAN training by [Gulrajani et al., 2017]. We use machines
equipped with NVIDIA GeForce GTX TITAN X GPUs.
In our experiments, we use two different image datasets: the MNIST handwrit-
ten digits dataset [LeCun et al., 1998] and the Fashion-MNIST (F-MNIST) clothing
articles dataset [Xiao et al., 2017b]. Both datasets consist of 60, 000 training images
and 10, 000 testing images. We split the training images into a training set of 50, 000
112
images and hold-out a validation set containing 10, 000 images. For white-box at-
tacks, the testing set is kept the same (10, 000 samples). For black-box attacks, the
testing set is divided into a small hold-out set of 150 samples reserved for adversary
substitute training, as was done in [Papernot et al., 2017], and the remaining 9, 850
samples are used for testing the different methods.
5.4.1 Results on black-box attacks
In this section, we present experimental results on FGSM black-box attacks.
As previously mentioned, the attacker trains a substitute model, which could differ
in architecture from the targeted model, using a limited dataset consisting of 150 le-
gitimate images augmented with synthetic images labeled using the target classifier.
The classifier and substitute model architectures used and referred to throughout
this section are described in Table 5.5.
In Tables 5.1 and 5.2, we present our classification accuracy results and com-
pare to other defense methods. As can be seen, FGSM black-box attacks were
successful at reducing the classifier accuracy by up to 70%. All considered defense
mechanisms are relatively successful at diminishing the effect of the attacks. We
note that, as expected, the performance of Defense-GAN-Rec and that of Defense-
GAN-Orig are very close. In addition, they both perform consistently well across
different classifier and substitute model combinations. MagNet also performs in a
consistent manner, but achieves lower accuracy than Defense-GAN. Two adversar-
ial training defenses are presented: the first one obtains the adversarial examples
113
assuming the same attack  = 0.3, and the second assumes a different  = 0.15.
With incorrect knowledge of , the performance of adversarial training generally
decreases. In addition, the classification performance of this defense method has
very large variance across the different architectures. It is worth noting that adver-
sarial training defense is only fit against FGSM attacks, because the adversarially
augmented data, even with a different , is generated using the same method as
the black-box attack (FGSM). In contrast, Defense-GAN and MagNet are general
defense mechanisms which do not assume a specific attack model.
The performances of defenses on the F-MNIST dataset, shown in Table 5.2,
are noticeably lower than on MNIST. This is due to the large  = 0.3 in the FGSM
attack. Please see Appendix 5.8 for qualitative examples showing that  = 0.3
represents very high noise, which makes F-MNIST images difficult to classify, even
by a human.
In addition, the Defense-GAN parameters used in this experiment were kept
the same for both Tables, in order to study the effect of dataset complexity, and can
be further optimized as investigated in the next section.
Effect of number of GD iterations L and random restarts R
Figure 5.3 shows the effect of varying the number of GD iterations L as well
as the random restarts R used to compute the GAN reconstructions of input im-
ages. Across different L and R values, Defense-GAN-Rec and Defense-GAN-Orig
have comparable performance. Increasing L has the expected effect of improving
114
Table 5.1: Classification accuracies of different classifier and substitute model combi-
nations using various defense strategies on the MNIST dataset, under FGSM black-
box attacks with  = 0.3. Defense-GAN has L = 200 and R = 10.
Classifier/ No No Defense- Defense- Adv. Tr. Adv. Tr.
MagNet
Substitute Attack Defense GAN-Rec GAN-Orig  = 0.3  = 0.15
A/B 0.9970 0.6343 0.9312 0.9282 0.6937 0.9654 0.6223
A/E 0.9970 0.5432 0.9139 0.9221 0.6710 0.9668 0.9327
B/B 0.9618 0.2816 0.9057 0.9105 0.5687 0.2092 0.3441
B/E 0.9618 0.2128 0.8841 0.8892 0.4627 0.1120 0.3354
C/B 0.9959 0.6648 0.9357 0.9322 0.7571 0.9834 0.9208
C/E 0.9959 0.8050 0.9223 0.9182 0.6760 0.9843 0.9755
D/B 0.9920 0.4641 0.9272 0.9323 0.6817 0.7667 0.8514
D/E 0.9920 0.3931 0.9164 0.9155 0.6073 0.7676 0.7129
Table 5.2: Classification accuracies of different classifier and substitute model com-
binations using various defense strategies on the F-MNIST dataset, under FGSM
black-box attacks with  = 0.3. Defense-GAN has L = 200 and R = 10.
Classifier/ No No Defense- Defense- Adv. Tr. Adv. Tr.
MagNet
Substitute Attack Defense GAN-Rec GAN-Orig  = 0.3  = 0.15
A/B 0.9346 0.5131 0.586 0.5803 0.5404 0.7393 0.6600
A/E 0.9346 0.3653 0.4790 0.4616 0.3311 0.6945 0.5638
B/B 0.7470 0.4017 0.4940 0.5530 0.3812 0.3177 0.3560
B/E 0.7470 0.3123 0.3720 0.4187 0.3119 0.2617 0.2453
C/B 0.9334 0.2635 0.5289 0.6079 0.4664 0.7791 0.6838
C/E 0.9334 0.2066 0.4871 0.4625 0.3016 0.7504 0.6655
D/B 0.8923 0.4541 0.5779 0.5853 0.5478 0.6172 0.6395
D/E 0.8923 0.2543 0.4007 0.4730 0.3396 0.5093 0.4962
performance when no attack is present. Interestingly, with an FGSM attack, the
classification performance decreases after a certain L value. With too many GD
iterations on the mean squared error (MSE) ||G(z)− (x + δ)||22, some of the adver-
sarial noise components are retained. In the right Figure, the effect of varying R
is shown to be extremely pronounced. This is due to the non-convex nature of the
MSE, and increasing R enables us to sample different local minima.
115
MNIST Classification accuracy of Model F using Defense-GAN varying L. MNIST Classification accuracy of Model F using Defense-GAN varying R.
1.0 1.0
0.9 0.9
0.8 0.8
0.7
0.7
Defense-GAN-Rec No attack
0.6 Defense-GAN-Orig No attack
Defense-GAN-Rec No attack
Defense-GAN-Rec FGSM 0.6 Defense-GAN-Orig No attack
Defense-GAN-Orig FGSM Defense-GAN-Rec FGSM
0.5 Defense-GAN-Orig FGSM
0 200 400 600 800 1000 1200 1400 1600 0.5
Number of GD iterations L 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Number of random restarts R
Figure 5.3: Classification accuracy of Model F using Defense-GAN on the MNIST
dataset, under FGSM black-box attacks with  = 0.3 and substitute Model E. Left:
various number of iterations L are used (R = 10). Right: various number of random
restarts R are used (L = 100).
Effect of adversarial noise norm 
We now investigate the effect of changing the attack  in Table 5.3. As ex-
pected, with higher , the FGSM attack is more successful, especially on the F-
MNIST dataset where the noise norm seems to have a more pronounced effect with
nearly 37% drop in performance between  = 0.1 and 0.3. Figure 5.7 in Section
5.8 shows adversarial samples as well as their reconstructions with Defense-GAN at
different values of . We can see that for large , the class is difficult to discern, even
for the human eye.
Even though it seems that increasing  is a desirable strategy for the attacker,
this increases the likelihood that the adversarial noise is discernible and therefore
the attack is detected. It is trivial for the attacker to provide adversarial images
at very high , and a good measure of an attack’s strength is its ability to affect
performance at low . In fact, in the next section, we discuss how Defense-GAN can
116
Accuracy
Accuracy
be used to not only diminish the effect of attacks, but to also detect them.
Table 5.3: Classification accuracy of Model F using Defense-GAN (L = 400, R =
10), under FGSM black-box attacks for various noise norms  and substitute Model
E.
Defense-GAN-Rec Defense-GAN-Rec

MNIST F-MNIST
0.10 0.9864± 0.0011 0.8844± 0.0017
0.15 0.9836± 0.0026 0.8267± 0.0065
0.20 0.9772± 0.0019 0.7492± 0.0170
0.25 0.9641± 0.0001 0.6384± 0.0159
0.30 0.9307± 0.0034 0.5126± 0.0096
Attack detection
MNIST, R = 10, Epsilon = 0.30. MNIST, L = 100, Epsilon = 0.30. MNIST, L = 400, R = 10.
1.0 1.0 1.0
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 L = 25 ­ AUC = 0.973 0.4 0.4
0.3 L = 50 ­ AUC = 0.985 0.3 0.3 Epsilon = 0.10 ­ AUC = 0.914
L = 400 ­ AUC = 0.999 R = 1 ­ AUC = 0.836 Epsilon = 0.15 ­ AUC = 0.975
0.2 0.2 0.2
L = 800 ­ AUC = 1.0 R = 2 ­ AUC = 0.922 Epsilon = 0.20 ­ AUC = 0.989
0.1 L = 1600 ­ AUC = 1.0 0.1 R = 5 ­ AUC = 0.982 0.1 Epsilon = 0.25 ­ AUC = 0.998
0.0 0.0 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
False Positive Rate False Positive Rate False Positive Rate
Figure 5.4: ROC Curves when using Defense-GAN MSE for FGSM attack detections
on the MNIST dataset (Classifier Model F, Substitute Model E). Left: Results for
various number of GD iterations are shown with R = 10,  = 0.30. Middle: Results
for various number of random restarts R are shown with L = 100,  = 0.30. Right:
Results for various  are shown with L = 400, R = 10.
We intuitively expect that clean, unperturbed images will lie closer to the
range of the Defense-GAN generator G than adversarial examples. This is due
to the fact that G was trained to produce images which resemble the legitimate
data. In light of this observation, we propose to use the MSE of an image with
it is reconstruction from (??) as a “metric” to decide whether or not the image
117
True Positive Rate
True Positive Rate
True Positive Rate
F­MNIST, R = 10, Epsilon = 0.30. F­MNIST, L = 100, Epsilon = 0.30 F­MNIST, L = 200, R = 10.
1.0 1.0 1.0
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 L = 25 ­ AUC = 0.935 0.4 0.4 Epsilon = 0.10 ­ AUC = 0.775
0.3 L = 50 ­ AUC = 0.954 0.3 R = 1 ­ AUC = 0.794 0.3 Epsilon = 0.15 ­ AUC = 0.884
L = 100 ­ AUC = 0.965 R = 2 ­ AUC = 0.876 Epsilon = 0.20 ­ AUC = 0.94
0.2 0.2 0.2
L = 400 ­ AUC = 0.983 R = 5 ­ AUC = 0.945 Epsilon = 0.25 ­ AUC = 0.969
0.1 L = 800 ­ AUC = 0.987 0.1 R = 10 ­ AUC = 0.965 0.1 Epsilon = 0.30 ­ AUC = 0.985
0.0 0.0 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
False Positive Rate False Positive Rate False Positive Rate
Figure 5.5: ROC Curves when using Defense-GAN MSE for FGSM attack detections
on the F-MNIST dataset (Classifier Model F, Substitute Model E). Left: Results for
various number of GD iterations are shown with R = 10,  = 0.30. Middle: Results
for various number of random restarts R are shown with L = 100,  = 0.30. Right:
Results for various  are shown with L = 200, R = 10.
was adversarially manipulated. In order words, for a given threshold θ > 0, the
hypothesis test is:
attack
||G(z∗)− x||22 R θ. (5.8)
no attack
We compute the reconstruction MSEs for every image from the test dataset,
and its adversarially manipulated version using FGSM. We show the Receiver Op-
erating Characteristic (ROC) curves as well as the Area Under the Curve (AUC)
metric for different Defense-GAN parameters and  values in Figures 5.4 and 5.5.
The results show that this attack detection strategy is effective especially when the
number of GD iterations L and random restarts R are large. From the left and
middle Figures, we can conclude that the number of random restarts plays a very
important role in the detection false positive and true positive rates as was discussed
in Section 5.4.1. Furthermore, when  is very small, it becomes difficult to detect
attacks at low false positive rates.
118
True Positive Rate
True Positive Rate
True Positive Rate
Results on white-box attacks
We now present results on white-box attacks using three different strategies:
FGSM, RAND+FGSM, and CW. We perform the CW attack for 100 iterations of
projected GD, with learning rate 10.0, and use c = 100 in equation (5.4). Table
5.4 shows the classification performance of different classifier models across different
attack and defense strategies. We note that Defense-GAN significantly outperforms
the two other baseline defenses. We even give the adversarial attacker access to
the random initializations of z. However, we noticed that the performance does
not change much when the attacker does not know the initialization. Adversarial
training was done using FGSM to generate the adversarial samples. It is interesting
to mention that when CW attack is used, adversarial training performs extremely
poorly. As previously discussed, adversarial training does not generalize well against
different attack methods.
Due to the loop of L steps of GD, Defense-GAN is resilient to GD-based white-
box attacks, since the attacker needs to “un-roll” the GD loop and propagate the
gradient of the loss all the way across L steps. In fact, from Table 5.4, the perfor-
mance of classifier A with Defense-GAN on the MNIST dataset drops less than 1%
from 0.997 to 0.988 under FGSM. In comparison, from Figure 5.8, when L = 25, the
performance of the same network drops to 0.947 (more than 5% drop). This shows
that using a larger L significantly increases the robustness of Defense-GAN against
GD-based white-box attacks. This comes at the expense of increased inference time
complexity. We present a more detailed discussion about the difficulty of GD-based
119
white-box attacks in Section 5.6 and time complexity in Section 5.11. Additional
white-box experimental results on higher-dimensional images are reported in 5.10.
5.5 Optimality of pg = pdata for WGANs
Sketch of proof of Lemma 1: The WGAN min-max loss is given by:
VW (D,G) =∫Ex∼p [D(x)]− Edata(x) ∫z∼pz(z)[D(G(z))] (5.9)
= ∫ pdata(x)D(x)dx− pz(z)D(G(z))dz (5.10)x z
= (pdata(x)− pg(x))D(x)dx (5.11)
x
For a fixed G, the optimal discriminator D which maximizes VW (D,G) is such that:
1 if pdata(x) ≥ pg(x)D∗G(x) =  (5.12)0 otherwise
Plugging D∗G back into (5.11), we get:
∫
V (D∗W G, G) = ∫ (pdata(x)− pg(x))D
∗
G(x)dx (5.13)
x
= (pdata(x)− pg(x)) dx (5.14)
{x | pdata(x)≥pg(x)}
120
Let X = {x | pdata(x) ≥ pg(x)}. Clearly, to minimize (5.14), we need to set
pdata(x) = pg(x) for x ∈ X . Then, since both pdfs should integrate to 1,
∫ ∫
pg(x)dx = pdata(x)dx (5.15)
X c X c
However, this is a contradiction since pg(x) < pdata(x) for x ∈ X c, unless µ(X c) = 0
where µ is the Lebesgue measure. This concludes the proof.
5.6 Difficulty of GD-based white-box attacks on Defense-GAN
In order to perform a GD-based white-box attack on models using Defense-
GAN, an attacker needs to compute the gradient of the output of the classifier
with respect to the input. From Figure 5.1, the generator and the classifier can be
seen as one, combined, feedforward network, through which it is easy to propagate
gradients. The difficulty lies in the orange box of the GD optimization detailed in
Figure 5.2.
For the sake of simplicity, let’s assume that R = 1. Define L(x, z) = ||G(z)−
x||22. Then z∗ = zL, which is computed recursively as follows:
z1 = z0 + η0 ∇zL(x, z)|z=z (5.16)0
z2 = z1 + η1 ∇zL(x, z)|z=z (5.17)1
= z0 + η0 ∇zL(x, z)|z=z + η1 ∇zL(x, z)|0 z=z (5.18)0+η0∇zL(x,z)|z=z0
and so on. Therefore, computing the gradient of z∗ with respect to x involves a
121
large number (L) of recursive chain rules and high-dimensional Jacobian tensors.
This computation gets increasingly prohibitive for large L.
5.7 Neural network architectures
We describe the neural network architectures used throughout the chapter.
The detail of models A through F used for classifier and substitute networks can be
found in Table 5.5. In Table 5.6, the GAN architectures are described, and in Table
5.7, the encoder architecture for the MagNet baseline is given. In what follows:
• Conv(m, k × k, s) refers to a convolutional layer with m feature maps, filter
size k × k, and stride s
• ConvT(m, k×k) refers to the transpose (gradient) of Conv (sometimes referred
to as “deconvolution”) with m feature maps, filter size k × k, and stride s
• FC(m) refers to a fully-connected layer with m outputs
• Dropout(p) refers to a dropout layer with probability p
• ReLU refers to the Rectified Linear Unit activation
• LeakyReLU(α) is the leaky version of the Rectified Linear Unit with parameter
α
122
Table 5.5: Neural network architectures used for classifiers and substitute models.
A B, F* C D, E*
Conv(64, 5× 5, 1) Dropout(0.2) Conv(128, 3× 3, 1) FC(200)
ReLU Conv(64, 8× 8, 2) ReLU ReLU
Conv(64, 5× 5, 2) ReLU Conv(64, 3× 3, 2) Dropout(0.5)
ReLU Conv(128, 6× 6, 2) ReLU FC(200)
Dropout(0.25) ReLU Dropout(0.25) ReLU
FC(128) Conv(128, 5× 5, 1) FC(128) Dropout(0.5)
ReLU ReLU ReLU FC(10) + Softmax
Dropout(0.5) Dropout(0.5) Dropout(0.5)
FC(10) + Softmax FC(10) + Softmax FC(10) + Softmax
[ * : F (resp. E) shares the same architecture as B (resp. D) with the dropout layers removed ]
123
5.8 Qualitative examples
Original Adv L = 10 L = 25 L = 50 L = 100 L = 200 Original Adv R = 1 R = 2 R = 5 R = 10 R = 20
Figure 5.6: Examples from MNIST and F-MNIST. Left: Original, FGSM adversarial
 = 0.3, and reconstruction images for R = 1 and various L are shown. Right:
Original, FGSM adversarial  = 0.3, and reconstruction images for L = 25 and
various R are shown.
Original Original Original
Figure 5.7: Examples from MNIST and F-MNIST: Original, FGSM adversarial and
reconstruction images for L = 50, R = 15 and various  are shown.
124
𝜖 = 0.4 𝜖 = 0.3 𝜖 = 0.2 𝜖 = 0.1
5.9 Additional results on the effect of varying the number of GD
iterations L and random restarts R
Table 5.8: Classification accuracy of Model F using Defense-GAN with various
number of iterations L (R = 10), on the MNIST dataset, under FGSM black-box
attack with  = 0.3.
Defense-GAN-Rec Defense-GAN-Orig Defense-GAN-Rec Defense-GAN-Orig
L
No attack No attack Adversarial Adversarial
25 0.9273± 0.0215 0.9141± 0.0033 0.7955± 0.0045 0.7998± 0.0063
50 0.9567± 0.0203 0.9371± 0.0048 0.8516± 0.0078 0.8472± 0.0026
100 0.9728± 0.0164 0.9560± 0.0051 0.8953± 0.0027 0.8911± 0.0024
200 0.9860± 0.0010 0.9712± 0.0028 0.9210± 0.0023 0.9155± 0.0032
400 0.9869± 0.0082 0.9808± 0.0044 0.9332± 0.0027 0.9307± 0.0034
800 0.9934± 0.0009 0.9938± 0.0004 0.9319± 0.0038 0.9216± 0.0005
1600 0.9963± 0.0013 0.9967± 0.0005 0.9081± 0.0062 0.9008± 0.0095
125
Table 5.11: Classification accuracy of Model F using Defense-GAN with various
number of random restarts R (L = 100), on the F-MNIST dataset, under FGSM
black-box attack with  = 0.3.
Defense-GAN-Rec Defense-GAN-Orig Defense-GAN-Rec Defense-GAN-Orig
R
No attack No attack Adversarial Adversarial
1 0.8425± 0.0008 0.5597± 0.0015 0.3504± 0.0102 0.3380± 0.0043
2 0.8994± 0.0051 0.7793± 0.0023 0.4050± 0.0148 0.3508± 0.0167
5 0.9260± 0.0028 0.6726± 0.0006 0.4521± 0.0177 0.4024± 0.0085
10 0.9101± 0.0032 0.8190± 0.0043 0.4808± 0.0088 0.4221± 0.0255
Figure 5.8: Classification accuracy of different models using Defense-GAN on the
MNIST dataset, under FGSM white-box attack with  = 0.3, for various number of
iterations L and R = 10.
5.10 Additional results on white-box attacks
We report results on white-box attacks on the CelebFaces Attributes dataset
(CelebA) [Liu et al., 2015] in Table 5.12. The CelebA dataset is a large-scale face
dataset consisting of more than 200, 000 face images, split into training, validation,
and testing sets. The RGB images were center-cropped and resized to 64× 64. We
performed the task of gender classification on this dataset. The GAN architecture
is the same as that in Table 5.6, except for an additional ConvT(128, 5× 5, 1) layer
in the generator network.
126
5.11 Time complexity
The computational complexity of reconstructing an image using Defense-GAN
is on the order of the number of GD iterations performed to estimate z∗, multiplied
by the time to compute gradients. The number of random restarts R has less
effect on the running time, since random restarts are independent and can run in
parallel if enough resources are available. Table 5.13 shows the average running
time, in seconds, to find the reconstructions of MNIST and F-MNIST images on
one NVIDIA GeForce GTX TITAN X GPU. For most applications, these running
times are not prohibitive. We can see a tradeoff between running time and defense
robustness as well as accuracy.
Table 5.13: Average time, in seconds, to compute reconstructions of MNIST/F-
MNIST images for various values of L and R.
L = 10 L = 25 L = 50 L = 100 L = 200
R = 1 0.043± 0.027 0.070± 0.003 0.137± 0.004 0.273± 0.006 L = 0.543± 0.017
R = 2 0.042± 0.026 0.067± 0.002 0.131± 0.003 0.261± 0.006 L = 0.510± 0.006
R = 5 0.043± 0.029 0.070± 0.002 0.136± 0.004 0.270± 0.004 L = 0.535± 0.008
R = 10 0.051± 0.032 0.086± 0.001 0.170± 0.002 0.338± 0.008 L = 0.675± 0.016
R = 20 0.060± 0.035 0.105± 0.003 0.209± 0.006 0.414± 0.012 L = 0.825± 0.022
127
5.12 An unsuccessful attempt to attack Defense-GAN
Among seven defense methods that are proposed in 2018 International Con-
ference on Representation Learning (ICLR), this work is the only method that is
reported as not broken in the best paper award winner of 2018 International Con-
ference on Machine Learning (ICML) [Athalye et al., 2018]. In there, they first
approximate Defense-GAN with a differentiable function g(x) and use the gradients
of g(x) as an approximation to the gradients of Defense-GAN at test time. This
method is the same as the black-box attack methods mentioned in Section 5.2.1 and
they are not effective for attacking Defense-GAN. Their partial success is due to the
non-optimal setting of hyperparameters that they used at test time.
In general, the attacks that try to approximate Defense-GAN fail because of
the non-convex nature of the optimization that is solved at test time. Defense-GAN
does not reach the solutions that are found from the approximation methods since
the initialization, learning rate, and the number of steps are essential parameters of
the optimization which are usually discarded by such methods.
128
Table 5.4: Classification accuracies of different classifier models using various defense
strategies on the MNIST (top) and F-MNIST (bottom) datasets, under FGSM,
RAND+FGSM, and CW white-box attacks. Defense-GAN has L = 200 and R = 10.
Classifier No No Defense- Adv. Tr.
Attack MagNet
Model Attack Defense GAN-Rec  = 0.3
A 0.997 0.217 0.988 0.191 0.651
FGSM B 0.962 0.022 0.956 0.082 0.060
 = 0.3 C 0.996 0.331 0.989 0.163 0.786
D 0.992 0.038 0.980 0.094 0.732
A 0.997 0.179 0.988 0.171 0.774
RAND+FGSM B 0.962 0.017 0.944 0.091 0.138
 = 0.3, α = 0.05 C 0.996 0.103 0.985 0.151 0.907
D 0.992 0.050 0.980 0.115 0.539
A 0.997 0.141 0.989 0.038 0.077
CW B 0.962 0.032 0.916 0.034 0.280
`2 norm C 0.996 0.126 0.989 0.025 0.031
D 0.992 0.032 0.983 0.021 0.010
Classifier No No Defense- Adv. Tr.
Attack MagNet
Model Attack Defense GAN-Rec  = 0.3
A 0.934 0.102 0.879 0.089 0.797
FGSM B 0.747 0.102 0.629 0.168 0.136
 = 0.3 C 0.933 0.139 0.896 0.110 0.804
D 0.892 0.082 0.875 0.099 0.698
A 0.934 0.102 0.888 0.096 0.447
RAND+FGSM B 0.747 0.131 0.661 0.161 0.119
 = 0.3, α = 0.05 C 0.933 0.105 0.893 0.112 0.699
D 0.892 0.091 0.862 0.104 0.626
A 0.934 0.076 0.896 0.060 0.157
CW B 0.747 0.172 0.656 0.131 0.118
`2 norm C 0.933 0.063 0.896 0.084 0.107
D 0.892 0.090 0.875 0.069 0.149
129
Table 5.6: Neural network architectures used for GANs.
Generator Discriminator
FC(4096) Conv(64, 5× 5, 2)
ReLU LeakyReLU(0.2)
ConvT(256, 5× 5, 1) Conv(128, 5× 5, 2)
ReLU LeakyReLU(0.2)
ConvT(128, 5× 5, 1) Conv(256, 5× 5, 2)
ReLU LeakyReLU(0.2)
ConvT(1, 5× 5, 1) FC(1)
Sigmoid Sigmoid
Table 5.7: Neural network architecture used for the MagNet encoder.
Encoder
Conv(64, 5× 5, 2)
LeakyReLU(0.2)
Conv(128, 5× 5, 2)
LeakyReLU(0.2)
Conv(256, 5× 5, 2)
LeakyReLU(0.2)
FC(128) + tanh
Table 5.9: Classification accuracy of Model F using Defense-GAN with various
number of iterations L (R = 10), on the F-MNIST dataset, under FGSM black-box
attack with  = 0.3.
Defense-GAN-Rec Defense-GAN-Orig Defense-GAN-Rec Defense-GAN-Orig
L
No attack No attack Adversarial Adversarial
25 0.8037± 0.0050 0.7595± 0.0009 0.4040± 0.0149 0.3910± 0.0119
50 0.8676± 0.0018 0.7898± 0.0016 0.4412± 0.0023 0.3980± 0.0114
100 0.9101± 0.0032 0.8190± 0.0043 0.4808± 0.0088 0.4221± 0.0255
200 0.9145± 0.0014 0.8373± 0.0054 0.5119± 0.0038 0.4594± 0.0056
400 0.9490± 0.0013 0.8557± 0.0049 0.5126± 0.0096 0.4754± 0.0102
800 0.9588± 0.0065 0.8832± 0.0042 0.5520± 0.0098 0.4644± 0.0092
1600 0.9640± 0.0010 0.9125± 0.0040 0.5335± 0.0226 0.4952± 0.0155
130
Table 5.10: Classification accuracy of Model F using Defense-GAN with various
number of random restarts R (L = 100), on the MNIST dataset, under FGSM
black-box attack with  = 0.3.
Defense-GAN-Rec Defense-GAN-Orig Defense-GAN-Rec Defense-GAN-Orig
R
No attack No attack Adversarial Adversarial
1 0.7035± 0.0035 0.6436± 0.0017 0.5329± 0.0094 0.5011± 0.0085
2 0.8619± 0.0010 0.8080± 0.0029 0.6722± 0.0041 0.6605± 0.0050
5 0.9523± 0.0006 0.9213± 0.0024 0.8199± 0.0097 0.8228± 0.0038
10 0.9810± 0.0015 0.9560± 0.0051 0.8956± 0.0032 0.8911± 0.0024
20 0.9966± 0.0009 0.9753± 0.0010 0.9456± 0.0031 0.9310± 0.0023
Table 5.12: Classification accuracies of different classifier models using vari-
ous defense strategies on the CelebA gender classification task, under FGSM,
RAND+FGSM, and CW white-box attacks. Defense-GAN has L = 200 and R = 2.
Classifier No No Defense- Adv. Tr.
Attack MagNet
Model Attack Defense GAN-Rec  = 0.3
A 0.9652 0.0870 0.9255 0.0985 0.1225
FGSM B 0.9468 0.0995 0.9140 0.0920 0.2345
 = 0.3 C 0.9459 0.0460 0.9255 0.1085 0.1130
D 0.9476 0.0605 0.9205 0.0975 0.7755
A 0.9652 0.0560 0.9280 0.1105 0.0700
RAND+FGSM B 0.9468 0.1785 0.9030 0.1015 0.4515
 = 0.3, α = 0.05 C 0.9459 0.0470 0.9200 0.1045 0.1055
D 0.9476 0.0665 0.9165 0.1105 0.696
A 0.9652 0.0460 0.8210 0.0985 0.5690
CW B 0.9468 0.0575 0.7465 0.0955 0.0725
`2 norm C 0.9459 0.0435 0.7985 0.0985 0.2635
D 0.9476 0.0660 0.7740 0.1040 0.5010
131
Chapter 6: Summary and Future Directions
6.1 Summaries
In Chapter 2, we presented a continuous face-based authentication method
using facial attributes for mobile devices. We trained binary attribute classifiers
and showed their effectiveness as feature vectors for active authentication with ex-
tensive experiments. We showed that attribute-based scores alone could improve
the verification results. Furthermore, we showed that in situations where the low-
level features such as LBP are reliable, verification results could be further improved
by fusing the resulting scores with the attribute-based scores. We also evaluated
the different realizations of our method on an actual cell phone and showed that
the authentication algorithm could be implemented with low memory usage, power
consumption at more than four frames per second. We also proposed a feasible
multi-task DCNN architecture to extract accurate describable facial attributes on
mobile devices. Each network predicted multi facial attributes from a given face
component by mapping it to a shared embedding space. We showed that our at-
tribute prediction performance is comparable to state-of-the-art. We explored the
embedding space and illustrated that we could extract new attributes by looking
at subspace clusters of this space. We also have shown that our networks perform
132
attribute-based authentication better than the previously proposed method. Finally,
we analyze the feasibility of our method by performing battery usage and prediction
speed experiments on an actual mobile device.
In Chapter 3, we presented a new attribute manipulation model that offers
unprecedented flexibility. Unlike previous models that represent attributes as dis-
crete classes, our multi-dimensional, continuous attribute representation allows for
a much richer set of attribute manipulations. These include Diverse Swaps, in which
an attribute can be changed but realized via a diverse set of choices, and Borrows,
in which an attribute realization can be taken from a reference image and applied
to a query image. Qualitative evaluation of our approach illustrates its efficacy for
use in a variety of applications.
In Chapter 4, we introduced ExplainGAN to interpret black-box classifiers by
visualizing boundary-crossing transformations. These transformations are designed
to be interpretable by humans and provide a high-level, conceptual intuition under-
lying a classifier’s decisions. This style of visualization can overcome limitations of
attribution and example-by-nearest-neighbor methods by making spatially localized
changes along with visual examples. While not explicitly trained to act as a saliancy
map, ExplainGAN’s maps are very competitive at demonstrating saliency. We also
introduced a new metric, Substitutability, that evaluates how much label-capturing
information is retained when performing boundary-crossing image transformations.
In Chapter 5, we proposed Defense-GAN, a novel defense strategy utilizing
GANs to enhance the robustness of classification models against black-box and
white-box adversarial attacks. Our method did not assume a particular attack
133
model and was shown to be effective against most commonly considered attack
strategies. We empirically show that Defense-GAN consistently provides adequate
defense on two benchmark computer vision datasets, whereas other methods had
many shortcomings on at least one type of attack.
It is worth mentioning that, although Defense-GAN was shown to be a possi-
ble defense mechanism against adversarial attacks, one might come across practical
difficulties while implementing and deploying this method. The success of Defense-
GAN relies on the expressiveness and generative power of the GAN. However, train-
ing GANs is still a challenging task and an active area of research, and if the GAN
is not adequately trained and tuned, the performance of Defense-GAN will suffer on
both original and adversarial examples. Moreover, the choice of hyper-parameters
L and R is also critical to the effectiveness of the defense, and it may be challenging
to tune them without knowing the attack model.
6.2 Future Directions
As seen in Chapter 2, attributes are potent features for designing real-world
ML systems efficiently and interactively. We have shown that raw attribute scores
are useful for active authentication; however, a better way of coming up with the
score is possible. Although we have presented a sampling scheme to see rare labels
more frequently, it is worthwhile to see how active sampling methods can be used to
sample data more efficiently. For example, if the network is performing “well” for
an attribute, it might be better to sample data points in a way that is “beneficial”
134
for training the other attributes.
Besides training the attribute classifier, given that a single CNN can be trained
for multiple tasks, an extension of our work may include loss functions for face
verification. More specifically, we can have a network that is trained with metric
learning loss or identity classification loss to get both attribute predictions and
identity-based similarity scores. This approach, however, requires a dataset labeled
with attributes and identities. In the case of not having such a dataset but having
multiple datasets with identity labels and multiple datasets with attribute labels, it
will be interesting to see how one can train different parts of the network with the
label at hand and not train for losses that we do not have labels for.
Another aspect of using attributes for active authentication is the phenomena
of domain shift. The reason is that the attributes are learned from datasets that are
usually gathered from the web, but the models are tested on mobile phone images.
These two domains are different in many ways. First of all, the images that are
acquired by the front camera of a cell phone are usually of low quality compared to
the ones that are available online. The phone cameras have lens distortion which
matters because we tend to hold the phone close to our face. The illumination is
different since usually, the cell phone images are acquired with indoor lighting or
low light, but the web images are usually outdoors, or if captured indoor, they are
taken in well lit conditions. All of these domain differences can be either dealt with
using data-driven domain adaptation methods.
In Chapter 3 we looked at the problem of conditional image generation. The
model that was used in this chapter had an encoder-decoder structure. As a future
135
research direction, it will be interesting to see how we can extend the idea to just
a single generator. Conditional GANs have been around and are working with
binary attribute labels. One possible way of exploring this is to use an alternative
optimization method with one step reconstructing a given image and constraining
the latent codes, and the other step for minimax optimization.
In Chapter 4 we filled the semantic gap with attribute detectors. The pro-
posed method works for binary classifiers; thus the next step is to extend the work
for multiple classes. The primary issue in explaining a multi-class classifier is the
quadratic number of pairwise relationships between classes. More specifically in an
n class problem, having an input of one class, we want to get n− 1 transformations
and explanations for the same input. One way would be to embed the solution into
the network architecture. For example, one can put n transformation heads and
mask heads. Then for a given input of class A, all other heads will explain the
transfer from one class to the other. Several issues arise in this case. First of all, the
complexity of the network goes up linearly as n increases. This results in inefficient
training and inference speed. The second issue is that each head will be trained
with a subset of the data for that specific class, which means that we need a large
and balanced dataset regarding the labels.
Besides dataset size and label imbalance, there is a conceptual issue with con-
sidering all of the pairwise relationships. It might not be obvious apriori that all
the pairwise relationships are meaningful. Therefore approaches that consider all
of them together might fail at training time for this reason. One possible way to
overcome this issue is to selectively train different parts of the network for the mean-
136
ingful relationships. However, this approach will still have the same computation
complexity. Another way would be to add target labels as inputs to the network
and keep the same network architecture. This way the network complexity will stay
almost the same, and we can efficiently train the network for meaningful class pairs.
In Chapter 5 we looked at how to protect attribute detectors from induced se-
mantic gap caused by adversarial perturbations using GANs. The primary challenge
with the current approach is the inference speed. This issue can be addressed by
approximating the latent codes with a feedforward CNN and continuing the gradient
descent steps from there.
Another future research direction for Chapter 5 is attacking a classifier that
is protected by the proposed method. As mentioned in Section 5.6, gradient-based
attacks, which are the dominant type of attacks, cannot work for the proposed
framework. One possible way is to attack each step of the unrolled gradient descent
steps. Such approaches have been successful for attacking RNNs [Papernot et al.,
2016d] and can possibly be extended to Defense-GAN.
137
Bibliography
[Abadi et al., 2015a] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,
Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Good-
fellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L.,
Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah,
C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P.,
Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg,
M., Wicke, M., Yu, Y., and Zheng, X. (2015a). TensorFlow: Large-scale machine
learning on heterogeneous systems. Software available from tensorflow.org.
[Abadi et al., 2015b] Abadi, M. et al. (2015b). TensorFlow: Large-scale machine
learning on heterogeneous systems. Software available from tensorflow.org.
[Ahonen et al., 2006] Ahonen, T., Hadid, A., and Pietikainen, M. (2006). Face
description with local binary patterns: Application to face recognition. Pattern
Analysis and Machine Intelligence, IEEE Transactions on, 28(12):2037–2041.
[Altinok and Turk, 2003] Altinok, A. and Turk, M. (2003). Temporal integration for
continuous multimodal biometrics. In Proceedings of the Workshop on Multimodal
User Authentication. Citeseer.
[Antipov et al., 2017] Antipov, G., Baccouche, M., and Dugelay, J.-L. (2017).
Face aging with conditional generative adversarial networks. arXiv preprint
arXiv:1702.01983.
[Arjovsky et al., 2017] Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasser-
stein gan. arXiv preprint arXiv:1701.07875.
[Asthana et al., 2013] Asthana, A., Zafeiriou, S., Cheng, S., and Pantic, M. (2013).
Robust discriminative response map fitting with constrained local models. In
Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on,
pages 3444–3451. IEEE.
138
[Athalye et al., 2018] Athalye, A., Carlini, N., and Wagner, D. (2018). Obfuscated
gradients give a false sense of security: Circumventing defenses to adversarial
examples. arXiv preprint arXiv:1802.00420.
[Bach et al., 2015] Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-
R., and Samek, W. (2015). On pixel-wise explanations for non-linear classifier
decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140.
[Berg and Belhumeur, 2013] Berg, T. and Belhumeur, P. N. (2013). Poof: Part-
based one-vs.-one features for fine-grained categorization, face verification, and
attribute estimation. In Computer Vision and Pattern Recognition (CVPR), 2013
IEEE Conference on, pages 955–962. IEEE.
[Bourdev et al., 2011] Bourdev, L., Maji, S., and Malik, J. (2011). Describing peo-
ple: A poselet-based approach to attribute classification. In Computer Vision
(ICCV), 2011 IEEE International Conference on, pages 1543–1550. IEEE.
[Bousmalis et al., 2017] Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., and
Krishnan, D. (2017). Unsupervised pixel-level domain adaptation with generative
adversarial networks. In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), volume 1, page 7.
[Bradski, 2000] Bradski, G. (2000). Opencv toolbox. Dr. Dobb’s Journal of Software
Tools.
[Brock et al., 2016] Brock, A., Lim, T., Ritchie, J. M., and Weston, N. (2016).
Neural photo editing with introspective adversarial networks. arXiv preprint
arXiv:1609.07093.
[Carlini and Wagner, 2017] Carlini, N. and Wagner, D. (2017). Towards evaluating
the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE
Symposium on, pages 39–57. IEEE.
[Carrillo, 2003] Carrillo, C. M. (2003). Continuous biometric authentication for
authorized aircraft personnel: A proposed design. Technical report, DTIC Docu-
ment.
[Chang and Lin, 2011] Chang, C.-C. and Lin, C.-J. (2011). Libsvm: a library for
support vector machines. ACM Transactions on Intelligent Systems and Technol-
ogy (TIST), 2(3):27.
[Chen et al., 2016] Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I.,
and Abbeel, P. (2016). Infogan: Interpretable representation learning by informa-
tion maximizing generative adversarial nets. In Advances in Neural Information
Processing Systems, pages 2172–2180.
[Choi et al., 2017] Choi, Y., Choi, M., Kim, M., Ha, J.-W., Kim, S., and Choo,
J. (2017). Stargan: Unified generative adversarial networks for multi-domain
image-to-image translation. arXiv preprint arXiv:1711.09020.
139
[Cortes and Vapnik, 1995] Cortes, C. and Vapnik, V. (1995). Support-vector net-
works. Machine learning, 20(3):273–297.
[Cox and Pinto, 2011] Cox, D. and Pinto, N. (2011). Beyond simple features: A
large-scale feature search approach to unconstrained face recognition. In Auto-
matic Face & Gesture Recognition and Workshops (FG 2011), 2011 IEEE Inter-
national Conference on, pages 8–15. IEEE.
[Crouse et al., 2015] Crouse, D., Han, H., Chandra, D., Barbello, B., and Jain,
A. K. (2015). Continuous authentication of mobile user: Fusion of face image and
inertial measurement unit data. In International Conference on Biometrics.
[Dalal and Triggs, 2005] Dalal, N. and Triggs, B. (2005). Histograms of oriented
gradients for human detection. In Computer Vision and Pattern Recognition,
2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–
893. IEEE.
[Datta et al., 2005] Datta, R., Li, J., and Wang, J. Z. (2005). Content-based image
retrieval: approaches and trends of the new age. In Proceedings of the 7th ACM
SIGMM international workshop on Multimedia information retrieval, pages 253–
262. ACM.
[Derawi et al., 2010] Derawi, M., Nickel, C., Bours, P., and Busch, C. (2010). Un-
obtrusive user-authentication on mobile phones using biometric gait recognition.
In International Conference on Intelligent Information Hiding and Multimedia
Signal Processing, pages 306–311.
[Elhamifar and Vidal, 2009] Elhamifar, E. and Vidal, R. (2009). Sparse subspace
clustering. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.
IEEE Conference on, pages 2790–2797. IEEE.
[Everingham et al., 2008] Everingham, M., Van Gool, L., Williams, C., Winn, J.,
and Zisserman, A. (2008). The pascal visual object classes challenge 2008.
[Farhadi et al., 2009] Farhadi, A., Endres, I., Hoiem, D., and Forsyth, D. (2009).
Describing objects by their attributes. In Computer Vision and Pattern Recogni-
tion, 2009. CVPR 2009. IEEE Conference on, pages 1778–1785. IEEE.
[Fathy et al., 2015] Fathy, M. E., Patel, V. M., and Chellappa, R. (2015). Face-
based active authentication on mobile devices. In IEEE International Conference
on Acoustics, Speech and Signal Processing.
[Feng et al., 2012] Feng, T., Liu, Z., Kwon, K.-A., Shi, W., Carbunar, B., Jiang,
Y., and Nguyen, N. (2012). Continuous mobile authentication using touchscreen
gestures. In IEEE Conference on Technologies for Homeland Security, pages 451–
456.
[Ferrari and Zisserman, 2007] Ferrari, V. and Zisserman, A. (2007). Learning visual
attributes. In Advances in Neural Information Processing Systems, pages 433–440.
140
[Fong and Vedaldi, 2017] Fong, R. C. and Vedaldi, A. (2017). Interpretable
explanations of black boxes by meaningful perturbation. arXiv preprint
arXiv:1704.03296.
[Frank et al., 2013] Frank, M., Biedert, R., Ma, E., Martinovic, I., and Song, D.
(2013). Touchalytics: On the applicability of touchscreen input as a behav-
ioral biometric for continuous authentication. IEEE Transactions on Information
Forensics and Security, 8(1):136–148.
[Fridman et al., 2015] Fridman, L., Weber, S., Greenstadt, R., and Kam, M. (2015).
Active authentication on mobile devices via stylometry, gps location, web brows-
ing behavior, and application usage patterns. IEEE Systems Journal.
[Ganin et al., 2016] Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle,
H., Laviolette, F., Marchand, M., and Lempitsky, V. (2016). Domain-adversarial
training of neural networks. The Journal of Machine Learning Research,
17(1):2096–2030.
[Gatys et al., 2015] Gatys, L. A., Ecker, A. S., and Bethge, M. (2015). A neural
algorithm of artistic style. arXiv preprint arXiv:1508.06576.
[Gatys et al., 2016] Gatys, L. A., Ecker, A. S., and Bethge, M. (2016). Image style
transfer using convolutional neural networks. In Computer Vision and Pattern
Recognition (CVPR), 2016 IEEE Conference on, pages 2414–2423. IEEE.
[Goodfellow et al., 2014a] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,
Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014a). Generative
adversarial nets. In Advances in neural information processing systems, pages
2672–2680.
[Goodfellow et al., 2014b] Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014b).
Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.
[Goodman and Flaxman, 2016] Goodman, B. and Flaxman, S. (2016). European
union regulations on algorithmic decision-making and a” right to explanation”.
arXiv preprint arXiv:1606.08813.
[Gulrajani et al., 2017] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and
Courville, A. (2017). Improved training of wasserstein gans. arXiv preprint
arXiv:1704.00028.
[Gunther et al., 2013] Gunther, M., Costa-Pazo, A., Ding, C., Boutellaa, E., Chi-
achia, G., Zhang, H., de Assis Angeloni, M., Struc, V., Khoury, E., Vazquez-
Fernandez, E., et al. (2013). The 2013 face recognition evaluation in mobile
environment. In Biometrics (ICB), 2013 International Conference on, pages 1–7.
IEEE.
141
[Hadid et al., 2007] Hadid, A., Heikkila, J., Silven, O., and Pietikainen, M. (2007).
Face and eye detection for person authentication in mobile phones. In ACM/IEEE
International Conference on Distributed Smart Cameras, pages 101–108.
[Hendrycks and Gimpel, 2017] Hendrycks, D. and Gimpel, K. (2017). Early meth-
ods for detecting adversarial images.
[Hinton et al., 2015] Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the
knowledge in a neural network. arXiv preprint arXiv:1503.02531.
[Huang et al., 2007] Huang, G. B., Ramesh, M., Berg, T., and Learned-Miller, E.
(2007). Labeled faces in the wild: A database for studying face recognition in un-
constrained environments. Technical Report 07-49, University of Massachusetts,
Amherst.
[Inc, 2013] Inc, N. M. (2013). Nearly one in three consumers who have
lost their mobile devices still do not lock them, new survey shows.
”http://www.prnewswire.com/news-releases/nearly-one-in-three-consumers-
who-have-lost-their-mobile-devices-still-do-not-lock-them-new-survey-shows-
200410151.html”.
[Isola et al., 2017] Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. (2017). Image-
to-image translation with conditional adversarial networks. arXiv preprint.
[Jain et al., 2004] Jain, A. K., Dass, S. C., and Nandakumar, K. (2004). Soft bio-
metric traits for personal recognition systems. In Biometric Authentication, pages
731–738. Springer.
[Janakiraman et al., 2005] Janakiraman, R., Kumar, S., Zhang, S., and Sim, T.
(2005). Using continuous face verification to improve desktop security. In Ap-
plication of Computer Vision, 2005. WACV/MOTIONS’05 Volume 1. Seventh
IEEE Workshops on, volume 1, pages 501–507. IEEE.
[Johnson et al., 2016] Johnson, J., Alahi, A., and Fei-Fei, L. (2016). Perceptual
losses for real-time style transfer and super-resolution. In European Conference
on Computer Vision, pages 694–711. Springer.
[Kabkab et al., 2018] Kabkab, M., Samangouei, P., and Chellappa, R. (2018). Task-
aware compressed sensing with generative adversarial networks. arXiv preprint
arXiv:1802.01284.
[Kazemi and Sullivan, 2014] Kazemi, V. and Sullivan, J. (2014). One millisecond
face alignment with an ensemble of regression trees. In Computer Vision and
Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1867–1874. IEEE.
[King, 2009] King, D. E. (2009). Dlib-ml: A machine learning toolkit. Journal of
Machine Learning Research, 10:1755–1758.
142
[Kingma and Ba, 2014] Kingma, D. and Ba, J. (2014). Adam: A method for
stochastic optimization. arXiv preprint arXiv:1412.6980.
[Klare et al., 2014] Klare, B. F., Klum, S., Klontz, J. C., Taborsky, E., Akgul, T.,
and Jain, A. K. (2014). Suspect identification based on descriptive facial at-
tributes. In Biometrics (IJCB), 2014 IEEE International Joint Conference on,
pages 1–8. IEEE.
[Klosterman and Ganger, 2000] Klosterman, A. J. and Ganger, G. R. (2000). Secure
continuous biometric-enhanced authentication.
[Krizhevsky et al., 2012] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).
Imagenet classification with deep convolutional neural networks. In Advances in
neural information processing systems, pages 1097–1105.
[Kulkarni et al., 2015] Kulkarni, T. D., Whitney, W. F., Kohli, P., and Tenenbaum,
J. (2015). Deep convolutional inverse graphics network. In Advances in Neural
Information Processing Systems, pages 2539–2547.
[Kumar et al., 2008] Kumar, N., Belhumeur, P. N., and Nayar, S. K. (2008). Face-
Tracer: A Search Engine for Large Collections of Images with Faces. In European
Conference on Computer Vision (ECCV), pages 340–353.
[Kumar et al., 2009] Kumar, N., Berg, A. C., Belhumeur, P. N., and Nayar, S. K.
(2009). Attribute and Simile Classifiers for Face Verification. In IEEE Interna-
tional Conference on Computer Vision (ICCV).
[Kurakin et al., 2016] Kurakin, A., Goodfellow, I., and Bengio, S. (2016). Adver-
sarial examples in the physical world. arXiv preprint arXiv:1607.02533.
[Lampert et al., 2014] Lampert, C. H., Nickisch, H., and Harmeling, S. (2014).
Attribute-based classification for zero-shot visual object categorization. Pattern
Analysis and Machine Intelligence, IEEE Transactions on, 36(3):453–465.
[Lample et al., 2017] Lample, G., Zeghidour, N., Usunier, N., Bordes, A., Denoyer,
L., et al. (2017). Fader networks: Manipulating images by sliding attributes. In
Advances in Neural Information Processing Systems, pages 5969–5978.
[Larsen et al., 2015] Larsen, A. B. L., Sønderby, S. K., Larochelle, H., and Winther,
O. (2015). Autoencoding beyond pixels using a learned similarity metric. arXiv
preprint arXiv:1512.09300.
[Layne et al., 2012] Layne, R., Hospedales, T. M., Gong, S., and Mary, Q. (2012).
Person re-identification by attributes. In BMVC, volume 2, page 8.
[LeCun et al., 1998] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998).
Gradient-based learning applied to document recognition. Proceedings of the
IEEE, 86(11):2278–2324.
143
[Levi and Hassner, 2015] Levi, G. and Hassner, T. (2015). Age and gender classifi-
cation using convolutional neural networks. In IEEE Conf. on Computer Vision
and Pattern Recognition (CVPR) workshops.
[Liu et al., 2011] Liu, J., Kuipers, B., and Savarese, S. (2011). Recognizing human
actions by attributes. In Computer Vision and Pattern Recognition (CVPR),
2011 IEEE Conference on, pages 3337–3344. IEEE.
[Liu et al., 2014] Liu, M., Zhang, D., and Chen, S. (2014). Attribute relation learn-
ing for zero-shot classification. Neurocomputing, 139:34–46.
[Liu et al., 2017] Liu, M.-Y., Breuel, T., and Kautz, J. (2017). Unsupervised image-
to-image translation networks. In Advances in Neural Information Processing
Systems, pages 700–708.
[Liu and Tuzel, 2016] Liu, M.-Y. and Tuzel, O. (2016). Coupled generative adver-
sarial networks. In Advances in neural information processing systems, pages
469–477.
[Liu et al., 2016] Liu, Y., Chen, X., Liu, C., and Song, D. (2016). Delving
into transferable adversarial examples and black-box attacks. arXiv preprint
arXiv:1611.02770.
[Liu et al., 2007] Liu, Y., Zhang, D., Lu, G., and Ma, W.-Y. (2007). A survey
of content-based image retrieval with high-level semantics. Pattern Recognition,
40(1):262–282.
[Liu et al., 2015] Liu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep learning face
attributes in the wild. In Proceedings of International Conference on Computer
Vision (ICCV).
[Maaten and Hinton, 2008] Maaten, L. v. d. and Hinton, G. (2008). Visualizing
data using t-sne. Journal of machine learning research, 9(Nov):2579–2605.
[Mahbub et al., 2016] Mahbub, U., Patel, V. M., Chandra, D., Barbello, B., and
Chellappa, R. (2016). Partial face detection for continuous authentication. arXiv
preprint arXiv:1603.09364.
[Mathieu et al., 2016] Mathieu, M. F., Zhao, J. J., Zhao, J., Ramesh, A., Sprech-
mann, P., and LeCun, Y. (2016). Disentangling factors of variation in deep rep-
resentation using adversarial training. In Advances in Neural Information Pro-
cessing Systems, pages 5040–5048.
[MATLAB, 2014] MATLAB (2014). version 8.4.0.150421 (R2014b). The Math-
Works Inc., Natick, Massachusetts.
[McCool et al., 2012] McCool, C., Marcel, S., Hadid, A., Pietikainen, M., Matejka,
P., Cernocky, J., Poh, N., Kittler, J., Larcher, A., Levy, C., Matrouf, D., Bonastre,
J.-F., Tresadern, P., and Cootes, T. (2012). Bi-modal person recognition on
144
a mobile phone: using mobile phone data. In IEEE ICME Workshop on Hot
Topics in Mobile Multimedia.
[Meng and Chen, 2017] Meng, D. and Chen, H. (2017). Magnet: a two-pronged
defense against adversarial examples. arXiv preprint arXiv:1705.09064.
[Monrose et al., 2002] Monrose, F., Reiter, M. K., and Wetzel, S. (2002). Password
hardening based on keystroke dynamics. International Journal of Information
Security, 1(2):69–83.
[Montavon et al., 2017] Montavon, G., Lapuschkin, S., Binder, A., Samek, W., and
Müller, K.-R. (2017). Explaining nonlinear classification decisions with deep tay-
lor decomposition. Pattern Recognition, 65:211–222.
[Moosavi-Dezfooli et al., 2016] Moosavi-Dezfooli, S.-M., Fawzi, A., and Frossard, P.
(2016). Deepfool: a simple and accurate method to fool deep neural networks.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 2574–2582.
[Niinuma and Jain, 2010] Niinuma, K. and Jain, A. K. (2010). Continuous user au-
thentication using temporal information. In SPIE Defense, Security, and Sensing,
pages 76670L–76670L. International Society for Optics and Photonics.
[Niinuma et al., 2010] Niinuma, K., Park, U., and Jain, A. K. (2010). Soft biometric
traits for continuous user authentication. Information Forensics and Security,
IEEE Transactions on, 5(4):771–780.
[Obeid et al., 2001] Obeid, M., Jedynak, B., and Daoudi, M. (2001). Image index-
ing & retrieval using intermediate features. In Proceedings of the ninth ACM
international conference on Multimedia, pages 531–533. ACM.
[Papernot et al., 2016a] Papernot, N., Goodfellow, I., Sheatsley, R., Feinman, R.,
and McDaniel, P. (2016a). cleverhans v1. 0.0: an adversarial machine learning
library. arXiv preprint arXiv:1610.00768.
[Papernot et al., 2017] Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik,
Z. B., and Swami, A. (2017). Practical black-box attacks against machine learn-
ing. In Proceedings of the 2017 ACM on Asia Conference on Computer and
Communications Security, pages 506–519. ACM.
[Papernot et al., 2016b] Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik,
Z. B., and Swami, A. (2016b). The limitations of deep learning in adversarial
settings. In Security and Privacy (EuroS&P), 2016 IEEE European Symposium
on, pages 372–387. IEEE.
[Papernot et al., 2016c] Papernot, N., McDaniel, P., Sinha, A., and Wellman, M.
(2016c). Towards the science of security and privacy in machine learning. arXiv
preprint arXiv:1611.03814.
145
[Papernot et al., 2016d] Papernot, N., McDaniel, P., Swami, A., and Harang, R.
(2016d). Crafting adversarial input sequences for recurrent neural networks. In
Military Communications Conference, MILCOM 2016-2016 IEEE, pages 49–54.
IEEE.
[Papernot et al., 2016e] Papernot, N., McDaniel, P., Wu, X., Jha, S., and Swami,
A. (2016e). Distillation as a defense to adversarial perturbations against deep
neural networks. In Security and Privacy (SP), 2016 IEEE Symposium on, pages
582–597. IEEE.
[Parkhi et al., 2015] Parkhi, O. M., Vedaldi, A., and Zisserman, A. (2015). Deep
face recognition. In British Machine Vision Conference.
[Patterson and Hays, 2012] Patterson, G. and Hays, J. (2012). Sun attribute
database: Discovering, annotating, and recognizing scene attributes. In Com-
puter Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages
2751–2758. IEEE.
[Perarnau et al., 2016] Perarnau, G., van de Weijer, J., Raducanu, B., and Álvarez,
J. M. (2016). Invertible conditional gans for image editing. arXiv preprint
arXiv:1611.06355.
[Prechelt, 1998] Prechelt, L. (1998). Automatic early stopping using cross valida-
tion: quantifying the criteria. Neural Networks, 11(4):761–767.
[Primo et al., 2014] Primo, A., Phoha, V., Kumar, R., and Serwadda, A. (2014).
Context-aware active authentication using smartphone accelerometer measure-
ments. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2014
IEEE Conference on, pages 98–105.
[Radford et al., 2015] Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised
representation learning with deep convolutional generative adversarial networks.
arXiv preprint arXiv:1511.06434.
[Rudin et al., 1992] Rudin, L. I., Osher, S., and Fatemi, E. (1992). Nonlinear total
variation based noise removal algorithms. Physica D: nonlinear phenomena, 60(1-
4):259–268.
[Samangouei and Chellappa, 2016] Samangouei, P. and Chellappa, R. (2016). Con-
volutional neural networks for attribute-based active authentication on mobile
devices. In Biometrics Theory, Applications and Systems (BTAS), 2016 IEEE
8th International Conference on, pages 1–8. IEEE.
[Samangouei et al., 2017a] Samangouei, P., Hand, E., Patel, V. M., and Chellappa,
R. (2017a). Active authentication using facial attributes. Mobile Biometrics,
3:131.
146
[Samangouei et al., 2018a] Samangouei, P., Kabkab, M., and Chellappa, R. (2018a).
Defense-gan: Protecting classifiers against adversarial attacks using generative
models. International Conference on Learning Representations.
[Samangouei et al., 2015] Samangouei, P., Patel, V. M., and Chellappa, R. (2015).
Attribute-based continuous user authentication on mobile devices. In Biometrics
Theory, Applications and Systems (BTAS), 2015 IEEE 7th International Confer-
ence on, pages 1–8.
[Samangouei et al., 2017b] Samangouei, P., Patel, V. M., and Chellappa, R.
(2017b). Facial attributes for active authentication on mobile devices. Image
and Vision Computing, 58:181–192.
[Samangouei et al., 2018b] Samangouei, P., Saeidi, A., Nakagiwa, L., and Silber-
man, N. (2018b). Explaingan: Model explanation via decision boundary cross-
ing transformations. Proceedings of European Conference on Computer Vision
(ECCV).
[Samek et al., 2017] Samek, W., Binder, A., Montavon, G., Lapuschkin, S., and
Müller, K.-R. (2017). Evaluating the visualization of what a deep neural net-
work has learned. IEEE transactions on neural networks and learning systems,
28(11):2660–2673.
[Sarkar et al., 2016] Sarkar, S., Patel, V. M., and Chellappa, R. (2016). Deep
feature-based face detection on mobile devices. arXiv preprint arXiv:1602.04868.
[Schmidhuber, 1992] Schmidhuber, J. (1992). Learning factorial codes by pre-
dictability minimization. Neural Computation, 4(6):863–879.
[Schroff et al., 2015] Schroff, F., Kalenichenko, D., and Philbin, J. (2015). Facenet:
A unified embedding for face recognition and clustering. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823.
[Selvaraju et al., 2016a] Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R.,
Parikh, D., and Batra, D. (2016a). Grad-cam: Visual explanations from deep
networks via gradient-based localization. See https://arxiv. org/abs/1610.02391
v3, 7(8).
[Selvaraju et al., 2016b] Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R.,
Parikh, D., and Batra, D. (2016b). Grad-cam: Visual explanations from deep
networks via gradient-based localization. See https://arxiv. org/abs/1610.02391
v3, 7(8).
[Shrikumar et al., 2017] Shrikumar, A., Greenside, P., and Kundaje, A. (2017).
Learning important features through propagating activation differences. arXiv
preprint arXiv:1704.02685.
147
[Shrikumar et al., 2016] Shrikumar, A., Greenside, P., Shcherbina, A., and Kun-
daje, A. (2016). Not just a black box: Learning important features through
propagating activation differences. arXiv preprint arXiv:1605.01713.
[Sim et al., 2007] Sim, T., Zhang, S., Janakiraman, R., and Kumar, S. (2007). Con-
tinuous verification using multimodal biometrics. Pattern Analysis and Machine
Intelligence, IEEE Transactions on, 29(4):687–700.
[Simonyan et al., 2013] Simonyan, K., Vedaldi, A., and Zisserman, A. (2013). Deep
inside convolutional networks: visualising image classification models and saliency
maps (2014). arXiv preprint arXiv:1312.6034.
[Simonyan and Zisserman, 2014] Simonyan, K. and Zisserman, A. (2014). Very
deep convolutional networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556.
[Spillane, 1975] Spillane, R. (1975). Keyboard apparatus for personal identification.
IBM Technical Disclosure Bulletin, 17(3346):3346.
[Springenberg et al., 2014] Springenberg, J. T., Dosovitskiy, A., Brox, T., and Ried-
miller, M. (2014). Striving for simplicity: The all convolutional net. arXiv preprint
arXiv:1412.6806.
[Sun et al., 2014a] Sun, Y., Chen, Y., Wang, X., and Tang, X. (2014a). Deep learn-
ing face representation by joint identification-verification. In Advances in Neural
Information Processing Systems, pages 1988–1996.
[Sun et al., 2014b] Sun, Y., Chen, Y., Wang, X., and Tang, X. (2014b). Deep
learning face representation by joint identification-verification. In Advances in
Neural Information Processing Systems, pages 1988–1996.
[Sundararajan et al., 2017] Sundararajan, M., Taly, A., and Yan, Q. (2017). Ax-
iomatic attribution for deep networks. arXiv preprint arXiv:1703.01365.
[Szegedy et al., 2013] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D.,
Goodfellow, I., and Fergus, R. (2013). Intriguing properties of neural networks.
arXiv preprint arXiv:1312.6199.
[Taigman et al., 2014] Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014).
Deepface: Closing the gap to human-level performance in face verification. In Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 1701–1708.
[Takacs et al., 2008] Takacs, G., Chandrasekhar, V., Gelfand, N., Xiong, Y., Chen,
W.-C., Bismpigiannis, T., Grzeszczuk, R., Pulli, K., and Girod, B. (2008). Out-
doors augmented reality on mobile phone using loxel-based visual feature orga-
nization. In Proceedings of the 1st ACM international conference on Multimedia
information retrieval, pages 427–434. ACM.
148
[Tramèr et al., 2017] Tramèr, F., Kurakin, A., Papernot, N., Boneh, D., and Mc-
Daniel, P. (2017). Ensemble adversarial training: Attacks and defenses. arXiv
preprint arXiv:1705.07204.
[Tropp and Gilbert, 2007] Tropp, J. A. and Gilbert, A. C. (2007). Signal recov-
ery from random measurements via orthogonal matching pursuit. Information
Theory, IEEE Transactions on, 53(12):4655–4666.
[Vaquero et al., 2009] Vaquero, D., Feris, R. S., Tran, D., Brown, L., Hampapur,
A., Turk, M., et al. (2009). Attribute-based people search in surveillance environ-
ments. In Applications of Computer Vision (WACV), 2009 Workshop on, pages
1–8. IEEE.
[Vedaldi and Fulkerson, 2008] Vedaldi, A. and Fulkerson, B. (2008). VLFeat: An
open and portable library of computer vision algorithms.
[Viola et al., 2005] Viola, P., Jones, M. J., and Snow, D. (2005). Detecting pedestri-
ans using patterns of motion and appearance. International Journal of Computer
Vision, 63(2):153–161.
[Wattenberg et al., 2016] Wattenberg, M., ViÃ©c gas, F., and Johnson, I. (2016).
How to use t-sne effectively. Distill.
[Wen et al., 2016] Wen, Y., Zhang, K., Li, Z., and Qiao, Y. (2016). A discriminative
feature learning approach for deep face recognition. In European Conference on
Computer Vision, pages 499–515. Springer.
[Wiskott et al., 1997] Wiskott, L., Fellous, J.-M., Kuiger, N., and Von Der Mals-
burg, C. (1997). Face recognition by elastic bunch graph matching. Pattern
Analysis and Machine Intelligence, IEEE Transactions on, 19(7):775–779.
[Wright et al., 2009] Wright, J., Yang, A. Y., Ganesh, A., Sastry, S. S., and Ma, Y.
(2009). Robust face recognition via sparse representation. Pattern Analysis and
Machine Intelligence, IEEE Transactions on, 31(2):210–227.
[Xiao et al., 2017a] Xiao, H., Rasul, K., and Vollgraf, R. (2017a). Fashion-mnist: a
novel image dataset for benchmarking machine learning algorithms.
[Xiao et al., 2017b] Xiao, H., Rasul, K., and Vollgraf, R. (2017b). Fashion-mnist:
A novel image dataset for benchmarking machine learning algorithms. arXiv
preprint arXiv:1708.07747.
[Yan et al., 2016] Yan, X., Yang, J., Sohn, K., and Lee, H. (2016). Attribute2image:
Conditional image generation from visual attributes. In European Conference on
Computer Vision, pages 776–791. Springer.
[Zeiler and Fergus, 2014] Zeiler, M. D. and Fergus, R. (2014). Visualizing and un-
derstanding convolutional networks. In European conference on computer vision,
pages 818–833. Springer.
149
[Zhang et al., 2015a] Zhang, H., Patel, V. M., and Chellappa, R. (2015a). Ro-
bust multimodal recognition via multitask multivariate low-rank representations.
In IEEE International Conference on Automatic Face and Gesture Recognition.
IEEE.
[Zhang et al., 2015b] Zhang, H., Patel, V. M., Fathy, M. E., and Chellappa, R.
(2015b). Touch gesture-based active user authentication using dictionaries. In
IEEE Winter conference on Applications of Computer Vision. IEEE.
[Zhang et al., 2015c] Zhang, H., Patel, V. M., Shekhar, S., and Chellappa, R.
(2015c). Domain adaptive sparse representation-based classification. In IEEE
International Conference on Automatic Face and Gesture Recognition. IEEE.
[Zhang et al., 2014a] Zhang, L., Kalashnikov, D. V., and Mehrotra, S. (2014a).
Context-assisted face clustering framework with human-in-the-loop. International
Journal of Multimedia Information Retrieval, 3(2):69–88.
[Zhang et al., 2010] Zhang, L., Tiwana, B., Qian, Z., Wang, Z., Dick, R. P., Mao,
Z. M., and Yang, L. (2010). Accurate online power estimation and automatic
battery behavior based power model generation for smartphones. In Proceedings
of the eighth IEEE/ACM/IFIP international conference on Hardware/software
codesign and system synthesis, pages 105–114. ACM.
[Zhang et al., 2014b] Zhang, N., Paluri, M., Ranzato, M., Darrell, T., and Bourdev,
L. (2014b). Panda: Pose aligned networks for deep attribute modeling. In Com-
puter Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages
1637–1644. IEEE.
[Zhang et al., 2005] Zhang, W., Shan, S., Gao, W., Chen, X., and Zhang, H. (2005).
Local gabor binary pattern histogram sequence (lgbphs): A novel non-statistical
model for face representation and recognition. In Computer Vision, 2005. ICCV
2005. Tenth IEEE International Conference on, volume 1, pages 786–791. IEEE.
[Zhao et al., 1998] Zhao, W., Krishnaswamy, A., Chellappa, R., Swets, D. L., and
Weng, J. (1998). Discriminant analysis of principal components for face recogni-
tion. In Face Recognition, pages 73–85. Springer.
[Zhou et al., 2016] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A.
(2016). Learning deep features for discriminative localization. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pages 2921–
2929.
[Zhu et al., 2017] Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. (2017). Unpaired
image-to-image translation using cycle-consistent adversarial networks. arXiv
preprint arXiv:1703.10593.
150