ABSTRACT

Title of Dissertation: EXPANDING ROBUSTNESS IN RESPONSIBLE AI
FOR NOVEL BIAS MITIGATION

Samuel Dooley
Doctor of Philosophy, 2023

Dissertation Directed by: John Dickerson
Department of Computer Science

Conventional belief in the fairness community is that one should first find the highest

performing model for a given problem and then apply a bias mitigation strategy. One starts with

an existing model architecture and hyperparameters, and then adjusts model weights, learning

procedures, or input data to make the model fairer using a pre-, post-, or in-processing bias

mitigation technique. While existing methods for de-biasing machine learning systems use

a fixed neural architecture and hyperparameter setting, I instead ask a fundamental question

which has received little attention: how much does model-bias arise from the architecture and

hyperparameters, and ask how can we exploit the extensive research in the fields of neural

architecture search (NAS) and hyperparameter optimization (HPO) to search for more inherently

fair models.

By thinking of bias mitigation in this new way, we really are expanding our conceptualization

of robustness in responsible AI. Robustness is an emerging aspect of responsible AI and focuses


on maintaining model performance in the face of uncertainties and variations for all subgroups of

a data population. Often robustness deals with protecting models from intentional or unintentional

manipulations in data, while handling noisy or corrupted data and preserving accuracy in real-

world scenarios. In other words, robustness, as commonly defined, examines the output of a system

under changes to input data. However, I will broaden the idea of what robustness in responsible

AI is in a manner which defines new fairness metrics, yields insights into robustness of deployed

AI systems, and proposes an entirely new bias mitigation strategy.

This thesis explores the connection between robust machine learning and responsible AI. It

introduces a fairness metric that quantifies disparities in susceptibility to adversarial attacks. It

also audits face detection systems for robustness to common natural noises, revealing biases in

these systems. Finally, it proposes using neural architecture search to find fairer architectures,

challenging the conventional approach of starting with accurate architectures and applying bias

mitigation strategies.


EXPANDING ROBUSTNESS IN RESPONSIBLE AI FOR NOVEL BIAS
MITIGATION

by

Samuel Dooley

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2023

Advisory Committee:
Dr. John Dickerson, Chair/Advisor
Dr. Philip Resnik
Dr. Hal Daumé III
Dr. Tom Goldstein
Dr. Furong Huang
Dr. Elissa Redmiles


© Copyright by
Samuel Dooley

2023


Acknowledgments

To my mother and father

Without the support of my family, friends, and colleagues, this work would not have been

possible.

I have had the great fortune to share in the creation of this work with amazing co-authors

including Vedant Nanda, Rhea Sukthanker, George Wei, Micah Goldblum, Colin White, Frank

Hutter, Tom Goldstein, and John Dickerson. These collaborations have been instrumental in my

progress during the PhD program.

I also worked with many other individuals on other projects throughout my time in the PhD,

but with projects which did not make it into this thesis for space constraints. These projects and

collaborators are listed below — thank you to everyone for their guidance and support.

I had the pleasure of working primarily with Michael Curry and John Dickerson on human

value alignment in deep learning for auction design work. Specifically with Kevin Kuo, Anthony

Ostuni, Elizabeth Horishny, Michael J Curry, Ping Chiang, Tom Goldstein, and John Dickerson in

Kuo et al. (2020) and Neehar Peri, Michael Curry, and John Dickerson in Peri et al. (2021).

I’d also like to thank Elissa Redmiles for her guidance and collaboration on projects of deep

interest to me. We worked with John Dickerson and Dana Turjeman to understand how messaging

impacted Covid-19 app adoption in Louisiana (Dooley et al., 2022a). Through Elissa, I also met

ii


Angelica Goetzen and had the pleasure of working with her on some statistical analyses of survey

data on privacy sentiment (Goetzen et al., 2022).

In other user-studies, I worked closely with Michelle Mazurek on a line of work that

examined the ways in which library IT staff conceived of privacy and security for their patrons.

Originally devised as a class project with Michael Rosenberg, Elliot Sloate, and Sungbok Shin

(Dooley et al., 2020), we worked closely with Nora McDonald and Rachel Greenstadt to devise the

survey which then was conducted by me and analyzed by Alan Luo and Noel Warford (Luo et al.,

2023). From these collaborations, I learned a lot about qualitative research as well as my own

working style and motivations in my work. Additionally, Michael Rosenberg and I also worked on

a topological data analysis project (Rawson et al., 2022) where I got to apply my undergrad math

knowledge to my PhD.

I worked closely with Tom Goldstein, Micah Goldblum, and Valeriia Cherepanova on work

pertaining to face recognition projects (Cherepanova et al., 2022; Dooley et al., 2021a). From

them, I learned much about technical writing and research process. Also, early in my PhD studies,

John introduced me to some labor market problems (Dooley and Dickerson, 2020) which I worked

on with Marina Knittel (Knittel et al., 2022).

Finally, a special thanks to Colin White who has been instrumental in my progression

towards the end of my program. We first started collaborating on Chapter 4 of this thesis, and

this further led to an internship at Abacus.AI and ultimately a job. We’ve further collaborated on

ForecastPFN (Khurana et al., 2023) and a bunch of forthcoming work in NAS. Thank you, Colin,

for your kindness, generosity, and support.

My family has been a constant voice of support through my education, particularly my

husband, mother, father, brother, sister-in-law, and niece. I particularly appreciate the constant

iii


love and encouragement of my husband as well as his family as they have cheered me on and

celebrated my wins. My friends have always been there to encourage and distract me at critical

times – thank you particularly to Claire and Annette. Finally, I’d like to thank all the musicians

whose music I enjoyed during the many long hours and late nights working on this theses —

particularly Beyoncé and her inspiring RENAISSANCE album.

In this thesis, I have been supported in part by NSF CAREER Award IIS-1846237, NIST

MSE Award #20126334, NSF D-ISN Award #2039862, NSF Award CCF-1852352, NIH R01

Award NLM-013039-01, DARPA GARD #HR00112020007, DARPA SI3-CMD #S4761, DoD

WHS Award #HQ003420F0035, and ARPA-E Award #4334192.

iv


Table of Contents

Acknowledgements ii

Table of Contents v

List of Tables viii

List of Figures xiv

Chapter 1: Introduction 1

Chapter 2: Adversarial Robustness 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Heterogeneous Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Robustness Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.1 Real-world Implications: Degradation of Quality of Service . . . . . . . 22
2.4 Measuring Robustness Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4.1 Adversarial Attacks (Upper Bound) . . . . . . . . . . . . . . . . . . . . 23
2.4.2 Randomized Smoothing (Lower Bound) . . . . . . . . . . . . . . . . . . 23

2.5 Empirical Evidence of Robustness Bias in the Wild . . . . . . . . . . . . . . . . 24
2.6 Exact Computation in a Simple Model: Multinomial Logistic Regression . . . . . 27
2.7 Evaluation of Robustness Bias using Adversarial Attacks . . . . . . . . . . . . . 29

2.7.1 Evaluation of ÎDF
P and ÎCW

P . . . . . . . . . . . . . . . . . . . . . . . . 29
2.7.2 Audit of Commonly Used Models . . . . . . . . . . . . . . . . . . . . . 30

2.8 Evaluation of Robustness Bias using Randomized Smoothing . . . . . . . . . . . 31
2.8.1 Evaluation of ÎRS

P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.8.2 Audit of Commonly Used Models . . . . . . . . . . . . . . . . . . . . . 32
2.8.3 Comparison of Randomized Smoothing and Upper Bounds . . . . . . . . 34

2.9 Reducing Robustness Bias through Regularization . . . . . . . . . . . . . . . . . 35
2.9.1 Experimental Results using Regularized Models . . . . . . . . . . . . . . 37

2.10 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Chapter 3: Robustness Disparities in Face Detection 50
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3 Benchmark Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

v


3.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.4.1 RQ1: Overall Model Performance . . . . . . . . . . . . . . . . . . . . . 68
3.4.2 RQ2: Demographic Disparities in Noise Robustness . . . . . . . . . . . 69
3.4.3 RQ3: Disparity Comparison to Between Academic and Commercial Models 73

3.5 Implications and Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.6 Discussion & A Call to action . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Chapter 4: Rethinking Bias Mitigation: Fairer Architectures Make for Fairer Face
Recognition 79

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.2.1 Face Identification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2.2 Bias Mitigation in Face Recognition. . . . . . . . . . . . . . . . . . . . . 84
4.2.3 Neural Architecture Search (NAS) and Hyperparameter Optimization (HPO). 85

4.3 Architectures and Hyperparameters: A Case Study . . . . . . . . . . . . . . . . . 86
4.3.1 NAS-based Bias Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3.2 Architectures and Hyperparameter Experiments . . . . . . . . . . . . . . 87
4.3.3 Experimental Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3.4 Evaluation procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3.5 Results and Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.4 Neural Architecture Search for Bias Mitigation . . . . . . . . . . . . . . . . . . 92
4.4.1 Search Space Design and Search Strategy . . . . . . . . . . . . . . . . . 92
4.4.2 Hyperparameter Search Space Design. . . . . . . . . . . . . . . . . . . . 92
4.4.3 Architecture Search Space Design. . . . . . . . . . . . . . . . . . . . . . 93
4.4.4 Obtained architectures and hyperparameter configurations from Black-

Box-Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.5 Multi-fidelity optimization. . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4.6 Multi-objective optimization. . . . . . . . . . . . . . . . . . . . . . . . . 95

4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.5.1 Novel architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.5.2 Analysis of the Pareto-Front of different Fairness Metrics . . . . . . . . . 97
4.5.3 Novel Architectures Outperform other Bias Mitigation Strategies . . . . . 99
4.5.4 Comparison to other Bias Mitigation Techniques on all Fairness Metrics . 101
4.5.5 Novel Architectures Generalize to Other Datasets . . . . . . . . . . . . . 102
4.5.6 Novel Architectures Generalize to Other Sensitive Attributes . . . . . . . 103
4.5.7 Novel Architectures Have Less Linear-Separability of Protected Attributes 104

4.6 Conclusion, Future Work and Limitations . . . . . . . . . . . . . . . . . . . . . 106

Chapter 5: Open Questions 113
5.1 Implications of Adversarial Robustness Bias . . . . . . . . . . . . . . . . . . . . 113
5.2 Causal Reasoning Behind Robustness Disparities in Face Detection . . . . . . . . 114
5.3 Applicability of NAS+HPO to Other Domains . . . . . . . . . . . . . . . . . . . 115

Appendix A: Additional Results: Robustness Bias in Face Detection 118

vi


A.1 Statistical Significance Regressions for Average Precision . . . . . . . . . . . . . 118
A.1.1 Main Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
A.1.2 AP — Corruption Comparison Claims . . . . . . . . . . . . . . . . . . . 118
A.1.3 AP — Age Comparison Claims . . . . . . . . . . . . . . . . . . . . . . . 129
A.1.4 AP — Gender Comparison Claims . . . . . . . . . . . . . . . . . . . . . 136
A.1.5 AP — Skin Type Comparison Claims . . . . . . . . . . . . . . . . . . . 142

Bibliography 150

vii


List of Tables

2.1 Test data performance of all models on different datasets. . . . . . . . . . . . . . 27

3.1 Adience Dataset Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2 CCD Dataset Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3 MIAP Dataset Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4 UTKFace Dataset Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5 Total Costs of Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.1 The fairness metrics explored in this chapter. Rank Disparity is explored in the
main paper and the other metrics are reported in 4.5.2 . . . . . . . . . . . . . . . 90

4.2 Searchable hyperparameter choices. . . . . . . . . . . . . . . . . . . . . . . . . 98
4.3 Operation choices and definitions. . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.4 Comparison bias mitigation techniques where the SMAC models were found on

VGGFace2 with NAS bias mitigation technique and the other three techniques
are standard in facial recognition: Flipped (Chang et al., 2020), Angular (Morales
et al., 2020), and Discriminator (Wang and Deng, 2020). Items in bold are Pareto-
optimal. The values show (Error;metric). . . . . . . . . . . . . . . . . . . . . . . 108

4.5 Taking the highest performing models from the Pareto front of both VGGFace2
and CelebA, we transfer their evaluation onto six other common face recognition
datasets: LFW (Huang et al., 2008), CFP_FF (Sengupta et al., 2016), CFP_FP (Sen-
gupta et al., 2016), AgeDB (Moschoglou et al., 2017), CALFW (Zheng et al.,
2017), CPLPW (Zheng and Deng, 2018). The novel architectures which we found
with our bias mitigation strategy significantly out perform all other models. . . . 109

4.6 Comparison bias mitigation techniques where the SMAC models were found with
NAS bias mitigation technique and the other three techniques are standard in facial
recognition: Flipped (Chang et al., 2020), Angular (Morales et al., 2020), and
Discriminator (Wang and Deng, 2020). Items in bold are Pareto-optimal. The
values show (Error;Rank Disparity). . . . . . . . . . . . . . . . . . . . . . . . . 110

4.7 We transfer the evaluation of top performing models on VGGFace2 and CelebA
onto six other common face recognition datasets: LFW (Huang et al., 2008),
CFP_FF (Sengupta et al., 2016), CFP_FP (Sengupta et al., 2016), AgeDB (Moschoglou
et al., 2017), CALFW (Zheng et al., 2017), CPLPW (Zheng and Deng, 2018). The
novel architectures which we found with our bias mitigation strategy significantly
out perform all other models. Full table is reported in Table 4.5. . . . . . . . . . 111

viii


4.8 Taking the highest performing models from the Pareto front of both VGGFace2
and CelebA, we transfer their evaluation onto a dataset with a different protected
attribue – race – on the RFW dataset (Wang et al., 2019a). The novel architectures
which we found with our bias mitigation strategy are always on the Pareto front,
and mostly Pareto-dominant of the traditional architectures. . . . . . . . . . . . . 111

4.9 Linear Probes on VGGFace2. Lower accuracy is better . . . . . . . . . . . . . . 111

A.1 The best and worst performing perturbations for each dataset and model. . . . . . 119
A.2 AP. Pairwise Wilcoxon test with Bonferroni correction for model on Adience . . 119
A.3 AP. Pairwise Wilcoxon test with Bonferroni correction for model on CCD . . . . 119
A.4 AP. Pairwise Wilcoxon test with Bonferroni correction for model on MIAP . . . . 120
A.5 AP. Pairwise Wilcoxon test with Bonferroni correction for model on UTK . . . . 120
A.6 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on AWS and

Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
A.7 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on Azure

and Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
A.8 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on GCP and

Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
A.9 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on MogFace

and Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
A.10 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on TinaFace

and Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
A.11 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on Yolov5

and Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
A.12 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on AWS and

CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
A.13 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on Azure

and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
A.14 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on GCP and

CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
A.15 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on MogFace

and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
A.16 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on TinaFace

and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
A.17 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on Yolov5

and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
A.18 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on AWS and

MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
A.19 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on Azure

and MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
A.20 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on GCP and

MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
A.21 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on MogFace

and MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

ix


A.22 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on TinaFace
and MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

A.23 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on Yolov5
and MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

A.24 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on AWS and
UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

A.25 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on Azure
and UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

A.26 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on GCP and
UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

A.27 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on MogFace
and UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

A.28 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on TinaFace
and UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

A.29 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on Yolov5
and UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

A.30 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on AWS and Adience129
A.31 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on Azure and

Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
A.32 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on GCP and Adience130
A.33 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on MogFace and

Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
A.34 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on TinaFace and

Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
A.35 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on Yolov5 and

Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
A.36 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on AWS and CCD 131
A.37 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on Azure and CCD 132
A.38 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on GCP and CCD 132
A.39 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on MogFace and

CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
A.40 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on TinaFace and

CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
A.41 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on Yolov5 and CCD133
A.42 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on AWS and MIAP133
A.43 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on Azure and MIAP133
A.44 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on GCP and MIAP 134
A.45 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on MogFace and

MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
A.46 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on TinaFace and

MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
A.47 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on Yolov5 and MIAP134
A.48 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on AWS and UTK 135
A.49 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on Azure and UTK 135
A.50 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on GCP and UTK 135

x


A.51 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on MogFace and
UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

A.52 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on TinaFace and
UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

A.53 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on Yolov5 and UTK136
A.54 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on AWS and

Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
A.55 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on Azure and

Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
A.56 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on GCP and

Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
A.57 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on MogFace

and Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
A.58 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on TinaFace

and Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
A.59 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on Yolov5 and

Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
A.60 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on AWS and CCD138
A.61 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on Azure and

CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
A.62 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on GCP and CCD138
A.63 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on MogFace

and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
A.64 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on TinaFace

and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
A.65 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on Yolov5 and

CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
A.66 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on AWS and

MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
A.67 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on Azure and

MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
A.68 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on GCP and

MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
A.69 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on MogFace

and MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
A.70 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on TinaFace

and MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
A.71 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on Yolov5 and

MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
A.72 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on AWS and UTK141
A.73 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on Azure and

UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
A.74 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on GCP and UTK141
A.75 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on MogFace

and UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

xi


A.76 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on TinaFace
and UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

A.77 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on Yolov5 and
UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

A.78 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type on AWS and
CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

A.79 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type on Azure
and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

A.80 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type on GCP and
CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

A.81 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type on MogFace
and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

A.82 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type on TinaFace
and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

A.83 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type on Yolov5
and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

A.84 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the
interaction with Lighting on AWS and CCD . . . . . . . . . . . . . . . . . . . . 144

A.85 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the
interaction with Lighting on Azure and CCD . . . . . . . . . . . . . . . . . . . . 144

A.86 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the
interaction with Lighting on GCP and CCD . . . . . . . . . . . . . . . . . . . . 145

A.87 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the
interaction with Lighting on MogFace and CCD . . . . . . . . . . . . . . . . . . 145

A.88 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the
interaction with Lighting on TinaFace and CCD . . . . . . . . . . . . . . . . . . 145

A.89 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the
interaction with Lighting on Yolov5 and CCD . . . . . . . . . . . . . . . . . . . 145

A.90 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the
interaction with Age and Gender on AWS and CCD . . . . . . . . . . . . . . . . 146

A.91 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the
interaction with Age and Gender on Azure and CCD . . . . . . . . . . . . . . . 146

A.92 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the
interaction with Age and Gender on GCP and CCD . . . . . . . . . . . . . . . . 146

A.93 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the
interaction with Age and Gender on MogFace and CCD . . . . . . . . . . . . . . 146

A.94 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the
interaction with Age and Gender on TinaFace and CCD . . . . . . . . . . . . . . 147

A.95 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the
interaction with Age and Gender on Yolov5 and CCD . . . . . . . . . . . . . . . 147

A.96 AP. Pairwise Wilcoxon test with Bonferroni correction for Lighting on AWS and
CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

A.97 AP. Pairwise Wilcoxon test with Bonferroni correction for Lighting on Azure and
CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

xii


A.98 AP. Pairwise Wilcoxon test with Bonferroni correction for Lighting on GCP and
CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

A.99 AP. Pairwise Wilcoxon test with Bonferroni correction for Lighting on MogFace
and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

A.100AP. Pairwise Wilcoxon test with Bonferroni correction for Lighting on TinaFace
and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

A.101AP. Pairwise Wilcoxon test with Bonferroni correction for Lighting on Yolov5 and
CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

xiii


List of Figures

2.1 A toy example showing robustness bias. A.) the classifier (solid line) has 100%
accuracy for blue and green points. However for a budget τ (dotted lines), 70% of
points belonging to the “round” subclass (showed by dark blue and dark green)
will get attacked while only 30% of points in the “cross” subclass will be attacked.
This shows a clear bias against the “round” subclass which is less robust in this
case. B.) shows a different classifier for the same data points also with 100%
accuracy. However, in this case, with the same budget τ , 30% of both “round” and
“cross” subclass will be attacked, thus being less biased. . . . . . . . . . . . . . . 16

2.2 An example of multinomial logistic regression. . . . . . . . . . . . . . . . . . . 17
2.3 An example of robustness bias in the UTKFace dataset. A model trained to predict

age group from faces is fooled for an inputs belonging to certain subgroups (black
and female in this example) for a given perturbation, but is robust for inputs
belonging to other subgroups (white and male in this example) for the same
magnitude of perturbation. We use the UTKFace dataset to make a broader point
that robustness bias can cause harms. In the specific case of UTKFace (and similar
datasets), the task definition of predicting age from faces itself is flawed, as has
been noted in many previous studies (Cramer et al., 2019; Crawford and Paglen,
2019; Buolamwini and Gebru, 2018). . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 For each dataset, we plot Îc(τ) for each class c in each dataset. Each blue line
represents one class. The red line represents the mean of the blue lines, i.e.,∑

c∈C Îc(τ) for each τ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 For each dataset, we plot Îτs for each sensitive attribute s in each dataset. . . . . . 21
2.6 UTKFace partitioned by race. We can see that across models, that different

populations are at different levels of robustness as calculated by different proxies
(DeepFool on the left, CarliniWagner in the middle and Randomized Smoothing on
the right). This suggests that robustness bias is an important criterion to consider
when auditing models for fairness. . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.7 Depiction of σDF
P and σCW

P for the UTKFace dataset with partitions corresponding
to the (1) class labels C and the, (2) gender, and (3) race/ethnicity. These values are
reported for all five convolutional models both at the beginning of their training
(after one epoch) and at the end. We observe that, largely, the signedness of the
functions are consistent between the five models and also across the training cycle. 41

xiv


2.8 Depiction of σRS
P for the UTKFace dataset with partitions corresponding to the

(1) class labels C and the, (2) gender, and (3) race/ethnicity. A more negative
value indicates less robustness bias for the partition. Darker regions indicate high
robustness bias. We observe that the trend is largely consistent amongst models
and also similar to the trend observed when using adversarial attacks to measure
robustness bias (see Figure 2.7). . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.9 In the unregularized model, “truck” in CIFAR10 tends to be more robust than
other classes (2.9a); however, using ADVERM reduces that disparity (2.9b). We
see similar behavior for UTKFace (2.6 & 2.9d). . . . . . . . . . . . . . . . . . . 41

2.10 Depiction of σDF
P and σCW

P for the CIFAR10 dataset with partitions corresponding
to the class labels C. These values are reported for all five convolutional models
both at the beginning of their training (after one epoch) and at the end. We observe
that, largely, the signedness of the functions are consistent between the five models
and also across the training cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.11 Depiction of σRS
P for the CIFAR10 dataset with partitions corresponding to the

class labels C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.12 Depiction of σDF

P and σCW
P for the CIFAR100 dataset with partitions correspond-

ing to the class labels C. These values are reported for all five convolutional models
both at the beginning of their training (after one epoch) and at the end. We observe
that, largely, the signedness of the functions are consistent between the five models
and also across the training cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.13 Depiction of σRS
P for the CIFAR100 dataset with partitions corresponding to the

class labels C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.14 Depiction of σDF

P and σCW
P for the CIFAR100super dataset with partitions corre-

sponding to the class labels C. These values are reported for all five convolutional
models both at the beginning of their training (after one epoch) and at the end. We
observe that, largely, the signedness of the functions are consistent between the
five models and also across the training cycle. . . . . . . . . . . . . . . . . . . . 44

2.15 Depiction of σRS
P for the CIFAR100super dataset with partitions corresponding to

the class labels C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.16 Depiction of σDF

P and σCW
P for the Adience dataset with partitions corresponding

to the (1) class labels C and the and (2) gender. These values are reported for
all five convolutional models both at the beginning of their training (after one
epoch) and at the end. We observe that, largely, the signedness of the functions
are consistent between the five models and also across the training cycle. . . . . . 45

2.17 Depiction of σRS
P for the Adience dataset with partitions corresponding to the (1)

class labels C and the and (2) gender. . . . . . . . . . . . . . . . . . . . . . . . . 45
2.18 [Regularization] CIFAR10 - Deep CNN . . . . . . . . . . . . . . . . . . . . . . 46
2.19 [Regularization] CIFAR10 - Resnet50 . . . . . . . . . . . . . . . . . . . . . . . 47
2.20 [Regularization] CIFAR10 - VGG19 . . . . . . . . . . . . . . . . . . . . . . . . 48
2.21 [Regularization] UTKFace partitioned by race - UTK Classifier. . . . . . . . . . 49
2.22 [Regularization] UTKFace partitioned by race - Resnet50. . . . . . . . . . . . . 49
2.23 [Regularization] UTKFace partitioned by race - VGG. . . . . . . . . . . . . . . 49

xv


3.1 Our benchmark consists of 5,066,312 images of the 15 types of algorithmically
generated corruptions produced by ImageNet-C. We use data from four datasets
(Adience, CCD, MIAP, and UTKFace) and present examples of corruptions from
each dataset here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.2 Depiction of how Average Precision (AP) metric is calculated by using clean
image as ground truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.3 Overall performance (AP) of each model on each dataset. . . . . . . . . . . . . . 66
3.4 Gender disparity plots for each dataset and model. Values below 1 indicate

that predominantly feminine presenting subjects are more susceptible to noise-
induced changes. Values above 1 indicate that predominantly masculine presenting
subjects are are more susceptible to noise-induced changes. Error bars indicate
95% confidence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.5 Age disparity plots for each dataset and model. Values greater than 1 indicate that
older subjects are more susceptible to noise-induced changes compared to middle
aged subjects. Error bars indicate 95% confidence. . . . . . . . . . . . . . . . . . 71

3.6 Skin type disparity plots for CCD. Values above 1 indicate that darker-skinned
subjects are more susceptible to noise-induced changes. Error bars indicate 95%
confidence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.7 Lighting disparity plots for CCD. Values above 1 indicate that dimly-lit subjects
are more susceptible to noise-induced changes. Error bars indicate 95% confidence. 72

4.1 Overview of our methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2 (Left) CelebA (Right) VGGFace2. Error-Rank Disparity Pareto front of the

architectures with lowest error (< 0.3). Models in the lower left corner are better.
The Pareto front is notated with a dashed line. Other points are architecture and
hyperparameter combinations which are not Pareto-optimal. . . . . . . . . . . . 90

4.3 SMAC discovers the above building blocks with (a) corresponding to architecture
with CosFace, with SGD optimizer and learning rate of 0.2813 as hyperparamters
(b) corresponding to CosFace, with SGD as optimizer and learning rate of 0.32348
and (c) corresponding to CosFace, with AdamW as optimizer and learning rate of
0.0006 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.4 Pareto front of the models discovered by SMAC and the rank-1 models from timm
for the (a) validation and (b) test sets on CelebA. Each point corresponds to the
mean and standard error of an architecture after training for 3 seeds. The SMAC
models Pareto-dominate the top performing timm models (Error < 0.1). . . . . 96

4.5 Pareto front of the models discovered by SMAC and the rank-1 models from timm
for the (a) validation and (b) test sets on VGGFace2. Each point corresponds
to the mean and standard error of an architecture after training for 3 seeds. The
SMAC models Pareto-dominate the top performing timm models (Error<0.1). . . 97

4.6 DPN block (left) vs. our searchable block (right). . . . . . . . . . . . . . . . . . 98
4.7 Replication of CelebA 4.2 with all data points. Error-Rank Disparity Pareto front

of the architectures with any non-trivial error. Models in the lower left corner are
better. The Pareto front is notated with a dashed line. Other points are architecture
and hyperparameter combinations which are not Pareto-dominant. . . . . . . . . 99

xvi


4.8 Replication of VGGFace2 4.2 with all data points. Error-Rank Disparity Pareto
front of the architectures with any non-trivial error. Models in the lower left
corner are better. The Pareto front is notated with a dashed line. Other points are
architecture and hyperparameter combinations which are not Pareto-dominant. . . 100

4.9 Replication of 4.7 on the CelebA validation dataset with Ratio of Ranks (left) and
Ratio of Errors (right) metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.10 Replication of 4.7 on the CelebA validation dataset with the Disparity in accuracy
metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.11 Replication of 4.7 on the CelebA validation dataset with the Ratio in accuracy
metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.12 Replication of 4.8 on the VGGFace2 validation dataset with Ratio of Ranks metric.104
4.13 Replication of 4.8 on the VGGFace2 validation dataset with Ratio of Errors metric.105
4.14 Replication of 4.8 on the VGGFace2 validation dataset with the Disparity in

accuracy metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.15 Replication of 4.8 on the VGGFace2 validation dataset with the Ratio in accuracy

metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.16 Models trained on CelebA (left) and VGGFace2 (right) evaluated on a dataset with

a different protected attribute, specifically on RFW with the racial attribute, and
with the Rank Disparity metric. The novel architectures out perform the existing
architectures in both settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.17 TSNE plots for models pretrained on VGGFace2 on the test-set (a) SMAC model
last layer (b) DPN MagFace on the last layer (b) SMAC model second last layer
(b) DPN MagFace on the second last layer. Note the better linear separability for
DPN MagFace in comparison with the SMAC model . . . . . . . . . . . . . . . 112

xvii


He talks, he talks, how he talks, and waves

his arms.

He fills up ornate vases.

Twenty-seven an hour. And keeps the

words in with cork stoppers

(If you hold the vases to your ears you can

hear the muted syllables colliding into

each other).

I want vases, some of them ornate,

But simple ones too.

And most of them

Will have flowers

On Verbosity

Annette Ryan

Chapter 1: Introduction

Artificial Intelligence (AI) has emerged as a transformative technology with the potential to

revolutionize various aspects of our lives. From personalized recommendations to autonomous

vehicles, AI systems are becoming increasingly prevalent in our daily interactions. However, as

AI becomes more advanced and integrated into society, concerns about its responsible use have

1


gained significant attention.

Responsible AI refers to an ethically-informed and transparent development, deployment,

and utilization of AI technologies, ensuring that they are designed and used in a manner that

respects human values, rights, and well-being. The need for responsible AI arises from the

potential risks associated with its widespread adoption. AI systems can inadvertently perpetuate

bias, discrimination, and reinforce societal inequalities if not developed and implemented with

care. For example, biased training data can lead to discriminatory outcomes, such as AI-powered

hiring algorithms favoring certain demographics. Machine learning is applied to a wide variety of

socially-consequential domains, e.g., credit scoring, fraud detection, hiring decisions, criminal

recidivism, loan repayment, and face recognition (Mukerjee et al., 2002; Ngai et al., 2011;

Learned-Miller et al., 2020; Barocas et al., 2017), with many of these applications impacting

the lives of people more than ever — often in biased ways (Buolamwini and Gebru, 2018; Joo

and Kärkkäinen, 2020; Wang et al., 2020b). Dozens of formal definitions of fairness have been

proposed (Narayanan, 2018), and many algorithmic techniques have been developed for debiasing

according to these definitions (Verma and Rubin, 2018).

Automated decision-making systems that are driven by data are being used in a variety

of different real-world applications, creating the risk that these systems will perpetuate and/or

create harms to people. In many cases, these systems make decisions on data points that represent

humans (e.g., targeted ads (Speicher et al., 2018; Ribeiro et al., 2019), personalized recommen-

dations (Singh and Joachims, 2018; Biega et al., 2018), hiring (Schumann et al., 2019, 2020),

credit scoring (Khandani et al., 2010), or recidivism prediction (Chouldechova, 2017)). In such

scenarios, there is often concern regarding the fairness of outcomes of the systems (Barocas and

Selbst, 2016; Galhotra et al., 2017). This has resulted in a growing body of work from Responsible

2


AI community that—drawing on prior legal and philosophical doctrine—aims to define, measure,

and (attempt to) mitigate manifestations of unfairness in automated systems (Chouldechova, 2017;

Feldman et al., 2015a; Leben, 2020; Binns, 2017).

Responsible AI aims to address such concerns by emphasizing fairness, accountability,

transparency, and inclusivity in AI development and deployment processes. One crucial aspect of

responsible AI is fairness. AI systems should be designed to treat all individuals fairly and without

discrimination. This means avoiding bias in data collection, ensuring diverse representation during

the development process, and regularly auditing AI algorithms for unintended biases. Additionally,

responsible AI involves being accountable for the outcomes of AI systems. Developers and orga-

nizations should take responsibility for any harm caused by their AI technologies and implement

mechanisms for redress and accountability.

Transparency is another fundamental principle of responsible AI. Users and stakeholders

should have access to understandable and explainable AI systems. This means that AI algorithms

should be designed in a way that allows for clear explanations of their decision-making processes.

Transparent AI fosters trust, enables users to understand how AI systems work, and helps identify

and rectify any potential biases or errors.

Responsible AI also emphasizes inclusivity, ensuring that the benefits and opportunities

created by AI are accessible to all. This involves considering the needs and perspectives of diverse

populations during AI development, addressing issues of digital divide and accessibility, and

actively working towards reducing biases and disparities present in AI systems.

Another emerging aspect of Responsible AI is the robustness of systems. In traditional

machine learning, robustness refers to the ability of a model to maintain its performance and

generalization capabilities even in the face of uncertainties, adversarial attacks, or variations in the

3


input data. A robust model is not only accurate on the training data but also exhibits resilience

to perturbations, noise, and outliers that it may encounter during deployment. The importance

of robustness arises from the fact that real-world data is often noisy, incomplete, and subject

to unpredictable variations. While traditional machine learning algorithms focus on optimizing

for average-case scenarios, robust machine learning aims to handle the worst-case scenarios and

mitigate the risks associated with unpredictable inputs.

Robustness in machine learning encompasses various dimensions, each presenting unique

challenges and trade-offs. One prominent aspect is adversarial robustness, which examines the

model’s vulnerability to adversarial attacks, where malicious actors deliberately manipulate the

input data to deceive or mislead the model’s predictions. Adversarial attacks have demonstrated

the susceptibility of machine learning models to subtle perturbations that are often imperceptible

to human observers. Developing models that are resistant to such attacks is crucial for security-

sensitive applications. I will explore this topic in depth in Chapter 2. Most of the initial work on

fairness in machine learning considered notions that were one-shot and considered the model and

data distribution to be static (Zafar et al., 2019, 2017c; Chouldechova, 2017; Barocas and Selbst,

2016; Dwork et al., 2012; Zemel et al., 2013). Recently, there has been more work exploring

notions of fairness that are dynamic and consider the possibility that the world (i.e., the model

as well as data points) might change over time (Heidari et al., 2019; Heidari and Krause, 2018;

Hashimoto et al., 2018; Liu et al., 2018b). Our proposed notion of robustness bias has subtle

difference from existing one-shot and dynamic notions of fairness in that it requires each partition

of the population be equally robust to imperceptible changes in the input (e.g., noise, adversarial

perturbations, etc).

Another dimension of robustness focuses on handling noisy or corrupted data. Real-world

4


datasets may contain outliers, missing values, or measurement errors, which can significantly

impact the performance of machine learning models. Robust techniques that can effectively handle

such anomalies and preserve the model’s accuracy and reliability are essential.

We explore robustness to noisy or corrupted data in Chapter 3, by auditing face detection

systems and show deeper and more pernicious forms of robustenss bias in these systems. Face

detection identifies the presence and location of faces in images and video. Automated face

detection is a core component of myriad systems—including face recognition technologies (FRT),

wherein a detected face is matched against a database of faces, typically for identification or

verification purposes. FRT-based systems are widely deployed (Hartzog, 2020; Derringer, 2019;

Weise and Singer, 2020). Automated face recognition enables capabilities ranging from the

relatively morally neutral (e.g., searching for photos on a personal phone (Google, 2021a)) to

morally laden (e.g., widespread citizen surveillance (Hartzog, 2020), or target identification in

warzones (Marson and Forrest, 2021)). Legal and social norms regarding the usage of FRT are

evolving (e.g., Grother et al., 2019). For example, in June 2021, the first county-wide ban on its

use for policing (see, e.g., Garvie, 2016) went into effect in the US (Gutman, 2021). Some use

cases for FRT will be deemed socially repugnant and thus be either legally or de facto banned from

use; yet, it is likely that pervasive use of facial analysis will remain—albeit with more guardrails

than today (Singer, 2018).

One such guardrail that has spurred positive, though insufficient, improvements and widespread

attention is the use of benchmarks. For example, in late 2019, the US National Institute of Stan-

dards and Technology (NIST) adapted its venerable Face Recognition Vendor Test (FRVT) to

explicitly include concerns for demographic effects (Grother et al., 2019), ensuring such concerns

propagate into industry systems. Yet, differential treatment by FRT of groups has been known for

5


at least a decade (e.g., Klare et al., 2012; El Khiyari and Wechsler, 2016), and more recent work

spearheaded by Buolamwini and Gebru (2018) uncovers unequal performance at the phenotypic

subgroup level. That latter work brought widespread public, and thus burgeoning regulatory,

attention to bias in FRT (e.g., Lohr, 2018; Kantayya, 2020).

One yet unexplored benchmark examines the bias present in a model’s robustness (e.g., to

noise, or to different lighting conditions), both in aggregate and with respect to different dimensions

of the population on which it will be used. Many detection and recognition systems are not built in

house, instead adapting an existing academic model or by making use of commercial cloud-based

“ML as a Service” (MLaaS) platforms offered by tech giants such as Amazon, Microsoft, Google,

Megvii, etc. I will present the first of its kind detailed benchmark robustness benchmark of six

different face detection models, for fifteen types of realistic noise (Hendrycks and Dietterich,

2019), and on four well-known datasets. Across all the datasets and systems, I generally find

that photos of individuals who are older, masculine presenting, of darker skin type, or have dim

lighting are more susceptible to errors than their counterparts in other identities.

Addressing robustness in machine learning involves a combination of algorithmic design,

feature engineering, and data preprocessing techniques. These approaches seek to make models

more resilient to uncertainties and perturbations, either by introducing regularization mechanisms,

utilizing ensemble methods, or leveraging domain knowledge to guide the learning process.

In this thesis, I’ll broaden and deepen the connection between robust machine learning and

responsible AI. In Chapter 2, I define a new fairness metric in Responsible AI which quantifies

the disparity between groups with respect to how susceptible they are to adversarial attack. In

Chapter 3, I audit existing academic and commercial face detection systems for their robustness to

types of common natural noises. Finally, in Chapter chapter 4, I expand the conceptualization of

6


robustness to include notions of model architecture and hyperparameters, and propose a novel bias

mitigation techniques which employs neural architecture search to find more fair architectures.

Conventional wisdom is that in order to effectively mitigate bias, we should start by selecting

a model architecture and set of hyperparameters which are optimal in terms of accuracy and then

apply a mitigation strategy to reduce bias while minimally impacting accuracy. As datasets become

larger and training becomes more computationally intensive, especially in the case of computer

vision and natural language processing, it is becoming increasingly more common in applications

to start with a very large pretrained model, and then fine-tune for the specific use-case (Chi et al.,

2017; Käding et al., 2016; Ouyang et al., 2016; Too et al., 2019). While existing methods for

de-biasing machine learning systems use a fixed neural architecture and hyperparameter setting, I

instead ask a fundamental question which has received little attention: how much does model-bias

arise from the architecture and hyperparameters? I further ask whether we can we exploit the

extensive research in the fields of neural architecture search (NAS) (Elsken et al., 2019) and

hyperparameter optimization (HPO) (Feurer and Hutter, 2019) to search for more inherently fair

models.

Many debiasing algorithms fit into one of three (or arguably four (Savani et al., 2020))

categories: pre-processing (e.g., Feldman et al., 2015b; Ryu et al., 2018; Quadrianto et al., 2019;

Wang and Deng, 2020), in-processing (e.g., Zafar et al., 2017b, 2019; Donini et al., 2018; Goel

et al., 2018; Padala and Gujar, 2020; Wang and Deng, 2020; Martinez et al., 2020; Nanda et al.,

2021; Diana et al., 2020; Lahoti et al., 2020), or post-processing (e.g., Hardt et al., 2016; Wang

et al., 2020b). I, however, pose the simple question, what if these approaches are using an

architecture which is inherently less fair than another architecture? To explore this topic, I employ

neural architecture search.

7


Neural architecture search (NAS) is a field of research that focuses on automating the

design and optimization of neural network architectures. It aims to discover or construct neural

network structures that achieve high performance on various tasks while reducing the manual

effort required for architecture engineering. Traditionally, neural network architectures were

designed by human experts based on their domain knowledge and intuition. However, as deep

learning models have grown in complexity and scale, manually designing architectures that yield

optimal performance has become increasingly challenging and time-consuming. NAS approaches

employ various strategies to automatically explore the vast search space of possible architectures.

One common technique is to use reinforcement learning or evolutionary algorithms to iteratively

generate, evaluate, and refine architectures. These algorithms leverage performance feedback

from the training process to guide the search towards architectures with improved performance.

NAS algorithms often incorporate additional components like performance predictors or surrogate

models, which help estimate the performance of unseen architectures based on their characteristics.

These models aid in efficiently exploring the architecture space and reduce the computational cost

associated with evaluating each architecture.

Motivated by the belief that the inductive bias of a model architecture is more important than

the bias mitigation strategy, I’ll use NAS to take a different approach to bias mitigation. We show

that finding an architecture that is more fair offers significant gains compared to conventional bias

mitigation strategies in the domain of face recognition, a task that is notoriously challenging to

de-bias. To this end, I’ll conduct the first neural architecture search for fairness, jointly with a

search for hyperparameters. Our search outputs a suite of models which Pareto-dominate all other

competitive architectures in terms of accuracy and fairness on the two most widely used datasets

for face identification: CelebA and VGGFace2. This work challenges the assumption that bias

8


mitigation pipelines should default to existing popular architectures which were optimized for

accuracy — instead I’ll show that it may be more beneficial to begin with a fairer architecture as

the foundation of such pipelines.

9


You gave up on expecting things to make

sense

Knew at some point this would not be a

puzzle you would ever complete

So you learn to hold it all

bodies that are home

and how the days unfold

one after the next

A continuous scroll of doing your best

and trusting

what the darkness will hold

Call it Love

Jena Schwartz

Chapter 2: Adversarial Robustness

This work was done in collaboration with my co-first author Vedant Nanda, as well as Sahil

Singla, John P. Dickerson, and Soheil Feizi, and was presented at FAccT, 2021 (Nanda et al.,

2021).

Deep neural networks (DNNs) are increasingly used in real-world applications (e.g. facial

recognition). This has resulted in concerns about the fairness of decisions made by these models.

10


Various notions and measures of fairness have been proposed to ensure that a decision-making

system does not disproportionately harm (or benefit) particular subgroups of the population. In

this chapter, we argue that traditional notions of fairness that are only based on models’ outputs

are not sufficient when the model is vulnerable to adversarial attacks. We argue that in some cases,

it may be easier for an attacker to target a particular subgroup, resulting in a form of robustness

bias. We show that measuring robustness bias is a challenging task for DNNs and propose two

methods to measure this form of bias. We then conduct an empirical study on state-of-the-art

neural networks on commonly used real-world datasets such as CIFAR-10, CIFAR-100, Adience,

and UTKFace and show that in almost all cases there are subgroups (in some cases based on

sensitive attributes like race, gender, etc) which are less robust and are thus at a disadvantage. We

argue that this kind of bias arises due to both the data distribution and the highly complex nature

of the learned decision boundary in the case of DNNs, thus making mitigation of such biases a

non-trivial task. Our results show that robustness bias is an important criterion to consider while

auditing real-world systems that rely on DNNs for decision making.

2.1 Introduction

Automated decision-making systems that are driven by data are being used in a variety of

different real-world applications. In many cases, these systems make decisions on data points

that represent humans (e.g., targeted ads (Speicher et al., 2018; Ribeiro et al., 2019), personalized

recommendations (Singh and Joachims, 2018; Biega et al., 2018), hiring (Schumann et al., 2019,

2020), credit scoring (Khandani et al., 2010), or recidivism prediction (Chouldechova, 2017)). In

such scenarios, there is often concern regarding the fairness of outcomes of the systems (Barocas

11


and Selbst, 2016; Galhotra et al., 2017). This has resulted in a growing body of work from the

nascent Fairness, Accountability, Transparency, and Ethics (FATE) community that—drawing

on prior legal and philosophical doctrine—aims to define, measure, and (attempt to) mitigate

manifestations of unfairness in automated systems (Chouldechova, 2017; Feldman et al., 2015a;

Leben, 2020; Binns, 2017).

Most of the initial work on fairness in machine learning considered notions that were

one-shot and considered the model and data distribution to be static (Zafar et al., 2019, 2017c;

Chouldechova, 2017; Barocas and Selbst, 2016; Dwork et al., 2012; Zemel et al., 2013). Recently,

there has been more work exploring notions of fairness that are dynamic and consider the possibility

that the world (i.e., the model as well as data points) might change over time (Heidari et al., 2019;

Heidari and Krause, 2018; Hashimoto et al., 2018; Liu et al., 2018b). Our proposed notion of

robustness bias has subtle difference from existing one-shot and dynamic notions of fairness in

that it requires each partition of the population be equally robust to imperceptible changes in the

input (e.g., noise, adversarial perturbations, etc).

We propose a simple and intuitive notion of robustness bias which requires subgroups of

populations to be equally “robust.” Robustness can be defined in multiple different ways (Szegedy

et al., 2014; Goodfellow et al., 2015; Papernot et al., 2016). We take a general definition which

assigns points that are farther away from the decision boundary higher robustness. Our key

contributions are as follows:

• We define a simple, intuitive notion of robustness bias that requires all partitions of the

dataset to be equally robust. We argue that such a notion is especially important when

the decision-making system is a deep neural network (DNN) since these have been shown

12


to be susceptible to various attacks (Carlini and Wagner, 2017; Moosavi-Dezfooli et al.,

2016). Importantly, our notion depends not only on the outcomes of the system, but also on

the distribution of distances of data-points from the decision boundary, which in turn is a

characteristic of both the data distribution and the learning process.

• We propose different methods to measure this form of bias. Measuring the exact distance

of a point from the decision boundary is a challenging task for deep neural networks which

have a highly non-convex decision boundary. This makes the measurement of robustness

bias a non-trivial task. In this chapter we leverage the literature on adversarial machine

learning and show that we can efficiently approximate robustness bias by using adversarial

attacks and randomized smoothing to get estimates of a point’s distance from the decision

boundary.

• We do an in-depth analysis of robustness bias on popularly used datasets and models.

Through extensive empirical evaluation we show that unfairness can exist due to different

partitions of a dataset being at different levels of robustness for many state-of-the art models

that are trained on common classification datasets. We argue that this form of unfairness

can happen due to both the data distribution and the learning process and is an important

criterion to consider when auditing models for fairness.

2.1.1 Related Work

Fairness in ML. Models that learn from historic data have been shown to exhibit unfairness, i.e.,

they disproportionately benefit or harm certain subgroups (often a sub-population that shares a

common sensitive attribute such as race, gender etc.) of the population (Barocas and Selbst, 2016;

13


Chouldechova, 2017; Khandani et al., 2010). This has resulted in a lot of work on quantifying,

measuring and to some extent also mitigating unfairness (Dwork et al., 2012; Dwork and Ilvento,

2018; Zemel et al., 2013; Zafar et al., 2019, 2017c; Hardt et al., 2016; Grgić-Hlac̆a et al., 2018;

Adel et al., 2019; Wadsworth et al., 2018; Saha et al., 2020; Donini et al., 2018; Calmon et al.,

2017; Kusner et al., 2017; Kilbertus et al., 2017; Pleiss et al., 2017; Wang et al., 2020b). Most

of these works consider notions of fairness that are one-shot—that is, they do not consider

how these systems would behave over time as the world (i.e., the model and data distribution)

evolves. Recently more works have taken into account the dynamic nature of these decision-

making systems and consider fairness definitions and learning algorithms that fare well across

multiple time steps (Heidari et al., 2019; Heidari and Krause, 2018; Hashimoto et al., 2018;

Liu et al., 2018b). We take inspiration from both the one-shot and dynamic notions, but take

a slightly different approach by requiring all subgroups of the population to be equally robust

to minute changes in their features. These changes could either be random (e.g.natural noise

in measurements) or carefully crafted adversarial noise. This is closely related to Heidari et al.

(2019)’s effort-based notion of fairness; however, their notion has a very specific use case of

societal scale models whereas our approach is more general and applicable to all kinds of models.

Our work is also closely related to and inspired by Zafar et al.’s use of a regularized loss function

which captures fairness notions and reduces disparity in outcomes (Zafar et al., 2019). There are

major differences in both the approach and application between our work and that of Zafar et al’s.

Their disparate impact formulation aims to equalize the average distance of points to the decision

boundary, E[d(x)]; our approach, instead, aims to equalize the number of points that are “safe”,

i.e., E[1{d(x) > τ}] (see section 2.3 for a detailed description). Our proposed metric is preferable

for applications of adversarial attack or noisy data, the focus of our paper; whereas the metric of

14


Zafar et al is more applicable for an analysis of the consequence of a decision in a classification

setting.

Robustness. Deep Neural Networks (DNNs) have been shown to be susceptible to carefully

crafted adversarial perturbations which—imperceptible to a human—result in a misclassification

by the model (Szegedy et al., 2014; Goodfellow et al., 2015; Papernot et al., 2016). In the context

of our paper, we use adversarial attacks to approximate the distance of a data point to the decision

boundary. For this we use state-of-the-art white-box attacks proposed by Moosavi-Dezfooli et al.

(2016) and Carlini and Wagner (2017). Due to the many works on adversarial attacks, there have

been many recent works on provable robustness to such attacks. The high-level goal of these works

is to estimate a (tight) lower bound on the distance of a point from the decision boundary (Cohen

et al., 2019; Salman et al., 2019; Singla and Feizi, 2020). We leverage these methods to estimate

distances from the decision boundary which helps assess robustness bias (defined formally in

Section 2.3).

Fairness and Robustness. Recent works have proposed poisoning attacks on fairness (Solans

et al., 2020; Mehrabi et al., 2020). Khani and Liang (2019) analyze why noise in features can

cause disparity in error rates when learning a regression. We believe that our work is the very first

to show that different subgroups of the population can have different levels of robustness which

can lead to unfairness. We hope that this will lead to more work at the intersection of these two

important sub fields of ML.

15


2.2 Heterogeneous Robustness

In a classification setting, a learner is given data D = {(xi, yi)}Ni=1 consisting of inputs

xi ∈ Rd and outputs yi ∈ C which are labels in some set of classes C = {c1, . . . , ck}. These

classes form a partition on the dataset such that D =
⊔

c∈C{(xi, yi) | yi = cj}. The goal of

learning in decision boundary-based optimization is to draw delineations between points in feature

space which sort the data into groups according to their class label. The learning generally tries to

maximize the classification accuracy of the decision boundary choice. A learner chooses some loss

function L to minimize on a training dataset, parameterized by parameters θ, while maximizing

the classification accuracy on a test dataset.

Of course there are other aspects to classification problems that have recently become more

salient in the machine learning community. Considerations about the fairness of classification

decisions, for example, are one such way in which additional constraints are brought into a

Figure 2.1: A toy example showing robustness bias. A.) the classifier (solid line) has 100%
accuracy for blue and green points. However for a budget τ (dotted lines), 70% of points belonging
to the “round” subclass (showed by dark blue and dark green) will get attacked while only 30%
of points in the “cross” subclass will be attacked. This shows a clear bias against the “round”
subclass which is less robust in this case. B.) shows a different classifier for the same data points
also with 100% accuracy. However, in this case, with the same budget τ , 30% of both “round” and
“cross” subclass will be attacked, thus being less biased.

16


(a) Three-class classification problem for randomly
generated data.

(b) Proportion samples which are greater than τ away
from a decision boundary.

Figure 2.2: An example of multinomial logistic regression.

learner’s optimization strategy. In these settings, the data D = {(xi, yi, si)}Ni=1 is imbued with

some metadata which have a sensitive attribute S = {s1, . . . , st} associated with each point.

Like the classes above, these sensitive attributes form a partition on the data such that D =⊔
s∈S{(xi, yi, si) | si = s}. Without loss of generality, we assume a single sensitive attribute.

Generally speaking, learning with fairness in mind considers the output of a classifier based off of

the partition of data by the sensitive attribute, where some objective behavior, like minimizing

disparate impact or treatment (Zafar et al., 2019), is integrated into the loss function or learning

procedure to find the optimal parameters θ.

There is not a one-to-one correspondence between decision boundaries and classifier per-

formance. For any given performance level on a test dataset, there are infinitely many decision

boundaries which produce the same performance, see Figure 2.1. This raises the question: if we

consider all decision boundaries or model parameters which achieve a certain performance, how

do we choose among them? What are the properties of a desirable, high-performing decision

boundary? As the community has discovered, one undesirable characteristic of a decision bound-

ary is its proximity to data which might be susceptible to adversarial attack (Goodfellow et al.,

2015; Szegedy et al., 2014; Papernot et al., 2016). This provides intuition that we should prefer

17


boundaries that are as far away as possible from example data (Suykens and Vandewalle, 1999;

Boser et al., 1992).

Let us look at how this plays out in a simple example. In multinomial logistic regression,

the decision boundaries are well understood and can be written in closed form. This makes it easy

for us to compute how close each point is to a decision boundary. Consider for example a dataset

and learned classifier as in Figure 2.2a. For this dataset, we observe that the brown class, as a

whole, is closer to a decision boundary than the yellow or blue classes. We can quantify this by

plotting the proportion of data that are greater than a distance τ away from a decision boundary,

and then varying τ . Let dθ(x) be the minimal distance between a point x and a decision boundary

corresponding to parameters θ. For a given partition P of a dataset, D, such that D =
⊔

P∈P P ,

we define the function:

ÎP (τ) =
|{(x, y) ∈ P | dθ(x) > τ, y = ŷ}|

|P |

If each element of the partition is uniquely defined by an element, say a class label, c, or a sensitive

attribute label, s, we equivalently will write Îc(τ) or Îs(τ) respectively. We plot this over a range

of τ in Figure 2.2b for the toy classification problem in Figure 2.2a. Observe that the function for

the brown class decreases significantly faster than the other two classes, quantifying how much

closer the brown class is to the decision boundary.

From a strictly classification accuracy point of view, the brown class being significantly

closer to the decision boundary is not of concern; all three classes achieve similar classification

accuracy. However, when we move away from this toy problem and into neural networks on

real data, this difference between the classes could become a potential vulnerability to exploit,

18


Predicted: 15 - 25
Ground Truth: 15 - 25
Race: Black
Gender: Female

Predicted: 25 - 40
Ground Truth: 25 - 40
Race: White
Gender: Male

Predicted: 60+

Predicted: 25 - 40

0.5 ℓ2 perturbation

0.5 ℓ2 perturbation

Figure 2.3: An example of robustness bias in the UTKFace dataset. A model trained to predict age
group from faces is fooled for an inputs belonging to certain subgroups (black and female in this
example) for a given perturbation, but is robust for inputs belonging to other subgroups (white
and male in this example) for the same magnitude of perturbation. We use the UTKFace dataset
to make a broader point that robustness bias can cause harms. In the specific case of UTKFace
(and similar datasets), the task definition of predicting age from faces itself is flawed, as has been
noted in many previous studies (Cramer et al., 2019; Crawford and Paglen, 2019; Buolamwini and
Gebru, 2018).

particularly when we consider adversarial examples.

2.3 Robustness Bias

Our goal is to understand how susceptible different classes are to perturbations (e.g., natural

noise, adversarial perturbations). Ideally, no one class would be more susceptible than any other,

but this may not be possible. We have observed that for the same dataset, there may be some

classifiers which have differences between the distance of that partition to a decision boundary;

and some which do not. There may also be one partition P which exhibits this discrepancy, and

another partition P ′ which does not. Therefore, we make the following statement about robustness

bias:

Definition 1. A dataset D with a partition P and a classifier parameterized by θ exhibits robustness

19


bias if there exists an element P ∈ P for which the elements of P are either significantly closer to

(or significantly farther from) a decision boundary than elements not in P .

A partition P may be based on sensitive attributes such as race, gender, or ethnicity—or

other class labels. For example, given a classifier and dataset with sensitive attribute “race”, we

might say that classifier exhibits robustness bias if, partitioning on that sensitive attribute, for some

value of “race” the average distance of members of that particular racial value are substantially

closer to the decision boundary than other members.

We might say that a dataset, partition, and classifier do not exhibit robustness bias if for all

P, P ′ ∈ P and all τ > 0

P(x,y)∈D{dθ(x) > τ | x ∈ P, y = ŷ} ≈

P(x,y)∈D{dθ(x) > τ | x ∈ P ′, y = ŷ}.
(2.1)

Intuitively, this definition requires that for a given perturbation budget τ and a given partition

P , one should not have any incentive to perturb data points from P over points that do not belong

to P . Even when examining this criteria, we can see that this might be particularly hard to satisfy.

Thus, we want to quantify the disparate susceptibility of each element of a partition to adversarial

attack, i.e., how much farther or closer it is to a decision boundary when compared to all other

points. We can do this with the following function for a dataset D with partition element P ∈ P

and classifier parameterized by θ:

RB(P, τ) = |Px∈D{dθ(x) > τ | x ∈ P, y = ŷ}−

Px∈D{dθ(x) > τ | x /∈ P, y = ŷ} |
(2.2)

Observe that RB(P, τ) is a large value if and only if the elements of P are much more (or

20


Figure 2.4: For each dataset, we plot Îc(τ) for each class c in each dataset. Each blue line
represents one class. The red line represents the mean of the blue lines, i.e.,

∑
c∈C Îc(τ) for each τ .

Figure 2.5: For each dataset, we plot Îτs for each sensitive attribute s in each dataset.

less) adversarially robust than elements not in P . We can then quantify this for each element

P ∈ P—but a more pernicious variable to handle is τ . We propose to look at the area under the

curve ÎP for all τ :

σ(P ) =
AUC(ÎP )− AUC(

∑
P ′ ̸=P ÎP ′)

AUC(
∑

P ′ ̸=P ÎP ′)
(2.3)

Note that these notions take into account the distances of data points from the decision

boundary and hence are orthogonal and complementary to other traditional notions of bias or

fairness (e.g., disparate impact/disparate mistreatment (Zafar et al., 2019), etc). This means that

having lower robustness bias does not necessarily come at the cost of fairness as measured by

these notions. Consider the motivating example shown in Figure 2.1: the decision boundary on the

right has lower robustness bias but preserves all other common notions (e.g. (Hardt et al., 2016;

Dwork et al., 2012; Zafar et al., 2017c)) as both classifiers maintain 100% accuracy.

21


2.3.1 Real-world Implications: Degradation of Quality of Service

Deep neural networks are the core of many real world applications, for example, facial

recognition, object detection, etc. In such cases, perturbations in the input can occur due to

multiple factors such as noise due to the environment or malicious intent by an adversary. Previous

works have highlighted how harms can be caused due to the degradation in quality of service for

certain sub-populations (Cramer et al., 2019; Holstein et al., 2019). Figure 2.3 shows an example

of inputs from the UTKFace dataset where an ℓ2 perturbation of 0.5 could change the predicted

label for an input with race “black” and gender “female” but an input with race “white” and gender

“male” was robust to the same magnitude of perturbation. In such a case, the system worked better

for a certain sub-group (white, male) thus resulting in unfairness. It is important to note that we use

datasets such as Adience and UTKFace (described in detail in section 2.5) only to demonstrate the

importance of having unbiased robustness. As noted in previous works, the very task of predicting

age from a person’s face is a flawed task definition with many ethical concerns (Cramer et al.,

2019; Buolamwini and Gebru, 2018; Crawford and Paglen, 2019).

2.4 Measuring Robustness Bias

Robustness bias as defined in the previous section requires a way to measure the distance

between a point and the (closest) decision boundary. For deep neural networks in use today,

a direct computation of dθ(x) is not feasible due to their highly complicated and non-convex

decision boundary. However, we show that we can leverage existing techniques from the literature

on adversarial attacks to efficiently approximate dθ(x). We describe these in more detail in this

section.

22


2.4.1 Adversarial Attacks (Upper Bound)

For a given input and model, one can compute an upper bound on dθ(x) by performing an

optimization which alters the input image slightly so as to place the altered image into a different

category than the original. Assume for a given data point x, we are able to compute an adversarial

image x̃, then the distance between these two images provides an upper bound on distance to a

decision boundary, i.e, ∥x− x̃∥ ≥ dθ(x).

We evaluate two adversarial attacks: DeepFool (Moosavi-Dezfooli et al., 2016) and Carlini-

Wagner’s L2 attack (Carlini and Wagner, 2017). We extend ÎP for DeepFool and CarliniWagner

as

ÎDF
P =

|{(x, y) ∈ P |τ < ∥x− x̃∥, y = ŷ}|
|P |

(2.4)

and

ÎCW
P =

|{(x, y) ∈ P |τ < ∥x− x̃∥, y = ŷ}|
|P |

(2.5)

respectively. We use similar notation to define σDF (P ), and σCW (P ) (σ as defined in Eq 2.3).

While these methods are guaranteed to yield upper bounds on dθ(x), they need not yield similar

behavior to ÎP or σ(P ). We perform an evaluation of this in Section 2.7.1.

2.4.2 Randomized Smoothing (Lower Bound)

Alternatively one can compute a lower bound on dθ(x) using techniques from recent works

on training provably robust classifiers (Salman et al., 2019; Cohen et al., 2019). For each input,

these methods calculate a radius in which the prediction of x will not change (i.e. the robustness

certificate). In particular, we use the randomized smoothing method (Cohen et al., 2019; Salman

23


et al., 2019) since it is scalable to large and deep neural networks and leads to the state-of-the-art

in provable defenses. Randomized smoothing transforms the base classifier f to a new smooth

classifier g by averaging the output of f over noisy versions of x. This new classifier g is more

robust to perturbations while also having accuracy on par to the original classifier. It is also

possible to calculate the radius δx (in the ℓ2 distance) in which, with high probability, a given

input’s prediction remains the same for the smoothed classifier (i.e. dθ(x) ≥ δx). A given input x

is then said to be provably robust, with high probability, for a δx ℓ2-perturbation where δx is the

robustness certificate of x.

For each point we use its δx, calculated using the method proposed by Salman et al. (2019),

as a proxy for dθ(x). The magnitude of δx for an input is a measure of how robust an input

is. Inputs with higher δx are more robust than inputs with smaller δx. Again, we extend ÎP for

Randomized Smoothing as

ÎRS
P =

|{(x, y) ∈ P |τ < δx, y = ŷ}|
|P |

(2.6)

We use similar notation to define σRS(P ) (see Eq 2.3).

2.5 Empirical Evidence of Robustness Bias in the Wild

We hypothesize that there exist datasets and model architectures which exhibit robustness

bias. To investigate this claim, we examine several image-based classification datasets and common

model architectures.

24


2.5.0.1 Datasets and Model Architectures:

We perform these tests of the datasets CIFAR-10 (Krizhevsky, 2009), CIFAR-100 (Krizhevsky,

2009) (using both 100 classes and 20 super classes), Adience (Eidinger et al., 2014), and UTK-

Face (Zhang et al., 2017). The first two are widely accepted benchmarks in image classification,

while the latter two provide significant metadata about each image, permitting various partitions

of the data by final classes and sensitive attributes.

CIFAR-10, CIFAR-100, CIFAR100Super. These are standard deep learning benchmark datasets.

Both CIFAR-10 and CIFAR-100 contain 60, 000 images in total which are split into 50, 000 train

and 10, 000 test images. The task is to classify a given image. Images are mean normalized with

mean and std of (0.5, 0.5, 0.5).

UTKFace. Contains images of a people labeled with race, gender and age. We split the dataset

into a random 80 : 20 train:test split to get 4, 742 test and 18, 966 train samples. We bin the age

into 5 age groups and convert this into a 5-class classification problem. Images are normalized

with mean and std of 0 and 1 respectively.

Adience. Contains images of a people labeled with gender and age group. Task is to classify

a given image into one of 8 age groups. We split the dataset into a random 80:20 train:test

split to get 14, 007 train and 3, 445 test samples. Images are normalized with mean and std of

(0.485, 0.456, 0.406) and (0.229, 0.224, 0.225) respectively.

Our experiments were performed using PyTorch’s torchvision module (Paszke et al., 2019).

We first explore a simple Multinomial Logistic Regression model which could be fully analyzed

with direct computation of the distance to the nearest decision boundary. For convolutional

25


neural networks, we focus on Alexnet (Krizhevsky, 2014), VGG19 (Simonyan and Zisserman,

2015), ResNet50 (He et al., 2016a), DenseNet121 (Huang et al., 2017), and Squeezenet1_0

(Iandola et al., 2016) which are all available through torchvision. We use these models since these

are widely used for a variety of tasks. We achieve performance that is comparable to state of

the art performance on these datasets for these models. Additionally we also train some other

popularly used dataset specific architectures like a deep convolutional neural network (we call

this Deep CNN)1 and PyramidNet (α = 64, depth=110, no bottleneck) (Han et al., 2017) for

CIFAR-10. We re-implemented Deep CNN in pytorch and used the publicly available repo to

train PyramidNet2. We use another deep convolutional neural network (which we refer to as Deep

CNN CIFAR1003 and PyramidNet (α = 48, depth=164, with bottleneck) for CIFAR-100 and

CIFAR-100Super. For Adience and UTKFace we additionally take simple deep convolutional

neural networks with multiple convolutional layers each of which is followed by a ReLu activation,

dropout and maxpooling. As opposed to architectures from torchvision (which are pre-trained on

ImageNet) these architectures are trained from scratch on the respective datasets. We refer to them

as UTK Classifier and Adience Classifier respectively. These simple models serve two purposes:

they form reasonable baselines for comparison with pre-trained ImageNet models finetuned on the

respective datasets, and they allow us to analyze robustness bias when models are trained from

scratch.

Accuracy of models trained on the datasets can be found in Table 2.1.

In Sections 2.7 and 2.8 we audit these datasets and the listed models for robustness bias.

In section 2.6, we train logistic regression on all the mentioned datasets and evaluate robustness

1http://torch.ch/blog/2015/07/30/cifar.html
2https://github.com/dyhan0920/PyramidNet-PyTorch
3https://github.com/aaron-xichen/pytorch-playground/blob/master/cifar/model.py

26

http://torch.ch/blog/2015/07/30/cifar.html
https://github.com/dyhan0920/PyramidNet-PyTorch


Table 2.1: Test data performance of all models on different datasets.

Deep
CNN
(CIFAR100)

PyramidNet
Adience
Classifier

UTK
Classifier

Resnet50 Alexnet VGG Densenet
Squeeze-
net

Adience - - 48.80 - 49.75 46.04 51.41 50.80 49.49

UTKFace - - - 66.25 69.82 68.09 69.89 69.15 70.73

CIFAR10 86.97 86.92 - - 83.26 92.08 89.53 85.17 76.97

CIFAR100 59.60 56.42 - - 55.81 71.31 64.39 61.05 40.36

CIFAR100super 71.78 67.55 - - 67.27 80.7 76.06 71.22 55.16

bias using an exact computation. We then show in section 2.7 and 2.8 that robustness bias can be

efficiently approximated using the techniques mentioned in 2.4.1 and 2.4.2 respectively for much

more complicated models, which are often used in the real world. We also provide a thorough

analysis of the types of robustness biases exhibited by some of the popularly used models on these

datasets.

2.6 Exact Computation in a Simple Model: Multinomial Logistic Regression

We begin our analysis by studying the behavior of multinomial logistic regression. Admit-

tedly, this is a simple model compared to modern deep-learning-based approaches; however, it

enables is to explicitly compute the exact distance to a decision boundary, dθ(x). We fit a regres-

sion to each of our vision datasets to their native classes and plot Îc(τ) for each dataset. Figure 2.2

shows the distributions of Îc(τ), from which we observe three main phenomena: (1) the general

shape of the curves are similar for each dataset, (2) there are classes which are significant outliers

from the other classes, and (3) the range of support of the τ for each dataset varies significantly.

We discuss each of these individually.

First, we note that the shape of the curves for each dataset is qualitatively similar. Since the

form of the decision boundaries in multinomial logistic regression are linear delineations in the

27


input space, it is fair to assume that this similarity in shape in Figure 2.4 can be attributed to the

nature of the classifier.

Second, there are classes c which indicate disparate treatment under Îc(τ). The treatment

disparities are most notable in UTKFace, the superclass version CIFAR-100, and regular CIFAR-

100. This suggests that, when considering the dataset as a whole, these outlier classes are less

suceptible to adversarial attack than other classes. Further, in UTKFace, there are some classes

that are considerably more susceptible to adversarial attack because a larger proportion of that

class is closer to the decision boundaries.

We also observe that the median distance to decision boundary can vary based on the

dataset. The median distance to a decision boundary for each dataset is: 0.40 for CIFAR-10; 0.10

for CIFAR-100; 0.06 for the superclass version of CIFAR-100; 0.38 for Adience; and 0.12 for

UTKFace. This is no surprise as dθ(x) depends both on the location of the data points (which are

fixed and immovable in a learning environment) and the choice of architectures/parameters.

Finally, we consider another partition of the datasets. Above, we consider the partition

of the dataset which occurs by the class labels. With the Adience and UTKFace datasets, we

have an additional partition by sensitive attributes. Adience admits partitions based off of gender;

UTKFace admits partition by gender and ethnicity. We note that Adience and UTKFace use

categorical labels for these multidimensional and socially complex concepts. We know this to

be reductive and serves to minimize the contextualization within which race and gender derive

their meaning (Hanna et al., 2020; Buolamwini and Gebru, 2018). Further, we acknowledge the

systems and notions that were used to reify such data partitions and the subsequent implications

and conclusions draw therefrom. We use these socially and systemically-laden partitions to

demonstrate that the functions we define, ÎP and σ depend upon how the data are divided for

28


analysis. To that end, the function ÎP is visualized in Figure 2.5. We observe that the Adience

dataset, which exhibited some adversarial robustness bias in the partition on C only exhibits minor

adversarial robustness bias in the partition on S for the attribute ‘Female’. On the other hand,

UTKFace which had signifiant adversarial robustness bias does exhibit the phenomenon for the

sensitive attribute ‘Black’ but not for the sensitive attribute ‘Female’.

This emphasizes that adversarial robustness bias is dependant upon the dataset and the

partition. We will demonstrate later that it is also dependant on the choice of classifier. First, we

talk about ways to approximate dθ(x) for more complicated models.

2.7 Evaluation of Robustness Bias using Adversarial Attacks

As described in Section 2.4.1, we argued that adversarial attacks can be used to obtain upper

bounds on dθ(x) which can then be used to measure robustness bias. In this section we audit some

popularly used models on datasets mentioned in Section 2.5 for robustness bias as measured using

the approximation given by adversarial attacks.

2.7.1 Evaluation of ÎDF
P and ÎCW

P

To compare the estimate of dθ(x) by DeepFool and CarliniWagner, we first look at the

signedness of σ(P ), σDF (P ), and σCW (P ). For a given partition P , σ(P ) captures the disparity

in robustness between points in P relative to points not in P (see Eq 2.3). Considering all 151

possible partitions (based on class labels and sensitive attributes, where available) for all five

datasets, both CarliniWagner and DeepFool agree with the signedness of the direct computation

125 times, i.e., 1P

[
sign(σ(P )) = sign(σDF (P ))

]
= 125 = 1P

[
sign(σ(P )) = sign(σCW (P ))

]
.

29


Further, the mean difference between σ(P ) and σCW (P ) or σDF (P ), i.e., (σ(P )− σDF (P )), is

0.17 for DeepFool and 0.19 for CarliniWagner with variances of 0.07 and 0.06 respectively.

There is 83% agreement between the direct computation and the DeepFool and CarliniWag-

ner estimates of ÎP . This behavior provides evidence that adversarial attacks provide meaningful

upper bounds on dθ(x) in terms of the behavior of identifying instances of robustness bias.

2.7.2 Audit of Commonly Used Models

We now evaluate five commonly-used convolutional neural networks (CNNs): Alexnet,

VGG, ResNet, DenseNet, and Squeezenet. We trained these networks using PyTorch with standard

stochastic gradient descent. We achieve comparable performance to documented state of the

art for these models on these datasets. After training each model on each dataset, we generated

adversarial examples using both methods and computed σ(P ) for each possible partition of the

dataset. An example of the results for the UTKFace dataset can be see in Figure 2.7.

With evidence from Section 2.7.1 that DeepFool and CarliniWagner can approximate

the robustness bias behavior of direct computations of dθ, we first ask if there are any major

differences between the two methods. If DeepFool exhibits adversarial robustness bias for

a dataset and a model and a class, does CarliniWagner exhibit the same? and vice versa?

Since there are 5 different convolutional models, we have 151 · 5 = 755 different compar-

isons to make. Again, we first look at the signedness of σDF (P ) and σCW (P ) and we see

that 1P

[
sign(σDF (P )) = sign(σCW (P ))

]
= 708. This means there is 94% agreement between

DeepFool and CarliniWagner about the direction of the adversarial robustness bias.

To investigate if this behavior is exhibited earlier in the training cycle than at the final, fully-

30


trained model, we compute σCW (P ) and σDF (P ) for the various models and datasets for trained

models after 1 epoch and the middle epoch. For the first epoch, 637 of the 755 partitions were

internally consistent, i.e., the signedness of σ was the same in the first and last epoch, and 621 were

internally consistent. We see that at the middle epoch, 671 of the 755 partitions were internally

consistent for DeepFool and 665 were internally consistent for CarliniWagner. Unsurprisingly,

this implies that as the training progresses, so does the behavior of the adversarial robustness bias.

However, it is surprising that much more than 80% of the final behavior is determined after the

first epoch, and there is a slight increase in agreement by the middle epoch.

We note that, of course, adversarial robustness bias is not necessarily an intrinsic value of

a dataset; it may be exhibited by some models and not by others. However, in our studies, we

see that the UTKFace dataset partition on Race/Ethnicity does appear to be significantly prone to

adversarial attacks given its comparatively low σDF (P ) and σCW (P ) values across all models.

2.8 Evaluation of Robustness Bias using Randomized Smoothing

In Section 2.4.2, we argued that randomized smoothing can be used to obtain lower bounds

on dθ(x) which can then be used to measure robustness bias. In this section we audit popular

models on a variety of datasets (described in detail in Section 2.5) for robustness bias, as measured

using the approximation given by randomzied smoothing.

2.8.1 Evaluation of ÎRS
P

To assess whether the estimate of dθ(x) by randomized smoothing is an appropriate measure

of robustness bias, we compare the signedness of σ(P ) and σRS(P ). When σ(P ) has positive

31


sign, higher magnitude indicates a higher robustness of members of partition P as compared

to members not included in that partition P ; similarly, when σ(P ) is negatively signed, higher

magnitude corresponds to lesser robustness for those members of partition P (see Eq 2.3). We

may interpret shared signedness of both σ(P ) (where dθ(x) is deterministic) and σRS(P ) (where

dθ(x) is measured by randomized smoothing as described in Section 2.4.2) as positive support for

the ÎRS
P measure.

Similar to Section 2.7.1, we consider all possible 151 partitions across CIFAR-10, CIFAR-

100, CIFAR-100Super, UTKFace and Adience. For each of these partitions, we compare σRS(P ) to

the corresponding σ(P ). We find that their sign agrees 101 times, i.e., 1P

[
sign(σ(P )) = sign(σRS(P ))

]
=

101, thus giving a 66.9% agreement. Furthermore, the mean difference between σ(P ) and σRS(P ),

i.e., (σ(P )− σRS(P )) is 0.08 with a variance of 0.19.

This provides evidence that randomized smoothing can also provide a meaningful estimate

on dθ(x) in terms of measuring robustness bias.

2.8.2 Audit of Commonly Used Models

We now evaluate the same models and all the datasets for robustness bias as measured by

randomized smoothing. Our comparison is analogous to the one performed in Section 2.7.2 using

adversarial attacks. Figure 2.8 shows results for all models on the UTKFace dataset. Here we plot

σRS
P for each partition of the dataset (on x-axis) and for each model (y-axis). A darker color in

the heatmap indicates high robustness bias (darker red indicates that the partition is less robust

than others, whereas a darker blue indicates that the partition is more robust). We can see that

some partitions, for example, the partition based on class label “40-60” and the partition based

32


on race “black” tend to be less robust in the final trained model, for all models (indicated by

a red color across all models).