ABSTRACT Title of Dissertation: EXPANDING ROBUSTNESS IN RESPONSIBLE AI FOR NOVEL BIAS MITIGATION Samuel Dooley Doctor of Philosophy, 2023 Dissertation Directed by: John Dickerson Department of Computer Science Conventional belief in the fairness community is that one should first find the highest performing model for a given problem and then apply a bias mitigation strategy. One starts with an existing model architecture and hyperparameters, and then adjusts model weights, learning procedures, or input data to make the model fairer using a pre-, post-, or in-processing bias mitigation technique. While existing methods for de-biasing machine learning systems use a fixed neural architecture and hyperparameter setting, I instead ask a fundamental question which has received little attention: how much does model-bias arise from the architecture and hyperparameters, and ask how can we exploit the extensive research in the fields of neural architecture search (NAS) and hyperparameter optimization (HPO) to search for more inherently fair models. By thinking of bias mitigation in this new way, we really are expanding our conceptualization of robustness in responsible AI. Robustness is an emerging aspect of responsible AI and focuses on maintaining model performance in the face of uncertainties and variations for all subgroups of a data population. Often robustness deals with protecting models from intentional or unintentional manipulations in data, while handling noisy or corrupted data and preserving accuracy in real- world scenarios. In other words, robustness, as commonly defined, examines the output of a system under changes to input data. However, I will broaden the idea of what robustness in responsible AI is in a manner which defines new fairness metrics, yields insights into robustness of deployed AI systems, and proposes an entirely new bias mitigation strategy. This thesis explores the connection between robust machine learning and responsible AI. It introduces a fairness metric that quantifies disparities in susceptibility to adversarial attacks. It also audits face detection systems for robustness to common natural noises, revealing biases in these systems. Finally, it proposes using neural architecture search to find fairer architectures, challenging the conventional approach of starting with accurate architectures and applying bias mitigation strategies. EXPANDING ROBUSTNESS IN RESPONSIBLE AI FOR NOVEL BIAS MITIGATION by Samuel Dooley Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2023 Advisory Committee: Dr. John Dickerson, Chair/Advisor Dr. Philip Resnik Dr. Hal Daumé III Dr. Tom Goldstein Dr. Furong Huang Dr. Elissa Redmiles © Copyright by Samuel Dooley 2023 Acknowledgments To my mother and father Without the support of my family, friends, and colleagues, this work would not have been possible. I have had the great fortune to share in the creation of this work with amazing co-authors including Vedant Nanda, Rhea Sukthanker, George Wei, Micah Goldblum, Colin White, Frank Hutter, Tom Goldstein, and John Dickerson. These collaborations have been instrumental in my progress during the PhD program. I also worked with many other individuals on other projects throughout my time in the PhD, but with projects which did not make it into this thesis for space constraints. These projects and collaborators are listed below — thank you to everyone for their guidance and support. I had the pleasure of working primarily with Michael Curry and John Dickerson on human value alignment in deep learning for auction design work. Specifically with Kevin Kuo, Anthony Ostuni, Elizabeth Horishny, Michael J Curry, Ping Chiang, Tom Goldstein, and John Dickerson in Kuo et al. (2020) and Neehar Peri, Michael Curry, and John Dickerson in Peri et al. (2021). I’d also like to thank Elissa Redmiles for her guidance and collaboration on projects of deep interest to me. We worked with John Dickerson and Dana Turjeman to understand how messaging impacted Covid-19 app adoption in Louisiana (Dooley et al., 2022a). Through Elissa, I also met ii Angelica Goetzen and had the pleasure of working with her on some statistical analyses of survey data on privacy sentiment (Goetzen et al., 2022). In other user-studies, I worked closely with Michelle Mazurek on a line of work that examined the ways in which library IT staff conceived of privacy and security for their patrons. Originally devised as a class project with Michael Rosenberg, Elliot Sloate, and Sungbok Shin (Dooley et al., 2020), we worked closely with Nora McDonald and Rachel Greenstadt to devise the survey which then was conducted by me and analyzed by Alan Luo and Noel Warford (Luo et al., 2023). From these collaborations, I learned a lot about qualitative research as well as my own working style and motivations in my work. Additionally, Michael Rosenberg and I also worked on a topological data analysis project (Rawson et al., 2022) where I got to apply my undergrad math knowledge to my PhD. I worked closely with Tom Goldstein, Micah Goldblum, and Valeriia Cherepanova on work pertaining to face recognition projects (Cherepanova et al., 2022; Dooley et al., 2021a). From them, I learned much about technical writing and research process. Also, early in my PhD studies, John introduced me to some labor market problems (Dooley and Dickerson, 2020) which I worked on with Marina Knittel (Knittel et al., 2022). Finally, a special thanks to Colin White who has been instrumental in my progression towards the end of my program. We first started collaborating on Chapter 4 of this thesis, and this further led to an internship at Abacus.AI and ultimately a job. We’ve further collaborated on ForecastPFN (Khurana et al., 2023) and a bunch of forthcoming work in NAS. Thank you, Colin, for your kindness, generosity, and support. My family has been a constant voice of support through my education, particularly my husband, mother, father, brother, sister-in-law, and niece. I particularly appreciate the constant iii love and encouragement of my husband as well as his family as they have cheered me on and celebrated my wins. My friends have always been there to encourage and distract me at critical times – thank you particularly to Claire and Annette. Finally, I’d like to thank all the musicians whose music I enjoyed during the many long hours and late nights working on this theses — particularly Beyoncé and her inspiring RENAISSANCE album. In this thesis, I have been supported in part by NSF CAREER Award IIS-1846237, NIST MSE Award #20126334, NSF D-ISN Award #2039862, NSF Award CCF-1852352, NIH R01 Award NLM-013039-01, DARPA GARD #HR00112020007, DARPA SI3-CMD #S4761, DoD WHS Award #HQ003420F0035, and ARPA-E Award #4334192. iv Table of Contents Acknowledgements ii Table of Contents v List of Tables viii List of Figures xiv Chapter 1: Introduction 1 Chapter 2: Adversarial Robustness 10 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Heterogeneous Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Robustness Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.1 Real-world Implications: Degradation of Quality of Service . . . . . . . 22 2.4 Measuring Robustness Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.1 Adversarial Attacks (Upper Bound) . . . . . . . . . . . . . . . . . . . . 23 2.4.2 Randomized Smoothing (Lower Bound) . . . . . . . . . . . . . . . . . . 23 2.5 Empirical Evidence of Robustness Bias in the Wild . . . . . . . . . . . . . . . . 24 2.6 Exact Computation in a Simple Model: Multinomial Logistic Regression . . . . . 27 2.7 Evaluation of Robustness Bias using Adversarial Attacks . . . . . . . . . . . . . 29 2.7.1 Evaluation of ÎDF P and ÎCW P . . . . . . . . . . . . . . . . . . . . . . . . 29 2.7.2 Audit of Commonly Used Models . . . . . . . . . . . . . . . . . . . . . 30 2.8 Evaluation of Robustness Bias using Randomized Smoothing . . . . . . . . . . . 31 2.8.1 Evaluation of ÎRS P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.8.2 Audit of Commonly Used Models . . . . . . . . . . . . . . . . . . . . . 32 2.8.3 Comparison of Randomized Smoothing and Upper Bounds . . . . . . . . 34 2.9 Reducing Robustness Bias through Regularization . . . . . . . . . . . . . . . . . 35 2.9.1 Experimental Results using Regularized Models . . . . . . . . . . . . . . 37 2.10 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Chapter 3: Robustness Disparities in Face Detection 50 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3 Benchmark Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 v 3.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.4.1 RQ1: Overall Model Performance . . . . . . . . . . . . . . . . . . . . . 68 3.4.2 RQ2: Demographic Disparities in Noise Robustness . . . . . . . . . . . 69 3.4.3 RQ3: Disparity Comparison to Between Academic and Commercial Models 73 3.5 Implications and Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.6 Discussion & A Call to action . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Chapter 4: Rethinking Bias Mitigation: Fairer Architectures Make for Fairer Face Recognition 79 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2.1 Face Identification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2.2 Bias Mitigation in Face Recognition. . . . . . . . . . . . . . . . . . . . . 84 4.2.3 Neural Architecture Search (NAS) and Hyperparameter Optimization (HPO). 85 4.3 Architectures and Hyperparameters: A Case Study . . . . . . . . . . . . . . . . . 86 4.3.1 NAS-based Bias Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3.2 Architectures and Hyperparameter Experiments . . . . . . . . . . . . . . 87 4.3.3 Experimental Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.3.4 Evaluation procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3.5 Results and Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.4 Neural Architecture Search for Bias Mitigation . . . . . . . . . . . . . . . . . . 92 4.4.1 Search Space Design and Search Strategy . . . . . . . . . . . . . . . . . 92 4.4.2 Hyperparameter Search Space Design. . . . . . . . . . . . . . . . . . . . 92 4.4.3 Architecture Search Space Design. . . . . . . . . . . . . . . . . . . . . . 93 4.4.4 Obtained architectures and hyperparameter configurations from Black- Box-Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.4.5 Multi-fidelity optimization. . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.4.6 Multi-objective optimization. . . . . . . . . . . . . . . . . . . . . . . . . 95 4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.5.1 Novel architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.5.2 Analysis of the Pareto-Front of different Fairness Metrics . . . . . . . . . 97 4.5.3 Novel Architectures Outperform other Bias Mitigation Strategies . . . . . 99 4.5.4 Comparison to other Bias Mitigation Techniques on all Fairness Metrics . 101 4.5.5 Novel Architectures Generalize to Other Datasets . . . . . . . . . . . . . 102 4.5.6 Novel Architectures Generalize to Other Sensitive Attributes . . . . . . . 103 4.5.7 Novel Architectures Have Less Linear-Separability of Protected Attributes 104 4.6 Conclusion, Future Work and Limitations . . . . . . . . . . . . . . . . . . . . . 106 Chapter 5: Open Questions 113 5.1 Implications of Adversarial Robustness Bias . . . . . . . . . . . . . . . . . . . . 113 5.2 Causal Reasoning Behind Robustness Disparities in Face Detection . . . . . . . . 114 5.3 Applicability of NAS+HPO to Other Domains . . . . . . . . . . . . . . . . . . . 115 Appendix A: Additional Results: Robustness Bias in Face Detection 118 vi A.1 Statistical Significance Regressions for Average Precision . . . . . . . . . . . . . 118 A.1.1 Main Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 A.1.2 AP — Corruption Comparison Claims . . . . . . . . . . . . . . . . . . . 118 A.1.3 AP — Age Comparison Claims . . . . . . . . . . . . . . . . . . . . . . . 129 A.1.4 AP — Gender Comparison Claims . . . . . . . . . . . . . . . . . . . . . 136 A.1.5 AP — Skin Type Comparison Claims . . . . . . . . . . . . . . . . . . . 142 Bibliography 150 vii List of Tables 2.1 Test data performance of all models on different datasets. . . . . . . . . . . . . . 27 3.1 Adience Dataset Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.2 CCD Dataset Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3 MIAP Dataset Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.4 UTKFace Dataset Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.5 Total Costs of Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.1 The fairness metrics explored in this chapter. Rank Disparity is explored in the main paper and the other metrics are reported in 4.5.2 . . . . . . . . . . . . . . . 90 4.2 Searchable hyperparameter choices. . . . . . . . . . . . . . . . . . . . . . . . . 98 4.3 Operation choices and definitions. . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.4 Comparison bias mitigation techniques where the SMAC models were found on VGGFace2 with NAS bias mitigation technique and the other three techniques are standard in facial recognition: Flipped (Chang et al., 2020), Angular (Morales et al., 2020), and Discriminator (Wang and Deng, 2020). Items in bold are Pareto- optimal. The values show (Error;metric). . . . . . . . . . . . . . . . . . . . . . . 108 4.5 Taking the highest performing models from the Pareto front of both VGGFace2 and CelebA, we transfer their evaluation onto six other common face recognition datasets: LFW (Huang et al., 2008), CFP_FF (Sengupta et al., 2016), CFP_FP (Sen- gupta et al., 2016), AgeDB (Moschoglou et al., 2017), CALFW (Zheng et al., 2017), CPLPW (Zheng and Deng, 2018). The novel architectures which we found with our bias mitigation strategy significantly out perform all other models. . . . 109 4.6 Comparison bias mitigation techniques where the SMAC models were found with NAS bias mitigation technique and the other three techniques are standard in facial recognition: Flipped (Chang et al., 2020), Angular (Morales et al., 2020), and Discriminator (Wang and Deng, 2020). Items in bold are Pareto-optimal. The values show (Error;Rank Disparity). . . . . . . . . . . . . . . . . . . . . . . . . 110 4.7 We transfer the evaluation of top performing models on VGGFace2 and CelebA onto six other common face recognition datasets: LFW (Huang et al., 2008), CFP_FF (Sengupta et al., 2016), CFP_FP (Sengupta et al., 2016), AgeDB (Moschoglou et al., 2017), CALFW (Zheng et al., 2017), CPLPW (Zheng and Deng, 2018). The novel architectures which we found with our bias mitigation strategy significantly out perform all other models. Full table is reported in Table 4.5. . . . . . . . . . 111 viii 4.8 Taking the highest performing models from the Pareto front of both VGGFace2 and CelebA, we transfer their evaluation onto a dataset with a different protected attribue – race – on the RFW dataset (Wang et al., 2019a). The novel architectures which we found with our bias mitigation strategy are always on the Pareto front, and mostly Pareto-dominant of the traditional architectures. . . . . . . . . . . . . 111 4.9 Linear Probes on VGGFace2. Lower accuracy is better . . . . . . . . . . . . . . 111 A.1 The best and worst performing perturbations for each dataset and model. . . . . . 119 A.2 AP. Pairwise Wilcoxon test with Bonferroni correction for model on Adience . . 119 A.3 AP. Pairwise Wilcoxon test with Bonferroni correction for model on CCD . . . . 119 A.4 AP. Pairwise Wilcoxon test with Bonferroni correction for model on MIAP . . . . 120 A.5 AP. Pairwise Wilcoxon test with Bonferroni correction for model on UTK . . . . 120 A.6 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on AWS and Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 A.7 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on Azure and Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 A.8 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on GCP and Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 A.9 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on MogFace and Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 A.10 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on TinaFace and Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 A.11 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on Yolov5 and Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 A.12 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on AWS and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 A.13 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on Azure and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 A.14 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on GCP and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 A.15 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on MogFace and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 A.16 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on TinaFace and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 A.17 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on Yolov5 and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 A.18 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on AWS and MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 A.19 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on Azure and MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 A.20 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on GCP and MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 A.21 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on MogFace and MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 ix A.22 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on TinaFace and MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 A.23 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on Yolov5 and MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 A.24 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on AWS and UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 A.25 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on Azure and UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 A.26 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on GCP and UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 A.27 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on MogFace and UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 A.28 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on TinaFace and UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 A.29 AP. Pairwise Wilcoxon test with Bonferroni correction for corruption on Yolov5 and UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 A.30 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on AWS and Adience129 A.31 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on Azure and Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 A.32 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on GCP and Adience130 A.33 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on MogFace and Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 A.34 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on TinaFace and Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 A.35 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on Yolov5 and Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 A.36 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on AWS and CCD 131 A.37 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on Azure and CCD 132 A.38 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on GCP and CCD 132 A.39 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on MogFace and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 A.40 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on TinaFace and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 A.41 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on Yolov5 and CCD133 A.42 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on AWS and MIAP133 A.43 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on Azure and MIAP133 A.44 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on GCP and MIAP 134 A.45 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on MogFace and MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 A.46 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on TinaFace and MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 A.47 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on Yolov5 and MIAP134 A.48 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on AWS and UTK 135 A.49 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on Azure and UTK 135 A.50 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on GCP and UTK 135 x A.51 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on MogFace and UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 A.52 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on TinaFace and UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 A.53 AP. Pairwise Wilcoxon test with Bonferroni correction for Age on Yolov5 and UTK136 A.54 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on AWS and Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 A.55 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on Azure and Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 A.56 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on GCP and Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 A.57 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on MogFace and Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 A.58 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on TinaFace and Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 A.59 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on Yolov5 and Adience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 A.60 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on AWS and CCD138 A.61 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on Azure and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 A.62 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on GCP and CCD138 A.63 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on MogFace and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 A.64 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on TinaFace and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 A.65 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on Yolov5 and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 A.66 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on AWS and MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 A.67 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on Azure and MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 A.68 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on GCP and MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 A.69 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on MogFace and MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 A.70 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on TinaFace and MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 A.71 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on Yolov5 and MIAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 A.72 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on AWS and UTK141 A.73 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on Azure and UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 A.74 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on GCP and UTK141 A.75 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on MogFace and UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 xi A.76 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on TinaFace and UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 A.77 AP. Pairwise Wilcoxon test with Bonferroni correction for Gender on Yolov5 and UTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 A.78 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type on AWS and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 A.79 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type on Azure and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 A.80 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type on GCP and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 A.81 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type on MogFace and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 A.82 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type on TinaFace and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 A.83 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type on Yolov5 and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 A.84 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the interaction with Lighting on AWS and CCD . . . . . . . . . . . . . . . . . . . . 144 A.85 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the interaction with Lighting on Azure and CCD . . . . . . . . . . . . . . . . . . . . 144 A.86 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the interaction with Lighting on GCP and CCD . . . . . . . . . . . . . . . . . . . . 145 A.87 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the interaction with Lighting on MogFace and CCD . . . . . . . . . . . . . . . . . . 145 A.88 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the interaction with Lighting on TinaFace and CCD . . . . . . . . . . . . . . . . . . 145 A.89 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the interaction with Lighting on Yolov5 and CCD . . . . . . . . . . . . . . . . . . . 145 A.90 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the interaction with Age and Gender on AWS and CCD . . . . . . . . . . . . . . . . 146 A.91 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the interaction with Age and Gender on Azure and CCD . . . . . . . . . . . . . . . 146 A.92 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the interaction with Age and Gender on GCP and CCD . . . . . . . . . . . . . . . . 146 A.93 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the interaction with Age and Gender on MogFace and CCD . . . . . . . . . . . . . . 146 A.94 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the interaction with Age and Gender on TinaFace and CCD . . . . . . . . . . . . . . 147 A.95 AP. Pairwise Wilcoxon test with Bonferroni correction for Skin Type and the interaction with Age and Gender on Yolov5 and CCD . . . . . . . . . . . . . . . 147 A.96 AP. Pairwise Wilcoxon test with Bonferroni correction for Lighting on AWS and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 A.97 AP. Pairwise Wilcoxon test with Bonferroni correction for Lighting on Azure and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 xii A.98 AP. Pairwise Wilcoxon test with Bonferroni correction for Lighting on GCP and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 A.99 AP. Pairwise Wilcoxon test with Bonferroni correction for Lighting on MogFace and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 A.100AP. Pairwise Wilcoxon test with Bonferroni correction for Lighting on TinaFace and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 A.101AP. Pairwise Wilcoxon test with Bonferroni correction for Lighting on Yolov5 and CCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 xiii List of Figures 2.1 A toy example showing robustness bias. A.) the classifier (solid line) has 100% accuracy for blue and green points. However for a budget τ (dotted lines), 70% of points belonging to the “round” subclass (showed by dark blue and dark green) will get attacked while only 30% of points in the “cross” subclass will be attacked. This shows a clear bias against the “round” subclass which is less robust in this case. B.) shows a different classifier for the same data points also with 100% accuracy. However, in this case, with the same budget τ , 30% of both “round” and “cross” subclass will be attacked, thus being less biased. . . . . . . . . . . . . . . 16 2.2 An example of multinomial logistic regression. . . . . . . . . . . . . . . . . . . 17 2.3 An example of robustness bias in the UTKFace dataset. A model trained to predict age group from faces is fooled for an inputs belonging to certain subgroups (black and female in this example) for a given perturbation, but is robust for inputs belonging to other subgroups (white and male in this example) for the same magnitude of perturbation. We use the UTKFace dataset to make a broader point that robustness bias can cause harms. In the specific case of UTKFace (and similar datasets), the task definition of predicting age from faces itself is flawed, as has been noted in many previous studies (Cramer et al., 2019; Crawford and Paglen, 2019; Buolamwini and Gebru, 2018). . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 For each dataset, we plot Îc(τ) for each class c in each dataset. Each blue line represents one class. The red line represents the mean of the blue lines, i.e.,∑ c∈C Îc(τ) for each τ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5 For each dataset, we plot Îτs for each sensitive attribute s in each dataset. . . . . . 21 2.6 UTKFace partitioned by race. We can see that across models, that different populations are at different levels of robustness as calculated by different proxies (DeepFool on the left, CarliniWagner in the middle and Randomized Smoothing on the right). This suggests that robustness bias is an important criterion to consider when auditing models for fairness. . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.7 Depiction of σDF P and σCW P for the UTKFace dataset with partitions corresponding to the (1) class labels C and the, (2) gender, and (3) race/ethnicity. These values are reported for all five convolutional models both at the beginning of their training (after one epoch) and at the end. We observe that, largely, the signedness of the functions are consistent between the five models and also across the training cycle. 41 xiv 2.8 Depiction of σRS P for the UTKFace dataset with partitions corresponding to the (1) class labels C and the, (2) gender, and (3) race/ethnicity. A more negative value indicates less robustness bias for the partition. Darker regions indicate high robustness bias. We observe that the trend is largely consistent amongst models and also similar to the trend observed when using adversarial attacks to measure robustness bias (see Figure 2.7). . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.9 In the unregularized model, “truck” in CIFAR10 tends to be more robust than other classes (2.9a); however, using ADVERM reduces that disparity (2.9b). We see similar behavior for UTKFace (2.6 & 2.9d). . . . . . . . . . . . . . . . . . . 41 2.10 Depiction of σDF P and σCW P for the CIFAR10 dataset with partitions corresponding to the class labels C. These values are reported for all five convolutional models both at the beginning of their training (after one epoch) and at the end. We observe that, largely, the signedness of the functions are consistent between the five models and also across the training cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.11 Depiction of σRS P for the CIFAR10 dataset with partitions corresponding to the class labels C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.12 Depiction of σDF P and σCW P for the CIFAR100 dataset with partitions correspond- ing to the class labels C. These values are reported for all five convolutional models both at the beginning of their training (after one epoch) and at the end. We observe that, largely, the signedness of the functions are consistent between the five models and also across the training cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.13 Depiction of σRS P for the CIFAR100 dataset with partitions corresponding to the class labels C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.14 Depiction of σDF P and σCW P for the CIFAR100super dataset with partitions corre- sponding to the class labels C. These values are reported for all five convolutional models both at the beginning of their training (after one epoch) and at the end. We observe that, largely, the signedness of the functions are consistent between the five models and also across the training cycle. . . . . . . . . . . . . . . . . . . . 44 2.15 Depiction of σRS P for the CIFAR100super dataset with partitions corresponding to the class labels C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.16 Depiction of σDF P and σCW P for the Adience dataset with partitions corresponding to the (1) class labels C and the and (2) gender. These values are reported for all five convolutional models both at the beginning of their training (after one epoch) and at the end. We observe that, largely, the signedness of the functions are consistent between the five models and also across the training cycle. . . . . . 45 2.17 Depiction of σRS P for the Adience dataset with partitions corresponding to the (1) class labels C and the and (2) gender. . . . . . . . . . . . . . . . . . . . . . . . . 45 2.18 [Regularization] CIFAR10 - Deep CNN . . . . . . . . . . . . . . . . . . . . . . 46 2.19 [Regularization] CIFAR10 - Resnet50 . . . . . . . . . . . . . . . . . . . . . . . 47 2.20 [Regularization] CIFAR10 - VGG19 . . . . . . . . . . . . . . . . . . . . . . . . 48 2.21 [Regularization] UTKFace partitioned by race - UTK Classifier. . . . . . . . . . 49 2.22 [Regularization] UTKFace partitioned by race - Resnet50. . . . . . . . . . . . . 49 2.23 [Regularization] UTKFace partitioned by race - VGG. . . . . . . . . . . . . . . 49 xv 3.1 Our benchmark consists of 5,066,312 images of the 15 types of algorithmically generated corruptions produced by ImageNet-C. We use data from four datasets (Adience, CCD, MIAP, and UTKFace) and present examples of corruptions from each dataset here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2 Depiction of how Average Precision (AP) metric is calculated by using clean image as ground truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.3 Overall performance (AP) of each model on each dataset. . . . . . . . . . . . . . 66 3.4 Gender disparity plots for each dataset and model. Values below 1 indicate that predominantly feminine presenting subjects are more susceptible to noise- induced changes. Values above 1 indicate that predominantly masculine presenting subjects are are more susceptible to noise-induced changes. Error bars indicate 95% confidence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.5 Age disparity plots for each dataset and model. Values greater than 1 indicate that older subjects are more susceptible to noise-induced changes compared to middle aged subjects. Error bars indicate 95% confidence. . . . . . . . . . . . . . . . . . 71 3.6 Skin type disparity plots for CCD. Values above 1 indicate that darker-skinned subjects are more susceptible to noise-induced changes. Error bars indicate 95% confidence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.7 Lighting disparity plots for CCD. Values above 1 indicate that dimly-lit subjects are more susceptible to noise-induced changes. Error bars indicate 95% confidence. 72 4.1 Overview of our methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2 (Left) CelebA (Right) VGGFace2. Error-Rank Disparity Pareto front of the architectures with lowest error (< 0.3). Models in the lower left corner are better. The Pareto front is notated with a dashed line. Other points are architecture and hyperparameter combinations which are not Pareto-optimal. . . . . . . . . . . . 90 4.3 SMAC discovers the above building blocks with (a) corresponding to architecture with CosFace, with SGD optimizer and learning rate of 0.2813 as hyperparamters (b) corresponding to CosFace, with SGD as optimizer and learning rate of 0.32348 and (c) corresponding to CosFace, with AdamW as optimizer and learning rate of 0.0006 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.4 Pareto front of the models discovered by SMAC and the rank-1 models from timm for the (a) validation and (b) test sets on CelebA. Each point corresponds to the mean and standard error of an architecture after training for 3 seeds. The SMAC models Pareto-dominate the top performing timm models (Error < 0.1). . . . . 96 4.5 Pareto front of the models discovered by SMAC and the rank-1 models from timm for the (a) validation and (b) test sets on VGGFace2. Each point corresponds to the mean and standard error of an architecture after training for 3 seeds. The SMAC models Pareto-dominate the top performing timm models (Error<0.1). . . 97 4.6 DPN block (left) vs. our searchable block (right). . . . . . . . . . . . . . . . . . 98 4.7 Replication of CelebA 4.2 with all data points. Error-Rank Disparity Pareto front of the architectures with any non-trivial error. Models in the lower left corner are better. The Pareto front is notated with a dashed line. Other points are architecture and hyperparameter combinations which are not Pareto-dominant. . . . . . . . . 99 xvi 4.8 Replication of VGGFace2 4.2 with all data points. Error-Rank Disparity Pareto front of the architectures with any non-trivial error. Models in the lower left corner are better. The Pareto front is notated with a dashed line. Other points are architecture and hyperparameter combinations which are not Pareto-dominant. . . 100 4.9 Replication of 4.7 on the CelebA validation dataset with Ratio of Ranks (left) and Ratio of Errors (right) metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.10 Replication of 4.7 on the CelebA validation dataset with the Disparity in accuracy metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.11 Replication of 4.7 on the CelebA validation dataset with the Ratio in accuracy metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.12 Replication of 4.8 on the VGGFace2 validation dataset with Ratio of Ranks metric.104 4.13 Replication of 4.8 on the VGGFace2 validation dataset with Ratio of Errors metric.105 4.14 Replication of 4.8 on the VGGFace2 validation dataset with the Disparity in accuracy metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.15 Replication of 4.8 on the VGGFace2 validation dataset with the Ratio in accuracy metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.16 Models trained on CelebA (left) and VGGFace2 (right) evaluated on a dataset with a different protected attribute, specifically on RFW with the racial attribute, and with the Rank Disparity metric. The novel architectures out perform the existing architectures in both settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.17 TSNE plots for models pretrained on VGGFace2 on the test-set (a) SMAC model last layer (b) DPN MagFace on the last layer (b) SMAC model second last layer (b) DPN MagFace on the second last layer. Note the better linear separability for DPN MagFace in comparison with the SMAC model . . . . . . . . . . . . . . . 112 xvii He talks, he talks, how he talks, and waves his arms. He fills up ornate vases. Twenty-seven an hour. And keeps the words in with cork stoppers (If you hold the vases to your ears you can hear the muted syllables colliding into each other). I want vases, some of them ornate, But simple ones too. And most of them Will have flowers On Verbosity Annette Ryan Chapter 1: Introduction Artificial Intelligence (AI) has emerged as a transformative technology with the potential to revolutionize various aspects of our lives. From personalized recommendations to autonomous vehicles, AI systems are becoming increasingly prevalent in our daily interactions. However, as AI becomes more advanced and integrated into society, concerns about its responsible use have 1 gained significant attention. Responsible AI refers to an ethically-informed and transparent development, deployment, and utilization of AI technologies, ensuring that they are designed and used in a manner that respects human values, rights, and well-being. The need for responsible AI arises from the potential risks associated with its widespread adoption. AI systems can inadvertently perpetuate bias, discrimination, and reinforce societal inequalities if not developed and implemented with care. For example, biased training data can lead to discriminatory outcomes, such as AI-powered hiring algorithms favoring certain demographics. Machine learning is applied to a wide variety of socially-consequential domains, e.g., credit scoring, fraud detection, hiring decisions, criminal recidivism, loan repayment, and face recognition (Mukerjee et al., 2002; Ngai et al., 2011; Learned-Miller et al., 2020; Barocas et al., 2017), with many of these applications impacting the lives of people more than ever — often in biased ways (Buolamwini and Gebru, 2018; Joo and Kärkkäinen, 2020; Wang et al., 2020b). Dozens of formal definitions of fairness have been proposed (Narayanan, 2018), and many algorithmic techniques have been developed for debiasing according to these definitions (Verma and Rubin, 2018). Automated decision-making systems that are driven by data are being used in a variety of different real-world applications, creating the risk that these systems will perpetuate and/or create harms to people. In many cases, these systems make decisions on data points that represent humans (e.g., targeted ads (Speicher et al., 2018; Ribeiro et al., 2019), personalized recommen- dations (Singh and Joachims, 2018; Biega et al., 2018), hiring (Schumann et al., 2019, 2020), credit scoring (Khandani et al., 2010), or recidivism prediction (Chouldechova, 2017)). In such scenarios, there is often concern regarding the fairness of outcomes of the systems (Barocas and Selbst, 2016; Galhotra et al., 2017). This has resulted in a growing body of work from Responsible 2 AI community that—drawing on prior legal and philosophical doctrine—aims to define, measure, and (attempt to) mitigate manifestations of unfairness in automated systems (Chouldechova, 2017; Feldman et al., 2015a; Leben, 2020; Binns, 2017). Responsible AI aims to address such concerns by emphasizing fairness, accountability, transparency, and inclusivity in AI development and deployment processes. One crucial aspect of responsible AI is fairness. AI systems should be designed to treat all individuals fairly and without discrimination. This means avoiding bias in data collection, ensuring diverse representation during the development process, and regularly auditing AI algorithms for unintended biases. Additionally, responsible AI involves being accountable for the outcomes of AI systems. Developers and orga- nizations should take responsibility for any harm caused by their AI technologies and implement mechanisms for redress and accountability. Transparency is another fundamental principle of responsible AI. Users and stakeholders should have access to understandable and explainable AI systems. This means that AI algorithms should be designed in a way that allows for clear explanations of their decision-making processes. Transparent AI fosters trust, enables users to understand how AI systems work, and helps identify and rectify any potential biases or errors. Responsible AI also emphasizes inclusivity, ensuring that the benefits and opportunities created by AI are accessible to all. This involves considering the needs and perspectives of diverse populations during AI development, addressing issues of digital divide and accessibility, and actively working towards reducing biases and disparities present in AI systems. Another emerging aspect of Responsible AI is the robustness of systems. In traditional machine learning, robustness refers to the ability of a model to maintain its performance and generalization capabilities even in the face of uncertainties, adversarial attacks, or variations in the 3 input data. A robust model is not only accurate on the training data but also exhibits resilience to perturbations, noise, and outliers that it may encounter during deployment. The importance of robustness arises from the fact that real-world data is often noisy, incomplete, and subject to unpredictable variations. While traditional machine learning algorithms focus on optimizing for average-case scenarios, robust machine learning aims to handle the worst-case scenarios and mitigate the risks associated with unpredictable inputs. Robustness in machine learning encompasses various dimensions, each presenting unique challenges and trade-offs. One prominent aspect is adversarial robustness, which examines the model’s vulnerability to adversarial attacks, where malicious actors deliberately manipulate the input data to deceive or mislead the model’s predictions. Adversarial attacks have demonstrated the susceptibility of machine learning models to subtle perturbations that are often imperceptible to human observers. Developing models that are resistant to such attacks is crucial for security- sensitive applications. I will explore this topic in depth in Chapter 2. Most of the initial work on fairness in machine learning considered notions that were one-shot and considered the model and data distribution to be static (Zafar et al., 2019, 2017c; Chouldechova, 2017; Barocas and Selbst, 2016; Dwork et al., 2012; Zemel et al., 2013). Recently, there has been more work exploring notions of fairness that are dynamic and consider the possibility that the world (i.e., the model as well as data points) might change over time (Heidari et al., 2019; Heidari and Krause, 2018; Hashimoto et al., 2018; Liu et al., 2018b). Our proposed notion of robustness bias has subtle difference from existing one-shot and dynamic notions of fairness in that it requires each partition of the population be equally robust to imperceptible changes in the input (e.g., noise, adversarial perturbations, etc). Another dimension of robustness focuses on handling noisy or corrupted data. Real-world 4 datasets may contain outliers, missing values, or measurement errors, which can significantly impact the performance of machine learning models. Robust techniques that can effectively handle such anomalies and preserve the model’s accuracy and reliability are essential. We explore robustness to noisy or corrupted data in Chapter 3, by auditing face detection systems and show deeper and more pernicious forms of robustenss bias in these systems. Face detection identifies the presence and location of faces in images and video. Automated face detection is a core component of myriad systems—including face recognition technologies (FRT), wherein a detected face is matched against a database of faces, typically for identification or verification purposes. FRT-based systems are widely deployed (Hartzog, 2020; Derringer, 2019; Weise and Singer, 2020). Automated face recognition enables capabilities ranging from the relatively morally neutral (e.g., searching for photos on a personal phone (Google, 2021a)) to morally laden (e.g., widespread citizen surveillance (Hartzog, 2020), or target identification in warzones (Marson and Forrest, 2021)). Legal and social norms regarding the usage of FRT are evolving (e.g., Grother et al., 2019). For example, in June 2021, the first county-wide ban on its use for policing (see, e.g., Garvie, 2016) went into effect in the US (Gutman, 2021). Some use cases for FRT will be deemed socially repugnant and thus be either legally or de facto banned from use; yet, it is likely that pervasive use of facial analysis will remain—albeit with more guardrails than today (Singer, 2018). One such guardrail that has spurred positive, though insufficient, improvements and widespread attention is the use of benchmarks. For example, in late 2019, the US National Institute of Stan- dards and Technology (NIST) adapted its venerable Face Recognition Vendor Test (FRVT) to explicitly include concerns for demographic effects (Grother et al., 2019), ensuring such concerns propagate into industry systems. Yet, differential treatment by FRT of groups has been known for 5 at least a decade (e.g., Klare et al., 2012; El Khiyari and Wechsler, 2016), and more recent work spearheaded by Buolamwini and Gebru (2018) uncovers unequal performance at the phenotypic subgroup level. That latter work brought widespread public, and thus burgeoning regulatory, attention to bias in FRT (e.g., Lohr, 2018; Kantayya, 2020). One yet unexplored benchmark examines the bias present in a model’s robustness (e.g., to noise, or to different lighting conditions), both in aggregate and with respect to different dimensions of the population on which it will be used. Many detection and recognition systems are not built in house, instead adapting an existing academic model or by making use of commercial cloud-based “ML as a Service” (MLaaS) platforms offered by tech giants such as Amazon, Microsoft, Google, Megvii, etc. I will present the first of its kind detailed benchmark robustness benchmark of six different face detection models, for fifteen types of realistic noise (Hendrycks and Dietterich, 2019), and on four well-known datasets. Across all the datasets and systems, I generally find that photos of individuals who are older, masculine presenting, of darker skin type, or have dim lighting are more susceptible to errors than their counterparts in other identities. Addressing robustness in machine learning involves a combination of algorithmic design, feature engineering, and data preprocessing techniques. These approaches seek to make models more resilient to uncertainties and perturbations, either by introducing regularization mechanisms, utilizing ensemble methods, or leveraging domain knowledge to guide the learning process. In this thesis, I’ll broaden and deepen the connection between robust machine learning and responsible AI. In Chapter 2, I define a new fairness metric in Responsible AI which quantifies the disparity between groups with respect to how susceptible they are to adversarial attack. In Chapter 3, I audit existing academic and commercial face detection systems for their robustness to types of common natural noises. Finally, in Chapter chapter 4, I expand the conceptualization of 6 robustness to include notions of model architecture and hyperparameters, and propose a novel bias mitigation techniques which employs neural architecture search to find more fair architectures. Conventional wisdom is that in order to effectively mitigate bias, we should start by selecting a model architecture and set of hyperparameters which are optimal in terms of accuracy and then apply a mitigation strategy to reduce bias while minimally impacting accuracy. As datasets become larger and training becomes more computationally intensive, especially in the case of computer vision and natural language processing, it is becoming increasingly more common in applications to start with a very large pretrained model, and then fine-tune for the specific use-case (Chi et al., 2017; Käding et al., 2016; Ouyang et al., 2016; Too et al., 2019). While existing methods for de-biasing machine learning systems use a fixed neural architecture and hyperparameter setting, I instead ask a fundamental question which has received little attention: how much does model-bias arise from the architecture and hyperparameters? I further ask whether we can we exploit the extensive research in the fields of neural architecture search (NAS) (Elsken et al., 2019) and hyperparameter optimization (HPO) (Feurer and Hutter, 2019) to search for more inherently fair models. Many debiasing algorithms fit into one of three (or arguably four (Savani et al., 2020)) categories: pre-processing (e.g., Feldman et al., 2015b; Ryu et al., 2018; Quadrianto et al., 2019; Wang and Deng, 2020), in-processing (e.g., Zafar et al., 2017b, 2019; Donini et al., 2018; Goel et al., 2018; Padala and Gujar, 2020; Wang and Deng, 2020; Martinez et al., 2020; Nanda et al., 2021; Diana et al., 2020; Lahoti et al., 2020), or post-processing (e.g., Hardt et al., 2016; Wang et al., 2020b). I, however, pose the simple question, what if these approaches are using an architecture which is inherently less fair than another architecture? To explore this topic, I employ neural architecture search. 7 Neural architecture search (NAS) is a field of research that focuses on automating the design and optimization of neural network architectures. It aims to discover or construct neural network structures that achieve high performance on various tasks while reducing the manual effort required for architecture engineering. Traditionally, neural network architectures were designed by human experts based on their domain knowledge and intuition. However, as deep learning models have grown in complexity and scale, manually designing architectures that yield optimal performance has become increasingly challenging and time-consuming. NAS approaches employ various strategies to automatically explore the vast search space of possible architectures. One common technique is to use reinforcement learning or evolutionary algorithms to iteratively generate, evaluate, and refine architectures. These algorithms leverage performance feedback from the training process to guide the search towards architectures with improved performance. NAS algorithms often incorporate additional components like performance predictors or surrogate models, which help estimate the performance of unseen architectures based on their characteristics. These models aid in efficiently exploring the architecture space and reduce the computational cost associated with evaluating each architecture. Motivated by the belief that the inductive bias of a model architecture is more important than the bias mitigation strategy, I’ll use NAS to take a different approach to bias mitigation. We show that finding an architecture that is more fair offers significant gains compared to conventional bias mitigation strategies in the domain of face recognition, a task that is notoriously challenging to de-bias. To this end, I’ll conduct the first neural architecture search for fairness, jointly with a search for hyperparameters. Our search outputs a suite of models which Pareto-dominate all other competitive architectures in terms of accuracy and fairness on the two most widely used datasets for face identification: CelebA and VGGFace2. This work challenges the assumption that bias 8 mitigation pipelines should default to existing popular architectures which were optimized for accuracy — instead I’ll show that it may be more beneficial to begin with a fairer architecture as the foundation of such pipelines. 9 You gave up on expecting things to make sense Knew at some point this would not be a puzzle you would ever complete So you learn to hold it all bodies that are home and how the days unfold one after the next A continuous scroll of doing your best and trusting what the darkness will hold Call it Love Jena Schwartz Chapter 2: Adversarial Robustness This work was done in collaboration with my co-first author Vedant Nanda, as well as Sahil Singla, John P. Dickerson, and Soheil Feizi, and was presented at FAccT, 2021 (Nanda et al., 2021). Deep neural networks (DNNs) are increasingly used in real-world applications (e.g. facial recognition). This has resulted in concerns about the fairness of decisions made by these models. 10 Various notions and measures of fairness have been proposed to ensure that a decision-making system does not disproportionately harm (or benefit) particular subgroups of the population. In this chapter, we argue that traditional notions of fairness that are only based on models’ outputs are not sufficient when the model is vulnerable to adversarial attacks. We argue that in some cases, it may be easier for an attacker to target a particular subgroup, resulting in a form of robustness bias. We show that measuring robustness bias is a challenging task for DNNs and propose two methods to measure this form of bias. We then conduct an empirical study on state-of-the-art neural networks on commonly used real-world datasets such as CIFAR-10, CIFAR-100, Adience, and UTKFace and show that in almost all cases there are subgroups (in some cases based on sensitive attributes like race, gender, etc) which are less robust and are thus at a disadvantage. We argue that this kind of bias arises due to both the data distribution and the highly complex nature of the learned decision boundary in the case of DNNs, thus making mitigation of such biases a non-trivial task. Our results show that robustness bias is an important criterion to consider while auditing real-world systems that rely on DNNs for decision making. 2.1 Introduction Automated decision-making systems that are driven by data are being used in a variety of different real-world applications. In many cases, these systems make decisions on data points that represent humans (e.g., targeted ads (Speicher et al., 2018; Ribeiro et al., 2019), personalized recommendations (Singh and Joachims, 2018; Biega et al., 2018), hiring (Schumann et al., 2019, 2020), credit scoring (Khandani et al., 2010), or recidivism prediction (Chouldechova, 2017)). In such scenarios, there is often concern regarding the fairness of outcomes of the systems (Barocas 11 and Selbst, 2016; Galhotra et al., 2017). This has resulted in a growing body of work from the nascent Fairness, Accountability, Transparency, and Ethics (FATE) community that—drawing on prior legal and philosophical doctrine—aims to define, measure, and (attempt to) mitigate manifestations of unfairness in automated systems (Chouldechova, 2017; Feldman et al., 2015a; Leben, 2020; Binns, 2017). Most of the initial work on fairness in machine learning considered notions that were one-shot and considered the model and data distribution to be static (Zafar et al., 2019, 2017c; Chouldechova, 2017; Barocas and Selbst, 2016; Dwork et al., 2012; Zemel et al., 2013). Recently, there has been more work exploring notions of fairness that are dynamic and consider the possibility that the world (i.e., the model as well as data points) might change over time (Heidari et al., 2019; Heidari and Krause, 2018; Hashimoto et al., 2018; Liu et al., 2018b). Our proposed notion of robustness bias has subtle difference from existing one-shot and dynamic notions of fairness in that it requires each partition of the population be equally robust to imperceptible changes in the input (e.g., noise, adversarial perturbations, etc). We propose a simple and intuitive notion of robustness bias which requires subgroups of populations to be equally “robust.” Robustness can be defined in multiple different ways (Szegedy et al., 2014; Goodfellow et al., 2015; Papernot et al., 2016). We take a general definition which assigns points that are farther away from the decision boundary higher robustness. Our key contributions are as follows: • We define a simple, intuitive notion of robustness bias that requires all partitions of the dataset to be equally robust. We argue that such a notion is especially important when the decision-making system is a deep neural network (DNN) since these have been shown 12 to be susceptible to various attacks (Carlini and Wagner, 2017; Moosavi-Dezfooli et al., 2016). Importantly, our notion depends not only on the outcomes of the system, but also on the distribution of distances of data-points from the decision boundary, which in turn is a characteristic of both the data distribution and the learning process. • We propose different methods to measure this form of bias. Measuring the exact distance of a point from the decision boundary is a challenging task for deep neural networks which have a highly non-convex decision boundary. This makes the measurement of robustness bias a non-trivial task. In this chapter we leverage the literature on adversarial machine learning and show that we can efficiently approximate robustness bias by using adversarial attacks and randomized smoothing to get estimates of a point’s distance from the decision boundary. • We do an in-depth analysis of robustness bias on popularly used datasets and models. Through extensive empirical evaluation we show that unfairness can exist due to different partitions of a dataset being at different levels of robustness for many state-of-the art models that are trained on common classification datasets. We argue that this form of unfairness can happen due to both the data distribution and the learning process and is an important criterion to consider when auditing models for fairness. 2.1.1 Related Work Fairness in ML. Models that learn from historic data have been shown to exhibit unfairness, i.e., they disproportionately benefit or harm certain subgroups (often a sub-population that shares a common sensitive attribute such as race, gender etc.) of the population (Barocas and Selbst, 2016; 13 Chouldechova, 2017; Khandani et al., 2010). This has resulted in a lot of work on quantifying, measuring and to some extent also mitigating unfairness (Dwork et al., 2012; Dwork and Ilvento, 2018; Zemel et al., 2013; Zafar et al., 2019, 2017c; Hardt et al., 2016; Grgić-Hlac̆a et al., 2018; Adel et al., 2019; Wadsworth et al., 2018; Saha et al., 2020; Donini et al., 2018; Calmon et al., 2017; Kusner et al., 2017; Kilbertus et al., 2017; Pleiss et al., 2017; Wang et al., 2020b). Most of these works consider notions of fairness that are one-shot—that is, they do not consider how these systems would behave over time as the world (i.e., the model and data distribution) evolves. Recently more works have taken into account the dynamic nature of these decision- making systems and consider fairness definitions and learning algorithms that fare well across multiple time steps (Heidari et al., 2019; Heidari and Krause, 2018; Hashimoto et al., 2018; Liu et al., 2018b). We take inspiration from both the one-shot and dynamic notions, but take a slightly different approach by requiring all subgroups of the population to be equally robust to minute changes in their features. These changes could either be random (e.g.natural noise in measurements) or carefully crafted adversarial noise. This is closely related to Heidari et al. (2019)’s effort-based notion of fairness; however, their notion has a very specific use case of societal scale models whereas our approach is more general and applicable to all kinds of models. Our work is also closely related to and inspired by Zafar et al.’s use of a regularized loss function which captures fairness notions and reduces disparity in outcomes (Zafar et al., 2019). There are major differences in both the approach and application between our work and that of Zafar et al’s. Their disparate impact formulation aims to equalize the average distance of points to the decision boundary, E[d(x)]; our approach, instead, aims to equalize the number of points that are “safe”, i.e., E[1{d(x) > τ}] (see section 2.3 for a detailed description). Our proposed metric is preferable for applications of adversarial attack or noisy data, the focus of our paper; whereas the metric of 14 Zafar et al is more applicable for an analysis of the consequence of a decision in a classification setting. Robustness. Deep Neural Networks (DNNs) have been shown to be susceptible to carefully crafted adversarial perturbations which—imperceptible to a human—result in a misclassification by the model (Szegedy et al., 2014; Goodfellow et al., 2015; Papernot et al., 2016). In the context of our paper, we use adversarial attacks to approximate the distance of a data point to the decision boundary. For this we use state-of-the-art white-box attacks proposed by Moosavi-Dezfooli et al. (2016) and Carlini and Wagner (2017). Due to the many works on adversarial attacks, there have been many recent works on provable robustness to such attacks. The high-level goal of these works is to estimate a (tight) lower bound on the distance of a point from the decision boundary (Cohen et al., 2019; Salman et al., 2019; Singla and Feizi, 2020). We leverage these methods to estimate distances from the decision boundary which helps assess robustness bias (defined formally in Section 2.3). Fairness and Robustness. Recent works have proposed poisoning attacks on fairness (Solans et al., 2020; Mehrabi et al., 2020). Khani and Liang (2019) analyze why noise in features can cause disparity in error rates when learning a regression. We believe that our work is the very first to show that different subgroups of the population can have different levels of robustness which can lead to unfairness. We hope that this will lead to more work at the intersection of these two important sub fields of ML. 15 2.2 Heterogeneous Robustness In a classification setting, a learner is given data D = {(xi, yi)}Ni=1 consisting of inputs xi ∈ Rd and outputs yi ∈ C which are labels in some set of classes C = {c1, . . . , ck}. These classes form a partition on the dataset such that D = ⊔ c∈C{(xi, yi) | yi = cj}. The goal of learning in decision boundary-based optimization is to draw delineations between points in feature space which sort the data into groups according to their class label. The learning generally tries to maximize the classification accuracy of the decision boundary choice. A learner chooses some loss function L to minimize on a training dataset, parameterized by parameters θ, while maximizing the classification accuracy on a test dataset. Of course there are other aspects to classification problems that have recently become more salient in the machine learning community. Considerations about the fairness of classification decisions, for example, are one such way in which additional constraints are brought into a Figure 2.1: A toy example showing robustness bias. A.) the classifier (solid line) has 100% accuracy for blue and green points. However for a budget τ (dotted lines), 70% of points belonging to the “round” subclass (showed by dark blue and dark green) will get attacked while only 30% of points in the “cross” subclass will be attacked. This shows a clear bias against the “round” subclass which is less robust in this case. B.) shows a different classifier for the same data points also with 100% accuracy. However, in this case, with the same budget τ , 30% of both “round” and “cross” subclass will be attacked, thus being less biased. 16 (a) Three-class classification problem for randomly generated data. (b) Proportion samples which are greater than τ away from a decision boundary. Figure 2.2: An example of multinomial logistic regression. learner’s optimization strategy. In these settings, the data D = {(xi, yi, si)}Ni=1 is imbued with some metadata which have a sensitive attribute S = {s1, . . . , st} associated with each point. Like the classes above, these sensitive attributes form a partition on the data such that D =⊔ s∈S{(xi, yi, si) | si = s}. Without loss of generality, we assume a single sensitive attribute. Generally speaking, learning with fairness in mind considers the output of a classifier based off of the partition of data by the sensitive attribute, where some objective behavior, like minimizing disparate impact or treatment (Zafar et al., 2019), is integrated into the loss function or learning procedure to find the optimal parameters θ. There is not a one-to-one correspondence between decision boundaries and classifier per- formance. For any given performance level on a test dataset, there are infinitely many decision boundaries which produce the same performance, see Figure 2.1. This raises the question: if we consider all decision boundaries or model parameters which achieve a certain performance, how do we choose among them? What are the properties of a desirable, high-performing decision boundary? As the community has discovered, one undesirable characteristic of a decision bound- ary is its proximity to data which might be susceptible to adversarial attack (Goodfellow et al., 2015; Szegedy et al., 2014; Papernot et al., 2016). This provides intuition that we should prefer 17 boundaries that are as far away as possible from example data (Suykens and Vandewalle, 1999; Boser et al., 1992). Let us look at how this plays out in a simple example. In multinomial logistic regression, the decision boundaries are well understood and can be written in closed form. This makes it easy for us to compute how close each point is to a decision boundary. Consider for example a dataset and learned classifier as in Figure 2.2a. For this dataset, we observe that the brown class, as a whole, is closer to a decision boundary than the yellow or blue classes. We can quantify this by plotting the proportion of data that are greater than a distance τ away from a decision boundary, and then varying τ . Let dθ(x) be the minimal distance between a point x and a decision boundary corresponding to parameters θ. For a given partition P of a dataset, D, such that D = ⊔ P∈P P , we define the function: ÎP (τ) = |{(x, y) ∈ P | dθ(x) > τ, y = ŷ}| |P | If each element of the partition is uniquely defined by an element, say a class label, c, or a sensitive attribute label, s, we equivalently will write Îc(τ) or Îs(τ) respectively. We plot this over a range of τ in Figure 2.2b for the toy classification problem in Figure 2.2a. Observe that the function for the brown class decreases significantly faster than the other two classes, quantifying how much closer the brown class is to the decision boundary. From a strictly classification accuracy point of view, the brown class being significantly closer to the decision boundary is not of concern; all three classes achieve similar classification accuracy. However, when we move away from this toy problem and into neural networks on real data, this difference between the classes could become a potential vulnerability to exploit, 18 Predicted: 15 - 25 Ground Truth: 15 - 25 Race: Black Gender: Female Predicted: 25 - 40 Ground Truth: 25 - 40 Race: White Gender: Male Predicted: 60+ Predicted: 25 - 40 0.5 ℓ2 perturbation 0.5 ℓ2 perturbation Figure 2.3: An example of robustness bias in the UTKFace dataset. A model trained to predict age group from faces is fooled for an inputs belonging to certain subgroups (black and female in this example) for a given perturbation, but is robust for inputs belonging to other subgroups (white and male in this example) for the same magnitude of perturbation. We use the UTKFace dataset to make a broader point that robustness bias can cause harms. In the specific case of UTKFace (and similar datasets), the task definition of predicting age from faces itself is flawed, as has been noted in many previous studies (Cramer et al., 2019; Crawford and Paglen, 2019; Buolamwini and Gebru, 2018). particularly when we consider adversarial examples. 2.3 Robustness Bias Our goal is to understand how susceptible different classes are to perturbations (e.g., natural noise, adversarial perturbations). Ideally, no one class would be more susceptible than any other, but this may not be possible. We have observed that for the same dataset, there may be some classifiers which have differences between the distance of that partition to a decision boundary; and some which do not. There may also be one partition P which exhibits this discrepancy, and another partition P ′ which does not. Therefore, we make the following statement about robustness bias: Definition 1. A dataset D with a partition P and a classifier parameterized by θ exhibits robustness 19 bias if there exists an element P ∈ P for which the elements of P are either significantly closer to (or significantly farther from) a decision boundary than elements not in P . A partition P may be based on sensitive attributes such as race, gender, or ethnicity—or other class labels. For example, given a classifier and dataset with sensitive attribute “race”, we might say that classifier exhibits robustness bias if, partitioning on that sensitive attribute, for some value of “race” the average distance of members of that particular racial value are substantially closer to the decision boundary than other members. We might say that a dataset, partition, and classifier do not exhibit robustness bias if for all P, P ′ ∈ P and all τ > 0 P(x,y)∈D{dθ(x) > τ | x ∈ P, y = ŷ} ≈ P(x,y)∈D{dθ(x) > τ | x ∈ P ′, y = ŷ}. (2.1) Intuitively, this definition requires that for a given perturbation budget τ and a given partition P , one should not have any incentive to perturb data points from P over points that do not belong to P . Even when examining this criteria, we can see that this might be particularly hard to satisfy. Thus, we want to quantify the disparate susceptibility of each element of a partition to adversarial attack, i.e., how much farther or closer it is to a decision boundary when compared to all other points. We can do this with the following function for a dataset D with partition element P ∈ P and classifier parameterized by θ: RB(P, τ) = |Px∈D{dθ(x) > τ | x ∈ P, y = ŷ}− Px∈D{dθ(x) > τ | x /∈ P, y = ŷ} | (2.2) Observe that RB(P, τ) is a large value if and only if the elements of P are much more (or 20 Figure 2.4: For each dataset, we plot Îc(τ) for each class c in each dataset. Each blue line represents one class. The red line represents the mean of the blue lines, i.e., ∑ c∈C Îc(τ) for each τ . Figure 2.5: For each dataset, we plot Îτs for each sensitive attribute s in each dataset. less) adversarially robust than elements not in P . We can then quantify this for each element P ∈ P—but a more pernicious variable to handle is τ . We propose to look at the area under the curve ÎP for all τ : σ(P ) = AUC(ÎP )− AUC( ∑ P ′ ̸=P ÎP ′) AUC( ∑ P ′ ̸=P ÎP ′) (2.3) Note that these notions take into account the distances of data points from the decision boundary and hence are orthogonal and complementary to other traditional notions of bias or fairness (e.g., disparate impact/disparate mistreatment (Zafar et al., 2019), etc). This means that having lower robustness bias does not necessarily come at the cost of fairness as measured by these notions. Consider the motivating example shown in Figure 2.1: the decision boundary on the right has lower robustness bias but preserves all other common notions (e.g. (Hardt et al., 2016; Dwork et al., 2012; Zafar et al., 2017c)) as both classifiers maintain 100% accuracy. 21 2.3.1 Real-world Implications: Degradation of Quality of Service Deep neural networks are the core of many real world applications, for example, facial recognition, object detection, etc. In such cases, perturbations in the input can occur due to multiple factors such as noise due to the environment or malicious intent by an adversary. Previous works have highlighted how harms can be caused due to the degradation in quality of service for certain sub-populations (Cramer et al., 2019; Holstein et al., 2019). Figure 2.3 shows an example of inputs from the UTKFace dataset where an ℓ2 perturbation of 0.5 could change the predicted label for an input with race “black” and gender “female” but an input with race “white” and gender “male” was robust to the same magnitude of perturbation. In such a case, the system worked better for a certain sub-group (white, male) thus resulting in unfairness. It is important to note that we use datasets such as Adience and UTKFace (described in detail in section 2.5) only to demonstrate the importance of having unbiased robustness. As noted in previous works, the very task of predicting age from a person’s face is a flawed task definition with many ethical concerns (Cramer et al., 2019; Buolamwini and Gebru, 2018; Crawford and Paglen, 2019). 2.4 Measuring Robustness Bias Robustness bias as defined in the previous section requires a way to measure the distance between a point and the (closest) decision boundary. For deep neural networks in use today, a direct computation of dθ(x) is not feasible due to their highly complicated and non-convex decision boundary. However, we show that we can leverage existing techniques from the literature on adversarial attacks to efficiently approximate dθ(x). We describe these in more detail in this section. 22 2.4.1 Adversarial Attacks (Upper Bound) For a given input and model, one can compute an upper bound on dθ(x) by performing an optimization which alters the input image slightly so as to place the altered image into a different category than the original. Assume for a given data point x, we are able to compute an adversarial image x̃, then the distance between these two images provides an upper bound on distance to a decision boundary, i.e, ∥x− x̃∥ ≥ dθ(x). We evaluate two adversarial attacks: DeepFool (Moosavi-Dezfooli et al., 2016) and Carlini- Wagner’s L2 attack (Carlini and Wagner, 2017). We extend ÎP for DeepFool and CarliniWagner as ÎDF P = |{(x, y) ∈ P |τ < ∥x− x̃∥, y = ŷ}| |P | (2.4) and ÎCW P = |{(x, y) ∈ P |τ < ∥x− x̃∥, y = ŷ}| |P | (2.5) respectively. We use similar notation to define σDF (P ), and σCW (P ) (σ as defined in Eq 2.3). While these methods are guaranteed to yield upper bounds on dθ(x), they need not yield similar behavior to ÎP or σ(P ). We perform an evaluation of this in Section 2.7.1. 2.4.2 Randomized Smoothing (Lower Bound) Alternatively one can compute a lower bound on dθ(x) using techniques from recent works on training provably robust classifiers (Salman et al., 2019; Cohen et al., 2019). For each input, these methods calculate a radius in which the prediction of x will not change (i.e. the robustness certificate). In particular, we use the randomized smoothing method (Cohen et al., 2019; Salman 23 et al., 2019) since it is scalable to large and deep neural networks and leads to the state-of-the-art in provable defenses. Randomized smoothing transforms the base classifier f to a new smooth classifier g by averaging the output of f over noisy versions of x. This new classifier g is more robust to perturbations while also having accuracy on par to the original classifier. It is also possible to calculate the radius δx (in the ℓ2 distance) in which, with high probability, a given input’s prediction remains the same for the smoothed classifier (i.e. dθ(x) ≥ δx). A given input x is then said to be provably robust, with high probability, for a δx ℓ2-perturbation where δx is the robustness certificate of x. For each point we use its δx, calculated using the method proposed by Salman et al. (2019), as a proxy for dθ(x). The magnitude of δx for an input is a measure of how robust an input is. Inputs with higher δx are more robust than inputs with smaller δx. Again, we extend ÎP for Randomized Smoothing as ÎRS P = |{(x, y) ∈ P |τ < δx, y = ŷ}| |P | (2.6) We use similar notation to define σRS(P ) (see Eq 2.3). 2.5 Empirical Evidence of Robustness Bias in the Wild We hypothesize that there exist datasets and model architectures which exhibit robustness bias. To investigate this claim, we examine several image-based classification datasets and common model architectures. 24 2.5.0.1 Datasets and Model Architectures: We perform these tests of the datasets CIFAR-10 (Krizhevsky, 2009), CIFAR-100 (Krizhevsky, 2009) (using both 100 classes and 20 super classes), Adience (Eidinger et al., 2014), and UTK- Face (Zhang et al., 2017). The first two are widely accepted benchmarks in image classification, while the latter two provide significant metadata about each image, permitting various partitions of the data by final classes and sensitive attributes. CIFAR-10, CIFAR-100, CIFAR100Super. These are standard deep learning benchmark datasets. Both CIFAR-10 and CIFAR-100 contain 60, 000 images in total which are split into 50, 000 train and 10, 000 test images. The task is to classify a given image. Images are mean normalized with mean and std of (0.5, 0.5, 0.5). UTKFace. Contains images of a people labeled with race, gender and age. We split the dataset into a random 80 : 20 train:test split to get 4, 742 test and 18, 966 train samples. We bin the age into 5 age groups and convert this into a 5-class classification problem. Images are normalized with mean and std of 0 and 1 respectively. Adience. Contains images of a people labeled with gender and age group. Task is to classify a given image into one of 8 age groups. We split the dataset into a random 80:20 train:test split to get 14, 007 train and 3, 445 test samples. Images are normalized with mean and std of (0.485, 0.456, 0.406) and (0.229, 0.224, 0.225) respectively. Our experiments were performed using PyTorch’s torchvision module (Paszke et al., 2019). We first explore a simple Multinomial Logistic Regression model which could be fully analyzed with direct computation of the distance to the nearest decision boundary. For convolutional 25 neural networks, we focus on Alexnet (Krizhevsky, 2014), VGG19 (Simonyan and Zisserman, 2015), ResNet50 (He et al., 2016a), DenseNet121 (Huang et al., 2017), and Squeezenet1_0 (Iandola et al., 2016) which are all available through torchvision. We use these models since these are widely used for a variety of tasks. We achieve performance that is comparable to state of the art performance on these datasets for these models. Additionally we also train some other popularly used dataset specific architectures like a deep convolutional neural network (we call this Deep CNN)1 and PyramidNet (α = 64, depth=110, no bottleneck) (Han et al., 2017) for CIFAR-10. We re-implemented Deep CNN in pytorch and used the publicly available repo to train PyramidNet2. We use another deep convolutional neural network (which we refer to as Deep CNN CIFAR1003 and PyramidNet (α = 48, depth=164, with bottleneck) for CIFAR-100 and CIFAR-100Super. For Adience and UTKFace we additionally take simple deep convolutional neural networks with multiple convolutional layers each of which is followed by a ReLu activation, dropout and maxpooling. As opposed to architectures from torchvision (which are pre-trained on ImageNet) these architectures are trained from scratch on the respective datasets. We refer to them as UTK Classifier and Adience Classifier respectively. These simple models serve two purposes: they form reasonable baselines for comparison with pre-trained ImageNet models finetuned on the respective datasets, and they allow us to analyze robustness bias when models are trained from scratch. Accuracy of models trained on the datasets can be found in Table 2.1. In Sections 2.7 and 2.8 we audit these datasets and the listed models for robustness bias. In section 2.6, we train logistic regression on all the mentioned datasets and evaluate robustness 1http://torch.ch/blog/2015/07/30/cifar.html 2https://github.com/dyhan0920/PyramidNet-PyTorch 3https://github.com/aaron-xichen/pytorch-playground/blob/master/cifar/model.py 26 http://torch.ch/blog/2015/07/30/cifar.html https://github.com/dyhan0920/PyramidNet-PyTorch Table 2.1: Test data performance of all models on different datasets. Deep CNN (CIFAR100) PyramidNet Adience Classifier UTK Classifier Resnet50 Alexnet VGG Densenet Squeeze- net Adience - - 48.80 - 49.75 46.04 51.41 50.80 49.49 UTKFace - - - 66.25 69.82 68.09 69.89 69.15 70.73 CIFAR10 86.97 86.92 - - 83.26 92.08 89.53 85.17 76.97 CIFAR100 59.60 56.42 - - 55.81 71.31 64.39 61.05 40.36 CIFAR100super 71.78 67.55 - - 67.27 80.7 76.06 71.22 55.16 bias using an exact computation. We then show in section 2.7 and 2.8 that robustness bias can be efficiently approximated using the techniques mentioned in 2.4.1 and 2.4.2 respectively for much more complicated models, which are often used in the real world. We also provide a thorough analysis of the types of robustness biases exhibited by some of the popularly used models on these datasets. 2.6 Exact Computation in a Simple Model: Multinomial Logistic Regression We begin our analysis by studying the behavior of multinomial logistic regression. Admit- tedly, this is a simple model compared to modern deep-learning-based approaches; however, it enables is to explicitly compute the exact distance to a decision boundary, dθ(x). We fit a regres- sion to each of our vision datasets to their native classes and plot Îc(τ) for each dataset. Figure 2.2 shows the distributions of Îc(τ), from which we observe three main phenomena: (1) the general shape of the curves are similar for each dataset, (2) there are classes which are significant outliers from the other classes, and (3) the range of support of the τ for each dataset varies significantly. We discuss each of these individually. First, we note that the shape of the curves for each dataset is qualitatively similar. Since the form of the decision boundaries in multinomial logistic regression are linear delineations in the 27 input space, it is fair to assume that this similarity in shape in Figure 2.4 can be attributed to the nature of the classifier. Second, there are classes c which indicate disparate treatment under Îc(τ). The treatment disparities are most notable in UTKFace, the superclass version CIFAR-100, and regular CIFAR- 100. This suggests that, when considering the dataset as a whole, these outlier classes are less suceptible to adversarial attack than other classes. Further, in UTKFace, there are some classes that are considerably more susceptible to adversarial attack because a larger proportion of that class is closer to the decision boundaries. We also observe that the median distance to decision boundary can vary based on the dataset. The median distance to a decision boundary for each dataset is: 0.40 for CIFAR-10; 0.10 for CIFAR-100; 0.06 for the superclass version of CIFAR-100; 0.38 for Adience; and 0.12 for UTKFace. This is no surprise as dθ(x) depends both on the location of the data points (which are fixed and immovable in a learning environment) and the choice of architectures/parameters. Finally, we consider another partition of the datasets. Above, we consider the partition of the dataset which occurs by the class labels. With the Adience and UTKFace datasets, we have an additional partition by sensitive attributes. Adience admits partitions based off of gender; UTKFace admits partition by gender and ethnicity. We note that Adience and UTKFace use categorical labels for these multidimensional and socially complex concepts. We know this to be reductive and serves to minimize the contextualization within which race and gender derive their meaning (Hanna et al., 2020; Buolamwini and Gebru, 2018). Further, we acknowledge the systems and notions that were used to reify such data partitions and the subsequent implications and conclusions draw therefrom. We use these socially and systemically-laden partitions to demonstrate that the functions we define, ÎP and σ depend upon how the data are divided for 28 analysis. To that end, the function ÎP is visualized in Figure 2.5. We observe that the Adience dataset, which exhibited some adversarial robustness bias in the partition on C only exhibits minor adversarial robustness bias in the partition on S for the attribute ‘Female’. On the other hand, UTKFace which had signifiant adversarial robustness bias does exhibit the phenomenon for the sensitive attribute ‘Black’ but not for the sensitive attribute ‘Female’. This emphasizes that adversarial robustness bias is dependant upon the dataset and the partition. We will demonstrate later that it is also dependant on the choice of classifier. First, we talk about ways to approximate dθ(x) for more complicated models. 2.7 Evaluation of Robustness Bias using Adversarial Attacks As described in Section 2.4.1, we argued that adversarial attacks can be used to obtain upper bounds on dθ(x) which can then be used to measure robustness bias. In this section we audit some popularly used models on datasets mentioned in Section 2.5 for robustness bias as measured using the approximation given by adversarial attacks. 2.7.1 Evaluation of ÎDF P and ÎCW P To compare the estimate of dθ(x) by DeepFool and CarliniWagner, we first look at the signedness of σ(P ), σDF (P ), and σCW (P ). For a given partition P , σ(P ) captures the disparity in robustness between points in P relative to points not in P (see Eq 2.3). Considering all 151 possible partitions (based on class labels and sensitive attributes, where available) for all five datasets, both CarliniWagner and DeepFool agree with the signedness of the direct computation 125 times, i.e., 1P [ sign(σ(P )) = sign(σDF (P )) ] = 125 = 1P [ sign(σ(P )) = sign(σCW (P )) ] . 29 Further, the mean difference between σ(P ) and σCW (P ) or σDF (P ), i.e., (σ(P )− σDF (P )), is 0.17 for DeepFool and 0.19 for CarliniWagner with variances of 0.07 and 0.06 respectively. There is 83% agreement between the direct computation and the DeepFool and CarliniWag- ner estimates of ÎP . This behavior provides evidence that adversarial attacks provide meaningful upper bounds on dθ(x) in terms of the behavior of identifying instances of robustness bias. 2.7.2 Audit of Commonly Used Models We now evaluate five commonly-used convolutional neural networks (CNNs): Alexnet, VGG, ResNet, DenseNet, and Squeezenet. We trained these networks using PyTorch with standard stochastic gradient descent. We achieve comparable performance to documented state of the art for these models on these datasets. After training each model on each dataset, we generated adversarial examples using both methods and computed σ(P ) for each possible partition of the dataset. An example of the results for the UTKFace dataset can be see in Figure 2.7. With evidence from Section 2.7.1 that DeepFool and CarliniWagner can approximate the robustness bias behavior of direct computations of dθ, we first ask if there are any major differences between the two methods. If DeepFool exhibits adversarial robustness bias for a dataset and a model and a class, does CarliniWagner exhibit the same? and vice versa? Since there are 5 different convolutional models, we have 151 · 5 = 755 different compar- isons to make. Again, we first look at the signedness of σDF (P ) and σCW (P ) and we see that 1P [ sign(σDF (P )) = sign(σCW (P )) ] = 708. This means there is 94% agreement between DeepFool and CarliniWagner about the direction of the adversarial robustness bias. To investigate if this behavior is exhibited earlier in the training cycle than at the final, fully- 30 trained model, we compute σCW (P ) and σDF (P ) for the various models and datasets for trained models after 1 epoch and the middle epoch. For the first epoch, 637 of the 755 partitions were internally consistent, i.e., the signedness of σ was the same in the first and last epoch, and 621 were internally consistent. We see that at the middle epoch, 671 of the 755 partitions were internally consistent for DeepFool and 665 were internally consistent for CarliniWagner. Unsurprisingly, this implies that as the training progresses, so does the behavior of the adversarial robustness bias. However, it is surprising that much more than 80% of the final behavior is determined after the first epoch, and there is a slight increase in agreement by the middle epoch. We note that, of course, adversarial robustness bias is not necessarily an intrinsic value of a dataset; it may be exhibited by some models and not by others. However, in our studies, we see that the UTKFace dataset partition on Race/Ethnicity does appear to be significantly prone to adversarial attacks given its comparatively low σDF (P ) and σCW (P ) values across all models. 2.8 Evaluation of Robustness Bias using Randomized Smoothing In Section 2.4.2, we argued that randomized smoothing can be used to obtain lower bounds on dθ(x) which can then be used to measure robustness bias. In this section we audit popular models on a variety of datasets (described in detail in Section 2.5) for robustness bias, as measured using the approximation given by randomzied smoothing. 2.8.1 Evaluation of ÎRS P To assess whether the estimate of dθ(x) by randomized smoothing is an appropriate measure of robustness bias, we compare the signedness of σ(P ) and σRS(P ). When σ(P ) has positive 31 sign, higher magnitude indicates a higher robustness of members of partition P as compared to members not included in that partition P ; similarly, when σ(P ) is negatively signed, higher magnitude corresponds to lesser robustness for those members of partition P (see Eq 2.3). We may interpret shared signedness of both σ(P ) (where dθ(x) is deterministic) and σRS(P ) (where dθ(x) is measured by randomized smoothing as described in Section 2.4.2) as positive support for the ÎRS P measure. Similar to Section 2.7.1, we consider all possible 151 partitions across CIFAR-10, CIFAR- 100, CIFAR-100Super, UTKFace and Adience. For each of these partitions, we compare σRS(P ) to the corresponding σ(P ). We find that their sign agrees 101 times, i.e., 1P [ sign(σ(P )) = sign(σRS(P )) ] = 101, thus giving a 66.9% agreement. Furthermore, the mean difference between σ(P ) and σRS(P ), i.e., (σ(P )− σRS(P )) is 0.08 with a variance of 0.19. This provides evidence that randomized smoothing can also provide a meaningful estimate on dθ(x) in terms of measuring robustness bias. 2.8.2 Audit of Commonly Used Models We now evaluate the same models and all the datasets for robustness bias as measured by randomized smoothing. Our comparison is analogous to the one performed in Section 2.7.2 using adversarial attacks. Figure 2.8 shows results for all models on the UTKFace dataset. Here we plot σRS P for each partition of the dataset (on x-axis) and for each model (y-axis). A darker color in the heatmap indicates high robustness bias (darker red indicates that the partition is less robust than others, whereas a darker blue indicates that the partition is more robust). We can see that some partitions, for example, the partition based on class label “40-60” and the partition based 32 on race “black” tend to be less robust in the final trained model, for all models (indicated by a red color across all models).