ABSTRACT Title of Dissertation: BUILDING RELIABLE AI UNDER DISTRIBUTION SHIFTS Bang An Doctor of Philosophy, 2025 Dissertation Directed by: Associate Professor Furong Huang Department of Computer Science Machine learning models are increasingly deployed in real-world settings where distribu- tion shifts—differences between training and deployment data—can significantly impact their reliability. These shifts affect models in multiple ways, leading to degraded generalization, fair- ness collapse, loss of robustness, and new safety vulnerabilities. This dissertation investigates how to build reliable AI under distribution shifts, providing theoretical insights and practical solutions across diverse applications. We begin by studying generalization under distribution shifts, exploring how model in- variance affects performance. We introduce a theoretical framework that quantifies the role of data transformations in shaping generalization, providing insights into selecting transformations that improve model robustness in shifted environments. This foundation also extends to fair- ness, where we examine how pre-trained fair models fail when deployed in new distributions and propose a method to transfer fairness reliably under distribution shifts. Next, we focus on robust perception and AI-generated content under shifting distributions. We investigate how models interpret visual information, showing that contextual reasoning can help mitigate spurious correlations and improve robustness under domain shifts. We also assess the reliability of AI-generated content, revealing how image watermarks, designed for prove- nance tracking, often fail when subjected to real-world distortions and adversarial attacks. To address this, we introduce a comprehensive benchmark for evaluating watermark robustness, providing a framework for improving their reliability. Finally, we turn to safety in large language models (LLMs) and investigate how distribu- tion shifts in training and deployment introduce new vulnerabilities. We analyze false refusals in safety-aligned LLMs, demonstrating that misaligned decision boundaries lead to excessive con- servatism at test time. We also explore retrieval-augmented generation (RAG) models, showing that despite their promise, they can introduce new safety risks when deployed in settings for which they were not originally trained. Our findings highlight critical gaps in existing AI safety evaluations and emphasize the need for new methods tailored to evolving AI architectures. By addressing generalization, robustness, and safety under distribution shifts, this disser- tation contributes to a deeper understanding of these challenges and provides practical strategies for improving AI reliability in real-world deployment. BUILDING RELIABLE AI UNDER DISTRIBUTION SHIFTS by Bang An Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2025 Advisory Committee: Associate Professor Furong Huang, Chair/Advisor Professor Hal Daume Professor Min Wu Asssociate Professor Jia-Bin Huang Assistant Professor Sanghamitra Dutta © Copyright by Bang An 2025 Acknowledgments I would like to express my sincere gratitude to everyone who has contributed to the com- pletion of this thesis and made my graduate journey a truly memorable and enriching experience. First and foremost, I am deeply thankful to my advisor, Prof. Furong Huang, for granting me the invaluable opportunity to work on Reliable AI and giving me the flexibility to explore my interests. Her consistent guidance, accessibility, and unwavering support have been instrumental in my academic growth. I am incredibly fortunate to have had the chance to learn from and work alongside such an inspiring mentor. Without her support and belief in me, I would not have made it through many of the tough times. I am also grateful to my lab mates for their camaraderie, support, and collaboration. Spe- cial thanks to Dr. Jiahao Su and Dr. Yanchao Sun for their warm welcome and for helping me get started and feel at home in the lab when I first arrived at UMD. I have greatly benefited from my collaboration with Sicheng Zhu, Michael-Andrei Panaitescu-Liess, Yuancheng Xu, Tahseen Rabbani, Mucong Ding, Souradip Chakraborty, Amrit Singh Bedi, Zora Che, Zikui Cai, Cheng- hao Deng, Aakriti Agrawal, Pankayaraj Pathmanathan, Yuhang Zhou, Paiheng Xu, Xiaoyu Liu, Anirudh Satheesh, Shayan Shabihi, and my interactions with all other group members. I’m also thankful to Yuxin Wen from Tom’s lab for his generous sharing of ideas and valuable discussions. I would like to thank the members of my Ph.D. committee, Prof. Min Wu, Prof. Jia-bin Huang, Prof. Hal Daum´e III, and Prof. Sanghamitra Dutta. Their valuable time, feedback, and ii suggestions have helped me a lot to improve my research, dissertation, and presentation. I would also like to thank Dr. Tom Goldstein for advising many of my projects. I would like to express my appreciation to my internship mentors, Dr. Mark Dredze, Dr. Shiyue Zhang, Sam Sharpe, Dr. Bayan Bruss, Dr. Zhe Zhao, Dr. Lichan Hong, Dr. Ed Chi, and Dr. Xueting Han, who provided me with generous guidance and support during my internships at Bloomberg, Capital One, Google Brain, and Microsoft Research. I would also like to thank my other mentors and collaborators, Dr. Ruiyi Zhang, Dr. Chaithanya Kumar Mummadi, Prof. Changyou Chen, and Prof. Sanghyun Hong for their valuable contributions to my work. My deepest appreciation goes to my family—my mother and father—for their unwavering belief in me, constant encouragement, and steadfast support through every challenge. Their love and guidance have been my foundation. Special thanks to my husband, collaborator, lab mate, and soul mate, Sicheng Zhu. Thank you for walking this path with me—for your love, your presence, and for helping me grow into someone I’m proud of. Thank you to my beloved dog, Lucky, for bringing love, joy, and peace to me. Finally, I want to thank myself—for not giving up when things got tough, for staying committed through long nights and setbacks, and for believing in myself. I sincerely apologize if I have inadvertently left out anyone who has contributed to the completion of this thesis. You are a significant component of this journey. iii Table of Contents Acknowledgements ii Table of Contents iv List of Tables viii List of Figures xii List of Abbreviations xix Chapter 1: Introduction 1 1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 I Generalization and Fairness under Distribution Shifts 5 Chapter 2: Understanding the Generalization Benefit of Model Invariance from a Data Perspective 6 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Generalization Benefit of Model Invariance . . . . . . . . . . . . . . . . . . . . 13 2.4.1 Sample Cover Induced by Data Transformations . . . . . . . . . . . . . 13 2.4.2 Refined Complexity Analysis of Lipschitz Models . . . . . . . . . . . . 16 2.4.3 Framework for Model-invariance-sensitive Generalization Bounds . . . . 17 2.5 Sample Cover Estimation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 19 2.6 Data-driven Selection of Data Transformations . . . . . . . . . . . . . . . . . . 21 2.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.7.1 Estimation of Sample Covering Numbers . . . . . . . . . . . . . . . . . 24 2.7.2 Evaluation of Generalization Benefit . . . . . . . . . . . . . . . . . . . . 25 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.9 Supplemental Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.9.1 Complexity Measurements and Generalization Bounds . . . . . . . . . . 28 2.9.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.9.3 Refined Complexity Analysis for Linear Models . . . . . . . . . . . . . 37 2.9.4 Empirical Estimation of Sample Covering Numbers . . . . . . . . . . . . 40 iv 2.9.5 Experimental Details and Extended Experiments . . . . . . . . . . . . . 42 2.9.6 Extended Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Chapter 3: Transferring Fairness under Distribution Shifts via Fair Consistency Regu- larization 52 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3 Preliminaries and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4 Fairness under Distribution Shifts . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5 Transfer Fairness via Fair Consistency Regularization . . . . . . . . . . . . . . . 63 3.5.1 Theoretical Analysis: A Sufficient Condition for Transferring Fairness . . 63 3.5.2 Practical Algorithm: Fair Consistency Regularization . . . . . . . . . . . 66 3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.6.1 Evaluation under Different Types of Distribution Shifts with a Synthetic Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.6.2 Evaluation on Real Datasets . . . . . . . . . . . . . . . . . . . . . . . . 71 3.6.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.8 Supplemental Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.8.1 Proof and More Discussion of Fairness under Distribution Shifts . . . . . 76 3.8.2 Proof of the Sufficient Condition for Transferring Fairness . . . . . . . . 81 3.8.3 Details of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.8.4 More Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.8.5 Impact and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 II Vision Models under Distribution Shifts 99 Chapter 4: PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts 100 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.4 Preparing CLIP for Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.4.1 Structuring and Describing Contextual Attributes . . . . . . . . . . . . . 106 4.4.2 Connecting Conditional Probabilities with CLIP Score . . . . . . . . . . 107 4.5 Contextual Attributes are Helpful and Inferable . . . . . . . . . . . . . . . . . . 109 4.5.1 Conditioning on Contextual Attributes is Helpful . . . . . . . . . . . . . 109 4.5.2 Contextual Attributes are Inferable . . . . . . . . . . . . . . . . . . . . . 112 4.6 PerceptionCLIP: Emulating Human Perception . . . . . . . . . . . . . . . . . 113 4.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.7.1 Zero-shot Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.7.2 Group Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.9 Supplemental Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 v 4.9.1 Extended Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.9.2 Image Caption Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.9.3 Human Visual Perception . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.9.4 Approximating Conditional Probabilities . . . . . . . . . . . . . . . . . 123 4.9.5 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.9.6 Additional Results and Analysis . . . . . . . . . . . . . . . . . . . . . . 132 4.9.7 Impact, Limitation and Future Work . . . . . . . . . . . . . . . . . . . . 140 Chapter 5: WAVES: Benchmarking the Robustness of Image Watermarks 142 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.2 Image Watermarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.3 Standardized Evaluation through WAVES . . . . . . . . . . . . . . . . . . . . . 147 5.3.1 Standardized Evaluation Workflow and Metrics . . . . . . . . . . . . . . 147 5.3.2 Stress-testing Watermarks . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.4 Benchmarking Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . 157 5.4.1 Benchmarking Watermark Robustness . . . . . . . . . . . . . . . . . . 157 5.4.2 Benchmarking Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.4.3 Benchmarking Results for User Identification . . . . . . . . . . . . . . . 160 5.4.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 5.4.5 Summary of Takeaway Messages . . . . . . . . . . . . . . . . . . . . . 164 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 5.6 Supplementary Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 5.6.1 A Mini Survey of Image Watermarks . . . . . . . . . . . . . . . . . . . 166 5.6.2 Formalism of Watermark Detection and Identification . . . . . . . . . . . 170 5.6.3 Details on Performance Metrics . . . . . . . . . . . . . . . . . . . . . . 174 5.6.4 Design Choices of WAVES . . . . . . . . . . . . . . . . . . . . . . . . . 177 5.6.5 Evaluation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 5.6.6 Details of Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 5.6.7 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 5.6.8 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 III AI Safety Challenges due to Distribution Shifts 209 Chapter 6: Automatic Pseudo-Harmful Prompt Generation for Evaluating False Re- fusals in Large Language Models 210 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 6.3 Defining Harmless, Controversial, and Harmful Prompts . . . . . . . . . . . . . 216 6.4 Automatic Pseudo-Harmful Prompt Generation . . . . . . . . . . . . . . . . . . 218 6.4.1 Surrogate Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 6.4.2 Generation Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 6.4.3 Steering the Generated Content . . . . . . . . . . . . . . . . . . . . . . 221 6.5 PHTest: A Dataset for False Refusal Evaluation . . . . . . . . . . . . . . . . . . 222 6.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 vi 6.6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 6.6.2 Safety vs False-Refusal Trade-off . . . . . . . . . . . . . . . . . . . . . 226 6.6.3 Jailbreak Defenses Should Be Calibrated by False-Refusal Rates . . . . . 228 6.7 A Preliminary Exploration of Fine-Grained Alignment . . . . . . . . . . . . . . 229 6.7.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 6.7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 6.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 6.9 Supplementary Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 6.9.1 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 6.9.2 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Chapter 7: RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Gen- eration for Large Language Models 242 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 7.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 7.4 RQ1: Are RAG-based LLMs safer than their non-RAG counterparts? . . . . . . . 248 7.5 RQ2: What makes RAG-based LLMs unsafe? . . . . . . . . . . . . . . . . . . . 250 7.5.1 Factor 1: Safety of the LLM . . . . . . . . . . . . . . . . . . . . . . . . 250 7.5.2 Factor 2: Retrieved Document Safety . . . . . . . . . . . . . . . . . . . 252 7.5.3 Factor 3: An LLM’s Capability on RAG Tasks . . . . . . . . . . . . . . 255 7.6 RQ3: Are red-teaming methods effective for RAG-based models? . . . . . . . . 258 7.6.1 Do non-RAG jailbreaks work for RAG? . . . . . . . . . . . . . . . . . . 260 7.6.2 Applying Jailbreaking Methods to RAG . . . . . . . . . . . . . . . . . . 261 7.7 Discussions on Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 262 7.8 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 7.9 Supplementary Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 7.9.1 Experimental Details and Additional Results for RQ1 . . . . . . . . . . . 264 7.9.2 Experimental Details and Additional Results for RQ2 . . . . . . . . . . . 268 7.9.3 Experimental Details and Additional Results for RQ3 . . . . . . . . . . . 289 Chapter 8: Conclusion 294 Bibliography 296 vii List of Tables 2.1 Classification accuracy and generalization gap (the difference between training and test accuracy) for ResNet18 on CIFAR-10. n denotes the sample size per class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 Classification accuracy and generalization gap (the difference between training and test accuracy) for ResNet18 on ShapeNet. n denotes the sample size per class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3 Evaluation of ResNet18 on ShapeNet under 3D-view transformations. Linv de- notes the test invariance loss. Ainv denotes the test consistency accuracy (indicat- ing whether the model’s prediction is unchanged after data transformation) under the worst-case data transformations. . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4 A diagram of the proof of Theorem 2.4.4. . . . . . . . . . . . . . . . . . . . . . 30 2.5 Some notations used in the proof of Theorem 2.4.4. . . . . . . . . . . . . . . . . 30 2.6 Data transformations used in our experiments. . . . . . . . . . . . . . . . . . . . 45 2.7 Sample covering numbers, classification accuracy, and generalization gap (the difference between training and test accuracy) for ResNet18 on CIFAR-100. . . . 47 2.8 Sample covering numbers, classification accuracy, and generalization gap (the difference between training and test accuracy) for ResNet18 on Restricted Ima- geNet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.9 Sample covering number (SCN) without and with normalization and generaliza- tion performance of ResNet18 on CIFAR-10. . . . . . . . . . . . . . . . . . . . 48 2.10 Classification accuracy and generalization gap (the difference between training and test accuracy) for MLP on ShapeNet. n denotes the sample size per class. . . 49 3.1 Transfer fairness and accuracy from UTKFace to FairFace . . . . . . . . . . . . 72 3.2 Transfer fairness and accuracy from UTKFace to FairFace with weak transfor- mations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.3 Ablation study on UTKFace-FairFace task . . . . . . . . . . . . . . . . . . . . . 75 3.4 Latent factors in 3dshapes dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.5 Simulate different distribution shifts. P(Y,A) is represented by the proportions of four groups as [P(Y = 0, A = 0),P(Y = 0, A = 1),P(Y = 1, A = 0),P(Y = 1, A = 1)]. P(D) is represented by the proportions of eight possible values of scale. Other factors have uniform marginal distributions. Images in two domains are sampled according to the marginal distributions of six latent factors. . . . . . 86 3.6 Statistics of UTKFace and FairFace datasets. . . . . . . . . . . . . . . . . . . . 88 3.7 Statistics of NewAdult dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.8 Statistics of UTK and FairFace datasets used in Table 3.9. . . . . . . . . . . . . . 93 3.9 Transfer fairness and accuracy from UTKFace to FairFace with less source data. . 94 viii 3.10 Results by using different transformations in our method. Average results of three trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.1 Conditional probabilities. x, y, and z denote image, class, and contextual at- tributes. z denotes (z1, . . . , zm) for simplicity. . . . . . . . . . . . . . . . . . . . 108 4.2 Classification accuracy (%) on ImageNet. We apply the left-side image transfor- mations to alter the corresponding attributes. Different methods condition on dif- ferent values of the contextual attributes. Conditioning on correct or self-inferred attribute values improves accuracy the most. . . . . . . . . . . . . . . . . . . . . 110 4.3 Inference accuracy (%) of two contextual attribute inference methods on Ima- geNet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.4 Zero-shot classification accuracy on five datasets using ViT-B/16. The best result in each column is highlighted in bold, while the next three highest values are underlined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.5 Classification accuracy of ViT-B/16 on different data domains with CLIP. . . . . 116 4.6 Intervening in inferring contextual attributes improves zero-shot classification. . 117 4.7 Average accuracy and worst group accuracy on the Waterbirds dataset. . . . . . 119 4.8 Average accuracy and worst group accuracy on the CelebA dataset. . . . . . . . 119 4.9 Image caption examples from LAION-400M (comparable to CLIP’s pretraining dataset). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.12 The average saliency (%) of the core feature and the spurious feature evaluated on the Waterbirds test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.10 Summary of descriptions for different attributes used in Figure 4.3, Table 4.2 and Table 4.3. z∗ denotes the correct value of the contextual attribute, and zwrong denotes the wrong value of the contextual attribute. Ideally, each attribute has a distribution of text descriptions. Here, we use three descriptions and use the averaged text embeddings of them to calculate the CLIP score. . . . . . . . . . . 135 4.11 Similarity score and classification accuracy on ImageNet test set. We apply a composition of two transformation functions on images, and use the composition of attributes’ descriptions for text. . . . . . . . . . . . . . . . . . . . . . . . . . 136 4.13 Summary of contextual attributes and their value descriptions used in ImageNet- related datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 4.14 Datasets, domain templates and contextual attributes used in Table 4.5 . . . . . . 136 4.15 Domain templates, contextual attributes and their descriptions used in Table 4.7 and Table 4.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 4.16 Ablation study on ImageNet and related datasets. . . . . . . . . . . . . . . . . . 137 4.17 Ablation study on different data domains. . . . . . . . . . . . . . . . . . . . . . 137 4.18 Performance of PerceptionCLIP using two order types in the attribute concate- nation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 4.19 Contextual attributes and their value descriptions for EuroSAT generated by GPT- 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 ix 5.1 Comparison of robustness evaluations with existing works. For categories of attacks, D, R, and A denote distortions, image regeneration, and adversarial attacks. Joint test means whether the performance and quality are jointly tested under a range of attack strengths. Our benchmark is the most comprehensive one, with a large scale of attacks, data, metrics, and more realistic evaluation setups. . 144 5.2 A taxonomy of all the attacks in our stress-testing set. Novel attacks proposed by WAVES are marked with ∗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 5.3 Comparison of attacks across three watermarking methods in detection setup. Q denotes the normalized quality degradation, and P denotes the performance as derived from Figure 5.7. Q@0.95P measures quality degradation at a 0.95 perfor- mance threshold where "inf" denotes cases where all tested attack strengths yield performance above 0.95, and "-inf" where all are below. A similar notation ap- plies to Q@0.7P. Avg P and Avg Q are the average performance and quality over all the attack strengths. The lower the performance and the smaller the quality degradation, the stronger the attack is. For each watermarking method, we rank attacks by Q@0.95P, Q@0.7P, Avg P, Avg Q, in that order, with lower values (↓) indicating stronger attacks. The top 5 attacks of each watermarking method are highlighted in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.4 A list of alternative watermarking algorithms not tested by WAVES in this work. 179 5.5 The knowledge of attackers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 5.6 Comparison of attacks across three watermarking methods under the identi- fication setup (one million users). Q denotes the normalized quality degradation and P denotes the performance as derived from aggregated 2D plots. Q@0.7P measures quality degradation at a 0.7 performance threshold where "inf" denotes cases where all tested attack strengths yield performance above 0.7, and "-inf" where all are below. Q@0.4P is defined analogously. Avg P and Avg Q are the average performance and quality over all the attack strengths. The lower the per- formance and the smaller the quality degradation, the stronger the attack. For each watermarking method, we rank attacks by Q@0.7P, Q@0.4P, Avg P, Avg Q, in that order, with lower values (↓) indicating stronger attacks. The top 5 attack of each watermarking method are highlighted in red. . . . . . . . . . . . . . . . 194 6.1 Two experimental settings. The only difference is whether to use pseudo-harmful prompts in the training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 6.2 Type-I false refusal: misinterpretation. LLMs falsely refuse some generated prompts because they misunderstand the prompts’ literal meanings or the users’ intentions. We label these prompts separately. Such false refusals imply a lack of understanding by the LLM, and they diminish as the LLM’s scale increases in our evaluations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 6.3 Type-II false refusal: misalignment. LLMs falsely refuse some generated pseudo- harmful prompts because they apply the rules learned during safety alignment to inappropriate scenarios. We observe that such false refusals do not automatically diminish as the LLM’s scale increases, suggesting that mitigation may require more refined alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 6.4 Some types of unnatural prompts in existing pseudo-harmful datasets. . . . . . . 241 x 7.1 Ranking of models from safe to unsafe. ≳ denotes the difference of unsafety is less than 1%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 7.2 Comparison of probabilities for generating unsafe responses in non-RAG and RAG settings. ✓ denotes safe, and ✗ denotes unsafe ones. . . . . . . . . . . . . 252 7.3 Safety of retrieved documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 7.4 Evaluation of extraction and summarization ability. Gemma performs poorly, leading to frequent refusals, which gives a false appearance of safety. . . . . . . . 256 7.5 Evaluation of models’ attention to documents via testing the accuracy with re- trieved and random documents. Most models do not rely fully on documents. . . 257 7.6 The average perplexity of the jailbreaking prompts created by two methods. . . . 291 xi List of Figures 1.1 Distribution shift often happens in reality, causing poor generalization and many other issues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Dissertation Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Illustration of the pseudometric and sample cover induced by data transformations. 9 2.2 Estimated sample covering numbers induced by different data transformations at different resolutions ϵ. “base” indicates no transformation. Note that as ϵ increases, it starts to exceed the L2 distance between some images and thus some images get covered by others without doing any transformation. Three vertical dashed lines indicate the maximum resolution ϵ at which the “base” yields a certain sample covering number, and from left to right they are 100%n, 99%n, 95%n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 An illustration of data transformations . . . . . . . . . . . . . . . . . . . . . . . 44 2.4 (a)-(c): Estimated sample covering numbers induced by different data transfor- mations on ShapeNet. n denotes the total sample size. (d): The normalized sam- ple covering number (=sample covering number /n) of 3D-view transformations estimated with different sample sizes. . . . . . . . . . . . . . . . . . . . . . . . 50 3.1 Illustration of intra-group expansion assumption in the input space. An ex- ample of gender classification task with the sensitive attribute being race. Intra- group expansion assumes that different groups are separated but every group is self-connected under certain transformations. If a model has consistent predic- tions under those transformations, we can propagate labels within each group. Under this assumption, we propose to obtain fairness and accuracy in both do- mains by a self-training algorithm with fair consistency regularization. . . . . . . 53 3.2 Training diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.3 Accuracy and unfairness (error bar denotes the standard deviation) in two do- mains under subpopulation shifts (Sshift 1, Sshift 2), domain shift (Dshift), and hybrid shift (Hshif). (S) and (T) denotes the evaluation in the source and target domains respectively. Results show that domain shift is more challenging than subpopulation shift, and our method can effectively transfer accuracy and fairness under all the distribution shifts considered. . . . . . . . . . . . . . . . . . . . . 69 3.4 Comparison of Pareto frontiers. Upper left is preferred. Our method outperforms baseline methods in achieving accuracy and fairness at the same time. . . . . . . 73 3.5 Unfairness and accuracy tested on NewAdult. CA as the source domain (red star) and other states as the target domain (blue dots). Red lines indicate the average of unfairness. The relative decrease is calculated by comparing with Laftr. . . . . 74 xii 3.6 Per-group accuracy and consistency. Compared with the standard consistency regularization (SCR), the model trained with fair consistency regularization (FCR) has more balanced consistency and accuracy. . . . . . . . . . . . . . . . . . . . 75 3.7 Randomly sampled examples from two domains under different distribution shifts. 87 3.8 With fair consistency regularization, our method alleviates the disparate impact of FixMatch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.1 (Left): CLIP co-relates natural language descriptions of contextual attributes with visual cues (orientation: upside-down). (Center): Unlike CLIP’s stan- dard zero-shot inference that uses fixed template(s) for class name retrieval, our method first infers contextual attributes (background: on the grass) using CLIP and then let CLIP predicts the class conditioned on the inferred contextual at- tributes. Here, background and orientation are both examples of contextual at- tributes. (Right): Grad-CAM visualization illustrates that our method focuses more on core features (on the dog) and is less distracted by spurious features (grass background) when performing the object classification. . . . . . . . . . . 101 4.2 Illustration of contextual attributes, their symbolic discrete values, and the possi- ble textual descriptions mapped by the annotation function. . . . . . . . . . . . . 106 4.3 Evaluating CLIP scores on ImageNet with different transformations altering the contextual attributes. The attribute-aware CLIP score gives higher scores for cor- rectly matched image-attribute pairs (green) while giving lower scores for mis- matched pairs (grey) and random pairs (blue), confirming CLIP’s understanding of our contextual attribute descriptions. CLIP score measures the similarity be- tween images and contextual attributes, while the original CLIP score (orange) is attribute-agnostic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.4 Images of a leopard and a waterbird, core and spurious features, and Grad-CAM heatmaps using no, incorrect, and ground-truth contextual attributes (with text below images). The bar shows core vs. spurious ratio in the heatmap. Visual- ization shows that classification conditioned on correct contextual attributes en- forces CLIP’s focus on core features. . . . . . . . . . . . . . . . . . . . . . . . 112 4.5 The increase in (left) CLIP scores and the (right) prediction probabilities by incorporating the descriptions of the correct contextual attribute into the text prompts. We compare the increased CLIP scores and prediction probabilities for the ground-truth class y∗, the Top-5 and Top-10 wrong classes. (left) Incorporat- ing ground-truth attributes into text prompts results in increased CLIP scores for both correct and incorrect classes. This improvement is attributed to the enhanced alignment of the text prompts with the images, addressing previously overlooked contextual attributes. Notably, the CLIP score of the correct class benefits more from this enhancement for all the attributes considered. This is because the ac- curate description of the class, combined with the contextual attributes, achieves a more precise alignment with the corresponding image. (right) Therefore, the model is more likely to predict the correct class after being provided with the correct context description in the prompt. . . . . . . . . . . . . . . . . . . . . . 128 xiii 4.6 Leopard images from ImageNet dataset. Visualization of the original image, the regions of core and spurious features, and the Grad-CAMs obtained using no, incorrect, and ground-truth contextual attributes. . . . . . . . . . . . . . . . . . 134 4.7 Waterbird images from Waterbirds dataset. Visualization of the original image, the regions of core and spurious features, and the Grad-CAMs obtained using no, incorrect, and ground-truth contextual attributes. . . . . . . . . . . . . . . . . . 138 5.1 WAVES establishes a standardized evaluation framework that encompasses a comprehensive suite of stress tests including both existing and newly proposed stronger attacks (denoted by ∗). . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.2 Evaluation workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.3 Regeneration attacks on Tree-Ringk. Regen-Diff is a single diffusive regener- ation and Rinse-[N]xDiff is a rinsing one with N repeated diffusions, with the number of noising steps as attack strength. Regen-VAE uses a pre-trained VAE with quality factor as strength and Regen-KLVAE uses pre-trained KL-VAEs with bottleneck size as strength. RinseD-VAE applies a VAE as a denoiser after Rinse- 4xDiff. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.4 Adversarial embedding attacks target Tree-Ring at strengths of {2/255, 4/255, 6/255, 8/255}. Tree-Ring shows vulnerability to embedding attacks, especially when the adversary can access the VAE being used. . . . . . . . . . . . . . . . . 153 5.5 Three settings for training the surrogate detector. The Generator is the victim generator under attack. We externalize the watermarking process for simplicity, but it could be in-processing watermarks. After training the surrogate detectors, the adversary performs PGD attacks on them to flip the labels. . . . . . . . . . . 154 5.6 Adversarial surrogate detector attacks on Tree-Ring. . . . . . . . . . . . . . . . . 154 5.7 Unified performance vs. quality degradation 2D plots under detection setup. We evaluate each watermarking method under various attacks. Two dashed lines show the thresholds used for ranking attacks. . . . . . . . . . . . . . . . . . . . 156 5.8 (a) Detection performance of three watermarks after attacks, measured by Aver- age TPR@0.1%FPR with lower values (near center) indicating higher vulnera- bilities. (b) The distribution of quality degradation. The lower, the better. . . . . . 158 5.9 Identification accuracy of three watermarks after attacks. . . . . . . . . . . . . . 161 5.10 An illustration of a robust watermarking workflow. An AI company provides two services: (1) generate watermarked images, i.e., embed invisible messages, and (2) detect these messages when shown any of their watermarked images. There is an attack stage between the watermarking and detection stages. The watermarked images may experience natural distortions (e.g., compression, re- scaling) or manipulated by malicious users attempting to remove the watermarks. A robust watermarking method should still be able to detect the original message after an attack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 5.11 Word clouds of DiffusionDB, MS-COCO, and DALL·E3 prompts. . . . . . . . . 179 5.12 Image examples of DiffusionDB, MS-COCO, and DALL·E3. . . . . . . . . . . . 179 xiv 5.13 Ranking attacks with different quality metrics on DiffusionDB images wa- termarked by Tree-Ring. Attack potency is ranked by image quality at 0.95 TPR@0.1%FPR. Colors indicate the ranks (1=best, 9=worst), and values show the measured quality. We use ’NA’ to label an attack if its attack curve lies en- tirely above TPR=0.95; the attack is automatically ranked last. . . . . . . . . . . 183 5.14 Cumulative distribution functions (CDFs) for eight image quality metrics across all attacked watermarked images. The horizontal dashed lines mark the 10% and 90% quantiles, and the intersecting vertical dashed lines delineate the bounds of the normalization intervals. Values at the lower bound are normalized to 0.1, and those at the upper bound to 0.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 5.15 Distortions and their combinations. We combine three types of distortions: ge- ometric, photometric, and degradation, both individually and collectively. By comparing quality-performance plots, we see combinations of distortions do not necessarily lead to better attacks. . . . . . . . . . . . . . . . . . . . . . . . . . . 189 5.16 Regenerative diffusion with varying depth of noising steps and a VAE regenera- tion with a low quality factor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 5.17 4x rinsing regeneration with varying depth of noising steps per diffusion. . . . . . 190 5.18 An image of a dragon attacked using a 3x rinsing regeneration. Pushing the image through a VAE restores image quality, noticeable in the eye color of the dragon (indicated by the green box). Image is drawn from the Gustavosta Stable Diffusion dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 5.19 Aggregated performance vs. quality degradation 2D plots under identifica- tion setup (one million users). We evaluate each watermarking method under various attacks. Two dashed lines show to thresholds used for ranking attacks. . . 194 5.20 The spoofing attack fails for AdvCls-UnWM&WM. . . . . . . . . . . . . . . . 195 5.21 Visualization of AdvCls-UnWM&WM attack. (a) shows the watermarking mask of Tree-Ring where there are four channels, and we only watermark the last chan- nel. The watermark message is the rings, which contain ten complex numbers that are not shown in the figure. (b) and (c) show the inversed latent before and after the attack in the Fourier space. We only show the real part of the latent. Clearly, the rings exist before the attack and vanish after the attack. (d) shows the magnitude of the element-wise difference before and after the attack. The attack not only perturbs the watermark part but also other features. The average magnitude change of the watermark-part and non-watermark-part is around 2:1. The attack successfully disturbs the watermark, albeit in an imprecise manner. . 196 5.22 Visualization of AdvCls-WM1&WM2 attack. (a) and (b) are the same as that in Figure 5.21. (c) shows the inversed latent after the attack, where the watermark vanishes instead of changing to another watermark. (d) shows the magnitude of the element-wise difference before and after the attack. The attack not only per- turbs the watermark part but also other features. The average magnitude change of the watermark-part and non-watermark-part is also around 2:1. Although the surrogate detector is trained to classify two different watermark messages. The attack based on it cannot change the watermark message from one to another but can effectively disturb the watermark. . . . . . . . . . . . . . . . . . . . . . . . 197 xv 5.23 The user identification results for Tree-Ring under AdvCls-WM1&WM2 attacks. The original watermarked images are embedded with User1’s message. AdvCls- WM1&WM2 tries to disrupt the latent feature of those images so that they can be misidentified as User2 generated. We simulate two settings: 100 users and 1000 users in total. The blue curves represent the proportion of images correctly iden- tified as belonging to User1, while the orange curves show those misidentified as User2’s. Note that, the axes for blue and orange curves have different ranges in the figure. With increasing attack strengths, the likelihood of correctly identify- ing them as User1’s decreases significantly under both 100 and 1K user scenarios. However, misidentification as User2’s images occurs notably only when the total number of users is small (e.g., 100 users). . . . . . . . . . . . . . . . . . . . . . 198 5.24 A visual demonstration of various adversarial, regeneration, and distortion at- tacks on a Tree-Ring watermarked image. Figure (a) is the base unattacked image. The base prompt, drawn from DiffusionDB, is “digital painting of a lake at sunset surrounded by forests and mountains,” along with further styling details. 200 5.25 Evaluation on DiffusionDB dataset under the detection setup (part 1). . . . . . . 201 5.26 Evaluation on DiffusionDB dataset under the detection setup (part 2). . . . . . . 202 5.27 Evaluation on MS-COCO dataset under the detection setup (part 1). . . . . . . . 203 5.28 Evaluation on MS-COCO dataset under the detection setup (part 2). . . . . . . . 204 5.29 Evaluation on DALL·E3 dataset under the detection setup (part 1). . . . . . . . . 205 5.30 Evaluation on DALL·E3 dataset under the detection setup (part 2). . . . . . . . . 206 5.31 Stress test results for DWT-DCT. It is highly susceptible to regeneration attacks (cross markers) and most distortion attacks (square markers), but relatively robust against adversarial attacks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 5.32 Stress test results for MBRS. It is vulnerable to certain distortion attacks (resized- cropping, blurring, rotation, combo distortions) and regeneration attacks, but ro- bust against other distortions (JPEG compression, brightness/contrast, random erasing, noise) and adversarial attacks. . . . . . . . . . . . . . . . . . . . . . . . 208 6.1 Examples of pseudo-harmful prompts generated by our method using llama2 as the target LLM, then transferred to closed-source LLMs. . . . . . . . . . . . . . 212 6.2 Some controversial prompts generated by our method. Claude 3.5 Sonnet (shown) refuses to respond, while GPT 4o and Gemini 1.5 Flash do. The left and middle’s harmfulness depends on definitions, while the right could have either innocent or malicious intentions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 6.3 Diagram of our automatic pseudo-harmful prompt generation. . . . . . . . . . . 218 6.4 (Left) LLM "recognizes" pseudo-harmfulness. Using only Llama2-8B’s refusal likelihood, we classify pseudo-harmful (green) and harmful (red) XSTest prompts with AUC 79.3%. This suggests that pseudo-harmful prompts often lie on the boundary of the LLM’s refusal decision. (Right) Using the LLM as a harmful- ness judge often aligns better with human evaluation than seeing if it refuses the prompt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 xvi 6.5 Comparison of quantity and distribution between PHTest and XSTest. (Left) PHTest prompts have lower perplexity (mainly because XSTest prompts are gen- erally shorter). (Right) XSTest prompts generally have a higher negative log- likelihood (NLL), making them more common in practice, while PHTest covers broader long-tail distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 6.6 False refusal rates of different LLMs on PHTest. . . . . . . . . . . . . . . . . . . 225 6.7 Tested LLMs demonstrate a trade-off between safety (low ASR on HarmBench) and usability (low FRR on PHTest’s harmless prompts). The safety of *-marked LLMs are potentially underestimated. We test their jailbreak ASR on a small available prompt set from HarmBench, while taking others directly from Harm- Bench’s report. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 6.8 False refusal rates before and after applying some jailbreak defenses. . . . . . . . 228 6.9 Fine-grained alignment requires a comprehensive safety policy and/or edge cases. 229 6.10 Testing accuracy during the training. Alignment w/ PHP that uses PHTest in the training has a significantly larger accuracy than the alignment w/o PHP that uses ShareGPT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 6.11 The safety of the model in the two settings steadily improves. However, there is a large gap between the two settings in the accuracy on the benign prompts. Alignment w/o PHP (ShareGPT) soon gets plateaued with large variance. . . . . 232 6.12 Decision boundary "visualization". . . . . . . . . . . . . . . . . . . . . . . . . . 233 6.13 Decision boundary "visualization". . . . . . . . . . . . . . . . . . . . . . . . . . 234 6.14 (Left) An example of our auto-generated pseudo-harmful prompt. (Right) Defin- ing harmfulness is complex, requiring detailed rules and supporting examples. We conjecture that safety alignment also requires extensive training examples to characterize the model’s rejection boundary. Jailbreak defenses without using additional data may only “shift” the boundary, leading to more false refusals. . . 237 6.15 Pseudo-harmful prompt examples generated by our method. . . . . . . . . . . . 238 6.16 Pseudo-harmful prompt examples generated by our method. . . . . . . . . . . . 239 7.1 RAG can make safe models unsafe, even if the retrieved documents are safe. . . . 243 7.2 Safety of LLMs in non-RAG vs. RAG settings. Most LLMs in the RAG setting exhibit a significantly higher percentage of unsafe responses. . . . . . . . . . . . 246 7.3 The change of risk profile from non-RAG to RAG is model-dependent. . . . . . . 247 7.4 Risk profile of Llama-3-8B. It is vulnerable in 7 categories in a non-RAG setting, but is vulnerable in all 16 categories in RAG, with an increase in risk across all categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 7.5 RAG is unsafe at points where non-RAG is unsafe, and more. . . . . . . . . . . . 251 7.6 Using one document in the RAG setting can change the safety behavior of mod- els. Provided with more documents, LLMs tend to be more vulnerable. . . . . . . 255 7.7 Capability of LLMs on RAG tasks. . . . . . . . . . . . . . . . . . . . . . . . . . 258 7.8 Train jailbreaking prompts on non-RAG Llama-3-8B and test them in the RAG setting with a varying number of retrieved documents. . . . . . . . . . . . . . . . 259 xvii 7.9 Train jailbreaking prompts on Llama-3-8B in the RAG setting using five docu- ments retrieved from the original queries, and test them in the RAG setting by retrieving documents using the optimized prompts with varying numbers of re- trieved documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 7.10 Risk taxonomy and the distribution in the dataset. . . . . . . . . . . . . . . . . . 265 7.11 Non-RAG (the upper) vs RAG (the bottom) pipelines. . . . . . . . . . . . . . . . 266 7.12 (Part 1) Risk profile of every LLM in non-RAG vs. RAG settings. . . . . . . . . 271 7.13 (Part 2) Risk profile of every LLM in non-RAG vs. RAG settings. . . . . . . . . 272 7.14 The change of risk profile from non-RAG to RAG. . . . . . . . . . . . . . . . . 273 7.15 Distribution of unsafe documents. . . . . . . . . . . . . . . . . . . . . . . . . . 275 7.16 Train jailbreaking prompts on non-RAG Mistral-V0.3 and test them in the RAG setting with a varying number of retrieved documents. . . . . . . . . . . . . . . . 290 7.17 Train jailbreaking prompts on Mistral-V0.3 in the RAG setting using five docu- ments retrieved from the original queries, and test them in the RAG setting by retrieving documents using the optimized prompt with a varying number of re- trieved documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 7.18 The fine selection phase of gradient-based methods involves calculating the jail- breaking loss for a large set of adversarial suffix candidates chosen through gra- dients. Previous work has addressed this using batch inference. However, in the RAG setting, the input query to LLMs—comprising both the retrieved docu- ments and the question—is significantly longer, leading to memory issues when performing batch inference with large batch sizes. . . . . . . . . . . . . . . . . . 292 7.19 We employ tree-attention to convert a batch of inputs into a sequence. The at- tention mask visualizes the tree-attention structure in the case of two candidates. Since the long query remains fixed during adversarial suffix optimization, we pre-process it and use it as a KV cache. The position IDs should also be adjusted accordingly. This approach allows us to efficiently compute the jailbreaking loss. 292 xviii List of Abbreviations AI Artificial Intelligence ASR Attack Success Rate CLIP Contrastive Language-Image Pretraining DNN Deep Neural Network FRR False Refusal Rate LLM Large Language Model ML Machine Learning PGD Projected Gradient Decent RAG Retrieval Augmented Generation VLM Vision Language Model xix Chapter 1: Introduction As machine learning (ML) systems move from controlled research environments into com- plex real-world settings, their reliability under changing conditions emerges as a central concern. Models trained under idealized assumptions often encounter significant performance degradation when faced with real-world distribution shifts—situations where the data observed during de- ployment differs from the training data. Distribution shifts undermine critical aspects of model reliability, including generalization, fairness, robustness, and safety, raising fundamental ques- tions about the trustworthiness of deployed AI systems. 1.1 Background and Motivation Historically, machine learning research has focused primarily on achieving high perfor- mance under the assumption of independently and identically distributed (i.i.d.) training and testing data. While this assumption has facilitated rapid advances in model accuracy and effi- ciency, it does not hold in many practical applications. Real-world data often evolves over time, varies across domains, and is susceptible to manipulation, adversarial attacks, and shifts in de- mographics or contexts. Such distribution shifts have become a widely recognized challenge for deploying reliable AI in sensitive or high-stakes domains such as healthcare, autonomous driving, financial decision-making, and content moderation. 1 Figure 1.1: Distribution shift often happens in reality, causing poor generalization and many other issues. Recent real-world failures highlight this issue: models trained to recognize pedestrians perform poorly under changed lighting conditions; models optimized for fairness in one demo- graphic group fail to generalize fairness constraints across different populations; and large lan- guage models fine-tuned for safety in controlled settings become overly conservative or danger- ously inaccurate when encountering unforeseen queries. Addressing these challenges is crucial for safe, ethical, and reliable AI deployment. Research on distribution shifts spans several subfields, including domain adaptation [1, 2], domain generalization [3, 4], robust optimization [5, 6], fairness-aware learning [7, 8], and AI safety [9, 10]. Domain adaptation techniques aim to adjust models trained on source distribu- tions to perform well on a specific shifted target distribution [2]. In contrast, domain general- ization seeks models robust to multiple unseen target distributions [3]. Recent methods involve invariance-based learning [11,12], self-training [13], and consistency regularization [14], empha- sizing alignment between training and deployment contexts. In fairness-aware learning, significant progress has been made toward enforcing algorith- mic fairness under static distributions [7]. However, little attention has been given to fairness preservation across changing environments [15, 16]. Meanwhile, robustness research, including 2 Figure 1.2: Dissertation Overview. adversarial robustness [6] and robustness to spurious correlations [5], focuses on ensuring that models are resilient under deliberate or incidental shifts in input distributions. Lastly, the AI safety literature increasingly highlights unintended behaviors of large foundation models under shifting distributions, particularly as models scale and complexity increases [9, 10]. Despite these advances, existing literature tends to focus narrowly on isolated aspects of distribution shifts, leaving gaps in a holistic understanding of reliability across generalization, fairness, robustness, and safety. 1.2 Dissertation Overview This dissertation explores the question of how to build reliable AI under distribution shifts, structured into three main parts: Part I: Generalization and Fairness under Distribution Shifts. We study the generalization of models facing distribution shifts in Chapter 2, introducing a theoretical framework that connects model invariance to generalization performance. This work guides the selection of transformations to improve model robustness. Extending this perspective, 3 we address how fairness can collapse under distribution shifts and propose an approach, fair consistency regularization, to reliably transfer fairness across shifted environments in Chapter 3. Part II: Vision Models under Distribution Shifts. We focus on generalization and robustness of vision models under distribution shifts, par- ticularly how perception and generative AI systems respond to shifts. In Chapter 4, inspired by shift between the training and development as well as human visual perception, we introduce a method to improve zero-shot classification robustness by explicitly modeling contextual at- tributes, mitigating reliance on spurious features. In Chapter 5, we examine the robustness of watermarking methods used to authenticate generative AI content, providing a benchmark that uncovers vulnerabilities under diverse adversarial and distributional transformations. Part III: AI Safety Challenges due to Distribution Shifts. We explore the safety implications of distribution shifts for large language models (LLMs). In Chapter 6, we investigate the phenomenon of false refusals, where safety-aligned models reject benign prompts at deployment due to the misalignment of learned safety boundaries. In Chap- ter 7, we further study retrieval-augmented generation (RAG) frameworks, uncovering new safety vulnerabilities arising from shifts between standard LLM training distributions and retrieval- augmented deployment contexts. The dissertation offers key contributions including theoretical insights, novel methods, and comprehensive benchmarks that collectively enhance our understanding of how to build AI sys- tems capable of maintaining reliability across diverse real-world conditions. 4 Part I Generalization and Fairness under Distribution Shifts 5 Chapter 2: Understanding the Generalization Benefit of Model Invariance from a Data Perspective Machine learning models that are developed with invariance to certain types of data trans- formations have demonstrated superior generalization performance in practice. However, the underlying mechanism that explains why invariance leads to better generalization is not well- understood, limiting our ability to select appropriate data transformations for a given dataset. This paper studies the generalization benefit of model invariance by introducing the sample cover induced by transformations, i.e., a representative subset of a dataset that can approximately re- cover the whole dataset using transformations. Based on this notion, we refine the generalization bound for invariant models and characterize the suitability of a set of data transformations by the sample covering number induced by transformations, i.e., the smallest size of its induced sample covers. We show that the generalization bound can be tightened for suitable transfor- mations that have a small sample covering number. Moreover, our proposed sample covering number can be empirically evaluated, providing a practical guide for selecting transformations to develop model invariance for better generalization. We evaluate the sample covering numbers for commonly used transformations on multiple datasets and demonstrate that the smaller sample covering number for a set of transformations indicates a smaller gap between the test and training error for invariant models, thus validating our propositions. 6 2.1 Introduction Invariance is ubiquitous in many real-world problems. For instance, categorical classifi- cation of visual objects is invariant to slight viewpoint changes [17–19], text understanding is invariant to synonymous substitution and minor typos [20–22]. Intuitively, models capturing the underlying invariance exhibit improved generalization in practice [23–26, 26, 27]. Such gener- alization benefit is especially crucial when the data are scarce as in some medical tasks [28], or when the task requires more data than usual as in cases of distribution shift [29] and adversarial attack [30–32]. A commonly accepted intuition attributes the generalization benefit of model invariance to the reduced model complexity, especially the reduced sensitivity to spurious features. However, a principled understanding of why model invariance helps generalization remains elusive, thus leaving many open questions. Since model invariance may come at a cost (e.g., compromised accuracy, increase computational overhead), given a task, how should we choose among various data transformations under which model invariance guarantees better generalization? If existing data transformations are not good enough for a given task, what is the guiding principle to find new ones? The lack of a principled understanding limits better exploitation of model invariance to further improve generalization. In addition, since identifying instructive generalization bound is a central topic in machine learning, we may expect to tighten existing generalization bounds by additionally considering the data-dependent model invariance property. The many faces of data transformations and model classes pose significant challenges to a principled understanding of model invariance’s generalization benefit. To address this, [18, 33–36] characterize the input space and show that certain data transformations equivalently 7 shrink the input space for invariant models, which then simplify the input and improves general- ization. From another perspective, [37,38] directly characterize the function space and show that the volume of the invariant model class is reduced, which then simplifies the learning problem and improves generalization. These understandings provide valuable insights, yet they may be- come less informative on high-dimensional input data or require model invariance to be obtained exclusively via feature averaging. Some certain assumptions on data transformations (e.g., finite- ness, group structure with certain measures) also make these understandings less applicable to more general data transformations. In this paper, we derive generalization bounds for invariant models based on the sample cover induced by data transformations and empirically show that the introduced notion can guide the data transformation selection. Different from previous understandings, we first identify a data-dependent property of data transformations in a model-agnostic way, and then establish its connections with the refined generalization bounds of invariant models. The analysis applies to more general data transformations regardless of how model invariance is obtained and naturally provides model-agnostic guidance for data transformation selection. We summarize our contri- butions as follows. At the core of our understanding is the notion of sample cover induced by data transforma- tions, defined informally as a representative subset of a dataset that can approximately recover the whole dataset using data transformations (illustrated in Figure 2.1). We show that this notion identifies a data-dependent property of data transformations which is related to the generalization benefit of the corresponding invariant models. Under a special setting of the sample cover, we first bound the model complexity of any invariant and output-bounded model class in terms of the sample covering numbers. Since this general bound requires a restrictive condition on data 8 𝜌𝒢 𝑥" , 𝑥# = 𝑟 𝑥! 𝑥" 𝑟 sample cover = {𝑥$, 𝑥%, … , 𝑥&} 𝑥# 𝑥$ 𝑥% 𝑥& 𝑥' 𝑥( 𝑥)𝜖 ((a)) G = ∅ 𝜌𝒢 𝑥" , 𝑥# = 𝑟$ + 𝑟% sample cover = {𝑥%, 𝑥&, 𝑥', 𝑥(} 𝑥! 𝑥" 𝑥# 𝑥$ 𝑥% 𝑥& 𝑥' 𝜌𝒢 < 𝜖 𝜖 𝑥( 𝑥) 𝑟! 𝑟" 𝑔(𝑥() 𝜌𝒢 < 𝜖 𝜌𝒢 < 𝜖 ((b)) G = {"rotation"} Figure 2.1: Illustration of the pseudometric and sample cover induced by data transformations. transformations in order to be informative, we then assume the model Lipschitzness to relax the requirement and refine the model complexity bound for invariant models. Finally, we outline a framework for model-invariance-sensitive generalization bounds based on the invariant models’ complexities, and use it to discuss the generalization benefit of model invariance. Given the usefulness of sample cover in the analysis, we propose an algorithm to empir- ically estimate the sample cover. This algorithm exactly verifies whether a given subset of a sample forms a valid sample cover, and always estimates a sample covering number that upper- bounds the ground truth. Inspired by our analysis, we also propose to use the sample covering number as a suitability measurement for practical data transformation selections. This measure- ment is data-driven, widely applicable, and empirically correlates with invariant models’ actual generalization performance. We discuss its limitations and empirical mitigation. To empirically verify our propositions, we first estimate the sample covering number for some commonly used data transformations on four image datasets, including CIFAR-10 and ShapeNet (a 3D dataset). Under typical settings, the 3D-view transformation induces a much smaller sample covering number than others on ShapeNet, while cropping induces the smallest 9 sample covering number on others datasets. For those data transformations, we then train in- variant models via data augmentation and invariance loss regularization to evaluate the actual generalization benefit. Results show a clear correlation between smaller sample covering num- bers induced by data transformations and the better generalization benefit enjoyed by invariant models. 2.2 Related Work Understandings from the input space perspective. One line of work characterizes the input space of invariant models. [33, 39] show that the invariant representations equivalently reduce the input dimension for downstream tasks and thus significantly reduce the model complexity (exponential in input dimensions) of downstream linear models. [35, 36] essentially factorize the input space into the product of a base space and a finite set of data transformations. Since the covering number needed to cover the base space is smaller, the associated generalization bound for invariant models is reduced. Compared with these works, our work tries to cover the sample instead of the input space which circumvents the strong dependence on input dimensions and also enables practical evaluation. Understandings from the function space perspective. Another line of work directly charac- terizes the function space of invariant models. [38] uses PAC-Bayes to show the reduction of generalization upper bound when the model class is symmetrized to be invariant. [37] analyzes the function space under the feature averaging operator and shows the first strict generalization gap (instead of an upper bound) via a linear model. This line of work currently restricts model invariance to be obtained exclusively via feature averaging. 10 Note that the categorization of different understanding perspectives is only for presentation convenience and has no formal distinctions. Additionally, we mention some work that stud- ies model invariance but does not focus on understanding its benefit. [40] proves that the VC dimension of an invariant model cannot be larger than its counterpart. [41] characterizes the gen- eral functional representations of invariant probability distributions as well as neural network structures that implement them. [42] uses group theory to show the benefit of learning with data- augmented loss. In the predicting generalization competition at NeurIPS 2020 [43], the runner-up team [44] shows that model robustness to data transformations can serve as an empirical proxy for predicting models’ generalization performance. [45] enforce model invariance to learned data transformations that capture inter-domain variation to improve the out-of-distribution generaliza- tion. [46] propose to select data transformations automatically from model training via optimizing parameterized distributions of data transformations. Interestingly, our sample covering numbers may be used to determine their regularization coefficients for better trade-offs. 2.3 Preliminaries Data transformations. We refer to the data transformation as a function from the input space X → X , and data transformations as a set of such functions. Unless otherwise specified, we do not assume data transformations to have group structures since many non-invertible transforma- tions (e.g., cropping) do not fit into a group structure directly. For a set of data transformations G = {g : X → X} and a data point (also referred to as an example) x ∈ X , we overload the no- tion of orbit in group theory and denote by G(x) the orbit of x defined as follows. The orbit of x generated by data transformations G is the collection of outputs after applying any transformation 11 g ∈ G on x: G(x) = {g(x) ∈ X : g ∈ G}. Model invariance. Let D be the underlying data distribution and supp(D) be its support. A model h : X → Y is said to be invariant under data transformations G on D if h(g(x)) = h(x) for any x ∈ supp(D) and any g ∈ G. We refer to a class of invariant models as the G-invariant model class. Complexity measurements. Covering number and Rademacher complexity [47] are two com- monly used complexity measurements for model classes (including neural networks [48]) that can provide uniform generalization bounds. The covering number can also be directly used to upper bound the Rademacher complexity via Dudley’s entropy integral theorem [49, 50]. Covering number. Let (F , d) be a (pseudo)metric space with some (pseudo)metric1 d. An ϵ-cover of a setH ⊆ F is defined as a subset Ĥ ⊆ H such that for any h ∈ H, there exists ĥ ∈ Ĥ such that d(h, ĥ) ≤ ϵ. The covering number N(ϵ,H, d) is defined as the minimum cardinality of an ϵ-cover (among all ϵ-covers) of H. In this paper, we use the concept of covering number both for measuring model class complexities and for defining the sample covering number on datasets. Empirical Rademacher complexity. Let H be a class of functions h : X → R. Given a sample S = {xi}ni=1, the empirical Rademacher complexity of model class H is defined as: RS(H) = Eσ [ suph∈H 1 n ∑n i=1 σih(xi) ] where σ = [σ1, ..., σn] ⊤ is the vector of i.i.d. Rademacher random variables, each uniformly chosen from {−1, 1}. Generalization error and gap. Let S = {xi}ni=1 be a sample drawn i.i.d. from some data distribution D, and H be a model class. Given a loss function ℓ : R → [0, 1], for a h ∈ H, we define the empirical error as RS(h) = 1 n ∑n i=1 ℓ(h(xi), yi), the generalization error as R(h) = 1A pseudometric is a metric if and only if it separates distinct points, namely d(x, y) > 0 for any x ̸= y. 12 E(x,y)∼D[ℓ(h(x), y)], and the generalization gap as R(h)−RS(h). 2.4 Generalization Benefit of Model Invariance In this section, we derive the generalization bounds for invariant models by identifying model invariance properties. We start by introducing the notion of sample cover induced by data transformations and based on it bound the Rademacher complexity of any invariant models with bounded output (Section 2.4.1). Then, we assume model Lipschitzness to provide a more infor- mative model complexity bound for any data transformations (Section 2.4.2). Finally, we provide a framework for model-invariance-sensitive generalization bounds and discuss the generalization benefit of model invariance (Section 2.4.3). 2.4.1 Sample Cover Induced by Data Transformations Existing empirical results suggest that, compared with standard models, invariant models may have certain properties reducing their effective model complexities. To identify such proper- ties, we alternatively identify the related properties of the corresponding data transformations via the notion of sample cover induced by data transformations. We now formalize the introduced notion. The definition of sample cover relies on the pseudometric induced by the data transforma- tions G. Note that G generates an orbit G(x) ⊆ X for each example x ∈ S. Let ∥ · ∥ be any norm on the input space X . Given a set of transformations G, we define the G-induced pseudometric2 2Note that ρG is not a metric since it allows ρG(x, y) = 0 for x ̸= y. 13 as ρG(x1,x2) = inf γ∈Γ(x1,x2) ∫ γ c(r)ds, where c(r) =  0, if r ∈ ∪x∈SG(x) 1, otherwise (2.4.1) where ds = ∥dr∥, and Γ denotes the set of all paths (curves) in X from x1 to x2. The ρG is essentially calculating the line integral along the shortest (if achievable) path γ in the scalar field c, where c can also be viewed as the "moving cost" function depending on G. The norm ∥ · ∥ here can be selected as any meaningful norm on the input space (e.g., Euclidean norm as in our experiments) and will later be used in defining the model’s Lipschitz constant. It can be checked that ρG satisfies pseudometric axioms. Definition 2.4.1 (Sample cover induced by data transformations). Let (X , ρG) be a pseudometric space and S = {xi}ni=1 be a sample of size n. An ϵ-sample cover ŜG,ϵ of the sample S induced by data transformations G at resolution ϵ is defined as a subset of the sample S such that for any x ∈ S, there exists x̂ ∈ ŜG,ϵ satisfying ρG(x, x̂) ≤ ϵ. Definition 2.4.2 (Sample covering number induced by data transformations). The sample cover- ing number N(ϵ,S, ρG) induced by data transformations G is defined as the minimum cardinality of an ϵ-sample cover: N(ϵ,S, ρG) = min{|ŜG,ϵ| : ŜG,ϵ is an ϵ-sample cover of S}. (2.4.2) Informally, the G-induced sample cover specifies a representative subset of examples that can approximately recover all the original examples using the given data transformations G. This notion is closely related to the sample compression [51] which represents a scheme to 14 prove the learnability of concepts through a compressed set of samples. While identifying the generalization-related properties of data transformations, this notion is insensitive to other unre- lated properties (e.g., finiteness, group structures) and thus applies to any data transformations. The intuition behind sample cover is that G-invariant models may have consistent behaviors on an example and its associated approximation in the G-induced sample cover. As such, we can analyze the model complexities of invariant models by considering the models’ behavior only on the potentially small-sized sample covers. Indeed, we directly have the following model complexity result. The proof is in Section 2.9.2. Proposition 2.4.3. Let S = {xi}ni=1 be a sample of size n. LetH be a model class mapping from X to [−B,B] for some B > 0 and is invariant to data transformations G. Then the empirical Rademacher complexity ofH satisfy RS(H) ≤ 24B √ N(0,S, ρG) n . (2.4.3) Proposition 2.4.3 generally bounds the model complexity of any output-bounded and G- invariant model class in terms of the sample covering number N(0,S, ρG) induced by G. A small G-induced sample covering number at resolution ϵ = 0 thus tightens the model complexity bound for a general class of G-invariant models. Note, however, that Proposition 2.4.3 is informative only when the data transformations G yields N(0,S, ρG) ≪ n on the sample S — a condition requiring G to be able to exactly recover S from a small-sized subset of S. This condition is unfortunately too strict to hold for many commonly used data transformations which only generate orbits with measure zero (with respect to the data measure) at most data points. For example, the rotation transformations on 15 CIFAR-10 do not satisfy this condition, since no two images in CIFAR-10 are rotated versions of each other. To better understand the generalization benefit brought by any data transformations (e.g., rotation), we further assume specific model properties which equivalently expand the orbits in order to get more general results. We study Lipschitz models in Section 2.4.2, and relegate a sharper (and relatively independent) analysis for linear models under linear data transformations to Section 2.9.3. 2.4.2 Refined Complexity Analysis of Lipschitz Models This subsection refines the model complexity analysis for Lipschitz models that are invari- ant. Characterizing the Lipschitz constant of models has been the focus of a line of work. For example, the Lipschitz constant of ReLU networks can be upper-bounded by the product of the spectral norms of the weight matrices, considering the worst-case inputs [48,52]. Assuming Lip- schitzness, the following theorem refines the covering number analysis for invariant models. The proof is in Section 2.9.2. Theorem 2.4.4. Let S = {xi}ni=1 be a sample of size n. Let H be a model class such that every h ∈ H is κ-Lipschitz with respect to ∥ · ∥ (used in defining the sample cover) and is invariant to G. Then the covering number ofH satisfies N ( τ,H, L2(PS) ) ≤ inf ϵ≥0,ŜG,ϵ N ( τ − κϵ √ 1− |ŜG,ϵ| n ,H, L2(PŜG,ϵ ) ) , (2.4.4) where ∀h, g ∈ H, the L2(PS) metric is defined as ∥h− g∥L2(PS) = (∑ x∈S 1 n ( h(x)− g(x) )2) 1 2 , and the L2(PŜG,ϵ ) metric is defined as3 ∥h− g∥L2(PŜG,ϵ ) = (∑ x∈ŜG,ϵ p(x) n ( h(x)− g(x) )2) 1 2 . 3The term p(x)/n can be viewed as the probability mass at x where the numerator indicates the number of 16 Theorem 2.4.4 upper-bounds the covering number of H evaluated at the sample S by the new covering number evaluated at any sample cover ŜG,ϵ, under a modified metric and at the cost of an additional error term depending on ϵ and κ. The equality trivially holds by taking ŜG,ϵ = S, while by searching over all sample covers with different resolution ϵ it is possible to tighten the covering number bound for invariant models. Additionally, Theorem 2.4.4 leads to a refined version of Dudley’s entropy integral theorem (see Lemma 2.9.3) that bounds the Rademacher complexity of invariant models. Theorem 2.4.4 suggests that we may improve existing covering-number-based model com- plexity analysis by weakening the dependence on input dimensions. Note that covering numbers that do not yield N ( τ,H, L2(PS) ) /n → 0 as n → ∞ are vacuous. Therefore, existing covering number results typically avoid linear dependence on n at the cost of (explicitly or implicitly) increased dependence on the input dimension [53]. With the refined result in Theorem 2.4.4, however, a covering number linear in n can now be replaced by one that is linear in a potentially much smaller sample covering number N(ϵ,S, ρG) and consequently become informative, thus circumvent the increased dependence on input dimensions. An interesting direction for future work is to instantiate the result in Equation 2.4.4 for specific model classes to get more inter- pretable results. 2.4.3 Framework for Model-invariance-sensitive Generalization Bounds This subsection presents the framework for generalization bounds sensitive to model invari- ance. While the results are straightforward applications of the derived complexities of invariant models, our goal is to justify the selection of suitable data transformations to maximize the gen- examples that x covers. See Section 2.9.2.1 for the formal definition of p(x). 17 eralization benefit. We start with the generalization analysis of invariant models and then present the framework. Generalization benefit for invariant models. The generalization bounds of invariant models follow immediately by applying the Rademacher model complexities (Proposition 2.4.3, Propo- sition 2.9.3, and Theorem 2.9.4) to the standard generalization bound (Theorem 2.9.2). Compared with standard models, invariant models’ tightened model complexity bounds already imply their reduced generalization gaps, whereas for reduced generalization error they further need to have low empirical error. Since enforcing model invariance may simultaneously increase the empirical error, we can use standard model selection techniques (e.g., structural risk minimization [47]) to select suitable data transformations and control the trade-off. Model-invariance-sensitive generalization bound. We outline the generalization bound that identifies model invariance properties based on the derived invariant models’ complexities. It follows by the post-hoc analysis which specifies a proper set of invariant models using the "in- variant loss" — the loss when composed with any model, makes the composition invariant. For data transformations with group structures, we can construct such loss by averaging (assum- ing Haar measure) or adversarially perturbing any given loss over the orbits of input exam- ples [37, 38]. Specifically, the adversarial loss with respect to data transformations G is defined as ℓ̃G (h(x), y) = maxx′∈G(x) ℓ (h(x ′), y), where ℓ is any given loss. Using the adversarial loss, the following proposition provides the model-invariance-dependent generalization bound by ap- plying the model selection framework [47]. Section 2.9.2.3 further describes a binary coding construction of combinations of data transformation classes. Proposition 2.4.5. Let S = {xi}ni=1 be a sample of size n. Let H be any given model class 18 and ℓ be any given loss. Suppose we have K sets of group-structured data transformations {G1,G2, ...,GK}. Then with probability at least 1 − δ, the following generalization bound holds for any h ∈ H and any k ∈ [K]: R(h) ≤ 1 n n∑ i=1 ℓ̃Gk (h(xi), yi) + 4RS(ℓ̃Gk ◦ H) + √ log k n + 3 √ log 4 δ 2n , (2.4.5) where RS(ℓ̃Gk ◦ H) is upper-bounded by the complexity of Gk-invariant models. For any model trained on S, Proposition 2.4.5 shows that we can optimize over all selections of data transformations to improve its generalization bound. Note that the selection of Gk is subject to a potential trade-off between the reduced model complexity RS(ℓ̃Gk ◦ H) and the increased empirical error ∑n i=1 ℓ̃Gk (h(xi), yi). Thus, if a suitable Gk reduces the model complexity while keeping the empirical error low, then the trained model will benefit from a tightened general- ization bound. This generalization bound does not require the models to be (strictly) invariant and potentially explains the improved generalization of models with trained invariance (e.g., via data augmentation [54, 55] or consistency regularization [56, 57]). The difficulty in instantiating Proposition 2.4.5 is that the model complexity with adversarial loss may be hard to compute for general data transformations. Therefore, we discuss more practical data transformation selections based on the sample covering numbers in Section 2.6. 2.5 Sample Cover Estimation Algorithm The sample cover induced by data transformations plays a central role in our understanding of model invariance. Despite its usefulness, exactly computing the sample cover turns out to be non-trivial in general. Indeed, computing the transformation-induced metrics can be difficult for 19 continuous data transformations, and finding the smallest sample cover is NP-hard. To address this problem, we propose an algorithm to estimate the sample covering number and find the associated sample cover. We outline the algorithm and discuss the algorithmic challenges in this section. The algorithmic details appear in Section 2.9.4. Setup. The estimation algorithm takes as input a sample S, a set of data transformations G, and the resolution parameter ϵ. It then returns the estimated sample covering number N(ϵ,S, ρG) and the associated sample cover ŜG,ϵ. The estimation algorithm has the following steps. Step 1. Compute (or approximate) the direct orbit distance between any two examples in S. The direct orbit distance between any two examples xi,xj ∈ S is dG(xi,xj) = ∥G(xi)− G(xj)∥ = min g1,g2∈G ∥g1(xi)− g2(xj)∥, which can be exactly computed for finite transformations (e.g., flipping) with complexity O(|G|2)), or can be approximated for continuous transformations (e.g., rotation) via optimization or sam- pling. Step 2. Compute the ρG distance between any two examples in S. Given results in step 1, com- puting the ρG distance between any two examples can be formulated as a shortest path problem on a complete graph, where each node represents an example and the cost of each edge is the direct orbit distance computed in step 1 (see formulations in Section 2.9.4). Note that the shortest path is always included in our finite candidates even though the ρG distance considers infinitely many paths. This is because any other path outside our finite candidates will be longer than its coun- terparts (depending on what orbits it intersects) in our finite candidates. Standard shortest path 20 algorithms solve for all pairs of examples in polynomial time (e.g., via Dijkstra’s algorithm [58] in O(n3)). Step 3. Construct the pairwise distance matrix [ρG(xi,xj)]i,j and approximate the sample cover- ing number. This step can be formulated as a set cover problem where each example x covers a subset of S in which each element’s ρG distance to x is less than or equal to ϵ. Our goal is to find a minimum number of those subsets such that their union contains S. This problem is known to be NP-hard in general but admits polynomial time approximations [59]. In experiments, we use modified k-medoids [60] clustering method to find the approximation of N(ϵ,S, ρG) (see Algorithm 1). Note that the estimated sample covering number returned by the algorithm is always an upper bound of the ground truth, regardless of the approximation error in steps 1 and 3. When step 1 is exact, the algorithm also exactly verifies whether a given subset of S forms a valid sample cover. In our experiment, step 2 becomes the computation bottleneck for large-sized samples. We leave improving the scalability as well as evaluating the approximation quality for future work. 2.6 Data-driven Selection of Data Transformations The pool of candidate data transformations on a given dataset may be infinitely large. To maximize the generalization benefit of model invariance, we usually make selections based on expensive cross-validations due to the absence of model-training-free guidance. Section 2.4 sug- gests that invariant models may benefit from improved generalization guarantees if the corre- sponding data transformations induce small sample covering numbers. Therefore, we propose 21 to use the sample covering number as an empirical suitability measurement to guide the data transformation selection. We discuss its advantages, limitations, and empirical mitigation in this section. Suitability measurement. To maximize the generalization benefit of model invariance on a dataset S, we measure the suitability of data transformations G by the sample covering number induced by G and favor the small ones. Advantages. One advantage of this suitability measurement is that it is model-training-free. It provides a-priori guidance depending only on the dataset and the data transformations, thus avoid- ing expensive cross-validations and fueling the exploration of new types of data transformations. Another advantage is that it applies to any type of data transformation (including continuous and non-invertible ones) and provides a uniform benchmark. Limitations and empirical mitigation. Being model-agnostic also poses two limitations to the suitability measurement. One limitation is that this suitability measurement, while capturing invariant models’ reduced generalization gap, ignores their potentially increased empirical error. Note that certain data transformations on a dataset may drastically increase invariant models’ em- pirical error and overturn the benefit of a reduced generalization gap. To mitigate this limitation, we consider two necessary conditions for maintaining low empirical error. First, the data transfor- mations should preserve the underlying ground-truth labeling. We may use domain knowledge to meet this condition. Second, the model class should be rich enough to contain a low-error invariant hypothesis. In our experiment, neural networks which are invariant and achieve low training error suffices this condition. Another limitation is that this suitability measurement ignores the models’ potential Lip- 22 schitz constant change after enforcing the invariance. Theorem 2.4.4 suggests that the general- ization benefit enjoyed by invariant models depends on models’ Lipschitz constant and can be overturned if enforcing invariance leads to a significantly larger Lipschitz constant. To mitigate this limitation, we use the fact that we are doing classification tasks and use the label information to heuristically offset the Lipschitz constant increase. We use the minimum inter-class distance change after applying data transformations to capture the Lipschitz constant change and use it to normalize the sample covering number for better data transformation selections (see Section 2.9.6.2). 2.7 Experiments In this section, we implement the sample cover estimation algorithm and verify the effec- tiveness of using sample covering numbers to guide the data transformation selection. We first estimate the sample covering number induced by different types of data transformations on some image datasets. Then, we investigate the actual generalization benefit for models invariant to those data transformations and analyze the correlation4. Datasets. We report experimental results on CIFAR-10 [61] and ShapeNet [62] in this section, and relegate results on CIFAR-100 and Restricted ImageNet to Section 2.9.6.1. ShapeNet is a large-scale 3D data repository that enables us to do more complex data transformations (e.g., change of 3D-view) beyond the common 2D geometric transformations. The work [63] provides 24 multi-view pre-rendered images for each 3D object in 10 chosen categories. For convenience, we use those images to approximate the random perturbation of the 3D view. 4Code is available at https://github.com/bangann/understanding-invariance. 23 ((a)) CIFAR-10 ((b)) ShapeNet Figure 2.2: Estimated sample covering numbers induced by different data transformations at different resolutions ϵ. “base” indicates no transformation. Note that as ϵ increases, it starts to exceed the L2 distance between some images and thus some images get covered by others without doing any transformation. Three vertical dashed lines indicate the maximum resolution ϵ at which the “base” yields a certain sample covering number, and from left to right they are 100%n, 99%n, 95%n. Data transformations. We evaluate some commonly used data transformations with typical parameter settings which we assume to be label-preserving. We choose flipping, cropping, and rotation on CIFAR-10, and additionally consider the 3D-view change on ShapeNet. We use the same data transformations with the same parameter settings during estimating the sample covering number and evaluating the generalization benefit. Section 2.9.5 provides more details of our experimental settings. 2.7.1 Estimation of Sample Covering Numbers We implement the algorithm in Section 2.5 to estimate the sample covering number induced by different data transformations. For efficiency, we randomly sample 1000 training images from CIFAR-10 and randomly sample 800 training images from ShapeNet. Section 2.9.5 compares results with different sample sizes. We use the Euclidean norm for defining the sample cover. For 24 n = 100 n = 1000 n = all Model acc (%) gap acc (%) gap acc (%) gap Base 41.05± 0.52 58.95± 0.52 68.62± 0.90 31.38± 0.90 85.43± 0.35 14.57± 0.35 Flip 44.19± 0.74 55.81± 0.74 75.12± 0.20 24.88± 0.20 89.67± 0.24 10.33± 0.24 Rotate 47.02± 0.46 52.93± 0.51 76.07± 0.28 23.92± 0.27 89.91± 0.13 10.05± 0.16 Crop 50.47± 0.48 49.53± 0.48 81.84± 0.12 18.15± 0.11 92.52± 0.08 7.48± 0.08 Table 2.1: Classification accuracy and generalization gap (the difference between training and test accuracy) for ResNet18 on CIFAR-10. n denotes the sample size per class. continuous data transformations, we do uniform random sampling to approximate the orbit of a data point. Figure 2.2 illustrates the estimated sample covering numbers induced by different trans- formations at different resolutions ϵ. As the resolution ϵ increases, the sample covering number N(ϵ,S, ρG) induced by any data transformation starts to decrease, indicating a smaller-sized sam- ple cover needed to cover the entire dataset. Meanwhile, different transformations behave dif- ferently. On CIFAR-10, cropping induces the smallest sample covering number. On ShapeNet, 3D-view transformation induces the smallest sample covering number and the gap is significant. Our propositions suggest that data transformations that induce smaller sample covering numbers tend to bring more generalization benefits for the corresponding invariant models. Therefore, Fig- ure 2.2 indicates that models should generalize well if it is invariant to 3D-view transformation on ShapeNet or to cropping on CIFAR-10. 2.7.2 Evaluation of Generalization Benefit We now evaluate the actual generalization performance of invariant models to verify if the sample covering number is a good suitability measurement. We use ResNet18 [64] on both datasets and discuss the influence of the model class’s implicit bias in Section 2.9.5. A simple 25 n = 100 n = 1000 n = all Model acc (%) gap acc (%) gap acc (%) gap Base 67.75± 2.02 32.25± 2.02 83.33± 0.38 16.67± 0.38 91.81± 0.22 8.18± 0.22 Flip 69.75± 1.55 30.25± 1.55 84.24± 0.30 15.76± 0.30 92.07± 0.20 7.92± 0.20 Rotate 70.25± 1.19 29.50± 1.15 83.93± 0.38 15.94± 0.35 91.85± 0.20 8.03± 0.26 Crop 74.88± 1.03 23.53± 1.30 86.13± 0.39 13.75± 0.32 92.64± 0.12 7.17± 0.19 3D-View 78.13± 1.31 14.94± 1.76 88.79± 0.34 8.38± 0.79 94.38± 0.08 3.09± 0.10 Table 2.2: Classification accuracy and generalization gap (the difference between training and test accuracy) for ResNet18 on ShapeNet. n denotes the sample size per class. method to learn invariant models is to do data augmentation. The augmented loss function is Laug(x) = L(f(g(x))), where f(·) denotes the model and g(x) denotes a randomly sampled ex- ample in x’s orbit induced by transformation G. We use this method on CIFAR-10 and ShapeNet and show results in Table 2.1 and 2.2. Sample covering number correlates well with generalization benefit. We use the generaliza- tion gap (the gap between training accuracy and test accuracy) to measure actual generalization benefit. Compared with the baseline, invariant models show an improved reduced generalization gap and also improved test accuracy. On CIFAR-10, the cropping-invariant model shows the smallest generalization gap and the highest accuracy. On ShapeNet, the model that is invariant to 3D-view changes shows the smallest generalization gap and the highest accuracy, especially when the training data size is small. By comparing results in Figure 2.2 and Table 2.1-2.2, we observe a clear correlation between the smaller sample covering number and better generalization benefit. This verifies our proposition — invariance to more suitable data transformations gives the model more generalization benefit. Model invariance indeed improves after learning. To verify that the improved generalization is indeed brought by model invariance, we further enforce the invariance using the invariance 26 λ train acc (%) test acc (%) gap Linv Ainv(%) 0 99.99± 0.01 91.81± 0.22 8.19± 0.22 0.0548± 0.0028 62.0± 0.6 0.01 99.98± 0.00 92.77± 0.16 7.21± 0.16 0.0290± 0.0029 74.78± 1.61 0.1 99.99± 0.00 93.87± 0.19 6.11± 0.19 0.0152± 0.0003 83.12± 0.50 0.3 99.98± 0.00 94.23± 0.11 5.76± 0.11 0.0121± 0.0003 85.10± 0.20 1 99.58± 0.04 94.68± 0.09 4.90± 0.09 0.0095± 0.0001 86.94± 0.08 3 97.74± 0.19 94.48± 0.19 3.26± 0.09 0.0060± 0.0003 88.15± 0.18 10 95.67± 0.26 93.56± 0.29 2.11± 0.04 0.0037± 0.0002 89.20± 0.16 100 92.89± 0.25 91.85± 0.26 1.03± 0.03 0.0018± 0.0001 89.82± 0.10 Table 2.3: Evaluation of ResNet18 on ShapeNet under 3D-view transformations. Linv denotes the test invariance loss. Ainv denotes the test consistency accuracy (indicating whether the model’s prediction is unchanged after data transformation) under the worst-case data transformations. regularization loss similar to [65, 66]: L = Lcls(f(x)) + λKL(f(x), f(g(x))).. Specifically, in addition to minimizing the classification loss on original images, we penalize the model by min- imizing the KL divergence between model outputs on original images and on transformed ones. At test time, we use Linv(x) = Eg1,g2∈G[KL(f(g1(x)), f(g2(x)))] to evaluate model invariance under transformation G. Table 2.3 shows that, as we increase the invariance penalty by increasing λ, invariant models enjoy a smaller generalization gap. Moreover, the decreased invariance loss and increased consistency accuracy show that model invariance indeed improves after training, demonstrating the generalization benefit brought by model invariance. 2.8 Conclusion This paper investigates the generalization advantage of model invariance by establishing model complexity bounds using the sample cover generated by data transformations. Addition- ally, we introduce an algorithm to estimate the sample cover and demonstrate that the sample covering number can aid in selecting suitable data transformations through empirical analysis. Our hope is that this research will encourage the exploration of more appropriate data trans- 27 formations for particular datasets. One potential avenue for future research is to examine the implicit biases of model classes to improve our understanding of the generalization benefits of model invariance. 2.9 Supplemental Materials 2.9.1 Complexity Measurements and Generalization Bounds In this section, we provide additional details on complexity measurements and generaliza- tion bounds. The following lemma bounds the empirical Rademacher complexity of a function class H via the covering number ofH evaluated at the sample S. Lemma 2.9.1 (Dudley’s Entropy Integral Theorem [47, 49]). Let H be a function class from X to R. Then, for any α > 0, RS(H) ≤ 4α + 12 ∫ ∞ α √ logN ( τ,H, L2(PS) ) n dτ. The following theorem provides a uniform generalization bound for a function class via empirical Rademacher complexity. Theorem 2.9.2 ( [47, 67]). Let H be a function class from X to [0, B]. For any δ > 0, with probability at least 1 − δ over the draw of a sample S with size n according to data distribution D, the following holds for any h ∈ H: R(h) ≤ RS(h) + 2BRS(H) + 3B √ log 2 δ 2n (2.9.1) 28 We can plug the refined Rademacher complexity bounds in Proposition 2.9.3 and Theo- rem 2.9.4 into (2.9.1) to get refined generalization bounds for certain invariant models. 2.9.2 Proofs We first prove Theorem 2.4.4, and then Proposition 2.4.3. 2.9.2.1 Proof of Theorem 2.4.4 Proof of Theorem 2.4.4. The general idea of this proof is to show that any cover of a model class evaluated at a sample cover also generates a same-sized cover of the model class evaluated at the original sample with some additional approximation error. The covering number inequality in (2.4.4) then follows by taking the minimization over all covers of the model class evaluated at the original sample. Since this proof includes some tedious notations, we first restate the problem setup and then go to the details. Problem setup. Let S = {x1,x2, . . . ,xn} be a sample of size n. Let Ŝ ⊆ S be an ϵ-cover of S with respect to ρG and has size m. Without loss of generality, we then vectorize S and Ŝ for notation simplicity. Denote by