ABSTRACT

Title of Dissertation: BUILDING RELIABLE AI
UNDER DISTRIBUTION SHIFTS

Bang An
Doctor of Philosophy, 2025

Dissertation Directed by: Associate Professor Furong Huang
Department of Computer Science

Machine learning models are increasingly deployed in real-world settings where distribu-

tion shifts—differences between training and deployment data—can significantly impact their

reliability. These shifts affect models in multiple ways, leading to degraded generalization, fair-

ness collapse, loss of robustness, and new safety vulnerabilities. This dissertation investigates

how to build reliable AI under distribution shifts, providing theoretical insights and practical

solutions across diverse applications.

We begin by studying generalization under distribution shifts, exploring how model in-

variance affects performance. We introduce a theoretical framework that quantifies the role of

data transformations in shaping generalization, providing insights into selecting transformations

that improve model robustness in shifted environments. This foundation also extends to fair-

ness, where we examine how pre-trained fair models fail when deployed in new distributions and

propose a method to transfer fairness reliably under distribution shifts.


Next, we focus on robust perception and AI-generated content under shifting distributions.

We investigate how models interpret visual information, showing that contextual reasoning can

help mitigate spurious correlations and improve robustness under domain shifts. We also assess

the reliability of AI-generated content, revealing how image watermarks, designed for prove-

nance tracking, often fail when subjected to real-world distortions and adversarial attacks. To

address this, we introduce a comprehensive benchmark for evaluating watermark robustness,

providing a framework for improving their reliability.

Finally, we turn to safety in large language models (LLMs) and investigate how distribu-

tion shifts in training and deployment introduce new vulnerabilities. We analyze false refusals in

safety-aligned LLMs, demonstrating that misaligned decision boundaries lead to excessive con-

servatism at test time. We also explore retrieval-augmented generation (RAG) models, showing

that despite their promise, they can introduce new safety risks when deployed in settings for

which they were not originally trained. Our findings highlight critical gaps in existing AI safety

evaluations and emphasize the need for new methods tailored to evolving AI architectures.

By addressing generalization, robustness, and safety under distribution shifts, this disser-

tation contributes to a deeper understanding of these challenges and provides practical strategies

for improving AI reliability in real-world deployment.


BUILDING RELIABLE AI
UNDER DISTRIBUTION SHIFTS

by

Bang An

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2025

Advisory Committee:
Associate Professor Furong Huang, Chair/Advisor
Professor Hal Daume
Professor Min Wu
Asssociate Professor Jia-Bin Huang
Assistant Professor Sanghamitra Dutta


© Copyright by
Bang An

2025


Acknowledgments

I would like to express my sincere gratitude to everyone who has contributed to the com-

pletion of this thesis and made my graduate journey a truly memorable and enriching experience.

First and foremost, I am deeply thankful to my advisor, Prof. Furong Huang, for granting

me the invaluable opportunity to work on Reliable AI and giving me the flexibility to explore my

interests. Her consistent guidance, accessibility, and unwavering support have been instrumental

in my academic growth. I am incredibly fortunate to have had the chance to learn from and work

alongside such an inspiring mentor. Without her support and belief in me, I would not have made

it through many of the tough times.

I am also grateful to my lab mates for their camaraderie, support, and collaboration. Spe-

cial thanks to Dr. Jiahao Su and Dr. Yanchao Sun for their warm welcome and for helping me

get started and feel at home in the lab when I first arrived at UMD. I have greatly benefited from

my collaboration with Sicheng Zhu, Michael-Andrei Panaitescu-Liess, Yuancheng Xu, Tahseen

Rabbani, Mucong Ding, Souradip Chakraborty, Amrit Singh Bedi, Zora Che, Zikui Cai, Cheng-

hao Deng, Aakriti Agrawal, Pankayaraj Pathmanathan, Yuhang Zhou, Paiheng Xu, Xiaoyu Liu,

Anirudh Satheesh, Shayan Shabihi, and my interactions with all other group members. I’m also

thankful to Yuxin Wen from Tom’s lab for his generous sharing of ideas and valuable discussions.

I would like to thank the members of my Ph.D. committee, Prof. Min Wu, Prof. Jia-bin

Huang, Prof. Hal Daum´e III, and Prof. Sanghamitra Dutta. Their valuable time, feedback, and

ii


suggestions have helped me a lot to improve my research, dissertation, and presentation. I would

also like to thank Dr. Tom Goldstein for advising many of my projects.

I would like to express my appreciation to my internship mentors, Dr. Mark Dredze, Dr.

Shiyue Zhang, Sam Sharpe, Dr. Bayan Bruss, Dr. Zhe Zhao, Dr. Lichan Hong, Dr. Ed Chi, and

Dr. Xueting Han, who provided me with generous guidance and support during my internships

at Bloomberg, Capital One, Google Brain, and Microsoft Research. I would also like to thank

my other mentors and collaborators, Dr. Ruiyi Zhang, Dr. Chaithanya Kumar Mummadi, Prof.

Changyou Chen, and Prof. Sanghyun Hong for their valuable contributions to my work.

My deepest appreciation goes to my family—my mother and father—for their unwavering

belief in me, constant encouragement, and steadfast support through every challenge. Their love

and guidance have been my foundation. Special thanks to my husband, collaborator, lab mate,

and soul mate, Sicheng Zhu. Thank you for walking this path with me—for your love, your

presence, and for helping me grow into someone I’m proud of. Thank you to my beloved dog,

Lucky, for bringing love, joy, and peace to me. Finally, I want to thank myself—for not giving up

when things got tough, for staying committed through long nights and setbacks, and for believing

in myself.

I sincerely apologize if I have inadvertently left out anyone who has contributed to the

completion of this thesis. You are a significant component of this journey.

iii


Table of Contents

Acknowledgements ii

Table of Contents iv

List of Tables viii

List of Figures xii

List of Abbreviations xix

Chapter 1: Introduction 1
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

I Generalization and Fairness under Distribution Shifts 5

Chapter 2: Understanding the Generalization Benefit of Model Invariance from a Data
Perspective 6

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Generalization Benefit of Model Invariance . . . . . . . . . . . . . . . . . . . . 13

2.4.1 Sample Cover Induced by Data Transformations . . . . . . . . . . . . . 13
2.4.2 Refined Complexity Analysis of Lipschitz Models . . . . . . . . . . . . 16
2.4.3 Framework for Model-invariance-sensitive Generalization Bounds . . . . 17

2.5 Sample Cover Estimation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Data-driven Selection of Data Transformations . . . . . . . . . . . . . . . . . . 21
2.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.7.1 Estimation of Sample Covering Numbers . . . . . . . . . . . . . . . . . 24
2.7.2 Evaluation of Generalization Benefit . . . . . . . . . . . . . . . . . . . . 25

2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.9 Supplemental Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.9.1 Complexity Measurements and Generalization Bounds . . . . . . . . . . 28
2.9.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.9.3 Refined Complexity Analysis for Linear Models . . . . . . . . . . . . . 37
2.9.4 Empirical Estimation of Sample Covering Numbers . . . . . . . . . . . . 40

iv


2.9.5 Experimental Details and Extended Experiments . . . . . . . . . . . . . 42
2.9.6 Extended Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Chapter 3: Transferring Fairness under Distribution Shifts via Fair Consistency Regu-
larization 52

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3 Preliminaries and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 Fairness under Distribution Shifts . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5 Transfer Fairness via Fair Consistency Regularization . . . . . . . . . . . . . . . 63

3.5.1 Theoretical Analysis: A Sufficient Condition for Transferring Fairness . . 63
3.5.2 Practical Algorithm: Fair Consistency Regularization . . . . . . . . . . . 66

3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6.1 Evaluation under Different Types of Distribution Shifts with a Synthetic

Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6.2 Evaluation on Real Datasets . . . . . . . . . . . . . . . . . . . . . . . . 71
3.6.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.8 Supplemental Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.8.1 Proof and More Discussion of Fairness under Distribution Shifts . . . . . 76
3.8.2 Proof of the Sufficient Condition for Transferring Fairness . . . . . . . . 81
3.8.3 Details of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.8.4 More Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.8.5 Impact and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

II Vision Models under Distribution Shifts 99

Chapter 4: PerceptionCLIP: Visual Classification by Inferring and Conditioning on
Contexts 100

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.4 Preparing CLIP for Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.4.1 Structuring and Describing Contextual Attributes . . . . . . . . . . . . . 106
4.4.2 Connecting Conditional Probabilities with CLIP Score . . . . . . . . . . 107

4.5 Contextual Attributes are Helpful and Inferable . . . . . . . . . . . . . . . . . . 109
4.5.1 Conditioning on Contextual Attributes is Helpful . . . . . . . . . . . . . 109
4.5.2 Contextual Attributes are Inferable . . . . . . . . . . . . . . . . . . . . . 112

4.6 PerceptionCLIP: Emulating Human Perception . . . . . . . . . . . . . . . . . 113
4.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.7.1 Zero-shot Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.7.2 Group Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.9 Supplemental Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

v


4.9.1 Extended Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.9.2 Image Caption Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.9.3 Human Visual Perception . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.9.4 Approximating Conditional Probabilities . . . . . . . . . . . . . . . . . 123
4.9.5 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.9.6 Additional Results and Analysis . . . . . . . . . . . . . . . . . . . . . . 132
4.9.7 Impact, Limitation and Future Work . . . . . . . . . . . . . . . . . . . . 140

Chapter 5: WAVES: Benchmarking the Robustness of Image Watermarks 142
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.2 Image Watermarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.3 Standardized Evaluation through WAVES . . . . . . . . . . . . . . . . . . . . . 147

5.3.1 Standardized Evaluation Workflow and Metrics . . . . . . . . . . . . . . 147
5.3.2 Stress-testing Watermarks . . . . . . . . . . . . . . . . . . . . . . . . . 150

5.4 Benchmarking Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . 157
5.4.1 Benchmarking Watermark Robustness . . . . . . . . . . . . . . . . . . 157
5.4.2 Benchmarking Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.4.3 Benchmarking Results for User Identification . . . . . . . . . . . . . . . 160
5.4.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.4.5 Summary of Takeaway Messages . . . . . . . . . . . . . . . . . . . . . 164

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.6 Supplementary Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

5.6.1 A Mini Survey of Image Watermarks . . . . . . . . . . . . . . . . . . . 166
5.6.2 Formalism of Watermark Detection and Identification . . . . . . . . . . . 170
5.6.3 Details on Performance Metrics . . . . . . . . . . . . . . . . . . . . . . 174
5.6.4 Design Choices of WAVES . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.6.5 Evaluation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
5.6.6 Details of Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
5.6.7 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
5.6.8 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

III AI Safety Challenges due to Distribution Shifts 209

Chapter 6: Automatic Pseudo-Harmful Prompt Generation for Evaluating False Re-
fusals in Large Language Models 210

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
6.3 Defining Harmless, Controversial, and Harmful Prompts . . . . . . . . . . . . . 216
6.4 Automatic Pseudo-Harmful Prompt Generation . . . . . . . . . . . . . . . . . . 218

6.4.1 Surrogate Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
6.4.2 Generation Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
6.4.3 Steering the Generated Content . . . . . . . . . . . . . . . . . . . . . . 221

6.5 PHTest: A Dataset for False Refusal Evaluation . . . . . . . . . . . . . . . . . . 222
6.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

vi


6.6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
6.6.2 Safety vs False-Refusal Trade-off . . . . . . . . . . . . . . . . . . . . . 226
6.6.3 Jailbreak Defenses Should Be Calibrated by False-Refusal Rates . . . . . 228

6.7 A Preliminary Exploration of Fine-Grained Alignment . . . . . . . . . . . . . . 229
6.7.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
6.7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

6.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
6.9 Supplementary Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

6.9.1 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
6.9.2 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

Chapter 7: RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Gen-
eration for Large Language Models 242

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
7.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
7.4 RQ1: Are RAG-based LLMs safer than their non-RAG counterparts? . . . . . . . 248
7.5 RQ2: What makes RAG-based LLMs unsafe? . . . . . . . . . . . . . . . . . . . 250

7.5.1 Factor 1: Safety of the LLM . . . . . . . . . . . . . . . . . . . . . . . . 250
7.5.2 Factor 2: Retrieved Document Safety . . . . . . . . . . . . . . . . . . . 252
7.5.3 Factor 3: An LLM’s Capability on RAG Tasks . . . . . . . . . . . . . . 255

7.6 RQ3: Are red-teaming methods effective for RAG-based models? . . . . . . . . 258
7.6.1 Do non-RAG jailbreaks work for RAG? . . . . . . . . . . . . . . . . . . 260
7.6.2 Applying Jailbreaking Methods to RAG . . . . . . . . . . . . . . . . . . 261

7.7 Discussions on Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 262
7.8 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
7.9 Supplementary Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

7.9.1 Experimental Details and Additional Results for RQ1 . . . . . . . . . . . 264
7.9.2 Experimental Details and Additional Results for RQ2 . . . . . . . . . . . 268
7.9.3 Experimental Details and Additional Results for RQ3 . . . . . . . . . . . 289

Chapter 8: Conclusion 294

Bibliography 296

vii


List of Tables

2.1 Classification accuracy and generalization gap (the difference between training
and test accuracy) for ResNet18 on CIFAR-10. n denotes the sample size per
class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2 Classification accuracy and generalization gap (the difference between training
and test accuracy) for ResNet18 on ShapeNet. n denotes the sample size per
class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 Evaluation of ResNet18 on ShapeNet under 3D-view transformations. Linv de-
notes the test invariance loss. Ainv denotes the test consistency accuracy (indicat-
ing whether the model’s prediction is unchanged after data transformation) under
the worst-case data transformations. . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4 A diagram of the proof of Theorem 2.4.4. . . . . . . . . . . . . . . . . . . . . . 30
2.5 Some notations used in the proof of Theorem 2.4.4. . . . . . . . . . . . . . . . . 30
2.6 Data transformations used in our experiments. . . . . . . . . . . . . . . . . . . . 45
2.7 Sample covering numbers, classification accuracy, and generalization gap (the

difference between training and test accuracy) for ResNet18 on CIFAR-100. . . . 47
2.8 Sample covering numbers, classification accuracy, and generalization gap (the

difference between training and test accuracy) for ResNet18 on Restricted Ima-
geNet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.9 Sample covering number (SCN) without and with normalization and generaliza-
tion performance of ResNet18 on CIFAR-10. . . . . . . . . . . . . . . . . . . . 48

2.10 Classification accuracy and generalization gap (the difference between training
and test accuracy) for MLP on ShapeNet. n denotes the sample size per class. . . 49

3.1 Transfer fairness and accuracy from UTKFace to FairFace . . . . . . . . . . . . 72
3.2 Transfer fairness and accuracy from UTKFace to FairFace with weak transfor-

mations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.3 Ablation study on UTKFace-FairFace task . . . . . . . . . . . . . . . . . . . . . 75
3.4 Latent factors in 3dshapes dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.5 Simulate different distribution shifts. P(Y,A) is represented by the proportions

of four groups as [P(Y = 0, A = 0),P(Y = 0, A = 1),P(Y = 1, A = 0),P(Y =
1, A = 1)]. P(D) is represented by the proportions of eight possible values of
scale. Other factors have uniform marginal distributions. Images in two domains
are sampled according to the marginal distributions of six latent factors. . . . . . 86

3.6 Statistics of UTKFace and FairFace datasets. . . . . . . . . . . . . . . . . . . . 88
3.7 Statistics of NewAdult dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.8 Statistics of UTK and FairFace datasets used in Table 3.9. . . . . . . . . . . . . . 93
3.9 Transfer fairness and accuracy from UTKFace to FairFace with less source data. . 94

viii


3.10 Results by using different transformations in our method. Average results of three
trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.1 Conditional probabilities. x, y, and z denote image, class, and contextual at-
tributes. z denotes (z1, . . . , zm) for simplicity. . . . . . . . . . . . . . . . . . . . 108

4.2 Classification accuracy (%) on ImageNet. We apply the left-side image transfor-
mations to alter the corresponding attributes. Different methods condition on dif-
ferent values of the contextual attributes. Conditioning on correct or self-inferred
attribute values improves accuracy the most. . . . . . . . . . . . . . . . . . . . . 110

4.3 Inference accuracy (%) of two contextual attribute inference methods on Ima-
geNet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

4.4 Zero-shot classification accuracy on five datasets using ViT-B/16. The best result
in each column is highlighted in bold, while the next three highest values are
underlined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

4.5 Classification accuracy of ViT-B/16 on different data domains with CLIP. . . . . 116
4.6 Intervening in inferring contextual attributes improves zero-shot classification. . 117
4.7 Average accuracy and worst group accuracy on the Waterbirds dataset. . . . . . 119
4.8 Average accuracy and worst group accuracy on the CelebA dataset. . . . . . . . 119
4.9 Image caption examples from LAION-400M (comparable to CLIP’s pretraining

dataset). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.12 The average saliency (%) of the core feature and the spurious feature evaluated

on the Waterbirds test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.10 Summary of descriptions for different attributes used in Figure 4.3, Table 4.2 and

Table 4.3. z∗ denotes the correct value of the contextual attribute, and zwrong

denotes the wrong value of the contextual attribute. Ideally, each attribute has
a distribution of text descriptions. Here, we use three descriptions and use the
averaged text embeddings of them to calculate the CLIP score. . . . . . . . . . . 135

4.11 Similarity score and classification accuracy on ImageNet test set. We apply a
composition of two transformation functions on images, and use the composition
of attributes’ descriptions for text. . . . . . . . . . . . . . . . . . . . . . . . . . 136

4.13 Summary of contextual attributes and their value descriptions used in ImageNet-
related datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

4.14 Datasets, domain templates and contextual attributes used in Table 4.5 . . . . . . 136
4.15 Domain templates, contextual attributes and their descriptions used in Table 4.7

and Table 4.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.16 Ablation study on ImageNet and related datasets. . . . . . . . . . . . . . . . . . 137
4.17 Ablation study on different data domains. . . . . . . . . . . . . . . . . . . . . . 137
4.18 Performance of PerceptionCLIP using two order types in the attribute concate-

nation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.19 Contextual attributes and their value descriptions for EuroSAT generated by GPT-

4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

ix


5.1 Comparison of robustness evaluations with existing works. For categories
of attacks, D, R, and A denote distortions, image regeneration, and adversarial
attacks. Joint test means whether the performance and quality are jointly tested
under a range of attack strengths. Our benchmark is the most comprehensive one,
with a large scale of attacks, data, metrics, and more realistic evaluation setups. . 144

5.2 A taxonomy of all the attacks in our stress-testing set. Novel attacks proposed by
WAVES are marked with ∗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

5.3 Comparison of attacks across three watermarking methods in detection setup.
Q denotes the normalized quality degradation, and P denotes the performance as
derived from Figure 5.7. Q@0.95P measures quality degradation at a 0.95 perfor-
mance threshold where "inf" denotes cases where all tested attack strengths yield
performance above 0.95, and "-inf" where all are below. A similar notation ap-
plies to Q@0.7P. Avg P and Avg Q are the average performance and quality over
all the attack strengths. The lower the performance and the smaller the quality
degradation, the stronger the attack is. For each watermarking method, we rank
attacks by Q@0.95P, Q@0.7P, Avg P, Avg Q, in that order, with lower values (↓)
indicating stronger attacks. The top 5 attacks of each watermarking method are
highlighted in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

5.4 A list of alternative watermarking algorithms not tested by WAVES in this work. 179
5.5 The knowledge of attackers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
5.6 Comparison of attacks across three watermarking methods under the identi-

fication setup (one million users). Q denotes the normalized quality degradation
and P denotes the performance as derived from aggregated 2D plots. Q@0.7P
measures quality degradation at a 0.7 performance threshold where "inf" denotes
cases where all tested attack strengths yield performance above 0.7, and "-inf"
where all are below. Q@0.4P is defined analogously. Avg P and Avg Q are the
average performance and quality over all the attack strengths. The lower the per-
formance and the smaller the quality degradation, the stronger the attack. For
each watermarking method, we rank attacks by Q@0.7P, Q@0.4P, Avg P, Avg Q,
in that order, with lower values (↓) indicating stronger attacks. The top 5 attack
of each watermarking method are highlighted in red. . . . . . . . . . . . . . . . 194

6.1 Two experimental settings. The only difference is whether to use pseudo-harmful
prompts in the training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

6.2 Type-I false refusal: misinterpretation. LLMs falsely refuse some generated
prompts because they misunderstand the prompts’ literal meanings or the users’
intentions. We label these prompts separately. Such false refusals imply a lack
of understanding by the LLM, and they diminish as the LLM’s scale increases in
our evaluations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

6.3 Type-II false refusal: misalignment. LLMs falsely refuse some generated pseudo-
harmful prompts because they apply the rules learned during safety alignment to
inappropriate scenarios. We observe that such false refusals do not automatically
diminish as the LLM’s scale increases, suggesting that mitigation may require
more refined alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

6.4 Some types of unnatural prompts in existing pseudo-harmful datasets. . . . . . . 241

x


7.1 Ranking of models from safe to unsafe. ≳ denotes the difference of unsafety is
less than 1%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

7.2 Comparison of probabilities for generating unsafe responses in non-RAG and
RAG settings. ✓ denotes safe, and ✗ denotes unsafe ones. . . . . . . . . . . . . 252

7.3 Safety of retrieved documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
7.4 Evaluation of extraction and summarization ability. Gemma performs poorly,

leading to frequent refusals, which gives a false appearance of safety. . . . . . . . 256
7.5 Evaluation of models’ attention to documents via testing the accuracy with re-

trieved and random documents. Most models do not rely fully on documents. . . 257
7.6 The average perplexity of the jailbreaking prompts created by two methods. . . . 291

xi


List of Figures

1.1 Distribution shift often happens in reality, causing poor generalization and many
other issues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Dissertation Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Illustration of the pseudometric and sample cover induced by data transformations. 9
2.2 Estimated sample covering numbers induced by different data transformations

at different resolutions ϵ. “base” indicates no transformation. Note that as ϵ
increases, it starts to exceed the L2 distance between some images and thus some
images get covered by others without doing any transformation. Three vertical
dashed lines indicate the maximum resolution ϵ at which the “base” yields a
certain sample covering number, and from left to right they are 100%n, 99%n,
95%n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3 An illustration of data transformations . . . . . . . . . . . . . . . . . . . . . . . 44
2.4 (a)-(c): Estimated sample covering numbers induced by different data transfor-

mations on ShapeNet. n denotes the total sample size. (d): The normalized sam-
ple covering number (=sample covering number /n) of 3D-view transformations
estimated with different sample sizes. . . . . . . . . . . . . . . . . . . . . . . . 50

3.1 Illustration of intra-group expansion assumption in the input space. An ex-
ample of gender classification task with the sensitive attribute being race. Intra-
group expansion assumes that different groups are separated but every group is
self-connected under certain transformations. If a model has consistent predic-
tions under those transformations, we can propagate labels within each group.
Under this assumption, we propose to obtain fairness and accuracy in both do-
mains by a self-training algorithm with fair consistency regularization. . . . . . . 53

3.2 Training diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3 Accuracy and unfairness (error bar denotes the standard deviation) in two do-

mains under subpopulation shifts (Sshift 1, Sshift 2), domain shift (Dshift), and
hybrid shift (Hshif). (S) and (T) denotes the evaluation in the source and target
domains respectively. Results show that domain shift is more challenging than
subpopulation shift, and our method can effectively transfer accuracy and fairness
under all the distribution shifts considered. . . . . . . . . . . . . . . . . . . . . 69

3.4 Comparison of Pareto frontiers. Upper left is preferred. Our method outperforms
baseline methods in achieving accuracy and fairness at the same time. . . . . . . 73

3.5 Unfairness and accuracy tested on NewAdult. CA as the source domain (red star)
and other states as the target domain (blue dots). Red lines indicate the average
of unfairness. The relative decrease is calculated by comparing with Laftr. . . . . 74

xii


3.6 Per-group accuracy and consistency. Compared with the standard consistency
regularization (SCR), the model trained with fair consistency regularization (FCR)
has more balanced consistency and accuracy. . . . . . . . . . . . . . . . . . . . 75

3.7 Randomly sampled examples from two domains under different distribution shifts. 87
3.8 With fair consistency regularization, our method alleviates the disparate impact

of FixMatch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.1 (Left): CLIP co-relates natural language descriptions of contextual attributes
with visual cues (orientation: upside-down). (Center): Unlike CLIP’s stan-
dard zero-shot inference that uses fixed template(s) for class name retrieval, our
method first infers contextual attributes (background: on the grass) using CLIP
and then let CLIP predicts the class conditioned on the inferred contextual at-
tributes. Here, background and orientation are both examples of contextual at-
tributes. (Right): Grad-CAM visualization illustrates that our method focuses
more on core features (on the dog) and is less distracted by spurious features
(grass background) when performing the object classification. . . . . . . . . . . 101

4.2 Illustration of contextual attributes, their symbolic discrete values, and the possi-
ble textual descriptions mapped by the annotation function. . . . . . . . . . . . . 106

4.3 Evaluating CLIP scores on ImageNet with different transformations altering the
contextual attributes. The attribute-aware CLIP score gives higher scores for cor-
rectly matched image-attribute pairs (green) while giving lower scores for mis-
matched pairs (grey) and random pairs (blue), confirming CLIP’s understanding
of our contextual attribute descriptions. CLIP score measures the similarity be-
tween images and contextual attributes, while the original CLIP score (orange) is
attribute-agnostic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.4 Images of a leopard and a waterbird, core and spurious features, and Grad-CAM
heatmaps using no, incorrect, and ground-truth contextual attributes (with text
below images). The bar shows core vs. spurious ratio in the heatmap. Visual-
ization shows that classification conditioned on correct contextual attributes en-
forces CLIP’s focus on core features. . . . . . . . . . . . . . . . . . . . . . . . 112

4.5 The increase in (left) CLIP scores and the (right) prediction probabilities by
incorporating the descriptions of the correct contextual attribute into the text
prompts. We compare the increased CLIP scores and prediction probabilities for
the ground-truth class y∗, the Top-5 and Top-10 wrong classes. (left) Incorporat-
ing ground-truth attributes into text prompts results in increased CLIP scores for
both correct and incorrect classes. This improvement is attributed to the enhanced
alignment of the text prompts with the images, addressing previously overlooked
contextual attributes. Notably, the CLIP score of the correct class benefits more
from this enhancement for all the attributes considered. This is because the ac-
curate description of the class, combined with the contextual attributes, achieves
a more precise alignment with the corresponding image. (right) Therefore, the
model is more likely to predict the correct class after being provided with the
correct context description in the prompt. . . . . . . . . . . . . . . . . . . . . . 128

xiii


4.6 Leopard images from ImageNet dataset. Visualization of the original image, the
regions of core and spurious features, and the Grad-CAMs obtained using no,
incorrect, and ground-truth contextual attributes. . . . . . . . . . . . . . . . . . 134

4.7 Waterbird images from Waterbirds dataset. Visualization of the original image,
the regions of core and spurious features, and the Grad-CAMs obtained using no,
incorrect, and ground-truth contextual attributes. . . . . . . . . . . . . . . . . . 138

5.1 WAVES establishes a standardized evaluation framework that encompasses a
comprehensive suite of stress tests including both existing and newly proposed
stronger attacks (denoted by ∗). . . . . . . . . . . . . . . . . . . . . . . . . . . 143

5.2 Evaluation workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.3 Regeneration attacks on Tree-Ringk. Regen-Diff is a single diffusive regener-

ation and Rinse-[N]xDiff is a rinsing one with N repeated diffusions, with the
number of noising steps as attack strength. Regen-VAE uses a pre-trained VAE
with quality factor as strength and Regen-KLVAE uses pre-trained KL-VAEs with
bottleneck size as strength. RinseD-VAE applies a VAE as a denoiser after Rinse-
4xDiff. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

5.4 Adversarial embedding attacks target Tree-Ring at strengths of {2/255, 4/255,
6/255, 8/255}. Tree-Ring shows vulnerability to embedding attacks, especially
when the adversary can access the VAE being used. . . . . . . . . . . . . . . . . 153

5.5 Three settings for training the surrogate detector. The Generator is the victim
generator under attack. We externalize the watermarking process for simplicity,
but it could be in-processing watermarks. After training the surrogate detectors,
the adversary performs PGD attacks on them to flip the labels. . . . . . . . . . . 154

5.6 Adversarial surrogate detector attacks on Tree-Ring. . . . . . . . . . . . . . . . . 154
5.7 Unified performance vs. quality degradation 2D plots under detection setup. We

evaluate each watermarking method under various attacks. Two dashed lines
show the thresholds used for ranking attacks. . . . . . . . . . . . . . . . . . . . 156

5.8 (a) Detection performance of three watermarks after attacks, measured by Aver-
age TPR@0.1%FPR with lower values (near center) indicating higher vulnera-
bilities. (b) The distribution of quality degradation. The lower, the better. . . . . . 158

5.9 Identification accuracy of three watermarks after attacks. . . . . . . . . . . . . . 161
5.10 An illustration of a robust watermarking workflow. An AI company provides

two services: (1) generate watermarked images, i.e., embed invisible messages,
and (2) detect these messages when shown any of their watermarked images.
There is an attack stage between the watermarking and detection stages. The
watermarked images may experience natural distortions (e.g., compression, re-
scaling) or manipulated by malicious users attempting to remove the watermarks.
A robust watermarking method should still be able to detect the original message
after an attack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

5.11 Word clouds of DiffusionDB, MS-COCO, and DALL·E3 prompts. . . . . . . . . 179
5.12 Image examples of DiffusionDB, MS-COCO, and DALL·E3. . . . . . . . . . . . 179

xiv


5.13 Ranking attacks with different quality metrics on DiffusionDB images wa-
termarked by Tree-Ring. Attack potency is ranked by image quality at 0.95
TPR@0.1%FPR. Colors indicate the ranks (1=best, 9=worst), and values show
the measured quality. We use ’NA’ to label an attack if its attack curve lies en-
tirely above TPR=0.95; the attack is automatically ranked last. . . . . . . . . . . 183

5.14 Cumulative distribution functions (CDFs) for eight image quality metrics across
all attacked watermarked images. The horizontal dashed lines mark the 10% and
90% quantiles, and the intersecting vertical dashed lines delineate the bounds of
the normalization intervals. Values at the lower bound are normalized to 0.1, and
those at the upper bound to 0.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

5.15 Distortions and their combinations. We combine three types of distortions: ge-
ometric, photometric, and degradation, both individually and collectively. By
comparing quality-performance plots, we see combinations of distortions do not
necessarily lead to better attacks. . . . . . . . . . . . . . . . . . . . . . . . . . . 189

5.16 Regenerative diffusion with varying depth of noising steps and a VAE regenera-
tion with a low quality factor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

5.17 4x rinsing regeneration with varying depth of noising steps per diffusion. . . . . . 190
5.18 An image of a dragon attacked using a 3x rinsing regeneration. Pushing the

image through a VAE restores image quality, noticeable in the eye color of the
dragon (indicated by the green box). Image is drawn from the Gustavosta Stable
Diffusion dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

5.19 Aggregated performance vs. quality degradation 2D plots under identifica-
tion setup (one million users). We evaluate each watermarking method under
various attacks. Two dashed lines show to thresholds used for ranking attacks. . . 194

5.20 The spoofing attack fails for AdvCls-UnWM&WM. . . . . . . . . . . . . . . . 195
5.21 Visualization of AdvCls-UnWM&WM attack. (a) shows the watermarking mask

of Tree-Ring where there are four channels, and we only watermark the last chan-
nel. The watermark message is the rings, which contain ten complex numbers
that are not shown in the figure. (b) and (c) show the inversed latent before and
after the attack in the Fourier space. We only show the real part of the latent.
Clearly, the rings exist before the attack and vanish after the attack. (d) shows
the magnitude of the element-wise difference before and after the attack. The
attack not only perturbs the watermark part but also other features. The average
magnitude change of the watermark-part and non-watermark-part is around 2:1.
The attack successfully disturbs the watermark, albeit in an imprecise manner. . 196

5.22 Visualization of AdvCls-WM1&WM2 attack. (a) and (b) are the same as that in
Figure 5.21. (c) shows the inversed latent after the attack, where the watermark
vanishes instead of changing to another watermark. (d) shows the magnitude of
the element-wise difference before and after the attack. The attack not only per-
turbs the watermark part but also other features. The average magnitude change
of the watermark-part and non-watermark-part is also around 2:1. Although the
surrogate detector is trained to classify two different watermark messages. The
attack based on it cannot change the watermark message from one to another but
can effectively disturb the watermark. . . . . . . . . . . . . . . . . . . . . . . . 197

xv


5.23 The user identification results for Tree-Ring under AdvCls-WM1&WM2 attacks.
The original watermarked images are embedded with User1’s message. AdvCls-
WM1&WM2 tries to disrupt the latent feature of those images so that they can be
misidentified as User2 generated. We simulate two settings: 100 users and 1000
users in total. The blue curves represent the proportion of images correctly iden-
tified as belonging to User1, while the orange curves show those misidentified as
User2’s. Note that, the axes for blue and orange curves have different ranges in
the figure. With increasing attack strengths, the likelihood of correctly identify-
ing them as User1’s decreases significantly under both 100 and 1K user scenarios.
However, misidentification as User2’s images occurs notably only when the total
number of users is small (e.g., 100 users). . . . . . . . . . . . . . . . . . . . . . 198

5.24 A visual demonstration of various adversarial, regeneration, and distortion at-
tacks on a Tree-Ring watermarked image. Figure (a) is the base unattacked
image. The base prompt, drawn from DiffusionDB, is “digital painting of a lake
at sunset surrounded by forests and mountains,” along with further styling details. 200

5.25 Evaluation on DiffusionDB dataset under the detection setup (part 1). . . . . . . 201
5.26 Evaluation on DiffusionDB dataset under the detection setup (part 2). . . . . . . 202
5.27 Evaluation on MS-COCO dataset under the detection setup (part 1). . . . . . . . 203
5.28 Evaluation on MS-COCO dataset under the detection setup (part 2). . . . . . . . 204
5.29 Evaluation on DALL·E3 dataset under the detection setup (part 1). . . . . . . . . 205
5.30 Evaluation on DALL·E3 dataset under the detection setup (part 2). . . . . . . . . 206
5.31 Stress test results for DWT-DCT. It is highly susceptible to regeneration attacks

(cross markers) and most distortion attacks (square markers), but relatively robust
against adversarial attacks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

5.32 Stress test results for MBRS. It is vulnerable to certain distortion attacks (resized-
cropping, blurring, rotation, combo distortions) and regeneration attacks, but ro-
bust against other distortions (JPEG compression, brightness/contrast, random
erasing, noise) and adversarial attacks. . . . . . . . . . . . . . . . . . . . . . . . 208

6.1 Examples of pseudo-harmful prompts generated by our method using llama2 as
the target LLM, then transferred to closed-source LLMs. . . . . . . . . . . . . . 212

6.2 Some controversial prompts generated by our method. Claude 3.5 Sonnet (shown)
refuses to respond, while GPT 4o and Gemini 1.5 Flash do. The left and middle’s
harmfulness depends on definitions, while the right could have either innocent or
malicious intentions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

6.3 Diagram of our automatic pseudo-harmful prompt generation. . . . . . . . . . . 218
6.4 (Left) LLM "recognizes" pseudo-harmfulness. Using only Llama2-8B’s refusal

likelihood, we classify pseudo-harmful (green) and harmful (red) XSTest prompts
with AUC 79.3%. This suggests that pseudo-harmful prompts often lie on the
boundary of the LLM’s refusal decision. (Right) Using the LLM as a harmful-
ness judge often aligns better with human evaluation than seeing if it refuses the
prompt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

xvi


6.5 Comparison of quantity and distribution between PHTest and XSTest. (Left)
PHTest prompts have lower perplexity (mainly because XSTest prompts are gen-
erally shorter). (Right) XSTest prompts generally have a higher negative log-
likelihood (NLL), making them more common in practice, while PHTest covers
broader long-tail distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

6.6 False refusal rates of different LLMs on PHTest. . . . . . . . . . . . . . . . . . . 225
6.7 Tested LLMs demonstrate a trade-off between safety (low ASR on HarmBench)

and usability (low FRR on PHTest’s harmless prompts). The safety of *-marked
LLMs are potentially underestimated. We test their jailbreak ASR on a small
available prompt set from HarmBench, while taking others directly from Harm-
Bench’s report. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

6.8 False refusal rates before and after applying some jailbreak defenses. . . . . . . . 228
6.9 Fine-grained alignment requires a comprehensive safety policy and/or edge cases. 229
6.10 Testing accuracy during the training. Alignment w/ PHP that uses PHTest in the

training has a significantly larger accuracy than the alignment w/o PHP that uses
ShareGPT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

6.11 The safety of the model in the two settings steadily improves. However, there
is a large gap between the two settings in the accuracy on the benign prompts.
Alignment w/o PHP (ShareGPT) soon gets plateaued with large variance. . . . . 232

6.12 Decision boundary "visualization". . . . . . . . . . . . . . . . . . . . . . . . . . 233
6.13 Decision boundary "visualization". . . . . . . . . . . . . . . . . . . . . . . . . . 234
6.14 (Left) An example of our auto-generated pseudo-harmful prompt. (Right) Defin-

ing harmfulness is complex, requiring detailed rules and supporting examples.
We conjecture that safety alignment also requires extensive training examples to
characterize the model’s rejection boundary. Jailbreak defenses without using
additional data may only “shift” the boundary, leading to more false refusals. . . 237

6.15 Pseudo-harmful prompt examples generated by our method. . . . . . . . . . . . 238
6.16 Pseudo-harmful prompt examples generated by our method. . . . . . . . . . . . 239

7.1 RAG can make safe models unsafe, even if the retrieved documents are safe. . . . 243
7.2 Safety of LLMs in non-RAG vs. RAG settings. Most LLMs in the RAG setting

exhibit a significantly higher percentage of unsafe responses. . . . . . . . . . . . 246
7.3 The change of risk profile from non-RAG to RAG is model-dependent. . . . . . . 247
7.4 Risk profile of Llama-3-8B. It is vulnerable in 7 categories in a non-RAG setting,

but is vulnerable in all 16 categories in RAG, with an increase in risk across all
categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

7.5 RAG is unsafe at points where non-RAG is unsafe, and more. . . . . . . . . . . . 251
7.6 Using one document in the RAG setting can change the safety behavior of mod-

els. Provided with more documents, LLMs tend to be more vulnerable. . . . . . . 255
7.7 Capability of LLMs on RAG tasks. . . . . . . . . . . . . . . . . . . . . . . . . . 258
7.8 Train jailbreaking prompts on non-RAG Llama-3-8B and test them in the RAG

setting with a varying number of retrieved documents. . . . . . . . . . . . . . . . 259

xvii


7.9 Train jailbreaking prompts on Llama-3-8B in the RAG setting using five docu-
ments retrieved from the original queries, and test them in the RAG setting by
retrieving documents using the optimized prompts with varying numbers of re-
trieved documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

7.10 Risk taxonomy and the distribution in the dataset. . . . . . . . . . . . . . . . . . 265
7.11 Non-RAG (the upper) vs RAG (the bottom) pipelines. . . . . . . . . . . . . . . . 266
7.12 (Part 1) Risk profile of every LLM in non-RAG vs. RAG settings. . . . . . . . . 271
7.13 (Part 2) Risk profile of every LLM in non-RAG vs. RAG settings. . . . . . . . . 272
7.14 The change of risk profile from non-RAG to RAG. . . . . . . . . . . . . . . . . 273
7.15 Distribution of unsafe documents. . . . . . . . . . . . . . . . . . . . . . . . . . 275
7.16 Train jailbreaking prompts on non-RAG Mistral-V0.3 and test them in the RAG

setting with a varying number of retrieved documents. . . . . . . . . . . . . . . . 290
7.17 Train jailbreaking prompts on Mistral-V0.3 in the RAG setting using five docu-

ments retrieved from the original queries, and test them in the RAG setting by
retrieving documents using the optimized prompt with a varying number of re-
trieved documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

7.18 The fine selection phase of gradient-based methods involves calculating the jail-
breaking loss for a large set of adversarial suffix candidates chosen through gra-
dients. Previous work has addressed this using batch inference. However, in
the RAG setting, the input query to LLMs—comprising both the retrieved docu-
ments and the question—is significantly longer, leading to memory issues when
performing batch inference with large batch sizes. . . . . . . . . . . . . . . . . . 292

7.19 We employ tree-attention to convert a batch of inputs into a sequence. The at-
tention mask visualizes the tree-attention structure in the case of two candidates.
Since the long query remains fixed during adversarial suffix optimization, we
pre-process it and use it as a KV cache. The position IDs should also be adjusted
accordingly. This approach allows us to efficiently compute the jailbreaking loss. 292

xviii


List of Abbreviations

AI Artificial Intelligence
ASR Attack Success Rate
CLIP Contrastive Language-Image Pretraining
DNN Deep Neural Network
FRR False Refusal Rate
LLM Large Language Model
ML Machine Learning
PGD Projected Gradient Decent
RAG Retrieval Augmented Generation
VLM Vision Language Model

xix


Chapter 1: Introduction

As machine learning (ML) systems move from controlled research environments into com-

plex real-world settings, their reliability under changing conditions emerges as a central concern.

Models trained under idealized assumptions often encounter significant performance degradation

when faced with real-world distribution shifts—situations where the data observed during de-

ployment differs from the training data. Distribution shifts undermine critical aspects of model

reliability, including generalization, fairness, robustness, and safety, raising fundamental ques-

tions about the trustworthiness of deployed AI systems.

1.1 Background and Motivation

Historically, machine learning research has focused primarily on achieving high perfor-

mance under the assumption of independently and identically distributed (i.i.d.) training and

testing data. While this assumption has facilitated rapid advances in model accuracy and effi-

ciency, it does not hold in many practical applications. Real-world data often evolves over time,

varies across domains, and is susceptible to manipulation, adversarial attacks, and shifts in de-

mographics or contexts. Such distribution shifts have become a widely recognized challenge for

deploying reliable AI in sensitive or high-stakes domains such as healthcare, autonomous driving,

financial decision-making, and content moderation.

1


Figure 1.1: Distribution shift often happens in reality, causing poor generalization and many other
issues.

Recent real-world failures highlight this issue: models trained to recognize pedestrians

perform poorly under changed lighting conditions; models optimized for fairness in one demo-

graphic group fail to generalize fairness constraints across different populations; and large lan-

guage models fine-tuned for safety in controlled settings become overly conservative or danger-

ously inaccurate when encountering unforeseen queries. Addressing these challenges is crucial

for safe, ethical, and reliable AI deployment.

Research on distribution shifts spans several subfields, including domain adaptation [1, 2],

domain generalization [3, 4], robust optimization [5, 6], fairness-aware learning [7, 8], and AI

safety [9, 10]. Domain adaptation techniques aim to adjust models trained on source distribu-

tions to perform well on a specific shifted target distribution [2]. In contrast, domain general-

ization seeks models robust to multiple unseen target distributions [3]. Recent methods involve

invariance-based learning [11,12], self-training [13], and consistency regularization [14], empha-

sizing alignment between training and deployment contexts.

In fairness-aware learning, significant progress has been made toward enforcing algorith-

mic fairness under static distributions [7]. However, little attention has been given to fairness

preservation across changing environments [15, 16]. Meanwhile, robustness research, including

2


Figure 1.2: Dissertation Overview.

adversarial robustness [6] and robustness to spurious correlations [5], focuses on ensuring that

models are resilient under deliberate or incidental shifts in input distributions. Lastly, the AI

safety literature increasingly highlights unintended behaviors of large foundation models under

shifting distributions, particularly as models scale and complexity increases [9, 10].

Despite these advances, existing literature tends to focus narrowly on isolated aspects of

distribution shifts, leaving gaps in a holistic understanding of reliability across generalization,

fairness, robustness, and safety.

1.2 Dissertation Overview

This dissertation explores the question of how to build reliable AI under distribution shifts,

structured into three main parts:

Part I: Generalization and Fairness under Distribution Shifts.

We study the generalization of models facing distribution shifts in Chapter 2, introducing a

theoretical framework that connects model invariance to generalization performance. This work

guides the selection of transformations to improve model robustness. Extending this perspective,

3


we address how fairness can collapse under distribution shifts and propose an approach, fair

consistency regularization, to reliably transfer fairness across shifted environments in Chapter 3.

Part II: Vision Models under Distribution Shifts.

We focus on generalization and robustness of vision models under distribution shifts, par-

ticularly how perception and generative AI systems respond to shifts. In Chapter 4, inspired by

shift between the training and development as well as human visual perception, we introduce

a method to improve zero-shot classification robustness by explicitly modeling contextual at-

tributes, mitigating reliance on spurious features. In Chapter 5, we examine the robustness of

watermarking methods used to authenticate generative AI content, providing a benchmark that

uncovers vulnerabilities under diverse adversarial and distributional transformations.

Part III: AI Safety Challenges due to Distribution Shifts.

We explore the safety implications of distribution shifts for large language models (LLMs).

In Chapter 6, we investigate the phenomenon of false refusals, where safety-aligned models reject

benign prompts at deployment due to the misalignment of learned safety boundaries. In Chap-

ter 7, we further study retrieval-augmented generation (RAG) frameworks, uncovering new safety

vulnerabilities arising from shifts between standard LLM training distributions and retrieval-

augmented deployment contexts.

The dissertation offers key contributions including theoretical insights, novel methods, and

comprehensive benchmarks that collectively enhance our understanding of how to build AI sys-

tems capable of maintaining reliability across diverse real-world conditions.

4


Part I

Generalization and Fairness under Distribution Shifts

5


Chapter 2: Understanding the Generalization Benefit of Model Invariance from

a Data Perspective

Machine learning models that are developed with invariance to certain types of data trans-

formations have demonstrated superior generalization performance in practice. However, the

underlying mechanism that explains why invariance leads to better generalization is not well-

understood, limiting our ability to select appropriate data transformations for a given dataset.

This paper studies the generalization benefit of model invariance by introducing the sample cover

induced by transformations, i.e., a representative subset of a dataset that can approximately re-

cover the whole dataset using transformations. Based on this notion, we refine the generalization

bound for invariant models and characterize the suitability of a set of data transformations by

the sample covering number induced by transformations, i.e., the smallest size of its induced

sample covers. We show that the generalization bound can be tightened for suitable transfor-

mations that have a small sample covering number. Moreover, our proposed sample covering

number can be empirically evaluated, providing a practical guide for selecting transformations

to develop model invariance for better generalization. We evaluate the sample covering numbers

for commonly used transformations on multiple datasets and demonstrate that the smaller sample

covering number for a set of transformations indicates a smaller gap between the test and training

error for invariant models, thus validating our propositions.

6


2.1 Introduction

Invariance is ubiquitous in many real-world problems. For instance, categorical classifi-

cation of visual objects is invariant to slight viewpoint changes [17–19], text understanding is

invariant to synonymous substitution and minor typos [20–22]. Intuitively, models capturing the

underlying invariance exhibit improved generalization in practice [23–26, 26, 27]. Such gener-

alization benefit is especially crucial when the data are scarce as in some medical tasks [28], or

when the task requires more data than usual as in cases of distribution shift [29] and adversarial

attack [30–32].

A commonly accepted intuition attributes the generalization benefit of model invariance to

the reduced model complexity, especially the reduced sensitivity to spurious features. However,

a principled understanding of why model invariance helps generalization remains elusive, thus

leaving many open questions. Since model invariance may come at a cost (e.g., compromised

accuracy, increase computational overhead), given a task, how should we choose among various

data transformations under which model invariance guarantees better generalization? If existing

data transformations are not good enough for a given task, what is the guiding principle to find

new ones? The lack of a principled understanding limits better exploitation of model invariance

to further improve generalization. In addition, since identifying instructive generalization bound

is a central topic in machine learning, we may expect to tighten existing generalization bounds

by additionally considering the data-dependent model invariance property.

The many faces of data transformations and model classes pose significant challenges

to a principled understanding of model invariance’s generalization benefit. To address this,

[18, 33–36] characterize the input space and show that certain data transformations equivalently

7


shrink the input space for invariant models, which then simplify the input and improves general-

ization. From another perspective, [37,38] directly characterize the function space and show that

the volume of the invariant model class is reduced, which then simplifies the learning problem

and improves generalization. These understandings provide valuable insights, yet they may be-

come less informative on high-dimensional input data or require model invariance to be obtained

exclusively via feature averaging. Some certain assumptions on data transformations (e.g., finite-

ness, group structure with certain measures) also make these understandings less applicable to

more general data transformations.

In this paper, we derive generalization bounds for invariant models based on the sample

cover induced by data transformations and empirically show that the introduced notion can guide

the data transformation selection. Different from previous understandings, we first identify a

data-dependent property of data transformations in a model-agnostic way, and then establish its

connections with the refined generalization bounds of invariant models. The analysis applies to

more general data transformations regardless of how model invariance is obtained and naturally

provides model-agnostic guidance for data transformation selection. We summarize our contri-

butions as follows.

At the core of our understanding is the notion of sample cover induced by data transforma-

tions, defined informally as a representative subset of a dataset that can approximately recover

the whole dataset using data transformations (illustrated in Figure 2.1). We show that this notion

identifies a data-dependent property of data transformations which is related to the generalization

benefit of the corresponding invariant models. Under a special setting of the sample cover, we

first bound the model complexity of any invariant and output-bounded model class in terms of

the sample covering numbers. Since this general bound requires a restrictive condition on data

8


𝜌𝒢 𝑥" , 𝑥# = 𝑟

𝑥!

𝑥"

𝑟

sample cover = {𝑥$, 𝑥%, … , 𝑥&}

𝑥#

𝑥$

𝑥%

𝑥&
𝑥'

𝑥(

𝑥)𝜖

((a)) G = ∅

𝜌𝒢 𝑥" , 𝑥# = 𝑟$ + 𝑟% sample cover = {𝑥%, 𝑥&, 𝑥', 𝑥(}

𝑥!

𝑥"

𝑥#

𝑥$
𝑥%

𝑥&

𝑥'
𝜌𝒢 < 𝜖 𝜖

𝑥(

𝑥)

𝑟!

𝑟"

𝑔(𝑥()

𝜌𝒢 < 𝜖

𝜌𝒢 < 𝜖

((b)) G = {"rotation"}

Figure 2.1: Illustration of the pseudometric and sample cover induced by data transformations.

transformations in order to be informative, we then assume the model Lipschitzness to relax the

requirement and refine the model complexity bound for invariant models. Finally, we outline a

framework for model-invariance-sensitive generalization bounds based on the invariant models’

complexities, and use it to discuss the generalization benefit of model invariance.

Given the usefulness of sample cover in the analysis, we propose an algorithm to empir-

ically estimate the sample cover. This algorithm exactly verifies whether a given subset of a

sample forms a valid sample cover, and always estimates a sample covering number that upper-

bounds the ground truth. Inspired by our analysis, we also propose to use the sample covering

number as a suitability measurement for practical data transformation selections. This measure-

ment is data-driven, widely applicable, and empirically correlates with invariant models’ actual

generalization performance. We discuss its limitations and empirical mitigation.

To empirically verify our propositions, we first estimate the sample covering number for

some commonly used data transformations on four image datasets, including CIFAR-10 and

ShapeNet (a 3D dataset). Under typical settings, the 3D-view transformation induces a much

smaller sample covering number than others on ShapeNet, while cropping induces the smallest

9


sample covering number on others datasets. For those data transformations, we then train in-

variant models via data augmentation and invariance loss regularization to evaluate the actual

generalization benefit. Results show a clear correlation between smaller sample covering num-

bers induced by data transformations and the better generalization benefit enjoyed by invariant

models.

2.2 Related Work

Understandings from the input space perspective. One line of work characterizes the input

space of invariant models. [33, 39] show that the invariant representations equivalently reduce

the input dimension for downstream tasks and thus significantly reduce the model complexity

(exponential in input dimensions) of downstream linear models. [35, 36] essentially factorize the

input space into the product of a base space and a finite set of data transformations. Since the

covering number needed to cover the base space is smaller, the associated generalization bound

for invariant models is reduced. Compared with these works, our work tries to cover the sample

instead of the input space which circumvents the strong dependence on input dimensions and also

enables practical evaluation.

Understandings from the function space perspective. Another line of work directly charac-

terizes the function space of invariant models. [38] uses PAC-Bayes to show the reduction of

generalization upper bound when the model class is symmetrized to be invariant. [37] analyzes

the function space under the feature averaging operator and shows the first strict generalization

gap (instead of an upper bound) via a linear model. This line of work currently restricts model

invariance to be obtained exclusively via feature averaging.

10


Note that the categorization of different understanding perspectives is only for presentation

convenience and has no formal distinctions. Additionally, we mention some work that stud-

ies model invariance but does not focus on understanding its benefit. [40] proves that the VC

dimension of an invariant model cannot be larger than its counterpart. [41] characterizes the gen-

eral functional representations of invariant probability distributions as well as neural network

structures that implement them. [42] uses group theory to show the benefit of learning with data-

augmented loss. In the predicting generalization competition at NeurIPS 2020 [43], the runner-up

team [44] shows that model robustness to data transformations can serve as an empirical proxy

for predicting models’ generalization performance. [45] enforce model invariance to learned data

transformations that capture inter-domain variation to improve the out-of-distribution generaliza-

tion. [46] propose to select data transformations automatically from model training via optimizing

parameterized distributions of data transformations. Interestingly, our sample covering numbers

may be used to determine their regularization coefficients for better trade-offs.

2.3 Preliminaries

Data transformations. We refer to the data transformation as a function from the input space

X → X , and data transformations as a set of such functions. Unless otherwise specified, we do

not assume data transformations to have group structures since many non-invertible transforma-

tions (e.g., cropping) do not fit into a group structure directly. For a set of data transformations

G = {g : X → X} and a data point (also referred to as an example) x ∈ X , we overload the no-

tion of orbit in group theory and denote by G(x) the orbit of x defined as follows. The orbit of x

generated by data transformations G is the collection of outputs after applying any transformation

11


g ∈ G on x: G(x) = {g(x) ∈ X : g ∈ G}.

Model invariance. Let D be the underlying data distribution and supp(D) be its support. A

model h : X → Y is said to be invariant under data transformations G on D if h(g(x)) = h(x)

for any x ∈ supp(D) and any g ∈ G. We refer to a class of invariant models as the G-invariant

model class.

Complexity measurements. Covering number and Rademacher complexity [47] are two com-

monly used complexity measurements for model classes (including neural networks [48]) that

can provide uniform generalization bounds. The covering number can also be directly used to

upper bound the Rademacher complexity via Dudley’s entropy integral theorem [49, 50].

Covering number. Let (F , d) be a (pseudo)metric space with some (pseudo)metric1 d. An

ϵ-cover of a setH ⊆ F is defined as a subset Ĥ ⊆ H such that for any h ∈ H, there exists ĥ ∈ Ĥ

such that d(h, ĥ) ≤ ϵ. The covering number N(ϵ,H, d) is defined as the minimum cardinality of

an ϵ-cover (among all ϵ-covers) of H. In this paper, we use the concept of covering number both

for measuring model class complexities and for defining the sample covering number on datasets.

Empirical Rademacher complexity. Let H be a class of functions h : X → R. Given

a sample S = {xi}ni=1, the empirical Rademacher complexity of model class H is defined

as: RS(H) = Eσ

[
suph∈H

1
n

∑n
i=1 σih(xi)

]
where σ = [σ1, ..., σn]

⊤ is the vector of i.i.d.

Rademacher random variables, each uniformly chosen from {−1, 1}.

Generalization error and gap. Let S = {xi}ni=1 be a sample drawn i.i.d. from some data

distribution D, and H be a model class. Given a loss function ℓ : R → [0, 1], for a h ∈ H, we

define the empirical error as RS(h) =
1
n

∑n
i=1 ℓ(h(xi), yi), the generalization error as R(h) =

1A pseudometric is a metric if and only if it separates distinct points, namely d(x, y) > 0 for any x ̸= y.

12


E(x,y)∼D[ℓ(h(x), y)], and the generalization gap as R(h)−RS(h).

2.4 Generalization Benefit of Model Invariance

In this section, we derive the generalization bounds for invariant models by identifying

model invariance properties. We start by introducing the notion of sample cover induced by data

transformations and based on it bound the Rademacher complexity of any invariant models with

bounded output (Section 2.4.1). Then, we assume model Lipschitzness to provide a more infor-

mative model complexity bound for any data transformations (Section 2.4.2). Finally, we provide

a framework for model-invariance-sensitive generalization bounds and discuss the generalization

benefit of model invariance (Section 2.4.3).

2.4.1 Sample Cover Induced by Data Transformations

Existing empirical results suggest that, compared with standard models, invariant models

may have certain properties reducing their effective model complexities. To identify such proper-

ties, we alternatively identify the related properties of the corresponding data transformations via

the notion of sample cover induced by data transformations. We now formalize the introduced

notion.

The definition of sample cover relies on the pseudometric induced by the data transforma-

tions G. Note that G generates an orbit G(x) ⊆ X for each example x ∈ S. Let ∥ · ∥ be any norm

on the input space X . Given a set of transformations G, we define the G-induced pseudometric2

2Note that ρG is not a metric since it allows ρG(x, y) = 0 for x ̸= y.

13


as

ρG(x1,x2) = inf
γ∈Γ(x1,x2)

∫
γ

c(r)ds, where c(r) =


0, if r ∈ ∪x∈SG(x)

1, otherwise
(2.4.1)

where ds = ∥dr∥, and Γ denotes the set of all paths (curves) in X from x1 to x2. The ρG is

essentially calculating the line integral along the shortest (if achievable) path γ in the scalar field

c, where c can also be viewed as the "moving cost" function depending on G. The norm ∥ · ∥

here can be selected as any meaningful norm on the input space (e.g., Euclidean norm as in our

experiments) and will later be used in defining the model’s Lipschitz constant. It can be checked

that ρG satisfies pseudometric axioms.

Definition 2.4.1 (Sample cover induced by data transformations). Let (X , ρG) be a pseudometric

space and S = {xi}ni=1 be a sample of size n. An ϵ-sample cover ŜG,ϵ of the sample S induced

by data transformations G at resolution ϵ is defined as a subset of the sample S such that for any

x ∈ S, there exists x̂ ∈ ŜG,ϵ satisfying ρG(x, x̂) ≤ ϵ.

Definition 2.4.2 (Sample covering number induced by data transformations). The sample cover-

ing number N(ϵ,S, ρG) induced by data transformations G is defined as the minimum cardinality

of an ϵ-sample cover:

N(ϵ,S, ρG) = min{|ŜG,ϵ| : ŜG,ϵ is an ϵ-sample cover of S}. (2.4.2)

Informally, the G-induced sample cover specifies a representative subset of examples that

can approximately recover all the original examples using the given data transformations G.

This notion is closely related to the sample compression [51] which represents a scheme to

14


prove the learnability of concepts through a compressed set of samples. While identifying the

generalization-related properties of data transformations, this notion is insensitive to other unre-

lated properties (e.g., finiteness, group structures) and thus applies to any data transformations.

The intuition behind sample cover is that G-invariant models may have consistent behaviors

on an example and its associated approximation in the G-induced sample cover. As such, we can

analyze the model complexities of invariant models by considering the models’ behavior only

on the potentially small-sized sample covers. Indeed, we directly have the following model

complexity result. The proof is in Section 2.9.2.

Proposition 2.4.3. Let S = {xi}ni=1 be a sample of size n. LetH be a model class mapping from

X to [−B,B] for some B > 0 and is invariant to data transformations G. Then the empirical

Rademacher complexity ofH satisfy

RS(H) ≤ 24B

√
N(0,S, ρG)

n
. (2.4.3)

Proposition 2.4.3 generally bounds the model complexity of any output-bounded and G-

invariant model class in terms of the sample covering number N(0,S, ρG) induced by G. A small

G-induced sample covering number at resolution ϵ = 0 thus tightens the model complexity bound

for a general class of G-invariant models.

Note, however, that Proposition 2.4.3 is informative only when the data transformations

G yields N(0,S, ρG) ≪ n on the sample S — a condition requiring G to be able to exactly

recover S from a small-sized subset of S. This condition is unfortunately too strict to hold for

many commonly used data transformations which only generate orbits with measure zero (with

respect to the data measure) at most data points. For example, the rotation transformations on

15


CIFAR-10 do not satisfy this condition, since no two images in CIFAR-10 are rotated versions of

each other. To better understand the generalization benefit brought by any data transformations

(e.g., rotation), we further assume specific model properties which equivalently expand the orbits

in order to get more general results. We study Lipschitz models in Section 2.4.2, and relegate a

sharper (and relatively independent) analysis for linear models under linear data transformations

to Section 2.9.3.

2.4.2 Refined Complexity Analysis of Lipschitz Models

This subsection refines the model complexity analysis for Lipschitz models that are invari-

ant. Characterizing the Lipschitz constant of models has been the focus of a line of work. For

example, the Lipschitz constant of ReLU networks can be upper-bounded by the product of the

spectral norms of the weight matrices, considering the worst-case inputs [48,52]. Assuming Lip-

schitzness, the following theorem refines the covering number analysis for invariant models. The

proof is in Section 2.9.2.

Theorem 2.4.4. Let S = {xi}ni=1 be a sample of size n. Let H be a model class such that every

h ∈ H is κ-Lipschitz with respect to ∥ · ∥ (used in defining the sample cover) and is invariant to

G. Then the covering number ofH satisfies

N
(
τ,H, L2(PS)

)
≤ inf

ϵ≥0,ŜG,ϵ

N
(
τ − κϵ

√
1− |ŜG,ϵ|

n
,H, L2(PŜG,ϵ

)
)
, (2.4.4)

where ∀h, g ∈ H, the L2(PS) metric is defined as ∥h− g∥L2(PS) =
(∑

x∈S
1
n

(
h(x)− g(x)

)2) 1
2
,

and the L2(PŜG,ϵ
) metric is defined as3 ∥h− g∥L2(PŜG,ϵ

) =
(∑

x∈ŜG,ϵ

p(x)
n

(
h(x)− g(x)

)2) 1
2
.

3The term p(x)/n can be viewed as the probability mass at x where the numerator indicates the number of

16


Theorem 2.4.4 upper-bounds the covering number of H evaluated at the sample S by the

new covering number evaluated at any sample cover ŜG,ϵ, under a modified metric and at the cost

of an additional error term depending on ϵ and κ. The equality trivially holds by taking ŜG,ϵ = S,

while by searching over all sample covers with different resolution ϵ it is possible to tighten the

covering number bound for invariant models. Additionally, Theorem 2.4.4 leads to a refined

version of Dudley’s entropy integral theorem (see Lemma 2.9.3) that bounds the Rademacher

complexity of invariant models.

Theorem 2.4.4 suggests that we may improve existing covering-number-based model com-

plexity analysis by weakening the dependence on input dimensions. Note that covering numbers

that do not yield N
(
τ,H, L2(PS)

)
/n → 0 as n → ∞ are vacuous. Therefore, existing covering

number results typically avoid linear dependence on n at the cost of (explicitly or implicitly)

increased dependence on the input dimension [53]. With the refined result in Theorem 2.4.4,

however, a covering number linear in n can now be replaced by one that is linear in a potentially

much smaller sample covering number N(ϵ,S, ρG) and consequently become informative, thus

circumvent the increased dependence on input dimensions. An interesting direction for future

work is to instantiate the result in Equation 2.4.4 for specific model classes to get more inter-

pretable results.

2.4.3 Framework for Model-invariance-sensitive Generalization Bounds

This subsection presents the framework for generalization bounds sensitive to model invari-

ance. While the results are straightforward applications of the derived complexities of invariant

models, our goal is to justify the selection of suitable data transformations to maximize the gen-

examples that x covers. See Section 2.9.2.1 for the formal definition of p(x).

17


eralization benefit. We start with the generalization analysis of invariant models and then present

the framework.

Generalization benefit for invariant models. The generalization bounds of invariant models

follow immediately by applying the Rademacher model complexities (Proposition 2.4.3, Propo-

sition 2.9.3, and Theorem 2.9.4) to the standard generalization bound (Theorem 2.9.2). Compared

with standard models, invariant models’ tightened model complexity bounds already imply their

reduced generalization gaps, whereas for reduced generalization error they further need to have

low empirical error. Since enforcing model invariance may simultaneously increase the empirical

error, we can use standard model selection techniques (e.g., structural risk minimization [47]) to

select suitable data transformations and control the trade-off.

Model-invariance-sensitive generalization bound. We outline the generalization bound that

identifies model invariance properties based on the derived invariant models’ complexities. It

follows by the post-hoc analysis which specifies a proper set of invariant models using the "in-

variant loss" — the loss when composed with any model, makes the composition invariant. For

data transformations with group structures, we can construct such loss by averaging (assum-

ing Haar measure) or adversarially perturbing any given loss over the orbits of input exam-

ples [37, 38]. Specifically, the adversarial loss with respect to data transformations G is defined

as ℓ̃G (h(x), y) = maxx′∈G(x) ℓ (h(x
′), y), where ℓ is any given loss. Using the adversarial loss,

the following proposition provides the model-invariance-dependent generalization bound by ap-

plying the model selection framework [47]. Section 2.9.2.3 further describes a binary coding

construction of combinations of data transformation classes.

Proposition 2.4.5. Let S = {xi}ni=1 be a sample of size n. Let H be any given model class

18


and ℓ be any given loss. Suppose we have K sets of group-structured data transformations

{G1,G2, ...,GK}. Then with probability at least 1 − δ, the following generalization bound holds

for any h ∈ H and any k ∈ [K]:

R(h) ≤ 1

n

n∑
i=1

ℓ̃Gk
(h(xi), yi) + 4RS(ℓ̃Gk

◦ H) +
√

log k

n
+ 3

√
log 4

δ

2n
, (2.4.5)

where RS(ℓ̃Gk
◦ H) is upper-bounded by the complexity of Gk-invariant models. For any

model trained on S, Proposition 2.4.5 shows that we can optimize over all selections of data

transformations to improve its generalization bound. Note that the selection of Gk is subject

to a potential trade-off between the reduced model complexity RS(ℓ̃Gk
◦ H) and the increased

empirical error
∑n

i=1 ℓ̃Gk
(h(xi), yi). Thus, if a suitable Gk reduces the model complexity while

keeping the empirical error low, then the trained model will benefit from a tightened general-

ization bound. This generalization bound does not require the models to be (strictly) invariant

and potentially explains the improved generalization of models with trained invariance (e.g., via

data augmentation [54, 55] or consistency regularization [56, 57]). The difficulty in instantiating

Proposition 2.4.5 is that the model complexity with adversarial loss may be hard to compute for

general data transformations. Therefore, we discuss more practical data transformation selections

based on the sample covering numbers in Section 2.6.

2.5 Sample Cover Estimation Algorithm

The sample cover induced by data transformations plays a central role in our understanding

of model invariance. Despite its usefulness, exactly computing the sample cover turns out to be

non-trivial in general. Indeed, computing the transformation-induced metrics can be difficult for

19


continuous data transformations, and finding the smallest sample cover is NP-hard. To address

this problem, we propose an algorithm to estimate the sample covering number and find the

associated sample cover. We outline the algorithm and discuss the algorithmic challenges in this

section. The algorithmic details appear in Section 2.9.4.

Setup. The estimation algorithm takes as input a sample S, a set of data transformations G, and

the resolution parameter ϵ. It then returns the estimated sample covering number N(ϵ,S, ρG) and

the associated sample cover ŜG,ϵ. The estimation algorithm has the following steps.

Step 1. Compute (or approximate) the direct orbit distance between any two examples in S. The

direct orbit distance between any two examples xi,xj ∈ S is

dG(xi,xj) = ∥G(xi)− G(xj)∥ = min
g1,g2∈G

∥g1(xi)− g2(xj)∥,

which can be exactly computed for finite transformations (e.g., flipping) with complexity O(|G|2)),

or can be approximated for continuous transformations (e.g., rotation) via optimization or sam-

pling.

Step 2. Compute the ρG distance between any two examples in S. Given results in step 1, com-

puting the ρG distance between any two examples can be formulated as a shortest path problem on

a complete graph, where each node represents an example and the cost of each edge is the direct

orbit distance computed in step 1 (see formulations in Section 2.9.4). Note that the shortest path

is always included in our finite candidates even though the ρG distance considers infinitely many

paths. This is because any other path outside our finite candidates will be longer than its coun-

terparts (depending on what orbits it intersects) in our finite candidates. Standard shortest path

20


algorithms solve for all pairs of examples in polynomial time (e.g., via Dijkstra’s algorithm [58]

in O(n3)).

Step 3. Construct the pairwise distance matrix [ρG(xi,xj)]i,j and approximate the sample cover-

ing number. This step can be formulated as a set cover problem where each example x covers a

subset of S in which each element’s ρG distance to x is less than or equal to ϵ. Our goal is to find

a minimum number of those subsets such that their union contains S. This problem is known

to be NP-hard in general but admits polynomial time approximations [59]. In experiments, we

use modified k-medoids [60] clustering method to find the approximation of N(ϵ,S, ρG) (see

Algorithm 1).

Note that the estimated sample covering number returned by the algorithm is always an

upper bound of the ground truth, regardless of the approximation error in steps 1 and 3. When

step 1 is exact, the algorithm also exactly verifies whether a given subset of S forms a valid

sample cover. In our experiment, step 2 becomes the computation bottleneck for large-sized

samples. We leave improving the scalability as well as evaluating the approximation quality for

future work.

2.6 Data-driven Selection of Data Transformations

The pool of candidate data transformations on a given dataset may be infinitely large. To

maximize the generalization benefit of model invariance, we usually make selections based on

expensive cross-validations due to the absence of model-training-free guidance. Section 2.4 sug-

gests that invariant models may benefit from improved generalization guarantees if the corre-

sponding data transformations induce small sample covering numbers. Therefore, we propose

21


to use the sample covering number as an empirical suitability measurement to guide the data

transformation selection. We discuss its advantages, limitations, and empirical mitigation in this

section.

Suitability measurement. To maximize the generalization benefit of model invariance on a

dataset S, we measure the suitability of data transformations G by the sample covering number

induced by G and favor the small ones.

Advantages. One advantage of this suitability measurement is that it is model-training-free. It

provides a-priori guidance depending only on the dataset and the data transformations, thus avoid-

ing expensive cross-validations and fueling the exploration of new types of data transformations.

Another advantage is that it applies to any type of data transformation (including continuous and

non-invertible ones) and provides a uniform benchmark.

Limitations and empirical mitigation. Being model-agnostic also poses two limitations to the

suitability measurement. One limitation is that this suitability measurement, while capturing

invariant models’ reduced generalization gap, ignores their potentially increased empirical error.

Note that certain data transformations on a dataset may drastically increase invariant models’ em-

pirical error and overturn the benefit of a reduced generalization gap. To mitigate this limitation,

we consider two necessary conditions for maintaining low empirical error. First, the data transfor-

mations should preserve the underlying ground-truth labeling. We may use domain knowledge

to meet this condition. Second, the model class should be rich enough to contain a low-error

invariant hypothesis. In our experiment, neural networks which are invariant and achieve low

training error suffices this condition.

Another limitation is that this suitability measurement ignores the models’ potential Lip-

22


schitz constant change after enforcing the invariance. Theorem 2.4.4 suggests that the general-

ization benefit enjoyed by invariant models depends on models’ Lipschitz constant and can be

overturned if enforcing invariance leads to a significantly larger Lipschitz constant. To mitigate

this limitation, we use the fact that we are doing classification tasks and use the label information

to heuristically offset the Lipschitz constant increase. We use the minimum inter-class distance

change after applying data transformations to capture the Lipschitz constant change and use it

to normalize the sample covering number for better data transformation selections (see Section

2.9.6.2).

2.7 Experiments

In this section, we implement the sample cover estimation algorithm and verify the effec-

tiveness of using sample covering numbers to guide the data transformation selection. We first

estimate the sample covering number induced by different types of data transformations on some

image datasets. Then, we investigate the actual generalization benefit for models invariant to

those data transformations and analyze the correlation4.

Datasets. We report experimental results on CIFAR-10 [61] and ShapeNet [62] in this section,

and relegate results on CIFAR-100 and Restricted ImageNet to Section 2.9.6.1. ShapeNet is a

large-scale 3D data repository that enables us to do more complex data transformations (e.g.,

change of 3D-view) beyond the common 2D geometric transformations. The work [63] provides

24 multi-view pre-rendered images for each 3D object in 10 chosen categories. For convenience,

we use those images to approximate the random perturbation of the 3D view.

4Code is available at https://github.com/bangann/understanding-invariance.

23


((a)) CIFAR-10 ((b)) ShapeNet

Figure 2.2: Estimated sample covering numbers induced by different data transformations at
different resolutions ϵ. “base” indicates no transformation. Note that as ϵ increases, it starts
to exceed the L2 distance between some images and thus some images get covered by others
without doing any transformation. Three vertical dashed lines indicate the maximum resolution
ϵ at which the “base” yields a certain sample covering number, and from left to right they are
100%n, 99%n, 95%n.

Data transformations. We evaluate some commonly used data transformations with typical

parameter settings which we assume to be label-preserving. We choose flipping, cropping, and

rotation on CIFAR-10, and additionally consider the 3D-view change on ShapeNet. We use

the same data transformations with the same parameter settings during estimating the sample

covering number and evaluating the generalization benefit. Section 2.9.5 provides more details

of our experimental settings.

2.7.1 Estimation of Sample Covering Numbers

We implement the algorithm in Section 2.5 to estimate the sample covering number induced

by different data transformations. For efficiency, we randomly sample 1000 training images from

CIFAR-10 and randomly sample 800 training images from ShapeNet. Section 2.9.5 compares

results with different sample sizes. We use the Euclidean norm for defining the sample cover. For

24


n = 100 n = 1000 n = all

Model acc (%) gap acc (%) gap acc (%) gap

Base 41.05± 0.52 58.95± 0.52 68.62± 0.90 31.38± 0.90 85.43± 0.35 14.57± 0.35
Flip 44.19± 0.74 55.81± 0.74 75.12± 0.20 24.88± 0.20 89.67± 0.24 10.33± 0.24
Rotate 47.02± 0.46 52.93± 0.51 76.07± 0.28 23.92± 0.27 89.91± 0.13 10.05± 0.16
Crop 50.47± 0.48 49.53± 0.48 81.84± 0.12 18.15± 0.11 92.52± 0.08 7.48± 0.08

Table 2.1: Classification accuracy and generalization gap (the difference between training and
test accuracy) for ResNet18 on CIFAR-10. n denotes the sample size per class.

continuous data transformations, we do uniform random sampling to approximate the orbit of a

data point.

Figure 2.2 illustrates the estimated sample covering numbers induced by different trans-

formations at different resolutions ϵ. As the resolution ϵ increases, the sample covering number

N(ϵ,S, ρG) induced by any data transformation starts to decrease, indicating a smaller-sized sam-

ple cover needed to cover the entire dataset. Meanwhile, different transformations behave dif-

ferently. On CIFAR-10, cropping induces the smallest sample covering number. On ShapeNet,

3D-view transformation induces the smallest sample covering number and the gap is significant.

Our propositions suggest that data transformations that induce smaller sample covering numbers

tend to bring more generalization benefits for the corresponding invariant models. Therefore, Fig-

ure 2.2 indicates that models should generalize well if it is invariant to 3D-view transformation

on ShapeNet or to cropping on CIFAR-10.

2.7.2 Evaluation of Generalization Benefit

We now evaluate the actual generalization performance of invariant models to verify if

the sample covering number is a good suitability measurement. We use ResNet18 [64] on both

datasets and discuss the influence of the model class’s implicit bias in Section 2.9.5. A simple

25


n = 100 n = 1000 n = all

Model acc (%) gap acc (%) gap acc (%) gap

Base 67.75± 2.02 32.25± 2.02 83.33± 0.38 16.67± 0.38 91.81± 0.22 8.18± 0.22
Flip 69.75± 1.55 30.25± 1.55 84.24± 0.30 15.76± 0.30 92.07± 0.20 7.92± 0.20
Rotate 70.25± 1.19 29.50± 1.15 83.93± 0.38 15.94± 0.35 91.85± 0.20 8.03± 0.26
Crop 74.88± 1.03 23.53± 1.30 86.13± 0.39 13.75± 0.32 92.64± 0.12 7.17± 0.19
3D-View 78.13± 1.31 14.94± 1.76 88.79± 0.34 8.38± 0.79 94.38± 0.08 3.09± 0.10

Table 2.2: Classification accuracy and generalization gap (the difference between training and
test accuracy) for ResNet18 on ShapeNet. n denotes the sample size per class.

method to learn invariant models is to do data augmentation. The augmented loss function is

Laug(x) = L(f(g(x))), where f(·) denotes the model and g(x) denotes a randomly sampled ex-

ample in x’s orbit induced by transformation G. We use this method on CIFAR-10 and ShapeNet

and show results in Table 2.1 and 2.2.

Sample covering number correlates well with generalization benefit. We use the generaliza-

tion gap (the gap between training accuracy and test accuracy) to measure actual generalization

benefit. Compared with the baseline, invariant models show an improved reduced generalization

gap and also improved test accuracy. On CIFAR-10, the cropping-invariant model shows the

smallest generalization gap and the highest accuracy. On ShapeNet, the model that is invariant

to 3D-view changes shows the smallest generalization gap and the highest accuracy, especially

when the training data size is small. By comparing results in Figure 2.2 and Table 2.1-2.2, we

observe a clear correlation between the smaller sample covering number and better generalization

benefit. This verifies our proposition — invariance to more suitable data transformations gives

the model more generalization benefit.

Model invariance indeed improves after learning. To verify that the improved generalization

is indeed brought by model invariance, we further enforce the invariance using the invariance

26


λ train acc (%) test acc (%) gap Linv Ainv(%)

0 99.99± 0.01 91.81± 0.22 8.19± 0.22 0.0548± 0.0028 62.0± 0.6
0.01 99.98± 0.00 92.77± 0.16 7.21± 0.16 0.0290± 0.0029 74.78± 1.61
0.1 99.99± 0.00 93.87± 0.19 6.11± 0.19 0.0152± 0.0003 83.12± 0.50
0.3 99.98± 0.00 94.23± 0.11 5.76± 0.11 0.0121± 0.0003 85.10± 0.20
1 99.58± 0.04 94.68± 0.09 4.90± 0.09 0.0095± 0.0001 86.94± 0.08
3 97.74± 0.19 94.48± 0.19 3.26± 0.09 0.0060± 0.0003 88.15± 0.18
10 95.67± 0.26 93.56± 0.29 2.11± 0.04 0.0037± 0.0002 89.20± 0.16
100 92.89± 0.25 91.85± 0.26 1.03± 0.03 0.0018± 0.0001 89.82± 0.10

Table 2.3: Evaluation of ResNet18 on ShapeNet under 3D-view transformations. Linv denotes the
test invariance loss. Ainv denotes the test consistency accuracy (indicating whether the model’s
prediction is unchanged after data transformation) under the worst-case data transformations.

regularization loss similar to [65, 66]: L = Lcls(f(x)) + λKL(f(x), f(g(x))).. Specifically, in

addition to minimizing the classification loss on original images, we penalize the model by min-

imizing the KL divergence between model outputs on original images and on transformed ones.

At test time, we use Linv(x) = Eg1,g2∈G[KL(f(g1(x)), f(g2(x)))] to evaluate model invariance

under transformation G. Table 2.3 shows that, as we increase the invariance penalty by increasing

λ, invariant models enjoy a smaller generalization gap. Moreover, the decreased invariance loss

and increased consistency accuracy show that model invariance indeed improves after training,

demonstrating the generalization benefit brought by model invariance.

2.8 Conclusion

This paper investigates the generalization advantage of model invariance by establishing

model complexity bounds using the sample cover generated by data transformations. Addition-

ally, we introduce an algorithm to estimate the sample cover and demonstrate that the sample

covering number can aid in selecting suitable data transformations through empirical analysis.

Our hope is that this research will encourage the exploration of more appropriate data trans-

27


formations for particular datasets. One potential avenue for future research is to examine the

implicit biases of model classes to improve our understanding of the generalization benefits of

model invariance.

2.9 Supplemental Materials

2.9.1 Complexity Measurements and Generalization Bounds

In this section, we provide additional details on complexity measurements and generaliza-

tion bounds.

The following lemma bounds the empirical Rademacher complexity of a function class H

via the covering number ofH evaluated at the sample S.

Lemma 2.9.1 (Dudley’s Entropy Integral Theorem [47, 49]). Let H be a function class from X

to R. Then, for any α > 0,

RS(H) ≤ 4α + 12

∫ ∞

α

√
logN

(
τ,H, L2(PS)

)
n

dτ.

The following theorem provides a uniform generalization bound for a function class via

empirical Rademacher complexity.

Theorem 2.9.2 ( [47, 67]). Let H be a function class from X to [0, B]. For any δ > 0, with

probability at least 1 − δ over the draw of a sample S with size n according to data distribution

D, the following holds for any h ∈ H:

R(h) ≤ RS(h) + 2BRS(H) + 3B

√
log 2

δ

2n
(2.9.1)

28


We can plug the refined Rademacher complexity bounds in Proposition 2.9.3 and Theo-

rem 2.9.4 into (2.9.1) to get refined generalization bounds for certain invariant models.

2.9.2 Proofs

We first prove Theorem 2.4.4, and then Proposition 2.4.3.

2.9.2.1 Proof of Theorem 2.4.4

Proof of Theorem 2.4.4. The general idea of this proof is to show that any cover of a model class

evaluated at a sample cover also generates a same-sized cover of the model class evaluated at

the original sample with some additional approximation error. The covering number inequality

in (2.4.4) then follows by taking the minimization over all covers of the model class evaluated at

the original sample. Since this proof includes some tedious notations, we first restate the problem

setup and then go to the details.

Problem setup. Let S = {x1,x2, . . . ,xn} be a sample of size n. Let Ŝ ⊆ S be an ϵ-cover

of S with respect to ρG and has size m. Without loss of generality, we then vectorize S and Ŝ

for notation simplicity. Denote by