ABSTRACT

Title of dissertation: ROBUSTNESS, DETECTABILITY, AND DATA
PRIVACY IN AI

Vinu Sankar Sadasivan, Doctor of Philosophy, 2025

Dissertation directed by: Professor Soheil Feizi
University of Maryland

Recent rapid advancements in Artificial Intelligence (AI) have made it widely popular in

various applications, ranging from autonomous systems to multimodal content generation.

However, these models are vulnerable to several security and safety problems. These pitfalls

in AI models can potentially allow attackers to jailbreak systems, enabling harmful tasks

and leaking sensitive information. As AI models are increasingly deployed in sensitive

applications, such as autonomous robots and healthcare devices, the importance of AI

safety is growing. To address these issues in today’s AI systems, it is critical to study their

vulnerabilities. In this thesis, we explore fundamental aspects of AI safety and security,

including robustness, detectability, and data privacy. We analyze challenges posed by

adversaries to contribute toward enhancing AI safety and security. First, we examine the

robustness of discriminative models, identifying blind spots in computer vision models. We

discover the existence of high-confidence star-like paths in the input space, where models

consistently predict the same output label despite the addition of significant noise. We

also develop provable robustness certificates for streaming classification models. Next, we

analyze the challenges in detecting AI-generated content across text and image modalities.


Through empirical and theoretical studies, we find that detecting AI content is a difficult task,

especially as generative models improve over time. We demonstrate attacks on a wide range

of text and image detectors, showing that adversaries can efficiently break these systems

to induce both type-I and type-II errors. We then investigate the robustness of generative

Large Language Models (LLMs), focusing on adversarial attacks and hallucinations. Our

fast, beam search-based adversarial attacks can compromise LLMs in under a GPU minute.

We also demonstrate that multi-modal LLMs have increased attack surface and are more

vulnerable to adversarial attacks in the real world. These attacks can jailbreak, induce

hallucinations, and carry out privacy attacks on LLMs. To mitigate these risks, we propose

a suite of methods for detecting hallucinations in both white-box and black-box settings

by analyzing the internal dynamics and output probabilities of LLMs. Lastly, we address

user privacy concerns by proposing a method to create unlearnable datasets through data

poisoning. This method, which is model-agnostic and computationally efficient, ensures

that datasets used without user consent become unlearnable by AI systems. In summary,

this thesis analyzes various attacking frameworks to understand the robustness and security

vulnerabilities of AI models. We hope the insights provided by this work will be useful in

designing safer, more secure AI systems.


ROBUSTNESS, DETECTABILITY, AND DATA PRIVACY IN AI

by

Vinu Sankar Sadasivan

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2025

Advisory Committee:
Professor Soheil Feizi, Chair/Advisor
Professor Hernisa Kacorri, Dean’s Representative
Professor Behtash Babadi
Professor David Jacobs
Professor Dinesh Manocha
Professor Abhinav Shrivastava


© Copyright by
Vinu Sankar Sadasivan

2025


Dedication

To my family — mother, father, brother, Timmy, Monash, and grandparents — and to God!

ii


Acknowledgments

I started my PhD in Fall 2021. Moving to a new country, leaving my home was challenging.

I am grateful to the people and circumstances that contributed to, supported, and inspired me

throughout my journey in my graduate school life. I believe my efforts would be meaningless

without acknowledging them.

I would like to start by thanking my advisor, Professor Soheil Feizi. Half of your struggle as

a PhD student would become invisible if you have a great advisor. I could not have asked

for a more understanding and supportive advisor. His advice and mentorship always guided

me along the right path during my PhD. Professor Feizi gave me all the freedom to do great

research and supported me whenever I was in a dilemma, shaping the PhD thesis that I have

today.

The most important part of my whole life has been my beloved family. I am the luckiest

to have the most supportive parents – Rekha, my mother and Sadasivan, my father – who

have always stood by my side. It is their sacrifice and hard work that has been the major

driving force for my brother’s and my successes in our lives. I will be forever indebted to

my parents for their contributions.

I believe teachers play the most important role in forming the intellectual curiosity of a child.

For me, the most important teacher or Guru has been my brother, Hari. In my childhood, he

was the one who motivated and inspired me the most to follow the path that took me where

I am today. I am extremely thankful to have him as my mentor and brother.

I would also like to remember my late pet Timmy, who was a sibling to me. I am sure she

would have been proud of me and greatly delighted as I am completing my PhD requirements.

I would also like to dedicate my PhD efforts to my grandparents who were a significant part

iii


and influence in my life.

Acknowledging my family without thanking my girlfriend, Monash, would be incomplete.

She has been an integral part of my life, especially supporting me through some of the most

stressful and low points in my final phase of PhD. I am deeply grateful to have her in my

life. I would also like to thank her parents, who wish the best for me.

My journey in UMD would have been far more difficult without my friends in College Park.

A special thanks to my roommate and friend, Souradip, who made my life in College Park

far more interesting and fun-filled. Also, thanks to my other friends at UMD – Kishen,

Roshan, Peeyush, Maharshi, Gaurang, Shelvin, Sharmila, Raju, Shah, Panky, Shishira, and

others – for making my life here more lively. I would like to thank all my amazing labmates,

especially Gaurang, Sriram, Neha, Samyadeep, and Sahil.

I had the opportunity to work with several researchers throughout my career, from whom

I have learned invaluable skills – from Prateek and Harsha at Microsoft Research; from

Professor Chandra at IISc; from Ashish at Caltech; from Pierre, Matthijs, and Jakob at Meta;

from Lun and Rajiv at Google. I thank all my mentors and teachers.

I was also fortunate enough to have mentors and well-wishers, who have helped me make

several important decisions in my life – Professor Nithin, Professor Anirban, and Dr. Dilip.

I am indebted to them for all the time they answered my questions.

Finally, I thank God for everything.

iv


Table of Contents

Dedication ii

Acknowledgements iii

1 Introduction 1
1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Publications and Authorship . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Related Works 10
2.1 Exploring Geometry Of Blind Spots In Vision Models . . . . . . . . . . . . 10
2.2 Robustness for Streaming Models . . . . . . . . . . . . . . . . . . . . . . 11
2.3 AI-Generated Text Detection . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 AI-Generated Image Detection And Data Retrieval . . . . . . . . . . . . . 14
2.5 Robustness And Privacy in Large Language Models . . . . . . . . . . . . . 17
2.6 Hallucination Detection In Large Language Models . . . . . . . . . . . . . 19
2.7 Robustness Of Discrimintaive Models . . . . . . . . . . . . . . . . . . . . 21

3 Investigating Robustness Of Classification Models 23
3.1 Exploring Geometry of Blind Spots in Vision Models . . . . . . . . . . . . 23
3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.1 Conjugate Nature of Adversarial and Confidence Preserving Pertur-
bations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Theoretical Analysis for Common Models . . . . . . . . . . . . . . . . . . 28
3.4 Proposed Method: Level Set Traversal . . . . . . . . . . . . . . . . . . . . 30
3.5 Disconnected Nature of Standard Adversarial Attacks . . . . . . . . . . . . 36
3.6 Quantifying Under-Sensitivity of Vision Models . . . . . . . . . . . . . . . 39
3.7 Conclusion I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.8 Provable Robustness For Streaming Models . . . . . . . . . . . . . . . . . 44
3.9 Preliminaries and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 49

v


3.10 Robustness Certificate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.11 Attacking Each Window . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.12 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.12.1 Attacking each input only once . . . . . . . . . . . . . . . . . . . . 56
3.12.2 Attacking each window . . . . . . . . . . . . . . . . . . . . . . . . 59

3.13 Conclusion II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4 Hardness of AI-Generated Text Detection 61
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Evading AI-Detectors using Paraphrasing Attacks . . . . . . . . . . . . . . 67

4.2.1 Attack Setup and Paraphrasing Methods . . . . . . . . . . . . . . . 67
4.2.2 Quality of the Paraphrases . . . . . . . . . . . . . . . . . . . . . . 70
4.2.3 Paraphrasing Attacks on Watermarked AI Text . . . . . . . . . . . 72
4.2.4 Paraphrasing Attacks on Non-Watermarked AI Text . . . . . . . . . 75

4.3 Adversarial Paraphrasing For Universal AI Text Humanization . . . . . . . 77
4.4 Spoofing Attacks on Generative AI-text Models . . . . . . . . . . . . . . . 79
4.5 Hardness of Reliable AI Text Detection . . . . . . . . . . . . . . . . . . . 83
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5 Robustness of AI Image Detectors: Fundamental Limits And Practical Attacks 89
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Robustness of Image Watermarking for AI-Image Detection . . . . . . . . . 94

5.2.1 Fundamental constraints for Watermarking methods . . . . . . . . . 94
5.2.2 Low Perturbation Budget Watermarks: Empirical Attacks . . . . . . 96
5.2.3 High Perturbation Budget Watermarks: Empirical Attacks . . . . . 99
5.2.4 Spoofing attacks on watermarking methods . . . . . . . . . . . . . 101

5.3 Robustness-Reliability Trade-off of Deepfake Detectors . . . . . . . . . . . 102
5.4 Conclusion I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.5 Data Retrieval with Error-Corrected Codes and Watermarking . . . . . . . 105

5.5.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.5.3 Methodology: DREW . . . . . . . . . . . . . . . . . . . . . . . . 113
5.5.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.7 Discussion and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.8 Conclusion II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6 IConMark: Robust Interpretable Concept-Based Watermark For AI Images 123
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.2 IConMark: Interpretable Watermarking . . . . . . . . . . . . . . . . . . . 125

6.2.1 Initialization: Concept Database Generation . . . . . . . . . . . . . 126

vi


6.2.2 Watermarked Image Generation . . . . . . . . . . . . . . . . . . . 126
6.2.3 Watermark detection . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.3 IConMark(+SS and +TM): Harnessing the Combinatorial Strength of ICon-
Mark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.4.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.4.2 Watermark Detection And Image Quality . . . . . . . . . . . . . . 134
6.4.3 Robustness of Watermarks . . . . . . . . . . . . . . . . . . . . . . 135

6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

7 Fast Adversarial Attacks on Language Models 138
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.2 Beam Search-based Adversarial Attack . . . . . . . . . . . . . . . . . . . . 141

7.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.2.2 Our Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.2.3 Our Method: BEAST . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.3 Jailbreaking Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.3.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.3.3 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.3.5 Multiple Behaviour and Transferability . . . . . . . . . . . . . . . 149
7.3.6 Model Transferability . . . . . . . . . . . . . . . . . . . . . . . . . 150

7.4 Hallucination Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

7.5 Privacy Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.5.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

8 On Manipulating Audio-Based Large Language Models In The Real World 157
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.2 Adversarial Attacks for Audio-based LLMs . . . . . . . . . . . . . . . . . 159

8.2.1 Targeted Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.2.2 Untargeted Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . 163

8.3 Manipulating an Audio-based LLM in the Real World . . . . . . . . . . . . 168
8.4 Defenses For Audio-Based Manipulations . . . . . . . . . . . . . . . . . . 171
8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

9 Investigating Detection Of Hallucination In Large Language Models 174
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

vii


9.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
9.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
9.2.2 Background and Formalisms for Hallucination Detection . . . . . . 178

9.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
9.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

9.4.1 Detection Results on Datasets with no External References as Context187
9.4.2 Detection with External References . . . . . . . . . . . . . . . . . 188

9.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

10 Unlearnable Datasets For Preserving User Privacy 192
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
10.2 Convolution-based Unlearnable DAtaset (CUDA) . . . . . . . . . . . . . . 195

10.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
10.2.2 Limitations of existing works . . . . . . . . . . . . . . . . . . . . . 196
10.2.3 Our method: CUDA . . . . . . . . . . . . . . . . . . . . . . . . . 198

10.3 Theory for CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
10.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

10.4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . 203
10.4.2 Effectiveness of CUDA . . . . . . . . . . . . . . . . . . . . . . . . 205

10.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

11 Appendix 211
11.1 Exploring Geometry of Blind Spots in Vision Models . . . . . . . . . . . . 211
11.2 Robustness for Streaming Models with a Sliding Window . . . . . . . . . . 223
11.3 Hardness of AI Text Detection . . . . . . . . . . . . . . . . . . . . . . . . 236
11.4 Hardness of AI Image Detection . . . . . . . . . . . . . . . . . . . . . . . 269
11.5 IConMark: Robust Interpretable Concept-Based Watermark For AI Images . 293

11.5.1 Additional Examples of Generated Images . . . . . . . . . . . . . . 293
11.6 Data Retrieval with Error-Corrected Codes and Watermarking . . . . . . . 297
11.7 Fast Adversarial Attacks on Language Models . . . . . . . . . . . . . . . . 305
11.8 Detection Of Hallucination In Large Language Models . . . . . . . . . . . 325
11.9 Unlearnable Datasets For Preserving User Privacy . . . . . . . . . . . . . . 330

Bibliography 367

viii


Chapter 1: Introduction

1.1 Thesis Statement

In this thesis we explore fundamental aspects of AI model robustness, detectability, and data

privacy. We analyze challenges posed by adversaries on AI systems to contribute towards

improving their robustness, security, and privacy.

1.2 Thesis Overview

Artificial Intelligence (AI) has become an integral part of numerous real-world applications,

from self-driving cars to models that can generate creative multimodal content. Despite their

success, these models exhibit several security and privacy vulnerabilities. Adversaries could

attack these AI systems to potentially jailbreak them to cause harm or leak sensitive private

information. In order to make these systems more robust and secure, we need to understand

their failure modes and vulnerabilities. In this thesis, we analyze the robustness of various

AI models to better understand their security and privacy issues. We study the robustness,

detectability, and data privacy in AI.

First, we study the robustness of discriminative models. Adversarial attacks in computer

vision models are well-studied. In an adversarial attack for an image classifier model, an

attacker adds imperceptible noise to the input to make the model misclassify. We study

1


a novel phenomenon called blind spots in vision models where we find the presence of

multiple continuous equi-confident star-like paths in the input space where the model

produces invariant output. We also study the robustness of streaming classification models

to provide provable certificates.

In the next part, we study the detectability of AI. Recent advancements in generative AI

have led to the wide usage of Large Language Models (LLMs) and Diffusion Models (DMs).

Attackers could use these generative models to produce fake content for social engineering.

In order to tackle this, many works suggested employing AI-content detectors. In our study,

we analyze the hardness of AI-content detection in both text and image modalities. We

empirically and theoretically show that attackers can easily evade these AI detection systems,

showing their security vulnerabilities. Further, we discuss how the insights from our analysis

can be leveraged to design more robust detection technologies. We also extend a specific

detection technology, watermarking, to show its benefits in data retrieval applications.

We then investigate the robustness of LLMs to adversarial attacks. Adversarial attacks

on LLMs are relatively new since it is hard to perform optimization in the discrete input

token space. Though recent works present various approaches to attack LLMs, they are

either computationally expensive or dependent on a more powerful model to attack a weaker

model. We propose an algorithm to attack LLMs in under one GPU minute. We also

demonstrate how multi-modal LLMs with increased attack surface are more vulnerable

to these attacks in the real world. We extend this to show experiments to jailbreak, illicit

hallucinations, and perform privacy attacks on LLMs. We show that an adversary can attack

existing transformer-based language models efficiently to make them unsafe and less useful.

We also show that these risks increase as models have more input modalities. Specifically,

we demonstrate scenarios where adversaries could harm innocent users using audio-based

LLMs.

2


Due to LLMs performing impressively in various tasks, it’s becoming a popular tool in

various applications. However, they are known to hallucinate, leading to concerns about

using them in sensitive fields such as health or robotics. In our study, we propose multiple

methods to detect LLM hallucinations. We introduce several algorithms that can detect

hallucinations in both white and black box settings. We demonstrate that the proposed

detection methods are extremely compute-efficient, requiring only a fraction of the run-time

of other baselines while achieving significant improvements in detection performance over

diverse datasets.

Finally, we discuss an approach to use poisoning for user privacy preservation. Big data

has contributed a lot to the success of today’s AI. This has led to various developers using

data available on the Internet to train their models without the data owner’s consent. We

propose an unlearnable dataset that can be generated efficiently in a model-agnostic manner.

Unlearnable datasets produced by our algorithm using just data-wise random convolutions

can lead to model training on this unconsented data, performing poorly on the unlearnable

user data distribution. We study this theoretically and empirically, showing promising

results.

In summary, our thesis develops various attacking frameworks to understand the robustness

and security vulnerabilities of various AI models. The insights we provide from our attacks

can be useful in designing better AI systems. We also propose approaches to address some

of these issues of the existing AI models.

1.3 Thesis Contributions

Our thesis explores fundamental aspects of AI model robustness, detectability, and privacy.

We propose various frameworks to attack existing AI models to better understand their

3


vulnerabilities. We also devise new defense methods to make AI more secure and reliable.

We state our contributions below.

Investigating Robustness Of Classification Models

• We propose a Level Set Traversal algorithm that iteratively explores regions of high

confidence with respect to the input space using orthogonal components of the local

gradients. Given a source image, we use this algorithm to identify inputs that lie in the

same equi-confidence level set as the source image despite being perceptually similar

to arbitrary images from other classes. We further observe that the source image is

linearly connected by a high-confidence path to these inputs, uncovering a star-like

structure for level sets of deep networks. Furthermore, we attempt to identify and

estimate the extent of these connected higher-dimensional regions over which the

model maintains a high degree of confidence. [NeurIPS 2023]

• We derive robustness certificates for models that use a fixed-size sliding window over

an input stream. Our guarantees hold for the average model performance across the

entire stream and are independent of stream size, making them suitable for large data

streams. We perform experiments on speech detection and human activity recognition

tasks and show that our certificates can produce meaningful performance guarantees

against adversarial perturbations. [arXiv 2023]

Hardness of AI-Generated Text Detection

• We introduce recursive paraphrasing attack to stress test a wide range of detection

schemes, including the ones using the watermarking as well as neural network-based

detectors, zero-shot classifiers, and retrieval-based detectors. Our experiments con-

ducted on passages, each approximately 300 tokens long, reveal the varying sensitivi-

4


ties of these detectors to our attacks. We also observe that these paraphrasing attacks

add slight degradation to the text quality.Additionally, we investigate the susceptibility

of watermarked LLMs to spoofing attacks aimed at misclassifying human-written text

as AI-generated. We also provide a theoretical framework connecting the AUROC of

the best possible detector to the Total Variation distance between human and AI text

distributions. [TMLR 2025]

• We introduce adversarial paraphrasing to launch a strong universal framework to break

existing AI text detectors. Our adversarial attack uses an LLM to auto-regressively

paraphrase AI texts guided by an AI text classifier to humanize them. We find

that our attack effectively transfers to all kinds of text detection schemes, including

watermarking, zero-shot detectors, and trained detectors. [NeurIPS Submission]

Hardness of AI-Generated Image Detection

• For watermarking methods that introduce subtle image perturbations (i.e., low per-

turbation budget methods), we reveal a fundamental trade-off between the evasion

error rate (i.e., the fraction of watermarked images detected as non-watermarked ones)

and the spoofing error rate (i.e., the fraction of non-watermarked images detected

as watermarked ones) upon an application of diffusion purification attack. For high

perturbation watermarking methods, we develop a model substitution adversarial at-

tack that can successfully remove watermarks. Moreover, we show that watermarking

methods are vulnerable to spoofing attacks where the attacker aims to have real images

identified as watermarked ones, damaging the reputation of the developers. We also

extend our theory to characterize a fundamental trade-off between the robustness

and reliability of classifier-based deep fake detectors and demonstrate it through

experiments. [ICLR 2024]

5


• We propose IConMark, a novel in-generation robust semantic watermarking method

that embeds interpretable concepts into AI-generated images, as a first step toward

interpretable watermarking. Our method is not only robust against various image

augmentations but also human-readable, enabling manual verification of watermarks.

We demonstrate a detailed evaluation of IConMark’s effectiveness, demonstrating its

superiority in terms of detection accuracy and maintaining image quality. Moreover,

IConMark can be combined with existing watermarking techniques to further enhance

and complement its robustness. IConMark and its variants achieve 10.8%, 14.5%, and

15.9% higher mean area under the receiver operating characteristic curve (AUROC)

scores for watermark detection, respectively, compared to the best baseline on various

datasets. [ICLR 2025 GenAI Watermarking Workshop and ICCV Submission]

• We propose Data Retrieval with Error-corrected codes and Watermarking (DREW)

that randomly clusters the reference dataset, injects unique error-controlled watermark

keys into each cluster, and uses these keys at query time to identify the appropriate

cluster for a given sample. DREW maintain baseline performance, while also pro-

viding opportunities for performance improvements due to the increased likelihood

of correctly matching queries to their origin when performing retrieval on a smaller

subset of the dataset. [arXiv 2024]

Adversarial Attacks on Language Models

• We introduce a novel class of fast, beam search-based adversarial attack (BEAST)

for Large Language Models (LLMs). BEAST employs interpretable parameters,

enabling attackers to balance between attack speed, success rate, and the readability of

adversarial prompts. The computational efficiency of BEAST enables us to investigate

its applications on LLMs for jailbreaking, eliciting hallucinations, and privacy attacks.

6


We find that BEAST can jailbreak LLMs in one to two GPU minutes to get as high

as ∼ 100% attack success. Our attack can also increase hallucination rates in LLMs

by 15% to 30%. BEAST can adversarially increase membership inference attack

accuracies of existing baseline methods as well. We also show that our attack can be

used to break new LLMs in a black-box manner. [ICML 2024]

• We demonstrate the real-world vulnerabilities of open-sourced LLMs that utilize audio

as an input modality, such as Qwen2-Audio. We first demonstrate that an adversary

can craft stealthy audio perturbations to manipulate audio-based LLMs into exhibiting

specific targeted behaviors, such as eliciting responses to wake-keywords (e.g., “Hey

Qwen”). Subsequently, we experimentally show that adding adversarial noise to

benign speech signals can significantly degrade the speech processing performance

of these models. Crucially, our research illustrates the scalability of these attacks to

real-world scenarios, impacting other innocent users. We find that our methods can

design effective adversarial audio signals even when played through the air, capable

of compromising audio-based LLMs used by innocent users and compelling their

models to execute unintended or harmful agentic tasks. Finally, we discuss potential

defensive measures, specifically input audio augmentation techniques, to mitigate

these attacks. [NeurIPS Submission]

Investigating Detection Of Hallucination In Large Language Models

• We conduct a comprehensive investigation into the nature of hallucinations within

LLMs and furthermore explore effective techniques for detecting such inaccuracies

in various real-world settings. We seek to identify hallucinations within a single

response in both white-box and black-box settings by analyzing the internal hidden

states, attention maps, and output prediction probabilities of an auxiliary LLM. We

7


also study hallucination detection in scenarios where ground-truth references are

also available, such as in the setting of Retrieval-Augmented Generation (RAG). We

demonstrate that the proposed detection methods are extremely compute-efficient,

requiring only a fraction of the run-time of other baselines, while achieving significant

improvements in detection performance over diverse datasets. [NeurIPS 2024]

Unlearnable Datasets For Preserving User Privacy

• We propose a novel, model-free, Convolution-based Unlearnable DAtaset (CUDA)

generation technique. CUDA is generated using controlled class-wise convolutions

with filters that are randomly generated via a private key. CUDA encourages the

network to learn the relation between filters and labels rather than informative features

for classifying the clean data. We develop some theoretical analysis demonstrating

that CUDA can successfully poison Gaussian mixture data by reducing the clean data

performance of the optimal Bayes classifier. We also show that CUDA is robust to

even adaptive defenses designed specifically to break it. [CVPR 2023]

1.4 Publications and Authorship

This thesis is based on published manuscripts, preprints, manuscripts under review, and

ongoing works listed in the Fig 1.1. These works are the result of collaborations with my

advisor, Soheil Feizi, and my other colleagues at the University of Maryland. In this thesis,

the pronoun “we” is used to acknowledge the collective contribution of all my collaborators.

Contributions. For all my first-author works, I have led the work, formulated the ideas, and

performed experiments to present the key results. For my joint first-author work (Cheng &

Sadasivan et al., 2025), I formulated the idea and performed experiments to present the key

8


algorithmic results, except for the experimental ablation results. I performed experimental

results in Kumar et al., 2023, experimental ablations in Balasubramanian & Sriramanan et al.,

2023, entropy-based detector in Sriramanan et al., 2024, detector spoofing and deepfake

experiments in Saberi et al., 2024a, and insights to improve retrieval algorithm in Saberi

et al., 2024b.

Figure 1.1: Primary works directly related to the thesis.

9


Chapter 2: Related Works

2.1 Exploring Geometry Of Blind Spots In Vision Models

The phenomenon of under-sensitivity in classification models was first pointed out by [170],

wherein they utilize a class of invertible neural networks called fully Invertible RevNets

[171, 192] to specifically craft input images that do not affect network activations at a given

layer. In contrast, our proposed algorithm is applicable to general network architectures

since we solely utilize input-gradients to perform the traversal over image space. [379]

further demonstrated that due to the misalignment of ℓp norm bounded balls and the ideal

set of human-imperceptible perturbations, networks that are adversarially trained against

such ℓp bounded perturbations of relatively large radius are overly-smooth, and become

excessively susceptible to invariance-based attacks within the same ℓp radius. To find

such images that lie within a given ℓp threat model, but induce human oracles to change

their label assignment, the authors propose to identify the training image from another

class closest in image space and apply a series of semantic-preserving transformations, and

additionally use techniques such as realignment and spectral clustering. Given that these

operations are fairly complex, the attack algorithm is slow, with alignment alone requiring

an amortized time of minutes per input example. In contrast, our proposed method relies

upon gradient-backpropagation steps which are efficiently parallelized across a minibatch of

input images. Furthermore, our technique is seen to be successful for arbitrary source-target

10


image pairs, since we do not utilize near-neighbour training images as the target. On the

theoretical front, [178] analyze the problem of span-recovery of neural networks given only

oracle access to the network predictions. They characterize the feasibility of span recovery,

and thus approximations of the null space of networks in a provable setting, but remark that

its success in practical settings is potentially limited to networks that are extremely thin.

2.2 Robustness for Streaming Models

The adversarial streaming setup has been studied extensively in recent years. [270] designed

an attack for transient data streams that do not allow the adversary to re-attack past input

items. In their setting, the adversary only has partial knowledge of the target DNN and the

perturbations applied in previous time steps are irrevocable. Their objective is to produce an

adversarial attack with minimal access to the data stream and the target model. Our goal,

on the other hand, is to design a provably robust method that can defend against as general

and strong an adversary as possible. We assume that the adversary has full knowledge of

the parameters of the target DNN and can change the adversarial perturbations added in

previous time steps. Our threat model includes transient data streams as a special case and

applies even to adversaries that only have partial access to the DNN.

Streaming adversarial attacks have also been studied for sampling algorithms such as

Bernoulli sampling and reservoir sampling [25]. Here, the goal of the adversary is to create

a stream that is unrepresentative of the actual data distribution. Other works have studied

the adversarial streaming setup for specific data analysis problems like frequency moment

estimation [24], submodular maximization [269], coreset construction and row sampling

[38]. In this work, we focus on a robustness certificate for general DNN models in the

streaming setting under the conventional notion of adversarial attacks in machine learning

literature. We use a sliding-window computational model which has been extensively studied

11


over several years for many streaming applications [87, 110, 118]. Recently [106] also

showed that a short-term memory is sufficient for several real-world reinforcement learning

tasks.

A closely related setting is that of adversarial reinforcement learning. Adversarial attacks

have been designed that either directly corrupt the observations of the agent [23, 163, 295]

or introduce adversarial behavior in a competing agent [122]. Robust training methods, such

as adding adversarial noise [188, 384] and training with a learned adversary in an online

alternating fashion [429], have been proposed to in improve the robustness of RL agents.

Several certified defenses have also been developed over the years. For instance, [431]

developed a method that can certify the actions of an RL agent at each time step under a

fixed adversarial perturbation budget. It can certify the total reward obtained at the end of an

episode if each of the intermediate actions is certifiably robust. Our streaming formulation

allows the adversary to choose the budget at each time step as long as the average perturbation

size remains below ε over time. Our framework also does not require each prediction to

be robust in order to certify the average performance of the DNN. More recent works in

certified RL can produce robustness guarantees on the total reward without requiring every

intermediate action to be robust or the adversarial budget to be fixed [204, 410]. However,

these certificates degrade for longer streams and the tightness analysis of these certificates

indicates that this dependence on stream size may not be improved. Our goal is to keep the

robustness guarantees independent of stream size so that they are suitable even for large

streams.

The literature on provable robustness has primarily focused on static prediction problems

like image classification. One of the most prominent techniques in this line of research is

randomized smoothing. For a given input image, this technique aggregates the output of

a DNN on several noisy versions of the image to produce a robust class label [71, 221].

12


This is the first approach that scaled up to high-dimensional image datasets like ImageNet

for ℓ2-norm bounded adversaries.. It does not make any assumptions on the underlying

neural network such as Lipschitz continuity or a specific architecture, making it suitable

for conventional DNNs that are several layers deep. However, randomized smoothing also

suffers some fundamental limitations for higher norms such as the ℓ∞-norm [206]. Due to its

flexible nature, randomized smoothing has also been adapted for tasks beyond classification,

such as segmentation and deep generative modeling, with multi-dimensional and structured

outputs like images, segmentation masks, and language [203]. For such outputs, robustness

certificates are designed in terms of a distance metric in the output space such as LPIPS

distance, intersection-over-union and total variation distance. However, provable robustness

in the static setting assumes a fixed budget on the size of the adversarial perturbation for

each input instance and does not allow the adversary to choose a different budget for each

instance. In our streaming threat model, we allow the adversary the flexibility of allocating

the adversarial budget to different time steps in an effective way, attacking more critical

input items with a higher budget and conserving its budget at other time steps. Recent

work on provable robustness against Wasserstein shifts of the data distribution allows the

adversary to choose the attack budget for each instance differently [207]. However, unlike

our streaming setting, the input instances are drawn independently from the data distribution

and the adversarial perturbation applied to one instance does not impact the performance of

the DNN on another.

2.3 AI-Generated Text Detection

Several detection works study this problem as a binary classification problem [19, 109,

177, 268, 280] and use neural network-based detectors. For example, OpenAI fine-tunes

RoBERTa-based [249] GPT-2 detector models to distinguish between non-AI generated and

13


GPT-2 generated texts [280]. This requires such a detector to be fine-tuned with supervision

on each new LLM for reliable detection. Another stream of work focuses on zero-shot AI text

detection without any additional training overhead [120, 168, 357]. These works evaluate the

expected per-token log probability of texts and perform thresholding to detect AI-generated

texts. [268] observe that AI-generated passages tend to lie in negative curvature of log

probability of texts. They propose DetectGPT, a zero-shot LLM text detection method, to

leverage this observation. Since these approaches rely on a neural network for their detection,

they can be vulnerable to adversarial and poisoning attacks [126, 208, 330, 390]. Another

line of work aims to watermark AI-generated texts to ease their detection [15, 193, 403, 442].

Watermarking eases the detection of LLM output text by imprinting specific patterns on

them. Soft watermarking proposed in [193] partitions tokens into “green” and “red” lists,

as they define, to help create these patterns. A watermarked LLM samples a token, with

high probability, from the green list determined by a pseudo-random generator seeded by

its prefix token. The watermarking detector would classify a passage with a large number

of tokens from the green list as AI-generated. These watermarks are often imperceptible

to humans. [199] introduces an information retrieval-based detector by storing the outputs

of the LLM in a database. For a candidate passage, their algorithm searches this database

for semantically similar matches for detection. However, storing user-LLM conversations

might cause serious privacy concerns.

2.4 AI-Generated Image Detection And Data Retrieval

Image Watermarking. Image watermarking is a versatile technology with applications

in copyright protection, content authenticity, data authentication, privacy preservation, and

branding. Its evolution began with manual methods like LSB [406], and later techniques

involved altering spatial or frequency domains [35, 76, 121, 153, 286, 297]. Various trans-

14


formations such as DCT, DWT, SVD-decomposition [57], and Radon transformations [339]

were explored. Recent advancements incorporate deep learning and generative models like

SteganoGAN [432], StegaStamp [368] RivaGAN [433], WatermarkDM [444], MBRS [181],

and Tree Ring [400], each employing different methods to embed watermarks into images.

There have been several works trying to attack watermarking methods [183, 387]. Notably a

recent concurrent work [443] also proves that the diffusion purification attack is successful

against invisible (low perturbation budget) watermarking. However, [443] is unable to

attack high perturbation budget watermarking methods such as Tree Ring or StegaStamp

and argues that they are more reliable watermarking alternatives. In contrast, we show that

our model substitution adversarial attack can effectively break those watermarking methods.

Additionally, we show that several watermarking approaches are vulnerable to spoofing

attacks and characterize a robustness-reliability trade-off for a classification-based deepfake

detector.

Classifier-based Detectors. Several machine-learning approaches focusing on detecting

artifacts in AI-generated content have been studied. For instance, [261] target irregular-

ities in face editing algorithms, while [69] exploit biological signals. [234] introduce a

technique for identifying partially manipulated videos, and [135] harness the traces from

convolutional layers of generative adversarial networks in fake image detection. [36] analyze

spatiotemporal texture dynamics of video signals for Deepfake detection. A plethora of

works focus on facial forgery or Deepfake detection using convolution net-based classi-

fiers [22, 77, 97, 310, 311, 447]. [318] proposed a face forensics dataset and train ResNet

[144] and XceptionNet [64] based classifiers using it. However, as noted in [142], machine

learning-based detectors are often vulnerable to novel input perturbations. Such limitations

challenge the practical utility of these methods, as any real-world detector should achieve

good performance while being robust to small perturbations in the input.

15


Error Control Codes. Error control codes (ECC) [143] are fundamental in ensuring the

reliable transmission and storage of data across noisy channels. These codes detect and

correct errors without needing re-transmission, enhancing data integrity and communication

efficiency. Notable ECCs include Reed-Solomon codes [313], known for robust error

correction in digital communications and storage like CDs and DVDs. LDPC codes [117]

have transformed ECC with near-capacity performance, crucial in modern systems like 5G

and satellite communications. BCH codes [37], introduced in the 1950s, are versatile for

error detection and correction in applications from QR codes to digital broadcasting. Polar

codes [12], a recent advancement, achieve channel capacity with low-complexity decoding,

essential for 5G technology. These advancements highlight ECC’s evolution, driven by the

need for high reliability and efficiency in complex communication networks.

Data Retrieval. Data retrieval techniques refer to the methods and algorithms used to

efficiently search, access, and retrieve relevant information from large data repositories or

databases [60]. Recent advancements include embedding-based retrieval systems, which

utilize deep learning models to transform data into feature vectors, enabling accurate and

context-aware searches. Approximate Nearest Neighbors (ANN) [13, 349] is a notable

technique in this field, facilitating rapid similarity searches by approximating the nearest

neighbor search in large datasets. Data retrieval techniques are especially useful in ap-

plications such as recommendation systems [275], retrieval augmented generation [227],

near-duplicate detection [358], and anomaly detection [73].

Watermarking for Data Retrieval. Some existing works have explored the use of water-

marks in retrieval-based tasks in various ways. Certain studies [319, 320] propose manually

shifting data (similar to watermarking) to enhance the mutual separation of samples in

feature space, thereby improving the robustness of feature-based data retrieval. However,

these methods do not leverage recent advancements in watermarking or efficient retrieval

16


techniques, rendering them impractical compared to contemporary approaches. Addition-

ally, other works [169, 214] have utilized watermarks in image retrieval systems for user

identification and to prevent illegal data distribution. Nonetheless, these approaches fail to

address the low robustness of watermarks when used alone and without error correction

modules, making them unreliable in practical applications.

2.5 Robustness And Privacy in Large Language Models

Adversarial attacks. Adversarial machine learning literature shows that the inputs to

models can be perturbed to get a desired target output [27, 53, 126, 290, 363]. Several

works investigate adversarial examples in the text domain for question answering [180],

document classification [105], sentiment analysis [9] either using discrete optimization

or greedy methods [185, 386]. Though recent works [139, 185, 344, 399] show that they

can generate adversarial prompts with automatic prompt-tuning, [49] claim that they are

insufficiently powerful in attacking LMs reliably.

Jailbreaking. A lot of research has been done to align LMs to human values to make them

safe and useful [287, 397]. However, [85, 296] show that prompts can be manually written

to jailbreak aligned LLMs. [449] introduced a gradient-based optimization technique to

generate adversarial prompts automatically by adding gibberish adversarial token suffixes.

[448] also proposes a gradient-based jailbreaking method that improves upon the readability

of the adversarial token suffixes. [218, 247, 419] propose black box jailbreak attacks using

genetic search algorithms. [58, 119] propose black box attacks inspired by social engineering

attacks where aligned LMs generate jailbreaking prompts by querying a target LM iteratively.

[164] demonstrates that the alignment of LM chatbots can be disrupted by manipulating

their decoding strategies.

17


With the increasing prevalence of vision-based LLMs, research has consequently begun

exploring how jailbreaking can affect these models. Such attacks often exploit cross-

modal interactions between vision and text inputs to circumvent safety alignments. For

instance, techniques involve crafting adversarial images with subtle perturbations designed

to elicit harmful text generation when paired with simple prompts [300]. Others focus on

encoding harmful requests directly into the visual input, such as embedding hidden text or

using typographic visual prompts like those created by FigStep [123]. A related approach,

exemplified by Imgtrojan [371], aims to create a single ‘Trojan’ image that acts as a trigger,

enabling the model to be jailbroken for various subsequent harmful text prompts.

The domain of attacking audio-based LLMs remains significantly under-explored. Ad-

vWave [189] represents a pioneer contribution in this area, presenting a framework en-

gineered to circumvent the safety protocols of speech-processing audio-based LLMs to

elicit prohibited content, by including a dual-phase optimization method — an adaptive

target search algorithm, following techniques to render the adversarial audio inconspicuous,

resembling background noise. The methodology in their study targets the direct creation of

adversarial speech, aligning with digital text-based attack paradigms, without explicitly mod-

eling real-world acoustic propagation effects which is a key consideration in our research.

Related research has investigated audio vulnerabilities in other machine learning contexts

like Automatic Speech Recognition or Text-To-Speech systems [10, 52, 173, 243, 388].

Hallucinations. Aligned LMs, at times, produce nonsensical outputs that are irrelevant

to the input prompts [4] or previously generated context [246], or factually incorrect

[232, 264, 271]. While a plethora of benchmarks [232, 240, 264], detection methods

[272, 424], and mitigation techniques [129, 347, 385] exist for hallucination, it is still a

problem that needs more investigation [440].

18


Privacy attacks. Overparameterized networks are known to memorize training data. Several

works demonstrate that this can be a pitfall of these models and can be leveraged to leak

potentially private information from them, such as training data membership [47, 345, 417]

or the data itself [47, 48, 274]. [248] performs prompt injection to leak system prompts with

manual prompts. [448] uses their automated jailbreak attack to leak system prompts from

aligned LMs.

2.6 Hallucination Detection In Large Language Models

SelfCheckGPT [259] presented a suite of hallucination detection methods in a zero-resource

grey-box/black-box setting to assess the veracity of a LM response, assuming only access to

the output probability distributions. This is highly practical in real-world usage, since API

calls often impede access to internal model activations. On the other hand, INSIDE [59]

proposed to measure the self-consistency of the hidden states across multiple independent

responses of an LLM, by performing an centered eigen-analysis of the covariance matrix

of these hidden states. Notably, INSIDE follows a slightly different evaluation framework

from other techniques which generally rely upon human annotations, wherein towards

assessing the correctness of a model response INSIDE utilizes either ROUGE-L score or

BERT-similarity scores being larger than a threshold (0.5 and 0.9 respectively) with respect

to a ground-truth. However, ROUGE-L is an n-gram based method that evaluates the longest

common subsequence between the model response and the ground-truth. Thus, several

hallucinatory words can be incorporated without a considerable change in the ROUGE-L

score overall.

[186] proposed to use the LLM itself to predict the probability of the generated response

being True, referred to as Self-Prompt; this is primarily relevant for the context of multiple-

choice-questions where the model explicitly has all the options in context. Furthermore,

19


they also train models to predict the probability that a given multiple-choice-question will

be correctly answered by calibrating its internal confidence values. [17] proposed to train

feedforward neural networks on the hidden activations of intermediate layers of LMs on

a True-False dataset of relatively short, simple sentences such as "The zebra uses flying

for locomotion." This method is thus a white-box detection technique which also requires

supervised training data to train the feedforword network as noted by [259]); in contrast, we

do not incorporate training of any sub-network or the LLM itself. Furthermore, it is fairly

non-trivial to extend the technique towards more general settings such as when external

references or multiple model responses are available at inference time. [421] investigate

factual errors made by LLMs by modelling it as a constraint-satisfaction problem. Indeed,

the factual query is specified by considering a sequence of Constraint-tokens and using a

subsequent Verifier (such as ExactMatch or WikiData Search) to determine the satisfaction

for each such constraint. This method is highly pertinent in settings where a fact-checking

service is available for ready oracle-access that can provide feedback and annotations in

real-time. [136] study hallucinations specifically in the setup of Neural machine translation

(NMT), and even propose mitigation techniques to overwrite hallucinatory sections of the

output.

FAVA [267] is a fine-grained hallucination detection dataset, wherein a novel hierarchy

of hallucination types is identified and analyzed comprehensively. FAVA first generates a

synthetic hallucination training dataset by inserting specific errors using ChatGPT [40], and

then fine-tunes a Llama-2 model to explicitly detect these fine-grained hallucinations. The

work also introduces a human-annotated dataset where the specific hallucination types are

recorded with specified hallucination spans. We make extensive use of the FAVA datasets so

released to compare multiple baselines on a common test-bed. Though retrieval augmented

generation (RAG) helps to mitigate hallucination in LLMs, they are still found hallucinating

20


[230, 348]. Recently, [411] showed how popular LLMs still hallucinate on different tasks

even with the integration of RAG, and compiled the RAGTruth dataset with span-level

human annotation. They proposed to fine-tune an LLM on their training data to detect

hallucination which is computationally expensive than our proposed method.

2.7 Robustness Of Discrimintaive Models

Adversarial attacks. Adversarial examples are specially designed examples that can fool

deep learning models at test time [50, 125, 212, 215, 364]. The adversary crafts these

examples by adding error-maximizing noises to the clean data. Even slightly perturbed

data can serve as adversarial examples. AT is a training framework proposed to make deep

learning models robust to adversarial examples [176, 216, 254, 393, 426]. AT is a min-max

optimization problem where the model is trained to minimize loss on adversarial examples

that have the maximum loss.

Poisoning attacks. In data poisoning, an attacker aims to hurt the deep learning model’s

performance by perturbing the training data [29, 195, 223, 245, 340, 389]. The backdoor

attack is a special type of poisoning attack where a trigger pattern is injected into clean data

at training time [61, 237, 250, 276]. The model trained on this data would misclassify an

image with a trigger pattern at test time. Gu et al. [134] and Li et al. [238] use perceptible

amounts of noises similar to CUDA for data poisoning. However, backdoor attacks do not

affect the performance of the model on clean data [21, 61, 340].

Recent literature utilize data poisoning to protect data from being used for model training

without authorization. Yuan and Wu [420] use neural tangent kernels [172] to generate clean

label attacks that can hurt the generalization of deep learning models. Huang et al. [161]

show that error-minimizing noise addition can serve as a poisoning technique. Fowl et al.

21


[113] show that error-maximizing noises as well can make strong poison attacks. However,

all these poisoning techniques do not offer data protection with AT [370, 418]. Fu et al.

[116] proposes a min-min-max optimization technique to generate poisoned data that offers

better unlearnability effects with AT.

22


Chapter 3: Investigating Robustness Of Classification Models

3.1 Exploring Geometry of Blind Spots in Vision Models

Deep neural networks have demonstrated remarkable success in various diverse domains

including computer vision and natural language processing, surpassing the performance

of classical algorithms by a significant margin. However, though they achieve state of

the art results on many computer vision tasks like image classification, sometimes even

exceeding human-level performance [145, 146], the overall visual processing conducted by

such models can deviate significantly from that effectively observed in the human visual

system. Perhaps most iconic and representative of these differences lies in the overwhelming

susceptibility of neural networks to near-imperceptible changes to their inputs — commonly

called adversarial attacks [365] — an extraordinary failure mode that highlights the over-

sensitivity of such models. Indeed, an active area of research over the past few years has

been focused towards analyzing adversarial perturbations under different threat models

[80, 127, 217, 378, 408] and in addressing adversarial vulnerabilities using robust training

methodologies [217, 226, 253, 355, 409, 427].

On the other hand, it has been shown that such models may also be under-sensitive, wherein

input images that are unmistakably disparate to a human oracle induce near identical

network activations or predictions [170]. To tractably analyze this phenomenon, [170]

23


utilize a special class of neural networks that are bijective functions, called fully Invertible

RevNets [171, 192], to craft large-magnitude semantic perturbations in input space that

are designed to leave its corresponding network representations unchanged. Furthermore,

[379] show that robust training with ℓp bounded adversaries can be a source of excessive

model-invariance in itself, due to the poor approximation of the true imperceptible threat

model of human oracles by ℓp norm-bounded balls in RGB-pixel space [217]. Indeed, the

authors ‘break’ a provably robust defense on MNIST [430] with a certified accuracy of 87%,

by crafting perturbations within the certified ℓ∞ radius of 0.4, that however cause model

agreement with human oracles to diminish to 60%.

However, these methods either rely upon special invertible network architectures or the

selection of the nearest training image of another class as a target followed by a sequence of

complex alignment and spectral clustering techniques, so as to semantically alter the input

in a conspicuous manner that induces a change in the human assigned oracle label, while

leaving the model prediction unchanged. This leads us to our research question: Is it possible

to analyze the phenomenon of under-sensitivity of general vision models in a systematic

manner on natural image datasets, and characterize the geometry and extent of “blind spots”

of such networks? Indeed, we empirically demonstrate the veracity of this claim — in this

work, we present a novel Level Set Traversal algorithm to explore the “equi-confidence” level

sets of popular vision models. Given an arbitrary source and target image pair, our proposed

algorithm successfully finds inputs that lie in the same level set as the source image, despite

being near-identical perceptually to the target image. Furthermore, the proposed algorithm

identifies a connected path between the original source image and the “blind spot" input so

generated, wherein high prediction confidence with respect to the source class is maintained

throughout the path. In summary, we make the following contributions in this work:

• We present a novel Level Set Traversal algorithm that iteratively uses orthogonal

24


components of the local gradient to identify the “blind spots” of common vision

models such as CNNs and ViTs on CIFAR-10 and ImageNet.

• We thereby show that there exist piecewise-linear connected paths in input space

between images that a human oracle would deem to be extremely disparate, though

vision models retain a near-uniform level of confidence on the same path.

• Furthermore, we show that the linear interpolant path between these images also

remarkably lies within the same level set; as we observe the consistent presence of

this phenomenon across arbitrary source-target image pairs, this unveils a star-like set

substructure within these equi-confidence level sets.

• We demonstrate that adversarially robust models tend to be under-sensitive over

subsets of the input domain that lie well beyond its original threat model, and display

level-sets of high-confidence that extend over a significant fraction of the triangular

convex hull of a given source image and arbitrary pair of target images.

3.2 Preliminaries

Notation: In this paper, we primarily consider the setting of classification with access to a

labelled dataset. Let x ∈X denote d-dimensional input images, and let their corresponding

labels be denoted as y ∈ {1, . . . ,N}. Let f : X → [0,1]N denote the classification model

considered, where f (x) =
(

f 1(x), . . . , f N(x)
)

represents the softmax predictions over the

N-classes. Further, let C f (x) be the argmax over the N-dimensional softmax output, repre-

senting the class predicted by the model for input x. For a given data sample (x,y), let the

cross-entropy loss achieved by the model be denoted as CE( f (x),y). Given a prediction

confidence value p ∈ [0,1], and a class j ∈ {1, . . . ,N} we define the Level Set L f (p, j), and

Superlevel Set L+
f (p, j) for the function f as follows:

L f (p, j) = {x ∈X : f j(x) = p} , L+
f (p, j) = {x ∈X : f j(x)≥ p}

25


Given a pair of inputs x1 and x2, we define the linear interpolant path between them as

P(λ ;x1,x2) = λ · x1 +(1−λ ) · x2, for λ ∈ [0,1]. A given set S is thus said to be convex if

P(λ ;x1,x2) ∈ S, ∀x1,x2 ∈ S and λ ∈ [0,1]. Further, a set S is said to be star-like if there

exists some s0 ∈ S such that P(λ ;s0,x) ∈ S, ∀x ∈ S and λ ∈ [0,1].

3.2.1 Conjugate Nature of Adversarial and Confidence Preserving Pertur-

bations

Given a classification model f and a correctly classified benign input (x,y), an adversarial

image is a specially crafted image x̃ = x+ ε such that both x and x̃ appear near-identical

to a human oracle, but induces the network to misclassify, i.e. C f (x+ ε) ̸=C f (x) = y. To

enforce imperceptibility in a computationally tractable manner, several adversarial threat

models have been proposed, with the ℓ2 and ℓ∞ norm constraint models being the most

popular. Amongst the earliest adversarial attacks specific to the latter threat model was

the Fast Gradient Sign Method (FGSM) attack, proposed by [127], wherein the adversarial

perturbation is found by single-step direct ascent along the local gradient with pixel-wise

clipping. A stronger multi-step variant of this attack called Iterated-FGSM (IFGSM) was

later introduced by [210], wherein iterated gradient ascent is performed alternately with

projection operations onto the constraint set. A popular variant, called the Projected Gradient

Descent (PGD) attack was introduced by [253] which incorporates a initial random perturba-

tion to the clean image, which was observed to help mitigate gradient masking effects [209].

A large class of adversarial attacks [46, 79, 131, 359] thus utilize perturbations parallel to

the gradient, with appropriate projection operations to ensure constraint satisfaction.

In contrast, perturbations that leave the network prediction confidence unchanged, are locally

orthogonal to the gradient direction. Indeed, for any differentiable function g : Rd →R,

denote the level set as Lg(c) for a given output c ∈ R. Let γ(t) : [0,1]→ Lg(c) be any

26


differentiable curve within the level set. Then, g(γ(t)) = c ∀t ∈ [0,1]. Thus, d
dt (g(γ(t))) =

0 = ⟨∇g(γ(t)),γ ′(t)⟩, implying that γ ′(t) is orthogonal to ∇g(γ(t)), ∀t ∈ [0,1]. Since this is

true for any curve γ contained in the level set, we conclude that the gradient vector is always

perpendicular to the level set. Furthermore, we can additionally show that the level set

Lg(c) is often a differentiable submanifold, with mild additional conditions on the gradient.

Indeed, a level set Lg(c) is said to be regular if ∇g(x) ̸= 0 ∀x ∈ Lg(c).

Lemma 1. If g : Rd →R is a continuously differentiable function, then each of its regular

level sets is a (d−1) dimensional submanifold of Rd .

We present the proof of Lemma 1 in Section 11.1 of the Appendix. We thus observe that

adversarial perturbations, largely parallel to the gradient, are locally orthogonal to confidence

preserving directions which correspond to the (d−1) dimensional tangent space of the level

set. We take inspiration from this observation to develop a general framework that applies

to a broad class of differentiable neural networks, as opposed to previous works [170] that

require the network to be invertible to identify confidence preserving perturbations.

We also remark that adversarial attacks and level set preserving perturbations are comple-

mentary from another perspective as well, as noted by prior works: the former attempts to

find inputs that change model predictions without modifying human oracle assignments,

while the latter attempts to keep network predictions unchanged though human oracles

would likely change their original label assignment. Thus, both classes of adversarial at-

tacks and level set preserving perturbations induce misalignment between oracle and model

predictions, and cast light onto independent means of evaluating the coherence of models

towards human expectations.

27


3.3 Theoretical Analysis for Common Models

To better understand the geometry of level set submanifolds for a given prediction confidence

threshold, we analyze the nature of confidence preserving perturbations in a few simplified

model settings.

1) Linear Functional: First, consider the classification model to be a linear functional

f : Rd →R, that is, f (x1 + x2) = f (x1)+ f (x2) and f (λx) = λ f (x). Then, by the Riesz

Representation theorem, there exists a unique vector w f ∈Rd such that f (x) = ⟨w f ,x⟩ ∀x∈

Rd . We thus observe that for any vector v orthogonal to w f , f (x+ v) = ⟨w f ,x+ v⟩ =

⟨w f ,x⟩+ ⟨w f ,v⟩= ⟨w f ,x⟩. Thus, in this setting, the level sets of f are (d−1) dimensional

linear subspaces spanned by the set {v ∈ Rd : ⟨w,v⟩ = 0}. We observe that a similar

argument can be extended to affine functions of the form f (x) = ⟨w f ,x⟩+ c, where c ∈R.

We remark that by applying a first-order Taylor series approximation for general real-valued

smooth functions, we observe near-affine behavior locally within a small neighborhood:

f (x+ εa) = f (x)+ ε⟨∇ f (x),a⟩+O(ε2∥a∥2). Thus locally, we observe that confidence-

preserving perturbations can arise from a (d − 1) dimensional plane orthogonal to the

gradient. Indeed we incorporate this implicitly in our proposed Level Set Traversal algorithm,

wherein the orthogonal projection ∆x⊥ with respect to the local gradient is iteratively

computed, and its relative success can partly be attributed to the orthogonal hyperplane

having a large dimension, namely (d−1).

Another related setting worthy of note is that of neural networks that utilize ReLU activations,

which induces a piece-wise linear structure over tessellated subsets of the input domain.

Thus, a given output neuron of a ReLU network functionally has the form f (x) = ⟨w,x⟩+ c

within a given tessellated region, and thus has a constant gradient within the same region.

28


Further, between two such adjacent regions, the two orthogonal (d− 1) dimensional hy-

perplanes typically intersect over an affine space over dimension at least (d−2). Thus if

x1,x2 are two inputs such that their linear interpolant path cuts across n distinct tessellated

regions, the common intersection of these orthogonal hyperplanes will typically be (d−n)

dimensional, indicating that there exist perturbations which lie in the common null-space of

the gradients as defined along each tessellated region that is crossed. Indeed in the following

section, we demonstrate empirically for Residual Networks that though the exact iterative

path followed by the Level Set Traversal algorithm is discretized and non-linear, the final

outputs so found are often linearly connected through paths of high confidence to the source

image, thereby lying within the level set of the original source class. This indicates that the

overlap of different (d−1) dimensional hyperplanes is non-trivial at a non-local scale in

image space, whereby we observe extended connected regions of high confidence.

2) Full-Rank Linear Transformations: Let us now consider a setting apart from classi-

fication, such as regression, wherein the complete vector representation of the output is of

principal interest. Let the model be of the form f : Rd →Rd , f (x) = Ax, where A ∈Rd×d

is a full-rank matrix. In this setting, we observe that if f (x1) = f (x2) = Ax1 = Ax2, implying

that A(x1− x2) = 0, the zero-vector. But since A is full-rank, this implies that x1− x2 = 0,

i.e, x1 = x2. Thus, f is a bijective function and has a trivial null space. Thus in this setting,

we necessarily have to relax the problem to be that of identifying perturbations of large

magnitude to the input that minimally change the function output: that is, we solve for

minv̸=0
∥Av∥2

∥v∥2 . Indeed, let the Singular Value Decomposition (SVD) of the full rank matrix

A be given by A =UΣV T , where U,V are d×d orthogonal matrices, and Σ is a diagonal

matrix consisting of the positive singular values σi for 1 ≤ i ≤ d, in descending order

without loss of generality. Then, ∥Av∥= ∥UΣV T v∥= ∥U(ΣV T v)∥= ∥ΣV T v∥ since U is an

orthogonal matrix. Similarly, since V is orthogonal as well, let z =V T v, so that ∥z∥= ∥v∥.

29


Thus, minv̸=0
∥Av∥2

∥v∥2 = minz̸=0
∥Σz∥2

∥z∥2 = minz∥Σz∥2such that zT z = 1. But since Σ is diagonal,

minz∥Σz∥2= ∑
d
i=1 σ2

i z2
i , under the constraint that ||z||2= 1. It is then easy to observe that

the minimum value attained is σ2
d = σ2

min, and is attained when the input vector to f is the

right-singular vector corresponding to the minimum singular value of A.

In this setting, we remark that adversarial perturbations, in contrast, can be formulated in

this setting as maxv̸=0
∥Av∥2

∥v∥2 , with the maximum value given by σ2
max and attained by the

right-singular vector corresponding to the maximum singular value of A, highlighting the

complementary nature of the two problems as indicated previously in Section 3.2.1. Indeed,

the condition number κ of an invertible matrix A is defined as κ = σmax/σmin = ∥A∥·∥A−1∥.

If condition number is large with κ ≫ 1, the matrix A is said to be ill-conditioned, and

induces an inescapable compromise: if σmax = ∥A∥≈ 1 (as potentially expected in "robust"

networks), then 1/σmin = ∥A−1∥≫ 1 is necessarily large, thereby inducing extreme under-

sensitivty along some dimensions; while if 1/σmin = ∥A−1∥≈ 1, then σmax = ∥A∥≫ 1 and

the model is extremely over-sensitive, similar to the phenomenon of adversarial vulnerability.

3.4 Proposed Method: Level Set Traversal

We now describe our algorithm for traversing the level set, which we call the Level Set

Traversal (LST) algorithm (Algorithm 1). We try to find a path from a source image to a target

image such that all points on that path are classified by the model as the source class with high

confidence. Given that these (d−1) dimensional level set submanifolds can be potentially

highly complex in their geometries, we use a discretized approximation using small yet finite

step sizes to tractably explore these regions. Let the source image be xs with true label ys,

and g = ∇xCE( f (x),ys) be the gradient of the cross entropy loss of the model ( f ) prediction

with respect to x. The key idea is to get as close as possible to a target image xt from some

image x by computing the projection of ∆x = xt−x onto the orthogonal complement of g, to

30


obtain a new image xnew that leaves the model confidence unchanged. We can compute the

projection of ∆x on the orthogonal complement by subtracting the component of ∆x parallel

to g, that is, x||g = g
(

∆x⊤g
∥g∥2

)
(L6). Then, using a scale factor η , the vector update to x can

be expressed as ∆x⊥ = η(∆x− x||g) (L7), and therefore xnew = x+∆x⊥. Starting from the

source image xs, we repeatedly perform this iteration to get reasonably close to the target

image while carefully ensuring that the confidence of the model prediction at any given point

does not drop below a preset confidence threshold δ compared to the source image (L10).

Figure 3.1: Pictorial representation of the LST algorithm. The contour lines represent level
sets of the model confidence f (x), with blue representing high confidence (low loss) and red
representing low confidence (high loss). At each iteration, we obtain xnew by adding two
vectors, x⊥ = η(∆x− c//g) (projection of ∆x onto the orthogonal complement of g) and x||
(a small perturbation to increase the confidence and remain within the level set).

31


Figure 3.2: Intermediate images over the path traversed by the LST algorithm using a source
image of a ‘goose’ and a target image of a ‘Scottish terrier’ (a dog) for a normally trained
ResNet-50 model. We observe that the model predicts the ‘goose’ class with very high
confidence for all images over the path, though the target blindspot found by LST clearly
appears as a ‘dog’ to human oracles.

Algorithm 1 Level Set Traversal (LST)
1: Input: Source image xs with label y, target image xt , model f , max iterations m, scale

factor η , stepsize ε , confidence threshold δ

2: Initialize x = xs,x|| = 0

3: for i = 1 to m do

4: ∆x = xt− x

5: g = ∇xCE( f (x),y)

6: c// = (g ·∆x)/||g||2

7: ∆x⊥ = η(∆x− c//g)

8: x|| = Π∞(x||− εg,−ε,ε)

9: xnew = x+∆x⊥+ x||

10: if f (xs)[ j]− f (xnew)[ j]> δ then

11: return x

12: end if

13: x = xnew

14: end for

15: return x

While the above method works well with a relatively large η if the curvature of f is low, it

32


risks a non-trivial drop in model confidence (or even a change in the model prediction) if the

curvature of f at x is high enough. Therefore, after each step, we add a small perturbation

vector x|| in the direction of −g scaled by a factor of ε . This step decreases the cross entropy

loss, and thus increases the confidence so as to offset any confidence drops incurred due

to the addition of ∆x⊥. Thus, we can maintain a higher model confidence over the path. In

order to ensure that the norm of this perturbation is not arbitrarily large, we bound the ℓ∞

norm by projecting the vector to an ℓ∞ ball of radius ε . Then, the step x|| can be computed

as clamp(εg,−ε,ε), where the function clamp(v,a,b) is the element-wise application of

the function min(max(·,a),b) on all elements of v. Thus, we modify L9 so that now

xnew = x+∆x⊥− x||. We present a pictorial schematic of Algorithm 1 in Fig 3.1. We

further employ an exponential moving average of x||, in order to smoothen components that

potentially undulate heavily during the iterative traversal. Thus, since the frequent changes

in x|| are smoothened out, we find that the final output images are often linearly connected

to the source image with high confidence over the linear interpolations (see Fig 3.6).

We now apply this algorithm to explore the level sets of standard and robust ResNet-50

models. In Fig 3.2, we present the path traversed by the LST algorithm, wherein the model

predicts the source ‘goose’ class with very high confidence over the entire path, though the

target blindspot found by LST clearly appears as a ‘dog’ to human oracles. To substanti-

ate the efficacy of the LST algorithm, we randomly select five images from five arbitrary

ImageNet classes (‘goose’, ‘Scottish Terrier’, ‘meerkat’, ‘academic gown’, ‘cleaver’), and

compute LST blindspots for all possible source-target image pairs. We show the final images

output by the LST algorithm in Fig 3.3 as an image-grid. If we order the five selected

images, the ith row and jth column of the image-grid is the LST output obtained using the

ith image as target and jth image as the source. Thus, each column represents the source

image being transformed iteratively into the other target images. The confidence of the

33


Class Label Predicted by ResNet50
Cl

as
s L

ab
el

 P
er

ce
iv

ed
 b

y 
Hu

m
an

s
Class Label Predicted by Robust ResNet50

Cl
as

s L
ab

el
 P

er
ce

iv
ed

 b
y 

Hu
m

an
s

Figure 3.3: The images returned by LST for 5 random source and target images for normally
trained (left) and adversarially trained (right) ResNet-50 models. The image in the ith row
and jth column is the LST output obtained using the ith image as target and jth image as the
source. The confidence of the model prediction for the source class (names on top of each
column) is displayed just below each image. For example, all four LST output blindspots in
the first column (highlighted), using the source ‘goose’ image, are all predicted to be of the
‘goose’ class with very high confidence. Diagonal images are unchanged, as source equals
target. We observe that almost any source image can be iteratively modified using LST to
resemble any target image very closely without any loss in confidence for both normal and
adversarially trained ResNet-50 models.

model prediction for the source class (names on top of each column) is displayed just below

each image. For both the normally trained and adversarially trained model, these images

are almost indistinguishable from the target while retaining high model confidence for the

source class. Since adversarially robust models have perceptually aligned gradients, we

can sometimes visually notice a few traces of the source image in the final LST image; for

example the ‘meerkat’ image in the 3rd row, 2nd column in the right side of Fig 3.3 has

some traces of the source ‘terrier’ image, but differences are usually hard to perceive.

We also examine the model confidence over linear interpolations between the source image

and LST outputs for all target pairs in Fig 3.6. Formally, consider a source image xs with

34


label y and let xop represent the LST output when applied toward a target image xt . Denote

the difference vector as ∆v = xop− xs. Then, we observe that xs +α∆v is assigned high

confidence with respect to class y by the model ∀α ∈ [0,1], which represents the entire

linear interpolant path between xs and xop. Furthermore, we observe that the path discovered

by the Level Set Traversal algorithm enjoys two key properties: (1) Uniqueness (once the

target image is fixed) and (2) Extremality:

(1) Uniqueness: Since the local tangent space of the level set is (d−1) dimensional, several

independent directions are orthogonal to the local gradient, and apriori do not yield a unique

path like a gradient-flow. However, once we fix our target image, we use its difference vector

with respect to the current iterate (L4) and compute its projection onto the local tangent

space (L7) of the level set, thereby generating a uniquely defined path.

(2) Extremality: Though this flow-based path may be non-linear, we additionally discover

that the final output-point of this flow is surprisingly linearly connected with high-confidence

to the source image after we apply discretized approximations in practical settings for

common vision models etc. Formally, the LST output xop is linearly extremal in the sense

that xs +(1+ ε)∆v is rapidly assigned low-confidence by the model even for extremely

small values of ε > 0, where ∆v = xop− xs.

Thus, using the LST algorithm, we find that the level sets of common models extend

outwards in an expansive, connected manner to include images of arbitrary classes, which

human oracles would never state as being similar. Since the linear path from any given

source image to LST outputs for arbitrary target images retains high model confidence

throughout, it unveils a remarkable star-like substructure for superlevel sets as shown in

Fig 3.4, where the number of “limbs” or linear protuberances of the star-like structure

is extraordinarily large, plausibly as large as the number of images in all other classes.

35


Furthermore, to study the size of the level sets beyond the one-dimensional interpolant paths,

we analyze the two-dimensional triangular convex hull between a given source image and

two LST output blindspot images, using quantitative metrics in Section 3.6.

3.5 Disconnected Nature of Standard Adversarial Attacks

At first glance, the images output by the LST algorithm may seem similar to those produced

using targeted variants of standard adversarial attacks. In this section, we explore the

connectivity of high-confidence paths as induced by standard adversarial attacks, and show

that adversarially attacked source images are not linearly connected with high-confidence

paths to their target image. In detail, let (x1,y1) and (x2,y2) be any two data samples from

different classes (y1 ̸= y2). A targeted adversarial attack [51] on input x1 with respect to

class y2 is formulated as solving ε12 = argminε12:||ε12||≤ε CE( f (x1 + ε12),y2). A related

variant, called a feature-level targeted adversarial attack [326] instead uses intermediate

network activations to craft image perturbations. If f|L(x) represents the network activations

from hidden layer L for an input x, then this attack attempts to match features by optimizing

ε12 = argminε12:||ε12||≤ε || f|L(x1+ε12)− f|L(x2)||. The hidden layer (L) that is often selected

corresponds to pre-activations of the final fully-connected layer, and thus often induces

misclassification.

36


Figure 3.4: Schematic of Star-like set substructure of Superlevel sets: The linear interpolant
paths between the source image xs and blindspots found using LST maintain high-confidence
throughout for arbitrary target images of other classes.

37


Figure 3.5: Model Confidence over Linear Interpolant Paths: We plot the median prediction
confidence with respect to class y2 over linear paths between the target benign sample (x2)
and adversarial examples (such as x1 + ε12 targeted towards input x2) in input space. While
model confidence falls to near-zero using standard targeted adversarial attacks, the linear
path between the LST output and source image has high confidence of almost 1.0 throughout.

Using this framework, we can thus analyze the connectivity of high-confidence paths with

respect to class y2 between the target benign sample (x2) and adversarial examples (such

as x1 + ε12 targeted towards x2), using the linear interpolant path between the two. By doing

so, we can analyze the level set with respect to class y2 using targeted adversarial examples

alone, though the perturbation utilized is inherently norm-limited. In Fig 3.5, we plot the

median confidence of a normally trained ResNet-50 model over the linear interpolant paths

for 1000 source-target pairs. We find that though the model confidence with respect to

class y2 is high for both the target benign input x2 and the targeted adversarially attacked

input x1 + ε12 (that is, at the end-points of the linear path), the model does not maintain

high confidence over the linear interpolant path between the two, but sharply declines to

a valley of near zero-confidence over the path. For specific inputs, we often observe sharp,

38


erratic diminutions in target-class prediction confidence over the convex combination, with

a non-trivial region of near-zero confidence. This contrasts sharply with the existence of

linear-connectivity observed between a source image and LST blindspot images. We thus

crucially note that adversarial attacks are standalone insufficient to study the structure of

level sets of common vision models. However, we incorporate them into our proposed Level

Set Traversal algorithm in a fruitful manner.

3.6 Quantifying Under-Sensitivity of Vision Models

In this paper, we primarily consider standard vision datasets such as ImageNet [89] and

CIFAR-10 [201] (latter in Section 11.1 of the Appendix). We thereby explore the “blind-

spots” of popular vision models such ResNet [146] and Vision Transformers [98, 374].

Furthermore, we explore the connectivity of such level sets on normally trained variants

of such networks, as well as adversarially robust counterparts. For the latter, we analyze

robust models trained adversarially against (4/255) ℓ∞ constrained adversaries, available

on RobustBench [78]. Specifically, we utilize a robust ResNet-50 model from [333] and a

robust DeiT model from [353]. We also fix the hyperparameters of LST for all models for a

fair comparison (detailed in Section 11.1 of the Appendix).

Image Quality Metrics: We now proceed to quantitatively verify our observations in

Section 3.4. First, to help quantify the deviation between target images and blindspot

outputs generated by our proposed Level Set Traversal algorithm, we utilize a combination

of classical image metrics such as RMSE, ℓ∞ and Structural Similarity Index (SSIM), and

perceptual measures such as LPIPS distance using AlexNet [437]. For SSIM, a higher value

indicates a closer match, while for all other metrics, a lower value indicates a closer match.

To calculate these metrics, we sample around 1000 source images from ImageNet, and select

five other random target images of different classes for each source image. We present the

39


image quality metrics for blindspots discovered by LST in Table 3.1. Here the standard

deviation is over the different randomly chosen source and target images.

Metrics for Model Confidence: To evaluate the extent of the model invariance over the

regions between the LST outputs and the source image, we evaluate the model confidence

with respect to the source class over one-dimensional linear interpolant paths and over

two-dimensional subspaces as well. For the latter, we evaluate the model confidence over the

triangular convex hull obtained by linear interpolation over three reference points, namely

the source image and the two target blindspot images produced using LST. For example,

the input at the ‘centroid’ of the triangle formed by a source image and pair of target

blindspots is the arithmetic mean of the three images. We visualize these in Fig 3.6, wherein

the prediction confidence (in the range [0,1]) assigned by the model with respect to the

source class is mapped to a continuous colorbar, with high-confidence points (close to 1.0)

appearing as bright yellow, and low-confidence points (close to 0.0) appearing as dark violet.

Specifically, we use the following metrics: (a) Average Triangle (∆) Confidence: the mean

of the model’s source class confidence over the enclosed triangle, (b) Average Triangle (∆)

Fraction for various values of δ : the fraction of inputs in the triangular region for which the

model confidence is greater than psrc−δ , averaged over all possible target blindspot pairs,

where psrc is the confidence of the source image, (c) Average Path Confidence: the average

model confidence over all linear paths from the source image to all LST blindspot images.

The higher these metrics, the more confident, and thus invariant, the model is in this region.

For computing these metrics, we use linear interpolations between the source images and the

5 LST outputs found previously for computing the distance metrics. We thus use
(5

2

)
= 10

triangles for each source image, and sample this triangular area in an equispaced manner

to obtain 66 images for computation of the triangle (∆) metrics. We present these metrics in

Table 3.2, along with the mean (and standard deviation) of model confidence on the source

40


Table 3.1: Quantitative image distance metrics between output of Level Set Traversal and
target images.

Models RMSE : µ±σ ℓ∞ dist: µ±σ SSIM: µ±σ LPIPS dist: µ±σ

ResNet-50 (Normal) 0.008 ± 0.001 0.046 ± 0.020 0.990 ± 0.021 0.002 ± 0.004
ResNet-50 (AT) 0.029 ± 0.008 0.746 ± 0.124 0.915 ± 0.041 0.057 ± 0.037
DeiT-S (Normal) 0.011 ± 0.002 0.116 ± 0.030 0.973 ± 0.024 0.024 ± 0.017

DeiT-S (AT) 0.046 ± 0.010 0.821 ± 0.117 0.898 ± 0.041 0.219 ± 0.068

Table 3.2: Quantitative confidence metrics over the triangular convex hull (∆) of a given
source image and two target LST blindspot image-pairs and over linear interpolant paths
between source and blindspot images. (For reference, a random classifier would have
confidence of 0.001)

Models psrc Avg ∆ Conf. Avg ∆ Frac. (µ±σ ) Avg Path Conf.
(µ±σ ) (µ±σ ) δ = 0.0 δ = 0.1 δ = 0.2 δ = 0.3 (µ±σ )

ResNet-50 (Normal) 0.99 ± 0.02 0.56 ± 0.10 0.13 ± 0.15 0.51 ± 0.11 0.53 ± 0.1 0.54 ± 0.10 0.96 ± 0.05
ResNet-50 (AT) 0.88 ± 0.11 0.83 ± 0.09 0.49 ± 0.29 0.79 ± 0.13 0.85 ± 0.1 0.88 ± 0.09 0.93 ± 0.06
DeiT-S (Normal) 0.85 ± 0.06 0.68 ± 0.05 0.54 ± 0.11 0.67 ± 0.06 0.71 ± 0.06 0.73 ± 0.06 0.94 ± 0.02

DeiT-S (AT) 0.76 ± 0.08 0.59 ± 0.07 0.20 ± 0.09 0.43 ± 0.14 0.63 ± 0.15 0.76 ± 0.12 0.73 ± 0.06

class images (psrc) for reference. Here, the standard deviation is over the different randomly

chosen source images.

We can now quantitatively confirm many of the trends we qualitatively observed in Sec-

tion 3.4. In Table 3.1, we observe that the LST outputs found for the models are closer to

the targets as compared to the adversarially trained models. The difference is particularly

stark when comparing the ℓ∞ distances, as ℓ∞ is a particularly sensitive metric and is also

the threat model against which the models were trained. However, for other distance metrics

like RMSE or LPIPS, the difference between normally trained and adversarially trained

ResNet-50 is not as high. In particular, LPIPS distances for both models are low, which

indirectly implies that the perceived difference between the images for humans oracles is

relatively low. This can also be confirmed by visually inspecting the images in Fig 3.3.

For the normally trained DeiT-S, the distance metrics are very similar to that of ResNet-50,

with slightly higher RMSE and LPIPS distances. However, the adversarially trained (AT)

41


Figure 3.6: Visualization of confidence of standard (top) and robust (bottom) ResNet-50
models over the triangular convex hull of a source ‘goose’ image and two LST outputs for
all pairs of target images from 4 other classes (same images as in Fig 3.3). In all source-
target image pairs, the linear interpolant path maintains high confidence, implying that the
source image is linearly connected to the LST target output in a star-like substructure within
the level set. For adversarially trained models, we observe that a significant fraction of
the triangular hull lies in the superlevel sets of high-confidence, thereby indicating their
under-sensitivity in regions far beyond their original threat model.

variant is significantly less under-sensitive compared to both its normal counterpart and the

ResNet-50 (AT). Specifically, the LPIPS distance metric is much greater for DeiT-S (AT),

which implies there exist significant human-perceptible differences in the image. We can

confirm this in Fig 11.1, where the LST output for the adversarially trained DeiT-S model

contains visible traces of the source image. However, the LST outputs are still clearly much

closer to the target class as compared to the source, which indicates that there are still some

significant blind spots in DeiT-S (AT).

However, when we measure the extent of model invariance over the convex regions enclosed

between the LST output and the target images, we find that adversarially trained ResNet-50

are overall more invariant (or under-sensitive) as compared to the normally trained variant.

For example, the average triangle confidence for ResNet-50 (AT) is higher than that of nor-

mally trained ResNet-50, even though its source confidence is much lower. We also find that

a much larger fraction of the triangular convex hull lies within the superlevel set for psrc−δ

42


for ResNet-50 (AT) as compared to normal ResNet-50 for all values of δ . The average

path confidence is much closer to psrc for ResNet-50 (AT) as compared to normally trained

ResNet-50. This quantitatively verifies the observation made in Fig 3.6 that adversarial

training demonstrably exacerbates under-sensitivity. Interestingly, these robust models are

under-sensitive over subsets of the input domain that lie well beyond the original threat

model used in its training. Moreover, between the normally trained DeiT-S and ResNet-50

models, the former appears to be more invariant with greater average confidence over the

triangular convex hulls, despite having lower source image confidences psrc. For the robust

variant of DeiT-S however, the trend is less apparent due to the significantly lower average

source image confidences psrc. However the average relative ∆ fraction becomes higher

for larger values of δ (such as 0.3), indicating that the superlevel sets are indeed expansive,

albeit for lower confidence thresholds.

3.7 Conclusion I

While the existence of level sets alone is not very significant in itself, using the proposed

LST algorithm, we find that the level set for common vision models is remarkably expan-

sive — large enough to contain inputs that look near-identical to arbitrary target images

from other classes. Since the linear path from any given source image to LST blindspot

outputs retain high model confidence throughout, the level sets have a star-like connected

substructure, where the number of ’limbs’ or linear protuberances of the star-like structure

is extraordinarily large, plausibly as large as the number of images in all other classes.

This is considerably noteworthy since it indicates the hitherto unknown and unappreciated

scale and extent of under-sensitivity in common vision models. Moreover this hints at the

potential degree of difficulty towards adequately mitigating this phenomenon in practical

settings. For instance, if the level set for images of class y1 contained sizable protuberances

43


towards only one other class y2 alone, the problem could perhaps be tackled by introducing

a contrastive training objective that encourages the network to better discriminate between

y1− y2 image pairs by utilizing a denser sampling of related image augmentations, likely

resulting in the diminution of these specific ‘directed’ protuberances (assuming reasonable

train-test generalization). But since the star-like connected substructure uncovered by LST

implies that such protuberances exist towards any generic image of any other class, such

simple approaches will likely be ineffective and possibly computationally infeasible from

a combinatorial perspective. Thus, based on the observations uncovered with LST, we

hypothesize that addressing the pervasive issue of under-sensitivity in conventional vision

models might present a significantly non-trivial challenge.

3.8 Provable Robustness For Streaming Models

Deep neural network (DNN) models are increasingly being adopted for real-time decision-

making tasks. They are often required to make predictions on an evolving stream of

inputs in applications like algorithmic trading [112, 196, 198, 288, 435], human action

recognition [285, 317, 415] and speech detection [90, 132, 156]. However, DNNs are

known to malfunction under tiny perturbations of the input, such as an imperceptible

noise added to an image, designed to fool them into making incorrect predictions [28,

54, 128, 256, 366]. This vulnerability is not limited just to static models like classifiers

and has been demonstrated for streaming models as well [24, 25, 38, 270]. Such input

corruptions, commonly known as adversarial attacks, make DNNs especially risky for

safety-critical applications such as health monitoring [44, 167, 222, 360] and autonomous

driving [34, 175, 412].

Over the years, a long line of research has been dedicated to mitigating this weakness

of DNNs. These methods seek to improve the robustness of a model by introducing

44


input corruptions during training [41, 96, 124, 133, 138, 211, 235, 330]. However, such

empirical defenses have been shown to break down under newer and stronger attacks

[16, 54, 377, 380]. This motivated the study of provable robustness in machine learning

(ML) which seeks to obtain verifiable guarantees on the adversarial performance of a DNN.

Several certified robustness techniques have been developed over the years, most notable of

which are based on convex relaxation [63, 309, 354, 356, 407], interval-bound propagation

[103, 130, 162, 266] and randomized smoothing [71, 221, 225, 228, 334]. However, most

of this work focuses on static tasks with independently generated inputs and the adversarial

streaming setting still remains open. What makes the streaming setting more challenging is

that the adversary can choose the attack budget based on previous inputs. For instance, it

could wait for a critical decision-making point, such as a trading algorithm making a buy/sell

recommendation or an autonomous vehicle approaching a stop sign, before generating an

adversarial perturbation.

In this work, we derive provable robustness guarantees for the streaming setting, where

inputs are presented as a sequence of potentially correlated items. We design certificates

that produce guarantees on the average model performance over long, potentially infinite,

data streams. Our threat model is defined as a man-in-the-middle adversary present between

the DNN and the data stream. It can perturb the input items before they are observed by the

DNN. The adversary is constrained by a limit on the average size of the perturbation added

to the inputs. We show that a DNN that randomizes the inputs before making predictions

is guaranteed to achieve a