ABSTRACT Title of dissertation: ROBUSTNESS, DETECTABILITY, AND DATA PRIVACY IN AI Vinu Sankar Sadasivan, Doctor of Philosophy, 2025 Dissertation directed by: Professor Soheil Feizi University of Maryland Recent rapid advancements in Artificial Intelligence (AI) have made it widely popular in various applications, ranging from autonomous systems to multimodal content generation. However, these models are vulnerable to several security and safety problems. These pitfalls in AI models can potentially allow attackers to jailbreak systems, enabling harmful tasks and leaking sensitive information. As AI models are increasingly deployed in sensitive applications, such as autonomous robots and healthcare devices, the importance of AI safety is growing. To address these issues in today’s AI systems, it is critical to study their vulnerabilities. In this thesis, we explore fundamental aspects of AI safety and security, including robustness, detectability, and data privacy. We analyze challenges posed by adversaries to contribute toward enhancing AI safety and security. First, we examine the robustness of discriminative models, identifying blind spots in computer vision models. We discover the existence of high-confidence star-like paths in the input space, where models consistently predict the same output label despite the addition of significant noise. We also develop provable robustness certificates for streaming classification models. Next, we analyze the challenges in detecting AI-generated content across text and image modalities. Through empirical and theoretical studies, we find that detecting AI content is a difficult task, especially as generative models improve over time. We demonstrate attacks on a wide range of text and image detectors, showing that adversaries can efficiently break these systems to induce both type-I and type-II errors. We then investigate the robustness of generative Large Language Models (LLMs), focusing on adversarial attacks and hallucinations. Our fast, beam search-based adversarial attacks can compromise LLMs in under a GPU minute. We also demonstrate that multi-modal LLMs have increased attack surface and are more vulnerable to adversarial attacks in the real world. These attacks can jailbreak, induce hallucinations, and carry out privacy attacks on LLMs. To mitigate these risks, we propose a suite of methods for detecting hallucinations in both white-box and black-box settings by analyzing the internal dynamics and output probabilities of LLMs. Lastly, we address user privacy concerns by proposing a method to create unlearnable datasets through data poisoning. This method, which is model-agnostic and computationally efficient, ensures that datasets used without user consent become unlearnable by AI systems. In summary, this thesis analyzes various attacking frameworks to understand the robustness and security vulnerabilities of AI models. We hope the insights provided by this work will be useful in designing safer, more secure AI systems. ROBUSTNESS, DETECTABILITY, AND DATA PRIVACY IN AI by Vinu Sankar Sadasivan Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2025 Advisory Committee: Professor Soheil Feizi, Chair/Advisor Professor Hernisa Kacorri, Dean’s Representative Professor Behtash Babadi Professor David Jacobs Professor Dinesh Manocha Professor Abhinav Shrivastava © Copyright by Vinu Sankar Sadasivan 2025 Dedication To my family — mother, father, brother, Timmy, Monash, and grandparents — and to God! ii Acknowledgments I started my PhD in Fall 2021. Moving to a new country, leaving my home was challenging. I am grateful to the people and circumstances that contributed to, supported, and inspired me throughout my journey in my graduate school life. I believe my efforts would be meaningless without acknowledging them. I would like to start by thanking my advisor, Professor Soheil Feizi. Half of your struggle as a PhD student would become invisible if you have a great advisor. I could not have asked for a more understanding and supportive advisor. His advice and mentorship always guided me along the right path during my PhD. Professor Feizi gave me all the freedom to do great research and supported me whenever I was in a dilemma, shaping the PhD thesis that I have today. The most important part of my whole life has been my beloved family. I am the luckiest to have the most supportive parents – Rekha, my mother and Sadasivan, my father – who have always stood by my side. It is their sacrifice and hard work that has been the major driving force for my brother’s and my successes in our lives. I will be forever indebted to my parents for their contributions. I believe teachers play the most important role in forming the intellectual curiosity of a child. For me, the most important teacher or Guru has been my brother, Hari. In my childhood, he was the one who motivated and inspired me the most to follow the path that took me where I am today. I am extremely thankful to have him as my mentor and brother. I would also like to remember my late pet Timmy, who was a sibling to me. I am sure she would have been proud of me and greatly delighted as I am completing my PhD requirements. I would also like to dedicate my PhD efforts to my grandparents who were a significant part iii and influence in my life. Acknowledging my family without thanking my girlfriend, Monash, would be incomplete. She has been an integral part of my life, especially supporting me through some of the most stressful and low points in my final phase of PhD. I am deeply grateful to have her in my life. I would also like to thank her parents, who wish the best for me. My journey in UMD would have been far more difficult without my friends in College Park. A special thanks to my roommate and friend, Souradip, who made my life in College Park far more interesting and fun-filled. Also, thanks to my other friends at UMD – Kishen, Roshan, Peeyush, Maharshi, Gaurang, Shelvin, Sharmila, Raju, Shah, Panky, Shishira, and others – for making my life here more lively. I would like to thank all my amazing labmates, especially Gaurang, Sriram, Neha, Samyadeep, and Sahil. I had the opportunity to work with several researchers throughout my career, from whom I have learned invaluable skills – from Prateek and Harsha at Microsoft Research; from Professor Chandra at IISc; from Ashish at Caltech; from Pierre, Matthijs, and Jakob at Meta; from Lun and Rajiv at Google. I thank all my mentors and teachers. I was also fortunate enough to have mentors and well-wishers, who have helped me make several important decisions in my life – Professor Nithin, Professor Anirban, and Dr. Dilip. I am indebted to them for all the time they answered my questions. Finally, I thank God for everything. iv Table of Contents Dedication ii Acknowledgements iii 1 Introduction 1 1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Publications and Authorship . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Related Works 10 2.1 Exploring Geometry Of Blind Spots In Vision Models . . . . . . . . . . . . 10 2.2 Robustness for Streaming Models . . . . . . . . . . . . . . . . . . . . . . 11 2.3 AI-Generated Text Detection . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 AI-Generated Image Detection And Data Retrieval . . . . . . . . . . . . . 14 2.5 Robustness And Privacy in Large Language Models . . . . . . . . . . . . . 17 2.6 Hallucination Detection In Large Language Models . . . . . . . . . . . . . 19 2.7 Robustness Of Discrimintaive Models . . . . . . . . . . . . . . . . . . . . 21 3 Investigating Robustness Of Classification Models 23 3.1 Exploring Geometry of Blind Spots in Vision Models . . . . . . . . . . . . 23 3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.1 Conjugate Nature of Adversarial and Confidence Preserving Pertur- bations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Theoretical Analysis for Common Models . . . . . . . . . . . . . . . . . . 28 3.4 Proposed Method: Level Set Traversal . . . . . . . . . . . . . . . . . . . . 30 3.5 Disconnected Nature of Standard Adversarial Attacks . . . . . . . . . . . . 36 3.6 Quantifying Under-Sensitivity of Vision Models . . . . . . . . . . . . . . . 39 3.7 Conclusion I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.8 Provable Robustness For Streaming Models . . . . . . . . . . . . . . . . . 44 3.9 Preliminaries and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 49 v 3.10 Robustness Certificate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.11 Attacking Each Window . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.12 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.12.1 Attacking each input only once . . . . . . . . . . . . . . . . . . . . 56 3.12.2 Attacking each window . . . . . . . . . . . . . . . . . . . . . . . . 59 3.13 Conclusion II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4 Hardness of AI-Generated Text Detection 61 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Evading AI-Detectors using Paraphrasing Attacks . . . . . . . . . . . . . . 67 4.2.1 Attack Setup and Paraphrasing Methods . . . . . . . . . . . . . . . 67 4.2.2 Quality of the Paraphrases . . . . . . . . . . . . . . . . . . . . . . 70 4.2.3 Paraphrasing Attacks on Watermarked AI Text . . . . . . . . . . . 72 4.2.4 Paraphrasing Attacks on Non-Watermarked AI Text . . . . . . . . . 75 4.3 Adversarial Paraphrasing For Universal AI Text Humanization . . . . . . . 77 4.4 Spoofing Attacks on Generative AI-text Models . . . . . . . . . . . . . . . 79 4.5 Hardness of Reliable AI Text Detection . . . . . . . . . . . . . . . . . . . 83 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5 Robustness of AI Image Detectors: Fundamental Limits And Practical Attacks 89 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.2 Robustness of Image Watermarking for AI-Image Detection . . . . . . . . . 94 5.2.1 Fundamental constraints for Watermarking methods . . . . . . . . . 94 5.2.2 Low Perturbation Budget Watermarks: Empirical Attacks . . . . . . 96 5.2.3 High Perturbation Budget Watermarks: Empirical Attacks . . . . . 99 5.2.4 Spoofing attacks on watermarking methods . . . . . . . . . . . . . 101 5.3 Robustness-Reliability Trade-off of Deepfake Detectors . . . . . . . . . . . 102 5.4 Conclusion I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.5 Data Retrieval with Error-Corrected Codes and Watermarking . . . . . . . 105 5.5.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.5.3 Methodology: DREW . . . . . . . . . . . . . . . . . . . . . . . . 113 5.5.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.7 Discussion and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.8 Conclusion II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6 IConMark: Robust Interpretable Concept-Based Watermark For AI Images 123 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.2 IConMark: Interpretable Watermarking . . . . . . . . . . . . . . . . . . . 125 6.2.1 Initialization: Concept Database Generation . . . . . . . . . . . . . 126 vi 6.2.2 Watermarked Image Generation . . . . . . . . . . . . . . . . . . . 126 6.2.3 Watermark detection . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.3 IConMark(+SS and +TM): Harnessing the Combinatorial Strength of ICon- Mark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.4.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.4.2 Watermark Detection And Image Quality . . . . . . . . . . . . . . 134 6.4.3 Robustness of Watermarks . . . . . . . . . . . . . . . . . . . . . . 135 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 7 Fast Adversarial Attacks on Language Models 138 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.2 Beam Search-based Adversarial Attack . . . . . . . . . . . . . . . . . . . . 141 7.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 7.2.2 Our Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.2.3 Our Method: BEAST . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.3 Jailbreaking Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 7.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 7.3.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.3.3 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 7.3.5 Multiple Behaviour and Transferability . . . . . . . . . . . . . . . 149 7.3.6 Model Transferability . . . . . . . . . . . . . . . . . . . . . . . . . 150 7.4 Hallucination Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 7.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 7.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 7.5 Privacy Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.5.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 7.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 8 On Manipulating Audio-Based Large Language Models In The Real World 157 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 8.2 Adversarial Attacks for Audio-based LLMs . . . . . . . . . . . . . . . . . 159 8.2.1 Targeted Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 8.2.2 Untargeted Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . 163 8.3 Manipulating an Audio-based LLM in the Real World . . . . . . . . . . . . 168 8.4 Defenses For Audio-Based Manipulations . . . . . . . . . . . . . . . . . . 171 8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 9 Investigating Detection Of Hallucination In Large Language Models 174 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 vii 9.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 9.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 9.2.2 Background and Formalisms for Hallucination Detection . . . . . . 178 9.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 9.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 9.4.1 Detection Results on Datasets with no External References as Context187 9.4.2 Detection with External References . . . . . . . . . . . . . . . . . 188 9.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 10 Unlearnable Datasets For Preserving User Privacy 192 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 10.2 Convolution-based Unlearnable DAtaset (CUDA) . . . . . . . . . . . . . . 195 10.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 10.2.2 Limitations of existing works . . . . . . . . . . . . . . . . . . . . . 196 10.2.3 Our method: CUDA . . . . . . . . . . . . . . . . . . . . . . . . . 198 10.3 Theory for CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 10.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 10.4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . 203 10.4.2 Effectiveness of CUDA . . . . . . . . . . . . . . . . . . . . . . . . 205 10.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 11 Appendix 211 11.1 Exploring Geometry of Blind Spots in Vision Models . . . . . . . . . . . . 211 11.2 Robustness for Streaming Models with a Sliding Window . . . . . . . . . . 223 11.3 Hardness of AI Text Detection . . . . . . . . . . . . . . . . . . . . . . . . 236 11.4 Hardness of AI Image Detection . . . . . . . . . . . . . . . . . . . . . . . 269 11.5 IConMark: Robust Interpretable Concept-Based Watermark For AI Images . 293 11.5.1 Additional Examples of Generated Images . . . . . . . . . . . . . . 293 11.6 Data Retrieval with Error-Corrected Codes and Watermarking . . . . . . . 297 11.7 Fast Adversarial Attacks on Language Models . . . . . . . . . . . . . . . . 305 11.8 Detection Of Hallucination In Large Language Models . . . . . . . . . . . 325 11.9 Unlearnable Datasets For Preserving User Privacy . . . . . . . . . . . . . . 330 Bibliography 367 viii Chapter 1: Introduction 1.1 Thesis Statement In this thesis we explore fundamental aspects of AI model robustness, detectability, and data privacy. We analyze challenges posed by adversaries on AI systems to contribute towards improving their robustness, security, and privacy. 1.2 Thesis Overview Artificial Intelligence (AI) has become an integral part of numerous real-world applications, from self-driving cars to models that can generate creative multimodal content. Despite their success, these models exhibit several security and privacy vulnerabilities. Adversaries could attack these AI systems to potentially jailbreak them to cause harm or leak sensitive private information. In order to make these systems more robust and secure, we need to understand their failure modes and vulnerabilities. In this thesis, we analyze the robustness of various AI models to better understand their security and privacy issues. We study the robustness, detectability, and data privacy in AI. First, we study the robustness of discriminative models. Adversarial attacks in computer vision models are well-studied. In an adversarial attack for an image classifier model, an attacker adds imperceptible noise to the input to make the model misclassify. We study 1 a novel phenomenon called blind spots in vision models where we find the presence of multiple continuous equi-confident star-like paths in the input space where the model produces invariant output. We also study the robustness of streaming classification models to provide provable certificates. In the next part, we study the detectability of AI. Recent advancements in generative AI have led to the wide usage of Large Language Models (LLMs) and Diffusion Models (DMs). Attackers could use these generative models to produce fake content for social engineering. In order to tackle this, many works suggested employing AI-content detectors. In our study, we analyze the hardness of AI-content detection in both text and image modalities. We empirically and theoretically show that attackers can easily evade these AI detection systems, showing their security vulnerabilities. Further, we discuss how the insights from our analysis can be leveraged to design more robust detection technologies. We also extend a specific detection technology, watermarking, to show its benefits in data retrieval applications. We then investigate the robustness of LLMs to adversarial attacks. Adversarial attacks on LLMs are relatively new since it is hard to perform optimization in the discrete input token space. Though recent works present various approaches to attack LLMs, they are either computationally expensive or dependent on a more powerful model to attack a weaker model. We propose an algorithm to attack LLMs in under one GPU minute. We also demonstrate how multi-modal LLMs with increased attack surface are more vulnerable to these attacks in the real world. We extend this to show experiments to jailbreak, illicit hallucinations, and perform privacy attacks on LLMs. We show that an adversary can attack existing transformer-based language models efficiently to make them unsafe and less useful. We also show that these risks increase as models have more input modalities. Specifically, we demonstrate scenarios where adversaries could harm innocent users using audio-based LLMs. 2 Due to LLMs performing impressively in various tasks, it’s becoming a popular tool in various applications. However, they are known to hallucinate, leading to concerns about using them in sensitive fields such as health or robotics. In our study, we propose multiple methods to detect LLM hallucinations. We introduce several algorithms that can detect hallucinations in both white and black box settings. We demonstrate that the proposed detection methods are extremely compute-efficient, requiring only a fraction of the run-time of other baselines while achieving significant improvements in detection performance over diverse datasets. Finally, we discuss an approach to use poisoning for user privacy preservation. Big data has contributed a lot to the success of today’s AI. This has led to various developers using data available on the Internet to train their models without the data owner’s consent. We propose an unlearnable dataset that can be generated efficiently in a model-agnostic manner. Unlearnable datasets produced by our algorithm using just data-wise random convolutions can lead to model training on this unconsented data, performing poorly on the unlearnable user data distribution. We study this theoretically and empirically, showing promising results. In summary, our thesis develops various attacking frameworks to understand the robustness and security vulnerabilities of various AI models. The insights we provide from our attacks can be useful in designing better AI systems. We also propose approaches to address some of these issues of the existing AI models. 1.3 Thesis Contributions Our thesis explores fundamental aspects of AI model robustness, detectability, and privacy. We propose various frameworks to attack existing AI models to better understand their 3 vulnerabilities. We also devise new defense methods to make AI more secure and reliable. We state our contributions below. Investigating Robustness Of Classification Models • We propose a Level Set Traversal algorithm that iteratively explores regions of high confidence with respect to the input space using orthogonal components of the local gradients. Given a source image, we use this algorithm to identify inputs that lie in the same equi-confidence level set as the source image despite being perceptually similar to arbitrary images from other classes. We further observe that the source image is linearly connected by a high-confidence path to these inputs, uncovering a star-like structure for level sets of deep networks. Furthermore, we attempt to identify and estimate the extent of these connected higher-dimensional regions over which the model maintains a high degree of confidence. [NeurIPS 2023] • We derive robustness certificates for models that use a fixed-size sliding window over an input stream. Our guarantees hold for the average model performance across the entire stream and are independent of stream size, making them suitable for large data streams. We perform experiments on speech detection and human activity recognition tasks and show that our certificates can produce meaningful performance guarantees against adversarial perturbations. [arXiv 2023] Hardness of AI-Generated Text Detection • We introduce recursive paraphrasing attack to stress test a wide range of detection schemes, including the ones using the watermarking as well as neural network-based detectors, zero-shot classifiers, and retrieval-based detectors. Our experiments con- ducted on passages, each approximately 300 tokens long, reveal the varying sensitivi- 4 ties of these detectors to our attacks. We also observe that these paraphrasing attacks add slight degradation to the text quality.Additionally, we investigate the susceptibility of watermarked LLMs to spoofing attacks aimed at misclassifying human-written text as AI-generated. We also provide a theoretical framework connecting the AUROC of the best possible detector to the Total Variation distance between human and AI text distributions. [TMLR 2025] • We introduce adversarial paraphrasing to launch a strong universal framework to break existing AI text detectors. Our adversarial attack uses an LLM to auto-regressively paraphrase AI texts guided by an AI text classifier to humanize them. We find that our attack effectively transfers to all kinds of text detection schemes, including watermarking, zero-shot detectors, and trained detectors. [NeurIPS Submission] Hardness of AI-Generated Image Detection • For watermarking methods that introduce subtle image perturbations (i.e., low per- turbation budget methods), we reveal a fundamental trade-off between the evasion error rate (i.e., the fraction of watermarked images detected as non-watermarked ones) and the spoofing error rate (i.e., the fraction of non-watermarked images detected as watermarked ones) upon an application of diffusion purification attack. For high perturbation watermarking methods, we develop a model substitution adversarial at- tack that can successfully remove watermarks. Moreover, we show that watermarking methods are vulnerable to spoofing attacks where the attacker aims to have real images identified as watermarked ones, damaging the reputation of the developers. We also extend our theory to characterize a fundamental trade-off between the robustness and reliability of classifier-based deep fake detectors and demonstrate it through experiments. [ICLR 2024] 5 • We propose IConMark, a novel in-generation robust semantic watermarking method that embeds interpretable concepts into AI-generated images, as a first step toward interpretable watermarking. Our method is not only robust against various image augmentations but also human-readable, enabling manual verification of watermarks. We demonstrate a detailed evaluation of IConMark’s effectiveness, demonstrating its superiority in terms of detection accuracy and maintaining image quality. Moreover, IConMark can be combined with existing watermarking techniques to further enhance and complement its robustness. IConMark and its variants achieve 10.8%, 14.5%, and 15.9% higher mean area under the receiver operating characteristic curve (AUROC) scores for watermark detection, respectively, compared to the best baseline on various datasets. [ICLR 2025 GenAI Watermarking Workshop and ICCV Submission] • We propose Data Retrieval with Error-corrected codes and Watermarking (DREW) that randomly clusters the reference dataset, injects unique error-controlled watermark keys into each cluster, and uses these keys at query time to identify the appropriate cluster for a given sample. DREW maintain baseline performance, while also pro- viding opportunities for performance improvements due to the increased likelihood of correctly matching queries to their origin when performing retrieval on a smaller subset of the dataset. [arXiv 2024] Adversarial Attacks on Language Models • We introduce a novel class of fast, beam search-based adversarial attack (BEAST) for Large Language Models (LLMs). BEAST employs interpretable parameters, enabling attackers to balance between attack speed, success rate, and the readability of adversarial prompts. The computational efficiency of BEAST enables us to investigate its applications on LLMs for jailbreaking, eliciting hallucinations, and privacy attacks. 6 We find that BEAST can jailbreak LLMs in one to two GPU minutes to get as high as ∼ 100% attack success. Our attack can also increase hallucination rates in LLMs by 15% to 30%. BEAST can adversarially increase membership inference attack accuracies of existing baseline methods as well. We also show that our attack can be used to break new LLMs in a black-box manner. [ICML 2024] • We demonstrate the real-world vulnerabilities of open-sourced LLMs that utilize audio as an input modality, such as Qwen2-Audio. We first demonstrate that an adversary can craft stealthy audio perturbations to manipulate audio-based LLMs into exhibiting specific targeted behaviors, such as eliciting responses to wake-keywords (e.g., “Hey Qwen”). Subsequently, we experimentally show that adding adversarial noise to benign speech signals can significantly degrade the speech processing performance of these models. Crucially, our research illustrates the scalability of these attacks to real-world scenarios, impacting other innocent users. We find that our methods can design effective adversarial audio signals even when played through the air, capable of compromising audio-based LLMs used by innocent users and compelling their models to execute unintended or harmful agentic tasks. Finally, we discuss potential defensive measures, specifically input audio augmentation techniques, to mitigate these attacks. [NeurIPS Submission] Investigating Detection Of Hallucination In Large Language Models • We conduct a comprehensive investigation into the nature of hallucinations within LLMs and furthermore explore effective techniques for detecting such inaccuracies in various real-world settings. We seek to identify hallucinations within a single response in both white-box and black-box settings by analyzing the internal hidden states, attention maps, and output prediction probabilities of an auxiliary LLM. We 7 also study hallucination detection in scenarios where ground-truth references are also available, such as in the setting of Retrieval-Augmented Generation (RAG). We demonstrate that the proposed detection methods are extremely compute-efficient, requiring only a fraction of the run-time of other baselines, while achieving significant improvements in detection performance over diverse datasets. [NeurIPS 2024] Unlearnable Datasets For Preserving User Privacy • We propose a novel, model-free, Convolution-based Unlearnable DAtaset (CUDA) generation technique. CUDA is generated using controlled class-wise convolutions with filters that are randomly generated via a private key. CUDA encourages the network to learn the relation between filters and labels rather than informative features for classifying the clean data. We develop some theoretical analysis demonstrating that CUDA can successfully poison Gaussian mixture data by reducing the clean data performance of the optimal Bayes classifier. We also show that CUDA is robust to even adaptive defenses designed specifically to break it. [CVPR 2023] 1.4 Publications and Authorship This thesis is based on published manuscripts, preprints, manuscripts under review, and ongoing works listed in the Fig 1.1. These works are the result of collaborations with my advisor, Soheil Feizi, and my other colleagues at the University of Maryland. In this thesis, the pronoun “we” is used to acknowledge the collective contribution of all my collaborators. Contributions. For all my first-author works, I have led the work, formulated the ideas, and performed experiments to present the key results. For my joint first-author work (Cheng & Sadasivan et al., 2025), I formulated the idea and performed experiments to present the key 8 algorithmic results, except for the experimental ablation results. I performed experimental results in Kumar et al., 2023, experimental ablations in Balasubramanian & Sriramanan et al., 2023, entropy-based detector in Sriramanan et al., 2024, detector spoofing and deepfake experiments in Saberi et al., 2024a, and insights to improve retrieval algorithm in Saberi et al., 2024b. Figure 1.1: Primary works directly related to the thesis. 9 Chapter 2: Related Works 2.1 Exploring Geometry Of Blind Spots In Vision Models The phenomenon of under-sensitivity in classification models was first pointed out by [170], wherein they utilize a class of invertible neural networks called fully Invertible RevNets [171, 192] to specifically craft input images that do not affect network activations at a given layer. In contrast, our proposed algorithm is applicable to general network architectures since we solely utilize input-gradients to perform the traversal over image space. [379] further demonstrated that due to the misalignment of ℓp norm bounded balls and the ideal set of human-imperceptible perturbations, networks that are adversarially trained against such ℓp bounded perturbations of relatively large radius are overly-smooth, and become excessively susceptible to invariance-based attacks within the same ℓp radius. To find such images that lie within a given ℓp threat model, but induce human oracles to change their label assignment, the authors propose to identify the training image from another class closest in image space and apply a series of semantic-preserving transformations, and additionally use techniques such as realignment and spectral clustering. Given that these operations are fairly complex, the attack algorithm is slow, with alignment alone requiring an amortized time of minutes per input example. In contrast, our proposed method relies upon gradient-backpropagation steps which are efficiently parallelized across a minibatch of input images. Furthermore, our technique is seen to be successful for arbitrary source-target 10 image pairs, since we do not utilize near-neighbour training images as the target. On the theoretical front, [178] analyze the problem of span-recovery of neural networks given only oracle access to the network predictions. They characterize the feasibility of span recovery, and thus approximations of the null space of networks in a provable setting, but remark that its success in practical settings is potentially limited to networks that are extremely thin. 2.2 Robustness for Streaming Models The adversarial streaming setup has been studied extensively in recent years. [270] designed an attack for transient data streams that do not allow the adversary to re-attack past input items. In their setting, the adversary only has partial knowledge of the target DNN and the perturbations applied in previous time steps are irrevocable. Their objective is to produce an adversarial attack with minimal access to the data stream and the target model. Our goal, on the other hand, is to design a provably robust method that can defend against as general and strong an adversary as possible. We assume that the adversary has full knowledge of the parameters of the target DNN and can change the adversarial perturbations added in previous time steps. Our threat model includes transient data streams as a special case and applies even to adversaries that only have partial access to the DNN. Streaming adversarial attacks have also been studied for sampling algorithms such as Bernoulli sampling and reservoir sampling [25]. Here, the goal of the adversary is to create a stream that is unrepresentative of the actual data distribution. Other works have studied the adversarial streaming setup for specific data analysis problems like frequency moment estimation [24], submodular maximization [269], coreset construction and row sampling [38]. In this work, we focus on a robustness certificate for general DNN models in the streaming setting under the conventional notion of adversarial attacks in machine learning literature. We use a sliding-window computational model which has been extensively studied 11 over several years for many streaming applications [87, 110, 118]. Recently [106] also showed that a short-term memory is sufficient for several real-world reinforcement learning tasks. A closely related setting is that of adversarial reinforcement learning. Adversarial attacks have been designed that either directly corrupt the observations of the agent [23, 163, 295] or introduce adversarial behavior in a competing agent [122]. Robust training methods, such as adding adversarial noise [188, 384] and training with a learned adversary in an online alternating fashion [429], have been proposed to in improve the robustness of RL agents. Several certified defenses have also been developed over the years. For instance, [431] developed a method that can certify the actions of an RL agent at each time step under a fixed adversarial perturbation budget. It can certify the total reward obtained at the end of an episode if each of the intermediate actions is certifiably robust. Our streaming formulation allows the adversary to choose the budget at each time step as long as the average perturbation size remains below ε over time. Our framework also does not require each prediction to be robust in order to certify the average performance of the DNN. More recent works in certified RL can produce robustness guarantees on the total reward without requiring every intermediate action to be robust or the adversarial budget to be fixed [204, 410]. However, these certificates degrade for longer streams and the tightness analysis of these certificates indicates that this dependence on stream size may not be improved. Our goal is to keep the robustness guarantees independent of stream size so that they are suitable even for large streams. The literature on provable robustness has primarily focused on static prediction problems like image classification. One of the most prominent techniques in this line of research is randomized smoothing. For a given input image, this technique aggregates the output of a DNN on several noisy versions of the image to produce a robust class label [71, 221]. 12 This is the first approach that scaled up to high-dimensional image datasets like ImageNet for ℓ2-norm bounded adversaries.. It does not make any assumptions on the underlying neural network such as Lipschitz continuity or a specific architecture, making it suitable for conventional DNNs that are several layers deep. However, randomized smoothing also suffers some fundamental limitations for higher norms such as the ℓ∞-norm [206]. Due to its flexible nature, randomized smoothing has also been adapted for tasks beyond classification, such as segmentation and deep generative modeling, with multi-dimensional and structured outputs like images, segmentation masks, and language [203]. For such outputs, robustness certificates are designed in terms of a distance metric in the output space such as LPIPS distance, intersection-over-union and total variation distance. However, provable robustness in the static setting assumes a fixed budget on the size of the adversarial perturbation for each input instance and does not allow the adversary to choose a different budget for each instance. In our streaming threat model, we allow the adversary the flexibility of allocating the adversarial budget to different time steps in an effective way, attacking more critical input items with a higher budget and conserving its budget at other time steps. Recent work on provable robustness against Wasserstein shifts of the data distribution allows the adversary to choose the attack budget for each instance differently [207]. However, unlike our streaming setting, the input instances are drawn independently from the data distribution and the adversarial perturbation applied to one instance does not impact the performance of the DNN on another. 2.3 AI-Generated Text Detection Several detection works study this problem as a binary classification problem [19, 109, 177, 268, 280] and use neural network-based detectors. For example, OpenAI fine-tunes RoBERTa-based [249] GPT-2 detector models to distinguish between non-AI generated and 13 GPT-2 generated texts [280]. This requires such a detector to be fine-tuned with supervision on each new LLM for reliable detection. Another stream of work focuses on zero-shot AI text detection without any additional training overhead [120, 168, 357]. These works evaluate the expected per-token log probability of texts and perform thresholding to detect AI-generated texts. [268] observe that AI-generated passages tend to lie in negative curvature of log probability of texts. They propose DetectGPT, a zero-shot LLM text detection method, to leverage this observation. Since these approaches rely on a neural network for their detection, they can be vulnerable to adversarial and poisoning attacks [126, 208, 330, 390]. Another line of work aims to watermark AI-generated texts to ease their detection [15, 193, 403, 442]. Watermarking eases the detection of LLM output text by imprinting specific patterns on them. Soft watermarking proposed in [193] partitions tokens into “green” and “red” lists, as they define, to help create these patterns. A watermarked LLM samples a token, with high probability, from the green list determined by a pseudo-random generator seeded by its prefix token. The watermarking detector would classify a passage with a large number of tokens from the green list as AI-generated. These watermarks are often imperceptible to humans. [199] introduces an information retrieval-based detector by storing the outputs of the LLM in a database. For a candidate passage, their algorithm searches this database for semantically similar matches for detection. However, storing user-LLM conversations might cause serious privacy concerns. 2.4 AI-Generated Image Detection And Data Retrieval Image Watermarking. Image watermarking is a versatile technology with applications in copyright protection, content authenticity, data authentication, privacy preservation, and branding. Its evolution began with manual methods like LSB [406], and later techniques involved altering spatial or frequency domains [35, 76, 121, 153, 286, 297]. Various trans- 14 formations such as DCT, DWT, SVD-decomposition [57], and Radon transformations [339] were explored. Recent advancements incorporate deep learning and generative models like SteganoGAN [432], StegaStamp [368] RivaGAN [433], WatermarkDM [444], MBRS [181], and Tree Ring [400], each employing different methods to embed watermarks into images. There have been several works trying to attack watermarking methods [183, 387]. Notably a recent concurrent work [443] also proves that the diffusion purification attack is successful against invisible (low perturbation budget) watermarking. However, [443] is unable to attack high perturbation budget watermarking methods such as Tree Ring or StegaStamp and argues that they are more reliable watermarking alternatives. In contrast, we show that our model substitution adversarial attack can effectively break those watermarking methods. Additionally, we show that several watermarking approaches are vulnerable to spoofing attacks and characterize a robustness-reliability trade-off for a classification-based deepfake detector. Classifier-based Detectors. Several machine-learning approaches focusing on detecting artifacts in AI-generated content have been studied. For instance, [261] target irregular- ities in face editing algorithms, while [69] exploit biological signals. [234] introduce a technique for identifying partially manipulated videos, and [135] harness the traces from convolutional layers of generative adversarial networks in fake image detection. [36] analyze spatiotemporal texture dynamics of video signals for Deepfake detection. A plethora of works focus on facial forgery or Deepfake detection using convolution net-based classi- fiers [22, 77, 97, 310, 311, 447]. [318] proposed a face forensics dataset and train ResNet [144] and XceptionNet [64] based classifiers using it. However, as noted in [142], machine learning-based detectors are often vulnerable to novel input perturbations. Such limitations challenge the practical utility of these methods, as any real-world detector should achieve good performance while being robust to small perturbations in the input. 15 Error Control Codes. Error control codes (ECC) [143] are fundamental in ensuring the reliable transmission and storage of data across noisy channels. These codes detect and correct errors without needing re-transmission, enhancing data integrity and communication efficiency. Notable ECCs include Reed-Solomon codes [313], known for robust error correction in digital communications and storage like CDs and DVDs. LDPC codes [117] have transformed ECC with near-capacity performance, crucial in modern systems like 5G and satellite communications. BCH codes [37], introduced in the 1950s, are versatile for error detection and correction in applications from QR codes to digital broadcasting. Polar codes [12], a recent advancement, achieve channel capacity with low-complexity decoding, essential for 5G technology. These advancements highlight ECC’s evolution, driven by the need for high reliability and efficiency in complex communication networks. Data Retrieval. Data retrieval techniques refer to the methods and algorithms used to efficiently search, access, and retrieve relevant information from large data repositories or databases [60]. Recent advancements include embedding-based retrieval systems, which utilize deep learning models to transform data into feature vectors, enabling accurate and context-aware searches. Approximate Nearest Neighbors (ANN) [13, 349] is a notable technique in this field, facilitating rapid similarity searches by approximating the nearest neighbor search in large datasets. Data retrieval techniques are especially useful in ap- plications such as recommendation systems [275], retrieval augmented generation [227], near-duplicate detection [358], and anomaly detection [73]. Watermarking for Data Retrieval. Some existing works have explored the use of water- marks in retrieval-based tasks in various ways. Certain studies [319, 320] propose manually shifting data (similar to watermarking) to enhance the mutual separation of samples in feature space, thereby improving the robustness of feature-based data retrieval. However, these methods do not leverage recent advancements in watermarking or efficient retrieval 16 techniques, rendering them impractical compared to contemporary approaches. Addition- ally, other works [169, 214] have utilized watermarks in image retrieval systems for user identification and to prevent illegal data distribution. Nonetheless, these approaches fail to address the low robustness of watermarks when used alone and without error correction modules, making them unreliable in practical applications. 2.5 Robustness And Privacy in Large Language Models Adversarial attacks. Adversarial machine learning literature shows that the inputs to models can be perturbed to get a desired target output [27, 53, 126, 290, 363]. Several works investigate adversarial examples in the text domain for question answering [180], document classification [105], sentiment analysis [9] either using discrete optimization or greedy methods [185, 386]. Though recent works [139, 185, 344, 399] show that they can generate adversarial prompts with automatic prompt-tuning, [49] claim that they are insufficiently powerful in attacking LMs reliably. Jailbreaking. A lot of research has been done to align LMs to human values to make them safe and useful [287, 397]. However, [85, 296] show that prompts can be manually written to jailbreak aligned LLMs. [449] introduced a gradient-based optimization technique to generate adversarial prompts automatically by adding gibberish adversarial token suffixes. [448] also proposes a gradient-based jailbreaking method that improves upon the readability of the adversarial token suffixes. [218, 247, 419] propose black box jailbreak attacks using genetic search algorithms. [58, 119] propose black box attacks inspired by social engineering attacks where aligned LMs generate jailbreaking prompts by querying a target LM iteratively. [164] demonstrates that the alignment of LM chatbots can be disrupted by manipulating their decoding strategies. 17 With the increasing prevalence of vision-based LLMs, research has consequently begun exploring how jailbreaking can affect these models. Such attacks often exploit cross- modal interactions between vision and text inputs to circumvent safety alignments. For instance, techniques involve crafting adversarial images with subtle perturbations designed to elicit harmful text generation when paired with simple prompts [300]. Others focus on encoding harmful requests directly into the visual input, such as embedding hidden text or using typographic visual prompts like those created by FigStep [123]. A related approach, exemplified by Imgtrojan [371], aims to create a single ‘Trojan’ image that acts as a trigger, enabling the model to be jailbroken for various subsequent harmful text prompts. The domain of attacking audio-based LLMs remains significantly under-explored. Ad- vWave [189] represents a pioneer contribution in this area, presenting a framework en- gineered to circumvent the safety protocols of speech-processing audio-based LLMs to elicit prohibited content, by including a dual-phase optimization method — an adaptive target search algorithm, following techniques to render the adversarial audio inconspicuous, resembling background noise. The methodology in their study targets the direct creation of adversarial speech, aligning with digital text-based attack paradigms, without explicitly mod- eling real-world acoustic propagation effects which is a key consideration in our research. Related research has investigated audio vulnerabilities in other machine learning contexts like Automatic Speech Recognition or Text-To-Speech systems [10, 52, 173, 243, 388]. Hallucinations. Aligned LMs, at times, produce nonsensical outputs that are irrelevant to the input prompts [4] or previously generated context [246], or factually incorrect [232, 264, 271]. While a plethora of benchmarks [232, 240, 264], detection methods [272, 424], and mitigation techniques [129, 347, 385] exist for hallucination, it is still a problem that needs more investigation [440]. 18 Privacy attacks. Overparameterized networks are known to memorize training data. Several works demonstrate that this can be a pitfall of these models and can be leveraged to leak potentially private information from them, such as training data membership [47, 345, 417] or the data itself [47, 48, 274]. [248] performs prompt injection to leak system prompts with manual prompts. [448] uses their automated jailbreak attack to leak system prompts from aligned LMs. 2.6 Hallucination Detection In Large Language Models SelfCheckGPT [259] presented a suite of hallucination detection methods in a zero-resource grey-box/black-box setting to assess the veracity of a LM response, assuming only access to the output probability distributions. This is highly practical in real-world usage, since API calls often impede access to internal model activations. On the other hand, INSIDE [59] proposed to measure the self-consistency of the hidden states across multiple independent responses of an LLM, by performing an centered eigen-analysis of the covariance matrix of these hidden states. Notably, INSIDE follows a slightly different evaluation framework from other techniques which generally rely upon human annotations, wherein towards assessing the correctness of a model response INSIDE utilizes either ROUGE-L score or BERT-similarity scores being larger than a threshold (0.5 and 0.9 respectively) with respect to a ground-truth. However, ROUGE-L is an n-gram based method that evaluates the longest common subsequence between the model response and the ground-truth. Thus, several hallucinatory words can be incorporated without a considerable change in the ROUGE-L score overall. [186] proposed to use the LLM itself to predict the probability of the generated response being True, referred to as Self-Prompt; this is primarily relevant for the context of multiple- choice-questions where the model explicitly has all the options in context. Furthermore, 19 they also train models to predict the probability that a given multiple-choice-question will be correctly answered by calibrating its internal confidence values. [17] proposed to train feedforward neural networks on the hidden activations of intermediate layers of LMs on a True-False dataset of relatively short, simple sentences such as "The zebra uses flying for locomotion." This method is thus a white-box detection technique which also requires supervised training data to train the feedforword network as noted by [259]); in contrast, we do not incorporate training of any sub-network or the LLM itself. Furthermore, it is fairly non-trivial to extend the technique towards more general settings such as when external references or multiple model responses are available at inference time. [421] investigate factual errors made by LLMs by modelling it as a constraint-satisfaction problem. Indeed, the factual query is specified by considering a sequence of Constraint-tokens and using a subsequent Verifier (such as ExactMatch or WikiData Search) to determine the satisfaction for each such constraint. This method is highly pertinent in settings where a fact-checking service is available for ready oracle-access that can provide feedback and annotations in real-time. [136] study hallucinations specifically in the setup of Neural machine translation (NMT), and even propose mitigation techniques to overwrite hallucinatory sections of the output. FAVA [267] is a fine-grained hallucination detection dataset, wherein a novel hierarchy of hallucination types is identified and analyzed comprehensively. FAVA first generates a synthetic hallucination training dataset by inserting specific errors using ChatGPT [40], and then fine-tunes a Llama-2 model to explicitly detect these fine-grained hallucinations. The work also introduces a human-annotated dataset where the specific hallucination types are recorded with specified hallucination spans. We make extensive use of the FAVA datasets so released to compare multiple baselines on a common test-bed. Though retrieval augmented generation (RAG) helps to mitigate hallucination in LLMs, they are still found hallucinating 20 [230, 348]. Recently, [411] showed how popular LLMs still hallucinate on different tasks even with the integration of RAG, and compiled the RAGTruth dataset with span-level human annotation. They proposed to fine-tune an LLM on their training data to detect hallucination which is computationally expensive than our proposed method. 2.7 Robustness Of Discrimintaive Models Adversarial attacks. Adversarial examples are specially designed examples that can fool deep learning models at test time [50, 125, 212, 215, 364]. The adversary crafts these examples by adding error-maximizing noises to the clean data. Even slightly perturbed data can serve as adversarial examples. AT is a training framework proposed to make deep learning models robust to adversarial examples [176, 216, 254, 393, 426]. AT is a min-max optimization problem where the model is trained to minimize loss on adversarial examples that have the maximum loss. Poisoning attacks. In data poisoning, an attacker aims to hurt the deep learning model’s performance by perturbing the training data [29, 195, 223, 245, 340, 389]. The backdoor attack is a special type of poisoning attack where a trigger pattern is injected into clean data at training time [61, 237, 250, 276]. The model trained on this data would misclassify an image with a trigger pattern at test time. Gu et al. [134] and Li et al. [238] use perceptible amounts of noises similar to CUDA for data poisoning. However, backdoor attacks do not affect the performance of the model on clean data [21, 61, 340]. Recent literature utilize data poisoning to protect data from being used for model training without authorization. Yuan and Wu [420] use neural tangent kernels [172] to generate clean label attacks that can hurt the generalization of deep learning models. Huang et al. [161] show that error-minimizing noise addition can serve as a poisoning technique. Fowl et al. 21 [113] show that error-maximizing noises as well can make strong poison attacks. However, all these poisoning techniques do not offer data protection with AT [370, 418]. Fu et al. [116] proposes a min-min-max optimization technique to generate poisoned data that offers better unlearnability effects with AT. 22 Chapter 3: Investigating Robustness Of Classification Models 3.1 Exploring Geometry of Blind Spots in Vision Models Deep neural networks have demonstrated remarkable success in various diverse domains including computer vision and natural language processing, surpassing the performance of classical algorithms by a significant margin. However, though they achieve state of the art results on many computer vision tasks like image classification, sometimes even exceeding human-level performance [145, 146], the overall visual processing conducted by such models can deviate significantly from that effectively observed in the human visual system. Perhaps most iconic and representative of these differences lies in the overwhelming susceptibility of neural networks to near-imperceptible changes to their inputs — commonly called adversarial attacks [365] — an extraordinary failure mode that highlights the over- sensitivity of such models. Indeed, an active area of research over the past few years has been focused towards analyzing adversarial perturbations under different threat models [80, 127, 217, 378, 408] and in addressing adversarial vulnerabilities using robust training methodologies [217, 226, 253, 355, 409, 427]. On the other hand, it has been shown that such models may also be under-sensitive, wherein input images that are unmistakably disparate to a human oracle induce near identical network activations or predictions [170]. To tractably analyze this phenomenon, [170] 23 utilize a special class of neural networks that are bijective functions, called fully Invertible RevNets [171, 192], to craft large-magnitude semantic perturbations in input space that are designed to leave its corresponding network representations unchanged. Furthermore, [379] show that robust training with ℓp bounded adversaries can be a source of excessive model-invariance in itself, due to the poor approximation of the true imperceptible threat model of human oracles by ℓp norm-bounded balls in RGB-pixel space [217]. Indeed, the authors ‘break’ a provably robust defense on MNIST [430] with a certified accuracy of 87%, by crafting perturbations within the certified ℓ∞ radius of 0.4, that however cause model agreement with human oracles to diminish to 60%. However, these methods either rely upon special invertible network architectures or the selection of the nearest training image of another class as a target followed by a sequence of complex alignment and spectral clustering techniques, so as to semantically alter the input in a conspicuous manner that induces a change in the human assigned oracle label, while leaving the model prediction unchanged. This leads us to our research question: Is it possible to analyze the phenomenon of under-sensitivity of general vision models in a systematic manner on natural image datasets, and characterize the geometry and extent of “blind spots” of such networks? Indeed, we empirically demonstrate the veracity of this claim — in this work, we present a novel Level Set Traversal algorithm to explore the “equi-confidence” level sets of popular vision models. Given an arbitrary source and target image pair, our proposed algorithm successfully finds inputs that lie in the same level set as the source image, despite being near-identical perceptually to the target image. Furthermore, the proposed algorithm identifies a connected path between the original source image and the “blind spot" input so generated, wherein high prediction confidence with respect to the source class is maintained throughout the path. In summary, we make the following contributions in this work: • We present a novel Level Set Traversal algorithm that iteratively uses orthogonal 24 components of the local gradient to identify the “blind spots” of common vision models such as CNNs and ViTs on CIFAR-10 and ImageNet. • We thereby show that there exist piecewise-linear connected paths in input space between images that a human oracle would deem to be extremely disparate, though vision models retain a near-uniform level of confidence on the same path. • Furthermore, we show that the linear interpolant path between these images also remarkably lies within the same level set; as we observe the consistent presence of this phenomenon across arbitrary source-target image pairs, this unveils a star-like set substructure within these equi-confidence level sets. • We demonstrate that adversarially robust models tend to be under-sensitive over subsets of the input domain that lie well beyond its original threat model, and display level-sets of high-confidence that extend over a significant fraction of the triangular convex hull of a given source image and arbitrary pair of target images. 3.2 Preliminaries Notation: In this paper, we primarily consider the setting of classification with access to a labelled dataset. Let x ∈X denote d-dimensional input images, and let their corresponding labels be denoted as y ∈ {1, . . . ,N}. Let f : X → [0,1]N denote the classification model considered, where f (x) = ( f 1(x), . . . , f N(x) ) represents the softmax predictions over the N-classes. Further, let C f (x) be the argmax over the N-dimensional softmax output, repre- senting the class predicted by the model for input x. For a given data sample (x,y), let the cross-entropy loss achieved by the model be denoted as CE( f (x),y). Given a prediction confidence value p ∈ [0,1], and a class j ∈ {1, . . . ,N} we define the Level Set L f (p, j), and Superlevel Set L+ f (p, j) for the function f as follows: L f (p, j) = {x ∈X : f j(x) = p} , L+ f (p, j) = {x ∈X : f j(x)≥ p} 25 Given a pair of inputs x1 and x2, we define the linear interpolant path between them as P(λ ;x1,x2) = λ · x1 +(1−λ ) · x2, for λ ∈ [0,1]. A given set S is thus said to be convex if P(λ ;x1,x2) ∈ S, ∀x1,x2 ∈ S and λ ∈ [0,1]. Further, a set S is said to be star-like if there exists some s0 ∈ S such that P(λ ;s0,x) ∈ S, ∀x ∈ S and λ ∈ [0,1]. 3.2.1 Conjugate Nature of Adversarial and Confidence Preserving Pertur- bations Given a classification model f and a correctly classified benign input (x,y), an adversarial image is a specially crafted image x̃ = x+ ε such that both x and x̃ appear near-identical to a human oracle, but induces the network to misclassify, i.e. C f (x+ ε) ̸=C f (x) = y. To enforce imperceptibility in a computationally tractable manner, several adversarial threat models have been proposed, with the ℓ2 and ℓ∞ norm constraint models being the most popular. Amongst the earliest adversarial attacks specific to the latter threat model was the Fast Gradient Sign Method (FGSM) attack, proposed by [127], wherein the adversarial perturbation is found by single-step direct ascent along the local gradient with pixel-wise clipping. A stronger multi-step variant of this attack called Iterated-FGSM (IFGSM) was later introduced by [210], wherein iterated gradient ascent is performed alternately with projection operations onto the constraint set. A popular variant, called the Projected Gradient Descent (PGD) attack was introduced by [253] which incorporates a initial random perturba- tion to the clean image, which was observed to help mitigate gradient masking effects [209]. A large class of adversarial attacks [46, 79, 131, 359] thus utilize perturbations parallel to the gradient, with appropriate projection operations to ensure constraint satisfaction. In contrast, perturbations that leave the network prediction confidence unchanged, are locally orthogonal to the gradient direction. Indeed, for any differentiable function g : Rd →R, denote the level set as Lg(c) for a given output c ∈ R. Let γ(t) : [0,1]→ Lg(c) be any 26 differentiable curve within the level set. Then, g(γ(t)) = c ∀t ∈ [0,1]. Thus, d dt (g(γ(t))) = 0 = ⟨∇g(γ(t)),γ ′(t)⟩, implying that γ ′(t) is orthogonal to ∇g(γ(t)), ∀t ∈ [0,1]. Since this is true for any curve γ contained in the level set, we conclude that the gradient vector is always perpendicular to the level set. Furthermore, we can additionally show that the level set Lg(c) is often a differentiable submanifold, with mild additional conditions on the gradient. Indeed, a level set Lg(c) is said to be regular if ∇g(x) ̸= 0 ∀x ∈ Lg(c). Lemma 1. If g : Rd →R is a continuously differentiable function, then each of its regular level sets is a (d−1) dimensional submanifold of Rd . We present the proof of Lemma 1 in Section 11.1 of the Appendix. We thus observe that adversarial perturbations, largely parallel to the gradient, are locally orthogonal to confidence preserving directions which correspond to the (d−1) dimensional tangent space of the level set. We take inspiration from this observation to develop a general framework that applies to a broad class of differentiable neural networks, as opposed to previous works [170] that require the network to be invertible to identify confidence preserving perturbations. We also remark that adversarial attacks and level set preserving perturbations are comple- mentary from another perspective as well, as noted by prior works: the former attempts to find inputs that change model predictions without modifying human oracle assignments, while the latter attempts to keep network predictions unchanged though human oracles would likely change their original label assignment. Thus, both classes of adversarial at- tacks and level set preserving perturbations induce misalignment between oracle and model predictions, and cast light onto independent means of evaluating the coherence of models towards human expectations. 27 3.3 Theoretical Analysis for Common Models To better understand the geometry of level set submanifolds for a given prediction confidence threshold, we analyze the nature of confidence preserving perturbations in a few simplified model settings. 1) Linear Functional: First, consider the classification model to be a linear functional f : Rd →R, that is, f (x1 + x2) = f (x1)+ f (x2) and f (λx) = λ f (x). Then, by the Riesz Representation theorem, there exists a unique vector w f ∈Rd such that f (x) = ⟨w f ,x⟩ ∀x∈ Rd . We thus observe that for any vector v orthogonal to w f , f (x+ v) = ⟨w f ,x+ v⟩ = ⟨w f ,x⟩+ ⟨w f ,v⟩= ⟨w f ,x⟩. Thus, in this setting, the level sets of f are (d−1) dimensional linear subspaces spanned by the set {v ∈ Rd : ⟨w,v⟩ = 0}. We observe that a similar argument can be extended to affine functions of the form f (x) = ⟨w f ,x⟩+ c, where c ∈R. We remark that by applying a first-order Taylor series approximation for general real-valued smooth functions, we observe near-affine behavior locally within a small neighborhood: f (x+ εa) = f (x)+ ε⟨∇ f (x),a⟩+O(ε2∥a∥2). Thus locally, we observe that confidence- preserving perturbations can arise from a (d − 1) dimensional plane orthogonal to the gradient. Indeed we incorporate this implicitly in our proposed Level Set Traversal algorithm, wherein the orthogonal projection ∆x⊥ with respect to the local gradient is iteratively computed, and its relative success can partly be attributed to the orthogonal hyperplane having a large dimension, namely (d−1). Another related setting worthy of note is that of neural networks that utilize ReLU activations, which induces a piece-wise linear structure over tessellated subsets of the input domain. Thus, a given output neuron of a ReLU network functionally has the form f (x) = ⟨w,x⟩+ c within a given tessellated region, and thus has a constant gradient within the same region. 28 Further, between two such adjacent regions, the two orthogonal (d− 1) dimensional hy- perplanes typically intersect over an affine space over dimension at least (d−2). Thus if x1,x2 are two inputs such that their linear interpolant path cuts across n distinct tessellated regions, the common intersection of these orthogonal hyperplanes will typically be (d−n) dimensional, indicating that there exist perturbations which lie in the common null-space of the gradients as defined along each tessellated region that is crossed. Indeed in the following section, we demonstrate empirically for Residual Networks that though the exact iterative path followed by the Level Set Traversal algorithm is discretized and non-linear, the final outputs so found are often linearly connected through paths of high confidence to the source image, thereby lying within the level set of the original source class. This indicates that the overlap of different (d−1) dimensional hyperplanes is non-trivial at a non-local scale in image space, whereby we observe extended connected regions of high confidence. 2) Full-Rank Linear Transformations: Let us now consider a setting apart from classi- fication, such as regression, wherein the complete vector representation of the output is of principal interest. Let the model be of the form f : Rd →Rd , f (x) = Ax, where A ∈Rd×d is a full-rank matrix. In this setting, we observe that if f (x1) = f (x2) = Ax1 = Ax2, implying that A(x1− x2) = 0, the zero-vector. But since A is full-rank, this implies that x1− x2 = 0, i.e, x1 = x2. Thus, f is a bijective function and has a trivial null space. Thus in this setting, we necessarily have to relax the problem to be that of identifying perturbations of large magnitude to the input that minimally change the function output: that is, we solve for minv̸=0 ∥Av∥2 ∥v∥2 . Indeed, let the Singular Value Decomposition (SVD) of the full rank matrix A be given by A =UΣV T , where U,V are d×d orthogonal matrices, and Σ is a diagonal matrix consisting of the positive singular values σi for 1 ≤ i ≤ d, in descending order without loss of generality. Then, ∥Av∥= ∥UΣV T v∥= ∥U(ΣV T v)∥= ∥ΣV T v∥ since U is an orthogonal matrix. Similarly, since V is orthogonal as well, let z =V T v, so that ∥z∥= ∥v∥. 29 Thus, minv̸=0 ∥Av∥2 ∥v∥2 = minz̸=0 ∥Σz∥2 ∥z∥2 = minz∥Σz∥2such that zT z = 1. But since Σ is diagonal, minz∥Σz∥2= ∑ d i=1 σ2 i z2 i , under the constraint that ||z||2= 1. It is then easy to observe that the minimum value attained is σ2 d = σ2 min, and is attained when the input vector to f is the right-singular vector corresponding to the minimum singular value of A. In this setting, we remark that adversarial perturbations, in contrast, can be formulated in this setting as maxv̸=0 ∥Av∥2 ∥v∥2 , with the maximum value given by σ2 max and attained by the right-singular vector corresponding to the maximum singular value of A, highlighting the complementary nature of the two problems as indicated previously in Section 3.2.1. Indeed, the condition number κ of an invertible matrix A is defined as κ = σmax/σmin = ∥A∥·∥A−1∥. If condition number is large with κ ≫ 1, the matrix A is said to be ill-conditioned, and induces an inescapable compromise: if σmax = ∥A∥≈ 1 (as potentially expected in "robust" networks), then 1/σmin = ∥A−1∥≫ 1 is necessarily large, thereby inducing extreme under- sensitivty along some dimensions; while if 1/σmin = ∥A−1∥≈ 1, then σmax = ∥A∥≫ 1 and the model is extremely over-sensitive, similar to the phenomenon of adversarial vulnerability. 3.4 Proposed Method: Level Set Traversal We now describe our algorithm for traversing the level set, which we call the Level Set Traversal (LST) algorithm (Algorithm 1). We try to find a path from a source image to a target image such that all points on that path are classified by the model as the source class with high confidence. Given that these (d−1) dimensional level set submanifolds can be potentially highly complex in their geometries, we use a discretized approximation using small yet finite step sizes to tractably explore these regions. Let the source image be xs with true label ys, and g = ∇xCE( f (x),ys) be the gradient of the cross entropy loss of the model ( f ) prediction with respect to x. The key idea is to get as close as possible to a target image xt from some image x by computing the projection of ∆x = xt−x onto the orthogonal complement of g, to 30 obtain a new image xnew that leaves the model confidence unchanged. We can compute the projection of ∆x on the orthogonal complement by subtracting the component of ∆x parallel to g, that is, x||g = g ( ∆x⊤g ∥g∥2 ) (L6). Then, using a scale factor η , the vector update to x can be expressed as ∆x⊥ = η(∆x− x||g) (L7), and therefore xnew = x+∆x⊥. Starting from the source image xs, we repeatedly perform this iteration to get reasonably close to the target image while carefully ensuring that the confidence of the model prediction at any given point does not drop below a preset confidence threshold δ compared to the source image (L10). Figure 3.1: Pictorial representation of the LST algorithm. The contour lines represent level sets of the model confidence f (x), with blue representing high confidence (low loss) and red representing low confidence (high loss). At each iteration, we obtain xnew by adding two vectors, x⊥ = η(∆x− c//g) (projection of ∆x onto the orthogonal complement of g) and x|| (a small perturbation to increase the confidence and remain within the level set). 31 Figure 3.2: Intermediate images over the path traversed by the LST algorithm using a source image of a ‘goose’ and a target image of a ‘Scottish terrier’ (a dog) for a normally trained ResNet-50 model. We observe that the model predicts the ‘goose’ class with very high confidence for all images over the path, though the target blindspot found by LST clearly appears as a ‘dog’ to human oracles. Algorithm 1 Level Set Traversal (LST) 1: Input: Source image xs with label y, target image xt , model f , max iterations m, scale factor η , stepsize ε , confidence threshold δ 2: Initialize x = xs,x|| = 0 3: for i = 1 to m do 4: ∆x = xt− x 5: g = ∇xCE( f (x),y) 6: c// = (g ·∆x)/||g||2 7: ∆x⊥ = η(∆x− c//g) 8: x|| = Π∞(x||− εg,−ε,ε) 9: xnew = x+∆x⊥+ x|| 10: if f (xs)[ j]− f (xnew)[ j]> δ then 11: return x 12: end if 13: x = xnew 14: end for 15: return x While the above method works well with a relatively large η if the curvature of f is low, it 32 risks a non-trivial drop in model confidence (or even a change in the model prediction) if the curvature of f at x is high enough. Therefore, after each step, we add a small perturbation vector x|| in the direction of −g scaled by a factor of ε . This step decreases the cross entropy loss, and thus increases the confidence so as to offset any confidence drops incurred due to the addition of ∆x⊥. Thus, we can maintain a higher model confidence over the path. In order to ensure that the norm of this perturbation is not arbitrarily large, we bound the ℓ∞ norm by projecting the vector to an ℓ∞ ball of radius ε . Then, the step x|| can be computed as clamp(εg,−ε,ε), where the function clamp(v,a,b) is the element-wise application of the function min(max(·,a),b) on all elements of v. Thus, we modify L9 so that now xnew = x+∆x⊥− x||. We present a pictorial schematic of Algorithm 1 in Fig 3.1. We further employ an exponential moving average of x||, in order to smoothen components that potentially undulate heavily during the iterative traversal. Thus, since the frequent changes in x|| are smoothened out, we find that the final output images are often linearly connected to the source image with high confidence over the linear interpolations (see Fig 3.6). We now apply this algorithm to explore the level sets of standard and robust ResNet-50 models. In Fig 3.2, we present the path traversed by the LST algorithm, wherein the model predicts the source ‘goose’ class with very high confidence over the entire path, though the target blindspot found by LST clearly appears as a ‘dog’ to human oracles. To substanti- ate the efficacy of the LST algorithm, we randomly select five images from five arbitrary ImageNet classes (‘goose’, ‘Scottish Terrier’, ‘meerkat’, ‘academic gown’, ‘cleaver’), and compute LST blindspots for all possible source-target image pairs. We show the final images output by the LST algorithm in Fig 3.3 as an image-grid. If we order the five selected images, the ith row and jth column of the image-grid is the LST output obtained using the ith image as target and jth image as the source. Thus, each column represents the source image being transformed iteratively into the other target images. The confidence of the 33 Class Label Predicted by ResNet50 Cl as s L ab el P er ce iv ed b y Hu m an s Class Label Predicted by Robust ResNet50 Cl as s L ab el P er ce iv ed b y Hu m an s Figure 3.3: The images returned by LST for 5 random source and target images for normally trained (left) and adversarially trained (right) ResNet-50 models. The image in the ith row and jth column is the LST output obtained using the ith image as target and jth image as the source. The confidence of the model prediction for the source class (names on top of each column) is displayed just below each image. For example, all four LST output blindspots in the first column (highlighted), using the source ‘goose’ image, are all predicted to be of the ‘goose’ class with very high confidence. Diagonal images are unchanged, as source equals target. We observe that almost any source image can be iteratively modified using LST to resemble any target image very closely without any loss in confidence for both normal and adversarially trained ResNet-50 models. model prediction for the source class (names on top of each column) is displayed just below each image. For both the normally trained and adversarially trained model, these images are almost indistinguishable from the target while retaining high model confidence for the source class. Since adversarially robust models have perceptually aligned gradients, we can sometimes visually notice a few traces of the source image in the final LST image; for example the ‘meerkat’ image in the 3rd row, 2nd column in the right side of Fig 3.3 has some traces of the source ‘terrier’ image, but differences are usually hard to perceive. We also examine the model confidence over linear interpolations between the source image and LST outputs for all target pairs in Fig 3.6. Formally, consider a source image xs with 34 label y and let xop represent the LST output when applied toward a target image xt . Denote the difference vector as ∆v = xop− xs. Then, we observe that xs +α∆v is assigned high confidence with respect to class y by the model ∀α ∈ [0,1], which represents the entire linear interpolant path between xs and xop. Furthermore, we observe that the path discovered by the Level Set Traversal algorithm enjoys two key properties: (1) Uniqueness (once the target image is fixed) and (2) Extremality: (1) Uniqueness: Since the local tangent space of the level set is (d−1) dimensional, several independent directions are orthogonal to the local gradient, and apriori do not yield a unique path like a gradient-flow. However, once we fix our target image, we use its difference vector with respect to the current iterate (L4) and compute its projection onto the local tangent space (L7) of the level set, thereby generating a uniquely defined path. (2) Extremality: Though this flow-based path may be non-linear, we additionally discover that the final output-point of this flow is surprisingly linearly connected with high-confidence to the source image after we apply discretized approximations in practical settings for common vision models etc. Formally, the LST output xop is linearly extremal in the sense that xs +(1+ ε)∆v is rapidly assigned low-confidence by the model even for extremely small values of ε > 0, where ∆v = xop− xs. Thus, using the LST algorithm, we find that the level sets of common models extend outwards in an expansive, connected manner to include images of arbitrary classes, which human oracles would never state as being similar. Since the linear path from any given source image to LST outputs for arbitrary target images retains high model confidence throughout, it unveils a remarkable star-like substructure for superlevel sets as shown in Fig 3.4, where the number of “limbs” or linear protuberances of the star-like structure is extraordinarily large, plausibly as large as the number of images in all other classes. 35 Furthermore, to study the size of the level sets beyond the one-dimensional interpolant paths, we analyze the two-dimensional triangular convex hull between a given source image and two LST output blindspot images, using quantitative metrics in Section 3.6. 3.5 Disconnected Nature of Standard Adversarial Attacks At first glance, the images output by the LST algorithm may seem similar to those produced using targeted variants of standard adversarial attacks. In this section, we explore the connectivity of high-confidence paths as induced by standard adversarial attacks, and show that adversarially attacked source images are not linearly connected with high-confidence paths to their target image. In detail, let (x1,y1) and (x2,y2) be any two data samples from different classes (y1 ̸= y2). A targeted adversarial attack [51] on input x1 with respect to class y2 is formulated as solving ε12 = argminε12:||ε12||≤ε CE( f (x1 + ε12),y2). A related variant, called a feature-level targeted adversarial attack [326] instead uses intermediate network activations to craft image perturbations. If f|L(x) represents the network activations from hidden layer L for an input x, then this attack attempts to match features by optimizing ε12 = argminε12:||ε12||≤ε || f|L(x1+ε12)− f|L(x2)||. The hidden layer (L) that is often selected corresponds to pre-activations of the final fully-connected layer, and thus often induces misclassification. 36 Figure 3.4: Schematic of Star-like set substructure of Superlevel sets: The linear interpolant paths between the source image xs and blindspots found using LST maintain high-confidence throughout for arbitrary target images of other classes. 37 Figure 3.5: Model Confidence over Linear Interpolant Paths: We plot the median prediction confidence with respect to class y2 over linear paths between the target benign sample (x2) and adversarial examples (such as x1 + ε12 targeted towards input x2) in input space. While model confidence falls to near-zero using standard targeted adversarial attacks, the linear path between the LST output and source image has high confidence of almost 1.0 throughout. Using this framework, we can thus analyze the connectivity of high-confidence paths with respect to class y2 between the target benign sample (x2) and adversarial examples (such as x1 + ε12 targeted towards x2), using the linear interpolant path between the two. By doing so, we can analyze the level set with respect to class y2 using targeted adversarial examples alone, though the perturbation utilized is inherently norm-limited. In Fig 3.5, we plot the median confidence of a normally trained ResNet-50 model over the linear interpolant paths for 1000 source-target pairs. We find that though the model confidence with respect to class y2 is high for both the target benign input x2 and the targeted adversarially attacked input x1 + ε12 (that is, at the end-points of the linear path), the model does not maintain high confidence over the linear interpolant path between the two, but sharply declines to a valley of near zero-confidence over the path. For specific inputs, we often observe sharp, 38 erratic diminutions in target-class prediction confidence over the convex combination, with a non-trivial region of near-zero confidence. This contrasts sharply with the existence of linear-connectivity observed between a source image and LST blindspot images. We thus crucially note that adversarial attacks are standalone insufficient to study the structure of level sets of common vision models. However, we incorporate them into our proposed Level Set Traversal algorithm in a fruitful manner. 3.6 Quantifying Under-Sensitivity of Vision Models In this paper, we primarily consider standard vision datasets such as ImageNet [89] and CIFAR-10 [201] (latter in Section 11.1 of the Appendix). We thereby explore the “blind- spots” of popular vision models such ResNet [146] and Vision Transformers [98, 374]. Furthermore, we explore the connectivity of such level sets on normally trained variants of such networks, as well as adversarially robust counterparts. For the latter, we analyze robust models trained adversarially against (4/255) ℓ∞ constrained adversaries, available on RobustBench [78]. Specifically, we utilize a robust ResNet-50 model from [333] and a robust DeiT model from [353]. We also fix the hyperparameters of LST for all models for a fair comparison (detailed in Section 11.1 of the Appendix). Image Quality Metrics: We now proceed to quantitatively verify our observations in Section 3.4. First, to help quantify the deviation between target images and blindspot outputs generated by our proposed Level Set Traversal algorithm, we utilize a combination of classical image metrics such as RMSE, ℓ∞ and Structural Similarity Index (SSIM), and perceptual measures such as LPIPS distance using AlexNet [437]. For SSIM, a higher value indicates a closer match, while for all other metrics, a lower value indicates a closer match. To calculate these metrics, we sample around 1000 source images from ImageNet, and select five other random target images of different classes for each source image. We present the 39 image quality metrics for blindspots discovered by LST in Table 3.1. Here the standard deviation is over the different randomly chosen source and target images. Metrics for Model Confidence: To evaluate the extent of the model invariance over the regions between the LST outputs and the source image, we evaluate the model confidence with respect to the source class over one-dimensional linear interpolant paths and over two-dimensional subspaces as well. For the latter, we evaluate the model confidence over the triangular convex hull obtained by linear interpolation over three reference points, namely the source image and the two target blindspot images produced using LST. For example, the input at the ‘centroid’ of the triangle formed by a source image and pair of target blindspots is the arithmetic mean of the three images. We visualize these in Fig 3.6, wherein the prediction confidence (in the range [0,1]) assigned by the model with respect to the source class is mapped to a continuous colorbar, with high-confidence points (close to 1.0) appearing as bright yellow, and low-confidence points (close to 0.0) appearing as dark violet. Specifically, we use the following metrics: (a) Average Triangle (∆) Confidence: the mean of the model’s source class confidence over the enclosed triangle, (b) Average Triangle (∆) Fraction for various values of δ : the fraction of inputs in the triangular region for which the model confidence is greater than psrc−δ , averaged over all possible target blindspot pairs, where psrc is the confidence of the source image, (c) Average Path Confidence: the average model confidence over all linear paths from the source image to all LST blindspot images. The higher these metrics, the more confident, and thus invariant, the model is in this region. For computing these metrics, we use linear interpolations between the source images and the 5 LST outputs found previously for computing the distance metrics. We thus use (5 2 ) = 10 triangles for each source image, and sample this triangular area in an equispaced manner to obtain 66 images for computation of the triangle (∆) metrics. We present these metrics in Table 3.2, along with the mean (and standard deviation) of model confidence on the source 40 Table 3.1: Quantitative image distance metrics between output of Level Set Traversal and target images. Models RMSE : µ±σ ℓ∞ dist: µ±σ SSIM: µ±σ LPIPS dist: µ±σ ResNet-50 (Normal) 0.008 ± 0.001 0.046 ± 0.020 0.990 ± 0.021 0.002 ± 0.004 ResNet-50 (AT) 0.029 ± 0.008 0.746 ± 0.124 0.915 ± 0.041 0.057 ± 0.037 DeiT-S (Normal) 0.011 ± 0.002 0.116 ± 0.030 0.973 ± 0.024 0.024 ± 0.017 DeiT-S (AT) 0.046 ± 0.010 0.821 ± 0.117 0.898 ± 0.041 0.219 ± 0.068 Table 3.2: Quantitative confidence metrics over the triangular convex hull (∆) of a given source image and two target LST blindspot image-pairs and over linear interpolant paths between source and blindspot images. (For reference, a random classifier would have confidence of 0.001) Models psrc Avg ∆ Conf. Avg ∆ Frac. (µ±σ ) Avg Path Conf. (µ±σ ) (µ±σ ) δ = 0.0 δ = 0.1 δ = 0.2 δ = 0.3 (µ±σ ) ResNet-50 (Normal) 0.99 ± 0.02 0.56 ± 0.10 0.13 ± 0.15 0.51 ± 0.11 0.53 ± 0.1 0.54 ± 0.10 0.96 ± 0.05 ResNet-50 (AT) 0.88 ± 0.11 0.83 ± 0.09 0.49 ± 0.29 0.79 ± 0.13 0.85 ± 0.1 0.88 ± 0.09 0.93 ± 0.06 DeiT-S (Normal) 0.85 ± 0.06 0.68 ± 0.05 0.54 ± 0.11 0.67 ± 0.06 0.71 ± 0.06 0.73 ± 0.06 0.94 ± 0.02 DeiT-S (AT) 0.76 ± 0.08 0.59 ± 0.07 0.20 ± 0.09 0.43 ± 0.14 0.63 ± 0.15 0.76 ± 0.12 0.73 ± 0.06 class images (psrc) for reference. Here, the standard deviation is over the different randomly chosen source images. We can now quantitatively confirm many of the trends we qualitatively observed in Sec- tion 3.4. In Table 3.1, we observe that the LST outputs found for the models are closer to the targets as compared to the adversarially trained models. The difference is particularly stark when comparing the ℓ∞ distances, as ℓ∞ is a particularly sensitive metric and is also the threat model against which the models were trained. However, for other distance metrics like RMSE or LPIPS, the difference between normally trained and adversarially trained ResNet-50 is not as high. In particular, LPIPS distances for both models are low, which indirectly implies that the perceived difference between the images for humans oracles is relatively low. This can also be confirmed by visually inspecting the images in Fig 3.3. For the normally trained DeiT-S, the distance metrics are very similar to that of ResNet-50, with slightly higher RMSE and LPIPS distances. However, the adversarially trained (AT) 41 Figure 3.6: Visualization of confidence of standard (top) and robust (bottom) ResNet-50 models over the triangular convex hull of a source ‘goose’ image and two LST outputs for all pairs of target images from 4 other classes (same images as in Fig 3.3). In all source- target image pairs, the linear interpolant path maintains high confidence, implying that the source image is linearly connected to the LST target output in a star-like substructure within the level set. For adversarially trained models, we observe that a significant fraction of the triangular hull lies in the superlevel sets of high-confidence, thereby indicating their under-sensitivity in regions far beyond their original threat model. variant is significantly less under-sensitive compared to both its normal counterpart and the ResNet-50 (AT). Specifically, the LPIPS distance metric is much greater for DeiT-S (AT), which implies there exist significant human-perceptible differences in the image. We can confirm this in Fig 11.1, where the LST output for the adversarially trained DeiT-S model contains visible traces of the source image. However, the LST outputs are still clearly much closer to the target class as compared to the source, which indicates that there are still some significant blind spots in DeiT-S (AT). However, when we measure the extent of model invariance over the convex regions enclosed between the LST output and the target images, we find that adversarially trained ResNet-50 are overall more invariant (or under-sensitive) as compared to the normally trained variant. For example, the average triangle confidence for ResNet-50 (AT) is higher than that of nor- mally trained ResNet-50, even though its source confidence is much lower. We also find that a much larger fraction of the triangular convex hull lies within the superlevel set for psrc−δ 42 for ResNet-50 (AT) as compared to normal ResNet-50 for all values of δ . The average path confidence is much closer to psrc for ResNet-50 (AT) as compared to normally trained ResNet-50. This quantitatively verifies the observation made in Fig 3.6 that adversarial training demonstrably exacerbates under-sensitivity. Interestingly, these robust models are under-sensitive over subsets of the input domain that lie well beyond the original threat model used in its training. Moreover, between the normally trained DeiT-S and ResNet-50 models, the former appears to be more invariant with greater average confidence over the triangular convex hulls, despite having lower source image confidences psrc. For the robust variant of DeiT-S however, the trend is less apparent due to the significantly lower average source image confidences psrc. However the average relative ∆ fraction becomes higher for larger values of δ (such as 0.3), indicating that the superlevel sets are indeed expansive, albeit for lower confidence thresholds. 3.7 Conclusion I While the existence of level sets alone is not very significant in itself, using the proposed LST algorithm, we find that the level set for common vision models is remarkably expan- sive — large enough to contain inputs that look near-identical to arbitrary target images from other classes. Since the linear path from any given source image to LST blindspot outputs retain high model confidence throughout, the level sets have a star-like connected substructure, where the number of ’limbs’ or linear protuberances of the star-like structure is extraordinarily large, plausibly as large as the number of images in all other classes. This is considerably noteworthy since it indicates the hitherto unknown and unappreciated scale and extent of under-sensitivity in common vision models. Moreover this hints at the potential degree of difficulty towards adequately mitigating this phenomenon in practical settings. For instance, if the level set for images of class y1 contained sizable protuberances 43 towards only one other class y2 alone, the problem could perhaps be tackled by introducing a contrastive training objective that encourages the network to better discriminate between y1− y2 image pairs by utilizing a denser sampling of related image augmentations, likely resulting in the diminution of these specific ‘directed’ protuberances (assuming reasonable train-test generalization). But since the star-like connected substructure uncovered by LST implies that such protuberances exist towards any generic image of any other class, such simple approaches will likely be ineffective and possibly computationally infeasible from a combinatorial perspective. Thus, based on the observations uncovered with LST, we hypothesize that addressing the pervasive issue of under-sensitivity in conventional vision models might present a significantly non-trivial challenge. 3.8 Provable Robustness For Streaming Models Deep neural network (DNN) models are increasingly being adopted for real-time decision- making tasks. They are often required to make predictions on an evolving stream of inputs in applications like algorithmic trading [112, 196, 198, 288, 435], human action recognition [285, 317, 415] and speech detection [90, 132, 156]. However, DNNs are known to malfunction under tiny perturbations of the input, such as an imperceptible noise added to an image, designed to fool them into making incorrect predictions [28, 54, 128, 256, 366]. This vulnerability is not limited just to static models like classifiers and has been demonstrated for streaming models as well [24, 25, 38, 270]. Such input corruptions, commonly known as adversarial attacks, make DNNs especially risky for safety-critical applications such as health monitoring [44, 167, 222, 360] and autonomous driving [34, 175, 412]. Over the years, a long line of research has been dedicated to mitigating this weakness of DNNs. These methods seek to improve the robustness of a model by introducing 44 input corruptions during training [41, 96, 124, 133, 138, 211, 235, 330]. However, such empirical defenses have been shown to break down under newer and stronger attacks [16, 54, 377, 380]. This motivated the study of provable robustness in machine learning (ML) which seeks to obtain verifiable guarantees on the adversarial performance of a DNN. Several certified robustness techniques have been developed over the years, most notable of which are based on convex relaxation [63, 309, 354, 356, 407], interval-bound propagation [103, 130, 162, 266] and randomized smoothing [71, 221, 225, 228, 334]. However, most of this work focuses on static tasks with independently generated inputs and the adversarial streaming setting still remains open. What makes the streaming setting more challenging is that the adversary can choose the attack budget based on previous inputs. For instance, it could wait for a critical decision-making point, such as a trading algorithm making a buy/sell recommendation or an autonomous vehicle approaching a stop sign, before generating an adversarial perturbation. In this work, we derive provable robustness guarantees for the streaming setting, where inputs are presented as a sequence of potentially correlated items. We design certificates that produce guarantees on the average model performance over long, potentially infinite, data streams. Our threat model is defined as a man-in-the-middle adversary present between the DNN and the data stream. It can perturb the input items before they are observed by the DNN. The adversary is constrained by a limit on the average size of the perturbation added to the inputs. We show that a DNN that randomizes the inputs before making predictions is guaranteed to achieve a