ABSTRACT Title of dissertation: INTERPRETING DEEP LEARNING MODELS AND UNLOCKING NEW APPLICATIONS WITH IT Samyadeep Basu, Doctor of Philosophy, 2025 Dissertation directed by: Professor Soheil Feizi University of Maryland In recent years, modern deep learning has made significant strides across various domains, including natural language processing, computer vision, and speech recognition. These advancements have been driven by innovations in scaling pre-training data, developing new model architectures, integrating distinct modalities (e.g., vision and language, audio and language), and employing modern engineering practices. However, despite these innovations in building better models, progress in understanding these models to enhance their reliability has been relatively slow. In this thesis, we lay the groundwork for interpreting modern deep learning models—such as vision, text-to-image, and multimodal language models—by examining them through the perspectives of data and internal model components. We aim to unlock various capabilities, including model editing and model steering, to enhance their reliability. First, we build on the principles of robust statistics to interpret test-time pre- dictions by identifying important training examples using higher-order influence functions. However, we find that influence functions can be fragile for large deep models, which limits their practical applications. To address this, we develop optimization-based data selection strategies to automatically generate stress-testing sets from large vision datasets, testing the reliability of vision models within a few-shot learning framework. Overall, our investiga- tions show that while analyzing models through the lens of data provides valuable insights for potential improvements, it does not offer a direct method for controlling and enhancing the reliability of these models. To this end, we investigate deep models by focusing on their internal components. We develop causal mediation analysis methods to understand knowledge storage in text-to-image generative models like Stable Diffusion. Based on these insights, we create novel model editing techniques that can remove copyrighted styles and objects from text-to-image models with minimal weight updates. We scale these methods to edit large open-source models such as SD-XL and DeepFloyd. As a follow-up, we then introduce innovative causal mediation analysis methods and a richly annotated probe dataset to interpret multimodal large language models like LLaVa. Our approach allows us to understand how these models internally retrieve relevant knowledge for factual Visual Question Answering (VQA) tasks. Leveraging these insights, we develop a novel model editing method that can effectively introduce rare, long-tailed knowledge or correct specific failure modes in multimodal large language models. Using similar principles, we explore vision models (in particular the ViT architecture), developing methods to interpret image representations based on internal components such as attention heads, using text descrip- tions. We apply these interpretability insights to (i) mitigate spurious correlations, (ii) enable zero-shot segmentation, and (iii) facilitate text or image-conditioned image retrieval. We also extend our mechanistic interpretability techniques to understand and control language models for real-world tasks, such as context-augmented generation in question-answering systems (i.e., extractive QA). In particular, we find that insights from mechanistic circuits can be useful towards context-data attribution and model steering towards improved context faithfulness. Finally, we leverage interpretability insights from multimodal models to enhance their compositionality in image-conditioned text retrieval and text-guided image generation. For vision-language models (VLMs) like CLIP, we propose a distillation method that transfers compositional knowledge from diffusion models to CLIP. For diffusion models, we introduce a lightweight fine-tuning approach that learns a linear layer on the conditioning text encoder, improving compositional generation for attribute binding. Overall, our thesis designs and adapts interpretable methods and leverages interpretable insights to uncover various capabilities in pre-trained models. INTERPRETING DEEP LEARNING MODELS AND UNLOCKING NEW APPLICATIONS WITH IT by Samyadeep Basu Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2025 Advisory Committee: Professor Soheil Feizi, Chair/Advisor Professor Hernisa Kacorri, Dean’s Representative Professor Furong Huang, Professor Abhinav Shrivastava, Committee Members Dr. Varun Manjunatha, Dr. Daniela Massiceti, External Members © Copyright by Samyadeep Basu 2025 Dedication To my parents, partner and friends for their love and support. ii Acknowledgments First and foremost, I would like to express my deepest gratitude to Dr. Soheil Feizi for his invaluable guidance and unwavering support throughout my PhD. His mentorship has been one of the most significant factors in the successful completion of my dissertation. I arrived at the University of Maryland in 2018 to pursue a Master’s degree, initially intending to focus on Computational Biology. However, due to various circumstances, that path did not materialize. While searching for an advisor, Dr. Feizi took a chance on me in my second year and guided me through my first research project, which led to a submission at AISTATS. With no prior background in machine learning or deep learning, the project was a steep learning curve, but Dr. Feizi’s encouragement ensured that I persevered. Due to personal circumstances, I had to defer my PhD admission, and Dr. Feizi was incredibly supportive of my decision. After completing my Master’s, he gave me the freedom to explore diverse topics in deep learning, allowing me to build a broad foundation. This combination of intellectual freedom and strong mentorship was instrumental in helping me publish at top conferences and secure valuable internships. I am profoundly grateful to Dr. Feizi for his guidance, patience, and support throughout my PhD journey—his influence has truly shaped my academic and professional path. The PhD journey can be long and often lonely, and I am incredibly grateful to have had the unwavering support of my parents, partner—now my wife—Sneha and in-laws. Sneha has stood by me through my lowest moments in ways no one else could, and for that, I will always be indebted to her. During my time in Maryland, I was fortunate to have the support of a wonderful group of friends who made this journey fulfilling. My College Park friends—Aman, Shlok, Ameya, Ryan, Naman, Anjali, Sai, Noor, Vasu, Komal, Sanchita, Yatharth, Neha, Pavan, Amanpreet, Ishita, Shishira, Shramay, Anshul, Susmija, Pranav, Ketul, Aadesh, and Manas—played an integral role in making these years memorable. I iii am also deeply thankful to my friends from my undergraduate days and beyond—Aditya, Siddhant, Dhairya, Srajit, Parikshit, Vandit, Surbhi, Dewanshu, Anish, Kunal, Fabian, and Himanshu — whose regular conversations and encouragement kept me going. Their support made all the difference in this journey, and I am truly grateful to have them in my life. During my PhD I also had the opportunity to do internships at Microsoft Research and Adobe Research. From MSR, I would particularly like to thank Dr. Daniela Massiceti who supported me not only during my internship but beyond that in my PhD as a mentor. Even though there was a time-difference, she made sure to schedule regular meetings to mentor me and carve a path towards a successful PhD. From Adobe, Varun has been the main motivator for me to work on interpretability. His idealogies on reverse engineering large models have shaped my PhD and in fact is one of the core parts of my PhD thesis. He has not only supported me in projects, but also as a lighting guide towards having a successful PhD and post PhD transition. I will forever be indebted to both Varun and Daniela. I can easily say that they have turned from great mentors to friends along the way – for which I am grateful. Finally, I would like to thank my amazing labmates without whom this PhD would not have been possible. iv Table of Contents Dedication ii Acknowledgements iii 1 Introduction 1 1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Publications and Authorship . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Related Work 10 2.1 Interpreting Test-Time Predictions Through Influence Functions . . . . . . 10 2.2 Automatically Designing Difficult Few-Shot Benchmarks for Model Reliability 11 2.3 Mechanistically Understanding and Editing Text-to-Image Diffusion Models 12 2.4 Mechanistically Understanding and Editing Multimodal Language Models . 13 2.5 Mechanistically Understanding and Unlocking Zero-Shot Capabilities in Vision Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.6 Mechanistic Circuits for Extractive Question-Answering . . . . . . . . . . 16 2.7 Improving Compositionality in Multimodal Models . . . . . . . . . . . . . 17 2.7.1 Compositionality in CLIP . . . . . . . . . . . . . . . . . . . . . . . 17 2.7.2 Compositionality in Text-to-Image Models . . . . . . . . . . . . . 17 3 Interpreting Test-Time Predictions With Influence Functions 19 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Group Influence Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5.3 Observations and Analysis . . . . . . . . . . . . . . . . . . . . . . 33 v Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6 Conclusion for Second-Order Group Influence Functions . . . . . . . . . . 35 3.7 Influence Functions in Deep Learning . . . . . . . . . . . . . . . . . . . . 36 3.8 Basics of Influence Function . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.9 What Can Go Wrong for Influence Functions In Deep Learning? . . . . . . 40 3.10 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.10.1 Understanding Influence Functions when the Exact Hessian Can be Computed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.10.2 Understanding Influence Functions in Shallow CNN Architectures . 45 3.10.3 Understanding Influence Functions in Deep Architectures . . . . . . 47 3.10.4 Is Scaling Influence Estimates To ImageNet Possible? . . . . . . . . 49 3.11 Conclusion for Influence Functions in Deep Learning . . . . . . . . . . . . 51 4 Automatically Designing Difficult Few-Shot Benchmarks for Model Reliability 52 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Few-Shot Classification: Preliminaries and Notations . . . . . . . . . . . . 55 4.3 FASTDIFFSEL: An Efficient Algorithm to Select Difficult Support Sets . . . 56 4.3.1 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4 Difficult Support Set Extraction on META-DATASET . . . . . . . . . . . . . 60 4.4.1 Test task samplers for META-DATASET . . . . . . . . . . . . . . . . 61 4.4.2 Validation of difficult META-DATASET tasks . . . . . . . . . . . . . 62 4.5 Stress Testing With HARD-META-DATASET++ . . . . . . . . . . . . . . . . 63 4.5.1 Test datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5.2 Metrics and training . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Results on Difficult Tasks from META-DATASET . . . . . . . . . . . 66 Results on Difficult Tasks from CURE-OR, ORBIT and OBJECTNET . 68 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5 Mechanistically Understanding and Editing Text-to-Image Generative Models 71 5.1 Knowledge Localization and Model Editing in Early Stable-Diffusion Variants 71 5.2 Causal Tracing for Text-to-Image Generative Models . . . . . . . . . . . . 75 5.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2.2 Adapting Causal Tracing For Text-to-Image Diffusion Models . . . 76 5.2.3 Tracing Knowledge in UNet . . . . . . . . . . . . . . . . . . . . . 77 5.2.4 Tracing Knowledge in the Text-Encoder . . . . . . . . . . . . . . . 79 5.2.5 Extracting Causal States Using CLIP-Score . . . . . . . . . . . . . 80 5.3 How is Knowledge Stored in Text-to-Image Models? . . . . . . . . . . . . 81 5.4 DIFF-QUICKFIX: Fast Model Editing for Text-to-Image Models . . . . . . 84 5.4.1 Editing Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 85 vi 5.4.3 Editing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.5 Conclusion I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.6 Knowledge Localization and Model Editing Across Various Open-Source Text-to-Image Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.7 On the Effectiveness of Causal Tracing for Text-to-Image Models . . . . . . 91 5.8 LOCOGEN: Towards Mechanistic Knowledge Localization . . . . . . . . . 93 5.8.1 Knowledge Control in Cross-Attention Layers . . . . . . . . . . . . 94 Altered Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 LOCOGEN Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 96 5.8.2 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.9 LOCOEDIT : Editing to Ablate Concepts . . . . . . . . . . . . . . . . . . . 101 5.9.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.9.2 Model Editing Results . . . . . . . . . . . . . . . . . . . . . . . . 103 5.10 On Neuron-Level Model Editing . . . . . . . . . . . . . . . . . . . . . . . 104 5.11 Conclusion II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6 Mechanistically Understanding and Editing Multimodal Language Models 109 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.2 A Constraint-Based Framework for Studying Information Storage and Trans- fer in MLLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.2.1 A Multi-modal Constraint-based Framework . . . . . . . . . . . . . 113 6.2.2 MULTIMODALCAUSALTRACE: Studying Information Storage in MLLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.2.3 Studying Information Transfer in MLLMs with Attention Contribu- tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.2.4 VQA-Constraints: A Constraint Annotated Test-Bed for VQA . . . 117 6.3 Key Findings in how MLLMs Store and Transfer Information . . . . . . . . 118 6.3.1 Finding 1: Early MLPs and self-attention layers are causal . . . . . 118 6.3.2 Finding 2: Only a subset of visual tokens are involved in transferring information from the image to the early causal MLP layers. . . . . . 120 6.3.3 Finding 3: Mid-layer self-attention layers are involved in transfer- ring information from the early causal layers to the question’s final token . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.3.4 Finding 4: Mid-layer self-attention contributions can be used to predict whether a MLLM will generate a correct answer, but model confidence is a more reliable predictor . . . . . . . . . . . . . . . . 121 6.4 Correcting and Inserting Long-Tailed Information in MLLMs . . . . . . . . 122 6.4.1 MULTEDIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.4.2 Experimental details . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 vii 7 Mechanistically Understanding and Unlocking Zero-Shot Capabilities in Vision Transformers 127 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.1.1 REPDECOMPOSE: Automated Representation Decomposition for ViTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.2 Aligning the component representations to CLIP space . . . . . . . . . . . 132 7.3 Component ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 7.4 Feature-based component analysis . . . . . . . . . . . . . . . . . . . . . . 137 7.4.1 Text based image retrieval . . . . . . . . . . . . . . . . . . . . . . 139 7.4.2 Image based image retrieval . . . . . . . . . . . . . . . . . . . . . 141 7.4.3 Zero-shot spurious correlation mitigation . . . . . . . . . . . . . . 142 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 8 A Mechanistic Circuit for Extractive Question-Answering 144 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 8.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 8.3 Deciphering a Circuit for Extractive QA . . . . . . . . . . . . . . . . . . . 148 8.3.1 Designing the Probe Dataset . . . . . . . . . . . . . . . . . . . . . 150 8.3.2 Interventional Steps for Extracting Circuits . . . . . . . . . . . . . 151 8.3.3 Insights For Extractive QA through Circuits . . . . . . . . . . . . . 154 Context Faithfulness Circuit Differs from Parametric Memory Circuit154 Validation of the Extracted Circuit . . . . . . . . . . . . . . . . . . 155 A Small Set of Attention Heads in the context circuit are interpretable156 One Can Switch Between Memory and Copy Faithfulness Circuits . 156 8.4 Application 1: Attribution for Free Via One Attention Head . . . . . . . . . 158 8.4.1 ATTNATTRIB: A Simple and Strong Data Attribution Method for Extractive QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 8.4.2 Evaluation on Extractive QA Benchmarks . . . . . . . . . . . . . . 159 8.5 Application 2: Towards Improved Context Faithfulness . . . . . . . . . . . 161 8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 9 Improving Compositionality in Multimodal Models 163 9.1 Compositionality in CLIP . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 9.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 9.1.2 Denoising Diffusion Score for Visio-Linguistic Reasoning . . . . . 165 9.1.3 SDS-CLIP: Our Method . . . . . . . . . . . . . . . . . . . . . . . 167 9.1.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 9.1.5 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 9.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 9.2 Compositionality in Text-to-Image Diffusion Models . . . . . . . . . . . . 171 viii 9.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 9.2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 9.2.3 Sources of Compositionality Failures . . . . . . . . . . . . . . . . 175 Source (i) : Erroneous Attention Contributions in CLIP . . . . . . . 176 Source (ii) : Sub-optimality of CLIP Text-Encoder for Composi- tional Prompts . . . . . . . . . . . . . . . . . . . . . . . 178 9.2.4 Projection Layer for Enhancing Compositionality in the CLIP Text Embedding Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 CLP: Token-wise Compositional Linear Projection . . . . . . . . . . 180 WiCLP: Window-based Compositional Linear Projection . . . . . . 181 9.2.5 SWITCH-OFF: Trade-off between Compositionality and Clean Ac- curacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 9.2.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Qualitative and Quantitative Evaluation . . . . . . . . . . . . . . . 185 9.2.7 Impact of WiCLP on Subsets of Tokens . . . . . . . . . . . . . . . . 187 9.2.8 Alternatives to WiCLP . . . . . . . . . . . . . . . . . . . . . . . . . 187 9.2.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 10 Conclusion and Future Work 189 10.1 Reading List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 10.2 Understanding Model Through the Lens of Data . . . . . . . . . . . . . . . 190 10.3 Understanding Model Through Internal Model Components . . . . . . . . . 191 10.4 Model Steering or Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 11 Appendix 194 11.1 Interpretation of Models Through Lens of Data . . . . . . . . . . . . . . . 194 11.2 Running Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 11.2.1 Faithfulness and Plausibility of Influence functions . . . . . . . . . 195 11.3 Automatically Designing Difficult Few-Shot Tasks . . . . . . . . . . . . . . 196 11.3.1 Support Set Extraction Algorithm . . . . . . . . . . . . . . . . . . 196 Steps For Solving the Projection Step . . . . . . . . . . . . . . . . 196 Hyperparameters of the Framework . . . . . . . . . . . . . . . . . 198 11.4 Mechanistically Understanding and Editing Text-to-Image Models . . . . . 198 11.4.1 Probe Dataset Design Details . . . . . . . . . . . . . . . . . . . . . 198 11.5 Mechanistically Understanding and Editing Multimodal Language Models . 204 11.5.1 VQA-Constraints Details . . . . . . . . . . . . . . . . . . . . . . . 204 11.5.2 Standard Causal Tracing Does Not Recover Causal States . . . . . . 206 11.6 Mechanistically Understanding and Unlocking Zero-Shot Capabilities in Vision Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 11.6.1 Scoring Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 11.6.2 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 207 11.7 A Mechanistic Circuit for Extractive Question-Answering . . . . . . . . . . 210 ix 11.7.1 Note on Second-order Circuit Components . . . . . . . . . . . . . 210 11.7.2 On Modifying Circuit Components . . . . . . . . . . . . . . . . . . 210 11.7.3 Extracted Circuit Components Across Language Models . . . . . . 211 11.7.4 Vicuna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Context Faithfulness . . . . . . . . . . . . . . . . . . . . . . . . . 211 Memory Faithfulness . . . . . . . . . . . . . . . . . . . . . . . . . 211 11.7.5 Llama-3-8B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 Context Faithfulness . . . . . . . . . . . . . . . . . . . . . . . . . 212 Memory Faithfulness . . . . . . . . . . . . . . . . . . . . . . . . . 212 11.7.6 Phi-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Context Faithfulness . . . . . . . . . . . . . . . . . . . . . . . . . 213 Memory Faithfulness . . . . . . . . . . . . . . . . . . . . . . . . . 213 11.7.7 Do we need a larger probe dataset? . . . . . . . . . . . . . . . . . . 213 11.7.8 Probe Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . . 214 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 11.7.9 Data Attribution Evaluation Dataset Descriptions . . . . . . . . . . 216 11.7.10Validating Long Extractive Answer Generations . . . . . . . . . . . 216 11.7.11 Results on CNN-Dailymail . . . . . . . . . . . . . . . . . . . . . . 218 11.7.12 Results on NQ-Long . . . . . . . . . . . . . . . . . . . . . . . . . 219 11.7.13 Circuit Components and Data Attribution in Llama-3-70B . . . . . 219 11.8 Improving Compositionality in CLIP . . . . . . . . . . . . . . . . . . . . . 220 11.8.1 Benchmark Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 220 11.8.2 Does distilling features directly from UNet help? . . . . . . . . . . 221 11.8.3 Additional Method Details . . . . . . . . . . . . . . . . . . . . . . 222 11.8.4 When does distillation not help CLIP? . . . . . . . . . . . . . . . . 222 11.8.5 More Experimental Details . . . . . . . . . . . . . . . . . . . . . . 223 11.8.6 Fine-tuning with Conceptual Captions . . . . . . . . . . . . . . . . 224 11.8.7 Results with OpenCLIP . . . . . . . . . . . . . . . . . . . . . . . . 224 11.8.8 Additional Results on CLEVR . . . . . . . . . . . . . . . . . . . . 224 11.8.9 Is it the Scale of Pre-Training Data Which Helps? . . . . . . . . . 225 11.8.10 Beyond CLIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 11.9 Improving Compositionality in Text-to-Image Models . . . . . . . . . . . . 225 Bibliography 228 x Chapter 1: Introduction 1.1 Thesis Statement In this thesis, we develop and investigate methods for interpreting deep models through the lens of data and internal model components. We use these insights towards developing fast and scalable model editing methods, automatically generating difficult few-shot learning benchmarks and mitigating spurious correlations amongst others. 1.2 Thesis Overview In recent years, a vast array of deep learning models has been developed and implemented in real-world applications. These models encompass unimodal types, such as those for text, images, videos, and audio, as well as multimodal models that combine modalities like vision and language, video and language, or audio and language. As these models have grown rapidly in size—driven by increases in model size, the scale of pre-training data, and the availability of advanced computing infrastructure—the research community has struggled to fully understand how these models make specific decisions. Furthermore, it remains unclear whether gaining a deeper understanding of these models would directly contribute to targeted improvements or enhancements in their downstream applications. In this thesis, we establish a framework for efficiently interpreting recently developed deep learning models, including 1 both unimodal and multimodal types. We explore these models through the perspectives of their pre-training data, internal model components, and fine-tuning algorithms. First, we investigate how a classifier’s decision-making process can be attributed to a group of training samples to understand the failure modes of deep models. We develop second-order group influence functions, which can efficiently approximate leave-k-out retraining. Through a range of experiments on synthetic data and standard image datasets, we demonstrate that our proposed second-order influence function better approximates leave- k-out retraining than first-order influence functions. However, for deep models involving non-convex losses, we also find that the first-order influence function baseline is often inaccurate when compared to ground-truth influence. We then conduct a comprehensive large-scale empirical study to highlight the advantages and limitations of influence functions for interpreting deep models in the context of training data. Using datasets up to the scale of ImageNet, we identify the conditions under which the approximation provided by influence functions is relatively error-free. Given that influence functions are unstable for highly overparameterized models, we explore a different algorithmic approach to understanding the failure modes of pre-trained models. Specifically, we examine these failure modes through the lens of few-shot tasks. To understand the worst-case failure scenarios of deep models, we design FastDiffSel, an optimization-based algorithm that can automatically extract challenging training sets for a given test set. We use FastDiffSel to identify difficult few-shot tasks from vision datasets, including ImageNet, ObjectNet, and CURE-OR. Our findings reveal that pre-trained models often fail when there is a natural distribution shift between the few-shot training set and the test examples. As a result, we curate a challenging few-shot testing set, HardMetaDataset++, which can be used to stress-test models. 2 While analyzing deep models through the lens of data helps identify their failure modes, it offers limited flexibility for post-training model control. To address this, we shift our focus to a "mechanistic" investigation, examining the internal components of these models. We first develop methodologies to understand how knowledge is stored in large-scale text-to- image models. Based on our findings, we design scalable, efficient, and data-free model editing techniques to remove copyrighted concepts from these models. Our empirical experiments demonstrate the effectiveness of our model editing methods in modifying large-scale open-source text-to-image generative models. We then extend our approach to multimodal language models, developing MultimodalCausalTrace, a tool that identifies crucial model components for factual Visual Question Answering (VQA) tasks. Building on these insights, we introduce MultEdit, a method for editing multimodal language models to insert new, rare knowledge or fix existing issues. Although our methods currently focus on interpreting multimodal models, a significant challenge remains in understanding the internal components of general Vision Transformers (ViTs) using human-understandable concepts, such as text. To tackle this, we create RepDecompose, an approach that automatically decomposes final representations in general ViTs through a recursive process. These components are then aligned with CLIP’s image encoder, allowing interpretation via the text encoder. Our analysis reveals that different attention heads in ViTs encode distinct concepts, such as patterns, colors, and locations. We leverage these insights to modify the identified attention heads, mitigating spurious correlations and utilizing their embeddings for tasks like zero-shot segmentation, text-conditioned image retrieval, and general image retrieval. While our current work has laid the groundwork for interpreting and controlling vision and multimodal models, we also focus on adapting these methods to control language models. In particular, we enhance language models for tasks such as data attribution to context and 3 mitigating hallucinations. Given the recent advancements in retrieval-augmented generation, there are significant practical applications in context-augmented question-answering setups. In this phase of our research, we investigate the internal circuits (e.g., sub-graphs) of language models that are causally linked to retrieval-augmented generation tasks. By analyzing different components of these circuits, we design zero-shot data attribution methods. Finally, we investigate the compositionality issues in VLMs (e.g., CLIP) and text-guided image generation models (e.g., diffusion models). In particular, we find that diffusion models are strong in terms of compositionality and such knowledge can be transferred to CLIP to improve it’s compositionality. To this end, we introduce SDS-CLIP, a light-weight fine-tuning based distillation method which can improve CLIP’s compositionality without harming it’s zero-shot capabilities. For diffusion models, we find that the text-embedding for compositional prompt is often sub-optimal. We show that solely fine-tuning a linear projection layer on the CLIP’s text-embedding can improve compositional generation for a variety of open-source text-to-image diffusion models (including SDv3). Overall, our thesis has developed new approaches towards understanding deep models and has shown the possibilities of practical applications of model interpretability. 1.3 Thesis Contributions This thesis makes several research contributions towards interpreting deep learning models, spanning both unimodal and multimodal models. In particular, we make contributions towards developing interpreting deep learning models through the lens of data as well as internal model components. Using our interpretability insights, we further develop light-weight methods towards unlocking capabilities in these models such as model editing. 4 Below we state our contributions: Interpreting Test-Time Predictions Through Influence Functions • We develop second-order group influence functions which can attribute test-time predictions to a group of samples in the training data. Our second-order group influence functions effectively approximates leave-k-out retraining by a second-order taylors expansion around the optimally trained model with all the training examples. [ICML 2020] • We empirically investigate the limits of influence functions for deep networks. To this end, we first analyse influence functions in a controlled experimental setup with synthetic data. We then scale up the analysis across different pre-training data and models upto the ImageNet scale – highlighting the fragilities of influence functions at larger model scales. [ICLR 2021] Automatically Designing Difficult Few-Shot Benchmarks for Model Reliability • We design an optimized-based data selection algorithm, which can automatically curate difficult few-shot benchmarks from large-scale vision datasets. The curated dataset from our algorithm can be used towards stress testing deep models for reliabil- ity. [ICLR 2023] Mechanistically Understanding and Editing Text-to-Image Diffusion Models • We design a causal tracing methodology which can locate internal model components which causally store knowledge corresponding to various visual attributes such as style, objects or facts. We then design model editing methods towards updating the weights of the identified components in a light-weight manner. [ICLR 2024] 5 • We investigate knowledge storage about visual attributes in the cross-attention layers across various open-source text-to-image diffusion models. We then use model editing towards updating the weights in those locations to remove copyrighted style, objects and update the model with new facts. [ICML 2024] Mechanistically Understanding and Editing Multimodal Language Models • We develop MultimodalCausalTrace, which can identify causal locations for a factual VQA task using a constrained based formulation. Along with providing salient interpretability insights about the inner workings of multimodal language models – we then introduce MultEdit, which can effectively introduce long-tailed knowledge into these models. [NeurIPS 2024] Mechanistically Understanding and Unlocking Zero-Shot Capabilities in Vision Transformers • We introduce RepDecompose which decomposes the final representation in Vision Transformers as a function of internal model components such as attention heads. We then interpret these attention heads via text, by aligning their embeddings to CLIP’s image encoder and then using CLIP’s text-encoder to interpret them. Based on our interpretability insights, we unlock various zero-shot capabilities in Vision Transformers: (i) Spurious correlation mitigation; (ii) Zero-shot segmentation; (iii) image / text conditioned image retrieval. [NeurIPS 2024] Mechanistically Understanding and Enhancing Context-Augmented Language Models • Large language models are widely used for document processing and question- 6 answering. In this work, we extract mechanistic circuits for context-augmented extractive QA using causal mediation analysis on model components (e.g., attention heads, MLPs). Our analysis reveals how models balance parametric memory and retrieved context, identifying a small set of attention heads that reliably perform data attribution by default. Leveraging this, we introduce ATTNATTRIB, a fast attribu- tion algorithm achieving state-of-the-art results across QA benchmarks. We further demonstrate that ATTNATTRIB can steer models to prioritize context over paramet- ric memory. Beyond insights into model behavior, our work highlights practical applications of circuits in attribution and model control. [ICML Review] Improving Compositionality in Multimodal Models • Image-text contrastive models like CLIP excel in zero-shot classification, retrieval, and transfer learning but struggle with compositional visio-linguistic tasks (e.g., attribute binding, object relationships), often performing at chance levels. To address this, we propose SDS-CLIP, a lightweight, sample-efficient distillation method that enhances CLIP’s compositional reasoning. Our approach fine-tunes CLIP using a distillation objective from large text-to-image generative models like Stable Diffusion, known for strong visio-linguistic reasoning. SDS-CLIP improves CLIP’s performance by up to 7% on Winoground and 3% on ARO, demonstrating the potential of generative model distillation to enhance contrastive learning. [EMNLP 2024] • Text-to-image diffusion-based generative models have the stunning ability to generate photo-realistic images and achieve state-of-the-art low FID scores on challenging image generation benchmarks. However, one of the primary failure modes of these text-to-image generative models is in composing attributes, objects, and their associ- ated relationships accurately into an image. In our paper, we investigate compositional 7 attribute binding failures, where the model fails to correctly associate descriptive attributes (such as color, shape, or texture) with the corresponding objects in the gen- erated images, and highlight that imperfect text conditioning with CLIP text-encoder is one of the primary reasons behind the inability of these models to generate high- fidelity compositional scenes. In particular, we show that (i) there exists an optimal text-embedding space that can generate highly coherent compositional scenes showing that the output space of the CLIP text-encoder is sub-optimal, and (ii) the final token embeddings in CLIP are erroneous as they often include attention contributions from unrelated tokens in compositional prompts. Our main finding shows that significant compositional improvements can be achieved (without harming the model’s FID score) by fine-tuning only a simple and parameter-efficient linear projection on CLIP’s representation space in Stable-Diffusion variants using a small set of compositional image-text pairs. [ACL Submission] 1.4 Publications and Authorship This thesis draws upon previously published manuscripts, manuscripts currently under review, and ongoing works listed in the table underneath. While I serve as the principal author (except for Chapter 7 and 9.2), the research presented here reflects the culmination of collaborative efforts with my advisor, Soheil Feizi, and mentors Daniela Massiceti, Varun Manjunatha alongside invaluable contributions from mentors and colleagues UMD, Adobe Research and Microsoft Research. Throughout Chapters 3-9, I use the pronoun ‘we’ to acknowledge the collective contributions of all my collaborators. 8 Figure 1.1: Primary Works Directly Related to the Thesis. Figure 1.2: Additional Relevant Works done during the Thesis. 9 Chapter 2: Related Work 2.1 Interpreting Test-Time Predictions Through Influence Functions Influence functions, a classical technique from robust statistics introduced by [50, 51] were first used in the machine learning community for interpretability by [119] to approximate the effect of upweighting a training point on the model parameters and test-loss for a particular test sample. In the past few years, there has been an increase in the applications of influence functions for a variety of machine learning tasks. [203] used influence functions to produce confidence intervals for a prediction and to audit the reliability of predictions. [242] used influence functions to approximate the gradient in order to recover a counterfactual distribution and increase model fairness, while [30] used influence functions to understand the origins of bias in word-embeddings. [117] crafted stronger data poisoning attacks using influence functions. Influence functions can also be used to detect extrapolation [159] in certain specific cases, validate causal inference models [7] and identify influential pre-training points [40]. Infinitesimal jackknife or the delta method are ideas closely related to influence functions for linear approximations of leave-one-out cross validation [66, 107]. Recently a higher-order instance [85] of infinitesimal jackknife [107] was used to approximate cross-validation procedures. While their setting corresponding to approximations of leave-k-out re-training is relatively similar to our paper, our higher- order terms preserve the empirical weight distribution of the training data in the ERM and 10 are derived from influence functions, while in [85] instances of infinitesimal jackknife is used. These differences lead to our higher-order terms being marginally different than the one proposed in [85]. Our proposed second-order approximation for group influence function is additionally backed by a thorough empirical study across different settings in the case of linear models which has not yet been explored in prior works. Recently, alternative methods to find influential samples in deep networks have been proposed. In [258], test-time predictions are explained by a kernel function evaluated at the training samples. Influential training examples can also be obtained by tracking the change in loss for a test-prediction through model-checkpoints, which are stored during the training time [189]. While these alternative methods [189, 258] work well for deep networks in interpreting model predictions, they lack the “jackknife" like ability of influence functions which makes it useful in multiple applications other than interpretability (e.g. uncertainty estimation). 2.2 Automatically Designing Difficult Few-Shot Benchmarks for Model Reliability Difficult tasks. Previous works [2, 12, 57] have shown that state-of-the-art few-shot classifiers generally display a wide range in performance when adapting to different test tasks. [2] use this observation to develop a greedy search-based algorithm that can specifically extract difficult tasks for further study. They consider only meta-learning approaches and, due to the computational requirements of a greedy search, are limited to small-scale datasets including mini-ImageNet and CIFAR-FS. [12] also study difficult tasks through a correlation-based analysis. We extend on both of these works by (i) proposing a scalable algorithm – FastDiffSel that can extract difficult tasks from any large-scale vision dataset, and (ii) conducting a deep empirical evaluation on the robustness of a broader range of 11 meta-learning and transfer learning approaches on these difficult tasks. Potentially ideas from subset selection [114, 247] can be adapted for few-shot task extraction, but we leave it for future work. Few-shot classification benchmarks. Meta-Dataset [230] and Hard-Meta-Dataset++ [65] are two of the most challenging few-shot image classification benchmarks in the current literature. They cover a wide range of domains and primarily evaluate the ability of a few- shot classifier to generalise to novel object classes, datasets and domains. Other few-shot benchmarks have been introduced to specifically target adaptation to images with high real-world variation, including ORBIT [163] and cross-domain transfer beyond natural images, including BSCD-FSL [89]. We note, however, that unlike HardMetaDataset, none of these benchmarks specifically target difficult tasks for few-shot classification. 2.3 Mechanistically Understanding and Editing Text-to-Image Diffusion Models Text-to-Image Diffusion Models. In the last year, a large number of text-to-image models such as Stable-Diffusion [197], DALLE [193] , Imagen [201] and others [16, 38, 59, 111] have been released. In addition, the open-source community has released DeepFloyd1 and Midjourney2 which can generate photorealistic images given a text prompt. While most of these models operate in the latent space of the images, they differ in the text-encoder used. For e.g., Stable-Diffusion uses CLIP for the text-encoder, whereas Imagen uses T5. These text-to-image diffusion models have been used as a basis for various applications such as image-editing, semantic-segmentation, object-detection, image restoration and zero-shot classification. 1https://www.deepfloyd.ai 2https://www.midjourney.com/ 12 Intepretability of Text-to-Image Models. To our knowledge, few works delve into the mechanisms of large text-to-image models like Stable-Diffusion. DAAM [221] interprets diffusion models by analyzing cross-attention maps between text tokens and images, em- phasizing their semantic accuracy for interpretation. In contrast, our approach focuses on comprehending the inner workings of diffusion models by investigating the storage of visual knowledge related to different attributes. We explore various model layers beyond just the cross-attention layer. [23] leverage causal tracing to understand how knowledge is stored in text-to-image models such as Stable-Diffusion-v1. Editing Text-to-Image Models. Understanding knowledge storage in diffusion models has significant implications for model editing. This ability to modify a diffusion model’s behavior without retraining from scratch were first explored in Concept-Ablation [126] and Concept-Erasure [78]. TIME [182] is another model editing method which translates between concepts by modifying the key and value matrices in cross-attention layers. How- ever, the experiments in [182] do not specifically target removing or updating concepts such as those used in [78, 126]. We also acknowledge concurrent works [79] and [10] use a closed-form update on the cross-attention layers and text-encoder respectively to ablate concepts. However, we note that our work focuses primarily on first understanding how knowledge is stored in text-to-image models and subsequently using this information to design a closed-form editing method for editing concepts. 2.4 Mechanistically Understanding and Editing Multimodal Language Models Multimodal Large Language Models.. We consider a MLLM to be a model that takes an image and text as input, and generates a text output [5]. Over the last year, such 13 models have made tremendous advances in tasks like VQA and image captioning, including BLIP [139], BLIP-2 [138], Instruct-BLIP [53], LLaVA [148, 149], Flamingo [8] and multi- modal Phi-2 (from the Bunny repo) [94]. These MLLM can broadly be categorized into two families based on how their visual information is integrated into the language model: (i) by embedding the vision encoder’s output into each layer of the language model with a cross-attention layer (e.g., Flamingo, BLIP) or, (ii) by mapping the vision encoder’s output into “visual tokens” in the language model’s input space (i.e. alongside the text tokens) via a projection layer (e.g., LLaVA, Bunny). Both families are widely used, however, the projection layer family has recently shown stronger performance on popular benchmark [94, 148, 149]. We, therefore, focus our study of information storage and transfer on this model family. Interpretability of MLLMs. A well-established arm of model interpretability examines the relationship between a model’s performance and its internals. A range of recent works have studied the internal mechanisms of information storage [165, 167, 243] and transfer [83, 263] in MLLM. However, to the best of our knowledge, only a few works [217, 225] have studied the interpretability of MLLM, with none specifically investigating the relationship between a model’s outputs and its internal states. [217], for example, designs an interactive interface to visualize the attention maps in an MLLM, while [225] explores the shortcomings of the CLIP vision encoder in MLLM. Neither consider the influence of both vision and text inputs on model internals or offer causal insights, as our work does. Our model editing approach which targets the projection layer MLLM family, is complemented by [46], who propose baselines for inserting information into the cross-attention layer MLLM family. 14 2.5 Mechanistically Understanding and Unlocking Zero-Shot Capabilities in Vision Transformers Several studies attempt to elucidate model predictions by analyzing either a subset of input example through heatmaps [157, 205, 212, 219] or a subset of training examples [120, 183, 190]. Nevertheless, empirical evidence suggests that these approaches are often unreliable in real-world scenarios [22, 115]. These methods do not interpret model predictions in relation to the model’s internal mechanisms, which is essential for gaining a deeper understanding of the reliability of model outputs. Internal Mechanisms of Vision Models: Our work is closely related to the studies by [77] and [238], both of which analyze vanilla ViTs in terms of their components and interpret them using either CLIP text encoders or pretrained ImageNet heads. Like these studies, our research can be situated within the emerging field of representation engineering [271] and mechanistic interpretability [29, 32]. Other works [24, 86, 178] focus on interpreting individual neurons to understand vision models’ internal mechanisms. However, these methods often fail to break down the model’s output into its subcomponents, which is crucial for understanding model reliability. [206] examine the direct effect of model weights on output, but do not study the fine-grained role of these components in building the final image representation. [17] focus on expressing CNN representations as a sum of contributions from input regions via masking. Interpreting models using CLIP: Many recent works utilize CLIP [191] to interpret models via text. [170] align model representations to CLIP space with a linear layer, but it is limited to only the final representation and can not be applied to model components. [177] annotate individual neurons in CNNs via CLIP, but their method cannot be extended 15 easily to high-dimensional component vectors. Our method is related to model stitching in which one representation space is interpreted in terms of another by training a map between two spaces [18, 134]. 2.6 Mechanistic Circuits for Extractive Question-Answering Circuit Based Interpretability in Language Models. With the advent of language models, a lot of recent works have focused on a mechanistic understanding of language models [87, 144, 164, 166, 232]. One of the primary benefit of transformer based language models is that the final logit representation can be decomposed as a sum of individual model components [68]. Based on this decomposition, one can extract task-specific causal sub- graphs (i.e., circuits) of internal model components in language models. Early works have extracted such circuits for indirect-object identification [244], greater-than operation [91] and more recently for entity-tracking [188]. Recently, there has been an increasing focus on the practical aspects of mechanistic interpretability such as refusal mediation [11, 267] or safety in general [272]. Circuits can also be constructed as sub-graphs of neurons in the language model, but it often comes with increased complexity of interpretation [67]. In our paper, we focus on extracting circuits where the nodes are different architectural components such as attention-heads, layers or MLPs. Applications in Context-Augmented Question-Answering. With the advent of retrieval- augmented generation [81, 135] language models have been increasingly used for real-world Question-Answering (QA) tasks. One of the primary enhancement of context-augmented QA lies in the ability to provide reliable grounding (i.e., attribution) in the context for the generated answer [102, 113, 137, 257]. In the recent times, there have been a large set of works which improve LLM responses by reducing hallucinations and improving grounding in the input context [14, 252, 257, 265]. Our paper tests the ability of the mechanistic 16 insights from circuits towards performing these applications. 2.7 Improving Compositionality in Multimodal Models 2.7.1 Compositionality in CLIP While CLIP models [191] are renowned for their robust zero-shot classification, recent research [60, 223] has exposed their limitations in visio-linguistic reasoning. In contrast, recent studies have demonstrated that text-to-image models [41, 49, 125, 136] outperform CLIP in reasoning tasks. These models in fact leverage scores computed from the diffusion objective. We note that while [187] use score-distillation sampling for text to 3D generation, ours is the first work to adapt the formulation as a regularizer and improve compositional abilities in CLIP. 2.7.2 Compositionality in Text-to-Image Models Compositionality in text-to-image models refers to the ability of a model to accurately capture the correct compositions of objects, their corresponding attributes, and the rela- tionships between objects described in a given prompt. [103] introduced a benchmark designed to evaluate compositionality in text-to-image models, highlighting the limitations of models when handling compositional prompts. The benchmark employs disentangled BLIP-Visual Question Answering (VQA) as a metric for assessing image compositional quality. The VQA score assesses how accurately an image captures the compositional elements described in the prompt by utilizing a vision-language model. This metrics demon- strates a closer correlation with human judgment compared to metrics like CLIP-Score [97]. The authors also proposed a fine-tuning baseline to enhance compositionality in these models. Alternatively, compositionality issues can be addressed at inference by modifying 17 cross-attention maps using hand-crafted loss functions and bounding boxes derived from a language model [1, 39, 73, 143, 150, 175, 245]. However, [103] showed that data-driven fine-tuning is more effective for improving compositionality. 18 Chapter 3: Interpreting Test-Time Predictions With Influence Func- tions 3.1 Introduction Recently, there has been a rapid and significant success in applying machine learning methods to a wide range of applications including vision [220], natural language processing [204], medicine [158], finance [147], etc. In sensitive applications such as medicine, we would like to explain test-time model predictions to humans. An important question is : why the model makes a certain prediction for a particular test sample. One way to address this is to trace back model predictions to its training data. More specifically, one can ask which training samples were the most influential ones for a given test prediction. Influence functions [51] from robust statistics measure the dependency of optimal model parameters on training samples. Previously [119] used first-order approximations of influ- ence functions to estimate how much model parameters would change if a training point was up-weighted by an infinitesimal amount. Such an approximation can be used to identify most influential training samples in a test prediction. Moreover, this approximation is similar to the leave-one-out re-training, thus the first-order influence function proposed in [119] bypasses the expensive process of repeated re-training the model to find influential training samples in a test-time prediction. 19 In some applications, one may want to understand how model parameters would change when large groups of training samples are removed from the training set. This could be useful to identify groups of training data which drive the decision for a particular test prediction. As shown in [118], finding influential groups can be useful in real-world applications such as diagnosing batch effects [253], apportioning credit between different data sources [13], understanding effects of different demographic groups [42] or in a multi-party learning setting [92]. [118] approximates the group influence by sum of first-order individual influences over training samples in the considered group. However, removal of a large group from training can lead to a large perturbation to model parameters. Therefore, influence functions based on first-order approximations may not be accurate in this setup. Moreover, approximating the group influence by adding individual sample influences ignores possible cross correlations that may exist among samples in the group. In this paper, we relax the first-order approximations of current influence functions and study how second-order approximations can be used to capture model changes when a potentially large group of training samples is up-weighted. Considering a training set S and a group U ⊂S , existing first-order approximations of the group influence function [118] can be written as the sum of first-order influences of individual points. That is, I (1)(U ) = |U | ∑ i=1 I (1) i where I (1)(U ) is the first-order group influence function and I (1) i is the first-order influ- ence for the ith sample in U . On the other hand, our proposed second-order group influence function has the following form: I (2)(U ) = I (1)(U )+I ′ (U ) 20 where I ′ (U ) captures informative cross-dependencies among samples in the group and is a function of gradient vectors and the Hessian matrix evaluated at the optimal model parameters. We present a more precise statement of this result in Theorem 1. We note that the proposed second-order influence function can be computed efficiently even for large models. We discuss its computational complexity in Section 3.4. Our analysis shows that the proposed second-order influence function captures model changes efficiently even when the size of the groups are relatively large or the changes to the model parameters are significant as in the case of groups with similar properties. For example, in an MNIST classification problem using logistic regression, when 50% of the training samples are removed, the correlation between the ground truth estimate and second-order influence values improves by over 55% when compared to the existing first-order influence values. We note that higher-order influence functions have been used in statistics [108] for point and interval estimates of non-linear functionals in parameteric, semi-parametric and non-parametric models. However, to the best of our knowledge, this is the first time, higher-order influence functions are used for the interpretability task in the machine learning community. Similar to [119] and [118], our main results for the second-order influence functions hold for linear prediction models where the underlying optimization is convex. However, we also additionally explore effectiveness of both first-order and second-order group influence functions in the case of deep neural networks. We observe that none of the methods provide good estimates of the ground-truth influence across different groups 1. In summary, we make the following contributions: • We propose second-order group influence functions that consider cross dependencies 1Note that experiments of [119] focus only on the most influential individual training samples. 21 among the samples in the considered group. • Through several experiments over linear models, across different sizes and types of groups, we show that the second-order influence estimates have higher correlations with the ground truth when compared to the first-order ones, especially when the changes to the underlying model is relatively large. • We also show that our proposed second-order group influence function can be used to improve the selection of the most influential training group. 3.2 Background We consider the classical supervised learning problem setup, where the task is to learn a function h (also called the hypothesis) mapping from the input space X to an output space Y . We denote the input-output pair as {x,y}. We assume that our learning algorithm is given training examples S := {zi = (xi,yi)}m i=1 drawn i.i.d from some unknown distribution P . Let Θ be the space of the parameters of considered hypothesis class. The goal is to select model parameters ` to minimize the empirical risk as follows: min θ∈Θ L /0(θ) := 1 |S | ∑ z∈S ℓ(hθ (z)), (3.1) where |S |= m, denotes the cardinality of the training set, the subscript /0 indicates that the whole set S is used in training and ℓ is the associated loss function. We refer to the optimal parameters computed by the above optimization as θ ∗. Let ∇θ L /0(θ) and Hθ∗ = ∇2 θ L /0(θ) be the gradient and the Hessian of the loss function, respectively. First, we discuss the case where we want to compute the effect of an individual training sample z on optimal model parameters as well as the test predictions made by the model. The effect or influence of a training sample on the model parameters could characterized by 22 removing that particular training sample and retraining the model again as follows: θ ∗ {z} = argmin `∈Θ L{z}(θ) = 1 |S |−1 ∑ zi ̸=z ℓ(h`(zi)) (3.2) Then, we can compute the change in model parameters as ∆` = θ ∗{z}−θ ∗, due to removal of a training point z. However, re-training the model for every such training sample is expensive when |S | is large. Influence functions based on first-order approximations introduced by [50, 51] was used by [119] to approximate this change. Up-weighting a training point z by an infinitesimal amount ε leads to a new optimal model parameters, θ ε {z}, obtained by solving the following optimization problem: θ ε {z} = argmin θ∈Θ 1 |S | ∑ z∈S ℓ(hθ (zi))+ εℓ(hθ (z)) (3.3) Removing a point z is similar to up-weighting its corresponding weight by ε =− 1 |S | . The main idea used by [119] is to approximate θ ∗{z} by minimizing the first-order Taylor series approximation around θ ∗. Following the classical result by [51], the change in the model parameters θ ∗ on up-weighting z can be approximated by the influence function [119] denoted by I : I (z) = dθ ε {z} dε |ε=0=−H−1 θ∗ ∇θ ℓ(hθ∗(z)) (3.4) A detailed proof can be found in [119]. Using the given formulation, we can track the change with respect to any function of θ ∗. The change in the test loss for a particular test point zt when a training point z is up-weighted can be approximated as a closed form expression: I (z,zt) =−∇θ ℓ(hθ∗(zt)) T H−1 θ∗ ∇θ ℓ(hθ∗(z)) (3.5) This result is based on the assumption [119] that the loss function L(θ) is strictly convex in 23 the model parameters θ and the Hessian Hθ∗ is therefore positive-definite. This approxima- tion is very similar to forming a quadratic approximation around the optimal parameters θ ∗ and taking a single Newton step. However explicitly computing Hθ∗ and it’s inverse H−1 θ∗ is not required. Using the Hessian-vector product rule [185] influence functions can be computed efficiently. 3.3 Group Influence Function Our goal in this section is to understand how the model parameters would change if a particular group of samples was up-weighted from the training set. However, up-weighting a group can lead to large perturbations to the training data distribution and therefore model parameters, which does not follow the small perturbation assumption of the first-order influence functions. In this section, we extend influence functions using second-order approximations to better capture changes in model parameters due to up-weighting a group of training samples. The empirical risk minimization (ERM) when we remove U samples from training can be written as: LU (`) = 1 |S |−|U | ∑ z∈S \U ℓ(hθ (z)) (3.6) To approximate how optimal solution of this optimization is related to θ ∗, we study the effect of up-weighting a group of training samples on model parameters. Note that in this case, updated weights should still be a valid distribution, i.e. if a group of training samples has been up-weighted, the rest of samples should be down-weighted to preserve the sum to one constraint of weights in the ERM formulation. In the individual influence function case (when the size of the group is one), up-weighting a sample by ε leads to down-weighting other samples by ε/(m− 1) whose effect can be neglected similar to the formulation of 24 [119]. In our formulation for the group influence function, we assume that the weights of samples in the set U has been up-weighted all by ε and use p = |U | |S | to denote the fraction of up-weighted training samples. This leads to a down-weighting of the rest of training samples by ε̃ = |U | |S |−|U |ε , to preserve the empirical weight distributioxn of the training data. This is also important in order to have a fair comparison with the ground-truth leave-out-retraining estimates. Therefore, the resulting ERM can be written as: θ ε U = argmin θ Lε U (θ) where Lε U (θ) = 1 |S | ( ∑ z∈S \U (1− ε̃)ℓ(hθ (z)) (3.7) + ∑ z∈U (1+ ε)ℓ(hθ (z)) ) . Or equivalently In the above formulation, if ε = 0 we get the original loss function L /0(`) (where none of the training samples are removed) and if ε =−1, we get the loss function LU (`) (where samples are removed from training). Let θ ε U denote the optimal parameters for Lε U minimization. Essentially we are concerned about the change in the model parameters (i.e. ∆θ = θ ε U −θ ∗) when each training sample in a group of size |U | is upweighted by a factor of ε . The key step of the derivation is to expand θ ε U around θ ∗ (the minimizer of L0 U (θ), or L /0(θ)) with respect to the order of ε , the upweighting parameter. In order to do that, we use the perturbation theory [15] to expand `ε U around θ ∗. Frequently used in quantum mechanics and also in other areas of physics such as parti- 25 cle physics, condensed matter and atomic physics, perturbation theory finds approximate solution to a problem (θ ε U ) by starting from the exact solution of a closely related and simpler problem (θ ∗). As ε gets smaller and smaller, these higher order terms become less significant. However, for large model perturbations (such as the case of group influence functions), using higher-order terms can reduce approximation errors significantly. The following perturbation series forms the core of our derivation for second-order influence functions: θ ε U −θ ∗ = O(ε)θ (1)+O(ε2)θ (2)+O(ε3)θ (3)+ · · · (3.8) where θ (1) characterizes the first-order (in ε) perturbation vector of model parameters while θ (2) is the second-order (in ε) model perturbation vector. We hide the dependencies of these perturbation vectors to constants (such as |U |) with the O(.) notation. In the case of computing influence of individual points, as shown by [119], the scaling of θ (1) is in the order of 1/|S | while the scaling of the second-order coefficient is 1/|S |2 which is very small when S is large. Thus, in this case, the second-order term can be ignored. In the case of computing the group influence, the second-order coefficient is in the order of |U |2/|S |2, which can be large when the size of U is large. Thus, in our definition of the group influence function, both θ (1) and θ (2) are taken into account. The first-order group influence function (denoted by I (1)) when all the samples in a group U are up-weighted by ε can be defined as: I (1)(U ) = ∂θ ε U ∂ε |ε=0 = ∂ (θ ∗+O(ε)θ (1)+O(ε2)θ (2)) ∂ε |ε=0= θ (1) 26 To capture the dependency of the terms in O(ε2), on the group influence function, we define I ′ as follows: I ′ (U ) = ∂ 2θ ε U ∂ε2 |ε=0 = ∂ 2(θ ∗+O(ε)θ (1)+O(ε2)θ (2)) ∂ε2 |ε=0= θ (2) Although one can consider even higher-order terms, in this paper, we restrict our derivations up to the second-order approximations of the group influence function. We now state our main result in the following theorem: Theorem 1. If the third-derivative of the loss function at θ ∗ is sufficiently small, the second- order group influence function (denoted by I (2)(U )) when all samples in a group U are up-weighted by ε is: I (2)(U ) = I (1)(U )+I ′ (U ) (3.9) where: I (1)(U ) =− 1 1− p 1 |S | H−1 θ∗ ∑ z∈U ∇ℓ(hθ∗(z)) and I ′ (U ) = p 1− p ( I− (∇2L /0(θ ∗))−1 1 |U | ∑ z∈U ∇ 2ℓ(hθ∗(z)) ) θ (1) This result is based on the assumption that the third-order derivatives of the loss function at θ ∗ is small. For the quadratic loss, the third-order derivatives of the loss are zero. Our experiments with the cross-entropy loss function indicates that this assumption approxi- mately holds for the classification problem as well. Below, we present a concise sketch of 27 Figure 3.1: Comparison of first-order and second-order group influences in case of synthetic dataset with 10,000 samples using logistic regression for a mis-classified test point. Across different sizes of groups which were randomly selected, it can be observed that the second- order influence values are more correlated with the ground truth than that of the first-order ones. The green line highlights the y = x line. this result. Proof Sketch. We now derive θ (1) and θ (2) to be used in the second order group influence function I (2)(U ). As θ ε U is the optimal parameter set for the interpolated loss function Lε U (θ), due to the first-order stationary condition, we have the following equality: 0 = ∇Lε U (θ ε U ) =∇L /0(θ ε U ) (3.10) + 1 |S | (−ε̃ ∑ z∈S \U +ε ∑ z∈U )∇ℓ(hθ ε U (z)) The main idea is to use Taylor series for expanding ∇L /0(θ ε U ) around θ ∗ along with the perturbation series defined in Equation (3.8) and compare the terms of the same order in ε: ∇L /0(θ ε U ) = ∇L /0(θ ∗)+∇ 2L /0(θ ∗)(θ ε U −θ ∗)+ . . . (3.11) Similarly, we expand ∇ℓ(hθ ε U (z)) around θ ∗ using Taylor series expansion. To derive θ (1) 28 we compared terms with the coefficient of O(ε) in Equation (3.10) and for θ (2) we compared terms with coefficient O(ε2). Based on this, θ (1) can be written in the following way: θ (1) =− 1 1− p 1 |S | H−1 θ∗ ∑ z∈U ∇ℓ(hθ∗(z)) (3.12) We expand Equation(3.10) and compare the terms with coefficient O(ε): ε∇ 2L /0(θ ∗)θ (1) = 1 |S | (ε̃ ∑ z∈S \U −ε ∑ z∈U )∇ℓ(hθ∗(z)) = ε̃∇L /0(θ ∗)− 1 |S | (ε̃ + ε) ∑ z∈U ∇ℓ(hθ∗(z)) =− 1 |S | (ε̃ + ε) ∑ z∈U ∇ℓ(hθ∗(z)) =− 1 |S | 1 (1− p) ε ∑ z∈U ∇ℓ(hθ∗(z)) (3.13) θ (1) is the first-order approximation of group influence function and can be denoted by I (1). Note that our first-order approximation of group influence function I (1), is slightly different from [118] with an additional 1− p in the denominator. For θ (2) we compare the terms with coefficients of the same order of O(ε2) in Equation (3.10): ε 2 ∇ 2L /0(θ ∗)θ (2)+ 1 2 L ′′′ /0 (θ ∗)[εθ (1),εθ (1), I] + 1 |S | (−ε̃ ∑ S \U +ε ∑ U )∇2ℓ(hθ∗(z))(εθ (1)) = 0 (3.14) For the θ (2) term, we ignore the third-order term 1 2L ′′′ /0 (θ ∗)[εθ (1),εθ (1), I] due to it being small. Now we substitute the value of ε̃ and equate the terms with coefficient in the order of 29 O(ε2): ∇ 2L /0(θ ∗)θ (2) = |U | |S |−|U | ( 1 |S | ∑ z∈S ∇ 2ℓ(hθ∗(z)) (3.15) − 1 |U | ∑ z∈U ∇ 2ℓ(hθ∗(z)) ) θ (1) Rearranging the Equation (3.15), we get the same identity as I ′ in Theorem (1). It can be observed that the additional term (I ′ ) in our second-order approximation captures cross-dependencies among the samples in U through a function of gradients and Hessians of the loss function at the optimal model parameters. This makes the second-order group influence function to be more informative when training samples are correlated. In Section (3.5), we empirically show that the addition of I ′ improves correlation with the ground truth influence as well. For tracking the change in the test loss for a particular test point zt when a group U is removed, we use the chain rule to compute the influence score as follows: I (2)(U ,zt) = ∇ℓ(hθ∗(zt)) T ( I (1)(U )+I ′ (U ) ) (3.16) Our second-order approximation of group influence function consists of a first-order term that is similar to the one proposed in [118] with an additional scaling term 1/(1− p). This scaling is due to the fact that our formulation preserves the empirical weight distribution constraint in ERM, which is essential when a large group is up-weighted. The second-order influence function has an additional term I ′ that is directly proportional to p and captures 30 Figure 3.2: Group size vs the correlation with the ground truth on MNIST for logistic regression with random groups (left panel) and coherent groups (right panel). large perturbations to the model parameters more effectively. 3.4 Computational Complexity For models with a relatively large number of parameters, computing the inverse of the Hessian H−1 θ∗ can be expensive and is of the order of O(n3). However, computing the Hessian- vector product [185] is relatively computationally inexpensive. In our experiments similar to [40, 118, 119], we used conjugate gradients (a second-order optimization technique) [209] to compute the inverse Hessian-vector product which uses a Hessian-vector product in the routine thus saving the expense for inverting the Hessian directly. The proposed second-order group influence function can be computed similarly to the first-order group influence functions with only an additional step of Hessian-vector product. 31 3.5 Experiments 3.5.1 Setup Our goal through the experiments is to observe if the second-order approximations of group influence functions improve the correlation with the ground truth estimate across different settings. We compare the computed second-order group influence score with the ground truth influence (which is computed by leave-k-out retraining for a group with size k). Our metric for evaluation is the Pearson correlation which measures how linearly the computed influence and the actual ground truth estimate are related. We perform our experiments primarily on logistic regression where the group influence function is well-defined. Additionally we also check the accuracy of first-order and second-order group influence functions in case of neural networks. 3.5.2 Datasets To understand the accuracy of both first-order and second-order group influence functions on linear models we use two datasets. In our first experiments, we use a synthetic dataset along with logistic regression. The synthetic dataset has 10,000 points drawn from a Gaussian distribution, consisting of 5 features and 2 classes. The details for the synthetic data can be found in the Appendix. The second set of experiments are done with the standard handwritten digits database MNIST [131] which consists of 10 classes of different digits. For understanding how group influence functions behave in case of the neural networks we use the MNIST dataset. For each of the two datasets, we pick random groups as well coherent groups as in [118] with sizes ranging from 1.6% to 60% of the entire training points. The computed group influence was primarily investigated for a test-point which was misclassified by the model. A detailed description of how the groups were selected in our 32 experiments is given in the Appendix. For the optimal group selection we used a synthetic dataset consisting of 20,000 training points consisting of 5 features in the form of 4 isotropic Gaussian blobs. 3.5.3 Observations and Analysis Linear Models For logistic regression, the general observation for the randomly selected groups was that the second-order group influence function improves the correlation with the ground truth estimates across different group sizes in both the synthetic dataset as well as MNIST. For the synthetic dataset, in Figure (3.1), it can be observed that the approximation provided by the second-order group influence function is fairly close to the ground truth when a large fraction of the training data (60 %) is removed. In such cases of large group sizes, the first-order approximation of group influence function is relatively inaccurate and far from the ground truth influence. This observation is consistent with the small perturbation assumption of first-order influence functions. However, in cases of smaller group sizes, although the second-order approximation improves over existing first-order group influence function, the gain in correlation is small. In case of MNIST, the observation was similar where the gain in correlation was significant when the size of the considered group was large. For e.g. it can seen in Figure (3.2), that when more than 36% of the samples were removed, the gain in correlation is almost always more than 40%. While the improvement in correlation for larger group sizes is consistent with our theory that the second-order approximation is effective in the case of large changes to the model, the gain in correlation is non-monotonic with respect to the group sizes. For groups of small size, selected uniformly at random, the model parameters do not change significantly and the second-order approximation improves only marginally over the existing first-order approximation. However, when a coherent 33 group (a group having training examples from the same class) of even a relatively small size is removed, the perturbation to the model is larger (as the model parameters can change significantly in a particular direction) than if a random group is removed. In such settings, we observe that even for small group sizes, the second-order approximation consistently improves the correlation with the ground-truth significantly (Figure (3.2)). For coherent groups, across different group sizes of the MNIST dataset, we observed an improvement in correlation when the second-order approximation was used. Across different group sizes we observed that the gain in correlation is at least 15%. These observations (shown in Figure (3.2)) reinforces our theory that the second-order (or rather higher-order) approximations of influence functions are particularly effective when the perturbation or changes in the model parameters are significantly large. The second-order approximation of the influence function could thus be used over existing first-order approximations in practical purposes such as understanding the behaviour of training groups with similar properties (e.g. demographic groups) on model predictions, without the need to actually retrain the model again. Neural Networks In case of neural networks, the Hessian is not positive semi-definite in general, which violates the assumptions of influence functions. Previously [119] regularized the Hessian in the form of Hθ∗+λ I, and had shown that for the top few influential training points (not groups) and for a given test point, the correlation with the ground truth influence is still satisfactory, if not highly significant. However, how influence functions behave in the case of groups, is a topic not yet well explored. For MNIST, we used a regularized Hessian with a value of λ = 0.01 and conducted experiments for a relatively simple two hidden layered feed-forward network with sigmoid activations for both first-order and second-order group influence functions. The general observation was that both existing first and proposed 34 second-order group influence functions underestimate the ground truth influence values across different group sizes, leading to a non-significant correlation. The corresponding Figure can be referred to in the Appendix. However, we observed that while the second-order influence values still suffer from the underestimation issue, they improve the correlation marginally across different group sizes. This observation was consistent in cases of both random and coherent group selections. 3.6 Conclusion for Second-Order Group Influence Functions In this paper, we proposed second-order group influence functions for approximating model changes when a group from the training set is removed. Empirically, in the case of linear models, across different group sizes and types, we showed that the second-order influence has a higher correlation with ground truth values compared to the first-order ones and is more effective than existing first-order approximations. Our observation was that the second-order influence is significantly informative when the changes to the underlying model is relatively large. We showed that the proposed second-order group influence function can be practically used in conjunction with optimization techniques to select the most influential group in the training set for a particular test prediction. For non-linear models such as deep neural networks, we observed that both first-order and second-order influence functions lead to a non-significant correlation with the ground truth across different group sizes (although the correction values for the second-order method was marginally better). Developing accurate group influence functions for neural networks or training neural networks to have improved influence functions and also extending group influence functions to the transfer learning setting as in [40] are among directions for future work. 35 3.7 Influence Functions in Deep Learning In machine learning, influence functions [51] can be used to estimate the change in model parameters when the empirical weight distribution of the training samples is perturbed infinitesimally. This approximation is cheaper to compute compared to the expensive process of repeatedly re-training the model to retrieve the exact parameter changes. Influence functions could thus be used to understand the effect of removing an individual training point (or, groups of training samples) on the model predictions at the test-time. Leveraging a first-order Taylor’s approximation of the loss function, [119] has shown that a (first-order) influence function, computed using the gradient and the Hessian of the loss function, can be useful to interpret machine learning models, fix mislabelled training samples and create data poisoning attacks. Influence functions are in general well-defined and studied for models such as logistic regression [119], where the underlying loss-function is convex. For convex loss functions, influence functions are also accurate even when the model perturbations are fairly large (e.g. in the group influence case [118]). However, when the convexity assumption of the underlying loss function is violated, which is the case in deep learning, the behaviour of influence functions is not well understood and is still an open area of research. With recent advances in computer vision [220], natural language processing [204], high-stakes applications such as medicine [158], it has become particularly important to interpret deep model predictions. This makes it critical to understand influence functions in the context of deep learning, which is the main focus of our paper. Despite their non-convexity, it is sometimes believed that influence functions would work for deep networks. The excellent work of [119] successfully demonstrated one example of influence estimation for a deep network, a small (2600 parameters), "all-convolutional" 36 network. To the best of our knowledge, this is the one of the few cases for deep networks where influence estimation has been shown to work. A question of key importance to practitioners then arises: for what other classes of deep networks does influence estimation work? In this work, we provide a comprehensive study of this question and find a pessimistic answer: influence estimation is quite fragile for a variety of deep networks. In the case of deep networks, several factors might have an impact on influence estimates: (i) due to non-convexity of the loss function, different initializations of the perturbed model can lead to significantly different model parameters (with approximately similar loss values); (ii) even if the initialization of the model is fixed, the curvature values of the network (i.e. eigenvalues of the Hessian matrix) at optimal model parameters might be very large in very deep networks, leading to a significant Taylor’s approximation error of the loss function and thus resulting in poor influence estimates; (iii) for large neural networks, computing the exact inverse-Hessian Vector product, required in computation of influence estimates, can be computationally very expensive. Thus, one needs to use approximate inverse-Hessian Vector product techniques which might be erroneous; resulting in low quality influence estimates; and finally (iv) different architectures can have different loss landscape geometries near the optimal model parameters, leading to varying influence estimates. In this paper, we study aforementioned issues of using influence functions in deep learning through an extensive experimental study on progressively-growing complex models and datasets. We first start our analysis with a case study of a small neural network for the Iris dataset where the exact Hessian matrix can be computed. We then progressively increase the complexity of the network and analyse a CNN architecture (depth of 6) trained on 10% of MNIST dataset, similar to [119]. Next, we evaluate the accuracy of influence estimates for more complex deep architectures (e.g. ResNets) trained on MNIST and CIFAR-10. Finally, we compute influence estimates on the ImageNet dataset using ResNet-50. 37 We make the following observations through our analysis: • We find that the network depth and width have a strong impact on influence estimates. In particular, we show that influence estimates are fairly accurate when the network is shallow, while for deeper models, influence estimates are often erroneous. We attribute this partially to the increasing curvature values of the network as the depth increases. • We observe that the weight decay regularization is important to obtain high quality influence estimates in certain architectures and datasets. • We show that the inverse-Hessian Vector product approximation techniques such as stochastic estimation [4] are erroneous, especially when the network is deep. This can contribute to the low quality of influence estimates in deep models. • We observe that the choice of test-point has a significant impact on the quality of influence estimates, across different datasets and architectures. • In very large-scale datasets such as ImageNet, we have found that even ground-truth influence estimates (obtained by leave-one-out re-training) can be inaccurate and noisy partially due to the model’s training and convergence. These results highlight sensitivity of current influence functions in deep learning and call for developing robust influence estimators to be used in large-scale machine learning applications. 3.8 Basics of Influence Function Consider h to be a function parameterized by θ which maps from an input feature space X to an output space denoted by Y . The training samples are denoted by the set S = {zi : (xi,yi)}n i=1, while the loss function is represented by ℓ(hθ (z)) for a particular training 38 Figure 3.3: Iris dataset experimental results - (a,b) Comparison of norm of parameter changes computed with influence function vs re-training; (a) trained with weight-decay; (b) trained without weight-decay. (c) Spearman correlation vs. network depth. (d) Spearman correlation vs. network width. example z. The standard empirical risk minimization solves the following optimization problem: θ ∗ = argmin θ 1 n n ∑ i=1 ℓ(hθ (zi)). (3.17) Up-weighting a training example z by an infinitesimal amount ε leads to a new set of model parameters denoted by θ ε {z}. This set of new model parameters θ ε {z} is obtained by solving: θ ε {z} = argmin θ 1 n n ∑ i=1 ℓ(hθ (zi))+ εℓ(hθ (z)). (3.18) Removing a training point z is similar to up-weighting its corresponding weight by ε =−1/n in Equation(3.18). The main idea used by [119] is to approximate θ ε {z} by the first-order Taylor series expansion around the optimal model parameters represented by θ ∗, which leads to: θ ε {z} ≈ θ ∗− εH−1 θ∗ ∇θ ℓ(hθ∗(z)), (3.19) where Hθ∗ represents the Hessian with respect to model parameters θ ∗. Following the classical result of [51], the change in the model parameters (∆θ = θ ε {z}−θ ∗) on up-weighting the training example z can be approximated by the influence function (I (z)) as follows: 39 I (z) = dθ ε {z} dε |ε=0=−H−1 θ∗ ∇θ ℓ(hθ∗(z)) . (3.20) The change in the loss value for a particular test point zt when a training point z is up- weighted can be approximated as a closed form expression by the chain rule [119]: I (z,zt) =−∇ℓ(hθ∗(zt)) T H−1 θ∗ ∇ℓ(hθ∗(z)). (3.21) I (z,zt)/n is approximately the change in the loss for the test-sample zt when a training sample z is removed from the training set. This result is, however, based on the assumption that the underlying loss function is strictly convex in the model parameters θ and the Hessian Hθ∗ is a positive-definite matrix [119]. For large models, inverting the exact Hessian Hθ∗ is expensive. In such cases, the inverse-Hessian Vector product can be computed efficiently with a combination of Hessian-vector product [185] and optimization techniques (see Appendix for details). 3.9 What Can Go Wrong for Influence Functions In Deep Learning? First-order influence functions [119] assume that the underlying loss function is convex and the change in model parameters is small when the empirical weight distribution of the training data is infinitesimally perturbed. In essence, this denotes the Taylor’s gap in Equation (3.19) to be small for an accurate influence estimate. However in the case of non-convex loss functions, this assumption is not generally true. Empirically, we find that the Taylor’s gap is strongly affected by common hyper-parameters for deep networks. For example, in Fig. (3.3)-(a,b), we find that for networks trained without a weight-decay regularization on Iris, the Taylor’s gap is large resulting in low quality influence estimates. In a similar vein, when the network depth and width is significantly large (i.e. the over- parameterized regime), the Taylor’s gap increases and substantially degrades the quality of 40 influence estimates (Fig. (3.4)). Empirically this increase in Taylor’s gap strongly correlates with the curvature values of the loss function evaluated at the optimal model parameters as observed in Fig. (3.4-(b)). Further complications may arise for larger models, where influence estimations in such settings require an additional approximation to compute the inverse-Hessian vector product. Nonetheless, we observe in Fig. (3.4)-(a), that on Iris this approximation has only a marginal impact on the influence estimation. These results show that that network architecture, hyper-parameters, and loss curvatures are significant factors for proper influence estimations. In the next section, we discuss these issues in details through controlled experiments on datasets and models of increasing complexity. 3.10 Experiments Datasets: We first study the behaviour of influence functions in a small Iris dataset [9], where the exact Hessian can be computed. Further, we progressively increase the complexity of the model and datasets: we use small MNIST [119] to evaluate the accuracy of influence functions in a small CNN architecture with a depth of 6. Next, we study influence functions on modern deep architectures trained on the standard MNIST [131] and CIFAR-10 [124] datasets. Finally, to understand how influence functions scale to large datasets, we use ImageNet [55] to compute the influence estimates. Evaluation Metrics: We evaluate the accuracy of influence estimates at a given test point zt using both Pearson [116] and Spearman rank-order correlation with the ground-truth (obtained by re-training the model) across a set of training points. Most of the existing interpretability methods desire that influential examples are ranked in the correct order of their importance [84]. Therefore, to evaluate the accuracy of influence estimates, Spearman 41 correlation is often a better choice. 3.10.1 Understanding Influence Functions when the Exact Hessian Can be Computed Setup: Computing influence estimates with the exact Hessian has certain advantages in our study: a) it bypasses inverse-Hessian Vector product approximation techniques which induce errors in computing influence estimates. Thus, we can compare influence estimates computed with exact vs. approximate inverse-Hessian Vector products to quantify this type of error; b) The deviation of the parameters computed with the influence function from the exact parameters can be computed exactly. This information can be useful to further quantify the error incurred by (first-order) influence estimates in the non-convex setup. However, computations of the exact Hessian matrix and its inverse are only computationally feasible for models with small number of parameters. Thus, we use the Iris dataset along with a small feed-forward neural network to analyse the behaviour of influence function computed with the exact Hessian in a non-convex setup. We train models to convergence for 60k iterations with full-batch gradient descent. To obtain the ground-truth estimates, we re-train the models for 7.5k steps, starting from the optimal model parameters. For our analysis, we choose the test-point with the maximum loss and evaluate the accuracy of influence estimates with the ground-truth amongst of the top 16.6% of the training points. Through our experiments with the exact Hessian, we answer some relevant questions related to how properties of the network such as depth, width and regularizers (e.g. weight-decay) affect the influence estimates. The Effect of Weight-Decay: One of the simpler and common regularization techniques used to train neural networks is weight-decay regularization. In particular, a term λ∥θ∥2 2, penalizing the scaled norm of the model parameters is added to the objective function, 42 Figure 3.4: Iris dataset experimental results; (a) Spearman correlation of influence estimates with the ground-truth estimates computed with stochastic estimation vs. exact inverse- Hessian vector product. (b) Top eigenvalue of the Hessian vs. the network depth. (c) Spearman correlation between the norm of parameter changes computed with influence function vs. re-training. during training, where λ is a hyperparameter which needs to be tuned. We train a simple feed-forward network2 with and without weight-decay regularization. For the network trained with weight-decay, we observe a Spearman correlation of 0.97 between the influence estimates and the ground-truth estimates. In comparison, for the network trained without a weight-decay regularization, the Spearman correlation estimates decrease to 0.508. In this case, we notice that the Hessian matrix is singular, thus a damping factor of 0.001 is added to the Hessian matrix, to make it invertible. To further understand the reason for this decrease in the quality of influence estimates, we compare the following metric across all training examples: a) Norm of the model parameter changes computed by re- training; b) Norm of the model parameter changes computed using the influence function (i.e. ∥H−1 θ∗ ∇ℓ(zi)∥2 ∀i ∈ [1,n]) (Fig. 3.3-(a,b)). We observe that when the network is trained without weight-decay, changes in model parameters computed with the influence function have a significantly larger deviation from those computed using re-training. This essentially suggests that the gap in Taylor expansion, using (first-order) influence estimates is large, when the model is trained without weight-decay. We observe similar results with smooth activation functions such as tanh (see the Appendix for details). 2With width of 5, depth of 1 and ReLU activations 43 The Effect Of Network Depth: From Fig. 3.3-(c), we see that network depth has a dramatic effect on the quality of influence estimates. For example, when the depth of the network is increased to 8, we notice a significant decrease in the Spearman correlation estimates. To further our understanding about the decrease in the quality of influence estimates when the network is deeper, we compute the gap in the approximation between the ground-truth parameter changes (computed by re-training) and the approximate parameter changes (computed using the influence function). To quantify the error gap, we compute the Spearman correlation estimates between the norm of true and approximate parameter changes across the top 16.6% of the influential examples. We find that with increasing depth, the Spearman correlation estimates between the norm of the true and approximate parameter changes decrease. From Fig. 3.4-(c), we see that the approximation error gap is particularly large when the depth of the network is more than 5. We also notice a consistent increase in the curvature of the loss function (Fig. 3.4-(b)), as the network becomes deeper. This possibly suggests that the curvature information of the network can be an upper bound in the approximation error gap between the true parameters and the ones computed using the influence function. Even in case of non-smooth activation functions like ReLU, we have a similar observation. (see the Appendix for more details). The Effect Of Network Width: To see the effect of the network width on the quality of influence estimates, we evaluate the influence estimates for a feed-forward network of constant depth, by progressively increasing its width. From Fig. 3.3-(d), we observe that with an increase in network width, the Spearman correlation decreases consistently. For example, we find that the Spearman correlation decreases from 0.82 to 0.56, when the width of the network is increased from 8 to 50. This observation suggests that over-parameterizing a network by increasing its width has a strong impact in the quality of influence estimates. The Effect of Stochastic Estimation on inverse-Hessian Vector Product: For large deep 44 Figure 3.5: Experiments on small MNIST using a CNN architecture. (a) Estimation of influence function with and without weight decay on (a) the top influential points, (b) training points at 30th percentile of influence score distribution. (c) Correlation vs the weight decay factor (evaluated on the top influential points). networks, the inverse-Hessian Vector product is computed using stochastic estimation[3], as the exact Hessian matrix cannot b