ABSTRACT

Title of dissertation: INTERPRETING DEEP LEARNING MODELS
AND UNLOCKING NEW APPLICATIONS
WITH IT

Samyadeep Basu, Doctor of Philosophy, 2025

Dissertation directed by: Professor Soheil Feizi
University of Maryland

In recent years, modern deep learning has made significant strides across various domains,

including natural language processing, computer vision, and speech recognition. These

advancements have been driven by innovations in scaling pre-training data, developing new

model architectures, integrating distinct modalities (e.g., vision and language, audio and

language), and employing modern engineering practices. However, despite these innovations

in building better models, progress in understanding these models to enhance their reliability

has been relatively slow. In this thesis, we lay the groundwork for interpreting modern

deep learning models—such as vision, text-to-image, and multimodal language models—by

examining them through the perspectives of data and internal model components. We

aim to unlock various capabilities, including model editing and model steering, to enhance

their reliability. First, we build on the principles of robust statistics to interpret test-time pre-

dictions by identifying important training examples using higher-order influence functions.

However, we find that influence functions can be fragile for large deep models, which limits

their practical applications. To address this, we develop optimization-based data selection

strategies to automatically generate stress-testing sets from large vision datasets, testing the


reliability of vision models within a few-shot learning framework. Overall, our investiga-

tions show that while analyzing models through the lens of data provides valuable insights

for potential improvements, it does not offer a direct method for controlling and enhancing

the reliability of these models. To this end, we investigate deep models by focusing on

their internal components. We develop causal mediation analysis methods to understand

knowledge storage in text-to-image generative models like Stable Diffusion. Based on these

insights, we create novel model editing techniques that can remove copyrighted styles and

objects from text-to-image models with minimal weight updates. We scale these methods

to edit large open-source models such as SD-XL and DeepFloyd. As a follow-up, we

then introduce innovative causal mediation analysis methods and a richly annotated probe

dataset to interpret multimodal large language models like LLaVa. Our approach allows us

to understand how these models internally retrieve relevant knowledge for factual Visual

Question Answering (VQA) tasks. Leveraging these insights, we develop a novel model

editing method that can effectively introduce rare, long-tailed knowledge or correct specific

failure modes in multimodal large language models. Using similar principles, we explore

vision models (in particular the ViT architecture), developing methods to interpret image

representations based on internal components such as attention heads, using text descrip-

tions. We apply these interpretability insights to (i) mitigate spurious correlations, (ii) enable

zero-shot segmentation, and (iii) facilitate text or image-conditioned image retrieval. We

also extend our mechanistic interpretability techniques to understand and control language

models for real-world tasks, such as context-augmented generation in question-answering

systems (i.e., extractive QA). In particular, we find that insights from mechanistic circuits

can be useful towards context-data attribution and model steering towards improved context

faithfulness. Finally, we leverage interpretability insights from multimodal models to

enhance their compositionality in image-conditioned text retrieval and text-guided image

generation. For vision-language models (VLMs) like CLIP, we propose a distillation method


that transfers compositional knowledge from diffusion models to CLIP. For diffusion models,

we introduce a lightweight fine-tuning approach that learns a linear layer on the conditioning

text encoder, improving compositional generation for attribute binding. Overall, our thesis

designs and adapts interpretable methods and leverages interpretable insights to uncover

various capabilities in pre-trained models.


INTERPRETING DEEP LEARNING MODELS AND UNLOCKING NEW
APPLICATIONS WITH IT

by

Samyadeep Basu

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2025

Advisory Committee:
Professor Soheil Feizi, Chair/Advisor
Professor Hernisa Kacorri, Dean’s Representative
Professor Furong Huang, Professor Abhinav Shrivastava, Committee Members
Dr. Varun Manjunatha, Dr. Daniela Massiceti, External Members


© Copyright by
Samyadeep Basu

2025


Dedication

To my parents, partner and friends for their love and support.

ii


Acknowledgments

First and foremost, I would like to express my deepest gratitude to Dr. Soheil Feizi for his

invaluable guidance and unwavering support throughout my PhD. His mentorship has been

one of the most significant factors in the successful completion of my dissertation. I arrived

at the University of Maryland in 2018 to pursue a Master’s degree, initially intending to

focus on Computational Biology. However, due to various circumstances, that path did not

materialize. While searching for an advisor, Dr. Feizi took a chance on me in my second

year and guided me through my first research project, which led to a submission at AISTATS.

With no prior background in machine learning or deep learning, the project was a steep

learning curve, but Dr. Feizi’s encouragement ensured that I persevered. Due to personal

circumstances, I had to defer my PhD admission, and Dr. Feizi was incredibly supportive

of my decision. After completing my Master’s, he gave me the freedom to explore diverse

topics in deep learning, allowing me to build a broad foundation. This combination of

intellectual freedom and strong mentorship was instrumental in helping me publish at top

conferences and secure valuable internships. I am profoundly grateful to Dr. Feizi for his

guidance, patience, and support throughout my PhD journey—his influence has truly shaped

my academic and professional path.

The PhD journey can be long and often lonely, and I am incredibly grateful to have had

the unwavering support of my parents, partner—now my wife—Sneha and in-laws. Sneha

has stood by me through my lowest moments in ways no one else could, and for that, I

will always be indebted to her. During my time in Maryland, I was fortunate to have the

support of a wonderful group of friends who made this journey fulfilling. My College Park

friends—Aman, Shlok, Ameya, Ryan, Naman, Anjali, Sai, Noor, Vasu, Komal, Sanchita,

Yatharth, Neha, Pavan, Amanpreet, Ishita, Shishira, Shramay, Anshul, Susmija, Pranav,

Ketul, Aadesh, and Manas—played an integral role in making these years memorable. I

iii


am also deeply thankful to my friends from my undergraduate days and beyond—Aditya,

Siddhant, Dhairya, Srajit, Parikshit, Vandit, Surbhi, Dewanshu, Anish, Kunal, Fabian, and

Himanshu — whose regular conversations and encouragement kept me going. Their support

made all the difference in this journey, and I am truly grateful to have them in my life.

During my PhD I also had the opportunity to do internships at Microsoft Research and

Adobe Research. From MSR, I would particularly like to thank Dr. Daniela Massiceti who

supported me not only during my internship but beyond that in my PhD as a mentor. Even

though there was a time-difference, she made sure to schedule regular meetings to mentor

me and carve a path towards a successful PhD. From Adobe, Varun has been the main

motivator for me to work on interpretability. His idealogies on reverse engineering large

models have shaped my PhD and in fact is one of the core parts of my PhD thesis. He has

not only supported me in projects, but also as a lighting guide towards having a successful

PhD and post PhD transition. I will forever be indebted to both Varun and Daniela. I can

easily say that they have turned from great mentors to friends along the way – for which I

am grateful.

Finally, I would like to thank my amazing labmates without whom this PhD would not have

been possible.

iv


Table of Contents

Dedication ii

Acknowledgements iii

1 Introduction 1
1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Publications and Authorship . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Related Work 10
2.1 Interpreting Test-Time Predictions Through Influence Functions . . . . . . 10
2.2 Automatically Designing Difficult Few-Shot Benchmarks for Model Reliability 11
2.3 Mechanistically Understanding and Editing Text-to-Image Diffusion Models 12
2.4 Mechanistically Understanding and Editing Multimodal Language Models . 13
2.5 Mechanistically Understanding and Unlocking Zero-Shot Capabilities in

Vision Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Mechanistic Circuits for Extractive Question-Answering . . . . . . . . . . 16
2.7 Improving Compositionality in Multimodal Models . . . . . . . . . . . . . 17

2.7.1 Compositionality in CLIP . . . . . . . . . . . . . . . . . . . . . . . 17
2.7.2 Compositionality in Text-to-Image Models . . . . . . . . . . . . . 17

3 Interpreting Test-Time Predictions With Influence Functions 19
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Group Influence Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5.3 Observations and Analysis . . . . . . . . . . . . . . . . . . . . . . 33

v


Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.6 Conclusion for Second-Order Group Influence Functions . . . . . . . . . . 35
3.7 Influence Functions in Deep Learning . . . . . . . . . . . . . . . . . . . . 36
3.8 Basics of Influence Function . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.9 What Can Go Wrong for Influence Functions In Deep Learning? . . . . . . 40
3.10 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.10.1 Understanding Influence Functions when the Exact Hessian Can be
Computed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.10.2 Understanding Influence Functions in Shallow CNN Architectures . 45
3.10.3 Understanding Influence Functions in Deep Architectures . . . . . . 47
3.10.4 Is Scaling Influence Estimates To ImageNet Possible? . . . . . . . . 49

3.11 Conclusion for Influence Functions in Deep Learning . . . . . . . . . . . . 51

4 Automatically Designing Difficult Few-Shot Benchmarks for Model Reliability 52
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Few-Shot Classification: Preliminaries and Notations . . . . . . . . . . . . 55
4.3 FASTDIFFSEL: An Efficient Algorithm to Select Difficult Support Sets . . . 56

4.3.1 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Difficult Support Set Extraction on META-DATASET . . . . . . . . . . . . . 60

4.4.1 Test task samplers for META-DATASET . . . . . . . . . . . . . . . . 61
4.4.2 Validation of difficult META-DATASET tasks . . . . . . . . . . . . . 62

4.5 Stress Testing With HARD-META-DATASET++ . . . . . . . . . . . . . . . . 63
4.5.1 Test datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5.2 Metrics and training . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Results on Difficult Tasks from META-DATASET . . . . . . . . . . . 66
Results on Difficult Tasks from CURE-OR, ORBIT and OBJECTNET . 68

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5 Mechanistically Understanding and Editing Text-to-Image Generative Models 71
5.1 Knowledge Localization and Model Editing in Early Stable-Diffusion Variants 71
5.2 Causal Tracing for Text-to-Image Generative Models . . . . . . . . . . . . 75

5.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.2 Adapting Causal Tracing For Text-to-Image Diffusion Models . . . 76
5.2.3 Tracing Knowledge in UNet . . . . . . . . . . . . . . . . . . . . . 77
5.2.4 Tracing Knowledge in the Text-Encoder . . . . . . . . . . . . . . . 79
5.2.5 Extracting Causal States Using CLIP-Score . . . . . . . . . . . . . 80

5.3 How is Knowledge Stored in Text-to-Image Models? . . . . . . . . . . . . 81
5.4 DIFF-QUICKFIX: Fast Model Editing for Text-to-Image Models . . . . . . 84

5.4.1 Editing Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 85

vi


5.4.3 Editing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5 Conclusion I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.6 Knowledge Localization and Model Editing Across Various Open-Source

Text-to-Image Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.7 On the Effectiveness of Causal Tracing for Text-to-Image Models . . . . . . 91
5.8 LOCOGEN: Towards Mechanistic Knowledge Localization . . . . . . . . . 93

5.8.1 Knowledge Control in Cross-Attention Layers . . . . . . . . . . . . 94
Altered Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
LOCOGEN Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 96

5.8.2 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.9 LOCOEDIT : Editing to Ablate Concepts . . . . . . . . . . . . . . . . . . . 101

5.9.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.9.2 Model Editing Results . . . . . . . . . . . . . . . . . . . . . . . . 103

5.10 On Neuron-Level Model Editing . . . . . . . . . . . . . . . . . . . . . . . 104
5.11 Conclusion II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6 Mechanistically Understanding and Editing Multimodal Language Models 109
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2 A Constraint-Based Framework for Studying Information Storage and Trans-

fer in MLLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.2.1 A Multi-modal Constraint-based Framework . . . . . . . . . . . . . 113
6.2.2 MULTIMODALCAUSALTRACE: Studying Information Storage in

MLLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.2.3 Studying Information Transfer in MLLMs with Attention Contribu-

tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.2.4 VQA-Constraints: A Constraint Annotated Test-Bed for VQA . . . 117

6.3 Key Findings in how MLLMs Store and Transfer Information . . . . . . . . 118
6.3.1 Finding 1: Early MLPs and self-attention layers are causal . . . . . 118
6.3.2 Finding 2: Only a subset of visual tokens are involved in transferring

information from the image to the early causal MLP layers. . . . . . 120
6.3.3 Finding 3: Mid-layer self-attention layers are involved in transfer-

ring information from the early causal layers to the question’s final
token . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.3.4 Finding 4: Mid-layer self-attention contributions can be used to
predict whether a MLLM will generate a correct answer, but model
confidence is a more reliable predictor . . . . . . . . . . . . . . . . 121

6.4 Correcting and Inserting Long-Tailed Information in MLLMs . . . . . . . . 122
6.4.1 MULTEDIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.4.2 Experimental details . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

vii


7 Mechanistically Understanding and Unlocking Zero-Shot Capabilities in
Vision Transformers 127
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.1.1 REPDECOMPOSE: Automated Representation Decomposition for
ViTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

7.2 Aligning the component representations to CLIP space . . . . . . . . . . . 132
7.3 Component ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.4 Feature-based component analysis . . . . . . . . . . . . . . . . . . . . . . 137

7.4.1 Text based image retrieval . . . . . . . . . . . . . . . . . . . . . . 139
7.4.2 Image based image retrieval . . . . . . . . . . . . . . . . . . . . . 141
7.4.3 Zero-shot spurious correlation mitigation . . . . . . . . . . . . . . 142

7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

8 A Mechanistic Circuit for Extractive Question-Answering 144
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.3 Deciphering a Circuit for Extractive QA . . . . . . . . . . . . . . . . . . . 148

8.3.1 Designing the Probe Dataset . . . . . . . . . . . . . . . . . . . . . 150
8.3.2 Interventional Steps for Extracting Circuits . . . . . . . . . . . . . 151
8.3.3 Insights For Extractive QA through Circuits . . . . . . . . . . . . . 154

Context Faithfulness Circuit Differs from Parametric Memory Circuit154
Validation of the Extracted Circuit . . . . . . . . . . . . . . . . . . 155
A Small Set of Attention Heads in the context circuit are interpretable156
One Can Switch Between Memory and Copy Faithfulness Circuits . 156

8.4 Application 1: Attribution for Free Via One Attention Head . . . . . . . . . 158
8.4.1 ATTNATTRIB: A Simple and Strong Data Attribution Method for

Extractive QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.4.2 Evaluation on Extractive QA Benchmarks . . . . . . . . . . . . . . 159

8.5 Application 2: Towards Improved Context Faithfulness . . . . . . . . . . . 161
8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

9 Improving Compositionality in Multimodal Models 163
9.1 Compositionality in CLIP . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

9.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.1.2 Denoising Diffusion Score for Visio-Linguistic Reasoning . . . . . 165
9.1.3 SDS-CLIP: Our Method . . . . . . . . . . . . . . . . . . . . . . . 167
9.1.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

9.1.5 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
9.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

9.2 Compositionality in Text-to-Image Diffusion Models . . . . . . . . . . . . 171

viii


9.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
9.2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
9.2.3 Sources of Compositionality Failures . . . . . . . . . . . . . . . . 175

Source (i) : Erroneous Attention Contributions in CLIP . . . . . . . 176
Source (ii) : Sub-optimality of CLIP Text-Encoder for Composi-

tional Prompts . . . . . . . . . . . . . . . . . . . . . . . 178
9.2.4 Projection Layer for Enhancing Compositionality in the CLIP Text

Embedding Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
CLP: Token-wise Compositional Linear Projection . . . . . . . . . . 180
WiCLP: Window-based Compositional Linear Projection . . . . . . 181

9.2.5 SWITCH-OFF: Trade-off between Compositionality and Clean Ac-
curacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

9.2.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Qualitative and Quantitative Evaluation . . . . . . . . . . . . . . . 185

9.2.7 Impact of WiCLP on Subsets of Tokens . . . . . . . . . . . . . . . . 187
9.2.8 Alternatives to WiCLP . . . . . . . . . . . . . . . . . . . . . . . . . 187
9.2.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

10 Conclusion and Future Work 189
10.1 Reading List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
10.2 Understanding Model Through the Lens of Data . . . . . . . . . . . . . . . 190
10.3 Understanding Model Through Internal Model Components . . . . . . . . . 191
10.4 Model Steering or Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

11 Appendix 194
11.1 Interpretation of Models Through Lens of Data . . . . . . . . . . . . . . . 194
11.2 Running Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

11.2.1 Faithfulness and Plausibility of Influence functions . . . . . . . . . 195
11.3 Automatically Designing Difficult Few-Shot Tasks . . . . . . . . . . . . . . 196

11.3.1 Support Set Extraction Algorithm . . . . . . . . . . . . . . . . . . 196
Steps For Solving the Projection Step . . . . . . . . . . . . . . . . 196
Hyperparameters of the Framework . . . . . . . . . . . . . . . . . 198

11.4 Mechanistically Understanding and Editing Text-to-Image Models . . . . . 198
11.4.1 Probe Dataset Design Details . . . . . . . . . . . . . . . . . . . . . 198

11.5 Mechanistically Understanding and Editing Multimodal Language Models . 204
11.5.1 VQA-Constraints Details . . . . . . . . . . . . . . . . . . . . . . . 204
11.5.2 Standard Causal Tracing Does Not Recover Causal States . . . . . . 206

11.6 Mechanistically Understanding and Unlocking Zero-Shot Capabilities in
Vision Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
11.6.1 Scoring Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
11.6.2 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 207

11.7 A Mechanistic Circuit for Extractive Question-Answering . . . . . . . . . . 210

ix


11.7.1 Note on Second-order Circuit Components . . . . . . . . . . . . . 210
11.7.2 On Modifying Circuit Components . . . . . . . . . . . . . . . . . . 210
11.7.3 Extracted Circuit Components Across Language Models . . . . . . 211
11.7.4 Vicuna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

Context Faithfulness . . . . . . . . . . . . . . . . . . . . . . . . . 211
Memory Faithfulness . . . . . . . . . . . . . . . . . . . . . . . . . 211

11.7.5 Llama-3-8B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
Context Faithfulness . . . . . . . . . . . . . . . . . . . . . . . . . 212
Memory Faithfulness . . . . . . . . . . . . . . . . . . . . . . . . . 212

11.7.6 Phi-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Context Faithfulness . . . . . . . . . . . . . . . . . . . . . . . . . 213
Memory Faithfulness . . . . . . . . . . . . . . . . . . . . . . . . . 213

11.7.7 Do we need a larger probe dataset? . . . . . . . . . . . . . . . . . . 213
11.7.8 Probe Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . . 214

Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

11.7.9 Data Attribution Evaluation Dataset Descriptions . . . . . . . . . . 216
11.7.10Validating Long Extractive Answer Generations . . . . . . . . . . . 216
11.7.11 Results on CNN-Dailymail . . . . . . . . . . . . . . . . . . . . . . 218
11.7.12 Results on NQ-Long . . . . . . . . . . . . . . . . . . . . . . . . . 219
11.7.13 Circuit Components and Data Attribution in Llama-3-70B . . . . . 219

11.8 Improving Compositionality in CLIP . . . . . . . . . . . . . . . . . . . . . 220
11.8.1 Benchmark Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 220
11.8.2 Does distilling features directly from UNet help? . . . . . . . . . . 221
11.8.3 Additional Method Details . . . . . . . . . . . . . . . . . . . . . . 222
11.8.4 When does distillation not help CLIP? . . . . . . . . . . . . . . . . 222
11.8.5 More Experimental Details . . . . . . . . . . . . . . . . . . . . . . 223
11.8.6 Fine-tuning with Conceptual Captions . . . . . . . . . . . . . . . . 224
11.8.7 Results with OpenCLIP . . . . . . . . . . . . . . . . . . . . . . . . 224
11.8.8 Additional Results on CLEVR . . . . . . . . . . . . . . . . . . . . 224
11.8.9 Is it the Scale of Pre-Training Data Which Helps? . . . . . . . . . 225
11.8.10 Beyond CLIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

11.9 Improving Compositionality in Text-to-Image Models . . . . . . . . . . . . 225

Bibliography 228

x


Chapter 1: Introduction

1.1 Thesis Statement

In this thesis, we develop and investigate methods for interpreting deep models through the

lens of data and internal model components. We use these insights towards developing

fast and scalable model editing methods, automatically generating difficult few-shot learning

benchmarks and mitigating spurious correlations amongst others.

1.2 Thesis Overview

In recent years, a vast array of deep learning models has been developed and implemented

in real-world applications. These models encompass unimodal types, such as those for text,

images, videos, and audio, as well as multimodal models that combine modalities like vision

and language, video and language, or audio and language. As these models have grown

rapidly in size—driven by increases in model size, the scale of pre-training data, and the

availability of advanced computing infrastructure—the research community has struggled to

fully understand how these models make specific decisions. Furthermore, it remains unclear

whether gaining a deeper understanding of these models would directly contribute to targeted

improvements or enhancements in their downstream applications. In this thesis, we establish

a framework for efficiently interpreting recently developed deep learning models, including

1


both unimodal and multimodal types. We explore these models through the perspectives of

their pre-training data, internal model components, and fine-tuning algorithms.

First, we investigate how a classifier’s decision-making process can be attributed to a

group of training samples to understand the failure modes of deep models. We develop

second-order group influence functions, which can efficiently approximate leave-k-out

retraining. Through a range of experiments on synthetic data and standard image datasets,

we demonstrate that our proposed second-order influence function better approximates leave-

k-out retraining than first-order influence functions. However, for deep models involving

non-convex losses, we also find that the first-order influence function baseline is often

inaccurate when compared to ground-truth influence.

We then conduct a comprehensive large-scale empirical study to highlight the advantages

and limitations of influence functions for interpreting deep models in the context of training

data. Using datasets up to the scale of ImageNet, we identify the conditions under which

the approximation provided by influence functions is relatively error-free. Given that

influence functions are unstable for highly overparameterized models, we explore a different

algorithmic approach to understanding the failure modes of pre-trained models. Specifically,

we examine these failure modes through the lens of few-shot tasks. To understand the

worst-case failure scenarios of deep models, we design FastDiffSel, an optimization-based

algorithm that can automatically extract challenging training sets for a given test set. We

use FastDiffSel to identify difficult few-shot tasks from vision datasets, including ImageNet,

ObjectNet, and CURE-OR. Our findings reveal that pre-trained models often fail when there

is a natural distribution shift between the few-shot training set and the test examples. As a

result, we curate a challenging few-shot testing set, HardMetaDataset++, which can be used

to stress-test models.

2


While analyzing deep models through the lens of data helps identify their failure modes, it

offers limited flexibility for post-training model control. To address this, we shift our focus

to a "mechanistic" investigation, examining the internal components of these models. We

first develop methodologies to understand how knowledge is stored in large-scale text-to-

image models. Based on our findings, we design scalable, efficient, and data-free model

editing techniques to remove copyrighted concepts from these models. Our empirical

experiments demonstrate the effectiveness of our model editing methods in modifying

large-scale open-source text-to-image generative models. We then extend our approach

to multimodal language models, developing MultimodalCausalTrace, a tool that identifies

crucial model components for factual Visual Question Answering (VQA) tasks. Building on

these insights, we introduce MultEdit, a method for editing multimodal language models

to insert new, rare knowledge or fix existing issues. Although our methods currently focus

on interpreting multimodal models, a significant challenge remains in understanding the

internal components of general Vision Transformers (ViTs) using human-understandable

concepts, such as text.

To tackle this, we create RepDecompose, an approach that automatically decomposes final

representations in general ViTs through a recursive process. These components are then

aligned with CLIP’s image encoder, allowing interpretation via the text encoder. Our analysis

reveals that different attention heads in ViTs encode distinct concepts, such as patterns,

colors, and locations. We leverage these insights to modify the identified attention heads,

mitigating spurious correlations and utilizing their embeddings for tasks like zero-shot

segmentation, text-conditioned image retrieval, and general image retrieval.

While our current work has laid the groundwork for interpreting and controlling vision and

multimodal models, we also focus on adapting these methods to control language models.

In particular, we enhance language models for tasks such as data attribution to context and

3


mitigating hallucinations. Given the recent advancements in retrieval-augmented generation,

there are significant practical applications in context-augmented question-answering setups.

In this phase of our research, we investigate the internal circuits (e.g., sub-graphs) of

language models that are causally linked to retrieval-augmented generation tasks. By

analyzing different components of these circuits, we design zero-shot data attribution

methods.

Finally, we investigate the compositionality issues in VLMs (e.g., CLIP) and text-guided

image generation models (e.g., diffusion models). In particular, we find that diffusion

models are strong in terms of compositionality and such knowledge can be transferred to

CLIP to improve it’s compositionality. To this end, we introduce SDS-CLIP, a light-weight

fine-tuning based distillation method which can improve CLIP’s compositionality without

harming it’s zero-shot capabilities. For diffusion models, we find that the text-embedding

for compositional prompt is often sub-optimal. We show that solely fine-tuning a linear

projection layer on the CLIP’s text-embedding can improve compositional generation for a

variety of open-source text-to-image diffusion models (including SDv3).

Overall, our thesis has developed new approaches towards understanding deep models and

has shown the possibilities of practical applications of model interpretability.

1.3 Thesis Contributions

This thesis makes several research contributions towards interpreting deep learning models,

spanning both unimodal and multimodal models. In particular, we make contributions

towards developing interpreting deep learning models through the lens of data as well

as internal model components. Using our interpretability insights, we further develop

light-weight methods towards unlocking capabilities in these models such as model editing.

4


Below we state our contributions:

Interpreting Test-Time Predictions Through Influence Functions

• We develop second-order group influence functions which can attribute test-time

predictions to a group of samples in the training data. Our second-order group

influence functions effectively approximates leave-k-out retraining by a second-order

taylors expansion around the optimally trained model with all the training examples.

[ICML 2020]

• We empirically investigate the limits of influence functions for deep networks. To

this end, we first analyse influence functions in a controlled experimental setup with

synthetic data. We then scale up the analysis across different pre-training data and

models upto the ImageNet scale – highlighting the fragilities of influence functions at

larger model scales. [ICLR 2021]

Automatically Designing Difficult Few-Shot Benchmarks for Model Reliability

• We design an optimized-based data selection algorithm, which can automatically

curate difficult few-shot benchmarks from large-scale vision datasets. The curated

dataset from our algorithm can be used towards stress testing deep models for reliabil-

ity. [ICLR 2023]

Mechanistically Understanding and Editing Text-to-Image Diffusion Models

• We design a causal tracing methodology which can locate internal model components

which causally store knowledge corresponding to various visual attributes such as

style, objects or facts. We then design model editing methods towards updating the

weights of the identified components in a light-weight manner. [ICLR 2024]

5


• We investigate knowledge storage about visual attributes in the cross-attention layers

across various open-source text-to-image diffusion models. We then use model editing

towards updating the weights in those locations to remove copyrighted style, objects

and update the model with new facts. [ICML 2024]

Mechanistically Understanding and Editing Multimodal Language Models

• We develop MultimodalCausalTrace, which can identify causal locations for a factual

VQA task using a constrained based formulation. Along with providing salient

interpretability insights about the inner workings of multimodal language models –

we then introduce MultEdit, which can effectively introduce long-tailed knowledge

into these models. [NeurIPS 2024]

Mechanistically Understanding and Unlocking Zero-Shot Capabilities in Vision

Transformers

• We introduce RepDecompose which decomposes the final representation in Vision

Transformers as a function of internal model components such as attention heads.

We then interpret these attention heads via text, by aligning their embeddings to

CLIP’s image encoder and then using CLIP’s text-encoder to interpret them. Based

on our interpretability insights, we unlock various zero-shot capabilities in Vision

Transformers: (i) Spurious correlation mitigation; (ii) Zero-shot segmentation; (iii)

image / text conditioned image retrieval. [NeurIPS 2024]

Mechanistically Understanding and Enhancing Context-Augmented Language

Models

• Large language models are widely used for document processing and question-

6


answering. In this work, we extract mechanistic circuits for context-augmented

extractive QA using causal mediation analysis on model components (e.g., attention

heads, MLPs). Our analysis reveals how models balance parametric memory and

retrieved context, identifying a small set of attention heads that reliably perform data

attribution by default. Leveraging this, we introduce ATTNATTRIB, a fast attribu-

tion algorithm achieving state-of-the-art results across QA benchmarks. We further

demonstrate that ATTNATTRIB can steer models to prioritize context over paramet-

ric memory. Beyond insights into model behavior, our work highlights practical

applications of circuits in attribution and model control. [ICML Review]

Improving Compositionality in Multimodal Models

• Image-text contrastive models like CLIP excel in zero-shot classification, retrieval, and

transfer learning but struggle with compositional visio-linguistic tasks (e.g., attribute

binding, object relationships), often performing at chance levels. To address this, we

propose SDS-CLIP, a lightweight, sample-efficient distillation method that enhances

CLIP’s compositional reasoning. Our approach fine-tunes CLIP using a distillation

objective from large text-to-image generative models like Stable Diffusion, known for

strong visio-linguistic reasoning. SDS-CLIP improves CLIP’s performance by up to

7% on Winoground and 3% on ARO, demonstrating the potential of generative model

distillation to enhance contrastive learning. [EMNLP 2024]

• Text-to-image diffusion-based generative models have the stunning ability to generate

photo-realistic images and achieve state-of-the-art low FID scores on challenging

image generation benchmarks. However, one of the primary failure modes of these

text-to-image generative models is in composing attributes, objects, and their associ-

ated relationships accurately into an image. In our paper, we investigate compositional

7


attribute binding failures, where the model fails to correctly associate descriptive

attributes (such as color, shape, or texture) with the corresponding objects in the gen-

erated images, and highlight that imperfect text conditioning with CLIP text-encoder

is one of the primary reasons behind the inability of these models to generate high-

fidelity compositional scenes. In particular, we show that (i) there exists an optimal

text-embedding space that can generate highly coherent compositional scenes showing

that the output space of the CLIP text-encoder is sub-optimal, and (ii) the final token

embeddings in CLIP are erroneous as they often include attention contributions from

unrelated tokens in compositional prompts. Our main finding shows that significant

compositional improvements can be achieved (without harming the model’s FID

score) by fine-tuning only a simple and parameter-efficient linear projection on CLIP’s

representation space in Stable-Diffusion variants using a small set of compositional

image-text pairs. [ACL Submission]

1.4 Publications and Authorship

This thesis draws upon previously published manuscripts, manuscripts currently under

review, and ongoing works listed in the table underneath. While I serve as the principal

author (except for Chapter 7 and 9.2), the research presented here reflects the culmination of

collaborative efforts with my advisor, Soheil Feizi, and mentors Daniela Massiceti, Varun

Manjunatha alongside invaluable contributions from mentors and colleagues UMD, Adobe

Research and Microsoft Research. Throughout Chapters 3-9, I use the pronoun ‘we’ to

acknowledge the collective contributions of all my collaborators.

8


Figure 1.1: Primary Works Directly Related to the Thesis.

Figure 1.2: Additional Relevant Works done during the Thesis.

9


Chapter 2: Related Work

2.1 Interpreting Test-Time Predictions Through Influence Functions

Influence functions, a classical technique from robust statistics introduced by [50, 51] were

first used in the machine learning community for interpretability by [119] to approximate

the effect of upweighting a training point on the model parameters and test-loss for a

particular test sample. In the past few years, there has been an increase in the applications of

influence functions for a variety of machine learning tasks. [203] used influence functions to

produce confidence intervals for a prediction and to audit the reliability of predictions. [242]

used influence functions to approximate the gradient in order to recover a counterfactual

distribution and increase model fairness, while [30] used influence functions to understand

the origins of bias in word-embeddings. [117] crafted stronger data poisoning attacks

using influence functions. Influence functions can also be used to detect extrapolation

[159] in certain specific cases, validate causal inference models [7] and identify influential

pre-training points [40]. Infinitesimal jackknife or the delta method are ideas closely

related to influence functions for linear approximations of leave-one-out cross validation

[66, 107]. Recently a higher-order instance [85] of infinitesimal jackknife [107] was

used to approximate cross-validation procedures. While their setting corresponding to

approximations of leave-k-out re-training is relatively similar to our paper, our higher-

order terms preserve the empirical weight distribution of the training data in the ERM and

10


are derived from influence functions, while in [85] instances of infinitesimal jackknife is

used. These differences lead to our higher-order terms being marginally different than

the one proposed in [85]. Our proposed second-order approximation for group influence

function is additionally backed by a thorough empirical study across different settings

in the case of linear models which has not yet been explored in prior works. Recently,

alternative methods to find influential samples in deep networks have been proposed. In

[258], test-time predictions are explained by a kernel function evaluated at the training

samples. Influential training examples can also be obtained by tracking the change in

loss for a test-prediction through model-checkpoints, which are stored during the training

time [189]. While these alternative methods [189, 258] work well for deep networks in

interpreting model predictions, they lack the “jackknife" like ability of influence functions

which makes it useful in multiple applications other than interpretability (e.g. uncertainty

estimation).

2.2 Automatically Designing Difficult Few-Shot Benchmarks for Model

Reliability

Difficult tasks. Previous works [2, 12, 57] have shown that state-of-the-art few-shot

classifiers generally display a wide range in performance when adapting to different test tasks.

[2] use this observation to develop a greedy search-based algorithm that can specifically

extract difficult tasks for further study. They consider only meta-learning approaches

and, due to the computational requirements of a greedy search, are limited to small-scale

datasets including mini-ImageNet and CIFAR-FS. [12] also study difficult tasks through a

correlation-based analysis. We extend on both of these works by (i) proposing a scalable

algorithm – FastDiffSel that can extract difficult tasks from any large-scale vision dataset,

and (ii) conducting a deep empirical evaluation on the robustness of a broader range of

11


meta-learning and transfer learning approaches on these difficult tasks. Potentially ideas

from subset selection [114, 247] can be adapted for few-shot task extraction, but we leave it

for future work.

Few-shot classification benchmarks. Meta-Dataset [230] and Hard-Meta-Dataset++ [65]

are two of the most challenging few-shot image classification benchmarks in the current

literature. They cover a wide range of domains and primarily evaluate the ability of a few-

shot classifier to generalise to novel object classes, datasets and domains. Other few-shot

benchmarks have been introduced to specifically target adaptation to images with high

real-world variation, including ORBIT [163] and cross-domain transfer beyond natural

images, including BSCD-FSL [89]. We note, however, that unlike HardMetaDataset, none

of these benchmarks specifically target difficult tasks for few-shot classification.

2.3 Mechanistically Understanding and Editing Text-to-Image Diffusion

Models

Text-to-Image Diffusion Models. In the last year, a large number of text-to-image models

such as Stable-Diffusion [197], DALLE [193] , Imagen [201] and others [16, 38, 59, 111]

have been released. In addition, the open-source community has released DeepFloyd1 and

Midjourney2 which can generate photorealistic images given a text prompt. While most of

these models operate in the latent space of the images, they differ in the text-encoder used.

For e.g., Stable-Diffusion uses CLIP for the text-encoder, whereas Imagen uses T5. These

text-to-image diffusion models have been used as a basis for various applications such as

image-editing, semantic-segmentation, object-detection, image restoration and zero-shot

classification.
1https://www.deepfloyd.ai
2https://www.midjourney.com/

12


Intepretability of Text-to-Image Models. To our knowledge, few works delve into the

mechanisms of large text-to-image models like Stable-Diffusion. DAAM [221] interprets

diffusion models by analyzing cross-attention maps between text tokens and images, em-

phasizing their semantic accuracy for interpretation. In contrast, our approach focuses on

comprehending the inner workings of diffusion models by investigating the storage of visual

knowledge related to different attributes. We explore various model layers beyond just the

cross-attention layer. [23] leverage causal tracing to understand how knowledge is stored in

text-to-image models such as Stable-Diffusion-v1.

Editing Text-to-Image Models. Understanding knowledge storage in diffusion models

has significant implications for model editing. This ability to modify a diffusion model’s

behavior without retraining from scratch were first explored in Concept-Ablation [126]

and Concept-Erasure [78]. TIME [182] is another model editing method which translates

between concepts by modifying the key and value matrices in cross-attention layers. How-

ever, the experiments in [182] do not specifically target removing or updating concepts such

as those used in [78, 126]. We also acknowledge concurrent works [79] and [10] use a

closed-form update on the cross-attention layers and text-encoder respectively to ablate

concepts. However, we note that our work focuses primarily on first understanding how

knowledge is stored in text-to-image models and subsequently using this information to

design a closed-form editing method for editing concepts.

2.4 Mechanistically Understanding and Editing Multimodal Language

Models

Multimodal Large Language Models.. We consider a MLLM to be a model that takes

an image and text as input, and generates a text output [5]. Over the last year, such

13


models have made tremendous advances in tasks like VQA and image captioning, including

BLIP [139], BLIP-2 [138], Instruct-BLIP [53], LLaVA [148, 149], Flamingo [8] and multi-

modal Phi-2 (from the Bunny repo) [94]. These MLLM can broadly be categorized into

two families based on how their visual information is integrated into the language model:

(i) by embedding the vision encoder’s output into each layer of the language model with a

cross-attention layer (e.g., Flamingo, BLIP) or, (ii) by mapping the vision encoder’s output

into “visual tokens” in the language model’s input space (i.e. alongside the text tokens)

via a projection layer (e.g., LLaVA, Bunny). Both families are widely used, however, the

projection layer family has recently shown stronger performance on popular benchmark [94,

148, 149]. We, therefore, focus our study of information storage and transfer on this model

family.

Interpretability of MLLMs. A well-established arm of model interpretability examines the

relationship between a model’s performance and its internals. A range of recent works have

studied the internal mechanisms of information storage [165, 167, 243] and transfer [83, 263]

in MLLM. However, to the best of our knowledge, only a few works [217, 225] have studied

the interpretability of MLLM, with none specifically investigating the relationship between

a model’s outputs and its internal states. [217], for example, designs an interactive interface

to visualize the attention maps in an MLLM, while [225] explores the shortcomings of the

CLIP vision encoder in MLLM. Neither consider the influence of both vision and text inputs

on model internals or offer causal insights, as our work does. Our model editing approach

which targets the projection layer MLLM family, is complemented by [46], who propose

baselines for inserting information into the cross-attention layer MLLM family.

14


2.5 Mechanistically Understanding and Unlocking Zero-Shot Capabilities

in Vision Transformers

Several studies attempt to elucidate model predictions by analyzing either a subset of input

example through heatmaps [157, 205, 212, 219] or a subset of training examples [120, 183,

190]. Nevertheless, empirical evidence suggests that these approaches are often unreliable in

real-world scenarios [22, 115]. These methods do not interpret model predictions in relation

to the model’s internal mechanisms, which is essential for gaining a deeper understanding

of the reliability of model outputs.

Internal Mechanisms of Vision Models: Our work is closely related to the studies by [77]

and [238], both of which analyze vanilla ViTs in terms of their components and interpret

them using either CLIP text encoders or pretrained ImageNet heads. Like these studies,

our research can be situated within the emerging field of representation engineering [271]

and mechanistic interpretability [29, 32]. Other works [24, 86, 178] focus on interpreting

individual neurons to understand vision models’ internal mechanisms. However, these

methods often fail to break down the model’s output into its subcomponents, which is crucial

for understanding model reliability. [206] examine the direct effect of model weights on

output, but do not study the fine-grained role of these components in building the final image

representation. [17] focus on expressing CNN representations as a sum of contributions

from input regions via masking.

Interpreting models using CLIP: Many recent works utilize CLIP [191] to interpret

models via text. [170] align model representations to CLIP space with a linear layer, but

it is limited to only the final representation and can not be applied to model components.

[177] annotate individual neurons in CNNs via CLIP, but their method cannot be extended

15


easily to high-dimensional component vectors. Our method is related to model stitching in

which one representation space is interpreted in terms of another by training a map between

two spaces [18, 134].

2.6 Mechanistic Circuits for Extractive Question-Answering

Circuit Based Interpretability in Language Models. With the advent of language models,

a lot of recent works have focused on a mechanistic understanding of language models [87,

144, 164, 166, 232]. One of the primary benefit of transformer based language models

is that the final logit representation can be decomposed as a sum of individual model

components [68]. Based on this decomposition, one can extract task-specific causal sub-

graphs (i.e., circuits) of internal model components in language models. Early works have

extracted such circuits for indirect-object identification [244], greater-than operation [91]

and more recently for entity-tracking [188]. Recently, there has been an increasing focus

on the practical aspects of mechanistic interpretability such as refusal mediation [11, 267]

or safety in general [272]. Circuits can also be constructed as sub-graphs of neurons in

the language model, but it often comes with increased complexity of interpretation [67].

In our paper, we focus on extracting circuits where the nodes are different architectural

components such as attention-heads, layers or MLPs.

Applications in Context-Augmented Question-Answering. With the advent of retrieval-

augmented generation [81, 135] language models have been increasingly used for real-world

Question-Answering (QA) tasks. One of the primary enhancement of context-augmented

QA lies in the ability to provide reliable grounding (i.e., attribution) in the context for the

generated answer [102, 113, 137, 257]. In the recent times, there have been a large set of

works which improve LLM responses by reducing hallucinations and improving grounding

in the input context [14, 252, 257, 265]. Our paper tests the ability of the mechanistic

16


insights from circuits towards performing these applications.

2.7 Improving Compositionality in Multimodal Models

2.7.1 Compositionality in CLIP

While CLIP models [191] are renowned for their robust zero-shot classification, recent

research [60, 223] has exposed their limitations in visio-linguistic reasoning. In contrast,

recent studies have demonstrated that text-to-image models [41, 49, 125, 136] outperform

CLIP in reasoning tasks. These models in fact leverage scores computed from the diffusion

objective. We note that while [187] use score-distillation sampling for text to 3D generation,

ours is the first work to adapt the formulation as a regularizer and improve compositional

abilities in CLIP.

2.7.2 Compositionality in Text-to-Image Models

Compositionality in text-to-image models refers to the ability of a model to accurately

capture the correct compositions of objects, their corresponding attributes, and the rela-

tionships between objects described in a given prompt. [103] introduced a benchmark

designed to evaluate compositionality in text-to-image models, highlighting the limitations

of models when handling compositional prompts. The benchmark employs disentangled

BLIP-Visual Question Answering (VQA) as a metric for assessing image compositional

quality. The VQA score assesses how accurately an image captures the compositional

elements described in the prompt by utilizing a vision-language model. This metrics demon-

strates a closer correlation with human judgment compared to metrics like CLIP-Score

[97]. The authors also proposed a fine-tuning baseline to enhance compositionality in these

models. Alternatively, compositionality issues can be addressed at inference by modifying

17


cross-attention maps using hand-crafted loss functions and bounding boxes derived from a

language model [1, 39, 73, 143, 150, 175, 245]. However, [103] showed that data-driven

fine-tuning is more effective for improving compositionality.

18


Chapter 3: Interpreting Test-Time Predictions With Influence Func-

tions

3.1 Introduction

Recently, there has been a rapid and significant success in applying machine learning

methods to a wide range of applications including vision [220], natural language processing

[204], medicine [158], finance [147], etc. In sensitive applications such as medicine, we

would like to explain test-time model predictions to humans. An important question is : why

the model makes a certain prediction for a particular test sample. One way to address this

is to trace back model predictions to its training data. More specifically, one can ask which

training samples were the most influential ones for a given test prediction.

Influence functions [51] from robust statistics measure the dependency of optimal model

parameters on training samples. Previously [119] used first-order approximations of influ-

ence functions to estimate how much model parameters would change if a training point

was up-weighted by an infinitesimal amount. Such an approximation can be used to identify

most influential training samples in a test prediction. Moreover, this approximation is similar

to the leave-one-out re-training, thus the first-order influence function proposed in [119]

bypasses the expensive process of repeated re-training the model to find influential training

samples in a test-time prediction.

19


In some applications, one may want to understand how model parameters would change

when large groups of training samples are removed from the training set. This could be useful

to identify groups of training data which drive the decision for a particular test prediction.

As shown in [118], finding influential groups can be useful in real-world applications such

as diagnosing batch effects [253], apportioning credit between different data sources [13],

understanding effects of different demographic groups [42] or in a multi-party learning

setting [92]. [118] approximates the group influence by sum of first-order individual

influences over training samples in the considered group. However, removal of a large group

from training can lead to a large perturbation to model parameters. Therefore, influence

functions based on first-order approximations may not be accurate in this setup. Moreover,

approximating the group influence by adding individual sample influences ignores possible

cross correlations that may exist among samples in the group.

In this paper, we relax the first-order approximations of current influence functions and study

how second-order approximations can be used to capture model changes when a potentially

large group of training samples is up-weighted. Considering a training set S and a group

U ⊂S , existing first-order approximations of the group influence function [118] can be

written as the sum of first-order influences of individual points. That is,

I (1)(U ) =
|U |

∑
i=1

I
(1)

i

where I (1)(U ) is the first-order group influence function and I
(1)

i is the first-order influ-

ence for the ith sample in U . On the other hand, our proposed second-order group influence

function has the following form:

I (2)(U ) = I (1)(U )+I
′
(U )

20


where I
′
(U ) captures informative cross-dependencies among samples in the group and

is a function of gradient vectors and the Hessian matrix evaluated at the optimal model

parameters. We present a more precise statement of this result in Theorem 1. We note that

the proposed second-order influence function can be computed efficiently even for large

models. We discuss its computational complexity in Section 3.4.

Our analysis shows that the proposed second-order influence function captures model

changes efficiently even when the size of the groups are relatively large or the changes

to the model parameters are significant as in the case of groups with similar properties.

For example, in an MNIST classification problem using logistic regression, when 50%

of the training samples are removed, the correlation between the ground truth estimate

and second-order influence values improves by over 55% when compared to the existing

first-order influence values. We note that higher-order influence functions have been used

in statistics [108] for point and interval estimates of non-linear functionals in parameteric,

semi-parametric and non-parametric models. However, to the best of our knowledge, this is

the first time, higher-order influence functions are used for the interpretability task in the

machine learning community.

Similar to [119] and [118], our main results for the second-order influence functions hold

for linear prediction models where the underlying optimization is convex. However, we

also additionally explore effectiveness of both first-order and second-order group influence

functions in the case of deep neural networks. We observe that none of the methods provide

good estimates of the ground-truth influence across different groups 1. In summary, we

make the following contributions:

• We propose second-order group influence functions that consider cross dependencies

1Note that experiments of [119] focus only on the most influential individual training samples.

21


among the samples in the considered group.

• Through several experiments over linear models, across different sizes and types of

groups, we show that the second-order influence estimates have higher correlations

with the ground truth when compared to the first-order ones, especially when the

changes to the underlying model is relatively large.

• We also show that our proposed second-order group influence function can be used to

improve the selection of the most influential training group.

3.2 Background

We consider the classical supervised learning problem setup, where the task is to learn a

function h (also called the hypothesis) mapping from the input space X to an output space

Y . We denote the input-output pair as {x,y}. We assume that our learning algorithm is

given training examples S := {zi = (xi,yi)}m
i=1 drawn i.i.d from some unknown distribution

P . Let Θ be the space of the parameters of considered hypothesis class. The goal is to

select model parameters ` to minimize the empirical risk as follows:

min
θ∈Θ

L /0(θ) :=
1
|S | ∑

z∈S
ℓ(hθ (z)), (3.1)

where |S |= m, denotes the cardinality of the training set, the subscript /0 indicates that the

whole set S is used in training and ℓ is the associated loss function. We refer to the optimal

parameters computed by the above optimization as θ ∗. Let ∇θ L /0(θ) and Hθ∗ = ∇2
θ

L /0(θ)

be the gradient and the Hessian of the loss function, respectively.

First, we discuss the case where we want to compute the effect of an individual training

sample z on optimal model parameters as well as the test predictions made by the model.

The effect or influence of a training sample on the model parameters could characterized by

22


removing that particular training sample and retraining the model again as follows:

θ
∗
{z} = argmin

`∈Θ

L{z}(θ) =
1

|S |−1 ∑
zi ̸=z

ℓ(h`(zi)) (3.2)

Then, we can compute the change in model parameters as ∆` = θ ∗{z}−θ ∗, due to removal of

a training point z. However, re-training the model for every such training sample is expensive

when |S | is large. Influence functions based on first-order approximations introduced by

[50, 51] was used by [119] to approximate this change. Up-weighting a training point z

by an infinitesimal amount ε leads to a new optimal model parameters, θ ε

{z}, obtained by

solving the following optimization problem:

θ
ε

{z} = argmin
θ∈Θ

1
|S | ∑

z∈S
ℓ(hθ (zi))+ εℓ(hθ (z)) (3.3)

Removing a point z is similar to up-weighting its corresponding weight by ε =− 1
|S | . The

main idea used by [119] is to approximate θ ∗{z} by minimizing the first-order Taylor series

approximation around θ ∗. Following the classical result by [51], the change in the model

parameters θ ∗ on up-weighting z can be approximated by the influence function [119]

denoted by I :

I (z) =
dθ ε

{z}
dε
|ε=0=−H−1

θ∗ ∇θ ℓ(hθ∗(z)) (3.4)

A detailed proof can be found in [119]. Using the given formulation, we can track the change

with respect to any function of θ ∗. The change in the test loss for a particular test point zt

when a training point z is up-weighted can be approximated as a closed form expression:

I (z,zt) =−∇θ ℓ(hθ∗(zt))
T H−1

θ∗ ∇θ ℓ(hθ∗(z)) (3.5)

This result is based on the assumption [119] that the loss function L(θ) is strictly convex in

23


the model parameters θ and the Hessian Hθ∗ is therefore positive-definite. This approxima-

tion is very similar to forming a quadratic approximation around the optimal parameters

θ ∗ and taking a single Newton step. However explicitly computing Hθ∗ and it’s inverse

H−1
θ∗ is not required. Using the Hessian-vector product rule [185] influence functions can be

computed efficiently.

3.3 Group Influence Function

Our goal in this section is to understand how the model parameters would change if a

particular group of samples was up-weighted from the training set. However, up-weighting

a group can lead to large perturbations to the training data distribution and therefore model

parameters, which does not follow the small perturbation assumption of the first-order

influence functions. In this section, we extend influence functions using second-order

approximations to better capture changes in model parameters due to up-weighting a group

of training samples.

The empirical risk minimization (ERM) when we remove U samples from training can be

written as:

LU (`) =
1

|S |−|U | ∑
z∈S \U

ℓ(hθ (z)) (3.6)

To approximate how optimal solution of this optimization is related to θ ∗, we study the

effect of up-weighting a group of training samples on model parameters. Note that in this

case, updated weights should still be a valid distribution, i.e. if a group of training samples

has been up-weighted, the rest of samples should be down-weighted to preserve the sum to

one constraint of weights in the ERM formulation. In the individual influence function case

(when the size of the group is one), up-weighting a sample by ε leads to down-weighting

other samples by ε/(m− 1) whose effect can be neglected similar to the formulation of

24


[119]. In our formulation for the group influence function, we assume that the weights of

samples in the set U has been up-weighted all by ε and use p = |U |
|S | to denote the fraction of

up-weighted training samples. This leads to a down-weighting of the rest of training samples

by ε̃ = |U |
|S |−|U |ε , to preserve the empirical weight distributioxn of the training data. This is

also important in order to have a fair comparison with the ground-truth leave-out-retraining

estimates. Therefore, the resulting ERM can be written as:

θ
ε

U = argmin
θ

Lε

U (θ)

where

Lε

U (θ) =
1
|S |

(
∑

z∈S \U
(1− ε̃)ℓ(hθ (z)) (3.7)

+ ∑
z∈U

(1+ ε)ℓ(hθ (z))
)
.

Or equivalently In the above formulation, if ε = 0 we get the original loss function L /0(`)

(where none of the training samples are removed) and if ε =−1, we get the loss function

LU (`) (where samples are removed from training).

Let θ ε

U denote the optimal parameters for Lε

U minimization. Essentially we are concerned

about the change in the model parameters (i.e. ∆θ = θ ε

U −θ ∗) when each training sample

in a group of size |U | is upweighted by a factor of ε . The key step of the derivation is to

expand θ ε

U around θ ∗ (the minimizer of L0
U (θ), or L /0(θ)) with respect to the order of ε , the

upweighting parameter. In order to do that, we use the perturbation theory [15] to expand

`ε

U around θ ∗.

Frequently used in quantum mechanics and also in other areas of physics such as parti-

25


cle physics, condensed matter and atomic physics, perturbation theory finds approximate

solution to a problem (θ ε

U ) by starting from the exact solution of a closely related and

simpler problem (θ ∗). As ε gets smaller and smaller, these higher order terms become less

significant. However, for large model perturbations (such as the case of group influence

functions), using higher-order terms can reduce approximation errors significantly. The

following perturbation series forms the core of our derivation for second-order influence

functions:

θ
ε

U −θ
∗ = O(ε)θ (1)+O(ε2)θ (2)+O(ε3)θ (3)+ · · · (3.8)

where θ (1) characterizes the first-order (in ε) perturbation vector of model parameters while

θ (2) is the second-order (in ε) model perturbation vector. We hide the dependencies of these

perturbation vectors to constants (such as |U |) with the O(.) notation.

In the case of computing influence of individual points, as shown by [119], the scaling of

θ (1) is in the order of 1/|S | while the scaling of the second-order coefficient is 1/|S |2

which is very small when S is large. Thus, in this case, the second-order term can be

ignored. In the case of computing the group influence, the second-order coefficient is in the

order of |U |2/|S |2, which can be large when the size of U is large. Thus, in our definition

of the group influence function, both θ (1) and θ (2) are taken into account.

The first-order group influence function (denoted by I (1)) when all the samples in a group

U are up-weighted by ε can be defined as:

I (1)(U ) =
∂θ ε

U

∂ε
|ε=0

=
∂ (θ ∗+O(ε)θ (1)+O(ε2)θ (2))

∂ε
|ε=0= θ

(1)

26


To capture the dependency of the terms in O(ε2), on the group influence function, we define

I
′
as follows:

I
′
(U ) =

∂ 2θ ε

U

∂ε2 |ε=0

=
∂ 2(θ ∗+O(ε)θ (1)+O(ε2)θ (2))

∂ε2 |ε=0= θ
(2)

Although one can consider even higher-order terms, in this paper, we restrict our derivations

up to the second-order approximations of the group influence function. We now state our

main result in the following theorem:

Theorem 1. If the third-derivative of the loss function at θ ∗ is sufficiently small, the second-

order group influence function (denoted by I (2)(U )) when all samples in a group U are

up-weighted by ε is:

I (2)(U ) = I (1)(U )+I
′
(U ) (3.9)

where:

I (1)(U ) =− 1
1− p

1
|S |

H−1
θ∗ ∑

z∈U
∇ℓ(hθ∗(z))

and

I
′
(U ) =

p
1− p

(
I− (∇2L /0(θ

∗))−1 1
|U | ∑

z∈U
∇

2ℓ(hθ∗(z))
)

θ
(1)

This result is based on the assumption that the third-order derivatives of the loss function

at θ ∗ is small. For the quadratic loss, the third-order derivatives of the loss are zero. Our

experiments with the cross-entropy loss function indicates that this assumption approxi-

mately holds for the classification problem as well. Below, we present a concise sketch of

27


Figure 3.1: Comparison of first-order and second-order group influences in case of synthetic
dataset with 10,000 samples using logistic regression for a mis-classified test point. Across
different sizes of groups which were randomly selected, it can be observed that the second-
order influence values are more correlated with the ground truth than that of the first-order
ones. The green line highlights the y = x line.

this result.

Proof Sketch. We now derive θ (1) and θ (2) to be used in the second order group influence

function I (2)(U ). As θ ε

U is the optimal parameter set for the interpolated loss function

Lε

U (θ), due to the first-order stationary condition, we have the following equality:

0 = ∇Lε

U (θ ε

U ) =∇L /0(θ
ε

U ) (3.10)

+
1
|S |

(−ε̃ ∑
z∈S \U

+ε ∑
z∈U

)∇ℓ(hθ ε
U
(z))

The main idea is to use Taylor series for expanding ∇L /0(θ
ε

U ) around θ ∗ along with the

perturbation series defined in Equation (3.8) and compare the terms of the same order in ε:

∇L /0(θ
ε

U ) = ∇L /0(θ
∗)+∇

2L /0(θ
∗)(θ ε

U −θ
∗)+ . . . (3.11)

Similarly, we expand ∇ℓ(hθ ε
U
(z)) around θ ∗ using Taylor series expansion. To derive θ (1)

28


we compared terms with the coefficient of O(ε) in Equation (3.10) and for θ (2) we compared

terms with coefficient O(ε2). Based on this, θ (1) can be written in the following way:

θ
(1) =− 1

1− p
1
|S |

H−1
θ∗ ∑

z∈U
∇ℓ(hθ∗(z)) (3.12)

We expand Equation(3.10) and compare the terms with coefficient O(ε):

ε∇
2L /0(θ

∗)θ (1)

=
1
|S |

(ε̃ ∑
z∈S \U

−ε ∑
z∈U

)∇ℓ(hθ∗(z))

= ε̃∇L /0(θ
∗)− 1
|S |

(ε̃ + ε) ∑
z∈U

∇ℓ(hθ∗(z))

=− 1
|S |

(ε̃ + ε) ∑
z∈U

∇ℓ(hθ∗(z))

=− 1
|S |

1
(1− p)

ε ∑
z∈U

∇ℓ(hθ∗(z)) (3.13)

θ (1) is the first-order approximation of group influence function and can be denoted by

I (1). Note that our first-order approximation of group influence function I (1), is slightly

different from [118] with an additional 1− p in the denominator. For θ (2) we compare the

terms with coefficients of the same order of O(ε2) in Equation (3.10):

ε
2
∇

2L /0(θ
∗)θ (2)+

1
2

L
′′′
/0 (θ

∗)[εθ
(1),εθ

(1), I]

+
1
|S |

(−ε̃ ∑
S \U

+ε ∑
U

)∇2ℓ(hθ∗(z))(εθ
(1))

= 0 (3.14)

For the θ (2) term, we ignore the third-order term 1
2L
′′′
/0 (θ

∗)[εθ (1),εθ (1), I] due to it being

small. Now we substitute the value of ε̃ and equate the terms with coefficient in the order of

29


O(ε2):

∇
2L /0(θ

∗)θ (2) =
|U |

|S |−|U |

( 1
|S | ∑

z∈S
∇

2ℓ(hθ∗(z)) (3.15)

− 1
|U | ∑

z∈U
∇

2ℓ(hθ∗(z))
)

θ
(1)

Rearranging the Equation (3.15), we get the same identity as I
′
in Theorem (1).

It can be observed that the additional term (I
′
) in our second-order approximation captures

cross-dependencies among the samples in U through a function of gradients and Hessians

of the loss function at the optimal model parameters. This makes the second-order group

influence function to be more informative when training samples are correlated. In Section

(3.5), we empirically show that the addition of I
′

improves correlation with the ground

truth influence as well.

For tracking the change in the test loss for a particular test point zt when a group U is

removed, we use the chain rule to compute the influence score as follows:

I (2)(U ,zt) = ∇ℓ(hθ∗(zt))
T
(
I (1)(U )+I

′
(U )

)
(3.16)

Our second-order approximation of group influence function consists of a first-order term

that is similar to the one proposed in [118] with an additional scaling term 1/(1− p). This

scaling is due to the fact that our formulation preserves the empirical weight distribution

constraint in ERM, which is essential when a large group is up-weighted. The second-order

influence function has an additional term I
′
that is directly proportional to p and captures

30


Figure 3.2: Group size vs the correlation with the ground truth on MNIST for logistic
regression with random groups (left panel) and coherent groups (right panel).

large perturbations to the model parameters more effectively.

3.4 Computational Complexity

For models with a relatively large number of parameters, computing the inverse of the

Hessian H−1
θ∗ can be expensive and is of the order of O(n3). However, computing the Hessian-

vector product [185] is relatively computationally inexpensive. In our experiments similar

to [40, 118, 119], we used conjugate gradients (a second-order optimization technique)

[209] to compute the inverse Hessian-vector product which uses a Hessian-vector product

in the routine thus saving the expense for inverting the Hessian directly. The proposed

second-order group influence function can be computed similarly to the first-order group

influence functions with only an additional step of Hessian-vector product.

31


3.5 Experiments

3.5.1 Setup

Our goal through the experiments is to observe if the second-order approximations of group

influence functions improve the correlation with the ground truth estimate across different

settings. We compare the computed second-order group influence score with the ground truth

influence (which is computed by leave-k-out retraining for a group with size k). Our metric

for evaluation is the Pearson correlation which measures how linearly the computed influence

and the actual ground truth estimate are related. We perform our experiments primarily

on logistic regression where the group influence function is well-defined. Additionally we

also check the accuracy of first-order and second-order group influence functions in case of

neural networks.

3.5.2 Datasets

To understand the accuracy of both first-order and second-order group influence functions on

linear models we use two datasets. In our first experiments, we use a synthetic dataset along

with logistic regression. The synthetic dataset has 10,000 points drawn from a Gaussian

distribution, consisting of 5 features and 2 classes. The details for the synthetic data can

be found in the Appendix. The second set of experiments are done with the standard

handwritten digits database MNIST [131] which consists of 10 classes of different digits.

For understanding how group influence functions behave in case of the neural networks

we use the MNIST dataset. For each of the two datasets, we pick random groups as well

coherent groups as in [118] with sizes ranging from 1.6% to 60% of the entire training

points. The computed group influence was primarily investigated for a test-point which was

misclassified by the model. A detailed description of how the groups were selected in our

32


experiments is given in the Appendix. For the optimal group selection we used a synthetic

dataset consisting of 20,000 training points consisting of 5 features in the form of 4 isotropic

Gaussian blobs.

3.5.3 Observations and Analysis

Linear Models

For logistic regression, the general observation for the randomly selected groups was that

the second-order group influence function improves the correlation with the ground truth

estimates across different group sizes in both the synthetic dataset as well as MNIST. For

the synthetic dataset, in Figure (3.1), it can be observed that the approximation provided by

the second-order group influence function is fairly close to the ground truth when a large

fraction of the training data (60 %) is removed. In such cases of large group sizes, the

first-order approximation of group influence function is relatively inaccurate and far from the

ground truth influence. This observation is consistent with the small perturbation assumption

of first-order influence functions. However, in cases of smaller group sizes, although the

second-order approximation improves over existing first-order group influence function,

the gain in correlation is small. In case of MNIST, the observation was similar where the

gain in correlation was significant when the size of the considered group was large. For

e.g. it can seen in Figure (3.2), that when more than 36% of the samples were removed, the

gain in correlation is almost always more than 40%. While the improvement in correlation

for larger group sizes is consistent with our theory that the second-order approximation is

effective in the case of large changes to the model, the gain in correlation is non-monotonic

with respect to the group sizes. For groups of small size, selected uniformly at random, the

model parameters do not change significantly and the second-order approximation improves

only marginally over the existing first-order approximation. However, when a coherent

33


group (a group having training examples from the same class) of even a relatively small

size is removed, the perturbation to the model is larger (as the model parameters can change

significantly in a particular direction) than if a random group is removed. In such settings,

we observe that even for small group sizes, the second-order approximation consistently

improves the correlation with the ground-truth significantly (Figure (3.2)). For coherent

groups, across different group sizes of the MNIST dataset, we observed an improvement in

correlation when the second-order approximation was used. Across different group sizes we

observed that the gain in correlation is at least 15%. These observations (shown in Figure

(3.2)) reinforces our theory that the second-order (or rather higher-order) approximations of

influence functions are particularly effective when the perturbation or changes in the model

parameters are significantly large. The second-order approximation of the influence function

could thus be used over existing first-order approximations in practical purposes such as

understanding the behaviour of training groups with similar properties (e.g. demographic

groups) on model predictions, without the need to actually retrain the model again.

Neural Networks

In case of neural networks, the Hessian is not positive semi-definite in general, which

violates the assumptions of influence functions. Previously [119] regularized the Hessian

in the form of Hθ∗+λ I, and had shown that for the top few influential training points (not

groups) and for a given test point, the correlation with the ground truth influence is still

satisfactory, if not highly significant. However, how influence functions behave in the case

of groups, is a topic not yet well explored. For MNIST, we used a regularized Hessian

with a value of λ = 0.01 and conducted experiments for a relatively simple two hidden

layered feed-forward network with sigmoid activations for both first-order and second-order

group influence functions. The general observation was that both existing first and proposed

34


second-order group influence functions underestimate the ground truth influence values

across different group sizes, leading to a non-significant correlation. The corresponding

Figure can be referred to in the Appendix. However, we observed that while the second-order

influence values still suffer from the underestimation issue, they improve the correlation

marginally across different group sizes. This observation was consistent in cases of both

random and coherent group selections.

3.6 Conclusion for Second-Order Group Influence Functions

In this paper, we proposed second-order group influence functions for approximating model

changes when a group from the training set is removed. Empirically, in the case of linear

models, across different group sizes and types, we showed that the second-order influence

has a higher correlation with ground truth values compared to the first-order ones and is more

effective than existing first-order approximations. Our observation was that the second-order

influence is significantly informative when the changes to the underlying model is relatively

large. We showed that the proposed second-order group influence function can be practically

used in conjunction with optimization techniques to select the most influential group in

the training set for a particular test prediction. For non-linear models such as deep neural

networks, we observed that both first-order and second-order influence functions lead to a

non-significant correlation with the ground truth across different group sizes (although the

correction values for the second-order method was marginally better). Developing accurate

group influence functions for neural networks or training neural networks to have improved

influence functions and also extending group influence functions to the transfer learning

setting as in [40] are among directions for future work.

35


3.7 Influence Functions in Deep Learning

In machine learning, influence functions [51] can be used to estimate the change in model

parameters when the empirical weight distribution of the training samples is perturbed

infinitesimally. This approximation is cheaper to compute compared to the expensive

process of repeatedly re-training the model to retrieve the exact parameter changes. Influence

functions could thus be used to understand the effect of removing an individual training

point (or, groups of training samples) on the model predictions at the test-time. Leveraging

a first-order Taylor’s approximation of the loss function, [119] has shown that a (first-order)

influence function, computed using the gradient and the Hessian of the loss function, can be

useful to interpret machine learning models, fix mislabelled training samples and create data

poisoning attacks.

Influence functions are in general well-defined and studied for models such as logistic

regression [119], where the underlying loss-function is convex. For convex loss functions,

influence functions are also accurate even when the model perturbations are fairly large

(e.g. in the group influence case [118]). However, when the convexity assumption of

the underlying loss function is violated, which is the case in deep learning, the behaviour

of influence functions is not well understood and is still an open area of research. With

recent advances in computer vision [220], natural language processing [204], high-stakes

applications such as medicine [158], it has become particularly important to interpret deep

model predictions. This makes it critical to understand influence functions in the context of

deep learning, which is the main focus of our paper.

Despite their non-convexity, it is sometimes believed that influence functions would work

for deep networks. The excellent work of [119] successfully demonstrated one example

of influence estimation for a deep network, a small (2600 parameters), "all-convolutional"

36


network. To the best of our knowledge, this is the one of the few cases for deep networks

where influence estimation has been shown to work. A question of key importance to

practitioners then arises: for what other classes of deep networks does influence estimation

work? In this work, we provide a comprehensive study of this question and find a pessimistic

answer: influence estimation is quite fragile for a variety of deep networks.

In the case of deep networks, several factors might have an impact on influence estimates: (i)

due to non-convexity of the loss function, different initializations of the perturbed model can

lead to significantly different model parameters (with approximately similar loss values);

(ii) even if the initialization of the model is fixed, the curvature values of the network (i.e.

eigenvalues of the Hessian matrix) at optimal model parameters might be very large in very

deep networks, leading to a significant Taylor’s approximation error of the loss function

and thus resulting in poor influence estimates; (iii) for large neural networks, computing the

exact inverse-Hessian Vector product, required in computation of influence estimates, can be

computationally very expensive. Thus, one needs to use approximate inverse-Hessian Vector

product techniques which might be erroneous; resulting in low quality influence estimates;

and finally (iv) different architectures can have different loss landscape geometries near the

optimal model parameters, leading to varying influence estimates.

In this paper, we study aforementioned issues of using influence functions in deep learning

through an extensive experimental study on progressively-growing complex models and

datasets. We first start our analysis with a case study of a small neural network for the Iris

dataset where the exact Hessian matrix can be computed. We then progressively increase the

complexity of the network and analyse a CNN architecture (depth of 6) trained on 10% of

MNIST dataset, similar to [119]. Next, we evaluate the accuracy of influence estimates for

more complex deep architectures (e.g. ResNets) trained on MNIST and CIFAR-10. Finally,

we compute influence estimates on the ImageNet dataset using ResNet-50.

37


We make the following observations through our analysis:

• We find that the network depth and width have a strong impact on influence estimates.

In particular, we show that influence estimates are fairly accurate when the network

is shallow, while for deeper models, influence estimates are often erroneous. We

attribute this partially to the increasing curvature values of the network as the depth

increases.

• We observe that the weight decay regularization is important to obtain high quality

influence estimates in certain architectures and datasets.

• We show that the inverse-Hessian Vector product approximation techniques such as

stochastic estimation [4] are erroneous, especially when the network is deep. This can

contribute to the low quality of influence estimates in deep models.

• We observe that the choice of test-point has a significant impact on the quality of

influence estimates, across different datasets and architectures.

• In very large-scale datasets such as ImageNet, we have found that even ground-truth

influence estimates (obtained by leave-one-out re-training) can be inaccurate and

noisy partially due to the model’s training and convergence.

These results highlight sensitivity of current influence functions in deep learning and call

for developing robust influence estimators to be used in large-scale machine learning

applications.

3.8 Basics of Influence Function

Consider h to be a function parameterized by θ which maps from an input feature space

X to an output space denoted by Y . The training samples are denoted by the set S =

{zi : (xi,yi)}n
i=1, while the loss function is represented by ℓ(hθ (z)) for a particular training

38


Figure 3.3: Iris dataset experimental results - (a,b) Comparison of norm of parameter
changes computed with influence function vs re-training; (a) trained with weight-decay; (b)
trained without weight-decay. (c) Spearman correlation vs. network depth. (d) Spearman
correlation vs. network width.

example z. The standard empirical risk minimization solves the following optimization

problem:

θ
∗ = argmin

θ

1
n

n

∑
i=1

ℓ(hθ (zi)). (3.17)

Up-weighting a training example z by an infinitesimal amount ε leads to a new set of model

parameters denoted by θ ε

{z}. This set of new model parameters θ ε

{z} is obtained by solving:

θ
ε

{z} = argmin
θ

1
n

n

∑
i=1

ℓ(hθ (zi))+ εℓ(hθ (z)). (3.18)

Removing a training point z is similar to up-weighting its corresponding weight by ε =−1/n

in Equation(3.18). The main idea used by [119] is to approximate θ ε

{z} by the first-order

Taylor series expansion around the optimal model parameters represented by θ ∗, which

leads to:

θ
ε

{z} ≈ θ
∗− εH−1

θ∗ ∇θ ℓ(hθ∗(z)), (3.19)

where Hθ∗ represents the Hessian with respect to model parameters θ ∗. Following the

classical result of [51], the change in the model parameters (∆θ = θ ε

{z}−θ ∗) on up-weighting

the training example z can be approximated by the influence function (I (z)) as follows:

39


I (z) =
dθ ε

{z}
dε
|ε=0=−H−1

θ∗ ∇θ ℓ(hθ∗(z)) . (3.20)

The change in the loss value for a particular test point zt when a training point z is up-

weighted can be approximated as a closed form expression by the chain rule [119]:

I (z,zt) =−∇ℓ(hθ∗(zt))
T H−1

θ∗ ∇ℓ(hθ∗(z)). (3.21)

I (z,zt)/n is approximately the change in the loss for the test-sample zt when a training

sample z is removed from the training set. This result is, however, based on the assumption

that the underlying loss function is strictly convex in the model parameters θ and the Hessian

Hθ∗ is a positive-definite matrix [119]. For large models, inverting the exact Hessian Hθ∗ is

expensive. In such cases, the inverse-Hessian Vector product can be computed efficiently

with a combination of Hessian-vector product [185] and optimization techniques (see

Appendix for details).

3.9 What Can Go Wrong for Influence Functions In Deep Learning?

First-order influence functions [119] assume that the underlying loss function is convex

and the change in model parameters is small when the empirical weight distribution of

the training data is infinitesimally perturbed. In essence, this denotes the Taylor’s gap

in Equation (3.19) to be small for an accurate influence estimate. However in the case

of non-convex loss functions, this assumption is not generally true. Empirically, we find

that the Taylor’s gap is strongly affected by common hyper-parameters for deep networks.

For example, in Fig. (3.3)-(a,b), we find that for networks trained without a weight-decay

regularization on Iris, the Taylor’s gap is large resulting in low quality influence estimates.

In a similar vein, when the network depth and width is significantly large (i.e. the over-

parameterized regime), the Taylor’s gap increases and substantially degrades the quality of

40


influence estimates (Fig. (3.4)). Empirically this increase in Taylor’s gap strongly correlates

with the curvature values of the loss function evaluated at the optimal model parameters as

observed in Fig. (3.4-(b)).

Further complications may arise for larger models, where influence estimations in such

settings require an additional approximation to compute the inverse-Hessian vector product.

Nonetheless, we observe in Fig. (3.4)-(a), that on Iris this approximation has only a marginal

impact on the influence estimation. These results show that that network architecture,

hyper-parameters, and loss curvatures are significant factors for proper influence estimations.

In the next section, we discuss these issues in details through controlled experiments on

datasets and models of increasing complexity.

3.10 Experiments

Datasets: We first study the behaviour of influence functions in a small Iris dataset [9],

where the exact Hessian can be computed. Further, we progressively increase the complexity

of the model and datasets: we use small MNIST [119] to evaluate the accuracy of influence

functions in a small CNN architecture with a depth of 6. Next, we study influence functions

on modern deep architectures trained on the standard MNIST [131] and CIFAR-10 [124]

datasets. Finally, to understand how influence functions scale to large datasets, we use

ImageNet [55] to compute the influence estimates.

Evaluation Metrics: We evaluate the accuracy of influence estimates at a given test point

zt using both Pearson [116] and Spearman rank-order correlation with the ground-truth

(obtained by re-training the model) across a set of training points. Most of the existing

interpretability methods desire that influential examples are ranked in the correct order of

their importance [84]. Therefore, to evaluate the accuracy of influence estimates, Spearman

41


correlation is often a better choice.

3.10.1 Understanding Influence Functions when the Exact Hessian Can be

Computed

Setup: Computing influence estimates with the exact Hessian has certain advantages in

our study: a) it bypasses inverse-Hessian Vector product approximation techniques which

induce errors in computing influence estimates. Thus, we can compare influence estimates

computed with exact vs. approximate inverse-Hessian Vector products to quantify this type

of error; b) The deviation of the parameters computed with the influence function from the

exact parameters can be computed exactly. This information can be useful to further quantify

the error incurred by (first-order) influence estimates in the non-convex setup. However,

computations of the exact Hessian matrix and its inverse are only computationally feasible

for models with small number of parameters. Thus, we use the Iris dataset along with a

small feed-forward neural network to analyse the behaviour of influence function computed

with the exact Hessian in a non-convex setup. We train models to convergence for 60k

iterations with full-batch gradient descent. To obtain the ground-truth estimates, we re-train

the models for 7.5k steps, starting from the optimal model parameters. For our analysis,

we choose the test-point with the maximum loss and evaluate the accuracy of influence

estimates with the ground-truth amongst of the top 16.6% of the training points. Through

our experiments with the exact Hessian, we answer some relevant questions related to how

properties of the network such as depth, width and regularizers (e.g. weight-decay) affect

the influence estimates.

The Effect of Weight-Decay: One of the simpler and common regularization techniques

used to train neural networks is weight-decay regularization. In particular, a term λ∥θ∥2
2,

penalizing the scaled norm of the model parameters is added to the objective function,

42


Figure 3.4: Iris dataset experimental results; (a) Spearman correlation of influence estimates
with the ground-truth estimates computed with stochastic estimation vs. exact inverse-
Hessian vector product. (b) Top eigenvalue of the Hessian vs. the network depth. (c)
Spearman correlation between the norm of parameter changes computed with influence
function vs. re-training.

during training, where λ is a hyperparameter which needs to be tuned. We train a simple

feed-forward network2 with and without weight-decay regularization. For the network

trained with weight-decay, we observe a Spearman correlation of 0.97 between the influence

estimates and the ground-truth estimates. In comparison, for the network trained without

a weight-decay regularization, the Spearman correlation estimates decrease to 0.508. In

this case, we notice that the Hessian matrix is singular, thus a damping factor of 0.001

is added to the Hessian matrix, to make it invertible. To further understand the reason

for this decrease in the quality of influence estimates, we compare the following metric

across all training examples: a) Norm of the model parameter changes computed by re-

training; b) Norm of the model parameter changes computed using the influence function (i.e.

∥H−1
θ∗ ∇ℓ(zi)∥2 ∀i ∈ [1,n]) (Fig. 3.3-(a,b)). We observe that when the network is trained

without weight-decay, changes in model parameters computed with the influence function

have a significantly larger deviation from those computed using re-training. This essentially

suggests that the gap in Taylor expansion, using (first-order) influence estimates is large,

when the model is trained without weight-decay. We observe similar results with smooth

activation functions such as tanh (see the Appendix for details).

2With width of 5, depth of 1 and ReLU activations

43


The Effect Of Network Depth: From Fig. 3.3-(c), we see that network depth has a

dramatic effect on the quality of influence estimates. For example, when the depth of the

network is increased to 8, we notice a significant decrease in the Spearman correlation

estimates. To further our understanding about the decrease in the quality of influence

estimates when the network is deeper, we compute the gap in the approximation between the

ground-truth parameter changes (computed by re-training) and the approximate parameter

changes (computed using the influence function). To quantify the error gap, we compute

the Spearman correlation estimates between the norm of true and approximate parameter

changes across the top 16.6% of the influential examples. We find that with increasing

depth, the Spearman correlation estimates between the norm of the true and approximate

parameter changes decrease. From Fig. 3.4-(c), we see that the approximation error gap is

particularly large when the depth of the network is more than 5. We also notice a consistent

increase in the curvature of the loss function (Fig. 3.4-(b)), as the network becomes deeper.

This possibly suggests that the curvature information of the network can be an upper bound

in the approximation error gap between the true parameters and the ones computed using

the influence function. Even in case of non-smooth activation functions like ReLU, we have

a similar observation. (see the Appendix for more details).

The Effect Of Network Width: To see the effect of the network width on the quality

of influence estimates, we evaluate the influence estimates for a feed-forward network of

constant depth, by progressively increasing its width. From Fig. 3.3-(d), we observe that

with an increase in network width, the Spearman correlation decreases consistently. For

example, we find that the Spearman correlation decreases from 0.82 to 0.56, when the width

of the network is increased from 8 to 50. This observation suggests that over-parameterizing

a network by increasing its width has a strong impact in the quality of influence estimates.

The Effect of Stochastic Estimation on inverse-Hessian Vector Product: For large deep

44


Figure 3.5: Experiments on small MNIST using a CNN architecture. (a) Estimation of
influence function with and without weight decay on (a) the top influential points, (b)
training points at 30th percentile of influence score distribution. (c) Correlation vs the weight
decay factor (evaluated on the top influential points).

networks, the inverse-Hessian Vector product is computed using stochastic estimation[3], as

the exact Hessian matrix cannot b