ABSTRACT

Title of Dissertation: PRUNING FOR EFFICIENT
DEEP LEARNING: FROM CNNS
TO GENERATIVE MODELS

Alireza Ganjdanesh
Doctor of Philosophy, 2025

Dissertation Directed by: Professor Heng Huang
Department of Computer Science

Deep learning models have shown remarkable success in visual recognition and gen-

erative modeling tasks in computer vision in the last decade. A general trend is that their

performance improves with an increase in the size of their training data, model capacity,

and training iterations on modern hardware. However, the increase in model size naturally

leads to higher computational complexity and memory footprint, thereby necessitating

high-end hardware for their deployment. This trade-off prevents the deployment of deep

learning models in resource-constrained environments such as robotic applications, mobile

phones, and edge devices employed in the Artificial Internet of Things (AIoT). In addition,

private companies and organizations have to spend significant resources on cloud services

to serve deep models for their customers. In this dissertation, we develop model pruning

and Neural Architecture Search (NAS) methods to improve the inference efficiency of deep

learning models for visual recognition and generative modeling applications. We design our


methods to be tailored to the unique characteristics of each model and its task.

In the first part, we present model pruning and efficient NAS methods for Convolu-

tional Neural Network (CNN) classifiers. We start by proposing a pruning method that

leverages interpretations of a pretrained model’s decisions to prune its redundant struc-

tures. Then, we provide an efficient NAS method to learn kernel sizes of a CNN model

using their training dataset and given a parameter budget for the model, enabling designing

efficient CNNs customized for their target application. Finally, we develop a framework

for simultaneous pretraining and pruning of CNNs, which combines the first two stage

of the pretrain-prune-finetune pipeline commonly used in model pruning and reduces its

complexity.

In the second part, we propose model pruning methods for visual generative mod-

els. First, we present a pruning method for conditional Generative Adversarial Networks

(GANs) in which we prune the generator and discriminator models in a collaborative man-

ner. We then address the inference efficiency of diffusion models by proposing a method

that prunes a pretrained diffusion model into a mixture of efficient experts, each handling

a separate part of the denoising process. Finally, we develop an adaptive prompt-tailored

pruning method for modern text-to-image diffusion models. It prunes a pretrained model

like Stable Diffusion into a mixture of efficient experts such that each expert specializes in

certain type of input prompts.


PRUNING FOR EFFICIENT DEEP LEARNING: FROM CNNS TO
GENERATIVE MODELS

by

Alireza Ganjdanesh

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2025

Advisory Committee:
Professor Heng Huang, Chair/Advisor
Professor Abhinav Shrivastava
Professor Furong Huang
Professor Tianyi Zhou
Professor Shuvra S. Bhattacharyya, Dean’s Representative


© Copyright by
Alireza Ganjdanesh

2025


To the dearest people in my life, Melina, Kasra, Ayli, Soheila, and Yousef.

ii


Acknowledgments

I start by expressing my appreciation to my Ph.D. advisor, Dr. Heng Huang, for

his guidance and constant support during my Ph.D. journey. He provided me with the

invaluable opportunity to enter the exciting field of deep learning research, which I will

always be grateful for. I am especially thankful for his patience and encouragement during

the early stages of my Ph.D., as he guided me in building the foundational knowledge and

skills necessary to conduct meaningful research. His mentorship has been instrumental in

shaping my growth as a researcher and inspiring me to push the boundaries of my work.

I extend my gratitude to my committee members, Dr. Abhinav Shrivastava, Dr.

Furong Huang, Dr. Tianyi Zhou, and Dr. Shuvra S. Bhattacharyya, for their generosity in

serving on my dissertation committee. Their insightful feedback and constructive comments

have been invaluable in refining my research and strengthening my dissertation.

I express my heartfelt appreciation to my collaborators, whose invaluable contribu-

tions have been instrumental to the success of my research. First and foremost, I want

to thank Dr. Shangqian Gao, who has been both a mentor and a guiding figure, akin to

an older brother in my professional journey. He introduced me to the fascinating field of

deep learning efficiency and was an outstanding collaborator on most of the projects pre-

sented in this dissertation. I am deeply thankful for the hard work we shared, the countless

hours dedicated to brainstorming ideas, implementing them, conducting experiments, and

iii


the late-night meetings essential to achieving these milestones together. I also thank Reza

Shirkavand for his dedication during our collaboration in the final year of my Ph.D. His

efforts were instrumental in bringing our project to fruition.

I am profoundly thankful to Yan Kang for giving me with the opportunity to join

Adobe Research as an intern in Summer 2023. This internship marked my first experience

working at a major technology company, where I was fortunate to have Dr. Yuchen Liu,

Dr. Richard Zhang, and Dr. Zhe Lin, along with Yan, as my mentors. Through this

experience, I also had the privilege of meeting my current manager, Dr. Soheil Darabi,

which led to securing my first full-time position with the brilliant Adobe Firefly team.

I would also like to thank my former collaborators, Dr. Wei Chen, Dr. Kamran

Ghasedi, Dr. Liang Zhan, and Jipeng Zhang. Their expertise and dedication were invalu-

able in completing the projects during the early stages of my Ph.D.

I thank my former and current lab-mates at Huang Lab at the University of Pitts-

burgh and the University of Maryland, College Park. I was lucky to be a part of such a

talented group of researchers including: Feihu Huang, Bin Gu, Lei Luo, Kamran Ghasedi,

Hongchang Gao, Xiaoqian Wang, Zhouyuan Huo, Shangqian Gao, Yanfu Zhang, Runxue

Bao, Wenhan Xian, Chao Li, An Xu, Xiaotian Dou, Guodong Liu, Peiran Yu, Zhenyi Wang,

Junfeng Guo, Zhengmian Hu, Junyi Li, Yihan Wu, Xidong Wu, Reza Shirkavand, Hirad

Alipanah, Lichang Chen, Yanshuo Chen, Ruibo Chen, Chenxi Liu, and Tianyi Xiong. I

had my first hot pot at our group’s lunch gathering, and since then, my wife and I have had

multiple Mediterranean-style hot pot meals together. I also will not forget the fun times

we had with Ruibo Chen, Chenxi Liu, and Tianyi Xiong at the Yahentamitsi dining hall

at UMD. Thank you all for the great memories.

iv


I thank Dr. Natasa Miskov-Zivanov, Dr. Azime Can-Cimino, and Dr. Steven Ja-

cobs, under whom I had the privilege of serving as a teaching assistant at the University

of Pittsburgh. They entrusted me with the opportunity to lead several lecture sessions,

allowing me to experience teaching in a language other than my mother tongue. These

unique experiences were both challenging and rewarding, and I am grateful for their trust

in me.

I also thank Handa Ding, who was a friend and a colleague during my time as a

teaching assistant for ECE Analytical Methods under Natasa. It was the first semester of

my Ph.D., and I was having a hard time balancing coursework and research while adapting

to a new environment. Handa’s support during that period was truly invaluable. I truly

appreciate his willingness to step in and cover for me when I needed it most. I will always

cherish the enjoyable moments we shared while grading exams with Natasa and Handa.

Next, I would like to thank my friends who made my Ph.D. journey not only tolerable

but also memorable during my years in Pittsburgh and College Park. I am particularly

grateful to Kazem Meidani, my best friend and roommate during my time in Pittsburgh.

He stood by me through all the ups and downs, particularly during the challenging COVID-

19 pandemic lockdowns, which began just six months after we arrived in the United States.

His unwavering support during such a difficult time was invaluable, and I could not have

remained sane without him.

I extend my heartfelt thanks to Kamran Ghasedi and Najmeh Sadoughi, who were

like an older brother and sister to me and Kazem during our first year of the Ph.D. program.

Their willingness to help and the wonderful memories we created together in Pittsburgh

mean so much to me. I feel fortunate to still be in touch with them in Seattle.

v


I sincerely thank Mohsen Tabrizi and Amirreza Hashemi for their unwavering willing-

ness to selflessly support the younger generation of students in the Iranian community in

Pittsburgh. Their humility and kindness have left a lasting impression on me. The world

would undoubtedly be a better place with more individuals like you. Thank you both for

your efforts and support during my time in Pittsburgh.

I am also grateful for the fun times and cherished memories shared with my other

friends in Pittsburgh: Parshin Shojaee, Soroosh Shafieezadeh Abadeh, Hirad Alipanah,

Reza Shirkavand, Maryam Hakimzadeh, Fahimeh Dehghan, Yasin Karimi, Sanaz Saadati-

far, Saba Dadsetan, Alireza Golestaneh, Parand Akbari, Masoud Zamani, Tahere Mokhtari,

Sina Malakouti, and Mohammad Bakhshalipour.

Additionally, I thank Matin Mortaheb and Maryam Maghsoudi Shaghaghi for the

great memories we shared during the last year of my Ph.D in College Park and Washington,

DC. Matin, a longtime friend since high school, has been a pillar of support and friendship

through all these years, and I deeply appreciate him for always being there.

Finally, I want to express my heartfelt gratitude to my family for their unwavering love

and support throughout this journey. Their encouragement has been my anchor, providing

me with the strength to persevere through challenges and celebrate successes.

I am especially grateful to my wife, Melina Emily Adrangi, for her constant love,

patience, and understanding. She has been my confidant, cheerleader, and partner in every

sense, helping me navigate the highs and lows of this journey with grace and resilience. Her

sacrifices, from enduring my late working hours to supporting me through stressful times,

have not gone unnoticed, and I am deeply thankful for her unwavering commitment to our

shared dreams. I am truly fortunate to have her by my side, and this accomplishment

vi


would not have been possible without her.

I express my deepest gratitude to my parents, Yousef and Soheila, whose unwavering

encouragement and support have been the foundation of my success. They have always

inspired me to work hard, dream big, and pursue my aspirations, including my journey to

the United States. From them, I learned the fundamental principles of classical liberalism,

free markets, and individual freedom—values that have contributed to the unprecedented

prosperity of the United States in human history. My parents have selflessly stood by me

through every stage of my life, offering their love and guidance during both triumphs and

challenges. I am endlessly grateful to have such incredible parents, and I truly could not

have achieved this without you. Thank you for everything.

I thank my siblings, Kasra (Amirreza) and Ayli, for all the joy and laughter you have

brought into my life. The fun times we’ve shared together have been a source of happiness

and balance during this journey. I am incredibly proud of both of you and have cherished

every moment of our conversations and the opportunity to share my experiences with you.

Your support throughout these years has meant the world to me. Thank you for always

being there.

I would also like to thank my in-law parents, Farid and Emilia, for their support

and kindness over the years. Your constant encouragement and love have been a source

of strength for both Melina and me throughout this journey. Your presence has provided

us with confidence and reassurance during challenging times, and I am deeply grateful for

everything you have done.

To all of you and the others that I met during these years, thank you for supporting

me throughout this journey.

vii


Table of Contents

Dedication ii

Acknowledgements iii

Table of Contents viii

List of Tables xi

List of Figures xiii

List of Abbreviations xx

Chapter 1: Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

I Pruning and Efficient Architecture Search Techniques for
Convolutional Neural Networks 14

Chapter 2: Interpretations Steered Network Pruning via Amortized Inferred Saliency
Maps 15

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5 Chapter Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . 37
Supplementary Materials for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . 38
S2.1 REAL-X Formulation Development for Interpretation of Classifiers . . . . . 38
S2.2 Implementation Details of Our AEM . . . . . . . . . . . . . . . . . . . . . 41
S2.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Chapter 3: Efficient Learning of Kernel Sizes for Convolution Layers of CNNs 61
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

viii


3.5 Chapter Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . 81
Supplementary Materials for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . 82
S3.1 Experimental settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
S3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Chapter 4: Jointly Training and Pruning CNNs via Learnable Agent Guidance and
Alignment 90

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.5 Chapter Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . 109
Supplementary Materials for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . 111
S4.1 Bounding our Agent’s Actions . . . . . . . . . . . . . . . . . . . . . . . . . 111
S4.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

II Pruning Methods for Efficient Inference of Deep Genera-
tive Models 115

Chapter 5: Compressing Image-to-Image Translation GANs Using Local Density
Structures on Their Learned Manifold 116

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.5 Chapter Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . 134
Supplementary Materials for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . 136
S5.1 Details of Our Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
S5.2 Our Pruning Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
S5.3 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

Chapter 6: Mixture of Efficient Diffusion Experts Through Automatic Interval and
Sub-Network Selection 142

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.5 Chapter Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . 164
Supplementary Materials for Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . 166
S6.1 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
S6.2 Details of Our Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
S6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

ix


Chapter 7: Not All Prompts Are Made Equal: Prompt-based Pruning of Text-to-
Image Diffusion Models 180

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
7.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
7.5 Chapter Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . 201
Supplementary Materials for Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . 202
S7.1 Overview of Diffusion Models . . . . . . . . . . . . . . . . . . . . . . . . . 202
S7.2 More Details of APTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
S7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

Chapter 8: Conclusion and Discussion 220
8.1 Broader Context and Future Directions . . . . . . . . . . . . . . . . . . . . 223

Bibliography 225

x


List of Tables

2.1 Comparison of results on CIFAR-10. ∆-Acc represents the performance changes

relative to the baseline, and +/− indicates an increase/decrease, respectively. . . 36
2.2 Comparison results on ImageNet with ResNet-34/50/101 and MobileNet-V2. . . 37

3.1 Results on the MNIST dataset. . . . . . . . . . . . . . . . . . . . . . . . . 73
3.2 Results on the CIFAR10 dataset. . . . . . . . . . . . . . . . . . . . . . . . 73
3.3 Results on the STL-10 dataset. . . . . . . . . . . . . . . . . . . . . . . . . 77
3.4 Results on the ImageNet-32 dataset. . . . . . . . . . . . . . . . . . . . . . 78
3.5 The architecture of our size predictor. . . . . . . . . . . . . . . . . . . . . . 83

4.1 Comparison results on CIFAR-10 for pruning ResNet-56 and MobileNet-V2. 104
4.2 Comparison results on ImageNet for pruning ResNet-18/34 and MobileNet-V2. . 105
4.3 Ablation Results of our method for pruning ResNet-56 on the CIFAR-10

dataset. EE represents the Epoch Embeddings. SR represents the Soft Regularization
in Eq. 4.10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.1 Quantitative comparison of our proposed method with state-of-the-art GAN
compression methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

5.2 Ablation results of our proposed method. . . . . . . . . . . . . . . . . . . . 133
5.3 The architecture of gen∗ (∗ ∈ {G, D}) used in our method. . . . . . . . . . 137
5.4 Hyperparameter settings for training original models. . . . . . . . . . . . . . . 139
5.5 Hyperparameter settings for pruning agents. . . . . . . . . . . . . . . . . . . . 140
5.6 Hyperparameter settings for Fine-tuning. . . . . . . . . . . . . . . . . . . . . 141

6.1 Comparison results of DiffPruning vs. baselines. Throughput values are
calculated using an NVIDIA A100 GPU. †: the values are average of our
two efficient experts. ∗: calculated by sampling from provided checkpoints. ‡:
speed-ups relative to the LDM model. The shadowed values are inaccurate,
and we refer to supplementary S6.3.3 for a detailed discussion. . . . . . . . 161

6.2 Ablation results of our proposed method for pruning the LDM model [1] for
LSUN-Bedroom to 50% MACs budget. . . . . . . . . . . . . . . . . . . . . 164

6.3 The architecture of our Expert Routing Agent. We calculate width architec-
ture vectors v(i) from the outputs o

(i)
k (k ∈ [1, L]). We compute the depth

architecture vector u(i) from o
(i)
L+1. We refer to Sec. S6.2.3.1 for detailed for-

mulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.4 Hyperparameters of fine-tuning our models with elastic dimensions. . . . . 174
6.5 Hyperparameters for the pruning and fine-tuning stages of our method for

different MACs pruning ratios (30%, 50%, and 70%). . . . . . . . . . . . . 177

xi


6.6 Comparison of the number of training iterations for different methods on
LSUN-Bedroom. The “Method’s Iterations” column denotes the number of
all the training iterations that the pruning method performs to obtain its
final efficient model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

6.7 Comparison of the number of training iterations for different methods on
LSUN-Church. The “Method’s Iterations” column denotes the number of all
the training iterations that the pruning method performs to obtain its final
efficient model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

7.1 Results on CC3M and MS-COCO. We report performance metrics using samples

generated at the resolution of 768 then downsampled to 256 [2]. We measure mod-

els’ MACs/Latency with the input resolution of 768 on an A100 GPU. @30/50k

shows fine-tuning iterations after pruning. . . . . . . . . . . . . . . . . . . . 198
7.2 The most frequent words in prompts assigned to each expert of APTP-Base pruned

on CC3M. The resource utilization of each expert is indicated in parentheses. . . 198
7.3 Ablation results of APTP’s components on 30k samples from MS-COCO [3]

validation set. We fine-tune all models for 10k iterations after pruning. . . 200
7.4 The most frequent words in prompts assigned to each expert of APTP-Base

pruned on COCO. The resource utilization of each expert is indicated in
parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

7.5 Quantitative results on CC3M and MS-COCO. We report the performance
metrics using samples generated at the resolution of 768 and downsampled
to 256 [2]. We measure models’ MACs and Latency with the input resolution
of 768 on an A100 GPU. @30/50k shows the model’s fine-tuning iterations
after pruning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

7.6 Prompts for Fig. 7.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

xii


List of Figures

2.1 Input features selected by a) REAL-X [4] and b) our model to explain
decisions of a ResNet-56 classifier for samples from CIFAR-10 [5]. In the
sub-figures from left to right: 1st column shows the original image. Both
models output an array (2nd columns) that each value of it is the parameter
of the predicted Bernoulli distribution over the corresponding mask pixel.
In the 3rd column, we show the masks generated such that a pixel’s value
is one provided that its predicted Bernoulli parameter is bigger than 0.5
and zero otherwise. The 4th columns show the masked inputs. Our model’s
explanations are easier to interpret than the ones by REAL-X that may seem
random for some samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2 Our AEM model. The goal is to train the selector model on the right (U-
Net model in dashed line) to predict interpretations (saliency maps) of the
classifier for each input sample. We train the selector by encouraging it to
follow Eq. 2.2. (Left): We train a predictor model that learns to predict the
classifier’s output distribution given a masked input (RHS of Eq. 2.2). We
do so using inputs masked by random RBF masks as our selector’s masks
have RBF-style. (Sec. 2.3.4) (Right): Given the trained predictor, we train
the selector model using obj. 2.8 that enforces it to follow Eq. 2.2. We use
the classifier’s convolutional backbone as the encoder of the selector and
only train its decoder for computational efficiency. Then, we use the trained
decoder to prune the encoder. (Fig. 2.3) . . . . . . . . . . . . . . . . . . . 28

2.3 Our pruning method. The classifier to be pruned is shown on top. (Conv
layers and FC). The U-Net (Conv layers and the Decoder) is our trained
selector model that can predict RBF parameters of the saliency map of each
input for the classifier. The selector model is trained such that the pretrained
backbone of the classifier is used as its encoder (Conv layers) and kept frozen
during training. (see Fig. 2.2) Thus, we freeze the selector and classifier’s
weights and insert our pruning gates between the selector’s encoder layers for
pruning the classifier. Given a pruning pair (a sample and its RBF saliency
map’s parameters for the original classifier), we train the gate parameters to
prune the classifier such that the pruned model have similar interpretations
(Linterpr) and accuracy (Lclass) to the original classifier while requiring lower
computational resources (LRes). . . . . . . . . . . . . . . . . . . . . . . . . 30

xiii


2.4 (a): Test accuracy of different masks’ parameterization schemes. (RBF
(ours) vs. Independent (REAL-X [4])) (b): Test accuracy w/wo using the
classification loss. All results are for 3 run times with ResNet-56 on CIFAR-
10. Shaded areas represent variance. . . . . . . . . . . . . . . . . . . . . . 33

2.5 (a), (b): The model’s test accuracy and the FLOPs regularization term
when changing γ1, and (c), (d): when varying γ2. All results are run for 3
times with ResNet-56 on CIFAR-10. Shaded areas represent variance. . . . 35

2.6 Our proposed nonlinear function to calculate the center’s coordinates of a
predicted RBF Kernel of the selector for the CIFAR-10 dataset. . . . . . . 48

2.7 Our proposed nonlinear function to calculate the expansion parameter of a
predicted RBF Kernel of the selector for the CIFAR-10 dataset. . . . . . . 49

2.8 Our proposed nonlinear function to calculate the center’s coordinates of a
predicted RBF Kernel of the selector for the ImageNet dataset. . . . . . . 50

2.9 ImageNet Examples. Columns from left to right: input image, distri-
bution over explanatory masks predicted by selector, predicted distribution
shown over input, a sampled mask from the predicted distribution, and input
image masked by the sampled mask. Class of input images from top
to bottom: ‘Gyromitra’, ‘Honeycomb’, ‘Strainer’, ‘English springer’, ‘Indri
brevicaudatus’, ‘Hartebeest’. . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.10 ImageNet Examples. Columns from left to right: input image, distri-
bution over explanatory masks predicted by selector, predicted distribution
shown over input, a sampled mask from the predicted distribution, and input
image masked by the sampled mask. Class of input images from top
to bottom: ‘Australian terrier’, ‘Scoreboard’, ‘Microwave oven’, ‘Barn’,
‘Rosehip’, ‘Samoyed’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

2.11 ImageNet Examples. Columns from left to right: input image, distri-
bution over explanatory masks predicted by selector, predicted distribution
shown over input, a sampled mask from the predicted distribution, and input
image masked by the sampled mask. Class of input images from top
to bottom: ‘Miniskirt’, ‘Soccer ball’, ‘Jeep’, ‘Albatross’, ‘Tench’, ‘China
cabinet’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

2.12 ImageNet Examples. Columns from left to right: input image, distri-
bution over explanatory masks predicted by selector, predicted distribution
shown over input, a sampled mask from the predicted distribution, and input
image masked by the sampled mask. Class of input images from top to
bottom: ‘Kimono’, ‘Whippet’, ‘Poncho’, ‘Drilling Platform’, ‘Steel Drum’,
‘Black Grouse’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

2.13 ImageNet Examples. Columns from left to right: input image, distri-
bution over explanatory masks predicted by selector, predicted distribution
shown over input, a sampled mask from the predicted distribution, and input
image masked by the sampled mask. Class of input images from top to
bottom: ‘Binoculars’, ‘Horned viper’, ‘Native bear’, ‘Hedgehog’, ‘Japanese
spaniel’, ‘Reel’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

xiv


2.14 CIFAR-10 Examples. Class of input images from top to bottom:
‘Ship’, ‘Truck’, ‘Automobile’, ‘Frog’, ‘Horse’, ‘Bird’, ‘Airplane’, ‘Bird’, ‘Dog’
‘Cat’, ‘Cat’, ‘Cat’, ‘Bird’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

2.15 CIFAR-10 Examples. Class of input images from top to bottom:
‘Bird’, ‘Ship’, ‘Frog’, ‘Cat’, ‘Dog’, ‘Frog’, ‘Airplane’, ‘Automobile’, ‘Horse’
‘Bird’, ‘Bird’, ‘Deer’, ‘Horse’. . . . . . . . . . . . . . . . . . . . . . . . . . . 58

2.16 CIFAR-10 Examples. Class of input images from top to bottom:
‘Frog’, ‘Ship’, ‘Ship’, ‘Cat’, ‘Airplane’, ‘Ship’, ‘Horse’, ‘Horse’, ‘Truck’, ‘Dog’,
‘Automobile’, ‘Frog’, ‘Deer’. . . . . . . . . . . . . . . . . . . . . . . . . . . 59

2.17 CIFAR-10 Examples. Class of input images from top to bottom:
‘Dog’, ‘Airplane’, ‘Horse’, ‘Automobile’, ‘Horse’, ‘Ship’, ‘Ship’, ‘Automobile’,
‘Cat’, ‘Airplane’, ‘Ship’, ‘Airplane’, ‘Dog’. . . . . . . . . . . . . . . . . . . . 60

3.1 Overview of our method. Our size predictor model learns to predict the
kernel sizes for the classifier. It predicts soft kernel sizes v that are rounded
to integer values. Then, our adaptive weight predictor model predicts opti-
mal kernel weights ŵ given the predicted sizes. We modulate the predicted
weights using masks ml parameterized by the soft sizes v to make the result-
ing weights w differentiable w.r.t the size predictor’s weights. Finally, the
weights w are used as the kernel weights of the classifier, and the training
is guided by the classification objective (Lclass) and the parameters budget
loss (Lparam). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.2 Learned sizes (top) and weights (bottom) for the kernels of our EffConv-20
model with 0.66M parameters on CIFAR-10. . . . . . . . . . . . . . . . . . 76

3.3 Results of our ablation studies. . . . . . . . . . . . . . . . . . . . . . . . . 79
3.4 MNIST, EffConv-20, 0.66M. . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.5 CIFAR-10, EffConv-26, 0.66M. . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.6 CIFAR-10, EffConv-32, 0.66M. . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.7 CIFAR-10, EffConv-38, 0.66M. . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.8 CIFAR-10, EffConv-44, 0.66M. . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.9 CIFAR-10, EffConv-50, 0.66M. . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.10 CIFAR-10, EffConv-56, 0.66M. . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.11 CIFAR-10, Wide-EffConv-28-1, 0.66M. . . . . . . . . . . . . . . . . . . . . 87
3.12 STL-10, EffConv-20, 0.66M. . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.13 STL-10, EffConv-20, 0.71M. . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.14 STL-10, EffConv-20, 0.78M. . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.15 ImageNet-32, EffConv-20, 0.50M. . . . . . . . . . . . . . . . . . . . . . . . 88
3.16 CIFAR-10, EffConv-20, 0.3M. . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.17 CIFAR-10, EffConv-20, 0.4M. . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.18 CIFAR-10, EffConv-20, 0.5M. . . . . . . . . . . . . . . . . . . . . . . . . . 89

xv


4.1 Overview of our method. We jointly train and prune a CNN model
using an RL agent by iteratively training the agent’s policy and model’s
weights. In each iteration, we train the model’s weights for one epoch and
perform several episodic observations of the agent. Left: Each action of our
agent prunes one layer of the model, and the procedure of pruning the l-th
layer is shown. The agent’s actions on the previous layers and the remain-
ing layers’ FLOPs determine its state, and we take the resulting model’s
accuracy as its reward (Sec. 4.3.2). As the model’s weights change between
iterations, the reward function also changes accordingly. Thus, we map each
epoch to an embedding and employ a recurrent model to provide a state of
the environment z to the agent. (Sec. 4.3.2.1) Right: Given a sub-network
selected by the agent, we train the model’s weights while softly regularizing
them to align with the selected structure (Sec. 4.3.2.3). . . . . . . . . . . . 94

4.2 Results of ablation experiments on CIFAR-10. (a-c) Best reward of our agent

when using a different number of episodes per epoch for three pruning rates when

pruning ResNet-56. (d-f) Best reward with/without using our mechanism to pro-

vide representations of the environment to our agent during training for three

pruning rates for ResNet-56. (g-i) Same results of (d-f) for MobileNet-V2. . . . 107

5.1 Our GAN pruning method. We encourage the pruned generator to preserve the

density structure of the original model over its learned manifold during prun-

ing. To do so, we partition the manifold into local neighborhoods around the

samples generated by the original generator (Fig. 5.2) and represent each local

neighborhood with a ‘Center’ sample (shown with a red frame) and its neighbors

(blue frames). We use these samples as ‘real’ samples and the one generated by

the pruned generator as a ‘fake’ one in our adversarial pruning objective. We

implement our adversarial game with two pruning agents, genG and genD, that

collaboratively learn to prune the original pretrained G and D. genG (genD) takes

the architecture embedding of their colleague genD (genG) when determining the

architecture of G (D). By doing so, genG and genD can maintain the balance

between the capacity of G and D during pruning and make the process stable.

(Fig. 5.4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.2 Our method to find local neighborhoods on the learned manifold of the original

generator. (Top Left): First, we obtain the original model’s predictions in the

target domain for training samples in the source domain. (Right and Down): We

call the sample that we want to find its local neighborhood on the manifold ‘Cen-

ter’ sample (shown with Red solid frame). We pass the predicted samples in the

previous step to a pretrained self-supervised encoder [6] that is fine-tuned on the

target images in the training dataset. Then, we take the samples whose represen-

tations have the highest cosine similarity with the representation of the ‘Center’

sample as its approximate neighbors on the manifold. Neighbor samples and the

approximate neighborhood on the manifold are shown with blue crosses and a

dashed line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

xvi


5.3 Qualitative results for 1) Pix2Pix: Cityscapes (top left), Edges2Shoes (bottom

left), and 2) CycleGAN: Horse2Zebra (top right) and Summer2Winter (bottom

right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.4 Different losses given different λ1 during the pruning process. (a)-(c) Loss val-

ues for CycleGAN on Horse2Zebra dataset. (d)-(f) Loss values for Pix2Pix on

Cityscapes dataset. We normalize R to the range [0, 1] for better visualization. . 131
5.5 Visualization of approximate neighborhoods on the learned manifold of our pruned

model vs. the original model. . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.1 Overview of DiffPruning. We prune a pre-trained LDM model [1] (top)
into a mixture of efficient experts (bottom). Each expert handles an interval,
which allows their architectures to be separately specialized by removing
layers or channels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

6.2 Our Pruning Scheme: We train our Expert Routing Agent (ERA) to
prune the experts into a mixture of efficient experts (Sec. S6.2.3). The
ERA predicts the architecture vectors (v, u) to prune experts’ width and
depth. Then, we calculate the denoising objectives of selected sub-networks
of experts, LDDPM,Ii

, as well as our Resource regularization term, R, that
encourages the ERA to provide a mixture of efficient experts with a desired
compute budget (MACs). We train ERA’s parameters to minimize the objec-
tive functions. Thus, it learns to automatically allocate the compute budget
(MACs) between experts in an end-to-end manner. . . . . . . . . . . . . . 144

6.3 Our Interval Selection Scheme: We calculate gradients of denoising
timesteps’ objectives w.r.t the pre-trained LDM’s parameters and take the
cosine similarity value of two timesteps’ gradients as their alignment score. The
dashed lines show our selected cluster intervals for the experts. One can ob-
serve the optimal cluster assignments are different for distinct datasets, and
employing a deterministic clustering strategy [7] like uniform clustering [8]
for all datasets is sub-optimal. . . . . . . . . . . . . . . . . . . . . . . . . . 149

6.4 U-Net architecture of the LDM [1]. We randomly drop/preserve each colored
layer in our elastic depth fine-tuning. . . . . . . . . . . . . . . . . . . . . . 151

6.5 Samples from the LDM [1] model and our pruned mixture of experts for
different MACs budgets. The green numbers show the relative sampling
throughput speed-up of our pruned models compared to LDM on an NVIDIA
A100 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

6.6 Comparison Results of our method vs. baselines, SP [9], OMS-DPM [10],
DDPM [11], and LDM [1]. First Row: FID vs. MACs curves. Second
Row: FID vs. Throughput curves. We calculate the Throughput values
with an NVIDIA A100 GPU. Higher Throughput and Lower FID and MACs
indicate a better performance. . . . . . . . . . . . . . . . . . . . . . . . . . 166

6.7 Weighted average J (t1) (Eq. 6.16) of the mean of alignment scores in two
clusters for the LDM trained on FFHQ. . . . . . . . . . . . . . . . . . . . . 169

6.8 Weighted average J (t1) (Eq. 6.16) of the mean of alignment scores in two
clusters for the LDM trained on ImageNet. . . . . . . . . . . . . . . . . . . 169

xvii


6.9 Weighted average J (t1) (Eq. 6.16) of the mean of alignment scores in two
clusters for the LDM trained on LSUN-Beds. . . . . . . . . . . . . . . . . . 170

6.10 Weighted average J (t1) (Eq. 6.16) of the mean of alignment scores in two
clusters for the LDM trained on LSUN-Church. . . . . . . . . . . . . . . . 170

6.11 Illustration of our Elastic Width training. We sort the convolution chan-
nels (attention heads) based on their importance (L1 norm) before start-
ing elastic width training. We drop a random ratio of the least important
channels (heads) for convolution layers (attention layers) for each batch of
training. The values o1:4 represent different possible dropping ratios for a
convolution layer with 4 channels. . . . . . . . . . . . . . . . . . . . . . . . 171

6.12 U-Net architecture of the LDM [1]. . . . . . . . . . . . . . . . . . . . . . . 173

7.1 Overview: We prune a text-to-image diffusion model like Stable Diffusion (left)

into a mixture of efficient experts (right) in a prompt-based manner. Our prompt

router routes distinct types of prompts to different experts, allowing experts’ ar-

chitectures to be separately specialized by removing layers or channels. . . . . . 181
7.2 Our pruning scheme. We train our prompt router and the set of architecture

codes to prune a text-to-image diffusion model into a mixture of experts. The

prompt router consists of three modules. We use a Sentence Transformer [12] as

our prompt encoder to encode the input prompt into a representation z. Then,

the architecture predictor transforms z into the architecture embedding e that

has the same dimensionality as architecture codes. Finally, the router routes the

embedding e into an architecture code a(i). We use optimal transport to evenly

assign the prompts in a training batch to the architecture codes. The architecture

code a(i) = (u(i), v(i)) determines pruning the model’s width and depth. We train

the prompt router’s parameters and architecture codes in an end-to-end manner

using the denoising objective of the pruned model LDDPM, distillation loss between

the pruned and original models Ldistill, average resource usage for the samples in

the batch R, and contrastive objective Lcont, encouraging embeddings e preserving

semantic similarity of the representations z. . . . . . . . . . . . . . . . . . . 187
7.3 Samples of the APTP-Base experts after pruning the Stable Diffusion V2.1

using CC3M [13] and COCO [3] as the target datasets. Expert IDs are shown
on the top right of images. (See Table 7.6 for prompts) . . . . . . . . . . . 196

7.4 Comparison of samples generated by low and high budget experts of APTP-Base

vs. SD V2.1 on CC3M and MS-COCO validation sets. . . . . . . . . . . . . . . 199
7.5 Ablation Results for the number of experts of APTP on MS-COCO. . . . . . . 200
7.6 Resource and Contrastive loss observed when applying APTP-Base with a

MAC budget of 0.77 to prune Stable Diffusion 2.1 using the COCO dataset.
The comparison is made between two settings: with and without optimal
transport. APTP both adheres to the target MAC budget and finds archi-
tecture vectors that maintain the similarity between the prompts. . . . . . 210

xviii


7.7 Comparison of sample assignments in a batch to experts with and without
optimal transport. The incorporation of optimal transport results in a more
diverse assignment pattern. In the figure, each square represents a prompt
within the batch, and the color signifies the budget level of the expert as-
signed to the prompt. Higher-resource experts are indicated by darker blue. 211

7.8 Distribution of CC3M Samples Mapped to Each Expert of APTP-Base, In-
cluding Resource Utilization Ratios . . . . . . . . . . . . . . . . . . . . . . 212

7.9 The block-level retained MAC ratio of the UNet architecture of all experts of
APTP-Base applied to Stable Diffusion 2.1 with CC3M as the target dataset.215

7.10 The block-level retained MAC ratio of the UNet architecture of all experts
of APTP-Base applied to Stable Diffusion 2.1 with COCO as the target
dataset. The groups of ResBlocks and the heads of Attention Blocks are
pruned based on the outputs of the architecture predictor. The intensity of
the color of each block represents the resource utilization of it. The number
in each block indicates the precise ratio of retained MACs of the block.
Conv in, Conv out, and skip connections between corresponding down and
up blocks are omitted for brevity. . . . . . . . . . . . . . . . . . . . . . . . 216

7.11 Samples of the APTP-Base experts after pruning the Stable Diffusion V2.1
using CC3M [13] as the target dataset. Each row corresponds to a unique
expert. Please refer to Table 7.2 for the groups of prompts assigned to each
expert. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

7.12 Samples of the APTP-Base experts after pruning the Stable Diffusion V2.1
using MS-COCO [3] as the target dataset. Each row corresponds to a unique
expert. Please refer to Table 7.4 for the groups of prompts assigned to each
expert. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

xix


List of Abbreviations

ADAM Adaptive Moment Estimation
AEM Amortized Explanation Models
AIoT Artificial Internet of Things
APTP Adaptive Prompt-Tailored Pruning
CLIP Contrastive Language-Image Pre-training
CMMD CLIP Maximum Mean Discrepancy
CNN Convolutional Neural Network
DDIM Denoising Diffusion Implicit Models
DDPM Denoising Diffusion Probabilistic Models
DNN Deep Neural Network
DPM Diffusion Probabilistic Model
EIE Efficient Inference Engine
ERA Expert Routing Agent
FID Fréchet inception distance
FLOPs Floating-point Operations
GAN Generative Adversarial Network
GPU Graphics Processing Unit
GRU Gated Recurrent Unit
GPT Generative Pre-trained Transformer
I2IGAN Image-to-Image Translation Generative Adversarial Network
ISP Interpretations Steered Pruning
IWFS Instance-Wise Feature Selection
KD Knowledge Distillation
KDE Kernel Density Estimation
KL Kullback-Leibler Divergence
LDM Latent Diffusion Model
LHS Left Hand Side
LIME Local Interpretable Model-agnostic Explanations
LLM Large Language Model
LLaVA Large Language and Vision Assistant
MACs Multiply Accumulate Operations
MGGC Manifold Guided GAN Compression
MLP Multi-layer Perceptron
MoE Mixture of Experts
NAS Neural Architecture Search
RBF Radial Basis Function

xx


ResNet Residual Network
RHS Right Hand Side
RL Reinforcement Learning
SAC Soft Actor-Critic
SD Stable Diffusion
SGD Stochastic Gradient Descent
SHAP Shapley Additive explanations
STE Straight-Through Estimator
T2I Text-to-Image
TPU Tensor Processing Unit
VAE Variational Autoencoder

xxi


Chapter 1: Introduction

1.1 Motivation

Deep learning methods have achieved unprecedented capabilities for visual recogni-

tion and generative modeling tasks in computer vision in the past decade. They have

significantly outperformed hand-crafted baselines on traditional image classification (as-

signing a label to an input image) and object detection (localizing as well as classifying

objects within an image) benchmarks. Moreover, deep vision language models like GPT-

4 [14], Gemini [15], and LLaVA [16] have made significant strides, enabling them to provide

fine-grained text descriptions that not only identify objects in their input images, but also

explain their relationships and attributes. In addition, modern deep generative models

like DALL-E [17], Stable Diffusion [1], Imagen [18], and Adobe Firefly [19] can generate

high-quality, realistic, and detailed images given input text prompts. They have also shown

an impressive performance for prompt-based image editing. Thus, deploying deep learn-

ing models in various real-world applications like disease diagnosis, autonomous driving,

robotics, and content creation is of a great interest. The key ingredient of their success is

that they can learn to extract or generate useful patterns for downstream tasks from vast

amounts of data.

Empirical trends in the literature indicate that the performance of deep models im-

1


proves when they benefit from 1) an increased training sample size, 2) higher architectural

capacity (also known as model size), and 3) longer training schedules using modern hard-

wares. These factors have driven the development of large-scale models with billions of

parameters, feuling a race among private companies to push the peformance boundaries

of deep models by increasing the model size. For instance, the vision language LLaVA

V1.5 [20] and QWEN2-VL [21] models are among the top performing models in multimodal

vision and language modeling tasks like object localization and visual question answering.

LLaVA V1.5 has two variants with 7 and 13 Billion parameters, and QWEN2-VL has three

variants of 2, 7, and 72 Billion parameters. Similarly, the Stable Diffusion (SD) models have

shown improved performance in image generation and editing tasks from SD-V1 with about

980M to SD-XL [22] and SD-V3 [23] having about 3.5B and 8B parameters, respectively.

However, the increase in model size leads to a natural trade-off between model per-

formance vs. computational complexity and memory footprint, thereby making the de-

ployment of these models challenging or even infeasible in various real-world scenarios. On

the one hand, organizations and private companies need to invest in expensive GPU or

TPU clusters or spend excessive budgets to rent them from cloud providers to serve their

models for their customers. On the other hand, directly deploying large-scale models for

edge applications like smartphones, robotics, self-driving cars, and the Artificial Internet

of Things (AIoT) exhausts their limited memory, computational resources, and battery.

Further, these applications demand real-time inference and low-latency responses, which

are infeasible to achieve if one runs large-scale models with the limited computational

resources available in these applications. Therefore, compressing and reducing the com-

putational burden of deep models while maintaining their performance is crucial before

2


deploying them in practice.

Model pruning is an efficient and effective technique for compressing trained over-

parameterized deep models to improve their inference efficiency. In fact, it has been shown

[24, 25] that over-parameterization is beneficial for the optimization and generalization of

deep models. Further, one can prune these models to a simple one without significantly

affecting their performance, but directly training the pruned model from scratch typically

results in worse performance due to the optimization difficulties [25]. Model pruning can

be fine-grained, called weight pruning, in which individual weights of a model are removed.

It can achieve high compression rates, significantly reducing required storage memory, but

weight pruning usually cannot provide inference speed up in practice. The reason is that

GPUs and TPUs cannot effectively utilize irregular sparsity patterns, and one needs to use

inference engines like EIE [26] to exploit the sparsity to accelerate the inference. Differently,

structural pruning removes channels or depth layers of the model, thereby both reducing the

model size and accelerating its inference on modern hardware like GPUs without requiring

post-processing or special inference libraries. Thus, structural pruning is more practical for

real-world applications and has been widely used in different domains.

In this dissertation, we develop structural pruning and architecture search

techniques to reduce the memory footprint and improve the inference efficiency

of deep visual recognition and generative models, tailored to their unique char-

acteristics and requirements. Therefore, we focus on the following two main research

directions that are:

I. Inference Efficiency of Visual Recognition Models, where we develop struc-

3


tural pruning and efficient architecture search methods for Convolutional Neural Net-

works (CNN) classifiers to achieve compact models given different constraints on the

model’s computational requirements like the model’s number of Multiply-Accumulate

operations (MACs) and parameter count. We mainly contribute to three main aspects

of this research direction:

First, we approach the pruning problem from a novel perspective and aim to answer

whether one can use interpretations of a CNN classifier’s decisions to prune it. This is

in contrast with the prominent techniques that either focus on the model’s outputs or

weights to prune it. We discuss in Chapter 2 that existing interpretation techniques

are either shown to be indepedent of the model’s decisions or are computationally

expensive for pruning. Thus, we develop an amortized explanation model tailored for

CNN classifiers and employ it in our framework to guide the pruning process.

Second, we introduce an efficient architecture search method to find kernel sizes for

CNNs (Chapter 3). Although kernel sizes are crucial design choices for CNNs’ perfor-

mance and efficiency, existing architectures usually contain convolution layers with

fixed kernel sizes stacked on top of each other. This design choice is suboptimal since

it does not consider the target task. We propose a differentiable architecture search

method that determines kernel sizes given a training dataset and parameter bud-

get, securing up to 60× speed up compared to the baseline methods while achieving

superior final performance.

Third, we develop a method that reduces the complexity of the pruning process

for CNNs. Typically, structural pruning methods perform a three-step process of

4


pretraining the model, pruning it, and then fine-tuning the pruned model, and each

step has its own design choices and hyperparameters. We propose a method that

accomplishes the first two steps at the same time using a reinforcement learning agent

that learns to determine the optimal structure of the model during the pretraining

phase (Chapter 4). By doing so, we improve the efficiency of the pruning process.

II. Inference Acceleration for Deep Generative Models: Due to their fundumental

differences, existing pruning methods for discriminative models are not directly ap-

plicable to generative models, and heuristically stacking them for pruning generative

models usually leads to unsatisfactory performance. Therefore, We design pruning

techniques for conditional Generative Adversarial Networks (GANs) and modern dif-

fusion models while taking their specific characteristics into account.

First, we address pruning a conditional GAN model. In contrast with previous works

that mainly apply distillation, we focus on the learned density structure of a pre-

trained GAN model as a generative model. Specifically, we propose a structural

pruning method that encourages the pruned model to preserve local density struc-

tures of the original model on neighborhoods of its learned manifold, resembling the

kernel density estimation method. Further, we design a collaborative pruning scheme

in which two agents prune both the generator and discriminator by exchanging feed-

back. Thus, our method can properly maintain the balance between capacities of two

models during pruning and alleviate mode collapse during the pruning process, which

is a common challenge with baselines.

Second, we leverage the gradual denoising process of modern diffusion models to

5


prune them into a mixture of efficient experts, each one handling a separate part of

the denoising path of the model’s sampling process. We propose a dataset-specific

approach to cluster the denoising timesteps into intervals using their alignment scores

and assign a separate expert to each interval. We introduce a framework in which

we prune the experts for the intervals simultaneously, thereby allocating compute

resources between them automatically.

Finally, we design a pruning approach tailored for Text-to-image (T2I) diffusion mod-

els. Precisely, our method prunes a pretrained T2I diffusion model into a set of effi-

cient experts such that each expert is a specialized model for the prompts routed to

it. Our method is the first one that enables using different amounts of compute re-

sources for various prompt types. We do so using a prompt router model that routes

input prompts to a set of architecture codes that determine the sub-network of the

model to be used. We design a framework in which we train both the prompt router

and the architecture codes in an end-to-end manner.

1.2 Dissertation Outline

The structure of this dissertation follows the organization of the research topics dis-

cussed in the Sec. 1.1. Accordingly, we divide the dissertation into two parts, addressing

inference efficiency of visual recognition models in Part I (Chapter 2, Chapter 3, and Chap-

ter 4) and deep generative models in Part II (Chapter 5, Chapter 6, and Chapter 7).

In general, pruning and architecture search for neural networks can be formulated as

a combinatorial “selection” problem in which one should determine whether to preserve or

6


remove each structural component of the model. In the case of deep over-parameterized

models, the search space of the model configurations is discrete, complex, and exponen-

tially large, and the design choices are highly non-trivial. Further, the evaluation of the

model configurations is extremely costly as each evaluation requires training the model from

scratch. In this dissertation, we design algorithms to tackle the model pruning and archi-

tecture search problems efficiently. The general framework of our ideas is that we convert

the discrete optimization problem of pruning and architecture search as a continuous one,

and by doing so, we leverage gradient-based optimization techniques to efficiently search

for compact, high-performing architectures. In more details, we implement our “selection”

scheme using a neural network that can be trained end-to-end by employing differentiable

selection gates and propagating gradients using the straight-through estimator. We briefly

describe our design choices to adapt our framework for different discriminative and gener-

ative models in the following.

In the part I, we address inference efficiency of CNNs. CNNs have consistently shown

state-of-the-art performance on various computer vision tasks, surpassing Transformers

[27, 28] and other counterparts [29, 30]. Therefore, optimizing CNNs’ architectures for

inference efficiency is practically crucial. In Chapter 2, Chapter 3, and Chapter 4, we

develop techniques for pruning and designing CNN architectures to make them efficient for

inference.

In Chapter 2, we propose a pruning method that leverages interpretations of a CNN’s

predictions to guide its pruning process. Existing channel pruning algorithms approach the

pruning problem from various perspectives and use different metrics to guide the pruning

process. However, these metrics mainly focus on the model’s ‘outputs’ or ‘weights’ and ne-

7


glect its ‘interpretations’ information. To fill in this gap, we propose to address the channel

pruning problem from a novel perspective by leveraging the interpretations of a model to

steer the pruning process, thereby utilizing information from both inputs and outputs of

the model. However, existing interpretation methods cannot get deployed to achieve our

goal as either they are inefficient for pruning or may predict non-coherent explanations.

We tackle this challenge by introducing a selector model that predicts real-time smooth

saliency masks for pruned models. We parameterize the distribution of explanatory masks

by Radial Basis Function (RBF)-like functions to incorporate geometric prior of natural

images in our selector model’s inductive bias. Thus, we can obtain compact representations

of explanations to reduce the computational costs of our pruning method. We leverage our

selector model to steer the network pruning by maximizing the similarity of explanatory

representations for the pruned and original models.

Chapter 3 presents an efficient kernel size learning method for CNNs. Determining

kernel sizes of a CNN model is a crucial and non-trivial design choice and significantly

impacts its performance and efficiency. The majority of existing kernel size design methods

rely on complex heuristic tricks or leverage neural architecture search that requires extreme

computational resources. Thus, learning kernel sizes, using methods such as modeling ker-

nels as a combination of basis functions, jointly with the model weights has been proposed

as a workaround. However, previous methods cannot achieve satisfactory results or are

inefficient for high-resolution and large-scale datasets. To fill this gap, we design an effi-

cient kernel size learning method in which a size predictor model learns to predict optimal

kernel sizes for a classifier given a desired number of parameters. It does so in collaboration

with a kernel predictor model that predicts the weights of the kernels - given kernel sizes

8


predicted by the size predictor - to minimize the training objective, and both models are

trained end-to-end. Our method only needs a small fraction of the training epochs of the

original CNN to train these two models and find proper kernel sizes for it. Thus, it offers

an efficient and effective solution for the kernel size learning problem.

In Chapter 4, we introduce a method to reduce the complexity of the model prun-

ing process. The majority of structural pruning ideas require a pretrained model before

pruning, which is costly to secure. We propose a novel structural pruning approach to

jointly learn the weights and structurally prune architectures of CNN models. The core

element of our method is a Reinforcement Learning (RL) agent whose actions determine

the pruning ratios of the CNN model’s layers, and the resulting model’s accuracy serves as

its reward. We conduct the joint training and pruning by iteratively training the model’s

weights and the agent’s policy, and we regularize the model’s weights to align with the

selected structure by the agent. The evolving model’s weights result in a dynamic reward

function for the agent, which prevents using prominent episodic RL methods with station-

ary environment assumption for our purpose. We address this challenge by designing a

mechanism to model the complex changing dynamics of the reward function and provide

a representation of it to the RL agent. To do so, we take a learnable embedding for each

training epoch and employ a recurrent model to calculate a representation of the changing

environment. We train the recurrent model and embeddings using a decoder model to

reconstruct observed rewards. Such a design empowers our agent to effectively leverage

episodic observations along with the environment representations to learn a proper policy

to determine performant sub-networks of the CNN model.

In the part II, we propose model pruning techniques to improve the inference efficiency

9


of deep generative models. First, we introduce a method to prune conditional GANs in

Chapter 5. Although diffusion models have achieved state-of-the-art performance on various

generative modeling tasks, they are still computationally expensive and slow to sample

from, requiring tens to hundreds of forward passes to generate a sample. In contrast, GANs

can generate samples in a single forward pass, making them more practical for real-time

applications. Therefore, pruning GANs can prepare them to be deployed in low-latency

applications. Then, we address the inference efficiency of diffusion models in Chapter 6 and

Chapter 7. We propose a pruning approach for diffusion models by leveraging their gradual

sampling process in Chapter 6. Finally, we develop a prompt-based pruning framework for

text-to-image diffusion models in Chapter 7.

We present our method for pruning conditional GANs in Chapter 5. GANs have

shown remarkable success in modeling complex data distributions for image-to-image trans-

lation. Still, their high computational demands prohibit their deployment in practical sce-

narios like edge devices. Existing GAN compression methods mainly rely on knowledge

distillation or convolutional classifiers’ pruning techniques. Thus, they neglect the critical

characteristic of GANs: their local density structure over their learned manifold. Accord-

ingly, we approach GAN compression from a new perspective by explicitly encouraging

the pruned model to preserve the density structure of the original parameter-heavy model

on its learned manifold. We facilitate this objective for the pruned model by partitioning

the learned manifold of the original generator into local neighborhoods around its gener-

ated samples. Then, we propose a pruning objective to regularize the pruned model to

preserve the local density structure over each neighborhood, resembling the kernel density

estimation method. Also, we develop a collaborative pruning scheme in which the discrim-

10


inator and generator are pruned by two pruning agents. We design the agents to capture

interactions between the generator and discriminator by exchanging their peer’s feedback

when determining their corresponding models’ architectures. Thanks to such a design, our

pruning method can efficiently find performant sub-networks and can maintain the balance

between the generator and discriminator more effectively compared to baselines during

pruning, thereby showing more stable pruning dynamics.

In Chapter 6, we propose a pruning method to reduce the sampling cost of diffusion

models. Diffusion models have shown better mode coverage and superior image generation

quality compared to GANs. Yet, their sampling process requires numerous denoising steps,

making it slow and computationally intensive. We propose to reduce the sampling cost by

pruning a pretrained diffusion model into a mixture of efficient experts. First, we study

the similarities between pairs of denoising timesteps, observing a natural clustering, even

across different datasets. This suggests that rather than having a single model for all time

steps, separate models can serve as “experts” for their respective time intervals. As such,

we separately fine-tune the pretrained model on each interval, with elastic dimensions in

depth and width, to obtain experts specialized in their corresponding denoising interval.

To optimize the resource usage between experts, we introduce our Expert Routing Agent,

which learns to select a set of proper network configurations. By doing so, our method

can allocate the computing budget between the experts in an end-to-end manner without

requiring manual heuristics. Finally, with a selected configuration, we fine-tune our pruned

experts to obtain our mixture of efficient experts.

We present our prompt-based pruning method for text-to-image diffusion models in

Chapter 7. Text-to-image (T2I) diffusion models have demonstrated impressive image

11


generation capabilities, synthesizing novel images given an input text prompt. Still, their

computational intensity prohibits resource-constrained organizations from deploying T2I

models after fine-tuning them on their internal target data. While pruning techniques

offer a potential solution to reduce the computational burden of T2I models, static pruning

methods use the same pruned model for all input prompts, overlooking the varying capacity

requirements of different prompts. Dynamic pruning addresses this issue by utilizing a

separate sub-network for each prompt, but it prevents batch parallelism on GPUs. To

overcome these limitations, we introduce Adaptive Prompt-Tailored Pruning (APTP), a

novel prompt-based pruning method designed for T2I diffusion models. Central to our

approach is a prompt router model, which learns to determine the required capacity for

an input text prompt and routes it to an architecture code, given a total desired compute

budget for prompts. Each architecture code represents a specialized model tailored to the

prompts assigned to it, and the number of codes is a hyperparameter. We train the prompt

router and architecture codes using contrastive learning, ensuring that similar prompts

are mapped to nearby codes. Further, we employ optimal transport to prevent the codes

from collapsing into a single one. We demonstrate APTP’s effectiveness by pruning Stable

Diffusion (SD) V2.1 using CC3M and COCO as target datasets. APTP outperforms the

single-model pruning baselines in terms of FID, CLIP, and CMMD scores. Our analysis

of the clusters learned by APTP reveals they are semantically meaningful. We also show

that APTP can automatically discover previously empirically found challenging prompts

for SD, e.g., prompts for generating text images, assigning them to higher capacity codes.

Finally, we conclude the dissertation in Chapter 8 by summarizing our contributions,

discussing the future directions, and putting our work in the context of the broader research

12


landscape.

13


Part I

Pruning and Efficient Architecture Search Techniques for Convolutional Neural Networks

14


Chapter 2: Interpretations Steered Network Pruning via Amortized In-

ferred Saliency Maps

2.1 Introduction

Convolutional Neural Networks (CNNs) have been continuously achieving state-of-

the-art results on various computer vision tasks [31, 32, 33, 34, 35, 36, 37], but the required

resources of popular deep models [38, 39, 40] are also exploding. Their substantial com-

putational and storage costs prohibit deploying these models in edge and mobile devices,

making the CNN compression problem a crucial task. Many ideas have attempted to ad-

dress this problem to reduce models’ sizes while maintaining their prediction performance.

These ideas can usually be classified into one of the model compression methods categories:

weight pruning [41], weight quantization [42, 43], structural pruning [44], knowledge distil-

lation [45], neural architecture search [46], etc.

We focus on pruning channels of CNNs (structural pruning) since it can effectively

and practically reduce the computational costs of a deep model without any post-processing

steps or specifically designed hardware. Although existing channel pruning methods have

achieved excellent results, they do not consider the model’s interpretations during the

pruning process. They tackle the pruning problem from various perspectives such as rein-

15


forcement learning [46], greedy search [47], and evolutionary algorithms [48]. In addition,

they have utilized a wide range of metrics like channels’ norm [44], loss [49], and accu-

racy [50] as guidance to prune the model. Thus, they emphasize the model’s outputs or

weights but ignore its valuable interpretations’ information.

We aim to approach the structural model pruning problem from a novel perspective

by exploiting the model’s interpretations (a subset of input features called saliency maps)

to steer the pruning. Our intuition is that the saliency maps of the pruned model should be

similar to the ones for the original model. However, the existing interpretation methods are

either inefficient or unreliable for pruning. Firstly, locally linear models (e.g., LIME [51]

and SHAP [52]) fit a separate linear model to explain the behavior of a nonlinear classifier

in the vicinity of each data point. However, they need to fit a new model in each itera-

tion of pruning that the classifier’s architecture changes, which makes them inefficient for

pruning. Secondly, previous works [53, 54] empirically observed that a feature importance

assignment of Gradient-based methods (e.g., Grad-CAM [55] and DeepLIFT [56]) might

not be more meaningful than random. Moreover, Srinivas and Fleuret [57] theoretically

showed that the input gradients used by these methods might seem explanatory as they

are related to an implicit generative model hidden in the classifiers [58], not their discrim-

inative function. Thus, their usage for interpreting classifiers should be avoided. Finally,

perturbation-based methods [59, 60] need multiple forward passes and rely on perturbed

samples that are out-of-distribution for the trained model [53] to obtain its explanations.

Hence, they are neither efficient nor reliable for pruning. Different from the mentioned

methods, Amortized Explanation Models (AEMs) [4, 61, 62] provide a theoretical

framework to obtain a model’s interpretations. They train a fast saliency prediction model

16


that can be applied in real-time systems as it can provide saliency maps with a single for-

ward pass, making them suitable for pruning. We refer to section 2.2 for more discussion

on interpretation methods.

In this chapter, at first, we provide a new AEM method that overcomes the disadvan-

tages of previous AEM models, and then, we employ it to prune convolutional classifiers.

Previous AEMs [4, 61, 62] cannot be applied to guide pruning due to several key draw-

backs. REAL-X [4] proved that L2X [61] and INVASE [62] could suffer from degenerate

cases where the saliency map selector predicts meaningless explanations. Although REAL-

X overcomes this problem, it generates masks independently for each input feature (pixel).

Thus, it neglects the geometric prior [63] in natural images that adjacent features (pix-

els) often correlate to each other. We empirically show in Section 2.3.3 and Fig. 2.4 that

the saliency maps predicted by REAL-X may lack visual interpretability. In addition, the

provided explanations have the same size as the input image, which also adds non-trivial

computational costs when used for pruning. We propose a novel AEM model to tackle

these problems. In contrast with REAL-X, which assumes features’ independence, we em-

ploy a proper geometric prior in our model. We use a Radial Basis Function (RBF)-like

function to parameterize saliency masks’ distribution. By doing so, the mask generation

is no longer independent for each pixel in our framework. Moreover, it enables us to infer

explanations for each image with only three parameters (center coordinates and kernel ex-

pansion), saving lots of computations. We utilize such compact saliency representations to

steer network pruning by reconstruction in real-time. We also find that merging guidance

from the model’s interpretations and outputs can further improve the pruning results. Our

experimental results on benchmark datasets illustrate that our new interpretation steered

17


pruning method can consistently achieve superior performance compared to baselines. Our

contributions are as follows:

• We propose a novel structural pruning method for CNNs designed from a new and

different perspective compared to existing methods. We utilize the model’s deci-

sions’ interpretations to steer the pruning procedure. By doing so, we effectively

merge the guidance from the model’s interpretations and outputs to discover the

high-performance sub-networks.

• We introduce a new Amortized Explanation Model (AEM) such that we embed a

proper geometric prior for natural images in the inductive bias of our model and enable

it to predict smooth explanations for input images. We parameterize the distribution

of saliency masks using RBF-like functions. Thus, our AEM can provide compact

explanatory representations and save computational costs. Further, it empowers us

to dynamically obtain saliency maps of pruned models and leverage them to steer the

pruning procedure.

The contents of this chapter are based on our work [64] published in ECCV 2022.

2.2 Related Works

2.2.1 Interpretation Methods

Interpretation methods can get classified into four [4] main categories:

1. Gradient-based methods such as CAM [65], Grad-CAM [55], DeepLIFT [56],

and LRP [66] rely on the gradients of outputs of a model w.r.t input features and assume

18


features with larger gradients have more influence on the model’s outcome [67, 68, 69], which

is shown is not necessarily a valid assumption [70]. In addition, their feature importance

assignment might not be more meaningful than random assignment [53, 54, 57], which

makes them unreliable for pruning. Further, Srinivas and Fleuret [57] theoretically proved

that input gradients are equal to the score function for the implicit generative model in

classifiers [58] and are not related to the discriminative function of classifiers. Thus, they

are not interpretations of the model’s predictions.

2. Perturbation-based models explore the effect of perturbing input features on

the model’s output or inner layers to conclude their importance [59, 60, 71]. Yet, they are

inefficient for pruning as they need multiple forward passes to obtain importance scores.

Also, they may underestimate features’ importance [56].

3. Locally Linear Models fit a linear model to approximate the behavior of a

classifier in the vicinity of each data point [51, 52]. However, they require to fit a new

model for each sample when the model’s architecture changes during pruning, which makes

them inefficient for pruning. Also, they rely on the classifier’s output for out-of-distribution

samples to train the linear model [53], which makes them undependable.

4. Amortized Explanation Models (AEMs) [4, 61, 62, 72] overcome the inef-

ficiencies of the previous methods by training a global model - called selector [4] - that

amortizes the cost of inferring saliency maps for each sample by selecting salient input

features with a single forward pass. AEMs [4, 61, 62] provide a theoretical framework

to train the selector model. To do so, they use a second predictor model that estimates

the classifier’s output target distribution given an input masked by the selector model’s

predicted mask. L2X [61] and INVASE [62] jointly train the selector and predictor. How-

19


ever, REAL-X [4] proved that doing so results in degenerate cases. REAL-X overcame

this problem by training the predictor model separately with random masks. However, we

show in section 2.3.3 that its predicted masks may not be interpretable for complex image

classifiers. Our conjecture for a reason is that it neglects geometric prior [63] of natural

images that nearby pixels correlate more to each other.

2.2.2 Network Compression

Weight pruning [41] and quantization [42, 43], structural pruning [44, 73, 74, 75,

76, 77, 78, 79, 80, 81], knowledge distillation [45], and NAS [46] are popular directions

for compressing CNNs. Structural pruning has attracted more attention as it can readily

decrease the computational burden of CNN models without any specific hardware changes.

Early channel pruning methods [44] propose that the channels with larger norms are more

critical and remove weights/filters with small L1/L2 norm. L1 penalty can also be applied

to scaling factors of batchnorm [82] to remove redundant channels [83]. Recent channel

pruning methods adopt more sophisticated designs. Automatic model compression [46]

learns the width of each layer with reinforcement learning. Metapruning [50] generates

parameters for subnetworks and uses evolutionary algorithms to find the best subnetwork.

Greedy subnetwork selection [47] greedily chooses each channel based on their L2 norm.

Pruning can be also used for fairness [84]. We refer to [85] for a more detailed discussion

of pruning techniques.

20


2.2.3 Network Pruning Using Interpretations

There are a few recent works that attempt to use interpretations of a model to deter-

mine importance scores of its weights. Sabih et al. [86] leverage DeepLIFT [56]; Yeom et

al. [87] use LRP [66]; and Yao et al. [88] utilize activation maximization [59] to deter-

mine weights’ importance. However, all these methods use gradient-based methods that,

as mentioned above, their predictions are unreliable and should not be used as the model’s

interpretations. Alqahtani et al. [89] visualize feature maps in the input space and use a

segmentation model to find the filters that have the highest alignment with visual con-

cepts. Nonetheless, their method needs an accurate segmentation model to find reliable

importance scores for filters, which may not be available in some domains. We develop a

new AEM model that is theoretically supported and improves REAL-X [4]. Moreover, in

contrast with these methods, our pruning method finds the optimal subnetwork end-to-end.

We also show in section 2.4.2 that our model outperforms [89].

2.3 Methodology

2.3.1 Overview

We present a novel pruning method in which we steer the pruning process of CNN

classifiers using feature-wise interpretations of their decisions. At first, we develop a new

intuitive AEM model that overcomes the limitations of REAL-X [4] (state-of-the-art AEM).

The reason is that we incorporate the geometric prior of high correlation between adjacent

input features (pixels) [63] in the images in the inductive bias of our AEM model. We

21


parameterize the distribution of saliency masks using Radial Basis Function (RBF)-style

functions. By doing so, we can represent interpretations (saliency maps) of input images

compactly. Then, we elaborate on our pruning method in which we leverage our AEM

model to provide interpretations of the original and pruned classifiers. Our intuition is that

saliency maps of the original and pruned models should be similar. Thus, we propose a

new loss function for pruning that encourages the pruned model to have similar saliency

explanations to the original one. In the following sub-sections, we introduce AEM methods

and empirically show the limitations of REAL-X. Then, we elaborate on our method and

its intuitions to tackle the drawbacks of previous AEMs. Finally, we present our pruning

scheme.

2.3.2 Notations

We denote our dataset as D = {(x(i), y(i))}N
i=1 such that (x, y) ∼ P(x, y) where P is

the unknown underlying joint distribution over features and targets, and we assume that

x ∈ R
D and y ∈ {1, 2, . . . , K}. We show the jth feature of sample x by xj and represent a

mask m by the indices of the input features that it preserves, i.e., m ⊆ {1, 2, . . . , D} and

a masked input m(x) is defined as follows:

m(x) = mask(x, m) =



























xj j ∈ m

01 Otherwise

(2.1)

We call the model that we aim to prune as the ‘classifier’ in following sections.

1We use zero values for the masked input features following the literature.[4, 61, 62]

22


2.3.3 Amortized Explanation Models (AEMs)

AEMs are a subgroup of Instance-Wise Feature Selection (IWFS) methods that aim

to compute a mask with minimum cardinality for each input sample that preserves its

outcome-related features. An outcome may be a classifier’s predictions (usually calculated

as a softmax distribution) for interpretation purposes. It can also be the population distri-

bution of the targets (one-hot representations) when performing dimensionality reduction

on the original raw data [4, 61, 62]. Although previous works [4, 61, 62] describe their

formulation for the latter, we focus on the former here.

Concretely, if Qclass(y|x) be the classifier’s conditional distribution of targets given

input features, the objective of AEM models is to find a mask m(x) for each sample x such

that

Qclass(y|x = x) = Qclass(y|x = m(x)) (2.2)

AEMs tackle this problem by training a global model called selector that learns to

predict a local (sample dependent) mask m(x) for each sample x [4]. They train the selector

by encouraging it to follow Eq. 2.2. To do so, one should quantify the discrepancy between

the RHS and LHS of Eq. 2.2 when the selector model generates the mask m in the RHS.

The LHS can be readily calculated by forwarding the sample x into the classifier. However,

the classifier should not be used to compute the RHS because the masked sample m(x) is

an out-of-distribution input for it [4]. AEMs solve this issue by training a predictor model

that predicts the conditional distribution of the classifier given a masked input. (RHS of

Eq. 2.2) Then, they train the selector guided by the supervision from the predictor. We

23


present the formulation of REAL-X [4] in supplementary S2.1.

2.3.3.1 Visualization of REAL-X Predictions

We visualize predicted explanations of REAL-X for a ResNet-56 model [39] trained

on CIFAR-10 [5] in Fig. 2.4(a). (we refer to supplementary S2.3 for implementation details)

As can be seen, the formulation of REAL-X cannot guide the selector model to learn to

select a coherent subset of input pixels of the salient parts of the images. Thus, it may not

provide interpretable explanations for the classifier. Our conjecture for the cause is that the

formulation of REAL-X does not include a proper inductive bias related to natural images

in the selector model. Typically, nearby pixels’ values and semantic information are more

correlated in natural images, known as their geometric prior [63]. REAL-X does not have

such a prior in its formulation because it factorizes the explanatory masks’ distribution

given an input x as:

qsel(m|x; β) =
D
∏

i=1

qi(mi|x; β) (2.3)

where qi(mi|x; β) ∼ Bernoulli((fβ(x))i), i.e., the distribution over the selector’s output

mask is factorized as a product of marginal Bernoulli distributions over mask’s elements,

and the parameter for each element gets calculated independently. (fβ(x) is the selector

model parameterized by β). Hence, the selector model does not have the inductive bias

that parameters of nearby Bernoulli distributions should be close to each other to make

the sampled masks coherent. Instead, it should ‘discover’ such prior during training, which

is infeasible with limited data and training epochs in practice.

24


(a) REAL-X[4] (b) Ours

Figure 2.1: Input features selected by a) REAL-X [4] and b) our model to explain
decisions of a ResNet-56 classifier for samples from CIFAR-10 [5]. In the sub-figures from
left to right: 1st column shows the original image. Both models output an array (2nd
columns) that each value of it is the parameter of the predicted Bernoulli distribution over
the corresponding mask pixel. In the 3rd column, we show the masks generated such that
a pixel’s value is one provided that its predicted Bernoulli parameter is bigger than 0.5 and
zero otherwise. The 4th columns show the masked inputs. Our model’s explanations are
easier to interpret than the ones by REAL-X that may seem random for some samples.

2.3.4 Proposed AEM Model

We introduce a new selector scheme that respects the proximity geometric prior. To

do so, we assume that the parameters of the Bernoulli distributions of mask pixels should

have a Radial Basis Function (RBF) style functional form over the pixels. The center of

the RBF kernel should be on the salient part of the image most relevant to the classifier’s

prediction, and the Bernoulli parameters should decrease as the pixel location gets far

from the kernel’s center. A parameter σ controls the area of a mask. Our assumption is

reasonable for multi-class classifiers in which, typically, a single object/region in their input

image determines the target class. Formally, considering a 2D mask that its coordinates are

parametrized by (z, t) and the parameters of a 2D RBF kernel being (cz, ct, σ), we calculate

the Bernoulli parameter (BP) of a pixel at location (z, t) as follows:

25


fBP (z, t; cz, ct, σ) = exp
{

(
−1

2σ2
[(z − cz)2 + (t − ct)

2])
}

(2.4)

This formulation has two crucial benefits: 1) It ensures that Bernoulli parameters of a

mask’s proximal pixels are close to each other. Thus, the resulting sampled masks will be

much more coherent and smooth than REAL-X. 2) It simplifies the selector model’s task

significantly. In REAL-X, the selector should learn how to calculate Bernoulli parameters

for each pixel that, for instance, will be 224 × 224 = 50176 independent functions for the

standard ImageNet [31] training. In contrast, in our formulation, the selector should only

learn to accurately estimate three values corresponding to the center’s coordinates (cz, ct)

and an expanding parameter σ for the RBF kernel. Given the estimated values, Bernoulli

parameters of the output mask’s pixels can be readily calculated by Eq. 2.4. In other words,

if the input images have spatial dimensions M ∗ N , and we denote the selector function

(implemented by a deep neural network) with fsel(x; β), our selector’s distribution over

masks given input images is:

[cz, ct, σ] = fsel(x; β)

qi,j(mi,j|x; β) = Bernoulli(fBP (i, j; cz, ct, σ))

qsel(m|x; β) =
M
∏

i=1

N
∏

j=1

qi,j(mi,j|x; β)

(2.5)

In Eq. 2.5, β denotes the selector’s parameters, and we illustrate a predicted RBF kernel by

our selector in Fig. 2.2. In summary, our intuition is that by incorporating the geometric

prior in the inductive bias of our framework, the selector will search for proper functional

form for Bernoulli parameters over pixels’ locations in the RBF family of functions, not all

26


possible ones. As a result, it can find the optimal functional form more readily and robustly.

Moreover, our selector model can provide a real-time and compact representation (RBF

parameters) for saliency maps, which enables us to efficiently compare the interpretations

of the original and pruned models to steer the pruning process. (section 2.3.6, Fig. 2.3)

2.3.5 AEM Training

We train our selector model by encouraging it to generate an explanatory mask m for

each sample x such that it follows Eq. 2.2. To do so, as mentioned in section 2.3.3, we need

to estimate the classifier’s conditional distribution of targets given masked inputs (RHS of

Eq. 2.2) to train our selector model. Such an estimate can quantify the quality of a mask

generated by the selector model by measuring the discrepancy between the LHS and RHS

of Eq. 2.2.

2.3.5.1 Predictor Model

We train a predictor model to calculate the classifier’s conditional distribution of

targets given a masked input. (RHS of Eq. 2.2) As we designed our selector to predict RBF-

style masks (Eq. 2.5), we train our predictor to predict the classifier’s output distribution

when the input is masked by a random RBF-style mask. Using random RBF masks allows

us to mimic any potential RBF-masked input. Hence, our predictor’s training objective is:

min
θ

Lpred(θ) = Ex∼P(x)Ec′

z ,c′

t,σ′ [Em′∼B(m|c′

z ,c′

t,σ′)Lθ(x, m′(x))] (2.6)

where Lθ(·, ·) and B(·) are defined as:

27


Random 

RBF Mask

PredictorClassifier
!!"#$

"#$% &'(

!"! !"" !"

Sampling

a mask

PredictorClassifier
&&&!%#&

"#$% &)(

Selector

Figure 2.2: Our AEM model. The goal is to train the selector model on the right (U-
Net model in dashed line) to predict interpretations (saliency maps) of the classifier for
each input sample. We train the selector by encouraging it to follow Eq. 2.2. (Left): We
train a predictor model that learns to predict the classifier’s output distribution given a
masked input (RHS of Eq. 2.2). We do so using inputs masked by random RBF masks as
our selector’s masks have RBF-style. (Sec. 2.3.4) (Right): Given the trained predictor,
we train the selector model using obj. 2.8 that enforces it to follow Eq. 2.2. We use the
classifier’s convolutional backbone as the encoder of the selector and only train its decoder
for computational efficiency. Then, we use the trained decoder to prune the encoder.
(Fig. 2.3)

Lθ(x, m′(x)) = KL(Qclass(y|x = x), qpred(y|x = m′(x); θ))

B(m|c′
z, c′

t, σ′) =
M
∏

i=1

N
∏

j=1

Bernoulli(fBP (i, j; c′
z, c′

t, σ′))

(2.7)

Eq. (2.6), Lθ form the predictor’s objective to learn the conditional distribution of the

classifier for targets given masked inputs (RHS of Eq. 2.2). B(·) generates random masks

with random RBF style (fBP ), and KL denotes Kullback-Leibler divergence [90]. Now, we

should define the distribution for the parameters c′
z, c′

t, and σ′ for a random RBF function.

Let us assume that the origin of our 2D coordinate system is the top left of an input image

with spatial dimensions M , N . In theory, c′
z and c′

t can have any real values, and the σ′

can be any positive real number in Eq. 2.4. However, considering that the salient part[s]

28


is inside the image region, we are interested that the predictor learns to correctly estimate

Qclass(y|x = m(x)) (RHS of Eq. 2.2) when the selector predicts that the center of the RBF

kernel is inside the image area. Hence, we assume that the distributions of c′
z and c′

t are

uniform across image dimensions, i.e., c′
z ∼ U [0, M ] and c′

t ∼ U [0, N ]. In addition, the

parameter σ′ determines the degree that an RBF kernel expands on the image, and the

values σ′ ≥ 2 ∗ max{M, N} practically provide the same Bernoulli parameters for all

the mask’s pixels when c′
z and c′

t are in the image region. Thus, we can reasonably assume

that σ′ ∼ U [0, 2 ∗ max{M, N}] for training the predictor in practice.

2.3.5.2 Selector Training

Given a predictor model denoted by qpred and trained with random RBF masks, we

train our selector model with the following objective:

min
β

Lsel(β) = Ex∼P(x)Em′∼qsel(m|x;β)[L(x, m′(x)) + λ1R(m′) + λ2S(m′)] (2.8)

such that L(·, ·), R(·), and S(·) are defined as:

L(x, m′(x)) = KL(Qclass(y|x = x), qpred(y|x = m′(x))),

R(m′) = ||m′||0, S(m′) =
M
∑

i=1

N
∑

j=1

[(m′
i,j − m′

i+1,j)
2 + (m′

i,j − m′
i,j+1)

2]

(2.9)

L(x, m′(x)) encourages the selector to follow Eq. 2.2 as qpred(y|x = m′(x)) approximates

the RHS of Eq. 2.2 given an input masked by the RBF mask predicted by the selector.

R(m′) regularizes the number of selected features. We add the smoothness loss S(m′)

29


(!!
"" !#

"" #")

Pruning 

Pair

!"!

!""

!"

ℒ!"#$%&%

!"

ℒ'$(

#!"#$$

Figure 2.3: Our pruning method. The classifier to be pruned is shown on top. (Conv layers
and FC). The U-Net (Conv layers and the Decoder) is our trained selector model that can
predict RBF parameters of the saliency map of each input for the classifier. The selector
model is trained such that the pretrained backbone of the classifier is used as its encoder
(Conv layers) and kept frozen during training. (see Fig. 2.2) Thus, we freeze the selector
and classifier’s weights and insert our pruning gates between the selector’s encoder layers
for pruning the classifier. Given a pruning pair (a sample and its RBF saliency map’s
parameters for the original classifier), we train the gate parameters to prune the classifier
such that the pruned model have similar interpretations (Linterpr) and accuracy (Lclass) to
the original classifier while requiring lower computational resources (LRes).

to further encourage the selector to output smooth masks. As Eq. 2.8 requires sampling

from predicted distribution by the selector, direct backpropagation of gradients to train its

parameters, β, is not possible. Thus, we use the Gumbel-Sigmoid [91, 92] trick to train the

model. We use a U-Net [93] architecture to implement the selector module of our AEM

model, as shown in Fig. 2.2. We refer to supplementary S2.3 for more details of our AEM

training.

30


2.3.6 Pruning

In this section, we introduce our pruning method that leverages interpretations of a

classifier to steer its pruning process. Our intuition is that the interpretations (saliency

maps) of the original and pruned classifiers should be similar. Thus, we design our pruning

method as follows. As discussed in section 2.3.5 and Fig. 2.2, we use the convolutional

backbone of the classifier as the encoder of the U-Net architecture for the selector model. We

keep the encoder weights frozen and only train the decoder when training the selector model

for computational efficiency. (Fig. 2.2) Furthermore, doing so provides us the flexibility to

keep the decoder frozen and prune the encoder such that the pruned model should have

similar output RBF parameters to the original model. (Fig. 2.3)

Formally, we employ our trained selector model to predict saliency maps of the orig-

inal classifier for training samples. For each sample xk, it provides the parameters of the

RBF kernel for its saliency map as Cxk
= [ck

z , ck
t , σk]. Then, we insert our pruning gates,

parameterized by θg, between the layers of the encoder. We represent the architectural

vector generated by the gates with v. Finally, we prune the encoder (classifier’s backbone)

by regularizing the gate parameters to maintain the interpretations and accuracy of the

pruned classifier similar to the original one while reducing its computational budget as

follows:

min
θg

L(f(x; W , v), y) + γ1∥Cx − fsel(x; β, v)∥2
2 + γ2Rres(T (v), pTall) (2.10)

31


where L(·, ·) is the classification loss, f(·; W , v) denotes our classifier (encoder of the U-Net

and the FC layer in Fig. 2.3) parameterized by weights W and the subnetwork selection

vector v. fsel(x; β, v) is our trained selector model (fsel(x; β) in Eq. 2.5) augmented by the

architecture vector v after inserting the pruning gates into its encoder. We calculate v using

Gumbel-sigmoid function g(·): v = g(θg) [91, 92], which controls openness or closeness of

a channel. The second term in Eq. 2.10 utilizes the interpretations of the original and

pruned classifiers to steer pruning through the selector model fsel(x; β, v) by encouraging

the similarity of their predicted RBF parameters. Rres is the FLOPs regularization to

ensure the pruned model reaches the desired FLOPs rate pTall. Tall is the total prunable

FLOPs of a model, T (v) is the current FLOPs rate determined by the subnetwork vector

v, and p controls the pruning rate. γ1 and γ2 are hyperparameters to control the strength

of related terms. During pruning, we only optimize θg and keep W and β frozen.

We emphasize that our amortized explanation prediction selector model, fsel(x; β, v),

enables us to readily perform interpretation-steered pruning because it can dynamically

predict each sample’s saliency map’s RBF parameters ([cz, ct, σ]) given the current sub-

network vector v with a single forward pass. In contrast, optimization-based explanation

methods [51, 52] need to fit a new model, and perturbation-based methods [59, 60, 71]

have to make multiple forward passes for the newly selected subnetwork to obtain its ex-

planations. Therefore, they are inefficient to achieve the same goal. We provide the detailed

parameterization of channels (g(·)) and Rres in supplementary S2.3.1.

32


(a) (b)

Figure 2.4: (a): Test accuracy of different masks’ parameterization schemes. (RBF
(ours) vs. Independent (REAL-X [4])) (b): Test accuracy w/wo using the classification
loss. All results are for 3 run times with ResNet-56 on CIFAR-10. Shaded areas represent
variance.

2.4 Experiments

We use CIFAR-10 [5] and ImageNet [31] to validate the effectiveness of our proposed

model. We refer to supplementary S2.3 for details of our experimental setup. We call our

method ISP (Interpretations Steered Pruning) in the experiments.

2.4.1 Analysis of Different Settings

Before we formally present our experimental results compared to competitive meth-

ods, we study the effect of different design choices for our model’s components on its

performance. We keep the resource regularization (Rres) term in obj. 2.10 and add/drop

other ones in all settings.

In our first experiment, we explore the impact of γ1 by only using interpretations

(second term of Obj. 2.10) to steer the pruning. Fig. 2.5(a,b) and Fig. 2.4(a) demonstrate

33


the results. We can observe in Fig. 2.5(a,b) that small γ1 values (e.g., 0.1) result in a

weaker supervision signal from the interpretations and make the exploration of subnetworks

unstable (showing high variance), whereas larger ones make the training smooth. Fig.

2.4(a) illustrates the influence of RBF/independent masks’ parameterization scheme in

Eq. 2.5 (ours)/Eq. 2.3 (REAL-X [4]). Our RBF-style model brings better performance

than independent parameterization. The latter becomes unstable and less effective when

the training proceeds. The instability happens possibly because the pruning gets trapped

in some local minima due to noisy and unstructured masks. We can also observe that

interpretations on their own provide stable and efficient signals for pruning.

In our second experiment, we examine the impact of γ2 while utilizing all three terms

in objective 2.10 for pruning. Fig. 2.5(c, d) indicates that small γ2 (e.g., 1.0) shows higher

accuracy but may not be able to push the FLOPs regularization to 0, i.e., reach the

predefined pruning rate p. Larger values can satisfy the resource constraint while showing

acceptable performance.

Finally, we examine the performance of different combinations of components in ob-

jective 2.10. The results are available in Fig. 2.4(b). Specifically, ‘w/o Classification Loss’

repre