ABSTRACT Title of Dissertation: PRUNING FOR EFFICIENT DEEP LEARNING: FROM CNNS TO GENERATIVE MODELS Alireza Ganjdanesh Doctor of Philosophy, 2025 Dissertation Directed by: Professor Heng Huang Department of Computer Science Deep learning models have shown remarkable success in visual recognition and gen- erative modeling tasks in computer vision in the last decade. A general trend is that their performance improves with an increase in the size of their training data, model capacity, and training iterations on modern hardware. However, the increase in model size naturally leads to higher computational complexity and memory footprint, thereby necessitating high-end hardware for their deployment. This trade-off prevents the deployment of deep learning models in resource-constrained environments such as robotic applications, mobile phones, and edge devices employed in the Artificial Internet of Things (AIoT). In addition, private companies and organizations have to spend significant resources on cloud services to serve deep models for their customers. In this dissertation, we develop model pruning and Neural Architecture Search (NAS) methods to improve the inference efficiency of deep learning models for visual recognition and generative modeling applications. We design our methods to be tailored to the unique characteristics of each model and its task. In the first part, we present model pruning and efficient NAS methods for Convolu- tional Neural Network (CNN) classifiers. We start by proposing a pruning method that leverages interpretations of a pretrained model’s decisions to prune its redundant struc- tures. Then, we provide an efficient NAS method to learn kernel sizes of a CNN model using their training dataset and given a parameter budget for the model, enabling designing efficient CNNs customized for their target application. Finally, we develop a framework for simultaneous pretraining and pruning of CNNs, which combines the first two stage of the pretrain-prune-finetune pipeline commonly used in model pruning and reduces its complexity. In the second part, we propose model pruning methods for visual generative mod- els. First, we present a pruning method for conditional Generative Adversarial Networks (GANs) in which we prune the generator and discriminator models in a collaborative man- ner. We then address the inference efficiency of diffusion models by proposing a method that prunes a pretrained diffusion model into a mixture of efficient experts, each handling a separate part of the denoising process. Finally, we develop an adaptive prompt-tailored pruning method for modern text-to-image diffusion models. It prunes a pretrained model like Stable Diffusion into a mixture of efficient experts such that each expert specializes in certain type of input prompts. PRUNING FOR EFFICIENT DEEP LEARNING: FROM CNNS TO GENERATIVE MODELS by Alireza Ganjdanesh Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2025 Advisory Committee: Professor Heng Huang, Chair/Advisor Professor Abhinav Shrivastava Professor Furong Huang Professor Tianyi Zhou Professor Shuvra S. Bhattacharyya, Dean’s Representative © Copyright by Alireza Ganjdanesh 2025 To the dearest people in my life, Melina, Kasra, Ayli, Soheila, and Yousef. ii Acknowledgments I start by expressing my appreciation to my Ph.D. advisor, Dr. Heng Huang, for his guidance and constant support during my Ph.D. journey. He provided me with the invaluable opportunity to enter the exciting field of deep learning research, which I will always be grateful for. I am especially thankful for his patience and encouragement during the early stages of my Ph.D., as he guided me in building the foundational knowledge and skills necessary to conduct meaningful research. His mentorship has been instrumental in shaping my growth as a researcher and inspiring me to push the boundaries of my work. I extend my gratitude to my committee members, Dr. Abhinav Shrivastava, Dr. Furong Huang, Dr. Tianyi Zhou, and Dr. Shuvra S. Bhattacharyya, for their generosity in serving on my dissertation committee. Their insightful feedback and constructive comments have been invaluable in refining my research and strengthening my dissertation. I express my heartfelt appreciation to my collaborators, whose invaluable contribu- tions have been instrumental to the success of my research. First and foremost, I want to thank Dr. Shangqian Gao, who has been both a mentor and a guiding figure, akin to an older brother in my professional journey. He introduced me to the fascinating field of deep learning efficiency and was an outstanding collaborator on most of the projects pre- sented in this dissertation. I am deeply thankful for the hard work we shared, the countless hours dedicated to brainstorming ideas, implementing them, conducting experiments, and iii the late-night meetings essential to achieving these milestones together. I also thank Reza Shirkavand for his dedication during our collaboration in the final year of my Ph.D. His efforts were instrumental in bringing our project to fruition. I am profoundly thankful to Yan Kang for giving me with the opportunity to join Adobe Research as an intern in Summer 2023. This internship marked my first experience working at a major technology company, where I was fortunate to have Dr. Yuchen Liu, Dr. Richard Zhang, and Dr. Zhe Lin, along with Yan, as my mentors. Through this experience, I also had the privilege of meeting my current manager, Dr. Soheil Darabi, which led to securing my first full-time position with the brilliant Adobe Firefly team. I would also like to thank my former collaborators, Dr. Wei Chen, Dr. Kamran Ghasedi, Dr. Liang Zhan, and Jipeng Zhang. Their expertise and dedication were invalu- able in completing the projects during the early stages of my Ph.D. I thank my former and current lab-mates at Huang Lab at the University of Pitts- burgh and the University of Maryland, College Park. I was lucky to be a part of such a talented group of researchers including: Feihu Huang, Bin Gu, Lei Luo, Kamran Ghasedi, Hongchang Gao, Xiaoqian Wang, Zhouyuan Huo, Shangqian Gao, Yanfu Zhang, Runxue Bao, Wenhan Xian, Chao Li, An Xu, Xiaotian Dou, Guodong Liu, Peiran Yu, Zhenyi Wang, Junfeng Guo, Zhengmian Hu, Junyi Li, Yihan Wu, Xidong Wu, Reza Shirkavand, Hirad Alipanah, Lichang Chen, Yanshuo Chen, Ruibo Chen, Chenxi Liu, and Tianyi Xiong. I had my first hot pot at our group’s lunch gathering, and since then, my wife and I have had multiple Mediterranean-style hot pot meals together. I also will not forget the fun times we had with Ruibo Chen, Chenxi Liu, and Tianyi Xiong at the Yahentamitsi dining hall at UMD. Thank you all for the great memories. iv I thank Dr. Natasa Miskov-Zivanov, Dr. Azime Can-Cimino, and Dr. Steven Ja- cobs, under whom I had the privilege of serving as a teaching assistant at the University of Pittsburgh. They entrusted me with the opportunity to lead several lecture sessions, allowing me to experience teaching in a language other than my mother tongue. These unique experiences were both challenging and rewarding, and I am grateful for their trust in me. I also thank Handa Ding, who was a friend and a colleague during my time as a teaching assistant for ECE Analytical Methods under Natasa. It was the first semester of my Ph.D., and I was having a hard time balancing coursework and research while adapting to a new environment. Handa’s support during that period was truly invaluable. I truly appreciate his willingness to step in and cover for me when I needed it most. I will always cherish the enjoyable moments we shared while grading exams with Natasa and Handa. Next, I would like to thank my friends who made my Ph.D. journey not only tolerable but also memorable during my years in Pittsburgh and College Park. I am particularly grateful to Kazem Meidani, my best friend and roommate during my time in Pittsburgh. He stood by me through all the ups and downs, particularly during the challenging COVID- 19 pandemic lockdowns, which began just six months after we arrived in the United States. His unwavering support during such a difficult time was invaluable, and I could not have remained sane without him. I extend my heartfelt thanks to Kamran Ghasedi and Najmeh Sadoughi, who were like an older brother and sister to me and Kazem during our first year of the Ph.D. program. Their willingness to help and the wonderful memories we created together in Pittsburgh mean so much to me. I feel fortunate to still be in touch with them in Seattle. v I sincerely thank Mohsen Tabrizi and Amirreza Hashemi for their unwavering willing- ness to selflessly support the younger generation of students in the Iranian community in Pittsburgh. Their humility and kindness have left a lasting impression on me. The world would undoubtedly be a better place with more individuals like you. Thank you both for your efforts and support during my time in Pittsburgh. I am also grateful for the fun times and cherished memories shared with my other friends in Pittsburgh: Parshin Shojaee, Soroosh Shafieezadeh Abadeh, Hirad Alipanah, Reza Shirkavand, Maryam Hakimzadeh, Fahimeh Dehghan, Yasin Karimi, Sanaz Saadati- far, Saba Dadsetan, Alireza Golestaneh, Parand Akbari, Masoud Zamani, Tahere Mokhtari, Sina Malakouti, and Mohammad Bakhshalipour. Additionally, I thank Matin Mortaheb and Maryam Maghsoudi Shaghaghi for the great memories we shared during the last year of my Ph.D in College Park and Washington, DC. Matin, a longtime friend since high school, has been a pillar of support and friendship through all these years, and I deeply appreciate him for always being there. Finally, I want to express my heartfelt gratitude to my family for their unwavering love and support throughout this journey. Their encouragement has been my anchor, providing me with the strength to persevere through challenges and celebrate successes. I am especially grateful to my wife, Melina Emily Adrangi, for her constant love, patience, and understanding. She has been my confidant, cheerleader, and partner in every sense, helping me navigate the highs and lows of this journey with grace and resilience. Her sacrifices, from enduring my late working hours to supporting me through stressful times, have not gone unnoticed, and I am deeply thankful for her unwavering commitment to our shared dreams. I am truly fortunate to have her by my side, and this accomplishment vi would not have been possible without her. I express my deepest gratitude to my parents, Yousef and Soheila, whose unwavering encouragement and support have been the foundation of my success. They have always inspired me to work hard, dream big, and pursue my aspirations, including my journey to the United States. From them, I learned the fundamental principles of classical liberalism, free markets, and individual freedom—values that have contributed to the unprecedented prosperity of the United States in human history. My parents have selflessly stood by me through every stage of my life, offering their love and guidance during both triumphs and challenges. I am endlessly grateful to have such incredible parents, and I truly could not have achieved this without you. Thank you for everything. I thank my siblings, Kasra (Amirreza) and Ayli, for all the joy and laughter you have brought into my life. The fun times we’ve shared together have been a source of happiness and balance during this journey. I am incredibly proud of both of you and have cherished every moment of our conversations and the opportunity to share my experiences with you. Your support throughout these years has meant the world to me. Thank you for always being there. I would also like to thank my in-law parents, Farid and Emilia, for their support and kindness over the years. Your constant encouragement and love have been a source of strength for both Melina and me throughout this journey. Your presence has provided us with confidence and reassurance during challenging times, and I am deeply grateful for everything you have done. To all of you and the others that I met during these years, thank you for supporting me throughout this journey. vii Table of Contents Dedication ii Acknowledgements iii Table of Contents viii List of Tables xi List of Figures xiii List of Abbreviations xx Chapter 1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 I Pruning and Efficient Architecture Search Techniques for Convolutional Neural Networks 14 Chapter 2: Interpretations Steered Network Pruning via Amortized Inferred Saliency Maps 15 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5 Chapter Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . 37 Supplementary Materials for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . 38 S2.1 REAL-X Formulation Development for Interpretation of Classifiers . . . . . 38 S2.2 Implementation Details of Our AEM . . . . . . . . . . . . . . . . . . . . . 41 S2.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Chapter 3: Efficient Learning of Kernel Sizes for Convolution Layers of CNNs 61 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 viii 3.5 Chapter Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . 81 Supplementary Materials for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . 82 S3.1 Experimental settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 S3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Chapter 4: Jointly Training and Pruning CNNs via Learnable Agent Guidance and Alignment 90 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.5 Chapter Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . 109 Supplementary Materials for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . 111 S4.1 Bounding our Agent’s Actions . . . . . . . . . . . . . . . . . . . . . . . . . 111 S4.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 II Pruning Methods for Efficient Inference of Deep Genera- tive Models 115 Chapter 5: Compressing Image-to-Image Translation GANs Using Local Density Structures on Their Learned Manifold 116 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.5 Chapter Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . 134 Supplementary Materials for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . 136 S5.1 Details of Our Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 S5.2 Our Pruning Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 S5.3 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Chapter 6: Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection 142 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.5 Chapter Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . 164 Supplementary Materials for Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . 166 S6.1 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 S6.2 Details of Our Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 S6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 ix Chapter 7: Not All Prompts Are Made Equal: Prompt-based Pruning of Text-to- Image Diffusion Models 180 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 7.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 7.5 Chapter Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . 201 Supplementary Materials for Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . 202 S7.1 Overview of Diffusion Models . . . . . . . . . . . . . . . . . . . . . . . . . 202 S7.2 More Details of APTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 S7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Chapter 8: Conclusion and Discussion 220 8.1 Broader Context and Future Directions . . . . . . . . . . . . . . . . . . . . 223 Bibliography 225 x List of Tables 2.1 Comparison of results on CIFAR-10. ∆-Acc represents the performance changes relative to the baseline, and +/− indicates an increase/decrease, respectively. . . 36 2.2 Comparison results on ImageNet with ResNet-34/50/101 and MobileNet-V2. . . 37 3.1 Results on the MNIST dataset. . . . . . . . . . . . . . . . . . . . . . . . . 73 3.2 Results on the CIFAR10 dataset. . . . . . . . . . . . . . . . . . . . . . . . 73 3.3 Results on the STL-10 dataset. . . . . . . . . . . . . . . . . . . . . . . . . 77 3.4 Results on the ImageNet-32 dataset. . . . . . . . . . . . . . . . . . . . . . 78 3.5 The architecture of our size predictor. . . . . . . . . . . . . . . . . . . . . . 83 4.1 Comparison results on CIFAR-10 for pruning ResNet-56 and MobileNet-V2. 104 4.2 Comparison results on ImageNet for pruning ResNet-18/34 and MobileNet-V2. . 105 4.3 Ablation Results of our method for pruning ResNet-56 on the CIFAR-10 dataset. EE represents the Epoch Embeddings. SR represents the Soft Regularization in Eq. 4.10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.1 Quantitative comparison of our proposed method with state-of-the-art GAN compression methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.2 Ablation results of our proposed method. . . . . . . . . . . . . . . . . . . . 133 5.3 The architecture of gen∗ (∗ ∈ {G, D}) used in our method. . . . . . . . . . 137 5.4 Hyperparameter settings for training original models. . . . . . . . . . . . . . . 139 5.5 Hyperparameter settings for pruning agents. . . . . . . . . . . . . . . . . . . . 140 5.6 Hyperparameter settings for Fine-tuning. . . . . . . . . . . . . . . . . . . . . 141 6.1 Comparison results of DiffPruning vs. baselines. Throughput values are calculated using an NVIDIA A100 GPU. †: the values are average of our two efficient experts. ∗: calculated by sampling from provided checkpoints. ‡: speed-ups relative to the LDM model. The shadowed values are inaccurate, and we refer to supplementary S6.3.3 for a detailed discussion. . . . . . . . 161 6.2 Ablation results of our proposed method for pruning the LDM model [1] for LSUN-Bedroom to 50% MACs budget. . . . . . . . . . . . . . . . . . . . . 164 6.3 The architecture of our Expert Routing Agent. We calculate width architec- ture vectors v(i) from the outputs o (i) k (k ∈ [1, L]). We compute the depth architecture vector u(i) from o (i) L+1. We refer to Sec. S6.2.3.1 for detailed for- mulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 6.4 Hyperparameters of fine-tuning our models with elastic dimensions. . . . . 174 6.5 Hyperparameters for the pruning and fine-tuning stages of our method for different MACs pruning ratios (30%, 50%, and 70%). . . . . . . . . . . . . 177 xi 6.6 Comparison of the number of training iterations for different methods on LSUN-Bedroom. The “Method’s Iterations” column denotes the number of all the training iterations that the pruning method performs to obtain its final efficient model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 6.7 Comparison of the number of training iterations for different methods on LSUN-Church. The “Method’s Iterations” column denotes the number of all the training iterations that the pruning method performs to obtain its final efficient model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 7.1 Results on CC3M and MS-COCO. We report performance metrics using samples generated at the resolution of 768 then downsampled to 256 [2]. We measure mod- els’ MACs/Latency with the input resolution of 768 on an A100 GPU. @30/50k shows fine-tuning iterations after pruning. . . . . . . . . . . . . . . . . . . . 198 7.2 The most frequent words in prompts assigned to each expert of APTP-Base pruned on CC3M. The resource utilization of each expert is indicated in parentheses. . . 198 7.3 Ablation results of APTP’s components on 30k samples from MS-COCO [3] validation set. We fine-tune all models for 10k iterations after pruning. . . 200 7.4 The most frequent words in prompts assigned to each expert of APTP-Base pruned on COCO. The resource utilization of each expert is indicated in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 7.5 Quantitative results on CC3M and MS-COCO. We report the performance metrics using samples generated at the resolution of 768 and downsampled to 256 [2]. We measure models’ MACs and Latency with the input resolution of 768 on an A100 GPU. @30/50k shows the model’s fine-tuning iterations after pruning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 7.6 Prompts for Fig. 7.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 xii List of Figures 2.1 Input features selected by a) REAL-X [4] and b) our model to explain decisions of a ResNet-56 classifier for samples from CIFAR-10 [5]. In the sub-figures from left to right: 1st column shows the original image. Both models output an array (2nd columns) that each value of it is the parameter of the predicted Bernoulli distribution over the corresponding mask pixel. In the 3rd column, we show the masks generated such that a pixel’s value is one provided that its predicted Bernoulli parameter is bigger than 0.5 and zero otherwise. The 4th columns show the masked inputs. Our model’s explanations are easier to interpret than the ones by REAL-X that may seem random for some samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 Our AEM model. The goal is to train the selector model on the right (U- Net model in dashed line) to predict interpretations (saliency maps) of the classifier for each input sample. We train the selector by encouraging it to follow Eq. 2.2. (Left): We train a predictor model that learns to predict the classifier’s output distribution given a masked input (RHS of Eq. 2.2). We do so using inputs masked by random RBF masks as our selector’s masks have RBF-style. (Sec. 2.3.4) (Right): Given the trained predictor, we train the selector model using obj. 2.8 that enforces it to follow Eq. 2.2. We use the classifier’s convolutional backbone as the encoder of the selector and only train its decoder for computational efficiency. Then, we use the trained decoder to prune the encoder. (Fig. 2.3) . . . . . . . . . . . . . . . . . . . 28 2.3 Our pruning method. The classifier to be pruned is shown on top. (Conv layers and FC). The U-Net (Conv layers and the Decoder) is our trained selector model that can predict RBF parameters of the saliency map of each input for the classifier. The selector model is trained such that the pretrained backbone of the classifier is used as its encoder (Conv layers) and kept frozen during training. (see Fig. 2.2) Thus, we freeze the selector and classifier’s weights and insert our pruning gates between the selector’s encoder layers for pruning the classifier. Given a pruning pair (a sample and its RBF saliency map’s parameters for the original classifier), we train the gate parameters to prune the classifier such that the pruned model have similar interpretations (Linterpr) and accuracy (Lclass) to the original classifier while requiring lower computational resources (LRes). . . . . . . . . . . . . . . . . . . . . . . . . 30 xiii 2.4 (a): Test accuracy of different masks’ parameterization schemes. (RBF (ours) vs. Independent (REAL-X [4])) (b): Test accuracy w/wo using the classification loss. All results are for 3 run times with ResNet-56 on CIFAR- 10. Shaded areas represent variance. . . . . . . . . . . . . . . . . . . . . . 33 2.5 (a), (b): The model’s test accuracy and the FLOPs regularization term when changing γ1, and (c), (d): when varying γ2. All results are run for 3 times with ResNet-56 on CIFAR-10. Shaded areas represent variance. . . . 35 2.6 Our proposed nonlinear function to calculate the center’s coordinates of a predicted RBF Kernel of the selector for the CIFAR-10 dataset. . . . . . . 48 2.7 Our proposed nonlinear function to calculate the expansion parameter of a predicted RBF Kernel of the selector for the CIFAR-10 dataset. . . . . . . 49 2.8 Our proposed nonlinear function to calculate the center’s coordinates of a predicted RBF Kernel of the selector for the ImageNet dataset. . . . . . . 50 2.9 ImageNet Examples. Columns from left to right: input image, distri- bution over explanatory masks predicted by selector, predicted distribution shown over input, a sampled mask from the predicted distribution, and input image masked by the sampled mask. Class of input images from top to bottom: ‘Gyromitra’, ‘Honeycomb’, ‘Strainer’, ‘English springer’, ‘Indri brevicaudatus’, ‘Hartebeest’. . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.10 ImageNet Examples. Columns from left to right: input image, distri- bution over explanatory masks predicted by selector, predicted distribution shown over input, a sampled mask from the predicted distribution, and input image masked by the sampled mask. Class of input images from top to bottom: ‘Australian terrier’, ‘Scoreboard’, ‘Microwave oven’, ‘Barn’, ‘Rosehip’, ‘Samoyed’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.11 ImageNet Examples. Columns from left to right: input image, distri- bution over explanatory masks predicted by selector, predicted distribution shown over input, a sampled mask from the predicted distribution, and input image masked by the sampled mask. Class of input images from top to bottom: ‘Miniskirt’, ‘Soccer ball’, ‘Jeep’, ‘Albatross’, ‘Tench’, ‘China cabinet’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.12 ImageNet Examples. Columns from left to right: input image, distri- bution over explanatory masks predicted by selector, predicted distribution shown over input, a sampled mask from the predicted distribution, and input image masked by the sampled mask. Class of input images from top to bottom: ‘Kimono’, ‘Whippet’, ‘Poncho’, ‘Drilling Platform’, ‘Steel Drum’, ‘Black Grouse’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.13 ImageNet Examples. Columns from left to right: input image, distri- bution over explanatory masks predicted by selector, predicted distribution shown over input, a sampled mask from the predicted distribution, and input image masked by the sampled mask. Class of input images from top to bottom: ‘Binoculars’, ‘Horned viper’, ‘Native bear’, ‘Hedgehog’, ‘Japanese spaniel’, ‘Reel’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 xiv 2.14 CIFAR-10 Examples. Class of input images from top to bottom: ‘Ship’, ‘Truck’, ‘Automobile’, ‘Frog’, ‘Horse’, ‘Bird’, ‘Airplane’, ‘Bird’, ‘Dog’ ‘Cat’, ‘Cat’, ‘Cat’, ‘Bird’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.15 CIFAR-10 Examples. Class of input images from top to bottom: ‘Bird’, ‘Ship’, ‘Frog’, ‘Cat’, ‘Dog’, ‘Frog’, ‘Airplane’, ‘Automobile’, ‘Horse’ ‘Bird’, ‘Bird’, ‘Deer’, ‘Horse’. . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.16 CIFAR-10 Examples. Class of input images from top to bottom: ‘Frog’, ‘Ship’, ‘Ship’, ‘Cat’, ‘Airplane’, ‘Ship’, ‘Horse’, ‘Horse’, ‘Truck’, ‘Dog’, ‘Automobile’, ‘Frog’, ‘Deer’. . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.17 CIFAR-10 Examples. Class of input images from top to bottom: ‘Dog’, ‘Airplane’, ‘Horse’, ‘Automobile’, ‘Horse’, ‘Ship’, ‘Ship’, ‘Automobile’, ‘Cat’, ‘Airplane’, ‘Ship’, ‘Airplane’, ‘Dog’. . . . . . . . . . . . . . . . . . . . 60 3.1 Overview of our method. Our size predictor model learns to predict the kernel sizes for the classifier. It predicts soft kernel sizes v that are rounded to integer values. Then, our adaptive weight predictor model predicts opti- mal kernel weights ŵ given the predicted sizes. We modulate the predicted weights using masks ml parameterized by the soft sizes v to make the result- ing weights w differentiable w.r.t the size predictor’s weights. Finally, the weights w are used as the kernel weights of the classifier, and the training is guided by the classification objective (Lclass) and the parameters budget loss (Lparam). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.2 Learned sizes (top) and weights (bottom) for the kernels of our EffConv-20 model with 0.66M parameters on CIFAR-10. . . . . . . . . . . . . . . . . . 76 3.3 Results of our ablation studies. . . . . . . . . . . . . . . . . . . . . . . . . 79 3.4 MNIST, EffConv-20, 0.66M. . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.5 CIFAR-10, EffConv-26, 0.66M. . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.6 CIFAR-10, EffConv-32, 0.66M. . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.7 CIFAR-10, EffConv-38, 0.66M. . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.8 CIFAR-10, EffConv-44, 0.66M. . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.9 CIFAR-10, EffConv-50, 0.66M. . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.10 CIFAR-10, EffConv-56, 0.66M. . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.11 CIFAR-10, Wide-EffConv-28-1, 0.66M. . . . . . . . . . . . . . . . . . . . . 87 3.12 STL-10, EffConv-20, 0.66M. . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.13 STL-10, EffConv-20, 0.71M. . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.14 STL-10, EffConv-20, 0.78M. . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.15 ImageNet-32, EffConv-20, 0.50M. . . . . . . . . . . . . . . . . . . . . . . . 88 3.16 CIFAR-10, EffConv-20, 0.3M. . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.17 CIFAR-10, EffConv-20, 0.4M. . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.18 CIFAR-10, EffConv-20, 0.5M. . . . . . . . . . . . . . . . . . . . . . . . . . 89 xv 4.1 Overview of our method. We jointly train and prune a CNN model using an RL agent by iteratively training the agent’s policy and model’s weights. In each iteration, we train the model’s weights for one epoch and perform several episodic observations of the agent. Left: Each action of our agent prunes one layer of the model, and the procedure of pruning the l-th layer is shown. The agent’s actions on the previous layers and the remain- ing layers’ FLOPs determine its state, and we take the resulting model’s accuracy as its reward (Sec. 4.3.2). As the model’s weights change between iterations, the reward function also changes accordingly. Thus, we map each epoch to an embedding and employ a recurrent model to provide a state of the environment z to the agent. (Sec. 4.3.2.1) Right: Given a sub-network selected by the agent, we train the model’s weights while softly regularizing them to align with the selected structure (Sec. 4.3.2.3). . . . . . . . . . . . 94 4.2 Results of ablation experiments on CIFAR-10. (a-c) Best reward of our agent when using a different number of episodes per epoch for three pruning rates when pruning ResNet-56. (d-f) Best reward with/without using our mechanism to pro- vide representations of the environment to our agent during training for three pruning rates for ResNet-56. (g-i) Same results of (d-f) for MobileNet-V2. . . . 107 5.1 Our GAN pruning method. We encourage the pruned generator to preserve the density structure of the original model over its learned manifold during prun- ing. To do so, we partition the manifold into local neighborhoods around the samples generated by the original generator (Fig. 5.2) and represent each local neighborhood with a ‘Center’ sample (shown with a red frame) and its neighbors (blue frames). We use these samples as ‘real’ samples and the one generated by the pruned generator as a ‘fake’ one in our adversarial pruning objective. We implement our adversarial game with two pruning agents, genG and genD, that collaboratively learn to prune the original pretrained G and D. genG (genD) takes the architecture embedding of their colleague genD (genG) when determining the architecture of G (D). By doing so, genG and genD can maintain the balance between the capacity of G and D during pruning and make the process stable. (Fig. 5.4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.2 Our method to find local neighborhoods on the learned manifold of the original generator. (Top Left): First, we obtain the original model’s predictions in the target domain for training samples in the source domain. (Right and Down): We call the sample that we want to find its local neighborhood on the manifold ‘Cen- ter’ sample (shown with Red solid frame). We pass the predicted samples in the previous step to a pretrained self-supervised encoder [6] that is fine-tuned on the target images in the training dataset. Then, we take the samples whose represen- tations have the highest cosine similarity with the representation of the ‘Center’ sample as its approximate neighbors on the manifold. Neighbor samples and the approximate neighborhood on the manifold are shown with blue crosses and a dashed line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 xvi 5.3 Qualitative results for 1) Pix2Pix: Cityscapes (top left), Edges2Shoes (bottom left), and 2) CycleGAN: Horse2Zebra (top right) and Summer2Winter (bottom right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.4 Different losses given different λ1 during the pruning process. (a)-(c) Loss val- ues for CycleGAN on Horse2Zebra dataset. (d)-(f) Loss values for Pix2Pix on Cityscapes dataset. We normalize R to the range [0, 1] for better visualization. . 131 5.5 Visualization of approximate neighborhoods on the learned manifold of our pruned model vs. the original model. . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.1 Overview of DiffPruning. We prune a pre-trained LDM model [1] (top) into a mixture of efficient experts (bottom). Each expert handles an interval, which allows their architectures to be separately specialized by removing layers or channels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.2 Our Pruning Scheme: We train our Expert Routing Agent (ERA) to prune the experts into a mixture of efficient experts (Sec. S6.2.3). The ERA predicts the architecture vectors (v, u) to prune experts’ width and depth. Then, we calculate the denoising objectives of selected sub-networks of experts, LDDPM,Ii , as well as our Resource regularization term, R, that encourages the ERA to provide a mixture of efficient experts with a desired compute budget (MACs). We train ERA’s parameters to minimize the objec- tive functions. Thus, it learns to automatically allocate the compute budget (MACs) between experts in an end-to-end manner. . . . . . . . . . . . . . 144 6.3 Our Interval Selection Scheme: We calculate gradients of denoising timesteps’ objectives w.r.t the pre-trained LDM’s parameters and take the cosine similarity value of two timesteps’ gradients as their alignment score. The dashed lines show our selected cluster intervals for the experts. One can ob- serve the optimal cluster assignments are different for distinct datasets, and employing a deterministic clustering strategy [7] like uniform clustering [8] for all datasets is sub-optimal. . . . . . . . . . . . . . . . . . . . . . . . . . 149 6.4 U-Net architecture of the LDM [1]. We randomly drop/preserve each colored layer in our elastic depth fine-tuning. . . . . . . . . . . . . . . . . . . . . . 151 6.5 Samples from the LDM [1] model and our pruned mixture of experts for different MACs budgets. The green numbers show the relative sampling throughput speed-up of our pruned models compared to LDM on an NVIDIA A100 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6.6 Comparison Results of our method vs. baselines, SP [9], OMS-DPM [10], DDPM [11], and LDM [1]. First Row: FID vs. MACs curves. Second Row: FID vs. Throughput curves. We calculate the Throughput values with an NVIDIA A100 GPU. Higher Throughput and Lower FID and MACs indicate a better performance. . . . . . . . . . . . . . . . . . . . . . . . . . 166 6.7 Weighted average J (t1) (Eq. 6.16) of the mean of alignment scores in two clusters for the LDM trained on FFHQ. . . . . . . . . . . . . . . . . . . . . 169 6.8 Weighted average J (t1) (Eq. 6.16) of the mean of alignment scores in two clusters for the LDM trained on ImageNet. . . . . . . . . . . . . . . . . . . 169 xvii 6.9 Weighted average J (t1) (Eq. 6.16) of the mean of alignment scores in two clusters for the LDM trained on LSUN-Beds. . . . . . . . . . . . . . . . . . 170 6.10 Weighted average J (t1) (Eq. 6.16) of the mean of alignment scores in two clusters for the LDM trained on LSUN-Church. . . . . . . . . . . . . . . . 170 6.11 Illustration of our Elastic Width training. We sort the convolution chan- nels (attention heads) based on their importance (L1 norm) before start- ing elastic width training. We drop a random ratio of the least important channels (heads) for convolution layers (attention layers) for each batch of training. The values o1:4 represent different possible dropping ratios for a convolution layer with 4 channels. . . . . . . . . . . . . . . . . . . . . . . . 171 6.12 U-Net architecture of the LDM [1]. . . . . . . . . . . . . . . . . . . . . . . 173 7.1 Overview: We prune a text-to-image diffusion model like Stable Diffusion (left) into a mixture of efficient experts (right) in a prompt-based manner. Our prompt router routes distinct types of prompts to different experts, allowing experts’ ar- chitectures to be separately specialized by removing layers or channels. . . . . . 181 7.2 Our pruning scheme. We train our prompt router and the set of architecture codes to prune a text-to-image diffusion model into a mixture of experts. The prompt router consists of three modules. We use a Sentence Transformer [12] as our prompt encoder to encode the input prompt into a representation z. Then, the architecture predictor transforms z into the architecture embedding e that has the same dimensionality as architecture codes. Finally, the router routes the embedding e into an architecture code a(i). We use optimal transport to evenly assign the prompts in a training batch to the architecture codes. The architecture code a(i) = (u(i), v(i)) determines pruning the model’s width and depth. We train the prompt router’s parameters and architecture codes in an end-to-end manner using the denoising objective of the pruned model LDDPM, distillation loss between the pruned and original models Ldistill, average resource usage for the samples in the batch R, and contrastive objective Lcont, encouraging embeddings e preserving semantic similarity of the representations z. . . . . . . . . . . . . . . . . . . 187 7.3 Samples of the APTP-Base experts after pruning the Stable Diffusion V2.1 using CC3M [13] and COCO [3] as the target datasets. Expert IDs are shown on the top right of images. (See Table 7.6 for prompts) . . . . . . . . . . . 196 7.4 Comparison of samples generated by low and high budget experts of APTP-Base vs. SD V2.1 on CC3M and MS-COCO validation sets. . . . . . . . . . . . . . . 199 7.5 Ablation Results for the number of experts of APTP on MS-COCO. . . . . . . 200 7.6 Resource and Contrastive loss observed when applying APTP-Base with a MAC budget of 0.77 to prune Stable Diffusion 2.1 using the COCO dataset. The comparison is made between two settings: with and without optimal transport. APTP both adheres to the target MAC budget and finds archi- tecture vectors that maintain the similarity between the prompts. . . . . . 210 xviii 7.7 Comparison of sample assignments in a batch to experts with and without optimal transport. The incorporation of optimal transport results in a more diverse assignment pattern. In the figure, each square represents a prompt within the batch, and the color signifies the budget level of the expert as- signed to the prompt. Higher-resource experts are indicated by darker blue. 211 7.8 Distribution of CC3M Samples Mapped to Each Expert of APTP-Base, In- cluding Resource Utilization Ratios . . . . . . . . . . . . . . . . . . . . . . 212 7.9 The block-level retained MAC ratio of the UNet architecture of all experts of APTP-Base applied to Stable Diffusion 2.1 with CC3M as the target dataset.215 7.10 The block-level retained MAC ratio of the UNet architecture of all experts of APTP-Base applied to Stable Diffusion 2.1 with COCO as the target dataset. The groups of ResBlocks and the heads of Attention Blocks are pruned based on the outputs of the architecture predictor. The intensity of the color of each block represents the resource utilization of it. The number in each block indicates the precise ratio of retained MACs of the block. Conv in, Conv out, and skip connections between corresponding down and up blocks are omitted for brevity. . . . . . . . . . . . . . . . . . . . . . . . 216 7.11 Samples of the APTP-Base experts after pruning the Stable Diffusion V2.1 using CC3M [13] as the target dataset. Each row corresponds to a unique expert. Please refer to Table 7.2 for the groups of prompts assigned to each expert. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 7.12 Samples of the APTP-Base experts after pruning the Stable Diffusion V2.1 using MS-COCO [3] as the target dataset. Each row corresponds to a unique expert. Please refer to Table 7.4 for the groups of prompts assigned to each expert. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 xix List of Abbreviations ADAM Adaptive Moment Estimation AEM Amortized Explanation Models AIoT Artificial Internet of Things APTP Adaptive Prompt-Tailored Pruning CLIP Contrastive Language-Image Pre-training CMMD CLIP Maximum Mean Discrepancy CNN Convolutional Neural Network DDIM Denoising Diffusion Implicit Models DDPM Denoising Diffusion Probabilistic Models DNN Deep Neural Network DPM Diffusion Probabilistic Model EIE Efficient Inference Engine ERA Expert Routing Agent FID Fréchet inception distance FLOPs Floating-point Operations GAN Generative Adversarial Network GPU Graphics Processing Unit GRU Gated Recurrent Unit GPT Generative Pre-trained Transformer I2IGAN Image-to-Image Translation Generative Adversarial Network ISP Interpretations Steered Pruning IWFS Instance-Wise Feature Selection KD Knowledge Distillation KDE Kernel Density Estimation KL Kullback-Leibler Divergence LDM Latent Diffusion Model LHS Left Hand Side LIME Local Interpretable Model-agnostic Explanations LLM Large Language Model LLaVA Large Language and Vision Assistant MACs Multiply Accumulate Operations MGGC Manifold Guided GAN Compression MLP Multi-layer Perceptron MoE Mixture of Experts NAS Neural Architecture Search RBF Radial Basis Function xx ResNet Residual Network RHS Right Hand Side RL Reinforcement Learning SAC Soft Actor-Critic SD Stable Diffusion SGD Stochastic Gradient Descent SHAP Shapley Additive explanations STE Straight-Through Estimator T2I Text-to-Image TPU Tensor Processing Unit VAE Variational Autoencoder xxi Chapter 1: Introduction 1.1 Motivation Deep learning methods have achieved unprecedented capabilities for visual recogni- tion and generative modeling tasks in computer vision in the past decade. They have significantly outperformed hand-crafted baselines on traditional image classification (as- signing a label to an input image) and object detection (localizing as well as classifying objects within an image) benchmarks. Moreover, deep vision language models like GPT- 4 [14], Gemini [15], and LLaVA [16] have made significant strides, enabling them to provide fine-grained text descriptions that not only identify objects in their input images, but also explain their relationships and attributes. In addition, modern deep generative models like DALL-E [17], Stable Diffusion [1], Imagen [18], and Adobe Firefly [19] can generate high-quality, realistic, and detailed images given input text prompts. They have also shown an impressive performance for prompt-based image editing. Thus, deploying deep learn- ing models in various real-world applications like disease diagnosis, autonomous driving, robotics, and content creation is of a great interest. The key ingredient of their success is that they can learn to extract or generate useful patterns for downstream tasks from vast amounts of data. Empirical trends in the literature indicate that the performance of deep models im- 1 proves when they benefit from 1) an increased training sample size, 2) higher architectural capacity (also known as model size), and 3) longer training schedules using modern hard- wares. These factors have driven the development of large-scale models with billions of parameters, feuling a race among private companies to push the peformance boundaries of deep models by increasing the model size. For instance, the vision language LLaVA V1.5 [20] and QWEN2-VL [21] models are among the top performing models in multimodal vision and language modeling tasks like object localization and visual question answering. LLaVA V1.5 has two variants with 7 and 13 Billion parameters, and QWEN2-VL has three variants of 2, 7, and 72 Billion parameters. Similarly, the Stable Diffusion (SD) models have shown improved performance in image generation and editing tasks from SD-V1 with about 980M to SD-XL [22] and SD-V3 [23] having about 3.5B and 8B parameters, respectively. However, the increase in model size leads to a natural trade-off between model per- formance vs. computational complexity and memory footprint, thereby making the de- ployment of these models challenging or even infeasible in various real-world scenarios. On the one hand, organizations and private companies need to invest in expensive GPU or TPU clusters or spend excessive budgets to rent them from cloud providers to serve their models for their customers. On the other hand, directly deploying large-scale models for edge applications like smartphones, robotics, self-driving cars, and the Artificial Internet of Things (AIoT) exhausts their limited memory, computational resources, and battery. Further, these applications demand real-time inference and low-latency responses, which are infeasible to achieve if one runs large-scale models with the limited computational resources available in these applications. Therefore, compressing and reducing the com- putational burden of deep models while maintaining their performance is crucial before 2 deploying them in practice. Model pruning is an efficient and effective technique for compressing trained over- parameterized deep models to improve their inference efficiency. In fact, it has been shown [24, 25] that over-parameterization is beneficial for the optimization and generalization of deep models. Further, one can prune these models to a simple one without significantly affecting their performance, but directly training the pruned model from scratch typically results in worse performance due to the optimization difficulties [25]. Model pruning can be fine-grained, called weight pruning, in which individual weights of a model are removed. It can achieve high compression rates, significantly reducing required storage memory, but weight pruning usually cannot provide inference speed up in practice. The reason is that GPUs and TPUs cannot effectively utilize irregular sparsity patterns, and one needs to use inference engines like EIE [26] to exploit the sparsity to accelerate the inference. Differently, structural pruning removes channels or depth layers of the model, thereby both reducing the model size and accelerating its inference on modern hardware like GPUs without requiring post-processing or special inference libraries. Thus, structural pruning is more practical for real-world applications and has been widely used in different domains. In this dissertation, we develop structural pruning and architecture search techniques to reduce the memory footprint and improve the inference efficiency of deep visual recognition and generative models, tailored to their unique char- acteristics and requirements. Therefore, we focus on the following two main research directions that are: I. Inference Efficiency of Visual Recognition Models, where we develop struc- 3 tural pruning and efficient architecture search methods for Convolutional Neural Net- works (CNN) classifiers to achieve compact models given different constraints on the model’s computational requirements like the model’s number of Multiply-Accumulate operations (MACs) and parameter count. We mainly contribute to three main aspects of this research direction: First, we approach the pruning problem from a novel perspective and aim to answer whether one can use interpretations of a CNN classifier’s decisions to prune it. This is in contrast with the prominent techniques that either focus on the model’s outputs or weights to prune it. We discuss in Chapter 2 that existing interpretation techniques are either shown to be indepedent of the model’s decisions or are computationally expensive for pruning. Thus, we develop an amortized explanation model tailored for CNN classifiers and employ it in our framework to guide the pruning process. Second, we introduce an efficient architecture search method to find kernel sizes for CNNs (Chapter 3). Although kernel sizes are crucial design choices for CNNs’ perfor- mance and efficiency, existing architectures usually contain convolution layers with fixed kernel sizes stacked on top of each other. This design choice is suboptimal since it does not consider the target task. We propose a differentiable architecture search method that determines kernel sizes given a training dataset and parameter bud- get, securing up to 60× speed up compared to the baseline methods while achieving superior final performance. Third, we develop a method that reduces the complexity of the pruning process for CNNs. Typically, structural pruning methods perform a three-step process of 4 pretraining the model, pruning it, and then fine-tuning the pruned model, and each step has its own design choices and hyperparameters. We propose a method that accomplishes the first two steps at the same time using a reinforcement learning agent that learns to determine the optimal structure of the model during the pretraining phase (Chapter 4). By doing so, we improve the efficiency of the pruning process. II. Inference Acceleration for Deep Generative Models: Due to their fundumental differences, existing pruning methods for discriminative models are not directly ap- plicable to generative models, and heuristically stacking them for pruning generative models usually leads to unsatisfactory performance. Therefore, We design pruning techniques for conditional Generative Adversarial Networks (GANs) and modern dif- fusion models while taking their specific characteristics into account. First, we address pruning a conditional GAN model. In contrast with previous works that mainly apply distillation, we focus on the learned density structure of a pre- trained GAN model as a generative model. Specifically, we propose a structural pruning method that encourages the pruned model to preserve local density struc- tures of the original model on neighborhoods of its learned manifold, resembling the kernel density estimation method. Further, we design a collaborative pruning scheme in which two agents prune both the generator and discriminator by exchanging feed- back. Thus, our method can properly maintain the balance between capacities of two models during pruning and alleviate mode collapse during the pruning process, which is a common challenge with baselines. Second, we leverage the gradual denoising process of modern diffusion models to 5 prune them into a mixture of efficient experts, each one handling a separate part of the denoising path of the model’s sampling process. We propose a dataset-specific approach to cluster the denoising timesteps into intervals using their alignment scores and assign a separate expert to each interval. We introduce a framework in which we prune the experts for the intervals simultaneously, thereby allocating compute resources between them automatically. Finally, we design a pruning approach tailored for Text-to-image (T2I) diffusion mod- els. Precisely, our method prunes a pretrained T2I diffusion model into a set of effi- cient experts such that each expert is a specialized model for the prompts routed to it. Our method is the first one that enables using different amounts of compute re- sources for various prompt types. We do so using a prompt router model that routes input prompts to a set of architecture codes that determine the sub-network of the model to be used. We design a framework in which we train both the prompt router and the architecture codes in an end-to-end manner. 1.2 Dissertation Outline The structure of this dissertation follows the organization of the research topics dis- cussed in the Sec. 1.1. Accordingly, we divide the dissertation into two parts, addressing inference efficiency of visual recognition models in Part I (Chapter 2, Chapter 3, and Chap- ter 4) and deep generative models in Part II (Chapter 5, Chapter 6, and Chapter 7). In general, pruning and architecture search for neural networks can be formulated as a combinatorial “selection” problem in which one should determine whether to preserve or 6 remove each structural component of the model. In the case of deep over-parameterized models, the search space of the model configurations is discrete, complex, and exponen- tially large, and the design choices are highly non-trivial. Further, the evaluation of the model configurations is extremely costly as each evaluation requires training the model from scratch. In this dissertation, we design algorithms to tackle the model pruning and archi- tecture search problems efficiently. The general framework of our ideas is that we convert the discrete optimization problem of pruning and architecture search as a continuous one, and by doing so, we leverage gradient-based optimization techniques to efficiently search for compact, high-performing architectures. In more details, we implement our “selection” scheme using a neural network that can be trained end-to-end by employing differentiable selection gates and propagating gradients using the straight-through estimator. We briefly describe our design choices to adapt our framework for different discriminative and gener- ative models in the following. In the part I, we address inference efficiency of CNNs. CNNs have consistently shown state-of-the-art performance on various computer vision tasks, surpassing Transformers [27, 28] and other counterparts [29, 30]. Therefore, optimizing CNNs’ architectures for inference efficiency is practically crucial. In Chapter 2, Chapter 3, and Chapter 4, we develop techniques for pruning and designing CNN architectures to make them efficient for inference. In Chapter 2, we propose a pruning method that leverages interpretations of a CNN’s predictions to guide its pruning process. Existing channel pruning algorithms approach the pruning problem from various perspectives and use different metrics to guide the pruning process. However, these metrics mainly focus on the model’s ‘outputs’ or ‘weights’ and ne- 7 glect its ‘interpretations’ information. To fill in this gap, we propose to address the channel pruning problem from a novel perspective by leveraging the interpretations of a model to steer the pruning process, thereby utilizing information from both inputs and outputs of the model. However, existing interpretation methods cannot get deployed to achieve our goal as either they are inefficient for pruning or may predict non-coherent explanations. We tackle this challenge by introducing a selector model that predicts real-time smooth saliency masks for pruned models. We parameterize the distribution of explanatory masks by Radial Basis Function (RBF)-like functions to incorporate geometric prior of natural images in our selector model’s inductive bias. Thus, we can obtain compact representations of explanations to reduce the computational costs of our pruning method. We leverage our selector model to steer the network pruning by maximizing the similarity of explanatory representations for the pruned and original models. Chapter 3 presents an efficient kernel size learning method for CNNs. Determining kernel sizes of a CNN model is a crucial and non-trivial design choice and significantly impacts its performance and efficiency. The majority of existing kernel size design methods rely on complex heuristic tricks or leverage neural architecture search that requires extreme computational resources. Thus, learning kernel sizes, using methods such as modeling ker- nels as a combination of basis functions, jointly with the model weights has been proposed as a workaround. However, previous methods cannot achieve satisfactory results or are inefficient for high-resolution and large-scale datasets. To fill this gap, we design an effi- cient kernel size learning method in which a size predictor model learns to predict optimal kernel sizes for a classifier given a desired number of parameters. It does so in collaboration with a kernel predictor model that predicts the weights of the kernels - given kernel sizes 8 predicted by the size predictor - to minimize the training objective, and both models are trained end-to-end. Our method only needs a small fraction of the training epochs of the original CNN to train these two models and find proper kernel sizes for it. Thus, it offers an efficient and effective solution for the kernel size learning problem. In Chapter 4, we introduce a method to reduce the complexity of the model prun- ing process. The majority of structural pruning ideas require a pretrained model before pruning, which is costly to secure. We propose a novel structural pruning approach to jointly learn the weights and structurally prune architectures of CNN models. The core element of our method is a Reinforcement Learning (RL) agent whose actions determine the pruning ratios of the CNN model’s layers, and the resulting model’s accuracy serves as its reward. We conduct the joint training and pruning by iteratively training the model’s weights and the agent’s policy, and we regularize the model’s weights to align with the selected structure by the agent. The evolving model’s weights result in a dynamic reward function for the agent, which prevents using prominent episodic RL methods with station- ary environment assumption for our purpose. We address this challenge by designing a mechanism to model the complex changing dynamics of the reward function and provide a representation of it to the RL agent. To do so, we take a learnable embedding for each training epoch and employ a recurrent model to calculate a representation of the changing environment. We train the recurrent model and embeddings using a decoder model to reconstruct observed rewards. Such a design empowers our agent to effectively leverage episodic observations along with the environment representations to learn a proper policy to determine performant sub-networks of the CNN model. In the part II, we propose model pruning techniques to improve the inference efficiency 9 of deep generative models. First, we introduce a method to prune conditional GANs in Chapter 5. Although diffusion models have achieved state-of-the-art performance on various generative modeling tasks, they are still computationally expensive and slow to sample from, requiring tens to hundreds of forward passes to generate a sample. In contrast, GANs can generate samples in a single forward pass, making them more practical for real-time applications. Therefore, pruning GANs can prepare them to be deployed in low-latency applications. Then, we address the inference efficiency of diffusion models in Chapter 6 and Chapter 7. We propose a pruning approach for diffusion models by leveraging their gradual sampling process in Chapter 6. Finally, we develop a prompt-based pruning framework for text-to-image diffusion models in Chapter 7. We present our method for pruning conditional GANs in Chapter 5. GANs have shown remarkable success in modeling complex data distributions for image-to-image trans- lation. Still, their high computational demands prohibit their deployment in practical sce- narios like edge devices. Existing GAN compression methods mainly rely on knowledge distillation or convolutional classifiers’ pruning techniques. Thus, they neglect the critical characteristic of GANs: their local density structure over their learned manifold. Accord- ingly, we approach GAN compression from a new perspective by explicitly encouraging the pruned model to preserve the density structure of the original parameter-heavy model on its learned manifold. We facilitate this objective for the pruned model by partitioning the learned manifold of the original generator into local neighborhoods around its gener- ated samples. Then, we propose a pruning objective to regularize the pruned model to preserve the local density structure over each neighborhood, resembling the kernel density estimation method. Also, we develop a collaborative pruning scheme in which the discrim- 10 inator and generator are pruned by two pruning agents. We design the agents to capture interactions between the generator and discriminator by exchanging their peer’s feedback when determining their corresponding models’ architectures. Thanks to such a design, our pruning method can efficiently find performant sub-networks and can maintain the balance between the generator and discriminator more effectively compared to baselines during pruning, thereby showing more stable pruning dynamics. In Chapter 6, we propose a pruning method to reduce the sampling cost of diffusion models. Diffusion models have shown better mode coverage and superior image generation quality compared to GANs. Yet, their sampling process requires numerous denoising steps, making it slow and computationally intensive. We propose to reduce the sampling cost by pruning a pretrained diffusion model into a mixture of efficient experts. First, we study the similarities between pairs of denoising timesteps, observing a natural clustering, even across different datasets. This suggests that rather than having a single model for all time steps, separate models can serve as “experts” for their respective time intervals. As such, we separately fine-tune the pretrained model on each interval, with elastic dimensions in depth and width, to obtain experts specialized in their corresponding denoising interval. To optimize the resource usage between experts, we introduce our Expert Routing Agent, which learns to select a set of proper network configurations. By doing so, our method can allocate the computing budget between the experts in an end-to-end manner without requiring manual heuristics. Finally, with a selected configuration, we fine-tune our pruned experts to obtain our mixture of efficient experts. We present our prompt-based pruning method for text-to-image diffusion models in Chapter 7. Text-to-image (T2I) diffusion models have demonstrated impressive image 11 generation capabilities, synthesizing novel images given an input text prompt. Still, their computational intensity prohibits resource-constrained organizations from deploying T2I models after fine-tuning them on their internal target data. While pruning techniques offer a potential solution to reduce the computational burden of T2I models, static pruning methods use the same pruned model for all input prompts, overlooking the varying capacity requirements of different prompts. Dynamic pruning addresses this issue by utilizing a separate sub-network for each prompt, but it prevents batch parallelism on GPUs. To overcome these limitations, we introduce Adaptive Prompt-Tailored Pruning (APTP), a novel prompt-based pruning method designed for T2I diffusion models. Central to our approach is a prompt router model, which learns to determine the required capacity for an input text prompt and routes it to an architecture code, given a total desired compute budget for prompts. Each architecture code represents a specialized model tailored to the prompts assigned to it, and the number of codes is a hyperparameter. We train the prompt router and architecture codes using contrastive learning, ensuring that similar prompts are mapped to nearby codes. Further, we employ optimal transport to prevent the codes from collapsing into a single one. We demonstrate APTP’s effectiveness by pruning Stable Diffusion (SD) V2.1 using CC3M and COCO as target datasets. APTP outperforms the single-model pruning baselines in terms of FID, CLIP, and CMMD scores. Our analysis of the clusters learned by APTP reveals they are semantically meaningful. We also show that APTP can automatically discover previously empirically found challenging prompts for SD, e.g., prompts for generating text images, assigning them to higher capacity codes. Finally, we conclude the dissertation in Chapter 8 by summarizing our contributions, discussing the future directions, and putting our work in the context of the broader research 12 landscape. 13 Part I Pruning and Efficient Architecture Search Techniques for Convolutional Neural Networks 14 Chapter 2: Interpretations Steered Network Pruning via Amortized In- ferred Saliency Maps 2.1 Introduction Convolutional Neural Networks (CNNs) have been continuously achieving state-of- the-art results on various computer vision tasks [31, 32, 33, 34, 35, 36, 37], but the required resources of popular deep models [38, 39, 40] are also exploding. Their substantial com- putational and storage costs prohibit deploying these models in edge and mobile devices, making the CNN compression problem a crucial task. Many ideas have attempted to ad- dress this problem to reduce models’ sizes while maintaining their prediction performance. These ideas can usually be classified into one of the model compression methods categories: weight pruning [41], weight quantization [42, 43], structural pruning [44], knowledge distil- lation [45], neural architecture search [46], etc. We focus on pruning channels of CNNs (structural pruning) since it can effectively and practically reduce the computational costs of a deep model without any post-processing steps or specifically designed hardware. Although existing channel pruning methods have achieved excellent results, they do not consider the model’s interpretations during the pruning process. They tackle the pruning problem from various perspectives such as rein- 15 forcement learning [46], greedy search [47], and evolutionary algorithms [48]. In addition, they have utilized a wide range of metrics like channels’ norm [44], loss [49], and accu- racy [50] as guidance to prune the model. Thus, they emphasize the model’s outputs or weights but ignore its valuable interpretations’ information. We aim to approach the structural model pruning problem from a novel perspective by exploiting the model’s interpretations (a subset of input features called saliency maps) to steer the pruning. Our intuition is that the saliency maps of the pruned model should be similar to the ones for the original model. However, the existing interpretation methods are either inefficient or unreliable for pruning. Firstly, locally linear models (e.g., LIME [51] and SHAP [52]) fit a separate linear model to explain the behavior of a nonlinear classifier in the vicinity of each data point. However, they need to fit a new model in each itera- tion of pruning that the classifier’s architecture changes, which makes them inefficient for pruning. Secondly, previous works [53, 54] empirically observed that a feature importance assignment of Gradient-based methods (e.g., Grad-CAM [55] and DeepLIFT [56]) might not be more meaningful than random. Moreover, Srinivas and Fleuret [57] theoretically showed that the input gradients used by these methods might seem explanatory as they are related to an implicit generative model hidden in the classifiers [58], not their discrim- inative function. Thus, their usage for interpreting classifiers should be avoided. Finally, perturbation-based methods [59, 60] need multiple forward passes and rely on perturbed samples that are out-of-distribution for the trained model [53] to obtain its explanations. Hence, they are neither efficient nor reliable for pruning. Different from the mentioned methods, Amortized Explanation Models (AEMs) [4, 61, 62] provide a theoretical framework to obtain a model’s interpretations. They train a fast saliency prediction model 16 that can be applied in real-time systems as it can provide saliency maps with a single for- ward pass, making them suitable for pruning. We refer to section 2.2 for more discussion on interpretation methods. In this chapter, at first, we provide a new AEM method that overcomes the disadvan- tages of previous AEM models, and then, we employ it to prune convolutional classifiers. Previous AEMs [4, 61, 62] cannot be applied to guide pruning due to several key draw- backs. REAL-X [4] proved that L2X [61] and INVASE [62] could suffer from degenerate cases where the saliency map selector predicts meaningless explanations. Although REAL- X overcomes this problem, it generates masks independently for each input feature (pixel). Thus, it neglects the geometric prior [63] in natural images that adjacent features (pix- els) often correlate to each other. We empirically show in Section 2.3.3 and Fig. 2.4 that the saliency maps predicted by REAL-X may lack visual interpretability. In addition, the provided explanations have the same size as the input image, which also adds non-trivial computational costs when used for pruning. We propose a novel AEM model to tackle these problems. In contrast with REAL-X, which assumes features’ independence, we em- ploy a proper geometric prior in our model. We use a Radial Basis Function (RBF)-like function to parameterize saliency masks’ distribution. By doing so, the mask generation is no longer independent for each pixel in our framework. Moreover, it enables us to infer explanations for each image with only three parameters (center coordinates and kernel ex- pansion), saving lots of computations. We utilize such compact saliency representations to steer network pruning by reconstruction in real-time. We also find that merging guidance from the model’s interpretations and outputs can further improve the pruning results. Our experimental results on benchmark datasets illustrate that our new interpretation steered 17 pruning method can consistently achieve superior performance compared to baselines. Our contributions are as follows: • We propose a novel structural pruning method for CNNs designed from a new and different perspective compared to existing methods. We utilize the model’s deci- sions’ interpretations to steer the pruning procedure. By doing so, we effectively merge the guidance from the model’s interpretations and outputs to discover the high-performance sub-networks. • We introduce a new Amortized Explanation Model (AEM) such that we embed a proper geometric prior for natural images in the inductive bias of our model and enable it to predict smooth explanations for input images. We parameterize the distribution of saliency masks using RBF-like functions. Thus, our AEM can provide compact explanatory representations and save computational costs. Further, it empowers us to dynamically obtain saliency maps of pruned models and leverage them to steer the pruning procedure. The contents of this chapter are based on our work [64] published in ECCV 2022. 2.2 Related Works 2.2.1 Interpretation Methods Interpretation methods can get classified into four [4] main categories: 1. Gradient-based methods such as CAM [65], Grad-CAM [55], DeepLIFT [56], and LRP [66] rely on the gradients of outputs of a model w.r.t input features and assume 18 features with larger gradients have more influence on the model’s outcome [67, 68, 69], which is shown is not necessarily a valid assumption [70]. In addition, their feature importance assignment might not be more meaningful than random assignment [53, 54, 57], which makes them unreliable for pruning. Further, Srinivas and Fleuret [57] theoretically proved that input gradients are equal to the score function for the implicit generative model in classifiers [58] and are not related to the discriminative function of classifiers. Thus, they are not interpretations of the model’s predictions. 2. Perturbation-based models explore the effect of perturbing input features on the model’s output or inner layers to conclude their importance [59, 60, 71]. Yet, they are inefficient for pruning as they need multiple forward passes to obtain importance scores. Also, they may underestimate features’ importance [56]. 3. Locally Linear Models fit a linear model to approximate the behavior of a classifier in the vicinity of each data point [51, 52]. However, they require to fit a new model for each sample when the model’s architecture changes during pruning, which makes them inefficient for pruning. Also, they rely on the classifier’s output for out-of-distribution samples to train the linear model [53], which makes them undependable. 4. Amortized Explanation Models (AEMs) [4, 61, 62, 72] overcome the inef- ficiencies of the previous methods by training a global model - called selector [4] - that amortizes the cost of inferring saliency maps for each sample by selecting salient input features with a single forward pass. AEMs [4, 61, 62] provide a theoretical framework to train the selector model. To do so, they use a second predictor model that estimates the classifier’s output target distribution given an input masked by the selector model’s predicted mask. L2X [61] and INVASE [62] jointly train the selector and predictor. How- 19 ever, REAL-X [4] proved that doing so results in degenerate cases. REAL-X overcame this problem by training the predictor model separately with random masks. However, we show in section 2.3.3 that its predicted masks may not be interpretable for complex image classifiers. Our conjecture for a reason is that it neglects geometric prior [63] of natural images that nearby pixels correlate more to each other. 2.2.2 Network Compression Weight pruning [41] and quantization [42, 43], structural pruning [44, 73, 74, 75, 76, 77, 78, 79, 80, 81], knowledge distillation [45], and NAS [46] are popular directions for compressing CNNs. Structural pruning has attracted more attention as it can readily decrease the computational burden of CNN models without any specific hardware changes. Early channel pruning methods [44] propose that the channels with larger norms are more critical and remove weights/filters with small L1/L2 norm. L1 penalty can also be applied to scaling factors of batchnorm [82] to remove redundant channels [83]. Recent channel pruning methods adopt more sophisticated designs. Automatic model compression [46] learns the width of each layer with reinforcement learning. Metapruning [50] generates parameters for subnetworks and uses evolutionary algorithms to find the best subnetwork. Greedy subnetwork selection [47] greedily chooses each channel based on their L2 norm. Pruning can be also used for fairness [84]. We refer to [85] for a more detailed discussion of pruning techniques. 20 2.2.3 Network Pruning Using Interpretations There are a few recent works that attempt to use interpretations of a model to deter- mine importance scores of its weights. Sabih et al. [86] leverage DeepLIFT [56]; Yeom et al. [87] use LRP [66]; and Yao et al. [88] utilize activation maximization [59] to deter- mine weights’ importance. However, all these methods use gradient-based methods that, as mentioned above, their predictions are unreliable and should not be used as the model’s interpretations. Alqahtani et al. [89] visualize feature maps in the input space and use a segmentation model to find the filters that have the highest alignment with visual con- cepts. Nonetheless, their method needs an accurate segmentation model to find reliable importance scores for filters, which may not be available in some domains. We develop a new AEM model that is theoretically supported and improves REAL-X [4]. Moreover, in contrast with these methods, our pruning method finds the optimal subnetwork end-to-end. We also show in section 2.4.2 that our model outperforms [89]. 2.3 Methodology 2.3.1 Overview We present a novel pruning method in which we steer the pruning process of CNN classifiers using feature-wise interpretations of their decisions. At first, we develop a new intuitive AEM model that overcomes the limitations of REAL-X [4] (state-of-the-art AEM). The reason is that we incorporate the geometric prior of high correlation between adjacent input features (pixels) [63] in the images in the inductive bias of our AEM model. We 21 parameterize the distribution of saliency masks using Radial Basis Function (RBF)-style functions. By doing so, we can represent interpretations (saliency maps) of input images compactly. Then, we elaborate on our pruning method in which we leverage our AEM model to provide interpretations of the original and pruned classifiers. Our intuition is that saliency maps of the original and pruned models should be similar. Thus, we propose a new loss function for pruning that encourages the pruned model to have similar saliency explanations to the original one. In the following sub-sections, we introduce AEM methods and empirically show the limitations of REAL-X. Then, we elaborate on our method and its intuitions to tackle the drawbacks of previous AEMs. Finally, we present our pruning scheme. 2.3.2 Notations We denote our dataset as D = {(x(i), y(i))}N i=1 such that (x, y) ∼ P(x, y) where P is the unknown underlying joint distribution over features and targets, and we assume that x ∈ R D and y ∈ {1, 2, . . . , K}. We show the jth feature of sample x by xj and represent a mask m by the indices of the input features that it preserves, i.e., m ⊆ {1, 2, . . . , D} and a masked input m(x) is defined as follows: m(x) = mask(x, m) =              xj j ∈ m 01 Otherwise (2.1) We call the model that we aim to prune as the ‘classifier’ in following sections. 1We use zero values for the masked input features following the literature.[4, 61, 62] 22 2.3.3 Amortized Explanation Models (AEMs) AEMs are a subgroup of Instance-Wise Feature Selection (IWFS) methods that aim to compute a mask with minimum cardinality for each input sample that preserves its outcome-related features. An outcome may be a classifier’s predictions (usually calculated as a softmax distribution) for interpretation purposes. It can also be the population distri- bution of the targets (one-hot representations) when performing dimensionality reduction on the original raw data [4, 61, 62]. Although previous works [4, 61, 62] describe their formulation for the latter, we focus on the former here. Concretely, if Qclass(y|x) be the classifier’s conditional distribution of targets given input features, the objective of AEM models is to find a mask m(x) for each sample x such that Qclass(y|x = x) = Qclass(y|x = m(x)) (2.2) AEMs tackle this problem by training a global model called selector that learns to predict a local (sample dependent) mask m(x) for each sample x [4]. They train the selector by encouraging it to follow Eq. 2.2. To do so, one should quantify the discrepancy between the RHS and LHS of Eq. 2.2 when the selector model generates the mask m in the RHS. The LHS can be readily calculated by forwarding the sample x into the classifier. However, the classifier should not be used to compute the RHS because the masked sample m(x) is an out-of-distribution input for it [4]. AEMs solve this issue by training a predictor model that predicts the conditional distribution of the classifier given a masked input. (RHS of Eq. 2.2) Then, they train the selector guided by the supervision from the predictor. We 23 present the formulation of REAL-X [4] in supplementary S2.1. 2.3.3.1 Visualization of REAL-X Predictions We visualize predicted explanations of REAL-X for a ResNet-56 model [39] trained on CIFAR-10 [5] in Fig. 2.4(a). (we refer to supplementary S2.3 for implementation details) As can be seen, the formulation of REAL-X cannot guide the selector model to learn to select a coherent subset of input pixels of the salient parts of the images. Thus, it may not provide interpretable explanations for the classifier. Our conjecture for the cause is that the formulation of REAL-X does not include a proper inductive bias related to natural images in the selector model. Typically, nearby pixels’ values and semantic information are more correlated in natural images, known as their geometric prior [63]. REAL-X does not have such a prior in its formulation because it factorizes the explanatory masks’ distribution given an input x as: qsel(m|x; β) = D ∏ i=1 qi(mi|x; β) (2.3) where qi(mi|x; β) ∼ Bernoulli((fβ(x))i), i.e., the distribution over the selector’s output mask is factorized as a product of marginal Bernoulli distributions over mask’s elements, and the parameter for each element gets calculated independently. (fβ(x) is the selector model parameterized by β). Hence, the selector model does not have the inductive bias that parameters of nearby Bernoulli distributions should be close to each other to make the sampled masks coherent. Instead, it should ‘discover’ such prior during training, which is infeasible with limited data and training epochs in practice. 24 (a) REAL-X[4] (b) Ours Figure 2.1: Input features selected by a) REAL-X [4] and b) our model to explain decisions of a ResNet-56 classifier for samples from CIFAR-10 [5]. In the sub-figures from left to right: 1st column shows the original image. Both models output an array (2nd columns) that each value of it is the parameter of the predicted Bernoulli distribution over the corresponding mask pixel. In the 3rd column, we show the masks generated such that a pixel’s value is one provided that its predicted Bernoulli parameter is bigger than 0.5 and zero otherwise. The 4th columns show the masked inputs. Our model’s explanations are easier to interpret than the ones by REAL-X that may seem random for some samples. 2.3.4 Proposed AEM Model We introduce a new selector scheme that respects the proximity geometric prior. To do so, we assume that the parameters of the Bernoulli distributions of mask pixels should have a Radial Basis Function (RBF) style functional form over the pixels. The center of the RBF kernel should be on the salient part of the image most relevant to the classifier’s prediction, and the Bernoulli parameters should decrease as the pixel location gets far from the kernel’s center. A parameter σ controls the area of a mask. Our assumption is reasonable for multi-class classifiers in which, typically, a single object/region in their input image determines the target class. Formally, considering a 2D mask that its coordinates are parametrized by (z, t) and the parameters of a 2D RBF kernel being (cz, ct, σ), we calculate the Bernoulli parameter (BP) of a pixel at location (z, t) as follows: 25 fBP (z, t; cz, ct, σ) = exp { ( −1 2σ2 [(z − cz)2 + (t − ct) 2]) } (2.4) This formulation has two crucial benefits: 1) It ensures that Bernoulli parameters of a mask’s proximal pixels are close to each other. Thus, the resulting sampled masks will be much more coherent and smooth than REAL-X. 2) It simplifies the selector model’s task significantly. In REAL-X, the selector should learn how to calculate Bernoulli parameters for each pixel that, for instance, will be 224 × 224 = 50176 independent functions for the standard ImageNet [31] training. In contrast, in our formulation, the selector should only learn to accurately estimate three values corresponding to the center’s coordinates (cz, ct) and an expanding parameter σ for the RBF kernel. Given the estimated values, Bernoulli parameters of the output mask’s pixels can be readily calculated by Eq. 2.4. In other words, if the input images have spatial dimensions M ∗ N , and we denote the selector function (implemented by a deep neural network) with fsel(x; β), our selector’s distribution over masks given input images is: [cz, ct, σ] = fsel(x; β) qi,j(mi,j|x; β) = Bernoulli(fBP (i, j; cz, ct, σ)) qsel(m|x; β) = M ∏ i=1 N ∏ j=1 qi,j(mi,j|x; β) (2.5) In Eq. 2.5, β denotes the selector’s parameters, and we illustrate a predicted RBF kernel by our selector in Fig. 2.2. In summary, our intuition is that by incorporating the geometric prior in the inductive bias of our framework, the selector will search for proper functional form for Bernoulli parameters over pixels’ locations in the RBF family of functions, not all 26 possible ones. As a result, it can find the optimal functional form more readily and robustly. Moreover, our selector model can provide a real-time and compact representation (RBF parameters) for saliency maps, which enables us to efficiently compare the interpretations of the original and pruned models to steer the pruning process. (section 2.3.6, Fig. 2.3) 2.3.5 AEM Training We train our selector model by encouraging it to generate an explanatory mask m for each sample x such that it follows Eq. 2.2. To do so, as mentioned in section 2.3.3, we need to estimate the classifier’s conditional distribution of targets given masked inputs (RHS of Eq. 2.2) to train our selector model. Such an estimate can quantify the quality of a mask generated by the selector model by measuring the discrepancy between the LHS and RHS of Eq. 2.2. 2.3.5.1 Predictor Model We train a predictor model to calculate the classifier’s conditional distribution of targets given a masked input. (RHS of Eq. 2.2) As we designed our selector to predict RBF- style masks (Eq. 2.5), we train our predictor to predict the classifier’s output distribution when the input is masked by a random RBF-style mask. Using random RBF masks allows us to mimic any potential RBF-masked input. Hence, our predictor’s training objective is: min θ Lpred(θ) = Ex∼P(x)Ec′ z ,c′ t,σ′ [Em′∼B(m|c′ z ,c′ t,σ′)Lθ(x, m′(x))] (2.6) where Lθ(·, ·) and B(·) are defined as: 27 Random RBF Mask PredictorClassifier !!"#$ "#$% &'( !"! !"" !" Sampling a mask PredictorClassifier &&&!%#& "#$% &)( Selector Figure 2.2: Our AEM model. The goal is to train the selector model on the right (U- Net model in dashed line) to predict interpretations (saliency maps) of the classifier for each input sample. We train the selector by encouraging it to follow Eq. 2.2. (Left): We train a predictor model that learns to predict the classifier’s output distribution given a masked input (RHS of Eq. 2.2). We do so using inputs masked by random RBF masks as our selector’s masks have RBF-style. (Sec. 2.3.4) (Right): Given the trained predictor, we train the selector model using obj. 2.8 that enforces it to follow Eq. 2.2. We use the classifier’s convolutional backbone as the encoder of the selector and only train its decoder for computational efficiency. Then, we use the trained decoder to prune the encoder. (Fig. 2.3) Lθ(x, m′(x)) = KL(Qclass(y|x = x), qpred(y|x = m′(x); θ)) B(m|c′ z, c′ t, σ′) = M ∏ i=1 N ∏ j=1 Bernoulli(fBP (i, j; c′ z, c′ t, σ′)) (2.7) Eq. (2.6), Lθ form the predictor’s objective to learn the conditional distribution of the classifier for targets given masked inputs (RHS of Eq. 2.2). B(·) generates random masks with random RBF style (fBP ), and KL denotes Kullback-Leibler divergence [90]. Now, we should define the distribution for the parameters c′ z, c′ t, and σ′ for a random RBF function. Let us assume that the origin of our 2D coordinate system is the top left of an input image with spatial dimensions M , N . In theory, c′ z and c′ t can have any real values, and the σ′ can be any positive real number in Eq. 2.4. However, considering that the salient part[s] 28 is inside the image region, we are interested that the predictor learns to correctly estimate Qclass(y|x = m(x)) (RHS of Eq. 2.2) when the selector predicts that the center of the RBF kernel is inside the image area. Hence, we assume that the distributions of c′ z and c′ t are uniform across image dimensions, i.e., c′ z ∼ U [0, M ] and c′ t ∼ U [0, N ]. In addition, the parameter σ′ determines the degree that an RBF kernel expands on the image, and the values σ′ ≥ 2 ∗ max{M, N} practically provide the same Bernoulli parameters for all the mask’s pixels when c′ z and c′ t are in the image region. Thus, we can reasonably assume that σ′ ∼ U [0, 2 ∗ max{M, N}] for training the predictor in practice. 2.3.5.2 Selector Training Given a predictor model denoted by qpred and trained with random RBF masks, we train our selector model with the following objective: min β Lsel(β) = Ex∼P(x)Em′∼qsel(m|x;β)[L(x, m′(x)) + λ1R(m′) + λ2S(m′)] (2.8) such that L(·, ·), R(·), and S(·) are defined as: L(x, m′(x)) = KL(Qclass(y|x = x), qpred(y|x = m′(x))), R(m′) = ||m′||0, S(m′) = M ∑ i=1 N ∑ j=1 [(m′ i,j − m′ i+1,j) 2 + (m′ i,j − m′ i,j+1) 2] (2.9) L(x, m′(x)) encourages the selector to follow Eq. 2.2 as qpred(y|x = m′(x)) approximates the RHS of Eq. 2.2 given an input masked by the RBF mask predicted by the selector. R(m′) regularizes the number of selected features. We add the smoothness loss S(m′) 29 (!! "" !# "" #") Pruning Pair !"! !"" !" ℒ!"#$%&% !" ℒ'$( #!"#$$ Figure 2.3: Our pruning method. The classifier to be pruned is shown on top. (Conv layers and FC). The U-Net (Conv layers and the Decoder) is our trained selector model that can predict RBF parameters of the saliency map of each input for the classifier. The selector model is trained such that the pretrained backbone of the classifier is used as its encoder (Conv layers) and kept frozen during training. (see Fig. 2.2) Thus, we freeze the selector and classifier’s weights and insert our pruning gates between the selector’s encoder layers for pruning the classifier. Given a pruning pair (a sample and its RBF saliency map’s parameters for the original classifier), we train the gate parameters to prune the classifier such that the pruned model have similar interpretations (Linterpr) and accuracy (Lclass) to the original classifier while requiring lower computational resources (LRes). to further encourage the selector to output smooth masks. As Eq. 2.8 requires sampling from predicted distribution by the selector, direct backpropagation of gradients to train its parameters, β, is not possible. Thus, we use the Gumbel-Sigmoid [91, 92] trick to train the model. We use a U-Net [93] architecture to implement the selector module of our AEM model, as shown in Fig. 2.2. We refer to supplementary S2.3 for more details of our AEM training. 30 2.3.6 Pruning In this section, we introduce our pruning method that leverages interpretations of a classifier to steer its pruning process. Our intuition is that the interpretations (saliency maps) of the original and pruned classifiers should be similar. Thus, we design our pruning method as follows. As discussed in section 2.3.5 and Fig. 2.2, we use the convolutional backbone of the classifier as the encoder of the U-Net architecture for the selector model. We keep the encoder weights frozen and only train the decoder when training the selector model for computational efficiency. (Fig. 2.2) Furthermore, doing so provides us the flexibility to keep the decoder frozen and prune the encoder such that the pruned model should have similar output RBF parameters to the original model. (Fig. 2.3) Formally, we employ our trained selector model to predict saliency maps of the orig- inal classifier for training samples. For each sample xk, it provides the parameters of the RBF kernel for its saliency map as Cxk = [ck z , ck t , σk]. Then, we insert our pruning gates, parameterized by θg, between the layers of the encoder. We represent the architectural vector generated by the gates with v. Finally, we prune the encoder (classifier’s backbone) by regularizing the gate parameters to maintain the interpretations and accuracy of the pruned classifier similar to the original one while reducing its computational budget as follows: min θg L(f(x; W , v), y) + γ1∥Cx − fsel(x; β, v)∥2 2 + γ2Rres(T (v), pTall) (2.10) 31 where L(·, ·) is the classification loss, f(·; W , v) denotes our classifier (encoder of the U-Net and the FC layer in Fig. 2.3) parameterized by weights W and the subnetwork selection vector v. fsel(x; β, v) is our trained selector model (fsel(x; β) in Eq. 2.5) augmented by the architecture vector v after inserting the pruning gates into its encoder. We calculate v using Gumbel-sigmoid function g(·): v = g(θg) [91, 92], which controls openness or closeness of a channel. The second term in Eq. 2.10 utilizes the interpretations of the original and pruned classifiers to steer pruning through the selector model fsel(x; β, v) by encouraging the similarity of their predicted RBF parameters. Rres is the FLOPs regularization to ensure the pruned model reaches the desired FLOPs rate pTall. Tall is the total prunable FLOPs of a model, T (v) is the current FLOPs rate determined by the subnetwork vector v, and p controls the pruning rate. γ1 and γ2 are hyperparameters to control the strength of related terms. During pruning, we only optimize θg and keep W and β frozen. We emphasize that our amortized explanation prediction selector model, fsel(x; β, v), enables us to readily perform interpretation-steered pruning because it can dynamically predict each sample’s saliency map’s RBF parameters ([cz, ct, σ]) given the current sub- network vector v with a single forward pass. In contrast, optimization-based explanation methods [51, 52] need to fit a new model, and perturbation-based methods [59, 60, 71] have to make multiple forward passes for the newly selected subnetwork to obtain its ex- planations. Therefore, they are inefficient to achieve the same goal. We provide the detailed parameterization of channels (g(·)) and Rres in supplementary S2.3.1. 32 (a) (b) Figure 2.4: (a): Test accuracy of different masks’ parameterization schemes. (RBF (ours) vs. Independent (REAL-X [4])) (b): Test accuracy w/wo using the classification loss. All results are for 3 run times with ResNet-56 on CIFAR-10. Shaded areas represent variance. 2.4 Experiments We use CIFAR-10 [5] and ImageNet [31] to validate the effectiveness of our proposed model. We refer to supplementary S2.3 for details of our experimental setup. We call our method ISP (Interpretations Steered Pruning) in the experiments. 2.4.1 Analysis of Different Settings Before we formally present our experimental results compared to competitive meth- ods, we study the effect of different design choices for our model’s components on its performance. We keep the resource regularization (Rres) term in obj. 2.10 and add/drop other ones in all settings. In our first experiment, we explore the impact of γ1 by only using interpretations (second term of Obj. 2.10) to steer the pruning. Fig. 2.5(a,b) and Fig. 2.4(a) demonstrate 33 the results. We can observe in Fig. 2.5(a,b) that small γ1 values (e.g., 0.1) result in a weaker supervision signal from the interpretations and make the exploration of subnetworks unstable (showing high variance), whereas larger ones make the training smooth. Fig. 2.4(a) illustrates the influence of RBF/independent masks’ parameterization scheme in Eq. 2.5 (ours)/Eq. 2.3 (REAL-X [4]). Our RBF-style model brings better performance than independent parameterization. The latter becomes unstable and less effective when the training proceeds. The instability happens possibly because the pruning gets trapped in some local minima due to noisy and unstructured masks. We can also observe that interpretations on their own provide stable and efficient signals for pruning. In our second experiment, we examine the impact of γ2 while utilizing all three terms in objective 2.10 for pruning. Fig. 2.5(c, d) indicates that small γ2 (e.g., 1.0) shows higher accuracy but may not be able to push the FLOPs regularization to 0, i.e., reach the predefined pruning rate p. Larger values can satisfy the resource constraint while showing acceptable performance. Finally, we examine the performance of different combinations of components in ob- jective 2.10. The results are available in Fig. 2.4(b). Specifically, ‘w/o Classification Loss’ repre