ABSTRACT Title of Dissertation: TOWARDS EFFECTIVE AND EFFICIENT VIDEO UNDERSTANDING Xijun Wang Doctor of Philosophy, 2025 Dissertation Directed by: Professor Dinesh Manocha, Ming Lin Department of Computer Science “If a picture is worth a thousand words, what is a video worth?” Video information, due to its inherent richness and efficiency compared to language, plays a pivotal role in conveying com- plex information. However, video understanding faces numerous challenges, including select- ing informative frames, addressing domain shifts, semantic grounding, reasoning and attention deficits, and significant computational burdens. Recent advancements in computer vision un- derscore the need to address these challenges through effective and efficient approaches, which are crucial for applications ranging from autonomous systems to human-computer interactions that require high accuracy and low latency. In this dissertation, we address five critical issues to overcome these challenges: dataset development, preprocessing, visual reasoning, multimodal alignment, and computational acceleration. High-quality datasets serve as the foundational building blocks, providing diverse, compre- hensive, and representative data to train models capable of handling real-world complexity. In this dissertation, we proposed METEOR dataset for tailored for autonomous driving applications in dense, heterogeneous, and unstructured traffic scenarios with rare and challenging conditions. Additionally, we developed DAVE, a comprehensive benchmark dataset specifically designed to enhance video understanding research for the safety of vulnerable road users in complex and unpredictable environments. Our analysis revealed substantial shortcomings of current object detection and behavior prediction models when tested against our METEOR and DAVE. Complementing datasets, for preprocessing, we proposed AZTR incorporates an automatic zooming algorithm for dynamic target scaling and a temporal reasoning mechanism to accurately capture action sequences. Furthermore, we introduced MITFAS, an alignment and sampling method based on mutual information specifically designed to address challenges inherent to UAV video action recognition, including varying human resolutions, large positional changes between frames, and occluded action features. For visual reasoning, we introduced SCP, which guides the model to explicitly learn input- invariant (prompt experts) and input-specific (data-dependent) prompt knowledge, effectively capturing discriminative patterns and significantly improving accuracy on challenging datasets. We also developed ICAR, a compatibility learning framework with a novel category-aware Flex- ible Bidirectional Transformer (FBT), which can effectively generate features across different domains based on visual similarity and complementarity for reasoning tasks. For multimodal alignment, we proposed ViLA to address both efficient frame sampling and effective cross-modal alignment in a unified way. Finally, we propose Bi-VLM to explore ultra-low precision post-training quantization method to bridge the gap between computational demands and practical limitations. Our method employs a saliency-aware hybrid quantization algorithm combined with a non-uniform model weight partition strategy, significantly reducing computational costs without compromising much overall model performance. TOWARDS EFFECTIVE AND EFFICIENT VIDEO UNDERSTANDING by Xijun Wang Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2025 Advisory Committee: Dr. Dinesh Manocha, Chair/Advisor Dr. Ming Lin, Co-Chair/Co-Advisor Dr. Maria K. Cameron, Dean’s Representative Dr. Abhinav Shrivastava Dr. Shan Yang © Copyright by Xijun Wang 2025 Meet the world, meet the multitudes, and in so doing, meet yourself. ii Acknowledgments On this quiet night, I find myself reflecting on my academic journey. Counting from the preschool transition class I attended at the age of five, I have spent 24 years in school. Throughout these years, various reasons kept me in school, the curiosity about artificial intelligence, concerns about job prospects, and reluctance toward becoming a heavy coding engineer, among others. Now, standing at the threshold of the highest academic degree, I find that there are no more excuses and no higher degrees left to pursue. I have finally arrived at a crossroads—one that leads toward society and real life. From a student’s perspective, even if mistakes are made, parents and teachers have always offered forgiveness, making student status feel like an ever-present shield of leniency, granting infinite chances to improve. Yet, I sense that I’m about to lose this “privilege.” There is an old Chinese saying, “One should establish oneself by the age of thirty,” mean- ing that by the age of thirty, one is expected to stand on their own—financially, intellectually, and spiritually. As I approach thirty, it seems an ideal moment to bid farewell to the school and the protective shield of being a student. I am now ready to step into society—this broader “school”—to forge my own shield and establish my own identity. Before arriving at this crossroads today, every individual I’ve encountered has left an in- delible mark on me, forming the foundation from which I now set sail. Firstly, I want to thank my mother, a strong-minded and resilient woman who taught me that “A gentleman acts with principle—he knows when to act and when to refrain,” which embodies integrity and boundaries. iii My heartfelt gratitude goes to my father, who has consistently supported my education, always encouraging me to make my own choices without interference, yet steadfastly supporting each decision. He taught me that life isn’t just about work; it also involves enjoyment. I also thank my sister, my childhood companion who showered me with care and provided a peer with whom I could share secrets that were hidden from our parents. Big thanks to my grandmother, my role model for diligent learning, demonstrated remarkable strength, and of course, always indulged me generously. Thanks to my brother-in-law and my niece and nephew—you collectively made our family extraordinarily joyful. Special thanks to my girlfriend, who flew countless times from Switzerland to the United States to reunite with me, encouraged me through low points, reminded me during prideful moments, and always remembered every important detail of my life. Walking beside you fills me with immense happiness. Academically, I owe profound gratitude to Dr. Dinesh Manocha and Dr. Ming Lin for their patient and meticulous guidance. Their visionary insights, passion for research, and unwavering dedication constantly inspired and moved me deeply. They generously provided valuable advice and carefully analyzed pros and cons whenever I faced difficult choices. Though sometimes lost, I always found myself back on track thanks to their detailed guidance and support. I am also deeply grateful to my mentor, Dr. Shan Yang, whose sharp vision and commitment to excellence profoundly influenced my research approach. My heartfelt thanks extend to all my collaborators, whose efforts made each project successful and fulfilling. I would like to express my sincere appreciation to the members of my dissertation com- mittee, Dr. Maria K. Cameron and Dr. Abhinav Shrivastava, whose valuable suggestions and constructive feedback greatly enriched my work. My appreciation also goes to the members of the GAMMA lab. Your creativity and inno- iv vation made working together joyful. We are not only colleagues but also friends who play hard together. I am grateful to my roommates, who enriched my life beyond academics and provided me with small families abroad. Finally, I want to motivate myself and everyone else with a quote that deeply resonates with me: “The magic you’re looking for is in the work you’re avoiding!” v Table of Contents Dedication ii Acknowledgements iii Table of Contents vi List of Tables xi List of Figures xviii Chapter 1: Introduction 1 1.1 Video Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 I Video Dataset 13 Chapter 2: METEOR: A Dense, Heterogeneous, and Unstructured Traffic Dataset With Rare Behaviors 14 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.2 Applications and Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Comparison with Existing Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.1 Tracking and Trajectory Prediction Datasets . . . . . . . . . . . . . . . . 19 2.2.2 Semantic Segmentation Datasets . . . . . . . . . . . . . . . . . . . . . . 20 2.2.3 Behavior Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 METEOR dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.1 Dataset Collection and Organization . . . . . . . . . . . . . . . . . . . . 22 2.3.2 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.3 Rare and Interesting Behaviors . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.4 Dataset statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4 Using METEOR to Extract New Insights in Unstructured Traffic . . . . . . . . . . 27 2.4.1 2D Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.2 Multi-Agent Behavior Recognition . . . . . . . . . . . . . . . . . . . . . 30 2.5 Conclusion, Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . 31 vi Chapter 3: DAVE: Diverse Atomic Visual Elements Dataset with High Representation of Vulnerable Road Users in Complex and Unpredictable Environments 32 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 DAVE Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 DAVE-DETR for Vulnerable Road Users Detection . . . . . . . . . . . . . . . . 39 3.3.1 Hierarchical Query Generator . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.2 Reduce Redundancy Module . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4 Datasets for Different Tasks and Experiments . . . . . . . . . . . . . . . . . . . 41 3.4.1 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.2 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.3 Video Moment Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.4 Spatiotemporal Action Localization . . . . . . . . . . . . . . . . . . . . 46 3.4.5 Multi-label Video Action Recognition . . . . . . . . . . . . . . . . . . . 47 3.5 Conclusion, Limitation, Future Work . . . . . . . . . . . . . . . . . . . . . . . . 48 II Preprocessing 49 Chapter 4: AZTR: Aerial Video Action Recognition with Auto Zoom and Temporal Reasoning 50 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.1.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.1 Learning-based Methods for Aerial Video Recognition . . . . . . . . . . 53 4.2.2 UAV and Drone Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.3 Activity Recognition on Edge Architectures . . . . . . . . . . . . . . . . 55 4.3 Our Approach: AZTR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3.1 Overall Learning Method . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.2 Auto Zoom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3.3 Temporal Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4.2 Implementation Details and Training . . . . . . . . . . . . . . . . . . . . 63 4.4.3 Results on RoCoG-v2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4.4 Results on UAV Human . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4.5 Results on Drone Action . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.5 Conclusion, Limitations and Future . . . . . . . . . . . . . . . . . . . . . . . . . 67 Chapter 5: MITFAS: Mutual Information based Temporal Feature Alignment and Sam- pling for Aerial Video Action Recognition 68 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2.1 Temporal Feature Alignment . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2.2 Similarity Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . 72 vii 5.2.3 Video Recognition for Aerial Videos . . . . . . . . . . . . . . . . . . . . 73 5.3 Video Recognition using Mutual Information . . . . . . . . . . . . . . . . . . . 74 5.3.1 Temporal Feature Alignment . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3.2 Mutual Information Sampling . . . . . . . . . . . . . . . . . . . . . . . 78 5.3.3 MITFAS: Aerial Video Recognition . . . . . . . . . . . . . . . . . . . . 81 5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4.1 Results on UAV Human . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4.2 Results on NEC Drone . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4.3 Results on Drone Action . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4.4 Ablation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.5 Conclusion, Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . 88 III Visual Reasoning 89 Chapter 6: ICAR: Image-based Complementary Auto Reasoning 90 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.3.1 Conditional Compatibility Auto Reasoning . . . . . . . . . . . . . . . . 96 6.3.2 Compatibility Learning Framework . . . . . . . . . . . . . . . . . . . . 97 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.4.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.4.3 Compatibility Learning Results . . . . . . . . . . . . . . . . . . . . . . . 103 6.4.4 Similarity Learning Results . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.5 Conclusion, Limitation, and Future Work . . . . . . . . . . . . . . . . . . . . . 107 Chapter 7: SCP: Soft Conditional Prompt Learning for Aerial Video Action Recognition 109 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.2.1 Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.2.2 Prompt Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.3.1 Prompt Learning-based Input Encoder . . . . . . . . . . . . . . . . . . . 116 7.3.2 Auto-regressive Temporal Reasoning . . . . . . . . . . . . . . . . . . . . 120 7.3.3 Single-agent and Multi-agent Objective . . . . . . . . . . . . . . . . . . 120 7.4 Datasets and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.4.1 Datasets and Experiment Settings . . . . . . . . . . . . . . . . . . . . . 121 7.4.2 Results on Okutama . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.4.3 Results on NECDrone . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.4.4 Results on Something-something V2 . . . . . . . . . . . . . . . . . . . . 124 7.4.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 viii IV Multimodal Alignment 127 Chapter 8: ViLA: Efficient Video-Language Alignment for Video Question Answering 128 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 8.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 8.2.1 Visual-Language Alignment . . . . . . . . . . . . . . . . . . . . . . . . 131 8.2.2 Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 8.2.3 Frame Selection for Video QA . . . . . . . . . . . . . . . . . . . . . . . 134 8.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 8.3.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 8.3.2 Text-guided Frame-Prompter Learning . . . . . . . . . . . . . . . . . . . 135 8.3.3 Cross-Modal Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . 138 8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 8.4.1 Implementation Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 139 8.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.4.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 8.4.4 More Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 8.5 Conclusion, Limitation and Future Work . . . . . . . . . . . . . . . . . . . . . . 148 V Quantization 149 Chapter 9: Bi-VLM: Pushing Ultra-Low Precision Post-Training Quantization Bound- aries in Vision-Language Models 150 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 9.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 9.2.1 Post-Quantization on VLM . . . . . . . . . . . . . . . . . . . . . . . . . 153 9.2.2 Network Binarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 9.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 9.3.1 Binarization Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 155 9.3.2 Bi-VLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 9.3.3 Pruning on Quantized Models . . . . . . . . . . . . . . . . . . . . . . . 164 9.4 Results and Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 9.4.1 Baseline Model and Datasets . . . . . . . . . . . . . . . . . . . . . . . . 167 9.4.2 Quantization on Different Components . . . . . . . . . . . . . . . . . . . 168 9.4.3 Comparison with SOTA Methods . . . . . . . . . . . . . . . . . . . . . . 168 9.4.4 Pruning on Quantized Model . . . . . . . . . . . . . . . . . . . . . . . . 169 9.5 Conclusions, Limitations, and Future Work . . . . . . . . . . . . . . . . . . . . 170 Chapter 10: Conclusion, Limitations and Future Work 171 10.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 10.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 10.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 A.1 Appendix for Chapter 3: DAVE . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 A.1.1 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 ix A.1.2 More Related Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 A.1.3 Full Dataset Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 A.2 Appendix for Chapter 5: MITFAS . . . . . . . . . . . . . . . . . . . . . . . . . 189 A.2.1 Implementation and Training Details . . . . . . . . . . . . . . . . . . . . 189 A.2.2 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 A.2.3 Ablation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 A.3 Appendix for Chapter 6: ICAR . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 A.3.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 A.3.2 More Details of Our Approach . . . . . . . . . . . . . . . . . . . . . . . 197 A.3.3 More Qualitative Evaluation Results . . . . . . . . . . . . . . . . . . . . 200 A.4 Appendix for Chapter 7: SCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 A.4.1 Experts Number for Learnable Prompt . . . . . . . . . . . . . . . . . . . 201 A.4.2 Different Inputs for Large Vision Model . . . . . . . . . . . . . . . . . . 202 A.4.3 Effect of Each Component of Our Method . . . . . . . . . . . . . . . . . 205 A.4.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 A.4.5 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 A.5 Appendix for Chapter 9: Bi-VLM . . . . . . . . . . . . . . . . . . . . . . . . . 208 A.5.1 Memory-Efficient Quantization Pipeline . . . . . . . . . . . . . . . . . . 208 A.5.2 Statistical Analysis of Weights: Histograms and Gaussian Fit . . . . . . . 211 A.5.3 Saliency Threshold Determination . . . . . . . . . . . . . . . . . . . . . 214 A.5.4 More results on Pruning in ScienceQA-IMG . . . . . . . . . . . . . . . . 217 Bibliography 225 x List of Tables 2.1 Characteristics of Traffic Datasets. We compare METEOR with state-of-the-art autonomous driving datasets that have been used for trajectory tracking, motion forecasting, semantic segmentation, prediction, and behavior classification. ME- TEOR is the largest (in terms of number of annotated frames) and most diverse in terms of heterogeneity, scenarios, varying behaviors, densities, and rare instances. Darker shades represent a richer collection in that category. Best viewed in color. 15 2.2 Training Details for Object Detection (BS: Batch size, Mom: Momentum, WD: Weight decay, MGN: Max Gradient Norm). . . . . . . . . . . . . . . . . . . . . 27 2.3 Effect of meta features on object detection. We analyze how meta features such as traffic density, type of agents, location, time of the day, and weather play a role in 2D object detection using the DETR, Deformable DETR, YOLOv3 and CenterNet object detectors. Bold indicates the type of meta feature that is the most effective for object detection. . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4 Object detection on Waymo and KITTI. We report the standard mAP for many widely used methods on autonomous driving datasets. . . . . . . . . . . . . . . . 28 2.5 Swin-T on Waymo and METEOR. We present a more detailed analysis of Swin- T, one of the state-of-the-art object detection approaches, on Waymo and METEOR. 28 2.6 ACAR-Net on METEOR. PT: pre-train, BS: batch size, Opt.: Optimization, LR: learning rate, WD: weight decay, FR(RX-101): Faster R-CNN (ResNeXt-101), Kin.-700: Kinetics-700, CR (Swin-T): Cascade R-CNN (Swin-T) . . . . . . . . . 30 3.1 The existing traffic datasets with Vulnerable Road Users and bbox. Total # in- cludes all labeled instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Comparison of datasets with respect to pedestrian, vehicle, and other Vulnerable Road Users (O-VRUs) action and tube annotations. . . . . . . . . . . . . . . . . 37 3.3 DAVE Characteristics. We annotate 16 types of actions performed by 16 types of actors. We highlight the maximum and average number of actions and actors per frame. LaneChanging(m) denotes lane changing on roads with clear lane markings. 38 3.4 Comparison with SOTAs. Our DAVE-DETR consistently surpasses four strong SOTAs across every reported metric. . . . . . . . . . . . . . . . . . . . . . . . . 42 3.5 Comparison of VRUs Datasets. Compared with Waymo and Waymo + DAVE with the same setting, DAVE training set outperforms the Waymo training set by 20.8% in terms of mAP50. When combining the Waymo and DAVE training sets, the model achieves a 24.0% improvement over Waymo alone and a 3.2% improvement over DAVE alone. The results show that our DAVE dataset is more effective for VRUs detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 xi 3.6 Comparison of Various Tracking Datasets. DAVE is comparable to GOT-10k in AO but more challenging for both success rate metrics. For SR0.75, ARTrack performs 23.7% worse on DAVE than GOT-10k, despite our preprocessing to keep the same object present in each frame sequence. . . . . . . . . . . . . . . . 44 3.7 Statistics of datasets for Video Moment Retrieval task. The CG-DETR method only gets 5.1 R1@0.5 on DAVE (58.4 on Charades-STA), and the perception performance degrades significantly illustrating that Video Moment Retrieval is still a challenging problem in the unstructured environment. . . . . . . . . . . . . 45 3.8 Spatiotemporal Action Localization. ACAR-Net gets 6.3% mAP on DAVE, which shows DAVE is a very challenging dataset and has tremendous room to improve. . 47 3.9 Multi-label Video Action Recognition. SlowFast achieves 4.2% more perfor- mance on Charades than DAVE, which means DAVE is harder. . . . . . . . . . . 47 4.1 3D operators are not well supported on most edge devices or processors, as high- lighted here. Therefore, we use 2D+1 conv and an efficient attention mechanism on the RB5 platform. TL: Tensorflow Lite, C3D: Conv3D, MP3D: MaxPooling 3D, AP3D: AveragePooling 3D, DC3D: Depthwidth Conv3D . . . . . . . . . . . 62 4.2 Inference Time on RB5 CPU. Our method takes 56.5 ms to inference one frame (on average) which is 2 times faster than MoViNet A3 on the RB5, and also results in improvement on top-1 accuracy, see Table 4.3. . . . . . . . . . . . . . 62 4.3 Results on RoCoG-v2. We demonstrate that our approach can improve the top-1 accuracy by 6.1%-7.4%, outperforms all SOTA methods that can be deployed on the RB5 platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.4 Benchmarking UAV Human and comparisons with prior arts. We compared with the state-of-the-art methods, which demonstrates an improvement of 8.3%−10.4 over SOTA methods. Trained on high-end desktop GPUs. . . . . . . . . . . . . . 66 4.5 Results on dataset: Drone Action. We demonstrate that AZTR improves the state- of-the-art accuracy by 3.2%, reaching 95.9% on Drone Action. Trained on high- end desktop GPUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.1 Notation and symbols used in Chapter 5. . . . . . . . . . . . . . . . . . . . . . . 75 5.2 Benchmarking UAV Human and comparisons with prior arts. For 224× 224 res- olution and 16 frames input, when training from scratch, our approach achieves a 13.2% improvement over the baseline X3D-M and 12.6% over the current state- of-the-art FAR. For 520 × 520 resolution and 8 frames input, MITFAS overper- forms the current state-of-the-art FAR by 9.6% when training from scratch. For 224 × 224 resolution and 16 frames input, when initializing with Kinetics pre- trained weights, MITFAS improves the top-1 accuracy over baseline by 20.2% and over SOTA method by 18.9%. For resolution over 620 × 620 and 8 frames input, when initializing with Kinetics pretrained weights, MITFAS overperforms the current state-of-the-art FAR by 7.5%. Our method obtains better performance in all settings, which illustrates the effectiveness of our proposed MITFAS. . . . . 79 5.3 Results on Drone Action. Our method achieves 100% top-1 accuracy, 16.6% over the baseline method X3D-M [1], outperforming current state-of-the-art method FAR [2] by 7.3% under same configuration. (HLPF [3], PCNN [4]) . . . . . . . . 83 xii 5.4 Results on NEC Drones. Our method shows an improvement of 12.5% on top- 1 accuracy against the baseline X3D-M [1], 7.2% over current state-of-the-art FAR [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.5 Temporal Feature Alignment (TFA) and Mutual Information Sampling (MIS) ab- lation studies on UAV-Human-Subset. The baseline is vanilla X3D with ran- dom [5] and uniform sampling [6], and we add our methods TFA and MIS step by step. From our experiments, TFA boost the accuracy by 16-17.5%. MIS out- performs the random sampling, uniform sampling, and MG Sampler [7]. . . . . . 85 5.6 Comparison with other methods [8, 9]. . . . . . . . . . . . . . . . . . . . . . . . 85 5.7 Mutual Information Sampling (MIS) ablation studies on UAV-Human-subset and Drone Action. The baseline is vanilla X3D with TFA, we test the MITFAS Sam- pling in terms of two hyperparameters for mutual information and joint mutual information, α and β respectively. From our experiments, MITFAS obtains the best accuracy when α = 1.0 and β = 1.0. . . . . . . . . . . . . . . . . . . . . . 86 5.8 Comparison with other similarity measures on UAV-Human Subset. Compared to other similarity measures, mutual information achieves the best accuracy. . . . 86 6.1 FITB Results on DeepRooms [10]. Our approach improved FITB accuracy by 9.5% over Visual Similarity Learning, CSA-Net [11], and OutfitTransformer [12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.2 FITB Results on STL and Street2Shop (S2S). Our approach can improve FITB accuracy by 4.2%–9.6% over IBR [13], Siamese Nets [14], BPR-DAE [15], Complete the Look [16], & OutfitTransformer [12]. . . . . . . . . . . . . . . . . 103 6.3 SFID Results on DeepRooms, STL(F: fashion, H: home) and Street2Shop(S2S). Our approach can improve SFID accuracy by 11.2 (23.3%, DeepRooms) and 2.9 (31.8%, STL-Home) on furniture and by 3.4 (22.3%, STL-F) and 1.6 (18.4%, S2S) on fashion images, respectively, over OutfitTransformer [12]. . . . . . . . . 104 6.4 Masking Method Comparison. Our random length masking outperforms fixed length masking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.5 Similarity Learning. Visual similarity learning is the most suitable for scene- based CIR. VQGAN [17], Swin [18], BEiT [19] . . . . . . . . . . . . . . . . . . 107 7.1 Comparison with the state-of-the-art results on the Okutama dataset. With bbox information, we achieved 10.20% improvement over the SOTA method. Without bbox information, we outperformed the SOTA by 3.17%. crops: from detection. . 122 7.2 Comparison with existing methods on NEC Drone. Our SCP improves 4.0-7.4% over X3D and 23.1% over K-centered. . . . . . . . . . . . . . . . . . . . . . . . 123 7.3 Comparison with the state-of-the-art results on the Something Something V2. Our SCP improves 3.6% over MViTv1 and 1.0% over strong SOTA MViTv2. . . 124 7.4 Ablation study in terms of the effect of different components in our method on the Okutama dataset. We evaluated ROI, Large Vision Model (SAM), and SCP. The experiments showed the effectiveness of our proposed methods. . . . . . . . 125 xiii 7.5 Ablation study in terms of different prompts on the Okutama dataset. We eval- uated various prompts, including optical flow, a large vision model(SAM [20]), and SCP. From our experiment, the large vision model and SCP achieved better accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 8.1 Comparison Results on NExT-QA dataset. Here we measure the accuracy of choosing the right answer. Especially on Temporal and Causal type of questions, our ViLA (using only 4 frames) improves 3.3% and 1.7% respectively, compared with SeViLA. We use bold-face font to indicate the best results and underline on the second best using the same number of frames ( brown box for 4 frames and blue box for 8 frames). ViLA using 2-frames only out-performs BLIP-2 using 4-frames by 1.3%. ViLA also achieves upto 3.04× speedup. It needs to be noted that our ViLA achieves 75.1% average accuracy with only 4 frames when we finetune LLM with LoRA [21]. . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.2 Comparison Results on STAR Video QA benchmark. For Interaction type of question, our ViLA improved 4.6%. On average, our ViLA out-performs the SOTA method by 2.2% when using 4 frames with 3.04× speed up. Note that our ViLA using 2-frames out-performs BLIP-2 using 4-frames. . . . . . . . . . . . . 142 8.3 Comparison Results on How2QA, VLEP, and TVQA Video QA Benchmarks. ViLA improves performance over SeViLA by 1.8% with 3.04× speed up on TVQA dataset, 0.7% with 1.45× speed up on VLEP dataset, and 0.3% with 3.04× speed up on How2QA dataset at 4 frames setting. Ours 2-frames out- perform SeViLA 4-frames on VLEP by 0.3% with 4.2× speed up. . . . . . . . . 143 8.4 Frame-Prompter and QFormer-Distiller Ablation Results. Across all four VideoQA datasets, we observe that both Text-aware Frame-Prompter and cross-modal QFormer- Distiller contribute significantly to our final performance. We highlight that on STAR, adding our QFormer-Distiller improves the accuracy by 2.9%. Our Frame- Prompter further boost the accuracy by 1.6%. . . . . . . . . . . . . . . . . . . . 146 8.5 QFormer-Distiller Decoder Ablation on NExT-QA. We find that a simple Fully Connected layer (FC) with a Layer Normalization (LN) works best across Tem- poral, Causal, Description. It is efficient and effective. GELU is activation function.147 9.1 Quantization on different components of Llama 3.2-Vision instruction 11B with weight 1 to 1.1 bit. The vision model exhibits high sensitivity to quantization; the adaptor/projector exhibits less sensitivity to quantization, barely affecting the performance; the language model exhibits considerable sensitivity to quantiza- tion. FP: Full precision. Vis: Vision encoder. Adp: Adapt layer. Lm: Language model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 9.2 Quantization on different components of Llava-One-Vision 7B with weight 1 to 1.1 bit. Same conclusion as in Table 9.1. Full precision. Vis: Vision encoder. Adp: Adapt layer. Lm: Language model. . . . . . . . . . . . . . . . . . . . . . 166 9.3 Quantization on different components of Qwen2.5-VL-7B-Instruct with weight 1 to 1.1 bit. Same conclusion as in Table 9.1. FP: Full precision. Vis: Vision encoder. Adp: Adapt layer. Lm: Language model. . . . . . . . . . . . . . . . . 166 xiv 9.4 SOTA comparison on Llama 3.2-Vision instruction 11B with weight 1 to 1.1 bit. For the language model part, our Bi-VLM outperforms the SOTA by 4%-47%. For the overall VLM, our Bi-VLM outperforms the SOTA by 8%-45%. FP: Full precision. L: Language model, all: the whole VLM model. . . . . . . . . . . . . 167 9.5 SOTA comparison on Llava-One-Vision 7B with weight 1 to 1.1 bit. For the language model part, our Bi-VLM outperforms the SOTA by 3%-20%. For the overall VLM, our Bi-VLM outperforms the SOTA by 4%-19%. FP: Full preci- sion. L: Language model, all: the whole VLM model. . . . . . . . . . . . . . . . 167 9.6 SOTA comparison on Qwen2.5-VL-7B-Instruct with weight 1.1 bit. For the lan- guage model part, our Bi-VLM outperforms the SOTA by 5%-10%. For the over- all VLM, our Bi-VLM outperforms the SOTA by 4%-12%. FP: Full precision. L: Language model, all: the whole VLM model. . . . . . . . . . . . . . . . . . . . 167 1 Comparison of Various Detection Datasets. Compared with COCO, with the same setting, Swin-T performs 18% better on the COCO Dataset. The results show that our DAVE dataset is more challenging than the existing datasets. . . . . 188 2 Ablation studies on UAV-Human subset in terms of using different bin numbers to calculate mutual information, reference image size (times of the standard size), using different strides for slipping windows, and searching area size. The best performance is achieved while using 128 histogram bins, reference image size 1.25× and sliding stride of 10. The size of the searching area does not affect the overall performance of our method. The top-1 accuracy only varies 0.6% while using different searching area sizes. This demonstrates the robustness of our MITFAS as the larger searching area contains more noises and outliers. . . . 189 3 Composing Method Comparison. Here bbox means using the bbox from the original scene image to place in the composed image and non-bbox means using a fixed size. GT is the ground truth set and Neg is the randomly chosen set. With the results, we find using white background and a fixed size for all the items and random placement in the composed set of images produces the lowest score. . . 199 4 Ablation study in terms of different experts number on the Okutama dataset. We evaluated various experts number including 4, 8, 16, and 32. From our experi- ment, with the expert number of 8 achieved better accuracy. . . . . . . . . . . . . 206 5 Ablation study in terms of different inputs for large vision model on the Oku- tama dataset. We evaluated various inputs including single point, two points, four points, and bbox. From our experiment, the large vision model with bbox achieved better accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 6 Llama: Only prune the vision model, keep others unpruned. [FP: Evaluated on Full precision model]. The baseline accuracy for the full precision model is 86.22%.217 xv 7 Llama: Keep the vision model part unpruned, and selective prune of only image tokens in the language model part [FP: Evaluated on Full precision model]. The number in parentheses denotes the cross-attention layers. Baseline accuracy for the full precision model is 86.22%. Results indicate that pruning up to 86.32% of the image tokens across language layers maintains performance above 84%, suggesting significant redundancy in image tokens. However, extreme pruning levels (e.g., 95.02%, 99%) lead to substantial accuracy drops, highlighting the importance of retaining a minimal number of tokens to ensure effective model performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 8 Llama: Pruning on both the vision and language part (vision token in vision model, text token in language model) [FP: Evaluated on Full precision model]. The baseline accuracy for the full precision model is 86.22%. Results show that pruning both image tokens in the vision encoder and text tokens in the lan- guage model significantly degrades performance compared to Table 8 and Table 7, where only image tokens in the language model were pruned. Specifically, accuracy drops sharply as pruning increases. This indicates that pruning the text token has a much more severe impact on model performance than pruning image tokens. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 9 Llama (BiLLM): Keep the vision model part unpruned, and selective prune of only image tokens in the language model part [vlm: Evaluated on quantized vision language model using BiLLM]. The baseline accuracy for the quantized model using BiLLM is 21.42%. From this table, the quantized model has 90%- 99% image token redundancy. . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 10 Llama (BiLLM): Pruning on both the vision and language part (vision token in vision model, text token in language model) [vlm: Evaluated on quantized vision language model using BiLLM]. The baseline accuracy for the quantized model using BiLLM is 21.42%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 11 Llama (Bi-VLM, Ours): Keep the vision model part unpruned, and selective prune of only image tokens in the language model part [vlm: Evaluated on quan- tized vision language model using Bi-VLM]. The baseline accuracy for the quan- tized model using Bi-VLM is 58.35%. From this table, the quantized model has 86%-95% image token redundancy. . . . . . . . . . . . . . . . . . . . . . . . . 219 12 Llama (Bi-VLM, Ours): Pruning on both the vision and language part (vision token in vision model, text token in language model) [vlm: Evaluated on quan- tized vision language model using Bi-VLM]. The baseline accuracy for the quan- tized model using Bi-VLM is 58.35%. . . . . . . . . . . . . . . . . . . . . . . . 220 13 Llava: Only prune the vision model, keep others unpruned. [FP: Evaluated on Full precision model]. The baseline accuracy for the full precision model is 95.84%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 14 Llava: Keep the vision model part unpruned, and selective prune of only image tokens in the language model part [FP: Evaluated on Full precision model]. The baseline accuracy for the full precision model is 95.84%. . . . . . . . . . . . . . 220 15 Llava: Pruning on both the vision and language part (vision token in vision model, text token in language model) [FP: Evaluated on Full precision model]. The baseline accuracy for the full precision model is 95.84%. . . . . . . . . . . 221 xvi 16 Llava (BiLLM): Keep the vision model part unpruned, and selective prune of only image tokens in the language model part [vlm: Evaluated on quantized vision language model using BiLLM]. The baseline accuracy for the quantized model using BiLLM is 63.81%. . . . . . . . . . . . . . . . . . . . . . . . . . . 221 17 Llava (BiLLM): Pruning on both the vision and language part (vision token in vision model, text token in language model) [vlm: Evaluated on quantized vision language model using BiLLM]. The baseline accuracy for the quantized model using BiLLM is 63.81%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 18 Qwen: Keep the vision model part unpruned, and selective prune of only image tokens in the language model part [FP: Evaluated on Full precision model]. The baseline accuracy for the full precision model is 77.29%. . . . . . . . . . . . . . 222 19 Qwen (Bi-VLM Ours): Keep the vision model part unpruned, and selective prune of only image tokens in the language model part [vlm: Evaluated on quan- tized vision language model using Bi-VLM]. The baseline accuracy for the quan- tized model using Bi-VLM is 68.32%. . . . . . . . . . . . . . . . . . . . . . . . 222 20 Qwen (BiLLM): Keep the vision model part unpruned, and selective prune of only image tokens in the language model part [vlm: Evaluated on quantized vision language model using BiLLM]. The baseline accuracy for the quantized model using BiLLM is 59.49%. . . . . . . . . . . . . . . . . . . . . . . . . . . 222 xvii List of Figures 2.1 METEOR. We summarize various characteristics of our dataset in terms of scene: traffic density, road type, lighting conditions, agents (we indicate the total count of each agent across 1250 videos), and behaviors, along with their size distribu- tion (in GB). The total size of the current version of the dataset is around 100GB, and it will continue to expand. Our dataset can be used to evaluate the perfor- mance of current and new methods for perception, prediction, behavior analysis, and navigation based on some or all of these characteristics. Details of the or- ganization of our dataset are given at https://gamma.umd.edu/meteor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Annotations for rare instances. One of the unique aspects of METEOR is the availability of explicit labels for rare and interesting instances including atypical interactions, traffic violations, and diverse scenarios. These annotations can be used to benchmark new methods for object detection and multi-agent behavior prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 We highlight the high traffic density, heterogeneity, and the richness of behavior information in METEOR. Abbreviations correspond to various behavior cate- gories and are explained in Section 2.3.3 . . . . . . . . . . . . . . . . . . . . . . 26 3.1 Tasks Overview. We use DAVE for various video recognition tasks, including Tracking, Detection, Video Moment Retrieval, Spatiotemporal Action Localiza- tion, and Multi-label Video Action Recognition. Our large-scale dataset is made up of complex environments that are densely annotated. Each bounding box (bbox) corresponds to an actor, and the text above each bbox serves as either the tracking ID or indicates the associated action. . . . . . . . . . . . . . . . . . 33 3.2 Challenging Characteristics of DAVE. These videos correspond to different times of the day with different brightness, different geographical landforms from city and rural areas, high density and unpredictable road conditions, diverse actors including humans, animals, vehicles, etc. . . . . . . . . . . . . . . . . . . . . . 35 3.3 DAVE-DETR consists of a hierarchical query generator to generate a dense query set and a reduce redundancy module with class-agnostic Non-Maximum Suppres- sion (NMS) to refine these proposals. . . . . . . . . . . . . . . . . . . . . . . . . 41 xviii https://gamma.umd.edu/meteor 4.1 Our learning pipeline consists of the auto zoom learning algorithm and the tem- poral reasoning algorithm. For auto zoom learning, we offer different bounding box(bbox) and feature operations. Refer to Section 4.3 for details. For the tem- poral reasoning algorithm, we perform (2D+1) conv on edge devices, 3D conv on desktop GPUs, and self-attention (Atten) mechanism on both edge devices and desktop GPUs. Attention layers on desktop GPUs are deeper and wider. . . . . . 51 4.2 We designed two different auto zoom methods with crops or features, for high- end desktop and mobile or edge devices respectively. (a) For auto zoom with crops, we use a detector to get the target bounding box and crop it from the original frame, then scale the crop size. For the auto zoom with features, we use the features to generate the bounding boxes and classification. (b) We use the detector to generate bboxes on key frames to reduce the computational cost. We predict the bbox at the next key frame, and compare the location of predicted bbox and generated bbox to avoid incorrect detection results. Finally, we apply linear interpolation to generate the bbox between key frames. Details are shown in Section 4.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 We use different combinations on desktop GPUs and edge devices between 2D+1 convolution, 3D convolution, and efficient transformer for temporal reasoning. The efficient transformer based algorithm has two components, the cross atten- tion is used to map the input sequences to a new sequence with a specific size according the computational cost requirement. The self attention is the normal component from transformers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4 Action recognition on RoCoG-v2 aerial video. More details are given in the video. 64 5.1 Ft and Ft+1 are two frames at time t and t + 1, respectively, from the same UAV video. The human actor in the two frames occupies less than 10% of the pixels due to the high camera altitude (top images). (a) MITFAS will focus on the regions corresponding to salient motions and use mutual information to find the more informative frame. (b) Because of the UAV’s motion, the position of the human actor in Ft+1 appears to be relatively behind compared to Ft. Our algorithm (MITFAS) computes and aligns these regions so that the recognition model will infer more from the human motions. As shown in the right image, the main body of the human actor in two frames overlaps after feature alignment. . . 69 5.2 Given a starting frame Ft in a UAV video, we use a localization network to local- ize the human action and crop the region containing the human motion as the ref- erence image Fr. At time t+1, we use our feature alignment algorithm to estimate the optimal operation parameter ω∗ t+1 and find a region in Lω∗ t+1 (Ft+1) ⊂ Ft+1 that the mutual information between Lω∗ t+1 (Ft+1) and the reference image Fr is max- imized. Next, we use Lω∗ t+1 (Ft+1) as the new reference image to find the optimal parameter ω∗ t+2 at time t+ 2 and repeat for subsequent frames. Then, we use the criterion illustrated in Section. 5.3.2 Eq . 5.11 to find a sequence of the most dis- tinctive and informative frames. We use a temporal inference backbone network (e.g., X3D [1]) to generate the predicted action label from the spatial-temporal features associated to the sampled frame sequence. . . . . . . . . . . . . . . . . 70 xix 5.3 We sample the i+1th frame Fi+1 from the candidate pool by choosing the frame that is not only least similar to the previous frame but also all the previously sampled frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.1 Cover Image. We present a new self-supervised model for scene-aware, visually compatible object retrieval tasks. In this example, given an inspirational home scene image (sampled from STL-home [16] column 1, 3, 5) with a pool of objects (3D-FRONT [22]) from an unseen domain, our model auto-regressively retrieves a set of stylistically compatible items (column 2, 4, 6). . . . . . . . . . . . . . . . 91 6.2 Scene-aware Complementary Item Retrieval Task Illustration. Given a query scene image, (optional) scene objects and item categories, the task goal is to generate a cross-domain set of stylistically compatible items. . . . . . . . . . . . 92 6.3 ICAR Model Overview. In similarity learning, we apply a CNN-based model [23] to learn the visual similarity features across two domains. The learned features are required for both complementary reasoning in the complementarity learning and the cross-domain retrieval. With the learned features, in the complementarity learning, we propose a Flexible Bidirectional Transformer (FBT) model to learn the multi-object visual compatibility. . . . . . . . . . . . . . . . . . . . . . . . 96 6.4 VSIM: Visual Similarity Model. . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.5 FBT: Flexible Bidirectional Transformer. We randomly sample M ∈ [0, N ] items from the total item number N of items (in a scene) as input set, and the (M+1)th item not in the input set as output target. We put the scene embedding at the beginning of input set, and take the scene embedding as the start token EI . We set a zero vector as the end token Ee. . . . . . . . . . . . . . . . . . . . . . . . 98 6.6 Scene-aware Cross-Domain CIR Qualitative Results. We show qualitatively our model is capable of retrieving stylistically compatible items from both seen (Row 2 and 4) and unseen domains (Row 1 and 3), given a home (Row 1-3) or fashion (Row 4) scene image. Column 1, 5, 9 are the input scene images. See supple- mentary materials for more examples. . . . . . . . . . . . . . . . . . . . . . . . 101 6.7 Learned Scene Image Embedding Clustering Results. To validate the style im- plicitly learned through our network, First column is the t-SNE on the randomly sampled 2k STL-home and STL-fashion test in split scene images (Columns 2-5). 104 6.8 Human Ratings on Different Datasets. Our SFID score correlates better than SOTA with human judgment. . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.1 Overall Architecture. Our action recognition method is designed to run one edge devices (on mobile robots) and cloud servers. This includes lightweight prompts (embedded), which can be easily embedded in any action recognition model without much extra computational cost. For large vision models, we perform these computations on cloud server and use low-latency communication with the robots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 xx 7.2 Task Overview. We use prompt learning for action recognition. Our method leverages the strengths of prompt learning to guide the learning process by help- ing models better focus on the descriptions or instructions associated with ac- tions in the input videos. We explore various prompts, including optical flow, large vision models, and proposed SCP to improve recognition performance. The recognition models can be CNNs or Transformers. . . . . . . . . . . . . . . . . . 111 7.3 Overview of the action recognition framework. We use transformer-based ac- tion recognition methods as an example. We designed a prompt-learning-based encoder to help better extract the feature and use our auto-regressive temporal reasoning algorithm for recognition models for enhanced inference ability. . . . 113 7.4 Soft Conditional Prompt Learning (SCP). Learning input-invariant (prompt ex- perts) and input-specific (data dependent) prompt. The input-invariant prompts will be updated from all the inputs, which contain task information, and we use a dynamic mechanism to generate input-specific prompts for different inputs. Add/Mul means element-wise operations. B×S×C is the input features’ shape, and l is the expert’s number in the prompt pool. . . . . . . . . . . . . . . . . . . 117 7.5 Visualization. We first detect the interested target and generate the prompts, then predict the action. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 8.1 Our efficient Vision-Language Alignment (ViLA) model via Frame-Promper and distilling contains two new modules: a text-guided Frame-Prompter and a cross- modal QFormer-Distiller. It learns to extract the most question-related frames while keeping the inference latency low. . . . . . . . . . . . . . . . . . . . . . . 129 8.2 Model Overview. Our ViLA model includes 4 sub-modules: the visual encoder, text-supervised Frame-Prompter (FP), QFormer-Distiller (QFD), and a LLM. We encode the video frames through a frozen visual encoder. Then we train the Teacher-QFormer using all the frame features. After that, we train the Student- QFormer and Frame-Prompter end-to-end. Unlike the Teacher-QFormer, our Student-QFormer is trained with masked frames features from a text-supervised Frame-Prompter. Finally, the input question text and QFormer transformed visual features go through a frozen large language model to generate the answer. Our network supports both leveraging LLM through proper visual prompting with- out affecting the original LLM (Frozen) ability on language tasks and finetuning LLMs(LoRA) simultaneously to get optimal performance on specific tasks. . . . 130 8.3 Text-guided Frame-Prompter. Here we show the details of our learnable text- guided Frame-Prompter. We design a learnable Frame-Prompter to sample the most text query-related frames, with two design choises (a and b). We choose design (a) for diversified temporal sampling. We first encode the mean-pooled segment features. We then apply the Gumbel Softmax to compute the segment mask to guarantee the differentiability. The selected frames embedding then goes through the QFormer-Distiller. Here B means batch size, T means frame number, N × C means the frame feature sequences. The Frame-Prompter is learned with the text-supervised gradient. When VQA loss is applied, the input question text- related gradient further flows to the Frame-Prompter. The question text-related gradient guides the Frame-Prompter to select the most critical frames. . . . . . . 135 xxi 8.4 Key-frame Selection Comparison Results (select 4 frames from 32 frames). We compare frames selected by our ViLA compared with that from the SOTA SeViLA [24] method. Across different type of questions, especially the Causal, Temporal type questions, keyframes selected by our network is more relevant and better related to the question. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 8.5 QFormer-Distiller Results Visualization. Here we visualize the keyframes se- lected after cross-modal distillation. After distillation, we can select the most question-relevant frames even from 16 frames. . . . . . . . . . . . . . . . . . . 146 9.1 Saliency-aware quantile-based partitioning of Gaussian-distributed weights. Un- salient weights are divided into equal quantiles and binarized, while salient weights, corresponding to the distribution tails, are quantized using a multi-bit approach. . 156 9.2 Pruning on Bi-VLM quantized model and BiLLM quantized model. After layer 10, we can observe that there is redundancy of image tokens around 95% in the quantized models. Our Bi-VLM exhibits better performance. . . . . . . . . . . . 170 1 Annotation Statistic. The actor and action distribution for DAVE, includes a wide- ranging and rich taxonomy of 16 agents and 16 action categories. This dual focus on both the breadth of agent and action types and the depth of instances allows for more robust and effective training of video recognition models. . . . . . . . . 178 2 SFID Composed Images Comparison. Here we show how we compose the set images. In (a), we randomly place the item images with fixed size (same height, aspect ratio reserved). In (b), we place the item images using the bbox of the item in the original scene images. After the study, we apply (a) in our SFID computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 3 FBT: Flexible Bidirectional Transformer. We randomly sample M ∈ [0, N ] items from the total item number N of items (in a scene) as input set, and the (M+1)th item not in the input set as output target. We put the scene embedding at the beginning of input set, and take the scene embedding as the start token EXs . We set a zero vector as the end token EXe . The output will be the end token when the input set contains all the items in the scene. . . . . . . . . . . . . . . . . . . . . 197 4 Binary FITB Results. We show the FITB results on the STL-home datasets when there are two candidates (second row). Our model chooses the item in the green box. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 5 More Visualization for living room. In this example, given an inspirational home scene image (sampled from STL-home as shown in column 1, 5) with a pool of products (from 3D-FRONT). Our model auto-regressively retrieves a set of stylistically compatible items (column 2, 3, 4, 6,7,8 in the picture). . . . . . . . . 202 6 More Visualization for bedroom. In this example, given an inspirational home scene image (sampled from STL-home as shown in column 1, 5) with a pool of products (from 3D-FRONT). Our model auto-regressively retrieves a set of stylistically compatible items (column 2, 3, 4, 6,7,8 in the picture). . . . . . . . . 203 xxii 7 More Visualization for Fashion. In this example, given an inspirational fashion scene image (sampled from STL-fashion as shown in column 1, 5) with a pool of products (from STL-fashion). Our model auto-regressively retrieves a set of stylistically compatible items (column 2, 3, 4, 6, 7, 8 in the picture). . . . . . . . 204 8 More Visualization for STL-home. In this example, given an inspirational home scene image (sampled from STL-home as shown in column 1, 5) with a pool of products (from STL-home). Our model auto-regressively retrieves a set of stylistically compatible items (column 2, 3, 4, 6,7,8 in the picture). . . . . . . . . 205 9 Large Vision Model. Prompts from the large vision model, no supervision needed. We visualize the outputs in terms of different prompts, including bbox, line, and different points. bbox and line have more stable outputs, which means better prompts result in better outputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 10 Histogram Analysis and Fitted Gaussian Curve of Layers [0,10,20,30] of the Vi- sion Model. The curve represents the fitted Gaussian distribution over the his- togram bar plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 11 Histogram Analysis and Fitted Gaussian Curve of Layers [0,10,20,30] of the Lan- guage Model. The curve represents the fitted Gaussian distribution over the his- togram bar plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 12 KL divergence between weight histograms and fitted Gaussian distributions in a Vision Model. Early self-attention layers exhibit significant deviation from the Gaussian approximation compared to later layers. . . . . . . . . . . . . . . . . . 213 13 KL divergence between weight histograms and fitted Gaussian distributions in a Language Model. Early self-attention layers exhibit significant deviation from the Gaussian approximation compared to later layers. . . . . . . . . . . . . . . . 214 14 Bi-VLM Quantization Error Across Vision Model Layers for Varying Saliency Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 15 Bi-VLM Quantization Error Across Language Model Layers for Varying Saliency Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 xxiii Chapter 1: Introduction 1.1 Video Understanding Video data has become a dominant form of information in the digital age. On platforms like YouTube, over 20 million videos are uploaded daily, and approximately 20 billion videos are on YouTube [25], contributing to an ever-growing reservoir of visual data that far exceeds what manual analysis can handle. These statistics highlight the urgent need for automatic and scalable video understanding. Video understanding refers to the process by which computational models analyze, interpret, and extract meaningful information from video data. This involves recognizing objects, actions, scenes, temporal relationships, and complex events over time, often integrating spatial, temporal, and sometimes multimodal information. The goal is to enable ma- chines to ”understand” video content in a way that supports tasks such as classification, detection, segmentation, captioning, retrieval, summarization, reasoning, and question answering [26–30]. Real-world applications of video understanding span virtually every domain. In autonomous driving [31–33], advanced driver assistance systems use multiple cameras to detect pedestrians, traffic signals, lanes, and other vehicles in real time, enabling self-driving cars to make life-critical decisions based on video feeds. In public security [34, 35], smart surveillance systems leverage computer vision to automatically recognize activities or anomalies in CCTV footage, aiming to enhance safety in smart cities. Unmanned aerial vehicles (UAVs) and drones [36, 37] deploy 1 onboard video analytics for tasks like agricultural monitoring and search-and-rescue, where real- time interpretation of aerial video can guide timely actions. In the entertainment industry [30,38], video understanding drives content recommendation, moderation, and immersive experiences – streaming services analyze video frames to tag content and personalize what viewers see, while video games and augmented reality rely on understanding live camera input to blend virtual and real worlds. This introduction presents an overview of the core foundational components in video under- standing, We discuss key components from the datasets that fuel progress and the preprocessing techniques that prepare raw footage, to the visual reasoning methods for spatial and temporal perception, multimodal alignment for integrating vision with language, and finally model com- pression approaches for efficient deployment. Dataset Development for Video Understanding: Large-scale annotated datasets have been instrumental in advancing video understanding. Early efforts like UCF101 [39] in 2012 provided one of the first sizable action recognition datasets, with 13,000 clips across 101 human action categories and similar benchmarks (e.g. HMDB-51 [40], Sports-1M [41], Kinetics [42] and etc.) enabled researchers to train and evaluate video models on diverse “in the wild” footage. These datasets established evaluation metrics and uncovered challenges (camera motion, back- ground clutter, etc.) that shaped model development. The progression in the dataset development illustrates how the community has moved towards larger, more diverse, and more task-driven video datasets. These datasets are foundational – they supply the training data needed for modern deep learning models and ensure that research progress translates to real-world generalization. Preprocessing Techniques: Raw video is high-dimensional and often redundant, so ef- fective preprocessing is critical for both accuracy and efficiency. A fundamental step is frame 2 sampling – selecting a subset of frames from the video – to balance information vs. computa- tion. Classical approaches often used uniform or stride-based sampling (e.g. one frame every few frames) and decoding all frames at a fixed frame rate [43]. Standard image preprocessing like re- sizing and cropping is extended to video by applying the same spatial transform to each frame in a clip, maintaining temporal consistency. Data augmentation (random crops, flips, color jitter, etc.) is similarly applied on video frames to improve robustness, with care to apply identically across time to avoid disrupting temporal coherence. Beyond these basics, recent research has developed more advanced preprocessing techniques. One notable trend is differentiable frame sampling or frame selection policies that learn to pick the most informative frames on the fly [44], rather than a fixed sampling rate. Such adaptive sampling saves computation by skipping redundant or uninformative frames, focusing processing on salient moments of the video. In summary, preprocessing has evolved from heuristic frame selection and manual feature computation (e.g. optical flow) to learned and content-aware sampling strategies, as well as leveraging video com- pression to process footage more efficiently. These techniques form the first stage of an efficient video understanding pipeline, ensuring that subsequent reasoning modules operate on compact yet information-rich visual inputs. Spatial Frame Understanding and Visual Reasoning: For video understanding, a fun- damental building block is robust spatial understanding of each frame. This involves detect- ing and recognizing the objects, scenes, and other visual elements present in individual images (frames). Over the years, the computer vision community has achieved remarkable progress in image understanding, which directly carries over to frames of a video. The introduction of deep Convolutional Neural Networks (CNNs) was a watershed moment: models like AlexNet [45] demonstrated that learnable convolutional filters could far outperform hand-crafted features at 3 image classification. This culminated in the development of ResNet [46], a very deep CNN using residual skip-connections to ease training. ResNet-50 and its deeper variants (with 101 or 152 layers) surpassed human-level performance on ImageNet image recognition, and these architec- tures became the de facto backbones for many vision tasks. In the context of video, a ResNet pretrained on ImageNet is often used to encode each frame into a rich feature vector, enabling the model to understand what is where in the scene. In the meantime, traditional CNNs have been extended with attention mechanisms and structured representations (e.g. scene graphs) to perform reasoning. The current state of the art for spatial vision modeling has shifted towards Vision Transformers (ViT) [47] and their variants. Transformers dispense with convolution en- tirely, and instead use global self-attention to model relationships between patches of an image. For video frames, vision transformers provide stronger capabilities for spatial reasoning – they can attend to disparate image regions (e.g. a person and an object they are reaching for across the frame) and model their relationship explicitly. Empirically, these transformer models now serve as powerful frame encoders. In summary, spatial frame understanding has evolved from early CNN-based recognition of objects to more holistic reasoning with attention-based models. Temporal Modeling and Motion Analysis: A video is more than a stack of images; the temporal dimension introduces motion and evolving dynamics that must be captured to under- stand actions and events. Temporal modeling techniques aim to learn representations that in- tegrate information over time, from short motion patterns (e.g. a hand waving) to long-range dependencies (e.g. events unfolding over minutes). One classical approach to temporal modeling is the use of recurrent neural networks (RNNs), such as Long Short-Term Memory (LSTM) net- works [48], which process frame features sequentially and maintain a hidden state that evolves. Early video recognition models like LRCN [49] combined CNN frame encoders with an LSTM 4 to recognize actions from frame sequences. Around the same time, another influential approach was the development of 3D Convolutional Networks. In a 3D ConvNet, filters operate across spa- tial dimensions (height, width) and time, thereby directly learning motion-sensitive features. A milestone was C3D [50], which showed that a generic 3D CNN trained on videos could automat- ically learn motion detectors (like rudimentary optical flow) in its early layers and achieve strong action recognition performance. These classical methods established two important paradigms: recurrent sequence modeling and spatiotemporal convolution, both of which substantially out- performed earlier hand-crafted temporal features. In recent years, state-of-the-art temporal mod- eling has been revolutionized by attention mechanisms and Transformer architectures applied in time. Just as transformers improved spatial modeling, they have proven extremely effective for temporal sequences. A representative example is TimeSformer [51], the first pure-transformer ar- chitecture for video understanding. TimeSformer factorizes attention over space and time – it ap- plies self-attention within each frame (spatial) and across frames (temporal), allowing the model to learn long-range temporal dependencies with global context. Empirically, transformer-based models and their hybrids now achieve state-of-the-art results on video classification benchmarks, often surpassing traditional 3D CNNs in accuracy while offering more flexibility in modeling. In summary, temporal modeling has progressed from sequential or fixed-length processing of frames to more flexible and long-range attentive modeling. These advances allow video under- standing systems to not only identify what is happening in a clip but also to understand when and how events unfold, which is crucial for tasks like action recognition, prediction, and temporal segmentation. Multimodal Alignment for Video Question Answering: Video understanding increas- ingly involves multimodal analysis – integrating visual information with other modalities such as 5 language and audio – to enable deeper semantic tasks. One prominent multimodal task is Video Question Answering (Video QA), where a model answers natural language questions about a video’s content. This requires aligning visual content (frames, objects, actions) with textual con- tent (questions, narration, or subtitles) and sometimes audio cues. A classical benchmark in this area was MovieQA [52], which provided clips from movies, plot summaries, and question- answer pairs to evaluate story understanding. Early methods for Video QA often used separate pipelines: a CNN/LSTM to encode the video and a text encoder for the question, with a fusion mechanism to produce an answer. For example, MovieQA baseline [52] combined visual features (from frame-level CNN descriptors) with simple text representations, and used multiple-choice questions to assess comprehension of characters and events. Another line of classical work was in video captioning [53] where models learned to generate descriptive sentences for video clips, effectively learning a mapping from video to language. These early works laid the groundwork for understanding correspondences between visual dynamics and natural language descriptions or questions. Modern state-of-the-art approaches to multimodal video understanding leverage powerful pretrained models and large-scale data. A notable example is VideoCLIP [54], which is one of the first large-scale contrastive pre-training directly on video–subtitle pairs, enabling zero-shot transfer to retrieval, captioning & action recognition. Other cutting-edge techniques include dual-stream transformers (one stream for video frames, one for text, with cross-attention between them) and the use of large language models augmented with visual encoders. For in- stance, researchers have begun to integrate vision encoders with models like GPT-4 to enable open-ended question answering about video content, though such systems are still emerging. The progress in this area is exemplified by huge gains on benchmarks: early models often strug- gled with complex queries, whereas current models can handle questions about temporal order, 6 causal relationships, and even hypothetical events in videos. In summary, the synergy of vision and language in video understanding has grown from simple Q&A on short clips to foundation models that jointly learn from video and text at scale. This multimodal alignment capability is crucial for high-level understanding tasks like video retrieval with text queries, video captioning, and video dialogue systems. Video Model Compression and Acceleration: The drive for efficient video understanding has led to extensive research in model compression and acceleration techniques. Video models (especially 3D CNNs or transformers operating on many frames) tend to be computationally heavy, which poses challenges for real-time processing and deployment on edge devices like drones or mobile phones. Classical approaches to compress models were often inherited from the image domain. For example, network pruning techniques systematically remove less important weights or filters from a CNN after training, while quantization reduces the precision of model parameters (e.g. from 32-bit floating point to 8-bit integers). Han et al. [55] pioneered Deep Compression, showing that one can prune, quantize, and even Huffman-code neural networks to dramatically reduce memory and computation with minimal loss in accuracy. Such methods, when applied to CNNs for video, can directly translate to faster inference. Another classical strategy is knowledge distillation [56], where a smaller “student” model is trained to replicate the outputs of a large “teacher” model, thus transferring knowledge. This has been used to compress large video models by training a compact model on the soft outputs or feature maps of a high- capacity reference model. In recent years, researchers have developed specialized techniques for video model acceler- ation, motivated by the observation that there is a lot of redundancy in video inputs (neighboring frames are often similar) and within the model’s feature maps. One effective approach is adap- 7 tive computation in the temporal dimension [57]. Overall, the state-of-the-art in video model compression is characterized by holistic optimizations – from weight pruning and quantization of networks, to dynamic execution that adjusts to input content, and architecture innovations that bake efficiency into the model design. These advancements are crucial for bringing sophisticated video understanding algorithms from research labs to real-world applications, where computation and energy are often limited. An efficient model that can run on-device in real time opens up pos- sibilities for privacy-preserving video analytics on edge (processing video locally) and scalable deployment (hundreds of cameras or drones each running analytics). The continued research in this area is driving the field towards the thesis goal of effective and efficient video understanding, achieving high accuracy and rich functionality without exorbitant computational cost. 1.2 Dissertation Overview The landscape of video understanding has evolved significantly over the past decade. We have seen the emergence of extensive datasets that enable training of deep learning models, the transition from hand-engineered features to end-to-end learned representations, the development of sophisticated mechanisms to model spatial and temporal information, the rise of video lan- guage alignment for high-level video reasoning, and the acceleration and compression for video models. Across all these developments, a common thread is the balance between effectiveness and efficiency. It is now possible to build models that “understand” videos in the sense of rec- ognizing complex actions and even answering semantic questions about them – but often at a tremendous computational cost. The central aim of this thesis is to explore methods that bridge this gap. By drawing on the foundations reviewed above and introducing new ideas to make 8 video models more effective, faster, and more adaptive, we seek to push the field toward tech- niques that retain state-of-the-art performance while significantly improving efficiency. The fol- lowing chapters will delve into these contributions, building upon the rich context outlined in this introduction. Part I: Video Datasets Our focus here is on the design and creation of challenging datasets tailored to improve the effectiveness and robustness of visual understanding, ensuring diverse, high-quality, and contex- tually rich data that bridges existing gaps in the field. Specifically, we present METEOR [58] in Chapter 2 and DAVE [59] in Chapter 3. METEOR is a dataset for rare and interesting, multia- gent driving behaviors that are grouped into traffic violations, atypical interactions, and diverse scenarios. DAVE is designed for evaluating perception methods on the benchmark with high representation of Vulnerable Road Users (VRUs: e.g. pedestrians, animals, motorbikes, and bi- cycles) in complex and unpredictable environments. Furthermore, DAVE can benchmark video tasks like Tracking, Detection, Spatiotemporal Action Localization, Language-Visual Moment retrieval, and Multi-label Video Action Recognition. Our analysis revealed substantial shortcom- ings of current perception models when tested against our METEOR and DAVE. Part II: Preprocessing Frame sampling reduces computational overhead by selecting only key frames, retaining essential temporal cues while eliminating redundant information, and cropping ensures a consis- tent spatial dimension, focusing the model’s attention on the most relevant regions. Preprocessing 9 improves both the efficiency of model training/inference and the overall accuracy and robustness of video understanding. In Chapter 4, we introduce AZTR [60], a learning-based approach that uses customized auto zoom to automatically identify the target and scale it appropriately, then our efficient transformer-based algorithm is used to map the input sequences to a new sequence with a specific size according to the computational cost requirement for the efficient Tempo- ral Reasoning. In practice, we achieve 6.1-7.4% improvement over SOTA in Top-1 accuracy on the RoCoG-v2 dataset [61], 8.3-10.4% improvement on the UAV-Human dataset [62] and 3.2% improvement on the Drone Action dataset [63]. In Chapter 5, we present MITFAS [64], which uses the concept of mutual information to compute and align the regions corresponding to the target action or motion in the temporal domain for better recognition reasoning. In prac- tice, we achieve 18.9% improvement in Top-1 accuracy over current state-of-the-art methods on UAV-Human [62], 7.3% improvement on Drone-Action [63], and 7.16% improvement on NEC Drones [65]. Part III: Visual Reasoning In this segment, we focus on the development of new methodologies for enhancing vi- sual reasoning capabilities, including architecture designs and advanced attention mechanisms that allow models to focus on contextually significant features and improve effectiveness in real- world scenarios. In Chapter 6, we proposed ICAR [66], a compatibility learning framework, a category-aware Flexible Bidirectional Transformer (FBT) is introduced for visual scene-based set compatibility reasoning with the cross-domain visual similarity module. Compared with the SOTA methods, our ICAR achieves up to 5.3% and 9.6% in FITB score and 22.3% and 31.8% 10 SFID improvement on fashion and furniture, respectively. In Chapter 7, we present Conditional Prompt Learning (SCP) [67], which leverages the strengths of prompt learning to help more enhance the reasoning ability. Our formulation supports various prompts, including learnable prompts, auxiliary visual information, and large vision models to improve the recognition perfor- mance. In practice, we observe a 3.17−10.2% accuracy improvement on the aerial video datasets (Okutama [68], NECDrone [65]), which consist of scenes with single-agent and multi-agent ac- tions. We further evaluate our approach on ground camera videos to verify the effectiveness and generalization and achieve a 1.0− 3.6% improvement on SSV2 [69]. Part IV: Multimodal Alignment In Chapter 8, we introduce a novel method for the integration of visual and linguistic modal- ities, fostering a seamless alignment that enhances cross-modal understanding and paves the way for more intuitive and accurate multimodal applications. In ViLA [70], we propose an efficient Video-Language Alignment (ViLA) network, which addresses both efficient frame sampling and effective cross-modal alignment in a unified way. Compared with prior work, our ViLA model demonstrates the capability of selecting key frames with critical contents, thus improving the video-language alignment accuracy while reducing the inference latency (+3.3% on NExT-QA Temporal with 3.0× speed up). Overall, our ViLA network outperforms the state-of-the-art meth- ods on the video question-answering benchmarks: +4.6% on STAR Interaction, +2.2% on STAR average with 3.0× speed up, ours 2-frames out-perform SeViLA 4-frames on the VLEP dataset with 4.2× speed-up. 11 Part V: Acceleration and Compression In Chapter 9, we propose Bi-VLM, which separates model weights non-uniformly based on the Gaussian quantiles. Our formulation groups the model weights into outlier (salient) and multiple inlier (unsalient) subsets, ensuring that each subset contains a proportion of weights corresponding to its quantile in the distribution. We propose a saliency-aware hybrid quanti- zation algorithm and use it to quantize weights by imposing different constraints on the scaler and binary matrices based on the saliency metric and compression objective. We have evalu- ated our approach on different VLMs. For the language model part of the VLM, our Bi-VLM outperforms the SOTA by 3%-47% on the visual question answering task in terms of four dif- ferent benchmarks and three different models. For the overall VLM, our Bi-VLM outperforms the SOTA by 4%-45%. We also perform token pruning on the quantized models and observe that there is redundancy of image tokens 90% - 99% in the quantized models. This helps us to further prune the visual tokens to improve efficiency. By addressing these dimensions, this dissertation not only contributes to the methodologies and insights for video understanding but also provides practical advancements that facilitate the deployment of more effective, efficient, and interpretable models for real-world applications. It also further lays the groundwork for future research in robust, context-aware, and multi-modal AI systems. 12 Part I Video Dataset 13 Chapter 2: METEOR: A Dense, Heterogeneous, and Unstructured Traffic Dataset With Rare Behaviors 2.1 Introduction Recent research in learning-based techniques for robotics, computer vision, and autonomous driving has been driven by the availability of datasets and benchmarks. Several traffic datasets have been collected from different parts of the world to stimulate research in autonomous driv- ing, driver assistants, and intelligent traffic systems. These datasets correspond to highway or urban traffic, and are widely used in the development and evaluation of new methods for percep- tion [71], prediction [72], behavior analysis [73], and navigation [74]. Many initial autonomous driving datasets were motivated by computer vision or perception tasks such as object recognition, semantic segmentation or 3D scene understanding. Recently, many other datasets have been released that consist of point-cloud representations of objects captured using LiDAR, pose information, 3D track information, stereo imagery or detailed map information for applications related to 3D object recognition and motion forecasting. Many large- scale motion forecasting datasets such as Argoverse [75], and Waymo Open Motion Dataset [76], among others, have been used extensively by researchers and engineers to develop robust predic- tion models that can forecast vehicle trajectories. However, existing datasets do not capture the 14 Table 2.1: Characteristics of Traffic Datasets. We compare METEOR with state-of-the-art autonomous driving datasets that have been used for trajectory tracking, motion forecasting, semantic segmentation, prediction, and behavior classification. METEOR is the largest (in terms of number of annotated frames) and most diverse in terms of heterogeneity, scenarios, varying behaviors, densities, and rare instances. Darker shades represent a richer collection in that category. Best viewed in color. Rare and Interesting Behaviors‡ Datasets Location Bad weather Night Road type Het.⋆ Size Density Lidar HD Maps Traffic Atypical Diverse Violations Interactions Scenarios Argoverse [75] USA ✓ ✓ urban 10 22K Medium ✓ ✓ ✗ ✓ ✗ Lyft Level 5 [77] USA ✗ ✗ urban 9 46K Low ✓ ✓ ✗ ✗ ✗ Waymo [76] USA ✓ urban 4 200K Medium ✓ ✓ ✗ ✓ ✓ ApolloScape [78] China ✗ ✓ urban, rural 5 144K High ✓ ✓ ✗ ✗ ✗ nuScenes [79] USA/Sg. ✓ ✓ urban 13 40K Low ✓ ✓ ✗ ✓ ✓ INTERACTION [80] International ✗ ✗ urban 1 − Medium ✓ ✓ ✗ ✗ ✗ CityScapes [81] Europe ✗ ✗ urban 10 25K Low ✗ ✗ ✗ ✗ ✗ IDD [82] India ✗ ✗ urban, rural 12 10K High ✗ ✗ ✗ ✗ ✗ HDD [83] USA ✗ ✗ urban − 275K Medium ✓ ✗ ✗ ✓ ✓ Brain4cars [84] USA ✗ ✗ urban − 2000K Low ✗ ✓ ✗ ✗ ✗ D2-City [85] China ✓ ✗ urban 12 700K Medium ✗ ✗ ✗ ✗ ✓ TRAF [86] India ✗ ✓ urban, rural 8 72K High ✗ ✗ ✗ ✗ ✗ BDD [87] USA ✓ ✓ urban 8 3000K High ✗ ✗ ✗ ✗ ✓ ROAD [88] UK ✓ ✓ urban 7 122K Low ✓ ✗ ✗ ✗ ✓ METEOR India ✓ ✓ urban, rural† 16†† 2027K High§ ✗ ✗ ✓ ✓ ✓ ‡ Rare instances can be broadly grouped into (i) traffic violations, (ii) atypical interactions, and (iii) difficult scenarios. † Includes roads without lane markings. Roads in other datasets with rural roads may contain lane markings. ⋆ Heterogeneity. We indicate the classes corresponding to moving traffic agents only, excluding static objects such as poles, traffic lights, etc. § Up to 40 agents per frame. †† Up to 9 unique agents per frame. rare behaviors or heterogeneous patterns. Therefore, prediction models trained on these existing datasets are not very robust in terms of handling challenging traffic scenarios that arise in the real world. A major challenge currently faced by research in autonomous driving is the heavy tail problem [75, 76], which refers to the challenge of dealing with rare and interesting instances. There are several ways in which existing datasets currently address the heavy tail problem: 1. Mining: The Argoverse and Waymo datasets use a mining procedure that includes scoring each trajectory based on its “interestingness” to explicitly search for difficult and unusual scenarios [75, 76]. 2. Diversifying the taxonomy: Train the prediction and forecasting models to identify the unknown agents at the time of testing. This approach necessitates annotating a diverse taxonomy of class labels. Argoverse and nuScenes [79] contain 15 and 23 classes, respec- 15 tively. 3. Increasing dataset size: This approach is to simply collect more data with the premise that collecting more traffic data will likely also increase the number of such scenarios in the dataset. In spite of many efforts along these lines, existing datasets manage to collect only a handful of such instances, due to the infrequent nature of their occurrence. For example, the Waymo Open Motion dataset [76] contains only atypical interactions and diverse scenarios while the Argoverse dataset [75] contains only atypical interactions. There is clearly a need for a different approach to addressing the heavy tail problem. Our solution is to build a traffic dataset from videos collected in India, where the inherent nature of the traffic is dense, heterogeneous, and unstructured. The traffic patterns and surrounding environment in parts of India are more challenging. than those in other parts of the world. This includes high congestion and traffic density. Some of these roads are unmarked or unpaved. Moreover, the traffic agents moving on these roads correspond to vehicles, buses, trucks, bicycles, pedestrians, auto-rickshaws, two-wheelers such as scooters and motorcycles, etc. 2.1.1 Main Contributions 1. We present a novel dataset, METEOR, corresponding to the dense, heterogeneous, and unstructured traffic in India. METEOR is the first large-scale dataset containing anno- tated scenes for rare and interesting instances and multi-agent driving behaviors, broadly grouped into: (a) Traffic violations—running traffic signals, driving in the wrong lanes, taking wrong 16 turns). (b) Atypical interactions—cut-ins, yielding, overtaking, overspeeding, zigzagging, lane changing. (c) Diverse scenarios—intersections, roundabouts, and traffic signals. 2. METEOR has more than 2 million labeled frames and 13 million annotated bounding boxes for 16 unique traffic agents, and GPS trajectories for the ego-agent. 3. Every video in METEOR is tagged using a diverse range of factors including weather, time of the day, road conditions, and traffic density. 4. We use METEOR to extract new insights in perception tasks such as 2D object detection and multi-agent behavior recognition in unstructured traffic. Additionally, we present a novel, fine-grained analysis on the relationship between traffic environments (traffic density, mix- ture of agents, area, time of the day, and weather conditions) and 2D object detection. 2.1.2 Applications and Benefits We list some promising directions in which METEOR can contribute towards autonomous driving research: • Towards Robust Perception: We observe that perception tasks like 2D object detection and multi-agent behavior recognition fail in challenging Indian traffic scenarios, compared to their performance on existing datasets captured in the US, Europe, and other developed nations. METEOR can be a useful benchmark for research in perception in unstructured traffic environments and developing nations. 17 Figure 2.1: METEOR. We summarize various characteristics of our dataset in terms of scene: traffic density, road type, lighting conditions, agents (we indicate the total count of each agent across 1250 videos), and behaviors, along with their size distribution (in GB). The total size of the current version of the dataset is around 100GB, and it will continue to expand. Our dataset can be used to evaluate the performance of current and new methods for perception, prediction, behavior analysis, and navigation based on some or all of these characteristics. Details of the organization of our dataset are given at https: //gamma.umd.edu/meteor. • Towards Risk-Aware Planning and Control: METEOR can aid the development of risk- aware motion planners by predicting the behaviors of surrounding agents. Motion planners can compute controls that guarantee safety around aggressive drivers who are prone to overtaking and overspeeding. • Towards Fine-grained Traffic Analysis: With METEOR, researchers can study the causal- ity relationship between traffic patterns, static scene elements, and dynamic agent behaviors resulting in novel ADAS for unstructured traffic environments. 18 https://gamma.umd.edu/meteor https://gamma.umd.edu/meteor 2.2 Comparison with Existing Datasets 2.2.1 Tracking and Trajectory Prediction Datasets Datasets such as the Argoverse [75], Lyft Level 5 [77], Waymo Open Dataset [76], Apol- loScape [78], nuScenes dataset [79] are used for trajectory forecasting [86, 89–92] and track- ing [71]. Several of these datasets use mining procedure [75, 76] that heuristically searches the dataset for rare and interesting scenarios. The resulting collection of such scenarios and behav- iors, however, is only a fraction of the entire dataset. METEOR, by comparison, exclusively contains such scenarios due to the inherent nature of the unstructured traffic in India. METEOR has many additional characteristics with respect to these datasets. For instance, METEOR’s 2.02 million annotated frames are more than 10× the current highest number of anno- tated frames with respect to other dataset with high density traffic (ApolloScape). Furthermore, METEOR consists of 16 different traffic-agents that include only on-road moving entities (and not static obstacles). This is by far, the most diverse in terms of class labels. In comparison, Argoverse and nuScenes both contain 10 and 13 traffic-agents, respectively. METEOR is the first motion forecasting and behavior prediction dataset with traffic patterns from rural and urban areas that consist of unmarked roads and high-density traffic. In contrast, traffic scenarios in Argov- erse, Waymo, Lyft, and nuScenes have been captured on sparse to medium density traffic with well-marked structured roads in urban areas. 19 2.2.2 Semantic Segmentation Datasets CityScapes [81] is widely used for several tasks, primarily semantic segmentation. It is based on urban traffic data collected from European cities with structured roads and low traffic density. In contrast, the Indian Driving Dataset (IDD) [82] is collected in India with both urban and rural areas with high-density traffic. A common aspect of both these datasets (CityScapes and IDD), however, is the relatively low annotated frame count (25K and 10K, respectively). This is probably due to the effort involved with annotating every pixel in each image. IDD also contains high-density traffic scenarios in rural areas, similar to METEOR. However, our dataset has 200× the number of annotated frames and 1.6× the number of traffic-agent classes. Similar to TRAF, the IDD does not contain behavior data. 2.2.3 Behavior Prediction Behavior prediction corresponds to the task of predicting turns (right, U-turn, or left), ac- celeration, merging, and braking in addition to driver-intrinsic behaviors such as over-speeding, overtaking, cut-ins, yielding, and rule-breaking. The two most prominent datasets for action pre- diction include the Honda Driving Dataset (HDD) [83] and the BDD dataset [87]. Some of the major distinctions between METEOR and the HDD in terms of size (approximately 10×), the availability of scenes with night driving and rainy weather, and the inclusion of unstructured en- vironments in low-density traffic. The BDD dataset [87] contains more annotated samples than METEOR, however, the BDD dataset contains 100K videos while METEOR contains 1K videos. So the number of annotated samples per video is 66× higher for METEOR. The annotations in prior datasets are limited to actions and do not contain the rare and interesting behaviors con- 20 (a) Cut-ins/Jaywalking. (b) Yielding/Cut-ins. (c) Overtaking/Overspeeding. (d) Driving in wrong lane. (e) Running red traffic lights. (f) Ignoring lane signs/wrong lane driving. (g) High density. (h) Rainy weather. (i) Night time. (j) Rural areas. Figure 2.2: Annotations for rare instances. One of the unique aspects of METEOR is the availability of explicit labels for rare and interesting instances including atypical interactions, traffic violations, and diverse scenarios. These annotations can be used to benchmark new methods for object detection and multi-agent behavior prediction. tained in METEOR. 2.3 METEOR dataset Our dataset is summarized in Figure 2.1 and visually shown in Figure 2.2. Below, we present some details of the data collection process and discuss some of the salient characteristics of METEOR. 21 2.3.1 Dataset Collection and Organization The data was collected in and around the city of Hyderabad, India within a radius of 42 to 62 miles. Several outskirts were chosen to cover rural and unstructured roads. Our hardware capture setup consists of two wide-angle Thinkware F800 dashcams mounted on an MG Hector and Maruti Ciaz. The camera sensor has 2.3 megapixel resolution with a 140◦ field of view. The video is captured in full high definition with a resolution of 1920 × 1080 pixels at a frame rate of 30 frames per second. The dashcam is embedded with an accurate positioning system that stores the GPS coordinates, which were processed into the world frame coordinates. The sensor synchronizes between the camera and the GPS. Recordings from the dashcam are streamed continuously and are clipped into 1 minute video segments. The dataset is organized as 1250 one-minute video clips. Each clip contains static and dynamic XML files. Each static file summarizes the meta-data of the entire video clip including the behaviors, road type, scene structure etc. Each dynamic file describes frame-level information such as bounding boxes, GPS coordinates, and agent behaviors. Our dataset can be searched using helpful filters that sort the data according to the road type, traffic density, area, weather, and behaviors. We also provide many scripts to easily load the data after downloading. 2.3.2 Annotations We manually annotated the videos using the Computer Vision Annotation Tool (CVAT) and provide the following labels: (i) bounding boxes for every agent, (ii) agent class IDs, (iii) GPS trajectories for the ego-vehicle, (iv) environment conditions including weather, time of the day, traffic density, and heterogeneity, (v) road conditions with urban, rural, lane markings, (vi) 22 road network including intersections, roundabouts, traffic signal, (vii) actions corresponding to left/right turns, U-turns, accelerate, brake, (viii) rare and interesting behaviors, and (ix) the camera intrinsic matrix for depth estimation to generate trajectories of the surrounding vehicles. This set of annotations is the most diverse and extensive compared prior datasets. A diverse and rich taxonomy of agent categories is necessary to ensure that autonomous driving systems can detect different types of agents in any given scenario. Towards that goal, datasets for autonomous driving are designed or captured to achieve two goals: (a) capture as many different types of agent categories as possible; (b) capture as many instances of each cat- egory as possible. In both these aspects, METEOR outperforms all prior datasets. We annotate 16 types of moving traffic entities with rare and interesting behaviors. Note specifically that the percentage